Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Mar 8;17(3):e1008743.
doi: 10.1371/journal.pcbi.1008743. eCollection 2021 Mar.

Optimal prediction with resource constraints using the information bottleneck

Affiliations
Free PMC article

Optimal prediction with resource constraints using the information bottleneck

Vedant Sachdeva et al. PLoS Comput Biol. .
Free PMC article

Abstract

Responding to stimuli requires that organisms encode information about the external world. Not all parts of the input are important for behavior, and resource limitations demand that signals be compressed. Prediction of the future input is widely beneficial in many biological systems. We compute the trade-offs between representing the past faithfully and predicting the future using the information bottleneck approach, for input dynamics with different levels of complexity. For motion prediction, we show that, depending on the parameters in the input dynamics, velocity or position information is more useful for accurate prediction. We show which motion representations are easiest to re-use for accurate prediction in other motion contexts, and identify and quantify those with the highest transferability. For non-Markovian dynamics, we explore the role of long-term memory in shaping the internal representation. Lastly, we show that prediction in evolutionary population dynamics is linked to clustering allele frequencies into non-overlapping memories.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. A schematic representation our predictive information bottleneck.
On the left hand side, we have coordinates Xt evolving in time, subject to noise to give Xtt. We construct a representation, X˜, that compresses the Xt (minimizes I(Xt;X˜)) while retaining as much information about Xtt (maximizes I(X˜;Xt+Δt)) up to the weighting of the prediction compared to the compression set by β.
Fig 2
Fig 2. Schematic of the stochastically driven damped harmonic oscillator (SDDHO).
(a) The SDDHO consists of a mass attached to a spring undergoing viscous damping and experiencing Gaussian thermal noise of magnitude. There are two parameters to be explored in this model: ζ=12ω0τ and Δt=Δtτ. (b) ζ=12, Δt = 1. Here, we show an example distribution of the history (yellow, left) and show its time evolution (purple, right). We take 5000 samples from the distribution, at random, and let these points evolve in time according to the SDDHO equation of motion. We visualize the evolution of the distribution of points in time via an ellipse representing the 1 − Σ confidence region of the rescaled position and velocity. (c) We illustrate the limiting case of the information bottleneck method when β → ∞. Representations of the past and how that constrains an estimate of the future position and velocity of the object can be compared to the prior be examining the relative size and shape of their respective ellipses. The blue circle represents the prior and its 1 − Σ confidence region. In yellow, we plot the inferred 1 − Σ confidence interval associated with the estimate of past, Xt, given by the encoding distribution when β → ∞. In this limit, the distribution is reduced to a single point. In purple, we plot the 1 − Σ confidence region of Xtt given our knowledge of Xt. Precise knowledge of the past coordinates reduces the our uncertainty about the future position and velocity (as compared to the prior), as depicted by the smaller area of the purple ellipse.
Fig 3
Fig 3. We consider the task of predicting the path of an SDDHO with ζ=12 and Δt = 1.
(a) (left) We encode the history of the stimulus, Xt, with a representation generated by the information bottleneck, X˜, that can store 1 bit of information. Knowledge of the coordinates in the compressed representation space enables us reduce our uncertainty about the bar’s position and velocity, with a confidence interval given by ellipse in yellow. This particular choice of encoding scheme enables us to predict the future, Xtt with a confidence interval given by the purple ellipse. The information bottleneck guarantees this uncertainty in future prediction is minimal for a given level of encoding. (right) The uncertainty in the prediction of the future can be reduced by reducing the overall level of uncertainty in the encoding of the history, as demonstrated by increasing the amount of information X˜ can store about Xt. However, the uncertainty in the future prediction cannot be reduced below the variance of the propagator function. (b) We show how the information with Xttscales with the information about Xt, highlighting the points represented in panel A.
Fig 4
Fig 4. Possible behaviors associated for the SDDHO for a variety of timescales with a fixed I(Xt;X˜) of 5 bits.
For an overdamped SDDHO, panel a-c, the optimal representation continues to encode mostly position information, as velocity is hard to predict. For the underdamped case, panels g-i, as the timescale of prediction increases, the optimal representation changes from being mostly position information to being a mix of position and velocity information. Optimal representations for critically damped input motion are shown in panels d-f. Comparatively, overdamped stimuli do not require precise velocity measurements, even at long timescales. Optimal predictive representations of overdamped input dynamics have higher amounts of predictive information for longer timescales, when compared to underdamped and critically damped cases.
Fig 5
Fig 5. Example of a sub-optimal compression.
An optimally predictive, compressed representation, in panel (a) compared to a suboptimal representation, in panel (b) for a prediction at Δt = 1 in the future, within the underdamped regime (ζ = 1/2). We fix the mutual information between the representations and Xt (I(Xt;X˜)=3 bits), but find that, as expected, the suboptimal representation contains significantly less information about the future.
Fig 6
Fig 6. Representations learned on underdamped systems can be transferred to other types of motion, while representations learned on overdamped systems cannot be easily transferred.
(a) Here, we consider the information bottleneck bound curve (black) for a stimulus with underlying parameters, (ζ, Δt). For some particular level of Ipast=Ipast0, we obtain a mapping, P(X˜|Xt) that extracts some predictive information, denoted Ioptimalfuture((ζ,Δt),Ipast0), about a stimulus with parameters (ζ, Δt). Keeping that mapping fixed, we determine the amount of predictive information for dynamics with new parameters (ζ′, Δt′), denoted by Itransferfuture((ζ,Δt),Ipast0(ζ,Δt)). (b) One-dimensional slices of Itransferfuture in the (ζ′, Δt′) plane: Itransferfuture versus ζ′ for Δt′ = 1. Ipast0=1 (top), and versus Δt′ for ζ′ = 1. Parameters are set to (ζ = 1, Δt = 1), Ipast0=1. (c) Two-dimensional map of Itransferfuture versus (ζ′, Δt′) (same parameters as b). (d) Overall transferability of the mapping. The heatmap of (c) is integrated over ζ′ and Δt′ and normalized by the integral of Ioptimalfuture((ζ,Δt),Ipast). We see that mappings learned from underdamped systems at late times yield high levels of predictive information for a wide range of parameters, while mappings learned from overdamped systems are not generally useful.
Fig 7
Fig 7. The ability of the information bottleneck Method to predict history-dependent stimuli.
(a) The prediction problem, using an extended history and a future. This problem is largely similar to the one set up for the SDDHO but the past and the future are larger composites of observations within a window of time tt0: t, expressed as Xpast for the past and t + Δt: t + Δt + t0, expressed as Xfuture for the future. (b) Predictive information I(Xt+Δt:t+Δt+t0,X˜) with lag Δt. (c) The maximum available predictive information saturates as a function of the historical information used t0.
Fig 8
Fig 8. The information bottleneck solution for a Wright Fisher process.
(a) The Wright-Fisher model of evolution can be visualized as a population of N parents giving rise to a population of N offspring. Genotypes of the offspring are selected as a function of the parents’ generation genotypes subject to mutation rates, μ, and selective pressures s. (b) Information bottleneck schematic with a discrete (rather than continuous) representation variable, X˜. (c) Predictive information as a function of compression level. Predictive information increases with the cardinality, m, of the representation variable. The amount of predictive information is limited by log(m) (vertical dashed lines) for small m, and the mutual information between allele frequencies at time t + Δt and time t, I(Xtt;Xt) (horizontal dashed line), for large m. Bifurcations occur in the amount of predictive information. For small I(Xt;X˜), the encoding strategies for different m are degenerate and the degeneracy is lifted as I(Xt;X˜) increases, with large m schemes accessing higher I(Xt;X˜) ranges. Parameters: N = 100, = 0.2, = 0.2, Ns = 0.001, Δt = 1. (d-i) We explore information bottleneck solutions to Wright-Fisher dynamics under the condition that the cardinality of X˜, m, is 2 and take β to be large enough that I(Xt;X˜)1, β ≈ 4. Parameters: N = 100, Ns = 0.001, Δt = 1, and = 0.2, = 2, and = 40 (from left to right). (d-f) In blue, we plot the steady state distribution. In yellow and red, we show the inferred historical distribution of alleles based on the observed value of X˜. Note that each distribution is corresponds to roughly non-overlapping portions of allele frequency space. (g-i) Predicted distribution of alleles based on the value of X˜. We observe that as mutation rate increases, the timescale of relaxation to steady state decreases, so historical information is less useful and the predictions becomes more degenerate with the steady state distribution.
Fig 9
Fig 9. Transferability of prediction schemes in Wright-Fisher dynamics.
We transfer a mapping, P(X˜|Xt), trained on one set of parameters and apply it to another. We consider transfers between two choices of mutability, 1 = 0.2 (low) and 2 = 20 (high), with N = 100, Ns = 0.001, Δt = 1. The dotted line is the steady state allele frequency distribution, the solid lines are the transferred representations, and the dashed lines are the optimal solutions. The top panels correspond to the distributions of Xt and the bottom panels correspond to distributions of Xtt. (a) Transfer from high to low mutability. Optimal information values: Ioptimalpast=0.98 and Ioptimalfuture=0.93; transferred information values: Itransferpast((Nμ2),Ipast=0.92(Nμ1))=0.14 and Itransferfuture((Nμ2),Ipast=0.92(Nμ1))=0.05. Representations learned on high mutation rates are not predictive in the low mutation regime. (b) Transfer from low to high mutability. Optimal information values: Ioptimalpast=0.92 and Ioptimalfuture=0.92 and Ioptimalfuture=0.28. Transferred information values: Itransferpast((Nμ1),Ipast=0.98(Nμ2))=0.79 and Itransferfuture((Nμ1),Ipast=0.98(Nμ2))=0.27. Transfer in this direction yields good predictive informations.
Fig 10
Fig 10. Amount of predictive information in the Wright Fisher dynamics as a function of model parameters.
(a-c), Value of the asymptote of the information bottleneck curve, I(Xt;Xtt) with: (a) N = 100, Ns = 0.001, Δt = 1 as a function of μ; (b) N = 100, = 0.2, Ns = 0.001 as a function of Δt; and (c) N = 100, = 0.2, and Δt = 1 as a function of s.
Fig 11
Fig 11. Encoding schemes with m > 2 representation variables.
The steady state is plotted as a dotted line and the representation for each realization of the value of X˜ are plotted as solid lines. The representations which carry maximum predictive information for (a) m = 2 at I(Xt;X˜)log(m)=1 bit, and (b) m = 3 at I(Xt;X˜)log(m)1.5 bits. The optimal representations at large m tile space more finely and have higher predictive information. The optimal representations for m = 200 at fixed β = 1.01 (I(Xt;X˜)=0.28, I(Xt+Δt;X˜)=0.27) (c) and β = 20 (I(Xt;X˜)=2.77, I(Xt+Δt;X˜)=2.34). (d) At low I(Xt;X˜), many of the representations are redundant and do not confer more predictive information than the m = 2 scheme. A more explicit comparison is given in S3 Fig. At high I(Xt;X˜), the degeneracy is lifted. All computations done at N = 100, = 0.2, Ns = 0.001, Δt = 1.

Similar articles

Cited by

References

    1. Barlow HB. Possible Principles Underlying the Transformation of Sensory Messages. In: Sensory communication. MIT Press; 2012.
    1. Laughlin SB. A Simple Coding Procedure Enhances a Neuron’s Information Capacity. Zeitschrift für Naturforschung C. 1981;36:910–912. 10.1515/znc-1981-9-1040 - DOI - PubMed
    1. de Ruyter van Steveninck RR, Laughlin SB. The rate of information transfer at graded-potential synapses. Nature. 1996;379(6566):642–645. 10.1038/379642a0 - DOI
    1. Olshausen BA, Field DJ. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature. 1996;381(6583):607–609. 10.1038/381607a0 - DOI - PubMed
    1. Bialek W, Nemenman I, Tishby N. Predictability, Complexity, and Learning. Neural Computation. 2001;13(11):2409–2463. 10.1162/089976601753195969 - DOI - PubMed

Publication types

LinkOut - more resources