Our perception of the word is the result of combining information between several senses, such as vision, audition and proprioception. These sensory modalities use widely different frames of reference to represent the properties and locations of object. Moreover, multisensory cues come with different degrees of reliability, and the reliability of a given cue can change in different contexts. The Bayesian framework--which we describe in this review--provides an optimal solution to deal with this issue of combining cues that are not equally reliable. However, this approach does not address the issue of frames of references. We show that this problem can be solved by creating cross-modal spatial links in basis function networks. Finally, we show how the basis function approach can be combined with the Bayesian framework to yield networks that can perform optimal multisensory combination. On the basis of this theory, we argue that multisensory integration is a dialogue between sensory modalities rather that the convergence of all sensory information onto a supra-modal area.