Mechanisms of human attention allow selective processing of just the relevant events among the many stimuli bombarding our senses. Most laboratory studies examine attention within just a single sense, but in the real world many important events are specified multimodally, as in verbal communication. Speech comprises visual lip movements as well as sounds, and lip-reading contributes to speech perception, even for listeners with good hearing, by a process of audiovisual integration. Such examples raise the problem of how we coordinate our spatial attention across the sensory modalities, to select sights and sounds from a common source for further processing. Here we show that this problem is alleviated by allowing some cross-modal matching before attentional selection is completed. Cross-modal matching can lead to an illusion, whereby sounds are mislocated at their apparent visual source; this crossmodal illusion can enhance selective spatial attention to speech sounds.