It is widely recognized that viewing a speaker's face enhances vocal communication, although the neural substrates of this phenomenon remain unknown. We propose that the enhancement effect uses the ongoing oscillatory activity of local neuronal ensembles in the primary auditory cortex. Neuronal oscillations reflect rhythmic shifting of neuronal ensembles between high and low excitability states. Our hypothesis holds that oscillations are 'predictively' modulated by visual input, so that related auditory input arrives during a high excitability phase and is thus amplified. We discuss the anatomical substrates and key timing parameters that enable and constrain this effect. Our hypothesis makes testable predictions for future studies and emphasizes the idea that 'background' oscillatory activity is instrumental to cortical sensory processing.