Cross-modal processing depends strongly on the compatibility between different sensory inputs, the relative timing of their arrival to brain processing components, and on how attention is allocated. In this behavioral study, we employed a cross-modal audio-visual Stroop task in which we manipulated the within-trial stimulus-onset-asynchronies (SOAs) of the stimulus-component inputs, the grouping of the SOAs (blocked vs. random), the attended modality (auditory or visual), and the congruency of the Stroop color-word stimuli (congruent, incongruent, neutral) to assess how these factors interact within a multisensory context. One main result was that visual distractors produced larger incongruency effects on auditory targets than vice versa. Moreover, as revealed by both overall shorter response times (RTs) and relative shifts in the psychometric incongruency-effect functions, visual-information processing was faster and produced stronger and longer-lasting incongruency effects than did auditory. When attending to either modality, stimulus incongruency from the other modality interacted with SOA, yielding larger effects when the irrelevant distractor occurred prior to the attended target, but no interaction with SOA grouping. Finally, relative to neutral-stimuli, and across the wide range of the SOAs employed, congruency led to substantially more behavioral facilitation than did incongruency to interference, in contrast to findings that within-modality stimulus-compatibility effects tend to be more evenly split between facilitation and interference. In sum, the present findings reveal several key characteristics of how we process the stimulus compatibility of cross-modal sensory inputs, reflecting stimulus processing patterns that are critical for successfully navigating our complex multisensory world.