In everyday life, visual information often precedes the auditory one, hence influencing its evaluation (e.g., seeing somebody’s angry face makes us expect them to speak to us angrily). By using the cross-modal affective paradigm, we investigated the influence of facial gestures when the subsequent acoustic signal is emotionally unclear (neutral or produced with a limited repertoire of cues to anger). Auditory stimuli spoken with angry or neutral prosody were presented in isolation or preceded by pictures showing emotionally related or unrelated facial gestures (angry or neutral faces). In two experiments, participants rated the valence and emotional intensity of the auditory stimuli only. These stimuli were created from acted speech from movies and delexicalized via speech synthesis, then manipulated by partially preserving or degrading their global spectral characteristics. All participants relied on facial cues when the auditory stimuli were acoustically impoverished; however, only a subgroup of participants used angry faces to interpret subsequent neutral prosody. Thus, listeners are sensitive to facial cues for evaluating what they are about to hear, especially when the auditory input is less reliable. These results extend findings on face perception to the auditory domain and confirm inter-individual variability in considering different sources of emotional information.