The influence of a talker's face (e.g., articulatory gestures) and voice, vocalic context, and word position were investigated in the training of Japanese and Korean English as a second language learners to identify American English /[invertedr]/ and /l/. In the pretest–posttest design, an identification paradigm assessed the effects of 3 weeks of training using multiple natural exemplars on videotape. Word position, adjacent vowel, and training type (auditory–visual [AV] vs. auditory only; multiple vs. single talker for Koreans) were independent variables. Findings revealed significant effects of training type (greater improvement with AV), talker, word position, and vowel. Identification accuracy generalized successfully to novel stimuli and a new talker. Transfer to significant production improvement was also noted. These findings are compatible with episodic models for the encoding of speech in memory.