1. Introduction
The recent advent of readily available large language models (LLMs) such as ChatGPT has the potential to dramatically impact how teachers approach instructional design. Researchers have investigated topics such as the use of artificially intelligent chatbots in language learning (Koç & Savaş, Reference Koç and Savaş2024), the use of artificial intelligence (AI) for corrective feedback (Ai, Reference Ai2017), and the use of AI as language tutors (Hwang, Xie, Wah & Gašević, Reference Hwang, Xie, Wah and Gašević2020). However, many in foreign language programs (see Perspectives column, Davin, Reference Davin2024) are understandably cautious of this technology (Bekou, Ben Mhamed & Assissou, Reference Bekou, Ben Mhamed and Assissou2024; Kern, Reference Kern2024; Kohnke, Moorhouse & Zou, Reference Kohnke, Moorhouse and Zou2023), in part due to declining enrollments (Lusin et al., Reference Lusin, Peterson, Sulewski and Zafer2023) and headlines regarding the replacement of foreign language departments with commercial software that uses AI (e.g. Duolingo; Pettit, Reference Pettit2023). Although the role of AI in language teaching and learning certainly deserves scrutiny, researchers have begun to examine how LLMs can support the work of language teachers (Gao, Reference Gao2024; Thorne, Reference Thorne2024).
Understanding how to prompt AI models and the possible limitations of the outputs are critical components of AI competency. Prompts are the interface between human intent and machine output, usually manifesting as questions or instructions given to an AI model with the goal of eliciting a specific response (Giray, Reference Giray2023). In its simplest form, a user prompts an LLM like ChatGPT with a simple command and obtains an output (Wei et al., Reference Wei, Cui, Cheng, Wang, Zhang, Huang, Xie, Xu, Chen, Zhang, Jiang and Han2023). For LLMs to successfully complete complex tasks, the ability to engineer sophisticated prompts that contain a high level of specificity is critical (Giray, Reference Giray2023). Numerous AI prompt engineering strategies have been developed to guide generative AI solutions to desired outputs (Bozkurt & Sharma, Reference Bozkurt and Sharma2023). However, variability in outputs from users using the same prompt may be correct for one and contain errors for another. The degree to which such errors are to be expected or if they reflect general weaknesses of current generative AI models when generating pedagogical materials remains unknown.
The purpose of the present study was to provide a first assessment of how prompt specificity influences trends in output variability and weaknesses when using generative AI to write lesson plans for language teachers. We used zero-shot prompting, in which a user inputs a single prompt and does not engage in dialogue with the chatbot. This approach provides a baseline expectation of the model’s output and is the most likely to be employed by non-AI specialists. We designed five prompts in which each subsequent prompt increased in the level of specificity. Additionally, we input each prompt 10 times to analyze the variability of outputs (i.e. lesson plans produced by ChatGPT) by scoring them against criteria based on best practices and the requirements of the most commonly required foreign language teacher licensure exam in the United States (i.e. edTPA). Using the resulting scores, we quantified the variation in outputs and assessed the overall strengths and weaknesses of ChatGPT for lesson plan creation. Collectively, these results provide essential guidance on the extent to which zero-shot prompting can be used for lesson plan creation.
2. Literature review
2.1 What are LLMs?
A primary objective of natural language processing (NLP), a subfield of AI, is to enable machines to understand, interpret, and generate human language for task performance (Chowdhary, Reference Chowdhary2020). The recent release of LLMs has placed unprecedented attention on our ability to create models that allow machines to mimic human language (Roe & Perkins, Reference Roe and Perkins2023). This ability is due in part to advances in deep learning, in which networks of nodes that mirror our conceptual understanding of human neural networks communicate and extract meaningful content from unstructured input data (Roumeliotis & Tselikas, Reference Roumeliotis and Tselikas2023). Understanding of input text is made possible by pre-training models on aggregations of millions of pages of text from books, websites, articles, and other sources (Wu et al., Reference Wu, He, Liu, Sun, Liu, Han and Tang2023). This pre-training provides a foundational basis for capturing semantic nuances of human language that can be fine-tuned for a wide range of specific applications that span content creation (Cao et al., Reference Cao, Li, Liu, Yan, Dai, Yu and Sun2023), language translation (Gu, Reference Gu2023; Li, Kou & Bonk, Reference Li, Kou and Bonk2023), and writing assistance (Bašić, Banovac, Kružić & Jerković, Reference Bašić, Banovac, Kružić and Jerković2023; Imran & Almusharraf, Reference Imran and Almusharraf2023), to name but a few. Central to the efficacy of such applications is the prompt of the user, which embeds task descriptions as input that guides the computational response of the AI model (Lee et al., Reference Lee, Jung, Jeon, Sohn, Hwang, Moon and Kim2024).
2.2 Prompting in LLMs
Prompts act as the primary user-based input that LLMs such as ChatGPT respond to when generating output. A prompt may simply state a question or task command such as “Write a haiku about sharks in French.” In response, ChatGPT will generate the haiku. If a haiku about any aspect of sharks is the desired output, then the teacher will have achieved their goal. However, such general examples are rarely the desired output. Instead, teachers might want the haiku to be tailored to a specific proficiency level or include specific vocabulary. This need for specificity has led researchers to urge users of generative AI to understand and master fundamental concepts of prompt engineering to effectively leverage LLMs (Giray, Reference Giray2023; Hatakeyama-Sato, Yamane, Igarashi, Nabae & Hayakawa, Reference Hatakeyama-Sato, Yamane, Igarashi, Nabae and Hayakawa2023; Heston & Khun, Reference Heston and Khun2023). Effective prompts often comprise four components (Giray, Reference Giray2023):
-
1. Instruction: A detailed directive (task or instruction) that steers the model’s actions toward the intended result.
-
2. Context: Supplementary data or context that supplies the model with foundational understanding, thereby enhancing its ability to produce accurate outputs.
-
3. Input data: This serves as the foundation of the prompt and influences the model’s perception of the task. This is the query or information we seek to have the model analyze and respond to.
-
4. Output indicator: This sets the output type and format, defining whether a brief answer, a detailed paragraph, or another particular format or combination of formats is desired.
Implementing these components into inputs can aid in more readily guiding LLMs to accurate target outcomes (Giray, Reference Giray2023; Jacobsen & Weber, Reference Jacobsen and Weber2023; Meskó, Reference Meskó2023).
2.3 Variation in outputs
The ability of AI to generate non-deterministic outputs from the same prompt has been lauded as a major achievement, but it also underscores a need for caution. This ability enables the LLM to weigh the importance of words in a sentence and generate outputs based on probability distributions (Lubiana et al., Reference Lubiana, Lopes, Medeiros, Silva, Goncalves, Maracaja-Coutinho and Nakaya2023). Central to this architecture is the temperature parameter that acts as a dial for the model’s creative output. At low temperature values, words with higher probabilities are chosen and the model output becomes more deterministic (Lubiana et al., Reference Lubiana, Lopes, Medeiros, Silva, Goncalves, Maracaja-Coutinho and Nakaya2023). At high temperature values, the model explores a broader range of possible responses that include novel and less expected outputs (Davis, Van Bulck, Durieux & Lindvall, Reference Davis, Van Bulck, Durieux and Lindvall2024). However, even at low temperature settings near or at zero, models like ChatGPT have been found to return non-deterministic and sometimes erroneous results by chance. Jalil, Rafi, LaToza, Moran and Lam (Reference Jalil, Rafi, LaToza, Moran and Lam2023) recently found that even at a temperature setting of zero, ChatGPT provided non-deterministic answers to simple prompts related to software curriculum nearly 10% of the time, with an error rate of over 5%. As the default temperature setting for the public release of ChatGPT likely to be used by educators is around 0.7, this finding suggests that assessments of this tool in education should include the potential for non-determinism. Unfortunately, how variability in output responses to the same prompt impact the design of language teacher instructional materials remains unexplored.
2.4 Approaches to prompting for lesson planning
The promise of using LLMs for foreign language teacher material development was recognized not long after the public release of ChatGPT (Hong, Reference Hong2023; Koraishi, Reference Koraishi2023). Since then, a range of prompting techniques such as zero-shot (Kojima, Gu, Reid, Matsuo & Iwasawa, Reference Kojima, Gu, Reid, Matsuo and Iwasawa2022), few-shot (Brown et al., Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter and Amodei2020; Kojima et al., Reference Kojima, Gu, Reid, Matsuo and Iwasawa2022), chain-of-thought (Wang et al., Reference Wang, Wei, Schuurmans, Le, Chi, Narang, Chowdhery and Zhou2022), tree-of-thought (Yao et al., Reference Yao, Yu, Zhao, Shafran, Griffiths, Cao and Narasimhan2023), and even autotune solutions (Khattab et al., Reference Khattab, Singhvi, Maheshwari, Zhang, Santhanam, Vardhamanan, Haq, Sharma, Joshi, Moazam, Miller, Zaharia and Potts2023) have been developed. However, whether teachers require training in complex prompting strategies for routine tasks remains unclear. Corp and Revelle (Reference Corp and Revelle2023) explored the ability of eight pre-service elementary school teachers to use zero-shot prompting with ChatGPT for lesson plan creation and found it to be feasible after a short tutorial. Zero-shot approaches have repeatedly been shown to produce quality outputs when prompted effectively (Ateia & Kruschwitz, Reference Ateia and Kruschwitz2023; Hu, Liu, Zhu, Lu & Wu, Reference Hu, Liu, Zhu, Lu and Wu2024), including in an evaluation of materials generated for physics classes that found no statistical difference in output between zero-shot prompting and other more complex approaches (Yeadon & Hardy, Reference Yeadon and Hardy2024). More recently, Karaman and Göksu (Reference Karaman and Göksu2024) evaluated the effectiveness of lesson plans generated by ChatGPT using zero-shot prompting on third graders’ mathematical performance over five weeks, noting a boost in math scores for the ChatGPT group.
These results underscore the potential benefits of incorporating AI-developed lesson plans into the educational repertoire. However, the alignment of those plans with teaching objectives and standards has been questioned (Koraishi, Reference Koraishi2023; Lo, Reference Lo2023). It is also not clear whether content generated will consistently align with objectives and standards between users. Researchers have demonstrated that LLMs such as ChatGPT can accomplish various language teaching and learning tasks (Hong, Reference Hong2023; Kohnke et al., Reference Kohnke, Moorhouse and Zou2023). Unfortunately, to our knowledge, no published studies have examined its use for language lesson plan creation.
3. Methods
The present study sought to analyze the extent to which zero-shot prompting could create language teacher lesson plans aligned to target criteria used in licensure training. Specifically, the study was guided by the following research questions:
R1. To what degree does increasing the specificity of prompts impact the structure and content of AI-generated lesson plans?
R2. How does the specificity of a prompt influence the consistency of AI-generated responses?
R3. Does ChatGPT demonstrate any overall strengths or weaknesses in lesson plan design, regardless of prompt specificity?
3.1 Approach
We used ChatGPT Version 4.0 (OpenAI, 2024) to assess how increasing prompt specificity impacted the alignment of outputs with lesson plan criteria that are given to pre-service L2 teachers during their training at the University of North Carolina at Charlotte (UNCC). We designed five increasingly specific zero-shot prompts following general guidelines of prompt design (Giray, Reference Giray2023) and recorded the resulting outputs. Each prompt was input into 10 unique chats to additionally assess the resulting non-determinism of outputs between the same prompt. The resulting 50 outputs were scored for the presence/absence of criteria and subject to a range of statistical analyses that allowed for the visualization of trends, assessment of variability between prompt groups, and a series of analyses in dissimilarity. These analyses aimed to reveal whether specific features of the prompt design yielded outputs that were more or less aligned with target criteria for lesson plan design. This approach allowed us to provide an assessment of zero-shot prompting for language teacher lesson plan design.
3.2 Prompt design
The five prompts iteratively built specificity toward constructing a lesson plan that aligns with target criteria used in the UNCC foreign language teacher licensure program. The lesson plan template and scoring rubric aligned to the requirements of Pearson’s edTPA, a performance-based assessment that teachers in North Carolina and many other states must pass for teacher licensure. Prompts were designed to adhere to the guidance of prompt design that includes instruction, context, input data, and an output indicator (Giray, Reference Giray2023). They iteratively increased in complexity. Table 1 shows the initial prompt (P.1) and the phrase or direction that was added in each subsequent prompt.
P.1 provided a general case prompt that included features common to lesson planning (Table 1). P.2 added specificity concerning the definition of proficiency based on the ACTFL Proficiency Guidelines (ACTFL, 2012), which are guidelines that describe what language users can do with the language in real-world situations at different proficiency levels (e.g. novice, intermediate, advanced). P.3 increased in specificity by adding a lesson plan format (see Supplementary Material 1) that a pre-service teacher would utilize as part of their training, uploaded as a pdf for ChatGPT to complete. P.4 included all components of P.1, P.2, and P.3, but added the condition that the lesson plan should address multiple ACTFL world-readiness standards. Finally, P.5 included a checklist of criteria that individually listed each component that the lesson plan should include (see Supplementary Material 2).
3.3 Dataset creation
All prompts were used as input for ChatGPT 4.0 (OpenAI, 2024) with the resulting lesson plan output saved to a text file. Each prompt (P.1–P.5) was input 10 times to capture non-deterministic output arising from the temperature parameter. Each prompt and prompt iteration occurred on new chats to ensure no influence of the prior prompt on the output. All text files were labeled by prompt. For example, P1.1 represented the first time Prompt 1 was entered and P5.5 represented the fifth time Prompt 5 was entered. In sum, 50 lesson plans, 10 for each of the five prompts, were generated. We then scored each output for the presence (1) or absence (0) of components indicated in P.5, yielding a binary presence/absence matrix.
3.4 Statistical analyses
All statistical analyses were conducted in R Version 4.2.1 “Bird Hippy” (R Development Core Team, 2021). To investigate how increasing the specificity of prompts impacted the structure and content of AI-generated lesson plans (R1), we first assessed general trends in the presence or absence of target lesson components. Trends in the total number of target components captured by each prompt across iterations were visualized and group means compared using ANOVA. This allowed us to test if the prompt types were significantly different in terms of generating outputs more aligned with the target criteria. We conducted pairwise t-tests between groups using the correction method of Benjamini and Hochberg (Reference Benjamini and Hochberg1995) to mitigate the potential for false discovery. These analyses were repeated on the word counts between prompt categories to assess if specificity abridged the resulting output.
As an ANOVA only offers a perspective on group means, we additionally utilized several statistical approaches to analyze the dissimilarity between prompt outputs that can reveal separation of outputs that may be masked by comparisons of group means. As Euclidean distance measures exhibit known pathologies when handling presence/absence matrices (Ricotta & Podani, Reference Ricotta and Podani2017), we quantified dissimilarity using Jaccard distances, which are appropriate for binary data (Hao et al., Reference Hao, Corral-Rivas, González-Elizondo, Ganeshaiah, Nava-Miranda, Zhang, Zhao and von Gadow2019), using the vegan Version 2.6.4 package (Dixon, Reference Dixon2003). To visualize overlap between prompt outputs, we used non-metric multidimensional scaling (NMDS), treating prompts (P.1–P.5) as groups. This approach to visualizing overlap of clusters (Dornburg et al., Reference Dornburg, Lippi, Federman, Moore, Warren, Iglesias, Brandley, Watkins-Colwell, Lamb and Jones2016) relaxes assumptions of linearity in alternative ordination approaches such as principal components analyses. To ensure that NMDS ordination is a viable indicator of dissimilarity, we quantified stress values, ensuring stress values below 0.1 (Clarke, Reference Clarke1993). To assess if prompts formed statistically significant clusters as would be expected if prompt specificity greatly impacted output (R1), we used the adonis2 function in vegan to conduct a permutational multivariate analysis of variance (PERMANOVA) with 999 permutations (Anderson, Reference Anderson2001). This test allowed us to assess if the composition of the prompts differed in multivariate space. More specifically, a significant difference would suggest that the grouping by prompts explains variance in goal achievement, indicating that different prompts lead to statistically different patterns of goal achievement. To gain additional insight into the degree of separation between groups, we complemented the PERMANOVA test with an analysis of similarities (ANOSIM), which tests the null hypothesis that there are no differences between groups through a comparison of within and between group dissimilarities (Chapman & Underwood, Reference Chapman and Underwood1999; Clarke, Reference Clarke1993). Using 999 permutations, mean ranks (R) were quantified, with values near 0 indicating high similarity and values near 1 indicating high dissimilarity (Chapman & Underwood, Reference Chapman and Underwood1999).
To quantify how the specificity of a prompt influences the consistency of AI-generated responses (R2), distances were visualized in R to look for patterns of distance increasing as a function of prompt specificity. We next employed hierarchical clustering (Singh, Hjorleifsson & Stefansson, Reference Singh, Hjorleifsson and Stefansson2011; Vijaya, Sharma & Batra, Reference Vijaya and Batra2019) with the single, complete, and average linkage algorithms in vegan to perform hierarchical clustering on the dissimilarity indices. This allowed us to visualize the distance between prompts and the degree to which output from the same prompt was clustered (R2) (Vijaya et al., Reference Vijaya and Batra2019). Under a scenario in which prompts continually returned highly similar output, we would expect a high degree of clustering between replicates. In contrast, if the output is highly variable, we would expect a high degree of convergence in output scoring between outputs generated by different prompts.
Finally, to assess if ChatGPT demonstrated any overall strengths or weaknesses in lesson plan design, regardless of prompt specificity (R3), we quantified which components were absent across each prompt (e.g. Cultural Connections, Meaningful Context, etc.).
4. Results
4.1 To what degree does increasing the specificity of prompts impact the structure and content of AI-generated lesson plans?
There was a marginally significant difference in overall trends of prompt scores between prompt groups (p = 0.0442, F = 2.668, DF = 4), though significance was not supported in multiple test–corrected pairwise tests (p > 0.12). Adding detailed instructions had the effect of reducing the spread of variance and in some cases raising the mean value of scores (Figure 1). However, the mean scores did not increase linearly as a function of prompt detail. When the lesson template was first provided in P.3, the mean score dropped from 21.2 out of 25 from P.2 to 19.8. For P.4, when the prompt included the directive to address the ACTFL world-readiness standards, the resulting score decreased from 19.8 (P.3) to 19.6 (P. 4), remaining lower than the average scores from P.2. However, the addition of the checklist of criteria in P.5 raised the mean to the overall highest of 21.6. In addition, adding specificity had a significant impact on the overall word count of the lesson plan (p = 0.0047, F = 4.338, DF = 4), with lessons generated by P.3 being significantly shorter than those from P.1 and P.2 (p = 0.007 and p = 0.018, respectively) and those generated by P.4 being significantly shorter than those from P.1 (p = 0.048).
NMDS-based visualizations of prompt output scores between prompting groups revealed several general aspects of the output alignments to scores (Figure 2). All prompt outputs shared some degree of overlap. However, there was also separation between substantial regions of the individual prompt clusters. This was particularly evident when comparing P.1+P.2 with the remaining prompts whose centroids largely occupied an alternative region of the NMDS space. This separation was supported by a PERMANOVA (F = 3.1556, p = 0.001), indicating significant differences in their multivariate profiles that supported that the prompt groups had significant effects on the outputs produced. The results of an ANOSIM complemented the result of the PERMANOVA, again supporting significant differences between groups (R = 0.141, p = 0.001).
The two prompt characteristics that most profoundly changed the ChatGPT output were the introduction of the lesson plan template in P.3 and the introduction of the scoring criteria in P.5. Once the lesson plan template was introduced, scores on the warm-up and teacher input portions of the lesson decreased dramatically. The average score on the warm-up, termed the “hook” in the template, revealed that P.1 (2.2 out of 3 possible points) and P.2 (2.5/3) scored much higher on the three portions of the checklist that corresponded to the rubric (see 2a–c in Prompt 5) than P.3 (1.4), P.4 (1.0), and P.5 (1.4). ChatGPT repeatedly provided a warm-up that was some version of the following: “Quick video showcasing a vibrant Costa Rican restaurant scene, highlighting local foods, prices, and restaurant ambiance.” Because that directive did not 1(a) address the ACTFL standards or 2(c) activate prior knowledge, it consistently scored lower.
The same was true for the Teacher Input. The average score for the four components related to the teacher input revealed that P.1 and P.5 both received perfect scores of 4 out of 4, unlike the prompts in between that scored 2.8/4 (P.2), 2/4 (P.3), and 2/4 (P.4). Lack of interaction and student engagement caused these low scores, even though the prompt for Teacher Input on the template stated, “Tip: This is where you introduce the new learning that addresses the Can-Do statement. You should engage students in interaction.” Despite that directive, the lesson plans produced for P.3 and P.4 had teacher inputs with activities like, “Introduce vocabulary related to restaurants, foods commonly found in Costa Rica (e.g. “casado,” “gallo pinto”), and phrases for asking about prices. Use images and realia to support learning.”
4.2 How does the specificity of a prompt influence the consistency of AI-generated responses?
Scoring of prompt output revealed high variance in scores within prompt groups (Figure 3). For example, scores resulting from P.1 ranged from 23/25 to 16/25, with an average score of 20.1/25. Likewise, scores for P.5, which contained the scoring criteria, ranged from a perfect score of 25/25 to 20/25. In general, outputs generated from identical prompts varied by at least five elements. Corresponding to the high variance observed in the raw score, estimation of Jaccard distances provided little indication of distinct clustering by prompt type that would indicate strong dissimilarity between groups (Figure 4). Instead, within each prompt group, there were examples of highly divergent replicates as well as replicates that were highly similar to replicates from other prompt groups (Figure 4A). In other words, the convergence and divergence patterns observed in the distance matrix revealed a mixture of similarity and variability between and within prompt groups. For example, the 10th replicate of Prompt 5 (5.10) was highly similar to the sixth replicate of Prompt 2 (2.6), indicating the potential of a less detailed prompt output to converge in score with one generated using a higher detailed prompt by chance. In contrast, the third output generated using Prompt 1 in independent chats (P1.3) was highly dissimilar to almost all other outputs (Figure 4A). This reveals the potential for a lack of rigid uniformity in how each distinct prompt type influenced outputs.
The dendrogram estimated using hierarchical clustering revealed a similar pattern to the raw distance matrix. Some prompt replicate outputs were highly dissimilar to the outputs from the other prompts (Figure 4B). Overall, there was some degree of differentiation between the prompt groups, with P.1 and P.2 having a higher distance on average from the P.3–P.5 replicates. However, the dendrogram also revealed numerous cases of convergence between prompt group replicates (Figure 4B). For example, three replicates from P.4 (P.4.9, P.4.5, P.4.3) were identical in scoring to a replicate from P.5 (P.5.3), two replicates from P.3 (P.3.4, P.3.2), and one replicate from P.1 (P.1.9). Similar cases of convergence in scoring groups were found throughout the dendrogram. Collectively, these results do not support the hypothesis that more specific prompts always lead to predictable and deterministic outputs, highlighting the importance of considering the variability inherent to AI-generated content when assessing prompt/output relationships.
4.3 Does ChatGPT demonstrate any overall strengths or weaknesses in lesson plan design, regardless of prompt specificity?
Assessing patterns of missing components in the scoring rubrics from prompt outputs revealed high heterogeneity between categories (Figure 5). Categories present in all outputs included meaningful context, teacher input aligning with lesson objectives, and activities appropriate to students’ proficiency level. Several categories were also present in almost all cases except a few replicates of P.1 or P.2, including fostering student engagement, showing a connection to the learner’s world, establishing a purpose for the lesson in the warm-up, and the integration of ACTFL standards into the closure and independent practice (Figure 5).
However, several categories were conspicuously absent between prompt groups. For example, cultural connection and ACTFL standards being integrated into teacher input were largely restricted to the outputs generated by P.5. Warm-up serving to activate prior knowledge and integration of the ACTFL standards into the focus and review were largely restricted to groups P.1 and P.2. Teacher input engaging learners in interaction was restricted largely to P.1, P.2, and P.5. In all of these examples of heterogeneity between prompt outputs, the appearance of rubric elements was often restricted to only around 50% of the outputs, indicating non-determinism in generated responses.
5. Discussion
Our results demonstrate several significant aspects of ChatGPT’s output in regard to prompt specificity, variability, and possible weaknesses that can guide usage. Overall, these results suggest that ChatGPT was largely able to create aligned lesson plans, confirming existing research that zero-shot approaches can produce target outputs (Ateia & Kruschwitz Reference Ateia and Kruschwitz2023; Hu et al., Reference Hu, Liu, Zhu, Lu and Wu2024). However, our results underscored that simply providing additional context and specificity to a prompt did not guarantee a concomitant increase in output score. On the one hand, we observed a moderate degree of convergence in the outputs between the prompt categories, with several features of the scoring criteria present in all prompt outputs. On the other hand, we observed extreme cases of variability in which the same prompt yielded outputs perfectly or almost perfectly aligned with desired outcomes as well as prompts that missed numerous criteria. In several cases, this variability reflected outputs aligned with anachronistic pedagogical practices that no longer reflect best practices. This suggests the presence of possible biases in the neural network stemming from training on historic data that may steer users toward research-rebuked teaching practices. Whether such biases permeate other aspects of instructional design requires additional study. In the subsequent subsections, we discuss each major finding and include a pedagogical implication for each.
5.1 Increasing specificity does not always lead to increasing quality
Determining the necessary specificity for a desired output is considered a critical aspect of prompting for AI models (Krause, Reference Krause2023). However, we found that the relationship between the specificity of prompt and output score or generated lesson plans was not linear. For example, the average score of the output provided as a result of P.3 decreased from P.2 and remained virtually unchanged (0.2 difference) for P.4. In P.3, we provided a lesson plan template for the first time. This embedded description of the task guided the computational response of the AI model toward less detailed plans. Additionally, the opening section of a lesson plan, called the Warm-up or Focus and Review, was labeled as the hook (Wiggins & McTighe, Reference Wiggins and McTighe2005). ChatGPT seemed to interpret this terminology as input that is teacher-centered and does not require student interaction, which resulted in lower scores for related portions of the lesson plan. As a consequence, once the lesson plan template was included as input beginning with P.3, the warm-up/hook became almost formulaic. Though technically more specific, this decrease in score suggests that simply including a lesson plan template for context is not an effective strategy for increasing quality. Instead, including scoring criteria increased output quality. Just as students are more effective when given scoring criteria along with the description of an assignment (Jonsson & Svingby, Reference Jonsson and Svingby2007), so too was ChatGPT. This was readily observed with the addition of the checklist with required criteria in P.5 that raised the mean to the overall highest of 21.6/25, suggesting that such input may be critical for optimizing AI-based lesson plan generation.
When prompting ChatGPT to create foreign language lesson plans, teachers should include a meaningful context, lesson objectives, and the scoring criteria in the prompt. The inclusion of the first two in the prompt resulted in high scores for these categories across all prompts. The inclusion of scoring criteria in P.5 resulted in the highest average score on output for that prompt. If one’s school or district requires the lesson in a particular format, then including that format can be useful as well. However, specialized terminology like “hook” should be fully explained in the input. It is also important to note that a lesson plan template that works well for teachers’ own instructional design may differ from one that works well for ChatGPT and may require iterative refinement. Teacher educators should experiment with using different templates for ChatGPT to determine which works best for their needs.
5.2 Variable responses are an inherent feature of AI output
Generative AI models break down prompts into textual representation and logic to parse the tasks required of them. As users add specificity and context to a prompt, this provides additional logic to guide the model toward desirable outputs (Giray, Reference Giray2023). However, our work underscores that outputs from the same prompt are often not deterministic. In our case, scoring of prompt output revealed high variance in scores within prompt groups (Figure 1). Simply by chance, some outputs from the same prompt missed over 25% of scoring criteria while others were over 90% complete. Similar variability and lack of deterministic output have been observed in other fields and are often attributed to similar weights between word pattern choices or the temperature value in the transformer model (Ouyang, Zhang, Harman & Wang, Reference Ouyang, Zhang, Harman and Wang2023). Even at temperature settings of 0, models like ChatGPT have been found to return non-deterministic and erroneous outputs (Lubiana et al., Reference Lubiana, Lopes, Medeiros, Silva, Goncalves, Maracaja-Coutinho and Nakaya2023).
Teachers and researchers should experiment with using additional prompt follow-ups to improve output. In the present study, we did not engage in back-and-forth dialogue with the chatbot to avoid non-tractable variability. However, the additional input provided to ChatGPT would likely produce lessons more aligned to one’s criteria. Moreover, in the context of teacher education, requiring teacher candidates to evaluate strengths and weaknesses of lesson outputs and follow-up with additional requests would provide a window into their thinking and development. For example, once a chatbot produces a lesson plan, a teacher might follow up with a prompt like, “Please engage students in interaction during the Teacher Input,” or “Please script out questions for the teacher to ask during the Teacher Input,” or “Please add a Closure in which the teacher asks students how dining practices might differ in Costa Rica and the United States.” In this way, teacher educators could assess teacher candidates’ abilities to critically evaluate a lesson plan and pinpoint areas in need of mediation.
5.3 Weaknesses in ChatGPT’s output often reflected historic shifts in instructional design
Generative AI models such as ChatGPT were trained on vast collections of text to develop mathematical associations between words that form the basis of each output. However, such training can induce algorithmic biases toward over-represented associations that do not reflect current knowledge or practices. Algorithmic bias may be particularly problematic for education, as there have been dramatic shifts in pedagogical practices in recent decades. For example, many of the lesson plans created by ChatGPT reflected aspects of the audio-lingual method, a behaviorist approach popular in the 1970s in which learners listened to and repeated pre-scripted dialogues. In some lesson plan iterations, ChatGPT directed teachers to give students a role-play script to rehearse and present. In others, the plans asked the teacher to show flashcards for students to practice pronunciation. A second example related to the historic practice in foreign language instruction of focusing solely on the teaching of language with no attention to culture (Kramsch, Reference Kramsch, de Bot, Ginsberg and Kramsch1991). Pre-service teachers have historically struggled with this lesson plan component on teacher licensure exams (Hildebrandt & Swanson, Reference Hildebrandt and Swanson2014) and ChatGPT reflected that bias. This component was met in only five of the 50 lesson plans. Even when it was included in the input scoring rubric for P.5, output only included this component 50% of the time. This finding supports the view that the teaching of culture is an area within which current AI falls short (Kern, 2024) and that this particular aspect of lesson planning requires human attention when using generative AI.
Teacher educators must impress upon teachers the critical importance of evaluating and engaging with AI outputs. Guichon (Reference Guichon2024) refers to this as “metatechnolinguistic competence” or “the ability to develop critical as well as ethical knowledge about the tools available to carry out a task involving language” (p. 565). Extrapolating from our results and considering the number of teachers who could use ChatGPT for lesson plan design, there is a high chance that similar relics of outdated methods will appear across outputs. Should users not look critically at generated output, these atavisms might become pervasive features of pedagogical materials. Moreover, instructional materials generated by teachers using the same prompt have the potential to widely vary in quality. Even with refinement and iterative prompting, based on the various temperature settings in generative models such as ChatGPT, there is no guarantee that outputs will be of a high quality in every instance. For teachers, we advocate that engagement with AI tools and their output should be considered a key area of AI literacy and competency training (Dell’Acqua et al., Reference Dell’Acqua, McFowland, Mollick, Lifshitz-Assaf, Kellogg, Rajendran, Krayer, Candelon and Lakhani2023). It is not clear how teachers currently using generative AI tools are evaluating outputs, and research on this topic is needed.
6. Conclusion
Understandably, many teachers and teacher educators are skeptical and cautious about using and allowing the use of ChatGPT (see Kern, 2024). But many language teachers, especially those in elementary and secondary public schools, are experiencing increasing course loads and decreasing resources (Mason, Reference Mason2017). Our results demonstrate that ChatGPT can help to streamline the process of lesson planning, although it requires critical evaluation on the part of the human teacher. Awareness of possible biases toward outdated pedagogical practices that can occur by chance mark an urgent need in teacher AI literacy. Current rates of error that can be expected stochastically in foreign language instructional materials remain unknown because evaluations of models such as ChatGPT have been focused on single prompt outputs. Our work suggests that historical biases could be pervasive and should be expected to occur in output by chance. To avoid reintroducing research-rejected practices into modern curricula, it is essential that teachers be trained to modify and revise outputs.
There is an ongoing debate on whether generative AI will replace human language teachers or make the need for language teaching and learning obsolete (Kern, 2024). However, our study underscores the promise of a synergistic, not antagonistic, relationship between this emerging technology and language educators. As we enter into what Gao (Reference Gao2024) refers to as a “brave new world” (p. 556) of technological development, we urge consideration of the implications for language teacher preparation. Language teacher professional development can incorporate training on how generative AI can be used to increase the efficiency and impact of daily tasks. This provides teacher educators with the unprecedented opportunity to integrate LLMs into existing approaches to teacher preparation. Continued research into successes and pitfalls of such efforts will become critical not only for teacher preparation in the 21st century but also for teachers to keep pace with students who often use these tools in their daily lives.
Supplementary material
To view supplementary material for this article, please visit https://doi.org/10.1017/S0958344024000272
Data availability statement
Full prompts and ChatGPT-generated outputs are available in the supplementary material. Output scores and code used in the analyses are available on Zenodo (DOI: 10.5281/zenodo.11060097; Link: https://zenodo.org/records/11060097).
Ethical statement and competing interests
This work involved no human subjects. The authors declare no competing interests.
Use of artificial intelligence tools
This study assessed the performance of generative AI outputs generated by ChatGPT. Full prompts and ChatGPT-generated outputs are available in the supplementary material and associated data archive. Generative AI was not used to draft this manuscript nor generate the figures.
About the authors
Alex Dornburg is an assistant professor in the College of Computing and Informatics at the University of North Carolina at Charlotte. His research interests include AI applications, phylogenetics, evolutionary immunology, and comparative genomics.
Kristin J. Davin is a professor of foreign language education at the University of North Carolina at Charlotte. Her research interests include foreign language teacher development, language policies, and assessment.