1. Introduction
Large language models (LLMs) have revolutionized artificial intelligence (AI) and natural language processing (NLP) by achieving unprecedented proficiency in understanding and generating human language. Built upon the transformative transformer architecture, LLMs excel in tasks such as text generation, machine translation, question answering and sentiment analysis, often matching or surpassing human performance (Naveed et al., Reference Naveed, Khan, Qiu, Saqib, Anwar, Usman and Mian2024).
This paper provides a brief overview of LLMs, touching upon their theoretical foundations, technical advancements and practical applications. We begin by introducing the transformer architecture, which addressed the limitations of earlier models like recurrent neural networks (RNNs) and convolutional neural networks (CNNs) through self-attention mechanisms that capture long-range dependencies and contextual relationships. We explore how scaling laws, increased model sizes and advanced training techniques have propelled LLMs to new heights. Key innovations such as masked language modeling (MLM) and causal language modeling (CLM) underpin models like BERT (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019) and the GPT series. The paper also examines practical methodologies that enhance the adaptability and precision of LLMs, including fine-tuning strategies like parameter-efficient fine-tuning (PEFT) and techniques such as prompt engineering. We address challenges associated with LLMs, such as computational demands, biases and hallucinations – where models generate plausible but incorrect information – and present solutions like retrieval-augmented generation (RAG) to improve factual accuracy.
By outlining both the capabilities and limitations of LLMs, this paper aims to provide a foundational understanding for legal researchers, practitioners and students. We emphasize the transformative potential of these models in shaping the future of AI and language technologies, underscoring the importance of ongoing research to enhance efficiency and address ethical considerations.
2. Understanding the context: NLP and neural networks
NLP is a multidisciplinary field that combines linguistics, computer science and machine learning to enable machines to interpret and generate human language. Early NLP systems relied heavily on rule-based methods, which required extensive domain knowledge and were limited in scalability. These were soon replaced by statistical approaches and simple neural networks like the perceptron (Rosenblatt, Reference Rosenblatt1958), which could learn basic patterns from data. These methods were soon replaced by statistical approaches and simple neural networks (McCulloch & Pitts, Reference McCulloch and Pitts1943) like the perceptron, which could learn basic patterns from data.
2.1. Neural networks: Foundations and challenges
Neural networks, inspired by the structure of the human brain, consist of layers of interconnected nodes or “neurons.” These neurons process input data by applying weightsFootnote 1 (scaling factors) and biasesFootnote 2 (offsets) before passing the result through an activation function. Early models like the perceptron were capable of handling simple classification tasks by adjusting these parameters to minimize errors, as quantified by a loss functionFootnote 3 (Terven, Cordova-Esparza, Ramirez-Pedraza, Chavez-Urbiola & Romero-Gonzalez, Reference Terven, Cordova-Esparza, Ramirez-Pedraza, Chavez-Urbiola and Romero-Gonzalez2024) – a measure of the difference between predicted and actual outputs. However, these models were limited to linear decision boundaries and struggled with more complex, nonlinear problems.
The introduction of the backpropagation algorithm (Werbos, Reference Werbos1974) marked a significant advancement, allowing neural networks to adjust weights and biases more effectively using gradient descent. This method calculates gradients of the loss function to iteratively update the network’s parameters. Despite this breakthrough, deeper networks encountered the vanishing gradient problemFootnote 4 (Hochreiter, Reference Hochreiter1998), where gradients diminished as they propagated backward, slowing or halting the learning process in earlier layers.
2.2. Hardware and architectural advances
The resurgence of neural networks in the 21st century was driven by advancements in hardware, particularly graphics processing units (GPUs), which enabled efficient parallel computation. These improvements made it feasible to train deeper networks on large datasets, resulting in breakthroughs in tasks like computer vision and speech recognition. However, neural networks still faced limitations in handling sequential data and long-range dependencies, crucial for many NLP tasks.
2.3. From RNNs and CNNs to transformers
To address these challenges, more advanced architectures were developed. RNNs (Rumelhart, Hinton & Williams, Reference Rumelhart, Hinton and Williams1986) introduced feedback loops to retain information across time steps, making them suitable for sequential data. Similarly, CNNs (LeCun et al., Reference LeCun, Boser, Denker, Henderson, Howard, Hubbard and Jackel1989; Lecun, Bottou, Bengio, & Haffner, Reference LeCun, Bottou, Bengio and Haffner1998), designed for grid-like data such as images, provided local pattern detection. While these architectures offered improvements, they still struggled with scalability and efficiently capturing long-range dependencies in NLP tasks.
The introduction of transformers revolutionized NLP by addressing these challenges, offering superior handling of context and enabling parallel processing of large datasets. This innovation laid the groundwork for the development of LLMs, which can capture intricate language patterns and perform complex tasks with remarkable accuracy and fluency.
3. Transformers based architecture: a new paradigm
Introduced in the seminal 2017 paper Attention is All You Need (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez and Polosukhin2017), the transformer architecture fundamentally shifted the way models process and understand sequential data by eliminating the need for recurrent and CNNs traditionally used in language models. Instead, transformers rely on a mechanism called self-attention, which allows them to consider the entire input sequence simultaneously rather than processing it step-by-step.
4. Self-attention mechanism
Self-attention is the key innovation of the transformer architecture. Unlike recurrent networks, which process data in order, or convolutional networks, which focus on local patterns, transformers enable each word or token in the input to weigh the relevance of every other word in the sequence. It enables the model to weigh the relevance of each word (or token) in the input sequence with respect to each other word, capturing long-range dependencies and contextual relationships more effectively.
Mathematically, self-attention operates as follows:
1. Input representation:
Given an input sequence of tokens:
Each token is embedded into a continuous vector space to obtain embeddings:
2. Linear projections:
For each embedding ei, we compute three vectors: a query qi, a key ki and a value vi, using learned weight matrices Wφ, Wk and Wv:
3. Scaled dot-product attention:
The attention score between token i and token j is calculated using the scaled dot-product of their queries and keys:
● qi · kvt denotes the dot product of qi and the transpose of kv.
● dk is the dimensionality of the key vectors.
● The division by √dk scales the dot products to prevent large values that could result in small gradients during training.
The attention weights are obtained by applying the softmax function (Figure 1) to the attention scores:
The output for each token i is a weighted sum of the value vectors vv of all tokens:
4. Context vector computation:
Collectively, in matrix form:
○ Q, K, and V are matrices of queries, keys, and values for all tokens.
○ Kt is the transpose of K.
○ The multiplication Q × Kt computes the attention scores for all pairs of tokens simultaneously.
This mechanism allows the model to focus on relevant parts of the input while generating output. For example, in the sentence “The cat sat on the mat because it was soft,” the model can accurately capture that “it” refers to “the mat” by assigning higher attention weights between these tokens.
4.1. Positional encoding
One challenge in processing sequences simultaneously is maintaining the sense of order in the data, as transformers do not process inputs sequentially like RNNs. To address this, positional encoding (Chen et al., Reference Chen, Tsai, Bhojanapalli, Chung, Chang and Ferng2021) is introduced. As shown in Figure 2, positional encoding adds information about the position of each token in the sequence,Footnote 5 ensuring the model can differentiate between words appearing at different positions and preserve the natural order of language (Kazemnejad, Padhi, Ramamurthy, Das & Reddy, Reference Kazemnejad, Padhi, Ramamurthy, Das and Reddy2023). These encodings are incorporated into the model’s input embeddings,Footnote 6 allowing transformers to maintain both position and context without the need for recurrence.
The positional encoding PE is added to the input embeddings to inject positional information:
The positional encoding is defined using sine and cosine functions of varying frequencies:
For position pos and dimension i:
● For even dimensions (2i):
● For odd dimensions (2i + 1):
Where:
● pos is the position index of the token in the sequence.
● i is the dimension index.
● d_model is the dimensionality of the embeddings.
This formulation allows the model to learn positional relationships because the positional encodings provide unique vectors for each position, and the sinusoidal functions enable the model to generalize to sequences longer than those seen during training.
4.2. Multi-head attention
While self-attention allows the model to consider relationships between tokens, multi-head attention (represented in Figure 3) extends this capability by enabling the model to focus on different positions and representation subspaces (Cordonnier, Loukas & Jaggi, Reference Cordonnier, Loukas and Jaggi2021).
1. Multiple attention heads:
Instead of computing attention once, the model uses h different attention heads, each with its own set of learned projections:
For head i:
2. Concatenation and output projection:
The outputs from all heads are concatenated and projected to form the final output:
where WO is the output projection matrix.
By having multiple heads, the model can capture diverse aspects of the input, such as syntax and semantics, and learn different types of relationships.
4.3. Feed-forward networks and residual connections
After the multi-head attention layer, each position undergoes a fully connected position-wise feed-forward network (FFN):
Where:
● W₁ and W₂ are weight matrices.
● b₁ and b₂ are bias vectors.
● max(0, x) denotes the rectified linear unit activation function.
The FFN is applied independently to each position, allowing the model to transform the attended representations into a higher-level abstraction.
To facilitate training and improve gradient flow, the transformer architecture employs residual connections and layer normalization:
1. Residual connections:
The input to each sublayer is added to its output:
2. Layer normalization:
The residual output is normalized to stabilize the training:
These techniques help prevent vanishing or exploding gradients and allow for deeper networks by ensuring that the signal remains strong as it moves through the layers.
5. Overall transformer architecture
The transformer architecture consists of two main components: the encoder and the decoder. This design is particularly effective for sequence-to-sequence tasks like machine translation, where an input sequence in one language is transformed into an output sequence in another.
5.1. Encoder–decoder structure
The encoder-decoder structure, illustrated in Figure 4, is a fundamental mechanism in sequence-to-sequence models designed for tasks such as translation, summarization, and text generation.
The encoder processes the input sequence to generate a contextual representation. It begins by converting input tokens into continuous vectors using an embedding layer and adds positional encodings to retain the order of tokens. The encoder is composed of multiple identical layers, each containing a multi-head self-attention mechanism and a position-wise FFN, both followed by residual connections and layer normalization to enhance training stability and gradient flow.
The self-attention mechanism allows each token to attend to all other tokens in the sequence, capturing dependencies regardless of distance. The FFN further transforms these representations, introducing nonlinearity and enabling the model to learn complex patterns.
The decoder generates the output sequence by predicting one token at a time, using both the encoder’s output and its own previously generated tokens. Like the encoder, it starts with an embedding layer and positional encodings. Each decoder layer includes a masked multi-head self-attention mechanism (to prevent access to future tokens), a multi-head attention mechanism over the encoder’s output (allowing focus on relevant parts of the input) and a position-wise FFN, each followed by residual connections and layer normalization.
During the encoding phase, the encoder processes the entire input sequence simultaneously, producing encoded representations for each position. In the decoding phase, the decoder generates the output sequence step by step. At each step, it considers its own past outputs through masked self-attention and attends to the encoder’s output via encoder–decoder attention, enabling it to incorporate information from the input sequence relevant to generating the next token.
5.1.1. Decoder-only transformers
In some applications, only the decoder part of the transformer is used. Decoder-only transformers, such as the GPT series, are specialized for tasks involving sequence generation based on prior context, like language modeling and text generation. These models consist solely of decoder layers with masked multi-head self-attention to ensure that predictions depend only on preceding tokens. They are trained to predict the next token in a sequence, making them ideal for tasks like autocomplete and text continuation. A visual representation of this structure, showing an attention word heatmap of a decoder-only architecture, is illustrated in Figure 5.
5.1.2. Encoder-only transformers
Conversely, encoder-only transformers consist solely of the encoder stack and are designed for language understanding tasks. Models like BERT utilize this architecture. They employ bidirectional self-attention mechanisms, allowing tokens to attend to both past and future positions, thereby capturing context from the entire sequence. These models are trained using MLM, where some input tokens are masked, and the model learns to predict them based on surrounding context. This approach is effective for tasks such as sentiment analysis, named entity recognition, and question answering.
5.2. Open perspective
Despite their strengths, transformers are not without challenges. The self-attention mechanism, while powerful, requires significant computational resources, particularly in terms of memory. This is because self-attention involves comparing every element in the input sequence with every other element, which scales quadratically with the input length. For very large datasets or long input sequences, this can become prohibitively expensive. However, recent research has been focused on addressing these limitations by developing more efficient variants of transformers, such as sparse transformers and reformers, which aim to reduce the computational load without sacrificing performance. Additionally, quantized modelsFootnote 7 (Egashira, Vero, Staab, He & Vechev, Reference Egashira, Vero, Staab, He and Vechev2024) further enhance efficiency by reducing the precision of model weights (e.g., from 32-bit to 4-bit), allowing significant reductions in memory usage and enabling models to run on smaller hardware without significant performance.
6. LLMs: Scaling transformers to new heights
Building upon the transformative capabilities of the transformer architecture, LLMs represent a significant advancement in AI by scaling the core innovations of transformers to unprecedented levels (Naveed et al., Reference Naveed, Khan, Qiu, Saqib, Anwar, Usman and Mian2024). LLMs leverage self-attention mechanisms and extensive training to capture intricate patterns in text, enabling them to perform a wide array of language tasks with remarkable proficiency.
6.1. Scaling laws and model sizes
A critical aspect of LLMs is their scale – in terms of model size, training data quantity and computational resources – which significantly impacts their performance. Research by AI labs and research centers has established scaling laws that describe how increasing these factors lead to predictable improvements in model capabilities:
● Model size (parameters): LLMs like GPT-3 and GPT-4 contain hundreds of billions of parameters. Increasing the number of parameters allows the model to capture more complex patterns and nuances in language.
● Data quantity: Training on larger datasets exposes the model to a broader range of language uses, contexts and knowledge. This diversity enhances the model’s ability to generalize across different tasks.
● Compute resources: Training large models on vast datasets requires substantial computational power. Advances in hardware (such as GPUs and TPUs) and distributed training techniques have enabled the training of LLMs at this scale.
Scaling laws suggest that as we proportionally scale up model size, data and compute resources, the model’s performance continues to improve, often following a power–law relationship. This has motivated the development of ever-larger models to push the boundaries of language understanding and generation. However, it must be noted that as of today, this approach brings with it several challenges and considerations – both economical and ethical – such as the increasing need for expensive computational resources, environmental impact due to the large-scale consumption of electricity. This has led researchers to look into different directions, such as using smaller models (Lu Z. et al, Reference Lu, Li, Cai, Yi, Liu, Zhang, Lane and Xu2024) in combination with use of highly curated and specialized training sets as an alternative to ever growing models (Liu et al., Reference Liu, Cao, Liu, Ding and Jin2024).
6.1.1. Domain-specific small language models
While scaling has driven remarkable achievements in general-purpose language models, recent research has demonstrated the promise of smaller, specialized models trained on domain-specific data. These models, typically ranging from hundreds of millions to a few billion parameters, leverage targeted training data to achieve performance comparable to larger models within their specialized domains (Hsieh et al., Reference Hsieh, Li, Yeh, Nakhost, Fujii, Ratner and Pfister2023; Javaheripi et al., Reference Javaheripi, Bubeck, Abdin, Aneja, Bubeck, Mendes and Gopi2023). The efficiency of these models stems from their concentrated focus on domain-specific patterns, terminology and task requirements, effectively reducing the computational overhead associated with maintaining broad language understanding, thus having a significantly reduced environmental impact with parameter counts several orders of magnitude smaller (Schick & Schütze, Reference Schick and Schütze2020). This approach has proven particularly valuable in fields such as biomedicine (Gu et al., Reference Gu, Tinn, Cheng, Lucas, Usuyama, Liu and Poon2021) and legal document analysisFootnote 8 where domain expertise and precision are necessary. The success of these specialized models suggests that strategic data curation and domain-focused architecture optimization may offer a complementary path to the scaling paradigm (Zhang, Zeng, Wang & Lu, Reference Zhang, Zeng, Wang and Lu2024).
6.2. Training techniques
LLMs are typically trained in two stages: pretraining (Schneider, Meske & Kuss, Reference Schneider, Meske and Kuss2024; Wang, Li, Wu, Hovy & Sun, Reference Wang, Li, Wu, Hovy and Sun2023; Zhou et al., Reference Zhou, Li, Li, Yu, Liu, Wang and Sun2023) and fine-tuning (Parthasarathy, Zafar, Khan & Shahid, Reference Parthasarathy, Zafar, Khan and Shahid2024). During pretraining, the model learns general language representations from vast amounts of text data without explicit supervision (Ding, Qin & Yang et al., Reference Ding, Qin, Yang, Wei, Yang, Su, Hu, Chen, Chan, Chen, Yi, Zhao, Wang, Liu, Zheng, Chen, Liu, Tang, Li and Sun2023). Two primary objectives guide this phase:
1. Causal language modeling (CLM): Utilized by models like the GPT series, CLM trains the model to predict the next word in a sequence given all previous words. This unidirectional approach is suitable for generation tasks, where the model maximizes the likelihood of the next word based on the preceding context.
2. Masked language modeling (MLM): Employed by models such as BERT, MLM involves predicting missing words in a sequence where some tokens are randomly masked. This bidirectional approach allows the model to learn from both left and right contexts, minimizing the prediction error for the masked tokens (Merchant, Rahimtoroghi, Pavlick & Tenney, Reference Merchant, Rahimtoroghi, Pavlick and Tenney2020).
After pre-training, LLMs undergo fine-tuning on task-specific datasets to adapt them to particular applications. Fine-tuning can be supervised, using labeled data for tasks like question answering, sentiment analysis, or named entity recognition. In cases where labeled data is scarce, unsupervised fine-tuning leverages unsupervised objectives to adapt the model to new domains.
Unsupervised learning plays a crucial role in the initial training phase, enabling the model to learn general language patterns from unlabeled data. Supervised learning becomes important during fine-tuning, where the model is taught to perform specific tasks based on labeled datasets.
6.2.1. Parameter-efficient fine-tuning (PEFT)
PEFT (Han, Gao, Liu, Zhang & Zhang, Reference Han, Gao, Liu, Zhang and Zhang2024; Xu, Xie, Qin, Tao & Wang, Reference Xu, Xie, Qin, Tao and Wang2023) methods have emerged to address the computational challenges associated with fine-tuning massive models with billions of weights (Fu et al., Reference Fu, Yang, So, Lam, Bing and Collier2023). Instead of updating all weights, techniques like Low-Rank Adaptation (LoRA) allow only a small subset of weights to be fine-tuned. LoRA (Hu et al., Reference Hu, Shen, Wallis, Allen-Zhu, Li, Wang and Chen2021) introduces low-rank matrices to specific layers, adapting the model without altering its full architecture, which significantly reduces memory and computational demands.
Quantized LoRA (QLoRA) combines this approach with quantization, storing model weights in lower-precision formats like 4-bit, further reducing resource needs while maintaining accuracy. Quantization-aware LoRA (QA-LoRA) (Xu et al., Reference Xu, Xie, Qin, Tao and Wang2023) goes a step further by applying quantization selectively to critical weight matrices, balancing efficiency and performance in constrained environments. These techniques enable the fine-tuning of large models on smaller hardware, reducing computational overhead without sacrificing precision.
6.3. Parameter tuning
While fine-tuning optimizes model architecture for specific applications, parameter tuning offers flexible adjustments to model outputs based on input prompts. This helps tailor responses for characteristics like creativity, precision or length, enhancing task-specific performance without altering the model’s structure (Liao, Li, Shang & Ma, Reference Liao, Li, Shang and Ma2022).
6.3.1. Key parameters in LLM tuning
● Temperature: This controls the randomness of the model’s output by adjusting the probability distribution of predicted words. Lower temperatures make responses more deterministic, while higher temperatures increase variability, fostering creativity in responses like poetry.
● Seed: The seed ensures reproducibility by fixing the random number generator’s starting point, making it possible to produce the same outputs for identical inputs across multiple trials – crucial for testing and debugging.
● Top-k sampling: This technique restricts the next word prediction to the k most probable words, reducing the risk of the model choosing unlikely or incoherent words. Smaller values of k make the output more focused, enhancing accuracy.
● Top-p (nucleus sampling): A more dynamic approach than Top-k, nucleus sampling selects words whose combined probability exceeds a certain threshold (p), ensuring a balance between diversity and coherence.
● Max Tokens: This parameter limits the number of tokens the model generates in response, useful for managing the length of outputs, such as in summarization tasks where brevity is needed.
● Frequency and Presence Penalty: Both parameters manage word repetition. Frequency penalties reduce redundancy by discouraging the model from repeating words, while presence penalties further limit the reuse of words that have already appeared in the text.
● Stop sequences: These are specific tokens that signal the model to halt text generation, particularly useful for structured tasks or dialogues that need concise responses.
● Logit bias: Logit bias allows for direct control over the probability distribution, steering the model toward or away from certain words – vital for ensuring the use of domain-specific terminology or avoiding irrelevant language.
By adjusting these parameters, users can ensure LLMs meet specific needs, whether optimizing for creativity, precision or domain-specific vocabulary. This layer of control complements fine-tuning and provides a powerful toolset for task adaptation, enabling more effective utilization of LLMs across diverse applications.
6.4. Prompt engineering
Prompt engineering is a technique used to optimize the inputs provided to LLMs, ensuring they generate more accurate, relevant and useful outputs (Chen, Zhang, Langrené & Zhu, Reference Chen, Zhang, Langrené and Zhu2023). A “prompt” in this context refers to the text or instructions given to the model, guiding its response. Unlike methods that alter the model’s architecture or underlying weights, prompt engineering focuses solely on refining the input to influence the output without changing the model itself.
6.4.1. Key concepts in prompt engineering
● Clarity and specificity: Well-crafted prompts are clear and specific, reducing ambiguity and leading to more accurate responses.
● Contextual information: Providing the right amount of background or context can significantly improve the relevance and coherence of the model’s outputs.
● Task demonstration (Few-shot learning): By including examples of the desired task (few-shot learning), the model can generalize better and provide higher-quality responses.
6.4.2. Techniques in prompt engineering
1. Zero-shot prompting: The model is expected to generate a response without any examples, relying purely on pre-trained knowledge.
2. One-shot prompting: A single example of the task is provided in the prompt, helping the model better understand the format and expected output while still minimizing the number of examples.
3. Few-shot prompting: A few examples of the task are included in the prompt, which helps the model understand the format and expected output.
4. Chain-of-thought (CoT) prompting: CoT guides the model through a step-by-step reasoning process, which is particularly effective for tasks that require logical progression or complex reasoning. CoT methodologies often have subcategories, such as Tabular CoT, which is tailored for handling tasks that involve structured data or tables by applying step-by-step reasoning within tabular formats.
5. Instruction tuning: Clear and direct instructions help the model perform specific tasks, such as summarizing or generating lists.
6. Self-consistency: This technique involves generating multiple responses for the same prompt and selecting the most consistent one, improving reliability, especially in reasoning tasks.
6.5. Evaluating LLMs performance
Evaluating LLMs is essential to ensuring their accuracy, reliability and fairness across various applications, from healthcare to law. Performance evaluation can be divided into two main categories: human assessments and automated methods.
Human evaluation involves domain experts or users reviewing model outputs for factors like fluency, coherence, relevance and factual accuracy (Feng et al., Reference Feng, Ding, Ma, Wang, Zhang and Chen2024). This approach is especially critical in fields such as legal (Chalkidis, Fergadiotis, Malakasiotis, Aletras & Androutsopoulos, Reference Chalkidis, Fergadiotis, Malakasiotis, Aletras and Androutsopoulos2020), financial (Wu et al., Reference Wu, Irsoy, Lu, Dabravolski, Dredze, Gehrmann and Mann2023) and medical applications (Wang & Zhang, Reference Wang and Zhang2024), where nuanced and context-specific knowledge is required. However, human evaluation is labor-intensive and difficult to scale for large volumes of model iterations or outputs.
Automated evaluation methods provide scalable and objective metrics. They measure aspects such as fluency, accuracy and relevance of the text output. Common methods include the following:
● Bilingual Evaluation Understudy (BLEU) (Papineni, Roukos, Ward & Zhu, Reference Papineni, Roukos, Ward and Zhu2002), Recall-Oriented Understudy for Gisting Evaluation (ROUGE) (Lin, Reference Lin2004) and Metric for Evaluation of Translation with Explicit ORdering (METEOR) (Lavie & Denkowski, Reference Lavie and Denkowski2009) scores: These assess the quality of text generation (e.g., translation, summarization) by comparing model outputs to reference texts based on content overlap and lexical similarity.
● Perplexity (Colla, Delsanto, Agosto, Vitiello & Radicioni, Reference Colla, Delsanto, Agosto, Vitiello and Radicioni2022) and F1 scores (Zhang, Wang & Zhao, Reference Zhang, Wang and Zhao2015): Perplexity measures how well a language model predicts sequences of text, focusing on fluency. F1 scores, combining precision and recall, are used for classification tasks to evaluate how well the model categorizes or identifies information.
● Adversarial robustness testing (Zimmermann, Brendel, Tramer & Carlini, Reference Zimmermann, Brendel, Tramer and Carlini2022): This method tests how LLMs perform under challenging or adversarial inputs, ensuring that models can handle unexpected or tricky queries without producing incorrect or biased responses.
● Fairness and bias testing (Rodolfa, Saleiro & Ghani, Reference Rodolfa, Saleiro and Ghani2020): These frameworks measure the ethical performance of models by identifying and mitigating any gender, racial or cultural biases in generated content, ensuring the model outputs are fair and nondiscriminatory.
These evaluation techniques help optimize LLMs for performance while ensuring they meet ethical and reliability standards across various applications.
7. Context windows, hallucinations and other challenges in LLMs
LLMs excel in tasks involving language comprehension and generation, but they are not without limitations. Two of the most prominent challenges are the management of context windows and the issue of hallucinations, among other inherent difficulties in LLMs.
LLMs operate within fixed context windows (Dsouza, Glaze, Shin & Sala, Reference Dsouza, Glaze, Shin and Sala2024), typically from a few thousands to a few hundred thousand tokens. This limitation constrains the amount of text the model can consider at once. In scenarios requiring long-form analysis, like legal reviews or complex conversations, earlier parts of the input might be discarded, leading to a loss of continuity and potentially impacting the quality of the response. While newer models such as GPT-4 have extended context windows, the inherent limitation remains, posing challenges for tasks that demand deep contextual understanding.
Hallucinations (Azamfirei, Kudchadkar & Fackler, Reference Azamfirei, Kudchadkar and Fackler2023) occur when LLMs generate text that is plausible but incorrect or entirely fabricated (Ye, Liu, Zhang, Hua & Jia, Reference Ye, Liu, Zhang, Hua and Jia2023). Because LLMs generate predictions based on statistical patterns learned from training data, they might confidently present false information. This is particularly dangerous in critical fields such as healthcare, finance and law, where factual accuracy is essential. Models can invent statistics, references or claims, complicating the task of trustworthiness. Mitigation strategies include refining training datasets, incorporating real-time knowledge bases and enhancing human oversight during model fine-tuning.
LLMs also struggle with bias amplification, as they reflect the biases present in their training data, which can perpetuate harmful stereotypes (illustrated in Figure 6, where an example of gender bias learned by the model is depicted). Additionally, LLMs remain opaque in their decision-making processes, making it difficult to interpret how outputs are generated. Finally, energy consumption is a growing concern (Samsi et al., Reference Samsi, Zhao, McDonald, Li, Michaleas, Jones, Bergeron, Kepner, Tiwari and Gadepally2023), as training large models demands substantial computational resources, raising ethical and environmental considerations.
8. Improving LLM accuracy: Retrieval-augmented generation (RAG)
LLMs have shown remarkable capabilities in natural language understanding and generation. However, as previously presented, they still face limitations, particularly around the accuracy and timeliness of the information they generate. These models are trained on vast datasets but may lack up-to-date or domain-specific knowledge, leading to hallucinations, outdated responses, or factually incorrect outputs. This is where RAG steps in to enhance the accuracy and factual reliability of LLMs.
8.1. Retrieval-augmented generation (RAG)
RAG is an advanced approach that integrates information retrieval systems with LLMs to enhance their accuracy and relevance (Lewis et al., Reference Lewis, Perez, Piktus, Petroni, Karpukhin, Goyal and Kiela2020; Li, Su, Cai, Wang & Liu, Reference Li, Su, Cai, Wang and Liu2022). RAG models combine the generative capabilities of LLMs with external knowledge sources, such as databases or document collections, enabling the model to pull real-time information rather than relying solely on pretrained data. This mechanism addresses the limitations of LLMs, such as hallucinations and outdated knowledge, by grounding generated responses in retrieved, factual data. A flowchart illustrating the architecture of a RAG model, detailing the interaction between the retrieval and generation components, is shown in Figure 7.
RAG operates in two phases: retrieval and generation. In the retrieval phase, the model searches a vast external knowledge base to gather relevant information based on the input query. In the generation phase, the retrieved data is then used to condition the response, enabling the LLM to provide more accurate and contextually grounded outputs. By integrating retrieval with generation, RAG mitigates the issue of hallucinations, significantly reducing instances of fabricated or inaccurate content.
The ability to retrieve up-to-date information makes RAG particularly effective for dynamic fields such as news reporting, medical diagnostics and legal document analysis, where real-time accuracy is paramount.
8.2. Evaluation of RAG efficacy and metrics
Evaluating the efficacy of RAG models requires both traditional and novel metrics tailored to the retrieval-enhanced framework. Key metrics include the following:
● Retrieval accuracy: Ensures that the external knowledge source effectively supplements the LLM, reducing hallucinations and improving factuality. Tools such as Retrieval Augmented Generation Assessment (RAGAs) (Es, James, Espinosa-Anke & Schockaert, Reference Es, James, Espinosa-Anke and Schockaert2023) (RAG Automatic Scoring) are emerging to automate this process by evaluating both the retrieval quality and the final generated output.
● Factuality and groundedness: A critical metric for RAG models is ensuring that generated responses are factually grounded in the retrieved documents. Evaluation frameworks like Luna (Saidov, Bakalova, Taktasheva, Mikhailov & Artemova, Reference Saidov, Bakalova, Taktasheva, Mikhailov and Artemova2024) assess how well the generated text aligns with retrieved facts, helping to reduce inaccuracies and inconsistencies.
9. Conclusions
LLMs have revolutionized NLP by harnessing transformer architectures to achieve unprecedented proficiency in language understanding and generation. They have transformed industries such as healthcare, law and customer service by enabling applications that require high fluency and precision. Despite these advancements, LLMs face ongoing challenges, including computational resource demands, context window limitations and issues related to bias and factual accuracy.
Innovations like PEFT, quantization techniques and RAG are actively addressing these challenges, enhancing the efficiency, scalability and reliability of LLMs. As these models continue to grow in scale and capability, they hold the promise of extending beyond language tasks to impact fields like computer vision and enable multimodal AI applications.
With a continued focus on improving efficiency and addressing ethical considerations, LLMs are poised to play a pivotal role in shaping the future of technology and AI, driving forward the capabilities of AI systems across a wide array of domains.
Funding statement
This research was supported by the Fondazione CRT, under the 2023 Call, aimed at advancing interdisciplinary studies in the intersection of law, LLMs and policy. The funding body played no role in the design, execution or publication of this work.
Competing interests
The authors declare that they have no competing interests, financial or otherwise, that could have influenced the content or conclusions of this research.
Andrea Filippo Ferraris, PhD Fellow LAST-JD, ALMA AI, University of Bologna and PhD Fellow in Law at DIKE and Law faculty, Vrije Universiteit Brussel, Brussels. Email: [email protected] and [email protected]
Davide Audrito, PhD Fellow at LAST-JD, Computer Science Department, University of Torino, and Legal Studies Department, University of Bologna. Email: [email protected] and [email protected]
Luigi Di Caro, Associate Professor of Computer Science, Computer Science Department, University of Torino. Email: [email protected]
Cristina Poncibò, Full Professor of Comparative Law, Department of Law, University of Turin. Email: [email protected]