A large language model can be defined as a computational model, typically based on deep neural networks, trained on vast datasets of text to perform a wide range of language-related tasks. According to Vaswani et al. (2017), the advent of transformer architectures marked a pivotal shift in NLP, enabling models to process and generate text with unprecedented accuracy. LLMs, such as OpenAI’s GPT series or Google’s BERT, leverage these architectures to model the probabilistic relationships between words, phrases, and sentences in a given context (Devlin et al. 2019). The "large" aspect of LLMs refers to both their scale—often comprising billions of parameters—and the extensive corpora used for training, which may include books, websites, and other textual sources (Brown et al. 2020). These models are pre-trained in an unsupervised or self-supervised manner, learning general language patterns before being fine-tuned for specific tasks, such as sentiment analysis or question answering (Radford et al. 2018). This pre-training and fine-tuning paradigm distinguishes LLMs from earlier, task-specific NLP models, offering greater flexibility and generalisation.
The defining hallmark of LLMs is their unprecedented scale, both in terms of parameters and training data. Models like GPT-3, with its 175 billion parameters, exemplify this trend, enabling the capture of intricate linguistic patterns that facilitate nuanced text generation and comprehension (Brown et al. 2020). The complexity of these models arises from their deep neural architectures, which comprise multiple layers of interconnected nodes, each contributing to the model’s ability to encode semantic and syntactic relationships. This scale allows LLMs to model high-dimensional probability distributions over sequences of words, resulting in outputs that closely mimic human language. However, this computational grandeur incurs significant costs. Training such models requires vast computational resources, often involving thousands of GPU hours, which translates to substantial energy consumption. Strubell et al. (2019) quantify this impact, estimating that training a single large-scale NLP model can produce carbon emissions equivalent to multiple transatlantic flights. Moreover, the resource intensity of LLMs raises accessibility concerns, as only well-funded organisations can afford the infrastructure necessary for their development and deployment. This concentration of capability underscores a critical tension between technological advancement and equitable access, necessitating research into more efficient training paradigms, such as model pruning or quantisation (Sanh et al. 2019).
The architectural foundation of LLMs lies in the transformer model, introduced by Vaswani et al. (2017), which has supplanted earlier recurrent neural network (RNN) approaches due to its efficiency and performance. Transformers leverage self-attention mechanisms, which dynamically assign weights to words in a sequence based on their contextual relevance, irrespective of their positional distance. This capability enables LLMs to capture long-range dependencies in text, a significant advancement over RNNs, which struggled with vanishing gradients and sequential processing limitations (Hochreiter and Schmidhuber 1997). The self-attention mechanism operates by computing attention scores across all tokens in an input sequence, allowing the model to prioritise relevant linguistic elements. For instance, in the sentence “The cat, which was hiding under the table, jumped,” the transformer can effectively link “cat” and “jumped” despite the intervening clause. Additionally, transformers facilitate parallelisation, enabling faster training on large datasets compared to the sequential nature of RNNs. This architectural efficiency has been pivotal in scaling LLMs, as evidenced by models like BERT and GPT, which rely on variants of the transformer to achieve state-of-the-art performance across diverse NLP benchmarks (Devlin et al. 2019; Brown et al. 2020). Ongoing research aims to refine transformer architectures, exploring sparse attention mechanisms to further enhance computational efficiency (Zaheer et al. 2020).
LLMs demonstrate remarkable generalisation, enabling them to perform effectively across a wide array of tasks with minimal task-specific fine-tuning. This is achieved through transfer learning, wherein models are pre-trained on vast, diverse text corpora to learn general linguistic representations, which are then adapted to specific tasks (Radford et al. 2018). For example, BERT’s bidirectional pre-training, which considers both preceding and following context, equips it to excel in tasks such as text classification, question answering, and named entity recognition (Devlin et al. 2019). The efficacy of transfer learning lies in the model’s ability to encode universal language patterns during pre-training, which can be fine-tuned with smaller, task-specific datasets to achieve high performance. This paradigm shift from task-specific models to general-purpose language models has democratised NLP, enabling developers to leverage pre-trained LLMs for bespoke applications without extensive retraining. However, generalisation is not without limitations; LLMs may struggle with tasks requiring domain-specific knowledge absent from their training data, highlighting the need for continual learning strategies to adapt models to new domains (Bender et al. 2021).
A core strength of LLMs is their ability to model contextual relationships within text, a departure from traditional rule-based or statistical NLP approaches. By estimating the conditional probability of words given their context, LLMs generate coherent and contextually appropriate responses, as seen in applications like conversational agents and automated writing tools (Brown et al. 2020). This contextual prowess stems from the transformer’s ability to attend to relevant tokens across an input sequence, enabling the model to disambiguate meanings based on surrounding text. For instance, in the sentence “The bank was flooded”, an LLM can infer whether “bank” refers to a financial institution or a riverbank based on contextual cues. However, this reliance on context can lead to errors when inputs are ambiguous or fall outside the model’s training distribution. Bender et al. (2021) note that LLMs may produce plausible but incorrect outputs in such cases, a phenomenon known as “hallucination”. Addressing these limitations requires advances in robust contextual modelling, such as incorporating external knowledge bases or improving out-of-domain generalisation (Weidinger et al. 2021).
References:
1. Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. ‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?’. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. ^ Back
2. Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. ‘Language Models Are Few-Shot Learners’. Advances in Neural Information Processing Systems 33: 1877–1901. ^ Back
3. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. ‘BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding’. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–86. ^ Back
4. Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. ‘Long Short-Term Memory’. Neural Computation 9 (8): 1735–1780. ^ Back
5. Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. ‘Improving Language Understanding by Generative Pre-Training’. Download PDF – ^ Back
6. Strubell, Emma, Ananya Ganesh, and Andrew McCallum. 2019. ‘Energy and Policy Considerations for Deep Learning in NLP’. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3645–3650. Florence, Italy: Association for Computational Linguistics. https://aclanthology.org/P19-1355/ ^ Back
7. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. ‘Attention Is All You Need’. arXiv. doi:10.48550/ARXIV.1706.03762 – ^ Back
8. Weidinger, Laura, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. 2021. ‘Ethical and Social Risks of Harm from Language Models’. arXiv preprint arXiv:2112.04359. https://arxiv.org/abs/2112.04359 ^ Back
9. Zaheer, Manzil, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. ‘Big Bird: Transformers for Longer Sequences’. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020). https://proceedings.neurips.cc/paper/2020/hash/c8512d142a2d849725f31a9a7a361ab9-Abstract.html ^ Back