A large language model can be defined as a computational model, typically based on deep neural networks, trained on vast datasets of text to perform a wide range of language-related tasks. According to Vaswani et al. (2017), the advent of transformer architectures marked a pivotal shift in NLP, enabling models to process and generate text with unprecedented accuracy. LLMs, such as OpenAI’s GPT series or Google’s BERT, leverage these architectures to model the probabilistic relationships between words, phrases, and sentences in a given context (Devlin et al. 2019). The "large" aspect of LLMs refers to both their scale—often comprising billions of parameters—and the extensive corpora used for training, which may include books, websites, and other textual sources (Brown et al. 2020). These models are pre-trained in an unsupervised or self-supervised manner, learning general language patterns before being fine-tuned for specific tasks, such as sentiment analysis or question answering (Radford et al. 2018). This pre-training and fine-tuning paradigm distinguishes LLMs from earlier, task-specific NLP models, offering greater flexibility and generalisation.
The defining hallmark of LLMs is their unprecedented scale, both in terms of parameters and training data. While foundational models like GPT-3, with its 175 billion parameters, demonstrated the power of scale (Brown et al. 2020), the frontier has shifted towards even larger models often employing Mixture-of-Experts (MoE) architectures. Models like Mixtral 8x7B use a sparse combination of expert sub-networks, achieving the performance of much larger dense models while managing computational costs during inference (Jiang et al. 2024). The complexity of these models arises from their deep neural architectures, which comprise multiple layers of interconnected nodes, each contributing to the model’s ability to encode semantic and syntactic relationships. This scale allows LLMs to model high-dimensional probability distributions over sequences of words, resulting in outputs that closely mimic human language. However, this computational grandeur incurs significant costs. Training such models requires vast computational resources, often involving thousands of GPU hours, which translates to substantial energy consumption. Strubell et al. (2019) quantify this impact, estimating that training a single large-scale NLP model can produce carbon emissions equivalent to multiple transatlantic flights. Moreover, the resource intensity of LLMs raises accessibility concerns, as only well-funded organisations can afford the infrastructure necessary for their development and deployment. This concentration of capability underscores a critical tension between technological advancement and equitable access, necessitating research into more efficient training paradigms, such as the sparse activation of MoE models, advanced quantization techniques, and speculative decoding (Zhao et al. 2023).
The architectural foundation of LLMs lies predominantly in the transformer model (2017). Its self-attention mechanism—as detailed in Chapter 2.4—is pivotal for capturing long-range linguistic dependencies more effectively than earlier RNNs, and its parallelizable nature has been key to scaling models like BERT and GPT (Devlin et al. 2019; Brown et al. 2020). While research continues to refine Transformer-based approaches (e.g., sparse attention), compelling alternative architectures like State Space Models (SSMs) have also emerged, offering competitive performance with different computational trade-offs (Gu & Dao 2023). A defining characteristic of LLMs is their remarkable ability to generalise, allowing them to perform effectively across a wide array of tasks with minimal task-specific adjustment. This capability is achieved through the transfer learning paradigm—pre-training on vast corpora followed by task-specific fine-tuning. However, this generalisation has its limits, as models may struggle with tasks requiring domain-specific knowledge not present in their training data, a challenge that necessitates ongoing research into continual learning (Bender et al. 2021).
A core strength of LLMs is their ability to model contextual relationships within text, a departure from traditional rule-based or statistical NLP approaches. By estimating the conditional probability of words given their context, LLMs generate coherent and contextually appropriate responses, as seen in applications like conversational agents and automated writing tools (Brown et al. 2020). This contextual prowess stems from the transformer’s ability to attend to relevant tokens across an input sequence, enabling the model to disambiguate meanings based on surrounding text. For instance, in the sentence “The bank was flooded”, an LLM can infer whether “bank” refers to a financial institution or a riverbank based on contextual cues. However, this reliance on context can lead to errors when inputs are ambiguous or fall outside the model’s training distribution. Bender et al. (2021) note that LLMs may produce plausible but incorrect outputs in such cases, a phenomenon known as “hallucination”. Addressing these limitations requires advances in robust contextual modelling, such as incorporating external knowledge bases or improving out-of-domain generalisation (Weidinger et al. 2021).
References:
1. Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. ‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?’. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. ^ Back
2. Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. ‘Language Models Are Few-Shot Learners’. Advances in Neural Information Processing Systems 33: 1877–1901. ^ Back
3. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. ‘BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding’. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–86. ^ Back
4. Gu, Albert & Dao, Tri. 2023. Mamba: Linear-time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752. https://arxiv.org/abs/2312.00752 ^ Back
5. Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. ‘Long Short-Term Memory’. Neural Computation 9 (8): 1735–1780. ^ Back
6. Jiang, Albert Q. et al. 2024. Mixtral of Experts. arXiv preprint arXiv:2401.04088. https://arxiv.org/abs/2401.04088 ^ Back
7. Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. ‘Improving Language Understanding by Generative Pre-Training’. Download PDF – ^ Back
8. Strubell, Emma, Ananya Ganesh, and Andrew McCallum. 2019. ‘Energy and Policy Considerations for Deep Learning in NLP’. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3645–3650. Florence, Italy: Association for Computational Linguistics. https://aclanthology.org/P19-1355/ ^ Back
9. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. ‘Attention Is All You Need’. arXiv. doi:10.48550/ARXIV.1706.03762 – ^ Back
10. Weidinger, Laura et al. 2021. ‘Ethical and Social Risks of Harm from Language Models’. arXiv preprint arXiv:2112.04359. https://arxiv.org/abs/2112.04359 ^ Back
11. Zaheer, Manzil et al. 2020. ‘Big Bird: Transformers for Longer Sequences’. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020). https://proceedings.neurips.cc/paper/2020/hash/c8512d142a2d849725f31a9a7a361ab9-Abstract.html ^ Back
12. Zhao, Wayne Xin et al. 2023. A Survey of Large Language Models. arXiv preprint arXiv:2303.18223 1(2). https://arxiv.org/abs/2303.18223 ^ Back