A defining characteristic of LLMs is their scale, measured by the number of parameters, which has grown exponentially in recent years. Models such as GPT-3, with 175 billion parameters, and its successors have demonstrated remarkable capabilities, raising questions about the relationship between model size and performance (Brown et al. 2020). This essay explores why size matters in LLMs, examining how scaling impacts performance, capabilities, and limitations, while drawing on foundational and contemporary scholarly sources.
The scaling hypothesis posits that increasing the size of a neural network, alongside sufficient data and computational resources, leads to improved performance across a range of tasks. Kaplan et al. (2020) formalised this idea, demonstrating through empirical studies that larger models exhibit predictable improvements in perplexity—a measure of how well a model predicts a sequence of words. Their work established scaling laws, showing that performance scales as a power-law function of model size, dataset size, and compute. For instance, doubling the number of parameters in a transformer-based model can yield significant reductions in perplexity, translating to better language understanding and generation. This hypothesis has been validated by models like GPT-3, which outperforms its smaller predecessors, such as GPT-2 (Radford et al. 2019), in tasks ranging from text completion to question answering. The increased parameter count allows larger models to capture more complex patterns in data, enabling them to generalise better across diverse linguistic contexts. However, scaling is not without trade-offs. As models grow, the computational cost of training and inference rises exponentially, raising concerns about energy consumption and accessibility (Strubell et al. 2019).
Beyond raw performance metrics, scaling enhances the emergent capabilities of LLMs—abilities that smaller models struggle to exhibit. Brown et al. (2020) highlight that GPT-3 demonstrates few-shot learning, where the model can perform tasks with minimal examples provided in the prompt. This capability emerges only at larger scales, as smaller models lack the capacity to generalise from sparse data. For example, GPT-3 can translate languages, write code, or solve simple mathematical problems with few or no task-specific training examples, a feat unattainable by earlier, smaller models like BERT (Devlin et al. 2019). Moreover, larger models exhibit improved contextual understanding, allowing them to maintain coherence over longer text sequences. This is particularly evident in tasks like story generation or dialogue, where maintaining narrative consistency is critical. Wei et al. (Wei et al. 2022) argue that scale enables "emergent abilities," such as reasoning and commonsense understanding, which are not explicitly programmed but arise from the model’s ability to encode vast amounts of world knowledge. However, these capabilities are not universal; performance on specialised tasks, such as medical or legal reasoning, may still require fine-tuning or domain-specific data (Bommasani et al. 2022).
While scaling yields impressive gains, it also introduces significant challenges. One major limitation is the diminishing returns of scaling. Kaplan et al. (2020) note that beyond a certain point, increasing model size yields smaller improvements in performance relative to the computational cost. This raises questions about the sustainability of pursuing ever-larger models, particularly as the environmental impact of training LLMs becomes a pressing concern. Strubell et al. (2019) estimate that training a single large model can emit as much carbon as a transatlantic flight, prompting calls for more efficient architectures or training methods. Another challenge is the amplification of biases. Larger models, trained on vast datasets scraped from the internet, often encode societal biases present in their training data. Bender et al. (2021) argue that scaling exacerbates these issues, as larger models are more likely to reproduce harmful stereotypes or generate toxic content. Mitigating these biases requires careful dataset curation and post-training interventions, which are resource-intensive and not always effective. Furthermore, larger models are less accessible to researchers and organisations with limited computational resources. The democratisation of AI research is hindered when only well-funded entities can afford to train or deploy state-of-the-art models (Bommasani et al. 2022). This creates an uneven playing field, limiting innovation and diversity in NLP applications.
The success of scaling can be traced to foundational theories in machine learning. Rumelhart et al. (1986) introduced the concept of distributed representations, where knowledge is encoded across a network’s parameters. Larger models leverage this principle by distributing linguistic and world knowledge across billions of parameters, enabling richer representations. Additionally, the universal approximation theorem (Cybenko, 1989) suggests that sufficiently large neural networks can approximate any function, providing a theoretical basis for why scaling improves model expressiveness. However, these theoretical insights do not fully explain emergent behaviours in LLMs. Wei et al. (2022) suggest that scale induces qualitative shifts in model behaviour, potentially due to phase transitions in learning dynamics. These phenomena are not yet fully understood, highlighting the need for further research into the mechanisms underlying scaling.
Given the challenges of scaling, researchers are exploring alternatives to simply increasing model size. Techniques such as model pruning, quantisation, and knowledge distillation aim to create smaller, more efficient models without sacrificing performance (Bommasani et al. 2022). Additionally, modular architectures, where smaller specialised models collaborate on complex tasks, offer a promising avenue for balancing capability and efficiency. Another direction is improving data quality over quantity. Smaller models trained on carefully curated, high-quality datasets can sometimes outperform larger models trained on noisy data (Bender et al. 2021). This approach aligns with efforts to address biases and reduce the environmental footprint of NLP research.
References:
1. Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021, March. ‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?’. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623. ^ Back
2. Bommasani, Rishi, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein et al. 2022. ‘On the Opportunities and Risks of Foundation Models’. Center for Research on Foundation Models, Stanford HAI. https://doi.org/10.48550/arXiv.2108.07258 ^ Back
3. Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. ‘Language Models Are Few-Shot Learners’. Advances in Neural Information Processing Systems 33: 1877–1901. ^ Back
4. Cybenko, George. 1989. ‘Approximation by Superpositions of a Sigmoidal Function’. Mathematics of Control, Signals, and Systems 2: 303–314. https://doi.org/10.1007/BF02551274 ^ Back
5. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. ‘BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding’. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–86. ^ Back
6. Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. ‘Scaling Laws for Neural Language Models’. arXiv preprint arXiv:2001.08361. https://doi.org/10.48550/arXiv.2001.08361 ^ Back
7. Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. ‘Language Models Are Unsupervised Multitask Learners’. Download PDF – ^ Back
8. Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. ‘Learning Representations by Back-Propagating Errors’. Nature 323(6088): 533–536. ^ Back
9. Strubell, Emma, Ananya Ganesh, and Andrew McCallum. 2019. ‘Energy and Policy Considerations for Deep Learning in NLP’. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3645–3650. https://aclanthology.org/P19-1355 ^ Back
10. Wei, Jason, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama et al. 2022. ‘Emergent Abilities of Large Language Models’. Transactions on Machine Learning Research. https://doi.org/10.48550/arXiv.2206.07682 ^ Back