Small Language Models (SLMs) are compact neural networks designed to perform natural language processing (NLP) tasks with significantly fewer parameters and lower computational requirements than their larger counterparts. SLMs aim to deliver robust performance in resource-constrained environments, such as mobile devices or edge computing systems, where efficiency is paramount. The two primary techniques for creating SLMs are knowledge distillation from larger models and training highly efficient models from scratch on curated, high-quality data.
Knowledge distillation, introduced by Hinton et al. (2015), is a model compression technique where a smaller "student" model is trained to replicate the behaviour of a larger "teacher" model. In NLP, distillation transfers knowledge from LLMs, such as BERT or GPT-3, to SLMs by leveraging the teacher’s output distributions or internal representations. The student model is trained on a combination of the teacher’s soft labels (probability distributions over outputs) and the original dataset’s hard labels, enabling it to capture nuanced patterns while maintaining efficiency. Pioneering examples such as DistilBERT (Sanh et al., 2019) and TinyBERT (Jiao et al., 2019) demonstrated the effectiveness of this approach by successfully compressing BERT-family models. More recent efforts have shifted to distilling the capabilities of modern generative LLMs. Recent advancements, such as step-by-step distillation, have shown that smaller models can even outperform LLMs with less training data (Hsieh et al., 2023).
Beyond distillation, a paradigm shift has emerged in SLM development: training smaller models from the ground up on exceptionally high-quality data. This "less is more" approach to data curation contrasts with the traditional method of using vast, unfiltered web scrapes. Models like Microsoft's Phi-4-reasoning exemplify this strategy. This 14-billion-parameter model, trained on a carefully curated set of "teachable" prompts, achieves reasoning capabilities that outperform models many times its size (Abdin et al. 2025). This proves that data quality, rather than sheer quantity, can be a primary driver of performance, enabling SLMs to acquire complex knowledge without direct distillation from a larger teacher model.
SLMs offer several advantages, particularly in scenarios where computational resources are limited. Their reduced size and faster inference times enable deployment on edge devices, supporting real-time applications like virtual assistants or on-device translation without reliance on cloud infrastructure (McMahan et al., 2017). This enhances user privacy by minimising data transmission and reduces latency, critical for time-sensitive tasks such as chatbots or content moderation. Effective distillation methods further improve SLM performance, making them viable alternatives to LLMs (Sanh et al., 2019). Additionally, SLMs contribute to sustainable AI by lowering energy consumption compared to LLMs, aligning with environmental considerations in model development (Strubell et al. 2019). They also democratise access to advanced NLP, allowing small businesses or academic institutions to deploy models for tasks like sentiment analysis or document classification without requiring high-end hardware (Wolf et al., 2020).
Despite their benefits, SLMs face challenges related to performance and distillation design. Distilled models may struggle with tasks requiring complex reasoning or nuanced understanding, where LLMs typically excel (Jiao et al., 2019). This performance trade-off necessitates careful optimisation to balance efficiency and capability. The distillation process itself presents technical hurdles. Selecting an appropriate teacher model, designing effective loss functions, and determining training objectives are critical to success. Over-compression can lead to underfitting, while insufficient distillation may fail to capture the teacher’s knowledge (Gou et al., 2021). Recent research explores advanced techniques, such as multi-teacher distillation or task-specific fine-tuning, to address these issues (Sun et al., 2019). Ethical considerations also arise, as SLMs inherit biases from their teacher models, potentially perpetuating harmful stereotypes or misinformation. Mitigating these biases requires rigorous evaluation and correction techniques, an area of active research (Bender et al. 2021).
The future of SLM development appears to be advancing along two parallel tracks. On one hand, research continues to refine distillation techniques, exploring adaptive methods and combining them with other compression methods like quantization and pruning to maximize efficiency (Gou et al., 2021). On the other hand, the "quality data" paradigm is being pushed forward by developing more sophisticated data synthesis techniques and investigating the theoretical principles that make "textbook-like" data so effective. Furthermore, creating domain-specific SLMs remains a highly promising avenue applicable to both approaches. Whether by distilling a fine-tuned LLM or training a model from scratch on high-quality, specialized data (e.g., in medicine or law), highly capable and efficient models can be developed for critical, resource-constrained applications. SLMs, empowered by either knowledge distillation or high-quality training regimes, offer an efficient and accessible alternative to LLMs, enabling advanced NLP in resource-constrained environments. While challenges like performance trade-offs and ethical concerns persist, innovations in model design and data curation continue to expand their potential. As the demand for sustainable and deployable AI grows, SLMs are set to play a pivotal role in the future of language processing, bridging the gap between capability and efficiency.
References:
1. Abdin, Marah, Agarwal, Sahaj, Awadallah, Ahmed, Balachandran, Vidhisha, Behl, Harkirat, Chen, Lingjiao, de Rosa, Gustavo et al. 2025. Phi-4-reasoning technical report. arXiv preprint arXiv:2504.21318. Link ^ Back
2. Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021, March. ‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?’. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623. ^ Back
3. Gou, Jianping, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. 2021. ‘Knowledge Distillation: A Survey’. International Journal of Computer Vision 129: 1789–1819. https://doi.org/10.1007/s11263-021-01453-z ^ Back
4. Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. ‘Distilling the Knowledge in a Neural Network’. NIPS 2014 Deep Learning Workshop. arXiv:1503.02531 [stat.ML]. https://doi.org/10.48550/arXiv.1503.02531 ^ Back
5. Hsieh, Cheng-Yu, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. 2023. ‘Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes’. arXiv preprint arXiv:2305.02301 [cs.CL]. https://doi.org/10.48550/arXiv.2305.02301 ^ Back
6. Jiao, Xiaoqi, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. ‘TinyBERT: Distilling BERT for Natural Language Understanding’. Findings of EMNLP 2020. arXiv:1909.10351 [cs.CL]. https://doi.org/10.48550/arXiv.1909.10351 ^ Back
7. McMahan, Brendan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. ‘Communication-Efficient Learning of Deep Networks from Decentralized Data’. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, PMLR 54:1273–1282. https://proceedings.mlr.press/v54/mcmahan17a.html
8. Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. ‘DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter’. 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019. arXiv:1910.01108 [cs.CL]. https://doi.org/10.48550/arXiv.1910.01108 ^ Back
9. Sun, Siqi, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. ‘Patient Knowledge Distillation for BERT Model Compression’. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4323–4332, Hong Kong, China. Association for Computational Linguistics.
10. Strubell, Emma, Ananya Ganesh, and Andrew McCallum. 2019. ‘Energy and Policy Considerations for Deep Learning in NLP’. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3645–3650. Florence, Italy: Association for Computational Linguistics. https://aclanthology.org/P19-1355/ ^ Back
11. Wolf, Thomas et al. 2020. ‘Transformers: State-of-the-Art Natural Language Processing’. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.