GenAI textbook Part 7 Chapter 4 Large Language Models (LLMs)

Comparing leading large language models: architectures, performance and specialised capabilities

Mar 23, 2025

7 min read

Comparing leading large language models: architectures, performance and specialised capabilities — Source: Freepik - gerain0812

Most contemporary LLMs employ a decoder‑only transformer architecture, which processes sequences in parallel via self‑attention. However, scaling dense transformers linearly in size increases computation and cost. Mixture‑of‑experts (MoE) approaches address this by activating only a subset of parameters per token. In the Switch Transformer, MoE routing selects different parameter subsets for each input, enabling models with vast numbers of parameters to maintain constant computational cost (Fedus et al. 2022). Mistral’s Mixtral 8×7B demonstrates this paradigm: each layer has eight feed‑forward experts, and a router chooses two experts per token; although the model exposes 47 billion parameters, only 13 billion are active during inference (Jiang et al. 2024). Such sparse architectures allow models to deliver large effective capacity while controlling latency and energy use. Meta’s Llama 4 models similarly adopt a mixture‑of‑experts architecture, using either 16 or 128 experts to achieve high performance at lower cost (Meta AI 2024). These designs have inspired speculation that proprietary models such as GPT‑4 may also employ MoE to reach trillion‑scale capacities (Minaee et al. 2024).

OpenAI’s GPT series remains the prototypical LLM family. Early GPTs used dense transformers; GPT‑4 is a multimodal model capable of accepting both text and images and is considered the most capable model in its family. The technical report notes that GPT‑4 can power general‑purpose agents, performing complex reasoning and following diverse instructions (Minaee et al. 2024). While OpenAI has not published full architectural details, public discussions suggest the model may employ a mixture of experts and a context window exceeding 128 000 tokens, and derivative models such as GPT‑4o extend capabilities to audio. Codex, a sibling model used for GitHub Copilot, is fine‑tuned on code corpora and contains about 170 billion parameters; Copilot adapts the base model to a user’s specific codebase through fine‑tuning (Verdi 2024). These models support natural‑language to code translation and have been integrated into developer tools.

Meta’s LLaMA series provides open‑weight alternatives to proprietary models. The initial release (LLaMA 1) included models ranging from 7 billion to 65 billion parameters and was trained on trillions of tokens drawn from publicly available data; the 13 billion parameter model outperformed GPT‑3 (175 billion) on several benchmarks (Touvron et al. 2023). Llama 2 extended the series with improved training data and instruction‑tuned variants. In 2025, Meta introduced Llama 4, a family of multimodal, open‑weight models employing mixture‑of‑experts architectures. The Llama Scout and Llama 4 Maverick models each activate 17 billion parameters using 16 and 128 experts respectively, fitting on single NVIDIA H100 GPUs with context windows up to 10 million tokens. A teacher model (Llama 4 Behemoth) with 288 billion active parameters is still training but reportedly outperforms GPT‑4.5, Claude Sonnet 3.7 and Gemini 2.0 on STEM benchmarks (Meta AI 2024). The Llama ecosystem emphasises openness; the blog notes that releasing open‑weight models “enables the community to build new experiences” (Meta AI 2024).

Anthropic focuses on alignment and safety through Constitutional AI. This method uses a model‑defined constitution—a list of high‑level principles—and trains a model through supervised learning and reinforcement learning from AI feedback to follow these principles. The approach involves sampling responses, generating self‑critiques and revisions, and training a preference model that guides reinforcement learning. The result is a “harmless but non‑evasive” assistant that engages with harmful queries by explaining its objections (Bai et al. 2022). Later work introduced Collective Constitutional AI, incorporating public input into the constitution; quantitative evaluations show that this reduces bias across nine social dimensions while maintaining performance on language, mathematics and helpful‑harmless evaluations (Huang et al. 2024). Claude models are thus characterised by explicit alignment procedures rather than architectural innovations; publicly available details suggest that Claude 4 models have parameter counts comparable to GPT‑4 but emphasise transparency and safety.

xAI’s Grok series, announced by Elon Musk, positions itself as a “maximal truth‑seeking” alternative to mainstream chatbots (Ray 2023). Although official technical reports are scarce, a detailed 2025 analysis describes Grok 4 as a hybrid architecture with specialised modules for different cognitive tasks and distributed processing to manage complex queries (Martin 2025). The model reportedly contains about 1.7 trillion parameters—significantly larger than previous generations—and includes dedicated attention heads for mathematical reasoning, code generation and natural‑language understanding. Grok 4 supports a context window of up to 256 000 tokens and is powered by a supercomputer with 200 000 Nvidia GPUs. Its multimodal capabilities (text and vision) are expanding, though the vision component is still maturing. A specialised variant, Grok 4 Code, integrates with development tools and offers advanced code generation and debugging features (Martin 2025).

Google DeepMind’s Gemini series emphasises multimodality, long context and reasoning. The Gemini 2.X family includes Gemini 2.5 Pro and Gemini 2.5 Flash. The technical report describes Gemini 2.5 Pro as the most capable model in the series, achieving state‑of‑the‑art performance on coding and reasoning benchmarks; it processes up to 3 hours of video and its long‑context and multimodal capabilities enable new agentic workflows (Comanici et al. 2025). Gemini 2.5 models span the Pareto frontier of capability versus cost, with Flash variants offering lower latency and cost. The report notes that Gemini 2.5 Pro achieves competitive scores on the Aider Polyglot, GPQA (diamond) and Humanity’s Last Exam benchmarks. In addition to improved performance, Gemini 2.5 models maintain strong safety standards and are less likely to refuse questions or adopt a sanctimonious tone. The authors highlight that such rapid performance improvements challenge the development of sufficiently difficult evaluation benchmarks (Comanici et al. 2025).

Mistral AI’s open‑weight models illustrate how efficient architectures can compete with much larger proprietary models. The company released Mixtral 8×7B, a sparse MoE model with open weights that “matches or outperforms Llama 2 70B” and GPT‑3.5 on standard benchmarks (Mistral 2023). Each layer’s feed‑forward block selects two of eight expert groups via a router network, increasing parameter count while controlling cost. Mixtral has 46.7 billion total parameters but uses only 12.9 billion per token, delivering the speed of a 12.9 billion parameter model. The model handles context lengths of 32 000 tokens and supports multiple languages. Mixtral Instruct, fine‑tuned with supervised and direct preference optimisation, achieves a score of 8.3 on the MT‑Bench and surpasses GPT‑3.5 Turbo, Claude 2.1 and Gemini Pro on human evaluations. The technical report introducing the model confirms these design details and reports that Mixtral, trained with a 32k token context size, outperforms Llama 2 70B and GPT‑3.5 across mathematics, code generation and multilingual benchmarks (Mistral 2023).

A comparative scan of recent large language models shows GPT‑4 still sets the bar for general reasoning, but Gemini 2.5 Pro is closing the gap and Llama 4 Behemoth—when complete—may even exceed GPT‑4.5 on STEM tasks (Comanici et al. 2025, Meta AI 2024). xAI’s Grok 4, a mixture‑of‑experts model with some 1.7 trillion parameters, reportedly matches or exceeds peers on GPQA, AIME and ARC‑AGI, solving about 38.6 % of questions on a demanding PhD‑level exam and outperforming Claude 4 Opus, Gemini 2.5 Pro and OpenAI’s o3‑pro on certain benchmarks, though these results come largely from xAI’s own evaluations. Coding assistance has advanced through systems such as GitHub Copilot (built on Codex) and improved by newer architectures: Mistral’s sparse Mixtral 8×7B and Meta’s Llama 4 Maverick achieve strong performance with fewer active parameters, and the Grok 4 Code variant delivers intelligent suggestions and debugging help, scoring around 72–75 % on the SWE‑Bench coding benchmark (Meta AI 2024, Verdi 2024). Differences in alignment strategies are notable: Anthropic’s Claude series employs constitutional and collective constitutional AI to reduce bias, OpenAI uses proprietary reinforcement learning from human feedback, Gemini 2.5 emphasises safety tuning (Comanici et al. 2025), Grok’s alignment is loosely framed as “truth‑seeking” and Mistral’s open‑weight models enable community‑driven audits and direct preference optimisation (Mistral 2023). The broader implication is that sparse mixture‑of‑experts designs and long‑context, multimodal inputs will become standard; Gemini 2.5 already processes hours of video and Grok 4 handles 128k/256k token contexts. These advances reinforce the need for rigorous alignment frameworks and robust evaluation methodologies, as rapid improvements continue to challenge existing benchmarks and highlight the trade‑off between capability, efficiency and safety (Comanici et al. 2025).

References:

1. Bai, Yuntao, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, et al. 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073. Available at: https://arxiv.org/abs/2212.08073 ^ Back

2. Comanici, Gheorghe, et al. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv preprint arXiv:2507.06261. Available at: https://arxiv.org/abs/2507.06261 ^ Back

3. Fedus, William, Barret Zoph, and Noam Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Journal of Machine Learning Research 23.120: 1–39. ^ Back

4. Huang, Saffron, Divya Siddarth, Liane Lovitt, Thomas I. Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli. 2024. Collective Constitutional AI: Aligning a Language Model with Public Input. In: Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pp. 1395–1417. ^ Back

5. Jiang, Albert Q., et al. 2024. Mixtral of Experts. arXiv preprint arXiv:2401.04088. Available at: https://arxiv.org/abs/2401.04088 ^ Back

6. Meta AI. 2024. LLaMA 4: Advancing Multimodal Intelligence. Meta AI Blog. Available at: https://ai.meta.com/blog/llama-4-multimodal-intelligence/ ^ Back

7. Minaee, Shervin, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2024. Large Language Models: A Survey. arXiv preprint arXiv:2402.06196. Available at: https://arxiv.org/abs/2402.06196 ^ Back

8. Mistral. 2023. Mixtral of Experts. Available at: https://mistral.ai/news/mixtral-of-experts ^ Back

9. Ray, Siladitya. 2023. ‘Maximum Truth-Seeking AI’: Musk Says He’s Building ‘TruthGPT’ In Tucker Carlson Interview. Forbes, 18 April 2023. Available at: https://www.forbes.com/.../truthgpt-in-tucker-carlson-interview ^ Back

10. Verdi, Sara. 2024. Inside GitHub: Working with the LLMs Behind GitHub Copilot. GitHub Blog – AI and ML. Available at: https://github.blog/.../inside-github-working-with-the-llms-behind-github-copilot/ ^ Back

11. Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, et al. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971. Available at: https://arxiv.org/abs/2302.13971 ^ Back