GenAI textbook Part 6 Chapter 3 Generative AI models

Main Types of Generative Models and Their Operating Principles: GANs, Diffusion Models, and Autoregressive Models

Feb 28, 2025

5 min read

Main Types of Generative Models and Their Operating Principles: GANs, Diffusion Models, and Autoregressive Models — Source: Freepik - kjpargeter

Generative models represent a fundamental paradigm in machine learning, enabling computers to create new data samples that closely mirror real-world examples. These models have become indispensable tools across diverse fields including image creation, natural language processing, and scientific research. Three principal architectures have emerged as dominant approaches: Generative Adversarial Networks (GANs), diffusion models, and autoregressive models (Bond-Taylor et al. 2021). Each methodology employs distinct theoretical foundations and operational mechanisms, offering unique advantages for different applications whilst presenting specific challenges and limitations.

Generative Adversarial Networks, introduced by Goodfellow et al. in 2014, revolutionised generative modelling through their innovative adversarial training paradigm (Goodfellow et al. 2014). The fundamental principle involves a competitive relationship between two neural networks: a generator that produces synthetic data samples from random noise, and a discriminator that attempts to distinguish between real and generated samples. This relationship resembles a counterfeiter trying to create fake currency whilst a detective attempts to identify forgeries. The counterfeiter continuously improves based on feedback from the detective, whilst the detective becomes increasingly skilled at spotting fakes. This adversarial process drives both networks to improve iteratively, with the generator learning to produce increasingly realistic synthetic data.

GANs have demonstrated remarkable success in image generation, achieving photorealistic results across diverse domains. Notable applications include the StyleGAN series for high-quality face generation, which can create remarkably realistic human faces that are indistinguishable from photographs (Karras et al. 2019). For researchers, GANs have proven particularly valuable in medical imaging applications, where synthetic data can help expand small datasets for training diagnostic models, addressing privacy concerns whilst maintaining statistical validity. However, adversarial training presents inherent challenges including mode collapse, where the generator produces limited variety in outputs, and training instability that can lead to convergence difficulties (Arjovsky et al. 2017). These limitations motivated the development of improved variants such as Wasserstein GANs, which provide more stable training by adjusting how the networks learn and measure differences between real and synthetic data distributions (Gulrajani et al. 2017). Earlier architectural innovations, such as Progressive GANs, paved the way for these advanced models by enhancing both stability and output quality.

Diffusion models represent a paradigm shift in generative modelling, drawing inspiration from non-equilibrium thermodynamics and physical processes observed in nature. The foundational theoretical framework was first outlined by Sohl-Dickstein et al. in 2015, establishing the mathematical basis for this approach (Sohl-Dickstein et al. 2015). Subsequently, Ho et al. advanced the field significantly with their work on Denoising Diffusion Probabilistic Models, demonstrating that these models could achieve image quality comparable to or exceeding that of GANs (Ho et al. 2020). The core principle involves a two-stage process analogous to gradually adding noise to a clear image until it becomes pure static, then learning to reverse this process. In the forward diffusion stage, the model systematically adds Gaussian noise to training images across multiple timesteps until the original image is completely obscured. The reverse denoising stage involves training a neural network to learn how to remove this noise step by step, effectively reconstructing coherent images from pure noise. During generation, the model starts with random noise and applies the learned denoising process iteratively, gradually refining the noise into a coherent image. This progressive refinement approach often produces higher quality results compared to single-step generation methods, as each denoising step allows for careful adjustment and improvement of the emerging image.

Diffusion models have achieved state-of-the-art results in image synthesis, often surpassing GANs in terms of sample quality and diversity (Dhariwal & Nichol 2021). Notable implementations include DALL-E 2 for text-to-image generation, Stable Diffusion for high-quality image synthesis, and various applications in audio creation. For researchers, diffusion models are particularly appealing due to their strong theoretical foundation and versatility across multiple modalities including images, audio, and even molecular structures. The primary limitation of diffusion models lies in their computational intensity, as the iterative generation process requires many steps and can be significantly slower than single-step alternatives. However, recent advances such as latent diffusion models have addressed this challenge by working with compressed representations of data, substantially reducing computational requirements whilst maintaining output quality (Rombach et al. 2022).

Autoregressive models constitute a fundamental class that creates data sequentially, where each new element is generated based on all previously created elements. These models decompose the joint probability distribution of data into a product of conditional probabilities, enabling tractable likelihood computation and straightforward sampling procedures. The underlying principle resembles writing a story word by word, where each new word is chosen based on all preceding words. The model learns patterns and relationships in sequences, enabling prediction of what should come next given particular context. This sequential generation allows for variable-length outputs and natural incorporation of context and dependencies. The transformer architecture has become the dominant framework for implementing autoregressive models, particularly in natural language processing (Vaswani et al. 2017). The key innovation lies in their attention mechanism, which allows models to focus on relevant parts of input sequences when generating each new element, enabling sophisticated understanding of long-range dependencies and contextual relationships. The GPT family exemplifies the power of autoregressive approaches in natural language processing. Beginning with the original Generative Pre-trained Transformer (GPT) model (Radford et al. 2018), and significantly advanced by its successor GPT-2 (Radford et al. 2019), these models demonstrated remarkable capabilities in learning diverse linguistic patterns through unsupervised training on large text corpora. Subsequent iterations have shown that scaling model parameters and training data leads to emergent capabilities in few-shot learning, reasoning, and task generalisation.

Autoregressive models extend beyond text generation to other domains. In image generation, models like PixelRNN and PixelCNN apply the same sequential principle to pixels, building images one pixel at a time (van Den Oord et al. 2016). These approaches demonstrate the versatility of autoregressive modelling across different data modalities. For researchers, autoregressive models are particularly valuable due to their flexibility across tasks, from text generation to time-series prediction. They are especially useful when understanding the probability of outputs is important, as they provide tractable likelihood estimates. However, their sequential nature makes generation computationally intensive, especially for large datasets like high-resolution images, as each element must be generated in order.

References:

1. Arjovsky, Martin, Soumith Chintala, and Léon Bottou. 2017. ‘Wasserstein Generative Adversarial Networks’. Proceedings of the 34th International Conference on Machine Learning, PMLR 70: 214–223. ^ Back

2. Bond-Taylor, Samuel, Adam Leach, Yang Long, and Christopher G. Willcocks. 2021. ‘Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models’. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(11): 7327–7347. DOI ^ Back

3. Dhariwal, Prafulla, and Alexander Nichol. 2021. ‘Diffusion Models Beat GANs on Image Synthesis’. Advances in Neural Information Processing Systems 34: 8780–8794. ^ Back

4. Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. ‘Generative Adversarial Nets’. Advances in Neural Information Processing Systems 27. ^ Back

5. Gulrajani, Ishaan, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C. Courville. 2017. ‘Improved Training of Wasserstein GANs’. Advances in Neural Information Processing Systems 30: 5767–5777. arXiv ^ Back

6. Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. ‘Denoising Diffusion Probabilistic Models’. Advances in Neural Information Processing Systems 33: 6840–6851. arXiv:2006.11239 ^ Back

7. Radford, Alec et al. 2018. Improving Language Understanding by Generative Pre-Training. Available at: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf ^ Back

8. Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. ‘Language Models Are Unsupervised Multitask Learners’. OpenAI Technical Report. [PDF] ^ Back

9. Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. ‘High-Resolution Image Synthesis with Latent Diffusion Models’. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695. Link ^ Back

10. Sohl-Dickstein, Jascha, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. ‘Deep Unsupervised Learning Using Nonequilibrium Thermodynamics’. Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:2256–2265. Link ^ Back

11. van den Oord, Aäron, Nal Kalchbrenner, and Koray Kavukcuoglu. 2016. ‘Pixel Recurrent Neural Networks’. Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:1747–1756. [PMLR] ^ Back

12. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. ‘Attention Is All You Need’. arXiv. doi:10.48550/ARXIV.1706.03762 – ^ Back