Like many who have spent time in academia, I sometimes have to resist the thought that what I really need to do is a(nother) PhD. I have been eying up Imperial College London’s ML research group as one of the most exciting research centres for machine learning in the UK. So I was delighted and flattered to be asked by Imperial’s Data Learning seminar to speak to them about latent diffusion models, the architecture behind image generative models.
I won’t claim that my talk presents novel research, though it does contain details on how we accelerate the computation of forward passes of these models in production.
The talk walks through some of the mathematics behind these models, to show how the image generation is even possible. Just as it is, even now, hard to believe that image models manage to encode such rich visual data as they do, I found denoising diffusion — a technique that predates generative AI — slightly incredible. Here’s an equation-free introduction.
Consider ink diffusing in water, via some stochastic process. At the end of the diffusion process, the ink is uniformly distributed through the water: the water is uniformly inky.
The task of being able to infer back from this initial state of uniformity, one tiny step at a time, is a task from non-equilibrium thermodynamics.
If you can solve it, then you can achieve the initial conditions of the system.
A deep learning method called score estimation allows you to solve the stochastic differential equation that defines the reverse diffusion process, that is, the backwards-ink-diffusion process.
Denoising diffusion models use the same mathematics to literally create images out of noise. The models are trained on “noised images”. That’s images that have been corrupted by a process of adding noise, scrambling the pixels so that there is no information left. At inference time, the models begin with a sample of uniform noise, and incrementally denoise it to obtain a sample that is not-noise, but is an image! Text conditioning means that this denoising can be “guided” by some text information, so that the resulting sample is conditioned on that information, and corresponds to the content of the text. This is how denoising diffusion models make images from prompt inputs.
This talk was given in June 2024.
Share this post