To be, fo hend!
First her sense ountier to Jupits,
be horse.
Wise words! This is the results of 2 hours of training my very own PyTorch Diffusion Language Model on an M2 MacBook Air.
You can check out the code over at GitHub: github.com/Encrux/simple_dlm
Why?
Diffusion Language Models are kind of a hot-topic right now in Machine Learning. The basic idea: corrupt some data with noise, then train a model to reverse that corruption over many small steps.
They’re used in a variety of domains, most notably in image synthesis. Image generation algorithms like Stable Diffusions treat this as a continuous problem on a per-pixel basis, because a pixel value of 134 is close to 135. For text, this principle is not as straight forward, because the latter “A”, which would convert to 65 is in no meaningful interpretation closer to “B” (ascii 66) than “Z” (ascii 90). The fix is to give up on numeric noise altogether. We corrupt the sequence by replacing tokens with a [MASK] token, and let the model learn to predict what was there.
This feels nothing like the physical noise picture, but it’s still proper diffusion math under the hood.
Diffusion vs autoregressive
Pretty much any LLM currently in use is decoding tokens autoregressively, so why care about diffusion? Autoregressive models force left-to-right decoding by design. With diffusion, we can decode the entire sequence in parallel and reach a low entropy state (i.e. actual text) by repeated denoising over the same sequence.
In theory, this can yield significantly higher tokens per second. Models like Mercury2 are working towards demonstrating this in the real world.
Training loop
For training, we grab a random 128-char chunk from the training data and sample a random masking probability mask_prob ~ U(0, 1). This fraction is replaced with the [MASK]-token.
After the forward pass, we train the model using cross-entropy loss on only the masked tokens. Mask_prob itself also gets passed into the model as an input. That’s what lets one network handle every noise level, from a barely-masked sequence to a fully-masked one.
Sampling
For decoding, we begin by setting all tokens to [MASK]. We then run k denoising steps, committing more tokens each time, until the sequence is fully revealed. In this example, k = 20.
What undertraining sounds like
Step 67k, loss 1.22:
To be, and be of men?
Prown AMEN:
O yout aboars of
Ra':
Un
Step 77k, loss 1.09:
To be, fo hend!
First her sense ountier to Jupits,
be horse.
The output is obviously mostly nonsense, but the fact that it learned to ouptut real words and strings that resemble actual sentences even a tiny bit is quite impressive considering the hardware it has been trained on. Tokens are encoded per-character, so the model had to learn this from scratch.
Stepping back
This write-up is barely scratching the surface. The different flavors of language models keep increasing in numbers. I didn’t address shortcomings like actual model performance, fixed decoding lengths and how these are (or could be) addressed.
There’s a lot of scary buzz words floating around in the age of AI. Projects like these help me make sense of key concepts that I think are worth knowing about. I think diffusion models are fascinating. In the future, I definitely want to learn more about their inner workings, especially when it comes to multi-modal models.
References
- Mercury (Inception Labs, 2025) - very fast DLM
- LLaDA: Large Language Diffusion Models (Nie et al., 2025)
- How Efficient Are Diffusion Language Models? (2026)
- The Original masked language model: BERT (Devlin et al., 2018)
- Sander Dieleman, “Diffusion is spectral autoregression”
- Inspiration (and dataset) for this Project: Karpathy, nanoGPT
- Source code