Back to Basics: Build Your Own LLM from Scratch

Feed-Forward Network (FFN)

Attention mixes information across tokens. The FFN processes each token independently, giving the model room to think.

FFN(x) = activation(x · W1 + b1) · W2 + b2

Two linear layers with a non-linearity in between:

Expand: W1 of shape (n_embd × d_ff), typically d_ff = 4 × n_embd
Activate: GELU (GPT) or ReLU
Compress: W2 of shape (d_ff × n_embd)

The expand then compress pattern gives the model a wider working space to compute in.

	a	b	c	d
Model A	0.10	0.10	0.70	0.10
Model B	0.25	0.25	0.25	0.25
Model C	0.40	0.40	0.05	0.15

	Training	Inference
Input	Batches of full sequences	User's prompt + tokens so far
Use which logits	Every position (parallel prediction)	Only the last position

	Training	Inference
Compare to truth	Yes — cross-entropy loss	No — there is no answer key
Backprop	Yes	No
Weights	Updated every step	Frozen
Output	A slightly better model	The next token
Loop	Over millions of batches	Over generated tokens

Back to Basics: Build Your Own LLM from Scratch

Workshop Overview

Why build one?

Goal of language modeling

The whole model in 5 numbers

A simple GPT-like model looks like

Tokens

Tokens

Tokenizer

Embeddings

Embeddings

Positional encoding

Positional encoding

Why attention?

Why attention?

Single-head attention: Q, K, V

Single-head attention: Q, K, V

Attention scores (scaled)

Softmax intuition

Softmax intuition

Causal mask

Apply attention

Multi-head attention

Combining heads with Wo

Residual + LayerNorm

Feed-Forward Network (FFN)

One full transformer block

Stacking layers

Output logits

Generation: from logits to text

Cross-entropy loss: how wrong is the model?

Cross-entropy loss: how wrong is the model?

Cross-entropy: the formula

Cross-entropy: the formula

Why we minimize it

Why we minimize it

Backpropagation: who is to blame?

Backpropagation: who is to blame?

How backprop computes gradients

How backprop computes gradients

From gradients to weight updates

From gradients to weight updates

Training

Training vs. inference

Training vs. inference

What gets learned, what stays fixed

What gets learned, what stays fixed

Parameter count: the formula

Parameter count: the formula

Parameter count: worked example

Parameter count: worked example

Scaling: why bigger?

Scaling: why bigger?

Recap: what we just built

Hands-on: the plan

Hands-on: Walkthrough

Hands-on: Train

Hands-on: Gnerate

References & Credits

Thank you!