Back to Basics: Build Your Own LLM from Scratch

Back to Basics: Build Your Own LLM from Scratch

Open the black box by building it yourself, from tokenizer to transformer, every layer, every weight, every line of code.

Last Modified: 2026-06-14T00:07:30+0530

Back to Basics: Build Your Own LLM from Scratch

Workshop Overview

By the end of this workshop, you will have:

  • Built a tokenizer from scratch
  • Understood every component of a GPT-style transformer
  • Trained a small language model on your own data
  • Generated text from your trained model

We will not use any black-box library calls for the model itself; every layer, every weight, every line is yours to inspect.

Back to Basics: Build Your Own LLM from Scratch

Why build one?

  • Demystify: LLMs are not magic. They are matrix multiplications, gradients, and a lot of data.
  • Intuition: Once you have built the smallest version, you can reason about why larger ones behave the way they do.
  • Agency: You decide the data, the size, the training.
  • Foundations: Every modern model shares the same skeleton we will build today.
Back to Basics: Build Your Own LLM from Scratch

Goal of language modeling

Given a sequence of tokens, predict the next one.

"The cat sat on the __?"

The model assigns a probability to every possible next token in the vocabulary. We pick one (greedily, or by sampling) and append it. Repeat.

That is it. Everything else, attention, embeddings, layers, is machinery to make this prediction better.

Back to Basics: Build Your Own LLM from Scratch

The whole model in 5 numbers

class GPTConfig:
    vocab_size = 65     # how many unique tokens
    n_embd     = 256    # vector size for each token
    n_head     = 6      # parallel attention heads
    block_size = 256    # max context length
    n_layer    = 6      # stacked transformer blocks

These five numbers define the shape of the model. We will unpack each as we go, and revisit them at the end when we count parameters.

Back to Basics: Build Your Own LLM from Scratch

A simple GPT-like model looks like

Back to Basics: Build Your Own LLM from Scratch

Tokens

Computers don't read text; they read numbers. A token is the unit of text the model sees.

Choices:

  • Character-level: one token per character. Small vocab, long sequences. We will use this, as it is simple to build.
  • Word-level: one token per word. Huge vocab, lots of unknown words.
  • Subword (BPE): middle ground. Common words stay whole, rare words split into pieces. GPT, LLaMA, Claude all use this family.
Back to Basics: Build Your Own LLM from Scratch

Tokens

Example:

"unhappiness"  tokenized to  ['u','n','h','a','p','p','i','n','e','s','s']  # character
"unhappiness"  tokenized to  ["un", "happi", "ness"]   # subword
Back to Basics: Build Your Own LLM from Scratch

Tokenizer

A tokenizer does two jobs:

  1. Encode: text into list of token IDs (integers)
  2. Decode: list of token IDs into text
encode( "hello" ) =  [104, 101, 108, 108, 111]
decode([104, 101, 108, 108, 111]) = "hello"
# The exact numbers depend on your tokenizer's vocab; these are illustrative.

The size of the tokenizer's vocabulary becomes our vocab_size. We will build a small character-BPE tokenizer later in the workshop.

Back to Basics: Build Your Own LLM from Scratch

Embeddings

A token ID like 104 is just a number; it carries no meaning. We need a representation the model can do math on.

The embedding table is a learned lookup:

  embedding_lookup(token_ids)  =  E   # shape: (seq_len × n_embd)
Back to Basics: Build Your Own LLM from Scratch

Embeddings

  • seq_len is the length of token_ids array or input token length
  • Each row is one token's learned vector of length n_embd.
  • Shape of the table: (vocab_size × n_embd)
  • Initialized randomly, updated during training.
  • Similar tokens end up with similar vectors
Back to Basics: Build Your Own LLM from Scratch

Positional encoding

Attention treats the input as a set of tokens, not a sequence. Without position information, the model cannot tell:

"cat sat on mat"   vs   "mat on sat cat"

We add a position-dependent vector to each token's embedding:

X = E + positional_encoding(seq_len)        # shape: (seq_len × n_embd)
Back to Basics: Build Your Own LLM from Scratch

Positional encoding

Two common flavors:

  • Learned (GPT-style): another lookup table of shape (block_size × n_embd)
  • Fixed sinusoidal (original transformer): no parameters, just sin/cos at different frequencies
Back to Basics: Build Your Own LLM from Scratch

Why attention?

Consider:

"The trophy did not fit in the suitcase because it was too big."

What does it refer to? To answer, the model must look back at earlier tokens and decide which one is relevant.

Back to Basics: Build Your Own LLM from Scratch

Why attention?

  • A fixed-window approach (look at last 3 words) fails on long-range references.
  • A recurrent approach (RNN) struggles to carry information across many steps.

Attention: for each token, compute a weighted sum over all previous tokens, where the weights are learned and depend on the content.

Back to Basics: Build Your Own LLM from Scratch

Single-head attention: Q, K, V

For each token's vector x, we compute three projections:

  • Query (Q): what am I looking for?
  • Key (K): what do I offer?
  • Value (V): what do I pass on if matched?
Back to Basics: Build Your Own LLM from Scratch

Single-head attention: Q, K, V

Q = X · Wq
K = X · Wk
V = X · Wv

Each weight matrix Wq, Wk, Wv has shape (n_embd × n_embd). These are learned during training.

Back to Basics: Build Your Own LLM from Scratch

Attention scores (scaled)

Each token's query is compared with every other token's key via dot product:

scores = (Q · Kᵀ) / √(d_k)
  • High dot product = "this key matches my query well"
  • √(d_k) scaling is critical, without it, dot products grow large as d_k increases, pushing softmax into flat regions with near-zero gradients (no learning).
  • Shape of scores: (seq_len × seq_len) every token's relevance to every other token.
Back to Basics: Build Your Own LLM from Scratch

Softmax intuition

Raw scores are arbitrary numbers. We want weights that:

  • are all positive
  • sum to 1
  • behave like "importance"
softmax(x_i) = exp(x_i) / Σ exp(x_j)
Back to Basics: Build Your Own LLM from Scratch

Softmax intuition

Example:

raw scores  →  [2.0, 1.0, -∞]
softmax     →  [0.73, 0.27, 0.00]

After softmax, each row of the score matrix is a probability distribution over which earlier tokens to "pay attention to."

Back to Basics: Build Your Own LLM from Scratch

Causal mask

In a language model, token i must not see future tokens j > i, otherwise training would be cheating (the model could just copy the answer).

Before softmax, we set future positions to -∞:

scores[i, j] = -∞   for all j > i

After softmax, those positions become exactly 0.

  • Used in GPT, LLaMA, Claude, all autoregressive models
  • Not used in encoder-only models like BERT (they see everything at once)
Back to Basics: Build Your Own LLM from Scratch

Apply attention

Once we have softmaxed, masked weights, we use them to take a weighted sum of the values:

attention_out = softmax(scores) · V        # shape: (seq_len × n_embd)

Each output row is a blend of value vectors from earlier positions, weighted by relevance.

This is the heart of the transformer. Everything else is plumbing.

Back to Basics: Build Your Own LLM from Scratch

Multi-head attention

One attention head learns one kind of relationship. To capture different relationships in parallel, we run several heads simultaneously.

d_k = n_embd / n_head
  • Split Q, K, V across n_head heads
  • Each head gets a slice of shape (seq_len × d_k)
  • Each head runs the same attention math independently
  • Heads can specialize: one for syntax, one for coreference, one for position, etc.
Back to Basics: Build Your Own LLM from Scratch

Combining heads with Wo

After running all heads in parallel, we concatenate their outputs back to (seq_len × n_embd) and project through one more matrix:

combined = concat(head_1, head_2, ..., head_n)    # (seq_len × n_embd)
attn_out = combined · Wo
  • Wo has shape (n_embd × n_embd)
  • It learns how to mix information across heads
  • Without Wo, heads would be siloed; Wo lets them talk
Back to Basics: Build Your Own LLM from Scratch

Residual + LayerNorm

Two tricks that make deep transformers trainable. We apply this pattern twice per block: once after attention, once after the FFN.

x = LayerNorm(x + sublayer(x))
  • Residual (x + ...): lets gradients flow directly through deep stacks; without it, training a 6+ layer model is unstable.
  • LayerNorm: normalizes each token's vector to mean 0, variance 1, then applies learned scale γ and shift β. Keeps activations in a healthy range.

Parameters per LayerNorm: 2 × n_embd (γ and β).

Back to Basics: Build Your Own LLM from Scratch

Feed-Forward Network (FFN)

Attention mixes information across tokens. The FFN processes each token independently, giving the model room to think.

FFN(x) = activation(x · W1 + b1) · W2 + b2

Two linear layers with a non-linearity in between:

  • Expand: W1 of shape (n_embd × d_ff), typically d_ff = 4 × n_embd
  • Activate: GELU (GPT) or ReLU
  • Compress: W2 of shape (d_ff × n_embd)

The expand then compress pattern gives the model a wider working space to compute in.

Back to Basics: Build Your Own LLM from Scratch

One full transformer block

                                      ┌─ x ─┐
        ┌──────────────────────────┐  │     │
   x → LayerNorm → Multi-head Attn → ─┴─ + ─┘
                                          │
                                          ↓
                                      ┌─ x ─┐
        ┌──────────────────────────┐  │     │
   x → LayerNorm → FFN ─────────── → ─┴─ + ─┘
                                          │
                                          ↓
                                        x_out

Two sublayers, each wrapped in residual + pre-LayerNorm. This is the GPT-2 / modern convention.

Back to Basics: Build Your Own LLM from Scratch

Stacking layers

The block is the unit. We stack n_layer of them, each with its own weights:

x = embeddings + positional_encoding
for i in range(n_layer):
    x = transformer_block_i(x)
x_final = LayerNorm(x)
  • Lower layers tend to learn surface patterns (syntax, common phrases)
  • Upper layers tend to learn abstract patterns (semantics, reasoning)
  • More layers = more capacity, more compute, more data needed
Back to Basics: Build Your Own LLM from Scratch

Output logits

After the final block, we project back to vocabulary size to get a score for every possible next token:

logits = x_final · W_out          # shape: (seq_len × vocab_size)
  • One row per input position
  • Each row has one number per token in the vocab
  • The Linear layer here is often weight-tied to the input embedding table (same matrix, transposed). Saves parameters and works well in practice.
Back to Basics: Build Your Own LLM from Scratch

Generation: from logits to text

During inference, we use only the last row of logits (the prediction for the next token):

logits[-1]  →  softmax  →  probabilities over vocab
            →  sample (or argmax)
            →  next token ID
            →  decode → character

Then append the new token to the input and repeat. This is how GPT writes a sentence: one token at a time, in a loop.

Sampling knobs (we'll see these later): temperature, top-k, top-p.

Back to Basics: Build Your Own LLM from Scratch

Cross-entropy loss: how wrong is the model?

After softmax, the model gives us a probability for every possible next token. We know the true next token. How do we score the model?

Suppose vocab = [a, b, c, d] and the true next token is c.

a b c d
Model A 0.10 0.10 0.70 0.10
Model B 0.25 0.25 0.25 0.25
Model C 0.40 0.40 0.05 0.15
Back to Basics: Build Your Own LLM from Scratch

Cross-entropy loss: how wrong is the model?

  • Model A is confident and right i.e low loss
  • Model B is clueless (uniform guess) i.e medium loss
  • Model C is confident and wrong i.e high loss

We need a number that captures this. That number is cross-entropy.

Back to Basics: Build Your Own LLM from Scratch

Cross-entropy: the formula

For one prediction, with true token at index t:

loss = -log( p_t )

where p_t is the probability the model assigned to the correct token.

Back to Basics: Build Your Own LLM from Scratch

Cross-entropy: the formula

Plugging in our three models (true token = c):

Model A:  -log(0.70) = 0.36     # low, good
Model B:  -log(0.25) = 1.39     # medium
Model C:  -log(0.05) = 3.00     # high, bad
  • p_t = 1 (perfect) → loss = 0
  • p_t → 0 (totally wrong) → loss → ∞

The -log makes confident wrong answers hurt much more than uncertain ones. That's the signal training needs.

Back to Basics: Build Your Own LLM from Scratch

Why we minimize it

Cross-entropy gives us a single number that says "how wrong was the model on this example?"

Training's job: find weights that make this number small across the whole dataset.

average loss = (1/N) · Σ -log(p_t)    over all N training tokens
Back to Basics: Build Your Own LLM from Scratch

Why we minimize it

  • Low average loss → model assigns high probability to the actual next tokens it sees in training → it has learned the patterns of the data.
  • Minimizing cross-entropy is mathematically equivalent to maximizing the likelihood of the training data under the model.

We don't minimize it by hand — gradients and the optimizer do that for us. That's the next slide.

Back to Basics: Build Your Own LLM from Scratch

Backpropagation: who is to blame?

We have one number: the loss. We have millions of weights. Which ones should we change, and by how much?

Backprop answers: for every weight w, compute how much the loss would change if we nudged w a tiny bit. That number is called the gradient:

∂loss / ∂w     →     "if I increase w by a little, does loss go up or down, and how fast?"
Back to Basics: Build Your Own LLM from Scratch

Backpropagation: who is to blame?

  • Gradient is positive → increasing w makes loss worse → decrease w
  • Gradient is negative → increasing w makes loss better → increase w
  • Gradient is near zero → this weight doesn't matter much for this example

Every weight gets its own gradient. Every example produces a fresh batch of gradients.

Back to Basics: Build Your Own LLM from Scratch

How backprop computes gradients

The forward pass goes left to right:

input → layer 1 → layer 2 → ... → loss

Backprop goes right to left, using the chain rule from calculus:

loss → ∂loss/∂(layer N output) → ∂loss/∂(layer N weights)
                               → ∂loss/∂(layer N-1 output) → ∂loss/∂(layer N-1 weights)
                               → ... → all the way back to embeddings
Back to Basics: Build Your Own LLM from Scratch

How backprop computes gradients

  • Each layer "passes blame backward" to the layer before it
  • Each layer also computes how much its own weights contributed
  • Frameworks like PyTorch do this automatically (loss.backward())
  • The math is just the chain rule applied many times; the engineering is keeping track of it efficiently

You will not write backprop by hand. But knowing it's the chain rule, not magic, is what matters.

Back to Basics: Build Your Own LLM from Scratch

From gradients to weight updates

Gradients tell us the direction. The optimizer decides the step size and actually changes the weights:

w_new = w_old  −  learning_rate × gradient
  • learning_rate: small number (e.g. 3e-4). Too big → unstable. Too small → slow.
  • The minus sign matters: we move against the gradient (gradient points uphill, we want to go downhill).
  • AdamW (what we'll use) is a fancier optimizer that adapts the step size per weight using running averages of past gradients.
Back to Basics: Build Your Own LLM from Scratch

From gradients to weight updates

One full training step:

1. forward pass  →  logits  →  loss
2. backward pass →  gradients for every weight
3. optimizer     →  update every weight a tiny bit
4. repeat
Back to Basics: Build Your Own LLM from Scratch

Training

During training, we put the full loop together:

batch → forward → logits → softmax → probabilities → cross-entropy loss
                                                              ↓
                       update every weight  ←  optimizer  ←  backprop
                                  ↓
                              repeat

One step = one batch through this loop.
One epoch = one full pass over the training data.
Training a small GPT typically runs for tens of thousands of steps.

Every weight we listed, Wq, Wk, Wv, Wo, W1, W2, γ, β, and the embedding table, gets nudged on every step.

Back to Basics: Build Your Own LLM from Scratch

Training vs. inference

Same brain, different loop. Both use the same forward pass, the same weights, the same math. What differs is the loop around it. Causal mask is on in both.

Training Inference
Input Batches of full sequences User's prompt + tokens so far
Use which logits Every position (parallel prediction) Only the last position
Back to Basics: Build Your Own LLM from Scratch

Training vs. inference

Training Inference
Compare to truth Yes — cross-entropy loss No — there is no answer key
Backprop Yes No
Weights Updated every step Frozen
Output A slightly better model The next token
Loop Over millions of batches Over generated tokens

It's part of the architecture, not a training-only trick. Dropout, if used, is on in training and off in inference.

Back to Basics: Build Your Own LLM from Scratch

What gets learned, what stays fixed

Learned (updated by gradients):

  • Embedding table
  • Positional encoding (if learned)
  • All Wq, Wk, Wv, Wo per layer
  • All W1, W2, b1, b2 per layer
  • All γ, β (LayerNorm)
  • Output projection
Back to Basics: Build Your Own LLM from Scratch

What gets learned, what stays fixed

Fixed (architecture choices):

  • vocab_size, n_embd, n_head, block_size, n_layer
  • The causal mask
  • The shape of softmax, residual connections
Back to Basics: Build Your Own LLM from Scratch

Parameter count: the formula

Per transformer layer, with d = n_embd and d_ff = 4d:

  • Attention (Wq, Wk, Wv, Wo): 4 × d² = 4d²
  • FFN (W1, W2 + biases): 8d² + 5d
  • LayerNorm (2 per layer, γ and β): 4d

Per layer ≈ 12d² + 9d, dominated by 12d².

Back to Basics: Build Your Own LLM from Scratch

Parameter count: the formula

Plus, once at the top:

  • Embedding table: vocab_size × d
  • Output head: vocab_size × d (often tied with embedding)
  • Positional encoding: block_size × d (if learned)
Back to Basics: Build Your Own LLM from Scratch

Parameter count: worked example

Given:

n_embd = 256
n_head = 4 # so d_k = 64
n_layer = 4
vocab_size = 65
block_size = 256
Back to Basics: Build Your Own LLM from Scratch

Parameter count: worked example

Per layer:

Attention:  4 × 256²        = 262,144
FFN:        8 × 256² + 5×256 = 525,568
LayerNorm:  4 × 256          =   1,024
                              ─────────
Per layer:                     788,736
4 layers:                    3,154,944

Plus embeddings + position + output head ≈ (65 + 256 + 65) × 256 ≈ 99K.

Total ≈ 3.25M parameters. A 100x-smaller cousin of GPT-2 small (124M).

Back to Basics: Build Your Own LLM from Scratch

Scaling: why bigger?

Empirically, performance improves predictably as we scale three things together:

  • Parameters (depth × width)
  • Data (tokens trained on)
  • Compute (training steps)
Back to Basics: Build Your Own LLM from Scratch

Scaling: why bigger?

Rough mental model:

  • 3M params: learns Shakespeare's character patterns
  • 124M params (GPT-2 small): coherent paragraphs
  • 1.5B params (GPT-2 XL): plausible essays
  • 175B+ (GPT-3 and up): the modern era

We will train at the 3M end. Same recipe, smaller dial.

Back to Basics: Build Your Own LLM from Scratch

Recap: what we just built

text
  → tokenizer            → token IDs
  → embedding + position → X
  → for each of n_layer blocks:
      → LayerNorm
      → multi-head attention (Q, K, V → scores → softmax → ·V → Wo)
      → + residual
      → LayerNorm
      → FFN (expand → activate → compress)
      → + residual
  → final LayerNorm
  → output projection    → logits
  → softmax              → next token

Every arrow is code we will write together.

Back to Basics: Build Your Own LLM from Scratch

Hands-on: the plan

  • Create a file called build_and_test.py
  • This is the file we will add code to and use it
  • Make it into a uv script with all dependencies at the top
  • Though we could use GPU, for simplicity we will always use CPU here
  • Use torch
  • Write a GPT class which extends nn.Module and implements forward + generate. Also add GPTConfig.
  • It's a CLI with two commands: train or generate
  • Based on the command, take other inputs
Back to Basics: Build Your Own LLM from Scratch
  • train arguments:
    • --data — path to text file
    • --max-steps — number of training steps
    • Others (batch_size, n_layer, n_head, n_embd, block_size) come from GPTConfig
Back to Basics: Build Your Own LLM from Scratch
  • generate arguments:
    • --checkpoint — path to a .pt file
    • --prompt — starting text
    • --num-new-tokens — how many tokens to produce
    • --temperature — sampling sharpness
    • --top-k — restrict sampling to top-k tokens
    • --seed — for reproducibility
Back to Basics: Build Your Own LLM from Scratch

Hands-on: Walkthrough

① GPTConfig            - slide "The whole model in 5 numbers"
② CharTokenizer        - slides "Tokens" / "Tokenizer"
③ GPT.__init__/forward - slide "Recap: what we just built" (the big pipeline)
④ CausalSelfAttention  - slides "Single-head attention""Combining heads with Wo"
⑤ FeedForward          - slide "Feed-Forward Network (FFN)"
⑥ TransformerBlock     - slide "One full transformer block"
⑦ get_batch            - (where x/y "next-token" pairs come from)
⑧ train()              - slides "Cross-entropy""Training"
⑨ GPT.generate()       - slide "Generation: from logits to text"
Back to Basics: Build Your Own LLM from Scratch

Hands-on: Train

Back to Basics: Build Your Own LLM from Scratch

Hands-on: Gnerate

Back to Basics: Build Your Own LLM from Scratch

References & Credits

Back to Basics: Build Your Own LLM from Scratch

Thank you!

Any questions?

Thejesh GN

Speaker notes:

Input tokens -> embedding + positional encoding -> N transformer blocks -> output logits -> next token

Speaker notes: - Stress: cross-entropy is just "how surprised is the model by the right answer." - Low surprise = low loss. The model isn't shocked by what came next. - -log(1) = 0 and -log(0) = ∞ are worth pausing on.

Speaker notes: If asked: AdamW = Adam + a corrected version of weight decay. Adam came out in 2014, AdamW in 2017. Used by GPT-2, GPT-3, LLaMA, all modern LLMs."

Analogy: training is a student doing practice problems with an answer key, adjusting after each. Inference is the same student taking the exam: same knowledge, no peeking, no learning during.

Speaker notes: