Speaker notes:
Input tokens -> embedding + positional encoding -> N transformer blocks -> output logits -> next token
Speaker notes: - Stress: cross-entropy is just "how surprised is the model by the right answer." - Low surprise = low loss. The model isn't shocked by what came next. - -log(1) = 0 and -log(0) = ∞ are worth pausing on.
Speaker notes: If asked: AdamW = Adam + a corrected version of weight decay. Adam came out in 2014, AdamW in 2017. Used by GPT-2, GPT-3, LLaMA, all modern LLMs."
Analogy: training is a student doing practice problems with an answer key, adjusting after each. Inference is the same student taking the exam: same knowledge, no peeking, no learning during.
Speaker notes: