A Humble Disclaimer I’m not a machine learning researcher, and I’m definitely not a Rust core contributor. I spend my weekdays writing TypeScript for React frontends and Go for gRPC microservices. My formal ML experience is mostly limited to linear regression, logistic regression, decision trees, and maybe a naive Bayes classifier—the kind of stuff you’d learn in a 10-week bootcamp or an intro university course. So when I say I “built GPT-2 from scratch,” I don’t mean I invented anything new. What I actually did was implement the GPT-2 architecture from OpenAI’s 2019 paper, using Rust on my Ubuntu desktop, and eventually Candle, after a bit of trial and error—all fueled by curiosity and weekend tinkering. This isn’t a tutorial for experts. Think of it as a field report from the trenches: my failed first attempt with tch-rs on Linux, why I switched to Candle, and the lessons I picked up along the way. Why This Project? The Spark Behind the Code Few weeks ago, I watched Andrej Karpathy’s legendary video, “Let’s build GPT”. For 90 minutes, he coded a working GPT in Python, explaining every line with calm clarity. At the end, his model generated Shakespearean prose from scratch. “Let’s build GPT” I was mesmerized but also frustrated. Because while I understood the concepts - embeddings, attention, feed-forward networks, I had never felt them. In my day job, I use APIs and get results. It’s powerful, but it’s passive. understood felt I wanted to build, not just consume. build And since I’ve been slowly learning Rust in my spare time - not because I need it, but because I love its rigor - I asked a dangerous question: Rust “What if I tried to build GPT-2… in Rust?” “What if I tried to build GPT-2… in Rust?” “What if I tried to build GPT-2… in Rust?” Not for performance. Not for production.Just to see if I could. Just to see if I could. The Tools: My Journey from tch-rs to Candle (on Ubuntu) tch-rs When I first started, I reached for what seemed like the obvious choice: tch-rs, the official Rust bindings to PyTorch. tch-rs PyTorch After all, PyTorch powers most modern ML research. If I used tch-rs, I’d get: tch-rs Battle-tested tensor ops GPU support (in theory) Autograd that “just works” Familiar APIs if I ever read PyTorch code Battle-tested tensor ops GPU support (in theory) Autograd that “just works” Familiar APIs if I ever read PyTorch code So I dove in—on my Ubuntu 22.04 computer, with a RTX 5060ti and Ryzen 5 Ubuntu 22.04 computer The tch-rs Phase: Fighting CUDA, Not Models tch-rs I started with tch-rs, but ran into a wall: tch-rs It ships with precompiled libtorch tied to a specific PyTorch / CUDA combo. My system (CUDA 12.8) didn’t match the expected version. The "fix"? Setting obscure env vars like LIBTORCH_BYPASS_VERSION_CHECK=1 and living with deprecation warnings. (Btw, this didn't fixed the problem) It ships with precompiled libtorch tied to a specific PyTorch / CUDA combo. precompiled libtorch PyTorch / CUDA combo My system (CUDA 12.8) didn’t match the expected version. The "fix"? Setting obscure env vars like LIBTORCH_BYPASS_VERSION_CHECK=1 and living with deprecation warnings. (Btw, this didn't fixed the problem) LIBTORCH_BYPASS_VERSION_CHECK=1 So instead of training GPT-2, I was debugging library mismatches and praying that my build would finally run. library mismatches The Switch to Candle: Pure Rust, Zero Friction Then I tried Candle (pure Rust, Hugging Face). The difference was night and day: Candle No libtorch, no CUDA mismatch: just cargo add candle-core candle-nn. Readable Rust source: I could actually inspect matmul or softmax in minutes. Learning-first: enough to build GPT-2 without PyTorch baggage. Cross-platform: Ubuntu, macOS, Windows—no setup headaches. No libtorch, no CUDA mismatch: just cargo add candle-core candle-nn. No libtorch, no CUDA mismatch: cargo add candle-core candle-nn Readable Rust source: I could actually inspect matmul or softmax in minutes. Readable Rust source: matmul softmax Learning-first: enough to build GPT-2 without PyTorch baggage. Learning-first: Cross-platform: Ubuntu, macOS, Windows—no setup headaches. Cross-platform: With tch-rs, I was debugging linker errors.With Candle, I debugged attention masks—the real problem I wanted to solve. tch-rs linker errors attention masks real A compact blueprint: GPT‑2 as a stack of clear responsibilities GPT‑2 is an orchestration of these parts: Tokenization → discrete IDs Token + Positional Embeddings → geometry LayerNorm → stabilize activations Masked Multi‑Head Self‑Attention → weighted memory Residual connections → gradient highways Position‑wise Feed‑Forward → per‑token processing Dropout → regularization Output head / projection → logits over vocabulary Tokenization → discrete IDs Token + Positional Embeddings → geometry LayerNorm → stabilize activations Masked Multi‑Head Self‑Attention → weighted memory Residual connections → gradient highways Position‑wise Feed‑Forward → per‑token processing Dropout → regularization Output head / projection → logits over vocabulary Below I show why each is there and include the relevant snippets from your gpt.rs to make the mapping explicit. gpt.rs Tokenization - the first translation Why tokenization exists:Language is continuous; computers need discrete symbols. Tokenization breaks text into subword units that balance vocabulary size, coverage, and generalization. Why tokenization exists Example usage: Example usage: let tokenizer = get_bpe_from_model("gpt2")?; let encoded = tokenizer.encode_with_special_tokens("Hello, I am"); // -> [15496, 11, 314, 257] let tokenizer = get_bpe_from_model("gpt2")?; let encoded = tokenizer.encode_with_special_tokens("Hello, I am"); // -> [15496, 11, 314, 257] These IDs are the indices the embedding layer maps into vectors. Embeddings — where symbols become vectors Why it exists: Token IDs are discrete points; embeddings map them into a dense, continuous vector space where relationships become geometric (similar words → nearby vectors). GPT‑2 uses both token embeddings and learned positional embeddings (added to token embeddings). Why it exists: Key excerpt (from gpt.rs forward pass): Key excerpt (from gpt.rs let token_emb = self.token_emb.forward(xs)?; // [batch, seq, emb_dim] let pos_ids = Tensor::arange(0, seq_len as u32, xs.device())?; let pos_embed = self.pos_emb.embeddings().index_select(&pos_ids, 0)?; let mut x = token_emb.broadcast_add(&pos_embed)?; x = self.dropout.forward(&x, train)?; let token_emb = self.token_emb.forward(xs)?; // [batch, seq, emb_dim] let pos_ids = Tensor::arange(0, seq_len as u32, xs.device())?; let pos_embed = self.pos_emb.embeddings().index_select(&pos_ids, 0)?; let mut x = token_emb.broadcast_add(&pos_embed)?; x = self.dropout.forward(&x, train)?; Why positional embeddings? Transformers are permutation‑invariant; positional embeddings inject order information so the model knows where a token sits in the sequence. Why positional embeddings? Layer Normalization - the silent stabilizer Why it exists: LayerNorm keeps each token's activations well‑scaled and centered across the embedding dimension so deep stacks don't blow up or collapse. It's applied per token (not per batch), which suits variable sequence lengths. Why it exists: per token Implementation (from gpt.rs): Implementation (from gpt.rs impl Module for LayerNorm { fn forward(&self, xs: &Tensor) -> Result<Tensor> { let mean = xs.mean_keepdim(D::Minus1)?; let variance = xs.var_keepdim(D::Minus1)?; // prevent division by zero let eps = Tensor::new(&[self.eps], xs.device())?; // (x - mean) / (var + epsilon) let dividend = xs.broadcast_sub(&mean)?; let divisor = variance.broadcast_add(&eps)?; let divisor_sqrt = divisor.sqrt()?; let normed = dividend.broadcast_div(&divisor_sqrt)?; let scaled_shifted = normed .broadcast_mul(&self.scale)? .broadcast_add(&self.shift)?; Ok(scaled_shifted) } } impl Module for LayerNorm { fn forward(&self, xs: &Tensor) -> Result<Tensor> { let mean = xs.mean_keepdim(D::Minus1)?; let variance = xs.var_keepdim(D::Minus1)?; // prevent division by zero let eps = Tensor::new(&[self.eps], xs.device())?; // (x - mean) / (var + epsilon) let dividend = xs.broadcast_sub(&mean)?; let divisor = variance.broadcast_add(&eps)?; let divisor_sqrt = divisor.sqrt()?; let normed = dividend.broadcast_div(&divisor_sqrt)?; let scaled_shifted = normed .broadcast_mul(&self.scale)? .broadcast_add(&self.shift)?; Ok(scaled_shifted) } } Practical note: Without LayerNorm the model’s loss quickly explodes. It's conceptually similar to feature standardization in classical ML but applied repeatedly inside the network to internal representations. Practical note: Masked Multi‑Head Self‑Attention — the engine of context Why masked attention exists:To predict the next word, the model must attend only to what came before. Masked attention enforces causality—so the model never cheats by peeking at the future. Why masked attention exists only to what came before causality Key steps (and why): Key steps (and why): Project input into queries (Q), keys (K), values (V). Split into num_heads and compute scaled dot‑products: Q · Kᵀ / √d_k. Scaling prevents softmax saturation. Apply a causal mask (set future positions to −∞) so attention can't peek forward. Softmax → attention weights → weighted sum with V. Reassemble heads, linear projection. Project input into queries (Q), keys (K), values (V). Split into num_heads and compute scaled dot‑products: Q · Kᵀ / √d_k. Scaling prevents softmax saturation. num_heads Q · Kᵀ / √d_k Apply a causal mask (set future positions to −∞) so attention can't peek forward. Softmax → attention weights → weighted sum with V. Reassemble heads, linear projection. Implementation (abridged from gpt.rs): Implementation (abridged from gpt.rs pub fn forward(&self, xs: &Tensor, train: bool) -> Result<Tensor> { let (b, num_tokens, _d_in) = xs.dims3()?; let query = self.w_query.forward_t(xs, train)?; let key = self.w_key.forward_t(xs, train)?; let values = self.w_value.forward_t(xs, train)?; // Split matrices into multiple heads let query = query .reshape((b, num_tokens, self.num_heads, self.head_dim))? .transpose(1, 2)? .contiguous()?; let key = key .reshape((b, num_tokens, self.num_heads, self.head_dim))? .transpose(1, 2)? .contiguous()?; // (values reshape is similar) let values = values .reshape((b, num_tokens, self.num_heads, self.head_dim))? .transpose(1, 2)? .contiguous()?; let key_t = &key.transpose(D::Minus2, D::Minus1)?; let attention_scores = query.matmul(key_t)?; let mask = get_mask(num_tokens, xs.device())?; let masked_tensor = masked_fill( &attention_scores, &mask.broadcast_left((b, self.num_heads)).unwrap(), f32::NEG_INFINITY, )?; let mut attention_weights = softmax(&(masked_tensor * self.scaling)?, D::Minus1)?; attention_weights = self.dropout.forward(&attention_weights, train)?; let context_vec = attention_weights .matmul(&values)? .transpose(1, 2)? .reshape((b, num_tokens, self.dim_out))? .contiguous()?; self.out_proj.forward_t(&context_vec, train) } pub fn forward(&self, xs: &Tensor, train: bool) -> Result<Tensor> { let (b, num_tokens, _d_in) = xs.dims3()?; let query = self.w_query.forward_t(xs, train)?; let key = self.w_key.forward_t(xs, train)?; let values = self.w_value.forward_t(xs, train)?; // Split matrices into multiple heads let query = query .reshape((b, num_tokens, self.num_heads, self.head_dim))? .transpose(1, 2)? .contiguous()?; let key = key .reshape((b, num_tokens, self.num_heads, self.head_dim))? .transpose(1, 2)? .contiguous()?; // (values reshape is similar) let values = values .reshape((b, num_tokens, self.num_heads, self.head_dim))? .transpose(1, 2)? .contiguous()?; let key_t = &key.transpose(D::Minus2, D::Minus1)?; let attention_scores = query.matmul(key_t)?; let mask = get_mask(num_tokens, xs.device())?; let masked_tensor = masked_fill( &attention_scores, &mask.broadcast_left((b, self.num_heads)).unwrap(), f32::NEG_INFINITY, )?; let mut attention_weights = softmax(&(masked_tensor * self.scaling)?, D::Minus1)?; attention_weights = self.dropout.forward(&attention_weights, train)?; let context_vec = attention_weights .matmul(&values)? .transpose(1, 2)? .reshape((b, num_tokens, self.dim_out))? .contiguous()?; self.out_proj.forward_t(&context_vec, train) } Gotcha story: when the mask ate my tokens Gotcha story: when the mask ate my tokens I once wired up the causal mask and thought, *“Okay, that’s the easy part.”*Slap on a triangular mask, stop the model from cheating, move on. Except during training, something weird happened. Every next-token prediction came back as nothing. Empty strings. Just zeroes. It was like the model had been lobotomized overnight. I dug through the code thinking maybe my embeddings were broken, maybe gradients were exploding. Nope. The real culprit? The mask. I had accidentally broadcast it across all positions instead of just the “future” ones. Which meant the model wasn’t just blind to tomorrow’s tokens—it was blind to everything. Every attention score got nuked to -∞, the softmax collapsed into uniform noise, and poof: the model literally had no past to attend to. all -∞ no past to attend to Lesson learned: when you implement masked attention, don’t just trust the math, print your mask. For a sequence of 5 tokens, you should see a neat little lower-triangular staircase. If you see a solid wall instead, congratulations: you’ve just silenced your entire model. print your mask Feed‑Forward Network (position‑wise) - per‑token processing Why it exists: Attention mixes information across tokens; the feed‑forward network processes each token independently with a non‑linear transformation. In GPT‑2 it's typically a 4× expansion (e.g., 768 → 3072 → 768) with GeLU activation. Why it exists: Implementation (from gpt.rs): Implementation (from gpt.rs impl Module for FeedForward { fn forward(&self, xs: &Tensor) -> Result<Tensor> { let xs = self.top.forward(xs)?; let xs = GELU.forward(&xs)?; let xs = self.bottom.forward(&xs)?; Ok(xs) } } impl Module for FeedForward { fn forward(&self, xs: &Tensor) -> Result<Tensor> { let xs = self.top.forward(xs)?; let xs = GELU.forward(&xs)?; let xs = self.bottom.forward(&xs)?; Ok(xs) } } 💡 Why GeLU? It’s smoother than ReLU, which helps with gradient flow in deep networks. The approximation avoids expensive error functions—perfect for Rust’s no_std-friendly ethos. Why GeLU? no_std 💡 Insight: This layer is where much of the model’s computation happens. Attention decides what to attend to; the feed-forward network decides what to do with it. Insight computation what to attend to what to do with it Residual connections — the gradient highway Why it exists: Deep stacks can suffer vanishing gradients. Residuals (x + f(x)) create shortcut paths that allow gradients to flow directly from later layers to earlier ones, enabling much deeper models to train reliably. Why it exists: x + f(x) Implementation (transformer block forward): Implementation (transformer block forward): impl ModuleT for TransformerBlock { fn forward_t(&self, xs: &Tensor, train: bool) -> Result<Tensor> { let shortcut = xs.clone(); let mut x = xs.to_owned(); x = self.layer_norm1.forward(&x)?; x = self.mha.forward(&x, train)?; x = self.dropout.forward(&x, train)?; x = (x + shortcut)?; let shortcut = x.clone(); x = self.layer_norm2.forward(&x)?; x = self.ff.forward(&x)?; x = self.dropout.forward(&x, train)?; x = (x + shortcut)?; Ok(x) } } impl ModuleT for TransformerBlock { fn forward_t(&self, xs: &Tensor, train: bool) -> Result<Tensor> { let shortcut = xs.clone(); let mut x = xs.to_owned(); x = self.layer_norm1.forward(&x)?; x = self.mha.forward(&x, train)?; x = self.dropout.forward(&x, train)?; x = (x + shortcut)?; let shortcut = x.clone(); x = self.layer_norm2.forward(&x)?; x = self.ff.forward(&x)?; x = self.dropout.forward(&x, train)?; x = (x + shortcut)?; Ok(x) } } Design note: Residuals pair with LayerNorm and dropout. The order (norm → block → dropout → add shortcut) is important and implemented here to match standard transformer variants. Design note: 💡 Rust nuance: The + operator in Candle is overloaded for tensors, but it requires identical shapes. One off-by-one in sequence length, and your shortcut becomes a crash site. Rust nuance + identical shapes Output head & generation — closing the loop Why it exists: After final normalization, the model must map hidden vectors to logits over the vocabulary. Many implementations tie the output weights with token embeddings; your implementation leaves the projection as a separate Linear for flexibility. Why it exists: Linear GPT forward (embeddings → transformer → final norm → projection): GPT forward (embeddings → transformer → final norm → projection): impl ModuleT for GPTModel { fn forward_t(&self, xs: &Tensor, train: bool) -> Result<Tensor> { let (_batch_size, seq_len) = xs.dims2()?; let token_emb = self.token_emb.forward(xs)?; let pos_ids = Tensor::arange(0, seq_len as u32, xs.device())?; let pos_embed = self.pos_emb.embeddings().index_select(&pos_ids, 0)?; let mut x = token_emb.broadcast_add(&pos_embed)?; x = self.dropout.forward(&x, train)?; x = self.transformer_blocks.forward_t(&x, train)?; x = self.final_layer_norm.forward(&x)?; let result = self.linear_output_layer.forward(&x)?; Ok(result) } } impl ModuleT for GPTModel { fn forward_t(&self, xs: &Tensor, train: bool) -> Result<Tensor> { let (_batch_size, seq_len) = xs.dims2()?; let token_emb = self.token_emb.forward(xs)?; let pos_ids = Tensor::arange(0, seq_len as u32, xs.device())?; let pos_embed = self.pos_emb.embeddings().index_select(&pos_ids, 0)?; let mut x = token_emb.broadcast_add(&pos_embed)?; x = self.dropout.forward(&x, train)?; x = self.transformer_blocks.forward_t(&x, train)?; x = self.final_layer_norm.forward(&x)?; let result = self.linear_output_layer.forward(&x)?; Ok(result) } } Text Generation The example sticks to greedy decoding (taking the argmax at each step). For richer and more varied text generation, techniques like sampling, top-k, or nucleus (top-p) sampling are usually preferred in real applications. I left those out here since this was meant to stay a small weekend project, and diving into decoding strategies felt like it would stretch the scope too far. pub fn generate_text_simple<M: GPT + ModuleT>( model: &M, mut tokens: Tensor, // the "story so far" (starting sequence of token IDs) max_new_tokens: usize, // how many more words we want to add context_size: usize, // how much memory the model has (how many words it can "see") ) -> Result<Tensor> { // We will add words one by one until we reach max_new_tokens for _ in 0..max_new_tokens { // Find out how many words we already have let (_batch, seq_len) = tokens.dims2()?; // The model can only remember `context_size` words at a time, // so if our story is longer, we cut it to just the recent part. let start = seq_len.saturating_sub(context_size); let context = tokens.i((.., start..seq_len))?; // Ask the model: "What do you think comes next after this context?" // The model replies with scores (logits) for every word in its dictionary. let logits = model.forward_t(&context, false)?; // We only care about the scores for the very last word position, // because that's where the model guesses the "next" word. let (_batch, seq_len, _vocab) = logits.dims3()?; let last_logits = logits.i((.., seq_len - 1, ..))?; // Turn those scores into probabilities (like dice odds). // Example: word A = 80% chance, word B = 15%, word C = 5%. let probs = softmax(&last_logits, 1)?; // Pick the most likely word (greedy = always choose the biggest number). // This makes the story very predictable, like always choosing // vanilla ice cream instead of trying new flavors. let next_token = probs.argmax_keepdim(D::Minus1)?; // Stick that new word onto the end of our story. tokens = Tensor::cat(&[&tokens, &next_token], D::Minus1)?; } // When we're done, return the whole story (original + new words). Ok(tokens) } pub fn generate_text_simple<M: GPT + ModuleT>( model: &M, mut tokens: Tensor, // the "story so far" (starting sequence of token IDs) max_new_tokens: usize, // how many more words we want to add context_size: usize, // how much memory the model has (how many words it can "see") ) -> Result<Tensor> { // We will add words one by one until we reach max_new_tokens for _ in 0..max_new_tokens { // Find out how many words we already have let (_batch, seq_len) = tokens.dims2()?; // The model can only remember `context_size` words at a time, // so if our story is longer, we cut it to just the recent part. let start = seq_len.saturating_sub(context_size); let context = tokens.i((.., start..seq_len))?; // Ask the model: "What do you think comes next after this context?" // The model replies with scores (logits) for every word in its dictionary. let logits = model.forward_t(&context, false)?; // We only care about the scores for the very last word position, // because that's where the model guesses the "next" word. let (_batch, seq_len, _vocab) = logits.dims3()?; let last_logits = logits.i((.., seq_len - 1, ..))?; // Turn those scores into probabilities (like dice odds). // Example: word A = 80% chance, word B = 15%, word C = 5%. let probs = softmax(&last_logits, 1)?; // Pick the most likely word (greedy = always choose the biggest number). // This makes the story very predictable, like always choosing // vanilla ice cream instead of trying new flavors. let next_token = probs.argmax_keepdim(D::Minus1)?; // Stick that new word onto the end of our story. tokens = Tensor::cat(&[&tokens, &next_token], D::Minus1)?; } // When we're done, return the whole story (original + new words). Ok(tokens) } Training Once the model was built, I moved on to training — and honestly, this was where things started to feel tough. Not because the code was complicated (most of it was just setting up batches, feeding them through the model, and applying AdamW), but because training is a totally different beast compared to writing code. training is a totally different beast compared to writing code. And here’s me being fully honest: I didn’t hand-craft the training loop myself. I GPT-ed most of it. It was faster, and honestly, I wasn’t here to reinvent the wheel. I just wanted something that worked so I could see my model actually train. The point of this project (for me) wasn’t to become a production-ready model, it was to understand GPT-2 better. Video: https://marbelona.com/static/images/gpt2-from-scratch-rust-transformers/gpt2train.mp4?embedable=true https://marbelona.com/static/images/gpt2-from-scratch-rust-transformers/gpt2train.mp4?embedable=true 🔑 What the Training Loop Does (in plain terms) Here’s the heart of the training code, stripped down to the essentials: for (input_batch, target_batch) in prebaked_batches.iter() { // 1. Forward pass — get model predictions (logits) let logits = model.forward_t(input_batch, true)?; // 2. Flatten predictions and targets for cross-entropy let logits_flat = logits.reshape((batch_size * seq_len, vocab_size))?; let targets_flat = target_batch.reshape((batch_size * seq_len,))?; // 3. Compute loss let loss = cross_entropy(&logits_flat, &targets_flat)?; // 4. Backpropagation + optimizer step optimizer.backward_step(&loss)?; } for (input_batch, target_batch) in prebaked_batches.iter() { // 1. Forward pass — get model predictions (logits) let logits = model.forward_t(input_batch, true)?; // 2. Flatten predictions and targets for cross-entropy let logits_flat = logits.reshape((batch_size * seq_len, vocab_size))?; let targets_flat = target_batch.reshape((batch_size * seq_len,))?; // 3. Compute loss let loss = cross_entropy(&logits_flat, &targets_flat)?; // 4. Backpropagation + optimizer step optimizer.backward_step(&loss)?; } Forward pass: run input tokens through the model. Forward pass Reshape: make tensors fit the loss function. Reshape Loss: measure how wrong the predictions were. Loss Backprop + optimizer: adjust weights to improve next time. Backprop + optimizer That’s training in a nutshell. GPT helped me scaffold this loop, but I still had to wrestle with the details (tensor shapes, batching, optimizer parameters). I didn’t spend weeks fine-tuning or pushing it to convergence — this was meant to be a weekend project after all. So the training I ran was short, the loss didn’t drop much, and the generations that came out of it were basically trash. But that was expected. I wasn’t aiming for SOTA performance, just the satisfaction of seeing the whole pipeline run end-to-end. Here’s a real example of what my “trained” model produced after just a little bit of training: What This Taught Me - Beyond the Code 1. Tooling matters. 1. Tooling matters. Pure‑Rust Candle removed the build friction from Ubuntu and let you focus on model bugs (masks, shapes, scaling). 2. Deep learning builds on classical ML Deep learning builds on classical ML Overfitting → Dropout Feature scaling → LayerNorm Sequential prediction → Autoregressive generation Ensemble refinement → Residual connections Overfitting → Dropout Overfitting Feature scaling → LayerNorm Feature scaling Sequential prediction → Autoregressive generation Sequential prediction Ensemble refinement → Residual connections Ensemble refinement The math is more complex, but the principles are the same. principles are the same 3. Rust rewards patience Rust rewards patience The borrow checker forced me to think deeply about data ownership. Result types made errors explicit - no silent NaNs. cargo clippy caught performance anti-patterns early. The borrow checker forced me to think deeply about data ownership. Result types made errors explicit - no silent NaNs. Result cargo clippy caught performance anti-patterns early. cargo clippy It’s frustrating at first—but once you “get” it, you write safer, clearer code. 4. Understanding comes from doing Understanding comes from doing I’d read about attention a dozen times. But only when I forgot the causal mask and saw the model break did I truly understand its purpose. forgot the causal mask 5. You don’t need to be an expert to learn deeply You don’t need to be an expert to learn deeply Curiosity, time, and a willingness to fail are enough—even on a standard Ubuntu setup with mismatched CUDA. Final Reflections: Why This Matters I’ll go back to writing TypeScript tomorrow. I’ll debug Go race conditions and argue about React state management. And that’s okay, those are the skills that pay the bills. But now, when someone says “GPT-2,” I don’t see a black box.I see: A tokenizer slicing “Hello” into 15496 Embeddings placing it in 768D space 12 attention heads whispering about context LayerNorm keeping activations sane Residual connections carrying gradients home GeLU curves adding non-linearity All of it compiled, safe, and running in Rust on Ubuntu, with no CUDA drama A tokenizer slicing “Hello” into 15496 15496 Embeddings placing it in 768D space 12 attention heads whispering about context LayerNorm keeping activations sane Residual connections carrying gradients home GeLU curves adding non-linearity All of it compiled, safe, and running in Rust on Ubuntu, with no CUDA drama Ubuntu no CUDA drama And that’s why I did this. Not to build the next AI but to understand the one we already have. Not to build the next AI but to understand the one we already have. The next steps for this would just probably me learning how to load the official trained weights instead of me training it on a consumer grade machine, 'cause obviously that's kinda dumb to do. Also want to learn how to fine-tuning works, adapter tuning to be exact, so that's on the list. Resources LLM From Scratch Github Repo This served as my main code reference on dissecting how to implement a GPT model from the ground up. LLM From Scratch Github Repo https://github.com/rasbt/LLMs-from-scratch https://github.com/rasbt/LLMs-from-scratch Andrej Karpathy - GPT2 Andrej Karpathy - GPT2 https://www.youtube.com/watch?v=l8pRSuU81PU&pp=ygUUYW5kcmVqIGthcnBhdGh5IGdwdDI%3D https://www.youtube.com/watch?v=l8pRSuU81PU&pp=ygUUYW5kcmVqIGthcnBhdGh5IGdwdDI%3D Repository Link for this GPT2 Clone Repository Link for this GPT2 Clone https://github.com/montecassino/gpt2-rs https://github.com/montecassino/gpt2-rs