The Dragon Hatchling Learns to Fly: Inside AI’s Next Learning Revolution

A Friendly Guide to the Brain-like Dragon Hatchling (BDH) Modern neural networks can recognize faces, write stories, and even pass programming interviews — but they all share the same limitation: they stop learning once deployed. stop learning once deployed A few weeks ago, a group of engineers and researchers — Adrian Kosowski, Przemysław Uznanski, Jan Chorowski, Zuzanna Stamirowska, and Michał Bartoszkiewicz — published a fascinating paper introducing a new idea in the field of machine learning and neural architectures. In simple terms, they proposed a new type of artificial neural network. new type of artificial neural network https://arxiv.org/abs/2509.26507?embedable=true https://arxiv.org/abs/2509.26507?embedable=true The paper itself is quite dense — filled with math, formulas, and graphs — but full of bold ideas. I wanted to unpack it in a way that’s easier to digest: to make a popular-science overview, with a few metaphors and simplifications of my own. popular-science overview Imagine a young dragon hatchling that has just broken out of its shell. It already knows how to fly and breathe fire — but it doesn’t yet know how to react to the world around it. It doesn’t learn from books, but from experience — right in the middle of flight — memorizing which actions helped and which didn’t. how to react That’s the essence of BDH — the Brain-like Dragon Hatchling: a new neural architecture that combines classic pretraining (like in standard networks) with instant, self-directed learning during inference. BDH — the Brain-like Dragon Hatchling classic pretraining instant, self-directed learning A neural network is a system of neurons connected by “weights” that adjust through gradient descent, gradually reducing error — much like a student improving after each test by reviewing mistakes. However, once the test is over, the student no longer learns — the learning happened earlier, before the test. gradient descent That’s how today’s models like GPT work: they learn inside the egg — and then stop. learn inside the egg What makes the Dragon Hatchling different? The BDH is designed a bit smarter. It has two kinds of memory: Permanent memory, like any normal neural network — this is what it learned before hatching. Temporary memory, resembling instincts or short-term connections between thoughts. Permanent memory, like any normal neural network — this is what it learned before hatching. Permanent memory Temporary memory, resembling instincts or short-term connections between thoughts. Temporary memory When BDH processes information, it creates new connections on the fly. If two neurons activate together — the connection between them strengthens. on the fly This is known as the Hebbian learning rule: Hebbian learning rule “Neurons that fire together, wire together.” “Neurons that fire together, wire together.” These connections are stored in a separate matrix σ, which acts as a temporary map of what has recently happened.If a similar situation occurs later, BDH recalls:“Ah, I’ve seen this before — and here’s what worked.” σ “Ah, I’ve seen this before — and here’s what worked.” What changes with BDH? BDH transforms the learning process itself. It learns while it works, even without running backpropagation. It can adapt to new information on the go, without retraining or heavy GPU computations. while it works on the go In other words — BDH is a network that learns to live, not just to repeat. BDH is a network that learns to live, not just to repeat. Learning to Stand, Fly, and Breathe Fire Every living creature has its own learning stages. A dragon hatchling first learns to stand, then to flap its wings, and eventually to breathe fire. The BDH model follows a similar path — each stage of its “life” brings a different kind of learning. Stage 1: Standing (Classic Pretraining) This is where BDH learns, like any traditional neural network. It’s trained on data, adjusts weights via gradient descent, and minimizes loss — the familiar supervised learning phase. Think of it as the dragon strengthening its legs before taking the first flight. At this stage, the model is trained offline on a large dataset — text corpora, translations, and other examples. It uses standard backpropagation, an optimizer like AdamW, and a loss function that predicts the next token. During this process, BDH develops its permanent weights, referred to as “G” in the paper (the fixed ruleset). These correspond to what, in a transformer, would be parameters like Wq, Wk, Wv, W1, W2, and so on. Stage 2: Flying (Online Adaptation) Once training ends, most networks stop changing. But BDH keeps learning in real time. It has a Hebbian memory — a fast-acting connection map that updates itself during inference. If certain neurons activate together, their connection grows stronger; if not, it weakens. This is how BDH adapts to new situations mid-flight, without retraining. During inference — when BDH reads or generates text — it updates its temporary internal states, denoted as σ(i, j), or “synaptic weights.” This process isn’t gradient descent. Instead, it follows a local learning rule: If neuron i and neuron j fire together → strengthen their connection σ(i, j). This simple rule implements Hebbian learning — often summarized as “neurons that fire together, wire together.” These updates are short-lived: they exist only while a dialogue or reasoning session is active. Once σ is reset, the model returns to its original “hatched” knowledge — the way it was trained before flight. Stage 3: Breathing Fire (Self-regulation) BDH doesn’t just strengthen all connections — it keeps them balanced. The model uses sparsity thresholds and normalization to prevent runaway feedback loops. It learns to "breathe fire" carefully — powerful, but controlled. Too much activation would lead to instability; too little would make it unresponsive. The balance between those extremes is what gives BDH its “life”. The paper briefly mentions an intriguing idea: if theHebbian updates (σ) are preserved and averaged over time, BDH could develop something resembling long-term memory — a mechanism akin to slowly updating its core weights. However, the authors haven’t yet formalized the exact algorithm for this process. They suggest that: Fast memory (σ) operates on short timescales — minutes or a few hundred tokens. Slow memory (G) evolves over much longer periods — days or across model updates. This opens the door to lifelong learning— systems that can continuously acquire new knowledge without erasing what they already know.Unlike classic transformers, which suffer fromcatastrophic forgetting, BDH hints at a future where models can remember their past while growing into the future. Stage 1: Standing (Classic Pretraining) This is where BDH learns, like any traditional neural network. It’s trained on data, adjusts weights via gradient descent, and minimizes loss — the familiar supervised learning phase. Think of it as the dragon strengthening its legs before taking the first flight. At this stage, the model is trained offline on a large dataset — text corpora, translations, and other examples. It uses standard backpropagation, an optimizer like AdamW, and a loss function that predicts the next token. During this process, BDH develops its permanent weights, referred to as “G” in the paper (the fixed ruleset). These correspond to what, in a transformer, would be parameters like Wq, Wk, Wv, W1, W2, and so on. Stage 1: Standing (Classic Pretraining) This is where BDH learns, like any traditional neural network. It’s trained on data, adjusts weights via gradient descent, and minimizes loss — the familiar supervised learning phase. Think of it as the dragon strengthening its legs before taking the first flight. At this stage, the model is trained offline on a large dataset — text corpora, translations, and other examples. It uses standard backpropagation, an optimizer like AdamW, and a loss function that predicts the next token. During this process, BDH develops its permanent weights, referred to as “G” in the paper (the fixed ruleset). These correspond to what, in a transformer, would be parameters like Wq, Wk, Wv, W1, W2, and so on. At this stage, the model is trained offline on a large dataset — text corpora, translations, and other examples. It uses standard backpropagation, an optimizer like AdamW, and a loss function that predicts the next token. At this stage, the model is trained offline on a large dataset — text corpora, translations, and other examples. It uses standard backpropagation, an optimizer like AdamW, and a loss function that predicts the next token. offline backpropagation AdamW loss function During this process, BDH develops its permanent weights, referred to as “G” in the paper (the fixed ruleset). These correspond to what, in a transformer, would be parameters like Wq, Wk, Wv, W1, W2, and so on. During this process, BDH develops its permanent weights, referred to as “G” in the paper (the fixed ruleset). These correspond to what, in a transformer, would be parameters like Wq, Wk, Wv, W1, W2, and so on. permanent weights “G” fixed ruleset Wq, Wk, Wv, W1, W2 Stage 2: Flying (Online Adaptation) Once training ends, most networks stop changing. But BDH keeps learning in real time. It has a Hebbian memory — a fast-acting connection map that updates itself during inference. If certain neurons activate together, their connection grows stronger; if not, it weakens. This is how BDH adapts to new situations mid-flight, without retraining. During inference — when BDH reads or generates text — it updates its temporary internal states, denoted as σ(i, j), or “synaptic weights.” This process isn’t gradient descent. Instead, it follows a local learning rule: If neuron i and neuron j fire together → strengthen their connection σ(i, j). This simple rule implements Hebbian learning — often summarized as “neurons that fire together, wire together.” These updates are short-lived: they exist only while a dialogue or reasoning session is active. Once σ is reset, the model returns to its original “hatched” knowledge — the way it was trained before flight. Stage 2: Flying (Online Adaptation) Stage 2: Flying (Online Adaptation) Once training ends, most networks stop changing. But BDH keeps learning in real time. It has a Hebbian memory — a fast-acting connection map that updates itself during inference. If certain neurons activate together, their connection grows stronger; if not, it weakens. This is how BDH adapts to new situations mid-flight, without retraining. Hebbian memory During inference — when BDH reads or generates text — it updates its temporary internal states, denoted as σ(i, j), or “synaptic weights.” temporary internal states σ(i, j) This process isn’t gradient descent. Instead, it follows a local learning rule: local learning rule If neuron i and neuron j fire together → strengthen their connection σ(i, j). If neuron i and neuron j fire together → strengthen their connection σ(i, j). i j This simple rule implements Hebbian learning — often summarized as “neurons that fire together, wire together.” Hebbian learning “neurons that fire together, wire together.” These updates are short-lived: they exist only while a dialogue or reasoning session is active. Once σ is reset, the model returns to its original “hatched” knowledge — the way it was trained before flight. short-lived Stage 3: Breathing Fire (Self-regulation) BDH doesn’t just strengthen all connections — it keeps them balanced. The model uses sparsity thresholds and normalization to prevent runaway feedback loops. It learns to "breathe fire" carefully — powerful, but controlled. Too much activation would lead to instability; too little would make it unresponsive. The balance between those extremes is what gives BDH its “life”. The paper briefly mentions an intriguing idea: if theHebbian updates (σ) are preserved and averaged over time, BDH could develop something resembling long-term memory — a mechanism akin to slowly updating its core weights. However, the authors haven’t yet formalized the exact algorithm for this process. They suggest that: Fast memory (σ) operates on short timescales — minutes or a few hundred tokens. Slow memory (G) evolves over much longer periods — days or across model updates. This opens the door to lifelong learning— systems that can continuously acquire new knowledge without erasing what they already know.Unlike classic transformers, which suffer fromcatastrophic forgetting, BDH hints at a future where models can remember their past while growing into the future. Stage 3: Breathing Fire (Self-regulation) Stage 3: Breathing Fire (Self-regulation) BDH doesn’t just strengthen all connections — it keeps them balanced. The model uses sparsity thresholds and normalization to prevent runaway feedback loops. It learns to "breathe fire" carefully — powerful, but controlled. Too much activation would lead to instability; too little would make it unresponsive. The balance between those extremes is what gives BDH its “life”. The paper briefly mentions an intriguing idea: if theHebbian updates (σ) are preserved and averaged over time, BDH could develop something resembling long-term memory — a mechanism akin to slowly updating its core weights. However, the authors haven’t yet formalized the exact algorithm for this process. Hebbian updates (σ) long-term memory They suggest that: Fast memory (σ) operates on short timescales — minutes or a few hundred tokens. Slow memory (G) evolves over much longer periods — days or across model updates. Fast memory (σ) operates on short timescales — minutes or a few hundred tokens. Fast memory (σ) Slow memory (G) evolves over much longer periods — days or across model updates. Slow memory (G) This opens the door to lifelong learning— systems that can continuously acquire new knowledge without erasing what they already know.Unlike classic transformers, which suffer fromcatastrophic forgetting, BDH hints at a future where models can remember their past while growing into the future. lifelong learning catastrophic forgetting remember their past while growing into the future. Why I Believe BDH Is an Evolution, Not Just Another Model The paper “The Brain-like Dragon Hatchling (BDH)” isn’t just theoretical — it points toward a new direction in AI architecture that offers real, measurable advantages. “The Brain-like Dragon Hatchling (BDH)” new direction in AI architecture Transparent and Interpretable AI One of the biggest pain points in modern LLMs is opacity — we rarely know why a model made a particular decision. BDH changes that: its “synapses” directly correspond to conceptual relationships. You can literally see which connections strengthen as the model “thinks” about a given idea. Its activations are sparse and positive (just like in the brain), making it possible to debug and even audit reasoning processes. opacity why see sparse positive audit reasoning processes ➡️ This opens the door for explainable AI in critical domains — medicine, finance, law — where understanding why a model reached its conclusion is as important as the conclusion itself. why On-the-Fly Learning (Inference-Time Learning) BDH applies Hebbian learning even during inference — meaning the connections between neurons can evolve without retraining. It adapts to the user or context in real time, developing a form of short-term memory that “remembers” ideas across tokens and paragraphs. Hebbian learning without retraining short-term memory ➡️ This pushes LLMs closer to lifelong learning — models that keep improving mid-conversation, the way humans do, without any extra fine-tuning. lifelong learning Stable and Scalable Reasoning Over Time Transformers struggle with long-range reasoning — once you go beyond their trained context window, coherence collapses. BDH, however, is designed as a scale-free system — its behavior remains stable as reasoning depth and neuron count grow. long-range reasoning scale-free system ➡️ That means we can build agentic systems that run for days or even weeks — planning, researching, or simulating — without losing logical consistency. agentic systems Merging Models Without Catastrophic Forgetting BDH introduces a unique property called model merging: two models can be “fused” simply by connecting their graphs. Unlike transformers, this doesn’t degrade performance or require retraining. model merging ➡️ You can combine models from different domains (say, medical and legal) without fine-tuning. ➡️ This paves the way for modular AI, where reusable “neural plugins” can be connected like software components. modular AI Performance and Efficiency BDH-GPU works as a state-space system, meaning it can be trained efficiently using PyTorch and GPUs. Its parameter and compute costs grow linearly — not exponentially like in large transformer stacks. state-space system linearly ➡️ This enables building powerful models in the 10M–1B parameter range, making BDH accessible to independent researchers and startups alike. 10M–1B parameter range Connection to Neuromorphic Computing Because BDH is naturally defined in terms of neurons and synapses, it’s a perfect fit for neuromorphic hardware — chips like Loihi or TrueNorth that emulate biological networks directly in silicon. neurons synapses neuromorphic hardware Loihi TrueNorth ➡️ This opens possibilities for running large-scale reasoning models on energy-efficient edge devices, robotics platforms, or bio-inspired systems. large-scale reasoning models edge devices A Step Toward “Axiomatic AI” The authors introduce the idea of Axiomatic AI — systems whose behavior can not only be observed but formally predicted over time. It’s like discovering the “thermodynamics of intelligence”: predictable scaling laws and stable reasoning dynamics. Axiomatic AI formally predicted over time ➡️ This points toward certifiable and safe AI architectures, suitable for use in autonomous, high-stakes environments — from finance and healthcare to transportation. certifiable and safe AI architectures autonomous, high-stakes environments Building a Simple Neural Network To really understand how BDH works, I decided to build a tiny proof-of-concept — a minimal “tiny-BDH” in Rust, trained on the classic XOR problem. It uses autograd via tch-rs (a Rust wrapper around libtorch, the C++ core of PyTorch). This little project was inspired by the famous “A Neural Network in 11 Lines of Python”, but my goal wasn’t brevity — it was clarity. I wanted to deeply understand how BDH’s mechanisms could work in practice. minimal “tiny-BDH” in Rust autograd via tch-rs libtorch “A Neural Network in 11 Lines of Python” The full source code is available in my GitHub repo ZhukMax/tiny_bdh_xor, prepared specifically for this article. Below, I’ll walk through the implementation step by step. It may look verbose, but that’s intentional — the goal here is maximum transparency and accessibility for anyone curious about BDH internals. ZhukMax/tiny_bdh_xor ZhukMax/tiny_bdh_xor maximum transparency and accessibility Cargo.toml Since this example is written in Rust, we start with a Cargo.toml file — the manifest that defines the project and its dependencies. Rust Cargo.toml The key dependency here is tch, a safe Rust wrapper around the libtorch C++ library, which powers PyTorch. It gives us access to tensors, autograd, and other core features of deep learning directly from Rust. tch libtorch tensors autograd Because BDH uses familiar concepts like neurons and synapses, it makes sense to reuse these existing abstractions rather than re-implement them from scratch. Our goal isn’t to recreate PyTorch — it’s to explore the learning logic behind BDH in the simplest possible form. neurons synapses learning logic Here’s the relevant snippet from Cargo.toml: Cargo.toml [package] name = "tiny_bdh_xor" version = "0.1.0" edition = "2021" [dependencies] anyhow = "1.0.100" tch = { version = "0.22", features = ["download-libtorch"] } [package] name = "tiny_bdh_xor" version = "0.1.0" edition = "2021" [dependencies] anyhow = "1.0.100" tch = { version = "0.22", features = ["download-libtorch"] } 💡 The download-libtorch feature tells Cargo to automatically fetch and link the correct libtorch binaries for your OS and architecture. Without it, you’d need to manually install PyTorch and set the LIBTORCH environment variable. With it, everything “just works” — Cargo downloads and links the library during build. 💡 The download-libtorch feature tells Cargo to automatically fetch and link the correct libtorch binaries for your OS and architecture. Without it, you’d need to manually install PyTorch and set the LIBTORCH environment variable. With it, everything “just works” — Cargo downloads and links the library during build. download-libtorch libtorch LIBTORCH (Note: the exact version of tch may differ depending on your setup.) (Note: the exact version of tch src/main.rs — The Core of Our Tiny BDH src/main.rs In Rust projects, all source files live inside the src directory. Since this is a minimal example, we’ll keep everything in a single file — main.rs. Let’s import the necessary dependencies and set up the entry point: src minimal example main.rs use anyhow::Result; use tch::{nn, Device, Kind, Reduction, Tensor}; use tch::nn::{Init, OptimizerConfig}; fn main() -> Result { let dev = if tch::Cuda::is_available() { Device::Cuda(0) } else { Device::Cpu }; Ok(()) } use anyhow::Result; use tch::{nn, Device, Kind, Reduction, Tensor}; use tch::nn::{Init, OptimizerConfig}; fn main() -> Result { let dev = if tch::Cuda::is_available() { Device::Cuda(0) } else { Device::Cpu }; Ok(()) } Choosing the Device (CPU or GPU) On line 6, we decide where to run the computations — on the GPU or CPU: where tch::Cuda::is_available() checks whether CUDA is installed and detects any NVIDIA GPUs. If CUDA is available, the code selects the first GPU: Device::Cuda(0). If CUDA isn’t available (for example, on a Mac or a CPU-only server), it defaults to Device::Cpu. tch::Cuda::is_available() checks whether CUDA is installed and detects any NVIDIA GPUs. tch::Cuda::is_available() If CUDA is available, the code selects the first GPU: Device::Cuda(0). Device::Cuda(0) If CUDA isn’t available (for example, on a Mac or a CPU-only server), it defaults to Device::Cpu. Device::Cpu The variable dev is then passed into other components such as VarStore::new(dev) so that all tensors are created and computed on the same device. dev VarStore::new(dev) all tensors Creating the Training Data Next, we define the input and output tensors for our tiny XOR neural network — its training set: input output let x = Tensor::from_slice(&[ 0f32,0.,1., 0.,1.,1., 1.,0.,1., 1.,1.,1. ]).reshape([4,3]).to_device(dev); let y = Tensor::from_slice(&[0f32,1.,1.,0.]).reshape([4,1]).to_device(dev); let x = Tensor::from_slice(&[ 0f32,0.,1., 0.,1.,1., 1.,0.,1., 1.,1.,1. ]).reshape([4,3]).to_device(dev); let y = Tensor::from_slice(&[0f32,1.,1.,0.]).reshape([4,1]).to_device(dev); We start with a flat array of 12 numbers (4 × 3), describing four XOR samples. Each triplet of numbers is one example: 4 × 3 [0, 0, 1] [0, 1, 1] [1, 0, 1] [1, 1, 1] [0, 0, 1] [0, 1, 1] [1, 0, 1] [1, 1, 1] The first two values are binary inputs (X₁ and X₂), and the third is a constant bias input (always 1), helping the model separate data linearly. X₁ X₂ bias 1 Then .reshape([4,3]) converts this flat array into a 4×3 matrix — four samples, each with three input features. Finally, .to_device(dev) moves the tensor to the selected device (GPU or CPU), ensuring all computations happen in one place. .reshape([4,3]) 4×3 .to_device(dev) The second tensor, y, contains the expected outputs for each input: y expected outputs [0], [1], [1], [0] [0], [1], [1], [0] These correspond to the XOR truth table: X₁ X₂ Y 0 0 0 0 1 1 1 0 1 1 1 0 X₁ X₂ Y 0 0 0 0 1 1 1 0 1 1 1 0 X₁ X₂ Y X₁ X₁ X₂ X₂ Y Y 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 1 1 1 0 0 Network Hyperparameters let n: i64 = 64; let d: i64 = 16; let u: f64 = 0.20; let hebb_lr: f64 = 0.01; let smax: f64 = 1.0; let sparsity_thresh: f64 = 5e-3; let lr: f64 = 5e-3; let steps = 3000; let n: i64 = 64; let d: i64 = 16; let u: f64 = 0.20; let hebb_lr: f64 = 0.01; let smax: f64 = 1.0; let sparsity_thresh: f64 = 5e-3; let lr: f64 = 5e-3; let steps = 3000; n = 64 — the size of the neural field (number of neurons in the layer). d = 16 — the low-rank dimension for matrices E and D, defining how much the data is compressed and expanded. u = 0.20 — the forgetting rate for the fast memory σ; higher values make it “forget” faster. hebb_lr = 0.01 — the learning rate for Hebbian updates — controls how strongly new activations modify σ. n = 64 — the size of the neural field (number of neurons in the layer). n = 64 neural field d = 16 — the low-rank dimension for matrices E and D, defining how much the data is compressed and expanded. d = 16 E D u = 0.20 — the forgetting rate for the fast memory σ; higher values make it “forget” faster. u = 0.20 hebb_lr = 0.01 — the learning rate for Hebbian updates — controls how strongly new activations modify σ. hebb_lr = 0.01 Hebbian Memory: In BDH, memory is represented by a special connection matrix σ (sigma) — a temporary synaptic memory. It doesn’t store the model’s learned weights (those are handled by gradient descent). Instead, it remembers which neurons were active together, forming short-term associations — a kind of “working memory” active during inference. Hebbian Memory: σ (sigma) synaptic memory which neurons were active together Continuing: smax = 1.0 — limits the maximum connection strength in σ, preventing runaway values. sparsity_thresh = 5e-3 — zeroes out very small σ elements, keeping the memory sparse and stable. lr = 5e-3 — learning rate for the Adam optimizer that updates regular model parameters (E, D, R_in, W_read). steps = 3000 — number of training iterations (how many times the model sees the data). smax = 1.0 — limits the maximum connection strength in σ, preventing runaway values. smax = 1.0 sparsity_thresh = 5e-3 — zeroes out very small σ elements, keeping the memory sparse and stable. sparsity_thresh = 5e-3 lr = 5e-3 — learning rate for the Adam optimizer that updates regular model parameters (E, D, R_in, W_read). lr = 5e-3 E D R_in W_read steps = 3000 — number of training iterations (how many times the model sees the data). steps = 3000 Initializing Parameters and the “Neural Field” After defining our hyperparameters, we create a parameter store — a container that holds all trainable weights and biases of the network. Then we add the model’s learnable parameters — its “weights,” which will be updated during training: parameter store let vs = nn::VarStore::new(dev); let root = &vs.root(); let e = root.var("E", &[n,d], Init::Randn { mean: 0.0, stdev: 0.05 }); let dx = root.var("Dx", &[n,d], Init::Randn { mean: 0.0, stdev: 0.05 }); let dy = root.var("Dy", &[n,d], Init::Randn { mean: 0.0, stdev: 0.05 }); let r_in = root.var("R_in", &[3,n], Init::Randn { mean: 0.0, stdev: 0.20 }); let w_read = root.var("W_read", &[n,1], Init::Randn { mean: 0.0, stdev: 0.20 }); let vs = nn::VarStore::new(dev); let root = &vs.root(); let e = root.var("E", &[n,d], Init::Randn { mean: 0.0, stdev: 0.05 }); let dx = root.var("Dx", &[n,d], Init::Randn { mean: 0.0, stdev: 0.05 }); let dy = root.var("Dy", &[n,d], Init::Randn { mean: 0.0, stdev: 0.05 }); let r_in = root.var("R_in", &[3,n], Init::Randn { mean: 0.0, stdev: 0.20 }); let w_read = root.var("W_read", &[n,1], Init::Randn { mean: 0.0, stdev: 0.20 }); Each variable defines part of the BDH model: r_in — the input projection into the neural field. E, Dx, Dy — the internal transformations, analogous to the weights of a hidden layer. But remember: BDH doesn’t have layers in the usual sense — it’s more like a single self-connected field of neurons. w_read — the output projection, used to read the network’s final activations. r_in — the input projection into the neural field. r_in input projection E, Dx, Dy — the internal transformations, analogous to the weights of a hidden layer. But remember: BDH doesn’t have layers in the usual sense — it’s more like a single self-connected field of neurons. E Dx Dy single self-connected field w_read — the output projection, used to read the network’s final activations. w_read output projection The Optimizer and Fast Memory Next, we initialize the Adam optimizer, a popular variant of gradient descent that automatically tunes learning rates per parameter. We also create a tensor σ — a square [n × n] matrix filled with zeros. This represents BDH’s fast Hebbian memory, which stores temporary connections between neurons and is updated at every training step. Adam optimizer σ [n × n] fast Hebbian memory temporary let mut opt = nn::Adam::default().build(&vs, lr)?; let mut sigma = Tensor::zeros(&[n, n], (Kind::Float, dev)); for step in 0..steps { ... } let mut opt = nn::Adam::default().build(&vs, lr)?; let mut sigma = Tensor::zeros(&[n, n], (Kind::Float, dev)); for step in 0..steps { ... } Inside this training loop, we’ll add the code that teaches our “Dragon Hatchling” while it’s still in its egg — that is, during offline pretraining. egg Forward Pass — The Dragon’s First Flight The next code block performs the forward pass, the main computation step where inputs are transformed into outputs (logits): forward pass logits let x_neu = x.matmul(&r_in); let y1 = relu_lowrank_forward(&x_neu, &e, &dx); let a = x_neu.matmul(&sigma.transpose(-1, -2)); let y2 = y1 + a; let z = relu_lowrank_forward(&y2, &e, &dy); let logits = z.matmul(&w_read); let x_neu = x.matmul(&r_in); let y1 = relu_lowrank_forward(&x_neu, &e, &dx); let a = x_neu.matmul(&sigma.transpose(-1, -2)); let y2 = y1 + a; let z = relu_lowrank_forward(&y2, &e, &dy); let logits = z.matmul(&w_read); Here’s what happens step by step: x_neu = x.matmul(&r_in) — the input data enters the neural field. y1 = relu_lowrank_forward(...) — the data is compressed, expanded, and passed through a ReLU activation. (We’ll define this helper function next.) a = x_neu.matmul(&sigma.T) — retrieves the additional signal from Hebbian memory σ, based on temporary neuron associations. y2 = y1 + a — merges the “current” signal with short-term memory — this is the core idea of BDH. z and logits — the final processing and output projection, combining both short-term and long-term knowledge of the model. x_neu = x.matmul(&r_in) — the input data enters the neural field. x_neu = x.matmul(&r_in) y1 = relu_lowrank_forward(...) — the data is compressed, expanded, and passed through a ReLU activation. (We’ll define this helper function next.) y1 = relu_lowrank_forward(...) a = x_neu.matmul(&sigma.T) — retrieves the additional signal from Hebbian memory σ, based on temporary neuron associations. a = x_neu.matmul(&sigma.T) Hebbian memory σ y2 = y1 + a — merges the “current” signal with short-term memory — this is the core idea of BDH. y2 = y1 + a core idea z and logits — the final processing and output projection, combining both short-term and long-term knowledge of the model. z logits The output logits aren’t yet passed through a sigmoid; they represent the raw predictions before activation — the dragon’s unrefined thoughts before taking shape. logits sigmoid raw predictions Low-Rank + ReLU Helper As promised, here’s the ReLU helper we use in the forward pass: /// y = ReLU( (x E) D^T ) fn relu_lowrank_forward(x: &Tensor, e: &Tensor, d: &Tensor) -> Tensor { let h = x.matmul(e); // [B,n]·[n,d] = [B,d] h.matmul(&d.transpose(-1, -2)).relu() // [B,d]·[d,n] = [B,n] } /// y = ReLU( (x E) D^T ) fn relu_lowrank_forward(x: &Tensor, e: &Tensor, d: &Tensor) -> Tensor { let h = x.matmul(e); // [B,n]·[n,d] = [B,d] h.matmul(&d.transpose(-1, -2)).relu() // [B,d]·[d,n] = [B,n] } This is a low-rank linear layer with ReLU. Instead of a big dense matrix W ∈ R^{n×n} , we factor it as W ≈ E · Dᵀ with E ∈ R^{n×d} , D ∈ R^{n×d}, d ≪ n . low-rank linear layer with ReLU W ∈ R^{n×n} W ≈ E · Dᵀ E ∈ R^{n×d} D ∈ R^{n×d} d ≪ n The idea is straightforward: you don’t need all possible synapses. Project into a compact latent space of size d, then project back. For tiny demos like XOR this is mostly illustrative; for GPT-scale models the memory savings can be massive (terabytes at scale). don’t need all possible synapses d massive Line 3 compresses the high-dimensional “neural field” (n features) into a latent space of size d. The next line expands it back to n as a linear combination of decoder patterns from D. Together this acts like a single multiplication by W ≈ E · Dᵀ, but uses 2nd parameters instead of (n^2). Line 3 compresses the high-dimensional “neural field” (n features) into a latent space of size d. n latent d The next line expands it back to n as a linear combination of decoder patterns from D. Together this acts like a single multiplication by W ≈ E · Dᵀ, but uses 2nd parameters instead of (n^2). n D W ≈ E · Dᵀ (n^2) Loss, Backprop, Step Now let’s add the standard training step — compute the loss, run backprop, update weights: training step let loss = logits .binary_cross_entropy_with_logits:: (&y, None, None, Reduction::Mean); opt.zero_grad(); loss.backward(); opt.step(); let loss = logits .binary_cross_entropy_with_logits:: (&y, None, None, Reduction::Mean); opt.zero_grad(); loss.backward(); opt.step(); These four lines are the heart of the training loop: measure error, compute how to fix the model, and apply the update. After each iteration, the network moves a little closer to the correct solution. heart of the training loop Hebbian Fast Memory Update (σ) The last part — and really the core BDH twist — is the Hebbian fast-memory update. It runs outside autograd and keeps values stable: Hebbian fast-memory update outside autograd tch::no_grad(|| { let bsz = x.size()[0] as f64; // 1) Build co-activation map: outer = y2ᵀ @ x_neu let outer = y2 .detach() // detach from autograd .transpose(-1, -2) // [B,n]ᵀ → [n,B] .matmul(&x_neu.detach()) // [n,B] @ [B,n] → [n,n] .to_kind(Kind::Float) * (hebb_lr / bsz); // scale by batch size and Hebb LR // 2) Work on a shallow copy to avoid move/borrow issues let zeros = Tensor::zeros_like(&sigma); let mut s = sigma.shallow_clone(); // 3) Exponential forgetting + add fresh co-activations s *= 1.0 - u; // older σ fades out s += &outer; // Hebbian boost for co-firing neurons // 4) Safety rails: clamp to prevent blow-ups // (I originally skipped this and hit runtime errors during training) s = s.clamp(-smax, smax); // 5) Sparsify: zero-out tiny values (efficiency + stability) let keep = s.abs().ge(sparsity_thresh); s = s.where_self(&keep, &zeros); // 6) Row-wise normalization: stabilize the energy of σ @ x let row_norm = s.square().sum_dim_intlist([1].as_ref(), true, Kind::Float).sqrt(); s = &s / &row_norm.clamp_min(1.0); // 7) Write back into σ without changing ownership sigma.copy_(&s); }); tch::no_grad(|| { let bsz = x.size()[0] as f64; // 1) Build co-activation map: outer = y2ᵀ @ x_neu let outer = y2 .detach() // detach from autograd .transpose(-1, -2) // [B,n]ᵀ → [n,B] .matmul(&x_neu.detach()) // [n,B] @ [B,n] → [n,n] .to_kind(Kind::Float) * (hebb_lr / bsz); // scale by batch size and Hebb LR // 2) Work on a shallow copy to avoid move/borrow issues let zeros = Tensor::zeros_like(&sigma); let mut s = sigma.shallow_clone(); // 3) Exponential forgetting + add fresh co-activations s *= 1.0 - u; // older σ fades out s += &outer; // Hebbian boost for co-firing neurons // 4) Safety rails: clamp to prevent blow-ups // (I originally skipped this and hit runtime errors during training) s = s.clamp(-smax, smax); // 5) Sparsify: zero-out tiny values (efficiency + stability) let keep = s.abs().ge(sparsity_thresh); s = s.where_self(&keep, &zeros); // 6) Row-wise normalization: stabilize the energy of σ @ x let row_norm = s.square().sum_dim_intlist([1].as_ref(), true, Kind::Float).sqrt(); s = &s / &row_norm.clamp_min(1.0); // 7) Write back into σ without changing ownership sigma.copy_(&s); }); Think of this as BDH’s working memory: it quickly adapts to the current context (Hebbian), gradually forgets old patterns (u), stays compact (sparsity), and remains numerically stable (clamp + normalization). working memory forgets u compact stable What We’ve Built We’ve implemented a network with the two learning modes described in the paper: two Slow learning — classic backprop that shapes the permanent weights (E, D, R_in, W_read). Fast learning — Hebbian updates of the σ matrix during inference/training. Slow learning — classic backprop that shapes the permanent weights (E, D, R_in, W_read). Slow learning E D R_in W_read Fast learning — Hebbian updates of the σ matrix during inference/training. Fast learning σ We intentionally leave out the third piece — transferring fast memory into long-term weights — because, as the authors note, it’s not fully specified yet. Designing that mechanism is nontrivial and beyond the scope of this overview; even the research paper only sketches this direction at a high level. leave out transferring fast memory into long-term weights beyond the scope How to Run It # 1) Create the project and add the files cargo new tiny_bdh_xor && cd tiny_bdh_xor # (replace Cargo.toml and src/main.rs with the code above) # 2) Build & run cargo run --release # 1) Create the project and add the files cargo new tiny_bdh_xor && cd tiny_bdh_xor # (replace Cargo.toml and src/main.rs with the code above) # 2) Build & run cargo run --release As expected, after a couple thousand steps the network converges (loss ↓, acc → 1.0) and predicts XOR correctly. loss ↓ acc → 1.0 Logging to the Console To make the training dynamics and results easy to inspect, let’s add some lightweight logging. 1) Progress every 300 steps Print loss and accuracy during training: if step % 300 == 0 { let y_hat = logits.sigmoid(); let acc = y_hat.gt(0.5) .eq_tensor(&y.gt(0.5)) .to_kind(Kind::Float) .mean(Kind::Float) .double_value(&[]); println!("step {:4} loss {:.4} acc {:.2}", step, loss.double_value(&[]), acc); } if step % 300 == 0 { let y_hat = logits.sigmoid(); let acc = y_hat.gt(0.5) .eq_tensor(&y.gt(0.5)) .to_kind(Kind::Float) .mean(Kind::Float) .double_value(&[]); println!("step {:4} loss {:.4} acc {:.2}", step, loss.double_value(&[]), acc); } 2) Final predictions After training, dump the model’s predictions: let x_neu = x.matmul(&r_in); let y1 = relu_lowrank_forward(&x_neu, &e, &dx); let a = x_neu.matmul(&sigma.transpose(-1, -2)); let y2 = y1 + a; let z = relu_lowrank_forward(&y2, &e, &dy); let preds = z.matmul(&w_read).sigmoid().gt(0.5).to_kind(Kind::Int64); println!("\nPred:\n{:?}", preds); let x_neu = x.matmul(&r_in); let y1 = relu_lowrank_forward(&x_neu, &e, &dx); let a = x_neu.matmul(&sigma.transpose(-1, -2)); let y2 = y1 + a; let z = relu_lowrank_forward(&y2, &e, &dy); let preds = z.matmul(&w_read).sigmoid().gt(0.5).to_kind(Kind::Int64); println!("\nPred:\n{:?}", preds); 3) With vs. without fast memory (σ) Compare predictions when the Hebbian memory is on vs off: on off // σ = on let probs = z.matmul(&w_read).sigmoid(); println!("\nProbs (σ=on):"); probs.print(); println!("Preds (σ=on):"); preds.print(); // σ = off let y1_nos = relu_lowrank_forward(&x_neu, &e, &dx); let y2_nos = y1_nos; // no 'a' term from σ let z_nos = relu_lowrank_forward(&y2_nos, &e, &dy); let preds_nos = z_nos.matmul(&w_read).sigmoid().gt(0.5).to_kind(Kind::Int64); println!("\nPreds (σ=off):"); preds_nos.print(); // σ = on let probs = z.matmul(&w_read).sigmoid(); println!("\nProbs (σ=on):"); probs.print(); println!("Preds (σ=on):"); preds.print(); // σ = off let y1_nos = relu_lowrank_forward(&x_neu, &e, &dx); let y2_nos = y1_nos; // no 'a' term from σ let z_nos = relu_lowrank_forward(&y2_nos, &e, &dy); let preds_nos = z_nos.matmul(&w_read).sigmoid().gt(0.5).to_kind(Kind::Int64); println!("\nPreds (σ=off):"); preds_nos.print(); For a full working code, see the repository: https://github.com/ZhukMax/tiny_bdh_xor For a full working code, see the repository: https://github.com/ZhukMax/tiny_bdh_xor https://github.com/ZhukMax/tiny_bdh_xor https://github.com/ZhukMax/tiny_bdh_xor Build, Training, and Prediction Results The model converges quickly, and you can see that: Probs (σ = on) are almost perfect: [~0, 1, 1, ~0]. Preds (σ = off) match — which is expected for XOR: it’s a static task solvable by the “slow” weights without fast memory. Probs (σ = on) are almost perfect: [~0, 1, 1, ~0]. Probs (σ = on) [~0, 1, 1, ~0] Preds (σ = off) match — which is expected for XOR: it’s a static task solvable by the “slow” weights without fast memory. Preds (σ = off) Running `target/debug/tiny_bdh_xor` step 0 loss 0.6931 acc 0.50 step 300 loss 0.0000 acc 1.00 step 600 loss 0.0000 acc 1.00 step 900 loss 0.0000 acc 1.00 step 1200 loss 0.0000 acc 1.00 step 1500 loss 0.0000 acc 1.00 step 1800 loss 0.0000 acc 1.00 step 2100 loss 0.0000 acc 1.00 step 2400 loss 0.0000 acc 1.00 step 2700 loss 0.0000 acc 1.00 Pred: Tensor[[4, 1], Int64] Probs (σ=on): 7.4008e-09 1.0000e+00 1.0000e+00 6.6654e-17 [ CPUFloatType{4,1} ] Preds (σ=on): 0 1 1 0 [ CPULongType{4,1} ] Preds (σ=off): 0 1 1 0 [ CPULongType{4,1} ] Running `target/debug/tiny_bdh_xor` step 0 loss 0.6931 acc 0.50 step 300 loss 0.0000 acc 1.00 step 600 loss 0.0000 acc 1.00 step 900 loss 0.0000 acc 1.00 step 1200 loss 0.0000 acc 1.00 step 1500 loss 0.0000 acc 1.00 step 1800 loss 0.0000 acc 1.00 step 2100 loss 0.0000 acc 1.00 step 2400 loss 0.0000 acc 1.00 step 2700 loss 0.0000 acc 1.00 Pred: Tensor[[4, 1], Int64] Probs (σ=on): 7.4008e-09 1.0000e+00 1.0000e+00 6.6654e-17 [ CPUFloatType{4,1} ] Preds (σ=on): 0 1 1 0 [ CPULongType{4,1} ] Preds (σ=off): 0 1 1 0 [ CPULongType{4,1} ] Why σ Isn’t “Needed” for XOR XOR is a simple Boolean function that the network can learn with its slow parameters (E/Dx/Dy/R_in/W_read). The Hebbian layer σ shines when there’s context over time — sequences, associations, “what happened earlier” — not when each sample is independent. slow E/Dx/Dy/R_in/W_read σ context over time What to Try Next to See σ Pay Off Sequences (context memory): Predict the final symbol of a pair that appeared earlier in the same sequence (copy / associative recall). Long-range dependencies: Balanced-parentheses tasks — check pairing correctness across 20–100 steps. On-the-fly adaptation: During inference, “inject a new rule” (a token pair) and verify the model uses it without gradient updates. σ ablations: Compare convergence speed/quality with σ on/off on harder prediction tasks. Log nnz(σ) and watch how connections strengthen/decay over time. Sequences (context memory): Predict the final symbol of a pair that appeared earlier in the same sequence (copy / associative recall). Sequences (context memory): Long-range dependencies: Balanced-parentheses tasks — check pairing correctness across 20–100 steps. Long-range dependencies: On-the-fly adaptation: During inference, “inject a new rule” (a token pair) and verify the model uses it without gradient updates. On-the-fly adaptation: without gradient updates σ ablations: Compare convergence speed/quality with σ on/off on harder prediction tasks. Log nnz(σ) and watch how connections strengthen/decay over time. σ ablations: σ on/off nnz(σ) The AI Incubator Is Near (Conclusions) BDH isn’t just “another alternative to transformers.” It’s a glimpse into the next era of neural architectures — ones that learn not on schedule, but in the moment of action. Instead of waiting for retraining or requiring terabytes of data, BDH adjusts itself during reasoning, in real time. learn not on schedule, but in the moment of action during reasoning If transformers are like “students” who completed a course and earned their diploma, then BDH is a dragon hatchling — freshly born, exploring the world, making mistakes, adapting, and remembering everything new it encounters. dragon hatchling This direction brings AI back to its original spirit: not just to compute probabilities, but to think within context and experience. think within context and experience