Uma ungakuthanda ama-self-values, ungakuthanda i-AI Yini ukuba ngithi ukuthi izibalo ezintathu - kuphela ezintathu - angakwazi ukuhlaziya ukuthi i-AI yakho izidakamizwa noma ukuhlangabezana? Ngaphandle kwe-architecture entsha. Ngaphandle kwe-re-training. Ngaphandle kwe-infrastructure update. I-mathematics eyenziwe embhokisini kusukela ku-1950, ebhokisi, ezihlangene nabanye abalandeli ukubuyekeza. Lezi zinyathelo ezintathu ziye zithunyelwe: I-Tesla ye-phantom yakho ifakwe ngaphandle kwezizathu. I-Tesla ye-phantom yakho ifakwe ngaphandle kwezizathu. I-GPT-4 enikezela isicelo se-tribunal eyenziwe: esekelwe ngu-advocate eyenziwe. I-GPT-4 enikezela isicelo se-tribunal eyenziwe: esekelwe ngu-advocate eyenziwe. Model yakho, eqeqeshiwe iminyaka eminyakeni ezintathu, enhle ukuchithwa, ezokuthunyelwe prod, ngokushesha kwangena emzimbeni. Model yakho, eqeqeshiwe iminyaka eminyakeni ezintathu, enhle ukuchithwa, ezokuthunyelwe prod, ngokushesha kwangena emzimbeni. Isilinganiso esilinganisweni esilinganisweni esilinganisweni esilinganisweni esilinganisweni esilinganisweni. Three Izinombolo Wonke Abalandeli Ngingu-ke akwazi ukufundisa mayelana kwabo. Hhayi isakhiwo yakho. Hhayi izifundo zakho. Hhayi izifundo, Qiniseka ukuthi ukuhlaziywa kuyinto enhle noma i-hallucinatory. Qiniseka ukuthi ubuningi wakho kuyinto enhle noma isikhwama. Ngiyazi ukuthi ukuqeqeshwa kwakho uzothola imiphumela noma uzama emzimbeni yakho Ngiyazi ukuthi ukuqeqeshwa kwakho uzothola imiphumela noma uzama emzimbeni yakho : Is your optimizer wasting compute zig-zagging? This number measures how “stretched” your loss landscape is. Ukulandelela i-90% ye-computing yakho ku-sideways kunokuba ku-down (ukudluliselwa ku-error minimum) 1. Inani lwezimo κ (λ_max / λ_min) Ukuphakama kwe-k kuhlanganisa i-canyon. Inani lwezimo κ (λ_max / λ_min) Condition number κ (λ_max / λ_min) Ukuphakama kwe-k kuhlanganisa i-canyon. : Ngaba imodeli yakho ngokwenene ukufunda - noma kuphela ukucinga? Funda, futhi uzothola GIF below that explains it better than a whole series of equations. 2. Umbala we-Eigenvalue ε (λ_max) ikhaya Umbala we-Eigenvalue ε (λ_max) Umbala we-Eigenvalue ε (λ_max) 3. You might be convinced that the gradient from your AI framework is enough to finish training your shiny new network. That’s what you were told: : Ukubalwa kwe-equivalent ye-negative δ (#λ < 0) I-gradient inikeza "ukuguqulwa, akukho indawo yokufinyelela" ngenxa yokufinyelela okuphakeme kuzo zonke izindlela ezidingekayo. Ukubalwa kwe-equivalent ye-negative δ (#λ < 0) Ukubalwa kwe-equivalent ye-negative δ (#λ < 0) Kodwa yini uma ukuqeqeshwa kwakho kubhalwe ku-minimum yokufakelwa? That’s what this killer δ number is for. As a preview: Uma δ = 0, uzokufika ku-minimum efanelekayo — isikhunta se-daley. Uma δ > 0, uya khona emaphandleni. Ngezansi ngezansi, uzothola i-graphic animated enikezela imibuzo okuyinto ungacacindezela ukuthi ungacindezela ngokushesha. Futhi Nokho - Believe It or Not - Lokhu Ukubuyekezwa Yini Yini Lezi zihlanganisa zihlanganisa kusukela ku-1950, ngaphambi kokufunda inethiwekhi ye-neural. Seventy-five years of mathematicians screaming into the void. Futhi futhi - manje, namhlanje, njengoba ushiye lokhu - ayikho ku-PyTorch. Ayikho ku-TensorFlow. Ayikho ku-JAX. Ayikho ku-Keras. Ayikho ku-standard training pipeline, okushisa ama-million dollar. Three numbers that would tell you everything. Hidden in plain sight. I-AI labs eningi ehlabathini? Zibonisa izimpendulo ze-Loss Curves. I-Finger Crossing. Futhi ibizwa ngokuthi i-Engineering. OpenAI ignores them. Google skips them. Anthropic doesn’t even know they exist. Nobody checks. I-OpenAI ihamba kwabo. I-Google ihamba kwabo. I-Anthropic asazi ukuthi kukhona. Ayikwazi ukuyifaka. Ngibuyekele. Kodwa sicela ukufunda, akuyona ingxenye engcono. I-Curious about the hidden side of AI? Find out more on my page lapha Curious about the hidden side of AI? Discover more on my page here Three Numbers Wonke Abalandeli So you’ve met κ, ε, and δ. Three numbers that predict everything. Three numbers that have existed since your grandparents were young. Ngiyaxolisa: Ngiyaxolisa? Not on your screen. Not in your logs. Not anywhere in the trillion-dollar AI industry. Izinombolo ezine ezingu-3 akuyona isayensi. Zine iphaneli yokubonakalisa ukuthi kufanele kube ... kodwa akuyona. Instead, here’s what the most popular AI frameworks — PyTorch, TensorFlow, JAX, Keras — actually give you: Loss value ✓ Gradient direction ✓ Ukulungiswa izinga ✓ Kuyinto. Kuyinto yonke umbhalo. Three numbers — but not the right three. Not (condition number). Not (eigenvalue magnitude). Not (negative eigenvalue count). κ ε δ κ κ ε ε δ δ Futhi akuyona kuphela ibhizinisi lakho. Imishini ye-OpenAI? I-blind spot efanayo. Isakhiwo se-Google? Isakhiwo se-Google efanayo. I-Anthropic's training pipeline? I-panel efanayo esithile. I-industry ephelele ibhizinisi elifanayo. Ukuqeqeshwa kwe-AI namhlanje kuyinto efana ne-747 nge-speedmeter, i-compass, ne-vibes. Ukuqeqeshwa kwe-AI namhlanje kuyinto efana ne-747 nge-speedmeter, i-compass, ne-vibes. Ukuqeqeshwa kwe-AI namhlanje kuyinto efana ne-747 nge-speedmeter, i-compass, ne-vibes. Imininingwane iyahambisa. Great - lokhu inikeza ukuthi ingxaki iyahambisa, ukuqeqeshwa kubonakala kahle. But Ingabe imodeli ukufumana isixazululo efanelekayo, noma nje ukujabulela imibuzo futhi ukujabulela kwelanga ukuthi akuyona akuyona? Yini I-eigenvalues uyazi. I-dashboard yakho ye-AI framework ayikho. Figure 2b makes this obvious: Kusukela lapho lezi ezintathu zihlanganisa Waze: . It ikhodi ukunambitheka kokugcwele kwelanga yakho yokulondoloza. Yonke canyon, zonke izindandatho, zonke izindandatho flat. Yini konke kukhona. Thola ngaphakathi lokho ebizwa ngokuthi i-Hessian the second derivative matrix of your loss function Futhi apha ingxenye enhle: I-Hessian iyatholakala ngokuphelele. I-Hessian itholakala khona, i-diagnostic data, futhi umuntu akuyona. Futhi apha ingxenye enhle: I-Hessian iyatholakala ngokuphelele. I-Hessian itholakala khona, i-diagnostic data, futhi umuntu akuyona. Ngiyazi kanjani kusebenza: Thola amabhokisi amabili emzimbeni. Emuva: Yini i-PyTorch ibonisa. I-Loss ibonakala kahle. I-Gradient e-zero. I-Learning Rate Set. I-Green Light. I-Ship. On the right: what Ngena ngemva , futhi Ukubonisa. Inani lwezimo ngokusebenzisa ibhokisi. Ukuhlobisa kwebhayisikili. Amadivaysi amabili asebenzayo. Uyazi okungenani: κ ε δ you’re stuck at a saddle point pretending to be done κ ε δ Same numbers. Same moment. Opposite conclusions. Imininingwane efanayo. Imininingwane efanayo. Imininingwane efanayo. Imininingwane efanayo. Imininingwane efanayo. Imininingwane efanayo. I-OpenAI ayikhompyutha lokhu. I-Google ayikhompyutha lokhu. Uneminyaka engaphezu kwama-100,000-bucks-a-month i-GPU cluster akuyona i-eigenvalues. Uneminyaka engaphezu kwama-100,000-bucks-a-month i-GPU cluster akuyona i-eigenvalues. Zibonisa izindandatho ze-loss and you guessed right... ukubhala! Kodwa i-Hessian iyatholakala kakhulu! Kuyinto lapho umuntu kusuka ku-Google ukwandisa isandla. I-Hessian kuyinto i-n×n. Ukuze imodeli ye-billion-parameter, lokhu kuyinto i-billion-x-billion. Uyakwazi ukucubungula lokhu. It is mathematically insane! I-Hessian kuyinto i-n×n. Ukuze imodeli ye-billion-parameter, lokhu kuyinto i-billion-x-billion. Uyakwazi ukucubungula lokhu. It is mathematically insane! Correct. And completely irrelevant. You don’t need the full matrix. You never did. Umbhali we-Hungarian Physicist ebizwa ngokuthi — back lapho "computer" kubalulekile indawo ephelele ama-tubes ye-vacuum kanye ne-obsessive cross-fingering. I-methode yayo ikhiqiza ama-eigenvalues eziphambili usebenzisa imikhiqizo ye-matrix-vector iterative. I-complexity: . Basically mahhala kuqhathaniswa ne-single forward pass. Cornelius Lanczos wabhala lokhu ngo-1950 O(n) per iteration Cornelius Lanczos wabhala lokhu ngo-1950 Seventy-five years of progress since then? We now have Hutchinson’s trace estimator, stochastic Lanczos quadrature, randomized SVD. You can get spectral density estimates, top-k eigenvalues, condition number bounds — all at negligible computational cost. Izinsiza like Ukubonisa ukuthi lokhu kusebenza ku-ImageNet. Today. Right now. PyHessian Ukubuyekezwa Ngiyazi lokhu ayikho ku-PyTorch? Ngiyazi ayikho ku-TensorFlow? Ngiyazi ayikho ku-JAX? Ngiyazi lokhu ayikho ku-PyTorch? Ngiyazi ayikho ku-TensorFlow? Ngiyazi ayikho ku-JAX? Ngiyazi lokhu ayikho ku-PyTorch? Ngiyazi ayikho ku-TensorFlow? Ngiyazi ayikho ku-JAX? Njengoba abantu abanolwazi i-spectral theory zihlanganisa izigaba ze-mathematics, ukulungiselela izitebhisi ayikwazi ukuyifaka. Iziqu ze-frameworks zihlanganisa izici ezinhle ze-demo. Futhi ama-models zezinto zihlanganisa zihlanganisa ama-baby-loss curves ukuze akhululele ukuthi kungcono ukuthi zihlanganisa ama-baby-loss curves. I-Mathematics ikhona. I-Engineering ikhona. I-Will to Connect Them? Ngokuqondene kungekho. Seventy-five years. Still waiting… Yintoni Izinombolo Zibonisa Nokho Yenza okuhle, okuhle, futhi - ikakhulukazi - visual, ukuze ungakwazi ukwakha izibonelelo zokusebenza okuhle. Yenza okuhle, okuhle, futhi - ikakhulukazi - visual, ukuze ungakwazi ukwakha izibonelelo zokusebenza okuhle. The Condition Number Disaster Inani lwezimo κ (kappa) = λ_max / λ_min ukulawula ukuthi inkinga lakho lokuphumula ngokufanele. Here’s what nobody tells you: gradient descent convergence scales as (κ-1)/(κ+1) per iteration. Ngiyazi ukuthi akuyona: i-gradient descent convergence scales njenge (k-1)/(k+1) (k-1)/(k+1) Ngokuba iteration. Translation: κ = 100? Cishe amahora amahora amahora amahora amahora. Roughly 10k iterations for the same progress. κ = 10000? And κ = 10000 is common in practice. Not pathological. Not rare. Tuesday. Ngiya ku-High K Ngaba? Thumela Imibuzo rolling a marble down a valley. Nice round bowl? Marble rolls straight to the bottom. Yenza. But high κ means your valley is a razor-thin slot canyon: walls a mile high, floor barely visible. Your marble doesn’t roll down. It pinballs off the walls. Left. Right. Left. Right. Burning energy going sideways instead of down. Kuyinto i-gradient yakho. Ukubonisa ngokufanele lapho i-calculus ivumela: ukujula okusheshayo. Umbuzo, "Ukujula okusheshayo" ivumela ku- , not down the floor toward the solution. Canyon izindwangu Loss stuck at 0.0847, then 0.0846, then 0.0847 again? You tweak the learning rate. Nothing. You sacrifice a rubber duck to the ML gods🚮 . Nothing. Those training runs that plateau for hours? Kuyinto ephakeme k. Your optimizer isn’t broken. It’s doing exactly what you asked. What you asked is just geometrically insane. Ukulandelela geometry yobuchwepheshe yobuchwepheshe, isakhiwo sakho isixazulule ukholelwa ukuthi kukhona. The fix? Preconditioning. Izindlela Second-order. Adaptive optimizers ukuthi ngempumelelo ukuguqulwa. Imathemikhali iyatholakala. It has been in textbooks since the 1960s. Kodwa okokuqala, kufanele ukwazi κ kuyinto ingxaki. Your dashboard won’t tell you. So you’ll never ask. And your compute bill keeps climbing. Iphuzu elandelayo ibonisa ngokucacileyo okuholela. Iphuzu elifanayo lokuqala. Iphuzu elifanayo. Isinyathelo esisodwa ihlanganisa ~100 iterations. Isinyathelo esisodwa ihlanganisa ~10,000 - zigzagging off izindandatho canyon while your GPU burns money. Sharp vs. Flat: I-Prophecy ye-Generalization Sishayele manje — isilinganiso se-equivalent ye-equivalent e-Equivalent_max. ε (epsilon) This number answers the question every ML engineer secretly dreads: Did your model actually learn anything, or did it just memorize the test? Inani le-number enikezela imibuzo wonke ingcindezi ye-ML ngokumangalisayo: Did your model actually learn anything, or did it just memorize the test? Ukubonisa isakhiwo se-Lost Landscape yakho njengezindawo. A is a broad valley — ungakwazi ukujabulela futhi ukujula ngempumelelo ukuguqulwa. is a knife-edge ridge - isinyathelo esisodwa esisodwa futhi uxhumane ku-abyss. flat minimum sharp minimum Small ε = flat minimum = good news. When ε is small, your model found that wide valley. Production data comes in slightly different from training data: users type weird things, lighting changes, accents vary, and your model shrugs. Okuningi kakhulu. Ngithole lokhu. Kuyinto generalization. Kuyinto ukuthi uthatha imali. I-GIF ezilandelayo ibonisa ukuthi lokhu kubona geometrically: Ngiyaxolisa i-script What happens when ε is huge? When your eigenvalues are screaming large numbers? . Training loss looks perfect. Validation is “best run ever.” Well, when ε is large , your model has squeezed itself into a tiny, sharp crevice Okwe, lapho ε Ukubuyekezwa Ukubuyekezwa , imodeli yakho iye sishintshwe ku-cracks encane, encane Then real data arrives. Slightly different wording. Slightly different images. Slightly different anything. Umhlahlandlela Ngathi perform a bit worse It blows up: Confidence scores go crazy Izinzuzo ezivela ngempumelelo Your alert-log system explodes at 2 AM. 💀 That’s a sharp minimum in action. The walls are so steep that the tiniest shift sends the loss rocketing. Your model wasn’t robust. It was . brittle and faking it The GIF below shows the difference. Same horizontal shift. Wildly different outcomes. Press enter or click to view image in full size As you see, this is not philosophy. This is geometry you can measure. I-model yakho enikezela i-benchmark futhi wahlala ekukhiqizeni? I-minimum ebonakalayo. Ama-eigenvalues angakutholele... Kodwa umuntu wahlola. Saddle Points: The Silent Killer Ngoku — Ukubala i-eigenvalue negative, #(λ < 0). Lezi zihlukile. Lezi zihlukile emzimbeni yakho. δ (delta) Your gradient hits zero. Your loss curve goes flat. Your framework prints: **“converged.” \ You relax. Kodwa akufanele ku-minimum ye-error. You’re at a Ukulungiselela Point ! 💀 You’re at a Ukulungiselela Point saddle point I-Mountain Pass lapho i-terrain ingxubevange eminyakeni eminye futhi eminyakeni eminye. But you might say: \ No, dear reader. No. It only means you’re balanced on a ridge, not that you’ve reached anything useful. I-PyTorch ibonisa i-gradient yam yi-zero - akukwazi ukuthi kungenzeka ukuthi ukulungiswa kwenziwa? Ukulungiselela kanjani i-Saddle Points? Thola ukuyifaka Math. At any critical point, you can think of each eigenvalue as having a coin-flip chance of being positive or negative. For a true minimum, you need of them positive. The rough probability? One-half raised to the power of your parameter dimension. Wonke Ukuze unemili ye-parameter, lokhu kuyinto 1/2^ (106). You have better odds of winning the lottery while being struck by lightning while a shark bites your leg. In high dimensions, almost every critical point is effectively a saddle. True minima are statistical miracles. The good news: most saddle points are unstable. SGD’s inherent noise usually kicks you off eventually. I-bad news: "ngemuva kwalokho" kungabangela izinsuku ezintathu ze-computing eyenziwe. I-saddles e-degenerate - lapho ama-eigenvalues zihlukile ekupheleni kwe-zero - i-plateas lapho i-gradient ibhoxisa kunokufundisa. I-Loss akuyona emzimbeni. Uyazi, kodwa ungenza ukuba ungenza noma nje ngempumelelo. δ > 0 uyazi ngokushesha. Inombolo eyodwa. Saddle noma akukho. Kodwa umkhakha wakho akuyona. would tell you instantly. One number. Saddle or not. But your framework doesn’t compute it. δ > 0 The GIF below shows what this trap looks like: I-Gradient Flow Time I-Bomb Oh, you thought we were done? It gets worse. Your neural network isn’t one function. It’s a chain of functions — layer after layer after layer. And gradients have to flow backward through every single one during training. Like a game of telephone, except each person might whisper too quietly or scream too loudly. Each layer has a Jacobian matrix — the matrix of partial derivatives that governs how signals propagate. . The singular values of these Jacobians determine whether your gradients survive the journey or die along the way I-values ye-Jacobians eyodwa ikakhulukazi ukuthi ama-gradients akho zihlangene isitimela noma zihlangene ngenkathi The gradient gets amplified at each layer. By the time it reaches the early layers, it’s not a gradient anymore — it’s a bomb. Exploding gradients. Your weights go to infinity. Training crashes. NaN city. Singular values > 1: The opposite disaster. The gradient gets squashed at each layer. By the time it reaches the early layers, it’s a rounding error. Vanishing gradients. Your early layers stop learning. They’re frozen while the rest of the network pretends to train. Singular values < 1: I-Goldilocks zone. I-gradients itholakala ngokucacileyo ukusuka ekupheleni kuya ekupheleni. Yonke ingxubevange ithi. Ngakho-ke i-orthogonal initialization isebenza. Ngakho-ke i-spectral normalization ikhona. Singular values ≈ 1: Kodwa lokhu: Lezi zindlela ziye zithole ngempumelelo futhi zisetshenziswe njengama-bands-aids. Akukho abantu abalandeli i-Jacobian spectra ngesikhathi sokufundisa. Akukho abantu abalandeli izinga ezingu-singular. I-diagnosis enikezela: , simply doesn’t exist in any commercial framework. I-Layer 47 iyahambisa ukufa kwelanga lakho le-gradient I-Layer 47 iyahambisa ukufa kwelanga lakho le-gradient I-network yakho ingangena ngaphakathi, futhi i-dashboard ayibonakali. See it for yourself: Futhi Ngiyazi: I-Dashboard Ehlukile I-AI Framework Yayo Imininingana Kuyinto umbhalo umbhalo umkhakha umkhakha umkhakha umkhakha umkhakha umkhakha. I-matematically literate training loop uyaziqhathanisa ukucubungula kwe-spectral we-lightweight at checkpoints futhi isebenza kulezi: Inani lwezimo se-Hessian ibekwe isivinini → ukuguqulwa ku-preconditioned method. I-Jacobian singular values ivuka kusukela ku-1 → isicelo i-spectral normalization. → you’re at a saddle, perturb to escape. Imibuzo ye-self-values ye-negative Hessian condition number exceeds a threshold I-Jacobian singular values ihamba kusuka ku-1 Imibuzo ye-self-values ye-negative Right now, no commercially available AI framework gives you these basic eigenvalue predictors. Instead, you’re hand-tuning learning rates and hoping… Ngiyaxolisa lokhu Three reasons: When throwing more compute at the problem works, nobody questions the foundations. This is temporary. Scaling laws plateau. When they do, the industry will suddenly need mathematics it never bothered to learn. Scaling obscures mathematical sins. Iziqu ze-spectra theory abasebenza kumasipala we-mathematics esebenzayo ekuphenduleni imiphumela embalwa. Iziqu ze-AI frameworks zihlanganisa nokusebenza kanye ne-statistics. I-Venn diagram overlapping kuyinto emangalisayo. Disciplinary silos. Ukukhiqiza ukucubungula okuhle we-spectra requires infrastructure no-one wants to build. Everyone would benefit. No-one wants to pay. Abstraction debt. I-Bill Comes Two Here’s what’s missing in every production framework: The industry has invested $100 billion in scaling AI. Izakhiwo zemathemathemikhali ziye zihlukile. Konke ukucubungula okufakiwe, zonke ama-anomaly ye-generalization, zonke ama-models abenziwe e-laboratory futhi abalandeli ekukhiqizeni-ezinto zihlanganisa ama-patologies zamathemathikhi ezingenalutho. Yini ungayenza manje ukuze abe ngaphezu kwezinkampani ezinkulu ze-AI? Ukuqala lapha: for your specific architecture Learn what Hessian eigenvalues mean Ukuhlola izibalo zokusebenza ngesikhathi sokucwaninga - ngisho izibalo zokusebenza Thola ukucindezeleka ngaphambi kokuthumela — PyHessian kukhona, usebenzisa when loss plateaus — you might be at a saddle, not a minimum Question everything The math exists. It’s been waiting since 1950. The only question is whether you’ll learn it before your next production failure teaches you the hard way. I-Curious about the hidden side of AI? Find out more on my page lapha I-Curious about the hidden side of AI? Find out more on my page lapha