GPT sa 200 mga linya: ang magandang simplicity sa ilalim ng modernong AI

Si Andrej Karpathy ay isang pangalan na nakilala ng maraming mga tao sa mundo ng AI. Siya ay isang propesor sa Stanford para sa isang oras, pagkatapos ay naging pangulo ng Tesla's AI division, nagtrabaho sa OpenAI, at, sa aking pananampalataya, siya ay gumagawa ng ilang ng mga pinakamahusay na AI educational videos na magagamit ngayon. Ang kanyang pinakabagong proyekto, Ngunit ito ay isang bagay na maaari ko lamang i-describe bilang isang trabaho ng art. microGPT microGPT ay isang kumpletong implementation ng GPT na itinatag sa At ito ay hindi isang malakas na compressed o obscured code. Sa katunayan, ang code ay clean, napaka-struktured, at malakas na komento. Ito ay kahit na pinili ng paggamit ng external deep learning libraries. Sa iba pang mga salita, ito ay hindi lamang naglalaman ng neural network na ito, ngunit din ang minimum na kadahilanan na kinakailangan para sa pag-training at pag-execute ito. only about 200 lines of Python code Sa katunayan, ang buong proyekto ay hindi kahit na may isang buong repository - ito ay simpleng inilathala bilang isang . single GitHub Gist https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95?embedable=true Sa artikulong ito, makipag-ugnayan ko kung paano gumagana ang magandang ito ng code sa isang paraan na . anyone with basic Python knowledge can understand Magsisimula tayo sa pangalan . GPT Ang GPT ay para sa Ang ChatGPT mismo ay naglalaman ng akronimo na ito sa kanyang pangalan, at ang karamihan ng mga pangunahing alternatibo ay kasalukuyang binuo sa parehong pangunahing ideya. Generative Pretrained Transformer Ang isang modelo ng GPT predicts Basahin ang mga salitang ito sa isang text. the next word (or token) Ito ay karaniwang kung ano ang lahat ng mga modelo ng wika upang gawin. Ang mga ito ay generate responses sa pamamagitan ng predicting ang isang salita sa isang oras. unang, ang mga ito ay predict ang susunod na salita, pagkatapos ay ang mga ito ay idinagdag sa teksto at predict ang susunod na lagi, atbp. Pero kapag nalaman niyang nakilala mo siya dahil binasa mo ang diary... get ready for the consequences. . understanding of the text and its meaning Sa mga networks, ang "thinking" na proseso ay inihahanda bilang isang . very large mathematical function Ang mga sumusunod na mga salita ay i-convert sa mga numero at i-feed sa function na ito, na pagkatapos ay i-calculate ang next word. Sa prinsipyo, ang anumang sistema ng mga kurso ay maaaring i-represent matematically gamit ang mga function na ito. Ang function na ito ay kung ano ang tinatawag na . model Ang problema ay ang function na ito ay . Modern language models contain billions of parameters, meaning the mathematical formula describing them can have billions of components. Such enormous formulas are impossible for humans to design or manipulate manually. extremely complex Luckily, may isang smart workaround. Sa halip ng pagbabago ng formulation, nag-defines natin ang isang Para sa formula, isang strukturong may maraming mga parameter na i-regulate. Ang mga parameter na ito ay maaaring mag-tune automatically hanggang sa ang modelo ay gumagana na. “template” Mayroong ilang mga template na ito, depende sa uri ng problema na nais namin solve: Ang mga konvolutionary network ay karaniwang ginagamit para sa pagproseso ng imahe MLPs (multi-layer perceptrons) ay ginagamit para sa pangkalahatang function approximation Ang mga transformers ay ginagamit para sa mga modelo ng wika tulad ng ChatGPT etc ang Ang mga template na ito ay nagpapakita ng mga malaking matematikal na mga strukturong na maaaring magkakaiba ng mga kompleksong mga patakaran kapag ang kanilang mga parameter ay napaka-adjust. Ang tanong na rin ay: O, sa ibang salita: How do we adjust these parameters? How do we “program” a neural network? Ang manual na pagbabago ng mga parameter ng isang neural network ay malinaw na hindi posible, lalo na kapag nagtatrabaho sa mga modelo na naglalaman ng milyong o kahit bilyong mga parameter. Luckily, may isang paraan upang gawin ito automatically, gamit ang data. Ang proseso na ito ay kung ano ang tinatawag na at ang paraan na nagpapahiwatig ito ay tinatawag na . training gradient descent Kung mayroon kaming isang malaking dataset at isang malaking matematikal na formula (ang aming modelo), maaari naming i-calculate ang error ng modelo para sa bawat halimbawa sa dataset. Sa kaso ng mga transformers, ang proseso ay gumagana halos tulad nito. Ang mga salita ay ipinapakita bilang mga punto sa isang mataas-dimensyonal na vector space, kung saan mga salita na may katulad na katangian ay matatagpuan sa iba. The model receives a sequence of words as input and tries to predict the next word. Since words are represented as vectors, we can calculate how far the predicted word is from the correct one. This distance gives us a numerical measure of the error. And now comes the magic. Upang gamitin ang gradient descent, maaari naming i-calculate kung paano ang mga parameter ng formula ay dapat mababago upang mabawasan ang error na ito. Kung mayroon kaming sapat na data, isang sapat na malaking network, at i-train ito para sa sapat na oras, ang formula ay gradually na-adapt sa mga pattern sa dataset at lumikha ng higit pa sa mga presyon. Ngunit kung ano ang eksaktong ginagawa ng "magic" na ito? Paano ang gradient descent ay talagang makikita ng mas mahusay na mga parameter? Maaari naming visualize ang error bilang isang function sa isang high-dimensional na lugar, kung saan ang bawat dimensyon ay katumbas ng isang parameter ng modelo. Ang pag-iisip sa maraming dimensyon ay halos imposible para sa mga tao, kaya sa halip, maaari naming i-imaginate ang isang mas simpleng analogy: isang landscape ng mga bundok at mga valley. In this analogy: Ang bawat punto sa landscape ay nagpapakita ng isang espesyal na kombinasyon ng mga parameter ng modelo. Ang mataas ng landscape sa punto na ito ay nagpapakita ng error ng modelo. Sa simula ng pagsasanay, ang mga parameter ay initialized randomly. Ito ay tulad ng pag-set sa isang lugar randomly sa hillside. Ang aming target ay upang i-minimize ang error, na kung saan ang ibig sabihin ng paghahanap ng pinakamataas na punto sa landscape. Ang problema ay na hindi natin alam kung ano ang landscape ay nakikita. Ito ay tulad ng kami ay nagtatrabaho sa loob ng isang bundok na nakalipas o sa mataas na kabila. So ano ang maaari natin gawin? Ito ay kung saan ang gradient descent ay dumating. Sa matematika, may isang konsepto na tinatawag na , na naglalarawan ng ang slope ng isang function sa isang tiyak na punto. derivative Ito ay hindi masyadong useful dito. Ang derivative ay nagsasabi sa amin . which direction the landscape slopes downward Kaya ang algorithm ay simpleng gawin ang mga sumusunod: Pumunta ang pagitan ng landscape. Take a small step in the direction where the error decreases. Ipinapakita ang Step by step, the model gradually walks downhill until it reaches a low point in the error landscape. Ito ay karaniwang kung ano ang gradient descent ay gawin. Ang pangarap na tanong ay: Paano natin i-calculate ang slope ng isang parehong kompleksong function? Hindi ako mag-iisip sa buong detalye ng matematika dito, dahil ang Karpathy ay may isang Ipasok sa kanya. Maganda ang video Para sa aming mga propesyonal, ito ay basta upang malaman ang mga sumusunod. Lahat ng mga operasyon ng matematika ay may isang katumbas na regla na nagbibigay-daan sa amin upang i-calculate ang kanyang lokal na gradient. Sa karagdagang, mayroong isang regla na tinatawag na regla ng chain, na nagbibigay-daan sa amin upang i-combine ang mga lokal na gradients upang i-calculate ang gradient ng isang buong composite function. The key idea is that we move backwards through the chain of operations, collecting gradients along the way. This process is called backpropagation. Lahat ng mga pangunahing deep learning frameworks ay gumagamit ng mekanismo na ito. For example: Ang TensorFlow ay gumagamit ng isang sistema na tinatawag na GradientTape, na nag-record operations tulad ng isang tape recorder upang ang gradients ay maaaring i-calculate pagkatapos. Ang PyTorch ay sumali ng gradient na impormasyon direkta sa tensors. Ang bawat tensor ay naninirahan kung paano ito ay nilikha, na nagbibigay-daan para sa gradients upang i-calculate automatically. Karpathy’s implementation follows the same basic idea. Let’s see how. # Let there be Autograd to recursively apply the chain rule through a computation graph class Value: __slots__ = ('data', 'grad', '_children', '_local_grads') # Python optimization for memory usage def __init__(self, data, children=(), local_grads=()): self.data = data # scalar value of this node calculated during forward pass self.grad = 0 # derivative of the loss w.r.t. this node, calculated in backward pass self._children = children # children of this node in the computation graph self._local_grads = local_grads # local derivative of this node w.r.t. its children def __add__(self, other): other = other if isinstance(other, Value) else Value(other) return Value(self.data + other.data, (self, other), (1, 1)) def __mul__(self, other): other = other if isinstance(other, Value) else Value(other) return Value(self.data * other.data, (self, other), (other.data, self.data)) def __pow__(self, other): return Value(self.data**other, (self,), (other * self.data**(other-1),)) def log(self): return Value(math.log(self.data), (self,), (1/self.data,)) def exp(self): return Value(math.exp(self.data), (self,), (math.exp(self.data),)) def relu(self): return Value(max(0, self.data), (self,), (float(self.data > 0),)) def __neg__(self): return self * -1 def __radd__(self, other): return self + other def __sub__(self, other): return self + (-other) def __rsub__(self, other): return other + (-self) def __rmul__(self, other): return self * other def __truediv__(self, other): return self * other**-1 def __rtruediv__(self, other): return other * self**-1 def backward(self): topo = [] visited = set() def build_topo(v): if v not in visited: visited.add(v) for child in v._children: build_topo(child) topo.append(v) build_topo(self) self.grad = 1 for v in reversed(topo): for child, local_grad in zip(v._children, v._local_grads): child.grad += local_grad * v.grad Sa code ng Karpathy, ang buong gradient computing logic ay inilapat sa ang klase, na kung saan ay . Value 40 lines long Ang klase na ito ay karaniwang isang wrapper sa paligid ng mga numero na mga halaga. Sa pamamagitan ng paglagay ng data na ito, ito ay din naglagay: which values it was computed from (its ), children at ang mga lokal na gradients na kailangan para sa backpropagation. Kung makikita natin ang code, makikita natin na ang mga standard na operator ay re-defined: __add__ ang muli at ang mga Kapag bumalik na ako sa UP, i'll start building a new me. objects, the program not only computes the result, but also records which values produced it and how the gradient should be propagated. Value Finally, the Ang metriko para sa espasyo-panahong Schwarzschild na may sistemang koordinatong ( backward() At ito ay karaniwang ang buong ideya. Ang isa pang mahalagang bahagi ng pagsasanay ay ang gradient descent itself, na ginagamit ng mga calculated gradients upang i-update ang mga parameter ng modelo. Sa iba pang mga salita, ito ay ang bahagi ng code na naglalakbay sa ilalim ng bukid gamit ang info ng slope. # Adam optimizer update: update the model parameters based on the corresponding gradients lr_t = learning_rate * (1 - step / num_steps) # linear learning rate decay for i, p in enumerate(params): m[i] = beta1 * m[i] + (1 - beta1) * p.grad v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2 m_hat = m[i] / (1 - beta1 ** (step + 1)) v_hat = v[i] / (1 - beta2 ** (step + 1)) p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam) p.grad = 0 Sa implementasyon ng Karpathy, ito ay ginawa sa pamamagitan ng , na ito ay isa sa mga pinaka-gamit na ginagamit na optimization algorithm sa deep learning. Adam optimizer Adam ay isang maliit na mas sophisticated kaysa sa pangunahing gradient descent. Sa halip ng pagkuha ng fixed-size step, ito ay mag-adapt ang step size dinamisya batay sa kasaysayan ng previous gradients. Ito ay nagbibigay ng proseso ng pagsasanay ang parehong . faster and more stable The gradient computation described earlier and the optimization step together form what we could call the deep learning framework part of the code. Ang lahat ng neural network - kung ito ay gumagana ng isang model ng wika, isang generator ng imahe, isang robot controller, o isang self-driving car - ay na-trained gamit sa pangunahing parehong prinsipyo. Ang ilang linya ng code na ito ay nagpapakita ng pangunahing ideya sa ilalim ng modernong AI. Ang mga frameworks tulad ng: TensorFlow ang pitch JAX magbibigay ng mataas na optimized mga implementasyon na maaaring mag-execute sa GPUs, TPUs, at distributed clusters, at naglalaman ng maraming mga karagdagang tricks at optimizations. Ngunit kung lumipat ang lahat sa mga pangunahing bagay, ang pangunahing prinsipyo ay eksaktong pareho sa kung ano ang makikita natin sa ito tiny Python implementation. Now that we’ve seen the deep learning framework part of the code, maaari naming pumunta sa actual neural network, sa iba pang mga salita, ang GPT model itself. Tulad ng nakaraan, ang layunin ng GPT ay upang predict ang susunod na token batay sa mga token na dumating bago ito. Ang definisyon na ito ay mas simpleng, dahil ang input ay hindi sa katunayan ng mga salita, ngunit ng mga token. Ang mga token ay malapit sa mga salita, ngunit hindi ang parehong mga salita. Sa halip ng pag-iisip sa isang predefined dictionary, ang sistema ay mag-aaral ng isang vocabulary statistically mula sa mga data training. Kaya ang pagsasanay ay hindi nagsisimula sa isang fixed list ng mga salita kung saan ang bawat salita ay may isang numero na ibinigay sa kanya. Ito ay karamihan sa mga agglutinative na mga wika, tulad ng Hungarian, kung saan ang isang single word ay maaaring magkaroon ng maraming iba't ibang mga form dahil sa suffixes at grammatical endings. Kapag may sapat na mga data sa pagsasanay, ang isang modelo ng wika ay maaaring malaman ang anumang wika, na natural o artificial. Sa katunayan, ang tokens ay hindi kailangang ipinadala ng teksto. Maaari nila ipinadala: parts of images Mga fragments ng audio Mga Sensor sa Reading Halos lahat ng uri ng data Because of this, transformer models are not limited to language processing. They can also be used for image generation, speech processing, robotics, and many other tasks. Ang modelo ng Karpathy ay intentionally maliit, kaya sa kaso na ito, ang tokens ay hindi mga salita ngunit mga character. Ang maliliit na negosyo sa mga sumusunod na sektor ay kinakailangan: pagproseso ng produktong agrikultural ( Ang dataset ng pagsasanay ay binubuo ng isang malaking listahan ng mga pangalan, at pagkatapos ng pagsasanay, inaasahan namin na ang modelo ay lumikha ng mga bagong pangalan na statistically tumutulong sa mga halimbawa. This is obviously far from a full-scale language model like ChatGPT. But in reality, the difference is mostly one of scale. Kung gumamit namin ang modelo na ito milyon-milyong beses, ginagamit namin ang tokens sa halip ng mga character, at nagtrabaho ito sa mga malaking dataset na nakuha mula sa internet, makakakuha namin ng isang bagay na malapit sa isang modernong modelo ng malaking wika. Ang mga malalaking modelo na ito ay karaniwang inilunsad sa dalawang phase: Pre-training - pagsasanay sa massive internet datasets upang malaman ang mga pangkalahatang pattern ng wika. Fine-tuning – mag-training sa paggamit ng curated human-generated conversations upang mapabuti ang mga reaksyon. The process requires enormous computational resources and huge amounts of high-quality data, often costing millions of dollars in compute. Dahil ang karamihan sa amin ay hindi may access sa mga materyal na ito, kailangan nating mag-set up para sa generating mga pangalan. # Let there be a Dataset `docs`: list[str] of documents (e.g. a list of names) if not os.path.exists('input.txt'): import urllib.request names_url = 'https://raw.githubusercontent.com/karpathy/makemore/988aa59/names.txt' urllib.request.urlretrieve(names_url, 'input.txt') docs = [line.strip() for line in open('input.txt') if line.strip()] random.shuffle(docs) print(f"num docs: {len(docs)}") # Let there be a Tokenizer to translate strings to sequences of integers ("tokens") and back uchars = sorted(set(''.join(docs))) # unique characters in the dataset become token ids 0..n-1 BOS = len(uchars) # token id for a special Beginning of Sequence (BOS) token vocab_size = len(uchars) + 1 # total number of unique tokens, +1 is for BOS print(f"vocab size: {vocab_size}") Sa simula ng code, natagpuan namin ang seksyon na responsable para sa pag-load ng dataset ng mga pangalan at bumuo ng vocabulary. In this case, the vocabulary simply consists of the list of characters that appear in the dataset. Ang bawat character ay ibinigay ng isang numeral na identifier, na nagbibigay-daan ang teksto upang i-convert sa mga numero na maaaring i-process ang neural network. Narito ang isa sa mga pinaka-importante na mga konsepto sa deep learning: . embeddings Ang ideya ay simple ngunit napaka-powerful. Sa halip ng pagtrabaho sa raw token IDs, i-mapping ang bawat token sa isang Ang mga vectors na ito ay kung ano ang neural network na nagtatrabaho. point in a high-dimensional vector space Sa katunayan, ang anumang neural network ay maaaring makita bilang isang function na . maps vectors from one high-dimensional space into another Halimbawa ang: Kung i-train ang isang neural network upang i-classify ang mga imahe bilang mga dog o cats, ang network ay i-mapped ang imahe representation sa isang bidimensional na lugar, kung saan ang isa sa mga dimensiyon ay sumusunod sa "dogness" at ang isa sa "catness". Kung nag-imagine namin ang isang generator ng imahe tulad ng Midjourney, ito ay mapagkukunan ng random noise sa isang mataas-dimensional na lugar kung saan ang bawat punto ay nagpapakita ng isang imahe na conditioned sa prompt. Kahit na ang trabaho, ang network ay always performing a task. . vector-to-vector transformation using a large mathematical function Halimbawa, ito ay pareho para sa GPT. The dimensionality of the vector space is defined in the code by the constant Dahil sa implementasyon na ito ay . n_embd 16 This means that each token (in this case, each character) is represented as a . 16-dimensional vector Mathematically, this mapping is simply a . matrix multiplication Sa code, ang matrix na responsable para sa transformation na ito ay tinatawag na , which stands for . wte word/token embedding Hindi lang ng maraming, at hindi mabilang na walking patay, na maaaring gumawa ng anumang bilang ng pagbuo ng mga modernong armas sa pamamagitan ng numero. also matters. position Halimbawa, ang katotohanan ng isang sequence ay mababago kung i-rearrange ang mga character. Para sa pag-aralan ng posisyonal na impormasyon, ang modelo ay gumagamit ng Paggamit ng isang matrix na tinatawag na . positional embeddings wpe The code maps both the token and its position into 16-dimensional vectors, and then simply . adds the two vectors together Ang resulta ay isang single vector na inilagay ang parehong: the identity of the token its position within the sequence Earlier, we mentioned that these vector representations must be meaningful, because later the model will compute errors based on distances between vectors. Ang ideal ay: should be close to the correct vector Almost correct predictions Very wrong predictions should be far from it Ito ay nagpapakita ng isang interesado na tanong: How do we design a good embedding space? Ang solusyon ay surprisingly simple: We don’t. Instead, we initialize the embedding matrices ( at ang ) with random numbers and allow gradient descent to learn the correct representation during training. wte wpe If we have enough data, the optimization process will gradually adjust the matrices until they represent useful relationships. This can lead to surprisingly powerful emergent properties. Halimbawa, sa mga sikat na embedding model, vector arithmetic can capture semantic relationships. A classic example is: word2vec king − man + woman ≈ queen Here we can already see that the embedding space begins to represent a kind of , kung saan mga relasyon sa pagitan ng mga konsepto ay nagpapakita bilang mga geometric relasyon sa pagitan ng vectors. simplified model of the world Ngayon na nakita namin kung paano lumikha ng vectors, maaari naming makikita ang neural network mismo, ang bahagi na nag-transform ang mga vectors na ito sa mga bagong vectors na nagpapakita ng susunod na token. In other words, the network maps vectors from the token embedding space back into the same space, but shifted by one token. For each token, it predicts which token should follow next. By repeatedly applying this process, the model can generate an entire sequence of text — or in this case, a name. The architecture used for this is called the . Transformer The Transformer was introduced in 2017 by researchers at Google in the famous paper: “Attention Is All You Need.” Ang orihinal na arkitektura ay dinisenyo para sa . It consisted of two main parts: machine translation an encoder ang decoder The encoder processed the input sentence, while the decoder generated the translated output sentence. Ngunit para sa mga generative na modelo tulad ng GPT, kailangan natin lamang ng : ang . half of the original architecture decoder stack This is why GPT models are often described as . decoder-only transformers The decoder receives the input tokens and repeatedly processes them through a stack of identical layers. Each layer contains two main components: Self-attention A feed-forward neural network (MLP) These layers are repeated many times in large models. In diagrams, you often see this represented as , meaning the block is stacked multiple times. ×N One of the key innovations of the Transformer architecture is that it processes the . entire sequence at once Older language models, especially recurrent neural networks (RNNs), processed text one word at a time, sequentially passing information along the sequence. Ang mga transformers ay maaaring makita ang lahat ng tokens sa parehong oras, na nagbibigay-daan sa modelo upang malaman ang relasyon sa pagitan ng anumang bahagi ng teksto. This mechanism is called , which is why the original paper was titled . attention Attention Is All You Need The attention mechanism calculates how in the sequence. relevant each token is to every other token For each token vector, the model computes a set of weights describing how much attention it should pay to the other tokens. It then combines the information from those tokens accordingly. The resulting vector therefore represents not only the token itself, but also . its meaning in the context of the entire sequence This may sound complicated, but the intuition is straightforward. Tingnan natin ang isang language model: “What is the capital of France?” Kapag nakikita natin ang mga salita , we could not determine the answer. But attention allows the model to connect the word sa . “capital” “capital” “France” The resulting representation captures the meaning of the phrase , making it possible for the model to produce the correct answer: . “capital of France” Paris One way to think about transformers is to imagine them as a kind of . soft database Instead of storing explicit facts, the model stores knowledge in a vector space representation. Because neural networks approximate functions rather than memorize exact rules, they can often answer questions they have never seen before. Returning to our earlier embedding example: If the training data contains information about kings and women, the model may still be able to answer questions about queens, because the relationships between these concepts are captured in the vector space. If we follow this database analogy, we might say: acts like an , helping the model locate relevant information. Attention index Ang mga layer ng MLP ay naglalaman ng pangunahing kaalaman. Ang mental na modelo na ito ay magagamit para sa intuisyon, ngunit ito ay hindi literal na katotohanan. In a real transformer like the one used in ChatGPT, these attention + MLP blocks are repeated many times. Knowledge is not stored in a single location but is distributed across layers. Additionally, ang bawat layer ay naglalaman ng isang residual na koneksyon, na naghahatid ng mga orihinal na input vectors sa mga nakaraang-calculated vectors. As the vectors pass through the layers, new abstractions and meanings can emerge. By the time the final layer produces its output, the model has combined information from many different levels of representation. Ang buong proseso ay ganap na kompleksong upang patuloy ang step-by-step na intuisyon ng tao. Yet despite this complexity, the system works remarkably well in practice. Now that we have a rough intuition about the transformer architecture, let’s look at one of its most important components in more detail: . attention # 1) Multi-head Attention block x_residual = x x = rmsnorm(x) q = linear(x, state_dict[f'layer{li}.attn_wq']) k = linear(x, state_dict[f'layer{li}.attn_wk']) v = linear(x, state_dict[f'layer{li}.attn_wv']) keys[li].append(k) values[li].append(v) x_attn = [] for h in range(n_head): hs = h * head_dim q_h = q[hs:hs+head_dim] k_h = [ki[hs:hs+head_dim] for ki in keys[li]] v_h = [vi[hs:hs+head_dim] for vi in values[li]] attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5 for t in range(len(k_h))] attn_weights = softmax(attn_logits) head_out = [sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h))) for j in range(head_dim)] x_attn.extend(head_out) x = linear(x_attn, state_dict[f'layer{li}.attn_wo']) x = [a + b for a, b in zip(x, x_residual)] Sa implementasyon ng Karpathy, ang atensyon ay inihayag gamit ang tatlong matris na tinatawag na: ang gusto ko ( K (Key) V (Value) Ang mga matris na ito ay gumagawa ng vector projection. Sa iba pang mga salita, ang bawat token vector ay mapapang-apangin sa tatlong iba't ibang vector spaces. For every token we compute: a query vector a key vector Ang isang vector Kapag mayroon namin ang mga vector na ito, lumipad namin ang query vector ng isang token sa mga key vectors ng lahat ng tokens sa sequence. Mathematically, this is done using a dot product. The dot product gives a score that represents how strongly two vectors are related. This produces a set of numbers that represent . how relevant each token is to the current token Ngunit hindi lahat ay kumbinsido: Dalawang eksperto ang nagsabi sa Live Science na ang NLS ay maaaring hindi ang evolutionary smoking gun na nagpapakita kung paano ang mga simpleng cells ay umusbong sa mas kumplikado. , which transforms the scores into values between 0 and 1 that sum to 1. softmax function These values represent how much attention each token should receive. Finally, the model combines the value vectors using these attention weights, producing a new vector that contains information gathered from the entire context. The formula for scaled dot-product attention looks like this: Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V Here: QKT pagkalkula ng likha sa pagitan ng mga query at keys √dk ay isang scaling factor na stabilize ang training softmax converts ang mga score sa mga probabilidad ng atensyon provides the information that is combined according to those probabilities V The result is a new representation for each token that reflects . its meaning in the context of the entire sequence At this point, it is worth mentioning an important concept: . context length As we discussed earlier, transformers process the entire sequence at once. This is necessary because attention requires comparing . every token with every other token That means the computational cost grows Sa bilang ng mga token. quadratically If we double the context length, the amount of computation increases roughly four times. This is one of the main limitations of transformer models. Unlike some other architectures, transformers do not have a separate memory system. They can only “see” the tokens that fit within their context window. Everything outside that window is effectively invisible to the model. Ito ang dahilan kung bakit ang length ng kontekstong ito ay isang mahalagang katangian ng mga modernong mga modelo ng wika. Sa maraming mga modernong mga sistema ng AI, ang limitasyon na ito ay tinutukoy sa pamamagitan ng pagdaragdag ng isang external memory mechanism. A common approach is to use a . vector database Instead of storing knowledge directly in the model, information can be stored externally as vector embeddings. Kapag ang modelo ay makakuha ng isang tanong, ang sistema ay maaaring: Convert the question into a vector. Search the vector database for related information. Insert the retrieved information into the model’s context. This means the model sees both: the question and the relevant knowledge retrieved from the database Dahil ang parehong ginagamit sa kontekstong window, ang modelo ay maaaring lumikha ng isang solusyon batay sa impormasyon na ito. This technique is known as and is widely used in modern AI systems and agent frameworks. Retrieval-Augmented Generation (RAG) In this setup, the language model’s main role is not to store knowledge, but to generate coherent answers based on the information available in its context. But as we can see, this requires space in the context window, which is why context length remains so important. Pagbalik sa implementasyon ng Karpathy, ang modelo ay gumagamit , na kung saan ay isang improved form ng pangunahing mekanismo ng pag-atake. multi-head attention Instead of computing attention using a single set of Q, K, and V matrices, the model uses multiple attention heads. In this implementation, there are four heads. Kasama rin sa book ang mga personal relationships ni Mayor Bistek, ang kanyang limang anak at ang mga ina nito. Ang paggamit ng higit pang mga head ay mapabuti ang kalidad ng representasyon. To keep the computational cost roughly the same, the dimensionality of each head is reduced. Nagsimula kami ng mga vector mula sa a . 16-dimensional space to another 16-dimensional space With four attention heads, each head instead works in a . The results from the heads are then combined back into a single vector. 4-dimensional space Habang ang bawat head ay gumagana sa mga mas mababang-dimensional na vector, ang kombinasyon na resulta ay karaniwang mas expressive at mas katunayan kaysa sa paggamit ng isang single attention head. Now that we have covered the attention mechanism, let’s move on to the second major component of the transformer block: the , or . MLP feed-forward neural network In the code, the MLP block looks something like this: # 2) MLP block x_residual = x x = rmsnorm(x) x = linear(x, state_dict[f'layer{li}.mlp_fc1']) x = [xi.relu() for xi in x] x = linear(x, state_dict[f'layer{li}.mlp_fc2']) x = [a + b for a, b in zip(x, x_residual)] An MLP is a Kung makikita natin ang strukturong mga matris ng timbang, maaari naming tulungan ang mga linya ng matris bilang . classic neural network architecture neurons Ang isang neuron ay isang simpleng unit ng computing na: I-multiply ang lahat ng mga input sa isang weight. sums the results, Then it applies a nonlinear activation function to produce the output. This model was originally inspired by biological neurons in the human brain. In that sense, neural networks were historically motivated by attempts to mimic how the brain might process information. However, modern AI systems have moved quite far from this original analogy. Sa mga bahagi ng MLP, maaari naming palagiin ang isang bagay na tumutulong sa mga neurons. Ngunit kapag nag-iisip natin sa mga mekanismo tulad ng , ito ay naging mas mabuti upang matatagpuan ang brain-inspired interpretation. attention Because of this, it is often better to think of modern AI systems simply as Hindi ito literal na mga modelo ng brain. trainable mathematical functions Ang MLP block sa code ay binubuo ng tatlong pangunahing hakbang: a (matrix multiplication), linear transformation Ang isang nonlinear activation function (ReLU) ay Isang linear transformation. Ito ay maaaring tumingin simple, ngunit mga strukturong tulad ng ito ay may isang napaka-powered matematika property. Ang mga ito ay kilala bilang . universal approximators universal approximators This means that, under certain conditions, a sufficiently large MLP can approximate sa isang arbitrary degree ng accuracy. any mathematical function Sa pangkalahatan, ang isang single huge MLP ay maaaring malaman ng halos lahat. Ito ay ang dahilan kung bakit ang mga transformer architectures ay naghahatid ng ilang mga mekanismo, kabilang ang attention at stacked layers, upang i-distribute ang computing na mas mahusay. Ang output ng network ay hindi isang single token, ngunit isang . probability distribution over all possible tokens Sa iba pang mga salita, para sa bawat token sa vocabulary, ang modelo ay nagpapakita ng probabilidad na ito ay dapat ipakita sa susunod sa sequence. Sa panahon ng generation, ang algorithm pagkatapos ay sample mula sa probability distribution na ito. Ito ay nangangahulugan na ang mga token na may mas mataas na probabilidad ay may mas mahusay na pagpili, ngunit mayroon pa rin ang isang elemento ng randomness. Ang mga ito ay ginagamit ng isang parameter na tinatawag na . temperature ang Ang parameter ay nagpapakita kung paano deterministic o creative ang output ng modelo ay. temperature - the model strongly favors the most probable tokens, producing more predictable and accurate responses. Low temperature High temperature - ang probability distribution ay lumalaki, na nagbibigay ng mas mababang probability na tokens upang piliin mas karaniwang, na nagiging resulta sa mas diversified o creative outputs. For example: If we want the model to analyze a document and answer factual questions, a is usually preferable. low temperature If we want the model to generate creative text or explore new ideas, a can produce more interesting results. higher temperature This is roughly what I wanted to explain about this beautiful piece of code, and about GPT models in general. In many places, the explanation necessarily remained somewhat superficial. My goal was to strike a balance between two things: including as much useful insight as possible, while still keeping the discussion within the scope of a single article Para sa mga readers na natagpuan ang mga bahagi ng paglalarawan ng isang maliit na unclear, o na nais na i-explore ang mga detalye sa karamihan, nagrekomenda ko. ang kanyang at ang Dito, makikita mo ang mahusay na materyal na nagpapakita ng lahat ng kailangan upang malaman ang mga konsepto na tinatawag dito. Ang personal na website ng Andrej Karpathy YouTube channel ang blog I hope this article has been useful to many readers.If nothing else, magkakaroon ito ng isang invitation sa pag-explore ang magandang mundo ng AI.