GPT ใน 200 เส้น: ความเรียบง่ายที่สวยงามที่อยู่เบื้องหลัง AI แบบสมัยใหม่

Andrej Karpathy เป็นชื่อที่หลายคนในโลก AI จะยอมรับ เขาเป็นนักวิชาการที่สเตนฟอร์ดเป็นเวลานานจากนั้นก็กลายเป็นหัวหน้าฝ่าย AI ของ Tesla และทํางานที่ OpenAI และในความคิดของฉันเขายังผลิตวิดีโอการศึกษาด้าน AI ที่ดีที่สุดที่สามารถใช้ได้ในวันนี้ โครงการล่าสุดของเขา อย่างไรก็ตามมันเป็นสิ่งที่ฉันสามารถอธิบายได้เป็นผลงานศิลปะเท่านั้น microGPT microGPT เป็นการประยุกต์ใช้ GPT เต็มรูปแบบที่เขียนใน . And this is not some cleverly compressed or obfuscated code. On the contrary, the code is clean, well-structured, and thoroughly commented. It even avoids using external deep learning libraries. In other words, it not only contains the neural network itself, but also the minimal framework needed for training and running it. only about 200 lines of Python code ในความเป็นจริงโครงการทั้งหมดไม่ได้มีเก็บข้อมูลที่สมบูรณ์ - มันถูกเผยแพร่เป็น . single GitHub Gist https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95?embedable=true ในบทความนี้ฉันจะพยายามที่จะอธิบายวิธีการที่ชิ้นส่วนโค้ดที่สวยงามนี้ทํางานในลักษณะที่ . anyone with basic Python knowledge can understand ลองเริ่มต้นด้วยชื่อ . GPT GPT ยืนยันสําหรับ สถาปัตยกรรมนี้เป็นพื้นฐานของโมเดลภาษาขนาดใหญ่ที่ทันสมัย ChatGPT ด้วยตัวเองมีตัวย่อนี้ในชื่อของมันและตัวเลือกทางเลือกที่สําคัญส่วนใหญ่ในขณะนี้ถูกสร้างขึ้นบนความคิดพื้นฐานเดียวกัน Generative Pretrained Transformer โมเดล GPT การคาดการณ์ ขึ้นอยู่กับคําก่อนหน้านี้ในข้อความ the next word (or token) นี่คือสิ่งที่รูปแบบภาษาทั้งหมดทํา พวกเขาสร้างการตอบสนองโดยการคาดการณ์คําหนึ่งในครั้งแรกพวกเขาคาดการณ์คําถัดไปจากนั้นพวกเขาเพิ่มคํานั้นลงในข้อความและคาดการณ์คําถัดไปอีกครั้งและอื่น ๆ ครั้งแรกนี้อาจดูเรียบง่าย แต่ถ้าเราคิดเกี่ยวกับมันอย่างระมัดระวังการคาดการณ์คําถัดไปอย่างถูกต้องต้องมีระดับของความเข้าใจ . understanding of the text and its meaning ในเครือข่ายประสาทกระบวนการ "คิด" นี้ถูกนําไปใช้เป็น . very large mathematical function คําเข้าจะถูกแปลงเป็นตัวเลขและให้อาหารเข้าไปในฟังก์ชั่นนี้ซึ่งจะคําคําถัดไปจะคํานวณ โดยหลักการระบบกฎใด ๆ สามารถแสดงได้ทางคณิตศาสตร์โดยใช้ฟังก์ชั่นดังกล่าว ดังนั้นงานจึงกลายเป็นการค้นหาฟังก์ชั่นที่เหมาะสมที่จับโครงสร้างของภาษา ฟังก์ชั่นนี้เป็นสิ่งที่เราเรียกว่า . model ปัญหาคือฟังก์ชั่นนี้สามารถ รูปแบบภาษาที่ทันสมัยมีพารามิเตอร์หลายพันพันล้านซึ่งหมายความว่าสูตรทางคณิตศาสตร์ที่อธิบายพวกเขาสามารถมีส่วนประกอบหลายพันพันพัน พารามิเตอร์ขนาดใหญ่ดังกล่าวเป็นไปไม่ได้สําหรับมนุษย์ในการออกแบบหรือจัดการด้วยตนเอง extremely complex โชคดีที่มีการทํางานที่ชาญฉลาด แทนที่จะเขียนสูตรที่แน่นอนเองเรากําหนด a สําหรับสูตรโครงสร้างที่มีพารามิเตอร์ที่สามารถปรับได้มากมาย พารามิเตอร์เหล่านี้สามารถปรับได้โดยอัตโนมัติจนกว่ารุ่นจะทํางานได้ดี “template” เรามีหลายรูปแบบดังกล่าวขึ้นอยู่กับประเภทของปัญหาที่เราต้องการแก้ปัญหา: เครือข่าย Convolutional ใช้กันอย่างแพร่หลายสําหรับการประมวลผลภาพ MLPs (multi-layer perceptrons) ใช้สําหรับการเข้าถึงฟังก์ชั่นทั่วไป Transformers ใช้สําหรับรุ่นภาษาเช่น ChatGPT และอื่น ๆ These templates represent huge mathematical structures that can approximate complex rules once their parameters are properly adjusted. คําถามที่เหลือคือ: หรือด้วยคําอื่น ๆ : How do we adjust these parameters? How do we “program” a neural network? Manually adjusting the parameters of a neural network would obviously be impossible, especially when we are dealing with models that contain millions or even billions of parameters. โชคดีที่มีวิธีที่จะทําสิ่งนี้โดยอัตโนมัติโดยใช้ข้อมูล กระบวนการนี้เป็นสิ่งที่เราเรียกว่า และวิธีการที่ทําให้มันเป็นไปได้เรียกว่า . training gradient descent หากเรามีชุดข้อมูลขนาดใหญ่และสูตรทางคณิตศาสตร์ขนาดใหญ่ (รุ่นของเรา) เราสามารถคํานวณข้อผิดพลาดของรุ่นสําหรับแต่ละตัวอย่างในชุดข้อมูล In the case of transformers, the process works roughly like this. คําแสดงเป็นจุดในพื้นที่เวกเตอร์ขนาดสูงซึ่งคําที่มีความหมายที่คล้ายกันปรากฏใกล้เคียงกัน แต่ละคําตรงกับจุดในพื้นที่นี้ โมเดลได้รับลําดับของคําเป็น input และพยายามคาดการณ์คําถัดไป เนื่องจากคําจะแสดงเป็นเวกเตอร์เราสามารถคํานวณระยะทางที่คําที่คาดการณ์อยู่จากคําที่ถูกต้อง ระยะทางนี้ให้เรามาตรฐานตัวเลขของความผิดพลาด และตอนนี้มาถึง Magic โดยใช้การลดเกรดเราสามารถคํานวณวิธีการที่พารามิเตอร์ของสูตรควรเปลี่ยนแปลงเพื่อลดข้อผิดพลาดนี้ If we have enough data, a sufficiently large network, and we train it for long enough, the formula gradually adapts to the patterns in the dataset and produces increasingly accurate predictions. แต่วิธีการที่แน่นอนนี้ “Magic” ทํางานได้อย่างไร การลดลงเกริดจ์จริง ๆ ค้นหาพารามิเตอร์ที่ดีขึ้นอย่างไร เราสามารถมองเห็นข้อผิดพลาดเป็นฟังก์ชั่นในพื้นที่ขนาดสูงซึ่งแต่ละมิติสอดคล้องกับพารามิเตอร์หนึ่งของรูปแบบ Thinking in many dimensions is practically impossible for humans, so instead we can imagine a simpler analogy: a landscape of hills and valleys. ในตัวอักษรนี้: แต่ละจุดในภูมิทัศน์แสดงให้เห็นถึงการรวมกันเฉพาะของพารามิเตอร์ของโมเดล ความสูงของภูมิทัศน์ในจุดนี้แสดงให้เห็นถึงความผิดพลาดของรูปแบบ ในตอนเริ่มต้นของการฝึกอบรมพารามิเตอร์จะถูกเริ่มต้นอย่างสุ่ม นี่เป็นเหมือนจะถูกวางที่ไหนสักแห่งที่สุ่มอยู่บนภูเขา Our goal is to minimize the error, which means finding the lowest point in the landscape. ความยากลําบากคือเราไม่ทราบว่าภูมิทัศน์นั้นดูเหมือนอย่างไร มันเหมือนว่าเรากําลังพยายามเดินลงบนภูเขาที่หุ้มด้วยตาหรือในฝนหนา ดังนั้นเราสามารถทําอย่างไร นี่คือจุดที่การลดระดับมาถึง ในคณิตศาสตร์มีแนวคิดที่เรียกว่า , ซึ่งอธิบายแนวของฟังก์ชั่นในจุดที่กําหนด derivative นี่เป็นประโยชน์อย่างไม่น่าเชื่อที่นี่ derivative กล่าวถึงเรา . which direction the landscape slopes downward ดังนั้นอัลกอริทึมทําได้ดังต่อไปนี้ การวัดแนวโน้มของภูมิทัศน์ ขั้นตอนเล็ก ๆ ในทิศทางที่ข้อผิดพลาดลดลง การทําซ้ํา ขั้นตอนตามขั้นตอนรุ่นค่อยๆเดินลงจนถึงจุดต่ําในภูมิทัศน์ข้อผิดพลาด นี่คือสิ่งที่การลดเกรดจะทํา คําถามที่เหลือคือ: วิธีการคํานวณแนวโน้มของฟังก์ชั่นที่ซับซ้อนเช่นนี้ ฉันจะไม่เข้าไปในรายละเอียดทางคณิตศาสตร์ทั้งหมดที่นี่เพราะ Karpathy มี คําอธิบาย วิดีโอที่ยอดเยี่ยม สําหรับวัตถุประสงค์ของเราเพียงพอที่จะรู้สิ่งต่อไปนี้ ทุกการดําเนินงานทางคณิตศาสตร์มีกฎที่สอดคล้องกันซึ่งช่วยให้เราสามารถคํานวณ gradient ในท้องถิ่น นอกจากนี้ยังมีกฎที่เรียกว่ากฎของโซ่ซึ่งช่วยให้เราสามารถรวม gradients ในท้องถิ่นเหล่านี้เพื่อคํานวณ gradient ของฟังก์ชั่นคอมโพสิตทั้งหมด ความคิดสําคัญคือเราจะย้ายกลับผ่านโซ่ของการดําเนินงานการเก็บรวบรวม gradients along the way กระบวนการนี้เรียกว่า backpropagation All major deep learning frameworks implement this mechanism. ตัวอย่างเช่น TensorFlow ใช้ระบบที่เรียกว่า GradientTape ซึ่งบันทึกการดําเนินงานเช่นเครื่องบันทึกเทปเพื่อให้สามารถคํานวณ gradients หลังจากนั้น PyTorch เชื่อมต่อข้อมูล gradient โดยตรงกับ tensors แต่ละ tensor จําวิธีการสร้างมันซึ่งช่วยให้ gradients สามารถคํานวณโดยอัตโนมัติ ระบบนี้เรียกว่า autograd การประยุกต์ใช้ Karpathy ตามแนวคิดพื้นฐานเดียวกัน ลองดูวิธีการ # Let there be Autograd to recursively apply the chain rule through a computation graph class Value: __slots__ = ('data', 'grad', '_children', '_local_grads') # Python optimization for memory usage def __init__(self, data, children=(), local_grads=()): self.data = data # scalar value of this node calculated during forward pass self.grad = 0 # derivative of the loss w.r.t. this node, calculated in backward pass self._children = children # children of this node in the computation graph self._local_grads = local_grads # local derivative of this node w.r.t. its children def __add__(self, other): other = other if isinstance(other, Value) else Value(other) return Value(self.data + other.data, (self, other), (1, 1)) def __mul__(self, other): other = other if isinstance(other, Value) else Value(other) return Value(self.data * other.data, (self, other), (other.data, self.data)) def __pow__(self, other): return Value(self.data**other, (self,), (other * self.data**(other-1),)) def log(self): return Value(math.log(self.data), (self,), (1/self.data,)) def exp(self): return Value(math.exp(self.data), (self,), (math.exp(self.data),)) def relu(self): return Value(max(0, self.data), (self,), (float(self.data > 0),)) def __neg__(self): return self * -1 def __radd__(self, other): return self + other def __sub__(self, other): return self + (-other) def __rsub__(self, other): return other + (-self) def __rmul__(self, other): return self * other def __truediv__(self, other): return self * other**-1 def __rtruediv__(self, other): return other * self**-1 def backward(self): topo = [] visited = set() def build_topo(v): if v not in visited: visited.add(v) for child in v._children: build_topo(child) topo.append(v) build_topo(self) self.grad = 1 for v in reversed(topo): for child, local_grad in zip(v._children, v._local_grads): child.grad += local_grad * v.grad ในรหัสของ Karpathy โลจิกการคํานวณขั้นสูงทั้งหมดจะถูกนําไปใช้ใน class, which is only about . Value 40 lines long คลาสนี้เป็นส่วนใหญ่เป็นตัวยึดรอบค่าตัวเลข นอกเหนือจากการจัดเก็บข้อมูลเองก็ยังจัดเก็บ: ค่าใช้จ่ายที่คํานวณจาก (ลูกของเธอ) และ gradients พื้นที่ที่จําเป็นสําหรับการย้อนกลับ หากเราดูรหัสเราสามารถเห็นได้ว่าตัวดําเนินการมาตรฐานได้รับการกําหนดใหม่: __add__ ภาษาไทย and others. นั่นหมายความว่าทุกครั้งที่เราดําเนินการทางคณิตศาสตร์ โปรแกรมไม่เพียง แต่คํานวณผลลัพธ์เท่านั้น แต่ยังบันทึกค่าใดที่ผลิตและวิธีการแพร่กระจาย gradient Value สุดท้าย The method performs the backpropagation process. It walks backward through the chain of operations and computes the total gradient with respect to the error. backward() และนี่เป็นแนวคิดทั้งหมด ส่วนประกอบที่สําคัญอื่น ๆ ของการฝึกอบรมคือการลดเกรดเดี่ยวเองซึ่งใช้เกรดเดี่ยวที่คํานวณเพื่ออัปเดตพารามิเตอร์ของรุ่น In other words, this is the part of the code that actually walks down the hill using the slope information. # Adam optimizer update: update the model parameters based on the corresponding gradients lr_t = learning_rate * (1 - step / num_steps) # linear learning rate decay for i, p in enumerate(params): m[i] = beta1 * m[i] + (1 - beta1) * p.grad v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2 m_hat = m[i] / (1 - beta1 ** (step + 1)) v_hat = v[i] / (1 - beta2 ** (step + 1)) p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam) p.grad = 0 ในการดําเนินการของ Karpathy นี้จะทําด้วย , ซึ่งเป็นหนึ่งในอัลกอริทึมการเพิ่มประสิทธิภาพที่ใช้กันอย่างแพร่หลายในการเรียนรู้ลึก Adam optimizer Adam มีความซับซ้อนมากขึ้นกว่าการลดระดับพื้นฐาน แทนที่จะใช้ขั้นตอนที่มีขนาดคงที่มันจะปรับขนาดขั้นตอนอย่างไดนามิกขึ้นอยู่กับประวัติการลดระดับก่อนหน้านี้ ซึ่งทําให้กระบวนการฝึกอบรมทั้งสอง . faster and more stable การคํานวณ gradient ที่อธิบายไว้ก่อนหน้านี้และขั้นตอนการเพิ่มประสิทธิภาพร่วมกันเป็นสิ่งที่เราสามารถเรียกได้ว่าเป็นกรอบการเรียนรู้ลึกส่วนหนึ่งของรหัส เครือข่ายประสาทใด ๆ - ไม่ว่ามันจะขับเคลื่อนรูปแบบภาษาเครื่องกําเนิดภาพควบคุมหุ่นยนต์หรือรถขับขี่อัตโนมัติ - จะได้รับการฝึกอบรมโดยใช้หลักการเดียวกัน เหล่านี้ไม่กี่บรรทัดของรหัสจับความคิดหลักที่อยู่เบื้องหลัง AI แบบสมัยใหม่ Large frameworks like: TensorFlow ปิโตรช เจ็กซ์ ให้การใช้งานที่เพิ่มประสิทธิภาพสูงที่สามารถทํางานบน GPUs, TPUs และคลัสเตอร์กระจายได้และรวมถึงเทคนิคและการเพิ่มประสิทธิภาพเพิ่มเติมมากมาย แต่ถ้าเราตัดทุกอย่างลงไปถึงสิ่งสําคัญหลักการที่อยู่เบื้องต้นคือสิ่งที่เราเห็นในการประยุกต์ใช้ Python ขนาดเล็กนี้ Now that we have seen the deep learning framework part of the code, we can move on to the actual neural network, in other words, the GPT model itself. ดังกล่าวก่อนหน้านี้เป้าหมายของ GPT คือการคาดการณ์ token ถัดไปขึ้นอยู่กับ tokens ที่มาก่อนมัน ข้อกําหนดนี้จะง่ายขึ้นเล็กน้อยเนื่องจากการป้อนข้อมูลไม่ได้ประกอบด้วยคํา แต่เป็น tokens แทนที่จะพึ่งพาคําศัพท์ที่กําหนดไว้ล่วงหน้าระบบจะเรียนรู้คําศัพท์ทางสถิติจากข้อมูลการฝึกอบรม แทนที่จะพึ่งพาคําศัพท์ที่กําหนดไว้ล่วงหน้าระบบจะเรียนรู้คําศัพท์ทางสถิติจากข้อมูลการฝึกอบรม แทนนี้เป็นลําดับตัวละครที่ปรากฏบ่อยในชุดข้อมูล ดังนั้นการฝึกอบรมไม่ได้เริ่มต้นด้วยรายการคําที่คงที่ซึ่งแต่ละคํามีหมายเลขที่กําหนดไว้แล้ว แทนที่ "คําพูด" นี้เกิดขึ้นจากข้อมูลเองในระหว่างการประมวลผลล่วงหน้า วิธีการนี้เป็นประโยชน์โดยเฉพาะอย่างยิ่งสําหรับภาษา agglutinative เช่น ฮังการีซึ่งคําเดียวสามารถมีรูปร่างที่แตกต่างกันได้หลายรูปแบบเนื่องจากตัวยึดและจุดสิ้นเชิงทางจริยธรรม ด้วยข้อมูลการฝึกอบรมเพียงพอโมเดลภาษาสามารถเรียนรู้ภาษาใด ๆ ไม่ว่าจะเป็นธรรมชาติหรือเทียม ในความเป็นจริง tokens ไม่จําเป็นต้องแสดงข้อความ พวกเขาสามารถแสดง: ชิ้นส่วนของภาพ ชิ้นส่วนของเสียง เซ็นเซอร์อ่าน หรือเกือบทุกชนิดของข้อมูล นี่เป็นเหตุผลที่รุ่นเครื่องแปลงสัญญาณไม่ได้ จํากัด ในการประมวลผลภาษา นอกจากนี้ยังสามารถใช้สําหรับการสร้างภาพการประมวลผลการพูดหุ่นยนต์และงานอื่น ๆ อีกมากมาย Karpathy’s model is intentionally very small, so in this case, the tokens are not words but characters. ดังนั้นวัตถุประสงค์ของโมเดลไม่ใช่การสร้างคําถามหรือคําตอบที่สมบูรณ์ แต่เพียงแค่การผลิตชื่อที่ดูเป็นจริง The training dataset consists of a large list of names, and after training, we expect the model to generate new names that statistically resemble the examples. นี่เป็นอย่างชัดเจนไกลจากรูปแบบภาษาขนาดเต็มรูปแบบเช่น ChatGPT แต่ในความเป็นจริงความแตกต่างส่วนใหญ่เป็นหนึ่งในขนาด หากเราขยายรุ่นนี้หลายล้านครั้งใช้โทเค็นแทนตัวอักษรและฝึกมันบนชุดข้อมูลขนาดใหญ่ที่เก็บรวบรวมจากอินเทอร์เน็ตเราจะจบลงด้วยสิ่งที่คล้ายคลึงกับรุ่นภาษาขนาดใหญ่สมัยใหม่ รูปแบบขนาดใหญ่เหล่านี้มักจะได้รับการฝึกอบรมในสองขั้นตอน: Pretraining – training on massive internet datasets to learn general language patterns. Fine-tuning - การฝึกอบรมเพิ่มเติมโดยใช้การสนทนาที่สร้างขึ้นโดยมนุษย์เพื่อปรับปรุงการตอบสนอง กระบวนการนี้ต้องใช้ทรัพยากรการคํานวณขนาดใหญ่และปริมาณข้อมูลที่มีคุณภาพสูงซึ่งมักจะเสียค่าใช้จ่ายล้านดอลลาร์ในการคํานวณ เนื่องจากส่วนใหญ่ของเราไม่มีการเข้าถึงทรัพยากรดังกล่าวเราจะต้องยอมรับการสร้างชื่อ # Let there be a Dataset `docs`: list[str] of documents (e.g. a list of names) if not os.path.exists('input.txt'): import urllib.request names_url = 'https://raw.githubusercontent.com/karpathy/makemore/988aa59/names.txt' urllib.request.urlretrieve(names_url, 'input.txt') docs = [line.strip() for line in open('input.txt') if line.strip()] random.shuffle(docs) print(f"num docs: {len(docs)}") # Let there be a Tokenizer to translate strings to sequences of integers ("tokens") and back uchars = sorted(set(''.join(docs))) # unique characters in the dataset become token ids 0..n-1 BOS = len(uchars) # token id for a special Beginning of Sequence (BOS) token vocab_size = len(uchars) + 1 # total number of unique tokens, +1 is for BOS print(f"vocab size: {vocab_size}") ที่จุดเริ่มต้นของรหัสเราพบส่วนที่รับผิดชอบในการโหลดชุดข้อมูลของชื่อและสร้างคําศัพท์ ในกรณีนี้คําพูดประกอบด้วยเพียงรายการตัวอักษรที่ปรากฏในชุดข้อมูล Each character is assigned a numerical identifier, allowing the text to be converted into numbers that the neural network can process. ต่อไปนี้เป็นหนึ่งในแนวคิดที่สําคัญที่สุดในการเรียนรู้ลึก: . embeddings ความคิดที่เรียบง่าย แต่มีประสิทธิภาพ แทนที่จะทํางานร่วมกับ ID token วัตถุดิบเราจะทําแผนที่แต่ละ token เป็น วัตถุเหล่านี้เป็นสิ่งที่เครือข่ายประสาทจริงประมวลผล point in a high-dimensional vector space ในความเป็นจริงเครือข่ายประสาทใด ๆ สามารถมองเห็นได้ว่าเป็นฟังก์ชั่นที่ . maps vectors from one high-dimensional space into another ตัวอย่างเช่น หากเราฝึกเครือข่ายประสาทเพื่อจัดประเภทภาพเป็นสุนัขหรือแมวเครือข่ายจะวางแผนการแสดงภาพลงในพื้นที่สองมิติซึ่งมิติหนึ่งสอดคล้องกับ“สุนัข” และมิติอื่น ๆ กับ“แมว” If we imagine an image generator like , it maps random noise into a high-dimensional space where each point represents an image conditioned on the prompt. Midjourney ไม่ว่างานใดเครือข่ายจะดําเนินการเสมอ . vector-to-vector transformation using a large mathematical function เช่นเดียวกันกับ GPT The dimensionality of the vector space is defined in the code by the constant ซึ่งในประยุกต์ใช้นี้จะกําหนดให้ . n_embd 16 นั่นหมายความว่าแต่ละ token (ในกรณีนี้แต่ละตัวอักษร) จะแสดงเป็น . 16-dimensional vector คณิตศาสตร์แผนที่นี้เป็นเพียง a . matrix multiplication ในรหัสแม่เหล็กที่รับผิดชอบต่อการเปลี่ยนแปลงนี้เรียกว่า , which stands for . wte word/token embedding อย่างไรก็ตามการรู้ว่าตัวอักษรใดปรากฏในคํานั้นไม่เพียงพอ มีความสําคัญ position For example, the meaning of a sequence changes if we rearrange the characters. เพื่อรวมข้อมูลตําแหน่งโมเดลใช้ , implemented using another matrix called . positional embeddings wpe รหัสแผนที่ทั้งโทเค็นและตําแหน่งของมันเป็นเวกเตอร์ 16 มิติและจากนั้นเพียงแค่ . adds the two vectors together The result is a single vector that encodes both: the identity of the token its position within the sequence Earlier, we mentioned that these vector representations must be meaningful, because later the model will compute errors based on distances between vectors. เหมาะสําหรับ: การคาดการณ์เกือบถูกต้องควรอยู่ใกล้กับเวกเตอร์ที่ถูกต้อง การคาดการณ์ที่ไม่ถูกต้องมากควรห่างไกลจากมัน สิ่งนี้ทําให้เกิดคําถามที่น่าสนใจ: How do we design a good embedding space? คําตอบที่น่าแปลกใจง่าย: We don’t. แทนที่เราเริ่มต้นการแทรกเมตริกซ์ ( and ) with random numbers and allow gradient descent to learn the correct representation during training. wte wpe หากเรามีข้อมูลเพียงพอกระบวนการเพิ่มประสิทธิภาพจะค่อยๆปรับแต่งเมตริกส์จนกว่าจะแสดงความสัมพันธ์ที่มีประโยชน์ This can lead to surprisingly powerful emergent properties. ตัวอย่างเช่นในที่รู้จักกันดี embedding model, vector arithmetic can capture semantic relationships. A classic example is: word2vec king − man + woman ≈ queen ที่นี่เราสามารถเห็นได้แล้วว่าพื้นที่บูรณาการเริ่มแสดงให้เห็นถึงชนิดของ , where relationships between concepts appear as geometric relationships between vectors. simplified model of the world ตอนนี้ที่เราได้เห็นวิธีการสร้างเวกเตอร์เราสามารถมองไปที่เครือข่ายประสาทเองซึ่งเป็นส่วนประกอบที่เปลี่ยนเวกเตอร์เหล่านี้เป็นเวกเตอร์ใหม่ที่แสดงให้เห็นถึง token ต่อไป กล่าวอีกนัยหนึ่งเครือข่ายแผนที่เวกเตอร์จาก token การแทรกพื้นที่กลับเข้าไปในพื้นที่เดียวกัน แต่เปลี่ยนโดยหนึ่ง token สําหรับแต่ละ token มันคาดการณ์ว่า token ควรติดตามต่อไป โดยการประยุกต์ใช้กระบวนการนี้ซ้ํา ๆ แบบจําลองสามารถสร้างลําดับข้อความทั้งหมด - หรือในกรณีนี้ชื่อ The architecture used for this is called the . Transformer The Transformer was introduced in 2017 by researchers at Google in the famous paper: “Attention Is All You Need.” The original architecture was designed for . It consisted of two main parts: machine translation an encoder a decoder The encoder processed the input sentence, while the decoder generated the translated output sentence. อย่างไรก็ตามสําหรับรุ่นที่สร้างขึ้นเช่น GPT เราต้องการ : the . half of the original architecture decoder stack This is why GPT models are often described as . decoder-only transformers The decoder receives the input tokens and repeatedly processes them through a stack of identical layers. Each layer contains two main components: ความสนใจตนเอง A feed-forward neural network (MLP) These layers are repeated many times in large models. In diagrams, you often see this represented as , meaning the block is stacked multiple times. ×N หนึ่งในนวัตกรรมหลักของสถาปัตยกรรม Transformer คือการประมวลผล . entire sequence at once รูปแบบภาษาที่เก่าแก่ขึ้นโดยเฉพาะอย่างยิ่งเครือข่ายประสาทรีไซเคิล (RNNs) ได้ประมวลผลข้อความหนึ่งคําในครั้งเดียวและส่งข้อมูลตามลําดับตามลําดับ Transformers work differently. They can look at all tokens simultaneously, allowing the model to learn relationships between any parts of the text. กลไกนี้เรียกว่า , which is why the original paper was titled . attention Attention Is All You Need The attention mechanism calculates how ในลําดับ relevant each token is to every other token For each token vector, the model computes a set of weights describing how much attention it should pay to the other tokens. It then combines the information from those tokens accordingly. The resulting vector therefore represents not only the token itself, but also . its meaning in the context of the entire sequence สิ่งนี้อาจฟังดูซับซ้อน แต่อินtuition เป็นเรื่องง่าย Suppose we ask a language model: “What is the capital of France?” If we only looked at the word , we could not determine the answer. But attention allows the model to connect the word with . “capital” “capital” “France” การแสดงผลที่เกิดขึ้นจับความหมายของคําพูด , ทําให้เป็นไปได้สําหรับรุ่นที่จะผลิตคําตอบที่ถูกต้อง: . “capital of France” Paris วิธีหนึ่งในการคิดเกี่ยวกับตัวแปลงสัญญาณคือการจินตนาการพวกเขาเป็นชนิดของ . soft database Instead of storing explicit facts, the model stores knowledge in a vector space representation. Because neural networks approximate functions rather than memorize exact rules, they can often answer questions they have never seen before. กลับไปที่ตัวอย่างการบูรณาการก่อนหน้านี้ของเรา: หากข้อมูลการฝึกอบรมมีข้อมูลเกี่ยวกับกษัตริย์และผู้หญิงรุ่นอาจยังคงสามารถตอบคําถามเกี่ยวกับกษัตริย์เพราะความสัมพันธ์ระหว่างแนวคิดเหล่านี้จะถูกจับในพื้นที่เวกเตอร์ If we follow this database analogy, we might say: ความสนใจทําหน้าที่เหมือนดัชนีช่วยให้รุ่นค้นหาข้อมูลที่เกี่ยวข้อง The contain the knowledge itself. MLP layers โมเดลจิตนี้มีประโยชน์ต่อการสัมผัส แต่ไม่ถูกต้องอย่างแท้จริง In a real transformer like the one used in ChatGPT, these attention + MLP blocks are repeated many times. Knowledge is not stored in a single location but is distributed across layers. Additionally, each layer includes a residual connection, which mixes the original input vectors with the newly computed vectors. This allows information to flow through the network more effectively and stabilizes training. As the vectors pass through the layers, new abstractions and meanings can emerge. By the time the final layer produces its output, the model has combined information from many different levels of representation. กระบวนการทั้งหมดค่อนข้างซับซ้อนเกินไปที่จะทําตามขั้นตอนโดยขั้นตอนด้วยวิสัยทัศน์ของมนุษย์ Yet despite this complexity, the system works remarkably well in practice. Now that we have a rough intuition about the transformer architecture, let’s look at one of its most important components in more detail: . attention # 1) Multi-head Attention block x_residual = x x = rmsnorm(x) q = linear(x, state_dict[f'layer{li}.attn_wq']) k = linear(x, state_dict[f'layer{li}.attn_wk']) v = linear(x, state_dict[f'layer{li}.attn_wv']) keys[li].append(k) values[li].append(v) x_attn = [] for h in range(n_head): hs = h * head_dim q_h = q[hs:hs+head_dim] k_h = [ki[hs:hs+head_dim] for ki in keys[li]] v_h = [vi[hs:hs+head_dim] for vi in values[li]] attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5 for t in range(len(k_h))] attn_weights = softmax(attn_logits) head_out = [sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h))) for j in range(head_dim)] x_attn.extend(head_out) x = linear(x_attn, state_dict[f'layer{li}.attn_wo']) x = [a + b for a, b in zip(x, x_residual)] In Karpathy’s implementation, attention is calculated using three matrices called: Q (ต้องการ) K (Key) V (Value) These matrices perform vector projections. In other words, each token vector is mapped into three different vector spaces. For every token we compute: a query vector a key vector a value vector Once we have these vectors, we compare the query vector of one token with the key vectors of all tokens in the sequence. คณิตศาสตร์นี้ทําโดยใช้ผลิตภัณฑ์จุด ผลิตภัณฑ์จุดให้คะแนนที่แสดงให้เห็นว่าสองเวคเตอร์มีความสัมพันธ์อย่างแข็งแกร่งอย่างไร นี้ผลิตชุดของตัวเลขที่แสดงให้เห็นว่า . how relevant each token is to the current token However, these raw scores are not yet probabilities. To convert them into a probability distribution, we apply the , which transforms the scores into values between 0 and 1 that sum to 1. softmax function These values represent how much attention each token should receive. Finally, the model combines the value vectors using these attention weights, producing a new vector that contains information gathered from the entire context. สูตรสําหรับความสนใจของผลิตภัณฑ์จุดที่มีขนาดใหญ่ดูดังนี้: ความสนใจ(Q, K, V) = softmax(QKT / √dk) V นี่ : QKT การคํานวณความคล้ายคลึงกันระหว่างคําถามและคีย์ √dk เป็นปัจจัยการปรับขนาดที่เสถียรการฝึกอบรม converts the scores into attention probabilities softmax provides the information that is combined according to those probabilities V ผลลัพธ์คือการแสดงตัวแทนใหม่สําหรับแต่ละ token ซึ่งสะท้อนถึง . its meaning in the context of the entire sequence At this point, it is worth mentioning an important concept: . context length ดังที่เราได้กล่าวถึงก่อนหน้านี้ transformers จะประมวลผลลําดับทั้งหมดพร้อมกัน สิ่งนี้เป็นสิ่งจําเป็นเพราะความสนใจต้องเปรียบเทียบ . every token with every other token That means the computational cost grows with the number of tokens. quadratically If we double the context length, the amount of computation increases roughly four times. This is one of the main limitations of transformer models. Unlike some other architectures, transformers do not have a separate memory system. They can only “see” the tokens that fit within their context window. Everything outside that window is effectively invisible to the model. This is why context length is such an important property of modern language models. In many modern AI systems, this limitation is addressed by adding an external memory mechanism. A common approach is to use a . vector database Instead of storing knowledge directly in the model, information can be stored externally as vector embeddings. เมื่อรุ่นได้รับคําถามระบบสามารถ: แปลงคําถามเป็นเวกเตอร์ Search the vector database for related information. Insert the retrieved information into the model’s context. นั่นหมายความว่ารูปแบบดูทั้งสอง: คําถาม and the relevant knowledge retrieved from the database เนื่องจากทั้งสองจะปรากฏในหน้าต่างแง่มุมรูปแบบสามารถสร้างคําตอบตามข้อมูลนี้ เทคนิคนี้เป็นที่รู้จักในฐานะ and is widely used in modern AI systems and agent frameworks. Retrieval-Augmented Generation (RAG) ในการตั้งค่านี้บทบาทหลักของโมเดลภาษาไม่ได้เป็นการจัดเก็บความรู้ แต่เพื่อสร้างคําตอบที่สอดคล้องกันขึ้นอยู่กับข้อมูลที่มีอยู่ในแง่มุมของมัน But as we can see, this requires space in the context window, which is why context length remains so important. Returning to Karpathy’s implementation, the model uses ซึ่งเป็นรูปแบบที่เพิ่มขึ้นของกลไกความสนใจพื้นฐาน multi-head attention แทนที่จะคํานวณความสนใจโดยใช้ชุดเดียวของแม่เหล็ก Q, K และ V แบบจําลองใช้หัวความสนใจหลายตัว In this implementation, there are four heads. แต่ละหัวเรียนรู้ที่จะมุ่งเน้นไปที่ความสัมพันธ์ที่แตกต่างกันระหว่างโทเค็น ตัวอย่างเช่นหัวหนึ่งอาจมุ่งเน้นไปที่ความสัมพันธ์ทางจริยธรรมในขณะที่หัวอื่นอาจจับความสัมพันธ์ระยะยาว การใช้หัวหลายหัวปรับปรุงคุณภาพของการแสดง To keep the computational cost roughly the same, the dimensionality of each head is reduced. Earlier, we mapped vectors from a . 16-dimensional space to another 16-dimensional space With four attention heads, each head instead works in a . The results from the heads are then combined back into a single vector. 4-dimensional space แม้ว่าแต่ละหัวจะทํางานกับเวกเตอร์ขนาดต่ํากว่า แต่ผลรวมโดยทั่วไปจะแสดงออกและแม่นยํากว่าการใช้หัวความสนใจเดียว ตอนนี้ที่เราได้ครอบคลุมกลไกของความสนใจให้ไปสู่ส่วนประกอบหลักที่สองของบล็อกหม้อแปลง: , or . MLP feed-forward neural network ในรหัสบล็อก MLP จะดูเหมือนเช่นนี้: # 2) MLP block x_residual = x x = rmsnorm(x) x = linear(x, state_dict[f'layer{li}.mlp_fc1']) x = [xi.relu() for xi in x] x = linear(x, state_dict[f'layer{li}.mlp_fc2']) x = [a + b for a, b in zip(x, x_residual)] An MLP is a . If we look at the structure of the weight matrices, we can interpret the rows of the matrix as . classic neural network architecture neurons A neuron is a simple computational unit that: multiplies each input by a weight, รวมผลลัพธ์ จากนั้นมันใช้ฟังก์ชั่นการเปิดใช้งานที่ไม่ใช่เชิงเส้นเพื่อผลิตการส่งออก แบบจําลองนี้เป็นครั้งแรกได้รับแรงบันดาลใจจากประสาทประสาททางชีวภาพในสมองมนุษย์ ในความหมายนี้เครือข่ายประสาททางประวัติศาสตร์ได้รับการกระตุ้นโดยความพยายามที่จะจําลองวิธีการที่สมองสามารถประมวลผลข้อมูลได้ However, modern AI systems have moved quite far from this original analogy. ในส่วนประกอบ MLP เรายังคงสามารถรับรู้สิ่งที่คล้ายคลึงกับประสาท แต่เมื่อเรามองไปที่กลไกเช่น , it becomes much harder to maintain the brain-inspired interpretation. attention ด้วยเหตุนี้มักจะดีกว่าที่จะคิดเกี่ยวกับระบบอัจฉริยะอัจฉริยะที่ทันสมัยเพียงแค่ แทนที่จะเป็นรูปแบบของสมองที่แท้จริง trainable mathematical functions The MLP block in the code consists of three main steps: a (matrix multiplication), linear transformation a (ReLU), nonlinear activation function การเปลี่ยนแปลงเชิงเส้นอื่น ๆ สิ่งนี้อาจดูเรียบง่าย แต่โครงสร้างเช่นนี้มีคุณสมบัติทางคณิตศาสตร์ที่มีประสิทธิภาพมาก พวกเขาเป็นที่รู้จักในฐานะ . universal approximators universal approximators This means that, under certain conditions, a sufficiently large MLP can approximate ระดับความแม่นยําใด ๆ any mathematical function In principle, a single enormous MLP could learn almost anything. นั่นคือเหตุผลที่สถาปัตยกรรมเครื่องแปลงรวมกลไกหลายอย่างรวมถึงความสนใจและชั้นวางเพื่อกระจายการคํานวณได้อย่างมีประสิทธิภาพมากขึ้น การส่งออกของเครือข่ายไม่ใช่ token หนึ่ง แต่ a . probability distribution over all possible tokens In other words, for each token in the vocabulary, the model outputs the probability that it should appear next in the sequence. During generation, the algorithm then samples from this probability distribution. นั่นหมายความว่า tokens ที่มีความน่าจะเป็นสูงมีแนวโน้มที่จะได้รับการเลือก แต่ยังคงมีองค์ประกอบของความสุ่ม ความสุ่มนี้ถูกควบคุมโดยพารามิเตอร์ที่เรียกว่า . temperature The parameter determines how deterministic or creative the model’s output will be. temperature อุณหภูมิต่ํา - รูปแบบที่ชื่นชอบมากที่สุดของโทเค็นที่คาดการณ์ได้มากขึ้นและตอบสนองที่แม่นยํา อุณหภูมิสูง - การกระจายตัวของความน่าจะเป็นจะกลายเป็นแบนขึ้นซึ่งจะช่วยให้การเลือก tokens ที่มีแนวโน้มน้อยลงบ่อยขึ้นซึ่งจะนําไปสู่การส่งออกที่หลากหลายหรือสร้างสรรค์มากขึ้น ตัวอย่างเช่น If we want the model to analyze a document and answer factual questions, a is usually preferable. low temperature If we want the model to generate creative text or explore new ideas, a can produce more interesting results. higher temperature นี่คือสิ่งที่ฉันต้องการที่จะอธิบายเกี่ยวกับชิ้นส่วนโค้ดที่สวยงามนี้และเกี่ยวกับรูปแบบ GPT โดยทั่วไป ในหลายสถานที่คําอธิบายจําเป็นต้องยังคงเล็กน้อยพื้นผิว เป้าหมายของฉันคือการสร้างความสมดุลระหว่างสองสิ่ง: รวมความเข้าใจที่มีประโยชน์มากที่สุดเท่าที่จะเป็นไปได้ในขณะที่ยังคงรักษาการสนทนาภายในขอบเขตของบทความเดียว For readers who found parts of the explanation a bit unclear, or who want to explore the details more deeply, I highly recommend ของเขา , and ที่นี่คุณจะพบวัสดุที่ยอดเยี่ยมที่อธิบายทุกอย่างที่จําเป็นเพื่อเข้าใจแนวคิดที่กล่าวถึงที่นี่อย่างเต็มที่ เว็บไซต์ส่วนตัวของ Andrej Karpathy YouTube channel his blog I hope this article was useful to many readers. If nothing else, perhaps it serves as an invitation to explore the fascinating world of AI.