GPT በ 200 መስመሮች: የቴክኒካዊ የቴክኒካዊ አጠቃቀም ቀላልነት

Andrej Karpathy ብዙ ሰዎች በ AI ዓለም ታውቃቸዋል አንድ ስም ነው. እሱ አንድ ጊዜ ስታንፎርድ ውስጥ ፕሮፌሰር ነበር, ከዚያም ቴስላ የ AI ክፍሎች ዋናው ተባባሪ ነበር, በ OpenAI የተሰራ ነበር, እና, የእኔን ግምገማ ላይ, በአሁኑ ጊዜ ይገኛሉ ምርጥ የ AI ስልጠና ቪዲዮዎች አንዳንድ ያደርጋል. የቅርብ ጊዜ ፕሮጀክቶች, ነገር ግን, ይህ ብቻ አንድ አርቲፊኬት እንደ ማየት ይችላሉ ነገር ነው. microGPT microGPT አንድ ሙሉ GPT መተግበሪያ ነው ይህ የኮድ አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም only about 200 lines of Python code In fact, the whole project doesn’t even have a full repository — it is simply published as a . single GitHub Gist https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95?embedable=true በዚህ ጽሑፍ ውስጥ, ይህ ጥሩ ኮድ ቅርጽ እንዴት እንደሚሰራ እንደሚፈልጉ ይሆናል. . anyone with basic Python knowledge can understand የእኛን ስም ጋር ይጀምራል . GPT ጂ.ኤስ.ኤስ. ለ ይህ መዋቅር በጣም ዘመናዊ ትልቅ ቋንቋ ሞዴሎች መዋቅር ነው. ChatGPT ራሱም በዚህ ጥቅል በ ስም ውስጥ ያካትታል, እና በአሁኑ ጊዜ አብዛኞቹ ዋና አማራጮች ተመሳሳይ የተመሠረተ ስሜት ላይ የተመሠረተ ናቸው. Generative Pretrained Transformer የ GPT ሞዴል ይመዝገቡ አንድ ጽሑፍ ውስጥ የቀድሞው ቃላት ላይ የተመሠረተ. the next word (or token) ይህ ሁሉም ቋንቋ ሞዴሎች ምን ማድረግ ነው. እነርሱ እያንዳንዱ ጊዜ አንድ ስዕል ይወዳሉ. የመጀመሪያው, እነርሱ ቀጣይ ስዕል ይወዳሉ, ከዚያም ይህ ስዕል ወደ ጽሑፍ ያካትታል እና ቀጣይ ስዕል ይወዳሉ, ወዘተ. በመጀመሪያው ጊዜ, ይህ ቀላል ሊሆን ይችላል. ነገር ግን እኛ ተጨማሪ ጥንካሬ ለማየት, ቀጣይነት ቀጣይነት ቀጣይነት ቀጣይነት አንድ ዝርዝር ደረጃ ያስፈልጋል. . understanding of the text and its meaning በይነገጽ መረብ ውስጥ, ይህ "እናንተ" ሂደት እንደ . very large mathematical function የ input words ወደ ቁጥር ይቀበላሉ እና ወደ ይህ ተግባር ይቀበላሉ, ይህም ከዚያም ቀጣይ ስልክ ይቀበላሉ. በዋናነት, ማንኛውም መስኮት ሥርዓት እነዚህን ተግባሮች በመጠቀም የኬሚካል መተግበሪያ ሊሆን ይችላል. This function is what we call the . model ይህ ተግባር ሊሆን ይችላል ዘመናዊ ቋንቋ ሞዴሎች ቢሊዮን ክፍሎች ያካትታሉ, ይህም እነዚህን ምልክት ያካትታሉ የሙዚቃ ቅርጸት ቢሊዮን ክፍሎች ሊሆን ይችላል. extremely complex አስደናቂው ነገር ግን አንድ ደንበኛ workaround ነው. የእኛን ትክክለኛ ቅርጸት መጻፍ helyett, እኛ a ለሙከራ, ብዙ የተመሠረተ ፓርሜራዎች ጋር አንድ መዋቅር. እነዚህ ፓርሜራዎች ወደ ሞዴል የተሻለ ይሰራል ድረስ ከዚያም ራስ-ሰር የተመሠረተ ሊሆን ይችላል. “template” We have several such templates, depending on the type of problem we want to solve: Convolutional አውታረ መረብዎች በአጠቃላይ በይነገጽ ሂደት ላይ ጥቅም ላይ ናቸው. are used for general function approximation MLPs (multi-layer perceptrons) Transformers እንደ ChatGPT እንደ ቋንቋ ሞዴሎች ይጠቀማሉ ወዘተ እነዚህ ሞዴሎች ታላቅ የሙዚቃ መዋቅር ያካትታሉ, እነርሱን ፓርሜራሮች በተመሳሳይ ጊዜ ተስማሚ ሁኔታዎች ሊሆን ይችላል. የቀድሞው ጥያቄ ነው: ወይም, በዚያም ጊዜ: How do we adjust these parameters? How do we “program” a neural network? የኒውሮል አውታረ መረብ ፓርሜራችን በሽታ መተግበሪያዎች መተግበሪያዎች በሽታ ሊሆን አይችልም, በተለይም በሽታዎች ውስጥ በሽታዎች ሚሊዮን ወይም በሽታዎች ቢሊዮን ያካትታል. Fortunately, there is a way to do this automatically, using data. This process is what we call ይህ ሊሆን ይችላል የሚችል መንገድ ነው . training gradient descent አንድ ትልቅ ውሂብ ስብስቦች እና አንድ ትልቅ የሙዚቃ ቅርጸት (እኛ ሞዴል) አላቸው ከሆነ, እኛ ውሂብ ስብስቦች ውስጥ እያንዳንዱ ምሳሌ ለ ሞዴል ፍለጋ ማግኘት ይችላሉ. In the case of transformers, the process works roughly like this. በከፍተኛ-የመደበኛ ቪክቶር ስፋት ውስጥ ስዕሎች እንደ ስዕሎች ያካትታል, በዚህ ስዕል ውስጥ እያንዳንዱ ስዕል በዚህ ስዕል ውስጥ አንድ ስዕል ጋር ተኳሃኝ ይሆናል. ይህ ሞዴል በይነገጽ ላይ አንድ ስዕል አግኝቷል እና ቀጣይ ስዕል ለመምረጥ ይሞክራል. የይነገጽ በይነገጽ በይነገጽ በይነገጽ በይነገጽ በይነገጽ በይነገጽ ነው. ነገር ግን አሁን መጀመሪያው magic ነው. በመጠቀም gradient descent, ይህ ስህተት ለመቀነስ የፕሮግራም ፓርሜራችን እንዴት ይቀበላሉ እንደሚቻል ማረጋገጥ ይችላሉ. አንድ አነስተኛ ውሂብ, አንድ ትክክለኛ ትልቁ አውታረ መረብ, እና ለስላሳ ጊዜ ያስተዋውቃል ከሆነ, ቅርጸት ወደ ውሂብ አጠቃቀም ውስጥ ቅርጸት ተስማሚ ይሆናል እና በአብዛኛው ጊዜ ትክክክለኛ ጥንካሬዎችን ያደርጋል. But how exactly does this “magic” work? How does gradient descent actually find better parameters? We can visualize the error as a function in a high-dimensional space, where each dimension corresponds to one parameter of the model. በአብዛኛው መጠን ውስጥ ለመምረጥ በእያንዳንዱ ሰው በእርግጥ አይችልም, ስለዚህ በቀላሉ አንድ ቀላል አጠቃቀም ማየት ይችላሉ: የጎማዎች እና የጎማዎች አጠቃቀም. በዚህ ተመሳሳይነት: በእያንዳንዱ ገጽ በገጽ ላይ ሞዴል ፓርሜራዎች ልዩ ተኳሃኝ ያደርጋል. በዚያ ጊዜ የጎማ ዝቅተኛ ደረጃ ሞዴል ስህተት ያደርጋል. የፕሮጀክሽን መጀመሪያ ላይ, ፓርሜራዎች በእርግጥ መጀመር ይሆናል. ይህ በባህር ገጽ ላይ በባህር ገጽ ላይ አንዳንድ ቦታዎች ተመሠረተ እንደ ነው. Our goal is to minimize the error, which means finding the lowest point in the landscape. ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ነው. ይህ በይነገጽ ውስጥ የሚመጣው ቦታ ነው. በሙታሚካል ውስጥ አንድ ባህሪያት የሚታወቀው ነው አንድ ደረጃ ላይ አንድ ተግባር ቅርጸት ያካትታል. derivative ይህ እዚህ በጣም አስፈላጊ ነው. Derivatives እኛን ያውቃል . which direction the landscape slopes downward የ algorithm ብቻ የሚከተሉትን ያደርጋል: የእርስዎን ጓደኛ ጓደኛ ጓደኛ ጓደኛ ጓደኛ Take a small step in the direction where the error decreases. አግኙን ደረጃ-ጥፍ-ጥፍ, ሞዴል በይነገጽ በይነገጽ ውስጥ አንድ ዝቅተኛ ነጥብ ለማግኘት ድረስ በይነገጽ በይነገጽ ይጎብኙ. This is essentially what gradient descent does. ቀጣይ ጥያቄ ነው: እንዴት እንደዚህ ተኳሃኝ ተግባር መፍጨት እንክብካለን? እኔ እዚህ ሙሉ የኮምፒውተር ዝርዝር ላይ መውሰድ አይችሉም, ምክንያቱም የካርፓቲው አሁን አንድ አቅም አላቸው. ያውቃል ያውቃል ምርጥ ቪዲዮ የእኛን ፍላጎቶች ለማግኘት, የሚከተሉትን ያውቃሉ. እያንዳንዱ የሙዚቃ ትራንስፖርት እያንዳንዱ ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት ትራንስፖርት የኩባንያው የገንዘብ አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም All major deep learning frameworks implement this mechanism. For example: TensorFlow GradientTape የሚታወቀው ስርዓት ይጠቀማል, ይህም በይነገጽ መክተቻዎች እንደ መክተቻዎችን መክተቻ ይጠቀማል, ስለዚህ በይነገጽ መክተቻዎች በኋላ መክተቻ ሊሆን ይችላል. PyTorch በቴንሰሮች ላይ በቀጥታ የቴንሰሮች መረጃ ያካትታል. እያንዳንዱ ቴንሰሮች በቴንሰሮች ላይ እንዴት የተፈጠረ ነው, ይህም በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰሮች በቴንሰ የ Karpathy መተግበሪያው ተመሳሳይ መሠረታዊ አግኝተዋል. እንዴት እንደሚቻል ይመልከቱ. # Let there be Autograd to recursively apply the chain rule through a computation graph class Value: __slots__ = ('data', 'grad', '_children', '_local_grads') # Python optimization for memory usage def __init__(self, data, children=(), local_grads=()): self.data = data # scalar value of this node calculated during forward pass self.grad = 0 # derivative of the loss w.r.t. this node, calculated in backward pass self._children = children # children of this node in the computation graph self._local_grads = local_grads # local derivative of this node w.r.t. its children def __add__(self, other): other = other if isinstance(other, Value) else Value(other) return Value(self.data + other.data, (self, other), (1, 1)) def __mul__(self, other): other = other if isinstance(other, Value) else Value(other) return Value(self.data * other.data, (self, other), (other.data, self.data)) def __pow__(self, other): return Value(self.data**other, (self,), (other * self.data**(other-1),)) def log(self): return Value(math.log(self.data), (self,), (1/self.data,)) def exp(self): return Value(math.exp(self.data), (self,), (math.exp(self.data),)) def relu(self): return Value(max(0, self.data), (self,), (float(self.data > 0),)) def __neg__(self): return self * -1 def __radd__(self, other): return self + other def __sub__(self, other): return self + (-other) def __rsub__(self, other): return other + (-self) def __rmul__(self, other): return self * other def __truediv__(self, other): return self * other**-1 def __rtruediv__(self, other): return other * self**-1 def backward(self): topo = [] visited = set() def build_topo(v): if v not in visited: visited.add(v) for child in v._children: build_topo(child) topo.append(v) build_topo(self) self.grad = 1 for v in reversed(topo): for child, local_grad in zip(v._children, v._local_grads): child.grad += local_grad * v.grad በ Karpathy's ኮድ ውስጥ, ሁሉንም gradient computing logic ውስጥ ይሰራል. አንድ ክፍል ነው, ይህም ብቻ ነው . Value 40 lines long ይህ ክፍል በእርግጥ የኮምፒውተር ነው. በተጨማሪም ውሂብዎን ማከማቸዋል, በተጨማሪም: ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ የሚፈልጉትን መሳሪያዎች ይሰጣሉ. የኮድ ይመልከቱ ከሆነ, መደበኛ አጠቃቀሞች የተመሠረተ ይሆናል: አግኙን ያግኙን እና ሌሎች. እርስዎ አንድ የሙዚቃ አጠቃቀም ይጠቀማል የፕሮግራም ብቻ ውጤት መክፈት አይችልም, ነገር ግን በተጨማሪም እያንዳንዱ መጠን ያደርጋል እና የ gradient እንዴት ያስተዋውቃል. Value መጨረሻም, የ የፕሮግራም የፕሮግራም ሂደት ይጠቀማል. የፕሮግራም ሂደቶች ኬብል በኩል ተለዋዋዋጭ ይጎብኙ እና በይነገጽ ላይ አጠቃላይ ተለዋዋዋጭን ያካትታል. backward() በእርግጥ, ይህ ሙሉ ስሜት ነው. ሌሎች አስፈላጊ ቁሳቁሶች የሞዴል ፓራሚተሮች ለማሻሻል የኮምፒውተር gradients ይጠቀማል, gradient descent ነው. ሌላው ነገር ነው, ይህ ኮድ ክፍል በአሁኑ ጊዜ በባህር መረጃን በመጠቀም በባህር ዝርዝር ላይ ይጎብኙ. # Adam optimizer update: update the model parameters based on the corresponding gradients lr_t = learning_rate * (1 - step / num_steps) # linear learning rate decay for i, p in enumerate(params): m[i] = beta1 * m[i] + (1 - beta1) * p.grad v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2 m_hat = m[i] / (1 - beta1 ** (step + 1)) v_hat = v[i] / (1 - beta2 ** (step + 1)) p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam) p.grad = 0 In Karpathy’s implementation, this is done with the , ይህም በይፋ መግዛት ውስጥ በጣም በስፋት ይጠቀማል optimization algorithms መካከል አንዱ ነው. Adam optimizer አዳም ከባድ ግሬዲን ዝቅተኛ ከባድ ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬዲን ግሬ . faster and more stable የቀድሞው የተመሠረተ gradient ኮምፒዩተር እና optimization step ጋር ተኳሃኝ የኮድ ክፍሎች ውስጥ የ Deep Learning Framework ን ሊሆን ይችላል. Every neural network — whether it powers a language model, an image generator, a robot controller, or a self-driving car — is trained using essentially the same principle. These few lines of code capture the core idea behind modern AI. እንደ እንደ ትልቅ frameworks: የ TensorFlow PyTorch ጃክስ በ GPUs, TPUs, እና Distributed Clusters ላይ ሊሆን ይችላል ከፍተኛ የተመሠረተ መተግበሪያዎችን ያቀርባል, እና ብዙ ተጨማሪ ጠቃሚ ምክሮች እና optimizations ያካትታል. But if we strip everything down to the essentials, the underlying principle is exactly the same as what we see in this tiny Python implementation. Now that we have seen the deep learning framework part of the code, we can move on to the actual neural network, በመጨረሻም, የ GPT ሞዴል. በ GPT መተግበሪያ (GPT) መተግበሪያ (GPT) መተግበሪያ (GPT) መተግበሪያ (GPT) መተግበሪያ (GPT) መተግበሪያ (GPT) መተግበሪያ (GPT) መተግበሪያ (GPT) መተግበሪያ (GPT) ነው. ይህ ትክክለኛነት ትንሽ ቀላል ነው, ምክንያቱም መተግበሪያው በእርግጥ ስህተት አይደለም, ነገር ግን ቶኮች. የቴክኒካዊ መሳሪያዎች የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴክኒካዊ የቴ ስለዚህ ስልጠና በእያንዳንዱ ስሜት ከሁሉም ስሜት ከሁሉም ስሜት ከሁሉም ስሜት ከሁሉም ስሜት ከሁሉም ስሜት ከሁሉም ስሜት ከሁሉም ስሜት ከሁሉም ስሜት ከሁሉም ስሜት ከሁሉም ስሜት ጋር ይጀምራል. ይህ መተግበሪያ በዋናነት ለ agglutinative ቋንቋዎች, ለምሳሌ በሃንጋሪ, በዚያ አንድ ቀጥተኛ ቃል በይፋዎች እና ግምገማ መጨረሻዎች ምክንያት ብዙ የተለያዩ ቅርጸቶች ሊሆን ይችላል. አንድ ቋንቋ ሞዴል የተሻለ ስልጠና መረጃ ጋር እያንዳንዱ ቋንቋ ማወቅ ይችላሉ, በእርግጥ ምርት ወይም ሰማያዊ. In fact, tokens do not even have to represent text. They can represent: ምስሎች ክፍሎች የኦዲዮ ክፍሎች Sensor ንጥሎች ከሁሉም ዓይነት የግል መረጃ ይህ ምክንያት, transformer ሞዴሎች ቋንቋ መተግበሪያ ላይ ብቻ አይደለም. እነርሱ ደግሞ ፎቶ ማምረት, ስሜት መተግበሪያ, ሮቦቲክ, እና ብዙ ሌሎች ተግባራት ጥቅም ላይ ሊውል ይችላል. የ Karpathy ሞዴል በእርግጥ በጣም አነስተኛ ነው, ስለዚህ በዚህ ሁኔታ, የቶኪዎች ቃል አይደለም, ነገር ግን ወረቀት ናቸው. The model’s goal is therefore not to generate full sentences or answers, but simply to produce realistic-looking names. የሙከራ ውሂብ ስብስቦች አንድ ትልቅ ስም ዝርዝር ያካትታሉ, እና የሙከራ በኋላ, ሞዴል በዩናይትድ ስቴትስ ውስጥ ተመሳሳይ ስቴትስቲክ ስምዎችን ለመፍጠር ይሆናል. ይህ በእርግጥ እንደ ChatGPT እንደ ሙሉ ደረጃ ቋንቋ ሞዴል ከሌለ ነው. ነገር ግን በእርግጥ, ልዩነት አብዛኛውን ጊዜ አንድ መጠን ነው. ይህ ሞዴል ሚሊዮን ጊዜ ለማሻሻል, የኮምፒውተር መለያዎች helyett የኮምፒውተር መለያዎች ይጠቀማል, እና በይነገጽ ላይ የተጠቃሚ ውሂብ ስብስቦች ላይ ያግኙ, እኛ በአሁኑ ጊዜ ትልቅ ቋንቋ ሞዴል ጋር በጣም ተመሳሳይ ነገር ማግኘት ይሆናል. እነዚህ ትልቅ ሞዴሎች በአጠቃላይ ሁለት ደረጃዎች ውስጥ የተሰራ ናቸው: የፕሬሊንግ - የፕሬሊንግ ግምገማዎች ለመፍጠር በይነገጽ ላይ በይነገጽ ግምገማዎች ላይ በይነገጽ. Fine-tuning - መልስ ለማሻሻል በመጠቀም የተከናወረ ሰው-መሠረተ ጓደኝነት ይጠቀማል. The process requires enormous computational resources and huge amounts of high-quality data, often costing millions of dollars in compute. የእኛን አብዛኞቹ እንደዚህ መሳሪያዎች መግዛት አይችልም, እኛ ስም ለመፍጠር ይሆናል. # Let there be a Dataset `docs`: list[str] of documents (e.g. a list of names) if not os.path.exists('input.txt'): import urllib.request names_url = 'https://raw.githubusercontent.com/karpathy/makemore/988aa59/names.txt' urllib.request.urlretrieve(names_url, 'input.txt') docs = [line.strip() for line in open('input.txt') if line.strip()] random.shuffle(docs) print(f"num docs: {len(docs)}") # Let there be a Tokenizer to translate strings to sequences of integers ("tokens") and back uchars = sorted(set(''.join(docs))) # unique characters in the dataset become token ids 0..n-1 BOS = len(uchars) # token id for a special Beginning of Sequence (BOS) token vocab_size = len(uchars) + 1 # total number of unique tokens, +1 is for BOS print(f"vocab size: {vocab_size}") At the beginning of the code, we find the section responsible for loading the dataset of names and building the vocabulary. በዚህ ሁኔታ, የኮምፒውተር ቀላል በዲትስቲክ ውስጥ የሚገኝ የኮምፒውተር ፎቶዎች ዝርዝር ያካትታል. እያንዳንዱ ስሜት አንድ የኮምፒውተር መታወቂያ ይሰጣል, text to be converted into numbers that the neural network can process. ከዚያም ከባድ መግቢያ ውስጥ በጣም አስፈላጊ ባህሪያት አንዱ ይመጣል: . embeddings እምነት ቀላል ነገር ግን ጠንካራ ነው. የእርስዎን የክፍያ መታወቂያዎችን በመጠቀም, እያንዳንዱ የክፍያ መታወቂያን አንድ መታወቂያ ውስጥ ይሸፍናል. እነዚህ ኬሚካሎች በእርግጥ የኒውሮል አውታረ መረብ ምን ይጠቀማል. point in a high-dimensional vector space በእርግጥ, ማንኛውም የኒውሮል አውታረ መረብ እንደ አንድ ተግባር ሊሆን ይችላል . maps vectors from one high-dimensional space into another ለምሳሌ : አንድ የኒውሮል አውታረ መረብን ፎቶዎችን እንደ ፎቶዎች ወይም ፎቶዎች ለማስተካከል ለማስተካከል ለማስተካከል, ይህ አውታረ መረብ የፎቶ ማተሚያን በሁለት-የሜትር ስቴትስ ውስጥ ያካትታል, በዚያ አንድ መጠን "በስማት" እና ሌሎች "በስማት" ያካትታል. Midjourney እንደ አንድ ፎቶ ጄኔሬተር ይምረጡ ከሆነ, እያንዳንዱ ነጥብ በባቡር ላይ የተመሠረተ ፎቶ ያካትታል በከፍተኛ-መደበኛ ቦታ ውስጥ በባቡር ፎቶ ያካትታል. ከባድዎን ከባድዎን ከባድዎን ከባድዎን ከባድዎን ከባድዎን ከባድዎን ከባድዎን ከባድዎን ከባድዎን ከባድዎን ከባድዎን ከባድዎን ከባድዎን ከባድዎን. . vector-to-vector transformation using a large mathematical function ይህ ለ GPT ተመሳሳይ ነው. የ Vector Space መጠን በኮድ ውስጥ የተወሰነ ነው. , which in this implementation is set to . n_embd 16 እያንዳንዱ የኮምፒውተር (እነዚህ ውስጥ, እያንዳንዱ የኮምፒውተር) እንደ አንድ የኮምፒውተር ነው . 16-dimensional vector የሙዚቃ ግምገማዎች ብቻ አንድ ነው . matrix multiplication በዚህ ኮድ ውስጥ, ይህ transformation የሚፈልጉት matrix ይደውሉ , which stands for . wte word/token embedding However, knowing which characters appear in a word is not enough. Their ነገር ግን. position ለምሳሌ, አንድ ተከታታይ ምንድን ነው, እርስዎ የጽሑፎችን ይመዝገቡ. በይነገጽ መረጃ ለማግኘት, ሞዴል ይጠቀማል በመጠቀም ሌላ Matrix ይደውሉ . positional embeddings wpe ይህ ኮድ የቶኬን እና የ 16 ዲሜትር ቪክቶሮች ውስጥ ይሸፍናል, ከዚያም ብቻ . adds the two vectors together ውጤት ሁለቱም ያካትታል አንድ ቪክቶር ነው: the identity of the token its position within the sequence የቀድሞው, እነዚህን ቪክቶር ምልክትዎች አስፈላጊ መሆን አለበት, ምክንያቱም ከዚያም ሞዴል በቪክቶር መካከል ክፍሎች ላይ የተመሠረተ ፍለጋዎችን ያካትታል. የተሻለ ነው: should be close to the correct vector Almost correct predictions በጣም ትክክለኛ ጥቅሞች ከሌሎች ሊሆን ይችላል This raises an interesting question: How do we design a good embedding space? መልስ በጣም ቀላል ነው: We don’t. Instead, we initialize the embedding matrices ( and ) በኮምፒዩት ቁጥር ጋር እና በኮምፒዩት ጊዜ ውስጥ ትክክለኛ ቅርጸት ማወቅ ወደ gradient ውጭ ያደርጋል. wte wpe አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም አጠቃቀም This can lead to surprisingly powerful emergent properties. For example, in the famous embedding model, vector arithmetic can capture semantic relationships. A classic example is: word2vec king − man + woman ≈ queen Here we can already see that the embedding space begins to represent a kind of , where relationships between concepts appear as geometric relationships between vectors. simplified model of the world Now that we have seen how vectors are created, we can finally look at the neural network itself, the component that transforms these vectors into new vectors representing the next token. በይነገጽም, በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽው በይነገጽ The architecture used for this is called the . Transformer The Transformer was introduced in 2017 by researchers at Google in the famous paper: “Attention Is All You Need.” የመጀመሪያው መሐንዲት ለ ይህ ሁለት ዋና ክፍሎች ያካትታል: machine translation an encoder a decoder የ coder ወደ input sentence ያተኮሩ, while the decoder generated the translated output sentence. የ decoder ወደ input sentence ያተኮሩ. For generative models like GPT, however, we only need • የ . half of the original architecture decoder stack ይህ የ GPT ሞዴሎች አብዛኛውን ጊዜ እንደ . decoder-only transformers የ decoder ወደ input tokens ያግኙ እና በተደጋጋጋጋሚ ተመሳሳይ ጥቅሎች መካከል አንድ ጥቅል በኩል ያተኮሩ. እያንዳንዱ ጥቅል ሁለት ዋና ክፍሎች ያካትታል: Self-attention A feed-forward neural network (MLP) አብዛኛውን ጊዜ እነዚህን ጫማዎች በአብዛኛው ሞዴሎች ውስጥ ብዙ ጊዜ ያነሰ ይሆናል. በ ዲግሪዎች ላይ, ይህ በአብዛኛው ጊዜ እንደ ይመልከቱ , meaning the block is stacked multiple times. ×N One of the key innovations of the Transformer architecture is that it processes the . entire sequence at once የቀድሞው ቋንቋ ሞዴሎች, በተለይ recurrent neural networks (RNNs), አንድ ጊዜ አንድ ስዕል ጽሑፍ መተግበሪያ, በተለመደው መረጃ በመርዳት. Transformers work differently. They can look at all tokens simultaneously, allowing the model to learn relationships between any parts of the text. This mechanism is called ስለዚህ የመጀመሪያው ጽሑፍ ተመሠረተ . attention Attention Is All You Need The attention mechanism calculates how በይፋ ውስጥ. relevant each token is to every other token For each token vector, the model computes a set of weights describing how much attention it should pay to the other tokens. It then combines the information from those tokens accordingly. ይህ ቪክቶን ብቻ አይሆንም, ነገር ግን . its meaning in the context of the entire sequence This may sound complicated, but the intuition is straightforward. አንድ ቋንቋ ሞዴል ይጠይቃል: “What is the capital of France?” ከባድ ስሜት ብቻ ይመልከቱ , እኛ መልስ ማረጋገጥ አይችልም. ነገር ግን ትክክለኛነት ሞዴል ቃል ለመገናኘት ይቻላል ጋር . “capital” “capital” “France” The resulting representation captures the meaning of the phrase , making it possible for the model to produce the correct answer: . “capital of France” Paris One way to think about transformers is to imagine them as a kind of . soft database ይህ ሞዴል በቪክቶር ስቴትስ አጠቃቀም ውስጥ እውቀት ማከማቸዋል. የኒውሮኒየም አውታረ መረብዎች ትክክክለኛ መስፈርቶች ለማከናወን helyett ተግባራት ያካትታል. Returning to our earlier embedding example: If the training data contains information about kings and women, the model may still be able to answer questions about queens, because the relationships between these concepts are captured in the vector space. በዚህ database አጠቃቀም ይመልከቱ ከሆነ, እኛ መውሰድ ይችላሉ: acts like an , helping the model locate relevant information. Attention index The contain the knowledge itself. MLP layers This mental model is useful for intuition, but it is not literally correct. በ ChatGPT ውስጥ ጥቅም ላይ እንደ አንድ እውነተኛ transformer ውስጥ, እነዚህ ትኩረት + MLP ብሎክዎች ብዙ ጊዜ ይደሰቱ. Knowledge is not stored in a single location but is distributed across layers. Additionally, each layer includes a residual connection, which mixes the original input vectors with the newly computed vectors. This allows information to flow through the network more effectively and stabilizes training. As the vectors pass through the layers, new abstractions and meanings can emerge. By the time the final layer produces its output, the model has combined information from many different levels of representation. ሁሉም ሂደት ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ነው. Yet despite this complexity, the system works remarkably well in practice. Now that we have a rough intuition about the transformer architecture, let’s look at one of its most important components in more detail: . attention # 1) Multi-head Attention block x_residual = x x = rmsnorm(x) q = linear(x, state_dict[f'layer{li}.attn_wq']) k = linear(x, state_dict[f'layer{li}.attn_wk']) v = linear(x, state_dict[f'layer{li}.attn_wv']) keys[li].append(k) values[li].append(v) x_attn = [] for h in range(n_head): hs = h * head_dim q_h = q[hs:hs+head_dim] k_h = [ki[hs:hs+head_dim] for ki in keys[li]] v_h = [vi[hs:hs+head_dim] for vi in values[li]] attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5 for t in range(len(k_h))] attn_weights = softmax(attn_logits) head_out = [sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h))) for j in range(head_dim)] x_attn.extend(head_out) x = linear(x_attn, state_dict[f'layer{li}.attn_wo']) x = [a + b for a, b in zip(x, x_residual)] In Karpathy’s implementation, attention is calculated using three matrices called: Q (Query) K (Key) V (Value) እነዚህ matrices vector projections ያደርጋሉ. በይነገጽ, እያንዳንዱ ቶኬን vector በ 3 የተለያዩ vector spaces ውስጥ የተካተተተ ነው. For every token we compute: የ Vector ፍላጎት a key vector ቪክቶሪ መጠን Once we have these vectors, we compare the query vector of one token with the key vectors of all tokens in the sequence. የኮምፒውተር የኮምፒውተር የኮምፒውተር የኮምፒውተር የኮምፒውተር የኮምፒውተር የኮምፒውተር የኮምፒውተር የኮምፒውተር የኮምፒውተር የኮምፒውተር የኮምፒውተር የኮምፒውተር የኮምፒውተር የኮምፒውተር የኮምፒውተር የኮምፒውተር የኮምፒውተር የኮምፒውተር This produces a set of numbers that represent . how relevant each token is to the current token ነገር ግን, እነዚህ ጥቁር ስኬቶች अभीም ምናልባት አይሆንም. እነዚህን ምናልባት አጠቃቀም ለማሻሻል, የእኛን ግምገማዎች , which transforms the scores into values between 0 and 1 that sum to 1. softmax function These values represent how much attention each token should receive. መጨረሻም, ሞዴል እነዚህ ትኩረት ቮልቴሮች በመጠቀም መጠን ቮልቴሮች ያካትታሉ, የኮንክቶፕ አጠቃቀም መረጃ ያካትታሉ አንድ አዲስ ቮልቴጅ ለመፍጠር. The formula for scaled dot-product attention looks like this: አግኝታ(Q, K, V) = softmax(QKT / √dk) V እዚህ ላይ: computes the similarity between queries and keys QKᵀ is a scaling factor that stabilizes training √dₖ softmax ስኬታማውን በኩባንያው ስኬታማነት ይቀበላል. V ያደርጋል መረጃ, ይህም እነዚህ ትክክለኛነት መሠረት ያካትታል. ውጤት እያንዳንዱ ቶኮን ለ አዲስ ትዕዛዞች ነው . its meaning in the context of the entire sequence በዚህ ጊዜ, አንድ አስፈላጊ ባህሪ ይሰጣል: . context length As we discussed earlier, transformers process the entire sequence at once. This is necessary because attention requires comparing . every token with every other token የኮምፒውተር ክፍያዎች ይጨምራል የቶኮን ቁጥር. quadratically የኮንክቶፕ ርዝመት አነስተኛ ከሆነ, የኮንክቶፕ መጠን አራት ጊዜ ይጨምራል. ይህ የ Transformers ሞዴሎች ዋና መስፈርቶች መካከል አንዱ ነው. Unlike some other architectures, transformers do not have a separate memory system. They can only “see” the tokens that fit within their context window. ይህ መስኮት ከሌሎች ሁሉ በሞዴል ላይ የማይታመን ነው. This is why context length is such an important property of modern language models. ብዙ ዘመናዊ የ AI ስርዓቶች ውስጥ, ይህ መስፈርቶች ከባድ የሙዚቃ ሜካኒካል ያካትታል. A common approach is to use a . vector database በሞዴል ውስጥ በቀጥታ መረጃ ማከማቻውን helyett, መረጃ በቪክቶር መተግበሪያዎች እንደ ውጭ ማከማቻ ሊሆን ይችላል. ሞዴል አንድ ጥያቄ ማግኘት ጊዜ, ስርዓት ሊሆን ይችላል: Convert the question into a vector. Search the vector database for related information. የምስክር ወረቀት በይነገጽ ውስጥ ያገለግላል. ይህ ሞዴል ሁለቱም ይመልከቱ: the question እና የ Database ውስጥ የተወሰነ መረጃዎች ሁለቱም በጽሑፍ መስኮት ውስጥ ይመልከቱ, ሞዴል በዚህ መረጃ ላይ የተመሠረተ መልስ መፍጠር ይችላሉ. ይህ ቴክኖሎጂ እንደ and is widely used in modern AI systems and agent frameworks. Retrieval-Augmented Generation (RAG) In this setup, the language model’s main role is not to store knowledge, but to generate coherent answers based on the information available in its context. But as we can see, this requires space in the context window, which is why context length remains so important. Returning to Karpathy’s implementation, the model uses , which is an improved form of the basic attention mechanism. multi-head attention የ Q, K, እና V matrices መካከል አንድ ተከታታይ በመጠቀም ትክክለኛነት ለመምረጥ helyett, ሞዴል በርካታ ትክክለኛነት ጫማዎችን ይጠቀማል. In this implementation, there are four heads. እያንዳንዱ ኮርፖሬሽን በቶኮን መካከል የተለያዩ ዓይነት ግንኙነት ላይ መተግበሪያ ይረዳል. ለምሳሌ, አንድ ኮርፖሬሽን በባህላዊ ግንኙነት ላይ መተግበሪያ ሊሆን ይችላል, ነገር ግን ሌላው ኮርፖሬሽን በከፍተኛ-የተግበሪያ ደህንነት ላይ መተግበሪያ ይረዳል. ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ ከባድ የኮምፒዩተር ወጪ መጠን ተመሳሳይ ለመጠበቅ, እያንዳንዱ ጫፍ መጠን ለመቀነስ ነው. Earlier, we mapped vectors from a . 16-dimensional space to another 16-dimensional space With four attention heads, each head instead works in a ከዚያም ከሌሎች ውሂብ ተመሳሳይ ቪክቶር ውስጥ ተዛማቸዋል. 4-dimensional space Although each head works with lower-dimensional vectors, the combined result is typically more expressive and more accurate than using a single attention head. Now that we’ve covered the attention mechanism, let’s move on to the second major component of the transformer block: በ transformer block ላይ ይጀምራል. ወይም . MLP feed-forward neural network በኮድ ውስጥ, MLP ብክለት እንደዚህ ይመልከቱ: # 2) MLP block x_residual = x x = rmsnorm(x) x = linear(x, state_dict[f'layer{li}.mlp_fc1']) x = [xi.relu() for xi in x] x = linear(x, state_dict[f'layer{li}.mlp_fc2']) x = [a + b for a, b in zip(x, x_residual)] አንድ MLP ነው a . If we look at the structure of the weight matrices, we can interpret the rows of the matrix as . classic neural network architecture neurons የኒውሮን አንድ ቀላል የኮምፒዩተር ዩኒት ነው: multiplies each input by a weight, ውጤቶችን ያካትታል, Then it applies a nonlinear activation function to produce the output. ይህ ሞዴል በመጀመሪያ በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት በዋነኝነት ቢሆንም, ዘመናዊ AI ስርዓቶች በዚህ የመጀመሪያው አጠቃቀም ከረጅም ርዝመት ይቆያል. In the MLP component, we can still loosely recognize something that resembles neurons. But when we look at mechanisms like ከባድ ስሜት ከባድ ስሜት ከባድ ስሜት ከባድ ስሜት ይሰጣል attention ይህ ምክንያት, በአሁኑ ጊዜ የቴክኒክ መሳሪያዎች ብቻ እንደ ይበልጥ ጥሩ ነው. , rather than literal models of the brain. trainable mathematical functions The MLP block in the code consists of three main steps: a (matrix multiplication), linear transformation a (ReLU), nonlinear activation function አንድ ቀጣይ transformation ነው. ይህ ቀላል ሊሆን ይችላል, ነገር ግን እንደዚህ ያሉ መዋቅርዎች በጣም ጠንካራ የሙዚቃ ባህሪያት አለው. ይታወቃሉ እንደ . universal approximators የዩናይትድ ስቴትስ This means that, under certain conditions, a sufficiently large MLP can approximate to an arbitrary degree of accuracy. any mathematical function በአጠቃላይ, አንድ ትልቅ MLP ከሁሉም ነገር ማወቅ ይችላሉ. However, that would not be very efficient in practice. That is why transformer architectures combine multiple mechanisms, including attention and stacked layers, to distribute the computation more effectively. The output of the network is not a single token, but a . probability distribution over all possible tokens In other words, for each token in the vocabulary, the model outputs the probability that it should appear next in the sequence. የ algorithm ከዚያም በዚህ ትክክለኛነት አጠቃቀም ከ ሞዴሎች ይሰጣል. This means that tokens with higher probability are more likely to be chosen, but there is still an element of randomness. This randomness is controlled by a parameter called . temperature የ parameter determines how deterministic or creative the model’s output will be. temperature - the model strongly favors the most probable tokens, producing more predictable and accurate responses. Low temperature ከፍተኛ ሙቀት - የምስክር ወረቀት ልውውጥ ተለዋዋጭ ይሆናል, ዝቅተኛ የምስክር ወረቀት አብዛኛውን ጊዜ ለመምረጥ ይቻላል, ይህም ተጨማሪ የተለያዩ ወይም የፈጠራ ውፅዓት ያደርጋል. For example: የእኛ ሞዴል አንድ ጽሑፍ መተግበሪያ እና እውነተኛ ጥያቄዎች መልስ የሚፈልጉ ከሆነ, ዝቅተኛ ሙቀት አብዛኛውን ጊዜ የሚፈልጉ ነው. የእኛ ሞዴል የፈጠራ ጽሑፍ ለመፍጠር ወይም አዲስ ባህሪያት ለመፈለግ የሚፈልጉ ከሆነ, ከፍተኛ ሙቀት ተጨማሪ አስደናቂ ውጤቶችን ለመፍጠር ይችላሉ. This is roughly what I wanted to explain about this beautiful piece of code, and about GPT models in general. በአብዛኛው ቦታዎች, ትክክለኛነት በተመሳሳይ ጊዜ ሊሆን ይችላል. የእኔ መውሰድ ሁለት ነገሮች መካከል አንድ ኳስ መውሰድ ነበር: በተመሳሳይ ጊዜ በተመሳሳይ ጽሑፍ ውፍረት ውስጥ ሊሆን ይችላል, በተመሳሳይ ጊዜ በጣም አስፈላጊ ጥቅሞች ያካትታል. For readers who found parts of the explanation a bit unclear, or who want to explore the details more deeply, I highly recommend የእርስዎ እና There, you will find excellent material explaining everything needed to fully understand the concepts discussed here. በዚያ ላይ, እዚህ የተወሰነ ባህሪያት ያገኛሉ. Andrej Karpathy’s personal website የ YouTube ቻይና ጦማር ይህ ጽሑፍ አብዛኞቹ የኮምፒውተር ደንበኞች ለስላሳ ነው. ነገር ግን, እርስዎ የኮምፒውተር ደንበኞች እና የኮምፒውተር ደንበኞች ለስላሳ የኮምፒውተር ደንበኞች ይጠቀማሉ.