This article is the third in the series of future technology articles that I wrote:
I’m writing this series because even as cutting-edge technologies are shaping our world—as Marc Andreessen of Andreessen Horowitz says, “software is eating the world”—the complexities of their development are not well understood. I decided to write a trilogy that would simplify the understanding of these emerging technologies that are shaping our future.
I am Manoj Boopathi Raj, a Senior Software Engineer at Google. I’ve worked on Google products used by hundreds of millions of users, perhaps even you. If you’ve ever used Google AI Assistant in your car, I made sure it actually understands you past all the noise of the road and highway. I made sure that when you say ‘take a selfie’ your Android phone does exactly that. I’ve also kept spam out of YouTube, so your search results are exactly what you’re looking for and made sure your e-sim enabled Android phone is always connected to the strongest network, so you are never stuck with a loading screen. And yes, I believe humanity should ‘boldly go where no man has gone before’.
Today, the technology I’m most fascinated by are Large Language Models (LLMs) and how they’re revolutionizing human-computer interactions. You may ask, what are LLMs, and why do they matter? I cannot understate their importance to the next decade of nearly every industry. Companies that master it will be industry sector leaders; employees will be on a fast track for success. LLMs are at the core of every AI that is trained and will be trained in the future to perform every and any action.
LLMs are data models trained on colossal amounts of text data, ingesting books, articles, code, and other forms of written content. This firehose of information allows them to grasp the nuances of language, including statistical relationships between words and how they're used in context. Once the data has been collected, artificial intelligence and machine learning applications empower LLMs to perform a variety of tasks, including:
Data Sources: LLMs are pre-trained on massive datasets of text and code, often in the order of terabytes or even petabytes, which can be compared to most people’s interactions with computers being orders of magnitude smaller, on the level of megabytes or gigabytes. This data can be curated from web scraping tools, public document archives, or proprietary datasets. Common formats include books (e.g., Project Gutenberg), articles (e.g., Wikipedia, news archives), code repositories (e.g., GitHub), and social media conversations (after anonymization and ethical considerations).
Tokenization: The text data is pre-processed by splitting it into individual units called tokens. Tokenization strategies can vary depending on the chosen approach for an LLM. Here are two common methods:
Word Embeddings: Each token is converted into a dense vector representation known as an embedding. This embedding is typically low-dimensional (e.g., 300 or 512 dimensions) but captures the semantic meaning of the word and its relationship to other words within the vocabulary. Popular embedding techniques include word2vec, GloVe, and contextualized embeddings like BERT or XLNet. These techniques leverage statistical properties of the surrounding text to create more nuanced and context-dependent embeddings.
Transformer Decoder Network: At the heart of the pre-training process lies a deep learning architecture called a transformer decoder network. This network consists of multiple encoder-decoder layers stacked together. Here's a breakdown of the key components:
Pre-training equips the LLM with a strong foundation in language understanding. However, to excel at specific tasks, LLMs undergo further training on a smaller dataset curated for that particular domain. This dataset is labeled with examples relevant to the target goal. Here's a deeper dive into fine-tuning techniques with a technical focus:
Supervised Fine-Tuning:
Unsupervised Fine-tuning:
Reinforcement Learning from Human Feedback (RLHF):
Additional Considerations:
Fine-tuning is a crucial step in transforming a general-purpose LLM into a powerful tool for real-world applications. By carefully selecting the fine-tuning approach, loss functions, hyperparameters, and regularization techniques, we can unlock the potential of LLMs to excel in various tasks, from generating different creative text formats to performing complex question answering or machine translation. As research in this field continues to evolve, we can expect even more sophisticated fine-tuning methods to emerge, further pushing the boundaries of LLM capabilities.
The development and fine-tuning of LLMs are pivotal in building the future, as these models can understand and generate human-like text, making digital assistants more responsive and intelligent. The potential of LLMs extends far beyond current applications, promising to revolutionize industries, streamline processes, and create more personalized user experiences. As we continue to explore the capabilities of LLMs, I am excited to be at the forefront of this technological evolution, shaping the future of AI-driven interactions, and I hope you are as excited as I am about how the next generation of LLMs will further change the world.