Generative AI: Expert Insights on Evolution, Challenges, and Future Trends

AI has captured the attention of tech enthusiasts and industry experts for quite some time. In this article, we delve into the evolution of AI, shedding light on the issues it poses and the emerging trends on the horizon. As we observe the exponential growth of AI technology, it becomes increasingly crucial to have a comprehensive understanding of its capabilities in order to maximize its potential benefits. Delving into this complex realm, Volodymyr Getmanskyi, the Head of the Data Science Office at ELEKS, shares his insights and expertise on this trending topic. AI vs. GenAI – Key Differences Explained Firstly, generative AI is part of the AI field. While AI mainly focuses on automating or optimizing human tasks, generative AI focuses on creating different objects. Typical AI tasks such as building conversational or decision-making agents, intelligent automation, image recognition and processing, as well as translation, can be enhanced with GenAI. It allows the generation of text and reports, images and designs, speech and music, and more. As a result, the integration of generative AI into everyday tasks and workflows has become increasingly seamless and impactful. One might wonder which type of data generation is the most popular. However, the answer is not straightforward. Multimodal models allow the generation of different types of data based on diverse input. So, even if we had usage statistics, it would be challenging to determine the most popular type of data being generated. However, based on current business needs, large language models are among the most popular. These models can process both text and numerical information and can be used for tasks like question-answering, text transformation (translation, spell-checking, enrichment), and generating reports. This functionality is a significant part of operational activities for enterprises across industries, unlike image or video generation, which is less common. Large Language Models: From Text Generation to Modern Giants Large language models (LLMs) are huge transformers, which are a type of deep learning model or, to put it simply, specific neural networks. Generally, LLMs have anywhere from 8 billion to 70 billion parameters and are trained on vast amounts of data. For instance, Crawl, one of the largest datasets, contains web pages and information from the past decade, amounting to dozens of petabytes of data. To put it in perspective, the Titanic dataset, which consists of around 900 samples describing which passengers survived the Titanic shipwreck, is less than 1 Mb in size, and the model that can efficiently predict the probability of survival may have around 25 to 100 parameters. LLMs also have a long history, and they didn't suddenly appear. For example, the ELEKS data science department used GPT-2 for response generation in 2019, while the first GPT (generative pre-trained transformer) model was released in 2018. However, even that wasn't the first appearance of the text generation models. Before the transformer era started in 2017, tasks such as text generation had been addressed using different approaches, for example: Generative adversarial networks - an approach where the generator trains based on the feedback from another network or discriminator, Autoencoders - a general and well-known approach where the model tries to reproduce the input. In 2013, efficient vector word embeddings like word2vec were proposed, and even earlier, in the previous century, there were examples of probabilistic and pattern-based generation, such as the Eliza chatbot in 1964. So, as we can see, the natural language generation (NLG) tasks and attempts have existed for many years. Most of the current LLMs users, such as ChatGPT, GPT, Gemini, Copilot, Claude, etc., are likely unaware of this because the results weren’t as promising as after the first release of InstructGPT, where OpenAI proposed public access, promoting it. Following the first release of ChatGPT in November 2022, which received millions of mentions on social media. The AI Regulation Debate: Balancing Innovation and Safety Nowadays, the AI community is divided on the topic of AI risks and compliance needs, with some advocating for AI regulations and safety control while others oppose them. Among the critics is Yann LeCun, Chief of Meta (Facebook) AI, who stated that such AI agents have intelligence even not similar to that of a dog. Meta AI group (formerly Facebook AI Research) is one of the developers of free and publicly available AI models such as Detectron, Llama, SegmentAnything, and ELF, which can be freely downloaded and used with only some commercial limitations. Open access has definitely been favorably received by the worldwide AI community. Those systems are still very limited; they don’t have any understanding of the underlying reality of the real world because they are purely trained on text, a massive amount of text. — Yann LeCun, Chief AI Scientist at Meta The concerns regarding the regulations have also been raised by officials. For example, French President Emmanuel Macron warned that landmark EU legislation designed to tackle the development of artificial intelligence risks hampering European tech companies compared to rivals in the US, UK, and China. On the other hand, there are AI regulation supporters. According to Elon Musk, Tesla CEO, AI is one of the biggest risks to the future of civilization. This is the same as nonpublic/paid AI representatives, but here, the real exciters of such a position can be market competition—to limit the spread of competing AI models. Overview of the EU Artificial Intelligence Act In 2023, the EU parliament passed the AI Act, the first set of comprehensive rules governing the use of AI technologies within the European Union. This legislation sets a precedent for responsible and ethical AI development and implementation. Key issues addressed by the EU AI Act: Firstly, there are logical limitations to personal data, as already outlined by different standards, like GDPR (EU), APPI (Japan), HIPPA (US), and PIPEDA (Canada), which cover personal data processing, biometric identification, etc. Connected to this are scoring systems or any form of people categorization, where model bias can have a significant impact, potentially leading to discrimination. Finally, there is behavioral manipulation, where some models can try to increase any business KPIs (conversion rates, overconsumption). AI Model Preparation and Usage: Challenges and Concerns There are many issues and concerns connected to model preparation, usage, and other hidden activities. For example, the data used for the model training consists of personal data, which wasn't authorized for such purposes. Global providers offer services focused on private correspondence (emails) or other private assets (photos, video) that can be used for the model training in the hidden mode without any announcement. There was recently a question addressed to OpenAI's CTO regarding the use of private videos for SORA training, a nonpublic OpenAI service for generating videos based on textual queries, but she could not provide a clear answer. Another issue can be related to data labeling and filtering—we don't know the personal characteristics, skills, stereotypes, and knowledge of specialists involved there, and this can introduce unwanted statements/content to the data. Also, there was an ethical issue—there was information that some of the global GenAI providers involved labelers from Kenya and underpaid them. Model bias and so-called model hallucinations, in which the models provide incorrect or partially incorrect answers that appear to be perfect, are also problems. Recently, the ELEKS data science team was working on improving our customers' retrieval augmented generation (RAG) solution, which covers showing some data for the model, and the model summarizes or provides answers based on that data. During the process, our team realized that many modern online (larger but paid) or offline (smaller and public) models confuse the enterprise names and numbers. We had data containing financial statements and audit information for a few companies, and the request was to show company A's revenue. However, the revenue for company A wasn't directly provided in the data and needed to be calculated. Most models, including leaders in the LLM Arena benchmark, responded with the wrong revenue level that belonged to company B. This error occurred due to partially similar character combinations in companies' names such as "Ltd", "Service", etc. Here, even the prompt learning didn't help; adding a statement like "if you aren't confident or some information is missing, please answer don't know" didn't resolve the issue. Another thing is about numerical representation—the LLMs perceive numbers as tokens, or even many tokens, like 0.33333 can be encoded as '0.3' and '3333' according to the byte-pair encoding approach, so it is hard to deal with complicated numerical transformations without additional adapters. The recent appointment of retired U.S. Army General Paul M. Nakasone to OpenAI's board of directors has sparked a mixed reaction. On the one hand, Nakasone's extensive background in cybersecurity and intelligence is seen as a significant asset, likely to implement robust strategies to defend against cyber attacks, crucial for a company dealing with AI research and development. On the other hand, there are concerns about the potential implications of Nakasone's appointment due to his military and intelligence background (former Head of the National Security Agency (NSA) and U.S. Cyber Command), which may lead to increased government surveillance and intervention. The fear is that Nakasone could facilitate more extensive access by government agencies to OpenAI's data and services. Thus, some fear that this appointment can affect both the use of the service, data, requests by government agencies, and the limitations of the service itself. Finally, there are other concerns, such as the generated code vulnerability, contradictory suggestions, inappropriate usage (passing exams or getting instruction on how to create the bomb), and more. How to Improve the LLMs Usage for More Robust Results First, it's crucial to determine whether using LLM is necessary and whether it should be a general foundational model. In some cases, the purpose and the decomposed task are not so complicated and can be resolved by simpler offline models such as misspelling, pattern-based generation, and parsing/information retrieval. Additionally, the general model can answer questions not related to the intended purpose of LLM integration. There are examples when the company encouraged online LLM integration (e.g., GPT, Gemini) without any additional adapters (pre and post-processors) and encountered unexpected behavior. For example, the user asked a car dealer chatbot to write the Python script to solve the Navier-Stokes fluid flow equation, and the chatbot said, "Certainly! I'll do that." Next, comes the question of which LLM to use—public and offline or paid and offline. The decision depends on the complexity of the task and the computing possibilities. Online and paid models are larger and have higher performance, while offline and public models require significant expenditures for hosting, often needing at least 40Gb of VRAM. When using online models, it's essential to have strict control of sensitive data shared with the provider. Typically, for such things, we build the preprocessing module that can remove personal or sensitive information, such as financial details or private agreements, without significantly changing the query to preserve the context, leaving information like the enterprise size or approximate location if needed. The initial step to decreasing the model's bias and avoiding hallucinations is to choose the right data or context or rank the candidates (e.g. for RAG). Sometimes, vector representation and similarity metrics, such as cosine similarity, may not be effective. This is because small variations, like the presence of the word "no" or slight differences in names (e.g. Oracle vs Orache), can have a significant impact. As for the post-processing, we can instruct the model to respond with "don't know" if confidence is low and develop a verification adapter that checks the accuracy of the model's responses. Emerging Trends and Future Directions in the LLM Field Numerous research directions exist in the field of LLMs, and new scientific articles emerge weekly. These articles cover a range of topics, including transformer/LLM optimization, robustness, efficiency (such as how to generalize models without significantly increasing their size or parameter count), typical optimization techniques (like distillation), and methods for increasing input (context) length. Among the various directions, prominent ones during the recent period include Mixture-of-tokens, Mixture-of-experts, Mixture-of-depth, Skeleton-of-thoughts, RoPE, and Chain-of-thoughts prompting. Let's briefly describe what each of these means. The Mixture-of-experts (MoEs) is a different transformer architecture. It typically has a dynamic layer consisting of several (8 in Mixtral) or many dense/flattened layers representing different knowledge. This architecture includes switch or routing methods, for example, a gating function that allows selecting which tokens should be processed by which experts, leading to the reduced number of layers ("experts") per token or group of tokens to one expert (switch layer). This allows for efficient model scaling and improves performance by using different submodels (experts) for input parts, making it more effective than using one general and even larger layer. The Mixture-of-tokens is connected to the mentioned Mixture-of-experts, where we group tokens by their importance (softmax activation) for a specific expert. The Mixture-of-depth technique is also connected to the mentioned MoEs, particularly, in terms of routing. It aims to decrease the computing graph (compute budget), limiting it to the top tokens that will be used in the attention mechanism. The tokens deemed less important (e.g. punctuation) for the specific sequence are skipped. This results in dynamic token participation, but the k (top k tokens) number of tokens is static, so we can decrease the sizes according to the compute budget (or k, which we've chosen). The Skeleton-of-thoughts is efficient for LLM scaling and allows the generation of parts of the completion (model response) in parallel based on the primary skeleton request, which consists of points that can be parallelized. There are other challenges, for example, the input size. Users often want to provide an LLM with large amounts of information, sometimes even whole books, while keeping the number of parameters unchanged. Here are two known methods ALiBi (Attention Layer with Linear Biases) and RoPE (Rotary Position Embedding), that can extrapolate, or possibly interpolate, the input embedding using the dynamic positional encoding and scaling factor, allowing users to increase the context length in comparison to which was used for the training. The Chain-of-thoughts prompting, which is an example of few-shot prompting (the user provides the supervision for LLM in the context), aims to decompose the question into several steps. Mostly, it is applied to reasoning problems, such as when you can split the logic into some computational plan. The example from the origin paper: "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? Thoughts plan: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11." Besides that, there are many other directions, and every week, several new significant papers appear around them. Sometimes, there is an additional problem for data scientists in following all these challenges and achievements. What Can End Users Expect From the Latest AI Developments? There are also many trends, just to sum up, there may be stronger AI regulations, that will limit different solutions and finally will result in available models’ generalization or field coverage. Other trends are mostly about the existing approaches' improvement, for example, decreasing the number of parameters and memory needed (e.g. quantization or even 1-bit LLMs – where each parameter is ternary (can take -1, 0, 1 values)). So, we can expect offline LLMs or Diffusion Transformers (DiT – modern Diffusion models and Visual Transformers successors (primary for the image generation tasks)) running even on our phones (nowadays, there are several examples, for example, Microsoft’s Phi-2 model with the generation speed is about 3-10 tokens per sec on modern Snapdragon-based Android devices). Also, there will be more advanced personalization (using all previous user experience and feedback to provide more suitable results), even up to digital twins. Many other things will have been improved that are available right now – assistants/model customization and marketplaces, one model for everything (multimodal direction), security (a more efficient mechanism to work with personal data, to encode it, etc.), and others. Ready to unlock the potential of AI for your business? Contact ELEKS expert. AI has captured the attention of tech enthusiasts and industry experts for quite some time. In this article, we delve into the evolution of AI, shedding light on the issues it poses and the emerging trends on the horizon. AI has captured the attention of tech enthusiasts and industry experts for quite some time. In this article, we delve into the evolution of AI, shedding light on the issues it poses and the emerging trends on the horizon. As we observe the exponential growth of AI technology , it becomes increasingly crucial to have a comprehensive understanding of its capabilities in order to maximize its potential benefits. Delving into this complex realm, Volodymyr Getmanskyi, the Head of the Data Science Office at ELEKS, shares his insights and expertise on this trending topic. AI technology AI vs. GenAI – Key Differences Explained Firstly, generative AI is part of the AI field. While AI mainly focuses on automating or optimizing human tasks, generative AI focuses on creating different objects. Typical AI tasks such as building conversational or decision-making agents, intelligent automation, image recognition and processing, as well as translation, can be enhanced with GenAI. It allows the generation of text and reports, images and designs, speech and music, and more. As a result, the integration of generative AI into everyday tasks and workflows has become increasingly seamless and impactful. One might wonder which type of data generation is the most popular. However, the answer is not straightforward. Multimodal models allow the generation of different types of data based on diverse input. So, even if we had usage statistics, it would be challenging to determine the most popular type of data being generated. However, based on current business needs, large language models are among the most popular. These models can process both text and numerical information and can be used for tasks like question-answering, text transformation (translation, spell-checking, enrichment), and generating reports. This functionality is a significant part of operational activities for enterprises across industries, unlike image or video generation, which is less common. Large Language Models: From Text Generation to Modern Giants Large language models (LLMs) are huge transformers, which are a type of deep learning model or, to put it simply, specific neural networks. Generally, LLMs have anywhere from 8 billion to 70 billion parameters and are trained on vast amounts of data. For instance, Crawl, one of the largest datasets, contains web pages and information from the past decade, amounting to dozens of petabytes of data. To put it in perspective, the Titanic dataset, which consists of around 900 samples describing which passengers survived the Titanic shipwreck, is less than 1 Mb in size, and the model that can efficiently predict the probability of survival may have around 25 to 100 parameters. LLMs also have a long history, and they didn't suddenly appear. For example, the ELEKS data science department used GPT-2 for response generation in 2019, while the first GPT (generative pre-trained transformer) model was released in 2018. However, even that wasn't the first appearance of the text generation models. Before the transformer era started in 2017, tasks such as text generation had been addressed using different approaches, for example: Generative adversarial networks - an approach where the generator trains based on the feedback from another network or discriminator, Autoencoders - a general and well-known approach where the model tries to reproduce the input. Generative adversarial networks - an approach where the generator trains based on the feedback from another network or discriminator, Autoencoders - a general and well-known approach where the model tries to reproduce the input. In 2013, efficient vector word embeddings like word2vec were proposed, and even earlier, in the previous century, there were examples of probabilistic and pattern-based generation, such as the Eliza chatbot in 1964. So, as we can see, the natural language generation (NLG) tasks and attempts have existed for many years. Most of the current LLMs users, such as ChatGPT, GPT, Gemini, Copilot, Claude, etc., are likely unaware of this because the results weren’t as promising as after the first release of InstructGPT, where OpenAI proposed public access, promoting it. Following the first release of ChatGPT in November 2022, which received millions of mentions on social media. The AI Regulation Debate: Balancing Innovation and Safety Nowadays, the AI community is divided on the topic of AI risks and compliance needs, with some advocating for AI regulations and safety control while others oppose them. Among the critics is Yann LeCun, Chief of Meta (Facebook) AI, who stated that such AI agents have intelligence even not similar to that of a dog. Meta AI group (formerly Facebook AI Research) is one of the developers of free and publicly available AI models such as Detectron, Llama, SegmentAnything, and ELF, which can be freely downloaded and used with only some commercial limitations. Open access has definitely been favorably received by the worldwide AI community. Those systems are still very limited; they don’t have any understanding of the underlying reality of the real world because they are purely trained on text, a massive amount of text. — Yann LeCun, Chief AI Scientist at Meta Those systems are still very limited; they don’t have any understanding of the underlying reality of the real world because they are purely trained on text, a massive amount of text. Those systems are still very limited; they don’t have any understanding of the underlying reality of the real world because they are purely trained on text, a massive amount of text. — Yann LeCun, Chief AI Scientist at Meta The concerns regarding the regulations have also been raised by officials. For example, French President Emmanuel Macron warned that landmark EU legislation designed to tackle the development of artificial intelligence risks hampering European tech companies compared to rivals in the US, UK, and China. On the other hand, there are AI regulation supporters. According to Elon Musk, Tesla CEO, AI is one of the biggest risks to the future of civilization. This is the same as nonpublic/paid AI representatives, but here, the real exciters of such a position can be market competition—to limit the spread of competing AI models. Overview of the EU Artificial Intelligence Act In 2023, the EU parliament passed the AI Act, the first set of comprehensive rules governing the use of AI technologies within the European Union. This legislation sets a precedent for responsible and ethical AI development and implementation. Key issues addressed by the EU AI Act: Key issues addressed by the EU AI Act: Firstly, there are logical limitations to personal data, as already outlined by different standards, like GDPR (EU), APPI (Japan), HIPPA (US), and PIPEDA (Canada), which cover personal data processing, biometric identification, etc. Firstly, there are logical limitations to personal data, as already outlined by different standards, like GDPR (EU), APPI (Japan), HIPPA (US), and PIPEDA (Canada), which cover personal data processing, biometric identification, etc. Connected to this are scoring systems or any form of people categorization, where model bias can have a significant impact, potentially leading to discrimination. Connected to this are scoring systems or any form of people categorization, where model bias can have a significant impact, potentially leading to discrimination. Finally, there is behavioral manipulation, where some models can try to increase any business KPIs (conversion rates, overconsumption). Finally, there is behavioral manipulation, where some models can try to increase any business KPIs (conversion rates, overconsumption). AI Model Preparation and Usage: Challenges and Concerns There are many issues and concerns connected to model preparation, usage, and other hidden activities. For example, the data used for the model training consists of personal data, which wasn't authorized for such purposes. Global providers offer services focused on private correspondence (emails) or other private assets (photos, video) that can be used for the model training in the hidden mode without any announcement. There was recently a question addressed to OpenAI's CTO regarding the use of private videos for SORA training, a nonpublic OpenAI service for generating videos based on textual queries, but she could not provide a clear answer. Another issue can be related to data labeling and filtering—we don't know the personal characteristics, skills, stereotypes, and knowledge of specialists involved there, and this can introduce unwanted statements/content to the data. Also, there was an ethical issue—there was information that some of the global GenAI providers involved labelers from Kenya and underpaid them. Model bias and so-called model hallucinations, in which the models provide incorrect or partially incorrect answers that appear to be perfect, are also problems. Recently, the ELEKS data science team was working on improving our customers' retrieval augmented generation (RAG) solution, which covers showing some data for the model, and the model summarizes or provides answers based on that data. During the process, our team realized that many modern online (larger but paid) or offline (smaller and public) models confuse the enterprise names and numbers. We had data containing financial statements and audit information for a few companies, and the request was to show company A's revenue. However, the revenue for company A wasn't directly provided in the data and needed to be calculated. Most models, including leaders in the LLM Arena benchmark, responded with the wrong revenue level that belonged to company B. This error occurred due to partially similar character combinations in companies' names such as "Ltd", "Service", etc. Here, even the prompt learning didn't help; adding a statement like "if you aren't confident or some information is missing, please answer don't know" didn't resolve the issue. Another thing is about numerical representation—the LLMs perceive numbers as tokens, or even many tokens, like 0.33333 can be encoded as '0.3' and '3333' according to the byte-pair encoding approach, so it is hard to deal with complicated numerical transformations without additional adapters. We had data containing financial statements and audit information for a few companies, and the request was to show company A's revenue. However, the revenue for company A wasn't directly provided in the data and needed to be calculated. Most models, including leaders in the LLM Arena benchmark, responded with the wrong revenue level that belonged to company B. This error occurred due to partially similar character combinations in companies' names such as "Ltd", "Service", etc. Here, even the prompt learning didn't help; adding a statement like "if you aren't confident or some information is missing, please answer don't know" didn't resolve the issue. We had data containing financial statements and audit information for a few companies, and the request was to show company A's revenue. However, the revenue for company A wasn't directly provided in the data and needed to be calculated. Most models, including leaders in the LLM Arena benchmark, responded with the wrong revenue level that belonged to company B. This error occurred due to partially similar character combinations in companies' names such as "Ltd", "Service", etc. Here, even the prompt learning didn't help; adding a statement like "if you aren't confident or some information is missing, please answer don't know" didn't resolve the issue. Another thing is about numerical representation—the LLMs perceive numbers as tokens, or even many tokens, like 0.33333 can be encoded as '0.3' and '3333' according to the byte-pair encoding approach, so it is hard to deal with complicated numerical transformations without additional adapters. Another thing is about numerical representation—the LLMs perceive numbers as tokens, or even many tokens, like 0.33333 can be encoded as '0.3' and '3333' according to the byte-pair encoding approach, so it is hard to deal with complicated numerical transformations without additional adapters. The recent appointment of retired U.S. Army General Paul M. Nakasone to OpenAI's board of directors has sparked a mixed reaction. On the one hand, Nakasone's extensive background in cybersecurity and intelligence is seen as a significant asset, likely to implement robust strategies to defend against cyber attacks, crucial for a company dealing with AI research and development. On the other hand, there are concerns about the potential implications of Nakasone's appointment due to his military and intelligence background (former Head of the National Security Agency (NSA) and U.S. Cyber Command), which may lead to increased government surveillance and intervention. The fear is that Nakasone could facilitate more extensive access by government agencies to OpenAI's data and services. Thus, some fear that this appointment can affect both the use of the service, data, requests by government agencies, and the limitations of the service itself. Finally, there are other concerns, such as the generated code vulnerability, contradictory suggestions, inappropriate usage (passing exams or getting instruction on how to create the bomb), and more. How to Improve the LLMs Usage for More Robust Results First, it's crucial to determine whether using LLM is necessary and whether it should be a general foundational model. In some cases, the purpose and the decomposed task are not so complicated and can be resolved by simpler offline models such as misspelling, pattern-based generation, and parsing/information retrieval. Additionally, the general model can answer questions not related to the intended purpose of LLM integration. There are examples when the company encouraged online LLM integration (e.g., GPT, Gemini) without any additional adapters (pre and post-processors) and encountered unexpected behavior. For example, the user asked a car dealer chatbot to write the Python script to solve the Navier-Stokes fluid flow equation, and the chatbot said, "Certainly! I'll do that." Next, comes the question of which LLM to use—public and offline or paid and offline. The decision depends on the complexity of the task and the computing possibilities. Online and paid models are larger and have higher performance, while offline and public models require significant expenditures for hosting, often needing at least 40Gb of VRAM. When using online models, it's essential to have strict control of sensitive data shared with the provider. Typically, for such things, we build the preprocessing module that can remove personal or sensitive information, such as financial details or private agreements, without significantly changing the query to preserve the context, leaving information like the enterprise size or approximate location if needed. The initial step to decreasing the model's bias and avoiding hallucinations is to choose the right data or context or rank the candidates (e.g. for RAG). Sometimes, vector representation and similarity metrics, such as cosine similarity, may not be effective. This is because small variations, like the presence of the word "no" or slight differences in names (e.g. Oracle vs Orache), can have a significant impact. As for the post-processing, we can instruct the model to respond with "don't know" if confidence is low and develop a verification adapter that checks the accuracy of the model's responses. Emerging Trends and Future Directions in the LLM Field Numerous research directions exist in the field of LLMs, and new scientific articles emerge weekly. These articles cover a range of topics, including transformer/LLM optimization, robustness, efficiency (such as how to generalize models without significantly increasing their size or parameter count), typical optimization techniques (like distillation), and methods for increasing input (context) length. Among the various directions, prominent ones during the recent period include Mixture-of-tokens, Mixture-of-experts, Mixture-of-depth, Skeleton-of-thoughts, RoPE, and Chain-of-thoughts prompting. Let's briefly describe what each of these means. The Mixture-of-experts (MoEs) is a different transformer architecture. It typically has a dynamic layer consisting of several (8 in Mixtral) or many dense/flattened layers representing different knowledge. This architecture includes switch or routing methods, for example, a gating function that allows selecting which tokens should be processed by which experts, leading to the reduced number of layers ("experts") per token or group of tokens to one expert (switch layer). This allows for efficient model scaling and improves performance by using different submodels (experts) for input parts, making it more effective than using one general and even larger layer. The Mixture-of-tokens is connected to the mentioned Mixture-of-experts, where we group tokens by their importance (softmax activation) for a specific expert. The Mixture-of-depth technique is also connected to the mentioned MoEs, particularly, in terms of routing. It aims to decrease the computing graph (compute budget), limiting it to the top tokens that will be used in the attention mechanism. The tokens deemed less important (e.g. punctuation) for the specific sequence are skipped. This results in dynamic token participation, but the k (top k tokens) number of tokens is static, so we can decrease the sizes according to the compute budget (or k, which we've chosen). The Skeleton-of-thoughts is efficient for LLM scaling and allows the generation of parts of the completion (model response) in parallel based on the primary skeleton request, which consists of points that can be parallelized. There are other challenges, for example, the input size. Users often want to provide an LLM with large amounts of information, sometimes even whole books, while keeping the number of parameters unchanged. Here are two known methods ALiBi (Attention Layer with Linear Biases) and RoPE (Rotary Position Embedding), that can extrapolate, or possibly interpolate, the input embedding using the dynamic positional encoding and scaling factor, allowing users to increase the context length in comparison to which was used for the training. The Chain-of-thoughts prompting, which is an example of few-shot prompting (the user provides the supervision for LLM in the context), aims to decompose the question into several steps. Mostly, it is applied to reasoning problems, such as when you can split the logic into some computational plan. The example from the origin paper: "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? Thoughts plan: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11." The Mixture-of-experts (MoEs) is a different transformer architecture. It typically has a dynamic layer consisting of several (8 in Mixtral) or many dense/flattened layers representing different knowledge. This architecture includes switch or routing methods, for example, a gating function that allows selecting which tokens should be processed by which experts, leading to the reduced number of layers ("experts") per token or group of tokens to one expert (switch layer). This allows for efficient model scaling and improves performance by using different submodels (experts) for input parts, making it more effective than using one general and even larger layer. The Mixture-of-experts (MoEs) is a different transformer architecture. It typically has a dynamic layer consisting of several (8 in Mixtral) or many dense/flattened layers representing different knowledge. This architecture includes switch or routing methods, for example, a gating function that allows selecting which tokens should be processed by which experts, leading to the reduced number of layers ("experts") per token or group of tokens to one expert (switch layer). Mixture-of-experts (MoEs) This allows for efficient model scaling and improves performance by using different submodels (experts) for input parts, making it more effective than using one general and even larger layer. The Mixture-of-tokens is connected to the mentioned Mixture-of-experts, where we group tokens by their importance (softmax activation) for a specific expert. The Mixture-of-tokens is connected to the mentioned Mixture-of-experts, where we group tokens by their importance (softmax activation) for a specific expert. Mixture-of-tokens The Mixture-of-depth technique is also connected to the mentioned MoEs, particularly, in terms of routing. It aims to decrease the computing graph (compute budget), limiting it to the top tokens that will be used in the attention mechanism. The tokens deemed less important (e.g. punctuation) for the specific sequence are skipped. This results in dynamic token participation, but the k (top k tokens) number of tokens is static, so we can decrease the sizes according to the compute budget (or k, which we've chosen). The Mixture-of-depth technique is also connected to the mentioned MoEs, particularly, in terms of routing. It aims to decrease the computing graph (compute budget), limiting it to the top tokens that will be used in the attention mechanism. The tokens deemed less important (e.g. punctuation) for the specific sequence are skipped. This results in dynamic token participation, but the k (top k tokens) number of tokens is static, so we can decrease the sizes according to the compute budget (or k, which we've chosen). Mixture-of-depth The Skeleton-of-thoughts is efficient for LLM scaling and allows the generation of parts of the completion (model response) in parallel based on the primary skeleton request, which consists of points that can be parallelized. The Skeleton-of-thoughts is efficient for LLM scaling and allows the generation of parts of the completion (model response) in parallel based on the primary skeleton request, which consists of points that can be parallelized. Skeleton-of-thoughts There are other challenges, for example, the input size. Users often want to provide an LLM with large amounts of information, sometimes even whole books, while keeping the number of parameters unchanged. Here are two known methods ALiBi (Attention Layer with Linear Biases) and RoPE (Rotary Position Embedding), that can extrapolate, or possibly interpolate, the input embedding using the dynamic positional encoding and scaling factor, allowing users to increase the context length in comparison to which was used for the training. There are other challenges, for example, the input size. Users often want to provide an LLM with large amounts of information, sometimes even whole books, while keeping the number of parameters unchanged. Here are two known methods ALiBi (Attention Layer with Linear Biases) and RoPE (Rotary Position Embedding) , that can extrapolate, or possibly interpolate, the input embedding using the dynamic positional encoding and scaling factor, allowing users to increase the context length in comparison to which was used for the training. ALiBi (Attention Layer with Linear Biases) RoPE (Rotary Position Embedding) The Chain-of-thoughts prompting, which is an example of few-shot prompting (the user provides the supervision for LLM in the context), aims to decompose the question into several steps. Mostly, it is applied to reasoning problems, such as when you can split the logic into some computational plan. The example from the origin paper: "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? Thoughts plan: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11." The Chain-of-thoughts prompting, which is an example of few-shot prompting (the user provides the supervision for LLM in the context), aims to decompose the question into several steps. Mostly, it is applied to reasoning problems, such as when you can split the logic into some computational plan. The example from the origin paper: "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? Thoughts plan: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11." Chain-of-thoughts Besides that, there are many other directions, and every week, several new significant papers appear around them. Sometimes, there is an additional problem for data scientists in following all these challenges and achievements. What Can End Users Expect From the Latest AI Developments? There are also many trends, just to sum up, there may be stronger AI regulations, that will limit different solutions and finally will result in available models’ generalization or field coverage. Other trends are mostly about the existing approaches' improvement, for example, decreasing the number of parameters and memory needed (e.g. quantization or even 1-bit LLMs – where each parameter is ternary (can take -1, 0, 1 values)). So, we can expect offline LLMs or Diffusion Transformers (DiT – modern Diffusion models and Visual Transformers successors (primary for the image generation tasks)) running even on our phones (nowadays, there are several examples, for example, Microsoft’s Phi-2 model with the generation speed is about 3-10 tokens per sec on modern Snapdragon-based Android devices). Also, there will be more advanced personalization (using all previous user experience and feedback to provide more suitable results), even up to digital twins. Many other things will have been improved that are available right now – assistants/model customization and marketplaces, one model for everything (multimodal direction), security (a more efficient mechanism to work with personal data, to encode it, etc.), and others. Ready to unlock the potential of AI for your business? Contact ELEKS exper t. Contact ELEKS exper