Surveying the Evolution and Future Trajectory of Generative AI - Evolution of Generative AI

Authors: (1) Timothy R. McIntosh; (2) Teo Susnjak; (3) Tong Liu; (4) Paul Watters; (5) Malka N. Halgamuge. Table of Links Abstract and Introduction Background: Evolution of Generative AI The Current Generative AI Research Taxonomy Innovative Horizon of MOE Speculated Capabilities of Q* Projected Capabilities of AGI Impact Analysis on Generative AI Research Taxonomy Emergent Research Priorities in Generative AI Practical Implications and Limitations of Generative AI Technologies Impact of Generative AI on Preprints Across Disciplines Conclusions, Disclaimer, and References II. BACKGROUND: EVOLUTION OF GENERATIVE AI The ascent of Generative AI has been marked by significant milestones, with each new model paving the way for the next evolutionary leap. From single-purpose algorithms to LLMs like OpenAI’s ChatGPT and the latest multimodal systems, the AI landscape has been transformed, while countless other fields have been disrupted. A. The Evolution of Language Models Language models have undergone a transformative journey (Fig. 3), evolving from rudimentary statistical methods to the complex neural network architectures that underpin today’s LLMs [36], [37]. This evolution has been driven by a relentless quest for models that more accurately reflect the nuances of human language, as well as the desire to push the boundaries of what machines can understand and generate [36], [38], [37]. However, this rapid advancement has not been without its challenges. As language models have grown in capability, so too have the ethical and safety concerns surrounding their use, prompting a reevaluation of how these models are developed and the purposes for which they are employed [36], [39], [40]. 1) Language Models as Precursors: The inception of language modeling can be traced to the statistical approaches of the late 1980s, a period marked by a transition from rule-based to machine learning algorithms in Natural Language Processing (NLP) [41], [42], [43], [44], [45]. Early models, primarily n-gram based, calculated the probability of word sequences in a corpus, thus providing a rudimentary understanding of language structure [41]. Those models, simplistic yet groundbreaking, laid the groundwork for future advances in language understanding. With the increase of computational power, the late 1980s witnessed a revolution in NLP, pivoting towards statistical models capable of ‘soft’ probabilistic decisions, as opposed to the rigid, ‘handwritten’ rule-based systems that dominated early NLP systems [43]. IBM’s development of complicated statistical models throughout this period signified the growing importance and success of these approaches. In the subsequent decade, the popularity and applicability of statistical models surged, proving invaluable in managing the flourishing flow of digital text. The 1990s saw statistical methods firmly established in NLP research, with n-grams becoming instrumental in numerically capturing linguistic patterns. The introduction of Long Short-Term Memory (LSTM) networks in 1997 [46], and their application to voice and text processing a decade later [47], [48], [49], marked a significant milestone, leading to the current era where neural network models represent the cutting edge of NLP research and development. 2) Large Language Models: Technical Advancement and Commercial Success: The advent of deep learning has revolutionized the field of NLP, leading to the development of LLMs like GPT, BERT, and notably, OpenAI’s ChatGPT. Recent models such as GPT-4 and LLaMA have pushed the boundaries by integrating sophisticated techniques like transformer architectures and advanced natural language understanding, illustrating the rapid evolution in this field [37]. These models represent a significant leap in NLP capabilities, leveraging vast computational resources and extensive datasets to achieve new heights in language understanding and generation [37], [50]. ChatGPT has shown impressive conversational skills and contextual understanding with a broad spectrum of functional uses in many areas, as evidenced by its technical and commercial success, including rapid adoption by over 100 million users shortly after launch, which underscores a robust market demand for natural language AI and has catalyzed interdisciplinary research into its applications in sectors like education, healthcare, and commerce [8], [50], [51], [52], [53]. In education, ChatGPT offers innovative approaches to personalized learning and interactive teaching [54], [51], [55], [56], while in commerce, it revolutionizes customer service and content creation [57], [58]. The widespread use of ChatGPT, Google Bard, Anthropic Claude and similar commercial LLMs has reignited important debates in the field of AI, particularly concerning AI consciousness and safety, as its human-like interaction capabilities raise significant ethical questions and highlight the need for robust governance and safety measures in AI development [59], [31], [32], [11]. Such influence appears to extend beyond its technical achievements, shaping cultural and societal discussions about the role and future of AI in our world. The advancements in LLMs, including the development of models like GPT and BERT, have paved the way for the conceptualization of Q*. Specifically, the scalable architecture and extensive training data that characterize these models are foundational to the proposed capabilities of Q*. The success of ChatGPT in contextual understanding and conversational AI, for example, informs the design principles of Q*, suggesting a trajectory towards more sophisticated, context-aware, and adaptive language processing capabilities. Similarly, the emergence of multimodal systems like Gemini, capable of integrating text, images, audio, and video, reflects an evolutionary path that Q* could extend, combining the versatility of LLMs with advanced learning and pathfinding algorithms for a more holistic AI solution. 3) Fine-tuning, Hallucination Reduction, and Alignment in LLMs: The advancement of LLMs has underlined the significance of fine-tuning [60], [61], [62], [63], hallucination reduction [64], [65], [66], [67], and alignment [68], [69], [70], [71], [72]. These aspects are crucial in enhancing the functionality and reliability of LLMs. Fine-tuning, which involves adapting pre-trained models to specific tasks, has seen significant progress: techniques like prompt-based and few-shot learning [73], [74], [75], [76], alongside supervised fine-tuning on specialized datasets [60], [77], [78], [79], have enhanced the adaptability of LLMs in various contexts, but challenges remain, particularly in bias mitigation and the generalization of models across diverse tasks [60], [80], [72]. Hallucination reduction is a persistent challenge in LLMs, characterized by the generation of confident but factually incorrect information [36]. Strategies such as confidence penalty regularization during fine-tuning have been implemented to mitigate overconfidence and improve accuracy [81], [82], [83]. Despite these efforts, the complexity of human language and the breadth of topics make completely eradicating hallucinations a daunting task, especially in culturally sensitive contexts [36], [9]. Alignment, ensuring LLM outputs are congruent with human values and ethics, is an area of ongoing research. Innovative approaches, from constrained optimization [84], [85], [86], [87], [88], to different types of reward modeling [89], [90], [91], [92], aim to embed human preferences within AI systems. While advancements in fine-tuning, hallucination reduction, and alignment have propelled LLMs forward, these areas still present considerable challenges. The complexity of aligning AI with the diverse spectrum of human ethics and the persistence of hallucinations, particularly on culturally sensitive topics, highlight the need for continued interdisciplinary research in the development and application of LLMs [9]. 4) Mixture of Experts: A Paradigm Shift: The adoption of the MoE architecture in LLMs marks a critical evolution in AI technology. This innovative approach, exemplified by advanced models like Google’s Switch Transformer[5] and MistralAI s Mixtral-8x7B[6], leverages multiple transformer based expert modules for dynamic token routing, enhancing modeling efficiency and scalability. The primary advantage of MoE lies in its ability to handle vast parameter scales, reducing memory footprint and computational costs significantly [93], [94], [95], [96], [97]. This is achieved through model parallelism across specialized experts, allowing the training of models with trillions of parameters, and its specialization in handling diverse data distributions enhances its capability in few-shot learning and other complex tasks [94], [95]. To illustrate the practicality of MoE, consider its application in healthcare. For example, an MoE-based system could be used for personalized medicine, where different ‘expert’ modules specialize in various aspects of patient data analysis, including genomics, medical imaging, and electronic health records. This approach could significantly enhance diagnostic accuracy and treatment personalization. Similarly, in finance, MoE models can be deployed for risk assessment, where experts analyze distinct financial indicators, market trends, and regulatory compliance factors. Despite its benefits, MoE confronts challenges in dynamic routing complexity [98], [99], [100], [101], [102], expert imbalance [103], [104], [105], [106], and probability dilution [107], and such technical hurdles demand sophisticated solutions to fully harness MoE’s potential. Moreover, while MoE may offer performance gains, it does not inherently solve ethical alignment issues in AI [108], [109], [110]. The complexity and specialization of MoE models can obscure the decision-making processes, complicating efforts to ensure ethical compliance and alignment with human values [108], [111]. Although the paradigm shift to MoE signifies a major leap in LLM development, offering significant scalability and specialization advantages, ensuring the safety, ethical alignment, and transparency of these models remains a paramount concern. The MoE architecture, while technologically advanced, entails continued interdisciplinary research and governance to align AI with broader societal values and ethical standards. B. Multimodal AI and the Future of Interaction The advent of multimodal AI marks a transformative era in AI development, revolutionizing how machines interpret and interact with a diverse array of human sensory inputs and contextual data. 1) Gemini: Redefining Benchmarks in Multimodality: Gemini, a pioneering multimodal conversational system, marks a significant shift in AI technology by surpassing traditional text-based LLMs like GPT-3 and even its multimodal counterpart, ChatGPT-4. Gemini’s architecture has been designed to incorporate the processing of diverse data types such as text, images, audio, and video, a feat facilitated by its unique multimodal encoder, cross-modal attention network, and multimodal decoder [112]. The architectural core of Gemini is its dual-encoder structure, with separate encoders for visual and textual data, enabling sophisticated multimodal contextualization [112]. This architecture is believed to surpass the capabilities of single-encoder systems, allowing Gemini to associate textual concepts with image regions and achieve a compositional understanding of scenes [112]. Furthermore, Gemini integrates structured knowledge and employs specialized training paradigms for cross-modal intelligence, setting new benchmarks in AI [112]. In [112], Google has claimed and demonstrated that Gemini distinguishes itself from ChatGPT-4 through several key features: • Breadth of Modalities: Unlike ChatGPT-4, which primarily focuses on text, documents, images, and code, Gemini handles a wider range of modalities including audio, and video. This extensive range allows Gemini to tackle complex tasks and understand real-world contexts more effectively. • Performance: Gemini Ultra excels in key multimodality benchmarks, notably in massive multitask language understanding (MMLU) which encompasses a diverse array of domains like science, law, and medicine, outperforming ChatGPT-4. • Scalability and Accessibility: Gemini is available in three tailored versions – Ultra, Pro, and Nano – catering to a range of applications from data centers to on-device tasks, a level of flexibility not yet seen in ChatGPT-4. • Code Generation: Gemini’s proficiency in understanding and generating code across various programming languages is more advanced, offering practical applications beyond ChatGPT-4’s capabilities. • Transparency and Explainability: A focus on explainability sets Gemini apart, as it provides justifications for its outputs, enhancing user trust and understanding of the AI’s reasoning process. Despite these advancements, Gemini’s real-world performance in complex reasoning tasks that require integration of commonsense knowledge across modalities remains to be thoroughly evaluated. 2) Technical Challenges in Multimodal Systems: The development of multimodal AI systems faces several technical hurdles, including creating robust and diverse datasets, managing scalability, and enhancing user trust and system interpretability [113], [114], [115]. Challenges like data skew and bias are prevalent due to data acquisition and annotation issues, which requires effective dataset management by employing strategies such as data augmentation, active learning, and transfer learning [113], [116], [80], [115]. A significant challenge is the computational demands of processing various data streams simultaneously, requiring powerful hardware and optimized model architectures for multiple encoders [117], [118]. Advanced algorithms and multimodal attention mechanisms are needed to balance attention across different input media and resolve conflicts between modalities, especially when they provide contradictory information [119], [120], [118]. Scalability issues, due to the extensive computational resources needed, are exacerbated by limited high-performance hardware availability [121], [122]. There is also a pressing need for calibrated multimodal encoders for compositional scene understanding and data integration [120]. Refining evaluation metrics for these systems is necessary to accurately assess performance in real-world tasks, calling for comprehensive datasets and unified benchmarks, and for enhancing user trust and system interpretability through explainable AI in multimodal contexts. Addressing these challenges is vital for the advancement of multimodal AI systems, enabling seamless and intelligent interaction aligned with human expectations. 3) Multimodal AI: Beyond Text in Ethical and Social Contexts: The expansion of multimodal AI systems introduces both benefits and complex ethical and social challenges that extend beyond those faced by text-based AI. In commerce, multimodal AI can transform customer engagement by integrating visual, textual, and auditory data [123], [124], [125]. For autonomous vehicles, multimodality can enhance safety and navigation by synthesizing data from various sensors, including visual, radar, and Light Detection and Ranging (LIDAR) [126], [125], [127]. Still, DeepFake technology’s ability to generate convincingly realistic videos, audio, and images is a critical concern in multimodality, as it poses risks of misinformation and manipulation that significantly impact public opinion, political landscapes, and personal reputations, thereby compromising the authenticity of digital media and raising issues in social engineering and digital forensics where distinguishing genuine from AI-generated content becomes increasingly challenging [128], [129]. Privacy concerns are amplified in multimodal AI due to its ability to process and correlate diverse data sources, potentially leading to intrusive surveillance and profiling, which raises questions about the consent and rights of individuals, especially when personal media is used without permission for AI training or content creation [113], [130], [131]. Moreover, multimodal AI can propagate and amplify biases and stereotypes across different modalities, and if unchecked, this can perpetuate discrimination and social inequities, making it imperative to address algorithmic bias effectively [132], [133], [134]. The ethical development of multimodal AI systems requires robust governance frameworks focusing on transparency, consent, data handling protocols, and public awareness, when ethical guidelines must evolve to address the unique challenges posed by these technologies, including setting standards for data usage and safeguarding against the nonconsensual exploitation of personal information [135], [136]. Additionally, the development of AI literacy programs will be crucial in helping society understand and responsibly interact with multimodal AI technologies [113], [135]. As the field progresses, interdisciplinary collaboration will be key in ensuring these systems are developed and deployed in a manner that aligns with societal values and ethical principles [113]. C. Speculative Advances and Chronological Trends In the dynamic landscape of AI, the speculative capabilities of the Q* project, blending LLMs, Q-learning, and A* (AStar algorithm), embodies a significant leap forward. This section explores the evolutionary trajectory from game-centric AI systems to the broad applications anticipated with Q*. 1) From AlphaGo’s Groundtruth to Q-Star’s Exploration: The journey from AlphaGo, a game-centric AI, to the conceptual Q-Star project represents a significant paradigm shift in AI. AlphaGo’s mastery in the game of Go highlighted the effectiveness of deep learning and tree search algorithms within well-defined rule-based environments, underscoring the potential of AI in complex strategy and decision-making [137], [138]. Q-Star, however, is speculated to move beyond these confines, aiming to amalgamate the strengths of reinforcement learning (as seen in AlphaGo), with the knowledge, NLG, creativity and versatility of LLMs, and the strategic efficiency of pathfinding algorithms like A*. This blend, merging pathfinding algorithms and LLMs, could enable AI systems to transcend board game confines and, with Q-Star’s natural language processing, interact with human language, enabling nuanced interactions and marking a leap towards AI adept in both structured tasks and complex human-like communication and reasoning. Moreover, the incorporation of Q-learning and A* algorithms would enable Q-Star to optimize decision paths and learn from its interactions, making it more adaptable and intelligent over time. The combination of these technologies could lead to AI that is not only more efficient in problemsolving but also creative and insightful in its approach. This speculative advancement from the game-focused power of AlphaGo to the comprehensive potential of Q-Star illustrates the dynamic and ever-evolving nature of AI research, and opens up possibilities for AI applications that are more integrated with human life and capable of handling a broader range of tasks with greater autonomy and sophistication. 2) Bridging Structured Learning with Creativity: The anticipated Q* project, blending Q-learning and A* algorithms with the creativity of LLMs, embodies a groundbreaking step in AI, potentially surpassing recent innovations like Gemini. The fusion suggested in Q* points to an integration of structured, goal-oriented learning with generative, creative capabilities, a combination that could transcend the existing achievements of Gemini. While Gemini represents a significant leap in multimodal AI, combining various forms of data inputs such as text, images, audio, and video, Q* is speculated to bring a more profound integration of creative reasoning and structured problem-solving. This would be achieved by merging the precision and efficiency of algorithms like A* with the learning adaptability of Q-learning, and the complex understanding of human language and context offered by LLMs. Such an integration could enable AI systems to not only process and analyze complex multimodal data but also to autonomously navigate through structured tasks while engaging in creative problem-solving and knowledge generation, mirroring the multifaceted nature of human cognition. The implications of this potential advancement are vast, suggesting applications that span beyond the capabilities of current multimodal systems like Gemini. By aligning the deterministic aspects of traditional AI algorithms with the creative and generative potential of LLMs, Q* could offer a more holistic approach to AI development. This could bridge the gap between the logical, rule-based processing of AI and the creative, abstract thinking characteristic of human intelligence. The anticipated unveiling of Q*, merging structured learning techniques and creative problem-solving in a singular, advanced framework, holds the promise of not only extending but also significantly surpassing the multimodal capabilities of systems like Gemini, thus heralding another game-changing era in the domain of generative AI, showcasing its potential as a crucial development eagerly awaited in the ongoing evolution of AI. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. [5] https://huggingface.co/google/switch-c-2048 [6] https://huggingface.co/mistralai/Mixtral-8x7B-v0.1 Authors: (1) Timothy R. McIntosh; (2) Teo Susnjak; (3) Tong Liu; (4) Paul Watters; (5) Malka N. Halgamuge. Authors: Authors: (1) Timothy R. McIntosh; (2) Teo Susnjak; (3) Tong Liu; (4) Paul Watters; (5) Malka N. Halgamuge. Table of Links Abstract and Introduction Abstract and Introduction Background: Evolution of Generative AI Background: Evolution of Generative AI The Current Generative AI Research Taxonomy The Current Generative AI Research Taxonomy Innovative Horizon of MOE Innovative Horizon of MOE Speculated Capabilities of Q* Speculated Capabilities of Q* Projected Capabilities of AGI Projected Capabilities of AGI Impact Analysis on Generative AI Research Taxonomy Impact Analysis on Generative AI Research Taxonomy Emergent Research Priorities in Generative AI Emergent Research Priorities in Generative AI Practical Implications and Limitations of Generative AI Technologies Practical Implications and Limitations of Generative AI Technologies Impact of Generative AI on Preprints Across Disciplines Impact of Generative AI on Preprints Across Disciplines Conclusions, Disclaimer, and References Conclusions, Disclaimer, and References II. BACKGROUND: EVOLUTION OF GENERATIVE AI The ascent of Generative AI has been marked by significant milestones, with each new model paving the way for the next evolutionary leap. From single-purpose algorithms to LLMs like OpenAI’s ChatGPT and the latest multimodal systems, the AI landscape has been transformed, while countless other fields have been disrupted. A. The Evolution of Language Models A. The Evolution of Language Models Language models have undergone a transformative journey (Fig. 3), evolving from rudimentary statistical methods to the complex neural network architectures that underpin today’s LLMs [36], [37]. This evolution has been driven by a relentless quest for models that more accurately reflect the nuances of human language, as well as the desire to push the boundaries of what machines can understand and generate [36], [38], [37]. However, this rapid advancement has not been without its challenges. As language models have grown in capability, so too have the ethical and safety concerns surrounding their use, prompting a reevaluation of how these models are developed and the purposes for which they are employed [36], [39], [40]. 1) Language Models as Precursors: The inception of language modeling can be traced to the statistical approaches of the late 1980s, a period marked by a transition from rule-based to machine learning algorithms in Natural Language Processing (NLP) [41], [42], [43], [44], [45]. Early models, primarily n-gram based, calculated the probability of word sequences in a corpus, thus providing a rudimentary understanding of language structure [41]. Those models, simplistic yet groundbreaking, laid the groundwork for future advances in language understanding. With the increase of computational power, the late 1980s witnessed a revolution in NLP, pivoting towards statistical models capable of ‘soft’ probabilistic decisions, as opposed to the rigid, ‘handwritten’ rule-based systems that dominated early NLP systems [43]. IBM’s development of complicated statistical models throughout this period signified the growing importance and success of these approaches. In the subsequent decade, the popularity and applicability of statistical models surged, proving invaluable in managing the flourishing flow of digital text. The 1990s saw statistical methods firmly established in NLP research, with n-grams becoming instrumental in numerically capturing linguistic patterns. The introduction of Long Short-Term Memory (LSTM) networks in 1997 [46], and their application to voice and text processing a decade later [47], [48], [49], marked a significant milestone, leading to the current era where neural network models represent the cutting edge of NLP research and development. 1) Language Models as Precursors: 2) Large Language Models: Technical Advancement and Commercial Success: The advent of deep learning has revolutionized the field of NLP, leading to the development of LLMs like GPT, BERT, and notably, OpenAI’s ChatGPT. Recent models such as GPT-4 and LLaMA have pushed the boundaries by integrating sophisticated techniques like transformer architectures and advanced natural language understanding, illustrating the rapid evolution in this field [37]. These models represent a significant leap in NLP capabilities, leveraging vast computational resources and extensive datasets to achieve new heights in language understanding and generation [37], [50]. ChatGPT has shown impressive conversational skills and contextual understanding with a broad spectrum of functional uses in many areas, as evidenced by its technical and commercial success, including rapid adoption by over 100 million users shortly after launch, which underscores a robust market demand for natural language AI and has catalyzed interdisciplinary research into its applications in sectors like education, healthcare, and commerce [8], [50], [51], [52], [53]. In education, ChatGPT offers innovative approaches to personalized learning and interactive teaching [54], [51], [55], [56], while in commerce, it revolutionizes customer service and content creation [57], [58]. The widespread use of ChatGPT, Google Bard, Anthropic Claude and similar commercial LLMs has reignited important debates in the field of AI, particularly concerning AI consciousness and safety, as its human-like interaction capabilities raise significant ethical questions and highlight the need for robust governance and safety measures in AI development [59], [31], [32], [11]. Such influence appears to extend beyond its technical achievements, shaping cultural and societal discussions about the role and future of AI in our world. 2) Large Language Models: The advancements in LLMs, including the development of models like GPT and BERT, have paved the way for the conceptualization of Q*. Specifically, the scalable architecture and extensive training data that characterize these models are foundational to the proposed capabilities of Q*. The success of ChatGPT in contextual understanding and conversational AI, for example, informs the design principles of Q*, suggesting a trajectory towards more sophisticated, context-aware, and adaptive language processing capabilities. Similarly, the emergence of multimodal systems like Gemini, capable of integrating text, images, audio, and video, reflects an evolutionary path that Q* could extend, combining the versatility of LLMs with advanced learning and pathfinding algorithms for a more holistic AI solution. 3) Fine-tuning, Hallucination Reduction, and Alignment in LLMs: The advancement of LLMs has underlined the significance of fine-tuning [60], [61], [62], [63], hallucination reduction [64], [65], [66], [67], and alignment [68], [69], [70], [71], [72]. These aspects are crucial in enhancing the functionality and reliability of LLMs. Fine-tuning, which involves adapting pre-trained models to specific tasks, has seen significant progress: techniques like prompt-based and few-shot learning [73], [74], [75], [76], alongside supervised fine-tuning on specialized datasets [60], [77], [78], [79], have enhanced the adaptability of LLMs in various contexts, but challenges remain, particularly in bias mitigation and the generalization of models across diverse tasks [60], [80], [72]. Hallucination reduction is a persistent challenge in LLMs, characterized by the generation of confident but factually incorrect information [36]. Strategies such as confidence penalty regularization during fine-tuning have been implemented to mitigate overconfidence and improve accuracy [81], [82], [83]. Despite these efforts, the complexity of human language and the breadth of topics make completely eradicating hallucinations a daunting task, especially in culturally sensitive contexts [36], [9]. Alignment, ensuring LLM outputs are congruent with human values and ethics, is an area of ongoing research. Innovative approaches, from constrained optimization [84], [85], [86], [87], [88], to different types of reward modeling [89], [90], [91], [92], aim to embed human preferences within AI systems. While advancements in fine-tuning, hallucination reduction, and alignment have propelled LLMs forward, these areas still present considerable challenges. The complexity of aligning AI with the diverse spectrum of human ethics and the persistence of hallucinations, particularly on culturally sensitive topics, highlight the need for continued interdisciplinary research in the development and application of LLMs [9]. 3) Fine-tuning, Hallucination Reduction, and Alignment in LLMs: 4) Mixture of Experts: A Paradigm Shift: The adoption of the MoE architecture in LLMs marks a critical evolution in AI technology. This innovative approach, exemplified by advanced models like Google’s Switch Transformer[5] and MistralAI s Mixtral-8x7B[6], leverages multiple transformer based expert modules for dynamic token routing, enhancing modeling efficiency and scalability. The primary advantage of MoE lies in its ability to handle vast parameter scales, reducing memory footprint and computational costs significantly [93], [94], [95], [96], [97]. This is achieved through model parallelism across specialized experts, allowing the training of models with trillions of parameters, and its specialization in handling diverse data distributions enhances its capability in few-shot learning and other complex tasks [94], [95]. To illustrate the practicality of MoE, consider its application in healthcare. For example, an MoE-based system could be used for personalized medicine, where different ‘expert’ modules specialize in various aspects of patient data analysis, including genomics, medical imaging, and electronic health records. This approach could significantly enhance diagnostic accuracy and treatment personalization. Similarly, in finance, MoE models can be deployed for risk assessment, where experts analyze distinct financial indicators, market trends, and regulatory compliance factors. Despite its benefits, MoE confronts challenges in dynamic routing complexity [98], [99], [100], [101], [102], expert imbalance [103], [104], [105], [106], and probability dilution [107], and such technical hurdles demand sophisticated solutions to fully harness MoE’s potential. Moreover, while MoE may offer performance gains, it does not inherently solve ethical alignment issues in AI [108], [109], [110]. The complexity and specialization of MoE models can obscure the decision-making processes, complicating efforts to ensure ethical compliance and alignment with human values [108], [111]. Although the paradigm shift to MoE signifies a major leap in LLM development, offering significant scalability and specialization advantages, ensuring the safety, ethical alignment, and transparency of these models remains a paramount concern. The MoE architecture, while technologically advanced, entails continued interdisciplinary research and governance to align AI with broader societal values and ethical standards. B. Multimodal AI and the Future of Interaction B. Multimodal AI and the Future of Interaction The advent of multimodal AI marks a transformative era in AI development, revolutionizing how machines interpret and interact with a diverse array of human sensory inputs and contextual data. 1) Gemini: Redefining Benchmarks in Multimodality: Gemini, a pioneering multimodal conversational system, marks a significant shift in AI technology by surpassing traditional text-based LLMs like GPT-3 and even its multimodal counterpart, ChatGPT-4. Gemini’s architecture has been designed to incorporate the processing of diverse data types such as text, images, audio, and video, a feat facilitated by its unique multimodal encoder, cross-modal attention network, and multimodal decoder [112]. The architectural core of Gemini is its dual-encoder structure, with separate encoders for visual and textual data, enabling sophisticated multimodal contextualization [112]. This architecture is believed to surpass the capabilities of single-encoder systems, allowing Gemini to associate textual concepts with image regions and achieve a compositional understanding of scenes [112]. Furthermore, Gemini integrates structured knowledge and employs specialized training paradigms for cross-modal intelligence, setting new benchmarks in AI [112]. In [112], Google has claimed and demonstrated that Gemini distinguishes itself from ChatGPT-4 through several key features: 1) Gemini: Redefining Benchmarks in Multimodality: • Breadth of Modalities: Unlike ChatGPT-4, which primarily focuses on text, documents, images, and code, Gemini handles a wider range of modalities including audio, and video. This extensive range allows Gemini to tackle complex tasks and understand real-world contexts more effectively. • Breadth of Modalities: • Performance: Gemini Ultra excels in key multimodality benchmarks, notably in massive multitask language understanding (MMLU) which encompasses a diverse array of domains like science, law, and medicine, outperforming ChatGPT-4. • Performance: • Scalability and Accessibility: Gemini is available in three tailored versions – Ultra, Pro, and Nano – catering to a range of applications from data centers to on-device tasks, a level of flexibility not yet seen in ChatGPT-4. • Scalability and Accessibility: • Code Generation: Gemini’s proficiency in understanding and generating code across various programming languages is more advanced, offering practical applications beyond ChatGPT-4’s capabilities. • Code Generation: • Transparency and Explainability: A focus on explainability sets Gemini apart, as it provides justifications for its outputs, enhancing user trust and understanding of the AI’s reasoning process. • Transparency and Explainability: Despite these advancements, Gemini’s real-world performance in complex reasoning tasks that require integration of commonsense knowledge across modalities remains to be thoroughly evaluated. 2) Technical Challenges in Multimodal Systems: The development of multimodal AI systems faces several technical hurdles, including creating robust and diverse datasets, managing scalability, and enhancing user trust and system interpretability [113], [114], [115]. Challenges like data skew and bias are prevalent due to data acquisition and annotation issues, which requires effective dataset management by employing strategies such as data augmentation, active learning, and transfer learning [113], [116], [80], [115]. A significant challenge is the computational demands of processing various data streams simultaneously, requiring powerful hardware and optimized model architectures for multiple encoders [117], [118]. Advanced algorithms and multimodal attention mechanisms are needed to balance attention across different input media and resolve conflicts between modalities, especially when they provide contradictory information [119], [120], [118]. Scalability issues, due to the extensive computational resources needed, are exacerbated by limited high-performance hardware availability [121], [122]. There is also a pressing need for calibrated multimodal encoders for compositional scene understanding and data integration [120]. Refining evaluation metrics for these systems is necessary to accurately assess performance in real-world tasks, calling for comprehensive datasets and unified benchmarks, and for enhancing user trust and system interpretability through explainable AI in multimodal contexts. Addressing these challenges is vital for the advancement of multimodal AI systems, enabling seamless and intelligent interaction aligned with human expectations. 2) Technical Challenges in Multimodal Systems: 3) Multimodal AI: Beyond Text in Ethical and Social Contexts: The expansion of multimodal AI systems introduces both benefits and complex ethical and social challenges that extend beyond those faced by text-based AI. In commerce, multimodal AI can transform customer engagement by integrating visual, textual, and auditory data [123], [124], [125]. For autonomous vehicles, multimodality can enhance safety and navigation by synthesizing data from various sensors, including visual, radar, and Light Detection and Ranging (LIDAR) [126], [125], [127]. Still, DeepFake technology’s ability to generate convincingly realistic videos, audio, and images is a critical concern in multimodality, as it poses risks of misinformation and manipulation that significantly impact public opinion, political landscapes, and personal reputations, thereby compromising the authenticity of digital media and raising issues in social engineering and digital forensics where distinguishing genuine from AI-generated content becomes increasingly challenging [128], [129]. Privacy concerns are amplified in multimodal AI due to its ability to process and correlate diverse data sources, potentially leading to intrusive surveillance and profiling, which raises questions about the consent and rights of individuals, especially when personal media is used without permission for AI training or content creation [113], [130], [131]. Moreover, multimodal AI can propagate and amplify biases and stereotypes across different modalities, and if unchecked, this can perpetuate discrimination and social inequities, making it imperative to address algorithmic bias effectively [132], [133], [134]. The ethical development of multimodal AI systems requires robust governance frameworks focusing on transparency, consent, data handling protocols, and public awareness, when ethical guidelines must evolve to address the unique challenges posed by these technologies, including setting standards for data usage and safeguarding against the nonconsensual exploitation of personal information [135], [136]. Additionally, the development of AI literacy programs will be crucial in helping society understand and responsibly interact with multimodal AI technologies [113], [135]. As the field progresses, interdisciplinary collaboration will be key in ensuring these systems are developed and deployed in a manner that aligns with societal values and ethical principles [113]. 3) Multimodal AI: C. Speculative Advances and Chronological Trends C. Speculative Advances and Chronological Trends In the dynamic landscape of AI, the speculative capabilities of the Q* project, blending LLMs, Q-learning, and A* (AStar algorithm), embodies a significant leap forward. This section explores the evolutionary trajectory from game-centric AI systems to the broad applications anticipated with Q*. 1) From AlphaGo’s Groundtruth to Q-Star’s Exploration: The journey from AlphaGo, a game-centric AI, to the conceptual Q-Star project represents a significant paradigm shift in AI. AlphaGo’s mastery in the game of Go highlighted the effectiveness of deep learning and tree search algorithms within well-defined rule-based environments, underscoring the potential of AI in complex strategy and decision-making [137], [138]. Q-Star, however, is speculated to move beyond these confines, aiming to amalgamate the strengths of reinforcement learning (as seen in AlphaGo), with the knowledge, NLG, creativity and versatility of LLMs, and the strategic efficiency of pathfinding algorithms like A*. This blend, merging pathfinding algorithms and LLMs, could enable AI systems to transcend board game confines and, with Q-Star’s natural language processing, interact with human language, enabling nuanced interactions and marking a leap towards AI adept in both structured tasks and complex human-like communication and reasoning. Moreover, the incorporation of Q-learning and A* algorithms would enable Q-Star to optimize decision paths and learn from its interactions, making it more adaptable and intelligent over time. The combination of these technologies could lead to AI that is not only more efficient in problemsolving but also creative and insightful in its approach. This speculative advancement from the game-focused power of AlphaGo to the comprehensive potential of Q-Star illustrates the dynamic and ever-evolving nature of AI research, and opens up possibilities for AI applications that are more integrated with human life and capable of handling a broader range of tasks with greater autonomy and sophistication. 1) From AlphaGo’s Groundtruth to Q-Star’s Exploration: 2) Bridging Structured Learning with Creativity: The anticipated Q* project, blending Q-learning and A* algorithms with the creativity of LLMs, embodies a groundbreaking step in AI, potentially surpassing recent innovations like Gemini. The fusion suggested in Q* points to an integration of structured, goal-oriented learning with generative, creative capabilities, a combination that could transcend the existing achievements of Gemini. While Gemini represents a significant leap in multimodal AI, combining various forms of data inputs such as text, images, audio, and video, Q* is speculated to bring a more profound integration of creative reasoning and structured problem-solving. This would be achieved by merging the precision and efficiency of algorithms like A* with the learning adaptability of Q-learning, and the complex understanding of human language and context offered by LLMs. Such an integration could enable AI systems to not only process and analyze complex multimodal data but also to autonomously navigate through structured tasks while engaging in creative problem-solving and knowledge generation, mirroring the multifaceted nature of human cognition. The implications of this potential advancement are vast, suggesting applications that span beyond the capabilities of current multimodal systems like Gemini. By aligning the deterministic aspects of traditional AI algorithms with the creative and generative potential of LLMs, Q* could offer a more holistic approach to AI development. This could bridge the gap between the logical, rule-based processing of AI and the creative, abstract thinking characteristic of human intelligence. The anticipated unveiling of Q*, merging structured learning techniques and creative problem-solving in a singular, advanced framework, holds the promise of not only extending but also significantly surpassing the multimodal capabilities of systems like Gemini, thus heralding another game-changing era in the domain of generative AI, showcasing its potential as a crucial development eagerly awaited in the ongoing evolution of AI. 2) Bridging Structured Learning with Creativity: This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. available on arxiv [5] https://huggingface.co/google/switch-c-2048 [6] https://huggingface.co/mistralai/Mixtral-8x7B-v0.1