Are we on the brink of debunking Moravec’s Paradox? In the 1980s, AI and robotics researcher Hans Moravec highlighted a counterintuitive aspect of AI: tasks requiring high-level reasoning — like chess or Go — are easier for to master than basic sensory and motor skills — such as walking or identifying your mom’s face — which humans find instinctive. Adding complexity, these “simpler” skills actually demand much more computational power. This insight sheds light on the complexity of replicating human-like perception and dexterity, outcomes of millions of years of evolution, as opposed to logical reasoning, a more recent development. In today’s AI and ML landscape, this paradox underscores the challenges in creating robots and AI systems capable of seamlessly navigating and interacting with the physical world. AI However, last week, Bernt Bornich, CEO and founder of 1x, a humanoid robotics company, , “ .” I suspect that this has something to do with the advancements in foundation models. Originally known for their ability to perform a wide range of tasks based on a single type of data (like text for language models), these models become “multimodal” when integrating and interpreting information across different sensory inputs, closely mirroring human-like understanding. wrote New progress update on the droids dropping in 4 weeks, looks like Moravec’s paradox might be debunked, and we just didn’t have the data Could the embodiment of AI, with all its sensory inputs, plus reasoning-imitation algorithms like LLMs, be the pool of data that disproves Moravec’s paradox? Another intriguing development caught my attention. Huang Jensen, Nvidia’s CEO, to a question from Wired about what current development could change everything. Jensen replied, “ ” responded There are a couple of things. One doesn’t really have a name, but it’s part of the work we’re doing in foundational robotics. If you can generate text and images, can you also generate motion? The answer is probably yes. And if you can generate motion, you can understand the intent and generate a generalized version of articulation. Therefore, humanoid robotics should be right around the corner. Something to observe in the coming weeks! In related news from the robotics universe, Figure AI, a humanoid robotics startup, made headlines by raising approximately in funding. What’s more impressive is the list of backers: Amazon, NVIDIA, Microsoft, OpenAI, Intel, LG, and Samsung. This indicates a strong belief in the potential of humanoid robotics to disrupt various sectors. $675 million Yet, there are skeptical voices. Rodney Brooks, who coined Nouvelle AI*, last week: “ ” posted Tele-op robots presented as autonomous, like the Tesla Optimus humanoid folding a shirt, and 1X humanoid robots, are misrepresentations of what robots are actually doing, which can also be called LIES. Note that the Stanford robot cooking and cleaning videos are also tele-operated. If 2023 was the year of LLMs, are we ready to evolve to an embodied AI and make 2024 the year of robots? 🎁 Bonus: The freshest research papers from the week of Feb 19 — Feb 25 Enhancing Large Language Models (LLMs) : Expands the processing capability of LLMs to handle over 2 million tokens, pushing the boundaries of context window sizes for more comprehensive understanding and generation tasks. LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Read the paper. : Demonstrates the versatility of LLMs in performing numerical regression tasks, suggesting their potential as universal tools for predictive modeling across a variety of domains. . OmniPred: Language Models as Universal Regressors Read the paper : Investigates efficient strategies for distilling large models into smaller, more manageable ones, particularly for reasoning tasks, emphasizing the importance of decomposition over problem-solving. . Divide-or-Conquer? Which Part Should You Distill Your LLM? Read the paper : Proposes a method to enhance the efficiency of self-attention mechanisms in LLMs, crucial for improving performance and reducing resource consumption. . ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition Read the paper : Introduces a framework for personalizing LLM interactions using user embeddings, enhancing the model’s responsiveness to individual user preferences and histories. USER-LLM: Efficient LLM Contextualization with User Embeddings Multimodal and Multi-Agent Systems : Explores integrating video and language for advanced AI understanding and interaction, leveraging a novel RingAttention mechanism for efficient multimodal learning. . World Model on Million-Length Video and Language with RingAttention Read the paper : Develops a multi-agent platform that enhances cooperation and flexibility among agents, addressing the complexity of multi-agent systems and their practical applications. . AgentScope: A Flexible yet Robust Multi-Agent Platform Read the paper : Focuses on the design and analysis of small-scale multimodal models, proving that with strategic optimizations, smaller models can achieve or surpass the performance of larger counterparts. . TinyLLaVA: A Framework of Small-scale Large Multimodal Models Read the paper : Introduces a versatile multimodal language model that processes speech, text, images, and music, demonstrating the power of discrete representations in unifying various data modalities within a single framework. This approach simplifies the integration of new modalities without needing to modify the underlying architecture or training methodologies. AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling read the paper : Presents a novel dataset that enhances multimodal understanding by incorporating touch with vision and language, aiming to advance touch-vision-language alignment and understanding through a tactile encoder and text generation model. A Touch, Vision, and Language Dataset for Multimodal Alignment read the paper Advancements in Specific Domains : Introduces a novel object detection model that leverages programmable gradient information for enhanced accuracy and efficiency in learning. . YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information Read the paper : Demonstrates the application of Transformers in complex planning tasks, offering a method that surpasses traditional search algorithms in efficiency and effectiveness. . : Better Planning with Transformers via Search Dynamics Bootstrapping* Beyond A Read the paper Developer Tools and APIs : Presents a vast dataset designed for training LLMs to interact with APIs, addressing the challenge of creating effective models for API usage and integration. . API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs Read the paper : Develops an open-source system for code generation, execution, and refinement, facilitated by a dataset of multi-turn interactions, aiming to bridge the gap between code generation models and practical coding tasks. . OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement Read the paper Security and Adversarial Research : Explores the susceptibility of LLMs to a wide range of adversarial attacks, highlighting the need for comprehensive security measures to protect against unintended behaviors and data extraction. . Coercing LLMs to do and reveal (almost) anything Read the paper Model Efficiency and Quantization : Discusses a novel framework for quantizing LLM weight matrices to 1-bit to drastically reduce storage and computational demands while maintaining performance, enabling efficient deployment of LLMs on resource-constrained devices. OneBit: Towards Extremely Low-bit Large Language Models read the paper Instruction Tuning and Data Quality : Introduces REALIGN, a method for refining instruction data quality for LLMs to better align with human values, emphasizing the importance of instruction data quality in model alignment and suggesting areas for further exploration in LLM science. Reformatted Alignment read the paper : Proposes a novel method for instruction tuning that generates synthetic instruction data across all disciplines, showcasing a scalable and customizable approach to instruction tuning without relying on specific training data. Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models read the paper : Proposes a pre-instruction-tuning method to enhance LLMs’ knowledge updating capabilities, demonstrating significant improvements in factual knowledge absorption and cross-domain generalization. Instruction-tuned Language Models are Better Knowledge Learners read the paper *Also published . * here