The last year was all about the race of commercial LLMs. Though the LLaMa leakage and further Meta actions on the open-source front were significant, and Hugging Face kept blooming as a platform for open source, this year starts with a powerful player stepping in with a real open-source approach.
We speak, of course, about OLMo: Accelerating the Science of Language Model, released by the Allen Institute for Artificial Intelligence. Newsletters such as AlphaSignal, TheSequence, Data Machina, Smol Talk, and Interconnects (Nathan Lambert is one of the authors of the OLMo paper) explained pretty well what’s the difference between almost open-source and truly open-source models.
The gist of it is that truly open source means that not only the weights of the model and inference code are released but truly the whole package: the training data, training and evaluation code, and a comprehensive framework for studying language modeling.
Who is behind OLMo?
What I found interesting is who stands behind the release. While EleutherAI’s Pythia and Big Science’s BLOOM previously set a precedent for releasing fully open-source models, the distinction with OLMo is its release by a true nonprofit organization — the Allen Institute for AI (AI2). AI2 was founded in 2014 by philanthropist and Microsoft co-founder Paul G. Allen, who committed to conducting high-impact research and engineering in artificial intelligence. He was also very interested in teaching machines “common sense.”
And he funded this cause well. Once, I had a conversation with one of the top executives of AI2; the person said that thanks to Paul Allen’s financing structure, AI2 is well-funded, has no influence from large companies, and has no pressure to make money.
AI2 is famous for conducting cutting-edge research in AI and aiming to influence the broader AI research community by releasing open-source software, datasets, and research findings. Projects like the Semantic Scholar academic search engine democratize access to information and accelerate scientific breakthroughs.
Why OLMo is special
The OLMo framework includes multiple training checkpoints, logs, exact datasets used, and a permissive license, establishing a new standard for openness in the field. They also don’t mind this model being used for commercial purposes. Unlike others, the researchers readily embrace openness, believing it outweighs the low misuse risk, as their models, not designed as chatbots, contribute to science rather than commercial products.
Furthermore, they released ‘Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.’ According to Luca Soldani, ‘the name of the pretraining corpus, “Dolma,” stands for Data to feed OLMo’s Appetite.’
What also surprised me was that the authors highlighted the environmental impact of training large LMs, providing estimates of power consumption and carbon emissions. They advocate for transparency in reporting these impacts and emphasize the potential for open models like OLMo to mitigate future emissions by minimizing redundant model training.
Great start to the year of open-source!
News from The Usual Suspects ©
Google
- In their blog post, Google announced MobileDiffusion, a new approach for sub-second text-to-image generation on mobile devices, addressing the efficiency challenges of current large models. This model combines a pre-trained diffusion model with a GAN to enable one-step sampling during inference, resulting in a compact model size of 520M parameters capable of generating high-quality images rapidly on iOS and Android devices. The research includes an in-depth exploration of model architecture, optimizing for mobile deployment while maintaining image quality.
“There are several strategic benefits. First, open source software is typically safer and more secure, as well as more compute efficient to operate due to all the ongoing feedback, scrutiny, and development from the community. This is a big deal because safety is one of the most important issues in AI. Efficiency improvements and lowering the compute costs also benefit everyone including us. Second, open source software often becomes an industry standard, and when companies standardize on building with our stack, that then becomes easier to integrate new innovations into our products.
That’s subtle, but the ability to learn and improve quickly is a huge advantage and being an industry standard enables that. Third, open source is hugely popular with developers and researchers. We know that people want to work on open systems that will be widely adopted, so this helps us recruit the best people at Meta, which is a very big deal for leading in any new technology area. And again, we typically have unique data and build unique product integrations anyway, so providing infrastructure like Llama as open source doesn’t reduce our main advantages. This is why our long-standing strategy has been to open source general infrastructure and why I expect it to continue to be the right approach for us going forward.” — Mark Zuckerberg
The freshest research papers, categorized for your convenience
Language Modeling and Efficiency
- Large Language Models for Mathematical Reasoning: Progresses and Challenges: Examines LLMs in mathematical reasoning, highlighting capabilities and challenges. read the paper
- Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens: Introduces an ∞-gram model for extensive text analysis. read the paper
- Scavenging Hyena: Distilling Transformers into Long Convolution Models: Proposes a sustainable and efficient LLM pre-training method. read the paper
- Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling: Presents a strategy for efficient LLM pre-training using rephrased web documents. read the paper
Advanced Reasoning and Contextual Understanding
- Efficient Tool Use with Chain-of-Abstraction Reasoning: Enhances multi-step reasoning in LLMs for improved accuracy and efficiency. read the paper
- RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval: Improves performance on complex reasoning tasks through hierarchical text summarization. read the paper
- Can Large Language Models Understand Context?: Investigates LLMs’ contextual understanding and the impact of compression techniques. read the paper
Enhancements in AI Frameworks and Methodologies
- SymbolicAI: A framework for logic-based approaches combining generative models and solvers: Merges generative models with solvers for advanced concept learning. read the paper
- Efficient Exploration for LLMs: Demonstrates the benefits of efficient exploration in LLM training. read the paper
- Transforming and Combining Rewards for Aligning Large Language Models: Discusses methods for improving LLM alignment with human preferences. read the paper
- MoE-LLaVA: Mixture of Experts for Large Vision-Language Models: Introduces a MoE approach for efficient multi-modal learning in LVLMs. read the paper
Novel Applications and Security Insights
- Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception: Develops an autonomous agent for mobile devices using visual perception. read the paper
- Transfer Learning for Text Diffusion Models: Explores text diffusion as a competitive method for LLMs. read the paper
- Weak-to-Strong Jailbreaking on Large Language Models: Identifies a security issue in LLMs and proposes initial defense strategies. read the paper
In other newsletters
- A very interesting read about a new trend: VCs backing companies where the majority of revenues come from government contracts by Newcomer
- The research report gives interesting details about the types of security assessments Chinese AI labs are implementing to comply with generative AI regulations by ChinAI
- ML Engineer Newsletter celebrates a large milestone towards democratizing AI inference with their Vulkan Kompute project that was adopted as one of the backends for the LLama.cpp and GPT4ALL frameworks.
Also published here.