During Apple’s Worldwide Developers Conference (WWDC) 2024, the company presented a number of improvements that are intended to improve the deployment and performance of on-device AI models. Some of the important changes included major enhancements to Core ML tools which came with the pre-release 8.0b1.
These updates are aimed at improving the efficiency and effectiveness of deploying machine learning (ML) models on Apple devices. Here is the breakdown of these innovations, how they affect developers, and the advantages for the end users.
Before diving into the updates, let's clarify some key terms:
This technique decreases the model weight accuracy by grouping the weights into clusters and representing each cluster by one single value. It is like having a range of colors in a painting where one color is used to represent the whole range of colors. In machine learning, palettization significantly reduces the size of a model by compressing its weight values.
Quantization is a process to reduce the precision of weights and activations from floating-point numbers, such as 32-bit floats, to lower precision numbers, for example, 8-bit integers. This compression technique helps to reduce the model size and also speeds up the inference by making computations faster on lower-precision hardware.
This variant of quantization divides the weights of the model into smaller blocks or chunks and quantizes each block separately which leads to a better improved accuracy due to more precise quantization performed on each of the chunks.
It is a data compression technique that involves the elimination of weights in a model that is non-critical and has the least impact on the model’s prediction. This process sets the least important weights to zero which can be stored efficiently using sparse matrix representations.
Stateful models are the models that keep track of the information that needs to be passed across multiple runs of the model, or in other words, retain the context and its state. This is important, especially for tasks such as language modeling where the model requires to remember the words that have been generated in the past in order to generate the next text properly and coherently.
The Core ML tools (coremltools) is a Python package for converting third-party models to formats suitable for Core ML (Apple’s framework for integrating machine learning models into apps). Core ML Tools supports conversion from popular libraries such as TensorFlow and PyTorch into the Core ML model package format.
The coremltools package allows you to:
Core ML provides a unified representation for all models, allowing your app to use Core ML APIs and user data to make predictions and to fine-tune models directly on the user’s device. This approach removes the need for a network connection, keeps user data private, and makes your app more responsive. Core ML optimizes on-device performance by leveraging the CPU, GPU, and Neural Engine (NE) all while minimizing memory footprint and power consumption.
Now, let’s finally begin discussing the changes themselves. We’ve covered the theory and the terminology, and now it is time to dive into the new features and the changes in Core ML tools in the soon-to-be-released version 8.0b1.
The introduction of coremltools.utils.MultiFunctionDescriptor()
and coremltools.utils.save_multifunction
simplify the creation of ML programs with multiple functions that can share weights between each other. This increases the versatility and ease of use of the models as it allows for the easy loading of specific functions for the prediction.
Core ML has now been enhanced to support stateful models through recent changes to the converter to generate models with the new State Type, which was introduced in iOS 18 and macOS 15. These models can maintain information from one inference run to another, which is especially useful for tasks where the model needs to remember the inputs it saw in the past.
The Core ML tools have expanded the scope of compression capabilities to shrink model sizes while maintaining performance. The updated coremltools.optimize
module now supports:
These techniques, in addition to the joint compression mode such as 8-bit look-up tables (LUTs) for palettization, weight pruning combined with quantization or palettization, offer efficient tools to reduce the model size and enhance the performance.
The coremltools.optimize
module has significant API updates to support advanced compression techniques. For example, a new API for activation quantization based on calibration data can change a W16A16 Core ML model (16-bit weight and activations) into a W8A8 model (8-bit weights and activations) improving efficiency while retaining accuracy. Additionally, updates to coremltools.optimize.torch
introduced data-free compression methods based on calibration data, which made PyTorch model optimization for Core ML easier.
The latest operating systems support new operations such as constexpr_blockwise_shift_scale
, constexpr_lut_to_dense
, and constexpr_sparse_to_dense,
which are crucial for efficient model compression. Updates to the Gated Recurrent Unit (GRU) operations and the addition of the PyTorch scaled_dot_product_attention
operation helps improve the performance and make transformer models and other complex structures run well on Apple silicon. These updates ensure more efficient execution and better utilization of hardware capabilities.
The torch.export
conversion support helps to seamlessly convert the model directly to Core ML
from PyTorch
.
This process involves:
import necessary libraries
export the PyTorch model using torch.export
convert the exported program into a Core ML model by using coremltools.convert
This simplified process reduced the complexity of deploying PyTorch models on Apple devices by taking benefit of the enhanced performance of Core ML.
Integration for multifunction models in Core ML tools allows merging models with shared weights into a single ML program. This is advantageous for applications requiring multiple tasks, such as combining a feature extractor with classifiers and regressors. The multifunction descriptor and save_multifunction
utility guarantees that the shared weights are not duplicated, saving more storage space and performance.
Core ML Tools new version 8.0b1 includes various bug fixes, enhancements, and optimizations to make a development experience smoother. Some known issues like conversion failures with certain palettization modes and incorrect quantization scales have been fixed to make the reliability and accuracy of compressed models better.
The enhancements in coremltools 8.0b1 pre-release bring several major benefits to end users, improving the overall experience with AI-powered applications:
The pre-release of coremltools 8.0b1 represents a significant step forward in on-device AI model deployment. Now, developers can create more efficient, compact, and versatile ML models with enhanced compression techniques, stateful model support, and multifunction model utilities. These advancements highlight Apple's commitment to providing robust tools for developers to leverage the power of Apple silicon, ultimately delivering faster, more efficient, and more capable on-device AI applications.
As the Core ML and its environment evolve, the possibilities for innovation in AI-powered apps continue to expand and grow, opening the doors for more sophisticated and user-friendly experiences.
In the upcoming article, we will demonstrate these new features practically in a sample project, showcasing how to apply them in real-life scenarios. Stay tuned!