Comprehensive List of Feature Store Architectures for Data Scientists and Big Data Professionals
Feature store has become an important unit of organizations developing predictive services across any industry domain. Some of the earlier challenges in deploying ML solutions at scale involves :
To overcome the above limitations, Architects. Data scientists, Big Data, and Analytics professionals have felt the necessity to walk under one roof with one unified framework to facilitate easier collaboration, sharing of data, results, reports.
Departments, teams and organizations shared some of the similar notions of Feature Engineering:
While sharing a similar opinion, it became easier to come together and create a Unified Framework called Feature Store. This would enhance the speed of ML model deployment life-cycle along with the creation of proper documents, required version analysis, and model performance in order to save time and effort.
In this blog, we highlight on the features supported by different Feature Store frameworks, that are primarily developed by different leading industry giants.
Michaelangelo - a framework developed by Uber that allows feature integration/joining in both offline and online pipelines. Here Hive (Offline) and Cassandra (Online) acts as the main storage unit for raw/transformed features. It provides a horizontally scalable multi-tenant architecture for multiple models with suitable scaling and monitoring. Training jobs can be configured and managed through a web UI or an API, via Jupyter notebook.
It further provides options to define hierarchical partitioning schema to train models per partition, that can be deployed as a single logical model. This provides easy bootstrapping and helps to overcome challenges when several models need to be trained based on the hierarchical structure of the data.
At runtime during serving, it finds root to the best model for each node. Further its best known for its ability to support continuous learning, providing integration with AutoML, along with its support for distributed deep learning.
Google released Feast which is primarily built around Google Cloud services: Big Query (offline) and Big Table (online) and Redis (low-latency), with Apache Beam for feature engineering. It allows a clear separation between big data and model development. This online predictive service allows feature sharing among teams with strong consistency between model training and serving.
Further Feast comes with centralized feature management, discovery, feature validation, and feature aggregation. The feature columns reside inside wide-entity tables. In addition, the composite entities separate individual features.
Wix provides a platform for feature-sharing across different ML models for both batch and real-time datasets. It supports a pre-configured set of feature families on the site and user-level for both training and serving models. The different stages of data management, model training and deployment are marked and show in the figure above. It further uses S3 to store real-time extracted features.
The Feature Store developed by Comcast helps data scientists to reuse versioned features, upload online (real-time)/streaming data, and review feature metrics by models. The product is available in multiple pluggable feature store components. The built-in model repository contains artifacts related to data pre-processing (normalization, scaling) displaying the required mapping to the features needed to execute the model. Further, the architecture is built using Spark on Alluxio (open source data orchestration layer that brings data close to compute for big data and AI/ML workloads in the cloud), S3, HDFS, RDBMS, Kafka, Kinesis. The Model deployment with Kubeflow helps to build a resilient, highly available distributed systems with support for rate-limiting, shadow deployments, and auto-scaling.
The integration with Data Lake with suitable API s helps data scientists to use SQL and create training/validation/test datasets that can be versioned and integrated into the full model pipeline. In addition, the framework comes with the support of Seldon Inference Graphs for A/B Testing, Ensembles, Multi-armed bandits, Custom combinations. The end to end system not only provides traceability from use-case, models, features, model to features mapping, versioned datasets, model training codebase, model deployment containers, and prediction/outcome sinks, it is also known for integration with Feature-Store, Container Repository, and Git to integrate data, code and run-time artifacts for CI/CD integration.
Just like any other architecture, it has continuous Feature Aggregation on streaming data + on-demand features. The Online Feature Store uses the following sequences before giving a prediction:
HopWorks Enterprise Edition is a multi-tenant architecture that integrates AWS Sagemaker, Databricks, Kubernetes, and Jupyter Notebook. It also supports integration with Authentication frameworks like LDAP, Kerberos, and Oauth2.
The Batch / Live Streaming functionality is facilitated by Apache Beam, Apache Flink, and Apache Spark, whereas the model governance and monitoring pipeline are built using Kafka and Spark Streaming.
The architecture is composed of several building blocks namely
The feature store developed by Netflix supports both online and offline model training and development. The online micro-services enables the framework to collect the data elements required by the feature encoders in a model. It further passes this downstream for future use by offline predictions. The Fact Logging service of Netflix logs user-related, video-related and computation specific features in a serialized format in appropriate storage units (S3).
The unique point of this architecture is the presence of components that help to:
As snapshotting data for all contexts (e.g all member profiles, devices, times of day) would incur overhead and cost, Netflix relies on selecting samples of contexts to snapshot periodically (at regular intervals - daily/twice daily), though different algorithms. It achieves this through Spark, by training data on different distributions, and by using stratified samples based on properties such as viewing patterns, devices, time spent on the service, region, etc.
Netflix embraces a fine-grained Service Oriented Architecture for cloud-based deployment model.
The FBLearner designed by Facebook is a framework for AI WorkFlow with Model Management and Deployment. It is mainly composed of 3 components - FB Learner Feature Store (runs on CPU), FB Learner Flow (runs on CPU +GPU), and FB Learner Predictor (runs on CPU). It supports building all kinds of deep learning models (Caffe2, Pytorch, Tensorflow, MxNet, CNTK) and models can be stored in ONNX format (standardizes portability across converters, runtimes, compilers, and visualizers. supports and to) across different hardware/software platforms.
The above broad categories can be seen as creating logical units from hardware to application software.
Facebook also uses a principle to split development and deployment (production) environments.
Pinterest's - Big Data Machine Learning is a classic example of high speed and quality which is scalable, reliable, and secure. This Metadata-driven framework is built using open-source technology with individual building blocks that help in reusability. It also provides governance: enforcement & tracking.
The uniqueness of this architecture lies in capturing relationships and interactions (clicks made by users) between pins (how objects are organized into collections).
The below figure illustrates the different components in model governance and development architecture
The predictive system ZipLine created by Airbnb relies on a scoring service based on features gathered in due time and space. The scoring log (acts as debug/audit log) is computed/updated daily to ensure feature consistency and single feature definition both during training ML model and deploying them at production. In addition, it ensures Data Quality monitoring, feature back-filling, and making features searchable and sharable.
The architecture integrated with data sources -- Hive Table, databases and Jitney's Event Bus apart from Apache Spark (batch) and Flink (streaming) with Lambda as serving point. The uniqueness of this platform lies in :
TensorFlow Extended (TFX), a TensorFlow based general-purpose machine learning platform provides orchestration of many components—a learner for generating models based on training data, modules for analyzing and validating both data as well as models, and finally infrastructure for serving models in production. The platform is particularly known for training, validation, visualization, and deployment of fresh newly trained models in production continuously relatively quickly. The individual components can share utilities that allow them to communicate and share assets. Due to fast training data and deserialization teams and community can share their data, models, tools, visualizations, optimizations, and other techniques
The components are further known for gathering statistics over feature values: for continuous features, the statistics include quantiles, equi-width histograms, the mean and standard deviation, whereas for discrete features they include the top-K values by frequency. In addition, the components support the computation of model metrics on slices of data e.g., on negative and positive examples in a binary classification problem) and cross-feature statistics like correlation and covariance between features. These statistics give insights to users on the shape of each dataset.
Further, the architecture also provides configuration free validation-setup enabled for all users, multi-tenancy to serve multiple machine-learned models concurrently, soft model-isolation to increase model performance.
Apache Airflow : Source
Apache Airflow's entire architecture is based on the concept of DAG (Directed Acyclic Graph), which takes into account the dependencies within them. Its principal responsibility to ensure all things happen at the right time and in the right order. The DAGs define a single logical workflow and they are defined in python files.
Further, it supports Airflow Operators which states what steps are executed over time (e.g. download or transfer operators- GoogleCloudStorageDownloadOperator ). One such Operator is the GoogleCloudStorageObjectSensor which pauses execution until aa key appears in S3.
Apache Airflow guarantees Idempotence (ensuring subsequent execution of any step produces the same end-result, irrespective of the number of times.), Atomicity, and Metadata Exchange. Data exchange between different components of this distributed architecture is facilitated using XCOM (cross-communication) that provided an exchange of small metadata. However, for large volumes of data, it supports shared network storage, data lake (S3) or URI based exchange through XCOM.
Parameterized representations of operators help DAG to run tasks that spawn a TaskInstance at a particular instant of time. Further, the instances within Apache AirFlow DAG are grouped into a DagRun.
Zomato's restaurant business heavily relies on stream data processing to compute running orders at the restaurant at any given point. The architecture use Apache Flink that provides job level isolation for each ML model as features from each ML model maintain their separate space for research, analysis, logging and do not interact with features from other ML models.
In addition to streaming and online feature extraction, the life-cycle management of ML models are provided by MLFlow. The ML models are served to the external world via API Gateway by means of AWS Sagemaker endpoints.
Overton automates the life cycle of model construction, deployment, and monitoring by providing a set of novel high-level, declarative abstractions. It supports multi-task learning to concurrently predict several ML models in both real-time and backend production applications.
Further, the architecture allows separation between model and data with two components the tasks, which capture the tasks the model needs to accomplish, and payloads that represent sources of data, such as tokens or entity embeddings.
The model training is governed by a schema file, which acts as a guide to compile a TensorFlow model and to describe its output for downstream use. Overton also embeds raw data into a payload, which is then used as input to a task or to another payload. The payloads are either singletons (e.g., a query), sequences (e.g. a query tokenized into words or characters), and sets (e.g., a set of candidate entities).
StreamSQL Feature store is alow latency based model development framework with high throughput serving. It allows new model features to be deployed confidently with versioning with much with ease. With the use of feature definitions, consistent feature deployment is ensured across training, in serving and across production.
The architecture is also known for its ability to increase model performance by integrating features from 3rd party. It combines batch and stream processing with an immutable ledger, where each event is appended to the end of the ledger. Further, the framework at any point allows the addition of new data sources/transformations (from Flink and Spark. Files, tables, and stream), modify or create a new set of features and even analyze/discover features from feature registry.
Tecton has come up with a unified architecture to develop, deploy, curate/govern and monitor a platform built to standardize high-quality features, labels, and data sets for ML models in production, ensuring the safe operation of models over time, with proper reproducibility, lineage, and logging.
The Tecton platform consists of:
Feature Pipelines for transforming your raw data into features or labels
The above figure illustrates a Hybrid Feature Store with Data Pipeline, BI Platforms (Tableau) using Apache Airflow, S3, Hopsworks Feature Store, and Data Lakes from Cloudera. The platform is capable of ingesting raw data, event or SQL data at the input.
The Feature Store provided by Scribble Data puts lots of stress on Input Data Correctness and Completeness (gaps, duplicates, exceptions, invalid values), as it is known to play an impact on ML models' prediction. Hence it recommends a continuous check/early morning system to prevent poor quality data from coming into the system. On the reactive side, the system undertakes a continuous process to improve ML operations over time.
Here we have discussed about different architectural frameworks using Big Data (some of them are Open Source tools), ML model training and serving tools, along with orchestration layer (such as Kubernetes). Each of the component is equally important and they go hand in hand to create a real-time end to end predictive system.