Comprehensive List of Feature Store Architectures for Data Scientists and Big Data Professionals Introduction & Motivation - Why Feature Store Feature store has become an important unit of organizations developing predictive services across any industry domain. Some of the earlier challenges in deploying ML solutions at scale involves : and by individual teams with little or no coordination. Developing maintaining customized systems No system for sharing features for similar type ML models (models from a similar domain or models addressing. same business use-cases or customer domains). collaborative Increased without the proper scope of scalability. cognitive burden Limited with big-data ecosystems. integration Limited scope for , limiting agile development life-cycle. model retraining, comparison, model governance, and traceability Difficult to t which exhibits seasonality. rack and retrain model To overcome the above limitations, Architects. Data scientists, Big Data, and Analytics professionals have felt the necessity to walk under one roof with one unified framework to facilitate easier collaboration, sharing of data, results, reports. Departments, teams and organizations shared some of the similar notions of Feature Engineering: Feature Engineering is and happens over time and across models. expensive amortization The increase in cost is in the number of features. non-linear/exponential with the increase Triggers/Alerts due to addition/removal of feature is high. Most often dependencies are not which results in of and getting added over time. documented/tracked an increase implicit explicit dependencies While sharing a similar opinion, it became easier to come together and create a called This would enhance the speed of ML model along with the creation of proper documents, and in order to save time and effort. Unified Framework Feature Store. deployment life-cycle required version analysis, model performance In this blog, we highlight on the features supported by different Feature Store frameworks, that are primarily developed by different leading industry giants. Advantages of Feature Store Ability to between teams across the organization. re-use and discover features Features should be by adding features like . governed access control and versioning Ability to and --- including online computation and offline aggregation. precompute automatically backfill features Helping to create a between data scientists and big data engineers. collaborative environment and by sharing not only features but also related , documents, marketing insights of models developed from these features. Save effort cost artifacts Enable between training and serving. consistency Michaelengelo From Uber Source - a framework developed by that allows feature integration/joining in both offline and online pipelines. Here and acts as the main storage unit for raw/transformed features. It provides a architecture for multiple models with suitable scaling and monitoring. Training jobs can be configured and managed through a web UI or an API, Michaelangelo Uber Hive (Offline) Cassandra (Online) horizontally scalable multi-tenant via Jupyter notebook. It further provides options to define hierarchical partitioning schema to train models per partition, that can be deployed as a single logical model. This provides easy and helps to overcome challenges when several models need to be trained based on the hierarchical structure of the data. bootstrapping At runtime during serving, it finds root to the best model for each node. Further its best known for its ability to support continuous learning, providing integration with , along with its support for . AutoML distributed deep learning Feast Feature Store Source Google released which is primarily built around , with for feature engineering. It allows a clear separation between big data and model development. This online predictive service allows feature sharing among teams with strong between model training and serving. Feast Google Cloud services: Big Query (offline) and Big Table (online) and Redis (low-latency) Apache Beam consistency Further Feast comes with centralized and feature aggregation. The feature columns reside inside wide-entity tables. In addition, the composite entities separate individual features. feature management, discovery, feature validation, Wix Feature Store Source provides a platform for feature-sharing across different ML models for both and datasets. It supports a pre-configured set of feature families on the site and user-level for both training and serving models. The different stages of data management, model training and deployment are marked and show in the figure above. It further uses to store real-time extracted features. Wix batch real-time S3 FeatureStore from Comcast Source The developed by helps data scientists to reuse versioned features, upload online (real-time)/streaming data, and by models. The product is available in multiple pluggable feature store components. The built-in model repository contains artifacts related to displaying the required mapping to the features needed to execute the model. Further, the architecture is built using Spark on Alluxio (open source data orchestration layer that brings data close to compute for big data and AI/ML workloads in the cloud), . The Model deployment with helps to build a resilient, highly available distributed systems with support for . Feature Store Comcast review feature metrics data pre-processing (normalization, scaling) S3, HDFS, RDBMS, Kafka, Kinesis Kubeflow rate-limiting, shadow deployments, and auto-scaling The integration with Data Lake with suitable API s helps data scientists to use SQL and create that can be and integrated into the full model pipeline. In addition, the framework comes with the support of . The end to end system not only provides traceability from use-case, models, features, model to features mapping, versioned datasets, model training codebase, model deployment containers, and prediction/outcome sinks, it is also known for integration with . training/validation/test datasets versioned Seldon Inference Graphs for A/B Testing, Ensembles, Multi-armed bandits, Custom combinations Feature-Store, Container Repository, and Git to integrate data, code and run-time artifacts for CI/CD integration Just like any other architecture, it has continuous Feature Aggregation on features. The Online Feature Store uses the following sequences before giving a prediction: streaming data + on-demand Payload only contains Model Name & Account Number Model Metadata informs which features are needed for the model Pull required by features by Account Number Pass a full set of assembled features for model execution Source is a multi-tenant architecture that integrates AWS Sagemaker, Databricks, Kubernetes, and Jupyter Notebook. It also supports integration with Authentication frameworks like . HopWorks Enterprise Edition LDAP, Kerberos, and Oauth2 The Batch / Live Streaming functionality is facilitated by whereas the model governance and monitoring pipeline are built using Kafka and Spark Streaming. Apache Beam, Apache Flink, and Apache Spark, The architecture is composed of several building blocks namely The Feature Store API - For reading/writing to/from the feature store. The Feature Store Registry - User-Interface to discover features. Feature Metadata - Documentation, Analysis and Versioning Feature Engineering Job - For computationStorage Layer - For feature storage Netflix Feature Store Source The feature store developed by Netflix supports both online and offline model training and development. The online micro-services enables the framework to collect the data elements required by the feature encoders in a model. It further passes this downstream for future use by offline predictions. The service of Netflix logs in a format in appropriate storage units ( ). Fact Logging user-related, video-related and computation specific features serialized S3 The unique point of this architecture is the presence of components that help to: Develop/Create contexts to snapshot Snapshot data of various micro-services for the selected context Build APIs to serve this data for a given time coordinate in the past As snapshotting data for all contexts (e.g all member profiles, devices, times of day) would incur overhead and cost, Netflix relies on selecting samples of contexts to periodically (at regular intervals - daily/twice daily), though different algorithms. It achieves this through Spark, by training data on different distributions, and by using based on properties such as viewing patterns, devices, time spent on the service, region, etc. snapshot stratified samples Netflix embraces a fine-grained Service Oriented Architecture for cloud-based deployment model. FBLearner from Facebook Source The FBLearner designed by Facebook is a framework for with . It is mainly composed of 3 components - FB Learner Feature Store (runs on CPU), t supports building all kinds of and models can be stored in ONNX format ( across converters, runtimes, compilers, and visualizers. supports and to) across different hardware/software platforms. AI WorkFlow Model Management and Deployment FB Learner Flow (runs on CPU +GPU), and FB Learner Predictor (runs on CPU). I deep learning models (Caffe2, Pytorch, Tensorflow, MxNet, CNTK) standardizes portability The above broad categories can be seen as creating logical units from hardware to application software. needed to create, migrate and train models. Frameworks (FB Learner Feature Store) for model deployment and management and Platforms (FB Learner Flow) needed to compute workloads and store data Infrastructure (FB Learner Predictor) Facebook also uses a principle to split development and deployment (production) environments. Pinterest Feature Store Source Pinterest's - Big Data Machine Learning is a classic example of high speed and quality which is , , and . This is built using open-source technology with individual building blocks that help in reusability. It also provides . scalable reliable secure Metadata-driven framework governance: enforcement & tracking The uniqueness of this architecture lies in between are into . capturing relationships and interactions (clicks made by users) pins (how objects organized collections) The below figure illustrates the different components in model governance and development architecture Source Zipline from Airbnb Source The predictive system ZipLine created by Airbnb relies on a based on gathered in due t and . The scoring log (acts as debug/audit log) is computed/updated daily to ensure feature consistency and single feature definition both during training ML model and deploying them at production. In addition, it ensures , and making and scoring service features ime space Data Quality monitoring, feature back-filling features searchable sharable. The architecture integrated with data sources -- Hive Table, databases and Jitney's Event Bus apart from Apache Spark (batch) and Flink (streaming) with Lambda as serving point. The uniqueness of this platform lies in : Reduction of custom pipeline creations Reducing data leaks in custom aggregations Feature distribution observabilityImproved model iteration workflow TFX Source , a TensorFlow based general-purpose machine learning platform provides of many components—a learner for generating models based on training data, modules for analyzing and validating both data as well as models, and finally infrastructure for serving models in production. The platform is particularly known for training, validation, visualization, and deployment of fresh newly trained models in production continuously relatively quickly. The individual components can share utilities that allow them to communicate and share assets. Due to fast training data and teams and community can share their data, models, tools, visualizations, optimizations, and other techniques TensorFlow Extended (TFX) orchestration deserialization The components are further known for gathering statistics over feature values: for the , the mean and standard deviation, whereas for discrete features they include the by . In addition, the components support the computation of model metrics on slices of data e.g., on negative and positive examples in a binary classification problem) and like and between features. These statistics give insights to users on the shape of each dataset. continuous features, statistics include quantiles, equi-width histograms top-K values frequency cross-feature statistics correlation covariance Further, the architecture also provides configuration free validation-setup enabled for all users, to serve , to model performance. multi-tenancy multiple machine-learned models concurrently soft model-isolation increase Apache Airflow Apache Airflow : Source entire architecture is based on the concept of (Directed Acyclic Graph), which takes into account the dependencies within them. Its principal responsibility to ensure all things happen at the right time and in the right order. The s define a single logical workflow and they are defined in python files. Apache Airflow's DAG DAG Further, it supports which states what steps are executed over time (e.g. download or transfer operators- GoogleCloudStorageDownloadOperator ). One such Operator is the GoogleCloudStorageObjectSensor which pauses execution until aa key appears in S3. Airflow Operators Apache Airflow guarantees (ensuring subsequent execution of any step produces the same end-result, irrespective of the number of times.), , and . Data exchange between different components of this distributed architecture is facilitated using that provided an exchange of small metadata. However, for large volumes of data, it supports shared network storage, data lake (S3) or URI based exchange through . Idempotence Atomicity Metadata Exchange XCOM (cross-communication) XCOM Parameterized representations of operators help DAG to run tasks that spawn a TaskInstance at a particular instant of time. Further, the instances within Apache AirFlow DAG are grouped into a . DagRun Zomato Feature Store Source Zomato's restaurant business heavily relies on stream data processing to compute running orders at the restaurant at any given point. The architecture use that provides job level isolation for each ML model as features from each ML model maintain their separate space for and do not interact with features from other ML models. Apache Flink research, analysis, logging In addition to streaming and online feature extraction, the life-cycle management of ML models are provided by . The ML models are served to the external world via by means of . MLFlow API Gateway AWS Sagemaker endpoints Overton from Apple Source automates the life cycle of by providing a set of novel high-level, declarative abstractions. It supports to predict several ML models in both and production applications. Overton model construction, deployment, and monitoring multi-task learning concurrently real-time backend Further, the architecture allows separation between model and data with two components the tasks, which capture the tasks the model needs to accomplish, and payloads that represent sources of data, such as tokens or entity embeddings. The model training is governed by a file, which acts as a guide to compile a TensorFlow model and to describe its output for use. Overton also embeds raw data into a payload, which is then used as input to a task or to another payload. The payloads are either , and . schema downstream singletons (e.g., a query) sequences (e.g. a query tokenized into words or characters), sets (e.g., a set of candidate entities) StreamSQL Feature Store Source Feature store is alow latency based model development framework with . It allows new model features to be deployed confidently with with much with ease. With the use of feature definitions, consistent feature deployment is ensured across training, in serving and across production. StreamSQL high throughput serving versioning The architecture is also known for its ability to increase model performance by integrating features from 3rd party. It combines batch and stream processing with an , where each event is appended to the end of the ledger. Further, the framework at any point allows the addition of new data sources/transformations (from ), modify or create a new set of features and even features from feature registry. immutable ledger Flink and Spark. Files, tables, and stream analyze/discover Feature Store from Tecton Tecton has come up with a unified architecture to and a platform built to standardize for ML models in production, ensuring the safe operation of models over time, with proper . Source develop, deploy, curate/govern monitor high-quality features, labels, and data sets reproducibility, lineage, and logging The Tecton platform consists of: for transforming your raw data into features or labels Feature Pipelines for storing historical feature and label data A Feature Store for serving the latest feature values in production A Feature Server for retrieving training data and manipulating feature pipelines An SDK for managing and tracking features, labels, and data sets A Web UI for detecting data quality or drift issues and alerting A Monitoring Engine Hybrid Feature Store Source The above figure illustrates a Hybrid Feature Store with Data Pipeline, BI Platforms (Tableau) using Apache Airflow, S3, Hopsworks Feature Store, and Data Lakes from Cloudera. The platform is capable of ingesting raw data, event or SQL data at the input. Feature Store from Scribble data Source The Feature Store provided by Scribble Data puts lots of stress on Input and (gaps, duplicates, exceptions, invalid values), as it is known to play an impact on ML models' prediction. Hence it recommends a to prevent poor quality data from coming into the system. On the reactive side, the system undertakes a continuous process to improve ML operations over time. Data Correctness Completeness continuous check/early morning system Conclusion Here we have discussed about different architectural frameworks using Big Data (some of them are Open Source tools), ML model training and serving tools, along with orchestration layer (such as Kubernetes). Each of the component is equally important and they go hand in hand to create a real-time end to end predictive system. References FBLearner - https://www.matroid.com/scaledml/2018/yangqing.pdf FBlearner https://medium.com/@jamal.robinson/how-facebook-scales-artificial-intelligence-machine-learning-693706ae296f MetaFlow by Netflix https://netflixtechblog.com/open-sourcing-metaflow-a-human-centric-framework-for-data-science-fa72e04a5d9 Tensorflow Extended http://stevenwhang.com/tfx_paper.pdf Apache Airflow: https://mlsys.org/Conferences/2019/doc/2019/demo_7.pdf Survey Monkey: http://snurran.sics.se/surveymonkey.pdf Overton: A Data System for Monitoring and Improving Machine-Learned Products: https://arxiv.org/pdf/1909.05372.pdf t https://www.slideshare.net/Alluxio/pinterest-big-data-machine-learning-platform-at-pinteres https://www.bigabid.com/blog/data-the-importance-of-having-a-feature-store https://towardsdatascience.com/mlops-with-a-feature-store-816cfa5966e9 https://github.com/EthicalML/awesome-production-machine-learning#feature-stores http://featurestore.org/ https://github.com/logicalclocks/hopsworks https://gist.github.com/mserranom/10aaac360617d58e00f1c380db22592e https://github.com/quantopian/zipline https://mlsys.org/Conferences/2019/doc/2019/demo_7.pdf The Hopsworks Feature Store Ormenisan et al, Horizontally scalable ML pipelines with a Feature Store Sculley et al, What’s your ML Test Score? A rubric for ML production systems Baylor et al, TFX: A TensorFlow-Based Production-Scale Machine Learning Platform Mewald et al, Drift detection for production machine learning CDF Special Interest Group — MLOps Continuous Delivery for Machine Learning GitOps Metaflow -Netflix https://github.com/Netflix/metaflow/tree/master/test HopWorks https://www.slideshare.net/dowlingjim/the-feature-store-in-hopsworks https://www.tecton.ai/blog/data-platform-ml/