Feature store
A feature store is a centralised repository or data storage layer where users can store, share, and discover curated features for machine learning (ML) models.The concept often associated with feature engineering facilitates the processing and transformation of raw data into consumable features for model training and serving pipelines. By streamlining feature engineering through a feature bank for the storage, definition and discovery of reusable features, the feature store provides flexibility across different big data models and development teams. The centralisation feature also enhances collaboration, ensures consistency, and accelerates the deployment of ML models. The feature store platform also facilitates joint effort among various teams within organisations as it allows access to diverse datasets without the interference often observed in traditional centralised systems.
Feature stores typically handle two primary types of data, batch data and real-time data. Batch Data is derived from data lakes or data warehouses and consists of large, static datasets not updated in real time. Real-time Data is generated from streaming and log event, continuously updated and immediately fed into the feature store.
Deployment and availability
Feature stores can be built in-house by engineering teams or obtained from companies offering Feature Store solutions as Platform-as-a-Service (PaaS). These solutions can be cloud-based (online) or offered as on-premises (offline) deployments. The first feature stores, Michelangelo Palette by Uber and Zipline by Airbnb, were based on a domain-specific language (DSL) for creating feature pipelines that write features to both offline and online stores. More recent open-source feature store platforms include Feast, FeatureForm, and Feathr, while commercial feature stores include Hopsworks, Tecton, Databricks, AWS SageMaker, and Google Cloud Platform (GCP) Vertex AI.
Functionality and advantages
Feature stores provide API-based access to structured and unstructured data for machine learning workloads, supporting efficient querying and retrieval. A significant advantage of feature stores is their ability to accelerate Machine learning model development and deployment. Engineering teams can reuse existing, precomputed features, significantly reducing the time required for experimentation and model training. Facebook reported that in their feature store, “most features are used by many models,” and the most popular 100 features are reused in over 100 different models. Machine Learning systems supported by feature stores typically follow the Feature-Training-Inference (FTI) pipeline architecture. In this architecture, feature pipeline transforms input data into features stored in the feature store. A training pipeline reads features and labels from the feature store, trains a model, and outputs the trained model to a model registry. An inference pipeline reads new feature data and an ML model as input, producing predictions and logging prediction results.
Key components of feature stores
- The centralised feature management organises features and ensures that they are consistent, making them easily accessible to different teams and models.
- Features are consistent and can be reused different models, thus improving the reproducibility of ML projects.
- Real time and batch features enable seamless management and serving of both batch and real-time features, thus catering to a wide array of ML applications.
- Time to production is accelerated as the platform allows for smooth and efficient collaboration between data scientist and engineering teams because processed features are accessible while the data pipeline is still being maintained.
- Reduction in storage and computation costs may be observed when features are computed once and reused instead of than recalculated for every new model
- Includes tools for monitoring, validation, and version control, which are critical for governance and compliance requirements.
- Supports programmatic interfaces via SQL, Python, and PySpark interfaces.
Example of a feature store
DoorDash successfully implemented a feature store in its food delivery service to enhance machine learning (ML) model performance. Features, which served as input variables for ML inference were stored in a key-value system to ensure seamless availability in production. When designing the feature store, the company was met with several challenges which included designing the feature store to meet the scaling and complexity requirements.
Challenges and considerations
While feature stores offer substantial advantages, their implementation requires careful consideration of several factors such as data quality through ensuring that feature data is clean, accurate, and up to date is critical for effective ML predictions. Scalability to handle large-scale feature data while maintaining low-latency access for real-time inference, integration with existing infrastructure and access control to enforce appropriate access policies to prevent unauthorised use and facilitate compliance with regulatory standards are also important considerations