Building a Lambda-Style Feature Platform with GCP Native Services

The landscape of machine learning is changing quickly, leaving organizations with a critical decision: build a feature platform from scratch or leverage cloud-native services? This post examines a pure lambda-style feature platform built entirely on Google Cloud Platform's native services - a solution we've implemented in production that delivers enterprise-scale feature engineering capabilities with surprisingly minimal operational overhead. pure lambda-style feature platform The Zero-Ops Feature Engineering Vision The architecture we'll explore embodies the serverless philosophy applied to feature engineering. By combining BigQuery Materialized Views, Scheduled Queries, Dataflow pipelines, and Vertex AI Feature Store, this solution aims to eliminate the operational complexity typically associated with feature platforms while maintaining production-grade performance and reliability. Architecture Overview Figure 1: Lambda-style architecture leveraging GCP managed services for both batch and streaming feature pipelines Figure 1: Lambda-style architecture leveraging GCP managed services for both batch and streaming feature pipelines The platform operates on two distinct but complementary pipelines: The platform operates on two distinct but complementary pipelines: Batch Feature Pipeline: SQL-Driven Aggregations Batch Feature Pipeline: SQL-Driven Aggregations The batch pipeline leverages BigQuery's native capabilities for time-window aggregations: Data Source → Materialized Views → Scheduled Queries → Vertex AI Feature Store Data Source → Materialized Views → Scheduled Queries → Vertex AI Feature Store Streaming Feature Pipeline: Real-Time Event Processing Streaming Feature Pipeline: Real-Time Event Processing The streaming pipeline uses Dataflow for low-latency feature computation: Event Streams → Dataflow (Apache Beam) → Vertex AI Feature Store Event Streams → Dataflow (Apache Beam) → Vertex AI Feature Store Batch Feature Engineering The Power of Materialized Views The batch pipeline's foundation lies in BigQuery Materialized Views (MVs), which solve a critical scaling challenge and create cascading benefits across the entire feature platform. In our production implementation, we battle-tested this design using 15-minute aggregate materialized views—the 10-minute interval shown in examples is just a parameter to tweak based on your desired refresh cadence for batch pipelines and also you the amount of money you want to spend. 10-minute interval shown in examples is just a parameter to tweak The Fundamental Problem: Computing large window features (1-day, 60-day averages) directly from raw event data means scanning massive datasets repeatedly—potentially terabytes of data for each feature calculation. The Fundamental Problem The MV Solution: We've found that pre-aggregating raw events into 10-minute buckets reduces downstream data processing by ~600x: The MV Solution Why This Transforms the Entire System: Why This Transforms the Entire System Batch Feature Speed: Large window aggregations compute in seconds instead of minutesCost Efficiency: Query costs drop dramatically (scanning MB instead of TB)Faster Forward Fill: Historical feature backfilling becomes practical at enterprise scaleStreaming Optimization: Since batch handles long windows efficiently, streaming can focus on short-term features (≤10 minutes), avoiding expensive long-term state managementSystem Simplicity: Clear separation of concerns between batch (long windows) and streaming (immediate features) Batch Feature Speed: Large window aggregations compute in seconds instead of minutes Batch Feature Speed Cost Efficiency: Query costs drop dramatically (scanning MB instead of TB) Cost Efficiency Faster Forward Fill: Historical feature backfilling becomes practical at enterprise scale Faster Forward Fill Streaming Optimization: Since batch handles long windows efficiently, streaming can focus on short-term features (≤10 minutes), avoiding expensive long-term state management Streaming Optimization System Simplicity: Clear separation of concerns between batch (long windows) and streaming (immediate features) System Simplicity CREATE MATERIALIZED VIEW user_features_by_10min_bucket_mv PARTITION BY feature_timestamp CLUSTER BY entity_id OPTIONS ( enable_refresh = true, refresh_interval_minutes = 10 ) AS SELECT TIMESTAMP_BUCKET(source.event_timestamp, INTERVAL 10 MINUTE) AS feature_timestamp, source.userid AS entity_id, AVG(source.activity_value) AS avg_value_last_10_mins, SUM(source.activity_value) AS sum_value_for_sliding_avg, COUNT(source.activity_value) AS count_for_sliding_avg FROM my_project.my_dataset.user_activity AS source WHERE TIMESTAMP_TRUNC(source.event_timestamp, HOUR) >= TIMESTAMP('2000-01-01T00:00:00Z') GROUP BY feature_timestamp, entity_id CREATE MATERIALIZED VIEW user_features_by_10min_bucket_mv PARTITION BY feature_timestamp CLUSTER BY entity_id OPTIONS ( enable_refresh = true, refresh_interval_minutes = 10 ) AS SELECT TIMESTAMP_BUCKET(source.event_timestamp, INTERVAL 10 MINUTE) AS feature_timestamp, source.userid AS entity_id, AVG(source.activity_value) AS avg_value_last_10_mins, SUM(source.activity_value) AS sum_value_for_sliding_avg, COUNT(source.activity_value) AS count_for_sliding_avg FROM my_project.my_dataset.user_activity AS source WHERE TIMESTAMP_TRUNC(source.event_timestamp, HOUR) >= TIMESTAMP('2000-01-01T00:00:00Z') GROUP BY feature_timestamp, entity_id Key Benefits: Key Benefits: Automatic Refresh: MVs incrementally update every 10 minutesQuery Optimization: Subsequent queries leverage pre-computed resultsHistorical Coverage: Broad time filters enable comprehensive backfilling Automatic Refresh: MVs incrementally update every 10 minutes Automatic Refresh Query Optimization: Subsequent queries leverage pre-computed results Query Optimization Historical Coverage: Broad time filters enable comprehensive backfilling Historical Coverage Leveraging MV Efficiency Building upon the MVs, scheduled queries compute complex sliding window features with remarkable efficiency. Key insight we discovered: Instead of spanning across raw events, these queries operate on the pre-aggregated 10-minute buckets, which makes a world of difference. For refresh cadences, we implemented a 1/5 rule truncated to a maximum of every 5 hours: 1-hour window features refresh every 15 minutes, 3-hour windows at 45 minutes, 24-hour windows every 5 hours, and 60-day windows at 5 hours. Key insight we discovered Important caveat: This MV optimization only works for simple aggregations (SUM, COUNT, AVG). We learned this the hard way when dealing with complex aggregations requiring sorting and ROW_NUMBER() functions—the MV optimizations were not applicable to these, and we had to run the entire aggregation logic in scheduled queries instead. Important caveat Figure 2: Window function-based computation of 1-day and 60-day sliding averages using 10-minute bucket aggregates Figure 2: Window function-based computation of 1-day and 60-day sliding averages using 10-minute bucket aggregates -- Window frame: Last 144 buckets (1 day) ending at current bucket SUM(sum_value_for_sliding_avg) OVER ( PARTITION BY entity_id ORDER BY feature_timestamp ASC ROWS BETWEEN 143 PRECEDING AND CURRENT ROW ) AS sum_1_day_sliding -- Window frame: Last 144 buckets (1 day) ending at current bucket SUM(sum_value_for_sliding_avg) OVER ( PARTITION BY entity_id ORDER BY feature_timestamp ASC ROWS BETWEEN 143 PRECEDING AND CURRENT ROW ) AS sum_1_day_sliding The Efficiency Multiplier: The Efficiency Multiplier Traditional Approach: 60-day sliding window = scan 60 days of raw events (potentially 1TB+ per query)MV-Powered Approach: 60-day sliding window = scan 8,640 pre-aggregated buckets (~1MB per query) Traditional Approach: 60-day sliding window = scan 60 days of raw events (potentially 1TB+ per query) Traditional Approach MV-Powered Approach: 60-day sliding window = scan 8,640 pre-aggregated buckets (~1MB per query) MV-Powered Approach This ~1000x data reduction enables: Sub-second Feature Computation: Large window features that previously took minutes now complete in secondsCost-Effective Backfilling: Historical feature generation becomes economically viableReal-time Forward Fill: Fresh features can be computed continuously without breaking the budgetStreaming Focus: Stream processing freed from long-window state management, enabling cost-effective real-time features Sub-second Feature Computation: Large window features that previously took minutes now complete in seconds Sub-second Feature Computation Cost-Effective Backfilling: Historical feature generation becomes economically viable Cost-Effective Backfilling Real-time Forward Fill: Fresh features can be computed continuously without breaking the budget Real-time Forward Fill Streaming Focus: Stream processing freed from long-window state management, enabling cost-effective real-time features Streaming Focus Feature Examples Enabled: Feature Examples Enabled Average spending over 1 hour, 12 hours, 24 hours (computed from 6, 72, 144 buckets respectively)Transaction velocity across 1-day, 7-day, 30-day windowsUser engagement trends spanning weeks or months Average spending over 1 hour, 12 hours, 24 hours (computed from 6, 72, 144 buckets respectively) Transaction velocity across 1-day, 7-day, 30-day windows User engagement trends spanning weeks or months Streaming Feature Engineering Real-Time Processing with Dataflow The streaming pipeline handles low-latency features that require immediate computation: Figure 3: Dataflow pipeline processing real-time events with windowing and state management Figure 3: Dataflow pipeline processing real-time events with windowing and state management Streaming Pipeline Optimization Through MV Design: Streaming Pipeline Optimization Through MV Design: The materialized view strategy fundamentally changes what the streaming pipeline needs to handle: Before MV Optimization: Before MV Optimization Streaming pipeline manages state for long windows (hours, days)Expensive persistent state storage for millions of entitiesComplex checkpointing for multi-day windowsHigh memory requirements and operational overhead Streaming pipeline manages state for long windows (hours, days) Expensive persistent state storage for millions of entities Complex checkpointing for multi-day windows High memory requirements and operational overhead After MV Optimization: After MV Optimization Streaming focused on immediate features (≤10 minutes)Lightweight state management for short windowsReduced operational complexity and costsClear architectural boundaries Streaming focused on immediate features (≤10 minutes) Lightweight state management for short windows Reduced operational complexity and costs Clear architectural boundaries Key Streaming Features (Optimized Scope): Key Streaming Features (Optimized Scope) Count Events Last N Minutes: Only short windows (≤10 min), since batch handles longer periods efficientlyTime Since Last Event: Stateful processing per entity, reset frequentlyLast Event Type: State-based feature tracking with minimal memory footprintReal-time Anomaly Flags: Immediate detection requiring sub-second latency Count Events Last N Minutes: Only short windows (≤10 min), since batch handles longer periods efficiently Count Events Last N Minutes Time Since Last Event: Stateful processing per entity, reset frequently Time Since Last Event Last Event Type: State-based feature tracking with minimal memory footprint Last Event Type Real-time Anomaly Flags: Immediate detection requiring sub-second latency Real-time Anomaly Flags Streaming Feature Backfilling For streaming features, we use a unified Beam pipeline approach that reuses the exact streaming logic for historical data. This ensures identical computation semantics and eliminates any discrepancies between batch and streaming feature calculations. In our implementation, all streaming features are simple aggregations needed in real-time—things like event counts, sums, and basic statistical measures over short windows. The streaming pipeline handles the "last mile" features, specifically the latest 15-minute window aggregations. These streaming features are then augmented with the longer-term batch features before being sent to our models, giving us both real-time responsiveness and historical context. Vertex AI Feature Store Integration The platform culminates in Vertex AI Feature Store V2, which we chose after careful consideration. Vertex AI Feature Store's batch export functionality just opened for general adoption, and we tested it out on a smaller scale—it looks promising so far. The high-maintenance alternative would be the battle-tested Feast feature store open source solution, but we decided to bet on Google's managed offering to reduce our operational overhead. The integration provides: Figure 4: Unified feature serving with point-in-time correctness for both batch and streaming features Figure 4: Unified feature serving with point-in-time correctness for both batch and streaming features Key Capabilities: Key Capabilities: Point-in-Time Correctness: Accurate training data generationOnline Serving: Low-latency feature retrievalMixed Feature Types: Batch and streaming features co-existAutomatic Versioning: Feature schema evolution support Point-in-Time Correctness: Accurate training data generation Point-in-Time Correctness Online Serving: Low-latency feature retrieval Online Serving Mixed Feature Types: Batch and streaming features co-exist Mixed Feature Types Automatic Versioning: Feature schema evolution support Automatic Versioning Strengths of this Approach Operational Excellence Operational Excellence Zero Infrastructure Management: Fully managed services eliminate operational overheadAutomatic Scaling: Services scale based on demand without interventionBuilt-in Monitoring: Native GCP monitoring and alertingSimplified Deployments: No cluster management or resource provisioning Zero Infrastructure Management: Fully managed services eliminate operational overhead Zero Infrastructure Management Automatic Scaling: Services scale based on demand without intervention Automatic Scaling Built-in Monitoring: Native GCP monitoring and alerting Built-in Monitoring Simplified Deployments: No cluster management or resource provisioning Simplified Deployments Cost Efficiency Through MV-Driven Design Cost Efficiency Through MV-Driven Design Pay-as-you-go: Only pay for actual compute and storage usageDramatic Query Cost Reduction: MVs reduce data scanning by ~1000x (TB→MB per query)Streaming Cost Optimization: Short-window focus eliminates expensive long-term state managementEfficient Forward Fill: Historical feature generation becomes economically viable Pay-as-you-go: Only pay for actual compute and storage usage Pay-as-you-go Dramatic Query Cost Reduction: MVs reduce data scanning by ~1000x (TB→MB per query) Dramatic Query Cost Reduction Streaming Cost Optimization: Short-window focus eliminates expensive long-term state management Streaming Cost Optimization Efficient Forward Fill: Historical feature generation becomes economically viable Efficient Forward Fill Developer Productivity Developer Productivity SQL-First Approach: Familiar tooling for data practitionersRapid Prototyping: Quick iteration on feature definitionsIDE Integration: Native BigQuery and Dataflow tooling SQL-First Approach: Familiar tooling for data practitioners SQL-First Approach Rapid Prototyping: Quick iteration on feature definitions Rapid Prototyping IDE Integration: Native BigQuery and Dataflow tooling IDE Integration Technical Advantages Technical Advantages Proven Scalability: BigQuery handles petabyte-scale datasetsAutomatic Optimization: Query optimizer handles performance tuningData Freshness: Near real-time updates through MVs and streamingBackup and Recovery: Built-in data protection and disaster recovery Proven Scalability: BigQuery handles petabyte-scale datasets Proven Scalability Automatic Optimization: Query optimizer handles performance tuning Automatic Optimization Data Freshness: Near real-time updates through MVs and streaming Data Freshness Backup and Recovery: Built-in data protection and disaster recovery Backup and Recovery Limitations and Trade-offs Platform Lock-in Concerns Platform Lock-in Concerns Vendor Dependency: Heavy reliance on GCP-specific servicesMigration Complexity: Difficult to port to other cloud providersPricing Volatility: Subject to GCP pricing changesFeature Parity: Limited by GCP service capabilities and roadmap Vendor Dependency: Heavy reliance on GCP-specific services Vendor Dependency Migration Complexity: Difficult to port to other cloud providers Migration Complexity Pricing Volatility: Subject to GCP pricing changes Pricing Volatility Feature Parity: Limited by GCP service capabilities and roadmap Feature Parity Architectural Constraints Architectural Constraints Limited Flexibility: Constrained by BigQuery and Dataflow capabilitiesComplex Features: Some ML features may not map well to SQLCross-Service Dependencies: Failures cascade across multiple servicesConsistency Challenges: Eventual consistency between batch and streaming Limited Flexibility: Constrained by BigQuery and Dataflow capabilities Limited Flexibility Complex Features: Some ML features may not map well to SQL Complex Features Cross-Service Dependencies: Failures cascade across multiple services Cross-Service Dependencies Consistency Challenges: Eventual consistency between batch and streaming Consistency Challenges Data Engineering Limitations Data Engineering Limitations Transformation Complexity: Complex business logic harder to express in SQLSchema Evolution: Changes require careful coordination across services Transformation Complexity: Complex business logic harder to express in SQL Transformation Complexity Schema Evolution: Changes require careful coordination across services Schema Evolution What We've Learned About Lambda-Style Feature Engineering After implementing this GCP Native Feature Platform in production, we've found it represents a compelling vision of infrastructure-as-code applied to feature engineering. By embracing the lambda architecture paradigm and leveraging managed services, we've been able to dramatically reduce operational complexity while maintaining enterprise-scale capabilities. infrastructure-as-code applied to feature engineering This approach excels when: This approach excels when: Teams want to focus on feature logic rather than infrastructureOperational simplicity is prioritized over maximum flexibilityOrganizations already have significant GCP investmentsTime-to-market is critical for competitive advantage Teams want to focus on feature logic rather than infrastructure Operational simplicity is prioritized over maximum flexibility Organizations already have significant GCP investments Time-to-market is critical for competitive advantage Consider alternatives when: Consider alternatives when: Maximum control over processing logic is requiredMulti-cloud or hybrid deployment strategies are neededComplex, non-SQL-friendly feature transformations are commonVendor lock-in presents significant business risks Maximum control over processing logic is required Multi-cloud or hybrid deployment strategies are needed Complex, non-SQL-friendly feature transformations are common Vendor lock-in presents significant business risks The lambda-style approach fundamentally shifts the feature platform paradigm from "infrastructure management" to "feature logic optimization." For many organizations, this trade-off represents a strategic advantage, enabling data science teams to focus on what matters most: creating features that drive business value. As cloud-native services continue to mature, we can expect this architectural pattern to become increasingly prevalent, making sophisticated feature engineering capabilities accessible to organizations without large platform engineering teams.