I've spent the last six years building and rebuilding recommendation systems at companies ranging from mid-sized retailers to platforms processing billions of events daily. Here's something I learned the hard way: the architecture diagrams in research papers look nothing like what you'll actually deploy in production.
Here's what three production outages taught me: the architecture diagrams in research papers bear little resemblance to what survives contact with real traffic. Most articles about recommendation systems focus on the algorithms. Collaborative filtering, matrix factorization, deep learning embeddings. All important, sure. But they skip over the messy reality. What happens when your system needs to serve recommendations in under 100 milliseconds while your database is struggling under peak traffic?
I once watched a well-designed collaborative filtering model become useless within 30 minutes of a flash sale starting. The model was trained on normal browsing patterns. During the sale, users were behaving completely differently, racing to checkout on deeply discounted items. The cached recommendations were still showing 'you might also like' suggestions based on yesterday's behavior.
The main message: real-time recommendation systems are distributed systems problems disguised as machine learning problems. Design them as modeling exercises and they'll fail in production. Design them as distributed systems with embedded learning and they'll scale.
When a user clicks, scrolls, searches, or adds an item to cart, that event immediately becomes a ranking signal. But it is useless if it sits in a log file for five minutes waiting for a batch job. Real-time systems treat behavior as a streaming signal, not as historical data.
At small scale, you can cheat. At large scale, every shortcut becomes technical debt with high interest.
In the next section, I will break down the core components of a real-time recommendation system, starting with event ingestion and state management, because that is where most elegant models quietly fall apart.
The Two-Layer Architecture that is used in Industry
Every production recommendation system I've built uses the same basic split: a fast layer and a smart layer. The fast layer handles serving, the smart layer handles learning. This separation is necessary.
Here's what this looks like in practice. The key is combining multiple cheap retrieval strategies, then using a lightweight model to rank the merged candidates:
class RecommendationPipeline:
"""
Two-stage pipeline: candidate generation + ranking
Stage 1 is fast and broad, Stage 2 is precise but slower
"""
def get_recommendations(self, user_id, context, num_items=10):
# Stage 1: Fast candidate retrieval (1000s of items in <20ms)
candidates = self.generate_candidates(user_id, context, top_k=500)
# Stage 2: Precise ranking (score 500 items in <40ms)
scored = self.rank_candidates(user_id, candidates, context)
# Post-processing: diversity, business rules
final = self.apply_filters(scored, num_items)
return final
def generate_candidates(self, user_id, context, top_k=500):
"""
Combine multiple fast retrieval strategies
Each strategy is cheap and retrieves from pre-computed indices
"""
candidates = set()
# Strategy 1: User's collaborative filtering neighbors
# Pre-computed daily, stored in Redis
cf_items = self.redis.get(f"cf_recs:{user_id}")[:200]
candidates.update(cf_items)
# Strategy 2: Popular items in user's preferred categories
user_categories = self.get_user_categories(user_id)
for cat in user_categories[:3]:
trending = self.redis.get(f"trending:{cat}")[:100]
candidates.update(trending)
# Strategy 3: Items similar to recently viewed
recent_views = self.redis.lrange(f"recent:{user_id}", 0, 5)
for item in recent_views:
similar = self.redis.get(f"similar:{item}")[:50]
candidates.update(similar)
# Strategy 4: Session-based (if user searched/filtered)
if context.get("search_query"):
search_results = self.search_index.query(
context["search_query"], limit=100
)
candidates.update(search_results)
return list(candidates)[:top_k]
def rank_candidates(self, user_id, candidates, context):
"""
Precise scoring with lightweight model
Features are pre-computed or very fast to retrieve
"""
features = self.batch_get_features(user_id, candidates, context)
scores = self.model.predict(features) # Single batch inference
return sorted(
zip(candidates, scores),
key=lambda x: x[1],
reverse=True
)
Your serving layer needs to return recommendations in 50-100 milliseconds. That's not enough time to run a complex model, query user history from a database, and compute personalized scores across your entire catalog. Instead, you pre-compute candidate sets. For each user or user segment, you maintain a list of likely relevant items, scored and ranked, sitting in a low-latency store like Redis or Memcached.
The smart layer runs asynchronously. It processes user events, updates models, recomputes embeddings, and refreshes those candidate sets. This layer can take minutes or hours. It doesn't matter because users never wait for it.
Here's where it gets interesting. The fast layer isn't just a cache. It needs fallback logic for new users, diversification rules to avoid showing the same item repeatedly, business logic to exclude out-of-stock products, and real-time filtering based on the current session. All of this happens in milliseconds.
Handling Events at Scale Without Losing Your Mind
Real-time means different things to different people. For recommendations, it usually means incorporating user behavior within seconds or minutes, not literally as each click happens. The distinction matters because true real-time processing of every event is expensive and often unnecessary.
I've seen teams build Kafka pipelines that process every pageview, every product click, every cart addition immediately. The infrastructure cost was enormous. The improvement in recommendation quality was marginal. Most of the time, batch processing every few minutes gives you 90% of the benefit at 10% of the cost.
The exception is session-based signals. When a user adds something to their cart or searches for a specific product, you want that reflected in recommendations immediately. These high-value events go through a fast path. Lower-value events like pageviews get batched. The event schema I've shown captures both the priority flag and all the context you need to make smart routing decisions.
{
"event_id": "evt_1a2b3c4d",
"timestamp": 1705593600000,
"user_id": "usr_789",
"session_id": "ses_abc123",
"event_type": "product_view",
"properties": {
"product_id": "prod_4567",
"category": "electronics/laptops",
"price": 1299.99,
"in_stock": true,
"view_duration_ms": 15000
},
"context": {
"device_type": "mobile",
"platform": "ios",
"referrer": "google_search",
"page_url": "/products/laptop-xyz"
},
"priority": "standard"
}
Your event processing pipeline needs to handle duplicates, out-of-order events, and sudden traffic spikes. Use idempotent operations wherever possible. Design your aggregations so that reprocessing a window of data gives the same result. When traffic doubles during a sale, your system should slow down gracefully, not crash.
Feature Engineering for Speed
The features you use for scoring matter as much as your model architecture. In production, feature computation time often dominates inference time. You need a feature store strategy that balances freshness with latency.
I organize features into three tiers based on update frequency. User-level features like purchase history and category preferences get updated daily in batch jobs. These are expensive to compute but change slowly. Item-level features like pricing, stock status, and recent conversion rates update whenever the product catalog changes. Session features like current cart contents and recently viewed items update in real-time as users browse.
The feature store design pattern shows this 3 tier approach. Everything lives in Redis for fast lookup, but the update cadence varies dramatically. You pre-compute what you can, and only compute on-the-fly what absolutely must be fresh.
User Features (Updated every 6-24 hours)
user:12345 → {
"purchase_count_30d": 3,
"avg_order_value": 156.80,
"favorite_categories": ["electronics", "books"],
"price_sensitivity": 0.72,
"last_purchase_days_ago": 8,
"session_frequency": "high"
}
Item Features (Updated when product catalog changes)
product:67890 → {
"category_path": "home/kitchen/appliances",
"price_tier": "premium",
"avg_rating": 4.3,
"review_count": 847,
"conversion_rate_7d": 0.034,
"stock_level": "high",
"margin_percentage": 28.5
}
Real-time Session Features (Updated per request)
session:abc123 → {
"viewed_items": [67890, 11223, 44556],
"time_on_site_sec": 420,
"pages_viewed": 7,
"cart_items": [44556],
"current_category": "home/kitchen",
"device": "mobile"
}
This way lets you serve features in under 20ms even when combining dozens of signals. The alternative is querying your transactional database during serving, which will destroy your latency budget and potentially take down your database under load.
When Not to Build the Above
If you're serving fewer than 10,000 recommendations per day, don't build this system. Use a simple collaborative filtering library and pre-compute recommendations overnight. The complexity isn't worth it until you hit scale where infrastructure costs and latency actually matter.
Similarly, if your product catalog is tiny (under 1,000 items), the candidate generation stage is unnecessary. You can score all items in real-time and still hit your latency budget.
Outro
Building a real-time recommendation system at scale isn't about implementing the perfect ML algorithm. It's about making practical trade-offs between accuracy, latency, and cost.
Your first version won't be perfect. That's fine. Start with a simple two-stage pipeline, pre-compute what you can, and measure everything. Focus on getting the infrastructure right before you worry about the latest deep learning architecture. Most gains come from better data pipelines and smarter caching strategies, not fancier models.
The real skill is knowing when good enough is actually good enough. That 2% improvement in recommendation quality might not be worth doubling your infrastructure costs.
