332K Orders Later: How Ensemble ML Cut False Positives by 35%

Written by pradeepkalluri | Published 2026/03/04
Tech Story Tags: data-engineering | machine-learning | python | apache-kafka | data-science | open-source | anomaly-detection | software-engineering

TLDRA 25-day production experiment processed 332K orders to compare a single Isolation Forest model against a 3-model ensemble (Isolation Forest, LSTM, Autoencoder) for data quality monitoring. The ensemble reduced false positives by 35% and caught 30% more real anomalies, with only a slight increase in inference time. Results show ensemble ML is worth the complexity for multi-type anomaly detection.via the TL;DR App

Ensemble ML is everywhere - every blog post, every conference talk claims "combine multiple models for better results." But does it actually work in production?

I built a data quality monitoring system to find out. Three ML models (Isolation Forest, LSTM, Autoencoder) are working together. 332K synthetic orders processed over 25 days.

Here's what actually happened.

Why I Tested This

"Use ensemble methods" is the standard advice for ML in production. Combine multiple models, get better predictions, and reduce false positives.

Sounds great in theory. But I wanted to know:

  • Does it actually reduce false positives?
  • Is the complexity worth it?
  • Does it work for data quality monitoring specifically?

So I built it. Ran it continuously. Measured everything.

The Setup

Stack:

  • Apache Kafka streaming orders
  • Python processing pipeline
  • PostgreSQL for metrics
  • Three ML models in ensemble
  • Docker Compose (runs locally)

Data: Synthetic e-commerce orders with realistic quality issues injected.

Goal: Compare single model vs. ensemble. Which catches more real issues? Which has fewer false positives?

Baseline: Single Model (Isolation Forest)

Started with just Isolation Forest - the standard choice for anomaly detection:

from sklearn.ensemble import IsolationForest

# Train on 24 hours of quality metrics
historical_data = get_metrics(hours=24)
model = IsolationForest(contamination=0.1)
model.fit(historical_data)

# Predict
is_anomaly = model.predict(current_metrics)

Week 1 Results:

  • 93% accuracy
  • <10ms inference
  • Caught obvious anomalies
  • 15% false positive rate
  • Missed gradual changes

The false positive rate was the killer. Every 6-7 alerts, one was wrong. Teams started ignoring alerts.

Adding SHAP (Because Black Boxes Don't Work)

Before adding more models, I needed to understand why the single model flagged things:

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(current_metrics)

# Now I get:
# "Primary driver: validity_score = 45.2 (30% below baseline)"

This helped debug false positives. But didn't reduce them.

That's when I decided to try the ensemble approach everyone talks about.

The Ensemble: 3 Models

The theory: Different algorithms catch different problems.

Model 1: Isolation Forest (40% weight)

  • Fast, catches statistical outliers
  • Good at: Sudden spikes, obvious anomalies
  • Misses: Gradual changes, temporal patterns

Model 2: LSTM (30% weight)

  • Neural network for sequences
  • Good at: Temporal patterns, gradual degradation
  • Misses: Sudden outliers

Model 3: Autoencoder (30% weight)

  • Reconstruction-based detection
  • Good at: Unusual feature combinations
  • Misses: Single-dimension outliers

Voting strategy:

ensemble_score = (
    isolation_score * 0.4 +
    lstm_score * 0.3 +
    autoencoder_score * 0.3
)

# Flag if:
# - Combined score high, OR
# - At least 2 models agree
is_anomaly = ensemble_score > 0.5 or votes >= 2

The Results: Ensemble vs. Single Model

After 25 days processing 332K orders:

Metric                  Single Model    Ensemble    Improvement
─────────────────────────────────────────────────────────────
Accuracy                93.2%          93.8%       +0.6%
False Positives         15%            9.7%        -35%
Real Anomalies Caught   baseline       +30%        +30%
Inference Time          <5ms           <5ms        same
Training Time           <1 min         3 min       acceptable

The big wins:

  • False positives dropped 35% (15% → 9.7%)
  • Caught 30% more real issues
  • Same inference speed

The cost:

  • Training takes 3 minutes instead of 1
  • More code to maintain
  • Need to monitor 3 models

What Each Model Caught

Example 1: Isolation Forest Alone

Sudden spike in missing fields. Obvious outlier.

  • IF: Detected
  • LSTM: Missed (not temporal)
  • Autoencoder: Missed (single dimension)
  • Ensemble: Detected (1 vote enough)

Example 2: LSTM Alone

Completeness scores dropping 8% over 6 hours.

  • IF: Missed (each point looked normal)
  • LSTM: Detected (temporal pattern)
  • Autoencoder: Missed (not reconstruction issue)
  • Ensemble: Detected (1 vote enough)

Example 3: Autoencoder Alone

Unusual combination: low volume + high value + weekend.

  • IF: Missed (individually normal)
  • LSTM: Missed (no temporal pattern)
  • Autoencoder: Detected (unusual combination)
  • Ensemble: Detected (1 vote enough)

Example 4: False Positive Reduction

Weekend pattern that's unusual but valid.

  • IF: Flagged (outlier)
  • LSTM: Correctly ignored (normal for weekends)
  • Autoencoder: Correctly ignored (seen before)
  • Ensemble: Correctly ignored (2 models voted no)

This is where ensemble shines - reducing false positives through voting.

Does the Complexity Pay Off?

YES, but with conditions:

When ensemble is worth it: You need low false positive rate (<10%). Different anomaly types exist (temporal, statistical, combinatorial) You can afford 3x training time You have enough data for 3 models

When single model is enough: False positives aren't a big problem Only one type of anomaly. Need fastest possible training. Limited data

For data quality monitoring specifically, Ensemble is worth it.

Why? False positives kill trust. Teams ignore alerts if too many are wrong. Reducing 15% → 9.7% false positives makes the system actually usable

The Drift Problem (Affects All Models)

Both single model and ensemble faced the same issue: concept drift.

After 2 weeks, accuracy dropped from 93% to 78%.

Solution: Statistical drift detection with auto-retraining:

from scipy.stats import ks_2samp

reference = get_baseline(feature, hours=24)
current = get_recent(feature, hours=1)

statistic, p_value = ks_2samp(reference, current)

if p_value < 0.01:
    retrain_all_models()

This works for both single model and ensemble. Not specific to ensemble approach.

Performance: Does Ensemble Slow Things Down?

Inference (per order):

  • Single model: 4.2ms
  • Ensemble (3 models): 4.8ms
  • Difference: +0.6ms (14% slower)

Training:

  • Single model: 52 seconds
  • Ensemble (3 models): 3 minutes 18 seconds
  • Difference: ~4x slower

Training every 2 hours, so 3 minutes is acceptable. Inference is still <5ms, fast enough for real-time.

Optimization used:

python

# Run 3 models in parallel
with ThreadPoolExecutor(max_workers=3) as executor:
    if_result = executor.submit(isolation_forest.predict, data)
    lstm_result = executor.submit(lstm.predict, data)
    ae_result = executor.submit(autoencoder.predict, data)

Without parallelization: 12ms. With: 4.8ms.

Production Metrics (25 Days)

Orders Processed: 332,308

Quality Checks: 2.8M+

System Uptime: 603.7 hours (100%)

Average Latency: 4.8ms per order

Ensemble Performance:

  • Accuracy: 93.8%
  • False Positives: 9.7%
  • Precision: 94.2%
  • Recall: 91.8%

Comparison to Single Model:

  • False Positives: -35% improvement
  • Real Anomalies: +30% more caught
  • Inference Time: +14% slower (acceptable)
  • Training Time: 4x slower (acceptable)

What I Learned

1. Ensemble DOES reduce false positives

Not by a little. By 35%. This matters a lot for trust.

2. Different models catch different things

Not marketing fluff. Actually true. IF catches outliers, LSTM catches temporal, AE catches combinations.

3. Voting is powerful

When 2/3 models say "not anomalous," they're usually right. This reduces false alarms.

4. The complexity is manageable

Three models instead of one isn't 3x the work. Maybe 1.5x once you have infrastructure.

5. But you need enough data

Each model needs reasonable training data. If you have <100 samples, single model is better.

6. Explainability still required

Ensemble without SHAP is useless. Add SHAP to all three models.

Should You Use Ensemble?

YES if:

  • False positives are a problem
  • You have different types of anomalies
  • You can handle 3-5 minute training
  • You have 500+ training samples

NO if:

  • Single model works fine

  • Training time is critical

  • Limited data (<100 samples)

  • Simple anomaly patterns

    For data quality monitoring specifically: Yes.

    Why? Data quality has multiple anomaly types (missing fields, wrong formats, unusual patterns, temporal issues). Ensemble catches more with fewer false alarms.

    The Code

    Open source: https://github.com/kalluripradeep/realtime-data-quality-monitor

    Compare single model vs. ensemble yourself:

    Clone the repo and run: docker compose up -d

Bottom Line

Ensemble ML isn't just hype. It actually works:

  • 35% fewer false positives
  • 30% more real anomalies caught
  • Same inference speed
  • Worth the extra complexity

For data quality monitoring, it's worth doing.

About the Author

Pradeep Kalluri is a Data Engineer at NatWest Bank and Apache Airflow/Dbt contributor.


Written by pradeepkalluri | Data Engineer specializing in real-time data pipelines and production ML systems. Currently at NatWest Bank.
Published by HackerNoon on 2026/03/04