If you've ever built a data pipeline for analytics or business intelligence, you know the basics — ingest, transform, store. But regulated industries are a different game entirely. A missed record in a BI pipeline means a slightly off dashboard. A missed record in a compliance pipeline means a regulatory fine, a failed audit, or worse.
Over the past couple of years, I've been designing and running ETL pipelines that process over 500,000 compliance data files every month from more than 25 data providers — emails, instant messages, voice recordings, video recordings, and mobile communications. Every single record needs to be ingested, parsed, normalized, and stored with a full audit trail. No exceptions.
Along the way, I've made plenty of mistakes and discovered patterns that actually work at this scale. Here's what I've learned.
The Problem: Chaos In, Order Out
Here's the reality no one warns you about: no two data providers behave the same way. Some push files via SFTP on a schedule. Others drop them into S3 buckets through event triggers. Some send nicely structured JSON. For some sources, you have to build pullers. Others send multi-gigabyte XML files with schemas that are barely documented — if at all. And every once in a while, a provider changes their format without telling anyone.
Now here's the fun part. Regulations like SEC Rule 17a-4 couldn't care less about your provider's quirks. Every record must be captured completely, stored immutably, and traceable from source to storage.
That gap between messy inputs and strict rules? That's where all the interesting engineering lives.
How We Structured It: A Three-Stage Pipeline
After a lot of trial and error, we landed on a three-stage architecture:
File Upload → EventBridge → Parser (Stage 1)
↓
Canonical Store (MongoDB)
↓
Processor (Stage 2)
↓
Alerts + Audit Trail + Retention
↓
Export (Stage 3, on-demand)
AWS Step Functions orchestrate everything. Each source type gets its own state machine — workflow, retries, error handling — all defined in declarative YAML. Want to change the retry count from 3 to 5? That's a config change, not a deployment.
This separation turned out to be crucial. Parsing is a different problem from processing, which is a different problem from exporting. Mixing them creates a mess that's impossible to debug at 3 AM when something breaks.
The Generic Parser: From Weeks to Days
Early on, we built custom parsers for every data provider. Each one was a snowflake — different structure, different assumptions, different bugs. Adding a new provider meant 3-4 weeks of custom work. That doesn't scale when you're integrating 25+ sources.
The fix was surprisingly simple: define two interfaces and make everything implement them.
interface IParserService {
parse(input: RawInput): AsyncIterable<BaseParsedData>;
}
interface IProcessingService {
process(data: BaseParsedData[]): Promise<ProcessedResult>;
}
Each data source plugs in its own implementation via dependency injection. The pipeline doesn't care whether it's parsing XML, JSON, or EML — it calls parse(), gets back normalized records, and moves on.
After making this switch, new integrations dropped to about 100 lines of transformer code and 3-5 days of work. That's the difference between supporting 5 providers and supporting 25+.
The principle is simple: standardize the interface, not the source. You'll never get providers to send data in a consistent format. But you can absolutely control how your system consumes it.
One Schema to Rule Them All
With 25+ source types, you need a single internal representation. Otherwise, every downstream system has to understand every source format — and that's a maintenance nightmare.
We went with a type hierarchy:
interface IBaseParsedData {
originId: string; // Unique ID from source
content: string; // Raw content
contentText: string; // Plain text extraction
originalTimestamp: number; // Epoch ms
sourceKey: string; // Original file location
type: MessageType; // EMAIL | IM | VOICE | SMS | ...
}
Specialized types extend the base. Instant messages have threading and participant fields. Email has distribution lists and attachments. Voice recordings have timestamps and speaker attribution. Downstream processors can work with the base type for generic operations — retention, audit trail creation — and narrow to a specific type only when they actually need source-specific fields.
One rule we enforce strictly: never throw away source-specific data. If a provider sends a field that doesn't map to our canonical model, we preserve it in the raw output anyway. Sometimes auditors ask unpredictable questions, and they ask them years after the data was ingested. "Sorry, we discarded that field" is not an acceptable answer.
Streaming: Because 4GB Files Won't Fit in Memory
Here's a problem you don't encounter until you hit a certain scale. Some providers deliver batch files that are multiple gigabytes in size. Try loading a 4GB XML file into a Lambda function's memory and see what happens. (Spoiler: nothing good.)
We needed constant memory usage regardless of file size. The answer was async iterables — all the way through.
async function* parseSourceFile(
s3Stream: ReadableStream
): AsyncIterable<IBaseParsedData> {
const parser = createStreamingParser();
for await (const chunk of s3Stream) {
parser.feed(chunk);
while (parser.hasCompleteRecord()) {
yield normalize(parser.nextRecord());
}
}
}
The parser yields records one at a time as it reads the file. The downstream processor accumulates them into chunks of 500, flushes to MongoDB, and moves on. At no point does the full file exist in memory. The footprint stays constant: one chunk plus the parser buffer, no matter how big the file is.
We also set hard ceilings — 2GB for parser text buffers, 200MB for processor chunks. When the limit is hit, the pipeline halts with a clear error instead of quietly dying from an OOM crash.
When Node.js Isn't Fast Enough
Several of our parsers handled deeply nested XML files with hundreds of thousands of records. The Node.js implementation worked fine for small files. But on large exports, it was spending more than several hours on files that needed to finish in minutes to meet SLAs.
So we decided to experiment: we rewrote those parsers in Rust.
The Rust version uses the quick-xml crate with async streaming, compiled to a native Lambda binary. The outcome is 40x speedup. What used to take 2 hours now finishes in under 15 seconds.
Here's the important part — we didn't rewrite the whole pipeline in Rust. That would've been overkill. We identified the single compute-bound bottleneck, rewrote that component, and plugged it back in through Step Functions. The Rust Lambda processes the file and writes normalized output. The rest of the TypeScript pipeline consumes it normally, completely unaware that anything changed upstream.
TypeScript is great for business logic and orchestration, but Rust is great for making a parser go much faster.
Concurrency: MongoDB as a Distributed Throttle
When multiple sources are triggering the pipeline simultaneously, they can overload it.
The obvious solution is Redis or a dedicated queue. But we already had MongoDB Atlas as our primary data store, and we didn't want to add another service to monitor and maintain. So we came up with an idea:
async function tryAcquireSlot(
runId: string,
type: 'Parser' | 'Processor'
): Promise<boolean> {
const result = await ThrottlingList.findOneAndUpdate(
{
type,
$expr: { $lt: [{ $size: '$runIds' }, MAX_CONCURRENT_RUNS] }
},
{ $addToSet: { runIds: runId } },
{ new: true }
);
return result !== null;
}
MongoDB's findOneAndUpdate is atomic — it checks capacity and reserves a slot in a single operation without any race conditions. If the list is full, the Step Function enters a wait state with exponential backoff and retries until a slot opens.
This is probably not a "proper" way to do concurrency control, but when you are short of resources, you have to get creative. For our use case, there were tens of concurrent runs, not thousands. So, we decided to try it out. It worked pretty well and kept the operational footprint simple. Sometimes the pragmatic solution beats the textbook one.
Tiered Storage: Cutting Costs by 40%
At 500K files per month, storage costs grow quickly. We split data across access tiers:
- Hot: Standard S3 for anything under 90 days old. Fast retrieval for daily compliance work.
- Cold: Automatic lifecycle transition to Glacier Instant Retrieval for older data. SEC 17a-4 can require 3-7 years of retention — Glacier handles this at a fraction of standard storage cost.
The result: ~40% reduction in storage costs, with zero impact on compliance. Most queries hit recent data anyway. The stuff from two years ago only gets touched during audits.
The Bigger Picture
Here's what I keep seeing across regulated industries: most organizations don't have a unified pipeline. They have 5-10 disconnected tools — one for email, one for chat, one for voice, one for reporting. Compliance officers spend their time manually exporting from each system, reformatting in spreadsheets, and cross-referencing by hand.
About 77% of compliance teams still rely on manual processes. Only 1.6% of firms have fully integrated AI into their compliance workflows. At the same time, regulatory compliance runs U.S. businesses an estimated $2.155 trillion annually.
The generic parser pattern I've described isn't just an engineering convenience — it's a way to solve this fragmentation. Compliance teams are getting a unified view through a single interface rather than stitching together five different exports.
This matters especially for mid-sized firms, where compliance costs per employee run about 40% higher than at large enterprises. They can't afford ten specialized tools. A unified pipeline gives them comprehensive coverage without the cost and operational overhead.
What I'd Tell Someone Starting From Scratch
After several years of doing this, here's what I keep coming back to:
- Standardize the interface, not the source. Build pluggable parsers. New sources should take days, not weeks.
- Stream everything. If your pipeline loads entire files into memory, it will break at scale. Async iterables keep your footprint constant.
- Pick the right language for the bottleneck. TypeScript for logic and orchestration. Rust or C++ for the hot path. A major speedup on one parser can improve your entire system.
- Never silently drop data. Halt, alert, and let a human decide. A known gap is infinitely better than an invisible one.
- Design for the regulatory clock, not just the technical SLA. Your pipeline needs to be fast enough to meet compliance deadlines, and you need monitoring that can prove it.
