Why Most Database Migrations Fail Silently: 5 Hidden Failure Modes in Zero-Downtime Replatforming

Seven major replatformings over eleven years — and I will tell you something counterintuitive: the scary part is never the cutover itself. The cutover is rehearsed, scripted, and everyone is watching the dashboards. The scary part is what happens three weeks later, when everyone has moved on, and something quietly breaks in a way that nobody notices for months. I still get a knot in my stomach thinking about the third week after cutover on my first 10TB+ migration. That is when the silent failures start whispering. Look — I have learned this the hard way. Over the past several years, I have led and contributed to large-scale database replatforming efforts — moving 16TB financial systems from legacy monoliths to distributed microservice architectures without taking the system offline. The kind of migrations where a single lost transaction means real money gone, and "we'll fix it in the next release" is not an option. What I have learned is that most migration guides get the big picture right. Dual-write, shadow traffic, canary rollout — these are well-understood patterns. But the actual failures I have seen in production do not come from getting the pattern wrong. They come from gaps between the patterns — edge cases that surface only under real load, real latency, and real failure conditions. None of them showed up in staging. This article catalogs five specific failure modes I have encountered, along with the detection and mitigation strategies that actually worked. If you are planning or executing a large-scale database migration, at least a couple of these will sound uncomfortably familiar. A Quick Note on Terminology When I say "replatforming," I mean something more ambitious than schema migration or version upgrade. I am talking about moving from one database technology to another — often from a monolithic relational system to a combination of purpose-built data stores — while the production system continues to serve traffic. The data model changes, the access patterns change, the consistency guarantees change. Everything is in flux simultaneously. This is a fundamentally different beast from running ALTER TABLE with zero downtime. The failure modes are architectural, not operational. ALTER TABLE The Standard Playbook (and Where It Falls Apart) The typical zero-downtime migration follows some variant of this progression: Phase 1 — Dual Write: Every write goes to both the old and new data store. The old store remains authoritative. The new store accumulates data in the background. Phase 1 — Dual Write: Phase 2 — Shadow Read: Reads are served from the old store, but the system also reads from the new store and compares results. Discrepancies are logged but not surfaced to users. Phase 2 — Shadow Read: Phase 3 — Canary Traffic: A small percentage of real read traffic is served from the new store. The old store remains the fallback. Phase 3 — Canary Traffic: Phase 4 — Cutover: The new store becomes authoritative. Reads and writes shift entirely. The old store is kept alive in read-only mode for rollback. Phase 4 — Cutover: Phase 5 — Decommission: The old store is retired after a bake period. Phase 5 — Decommission: Straightforward enough on a whiteboard. In practice, each phase is hiding at least one way to ruin your quarter. Here are the five that cost me the most sleep. Failure Mode 1: Shadow Delta Drift During the shadow read phase, you run every read against both data stores and compare results. Your dashboard shows a delta rate of 0.02% — well within your tolerance. Green across the board. You move to the canary phase feeling confident. Then a customer reports that their account balance is wrong. The problem is not that shadow reads were inaccurate. The problem is what the 0.02% delta actually represented. During a 16TB migration of a financial instrument platform — think stored-value products at massive scale — we discovered the hard way that our shadow deltas were overwhelmingly concentrated in high-value, low-frequency transactions. When we dug in, 92% of the mismatches were in balance-critical paths. A single undetected mismatch on this system would have meant six-figure reconciliation headaches and a very uncomfortable conversation with the finance team. The 99.98% that matched were simple reads — profile lookups, status checks, the kind of bread-and-butter queries that every system handles well. The 0.02% that did not match included balance calculations involving complex joins across denormalized data, currency conversions with rounding differences, and transactions that hit timing windows between the dual-write propagation. We caught it the afternoon before our peak traffic rehearsal — about 36 hours before canary cutover — only because we had started stratifying by tier. If we had waited until the following Monday? Let's just say finance would have been... vocal. We had 14 hours until the peak traffic rehearsal when the Tier 1 delta alert fired. That is when you learn who on your team can actually debug under pressure. Why This Is Hard to Catch Shadow read comparison is typically implemented as a bulk metric: total matches divided by total comparisons. This treats a mismatched profile name with the same severity as a mismatched financial balance. When your delta rate is below threshold, nobody digs into which records are diverging. How We Fixed It We stratified our shadow read comparison by business criticality tier. Tier 1 (financial): Balance reads, transaction history, settlement records → Delta tolerance: 0.000% (zero tolerance, any mismatch triggers alert) Tier 2 (correctness): Order status, inventory counts, shipping state → Delta tolerance: 0.01% Tier 3 (eventual): User preferences, analytics counters, recommendation data → Delta tolerance: 0.5% Tier 1 (financial): Balance reads, transaction history, settlement records → Delta tolerance: 0.000% (zero tolerance, any mismatch triggers alert) Tier 2 (correctness): Order status, inventory counts, shipping state → Delta tolerance: 0.01% Tier 3 (eventual): User preferences, analytics counters, recommendation data → Delta tolerance: 0.5% The implementation was not complicated — a simple enum with per-tier thresholds did most of the heavy lifting: // Shadow comparator tiering — added during 16TB migration (Q3 '22) public enum DataTier { TIER_1_FINANCIAL(0.0), // balances, settlements — zero tolerance TIER_2_CORRECTNESS(0.01), // orders, inventory TIER_3_EVENTUAL(0.5), // preferences, analytics // TODO: deprecate TIER_3 after cleanup phase — still used by legacy reporting ; private final double maxDeltaPercent; DataTier(double maxDeltaPercent) { this.maxDeltaPercent = maxDeltaPercent; } // Comparator routes mismatches to tier-specific alert channels: // Tier 1 → PagerDuty immediate // Tier 2 → Slack #migration-alerts // Tier 3 → daily digest email } // Shadow comparator tiering — added during 16TB migration (Q3 '22) public enum DataTier { TIER_1_FINANCIAL(0.0), // balances, settlements — zero tolerance TIER_2_CORRECTNESS(0.01), // orders, inventory TIER_3_EVENTUAL(0.5), // preferences, analytics // TODO: deprecate TIER_3 after cleanup phase — still used by legacy reporting ; private final double maxDeltaPercent; DataTier(double maxDeltaPercent) { this.maxDeltaPercent = maxDeltaPercent; } // Comparator routes mismatches to tier-specific alert channels: // Tier 1 → PagerDuty immediate // Tier 2 → Slack #migration-alerts // Tier 3 → daily digest email } We tagged each read query with its tier at the application layer, and the shadow comparator routed mismatches to different alerting channels based on that tag. Tier 1 mismatches paged on-call immediately. Tier 3 mismatches went to a daily digest. What made this painful was not the engineering — it was the politics. Defining which queries belong in which tier forced product and engineering to have conversations about what "correct" actually means for each data path. In several cases, we discovered that the legacy system's behavior was itself inconsistent, and we had to decide whether the migration target should replicate the bug or fix it. My recommendation: have the tiering conversation before you start shadow reads, not after you find the first divergence. Failure Mode 2: The Bidirectional CDC Lag Trap If you are doing a reversible migration — and for financial-grade systems, you should be, no exceptions — you probably have Change Data Capture (CDC) running in both directions. Forward CDC streams changes from old to new during the migration. Reverse CDC streams changes from new to old after cutover, so you can roll back if things go sideways. Here is the trap: forward CDC and reverse CDC do not behave symmetrically under load. During one migration, we had forward CDC running for months. It was well-tuned, consistently maintaining sub-second lag. We were confident. After cutover, we activated reverse CDC for rollback safety. Within hours, the reverse CDC lag started climbing — not catastrophically, but steadily. It went from 2 seconds to 8 seconds to 45 seconds over the course of a day. The root cause was that the new data store had a fundamentally different write amplification profile. What was a single row update in the old monolithic schema became three writes across two tables in the new microservice architecture. The reverse CDC had to transform and consolidate those writes back into the old schema's format — and here is the part that took us 11 hours of war-room debugging to figure out. I will admit we wasted the first 4 hours staring at database metrics, because that is where you expect the fire to be. When we finally thought to check the Debezium connector host, there it was: CPU pegged at 94%, threads blocked on Jackson deserialization. The connector was choking on JSON-to-relational transforms while source and target databases sat comfortably at 12% CPU. Moral of the story: when CDC lags, check the middleware first. I have that tattooed on my brain now. The real danger was not the lag itself — it was what the lag meant for our rollback window. Our rollback runbook assumed we could switch back to the old store at any time with no more than 5 seconds of data loss. With 45 seconds of lag and climbing, a rollback would have silently dropped 45 seconds of transactions. For a financial system processing hundreds of operations per second, that is not a minor data loss event. Detection We added a dedicated metric for CDC lag in the reverse direction and set up alerts at three thresholds: Warning (lag > 5s): Triggers investigation. Critical (lag > 30s): Halts any non-essential changes. On-call reviews rollback impact. Emergency (lag > 120s): Initiates rollback immediately, because the longer you wait, the worse the data loss on rollback becomes. Warning (lag > 5s): Triggers investigation. Warning Critical (lag > 30s): Halts any non-essential changes. On-call reviews rollback impact. Critical Emergency (lag > 120s): Initiates rollback immediately, because the longer you wait, the worse the data loss on rollback becomes. Emergency The non-obvious insight: set your emergency threshold based on the maximum acceptable data loss during rollback, not based on what "feels" like a lot of lag. Here is the actual alerting config we ended up with (sanitized but structurally identical): # Reverse CDC lag alerting — thresholds derived from rollback SLA # WARNING: Do NOT relax the emergency threshold. We tried 180s on an earlier # migration; rollback lost 47s of transactions. 120s is the ceiling. reverse_cdc_alerts: warning: condition: lag_seconds > 5 for 5m action: page secondary on-call, begin investigation critical: condition: lag_seconds > 30 for 2m action: freeze non-essential deploys, on-call reviews rollback impact emergency: condition: lag_seconds > 120 action: auto-trigger rollback runbook, page incident commander # Reverse CDC lag alerting — thresholds derived from rollback SLA # WARNING: Do NOT relax the emergency threshold. We tried 180s on an earlier # migration; rollback lost 47s of transactions. 120s is the ceiling. reverse_cdc_alerts: warning: condition: lag_seconds > 5 for 5m action: page secondary on-call, begin investigation critical: condition: lag_seconds > 30 for 2m action: freeze non-essential deploys, on-call reviews rollback impact emergency: condition: lag_seconds > 120 action: auto-trigger rollback runbook, page incident commander Prevention We ran a "reverse CDC stress test" before cutover. For 48 hours, we replayed production write traffic through the new data store and measured how the reverse CDC performed under realistic load. This caught the write amplification issue before it mattered. If your forward CDC has been running smoothly for months, do not assume the reverse will behave the same way. The data models are different, the write patterns are different, and the transformation logic is different. Test it independently. Failure Mode 3: The Query Plan Regression Same data. Same indexes. Same queries. Different database engine — or even just a different version of the same engine. Wildly different query execution plan. This one almost took down a production system on a Saturday. After migrating a 16TB dataset, we ran our standard validation suite — row counts matched, checksums matched, sample queries returned identical results. Everything looked clean. We cut over read traffic. Within two hours, the new database's CPU hit 98%. Latency spiked from 15ms to 4 seconds. The root cause: several critical queries had picked up full table scans instead of index seeks. The query optimizer in the new database made different cost-based decisions about the same data distribution, and those decisions were catastrophically wrong for our workload. The specific trigger was a statistics collection gap. The new database had not yet gathered enough query statistics to make informed optimization decisions. It was making plans based on default cardinality estimates, and for our highly skewed data distributions (think: 80% of queries hitting 2% of the data), those defaults led to full table scans on our largest tables. The Deeper Problem This was not just a "run ANALYZE" situation, though that was the immediate fix. The deeper issue was that the new database's autovacuum and statistics collection was misconfigured for our table sizes. On tables with hundreds of millions of rows, the default autovacuum settings meant that statistics were perpetually stale. We were essentially running on outdated execution plans all the time — the cutover just made it visible because the initial plans were particularly bad. We caught a related problem in the same week: the autovacuum could not keep up with our transaction rate on the largest tables, and the transaction ID counter was creeping toward wraparound. For those unfamiliar, PostgreSQL uses a 32-bit transaction ID space — 2^31 usable IDs, roughly 2.1 billion, and then the database hard-stops all writes to protect data integrity. There is no graceful degradation. It just stops. We were burning through 18 million transaction IDs per day on the largest table. When we finally ran txid_current() at 2 AM during the post-cutover war room — 173 million IDs left. Eight days of runway on a system handling peak-season traffic. I will not lie, I felt physically sick for about twenty minutes. Then we set autovacuum_freeze_max_age to 150 million (not the default 200 million) on every table over 500 million rows and watched the counter stabilize. That config line is now in every migration runbook I touch. The extra 50 million ID buffer gives you breathing room when autovacuum falls behind — and it will fall behind during a migration. txid_current() autovacuum_freeze_max_age Here are the exact autovacuum settings we deployed on every table over 500 million rows: -- Per-table autovacuum tuning for large migrated tables ALTER TABLE large_events SET ( autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_threshold = 50000, autovacuum_analyze_scale_factor = 0.005, autovacuum_analyze_threshold = 10000, autovacuum_freeze_max_age = 150000000 ); -- Per-table autovacuum tuning for large migrated tables ALTER TABLE large_events SET ( autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_threshold = 50000, autovacuum_analyze_scale_factor = 0.005, autovacuum_analyze_threshold = 10000, autovacuum_freeze_max_age = 150000000 ); Mitigation Three things we now do for every migration: Pre-warm statistics before cutover. After the initial data load, run full statistics collection on every table. Do not rely on autovacuum to catch up. Replay production query patterns against the new store under load. Not synthetic benchmarks — actual production query logs replayed at production volume. Compare execution plans explicitly using EXPLAIN ANALYZE on the top 50 queries by frequency and the top 50 by resource consumption. Tune autovacuum per table, not globally. Large tables need aggressive autovacuum settings — lower thresholds, higher worker counts. The defaults assume moderate-sized tables with moderate write rates, which is not what you get in a system that just absorbed a 16TB migration. Pre-warm statistics before cutover. After the initial data load, run full statistics collection on every table. Do not rely on autovacuum to catch up. Pre-warm statistics before cutover. Replay production query patterns against the new store under load. Not synthetic benchmarks — actual production query logs replayed at production volume. Compare execution plans explicitly using EXPLAIN ANALYZE on the top 50 queries by frequency and the top 50 by resource consumption. Replay production query patterns against the new store under load. EXPLAIN ANALYZE Tune autovacuum per table, not globally. Large tables need aggressive autovacuum settings — lower thresholds, higher worker counts. The defaults assume moderate-sized tables with moderate write rates, which is not what you get in a system that just absorbed a 16TB migration. Tune autovacuum per table, not globally. Failure Mode 4: The Forwarding Dependency Stall You have migrated the data layer. You have migrated the primary read and write paths. But somewhere in the call graph, there is still a dependency that routes through the old monolith. In our case, the new microservices handled all direct data operations — reads, writes, the core business logic. But a particular class of requests — let's call them aggregate queries — still needed data that lived in the old system. Rather than delay the migration to rewrite these queries, we set up a forwarding path: the new service called the old monolith's internal API to fetch the aggregated data, then combined it with data from the new store. This worked fine for months. Then one Tuesday afternoon, the old monolith's database experienced a latency spike — unrelated to our migration, just the old system being old. Response times on the monolith went from 50ms to 3 seconds. Normally, that is the old system's problem. But because our new service was forwarding requests to the monolith, those 3-second responses consumed threads in our new service's connection pool. Within minutes, the thread pool was saturated. All requests — including the ones that had nothing to do with the forwarding path — started queuing. Our tier-1 customer-facing requests, which were fully migrated and should have been immune to monolith issues, were failing because they could not get a thread. We had successfully migrated the data but accidentally imported a dependency on the old system's availability. Why This Is Insidious This failure mode does not show up in any migration checklist I have ever seen. Believe me, I have looked. Your data is migrated. Your primary flows are migrated. The forwarding path handles edge cases that represent maybe 5% of traffic. It feels like a reasonable temporary compromise. But that 5% traffic path shares infrastructure with your 95% path — connection pools, thread pools, circuit breakers. Unless you have explicitly isolated it, a latency spike on the forwarding path will cascade into your fully-migrated flows. The Fix Two things, in order of priority: First, isolate the forwarding path's resources. Give it its own connection pool, its own thread pool, its own circuit breaker with aggressive timeouts. If the old system is slow, the forwarding path should fail fast and return a degraded response — not drag down the rest of the system. First, isolate the forwarding path's resources. Main thread pool: 200 threads → handles direct data operations Forwarding pool: 20 threads → handles legacy aggregate queries Circuit breaker: timeout 500ms, open after 5 failures in 10 seconds Fallback: return cached/stale aggregate data or graceful error Main thread pool: 200 threads → handles direct data operations Forwarding pool: 20 threads → handles legacy aggregate queries Circuit breaker: timeout 500ms, open after 5 failures in 10 seconds Fallback: return cached/stale aggregate data or graceful error Second, treat the forwarding dependency as tech debt with a hard deadline. It is easy to let these linger because they "work fine." They work fine until they don't, and they always stop working at the worst possible time. We set a 90-day deadline from cutover to eliminate all forwarding dependencies, and tracked it as a top-priority reliability item. Second, treat the forwarding dependency as tech debt with a hard deadline. Failure Mode 5: The Deferred Cleanup Bomb The migration is done. The new system is humming. The team celebrates. Three months later, performance degrades slowly and nobody connects it to the migration. In our case, the symptom was gradually increasing query latency on the new database — not a spike, just a steady 2-3% degradation per week. For the first month, it was within normal variation. By month two, p99 latency had doubled. By month three, we were investigating. The root cause was a combination of two deferred cleanup issues: Orphaned data from the dual-write phase. During dual-write, every write went to both stores. But not every write in the old store had a corresponding cleanup. Specifically, soft-deleted records in the old system were being dual-written to the new system as active records. The new system's table sizes were growing faster than expected, and indexes were bloated with records that should not have been there. Orphaned data from the dual-write phase. Residual CDC artifacts. Our CDC pipeline wrote metadata alongside each replicated row — timestamp of replication, source transaction ID, replication batch number. This metadata was useful during migration for debugging and reconciliation. Post-migration, it was dead weight. But nobody thought to clean it up, and these extra columns were increasing row sizes by roughly 15%, which meant more I/O for every table scan and index operation. Residual CDC artifacts. Neither of these individually was catastrophic. Together, they compounded into measurable degradation that slowly ate into our performance budget. What We Do Now We maintain a "migration cleanup checklist" that runs for 90 days post-cutover: Week 1-2: Remove CDC metadata columns. Drop replication-specific indexes. Week 2-4: Reconcile record counts between old and new stores. Identify and purge orphaned records from dual-write inconsistencies. Week 4-8: Full vacuum and reindex on all migrated tables. Validate that table sizes match expected growth projections. Week 8-12: Remove forwarding dependencies (see Failure Mode 4). Drop temporary compatibility views and translation layers. Week 1-2: Remove CDC metadata columns. Drop replication-specific indexes. Week 1-2: Week 2-4: Reconcile record counts between old and new stores. Identify and purge orphaned records from dual-write inconsistencies. Week 2-4: Week 4-8: Full vacuum and reindex on all migrated tables. Validate that table sizes match expected growth projections. Week 4-8: Week 8-12: Remove forwarding dependencies (see Failure Mode 4). Drop temporary compatibility views and translation layers. Week 8-12: The checklist is boring, unglamorous work. I get it — nobody wants to spend their sprint capacity on database housekeeping after the excitement of the migration. But it is the difference between a migration that stays healthy and one that slowly rots. The Reversible Migration Framework The five failure modes above share a common thread: they all happen in the gaps between migration phases. Shadow deltas drift because the validation is too coarse. CDC lag traps appear because forward and reverse paths are tested asymmetrically. Query plans regress because pre-warming is skipped. Forwarding dependencies stall because isolation is deferred. Cleanup bombs detonate because post-migration work is not tracked. To address these gaps, I have settled on a framework that treats reversibility as the central design constraint — not as a nice-to-have rollback plan, but as a property that the system must maintain at every point during the migration. The framework has three principles: 1. Every phase must be independently reversible. Moving from Phase 2 (shadow read) to Phase 3 (canary traffic) should be a configuration change, not a code deployment. Moving backward from Phase 3 to Phase 2 should be equally trivial. This means all phase transitions are controlled by feature flags or dynamic configuration — never by code changes that require a deploy. 1. Every phase must be independently reversible. 2. The exit criteria for each phase must be quantitative, not qualitative — and I mean actually quantitative. "Shadow reads look good" is not an exit criterion. "Tier-1 shadow delta rate < 0.000% for 72 consecutive hours, Tier-2 < 0.01%, reverse CDC lag < 5s for 48 consecutive hours, p99 read latency within 10% of baseline" — that is an exit criterion. We document these thresholds before the migration begins, and we do not override them under schedule pressure. 2. The exit criteria for each phase must be quantitative, not qualitative — and I mean actually quantitative. 3. Rollback cost is measured continuously, not assumed. At every point during the migration, we know the answer to: "If we roll back right now, how much data do we lose?" This is a function of reverse CDC lag, in-flight transaction count, and any irreversible side effects (external API calls, published events). If the rollback cost exceeds our tolerance, we pause and fix the underlying issue before proceeding. 3. Rollback cost is measured continuously, not assumed. Measuring Success After applying this framework across multiple large-scale migrations, the outcomes were consistent: Zero customer-visible incidents during cutover across all migrations. 40% reduction in infrastructure costs post-migration (the primary business driver in most cases). 15-minute maximum cutover windows, with instant rollback capability throughout. Post-migration latency within 5% of pre-migration baseline after cleanup. Zero customer-visible incidents during cutover across all migrations. 40% reduction in infrastructure costs post-migration (the primary business driver in most cases). 15-minute maximum cutover windows, with instant rollback capability throughout. Post-migration latency within 5% of pre-migration baseline after cleanup. But honestly, the metric I care about most is the one that is hardest to measure: the number of silent failures that did not happen because we caught them during shadow reads or reverse CDC testing. You do not get a dashboard for disasters prevented. Closing Thoughts If there is a single takeaway here, it is this: database migrations fail silently because we validate the happy path and skip the edges. The bulk of your data will migrate cleanly. The bulk of your queries will work fine. The bulk of your CDC stream will keep up. It is the 0.02% of shadow deltas that concentrate in your most critical data paths. It is the reverse CDC that works perfectly at low volume but falls behind under production load. And it is the query that ran fine for five years on the old optimizer and suddenly decides a full table scan is a good idea on the new one. None of these show up in a migration guide. They show up at 2 AM on a Tuesday when you have already marked the migration as "complete" in your project tracker. My advice: do not trust aggregate metrics. Stratify everything by business impact. Test your rollback path as rigorously as your forward path. And budget three months of cleanup work after cutover — not because you might need it, but because you will. Every. Single. Time. I learned this the hard way on my second major migration — skipped the 90-day cleanup because the system looked fine. Three months later, we spent two engineering weeks untangling index bloat that would have taken two hours to prevent. Do not be like past-me.