Inside a 34-Petabyte Migration: The True Cost of Moving a Digital Mountain

“After years of searching, there is still no cure for Digital Disposophobia” “After years of searching, there is still no cure for Digital Disposophobia” What it takes to move a multi-petabyte archive from legacy tape to hybrid object storage—and why planning, hashing, and real-world limitations matter more than any cloud calculator. Introduction — The Hidden Costs of Data Migration When people hear you’re migrating 34 petabytes of data, they expect it’ll be expensive—but not that expensive. After all, storage is cheap. Cloud providers quote pennies per gigabyte per month, and object storage vendors often pitch compelling cost-per-terabyte pricing. Tape is still considered low-cost. Object systems are marketed as plug-and-play. And the migration itself? Supposedly, just a big copy job. In reality? The true cost of a large-scale data migration isn’t in the storage—it’s in the movement. If you’re managing long-term digital archives at scale, you already know: every file has history, metadata, and risk. Every storage platform has bottlenecks. Every bit has to be accounted for. And every misstep—be it silent corruption, metadata loss, or bad recall logic—can cost you time, money, and trust. This article outlines the early stages of our ongoing migration of 34 petabytes of tape-based archival data to a new on-premises hybrid object storage system—and the operational, technical, and hidden costs we’re uncovering along the way. The Current Day and the Life of the Preservation Environment The Current Day and the Life of the Preservation Environment Before we examine the scale and complexity of our migration effort, it's important to understand the operational heartbeat of the current digital preservation environment. This is not a cold archive sitting idle—this is a living, actively maintained preservation system adhering to a rigorous 3-2-1 policy: at least three copies, on two distinct media types, with one copy geographically off-site. 3-2-1 policy three copies two distinct media types one copy geographically off-site 3-2-1 in Practice 3-2-1 in Practice Our preservation strategy is based on three concurrent and deliberately separated storage layers: Primary Copy (Tape-Based, On-Premises) Housed in our main data center, this is the primary deep archive. It includes Oracle SL8500 robotic libraries using T10000D media, a Quantum i6000 with LTO-9 cartridges, and is orchestrated entirely by Versity ScoutAM. Secondary Copy (Tape-Based, Alternate Facility) Located in a separate data center, this second copy is maintained on a distinct tape infrastructure. It acts as both a resiliency layer and a compliance requirement, ensuring survivability in case of a catastrophic site failure at the primary location. Tertiary Copy (Cloud-Based, AWS us-east-2) Every morning, newly ingested files written to the Versity ScoutAM system are reviewed and queued for replication to Amazon S3 buckets in the us-east-2 region. This process is automated and hash-validated, ensuring the offsite copy is both complete and independently recoverable. Primary Copy (Tape-Based, On-Premises) Housed in our main data center, this is the primary deep archive. It includes Oracle SL8500 robotic libraries using T10000D media, a Quantum i6000 with LTO-9 cartridges, and is orchestrated entirely by Versity ScoutAM. Primary Copy (Tape-Based, On-Premises) Secondary Copy (Tape-Based, Alternate Facility) Located in a separate data center, this second copy is maintained on a distinct tape infrastructure. It acts as both a resiliency layer and a compliance requirement, ensuring survivability in case of a catastrophic site failure at the primary location. Secondary Copy (Tape-Based, Alternate Facility) Tertiary Copy (Cloud-Based, AWS us-east-2) Every morning, newly ingested files written to the Versity ScoutAM system are reviewed and queued for replication to Amazon S3 buckets in the us-east-2 region. This process is automated and hash-validated, ensuring the offsite copy is both complete and independently recoverable. Tertiary Copy (Cloud-Based, AWS us-east-2) Importantly, this cloud-based copy is contractual in nature—subject to renewal terms, vendor viability, and pricing structures. To uphold the 3-2-1 preservation standard long-term, we treat this copy as disposable yet essential: if and when the cloud contract expires, the full cloud copy is re-propagated to a new geographically distributed storage location—potentially another cloud region, vendor, or sovereign archive environment. This design ensures that dependency on any single cloud provider is temporary, not foundational. contractual in nature disposable yet essential new geographically distributed storage location Daily Lifecycle Operations Daily Lifecycle Operations Despite the appearance of a “cold archive,” this system is active, transactional, and managed daily. Key operations include: New Ingests: Files continue to be written to ScoutFS via controlled data pipelines. These often come from internal digitization projects, external partners, or ongoing digital collections initiatives. Fixity Verification: For each new ingest, cryptographic checksums are embedded into the user hash space of ScoutFS to ensure future validation. These hashes are stored at time of write and used for all subsequent checks. Replication Pipeline (Cloud Offsite Copy): Once a file is written and verified locally, a daily script scans the Versity environment for the current scheduler and gathers entries from the archiver.log to identify directories that had archive jobs executed the previous day. These identified files are queued for replication to AWS S3 in the us-east-2 region. Files are transmitted in their original structure, and upon successful upload, the cloud-stored version is validated using the same hash metadata. Any mismatch is flagged for remediation. New Ingests: Files continue to be written to ScoutFS via controlled data pipelines. These often come from internal digitization projects, external partners, or ongoing digital collections initiatives. New Ingests Fixity Verification: For each new ingest, cryptographic checksums are embedded into the user hash space of ScoutFS to ensure future validation. These hashes are stored at time of write and used for all subsequent checks. Fixity Verification Replication Pipeline (Cloud Offsite Copy): Once a file is written and verified locally, a daily script scans the Versity environment for the current scheduler and gathers entries from the archiver.log to identify directories that had archive jobs executed the previous day. These identified files are queued for replication to AWS S3 in the us-east-2 region. Files are transmitted in their original structure, and upon successful upload, the cloud-stored version is validated using the same hash metadata. Any mismatch is flagged for remediation. Replication Pipeline (Cloud Offsite Copy) A Moving Target A Moving Target This is the reality we are migrating from—not a static legacy tape pool, but an active, resilient, and highly instrumented preservation environment. The migration plan outlined in the next section doesn’t replace this environment overnight—it transitions just one of the three preservation copies to a new hybrid object storage model. The second tape copy remains fully operational, continuing to receive daily writes, while cloud replication continues for all eligible content. This overlapping strategy allows us to validate new infrastructure in production without putting preservation guarantees at risk. one of the three preservation copies second tape copy cloud replication Upcoming Migration — From Tape to Hybrid Object Archive Upcoming Migration — From Tape to Hybrid Object Archive We’re in the early planning stages of a migration project to move 34PB of legacy cold storage to a new on-premises hybrid object archival storage system. “Hybrid” here refers to an architecture that blends both high-capacity disk and modern tape tiers, all behind an S3-compatible interface. This design gives us the best of both worlds: faster recall and metadata access when needed, with cost-effective, long-term retention via tape. 34PB of legacy cold storage n-premises hybrid object archival storage system high-capacity disk modern tape tiers S3-compatible interface faster recall and metadata access when needed with cost-effective, long-term retention Legacy Environment: Legacy Environment Oracle SL8500 robotic tape libraries containing the majority of our archive, based on T10000D cartridges Approximately 100 LTO-9 tapes also stored within the SL8500 system A Quantum i6000 tape library housing another ~500 LTO-9 cartridges Managed and orchestrated via Versity ScoutAM, which handles: Oracle SL8500 robotic tape libraries containing the majority of our archive, based on T10000D cartridges Oracle SL8500 robotic tape libraries T10000D cartridges Approximately 100 LTO-9 tapes also stored within the SL8500 system 100 LTO-9 tapes A Quantum i6000 tape library housing another ~500 LTO-9 cartridges Quantum i6000 tape library 500 LTO-9 cartridges Managed and orchestrated via Versity ScoutAM, which handles: Versity ScoutAM This mixed tape environment presents real-world operational challenges: Legacy T10000D drives are slower, with long mount and seek times LTO-9 drives are higher performing but operate in a separate mechanical and logical tier Drive sharing, recall contention, and concurrent read bandwidth must be carefully managed Legacy T10000D drives are slower, with long mount and seek times T10000D drives LTO-9 drives are higher performing but operate in a separate mechanical and logical tier LTO-9 drives Drive sharing, recall contention, and concurrent read bandwidth must be carefully managed To reduce risk and improve data fidelity, we've started integrating fixity hash values directly into the user hash space within the ScoutFS file system. This ensures each file can be validated during staging, catching any corruption, truncation, or misread before it’s written to the new system. ntegrating fixity hash user hash space within the ScoutFS file system validated during staging Our migration target includes not just the 34PB of existing tape-based data, but enough capacity to absorb an additional ~4PB of new ingest annually, for at least the first year. The total provisioned capacity in the new system is 40PB—designed to give us a buffer without overextending infrastructure. 34PB of existing tape-based data 40PB The Real Costs in Migration The Real Costs in Migration Migrations of this scale aren’t just about buying space—they’re about managing risk, trust, throughput, future-proofing, and time. It’s not enough to copy data from point A to point B. At any given moment, you’re balancing three active datasets: risk, trust, throughput, future-proofing, and time Current production (new data being ingested) Data in migration (from legacy tape to staging) Data in verification (testing the copied files post-ingest) Current production (new data being ingested) Current production Data in migration (from legacy tape to staging) Data in migration Data in verification (testing the copied files post-ingest) Data in verification Most vendor proposals and cloud calculators overlook the operational cost of running all three states simultaneously. Here’s a breakdown of what truly drives cost and complexity in the real world: three states simultaneously System Cost System Cost The new hybrid on-premises archive system is provisioned to support approximately 40PB, allowing us to: hybrid on-premises archive system 40PB Absorb the full 34PB migration dataset Accommodate at least 1 year of new ingest, estimated at ~4PB annually Absorb the full 34PB migration dataset 34PB migration dataset Accommodate at least 1 year of new ingest, estimated at ~4PB annually 1 year of new ingest ~4PB annually The migration from the legacy tape environment is orchestrated by Versity ScoutAM, which manages a multi-stage pipeline: Versity ScoutAM - Volume serial number (VSN)-driven recalls from both T10000D and LTO-9 cartridges Volume serial number (VSN)-driven recalls - Staging of data into disk-based scratch/cache pools Staging disk-based scratch/cache pools - Controlled archival into the new S3-compatible object storage system Controlled archiva S3-compatible object storage system - Additional cache storage was provisioned to: Additional cache storage Support simultaneous ingest and migration staging Handle production workloads Allow for delayed verification of migrated files before release Support simultaneous ingest and migration staging Handle production workloads Allow for delayed verification of migrated files before release delayed verification Validation Overhead Validation Overhead To ensure bit-level data fidelity, we’ve begun populating user hash space fields in the ScoutFS file system with cryptographic fixity checksums prior to recall. bit-level data fidelity user hash space fields ScoutFS file system cryptographic fixity checksums This approach enables: - On-the-fly validation of files as they are staged from tape On-the-fly validation - Comparison of staged file hashes with original stored hashes to immediately detect: immediately detect File corruption Byte truncation Mismatches from degraded tape or faulty drives File corruption Byte truncation Mismatches from degraded tape or faulty drives This strategy significantly reduces: - Redundant hashing workloads during object ingest Redundant hashing - Silent corruption risks introduced during mechanical tape reads Silent corruption risks - Migration delays due to manual file triage or inconsistent validation logic Migration delays Hidden Taxes — Time, Energy, and Human Overhead Hidden Taxes — Time, Energy, and Human Overhead Some of the most significant costs in a multi-petabyte migration don’t show up on vendor quotes or capacity calculators—they’re buried in the human effort, infrastructure overlap, and round-the-clock support needed to make it all happen. human effort, infrastructure overlap, and round-the-clock support Here’s what that looks like in practice: 1. Dual-System Overhead 1. Dual-System Overhead We expect to operate both the legacy and new archival systems in parallel for at least two full years. That means: Power, cooling, and maintenance costs for legacy robotics, tape drives, and storage controllers—even as data is actively migrating away Infrastructure costs for the new system (rack space, spinning disk, tape robotics, S3 interface endpoints) that must scale up before the old system scales down Ongoing monitoring and maintenance across both environments, which includes two independent telemetry stacks, alerting layers, and queue management processes Power, cooling, and maintenance costs for legacy robotics, tape drives, and storage controllers—even as data is actively migrating away Infrastructure costs for the new system (rack space, spinning disk, tape robotics, S3 interface endpoints) that must scale up before the old system scales down Ongoing monitoring and maintenance across both environments, which includes two independent telemetry stacks, alerting layers, and queue management processes The dual-stack reality introduces complexity not just in capacity planning, but in operational overhead—particularly when issues affect both sides of the migration simultaneously. 2. Staffing Requirements 2. Staffing Requirements To meet our timeline and operational commitments, the migration team is scheduled for: 6-day-per-week operations, running 24 hours per day Shifts covering: Tape handling and media recalls Staging and ingest monitoring Fixity verification and issue resolution Log review, alerting, and dashboard tuning Daily oversight of both legacy and new systems 6-day-per-week operations, running 24 hours per day Shifts covering: Tape handling and media recalls Staging and ingest monitoring Fixity verification and issue resolution Log review, alerting, and dashboard tuning Daily oversight of both legacy and new systems Tape handling and media recalls Staging and ingest monitoring Fixity verification and issue resolution Log review, alerting, and dashboard tuning Daily oversight of both legacy and new systems Tape handling and media recalls Staging and ingest monitoring Fixity verification and issue resolution Log review, alerting, and dashboard tuning Daily oversight of both legacy and new systems Staff must be able to respond to issues across multiple layers—tape robotics, disk cache performance, object storage health, and software automation pipelines. 3. ScoutAM Operational Load 3. ScoutAM Operational Load While Versity ScoutAM serves as the backbone of the migration orchestration, it requires constant operational intervention in a complex legacy environment: Frequent manual remediation for ACSLS (Automated Cartridge System Library Software) issues, which affect tape visibility and mount accuracy Managing high stage queues, which can stall throughput if not carefully balanced across drives, media pools, and disk cache availability Regular validation and tuning of configuration to prevent deadlocks, retries, or starvation scenarios under load Frequent manual remediation for ACSLS (Automated Cartridge System Library Software) issues, which affect tape visibility and mount accuracy Managing high stage queues, which can stall throughput if not carefully balanced across drives, media pools, and disk cache availability Regular validation and tuning of configuration to prevent deadlocks, retries, or starvation scenarios under load This means that even with automation in place, the system must be actively managed and routinely adjusted to avoid migration stalls. 4. Migration Timeline Pressure 4. Migration Timeline Pressure The goal: complete 34PB of migration in 18 to 23 months. That requires: Continuous tuning of recall-to-ingest pipelines Load balancing across tape drives, scratch pools, and object ingest nodes Real-time monitoring of errors, retries, and throughput drops Maintaining progress while still supporting current ingest and user requests Continuous tuning of recall-to-ingest pipelines Load balancing across tape drives, scratch pools, and object ingest nodes Real-time monitoring of errors, retries, and throughput drops Maintaining progress while still supporting current ingest and user requests Every delay has downstream consequences: A failed or slow tape recall can back up staging A hash mismatch triggers manual triage A missed verification step risks corrupted long-term storage A failed or slow tape recall can back up staging A hash mismatch triggers manual triage A missed verification step risks corrupted long-term storage These aren’t exceptions—they’re expected parts of the workflow. And they require human expertise, resilience, and continuous iteration to manage effectively. The Vendor Blind Spot: Why Calculators Don’t Work The Vendor Blind Spot: Why Calculators Don’t Work Storage vendors and cloud platforms love calculators. Plug in how many terabytes you have, pick a redundancy level, maybe add a retrieval rate, and out comes a tidy monthly cost or migration estimate. It all looks scientific—until you actually try to move 34 petabytes of long-term archive data. The reality is: most calculators are built for static cost modeling, not for complex data movement and verification pipelines that span years, formats, and evolving systems. Here’s where they fall short: 1. They Don’t Account for Legacy Media Complexity 1. They Don’t Account for Legacy Media Complexity Calculators assume all your data is neatly stored and instantly accessible. But we’re migrating from: T10000D cartridges with long mount and seek times LTO-9 cartridges in multiple libraries A blend of media types, drive generations, and recall strategies T10000D cartridges with long mount and seek times LTO-9 cartridges in multiple libraries A blend of media types, drive generations, and recall strategies Vendor models don’t include the cost of slow robotic mounts, incompatible drive pools, or long recall chains. And they certainly don’t account for manual intervention required to babysit legacy systems like ACSLS. 2. They Ignore Fixity Validation Workflows 2. They Ignore Fixity Validation Workflows Most calculators focus on bytes moved, not bytes verified. In our case: Every file must be validated against stored checksums in ScoutFS Hash mismatches trigger triage workflows Post-write verification in the object system must be staged, timed, and tracked Every file must be validated against stored checksums in ScoutFS Hash mismatches trigger triage workflows Post-write verification in the object system must be staged, timed, and tracked This adds both compute and storage demand to the migration, as data often exists in three states: Original tape format Staged file on disk Verified object in long-term archive Original tape format Staged file on disk Verified object in long-term archive The calculators? They don’t factor in staging costs, hash workloads, or space for verification. 3. They Omit Human Labor 3. They Omit Human Labor People run migrations—not spreadsheets. Calculators ignore: 24/6 staffing models On-call support Tape librarians Log monitoring teams Software maintainers 24/6 staffing models On-call support Tape librarians Log monitoring teams Software maintainers We’re running two live environments for two years, with full coverage across: Legacy tape infrastructure Object archive ingest Monitoring and verification systems Legacy tape infrastructure Object archive ingest Monitoring and verification systems The people-hours alone are non-trivial operational costs, yet they never appear on vendor estimates. 4. They Assume Ideal Conditions 4. They Assume Ideal Conditions Calculators assume perfect conditions: All tapes readable All files intact All drives healthy No queue contention No ingest bottlenecks All tapes readable All files intact All drives healthy No queue contention No ingest bottlenecks That’s not real life. In production: Drives fail Mounts timeout Fixity fails Scripts stall Resources saturate Drives fail Mounts timeout Fixity fails Scripts stall Resources saturate And every hour lost to those failures is time you can’t get back—or model. 5. They Treat Migration as a Cost, Not a Capability 5. They Treat Migration as a Cost, Not a Capability Most importantly, calculators treat migration as a one-time line item, not as a multi-phase operational capability that must be: Designed Tuned Scaled Monitored Documented Designed Tuned Scaled Monitored Documented For us, migration is a platform feature—not a side task. It requires: Real-time logging Prometheus/Grafana-based alerting API-level orchestration Hash-aware data flow management Real-time logging Prometheus/Grafana-based alerting API-level orchestration Hash-aware data flow management None of this is in the default TCO calculator. Recommendations for Teams Planning Large Migrations Recommendations for Teams Planning Large Migrations If you're planning a multi-petabyte migration—especially from legacy tape to modern hybrid storage—understand that your success depends less on how much storage you buy and more on how well you architect your operational pipeline. Here are our key takeaways for teams facing similar challenges: 1. Map Your Environment Thoroughly 1. Map Your Environment Thoroughly Inventory every media type, volume serial number, and drive model Understand robotic behaviors and drive sharing limitations Track mount latencies, not just theoretical throughput Inventory every media type, volume serial number, and drive model Understand robotic behaviors and drive sharing limitations Track mount latencies, not just theoretical throughput 2. Build for Simultaneous Ingest, Recall, and Verification 2. Build for Simultaneous Ingest, Recall, and Verification Expect to run multiple systems in parallel for months to years Provision dedicated staging storage to buffer tape recalls and object ingest Treat hash verification as a core architectural feature—not a post-process Expect to run multiple systems in parallel for months to years Provision dedicated staging storage to buffer tape recalls and object ingest Treat hash verification as a core architectural feature—not a post-process 3. Treat Hashing as Core Metadata 3. Treat Hashing as Core Metadata Use file system-level hash fields (like ScoutFS user hash space) early Don’t rehash if you can avoid it—store once, validate often Ensure every copy operation is backed by fixity-aware logic Use file system-level hash fields (like ScoutFS user hash space) early Don’t rehash if you can avoid it—store once, validate often Ensure every copy operation is backed by fixity-aware logic 4. Invest in Open Monitoring and Alerting 4. Invest in Open Monitoring and Alerting Use tools like Prometheus, Grafana, and custom log collectors Instrument every part of the pipeline—from tape mount to hash verification Build dashboards and alert rules before your first PB moves Use tools like Prometheus, Grafana, and custom log collectors Instrument every part of the pipeline—from tape mount to hash verification Build dashboards and alert rules before your first PB moves 5. Automate What You Can, Document What You Can’t 5. Automate What You Can, Document What You Can’t Script all recall, ingest, and validation tasks Maintain a living runbook for exceptions and intervention playbooks Expect edge cases. Document them when they happen. Script all recall, ingest, and validation tasks Maintain a living runbook for exceptions and intervention playbooks Expect edge cases. Document them when they happen. 6. Design for Graceful Failure and Retry 6. Design for Graceful Failure and Retry Every file should have a known failure state and retry path Don’t let bad tapes, bad hashes, or stalled queues stop the pipeline Build small, testable units of work, not monolithic jobs Every file should have a known failure state and retry path Don’t let bad tapes, bad hashes, or stalled queues stop the pipeline Build small, testable units of work, not monolithic jobs Conclusion: Migration Is Infrastructure, Not a One-Time Task Conclusion: Migration Is Infrastructure, Not a One-Time Task Moving 34PB of data isn’t a project—it’s the creation of an ongoing operational platform that defines how preservation happens, how access is retained, and how risk is managed. ongoing operational platform For many institutions, the assumption has been that data needs to be migrated from tape every 7 to 10 years, driven by: 7 to 10 years Media obsolescence Hardware aging And shifting vendor support lifecycles Media obsolescence Media obsolescence Hardware aging Hardware aging And shifting vendor support lifecycles vendor support lifecycles That rhythm alone is expensive—and it multiplies with every additional tape copy you maintain “just in case.” additional tape copy But what if the storage platform itself was built for permanence? What we’re working toward is not just a migration—but a transition to an archival system that inherently supports long-term durability: transition to an archival system that inherently supports long-term durability: Built-in fault tolerance Geographic or media-tier redundancy Self-healing mechanisms like checksums and erasure coding Verification pipelines that ensure data integrity over decades Built-in fault tolerance Built-in fault tolerance Geographic or media-tier redundancy Geographic or media-tier redundancy Self-healing mechanisms like checksums and erasure coding Self-healing mechanisms like checksums and erasure coding Verification pipelines that ensure data integrity over decades Verification pipelines that ensure data integrity over decades If these characteristics are fully realized, it opens the door to reducing the number of physical tape copies required to meet digital preservation standards. Instead of three physical copies to ensure survivability, you may achieve equivalent or better protection with: reducing the number of physical tape copies required A primary object storage layer A cold, fault-tolerant tape tier And a hash-validated verification log or metadata registry A primary object storage layer primary object storage layer A cold, fault-tolerant tape tier cold, fault-tolerant tape tier And a hash-validated verification log or metadata registry hash-validated verification log or metadata registry It doesn’t eliminate preservation requirements—it modernizes how we meet them. True digital stewardship means designing systems that migrate themselves, that verify without intervention, and that allow future generations to access and trust the data without redoing all the work. Preservation is no longer about saving the bits. It’s about building platforms that do it for us—consistently, verifiably, and automatically. As we look beyond this migration cycle, a compelling evolution of the traditional 3-2-1 preservation strategy is the integration of ultra-resilient, long-lived media for one of the three preservation copies—specifically, Copy 2. By writing this second copy to a century-class storage medium such as DNA-based storage, fused silica glass (e.g., Project Silica), ceramic or film, we can significantly reduce the operational burden of decadal migrations. These emerging storage formats offer write-once, immutable characteristics with theoretical lifespans of 100 years or more, making them ideal candidates for infrequently accessed preservation tiers. If successfully adopted, this approach would allow institutions to focus active migration and infrastructure upgrades on only a single dynamic copy, while the long-lived copy serves as a stable anchor across technology generations. It’s not a replacement for redundancy—it’s an enhancement of durability and sustainability in preservation planning.