Andrea Marinaro

@andreamarinaro

What I learned by migrating an RDS database into production using AWS DMS

What I learned by migrating an RDS database into production using AWS DMS (Database Migration Service)

…or how to avoid a big headache while live migrating and replicating your db instances

Image from Jan Antonin Kolar — Unsplash

I am currently managing the migration of some on premise services, and other services already present on AWS under a single VPC.

Having active production systems H24 7/7 it was essential to design and implement a fairly precise migration plan with a zero error margin.

While regarding services like EC2 instances, EFS, Security Group, Target Group and Load Balancer the transition was quite simple, I encountered some obstacles during the migration of our RDS database (MariaDB) from one VPC to another.

For this task the native Amazon service, DMS (Database Migration Service) was chosen.

Obviously, not being able to allow discrepancies between the existing and new production databases, the obligatory choice fell on a "live" migration through the "Ongoing replication" mode.

According to Amazon: “…the quickest and easiest way to migrate databases to AWS with low cost and minimal downtime.”

Even if i managed to successfully migrate other non-production databases using DMS, i was kinda like…

To avoid unpleasant surprises, I performed a snapshot of the production database, created a new RDS instance (mariaDB) to then restore the snapshot on the new database.

For those who have never used DMS, the service involves the creation of a replication instance, two endpoints (source and target) and a task (which will take care of migrating data from one database to another).
I preferred to use the db of the production db, and not of another database, as I wanted to replicate in “sandbox env” the exact structure of the database, including the attributes of the tables.

In this article we will not deal with the technical aspects, configurations related to DMS, so if you are already accustomed the following represents the replication task created for my prod-test migration (ARNs are not visible cause … you know why… ):

Task name 
An identifier for your Task
***********:-db-task
Task ARN
This ARN is the stable way uniquely identify your Replication Task when calling DMS APIs
arn:aws:dms:eu-central-1:***********:task:***************
Status
The current computed status of the task. Note that this is a computed value and may not match the raw status from the service API
Load complete
Migration type
How should this task migrate data
Full Load
Replication instance
:***********:
Source endpointSource endpoint
:***********:-frankfurt
Target endpointTarget endpoint
:***********:-db
Mapping method
This is a json document that details how source tables are mapped to the target
{"rules":[{"rule-type":"selection","rule-id":"1","rule-name":"1","object-locator":{"schema-name":"tmp","table-name":"%"},"rule-action":"include","filters":[]},{"rule-type":"selection","rule-id":"2","rule-name":"2","object-locator":{"schema-name":"digital_content_store_kareem","table-name":"%"},"rule-action":"include","filters":[]},{"rule-type":"selection","rule-id":"3","rule-name":"3","object-locator":{"schema-name":"digital_content_store","table-name":"%"},"rule-action":"include","filters":[]},{"rule-type":"selection","rule-id":"4","rule-name":"4","object-locator":{"schema-name":"dcw_unit_testing","table-name":"%"},"rule-action":"include","filters":[]},{"rule-type":"selection","rule-id":"5","rule-name":"5","object-locator":{"schema-name":"dcs_unit_testing","table-name":"%"},"rule-action":"include","filters":[]},{"rule-type":"selection","rule-id":"6","rule-name":"6","object-locator":{"schema-name":"dcs_issam","table-name":"%"},"rule-action":"include","filters":[]},{"rule-type":"selection","rule-id":"7","rule-name":"7","object-locator":{"schema-name":"DCW_reporting","table-name":"%"},"rule-action":"include","filters":[]},{"rule-type":"selection","rule-id":"8","rule-name":"8","object-locator":{"schema-name":"DCW","table-name":"%"},"rule-action":"include","filters":[]}]}
Last failure message
The last failure message from the time the task was run and failed
Created
March 28, 2019 at 4:11:15 PM UTC+1
Started
March 29, 2019 at 2:14:56 PM UTC+1
CDC start positionOption to start Change Data Capture (CDC)
CDC stop positionOption to stop Change Data Capture (CDC)
CDC recovery checkpoint
DMS specific recovery checkpoint string to recover from a previous point
Task Settings
{
  "TargetMetadata": {
    "TargetSchema": "",
    "SupportLobs": true,
    "FullLobMode": true,
    "LobChunkSize": 64,
    "LimitedSizeLobMode": false,
    "LobMaxSize": 0,
    "InlineLobMaxSize": 0,
    "LoadMaxFileSize": 0,
    "ParallelLoadThreads": 0,
    "ParallelLoadBufferSize": 0,
    "BatchApplyEnabled": false,
    "TaskRecoveryTableEnabled": false
  },
  "FullLoadSettings": {
    "TargetTablePrepMode": "DROP_AND_CREATE",
    "CreatePkAfterFullLoad": false,
    "StopTaskCachedChangesApplied": false,
    "StopTaskCachedChangesNotApplied": false,
    "MaxFullLoadSubTasks": 8,
    "TransactionConsistencyTimeout": 600,
    "CommitRate": 10000
  },
  "Logging": {
    "EnableLogging": false,
    "LogComponents": [
      {
        "Id": "SOURCE_UNLOAD",
        "Severity": "LOGGER_SEVERITY_DEFAULT"
      },
      {
        "Id": "SOURCE_CAPTURE",
        "Severity": "LOGGER_SEVERITY_DEFAULT"
      },
      {
        "Id": "TARGET_LOAD",
        "Severity": "LOGGER_SEVERITY_DEFAULT"
      },
      {
        "Id": "TARGET_APPLY",
        "Severity": "LOGGER_SEVERITY_DEFAULT"
      },
      {
        "Id": "TASK_MANAGER",
        "Severity": "LOGGER_SEVERITY_DEFAULT"
      }
    ],
    "CloudWatchLogGroup": null,
    "CloudWatchLogStream": null
  },
  "ControlTablesSettings": {
    "historyTimeslotInMinutes": 5,
    "ControlSchema": "",
    "HistoryTimeslotInMinutes": 5,
    "HistoryTableEnabled": false,
    "SuspendedTablesTableEnabled": false,
    "StatusTableEnabled": false
  },
  "StreamBufferSettings": {
    "StreamBufferCount": 3,
    "StreamBufferSizeInMB": 8,
    "CtrlStreamBufferSizeInMB": 5
  },
  "ChangeProcessingDdlHandlingPolicy": {
    "HandleSourceTableDropped": true,
    "HandleSourceTableTruncated": true,
    "HandleSourceTableAltered": true
  },
  "ErrorBehavior": {
    "DataErrorPolicy": "LOG_ERROR",
    "DataTruncationErrorPolicy": "LOG_ERROR",
    "DataErrorEscalationPolicy": "SUSPEND_TABLE",
    "DataErrorEscalationCount": 0,
    "TableErrorPolicy": "SUSPEND_TABLE",
    "TableErrorEscalationPolicy": "STOP_TASK",
    "TableErrorEscalationCount": 0,
    "RecoverableErrorCount": -1,
    "RecoverableErrorInterval": 5,
    "RecoverableErrorThrottling": true,
    "RecoverableErrorThrottlingMax": 1800,
    "ApplyErrorDeletePolicy": "IGNORE_RECORD",
    "ApplyErrorInsertPolicy": "LOG_ERROR",
    "ApplyErrorUpdatePolicy": "LOG_ERROR",
    "ApplyErrorEscalationPolicy": "LOG_ERROR",
    "ApplyErrorEscalationCount": 0,
    "ApplyErrorFailOnTruncationDdl": false,
    "FullLoadIgnoreConflicts": true,
    "FailOnTransactionConsistencyBreached": false,
    "FailOnNoTablesCaptured": false
  },
  "ChangeProcessingTuning": {
    "BatchApplyPreserveTransaction": true,
    "BatchApplyTimeoutMin": 1,
    "BatchApplyTimeoutMax": 30,
    "BatchApplyMemoryLimit": 500,
    "BatchSplitSize": 0,
    "MinTransactionSize": 1000,
    "CommitTimeout": 1,
    "MemoryLimitTotal": 1024,
    "MemoryKeepTime": 60,
    "StatementCacheSize": 50
  },
  "ValidationSettings": {
    "EnableValidation": true,
    "ValidationMode": "ROW_LEVEL",
    "ThreadCount": 5,
    "PartitionSize": 10000,
    "FailureMaxCount": 10000,
    "RecordFailureDelayInMinutes": 5,
    "RecordSuspendDelayInMinutes": 30,
    "MaxKeyColumnSize": 8096,
    "TableFailureMaxCount": 1000,
    "ValidationOnly": false,
    "HandleCollationDiff": false,
    "RecordFailureDelayLimitInMinutes": 0
  },
  "PostProcessingRules": null
}

Everything should be ok, right ?
Let’s check for example some random table attributes on both databases:

  • PROD-LIKE DB (the one we snapshotted):
| portal_page | CREATE TABLE `portal_page` (   `id` int(10) NOT NULL AUTO_INCREMENT,   `portal_id` int(10) NOT NULL,   `page_id` int(10) NOT NULL,   `section_id` int(10) DEFAULT NULL,   `layout` varchar(10) DEFAULT NULL,   `route_alias` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,   `active` enum('1','0') NOT NULL DEFAULT '1',   PRIMARY KEY (`id`),   UNIQUE KEY `portal_page_section` (`portal_id`,`page_id`,`section_id`),   KEY `portal_id` (`portal_id`),   KEY `page_id` (`page_id`),   CONSTRAINT `FKpp_page_id` FOREIGN KEY (`page_id`) REFERENCES `default_pages` (`id`),   CONSTRAINT `FKpp_portal_id` FOREIGN KEY (`portal_id`) REFERENCES `portal` (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=346 DEFAULT CHARSET=utf8 |
  • PROD-RESTORED, same table on the replicated RDS database:
| portal_page | CREATE TABLE `portal_page` (   `id` int(11) NOT NULL,   `portal_id` int(11) NOT NULL,   `page_id` int(11) NOT NULL,   `section_id` int(11) DEFAULT NULL,   `layout` varchar(10) DEFAULT NULL,   `route_alias` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,   `active` enum('1','0') NOT NULL,   PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 |

So we ended up with all the data in the database, no error show, but tables structure completely altered (or better, not replicated at all).

Well, maybe it was my fault. Let’s open a support ticket to AWS.

In a couple of days i received an answer (and trust me, Aws support guys are amazing and very technically skilled)…
Basically, according to them: ”DMS takes a minimalist approach and creates only those objects required to efficiently migrate the data. In other words, AWS DMS creates tables, primary keys, and in some cases unique indexes, but doesn’t create any other objects that are not required to efficiently migrate the data from the source. For example, it doesn’t create secondary indexes, non primary key constraints, or data defaults”.

Well, Houston we have a problem…

We absolutely needed to have the identical data structure on the new database, so what now ?
Since we are migrating from MariaDB to another MariaDB, AWS support guy suggest me to use a mysqldump to create an identical replica (…)

But no, wait… we need live data replicated to avoid data losses.
So, no old good mysqldump for us.
But hey, wait a moment… it’s only a data structure issue, right ?
So why not trying to dump ONLY the data structure from the old database, restore the data structure dump on the new database and then start again the replication task ? In this way we should be able to keep both structure and data consistency… ! (?)

Only way to know, trying it.
A good reference about mysqldump command can be found here: https://dev.mysql.com/doc/refman/8.0/en/mysqldump.html

Once done, i started again my replication task with “Truncate” option enabled (this is the most important thing because it’s the only way to replicate your data and keep the data structure as it is. If you choose “Drop tables on target” for example, you’ll end up with the same situation we encountered on our first test, that means no data structure consistency during and after replication).

And here we are, we now have an ongoing replication instance that will replicate every single change on the main database to the other one.

Have fun!

NOTE: All product names, logos, and brands are property of their respective owners.

More by Andrea Marinaro

Topics of interest

More Related Stories