paint-brush
What I learned by migrating an RDS database into production using AWS DMSby@andreamarinaro
8,979 reads
8,979 reads

What I learned by migrating an RDS database into production using AWS DMS

by Andrea MarinaroApril 4th, 2019
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

I am currently managing the migration of some on premise services, and other services already present on AWS under a single VPC.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - What I learned by migrating an RDS database into production using AWS DMS
Andrea Marinaro HackerNoon profile picture

What I learned by migrating an RDS database into production using AWS DMS (Database Migration Service)

…or how to avoid a big headache while live migrating and replicating your db instances

I am currently managing the migration of some on premise services, and other services already present on AWS under a single VPC.

Having active production systems H24 7/7 it was essential to design and implement a fairly precise migration plan with a zero error margin.

While regarding services like EC2 instances, EFS, Security Group, Target Group and Load Balancer the transition was quite simple, I encountered some obstacles during the migration of our RDS database (MariaDB) from one VPC to another.

For this task the native Amazon service, DMS (Database Migration Service) was chosen.

Obviously, not being able to allow discrepancies between the existing and new production databases, the obligatory choice fell on a "live" migration through the "Ongoing replication" mode.

According to Amazon: “…the quickest and easiest way to migrate databases to AWS with low cost and minimal downtime.”

Even if i managed to successfully migrate other non-production databases using DMS, i was kinda like…

<a href="https://medium.com/media/02ed01f042aa15adced4ad548395d86a/href">https://medium.com/media/02ed01f042aa15adced4ad548395d86a/href</a>

To avoid unpleasant surprises, I performed a snapshot of the production database, created a new RDS instance (mariaDB) to then restore the snapshot on the new database.


For those who have never used DMS, the service involves the creation of a replication instance, two endpoints (source and target) and a task (which will take care of migrating data from one database to another).I preferred to use the db of the production db, and not of another database, as I wanted to replicate in “sandbox env” the exact structure of the database, including the attributes of the tables.

In this article we will not deal with the technical aspects, configurations related to DMS, so if you are already accustomed the following represents the replication task created for my prod-test migration (ARNs are not visible cause … you know why… ):

Task name 
An identifier for your Task

***********:-db-task

Task ARN
This ARN is the stable way uniquely identify your Replication Task when calling DMS APIs

arn:aws:dms:eu-central-1:***********:task:***************

Status
The current computed status of the task. Note that this is a computed value and may not match the raw status from the service API

Load complete

Migration type
How should this task migrate data

Full Load

Replication instance

:***********:

Source endpointSource endpoint

:***********:-frankfurt

Target endpointTarget endpoint

:***********:-db

Mapping method
This is a json document that details how source tables are mapped to the target

{"rules":[{"rule-type":"selection","rule-id":"1","rule-name":"1","object-locator":{"schema-name":"tmp","table-name":"%"},"rule-action":"include","filters":[]},{"rule-type":"selection","rule-id":"2","rule-name":"2","object-locator":{"schema-name":"digital_content_store_kareem","table-name":"%"},"rule-action":"include","filters":[]},{"rule-type":"selection","rule-id":"3","rule-name":"3","object-locator":{"schema-name":"digital_content_store","table-name":"%"},"rule-action":"include","filters":[]},{"rule-type":"selection","rule-id":"4","rule-name":"4","object-locator":{"schema-name":"dcw_unit_testing","table-name":"%"},"rule-action":"include","filters":[]},{"rule-type":"selection","rule-id":"5","rule-name":"5","object-locator":{"schema-name":"dcs_unit_testing","table-name":"%"},"rule-action":"include","filters":[]},{"rule-type":"selection","rule-id":"6","rule-name":"6","object-locator":{"schema-name":"dcs_issam","table-name":"%"},"rule-action":"include","filters":[]},{"rule-type":"selection","rule-id":"7","rule-name":"7","object-locator":{"schema-name":"DCW_reporting","table-name":"%"},"rule-action":"include","filters":[]},{"rule-type":"selection","rule-id":"8","rule-name":"8","object-locator":{"schema-name":"DCW","table-name":"%"},"rule-action":"include","filters":[]}]}

Last failure message
The last failure message from the time the task was run and failed

Created

March 28, 2019 at 4:11:15 PM UTC+1

Started

March 29, 2019 at 2:14:56 PM UTC+1

CDC start positionOption to start Change Data Capture (CDC)

CDC stop positionOption to stop Change Data Capture (CDC)

CDC recovery checkpoint
DMS specific recovery checkpoint string to recover from a previous point

Task Settings
{
  "TargetMetadata": {
    "TargetSchema": "",
    "SupportLobs": true,
    "FullLobMode": true,
    "LobChunkSize": 64,
    "LimitedSizeLobMode": false,
    "LobMaxSize": 0,
    "InlineLobMaxSize": 0,
    "LoadMaxFileSize": 0,
    "ParallelLoadThreads": 0,
    "ParallelLoadBufferSize": 0,
    "BatchApplyEnabled": false,
    "TaskRecoveryTableEnabled": false
  },
  "FullLoadSettings": {
    "TargetTablePrepMode": "DROP_AND_CREATE",
    "CreatePkAfterFullLoad": false,
    "StopTaskCachedChangesApplied": false,
    "StopTaskCachedChangesNotApplied": false,
    "MaxFullLoadSubTasks": 8,
    "TransactionConsistencyTimeout": 600,
    "CommitRate": 10000
  },
  "Logging": {
    "EnableLogging": false,
    "LogComponents": [
      {
        "Id": "SOURCE_UNLOAD",
        "Severity": "LOGGER_SEVERITY_DEFAULT"
      },
      {
        "Id": "SOURCE_CAPTURE",
        "Severity": "LOGGER_SEVERITY_DEFAULT"
      },
      {
        "Id": "TARGET_LOAD",
        "Severity": "LOGGER_SEVERITY_DEFAULT"
      },
      {
        "Id": "TARGET_APPLY",
        "Severity": "LOGGER_SEVERITY_DEFAULT"
      },
      {
        "Id": "TASK_MANAGER",
        "Severity": "LOGGER_SEVERITY_DEFAULT"
      }
    ],
    "CloudWatchLogGroup": null,
    "CloudWatchLogStream": null
  },
  "ControlTablesSettings": {
    "historyTimeslotInMinutes": 5,
    "ControlSchema": "",
    "HistoryTimeslotInMinutes": 5,
    "HistoryTableEnabled": false,
    "SuspendedTablesTableEnabled": false,
    "StatusTableEnabled": false
  },
  "StreamBufferSettings": {
    "StreamBufferCount": 3,
    "StreamBufferSizeInMB": 8,
    "CtrlStreamBufferSizeInMB": 5
  },
  "ChangeProcessingDdlHandlingPolicy": {
    "HandleSourceTableDropped": true,
    "HandleSourceTableTruncated": true,
    "HandleSourceTableAltered": true
  },
  "ErrorBehavior": {
    "DataErrorPolicy": "LOG_ERROR",
    "DataTruncationErrorPolicy": "LOG_ERROR",
    "DataErrorEscalationPolicy": "SUSPEND_TABLE",
    "DataErrorEscalationCount": 0,
    "TableErrorPolicy": "SUSPEND_TABLE",
    "TableErrorEscalationPolicy": "STOP_TASK",
    "TableErrorEscalationCount": 0,
    "RecoverableErrorCount": -1,
    "RecoverableErrorInterval": 5,
    "RecoverableErrorThrottling": true,
    "RecoverableErrorThrottlingMax": 1800,
    "ApplyErrorDeletePolicy": "IGNORE_RECORD",
    "ApplyErrorInsertPolicy": "LOG_ERROR",
    "ApplyErrorUpdatePolicy": "LOG_ERROR",
    "ApplyErrorEscalationPolicy": "LOG_ERROR",
    "ApplyErrorEscalationCount": 0,
    "ApplyErrorFailOnTruncationDdl": false,
    "FullLoadIgnoreConflicts": true,
    "FailOnTransactionConsistencyBreached": false,
    "FailOnNoTablesCaptured": false
  },
  "ChangeProcessingTuning": {
    "BatchApplyPreserveTransaction": true,
    "BatchApplyTimeoutMin": 1,
    "BatchApplyTimeoutMax": 30,
    "BatchApplyMemoryLimit": 500,
    "BatchSplitSize": 0,
    "MinTransactionSize": 1000,
    "CommitTimeout": 1,
    "MemoryLimitTotal": 1024,
    "MemoryKeepTime": 60,
    "StatementCacheSize": 50
  },
  "ValidationSettings": {
    "EnableValidation": true,
    "ValidationMode": "ROW_LEVEL",
    "ThreadCount": 5,
    "PartitionSize": 10000,
    "FailureMaxCount": 10000,
    "RecordFailureDelayInMinutes": 5,
    "RecordSuspendDelayInMinutes": 30,
    "MaxKeyColumnSize": 8096,
    "TableFailureMaxCount": 1000,
    "ValidationOnly": false,
    "HandleCollationDiff": false,
    "RecordFailureDelayLimitInMinutes": 0
  },
  "PostProcessingRules": null
}


Everything should be ok, right ?Let’s check for example some random table attributes on both databases:

  • PROD-LIKE DB (the one we snapshotted):

    | portal_page | CREATE TABLE portal_page ( id int(10) NOT NULL AUTO_INCREMENT, portal_id int(10) NOT NULL, page_id int(10) NOT NULL, section_id int(10) DEFAULT NULL, layout varchar(10) DEFAULT NULL, route_alias varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL, active enum('1','0') NOT NULL DEFAULT '1', PRIMARY KEY (id), UNIQUE KEY portal_page_section (portal_id,page_id,section_id), KEY portal_id (portal_id), KEY page_id (page_id), CONSTRAINT FKpp_page_id FOREIGN KEY (page_id) REFERENCES default_pages (id), CONSTRAINT FKpp_portal_id FOREIGN KEY (portal_id) REFERENCES portal (id) ) ENGINE=InnoDB AUTO_INCREMENT=346 DEFAULT CHARSET=utf8 |

  • PROD-RESTORED, same table on the replicated RDS database:

    | portal_page | CREATE TABLE portal_page ( id int(11) NOT NULL, portal_id int(11) NOT NULL, page_id int(11) NOT NULL, section_id int(11) DEFAULT NULL, layout varchar(10) DEFAULT NULL, route_alias varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL, active enum('1','0') NOT NULL, PRIMARY KEY (id) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 |

<a href="https://medium.com/media/7ffedd984d9bb9e803086cc8a94c5c80/href">https://medium.com/media/7ffedd984d9bb9e803086cc8a94c5c80/href</a>

So we ended up with all the data in the database, no error show, but tables structure completely altered (or better, not replicated at all).

Well, maybe it was my fault. Let’s open a support ticket to AWS.


In a couple of days i received an answer (and trust me, Aws support guys are amazing and very technically skilled)…Basically, according to them: ”DMS takes a minimalist approach and creates only those objects required to efficiently migrate the data. In other words, AWS DMS creates tables, primary keys, and in some cases unique indexes, but doesn’t create any other objects that are not required to efficiently migrate the data from the source. For example, it doesn’t create secondary indexes, non primary key constraints, or data defaults”.

Well, Houston we have a problem…


We absolutely needed to have the identical data structure on the new database, so what now ?Since we are migrating from MariaDB to another MariaDB, AWS support guy suggest me to use a mysqldump to create an identical replica (…)




But no, wait… we need live data replicated to avoid data losses.So, no old good mysqldump for us.But hey, wait a moment… it’s only a data structure issue, right ?So why not trying to dump ONLY the data structure from the old database, restore the data structure dump on the new database and then start again the replication task ? In this way we should be able to keep both structure and data consistency… ! (?)


Only way to know, trying it.A good reference about mysqldump command can be found here: https://dev.mysql.com/doc/refman/8.0/en/mysqldump.html

Once done, i started again my replication task with “Truncate” option enabled (this is the most important thing because it’s the only way to replicate your data and keep the data structure as it is. If you choose “Drop tables on target” for example, you’ll end up with the same situation we encountered on our first test, that means no data structure consistency during and after replication).

And here we are, we now have an ongoing replication instance that will replicate every single change on the main database to the other one.

Have fun!

NOTE: All product names, logos, and brands are property of their respective owners.