1,188 reads

Using AWS Macie To Classify Databases

by Leandro MantovaniJune 16th, 2021

Too Long; Didn't Read

Data breaches are the number 1 threats today, and every company should be doing something to protect their most sensitive data. The big problem remains on how to know where data is stored, and what kind of data you have. The whole process is designed to take snapshots of databases, store them in S3, and trigger Macie jobs that will look for sensitive data. The scans are performed once a week, and then uploaded to SecurityHub, in order to manage the findings. The same process will work with any database backup in SQL language (and other formats)

Company Mentioned

featured image - Using AWS Macie To Classify Databases

Data breaches are the number 1 threats today, and every company should be doing something to protect their most sensitive data. That sounds nice in the paper but in practice, it's a pain in the ass.

Setting up security policies for protecting data is not big trouble, I think that this is the easiest part of the problem itself.

The big problem remains on how to know where data is stored, and what kind of data you have. There are different levels of risk, depending on the grade of sensitivity; it's not the same if I have customer emails, instead of full credit card info.

You can't protect what you don't know

It would be as if the bank tried to protect the money, but it doesn't know where it's stored.

If I don't know where is my most sensitive data, I couldn't protect it, no matter how hard I try, I'll fail.

Goal

I'll show how to use Macie to scan any database, inclusive on-premises, to discover sensitive data on tables.

I'll do it by taking snapshots or full backups from databases and launching Macie scanning jobs automatically.

When scanning is done, I'll get a full summary of what kind of sensitive data is stored.

How Macie works?

The service is designed to scan S3 buckets, and to identify sensitive data in objects stored there. Currently, it doesn't have any integration with databases.

Macie has predefined patterns to find these kinds of data:

Credentials
Financial Information
Personal Health Information
Personally Identifiable Information

If I'd like, I can extend the Discover scope, using my own pattern matching. I just need to create my custom regex and provide them to Macie. In this way, I could define whatever pattern I consider sensible for my company, like internal employee IDs.

Solution

It seems complicated, but it's much simpler than it looks. The whole process is designed to take snapshots of databases, store them in S3, and trigger Macie jobs that will look for sensitive data. The scans are performed once a week, and then uploaded to SecurityHub, in order to manage the findings. Finally, after the job was completed, previously created snapshots are deleted.

I've created multiple lambdas that helped me automate all the processes. Snapshot creation and exporting are usually manual tasks, and Macie job creation too. AWS Lambda helped me to automate that.

The example above is for an RDS database, but the same process will work with any database backup in SQL language (and other formats). The advantage with RDS snapshots is that Amazon automatically compresses snapshots using parquet format, and Macie is able to read them.

The first Lambda (Start Backup Function) function has a time-based trigger. I've configured it to run once a week. It's in charge of initiate snapshot creation for RDS. This logic is pretty simple, the only task of this lambda is to start snapshot creation.

The second Lambda (export snapshot S3) is also time-based, but I've configured it to run every 15 minutes. Its job consists of checking if the snapshot has completed, and then exports it to the S3 bucket. RDS snapshots take a while to be done, for this reason, the function is executed in short periods, just to check if there are snapshots pending export.

The last Lambda function (Create Macie Job), is triggered by every object that is inserted in the bucket dedicated to Macie scans. The logic of this function is to configure, create and start the Macie job, to let it perform the scan over the bucket.

Macie is configured to send every data finding to SecurityHub automatically. There I can manage the findings with other security events and aggregate them all together, to have a better view of your landscape.

For snapshot deletion, I've configured a lifecycle rule for the S3 bucket that deletes objects after 1 day.

That's not all folks

Do you like to see a PoC? I'm finishing the code because I've done all the stuff manually. Follow me on my channel, I'll publish updates of new posts there!

I hope that you've learned something new with my post, and if this is your case I encourage you to become one of our members of my fantastic telegram channel about Cloud and Security.