Fines related to sensitive data exposure are growing. For instance, major GDPR violations can cost companies up to 4% of their annual global turnover, while gross HIPAA violations can result in imprisonment.
Your production environment might be thoroughly protected. But what is about testing initiatives and sales demos? Are you confident in the third-party contractors that have access to your sensitive data? Will they do their best to protect it?
To ensure compliance and data safety, companies are turning to data management service providers. If you are also interested, check out this guide answering the three important questions:
It also presents a detailed data masking example from our portfolio. After reading the article, you will have enough information to negotiate with data masking vendors.
So, what is data masking?
Data masking is defined as building a realistic and structurally similar but nonetheless fake version of organizational data. It alters the original data values using manipulation techniques while maintaining the same format and delivers a new version that can’t be reverse-engineered or tracked back to the authentic values. Here is an example of masked data:
Do you need to apply data masking algorithms to all the data stored within your company? Most likely not. Here are the data types that you definitely need to protect:
Data masking protects sensitive information utilized for non-productive purposes. So, as long as you use any of the sensitive data types presented in the previous section in training, testing, sales demos, or any other types of non-production activities, you need to apply data masking techniques. This makes sense as non-production environments are normally less protected and introduce more security vulnerabilities.
Moreover, if there is a need to share your data with third-party vendors and partners, you can grant access to masked data instead of forcing the other party to comply with your extensive security measures to access the original database. Statistics show that 19% of data breaches take place due to compromises on the business partner’s side.
Additionally, data masking can provide the following advantages:
There are five main types of data masking that aim to cover different organizational needs.
Implies creating a backup of the original data and keeping it safe in a separate environment for production use cases. Then it disguises the copy by including fake but realistic values, and makes it available for non-production purposes (e.g., testing, research), as well as sharing with contractors.
Aims to modify an excerpt of the original data at runtime when receiving a query to the database. So, a user who is not authorized to view sensitive information queries the production database, and the response is masked on the fly without changing the original values. You can implement it via database proxy, as presented below. This data masking type is normally used in read-only settings to prevent overriding production data.
This data masking type disguises data when transferring it from one environment to another, such as from production to testing. It is popular with organizations that continuously deploy software and perform large data integrations.
Replaces column data with the same fixed value. For instance, if you want to replace “Olivia” with “Emma”, you have to do it in all the associated tables, not only in the table you are currently masking.
This is used to reveal information about patterns and trends in a dataset without sharing any details on actual people represented there.
Below you can find seven of the most popular data masking techniques. You can combine them to cover the various needs of your business.
Shuffling. You can shuffle and reassign data values within the same table. For example, if you shuffle the employee name column, you will get the real personal details of one employee matched to another.
Scrambling. Rearranges characters and integers of a data field in random order. If an employee’s original ID is 97489376, after applying shuffling, you will receive something like 37798649. This is restricted to specific data types.
Nulling out. This is a simple masking strategy where a data field is assigned a null value. This method has limited usage as it tends to fail the application’s logic.
Substitution. Original data is substituted by fake but realistic values. Meaning that the new value still needs to satisfy all domain constraints. For instance, you substitute someone’s credit card number with another number that conforms to the rules enforced by the issuing bank.
Number variance. This is mostly applicable to financial information. One example is masking original salaries by applying +/-20% variance.
Date aging. This method increases or decreases a date by a specific range, maintaining that the resulting date satisfies the application’s constraints. For instance, you can age all contracts by 50 days.
Averaging. Involves replacing all the original data values by an average. For instance, you can replace every individual salary field by an average of salary values in this table.
Here is your 5-step data masking implementation plan.
Before you start, you will need to identify which aspects you will cover. Here is a list of typical questions that your data team can study before proceeding with the masking initiatives:
During this step, you need to identify which technique or a combination of data masking tools are the best fit for the task at hand.
First of all, you need to identify which data types you need to mask, for instance, names, dates, financial data, etc., as different types require dedicated data masking algorithms. Based on that, you and your vendor can choose which open-source library(s) can be reused to produce the best-suited data masking solution. We advise turning to a software vendor, as they will help you customize the solution and integrate it painlessly into your workflows across the whole company without interrupting any business processes. Also, it’s possible to build something from zero to cover the company’s unique needs.
There are ready-made data masking tools that you can purchase and deploy yourself, such as Oracle Data Masking, IRI FieldShield, DATPROF, and many more. You can opt for this strategy if you manage all your data by yourself, you understand how different data flows work, and you have an IT department that can help integrate this new data masking solution into the existing processes without hindering productivity.
The security of your sensitive data largely depends on the security of the selected fake data-generating algorithms. Therefore, only authorized personnel can know which data masking algorithms are deployed, as these people can reverse engineer the masked data to the original dataset with this knowledge. It’s a good practice to apply separation of duties. For instance, the security department selects the best-suited algorithms and tools, while data owners maintain the settings applied in masking their data.
Referential integrity means that each data type within your organization is masked in the same way. This can be a challenge if your organization is rather large and has several business functions and product lines. In this case, your company is likely to use different data masking algorithms for various tasks.
To overcome this issue, identify all the tables that contain referential constraints and determine in which order you will mask the data as parent tables should be masked before the corresponding child tables. After completing the masking process, do not forget to check whether referential integrity was maintained.
Any adjustment to a particular project, or just general changes within your organization, can result in modifying the sensitive data and creating new data sources, posing the need to repeat the masking process.
There are instances where data masking can be a one-time effort, such as in the case of preparing a specialized training dataset that will be used for a few months for a small project. But if you want a solution that will serve you for a prolonged time, your data can become obsolete at one point. So, invest time and effort in formalizing the masking process to make it fast, repeatable, and as automated as possible.
Develop a set of masking rules, such as which data has to be masked. Identify any exceptions or special cases that you can foresee at this point. Acquire/build scripts and automated tools to apply these masking rules in a consistent manner.
Whether you work with a software vendor of your choice or opt for a ready-made solution, the final product needs to follow these data-masking best practices:
Here is a list of challenges that you might face during implementation.
An international healthcare organization was looking to obscure sensitive personally identifiable information (PII) presented in multiple formats and residing in both production and non-production environments. They wanted to build an ML-powered data masking software that can discover and obfuscate PII while complying with the company’s internal policies, GDPR, and other data privacy regulations.
Our team immediately noticed the following challenges:
Due to this large variety, our team wanted to come up with a set of policies and processes that would guide different dataset owners on how to mask their data and would serve as the basis for our solution. For instance, someone could come up with a list of data points that they want to obfuscate, whether once or continuously, and the solution, guided by these principles, would study the data and select appropriate obfuscation techniques and apply them.
We approached this project by surveying the landscape through the following questions:
After answering these questions, we suggested providing data masking as a service mainly because the client has too many data sources, to begin with, and it might have taken years to cover them all.
In the end, we delivered data masking services with the help of a custom ML-driven tool that can semi-automatically perform data masking in four steps:
This data masking solution helped the client comply with GDPR, dramatically reduced the time needed to form non-production environments, and lowered the costs of transferring data from production to sandbox.
Your efforts do not stop when confidential data is masked. You still need to maintain it over time. Here are the steps that will help you in this initiative:
Data masking will protect your data in non-production environments, enable you to share information with third-party contractors, and help you with compliance. You can purchase and deploy a data obfuscation solution yourself if you have an IT department and control your data flows. However, keep in mind that improper data masking implementation can lead to rather unpleasant consequences. Here are some of the most prominent ones:
Hence, if a company isn’t confident in its abilities to execute data obfuscation initiatives, it’s best to contact an external vendor who will help select the right data masking techniques and integrate the final product into your workflows with minimal interruptions.
Stay protected!
Considering to implement a data masking solution? Get in touch! We will help you prioritize your data, build a compliant obfuscation tool, and deploy it without interrupting your business processes.