The Essential Data Cleansing Checklist
Data Scientist for Food Tech. Based in South Korea
After some time working as a data scientist in my startup, I came to a point where I needed to ask for external help with your project.
I hired an external team to create this project.
Then I came to a situation where I had to turn over a significant portion of my raw data to get the project running. While in this process, I have learned that there some things that I have to do to keep the price low and speed up the building process while you work with a contractor.
Following these tips would greatly help your contractor and actually gain respect as a fellow data scientist.
Here are the essential highlights that I got during this data project:
I. Help them understand your data and things will go way smoother
CAUTION: This is true when you already have your legal team go over the details of the N.D.A. between the two companies.
- Creating an E.R.D. (entity relational diagram) — this will help the contractor further understand what kind of data or system they are tasked to build. Think of this as an overview of the project. Also, remember that this must have the vital P.R.K. (Primary) and F.R.K. (Foreign) in the diagram to make an effective diagram.
- Creating an extremely detailed schema — this part would help you, and your contractor fully understand and answer the following questions: what kind of data (data type), what must not be missing (not null), what data links to what (P.R.K., F.R.K.), and How the data flow throughout the system (Core Tables, Join Tables)
- Additional Notes — this may not be common, but there are certain situations where a system may have dual P.R.K.s or have unique default values for each column.
This portion may also include a specific process on how the data is created, such as special indexes, etc.
II. Basic Information for Dataset Profiling, E.D.A. analysis (Subjective Opinion)
- D.B. encoding format (UTF-8 (4-bit), ASCII-II, etc.)
- Check for special characters (Return Character, Escape Character usually creates an error while uploading in text format)
- Capitalization and Synonym uniformity, i.e., Korea, Korea (south), R.O.K.
- Missing Values, NaN (Does the entry exists because they exist? or is it a random error?)
- PRK Ratio check (Is the ratio of index id match the actual entries? If not, why? Do we have the solution to fix it? [clue: multi-key model would be an option or maybe serialization?]
- Range distribution (Are there outliers that don’t match the set?)
- Units (There may be more than one type of data. i.e. ‘grams, U.I., ml, etc…,’ Check if you need them to be unified)
- Naming Errors (Certain downloaded sets have encoding errors that may need thorough cleaning like the special character errors.)
- Table Description (Columns name doesn’t tell the whole story, documentation is important)
- Multi-platform check (Do they open well on Excel, D.B., Python? This is needed to help other fellow data scientists save some time figuring the encoding or manually cleaning a new set for them to use.)
- Remove unwanted observation (duplicate, irrelevant pieces). Domain knowledge is needed for this.
- Structural errors (Certain hierarchy, groups, and classes may be a good fit to the data model that you are currently creating, i.e. 10–20 may be considered teenagers in certain situations, but for other cases, ages 18~20 maybe group into adults category depending on what background this model is being implemented)
- Missing data and solution implementation documentation (dropping, imputing — missing portions)
- Flag & fill data (replacement terms and other special systematic labeling done by the set creator)
After you have implemented parts I & II, you are ready to turn over your massive data to your teammate or contractor (freelancer) to help you create an awesome data product.
I hope this guide helps further understand why data scientists take 80% of their time cleaning, labeling, and even documenting their datasets. This article should've helped highlight the importance of documentation and schemas when working in teams in the data science space.
If you think there are some missing variables that I have left out in this checklist, please fill free to comment so that I could update this list for others to see.
Join Hacker Noon
Create your free account to unlock your custom reading experience.