Part 2— cleaner: a story about how to write a program that cleans raw data
In Introduction — Part 1, I’ve defined the infrastructure of this project, drawing the whole picture and gathering all the pieces.
Now it’s time to write the first program: the Cleaner!
The only thing the Cleaner has to do is to parse raw CSV files into ready-to-use CSV files by removing useless data and metadata.
The result will be used by the Collector (part 3).
In this article, I cover the second part of this journey:
Setup it first!
Cleaner will be a Python program, so I need to setup the docker container first!
As I mentioned in the previous article, I don’t want to have anything installed on my system, so I’m going to update the setup-python service inside the docker-compose.yml file that will be called just once at the beginning, by adding a new command:
I defined the setup service just to make the Python development environment ready.
Further more, I told docker-compose to build data.python3:1 image (that will be used in other services) from this Python Dockerfile:
Fixing versions makes your system predictable and deterministic: don’t rely on latest versions.
Python container has only the python package manager (pip) installed and a generic entrypoint used to pass commands from outside:
Now, let’s try this container and install a new project’s dependency: pymongo!
Yay, it will be the Python driver for MongoDB.
To do this, I need to get inside the setup container and launch “pip install” and “pip freeze” commands, so I can have the dependency installed and every file needed by pip to define the requirements.txt:
# first, build the image
$ docker-compose build
# --rm will remove the container after its execution
$ docker-compose run --rm setup bash
And then, from inside the container:
# don't forget to initialize, so the project's dependencies will be stored locally
$ virtualenv .
$ source ./bin/activate
# install and freeze
# no need to mark a specific pymongo version during the setup phase
$ pip install pymongo
$ pip freeze > requirements.txt
All right, the setup is done!
If I need to restore the environment on another system, no problem, docker-compose will do the same for me:
$ docker-compose run --rm setup-python
Cleaner: hands on!
Finally, the core of this article. Data! Yay!
This is the first part of the project I’m going to code: the cleaner.
This tool has to read from a folder full of CSV files and then write to another folder the resulting cleaned files.
I need to go from this sample:
# this is a header row
Date Account Name Number Description Notes Memo Category Type Action Reconcile To With Sym From With Sym To Num. From Num. To Rate/Price From Rate/Price
# this is a data row
# another data row, part of the previous one
# another data row, part of the previous one
# header row
Date,Account Name,Description,To Num.
# just one data row
Let’s do this with Python!
The very first thing to do is to define the source folder (where the cleaner gets the raw CSV data) and the destination folder (where the cleaned data will be stored).
Let’s create a samples folder (the source) and a cleaned folder (the destination).
By the way, I want the user to have the opportunity to specify its own directories: so, I’m gonna use environment variables with the defaults listed above, one called SOURCE_FOLDER and the other one called CLEANED_FOLDER.
But first, I’m going to create the cleaner service inside the docker-compose.yaml:
Cleaner will use the same docker image of the setup-python service because it’s actually the same.
After defining the environment variables, I need to mount the whole root folder inside the container so when the command gets executed it will find python/src/cleaner.py.
So, let’s add an empty cleaner.py inside src folder:
$ touch python/src/cleaner.py
Now, what should the cleaner do?
Let’s define the procedure:
- define the CSV columns indexes required (like “Date”, “Description”, etc.)
- retrieve the source and dest folders from the environment
- restore/create the dest folder (I want to have it cleaned from previous iterations)
- read the files list from the source folder
- for each raw file, create a new cleaned file inside the dest folder
It’s time to write it down!
Looking at the samples above (raw and cleaned), I can say that the columns needed are:
- Date (index: 0)
- Account Name (index: 1)
- Description (index: 3)
- To Num. (index: 12)
I’m gonna put them inside constants:
Then, let’s get the source and dest folders from environment vars, with os.getenv:
Restore/create procedure can be defined as rmdir and mkdir in sequence, with shutil.rmtree and os.makedirs:
Reading the CSV raw files is quite easy with listdir and list comprehension:
Now, the big part!
Scan each raw file, parse it and create a new cleaned file inside the dest folder, with csv.reader and csv.writer:
CSV files are opened and parsed with csv.reader.
Then, the cleaned file is created inside the dest folder.
In the end, each row of the CSV file is parsed and only those having the DATE_COLUMN field not empty are taken.
Ok, also the cleaner is ready to be used.
And that’s the whole source code:
$ docker-compose run --rm cleaner
When the execution terminates, the dest folder has been filled with cleaned data, like the previous expected sample listed above.
End of part 2
Now, I have got a cleaner program that let me convert raw unreadable data into fresh cleaned data, ready to be used.
I wanted to produce this intermediate step because I want to check if the cleaned data is reliable or not; in fact, the next program will be the collector and it will use this transformed data to do the job.
If you enjoyed this article don’t forget to share it!
See you in Part 3: Collector!
Source code is already available here: https://github.com/wilk/from-csv-to-buxfer