Part 2— cleaner: a story about how to write a program that cleans raw data Preamble In , I’ve defined the infrastructure of this project, drawing the whole picture and gathering all the pieces.Now it’s time to write the first program: the Cleaner! Introduction — Part 1 The only thing the Cleaner has to do is to parse raw CSV files into ready-to-use CSV files by removing useless data and metadata.The result will be used by the Collector ( ). part 3 Journey In this article, I cover the second part of this journey: Part 1: Introduction Part 2 (this part): Cleaner Part 3: Collector Part 4: Goxfer Part 5: Conclusions Setup it first! Cleaner will be a program, so I need to setup the docker container first!As I mentioned in the previous article, I don’t want to have anything installed on my system, so I’m going to update the setup-python service inside the docker-compose.yml file that will be called just once at the beginning, by adding a new : Python command That’s good!I defined the setup service just to make the Python development environment ready.Further more, I told docker-compose to build image (that will be used in other services) from this Python Dockerfile: data.python3:1 Fixing versions makes your system predictable and deterministic: don’t rely on latest versions. Python container has only the python package manager ( ) installed and a generic used to pass commands from outside: pip entrypoint Now, let’s try this container and install a new project’s dependency: !Yay, it will be the Python driver for .To do this, I need to get inside the setup container and launch “ ” and “ ” commands, so I can have the dependency installed and every file needed by pip to define the : pymongo MongoDB pip install pip freeze requirements.txt # first, build the image$ docker-compose build # --rm will remove the container after its execution$ docker-compose run --rm setup bash And then, from inside the container: # don't forget to initialize, so the project's dependencies will be stored locally$ virtualenv .$ source ./bin/activate # install and freeze# no need to mark a specific pymongo version during the setup phase$ pip install pymongo$ pip freeze > requirements.txt All right, the setup is done!If I need to restore the environment on another system, no problem, docker-compose will do the same for me: $ docker-compose run --rm setup-python Cleaner: hands on! Finally, the core of this article. Data! Yay!This is the first part of the project I’m going to code: .This tool has to read from a folder full of CSV files and then write to another folder the resulting cleaned files.I need to go from this sample: the cleaner # this is a header row Date Account Name Number Description Notes Memo Category Type Action Reconcile To With Sym From With Sym To Num. From Num. To Rate/Price From Rate/Price # this is a data row"04/06/2016","Abbigliamento","","maglietta","","","Sbilancio-EUR","T","","N","€ 5,00","","5,00","","","" # another data row, part of the previous one"","","","","","","Abbigliamento","S","","N","€ 5,00","","5,00","","1,00","" # another data row, part of the previous one"","","","","","","Sbilancio-EUR","S","","N","","-€ 5,00","","-5,00","","1,00" to this: # header rowDate,Account Name,Description,To Num. # just one data row04/06/2016,Abbigliamento,maglietta,"5,00" Let’s do this with Python!The very first thing to do is to define the (where the cleaner gets the raw CSV data) and the (where the cleaned data will be stored).Let’s create a samples folder (the source) and a cleaned folder (the destination).By the way, I want the user to have the opportunity to specify its own directories: so, I’m gonna use with the defaults listed above, one called and the other one called .But first, I’m going to create the service inside the docker-compose.yaml: source folder destination folder environment variables SOURCE_FOLDER CLEANED_FOLDER cleaner Cleaner will use the same docker image of the service because it’s actually the same.After defining the environment variables, I need to mount the whole root folder inside the container so when the command gets executed it will find .So, let’s add an empty cleaner.py inside src folder: setup-python python/src/cleaner.py $ touch python/src/cleaner.py Now, Let’s define the procedure: what should the cleaner do? define the CSV (like “Date”, “Description”, etc.) columns indexes required retrieve the from the environment source and dest folders the dest folder (I want to have it cleaned from previous iterations) restore/create from the source folder read the files list for each raw file, inside the dest folder create a new cleaned file It’s time to write it down! Looking at the samples above (raw and cleaned), I can say that the columns needed are: (index: 0) Date (index: 1) Account Name (index: 3) Description (index: 12) To Num. I’m gonna put them inside constants: Then, let’s get the source and dest folders from environment vars, with : os.getenv Restore/create procedure can be defined as rmdir and mkdir in sequence, with and : shutil.rmtree os.makedirs Reading the CSV raw files is quite easy with and : listdir list comprehension Now, the big part!Scan each raw file, parse it and create a new cleaned file inside the dest folder, with and : csv.reader csv.writer CSV files are opened and parsed with csv.reader.Then, the cleaned file is created inside the dest folder.In the end, each row of the CSV file is parsed and only those having the field not empty are taken. DATE_COLUMN Ok, also the cleaner is ready to be used.And that’s the whole source code: Let’s try: $ docker-compose run --rm cleaner When the execution terminates, the dest folder has been filled with cleaned data, like the previous expected sample listed above. Great job! End of part 2 Now, I have got a cleaner program that let me convert raw unreadable data into fresh cleaned data, ready to be used.I wanted to produce this intermediate step because I want to check if the cleaned data is reliable or not; in fact, the next program will be the collector and it will use this transformed data to do the job. If you enjoyed this article don’t forget to share it!See you in ! Part 3: Collector Spoiler Source code is already available here: https://github.com/wilk/from-csv-to-buxfer