Part 3— collector: a story about how to write a program that collects a pure dataset
In Cleaner — Part 2, I’ve coded the first program of this project, cleaning raw data and make it available for the second program: the Collector 📦!
In this article, I cover the third part of this journey:
This is the second important block of this adventure: the Collector.
The procedure is similar to the Cleaner: update the docker-compose.yml file and then write the program in Python.
So, like the Cleaner, the collector service uses the same Python image, mounts the same volume, launches a similar command and uses some of the Cleaner’s environment variables.
Let’s update the collector service:
Actually, the connector needs something more: for example, a MongoDB instance attached!
Ok, connecting to MongoDB requires some variables… yay, environment variables!
And of course, the collector service will depend on mongodb service, so I’m going to rewrite it:
I’ve got almost everything to start writing the collector program: I just need a couple of information more.
For instance, I want to customise the account name for each file, like expenses and income, and map transactions’ tags with my new custom tags.
As done previously, I’ll put them inside the environment space:
Now, TAGS_FILE is a reference to a file where I map those tags found inside CSV files with my own custom tags, like this:
"Abbigliamento": ["Abbigliamento", "Uscite / Annuali"]
I’ve to do this because my current transactions have more than one tag each.
So, for a transaction like this one:
the result will be a new transaction with two tags associated: Abbigliamento and Uscite / Annuali.
EXPENSE_ACCOUNT and INCOME_ACCOUNT define the accounts for expenses and income transactions respectively.
It’s time to define the collector procedure:
- connect to MongoDB
- read CSV cleaned files
- for each row, map the right account and tags
- for each row, create a MongoDB document
- check the number of transactions added
Ok, first of all, I need the collector to check the existence of the cleaned folder:
Then, the DB connection, followed by an erasing procedure used to clean previous executions:
Let’s read some files:
tags is a dictionary used to map old tags with new ones, while csv_files holds the CSV files references.
Now, it’s time to write the collector core logic:
Long story short: for each CSV file, put a new transaction inside MongoDB, using a dictionary to represent it.
The transaction model is composed by a date (datetime package is really useful in these cases), an account (defining the transaction’s type), a description, an amount (built by converting string into float) and a list of tag (combining the TAGS_FILE with the original transaction’s tag).
Then some counters, like transactions_counter and total_counter are updated because they will be used for the final test.
When the collector’s execution has finished, I need to check if the amount of transactions read from CSV files is equal to the amount of transactions inserted inside MongoDB: I need to check both the counter and the total amount.
With assert, I can test if total_counter (that is the number of transactions added) and total_amount (the sum of every transaction added) are equal to collection_counter and amount_counter (counterparts, read and calculated from the DB).
At this point, running collector after cleaner makes it easy filling the DB with cleaned data, hence having fresh transactions ready to be pushed on Buxfer!
Following the source code of collector.py:
End of part 3
Before the collector I had just a bunch of CSV files, containing data like this:
But then, I’ve got something better! Take a look:
Much better 😁
So, Collector promotes cleaned unstructured data to well-defined and well-structured documents inside MongoDB.
This is useful for data analysis and it’s definitely a bridge between Python and GoLang (used in the next step).
If you enjoyed this article don’t forget to share it!
See you in Part 4: Goxfer!
Source code is already available here: https://github.com/wilk/from-csv-to-buxfer