Part 3— collector: a story about how to write a program that collects a pure dataset Preamble In , I’ve coded the first program of this project, cleaning raw data and make it available for the second program: the Collector 📦! Cleaner — Part 2 Collector has to read cleaned data from CSV files and to store it inside a instance using the data model defined in . MongoDB Part 1 Journey In this article, I cover the third part of this journey: Part 1: Introduction Part 2: Cleaner Part 3 (this part): Collector Part 4: Goxfer Part 5: Conclusions Getting started This is the second important block of this adventure: .The procedure is similar to the Cleaner: update the docker-compose.yml file and then write the program in . the Collector Python So, like the Cleaner, the collector service uses the same Python image, mounts the same volume, launches a similar command and uses some of the Cleaner’s environment variables.Let’s update the collector service: Actually, the connector needs something more: for example, a MongoDB instance attached!Ok, connecting to MongoDB requires some variables… yay, environment variables!And of course, the collector service will depend on service, so I’m going to rewrite it: mongodb Very well!I’ve got almost everything to start writing the collector program: I just need a couple of information more.For instance, I want to customise the account name for each file, like and , and map transactions’ tags with my new custom tags.As done previously, I’ll put them inside the environment space: expenses income Now, is a reference to a file where I map those tags found inside CSV files with my own custom tags, like this: TAGS_FILE {"expenses.csv": {"Abbigliamento": ["Abbigliamento", "Uscite / Annuali"]}} I’ve to do this because my current transactions have more than one tag each.So, for a transaction like this one: 04/06/2016,Abbigliamento,maglietta,"5,00" the result will be a new transaction with two tags associated: and . Abbigliamento Uscite / Annuali and define the accounts for expenses and income transactions respectively. EXPENSE_ACCOUNT INCOME_ACCOUNT Collector It’s time to define the collector procedure: connect to MongoDB cleaned files read CSV for each row, map the right account and tags for each row, a MongoDB document create the number of transactions added check Ok, first of all, I need the collector to check the existence of the cleaned folder: Then, the DB connection, followed by an erasing procedure used to clean previous executions: That’s great!Let’s read some files: is a dictionary used to map old tags with new ones, while holds the CSV files references. tags csv_files Now, it’s time to write the collector core logic: Long story short: for each CSV file, put a new transaction inside MongoDB, using a dictionary to represent it.The transaction model is composed by a date ( package is really useful in these cases), an account (defining the transaction’s type), a description, an amount (built by converting string into float) and a list of tag (combining the TAGS_FILE with the original transaction’s tag).Then some counters, like and are updated because they will be used for the final test. datetime transactions_counter total_counter Testing When the collector’s execution has finished, I need to check if the amount of transactions read from CSV files is equal to the amount of transactions inserted inside MongoDB: I need to check both the counter and the total amount.With , I can test if (that is the number of transactions added) and (the sum of every transaction added) are equal to and (counterparts, read and calculated from the DB). assert total_counter total_amount collection_counter amount_counter At this point, running collector after cleaner makes it easy filling the DB with cleaned data, hence having fresh transactions ready to be pushed on Buxfer! Following the source code of : collector.py End of part 3 Before the collector I had just a bunch of CSV files, containing data like this: 02/01/2016,Alimentari,spesa,"17,64"26/03/2016,Hobby,libri,"30,46"04/06/2016,Abbigliamento,maglietta,"5,00" But then, I’ve got something better! Take a look: Much better 😁 So, Collector promotes cleaned unstructured data to well-defined and well-structured documents inside MongoDB.This is useful for and it’s definitely a bridge between Python and GoLang (used in the next step). data analysis If you enjoyed this article don’t forget to share it!See you in ! Part 4: Goxfer Spoiler Source code is already available here: https://github.com/wilk/from-csv-to-buxfer