Hackernoon logoTracking, Reproducibility and Collaboration in Data Projects by@datmo_io

Tracking, Reproducibility and Collaboration in Data Projects


Documenting your work is necessary, but boring, regardless of the type of work you do. While tracking and reproducing work for most generic web-connected applications and workflows is becoming more standardized (i.e., document state-saving and tracking through Google Docs and code collaboration and version control with git and Github) there is currently no widely accepted standard or simple automation for data science and machine learning. This is not to say developers and data scientists don’t track their work, but their process tends to be rinse and repeat, time-consuming, and rarely automated.

Not only is keeping track of the state of your work an important part of getting things done, but automating and observing best practices in tracking also drives better productivity and collaboration. Below, we propose some best practices and an open source system for solving tracking and reproducibility when working with data and machine learning. Furthermore, we introduce a new way to automate this process.

Especially in data and machine learning projects, tracking and saving work means you:

  1. Prevent lost history of your trained models, configurations, metrics, and environments
  2. Ensure accurate reporting of results
  3. Avoid errors when repeating, learning from, and reproducing someone else’s work (or even your own work!)

Broadly we can categorize tracking in data projects as keeping a log of and computing inputs and outputs. Below we break down the current state of the art, some best practices for handling these workflows and the Datmo equivalents for the best practices.

Tracking Conundrums with Existing Solutions

Best Practices with Existing Solutions

Best Practices using Existing Tools

Datmo Solution

New Datmo Paradigm for Tracking the Workflow

Keeping track of workflows while working with and modeling data shouldn’t be like pulling teeth. Datmo’s simple Command Line Interface takes into account the common workflows and best practices that data scientists and developers are used to, and automates the entire tracking process.

More specifically, we allow users to keep track of 3 main components

  • Datasets and versioning them for use with models
  • Snapshots of models, which refers to a point in time of the trained model
  • Training tasks run in parallel

Datmo CLI along with GUI platform allows to visualize and collaborate on data projects. This allows collaborators to see dataset used, all tasks and snapshots of models. These collaborators can now improve, comment and work on new experiments from there. Thus, improving the way everyone work on their data projects.

If you want to check out our Datmo CLI, click here to get started today. We hope no one will ever have to deal with the issue of tracking work manually ever again. Looking forward to your feedback :)

If you’re looking for a demo to see how it works, check out our video here:

Signup to our newsletter at https://datmo.io/ to get updates on our machine learning suite.

P.S. Thanks for reading this far! If you found value in this, We’d really appreciate it if you recommend this post (by clicking the ❤ button) so other people can see it!.



Join Hacker Noon

Create your free account to unlock your custom reading experience.