Technology Fridays: MLDB is the Database Every Data Scientist Dreams Of

Machine learning solutions in the real world are rarely just a matter of building and testing models. Managing and automating the lifecycle of machine learning models from training to optimization is, by far, the hardest problem to solve in machine learning solutions. To control the lifecycle of a model, data scientists need to be able to persist and query its state at scale. This problem might seem trivial until you consider that any average deep learning model can include hundreds of hidden layers and millions of interconnected nodes ;) Storing and accessing large computation graphs is far from trivial. Most of the times, data science teams spend a lot of time trying to adapt commodity NOSQL databases to machine learning models before arriving to the not-so-obvious conclusion: Machine learning solutions need a new type of database.

MLDB is a database designed for the machine learning era. The platform is optimized for storing, transforming and navigating a computation graph that represents a machine learning structure such as a deep neural network. I know what you are thinking 😉 Cloud machine learning platforms such as AWS SageMaker or Azure ML already include persistence models for machine learning graphs so why do we need another solution? Well, it turns out that there are quite a few requirements of real world machine learning solutions that can benefit from a real database:

Enter MLDB

MLDB provides an open-source, native database for the storage and query of machine learning model. The platform was first incubated within Datacratic and was recently acquired by AI powerhouse Elementai as a validation of the relevance of the database engine in modern machine learning projects. MLDB is available on different forms such as a cloud service, a VirtualBox VM or a Docker instance that can deployed on any container platform.

The architecture of MLDB combines different artifacts that abstract the different elements of the lifecycle of a machine learning solution. Technically, the MLDB model can be summarized in six simple components: Files, Datasets, Procedures, Functions, Queries and APIs.

Files

Files represent the common unit of abstraction in the MLDB architecture. In the MLDB model, Files can be used to load data for a model, parameter for a function or persist a specific dataset. MLDB supports native integration with popular file systems such as HDFS and S3.

Datasets

MLDB Dataset represent the main data unit used by Procedures and machine learning models. Structurally, datasets are schema-less, append-only named sets of data points, which are contained in cells, which sit at the intersection of rows and columns. Data points are composed of a value and a timestamp. Each data point can thus be represented as a (row, column, timestamp, value) tuple, and datasets can be thought of as sparse 3-dimensional matrices. Datasets can be created and data can be appended to them via MLDB’s REST API and they can also loaded from or saved to files via Procedures.

Procedures

In MLDB, Procedures are used to implement the different aspects of a machine learning model such as training or data transformation. From the technical standpoint, Procedures are named, reusable programs used to implement long-running batch operations with no return values. Procedures generally run over Datasets and can be configured via SQL expressions. The outputs of a Procedure can include Datasets and files.

Functions

MLDB Functions abstract data computation routines used in Procedures. Functions are named, reusable programs used to implement streaming computations which can accept input values and return output values. Commonly, MLDB Functions encapsulate SQL expressions that express a specific computation.

Queries

One of the main advantages of MLDB is that it uses SQL as the mechanism to query data stored in the database. The platform supports a fairly complete, SQL-based grammar that includes familiar constructs such as SELECT, WHERE, FROM, GROUP BY, ORDER BY and many others. For instance, in MLDB, we can use a SQL query to prepare a training dataset for an image classification model:

mldb.query("SELECT * FROM images LIMIT 3000")

APIs & Pymldb

All the capabilities of MLDB are exposed via a simple REST API. The platform also includes pymldb, a Python library that abstracts the capabilities of the API in a very friendly syntax. The following code shows how to use pymldb to create and query a dataset.

from pymldb import Connectionmldb = Connection("http://localhost")

mldb.put( "/v1/datasets/demo", {"type":"sparse.mutable"})mldb.post("/v1/datasets/demo/rows", {"rowName": "first", "columns":[["a",1,0],["b",2,0]]})mldb.post("/v1/datasets/demo/rows", {"rowName": "second", "columns":[["a",3,0],["b",4,0]]})mldb.post("/v1/datasets/demo/commit")

df = mldb.query("select * from demo")print type(df)

Support for Machine Learning Algorithms

MLDB provides support for a large number of algorithms such as that can be used from Procedures and Functions. The platform also natively supports the computation graph of different deep learning engines like TensorFlow.

Bringing it All Together

Let’s take a common workflow in machine learning solutions such as the training and scoring of a model. The following figure illustrates how it will be implemented in MLDB:

The process starts with a file full of training data, which is loaded into a Training Dataset.
A Training Procedure is run, to produce a Model File
The Model File is used to parameterize a Scoring Function
This Scoring Function is immediately accessible via a REST Endpoint for real-time scoring
The Scoring Function is also immediately accesible via an SQL Query
A Batch Scoring Procedure uses SQL to apply the Scoring Function to an Unscored Dataset in batch, producing a Scored Dataset

Conclusion

MLDB one of the first instances of databases designed from the ground up to enable machine learning solutions. The platform can still be improved a lot to support modern machine and deep learning techniques but its flexibility and extensibility makes it a great first iteration in this new space.