A New Programming Language For AI: Linear Regression, But With Mojo Language

What is Mojo language?

Mojo is a new programming language for AI development. The language is being developed by the Modular company. The developer says, that it combines the best of Python syntax with systems programming and metaprogramming and lets you write portable code that's faster than C and interoperates seamlessly with the Python ecosystem.

According to the official website, Mojo supports parallel threading and is 68,000 times faster than Python.

Why this project is important?

Mojo is becoming more and more popular and the community is growing actively. So, I decided to try to write my first data science project, but in Mojo language.

I am excited to try a new programming language and start learning it before its popularity explodes. This article is being written in December 2023, so this code is valid for that period.

So, I strongly believe this little guide will be valuable for someone interested in Mojo.

Code from the article is available in this GitHub repo

https://github.com/OberemokAlexandra/mojo_experiments/tree/main

How to install Mojo language and start experiments?

There are two ways to get started with Mojo. The first is to install it manually on your laptop. Instructions for each operating system can be found here

https://developer.modular.com/download

Installation takes time - be prepared

But there is an easier and faster way to try out Mojo on your own: the free JupiterLab playground is available here:

https://playground.modular.com/

I set up all the components on my laptop, and all the code below was tested on the environment, which was configured by me on my laptop.

Choose of the dataset

As it is my very first data project on Mojo - a regression task, I decided to choose a dataset from the sklearn library - diabetes dataset - https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html

Iris and Titanic datasets are the basis for classification tasks, so not today 🙂 .

This dataset contains measurements from 442 diabetic patients. There are 10 features with different magical measurements, and the target is a quantitative measure of disease progression over a year.

A great explanation of the features is given here https://rowannicholls.github.io/python/data/sklearn_datasets/diabetes.html

Installation of Python modules with Mojo

Mojo has built-in support for Python infrastructure - so it's quite easy to import a required module.

For example, a numpy import will look like this.

# first line imports a Python module in Mojo
from python import Python
# This line imports a numpy module itself
let np = Python.import_module("numpy")

Don’t forget to install the necessary modules in your environment first. For example,

pip install numpy

Get the dataset

Variables in Mojo can be declared with two keywords - let and var. The main difference between them is that “let” is used to declare an immutable variable, while “var” is needed to state a mutable variable.

In a Mojo, there are two ways to declare a function - by using def and fn keywords.

More detailed information about the difference between the keywords can be found here: https://docs.modular.com/mojo/manual/functions.html

Here is a function to return a dataset for further modeling.

fn get_data() raises -> PythonObject:
    # Import of sklearn module with ready datasets
    # It’s possible to import a whole module
    let datasets = Python.import_module("sklearn.datasets")  
    let ds = datasets.load_diabetes()
    return ds

Function to split, build an ml model, and evaluate its performance

fn main() raises:
    # gets data from previous function
    let ds = get_data()
    let X = ds['data']
    let y = ds['target']
    # import needed sklearn module for train_test split
    let model_selection = Python.import_module("sklearn.model_selection")
    let split = model_selection.train_test_split(X, y)
    # Unpacking looks much harder than in python
    let X_train = split[0]
    let X_test = split[1]
    let y_train = split[2]
    let y_test = split[3]
    # modeling process
    let linear_model = Python.import_module("sklearn.linear_model")
    var regr = linear_model.LinearRegression()
    regr.fit(X_train, y_train)
    let pred = regr.predict(X_test)

Metrics and results

Let’s evaluate the regression model and see the results

let metrics = Python.import_module("sklearn.metrics")
print("mean")
print(np.mean(y_test))
print("mse")
print(metrics.mean_squared_error(y_test, pred))
print("mae")
print(metrics.mean_absolute_error(y_test, pred))
print("r2")
print(metrics.r2_score(y_test, pred))

The results look like this

Warning: your results may differ from those given here due to the random splitting of the data into training and test samples and model evaluation procedure

Brief metrics analysis and explanation:

Mean shows the mean value in the test split
A huge value of the mse metric compared to a relatively small mae value may indicate that there are outliers in the test sample.
The R-squared metric represents how much of the variance from the mean is accounted for by the model. In our case, it equals approximately 0.6, and it means that the model deserves to be considered.

Mojo vs Python

I’ve rewritten the same code in Python. To make the experiment clear and honest, I maintained the same settings on Python (trust me, you can make much more using Python).

Here is the code listed below

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

def get_data():
    """returns dataset for modeling"""
    return load_diabetes()


def main():
    """performs all the modeling actions"""
    ds = get_data()
    X, y = ds['data'], ds['target']
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    regr = LinearRegression()
    regr.fit(X_train, y_train)
    pred = regr.predict(X_test)
    print("mean")
    print(np.mean(y_test))
    print("mse")
    print(mean_squared_error(y_test, pred))
    print("mae")
    print(mean_absolute_error(y_test, pred))
    print("r2")
    print(r2_score(y_test, pred))

It is seen that modeling results using Python are comparable.

Challenges

The challenges below are valid for December 2023. By the time you are reading this article, they may have already been fixed

For now, the hardest part - is that it’s almost impossible to pass arguments easily to a Python module in Mojo
```
data = load_diabetes(as_frame=True)
```
Nowadays, I can’t pass parameters like as_frame=True in Mojo code. It can lead to difficulties when using and tuning complex models
Based on information from Github, this feature is on request to develop
issue link - https://github.com/modularml/mojo/issues/702

I couldn't unpack values in “the Python way.” I can't write a line of code like this

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

I only could do something like this

let split = model_selection.train_test_split(X, y)
let X_train = split[0]
let X_test = split[1]
let y_train = split[2]
let y_test = split[3]

Conclusion

Mojo is a new programming language under development. Today, it is possible to write code with a simple regression model. Some features need to be introduced for more development abilities, but I strongly believe that the new programming language has potential and will gain more popularity in the future.