Piotr Płoński


MLJAR Academy: Start with Machine Learning

Machine Learning is a hot topic nowadays. It is a kind of a buzz word everyone is trying to use to look smarter. I asked my friend about type of algorithm his company is using for classification and he said:

“ To be honest I don’t have ML experience. I’m tech guy. But to sound smart, I would probably said: we are using complex machine learning algorithm based on artificial neural network

Actually, it sounds quite good :-) To help people fill hunger for machine learning knowledge I will post series about Machine Learning — and predictive analytics in particular.

Machine Learning according to wikipedia:

Machine learning is the subfield of computer science that … gives “computers the ability to learn without being explicitly programmed.” Evolved from the study of pattern recognition and computational learning theory in artificial intelligence, machine learning explores the study and construction of algorithms that can learn from and make predictions on data ...

To make it simple, machine learning allows you to build an algorithm (recipe) that doesn’t have to be explicitly defined — is it possible? Yes, everything in this algorithm (recipe) is learned from your data — each condition or new step is based on your data. As a result of learning you get an algorithm, but to be more precise, you obtain a model. This model has some knowledge about your data and this model can serve responses for you based on input data.

There will be always some kind of response from model, because otherwise it will be useless. Response can be in different formats — it can be numbers, images, audio, or response can be the internal structure of the model which can bring us some insights. Based on the response you can take actions.

Usually, there are two stages in machine learning model life:

  • Phase 1: Learning — in this step you need to learn your model, you are showing data to your model and train it.
  • Phase 2: Prediction — it is a ‘production’ phase, in this phase new data is presented to your model, and it serves responses — based on its previous knowledge (from phase 1).

You cannot omit learning step for your model, because otherwise it will produce some random responses. However, sometimes you can buy access to already learned model, which you can use for predictions (for example clarifai, algorithmia).

There are many ways you can categorize machine learning algorithms, one of them depends on data available for learning:

  • supervised learning — training with a teacher (supervisor), your data should contain information that will be used to guide model about response values, you know what values are desired as the response, and you will teach your model to return them. Some examples: SPAM classification in emails, objects recognition in images, credit risk assessment, churn prediction, prices prediction on stock market.
  • unsupervised learning — there is no information available to guide model about desired response value, model need to figure out it by itself — it won’t be supervised. Some examples: user segmentation, complex data visualization.
  • reinforced learning — during learning phase your model interacts with dynamic environment in which it has a certain goal, can make actions and is obtaining a feedback (rewards and punishments) from environment. Some examples: self-driving cars or playing game against opponent.

In my opinion the best way to learn humans is by doing, so right now after some theory we will jump into classification task (which is part of supervised learning). OK, let’s learn our model!

I’ve prepared several data sets that are good (easy) for start. They are available on my github. We will use Adult data set:

  • Link for data: here
  • original data source: here

In this task, we are training a model to predict whether person’s income exceeds $50k per year based on census data. Let’s look into our dataset:

Few top rows from Adult dataset.

There are 15 columns in our dataset and 32,562 rows. The first row it is a header and it describes meaning of each column. One row describes one person. Each person is described with:

  • age (continuous attribute)
  • workclass (categorical attribute)
  • fnlwgt (continuous attribute) — it is some index that describes person
  • education (categorical attribute)
  • education-num (continuous attribute)
  • marital-status (categorical attribute)
  • occupation (categorical attribute)
  • relationship (categorical attribute)
  • race (categorical attribute)
  • sex (categorical attribute)
  • capital-gain (continuous attribute)
  • capital-loss (continuous attribute)
  • hours-per-week (continuous attribute)
  • native-country (categorical attribute)
  • income (categorical attribute)

Continuous attribute means that values in that column are numbers, for example “age” — in this column we will see only numbers. Categorical attribute means that in this column there will be strings, for example “sex”, there are values “male” or “female”. String values will be transformed into numbers (we will not go into preprocessing details right now, it will be covered in next lessons).

For building classifier we will use MLJAR, which has easy web-based UI — you can set up account there and receive free credits for start (enough to run experiments from MLJAR Academy lessons). We start analysis with creation of new project.

After project creation please go to Sources and add new data source.

After adding new dataset, please go to Preview and let’s check data, column types and distribution.

Before training, we need to select which attributes will be used as input for a model (‘Use it’ columns) and which column is a ‘Target’ column.

The target column will be used in model training to supervise, I will sometimes call target column as output column, because model will learn to give reposnses (output) similar to target value. After selecting column usage we need to ‘Accept Attributes Usage’ at the top of Preview. Now, we are ready to start our first Machine Learning experiment! Let’s go to Experiments and add new experiment.

Wow! This dialog has a lot of features — don’t worry, many of them are set to smart defaults! In this dialog we will set three things:

  1. Input dataset: it will be our data source
  2. Learning Algorithm: we will select Random Forest
  3. Metric to be optimized: we will set Area Under ROC Curve

That’s all! We click ‘Create & Start’ and the model training will be started (in background we launch AWS instances for you that will do the job). Please don’t worry about other experiment’s parameters, for example: preprocessing. We will go into details in next lessons. After experiment creation you will be redirected to Results view.

OK, what’s going on? There are several rows in the results table and each represents one model. You can click on row in the table to see model parameters.

When you train machine learning algorithm it usually has many parameters that control the learning process — their values need to be carefully selected and MLJAR do this for you.

That’s all for good start! You made a great progress today:

  • you know what is Machine Learning
  • you know what are types of learning
  • you trained your first classifier with Random Forest algorithm

In next lessons we will cover more details:

  • hyper-paremeters tuning
  • data preprocessing
  • model validation
  • and many more! so please subscribe at mljar.com to not miss next posts

I’m really interested on your feedback on MLJAR platform!

Disclaimer: some of the definitions were simplified to make it easier to explain and understand.

Topics of interest

More Related Stories