Hackernoon logoHow To Predict Water Pumps Failure in Tanzania using CatBoost Library by@sagol

How To Predict Water Pumps Failure in Tanzania using CatBoost Library

Taras Baranyuk Hacker Noon profile picture

@sagolTaras Baranyuk

17+ years of experience in creating software products in various positions.

Data Mining and a Simple Starter Model

This article is based on the competition. In the largest country in East Africa, with about 60 million people, half of the population does not have access to clean water. Billions of dollars in foreign aid are being provided to the country to tackle the freshwater problem. However, the government cannot solve this problem. A significant part of water pumps is entirely out of order or practically does not function; the others require repair.


The data has many characteristics associated with water pumps. The water supply points were divided into functional, non-functional and functional but in need of repair. The goal of the competition is to build a model that predicts the functionality of water supply points.

The modelling data has 59400 rows and 40 columns without the label that comes in a separate file.

The metric used for this competition is the classification rate, which calculates the percentage of rows where the predicted class in the submission matches the actual class in the test set. The maximum is 1, and the minimum is 0. The goal is to maximize the classification rate.

EDA (Exploratory Data Analysis)

A detailed description of each feature in the dataset can be found on the competition page.

First of all, let’s look at the target — the classes don’t have an even distribution.

The small number of labels for water pumps in need of repair. We will not solve this issue but use the appropriate metric when creating the model and library capabilities.

Let’s see how the water pumps are distributed across the territory of the country.

It is known that some functions contain empty values - let's see them on the chart.

We can see that there are very few rows with missing values, with scheme_name having the largest number.
The following heatmap represents the presence/absence relationships between variables. It is worth paying attention to the correlation between permitinstaller and funder.

Let’s see the general picture of the relationships on the dendrogram.

In the characteristics of water pumps, there is one that shows the amount of water. We can check how the water amount is related to the pumps’ condition (quantity_group).

There are many wells with sufficient water that are not functioning. From the point of view of investment efficiency, it is logical to focus on repairing this particular group in the first place. Also, it is observed that most dry pumps are not working. By finding a solution to fill these wells again with water, they can probably be functional.

Does water quality affect the condition of the water pumps? We can see the data grouped by quality_group.

Unfortunately, this graph is not very informative, since the number of sources with good water prevails. Let’s try to group only for sources with less quality water.

Most pumps with an unknown quality_group are non-functional.

There is another attractive characteristic of waterpoints — their type (waterpoint_type_group).

Analysis of the data by waterpoints shows that the group with other types contains many inoperative pumps. Are they outdated? We can check how the year the pump was constructed affects.

The older the waterpoint, the higher the probability that it is not functioning, mostly before the 80s.

Now we will try to get insights from the information about the funding organizations. The condition of the wells should be correlated with funding. Consider only organizations that fund more than 500 waterpoints.

Danida — they have many working water points, the percentage of broken ones is very high. Similar situation with RWSSP(Rural Water Supply and Sanitation Program), Dhv and a few more. It should be noted that most of the wells financed by the German Republic and by Private Individuals are mostly in working state. In contrast, a large number of wells that are financed by the state are not functioning. Most of the water points established by the central government and district council are also not working.

Let us consider the hypothesis that the water’s purity and the water basin to which the well belongs can influence the functioning. First of all, let’s look at the water basins.

Two basins stand out strongly — Reuben and Lake Rukwa. The number of broken water points there is the majority.

It is known that some of the wells are not free. We can assume that payments can positively affect keeping the pumps in working order.

The hypothesis is fully confirmed — payment for water helps to keep the source in a working state.

Let's build a simple DecisionTreeClassifier with a 4 depth and see how this tree looks.

from sklearn import tree
from dtreeviz.trees import *
from sklearn.utils.class_weight import compute_sample_weight

clf = tree.DecisionTreeClassifier(max_depth=4, random_state=42)

y_train = df['labels']
X_train = df.drop('labels', axis=1)
sample_weight = compute_sample_weight(
    y = y_train)

clf.fit(X_train, y_train, sample_weight=sample_weight)

    clf, x_data=X_train, y_data=y_train, target_name='labels',
    class_names=["functional", "non functional",
        "functional needs repair"],
    title="Decision Tree")

Here you can look at the full-size image.

The data contains numeric information that we can look at and maybe find something interesting in addition to categorical parameters.

Part of the data was filled with 0 values instead of real data. We can also see that amount_tsh is higher in workable water points (label = 0). Also, you should pay attention to the outliers in the amount_tsh feature. As a feature, one can note the difference in elevation and the fact that a significant part of the population lives 500 meters above the mean sea level.

Data Cleaning

Before starting to create a model, we need to clean and prepare the data.

  1. The installer feature contains many repetitions with different cases, spelling errors and abbreviations. Let’s put everything in lowercase first. Then, using simple rules, we reduce the number of mistakes and do the grouping.
  2. After cleaning, we replace any items that occur less than 71 times (0.95 quantiles) with ‘other’ items.
  3. We repeat by analogy with the funder feature. The cut-off threshold is 98.
  4. The data contains features with very similar categories. Let’s choose only one of them. Since there is not much data in the dataset, we leave the feature with the smallest categories. Delete scheme_managementquantity_groupwater_qualitypayment_typeextraction_typewaterpoint_type_groupregion_code.
  5. Replace the latitude and longitude values of outliers with the corresponding region_code median values.
  6. A similar technique for replacing missing values is applicable for subvillage and scheme_name.
  7. Missing values in public_meeting and permit are replaced with median values.
  8. For subvillagepublic_meetingscheme_namepermit, we can create different binary features that show missing values.
  9. The features scheme_managementquantity_groupwater_qualityregion_codepayment_typeextraction_typewaterpoint_type_groupdate_recorded, and recorded_by can be deleted is either duplicate information or it is useless.


The data contains a large number of categorical features. The most suitable for obtaining a base-line model, in my opinion, is CatBoost. It is a high-performance, open-source library for gradient boosting on decision trees.

We will not select the optimal parameters; let it be homework. Let’s write a function to initialize and train the model.

def fit_model(train_pool, test_pool, **kwargs):
    model = CatBoostClassifier(
return model.fit(

For the evaluation, AUC was chosen because the data is highly unbalanced, and this metric is the best for such cases.

For the target metric, we can write our function.

def classification_rate(y, y_pred):
    return np.sum(y==y_pred)/len(y)

Since there is little data, it is not great to split the dataset into train and validation parts. In this case, it is better to use OOF (Out-of-Fold) predictions. We will not use third-party libraries; let’s try to write a simple function. Please note that splitting the dataset into folds must be stratified.

def get_oof(n_folds, x_train, y, x_test, cat_features, seeds):
    ntrain = x_train.shape[0]
    ntest = x_test.shape[0]  

    oof_train = np.zeros((len(seeds), ntrain, 3))
    oof_test = np.zeros((ntest, 3))
    oof_test_skf = np.empty((len(seeds), n_folds, ntest, 3))
    test_pool = Pool(data=x_test, cat_features=cat_features) 
    models = {}
    for iseed, seed in enumerate(seeds):
        kf = StratifiedKFold(
        for i, (train_index, test_index) in enumerate(kf.split(x_train, y)):
            print(f'\nSeed {seed}, Fold {i}')
            x_tr = x_train.iloc[train_index, :]
            y_tr = y[train_index]
            x_te = x_train.iloc[test_index, :]
            y_te = y[test_index]
            train_pool = Pool(
                data=x_tr, label=y_tr, cat_features=cat_features)
            valid_pool = Pool(
                data=x_te, label=y_te, cat_features=cat_features)
            model = fit_model(
                train_pool, valid_pool,
            oof_train[iseed, test_index, :] = model.predict_proba(x_te)
            oof_test_skf[iseed, i, :, :] = model.predict_proba(x_test)
            models[(seed, i)] = model
oof_test[:, :] = oof_test_skf.mean(axis=1).mean(axis=0)
    oof_train = oof_train.mean(axis=0)
    return oof_train, oof_test, models

To reduce the dependence on splitting randomness, we will set several different seeds to calculate predictions.

The learning curves look incredibly optimistic, and the model should look good.

Having looked at the importance of the model’s features, we can make sure that there is no obvious leak.

After averaging the predictions:

balanced accuracy: 0.6703822994494413
classification rate: 0.8198316498316498

This result was obtained when uploading predictions on the competition website.

Considering that the top5 result was only about 0.005 better at the time of this writing, we can say that the base-line model is good.

To ensure that all the work on the analysis and data cleaning was not done in vain, we will build a model based solely on the data. The only thing we'll do is fill in the missing values with zeros.

balanced accuracy: 0.6549535670689709
classification rate: 0.8108249158249158

The result is noticeably worse.


In this post, we:

  1. got acquainted with the data and looked for insights that can lead to thoughts for feature generation;
  2. cleaned up and prepared the provided data to create the model;
  3. decided to use CatBoost, since the bulk of the features are categorical;
  4. wrote a function for OOF-prediction;
  5. got an excellent result for the base-line model.

The right approach to data preparation and choosing the right tools for creating a model can give great results even without making additional features.

As a homework assignment, I suggest adding new features, choosing the model’s optimal parameters, using other libraries for gradient boosting, and building ensembles from the resulting models.

The code from the article can be viewed here.

Also published on Dev.to

Taras Baranyuk Hacker Noon profile picture

@sagolTaras Baranyuk

Read my stories

17+ years of experience in creating software products in various positions.


Join Hacker Noon

Create your free account to unlock your custom reading experience.