Data Mining and a Simple Starter Model This article is based on the . In the largest country in East Africa, with about 60 million people, half of the population does not have access to clean water. Billions of dollars in foreign aid are being provided to the country to tackle the freshwater problem. However, the government cannot solve this problem. A significant part of water pumps is entirely out of order or practically does not function; the others require repair. competition Data The data has many characteristics associated with water pumps. The water supply points were divided into functional, non-functional and functional but in need of repair. The goal of the competition is to build a model that predicts the functionality of water supply points. The modelling data has 59400 rows and 40 columns without the label that comes in a separate file. The metric used for this competition is the classification rate, which calculates the percentage of rows where the predicted class in the submission matches the actual class in the test set. The maximum is 1, and the minimum is 0. The goal is to maximize the classification rate. EDA (Exploratory Data Analysis) A detailed description of each feature in the dataset can be found on the . competition page First of all, let’s look at the target — the classes don’t have an even distribution. The small number of labels for water pumps in need of repair. We will not solve this issue but use the appropriate metric when creating the model and library capabilities. Let’s see how the water pumps are distributed across the territory of the country. It is known that some functions contain empty values - let's see them on the chart. We can see that there are very few rows with missing values, with having the largest number. The following heatmap represents the presence/absence relationships between variables. It is worth paying attention to the correlation between , and . scheme_name permit installer funder Let’s see the general picture of the relationships on the dendrogram. In the characteristics of water pumps, there is one that shows the amount of water. We can check how the water amount is related to the pumps’ condition ( ). quantity_group There are many wells with sufficient water that are not functioning. From the point of view of investment efficiency, it is logical to focus on repairing this particular group in the first place. Also, it is observed that most dry pumps are not working. By finding a solution to fill these wells again with water, they can probably be functional. Does water quality affect the condition of the water pumps? We can see the data grouped by . quality_group Unfortunately, this graph is not very informative, since the number of sources with good water prevails. Let’s try to group only for sources with less quality water. Most pumps with an unknown are non-functional. quality_group There is another attractive characteristic of waterpoints — their type ( ). waterpoint_type_group Analysis of the data by waterpoints shows that the group with types contains many inoperative pumps. Are they outdated? We can check how the year the pump was constructed affects. other The older the waterpoint, the higher the probability that it is not functioning, mostly before the 80s. Now we will try to get insights from the information about the funding organizations. The condition of the wells should be correlated with funding. Consider only organizations that fund more than 500 waterpoints. Danida — they have many working water points, the percentage of broken ones is very high. Similar situation with RWSSP(Rural Water Supply and Sanitation Program), Dhv and a few more. It should be noted that most of the wells financed by the German Republic and by Private Individuals are mostly in working state. In contrast, a large number of wells that are financed by the state are not functioning. Most of the water points established by the central government and district council are also not working. Let us consider the hypothesis that the water’s purity and the water basin to which the well belongs can influence the functioning. First of all, let’s look at the water basins. Two basins stand out strongly — Reuben and Lake Rukwa. The number of broken water points there is the majority. It is known that some of the wells are not free. We can assume that payments can positively affect keeping the pumps in working order. The hypothesis is fully confirmed — payment for water helps to keep the source in a working state. Let's build a simple with a depth and see how this tree looks. DecisionTreeClassifier 4 sklearn tree dtreeviz.trees * sklearn.utils.class_weight compute_sample_weight

clf = tree.DecisionTreeClassifier(max_depth= , random_state= )

y_train = df[ ]
X_train = df.drop( , axis= )
sample_weight = compute_sample_weight(
    class_weight= , 
    y = y_train)

clf.fit(X_train, y_train, sample_weight=sample_weight)

dtreeviz(
    clf, x_data=X_train, y_data=y_train, target_name= ,
    feature_names=X_train.columns.tolist(),
    class_names=[ , , ],
    title= ) from import from import from import 4 42 'labels' 'labels' 1 'balanced' 'labels' "functional" "non functional" "functional needs repair" "Decision Tree" you can look at the full-size image. Here The data contains numeric information that we can look at and maybe find something interesting in addition to categorical parameters. Part of the data was filled with 0 values instead of real data. We can also see that is higher in workable water points (label = 0). Also, you should pay attention to the outliers in the amount_tsh feature. As a feature, one can note the difference in elevation and the fact that a significant part of the population lives 500 meters above the mean sea level. amount_tsh Data Cleaning Before starting to create a model, we need to clean and prepare the data. The feature contains many repetitions with different cases, spelling errors and abbreviations. Let’s put everything in lowercase first. Then, using simple rules, we reduce the number of mistakes and do the grouping. installer After cleaning, we replace any items that occur less than 71 times (0.95 quantiles) with ‘other’ items. We repeat by analogy with the feature. The cut-off threshold is 98. funder The data contains features with very similar categories. Let’s choose only one of them. Since there is not much data in the dataset, we leave the feature with the smallest categories. Delete , , , , , , . scheme_management quantity_group water_quality payment_type extraction_type waterpoint_type_group region_code Replace the and values of outliers with the corresponding median values. latitude longitude region_code A similar technique for replacing missing values is applicable for and . subvillage scheme_name Missing values in and are replaced with median values. public_meeting permit For , , , , we can create different binary features that show missing values. subvillage public_meeting scheme_name permit The features , , , , , , , , and can be deleted is either duplicate information or it is useless. scheme_management quantity_group water_quality region_code payment_type extraction_type waterpoint_type_group date_recorded recorded_by Modelling The data contains a large number of categorical features. The most suitable for obtaining a base-line model, in my opinion, is . It is a high-performance, open-source library for gradient boosting on decision trees. CatBoost We will not select the optimal parameters; let it be homework. Let’s write a function to initialize and train the model. def fit_model(train_pool, test_pool, **kwargs):
    model = CatBoostClassifier(
        max_ctr_complexity= ,
        task_type= ,
        iterations= ,
        eval_metric= ,
        od_type= ,
        od_wait= ,
        **kwargs
    ) model.fit(
        train_pool,
        eval_set=test_pool,
        verbose= ,
        plot=False,
        use_best_model=True) 5 'CPU' 10000 'AUC' 'Iter' 500 return 1000 For the evaluation, AUC was chosen because the data is highly unbalanced, and this metric is the best for such cases. For the target metric, we can write our function. def classification_rate(y, y_pred): np.sum(y==y_pred)/len(y) return Since there is little data, it is not great to split the dataset into and parts. In this case, it is better to use OOF (Out-of-Fold) predictions. We will not use third-party libraries; let’s try to write a simple function. Please note that splitting the dataset into folds must be stratified. train validation ntrain = x_train.shape[ ]
    ntest = x_test.shape[ ]  

    oof_train = np.zeros((len(seeds), ntrain, ))
    oof_test = np.zeros((ntest, ))
    oof_test_skf = np.empty((len(seeds), n_folds, ntest, ))
    test_pool = Pool(data=x_test, cat_features=cat_features) 
    models = {} iseed, seed enumerate(seeds):
        kf = StratifiedKFold(
            n_splits=n_folds,
            shuffle= ,
            random_state=seed) i, (train_index, test_index) enumerate(kf.split(x_train, y)):
            print( )
            x_tr = x_train.iloc[train_index, :]
            y_tr = y[train_index]
            x_te = x_train.iloc[test_index, :]
            y_te = y[test_index]
            train_pool = Pool(
                data=x_tr, label=y_tr, cat_features=cat_features)
            valid_pool = Pool(
                data=x_te, label=y_te, cat_features=cat_features)
            model = fit_model(
                train_pool, valid_pool,
                loss_function= ,
                random_seed=seed
            )
            oof_train[iseed, test_index, :] = model.predict_proba(x_te)
            oof_test_skf[iseed, i, :, :] = model.predict_proba(x_test)
            models[(seed, i)] = model
oof_test[:, :] = oof_test_skf.mean(axis= ).mean(axis= )
    oof_train = oof_train.mean(axis= ) oof_train, oof_test, models : def get_oof (n_folds, x_train, y, x_test, cat_features, seeds) 0 0 3 3 3 for in True for in f'\nSeed , Fold ' {seed} {i} 'MultiClass' 1 0 0 return To reduce the dependence on splitting randomness, we will set several different seeds to calculate predictions. The learning curves look incredibly optimistic, and the model should look good. Having looked at the importance of the model’s features, we can make sure that there is no obvious leak. After averaging the predictions: balanced accuracy: classification rate: 0.6703822994494413 0.8198316498316498 This result was obtained when uploading predictions on the competition website. Considering that the top5 result was only about 0.005 better at the time of this writing, we can say that the base-line model is good. To ensure that all the work on the analysis and data cleaning was not done in vain, we will build a model based solely on the data. The only thing we'll do is fill in the missing values with zeros. balanced accuracy: classification rate: 0.6549535670689709 0.8108249158249158 The result is noticeably worse. Summary In this post, we: got acquainted with the data and looked for insights that can lead to thoughts for feature generation; cleaned up and prepared the provided data to create the model; decided to use CatBoost, since the bulk of the features are categorical; wrote a function for OOF-prediction; got an excellent result for the base-line model. The right approach to data preparation and choosing the right tools for creating a model can give great results even without making additional features. As a homework assignment, I suggest adding new features, choosing the model’s optimal parameters, using other libraries for gradient boosting, and building ensembles from the resulting models. The code from the article can be viewed . here Also published on Dev.to

Funding

How To Predict Water Pumps Failure in Tanzania using CatBoost Library

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Gender Prediction Using Mobile App Data

36 Stories To Learn About Analysis

5 Tips For Proper Analysis of Cryptocurrency Investments

5 Prominent Big Data Analytics Tools to Learn in 2020

A Quick Gasprice Market Analysis

AI Politics: From Pausing to Regulating, It’s All About Winning the Hearts and Minds of People

Gender Prediction Using Mobile App Data

36 Stories To Learn About Analysis

5 Tips For Proper Analysis of Cryptocurrency Investments

5 Prominent Big Data Analytics Tools to Learn in 2020

A Quick Gasprice Market Analysis

AI Politics: From Pausing to Regulating, It’s All About Winning the Hearts and Minds of People

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps