I have developed a fascinating — a list of users, installed applications, user gender, and statistics on the gender distribution for apps. dataset DOI: /KAGGLE/DSV/ 10.34740 2309388 For a successful advertising campaign, working with a segment is vital, and the gender of the user simplifies the work of selecting segments at times. I will tell you how collecting statistics on applications allow ML to predict a user’s gender. Data Two new files have been added to the dataset: — A list of users with their most likely gender and a list of several installed applications. users.csv — Gender distribution of users in the application. bundles_gender.csv Pay attention to the field — it shows the number of users who have this app installed, whose gender we know, and, accordingly, we can collect statistics regarding the app. This field can also be used as a measure of confidence in information about this application. cnt EDA First of all, it is interesting to look at how gender is distributed among devices. One might expect the devices to be roughly equally divided, but this has not happened. Therefore, I hypothesize that women are less likely to indicate their gender in the app’s settings. Perhaps this is influenced by the fact that the number of applications that are made exclusively for the female audience is less than for the male audience. The following picture indirectly confirms this. Let’s look at the histogram. The first one will be without additional filters. Almost nothing is visible except for symmetrical and pronounced peaks. Let’s take a closer look at one of the peaks. genders_df[
    (genders_df[ ]>= ) & 
    (genders_df[ ]<= )
].describe() 'F' 0.3325 'F' 0.3375 If you ignore the outlier, you can see that most of the applications from this subsample are extremely rare, which leads to a large number of the same values. Let’s try to keep only those applications that are encountered more than 10 times. Peaks are still visible but not so clear. Increasing the threshold to 50 almost eliminates the peaks. The graph clearly shows that there are fewer applications with a female audience. New Features Let’s create a few more additional features that will show additional information. It can be assumed that the number of installed applications can be helpful. users_df[ ] = users_df[ ].apply(len)
users_df.groupby( )[ ].describe() 'apps_count' 'ids' 'gend' 'apps_count' You can see that women, on average, install more apps on their devices. I have data about users, their gender and installed apps, and information about the distribution of gender for these applications. Is there a correlation between this data? It is logical to assume that there is, but how strong is this correlation? g_dict = genders_df[ ].to_dict()
users_df[ ] = users_df[ ].apply( x: np.mean(
        list(filter( .__ne__, list(map(g_dict.get, x))))
    )
) 'F' 'F_prob' 'ids' lambda None Instead of the average, you can use more complex methods, but for the initial analysis, this is quite enough. np.corrcoef(
    users_df[ ],
    users_df[ ].astype( ).cat.codes
)[ , ] 'F_prob' 'gend' 'category' 0 1 The correlation turned out to be very significant. -0.46602945129982887 The histogram shows that users are well divided into two groups. Baseline For conclusions and assessments, I need a baseline model. Therefore, I choose the simplest approach. print( )
print( ) f"Accuracy: \ " {accuracy_score(users_df[ ].astype( ).cat.codes, users_df[ ]< )} 'gend' 'category' 'F_prob' 0.5 f"AUC: \ " { - roc_auc_score(users_df[ ].astype( ).cat.codes, users_df[ ])} 1 'gend' 'category' 'F_prob' Even such a naive approach gives a good result, but let’s try to improve it further. Accuracy: AUC     : 0.740925288445762 0.7793767183917958 Train and Test Since the dataset with users is large, I can select a subset on which the models will be checked and compared. train, test = train_test_split(
    users_df, train_size= ,
    random_state= , stratify=users_df[ ]) 0.7 0 'gend' Logistic Regression First, I’ll try the simplest and most common method — logistic regression. But for this, we need numeric features instead of lists with id. Again, I can use the simplest method — binarization. But there is an obvious problem — the number of unique ids. 109186 But the fact that the resulting binarized data will be sparse allows the use of . sparse matrices mlb = MultiLabelBinarizer(sparse_output= )
mlb.fit(users_df[ ])
train_mlb = mlb.transform(train[ ])
test_mlb = mlb.transform(test[ ]) True 'ids' 'ids' 'ids' I use the approach to obtain reliable results and reduce the influence of randomness when dividing into training and validation subsamples. I don’t use third-party libraries and wrote a simple function. Please note that splitting the dataset into folds must be stratified. OOF (Out-of-Fold) ntrain = x_train.shape[ ]
    ntest = x_test.shape[ ]  
        
    oof_train = np.zeros((len(seeds), ntrain, ))
    oof_test = np.zeros((ntest, ))
    oof_test_skf = np.empty((len(seeds), n_folds, ntest, ))
    models = {} iseed, seed enumerate(seeds):
        kf = StratifiedKFold(
            n_splits=n_folds,
            shuffle= ,
            random_state=seed) i, (tr_i, t_i) enumerate(kf.split(x_train, y)):
            print( )
            x_tr = x_train[tr_i, :]
            y_tr = y[tr_i]
            x_te = x_train[t_i, :]
            y_te = y[t_i]
            model = LogisticRegression(
                random_state=seed,
                max_iter = ,
                verbose= ,
                n_jobs= )
            model.fit(x_tr, y_tr)
            oof_train[iseed, t_i, :] = \
                model.predict_proba(x_te)
            print( )
            oof_test_skf[iseed, i, :, :] = \
                model.predict_proba(x_test)
            models[(seed, i)] = model
    oof_test[:, :] = oof_test_skf.mean(axis= ).mean(axis= )
    oof_train = oof_train.mean(axis= ) oof_train, oof_test, models : def get_oof_lr (n_folds, x_train, y, x_test, seeds) 0 0 2 2 2 for in True for in f'\nSeed , Fold ' {seed} {i} 10000 1 20 f"AUC: " {roc_auc_score(y_te, oof_train[iseed, t_i, :][:, ])} 1 1 0 0 return AUC: Seed , Fold : Seed , Fold : Seed , Fold : Seed , Fold : Seed , Fold : Seed , Fold : Seed , Fold : Seed , Fold : Seed , Fold : Seed , Fold : Seed , Fold : Seed , Fold : Seed , Fold : Seed , Fold : Seed , Fold : 0 0 0.8752592302937795 0 1 0.8741272807694727 0 2 0.8754404425783484 0 3 0.8750862228494931 0 4 0.8767777821454008 42 0 0.876839970445301 42 1 0.8771914077769174 42 2 0.8762049208242458 42 3 0.8725705419477277 42 4 0.8731672122759209 888 0 0.8752996641300741 888 1 0.8749304780764804 888 2 0.8762614986655877 888 3 0.8765240184267109 888 4 0.8725618258459555 Let’s check the prediction on the test subsample. Accuracy: AUC     : 0.8208932240918818 0.8798990678456793 I would say that the difference is big compared to the baseline. I will assume that the quality can be increase by tuning the hyperparameters, let it be the reader’s homework. CatBoost #1 When I look at the feature, I see a list of tokens. Why not try working with this data like plain text? ids I chose as the free library for the model. CatBoost is a high-performance, open-source library for gradient boosting on decision trees. From release 0.19.1, it supports text features for classification on GPU out-of-the-box. The main advantage is that CatBoost can include categorical functions and text functions in your data without additional preprocessing. You can find more detail about text features in the article . CatBoost Unconventional Sentiment Analysis: BERT vs. Catboost !pip install catboost Let’s write a function to initialize and train the model. model = CatBoostClassifier(
        task_type= ,
        iterations= ,
        eval_metric= ,
        od_type= ,
        od_wait= ,
        learning_rate= ,
        **kwargs
    ) model.fit(
        train_pool,
        eval_set=test_pool,
        verbose= ,
        plot= ,
        use_best_model= ) : def fit_model (train_pool, test_pool, **kwargs) 'GPU' 10000 'AUC' 'Iter' 1000 0.1 return 1000 False True Unfortunately, in the current version of CatBoost, it is impossible to use a list of already prepared tokens. Therefore, let’s do a little trick — turn the feature into text and use it to create a model. users_df[ ] = \
    users_df[ ].apply( x: .join([str(i) i x ])) 'ids_txt' 'ids' lambda " " for in As with logistic regression, I make an OOF prediction. columns = [ , ]
oof_train_cb, oof_test_cb, models_cb = get_oof_cb(
    n_folds= ,
    x_train=train[columns],
    y=train[ ].values,
    x_test=test[columns],
    text_features=[ ],
    seeds=[ , , ]
) 'ids_txt' 'apps_count' 5 'gend' 'ids_txt' 0 42 888 Model quality metrics in the test subsample show better quality than when using logistic regression. Accuracy: AUC     : 0.8218224490121011 0.8856101448105046 Interestingly, two completely different approaches give very similar results. In such a situation, it is logical to assume that combining methods will give a synergistic effect. CatBoost #2 As a new feature, I’ve added OOF predictions from a logistic regression model. In addition, do not forget about the feature, which worked well for the base model. F_prob columns = [ , , , ]
oof_train_cb_2, oof_test_cb_2, models_cb_2 = get_oof(
    n_folds= ,
    x_train=train_2[columns],
    y=train_2[ ].values,
    x_test=test_2[columns],
    text_features=[ ],
    seeds=[ , , ]
) 'ids_txt' 'F_prob' 'lr' 'apps_count' 5 'gend' 'ids_txt' 0 42 888 I can say that the model almost ideally predicts the gender of the user using only information about the installed applications on the device. Accuracy: AUC     : 0.836950230713273 0.9010077023800467 Summary In this story, I: Introduced a new free dataset; Did exploratory data analysis; Created several new features; Created several models for predicting the gender of a user of a mobile device. All this required the accumulation of certain statistical information about applications users use and information about the distribution of gender among users for the applications themselves. The code from the article can be viewed . here

The Graph

Gender Prediction Using Mobile App Data

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

How To Predict Water Pumps Failure in Tanzania using CatBoost Library

How To Predict Water Pumps Failure in Tanzania using CatBoost Library

How We Built an Efficient ML Model With Dirty Data and Insufficient Information

Unconventional Sentiment Analysis: BERT vs. Catboost

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

How To Predict Water Pumps Failure in Tanzania using CatBoost Library

How To Predict Water Pumps Failure in Tanzania using CatBoost Library

How We Built an Efficient ML Model With Dirty Data and Insufficient Information

Unconventional Sentiment Analysis: BERT vs. Catboost

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps