I have developed a fascinating dataset — a list of users, installed applications, user gender, and statistics on the gender distribution for apps.
DOI: 10.34740/KAGGLE/DSV/2309388
For a successful advertising campaign, working with a segment is vital, and the gender of the user simplifies the work of selecting segments at times.
I will tell you how collecting statistics on applications allow ML to predict a user’s gender.
Two new files have been added to the dataset:
users.csv — A list of users with their most likely gender and a list of several installed applications.
bundles_gender.csv — Gender distribution of users in the application.
Pay attention to the
cnt
field — it shows the number of users who have this app installed, whose gender we know, and, accordingly, we can collect statistics regarding the app. This field can also be used as a measure of confidence in information about this application.First of all, it is interesting to look at how gender is distributed among devices.
One might expect the devices to be roughly equally divided, but this has not happened. Therefore, I hypothesize that women are less likely to indicate their gender in the app’s settings.
Perhaps this is influenced by the fact that the number of applications that are made exclusively for the female audience is less than for the male audience. The following picture indirectly confirms this.
Let’s look at the histogram. The first one will be without additional filters.
Almost nothing is visible except for symmetrical and pronounced peaks. Let’s take a closer look at one of the peaks.
genders_df[
(genders_df['F']>=0.3325) &
(genders_df['F']<=0.3375)
].describe()
If you ignore the outlier, you can see that most of the applications from this subsample are extremely rare, which leads to a large number of the same values.
Let’s try to keep only those applications that are encountered more than 10 times.
Peaks are still visible but not so clear. Increasing the threshold to 50 almost eliminates the peaks.
The graph clearly shows that there are fewer applications with a female audience.
Let’s create a few more additional features that will show additional information.
It can be assumed that the number of installed applications can be helpful.
users_df['apps_count'] = users_df['ids'].apply(len)
users_df.groupby('gend')['apps_count'].describe()
You can see that women, on average, install more apps on their devices.
I have data about users, their gender and installed apps, and information about the distribution of gender for these applications. Is there a correlation between this data? It is logical to assume that there is, but how strong is this correlation?
g_dict = genders_df['F'].to_dict()
users_df['F_prob'] = users_df['ids'].apply(
lambda x: np.mean(
list(filter(None.__ne__, list(map(g_dict.get, x))))
)
)
Instead of the average, you can use more complex methods, but for the initial analysis, this is quite enough.
np.corrcoef(
users_df['F_prob'],
users_df['gend'].astype('category').cat.codes
)[0,1]
The correlation turned out to be very significant.
-0.46602945129982887
The histogram shows that users are well divided into two groups.
For conclusions and assessments, I need a baseline model. Therefore, I choose the simplest approach.
print(f"Accuracy: \
{accuracy_score(users_df['gend'].astype('category').cat.codes, users_df['F_prob']<0.5)}")
print(f"AUC: \
{1 - roc_auc_score(users_df['gend'].astype('category').cat.codes, users_df['F_prob'])}")
Even such a naive approach gives a good result, but let’s try to improve it further.
Accuracy: 0.740925288445762
AUC : 0.7793767183917958
Since the dataset with users is large, I can select a subset on which the models will be checked and compared.
train, test = train_test_split(
users_df, train_size=0.7,
random_state=0, stratify=users_df['gend'])
First, I’ll try the simplest and most common method — logistic regression. But for this, we need numeric features instead of lists with id. Again, I can use the simplest method — binarization.
But there is an obvious problem — the number of unique ids.
109186
But the fact that the resulting binarized data will be sparse allows the use of sparse matrices.
mlb = MultiLabelBinarizer(sparse_output=True)
mlb.fit(users_df['ids'])
train_mlb = mlb.transform(train['ids'])
test_mlb = mlb.transform(test['ids'])
I use the OOF (Out-of-Fold) approach to obtain reliable results and reduce the influence of randomness when dividing into training and validation subsamples. I don’t use third-party libraries and wrote a simple function. Please note that splitting the dataset into folds must be stratified.
def get_oof_lr(n_folds, x_train, y, x_test, seeds):
ntrain = x_train.shape[0]
ntest = x_test.shape[0]
oof_train = np.zeros((len(seeds), ntrain, 2))
oof_test = np.zeros((ntest, 2))
oof_test_skf = np.empty((len(seeds), n_folds, ntest, 2))
models = {}
for iseed, seed in enumerate(seeds):
kf = StratifiedKFold(
n_splits=n_folds,
shuffle=True,
random_state=seed)
for i, (tr_i, t_i) in enumerate(kf.split(x_train, y)):
print(f'\nSeed {seed}, Fold {i}')
x_tr = x_train[tr_i, :]
y_tr = y[tr_i]
x_te = x_train[t_i, :]
y_te = y[t_i]
model = LogisticRegression(
random_state=seed,
max_iter = 10000,
verbose=1,
n_jobs=20
)
model.fit(x_tr, y_tr)
oof_train[iseed, t_i, :] = \
model.predict_proba(x_te)
print(f"AUC: {roc_auc_score(y_te, oof_train[iseed, t_i, :][:,1])}")
oof_test_skf[iseed, i, :, :] = \
model.predict_proba(x_test)
models[(seed, i)] = model
oof_test[:, :] = oof_test_skf.mean(axis=1).mean(axis=0)
oof_train = oof_train.mean(axis=0)
return oof_train, oof_test, models
AUC:
Seed 0, Fold 0: 0.8752592302937795
Seed 0, Fold 1: 0.8741272807694727
Seed 0, Fold 2: 0.8754404425783484
Seed 0, Fold 3: 0.8750862228494931
Seed 0, Fold 4: 0.8767777821454008
Seed 42, Fold 0: 0.876839970445301
Seed 42, Fold 1: 0.8771914077769174
Seed 42, Fold 2: 0.8762049208242458
Seed 42, Fold 3: 0.8725705419477277
Seed 42, Fold 4: 0.8731672122759209
Seed 888, Fold 0: 0.8752996641300741
Seed 888, Fold 1: 0.8749304780764804
Seed 888, Fold 2: 0.8762614986655877
Seed 888, Fold 3: 0.8765240184267109
Seed 888, Fold 4: 0.8725618258459555
Let’s check the prediction on the test subsample.
Accuracy: 0.8208932240918818
AUC : 0.8798990678456793
I would say that the difference is big compared to the baseline. I will assume that the quality can be increase by tuning the hyperparameters, let it be the reader’s homework.
When I look at the ids feature, I see a list of tokens. Why not try working with this data like plain text?
I chose CatBoost as the free library for the model. CatBoost is a high-performance, open-source library for gradient boosting on decision trees. From release 0.19.1, it supports text features for classification on GPU out-of-the-box. The main advantage is that CatBoost can include categorical functions and text functions in your data without additional preprocessing. You can find more detail about text features in the article Unconventional Sentiment Analysis: BERT vs. Catboost.
!pip install catboost
Let’s write a function to initialize and train the model.
def fit_model(train_pool, test_pool, **kwargs):
model = CatBoostClassifier(
task_type='GPU',
iterations=10000,
eval_metric='AUC',
od_type='Iter',
od_wait=1000,
learning_rate=0.1,
**kwargs
)
return model.fit(
train_pool,
eval_set=test_pool,
verbose=1000,
plot=False,
use_best_model=True
)
Unfortunately, in the current version of CatBoost, it is impossible to use a list of already prepared tokens. Therefore, let’s do a little trick — turn the feature into text and use it to create a model.
users_df['ids_txt'] = \
users_df['ids'].apply(
lambda x: " ".join([str(i) for i in x ]))
As with logistic regression, I make an OOF prediction.
columns = ['ids_txt', 'apps_count']
oof_train_cb, oof_test_cb, models_cb = get_oof_cb(
n_folds=5,
x_train=train[columns],
y=train['gend'].values,
x_test=test[columns],
text_features=['ids_txt'],
seeds=[0, 42, 888]
)
Model quality metrics in the test subsample show better quality than when using logistic regression.
Accuracy: 0.8218224490121011
AUC : 0.8856101448105046
Interestingly, two completely different approaches give very similar results. In such a situation, it is logical to assume that combining methods will give a synergistic effect.
As a new feature, I’ve added OOF predictions from a logistic regression model. In addition, do not forget about the F_prob feature, which worked well for the base model.
columns = ['ids_txt', 'F_prob', 'lr', 'apps_count']
oof_train_cb_2, oof_test_cb_2, models_cb_2 = get_oof(
n_folds=5,
x_train=train_2[columns],
y=train_2['gend'].values,
x_test=test_2[columns],
text_features=['ids_txt'],
seeds=[0, 42, 888]
)
I can say that the model almost ideally predicts the gender of the user using only information about the installed applications on the device.
Accuracy: 0.836950230713273
AUC : 0.9010077023800467
In this story, I:
All this required the accumulation of certain statistical information about applications users use and information about the distribution of gender among users for the applications themselves.
The code from the article can be viewed here.