paint-brush
机器学习模型在击键动力学用户识别中的应用by@tudoracheabogdan
314
314

机器学习模型在击键动力学用户识别中的应用

Bogdan Tudorache10m2023/10/10
Read on Terminal Reader

本文探讨了使用机器学习模型进行基于击键动力学(一种行为生物识别方法)的用户识别。该过程包括分析击键数据以提取打字模式并构建三种机器学习模型:支持向量机 (SVM)、随机森林和极限梯度提升 (XGBoost)。首先处理数据以计算保持时间、按下-按下时间、释放-释放时间和释放-按下时间等特征。然后使用这些特征来训练 ML 模型。提供了每个模型训练的代码示例。本文还讨论了特征提取的重要性,并提供了额外的资源以供进一步阅读
featured image - 机器学习模型在击键动力学用户识别中的应用
Bogdan Tudorache HackerNoon profile picture
0-item

使用击键动力学进行用户识别的 ML 模型

本文用于用户识别的机器学习模型中使用的击键动力学是行为生物识别技术。击键动态使用每个人键入的独特方式来确认其身份。这是通过分析 Key-Press 和 Key-Release 上的2 个击键事件来完成的 - 这些事件构成了计算机键盘上的击键以提取打字模式。


本文将研究如何应用这些模式来创建 3 个用于用户识别的精确机器学习模型。


本文的目标将分为两部分,构建和训练机器学习模型(1. SVM 2.随机森林3. XGBoost )并将模型部署在能够基于 5 个输入预测用户的真实单点 API 中参数:ML 模型和 4 次击键次数。




1. 构建模型

问题

这部分的目标是根据击键数据构建用于用户识别的机器学习模型。击键动力学是一种行为生物识别技术,利用人打字的独特方式来验证个人身份。


打字模式主要是从计算机键盘中提取的。击键动态中使用的模式主要源自构成击键的两个事件:按键和释放。


按键事件发生在首次按下某个键时,而按键释放事件发生在随后释放该键时。


在本步骤中,给出用户击键信息的数据集,其信息如下:


  • keystroke.csv :在此数据集中,收集了 110 个用户的击键数据。
  • 所有用户都被要求输入 13 长度的常量字符串 8 次,并收集击键数据(每个键的按下时间和释放时间)。
  • 该数据集包含 880 行和 27 列。
  • 第一列表示用户 ID,其余列显示第一个到第 13 个字符的按下和释放时间。


您应该执行以下操作:

  1. 通常,原始数据信息量不够,需要从原始数据中提取信息丰富的特征来建立一个好的模型


对此,有四个特点:


  • Hold Time “HT”

  • Press-Press time “PPT”

  • Release-Release Time “RRT”

  • Release-Press time “RPT”


介绍了它们,并且上面描述了它们各自的定义。


2. 对于keystroke.csv,您应该为每两个连续的键生成这些特征


3. 完成上一步后,您应该计算每行每个特征的平均值和标准差。因此,每行应该有 8 个特征(4 个均值和 4 个标准差)。 → process_csv()



 def calculate_mean_and_standard_deviation(feature_list): from math import sqrt # calculate the mean mean = sum(feature_list) / len(feature_list) # calculate the squared differences from the mean squared_diffs = [(x - mean) ** 2 for x in feature_list] # calculate the sum of the squared differences sum_squared_diffs = sum(squared_diffs) # calculate the variance variance = sum_squared_diffs / (len(feature_list) - 1) # calculate the standard deviation std_dev = sqrt(variance) return mean, std_dev



 def process_csv(df_input_csv_data): data = { 'user': [], 'ht_mean': [], 'ht_std_dev': [], 'ppt_mean': [], 'ppt_std_dev': [], 'rrt_mean': [], 'rrt_std_dev': [], 'rpt_mean': [], 'rpt_std_dev': [], } # iterate over each row in the dataframe for i, row in df_input_csv_data.iterrows(): # iterate over each pair of consecutive presses and releases # print('user:', row['user']) # list of hold times ht_list = [] # list of press-press times ppt_list = [] # list of release-release times rrt_list = [] # list of release-press times rpt_list = [] # I use the IF to select only the X rows of the csv if i < 885: for j in range(12): # calculate the hold time: release[j]-press[j] ht = row[f"release-{j}"] - row[f"press-{j}"] # append hold time to list of hold times ht_list.append(ht) # calculate the press-press time: press[j+1]-press[j] if j < 11: ppt = row[f"press-{j + 1}"] - row[f"press-{j}"] ppt_list.append(ppt) # calculate the release-release time: release[j+1]-release[j] if j < 11: rrt = row[f"release-{j + 1}"] - row[f"release-{j}"] rrt_list.append(rrt) # calculate the release-press time: press[j+1] - release[j] if j < 10: rpt = row[f"press-{j + 1}"] - row[f"release-{j}"] rpt_list.append(rpt) # ht_list, ppt_list, rrt_list, rpt_list are a list of calculated values for each feature -> feature_list ht_mean, ht_std_dev = calculate_mean_and_standard_deviation(ht_list) ppt_mean, ppt_std_dev = calculate_mean_and_standard_deviation(ppt_list) rrt_mean, rrt_std_dev = calculate_mean_and_standard_deviation(rrt_list) rpt_mean, rpt_std_dev = calculate_mean_and_standard_deviation(rpt_list) # print(ht_mean, ht_std_dev) # print(ppt_mean, ppt_std_dev) # print(rrt_mean, rrt_std_dev) # print(rpt_mean, rpt_std_dev) data['user'].append(row['user']) data['ht_mean'].append(ht_mean) data['ht_std_dev'].append(ht_std_dev) data['ppt_mean'].append(ppt_mean) data['ppt_std_dev'].append(ppt_std_dev) data['rrt_mean'].append(rrt_mean) data['rrt_std_dev'].append(rrt_std_dev) data['rpt_mean'].append(rpt_mean) data['rpt_std_dev'].append(rpt_std_dev) else: break data_df = pd.DataFrame(data) return data_df


所有代码都可以在我的 GitHub 上的 Key StrikeDynamics 存储库中找到:


训练 ML

现在我们已经解析了数据,我们可以开始通过训练 ML 来构建模型。


支持向量机

def train_svm(training_data, features): import joblib from sklearn.svm import SVC """ SVM stands for Support Vector Machine, which is a type of machine learning algorithm used: for classification and regression analysis. SVM algorithm aims to find a hyperplane in an n-dimensional space that separates the data into two classes. The hyperplane is chosen in such a way that it maximizes the margin between the two classes, making the classification more robust and accurate. In addition, SVM can also handle non-linearly separable data by mapping the original features to a higher-dimensional space, where a linear hyperplane can be used for classification. :param training_data: :param features: :return: ML Trained model """ # Split the data into features and labels X = training_data[features] y = training_data['user'] # Train an SVM model on the data svm_model = SVC() svm_model.fit(X, y) # Save the trained model to disk svm_model_name = 'models/svm_model.joblib' joblib.dump(svm_model, svm_model_name)


补充阅读:


随机森林


def train_random_forest(training_data, features): """ Random Forest is a type of machine learning algorithm that belongs to the family of ensemble learning methods. It is used for classification, regression, and other tasks that involve predicting an output value based on a set of input features. The algorithm works by creating multiple decision trees, where each tree is built using a random subset of the input features and a random subset of the training data. Each tree is trained independently, and the final output is obtained by combining the outputs of all the trees in some way, such as taking the average (for regression) or majority vote (for classification). :param training_data: :param features: :return: ML Trained model """ import joblib from sklearn.ensemble import RandomForestClassifier # Split the data into features and labels X = training_data[features] y = training_data['user'] # Train a Random Forest model on the data rf_model = RandomForestClassifier() rf_model.fit(X, y) # Save the trained model to disk rf_model_name = 'models/rf_model.joblib' joblib.dump(rf_model, rf_model_name)


补充阅读:



极端梯度提升


def train_xgboost(training_data, features): import joblib import xgboost as xgb from sklearn.preprocessing import LabelEncoder """ XGBoost stands for Extreme Gradient Boosting, which is a type of gradient boosting algorithm used for classification and regression analysis. XGBoost is an ensemble learning method that combines multiple decision trees to create a more powerful model. Each tree is built using a gradient boosting algorithm, which iteratively improves the model by minimizing a loss function. XGBoost has several advantages over other boosting algorithms, including its speed, scalability, and ability to handle missing values. :param training_data: :param features: :return: ML Trained model """ # Split the data into features and labels X = training_data[features] label_encoder = LabelEncoder() y = label_encoder.fit_transform(training_data['user']) # Train an XGBoost model on the data xgb_model = xgb.XGBClassifier() xgb_model.fit(X, y) # Save the trained model to disk xgb_model_name = 'models/xgb_model.joblib' joblib.dump(xgb_model, xgb_model_name)


补充阅读:




也发布在这里。

如果您喜欢这篇文章并愿意支持我,请确保:

🔔 跟我来博格丹·图多拉奇

🔔 与我联系: LinkedIn | 红迪网