# What is Hyperparameter Optimization? In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned. # Strategies Several strategies can be used for performing optimization. The most simple one is manual tuning. One such example is using the [Elbow Method](https://en.wikipedia.org/wiki/Elbow_method_(clustering) for determining the number of clusters in *[k](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)*[-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). On the other hand complex models, have dozens of hyperparameters, and combined with the fact that some of them are continuous, the size of the search space explodes, so the manual effort. To tackle this issue, several other "smarter" approaches exist. Some of them are: * Grid search * Random search * Bayesian optimization * Gradient-based optimization * Evolutionary optimization * Population-based \ To get more familiar with how these approaches work and what are the differences between them, I encourage you to read these articles [7 Hyperparameter Optimization Techniques Every Data Scientist Should Know](https://towardsdatascience.com/7-hyperparameter-optimization-techniques-every-data-scientist-should-know-12cdebe713da), [How To Make Deep Learning Models That Don’t Suck](https://nanonets.com/blog/hyperparameter-optimization/), or [Hyperparameter Optimization Approaches ](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Approaches). # Tooling Several frameworks provide implementations of the approaches mentioned above. In this tutorial, we are going to explore [Weights & Biases - Sweeps](https://docs.wandb.ai/sweeps), *(WANDB for short)*. # Setup For this tutorial, we are going to build a classifier for the [Heart Disease UCI](https://www.kaggle.com/ronitf/heart-disease-uci) dataset. We will use `RandomForestClassifier` from sklearn to predict the presence of heart disease. \ This tutorial does not focus on data pre-processing, so we'll dive straight into splitting the data into train and test data, and train the model once with default values for hyperparameters. ```py import pandas as pd from sklearn.model_selection import train_test_split df = pd.read_csv('data\heart.csv') X = df.drop(['target'], axis=1) y = df['target'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) ``` \ Train the model ```py rfc = RandomForestClassifier() rfc.fit(X_train, y_train) ``` and that's it, now we have successfully trained a random forest classifier with default values for its hyperparameters. This classifier has the following hyperparameters: \ * bootstrap * max_depth * max_features * min_samples_leaf * min_samples_split * n_estimators Now let's get into setting up the optimization: # Optimization ## Step 1: Define the Training Script For more details, see the [docs](https://docs.wandb.ai/guides/sweeps/quickstart#set-up-your-python-training-script). This script should serve as the main entry point for optimization. It performs one training and evaluation of the model with values for the hyperparameters injected from outside (through `wandb.config`). It gets the configuration from the outside and performs training and evaluation of the model with fixed values for all of the hyperparameters. The name could be arbitrary, and for this example is `train.py`. ```py import wandb import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split WANDB_PROJECT_NAME = "hyperparameter-optimization" with wandb.init(project=WANDB_PROJECT_NAME): df = pd.read_csv('data\heart.csv') X = df.drop(['target'], axis=1) y = df['target'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) config = wandb.config rfc = RandomForestClassifier( bootstrap=config.bootstrap, max_depth = config.max_depth, max_features = config.max_features, min_samples_leaf = config.min_samples_leaf, min_samples_split = config.min_samples_split, n_estimators = config.n_estimators, ) rfc.fit(X_train, y_train) y_pred = rfc.predict(X_test) wandb.log({'accuracy': accuracy_score(y_test, y_pred)}) ``` ## Step 2: Define the Optimization Strategy and Configuration To determine the optimization strategy, i.e. which approach to be used for optimization, what values or ranges should be tried for every hyper-parameter, and what objective to be optimized, a configuration file `sweep.yml` needs to be defined. This file contains some additional configuration regarding the python path, training script path (from Step 1), to get more familiar see the [docs](https://docs.wandb.ai/guides/sweeps/quickstart#2.-configure-your-sweep). For this example the `sweep.yml` file is. ```yml program: train.py method: bayes project: hyperparameter-optimization command: - ${env} - ~/envs/hyperopt/bin/python - ${program} - ${args} metric: name: accuracy goal: maximize parameters: bootstrap: values: [True, False] max_depth: values: [2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None] max_features: values: ['auto', 'sqrt'] min_samples_leaf: values: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] min_samples_split: values: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] n_estimators: values: [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 500] ``` \ Interpretation is that the training script is located at `train.py`, Bayesian optimization is going to be used and for bootstrap, the values `True` and \`False is going to be tried, for max_depth \[2, 3, 4, .... \] and similarly for all of the hyperparameters. The objective is maximizing accuracy. So in other words, **we want to assign values to the hyperparameters such that the accuracy is maximized**. ## Step 3: Initializing and Running the optimization To start optimization, open a shell (with your favorite terminal emulator). 1. Activate the python virtual environment (for UNIX `source ~/envs/hyperopt/bin/python`, for other see the official Python [guide](https://docs.python.org/3/tutorial/venv.html).) 2. Initialize the sweep ```bash wandb sweep .\sweep.yml ``` Expected Output:  3. Run the sweep ```bash wandb agent aleksandar1932/hyperparameter-optimization/r3s5xf4d ``` Expected Output:  And that's it, now the running sweep can be observed at WANDB. ## Step 4: Monitoring Go to the sweep URL, from your shell output. For this example, the output is available [here](https://wandb.ai/aleksandar1932/hyperparameter-optimization/sweeps/m07mk186?workspace=user-aleksandar1932), and below.  As the model is trained for different combinations for the hyperparameters, the results are updated in real-time. We can wait for the given approach to find the combination of values that when used for training, provide the best model, or stop the optimization either by terminating the running shell or through WANDB. # Conclusion This sweep performed 79 runs, and the best model scored `0.9016` accuracy on a randomly sampled test set for that particular run. *(In* `train.py` train_test_split is done for every run)  So, it can be concluded that the best RandomForestClassifier should be instantiated with the following hyperparameters. ```py model = RandomForestClassifier( bootstrap=False, max_depth = 4, max_features = 'sqrt', min_samples_leaf = 3, min_samples_split = 4, n_estimators = 200, ) ``` \ The code from this tutorial is available on [GitHub](https://github.com/Aleksandar1932/hyperparameter-optimization). \ *First published [here](https://blog.codechem.com/hyperparameter-optimization-with-weights-and-biases)*