I got you!
In this article, I will show you how to convert your script into an objective function that can be optimized with any hyperparameter optimization library.
It will take just 3 steps and you will be tuning model parameters like there is no tomorrow.
Ready?
Let’s go!
I suppose your main.py script looks something like this one:
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
data = pd.read_csv('data/train.csv', nrows=10000)
X = data.drop(['ID_code', 'target'], axis=1)
y = data['target']
(X_train, X_valid,
y_train, y_valid )= train_test_split(X, y, test_size=0.2, random_state=1234)
train_data = lgb.Dataset(X_train, label=y_train)
valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)
params = {'objective': 'binary',
'metric': 'auc',
'learning_rate': 0.4,
'max_depth': 15,
'num_leaves': 20,
'feature_fraction': 0.8,
'subsample': 0.2}
model = lgb.train(params, train_data,
num_boost_round=300,
early_stopping_rounds=30,
valid_sets=[valid_data],
valid_names=['valid'])
score = model.best_score['valid']['auc']
print('validation AUC:', score)
Take the parameters that you want to tune and put them in a dictionary at the top of your script. By doing that you effectively decouple search parameters from the rest of the code.
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
SEARCH_PARAMS = {'learning_rate': 0.4,
'max_depth': 15,
'num_leaves': 20,
'feature_fraction': 0.8,
'subsample': 0.2}
data = pd.read_csv('../data/train.csv', nrows=10000)
X = data.drop(['ID_code', 'target'], axis=1)
y = data['target']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1234)
train_data = lgb.Dataset(X_train, label=y_train)
valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)
params = {'objective': 'binary',
'metric': 'auc',
**SEARCH_PARAMS}
model = lgb.train(params, train_data,
num_boost_round=300,
early_stopping_rounds=30,
valid_sets=[valid_data],
valid_names=['valid'])
score = model.best_score['valid']['auc']
print('validation AUC:', score)
Now, you can put the entire training and evaluation logic inside of a train_evaluate function. This function takes parameters as input and outputs the validation score.
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
SEARCH_PARAMS = {'learning_rate': 0.4,
'max_depth': 15,
'num_leaves': 20,
'feature_fraction': 0.8,
'subsample': 0.2}
def train_evaluate(search_params):
data = pd.read_csv('../data/train.csv', nrows=10000)
X = data.drop(['ID_code', 'target'], axis=1)
y = data['target']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1234)
train_data = lgb.Dataset(X_train, label=y_train)
valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)
params = {'objective': 'binary',
'metric': 'auc',
**search_params}
model = lgb.train(params, train_data,
num_boost_round=300,
early_stopping_rounds=30,
valid_sets=[valid_data],
valid_names=['valid'])
score = model.best_score['valid']['auc']
return score
if __name__ == '__main__':
score = train_evaluate(SEARCH_PARAMS)
print('validation AUC:', score)
We are almost there.
All you need to do now is to use this train_evaluate function as an objective for the black-box optimization library of your choice.
I will use Scikit Optimize which I have described in great detail in another article but you can use any hyperparameter optimization library out there.
In a nutshell I:
In this example, I will try 100 different configurations starting with 10 randomly chosen parameter sets.
import skopt
from script_step2 import train_evaluate
SPACE = [
skopt.space.Real(0.01, 0.5, name='learning_rate', prior='log-uniform'),
skopt.space.Integer(1, 30, name='max_depth'),
skopt.space.Integer(2, 100, name='num_leaves'),
skopt.space.Real(0.1, 1.0, name='feature_fraction', prior='uniform'),
skopt.space.Real(0.1, 1.0, name='subsample', prior='uniform')]
@skopt.utils.use_named_args(SPACE)
def objective(**params):
return -1.0 * train_evaluate(params)
results = skopt.forest_minimize(objective, SPACE, n_calls=30, n_random_starts=10)
best_auc = -1.0 * results.fun
best_params = results.x
print('best result: ', best_auc)
print('best parameters: ', best_params)
This is it.
The results object contains information about the best score and parameters that produced it.
If you want to visualize your training and save diagnostic charts after it finishes you can add one callback and one function call to log every hyperparameter search to Neptune.
Just use this optuna monitoring helper function.
import neptune
import neptunecontrib.monitoring.skopt as sk_utils
import skopt
from script_step2 import train_evaluate
neptune.init('jakub-czakon/blog-hpo')
neptune.create_experiment('hpo-on-any-script', upload_source_files=['*.py'])
SPACE = [
skopt.space.Real(0.01, 0.5, name='learning_rate', prior='log-uniform'),
skopt.space.Integer(1, 30, name='max_depth'),
skopt.space.Integer(2, 100, name='num_leaves'),
skopt.space.Real(0.1, 1.0, name='feature_fraction', prior='uniform'),
skopt.space.Real(0.1, 1.0, name='subsample', prior='uniform')]
@skopt.utils.use_named_args(SPACE)
def objective(**params):
return -1.0 * train_evaluate(params)
monitor = sk_utils.NeptuneMonitor()
results = skopt.forest_minimize(objective, SPACE, n_calls=100, n_random_starts=10, callback=[monitor])
sk_utils.log_results(results)
neptune.stop()
Now, when you run your parameter sweep you will see the following:
Check out the skopt hyperparameter sweep experiment with all the code, charts and results.
In this article, you’ve learned how to optimize hyperparameters of pretty much any Python script in just 3 steps.
Hopefully, with this knowledge, you will build better machine learning models with less effort.
Happy training!
This article was originally posted on the Neptune blog. If you liked it, you may like it there :)
You can also find me tweeting @Neptune_ai or posting on LinkedIn about ML and Data Science stuff.