Hyperparameter tuning/Tracking ML models using MLFLOW
One cannot escape from hyperparameter tuning when dealing with machine learning models. We try different set of values for the hyperparameters associated with a model. I used to keep track of all the hyperparameter tuning experiments using an excel sheet. However, now I use MLFLOW to track it and let me show you how I do it.
In this blog, we will classify the very famous iris dataset using k-Nearest neighbours (k-NN). We will tune the hyperparameter k in k-NN and keep a track of all the experiments and models using MLFLOW.
Abiding by my lazy nature, I will be using some part of code for classification of iris dataset from
https://deepnote.com/@ndungu/Implementing-KNN-Algorithm-on-the-Iris-Dataset-58Fkk1AMQki-VJOJ3mA_Fg
I am using google colab for the experiments. Let’s first install mlflow:
!pip install mlflow
Let’s import all the libraries
from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.preprocessing import Normalizer from sklearn.metrics import accuracy_score from sklearn.neighbors import KNeighborsClassifier import pandas as pd import numpy as np
import mlflow
The dataset looks like this:
iris = datasets.load_iris()
iris_df = pd.DataFrame(data= np.c_[iris[‘data’], iris[‘target’]], columns= iris[‘feature_names’] + [‘target’])
iris_df.head(5)
Segregating independent and dependent variables.
x= iris_df.iloc[:, :-1] y= iris_df.iloc[:, -1]
Splitting the data into train set and test set.
x_train, x_test, y_train, y_test= train_test_split(x, y,test_size= 0.2,shuffle= True, #shuffle the data to avoid bias. random_state= 0)
x_train= np.asarray(x_train) y_train= np.asarray(y_train) x_test= np.asarray(x_test) y_test= np.asarray(y_test)
Taking k=2 a simple k-NN classifier would look like this:
knn=KNeighborsClassifier(2) knn.fit(x_train, y_train) y_pred_sklearn= knn.predict(x_test)
print(y_pred_sklearn)
The value in y_pred_sklearn is:
accuracy_score(y_test, y_pred_sklearn)
Now what if I want to try out different values of k and see how the accuracy_score changes and based on this accuracy_score I would choose the best classifier. MLFLOW to the rescue.
mlflow.sklearn.autolog()
for k in range(2,10):
with mlflow.start_run():
knn=KNeighborsClassifier(k)
knn.fit(x_train, y_train)
y_pred_sklearn= knn.predict(x_test)
mlflow.log_metric(“accuracy”, accuracy_score(y_test, y_pred_sklearn))
After running this block of code a folder named “mlruns” will be created
Now each subfolder denotes one value of the hyperparameter k. Which is why there are 8 subfolders. Let’s take a look at the first subfolder:
The model.pkl file is the pickle file of the model with that particular value of hyperparamters. The metrics folder consists of some default metrics along with the metric specified in the code logic. In our case it was “accuracy”. The params folder consists of all the hyperparameters and their value that were used to build the model in this subfolder.
Based on the accuracy score we choose the model in that particular subfolder to deploy/for inference.
I generally use mlflow with databricks as it makes my life a bit easier :p