Home Machine Learning Function Choice with Optuna. A flexible and promising strategy for… | by Nicolas Lupi | Could, 2024

Function Choice with Optuna. A flexible and promising strategy for… | by Nicolas Lupi | Could, 2024

0
Function Choice with Optuna. A flexible and promising strategy for… | by Nicolas Lupi | Could, 2024

[ad_1]

A flexible and promising strategy for the function choice activity

Picture by Edu Grande on Unsplash

Function choice is a vital step in lots of machine studying pipelines. In apply, we typically have a variety of variables accessible as predictors for our fashions, however only some of them are associated to our goal. Function choice consists of discovering a decreased set of those options, primarily for:

  • Improved generalization — utilizing a decreased variety of options minimizes the chance of overfitting.
  • Higher inference — by eradicating redundant options (for instance, two options very correlated with one another), we are able to retain solely one among them and higher seize its impact.
  • Environment friendly coaching — having much less options means shorter coaching occasions.
  • Higher interpretation — lowering the variety of options produces extra parsimonious fashions that are simpler to grasp.

There are numerous methods accessible to carry out function choice, every with various complexity. On this article, I wish to share a means of utilizing a robust open supply optimization software, Optuna, to carry out the function choice activity in an progressive means. The principle concept is to have a versatile software that may deal with function choice for a variety of duties, by effectively testing totally different function combos (e.g., not attempting all of them one after the other). Beneath, we’ll undergo a hands-on instance implementing this strategy, and in addition evaluating it to different widespread function choice methods. To experiment with the function choice methods mentioned, you may comply with together with this Colab Pocket book.

On this instance, we’ll concentrate on a classification activity based mostly on the Cell Value Classification dataset from Kaggle. We have now 20 options, together with ‘battery_power’, ‘clock_speed’ and ‘ram’, to foretell the ‘price_range’ function, which may belong to 4 totally different bands: 0, 1, 2 and three.

We first cut up our dataset into practice and check units, and we additionally put together a 5-fold validation cut up inside the practice set — this shall be helpful afterward.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

SEED = 32

# Load knowledge
filename = "practice.csv" # practice.csv from https://www.kaggle.com/datasets/iabhishekofficial/mobile-price-classification

df = pd.read_csv(filename)

# Practice - check cut up
df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.iloc[:,-1], random_state=SEED)
df_train = df_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

# The final column is the goal variable
X_train = df_train.iloc[:,0:20]
y_train = df_train.iloc[:,-1]
X_test = df_test.iloc[:,0:20]
y_test = df_test.iloc[:,-1]

# Stratified kfold over the practice set for cross validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
splits = checklist(skf.cut up(X_train, y_train))

The mannequin we’ll use all through the instance is the Random Forest Classifier, utilizing the scikit-learn implementation and default parameters. We first practice the mannequin utilizing all options to set our benchmark. The metric we’ll measure is the F1 rating weighted for all 4 value ranges. After becoming the mannequin over the practice set, we consider it on the check set, acquiring an F1 rating of round 0.87.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, classification_report

mannequin = RandomForestClassifier(random_state=SEED)
mannequin.match(X_train,y_train)
preds = mannequin.predict(X_test)

print(classification_report(y_test, preds))
print(f"International F1: {f1_score(y_test, preds, common='weighted')}")

Picture by creator

The aim now could be to enhance these metrics by deciding on a decreased function set. We are going to first define how our Optuna-based strategy works, after which check and evaluate it with different widespread function choice methods.

Optuna is an optimization framework primarily used for hyperparameter tuning. One of many key options of the framework is its use of Bayesian optimization methods to go looking the parameter area. The principle concept is that Optuna tries totally different combos of parameters and evaluates how the target perform modifications with every configuration. From these trials, it builds a probabilistic mannequin used to estimate which parameter values are more likely to yield higher outcomes.

This technique is way more environment friendly in comparison with grid or random search. For instance, if we had n options, and tried to attempt every doable function subset, we must carry out 2^n trials. With 20 options these could be greater than one million trials. As an alternative, with Optuna, we are able to discover the search area with a lot fewer trials.

Optuna presents varied samplers to attempt. For our case, we’ll use the default one, the TPESampler, based mostly on the Tree-structured Parzen Estimator algorithm (TPE). This sampler is probably the most generally used, and it’s really helpful for looking categorical parameters, which is our case as we’ll see under. In accordance with the documentation, this algorithm “matches one Gaussian Combination Mannequin (GMM) l(x) to the set of parameter values related to the very best goal values, and one other GMM g(x) to the remaining parameter values. It chooses the parameter worth x that maximizes the ratio l(x)/g(x).”

As talked about earlier, Optuna is usually used for hyperparameter tuning. That is normally executed by coaching the mannequin repeatedly on the identical knowledge utilizing a hard and fast set of options, and in every trial testing a brand new set of hyperparameters decided by the sampler. The parameter set that minimizes the given goal perform is then returned as the very best trial.

In our case, nonetheless, we’ll use a hard and fast mannequin with predetermined parameters, and in every trial, we’ll enable Optuna to pick out which options to attempt. The method goals to search out the set of options that minimizes the loss perform. In our case, we’ll information the algorithm to maximise the F1 rating (or reduce the unfavorable of the F1). Moreover, we’ll add a small penalty for every function used, to encourage smaller function units (if two function units yield related outcomes, we’ll favor the one with fewer options).

The info we’ll use is the practice dataset, cut up into 5 folds. In every trial, we’ll match the classifier 5 occasions utilizing 4 of the 5 folds for coaching and the remaining fold for validation. We’ll then common the validation metrics and add the penalty time period to calculate the trial’s loss.

Beneath is the applied class to carry out the function choice search:

import optuna

class FeatureSelectionOptuna:
"""
This class implements function choice utilizing Optuna optimization framework.

Parameters:

- mannequin (object): The predictive mannequin to judge; this needs to be any object that implements match() and predict() strategies.
- loss_fn (perform): The loss perform to make use of for evaluating the mannequin efficiency. This perform ought to take the true labels and the
predictions as inputs and return a loss worth.
- options (checklist of str): An inventory containing the names of all doable options that may be chosen for the mannequin.
- X (DataFrame): The whole set of function knowledge (pandas DataFrame) from which subsets shall be chosen for coaching the mannequin.
- y (Sequence): The goal variable related to the X knowledge (pandas Sequence).
- splits (checklist of tuples): An inventory of tuples the place every tuple comprises two components, the practice indices and the validation indices.
- penalty (float, non-compulsory): An element used to penalize the target perform based mostly on the variety of options used.
"""

def __init__(self,
mannequin,
loss_fn,
options,
X,
y,
splits,
penalty=0):

self.mannequin = mannequin
self.loss_fn = loss_fn
self.options = options
self.X = X
self.y = y
self.splits = splits
self.penalty = penalty

def __call__(self,
trial: optuna.trial.Trial):

# Choose True / False for every function
selected_features = [trial.suggest_categorical(name, [True, False]) for identify in self.options]

# Checklist with names of chosen options
selected_feature_names = [name for name, selected in zip(self.features, selected_features) if selected]

# Non-compulsory: provides a penalty for the quantity of options used
n_used = len(selected_feature_names)
total_penalty = n_used * self.penalty

loss = 0

for cut up in self.splits:
train_idx = cut up[0]
valid_idx = cut up[1]

X_train = self.X.iloc[train_idx].copy()
y_train = self.y.iloc[train_idx].copy()
X_valid = self.X.iloc[valid_idx].copy()
y_valid = self.y.iloc[valid_idx].copy()

X_train_selected = X_train[selected_feature_names].copy()
X_valid_selected = X_valid[selected_feature_names].copy()

# Practice mannequin, get predictions and accumulate loss
self.mannequin.match(X_train_selected, y_train)
pred = self.mannequin.predict(X_valid_selected)

loss += self.loss_fn(y_valid, pred)

# Take the typical loss throughout all splits
loss /= len(self.splits)

# Add the penalty to the loss
loss += total_penalty

return loss

The important thing half is the place we outline which options to make use of. We deal with every function as one parameter, which may take the values True or False. These values point out whether or not the function needs to be included within the mannequin. We use the suggest_categorical methodology in order that Optuna selects one of many two doable values for every function.

We now initialize our Optuna examine and carry out the seek for 100 trials. Discover that we enqueue a primary trial utilizing all options, as a place to begin for the search, permitting Optuna to check subsequent trials towards a fully-featured mannequin:

from optuna.samplers import TPESampler

def loss_fn(y_true, y_pred):
"""
Returns the unfavorable F1 rating, to be handled as a loss perform.
"""
res = -f1_score(y_true, y_pred, common='weighted')
return res

options = checklist(X_train.columns)

mannequin = RandomForestClassifier(random_state=SEED)

sampler = TPESampler(seed = SEED)
examine = optuna.create_study(course="reduce",sampler=sampler)

# We first attempt the mannequin utilizing all options
default_features = {ft: True for ft in options}
examine.enqueue_trial(default_features)

examine.optimize(FeatureSelectionOptuna(
mannequin=mannequin,
loss_fn=loss_fn,
options=options,
X=X_train,
y=y_train,
splits=splits,
penalty = 1e-4,
), n_trials=100)

After finishing the 100 trials, we retrieve the very best one from the examine and the options utilized in it. These are the next:

[‘battery_power’, ‘blue’, ‘dual_sim’, ‘fc’, ‘mobile_wt’, ‘px_height’, ‘px_width’, ‘ram’, ‘sc_w’]

Discover that from the unique 20 options, the search concluded with solely 9 of them, which is a big discount. These options yielded a minimal validation lack of round -0.9117, which implies they achieved a mean F1 rating of round 0.9108 throughout all folds (after adjusting for the penalty time period).

The following step is to coach the mannequin on the whole practice set utilizing these chosen options and consider it on the check set. This ends in an F1 rating of round 0.882:

Picture by creator

By deciding on the proper options, we have been capable of cut back our function set by greater than half, whereas nonetheless reaching the next F1 rating than with the complete set. Beneath we are going to talk about some execs and cons of utilizing Optuna for function choice:

Execs:

  • Searches throughout function units effectively, bearing in mind which function combos are almost definitely to provide good outcomes.
  • Adaptable for a lot of eventualities: So long as there’s a mannequin and a loss perform, we are able to use it for any function choice activity.
  • Sees the entire image: Not like strategies that consider options individually, Optuna takes into consideration which options are inclined to go effectively with one another, and which don’t.
  • Dynamically determines the variety of options as a part of the optimization course of. This may be tuned with the penalty time period.

Cons:

  • It’s not as easy as less complicated strategies, and for smaller and less complicated datasets it may not be value it.
  • Though it requires a lot fewer trials than different strategies (like exhaustive search), it nonetheless usually requires round 100 to 1000 trials. Relying on the mannequin and dataset, this may be time-consuming and computationally costly.

Subsequent, we’ll evaluate our strategy to different widespread function choice methods.

Filter Strategies — Chi-Squared

One of many easiest options is to judge every function individually utilizing a statistial check and retain the highest ok options based mostly on their scores. Discover that this strategy doesn’t require any machine studying mannequin. For instance, for the classification activity, we are able to select the chi-squared check, which determines whether or not there’s a statistically vital affiliation between every function and the goal variable. We’ll use the SelectKBest class from scikit-learn, which applies the rating perform (chi-squared) to every function and returns the highest ok scoring variables. Not like the Optuna methodology, the variety of options isn’t decided within the choice course of, however should be set beforehand. On this case, we’ll set this quantity at ten. These strategies fall inside the filter strategies class. They are typically the best and quickest to compute since they don’t require any mannequin behind.

from sklearn.feature_selection import SelectKBest, chi2

skb = SelectKBest(score_func=chi2, ok=10)
skb.match(X_train,y_train)

scores = pd.DataFrame(skb.scores_)
cols = pd.DataFrame(X_train.columns)
featureScores = pd.concat([cols,scores],axis=1)
featureScores.columns = ['feature','score']
featureScores.nlargest(10, 'rating')

Picture by creator

In our case, ram scored the best by far within the chi-squared check, adopted by px_height and battery_power. Discover that these options have been additionally chosen by our Optuna methodology above, together with px_width, mobile_wt and sc_w. Nevertheless, there are some new additions like int_memory and talk_time — these weren’t picked by the Optuna examine. After coaching the random forest with these 10 options and evaluating it on the check set, we achieved an F1 rating barely larger than our earlier greatest, at roughly 0.888:

Picture by creator

Execs:

  • Mannequin agnostic: doesn’t require a machine studying mannequin.
  • Straightforward and quick to implement and run.

Cons:

  • It must be tailored for every activity. As an example, some rating capabilities are solely relevant for classification duties, and others just for regression duties.
  • Grasping: relying on the choice used, it normally appears at options one after the other, with out bearing in mind that are already included within the set.
  • Requires the variety of options to pick out to be set beforehand.

Wrapper Strategies — Ahead Search

Wrapper strategies are one other class of function choice methods. These are iterative strategies; they contain coaching the mannequin with a set of options, evaluating its efficiency, after which deciding whether or not so as to add or take away options. Our Optuna technique falls inside these strategies. Nevertheless, most typical examples embrace ahead choice or backward choice. With ahead choice, we start with no options and, at every step, we greedily add the function that gives the best efficiency acquire, till a cease criterion is met (variety of options or efficiency decline). Conversely, backward choice begins with all options and iteratively removes the least vital ones at every step.

Beneath, we attempt the SequentialFeatureSelector class from scikit-learn, performing a ahead choice till we discover the highest 10 options. This methodology will even make use of the 5-fold cut up we carried out above, averaging efficiency throughout the validation splits at every step.

from sklearn.feature_selection import SequentialFeatureSelector

mannequin = RandomForestClassifier(random_state=SEED)
sfs = SequentialFeatureSelector(mannequin, n_features_to_select=10, cv=splits)
sfs.match(X_train, y_train);

selected_features = checklist(X_train.columns[sfs.get_support()])
print(selected_features)

This methodology finally ends up deciding on the next options:

[‘battery_power’, ‘blue’, ‘fc’, ‘mobile_wt’, ‘px_height’, ‘px_width’, ‘ram’, ‘talk_time’, ‘three_g’, ‘touch_screen’]

Once more, some are widespread to the earlier strategies, and a few are new (e.g., three_g and touch_screen. Utilizing these options, the Random Forest achieves a decrease check F1 rating, barely under 0.88.

Picture by creator

Execs

  • Straightforward to implement in only a few strains of code.
  • It can be used to find out the variety of options to make use of (utilizing the tolerance parameter).

Cons

  • Time consuming: Beginning with zero options, it trains the mannequin every time utilizing a special variable, and retains the very best one. For the following step, it once more tries out all options (now together with the earlier one), and once more selects the very best one. That is repeated till the specified variety of options is reached.
  • Grasping: As soon as a function is included, it stays. This will likely result in suboptimal outcomes, because the function offering the best particular person acquire in early rounds may not be your best option within the context of different function interactions.

Function Significance

Lastly, we’ll discover one other easy choice technique, which includes utilizing the function importances the mannequin learns (if accessible). Sure fashions, like Random Forests, present a measure of which options are most essential for prediction. We are able to use these rankings to filter out these options that, in accordance with the mannequin, have the least significance. On this case, we practice the mannequin on the whole practice dataset, and retain the ten most essential options:

mannequin = RandomForestClassifier(random_state=SEED)
mannequin.match(X_train,y_train)

significance = pd.DataFrame({'function':X_train.columns, 'significance':mannequin.feature_importances_})
significance.nlargest(10, 'significance')

Picture by creator

Discover how, as soon as once more, ram is ranked highest, far above the second most essential function. Coaching with these 10 options, we acquire a check F1 rating of just about 0.883, much like those we’ve been seeing. Additionally, word how the options chosen via function significance are the identical as these chosen utilizing the chi-squared check, though they’re ranked in another way. This distinction in rating ends in a barely totally different end result.

Picture by creator

Execs:

  • Straightforward and quick to implement: it requires a single coaching of the mannequin and straight makes use of the derived function importances.
  • It may be tailored right into a recursive model, during which at every step the least essential function is eliminated and the mannequin is then skilled once more (see Recursive Function Elimination).
  • Contained inside the mannequin: If the mannequin we’re utilizing offers function importances, we have already got a function choice different accessible at no extra price.

Cons:

  • Function significance may not be aligned with our finish aim. As an example, a function would possibly seem unimportant by itself however could possibly be vital attributable to its interplay with different options. Additionally, an essential function could be counterproductive general, by affecting the efficiency of different helpful predictors.
  • Not all fashions supply function significance estimation.
  • Requires the variety of options to pick out to be predefined.

To conclude, we’ve seen use Optuna, a robust optimization software, for the function choice activity. By effectively navigating the search area, it is ready to discover good function subsets with comparatively few trials. Not solely that, however it is usually versatile and might be tailored to many eventualities so long as we have now a mannequin and a loss perform outlined.

All through our examples, we noticed that each one methods yielded related function units and outcomes. That is primarily as a result of the dataset we used is relatively easy. In these instances, less complicated strategies already produce a superb function choice, so it wouldn’t make a lot sense to go along with the Optuna strategy. Nevertheless, for extra advanced datasets, with extra options and complicated relationships between them, utilizing Optuna could be a good suggestion. So, all in all, given its relative ease of implementation and talent to ship good outcomes, utilizing Optuna for function choice is a worthwhile addition to the info scientist’s toolkit.

Thanks for studying!

[ad_2]