A Refined Bias that May Affect Your Determination Bushes and Random Forests | by Gyorgy Kovacs

Machine Learning

A Refined Bias that May Affect Your Determination Bushes and Random Forests | by Gyorgy Kovacs | Dec, 2023

hhhhm

2024年1月1日

A Refined Bias that May Affect Your Determination Bushes and Random Forests | by Gyorgy Kovacs | Dec, 2023

[ad_1]

Determination bushes and random forests are broadly adopted classification and regression methods in machine studying. Determination bushes are favored for his or her interpretability, whereas random forests stand out as extremely aggressive and general-purpose state-of-the-art methods. Generally used CART implementations, akin to these within the Python bundle sklearn and the R packages tree and caret, assume that every one options are steady. Regardless of this silent assumption of continuity, each methods are routinely utilized to datasets with various characteristic sorts.

In a latest paper, we investigated the sensible implications of violating the belief of continuity and located that it results in a bias. Importantly, these assumptions are virtually at all times violated in observe. On this article, we current and focus on our findings, illustrate and clarify the background, and suggest some easy methods to mitigate the bias.

Let’s leap into it with an instance utilizing the CPU efficiency dataset from the UCI repository. We’ll import it by means of the common-datasets bundle, to simplify the preprocessing and bypass the necessity for characteristic encoding and lacking information imputation.

import numpy as npfrom sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RepeatedKFold, cross_validate
from scipy.stats import wilcoxon
from common_datasets.regression import load_cpu_performance
dataset = load_cpu_performance()
X = dataset['data']
y = dataset['target']
# a cross-validation wrapper to simplify the code
def cv_rf(X, y, regressor=RandomForestRegressor):
return cross_validate(
estimator=regressor(max_depth=11),
X=X, y=y,
cv=RepeatedKFold(n_splits=5, n_repeats=400, random_state=5),
scoring='r2'
)['test_score']
r2_original = cv_rf(X, y)
r2_mirrored = cv_rf(-X, y)

Within the experiment, we consider the efficiency of the random forest regressor on each the unique information and its mirrored model (every characteristic multiplied by -1). The hyperparameter for the regressor (max_depth=11 ) was chosen in a devoted mannequin choice step, maximizing the r2 rating throughout an inexpensive vary of depths. The cross-validation employed for analysis is considerably extra complete than what is often utilized in machine studying, encompassing a complete of 2000 folds.

print(f'unique r2: {np.imply(r2_original):.4f}')
print(f'mirrored r2: {np.imply(r2_mirrored):.4f}')
print(f'p-value: {wilcoxon(r2_original, r2_mirrored, zero_method="zsplit").pvalue:.4e}')
# unique r2: 0.8611
# mirrored r2: 0.8595
# p-value: 6.2667e-04

When it comes to r2 scores, we observe a deterioration of 0.2 share factors when the attributes are mirrored. Moreover, the distinction is statistically important at typical ranges (p << 0.01).

The outcomes are considerably stunning and counter-intuitive. Machine studying methods are sometimes invariant to sure kinds of transformations. For instance, ok Nearest Neighbors is invariant to any orthogonal transformation (like rotation), and linear-ish methods are sometimes invariant to the scaling of attributes. For the reason that area partitioning in resolution bushes is axis aligned, it can’t be anticipated to be invariant to rotations. Nevertheless, it’s invariant to scaling: making use of any constructive multiplier to any characteristic will result in the very same tree. Consequently, there should be one thing occurring with the mirroring of the axes.

An intriguing query arises: what if mirroring the axes results in higher outcomes? Ought to we think about one other diploma of freedom (multiplying by -1) in mannequin choice past figuring out the optimum depth? Effectively, in the remainder of the put up we work out what’s going on right here!

Now, let’s briefly evaluate some essential traits of constructing and making inferences with binary Classification And Regression Bushes (CART), that are utilized by most implementations. A notable distinction in comparison with different tree induction methods like ID3 and C4.5 is that CART bushes don’t deal with categorical options in any particular means. CART bushes assume that every one options are steady.

Given a coaching set (classification or regression), resolution bushes are induced by recursively partitioning the coaching set alongside the characteristic area utilizing circumstances like characteristic < threshold or alternatively, options <= threshold. The selection of conditioning is normally an intrinsic property of the implementations. For instance, the Python bundle sklearn makes use of circumstances of the shape characteristic <= threshold, whereas the R bundle tree makes use of characteristic < threshold. Be aware that these circumstances are aligned with the presumption of all options being steady.

Nonetheless, the presumption of steady options isn’t a limitation. Integer options, class options by means of some encoding, or binary options can nonetheless be fed into these bushes. Let’s look at an instance tree in a hypothetical mortgage approval state of affairs (a binary classification downside), based mostly on three attributes:

graduated (binary): 0 if the applicant didn’t graduate, 1 if the applicant graduated;
earnings (float): the yearly gross earnings of the applicant;
dependents (int): the variety of dependents;

and the goal variable is binary: whether or not the applicant defaults (1) or pays again (0).

A choice tree constructed for a hypothetical mortgage approval state of affairs

The construction of the tree, in addition to the circumstances within the nodes (which threshold on which characteristic), are inferred from the coaching information. For extra particulars about tree induction, discuss with resolution tree studying on Wikipedia.

Given a tree like this, inference for a brand new document is performed by ranging from the leaf node, recursively making use of the circumstances, and routing the document to the department akin to the output of the situation. When a leaf node is encountered, the label (or ultimately distribution) recorded within the leaf node is returned because the prediction.

A finite set of coaching information can not suggest a novel partitioning of the characteristic area. For instance, the tree within the determine above may very well be induced from information the place there isn’t any document with commencement = 0 and earnings within the vary ]60k, 80k[. The tree induction method identifies that a split should be made between the income values 60k and 80k. In the absence of further information, the midpoint of the interval (70k) is used as the threshold. Generally, it could be 65k or 85k as well. Using the midpoints of the unlabeled intervals is a common practice and a reasonable choice: in line with the assumption of continuous features, 50% of the unlabeled interval is assigned to the left and 50% to the right branches.

Due to the use of midpoints as thresholds, the tree induction is completely independent of the choice of the conditioning operator: using both <= and < leads to the same tree structure, with the same thresholds, except for the conditioning operator.

However, inference does depend on the conditioning operator. In the example, if a record representing an applicant with a 70k income is to be inferred, then in the depicted setup, it will be routed to the left branch. However, using the operator <, it will be routed to the right branch. With truly continuous features, there is a negligible chance for a record with exactly 70k income to be inferred. However, in reality, the income might be quoted in units of 1k, 5k, or eventually 10k, which makes it probable that the choice of the conditioning operator has a notable impact on predictions.

Why do we talk about the conditioning when the problem we observed is about the mirroring of features? The two are basically the same. A condition “feature < threshold” is equivalent to the condition “-feature <= -threshold” in the sense that they lead to the same, but mirrored partitioning of the real axis. Namely, in both cases, if the feature value equals the threshold, that value is in the same partition where the feature values greater than the threshold are. For example, compare the two trees below, the one we used for illustration earlier, except all conditioning operators are changed to <, and another tree where the operator is kept, but the tree is mirrored: one can readily see that for any record they lead to the same predictions.

The previous tree with the conditioning operator <

The tree built on the mirrored data (still using the conditioning operator ≤)

Since tree induction is independent of the choice of conditioning, building a tree on mirrored data and then predicting mirrored vectors is equivalent to using the non-default conditioning operator (<) for inference on non-mirrored records. When the trees of the forest were fitted to the mirrored data, even though sklearn uses the ‘<=’ operator for conditioning, it worked as if it used the ‘<’ operator. Consequently, the performance deterioration we discovered with mirroring is due to thresholds coinciding with feature values, leading to different predictions during the evaluation of the test sets.

For the sake of completeness, we note that the randomization in certain steps of tree induction might lead to slightly different trees when fitted to the mirrored data. However, these differences smooth out in random forests, especially in 2k folds of evaluation. The observed performance deterioration is a consequence of the systematic effect of thresholds coinciding with feature values.

Primarily, two circumstances increase the likelihood of the phenomenon:

When a feature domain contains highly probable equidistant values: This sets the stage for a threshold (being the mid-point of two observations) to coincide with a feature value with high probability.
Relatively deep trees are built: Generally, as a tree gets deeper, the training data becomes sparser at the nodes. When certain observations are absent at greater depths, thresholds might fall on those values.

Interestingly, features taking a handful of equidistant values are very common in numerous domains. For example:

The age feature in medical datasets.
Rounded decimals (observed to the, say, 2nd digit will form a lattice).
Monetary figures quoted in units of millions or billions in financial datasets.

Additionally, almost 97% of features in the toy regression and classification datasets in sklearn.datasets are of this kind. Therefore, it is not an over-exaggeration to say that features taking equidistant values with high probability are present everywhere. Consequently, as a rule of thumb, the deeper trees or forests one builds, the more likely it becomes that thresholds interfere with feature values.

We have seen that the two conditioning operators (the non-default one imitated by the mirroring of data) can lead to different prediction results with statistical significance. The two predictions cannot be unbiased at the same time. Therefore, we consider the use of either form of conditioning introducing a bias when thresholds coincide with feature values.

Alternatively, it is tempting to consider one form of conditioning to be luckily aligned with the data and improving the performance. Thus, model selection could be used to select the most suitable form of conditioning (or whether the data should be mirrored). However, in a particular model selection scenario, using some k-fold cross-validation scheme, we can only test which operator is typically favorable if, say, 20% of the data is removed (5-fold) from training and then used for evaluation. When a model is trained on all data, other thresholds might interfere with feature values, and we have no information on which conditioning would improve the performance.

A natural way to eliminate the bias is to integrate out the effect of the choice of conditioning operators. This involves carrying out predictions with both operators and averaging the results.

In practice, with random forests, exploiting the equivalence of data mirroring and changing the conditioning operator, this can be approximated for basically no cost. Instead of using a forest of N_e estimators, one can build two forests of half the size, fit one to the original data, the other to the mirrored data, and take the average of the results. Note that this approach is applicable with any random forest implementation, and has only marginal additional cost (like multiplying the data by -1 and averaging the results).

For example, we implement this strategy in Python below, aiming to integrate out the bias from the sklearn random forest.

from sklearn.base import RegressorMixinclass UnbiasedRandomForestRegressor(RegressorMixin):
def __init__(self, **kwargs):
# determining the number of estimators used in the
# two subforests (with the same overall number of trees)
self.n_estimators = kwargs.get('n_estimators', 100)
n_leq = int(self.n_estimators / 2)
n_l = self.n_estimators - n_estimators_leq
# instantiating the subforests
self.rf_leq = RandomForestRegressor(**(kwargs | {'n_estimators': n_leq}))
self.rf_l = RandomForestRegressor(**(kwargs | {'n_estimators': n_l}))
def fit(self, X, y, sample_weight=None):
# fitting both subforests
self.rf_leq.fit(X, y, sample_weight)
self.rf_l.fit(-X, y, sample_weight)
return self
def predict(self, X):
# taking the average of the predictions
return np.mean([self.rf_leq.predict(X), self.rf_l.predict(-X)], axis=0)
def get_params(self, deep=False):
# returning the parameters
return self.rf_leq.get_params(deep) | {'n_estimators': self.n_estimators}

Subsequent, we will execute the identical experiments as earlier than, utilizing the very same folds:

r2_unbiased = cv_rf(X, y, UnbiasedRandomForestRegressor)

Let’s evaluate the outcomes!

print(f'unique r2: {np.imply(r2_original):.4f}')
print(f'mirrored r2: {np.imply(r2_mirrored):.4f}')
print(f'unbiased r2: {np.imply(r2_unbiased):.4f}')
# unique r2: 0.8611
# mirrored r2: 0.8595
# unbiased r2: 0.8608

In keeping with expectations, the r2 rating of the unbiased forest falls between the scores achieved by the unique forest with and with out mirroring the info. It might sound that eliminating the bias is detrimental to the efficiency; nonetheless, we emphasize once more that when the forest is match with all information, the relations is perhaps reversed, and the unique mannequin would possibly result in worse predictions than the mirrored mannequin. Eliminating the bias by integrating out the dependence on the conditioning operator eliminates the chance of deteriorated efficiency resulting from counting on the default conditioning operator.

The presence of a bias associated to the interplay of the selection of conditioning and options taking equidistant values has been established and demonstrated. Given the frequent prevalence of options of this type, the bias is more likely to be current in sufficiently deep resolution bushes and random forests. The doubtless detrimental impact might be eradicated by averaging the predictions carried out by the 2 conditioning operators. Apparently, within the case of random forests, this may be achieved at principally no value. Within the instance we used, the advance reached the extent of 0.1–0.2 share factors of r2 scores. Lastly, we emphasize that the outcomes generalize to classification issues and single resolution bushes as nicely (see preprint).

For additional particulars, I like to recommend:

[ad_2]