Home Machine Learning Protection vs. Accuracy: Placing a Stability in Information Science | by Nadav Har-Tuv | Apr, 2024

Protection vs. Accuracy: Placing a Stability in Information Science | by Nadav Har-Tuv | Apr, 2024

0
Protection vs. Accuracy: Placing a Stability in Information Science | by Nadav Har-Tuv | Apr, 2024

[ad_1]

An instance

This instance will present how the idea of agile workflow can create nice worth. It is a quite simple instance that’s meant to visualise this idea. Actual-life examples will likely be quite a bit much less apparent however the concept that you will note right here is simply as related.

Let’s take a look at this two-dimensional knowledge that I simulated from three equally sized courses.

num_samples_A = 500
num_samples_B = 500
num_samples_C = 500

# Class A
mean_A = [3, 2]
cov_A = [[0.1, 0], [0, 0.1]] # Low variance
class_A = np.random.multivariate_normal(mean_A, cov_A, num_samples_A)

# Class B
mean_B = [0, 0]
cov_B = [[1, 0.5], [0.5, 1]] # Bigger variance with some overlap with class C
class_B = np.random.multivariate_normal(mean_B, cov_B, num_samples_B)

# Class C
mean_C = [0, 1]
cov_C = [[2, 0.5], [0.5, 2]] # Bigger variance with some overlap with class B
class_C = np.random.multivariate_normal(mean_C, cov_C, num_samples_C)

A plot that I created to visualize the simulated data
Two-dimensional knowledge from three courses

Now we attempt to match a machine studying classifier to this knowledge, it appears like an SVM classifier with a Gaussian (‘rbf’) kernel may do the trick:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Creating DataFrame
knowledge = np.concatenate([class_A, class_B, class_C])
labels = np.concatenate([np.zeros(num_samples_A), np.ones(num_samples_B), np.ones(num_samples_C) * 2])
df = pd.DataFrame(knowledge, columns=['x', 'y'])
df['label'] = labels.astype(int)

# Splitting knowledge into prepare and take a look at units
X_train, X_test, y_train, y_test = train_test_split(df[['x', 'y']], df['label'], test_size=0.2, random_state=42)

# Coaching SVM mannequin with RBF kernel
svm_rbf = SVC(kernel='rbf', likelihood= True)
svm_rbf.match(X_train, y_train)

# Predict chances for every class
svm_rbf_probs = svm_rbf.predict_proba(X_test)

# Get predicted courses and corresponding confidences
svm_rbf_predictions = [(X_test.iloc[i]['x'], X_test.iloc[i]['y'], true_class, np.argmax(probs), np.max(probs)) for i, (true_class, probs) in enumerate(zip(y_test, svm_rbf_probs))]

svm_predictions_df = pd.DataFrame(svm_rbf_predictions).rename(columns={0:'x',1:'y' ,2: 'true_class', 3: 'predicted_class', 4: 'confidence'})

How does this mannequin carry out on our knowledge?

accuracy = (svm_predictions_df['true_class'] == svm_predictions_df['predicted_class']).imply()*100
print(f'Accuracy = {spherical(accuracy,2)}%')

Accuracy = 75.33%

75% % accuracy is disappointing, however does this imply that this mannequin is ineffective?

Now we wish to take a look at probably the most assured predictions and see how the mannequin performs on them. How will we outline probably the most assured predictions? We are able to check out totally different confidence (predict_proba) thresholds and see what protection and accuracy we get for every threshold after which determine which threshold meets our enterprise wants.

thresholds = [.5, .55, .6, .65, .7, .75, .8, .85, .9]
outcomes = []

for threshold in thresholds:
svm_df_covered = svm_predictions_df.loc[svm_predictions_df['confidence'] > threshold]
protection = len(svm_df_covered) / len(svm_predictions_df) * 100
accuracy_covered = (svm_df_covered['true_class'] == svm_df_covered['predicted_class']).imply() * 100

outcomes.append({'Threshold': threshold, 'Protection (%)': spherical(protection,2), 'Accuracy on coated knowledge (%)': spherical(accuracy_covered,2)})

results_df = pd.DataFrame(outcomes)
print(results_df)

And we get

The table resulting from the above code block
Protection and accuracy by threshold desk

Or if we would like a extra detailed look we will create a plot of the protection and accuracy by threshold:

A line plot of accuracy and coverage of the model on the simulated data for various thresholds
Accuracy and protection as perform as threshold

We are able to now choose the brink that matches our enterprise logic. For instance, if our firm’s coverage is to ensure no less than 90% accuracy, then we will select a threshold of 0.75 and get an accuracy of 90% for 62% of the info. It is a large enchancment to throwing out the mannequin, particularly if we don’t have any mannequin in manufacturing!

Now that our mannequin is fortunately working in manufacturing on 60% of the info, we will shift our focus to the remainder of the info. We are able to accumulate extra knowledge, do extra characteristic engineering, attempt extra advanced fashions, or get assist from a website skilled.

[ad_2]