Fraud Detection with Generative Adversarial Nets (GANs) | by Michio Suginoo

Machine Learning

Fraud Detection with Generative Adversarial Nets (GANs) | by Michio Suginoo | Jan, 2024

hhhhm

2024年1月30日

Fraud Detection with Generative Adversarial Nets (GANs) | by Michio Suginoo | Jan, 2024

[ad_1]

Lastly, it’s time for us to make use of GANs for knowledge augmentation.

So what number of artificial knowledge do we have to create?

To start with, our curiosity for the info augmentation is just for the mannequin coaching. Because the check dataset is out-of-sample knowledge, we need to protect the unique type of the check dataset. Secondly, as a result of our intention is to remodel the imbalanced dataset completely, we don’t need to increase the bulk class of non-fraud instances.

Merely put, we need to increase solely the prepare dataset of the minority fraud class, nothing else.

Now, let’s cut up the working dataframe into the prepare dataset and the check dataset in 80/20 ratio, utilizing a stratified knowledge cut up methodology.

# Separate options and goal variable
X = df.drop('Class', axis=1)
y = df['Class']# Splitting knowledge into prepare and check units
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Mix the options and the label for the prepare dataset 
train_df = pd.concat([X_train, y_train], axis=1)

In consequence, the form of the prepare dataset is as follows:

train_df.form = (226010, 7)

Let’s see the composition (the fraud instances and the non-fraud instances) of the prepare dataset.

# Load the dataset (fraud and non-fraud knowledge)
fraud_data = train_df[train_df['Class'] == 1].drop('Class', axis=1).values
non_fraud_data = train_df[train_df['Class'] == 0].drop('Class', axis=1).values# Calculate the variety of artificial fraud samples to generate
num_real_fraud = len(fraud_data)
num_synthetic_samples = len(non_fraud_data) - num_real_fraud
print("# of non-fraud: ", len(non_fraud_data))
print("# of Actual Fraud:", num_real_fraud)
print("# of Artificial Fraud required:", num_synthetic_samples)
# of non-fraud:  225632
# of Actual Fraud: 378
# of Artificial Fraud required: 225254

This tells us that the prepare dataset (226,010) is comprised of 225,632 non-fraud knowledge and 378 fraud knowledge. In different phrases, the distinction between them is 225,254. This quantity is the variety of the artificial fraud knowledge (num_synthetic_samples) that we have to increase to be able to completely match the numbers of those two courses throughout the prepare dataset: as a reminder, we do protect the unique check dataset.

Subsequent, let’s code GANs.

First, let’s create customized features to find out the 2 brokers: the discriminator and the generator.

For the generator, I create a noise distribution perform, build_generator(), which requires two parameters: latent_dim (the dimension of the noise) as the form of its enter; and the form of its output, output_dim, which corresponds to the variety of the options.

# Outline the generator community
def build_generator(latent_dim, output_dim):
mannequin = Sequential()
mannequin.add(Dense(64, input_shape=(latent_dim,)))
mannequin.add(Dense(128, activation='sigmoid'))
mannequin.add(Dense(output_dim, activation='sigmoid'))
return mannequin

For the discriminator, I create a customized perform build_discriminator() that takes input_dim, which corresponds to the variety of the options.

# Outline the discriminator community
def build_discriminator(input_dim):
mannequin = Sequential()
mannequin.add(Enter(input_dim))
mannequin.add(Dense(128, activation='sigmoid'))
mannequin.add(Dense(1, activation='sigmoid'))
return mannequin

Then, we will name these perform to create the generator and the discriminator. Right here, for the generator I arbitrarily set latent_dim to be 32: you possibly can strive different worth right here, in case you like.

# Dimensionality of the enter noise for the generator
latent_dim = 32# Construct generator and discriminator fashions
generator = build_generator(latent_dim, fraud_data.form[1])
discriminator = build_discriminator(fraud_data.form[1])

At this stage, we have to compile the discriminator, which goes to be nested in the principle (increased) optimization loop later. And we will compile the discriminator with the next argument setting.

the loss perform of the discriminator: the generic cross-entropy loss perform for a binary classifier
the analysis metrics: precision and recall.

# Compile the discriminator mannequin
from keras.metrics import Precision, Recall
discriminator.compile(optimizer=Adam(learning_rate=0.0002, beta_1=0.5), loss='binary_crossentropy',  metrics=[Precision(), Recall()])

For the generator, we’ll compile it after we assemble the principle (higher) optimization loop.

At this stage, we will outline the customized goal perform for the generator as follows. Keep in mind, the really useful goal was to maximise the next components:

def generator_loss_log_d(y_true, y_pred):
return - Ok.imply(Ok.log(y_pred + Ok.epsilon()))

Above, the unfavourable signal is required, because the loss perform by default is designed to be minimized.

Then, we will assemble the principle (higher) loop, build_GANs(generator, discriminator), of the bi-level optimization structure. On this major loop, we compile the generator implicitly. On this context, we have to use the customized goal perform of the generator, generator_loss_log_d, after we compile the principle loop.

As aforementioned, we have to freeze the discriminator after we prepare the generator.

# Construct and compile the GANs higher optimization loop combining generator and discriminator
def build_gan(generator, discriminator):
discriminator.trainable = False
mannequin = Sequential()
mannequin.add(generator)
mannequin.add(discriminator)
mannequin.compile(optimizer=Adam(learning_rate=0.0002, beta_1=0.5), loss=generator_loss_log_d)return mannequin
# Name the higher loop perform
gan = build_gan(generator, discriminator)

On the final line above, gan calls build_gan() to be able to implement the batch coaching beneath, utilizing Keras’ mannequin.train_on_batch() methodology.

As a reminder, whereas we prepare the discriminator, we have to freeze the coaching of the generator; and whereas we prepare the generator, we have to freeze the coaching of the discriminator.

Right here is the batch coaching code incorporating the alternating coaching course of of those two brokers below the bi-level optimization framework.

# Set hyperparameters
epochs = 10000
batch_size = 32# Coaching loop for the GANs
for epoch in vary(epochs):
# Practice discriminator (freeze generator)
discriminator.trainable = True
generator.trainable = False
# Random sampling from the actual fraud knowledge
real_fraud_samples = fraud_data[np.random.randint(0, num_real_fraud, batch_size)]
# Generate pretend fraud samples utilizing the generator
noise = np.random.regular(0, 1, measurement=(batch_size, latent_dim))
fake_fraud_samples = generator.predict(noise)
# Create labels for actual and faux fraud samples
real_labels = np.ones((batch_size, 1))
fake_labels = np.zeros((batch_size, 1))
# Practice the discriminator on actual and faux fraud samples
d_loss_real = discriminator.train_on_batch(real_fraud_samples, real_labels)
d_loss_fake = discriminator.train_on_batch(fake_fraud_samples, fake_labels)
d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)
# Practice generator (freeze discriminator)
discriminator.trainable = False
generator.trainable = True
# Generate artificial fraud samples and create labels for coaching the generator
noise = np.random.regular(0, 1, measurement=(batch_size, latent_dim))
valid_labels = np.ones((batch_size, 1))
# Practice the generator to generate samples that "idiot" the discriminator
g_loss = gan.train_on_batch(noise, valid_labels)
# Print the progress
if epoch % 100 == 0:
print(f"Epoch: {epoch} - D Loss: {d_loss} - G Loss: {g_loss}")

Right here, I’ve a fast query for you.

Under we now have an excerpt related to the generator coaching from the code above.

Are you able to clarify what this code is doing?

# Generate artificial fraud samples and create labels for coaching the generator
noise = np.random.regular(0, 1, measurement=(batch_size, latent_dim))
valid_labels = np.ones((batch_size, 1))

Within the first line, noise generates the artificial knowledge. Within the second line, valid_labels assigns the label of the artificial knowledge.

Why do we have to label it with 1, which is meant to be the label for the actual knowledge? Didn’t you discover the code counter-intuitive?

Women and gents, welcome to the world of counterfeiters.

That is the labeling magic that trains the generator to create samples that may idiot the discriminator.

Now, let’s use the educated generator to create the artificial knowledge for the minority fraud class.

# After coaching, use the generator to create artificial fraud knowledge
noise = np.random.regular(0, 1, measurement=(num_synthetic_samples, latent_dim))
synthetic_fraud_data = generator.predict(noise)# Convert the end result to a Pandas DataFrame format
fake_df = pd.DataFrame(synthetic_fraud_data, columns=options.to_list())

Lastly, the artificial knowledge is created.

Within the subsequent part, we will mix this artificial fraud knowledge with the unique prepare dataset to make all the prepare dataset completely balanced. I hope that the superbly balanced coaching dataset would enhance the efficiency of the fraud detection classification mannequin.

Repeatedly, the usage of GANs on this venture is completely for knowledge augmentation, however not for classification.

To start with, we would want the benchmark mannequin as the premise of the comparability to ensure that us to judge the development made by the GANs based mostly knowledge augmentation on the efficiency of the fraud detection mannequin.

As a binary classifier algorithm, I chosen Ensemble Methodology for constructing the fraud detection mannequin. Because the benchmark state of affairs, I developed a fraud detection mannequin solely with the unique imbalanced dataset: thus, with out knowledge augmentation. Then, for the second state of affairs with knowledge augmentation by GANs, I can prepare the identical algorithm with the superbly balanced prepare dataset, which comprises the artificial fraud knowledge created by GANs.

Benchmark State of affairs: Ensemble Classifier with out knowledge augmentation
GANs State of affairs: Ensemble Classifier with knowledge augmentation by GANs

Benchmark State of affairs: Ensemble with out knowledge augmentation

Subsequent, let’s outline the benchmark state of affairs (with out knowledge augmentation). I made a decision to pick out Ensemble Classifier: voting methodology because the meta learner with the next 3 base learners.

Gradient Boosting
Choice Tree
Random Forest

Because the authentic dataset is extremely imbalanced, quite than accuracy I shall choose analysis metrics from the next 3 choices: precision, recall, and F1-Rating.

The next customized perform, ensemble_training(X_train, y_train), defines the coaching and validation course of.

def ensemble_training(X_train, y_train):

  # Initialize base learners
gradient_boosting = GradientBoostingClassifier(random_state=42)
decision_tree = DecisionTreeClassifier(random_state=42)
random_forest = RandomForestClassifier(random_state=42)  # Outline the bottom fashions
base_models = {
'RandomForest': random_forest,
'DecisionTree': decision_tree,
'GradientBoosting': gradient_boosting
}  # Initialize the meta learner
meta_learner = VotingClassifier(estimators=[(name, model) for name, model in base_models.items()], voting='gentle')  # Lists to retailer coaching and validation metrics
train_f1_scores = []
val_f1_scores = []  # Splitting the prepare set additional into coaching and validation units
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42, stratify=y_train)  # Coaching and validation
for model_name, mannequin in base_models.objects():
mannequin.match(X_train, y_train)    # Coaching metrics
train_predictions = mannequin.predict(X_train)
train_f1 = f1_score(y_train, train_predictions)
train_f1_scores.append(train_f1)    # Validation metrics utilizing the validation set
val_predictions = mannequin.predict(X_val)
val_f1 = f1_score(y_val, val_predictions)
val_f1_scores.append(val_f1)  # Coaching the meta learner on all the coaching set
meta_learner.match(X_train, y_train)  return meta_learner, train_f1_scores, val_f1_scores, base_models

The following perform block, ensemble_evaluations(meta_learner, X_train, y_train, X_test, y_test), calculates the efficiency analysis metrics on the meta learner stage.

def ensemble_evaluations(meta_learner,X_train, y_train, X_test, y_test):
# Metrics for the ensemble mannequin on each traininGANsd check datasets
ensemble_train_predictions = meta_learner.predict(X_train)
ensemble_test_predictions = meta_learner.predict(X_test)

  # Calculating metrics for the ensemble mannequin
ensemble_train_f1 = f1_score(y_train, ensemble_train_predictions)
ensemble_test_f1 = f1_score(y_test, ensemble_test_predictions)  # Calculate precision and recall for each coaching and check datasets
precision_train = precision_score(y_train, ensemble_train_predictions)
recall_train = recall_score(y_train, ensemble_train_predictions)  precision_test = precision_score(y_test, ensemble_test_predictions)
recall_test = recall_score(y_test, ensemble_test_predictions)  # Output precision, recall, and f1 rating for each coaching and check datasets
print("Ensemble Mannequin Metrics:")
print(f"Coaching Precision: {precision_train:.4f}, Recall: {recall_train:.4f}, F1-score: {ensemble_train_f1:.4f}")
print(f"Check Precision: {precision_test:.4f}, Recall: {recall_test:.4f}, F1-score: {ensemble_test_f1:.4f}")  return ensemble_train_predictions, ensemble_test_predictions, ensemble_train_f1, ensemble_test_f1, precision_train, recall_train, precision_test, recall_test

Under, let’s take a look at the efficiency of the benchmark Ensemble Classifier.

Coaching Precision: 0.9811, Recall: 0.9603, F1-score: 0.9706
Check Precision: 0.9351, Recall: 0.7579, F1-score: 0.8372

On the meta-learner stage, the benchmark mannequin generated F1-Rating at an affordable stage of 0.8372.

Subsequent, let’s transfer on to the state of affairs with knowledge augmentation utilizing GANs . We need to see if the efficiency of the state of affairs with GAN can outperform the benchmark state of affairs.

GANs State of affairs: Fraud Detection with knowledge augmentation by GANs

Lastly, we now have constructed a superbly balanced dataset by combining the unique imbalanced prepare dataset (each non-fraud and fraud instances), train_df, and the artificial fraud dataset generated by GANs, fake_df. Right here, we’ll protect the check dataset as authentic by not involving it on this course of.

wdf = pd.concat([train_df, fake_df], axis=0)

We’ll prepare the identical ensemble methodology with the combined balanced dataset to see if it is going to outperform the benchmark mannequin.

Now, we have to cut up the combined balanced dataset into the options and the label.

X_mixed = wdf[wdf.columns.drop("Class")]
y_mixed = wdf["Class"]

Keep in mind, after I ran the benchmark state of affairs earlier, I already outlined the required customized perform blocks to coach and consider the ensemble classifier. I can use these customized features right here as nicely to coach the identical Ensemble algorithm with the mixed balanced knowledge.

We will go the options and the label (X_mixed, y_mixed) into the customized Ensemble Classifier perform ensemble_training().

meta_learner_GANs, train_f1_scores_GANs, val_f1_scores_GANs, base_models_GANs=ensemble_training(X_mixed, y_mixed)

Lastly, we will consider the mannequin with the check dataset.

ensemble_evaluations(meta_learner_GANs, X_mixed, y_mixed, X_test, y_test)

Right here is the end result.

Ensemble Mannequin Metrics:
Coaching Precision: 1.0000, Recall: 0.9999, F1-score: 0.9999
Check Precision: 0.9714, Recall: 0.7158, F1-score: 0.8242

[ad_2]