Home Machine Learning De-biasing Therapy Results with Double Machine Studying | by Ryan O’Sullivan | Apr, 2024

De-biasing Therapy Results with Double Machine Studying | by Ryan O’Sullivan | Apr, 2024

0
De-biasing Therapy Results with Double Machine Studying | by Ryan O’Sullivan | Apr, 2024

[ad_1]

Causal AI, exploring the combination of causal reasoning into machine studying

Photograph by Ales Nesetril on Unsplash

Welcome to my collection on Causal AI, the place we are going to discover the combination of causal reasoning into machine studying fashions. Anticipate to discover quite a few sensible purposes throughout totally different enterprise contexts.

Within the final article we explored making Causal Discovery work in real-world enterprise settings. This time we are going to cowl de-biasing therapy results with Double Machine Studying.

Should you missed the final article on Causal Discovery, test it out right here:

This text will exhibit why Double Machine Studying is a necessary a part of the Causal AI toolbox:

Anticipate to achieve a deep understanding of:

  • Common therapy results (ATE)
  • The challenges of utilizing Linear Regression to estimate ATE
  • Double Machine Studying and the way it overcomes the challenges Linear Regression faces
  • A labored case examine in Python illustrating find out how to apply Double Machine Studying.

The complete pocket book might be discovered right here:

ATE

ATE is the common affect of a therapy or intervention on a inhabitants. We are able to calculate it by evaluating the common change in a selected metric between a therapy and management group.

For instance, take into account a advertising and marketing workforce is operating a promotion. The therapy group consists of consumers who obtain a suggestion, whereas the management group consists of consumers who didn’t. We are able to calculate ATE evaluating the common variety of orders within the therapy and management group.

Potential outcomes framework

The potential outcomes framework was developed by Donald Rubin and has turn into a foundational idea in causal inference. Lets try to perceive it utilizing the instance above from the advertising and marketing workforce.

  1. Therapy project: Every buyer has two potential outcomes, the result of being within the therapy group (despatched supply) and the result of being within the management group (not despatched supply). Nonetheless, just one potential final result is noticed for every buyer.
  2. Counterfactuals: The potential final result which isn’t noticed is a counterfactual e.g. what would have occurred if this buyer was within the management group (not despatched supply).
  3. Causal impact: The causal impact of a therapy is the distinction between the potential outcomes below totally different therapy situations (despatched off vs not despatched supply).
  4. Estimation: Causal results might be estimated utilizing experimental or observational information utilizing a variety of causal methods.

A number of assumptions are made to assist make sure the estimated results are legitimate:

  • Secure Unit Therapy Worth Assumption (SUTVA): The potential final result for any buyer is unaffected by the therapy project of different clients.
  • Positivity: For any mixture of options, there have to be some chance {that a} buyer may obtain both therapy or management
  • Ignorability: All confounders which results each therapy and final result are noticed.

Experimental information

Estimating ATE with experimental information is comparatively simple.

Randomised Managed Trials (RCTs) or AB assessments are designed to randomly assign contributors to therapy and management teams. This ensures that any variations in outcomes might be attributed to the therapy impact moderately than pre-existing traits of the contributors.

Again to the instance from the advertising and marketing workforce. In the event that they randomly break up clients between the therapy and management group, the common distinction in orders is the causal impact of the supply despatched.

Observational information

Estimating ATE utilizing observational information is more difficult.

The commonest problem is confounding variables which impact each the therapy and final result. Failure to manage for confounders will result in biased estimates of the therapy impact. We’ll come again to this later within the article within the labored case examine.

Different challenges embrace:

  • Choice bias — The therapy project is influenced by components associated to the result.
  • Heterogenous therapy results — The therapy impact varies throughout totally different subgroups of the inhabitants.

Overview

Linear regression can be utilized to estimate ATE utilizing observational information. The therapy (T) and management options (X) are included as variables within the mannequin.

Use generated picture

The coefficient of the therapy variable is the ATE — the common change within the final result variable related to a unit change within the therapy variable, whereas holding management options fixed.

Information-generating course of

We are able to use a easy data-generating course of with one final result, therapy and confounder for example how we are able to use linear regression to estimate ATE.

To start with we are able to visualise the causal graph:

# Create node lookup variables
node_lookup = {0: 'Confounder',
1: 'Therapy',
2: 'Consequence'
}

total_nodes = len(node_lookup)

# Create adjacency matrix - that is the bottom for our graph
graph_actual = np.zeros((total_nodes, total_nodes))

# Create graph utilizing knowledgeable area data
graph_actual[0, 1] = 1.0 # Confounder -> Therapy
graph_actual[0, 2] = 1.0 # Confounder -> Consequence
graph_actual[1, 2] = 1.0 # Therapy -> Consequence

plot_graph(input_graph=graph_actual, node_lookup=node_lookup)

Person generated picture

After which we are able to create samples utilizing the straightforward data-generating course of. Pay shut consideration to the coefficient of the therapy variable (0.75) — that is our floor fact ATE.

np.random.seed(123)

# Create dataframe with a confounder, therapy and final result
df = pd.DataFrame(columns=['Confounder', 'Treatment', 'Outcome'])
df['Confounder'] = np.random.regular(loc=100, scale=25, measurement=1000)
df['Treatment'] = np.random.regular(loc=50, scale=10, measurement=1000) + 0.50 * df['Confounder']
df['Outcome'] = 0.25 * df['Confounder'] + 0.75 * df['Treatment'] + np.random.regular(loc=0, scale=5, measurement=1000)

sns.pairplot(df, nook=True)

Person generated picture

Linear regression

We are able to then prepare a linear regression mannequin and extract the coefficient of the therapy variable — We are able to see that it accurately estimates the ATE (0.75).

# Set goal and options
y = df['Outcome']
X = df[['Confounder', 'Treatment']]

# Prepare mannequin
mannequin = RidgeCV()
mannequin = mannequin.match(X, y)

# Extract the therapy coefficient
ate_lr = spherical(mannequin.coef_[1], 2)

print(f'The typical therapy impact utilizing Linear Regression is: {ate_lr}')

Person generated picture

Challenges

Linear regression generally is a very efficient technique for estimating ATE. Nonetheless, there are some challenges to pay attention to:

  • It struggles when we now have high-dimensional information.
  • The “nuisance parameters” (the management options that are a “nuisance” to estimate) could also be too advanced for linear regression to estimate.
  • It assumes the therapy impact is fixed throughout totally different subgroups of the inhabitants (e.g. no heterogeneity).
  • Assumes no unobserved confounders.
  • Assumes that the therapy impact is linear.

Overview

Double Machine Studying is a causal technique first launched in 2017 within the paper “Double/Debiased Machine Studying for Therapy and Structural Parameters”:

It goals to cut back bias and enhance the estimation of causal results in conditions the place we now have high-dimensional information and/or advanced nuisance parameters.

It’s impressed by the Frisch-Waugh-Lovell theorem, so let’s begin by understanding this.

Frisch-Waugh-Lovell theorem

The FWL theorem is used to decompose the results of a number of regressors on an final result variable, permitting us to isolate results of curiosity.

Think about you had two units of options, X1 and X2. You could possibly estimate the mannequin parameters utilizing linear regression like we did earlier than. Nonetheless, you too can get the identical parameter for X1 by following these steps:

  1. Use X2 solely to foretell the result
  2. Use X2 solely to foretell X1
  3. Calculate the residuals from the result mannequin (step 1) and have mannequin (step 2)
  4. Regress the residuals of the result mannequin on the residuals of the function mannequin to estimate the parameter for X1

At first look this may be fairly onerous to comply with, so let’s strive it out in Python for example. We use the identical information as earlier than, however take the therapy column as X1 and the confounder column as X2:

# Set therapy, final result and confounder samples
therapy = df['Treatment'].to_numpy().reshape(-1,1)
final result = df['Outcome'].to_numpy().reshape(-1,1)
confounder = df['Confounder'].to_numpy().reshape(-1,1)

# Prepare therapy mannequin and calculate residuals
treatment_model = RidgeCV()
treatment_model = treatment_model.match(confounder, therapy)
treatment_pred = treatment_model.predict(confounder)
treatment_residuals = therapy - treatment_pred

# Prepare final result mannequin and calculate residuals
outcome_model = RidgeCV()
outcome_model = outcome_model.match(confounder, final result)
outcome_pred = outcome_model.predict(confounder)
outcome_residuals = final result - outcome_pred

# Prepare residual mannequin and calculate common therapy impact
final_model = RidgeCV()
final_model = final_model.match(treatment_residuals, outcome_residuals)
ate_dml = spherical(final_model.coef_[0][0], 2)

print(f'The typical therapy impact is: {ate_fwl}')

Person generated picture

We are able to see that it accurately estimates the coefficient of the therapy variable (0.75).

Double Machine Studying

Double Machine Studying builds upon FWL by isolating the results of therapy and management options and through the use of versatile machine studying fashions.

The primary stage is usually referred to an orthogonalization because the nuisance parameters are estimated independently of the therapy impact estimation.

First stage:

  • Therapy mannequin (de-biasing): Machine studying mannequin used to estimate the chance of therapy project (sometimes called propensity rating). The therapy mannequin residuals are then calculated.
  • Consequence mannequin (de-noising): Machine studying mannequin used to estimate the result utilizing simply the management options. The end result mannequin residuals are then calculated.

Second stage:

  • The therapy mannequin residuals are used to foretell the result mannequin residuals.

The coefficient of the second stage mannequin is the ATE. It’s value noting that the second stage mannequin is a linear mannequin, that means we’re assuming our therapy impact is linear (this is the reason we name DML {a partially} linear mannequin).

Slightly than code it up ourselves we are able to use the Microsoft bundle EconML. EconML has a variety of Causal ML methods carried out together with quite a few implementations of DML:

# Prepare DML mannequin
dml = LinearDML(discrete_treatment=False)
dml.match(df['Outcome'].to_numpy().reshape(-1,1), T=df['Treatment'].to_numpy().reshape(-1,1), X=None, W=df['Confounder'].to_numpy().reshape(-1,1))

# Calculate common therapy impact
ate_dml = spherical(dml.ate()[0], 2)

print(f'The typical therapy impact utilizing the DML is: {ate_dml}')

Person generated picture

We once more can see that it accurately estimates the coefficient of the therapy variable (0.75).

Background

The Advertising workforce ship enticing presents to chose clients. They don’t at the moment maintain out a randomly chosen pattern of consumers to measure the affect of the presents.

The Information Science workforce is requested to estimate how the presents have an effect on buyer orders.

Confounding bias

Naively evaluating clients who had been and weren’t despatched presents is biased. That is pushed by confounding components:

  • Prospects who opt-out of e-mail can’t obtain a suggestion – this inhabitants is much less engaged and fewer prone to order.
  • The CRM workforce goal clients based mostly on their order historical past — order historical past results how possible you might be to order once more.

Information producing course of

We arrange a knowledge producing course of with the next traits:

  • Tough nuisance parameters
  • Easy therapy impact (no heterogeneity)

The X options are buyer traits taken earlier than the therapy:

Person generated picture

T is a binary flag indicating whether or not the client acquired the supply.

Person generated picture
np.random.seed(123)

# Set variety of observations
n=100000

# Set variety of options
p=10

# Create options
X = np.random.uniform(measurement=n * p).reshape((n, -1))

# Nuisance parameters
b = (
np.sin(np.pi * X[:, 0] * X[:, 1])
+ 2 * (X[:, 2] - 0.5) ** 2
+ X[:, 3]
+ 0.5 * X[:, 4]
+ X[:, 5] * X[:, 6]
+ X[:, 7] ** 3
+ np.sin(np.pi * X[:, 8] * X[:, 9])
)

# Create binary therapy
T = np.random.binomial(1, expit(b))

# Set therapy impact
tau = 0.75

# Calculate final result
y = b + T * tau + np.random.regular(measurement=n)

The information producing course of python code is predicated on the artificial information creator from Ubers Causal ML bundle. With the ability to create reasonable artificial information is essential in terms of assessing causal inference strategies so I extremely advocate you test it out:

Linear Regression

We begin through the use of linear regression to estimate the ATE. Our expectation is that it’ll battle to seize the nuisance parameters after which probably mis-specify the therapy impact.

# Append options and therapy
X_T = np.append(X, T.reshape(-1, 1), axis=1)

# Prepare linear regression mannequin
mannequin = RidgeCV()
mannequin = mannequin.match(X_T, y)
y_pred = mannequin.predict(X_T)

# Extract the therapy coefficient
ate_lr = spherical(mannequin.coef_[-1], 2)

print(f'The typical therapy impact utilizing Linear Regression is: {ate_lr}')

Person generated picture

Double Machine Studying

We then prepare a DML mannequin utilizing LightGBM as versatile first stage fashions. This could permit us to seize the troublesome nuisance parameters while accurately calculating the therapy impact.

np.random.seed(123)

# Prepare DML mannequin utilizing versatile stage 1 fashions
dml = LinearDML(model_y=LGBMRegressor(), model_t=LGBMClassifier(), discrete_treatment=True)
dml.match(y, T=T, X=None, W=X)

# Calculate common therapy impact
ate_dml = spherical(dml.ate(), 2)

print(f'The typical therapy impact utilizing the DML is: {ate_dml}')

Person generated picture

Comparability

Once we examine the outcomes, we observe that linear regression provides us a biased estimate while DML may be very near the bottom fact. This actually reveals the facility of DML!

# Plot comparability of outcomes
classes = ['Ground truth', 'DML', 'Linear Regression']
sns.barplot(x=classes, y=[tau, ate_dml, ate_lr])
plt.ylabel('ATE')
plt.title('Common Therapy Impact comparability')
plt.present()
Person generated picture

There a a number of different causal strategies which we are able to use to estimate ATE (plenty of that are carried out in each EconML and CausalML packages):

  • Propensity rating matching (PSM)
  • Inverse-propensity rating matching (IPSM)
  • S-Learner
  • T-Learner
  • Doubly-Strong Learner (DR)
  • Instrument variable learner (IV)

If you wish to delve into these strategies additional, I might advocate beginning with the S-Learner and T-Learner (sometimes called meta-learners). A few key learnings that will help you begin to work out when and the place you possibly can apply them:

  • When your therapy is binary, and your therapy and management measurement is equally balanced, the T-Learner is usually an easier various to DML.
  • When your therapy is steady, and you believe you studied the therapy impact could also be non-linear, the S-Learner could also be extra acceptable than DML.
  • Meta-learners can battle with regularization bias (significantly the S-learner) — Once we do see DML outperform meta-learners, that is often the explanation.

[ad_2]