Home Machine Learning Drawing From a Random Distribution in SQL | by Sami Abboud | Feb, 2024

Drawing From a Random Distribution in SQL | by Sami Abboud | Feb, 2024

0
Drawing From a Random Distribution in SQL | by Sami Abboud | Feb, 2024

[ad_1]

From a likelihood density perform to random samples

Picture by Moritz Kindler on Unsplash

Tlisted below are completely different strategies for updating a reinforcement studying agent’s coverage at every iteration. A couple of weeks in the past we began experimenting with changing our present methodology with a Bayesian inference step. Among the information workloads inside our agent are written in SQL that’s executed on GCP’s BigQuery engine. We use this stack as a result of it gives scalable computational capabilities, ML packages and an easy SQL interface.

The Bayesian inference step we appeared to implement makes use of one of many various parametrizations of a beta distribution. Which implies that we’ll want to have the ability to draw from a beta distribution in SQL. Whereas engaged on this, I found that drawing from the random distribution in SQL is a subject with only a few properly documented examples. So I’m writing about it right here..

BigQuery doesn’t have a beta distribution. It doesn’t have the potential to attract from any random distribution. So my first instinct was to take the definition of the beta distribution, write it in SQL, set the parameters utilizing a CTA, draw a random quantity between 0 and 1 and compute the worth of that perform.

However it’s 2024, so I requested ChatGPT how it will do it:

Me: “​How do you create random attracts from a beta distribution in BigQuery?

ChatGPT:

WITH beta_draws AS (
SELECT
(alpha — 1) / (alpha + beta — 2) AS a,
(beta — 1) / (alpha + beta — 2) AS b
FROM
(SELECT 2 AS alpha, 5 AS beta) — Set your required alpha and beta values
)
SELECT
(a * POW(RAND(), alpha - 1) * POW(1 - RAND(), beta - 1) + b) AS beta_draw
FROM
beta_draws,
UNNEST(GENERATE_ARRAY(1, 1000)) AS rand_num

Me considering to myself: Proper, in order that clearly received’t work.

Do you see the issue within the code? ChatGPT attracts two completely different x values for the presumed beta distribution PDF. I fastened this, cleaned up the question a bit of and sampled 1,000 values. And right here’s the SQL code for doing that:

WITH raw_data AS (
SELECT
‘user_a’ AS contact_id
, 0.674 AS likelihood
, 72 AS lifetime_messages_received
), parameters AS (
SELECT
contact_id
, likelihood*lifetime_messages_received AS alpha
, (1.0 — likelihood)*lifetime_messages_received AS beta
, RAND() AS x
FROM raw_data
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 1000)) AS draw_id
)
SELECT
contact_id
, ARRAY_AGG(POW(x, alpha — 1.0) * POW(1.0 — x, beta — 1)) AS beta_x
FROM parameters
GROUP BY contact_id

Thanks all, that’s a wrap 🎁 See you within the subsequent put up!

WRONG! 🔴

Let’s take a trusted implementation of drawing from a beta distribution utilizing the identical parameters and examine the outcomes. I’ve used SciPy’s beta.rvs() in Python and listed below are two 100-bin histograms that can enable evaluating the 2 drawn distributions.

from scipy.stats import beta

alpha_param = 0.674 * 72
beta_param = (1–0.674) * 72

scipy_beta_draw = beta.rvs(alpha_param, beta_param, measurement=1000)

Left: Naive draws using BigQuery. Right: Draws using SciPy’s beta.rvs()
(Left): Naive attracts utilizing BigQuery. (Proper): Attracts utilizing SciPy’s beta.rvs()

Nicely, it doesn’t take a magnifying glass to comprehend that the distributions are completely different. I went again the beta distribution definition and realized that it could be as a result of the beta distribution additionally has a scaling fixed which relies on the gamma perform that I didn’t embrace within the calculation 🤦.

Downside: the gamma perform doesn’t have a closed-form expression, and BigQuery doesn’t present an implementation that approximates it. So at this level I made a decision to change to Python, a language that I’m extra aware of and can make my experimentation extra environment friendly. The considering was that if I nail it down in Python, I’ll be capable of translate it to SQL. I might nonetheless want some approach to approximate a gamma perform, however one step at a time.

Let’s implement a handbook draw from a beta distribution in Python, however now with the right fixed utilizing SciPy’s gamma perform:

import numpy as np
from scipy.particular import gamma
from scipy.stats import uniform

alpha_param = 0.674 * 72
beta_param = (1–0.674) * 72

fixed = gamma(alpha_param + beta_param) / (gamma(alpha_param) * gamma(beta_param))
scipy_manual_beta_draw = np.array([
constant*pow(x, alpha_param-1)*pow(1-x, beta_param-1)
for x in uniform.rvs(size=1000)
])

Let’s study the distribution utilizing a 100-bin histogram once more:

Naive attracts utilizing Python

The very first thing we discover is that the size is now completely different, however the distribution nonetheless seems to be just like the one drawn in BigQuery.

… one thing is incorrect… it’s time for a brief stroll to suppose 🚶

After a brief stroll:

What does drawing from a random distribution truly imply? What I’ve applied to this point is randomly sampling from the beta likelihood density perform (PDF) and it wasn’t working.

So I needed to dig up some statistics courses.

Listed below are a few good refreshers on:

Briefly, the conclusion is that drawing from a random variable truly means sampling from the inverse cumulative distribution perform (CDF), not from the likelihood density perform (PDF) like I used to be doing to this point.

After all 🤦. My likelihood professor, who I simply discovered had handed away from sickness in 2020, would have inspired me to “evaluate the fundamentals” at this level..

Okay. Let’s revisit the Python code, now drawing samples from the inverse CDF (which can also be referred to as the quantile perform) of our beta distribution, and examine it to the distribution drawn utilizing SciPy’s beta.rvs():

import numpy as np
from scipy.particular import gamma
from scipy.stats import uniform, beta

alpha_param = 0.674 * 72
beta_param = (1–0.674) * 72
n_draws = 1000

# Use SciPy RVS for comparability
scipy_beta_draw = beta.rvs(alpha_param, beta_param, measurement=n_draws)

# Handbook beta draw with the assistance of the SciPy Gamma perform

# We begin with a discrete analogue of the Beta PDF we want to draw from.
# That is simply sampling from the PDF at fastened intervals however do try
# this evaluate for a extra in-depth remedy of the topic:
# https://jsdajournal.springeropen.com/articles/10.1186/s40488-015-0028-6

# Set the decision for producing the discrete PDF
n_samples = 1000

# The beta distribution is supported on the vary [0, 1], so we set the
# pdf min and max parameters accordingly
pdf_min = 0.0
pdf_max = 1.0

x_span = np.linspace(pdf_min, pdf_max, n_samples)
fixed = gamma(alpha_param + beta_param) / (gamma(alpha_param) * gamma(beta_param))
beta_pdf = np.array([
constant * pow(x, alpha_param — 1) * pow(1 — x, beta_param — 1)
for x in x_span
])

# Utilizing the discrete Beta PDF, we now compute a discrete Beta CDF.
# To try this, we combine the PDF. For every level x, we sum the PDF till
# that time and a number of with the width of every pattern.
freq = 1.0 / n_samples
beta_cdf = beta_pdf.cumsum() * freq

def inv(cdf, q):
“””Return inverse CDF for worth q utilizing the quantile perform”””
return x_span[np.argmin(cdf < q)]

# Lastly, we are able to now draw n_draws from the discrete inverse of CDF, aka
# generate random samples from a beta distribution
manual_beta_draw = np.array([
inv(beta_cdf, x)
for x in uniform.rvs(size=n_draws)
])

*phew* this seems to be a lot better:

An overlay of two histograms evaluating a 1,000 attracts utilizing SciPy’s beta.rvs() and a handbook draw

Now that we’ve received drawing samples from a random variable straight, it’s time to maneuver again to SQL. For the sake of simplicity, and since BigQuery doesn’t readily include an implementation of a Gamma function¹ I’m going to attract from the logistic distribution (with parameters a=0 and b=1).

 — The next 3 parameters have to be adjusted based mostly on the help of the
— PDF of the distribution you want to draw from. This values are set for a logistic
— distribution with a=0 and b=1

DECLARE pdf_min INT64 DEFAULT -10;
DECLARE pdf_max INT64 DEFAULT 10;
DECLARE n_samples INT64 DEFAULT 5000;
DECLARE sampling_step FLOAT64 DEFAULT (pdf_max — pdf_min) / n_samples;

— The variety of random attracts you want to carry out
DECLARE n_draws INT64 DEFAULT 1000;

WITH pdf AS (

— The discrete sampling of the logistic distribution PDF

SELECT
x
, exp(-x) / pow(1 + exp(-x), 2) AS y — a=0, b=1
FROM UNNEST(GENERATE_ARRAY(pdf_min, pdf_max, sampling_step)) AS x
), cdf AS (

— The discrete CDF

SELECT
x
, SUM(y)
OVER (
ORDER BY x
) * (1.0 / n_samples) AS y
FROM pdf
), random_draws AS (

— Random attracts within the vary of [0, max(cdf)]

SELECT
RAND() * (SELECT MAX(y) FROM cdf) as q
, draw_id
FROM UNNEST(GENERATE_ARRAY(1, n_draws)) AS draw_id
)

— Calculate the inverse CDF per draw utilizing the quantile perform by producing
— and array of the discrete help of the distribution and returning the worth
— of the index simply earlier than the randomly generated quantity is bigger than the CDF

SELECT
ARRAY_AGG(x ORDER BY x)[OFFSET(SUM(CAST(y < q AS INT64)))] AS x
FROM random_draws
JOIN cdf
ON TRUE
GROUP BY draw_id;

Let’s now examine the distributions of the three sampling strategies:

  1. SciPy’s logistic.rvs()
  2. Manually sampling the logistic distribution PDF in Python and drawing a random pattern as per Step 2 above
  3. Doing the identical in SQL
An overlay of three histograms evaluating 1,000 attracts utilizing SciPy’s beta.rvs(), a handbook attract Python and a handbook attract SQL

This seems to be like a hit to me! 💪

This SQL code above samples from the logistic distribution, nevertheless it ought to work on any distribution the place you’ll be able to get a discrete illustration of the PDF by sampling it at constant intervals!

[ad_2]