Home Machine Learning 4 Methods to Quantify Fats Tails with Python | by Shawhin Talebi | Dec, 2023

4 Methods to Quantify Fats Tails with Python | by Shawhin Talebi | Dec, 2023

0
4 Methods to Quantify Fats Tails with Python | by Shawhin Talebi | Dec, 2023

[ad_1]

We begin by importing some useful libraries.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import powerlaw
from scipy.stats import kurtosis

Subsequent, we are going to load every dataset and retailer them in a dictionary.

filename_list = ['medium-followers', 'YT-earnings', 'LI-impressions']

df_dict = {}

for filename in filename_list:
df = pd.read_csv('knowledge/'+filename+'.csv')
df = df.set_index(df.columns[0]) # set index
df_dict[filename] = df

At this level, wanting on the knowledge is all the time a good suggestion. We will do this by plotting histograms and printing the highest 5 information for every dataset.

for filename in filename_list:
df = df_dict[filename]

# plot histograms (operate bleow is outlined in pocket book on GitHub)
plot_histograms(df.iloc[:,0][df.iloc[:,0]>0], filename, filename.cut up('-')[1])
plt.savefig("pictures/"+filename+"_histograms.png")

# print prime 5 information
print("High 5 Data by Share")
print((df.iloc[:,0]/df.iloc[:,0].sum()).sort_values(ascending=False)[:5])
print("")

Histograms for month-to-month Medium followers. Picture by creator.
Histograms for YouTube video earnings. Notice: in case you discover a distinction from the earlier article, it’s as a result of I discovered a rogue file within the knowledge (that’s why wanting is a good suggestion 😅). Picture by creator.
Histograms for every day LinkedIn impressions. Picture by creator.

Primarily based on the histograms above, every dataset seems fat-tailed to some extent. Let’s see the highest 5 information by share to get one other have a look at this.

High 5 information by share for every dataset. Picture by creator.

From this view, Medium followers seem probably the most fat-tailed, with 60% of followers coming from simply 2 months. YouTube earnings are additionally strongly fat-tailed, the place about 60% of income comes from simply 4 movies. LinkedIn impressions appear the least fat-tailed.

Whereas we might get a qualitative sense of the fat-tailedness simply by wanting on the knowledge, let’s make this extra quantitative by way of our 4 heuristics.

Heuristic 1: Energy Legislation Tail Index

To acquire an α for every dataset, we will use the powerlaw library as we did within the earlier article. That is completed within the code block beneath, the place we carry out the match and print the parameter estimates for every dataset in a for loop.

for filename in filename_list:
df = df_dict[filename]

# carry out Energy Legislation match
outcomes = powerlaw.Match(df.iloc[:,0])

# print outcomes
print("")
print(filename)
print("-"*len(filename))
print("Energy Legislation Match")
print("a = " + str(outcomes.power_law.alpha-1))
print("xmin = " + str(outcomes.power_law.xmin))
print("")

Energy Legislation match outcomes. Picture by creator.

The outcomes above match our qualitative evaluation that Medium followers are probably the most fat-tailed, adopted by YouTube earnings and LinkedIn impressions (bear in mind, a smaller α means a fatter tail).

Heuristic 2: Kurtosis

A straightforward solution to compute Kurtosis is utilizing an off-the-shelf implementation. Right here, I exploit Scipy and print the ends in the same means as earlier than.

for filename in filename_list:
df = df_dict[filename]

# print outcomes
print(filename)
print("-"*len(filename))
print("kurtosis = " + str(kurtosis(df.iloc[:,0], fisher=True)))
print("")

Kurtosis values for every dataset. Picture by creator.

Kurtosis tells us a unique story than Heuristic 1. The rating of fat-tailedness in line with this measure is as follows: LinkedIn > Medium > YouTube.

Nevertheless, these outcomes must be taken with a grain of salt. As we noticed with the facility legislation matches above, all 3 datasets match an influence legislation with α < 4, which means the Kurtosis is infinite. So, whereas the computation returns a price, it’s most likely sensible to be suspicious of those numbers.

Heuristic 3: Log-normal’s σ

We will once more use the powerlaw library to acquire σ estimates just like what we did for Heuristic 1. Right here’s what that appears like.

for filename in filename_list:
df = df_dict[filename]

# carry out Energy Legislation match
outcomes = powerlaw.Match(df.iloc[:,0])

# print outcomes
print("")
print(filename)
print("-"*len(filename))
print("Log Regular Match")
print("mu = " + str(outcomes.lognormal.mu))
print("sigma = " + str(outcomes.lognormal.sigma))
print("")

Log-normal match outcomes. Picture by creator.

Wanting on the σ values above, we see all matches suggest the information are fat-tailed, the place Medium followers and LinkedIn impressions have comparable σ estimates. YouTube earnings, however, have a considerably bigger σ worth, implying a (a lot) fatter tail.

One trigger for suspicion, nevertheless, is that the match estimates a adverse μ, which can recommend a Log-normal match might not clarify the information effectively.

Heuristic 4: Taleb’s κ

Since I couldn’t discover an off-the-shelf Python implementation for computing κ (I didn’t look very onerous), this computation requires a couple of additional steps. Particularly, we have to outline 3 helper features, as proven beneath.

def mean_abs_deviation(S):
"""
Computation of imply absolute deviation of an enter pattern S
"""
M = np.imply(np.abs(S - np.imply(S)))

return M

def generate_n_sample(X,n):
"""
Perform to generate n random samples of dimension len(X) from an array X
"""
# initialize pattern
S_n=0

for i in vary(n):
# ramdomly pattern len(X) observations from X and add it to the pattern
S_n = S_n + X[np.random.randint(len(X), size=int(np.round(len(X))))]

return S_n

def kappa(X,n):
"""
Taleb's kappa metric from n0=1 as described right here: https://arxiv.org/abs/1802.05495

Notice: K_1n = kappa(1,n) = 2 - ((log(n)-log(1))/log(M_n/M_1)), the place M_n denotes the imply absolute deviation of the sum of n random samples
"""
S_1 = X
S_n = generate_n_sample(X,n)

M_1 = mean_abs_deviation(S_1)
M_n = mean_abs_deviation(S_n)

K_1n = 2 - (np.log(n)/np.log(M_n/M_1))

return K_1n

The primary operate, mean_abs_deviation(), computes the imply absolute deviation as outlined earlier.

Subsequent, we want a solution to generate and sum n samples from our empirical knowledge. Right here, I take a naive strategy and randomly pattern an enter array (X) n occasions and sum the samples collectively.

Lastly, I deliver collectively mean_abs_deviation(S) and generate_n_sample(X,n) to implement the κ calculation outlined earlier than and compute it for every dataset.

n = 100 # variety of samples to incorporate in kappa calculation

for filename in filename_list:
df = df_dict[filename]

# print outcomes
print(filename)
print("-"*len(filename))
print("kappa_1n = " + str(kappa(df.iloc[:,0].to_numpy(), n)))
print("")

κ(1,100) values for every dataset. Picture by creator.

The outcomes above give us yet one more story. Nevertheless, given the implicit randomness of this calculation (recall generate_n_sample() definition) and the very fact we’re coping with fats tails, level estimates (i.e. simply working the computation as soon as) can’t be trusted.

Accordingly, I run the identical calculation 1000x and print the imply κ(1,100) for every dataset.

num_runs = 1_000
kappa_dict = {}

for filename in filename_list:
df = df_dict[filename]

kappa_list = []
for i in vary(num_runs):
kappa_list.append(kappa(df.iloc[:,0].to_numpy(), n))

kappa_dict[filename] = np.array(kappa_list)

print(filename)
print("-"*len(filename))
print("imply kappa_1n = " + str(np.imply(kappa_dict[filename])))
print("")

Imply κ(1,100) values from 1000 runs for every dataset. Picture by creator.

These extra steady outcomes point out Medium followers are probably the most fat-tailed, adopted by LinkedIn Impressions and YouTube earnings.

Notice: One can examine these values to Desk III in ref [3] to raised perceive every κ worth. Particularly, these values are akin to a Pareto distribution with α between 2 and three.

Though every heuristic advised a barely totally different story, all indicators level towards Medium followers gained being probably the most fat-tailed of the three datasets.

Whereas binary labeling knowledge as fat-tailed (or not) could also be tempting, fat-tailedness lives on a spectrum. Right here, we broke down 4 heuristics for quantifying how fat-tailed knowledge are.

Though every strategy has its limitations, they supply practitioners with quantitative methods of evaluating the fat-tailedness of empirical knowledge.

👉 Extra on Energy Legal guidelines & Fats Tails: Introduction | Energy Legislation Suits

[ad_2]