The Math Behind “The Curse of Dimensionality” | by Maxime Wolf

Machine Learning

The Math Behind “The Curse of Dimensionality” | by Maxime Wolf | Apr, 2024

hhhhm

2024年4月21日

The Math Behind “The Curse of Dimensionality” | by Maxime Wolf | Apr, 2024

[ad_1]

Dive into the “Curse of Dimensionality” idea and perceive the mathematics behind all of the stunning phenomena that come up in excessive dimensions.

Within the realm of machine studying, dealing with high-dimensional vectors isn’t just frequent; it’s important. That is illustrated by the structure of standard fashions like Transformers. For example, BERT makes use of 768-dimensional vectors to encode the tokens of the enter sequences it processes and to raised seize advanced patterns within the information. On condition that our mind struggles to visualise something past 3 dimensions, the usage of 768-dimensional vectors is kind of mind-blowing!

Whereas some Machine and Deep Studying fashions excel in these high-dimensional eventualities, additionally they current many challenges. On this article, we’ll discover the idea of the “curse of dimensionality”, clarify some fascinating phenomena related to it, delve into the arithmetic behind these phenomena, and talk about their basic implications to your Machine Studying fashions.

Notice that detailed mathematical proofs associated to this text are accessible on my web site as a supplementary extension to this text.

Individuals typically assume that geometric ideas acquainted in three dimensions behave equally in higher-dimensional areas. This isn’t the case. As dimension will increase, many fascinating and counterintuitive phenomena come up. The “Curse of Dimensionality” is a time period invented by Richard Bellman (well-known mathematician) that refers to all these stunning results.

What’s so particular about high-dimension is how the “quantity” of the area (we’ll discover that in additional element quickly) is rising exponentially. Take a graduated line (in a single dimension) from 1 to 10. There are 10 integers on this line. Prolong that in 2 dimensions: it’s now a sq. with 10 × 10 = 100 factors with integer coordinates. Now think about “solely” 80 dimensions: you’d have already got 10⁸⁰ factors which is the variety of atoms within the universe.

In different phrases, because the dimension will increase, the quantity of the area grows exponentially, leading to information turning into more and more sparse.

Excessive-dimensional areas are “empty”

Think about this different instance. We need to calculate the farthest distance between two factors in a unit hypercube (the place either side has a size of 1):

In 1 dimension (the hypercube is a line phase from 0 to 1), the utmost distance is solely 1.
In 2 dimensions (the hypercube types a sq.), the utmost distance is the gap between the alternative corners [0,0] and [1,1], which is √2, calculated utilizing the Pythagorean theorem.
Extending this idea to n dimensions, the gap between the factors at [0,0,…,0] and [1,1,…,1] is √n. This components arises as a result of every extra dimension provides a sq. of 1 to the sum beneath the sq. root (once more by the Pythagorean theorem).

Apparently, because the variety of dimensions n will increase, the biggest distance throughout the hypercube grows at an O(√n) price. This phenomenon illustrates a diminishing returns impact, the place will increase in dimensional area result in proportionally smaller good points in spatial distance. Extra particulars on this impact and its implications will likely be explored within the following sections of this text.

Let’s dive deeper into the notion of distances we began exploring within the earlier part.

We had our first glimpse of how high-dimensional areas render the notion of distance nearly meaningless. However what does this actually imply, and might we mathematically visualize this phenomenon?

Let’s think about an experiment, utilizing the identical n-dimensional unit hypercube we outlined earlier than. First, we generate a dataset by randomly sampling many factors on this dice: we successfully simulate a multivariate uniform distribution. Then, we pattern one other level (a “question” level) from that distribution and observe the distance from its nearest and farthest neighbor in our dataset.

Right here is the corresponding Python code.

def generate_data(dimension, num_points):
''' Generate random information factors inside [0, 1] for every coordinate within the given dimension '''
information = np.random.rand(num_points, dimension)
return informationdef neighbors(information, query_point):
''' Returns the closest and farthest level in information from query_point '''
nearest_distance = float('inf')
farthest_distance = 0
for level in information:
distance = np.linalg.norm(level - query_point)
if distance < nearest_distance:
nearest_distance = distance
if distance > farthest_distance:
farthest_distance = distance
return nearest_distance, farthest_distance

We are able to additionally plot these distances:

Distances to nearest and farthest factors as n will increase (Picture by the creator)

Utilizing a log scale, we observe that the relative distinction between the gap to the closest and farthest neighbor tends to lower because the dimension will increase.

This can be a very unintuitive conduct: as defined within the earlier part, factors are very sparse from one another due to the exponentially growing quantity of the area, however on the similar time, the relative distances between factors develop into smaller.

The notion of nearest neighbors vanishes

Which means the very idea of distance turns into much less related and fewer discriminative because the dimension of the area will increase. As you possibly can think about, it poses issues for Machine Studying algorithms that solely depend on distances corresponding to kNN.

We are going to now discuss another fascinating phenomena. For this, we’ll want the n-ball. A n-ball is the generalization of a ball in n dimensions. The n-ball of radius R is the gathering of factors at a distance at most R from the middle of the area 0.

Let’s think about a radius of 1. The 1-ball is the phase [-1, 1]. The two-ball is the disk delimited by the unit circle, whose equation is x² + y² ≤ 1. The three-ball (what we usually name a “ball”) has the equation x² + y² + z² ≤ 1. As you perceive, we will lengthen that definition to any dimension:

The query now could be: what’s the quantity of this ball? This isn’t a simple query and requires various maths, which I received’t element right here. Nevertheless, you could find all the small print on my web site, in my put up in regards to the quantity of the n-ball.

After a number of enjoyable (integral calculus), you possibly can show that the quantity of the n-ball might be expressed as follows, the place Γ denotes the Gamma operate.

For instance, with R = 1 and n = 2, the quantity is πR², as a result of Γ(2) = 1. That is certainly the “quantity” of the 2-ball (additionally referred to as the “space” of a circle on this case).

Nevertheless, past being an fascinating mathematical problem, the quantity of the n-ball additionally has some very stunning properties.

Because the dimension n will increase, the quantity of the n-ball converges to 0.

That is true for each radius, however let’s visualize this phenomenon with a couple of values of R.

Quantity of the n-ball for various radii because the dimension will increase (Picture by the creator)

As you possibly can see, it solely converges to 0, nevertheless it begins by growing after which decreases to 0. For R = 1, the ball with the biggest quantity is the 5-ball, and the worth of n that reaches the utmost shifts to the appropriate as R will increase.

Listed here are the primary values of the quantity of the unit n-ball, as much as n = 10.

Quantity of the unit n-ball for various values of n (Picture by the creator)

The amount of a high-dimensional unit ball is concentrated close to its floor.

For small dimensions, the quantity of a ball seems to be fairly “homogeneous”: this isn’t the case in excessive dimensions.

Let’s think about an n-ball with radius R and one other with radius R-dR the place dR may be very small. The portion of the n-ball between these 2 balls is known as a “shell” and corresponds to the portion of the ball close to its floor(see the visualization above in 3D). We are able to compute the ratio of the quantity of your entire ball and the quantity of this skinny shell solely.

Ratio (complete quantity / skinny shell quantity) as n will increase (Picture by the creator)

As we will see, it converges in a short time to 0: nearly all the quantity is close to the floor in excessive dimensional areas. For example, for R = 1, dR = 0.05, and n = 50, about 92.3% of the quantity is concentrated within the skinny shell. This exhibits that in increased dimensions, the quantity is in “corners”. That is once more associated to the distortion of the idea of distance now we have seen earlier.

Notice that the quantity of the unit hypercube is 2ⁿ. The unit sphere is principally “empty” in very excessive dimensions, whereas the unit hypercube, in distinction, will get exponentially extra factors. Once more, this exhibits how the thought of a “nearest neighbor” of a degree loses its effectiveness as a result of there may be nearly no level inside a distance R of a question level q when n is giant.

The curse of dimensionality is intently associated to the overfitting precept. Due to the exponential development of the quantity of the area with the dimension, we’d like very giant datasets to adequately seize and mannequin high-dimensional patterns. Even worse: we’d like a lot of samples that grows exponentially with the dimension to beat this limitation. This situation, characterised by many options but comparatively few information factors, is especially liable to overfitting.

Occam’s Razor means that less complicated fashions are typically higher than advanced ones as a result of they’re much less prone to overfit. This precept is especially related in high-dimensional contexts (the place the curse of dimensionality performs a job) as a result of it encourages the discount of mannequin complexity.

Making use of Occam’s Razor precept in high-dimensional eventualities can imply decreasing the dimensionality of the issue itself (through strategies like PCA, characteristic choice, and so forth.), thereby mitigating some results of the curse of dimensionality. Simplifying the mannequin’s construction or the characteristic area helps in managing the sparse information distribution and in making distance metrics extra significant once more. For example, dimensionality discount is a quite common preliminary step earlier than making use of the kNN algorithm. More moderen strategies, corresponding to ANNs (Approximate Nearest Neighbors) additionally emerge as a solution to cope with high-dimensional eventualities.

Whereas we’ve outlined the challenges of high-dimensional settings in machine studying, there are additionally some benefits!

Excessive dimensions can improve linear separability, making methods like kernel strategies simpler.
Moreover, deep studying architectures are significantly adept at navigating and extracting advanced patterns from high-dimensional areas.

As at all times with Machine Studying, it is a trade-off: leveraging these benefits includes balancing the elevated computational calls for with potential good points in mannequin efficiency.

Hopefully, this provides you an thought of how “bizarre” geometry might be in high-dimension and the numerous challenges it poses for machine studying mannequin improvement. We noticed how, in high-dimensional areas, information may be very sparse but in addition tends to be concentrated within the corners, and distances lose their usefulness. For a deeper dive into the n-ball and mathematical proofs, I encourage you to go to the prolonged of this text on my web site.

Whereas the “curse of dimensionality” outlines vital limitations in high-dimensional areas, it’s thrilling to see how trendy deep studying fashions are more and more adept at navigating these complexities. Think about the embedding fashions or the newest LLMs, for instance, which make the most of very high-dimensional vectors to extra successfully discern and mannequin textual patterns.

[ad_2]