Why Deep Studying Fashions Run Sooner on GPUs: A Temporary Introduction to CUDA Programming | by Lucas de Lima Nogueira

Machine Learning

Why Deep Studying Fashions Run Sooner on GPUs: A Temporary Introduction to CUDA Programming | by Lucas de Lima Nogueira | Apr, 2024

hhhhm

2024年4月25日

Why Deep Studying Fashions Run Sooner on GPUs: A Temporary Introduction to CUDA Programming | by Lucas de Lima Nogueira | Apr, 2024

[ad_1]

For many who wish to perceive what .to(“cuda”) does

15 min learn

Apr 17, 2024

Picture by the writer with the help of AI (https://copilot.microsoft.com/pictures/create)

These days, once we speak about deep studying, it is vitally widespread to affiliate its implementation with using GPUs with a view to enhance efficiency.

GPUs (Graphical Processing Items) had been initially designed to speed up rendering of pictures, 2D, and 3D graphics. Nonetheless, because of their functionality of performing many parallel operations, their utility extends past that to purposes resembling deep studying.

Using GPUs for deep studying fashions began across the mid to late 2000s and have become highly regarded round 2012 with the emergence of AlexNet. AlexNet, a convolution neural community designed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, received the ImageNet Giant Scale Visible Recognition Problem (ILSVRC) in 2012. This victory marked a milestone because it demonstrated the effectiveness of deep neural networks for picture classification and using GPUs for coaching giant fashions.

Following this breakthrough, using GPUs for deep studying fashions turned more and more standard, which contributed to the creation of frameworks like PyTorch and TensorFlow.

These days, we simply write .to("cuda") in PyTorch to ship information to GPU and count on the coaching to be accelerated. However how does deep studying algorithms reap the benefits of GPUs computation efficiency in apply? Let’s discover out!

Deep studying architectures like neural networks, CNNs, RNNs and transformers are mainly constructed utilizing mathematical operations resembling matrix addition, matrix multiplication and making use of a perform a matrix. Thus, if we discover a method to optimize these operations, we are able to enhance the efficiency of the deep studying fashions.

So, let’s begin easy. Think about you wish to add two vectors C = A + B.

A easy implementation of this in C could be:

void AddTwoVectors(flaot A[], float B[], float C[]) {
for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
}

As you’ll be able to discover, the pc should iterate over the vector, including every pair of parts on every iteration sequentially. However these operations are unbiased of one another. The addition of the ith pair of parts doesn’t depend on another pair. So, what if we may execute these operations concurrently, including all the pairs of parts in parallel?

A simple method could be utilizing CPU multithreading with a view to run all the computation in parallel. Nonetheless, in terms of deep studying fashions, we’re coping with huge vectors, with tens of millions of parts. A standard CPU can solely deal with round a dozen threads concurrently. That’s when the GPUs come into motion! Trendy GPUs can run tens of millions of threads concurrently, enhancing efficiency of those mathematical operations on huge vectors.

Though CPU computations may be quicker than GPU for a single operation, the benefit of GPUs depends on its parallelization capabilities. The explanation for that is that they’re designed with completely different objectives. Whereas CPU is designed to execute a sequence of operations (thread) as quick as doable (and might solely execute dozens of them concurrently), the GPU is designed to execute tens of millions of them in parallel (whereas sacrificing velocity of particular person threads).

See the video under:

[ad_2]