Home Machine Learning Why Deep Studying Fashions Run Sooner on GPUs: A Temporary Introduction to CUDA Programming | by Lucas de Lima Nogueira | Apr, 2024

Why Deep Studying Fashions Run Sooner on GPUs: A Temporary Introduction to CUDA Programming | by Lucas de Lima Nogueira | Apr, 2024

0
Why Deep Studying Fashions Run Sooner on GPUs: A Temporary Introduction to CUDA Programming | by Lucas de Lima Nogueira | Apr, 2024

[ad_1]

For many who wish to perceive what .to(“cuda”) does

Picture by the writer with the help of AI (https://copilot.microsoft.com/pictures/create)

These days, once we speak about deep studying, it is vitally widespread to affiliate its implementation with using GPUs with a view to enhance efficiency.

GPUs (Graphical Processing Items) had been initially designed to speed up rendering of pictures, 2D, and 3D graphics. Nonetheless, because of their functionality of performing many parallel operations, their utility extends past that to purposes resembling deep studying.

Using GPUs for deep studying fashions began across the mid to late 2000s and have become highly regarded round 2012 with the emergence of AlexNet. AlexNet, a convolution neural community designed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, received the ImageNet Giant Scale Visible Recognition Problem (ILSVRC) in 2012. This victory marked a milestone because it demonstrated the effectiveness of deep neural networks for picture classification and using GPUs for coaching giant fashions.

Following this breakthrough, using GPUs for deep studying fashions turned more and more standard, which contributed to the creation of frameworks like PyTorch and TensorFlow.

These days, we simply write .to("cuda") in PyTorch to ship information to GPU and count on the coaching to be accelerated. However how does deep studying algorithms reap the benefits of GPUs computation efficiency in apply? Let’s discover out!

Deep studying architectures like neural networks, CNNs, RNNs and transformers are mainly constructed utilizing mathematical operations resembling matrix addition, matrix multiplication and making use of a perform a matrix. Thus, if we discover a method to optimize these operations, we are able to enhance the efficiency of the deep studying fashions.

So, let’s begin easy. Think about you wish to add two vectors C = A + B.

A easy implementation of this in C could be:

void AddTwoVectors(flaot A[], float B[], float C[]) {
for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
}

As you’ll be able to discover, the pc should iterate over the vector, including every pair of parts on every iteration sequentially. However these operations are unbiased of one another. The addition of the ith pair of parts doesn’t depend on another pair. So, what if we may execute these operations concurrently, including all the pairs of parts in parallel?

A simple method could be utilizing CPU multithreading with a view to run all the computation in parallel. Nonetheless, in terms of deep studying fashions, we’re coping with huge vectors, with tens of millions of parts. A standard CPU can solely deal with round a dozen threads concurrently. That’s when the GPUs come into motion! Trendy GPUs can run tens of millions of threads concurrently, enhancing efficiency of those mathematical operations on huge vectors.

Though CPU computations may be quicker than GPU for a single operation, the benefit of GPUs depends on its parallelization capabilities. The explanation for that is that they’re designed with completely different objectives. Whereas CPU is designed to execute a sequence of operations (thread) as quick as doable (and might solely execute dozens of them concurrently), the GPU is designed to execute tens of millions of them in parallel (whereas sacrificing velocity of particular person threads).

See the video under:

For instance, think about {that a} CPU is sort of a Ferrari, and the GPU as a bus. In case your process is to maneuver one particular person, the Ferrari (CPU) is the higher selection. Nonetheless, in case you are transferring a number of individuals, though the Ferrari (CPU) is quicker per journey, the bus (GPU) can transport everybody in a single go, transporting all individuals without delay quicker than the Ferrari touring the route a number of occasions. So CPUs are higher designed for dealing with sequential operations and GPUs for parallel operations.

Picture by the writer with the help of AI (https://copilot.microsoft.com/pictures/create)

With the intention to present larger parallel capabilities, GPU designs allocate extra transistors for information processing than to information caching and move management, not like CPUs which allocate a good portion of transistors for that objective, with a view to optimize single-threaded efficiency and complicated instruction execution.

The determine under illustrates the distribution of chip assets for CPU vs GPU.

Picture by the writer with inspiration from CUDA C++ Programming Information

CPUs have highly effective cores and a extra complicated cache reminiscence structure (allocating a big quantity of transistors for that). This design permits quicker dealing with of sequential operations. Alternatively, GPUs prioritize having numerous cores to attain the next degree of parallelism.

Now that we understood these primary ideas, how can we reap the benefits of this parallel computation capabilities in apply?

If you find yourself operating some deep studying mannequin, in all probability your selection is to make use of some standard Python library resembling PyTorch or TensorFlow. Nonetheless, it’s well-known that the core of those libraries run C/C++ code beneath. Additionally, as we talked about earlier than, you would possibly use GPUs to hurry up processing. That’s the place CUDA is available in! CUDA stands for Compute Unified Structure and it’s a platform developed by NVIDIA for general-purpose processing on their GPUs. Thus, whereas DirectX is utilized by sport engines to deal with graphical computation, CUDA permits builders to combine NVIDIA’s GPU computational energy into their general-purpose software program purposes, extending past simply graphics rendering.

With the intention to implement that, CUDA gives a easy C/C++ primarily based interface (CUDA C/C++) that grants entry to the GPU’s digital intruction set and particular operations (resembling transferring information between CPU and GPU).

Earlier than we go additional, let’s perceive some primary CUDA Programming ideas and terminology:

  • host: refers back to the CPU and its reminiscence;
  • system: refers back to the GPU and its reminiscence;
  • kernel: refers to a perform that’s executed on the system (GPU);

So, in a primary code written utilizing CUDA, this system runs on the host (CPU), sends information to the system (GPU) and launches kernels (features) to be executed on the system (GPU). These kernels are executed by a number of threads in parallel. After the execution, the outcomes are transfered again from the system (GPU) to the host (CPU).

So let’s return to our downside of including two vectors:

#embody <stdio.h>

void AddTwoVectors(flaot A[], float B[], float C[]) {
for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
}

int most important() {
...
AddTwoVectors(A, B, C);
...
}

In CUDA C/C++, the programmers can outline C/C++ features, known as kernels, that when known as, are executed N occasions in parallel by N completely different CUDA threads.

To outline a kernel, you need to use a __global__ declaration specifier, and the variety of CUDA threads that execute this kernel may be specified utilizing <<<...>>> notation:

#embody <stdio.h>

// Kernel definition
__global__ void AddTwoVectors(float A[], float B[], float C[]) {
int i = threadIdx.x;
C[i] = A[i] + B[i];
}

int most important() {
...
// Kernel invocation with N threads
AddTwoVectors<<<1, N>>>(A, B, C);
...
}

Every thread executes the kernel and is given a novel thread ID threadIdx accessible throughout the kernel by built-in variables. The code above provides two vectors A and B, of dimension N and shops the end result into vector C. As you’ll be able to discover, as a substitute of a loop to execute every pair-wise addition sequentially, CUDA permits us to carry out all of those operations concurrently, utilizing N threads in parallel.

However earlier than we are able to run this code, we have to do one other modification. It is very important do not forget that the kernel perform runs throughout the system (GPU). So all of its information must be saved within the system reminiscence. You are able to do this by utilizing the next CUDA built-in features:

#embody <stdio.h>

// Kernel definition
__global__ void AddTwoVectors(float A[], float B[], float C[]) {
int i = threadIdx.x;
C[i] = A[i] + B[i];
}

int most important() {

int N = 1000; // Dimension of the vectors
float A[N], B[N], C[N]; // Arrays for vectors A, B, and C

...

float *d_A, *d_B, *d_C; // Gadget pointers for vectors A, B, and C

// Allocate reminiscence on the system for vectors A, B, and C
cudaMalloc((void **)&d_A, N * sizeof(float));
cudaMalloc((void **)&d_B, N * sizeof(float));
cudaMalloc((void **)&d_C, N * sizeof(float));

// Copy vectors A and B from host to system
cudaMemcpy(d_A, A, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, N * sizeof(float), cudaMemcpyHostToDevice);

// Kernel invocation with N threads
AddTwoVectors<<<1, N>>>(d_A, d_B, d_C);

// Copy vector C from system to host
cudaMemcpy(C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost);

}

As an alternative of straight passing variables A, B and C to the kernel, we have to use pointers. In CUDA programming, you’ll be able to’t straight use host arrays (like A, B, and C within the instance) inside kernel launches (<<<...>>>). CUDA kernels function on system reminiscence, so you want to move system pointers (d_A, d_B, and d_C) to the kernel for it to function on.

Past that, we have to allocate reminiscence on the system by utilizing cudaMalloc, and replica information between host and system utilizing cudaMemcpy.

Now we are able to add the initialization of vectors A and B, and refresh cuda reminiscence on the finish of the code.

#embody <stdio.h>

// Kernel definition
__global__ void AddTwoVectors(float A[], float B[], float C[]) {
int i = threadIdx.x;
C[i] = A[i] + B[i];
}

int most important() {

int N = 1000; // Dimension of the vectors
float A[N], B[N], C[N]; // Arrays for vectors A, B, and C

// Initialize vectors A and B
for (int i = 0; i < N; ++i) {
A[i] = 1;
B[i] = 3;
}

float *d_A, *d_B, *d_C; // Gadget pointers for vectors A, B, and C

// Allocate reminiscence on the system for vectors A, B, and C
cudaMalloc((void **)&d_A, N * sizeof(float));
cudaMalloc((void **)&d_B, N * sizeof(float));
cudaMalloc((void **)&d_C, N * sizeof(float));

// Copy vectors A and B from host to system
cudaMemcpy(d_A, A, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, N * sizeof(float), cudaMemcpyHostToDevice);

// Kernel invocation with N threads
AddTwoVectors<<<1, N>>>(d_A, d_B, d_C);

// Copy vector C from system to host
cudaMemcpy(C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost);

// Free system reminiscence
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}

Additionally, we have to add cudaDeviceSynchronize(); after we name the kernel. This can be a perform used to synchronize the host thread with the system. When this perform is named, the host thread will wait till all beforehand issued CUDA instructions on the system are accomplished earlier than persevering with execution.

Past that, you will need to add some CUDA error checking so we are able to establish bugs on GPU. If we don’t add this checking, the code will continues execution of the host thread (CPU) and it will likely be troublesome to establish CUDA associated errors.

The implementation of each methods under:

#embody <stdio.h>

// Kernel definition
__global__ void AddTwoVectors(float A[], float B[], float C[]) {
int i = threadIdx.x;
C[i] = A[i] + B[i];
}

int most important() {

int N = 1000; // Dimension of the vectors
float A[N], B[N], C[N]; // Arrays for vectors A, B, and C

// Initialize vectors A and B
for (int i = 0; i < N; ++i) {
A[i] = 1;
B[i] = 3;
}

float *d_A, *d_B, *d_C; // Gadget pointers for vectors A, B, and C

// Allocate reminiscence on the system for vectors A, B, and C
cudaMalloc((void **)&d_A, N * sizeof(float));
cudaMalloc((void **)&d_B, N * sizeof(float));
cudaMalloc((void **)&d_C, N * sizeof(float));

// Copy vectors A and B from host to system
cudaMemcpy(d_A, A, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, N * sizeof(float), cudaMemcpyHostToDevice);

// Kernel invocation with N threads
AddTwoVectors<<<1, N>>>(d_A, d_B, d_C);

// Examine for error
cudaError_t error = cudaGetLastError();
if(error != cudaSuccess) {
printf("CUDA error: %sn", cudaGetErrorString(error));
exit(-1);
}

// Waits untill all CUDA threads are executed
cudaDeviceSynchronize();

// Copy vector C from system to host
cudaMemcpy(C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost);

// Free system reminiscence
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}

To compile and run the CUDA code, you’ll want to make sure that the CUDA toolkit is put in in your system. Then, you’ll be able to compile the code utilizing nvcc, the NVIDIA CUDA Compiler. In the event you don’t have a GPU in your machine, you need to use Google Colab. You simply want to pick a GPU on Runtime → Pocket book settings, then save the code on a instance.cu file and run:

%%shell
nvcc instance.cu -o compiled_example # compile
./compiled_example # run

# you too can run the code with bug detection sanitizer
compute-sanitizer --tool memcheck ./compiled_example

Nonetheless, our code nonetheless shouldn’t be totally optimized. The instance above makes use of a vector of dimension N = 1000. However, this can be a small quantity that won’t totally reveal the GPU’s parallelization functionality. Additionally, when coping with deep studying downside, we regularly deal with huge vectors with tens of millions of parameters. Nonetheless, if we attempt settings, for instance, N = 500000 and run the kernel with <<<1, 500000>>> utilizing the instance above, it would throw an error. Thus, to enhance the code and carry out such operation, we first want to grasp an essential idea of CUDA programming: Thread hierarchy.

The calling of kernel features is completed utilizing the notation <<<number_of_blocks, threads_per_block>>>. So, in our instance above, we run 1 block with N CUDA threads. Nonetheless, every block has a restrict on the variety of threads it might help. This happens as a result of each thread inside a block is required to be positioned on the identical streaming multiprocessor core and should share the reminiscence assets of that core.

You will get this restrict utilizing the next snippet of code:

int system;
cudaDeviceProp props;
cudaGetDevice(&system);
cudaGetDeviceProperties(&props, system);
printf("Most threads per block: %dn", props.maxThreadsPerBlock);

On present Colab GPUs, a thread block could include as much as 1024 threads. Thus, we want extra blocks to execute rather more threads with a view to course of a large vector within the instance. Additionally, blocks are organized into grids, as illustrated under:

https://handwiki.org/wiki/index.php?curid=1157670 (CC BY-SA 3.0)

Now, the thread ID may be accessed utilizing:

int i = blockIdx.x * blockDim.x + threadIdx.x;

So, our script turns into:

#embody <stdio.h>

// Kernel definition
__global__ void AddTwoVectors(float A[], float B[], float C[], int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) // To keep away from exceeding array restrict
C[i] = A[i] + B[i];
}

int most important() {
int N = 500000; // Dimension of the vectors
int threads_per_block;
int system;
cudaDeviceProp props;
cudaGetDevice(&system);
cudaGetDeviceProperties(&props, system);
threads_per_block = props.maxThreadsPerBlock;
printf("Most threads per block: %dn", threads_per_block); // 1024

float A[N], B[N], C[N]; // Arrays for vectors A, B, and C

// Initialize vectors A and B
for (int i = 0; i < N; ++i) {
A[i] = 1;
B[i] = 3;
}

float *d_A, *d_B, *d_C; // Gadget pointers for vectors A, B, and C

// Allocate reminiscence on the system for vectors A, B, and C
cudaMalloc((void **)&d_A, N * sizeof(float));
cudaMalloc((void **)&d_B, N * sizeof(float));
cudaMalloc((void **)&d_C, N * sizeof(float));

// Copy vectors A and B from host to system
cudaMemcpy(d_A, A, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, N * sizeof(float), cudaMemcpyHostToDevice);

// Kernel invocation with a number of blocks and threads_per_block threads per block
int number_of_blocks = (N + threads_per_block - 1) / threads_per_block;
AddTwoVectors<<<number_of_blocks, threads_per_block>>>(d_A, d_B, d_C, N);

// Examine for error
cudaError_t error = cudaGetLastError();
if (error != cudaSuccess) {
printf("CUDA error: %sn", cudaGetErrorString(error));
exit(-1);
}

// Wait till all CUDA threads are executed
cudaDeviceSynchronize();

// Copy vector C from system to host
cudaMemcpy(C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost);

// Free system reminiscence
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);

}

Beneath a comparability of CPU and GPU computation of this including two vector operation for various vector sizes.

Picture by the writer

As one can see, the benefit of GPU processing solely turns into obvious with a big vector dimension N. Additionally, do not forget that this time comparability is barely contemplating the execution of the kernel/perform. It’s not taking into consideration the time to repeat information between host and system, which though is probably not important on most instances, it’s comparatively appreciable in our case as we’re performing solely a easy addition operation. Due to this fact, you will need to do not forget that GPU computation solely demonstrates its benefit when coping with extremely compute-intensive computations which might be additionally extremely parallelized.

Okay, now we all know enhance efficiency of a easy array operation. However when coping with deep studying fashions, we have to deal with matrix and tensor operations. In our earlier instance, we solely used one-dimensional blocks with N threads. Nonetheless, it’s also doable to execute multidimensional thread blocks (as much as 3 dimensions). So, for comfort you’ll be able to run a thread block of NxM threads if you want to run matrix operations. On this case, you would get hold of the matrix rows columns indices as row = threadIdx.x, col = threadIdx.y. Additionally, for comfort, you need to use dim3 variable kind to outline the number_of_blocks and threads_per_block.

The instance under illustrates add two matrices.

#embody <stdio.h>

// Kernel definition
__global__ void AddTwoMatrices(float A[N][N], float B[N][N], float C[N][N]) {
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}

int most important() {
...
// Kernel invocation with 1 block of NxN threads
dim3 threads_per_block(N, N);
AddTwoMatrices<<<1, threads_per_block>>>(A, B, C);
...
}

You can even lengthen this instance to deal with a number of blocks:

#embody <stdio.h>

// Kernel definition
__global__ void AddTwoMatrices(float A[N][N], float B[N][N], float C[N][N]) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N) {
C[i][j] = A[i][j] + B[i][j];
}
}

int most important() {
...
// Kernel invocation with 1 block of NxN threads
dim3 threads_per_block(32, 32);
dim3 number_of_blocks((N + threads_per_block.x - 1) ∕ threads_per_block.x, (N + threads_per_block.y - 1) ∕ threads_per_block.y);
AddTwoMatrices<<<number_of_blocks, threads_per_block>>>(A, B, C);
...
}

You can even lengthen this instance to course of third-dimensional operations utilizing the identical thought.

Now that you know the way to function multidimensional information, there may be one other essential and easy idea to study: name features inside a kernel. Principally that is merely accomplished by utilizing a __device__ declaration specifier. This defines features that may be known as by the system (GPU) straight. Thus, they will solely be known as from __global__ or one other __device__ perform. The instance under apply a sigmoid operation to a vector (quite common operation on deep studying fashions).

#embody <math.h>

// Sigmoid perform
__device__ float sigmoid(float x) {
return 1 / (1 + expf(-x));
}

// Kernel definition for making use of sigmoid perform to a vector
__global__ void sigmoidActivation(float enter[], float output[]) {
int i = threadIdx.x;
output[i] = sigmoid(enter[i]);

}

So, now that you recognize the essential essential ideas of CUDA programming, you can begin creating CUDA kernels. Within the case of deep studying fashions, they’re mainly a bunch of matrix and tensor operations resembling sum, multiplication, convolution, normalization and others. For example, a naive matrix multiplication algorithm may be parallelized as follows:

// GPU model

__global__ void matMul(float A[M][N], float B[N][P], float C[M][P]) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;

if (row < M && col < P) {
float C_value = 0;
for (int i = 0; i < N; i++) {
C_value += A[row][i] * B[i][col];
}
C[row][col] = C_value;
}
}

Now evaluate this with a traditional CPU implementation of two matrices multiplication under:

// CPU model

void matMul(float A[M][N], float B[N][P], float C[M][P]) {
for (int row = 0; row < M; row++) {
for (int col = 0; col < P; col++) {
float C_value = 0;
for (int i = 0; i < N; i++) {
C_value += A[row][i] * B[i][col];
}
C[row][col] = C_value;
}
}
}

You may discover that on the GPU model we’ve got much less loops, leading to a quicker processing of the operation. Beneath is a comparability of efficiency between CPU and GPU of NxN matrix multiplications:

Picture by the writer

As it’s possible you’ll observe, the efficiency enchancment of GPU processing is even larger for matrix multiplication operations because the matrix dimension will increase.

Now, think about a primary neural community, which principally entails y = σ(Wx + b) operations, as proven under:

Picture by the writer

These operations primarily comprise matrix multiplication, matrix addition, and making use of a perform to an array, all of which you’re already acquainted with the parallelization methods. Thus, you at the moment are able to implementing your personal neural community that runs on GPUs from scratch!

On this submit we coated introductory ideas relating to GPU processing to reinforce deep studying fashions efficiency. Nonetheless, it’s also essential to say that the ideas you’ve gotten seen are solely the fundamentals and there’s a lot extra to be discovered. Libraries like PyTorch and Tensorflow implement optimization methods that entails different extra complicated ideas resembling optimized reminiscence entry, batched operations and others (they harness libraries constructed on high of CUDA, resembling cuBLAS and cuDNN). Nonetheless, I hope this submit helps clear up what goes on behind the scenes whenever you write .to("cuda") and execute deep studying fashions on GPUs.

In future posts, I’ll attempt to carry extra complicated ideas relating to CUDA Programming. Please let me know what you assume or what you want to me to put in writing about subsequent within the feedback! Thanks a lot for studying! 😊

CUDA Programming Information — NVIDIA CUDA Programming documentation.

CUDA Documentation — NVIDIA full CUDA documentation.

CUDA Neural Community coaching implementation — Pure CUDA C++ implementation of a neural community coaching.

CUDA LLM coaching implementation — Coaching implementation of LLM with pure CUDA C.

[ad_2]