Recreating PyTorch from Scratch (with GPU Help and Computerized Differentiation) | by Lucas de Lima Nogueira

Machine Learning

Recreating PyTorch from Scratch (with GPU Help and Computerized Differentiation) | by Lucas de Lima Nogueira | Might, 2024

hhhhm

2024年5月15日

Recreating PyTorch from Scratch (with GPU Help and Computerized Differentiation) | by Lucas de Lima Nogueira | Might, 2024

[ad_1]

Construct your personal deep studying framework based mostly on C/C++, CUDA, and Python, with GPU assist and computerized differentiation

Picture by Writer with the help of AI (https://copilot.microsoft.com/pictures/create)

For a few years I’ve been utilizing PyTorch to assemble and prepare deep studying fashions. Although I’ve discovered its syntax and guidelines, one thing has at all times aroused my curiosity: what is occurring internally throughout these operations? How does all of this work?

When you’ve got gotten right here, you most likely have the identical questions. If I ask you the way to create and prepare a mannequin in PyTorch, you’ll most likely provide you with one thing much like the code under:

import torch
import torch.nn as nn
import torch.optim as optimclass MyModel(nn.Module):
def __init__(self):
tremendous(MyModel, self).__init__()
self.fc1 = nn.Linear(1, 10)
self.sigmoid = nn.Sigmoid()
self.fc2 = nn.Linear(10, 1)
def ahead(self, x):
out = self.fc1(x)
out = self.sigmoid(out)
out = self.fc2(out)
return out
...
mannequin = MyModel().to(gadget)
criterion = nn.MSELoss()
optimizer = optim.SGD(mannequin.parameters(), lr=0.001)
for epoch in vary(epochs):
for x, y in ...
x = x.to(gadget)
y = y.to(gadget)
outputs = mannequin(x)
loss = criterion(outputs, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()

However what if I ask you ways does this backward step works? Or, as an example, what occurs once you reshape a tensor? Is the info rearranged internally? How does that occurs? Why is PyTorch so quick? How does PyTorch deal with GPU operations? These are the kinds of questions which have at all times intrigued me, and I think about additionally they intrigue you. Thus, so as to higher perceive these ideas, what is healthier than constructing your personal tensor library from scratch? And that’s what you’ll be taught on this article!

So as to assemble a tensor library, the primary idea you could be taught clearly is: what’s a tensor?

You will have an intuitive thought {that a} tensor is a mathematical idea of a n-dimensional information construction that incorporates some numbers. However right here we have to perceive the way to mannequin this information construction from a computational perspective. We are able to consider a tensor as consisting of the info itself and likewise some metadata describing features of the tensor similar to its form or the gadget it lives in (i.e. CPU reminiscence, GPU reminiscence…).

There’s additionally a much less standard metadata that you’ll have by no means heard of, known as stride. This idea is essential to know the internals of tensor information rearrangement, so we have to talk about it slightly extra.

Think about a 2-D tensor with form [4, 8], illustrated under.

The information (i.e. float numbers) of a tensor is definitely saved as a 1-dimensional array on reminiscence:

1-D dimensional information array of the tensor (Picture by Writer)

So, so as to symbolize this 1-dimensional array as a N-dimensional tensor, we use strides. Principally the thought is the next:

We now have a matrix with 4 rows and eight columns. Contemplating that each one of its parts are organized by rows on the 1-dimensional array, if we wish to entry the worth at place [2, 3], we have to traverse 2 rows (of 8 parts every) plus 3 positions. In mathematical phrases, we have to traverse 3 + 2 * 8 parts on the 1-dimensional array:

So this ‘8’ is the stride of the second dimension. On this case, it’s the data of what number of parts I have to traverse on the array to “soar” to different positions on the second dimension.

Thus, for accessing the ingredient [i, j] of a 2-dimensional tensor with form [shape_0, shape_1], we principally have to entry the ingredient at place j + i * shape_1

Now, allow us to think about a third-dimensional tensor:

You possibly can consider this third-dimensional tensor as a sequence of matrices. For instance, you possibly can consider this [5, 4, 8] tensor as 5 matrices of form [4, 8].

Now, so as to entry the ingredient at place [1, 3, 7], you could traverse 1 full matrix of form [4,8], 2 rows of form [8] and seven columns of form [1]. So, you could traverse (1 * 4 * 8) + (2 * 8) + (7 * 1) positions on the 1-dimensional array.

Thus, to entry the ingredient [i][j][k] of a 3-D tensor with [shape_0, shape_1, shape_2] on the 1-D information array, you do:

This shape_1 * shape_2 is the stride of the primary dimension, the shape_2 is the stride of the second dimension and 1 is the stride of the third dimension.

Then, so as to generalize:

The place the strides of every dimension may be calculated utilizing the product of the subsequent dimension tensor shapes:

Then we set stride[n-1] = 1.

On our tensor instance of form [5, 4, 8] we’d have strides = [4*8, 8, 1] = [32, 8, 1]

You possibly can take a look at by yourself:

import torchtorch.rand([5, 4, 8]).stride()
#(32, 8, 1)

Okay, however why do we’d like shapes and strides? Past accessing parts of N-dimensional tensors saved as 1-dimensional arrays, this idea can be utilized to control tensor preparations very simply.

For instance, to reshape a tensor, you solely have to set the brand new form and calculate the brand new strides based mostly on it! (because the new form ensures the identical variety of parts)

import torcht = torch.rand([5, 4, 8])
print(t.form)
# [5, 4, 8]
print(t.stride())
# [32, 8, 1]
new_t = t.reshape([4, 5, 2, 2, 2])
print(new_t.form)
# [4, 5, 2, 2, 2]
print(new_t.stride())
# [40, 8, 4, 2, 1]

Internally, the tensor continues to be saved as the identical 1-dimensional array. The reshape methodology didn’t change the order of the weather throughout the array! That’s superb, isn’t? 😁

You possibly can confirm by yourself utilizing the next operate that accesses the inner 1-dimensional array on PyTorch:

import ctypesdef print_internal(t: torch.Tensor):
print(
torch.frombuffer(
ctypes.string_at(t.data_ptr(), t.storage().nbytes()), dtype=t.dtype
)
)
print_internal(t)
# [0.0752, 0.5898, 0.3930, 0.9577, 0.2276, 0.9786, 0.1009, 0.138, ...
print_internal(new_t)
# [0.0752, 0.5898, 0.3930, 0.9577, 0.2276, 0.9786, 0.1009, 0.138, ...

Or for instance, you want to transpose two axes. Internally, you just need to swap the respective strides!

t = torch.arange(0, 24).reshape(2, 3, 4)
print(t)
# [[[ 0,  1,  2,  3],
#   [ 4,  5,  6,  7],
#   [ 8,  9, 10, 11]],#  [[12, 13, 14, 15],
#   [16, 17, 18, 19],
#   [20, 21, 22, 23]]]
print(t.form)
# [2, 3, 4]
print(t.stride())
# [12, 4, 1]
new_t = t.transpose(0, 1)
print(new_t)
# [[[ 0,  1,  2,  3],
#   [12, 13, 14, 15]],
#  [[ 4,  5,  6,  7],
#   [16, 17, 18, 19]],
#  [[ 8,  9, 10, 11],
#   [20, 21, 22, 23]]]
print(new_t.form)
# [3, 2, 4]
print(new_t.stride())
# [4, 12, 1]

For those who print the inner array, each have the identical values:

print_internal(t)
# [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]print_internal(new_t)
# [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

Nevertheless, the stride of new_t now doesn’t match with the equation I confirmed above. This is because of the truth that the tensor is no longer contiguous. That implies that though the inner array stays the identical, the order of its values in reminiscence doesn’t match with the precise order of the tensor.

t.is_contiguous()
# Truenew_t.is_contiguous()
# False

This implies the accessing the non-contiguous parts in sequence is much less environment friendly (as the actual tensor parts isn’t ordered in sequence on reminiscence). So as to repair that, we will do:

new_t_contiguous = new_t.contiguous()print(new_t_contiguous.is_contiguous())
# True

If we analyze the inner array, its order now matches with the precise tensor order now, which may present higher reminiscence entry effectivity:

print(new_t)
# [[[ 0,  1,  2,  3],
#   [12, 13, 14, 15]],#  [[ 4,  5,  6,  7],
#   [16, 17, 18, 19]],
#  [[ 8,  9, 10, 11],
#   [20, 21, 22, 23]]]
print_internal(new_t)
# [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
print_internal(new_t_contiguous)
# [ 0,  1,  2,  3, 12, 13, 14, 15,  4,  5,  6,  7, 16, 17, 18, 19,  8,  9, 10, 11, 20, 21, 22, 23]

Now that we comprehend how a tensor is modeled, allow us to begin creating our library!

I’ll name it Norch, which stands for NOT PyTorch, and likewise makes an allusion to my final title, Nogueira 😁

The very first thing to know is that though PyTorch is used by Python, internally it runs C/C++. So we are going to first create our internals C/C++ features.

We are able to first outline a tensor as a struct to retailer its information and metadata, and create a operate to instantiate it:

//norch/csrc/tensor.cpp#embrace <stdio.h>
#embrace <stdlib.h>
#embrace <string.h>
#embrace <math.h>
typedef struct {
float* information;
int* strides;
int* form;
int ndim;
int dimension;
char* gadget;
} Tensor;
Tensor* create_tensor(float* information, int* form, int ndim) {
Tensor* tensor = (Tensor*)malloc(sizeof(Tensor));
if (tensor == NULL) {
fprintf(stderr, "Reminiscence allocation failedn");
exit(1);
}
tensor->information = information;
tensor->form = form;
tensor->ndim = ndim;
tensor->dimension = 1;
for (int i = 0; i < ndim; i++) {
tensor->dimension *= form[i];
}
tensor->strides = (int*)malloc(ndim * sizeof(int));
if (tensor->strides == NULL) {
fprintf(stderr, "Reminiscence allocation failedn");
exit(1);
}
int stride = 1;
for (int i = ndim - 1; i >= 0; i--) {
tensor->strides[i] = stride;
stride *= form[i];
}
return tensor;
}

So as to entry some ingredient, we will benefit from strides, as we discovered earlier than:

//norch/csrc/tensor.cppfloat get_item(Tensor* tensor, int* indices) {
int index = 0;
for (int i = 0; i < tensor->ndim; i++) {
index += indices[i] * tensor->strides[i];
}
float outcome;
outcome = tensor->information[index];
return outcome;
}

Now, we will create the tensor operations. I’ll present some examples and yow will discover the entire model within the repository linked on the finish of this text.

//norch/csrc/cpu.cppvoid add_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {
for (int i = 0; i < tensor1->dimension; i++) {
result_data[i] = tensor1->information[i] + tensor2->information[i];
}
}
void sub_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {
for (int i = 0; i < tensor1->dimension; i++) {
result_data[i] = tensor1->information[i] - tensor2->information[i];
}
}
void elementwise_mul_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {
for (int i = 0; i < tensor1->dimension; i++) {
result_data[i] = tensor1->information[i] * tensor2->information[i];
}
}
void assign_tensor_cpu(Tensor* tensor, float* result_data) {
for (int i = 0; i < tensor->dimension; i++) {
result_data[i] = tensor->information[i];
}
}
...

After that we’re capable of create our different tensor features that can name these operations:

//norch/csrc/tensor.cppTensor* add_tensor(Tensor* tensor1, Tensor* tensor2) {
if (tensor1->ndim != tensor2->ndim) {
fprintf(stderr, "Tensors should have the identical variety of dimensions %d and %d for additionn", tensor1->ndim, tensor2->ndim);
exit(1);
}
int ndim = tensor1->ndim;
int* form = (int*)malloc(ndim * sizeof(int));
if (form == NULL) {
fprintf(stderr, "Reminiscence allocation failedn");
exit(1);
}
for (int i = 0; i < ndim; i++) {
if (tensor1->form[i] != tensor2->form[i]) {
fprintf(stderr, "Tensors should have the identical form %d and %d at index %d for additionn", tensor1->form[i], tensor2->form[i], i);
exit(1);
}
form[i] = tensor1->form[i];
}        
float* result_data = (float*)malloc(tensor1->dimension * sizeof(float));
if (result_data == NULL) {
fprintf(stderr, "Reminiscence allocation failedn");
exit(1);
}
add_tensor_cpu(tensor1, tensor2, result_data);
return create_tensor(result_data, form, ndim, gadget);
}

As talked about earlier than, the tensor reshaping doesn’t modify the inner information array:

//norch/csrc/tensor.cppTensor* reshape_tensor(Tensor* tensor, int* new_shape, int new_ndim) {
int ndim = new_ndim;
int* form = (int*)malloc(ndim * sizeof(int));
if (form == NULL) {
fprintf(stderr, "Reminiscence allocation failedn");
exit(1);
}
for (int i = 0; i < ndim; i++) {
form[i] = new_shape[i];
}
// Calculate the whole variety of parts within the new form
int dimension = 1;
for (int i = 0; i < new_ndim; i++) {
dimension *= form[i];
}
// Examine if the whole variety of parts matches the present tensor's dimension
if (dimension != tensor->dimension) {
fprintf(stderr, "Can't reshape tensor. Complete variety of parts in new form doesn't match the present dimension of the tensor.n");
exit(1);
}
float* result_data = (float*)malloc(tensor->dimension * sizeof(float));
if (result_data == NULL) {
fprintf(stderr, "Reminiscence allocation failedn");
exit(1);
}
assign_tensor_cpu(tensor, result_data);
return create_tensor(result_data, form, ndim, gadget);
}

Though we will now do some tensor operations, no person deserves to run it utilizing C/C++, proper? Allow us to begin constructing our Python wrapper!

There are quite a lot of choices to run C/C++ code utilizing Python, similar to Pybind11 and Cython. For our instance I’ll use ctypes.

The principally construction of ctypes is proven under:

//C code
#embrace <stdio.h>float add_floats(float a, float b) {
return a + b;
}

# Compile
gcc -shared -o add_floats.so -fPIC add_floats.c

# Python code
import ctypes# Load the shared library
lib = ctypes.CDLL('./add_floats.so')
# Outline the argument and return sorts for the operate
lib.add_floats.argtypes = [ctypes.c_float, ctypes.c_float]
lib.add_floats.restype = ctypes.c_float
# Convert python float to c_float sort 
a = ctypes.c_float(3.5)
b = ctypes.c_float(2.2)
# Name the C operate
outcome = lib.add_floats(a, b)
print(outcome)
# 5.7

As you possibly can see, it is vitally intuitive. After you compile the C/C++ code, you need to use it on Python with ctypes very simply. You solely have to outline the arguments & return c_types of the operate, convert the variable to its respective c_types and name the operate. For extra advanced sorts similar to arrays (float lists) you need to use pointers.

information = [1.0, 2.0, 3.0]
data_ctype = (ctypes.c_float * len(information))(*information)lib.some_array_func.argstypes = [ctypes.POINTER(ctypes.c_float)]
...
lib.some_array_func(information)

And for struct sorts we will create our personal c_type:

class CustomType(ctypes.Construction):
_fields_ = [
('field1', ctypes.POINTER(ctypes.c_float)),
('field2', ctypes.POINTER(ctypes.c_int)),
('field3', ctypes.c_int),
]# Can be utilized as ctypes.POINTER(CustomType)

After this temporary rationalization, allow us to assemble the Python wrapper for our tensor C/C++ library!

# norch/tensor.pyimport ctypes
class CTensor(ctypes.Construction):
_fields_ = [
('data', ctypes.POINTER(ctypes.c_float)),
('strides', ctypes.POINTER(ctypes.c_int)),
('shape', ctypes.POINTER(ctypes.c_int)),
('ndim', ctypes.c_int),
('size', ctypes.c_int),
]
class Tensor:
os.path.abspath(os.curdir)
_C = ctypes.CDLL("COMPILED_LIB.so"))
def __init__(self):
information, form = self.flatten(information)
self.data_ctype = (ctypes.c_float * len(information))(*information)
self.shape_ctype = (ctypes.c_int * len(form))(*form)
self.ndim_ctype = ctypes.c_int(len(form))
self.form = form
self.ndim = len(form)
Tensor._C.create_tensor.argtypes = [ctypes.POINTER(ctypes.c_float), ctypes.POINTER(ctypes.c_int), ctypes.c_int]
Tensor._C.create_tensor.restype = ctypes.POINTER(CTensor)
self.tensor = Tensor._C.create_tensor(
self.data_ctype,
self.shape_ctype,
self.ndim_ctype,
)
def flatten(self, nested_list):
"""
This methodology merely convert a listing sort tensor to a flatten tensor with its form
Instance:
Arguments:  
nested_list: [[1, 2, 3], [-5, 2, 0]]
Return:
flat_data: [1, 2, 3, -5, 2, 0]
form: [2, 3]
"""
def flatten_recursively(nested_list):
flat_data = []
form = []
if isinstance(nested_list, listing):
for sublist in nested_list:
inner_data, inner_shape = flatten_recursively(sublist)
flat_data.lengthen(inner_data)
form.append(len(nested_list))
form.lengthen(inner_shape)
else:
flat_data.append(nested_list)
return flat_data, form
flat_data, form = flatten_recursively(nested_list)
return flat_data, form

Now we embrace the Python tensor operations to name the C/C++ operations.

# norch/tensor.pydef __getitem__(self, indices):
"""
Entry tensor by index tensor[i, j, k...]
"""
if len(indices) != self.ndim:
increase ValueError("Variety of indices should match the variety of dimensions")
Tensor._C.get_item.argtypes = [ctypes.POINTER(CTensor), ctypes.POINTER(ctypes.c_int)]
Tensor._C.get_item.restype = ctypes.c_float
indices = (ctypes.c_int * len(indices))(*indices)
worth = Tensor._C.get_item(self.tensor, indices)  
return worth
def reshape(self, new_shape):
"""
Reshape tensor
outcome = tensor.reshape([1,2])
"""
new_shape_ctype = (ctypes.c_int * len(new_shape))(*new_shape)
new_ndim_ctype = ctypes.c_int(len(new_shape))
Tensor._C.reshape_tensor.argtypes = [ctypes.POINTER(CTensor), ctypes.POINTER(ctypes.c_int), ctypes.c_int]
Tensor._C.reshape_tensor.restype = ctypes.POINTER(CTensor)
result_tensor_ptr = Tensor._C.reshape_tensor(self.tensor, new_shape_ctype, new_ndim_ctype)   
result_data = Tensor()
result_data.tensor = result_tensor_ptr
result_data.form = new_shape.copy()
result_data.ndim = len(new_shape)
result_data.gadget = self.gadget
return result_data
def __add__(self, different):
"""
Add tensors
outcome = tensor1 + tensor2
"""
if self.form != different.form:
increase ValueError("Tensors should have the identical form for addition")
Tensor._C.add_tensor.argtypes = [ctypes.POINTER(CTensor), ctypes.POINTER(CTensor)]
Tensor._C.add_tensor.restype = ctypes.POINTER(CTensor)
result_tensor_ptr = Tensor._C.add_tensor(self.tensor, different.tensor)
result_data = Tensor()
result_data.tensor = result_tensor_ptr
result_data.form = self.form.copy()
result_data.ndim = self.ndim
result_data.gadget = self.gadget
return result_data
# Embrace the opposite operations:
# __str__
# __sub__ (-)
# __mul__ (*)
# __matmul__ (@)
# __pow__ (**)
# __truediv__ (/)
# log
# ...

For those who acquired right here, you at the moment are succesful to run the code and begin performing some tensor operations!

import norchtensor1 = norch.Tensor([[1, 2, 3], [3, 2, 1]])
tensor2 = norch.Tensor([[3, 2, 1], [1, 2, 3]])
outcome = tensor1 + tensor2
print(outcome[0, 0])
# 4

After creating the fundamental construction of our library, now we are going to take it to a brand new degree. It’s well-known you could name .to("cuda") to ship information to GPU and run math operations sooner. I’ll assume that you’ve got primary information on how CUDA works, but when you don’t, you possibly can learn my different article: CUDA tutorial. I’ll wait right here. 😊

…

For these in a rush, a easy introduction right here:

Principally, all of our code till right here is operating on CPU reminiscence. Altough for single operations CPUs are sooner, the benefit of GPUs depends on its parallelization capabilities. Whereas CPU design goals to execute a sequence of operations (threads) quick (however can solely execute dozens of them), the GPU design goals to execute thousands and thousands of operations in parallel (by sacrificing particular person threads efficiency).

Thus, we will benefit from this functionality to carry out operations in parallel. For instance, in a million-sized tensor addition, as an alternative of including the weather of every index sequentially inside a loop, utilizing a GPU we will add all of them in parallel without delay. So as to do this, we will use CUDA, which is a platform developed by NVIDIA to allow builders to combine GPU assist to their software program functions.

So as to do this, you need to use CUDA C/C++, which is an easy C/C++ based mostly interface designed to run particular GPU operations (similar to copy information from CPU reminiscence to GPU reminiscence).

The code under principally makes use of some CUDA C/C++ features to repeat information from CPU to GPU, and run the AddTwoArrays operate (additionally known as kernel) on a complete of N GPU threads in parallel, every accountable so as to add a unique ingredient of the array.

#embrace <stdio.h>// CPU model for comparability
void AddTwoArrays_CPU(flaot A[], float B[], float C[]) {
for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
}
// Kernel definition
__global__ void AddTwoArrays_GPU(float A[], float B[], float C[]) {
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int essential() {
int N = 1000; // Measurement of the arrays
float A[N], B[N], C[N]; // Arrays A, B, and C
...
float *d_A, *d_B, *d_C; // System pointers for arrays A, B, and C
// Allocate reminiscence on the gadget for arrays A, B, and C
cudaMalloc((void **)&d_A, N * sizeof(float));
cudaMalloc((void **)&d_B, N * sizeof(float));
cudaMalloc((void **)&d_C, N * sizeof(float));
// Copy arrays A and B from host to gadget
cudaMemcpy(d_A, A, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, N * sizeof(float), cudaMemcpyHostToDevice);
// Kernel invocation with N threads
AddTwoArrays_GPU<<<1, N>>>(d_A, d_B, d_C);
// Copy vector C from gadget to host
cudaMemcpy(C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost);
}

As you possibly can discover, as an alternative of including every ingredient pair per operation, we run the entire including operations in parallel, eliminating the loop instruction.

After this temporary introduction, we will return to our tensor library.

Step one is to create a operate to ship tensor information from CPU to GPU and vice versa.

//norch/csrc/tensor.cppvoid to_device(Tensor* tensor, char* target_device) {
if ((strcmp(target_device, "cuda") == 0) && (strcmp(tensor->gadget, "cpu") == 0)) {
cpu_to_cuda(tensor);
}
else if ((strcmp(target_device, "cpu") == 0) && (strcmp(tensor->gadget, "cuda") == 0)) {
cuda_to_cpu(tensor);
}
}

//norch/csrc/cuda.cu__host__ void cpu_to_cuda(Tensor* tensor) {
float* data_tmp;
cudaMalloc((void **)&data_tmp, tensor->dimension * sizeof(float));
cudaMemcpy(data_tmp, tensor->information, tensor->dimension * sizeof(float), cudaMemcpyHostToDevice);
tensor->information = data_tmp;
const char* device_str = "cuda";
tensor->gadget = (char*)malloc(strlen(device_str) + 1);
strcpy(tensor->gadget, device_str); 
printf("Efficiently despatched tensor to: %sn", tensor->gadget);
}
__host__ void cuda_to_cpu(Tensor* tensor) {
float* data_tmp = (float*)malloc(tensor->dimension * sizeof(float));
cudaMemcpy(data_tmp, tensor->information, tensor->dimension * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(tensor->information);
tensor->information = data_tmp;
const char* device_str = "cpu";
tensor->gadget = (char*)malloc(strlen(device_str) + 1);
strcpy(tensor->gadget, device_str); 
printf("Efficiently despatched tensor to: %sn", tensor->gadget);
}

The Python wrapper:

# norch/tensor.pydef to(self, gadget):
self.gadget = gadget
self.device_ctype = self.gadget.encode('utf-8')
Tensor._C.to_device.argtypes = [ctypes.POINTER(CTensor), ctypes.c_char_p]
Tensor._C.to_device.restype = None
Tensor._C.to_device(self.tensor, self.device_ctype)
return self

Then, we create GPU variations for all of our tensor operations. I’ll write examples for addition and subtraction:

//norch/csrc/cuda.cu#outline THREADS_PER_BLOCK 128
__global__ void add_tensor_cuda_kernel(float* data1, float* data2, float* result_data, int dimension) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < dimension) {
result_data[i] = data1[i] + data2[i];
}
}
__host__ void add_tensor_cuda(Tensor* tensor1, Tensor* tensor2, float* result_data) {
int number_of_blocks = (tensor1->dimension + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
add_tensor_cuda_kernel<<<number_of_blocks, THREADS_PER_BLOCK>>>(tensor1->information, tensor2->information, result_data, tensor1->dimension);
cudaError_t error = cudaGetLastError();
if (error != cudaSuccess) {
printf("CUDA error: %sn", cudaGetErrorString(error));
exit(-1);
}
cudaDeviceSynchronize();
}
__global__ void sub_tensor_cuda_kernel(float* data1, float* data2, float* result_data, int dimension) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < dimension) {
result_data[i] = data1[i] - data2[i];
}
}
__host__ void sub_tensor_cuda(Tensor* tensor1, Tensor* tensor2, float* result_data) {
int number_of_blocks = (tensor1->dimension + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
sub_tensor_cuda_kernel<<<number_of_blocks, THREADS_PER_BLOCK>>>(tensor1->information, tensor2->information, result_data, tensor1->dimension);
cudaError_t error = cudaGetLastError();
if (error != cudaSuccess) {
printf("CUDA error: %sn", cudaGetErrorString(error));
exit(-1);
}
cudaDeviceSynchronize();
}
...

Subsequently, we embrace a brand new tensor attribute char* gadget on the tensor.cpp and we will use it to pick out the place the operations shall be run (CPU or GPU):

//norch/csrc/tensor.cppTensor* add_tensor(Tensor* tensor1, Tensor* tensor2) {
if (tensor1->ndim != tensor2->ndim) {
fprintf(stderr, "Tensors should have the identical variety of dimensions %d and %d for additionn", tensor1->ndim, tensor2->ndim);
exit(1);
}
if (strcmp(tensor1->gadget, tensor2->gadget) != 0) {
fprintf(stderr, "Tensors have to be on the identical gadget: %s and %sn", tensor1->gadget, tensor2->gadget);
exit(1);
}
char* gadget = (char*)malloc(strlen(tensor1->gadget) + 1);
if (gadget != NULL) {
strcpy(gadget, tensor1->gadget);
} else {
fprintf(stderr, "Reminiscence allocation failedn");
exit(-1);
}
int ndim = tensor1->ndim;
int* form = (int*)malloc(ndim * sizeof(int));
if (form == NULL) {
fprintf(stderr, "Reminiscence allocation failedn");
exit(1);
}
for (int i = 0; i < ndim; i++) {
if (tensor1->form[i] != tensor2->form[i]) {
fprintf(stderr, "Tensors should have the identical form %d and %d at index %d for additionn", tensor1->form[i], tensor2->form[i], i);
exit(1);
}
form[i] = tensor1->form[i];
}        
if (strcmp(tensor1->gadget, "cuda") == 0) {
float* result_data;
cudaMalloc((void **)&result_data, tensor1->dimension * sizeof(float));
add_tensor_cuda(tensor1, tensor2, result_data);
return create_tensor(result_data, form, ndim, gadget);
} 
else {
float* result_data = (float*)malloc(tensor1->dimension * sizeof(float));
if (result_data == NULL) {
fprintf(stderr, "Reminiscence allocation failedn");
exit(1);
}
add_tensor_cpu(tensor1, tensor2, result_data);
return create_tensor(result_data, form, ndim, gadget);
}     
}

Now our library has GPU assist!

import norchtensor1 = norch.Tensor([[1, 2, 3], [3, 2, 1]]).to("cuda")
tensor2 = norch.Tensor([[3, 2, 1], [1, 2, 3]]).to("cuda")
outcome = tensor1 + tensor2

One of many essential the explanation why PyTorch acquired so standard is because of its Autograd module. It’s a core part that permits computerized differentiation so as to compute gradients (essential for coaching fashions with optimization algorithms similar to gradient descent). By calling a single methodology .backward(), it computes the entire gradients from earlier tensor operations:

x = torch.tensor([[1., 2, 3], [3., 2, 1]], requires_grad=True)
# [[1,  2,  3],
#  [3,  2., 1]]y = torch.tensor([[3., 2, 1], [1., 2, 3]], requires_grad=True)
# [[3,  2, 1],
#  [1,  2, 3]]
L = ((x - y) ** 3).sum()
L.backward()
# You possibly can entry gradients of x and y
print(x.grad)
# [[12, 0, 12],
#  [12, 0, 12]]
print(y.grad)
# [[-12, 0, -12],
#  [-12, 0, -12]]
# So as to reduce z, you need to use that for gradient descent:
# x = x - learning_rate * x.grad
# y = y - learning_rate * y.grad

So as to perceive what is occurring, let’s attempt to replicate the identical process manually:

Let’s first calculate:

Word that x is a matrix, thus we have to calculate the by-product of L with respect to every ingredient individually. Moreover, L is a summation over all parts, however it is very important keep in mind that for every parts, the opposite parts doesn’t intervene on its by-product. Due to this fact, we receive the next time period:

By making use of the chain rule for every time period, we differentiate the outer operate and multiply by the by-product of the interior operate:

The place:

Lastly:

Thus, we have now the next ultimate equation to calculate the by-product of L with respect to x:

Substituting the values into the equation:

Calculating the outcome, we get the identical values we obtained with PyTorch:

Now, let’s analyze what we simply did:

Principally, we noticed all of the operations concerned in reserved order: a summation, an influence of three and a subtraction. Then, we utilized chain of rule, calculating the by-product of every operation and recursively calculated the by-product for the subsequent operation. So, first we’d like an implementation of the by-product for various math operations:

For addition:

# norch/autograd/features.pyclass AddBackward:
def __init__(self, x, y):
self.enter = [x, y]
def backward(self, gradient):
return [gradient, gradient]

For sin:

# norch/autograd/features.pyclass SinBackward:
def __init__(self, x):
self.enter = [x]
def backward(self, gradient):
x = self.enter[0]
return [x.cos() * gradient]

For cosine:

# norch/autograd/features.pyclass CosBackward:
def __init__(self, x):
self.enter = [x]
def backward(self, gradient):
x = self.enter[0]
return [- x.sin() * gradient]

For element-wise multiplication:

# norch/autograd/features.pyclass ElementwiseMulBackward:
def __init__(self, x, y):
self.enter = [x, y]
def backward(self, gradient):
x = self.enter[0]
y = self.enter[1]
return [y * gradient, x * gradient]

For summation:

# norch/autograd/features.pyclass SumBackward:
def __init__(self, x):
self.enter = [x]
def backward(self, gradient):
# Since sum reduces a tensor to a scalar, gradient is broadcasted to match the unique form.
return [float(gradient.tensor.contents.data[0]) * self.enter[0].ones_like()]

You possibly can entry the GitHub repository hyperlink on the finish of the article to discover different operations.

Now that we have now by-product expressions for every operation, we will proceed with implementing the recursive backward chain rule. We are able to set a requires_grad argument for our tensor to point that we wish to retailer the gradients of this tensor. If true, we are going to retailer the gradients for every tensor operation. As an example:

# norch/tensor.pydef __add__(self, different):
if self.form != different.form:
increase ValueError("Tensors should have the identical form for addition")
Tensor._C.add_tensor.argtypes = [ctypes.POINTER(CTensor), ctypes.POINTER(CTensor)]
Tensor._C.add_tensor.restype = ctypes.POINTER(CTensor)
result_tensor_ptr = Tensor._C.add_tensor(self.tensor, different.tensor)
result_data = Tensor()
result_data.tensor = result_tensor_ptr
result_data.form = self.form.copy()
result_data.ndim = self.ndim
result_data.gadget = self.gadget
result_data.requires_grad = self.requires_grad or different.requires_grad
if result_data.requires_grad:
result_data.grad_fn = AddBackward(self, different)

Then, implement the .backward() methodology:

# norch/tensor.pydef backward(self, gradient=None):
if not self.requires_grad:
return
if gradient is None:
if self.form == [1]:
gradient = Tensor([1]) # dx/dx = 1 case
else:
increase RuntimeError("Gradient argument have to be specified for non-scalar tensors.")
if self.grad is None:
self.grad = gradient
else:
self.grad += gradient
if self.grad_fn isn't None: # not a leaf
grads = self.grad_fn.backward(gradient) # name the operation backward
for tensor, grad in zip(self.grad_fn.enter, grads):
if isinstance(tensor, Tensor):
tensor.backward(grad) # recursively name the backward once more for the gradient expression (chain rule)

Lastly, simply implement .zero_grad() to zero the gradient of a tensor, and .detach() to take away the tensor autograd historical past:

# norch/tensor.pydef zero_grad(self):
self.grad = None
def detach(self):
self.grad = None
self.grad_fn = None

Congratulations! You simply created a whole tensor library with GPU assist and computerized differentiation! Now we will create the nn and optim modules to coach some deep studying fashions extra simply.

The nn is a module for constructing neural networks and deep studying fashions and optim is expounded to optimization algorithms to coach these fashions. So as to recreate them, the very first thing to do is implementing a Parameter, which merely is a trainable tensor, with the identical operations, however with requires_grad set at all times as True and with some random initialization method.

# norch/nn/parameter.pyfrom norch.tensor import Tensor
from norch.utils import utils
import random
class Parameter(Tensor):
"""
A parameter is a trainable tensor.
"""
def __init__(self, form):
information = utils.generate_random_list(form=form)
tremendous().__init__(information, requires_grad=True)

# norch/utisl/utils.pydef generate_random_list(form):
"""
Generate a listing with random numbers and form 'form'
[4, 2] --> [[rand1, rand2], [rand3, rand4], [rand5, rand6], [rand7, rand8]]
"""
if len(form) == 0:
return []
else:
inner_shape = form[1:]
if len(inner_shape) == 0:
return [random.uniform(-1, 1) for _ in range(shape[0])]
else:
return [generate_random_list(inner_shape) for _ in range(shape[0])]

Through the use of parameters, we will begin contructing modules:

# norch/nn/module.pyfrom .parameter import Parameter
from collections import OrderedDict
from abc import ABC
import examine
class Module(ABC):
"""
Summary class for modules
"""
def __init__(self):
self._modules = OrderedDict()
self._params = OrderedDict()
self._grads = OrderedDict()
self.coaching = True
def ahead(self, *inputs, **kwargs):
increase NotImplementedError
def __call__(self, *inputs, **kwargs):
return self.ahead(*inputs, **kwargs)
def prepare(self):
self.coaching = True
for param in self.parameters():
param.requires_grad = True
def eval(self):
self.coaching = False
for param in self.parameters():
param.requires_grad = False
def parameters(self):
for title, worth in examine.getmembers(self):
if isinstance(worth, Parameter):
yield self, title, worth
elif isinstance(worth, Module):
yield from worth.parameters()
def modules(self):
yield from self._modules.values()
def gradients(self):
for module in self.modules():
yield module._grads
def zero_grad(self):
for _, _, parameter in self.parameters():
parameter.zero_grad()
def to(self, gadget):
for _, _, parameter in self.parameters():
parameter.to(gadget)
return self
def inner_repr(self):
return ""
def __repr__(self):
string = f"{self.get_name()}("
tab = "   "
modules = self._modules
if modules == {}:
string += f'n{tab}(parameters): {self.inner_repr()}'
else:
for key, module in modules.objects():
string += f"n{tab}({key}): {module.get_name()}({module.inner_repr()})"
return f'{string}n)'
def get_name(self):
return self.__class__.__name__
def __setattr__(self, key, worth):
self.__dict__[key] = worth
if isinstance(worth, Module):
self._modules[key] = worth
elif isinstance(worth, Parameter):
self._params[key] = worth

For instance, we will assemble our customized modules by inheriting from nn.Module, or we will use some beforehand created modules, such because the linear, which implements the y = Wx + b operation.

# norch/nn/modules/linear.pyfrom ..module import Module
from ..parameter import Parameter
class Linear(Module):
def __init__(self, input_dim, output_dim):
tremendous().__init__()
self.input_dim = input_dim
self.output_dim = output_dim
self.weight = Parameter(form=[self.output_dim, self.input_dim])
self.bias = Parameter(form=[self.output_dim, 1])
def ahead(self, x):
z = self.weight @ x + self.bias
return z
def inner_repr(self):
return f"input_dim={self.input_dim}, output_dim={self.output_dim}, " 
f"bias={True if self.bias isn't None else False}"

Now we will implement some loss and activation features. As an example, a imply squared error loss and a sigmoid operate:

# norch/nn/loss.pyfrom .module import Module
class MSELoss(Module):
def __init__(self):
cross
def ahead(self, predictions, labels):
assert labels.form == predictions.form, 
"Labels and predictions form doesn't match: {} and {}".format(labels.form, predictions.form)
return ((predictions - labels) ** 2).sum() / predictions.numel
def __call__(self, *inputs):
return self.ahead(*inputs)

# norch/nn/activation.pyfrom .module import Module
import math
class Sigmoid(Module):
def __init__(self):
tremendous().__init__()
def ahead(self, x):
return 1.0 / (1.0 + (math.e) ** (-x))

Lastly, create the optimizers. On our instance I’ll implement the Stochastic Gradient Descent algorithm:

# norch/optim/optimizer.pyfrom abc import ABC
from norch.tensor import Tensor
class Optimizer(ABC):
"""
Summary class for optimizers
"""
def __init__(self, parameters):
if isinstance(parameters, Tensor):
increase TypeError("parameters must be an iterable however acquired {}".format(sort(parameters)))
elif isinstance(parameters, dict):
parameters = parameters.values()
self.parameters = listing(parameters)
def step(self):
increase NotImplementedError
def zero_grad(self):
for module, title, parameter in self.parameters:
parameter.zero_grad()
class SGD(Optimizer):
def __init__(self, parameters, lr=1e-1, momentum=0):
tremendous().__init__(parameters)
self.lr = lr
self.momentum = momentum
self._cache = {'velocity': [p.zeros_like() for (_, _, p) in self.parameters]}
def step(self):
for i, (module, title, _) in enumerate(self.parameters):
parameter = getattr(module, title)
velocity = self._cache['velocity'][i]
velocity = self.momentum * velocity - self.lr * parameter.grad
updated_parameter = parameter + velocity
setattr(module, title, updated_parameter)
self._cache['velocity'][i] = velocity
parameter.detach()
velocity.detach()

And, that’s it! We simply created our personal deep studying framework! 🥳

Let’s do some coaching:

import norch
import norch.nn as nn
import norch.optim as optim
import random
import mathrandom.seed(1)
class MyModel(nn.Module):
def __init__(self):
tremendous(MyModel, self).__init__()
self.fc1 = nn.Linear(1, 10)
self.sigmoid = nn.Sigmoid()
self.fc2 = nn.Linear(10, 1)
def ahead(self, x):
out = self.fc1(x)
out = self.sigmoid(out)
out = self.fc2(out)
return out
gadget = "cuda"
epochs = 10
mannequin = MyModel().to(gadget)
criterion = nn.MSELoss()
optimizer = optim.SGD(mannequin.parameters(), lr=0.001)
loss_list = []
x_values = [0. ,  0.4,  0.8,  1.2,  1.6,  2. ,  2.4,  2.8,  3.2,  3.6,  4. ,
4.4,  4.8,  5.2,  5.6,  6. ,  6.4,  6.8,  7.2,  7.6,  8. ,  8.4,
8.8,  9.2,  9.6, 10. , 10.4, 10.8, 11.2, 11.6, 12. , 12.4, 12.8,
13.2, 13.6, 14. , 14.4, 14.8, 15.2, 15.6, 16. , 16.4, 16.8, 17.2,
17.6, 18. , 18.4, 18.8, 19.2, 19.6, 20.]
y_true = []
for x in x_values:
y_true.append(math.pow(math.sin(x), 2))
for epoch in vary(epochs):
for x, goal in zip(x_values, y_true):
x = norch.Tensor([[x]]).T
goal = norch.Tensor([[target]]).T
x = x.to(gadget)
goal = goal.to(gadget)
outputs = mannequin(x)
loss = criterion(outputs, goal)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f'Epoch [{epoch + 1}/{epochs}], Loss: {loss[0]:.4f}')
loss_list.append(loss[0])
# Epoch [1/10], Loss: 1.7035
# Epoch [2/10], Loss: 0.7193
# Epoch [3/10], Loss: 0.3068
# Epoch [4/10], Loss: 0.1742
# Epoch [5/10], Loss: 0.1342
# Epoch [6/10], Loss: 0.1232
# Epoch [7/10], Loss: 0.1220
# Epoch [8/10], Loss: 0.1241
# Epoch [9/10], Loss: 0.1270
# Epoch [10/10], Loss: 0.1297

The mannequin was efficiently created and skilled utilizing our customized deep studying framework!

You possibly can verify the entire code right here.

On this submit we lined the fundamental ideas of what’s a tensor, how it’s modeled and extra superior subjects similar to CUDA and Autograd. We efficiently created a deep studying framework with GPU assist and computerized differentiation. I hope this submit helped you to briefly perceive how PyTorch works beneath the hood.

In future posts, I’ll attempt to cowl extra superior subjects similar to distributed coaching (multinode / multi GPU) and reminiscence administration. Please let me know what you assume or what you want to me to write down about subsequent within the feedback! Thanks a lot for studying! 😊

Additionally, comply with me right here and on my LinkedIn profile to remain up to date on my newest articles!

PyNorch — GitHub repository of this undertaking.

Tutorial CUDA — A short introduction on how CUDA works.

PyTorch — PyTorch documentation.

MartinLwx’s weblog — Tutorial on strides.

Strides tutorial — One other tutorial about strides.

PyTorch internals — A information on how PyTorch is structured.

Nets — A PyTorch recreation utilizing NumPy.

Autograd — A stay coding of an Autograd library.

[ad_2]