项目作者: AhmetFurkanDEMIR

项目描述 :
NVIDIA GPU benchmark
高级语言: Jupyter Notebook
项目地址: git://github.com/AhmetFurkanDEMIR/NVIDIA-GPU-benchmark.git
创建时间: 2020-11-09T20:29:26Z
项目社区:https://github.com/AhmetFurkanDEMIR/NVIDIA-GPU-benchmark

开源协议:

下载


NVIDIA GPU benchmark

o-que-e-gpu-placa-de-video

Hello, I have prepared two speed tests for you on NVIDIA GPUs that I have access to. GPUs have been accessed via Google Colab and AWS.

WARNING : Instead of evaluating these GPUs alone, I recommend you to examine them with all their hardware, these GPUs may give different results in different applications or tests at different times.

Graphics processor unit : The graphics processor unit, or GPU for short, is the device used for graphic creation in personal computers, workstations or game consoles. Modern GPUs are extremely efficient at rendering and displaying computer graphics, and their high parallel structures make it more efficient than CPU for complex algorithms. The GPU can be just above the graphics card or integrated into the motherboard.

1 . Speed test : I created four matrices with 10000 rows and 10000 columns on the GPU. First I multiply matrix a and b and assign it to variable y, then I multiply matrix c and d and assign it to variable z, and finally I multiply matrix y and z and assign it to variable x, and I did this operation 1000 times in total.

  1. import time, torch
  2. bas = time.time()
  3. a = torch.rand(10000, 10000, device=torch.device("cuda"))
  4. b = torch.rand(10000, 10000, device=torch.device("cuda"))
  5. c = torch.rand(10000, 10000, device=torch.device("cuda"))
  6. d = torch.rand(10000, 10000, device=torch.device("cuda"))
  7. for i in range(0,1000):
  8. y = a@b
  9. z = c@d
  10. x = y@z
  11. son = time.time()
  12. print("1.test result (second) : " + str(son-bas))

2 . Speed test : With C ++, I manually allocated two places in the GPU memory (10000 rows and 10000 columns) and assigned values to these reserved areas with loops. Then I multiplied these matrices with each other.

You can access the cuda, c and header files written for Test 2 from these links. Links: https://ahmetfurkandemir.s3.amazonaws.com/kernel.cu (kernel.cu), https://ahmetfurkandemir.s3.amazonaws.com/dev_array.h (dev_array.h), https://ahmetfurkandemir.s3.amazonaws.com/kernel.h (kernel.h), https://ahmetfurkandemir.s3.amazonaws.com/matrixmul.cu (matrixmul.c).

  1. #include <iostream>
  2. #include <vector>
  3. #include <stdlib.h>
  4. #include <time.h>
  5. #include <cuda_runtime.h>
  6. #include "kernel.h"
  7. #include "kernel.cu"
  8. #include "dev_array.h"
  9. #include <math.h>
  10. #include <stdio.h>
  11. using namespace std;
  12. int main()
  13. {
  14. // Perform matrix multiplication C = A*B
  15. // where A, B and C are NxN matrices
  16. int N = 10000;
  17. int SIZE = N*N;
  18. // Allocate memory on the host
  19. vector<float> h_A(SIZE);
  20. vector<float> h_B(SIZE);
  21. vector<float> h_C(SIZE);
  22. // Initialize matrices on the host
  23. for (int i=0; i<N; i++){
  24. for (int j=0; j<N; j++){
  25. h_A[i*N+j] = sin(i);
  26. h_B[i*N+j] = cos(j);
  27. }
  28. }
  29. // Allocate memory on the device
  30. dev_array<float> d_A(SIZE);
  31. dev_array<float> d_B(SIZE);
  32. dev_array<float> d_C(SIZE);
  33. d_A.set(&h_A[0], SIZE);
  34. d_B.set(&h_B[0], SIZE);
  35. matrixMultiplication(d_A.getData(), d_B.getData(), d_C.getData(), N);
  36. cudaDeviceSynchronize();
  37. d_C.get(&h_C[0], SIZE);
  38. cudaDeviceSynchronize();
  39. printf("END");
  40. return 0;
  41. }

Let’s get to know the contestants :)

There are a total of 4 GPU bananas belonging to the Tesla series, let’s examine them in order.

4 X NVIDIA Tesla V100 GPU

a

Screenshot_2020-11-09_21-47-19

Screenshot_2020-11-09_21-48-35

  • Yes, as you can see, we have a machine with 4 Tesla V100 GPUs(It has 64GB of video memory.) in total and we also have a 16-core Intel (R) Xeon (R) CPU.

NVIDIA Tesla P4 GPU

Screenshot_2020-11-09_21-56-45

Screenshot_2020-11-09_21-57-27

Screenshot_2020-11-09_21-57-52

  • We have a Tesla P4 GPU with 7.6GB of video memory, we also have an Intel (R) Xeon (R) CPU with 1 core.

NVIDIA Tesla P100 GPU

Screenshot_2020-11-09_22-06-28

Screenshot_2020-11-09_22-06-34

Screenshot_2020-11-09_22-06-41

  • We have a Tesla P100 GPU with 16.2GB of video memory, we also have a 1-core Intel (R) Xeon (R) CPU.

NVIDIA Tesla T4 GPU

Screenshot_2020-11-09_22-10-35

Screenshot_2020-11-09_22-10-48

Screenshot_2020-11-09_22-10-57

  • We have a Tesla T4 GPU with 15 GB of video memory, we also have a 1-core Intel (R) Xeon (R) CPU.

Test 1 Results

1 . Let’s recall what our test is. I created four matrices with 10000 rows and 10000 columns on the GPU. First I multiply matrix a and b and assign it to variable y, then I multiply matrix c and d and assign it to variable z, and finally I multiply matrix y and z and assign it to variable x, and I did this operation 1000 times in total.

  1. import time, torch
  2. bas = time.time()
  3. a = torch.rand(10000, 10000, device=torch.device("cuda"))
  4. b = torch.rand(10000, 10000, device=torch.device("cuda"))
  5. c = torch.rand(10000, 10000, device=torch.device("cuda"))
  6. d = torch.rand(10000, 10000, device=torch.device("cuda"))
  7. for i in range(0,1000):
  8. y = a@b
  9. z = c@d
  10. x = y@z
  11. son = time.time()
  12. print("1.test result (second) : " + str(son-bas))

Performance of GPUs, in seconds

  • 1-) 4 X NVIDIA Tesla V100 GPU : 291.4778277873993 (Second), about 4.85 minutes.

  • 2-) NVIDIA Tesla P4 GPU : 1071.427838563919 (Second), about 17.85 minutes.

  • 3-) NVIDIA Tesla P100 GPU : 479.9311819076538 (Second), about 7.99 minutes.

  • 4-) NVIDIA Tesla T4 GPU : 1293.739860534668 (Second), about 21.56 minutes.

Our machine with 4 X NVIDIA Tesla V100 GPU won this race.

dsa

Test 2 Results

2 . Let’s recall what our test is. With C ++, I manually allocated two places in the GPU memory (10000 rows and 10000 columns) and assigned values to these reserved areas with loops. Then I multiplied these matrices with each other.

  1. #include <iostream>
  2. #include <vector>
  3. #include <stdlib.h>
  4. #include <time.h>
  5. #include <cuda_runtime.h>
  6. #include "kernel.h"
  7. #include "kernel.cu"
  8. #include "dev_array.h"
  9. #include <math.h>
  10. #include <stdio.h>
  11. using namespace std;
  12. int main()
  13. {
  14. // Perform matrix multiplication C = A*B
  15. // where A, B and C are NxN matrices
  16. int N = 10000;
  17. int SIZE = N*N;
  18. // Allocate memory on the host
  19. vector<float> h_A(SIZE);
  20. vector<float> h_B(SIZE);
  21. vector<float> h_C(SIZE);
  22. // Initialize matrices on the host
  23. for (int i=0; i<N; i++){
  24. for (int j=0; j<N; j++){
  25. h_A[i*N+j] = sin(i);
  26. h_B[i*N+j] = cos(j);
  27. }
  28. }
  29. // Allocate memory on the device
  30. dev_array<float> d_A(SIZE);
  31. dev_array<float> d_B(SIZE);
  32. dev_array<float> d_C(SIZE);
  33. d_A.set(&h_A[0], SIZE);
  34. d_B.set(&h_B[0], SIZE);
  35. matrixMultiplication(d_A.getData(), d_B.getData(), d_C.getData(), N);
  36. cudaDeviceSynchronize();
  37. d_C.get(&h_C[0], SIZE);
  38. cudaDeviceSynchronize();
  39. printf("END");
  40. return 0;
  41. }

Test-2a, Performance of GPUs, in seconds :

Test 2-a, let’s first see which GPU will compile the Cuda file named matrixmul.cu. The files are compiled with nvcc (Cuda compiler).

  1. # compilation test
  2. bas = time.time()
  3. !nvcc matrixmul.cu
  4. son = time.time()
  5. print("2.test-a result (second) : " + str(son-bas))
  • 1-) 4 X NVIDIA Tesla V100 GPU : 1.413379192352295 (Second).

  • 2-) NVIDIA Tesla P4 GPU : 2.9613592624664307 (Second).

  • 3-) NVIDIA Tesla P100 GPU : 1.4539947509765625 (Second).

  • 4-) NVIDIA Tesla T4 GPU : 1.6754465103149414 (Second).

The machine with 4X NVIDIA Tesla V100 GPU won the race 2-a by a small margin.
tst2

Test-2b, Performance of GPUs, in seconds :

Test 2-b, which GPU will be able to finish running the compiled file first.

  1. # run the compiled file, test
  2. bas = time.time()
  3. !./a.out
  4. son = time.time()
  5. print("2.test-b result (second) : " + str(son-bas))
  • 1-) 4 X NVIDIA Tesla V100 GPU : 9.453376293182373 (Second).

  • 2-) NVIDIA Tesla P4 GPU : 8.686630487442017 (Second).

  • 3-) NVIDIA Tesla P100 GPU : 8.072553873062134 (Second).

  • 4-) NVIDIA Tesla T4 GPU : 8.99604868888855 (Second).

The machine with 4X NVIDIA Tesla P100 GPU won the race 2-b by a small margin.

dsaddd

My own conclusions based on these results

  • According to my observations, in short and simple operations, all GPUs, regardless of GPU video memory and CPU, can finish in a very short and close time.

  • But in long and laborious calculations, high GPU memory and a good CPU allow it to stand out from other competitors.

  • If we look at the graphics and results, today’s winner is 4 X NVIDIA Tesla V100 GPUs :).

  • WARNING : Instead of evaluating these GPUs alone, I recommend you to examine them with all their hardware, these GPUs may give different results in different applications or tests at different times.