Numba supports CUDA-enabled GPU with compute capability (CC) 2.0 or above with an up-to-data Nvidia driver. However, it is wise to use GPU with compute capability 3.0 or above as this allows for double precision operations. Anything lower than a 3.0 CC will only support single precision. Now if you’re on Matlab, only CC >= 3 is permitted for double precision work.
On my machine, I have the 2 NVIDIA GPUs: GeForce GTX 960 (CC=5.2) and GeForce GTX 1050 (CC=6.1)
Sat Dec 15 22:01:07 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 105... Off | 00000000:0F:00.0 On | N/A |
| 30% 26C P8 N/A / 75W | 284MiB / 4038MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 960 Off | 00000000:42:00.0 Off | N/A |
| 0% 30C P8 7W / 160W | 12MiB / 2002MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 982 G /usr/lib/xorg/Xorg 136MiB |
| 0 1150 G /usr/bin/gnome-shell 98MiB |
| 0 3742 C ...TPL/anaconda/python3/install/bin/python 45MiB |
+-----------------------------------------------------------------------------+
First you need to install the CUDA toolkit. Using Conda
conda update conda
conda install accelerate
conda install cudatoolkit
At the start, the GPU support was part of numbapro. This has now been deprecated. Before to check CUDA compatibility
from numbapro import check_cuda
check_cuda()
If you’re lucky, numbapro would give a deprecation warning. For me, I received an error. The underlying routine for check_cuda() are still present albeit in a different location: numba.
Let’s check numba CUDA on my system. First, a search for the CUDA libraries with numba.cuda.cudadrv.libs.test():
In [4]: numba.cuda.cudadrv.libs.test()
Finding cublas
located at /home/sconde/TPL/anaconda/python3/install/lib/libcublas.so.7.5
trying to open library... ok
Finding cusparse
located at /home/sconde/TPL/anaconda/python3/install/lib/libcusparse.so.7.5
trying to open library... ok
Finding cufft
located at /home/sconde/TPL/anaconda/python3/install/lib/libcufft.so.7.5
trying to open library... ok
Finding curand
located at /home/sconde/TPL/anaconda/python3/install/lib/libcurand.so.7.5
trying to open library... ok
Finding nvvm
located at /home/sconde/TPL/anaconda/python3/install/lib/libnvvm.so.3.0.0
trying to open library... ok
finding libdevice for compute_20... ok
finding libdevice for compute_30... ok
finding libdevice for compute_35... ok
finding libdevice for compute_50... ok
Out[4]: True
Now let’s check the devices detected and available for computation with numba.cuda.api.detect():
In [5]: numba.cuda.api.detect()
Found 2 CUDA devices
id 0 b'GeForce GTX 1050 Ti' [SUPPORTED]
compute capability: 6.1
pci device id: 0
pci bus id: 15
id 1 b'GeForce GTX 960' [SUPPORTED]
compute capability: 5.2
pci device id: 0
pci bus id: 66
Summary:
2/2 devices are supported
Out[5]: True
Everything seems to be working. How about a simple kernel for further testing? A simple vector addition of two single precision vectors. Here is the very simple code:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
@vectorize(["float32(float32, float32)"], target='cuda')
def VectorAdd(a, b):
return a + b
def main():
N = 320000000
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
C = np.zeros(N, dtype=np.float32)
start = timer()
C = VectorAdd(A, B)
vectoradd_timer = timer() - start
start = timer()
C_np = A+B
np_vectoradd_timer = timer() - start
error = np.abs(C - C_np).max()
print("Error: ", error)
print("VectorAdd took %f seconds" % vectoradd_timer)
print("VectorAdd(NP)took %f seconds" % np_vectoradd_timer)
if __name__ == '__main__':
main()
Running this very simple script produces a warning: warnings.warn('Could not autotune, using default tpb of 128').
warnings.warn('Could not autotune, using default tpb of 128')
Error: 0.0
VectorAdd took 1.276218 seconds
VectorAdd(NP)took 0.591913 seconds
We can see that numpy is still faster. Now I wasn’t expecting the VectorAdd function to actually outperform anything from numpy. We can
see however that the code is running. But how do I know that it’s running on the GPU?