Numba supports CUDA-enabled GPU with compute capability (CC
) 2.0 or above with an up-to-data Nvidia driver. However, it is wise to use GPU with compute capability 3.0 or above as this allows for double precision operations. Anything lower than a 3.0 CC
will only support single precision. Now if you’re on Matlab, only CC >= 3
is permitted for double precision work.
On my machine, I have the 2 NVIDIA GPUs: GeForce GTX 960 (CC=5.2
) and GeForce GTX 1050 (CC=6.1
)
Sat Dec 15 22:01:07 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 105... Off | 00000000:0F:00.0 On | N/A |
| 30% 26C P8 N/A / 75W | 284MiB / 4038MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 960 Off | 00000000:42:00.0 Off | N/A |
| 0% 30C P8 7W / 160W | 12MiB / 2002MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 982 G /usr/lib/xorg/Xorg 136MiB |
| 0 1150 G /usr/bin/gnome-shell 98MiB |
| 0 3742 C ...TPL/anaconda/python3/install/bin/python 45MiB |
+-----------------------------------------------------------------------------+
First you need to install the CUDA toolkit. Using Conda
conda update conda
conda install accelerate
conda install cudatoolkit
At the start, the GPU support was part of numbapro
. This has now been deprecated. Before to check CUDA compatibility
from numbapro import check_cuda
check_cuda()
If you’re lucky, numbapro
would give a deprecation warning. For me, I received an error. The underlying routine for check_cuda()
are still present albeit in a different location: numba
.
Let’s check numba CUDA on my system. First, a search for the CUDA libraries with numba.cuda.cudadrv.libs.test()
:
In [4]: numba.cuda.cudadrv.libs.test()
Finding cublas
located at /home/sconde/TPL/anaconda/python3/install/lib/libcublas.so.7.5
trying to open library... ok
Finding cusparse
located at /home/sconde/TPL/anaconda/python3/install/lib/libcusparse.so.7.5
trying to open library... ok
Finding cufft
located at /home/sconde/TPL/anaconda/python3/install/lib/libcufft.so.7.5
trying to open library... ok
Finding curand
located at /home/sconde/TPL/anaconda/python3/install/lib/libcurand.so.7.5
trying to open library... ok
Finding nvvm
located at /home/sconde/TPL/anaconda/python3/install/lib/libnvvm.so.3.0.0
trying to open library... ok
finding libdevice for compute_20... ok
finding libdevice for compute_30... ok
finding libdevice for compute_35... ok
finding libdevice for compute_50... ok
Out[4]: True
Now let’s check the devices detected and available for computation with numba.cuda.api.detect()
:
In [5]: numba.cuda.api.detect()
Found 2 CUDA devices
id 0 b'GeForce GTX 1050 Ti' [SUPPORTED]
compute capability: 6.1
pci device id: 0
pci bus id: 15
id 1 b'GeForce GTX 960' [SUPPORTED]
compute capability: 5.2
pci device id: 0
pci bus id: 66
Summary:
2/2 devices are supported
Out[5]: True
Everything seems to be working. How about a simple kernel for further testing? A simple vector addition of two single precision vectors. Here is the very simple code:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
@vectorize(["float32(float32, float32)"], target='cuda')
def VectorAdd(a, b):
return a + b
def main():
N = 320000000
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
C = np.zeros(N, dtype=np.float32)
start = timer()
C = VectorAdd(A, B)
vectoradd_timer = timer() - start
start = timer()
C_np = A+B
np_vectoradd_timer = timer() - start
error = np.abs(C - C_np).max()
print("Error: ", error)
print("VectorAdd took %f seconds" % vectoradd_timer)
print("VectorAdd(NP)took %f seconds" % np_vectoradd_timer)
if __name__ == '__main__':
main()
Running this very simple script produces a warning: warnings.warn('Could not autotune, using default tpb of 128')
.
warnings.warn('Could not autotune, using default tpb of 128')
Error: 0.0
VectorAdd took 1.276218 seconds
VectorAdd(NP)took 0.591913 seconds
We can see that numpy
is still faster. Now I wasn’t expecting the VectorAdd
function to actually outperform anything from numpy
. We can
see however that the code is running. But how do I know that it’s running on the GPU?