numpy - How to properly use anaconda accelerate for GPU -
i trying fast computations of matrices anaconda accelerate. started basic example: multiply 2 matrices.
my goal somehow gpu-multiplication better usual numpy.dot
here basic example, based on documentation.
from numbapro import guvectorize numpy import arange @guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'], '(m,n),(n,p)->(m,p)', target='gpu') def matmul(a, b, c): m, n = a.shape n, p = b.shape in range(m): j in range(p): c[i, j] = 0 k in range(n): c[i, j] += a[i, k] * b[k, j] import numpy np import time dim in [50, 100, 200]: rnd = np.random.randomstate(0) = rnd.rand(dim, dim).astype(np.float32) b = rnd.rand(dim, dim).astype(np.float32) resgpu = np.zeros_like(a) start = time.time() rescpu = np.dot(a, b) print('cpu:', time.time() - start) start = time.time() resgpu = matmul(a, b) print('gpu:', time.time() - start) print(np.allclose(rescpu, resgpu)) print(np.allclose(resgpu, rescpu))
results bad: gpu incredibly slower cpu
cpu: 0.00011801719665527344 gpu: 0.05677294731140137 true true cpu: 0.00011205673217773438 gpu: 0.3881375789642334 true true cpu: 0.00038933753967285156 gpu: 3.018171787261963 true true
of course understand internal numpy realization optimized, expected anaconda official example good. using python 3.4.3 , got errors using these 2 helping libs: http://www.cs.toronto.edu/~tijmen/gnumpy.html , https://github.com/rctn/gpupy
i should gpupy had successful speedup on python 2.7.
so question is: how can matrix multiplication better numpy-cpu using gpu? wrong anaconda official example , if there working library python3 allows use gpu in numpy way?
===
results
unfortunately, there no simple , way python 3, use 2.7 instead
thanks @rth recommendint awesome library scikits.cuda
some benchmark (tested using anaconda mkl, numpy fast too)
dim = 10000 rnd = np.random.randomstate(0) = rnd.rand(dim, dim).astype(np.float32) b = rnd.rand(dim, dim).astype(np.float32) a_gpu = gpuarray.to_gpu(a) b_gpu = gpuarray.to_gpu(b) start = time.time() rescpu = np.dot(a, b) print 'cpu:', time.time() - start start = time.time() resgpu = culinalg.dot(a_gpu, b_gpu) print 'gpu:', time.time() - start resgpu = resgpu.get() print np.allclose(rescpu, resgpu) print np.allclose(resgpu, rescpu)
and results
cpu: 16.4765479565 gpu: 0.000520944595337
you should have @ blas implementations provide highly optimized routines classical linear algebra operations. multiplication of dense matrices performed gemm
function.
- for instance, matrix multiplication in
numpy
improved if compiled against optimized blas implementation (openblas, atlas, mkl, etc). - for gpu, nvidia provides cublas implementation. according answer, can called numpy arrays using
scikits.cuda
module. anaconda accelerate using, provides direct binding cublas.
btw, if want benchmark cpu vs gpu performance matrix multiplication, should specify blas used numpy cpu calculations, since results differ order of magnitude (see this benchmark).
Comments
Post a Comment