numpy - How to properly use anaconda accelerate for GPU -


i trying fast computations of matrices anaconda accelerate. started basic example: multiply 2 matrices.

my goal somehow gpu-multiplication better usual numpy.dot

here basic example, based on documentation.

from numbapro import guvectorize numpy import arange  @guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'], '(m,n),(n,p)->(m,p)', target='gpu') def matmul(a, b, c):     m, n = a.shape     n, p = b.shape     in range(m):         j in range(p):             c[i, j] = 0             k in range(n):                 c[i, j] += a[i, k] * b[k, j]  import numpy np import time  dim in [50, 100, 200]:     rnd = np.random.randomstate(0)     = rnd.rand(dim, dim).astype(np.float32)     b = rnd.rand(dim, dim).astype(np.float32)     resgpu = np.zeros_like(a)      start = time.time()     rescpu = np.dot(a, b)     print('cpu:', time.time() - start)      start = time.time()     resgpu = matmul(a, b)     print('gpu:', time.time() - start)      print(np.allclose(rescpu, resgpu))     print(np.allclose(resgpu, rescpu)) 

results bad: gpu incredibly slower cpu

cpu: 0.00011801719665527344 gpu: 0.05677294731140137 true true cpu: 0.00011205673217773438 gpu: 0.3881375789642334 true true cpu: 0.00038933753967285156 gpu: 3.018171787261963 true true 

of course understand internal numpy realization optimized, expected anaconda official example good. using python 3.4.3 , got errors using these 2 helping libs: http://www.cs.toronto.edu/~tijmen/gnumpy.html , https://github.com/rctn/gpupy

i should gpupy had successful speedup on python 2.7.

so question is: how can matrix multiplication better numpy-cpu using gpu? wrong anaconda official example , if there working library python3 allows use gpu in numpy way?

===

results

unfortunately, there no simple , way python 3, use 2.7 instead

thanks @rth recommendint awesome library scikits.cuda

available functions

some benchmark (tested using anaconda mkl, numpy fast too)

dim = 10000 rnd = np.random.randomstate(0) = rnd.rand(dim, dim).astype(np.float32) b = rnd.rand(dim, dim).astype(np.float32) a_gpu = gpuarray.to_gpu(a) b_gpu = gpuarray.to_gpu(b)  start = time.time() rescpu = np.dot(a, b) print 'cpu:', time.time() - start  start = time.time() resgpu = culinalg.dot(a_gpu, b_gpu) print 'gpu:', time.time() - start  resgpu = resgpu.get() print np.allclose(rescpu, resgpu) print np.allclose(resgpu, rescpu) 

and results

cpu: 16.4765479565 gpu: 0.000520944595337 

you should have @ blas implementations provide highly optimized routines classical linear algebra operations. multiplication of dense matrices performed gemm function.

  • for instance, matrix multiplication in numpy improved if compiled against optimized blas implementation (openblas, atlas, mkl, etc).
  • for gpu, nvidia provides cublas implementation. according answer, can called numpy arrays using scikits.cuda module. anaconda accelerate using, provides direct binding cublas.

btw, if want benchmark cpu vs gpu performance matrix multiplication, should specify blas used numpy cpu calculations, since results differ order of magnitude (see this benchmark).


Comments

Popular posts from this blog

c# - Validate object ID from GET to POST -

node.js - Custom Model Validator SailsJS -

php - Find a regex to take part of Email -