Opened 9 months ago

Last modified 5 weeks ago

#1076 new enhancement

support CUDA as well as OpenCL

Reported by: pkienzle Owned by:
Priority: major Milestone: SasView 4.3.0
Component: SasView Keywords:
Cc: Work Package: SasView Bug Fixing

Description

Most GPU computing clusters are based on NVidia devices and will have the correct drivers installed for CUDA, but they may not have drivers for OpenCL. Also, NVidia may choose to drop support for OpenCL and then we will be forced to switch.

The sasmodels cuda-test branch demonstrates that kernels can run under CUDA. There appears to be no performance advantage: on a fast NVidia card the 2D cylinder kernel is slightly faster when running with OpenCL. Note that the sync() function in the kernelcuda.GPUKernel.call is slow; I set the step size to infinite when running timing tests. It should be easy enough to fix, but was not required for this proof-of-concept.

Running under CUDA required a number of code changes:

  1. every function that runs in CUDA needs to be tagged as a __device__ function. This includes Iq, form_volume, etc. in user-defined plugins. We can remove the tag easily enough with a macro when running in OpenCL or DLL engines, but automatically adding it for CUDA when it is not there will be a bit tricky.
  1. global, local and constant memory have different tags in OpenCL and CUDA. Unfortunately, some of these need to be tagged in the function declaration, and the tag may be different from that required to define the variable. For example, the constant modifier is required for parameters and function declarations in OpenCL, but the equivalent __constant__ modifier for CUDA can only appear in variable definition, not in function declaration. An ugly solution is to use something like constant_var when defining the variable vs constant_par when declaring it in a function.

Both problems could be addressed without changing our model API by writing a simple parser to adjust the function declarations appropriately for CUDA. It will need to be called every time a model is called, so make sure it is fast enough.

Change History (5)

comment:1 Changed 2 months ago by pkienzle

Currently device and constant qualifiers requires changing the model code. Only the cylinder model has been changed.

For marking device functions in cuda, a regex replacement for functions may be good enough:

qualifiers return_type function ( args ) => __device__ qualifiers return_type function ( args )

Regarding the use of global and constant qualifiers on values, the branch currently uses global/local/constant_par/var to allow different handling between cuda and opencl. This only affects kernel_iq.c (which users don't see) and polevl.c. It would be somewhat cleaner to mark the arrays passed to polevl as constant rather than constant_var (since there are a lot of them) and only mark polevl as constant_par.

Packaging for cuda will need pycuda in addition to pyopencl.

Cuda models are triggered using "SAS_OPENCL=cuda" in the environment. In the absence of SAS_OPENCL we could check if a cuda device is available and use it before falling back to OpenCL.

Here are the environment variables used by pycuda:

PYCUDA_DEFAULT_NVCC_FLAGS
PYCUDA_CACHE_DIR
PYCUDA_DISABLE_CACHE
CUDA_PATH
CUDA_ROOT
CUDA_DEVICE

comment:2 Changed 5 weeks ago by pkienzle

In 74e9b5fa9e460b3726df690a1411aef195f15b4e/sasmodels:

autotag functions as device functions for cuda. Refs #1076.

comment:3 Changed 5 weeks ago by pkienzle

In 4de145843b2267c4f1e8be16517c48de601151b6/sasmodels:

allow sascomp to use cuda. Refs #1076.

comment:4 Changed 5 weeks ago by pkienzle

With a simple regex for marking the functions as device functions all tests now pass. There was no need to modify any of the model code.

A simple timing test of a 2D cylinder shows some performance problems:

# OpenCL
$ ./sascomp cylinder -2d -nq=200 -neval=20 -poly -noplot
INFO:root:building cylinder-float32-E2AA8C1A for OpenCL GeForce GTX 980 Ti
GPU[32] t=30.40 ms, intensity=10294733

# CUDA [30x slower than OpenCL]
$ ./sascomp cylinder -2d -nq=200 -neval=2 -poly -noplot -DSAS_OPENCL=cuda
INFO:root:building cylinder-float32-E2AA8C1A for CUDA
GPU[32] t=899.80 ms, intensity=10294733

# DLL [250x slower than OpenCL; 125x with OpenMP]
$ ./sascomp cylinder -2d -nq=200 -neval=2 -poly -noplot -DSAS_OPENCL=none
DLL[64] t=7829.20 ms, intensity=10294736

The kernel is more or less the same, the card is the same and NVidia is providing both the OpenCL and the CUDA drivers. Probably need to tweak the calling interface in kernelcuda.py.

comment:5 Changed 5 weeks ago by pkienzle

Speed issue sorted with latest checkin.

Note: See TracTickets for help on using tickets.