Opened 6 years ago

Last modified 6 years ago

#1198 new defect

better gpu timeout selection

Reported by: pkienzle Owned by:
Priority: major Milestone: sasmodels Next Release +1
Component: sasmodels Keywords:
Cc: Work Package: SasModels Infrastructure

Description

When using a GPU device that is attached to a monitor, the OpenCL driver (and maybe the CUDA driver) will halt the calculation after a timeout even if it is not completed, with no indication that the computation timed out. On some platforms it has triggered a system reboot.

To prevent this, there is a timeout loop in sasmodels/kernelcl.py (and kernelcuda.py for the cuda-test branch) which breaks after every n kernel evaluations.

The problem is that there is no fixed value for "n" which is optimal for all models and all platforms. If the value is too low, there is significant overhead (30x slowdown on a GTX-1080 for example), but if the value is too high then some platforms might crash.

Need more control over this value. Ideally, ask the card to fail nicely, but otherwise provide a "computational power" scaling to the number of function evaluations allowed.

Could set the scale factor to infinite for headless GPUs (there is no timeout when a monitor is not attached to the card). That will interfere with Ctrl-C handling, but that should not be a problem for environments where a headless GPU is available.

Change History (1)

comment:1 Changed 6 years ago by butler

  • Work Package changed from SasModels Redesign to SasModels Infrastructure
Note: See TracTickets for help on using tickets.