Opened 8 years ago
Last modified 6 years ago
#782 new enhancement
Performance tuning for 2D calculations
Reported by: | pkienzle | Owned by: | |
---|---|---|---|
Priority: | minor | Milestone: | sasmodels WishList |
Component: | sasmodels | Keywords: | |
Cc: | Work Package: | SasView Bug Fixing |
Description
Can save 6 trig functions 9 multiplications and 5 additions by precomputing the orientation info for each q point. In absolute terms, it is 325k operations on a 128x128 detector. In relative terms, the fcc mode uses an additional 4 special functions, 49 multiplications and 16 adds, so this could be a 25% speed up.
Need to transform:
q = sqrt(qx*qx + qy*qy); const double qxhat = qx/q; const double qyhat = qy/q; double sin_theta, cos_theta; double sin_phi, cos_phi; double sin_psi, cos_psi; SINCOS(theta*M_PI_180, sin_theta, cos_theta); SINCOS(phi*M_PI_180, sin_phi, cos_phi); SINCOS(psi*M_PI_180, sin_psi, cos_psi); cos_alpha = cos_theta*cos_phi*qxhat + sin_theta*qyhat; cos_mu = (-sin_theta*cos_psi*cos_phi - sin_psi*sin_phi)*qxhat + cos_theta*cos_psi*qyhat; cos_nu = (-cos_phi*sin_psi*sin_theta + sin_phi*cos_psi)*qxhat + sin_psi*cos_theta*qyhat;
Into a precompute phase:
double sin_theta, cos_theta; double sin_phi, cos_phi; double sin_psi, cos_psi; SINCOS(theta*M_PI_180, sin_theta, cos_theta); SINCOS(phi*M_PI_180, sin_phi, cos_phi); SINCOS(psi*M_PI_180, sin_psi, cos_psi); alpha_x = cos_theta*cos_phi; alpha_y = sin_theta; mu_x = -sin_theta*cos_psi*cos_phi - sin_psi*sin_phi; mu_y = cos_theta*cos_psi; nu_x = -cos_phi*sin_psi*sin_theta + sin_phi*cos_psi; nu_y = sin_psi*cos_theta;
and a compute phase:
q = sqrt(qx*qx + qy*qy); const double qxhat = qx/q; const double qyhat = qy/q; cos_alpha = alpha_x*qxhat + alpha_y*qyhat; cos_mu = mu_x*qxhat + mu_y*qyhat; cos_nu = nu_x*qxhat + nu_y*qyhat;
For polydisperse systems, need to precompute for each independent (theta,phi,psi) triple, but this can be done in parallel.
For polydisperse systems, can save a sqrt, 4 multiplies and an add by precomputing q, qxhat and qyhat for each point. Again, this can be done in parallel.
Could be implemented using global working memory (ticket #679).
Playing with a model with lots of polydispersity, computation efficiency for 2-D ellipsoid kernel on NVIDIA 1080 Ti is 25% of the theoretical maximum.
Can maybe improve performance 20% by prefetching the pd values and weights for the inner loop from global memory to shared memory. Support long pd vectors by introducing an outermost loop that prefetches the next block of the innermost loop whenever all the other loops have exhausted the current block.
A simple experiment replacing the fetching code with a constant shows an improvement to 35% of the theoretical maximum.
Need to shut off the partial dispatch in kernelcl/kernelcuda to achieve maximum performance. It is there to prevent machines from crashing or returning a bad result if the computation kernel takes too long. Without the ability to turn this off, the additional performance will not be relevant.
Test command:
Current speed: 0.85s on OpenCL, 0.77s on CUDA.
Expected speed: 0.71s on OpenCL, 0.62s on CUDA.
Floating point operations for the computation are 2.1 TFLOP equivalent, adjusting for the fact that sin, etc. take four cycles rather than 1.
Control flow, etc., adds another 100 instructions to the inner loop, so 55% may be the best we can achieve (equivalent to 0.33s, or a factor of 2+ better than we are currently doing).
Relative improvement for more complicated models will be less. The additional time and complexity to implement and maintain this may not be worthwhile, especially if it only affects a few simple models.