Opened 6 years ago

Last modified 6 years ago

#1091 new enhancement

improve parallelism for 1D integration models

Reported by: pkienzle Owned by:
Priority: major Milestone: SasView 4.3.0
Component: SasView Keywords:
Cc: Work Package: SasView Bug Fixing

Description

On a GPU with number of cores greater than the number of q points the limiting speed for 1D models is the cost of the integration loop. For example, on the Radeon R9 Nano, the paracrystalline models each require about 400 ms to evaluate for nq up to 12000. Given that a typical curve has on the order of 120 points (e.g., the P123 example data sets) this suggests only 1% of the GPU is active for any given function evaluation using 120 points. [4096 cores / 128 points suggests 3% usage]

To improve parallelism we could unroll the integration loop by evaluating the different (theta, phi) points in gauss_z X gauss_z in parallel, then summing the resulting grid in parallel. The 76x76x120 point 2D calculation for sc_paracrystal takes 7.5 ms vs 423 ms for the 1D calculation (56x speedup). For core-shell parallelepiped the speedup is only 5x. Even symmetric shapes such as barbell can benefit, with a 7x speeup for a 2D 76x120 pattern compared to a 1D loop over gauss_z. More speedup would be possible with specialized code since some parts of the equation can be precomputed and shared for all points at a given q (the sphere form in the paracrystal example) or a given theta (the C direction in the parallelepiped models).

If we define the q points for the 2D calculator in polar coordinates, then larger rings at higher |q| could use more points, giving a simple form of adaptive integration (ticket #392).

As an alternative, we could use the same GPU in parallel from different fit processes. Testing with mpi and DREAM (4 fit pars, 32 evals per step), this can give a 4x speedup for population fitters (DREAM and DE):

processes   time
     1      13.6
     2      13.2
     4       6.8
     8       3.8
    16       4.3           

Note that turning on resolution slows down the program by 3x because the points below q min and the points above q max are computed in separate batches, each of which takes the same time as the measured q set even though they may have only one or two points. This will be fixed with ticket #717.

Change History (2)

comment:1 Changed 6 years ago by pkienzle

  • Milestone changed from SasView 4.2.0 to SasView 4.3.0
  • Type changed from defect to enhancement

comment:2 Changed 6 years ago by richardh

Moving this ticket to beta approximation project - which this is not really part of, but Paul K would like to keep in mind whilst making all the other changes - see also #392

Note: See TracTickets for help on using tickets.