Opened 6 years ago
Last modified 6 years ago
#1230 new defect
parallelize polydispersity loops
Reported by: | pkienzle | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | SasView 4.3.0 |
Component: | SasView | Keywords: | |
Cc: | Work Package: | McSAS Integration Project |
Description
There is unexploited parallelism in the polydiserpsity calculation. This limits the speedup available on high end graphics cards which are operating on 1-D dataset, since the current implementation is limited to only using one processor per q value. A card with 5000 separate processors will be mostly idle.
This is particularly important for mcSAS, which needs to evaluate
I(q_j) = sum_{i=1}^m w_i P(q_j, r_i)
where P(q,r) is the sphere form, q is length n and the distribution w,r is length m, for a total of m x n total evaluations. The current sasmodels code does this in parallel over q_j, with the sum over w_i, r_i running in serial. Instead, we could break up the loop into non-intersecting stripes, first computing
I_k(q_j) = sum_{i=1}^{m/p} w_{k+p*i} P(q_j, r_{k+p*i})
with the number of stripes p, then we can keep n x p processors busy at the same time, and the cost of n x p intermediate results. With 256 q points, the 5120 processors on an nvidia V100 can compute 20 batches in parallel simultaneously, using minimal extra memory (256 x 20 x 8 bytes). May be faster to use p=16 so that memory accesses align better.
Next turn the problem on its side, compute the following:
I(q_j) = sum_k I_k(q_j)
with one processor for each q value. Can perhaps do better by computing pairs in parallel, then pairs of pairs, requiring four cycles for p=16 rather than 16, though the overhead of managing this may outweigh any benefit. Looking at the graphs on page 5 of the following:
https://www.cl.cam.ac.uk/teaching/1617/AdvGraph/07_OpenCL.pdf
I'm guessing the 4k reductions is too small to warrant a fast algorithm.
The existing kernel_iq.c would benefit from this, at least for the inner polydispersity loop, if you are willing to tackle it. Generating a specialized kernel for the particular problem of a distribution of spheres in mcSAS will probably be easier.
See also ticket #1172.