Opened 3 months ago

Last modified 3 months ago

## #1230 new defect

# parallelize polydispersity loops

Reported by: | pkienzle | Owned by: | |
---|---|---|---|

Priority: | major | Milestone: | SasView 4.3.0 |

Component: | SasView | Keywords: | |

Cc: | Work Package: | McSAS Integration Project |

### Description

There is unexploited parallelism in the polydiserpsity calculation. This limits the speedup available on high end graphics cards which are operating on 1-D dataset, since the current implementation is limited to only using one processor per q value. A card with 5000 separate processors will be mostly idle.

This is particularly important for mcSAS, which needs to evaluate

I(q_j) = sum_{i=1}^m w_i P(q_j, r_i)

where P(q,r) is the sphere form, q is length n and the distribution w,r is length m, for a total of m x n total evaluations. The current sasmodels code does this in parallel over q_j, with the sum over w_i, r_i running in serial. Instead, we could break up the loop into non-intersecting stripes, first computing

I_k(q_j) = sum_{i=1}^{m/p} w_{k+p*i} P(q_j, r_{k+p*i})

with the number of stripes p, then we can keep n x p processors busy at the same time, and the cost of n x p intermediate results. With 256 q points, the 5120 processors on an nvidia V100 can compute 20 batches in parallel simultaneously, using minimal extra memory (256 x 20 x 8 bytes). May be faster to use p=16 so that memory accesses align better.

Next turn the problem on its side, compute the following:

I(q_j) = sum_k I_k(q_j)

with one processor for each q value. Can perhaps do better by computing pairs in parallel, then pairs of pairs, requiring four cycles for p=16 rather than 16, though the overhead of managing this may outweigh any benefit. Looking at the graphs on page 5 of the following:

https://www.cl.cam.ac.uk/teaching/1617/AdvGraph/07_OpenCL.pdf

I'm guessing the 4k reductions is too small to warrant a fast algorithm.

The existing kernel_iq.c would benefit from this, at least for the inner polydispersity loop, if you are willing to tackle it. Generating a specialized kernel for the particular problem of a distribution of spheres in mcSAS will probably be easier.

**Note:**See TracTickets for help on using tickets.

See also ticket #1172.