aboutsummaryrefslogtreecommitdiff
path: root/final/libomptarget/deviceRTLs/nvptx/docs/ReductionDesign.txt
diff options
context:
space:
mode:
Diffstat (limited to 'final/libomptarget/deviceRTLs/nvptx/docs/ReductionDesign.txt')
-rw-r--r--final/libomptarget/deviceRTLs/nvptx/docs/ReductionDesign.txt523
1 files changed, 523 insertions, 0 deletions
diff --git a/final/libomptarget/deviceRTLs/nvptx/docs/ReductionDesign.txt b/final/libomptarget/deviceRTLs/nvptx/docs/ReductionDesign.txt
new file mode 100644
index 0000000..989a01f
--- /dev/null
+++ b/final/libomptarget/deviceRTLs/nvptx/docs/ReductionDesign.txt
@@ -0,0 +1,523 @@
+
+**Design document for OpenMP reductions on the GPU**
+
+//Abstract: //In this document we summarize the new design for an OpenMP
+implementation of reductions on NVIDIA GPUs. This document comprises
+* a succinct background review,
+* an introduction to the decoupling of reduction algorithm and
+ data-structure-specific processing routines,
+* detailed illustrations of reduction algorithms used and
+* a brief overview of steps we have made beyond the last implementation.
+
+**Problem Review**
+
+Consider a typical OpenMP program with reduction pragma.
+
+```
+ double foo, bar;
+ #pragma omp parallel for reduction(+:foo, bar)
+ for (int i = 0; i < N; i++) {
+ foo+=A[i]; bar+=B[i];
+ }
+```
+where 'foo' and 'bar' are reduced across all threads in the parallel region.
+Our primary goal is to efficiently aggregate the values of foo and bar in
+such manner that
+* makes the compiler logically concise.
+* efficiently reduces within warps, threads, blocks and the device.
+
+**Introduction to Decoupling**
+In this section we address the problem of making the compiler
+//logically concise// by partitioning the task of reduction into two broad
+categories: data-structure specific routines and algorithmic routines.
+
+The previous reduction implementation was highly coupled with
+the specificity of the reduction element data structures (e.g., sizes, data
+types) and operators of the reduction (e.g., addition, multiplication). In
+our implementation we strive to decouple them. In our final implementations,
+we could remove all template functions in our runtime system.
+
+The (simplified) pseudo code generated by LLVM is as follows:
+
+```
+ 1. Create private copies of variables: foo_p, bar_p
+ 2. Each thread reduces the chunk of A and B assigned to it and writes
+ to foo_p and bar_p respectively.
+ 3. ret = kmpc_nvptx_reduce_nowait(..., reduceData, shuffleReduceFn,
+ interWarpCpyFn)
+ where:
+ struct ReduceData {
+ double *foo;
+ double *bar;
+ } reduceData
+ reduceData.foo = &foo_p
+ reduceData.bar = &bar_p
+
+ shuffleReduceFn and interWarpCpyFn are two auxiliary functions
+ generated to aid the runtime performing algorithmic steps
+ while being data-structure agnostic about ReduceData.
+
+ In particular, shuffleReduceFn is a function that takes the following
+ inputs:
+ a. local copy of ReduceData
+ b. its lane_id
+ c. the offset of the lane_id which hosts a remote ReduceData
+ relative to the current one
+ d. an algorithm version paramter determining which reduction
+ algorithm to use.
+ This shuffleReduceFn retrieves the remote ReduceData through shuffle
+ intrinsics and reduces, using the algorithm specified by the 4th
+ parameter, the local ReduceData and with the remote ReduceData element
+ wise, and places the resultant values into the local ReduceData.
+
+ Different reduction algorithms are implemented with different runtime
+ functions, but they all make calls to this same shuffleReduceFn to
+ perform the essential reduction step. Therefore, based on the 4th
+ parameter, this shuffleReduceFn will behave slightly differently to
+ cooperate with the runtime function to ensure correctness under
+ different circumstances.
+
+ InterWarpCpyFn, as the name suggests, is a function that copies data
+ across warps. Its function is to tunnel all the thread private
+ ReduceData that is already reduced within a warp to a lane in the first
+ warp with minimal shared memory footprint. This is an essential step to
+ prepare for the last step of a block reduction.
+
+ (Warp, block, device level reduction routines that utilize these
+ auxiliary functions will be discussed in the next section.)
+
+ 4. if ret == 1:
+ The master thread stores the reduced result in the globals.
+ foo += reduceData.foo; bar += reduceData.bar
+```
+
+**Reduction Algorithms**
+
+On the warp level, we have three versions of the algorithms:
+
+1. Full Warp Reduction
+
+```
+gpu_regular_warp_reduce(void *reduce_data,
+ kmp_ShuffleReductFctPtr ShuffleReduceFn) {
+ for (int offset = WARPSIZE/2; offset > 0; offset /= 2)
+ ShuffleReduceFn(reduce_data, 0, offset, 0);
+}
+```
+ShuffleReduceFn is used here with lane_id set to 0 because it is not used
+therefore we save instructions by not retrieving lane_id from the corresponding
+special registers. The 4th parameters, which represents the version of the
+algorithm being used here, is set to 0 to signify full warp reduction.
+
+In this version specified (=0), the ShuffleReduceFn behaves, per element, as
+follows:
+
+```
+//reduce_elem refers to an element in the local ReduceData
+//remote_elem is retrieved from a remote lane
+remote_elem = shuffle_down(reduce_elem, offset, 32);
+reduce_elem = reduce_elem @ remote_elem;
+
+```
+
+An illustration of this algorithm operating on a hypothetical 8-lane full-warp
+would be:
+{F74}
+The coloring invariant follows that elements with the same color will be
+combined and reduced in the next reduction step. As can be observed, no overhead
+is present, exactly log(2, N) steps are needed.
+
+2. Contiguous Full Warp Reduction
+```
+gpu_irregular_warp_reduce(void *reduce_data,
+ kmp_ShuffleReductFctPtr ShuffleReduceFn, int size,
+ int lane_id) {
+ int curr_size;
+ int offset;
+ curr_size = size;
+ mask = curr_size/2;
+ while (offset>0) {
+ ShuffleReduceFn(reduce_data, lane_id, offset, 1);
+ curr_size = (curr_size+1)/2;
+ offset = curr_size/2;
+ }
+}
+```
+
+In this version specified (=1), the ShuffleReduceFn behaves, per element, as
+follows:
+```
+//reduce_elem refers to an element in the local ReduceData
+//remote_elem is retrieved from a remote lane
+remote_elem = shuffle_down(reduce_elem, offset, 32);
+if (lane_id < offset) {
+ reduce_elem = reduce_elem @ remote_elem
+} else {
+ reduce_elem = remote_elem
+}
+```
+
+An important invariant (also a restriction on the starting state of the
+reduction) is that this algorithm assumes that all unused ReduceData are
+located in a contiguous subset of threads in a warp starting from lane 0.
+
+With the presence of a trailing active lane with an odd-numbered lane
+id, its value will not be aggregated with any other lane. Therefore,
+in order to preserve the invariant, such ReduceData is copied to the first lane
+whose thread-local ReduceData has already being used in a previous reduction
+and would therefore be useless otherwise.
+
+An illustration of this algorithm operating on a hypothetical 8-lane partial
+warp woud be:
+{F75}
+
+As illustrated, this version of the algorithm introduces overhead whenever
+we have odd number of participating lanes in any reduction step to
+copy data between lanes.
+
+3. Dispersed Partial Warp Reduction
+```
+gpu_irregular_simt_reduce(void *reduce_data,
+ kmp_ShuffleReductFctPtr ShuffleReduceFn) {
+ int size, remote_id;
+ int logical_lane_id = find_number_of_dispersed_active_lanes_before_me() * 2;
+ do {
+ remote_id = find_the_next_active_lane_id_right_after_me();
+ // the above function returns 0 of no active lane
+ // is present right after the current thread.
+ size = get_number_of_active_lanes_in_this_warp();
+ logical_lane_id /= 2;
+ ShuffleReduceFn(reduce_data, logical_lane_id, remote_id-1-threadIdx.x, 2);
+ } while (logical_lane_id % 2 == 0 && size > 1);
+```
+
+There is no assumption made about the initial state of the reduction.
+Any number of lanes (>=1) could be active at any position. The reduction
+result is kept in the first active lane.
+
+In this version specified (=2), the ShuffleReduceFn behaves, per element, as
+follows:
+```
+//reduce_elem refers to an element in the local ReduceData
+//remote_elem is retrieved from a remote lane
+remote_elem = shuffle_down(reduce_elem, offset, 32);
+if (LaneId % 2 == 0 && Offset > 0) {
+ reduce_elem = reduce_elem @ remote_elem
+} else {
+ reduce_elem = remote_elem
+}
+```
+We will proceed with a brief explanation for some arguments passed in,
+it is important to notice that, in this section, we will introduce the
+concept of logical_lane_id, and it is important to distinguish it
+from physical lane_id as defined by nvidia.
+1. //logical_lane_id//: as the name suggests, it refers to the calculated
+ lane_id (instead of the physical one defined by nvidia) that would make
+ our algorithm logically concise. A thread with logical_lane_id k means
+ there are (k-1) threads before it.
+2. //remote_id-1-threadIdx.x//: remote_id is indeed the nvidia-defined lane
+ id of the remote lane from which we will retrieve the ReduceData. We
+ subtract (threadIdx+1) from it because we would like to maintain only one
+ underlying shuffle intrinsic (which is used to communicate among lanes in a
+ warp). This particular version of shuffle intrinsic we take accepts only
+ offsets, instead of absolute lane_id. Therefore the subtraction is performed
+ on the absolute lane_id we calculated to obtain the offset.
+
+This algorithm is slightly different in 2 ways and it is not, conceptually, a
+generalization of the above algorithms.
+1. It reduces elements close to each other. For instance, values in the 0th lane
+ is to be combined with that of the 1st lane; values in the 2nd lane is to be
+ combined with that of the 3rd lane. We did not use the previous algorithm
+ where the first half of the (partial) warp is reduced with the second half
+ of the (partial) warp. This is because, the mapping
+ f(x): logical_lane_id -> physical_lane_id;
+ can be easily calculated whereas its inverse
+ f^-1(x): physical_lane_id -> logical_lane_id
+ cannot and performing such reduction requires the inverse to be known.
+2. Because this algorithm is agnostic about the positions of the lanes that are
+ active, we do not need to perform the coping step as in the second
+ algorithm.
+An illustrative run would look like
+{F76}
+As observed, overhead is high because in each and every step of reduction,
+logical_lane_id is recalculated; so is the remote_id.
+
+On a block level, we have implemented the following block reduce algorithm:
+
+```
+gpu_irregular_block_reduce(void *reduce_data,
+ kmp_ShuffleReductFctPtr shuflReduceFn,
+ kmp_InterWarpCopyFctPtr interWarpCpyFn,
+ int size) {
+
+ int wid = threadIdx.x/WARPSIZE;
+ int lane_id = threadIdx.x%WARPSIZE;
+
+ int warp_needed = (size+WARPSIZE-1)/WARPSIZE; //ceiling of division
+
+ unsigned tnum = __ballot(1);
+ int thread_num = __popc(tnum);
+
+ //full warp reduction
+ if (thread_num == WARPSIZE) {
+ gpu_regular_warp_reduce(reduce_data, shuflReduceFn);
+ }
+ //partial warp reduction
+ if (thread_num < WARPSIZE) {
+ gpu_irregular_warp_reduce(reduce_data, shuflReduceFn, thread_num,
+ lane_id);
+ }
+ //Gather all the reduced values from each warp
+ //to the first warp
+ //named_barrier inside this function to ensure
+ //correctness. It is effectively a sync_thread
+ //that won't deadlock.
+ interWarpCpyFn(reduce_data, warp_needed);
+
+ //This is to reduce data gathered from each "warp master".
+ if (wid==0) {
+ gpu_irregular_warp_reduce(reduce_data, shuflReduceFn, warp_needed,
+ lane_id);
+ }
+
+ return;
+}
+```
+In this function, no ShuffleReduceFn is directly called as it makes calls
+to various versions of the warp-reduction functions. It first reduces
+ReduceData warp by warp; in the end, we end up with the number of
+ReduceData equal to the number of warps present in this thread
+block. We then proceed to gather all such ReduceData to the first warp.
+
+As observed, in this algorithm we make use of the function InterWarpCpyFn,
+which copies data from each of the "warp master" (0th lane of each warp, where
+a warp-reduced ReduceData is held) to the 0th warp. This step reduces (in a
+mathematical sense) the problem of reduction across warp masters in a block to
+the problem of warp reduction which we already have solutions to.
+
+We can thus completely avoid the use of atomics to reduce in a threadblock.
+
+**Efficient Cross Block Reduce**
+
+The next challenge is to reduce values across threadblocks. We aim to do this
+without atomics or critical sections.
+
+Let a kernel be started with TB threadblocks.
+Let the GPU have S SMs.
+There can be at most N active threadblocks per SM at any time.
+
+Consider a threadblock tb (tb < TB) running on SM s (s < SM). 'tb' is one of
+at most 'N' active threadblocks on SM s. Let each threadblock active on an SM
+be given an instance identifier id (0 <= id < N). Therefore, the tuple (s, id)
+uniquely identifies an active threadblock on the GPU.
+
+To efficiently implement cross block reduce, we first allocate an array for
+each value to be reduced of size S*N (which is the maximum number of active
+threadblocks at any time on the device).
+
+Each threadblock reduces its value to slot [s][id]. This can be done without
+locking since no other threadblock can write to the same slot concurrently.
+
+As a final stage, we reduce the values in the array as follows:
+
+```
+// Compiler generated wrapper function for each target region with a reduction
+clause.
+target_function_wrapper(map_args, reduction_array) <--- start with 1 team and 1
+ thread.
+ // Use dynamic parallelism to launch M teams, N threads as requested by the
+ user to execute the target region.
+
+ target_function<<M, N>>(map_args)
+
+ Reduce values in reduction_array
+
+```
+
+**Comparison with Last Version**
+
+
+The (simplified) pseudo code generated by LLVM on the host is as follows:
+
+
+```
+ 1. Create private copies of variables: foo_p, bar_p
+ 2. Each thread reduces the chunk of A and B assigned to it and writes
+ to foo_p and bar_p respectively.
+ 3. ret = kmpc_reduce_nowait(..., reduceData, reduceFn, lock)
+ where:
+ struct ReduceData {
+ double *foo;
+ double *bar;
+ } reduceData
+ reduceData.foo = &foo_p
+ reduceData.bar = &bar_p
+
+ reduceFn is a pointer to a function that takes in two inputs
+ of type ReduceData, "reduces" them element wise, and places the
+ result in the first input:
+ reduceFn(ReduceData *a, ReduceData *b)
+ a = a @ b
+
+ Every thread in the parallel region calls kmpc_reduce_nowait with
+ its private copy of reduceData. The runtime reduces across the
+ threads (using tree reduction on the operator 'reduceFn?) and stores
+ the final result in the master thread if successful.
+ 4. if ret == 1:
+ The master thread stores the reduced result in the globals.
+ foo += reduceData.foo; bar += reduceData.bar
+ 5. else if ret == 2:
+ In this case kmpc_reduce_nowait() could not use tree reduction,
+ so use atomics instead:
+ each thread atomically writes to foo
+ each thread atomically writes to bar
+```
+
+On a GPU, a similar reduction may need to be performed across SIMT threads,
+warps, and threadblocks. The challenge is to do so efficiently in a fashion
+that is compatible with the LLVM OpenMP implementation.
+
+In the previously released 0.1 version of the LLVM OpenMP compiler for GPUs,
+the salient steps of the code generated are as follows:
+
+
+```
+ 1. Create private copies of variables: foo_p, bar_p
+ 2. Each thread reduces the chunk of A and B assigned to it and writes
+ to foo_p and bar_p respectively.
+ 3. ret = kmpc_reduce_nowait(..., reduceData, reduceFn, lock)
+ status = can_block_reduce()
+ if status == 1:
+ reduce efficiently to thread 0 using shuffles and shared memory.
+ return 1
+ else
+ cannot use efficient block reduction, fallback to atomics
+ return 2
+ 4. if ret == 1:
+ The master thread stores the reduced result in the globals.
+ foo += reduceData.foo; bar += reduceData.bar
+ 5. else if ret == 2:
+ In this case kmpc_reduce_nowait() could not use tree reduction,
+ so use atomics instead:
+ each thread atomically writes to foo
+ each thread atomically writes to bar
+```
+
+The function can_block_reduce() is defined as follows:
+
+
+```
+int32_t can_block_reduce() {
+ int tid = GetThreadIdInTeam();
+ int nt = GetNumberOfOmpThreads(tid);
+ if (nt != blockDim.x)
+ return 0;
+ unsigned tnum = __ballot(1);
+ if (tnum != (~0x0)) {
+ return 0;
+ }
+ return 1;
+}
+```
+
+This function permits the use of the efficient block reduction algorithm
+using shuffles and shared memory (return 1) only if (a) all SIMT threads in
+a warp are active (i.e., number of threads in the parallel region is a
+multiple of 32) and (b) the number of threads in the parallel region
+(set by the num_threads clause) equals blockDim.x.
+
+If either of these preconditions is not true, each thread in the threadblock
+updates the global value using atomics.
+
+Atomics and compare-and-swap operations are expensive on many threaded
+architectures such as GPUs and we must avoid them completely.
+
+
+**Appendix: Implementation Details**
+
+
+```
+// Compiler generated function.
+reduceFn(ReduceData *a, ReduceData *b)
+ a->foo = a->foo + b->foo
+ a->bar = a->bar + b->bar
+
+// Compiler generated function.
+swapAndReduceFn(ReduceData *thread_private, int lane)
+ ReduceData *remote = new ReduceData()
+ remote->foo = shuffle_double(thread_private->foo, lane)
+ remote->bar = shuffle_double(thread_private->bar, lane)
+ reduceFn(thread_private, remote)
+
+// OMP runtime function.
+warpReduce_regular(ReduceData *thread_private, Fn *swapAndReduceFn):
+ offset = 16
+ while (offset > 0)
+ swapAndReduceFn(thread_private, offset)
+ offset /= 2
+
+// OMP runtime function.
+warpReduce_irregular():
+ ...
+
+// OMP runtime function.
+kmpc_reduce_warp(reduceData, swapAndReduceFn)
+ if all_lanes_active:
+ warpReduce_regular(reduceData, swapAndReduceFn)
+ else:
+ warpReduce_irregular(reduceData, swapAndReduceFn)
+ if in_simd_region:
+ // all done, reduce to global in simd lane 0
+ return 1
+ else if in_parallel_region:
+ // done reducing to one value per warp, now reduce across warps
+ return 3
+
+// OMP runtime function; one for each basic type.
+kmpc_reduce_block_double(double *a)
+ if lane == 0:
+ shared[wid] = *a
+ named_barrier(1, num_threads)
+ if wid == 0
+ block_reduce(shared)
+ if lane == 0
+ *a = shared[0]
+ named_barrier(1, num_threads)
+ if wid == 0 and lane == 0
+ return 1 // write back reduced result
+ else
+ return 0 // don't do anything
+
+```
+
+
+
+```
+// Compiler generated code.
+ 1. Create private copies of variables: foo_p, bar_p
+ 2. Each thread reduces the chunk of A and B assigned to it and writes
+ to foo_p and bar_p respectively.
+ 3. ret = kmpc_reduce_warp(reduceData, swapAndReduceFn)
+ 4. if ret == 1:
+ The master thread stores the reduced result in the globals.
+ foo += reduceData.foo; bar += reduceData.bar
+ 5. else if ret == 3:
+ ret = block_reduce_double(reduceData.foo)
+ if ret == 1:
+ foo += reduceData.foo
+ ret = block_reduce_double(reduceData.bar)
+ if ret == 1:
+ bar += reduceData.bar
+```
+
+**Notes**
+
+ 1. This scheme requires that the CUDA OMP runtime can call llvm generated
+ functions. This functionality now works.
+ 2. If the user inlines the CUDA OMP runtime bitcode, all of the machinery
+ (including calls through function pointers) are optimized away.
+ 3. If we are reducing multiple to multiple variables in a parallel region,
+ the reduce operations are all performed in warpReduce_[ir]regular(). This
+ results in more instructions in the loop and should result in fewer
+ stalls due to data dependencies. Unfortunately we cannot do the same in
+ kmpc_reduce_block_double() without increasing shared memory usage.