diff options
Diffstat (limited to 'final/libomptarget/deviceRTLs/nvptx/docs/ReductionDesign.txt')
-rw-r--r-- | final/libomptarget/deviceRTLs/nvptx/docs/ReductionDesign.txt | 523 |
1 files changed, 523 insertions, 0 deletions
diff --git a/final/libomptarget/deviceRTLs/nvptx/docs/ReductionDesign.txt b/final/libomptarget/deviceRTLs/nvptx/docs/ReductionDesign.txt new file mode 100644 index 0000000..989a01f --- /dev/null +++ b/final/libomptarget/deviceRTLs/nvptx/docs/ReductionDesign.txt @@ -0,0 +1,523 @@ + +**Design document for OpenMP reductions on the GPU** + +//Abstract: //In this document we summarize the new design for an OpenMP +implementation of reductions on NVIDIA GPUs. This document comprises +* a succinct background review, +* an introduction to the decoupling of reduction algorithm and + data-structure-specific processing routines, +* detailed illustrations of reduction algorithms used and +* a brief overview of steps we have made beyond the last implementation. + +**Problem Review** + +Consider a typical OpenMP program with reduction pragma. + +``` + double foo, bar; + #pragma omp parallel for reduction(+:foo, bar) + for (int i = 0; i < N; i++) { + foo+=A[i]; bar+=B[i]; + } +``` +where 'foo' and 'bar' are reduced across all threads in the parallel region. +Our primary goal is to efficiently aggregate the values of foo and bar in +such manner that +* makes the compiler logically concise. +* efficiently reduces within warps, threads, blocks and the device. + +**Introduction to Decoupling** +In this section we address the problem of making the compiler +//logically concise// by partitioning the task of reduction into two broad +categories: data-structure specific routines and algorithmic routines. + +The previous reduction implementation was highly coupled with +the specificity of the reduction element data structures (e.g., sizes, data +types) and operators of the reduction (e.g., addition, multiplication). In +our implementation we strive to decouple them. In our final implementations, +we could remove all template functions in our runtime system. + +The (simplified) pseudo code generated by LLVM is as follows: + +``` + 1. Create private copies of variables: foo_p, bar_p + 2. Each thread reduces the chunk of A and B assigned to it and writes + to foo_p and bar_p respectively. + 3. ret = kmpc_nvptx_reduce_nowait(..., reduceData, shuffleReduceFn, + interWarpCpyFn) + where: + struct ReduceData { + double *foo; + double *bar; + } reduceData + reduceData.foo = &foo_p + reduceData.bar = &bar_p + + shuffleReduceFn and interWarpCpyFn are two auxiliary functions + generated to aid the runtime performing algorithmic steps + while being data-structure agnostic about ReduceData. + + In particular, shuffleReduceFn is a function that takes the following + inputs: + a. local copy of ReduceData + b. its lane_id + c. the offset of the lane_id which hosts a remote ReduceData + relative to the current one + d. an algorithm version paramter determining which reduction + algorithm to use. + This shuffleReduceFn retrieves the remote ReduceData through shuffle + intrinsics and reduces, using the algorithm specified by the 4th + parameter, the local ReduceData and with the remote ReduceData element + wise, and places the resultant values into the local ReduceData. + + Different reduction algorithms are implemented with different runtime + functions, but they all make calls to this same shuffleReduceFn to + perform the essential reduction step. Therefore, based on the 4th + parameter, this shuffleReduceFn will behave slightly differently to + cooperate with the runtime function to ensure correctness under + different circumstances. + + InterWarpCpyFn, as the name suggests, is a function that copies data + across warps. Its function is to tunnel all the thread private + ReduceData that is already reduced within a warp to a lane in the first + warp with minimal shared memory footprint. This is an essential step to + prepare for the last step of a block reduction. + + (Warp, block, device level reduction routines that utilize these + auxiliary functions will be discussed in the next section.) + + 4. if ret == 1: + The master thread stores the reduced result in the globals. + foo += reduceData.foo; bar += reduceData.bar +``` + +**Reduction Algorithms** + +On the warp level, we have three versions of the algorithms: + +1. Full Warp Reduction + +``` +gpu_regular_warp_reduce(void *reduce_data, + kmp_ShuffleReductFctPtr ShuffleReduceFn) { + for (int offset = WARPSIZE/2; offset > 0; offset /= 2) + ShuffleReduceFn(reduce_data, 0, offset, 0); +} +``` +ShuffleReduceFn is used here with lane_id set to 0 because it is not used +therefore we save instructions by not retrieving lane_id from the corresponding +special registers. The 4th parameters, which represents the version of the +algorithm being used here, is set to 0 to signify full warp reduction. + +In this version specified (=0), the ShuffleReduceFn behaves, per element, as +follows: + +``` +//reduce_elem refers to an element in the local ReduceData +//remote_elem is retrieved from a remote lane +remote_elem = shuffle_down(reduce_elem, offset, 32); +reduce_elem = reduce_elem @ remote_elem; + +``` + +An illustration of this algorithm operating on a hypothetical 8-lane full-warp +would be: +{F74} +The coloring invariant follows that elements with the same color will be +combined and reduced in the next reduction step. As can be observed, no overhead +is present, exactly log(2, N) steps are needed. + +2. Contiguous Full Warp Reduction +``` +gpu_irregular_warp_reduce(void *reduce_data, + kmp_ShuffleReductFctPtr ShuffleReduceFn, int size, + int lane_id) { + int curr_size; + int offset; + curr_size = size; + mask = curr_size/2; + while (offset>0) { + ShuffleReduceFn(reduce_data, lane_id, offset, 1); + curr_size = (curr_size+1)/2; + offset = curr_size/2; + } +} +``` + +In this version specified (=1), the ShuffleReduceFn behaves, per element, as +follows: +``` +//reduce_elem refers to an element in the local ReduceData +//remote_elem is retrieved from a remote lane +remote_elem = shuffle_down(reduce_elem, offset, 32); +if (lane_id < offset) { + reduce_elem = reduce_elem @ remote_elem +} else { + reduce_elem = remote_elem +} +``` + +An important invariant (also a restriction on the starting state of the +reduction) is that this algorithm assumes that all unused ReduceData are +located in a contiguous subset of threads in a warp starting from lane 0. + +With the presence of a trailing active lane with an odd-numbered lane +id, its value will not be aggregated with any other lane. Therefore, +in order to preserve the invariant, such ReduceData is copied to the first lane +whose thread-local ReduceData has already being used in a previous reduction +and would therefore be useless otherwise. + +An illustration of this algorithm operating on a hypothetical 8-lane partial +warp woud be: +{F75} + +As illustrated, this version of the algorithm introduces overhead whenever +we have odd number of participating lanes in any reduction step to +copy data between lanes. + +3. Dispersed Partial Warp Reduction +``` +gpu_irregular_simt_reduce(void *reduce_data, + kmp_ShuffleReductFctPtr ShuffleReduceFn) { + int size, remote_id; + int logical_lane_id = find_number_of_dispersed_active_lanes_before_me() * 2; + do { + remote_id = find_the_next_active_lane_id_right_after_me(); + // the above function returns 0 of no active lane + // is present right after the current thread. + size = get_number_of_active_lanes_in_this_warp(); + logical_lane_id /= 2; + ShuffleReduceFn(reduce_data, logical_lane_id, remote_id-1-threadIdx.x, 2); + } while (logical_lane_id % 2 == 0 && size > 1); +``` + +There is no assumption made about the initial state of the reduction. +Any number of lanes (>=1) could be active at any position. The reduction +result is kept in the first active lane. + +In this version specified (=2), the ShuffleReduceFn behaves, per element, as +follows: +``` +//reduce_elem refers to an element in the local ReduceData +//remote_elem is retrieved from a remote lane +remote_elem = shuffle_down(reduce_elem, offset, 32); +if (LaneId % 2 == 0 && Offset > 0) { + reduce_elem = reduce_elem @ remote_elem +} else { + reduce_elem = remote_elem +} +``` +We will proceed with a brief explanation for some arguments passed in, +it is important to notice that, in this section, we will introduce the +concept of logical_lane_id, and it is important to distinguish it +from physical lane_id as defined by nvidia. +1. //logical_lane_id//: as the name suggests, it refers to the calculated + lane_id (instead of the physical one defined by nvidia) that would make + our algorithm logically concise. A thread with logical_lane_id k means + there are (k-1) threads before it. +2. //remote_id-1-threadIdx.x//: remote_id is indeed the nvidia-defined lane + id of the remote lane from which we will retrieve the ReduceData. We + subtract (threadIdx+1) from it because we would like to maintain only one + underlying shuffle intrinsic (which is used to communicate among lanes in a + warp). This particular version of shuffle intrinsic we take accepts only + offsets, instead of absolute lane_id. Therefore the subtraction is performed + on the absolute lane_id we calculated to obtain the offset. + +This algorithm is slightly different in 2 ways and it is not, conceptually, a +generalization of the above algorithms. +1. It reduces elements close to each other. For instance, values in the 0th lane + is to be combined with that of the 1st lane; values in the 2nd lane is to be + combined with that of the 3rd lane. We did not use the previous algorithm + where the first half of the (partial) warp is reduced with the second half + of the (partial) warp. This is because, the mapping + f(x): logical_lane_id -> physical_lane_id; + can be easily calculated whereas its inverse + f^-1(x): physical_lane_id -> logical_lane_id + cannot and performing such reduction requires the inverse to be known. +2. Because this algorithm is agnostic about the positions of the lanes that are + active, we do not need to perform the coping step as in the second + algorithm. +An illustrative run would look like +{F76} +As observed, overhead is high because in each and every step of reduction, +logical_lane_id is recalculated; so is the remote_id. + +On a block level, we have implemented the following block reduce algorithm: + +``` +gpu_irregular_block_reduce(void *reduce_data, + kmp_ShuffleReductFctPtr shuflReduceFn, + kmp_InterWarpCopyFctPtr interWarpCpyFn, + int size) { + + int wid = threadIdx.x/WARPSIZE; + int lane_id = threadIdx.x%WARPSIZE; + + int warp_needed = (size+WARPSIZE-1)/WARPSIZE; //ceiling of division + + unsigned tnum = __ballot(1); + int thread_num = __popc(tnum); + + //full warp reduction + if (thread_num == WARPSIZE) { + gpu_regular_warp_reduce(reduce_data, shuflReduceFn); + } + //partial warp reduction + if (thread_num < WARPSIZE) { + gpu_irregular_warp_reduce(reduce_data, shuflReduceFn, thread_num, + lane_id); + } + //Gather all the reduced values from each warp + //to the first warp + //named_barrier inside this function to ensure + //correctness. It is effectively a sync_thread + //that won't deadlock. + interWarpCpyFn(reduce_data, warp_needed); + + //This is to reduce data gathered from each "warp master". + if (wid==0) { + gpu_irregular_warp_reduce(reduce_data, shuflReduceFn, warp_needed, + lane_id); + } + + return; +} +``` +In this function, no ShuffleReduceFn is directly called as it makes calls +to various versions of the warp-reduction functions. It first reduces +ReduceData warp by warp; in the end, we end up with the number of +ReduceData equal to the number of warps present in this thread +block. We then proceed to gather all such ReduceData to the first warp. + +As observed, in this algorithm we make use of the function InterWarpCpyFn, +which copies data from each of the "warp master" (0th lane of each warp, where +a warp-reduced ReduceData is held) to the 0th warp. This step reduces (in a +mathematical sense) the problem of reduction across warp masters in a block to +the problem of warp reduction which we already have solutions to. + +We can thus completely avoid the use of atomics to reduce in a threadblock. + +**Efficient Cross Block Reduce** + +The next challenge is to reduce values across threadblocks. We aim to do this +without atomics or critical sections. + +Let a kernel be started with TB threadblocks. +Let the GPU have S SMs. +There can be at most N active threadblocks per SM at any time. + +Consider a threadblock tb (tb < TB) running on SM s (s < SM). 'tb' is one of +at most 'N' active threadblocks on SM s. Let each threadblock active on an SM +be given an instance identifier id (0 <= id < N). Therefore, the tuple (s, id) +uniquely identifies an active threadblock on the GPU. + +To efficiently implement cross block reduce, we first allocate an array for +each value to be reduced of size S*N (which is the maximum number of active +threadblocks at any time on the device). + +Each threadblock reduces its value to slot [s][id]. This can be done without +locking since no other threadblock can write to the same slot concurrently. + +As a final stage, we reduce the values in the array as follows: + +``` +// Compiler generated wrapper function for each target region with a reduction +clause. +target_function_wrapper(map_args, reduction_array) <--- start with 1 team and 1 + thread. + // Use dynamic parallelism to launch M teams, N threads as requested by the + user to execute the target region. + + target_function<<M, N>>(map_args) + + Reduce values in reduction_array + +``` + +**Comparison with Last Version** + + +The (simplified) pseudo code generated by LLVM on the host is as follows: + + +``` + 1. Create private copies of variables: foo_p, bar_p + 2. Each thread reduces the chunk of A and B assigned to it and writes + to foo_p and bar_p respectively. + 3. ret = kmpc_reduce_nowait(..., reduceData, reduceFn, lock) + where: + struct ReduceData { + double *foo; + double *bar; + } reduceData + reduceData.foo = &foo_p + reduceData.bar = &bar_p + + reduceFn is a pointer to a function that takes in two inputs + of type ReduceData, "reduces" them element wise, and places the + result in the first input: + reduceFn(ReduceData *a, ReduceData *b) + a = a @ b + + Every thread in the parallel region calls kmpc_reduce_nowait with + its private copy of reduceData. The runtime reduces across the + threads (using tree reduction on the operator 'reduceFn?) and stores + the final result in the master thread if successful. + 4. if ret == 1: + The master thread stores the reduced result in the globals. + foo += reduceData.foo; bar += reduceData.bar + 5. else if ret == 2: + In this case kmpc_reduce_nowait() could not use tree reduction, + so use atomics instead: + each thread atomically writes to foo + each thread atomically writes to bar +``` + +On a GPU, a similar reduction may need to be performed across SIMT threads, +warps, and threadblocks. The challenge is to do so efficiently in a fashion +that is compatible with the LLVM OpenMP implementation. + +In the previously released 0.1 version of the LLVM OpenMP compiler for GPUs, +the salient steps of the code generated are as follows: + + +``` + 1. Create private copies of variables: foo_p, bar_p + 2. Each thread reduces the chunk of A and B assigned to it and writes + to foo_p and bar_p respectively. + 3. ret = kmpc_reduce_nowait(..., reduceData, reduceFn, lock) + status = can_block_reduce() + if status == 1: + reduce efficiently to thread 0 using shuffles and shared memory. + return 1 + else + cannot use efficient block reduction, fallback to atomics + return 2 + 4. if ret == 1: + The master thread stores the reduced result in the globals. + foo += reduceData.foo; bar += reduceData.bar + 5. else if ret == 2: + In this case kmpc_reduce_nowait() could not use tree reduction, + so use atomics instead: + each thread atomically writes to foo + each thread atomically writes to bar +``` + +The function can_block_reduce() is defined as follows: + + +``` +int32_t can_block_reduce() { + int tid = GetThreadIdInTeam(); + int nt = GetNumberOfOmpThreads(tid); + if (nt != blockDim.x) + return 0; + unsigned tnum = __ballot(1); + if (tnum != (~0x0)) { + return 0; + } + return 1; +} +``` + +This function permits the use of the efficient block reduction algorithm +using shuffles and shared memory (return 1) only if (a) all SIMT threads in +a warp are active (i.e., number of threads in the parallel region is a +multiple of 32) and (b) the number of threads in the parallel region +(set by the num_threads clause) equals blockDim.x. + +If either of these preconditions is not true, each thread in the threadblock +updates the global value using atomics. + +Atomics and compare-and-swap operations are expensive on many threaded +architectures such as GPUs and we must avoid them completely. + + +**Appendix: Implementation Details** + + +``` +// Compiler generated function. +reduceFn(ReduceData *a, ReduceData *b) + a->foo = a->foo + b->foo + a->bar = a->bar + b->bar + +// Compiler generated function. +swapAndReduceFn(ReduceData *thread_private, int lane) + ReduceData *remote = new ReduceData() + remote->foo = shuffle_double(thread_private->foo, lane) + remote->bar = shuffle_double(thread_private->bar, lane) + reduceFn(thread_private, remote) + +// OMP runtime function. +warpReduce_regular(ReduceData *thread_private, Fn *swapAndReduceFn): + offset = 16 + while (offset > 0) + swapAndReduceFn(thread_private, offset) + offset /= 2 + +// OMP runtime function. +warpReduce_irregular(): + ... + +// OMP runtime function. +kmpc_reduce_warp(reduceData, swapAndReduceFn) + if all_lanes_active: + warpReduce_regular(reduceData, swapAndReduceFn) + else: + warpReduce_irregular(reduceData, swapAndReduceFn) + if in_simd_region: + // all done, reduce to global in simd lane 0 + return 1 + else if in_parallel_region: + // done reducing to one value per warp, now reduce across warps + return 3 + +// OMP runtime function; one for each basic type. +kmpc_reduce_block_double(double *a) + if lane == 0: + shared[wid] = *a + named_barrier(1, num_threads) + if wid == 0 + block_reduce(shared) + if lane == 0 + *a = shared[0] + named_barrier(1, num_threads) + if wid == 0 and lane == 0 + return 1 // write back reduced result + else + return 0 // don't do anything + +``` + + + +``` +// Compiler generated code. + 1. Create private copies of variables: foo_p, bar_p + 2. Each thread reduces the chunk of A and B assigned to it and writes + to foo_p and bar_p respectively. + 3. ret = kmpc_reduce_warp(reduceData, swapAndReduceFn) + 4. if ret == 1: + The master thread stores the reduced result in the globals. + foo += reduceData.foo; bar += reduceData.bar + 5. else if ret == 3: + ret = block_reduce_double(reduceData.foo) + if ret == 1: + foo += reduceData.foo + ret = block_reduce_double(reduceData.bar) + if ret == 1: + bar += reduceData.bar +``` + +**Notes** + + 1. This scheme requires that the CUDA OMP runtime can call llvm generated + functions. This functionality now works. + 2. If the user inlines the CUDA OMP runtime bitcode, all of the machinery + (including calls through function pointers) are optimized away. + 3. If we are reducing multiple to multiple variables in a parallel region, + the reduce operations are all performed in warpReduce_[ir]regular(). This + results in more instructions in the loop and should result in fewer + stalls due to data dependencies. Unfortunately we cannot do the same in + kmpc_reduce_block_double() without increasing shared memory usage. |