To illustrate the changes (and shared code), we use the overlay function as example. The original program exploits the equivalence between pointers and arrays. Array indexing is done implicitly using two points, pl and pr, and pointer arithmetic is used to sequence the arrays. ICC++ requires that arrays and collections be explicitly declared using the [ ] syntax, which produces the program shown in Program 1.2. Note the only difference is in the call to polyOverlayLn, where array accessing is explicit.
polyVec_p overlay(
polyVec_p left, // left input vector
polyVec_p right // right input vector
){
int il, ir; // loop indices
polyLn_p outList = NULL; // output list
polyLn_p newLn = NULL; // newly-created polygon link
for (il=0; il<left->len; il++){
for (ir=0; ir<right->len; ir++){
newLn = polyOverlayLn(&(left->vec[il]), &(right->vec[ir]));
if (newLn != NULL){
newLn->link = outList;
outList = newLn;
}
}
}
return polyLn2Vec(outList); // convert on the way out
}
The ICC++ version uses explicit array indexing and passes pointers to elements of the arrays, thus preserving the interface to functions such as polyOverlayLn. In this case, using explicit array indexing produces code of equal efficiency and may even make the code structure more lucid.
The first stage of parallelizing these algorithms was to add conc annotations to the main loops, as shown in Programs 1.3 and 1.4. With these small changes, both algorithms are concurrent, and the basic object consistency model is sufficient to ensure correct output. However, the loops are still serialized by the accumulation of new polygons into the output list. Note that the code sequences in Programs 1.3 and 1.4 areessentially what was run to produce the sequential performance numbers in Section 1.7.
conc for (il=0; il<leftlimit; il++){
conc for (ir=0; ir<rightlimit; ir++){
newLn = polyOverlayLn(&(left->vec[il]), &(right->vec[ir]));
if (newLn != NULL){
newLn->link = outList;
outList = newLn;
}
}
}
declarations as in sequential program
rightList = polyVec2AreaLn(right, TRUE);
conc for (il=0, pl=left->vec; il<left->len; il++, pl++){
leftArea = polyArea(pl);
rightPre = rightList;
conc while ((leftArea > 0) &&
((rightCurr = rightPre->link) != NULL)){
if ((newLn = polyOverlayLn(pl, rightCurr->poly)) != NULL){
add new polygon as in sequential program
}
if (rightCurr->area == 0){
rightPre->link = rightCurr->link;
} else{
rightPre = rightCurr;
}
}
}
To achieve scalable parallel performance, it is necessary to change the sequential accumulation structure from a list to a parallel accumulation structure. We chose the natural extension of a single global list into per-processor lists. This change, in conjunction with locality optimizations is described below.
Some of the changes required to implement this were as follows:
void polyLn_array::local_overlay(int proc, int limit, int stride,
int xl, int yl, int xh, int yh,
poly_t (*vec)[]){
for (int i=0; i<limit; i+=stride){
if (polyLn_p newLn = (*vec)[i].OverlayLn(xl, yl, xh, yh))
push(newLn);
}
}
void overlay(polyVec_p left, polyVec_p right){
int stride = NUM_PROCS;
poly_t (*leftvec)[],(*rightvec)[];
polyLn_array (*accum)[] = new polyLn_array[NUM_PROCS];
conc for (proc = 0; proc < NUM_PROCS; proc++){
int upper_limit = right->len+proc;
conc for (il=0; il<left->len; il++){
poly_p left_el = &(*left->vec)[il];
(*accum)[proc].local_overlay(proc, upper_limit, stride,
left_el->xl, left_el->yl, left_el->xh, left_el->yh,
right->vec);
}
}
conc for (proc = 0; proc < (NUM_PROCS-1); proc++){
polyLn_p last_one = (*accum)[proc].last_one();
last_one->link = (*accum)[proc+1].link;
}
}
As shown in Program 1.5, the accum[] collection implements the parallel list, with operations local_overlay(), push(), and last_one() for building the output polygons, accumulating them, and stringing the lists together to interface to the existing output routines. local_overlay() overlays the polygon passed in against all of the polygons on that processor, thus providing a tight inner loop of purely local computation. Similarly, the last_one() method code shown below runs down the local list to find its end:
polyLn_p polyLn_array[]::last_one(){
polyLn_p current, previous;
current = link; // start with my pointer
while (current != NULL){
previous = current;
current = current->link;
}
return previous;
}
Tuning the locality for the area-linked polygon overlay algorithm was a bit more involved, but was achieved with essentially the same strategy. The same parallel list structure, accum[], is used to collect the new polygons. However, because the area-linked algorithm prunes both lists based on the area remaining, it maintains a linked list for right->vec. This linked list forms a sequential bottleneck, so to parallelize the program effectively, we again changed the sequential list structure to a parallel list data structure. This is shown in Program 1.6. Note that our data decomposition partitions right->vec, but because it does not partition left->vec it is of similar efficiency to the sequential program. Empirical comparisons confirm that our parallel area-linked implementation performs a similar number of polygon compares to the sequential implementation, although is does more work as the degree of parallelism increases.
int polyLn_array::local_OverlayAreaLn(int running_area,
int l_xl, int l_yl, int l_xh, int l_yh,
polyAreaLn_p list)
{
polyLn_p newLn;
polyAreaLn_p current = list;
polyAreaLn_p prev = NULL;
int leftArea = running_area;
int newArea = 0;
while ((current != NULL)&&(leftArea>0)){
newLn = current->poly->local_OverlayLn(l_xl,l_yl,l_xh,l_yh);
if (newLn != NULL){
push(newLn);
newArea = newLn->poly.area();
leftArea -= newArea;
current->area -= newArea;
if (current->area == 0){
if (prev != NULL) prev->snap();
current = current->link;
} else {
prev = current; current = current->link;
}
} else{
prev = current; current = current->link;
}
}
return leftArea;
}
void poly_s::local_OverlayAreaLn(polyLn_array (*accum)[],
polyAreaLn_array (*input)[])
{
int leftArea = area();
int proc = 0;
while ((leftArea>0)&&(proc < NUM_PROCS)){
leftArea = (*accum)[proc].local_OverlayAreaLn(leftArea,xl,yl,
xh,yh,(*input)[proc].link);
proc++;
}
}
// calling loop
polyLn_array (*accum)[] = new polyLn_array[NUM_PROCS];
polyAreaLn_array (*input)[] = new polyAreaLn_array[NUM_PROCS];
polyVec2AreaLn(right, input);
leftvec = left->vec;
conc for (il=0; il<leftlimit; il++)
(*leftvec)[il].local_OverlayAreaLn(accum,input);
The parallel list data structure effectively enables parallelism, and also achieves high data locality for each processor. As with the accumulation list, this parallel list structure supports push and next operations; the object concurrency model provides all of the necessary synchronization to ensure correct results. Note that the area-linked algorithm uses a linked list and polygon deletion to reduce the number of polygon-polygon comparisons. Our parallel area-linked algorithm preserves the efficiency of the sequential version and, as a result, there is not enough work to generate much parallelism.
A na¨ıve partitioning that distributed both vectors (or replicated one of them) across the nodes would effectively destroy the performance of the area-linked approach because the total area that overlays a given polygon might be distributed across the nodes. In this case, no single node would be able to delete it from its work list. By retaining the left vector as a global structure, we avoid this problem by accumulating areas for the left vector globally. Hence, while parallel execution will cause this approach to perform somewhat more comparisons, the area-linked approach should not degrade into the na¨ıve case.