next up previous
Next: Performance Up: ICC++: A C++ Dialect Previous: Unstructured Concurrency

Polygon Overlay in ICC++

  This section discusses how we parallelized the polygon overlay problem in ICC++. We focus upon the changes needed to the code and how data locality is managed. Performance results are given in Section 1.7

Porting to ICC++

  We developed our polygon overlay program using the sequential ANSI C implementation as a starting point. While this program was essentially acceptable to our ICC++ compiler, several minor changes were required. For completeness, and to give the reader an idea what program porting effort would be required, we list these changes exhaustively. Note that the first two points are artifacts of our current preliminary implementation.
  1. Because the data representation used by the ICC++ runtime is not identical to the standard C++ representation, we were unable to use the qsort library, so we wrote a new quicksort routine. This would have been necessary for parallelization in any event.
  2. The differences between the ICC++ runtime system and C++ regarding object layout also made it impossible to use functions like fopen directly (fopen returns a FILE *); we wrote a library of wrapper functions that manipulated files using fopen, fscanf and the like, and communicated only ints to the ICC++ program. This will be unnecessary when we complete a simple interface to the stdio library.
  3. Finally, the original code made heavy use of pointers to create the polygon arrays. Because ICC++ does not allow pointers to be used as arrays, we rewrote the array allocation statements (a few lines), and all type declarations which described the ``arrays'' (converted to collections). In most cases this simply involved changing a declaration like poly_t* polygons to poly_t[ ] polygons.

To illustrate the changes (and shared code), we use the overlay function as example. The original program exploits the equivalence between pointers and arrays. Array indexing is done implicitly using two points, pl and pr, and pointer arithmetic is used to sequence the arrays. ICC++ requires that arrays and collections be explicitly declared using the [ ] syntax, which produces the program shown in Program 1.2. Note the only difference is in the call to polyOverlayLn, where array accessing is explicit.

polyVec_p overlay(
  polyVec_p     left,                   // left input vector
  polyVec_p     right                   // right input vector
){
  int           il, ir;                 // loop indices
  polyLn_p      outList = NULL;         // output list
  polyLn_p      newLn = NULL;           // newly-created polygon link

  for (il=0; il<left->len; il++){
    for (ir=0; ir<right->len; ir++){
      newLn = polyOverlayLn(&(left->vec[il]), &(right->vec[ir]));
      if (newLn != NULL){
        newLn->link = outList;
        outList = newLn;
      }
    }
  }
  return polyLn2Vec(outList);           // convert on the way out
}

ICC++ Code for Naıve Algorithm 

The ICC++ version uses explicit array indexing and passes pointers to elements of the arrays, thus preserving the interface to functions such as polyOverlayLn. In this case, using explicit array indexing produces code of equal efficiency and may even make the code structure more lucid.

Parallelization

  We parallelized two of the polygon overlay algorithms: the naıve algorithm and the area-linked algorithm. These two were chosen because they exemplified the two major data structure interactions (i.e. array-array and array-list) present in the polygon overlay program. In addition, the majority of algorithmic speedup was achieved with the area-linked algorithm for polygon overlay.

The first stage of parallelizing these algorithms was to add conc annotations to the main loops, as shown in Programs 1.3 and 1.4. With these small changes, both algorithms are concurrent, and the basic object consistency model is sufficient to ensure correct output. However, the loops are still serialized by the accumulation of new polygons into the output list. Note that the code sequences in Programs 1.3 and 1.4 areessentially what was run to produce the sequential performance numbers in Section 1.7.

conc for (il=0; il<leftlimit; il++){
  conc for (ir=0; ir<rightlimit; ir++){
    newLn = polyOverlayLn(&(left->vec[il]), &(right->vec[ir]));
    if (newLn != NULL){
      newLn->link = outList;
      outList = newLn;
    }
  }
}

Parallelized kernel of function overlay  

declarations as in sequential program
  rightList = polyVec2AreaLn(right, TRUE);
  conc for (il=0, pl=left->vec; il<left->len; il++, pl++){
    leftArea = polyArea(pl);
    rightPre = rightList;
    conc while ((leftArea > 0) &&
           ((rightCurr = rightPre->link) != NULL)){
      if ((newLn = polyOverlayLn(pl, rightCurr->poly)) != NULL){
        add new polygon as in sequential program
      }
      if (rightCurr->area == 0){
        rightPre->link = rightCurr->link;
      } else{
        rightPre = rightCurr;
      }
    }
  }

Parallelized kernel of function overlayAreaLinked  

To achieve scalable parallel performance, it is necessary to change the sequential accumulation structure from a list to a parallel accumulation structure. We chose the natural extension of a single global list into per-processor lists. This change, in conjunction with locality optimizations is described below.

Distributing PolyOver

  Because data locality is critical to achieving high performance, it is necessary to reorganize the computation to increase the node-level reuse of data. For both the naıve algorithm and area linked algorithms (i.e. the overlay and overlayAreaLinked' functions), the basic idea of the new structure is to distribute the ``right'' vector (i.e. right->vec) across the nodes, so each node has a local portion.Then, for each element of the ``left'' vector (i.e. left->vec), each nodes does the overlay with its local portion of right->vec. Thus, right->vec is partitioned across the nodes, but left->vec remains a global data structure, which is neither replicated nor explicitly partitioned. Note that this data decomposition preserves the list structure (for the area-linked algorithm) and therefore is of similar efficiency to the sequential program.

Some of the changes required to implement this were as follows:

void polyLn_array::local_overlay(int proc, int limit, int stride,
                                 int xl, int yl, int xh, int yh,
                                 poly_t (*vec)[]){
  for (int i=0; i<limit; i+=stride){
    if (polyLn_p newLn = (*vec)[i].OverlayLn(xl, yl, xh, yh)) 
      push(newLn);
  }
}

void overlay(polyVec_p left, polyVec_p right){
  int stride = NUM_PROCS;
  poly_t (*leftvec)[],(*rightvec)[];
  polyLn_array (*accum)[] = new polyLn_array[NUM_PROCS];

  conc for (proc = 0; proc < NUM_PROCS; proc++){
    int upper_limit = right->len+proc;
    conc for (il=0; il<left->len; il++){
      poly_p left_el = &(*left->vec)[il];
      (*accum)[proc].local_overlay(proc, upper_limit, stride,
                                   left_el->xl, left_el->yl, left_el->xh, left_el->yh,
                                   right->vec);
    }
  }

  conc for (proc = 0; proc < (NUM_PROCS-1); proc++){
    polyLn_p last_one = (*accum)[proc].last_one();
    last_one->link = (*accum)[proc+1].link;
  }
}

Locality-Tuned Naıve Polygon Overlay Algorithm 

As shown in Program 1.5, the accum[] collection implements the parallel list, with operations local_overlay(), push(), and last_one() for building the output polygons, accumulating them, and stringing the lists together to interface to the existing output routines. local_overlay() overlays the polygon passed in against all of the polygons on that processor, thus providing a tight inner loop of purely local computation. Similarly, the last_one() method code shown below runs down the local list to find its end:

polyLn_p polyLn_array[]::last_one(){
  polyLn_p current, previous;

  current = link;	// start with my pointer
  while (current != NULL){
    previous = current;
    current = current->link;
  }
  return previous;
}

Tuning the locality for the area-linked polygon overlay algorithm was a bit more involved, but was achieved with essentially the same strategy. The same parallel list structure, accum[], is used to collect the new polygons. However, because the area-linked algorithm prunes both lists based on the area remaining, it maintains a linked list for right->vec. This linked list forms a sequential bottleneck, so to parallelize the program effectively, we again changed the sequential list structure to a parallel list data structure. This is shown in Program 1.6. Note that our data decomposition partitions right->vec, but because it does not partition left->vec it is of similar efficiency to the sequential program. Empirical comparisons confirm that our parallel area-linked implementation performs a similar number of polygon compares to the sequential implementation, although is does more work as the degree of parallelism increases.

int polyLn_array::local_OverlayAreaLn(int running_area,
                                      int l_xl, int l_yl, int l_xh, int l_yh,
                                      polyAreaLn_p list)
{
  polyLn_p newLn;
  polyAreaLn_p current = list;
  polyAreaLn_p prev = NULL;
  int leftArea = running_area;
  int newArea = 0;

  while ((current != NULL)&&(leftArea>0)){
    newLn = current->poly->local_OverlayLn(l_xl,l_yl,l_xh,l_yh);

    if (newLn != NULL){
      push(newLn);     
      newArea = newLn->poly.area();
      leftArea -= newArea; 
      current->area -= newArea;
      if (current->area == 0){ 
        if (prev != NULL) prev->snap();
        current = current->link;
      } else {
        prev = current; current = current->link;
      }
    } else{
      prev = current; current = current->link;
    }
  }
  return leftArea;
}

void poly_s::local_OverlayAreaLn(polyLn_array (*accum)[], 
                                 polyAreaLn_array (*input)[])
{
  int leftArea = area();
  int proc = 0;

  while ((leftArea>0)&&(proc < NUM_PROCS)){
    leftArea = (*accum)[proc].local_OverlayAreaLn(leftArea,xl,yl,
                                                  xh,yh,(*input)[proc].link); 
    proc++;
  }
}

// calling loop
polyLn_array (*accum)[] = new polyLn_array[NUM_PROCS];
polyAreaLn_array (*input)[] = new polyAreaLn_array[NUM_PROCS];
polyVec2AreaLn(right, input);
leftvec = left->vec;
conc for (il=0; il<leftlimit; il++)
  (*leftvec)[il].local_OverlayAreaLn(accum,input);

Parallel Version of Area-linked Polygon Overlay Algorithm 

The parallel list data structure effectively enables parallelism, and also achieves high data locality for each processor. As with the accumulation list, this parallel list structure supports push and next operations; the object concurrency model provides all of the necessary synchronization to ensure correct results. Note that the area-linked algorithm uses a linked list and polygon deletion to reduce the number of polygon-polygon comparisons. Our parallel area-linked algorithm preserves the efficiency of the sequential version and, as a result, there is not enough work to generate much parallelism.

A na¨ıve partitioning that distributed both vectors (or replicated one of them) across the nodes would effectively destroy the performance of the area-linked approach because the total area that overlays a given polygon might be distributed across the nodes. In this case, no single node would be able to delete it from its work list. By retaining the left vector as a global structure, we avoid this problem by accumulating areas for the left vector globally. Hence, while parallel execution will cause this approach to perform somewhat more comparisons, the area-linked approach should not degrade into the na¨ıve case.


next up previous
Next: Performance Up: ICC++: A C++ Dialect Previous: Unstructured Concurrency

Julian Dolby
dolby@cs.uiuc.edu