next up previous
Next: Discussion and Related Work Up: ICC++: A C++ Dialect Previous: Polygon Overlay in ICC++

Performance

  The Illinois Concert C++ compiler exploits the Concert project's optimization technology for concurrent object oriented programs--such as static type inference 32], an aggressive application of cloning 33] and inlining 35]. It has been demonstrated to yield concurrent object-oriented programs with high sequential efficiency and efficiently support fine-grained object-oriented programs.

This base, originally developed for the Concurrent Aggregates 12] programming language, has been adapted to also support ICC++ . The adaptation of this compiler is continuing, so the performance figures we quote are for a preliminary implementation. All execution times are reported in seconds.

Sequential Performance

  The sequential performance of the ICC++ polygon overlay program was measured by running it on a multiprocessor simulator. This simulator has significant performance disadvantages compared to a sequential C++ program: it uses threads to provide runtime services (incurring context-switching overhead), executes ``parallel ready'' code that can deal with distributed data, and provides fully automatic storage management.

The simulator is used because the runtime system requires multithreading. For the sequential performance numbers presented, locality and concurrency control annotations were added to the computational kernels. These annotations enable the compiler to eliminate locality checks and locking operations. These annotations are applied to call sites, that is, places in the code where function calls appear. However, because we consider the current form of annotations to be rather primitive, they are not presented in detail. Table 1.1 shows the performance of our program. These figures show that ICC++ can come within a factor of 2 of sequential performance.

 
Inputs Algorithm C++ (secs) ICC++ (secs) Ratio(ICC++/C)
k100.00 k100.01 naive 2767 5242 1.89
area-linked 41 61 1.49
k100.10 k100.11 naive 2850 5503 1.93
area-linked 38 64 1.68
k100.18 k100.19 naive 2863 5713 2.00
area-linked 37 62 1.68
Table 1.1: Sequential Performance of Polygon Overlay on SS20/51 

Six data sets containing approximately 60,000 polygons each were used for these tests; no large data-dependent differences in execution time were noted. These performance numbers provide a baseline for evaluating the performance of ICC++. As can be seen, the sequential performance of the ICC++ program is generally one-half to two-thirds that of C++ program. Note, however, that the C++ program does not use C++ object-oriented programs. The majority of the performance difference is due to additional storage management overhead (i.e. automatic garbage collection) and concurrency control overheads which are unnecessary in a sequential program.

Parallel Performance

  The naive and area-linked algorithms are parallelized and tuned for locality. In addition to the code changes described in the previous section, annotations added in the computational kernels allow the compiler to eliminate locality checks and locking operations. They also provide explicit control over inlining decisions. As before, these are call-site annotations, but because we consider the current form of annotations to be rather primitive, they are not presented in detail. This enables the compiler to generate efficient sequential kernels. The overall execution times for both algorithms, and for their respective C++ equivalents on one node, are shown in Tables 1.2 to 1.5. These numbers were all taken on a Cray T3D, with the sequential times determined by running the original C code one one node.

We show performance both on small data sets, containing approximately 12,000 polygons each, as well as on data sets containing approximately 60,000 polygons each. The larger sets are included to allow comparisons with other systems; the small set is shown since our system cannot run the large ones on less than four T3D nodes.

 

PEs sequential 1 2 4 8 16 32 64 128
input 4 6 6 7 9 19 54 182
overlay 333 171 133 64 33 18 10 8
output 4 3 2 2 3 3 6 6
overall 158 358 189 149 77 47 41 82 211
Table: Naive Algorithm, 12K-polygon Data Sets on the T3D (all times in seconds)  

 

PEs sequential 1 2 4 8 16 32
input 3 6 6 7 9 20
listify .2 .2 .1 0 0 0
overlay 173 171 82 42 23 13
output 4 3 2 2 2 3
overall 110 190 189 95 55 35 38
Table 1.3: Area-Linked Algorithm, 12K-polygon Data Sets on the T3D (all times in seconds)  

Our results for 12K-polygon data sets {illustrate the performance our preliminary ICC++ compiler can achieve on Cray T3D: the sequential time on one node is roughly half that of the sequential C++ code. The parallel speedup for the naive polygon overlay phase alone (see Table 1.2 and Figure 1.1) is about one quarter of the ideal until 64 nodes (where it is 16), but only rises to 20 when the number of nodes is doubled to 128. When speedup is computed against the one-node performance of the ICC++ code, the speedup is roughly half ideal until 64 nodes. The results achieved using the area-linked algorithm are slightly better (see Table 1.3 and Figure 1.2), with the speedup being slightly better than one fourth ideal (5 on 16 nodes and 9 on 32 nodes). When calculated with repect to ICC++ single-node performance, though, speedup is roughly one-third ideal. Speedup on the smart algorithm is similar to that of the naive one, since our parallelization of the smart one preserves the optimizations of the sequential code.

  figure1120
Figure: ICC++ Speedups (Naive Algorithm, 12K-polygon Data Sets)  


  figure1127
Figure 1.2: ICC++ Speedups (Area-Linked Algorithm, 12K-polygon Data Sets) Note that our implementation preserves the sequential algorithms efficiency so their is little computation.  


Full-Size Sets

 

 

PEs sequential 4 8 16 32 64 128
input 35 41 44 81 270 708
overlay 3283 1475 739 371 188 99
output 9 11 12 11 12 12
overall 3500 3359 1603 835 487 494 818
Table: Naive Algorithm, 60K-polygon Data Sets on the T3D (all times is seconds)  

 

PEs seqential 4 8 16 32
input 31 36 43 70
listify .4 .2 .1 .1
overlay 68 49 42 39
output 11 12 10 13
overall 57 117 103 100 127
Table 1.5: Area-Linked Algorithm, 60K-polygon Data Sets on the T3D (all times in seconds) Note that out algorithm preserves sequential optimization  

The full-sized data sets are too large for the ICC++ program to run on one or two nodes, due to the storage overhead for objects in our runtime system, so performance numbers are shown for 4 and up. Direct comparison with sequential performance is impossible, but we would expect the same slowdown of approximately two that was seen for the small sets. As Table 1.4 and Figure 1.3 show, the speedup for the naive algorithm is slightly superlinear starting from 4 nodes, and better than one-fourth ideal when calculated from the sequential code's running time.

  figure1158
Figure: ICC++ Speedups: Naive Algorithm, 60K-Polygon Data Set on the T3D

 

   figure1166
Figure 1.4: ICC++ Speedups: Area-Linked Algorithm, 60K-Polygon Data Set on the T3D (Note that our implementation preserves the sequential algorithms efficiency so there is little computation.)

The speedup for the area-linked algorithm (see Table 1.5 and Figure 1.4), on the other hand, is very poor: going from 4 to 32 nodes yields a speedup of only 1.7. This is much worse than for the small data set, simply because the large data set is sorted. The area-linked algorithm is heavily dependent upon input file ordering for high performance, and there is very little work when the file is sorted. Note that our parallel implementation preserves the efficiency of the area-linked algorithm as described in Section 1.6.3. Also, the amount of work per polygon varies, making load balancing difficult.

Discussion

  As can be seen, the majority of the time spent in the naive algorithm is spent in the polygon overlay computation, until I/O time takes over for large numbers of nodes. However, for the area-linked algorithm, a significant fraction of the time is spent preparing for input/output, which performs badly enough to make the parallel code actually slower than the sequential C++ for the large, sorted data sets.

Clearly for realistic algorithms, parallelizing the input/output phases is necessary to achieve good performance. Thus, a high performance implementation would have to parallelize the input/output sections of the program, e.g. by using mapped files. There are no significant obstacles to doing this in ICC++, but we did not have the time to complete the parallelization. Instead, we chose to isolate the times for input/output from the basic polygon overlay functionality.


next up previous
Next: Discussion and Related Work Up: ICC++: A C++ Dialect Previous: Polygon Overlay in ICC++

Julian Dolby
dolby@cs.uiuc.edu