This base, originally developed for the Concurrent Aggregates 12] programming language, has been adapted to also support ICC++ . The adaptation of this compiler is continuing, so the performance figures we quote are for a preliminary implementation. All execution times are reported in seconds.
The simulator is used because the runtime system requires multithreading. For the sequential performance numbers presented, locality and concurrency control annotations were added to the computational kernels. These annotations enable the compiler to eliminate locality checks and locking operations. These annotations are applied to call sites, that is, places in the code where function calls appear. However, because we consider the current form of annotations to be rather primitive, they are not presented in detail. Table 1.1 shows the performance of our program. These figures show that ICC++ can come within a factor of 2 of sequential performance.
Six data sets containing approximately 60,000 polygons each were used for these tests; no large data-dependent differences in execution time were noted. These performance numbers provide a baseline for evaluating the performance of ICC++. As can be seen, the sequential performance of the ICC++ program is generally one-half to two-thirds that of C++ program. Note, however, that the C++ program does not use C++ object-oriented programs. The majority of the performance difference is due to additional storage management overhead (i.e. automatic garbage collection) and concurrency control overheads which are unnecessary in a sequential program.
We show performance both on small data sets, containing approximately 12,000 polygons each, as well as on data sets containing approximately 60,000 polygons each. The larger sets are included to allow comparisons with other systems; the small set is shown since our system cannot run the large ones on less than four T3D nodes.
Our results for 12K-polygon data sets {illustrate the performance our preliminary ICC++ compiler can achieve on Cray T3D: the sequential time on one node is roughly half that of the sequential C++ code. The parallel speedup for the naive polygon overlay phase alone (see Table 1.2 and Figure 1.1) is about one quarter of the ideal until 64 nodes (where it is 16), but only rises to 20 when the number of nodes is doubled to 128. When speedup is computed against the one-node performance of the ICC++ code, the speedup is roughly half ideal until 64 nodes. The results achieved using the area-linked algorithm are slightly better (see Table 1.3 and Figure 1.2), with the speedup being slightly better than one fourth ideal (5 on 16 nodes and 9 on 32 nodes). When calculated with repect to ICC++ single-node performance, though, speedup is roughly one-third ideal. Speedup on the smart algorithm is similar to that of the naive one, since our parallelization of the smart one preserves the optimizations of the sequential code.
Figure: ICC++ Speedups
(Naive Algorithm, 12K-polygon Data Sets)
Figure 1.2: ICC++ Speedups
(Area-Linked Algorithm, 12K-polygon Data Sets) Note that our
implementation preserves the sequential algorithms efficiency so their
is little computation.
The full-sized data sets are too large for the ICC++ program to run on one or two nodes, due to the storage overhead for objects in our runtime system, so performance numbers are shown for 4 and up. Direct comparison with sequential performance is impossible, but we would expect the same slowdown of approximately two that was seen for the small sets. As Table 1.4 and Figure 1.3 show, the speedup for the naive algorithm is slightly superlinear starting from 4 nodes, and better than one-fourth ideal when calculated from the sequential code's running time.
Figure: ICC++ Speedups: Naive Algorithm,
60K-Polygon Data Set on the T3D
Figure 1.4: ICC++ Speedups: Area-Linked Algorithm,
60K-Polygon Data Set on the T3D (Note that our implementation
preserves the sequential algorithms efficiency so there is little
computation.)
The speedup for the area-linked algorithm (see Table 1.5 and Figure 1.4), on the other hand, is very poor: going from 4 to 32 nodes yields a speedup of only 1.7. This is much worse than for the small data set, simply because the large data set is sorted. The area-linked algorithm is heavily dependent upon input file ordering for high performance, and there is very little work when the file is sorted. Note that our parallel implementation preserves the efficiency of the area-linked algorithm as described in Section 1.6.3. Also, the amount of work per polygon varies, making load balancing difficult.
Clearly for realistic algorithms, parallelizing the input/output phases is necessary to achieve good performance. Thus, a high performance implementation would have to parallelize the input/output sections of the program, e.g. by using mapped files. There are no significant obstacles to doing this in ICC++, but we did not have the time to complete the parallelization. Instead, we chose to isolate the times for input/output from the basic polygon overlay functionality.