\documentstyle[class,11pt]{article}
\def\ClassName{Computer~Organization}
\def\ClassNumber{CS~333}
\handedout{(11/15/95) Handout \#14}
\vspace{0.5in}

\begin{document}

\begin{center}
{\large \bf Solutions to Homework \#5} \\
\end{center}

{\bf Introduction:} This assignment focuses on vector processors,
their relation to instruction-level parallel scalar processors, and
basics of memory hierarchies.  Before starting the homework, you
should have read Appendix B, and Chapter 5.1-5 in the textbook.

\begin{problems}
\problem \Points{40}
{\bf Vector Machine Performance} -- H\&P B.3 (page B-43)

a)  The vector code for DAXPY for arbitrary vector size n is 
based on P\&H Pg. B-18, and is as follows:
\begin{verbatim}
        addi r2, r0, #(n*8)     ; # of bytes in vector
        add r2, r2, ra          ; address of the end of vector a
        addi r1, r0, #(n mod 8) ; length of first segment
        movi2s vlr, r1          ; length in bytes of first segment
        addi r3,r0,#64          ; vector length for other segments
loop:
                        ; Only the inner portion of the loop
                        ; is required for your answer.
        lv v1, ra
        lv v2, rb
        addv v3, v2, v1
        sv rb, v3
        multsv v4, v2, f0
        sv ra, v4

        add ra, ra, r1
        add rb, rb, r1
        addi r1, r0, #512       ; load byte offset of next segment
        movi2s vlr, r3          ; set length to 64 elements
        sub r4, r2, ra
        bnz r4, loop
\end{verbatim}

b) $$T_n = {\left\lceil {n \over MVL} \right\rceil} *
		(T_{loop} + T_{start}) + n * T_{chime}
$$
$$
T_{100} = {\left\lceil {100 \over 64} \right\rceil} *
		(15 + (12 + 12 + 6 + 12 + 12)) + 4*100
	= 538\ cycles
$$
Note that $T_{loop}$ is assumed to be 15;
and $T_{start}$ is calculated by convoys as follows.
The first LV has a startup of 12 cycles;
LV followed by ADDV is 12 + 6, since the ADDV depends on
the LV (See H\&P Fig.B-13);
SV and MULTSV has a startup penalty of only 12, since
the MULTSV does not depend on the SV(See H\&P Fig.B-5);
and the final SV is 12;
for a total of 54 cycles for $T_{start}$.
$$
R_{100} = {2*100*200 \over 538} = 74.35\ MFLOPS
$$

Extra Credit for calculating $T_{loop}$ (3 points.)  $T_{loop}$ = Time for
ADD, ADD, ADDI, MOVI2S , SUB , BNZ
= say, 1+1+1+1+1+2 = 7 cycles.
\newpage

c)
$$
R_\infty = \lim_{n\to\infty}{\left( {Operations\ Per\ Iteration
		* Clock\ Rate \over Clock\ Cycles\ Per\ Iteration}
		\right)}
$$
$$
= {2*200\ MHz\over \lim_{n\to\infty}
	{\left(
		{5n+69\over n}
	\right)}
}
= {2*200\over 5} \approx 80\ MFLOPS
$$

d)
$$
MFLOPS = {FLOPS\ executed\ in\ {N_{1\over2}}\ Iterations
	\over Clock\ Cycles\ for \ {N_{1\over2}}\ Iterations}
	* { cycles \over seconds} * 10^{-6}
$$
$$
= { 2*{N_{1\over2}}*200 \over T_{N_{1\over2}} }
$$
We wish to determine when this equals ${R_\infty\over2} = 40$.  Thus
we obtain
$$
T_{N_{1\over2}} = 10*{N_{1\over2}}
$$
Assuming ${N_{1\over2}} \le 64$, we get $T_n=4*n+69$ from Part(b).
Equating the above two equations we get:
$$
T_{N_{1\over2}} = 10*{N_{1\over2}} = 4*{N_{1\over2}}+69
$$
So, $${N_{1\over2}} = {69/6} \approx 12$$

e)
For the scalar code, we can estimate $T_{start}$ by $4*memory\ operations
+ 2*FLOPS= 4*6+2*3$.

Thus $Scalar\ Time = n*(15+30) = 45n$.

Vector $T_n = 4n+69,\ for\ n\le64$.

Equating, we get
$ 4*N_v+69 = 45*N_v$.  Thus $N_v = \left\lceil 69\over 41 \right\rceil= 2$.

f) We divide the code into two convoys as follows. LV, LV, ADDV; and
SV, MULTSV, SV.  $T_{start}$ remains the same, i.e. 69.

g)
$$T_{100} = {\left\lceil {100 \over 64} \right\rceil} *
                (69) + 2*100
	= 338\ cycles
$$
$$
R_{100} = {2*100*200 \over 338} = 118.34 \ MFLOPS
$$

\problem \Points{10}
{\bf Vector Machine vs. ILP} -- Using the vector machine and program
fragment in H\&P Problem B.3, (i.e. the above problem)
consider a comparison
against a quad issue superscalar machine with equivalent clock rate
and functional unit parallelism (number of functional units and their
latencies).  Assume the quad issue, similar to the DEC Alpha 21164 and
IBM PowerPC 620 can issue at most two integer and two floating point
instructions per cycle (e.g. no more than 2 integer ops in a given
cycle and likewise for the floats).  Calculate $N_V$?  and show your
work.

Solution:\\
We unroll the loop repeatedly in blocks of the following code.
The upper case instructions are related.
Loop prefix and suffix are ignored.

\begin{verbatim}
Cycle
1       LOAD    LOAD
2       Addv    Multsv  Load    Load
3       Store   Store   Addv    Multsv
4                       Store   Store
...
13      Load    Load
14      ADDV    MULTSV  Load    Load
15      Store   Store   Addv    Multsv
16                      Store   Store
...
21      Load    Load
22      Addv    Multsv  Load    Load
23      STORE   STORE   Addv    Multsv
24                      Store   Store
25      bnz
\end{verbatim}

As can be seen,
the unrolled loop computes 2 iterations in 4 cycles.
Note there is no loop overhead other than the final branch
since we can schedule such instructions in the delay slots above.
We need about 36 registers, three for each iteration, and perhaps
a few more for the loop overhead calculations.
Note that we cannot make the loop faster since we are issuing
two Load/Store instructions every cycle, which is maximal.
Thus the scalar loop executes at 2 cycles per iteration.
For n elements, $Scalar\ T_n = 2*n\ cycles+some \ overhead$.

$$
Vector\ T_n = {\left\lceil {n \over MVL} \right\rceil} *
                (T_{loop} + T_{start}) + n * T_{chime}
$$
$$
= {\left\lceil {n \over 64} \right\rceil} * 69 + 2*n \approx 3n+69
$$
Thus we see that the scalar loop is always faster.
So, for this case $N_v = \infty$.

This is
because we did not permit loop unrolling and tailgating for
the vector loop.  If we did, we would get
$Vector\ T_n = 2n+69$, which is about equal to the quad-issue
super-scalar case.

We can see from this why ILP machines can
replace vector machines; since by careful compilation we can
obtain about the same performance without binary level code
incompatibility.  Also, this result points out that in fact the
major difference between ILP and Vector
machines is generally a cache vs. a multibanked memory system (real
memory bandwidth).

\problem \Points{15}
{\bf Software for Vector Machines} -- H\&P Exercises B.9 (page B-45) 

There is a loop-carried dependence in the inner loop from a(i-1,j)
to a(i, j), since the value of a(i-1,j) is needed to calculate a(i, j).

The loop can be parallelized and vectorized.  Since our dependences
move down individual columns of the matrix aa, each column can be
computed in parallel.  Or we could vectorize across rows as follows.

By performing loop inversion we get,
\begin{verbatim}
        do 290 i = 2, n
                do 290 j = i, n
                        aa(i,j) = aa(i-1,j) * aa(i-1,j) + bb(i,j)
    290 continue
\end{verbatim}

Now the loop can be vectorized:
\begin{verbatim}
        do 290 i = 2, n
                do 290 j = i, n
                        aa[i:j,n] = aa[i-1:j,n] * aa[i-1:j,n] + bb[i:j,n]
    290 continue
\end{verbatim}
The notation aa[i:j,n] means we are applying the computation to
elements j through n of vector aa[i].

\problem \Points{10}
{\bf Memory Hierarchies I} -- H\&P Exercise 5.3

a)
Size of the cache:  From the graph for the various data sizes,
we see that up to the input data size of 64K we have a low
access time.  For data sizes above 64K the access time increases
steeply.  This is because we have exceeded the cache size
and this increases the cache miss rate.
This tells us the cache size is 64K.

b)
There are two possible solutions to this problem.
(This demonstrates that the performance of even a very simple cache
cannot be simply captured in a micro-benchmark graph !)

(1) Consider the graphs of data size $\ge$ 128K.
Here misses will occur since the cache size is 64KB.
We see that the average access time steadies out at
about stride 32.  This is because we are now hitting
a different block with every access.  Prior to this
we hit more than once in a single block, thereby
reducing the average access time.  So, block size
is 32 bytes.  (16 bytes gets partial credit, since
32 bytes is the better choice.)

(2) Consider the graphs of data size $\le$ 64K.
No misses occur since the cache size is 64KB.
However the graphs diverge at stride=128.  This
is because now we are accessing in steps of more than
one block.
The variance of the performance lines for the small data sizes
indicate that there's variability in the read/write times, which
could arise from reducing the number of reads/writes per line to one.
This allows other factors to intrude significantly, whereas with the
smaller lines the noise is amortized.
So, block size is 64 bytes.
(128 bytes gets partial credit.)

c)
When block size equals stride we access a different block with every
access, and we never skip a block.  Thus drawing a vertical line
up at stride = block size, we see that the access time
for the ``size $\ge$ 128K'' graphs
(which miss at each block) is about 970 ns, and for the ``size $\le$64K'' graphs
(which never miss at any block) is about 350 ns.

Thus miss penalty for read + write is $970 - 350 \approx 620\ ns$.

Note that since we access a data item by reading and immediately
writing back to it, we have only one miss, the read miss.
So Miss Penalty (for read) $\approx$ 620 ns.

\problem \Points{10}
{\bf Memory Hierarchies II} -- H\&P Exercise 5.5

a)
Let us consider what happens in each case for 100 instructions.
There are 100 instruction fetches, of which 26 are loads and
9 are stores.  Since the stores always take 2 cycles, this
means the execution time is 109 cycles + stalls.

$$
Stalls_{wb} = 100\ fetches * 0.5\%\ miss\ rate * 50\ cycles\ miss\ penalty
$$
$$
	+ (26\ loads + 9\ stores) * 1\%\ miss\ rate * (50\ cycles\ miss\ penalty
$$
$$
		+ 50\%\ dirty\ blocks * 50\ cycles\ to\ write\ to\ memory)
$$
$$
	= 51.25\ cycles
$$

Similarly,
$$
Stalls_{wt} = 100\ fetches * 0.5\%\ miss\ rate * 50\ cycles\ miss\ penalty
$$
$$
        + (26\ loads+ 9\ stores) * 1\%\ miss\ rate * (50\ cycles\ miss\ penalty)
$$
$$
	= 42.5\ cycles
$$

Thus,
$ Execution\ Time_{wb} = 109 + 51.25\ cycles = 160.25\ cycles$.

$ Execution\ Time_{wt} = 109 + 42.5\ cycles = 151.5\ cycles$.

So, the write through cache is 5.46\% faster than the
write back cache.

b)
The calculations are the same, except that since the stores 
for the write through cache no longer take 2 cycles but
only 1 cycle for hits, this
means the execution time for the write through cache
is 100 cycles + stalls. 

Thus, $ Execution\ Time_{wb} = 109 + 51.25\ cycles = 160.25\ cycles$.

$ Execution\ Time_{wt} = 100 + 42.5\ cycles = 142.5\ cycles$.

So the  write through cache is 11.08\% faster than the
write back cache, i.e. about twice as fast as the previous
case.

\problem \Points{15}
{\bf Memory Hierarchies III} -- H\&P Exercise 5.7

a)
Given the derivation presented in the book, we note that the
miss-rate and miss-penalty equations are the same.  The hit time
formula is relevant here too, except that the alternate hit rate is now
only multiplied by 1 rather than 2.

$$
AMT_{pseudo} = HitTime_{1way} + (MissRate_{1way} - MissRate_{2way})*1
$$
$$
+ MissRate_{2way} * MissPenalty_{1way}
$$

We also assume there is no decrease in the hit rate to the fast blocks
with the proposed configuration.

b)
$$
AMT_{pseudo\ 2KB} = 1 + (0.098 - 0.076)*1+(0.076*50) = 4.822
$$
$$
AMT_{pseudo\ 128KB} = 1 + (0.010 - 0.007)*1 + (0.007*50) = 1.353
$$

There is a very slight improvement over the normal pseudo-associative
cache.  This is because of the factor of 1 rather than 2 for the
alternate hit rate.  More
correctly, it is because of our assumption that the hit rate remains
the same; this is very unlikely.

\end{problems}

\end{document}


