\documentstyle[class,11pt]{article}
\def\ClassName{Computer~Organization}
\def\ClassNumber{CS~333}
\handedout{(10/13/95) Handout \#10}
\vspace{0.5in}

\begin{center}
{\bf Solutions to Homework \#3}
\end{center}

\begin{document}
\def\tenrm=\cmex10

{\bf Introduction:} This assignment focuses on Instruction Set Architecture and
Implementation, emphasizing pipelining.  Before starting the problems,
you should have read Chapter 3 and Section 4.1 in the textbook.  {\bf This
homework includes laboratory exercises}, so please start early enough
so that you can become facile with the tools.  

\begin{center}
{\bf Solutions:}
\end{center}

\begin{problems}
\problem \Points{10}
{\bf Instruction Sets} -- H\&P Exercises 2.7

Well, I guess everyone must be convinced by now of the relative
simplicity of the DLX or RISC ISA, compared to the 80x86 ISA !
Here is one possible solution to the question:

\begin{verbatim}
        movl [2000], #0 ; i = 0
loop:
        movl [2000], %eax ; eax = i
        sall %eax, #2     ; eax = i*4
        addl %eax, #5000  ; eax = address of B[i]

        movl %ebx, 0[%eax]; ebx = B[i]
        addl %ebx, [1500] ; ebx = B[i] + C

        movl %eax, [2000] ; eax = i
        sall %eax, #2     ; eax = i*4 = address of A[i]

        movl 0[%eax], %ebx; A[i] = B[i] + C

        movl %eax, [2000] ; eax = i
        incl %eax         ; eax = i+1
        movl [2000], %eax ; i = eax

        cmpl %eax, #101   ; i == 101 ?
        jne  loop         ; if not, repeat loop
\end{verbatim}

Instructions executed $= 1 + (13 * 101) = 1314$.

Data References executed $= 1 + (7 * 101) = 708$.

Code Size = 60 bytes, according to an x86 assembler.

\newpage
\problem \Points{10}
{\bf Instruction Sets} -- H\&P Exercises 2.8
\begin{verbatim}
                ;; R1 is A[i]
                ;; R2 is B[i]
                ;; R3 is C
                ;; R9 is i*4

                lhi r9, #404     ; We go down the loop, i = 100 ... 0
                lw  r3, 1500(r0) ; r3 = C
        loop:
                subi r9, r9, #4  ; r9 = i * 4
                lw r2, 5000(r9)  ; r2 = B[i]
                add r1, r2, r3   ; r1 = B[i] + C
                sw 0(r9), r1     ; A[i] = B[i] + C
                bnez r9, loop    ; r9 == 0 when i == 0

                lhi r9, #101     ; r9 = i = 101 at end of loop
                sw 2000(r0), r9  ; i = 101
\end{verbatim}
Instructions executed $= 2 + (5 * 101) + 2 = 509$.

Data references executed $= 1 + (2 * 101) + 1 = 204$.

Code Size $= 9 * 4 = 36$ bytes.

\problem \Points{25}
{\bf Basic Pipelining} -- H\&P Exercise 3.1

{\bf Solution to 3a.}
\begin{verbatim}
                                           1 1 1 1 1 1 1 1 1 1 2 2
                         1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        lw r1, 0(r2)     F D X M W
        addi r1,r1,#1      F s s D X M W
        sw r1,0(r2)              F s s D X M W
        addi r2,r2,#4                  F D X M W
        sub r4,r3,r2                     F s s D X M W
        bnz r4, loop                           F s s D X M W
        lw r1,0(r2)                                  F s s F D X M W
\end{verbatim}
\begin{itemize}
\item
Cycles 3-4: addi stalls ID to wait for lw to write back r1.
\item
Cycles 6-7: sw stalls ID to wait for addi to write back r1.
\item
Cycles 10-11: sub stalls ID to wait for addi to write back r2.
\item
Cycles 13-14: bnz stalls ID to wait for sub to write back r4.
\item
Cycles 16-17: bnz computes the next PC in MEM; so the lw of the
next iteration cannot be fetched until after cycle 17.
\end{itemize}

From the above figure, we see that the second iteration begins
17 clocks after the first iteration and the last iteration
takes 18 clocks to complete.  Since the loop executes 99 times
totally, it takes $(98*17)+18 = 1684$ clock cycles.

Common Errors:\\
1. Using the single-cycle branch delay pipeline of Fig.3.22,
rather than the simple pipeline of Fig.3.4 without forwarding
or bypassing.

{\bf Solution to 3b.}
\begin{verbatim}
                                           1 1 1 1 1 1 1 1 1 1 2 2
                         1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        lw r1, 0(r2)     F D X M W
        addi r1,r1,#1      F D s X M W
        sw r1,0(r2)          F s D X M W
        addi r2,r2,#4            F D X M W
        sub r4,r3,r2               F D X M W
        bnz r4, loop                 F D X M W
        lw r1,0(r2)                    F m m F D X M W     case(1)
             or
        lw r1,0(r2)                    F m F D X M W       case(2)
\end{verbatim}
\begin{itemize}
\item
Cycle 4: addi stalls EX to wait for lw to forward r1.
\item
Cycles 8-10: sw stalls ID to wait for addi to write back r1.
\item
Cycles 10-11: sub stalls ID to wait for addi to write back r2.
\item
Cycles 13-14: bnz stalls ID to wait for sub to write back r4.
\item
Cycles 8-10: bnz computes the next PC in MEM; so the lw of the
next iteration cannot be fetched until after cycle 10.
\end{itemize}
The first lw is based on the pipeline of Fig.3.21 of the text, and 
the second lw is for those who have incorrectly assumed the pipeline
of Fig. 3.22 (no points deducted for the repeated error.)

Since the bnz is predicted as not taken, the instruction statically
following the bnz in memory is fetched and goes to decode before
the pipeline is aware of the misprediction.

For the first case
from the above figure, we see that the second iteration begins
10 clocks after the first iteration and the last iteration
takes 11 clocks to complete.  Since the loop executes 99 times
totally, it takes $(98*10)+11 = 991$ clock cycles.

For the second case we get 795 cycles.

\vfill
{\bf Solution to 3c.}

The original code has a load delay slot, and a branch delay slot.
Basically, a correct solution will have no stalls.
One possible solution is as follows:
\begin{verbatim}
        loop:
                lw r1, 0(r2)     ; r1 = A[i]
                addi r2, r2, #4  ; i*4 = i*4+4
                addi r1, r1, #1  ; r1 = A[i]+1
                sub r4, r3, r2   ; r4 = r3 - i*4
                bnz r4, loop
                sw r1,-4(r2)     ; store A[i], executes in the branch delay slot
\end{verbatim}
\vfill
\rightline{\bf ...contd. over}
\newpage
The pipeline timing diagram is :
\begin{verbatim}
                                           1 1 1 1 1 1 1 1 1 1 2 2
                         1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        lw r1, 0(r2)     F D X M W
        addi r2,r2,#4      F D X M W
        addi r1,r1,#1        F D X M W
        sub r4,r3,r2           F D X M W
        bnz r4, loop             F D X M W
        sw r1,-4(r2)               F D X M W
        lw r1,0(r2)                  F D X M W
\end{verbatim}
Total number of cycles is $(98*6)+10=598$.

\problem \Points{10}
{\bf Hazards} -- H\&P Exercise 3.11

If IF/ID contains any instruction that requires its source operands
at the beginning of EX (i.e. loads, register-ALU, FP Add, FP Multiply)
then stall if:

\begin{itemize}
\item
ID/EX.opcode has a load instruction and ID/EX.IR{11..15} = IF/ID.IR{6..10}, OR
\item
\{ID/A1 or A1/A2 or A2/A3\}.opcode has an (Add, Sub, Convert) instruction
and
$
\{same\} .$
IR{11..15} = IF/ID.IR{6..10}\ or\ IF/ID.IR{11..15}, OR
\item
\{ID/M1 or M1/M2 or ... or M5/M6\}.opcode has an instruction
and \{same\}.IR{11..15} = IF/ID.IR{6..10} or IF/ID.IR{11..15}.
\end{itemize}

If IF/ID contains any Store instruction (these require source operands
only at the beginning of MEM) then stall if:

\begin{itemize}
\item
We do not need to stall for loads.
\item
\{ID/A1 or A1/A2\}.opcode has an (Add, Sub, Convert) instruction
and \{same\}.IR{11..15} = IF/ID.IR{6..10} or IF/ID.IR{11..15}
(i.e. We do not need to stall for A2/A3.)
, OR
\item
\{ID/M1 or M1/M2 or ... or M4/M5\}.opcode has an instruction
and \{same\} .IR{11..15} = IF/ID.IR{6..10} or IF/ID.IR{11..15}.
(i.e. We do not need to stall for M5/M6.)
\end{itemize}

WAR hazards never occur since all instructions are issued in order
and read operands in the ID stage (or have them forwarded in EX.)

WAW hazards can be handled by determining if ANY instruction in
\{ID/EX, ID/A1 ... A4/MEM, ID/M1 ... M7/MEM\} has the same
destination register as the instruction in IF/ID.

Common Error:\\
1. Almost no-one considered the second case (store).  No points were
deducted.

\newpage
\problem \Points{75}
{\bf Pipelines and Software Issues}.
In this exercise, you will use software tools to explore the
performance of a simple numerical program.  Consider the simple C
matrix multiply program shown below.  In this problem, you will
characterize its performance, optimize it, and characterize its
fundamental limits.

\begin{verbatim}
/* Matrix Multiplication */
#define SIZE 10

main()
{
float a[SIZE][SIZE],b[SIZE][SIZE],c[SIZE][SIZE];
int i, j, k, sum;

for (i=0; i<SIZE; i++){
   for (j=0; j<SIZE;j++){
      sum = 0;
      for (k=0; k<SIZE; k++) sum += a[i][k] * b[k][j];
      c[i][j] = sum;
      };
   };
}
\end{verbatim}

a. Compile the program using {\tt dlxcc} and execute it using {\tt
dlxsim}.  Looking at the assembly output produced by {\tt dlxcc},
write an analytic expression for the number of instructions
executed as a function of {\tt SIZE}.  Verify your expression
for SIZE 10, 17, 39 using the {\it stats} command of {\tt dlxsim}.
(For SIZE 39, {\tt dlxsim} might take a while to execute the program.)
List out your expression's calculated value and the dlxsim ``Total operations''
value for the above cases in the form of a table on your answer sheet.
Also record the number of total cycles, load stalls, FP stalls, and branches (total,
taken \%, and untaken \%) that DLXsim shows for each case.

{\bf Solution:}

Some students typed ``go'' instead of ``go \_main'' at the dlxsim prompt.
This means that the ``crt0.o'' file
(which is prepended to every file compiled in C)
was also executed in dlxsim.
This added something
like 64 instructions to the number of operations executed.
I am only going to give the solution for the ``go \_main'' case.
(No points were deducted for the other case.)

\begin{verbatim}
     Total      Total FP   Total     Load    FP         B r a n c h e s
SIZE Operations Operations Cycles    Stalls  Stalls  Total  Taken   Untaken
10      42,310    4,100       52,620   5,310   5,000  1,221  111     1110
                                                            (9.09%) (90.91%)
17     200,895   19,941      250,909  25,449  24,565  5,526  307     5219
                                                            (5.56%) (94.44%)
39   3,636,251  358,956    4,714,679 301,197 777,231 62,440 1561    60879
                                                            (2.5%)  (97.5%)
\end{verbatim}

The function for the total number of operations as
a function of SIZE, x, is
$f_1(x) = 39x^3 + 31 x^2 +17x + 40$.
Note that this is $O(n^3)$.
\begin{center}
\begin{tabular}{|c|c|}
\hline
SIZE   & f(SIZE) \\
\hline
10      &  42,310 \\
17      & 200,895 \\
39    & 2,361,295 \\
\hline
\end{tabular}
\end{center}
This matches the empirical data for SIZE 10 and 17, but fails
miserably for SIZE 39.  This is because the compiler code for
SIZE 39 is different from the code for SIZE 10 and 17.

The difference lies in the computing of array offsets:
	e.g. offset(a[i][j]) = (i*17 + j)*4

For size 17 the compiler computes the array offsets as follows:
$(i << 16+i) << 4 + j << 4$
i.e. using slli and add.

For size 39 the compiler computes the array offset (i*39+j)*4, as
	$i * 156 + j << 4$
i.e. using mult, slli, and add.

Since integer mults are FP operations in DLX, this bumps up the
number of FP operations in the inner loop from 4 to 6 (one extra
mult each for a[i][j] and b[i][j].)  This explains
why the number of FP operations increases to 358956 rather than
something like $(17/39)^3 * 19941 = 240765$ FP ops.

Also, since we have to move between FP registers and Integer registers,
the total number of operations in the inner loop increases from about
40 to 62.
This changes the coefficients of our analytical expression relating
operations as a function of size; and explains why we get 3636,251
total ops instead of our analytical result of 2361,295.

\smallskip
\hrule
\smallskip
b. If {\tt load}s take a single cycle, floating point {\tt add}s and
{\tt mult}s take three cycles, and all other instructions take a
single cycle, write a simple analytical expression for the
execution time (in cycles) as a function of SIZE.  Verify your expression
for SIZE 10, 17, 39 using the {\it stats} command of {\tt dlxsim}.
(Do not forget to change the default FP hardware configuration of DLXsim.)
List out your expression's calculated value and the dlxsim ``Total cycles''
value for the above cases in the form of a table on your answer sheet.

{\bf Solution:}

The dlxsim command line should be:

\begin{center}
{\tt dlxsim -al3 -ml3 -dl1}
\end{center}

The load latency is by default 1, as specified in Section 3.4.1 (Load
Stalls) of the dlxsim manual.

The results obtained are:
\begin{center}
\begin{tabular}{|c|c|}
\hline
SIZE & Total Cycles \\
\hline
10   & 51,620 \\
17   & 245,996 \\
39   & 4,415,042 \\
\hline
\end{tabular}
\end{center}
The function for the total number of cycles as a function of SIZE is
$$f_2(x) = f_1(x) + stalls(x) = (39x^3 + 31 x^2 +17x + 40) +
	(9x^3 + 3x^2 + x)
$$
$$
= 48x^3 + 34 x^2 +18x + 40
$$

Evaluating this expression for SIZE we have:

\begin{center}
\begin{tabular}{|c|c|}
\hline
SIZE & Total Cycles \\
\hline
10   & 51,620 \\
17   & 245,996 \\
39   & 2,899,768 \\
\hline
\end{tabular}
\end{center}
Again, the value of $f_2(x)$ for SIZE 39 does not match, for the
same reason given above.

\smallskip
\hrule
\smallskip
c. Compile
the source program, for SIZE 17, with the {\tt dlxcc -O} (optimization) option.
Run this optimized version of the program on {\tt dlxsim}.
Record the number of total operations, total cycles,
load stalls, FP stalls, and branches (total,
taken \%, and untaken \%) that {\tt dlxsim} shows for this case.
Comment on the differences between the unoptimized and optimized
versions.  Can you improve the innermost loop of the optimized 
version solely by re-scheduling code, i.e without loop unrolling ?

{\bf Solution:}\\
The dlxcc command line should be:
\begin{center}
{\tt dlxcc -O 17.c}
\end{center}

The results obtained are:
\begin{center}
\begin{tabular}{|r|l|}
\hline
SIZE                &   17 \\
Total Operations    & 113,584  \\
Total FP Operations &  19,941 \\
Total Cycles        & 143,062 \\
Load Stalls         &   4,913 \\
FP Stalls           &  24,565 \\
Total Branches      &   5,526 \\
Taken Branches      &   4,912 (88.89\%) \\
Untaken Branches    &     614 (11.11\%) \\
\hline
\end{tabular}
\end{center}

Comparing 5c to 5a, we have the
Ratio of Total Operations $= {5c\over 5a} = 56.5\%$.
So, compiler optimization gives us a performance improvement
of about 45\%.

Optimization of the inner loop is possible.  The stop and go
commands in dlxsim can be used to greatly simplify the detection
of stalls.  Once stalls are detected, instructions can be rescheduled
to remove stalls.  Load stalls can be made 0, and FP stalls can be
reduced but not eliminated by rescheduling in this particular case.
The sample rescheduled code of one of the students (used with permission)
can be found in $\sim$cs333/public\_html/Notes/sol3/hw3.5c.

\smallskip
\hrule
\smallskip
d. Optimize the pipelined execution of the optimized version above (innermost loop only) by
unrolling the loop, renaming registers, 
and so on.
(Do not optimize the entire program, but merely the assembly code
of the innermost loop.)
Include a print-out of your optimized inner loop in your answer sheet.
Carefully detail the code changes you made, and comment on what a
compiler would need to know to make such changes automatically.

Run your optimized version of the program on {\tt dlxsim}.
Record the number of total operations, total cycles,
load stalls, FP stalls, and branches (total,
taken \%, and untaken \%) that {\tt dlxsim} shows for this case.
Characterize the achieved performance and how much difference 
loop unrolling made, by \\
1. Developing an analytic expression for the
execution time (in cycles) as a function of SIZE.\\
2. Comparing and contrasting this case with the unoptimized version.\\
3. Comparing and contrasting this case with Case (c).

{\bf Solution:}

Solutions can differ in the number of unrollings of the innermost
loop.  Basically, the aim is to eliminate all load stalls and FP stalls.
Because assembly code is so impenetrable, code changes must be ``clearly
detailed'' or points are taken off.

Optimized code:\\
The sample 2-unrolled code of one of the students (used with permission)
can be found in $\sim$cs333/public\_html/Notes/sol3/hw3.5d.

Comments on what the compiler needs to know: \\
1. How many iterations the loop is going to execute. \\
2. What about pointers, i.e. aliasing of memory locations ? \\
3. Can the condition of the {\tt for} statement be changed by the
body of the loop ?

1)
Analytical expressions will likely be different because each student's
code is very likely to be different.
But your calculated result must match your simulated result.
An analytical expression (valid for only SIZE 17) for 2 unrollings 
and optimization is:
$$
=13x^3+8x^2+14x+60 = 66479
$$


2)
Comparing 5d to 5a, we have:
Ratio of Total Operations $= {5d\over 5a} = {66479 \over 200895} = 33\%$.
We are performing only 33\% of the number of
operations by manually optimizing, i.e. we are executing the same
code 3 times faster.

3)
Comparing 5d to 5c, we have:
Ratio of Total Operations $= {5d\over 5c} = {66479 \over 113584} = 58.5\%$.
Still very good, we are doing 40\% better than the optimizing compiler.

Is most of the code in the world
running at about 60\% of its potential speed, or even less ?  One
cannot extrapolate on so little data, but there
is certainly scope for improvement and further study.
Does it matter, since hardware improves this much every few years ?
Yes, because we need the performance !

\smallskip
\hrule
\smallskip
e. What are the limits on the performance achievable for the innermost
loop on the machine characterized by the default settings for DLXsim
(a basic R2000), and were you able to achieve them?  Comment on how
a machine designer might relax these constraints for higher
performance (give specific suggestions and assess the potential
benefit).

{\bf Solution:}
In a general pipeline, the fundamantal limits are the number
of registers,
the number of functional units,
the latencies of the functional units,
and whether these are pipelined or not.

For this particular pipeline and code, we have sufficient registers
to remove all load stalls by register renaming,
and so we turn our attention to the FUs.

Our default pipeline has FP add, multiply and divide latencies of
2, 5, and 19 cycles respectively.  FP Divide is not used by our
code sample.

By examining the code, we can see that our FP stalls are because of
the 4 cycle FP Multiply delay, and the fact that the FP Multiply
unit is NOT pipelined.  We could certainly benefit from either
increasing the number of FPMUs, decreasing its latency, making the
FPMU pipelined, or a combination of these.

So these are some constraints that the designer might relax for
higher performance.  The potential benefit would be to reduce the
number of unavoidable FP stalls for common cases, and thereby
improve performance.

\end{problems}

\end{document}

