\documentstyle[class,11pt]{article}
\def\ClassName{Computer~Organization}
\def\ClassNumber{CS~333}
\handedout{(9/22/95) Handout \#8}
%\vspace{0.5in}

\begin{center}
\large{\bf Solutions to Homework \#2} \\
\end{center}

\begin{document}

%Common Error: Work was not shown. {\it Please make sure to show
%how you obtained your solution.  If you used the right formula,
%but made mistakes in calculation, you can still get some points !
%If you did the right thing and got the right solution; but showed
%no work -- you will not be given points !!
%Etc. Etc...}

\begin{problems}
\problem \Points{25}
{\bf Instruction Set Types} -- H\&P Exercise 2.3

{\bf Purpose of the Exercise:}  Your task is to compare the memory
efficiency of different styles of ISA.

{\bf Solution:} \\
The assembly mnemonics are self-explanatory.  a, b, and c are the
addresses of A, B and C respectively.  In the cases where more than
one operand appears in an instruction, the target operand is first.

a. Accumulator ISA:

\begin{verbatim}
  INSTRUCTIONS        Instruction        Bytes of Data    Comments
                      Size               Accessed         (Acc = Accumulator)
  ------------        -----------        -------------    ----------------
  load b              1+2                4                Acc = B
  add c               1+2                4                Acc = B+C
  store a             1+2                4                A = Acc = B+C
  add c               1+2                4                Acc = A+C
  store b             1+2                4                B = Acc = A+C
  neg                 1                                   Acc = -B
  add a               1+2                4                Acc = -B+A = A-B
  store d             1+2                4                D = Acc = A-B
  ------------        -----------        -------------
  TOTAL                22                28        =50
\end{verbatim}

An ingenious solution was proposed for this case by one of the students, 
using the fact that $D = A - B = A-(A+C_{original}) = -C_{original}$.
However, this can be applied to all cases.

b. Memory-Memory ISA:

\begin{verbatim}
  INSTRUCTIONS        Instruction        Bytes of Data        Comments
                      Size               Accessed
  ------------        -----------        -------------        ----------------
  add a, b, c         1+2+2+2            4+4+4                A = B+C
  add b, a, c         1+2+2+2            4+4+4                B = A+C
  sub d, a, b         1+2+2+2            4+4+4                D = A-B
  ------------        -----------        -------------
  TOTAL                21                36        =57
\end{verbatim}

\newpage

c. Stack ISA:

Case 1:
\begin{verbatim}
  INSTRUCTIONS        Instruction        Bytes of Data    Comments
                      Size               Accessed         (TOS, TOS-1...)
  ------------        -----------        -------------    ----------------
  push b              1+2                4                B
  push c              1+2                4                C,B
  add                 1                                   B+C
  pop a               1+2                4                        (A=B+C)
  push a              1+2                4                A
  push c              1+2                4                C,A
  add                 1                                   A+C
  pop b               1+2                4                        (B=A+C)
  push a              1+2                4                A,B
  push b              1+2                4                B
  sub                 1                                   A-B
  pop d               1+2                4                        (D=A-B)
  ------------        -----------        -------------
  TOTAL                30                36        =66
\end{verbatim}

The last ``push a'' and ``push b'' could be in reverse order, depending on
the definition of the ``sub'' instruction.

Case 2: \\
If a dup instruction were available which duplicates the top-most
element of the stack (a dup is basically a push without a memory access),
it might appear that we can do the following:

\begin{verbatim}
  INSTRUCTIONS        Instruction        Bytes of Data    Comments
                      Size               Accessed         (TOS, TOS-1...)
  ------------        -----------        -------------    ----------------
  push b              1+2                4                B
  push c              1+2                4                C,B
  add                 1                                   B+C
  dup                 1                                   B+C, B+C
  pop a               1+2                4                A        (A=B+C)
  dup                 1                                   A, A
  push c              1+2                4 *              C,A,A
  add                 1                    *              A+C,A
  dup                 1                    *              A+C, A+C, A
  pop b               1+2                4 *              B,A        (B=A+C)
  neg                 1                                   -B,A
  add                 1                                   -B+A
  pop d               1+2                4                           (D=A-B)
  ------------        -----------        -------------
  TOTAL                25                24        =49
\end{verbatim}

As you can see this reduces our memory bandwidth.
However, since our stack is only two registers deep, whenever it grows
above size 2, every operation (indicated by *'s) could take an extra 
memory access.  This shows us that it is the number of registers
rather than the stack model that provides performance.

We could try to keep the stack below size 2, and still use
dups to do a better job than in case 1. However, it
is unfair to compare a stack ISA with only 2 registers with
the load/store architecture that has 16 registers,
so no points will be deducted for the above case, even though
it is, if one takes the textbook question literally, incorrect.

d. Load-Store ISA:        DLX is assumed.

\begin{verbatim}
  INSTRUCTIONS        Instruction        Bytes of Data    Comments
                      Size               Accessed         (TOS, TOS-1...)
  ------------        -----------        -------------    ----------------
  lw R1,b(R0)         1+.5+2+.5=4        4                R1 = B
  lw R3,c(R0)         4                  4                R3 = C
  add R1, R1, R3      1+.5+.5+.5=3                        R1 = B+C
                 (Actally, 2.5, but all instructions are integral in size.)
  sw a(R0), R1        1+2+.5+.5=4        4                A = B+C, so R1==A
  add R2, R1, R3      3                                   R2 = A+C
  sw b(R0), R2        4                  4                B = A+C, so R2==B
  sub R4, R1, R2      3                                   R4 = A-B
  sw d(R0), R4        4                  4                D = A-B, so R4==D
  ------------        -----------        -------------
  TOTAL                29                20        =49
\end{verbatim}

So we can see that \\
1. The best ISA by code size is Memory-Memory with 21.\\
2. The best ISA by (code+data) bandwidth is Load-Store or Stack (case 2 only !)

Note:  The text says to ``write the best code for the fragments
given.''  Some students have assumed each statement to be a different
fragment.  That is not exactly right, but since the interpretation
seems valid and the
purpose of the exercise is not affected, no points will be
taken off.

{\bf Common Mistakes:} \\
1. Assuming that each statement must be assembled separately. \\
2. Not using an integral value for the total size of an instruction.

\problem \Points{25}
{\bf Relation of High Level and Machine Languages} -- H\&P Exercise 2.6

{\bf Purpose of the Exercise:}  To help you understand how a
simple compiler will compile code, and to familiarize you with
DLX assembly code for the next assignment.

{\bf Solution:} \\
The code to be compiled is:
\begin{verbatim}
        for (i=0; i<=100; i++)
                {A[i] = B[i] + C;}

; Assignment of variables is as follows:
;     R1 == A[i]     0 <= i <= 100
;     R2 == B[i]     0 <= i <= 100
;     R3 == C
;     R9 == i * 4    0 <= i <= 100
;     R10 == boolean test (i<=100)
; `index' means i, `offset' means i*4 since each operand is 4 bytes in size.

[1]     sw i(R0), R0           ; i=0
[2]  loop:
[3]     lw R9, 2000(R0)        ;          R9 = i*4
[4]     lw R2, 5000(R9)        ;          R2 = B[i]
[5]     lw R3, 1500(R0)        ;          R3 = C
[6]     add R1, R2, R3         ;          R1 = B[i] + C
[7]     sw 0(R9), R1           ; A[i] = (R1) == B[i] + C
[8]     addi R9, R9, #4        ;          R9 = (i+1) * 4 = 4*i + 4 = (R9) + 4
[9]     sw 2000(R0), R9        ; i = i+1
[10]    slei R10, R9, #400     ; if (i <= 100)
[11]    bnez R10, loop         ;         goto loop

[12]    halt
\end{verbatim}

There are a total of 11 instructions, of which 9 are within the loop.
The loop executes 101 times.  
The dynamic (or runtime) number of instructions executed is $2 + 9*101 = 911 $.
Memory data references occur in load/store instructions only. 
There are 6 load/store's, with 5 inside the loop.
The number of memory data references is $1+ 5*101 = 506 $.
The total number of instructions is 11.  Since every DLX instruction
is 4 bytes long, the (static) code size in bytes, as asked in the question, is
44 bytes.

You can find the following files in {$\sim$}cs333/public\_html/Notes/sol2/ on
the sparcs :
\begin{itemize}\itemsep=-3pt
\item hw2p2.c	- input to dlxcc, the C compiler for DLX.
\item hw2p2.commented.s	- Compiled, commented DLX assembly code.
\item hw2p2.optimized.commented.s - Optimization compiled, commented DLX assembly code.
\end{itemize}

You are strongly encouraged to take a look at these files to see how a
compiler might compile the above fragment of code.

{\bf Common Mistakes:} \\
1.  Not taking into account the fact that the array elements were
stored in consecutive words of 4 bytes, not in consecutive bytes. \\
2.  Not computing extra instructions for loops in which the branch
was tested at different stages of the loop.\\
3.  Not realizing that each instruction was 4 bytes long.

\problem \Points{25}
{\bf CPI and Performance} -- H\&P Exercise 2.11

{\bf Purpose of the Exercise:}
To enhance your intuition about CPI.

{\bf Solution:} \\
The averages of {\tt gcc} and {\tt expresso} are:

\begin{verbatim}
        add 19.2
        sub .25
        mul .05
    compare 10.35
   load imm 4.05
      shift 6.6
        and 5.5
         or 4.5
      other 1.25
            -----
            51.75

       load 21.9
      store 9.7
            ----
            31.6

cond.branch 13.25
            -----
            13.25 (taken = 13.25% * 60%, untaken = 13.25% * 40%)

uncond branch 0.9
         call 0.75
       return 1.0
              ----
              2.65

CPI = 51.75% * 1+ 31.6% *1.4 + 13.25% * 60% * 2 + 13.25% * 40% * 1.5 
              + 2.65% * 1.2
    = 1.2298 = 1.23 cycles per instruction
\end{verbatim}

Depending on rounding, and whether you normalized for the missing 0.75\%
of instructions, you can get a CPI of upto 1.24.

{\bf Common Mistakes:} \\
1. Not grouping instructions into the correct categories - this sometimes
did not affect the final answer, but is wrong nevertheless ! \\
2. Not averaging gcc and expresso instruction mixes or CPI's.

\problem \Points{25}
{\bf Tradeoffs in Adding Instructions} H\&P 2.12

{\bf Solution:}

a.
10\% of displacement loads and stores can use the new addressing
mode.  Figure 2.26 of the text tells us that 26\% and 9\% of the
instruction mix are displacement loads and stores respectively, i.e. 35\% of
the total instructions can use the new addressing mode.  

Each time
the new addressing mode is used, two instructions get merged into
one, i.e. one less Add/Load instruction is executed.
The new addressing mode is used for 10\% of the displacement
loads and stores.  
So 10\% of 35\% gives us 3.5\% fewer instructions using the
new addressing mode.  Thus the ratio of instruction counts is

$$
	{Instruction\ Count_{new} \over Instruction\ Count_{old}}
	= {96.5\% \over 100\%} = 0.965
$$

b.
We know that:
$$
CPU\ Time = CPI* {Clock\ Cycle\ Time} * {Instruction\ Count}
$$

We assume the new CPI and old CPI are the same (although, strictly
speaking, they might not be) since insufficient information is given
in the question to calculate otherwise.
$$
Speedup = {CPI_{new}*{Clock\ Cycle\ Time_{new}} * {Instruction\ Count_{new}} \over
 	CPI_{old}*{Clock\ Cycle\ Time_{old}} * {Instruction\ Count_{old}} }
$$
$$
	= {CPI_{old}*(1.05*{Clock\ Cycle\ Time_{old}}) * 
			(0.965*{Instruction\ Count_{old}}) \over
 	CPI_{old}*{Clock\ Cycle\ Time_{old}} * {Instruction\ Count_{old}} }
$$
$$
	= 1.013
$$

Thus we see that the new addressing mode, though it improves MIPS,
actually decreases performance by $0.013 = 1.3\%$.

{\bf Common Mistakes:} \\
1. Dividing the 10\% of the 35\% of load/store instructions by 2,
yielding a result of .9825 instead of .965.

\end{problems}

\end{document}

