\documentstyle[class,11pt]{article}
\def\ClassName{Computer~Organization}
\def\ClassNumber{CS~333}
\handedout{(9/20/95) Handout \#7}
\vspace{0.5in}

\begin{center}
\large{\bf Homework \#3} \\
{\bf Due before 5:00 PM, September 29, 1995 \\
Turn in at TA's office L423 DCL (Basement)} \\
(Slide in under door if locked.)
\end{center}
\begin{document}

{\it As with all class assignments, you are expected to work the problems
on your own.  Discussions with other students should be limited to
understanding the problem and discussing general issues, NOT solving
the problems.  Electronic mail or posting to the course news group
(uiuc.class.cs333) is probably the fastest way to get clarification of
any problems, but you are welcome to contact any of the teaching staff
directly.  Check the WWW (URL: http://www-courses.cs.uiuc.edu/{$\sim$}cs333) 
for course information such as office hours.}

{\bf Introduction:} This assignment focuses on Instruction Set Architecture and
Implementation, emphasizing pipelining.  Before starting the problems,
you should have read Chapter 3 and Section 4.1 in the textbook.  {\bf This
homework includes laboratory exercises}, so please start early enough
so that you can become facile with the tools.  

{\bf Software Details:}
You will be using {\tt dlxcc} 
to compile a C program into DLX assembly code; 
and {\tt dlxsim} to execute DLX assembly code.
{\tt dlxcc} is a C compiler based on {\tt gcc},
and brief documentation (which is all you will need)
can be found in the online manual page.
You will need more information to use {\tt dlxsim},
and a postscript version of the {\tt dlxsim} report,
which includes the man page, helpful examples, etc.,
can be found in /home/cs333/doc on the sparcs in DCL 1235/1245.

You will need to have
/home/cs333/bin in your PATH,
and /home/cs333/man in your MANPATH.
These should already be set up for you; 
but if you have problems getting the tools to run,
please notify the TA.
If you do not have an account on the sparcs,
please contact the TA immediately 
(by email at jdesouza\@cs.uiuc.edu.)
{\it Any announcements will be posted to the class newsgroup 
(uiuc.class.cs333).}

{\bf Problems:}
If you feel that a problem is specified insufficiently, state clearly
and justify the assumptions you are making.  Please write your
solutions neatly, making sure to SHOW CLEARLY how you arrived at your
answers.  Illegible or insufficiently clear solutions may be graded 
as incorrect.

\begin{problems}
\problem \Points{10}
{\bf Instruction Sets} -- H\&P Exercises 2.7

\problem \Points{10}
{\bf Instruction Sets} -- H\&P Exercises 2.8

\problem \Points{25}
{\bf Basic Pipelining} -- H\&P Exercise 3.1

\problem \Points{10}
{\bf Hazards} -- H\&P Exercise 3.11

\problem \Points{75}
{\bf Pipelines and Software Issues}.
In this exercise, you will use software tools to explore the
performance of a simple numerical program.  Consider the simple C
matrix multiply program shown below.  In this problem, you will
characterize its performance, optimize it, and characterize its
fundamental limits.

\begin{verbatim}
/* Matrix Multiplication */
#define SIZE 10

main()
{
float a[SIZE][SIZE],b[SIZE][SIZE],c[SIZE][SIZE];
int i, j, k, sum;
\end{verbatim}

(continued on reverse)
\newpage
\begin{verbatim}
for (i=0; i<SIZE; i++){
   for (j=0; j<SIZE;j++){
      sum = 0;
      for (k=0; k<SIZE; k++) sum += a[i][k] * b[k][j];
      c[i][j] = sum;
      };
   };
}
\end{verbatim}

a. Compile the program using {\tt dlxcc} and execute it using {\tt
dlxsim}.  Looking at the assembly output produced by {\tt dlxcc},
write an analytic expression for the number of instructions
executed as a function of {\tt SIZE}.  Verify your expression
for SIZE 10, 17, 39 using the {\it stats} command of {\tt dlxsim}.
(For SIZE 39, {\tt dlxsim} might take a while to execute the program.)
List out your expression's calculated value and the dlxsim ``Total operations''
value for the above cases in the form of a table on your answer sheet.
Also record the number of total cycles, load stalls, FP stalls, and branches (total,
taken \%, and untaken \%) that DLXsim shows for each case.

b. If {\tt load}s take a single cycle, floating point {\tt add}s and
{\tt mult}s take three cycles, and all other instructions take a
single cycle, write a simple analytical expression for the
execution time (in cycles) as a function of SIZE.  Verify your expression
for SIZE 10, 17, 39 using the {\it stats} command of {\tt dlxsim}.
(Do not forget to change the default FP hardware configuration of DLXsim.)
List out your expression's calculated value and the dlxsim ``Total cycles''
value for the above cases in the form of a table on your answer sheet.

c. Compile
the source program, for SIZE 17, with the {\tt dlxcc -O} (optimization) option.
Run this optimized version of the program on {\tt dlxsim}.
Record the number of total operations, total cycles,
load stalls, FP stalls, and branches (total,
taken \%, and untaken \%) that {\tt dlxsim} shows for this case.
Comment on the differences between the unoptimized and optimized
versions.  Can you improve the innermost loop of the optimized 
version solely by re-scheduling code, i.e without loop unrolling ?

d. Optimize the pipelined execution of the optimized version above (innermost loop only) by
unrolling the loop, renaming registers, 
and so on.
(Do not optimize the entire program, but merely the assembly code
of the innermost loop.)
Include a print-out of your optimized inner loop in your answer sheet.
Carefully detail the code changes you made, and comment on what a
compiler would need to know to make such changes automatically.

Run your optimized version of the program on {\tt dlxsim}.
Record the number of total operations, total cycles,
load stalls, FP stalls, and branches (total,
taken \%, and untaken \%) that {\tt dlxsim} shows for this case.
Characterize the achieved performance and how much difference 
loop unrolling made, by \\
1. Developing an analytic expression for the
execution time (in cycles) as a function of SIZE.\\
2. Comparing and contrasting this case with the unoptimized version.\\
3. Comparing and contrasting this case with Case (c).

e. What are the limits on the performance achievable for the innermost
loop on the machine characterized by the default settings for DLXsim
(a basic R2000), and were you able to achieve them?  Comment on how
a machine designer might relax these constraints for higher
performance (give specific suggestions and assess the potential
benefit).

\end{problems}

\end{document}

