\documentstyle[class,11pt]{article}
\def\ClassName{Computer~Organization}
\def\ClassNumber{CS~333}
\handedout{(11/22/95) Handout \#16}
\vspace{0.5in}

\begin{document}
\begin{center}
{\large \bf Solutions to Homework \#6} \\
\end{center}

{\bf Introduction:} This assignment focuses on cache memory
hierarchies.  Before starting the homework, you should have read
Chapter 5 in the textbook.

{\bf Solutions:}
Many students made mistakes solving the problems.  Please go through
the solution carefully to understand the behaviour of write-through
vs. write-back caches during hits and misses.

\begin{problems}
\problem \Points{15}
{\bf Memory Hierarchies I} -- H\&P Exercise 5.4

a)
We have $10^9$ cache references being generated per second.
Let us calculate how many memory accesses this generates.
(Each memory access is one word.)
For a write through cache we have the following table:

\begin{center}
\begin{tabular}{|l|l|c|c|}
\hline
Type of Cache & Frequency & Memory Accesses/ & Average Memory Accesses/ \\
Reference && Cache Reference &Cache Reference \\
\hline
read hit   & 95\% * 75\% & 0 & 0 \\
write hit  & 95\% * 25\% & 1 & .238 \\
read miss  & 5\% * 75\% & 2 & .076 \\
write miss & 5\% * 25\% & 3 & .039 \\
\hline
\end{tabular}
\end{center}

Note that for a write hit the write through cache updates
only the word that is being written in main memory, i.e. it
does not write the block (two words) to main memory.

For a read miss, we have 2 memory accesses since we retrieve the
block containing the missed word.  Blocks are two words long.

For a write miss on a write through cache, a block is never
dirty.  So we just read in the missed block (read miss) and then do a
write (write hit.)

Totalling, we obtain the number of memory accesses per reference to cache.
This works out to .353.  Thus the memory bandwidth used is:
$$
Bandwidth\ used = { .35 \ accesses/ cache\ reference* 10^9\ cache\ references/ second \over
	 10^9\ memory\ bandwidth} = 35\%
$$

b)
This question is solved similar to the above one.
We have $10^9$ cache references being generated per second.
Let us calculate how many memory accesses this generates.
For a write back cache we have the following table:

\begin{center}
\begin{tabular}{|l|l|c|c|}
\hline
Type of Cache & Frequency & Memory Accesses/ & Average Memory Accesses/ \\
Reference && Cache Reference &Cache Reference \\
\hline
read hit   & 95\% * 75\% & 0 & 0 \\
write hit  & 95\% * 25\% & 0 & 0 \\
read miss  & 5\% * 75\% * 70\% clean & 2 & .0525 \\
read miss  & 5\% * 75\% * 30\% dirty & 4 & .045 \\
write miss & 5\% * 25\% * 70\% clean & 2 & .0175 \\
write miss & 5\% * 25\% * 30\% dirty & 4 & .015 \\
\hline
\end{tabular}
\end{center}

Note that for a read/write hit the write back cache accesses/updates
only the block in cache.

For a read/write miss, 
if the block being replaced is dirty, we need to write it back
to memory first.  Blocks are two words long.
Therefore we have 2 (or 4, for dirty blocks) memory accesses since we also
need to retrieve the
block containing the missed word.  

Totalling, we obtain the number of memory accesses per reference to cache.
This works out to .13.  Thus the memory bandwidth used is:
$$
Bandwidth\ used = { .13 \ accesses/ cache\ reference* 10^9\ cache\ references/ second \over
	 10^9\ memory\ bandwidth} = 13\%
$$

This is only about a third of the memory bandwidth needed for a
write through cache.

Common Errors:\\
1. Assuming that a write-through cache has dirty blocks.\\
2. Assuming that you only access memory for a cache miss.\\
3. Assuming you access memory in blocks (or words) only.

\problem \Points{15}
{\bf Memory Hierarchies II} -- H\&P Exercise 5.8

a) An ideal TLB is one that never causes a stall,
either for a cache hit or cache miss.
$$
CPI = Base\ CPI + Stall\ CPI = 1.5 + Stall\ CPI
$$

The Stall CPI (number of stall cycles per instruction)
depends on the Miss Rate for the cache,
the number of memory references (= blocks) each instruction makes on average
(1 fetch + 20\% loads/stores),
the fact that dirty blocks add an additional 50\% blocks of memory accesses,
and the miss penalty.  The miss penalty is the number of cycles needed
to access one block of 32 bytes.  This is memory latency plus the time
to transfer 32 bytes at 4 bytes/cycle $= 40 + 32/4 = 48\ cycles$.
$$
Stall\ CPI = Miss\ Rate * 1.2\ references/instruction * 1.5\ dirty\ factor * 48
	\ Miss\ Penalty
$$
So we have:

\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
Cache	&Miss Rate	&Stall CPI	&CPI\\
\hline
16-KB direct mapped cache & 0.029 & 2.5056 & 4.0056\\
16-KB 2-way cache & 0.022 & 1.9008 & 3.4008\\
32-KB direct mapped cache & 0.020 & 1.7280 & 3.2280\\
\hline
\end{tabular}
\end{center}

b)
Now the TLB is not ideal; so it introduces a 20 cycle penalty for
0.2\% of the cache misses.  How does the TLB
behave for dirty transfers ?  First
we translate the original virtual
address to a physical address and check if we have a miss.
If we do, we check if the block in cache is dirty.  If so,
we perform a dirty store.  Finally, we perform our original
access.
Since the dirty transfer will be executed first,
we need to store the original virtual address' translation, or recompute it.
I assume it is recomputed, and thus the dirty factor adds to
the number of times the TLB is accessed for a cache miss.
But the alternate assumption is also valid and has been graded
as correct.
$$
CPI = Base\ CPI + Stall\ CPI = 1.5 + Stall\ CPI
$$
$$
Stall\ CPI = Miss\ Rate * 1.2\ references/instruction * 1.5\ dirty\ factor * 
	(48 + .002*20)
$$

\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
Cache	&Miss Rate	&Stall CPI	&CPI\\
\hline
16-KB direct mapped cache & 0.029 & 2.507688 & 4.007688\\
16-KB 2-way cache & 0.022 & 1.902384 & 3.402384\\
32-KB direct mapped cache & 0.020 & 1.729440 & 3.229440\\
\hline
\end{tabular}
\end{center}

A very small difference indeed !  Even considering
the number of cycles executed by a processor each second (for
a 100 MHz processor,
we execute about 10000 instructions less each second).  Note
however, that these measurements are not at all exact !

c)
Many students did not understand the question.  Restated,
how does a virtually addressed cache affect TLB performance ?
Similarly, for a physically addressed cache.
(Note that a physically addressed cache does not mean we do
not have virtual memory.  It means that the TLB must perform
address translation before we access the cache.)

For a physically addressed cache, the TLB has to first
translate the Virtual Page number into a Physical Page
number before applying the address to the cache.  Essentially,
the TLB and cache are in series here.

For virtually addressed caches, the TLB can be accessed
in parallel and the physical tags can be compared to
determine if we have a hit or miss.

\begin{verbatim}
               +----->CACHE----->+ PP#
               |                 |
               |                 |
     VA ------>+                 equal? Yes----------> hit
               |                 |      No-----------> miss
               |                 |
               +----->TLB------->+ PP#
\end{verbatim}

Thus, in this case, the TLB does not degrade performance.
\end{problems}

\end{document}

