\documentstyle[class,11pt]{article}
\def\ClassName{Computer~Organization}
\def\ClassNumber{CS~333}
\handedout{(12/6/95) Handout \#18}
\vspace{0.5in}

\begin{document}
\begin{center}
{\large \bf Solutions to Homework \#7} \\
\end{center}

{\bf Introduction:} This assignment focuses on virtual memory, cache
coherence, and input/output.  Before starting the homework, you should
have read Chapter 6 and Sections 8.3, 8.4 and 8.6 in the textbook.

\begin{center}
{\bf Solutions}
\end{center}

\begin{problems}
\problem \Points{30}
{\bf Virtual Memory Translation}

a. Virtual memory systems use page tables to translate from virtual
addresses to physical addresses.  Consider a machine such as the DEC
Alpha 21064 which has 64 bit registers and manipulates 64-bit
addresses.  If the page size is 8KB, how many bits of virtual page
number are there?  If the page table used for translation from virtual
to physical addresses were 8 bytes per entry, how much memory is
required for the page table and is this amount of memory feasible?

{\bf Solution:}
8 KB requires 13 bits for the page offset.
Thus we have $64-13=51$ bits left over for the virtual page number.
A page table for such size would require $8*2^{51}$ bytes of memory
(or disk),
which is definitely not feasible !!

b. The actual DEC 21064 processor only implements 43 bits of virtual
address space and 34 bits of physical address.  Explain how this could
be done when programs manipulate 64-bit addresses.  Does this cause
programmers that use less than $2^{43}$ bytes of address space any
difficulty?

{\bf Solution:}
The programs are forced to access only 43 bits of virtual address
by restricting what the upper 21 bits can be.
These ``64 bits'' are then translated by a three step lookup
process into 34 bits of virtual memory.

Obviously, any program that uses less than 43 address bits
will not be affected by this scheme any more than one that
tries to use more than 43 bits.

43 address bits can reference 8 terabytes, which is probably enough for
most current mainstream applications.
But note that the designers of the Alpha have built in the capability
to increase the size of the virtual address space.
Of course a new chip is required, but it requires only minor
modification and has already been designed.

c. Consider the TLB design for the 21064.  Reducing the virtual
address size and physical address size simplifies the TLB
significantly.  If a TLB with 64-bit addresses (both virtual and
physical) with 16 entries could be implemented, how much larger a TLB
could be implemented if the storage is the limiting implementation
constraint? 

{\bf Solution:}
Assuming 13 bits of page offset, and ignoring status bits,
we have two fields per entry of the page table;
51 bits for the virtual page number,
and 51 bits for the physical page number.

For the 21064, we have 43 bits of virtual address
and 34 bits of physical address.
For the page offset of 13 bits,
we are left with 30 and 21 bits respectively for
the virtual and physical page numbers --
for a total of 51 bits per page table entry.

This allows us twice as many page table entries,
i.e. 32 in this case.

d. Consider a simultaneous translation scheme which uses the lower 13
bits of address to index and select from the L1 caches in the DEC
Alpha 21064.  If L1 cache misses take 30 cycles to service and TLB
misses take 50 cycles to service, for L1 cache miss rates of 1,2,4,
and 8\% (for both the instruction and data caches), calculate the TLB
miss rate which would result in an equal amount of execution time
spent servicing TLB and L1 cache misses.  Make sure to account for
both instruction and data references, and assume the Load/store
fraction is 30\%.

{\bf Solution:}
$$
Stalls_{L1} = 1.3*1\%*30
$$
$$
Stalls_{TLB} = 1.3 * \ x\ *50
$$
We wish to find {\it x} when the above equations are equal.
Thus,
$$
x = 1\%*30/50 = .6\%
$$
Similarly, we obtain 1.2\%, 2.4\%, and 4.8\% for 2\%, 4\%, and 8\%
respectively.

\problem \Points{10}
{\bf Cache Coherence I} -- H\&P Exercise 8.3

{\bf Solution:}
The transitions from the {\it Invalid} and {\it Shared} states
are unchanged for the write-through cache.
The transitions from the {\it Exclusive} state are as follows:

\begin{center}
\begin{tabular}{|l|l|l|l|l|}
\hline
Initial State	&Stimulus	&Action		&Final State	&Note	\\
\hline
Exclusive	&CPU read hit	&None		&Exclusive&\\
Exclusive	&CPU write hit	&None		&Exclusive&\\
Exclusive	&CPU read miss	&Place ``Read Miss'' on bus	&Shared&\\
Exclusive	&CPU write miss	&Place ``Write Miss'' on bus	&Exclusive&\\
Exclusive	&Bus read miss	&None		&Shared&\\
Exclusive	&Bus write miss	&None		&Invalid&\\
\hline
\end{tabular}
\end{center}

Note that the ``CPU Read Miss'' means that the CPU accesses some
memory address that maps to the same cache block,
but we already have a different memory block loaded into this cache block.
Therefore we need to load in the new cache block
(overwriting the old one.)  Similarly, for the ``CPU Write Miss''.

\problem \Points{30}
{\bf Cache Coherence II} -- H\&P Exercise 8.8

{\bf Solution:}
The relaxed consistency model specified in the question means
that if Processor P1 executes ``WRITE A'' followed by
``READ B'', the latter must be allowed to complete even
if $W_A$ has not completed.
However, if P1 executes $W_A$ followed by $R_A$, the $R_A$
must be delayed until the $W_A$ completes.
In other words, ``the ordering of accesses to the SAME
memory location is defined by the coherence model,
and the ordering of accesses to different memory
locations is defined by the consistency model.''

This can be extended to multiple processors, by considering that
A and B are shared data,
and that Processor P1 executes $W_A$,
and Processor P2 executes $R_A$ or $R_B$.
P2 must be allowed to execute $R_B$, but not $R_A$,
before P1's $W_A$ completes.

Note that a Data Write-back of a cache block only occurs when the 
block is in the {\it Exclusive} state.
The above consistency behaviour can be implemented by noticing that in Fig.8.24,
Data write-back's are explicitly called.
We do not need to change Fig.8.24 at all.
Fig.8.25 handles the ``Data Write-back'' when in {\it Exclusive}
state by performing some actions and changing state from {\it Exclusive}
to {\it Uncached} state.
We change the handling of the ``Data Write-back'' here
by introducing a new state {\it Write Incomplete},
and on a ``Data Write-back'' we perform the same actions
as before, but we transition to this new state.
We remain in this new state until the Write Miss completes,
and then transition to {\it Uncached} state.

To see why this works, consider an extended example.

Assume A and B are in different memory blocks that map to
the same cache block.
Say we are in {\it Shared} state for the block containing A,
and we perform $W_A$.  The CPU
sends a {\bf Write Miss} message to the directory, as shown in Fig.8.24.
The directory controller {\bf Invalidates} the block,
makes {\bf Sharers=\{P\}}, sends back the block to the
requesting processor P1 by doing a {\bf data value reply},
and then transitions to the {\it Exclusive} state.

Now we are in {\it Exclusive} state for the block containing A,
and we perform $W_B$, which is in a different memory block that maps to
the same cache block.  The CPU does a {\bf Data Write Back}
and places a {\bf Write Miss} on the bus, as shown in Fig.8.24.

First we consider the {\bf Data Write Back} of block A.
In the directory controller, we make {\bf Sharers=\{\}} and
change state to the new {\it Write Incomplete} state.
In this state, the only transition out is when the Write completes.
Assume the data write back merely writes the old block to the buffer,
and this does not reach memory yet.

Next we consider the {\bf Write Miss}.
Assume, the directory controller is also in state {\it Exclusive}
for the cache block containing B.  (It could be in some other
state too.)  It then performs the actions for a
Write miss as shown in Fig. 8.25.
So now it {\bf Invalidates} the cache blocks containing B,
places a Fetch on the bus, and returns the block containing B to
the requesting processor, making {\bf Sharers=\{P\}}.

Now P1 can go ahead and Write or Read B.

But if P1 tries to Read A, it must Write-back B, and
generate a Bus Read Miss for A.
However, since the block containing A is in {\it Write Incomplete} state,
the directory controller cannot react to the Read Miss and stalls it until the
Write completes.  This enforces the desired behaviour.
Note that the same behaviour is enforced
if another processor P2 performs $R_A$ or $R_B$.

A lot more can be said about the protocol, but let's just leave it
at that.

\problem \Points{30}
{\bf Input/Output (Disk Arrays)} 

Disk arrays: Consider a disk array built from Seagate ST31401N Elite-2
Drives (see Figure 6.2 in the text).  The array has five disks, with
parity interleaved across the disks in 512 byte sectors, using a
RAID-5 organization.

a. Assuming a seek time of exactly 10
milliseconds and synchronized disks, how long would it take to do a
32 kilobyte and 16 megabyte read respectively?

{\bf Solution:}

We have 21 tracks per cylinder, and 99 sectors of 512
bytes each per track.  The RPM of 5400 implies RPS of 90.
(This is how we get a data transfer rate of
$512 * 99 * 21 * 90 \approx 4.6\ MB$.
This is the rate at which bits are read off the platter.)

Now, $T_{sector}$ = Average Seek Time + Average Rotational Delay
+ Transfer Time + Controller Overhead.

We know the Seek Time is exactly 10 ms, i.e. 0.01 seconds;
and the Controller Overhead is 0.001 seconds.

(a)
32 KB is stored on 4 disks, with the fifth disk used for parity.
Thus the time to read 32 KB,
is the same as the time to read 8 KB off each disk,
assuming all disks are not busy.
Since each track on a disk holds well over 8 KB,
we need to seek for only one track on each disk.
Thus the read time becomes:
$$
T_{32KB/4disks} = T_{8KB/disk} = .01 + {0.5 \over 90\ RPS}
	+ {8\ KB\over 4.6*1024\ MB} + 0.001 = 18.254\ ms
$$

(b)
To read 16 MB off 4 disks, we need to access 4 MB off each disk.
Each cylinder of a disk can be accessed with one seek;
and this gives us $21*99*512 = 1064448$ bytes per seek.
Thus,
$$
T_{1064448B/disk} = .01 + {0.5 \over 90\ RPS}
	+ 1 + 0.001 = 1.016556 \ s
$$
We need 4 seeks to access 4\ MB.
$$
T_{16MB/5disks} = T_{16MB/4disks} = T_{4MB/disk} = {4*1024*1024\over 1064448}*1.016556\ seconds
$$
$$ = 4.00559072\ seconds $$

b. If the disks in the array are not synchronized how would this
affect the average time to do a 16 megabyte read?  Calculate this
as a function of the largest seek time of the 5 disks.

{\bf Solution:}
The formula is as above, except that
$$
T_{1064448B/disk} = Seek_{MAX} + etc.
$$
Thus,
$$
T_{16MB/4disks} = T_{4MB/disk} = 4*Seek_{MAX}+{4*1024*1024\over 1064448}*etc.
$$
Of course, here we are assuming that we always take $Seek_{MAX}$ time
units for every seek.

c. Discuss how would the numbers for part 2 change if the disk array had 9
disks?  17 disks?  (Don't have to calculate the answer precisely.)

{\bf Solution:}
The access time would be as follows:
$$T_{16MB/9disks} = {T_{16MB/5disks} \over 2} $$
$$T_{16MB/17disks} = {T_{16MB/5disks} \over 4} $$

d. Finally, the worst case performance for RAID-5's is small writes.
Assuming we're writing 512-byte blocks, what is the average latency
for 512-byte writes?  What is the average throughput for the 5-disk
array for such writes, assuming first-come-first-serve disk head
scheduling?

{\bf Solution:}
Each Read requires 4 accesses to 2 disks as described in the text.
Thus there are two accesses to the same block on each disk.
We assume there is only one seek to find the track;
and then the head stays over the track for the whole revolution,
and writes back the data when the same block passes under
the head again.  Thus we have,
$$
T_{512B\ Write} = 0.010 + {1.5 \over 90} + {1\ KB\over 4.6*1024\ KB}
	+0.002 = 28.789\ ms
$$
Average Throughput for the RAID would be equal to (1 or 4 times,
depending on assumptions) the throughput
per disk if all the data we access happens to be on the same disk,
in the worst case.
$$
Throughput = {0.5\ KB\over .028789\ seconds} = 17.312\ KB/second
$$
\end{problems}

\end{document}

