Tertiary Disk: A Large Scale Distributed Storage System Nisha Talagala, Satoshi Asami, David Patterson, Tom Anderson, Ken Lutz Tertiary Disk is the mass storage component of the UC Berkeley Network of Workstations project. As with the rest of the NOW project, a goal of Tertiary Disk is to study the design of distributed systems using commodity components. In this case, we concentrate on mass storage. The applications we are viewing as suitable for such storage systems include video on demand, web caches that hold current as well as outdated versions of objects so users can retrieve a "snapshot" of the web at any point of time, and medical data servers which need to hold massive amount of data as well as serve them in timely fashion. In the past 10 years, the cost performance gap between secondary and tertiary storage has been widening. The cost of raw disks has been falling at a factor of 2 per year, compared to 1.5 per year for tape drives and libraries. Disk areal densities have been increasing at 60% per year, with 9 GB 3.5 inch disk units starting to ship in 1996. Data rates have also been increasing at rates of 40% per year, expected to pass 40 MB/s by the end of the decade. If these trends continue, large storage systems composed of disks will have significant cost/performance advantages over tape libraries of similar capacity. The other alternative, hardware RAIDs, have drawbacks in terms of cost/performance, availability, and scalability. Due to custom hardware, the $/MB of RAIDs increases with system capacity, unlike raw disks and tape systems. Also, a RAID usually needs to be connected to a host computer, which becomes a bottleneck for both performance and availability. Its scalability is limited by the number of disks that can be supported by the infrastructure. When this limit is reached, another RAID must be added, increasing the likelihood of failure of one of the systems, and preventing incremental expansion. Tertiary Disk will address the limitations mentioned above by studying storage systems of disks attached to PCs connected by a fast network. Data will be striped across PCs to avoid a single point of failure within the system, and also to remove the host I/O bottleneck. To further improve availability, each string of disks will be double ended (connected to two host adapter cards in different PCs). This reduces the window of vulnerability when a PC or adapter card fails. RAID will be done in software, with the work of parity computation and reconstruction distributed among the PCs. We are currently building a prototype with 30 200MHz Pentium Pros and 500 4.3GB Seagate Barracuda Fast Wide SCSI disks. The PCs will be connected to each other and to the rest of the NOW cluster through a switched network of 80 MB/s Myrinet links. The complete system will have a total capacity of over 2 TB, with an expected aggregate bandwidth of over 500 MB/s. In contrast to previous research on RAID, Tertiary Disk will explore issues related to systems with very large numbers of disks in a distributed environment. For instance, RAID models will be extended to describe more complex environments. This includes recovery of data from host and network failures as well as disk failures. Also, while most of the RAID research assumes a single hardware controller, Tertiary Disk has an opportunity to develop algorithms for efficient disk reconstruction and parity computation in a distributed environment. Part of the design includes exploration of the cost, performance and reliability tradeoffs of different configurations under different workloads. Other interesting research issues include problems associated with the large number and size of disks involved, such as bit error rates, as well as maintenance and backup problems for large capacity disk systems. Initial results and measurements of Tertiary Disk prototypes, as well as a more detailed description of the research issues, can be found at http://now.cs.berkeley.edu/Td/.