System Administration Solutions This poster is divided into two parts. Infrastructure for building secure communication between principals, and system monitoring and diagnosis. Semantic Security Properties for Messages -------- -------- ---------- --- -------- Traditional security solutions provide low level cryptographic operations (Encrypt, Decrypt, Sign, Verify) and some help in key management (Kerberos, PGP). We use those pieces to build a messages divided into parts with various security properties such as: * Signed -- The receiver knows that the part has not been modified * Encrypted -- The receiver knows that the part has not been modified and could not be read by a third party. * NonRepudiatable -- The receiver knows that the part can be given to a third party, and it could verify that the part came from the original sender and had not been modified * EncryptedNonRepudiatable -- As NonRepudiatable, but in addition the message could not be read by a third party In addition, we have both independent and chained messages. An independent message can be decoded and verified without looking at any other messages. However, in the absence of additional protocol information, the message could have been a replay. A chained message is dependent on previous messages so that senders and receivers can exchange nonces, and then know that the following stream of messages is fresh. Furthermore, chained messages are faster to compute because they avoid using slow public key operations. As of right now, we have implemented the system using the Crypto++ library and we support a variety of encryption algorithms. We plan on integrating support for Kerberos and PGP principals soon. We plan to use this to support authenticated actions, which is partially described in an extended abstract available from: http://now.cs.berkeley.edu/Sysadmin/authact/intro.html The abstract describes an outdated version of the message format. We also plan to use the library to support system monitoring and diagnosis, described below. The library is also being used in the WebOS project to provide security for remote job execution on the web. The most recent copy of the code is available from the author Eric Anderson (eanders@cs.berkeley.edu) System Monitoring and Diagnosis ------ ---------- --- --------- This work is intended to scale system monitoring and diagnosis from systems with a few important servers and hundreds of unimportant clients to the realm of NOW's where hundreds of machines are simultaneously working to provide cooperative service. The original design is described in an extended abstract available from: http://now.cs.berkeley.edu/Sysadmin/sys-diag-console/intro.html Recently, we have decided that the diagnostic console should be split into two separate pieces, a display and control piece (the console), and a back-end which gathers, caches, and retains information from the system. The extended abstract describes our thoughts on the display piece, with a focus on maximizing the information displayed on the screen through aggregation. For example, an aggregate node might use fill to indicate overall utilization, shade to indicate variability, and color to indicate general functioning. We are implementing this in Perl/Tk so that the administrator can easily add new display methods while the system is running. We will also use similar techniques of fill shade and color for network links, but do not yet have data gathering for that information. The frontend also will support running programs across groups of nodes to test either single node, or node and network problems. The results from those programs will be displayed in the same way that other gathered information is displayed. The back-end data gathering should be some sort of database rather than having each console individually re-poll source nodes to gain information. By using a database, we can gain consistent, secure access to the data with more flexibility than is offered by solutions such as SNMP. Moreover, the database can retain information for historical access and should improve performance as the number of monitoring nodes increases. We plan to use many different forms of data gathering, including SNMP, and remote procedure execution through authenticated actions to get data into the database. We are currently evaluating the type of database to use given the requirements of very high fault tolerance (since tools like this are often used when the system is working poorly), scalability, ability to automatically keep cached data up to date, and ease of distributing the resulting system. An initial version of the diagnostic console is being cleaned up for an internal release, but code can be acquired by contacting the author Albert Goto (goto@cs.berkeley.edu). The current version also requires Glunix for data gathering.