Why AFS?

A Little History

If you were previously affiliated with the AI Lab, you are used to a particular centralized computing resource model. Many of you have voiced concerns that this familiar, comfortable system is being ripped out from under you for no good reason. Additionally, many AI folks feel that the level of personal helpdesk support they are used to has been diminished or eliminated due to the new structure.

If you were previously affiliated with LCS, you might not be used to the idea of certain labwide services in the first place.

When CRS (former LCS IT department) and More Magic (former AI IT department) were combined on the very front line of the AI-LCS merger, we set about trying to figure out a feasible way to bring the sort of economy of scale that the AI lab had enjoyed to the whole new organization. We additionally hoped to reduce or eliminate the widespread duplication of cost and effort that resulted from decentralized system administration in LCS by providing the most commonly needed services as a central resource: file storage, email, authentication, web and database service, and so on.

The AI computing resources were the result of decades of organic growth and the grafting on of new technologies as they became available. As a result, after so many years, the system had become nearly unmanageable. I can't tell you how many times I've seen one service go down in AI and cause a domino effect because of all the undocumented, quirky interdependencies. Clearly, we could not possibly dream of extending this system into a 1000+ user population and expect it to work. In particular, the legacy directory service, NIS, had grown long in the tooth and was the most common vector for AI lab security breaches. NIS is not typically used in new IT systems because of its inherent lack of security and scalability.

Why AFS and not NFS (or SMB)?

For all of the quick-fix, warm and fuzzyness of NFS and NIS, there were several major problems with both. The older AI file servers - in particular, Maytag, a Network Appliances file server - had gone past the manufacturer's "EOL", or end-of-life, meaning that if it breaks, the downtime could be weeks instead of hours as had been the case for years. In fact, maytag did experience a massive hardware failure that required a record eleven-hour tech support call and parts delivery at 3 in the morning as we tried to thump it back to life. A new file server, or file servers, was needed badly. The constant increase in the size of cheap IDE workstation disks has vastly outpaced the capacity of specialized file server appliances; such that it doesn't make sense in most cases to invest in this sort of system.

Another problem that those of us who were sysadmins in the former AI Lab had way too much experience with was the pain of migrating users from a full NFS file system to another. The use of NFS file systems requires intimate knowledge of the location and configuration of disks. When migration happens, it requires system downtime and a "freeze" on data while the transfer takes place. This was hard enough within the small AI Lab family, but we felt certain that it would not scale when we added on all of former LCS.

Trust Me

NFS is based on the weak notion of UID-based authentication. That is to say, anyone who can access an NFS filesystem and give him/herself an arbitrary UID (by editing /etc/passwd, for example) can access files owned by that UID, with no other form of credentials whatsoever. This has been a major problem, and accounts for the Great AI Crack last year, when a single compromised machine allowed a cracker to recursively delete every single AI user's email spool and data files.

We needed to find something that would provide the same level of universal authentication and file service, but that would also be robust and secure in the face of the thousands of security probes and crack attempts that the lab faces each day. After much research, we came to the conclusion that Kerberos V5 was the best possible solution for the authentication part of the picture, because it is an open standard; supported and used widely by MIT; available for essentially every major operating system and even many obscure ones; and based on actual cryptographic principles and real authentication, as opposed to UID-only trust. Of course, Kerberos cannot totally take the place of NIS, because NIS is a directory service, whereas Kerberos does exactly one thing: authentication. We needed to figure out that piece of the puzzle, too, since managing UIDs was still important. For this, we leveraged the existing INQUIR database system, and it is still at the core of the password file generation utility that is on the CSAIL Debian workstations. We have been talking internally about possibly consolidating the various legacy sources of directory information that are currently in use and making it available via LDAP, which could ostensibly take the place of the current password-file-pushing hack.

For the solution to the file service piece of the puzzle, AFS made the most sense. It integrates nicely with Kerberos, is cross-platform compatible, has built-in redundancy, fault-tolerance, high availability, and backup/restore management. Plus, it supports a much more granular access control model than standard UNIX modes, including user-defined access groups and even unauthenticated/anonymous access if desired. It seemed like a perfect cure-all.

However... we shortly discovered that there were two scenarios in which AFS did not make sense: datafiles that were over 2GB don't work under some older versions of AFS, and I/O patterns where multiple users/hosts need concurrent read/write to a dataset don't work properly due to AFS's caching nature. So we recognized that NFS would need to stay on in some capacity to address these types of situations. I think a lot of AI folks believe (incorrectly) that our goal is to entirely squash NFS out of existence and to force everyone into AFS; this is not the case. We've added several terabytes of storage that is available via NFS for people who need the performance that AFS isn't meant to provide. If your group needs NFS storage, just send email to help@csail.mit.edu and we'll carve out some space for you.
Topic revision: 21 Nov 2006, JasonDorfman

MIT Computer Science and Artificial Intelligence Laboratory


  • About CSAIL
  • Research
  • News + Events
  • Resources
  • People

This site is powered by Foswiki MIT: Massachusetts Institute of Technology