NFS at CSAIL

While AFS is our preferred storage model for security and manageability (it’s the original “cloud storage” concept), it doesn’t always meet users’ needs for performance.

Rules and recommendations

In order to make the best use of the new NFS servers, we are asking people to adhere to a few rules:

  1. Users of NFS storage are expected to join the nfs-users@csail mailing-list to receive notification of system outages.
  2. NFS storage is a costly resource to maintain. Do not use it for backups. If your research does not absolutely require the capacity or performance of NFS, keep your data in AFS, which is much cheaper to maintain.
  3. Please request one new filesystem per dataset or per project. Filesystems are cheap, and keeping different kinds of data separate has numerous advantages:
    • It allows our backup system to operate more efficiently.
    • It allows us to set storage parameters appropriately for the type of data being stored.
    • It makes it easier to identify and resolve performance problems relating to filesystem activity.
  4. Please don’t mix different kinds of data in one filesystem. For example, if you’re working on video analysis, DO NOT take a multi-gigabyte video file and explode it into a million individual frames in the same filesystem – use a separate scratch filesystem for that. (Feel free to ask for one if you need it, after checking within your group.)
  5. Also, please don’t mix data with radically different access patterns. If you have a large dataset that you use as reference material – that is, you extract it once but never change it – we want to put it in a different filesystem with different storage parameters and backup policy. (We can even make it read-only to ensure that you don’t change it by accident.)
  6. Creating hundreds of millions of files is almost never a good idea. Under no circumstances should you create even hundreds of thousands of entries in the same directory – this causes all accesses to that directory to be much slower, and may prevent the backup system from ever finishing a backup of your data. Use multiple levels of subdirectories to limit the size of any one directory, and keep large numbers of small files in archives (tar or ZIP) when you do not need to access them immediately. (For many cluster applications, it may make sense to store all your data in an archive, and extract to temporary storage only those files required for the computation on each node.) Please ask TIG if you need assistance in structuring your data.

Requests for new filesystems

When requesting a new NFS filesystem, please provide the following information in an email to help@csail.mit.edu. NOTE WELL: The information you include in your request is a contract between you and TIG. Do not expect sympathy if you use a filesystem for something other than what you told us and run into problems. (We will of course still help you, but that may mean that we help you to copy your data to a more appropriate place.)

  1. What will the filesystem be used for? This and the following question will determine the name for your new filesystem. Please note: “my files” is not an adequate explanation of what your filesystem will be used for. Most people work on multiple projects during their time at CSAIL, and most projects will have multiple people working on them; these need to be stored in separate filesystems. A good answer to this question will be something like “storing raw frog genome sequences” or “accumulating simulation results for Prof. X’s new cache coherence protocol” or “reference data from the FROBNITZ consortium that I’m using to train my speech recognizer”.
  2. Who (what group or project) will be maintaining it? (We want the name of a Unix/AFS filesystem group here, not a list of users, but if there is someone who will be the principal administrator, please let us know that too.)
  3. How long will this data need to remain online, and do your funding agencies have any requirements for data stewardship or preservation?
  4. What is the access pattern? Will it be write-once, read-many, or will it be simultaneous reads and updates? Please describe in detail how you will be using the data. For example, if your data will be read sequentially, so that each file is read only once per job and then not used again until the next job starts, that is important to know. Contrariwise, if small segments of files will be read and updated at random, we need to know that. Make sure to include information about how parallel your use is likely to be: if you will be running a 64-node cluster and every node updates the same file, we may need to configure things differently.
  5. How much space is required? How much do you need now, and how much will it grow over time? Keep in mind that all NFS filesystems have snapshots, which will take up some of your allocated space.
  6. Do you need this data to be backed up, or can you reload/regenerate it from some other source if necessary?
  7. Is your data stored in a compressed or encrypted form?

    Images and videos are almost always compressed.