Storage tiers
There are three tiers of NFS storage, with different performance characteristics: scratch, production, and archival. Some research groups have their own scratch and production file servers, but TIG operates servers in these tiers for shared use by all CSAIL members.
All three storage tiers are split between 32-341 (the main machine room in Stata, where high-speed access is available from OpenStack and group-owned compute clusters in Stata) and OC40-250 (the Massachusetts Green High-Performance Computing Center in Holyoke). Due to the non-pipelining nature of the NFS protocol, when performance is a concern users should make sure that they are using servers in the same building as the data they are accessing. The Holyoke data center is subject to an annual 24-hour maintenance shutdown during which all servers and data located there will be inaccessible; the date of the shutdown is usually in May and will be announced to the community a few months in advance.
Scratch
Scratch storage does what it says on the tin. It is intended for high-performance storage of temporary files and intermediate computations as well as similar data that can be easily reconstructed or otherwise does not require backups.
- No backups or snapshots are taken, and users are limited to 1 TB of public scratch usage. (Quotas on private group-owned scratch servers are determined by the relevant principal investigator(s).) Temporary quota elevation is sometimes available on request, depending on overall usage and available space.
- Scratch storage is overcommitted by design: there is not enough storage for every user to occupy their full quota at the same time.
- TIG will periodically prune files that have not been accessed recently.
- No attempt will be made to preserve or restore data in the case of a hardware or software fault that affects the filesystem or the file server.
- Data which is overwritten or deleted by mistake cannot be recovered.
- All data is compressed automatically, and quotas are applied after compression.
Production
Production storage is the main workhorse of CSAIL NFS. Production servers are optimized for relatively high performance with parallel reads and writes from multiple client systems.
- The same checksum and compression settings are used as for scratch storage.
- All production filesystems are fully provisioned; storage is not overcommitted.
- Some groups have a shared quota across multiple filesystems, but for most groups, the space allocation is determined on an individual filesystem basis. In either case it is important for efficient operations and cost-effective use of the backup system that data from unrelated activities not be mixed in a single filesystem.
- The regular snapshot policy applies, with hourly, daily, weekly, and monthly snapshots allowing easy (no TIG intervention required) recovery of deleted or overwritten files.
- Each filesystem may or may not be backed up, depending on user needs and access patterns; when enabled, backups are taken daily.
Archival
Archival storage is intended for data that is no longer being actively updated and does not require high performance parallel random reads. Presently archival storage is split between Stata and OC40, but TIG’s long-term plan is to shift all archival storage to OC40 in the future.
- The disk layout is optimized for reliability over access speeds.
- A stronger checksum is used to ensure that errors are detected while automatic data recovery is still possible.
- Automatic data compression is enabled, but users are advised that
offline data compression (e.g.,
bzip2
orlz
) will make more efficient use of the disk. - Archival filesystems may or may not be backed up, on a weekly schedule, but the usual snapshot policy applies.