Navigation :

NFS FAQ

Frequently Asked Questions

Access

How do I access `/data`?

CSAIL NFS is only supported on CSAIL Ubuntu clients. The /data filesystem is synthesized inside the NFS automounter based on configuration files installed by CSAIL Ubuntu’s configuration management. When properly set up, the df command will show:

$ df /data
Filesystem            1K-blocks  Used Available Use% Mounted on
/etc/auto.d/auto.data         0     0         0    - /data

If the automounter is not installed, configured, and running, that usually means that the machine hasn’t been set up for NFS access. To fix this, run the following commands as root:

# echo 'autofs=yes' >> /etc/facter/facts.d/csail.txt
# puppet agent -t

`/data` is mounted but I don’t see anything!

The automounter synthesizes directory entries as and when they are accessed. On first access to a filesystem, the automounter will use its configuration files (stored in /etc/auto.d) to locate the correct server and path, create a mount point, and mount the remote filesystem. Before a filesystem has been mounted, it will not show in directory listings, but there will be a short delay on first access. Thus, you should not depend on browsing to find your filesystem — always go directly to the full path you have been given.

The CSAIL automount configuration uses a hierarchical design, so filesystems will be divided by function and research group. The df command will tell you whether a particular directory is being synthesized by the automounter or represents a real directory on an NFS server.

Can I access my CSAIL NFS from my laptop/desktop/custom server?

CSAIL NFS is only supported on CSAIL Ubuntu. The NFS protocol is standard, and we do not employ technical measures to prevent anyone from mounting the CSAIL servers on an unsupported platform, but we cannot assist you if things go wrong. Furthermore, we will sometimes move filesystems from server to server without notice; clients not using CSAIL Ubuntu with up-to-date CSAIL configuration management will not receive updated configuration files and thus will have the filesystem ripped out from under them. Finally, unmanaged clients like laptops will not have the correct mapping of user and group IDs, so you will not able to access files as yourself and will be able to create files that you cannot access or remove from another computer.

Why am I suddenly getting `Read-only file system` errors?

Usually this happens as a part of filesystem migration from one server to another. A client receiving this error is still mounting the filesystem from the old server, usually because some process was holding the filesystem mounted when the transition happened, so the automounter could not unmount the old server. We try keep the old copy online, in read-only mode, for a week or so after completing a move in case there are clients stuck in this state. Use sudo lsof /data/mygroup/mountpoint to identify the processes which are hanging on to the old mount point and then terminate them so the automounter can do its job. (Sometimes if things get really stuck you may need to sudo umount /data/mygroup/mountpoint after terminating the remaining processes.) As an alternative, you can reboot the client node.

Scratch storage

What is appropriate to put in `/data/scratch`?

Scratch filesystems like /data/scratch and /data/scratch-oc40 are cleaned up annually, or more often if they fill up. They are to be used for temporary files, computational intermediates, logs, checkpoints, and similar data that must be written quickly during a compute job but is not required to be durably stored. There are no snapshots and no backups of scratch filesystems, so anything that is deleted, whether by you or by TIG during a cleanup, is gone forever.

For this reason, you must never put the only copy of software components or data sets that you need for reproducibility of your research in a scratch filesystem. You may use scratch storage as a working copy of your code or data so long as the original files are available somewhere else.

What files get deleted in a “cleanup”?

Typically we will delete two kinds of files: those owned by users who no longer have CSAIL accounts, and those that have not been accessed in at least six months (normally, at least one year). We deliberately run /data/scratch and /data/scratch-oc40 with file access timestamps enabled (they are disabled on other filesystems) in order to be able to tell which files have been accessed recently.

Warning

Some software systems keep multiple versions of code and/or data but only use one of them (for example, Python with bytecode and plain-text library files), yet still require both files to exist. Our cleanup procedures cannot detect this and you should expect that these systems will fail if you store their files in scratch filesystems.

Why doesn’t my Conda environment that I stored in `/data/scratch` work any more?

See above.

Basically, Python likes to cache a byte-compiled version of library code, and once it does that, it never looks at the plain-text version again except to confirm that it’s there and hasn’t been changed. So when you run your Python code, it only reads the .pyc files, but it will fail with an incomprehensible error message if the .py files (which it never reads) are missing.

This means that when the cleanup process looks at your Python library, it sees that that .pyc files have been accessed recently, but the corresponding .py files have not, so they are fair game to be deleted. Do not store Python libraries in /data/scratch!

You can set the environment variable PYTHONDONTWRITEBYTECODE to x which will force the Python interpreter not to create new bytecode caches, but this won’t help if your source code has already been deleted.

Capacity

Why is the size of my disk shrinking?

The CSAIL NFS servers implement snapshots. The quota for your filesystem counts both current data and old data referenced by snapshots, but NFS does not provide a way to convey this information to clients, so the space used by snapshots is subtracted from your quota when reported in df. In theory, we could define quotas another way (refquota and refreservation rather than quota and reservation, for those familiar with ZFS), but this would require us to overcommit storage, which we do not want to do. (A few filesystems on group-owned servers are configured this way, since the whole of the storage is allocated to the same group.)

How much space can I get?

There are four distinct answers to this question.

Individual users of shared “scratch” filesystems are limited to 1 TiB per filesystem. (Group “scratch” filesystems may have other quotas or no per-user quotas at all, at the discretion of the PI.)
Individual filesystems should not exceed 25 TiB before compression if they are to be backed up.
The total of a research group’s NFS allocation should not exceed 65 TiB of shared storage. (This limit is subject to review as technology improves.)
Group-owned storage is unlimited, but if the group wants TIG to manage the file server, take backups, and handle drive replacements and migrations, you must buy the equipment that we specify.

All of our filesystems are configured with compression enabled, so depending on the format and redundancy of your data you may be able to store substantially more. Quotas are based on physical storage consumed and not the uncompressed size of the data. (The du utility by default shows the actual disk space occupied. Use du --apparent-size to find the uncompressed size.)

I’m curious, how much storage does CSAIL have?

It’s a challenge to measure this accurately for reasons related to RAID overhead, and any static web page (like this one) will inevitably be out of date. In addition, there are multiple tiers of storage and much of the storage TIG manages is actually owned by research groups rather than being a lab-wide shared resource. However, as of 2022-02-15, the current raw amount of shared storage in the “production” tier is 1567 TiB, spread over three servers (two in Stata, one in MGHPCC).

Security

Can I access NFS from outside CSAIL?

No. NFS does not have any meaningful security mechanism, so to the extent we can provide any security at all, it depends on restricting network access to the NFS servers. (Arguably, even restricting it to the CSAIL network is too open: anyone who controls a machine on an “inside” network can read, write, create or delete any files on any CSAIL NFS server. CSAIL originally started with only AFS as a result of a security incident in the then AI Lab wherein an attacker on an AI workstation methodically began deleting all of the files on the AI’s NFS server. There is an authentication protocol that we could configure, but you would have to have Kerberos tickets to authenticate to NFS as well as AFS, and our users have consistently told us that they care more about frictionless access than they do about security. See below for more information.)

Some people have successfully used sshfs to access CSAIL NFS over an SSH connection to a server on the CSAIL network. This will not perform especially well but is a reasonable option for casual access, software development, and other low-volume, low-speed interactive tasks.

Can I store NDA-protected data on CSAIL NFS?

Yes and no, but mostly no. Because NFS lacks meaningful security, there is no way to protect data stored in CSAIL NFS from unauthorized access. You can store such data in encrypted form (we recommend age) so long as your encryption keys are not stored on NFS (e.g., on local disk on a compute node and readable only by authorized users), but for the lowest friction, we recommend that you provision your own file server on a private network — see Computing in Secured Medium Risk Environments. Note that this represents a substantial hardware expense that you should build into your grant proposals when undertaking research with this sort of data. Such servers are not integrated into the CSAIL NFS environment.

(Depending on your specific requirements, CSAIL AFS may provide a sufficient level of security when combined with additional access controls, albeit at a lower level of performance.)

How does NFS access control actually work?

The NFS server trusts clients’ assertions about who they are. Every request simply states a user ID number and a set of group ID numbers, and the server uses those values to perform access control. This means that anyone who has superuser access on any machine on the CSAIL network can potentially read, modify, create, and delete files assuming the identity of any other user.

Furthermore, NFS traffic is sent in the clear, without integrity protection, over the network. Thus, any attacker in the middle can also alter legitimate NFS requests and responses.

There are available mitigations for both of these flaws, but they would require making all existing clients and servers obsolete, and in some cases would require substantially altering the NFS service offering by requring users to maintain valid Kerberos tickets on every client they use (making it more like AFS, and incompatible with batch computing like SLURM).