Performance recommendations

Many times we are asked to improve the performance of NFS access for a specific application or user community. Often, performance issues are a result of the way users have structured the data or the way your compute jobs access the data. The following recommendations are intended to help you maximize the performance of your compute jobs.

Minimize round-trips

Above all else, minimize the number of round trips to the server. NFS is based on the remote procedure call (RPC) paradigm; most operations require the client to send a request to the server and wait for a response that includes information required to make the next request. Read and write operations can be asynchronous and pipelined, but open, close, and metadata operations generally cannot. Each round trip can take as much time as reading or writing megabytes of bulk data. This means it is highly advantageous to open an input or output file once, at the beginning of your compute job, and hold it open for the entire length of the job as you read and write from the file, rather than constantly opening and closing files (even if it’s always the same file, doing so requires at least two, possibly three, round trips).

Avoid deep directory structures

Directory traversal is both particularly expensive and inherently serialized. Each directory in a pathname requires one or two additional round trips, depending on what action you are taking.

Avoid large directories

This seemingly conflicts with the previous recommendation, but it is also true that very large directories (containing hundreds of thousands of files or subdirectories) are expensive to operate on, and are particularly problematic for our backup system.

Enumerate your data set once

A particularly slow operation we often see is traversing an entire file system to generate a list of files for training or analysis. If you are going to use the same dataset (or a subset of one) repeatedly, generate the list of files once, in a format that has just the information useful for your jobs, and cache it in a file. Then when your jobs run, they can read the index quickly and identify the files they will be working on without slow and expensive directory tree walking.

Use large files

We use a block size of 128 KiB for NFS. This means that opening and reading the entire contents of a file takes the same amount of time for any file of equal or smaller size. Reading multiple blocks at a time is substantially faster, because read RPCs can be pipelined — but only if the individual files you read are substantially larger than that.

If your data files are small and your compute cluster has sufficient local temporary space, we strongly recommend keeping your data set in an archive (tar or ZIP) on the server and extracting the files each job needs to local storage — this will likely take much less time overall than reading individual small files over NFS. There are libraries for all major languages that make it nearly as easy to access members of an archive as to access the file system. (Note that ZIP archives have a centralized directory structure that makes them superior for applications requiring random access.)

The situation is much the same when writing: better to save multi-gigabyte checkpoints or ZIP files of results than to write millions of individual files with only a few blocks of data in each. Note that writing to logs in append-only mode (e.g., >> redirection in the shell or a+ mode in the standard I/O library) is much more expensive than opening a file for plain writing (> or w+ mode) because each write operation is serialized (requiring two round trips).

In addition, modern disks have a 4 KiB (4,096-byte) sector size. This means that the smallest space any file can occupy on disk is (with our typical RAID-Z2 configuration) 12 KiB, meaning that there are two parity sectors for each sector of data, or 200% overhead. Larger files have much lower overhead, typically only 33% (again this depends on the storage subsystem’s configuration).

Don’t write into (near-)full file systems

As a file system gets full, it is more and more computationally expensive to find free space. In addition, most of our file systems are configured with compression enabled. When a file system is close to its storage quota, write operations are serialized in order to give a precise “out of quota” error to the exact write RPC that failed — which requires waiting for the data to finish being compressed. Writing at high speeds into a nearly full file system can completely hang the NFS server, as all available threads are stalled waiting for a slow write request to either commit or return an error.

Don’t run databases over NFS

NFS does not reliably provide the semantics expected by database engines, and lacks the expected performance characteristics. Run databases only on local storage.