Navigation :

Quick Start Guide

Prerequisites:

A CSAIL account

Which nodes you can access depends on research group membership.

Data Storage

For most users, the best choice is an NFS filesystem.

If you do not have a dedicated filesystem, you may use a temporary scratch filesystem. See: NFS for more info

Job requirements

To run a Slurm job, you must specify the following things:

See the list of available systems with sinfo, and see the memory, CPU, and GPU details for a system with sinfo show node followed by the name of a node.

Memory

Specified in megabytes. Typical requests range from 4000M to 64000M depending on your needs. You can check available memory on nodes with sinfo.

CPU

Specified in number of cores. Most jobs use 1-8 cores; GPUs typically work best with 4+ cores.

GPU

If your job requires a GPU, specify the number of GPUs.

Account

The Slurm account covering your resource usage. If you are a member of a research group, specify it here. The default account is “csail” and provides minimal access to resources.

Partition

The group of systems that will be eligible to run your job.

Some partitions are accessible to all CSAIL members, including “tig-cpu” and “csail-shared-h200”.

Type sinfo to see which partitions are available to you.

QoS

The “quality of service”, which determines job priority, resource limits, and pre-emptability.

Some QoSes allow you to run jobs on idle nodes belonging to other groups, with the caveat that your job may be pre-empted, i.e. killed, if the system owner needs the job.

Time

The maximum that your job will run.

Different partitions have different maximum time limits, typically between 24 hours and 7 days.

Submitting your job

Do not run interactive coding editors or AI agents (VSCode, Cursor, Claude, etc.) on the login nodes. The login servers are intended only for job submission, and a script is actively running that will kill these processes to prevent node crashes. If you need to do interactive coding, see the Interactive Coding & IDEs section below.

First, connect to the Slurm login server with ssh slurm-login.csail.mit.edu. All jobs must be submitted from a login server.

One way to submit a job is with the tool sbatch. sbatch allows you to specify your job requirements, commands, and any additional steps, in one file.

If you are using the example below, be sure to replace $MY_NFS_DIR with the correct filesystem path.

my_slurm_example.sbatch:

#!/bin/bash
#
#SBATCH --job-name=my_very_own_job 
#SBATCH --account=csail
#SBATCH --partition=tig-cpu
#SBATCH --qos=tig-main
#SBATCH --time=05:00:00 # 5 hours
#SBATCH --output=$MY_NFS_DIR/job_output_%j.log
#SBATCH --error=$MY_NFS_DIR/job_error_%j.log
#SBATCH --gpus=0
#SBATCH --cpus-per-task=1
#SBATCH --mem=4000M

echo "My Job ID is $SLURM_JOB_ID"
echo "The time is $(date)"
echo "This job is running on $(hostname)"

python -c "print('Hello, world!')"

Now run it with:

sbatch my_slurm_example.sbatch

The output of your job will be written to the paths specified with --output and --error.

Monitoring your job

Check your job status with:

squeue -u $USER

View details about completed jobs with:

sacct -j <job_id>

Interactive jobs

Because a Slurm job’s environment has a few differences from non-Slurm systems, it can be useful to run commands interactively to confirm how your commands behave and build your sbatch script.

Here’s an example srun command to quickly launch a single process:

srun --account=csail --partition=tig-cpu --qos=tig-main --time=5:00 --cpus-per-task=1 --gpus=0 --mem=4000M python -c "print('Hello, world!')"

The output of this command will print to your terminal.

You can also have an interactive shell by adding the –pty flag:

srun --account=csail --partition=tig-cpu --qos=tig-main --time=5:00 --cpus-per-task=1 --gpus=0 --mem=4000M --pty /bin/bash

Interactive Coding & IDEs (VSCode, Cursor, Claude, etc.)

The Slurm login host is only for submitting jobs, not interactive coding work. These tools can hit the server with heavy I/O requests over NFS, causing severe slowdowns. To protect the login nodes, automated scripts will specifically kill VSCode, Cursor, and Claude processes running on slurm-login.

If you need to do this kind of interactive development work, you have three supported options:

Develop Locally (Recommended)

Use your local machine for development work. Write your code locally, sync it to the cluster, and simply use the login node to submit your sbatch jobs. This is the right option in almost every case.
Develop Remotely on a CPU-Only Node

If you need direct access to cluster data through your IDE, you can allocate an interactive compute node and connect your IDE to it. Submitting to the tig-cpu partition (or any CPU partition you can access) is the best option for this, as it won’t waste much more expensive GPU time.
Develop Remotely on a GPU Node

If your interactive development requires an active GPU, you can submit a normal job to a GPU partition. Please be kind to your fellow researchers: keep in mind the resources this will use, and only request what you actually need.

How to Connect your IDE to a Compute Node

Any Slurm node on which you have a running job (e.g. via salloc or srun) will allow you to ssh in directly, as long as your job is running. You can use this allocated node as the endpoint for your VSCode, Cursor, or other remote editors.

For example, allocate a node using salloc:

$ salloc --partition tig-cpu --qos tig-main --time 01:00:00
salloc: partition tig-cpu, qos tig-main
salloc: Granted job allocation 665460
salloc: Nodes groenig-5 are ready for job

Once the node (in this example, groenig-5) is ready, you can SSH directly into it or point your remote editor’s connection to it.

Please make sure to subscribe to the slurm-announce mailing list to receive updates regarding changes to the cluster and new features.