SLURM Compute Cluster
Overview
Slurm is an open-source, highly scalable cluster management and job scheduling system.
At CSAIL, our Slurm cluster pools together the computing power of individual research groups alongside lab-wide general-use systems. This allows researchers to queue, manage, and execute intensive computational workloads efficiently without manually hunting for available machines.
Why Use Slurm?
-
Access to Premium Hardware: Easily request specific resources for your jobs, from high-memory CPU nodes to the latest high-performance GPUs (including A100s, H100s, and H200s)
-
Automated Job Management: Submit your code as a batch job (
sbatch), and Slurm will automatically allocate resources, run your code, and save the output logs to your NFS directory while you focus on other work. -
Fair Resource Allocation: Slurm’s Fairshare algorithm ensures that all users and groups get equitable access to compute time based on their contributions and recent usage.
Knowledge Base Directory
Whether you are running your first job or optimizing a massive array of GPU tasks, the guides below will help you navigate the CSAIL Slurm cluster:
Getting Started
-
Quick Start Guide: Step-by-step instructions for writing your first Slurm script and running interactive jobs.
-
Commands & Usage: A cheat sheet for essential Slurm commands like
srun,sbatch,salloc, andsinfo. -
Frequently Asked Questions: Solutions for common issues regarding Conda, pip, Apptainer/Docker, and X-Forwarding.
-
Want to contribute hardware? If your research group owns computing hardware at a CSAIL datacenter and you would like to have your jobs scheduled with Slurm, please inquire at help@csail.mit.edu. For more info, see our page on Joining CSAIL Slurm.
Managing Your Jobs
-
Compute Requirements: How to properly request CPUs, Memory, Time limits, and specific GPU types.
-
Partitions & QoS: Understand the different hardware groupings and Quality of Service limits available to your account.
-
Priority & Fairshare: Learn how your job’s place in the queue is calculated.
Environment & Infrastructure
-
Storage in Slurm: Best practices for managing your data. (Note: AFS is not supported. Use NFS or temporary local storage).
-
SSH to Compute Nodes: Rules and methods for getting an interactive shell on a running job.
-
Shared Resources: Details on the
csail-sharedpartition and preemptible group hardware. -
Maintenance Schedule: View the bi-monthly cluster downtime calendar.
Announcements
Stay Informed: Join the Slurm-Announce email list. Scheduled maintenance, downtime, and impactful changes will be announced here.


