Job Control

Controlling Jobs

Job Control Command Explanation
squeue Squeue is used to view job and job step information for jobs managed by SLURM.
scontrol show node Shows detailed information about compute nodes.
scontrol show partition <partition name> Shows detailed information about a specific partition/queue.
scontrol show job <job ID> Shows detailed information about a specific job or all jobs if no job id is given.
sinfo View information about slurm nodes and partitions/queues.
scancel <job ID> Kill a job. Users can kill their own jobs, root can kill any job.
scontrol hold <job ID> Hold a job.
scontrol release <job ID> Release a job.
sbalance Check available account balance.

Sample Command Outputs:

List Jobs
$ squeue
  JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
    106 standard      slurm-jo  user1   R   0:04      1 atom01

Get job details
$ scontrol show job 106
JobId=106 Name=slurm-job.sh
   UserId=user1(1001) GroupId=user1(1001)
   Priority=4294901717 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:07 TimeLimit=14-00:00:0 TimeMin=N/A
   SubmitTime=2013-01-26T12:55:02 EligibleTime=2013-01-26T12:55:02
   StartTime=2013-01-26T12:55:02 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=standard AllocNode:Sid=atom-head1:3526
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=atom01
   BatchHost=atom01
   NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/user1/slurm/local/slurm-job.sh
   WorkDir=/home/user1/slurm/local

Kill a Job
$ scancel 135
$ squeue
  JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

Hold a Job
$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    139      standard   simple  user1  PD       0:00      1 (Dependency)
    138      standard   simple  user1   R       0:16      1 atom01
$ scontrol hold 139
$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    139      standard   simple  user1  PD       0:00      1 (JobHeldUser)
    138      standard   simple  user1   R        0:32      1 atom01

Release a Job
$ scontrol release 139
$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    139      standard   simple  user1  PD       0:00      1 (Dependency)
    138      standard   simple  user1   R       0:46      1 atom01

View the Available Partition/Queues and Node Status
$ sinfo –s
PARTITION     AVAIL   TIMELIMIT   NODES(A/I/O/T)  NODELIST
standard         up 3-00:00:00    32/356/54/442  cn[001-384],gpu[001-022],hm[001-036]
gpu              up 3-00:00:00        0/21/1/22  gpu[001-022]
hm               up 3-00:00:00        0/35/1/36  hm[001-036]
standard-low*    up 3-00:00:00    32/356/54/442  cn[001-384],gpu[001-022],hm[001-036]

EN