Job Control
Controlling Jobs
| Job Control Command | Explanation |
|---|---|
squeue |
Squeue is used to view job and job step information for jobs managed by SLURM. |
scontrol show node |
Shows detailed information about compute nodes. |
scontrol show partition <partition name> |
Shows detailed information about a specific partition/queue. |
scontrol show job <job ID> |
Shows detailed information about a specific job or all jobs if no job id is given. |
sinfo |
View information about slurm nodes and partitions/queues. |
scancel <job ID> |
Kill a job. Users can kill their own jobs, root can kill any job. |
scontrol hold <job ID> |
Hold a job. |
scontrol release <job ID> |
Release a job. |
sbalance |
Check available account balance. |
Sample Command Outputs:
List Jobs
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
106 standard slurm-jo user1 R 0:04 1 atom01
Get job details
$ scontrol show job 106 JobId=106 Name=slurm-job.sh UserId=user1(1001) GroupId=user1(1001) Priority=4294901717 Account=(null) QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:07 TimeLimit=14-00:00:0 TimeMin=N/A SubmitTime=2013-01-26T12:55:02 EligibleTime=2013-01-26T12:55:02 StartTime=2013-01-26T12:55:02 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=standard AllocNode:Sid=atom-head1:3526 ReqNodeList=(null) ExcNodeList=(null) NodeList=atom01 BatchHost=atom01 NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/home/user1/slurm/local/slurm-job.sh WorkDir=/home/user1/slurm/local
Kill a Job
$ scancel 135 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
Hold a Job
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
139 standard simple user1 PD 0:00 1 (Dependency)
138 standard simple user1 R 0:16 1 atom01
$ scontrol hold 139
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
139 standard simple user1 PD 0:00 1 (JobHeldUser)
138 standard simple user1 R 0:32 1 atom01
Release a Job
$ scontrol release 139
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
139 standard simple user1 PD 0:00 1 (Dependency)
138 standard simple user1 R 0:46 1 atom01
View the Available Partition/Queues and Node Status
$ sinfo –s PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST standard up 3-00:00:00 32/356/54/442 cn[001-384],gpu[001-022],hm[001-036] gpu up 3-00:00:00 0/21/1/22 gpu[001-022] hm up 3-00:00:00 0/35/1/36 hm[001-036] standard-low* up 3-00:00:00 32/356/54/442 cn[001-384],gpu[001-022],hm[001-036]
EN