Not Running: Insufficient amount of resource: ncpus Even Through Resource Are Available

Problem: Jobs are in queue – showing insufficient amount of resource ncpus even though resources are available.

Troubleshooting: While troubleshooting checks the log file/tracejob also. From this we are unable to find any details information, So we increase the log events

To Increase the Verbose mode in Server, Scheduler, Mom.

for server : qmgr -c "s s log_events=2047"

for scheduler: make the log_filter to 0 in sched_config(/var/spool/PBS/sched_priv/sched_confg) file

For mom : add in mom config file : $logevent 0xffffffff

Then while I troubleshoot the problem I have checked the trace job.

[root@hn2017 sched_priv]# tracejob 16998

Job: 16998.hn2017

11/15/2017 14:17:59 L Considering job to run

11/15/2017 14:17:59 L Insufficient amount of resource: ncpus

11/15/2017 14:17:59 S enqueuing into routeq, state 1 hop 1

11/15/2017 14:17:59 S dequeuing from routeq, state 1

11/15/2017 14:17:59 S enqueuing into pdd, state 1 hop 1

11/15/2017 14:17:59 S Job Queued at request of ushak@hn2017, owner = ushak@hn2017, job name = Case1_coarse, queue = pdd

11/15/2017 14:17:59 S Job Modified at request of Scheduler@hn2017

11/15/2017 14:17:59 A queue=routeq

11/15/2017 14:17:59 A queue=pdd

11/15/2017 14:25:05 L Considering job to run

11/15/2017 14:25:05 L Insufficient amount of resource: ncpus

11/15/2017 14:25:30 L Considering job to run

11/15/2017 14:25:30 L Insufficient amount of resource: ncpus

11/15/2017 14:30:43 L Considering job to run

11/15/2017 14:30:43 L Insufficient amount of resource: mem (R: 30gb A: 0kb T: 0kb)

11/15/2017 14:30:43 S enqueuing into pdd, state 1 hop 1

11/15/2017 14:30:43 S Requeueing job, substate: 11 Requeued in queue: pdd

11/15/2017 14:30:43 S Job Modified at request of Scheduler@hn2017

11/15/2017 14:30:43 L Job will never run with the resources currently configured in the complex

11/15/2017 14:30:58 L Considering job to run

11/15/2017 14:30:58 L Evaluating subchunk: ncpus=4:mem=30720mb:mpiprocs=4:lscratch=30gb

11/15/2017 14:30:58 L Failed to satisfy subchunk: 1:ncpus=4:mem=30720mb:mpiprocs=4:lscratch=30gb

11/15/2017 14:30:58 L Insufficient amount of resource: ncpus

11/15/2017 14:30:58 S Job Modified at request of Scheduler@hn2017

11/15/2017 14:30:59 L Considering job to run

11/15/2017 14:30:59 L Evaluating subchunk: ncpus=4:mem=30720mb:mpiprocs=4:lscratch=30gb

11/15/2017 14:30:59 L Failed to satisfy subchunk: 1:ncpus=4:mem=30720mb:mpiprocs=4:lscratch=30gb

11/15/2017 14:30:59 L Insufficient amount of resource: ncpus

11/15/2017 14:39:13 S Asked external license server for 4 cpu licenses, got 4

11/15/2017 14:39:13 S Allocated 4 cpu licenses, float avail global 645815, float avail local 0, used locally 101

11/15/2017 14:39:13 L Considering job to run

11/15/2017 14:39:13 L Evaluating subchunk: ncpus=4:mem=30720mb:mpiprocs=4:lscratch=30gb

11/15/2017 14:39:13 L Allocated one subchunk: ncpus=4:mem=30720mb:mpiprocs=4:lscratch=30gb

11/15/2017 14:39:13 S Job Run at request of Scheduler@hn2017 on exec_vnode (cn002:ncpus=4:mem=31457280kb:lscratch=31457280kb)

11/15/2017 14:39:13 L Job run

11/15/2017 14:39:22 A user=ushak group=engr project=_pbs_project_default jobname=Case1_coarse queue=pdd ctime=1510735679 qtime=1510735679 etime=1510735679

start=1510736962 exec_host=cn002/1*4 exec_vnode=(cn002:ncpus=4:mem=31457280kb:lscratch=31457280kb) Resource_List.abaqus_lic=8

Resource_List.mem=30720mb Resource_List.mpiprocs=4 Resource_List.ncpus=4 Resource_List.nodect=1 Resource_List.place=shared

Resource_List.select=1:ncpus=4:mem=30720mb:mpiprocs=4:lscratch=30gb Resource_List.software=Abaqus resource_assigned.mem=31457280kb

resource_assigned.ncpus=4

11/15/2017 14:58:26 S Python spawn status 0 exit value 0

See the Inside the table I have split into two row -> First row less log event -> second row -> full log even(0 and 2047).

1) I have check the tracejob and /var/spool/PBS/sched_logs

11/15/2017 14:30:58;0400;pbs_sched;Node;16998.hn2017;Evaluating subchunk: ncpus=4:mem=30720mb:mpiprocs=4:lscratch=30gb

11/15/2017 14:30:58;0400;pbs_sched;Node;cn001;Job would conflict with reservation or top job

11/15/2017 14:30:58;0400;pbs_sched;Node;cn002;Job would conflict with reservation or top job

11/15/2017 14:30:58;0400;pbs_sched;Node;cn003;Job would conflict with reservation or top job

11/15/2017 14:30:58;0400;pbs_sched;Node;cn005;Job would conflict with reservation or top job

11/15/2017 14:30:58;0400;pbs_sched;Node;cn004;Job would conflict with reservation or top job

11/15/2017 14:30:58;0400;pbs_sched;Node;cn006;Job would conflict with reservation or top job

11/15/2017 14:30:58;0400;pbs_sched;Node;cn007;Job would conflict with reservation or top job

11/15/2017 14:30:58;0400;pbs_sched;Node;cn008;Insufficient amount of resource: ncpus (R: 4 A: 1 T: 28)

From the log, it says the job is a conflict with reservation or top job.

While check qstat command.

16868.hn2017 vinaydh fastq JD_J1 -- 8 224 960gb -- Q --

Not Running: Insufficient amount of resource: mem (R: 960gb A: 1004256164kb T: 1580972964kb)

16869.hn2017 vinaydh fastq JD_J2 -- 8 224 960gb -- Q --

Not Running: Insufficient amount of resource: mem (R: 960gb A: 1004256164kb T: 1580972964kb)

16870.hn2017 vinaydh fastq JD_J3 -- 8 224 960gb -- Q --

Not Running: Insufficient amount of resource: mem (R: 960gb A: 1004256164kb T: 1580972964kb)

PROBLEM: These jobs(1688,16869,16870) are waiting for a long time –this job queue is fastq(100- priority), recently submitted job queue is pdd(80-priority) queue. Fastq is higher priority compare to pdd. That is the reason, even though resources are available it is not allowing any other small jobs to run.Once the big job run on the cluster then only it will allow other job to run.it is just like bottleneck problem.
SOLUTION: we need to kill these big job or needs to reduce the priority of that particular queue then small job started running.