Problem: Jobs are in queue – showing insufficient amount of resource ncpus even though resources are available. Troubleshooting: While troubleshooting checks the log file/tracejob also. From this we are unable to find any details information, So we increase the log events To Increase the Verbose mode in Server, Scheduler, Mom. for server : qmgr -c "s s log_events=2047" for scheduler: make the log_filter to 0 in sched_config(/var/spool/PBS/sched_priv/sched_confg) file For mom : add in mom config file : $logevent 0xffffffff Then while I troubleshoot the problem I have checked the trace job.
| ||
See the Inside the table I have split into two row -> First row less log event -> second row -> full log even(0 and 2047). 1) I have check the tracejob and /var/spool/PBS/sched_logs
From the log, it says the job is a conflict with reservation or top job. While check qstat command.
PROBLEM: These jobs(1688,16869,16870) are waiting for a long time –this job queue is fastq(100- priority), recently submitted job queue is pdd(80-priority) queue. Fastq is higher priority compare to pdd. That is the reason, even though resources are available it is not allowing any other small jobs to run.Once the big job run on the cluster then only it will allow other job to run.it is just like bottleneck problem. SOLUTION: we need to kill these big job or needs to reduce the priority of that particular queue then small job started running. |
0 Comments