troubleshooting

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
troubleshooting [2022/06/22 14:44] – old revision restored (2022/06/22 10:27) 35.156.240.123troubleshooting [2022/06/22 14:48] (current) – old revision restored (2022/06/17 22:18) 35.156.240.123
Line 1: Line 1:
 +====== Troubleshooting Jobs ======
  
 +===== Job stuck in the queue and won't run: =====
 +
 +Show the queue to get JOBID and general information about what is running:
 +    $ showq
 +
 +Active jobs are jobs that are currently running. Eligible jobs are jobs that are waiting in line to run as soon as resources are available. Blocked jobs are jobs that cannot run for some reason; usually it is because the job has requested more resources than are allowed.
 +
 +check job for errors or to see why it isn't running
 +    $ sudo -i
 +    # checkjob *JOBID*
 +
 +If the job is blocked, the checkjob output will tell you why. Usually it's because there are not enough resources available, or because the job has asked for more resources than it is allowed to use.
 +cancel job and resubmit with new resource requirements:
 +    # mjobctl -c *JOBID*
 +
 +===== Job is in a running state, but not making progress: =====
 +
 +Cancel the job. The user will need to troubleshoot the script/software that is being run. One way to do this by bypassing the scheduler and just sshing into a node and running the job with standard input/output.
 +
 +===== No jobs running: =====
 +
 +check that the scheduler is running/restart scheduler
 +    # systemctl status moab
 +    # systemctl restart moab
 +If Moab won't start, check the moab logs for clues:
 +    # tail -300 /opt/moab/log/moab.log
 +
 +Check the status of the nodes:
 +    # sudo pbsnodes