troubleshooting

Troubleshooting Jobs

Show the queue to get JOBID and general information about what is running:

  $ showq

Active jobs are jobs that are currently running. Eligible jobs are jobs that are waiting in line to run as soon as resources are available. Blocked jobs are jobs that cannot run for some reason; usually it is because the job has requested more resources than are allowed.

check job for errors or to see why it isn't running

  $ sudo -i
  # checkjob *JOBID*

If the job is blocked, the checkjob output will tell you why. Usually it's because there are not enough resources available, or because the job has asked for more resources than it is allowed to use. cancel job and resubmit with new resource requirements:

  # mjobctl -c *JOBID*

Cancel the job. The user will need to troubleshoot the script/software that is being run. One way to do this by bypassing the scheduler and just sshing into a node and running the job with standard input/output.

check that the scheduler is running/restart scheduler

  # systemctl status moab
  # systemctl restart moab

If Moab won't start, check the moab logs for clues:

  # tail -300 /opt/moab/log/moab.log

Check the status of the nodes:

  # sudo pbsnodes