转矩工作不进入“E”状态(除非“qrun”)

我添加到队列中的作业将保持“排队”状态,而不会尝试执行(除非我手动执行qrun

/var/spool/torque/server_logs

 04/11/2011 12:43:27;0100;PBS_Server;Job;16.localhost;enqueuing into batch, state 1 hop 1 04/11/2011 12:43:27;0008;PBS_Server;Job;16.localhost;Job Queued at request of test@localhost, owner = test@localhost, job name = Qqq, queue = batch 

这项工作只需要1个CPU在1个节点上。

 # qmgr -c "list queue batch" Queue batch queue_type = Execution total_jobs = 0 state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 max_running = 3 acl_host_enable = True acl_hosts = localhost resources_min.ncpus = 1 resources_min.nodect = 1 resources_default.ncpus = 1 resources_default.nodes = 1 resources_default.walltime = 00:00:10 mtime = Mon Apr 11 12:07:10 2011 resources_assigned.ncpus = 0 resources_assigned.nodect = 0 kill_delay = 3 enabled = True started = True 

我无法将resources_assigned设置为非零,因为Cannot set attribute, read only or insufficient permission resources_assigned.ncpus

当我运行一些任务时,这是妈妈的日志:

 04/11/2011 21:27:48;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 04/11/2011 21:27:48;0001; pbs_mom;Job;TMomFinalizeJob3;job 18.localhost started, pid = 28592 04/11/2011 21:27:48;0080; pbs_mom;Job;18.localhost;scan_for_terminated: job 18.localhost task 1 terminated, sid=28592 04/11/2011 21:27:48;0008; pbs_mom;Job;18.localhost;job was terminated 04/11/2011 21:27:48;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 04/11/2011 21:27:48;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 04/11/2011 21:27:48;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 04/11/2011 21:27:48;0080; pbs_mom;Job;18.localhost;obit sent to server 

调度程序日志( /var/spool/torque/sched_logs/20110705 ):

 07/05/2011 21:44:53;0002; pbs_sched;Svr;Log;Log opened 07/05/2011 21:44:53;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20110705 opened 07/05/2011 21:44:53;0002; pbs_sched;Svr;main;/usr/sbin/pbs_sched startup pid 16234 

qstat -f

 Job Id: 26.localhost Job_Name = qwe Job_Owner = test@localhost job_state = Q queue = batch server = localhost Checkpoint = u ctime = Tue Jul 5 21:43:31 2011 Error_Path = localhost:/home/test/jscfi/default/0.738784810485275/qwe.e26 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Tue Jul 5 21:43:31 2011 Output_Path = localhost:/home/test/jscfi/default/0.738784810485275/qwe.o26 Priority = 0 qtime = Tue Jul 5 21:43:31 2011 Rerunable = True Resource_List.ncpus = 1 Resource_List.neednodes = 1:ppn=1 Resource_List.nodect = 1 Resource_List.nodes = 1:ppn=1 Resource_List.walltime = 00:01:00 substate = 10 Variable_List = PBS_O_HOME=/home/test,PBS_O_LANG=en_US.UTF-8, PBS_O_LOGNAME=test, PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games, PBS_O_MAIL=/var/mail/test,PBS_O_SHELL=/bin/sh,PBS_SERVER=127.0.0.1, PBS_O_WORKDIR=/home/test/jscfi/default/0.738784810485275, PBS_O_QUEUE=batch,PBS_O_HOST=localhost euser = test egroup = test queue_rank = 1 queue_type = E etime = Tue Jul 5 21:43:31 2011 submit_args = run.pbs Walltime.Remaining = 6 fault_tolerant = False 

如何使它自动执行作业,无需手动qrun

我花了几个小时的类似症状的问题,最后是在服务器设置中缺less单个选项:

 qmgr -c "set server scheduling = True" 

通常情况下,调度程序会决定何时运行作业,即何时有足够的资源,并通知服务器运行作业。 你正在运行一个调度程序? TORQUE包含一个基本的调度程序( pbs_sched ),或者您可以安装并运行更复杂的maui(free)或moab(pay-for)。

PBS / TORQUE的pbs_server部分是一个“资源pipe理器” – 本质上只是一个“框架”。 它本身不做任何决定:这是调度程序的工作。