我在集群环境中使用openSUSE 12.1上的扭矩4.0.1。 当我qsub工作(简单的“回声你好”),它仍然在'Q'状态,并永远不会安排。 我可以强制使用qrun运行作业,并在第一个节点上执行而不会出错。
我试图find过去几天的解决scheme,但失败了。 我阅读手册,日志,甚至源代码,但仍然无法find问题。 当然,我search了很多,尝试了各种解决scheme,但是没有人工作。
这里有一些可能有用的信息:
05/13/2012 18:55:08;0002; pbs_sched;Svr;Log;Log opened 05/13/2012 18:55:08;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20120513 opened 05/13/2012 18:55:08;0002; pbs_sched;Svr;main;pbs_sched startup pid 32604
05/13/2012 19:33:08;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.1, loglevel = 0 05/13/2012 19:33:56;0100;PBS_Server;Job;16.head;enqueuing into batch, state 1 hop 1 05/13/2012 19:33:56;0008;PBS_Server;Job;16.head;Job Queued at request of pubuser@head, owner = pubuser@head, job name = STDIN, queue = batch
Job Id: 16.head Job_Name = STDIN Job_Owner = pubuser@head job_state = Q queue = batch server = head Checkpoint = u ctime = Sun May 13 19:33:56 2012 Error_Path = head:/fserver/home/pubuser/STDIN.e16 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Sun May 13 19:33:56 2012 Output_Path = head:/fserver/home/pubuser/STDIN.o16 Priority = 0 qtime = Sun May 13 19:33:56 2012 Rerunable = True Resource_List.walltime = 01:00:00 substate = 10 Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/, PBS_O_WORKDIR=/fserver/home/pubuser,PBS_O_HOST=head,PBS_O_SERVER=head, PBS_O_WORKDIR=/fserver/home/pubuser euser = pubuser egroup = users queue_rank = 4 queue_type = E etime = Sun May 13 19:33:56 2012 fault_tolerant = False job_radix = 0 submit_host = head init_work_dir = /fserver/home/pubuser
sun1 state = free np = 2 ntype = cluster status = rectime=1336910403,varattr=,jobs=,state=free,netload=44492032184,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1697420kb,totmem=1802616kb,idletime=241085,nusers=0,nsessions=0,uname=Linux sun1 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 sun2 state = free np = 2 ntype = cluster status = rectime=1336910408,varattr=,jobs=,state=free,netload=39762812881,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1701012kb,totmem=1802616kb,idletime=239982,nusers=0,nsessions=0,uname=Linux sun2 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 sun3 state = free np = 2 ntype = cluster status = rectime=1336910400,varattr=,jobs=,state=free,netload=45984311925,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1699772kb,totmem=1802616kb,idletime=212303,nusers=0,nsessions=0,uname=Linux sun3 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 sun4 state = free np = 2 ntype = cluster status = rectime=1336910407,varattr=,jobs=,state=free,netload=37538584401,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805480kb,totmem=1908308kb,idletime=211197,nusers=0,nsessions=0,uname=Linux sun4 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 sun5 state = free np = 2 ntype = cluster status = rectime=1336910411,varattr=,jobs=,state=free,netload=173547166,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1803816kb,totmem=1908308kb,idletime=211199,nusers=0,nsessions=0,uname=Linux sun5 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 sun6 state = free np = 2 ntype = cluster status = rectime=1336910411,varattr=,jobs=,state=free,netload=24641446,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805704kb,totmem=1908308kb,idletime=212999,nusers=0,nsessions=0,uname=Linux sun6 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 sun7 state = free np = 2 ntype = cluster status = rectime=1336910412,varattr=,jobs=,state=free,netload=1548383055,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805432kb,totmem=1908308kb,idletime=215630,nusers=0,nsessions=0,uname=Linux sun7 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 sun8 state = free np = 2 ntype = cluster status = rectime=1336910400,varattr=,jobs=,state=free,netload=128755968,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1803448kb,totmem=1908308kb,idletime=211866,nusers=0,nsessions=0,uname=Linux sun8 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 sun9 state = free np = 2 ntype = cluster status = rectime=1336910374,varattr=,jobs=,state=free,netload=1371896399,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805664kb,totmem=1908308kb,idletime=211161,nusers=0,nsessions=0,uname=Linux sun9 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0
# # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch resources_default.walltime = 01:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = head set server managers = pubuser@head set server managers += root@head set server operators = pubuser@head set server operators += root@head set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 300 set server job_stat_rate = 45 set server poll_jobs = True set server mom_job_sync = True set server keep_completed = 0 set server submit_hosts = head set server next_job_number = 17 set server moab_array_compatible = True
Host: sun1/sun1 Version: 4.0.1 PID: 5362 Server[0]: head (192.168.0.1:15001) Last Msg From Server: 1584 seconds (DeleteJob) Last Msg To Server: 7 seconds HomeDirectory: /var/spool/torque/mom_priv stdout/stderr spool directory: '/var/spool/torque/spool/' (4457492 blocks available) MOM active: 229485 seconds Check Poll Time: 45 seconds Server Update Interval: 45 seconds LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust) Communication Model: TCP MemLocked: TRUE (mlock) TCP Timeout: 0 seconds Trusted Client List: 127.0.0.1:0,192.168.0.1:0,192.168.0.101:0,192.168.0.101:15003,192.168.0.102:15003,192.168.0.103:15003,192.168.0.104:15003,192.168.0.105:15003,192.168.0.106:15003,192.168.0.107:15003,192.168.0.108:15003,192.168.0.109:15003: 0 Copy Command: /usr/bin/scp -rpB NOTE: no local jobs detected diagnostics complete
问题是,TCP超时是0秒,这似乎不正常。 在诊断期间,在mom_logs中find以下日志
05/13/2012 20:30:10;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Resource temporarily unavailable (11) in tcp_read_proto_version, no protocol version number End of File (errno 2)
我GOOGLE了,但什么也没find。
我希望有人能解决这个问题。 谢谢!