Intereting Posts

Apache VirtualHost安装程序… 502错误的网关 Raid故障 – Dell PowerEdge T11 II 简单的文件/下载pipe理Web应用程序 Mac OS 10.6.8防火墙阻止一个用户的Wiki 净股不按预期工作？如何使用PHP运行SSH到我的Web服务器？如何从默认域策略中删除额外的registry设置？域密钥不在DNS查询中显示当涉及到虚拟主机的负载平衡器，路由器和交换机的基本解释无法更改Microsoft DNS中的启动方法检查使用pam_access时是否允许帐户由于缺lessphp-mbstring，无法在Amazon的EC2实例上安装WordPress 如何确认使用哪个特定的S / MIME公钥来encryption电子邮件？ Active Directory组策略问题（密码策略）名称服务器委派给私人子网

为什么在使用内置插件运行SLURM的集群上请求GPU作为通用资源失败？

免责声明：这篇文章是相当长的，因为我试图提供所有相关的configuration信息。

现状和问题：

我pipe理一个gpu集群，我想用slurm进行作业pipe理。不幸的是，我不能要求GPU使用slurm的相应通用资源插件。

注意：test.sh是一个打印环境variablesCUDA_VISIBLE_DEVICES的小脚本。

使用`--gres=gpu:1`运行作业无法完成

运行srun -n1 --gres=gpu:1 test.sh导致以下错误：

 srun: error: Unable to allocate resources: Requested node configuration is not available

日志：

 gres: gpu state for job 83 gres_cnt:4 node_cnt:0 type:(null) _pick_best_nodes: job 83 never runnable _slurm_rpc_allocate_resources: Requested node configuration is not available

使用`--gres=gram:500`运行作业完成

如果我打电话给srun -n1 --gres=gram:500 test.sh ，则作业运行并打印

 CUDA_VISIBLE_DEVICES=NoDevFiles

日志：

 sched: _slurm_rpc_allocate_resources JobId=76 NodeList=smurf01 usec=193 debug: Configuration for job 76 complete debug: laying out the 1 tasks on 1 hosts smurf01 dist 1 job_complete: JobID=76 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 job_complete: JobID=76 State=0x8003 NodeCnt=1 done

因此，slurm似乎被正确configuration为使用srun运行使用--gres请求的通用资源的--gres但由于某种原因无法识别gpus。

我的第一个想法是使用另一个名称的GPU通用资源，因为其他通用资源似乎工作，但我想坚持的GPU插件。

组态

该集群有两个以上的从属主机，但为了清晰起见，我将坚持两个configuration略有不同的从属主机和控制器主机：papa（控制器），smurf01和smurf02.“

slurm.conf

slurmconfiguration的通用重新分类相关部分：

 ... TaskPlugin=task/cgroup ... GresTypes=gpu,ram,gram,scratch ... NodeName=smurf01 NodeAddr=192.168.1.101 Feature="intel,fermi" Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 NodeName=smurf02 NodeAddr=192.168.1.102 Feature="intel,fermi" Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 ...

注意：RAM是以GB为单位的，克是以MB为单位的，并且再次以GB来划分。

`scontrol show node`输出

 NodeName=smurf01 Arch=x86_64 CoresPerSocket=6 CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 Features=intel,fermi Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 NodeAddr=192.168.1.101 NodeHostName=smurf01 Version=14.11 OS=Linux RealMemory=1 AllocMem=0 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 BootTime=2015-04-23T13:58:15 SlurmdStartTime=2015-04-24T10:30:46 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=smurf02 Arch=x86_64 CoresPerSocket=6 CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.01 Features=intel,fermi Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 NodeAddr=192.168.1.102 NodeHostName=smurf02 Version=14.11 OS=Linux RealMemory=1 AllocMem=0 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2015-04-23T13:57:56 SlurmdStartTime=2015-04-24T10:24:12 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

smurf01configuration

graphics处理器

  > ls /dev | grep nvidia nvidia0 ... nvidia7 > nvidia-smi | grep Tesla | 0 Tesla M2090 On | 0000:08:00.0 Off | 0 | ... | 7 Tesla M2090 On | 0000:1B:00.0 Off | 0 | ...

gres.conf

 Name=gpu Type=tesla File=/dev/nvidia0 CPUs=0 Name=gpu Type=tesla File=/dev/nvidia1 CPUs=1 Name=gpu Type=tesla File=/dev/nvidia2 CPUs=2 Name=gpu Type=tesla File=/dev/nvidia3 CPUs=3 Name=gpu Type=tesla File=/dev/nvidia4 CPUs=4 Name=gpu Type=tesla File=/dev/nvidia5 CPUs=5 Name=gpu Type=tesla File=/dev/nvidia6 CPUs=6 Name=gpu Type=tesla File=/dev/nvidia7 CPUs=7 Name=ram Count=48 Name=gram Count=6000 Name=scratch Count=1300

smurf02configuration

graphics处理器

与smurf01相同的configuration/输出。

smurf02上的gres.conf

 Name=gpu Count=8 Type=tesla File=/dev/nvidia[0-7] Name=ram Count=48 Name=gram Count=6000 Name=scratch Count=1300

注意：deamons已经重新启动，机器也重新启动。 slurm和作业提交用户在从站和控制器节点上具有相同的ID /组，并且mungeauthentication工作正常。

日志输出

我在slurm.conf文件中添加了DebugFlags=Gres ，并且GPU似乎被Plugin识别出来：

控制器日志

 gres / gpu: state for smurf01 gres_cnt found : 8 configured : 8 avail : 8 alloc : 0 gres_bit_alloc : gres_used : (null) topo_cpus_bitmap[0] : 0 topo_gres_bitmap[0] : 0 topo_gres_cnt_alloc[0] : 0 topo_gres_cnt_avail[0] : 1 type[0] : tesla topo_cpus_bitmap[1] : 1 topo_gres_bitmap[1] : 1 topo_gres_cnt_alloc[1] : 0 topo_gres_cnt_avail[1] : 1 type[1] : tesla topo_cpus_bitmap[2] : 2 topo_gres_bitmap[2] : 2 topo_gres_cnt_alloc[2] : 0 topo_gres_cnt_avail[2] : 1 type[2] : tesla topo_cpus_bitmap[3] : 3 topo_gres_bitmap[3] : 3 topo_gres_cnt_alloc[3] : 0 topo_gres_cnt_avail[3] : 1 type[3] : tesla topo_cpus_bitmap[4] : 4 topo_gres_bitmap[4] : 4 topo_gres_cnt_alloc[4] : 0 topo_gres_cnt_avail[4] : 1 type[4] : tesla topo_cpus_bitmap[5] : 5 topo_gres_bitmap[5] : 5 topo_gres_cnt_alloc[5] : 0 topo_gres_cnt_avail[5] : 1 type[5] : tesla topo_cpus_bitmap[6] : 6 topo_gres_bitmap[6] : 6 topo_gres_cnt_alloc[6] : 0 topo_gres_cnt_avail[6] : 1 type[6] : tesla topo_cpus_bitmap[7] : 7 topo_gres_bitmap[7] : 7 topo_gres_cnt_alloc[7] : 0 topo_gres_cnt_avail[7] : 1 type[7] : tesla type_cnt_alloc[0] : 0 type_cnt_avail[0] : 8 type[0] : tesla ... gres/gpu: state for smurf02 gres_cnt found:TBD configured:8 avail:8 alloc:0 gres_bit_alloc: gres_used:(null) type_cnt_alloc[0]:0 type_cnt_avail[0]:8 type[0]:tesla

从站日志

 Gres Name = gpu Type = tesla Count = 8 ID = 7696487 File = / dev / nvidia[0 - 7] ... gpu 0 is device number 0 gpu 1 is device number 1 gpu 2 is device number 2 gpu 3 is device number 3 gpu 4 is device number 4 gpu 5 is device number 5 gpu 6 is device number 6 gpu 7 is device number 7

安装版本（ 14.11.5 ）中的14.11.5似乎有问题，因为从gres.conf删除Type=... ，并相应地更改节点configuration行（到Gres=gpu:N,ram:... ）导致通过--gres=gpu:N成功执行需要gpus的作业。

为什么在使用内置插件运行SLURM的集群上请求GPU作为通用资源失败？

现状和问题：

使用--gres=gpu:1运行作业无法完成

使用--gres=gram:500运行作业完成

组态

slurm.conf

scontrol show node输出

smurf01configuration

graphics处理器

gres.conf

smurf02configuration

graphics处理器

smurf02上的gres.conf

日志输出

控制器日志

从站日志

使用`--gres=gpu:1`运行作业无法完成

使用`--gres=gram:500`运行作业完成

`scontrol show node`输出