所以我有另一台台式电脑,作为我的服务器, primesystem和一台笔记本电脑作为我的客户端,连接到它的zerosystem 。 它们分别作为我的ssh-server和ssh-client ,并通过以太网(而不是交叉)电缆连接。
我都遵循这些教程中的说明:在局域网内运行MPI集群并在Ubuntu中设置MPICH2集群 ,只是我想使用MPI实现的python ,所以我使用mpi4py来testing两台PC是否可以利用MPI。
我在素数系统中build立了一个目录/cloud ,它应该在我的networking中被共享,并按照第一个教程的指示安装在我的zerosystem系统中(所以我也可以在两个系统中工作而不需要通过ssh )。
在服务器或primesystem ,如果我运行示例helloworld脚本,它工作正常:
one@primesystem:/cloud$ mpirun -np 5 -hosts primesystem python -m mpi4py helloworld Hello, World! I am process 0 of 5 on primesystem. Hello, World! I am process 1 of 5 on primesystem. Hello, World! I am process 2 of 5 on primesystem. Hello, World! I am process 3 of 5 on primesystem. Hello, World! I am process 4 of 5 on primesystem.
如果我通过主机zerosystem运行它,也是zerosystem (但是应该注意,由于使用来自zerosystem外部CPU,执行有明显的延迟):
one@primesystem:/cloud$ mpirun -np 5 -hosts zerosystem python -m mpi4py helloworld Hello, World! I am process 0 of 5 on zerosystem. Hello, World! I am process 1 of 5 on zerosystem. Hello, World! I am process 2 of 5 on zerosystem. Hello, World! I am process 3 of 5 on zerosystem. Hello, World! I am process 4 of 5 on zerosystem.
但是,如果我利用这两个主机,它似乎没有回应:
one@primesystem:/cloud$ mpirun -np 5 -hosts primesystem,zerosystem python -m mpi4py helloworld Hello, World! I am process 0 of 5 on primesystem.
(如果我互换了主机的顺序, zerosystem是第一个,没有显示Hello World响应)
我尝试input.mpi-config文件中的主机列表以及它们各自的进程以产生,然后使用-f参数而不是-hosts
zerosystem:4 primesystem:2
但它仍然得到相同的响应,几秒钟或几分钟后,这是错误输出:
one@primesystem:/cloud$ mpirun -np 6 -f .mpi-config python -m mpi4py helloworld =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 23329 RUNNING AT primesystem = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== [proxy:0:1@zerosystem] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:886): assert (!closed) failed [proxy:0:1@zerosystem] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status [proxy:0:1@zerosystem] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event [mpiexec@primesystem] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting [mpiexec@primesystem] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec@primesystem] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion [mpiexec@primesystem] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion
为什么是这样? 有任何想法吗?