如何运行GPGPU内存testing

我们使用了大量的GPGPU计算(主要是使用CUDA,但是使用了一些OpenCL)。 通常,当用户运行代码时,代码错误只在我们的一台主机上出现内存错误。 我怀疑其中一张卡有故障。 有时会导致整个系统崩溃,有时候这个程序就会被炸毁。

什么是最简单,最快,最彻底的方法来全面testingGPU可能的故障?

我知道有一些程序是NVIDIA的CUDA SDK的一部分:

deviceQuery nvidia-smi 

但是我需要更彻底的东西。 build议? 经验?

事实上的标准似乎是CUDA GPU Memtest 。 正如@ c2h5oh提到的,它看起来像是基于memtest86testing模式,所以我确信它做得很好。 它在我testing的高端GPU上运行得相对较快(Quadro 6000 30分钟,Tesla C2075 20分钟)。 它在OS内运行(不像memtest),所以监视有点不同。 你可能会想把stdout和stderr输出到一个文件以后再看。 所以考虑运行它,如果你失去你的terminal输出,你可以看看什么样的testing发现:

 cuda_memtest 2>cuda_memtest.stderr 1>cuda_memtest.stdout & tail -f cuda_memtest.stdout & tail -f cuda_memtest.stderr & 

您还需要确保没有人使用系统和/或卡。 您可以使用以下方式将GPU设置为独占模式:

 nvidia-smi --compute-mode=EXCLUSIVE_PROCESS 

以下是Quadro和Tesla样例运行的一些输出,以防您对什么testing信息感兴趣:

 [09/07/2012 11:56:22][hydro][0]:Running cuda memtest, version 1.2.2 [09/07/2012 11:56:23][hydro][0]:Warning: Getting serial number failed [09/07/2012 11:56:23][hydro][0]:NVRM version: NVIDIA UNIX x86_64 Kernel Module 295.41 Fri Apr 6 23:18:58 PDT 2012 [09/07/2012 11:56:23][hydro][0]:num_gpus=1 [09/07/2012 11:56:23][hydro][0]:Device name=Quadro 6000, global memory size=6441992192 [09/07/2012 11:56:23][hydro][0]:major=2, minor=0 [09/07/2012 11:56:24][hydro][0]:Attached to device 0 successfully. [09/07/2012 11:56:24][hydro][0]:Allocated 6040 MB [09/07/2012 11:56:24][hydro][0]:Test0 [Walking 1 bit] [09/07/2012 11:56:30][hydro][0]:Test0 finished in 5.7 seconds [09/07/2012 11:56:30][hydro][0]:Test1 [Own address test] [09/07/2012 11:56:33][hydro][0]:Test1 finished in 3.5 seconds [09/07/2012 11:56:33][hydro][0]:Test2 [Moving inversions, ones&zeros] [09/07/2012 11:57:05][hydro][0]:Test2 finished in 32.3 seconds [09/07/2012 11:57:05][hydro][0]:Test3 [Moving inversions, 8 bit pat] [09/07/2012 11:57:37][hydro][0]:Test3 finished in 31.9 seconds [09/07/2012 11:57:37][hydro][0]:Test4 [Moving inversions, random pattern] [09/07/2012 11:57:53][hydro][0]:Test4 finished in 15.9 seconds [09/07/2012 11:57:53][hydro][0]:Test5 [Block move, 64 moves] [09/07/2012 11:57:59][hydro][0]:Test5 finished in 6.3 seconds [09/07/2012 11:57:59][hydro][0]:Test6 [Moving inversions, 32 bit pat] [09/07/2012 12:18:46][hydro][0]:Test6 finished in 1246.6 seconds [09/07/2012 12:18:46][hydro][0]:Test7 [Random number sequence] [09/07/2012 12:19:06][hydro][0]:Test7 finished in 19.8 seconds [09/07/2012 12:19:06][hydro][0]:Test8 [Modulo 20, random pattern] [09/07/2012 12:19:06][hydro][0]:test8[mod test]: p1=0x13472f5f, p2=0xecb8d0a0 [09/07/2012 12:20:34][hydro][0]:Test8 finished in 88.0 seconds [09/07/2012 12:20:34][hydro][0]:Test10 [Memory stress test] [09/07/2012 12:20:34][hydro][0]:Test10 with pattern=0x55f6c69858704128 [09/07/2012 12:21:11][hydro][0]:Test10 finished in 36.8 seconds [09/07/2012 12:21:11][hydro][0]:Test0 [Walking 1 bit] [09/07/2012 12:21:16][hydro][0]:Test0 finished in 5.8 seconds [09/06/2012 18:49:07][hydro][0]:Running cuda memtest, version 1.2.2 [09/06/2012 18:49:10][hydro][0]:Warning: Getting serial number failed [09/06/2012 18:49:10][hydro][0]:NVRM version: NVIDIA UNIX x86_64 Kernel Module 295.41 Fri Apr 6 23:18:58 PDT 2012 [09/06/2012 18:49:10][hydro][0]:num_gpus=1 [09/06/2012 18:49:10][hydro][0]:Device name=Tesla C2075, global memory size=5636292608 [09/06/2012 18:49:10][hydro][0]:major=2, minor=0 [09/06/2012 18:49:11][hydro][0]:Attached to device 0 successfully. [09/06/2012 18:49:11][hydro][0]:Allocated 5273 MB [09/06/2012 18:49:11][hydro][0]:Test0 [Walking 1 bit] [09/06/2012 18:49:22][hydro][0]:Test0 finished in 11.1 seconds [09/06/2012 18:49:22][hydro][0]:Test1 [Own address test] [09/06/2012 18:49:25][hydro][0]:Test1 finished in 3.1 seconds [09/06/2012 18:49:25][hydro][0]:Test2 [Moving inversions, ones&zeros] [09/06/2012 18:49:52][hydro][0]:Test2 finished in 27.4 seconds [09/06/2012 18:49:52][hydro][0]:Test3 [Moving inversions, 8 bit pat] [09/06/2012 18:50:20][hydro][0]:Test3 finished in 27.9 seconds [09/06/2012 18:50:20][hydro][0]:Test4 [Moving inversions, random pattern] [09/06/2012 18:50:34][hydro][0]:Test4 finished in 13.7 seconds [09/06/2012 18:50:34][hydro][0]:Test5 [Block move, 64 moves] [09/06/2012 18:50:39][hydro][0]:Test5 finished in 5.5 seconds [09/06/2012 18:50:39][hydro][0]:Test6 [Moving inversions, 32 bit pat] [09/06/2012 19:08:34][hydro][0]:Test6 finished in 1074.9 seconds [09/06/2012 19:08:34][hydro][0]:Test7 [Random number sequence] [09/06/2012 19:08:51][hydro][0]:Test7 finished in 17.1 seconds [09/06/2012 19:08:51][hydro][0]:Test8 [Modulo 20, random pattern] [09/06/2012 19:08:51][hydro][0]:test8[mod test]: p1=0x63136646, p2=0x9cec99b9 [09/06/2012 19:10:10][hydro][0]:Test8 finished in 78.4 seconds [09/06/2012 19:10:10][hydro][0]:Test10 [Memory stress test] [09/06/2012 19:10:10][hydro][0]:Test10 with pattern=0x26341d134a89ac2b [09/06/2012 19:10:39][hydro][0]:Test10 finished in 29.0 seconds 

Google:Memtest + GPU:3个第一个结果中的任何一个似乎都是有效的答案。 没有个人经验。

http://sourceforge.net/projects/cudagpumemtest/

http://www.softpedia.com/get/Tweak/Memory-Tweak/CUDA-MemTest.shtml

https://simtk.org/home/memtest/