AMD 24核心服务器内存带宽

我需要一些帮助来确定我在Linux服务器上看到的内存带宽是否正常。这是服务器规范：

HP ProLiant DL165 G7 2x AMD Opteron 6164 HE 12-Core 40 GB RAM (10 x 4GB DDR1333) Debian 6.0

在这个服务器上使用mbw ，我得到了以下数字：

 foo1:~# mbw -n 3 1024 Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory. Using 262144 bytes as blocks for memcpy block copy test. Getting down to business... Doing 3 runs per test. 0 Method: MEMCPY Elapsed: 0.58047 MiB: 1024.00000 Copy: 1764.082 MiB/s 1 Method: MEMCPY Elapsed: 0.58012 MiB: 1024.00000 Copy: 1765.152 MiB/s 2 Method: MEMCPY Elapsed: 0.58010 MiB: 1024.00000 Copy: 1765.201 MiB/s AVG Method: MEMCPY Elapsed: 0.58023 MiB: 1024.00000 Copy: 1764.811 MiB/s 0 Method: DUMB Elapsed: 0.36174 MiB: 1024.00000 Copy: 2830.778 MiB/s 1 Method: DUMB Elapsed: 0.35869 MiB: 1024.00000 Copy: 2854.817 MiB/s 2 Method: DUMB Elapsed: 0.35848 MiB: 1024.00000 Copy: 2856.481 MiB/s AVG Method: DUMB Elapsed: 0.35964 MiB: 1024.00000 Copy: 2847.310 MiB/s 0 Method: MCBLOCK Elapsed: 0.23546 MiB: 1024.00000 Copy: 4348.860 MiB/s 1 Method: MCBLOCK Elapsed: 0.23544 MiB: 1024.00000 Copy: 4349.230 MiB/s 2 Method: MCBLOCK Elapsed: 0.23544 MiB: 1024.00000 Copy: 4349.359 MiB/s AVG Method: MCBLOCK Elapsed: 0.23545 MiB: 1024.00000 Copy: 4349.149 MiB/s

在我的其他服务器之一（基于Intel Xeon E3-1270）：

 foo2:~# mbw -n 3 1024 Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory. Using 262144 bytes as blocks for memcpy block copy test. Getting down to business... Doing 3 runs per test. 0 Method: MEMCPY Elapsed: 0.18960 MiB: 1024.00000 Copy: 5400.901 MiB/s 1 Method: MEMCPY Elapsed: 0.18922 MiB: 1024.00000 Copy: 5411.690 MiB/s 2 Method: MEMCPY Elapsed: 0.18944 MiB: 1024.00000 Copy: 5405.491 MiB/s AVG Method: MEMCPY Elapsed: 0.18942 MiB: 1024.00000 Copy: 5406.024 MiB/s 0 Method: DUMB Elapsed: 0.14838 MiB: 1024.00000 Copy: 6901.200 MiB/s 1 Method: DUMB Elapsed: 0.14818 MiB: 1024.00000 Copy: 6910.561 MiB/s 2 Method: DUMB Elapsed: 0.14820 MiB: 1024.00000 Copy: 6909.628 MiB/s AVG Method: DUMB Elapsed: 0.14825 MiB: 1024.00000 Copy: 6907.127 MiB/s 0 Method: MCBLOCK Elapsed: 0.04362 MiB: 1024.00000 Copy: 23477.623 MiB/s 1 Method: MCBLOCK Elapsed: 0.04262 MiB: 1024.00000 Copy: 24025.151 MiB/s 2 Method: MCBLOCK Elapsed: 0.04258 MiB: 1024.00000 Copy: 24048.849 MiB/s AVG Method: MCBLOCK Elapsed: 0.04294 MiB: 1024.00000 Copy: 23847.599 MiB/s

以下是我在基于Intel的笔记本电脑上的参考资料：

 laptop:~$ mbw -n 3 1024 Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory. Using 262144 bytes as blocks for memcpy block copy test. Getting down to business... Doing 3 runs per test. 0 Method: MEMCPY Elapsed: 0.40566 MiB: 1024.00000 Copy: 2524.269 MiB/s 1 Method: MEMCPY Elapsed: 0.38458 MiB: 1024.00000 Copy: 2662.638 MiB/s 2 Method: MEMCPY Elapsed: 0.38876 MiB: 1024.00000 Copy: 2634.043 MiB/s AVG Method: MEMCPY Elapsed: 0.39300 MiB: 1024.00000 Copy: 2605.600 MiB/s 0 Method: DUMB Elapsed: 0.30707 MiB: 1024.00000 Copy: 3334.745 MiB/s 1 Method: DUMB Elapsed: 0.30425 MiB: 1024.00000 Copy: 3365.653 MiB/s 2 Method: DUMB Elapsed: 0.30342 MiB: 1024.00000 Copy: 3374.849 MiB/s AVG Method: DUMB Elapsed: 0.30491 MiB: 1024.00000 Copy: 3358.328 MiB/s 0 Method: MCBLOCK Elapsed: 0.07875 MiB: 1024.00000 Copy: 13003.670 MiB/s 1 Method: MCBLOCK Elapsed: 0.08374 MiB: 1024.00000 Copy: 12228.034 MiB/s 2 Method: MCBLOCK Elapsed: 0.07635 MiB: 1024.00000 Copy: 13411.216 MiB/s AVG Method: MCBLOCK Elapsed: 0.07961 MiB: 1024.00000 Copy: 12862.006 MiB/s

所以根据mbw 我的笔记本电脑比服务器快3倍！ 请帮我解释一下。我也试图装载一个RAM磁盘，并使用dd进行基准testing，我得到了类似的差异，所以我不认为mbw是怪罪。

我检查了BIOS设置，内存似乎正在全速运行。根据托pipe公司的模块都OK。

这可能与NUMA有关吗？这个服务器似乎是节点交错被禁用的。将启用它（因此closuresNUMA）有所作为？

 foo1:~# numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 node 0 size: 8190 MB node 0 free: 7898 MB node 1 cpus: 6 7 8 9 10 11 node 1 size: 12288 MB node 1 free: 12073 MB node 2 cpus: 18 19 20 21 22 23 node 2 size: 12288 MB node 2 free: 12034 MB node 3 cpus: 12 13 14 15 16 17 node 3 size: 8192 MB node 3 free: 8032 MB node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10

更新：

禁用了NUMA（numa =在Linux引导时closures）并在BIOS中禁用了ECC。没有变化，仍然是与上面相同的数字。

更新2：

这是根据dmidecode的内存布局：

 PROC 1 DIMM 1 PROC 1 DIMM 4 PROC 1 DIMM 7 PROC 1 DIMM 10 PROC 1 DIMM 12 PROC 2 DIMM 1 PROC 2 DIMM 4 PROC 2 DIMM 7 PROC 2 DIMM 10 PROC 2 DIMM 12

这些都是4GB三星模块（部件号M393B5270CH0-CH9）

我看了一下惠普关于如何填充这个服务器的内存的文档，如果我理解正确，那么当前在DIMM 12中的模块应该放在DIMM 3插槽中。这样的错误configuration能解释我得到的结果吗？

更新3：

我现在已经移除了2个模块，在1-4-7-10的每一边（4-4）获得4×4 GB。不幸的是我在基准testing中没有看到任何区别。服务器现在不能使用全部四个通道吗？我也尝试过使用multithreading的stream基准，结果非常令人失望。我唯一能想到的就是要求托pipe公司来replace整个服务器。

更新4：

当我昨天testing了最后一个设置（32 GB）时，我一定犯了错误，因为今天我看到了优异的结果：

 foo1:~# ./stream ------------------------------------------------------------- STREAM version $Revision: 5.9 $ ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 2000000, Offset = 0 Total memory required = 45.8 MB. Each test is run 10 times, but only the *best* time for each is used. ------------------------------------------------------------- Number of Threads requested = 24 ------------------------------------------------------------- Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 703 microseconds. (= 703 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 36873.0022 0.0009 0.0009 0.0010 Scale: 34699.5160 0.0009 0.0009 0.0010 Add: 30868.8427 0.0016 0.0016 0.0017 Triad: 25558.7904 0.0019 0.0019 0.0020 ------------------------------------------------------------- Solution Validates -------------------------------------------------------------

（我已经放弃了mbw因为它只能在单线程模式下运行，它在这个服务器上仍然给出相同的糟糕结果）。

所以这个问题一定是那两个最后的4GB模块，强制服务器在单通道模式下运行，就像@chx指出的那样。现在剩下的唯一问题是，如果可以使用40 GB，仍然可以获得全带宽？我可以使用2 x 8GB + 6 x 4GB吗？我在哪个渠道放置较大的模块是否重要？

您正在强制系统以单通道（！）模式运行，每个CPU使用5-5个模块，而不是4-4或8-8。这就是原因。尝试删除1 – 1，并报告回来。

如果内存模块设置正确，6164是一个G34插槽CPU，可以使用四通道。你的设置是最糟糕的。