我试图用Linux优化某些Sun硬件上的存储设置。 任何想法将不胜感激。
我们有以下硬件:
以下是SAS硬件的数据表:
http://www.sun.com/storage/storage_networking/hba/sas/PCIe.pdf
它使用PCI Express 1.0a,8x通道。 对于每通道250 MB /秒的带宽,我们应该能够以每个SAS控制器2000 MB /秒的速度运行。
每个控制器可以做每个端口3 Gb /秒,并有两个4端口PHY。 我们将两个来自控制器的PHY连接到JBOD。 因此,在JBOD和控制器之间,我们有2个PHY * 4个SAS端口* 3 Gb / sec = 24 Gb / sec的带宽,这比PCI Express带宽多。
在启用写入caching的情况下,当执行大写操作时,每个磁盘可以维持大约80 MB /秒(接近磁盘的启动)。 有了24个磁盘,这意味着我们应该能够为每个JBOD做1920 MB /秒。
多path{ rr_min_io 100 uid 0 path_grouping_policy multibus 故障恢复手册 path_selector“round-robin 0” 重量级优先 别名somealias no_path_retry队列 模式0644 gid 0 wwid somewwid }
我尝试了rr_min_io的值为50,100,1000,但似乎没有太大的差别。
随着rr_min_io的变化,我试图在开始dd之间增加一些延迟,以防止它们同时在相同的PHY上写入,但是这并没有什么区别,所以我认为I / O正在被正确地分散开来。
根据/ proc / interrupts,SAS控制器正在使用“IR-IO-APIC-compact”中断scheme。 由于某些原因,机器中的核心#0正在处理这些中断。 通过分配一个独立的内核来处理每个SAS控制器的中断,我可以稍微提高性能:
echo 2> / proc / irq / 24 / smp_affinity 回声4> / proc / irq / 26 / smp_affinity
使用dd写入磁盘产生“函数调用中断”(不知道这是什么),这是由核心#4处理,所以我保持其他进程离开这个核心。
我运行48个dd(每个磁盘一个),将它们分配给不处理像这样的中断的内核:
taskset -c somecore dd if = / dev / zero of = / dev / mapper / mpathx oflag = direct bs = 128M
oflag = direct防止涉及任何types的缓冲区caching。
我的核心似乎都没有超出。 处理中断的内核大多处于空闲状态,所有其他内核正在等待I / O。
Cpu0:0.0%us,1.0%sy,0.0%ni,91.2%id,7.5%wa,0.0%hi,0.2%si,0.0%st Cpu1:0.0%us,0.8%sy,0.0%ni,93.0%id,0.2%wa,0.0%hi,6.0%si,0.0%st Cpu2:0.0%us,0.6%sy,0.0%ni,94.4%id,0.1%wa,0.0%hi,4.8%si,0.0%st Cpu3:0.0%us,7.5%sy,0.0%ni,36.3%id,56.1%wa,0.0%hi,0.0%si,0.0%st Cpu4:0.0%us,1.3%sy,0.0%ni,85.7%id,4.9%wa,0.0%hi,8.1%si,0.0%st Cpu5:0.1%us,5.5%sy,0.0%ni,36.2%id,58.3%wa,0.0%hi,0.0%si,0.0%st Cpu6:0.0%us,5.0%sy,0.0%ni,36.3%id,58.7%wa,0.0%hi,0.0%si,0.0%st Cpu7:0.0%us,5.1%sy,0.0%ni,36.3%id,58.5%wa,0.0%hi,0.0%si,0.0%st Cpu8:0.1%us,8.3%sy,0.0%ni,27.2%id,64.4%wa,0.0%hi,0.0%si,0.0%st Cpu9:0.1%us,7.9%sy,0.0%ni,36.2%id,55.8%wa,0.0%hi,0.0%si,0.0%st Cpu10:0.0%us,7.8%sy,0.0%ni,36.2%id,56.0%wa,0.0%hi,0.0%si,0.0%st Cpu11:0.0%us,7.3%sy,0.0%ni,36.3%id,56.4%wa,0.0%hi,0.0%si,0.0%st Cpu12:0.0%us,5.6%sy,0.0%ni,33.1%id,61.2%wa,0.0%hi,0.0%si,0.0%st Cpu13:0.1%us,5.3%sy,0.0%ni,36.1%id,58.5%wa,0.0%hi,0.0%si,0.0%st Cpu14:0.0%us,4.9%sy,0.0%ni,36.4%id,58.7%wa,0.0%hi,0.0%si,0.0%st Cpu15:0.1%us,5.4%sy,0.0%ni,36.5%id,58.1%wa,0.0%hi,0.0%si,0.0%st
鉴于此,运行“dstat 10”报告的吞吐量在2200-2300 MB /秒的范围内。
鉴于上面的math,我会期望在2 * 1920〜= 3600 + MB /秒范围内的东西。
有谁知道我的丢失带宽去了哪里?
谢谢!
尼斯,做好准备的问题:)
我自己是一个速度不快的人,我认为你是诚实的。 我有一半希望看到你的吞吐量低于这个数字,但是我认为你得到的结果是轻微的,预计效率低下。 例如,一个PCIe总线很难一直达到100%,总体来说假设总体速率低于90%。 鉴于抖动,这也将意味着物理层将不会一直100%“喂饱”,所以你会在那儿损失一点点,就像高速caching,磁盘,非煤炭中断,IO调度等一样。基本上小的低效率,小的低效率时间……等等,最终导致超过5-10%的预期无效率。 我曾经见过这样的事情,惠普DL服务器使用W2K3与他们的MSA SAS设备交谈,然后通过多个NIC进行NLB处理,这令人沮丧,但我可以理解。 无论如何,这是我的2C,对不起,这不是太积极。