不写大量数据时，RAID-5的连续写入性能不佳

我有一些问题获得可接受的读/写性能为我的RAID5 + crypt + ext4终于能够跟踪到以下问题：

硬件

硬盘4x WD RED 3 TB WDC WD30EFRX-68EUZN0 as / dev / sd [efgh]
sde和sdf使用3 Gbps / s SATA链路通过控制器A连接（即使6 Gbps可用）
sdg和sdh使用6 Gbps / s SATA链路通过控制器B连接

单磁盘性能

写每个磁盘testing4次（一切都如我所料）

# dd if=/dev/zero of=/dev/sd[efgh] bs=2G count=1 oflag=dsync sde: 2147479552 bytes (2.1 GB) copied, xxx s, [127, 123, 132, 127] MB/s sdf: 2147479552 bytes (2.1 GB) copied, xxx s, [131, 130, 118, 137] MB/s sdg: 2147479552 bytes (2.1 GB) copied, xxx s, [145, 145, 145, 144] MB/s sdh: 2147479552 bytes (2.1 GB) copied, xxx s, [126, 132, 132, 132] MB/s

阅读testing使用hdparm和dd（一切都如我所料）

 # hdparm -tT /dev/sd[efgh] # echo 3 | tee /proc/sys/vm/drop_caches; dd of=/dev/null if=/dev/sd[efgh] bs=2G count=1 iflag=fullblock (sde) Timing cached reads: xxx MB in 2.00 seconds = [13983.68, 14136.87] MB/sec Timing buffered disk reads: xxx MB in 3.00 seconds = [143.16, 143.14] MB/sec 2147483648 bytes (2.1 GB) copied, xxx s, [140, 141] MB/s (sdf) Timing cached reads: xxx MB in 2.00 seconds = [14025.80, 13995.14] MB/sec Timing buffered disk reads: xxx MB in 3.00 seconds = [140.31, 140.61] MB/sec 2147483648 bytes (2.1 GB) copied, xxx s, [145, 141] MB/s (sdg) Timing cached reads: xxx MB in 2.00 seconds = [14005.61, 13801.93] MB/sec Timing buffered disk reads: xxx MB in 3.00 seconds = [153.11, 151.73] MB/sec 2147483648 bytes (2.1 GB) copied, xxx s, [154, 155] MB/s (sdh) Timing cached reads: xxx MB in 2.00 seconds = [13816.84, 14335.93] MB/sec Timing buffered disk reads: xxx MB in 3.00 seconds = [142.50, 142.12] MB/sec 2147483648 bytes (2.1 GB) copied, xxx s, [140, 140] MB/s

sd上的分区[efgh]

4x 32 GiB进行testing

 # gdisk -l /dev/sd[efgh] GPT fdisk (gdisk) version 0.8.10 Partition table scan: MBR: protective BSD: not present APM: not present GPT: present Found valid GPT with protective MBR; using GPT. Disk /dev/sde: 5860533168 sectors, 2.7 TiB Logical sector size: 512 bytes Disk identifier (GUID): xxx Partition table holds up to 128 entries First usable sector is 34, last usable sector is 5860533134 Partitions will be aligned on 2048-sector boundaries Total free space is 5793424237 sectors (2.7 TiB) Number Start (sector) End (sector) Size Code Name 1 2048 67110911 32.0 GiB FD00 Linux RAID

Raidarrays

 # mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 --chunk=256K /dev/sd[efgh]1 (some tests later ...) # mdadm --grow --verbose /dev/md0 --layout=right-asymmetric # mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Sat Dec 10 03:07:56 2016 Raid Level : raid5 Array Size : 100561920 (95.90 GiB 102.98 GB) Used Dev Size : 33520640 (31.97 GiB 34.33 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Sat Dec 10 23:56:53 2016 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : right-asymmetric Chunk Size : 256K Name : vm:0 (local to host vm) UUID : 80d0f886:dc380755:5387f78c:1fac60da Events : 158 Number Major Minor RaidDevice State 0 8 65 0 active sync /dev/sde1 1 8 81 1 active sync /dev/sdf1 2 8 97 2 active sync /dev/sdg1 4 8 113 3 active sync /dev/sdh1

现在的情况

我预计这个arrays的连续读写操作大概在350-400MB / s之间。读取或写入整个卷实际上产生的结果完全在这个范围内：

 # echo 3 | tee /proc/sys/vm/drop_caches; dd of=/dev/null if=/dev/md0 bs=256K 102975406080 bytes (103 GB) copied, 261.373 s, 394 MB/s # dd if=/dev/zero of=/dev/md0 bs=256K conv=fdatasync 102975406080 bytes (103 GB) copied, 275.562 s, 374 MB/s

但是，写入性能很大程度上取决于写入的数据量。正如预期的那样，传输速率随着数据量的增加而增加，但在达到2 GiB时会下降，只有在进一步增大时才会缓慢恢复：

 # dd if=/dev/zero of=/dev/md0 bs=256K conv=fdatasync count=x count=1: 262144 bytes (262 kB) copied, xxx s, [3.6, 7.6, 8.9, 8.9] MB/s count=2: 524288 bytes (524 kB) copied, xxx s, [3.1, 17.7, 15.3, 15.7] MB/s count=4: 1048576 bytes (1.0 MB) copied, xxx s, [13.2, 23.9, 26.9, 25.4] MB/s count=8: 2097152 bytes (2.1 MB) copied, xxx s, [24.3, 46.7, 45.9, 42.8] MB/s count=16: 4194304 bytes (4.2 MB) copied, xxx s, [5.1, 77.3, 42.6, 73.2, 79.8] MB/s count=32: 8388608 bytes (8.4 MB) copied, xxx s, [68.6, 101, 99.7, 101] MB/s count=64: 16777216 bytes (17 MB) copied, xxx s, [52.5, 136, 159, 159] MB/s count=128: 33554432 bytes (34 MB) copied, xxx s, [38.5, 175, 185, 189, 176] MB/s count=256: 67108864 bytes (67 MB) copied, xxx s, [53.5, 244, 229, 238] MB/s count=512: 134217728 bytes (134 MB) copied, xxx s, [111, 288, 292, 288] MB/s count=1K: 268435456 bytes (268 MB) copied, xxx s, [171, 328, 319, 322] MB/s count=2K: 536870912 bytes (537 MB) copied, xxx s, [228, 337, 330, 334] MB/s count=4K: 1073741824 bytes (1.1 GB) copied, xxx s, [338, 348, 348, 343] MB/s <-- ok! count=8K: 2147483648 bytes (2.1 GB) copied, xxx s, [168, 147, 138, 139] MB/s <-- bad! count=16K: 4294967296 bytes (4.3 GB) copied, xxx s, [155, 160, 178, 144] MB/s count=32K: 8589934592 bytes (8.6 GB) copied, xxx s, [256, 238, 264, 246] MB/s count=64K: 17179869184 bytes (17 GB) copied, xxx s, [298, 285] MB/s count=128K: 34359738368 bytes (34 GB) copied, xxx s, [347, 336] MB/s count=256K: 68719476736 bytes (69 GB) copied, xxx s, [363, 356] MB/s <-- getting better

（低于2 GiB的第一次测量似乎表明使用一些读caching）

当转移2 GiB或更多比我观察到iotop奇怪的东西：

第一阶段：开始的“总磁盘写入”和“实际磁盘写入”都是大约“400 MB / s”。 dd的IO值大约为85％，而其他的则为0％。这个阶段持续更长的时间更大的转移。
阶段2：在传输完成之前的几秒钟（ kworker秒）， kworker从dd跳入/窃取IO的30-50个百分点。分布在30:50和50:30之间波动。同时“总磁盘写入”下降到0 B / s，“实际磁盘写入”在20-70 MB / s之间跳转。这个阶段似乎持续一段时间。
阶段3：在最后3秒内“实际磁盘写入”跳转到> 400 MB / s，而“总磁盘写入”保持在0 B /秒。 dd和kworker都列出IO值为0％
阶段4： dd的IO值在单秒内跳跃到5％。同时转移完成。

一些更多的testing

 # dd if=/dev/zero of=/dev/md0 bs=256K count=32K oflag=direct 8589934592 bytes (8.6 GB) copied, 173.083 s, 49.6 MB/s # dd if=/dev/zero of=/dev/md0 bs=256M count=64 oflag=direct 17179869184 bytes (17 GB) copied, 47.792 s, 359 MB/s # dd if=/dev/zero of=/dev/md0 bs=768M count=16K oflag=direct 50734301184 bytes (51 GB) copied, 136.347 s, 372 MB/s <-- peak performance # dd if=/dev/zero of=/dev/md0 bs=1G count=16K oflag=direct 41875931136 bytes (42 GB) copied, 112.518 s, 372 MB/s <-- peak performance # dd if=/dev/zero of=/dev/md0 bs=2G count=16 oflag=direct 34359672832 bytes (34 GB) copied, 103.355 s, 332 MB/s # dd if=/dev/zero of=/dev/md0 bs=256K count=32K oflag=dsync 8589934592 bytes (8.6 GB) copied, 498.77 s, 17.2 MB/s # dd if=/dev/zero of=/dev/md0 bs=256M count=64 oflag=dsync 17179869184 bytes (17 GB) copied, 58.384 s, 294 MB/s # dd if=/dev/zero of=/dev/md0 bs=1G count=8 oflag=dsync 8589934592 bytes (8.6 GB) copied, 26.4799 s, 324 MB/s # dd if=/dev/zero of=/dev/md0 bs=2G count=8 oflag=dsync 17179836416 bytes (17 GB) copied, 192.327 s, 89.3 MB/s # dd if=/dev/zero of=/dev/md0 bs=256K; echo "sync"; sync 102975406080 bytes (103 GB) copied, 275.378 s, 374 MB/s sync

bs=256K oflag=direct – > 100％IO，没有kworker存在，性能不好
bs=1G oflag=direct – > <5％IO，没有kworker在场，performance良好
bs=2G oflag=direct – >> 80％IO， kworker跳，好的performance
oflag=dsync – > <5％IO， kworker跳转; 需要巨大的块大小以获得可接受的速度，但> 2G会导致性能下降。
echo "sync"; sync echo "sync"; sync – >与conv=fdatasync相同; sync立即返回

问题

什么是那两个进程似乎为IO而战的神秘的第二阶段？

谁在第三阶段将数据传输到硬件？

而最重要的是：我怎样才能最小化奇怪的效果，以获得arrays似乎能够提供的全部400 MB / s？（或者我甚至问一个XY-问题？）

奖金

在当前状态之前有一个长期的反复试验。我将调度程序从cfq切换到noop ，并将RAID块大小从512k减less到256k，产生稍好的结果。 --layout=right-asymmetric不会改变任何东西。暂时停用硬盘驱动器的写入cachingperformance更差。

第一句中提到的隐窝层目前完全不存在，稍后将会重新介绍。

 # uname -a Linux vm 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux

你所看到的是你的dd命令行的工件，特别是conv=fdatasync选项。从手册页：

每个CONV符号可以是：
…
fdatasync： 完成前物理写入输出文件数据
…

conv=fdatasync基本上指示dd在返回之前执行单个最终的fdatasync系统调用。但是， 在dd运行时写入被caching。 您的I / O阶段可以解释如下：

dd很快就写入了页面caching，而没有实际触及磁盘
kworker接近满， kworker内核开始刷新它的磁盘。在页面caching刷新期间， dd暂时暂停（导致高iowait ）; 在一些页面caching被释放后， dd可以恢复操作
在iotop TOTAL和ACTUAL磁盘写入之间的区别取决于页面caching如何分别填充和刷新
循环继续

总之，这里没有问题。如果您想观察未caching的行为，请使用oflag=directreplaceconv=fdatasync ：使用此标志，可以完全绕过页面caching。

为了观察caching但是同步的行为，把带有oflag=sync conv=fdatasyncreplace成这个标志， dd在每个块被写入磁盘时调用fdatasync。

进一步优化可以通过微调你的I / O堆栈（即：I / O调度器，合并行为，条带caching，ecc）来获得，但是这完全是另一个问题。