单个磁盘上的ZFS读取行为

我试图在单个磁盘上设置ZFS，因为它具有惊人的压缩和快照function。我的工作负载是一个postgres服务器。通常的指南build议以下设置：

atime = off compression = lz4 primarycache = metadata recordsize=16k

但是对于这些设置，我的确看到了一些怪异的读取速度 – 我只是看着这个atm！

这里是我的testing驱动（英特尔P4800X）与XFS，这是一个简单的直接IOtesting与DD：

  [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=4K iflag=direct 910046+0 records in 910046+0 records out 3727548416 bytes (3.7 GB) copied, 10.9987 s, 339 MB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=8K iflag=direct 455023+0 records in 455023+0 records out 3727548416 bytes (3.7 GB) copied, 6.05091 s, 616 MB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=16K iflag=direct 227511+1 records in 227511+1 records out 3727548416 bytes (3.7 GB) copied, 3.8243 s, 975 MB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=32K iflag=direct 113755+1 records in 113755+1 records out 3727548416 bytes (3.7 GB) copied, 2.78787 s, 1.3 GB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=64K iflag=direct 56877+1 records in 56877+1 records out 3727548416 bytes (3.7 GB) copied, 2.18482 s, 1.7 GB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=128K iflag=direct 28438+1 records in 28438+1 records out 3727548416 bytes (3.7 GB) copied, 1.83346 s, 2.0 GB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=256K iflag=direct 14219+1 records in 14219+1 records out 3727548416 bytes (3.7 GB) copied, 1.69168 s, 2.2 GB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=512K iflag=direct 7109+1 records in 7109+1 records out 3727548416 bytes (3.7 GB) copied, 1.54205 s, 2.4 GB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=1M iflag=direct 3554+1 records in 3554+1 records out 3727548416 bytes (3.7 GB) copied, 1.51988 s, 2.5 GB/s

正如你所看到的，驱动器可以在4K读取时达到约80k IOPS，而在8K时也是如此 – 在这里线性增加（根据规范，QD16可以达到550k IOPS，但是我在这里testing单线程顺序读取一切如预期）

zfs的内核参数：

 options zfs zfs_vdev_scrub_min_active=48 options zfs zfs_vdev_scrub_max_active=128 options zfs zfs_vdev_sync_write_min_active=64 options zfs zfs_vdev_sync_write_max_active=128 options zfs zfs_vdev_sync_read_min_active=64 options zfs zfs_vdev_sync_read_max_active=128 options zfs zfs_vdev_async_read_min_active=64 options zfs zfs_vdev_async_read_max_active=128 options zfs zfs_top_maxinflight=320 options zfs zfs_txg_timeout=30 options zfs zfs_dirty_data_max_percent=40 options zfs zfs_vdev_scheduler=deadline options zfs zfs_vdev_async_write_min_active=8 options zfs zfs_vdev_async_write_max_active=64

现在使用ZFS和16K的块大小进行相同的testing：

  910046+0 records in 910046+0 records out 3727548416 bytes (3.7 GB) copied, 39.6985 s, 93.9 MB/s 455023+0 records in 455023+0 records out 3727548416 bytes (3.7 GB) copied, 20.2442 s, 184 MB/s 227511+1 records in 227511+1 records out 3727548416 bytes (3.7 GB) copied, 10.5837 s, 352 MB/s 113755+1 records in 113755+1 records out 3727548416 bytes (3.7 GB) copied, 6.64908 s, 561 MB/s 56877+1 records in 56877+1 records out 3727548416 bytes (3.7 GB) copied, 4.85928 s, 767 MB/s 28438+1 records in 28438+1 records out 3727548416 bytes (3.7 GB) copied, 3.91185 s, 953 MB/s 14219+1 records in 14219+1 records out 3727548416 bytes (3.7 GB) copied, 3.41855 s, 1.1 GB/s 7109+1 records in 7109+1 records out 3727548416 bytes (3.7 GB) copied, 3.17058 s, 1.2 GB/s 3554+1 records in 3554+1 records out 3727548416 bytes (3.7 GB) copied, 2.97989 s, 1.3 GB/s

正如你所看到的，4K读取testing已经达到了93MB / s，而8K读取达到了184MB / s，16K达到了352MB / s。基于以前的testing，我会明确地预期在4k（243.75），8k（487.5），16k（975）更快的读取。此外，我读了logging大小对读取性能没有影响 – 但显然它确实如此。

比较128klogging：

 910046+0 records in 910046+0 records out 3727548416 bytes (3.7 GB) copied, 107.661 s, 34.6 MB/s 455023+0 records in 455023+0 records out 3727548416 bytes (3.7 GB) copied, 55.6932 s, 66.9 MB/s 227511+1 records in 227511+1 records out 3727548416 bytes (3.7 GB) copied, 27.3412 s, 136 MB/s 113755+1 records in 113755+1 records out 3727548416 bytes (3.7 GB) copied, 14.1506 s, 263 MB/s 56877+1 records in 56877+1 records out 3727548416 bytes (3.7 GB) copied, 7.4061 s, 503 MB/s 28438+1 records in 28438+1 records out 3727548416 bytes (3.7 GB) copied, 4.1867 s, 890 MB/s 14219+1 records in 14219+1 records out 3727548416 bytes (3.7 GB) copied, 2.6765 s, 1.4 GB/s 7109+1 records in 7109+1 records out 3727548416 bytes (3.7 GB) copied, 1.87574 s, 2.0 GB/s 3554+1 records in 3554+1 records out 3727548416 bytes (3.7 GB) copied, 1.40653 s, 2.7 GB/s

我也可以清楚地看到与iostat的磁盘有相应的logging大小的平均请求大小。但是IOPS比使用XFS低得多。

这是应该如何performance？那个行为在哪里logging？我需要为我的postgres服务器（顺序+随机），性能良好，但我也想要我的备份，副本等（顺序）的优秀performance – 所以似乎我得到很好的序列速度与大logging，或良好的随机速度与小logging。

编辑：此外，我还testing了primarycache =所有有更多的古怪，因为它最大不超过1.3 GB / s的logging大小。

服务器详情：

64 GB DDR4内存
英特尔至强E5-2620v4
英特尔P4800X

观察到的行为是由于ZFS如何进行基于logging概念的端到端校验和。

基本上，每个对象都分解成适当数量的logging大小的块，这些块分别进行校验和。这意味着小于logging大小的读取实际上需要转移和重新校验整个logging的对象，导致“浪费”的存储带宽。

这意味着较大的loggingZFS数据集performance不佳，读取较小，反之则读取较大。相反，小logging的ZFS数据集在小读取和小读取的情况下性能良好。

请注意，压缩和快照也适用于logging粒度：具有4K或8Klogging大小的数据集将比32K数据集具有更低的压缩比。

简而言之，ZFSlogging没有“防弹”的价值，而是需要调整到特定的应用需求。这也意味着dd是一个糟糕的select基准（虽然，快速和肮脏，我也广泛使用它！）; 相反，您应该使用fio （调整为performance为您的应用程序）或应用程序本身。

你可以在这里阅读更多的信息。

对于一般用途，我会让它到默认值（128K），而对于数据库和虚拟机，我会使用一个更小的32K值。

最后，请注意ZFS预读/预取调整，这可以显着提高读取速度。