SSD性能下降 – IBM X3650 M4(7915)

我为开发目的build立了一个testing环境。 它由IBM X3650 M4(7915)服务器组成:

  • 2个Intel Xeon E2690 @ 2.90GHz
  • 96GB 1333MHz ECC内存
  • 2个硬盘146GB 15k转
  • 6 x SSD 525GB(Crucial MX300)
  • embeddedServeRaid m5110e在JBOD模式下没有caching
  • Ubuntu服务器16.10
  • HDD(RAID0)和SSD(RAID10)上的md RAID软件

我不能完全绕过RAID控制器,因为它集成在主板上,没有专用的HBA卡(我应该买一个吗?),但我把它设置为JBOD模式。

我testing并将这些SSD作为单个磁盘进行过testing,采用RAID10和RAID0configuration。 我观察到软件RAID的预期行为,但不是来自单个磁盘:RAID规模(对我来说没关系),但单个SSD的运行速度只有预期的IOPS的一半!

testing使用fio和storagereviews.com( 链接 )描述的configuration运行。

以下是所有6个SSD(每个SSD上运行1 x 60秒)的平均运行总结图:

针对4k的IO Depth的SSD IOPS 100%随机读取和8k 70%的随机读取和30%的随机写入工作负载

从各种基准(storagereview.com,tomshardware.com等)和官方规格中读取,这些磁盘应达到双随机读取IOPS。 例如:

  • 为4k工作负载汤姆的硬件上限为92358 IOPS读取在32 IO深度,而我的上限在37400 IOPS( 链接 )。
  • storagereview.com运行的基准testing略有不同,但是它们都给出了完全不同的结果 – 对于4kalignment读取( 链接 )约为90k IOPS。
  • Hardware.info为1TB模型( 链接 )提供了相同的结果。

我优化了所有各种/sys/block/sd*/dev/sd*参数,如schedulernr_requestsrotationalfifo_batch

我该找什么?

更新1

我忘记提及那些磁盘超量configuration在25%,所以在下面的输出中报告的整体大小约为525GB的75%。 无论如何,过度configuration之前和之后的IOPS从未超过37k的限制。

hdparm -I /dev/sdc输出:

 /dev/sdc: ATA device, with non-removable media Model Number: Crucial_CT525MX300SSD1 Serial Number: 163113837E16 Firmware Revision: M0CR031 Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0 Standards: Used: unknown (minor revision code 0x006d) Supported: 10 9 8 7 6 5 Likely used: 10 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 769208076 Logical Sector size: 512 bytes Physical Sector size: 512 bytes Logical Sector-0 offset: 0 bytes device size with M = 1024*1024: 375589 MBytes device size with M = 1000*1000: 393834 MBytes (393 GB) cache/buffer size = unknown Form Factor: 2.5 inch Nominal Media Rotation Rate: Solid State Device Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, with device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 Advanced power management level: 254 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * WRITE_BUFFER command * READ_BUFFER command * NOP cmd * DOWNLOAD_MICROCODE * Advanced Power Management feature set * 48-bit Address feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name * IDLE_IMMEDIATE with UNLOAD Write-Read-Verify feature set * WRITE_UNCORRECTABLE_EXT command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE unknown 119[8] * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Gen3 signaling speed (6.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters * NCQ priority information * READ_LOG_DMA_EXT equivalent to READ_LOG_EXT * DMA Setup Auto-Activate optimization Device-initiated interface power management * Software settings preservation Device Sleep (DEVSLP) * SMART Command Transport (SCT) feature set * SCT Write Same (AC2) * SCT Features Control (AC4) * SCT Data Tables (AC5) * reserved 69[3] * reserved 69[4] * reserved 69[7] * DOWNLOAD MICROCODE DMA command * WRITE BUFFER DMA command * READ BUFFER DMA command * Data Set Management TRIM supported (limit 8 blocks) * Deterministic read ZEROs after TRIM Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 500a075113837e16 NAA : 5 IEEE OUI : 00a075 Unique ID : 113837e16 Device Sleep: DEVSLP Exit Timeout (DETO): 50 ms (drive) Minimum DEVSLP Assertion Time (MDAT): 10 ms (drive) Checksum: correct 

fdisk -l /dev/sdc输出fdisk -l /dev/sdc

 Disk /dev/sdc: 366.8 GiB, 393834534912 bytes, 769208076 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes 

cat /sys/block/sdc/queue/scheduler

 noop [deadline] cfq 

输出dmesg | grep "ahci\|ncq" dmesg | grep "ahci\|ncq"

 [ 5.490677] ahci 0000:00:1f.2: version 3.0 [ 5.490901] ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 1.5 Gbps 0x2 impl SATA mode [ 5.498675] ahci 0000:00:1f.2: flags: 64bit ncq sntf led clo pio slum part ems apst [ 5.507315] scsi host1: ahci [ 5.507435] scsi host2: ahci [ 5.507529] scsi host3: ahci [ 5.507620] scsi host4: ahci [ 5.507708] scsi host5: ahci [ 5.507792] scsi host6: ahci [ 14.382326] Modules linked in: ioatdma(+) ipmi_si(+) ipmi_msghandler mac_hid shpchp lpc_ich ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi coretemp ip_tables x_tables autofs4 btrfs raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid10 raid1 ses enclosure scsi_transport_sas crct10dif_pclmul crc32_pclmul ghash_clmulni_intel igb aesni_intel hid_generic dca aes_x86_64 lrw ptp glue_helper ablk_helper ahci usbhid cryptd pps_core wmi hid libahci megaraid_sas i2c_algo_bit fjes 

dmesg输出上看得更深一些,下面这些奇怪的消息被加粗了,颇为可疑:

 ... [ 0.081418] CPU: Physical Processor ID: 0 [ 0.081421] CPU: Processor Core ID: 0 [ 0.081427] ENERGY_PERF_BIAS: Set to 'normal', was 'performance' [ 0.081430] ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8) [ 0.081434] mce: CPU supports 20 MCE banks [ 0.081462] CPU0: Thermal monitoring enabled (TM1) ... [ 0.341838] cpuidle: using governor menu [ 0.341841] PCCT header not found. [ 0.341868] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it [ 0.341873] ACPI: bus type PCI registered ... [ 1.313494] NET: Registered protocol family 1 [ 1.313857] pci 0000:16:00.0: [Firmware Bug]: VPD access disabled [ 1.314223] pci 0000:04:00.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff] ... [ 1.591739] PCI: Probing PCI hardware (bus 7f) [ 1.591761] ACPI: \: failed to evaluate _DSM (0x1001) [ 1.591764] PCI host bridge to bus 0000:7f ... [ 1.595018] PCI: root bus ff: using default resources [ 1.595019] PCI: Probing PCI hardware (bus ff) [ 1.595039] ACPI: \: failed to evaluate _DSM (0x1001) ... [ 1.854466] ACPI: Power Button [PWRF] [ 1.855209] ERST: Can not request [mem 0x7e908000-0x7e909bff] for ERST. [ 1.855492] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC. ... 

更新2

我的问题不是这个问题的重复,因为我的IOPS总是单个SSD的预期IOPS的一半,而不是整个RAID,即使在IOPS非常小(<10k)的低IO深度。

看上面的图表:在IO Depth of 1的情况下,单个SSD的平均值为5794,而每个SSD至less应该有8000个,这远离我的最高边缘限制40k。 我没有记下RAID结果,因为它们与预期行为一致,但是在这里:RAID10的IO深度为16和32时,IOPS达到了约12万IOPS(由于RAID10镜像惩罚,每2个磁盘约有40k IOPS,所以40k为3)。

我也认为我的embedded式RAID卡可以代表瓶颈,但我找不到明确的答案。 例如,我观察到,在每个SSD上并行运行fiotesting(同时运行6个testing,每个testing运行在一个SSD上)将IO深度为16和32的单个SSD的IOPS减半。将IOPS提高到20k,他们是40k。

让我们尝试以下,分析单个设备sda

  • 检查SSD的私有DRAMcaching是否启用,通过发出hdparm -I /dev/sda (在这里输出)
  • 确保您的分区(如果有)正确alignment(显示fdisk -l /dev/sda的输出)
  • 将调度程序设置为deadline
  • 确保使用dmesg | grep -i ncq启用NCQ dmesg | grep -i ncq (再次,在这里输出)