MegaCli报告的物理磁盘数量不一致

首先，这是我的问题的删节版本。我在RAIDarrays的驱动器上闪烁的红灯，尽pipeMegaCli不报告任何磁盘故障或警告，但某些MegaCli命令显示24个磁盘，其他显示只有23个。我还看到每天都会出现以下错误：

Event Description: Controller encountered a fatal error and was reset

这些事情有关吗？这里有问题吗？

现在这是更长的版本。我inheritance了一个服务器（我们称之为my_server ）的职责，这个服务器是托pipe在一个数据中心的，我相信这个服务器有一个带RAID 50 / RAID 5 + 0configuration的LSI MegaRAID SAS 9265-8i。我收到一封来自数据中心的电子邮件，其中一个硬盘上的红灯闪烁。不幸的是，我几乎不知道RAIDarrays，所以我必须通过MegaRAID SAS软件用户指南和各种在线教程来感受自己的方式。

我ssh'ed到服务器来尝试诊断问题。接下来是一个示例shell会话，演示了我的努力，并提供了有关系统的相关信息。

首先我检查一下基本的系统信息：

 $ cat /etc/issue CentOS release 6.4 (Final) Kernel \r on an \m $ uname -a Linux my_server 2.6.32-358.11.1.el6.x86_64 #1 SMP Wed Jun 12 03:34:52 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

接下来我validationRAIDarrays和MegaCli版本：

 $ sudo /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -aALL | grep "Product Name" Product Name : LSI MegaRAID SAS 9265-8i $ sudo /opt/MegaRAID/MegaCli/MegaCli64 -CfgDsply -a0 | grep 'RAID Level' RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3 $ sudo /opt/MegaRAID/MegaCli/MegaCli64 -v MegaCLI SAS RAID Management Tool Ver 8.04.07 May 28, 2012 (c)Copyright 2011, LSI Corporation, All Rights Reserved. Exit Code: 0x00

现在，有关arrays中驱动器的一些摘要信息：

 $ sudo /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0 | grep -A8 "Device Present" Device Present ================ Virtual Drives : 1 Degraded : 0 Offline : 0 Physical Devices : 27 Disks : 24 Critical Disks : 0 Failed Disks : 0

这里看起来一切都很好。然后我检查SMART警报：

 $ sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL | grep 'SMART' Drive has flagged a SMART alert : No Drive has flagged a SMART alert : No [...] Drive has flagged a SMART alert : No Drive has flagged a SMART alert : No

没有SMART警报，所以在阅读了几个教程之后，我运行了一些其他的命令：

 $ sudo /opt/MegaRAID/MegaCli/MegaCli64 -ldinfo -lall -a0 | grep Drives Number Of Drives : 23 $ sudo /opt/MegaRAID/MegaCli/MegaCli64 -CfgDsply -aALL | grep -Pi 'SPAN|Span\ Ref|Number\ of' Number of DISK GROUPS: 1 Number of Spans: 1 SPAN: 0 Span Reference: 0x00 Number of PDs: 23 Number of VDs: 1 Number of dedicated Hotspares: 0 Number Of Drives : 23 Span Depth : 1 Drive's postion: DiskGroup: 0, Span: 0, Arm: 0 Drive's postion: DiskGroup: 0, Span: 0, Arm: 1 Drive's postion: DiskGroup: 0, Span: 0, Arm: 2 Drive's postion: DiskGroup: 0, Span: 0, Arm: 3 [...] Drive's postion: DiskGroup: 0, Span: 0, Arm: 20 Drive's postion: DiskGroup: 0, Span: 0, Arm: 21 Drive's postion: DiskGroup: 0, Span: 0, Arm: 22

现在我有点困惑，因为一些命令（例如adpallinfo和pdlist）显示24个磁盘存在，其他（例如ldinfo和CfgDsply）只显示23。

最后，我生成一个事件日志文件，并寻找麻烦的迹象：

 $ sudo /opt/MegaRAID/MegaCli/MegaCli64 -adpeventlog -getevents -f lsi-events.log -a0 -nolog $ cat lsi-events.log | grep -P -i 'fail|error|warn' [...] Event Description: Controller encountered a fatal error and was reset Event Description: Controller encountered a fatal error and was reset Event Description: Controller encountered a fatal error and was reset Event Description: Controller encountered a fatal error and was reset Event Description: Controller encountered a fatal error and was reset $ cat lsi-events.log | grep -B6 -A3 -P -i 'fail|error|warn' [...] seqNum: 0x000f8644 Time: Sun Feb 26 07:32:16 2017 Code: 0x00000159 Class: 2 Locale: 0x20 Event Description: Controller encountered a fatal error and was reset Event Data: =========== None

还要查找与插槽23特别相关的消息：

 $ cat lsi-events.log | grep -P -i 's23' | tail -30 Event Description: Power state change on PD 1f(e0x21/s23) from POWERSAVE(1) to TRANSITION(ff) Event Description: Power state change on PD 1f(e0x21/s23) from TRANSITION(ff) to ON(0) Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1) Event Description: Inserted: PD 1f(e0x21/s23) Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev) Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1) Event Description: Power state change on PD 1f(e0x21/s23) from POWERSAVE(1) to TRANSITION(ff) Event Description: Power state change on PD 1f(e0x21/s23) from TRANSITION(ff) to ON(0) Event Description: Inserted: PD 1f(e0x21/s23) Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev) Event Description: Inserted: PD 1f(e0x21/s23) Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev) Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1) Event Description: Inserted: PD 1f(e0x21/s23) Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev) Event Description: Inserted: PD 1f(e0x21/s23) Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev) Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1) Event Description: Global Hot Spare PD 1f(e0x21/s23) (global,rev) disabled Event Description: State change on PD 1f(e0x21/s23) from HOT SPARE(2) to UNCONFIGURED_GOOD(0) Event Description: Power state change on PD 1f(e0x21/s23) from POWERSAVE(1) to TRANSITION(ff) Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev) Event Description: State change on PD 1f(e0x21/s23) from UNCONFIGURED_GOOD(0) to HOT SPARE(2) Event Description: Power state change on PD 1f(e0x21/s23) from TRANSITION(ff) to ON(0) Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1)

我联系了数据中心，并被告知10号驱动器上闪烁的光线，所以我看着那个驱动器：

 $ sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDInfo -PhysDrv [33:10] -a0 Enclosure Device ID: 33 Slot Number: 10 Drive's postion: DiskGroup: 0, Span: 0, Arm: 10 Enclosure position: 1 Device Id: 18 WWN: 5000C500344D5940 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SAS Raw Size: 1.819 TB [0xe8e088b0 Sectors] Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors] Coerced Size: 1.818 TB [0xe8d00000 Sectors] Emulated Drive: No Firmware state: Online, Spun Up Commissioned Spare : No Emergency Spare : No Device Firmware Level: 0006 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x5000c500344d5941 SAS Address(1): 0x5000c500344d5942 Connected Port Number: 0(path0) 1(path1) Inquiry Data: SEAGATE ST32000444SS 00069WM6369D FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :26C (78.80 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Port-1 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a SMART alert : No Exit Code: 0x00

我也尝试使用smartctl：

 $ sudo smartctl -a -d megaraid,18 /dev/sdc smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.11.1.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net Vendor: SEAGATE Product: ST32000444SS Revision: 0006 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Logical block size: 512 bytes Logical Unit id: 0x5000c500344d5943 Serial number: 9WM6369D0000914458SC Device type: disk Transport protocol: SAS Local Time is: Tue Feb 28 17:18:33 2017 CST Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 26 C Drive Trip Temperature: 68 C Manufactured in week 21 of year 2011 Specified cycle count over device lifetime: 10000 Accumulated start-stop cycles: 41 Specified load-unload count over device lifetime: 300000 Accumulated load-unload cycles: 41 Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 3508224337 Blocks received from initiator = 38846232 Blocks read from cache and sent to initiator = 44013719 Number of read and write commands whose size <= segment size = 2649500 Number of read and write commands whose size > segment size = 4 Vendor (Seagate/Hitachi) factory information number of hours powered up = 45862.30 number of minutes until next internal SMART test = 46 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 22540834 0 0 22540834 22540834 230.346 0 write: 0 0 0 0 0 20.012 0 verify: 161330204 1 0 161330205 161330205 1896.577 0 Non-medium error count: 0 [GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on'] No self-tests have been logged Long (extended) Self Test duration: 18500 seconds [308.3 minutes]

逻辑驱动器视图和物理设备视图之间出现的差异是因为您的插槽23中的驱动器configuration为全局热备份（Global Hotspare），所以它不会被分配给任何逻辑驱动器，并且在进入降级状态时可以作为备用进入任何LD。所以你有24个物理驱动器，23个分配给LD 0与一个全球Hotspare。

关于驱动器上的红灯闪烁，您应该检查DC是哪个插槽，然后使用MegaCli -PDInfo -PhysDrv [E:S] -a0查看有关该驱动器状态的详细信息，其中E是机箱号，S是插槽数。通常闪烁的红灯是PFA / SMART即将失效的标志，虽然实际的里程可能会有所不同。

在附注中，使用grep来检查诸如MegaCli等逐字的人类可读输出命令的结果是最终会导致麻烦的习惯。