We were recently called in to diagnose a relatively new Promise Pegasus2 R6 that intermittently refused to mount. The Promise Utility app reported nothing amiss with the RAID or the drives, green lights everywhere, so we used the command line to dig a little deeper.
So let’s run a verbose SMART check on the unit:
promiseutil -C smart -v
The first three drives checked out. Drive 4 indicated that SMART thought everything was fine:
PdId: 4 Model Number: TOSHIBA DT01ACA2 Drive Type: SATA SMART Status: Enable SMART Health Status: OK
But then a little further down, CRC errors:
Error 165 occurred at disk power-on lifetime: 1176 hours (49 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 50 b0 ee 81 0d Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 80 a8 80 ee 81 40 00 18:38:48.276 WRITE FPDMA QUEUED 61 80 a0 00 ee 81 40 00 18:38:48.276 WRITE FPDMA QUEUED 61 80 98 80 ed 81 40 00 18:38:48.276 WRITE FPDMA QUEUED 61 80 90 00 ed 81 40 00 18:38:48.276 WRITE FPDMA QUEUED 61 80 88 80 ec 81 40 00 18:38:48.275 WRITE FPDMA QUEUED Error 164 occurred at disk power-on lifetime: 1175 hours (48 days + 23 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 10 f0 ad 6b 0d Error: ICRC, ABRT 16 sectors at LBA = 0x0d6badf0 = 225160688 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 35 00 80 80 ad 6b 40 00 18:36:07.145 WRITE DMA EXT 35 00 80 00 ae 6b 40 00 18:36:07.144 WRITE DMA EXT 35 00 80 00 ad 6b 40 00 18:36:07.144 WRITE DMA EXT 35 00 80 80 ab 6b 40 00 18:36:07.139 WRITE DMA EXT 35 00 80 00 ab 6b 40 00 18:36:07.139 WRITE DMA EXT Error 163 occurred at disk power-on lifetime: 1175 hours (48 days + 23 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 f0 10 5e 5d 0d Error: ICRC, ABRT 240 sectors at LBA = 0x0d5d5e10 = 224222736 ...
The client confirmed that he’d seen a warning light on drive 4, but that it had “gone away”. We had them back the data up immediately. Promise support subsequently verified that the drive had failed based on the logs and sent a replacement drive out.
If the drive had failed completely, I assume the RAID would have kicked in, taken the bad drive offline and continued spinning, but since the drive hadn’t actually failed, the volume was struggling with a failing member and that was causing boot and performance issues.
The take-away is that there’s a generous gap between a drive that’s beginning to fail and a drive that’s failed enough for the Promise Utility app to detect it. Verbose mode is your friend.