Ideally, if a Promise Pegasus has a failing RAID member, or disk, we would want the Promise Utility GUI to report that. But that’s not always what happens.
When Backblaze published the SMART stats they pay attention to a few years ago, I adopted a practice of replacing drives that exhibited non-zero values for RAW_VALUES
of the same SMART stats. Backblaze looks at:
SMART 5 – Reallocated_Sector_Count.
SMART 187 – Reported_Uncorrectable_Errors.
SMART 188 – Command_Timeout.
SMART 197 – Current_Pending_Sector_Count.
SMART 198 – Offline_Uncorrectable.
I’ve never seen SMART 187 or 188 reported by a drive member on a Promise Pegasus RAID, but the other values are there.
We can obtain the SMART status of the members of a RAID like this:
- In Terminal, start
promiseutil
- At the prompt, type
smart -v
(with the verbose flag on).
The output will show the SMART statistics for each member of the RAID.
So, what does a failing Promise Pegasus RAID member look like?
In this example, the Promise Utility reported the health of the RAID as fine, but the performance of this RAID suggested otherwise. Here’s the top of the log:
------------------------------------------------------- PdId: 2 Model Number: TOSHIBA DT01ACA2 Drive Type: SATA SMART Status: Enable SMART Health Status: OK SCT Status Version: 3 SCT Version (vendor specific): 256 (0x0100) SCT Support Level: 1 Device State: SMART Off-line Data Collection executing in background (4) Current Temperature: 37 Celsius Power Cycle Min/Max Temperature: 29/40 Celsius Lifetime Min/Max Temperature: 19/44 Celsius Under/Over Temperature Limit Count: 0/0 Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Error logging capability: (0x01) Error logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 249) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Self-test log structure revision number: 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Error Log Version: 1
Hmm…no signs of trouble. We see SMART Health Status: OK
, so if we were just grepping or awking for that, we’d assume that all was well. But a few lines down, we find ATA Error Count: 4
. This value, doesn’t even appear on a healthy member, even with the -v verbose flag. And that’s followed by four errors.
SMART Error Log Version: 1 ATA Error Count: 4 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 4 occurred at disk power-on lifetime: 34794 hours (1449 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH 40 51 80 80 e7 1c 0d Error: UNC 128 sectors at LBA = 0x0d1ce780 = 219998080 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 80 80 e7 1c 40 00 2d+17:20:39.072 READ DMA EXT 35 00 00 00 44 13 40 00 2d+17:20:39.070 WRITE DMA EXT 35 00 00 00 35 13 40 00 2d+17:20:39.061 WRITE DMA EXT 35 00 80 80 25 13 40 00 2d+17:20:39.052 WRITE DMA EXT 25 00 58 a8 1f 13 40 00 2d+17:20:39.046 READ DMA EXT Error 3 occurred at disk power-on lifetime: 34794 hours (1449 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH 40 51 78 80 b2 1e 0d Error: UNC 120 sectors at LBA = 0x0d1eb280 = 220115584 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 78 80 b2 1e 40 00 2d+17:20:35.539 READ DMA EXT 35 00 00 00 ad 1e 40 00 2d+17:20:35.538 WRITE DMA EXT 25 00 18 e8 e1 1c 40 00 2d+17:20:34.067 READ DMA EXT 61 80 00 80 ac 1e 40 00 2d+17:20:31.037 WRITE FPDMA QUEUED 2f 00 01 10 00 00 00 00 2d+17:20:31.036 READ LOG EXT Error 2 occurred at disk power-on lifetime: 34794 hours (1449 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH 40 51 a8 58 e2 1c 0d Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 80 28 80 ad 1e 40 00 2d+17:20:09.122 WRITE FPDMA QUEUED 61 80 20 00 ae 1e 40 00 2d+17:20:09.121 WRITE FPDMA QUEUED 61 80 18 80 ae 1e 40 00 2d+17:20:09.121 WRITE FPDMA QUEUED 61 80 10 00 af 1e 40 00 2d+17:20:09.121 WRITE FPDMA QUEUED 61 80 08 80 af 1e 40 00 2d+17:20:09.121 WRITE FPDMA QUEUED Error 1 occurred at disk power-on lifetime: 34706 hours (1446 days + 2 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH 40 51 20 78 c5 1f 0d Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 80 e0 80 d8 1a 40 00 01:51:41.474 WRITE FPDMA QUEUED 61 80 c8 00 d8 1a 40 00 01:51:41.474 WRITE FPDMA QUEUED 61 80 c0 00 d4 1a 40 00 01:51:41.473 WRITE FPDMA QUEUED 61 80 b8 80 d7 1a 40 00 01:51:41.473 WRITE FPDMA QUEUED 61 80 b0 00 d7 1a 40 00 01:51:41.473 WRITE FPDMA QUEUED
All of the errors occurred on a power up of the RAID. So what do the SMART stats for Backblaze’s preferred values (bolded) look like on this drive?
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 074 074 016 Pre-fail Always - 54134415 2 Throughput_Performance 0x0005 139 139 054 Pre-fail Offline - 70 3 Spin_Up_Time 0x0007 129 129 024 Pre-fail Always - 295 (Average 294) 4 Start_Stop_Count 0x0012 097 097 000 Old_age Always - 15544 5 Reallocated_Sector_Ct 0x0033 081 081 005 Pre-fail Always - 517 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 124 124 020 Pre-fail Offline - 33 9 Power_On_Hours 0x0012 096 096 000 Old_age Always - 34799 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 41 192 Power-Off_Retract_Count 0x0032 076 076 000 Old_age Always - 29286 193 Load_Cycle_Count 0x0012 076 076 000 Old_age Always - 29286 194 Temperature_Celsius 0x0002 162 162 000 Old_age Always - 37 (Lifetime Min/Max 19/44) 196 Reallocated_Event_Count 0x0032 063 063 000 Old_age Always - 846 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
As expected, no SMART 187 or 188 values. And 197 Current_Pending_Sector
and 198 Offline_Uncorrectable
are both 0
.
But look at the RAW_VALUE
for 5 Reallocated_Sector_Ct
. Not good. And while it’s not on Backblaze’s list, 1 Raw_Read_Error_Rate
is really high.
The other RAID elements have no errors. So we replaced the drive, and rebuilt the RAID, and performance returned to normal.