Tag Archives: Promise

Promise Pegasus2: Scripting a SMART check with promiseutil

We’ve found that the Promise Pegasus2 Thunderbolt 2 RAID can report that the SMART Health status of its disks is just dandy, while the unit is quietly accumulating ATA errors that may indicate the pending failure of a disk.

I want to be notified if the Pegasus either has a SMART status failure or if ATA errors are present on any of the disks.

This script does just that. It’s essentially a more refined version of the previous promiseutil scripts that grabs the simple SMART status of each disk, greps to see if it’s “OK”, then runs a line of awk that looks at the report to see if there’s an “ATA Error Count”. As always, it logs to system.log and optionally sends error reports by email.

#!/bin/bash
#
# promise_smart_check.sh
#
# Checks Promise Pegasus2 SMART status, checks for ATA errors, logs and mails the output if there's an issue.
#
# Author: AB @ Modest Industries
#
# Requires Promise Utility for Pegasus2 (http://www.promise.com), tested with v3.18.0000.18
# Requires sendemail for email alerts (http://caspian.dotconf.net/menu/Software/SendEmail/)
#
# Edit History
# 2014-04-21 - AB: Version 1.0.
# 2014-04-24 - AB: Refactored.
# 2014-05-01 - AB: Incorporate the awk script to check for ATI errors.
# 2014-05-08 - AB: Refinements.
# 2014-05-15 - AB: Update to message body construction, tmp file & sendemail sanity checks.
# 2014-05-17 - AB: Added promiseutil path check.

export DATESTAMP=`date +%Y-%m-%d\ %H:%M:%S`

# Editable variables

# Path to sendemail
sendemail_path="/Library/Scripts/Monitoring/sendemail"

# Send email alerts?
send_email_alert=true

# Variables for sendemail
# Sender's address
alert_sender="[email protected]"

# Recipient's addresses, comma separated.
#alert_recipient='[email protected], [email protected]'
alert_recipient="[email protected]"

# SMTP server to send the messages through
alert_smtp_server="smtp.example.com"

# ------------ You probably shouldn't edit below this line ------------------
# Variables

# Default the error flags to false.
smart_error_flag="false"
ata_error_flag="false"

# Alert subject
alert_subject="ALERT: Promise Pegasus2 SMART problem detected on $HOSTNAME."

# Alert header
alert_header="At $DATESTAMP, a problem was detected on this device:\n"

# Pass / Fail messages
pass_msg="Promise Pegasus SMART check successful."
fail_msg=" *** Promise Pegasus SMART check FAILED!!! ***"

# Default the message body
message_body=""

# Alert footer
alert_footer="Run 'promiseutil -C smart -v' for more information."

# Promise Pegasus command line utility default path
promiseutil_path="/usr/bin/promiseutil"

# ----------------- Check for promiseutil, sendemail & set up temp files ------------------
if [ ! -f $promiseutil_path ]; then
        echo "$0 ERROR: $promiseutil_path does not exist"
        echo  "Please download and install the Promise Pegasus Utility app from http://promise.com"
        exit 1
fi

if [ ! -f $sendemail_path ]; then
        echo "$0 ERROR: $sendemail_path does not exist"
        echo  "Please download from http://caspian.dotconf.net/menu/Software/SendEmail/ and then set the \$sendmemail_path variable inside this script"
        exit 1
fi

unit_ID_tmp=`mktemp -q "/tmp/$$_unit_ID.XXXX"`
if [ $? -ne 0 ]; then
        echo "$0: ERROR: Can't create temp file, exiting..."
        exit 1
fi

smart_results_tmp=`mktemp -q "/tmp/$$_smart_results.XXXX"`
if [ $? -ne 0 ]; then
        echo "$0: ERROR: Can't create temp file, exiting..."
        exit 1
fi

# ----------------- Run promiseutil, evaluate the results ------------------

# Get Unit ID information for this Promise unit. Includes workaround for promiseutil tty issue.
screen -D -m sh -c "$promiseutil_path -C subsys -v >$unit_ID_tmp"

# Drop the output into a variable.
unit_ID=$(<$tmpdir$unit_ID_tmp)

# Get the SMART report, put it into a tmp file.
screen -D -m sh -c "$promiseutil_path -C smart -v >$smart_results_tmp"

# Grab the header for each PdId in the Promise
smart_status=$(cat $smart_results_tmp | grep -A4 "^PdId")

# Check the header to see if SMART Health Check reports a problem
if grep "^SMART Health Status:" <<< "$smart_status" | grep -qv "OK"
then
        smart_error_flag="true"
fi

# Check for ATA errors, which may indicate that the drive is failing even if SMART Health is OK
ata_errors=$(awk '/^PdId: [1-9][0-9]*/ \
                                { a=$0; n=4; next } \
                                n { --n; a=a "\n" $0; next } \
                                /^ATA Error Count*/ \
                                { ata_err=$0; print a "\n" ata_err "\n" }' \
                                "$smart_results")
# Flag if there were ATA errors
if [ "$ata_errors" != "" ]; then
        ata_error_flag="true"
fi

# ----------------- Build the message_body ------------------

# If there's a problem, build the header.
if [ "$smart_error_flag" ==  "true" ] || [ "$ata_error_flag" == "true" ]; then
        message_body="$alert_header\n\n$fail_msg\n\n$unit_ID\n\n"

        # SMART Health status.
        if [ "$smart_error_flag" == "true" ]; then
                message_body="$message_body\nSMART Health Status is reporting one or more bad drives."
        fi

        # Always include the smart_status
        message_body="$message_body\n\n$smart_status"

        # Then the ATA errors.
        if [ "$ata_error_flag" == "true" ]; then
                message_body="$message_body\n\nOne or more drives has an ATA Error Count and may be failing.\n\n$ata_errors"
        fi
fi

#  ----------------- Logging & email ------------------

# Log the results, conditionally send email on failure.
if [ "$ata_error_flag" == "true" ] || [ "$smart_error_flag" == "true" ]; then
        message_body="$message_body\n\n$alert_footer"
        echo "$DATESTAMP: \n\n$message_body" >> /var/log/system.log
        if [ "$send_email_alert" == "true" ] ; then
                "$sendemail_path" -f $alert_sender -t $alert_recipient -u $alert_subject -m "$message_body" -s $alert_smtp_server
        fi
else
        echo "$DATESTAMP: $pass_msg\n\n$unit_ID" >> /var/log/system.log
fi

# ----------------- Cleanup ------------------

rm -f rm -f $unit_ID_tmp $smart_results_tmp


This version of the script checks for the presence of promiseutil and sendemail. We call screen here because the promiseutil seems to need a TTY in order to run properly.

Hope you find it useful.

Promise Pegasus2: Scripting an Enclosure check with promise_enclosure_check.sh

The Promise Pegasus2 has onboard sensors that monitor the power supply  voltages, speed of the fan, and temperature of the controller and backplane.

This seems worth performing the occasional check on.

The example script below runs an initial check of the enclosure using promiseutil. If it doesn’t find that “Everything is OK”, it runs a more verbose check, logs the problem and optionally sends email.

#!/bin/bash
#
# promise_enclosure_check.sh
#
# Checks the status of a Promise Pegasus2 RAID enclosure and mails the output if there's an issue.
#
# Author: AB @ Modest Industries
#
# Works with Promise Utility for Pegasus2 v3.18.0000.18 (http://www.promise.com)
# Requires sendemail for email alerts (http://caspian.dotconf.net/menu/Software/SendEmail/)
#
# Edit History
# 2014-04-21 - AB: Version 1.0.
# 2014-05-08 - AB: Refinements.
# 2014-05-09 - AB: Better message_body if failed.

export DATESTAMP=`date +%Y-%m-%d\ %H:%M:%S`

# Editable variables

# Path to sendemail
sendemail_path="/Library/Scripts/Monitoring/sendemail"

# If a problem is found, send email?
send_email_alert=true

# Variables for sendemail
# Sender's address
alert_sender="[email protected]"

# Recipient's addresses, comma separated.
#alert_recipient='[email protected], [email protected]'
alert_recipient="[email protected]"

# SMTP server to send the messages through
alert_smtp_server="smtp.example.com"

# ------------ Do not edit below this line ------------------
# Variables

# Pass / fail flags
enclosure_pass=true

# The subject line of the alert.
alert_subject="Alert: Promise Pegasus2 enclosure problem detected on $HOSTNAME."

# Alert header
alert_header="At $DATESTAMP, an enclosure problem was detected on this device:\n"

# Pass / Fail messages
pass_msg="Promise Pegasus Enclosure check successful."
fail_msg=" *** Promise Pegasus Enclosure check FAILED!!! ***\n\n"

# Alert footer
alert_footer="Run 'promiseutil -C enclosure -v' for more information."

# Create temp files
unit_ID_tmp=`mktemp "/tmp/$$_unit_ID.XXXX"`
enclosure_results_tmp=`mktemp "/tmp/$$_enclosure_results.XXXX"`

message_body="$alert_header"

# Get the information for this Promise unit. Includes workaround for promiseutil tty issue.
screen -D -m sh -c "promiseutil -C subsys -v >$unit_ID_tmp"

# Drop the output into a variable.
unit_ID=$(<$unit_ID_tmp)

# Get the report, put it into a tmp file.
screen -D -m sh -c "promiseutil -C enclosure >$enclosure_results_tmp"

if ! grep -qv "Everything is OK" $enclosure_results_tmp
then
        enclosure_pass="false"
        # Get a more detailed report, put it into a tmp file.
        screen -D -m sh -c "promiseutil -C enclosure -v >$enclosure_results_tmp"

        # Build the message.
        message_body=$message_body$fail_header$unit_ID$(<$enclosure_results_tmp)
fi

#  ----------------- Logging & email ------------------

# Log the results, conditionally send email on failure.
if [ "$enclosure_pass" == "false" ]; then
        message_body="$message_body\n\n$alert_footer"
        echo "$DATESTAMP: \n\n$message_body" >> /var/log/system.log
        if [ "$send_email_alert" == "true" ] ; then
                "$sendemail_path" -f $alert_sender -t $alert_recipient -u $alert_subject -m "$message_body" -s $alert_smtp_server
        fi
else
        echo "$DATESTAMP: $pass_msg" >> /var/log/system.log
fi
# Cleanup
rm -f rm -f $unit_ID_tmp $enclosure_results_tmp

The script was developed against a Promise Pegasus2. It hasn’t been tested with the earlier Promise Pegasus series.

2014-11-07 – Update: Merci to Stéphane Allain for catching a typo in the script.

Promise Pegasus2: The gap between a failing disk and a failed disk.

We were recently called in to diagnose a relatively new Promise Pegasus2 R6 that intermittently refused to mount. The Promise Utility app reported nothing amiss with the RAID or the drives, green lights everywhere, so we used the command line to dig a little deeper.

So let’s run a verbose SMART check on the unit:

promiseutil -C smart -v

The first three drives checked out. Drive 4 indicated that SMART thought everything was fine:

PdId: 4
Model Number: TOSHIBA DT01ACA2
Drive Type: SATA
SMART Status: Enable
SMART Health Status: OK

But then a little further down,  CRC errors:

Error 165 occurred at disk power-on lifetime: 1176 hours (49 days + 0 hours)
  When the command that caused the error occurred,
  the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 50 b0 ee 81 0d

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 80 a8 80 ee 81 40 00      18:38:48.276  WRITE FPDMA QUEUED
  61 80 a0 00 ee 81 40 00      18:38:48.276  WRITE FPDMA QUEUED
  61 80 98 80 ed 81 40 00      18:38:48.276  WRITE FPDMA QUEUED
  61 80 90 00 ed 81 40 00      18:38:48.276  WRITE FPDMA QUEUED
  61 80 88 80 ec 81 40 00      18:38:48.275  WRITE FPDMA QUEUED

Error 164 occurred at disk power-on lifetime: 1175 hours (48 days + 23 hours)
  When the command that caused the error occurred,
  the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 10 f0 ad 6b 0d  Error: ICRC, ABRT 16 sectors at LBA = 0x0d6badf0 = 225160688

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 00 80 80 ad 6b 40 00      18:36:07.145  WRITE DMA EXT
  35 00 80 00 ae 6b 40 00      18:36:07.144  WRITE DMA EXT
  35 00 80 00 ad 6b 40 00      18:36:07.144  WRITE DMA EXT
  35 00 80 80 ab 6b 40 00      18:36:07.139  WRITE DMA EXT
  35 00 80 00 ab 6b 40 00      18:36:07.139  WRITE DMA EXT

Error 163 occurred at disk power-on lifetime: 1175 hours (48 days + 23 hours)
  When the command that caused the error occurred,
  the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 f0 10 5e 5d 0d  Error: ICRC, ABRT 240 sectors at LBA = 0x0d5d5e10 = 224222736
...

The client confirmed that he’d seen a warning light on drive 4, but that it had “gone away”. We had them back the data up immediately. Promise support subsequently verified that the drive had failed based on the logs and sent a replacement drive out.

If the drive had failed completely, I assume the RAID would have kicked in, taken the bad drive offline and continued spinning, but since the drive hadn’t actually failed, the volume was struggling with a failing member and that was causing boot and performance issues.

The take-away is that there’s a generous gap between a drive that’s beginning to fail and a drive that’s failed enough for the Promise Utility app to detect it. Verbose mode is your friend.

Turn off Promise Pegasus2 power savings one-liner.

From Terminal:

promiseutil -C ctrl -a mod -s “powersavinglevel=0″

Now check your handiwork:

promiseutil -C ctrl -v | grep PowerSavingLevel

You should see:

PowerSavingLevel: 0

You may be tempted to turn off PowerManagement. Here’s what Promise’s website has to say about that:

Do not disable Power Management. Doing so will make the Pegasus2 hang when the Mac is in sleep or shut down and restarted and will require the Pegasus and the Mac to be power cycled to return to normal operation.

Promise Pegasus2 command line tools.

When I began deploying Promise Pegasus2 storage devices, I wasn’t happy with the state of the Promise Utility app. It doesn’t provide email alerts except when a user is logged in and this is isn’t optimal for most of our deployments.

Then I stumbled on a couple of Ruby scripts by GriffithStudio that showed a way around many of the limitations of the Promise GUI.

When you install the Promise Utility for Promise Pegasus2, it includes a command line utility. You can view status and even change the settings of the device. In a Terminal window, type:

promiseutil

You’ll be greeted with an interactive command line.

-------------------------------------------------------------
Promise Utility
Version: 3.18.0000.18 Build Date: Oct 29, 2013
-------------------------------------------------------------
 
List available RAID HBAs and Subsystems
=============================================================
Type  #    Model         Alias                            WWN                 
=============================================================
hba   1  * Pegasus2 R4                                    2000-0001-5557-98bf 
 
Totally 1 HBA(s) and 0 Subsystem(s)
 
-------------------------------------------------------------
The row with '*' sign refers the current working HBA/Subsystem path
To change the current HBA/Subsystem path, you may use the following command:
  
  spath -a chgpath -t hba|subsys -p <path #>.
 
Type help or ? to display all the available commands
-------------------------------------------------------------
 
cliib> 

To get a list of commands, type ? and press return. Some of the available commands include:

subsys - Model, serial number, hardware revision.
enclosure - Enclosure status.
ctrl - Firmware version, array & RAID status.
phydrv - Physical drive status.
array - Array status.
logdrv - Logical drive status.
event - Event log, including abnormal shutdowns.

Many of the commands yield a brief Pass/Fail style response:

cliib> enclosure 
=============================================================
Id  EnclosureType               OpStatus  StatusDescription                    
=============================================================
1   Pegasus2-R4                 OK        Everything is OK

If you want more details, you can add the verbose flag, -v. Want the serial number, model and hardware revision?

cliib> subsys -v
 
-------------------------------------------------------------
Alias: 
Vendor: Promise Technology,Inc.        Model: Pegasus2 R4
PartNo: F29DS4722000000                SerialNo: M00H00CXXXXXXXX
Rev: B3                                WWN: 2000-0001-5557-98bf

You can grab enclosure information, including temperature of box, backplane and controller card, as well as the rotation speed of the fans and voltage on the power rails with this:

cliib> enclosure -v
 
 
-------------------------------------------------------------
Enclosure Setting:
 
EnclosureId: 1
CtrlWarningTempThreshold: 63C/145F     CtrlCriticalTempThreshold: 68C/154F
 
 
-------------------------------------------------------------
Enclosure Info and Status:
 
EnclosureId: 1
EnclosureType: Pegasus2-R4
SEPFwVersion: 1.00
MaxNumOfControllers: 1                 MaxNumOfPhyDrvSlots: 4
MaxNumOfFans: 1                        MaxNumOfBlowers: 0
MaxNumOfTempSensors: 2                 MaxNumOfPSUs: 1
MaxNumOfBatteries: 0                   MaxNumOfVoltageSensors: 3
 
=============================================================
PSU       Status                        
=============================================================
1         Powered On and Functional     
 
=============================================================
Fan Location        FanStatus             HealthyThreshold  CurrentFanSpeed 
=============================================================
1   Backplane       Functional            > 1000 RPM        1200 RPM        
 
=============================================================
TemperatureSensor   Location       HealthThreshold   CurrentTemp    Status    
=============================================================
1                   Controller     < 63C/145F        49C/120F       normal    
2                   Backplane      < 53C/127F        47C/116F       normal    
 
=============================================================
VoltageSensor  Type    HealthyThreshold         CurrentVoltage  Status         
=============================================================
1              3.3V    +/- 5% (3.13 - 3.46) V   3.2V            Operational    
2              5.0V    +/- 5% (4.75 - 5.25) V   5.0V            Operational    
3              12.0V   +/- 8% (11.04 - 12.96) V 12.0V           Operational 

How about the state of the physical drives?

cliib> phydrv   
=============================================================
PdId Model        Type      Capacity  Location      OpStatus  ConfigStatus     
=============================================================
1    TOSHIBA DT01 SATA HDD  2TB       Encl1 Slot1   OK        Array0 No.0      
2    TOSHIBA DT01 SATA HDD  2TB       Encl1 Slot2   OK        Array0 No.1      
3    TOSHIBA DT01 SATA HDD  2TB       Encl1 Slot3   OK        Array0 No.2      
4    TOSHIBA DT01 SATA HDD  2TB       Encl1 Slot4   OK        Array0 No.3

You can also run commands without entering interactive mode and this is useful when incorporating into bash scripts. Simply add the -C flag, followed by the command you want to run. For example:

krieger:~ admin$ promiseutil -C logdrv -v

will let you view the logical drives.

Many of these commands will run on the previous generation Promise Pegasus units.

Be aware: it’s possible to change the configuration of your Pegasus2 or even destroy your RAID setup from the command line, so use caution when working on production systems.