Tag Archives: RAID

Promise Pegasus2: Scripting a SMART check with promiseutil

We’ve found that the Promise Pegasus2 Thunderbolt 2 RAID can report that the SMART Health status of its disks is just dandy, while the unit is quietly accumulating ATA errors that may indicate the pending failure of a disk.

I want to be notified if the Pegasus either has a SMART status failure or if ATA errors are present on any of the disks.

This script does just that. It’s essentially a more refined version of the previous promiseutil scripts that grabs the simple SMART status of each disk, greps to see if it’s “OK”, then runs a line of awk that looks at the report to see if there’s an “ATA Error Count”. As always, it logs to system.log and optionally sends error reports by email.

#!/bin/bash
#
# promise_smart_check.sh
#
# Checks Promise Pegasus2 SMART status, checks for ATA errors, logs and mails the output if there's an issue.
#
# Author: AB @ Modest Industries
#
# Requires Promise Utility for Pegasus2 (http://www.promise.com), tested with v3.18.0000.18
# Requires sendemail for email alerts (http://caspian.dotconf.net/menu/Software/SendEmail/)
#
# Edit History
# 2014-04-21 - AB: Version 1.0.
# 2014-04-24 - AB: Refactored.
# 2014-05-01 - AB: Incorporate the awk script to check for ATI errors.
# 2014-05-08 - AB: Refinements.
# 2014-05-15 - AB: Update to message body construction, tmp file & sendemail sanity checks.
# 2014-05-17 - AB: Added promiseutil path check.

export DATESTAMP=`date +%Y-%m-%d\ %H:%M:%S`

# Editable variables

# Path to sendemail
sendemail_path="/Library/Scripts/Monitoring/sendemail"

# Send email alerts?
send_email_alert=true

# Variables for sendemail
# Sender's address
alert_sender="[email protected]"

# Recipient's addresses, comma separated.
#alert_recipient='[email protected], [email protected]'
alert_recipient="[email protected]"

# SMTP server to send the messages through
alert_smtp_server="smtp.example.com"

# ------------ You probably shouldn't edit below this line ------------------
# Variables

# Default the error flags to false.
smart_error_flag="false"
ata_error_flag="false"

# Alert subject
alert_subject="ALERT: Promise Pegasus2 SMART problem detected on $HOSTNAME."

# Alert header
alert_header="At $DATESTAMP, a problem was detected on this device:\n"

# Pass / Fail messages
pass_msg="Promise Pegasus SMART check successful."
fail_msg=" *** Promise Pegasus SMART check FAILED!!! ***"

# Default the message body
message_body=""

# Alert footer
alert_footer="Run 'promiseutil -C smart -v' for more information."

# Promise Pegasus command line utility default path
promiseutil_path="/usr/bin/promiseutil"

# ----------------- Check for promiseutil, sendemail & set up temp files ------------------
if [ ! -f $promiseutil_path ]; then
        echo "$0 ERROR: $promiseutil_path does not exist"
        echo  "Please download and install the Promise Pegasus Utility app from http://promise.com"
        exit 1
fi

if [ ! -f $sendemail_path ]; then
        echo "$0 ERROR: $sendemail_path does not exist"
        echo  "Please download from http://caspian.dotconf.net/menu/Software/SendEmail/ and then set the \$sendmemail_path variable inside this script"
        exit 1
fi

unit_ID_tmp=`mktemp -q "/tmp/$$_unit_ID.XXXX"`
if [ $? -ne 0 ]; then
        echo "$0: ERROR: Can't create temp file, exiting..."
        exit 1
fi

smart_results_tmp=`mktemp -q "/tmp/$$_smart_results.XXXX"`
if [ $? -ne 0 ]; then
        echo "$0: ERROR: Can't create temp file, exiting..."
        exit 1
fi

# ----------------- Run promiseutil, evaluate the results ------------------

# Get Unit ID information for this Promise unit. Includes workaround for promiseutil tty issue.
screen -D -m sh -c "$promiseutil_path -C subsys -v >$unit_ID_tmp"

# Drop the output into a variable.
unit_ID=$(<$tmpdir$unit_ID_tmp)

# Get the SMART report, put it into a tmp file.
screen -D -m sh -c "$promiseutil_path -C smart -v >$smart_results_tmp"

# Grab the header for each PdId in the Promise
smart_status=$(cat $smart_results_tmp | grep -A4 "^PdId")

# Check the header to see if SMART Health Check reports a problem
if grep "^SMART Health Status:" <<< "$smart_status" | grep -qv "OK"
then
        smart_error_flag="true"
fi

# Check for ATA errors, which may indicate that the drive is failing even if SMART Health is OK
ata_errors=$(awk '/^PdId: [1-9][0-9]*/ \
                                { a=$0; n=4; next } \
                                n { --n; a=a "\n" $0; next } \
                                /^ATA Error Count*/ \
                                { ata_err=$0; print a "\n" ata_err "\n" }' \
                                "$smart_results")
# Flag if there were ATA errors
if [ "$ata_errors" != "" ]; then
        ata_error_flag="true"
fi

# ----------------- Build the message_body ------------------

# If there's a problem, build the header.
if [ "$smart_error_flag" ==  "true" ] || [ "$ata_error_flag" == "true" ]; then
        message_body="$alert_header\n\n$fail_msg\n\n$unit_ID\n\n"

        # SMART Health status.
        if [ "$smart_error_flag" == "true" ]; then
                message_body="$message_body\nSMART Health Status is reporting one or more bad drives."
        fi

        # Always include the smart_status
        message_body="$message_body\n\n$smart_status"

        # Then the ATA errors.
        if [ "$ata_error_flag" == "true" ]; then
                message_body="$message_body\n\nOne or more drives has an ATA Error Count and may be failing.\n\n$ata_errors"
        fi
fi

#  ----------------- Logging & email ------------------

# Log the results, conditionally send email on failure.
if [ "$ata_error_flag" == "true" ] || [ "$smart_error_flag" == "true" ]; then
        message_body="$message_body\n\n$alert_footer"
        echo "$DATESTAMP: \n\n$message_body" >> /var/log/system.log
        if [ "$send_email_alert" == "true" ] ; then
                "$sendemail_path" -f $alert_sender -t $alert_recipient -u $alert_subject -m "$message_body" -s $alert_smtp_server
        fi
else
        echo "$DATESTAMP: $pass_msg\n\n$unit_ID" >> /var/log/system.log
fi

# ----------------- Cleanup ------------------

rm -f rm -f $unit_ID_tmp $smart_results_tmp


This version of the script checks for the presence of promiseutil and sendemail. We call screen here because the promiseutil seems to need a TTY in order to run properly.

Hope you find it useful.

Promise Pegasus2 command line tools.

When I began deploying Promise Pegasus2 storage devices, I wasn’t happy with the state of the Promise Utility app. It doesn’t provide email alerts except when a user is logged in and this is isn’t optimal for most of our deployments.

Then I stumbled on a couple of Ruby scripts by GriffithStudio that showed a way around many of the limitations of the Promise GUI.

When you install the Promise Utility for Promise Pegasus2, it includes a command line utility. You can view status and even change the settings of the device. In a Terminal window, type:

promiseutil

You’ll be greeted with an interactive command line.

-------------------------------------------------------------
Promise Utility
Version: 3.18.0000.18 Build Date: Oct 29, 2013
-------------------------------------------------------------
 
List available RAID HBAs and Subsystems
=============================================================
Type  #    Model         Alias                            WWN                 
=============================================================
hba   1  * Pegasus2 R4                                    2000-0001-5557-98bf 
 
Totally 1 HBA(s) and 0 Subsystem(s)
 
-------------------------------------------------------------
The row with '*' sign refers the current working HBA/Subsystem path
To change the current HBA/Subsystem path, you may use the following command:
  
  spath -a chgpath -t hba|subsys -p <path #>.
 
Type help or ? to display all the available commands
-------------------------------------------------------------
 
cliib> 

To get a list of commands, type ? and press return. Some of the available commands include:

subsys - Model, serial number, hardware revision.
enclosure - Enclosure status.
ctrl - Firmware version, array & RAID status.
phydrv - Physical drive status.
array - Array status.
logdrv - Logical drive status.
event - Event log, including abnormal shutdowns.

Many of the commands yield a brief Pass/Fail style response:

cliib> enclosure 
=============================================================
Id  EnclosureType               OpStatus  StatusDescription                    
=============================================================
1   Pegasus2-R4                 OK        Everything is OK

If you want more details, you can add the verbose flag, -v. Want the serial number, model and hardware revision?

cliib> subsys -v
 
-------------------------------------------------------------
Alias: 
Vendor: Promise Technology,Inc.        Model: Pegasus2 R4
PartNo: F29DS4722000000                SerialNo: M00H00CXXXXXXXX
Rev: B3                                WWN: 2000-0001-5557-98bf

You can grab enclosure information, including temperature of box, backplane and controller card, as well as the rotation speed of the fans and voltage on the power rails with this:

cliib> enclosure -v
 
 
-------------------------------------------------------------
Enclosure Setting:
 
EnclosureId: 1
CtrlWarningTempThreshold: 63C/145F     CtrlCriticalTempThreshold: 68C/154F
 
 
-------------------------------------------------------------
Enclosure Info and Status:
 
EnclosureId: 1
EnclosureType: Pegasus2-R4
SEPFwVersion: 1.00
MaxNumOfControllers: 1                 MaxNumOfPhyDrvSlots: 4
MaxNumOfFans: 1                        MaxNumOfBlowers: 0
MaxNumOfTempSensors: 2                 MaxNumOfPSUs: 1
MaxNumOfBatteries: 0                   MaxNumOfVoltageSensors: 3
 
=============================================================
PSU       Status                        
=============================================================
1         Powered On and Functional     
 
=============================================================
Fan Location        FanStatus             HealthyThreshold  CurrentFanSpeed 
=============================================================
1   Backplane       Functional            > 1000 RPM        1200 RPM        
 
=============================================================
TemperatureSensor   Location       HealthThreshold   CurrentTemp    Status    
=============================================================
1                   Controller     < 63C/145F        49C/120F       normal    
2                   Backplane      < 53C/127F        47C/116F       normal    
 
=============================================================
VoltageSensor  Type    HealthyThreshold         CurrentVoltage  Status         
=============================================================
1              3.3V    +/- 5% (3.13 - 3.46) V   3.2V            Operational    
2              5.0V    +/- 5% (4.75 - 5.25) V   5.0V            Operational    
3              12.0V   +/- 8% (11.04 - 12.96) V 12.0V           Operational 

How about the state of the physical drives?

cliib> phydrv   
=============================================================
PdId Model        Type      Capacity  Location      OpStatus  ConfigStatus     
=============================================================
1    TOSHIBA DT01 SATA HDD  2TB       Encl1 Slot1   OK        Array0 No.0      
2    TOSHIBA DT01 SATA HDD  2TB       Encl1 Slot2   OK        Array0 No.1      
3    TOSHIBA DT01 SATA HDD  2TB       Encl1 Slot3   OK        Array0 No.2      
4    TOSHIBA DT01 SATA HDD  2TB       Encl1 Slot4   OK        Array0 No.3

You can also run commands without entering interactive mode and this is useful when incorporating into bash scripts. Simply add the -C flag, followed by the command you want to run. For example:

krieger:~ admin$ promiseutil -C logdrv -v

will let you view the logical drives.

Many of these commands will run on the previous generation Promise Pegasus units.

Be aware: it’s possible to change the configuration of your Pegasus2 or even destroy your RAID setup from the command line, so use caution when working on production systems.