Promise Pegasus2: Scripting a disk check with promise_disk_check.sh

When you deploy a Promise Pegasus2, you want to run regular disk health checks and send an email notification if there’s a problem. The Promise Utility app can theoretically do this* when there’s someone logged in at the console, but we’re rarely running these in environments where there’s anyone logged at the console.

The solution is to script a check of the disks using the promiseutil command line utility and then create a cronjob to run it at regular intervals.

Here’s an example disk check that parses the output of phydrv, logs each run to system.log and can optionally send email if a problem is found.

#!/bin/bash
#
# promise_disk_check.sh
#
# Checks the phydrv status of a Promise Pegasus, logs and mails the output if there's an issue.
#
# Author: A @ Modest Industries
# Last update: 2014-07-19
# 2014-07-19 - tweaked grep to allow for Media Patrol
#
# Works with Promise Utility for Pegasus2 v3.18.0000.18 (http://www.promise.com)
# Requires sendemail for email alerts (http://caspian.dotconf.net/menu/Software/SendEmail/)

export DATESTAMP=`date +%Y-%m-%d\ %H:%M:%S`

# Editable variables

# Path to sendemail
sendemail_path="/Library/Scripts/Monitoring/sendemail"
# Email alert?
send_email_alert=true

# Variables for sendemail
# Sender's address
alert_sender="[email protected]"

# Recipient's addresses, comma separated.
#alert_recipient='[email protected], [email protected]'
alert_recipient="[email protected]"

# SMTP server to send the messages through
# alert_smtp_server="smtp.example.com:port"
alert_smtp_server="smtp.example.com"

# Subject line of the alert.
alert_subject="Alert: Promise disk problem detected on $HOSTNAME."

# Header line at the top of the alert message 
alert_header="At $DATESTAMP, a problem was detected on this device:\n"

# Pass / Fail messages
pass_msg="Promise disk check successful."
fail_msg=" *** Promise disk check FAILED!!! ***"

# ------------ Do not edit below this line ------------------
# Variables
pass=true
results=""

# Create temp files
unit_ID_tmp=`mktemp "/tmp/$$_ID.XXXX"`
results_tmp=`mktemp "/tmp/$$_results.XXXX"`

# Get header information for this Promise unit. Includes workaround for promiseutil tty issue.
screen -D -m sh -c "promiseutil -C subsys -v >$tmpdir$unit_ID_tmp"
unit_ID=$(<$tmpdir$unit_ID_tmp)

# Get status of the disks.  Includes workaround for promiseutil tty issue.
screen -D -m sh -c "promiseutil -C phydrv >$tmpdir$results_tmp"

# Check each line of the output the test results.
while read -r line
do
        if grep '^[0-9]' <<< "$line" | grep -Eqv 'OK|Media'
        then
                results=$results"BAD DRIVE DETECTED: $line\n\n"
                pass=false
        fi
done < $tmpdir$results_tmp

# Log the results, conditionally send email on failure.
if [ "$pass" = false ] ; then
        results="$alert_header$unit_ID\n\n$results\n$alert_footer"
        echo "$DATESTAMP: $fail_msg\n\n$results" >> /var/log/system.log
        if [ "$send_email_alert" = true ] ; then
                "$sendemail_path" -f $alert_sen:der -t $alert_recipient -u $alert_subject -m "$results" -s $alert_smtp_server
        fi
else
        echo "$DATESTAMP: $pass_msg" >> /var/log/system.log
fi

# Cleanup
rm -f $tmpdir$unit_ID_tmp $tmpdir$results_tmp

Note that the script uses sendemail for sending mail, a very useful little drop in for when the local machine isn’t running mail services.

*I say “theoretically” because configuring email in the Promise Utility is a mess and I’ve yet to see a single successful notification after configuring it.

2014-05-04 – Updated to make the path to sendemail a variable.

2014-07-19 – Changed grep to handle false positive during Media Patrol runs

Mac OS X automated backups of Server 2.x

My colleague and I had been trying to come up with a decent backup script for Server 2.x services on Mac OS X 10.8 Mountain Lion. We had the skeleton of a Bash script from another colleague but wanted more functionality, so we began bouncing the script back and forth between us. Before we knew it, we had a working system that we now both deploy on 10.8 Server systems, osx-backup.sh.

I like to keep things simple when it comes to backup and this script is no exception. Edit the variables at the top to suit your fancy and then create a nightly cronjob to run it as root and you’re done.

Among the services that get backed up are:

  • Open Directory archive
  • Profile Manager database
  • DNS records
  • Service plists
  • Wiki
  • CalDAV & CardDAV databases

The script automatically shuts down each service that needs to be disabled before backup, then re-enables the service as soon as possible.

On the weekend, over a big pile of bacon on the weekend, we decided to turn the script loose to the community, so my colleague has posted the script on his GitHub repository. This version works with Server 2.2.2 on 10.8.5.

Caveats: Like any backup system, read the script completely and test it before putting it into production.

Promise Pegasus2: The gap between a failing disk and a failed disk.

We were recently called in to diagnose a relatively new Promise Pegasus2 R6 that intermittently refused to mount. The Promise Utility app reported nothing amiss with the RAID or the drives, green lights everywhere, so we used the command line to dig a little deeper.

So let’s run a verbose SMART check on the unit:

promiseutil -C smart -v

The first three drives checked out. Drive 4 indicated that SMART thought everything was fine:

PdId: 4
Model Number: TOSHIBA DT01ACA2
Drive Type: SATA
SMART Status: Enable
SMART Health Status: OK

But then a little further down,  CRC errors:

Error 165 occurred at disk power-on lifetime: 1176 hours (49 days + 0 hours)
  When the command that caused the error occurred,
  the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 50 b0 ee 81 0d

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 80 a8 80 ee 81 40 00      18:38:48.276  WRITE FPDMA QUEUED
  61 80 a0 00 ee 81 40 00      18:38:48.276  WRITE FPDMA QUEUED
  61 80 98 80 ed 81 40 00      18:38:48.276  WRITE FPDMA QUEUED
  61 80 90 00 ed 81 40 00      18:38:48.276  WRITE FPDMA QUEUED
  61 80 88 80 ec 81 40 00      18:38:48.275  WRITE FPDMA QUEUED

Error 164 occurred at disk power-on lifetime: 1175 hours (48 days + 23 hours)
  When the command that caused the error occurred,
  the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 10 f0 ad 6b 0d  Error: ICRC, ABRT 16 sectors at LBA = 0x0d6badf0 = 225160688

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 00 80 80 ad 6b 40 00      18:36:07.145  WRITE DMA EXT
  35 00 80 00 ae 6b 40 00      18:36:07.144  WRITE DMA EXT
  35 00 80 00 ad 6b 40 00      18:36:07.144  WRITE DMA EXT
  35 00 80 80 ab 6b 40 00      18:36:07.139  WRITE DMA EXT
  35 00 80 00 ab 6b 40 00      18:36:07.139  WRITE DMA EXT

Error 163 occurred at disk power-on lifetime: 1175 hours (48 days + 23 hours)
  When the command that caused the error occurred,
  the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 f0 10 5e 5d 0d  Error: ICRC, ABRT 240 sectors at LBA = 0x0d5d5e10 = 224222736
...

The client confirmed that he’d seen a warning light on drive 4, but that it had “gone away”. We had them back the data up immediately. Promise support subsequently verified that the drive had failed based on the logs and sent a replacement drive out.

If the drive had failed completely, I assume the RAID would have kicked in, taken the bad drive offline and continued spinning, but since the drive hadn’t actually failed, the volume was struggling with a failing member and that was causing boot and performance issues.

The take-away is that there’s a generous gap between a drive that’s beginning to fail and a drive that’s failed enough for the Promise Utility app to detect it. Verbose mode is your friend.

Turn off Promise Pegasus2 power savings one-liner.

From Terminal:

promiseutil -C ctrl -a mod -s “powersavinglevel=0″

Now check your handiwork:

promiseutil -C ctrl -v | grep PowerSavingLevel

You should see:

PowerSavingLevel: 0

You may be tempted to turn off PowerManagement. Here’s what Promise’s website has to say about that:

Do not disable Power Management. Doing so will make the Pegasus2 hang when the Mac is in sleep or shut down and restarted and will require the Pegasus and the Mac to be power cycled to return to normal operation.

Promise Pegasus2 command line tools.

When I began deploying Promise Pegasus2 storage devices, I wasn’t happy with the state of the Promise Utility app. It doesn’t provide email alerts except when a user is logged in and this is isn’t optimal for most of our deployments.

Then I stumbled on a couple of Ruby scripts by GriffithStudio that showed a way around many of the limitations of the Promise GUI.

When you install the Promise Utility for Promise Pegasus2, it includes a command line utility. You can view status and even change the settings of the device. In a Terminal window, type:

promiseutil

You’ll be greeted with an interactive command line.

-------------------------------------------------------------
Promise Utility
Version: 3.18.0000.18 Build Date: Oct 29, 2013
-------------------------------------------------------------
 
List available RAID HBAs and Subsystems
=============================================================
Type  #    Model         Alias                            WWN                 
=============================================================
hba   1  * Pegasus2 R4                                    2000-0001-5557-98bf 
 
Totally 1 HBA(s) and 0 Subsystem(s)
 
-------------------------------------------------------------
The row with '*' sign refers the current working HBA/Subsystem path
To change the current HBA/Subsystem path, you may use the following command:
  
  spath -a chgpath -t hba|subsys -p <path #>.
 
Type help or ? to display all the available commands
-------------------------------------------------------------
 
cliib> 

To get a list of commands, type ? and press return. Some of the available commands include:

subsys - Model, serial number, hardware revision.
enclosure - Enclosure status.
ctrl - Firmware version, array & RAID status.
phydrv - Physical drive status.
array - Array status.
logdrv - Logical drive status.
event - Event log, including abnormal shutdowns.

Many of the commands yield a brief Pass/Fail style response:

cliib> enclosure 
=============================================================
Id  EnclosureType               OpStatus  StatusDescription                    
=============================================================
1   Pegasus2-R4                 OK        Everything is OK

If you want more details, you can add the verbose flag, -v. Want the serial number, model and hardware revision?

cliib> subsys -v
 
-------------------------------------------------------------
Alias: 
Vendor: Promise Technology,Inc.        Model: Pegasus2 R4
PartNo: F29DS4722000000                SerialNo: M00H00CXXXXXXXX
Rev: B3                                WWN: 2000-0001-5557-98bf

You can grab enclosure information, including temperature of box, backplane and controller card, as well as the rotation speed of the fans and voltage on the power rails with this:

cliib> enclosure -v
 
 
-------------------------------------------------------------
Enclosure Setting:
 
EnclosureId: 1
CtrlWarningTempThreshold: 63C/145F     CtrlCriticalTempThreshold: 68C/154F
 
 
-------------------------------------------------------------
Enclosure Info and Status:
 
EnclosureId: 1
EnclosureType: Pegasus2-R4
SEPFwVersion: 1.00
MaxNumOfControllers: 1                 MaxNumOfPhyDrvSlots: 4
MaxNumOfFans: 1                        MaxNumOfBlowers: 0
MaxNumOfTempSensors: 2                 MaxNumOfPSUs: 1
MaxNumOfBatteries: 0                   MaxNumOfVoltageSensors: 3
 
=============================================================
PSU       Status                        
=============================================================
1         Powered On and Functional     
 
=============================================================
Fan Location        FanStatus             HealthyThreshold  CurrentFanSpeed 
=============================================================
1   Backplane       Functional            > 1000 RPM        1200 RPM        
 
=============================================================
TemperatureSensor   Location       HealthThreshold   CurrentTemp    Status    
=============================================================
1                   Controller     < 63C/145F        49C/120F       normal    
2                   Backplane      < 53C/127F        47C/116F       normal    
 
=============================================================
VoltageSensor  Type    HealthyThreshold         CurrentVoltage  Status         
=============================================================
1              3.3V    +/- 5% (3.13 - 3.46) V   3.2V            Operational    
2              5.0V    +/- 5% (4.75 - 5.25) V   5.0V            Operational    
3              12.0V   +/- 8% (11.04 - 12.96) V 12.0V           Operational 

How about the state of the physical drives?

cliib> phydrv   
=============================================================
PdId Model        Type      Capacity  Location      OpStatus  ConfigStatus     
=============================================================
1    TOSHIBA DT01 SATA HDD  2TB       Encl1 Slot1   OK        Array0 No.0      
2    TOSHIBA DT01 SATA HDD  2TB       Encl1 Slot2   OK        Array0 No.1      
3    TOSHIBA DT01 SATA HDD  2TB       Encl1 Slot3   OK        Array0 No.2      
4    TOSHIBA DT01 SATA HDD  2TB       Encl1 Slot4   OK        Array0 No.3

You can also run commands without entering interactive mode and this is useful when incorporating into bash scripts. Simply add the -C flag, followed by the command you want to run. For example:

krieger:~ admin$ promiseutil -C logdrv -v

will let you view the logical drives.

Many of these commands will run on the previous generation Promise Pegasus units.

Be aware: it’s possible to change the configuration of your Pegasus2 or even destroy your RAID setup from the command line, so use caution when working on production systems.