Category Archives: Hackery

Using promiseutil to Find a Failing RAID Member.

Ideally, if a Promise Pegasus has a failing RAID member, or disk, we would want the Promise Utility GUI to report that. But that’s not always what happens.

When Backblaze published the SMART stats they pay attention to a few years ago, I adopted a practice of replacing drives that exhibited non-zero values for RAW_VALUES of the same SMART stats. Backblaze looks at:

  • SMART 5 – Reallocated_Sector_Count.
  • SMART 187 – Reported_Uncorrectable_Errors.
  • SMART 188 – Command_Timeout.
  • SMART 197 – Current_Pending_Sector_Count.
  • SMART 198 – Offline_Uncorrectable.

I’ve never seen SMART 187 or 188 reported by a drive member on a Promise Pegasus RAID, but the other values are there.

We can obtain the SMART status of the members of a RAID like this:

  1. In Terminal, start promiseutil
  2. At the prompt, type smart -v (with the verbose flag on).

The output will show the SMART statistics for each member of the RAID.

So, what does a failing Promise Pegasus RAID member look like?

In this example, the Promise Utility reported the health of the RAID as fine, but the performance of this RAID suggested otherwise. Here’s the top of the log:

-------------------------------------------------------
 PdId: 2
 Model Number: TOSHIBA DT01ACA2
 Drive Type: SATA
 SMART Status: Enable
 SMART Health Status: OK
 SCT Status Version:                  3
 SCT Version (vendor specific):       256 (0x0100)
 SCT Support Level:                   1
 Device State:                        SMART Off-line Data Collection executing in background (4)
 Current Temperature:                    37 Celsius
 Power Cycle Min/Max Temperature:     29/40 Celsius
 Lifetime    Min/Max Temperature:     19/44 Celsius
 Under/Over Temperature Limit Count:   0/0
 Self-test execution status:      (   0)    The previous self-test routine
                     completed without error or no self-test
                     has ever been run.
 Error logging capability:        (0x01)    Error logging supported.
 Short self-test routine 
 recommended polling time:      (   1) minutes.
 Extended self-test routine
 recommended polling time:      ( 249) minutes.
 SCT capabilities:            (0x003d) SCT Status supported.
                     SCT Feature Control supported.
                     SCT Data Table supported.
 SMART Self-test log structure revision number: 1
 No self-tests have been logged.  [To run self-tests, use: smartctl -t]
 SMART Error Log Version: 1

Hmm…no signs of trouble. We see SMART Health Status: OK, so if we were just grepping or awking for that, we’d assume that all was well. But a few lines down, we find ATA Error Count: 4. This value, doesn’t even appear on a healthy member, even with the -v verbose flag. And that’s followed by four errors.

SMART Error Log Version: 1
 ATA Error Count: 4
     CR = Command Register [HEX]
     FR = Features Register [HEX]
     SC = Sector Count Register [HEX]
     SN = Sector Number Register [HEX]
     CL = Cylinder Low Register [HEX]
     CH = Cylinder High Register [HEX]
     DH = Device/Head Register [HEX]
     DC = Device Command Register [HEX]
     ER = Error register [HEX]
     ST = Status register [HEX]
 Powered_Up_Time is measured from power on, and printed as
 DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
 SS=sec, and sss=millisec. It "wraps" after 49.710 days.
 Error 4 occurred at disk power-on lifetime: 34794 hours (1449 days + 18 hours)
   When the command that caused the error occurred,
   the device was active or idle.
 After command completion occurred, registers were:
   ER ST SC SN CL CH DH
 
 40 51 80 80 e7 1c 0d  Error: UNC 128 sectors at LBA = 0x0d1ce780 = 219998080
 Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   25 00 80 80 e7 1c 40 00   2d+17:20:39.072  READ DMA EXT
   35 00 00 00 44 13 40 00   2d+17:20:39.070  WRITE DMA EXT
   35 00 00 00 35 13 40 00   2d+17:20:39.061  WRITE DMA EXT
   35 00 80 80 25 13 40 00   2d+17:20:39.052  WRITE DMA EXT
   25 00 58 a8 1f 13 40 00   2d+17:20:39.046  READ DMA EXT
 Error 3 occurred at disk power-on lifetime: 34794 hours (1449 days + 18 hours)
   When the command that caused the error occurred,
   the device was active or idle.
 After command completion occurred, registers were:
   ER ST SC SN CL CH DH
 
 40 51 78 80 b2 1e 0d  Error: UNC 120 sectors at LBA = 0x0d1eb280 = 220115584
 Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   25 00 78 80 b2 1e 40 00   2d+17:20:35.539  READ DMA EXT
   35 00 00 00 ad 1e 40 00   2d+17:20:35.538  WRITE DMA EXT
   25 00 18 e8 e1 1c 40 00   2d+17:20:34.067  READ DMA EXT
   61 80 00 80 ac 1e 40 00   2d+17:20:31.037  WRITE FPDMA QUEUED
   2f 00 01 10 00 00 00 00   2d+17:20:31.036  READ LOG EXT
 Error 2 occurred at disk power-on lifetime: 34794 hours (1449 days + 18 hours)
   When the command that caused the error occurred,
   the device was active or idle.
 After command completion occurred, registers were:
   ER ST SC SN CL CH DH
 
 40 51 a8 58 e2 1c 0d
 Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   61 80 28 80 ad 1e 40 00   2d+17:20:09.122  WRITE FPDMA QUEUED
   61 80 20 00 ae 1e 40 00   2d+17:20:09.121  WRITE FPDMA QUEUED
   61 80 18 80 ae 1e 40 00   2d+17:20:09.121  WRITE FPDMA QUEUED
   61 80 10 00 af 1e 40 00   2d+17:20:09.121  WRITE FPDMA QUEUED
   61 80 08 80 af 1e 40 00   2d+17:20:09.121  WRITE FPDMA QUEUED
 Error 1 occurred at disk power-on lifetime: 34706 hours (1446 days + 2 hours)
   When the command that caused the error occurred,
   the device was active or idle.
 After command completion occurred, registers were:
   ER ST SC SN CL CH DH
 
 40 51 20 78 c5 1f 0d
 Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   61 80 e0 80 d8 1a 40 00      01:51:41.474  WRITE FPDMA QUEUED
   61 80 c8 00 d8 1a 40 00      01:51:41.474  WRITE FPDMA QUEUED
   61 80 c0 00 d4 1a 40 00      01:51:41.473  WRITE FPDMA QUEUED
   61 80 b8 80 d7 1a 40 00      01:51:41.473  WRITE FPDMA QUEUED
   61 80 b0 00 d7 1a 40 00      01:51:41.473  WRITE FPDMA QUEUED

All of the errors occurred on a power up of the RAID. So what do the SMART stats for Backblaze’s preferred values (bolded) look like on this drive?

SMART Attributes Data Structure revision number: 16
 Vendor Specific SMART Attributes with Thresholds:
 ID# ATTRIBUTE_NAME
     FLAG    VALUE WORST THRESH TYPE      UPDATED    WHEN_FAILED  RAW_VALUE
 1 Raw_Read_Error_Rate     
     0x000b  074   074   016    Pre-fail  Always     -            54134415
   2 Throughput_Performance  
     0x0005  139   139   054    Pre-fail  Offline    -            70
   3 Spin_Up_Time            
     0x0007  129   129   024    Pre-fail  Always     -            295 (Average 294)
   4 Start_Stop_Count        
     0x0012  097   097   000    Old_age   Always     -            15544
   5 Reallocated_Sector_Ct   
     0x0033  081   081   005    Pre-fail  Always     -            517
   7 Seek_Error_Rate         
     0x000b  100   100   067    Pre-fail  Always     -            0
   8 Seek_Time_Performance   
     0x0005  124   124   020    Pre-fail  Offline    -            33
   9 Power_On_Hours          
     0x0012  096   096   000    Old_age   Always     -            34799
  10 Spin_Retry_Count        
     0x0013  100   100   060    Pre-fail  Always     -            0
  12 Power_Cycle_Count       
     0x0032  100   100   000    Old_age   Always     -            41
 192 Power-Off_Retract_Count 
     0x0032  076   076   000    Old_age   Always     -            29286
 193 Load_Cycle_Count        
     0x0012  076   076   000    Old_age   Always     -            29286
 194 Temperature_Celsius     
     0x0002  162   162   000    Old_age   Always     -            37 (Lifetime Min/Max 19/44)
 196 Reallocated_Event_Count 
     0x0032  063   063   000    Old_age   Always     -            846
 197 Current_Pending_Sector  
     0x0022  100   100   000    Old_age   Always     -            0
 198 Offline_Uncorrectable   
     0x0008  100   100   000    Old_age   Offline    -            0
 199 UDMA_CRC_Error_Count    
     0x000a  200   200   000    Old_age   Always     -            0

As expected, no SMART 187 or 188 values. And 197 Current_Pending_Sector and 198 Offline_Uncorrectable are both 0.

But look at the RAW_VALUE for 5 Reallocated_Sector_Ct. Not good. And while it’s not on Backblaze’s list, 1 Raw_Read_Error_Rate is really high.

The other RAID elements have no errors. So we replaced the drive, and rebuilt the RAID, and performance returned to normal.

In Which the Promise Utility GUI is Not Showing Stats

You open the Promise Utility to get some stats from the Subsystem (Promise Utility > Subsystem Information), but the Promise Utility displays nothing.

You try the command line. Nada. Zip.

Fear not. Try this (your mileage may vary, entirely at your own risk, if you’re not utterly certain what you’re doing, do not do any of this):

1. Umount your Promise volume(s) first.

diskutil unmount /Volumes/NameOfVolume1

2. Unload and delete the kernel extension:

sudo kextunload -b com.promise.driver.stex
sudo rm -rf /Library/Extensions/PromiseSTEX.kext

3. Delete the Pegasus Utility:

sudo rm -rf /Applications/Promise\ Utility.app

4. Delete the Promise Utility plist:

sudo rm -rf /Users/<username>/Library/Preferences/com.promise.PromiseUtility.plist

5. Delete the LaunchDaemons plist files:

sudo rm -rf /Library/LaunchDaemons/com.promise.emaild.plist
sudo rm -rf /Library/LaunchDaemons/com.promise.bgasched.plist
sudo rm -rf /Library/LaunchDaemons/com.promise.BGPMain_R.plist
sudo rm -rf /Library/LaunchDaemons/com.promise.diskmonitor.plist

6. Delete promiseutil (The Promise Utility installer pkg will re-install this):

sudo rm -rf /usr/local/bin/promiseutil

7. Restart the Mac.

8. Install the Pegasus 6.2.9 driver and the Promise Utility.

9. See if you now have stats for the Promise Pegasus via GUI and CLI.

Scripting Promise Utility Media Patrol

Promise describes Media Patrol as a routine maintenance procedure that searches the physical drives in a Promise Pegasus unit for media errors. If you’ve got a spare drive in the array, Media Patrol can invoke Predictive Data Migration if it encounters a critical error. This seems like a pretty good idea and something we should schedule.

While the Promise Utility app provides lots of functionality, it requires a running console (you need to be logged in) for some of its functions, like the Scheduler, to run. If you log out, you’ll find the Scheduler and any attendant Background Activities will stop running. This is…sub-optimal. So we’ll work around it using promiseutil.


#!/bin/bash
#
# promise_media_patrol.sh
#
# Runs Promise Media Patrol on a Pegasus2, logs  the run
#
# Author: AB @ Modest Industries
#
# Requires Promise Utility for Pegasus2 (http://www.promise.com), tested with v3.18.0000.18
#
# Edit History
# 2014-07-19 - AB: Version 1.0.

export DATESTAMP=`date +%Y-%m-%d\ %H:%M:%S`

# Start / finish messages
start_msg="Promise Media Patrol running..."
finish_msg="Promise Media Patrol complete!"

# Promise Pegasus command line utility default path
promiseutil_path="/usr/bin/promiseutil"

# ----------------- Check for promiseutil & set up temp files ------------------
if [ ! -f $promiseutil_path ]; then
        echo "$0 ERROR: $promiseutil_path does not exist"
        echo  "Please download and install the Promise Pegasus Utility app from http://promise.com"
        exit 1
fi

unit_ID_tmp=`mktemp -q "/tmp/$_unit_ID.XXXX"`
if [ $? -ne 0 ]; then
        echo "$0: ERROR: Can't create temp file, exiting..."
        exit 1
fi

# ----------------- Run promiseutil, evaluate the results ------------------

# Get Unit ID information for this Promise unit. Includes workaround for promiseutil tty issue.
screen -D -m sh -c "$promiseutil_path -C subsys -v >$unit_ID_tmp"

# Drop the output into a variable.
unit_ID=$(<$tmpdir$unit_ID_tmp)



# ----------------- Build the message_body ------------------

# If there's a problem, build the header.
if [ "$smart_error_flag" ==  "true" ] || [ "$ata_error_flag" == "true" ]; then
        message_body="$alert_header\n\n$fail_msg\n\n$unit_ID\n\n"

        # SMART Health status.
        if [ "$smart_error_flag" == "true" ]; then
                message_body="$message_body\nSMART Health Status is reporting one or more bad drives."
        fi

        # Always include the smart_status
        message_body="$message_body\n\n$smart_status"

        # Then the ATA errors.
        if [ "$ata_error_flag" == "true" ]; then
                message_body="$message_body\n\nOne or more drives has an ATA Error Count and may be failing.\n\n$ata_errors"
        fi
fi

#  ----------------- Logging & email ------------------

# Log the results, conditionally send email on failure.
if [ "$ata_error_flag" == "true" ] || [ "$smart_error_flag" == "true" ]; then
        message_body="$message_body\n\n$alert_footer"
        echo "$DATESTAMP: \n\n$message_body" >> /var/log/system.log
        if [ "$send_email_alert" == "true" ] ; then
                "$sendemail_path" -f $alert_sender -t $alert_recipient -u $alert_subject -m "$message_body" -s $alert_smtp_server
        fi
else
        echo "$DATESTAMP: $pass_msg\n\n$unit_ID" >> /var/log/system.log
fi

# ----------------- Cleanup ------------------

rm -f rm -f $unit_ID_tmp $smart_results_tmp

Using your scheduler of choice (cron, launchd), create a schedule for your script (we’re running it every two weeks across a weekend, when activity on the network is light) and you’re done.

One-liner: Check the five, most important SMART parameters on a disk.

A while ago, Backblaze published a report on what they consider to be the most reliable SMART parameters for determining whether a disk is failing. These include:

  • 5 – Reallocated_Sector_Ct
  • 187 – Uncorrectable_Error_Cnt
  • 188 – Command_Timeout
  • 197 – Current_Pending_Sector_Count
  • 198 – Offline_Uncorrectable

For a complete description of these parameters, take a look at the Wikipedia article on SMART.

While our sample of failing disks is no where near as large as Backblaze’s, our results have, unsurprisingly, correlated pretty strongly to theirs.

Note that not all of these parameters are supported by the drive manufacturers and that we typically don’t see many of these parameters on the hard disks supplied in Apple hardware. Additionally, note that SMART is not supported on some drives.

Assuming you’ve got smartmontools installed, this one-liner will very quickly give you a snapshot of the key values we look for as strong indicators that a drive needs to be replaced:

smartctl -a disk0 | egrep "^( 5|187|188|197|198)"

where

disk0

is the disk you’re testing. To get the disks available to test, run

diskutil list

You’ll get back output that looks like this:

/dev/disk0
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *256.1 GB   disk0
   1:                        EFI                         209.7 MB   disk0s1
   2:                  Apple_HFS Macintosh HD                 255.2 GB   disk0s2
   3:                 Apple_Boot Recovery HD             650.0 MB   disk0s3
/dev/disk1
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *500.1 GB   disk1
   1:                        EFI                         209.7 MB   disk1s1
   2:                  Apple_HFS Storage                 499.8 GB   disk1s2

In the example above, there are two disks to choose from,

disk0

and

disk1

Assuming the drive supports all five SMART parameters, you’ll get back something that looks like this:

  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   001    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   001    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   001    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   001    Old_age   Offline      -       0

Those trailing zeros are what we like to see. Positive values in the last column mean that the drive probably needs to be replaced.

Sometimes, you stumble on the right person…

…and they reveal to you a bit of magic you didn’t know existed.

In this case, it is an undocumented flag in Promise Technology’s command line utility for the Promise Pegasus2 Thunderbolt RAID, promiseutil.

As previously discussed, it appeared to be impossible to check the status of more than one Promise Pegasus enclosure from inside a script using promiseutil. We had filed a support ticket, hoping for some kind of resolution, but were told that promiseutil works as intended.

On a hunch, I reached out to someone at Promise and asked for their help investigating this issue.  I was pleasantly surprised when the contact not only took the issue seriously, he immediately looped in other support engineers to look at the problem.

After a week of back and forth about what an appropriate solution would be, perhaps a feature request, the support engineer discovered that there is an undocumented flag that allows you to specify the hba of the Promise unit you want to execute a command on.

Here’s an example. Let’s say we want to check the SMART status of two Promise Pegasus from the command line:

 promiseutil -C smart -v

will return the information for the default device only.

If you want to be explicit about which Promise Pegasus you’re checking, first get the hba numbers of the connected units:

promiseutil -C spath

The results will be something like this:

archer:~ admin$ promiseutil -C spath
=================================================
Type  #    Model        Alias   WWN          Seq
=================================================
hba   1  * Pegasus2 R4       2000-0001-5558-2fe2  1
hba   2    Pegasus2 M4       2000-0001-5558-3f92  1

Now we use the magic (apparently undocumented) -P (uppercase, not the documented lowercase) flag to specify the unit we want to look at.

promiseutil -T hba -P 1 -C smart -v

which returns the results for the first unit.

promiseutil -T hba -P 2 -C smart -v

will return results for the second unit.

My sincere thanks to the people at Promise who helped us sort this out (you know who you are) and to my fellow bug wrestler, Allen Hancock of Watchman Monitoring.

As always, be cautious with promiseutil. Its power is mighty and Bad Things® can happen if used incorrectly.

Scanning more than one Promise device with promiseutil

So, comes the day when you have more than a single Promise Pegasus attached to a Mac and you’d like to leverage some of your utilities to check the second device.

“No problem,” you think, “I’ll just count the number of devices, then check each one in sequence.”

Except…

promiseutil is broken in one, very important way.

From inside promiseutil, the command to switch to the second unit in the chain would be something like:

spath -a chgpath -t hba -p 2

And that command works just fine. But as we’ve seen from previous work, executing promiseutil from inside a bash script requires the use of the screen command.

Executing this command from inside promiseutil run under screen does not work correctly. promiseutil appears to ignore the command and remains on the default device.

The official response from Promise is as follows:

This has been made/designed in a way to work as it is described in the KB article (and it is not a bug,but that’s how it has been designed to work) that was given on my earlier reply and it can’t used in the way that you have given and I am sorry that there are no work around available.

If you know someone at Promise and have any influence, it would be a significant improvement to have this bug removed from the next release of the promiseutil.

Heck, if you’re feeling bored, file a bug report with them here.

Logging time-stamped ping results to a file using Applescript and bash.

I deal with a number of remote workers who, for one reason or the other, don’t work in the company office. Often, they’re using a VPN tunnel to connect to a server back at the company.

Occasionally, we’ll see intermittent connectivity issues from the client. Perhaps it’s their ISP, perhaps it’s the VPN tunnel, perhaps it’s a piece of software triggering IDS on a managed firewall.

In any case, we can triangulate the problem by launching a script on the client’s side that pings endpoints of our choosing to check connectivity. But we also want to time stamp and capture the results of the pings to a text file we can review later.

This is where

tee

is your friend. As the man entry says, tee is a “pipe fitting”.

The tee utility copies standard input to standard output, making a copy in zero or more files.

So, here are our requirements:

  1. Script is user-initiated.
  2. Script gets out of the user’s way.
  3. Timestamps and logs the pings to a text file in a  folder on the Desktop.

This Applescript, which makes a bunch of bash calls, does all of that.

# Simple ping monitor
# A script that pings servers of your choice by IP or DNS name and logs the results to a text file in a folder on the Desktop.
#
# Written by AB @ Modest Industries (modestindustries.com)
#
# 2012-07-25 - AB: First draft.
# 2014-07-25 - AB: Formatting cleanup. 

#Servers to ping. For each server you name here, you'll need to set up a ping statement below.
set server1 to "google.com"
set server2 to "8.8.8.8"
set server3 to "yahoo.com

property the_prefix : space

property the_sep : "-"

# Format a date to use as a datestamp.
on myDate()
    
    set myYear to "" & year of (current date)
    
    set myMth to text -2 thru -1 of ("0" & (month of (current date)) * 1)
    
    set myDay to text -2 thru -1 of ("0" & day of (current date))
    
    set myHours to hours of (current date)
    
    set myMinutes to minutes of (current date)
    
    return {myYear, myMth, myDay, myHours, myMinutes}
    
end myDate

# Check for a folder called Monitoring on the Desktop. If it doesn't exist, make one.
tell application "Finder"
    set the directory to desktop
    if (exists folder "Monitoring") is false then
        make new folder at desktop with properties {name:"Monitoring"}
    end if
    
    set the_path to folder "Monitoring" of desktop
    
    set the_name to (item 1 of my myDate())
    
    set the_name to (the_name & the_sep & item 2 of my myDate())
    
    set the_name to (the_name & the_sep & item 3 of my myDate())
    
    set the_timestamp to item 4 of my myDate() & item 5 of my myDate()
    
    -- set the directory to "Monitoring"
    if (exists folder the_name of folder "Monitoring" of desktop) is false then
        make new folder at the_path with properties {name:the_name}
    end if
    
    set the_path to folder the_name of folder "Monitoring" of desktop as alias
    
    set posixPath to POSIX path of the_path
end tell

# Ping servers of your choice. You'll need one statement for each server named above.

tell application "Terminal" to do script "ping " & server1 & " | while read pong; do echo \"$(date): $pong\"; done | tee " & quoted form of posixPath & the_name & the_sep & the_timestamp & the_sep & server1 & ".txt"

tell application "Terminal" to do script "ping " & server2 & " | while read pong; do echo \"$(date): $pong\"; done | tee " & quoted form of posixPath & the_name & the_sep & the_timestamp & the_sep & server2 & ".txt"

tell application "Terminal" to do script "ping " & server3 & " | while read pong; do echo \"$(date): $pong\"; done | tee " & quoted form of posixPath & the_name & the_sep & the_timestamp & the_sep & server3 & ".txt"

# Hide all the windows.
tell application "System Events" to set visible of process "Terminal" to false

# Tell the user it's running.
display dialog "Ping monitor is running!" buttons {"OK"} default button 1

# Switch back to the Finder.
tell application "Finder" to activate

You might want to tweak the dialogue to tell the user to leave the Terminal app running.

Should this be a bash script? Probably. But this works and can be launched by the user and hides most of the gubbins so that the user can get on with their business.

Promise Pegasus2: Scripting a SMART check with promiseutil

We’ve found that the Promise Pegasus2 Thunderbolt 2 RAID can report that the SMART Health status of its disks is just dandy, while the unit is quietly accumulating ATA errors that may indicate the pending failure of a disk.

I want to be notified if the Pegasus either has a SMART status failure or if ATA errors are present on any of the disks.

This script does just that. It’s essentially a more refined version of the previous promiseutil scripts that grabs the simple SMART status of each disk, greps to see if it’s “OK”, then runs a line of awk that looks at the report to see if there’s an “ATA Error Count”. As always, it logs to system.log and optionally sends error reports by email.

#!/bin/bash
#
# promise_smart_check.sh
#
# Checks Promise Pegasus2 SMART status, checks for ATA errors, logs and mails the output if there's an issue.
#
# Author: AB @ Modest Industries
#
# Requires Promise Utility for Pegasus2 (http://www.promise.com), tested with v3.18.0000.18
# Requires sendemail for email alerts (http://caspian.dotconf.net/menu/Software/SendEmail/)
#
# Edit History
# 2014-04-21 - AB: Version 1.0.
# 2014-04-24 - AB: Refactored.
# 2014-05-01 - AB: Incorporate the awk script to check for ATI errors.
# 2014-05-08 - AB: Refinements.
# 2014-05-15 - AB: Update to message body construction, tmp file & sendemail sanity checks.
# 2014-05-17 - AB: Added promiseutil path check.

export DATESTAMP=`date +%Y-%m-%d\ %H:%M:%S`

# Editable variables

# Path to sendemail
sendemail_path="/Library/Scripts/Monitoring/sendemail"

# Send email alerts?
send_email_alert=true

# Variables for sendemail
# Sender's address
alert_sender="[email protected]"

# Recipient's addresses, comma separated.
#alert_recipient='[email protected], [email protected]'
alert_recipient="[email protected]"

# SMTP server to send the messages through
alert_smtp_server="smtp.example.com"

# ------------ You probably shouldn't edit below this line ------------------
# Variables

# Default the error flags to false.
smart_error_flag="false"
ata_error_flag="false"

# Alert subject
alert_subject="ALERT: Promise Pegasus2 SMART problem detected on $HOSTNAME."

# Alert header
alert_header="At $DATESTAMP, a problem was detected on this device:\n"

# Pass / Fail messages
pass_msg="Promise Pegasus SMART check successful."
fail_msg=" *** Promise Pegasus SMART check FAILED!!! ***"

# Default the message body
message_body=""

# Alert footer
alert_footer="Run 'promiseutil -C smart -v' for more information."

# Promise Pegasus command line utility default path
promiseutil_path="/usr/bin/promiseutil"

# ----------------- Check for promiseutil, sendemail & set up temp files ------------------
if [ ! -f $promiseutil_path ]; then
        echo "$0 ERROR: $promiseutil_path does not exist"
        echo  "Please download and install the Promise Pegasus Utility app from http://promise.com"
        exit 1
fi

if [ ! -f $sendemail_path ]; then
        echo "$0 ERROR: $sendemail_path does not exist"
        echo  "Please download from http://caspian.dotconf.net/menu/Software/SendEmail/ and then set the \$sendmemail_path variable inside this script"
        exit 1
fi

unit_ID_tmp=`mktemp -q "/tmp/$$_unit_ID.XXXX"`
if [ $? -ne 0 ]; then
        echo "$0: ERROR: Can't create temp file, exiting..."
        exit 1
fi

smart_results_tmp=`mktemp -q "/tmp/$$_smart_results.XXXX"`
if [ $? -ne 0 ]; then
        echo "$0: ERROR: Can't create temp file, exiting..."
        exit 1
fi

# ----------------- Run promiseutil, evaluate the results ------------------

# Get Unit ID information for this Promise unit. Includes workaround for promiseutil tty issue.
screen -D -m sh -c "$promiseutil_path -C subsys -v >$unit_ID_tmp"

# Drop the output into a variable.
unit_ID=$(<$tmpdir$unit_ID_tmp)

# Get the SMART report, put it into a tmp file.
screen -D -m sh -c "$promiseutil_path -C smart -v >$smart_results_tmp"

# Grab the header for each PdId in the Promise
smart_status=$(cat $smart_results_tmp | grep -A4 "^PdId")

# Check the header to see if SMART Health Check reports a problem
if grep "^SMART Health Status:" <<< "$smart_status" | grep -qv "OK"
then
        smart_error_flag="true"
fi

# Check for ATA errors, which may indicate that the drive is failing even if SMART Health is OK
ata_errors=$(awk '/^PdId: [1-9][0-9]*/ \
                                { a=$0; n=4; next } \
                                n { --n; a=a "\n" $0; next } \
                                /^ATA Error Count*/ \
                                { ata_err=$0; print a "\n" ata_err "\n" }' \
                                "$smart_results")
# Flag if there were ATA errors
if [ "$ata_errors" != "" ]; then
        ata_error_flag="true"
fi

# ----------------- Build the message_body ------------------

# If there's a problem, build the header.
if [ "$smart_error_flag" ==  "true" ] || [ "$ata_error_flag" == "true" ]; then
        message_body="$alert_header\n\n$fail_msg\n\n$unit_ID\n\n"

        # SMART Health status.
        if [ "$smart_error_flag" == "true" ]; then
                message_body="$message_body\nSMART Health Status is reporting one or more bad drives."
        fi

        # Always include the smart_status
        message_body="$message_body\n\n$smart_status"

        # Then the ATA errors.
        if [ "$ata_error_flag" == "true" ]; then
                message_body="$message_body\n\nOne or more drives has an ATA Error Count and may be failing.\n\n$ata_errors"
        fi
fi

#  ----------------- Logging & email ------------------

# Log the results, conditionally send email on failure.
if [ "$ata_error_flag" == "true" ] || [ "$smart_error_flag" == "true" ]; then
        message_body="$message_body\n\n$alert_footer"
        echo "$DATESTAMP: \n\n$message_body" >> /var/log/system.log
        if [ "$send_email_alert" == "true" ] ; then
                "$sendemail_path" -f $alert_sender -t $alert_recipient -u $alert_subject -m "$message_body" -s $alert_smtp_server
        fi
else
        echo "$DATESTAMP: $pass_msg\n\n$unit_ID" >> /var/log/system.log
fi

# ----------------- Cleanup ------------------

rm -f rm -f $unit_ID_tmp $smart_results_tmp


This version of the script checks for the presence of promiseutil and sendemail. We call screen here because the promiseutil seems to need a TTY in order to run properly.

Hope you find it useful.

Promise Pegasus2: Scripting an Enclosure check with promise_enclosure_check.sh

The Promise Pegasus2 has onboard sensors that monitor the power supply  voltages, speed of the fan, and temperature of the controller and backplane.

This seems worth performing the occasional check on.

The example script below runs an initial check of the enclosure using promiseutil. If it doesn’t find that “Everything is OK”, it runs a more verbose check, logs the problem and optionally sends email.

#!/bin/bash
#
# promise_enclosure_check.sh
#
# Checks the status of a Promise Pegasus2 RAID enclosure and mails the output if there's an issue.
#
# Author: AB @ Modest Industries
#
# Works with Promise Utility for Pegasus2 v3.18.0000.18 (http://www.promise.com)
# Requires sendemail for email alerts (http://caspian.dotconf.net/menu/Software/SendEmail/)
#
# Edit History
# 2014-04-21 - AB: Version 1.0.
# 2014-05-08 - AB: Refinements.
# 2014-05-09 - AB: Better message_body if failed.

export DATESTAMP=`date +%Y-%m-%d\ %H:%M:%S`

# Editable variables

# Path to sendemail
sendemail_path="/Library/Scripts/Monitoring/sendemail"

# If a problem is found, send email?
send_email_alert=true

# Variables for sendemail
# Sender's address
alert_sender="[email protected]"

# Recipient's addresses, comma separated.
#alert_recipient='[email protected], [email protected]'
alert_recipient="[email protected]"

# SMTP server to send the messages through
alert_smtp_server="smtp.example.com"

# ------------ Do not edit below this line ------------------
# Variables

# Pass / fail flags
enclosure_pass=true

# The subject line of the alert.
alert_subject="Alert: Promise Pegasus2 enclosure problem detected on $HOSTNAME."

# Alert header
alert_header="At $DATESTAMP, an enclosure problem was detected on this device:\n"

# Pass / Fail messages
pass_msg="Promise Pegasus Enclosure check successful."
fail_msg=" *** Promise Pegasus Enclosure check FAILED!!! ***\n\n"

# Alert footer
alert_footer="Run 'promiseutil -C enclosure -v' for more information."

# Create temp files
unit_ID_tmp=`mktemp "/tmp/$$_unit_ID.XXXX"`
enclosure_results_tmp=`mktemp "/tmp/$$_enclosure_results.XXXX"`

message_body="$alert_header"

# Get the information for this Promise unit. Includes workaround for promiseutil tty issue.
screen -D -m sh -c "promiseutil -C subsys -v >$unit_ID_tmp"

# Drop the output into a variable.
unit_ID=$(<$unit_ID_tmp)

# Get the report, put it into a tmp file.
screen -D -m sh -c "promiseutil -C enclosure >$enclosure_results_tmp"

if ! grep -qv "Everything is OK" $enclosure_results_tmp
then
        enclosure_pass="false"
        # Get a more detailed report, put it into a tmp file.
        screen -D -m sh -c "promiseutil -C enclosure -v >$enclosure_results_tmp"

        # Build the message.
        message_body=$message_body$fail_header$unit_ID$(<$enclosure_results_tmp)
fi

#  ----------------- Logging & email ------------------

# Log the results, conditionally send email on failure.
if [ "$enclosure_pass" == "false" ]; then
        message_body="$message_body\n\n$alert_footer"
        echo "$DATESTAMP: \n\n$message_body" >> /var/log/system.log
        if [ "$send_email_alert" == "true" ] ; then
                "$sendemail_path" -f $alert_sender -t $alert_recipient -u $alert_subject -m "$message_body" -s $alert_smtp_server
        fi
else
        echo "$DATESTAMP: $pass_msg" >> /var/log/system.log
fi
# Cleanup
rm -f rm -f $unit_ID_tmp $enclosure_results_tmp

The script was developed against a Promise Pegasus2. It hasn’t been tested with the earlier Promise Pegasus series.

2014-11-07 – Update: Merci to Stéphane Allain for catching a typo in the script.

Promise Pegasus2: Scripting a disk check with promise_disk_check.sh

When you deploy a Promise Pegasus2, you want to run regular disk health checks and send an email notification if there’s a problem. The Promise Utility app can theoretically do this* when there’s someone logged in at the console, but we’re rarely running these in environments where there’s anyone logged at the console.

The solution is to script a check of the disks using the promiseutil command line utility and then create a cronjob to run it at regular intervals.

Here’s an example disk check that parses the output of phydrv, logs each run to system.log and can optionally send email if a problem is found.

#!/bin/bash
#
# promise_disk_check.sh
#
# Checks the phydrv status of a Promise Pegasus, logs and mails the output if there's an issue.
#
# Author: A @ Modest Industries
# Last update: 2014-07-19
# 2014-07-19 - tweaked grep to allow for Media Patrol
#
# Works with Promise Utility for Pegasus2 v3.18.0000.18 (http://www.promise.com)
# Requires sendemail for email alerts (http://caspian.dotconf.net/menu/Software/SendEmail/)

export DATESTAMP=`date +%Y-%m-%d\ %H:%M:%S`

# Editable variables

# Path to sendemail
sendemail_path="/Library/Scripts/Monitoring/sendemail"
# Email alert?
send_email_alert=true

# Variables for sendemail
# Sender's address
alert_sender="[email protected]"

# Recipient's addresses, comma separated.
#alert_recipient='[email protected], [email protected]'
alert_recipient="[email protected]"

# SMTP server to send the messages through
# alert_smtp_server="smtp.example.com:port"
alert_smtp_server="smtp.example.com"

# Subject line of the alert.
alert_subject="Alert: Promise disk problem detected on $HOSTNAME."

# Header line at the top of the alert message 
alert_header="At $DATESTAMP, a problem was detected on this device:\n"

# Pass / Fail messages
pass_msg="Promise disk check successful."
fail_msg=" *** Promise disk check FAILED!!! ***"

# ------------ Do not edit below this line ------------------
# Variables
pass=true
results=""

# Create temp files
unit_ID_tmp=`mktemp "/tmp/$$_ID.XXXX"`
results_tmp=`mktemp "/tmp/$$_results.XXXX"`

# Get header information for this Promise unit. Includes workaround for promiseutil tty issue.
screen -D -m sh -c "promiseutil -C subsys -v >$tmpdir$unit_ID_tmp"
unit_ID=$(<$tmpdir$unit_ID_tmp)

# Get status of the disks.  Includes workaround for promiseutil tty issue.
screen -D -m sh -c "promiseutil -C phydrv >$tmpdir$results_tmp"

# Check each line of the output the test results.
while read -r line
do
        if grep '^[0-9]' <<< "$line" | grep -Eqv 'OK|Media'
        then
                results=$results"BAD DRIVE DETECTED: $line\n\n"
                pass=false
        fi
done < $tmpdir$results_tmp

# Log the results, conditionally send email on failure.
if [ "$pass" = false ] ; then
        results="$alert_header$unit_ID\n\n$results\n$alert_footer"
        echo "$DATESTAMP: $fail_msg\n\n$results" >> /var/log/system.log
        if [ "$send_email_alert" = true ] ; then
                "$sendemail_path" -f $alert_sen:der -t $alert_recipient -u $alert_subject -m "$results" -s $alert_smtp_server
        fi
else
        echo "$DATESTAMP: $pass_msg" >> /var/log/system.log
fi

# Cleanup
rm -f $tmpdir$unit_ID_tmp $tmpdir$results_tmp

Note that the script uses sendemail for sending mail, a very useful little drop in for when the local machine isn’t running mail services.

*I say “theoretically” because configuring email in the Promise Utility is a mess and I’ve yet to see a single successful notification after configuring it.

2014-05-04 – Updated to make the path to sendemail a variable.

2014-07-19 – Changed grep to handle false positive during Media Patrol runs