Category Archives: Monitoring

One-liner: Check the five, most important SMART parameters on a disk.

A while ago, Backblaze published a report on what they consider to be the most reliable SMART parameters for determining whether a disk is failing. These include:

  • 5 – Reallocated_Sector_Ct
  • 187 – Uncorrectable_Error_Cnt
  • 188 – Command_Timeout
  • 197 – Current_Pending_Sector_Count
  • 198 – Offline_Uncorrectable

For a complete description of these parameters, take a look at the Wikipedia article on SMART.

While our sample of failing disks is no where near as large as Backblaze’s, our results have, unsurprisingly, correlated pretty strongly to theirs.

Note that not all of these parameters are supported by the drive manufacturers and that we typically don’t see many of these parameters on the hard disks supplied in Apple hardware. Additionally, note that SMART is not supported on some drives.

Assuming you’ve got smartmontools installed, this one-liner will very quickly give you a snapshot of the key values we look for as strong indicators that a drive needs to be replaced:

smartctl -a disk0 | egrep "^( 5|187|188|197|198)"

where

disk0

is the disk you’re testing. To get the disks available to test, run

diskutil list

You’ll get back output that looks like this:

/dev/disk0
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *256.1 GB   disk0
   1:                        EFI                         209.7 MB   disk0s1
   2:                  Apple_HFS Macintosh HD                 255.2 GB   disk0s2
   3:                 Apple_Boot Recovery HD             650.0 MB   disk0s3
/dev/disk1
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *500.1 GB   disk1
   1:                        EFI                         209.7 MB   disk1s1
   2:                  Apple_HFS Storage                 499.8 GB   disk1s2

In the example above, there are two disks to choose from,

disk0

and

disk1

Assuming the drive supports all five SMART parameters, you’ll get back something that looks like this:

  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   001    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   001    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   001    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   001    Old_age   Offline      -       0

Those trailing zeros are what we like to see. Positive values in the last column mean that the drive probably needs to be replaced.

Logging time-stamped ping results to a file using Applescript and bash.

I deal with a number of remote workers who, for one reason or the other, don’t work in the company office. Often, they’re using a VPN tunnel to connect to a server back at the company.

Occasionally, we’ll see intermittent connectivity issues from the client. Perhaps it’s their ISP, perhaps it’s the VPN tunnel, perhaps it’s a piece of software triggering IDS on a managed firewall.

In any case, we can triangulate the problem by launching a script on the client’s side that pings endpoints of our choosing to check connectivity. But we also want to time stamp and capture the results of the pings to a text file we can review later.

This is where

tee

is your friend. As the man entry says, tee is a “pipe fitting”.

The tee utility copies standard input to standard output, making a copy in zero or more files.

So, here are our requirements:

  1. Script is user-initiated.
  2. Script gets out of the user’s way.
  3. Timestamps and logs the pings to a text file in a  folder on the Desktop.

This Applescript, which makes a bunch of bash calls, does all of that.

# Simple ping monitor
# A script that pings servers of your choice by IP or DNS name and logs the results to a text file in a folder on the Desktop.
#
# Written by AB @ Modest Industries (modestindustries.com)
#
# 2012-07-25 - AB: First draft.
# 2014-07-25 - AB: Formatting cleanup. 

#Servers to ping. For each server you name here, you'll need to set up a ping statement below.
set server1 to "google.com"
set server2 to "8.8.8.8"
set server3 to "yahoo.com

property the_prefix : space

property the_sep : "-"

# Format a date to use as a datestamp.
on myDate()
    
    set myYear to "" & year of (current date)
    
    set myMth to text -2 thru -1 of ("0" & (month of (current date)) * 1)
    
    set myDay to text -2 thru -1 of ("0" & day of (current date))
    
    set myHours to hours of (current date)
    
    set myMinutes to minutes of (current date)
    
    return {myYear, myMth, myDay, myHours, myMinutes}
    
end myDate

# Check for a folder called Monitoring on the Desktop. If it doesn't exist, make one.
tell application "Finder"
    set the directory to desktop
    if (exists folder "Monitoring") is false then
        make new folder at desktop with properties {name:"Monitoring"}
    end if
    
    set the_path to folder "Monitoring" of desktop
    
    set the_name to (item 1 of my myDate())
    
    set the_name to (the_name & the_sep & item 2 of my myDate())
    
    set the_name to (the_name & the_sep & item 3 of my myDate())
    
    set the_timestamp to item 4 of my myDate() & item 5 of my myDate()
    
    -- set the directory to "Monitoring"
    if (exists folder the_name of folder "Monitoring" of desktop) is false then
        make new folder at the_path with properties {name:the_name}
    end if
    
    set the_path to folder the_name of folder "Monitoring" of desktop as alias
    
    set posixPath to POSIX path of the_path
end tell

# Ping servers of your choice. You'll need one statement for each server named above.

tell application "Terminal" to do script "ping " & server1 & " | while read pong; do echo \"$(date): $pong\"; done | tee " & quoted form of posixPath & the_name & the_sep & the_timestamp & the_sep & server1 & ".txt"

tell application "Terminal" to do script "ping " & server2 & " | while read pong; do echo \"$(date): $pong\"; done | tee " & quoted form of posixPath & the_name & the_sep & the_timestamp & the_sep & server2 & ".txt"

tell application "Terminal" to do script "ping " & server3 & " | while read pong; do echo \"$(date): $pong\"; done | tee " & quoted form of posixPath & the_name & the_sep & the_timestamp & the_sep & server3 & ".txt"

# Hide all the windows.
tell application "System Events" to set visible of process "Terminal" to false

# Tell the user it's running.
display dialog "Ping monitor is running!" buttons {"OK"} default button 1

# Switch back to the Finder.
tell application "Finder" to activate

You might want to tweak the dialogue to tell the user to leave the Terminal app running.

Should this be a bash script? Probably. But this works and can be launched by the user and hides most of the gubbins so that the user can get on with their business.

Promise Pegasus2: Scripting a SMART check with promiseutil

We’ve found that the Promise Pegasus2 Thunderbolt 2 RAID can report that the SMART Health status of its disks is just dandy, while the unit is quietly accumulating ATA errors that may indicate the pending failure of a disk.

I want to be notified if the Pegasus either has a SMART status failure or if ATA errors are present on any of the disks.

This script does just that. It’s essentially a more refined version of the previous promiseutil scripts that grabs the simple SMART status of each disk, greps to see if it’s “OK”, then runs a line of awk that looks at the report to see if there’s an “ATA Error Count”. As always, it logs to system.log and optionally sends error reports by email.

#!/bin/bash
#
# promise_smart_check.sh
#
# Checks Promise Pegasus2 SMART status, checks for ATA errors, logs and mails the output if there's an issue.
#
# Author: AB @ Modest Industries
#
# Requires Promise Utility for Pegasus2 (http://www.promise.com), tested with v3.18.0000.18
# Requires sendemail for email alerts (http://caspian.dotconf.net/menu/Software/SendEmail/)
#
# Edit History
# 2014-04-21 - AB: Version 1.0.
# 2014-04-24 - AB: Refactored.
# 2014-05-01 - AB: Incorporate the awk script to check for ATI errors.
# 2014-05-08 - AB: Refinements.
# 2014-05-15 - AB: Update to message body construction, tmp file & sendemail sanity checks.
# 2014-05-17 - AB: Added promiseutil path check.

export DATESTAMP=`date +%Y-%m-%d\ %H:%M:%S`

# Editable variables

# Path to sendemail
sendemail_path="/Library/Scripts/Monitoring/sendemail"

# Send email alerts?
send_email_alert=true

# Variables for sendemail
# Sender's address
alert_sender="[email protected]"

# Recipient's addresses, comma separated.
#alert_recipient='[email protected], [email protected]'
alert_recipient="[email protected]"

# SMTP server to send the messages through
alert_smtp_server="smtp.example.com"

# ------------ You probably shouldn't edit below this line ------------------
# Variables

# Default the error flags to false.
smart_error_flag="false"
ata_error_flag="false"

# Alert subject
alert_subject="ALERT: Promise Pegasus2 SMART problem detected on $HOSTNAME."

# Alert header
alert_header="At $DATESTAMP, a problem was detected on this device:\n"

# Pass / Fail messages
pass_msg="Promise Pegasus SMART check successful."
fail_msg=" *** Promise Pegasus SMART check FAILED!!! ***"

# Default the message body
message_body=""

# Alert footer
alert_footer="Run 'promiseutil -C smart -v' for more information."

# Promise Pegasus command line utility default path
promiseutil_path="/usr/bin/promiseutil"

# ----------------- Check for promiseutil, sendemail & set up temp files ------------------
if [ ! -f $promiseutil_path ]; then
        echo "$0 ERROR: $promiseutil_path does not exist"
        echo  "Please download and install the Promise Pegasus Utility app from http://promise.com"
        exit 1
fi

if [ ! -f $sendemail_path ]; then
        echo "$0 ERROR: $sendemail_path does not exist"
        echo  "Please download from http://caspian.dotconf.net/menu/Software/SendEmail/ and then set the \$sendmemail_path variable inside this script"
        exit 1
fi

unit_ID_tmp=`mktemp -q "/tmp/$$_unit_ID.XXXX"`
if [ $? -ne 0 ]; then
        echo "$0: ERROR: Can't create temp file, exiting..."
        exit 1
fi

smart_results_tmp=`mktemp -q "/tmp/$$_smart_results.XXXX"`
if [ $? -ne 0 ]; then
        echo "$0: ERROR: Can't create temp file, exiting..."
        exit 1
fi

# ----------------- Run promiseutil, evaluate the results ------------------

# Get Unit ID information for this Promise unit. Includes workaround for promiseutil tty issue.
screen -D -m sh -c "$promiseutil_path -C subsys -v >$unit_ID_tmp"

# Drop the output into a variable.
unit_ID=$(<$tmpdir$unit_ID_tmp)

# Get the SMART report, put it into a tmp file.
screen -D -m sh -c "$promiseutil_path -C smart -v >$smart_results_tmp"

# Grab the header for each PdId in the Promise
smart_status=$(cat $smart_results_tmp | grep -A4 "^PdId")

# Check the header to see if SMART Health Check reports a problem
if grep "^SMART Health Status:" <<< "$smart_status" | grep -qv "OK"
then
        smart_error_flag="true"
fi

# Check for ATA errors, which may indicate that the drive is failing even if SMART Health is OK
ata_errors=$(awk '/^PdId: [1-9][0-9]*/ \
                                { a=$0; n=4; next } \
                                n { --n; a=a "\n" $0; next } \
                                /^ATA Error Count*/ \
                                { ata_err=$0; print a "\n" ata_err "\n" }' \
                                "$smart_results")
# Flag if there were ATA errors
if [ "$ata_errors" != "" ]; then
        ata_error_flag="true"
fi

# ----------------- Build the message_body ------------------

# If there's a problem, build the header.
if [ "$smart_error_flag" ==  "true" ] || [ "$ata_error_flag" == "true" ]; then
        message_body="$alert_header\n\n$fail_msg\n\n$unit_ID\n\n"

        # SMART Health status.
        if [ "$smart_error_flag" == "true" ]; then
                message_body="$message_body\nSMART Health Status is reporting one or more bad drives."
        fi

        # Always include the smart_status
        message_body="$message_body\n\n$smart_status"

        # Then the ATA errors.
        if [ "$ata_error_flag" == "true" ]; then
                message_body="$message_body\n\nOne or more drives has an ATA Error Count and may be failing.\n\n$ata_errors"
        fi
fi

#  ----------------- Logging & email ------------------

# Log the results, conditionally send email on failure.
if [ "$ata_error_flag" == "true" ] || [ "$smart_error_flag" == "true" ]; then
        message_body="$message_body\n\n$alert_footer"
        echo "$DATESTAMP: \n\n$message_body" >> /var/log/system.log
        if [ "$send_email_alert" == "true" ] ; then
                "$sendemail_path" -f $alert_sender -t $alert_recipient -u $alert_subject -m "$message_body" -s $alert_smtp_server
        fi
else
        echo "$DATESTAMP: $pass_msg\n\n$unit_ID" >> /var/log/system.log
fi

# ----------------- Cleanup ------------------

rm -f rm -f $unit_ID_tmp $smart_results_tmp


This version of the script checks for the presence of promiseutil and sendemail. We call screen here because the promiseutil seems to need a TTY in order to run properly.

Hope you find it useful.

Promise Pegasus2: Scripting an Enclosure check with promise_enclosure_check.sh

The Promise Pegasus2 has onboard sensors that monitor the power supply  voltages, speed of the fan, and temperature of the controller and backplane.

This seems worth performing the occasional check on.

The example script below runs an initial check of the enclosure using promiseutil. If it doesn’t find that “Everything is OK”, it runs a more verbose check, logs the problem and optionally sends email.

#!/bin/bash
#
# promise_enclosure_check.sh
#
# Checks the status of a Promise Pegasus2 RAID enclosure and mails the output if there's an issue.
#
# Author: AB @ Modest Industries
#
# Works with Promise Utility for Pegasus2 v3.18.0000.18 (http://www.promise.com)
# Requires sendemail for email alerts (http://caspian.dotconf.net/menu/Software/SendEmail/)
#
# Edit History
# 2014-04-21 - AB: Version 1.0.
# 2014-05-08 - AB: Refinements.
# 2014-05-09 - AB: Better message_body if failed.

export DATESTAMP=`date +%Y-%m-%d\ %H:%M:%S`

# Editable variables

# Path to sendemail
sendemail_path="/Library/Scripts/Monitoring/sendemail"

# If a problem is found, send email?
send_email_alert=true

# Variables for sendemail
# Sender's address
alert_sender="[email protected]"

# Recipient's addresses, comma separated.
#alert_recipient='[email protected], [email protected]'
alert_recipient="[email protected]"

# SMTP server to send the messages through
alert_smtp_server="smtp.example.com"

# ------------ Do not edit below this line ------------------
# Variables

# Pass / fail flags
enclosure_pass=true

# The subject line of the alert.
alert_subject="Alert: Promise Pegasus2 enclosure problem detected on $HOSTNAME."

# Alert header
alert_header="At $DATESTAMP, an enclosure problem was detected on this device:\n"

# Pass / Fail messages
pass_msg="Promise Pegasus Enclosure check successful."
fail_msg=" *** Promise Pegasus Enclosure check FAILED!!! ***\n\n"

# Alert footer
alert_footer="Run 'promiseutil -C enclosure -v' for more information."

# Create temp files
unit_ID_tmp=`mktemp "/tmp/$$_unit_ID.XXXX"`
enclosure_results_tmp=`mktemp "/tmp/$$_enclosure_results.XXXX"`

message_body="$alert_header"

# Get the information for this Promise unit. Includes workaround for promiseutil tty issue.
screen -D -m sh -c "promiseutil -C subsys -v >$unit_ID_tmp"

# Drop the output into a variable.
unit_ID=$(<$unit_ID_tmp)

# Get the report, put it into a tmp file.
screen -D -m sh -c "promiseutil -C enclosure >$enclosure_results_tmp"

if ! grep -qv "Everything is OK" $enclosure_results_tmp
then
        enclosure_pass="false"
        # Get a more detailed report, put it into a tmp file.
        screen -D -m sh -c "promiseutil -C enclosure -v >$enclosure_results_tmp"

        # Build the message.
        message_body=$message_body$fail_header$unit_ID$(<$enclosure_results_tmp)
fi

#  ----------------- Logging & email ------------------

# Log the results, conditionally send email on failure.
if [ "$enclosure_pass" == "false" ]; then
        message_body="$message_body\n\n$alert_footer"
        echo "$DATESTAMP: \n\n$message_body" >> /var/log/system.log
        if [ "$send_email_alert" == "true" ] ; then
                "$sendemail_path" -f $alert_sender -t $alert_recipient -u $alert_subject -m "$message_body" -s $alert_smtp_server
        fi
else
        echo "$DATESTAMP: $pass_msg" >> /var/log/system.log
fi
# Cleanup
rm -f rm -f $unit_ID_tmp $enclosure_results_tmp

The script was developed against a Promise Pegasus2. It hasn’t been tested with the earlier Promise Pegasus series.

2014-11-07 – Update: Merci to Stéphane Allain for catching a typo in the script.

Promise Pegasus2: Scripting a disk check with promise_disk_check.sh

When you deploy a Promise Pegasus2, you want to run regular disk health checks and send an email notification if there’s a problem. The Promise Utility app can theoretically do this* when there’s someone logged in at the console, but we’re rarely running these in environments where there’s anyone logged at the console.

The solution is to script a check of the disks using the promiseutil command line utility and then create a cronjob to run it at regular intervals.

Here’s an example disk check that parses the output of phydrv, logs each run to system.log and can optionally send email if a problem is found.

#!/bin/bash
#
# promise_disk_check.sh
#
# Checks the phydrv status of a Promise Pegasus, logs and mails the output if there's an issue.
#
# Author: A @ Modest Industries
# Last update: 2014-07-19
# 2014-07-19 - tweaked grep to allow for Media Patrol
#
# Works with Promise Utility for Pegasus2 v3.18.0000.18 (http://www.promise.com)
# Requires sendemail for email alerts (http://caspian.dotconf.net/menu/Software/SendEmail/)

export DATESTAMP=`date +%Y-%m-%d\ %H:%M:%S`

# Editable variables

# Path to sendemail
sendemail_path="/Library/Scripts/Monitoring/sendemail"
# Email alert?
send_email_alert=true

# Variables for sendemail
# Sender's address
alert_sender="[email protected]"

# Recipient's addresses, comma separated.
#alert_recipient='[email protected], [email protected]'
alert_recipient="[email protected]"

# SMTP server to send the messages through
# alert_smtp_server="smtp.example.com:port"
alert_smtp_server="smtp.example.com"

# Subject line of the alert.
alert_subject="Alert: Promise disk problem detected on $HOSTNAME."

# Header line at the top of the alert message 
alert_header="At $DATESTAMP, a problem was detected on this device:\n"

# Pass / Fail messages
pass_msg="Promise disk check successful."
fail_msg=" *** Promise disk check FAILED!!! ***"

# ------------ Do not edit below this line ------------------
# Variables
pass=true
results=""

# Create temp files
unit_ID_tmp=`mktemp "/tmp/$$_ID.XXXX"`
results_tmp=`mktemp "/tmp/$$_results.XXXX"`

# Get header information for this Promise unit. Includes workaround for promiseutil tty issue.
screen -D -m sh -c "promiseutil -C subsys -v >$tmpdir$unit_ID_tmp"
unit_ID=$(<$tmpdir$unit_ID_tmp)

# Get status of the disks.  Includes workaround for promiseutil tty issue.
screen -D -m sh -c "promiseutil -C phydrv >$tmpdir$results_tmp"

# Check each line of the output the test results.
while read -r line
do
        if grep '^[0-9]' <<< "$line" | grep -Eqv 'OK|Media'
        then
                results=$results"BAD DRIVE DETECTED: $line\n\n"
                pass=false
        fi
done < $tmpdir$results_tmp

# Log the results, conditionally send email on failure.
if [ "$pass" = false ] ; then
        results="$alert_header$unit_ID\n\n$results\n$alert_footer"
        echo "$DATESTAMP: $fail_msg\n\n$results" >> /var/log/system.log
        if [ "$send_email_alert" = true ] ; then
                "$sendemail_path" -f $alert_sen:der -t $alert_recipient -u $alert_subject -m "$results" -s $alert_smtp_server
        fi
else
        echo "$DATESTAMP: $pass_msg" >> /var/log/system.log
fi

# Cleanup
rm -f $tmpdir$unit_ID_tmp $tmpdir$results_tmp

Note that the script uses sendemail for sending mail, a very useful little drop in for when the local machine isn’t running mail services.

*I say “theoretically” because configuring email in the Promise Utility is a mess and I’ve yet to see a single successful notification after configuring it.

2014-05-04 – Updated to make the path to sendemail a variable.

2014-07-19 – Changed grep to handle false positive during Media Patrol runs