You can imagine where it goes from here…

Monitoring HDD temperature

While the Antec P180 case is a favourite amongst enthusiasts it really is a poor option for a RAID array, the bottom cage especially. It houses 4 drives and has a single fan (in my case the stock fan was replaced with a Scythe SFLEX fluid bearing fan for lower noise – the server & desktop under my desk are virtually noiseless as I like them – but that’s another post) in between the cage and the power supply, with limited space for a lot of cabling. The fan also have no grills on either side. I have already lost one 500GB Seagate drive (full of data) when a cable got stuck in the fan and caused the drives to overheat. The bottom cage has no ventilation at all when the fan stops working for one or another reason. It really should give you the ability to add another fan in front of the cage, like the the center cage. Perhaps I can jury rig something.

This prompted me to create a 3x500GB linux software RAID5, using all WD RE2 drives this time around. So far so good, until last night on a whim I checked hdd temperatures (as I do from time to time – call me paranoid) and found the 3 drives in the bottom cage sitting at 60C!! I dove under my desk and found that the bottom fan was not working, looks like the power connector got bumped out when I cleaned up a bit there with velcro ties recently after upgrading some hardware. No idea how long they’ve been running like that, could be up to a few days 😦 Short of replacing the case with some better suited, for the short term I thought some pro active monitoring was in order. I installed hddtemp with yum (no need to parse the response otherwise could have just used smartmontools instead) and whipped up a quick script which I run through cron every 15 minutes:

[root@gatekeeper bin]# crontab -l
15,30,45,59 * * * * /usr/local/bin/monitorhddtemp.sh    

[root@gatekeeper bin]# cat monitorhddtemp.sh
#!/bin/bash
HDDS="/dev/sda /dev/sdb /dev/sdc /dev/sdd"
HDT=/usr/sbin/hddtemp
LOG=/usr/bin/logger
DOWN=/sbin/shutdown
ALERT_LEVEL=40
SHUTDOWN_LEVEL=55
for disk in $HDDS
do
  if [ -b $disk ]; then
        HDTEMP=$($HDT -n $disk)
        if [ $HDTEMP -ge $ALERT_LEVEL ]; then
           $LOG "hard disk : $disk temperature $HDTEMP°C crossed its alert limit"
           echo "hard disk : $disk temperature $HDTEMP°C crossed its alert limit" | mail -s "HDD TEMPERATURE WARNING" your@email.here
        fi
        if [ $HDTEMP -ge $SHUTDOWN_LEVEL ]; then
           $LOG "System going down as hard disk : $disk temperature $HDTEMP°C crossed its critical limit"
           sync;sync
           $DOWN -h 0
        fi
  fi
done

This script will email me when any of the listed hdd’s temperature exceeds the alert level (40C) and shutdown the machine when they reach over (55C – manufacturer’s operating limit)

November 21, 2007 - Posted by | Tech

No comments yet.

Leave a comment