:: 04-Dec-2006 23:49 GMT (Monday) ::
Our stats server, Fritz, is currently offline due to its ongoing RAID
issues. Although the machine is actually back online right now, we
have the webpages turned off until we finish making some more tweaks.
For the technically interested, the problem appears to one of the
following problems:
1) Four of the WDC hard drives (SATA model WD2000JB) we have are
suspected to possibly be affected by a timeout issue related to
thermal calibration, or a lack of TLER (Time Limited Error Recovery).
Western Digital claims the problem only affects certain older ATA
drives (but ours are SATA) http://lnk.nu/wdc.custhelp.com/c6c.php
And 3Ware confirms that the ATA version of our model number (but
not necessarily SATA). http://lnk.nu/3ware.com/c6d.aspx
There is a drive firmware update, but only available for ATA
drives. We have already opened support tickets 3Ware and WDC more
than a week ago and are still waiting for responses.
2) Physical drive failure. We’ve already had all of the drives RMA’ed
at least once when we first started having these problems, so we
don’t believe there is a physical failure in the normal sense. The
drives report no errors after a reboot.
3) Motherboard compatibility with our RAID controller. We have a Tyan
S2882 motherboard, but 3Ware’s compatibility page for the
9550SX-8LP says only Tyan S2880 and S2885 are “officially”
supported. http://lnk.nu/3ware.com/c6e.pdf We don’t think this is
too probable of a cause though.
4) FreeBSD updates. We’re currently on FreeBSD 6.0 stable, but 6.1
stable has some additional 3Ware driver updates, so tonight we will
be upgrading to that. http://lnk.nu/freebsd.org/c6f.html
5) 3Ware RAID firmware updates. We’ve already updated to the latest
firmware a couple weeks ago prior to this most recent outage, so
the firmware alone is not a fix.
6) 3Ware RAID controller. Several months ago we tried replacing the
RAID controller with a slightly different 3Ware model to see if
that would affect things, but the problem persisted.
We’ve also just recently purchased a KVM-over-IP solution to allow us
to remotely manage the machine if it becomes inaccessible over the
network. Unfortunately, this most recent failure wedged the OS
preventing even a keyboard-initiated reboot from working.
If we don’t get any further responses from WDC or 3Ware, our next
possible option is to go out and buy 4 new 200GB+ SATA drives from
another manufacturer and see if that improves things.
We might also try moving some of the drives (containing the OS and
swap) to the onboard RAID controller and see if that can avoid
preventing the OS from going down when the data volume goes down.
Thanks for your patience!