staff blogs

distributed.net staff keep (relatively) up-to-date logs of their activities in .plan files. These were traditionally available via finger, but we've put them on the web for easier consumption.

2005/11/19

nugget [19-Nov-2005 @ 12:32]

Filed under: stats @ 12:32 UTC

:: 19-Nov-2005 12:32 GMT (Saturday) ::

We made good progress this morning in diagnosing the problems with the
stats server. As Decibel mentioned last night, we started seeing random
read errors when pulling data off the drives. Running a SHA1 or MD5 hash
off the PostgreSQL backup file (10GB) twice in a row would never yield
the same hash twice in a row. Quite creepy to see.

At first we thought we might be dealing with an OS issue, since we’d
taken this downtime as a good opportunity to upgrade the server from
FreeBSD 5.x to 6.0-STABLE, so we got a little sidetracked debugging
UFS2 and newfs options (which we’d also experimented with during the
restore). In that experimenting, Leto managed to ferret out a weird
bug in FreeBSD 6 where the system will panic if you copy a large
directory structure to a drive which has been tuned with a large
average filesize parameter. (Sent PR amd64/89202 to the FreeBSD team)

http://www.freebsd.org/cgi/query-pr.cgi?pr=amd64/89202

Once we moved past that, though, we were still facing the weird read
errors. This morning I nicked two drives out of the raid10 volume (which
was empty anyway) and plugged them in to a spare 9500S card that we’ve
got on hand. We’re unable to repro the read errors off that card, which
would seem to indicate that the problem is indeed the old 3Ware 8506.

Sadly, the 9500S card is only the four port model, so we can’t just
swap it in and start using it, we’ll have to order a new card for
the stats server.

I’m quite encouraged that we seem to have isolated the problem to the
controller card. It’s under warranty, but it’s a depot repair and
the vendor won’t just cross-ship us a replacement. We’ll have to
order a new card if we want to get the server back up and running in
a reasonable amount of time.