staff blogs

distributed.net staff keep (relatively) up-to-date logs of their activities in .plan files. These were traditionally available via finger, but we've put them on the web for easier consumption.

2005-11-30

decibel [30-Nov-2005 @ 00:21]

Filed under: stats @ 00:21 +00:00

:: 30-Nov-2005 00:21 GMT (Wednesday) ::

Well… when it rains…

Nov 30 05:39:02 fritz kernel: twa0: INFO: (0x04: 0x000b): Rebuild started: unit=1
Nov 30 05:48:01 fritz kernel: twa0: ERROR: (0x04: 0x0026): Drive ECC error reported: port=5, unit=1
Nov 30 05:48:01 fritz kernel: twa0: ERROR: (0x04: 0x002d): Source drive error occurred: unit=1, port=5
Nov 30 05:48:01 fritz kernel: twa0: ERROR: (0x04: 0x0004): Rebuild failed: unit=1
Nov 30 05:48:01 fritz kernel: twa0: ERROR: (0x04: 0x0002): Degraded unit: unit=1, port=3
Nov 30 05:51:47 fritz kernel: twa0: INFO: (0x04: 0x000b): Rebuild started: unit=1

In plain english… another drive has failed. I’ve heard it’s common for drives
from the same manufacturing run to all fail at the same time; I guess this is
proof.

I’m going to turn stats back on again, but I highly recommend you not make any
changes to team or participant information until this is all cleared up. It is
very possible that we will end up losing the entire array again, which right
now would mean reverting to a backup that could be days (or possibly even
weeks, depending on how long this takes).

We’ve already RMA’d 2 200G drives. Once those come back it shouldn’t be much of
an issue for us to deal with drive failures, since we’ll have some spares
on-hand. I’m also going to setup replication of critical data so that even if
we do lose the database again loss of user-modified data should be minimal.

Thanks for your patience.

2005-11-29

decibel [29-Nov-2005 @ 10:21]

Filed under: stats @ 10:21 +00:00

:: 29-Nov-2005 10:21 GMT (Tuesday) ::

UD seems to still be working the kinks out of their new data center. In the
meantime, stats and cvs are down…

2005-11-28

decibel [28-Nov-2005 @ 13:29]

Filed under: stats @ 13:29 +00:00

:: 28-Nov-2005 13:29 GMT (Monday) ::

In case anyone didn’t notice… stats are back. :)

2005-11-23

decibel [23-Nov-2005 @ 23:05]

Filed under: stats @ 23:05 +00:00

:: 23-Nov-2005 23:05 GMT (Wednesday) ::

23:02 <+dctievent> (statsbox-iv/r72) Daily processing for 20051106 has
completed

As soon as fritz is moved back into a datacenter we should be all set. In the
meantime, it’s playing catchup.

2005-11-22

nugget [22-Nov-2005 @ 20:31]

Filed under: stats @ 20:31 +00:00

:: 22-Nov-2005 20:31 GMT (Tuesday) ::

The new raid controller for statsbox arrived today (3Ware 9550SX-8) and
I’ve got it plugged up and running. Everything looks great so far,
although the “SX” series cards are a bit new for FreeBSD stable and we’ll
have tapdance a bit on startup to get the proper twa driver loaded. I
see that the driver version we need was committed to FreeBSD current
about two weeks ago, so the awkwardness should be short-lived, I’d
expect an MFC into stable before too long.

The universe just keeps piling on, though, and one of the new 300GB
drives we bought died today while I was trying to initialize the
RAID10 volume. I ran to Fry’s to pick up a new, new drive and this
one seems fine. Right now I’m working on moving the contents of the
200GB RAID1 system volume (the OS and home directories) onto a new
300GB mirror made from two of the new drives. This will give us an
extra 100GB to play around with in our home directories, which ought
to be nice. Once I’ve verified that the system volume has copied to
the 300GB drives I’ll wipe the old ones and rebuild the RAID10
(database) volume from the six remaining 200GB drives.

I should have all that wrapped up by tomorrow, which means we’ll be
in a position to restore the stats database backup and kick off the
catchup runs from all the keymaster log files that have been piling
up during this downtime.

Thanks again for your patience and understanding as we bring stats
back to life. Hopefully this means we’ll have gotten the next few
years’ worth of problems out of the way all in this one massive crash.

Moo.

2005-11-19

nugget [19-Nov-2005 @ 12:32]

Filed under: stats @ 12:32 +00:00

:: 19-Nov-2005 12:32 GMT (Saturday) ::

We made good progress this morning in diagnosing the problems with the
stats server. As Decibel mentioned last night, we started seeing random
read errors when pulling data off the drives. Running a SHA1 or MD5 hash
off the PostgreSQL backup file (10GB) twice in a row would never yield
the same hash twice in a row. Quite creepy to see.

At first we thought we might be dealing with an OS issue, since we’d
taken this downtime as a good opportunity to upgrade the server from
FreeBSD 5.x to 6.0-STABLE, so we got a little sidetracked debugging
UFS2 and newfs options (which we’d also experimented with during the
restore). In that experimenting, Leto managed to ferret out a weird
bug in FreeBSD 6 where the system will panic if you copy a large
directory structure to a drive which has been tuned with a large
average filesize parameter. (Sent PR amd64/89202 to the FreeBSD team)

http://www.freebsd.org/cgi/query-pr.cgi?pr=amd64/89202

Once we moved past that, though, we were still facing the weird read
errors. This morning I nicked two drives out of the raid10 volume (which
was empty anyway) and plugged them in to a spare 9500S card that we’ve
got on hand. We’re unable to repro the read errors off that card, which
would seem to indicate that the problem is indeed the old 3Ware 8506.

Sadly, the 9500S card is only the four port model, so we can’t just
swap it in and start using it, we’ll have to order a new card for
the stats server.

I’m quite encouraged that we seem to have isolated the problem to the
controller card. It’s under warranty, but it’s a depot repair and
the vendor won’t just cross-ship us a replacement. We’ll have to
order a new card if we want to get the server back up and running in
a reasonable amount of time.

2005-11-18

decibel [18-Nov-2005 @ 17:09]

Filed under: stats @ 17:09 +00:00

:: 18-Nov-2005 17:09 GMT (Friday) ::

Can someone tell me what’s wrong with this picture?

decibel@fritz.1[16:52]~:60>sha1 fritz-20051002.sql.bz2
SHA1 (fritz-20051002.sql.bz2) = 6b3bb0796f7025fc243b2bfe8e9ec8b2c661045b
decibel@fritz.1[16:55]~:61>sha1 fritz-20051002.sql.bz2
SHA1 (fritz-20051002.sql.bz2) = c012f152f05d5e33a88e027948d5e267e7003e2b
decibel@fritz.1[17:01]~:62>

In a nutshell; fritz is throwing random errors when reading from either drive
array. Obviously not a feature one looks for in a database server. I suspect
it’s the 3ware controller, but we’ll need to do more testing to find out.

The machine is also being moved this weekend, probably on Sunday. Between the
hardware issues and the move, people probably shouldn’t expect stats to be back
up until next week at the earliest.

Also, stats were inadvertently turned back on last night. Unfortunately, any
modifications that were made last night will be lost. So, for example, if you
created a team, or changed some of your participant information last night,
that will be gone when we come back.

2005-11-17

decibel [17-Nov-2005 @ 09:17]

Filed under: stats @ 09:17 +00:00

:: 17-Nov-2005 09:17 GMT (Thursday) ::

Here’s the situation so far with stats:

Thanks to poor driver support, we had been running for who knows how long with
3 failing drives in the raid10 array that housed the database. But that wasn’t
actually what caused the outage… if a machine with an 8500 in it goes down
unexpectedly (think power failure), the controller can’t trust the data on the
drives to be in-sync, so it needs to rebuild the array. Unfortunately, one
of the drives it picked to be authoritative was failing, and decided that it
wasn’t going to give up it’s data.

Unfortunately we’ve been unable to recover the array. We tried using spinrite
as a last resort, but at the rate it was going it would have taken something
like a week to recover the drive. This means that when we get back online,
we’ll be running from a stats backup taken Nov. 6, about 4 days before the
failure. Any changes made to participant accounts or teams in the meantime will
have been lost.

In an ironic twist of fate, we’ve been working on getting a new machine in
production that would have allowed replicating user-modifiable tables (ie:
participant accounts and teams) to another machine. Had that been in place we
would have lost very little, if any, of this data.

The current situation is that we’ve bought 3 new drives and used them to
rebuild the array. We’ve also taken this opportunity to upgrade to FreeBSD 6.0.
But now any time we try to access the array, the machine reboots.

Once someone is on-site to investigate we’ll hopefully know more.

2005-11-13

chrisj [13-Nov-2005 @ 12:33]

Filed under: stats @ 12:33 +00:00

:: 13-Nov-2005 12:33 GMT (Sunday) ::

fritz (statsbox) is currently offline due to issues with the drive controller.
We’re working to bring it back up as soon as possible, but in the mean time,
we’re keeping stats offline while we make sure the hardware is ok.

All work is still being counted. The stats-site will catch-up once we bring it
online again.

Stay tuned for further updates!

2005-11-11

decibel [11-Nov-2005 @ 10:50]

Filed under: stats @ 10:50 +00:00

:: 11-Nov-2005 10:50 GMT (Friday) ::

More drive issues with fritz, so stats are off for right now…

Older Posts »