:: 08-Dec-2005 01:43 GMT (Thursday) ::
Stats are back online again. Sorry for all the recent downtime.
distributed.net staff keep (relatively) up-to-date logs of their activities in .plan files. These were traditionally available via finger, but we've put them on the web for easier consumption.
:: 08-Dec-2005 01:43 GMT (Thursday) ::
Stats are back online again. Sorry for all the recent downtime.
:: 07-Dec-2005 15:07 GMT (Wednesday) ::
Fritz is looking happy again. I’m running a vacuum of the entire database to
make sure PostgreSQL is happy as well. Once that’s done I’ll turn stats back
on.
:: 06-Dec-2005 19:05 GMT (Tuesday) ::
*sigh*
Got a background fsck failure on /usr which I wasn’t able to handle remotely.
My attempt ended up rendering the box off the net, so we’re now stuck until
someone can get to the console, which might well be tomorrow. Ooops.
Sorry for the continued delay…
:: 06-Dec-2005 13:46 GMT (Tuesday) ::
Replacement drives are finally here. We’re working on getting a backup before
doing the RAID rebuild, which is why stats are down. They should hopefully be
back up in time for statsrun.
:: 30-Nov-2005 00:21 GMT (Wednesday) ::
Well… when it rains…
Nov 30 05:39:02 fritz kernel: twa0: INFO: (0x04: 0x000b): Rebuild started: unit=1
Nov 30 05:48:01 fritz kernel: twa0: ERROR: (0x04: 0x0026): Drive ECC error reported: port=5, unit=1
Nov 30 05:48:01 fritz kernel: twa0: ERROR: (0x04: 0x002d): Source drive error occurred: unit=1, port=5
Nov 30 05:48:01 fritz kernel: twa0: ERROR: (0x04: 0x0004): Rebuild failed: unit=1
Nov 30 05:48:01 fritz kernel: twa0: ERROR: (0x04: 0x0002): Degraded unit: unit=1, port=3
Nov 30 05:51:47 fritz kernel: twa0: INFO: (0x04: 0x000b): Rebuild started: unit=1
In plain english… another drive has failed. I’ve heard it’s common for drives
from the same manufacturing run to all fail at the same time; I guess this is
proof.
I’m going to turn stats back on again, but I highly recommend you not make any
changes to team or participant information until this is all cleared up. It is
very possible that we will end up losing the entire array again, which right
now would mean reverting to a backup that could be days (or possibly even
weeks, depending on how long this takes).
We’ve already RMA’d 2 200G drives. Once those come back it shouldn’t be much of
an issue for us to deal with drive failures, since we’ll have some spares
on-hand. I’m also going to setup replication of critical data so that even if
we do lose the database again loss of user-modified data should be minimal.
Thanks for your patience.
:: 29-Nov-2005 10:21 GMT (Tuesday) ::
UD seems to still be working the kinks out of their new data center. In the
meantime, stats and cvs are down…
:: 28-Nov-2005 13:29 GMT (Monday) ::
In case anyone didn’t notice… stats are back. :)
:: 23-Nov-2005 23:05 GMT (Wednesday) ::
23:02 <+dctievent> (statsbox-iv/r72) Daily processing for 20051106 has
completed
As soon as fritz is moved back into a datacenter we should be all set. In the
meantime, it’s playing catchup.
:: 18-Nov-2005 17:09 GMT (Friday) ::
Can someone tell me what’s wrong with this picture?
decibel@fritz.1[16:52]~:60>sha1 fritz-20051002.sql.bz2
SHA1 (fritz-20051002.sql.bz2) = 6b3bb0796f7025fc243b2bfe8e9ec8b2c661045b
decibel@fritz.1[16:55]~:61>sha1 fritz-20051002.sql.bz2
SHA1 (fritz-20051002.sql.bz2) = c012f152f05d5e33a88e027948d5e267e7003e2b
decibel@fritz.1[17:01]~:62>
In a nutshell; fritz is throwing random errors when reading from either drive
array. Obviously not a feature one looks for in a database server. I suspect
it’s the 3ware controller, but we’ll need to do more testing to find out.
The machine is also being moved this weekend, probably on Sunday. Between the
hardware issues and the move, people probably shouldn’t expect stats to be back
up until next week at the earliest.
Also, stats were inadvertently turned back on last night. Unfortunately, any
modifications that were made last night will be lost. So, for example, if you
created a team, or changed some of your participant information last night,
that will be gone when we come back.
:: 17-Nov-2005 09:17 GMT (Thursday) ::
Here’s the situation so far with stats:
Thanks to poor driver support, we had been running for who knows how long with
3 failing drives in the raid10 array that housed the database. But that wasn’t
actually what caused the outage… if a machine with an 8500 in it goes down
unexpectedly (think power failure), the controller can’t trust the data on the
drives to be in-sync, so it needs to rebuild the array. Unfortunately, one
of the drives it picked to be authoritative was failing, and decided that it
wasn’t going to give up it’s data.
Unfortunately we’ve been unable to recover the array. We tried using spinrite
as a last resort, but at the rate it was going it would have taken something
like a week to recover the drive. This means that when we get back online,
we’ll be running from a stats backup taken Nov. 6, about 4 days before the
failure. Any changes made to participant accounts or teams in the meantime will
have been lost.
In an ironic twist of fate, we’ve been working on getting a new machine in
production that would have allowed replicating user-modifiable tables (ie:
participant accounts and teams) to another machine. Had that been in place we
would have lost very little, if any, of this data.
The current situation is that we’ve bought 3 new drives and used them to
rebuild the array. We’ve also taken this opportunity to upgrade to FreeBSD 6.0.
But now any time we try to access the array, the machine reboots.
Once someone is on-site to investigate we’ll hopefully know more.