staff blogs

distributed.net staff keep (relatively) up-to-date logs of their activities in .plan files. These were traditionally available via finger, but we've put them on the web for easier consumption.

2005-11-30

decibel [30-Nov-2005 @ 00:21]

Filed under: stats — decibel @ 00:21 +00:00

:: 30-Nov-2005 00:21 GMT (Wednesday) ::

Well… when it rains…

Nov 30 05:39:02 fritz kernel: twa0: INFO: (0x04: 0x000b): Rebuild started: unit=1
Nov 30 05:48:01 fritz kernel: twa0: ERROR: (0x04: 0x0026): Drive ECC error reported: port=5, unit=1
Nov 30 05:48:01 fritz kernel: twa0: ERROR: (0x04: 0x002d): Source drive error occurred: unit=1, port=5
Nov 30 05:48:01 fritz kernel: twa0: ERROR: (0x04: 0x0004): Rebuild failed: unit=1
Nov 30 05:48:01 fritz kernel: twa0: ERROR: (0x04: 0x0002): Degraded unit: unit=1, port=3
Nov 30 05:51:47 fritz kernel: twa0: INFO: (0x04: 0x000b): Rebuild started: unit=1

In plain english… another drive has failed. I’ve heard it’s common for drives
from the same manufacturing run to all fail at the same time; I guess this is
proof.

I’m going to turn stats back on again, but I highly recommend you not make any
changes to team or participant information until this is all cleared up. It is
very possible that we will end up losing the entire array again, which right
now would mean reverting to a backup that could be days (or possibly even
weeks, depending on how long this takes).

We’ve already RMA’d 2 200G drives. Once those come back it shouldn’t be much of
an issue for us to deal with drive failures, since we’ll have some spares
on-hand. I’m also going to setup replication of critical data so that even if
we do lose the database again loss of user-modified data should be minimal.

Thanks for your patience.