staff blogs

distributed.net staff keep (relatively) up-to-date logs of their activities in .plan files. These were traditionally available via finger, but we've put them on the web for easier consumption.

2006-12-17

bovine [17-Dec-2006 @ 05:16]

Filed under: keyservers — bovine @ 05:16 +00:00

:: 17-Dec-2006 05:16 GMT (Sunday) ::

There was a planned power outage today that required us to temporarily
shut down our primary webserver (www.distributed.net) and our
keymaster. Power has been restored and things should be catching up
now. All backlogged work that was buffered during the outage should
be getting processed now.

Since we didn’t get much early warning about this outage, we couldn’t
get an announcement out beforehand. Thanks for your patience. Moo!

2006-12-13

bovine [13-Dec-2006 @ 06:57]

Filed under: stats — bovine @ 06:57 +00:00

:: 13-Dec-2006 06:57 GMT (Wednesday) ::

Today 4 new hard drives were installed into Fritz and the RAID array
has been successfully rebuilt. The new drives reportedly have TLER so
hopefully the problems of drives dropping out of the RAID should not
occur anymore. Additionally, we are hoping that the newer FreeBSD
kernel will not freeze whenever the RAID controller resets itself.

Stats should now be online and accessible, though it is currently
re-processing the backlog of data since the time it went offline.
Hopefully it should be fully caught up in a few hours.

Thanks for your patience! Keep crunching!

2006-12-04

bovine [04-Dec-2006 @ 23:49]

Filed under: stats — bovine @ 23:49 +00:00

:: 04-Dec-2006 23:49 GMT (Monday) ::

Our stats server, Fritz, is currently offline due to its ongoing RAID
issues. Although the machine is actually back online right now, we
have the webpages turned off until we finish making some more tweaks.

For the technically interested, the problem appears to one of the
following problems:

1) Four of the WDC hard drives (SATA model WD2000JB) we have are
suspected to possibly be affected by a timeout issue related to
thermal calibration, or a lack of TLER (Time Limited Error Recovery).

Western Digital claims the problem only affects certain older ATA
drives (but ours are SATA) http://lnk.nu/wdc.custhelp.com/c6c.php
And 3Ware confirms that the ATA version of our model number (but
not necessarily SATA). http://lnk.nu/3ware.com/c6d.aspx

There is a drive firmware update, but only available for ATA
drives. We have already opened support tickets 3Ware and WDC more
than a week ago and are still waiting for responses.

2) Physical drive failure. We’ve already had all of the drives RMA’ed
at least once when we first started having these problems, so we
don’t believe there is a physical failure in the normal sense. The
drives report no errors after a reboot.

3) Motherboard compatibility with our RAID controller. We have a Tyan
S2882 motherboard, but 3Ware’s compatibility page for the
9550SX-8LP says only Tyan S2880 and S2885 are “officially”
supported. http://lnk.nu/3ware.com/c6e.pdf We don’t think this is
too probable of a cause though.

4) FreeBSD updates. We’re currently on FreeBSD 6.0 stable, but 6.1
stable has some additional 3Ware driver updates, so tonight we will
be upgrading to that. http://lnk.nu/freebsd.org/c6f.html

5) 3Ware RAID firmware updates. We’ve already updated to the latest
firmware a couple weeks ago prior to this most recent outage, so
the firmware alone is not a fix.

6) 3Ware RAID controller. Several months ago we tried replacing the
RAID controller with a slightly different 3Ware model to see if
that would affect things, but the problem persisted.

We’ve also just recently purchased a KVM-over-IP solution to allow us
to remotely manage the machine if it becomes inaccessible over the
network. Unfortunately, this most recent failure wedged the OS
preventing even a keyboard-initiated reboot from working.

If we don’t get any further responses from WDC or 3Ware, our next
possible option is to go out and buy 4 new 200GB+ SATA drives from
another manufacturer and see if that improves things.

We might also try moving some of the drives (containing the OS and
swap) to the onboard RAID controller and see if that can avoid
preventing the OS from going down when the data volume goes down.

Thanks for your patience!

2006-12-02

bovine [02-Dec-2006 @ 21:04]

Filed under: keyservers — bovine @ 21:04 +00:00

:: 02-Dec-2006 21:04 GMT (Saturday) ::

Our fullserver in Australia, proxy1.bris.qld.au.proxy.distributed.net,
has changed IP addresses and the server that was running at the old
address will be shut down in a few days. If you have not hard-coded
IP addresses into your config files, then you should be fine and
unaffected by this address change.

Also worth noting: earlier this week on Thursday, our keymaster server
was relocated to a new physical location. This planned move took only
a couple hours and was completed successfully without impacting
operations, due to the fully buffered nature of our proxy network.
The only effect was a brief gap in our keyrate graphing during the
time, and a surge once the keymaster was restarted.
http://stats.distributed.net/keyrate.php?project_id=25