staff blogs

distributed.net staff keep (relatively) up-to-date logs of their activities in .plan files. These were traditionally available via finger, but we've put them on the web for easier consumption.

2007-02-26

lefty [26-Feb-2007 @ 22:06]

Filed under: stats @ 22:06 +00:00

:: 26-Feb-2007 22:06 GMT (Monday) ::

Hello cows!

I’m now on board to help hack stats code. I’ll also be assisting with manual
retires while the current retire system is overhauled.

2007-01-06

chrisj [06-Jan-2007 @ 22:29]

Filed under: stats @ 22:29 +00:00

:: 06-Jan-2007 22:29 GMT (Saturday) ::

Done toying with stats for the night. If anyone notices anything wrong, drop me
an email

chrisj [06-Jan-2007 @ 20:09]

Filed under: stats @ 20:09 +00:00

:: 06-Jan-2007 20:09 GMT (Saturday) ::

I’m going to be doing some updates on fritz over the next couple hours.

Apologies in advance if I have to take stats offline for a bit

2006-12-13

bovine [13-Dec-2006 @ 06:57]

Filed under: stats @ 06:57 +00:00

:: 13-Dec-2006 06:57 GMT (Wednesday) ::

Today 4 new hard drives were installed into Fritz and the RAID array
has been successfully rebuilt. The new drives reportedly have TLER so
hopefully the problems of drives dropping out of the RAID should not
occur anymore. Additionally, we are hoping that the newer FreeBSD
kernel will not freeze whenever the RAID controller resets itself.

Stats should now be online and accessible, though it is currently
re-processing the backlog of data since the time it went offline.
Hopefully it should be fully caught up in a few hours.

Thanks for your patience! Keep crunching!

2006-12-04

bovine [04-Dec-2006 @ 23:49]

Filed under: stats @ 23:49 +00:00

:: 04-Dec-2006 23:49 GMT (Monday) ::

Our stats server, Fritz, is currently offline due to its ongoing RAID
issues. Although the machine is actually back online right now, we
have the webpages turned off until we finish making some more tweaks.

For the technically interested, the problem appears to one of the
following problems:

1) Four of the WDC hard drives (SATA model WD2000JB) we have are
suspected to possibly be affected by a timeout issue related to
thermal calibration, or a lack of TLER (Time Limited Error Recovery).

Western Digital claims the problem only affects certain older ATA
drives (but ours are SATA) http://lnk.nu/wdc.custhelp.com/c6c.php
And 3Ware confirms that the ATA version of our model number (but
not necessarily SATA). http://lnk.nu/3ware.com/c6d.aspx

There is a drive firmware update, but only available for ATA
drives. We have already opened support tickets 3Ware and WDC more
than a week ago and are still waiting for responses.

2) Physical drive failure. We’ve already had all of the drives RMA’ed
at least once when we first started having these problems, so we
don’t believe there is a physical failure in the normal sense. The
drives report no errors after a reboot.

3) Motherboard compatibility with our RAID controller. We have a Tyan
S2882 motherboard, but 3Ware’s compatibility page for the
9550SX-8LP says only Tyan S2880 and S2885 are “officially”
supported. http://lnk.nu/3ware.com/c6e.pdf We don’t think this is
too probable of a cause though.

4) FreeBSD updates. We’re currently on FreeBSD 6.0 stable, but 6.1
stable has some additional 3Ware driver updates, so tonight we will
be upgrading to that. http://lnk.nu/freebsd.org/c6f.html

5) 3Ware RAID firmware updates. We’ve already updated to the latest
firmware a couple weeks ago prior to this most recent outage, so
the firmware alone is not a fix.

6) 3Ware RAID controller. Several months ago we tried replacing the
RAID controller with a slightly different 3Ware model to see if
that would affect things, but the problem persisted.

We’ve also just recently purchased a KVM-over-IP solution to allow us
to remotely manage the machine if it becomes inaccessible over the
network. Unfortunately, this most recent failure wedged the OS
preventing even a keyboard-initiated reboot from working.

If we don’t get any further responses from WDC or 3Ware, our next
possible option is to go out and buy 4 new 200GB+ SATA drives from
another manufacturer and see if that improves things.

We might also try moving some of the drives (containing the OS and
swap) to the onboard RAID controller and see if that can avoid
preventing the OS from going down when the data volume goes down.

Thanks for your patience!

2006-11-14

chrisj [14-Nov-2006 @ 11:39]

Filed under: stats @ 11:39 +00:00

:: 14-Nov-2006 11:39 GMT (Tuesday) ::

Stats are back up again.

Apologies for the extended down-time, folks.

2006-11-05

chrisj [05-Nov-2006 @ 17:07]

Filed under: stats @ 17:07 +00:00

:: 05-Nov-2006 17:07 GMT (Sunday) ::

As you’ve no doubt all noticed, stats has gone down. Again.

It’s looking like fritz is having some more drive troubles. We’re working as
fast as we can to get the box back online and stable again.

As usual, all work is being logged, and will be credited when the site is back
online again.

Apologies for the extended down-time.

2006-10-30

chrisj [30-Oct-2006 @ 16:08]

Filed under: stats @ 16:08 +00:00

:: 30-Oct-2006 16:08 GMT (Monday) ::

At the risk of sounding like a broken record, stats are down again. More
information when we know more.

2006-09-30

chrisj [30-Sep-2006 @ 12:05]

Filed under: stats @ 12:05 +00:00

:: 30-Sep-2006 12:05 GMT (Saturday) ::

Fritz is back up again. We’re in the process of catching the database up now.

Thanks for your patience.

2006-09-28

chrisj [28-Sep-2006 @ 15:44]

Filed under: stats @ 15:44 +00:00

:: 28-Sep-2006 15:44 GMT (Thursday) ::

It looks like statsbox has suffered another drive failure. Unfortunately, it’s
looking like it will be a few days before anyone can go out and diagnose it.

We apologise for the downtime again. All work will be credited as soon as
statsbox is back online again.

« Newer PostsOlder Posts »