staff blogs

distributed.net staff keep (relatively) up-to-date logs of their activities in .plan files. These were traditionally available via finger, but we've put them on the web for easier consumption.

2005-11-22

nugget [22-Nov-2005 @ 20:31]

Filed under: stats — nugget @ 20:31 +00:00

:: 22-Nov-2005 20:31 GMT (Tuesday) ::

The new raid controller for statsbox arrived today (3Ware 9550SX-8) and
I’ve got it plugged up and running. Everything looks great so far,
although the “SX” series cards are a bit new for FreeBSD stable and we’ll
have tapdance a bit on startup to get the proper twa driver loaded. I
see that the driver version we need was committed to FreeBSD current
about two weeks ago, so the awkwardness should be short-lived, I’d
expect an MFC into stable before too long.

The universe just keeps piling on, though, and one of the new 300GB
drives we bought died today while I was trying to initialize the
RAID10 volume. I ran to Fry’s to pick up a new, new drive and this
one seems fine. Right now I’m working on moving the contents of the
200GB RAID1 system volume (the OS and home directories) onto a new
300GB mirror made from two of the new drives. This will give us an
extra 100GB to play around with in our home directories, which ought
to be nice. Once I’ve verified that the system volume has copied to
the 300GB drives I’ll wipe the old ones and rebuild the RAID10
(database) volume from the six remaining 200GB drives.

I should have all that wrapped up by tomorrow, which means we’ll be
in a position to restore the stats database backup and kick off the
catchup runs from all the keymaster log files that have been piling
up during this downtime.

Thanks again for your patience and understanding as we bring stats
back to life. Hopefully this means we’ll have gotten the next few
years’ worth of problems out of the way all in this one massive crash.

Moo.

2005-11-19

nugget [19-Nov-2005 @ 12:32]

Filed under: stats — nugget @ 12:32 +00:00

:: 19-Nov-2005 12:32 GMT (Saturday) ::

We made good progress this morning in diagnosing the problems with the
stats server. As Decibel mentioned last night, we started seeing random
read errors when pulling data off the drives. Running a SHA1 or MD5 hash
off the PostgreSQL backup file (10GB) twice in a row would never yield
the same hash twice in a row. Quite creepy to see.

At first we thought we might be dealing with an OS issue, since we’d
taken this downtime as a good opportunity to upgrade the server from
FreeBSD 5.x to 6.0-STABLE, so we got a little sidetracked debugging
UFS2 and newfs options (which we’d also experimented with during the
restore). In that experimenting, Leto managed to ferret out a weird
bug in FreeBSD 6 where the system will panic if you copy a large
directory structure to a drive which has been tuned with a large
average filesize parameter. (Sent PR amd64/89202 to the FreeBSD team)

http://www.freebsd.org/cgi/query-pr.cgi?pr=amd64/89202

Once we moved past that, though, we were still facing the weird read
errors. This morning I nicked two drives out of the raid10 volume (which
was empty anyway) and plugged them in to a spare 9500S card that we’ve
got on hand. We’re unable to repro the read errors off that card, which
would seem to indicate that the problem is indeed the old 3Ware 8506.

Sadly, the 9500S card is only the four port model, so we can’t just
swap it in and start using it, we’ll have to order a new card for
the stats server.

I’m quite encouraged that we seem to have isolated the problem to the
controller card. It’s under warranty, but it’s a depot repair and
the vendor won’t just cross-ship us a replacement. We’ll have to
order a new card if we want to get the server back up and running in
a reasonable amount of time.

2005-11-18

decibel [18-Nov-2005 @ 17:09]

Filed under: stats — decibel @ 17:09 +00:00

:: 18-Nov-2005 17:09 GMT (Friday) ::

Can someone tell me what’s wrong with this picture?

decibel@fritz.1[16:52]~:60>sha1 fritz-20051002.sql.bz2
SHA1 (fritz-20051002.sql.bz2) = 6b3bb0796f7025fc243b2bfe8e9ec8b2c661045b
decibel@fritz.1[16:55]~:61>sha1 fritz-20051002.sql.bz2
SHA1 (fritz-20051002.sql.bz2) = c012f152f05d5e33a88e027948d5e267e7003e2b
decibel@fritz.1[17:01]~:62>

In a nutshell; fritz is throwing random errors when reading from either drive
array. Obviously not a feature one looks for in a database server. I suspect
it’s the 3ware controller, but we’ll need to do more testing to find out.

The machine is also being moved this weekend, probably on Sunday. Between the
hardware issues and the move, people probably shouldn’t expect stats to be back
up until next week at the earliest.

Also, stats were inadvertently turned back on last night. Unfortunately, any
modifications that were made last night will be lost. So, for example, if you
created a team, or changed some of your participant information last night,
that will be gone when we come back.

2005-11-17

decibel [17-Nov-2005 @ 09:17]

Filed under: stats — decibel @ 09:17 +00:00

:: 17-Nov-2005 09:17 GMT (Thursday) ::

Here’s the situation so far with stats:

Thanks to poor driver support, we had been running for who knows how long with
3 failing drives in the raid10 array that housed the database. But that wasn’t
actually what caused the outage… if a machine with an 8500 in it goes down
unexpectedly (think power failure), the controller can’t trust the data on the
drives to be in-sync, so it needs to rebuild the array. Unfortunately, one
of the drives it picked to be authoritative was failing, and decided that it
wasn’t going to give up it’s data.

Unfortunately we’ve been unable to recover the array. We tried using spinrite
as a last resort, but at the rate it was going it would have taken something
like a week to recover the drive. This means that when we get back online,
we’ll be running from a stats backup taken Nov. 6, about 4 days before the
failure. Any changes made to participant accounts or teams in the meantime will
have been lost.

In an ironic twist of fate, we’ve been working on getting a new machine in
production that would have allowed replicating user-modifiable tables (ie:
participant accounts and teams) to another machine. Had that been in place we
would have lost very little, if any, of this data.

The current situation is that we’ve bought 3 new drives and used them to
rebuild the array. We’ve also taken this opportunity to upgrade to FreeBSD 6.0.
But now any time we try to access the array, the machine reboots.

Once someone is on-site to investigate we’ll hopefully know more.

2005-11-13

chrisj [13-Nov-2005 @ 12:33]

Filed under: stats — chrisj @ 12:33 +00:00

:: 13-Nov-2005 12:33 GMT (Sunday) ::

fritz (statsbox) is currently offline due to issues with the drive controller.
We’re working to bring it back up as soon as possible, but in the mean time,
we’re keeping stats offline while we make sure the hardware is ok.

All work is still being counted. The stats-site will catch-up once we bring it
online again.

Stay tuned for further updates!

2005-11-11

decibel [11-Nov-2005 @ 10:50]

Filed under: stats — decibel @ 10:50 +00:00

:: 11-Nov-2005 10:50 GMT (Friday) ::

More drive issues with fritz, so stats are off for right now…

2005-11-09

floppus [09-Nov-2005 @ 18:40]

Filed under: Uncategorized — floppus @ 18:40 +00:00

:: 09-Nov-2005 18:40 GMT (Wednesday) ::

An issue with the script that allows for fetching and flushing via e-mail was
resolved. Users would always receive 24 packets when requesting OGR-P2 work, no
matter how much work they requested.

If you are not familiar with fetching and flushing via e-mail,
and would like more information, please visit our help pages at:
http://www.distributed.net/docs/tutor_netopt.php#no_email

2005-11-01

bovine [01-Nov-2005 @ 23:14]

Filed under: stats — bovine @ 23:14 +00:00

:: 01-Nov-2005 23:14 GMT (Tuesday) ::

Although the stats website is still accessible, they are currently not
being updated because we are currently investigating some possible
hardware issues that were noticed after a recent power failure.
Hopefully once we are confident about the status of the box, we will
resume stats updates again. Don’t worry, since no data loss is
expected. :)

2005-10-18

nugget [18-Oct-2005 @ 18:51]

Filed under: Uncategorized — nugget @ 18:51 +00:00

:: 18-Oct-2005 18:51 GMT (Tuesday) ::

I got a request a few weeks back for a distributed.net shirt that wasn’t white
or grey. Cafepress can’t accommodate that need since they only do digital
transfer printing which is impractical on darker colored items. Hackerthreads
wanted a (fairly) large commitment for quantity, so I went looking for
alternatives.

I’m pleased to announce that we’ve got a handful of actual screen-transfer
shirts available from spreadshirt.com now — both with and without slogans on
the back. These are high quality plot printing transfers, which will not fade.

http://dnetware.spreadshirt.com/

You can never have too much cow swag.

2005-09-23

nugget [23-Sep-2005 @ 21:00]

Filed under: Uncategorized — nugget @ 21:00 +00:00

:: 23-Sep-2005 21:00 GMT (Friday) ::

Thanks to ODD, the remains of oldnodezero.distributed.net arrived this
evening via FedEx. If anyone’s curious, I snapped a few photos at
http://slacker.com/photos/oldnodezero/ — Rockin’ AMD K6-2 power!

In other news, I brought the ledger up to current here on the site
and we’ve gone ahead and ordered a more modern (Opteron) replacement
box which will get prepped and shipped out to visi.com next week or
the week thereafter. We decided to go ahead and spend a bit more than
we might have otherwise done (about three grand, all told) since history
would indicate that we can expect to be using this replacement server
until sometime in 2012. If nothing else, the new box’s keyrate will
be a lot faster than that old K6.

Writing a cheque for three grand is always a bit uncomfortable, so if
you’ve ever wanted to pick up a slick distributed.net t-shirt, today
would be the day. With the RC5 projects getting ridiculously huge, we’re
going to be relying more on member support to keep things running in
the coming years.

http://distributed.net/dnetware/

Moo.

« Newer Posts — Older Posts »