staff blogs

distributed.net staff keep (relatively) up-to-date logs of their activities in .plan files. These were traditionally available via finger, but we've put them on the web for easier consumption.

2016-08-24

Time for a new IRC home

Filed under: Uncategorized — nugget @ 01:50 +00:00

TL;DR — Join us in #distributed on irc.freenode.net

29-Aug-1997 18:00 #rc5 topic is Has anyone seen my keys? -- http://rc5.distributed.net/ 29-Aug-1997 18:00 #rc5 topic set by BovineOne on Fri Aug 29 18:38:59 29-Aug-1997 18:00 #rc5 created on Fri Apr 04 11:02:59

In early 1997 distributed.net became an actual thing entirely within the context of an Internet Relay Chat channel, #rc5 on the EFnet IRC network. Aside from the occasional “beerfest” meetup, the vast majority of this project’s coordination, victory celebrations, and commiserations have been shared on IRC. IRC has been the foundation for everything we’ve accomplished as a community.

<_GNU_> Lets set up a irc.distributed.net and put the channel there! :P

Some time around 2000 we split from EFnet and created our own public IRC network, mainly so that we could start encrypting the IRC traffic — a wort hwhile goal for a security and encryption focused group of geeks. Encrypted IRC was brand-new and barely supported at the time. Our IRC network has continued to serve us well for the past fifteen years, but looking around the landscape has changed significantly. Encryption is supported everywhere and proper channel and nick services bots are ubiquitous. It’s become more difficult to justify the effort and reliance on generosity from our hosting partners to run our own network.

It makes sense to re-join a “proper” IRC network and benefit from that scale and attention to operations. So… Effectively immediately the official support and community channel for distributed.net can now be found on the freenode irc network and irc.distributed.net will be shutting down very soon.

We’ll be forever grateful to Paul Followell at LightBound and FlightAware for their server space and network time. Many thanks to the developers of UnrealIRCd for a decade of secure and reliable server code. And also thanks to everyone at freenode for welcoming us to our new home.

PING PONG /CTCP SOUND moo.wav

2006-09-06

nugget [06-Sep-2006 @ 15:52]

Filed under: stats — nugget @ 15:52 +00:00

:: 06-Sep-2006 15:52 GMT (Wednesday) ::

Statsbox recovery is proceeding as expected. We’ve got the hardware all sorted
out and now we’re just doing one last quick backup of the postgresql database
prior to starting the raid10 rebuild (currently degraded).

Thanks again for your patience while we replace yet another of these old SATA
drives.

2006-04-11

nugget [11-Apr-2006 @ 16:36]

Filed under: stats — nugget @ 16:36 +00:00

:: 11-Apr-2006 16:36 GMT (Tuesday) ::

I made good progress on statsbox today and I think we’ve finally found the
fundamental problem that keeps taking drive 8 offline. Each of the 9
drive bays in fritz’s case has a little hotswap backplane board which
connects to the drive’s SATA and power connectors on the front, and to
the case power supply and SATA cable on the back side. It looks like
cable tension for the bundle of cables for the last three bays has been
pulling down on those three cables and loosening the connection between
the SATA cable and the backplane board. The cables for all three bays
are really, really loose and bay 7 even has broken plastic.

Here’s the guts of the machine, if you want to see what I mean:
http://slacker.com/photos/computers/inside

And here’s a closeup of the last three bay connectors (this is logically
“upside-down” from the first picture, looking at the back of the left-most
drive bays, under the optical drive):
http://slacker.com/photos/computers/fritzbackplane

Since we’re only using 8 of the 9 bays, I shuffled the drives around to
avoid the worst connector, and I also re-routed the cables so that they’d
be pulled up instead of from below, to best compensate for the looseness.
I’m talking with the vendor to see about replacing the dodgy backplane
boards. Since each bay has its own board, I’m optimistic that we’ll be
able to buy just three of them for cheap and hook them up.

I also hooked up our new 3Ware battery backup unit to the 9550SX RAID
controller. This thing had been on backorder for months, and they’re
finally hitting the marketplace. The battery has to test for 24 hours,
but after that we’ll be able to finally turn on write-caching, which
should really speed things up.

I still need to swap out the two failed drive fans in the front of the
case, too, but I no longer think they’re a factor in the crashing.

The RAID10 volume is rebuilding and so far no drives have dropped offline:

Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC
———————————————————————
u0 RAID-1 OK – – 279.387 OFF OFF OFF
u1 RAID-10 REBUILDING 68 64K 558.762 OFF OFF OFF

Port Status Unit Size Blocks
—————————————————–
p0 OK u0 279.46 GB 586072368
p1 OK u0 279.46 GB 586072368
p2 OK u1 232.88 GB 488397168
p3 OK u1 186.31 GB 390721968
p4 OK u1 186.31 GB 390721968
p5 OK u1 232.88 GB 488397168
p6 OK u1 186.31 GB 390721968
p7 DEGRADED u1 186.31 GB 390721968

Name OnlineState BBUReady Status Volt Temp Hours LastCapTest
—————————————————————
bbu On No Testing OK OK 0 xx-xxx-xxxx

Thanks again for your patience during the significant downtime we’ve
had recently. I’m really hopeful that we’ve figured it out and will
be able to stabilize things really soon.

2006-03-27

nugget [27-Mar-2006 @ 11:54]

Filed under: stats — nugget @ 11:54 +00:00

:: 27-Mar-2006 11:54 GMT (Monday) ::

I’ve got statsbox back up and online, and the raid10 volume is
currently rebuilding. It looks like the drive tray fans on
drives 7 and 8 have stopped working, which may be the source
of the problem. All those SATA drives are crammed in close
together and perhaps the drive weirdness we’ve seen lately is
the result of heat issues from the failing fans.

I’ve got the stats website shut off for now while the volume
rebuilds and Decibel can get a chance to nose around and make
sure that all the data looks sane.

EDIT: crap, drive 7 just disconnected itself again.

2005-12-29

nugget [29-Dec-2005 @ 17:03]

Filed under: stats — nugget @ 17:03 +00:00

:: 29-Dec-2005 17:03 GMT (Thursday) ::

You’ve probably noticed that statsbox is offline yet again. We don’t know
what’s happened to it. Unfortunately, the two people who have access to the
data center are both unavailable due to the holidays and we’re not sure when
we’ll be able to get in there to take a look.

In related news, we’re planning to relocate statsbox to a new (better) location
sometime in January.

2005-11-22

nugget [22-Nov-2005 @ 20:31]

Filed under: stats — nugget @ 20:31 +00:00

:: 22-Nov-2005 20:31 GMT (Tuesday) ::

The new raid controller for statsbox arrived today (3Ware 9550SX-8) and
I’ve got it plugged up and running. Everything looks great so far,
although the “SX” series cards are a bit new for FreeBSD stable and we’ll
have tapdance a bit on startup to get the proper twa driver loaded. I
see that the driver version we need was committed to FreeBSD current
about two weeks ago, so the awkwardness should be short-lived, I’d
expect an MFC into stable before too long.

The universe just keeps piling on, though, and one of the new 300GB
drives we bought died today while I was trying to initialize the
RAID10 volume. I ran to Fry’s to pick up a new, new drive and this
one seems fine. Right now I’m working on moving the contents of the
200GB RAID1 system volume (the OS and home directories) onto a new
300GB mirror made from two of the new drives. This will give us an
extra 100GB to play around with in our home directories, which ought
to be nice. Once I’ve verified that the system volume has copied to
the 300GB drives I’ll wipe the old ones and rebuild the RAID10
(database) volume from the six remaining 200GB drives.

I should have all that wrapped up by tomorrow, which means we’ll be
in a position to restore the stats database backup and kick off the
catchup runs from all the keymaster log files that have been piling
up during this downtime.

Thanks again for your patience and understanding as we bring stats
back to life. Hopefully this means we’ll have gotten the next few
years’ worth of problems out of the way all in this one massive crash.

Moo.

2005-11-19

nugget [19-Nov-2005 @ 12:32]

Filed under: stats — nugget @ 12:32 +00:00

:: 19-Nov-2005 12:32 GMT (Saturday) ::

We made good progress this morning in diagnosing the problems with the
stats server. As Decibel mentioned last night, we started seeing random
read errors when pulling data off the drives. Running a SHA1 or MD5 hash
off the PostgreSQL backup file (10GB) twice in a row would never yield
the same hash twice in a row. Quite creepy to see.

At first we thought we might be dealing with an OS issue, since we’d
taken this downtime as a good opportunity to upgrade the server from
FreeBSD 5.x to 6.0-STABLE, so we got a little sidetracked debugging
UFS2 and newfs options (which we’d also experimented with during the
restore). In that experimenting, Leto managed to ferret out a weird
bug in FreeBSD 6 where the system will panic if you copy a large
directory structure to a drive which has been tuned with a large
average filesize parameter. (Sent PR amd64/89202 to the FreeBSD team)

http://www.freebsd.org/cgi/query-pr.cgi?pr=amd64/89202

Once we moved past that, though, we were still facing the weird read
errors. This morning I nicked two drives out of the raid10 volume (which
was empty anyway) and plugged them in to a spare 9500S card that we’ve
got on hand. We’re unable to repro the read errors off that card, which
would seem to indicate that the problem is indeed the old 3Ware 8506.

Sadly, the 9500S card is only the four port model, so we can’t just
swap it in and start using it, we’ll have to order a new card for
the stats server.

I’m quite encouraged that we seem to have isolated the problem to the
controller card. It’s under warranty, but it’s a depot repair and
the vendor won’t just cross-ship us a replacement. We’ll have to
order a new card if we want to get the server back up and running in
a reasonable amount of time.

2005-10-18

nugget [18-Oct-2005 @ 18:51]

Filed under: Uncategorized — nugget @ 18:51 +00:00

:: 18-Oct-2005 18:51 GMT (Tuesday) ::

I got a request a few weeks back for a distributed.net shirt that wasn’t white
or grey. Cafepress can’t accommodate that need since they only do digital
transfer printing which is impractical on darker colored items. Hackerthreads
wanted a (fairly) large commitment for quantity, so I went looking for
alternatives.

I’m pleased to announce that we’ve got a handful of actual screen-transfer
shirts available from spreadshirt.com now — both with and without slogans on
the back. These are high quality plot printing transfers, which will not fade.

http://dnetware.spreadshirt.com/

You can never have too much cow swag.

2005-09-23

nugget [23-Sep-2005 @ 21:00]

Filed under: Uncategorized — nugget @ 21:00 +00:00

:: 23-Sep-2005 21:00 GMT (Friday) ::

Thanks to ODD, the remains of oldnodezero.distributed.net arrived this
evening via FedEx. If anyone’s curious, I snapped a few photos at
http://slacker.com/photos/oldnodezero/ — Rockin’ AMD K6-2 power!

In other news, I brought the ledger up to current here on the site
and we’ve gone ahead and ordered a more modern (Opteron) replacement
box which will get prepped and shipped out to visi.com next week or
the week thereafter. We decided to go ahead and spend a bit more than
we might have otherwise done (about three grand, all told) since history
would indicate that we can expect to be using this replacement server
until sometime in 2012. If nothing else, the new box’s keyrate will
be a lot faster than that old K6.

Writing a cheque for three grand is always a bit uncomfortable, so if
you’ve ever wanted to pick up a slick distributed.net t-shirt, today
would be the day. With the RC5 projects getting ridiculously huge, we’re
going to be relying more on member support to keep things running in
the coming years.

http://distributed.net/dnetware/

Moo.

2005-09-14

nugget [14-Sep-2005 @ 16:08]

Filed under: Uncategorized — nugget @ 16:08 +00:00

:: 14-Sep-2005 16:08 GMT (Wednesday) ::

A moment of silence for oldnodezero…

859 days ago (check my .plan) the UPS sitting behind our server
oldnodezero.distributed.net exploded in an exciting boom described even
today as “the great UPS cataclysm” by visi.com staff. The UPS explosion
dashed the spiffy 761 day uptime on the server.

Today the box finally died, setting a new record 859 day uptime. Attempts
to resuscitate the box have not been successful. It died with page fault
warnings on the console and now it won’t even POST. ODD is going to swing
by visi.com tomorrow to pick up the carcass and ship it down here to Austin
for forensics.

It really says a lot about a colo facility when they can provide that kind
of stability though. Visi has been babysitting this box since we put it
into production in 1998. It’s the first piece of hardware that we ever
bought for distributed.net and it’s been in productive use ever since.

It’s an AT case. Pre-ATX. An AMD K6 with a gigantic 8GB IDE hard drive.
Garage-built consumer junk, but it sure did hold up well. We’ll be
replacing it with a more modern 1U rackmount machine soon. In the
meantime though, we’re down to just the single server for dns and web.

Thanks again to visi.com who have consistently provided peerless colo
services. I’m sure if we manage to beat 859 days this time around, the
eventual failure will again be something beyond their control.