staff blogs

distributed.net staff keep (relatively) up-to-date logs of their activities in .plan files. These were traditionally available via finger, but we've put them on the web for easier consumption.

1999-12-21

nugget [21-Dec-1999 @ 21:15]

Filed under: Uncategorized @ 21:15 +00:00

:: 21-Dec-1999 21:21 (Tuesday) ::

Looks like things on statsbox are healthy. We had to re-create some
indexes, but everything else checks out ok. Thanks for your patience
as we hammered on the bits.

So, stats are up. Far more importantly, though, the mac client is out! :)

http://n0cgi.distributed.net/cgi/dnet-finger.cgi?user=moose

nugget [21-Dec-1999 @ 05:27]

Filed under: Uncategorized @ 05:27 +00:00

:: 21-Dec-1999 05:39 (Tuesday) ::

During tonight’s stats run things got extremely unhealthy during the
creation of the rc5-64 overall email rankings table. Sybase consumed
all it’s locks and several processes became deadlocked and refused to
die gracefully. (Heck, they wouldn’t even die screaming).

We spent a few hours poking and prodding, trying to convince the blocking
process to go away to no avail. Unfortunately, the only recourse at that
point was to down the whole server in a rather indelicate manner. (not
quite a kill-9, but pretty close).

Right now we’re sifting through the data now that the server is back
online. We’re trying to determine the scope of the damage (if any)
and get a feel for the situation.

More info as the situation progresses…

1999-12-03

nugget [03-Dec-1999 @ 04:42]

Filed under: Uncategorized @ 04:42 +00:00

:: 03-Dec-1999 04:45 (Friday) ::

Well, just when it looked like things were in good shape.. :) Just can’t
win today, I guess.

As you can see, stats are nearly back to full operations. It’s been a busy
36 or so hours catching up the rc5-64 data, but we’re back on track there.
The only little hitch from today is that the csc stats for 2-Dec are
incomplete. For some reason the 23:00 through 23:59 logfile never made
it across from the keymaster.

Normally I’d just re-run the day’s stats, but rc5-64 stats are running
now and I’m reluctant to try running both at the same time. So…
the numbers for yesterday are ~1/24th too low, and I’ll re-process with
the missing logfile in the morning.

Moo.

1999-12-01

nugget [01-Dec-1999 @ 21:19]

Filed under: Uncategorized @ 21:19 +00:00

:: 01-Dec-1999 21:24 (Wednesday) ::

Well, this is most excellent news. We’ve isolated the slowdown to
a single php page on the stats site, and a single change made to
the code on that php page.

We had recently made a modification to the /rc5-64/psearch.php queries
that pulled the search data from a different table than it normally
pulls from. For some reason, this is running much, much slower than
it should and is causing the incredible slowdown while stats are
up and available.

When we remove access to this one page (rc5-64/psearch) stats seems
to run just fine and dandy. So, at this point, the worst-case scenario
we’re looking at is not being able to do a participant search while the
daily processing is running. At least until we can figure out why
the query is running so slowly.

I’ve already rebuilt the relevant indexes and that hasn’t helped any,
so there’s still more debugging we need to do. The good news is that
since we know what has to happen to repair the performance, I can
go ahead and start catching up the rc5-64 data to bring it current.

More details as the situation progresses.

nugget [01-Dec-1999 @ 05:01]

Filed under: Uncategorized @ 05:01 +00:00

:: 01-Dec-1999 05:04 (Wednesday) ::

Well, that certainly took a lot longer than any of us anticipated!

We’ve created a new rc5_64_master table, but there are still a few minor
issues with it, and the data is completely unaudited against the old table.
However, it’s late and I’ve done too many concurrent “up till 3am” nights
and I don’t think I’ll be awake much longer tonight.

So, until tomorrow, I’ve just caught up the csc stats and brought the web
server back up. rc5 stats are still wonky and off, unfortunately. But,
we may be close. If the new data looks ok and the indexing runs smoothly
tomorrow, we may find ourselves in very good shape.

Thanks, as always, for all your patience and kind words and emails. We’ll
have this situation resolved as soon as is possible.

1999-11-30

nugget [30-Nov-1999 @ 15:50]

Filed under: Uncategorized @ 15:50 +00:00

:: 30-Nov-1999 15:51 (Tuesday) ::

Just a quick update, the rebuild of the rc5_64_master table is churning.
We’re at about 14 million rows (of 20 million) and it’s been running for
about 12 hours or so.

1999-11-29

nugget [29-Nov-1999 @ 21:19]

Filed under: Uncategorized @ 21:19 +00:00

:: 29-Nov-1999 21:22 (Monday) ::

Thanks to some very dedicated snooping on the part of Bruce Wilson, our
“hired gun” transact-sql guru, we’ve got a somewhat better handle on the
problems with statsbox. The good news is, we now have a very long list
of things that we now know /aren’t/ the problem. :)

The bad news is, we still don’t know what the problem is. We’ve finally
decided that the next step is to rebuild the big rc5_64_master table
from the ground up. The involves copying 20 million records or so, and
then generating new indexes on that data. I suspect it will take a non-
trivial number of hours to churn through this data, so please be patient
as it progresses. To speed things as much as possible, you’ll see that
the stats web site has been disabled while it runs.

Once the table is rebuilt, we’ll have a much better feel for where we stand
on the slowdown issue.

1999-11-27

nugget [27-Nov-1999 @ 03:05]

Filed under: Uncategorized @ 03:05 +00:00

:: 27-Nov-1999 03:10 (Saturday) ::

We’ve attracted quite a number of offers to assist in debugging the Sybase
problems. So many, in fact, that I haven’t had a chance to respond to
all of them. Two very talented guys have started going over the code
and Sybase configuration looking for oddities or suspicious configuration
items. So far, no serious red flags have turned up, but we have decided
on a good number of performance improvements that can be made.

I’ve spent the past 36 hours continuing to do all the normal repair and
check routines that I’m accustomed to doing. dropping and rebuilding indexes,
doing checktables and checkdbs. The “dbcc checktable” on the rc5_64_master
table took just over 12 hours to complete!

Given the serious slowdown brought about by having web access enabled, I’ve
unfortunately had to conduct most of these repairs and checks with stats
disabled. I know it’s been frustrating. As of tonight’s run (yes, csc stats
are current) I’m still unable to do anything, even bcp data into the server,
if the web server is enabled. This fact makes it impractical to be running
the rc5-64 logs as it would require even more downtime from the server.

Rest assured, we’re still prodding and tweaking, looking for the cause of the
problem. Thanks, as always, for your continued patience and support.

1999-11-24

nugget [24-Nov-1999 @ 04:26]

Filed under: Uncategorized @ 04:26 +00:00

:: 24-Nov-1999 04:32 (Wednesday) ::

Well, statsbox is still extremely unhealthy. We’re seeing some very strange
behavior in Sybase, with cpu utilization far in excess of what we should be
seeing. Decibel and I are both at a loss as to the cause of the problem.

I guess at this point we’re seeking some peer review on the problem.
If you have practical experience in Sybase database administration and
transact-sql coding, please drop me a line. We’d like a few more sets
of eyes to take a look at this situation. I suspect mssql 6.5 experience
would be equally valuable.

We’ve gotten to the point where we’re not even sure if we can trust what’s
coming out of sp_sysmon.

1999-11-23

nugget [23-Nov-1999 @ 15:33]

Filed under: Uncategorized @ 15:33 +00:00

:: 23-Nov-1999 15:37 (Tuesday) ::

The 21-Nov RC5 stats run is about half-way through, and after this will
come the 22-Nov RC5 run. CSC stats are current and accurate, it’s just
RC5 that is affected at this point.

The trouble stems from the fact that (as decibel mentioned last night)
the box has developed an annoying habit. If apache is on and serving
requests, the stats run completely halts. If I turn off apache, it
runs full speed. So, in the interests of wrapping up the late runs, I’ve
got apache turned off.

(Hey, if nothing else, it’s been a good opportunity to upgrade to the
most current apache and php releases)

Once the 21-Nov update wraps up, I think I’ll reboot the box, although
I dunno what I expect that to accomplish. Can’t hurt, I suppose.

More details as the situation progresses.

« Newer PostsOlder Posts »