staff blogs staff keep (relatively) up-to-date logs of their activities in .plan files. These were traditionally available via finger, but we've put them on the web for easier consumption.


nugget [11-Apr-2006 @ 16:36]

Filed under: stats @ 16:36 +00:00

:: 11-Apr-2006 16:36 GMT (Tuesday) ::

I made good progress on statsbox today and I think we’ve finally found the
fundamental problem that keeps taking drive 8 offline. Each of the 9
drive bays in fritz’s case has a little hotswap backplane board which
connects to the drive’s SATA and power connectors on the front, and to
the case power supply and SATA cable on the back side. It looks like
cable tension for the bundle of cables for the last three bays has been
pulling down on those three cables and loosening the connection between
the SATA cable and the backplane board. The cables for all three bays
are really, really loose and bay 7 even has broken plastic.

Here’s the guts of the machine, if you want to see what I mean:

And here’s a closeup of the last three bay connectors (this is logically
“upside-down” from the first picture, looking at the back of the left-most
drive bays, under the optical drive):

Since we’re only using 8 of the 9 bays, I shuffled the drives around to
avoid the worst connector, and I also re-routed the cables so that they’d
be pulled up instead of from below, to best compensate for the looseness.
I’m talking with the vendor to see about replacing the dodgy backplane
boards. Since each bay has its own board, I’m optimistic that we’ll be
able to buy just three of them for cheap and hook them up.

I also hooked up our new 3Ware battery backup unit to the 9550SX RAID
controller. This thing had been on backorder for months, and they’re
finally hitting the marketplace. The battery has to test for 24 hours,
but after that we’ll be able to finally turn on write-caching, which
should really speed things up.

I still need to swap out the two failed drive fans in the front of the
case, too, but I no longer think they’re a factor in the crashing.

The RAID10 volume is rebuilding and so far no drives have dropped offline:

Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC
u0 RAID-1 OK – – 279.387 OFF OFF OFF
u1 RAID-10 REBUILDING 68 64K 558.762 OFF OFF OFF

Port Status Unit Size Blocks
p0 OK u0 279.46 GB 586072368
p1 OK u0 279.46 GB 586072368
p2 OK u1 232.88 GB 488397168
p3 OK u1 186.31 GB 390721968
p4 OK u1 186.31 GB 390721968
p5 OK u1 232.88 GB 488397168
p6 OK u1 186.31 GB 390721968
p7 DEGRADED u1 186.31 GB 390721968

Name OnlineState BBUReady Status Volt Temp Hours LastCapTest
bbu On No Testing OK OK 0 xx-xxx-xxxx

Thanks again for your patience during the significant downtime we’ve
had recently. I’m really hopeful that we’ve figured it out and will
be able to stabilize things really soon.