Dusting off the Array! (Part 7)

What happened

Oh my, I really hope this is the final chapter of this fucking story… On Feb. 20th, 2018, one of the HGST disks failed (what a surprise!). Since the serial number reported by hdparm bears absolutely no resemblance to the serial number printed on the label of the disk (thanks a bunch HGST, BTW), I pulled the wrong disk, inserted the spare, started the rebuild and… Lo and behold! The failing disk still failed! My attempts to recover the RAID destroyed it completely.

It was already kaputt when I thought of a way to identify the failing disk’s slot:

# badblocks <reported device by kernel>

This should light up the LED permanently on the external SATA casing.

I was so fed up that I ordered 4 2TB SSD-Disks shortly after that. Yesterday (Mar. 2nd, 2018) I finally had time to install them. Of course this setup has its quirks, too, but at least I can identify the disks via hdparm. The serial reported is actually the serial on the label:

HDD1: SerialNo 1744197E67EE
HDD2: SerialNo 1744197E7B92
HDD3: SerialNo 1744197E7104
HDD4: SerialNo 1744197E836D

The Quirks

Of course it didn’t just work out of the box™. When I booted with the shiny, new SSD disks, hadante got stuck at the BIOS splash screen while HDD3 was throwing a shining, red light. I pulled HDD3 and HDD4, rebooted and got a login prompt. Since SATA is hot-pluggable, I inserted HDD3 and 4. Fortunately, they showed up on the SCSI-Bus (cries of joy!).

I created a RAID5 with:

# mdadm --create /dev/md1 --level=5 --raid-devices=4 /dev/sd[efgh]

and waited until today (Mar. 3rd, 2018) for the rebuild to finish. After that I tested the setup:

  • Power off hadante
  • Turn off the external casing
  • Wait about 30 seconds
  • Turn on the external casing and then hadante
  • Wait eagerly…

… and watch the kernel error messages scrolling down the screen 🙁

The solution

Note the (not really) failing drive by staring at the LEDs of the external casing. Power off hadante and pull the failing drive and any other non-failing drive! It’s important to pull 2 drives, so the kernel cannot assemble the RAID! Then reboot and stop failing the RAID:

# mdadm -S /dev/md?

Now hot-plug the missing drives, reboot again and be amazed how everything magically works again 🙂

From my observations the drives in the external bay are recognized until you cut the power, but that’s just a guess.