Dusting off the Array! (Part 9)

Shit happens!

What shit happened? Well, of course the SSD-Raid failed, too. Spontaneous massive existence failure. All non-backed-up data lost. Mostly because RAID5 makes things harder to recover data, so I decided to scrap the software raid and go with single disks. Just the device mapper for encryption and one big logical volume per disk.

That went well for quite some time. An occasional SATA error once in a while, but nothing to worry about. Or so I thought… I could even zero out the unused disks with dd without getting errors. But recently (Oct. 31, 2018) even the single disk failed. No SATA errors, just a stalled rsync to a zeroed out disk. That made me suspicious. The Crucial SSD’s couldn’t be that bad!

So I decided to change the bus from eSATA to USB. I tried that before, but not with the new external casing. The old one crashed hard when I tried to sync the RAID, so I didn’t think it was an option. But with single drives and a new casing I gave it another try.

What can I say? I guess it was a faulty cable or something. With USB it works perfectly fine! 4 days and running, not a single error. Throughput is as good as SATA, so I’m keeping my fingers crossed! I really hope that this is the end of this story…

Dusting off the Array! (Part 7)

What happened

Oh my, I really hope this is the final chapter of this fucking story… On Feb. 20th, 2018, one of the HGST disks failed (what a surprise!). Since the serial number reported by hdparm bears absolutely no resemblance to the serial number printed on the label of the disk (thanks a bunch HGST, BTW), I pulled the wrong disk, inserted the spare, started the rebuild and… Lo and behold! The failing disk still failed! My attempts to recover the RAID destroyed it completely.

It was already kaputt when I thought of a way to identify the failing disk’s slot:

# badblocks <reported device by kernel>

This should light up the LED permanently on the external SATA casing.

I was so fed up that I ordered 4 2TB SSD-Disks shortly after that. Yesterday (Mar. 2nd, 2018) I finally had time to install them. Of course this setup has its quirks, too, but at least I can identify the disks via hdparm. The serial reported is actually the serial on the label:

HDD1: SerialNo 1744197E67EE
HDD2: SerialNo 1744197E7B92
HDD3: SerialNo 1744197E7104
HDD4: SerialNo 1744197E836D

The Quirks

Of course it didn’t just work out of the box™. When I booted with the shiny, new SSD disks, hadante got stuck at the BIOS splash screen while HDD3 was throwing a shining, red light. I pulled HDD3 and HDD4, rebooted and got a login prompt. Since SATA is hot-pluggable, I inserted HDD3 and 4. Fortunately, they showed up on the SCSI-Bus (cries of joy!).

I created a RAID5 with:

# mdadm --create /dev/md1 --level=5 --raid-devices=4 /dev/sd[efgh]

and waited until today (Mar. 3rd, 2018) for the rebuild to finish. After that I tested the setup:

  • Power off hadante
  • Turn off the external casing
  • Wait about 30 seconds
  • Turn on the external casing and then hadante
  • Wait eagerly…

… and watch the kernel error messages scrolling down the screen 🙁

The solution

Note the (not really) failing drive by staring at the LEDs of the external casing. Power off hadante and pull the failing drive and any other non-failing drive! It’s important to pull 2 drives, so the kernel cannot assemble the RAID! Then reboot and stop failing the RAID:

# mdadm -S /dev/md?

Now hot-plug the missing drives, reboot again and be amazed how everything magically works again 🙂

From my observations the drives in the external bay are recognized until you cut the power, but that’s just a guess.

Installation SSD

On May 27, 2015, I replaced the system raid of hadante (4 spinning 500 GB disks, RAID5) with 4 Samsung SSD drives (also 500 GB, RAID5). It was well worth it. The speed is amazing!

Along with the disks I ordered 4 3.5″ -> 2.5″ installation frames. As it turned out I only needed two, because you can easily stack two SSD drives on one frame with the right frame. There even is a gap in between, so I don’t expect heat problems.

The RAID5-rebuild was blazing fast. Overall, everything seems to be much snappier.

The serial numbers – SSD drives from top to bottom:

  1. S21JNXAG415926
  2. S21JNXAG415880
  3. S21JNXAG433264
  4. S21JNXAG433168

The old serial numbers – spinning drives from top to bottom:

  1. S13TJ1EQ401080 (Samsung)
  2. 5VMJ32Q9 (Seagate)
  3. 3PM23C12 (Seagate)
  4. S13TJ1EQ401081 (Samsung)