RAID Drive Replacement

Standard

On the 20th May, I noticed an email from mdadm (the Linux Raid Administrator) saying that a Degraded Array event was detected. It looked like two drives went down at the same time (SDC and SDD). Before I had done any diagnosis of the problem, I had ordered two replacement refurbished drives.

I went for refurbished because getting new ‘affordable’ drives that don’t use SMR technology (Shingled Magnetic Recording) is difficult. SMR allows more capacity in a smaller area, however they are a lot slower drives once you have filled the 25GB cache and in Network Attached Storage systems, they are not ideal. (Even WD Red NAS drives use SMR and don’t disclose that!)

So I went for some refurbished Seagate Barracuda 2TB drives. These were cheap and they used CMR 🙂

After a bit more diagnosing and a reboot, it looked like the SDC drive was okay but was just knocked offline because SDD corrupted the SATA bus. That made me feel a little safer, as I don’t like running systems with no margins for failure. I did a full set of diagnostics on SDC and reintroduced it into the array and it did a data check and came back online just fine.

I then had to wait a little while for my refurbished drives to arrive from Germany. They took a couple of days to arrive, which I didn’t think was too bad considering the world is kinda messed up right now.

Once the drives had arrived, I started doing my usual round of tests on new drives, to make sure they’ve survived shipping, make sure I’ve not been sold a lemon and also to make sure they’re going to give a decent level of service.

My testing involves using the SMART self test feature, recording those results, zeroing the drive, recording those results, then overwrite the drive 4 times with different patterns and compare that back. Once that’s done I record the results and compare again to make sure there’s no problems that testing has uncovered.

Next comes partitioning the drive. I just copied the partition layout of one of the existing disks and wrote the partition table to the disk. I then asked mdadm to add the new partitions into the RAID devices (md0, md1, md2, md3), and it started to rebuild the missing drive onto the new blank. You can see in the screenshot it is about 9.2% through recovery of the largest md device, md1.

From discovery to fix, this entire process took about 5 days. Actual user input was only about an hour, plus checking back and forth to make sure the drive was behaving.

Of course, RAID is not backup, but it’s great if your system can take two drives failing and still run fine. I have a backup system on a seperate drive and cloud backups. This is because in 2010, I typed an F instead of a G and wiped out the last 10 years.

Checking back through the logs, the problem was first reported on the 5th, but I didn’t see the email alert until the 25th, but at least it’s all fixed now. I didn’t need two drives, but it’s good to have a ‘cold’ spare in stock now 🙂