Help recovering from ZFS failure
Background: I have a NAS (OpenMediaVault, latest major release) for storing a ton of data – mostly RAW photos. Everything I read online said to stick with mirrored pairs, so that’s what I did, with 4x12TB drives:
data partition = (A+B) + (C+D). Each disk (A-D) is 12TB.
Most of the data was copied from portable 2-5TB disks, though some was copied fresh to the NAS to free up space on my local machine. (Yes, I realize this doesn’t follow 3-2-1, but it’s where I’m at now…)
I had one disk (C) fail, but was not at all concerned because it was under warranty and I had mirrored pairs. The replacement came, and I started the resilvering process. I think the process got ~2% done before ZFS decided it didn’t trust the source disk (D), and the whole process just hung.
Before the drives failed, I realized the ~22TB of usable space would not be sufficient, so I bought 2x20TB drives to use as additional space; the goal was to add them on such that:
data partition = (A+B) + (C+D) + (E+F), where E & F are the new 20TB drives.
For better or worse, I never got that far, because I was doing the resilver before adding the new partition.
—
The AI model I consulted suggested a couple different strategies:
Attempt to recover the ‘new’ files, not found elsewhere — though now that the whole data partition is down, I’m not sure if that’s feasible. Maybe if I mount everything as read-only?
Use ddrescue to copy D->E and then …? Can I tell ZFS to treat E as D and run in a degraded state until the I can copy over the raw data? (I assume I should give up on the E+F vdev for now, even if it were a separate folder path, since I don’t want to strain the existing vdevs)
AI gives me the impression that the data on the A/B vdev is ‘safe’, though I’m not sure what that means for readability — does ZFS divide data on a per-file basis here? How would I extract data currently on the A/B vdev if C/D, and thus the data partition, is down?
—
I’m comfortable enough with ddrescue, so I hypothesize that’s my next step…I assume I’m using E as a disk (eg, ddrescue if=/dev/sdd1 of=/dev/sde1, instead of of=/mnt-of-sde1/disk.img).
Are there other options to consider? If ddrescue is the right next step, what would you do next to preserve as much data as possible before going to a 3-2-1 model?
Finally, is ZFS the right technology here, and is my topology as smart as I thought it was? I considered snapraid at one point, but was skeptical of its long-term support. However, now that I’ve (likely?) lost data, the idea of being able to read data from some disks after a failure sure seems appealing…
- You must be logged in to reply to this topic.
