从LVM卷组中删除故障驱动器…并从不完整的LV恢复部分数据(缺lessPV)

我一直在解决这个问题一段时间了。

我有3个磁盘,1.5TB,2TB和3TB的逻辑卷。 1.5TB驱动器失败。 大量的I / O错误和死亡的坏道。 我开始pvmove将失败的驱动器上的现有盘区移动到3TB驱动器(有足够的空间)。 我搬了99%的范围,但最后百分比似乎是不可能读取。 读取失败,pvmove退出。

这是目前的状态:

root@server:~# pvdisplay /dev/sdd: read failed after 0 of 4096 at 0: Input/output error /dev/sdd: read failed after 0 of 4096 at 1500301819904: Input/output error /dev/sdd: read failed after 0 of 4096 at 1500301901824: Input/output error /dev/sdd: read failed after 0 of 4096 at 4096: Input/output error /dev/sdd1: read failed after 0 of 4096 at 1500300771328: Input/output error /dev/sdd1: read failed after 0 of 4096 at 1500300853248: Input/output error /dev/sdd1: read failed after 0 of 4096 at 0: Input/output error /dev/sdd1: read failed after 0 of 4096 at 4096: Input/output error Couldn't find device with uuid hFhfbQ-4cuW-CSlE-qhfO-GNl8-Jvt7-4nZTWK. --- Physical volume --- PV Name /dev/sda # old, working drive VG Name lvm_group1 PV Size 1.82 TiB / not usable 1.09 MiB Allocatable yes (but full) PE Size 4.00 MiB Total PE 476932 Free PE 0 Allocated PE 476932 PV UUID FEoDYU-Lhjf-FdI1-Ei5p-koue-PIma-TGvs9A --- Physical volume --- PV Name /dev/sdd1 # old failing drive VG Name lvm_group1 PV Size 1.36 TiB / not usable 2.40 MiB Allocatable NO PE Size 4.00 MiB Total PE 357699 Free PE 357600 Allocated PE 99 PV UUID hFhfbQ-4cuW-CSlE-qhfO-GNl8-Jvt7-4nZTWK --- Physical volume --- PV Name /dev/sdf # new drive VG Name lvm_group1 PV Size 2.73 TiB / not usable 4.46 MiB Allocatable yes PE Size 4.00 MiB Total PE 715396 Free PE 357746 Allocated PE 357650 PV UUID qs4BVK-PAPv-I1DG-x5wJ-dRNq-vhBE-wQeJL6 

这就是pvmove所说的:

 root@server:~# pvmove /dev/sdd1:335950-336500 /dev/sdf --verbose Finding volume group "lvm_group1" Archiving volume group "lvm_group1" metadata (seqno 93). Creating logical volume pvmove0 Moving 50 extents of logical volume lvm_group1/cryptex Found volume group "lvm_group1" activation/volume_list configuration setting not defined: Checking only host tags for lvm_group1/cryptex Updating volume group metadata Found volume group "lvm_group1" Found volume group "lvm_group1" Creating lvm_group1-pvmove0 Loading lvm_group1-pvmove0 table (253:2) Loading lvm_group1-cryptex table (253:0) Suspending lvm_group1-cryptex (253:0) with device flush Suspending lvm_group1-pvmove0 (253:2) with device flush Found volume group "lvm_group1" activation/volume_list configuration setting not defined: Checking only host tags for lvm_group1/pvmove0 Resuming lvm_group1-pvmove0 (253:2) Found volume group "lvm_group1" Loading lvm_group1-pvmove0 table (253:2) Suppressed lvm_group1-pvmove0 identical table reload. Resuming lvm_group1-cryptex (253:0) Creating volume group backup "/etc/lvm/backup/lvm_group1" (seqno 94). Checking progress before waiting every 15 seconds /dev/sdd1: Moved: 4.0% /dev/sdd1: read failed after 0 of 4096 at 0: Input/output error No physical volume label read from /dev/sdd1 Physical volume /dev/sdd1 not found ABORTING: Can't reread PV /dev/sdd1 ABORTING: Can't reread VG for /dev/sdd1 

失败的驱动器上只剩下99个扩展盘区。 我可以丢失这些数据 – 我只是想把这个驱动器扔掉而不会丢失其他驱动器上的数据。

所以我尝试了pvremove:

 root@server:~# pvremove /dev/sdd1 /dev/sdd1: read failed after 0 of 4096 at 1500300771328: Input/output error /dev/sdd1: read failed after 0 of 4096 at 1500300853248: Input/output error /dev/sdd1: read failed after 0 of 4096 at 0: Input/output error /dev/sdd1: read failed after 0 of 4096 at 4096: Input/output error No physical volume label read from /dev/sdd1 Physical Volume /dev/sdd1 not found 

然后vgreduce:

 root@server:~# vgreduce lvm_group1 --removemissing /dev/sdd: read failed after 0 of 4096 at 0: Input/output error /dev/sdd: read failed after 0 of 4096 at 1500301819904: Input/output error /dev/sdd: read failed after 0 of 4096 at 1500301901824: Input/output error /dev/sdd: read failed after 0 of 4096 at 4096: Input/output error /dev/sdd1: read failed after 0 of 4096 at 1500300771328: Input/output error /dev/sdd1: read failed after 0 of 4096 at 1500300853248: Input/output error /dev/sdd1: read failed after 0 of 4096 at 0: Input/output error /dev/sdd1: read failed after 0 of 4096 at 4096: Input/output error Couldn't find device with uuid hFhfbQ-4cuW-CSlE-qhfO-GNl8-Jvt7-4nZTWK. WARNING: Partial LV cryptex needs to be repaired or removed. WARNING: Partial LV pvmove0 needs to be repaired or removed. There are still partial LVs in VG lvm_group1. To remove them unconditionally use: vgreduce --removemissing --force. Proceeding to remove empty missing PVs. 

pvdisplay仍然显示失败的驱动器…

有任何想法吗?

最后,我通过手动编辑/etc/lvm/backup/lvm_group1来解决这个问题。

以下是任何其他人遇到此问题的步骤:

  1. 我从服务器上物理删除了死盘
  2. 我执行了vgreduce lvm_group1 --removemissing --force
  3. 我从configuration中删除了死驱动器
  4. 我在“好的”驱动器上添加了另一个条带,而不是死盘驱动器上无法读取的盘区。
  5. 我执行了vgcfgrestore -f edited_config_file.cfg lvm_group1
  6. 重启
  7. 瞧! 驱动器是可见的,可以安装。

我花了4天的时间学习LVM的进出口来解决这个问题。

到目前为止它看起来不错。 没有错误。 快乐露营。

如果你可以暂时停止LVM(并且closures下面的LUKS容器,如果使用的话),可以使用GNU ddrescue将尽可能多的PV(或者下面的LUKS容器)拷贝到好的磁盘上,并且删除旧磁盘,然后重新启动LVM。

虽然我喜欢Sniku的LVM解决scheme,但ddrescue或许可以恢复比pvmove更多的数据。

(停止LVM的原因是,LVM具有多path支持,并且一旦LVM发现它们就会在具有相同UUID的PV对之间平衡写操作。此外,应该停止LVM和LUKS以确保所有最近写在基础设备上是可见的。重新启动系统并且不提供LUKS密码是确保它的最简单的方法。)