Hi all, I'm new to this list. I hope somebody here can help me. We had a disk go bad (disk commands timed out and took many seconds to do so) in our LVM installation with mirroring. With some trouble, we managed to pvremove the offending disk, and used `lvconvert --repair -y nova/$lv` to repair (restore redundancy) the logical volumes. One logical volume still seems to have trouble though. In `lvs -o devices -a` it shows no devices for 2 of its subvolumes, and it has the weird 'v' status: LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices lvname nova Rwi-aor--- 800.00g 100.00 lvname_rimage_0(0),lvname_rimage_1(0),lvname_rimage_2(0),lvname_rimage_3(0) [lvname_rimage_0] nova iwi-aor--- 400.00g /dev/sdc1(19605) [lvname_rimage_1] nova iwi-aor--- 400.00g /dev/sdi1(19605) [lvname_rimage_2] nova vwi---r--- 400.00g [lvname_rimage_3] nova iwi-aor--- 400.00g /dev/sdj1(19605) [lvname_rmeta_0] nova ewi-aor--- 64.00m /dev/sdc1(19604) [lvname_rmeta_1] nova ewi-aor--- 64.00m /dev/sdi1(19604) [lvname_rmeta_2] nova ewi---r--- 64.00m [lvname_rmeta_3] nova ewi-aor--- 64.00m /dev/sdj1(19604) ``` and also according to `lvdisplay -am` there is a problem with `..._rimage2` and `..._rmeta2`: ``` --- Logical volume --- Internal LV Name lvname_rimage_2 VG Name nova LV UUID xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LV Write Access read/write LV Creation host, time xxxxxxxxx, 2021-07-09 16:45:21 +0000 LV Status NOT available LV Size 400.00 GiB Current LE 6400 Segments 1 Allocation inherit Read ahead sectors auto --- Segments --- Virtual extents 0 to 6399: Type error --- Logical volume --- Internal LV Name lvname_rmeta_2 VG Name nova LV UUID xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LV Write Access read/write LV Creation host, time xxxxxxxxx, 2021-07-09 16:45:21 +0000 LV Status NOT available LV Size 64.00 MiB Current LE 1 Segments 1 Allocation inherit Read ahead sectors auto --- Segments --- Virtual extents 0 to 0: Type error Similarly, the metadata looks corresponding: lvname_rimage_2 { id = "..." status = ["READ", "WRITE"] flags = [] creation_time = 1625849121 # 2021-07-09 16:45:21 +0000 creation_host = "cbk130133" segment_count = 1 segment1 { start_extent = 0 extent_count = 6400 # 400 Gigabytes type = "error" } } On the other hand, the health status appears to read out normal: [13:38:20] root@cbk130133:~# lvs -o +lv_health_status LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Health lvname nova Rwi-aor--- 800.00g .. 100.00 We tried various combinations of `lvconvert --repair -y nova/$lv` and `lvchange --syncaction repair` on it without effect. `lvchange -ay` doesn't work either: $ sudo lvchange -ay nova/lvname_rmeta_2 Operation not permitted on hidden LV nova/lvname_rmeta_2. $ sudo lvchange -ay nova/lvname $ # (no effect) $ sudo lvconvert --repair nova/lvname_rimage_2 WARNING: Disabling lvmetad cache for repair command. WARNING: Not using lvmetad because of repair. Command on LV nova/lvname_rimage_2 does not accept LV type error. Command not permitted on LV nova/lvname_rimage_2. $ sudo lvchange --resync nova/lvname_rimage_2 WARNING: Not using lvmetad because a repair command was run. Command on LV nova/lvname_rimage_2 does not accept LV type error. Command not permitted on LV nova/lvname_rimage_2. $ sudo lvchange --resync nova/lvname WARNING: Not using lvmetad because a repair command was run. Logical volume nova/lvname in use. Can't resync open logical volume nova/lvname. $ lvchange --rebuild /dev/sdf1 nova/lvname WARNING: Not using lvmetad because a repair command was run. Do you really want to rebuild 1 PVs of logical volume nova/lvname [y/n]: y device-mapper: create ioctl on lvname_rmeta_2 LVM-blah failed: Device or resource busy Failed to lock logical volume nova/lvname. $ lvchange --raidsyncaction repair nova/lvname # (took a long time to complete but didn't change anything) $ sudo lvconvert --mirrors +1 nova/lvname Using default stripesize 64.00 KiB. --mirrors/-m cannot be changed with raid10. Any idea how to restore redundancy on this logical volume? It is in continuous use, of course... It seems like somehow we must convince LVM to allocate some space for it, instead of using the error segment (there is plenty available in the volume group). Thanks in advance. -Olaf -- SysEleven GmbH Boxhagener Straße 80 10245 Berlin T +49 30 233 2012 0 F +49 30 616 7555 0 http://www.syseleven.de http://www.facebook.com/SysEleven https://www.instagram.com/syseleven/ Aktueller System-Status immer unter: http://www.twitter.com/syseleven Firmensitz: Berlin Registergericht: AG Berlin Charlottenburg, HRB 108571 B Geschäftsführer: Marc Korthaus, Jens Ihlenfeld, Andreas Hermann _______________________________________________ linux-lvm mailing list linux-lvm@redhat.com https://listman.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
>>>>> "Olaf" == Olaf Seibert <o.seibert@syseleven.de> writes: I'm leaving for the rest of the weekend, but hopefully this will help you... Olaf> Hi all, I'm new to this list. I hope somebody here can help me. We will try! But I would strongly urge that you take backups of all your data NOW, before you do anything else. Copy to another disk which is seperate from this system just in case. My next suggestion would be for you to provide the output of the 'pvs', 'vgs' and 'lvs' commands. Also, which disk died? And have you replaced it? My second suggestion would be for you to use 'md' as the lower level RAID1/10/5/6 level underneath your LVM volumes. Alot of people think it's better to have it all in one tool (btrfs, zfs, others) but I stronly feel that using nice layers helps keep things organized and reliable. So if you can, add two new disks into your system, add a full-disk partition which starts at offset of 1mb or so, and maybe even leaves a couple of MBs of free space at the end, and then create an MD pair on them: mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdy1 /dev/sdz1 Now you can add that disk in your nova VG with: vgextend nova /dev/md0 Then try to move your LV named 'lvname' onto the new MD PV. pvmove -n lvname /dev/<source_PV> /dev/md0 I think you really want to move the *entire* top level LV onto new storage. Then you will know you have safe data. And this can be done while the volume is up and running. But again!!!!!! Please take a backup (rsync onto a new LV maybe?) of your current data to make sure you don't lose anything. Olaf> We had a disk go bad (disk commands timed out and took many Olaf> seconds to do so) in our LVM installation with mirroring. With Olaf> some trouble, we managed to pvremove the offending disk, and Olaf> used `lvconvert --repair -y nova/$lv` to repair (restore Olaf> redundancy) the logical volumes. How many disks do you have in the system? Please don't try to hide names of disks and such unless you really need to. It makes it much harder to diagnose. Olaf> One logical volume still seems to have trouble though. In `lvs -o Olaf> devices -a` it shows no devices for 2 of its subvolumes, and it has the Olaf> weird 'v' status: Olaf> LV VG Attr Olaf> LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices Olaf> lvname nova Rwi-aor--- 800.00g Olaf> 100.00 Olaf> lvname_rimage_0(0),lvname_rimage_1(0),lvname_rimage_2(0),lvname_rimage_3(0) Olaf> [lvname_rimage_0] nova iwi-aor--- 400.00g Olaf> /dev/sdc1(19605) Olaf> [lvname_rimage_1] nova iwi-aor--- 400.00g Olaf> /dev/sdi1(19605) Olaf> [lvname_rimage_2] nova vwi---r--- 400.00g Olaf> [lvname_rimage_3] nova iwi-aor--- 400.00g Olaf> /dev/sdj1(19605) Olaf> [lvname_rmeta_0] nova ewi-aor--- 64.00m Olaf> /dev/sdc1(19604) Olaf> [lvname_rmeta_1] nova ewi-aor--- 64.00m Olaf> /dev/sdi1(19604) Olaf> [lvname_rmeta_2] nova ewi---r--- 64.00m Olaf> [lvname_rmeta_3] nova ewi-aor--- 64.00m Olaf> /dev/sdj1(19604) Olaf> ``` Olaf> and also according to `lvdisplay -am` there is a problem with Olaf> `..._rimage2` and `..._rmeta2`: Olaf> ``` Olaf> --- Logical volume --- Olaf> Internal LV Name lvname_rimage_2 Olaf> VG Name nova Olaf> LV UUID xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Olaf> LV Write Access read/write Olaf> LV Creation host, time xxxxxxxxx, 2021-07-09 16:45:21 +0000 Olaf> LV Status NOT available Olaf> LV Size 400.00 GiB Olaf> Current LE 6400 Olaf> Segments 1 Olaf> Allocation inherit Olaf> Read ahead sectors auto Olaf> --- Segments --- Olaf> Virtual extents 0 to 6399: Olaf> Type error Olaf> --- Logical volume --- Olaf> Internal LV Name lvname_rmeta_2 Olaf> VG Name nova Olaf> LV UUID xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Olaf> LV Write Access read/write Olaf> LV Creation host, time xxxxxxxxx, 2021-07-09 16:45:21 +0000 Olaf> LV Status NOT available Olaf> LV Size 64.00 MiB Olaf> Current LE 1 Olaf> Segments 1 Olaf> Allocation inherit Olaf> Read ahead sectors auto Olaf> --- Segments --- Olaf> Virtual extents 0 to 0: Olaf> Type error Olaf> Similarly, the metadata looks corresponding: Olaf> lvname_rimage_2 { Olaf> id = "..." Olaf> status = ["READ", "WRITE"] Olaf> flags = [] Olaf> creation_time = 1625849121 # 2021-07-09 Olaf> 16:45:21 +0000 Olaf> creation_host = "cbk130133" Olaf> segment_count = 1 Olaf> segment1 { Olaf> start_extent = 0 Olaf> extent_count = 6400 # 400 Gigabytes Olaf> type = "error" Olaf> } Olaf> } Olaf> On the other hand, the health status appears to read out normal: Olaf> [13:38:20] root@cbk130133:~# lvs -o +lv_health_status Olaf> LV VG Attr LSize Pool Origin Data% Meta% Move Log Olaf> Cpy%Sync Convert Health Olaf> lvname nova Rwi-aor--- 800.00g .. 100.00 Olaf> We tried various combinations of `lvconvert --repair -y nova/$lv` and Olaf> `lvchange --syncaction repair` on it without effect. Olaf> `lvchange -ay` doesn't work either: Olaf> $ sudo lvchange -ay nova/lvname_rmeta_2 Olaf> Operation not permitted on hidden LV nova/lvname_rmeta_2. Olaf> $ sudo lvchange -ay nova/lvname Olaf> $ # (no effect) Olaf> $ sudo lvconvert --repair nova/lvname_rimage_2 Olaf> WARNING: Disabling lvmetad cache for repair command. Olaf> WARNING: Not using lvmetad because of repair. Olaf> Command on LV nova/lvname_rimage_2 does not accept LV type error. Olaf> Command not permitted on LV nova/lvname_rimage_2. Olaf> $ sudo lvchange --resync nova/lvname_rimage_2 Olaf> WARNING: Not using lvmetad because a repair command was run. Olaf> Command on LV nova/lvname_rimage_2 does not accept LV type error. Olaf> Command not permitted on LV nova/lvname_rimage_2. Olaf> $ sudo lvchange --resync nova/lvname Olaf> WARNING: Not using lvmetad because a repair command was run. Olaf> Logical volume nova/lvname in use. Olaf> Can't resync open logical volume nova/lvname. Olaf> $ lvchange --rebuild /dev/sdf1 nova/lvname Olaf> WARNING: Not using lvmetad because a repair command was run. Olaf> Do you really want to rebuild 1 PVs of logical volume nova/lvname [y/n]: y Olaf> device-mapper: create ioctl on lvname_rmeta_2 LVM-blah failed: Device Olaf> or resource busy Olaf> Failed to lock logical volume nova/lvname. Olaf> $ lvchange --raidsyncaction repair nova/lvname Olaf> # (took a long time to complete but didn't change anything) Olaf> $ sudo lvconvert --mirrors +1 nova/lvname Olaf> Using default stripesize 64.00 KiB. Olaf> --mirrors/-m cannot be changed with raid10. Olaf> Any idea how to restore redundancy on this logical volume? It is in Olaf> continuous use, of course... Olaf> It seems like somehow we must convince LVM to allocate some space for Olaf> it, instead of using the error segment (there is plenty available in the Olaf> volume group). Olaf> Thanks in advance. Olaf> -Olaf Olaf> -- Olaf> SysEleven GmbH Olaf> Boxhagener Straße 80 Olaf> 10245 Berlin Olaf> T +49 30 233 2012 0 Olaf> F +49 30 616 7555 0 Olaf> http://www.syseleven.de Olaf> http://www.facebook.com/SysEleven Olaf> https://www.instagram.com/syseleven/ Olaf> Aktueller System-Status immer unter: Olaf> http://www.twitter.com/syseleven Olaf> Firmensitz: Berlin Olaf> Registergericht: AG Berlin Charlottenburg, HRB 108571 B Olaf> Geschäftsführer: Marc Korthaus, Jens Ihlenfeld, Andreas Hermann Olaf> _______________________________________________ Olaf> linux-lvm mailing list Olaf> linux-lvm@redhat.com Olaf> https://listman.redhat.com/mailman/listinfo/linux-lvm Olaf> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/ _______________________________________________ linux-lvm mailing list linux-lvm@redhat.com https://listman.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
First, John, thanks for your reply. On 28.05.22 18:15, John Stoffel wrote: >>>>>> "Olaf" == Olaf Seibert <o.seibert@syseleven.de> writes: > > I'm leaving for the rest of the weekend, but hopefully this will help you... > > Olaf> Hi all, I'm new to this list. I hope somebody here can help me. > > We will try! But I would strongly urge that you take backups of all > your data NOW, before you do anything else. Copy to another disk > which is seperate from this system just in case. Unfortunately there are some complicating factors that I left out so far. The machine in question is a host for virtual machines run by customers. So we can't just even look at the data, never mind rsyncing it. (the name "nova" might have given that away; that is the name of the OpenStack compute service) > My next suggestion would be for you to provide the output of the > 'pvs', 'vgs' and 'lvs' commands. Also, which disk died? And have > you replaced it? /dev/sde died. It is still in the machine. $ sudo pvs PV VG Fmt Attr PSize PFree /dev/sda2 system lvm2 a-- 445.22g 347.95g /dev/sdb2 system lvm2 a-- 445.22g 347.94g /dev/sdc1 nova lvm2 a-- 1.75t 412.19g /dev/sdd1 nova lvm2 a-- 1.75t 1.75t /dev/sdf1 nova lvm2 a-- 1.75t 812.25g /dev/sdg1 nova lvm2 a-- 1.75t 1.75t /dev/sdh1 nova lvm2 a-- 1.75t 1.75t /dev/sdi1 nova lvm2 a-- 1.75t 412.19g /dev/sdj1 nova lvm2 a-- 1.75t 412.19g $ sudo vgs VG #PV #LV #SN Attr VSize VFree nova 7 20 0 wz--n- 12.23t 7.24t system 2 2 0 wz--n- 890.45g 695.89g $ sudo lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert 1b77 nova Rwi-aor--- 50.00g 100.00 1c13 nova Rwi-aor--- 50.00g 100.00 203f nova Rwi-aor--- 800.00g 100.00 3077 nova Rwi-aor--- 50.00g 100.00 61a0 nova Rwi-a-r--- 50.00g 100.00 63c1 nova Rwi-aor--- 50.00g 100.00 8958 nova Rwi-aor--- 800.00g 100.00 8a4f nova Rwi-aor--- 50.00g 100.00 965a nova Rwi-aor--- 100.00g 100.00 9d89 nova Rwi-aor--- 200.00g 100.00 9df4 nova Rwi-a-r--- 50.00g 100.00 b41b nova Rwi-aor--- 50.00g 100.00 c517 nova Rwi-aor--- 50.00g 100.00 d36b nova Rwi-aor--- 50.00g 100.00 dd1b nova Rwi-a-r--- 50.00g 100.00 e2ed nova Rwi-aor--- 50.00g 100.00 ef6c nova Rwi-aor--- 50.00g 100.00 f5ce nova Rwi-aor--- 100.00g 100.00 f952 nova Rwi-aor--- 50.00g 100.00 fbf6 nova Rwi-aor--- 50.00g 100.00 boot system mwi-aom--- 1.91g [boot_mlog] 100.00 root system mwi-aom--- 95.37g [root_mlog] 100.00 I am abbreviating the LV names since they are long boring UUIDs related to customer data. "203f" is "lvname", the LV which has problems. > My second suggestion would be for you to use 'md' as the lower level > RAID1/10/5/6 level underneath your LVM volumes. Alot of people think > it's better to have it all in one tool (btrfs, zfs, others) but I > stronly feel that using nice layers helps keep things organized and > reliable. > > So if you can, add two new disks into your system, add a full-disk > partition which starts at offset of 1mb or so, and maybe even leaves a > couple of MBs of free space at the end, and then create an MD pair on > them: I am not sure if there are any free slots for more disks. We would need to send somebody to the datacenter to put in any disks in any case. I think I understand what you are getting at here, redundancy-wise. But won't it confuse LVM? If it decides to store one side of any mirror on this new md0, won't this result in 3 copies of the data for that volume? In the list of commands I tried, there was this one: > Olaf> $ sudo lvchange --resync nova/lvname > Olaf> WARNING: Not using lvmetad because a repair command was run. > Olaf> Logical volume nova/lvname in use. > Olaf> Can't resync open logical volume nova/lvname. Any chance that this command might work, if we can ask the customer to shut down their VM for a while? On the other hand, there were some other commands that took a while to run, and therefore seemed to do something, but in the end they didn't. It seems that this "error" segment (which seems to have replaced the bad disk) is really confusing LVM. Such as `lvconvert --repair` which apparently worked on the other LVs. > mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdy1 /dev/sdz1 > > Now you can add that disk in your nova VG with: > > vgextend nova /dev/md0 > > Then try to move your LV named 'lvname' onto the new MD PV. > > pvmove -n lvname /dev/<source_PV> /dev/md0 > > I think you really want to move the *entire* top level LV onto new > storage. Then you will know you have safe data. And this can be done > while the volume is up and running. > > But again!!!!!! Please take a backup (rsync onto a new LV maybe?) of > your current data to make sure you don't lose anything. > > Olaf> We had a disk go bad (disk commands timed out and took many > Olaf> seconds to do so) in our LVM installation with mirroring. With > Olaf> some trouble, we managed to pvremove the offending disk, and > Olaf> used `lvconvert --repair -y nova/$lv` to repair (restore > Olaf> redundancy) the logical volumes. > > How many disks do you have in the system? Please don't try to hide > names of disks and such unless you really need to. It makes it much > harder to diagnose. There are 10 disks (sda-j) of which sde is broken and no longer listed. > Olaf> It seems like somehow we must convince LVM to allocate some space for > Olaf> it, instead of using the error segment (there is plenty available in the > Olaf> volume group). Thanks, -Olaf. -- SysEleven GmbH Boxhagener Straße 80 10245 Berlin T +49 30 233 2012 0 F +49 30 616 7555 0 http://www.syseleven.de http://www.facebook.com/SysEleven https://www.instagram.com/syseleven/ Aktueller System-Status immer unter: http://www.twitter.com/syseleven Firmensitz: Berlin Registergericht: AG Berlin Charlottenburg, HRB 108571 B Geschäftsführer: Marc Korthaus, Jens Ihlenfeld, Andreas Hermann _______________________________________________ linux-lvm mailing list linux-lvm@redhat.com https://listman.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
Replying to myself: On 30.05.22 10:16, Olaf Seibert wrote: > First, John, thanks for your reply. I contacted the customer and it turned out their VM's disk (this LV) was broken anyway. So there is no need any more to try to repair it... Thanks for your thoughts anyway. -Olaf. -- SysEleven GmbH Boxhagener Straße 80 10245 Berlin T +49 30 233 2012 0 F +49 30 616 7555 0 http://www.syseleven.de http://www.facebook.com/SysEleven https://www.instagram.com/syseleven/ Aktueller System-Status immer unter: http://www.twitter.com/syseleven Firmensitz: Berlin Registergericht: AG Berlin Charlottenburg, HRB 108571 B Geschäftsführer: Marc Korthaus, Jens Ihlenfeld, Andreas Hermann _______________________________________________ linux-lvm mailing list linux-lvm@redhat.com https://listman.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
[-- Attachment #1.1: Type: text/plain, Size: 1128 bytes --] On Mon, May 30, 2022 at 10:16:27AM +0200, Olaf Seibert wrote: > First, John, thanks for your reply. > > On 28.05.22 18:15, John Stoffel wrote: > >>>>>> "Olaf" == Olaf Seibert <o.seibert@syseleven.de> writes: > > > > I'm leaving for the rest of the weekend, but hopefully this will help you... > > > > Olaf> Hi all, I'm new to this list. I hope somebody here can help me. > > > > We will try! But I would strongly urge that you take backups of all > > your data NOW, before you do anything else. Copy to another disk > > which is seperate from this system just in case. > > Unfortunately there are some complicating factors that I left out so far. > The machine in question is a host for virtual machines run by customers. > So we can't just even look at the data, never mind rsyncing it. > (the name "nova" might have given that away; that is the name of the > OpenStack compute service) Can you try to live-migrate the VMs off of this node? If not, can you announce a maintenance window and power off the VMs so you can take a block-level backup? -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab [-- Attachment #1.2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] [-- Attachment #2: Type: text/plain, Size: 202 bytes --] _______________________________________________ linux-lvm mailing list linux-lvm@redhat.com https://listman.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
On 30.05.22 16:07, Demi Marie Obenour wrote: > Can you try to live-migrate the VMs off of this node? If not, can you > announce a maintenance window and power off the VMs so you can take a > block-level backup? Alas, this (type of) compute node is designed specifically to use local storage instead of (shared) network storage. This prevents live migration. If live migration were possible (and it would include migrating the disk storage), then this would more or less automatically solve the problem: on the new node the LV would be a fresh one and thus nicely mirrored. Maybe somebody could add support for something like that to OpenStack, but right now it can't do it. Cheers, -Olaf. -- SysEleven GmbH Boxhagener Straße 80 10245 Berlin T +49 30 233 2012 0 F +49 30 616 7555 0 http://www.syseleven.de http://www.facebook.com/SysEleven https://www.instagram.com/syseleven/ Aktueller System-Status immer unter: http://www.twitter.com/syseleven Firmensitz: Berlin Registergericht: AG Berlin Charlottenburg, HRB 108571 B Geschäftsführer: Marc Korthaus, Jens Ihlenfeld, Andreas Hermann _______________________________________________ linux-lvm mailing list linux-lvm@redhat.com https://listman.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
>>>>> "Olaf" == Olaf Seibert <o.seibert@syseleven.de> writes: Olaf> Replying to myself: Olaf> On 30.05.22 10:16, Olaf Seibert wrote: >> First, John, thanks for your reply. Olaf> I contacted the customer and it turned out their VM's disk (this Olaf> LV) was broken anyway. So there is no need any more to try to Olaf> repair it... So I'm not really surprised, because when that disk dies, it probably took out their data, or at least a chunk of it, so even though it looks like it might have kept running, it probably also got corrupted in a big way too. So I think you guys need to re-architect your storage design. If you have paying customers on there, you should really be using MD with RAID10, and a hot spare disk on there as well, so when a disk dies, it can be automatically replaced, even if it fails at 2am in the morning. It's not cheap, but neither is a customer losing data. The other critical thing to do here is to make sure you're using disks with proper SCTERC timeouts, so that when they have problems, the disks just fail quickly, without blocking the system and causing outages. Look back in the linux-raid mailing list archives for discussions on this. And of course I'd also try to setup a remote backup server with even bigger disks, so that you can replicate customer data onto other storage just in case. Olaf> Thanks for your thoughts anyway. Glad I could try to help, been flat out busy with $WORK and just now following up here. Sorry! Olaf> -- Olaf> SysEleven GmbH Olaf> Boxhagener Straße 80 Olaf> 10245 Berlin Olaf> T +49 30 233 2012 0 Olaf> F +49 30 616 7555 0 Olaf> http://www.syseleven.de Olaf> http://www.facebook.com/SysEleven Olaf> https://www.instagram.com/syseleven/ Olaf> Aktueller System-Status immer unter: Olaf> http://www.twitter.com/syseleven Olaf> Firmensitz: Berlin Olaf> Registergericht: AG Berlin Charlottenburg, HRB 108571 B Olaf> Geschäftsführer: Marc Korthaus, Jens Ihlenfeld, Andreas Hermann Olaf> _______________________________________________ Olaf> linux-lvm mailing list Olaf> linux-lvm@redhat.com Olaf> https://listman.redhat.com/mailman/listinfo/linux-lvm Olaf> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/ _______________________________________________ linux-lvm mailing list linux-lvm@redhat.com https://listman.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/