linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed
* [linux-lvm] raid10 with missing redundancy, but health status claims it is ok.
@ 2022-05-27 13:56 Olaf Seibert
  2022-05-28 16:15 ` John Stoffel
  0 siblings, 1 reply; 7+ messages in thread
From: Olaf Seibert @ 2022-05-27 13:56 UTC (permalink / raw)
  To: linux-lvm

Hi all, I'm new to this list. I hope somebody here can help me.

We had a disk go bad (disk commands timed out and took many seconds to
do so) in our LVM installation with mirroring. With some trouble, we
managed to pvremove the offending disk, and used `lvconvert --repair -y
nova/$lv` to repair (restore redundancy) the logical volumes.

One logical volume still seems to have trouble though. In `lvs -o
devices -a` it shows no devices for 2 of its subvolumes, and it has the
weird 'v' status:


  LV                                                   VG     Attr
 LSize   Pool Origin Data%  Meta%  Move Log         Cpy%Sync Convert Devices
  lvname            nova   Rwi-aor--- 800.00g
                 100.00
lvname_rimage_0(0),lvname_rimage_1(0),lvname_rimage_2(0),lvname_rimage_3(0)
  [lvname_rimage_0] nova   iwi-aor--- 400.00g
                                  /dev/sdc1(19605)
  [lvname_rimage_1] nova   iwi-aor--- 400.00g
                                  /dev/sdi1(19605)
  [lvname_rimage_2] nova   vwi---r--- 400.00g
  [lvname_rimage_3] nova   iwi-aor--- 400.00g
                                  /dev/sdj1(19605)
  [lvname_rmeta_0]  nova   ewi-aor---  64.00m
                                  /dev/sdc1(19604)
  [lvname_rmeta_1]  nova   ewi-aor---  64.00m
                                  /dev/sdi1(19604)
  [lvname_rmeta_2]  nova   ewi---r---  64.00m
  [lvname_rmeta_3]  nova   ewi-aor---  64.00m
                                  /dev/sdj1(19604)

```
and also according to `lvdisplay -am` there is a problem with
`..._rimage2` and `..._rmeta2`:
```
  --- Logical volume ---
  Internal LV Name       lvname_rimage_2
  VG Name                nova
  LV UUID                xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  LV Write Access        read/write
  LV Creation host, time xxxxxxxxx, 2021-07-09 16:45:21 +0000
  LV Status              NOT available
  LV Size                400.00 GiB
  Current LE             6400
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto

  --- Segments ---
  Virtual extents 0 to 6399:
    Type                error

  --- Logical volume ---
  Internal LV Name       lvname_rmeta_2
  VG Name                nova
  LV UUID                xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  LV Write Access        read/write
  LV Creation host, time xxxxxxxxx, 2021-07-09 16:45:21 +0000
  LV Status              NOT available
  LV Size                64.00 MiB
  Current LE             1
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto

  --- Segments ---
  Virtual extents 0 to 0:
    Type                error

Similarly, the metadata looks corresponding:

                lvname_rimage_2 {
                        id = "..."
                        status = ["READ", "WRITE"]
                        flags = []
                        creation_time = 1625849121      # 2021-07-09
16:45:21 +0000
                        creation_host = "cbk130133"
                        segment_count = 1

                        segment1 {
                                start_extent = 0
                                extent_count = 6400     # 400 Gigabytes

                                type = "error"
                        }
                }




On the other hand, the health status appears to read out normal:

[13:38:20] root@cbk130133:~# lvs -o +lv_health_status
  LV     VG     Attr       LSize   Pool Origin Data%  Meta%  Move Log
      Cpy%Sync Convert Health
  lvname nova   Rwi-aor--- 800.00g       ..   100.00



We tried various combinations of `lvconvert --repair -y nova/$lv` and
`lvchange --syncaction repair` on it without effect.
`lvchange -ay` doesn't work either:

$ sudo lvchange -ay   nova/lvname_rmeta_2
  Operation not permitted on hidden LV nova/lvname_rmeta_2.
$ sudo lvchange -ay   nova/lvname
$ # (no effect)
$ sudo lvconvert --repair nova/lvname_rimage_2
  WARNING: Disabling lvmetad cache for repair command.
  WARNING: Not using lvmetad because of repair.
  Command on LV nova/lvname_rimage_2 does not accept LV type error.
  Command not permitted on LV nova/lvname_rimage_2.
$ sudo lvchange --resync nova/lvname_rimage_2
  WARNING: Not using lvmetad because a repair command was run.
  Command on LV nova/lvname_rimage_2 does not accept LV type error.
  Command not permitted on LV nova/lvname_rimage_2.
$ sudo lvchange --resync nova/lvname
  WARNING: Not using lvmetad because a repair command was run.
  Logical volume nova/lvname in use.
  Can't resync open logical volume nova/lvname.
$ lvchange --rebuild /dev/sdf1 nova/lvname
  WARNING: Not using lvmetad because a repair command was run.
Do you really want to rebuild 1 PVs of logical volume nova/lvname [y/n]: y
  device-mapper: create ioctl on lvname_rmeta_2 LVM-blah failed: Device
or resource busy
  Failed to lock logical volume nova/lvname.
$ lvchange --raidsyncaction repair nova/lvname
# (took a long time to complete but didn't change anything)
$ sudo lvconvert --mirrors +1 nova/lvname
  Using default stripesize 64.00 KiB.
  --mirrors/-m cannot be changed with raid10.



Any idea how to restore redundancy on this logical volume? It is in
continuous use, of course...
It seems like somehow we must convince LVM to allocate some space for
it, instead of using the error segment (there is plenty available in the
volume group).

Thanks in advance.

-Olaf

-- 
SysEleven GmbH
Boxhagener Straße 80
10245 Berlin

T +49 30 233 2012 0
F +49 30 616 7555 0

http://www.syseleven.de
http://www.facebook.com/SysEleven
https://www.instagram.com/syseleven/

Aktueller System-Status immer unter:
http://www.twitter.com/syseleven

Firmensitz: Berlin
Registergericht: AG Berlin Charlottenburg, HRB 108571 B
Geschäftsführer: Marc Korthaus, Jens Ihlenfeld, Andreas Hermann

_______________________________________________
linux-lvm mailing list
linux-lvm@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] raid10 with missing redundancy, but health status claims it is ok.
  2022-05-27 13:56 [linux-lvm] raid10 with missing redundancy, but health status claims it is ok Olaf Seibert
@ 2022-05-28 16:15 ` John Stoffel
  2022-05-30  8:16   ` Olaf Seibert
  0 siblings, 1 reply; 7+ messages in thread
From: John Stoffel @ 2022-05-28 16:15 UTC (permalink / raw)
  To: LVM general discussion and development

>>>>> "Olaf" == Olaf Seibert <o.seibert@syseleven.de> writes:

I'm leaving for the rest of the weekend, but hopefully this will help you...

Olaf> Hi all, I'm new to this list. I hope somebody here can help me.

We will try!  But I would strongly urge that you take backups of all
your data NOW, before you do anything else.  Copy to another disk
which is seperate from this system just in case.

My next suggestion would be for you to provide the output of the
'pvs', 'vgs' and 'lvs' commands.   Also, which disk died?  And have
you replaced it?    

My second suggestion would be for you to use 'md' as the lower level
RAID1/10/5/6 level underneath your LVM volumes.  Alot of people think
it's better to have it all in one tool (btrfs, zfs, others) but I
stronly feel that using nice layers helps keep things organized and
reliable.

So if you can, add two new disks into your system, add a full-disk
partition which starts at offset of 1mb or so, and maybe even leaves a
couple of MBs of free space at the end, and then create an MD pair on
them:

   mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdy1 /dev/sdz1
     
Now you can add that disk in your nova VG with:

   vgextend nova /dev/md0

Then try to move your LV named 'lvname' onto the new MD PV.

   pvmove -n lvname /dev/<source_PV> /dev/md0

I think you really want to move the *entire* top level LV onto new
storage.  Then you will know you have safe data.  And this can be done
while the volume is up and running.

But again!!!!!!  Please take a backup (rsync onto a new LV maybe?) of
your current data to make sure you don't lose anything.  

Olaf> We had a disk go bad (disk commands timed out and took many
Olaf> seconds to do so) in our LVM installation with mirroring. With
Olaf> some trouble, we managed to pvremove the offending disk, and
Olaf> used `lvconvert --repair -y nova/$lv` to repair (restore
Olaf> redundancy) the logical volumes.

How many disks do you have in the system?  Please don't try to hide
names of disks and such unless you really need to.  It makes it much
harder to diagnose.  


Olaf> One logical volume still seems to have trouble though. In `lvs -o
Olaf> devices -a` it shows no devices for 2 of its subvolumes, and it has the
Olaf> weird 'v' status:


Olaf>   LV                                                   VG     Attr
Olaf>  LSize   Pool Origin Data%  Meta%  Move Log         Cpy%Sync Convert Devices
Olaf>   lvname            nova   Rwi-aor--- 800.00g
Olaf>                  100.00
Olaf> lvname_rimage_0(0),lvname_rimage_1(0),lvname_rimage_2(0),lvname_rimage_3(0)
Olaf>   [lvname_rimage_0] nova   iwi-aor--- 400.00g
Olaf>                                   /dev/sdc1(19605)
Olaf>   [lvname_rimage_1] nova   iwi-aor--- 400.00g
Olaf>                                   /dev/sdi1(19605)
Olaf>   [lvname_rimage_2] nova   vwi---r--- 400.00g
Olaf>   [lvname_rimage_3] nova   iwi-aor--- 400.00g
Olaf>                                   /dev/sdj1(19605)
Olaf>   [lvname_rmeta_0]  nova   ewi-aor---  64.00m
Olaf>                                   /dev/sdc1(19604)
Olaf>   [lvname_rmeta_1]  nova   ewi-aor---  64.00m
Olaf>                                   /dev/sdi1(19604)
Olaf>   [lvname_rmeta_2]  nova   ewi---r---  64.00m
Olaf>   [lvname_rmeta_3]  nova   ewi-aor---  64.00m
Olaf>                                   /dev/sdj1(19604)

Olaf> ```
Olaf> and also according to `lvdisplay -am` there is a problem with
Olaf> `..._rimage2` and `..._rmeta2`:
Olaf> ```
Olaf>   --- Logical volume ---
Olaf>   Internal LV Name       lvname_rimage_2
Olaf>   VG Name                nova
Olaf>   LV UUID                xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Olaf>   LV Write Access        read/write
Olaf>   LV Creation host, time xxxxxxxxx, 2021-07-09 16:45:21 +0000
Olaf>   LV Status              NOT available
Olaf>   LV Size                400.00 GiB
Olaf>   Current LE             6400
Olaf>   Segments               1
Olaf>   Allocation             inherit
Olaf>   Read ahead sectors     auto

Olaf>   --- Segments ---
Olaf>   Virtual extents 0 to 6399:
Olaf>     Type                error

Olaf>   --- Logical volume ---
Olaf>   Internal LV Name       lvname_rmeta_2
Olaf>   VG Name                nova
Olaf>   LV UUID                xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Olaf>   LV Write Access        read/write
Olaf>   LV Creation host, time xxxxxxxxx, 2021-07-09 16:45:21 +0000
Olaf>   LV Status              NOT available
Olaf>   LV Size                64.00 MiB
Olaf>   Current LE             1
Olaf>   Segments               1
Olaf>   Allocation             inherit
Olaf>   Read ahead sectors     auto

Olaf>   --- Segments ---
Olaf>   Virtual extents 0 to 0:
Olaf>     Type                error

Olaf> Similarly, the metadata looks corresponding:

Olaf>                 lvname_rimage_2 {
Olaf>                         id = "..."
Olaf>                         status = ["READ", "WRITE"]
Olaf>                         flags = []
Olaf>                         creation_time = 1625849121      # 2021-07-09
Olaf> 16:45:21 +0000
Olaf>                         creation_host = "cbk130133"
Olaf>                         segment_count = 1

Olaf>                         segment1 {
Olaf>                                 start_extent = 0
Olaf>                                 extent_count = 6400     # 400 Gigabytes

Olaf>                                 type = "error"
Olaf>                         }
Olaf>                 }




Olaf> On the other hand, the health status appears to read out normal:

Olaf> [13:38:20] root@cbk130133:~# lvs -o +lv_health_status
Olaf>   LV     VG     Attr       LSize   Pool Origin Data%  Meta%  Move Log
Olaf>       Cpy%Sync Convert Health
Olaf>   lvname nova   Rwi-aor--- 800.00g       ..   100.00



Olaf> We tried various combinations of `lvconvert --repair -y nova/$lv` and
Olaf> `lvchange --syncaction repair` on it without effect.
Olaf> `lvchange -ay` doesn't work either:

Olaf> $ sudo lvchange -ay   nova/lvname_rmeta_2
Olaf>   Operation not permitted on hidden LV nova/lvname_rmeta_2.
Olaf> $ sudo lvchange -ay   nova/lvname
Olaf> $ # (no effect)
Olaf> $ sudo lvconvert --repair nova/lvname_rimage_2
Olaf>   WARNING: Disabling lvmetad cache for repair command.
Olaf>   WARNING: Not using lvmetad because of repair.
Olaf>   Command on LV nova/lvname_rimage_2 does not accept LV type error.
Olaf>   Command not permitted on LV nova/lvname_rimage_2.
Olaf> $ sudo lvchange --resync nova/lvname_rimage_2
Olaf>   WARNING: Not using lvmetad because a repair command was run.
Olaf>   Command on LV nova/lvname_rimage_2 does not accept LV type error.
Olaf>   Command not permitted on LV nova/lvname_rimage_2.
Olaf> $ sudo lvchange --resync nova/lvname
Olaf>   WARNING: Not using lvmetad because a repair command was run.
Olaf>   Logical volume nova/lvname in use.
Olaf>   Can't resync open logical volume nova/lvname.
Olaf> $ lvchange --rebuild /dev/sdf1 nova/lvname
Olaf>   WARNING: Not using lvmetad because a repair command was run.
Olaf> Do you really want to rebuild 1 PVs of logical volume nova/lvname [y/n]: y
Olaf>   device-mapper: create ioctl on lvname_rmeta_2 LVM-blah failed: Device
Olaf> or resource busy
Olaf>   Failed to lock logical volume nova/lvname.
Olaf> $ lvchange --raidsyncaction repair nova/lvname
Olaf> # (took a long time to complete but didn't change anything)
Olaf> $ sudo lvconvert --mirrors +1 nova/lvname
Olaf>   Using default stripesize 64.00 KiB.
Olaf>   --mirrors/-m cannot be changed with raid10.



Olaf> Any idea how to restore redundancy on this logical volume? It is in
Olaf> continuous use, of course...
Olaf> It seems like somehow we must convince LVM to allocate some space for
Olaf> it, instead of using the error segment (there is plenty available in the
Olaf> volume group).

Olaf> Thanks in advance.

Olaf> -Olaf

Olaf> -- 
Olaf> SysEleven GmbH
Olaf> Boxhagener Straße 80
Olaf> 10245 Berlin

Olaf> T +49 30 233 2012 0
Olaf> F +49 30 616 7555 0

Olaf> http://www.syseleven.de
Olaf> http://www.facebook.com/SysEleven
Olaf> https://www.instagram.com/syseleven/

Olaf> Aktueller System-Status immer unter:
Olaf> http://www.twitter.com/syseleven

Olaf> Firmensitz: Berlin
Olaf> Registergericht: AG Berlin Charlottenburg, HRB 108571 B
Olaf> Geschäftsführer: Marc Korthaus, Jens Ihlenfeld, Andreas Hermann

Olaf> _______________________________________________
Olaf> linux-lvm mailing list
Olaf> linux-lvm@redhat.com
Olaf> https://listman.redhat.com/mailman/listinfo/linux-lvm
Olaf> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

_______________________________________________
linux-lvm mailing list
linux-lvm@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] raid10 with missing redundancy, but health status claims it is ok.
  2022-05-28 16:15 ` John Stoffel
@ 2022-05-30  8:16   ` Olaf Seibert
  2022-05-30  8:49     ` Olaf Seibert
  2022-05-30 14:07     ` Demi Marie Obenour
  0 siblings, 2 replies; 7+ messages in thread
From: Olaf Seibert @ 2022-05-30  8:16 UTC (permalink / raw)
  To: linux-lvm

First, John, thanks for your reply.

On 28.05.22 18:15, John Stoffel wrote:
>>>>>> "Olaf" == Olaf Seibert <o.seibert@syseleven.de> writes:
> 
> I'm leaving for the rest of the weekend, but hopefully this will help you...
> 
> Olaf> Hi all, I'm new to this list. I hope somebody here can help me.
> 
> We will try!  But I would strongly urge that you take backups of all
> your data NOW, before you do anything else.  Copy to another disk
> which is seperate from this system just in case.

Unfortunately there are some complicating factors that I left out so far.
The machine in question is a host for virtual machines run by customers.
So we can't just even look at the data, never mind rsyncing it.
(the name "nova" might have given that away; that is the name of the 
OpenStack compute service)

> My next suggestion would be for you to provide the output of the
> 'pvs', 'vgs' and 'lvs' commands.   Also, which disk died?  And have
> you replaced it?    

/dev/sde died. It is still in the machine.

$ sudo pvs
  PV         VG     Fmt  Attr PSize   PFree
  /dev/sda2  system lvm2 a--  445.22g 347.95g
  /dev/sdb2  system lvm2 a--  445.22g 347.94g
  /dev/sdc1  nova   lvm2 a--    1.75t 412.19g
  /dev/sdd1  nova   lvm2 a--    1.75t   1.75t
  /dev/sdf1  nova   lvm2 a--    1.75t 812.25g
  /dev/sdg1  nova   lvm2 a--    1.75t   1.75t
  /dev/sdh1  nova   lvm2 a--    1.75t   1.75t
  /dev/sdi1  nova   lvm2 a--    1.75t 412.19g
  /dev/sdj1  nova   lvm2 a--    1.75t 412.19g

$ sudo vgs
  VG     #PV #LV #SN Attr   VSize   VFree
  nova     7  20   0 wz--n-  12.23t   7.24t
  system   2   2   0 wz--n- 890.45g 695.89g

$ sudo lvs
  LV   VG     Attr       LSize   Pool Origin Data%  Meta%  Move Log         Cpy%Sync Convert
  1b77 nova   Rwi-aor---  50.00g                                            100.00
  1c13 nova   Rwi-aor---  50.00g                                            100.00
  203f nova   Rwi-aor--- 800.00g                                            100.00
  3077 nova   Rwi-aor---  50.00g                                            100.00
  61a0 nova   Rwi-a-r---  50.00g                                            100.00
  63c1 nova   Rwi-aor---  50.00g                                            100.00
  8958 nova   Rwi-aor--- 800.00g                                            100.00
  8a4f nova   Rwi-aor---  50.00g                                            100.00
  965a nova   Rwi-aor--- 100.00g                                            100.00
  9d89 nova   Rwi-aor--- 200.00g                                            100.00
  9df4 nova   Rwi-a-r---  50.00g                                            100.00
  b41b nova   Rwi-aor---  50.00g                                            100.00
  c517 nova   Rwi-aor---  50.00g                                            100.00
  d36b nova   Rwi-aor---  50.00g                                            100.00
  dd1b nova   Rwi-a-r---  50.00g                                            100.00
  e2ed nova   Rwi-aor---  50.00g                                            100.00
  ef6c nova   Rwi-aor---  50.00g                                            100.00
  f5ce nova   Rwi-aor--- 100.00g                                            100.00
  f952 nova   Rwi-aor---  50.00g                                            100.00
  fbf6 nova   Rwi-aor---  50.00g                                            100.00
  boot system mwi-aom---   1.91g                                [boot_mlog] 100.00
  root system mwi-aom---  95.37g                                [root_mlog] 100.00

I am abbreviating the LV names since they are long boring UUIDs 
related to customer data. "203f" is "lvname", the LV which has problems.

> My second suggestion would be for you to use 'md' as the lower level
> RAID1/10/5/6 level underneath your LVM volumes.  Alot of people think
> it's better to have it all in one tool (btrfs, zfs, others) but I
> stronly feel that using nice layers helps keep things organized and
> reliable.
> 
> So if you can, add two new disks into your system, add a full-disk
> partition which starts at offset of 1mb or so, and maybe even leaves a
> couple of MBs of free space at the end, and then create an MD pair on
> them:

I am not sure if there are any free slots for more disks. We would need
to send somebody to the datacenter to put in any disks in any case.

I think I understand what you are getting at here, redundancy-wise.
But won't it confuse LVM? If it decides to store one side of any mirror on
this new md0, won't this result in 3 copies of the data for that volume?

In the list of commands I tried, there was this one:

> Olaf> $ sudo lvchange --resync nova/lvname
> Olaf>   WARNING: Not using lvmetad because a repair command was run.
> Olaf>   Logical volume nova/lvname in use.
> Olaf>   Can't resync open logical volume nova/lvname.

Any chance that this command might work, if we can ask the customer to
shut down their VM for a while? 

On the other hand, there were some other commands that took a while to run,
and therefore seemed to do something, but in the end they didn't.
It seems that this "error" segment (which seems to have replaced the bad disk)
is really confusing LVM. Such as `lvconvert --repair` which apparently
worked on the other LVs.

>    mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdy1 /dev/sdz1
>      
> Now you can add that disk in your nova VG with:
> 
>    vgextend nova /dev/md0
> 
> Then try to move your LV named 'lvname' onto the new MD PV.
> 
>    pvmove -n lvname /dev/<source_PV> /dev/md0
> 
> I think you really want to move the *entire* top level LV onto new
> storage.  Then you will know you have safe data.  And this can be done
> while the volume is up and running.
> 
> But again!!!!!!  Please take a backup (rsync onto a new LV maybe?) of
> your current data to make sure you don't lose anything.  
> 
> Olaf> We had a disk go bad (disk commands timed out and took many
> Olaf> seconds to do so) in our LVM installation with mirroring. With
> Olaf> some trouble, we managed to pvremove the offending disk, and
> Olaf> used `lvconvert --repair -y nova/$lv` to repair (restore
> Olaf> redundancy) the logical volumes.
> 
> How many disks do you have in the system?  Please don't try to hide
> names of disks and such unless you really need to.  It makes it much
> harder to diagnose.  

There are 10 disks (sda-j) of which sde is broken and no longer listed.

> Olaf> It seems like somehow we must convince LVM to allocate some space for
> Olaf> it, instead of using the error segment (there is plenty available in the
> Olaf> volume group).

Thanks,
-Olaf.

-- 
SysEleven GmbH
Boxhagener Straße 80
10245 Berlin

T +49 30 233 2012 0
F +49 30 616 7555 0

http://www.syseleven.de
http://www.facebook.com/SysEleven
https://www.instagram.com/syseleven/

Aktueller System-Status immer unter:
http://www.twitter.com/syseleven

Firmensitz: Berlin
Registergericht: AG Berlin Charlottenburg, HRB 108571 B
Geschäftsführer: Marc Korthaus, Jens Ihlenfeld, Andreas Hermann

_______________________________________________
linux-lvm mailing list
linux-lvm@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] raid10 with missing redundancy, but health status claims it is ok.
  2022-05-30  8:16   ` Olaf Seibert
@ 2022-05-30  8:49     ` Olaf Seibert
  2022-06-01 21:58       ` John Stoffel
  2022-05-30 14:07     ` Demi Marie Obenour
  1 sibling, 1 reply; 7+ messages in thread
From: Olaf Seibert @ 2022-05-30  8:49 UTC (permalink / raw)
  To: linux-lvm

Replying to myself:

On 30.05.22 10:16, Olaf Seibert wrote:
> First, John, thanks for your reply.

I contacted the customer and it turned out their VM's disk (this LV)
was broken anyway. So there is no need any more to try to repair
it...

Thanks for your thoughts anyway.

-Olaf.

-- 
SysEleven GmbH
Boxhagener Straße 80
10245 Berlin

T +49 30 233 2012 0
F +49 30 616 7555 0

http://www.syseleven.de
http://www.facebook.com/SysEleven
https://www.instagram.com/syseleven/

Aktueller System-Status immer unter:
http://www.twitter.com/syseleven

Firmensitz: Berlin
Registergericht: AG Berlin Charlottenburg, HRB 108571 B
Geschäftsführer: Marc Korthaus, Jens Ihlenfeld, Andreas Hermann

_______________________________________________
linux-lvm mailing list
linux-lvm@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] raid10 with missing redundancy, but health status claims it is ok.
  2022-05-30  8:16   ` Olaf Seibert
  2022-05-30  8:49     ` Olaf Seibert
@ 2022-05-30 14:07     ` Demi Marie Obenour
  2022-05-31 11:27       ` Olaf Seibert
  1 sibling, 1 reply; 7+ messages in thread
From: Demi Marie Obenour @ 2022-05-30 14:07 UTC (permalink / raw)
  To: LVM general discussion and development


[-- Attachment #1.1: Type: text/plain, Size: 1128 bytes --]

On Mon, May 30, 2022 at 10:16:27AM +0200, Olaf Seibert wrote:
> First, John, thanks for your reply.
> 
> On 28.05.22 18:15, John Stoffel wrote:
> >>>>>> "Olaf" == Olaf Seibert <o.seibert@syseleven.de> writes:
> > 
> > I'm leaving for the rest of the weekend, but hopefully this will help you...
> > 
> > Olaf> Hi all, I'm new to this list. I hope somebody here can help me.
> > 
> > We will try!  But I would strongly urge that you take backups of all
> > your data NOW, before you do anything else.  Copy to another disk
> > which is seperate from this system just in case.
> 
> Unfortunately there are some complicating factors that I left out so far.
> The machine in question is a host for virtual machines run by customers.
> So we can't just even look at the data, never mind rsyncing it.
> (the name "nova" might have given that away; that is the name of the 
> OpenStack compute service)

Can you try to live-migrate the VMs off of this node?  If not, can you
announce a maintenance window and power off the VMs so you can take a
block-level backup?
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 202 bytes --]

_______________________________________________
linux-lvm mailing list
linux-lvm@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] raid10 with missing redundancy, but health status claims it is ok.
  2022-05-30 14:07     ` Demi Marie Obenour
@ 2022-05-31 11:27       ` Olaf Seibert
  0 siblings, 0 replies; 7+ messages in thread
From: Olaf Seibert @ 2022-05-31 11:27 UTC (permalink / raw)
  To: linux-lvm

On 30.05.22 16:07, Demi Marie Obenour wrote:
> Can you try to live-migrate the VMs off of this node?  If not, can you
> announce a maintenance window and power off the VMs so you can take a
> block-level backup?

Alas, this (type of) compute node is designed specifically to use local
storage instead of (shared) network storage. This prevents live migration.
If live migration were possible (and it would include migrating the
disk storage), then this would more or less automatically solve
the problem: on the new node the LV would be a fresh one
and thus nicely mirrored.

Maybe somebody could add support for something like that to OpenStack,
but right now it can't do it.

Cheers,
-Olaf.

-- 
SysEleven GmbH
Boxhagener Straße 80
10245 Berlin

T +49 30 233 2012 0
F +49 30 616 7555 0

http://www.syseleven.de
http://www.facebook.com/SysEleven
https://www.instagram.com/syseleven/

Aktueller System-Status immer unter:
http://www.twitter.com/syseleven

Firmensitz: Berlin
Registergericht: AG Berlin Charlottenburg, HRB 108571 B
Geschäftsführer: Marc Korthaus, Jens Ihlenfeld, Andreas Hermann

_______________________________________________
linux-lvm mailing list
linux-lvm@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] raid10 with missing redundancy, but health status claims it is ok.
  2022-05-30  8:49     ` Olaf Seibert
@ 2022-06-01 21:58       ` John Stoffel
  0 siblings, 0 replies; 7+ messages in thread
From: John Stoffel @ 2022-06-01 21:58 UTC (permalink / raw)
  To: LVM general discussion and development

>>>>> "Olaf" == Olaf Seibert <o.seibert@syseleven.de> writes:

Olaf> Replying to myself:
Olaf> On 30.05.22 10:16, Olaf Seibert wrote:
>> First, John, thanks for your reply.

Olaf> I contacted the customer and it turned out their VM's disk (this
Olaf> LV) was broken anyway. So there is no need any more to try to
Olaf> repair it...

So I'm not really surprised, because when that disk dies, it probably
took out their data, or at least a chunk of it, so even though it
looks like it might have kept running, it probably also got corrupted
in a big way too.  


So I think you guys need to re-architect your storage design.  If you
have paying customers on there, you should really be using MD with
RAID10, and a hot spare disk on there as well, so when a disk dies, it
can be automatically replaced, even if it fails at 2am in the
morning.  It's not cheap, but neither is a customer losing data.  

The other critical thing to do here is to make sure you're using disks
with proper SCTERC timeouts, so that when they have problems, the
disks just fail quickly, without blocking the system and causing
outages.

Look back in the linux-raid mailing list archives for discussions on
this.

And of course I'd also try to setup a remote backup server with even
bigger disks, so that you can replicate customer data onto other
storage just in case.  

Olaf> Thanks for your thoughts anyway.

Glad I could try to help, been flat out busy with $WORK and just now
following up here.  Sorry!

Olaf> -- 
Olaf> SysEleven GmbH
Olaf> Boxhagener Straße 80
Olaf> 10245 Berlin

Olaf> T +49 30 233 2012 0
Olaf> F +49 30 616 7555 0

Olaf> http://www.syseleven.de
Olaf> http://www.facebook.com/SysEleven
Olaf> https://www.instagram.com/syseleven/

Olaf> Aktueller System-Status immer unter:
Olaf> http://www.twitter.com/syseleven

Olaf> Firmensitz: Berlin
Olaf> Registergericht: AG Berlin Charlottenburg, HRB 108571 B
Olaf> Geschäftsführer: Marc Korthaus, Jens Ihlenfeld, Andreas Hermann

Olaf> _______________________________________________
Olaf> linux-lvm mailing list
Olaf> linux-lvm@redhat.com
Olaf> https://listman.redhat.com/mailman/listinfo/linux-lvm
Olaf> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

_______________________________________________
linux-lvm mailing list
linux-lvm@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-06-01 21:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-27 13:56 [linux-lvm] raid10 with missing redundancy, but health status claims it is ok Olaf Seibert
2022-05-28 16:15 ` John Stoffel
2022-05-30  8:16   ` Olaf Seibert
2022-05-30  8:49     ` Olaf Seibert
2022-06-01 21:58       ` John Stoffel
2022-05-30 14:07     ` Demi Marie Obenour
2022-05-31 11:27       ` Olaf Seibert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).