[PATCH] md: warn about using another MD array as write journal

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] md: warn about using another MD array as write journal
@ 2021-03-17  4:37 Manuel Riel
  2021-03-19 23:16 ` Song Liu
  0 siblings, 1 reply; 8+ messages in thread
From: Manuel Riel @ 2021-03-17  4:37 UTC (permalink / raw)
  To: Linux-RAID, Song Liu; +Cc: Vojtech Myslivec

To follow up on a previous discussion[1] about stuck RAIDs, I'd like to propose adding a warning
about this to the relevant docs. Specifically users shouldn't add other MD arrays as journal device.

Ideally mdadm would check for this, but having it in the docs is useful too.

1: https://lore.kernel.org/linux-btrfs/d3fced3f-6c2b-5ffa-fd24-b24ec6e7d4be@xmyslivec.cz/

---

diff --git a/Documentation/driver-api/md/raid5-cache.rst b/Documentation/driver-api/md/raid5-cache.rst
index d7a15f44a..128044018 100644
--- a/Documentation/driver-api/md/raid5-cache.rst
+++ b/Documentation/driver-api/md/raid5-cache.rst
@@ -17,7 +17,10 @@ And switch it back to write-through mode by::
        echo "write-through" > /sys/block/md0/md/journal_mode

 In both modes, all writes to the array will hit cache disk first. This means
-the cache disk must be fast and sustainable.
+the cache disk must be fast and sustainable. The cache disk also can't be
+another MD RAID array, since such a nested setup can cause problems when
+assembling an array or lead to the primary array getting stuck during
+operation.

 write-through mode
 ==================

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] md: warn about using another MD array as write journal
  2021-03-17  4:37 [PATCH] md: warn about using another MD array as write journal Manuel Riel
@ 2021-03-19 23:16 ` Song Liu
  2021-03-20  1:12   ` Manuel Riel
  0 siblings, 1 reply; 8+ messages in thread
From: Song Liu @ 2021-03-19 23:16 UTC (permalink / raw)
  To: Manuel Riel; +Cc: Linux-RAID, Vojtech Myslivec

On Tue, Mar 16, 2021 at 9:39 PM Manuel Riel <manu@snapdragon.cc> wrote:
>
> To follow up on a previous discussion[1] about stuck RAIDs, I'd like to propose adding a warning
> about this to the relevant docs. Specifically users shouldn't add other MD arrays as journal device.
>
> Ideally mdadm would check for this, but having it in the docs is useful too.
>
> 1: https://lore.kernel.org/linux-btrfs/d3fced3f-6c2b-5ffa-fd24-b24ec6e7d4be@xmyslivec.cz/
>
> ---
>
> diff --git a/Documentation/driver-api/md/raid5-cache.rst b/Documentation/driver-api/md/raid5-cache.rst
> index d7a15f44a..128044018 100644
> --- a/Documentation/driver-api/md/raid5-cache.rst
> +++ b/Documentation/driver-api/md/raid5-cache.rst
> @@ -17,7 +17,10 @@ And switch it back to write-through mode by::
>         echo "write-through" > /sys/block/md0/md/journal_mode
>
>  In both modes, all writes to the array will hit cache disk first. This means
> -the cache disk must be fast and sustainable.
> +the cache disk must be fast and sustainable. The cache disk also can't be
> +another MD RAID array, since such a nested setup can cause problems when
> +assembling an array or lead to the primary array getting stuck during
> +operation.

Sorry for being late on this issue.

Manuel and Vojtech, are we confident that this issue only happens when we use
another md array as the journal device?

Thanks,
Song

>
>  write-through mode
>  ==================

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] md: warn about using another MD array as write journal
  2021-03-19 23:16 ` Song Liu
@ 2021-03-20  1:12   ` Manuel Riel
  2021-03-21  4:22     ` Manuel Riel
  0 siblings, 1 reply; 8+ messages in thread
From: Manuel Riel @ 2021-03-20  1:12 UTC (permalink / raw)
  To: Song Liu; +Cc: Linux-RAID, Vojtech Myslivec

On Mar 20, 2021, at 7:16 AM, Song Liu <song@kernel.org> wrote:
> 
> Sorry for being late on this issue.
> 
> Manuel and Vojtech, are we confident that this issue only happens when we use
> another md array as the journal device?
> 
> Thanks,
> Song

Hi Song,

thanks for getting back.

Unfortunately it's still happening, even when using a NVMe partition directly. It just took a long 3 weeks to happen. So discard my patch. Here how it went down yesterday:

- process md4_raid6 is running with 100% CPU utilization, all I/O to the array is blocked
- no disk activity on the physical drives
- soft reboot doesn't work, as md4_raid6 blocks, so hard reset is needed
- when booting to rescue mode, it tries to assemble the array and shows the same issue of 100% CPU utilization. Also can't reboot.
- when manually assembling it *with* the journal drive, it will read a few GB from the journal device and then get stuck at 100% CPU utilization again without any disk activity.

Solution in the end was to avoid assembling the array on reboot, then assemble it *without* the existing journal and add an empty journal drive later. This lead to some data loss and a full resync.

I'm currently moving all data off this machine and will repave it. Then see if that changes anything.

My main OS is CentOS 8 and the rescue system was Debian. Both showed a similar issue. This must be connected to the journal drive somehow.

My journal drive is a partition on an NVMe with ~180GB in size.

Thanks for any pointers, I could try next.

Manu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] md: warn about using another MD array as write journal
  2021-03-20  1:12   ` Manuel Riel
@ 2021-03-21  4:22     ` Manuel Riel
  2021-03-22 17:13       ` Song Liu
  0 siblings, 1 reply; 8+ messages in thread
From: Manuel Riel @ 2021-03-21  4:22 UTC (permalink / raw)
  To: Song Liu; +Cc: Linux-RAID, Vojtech Myslivec

My impression is that the write-journal feature isn't fully stable yet, as was already reported in 2019[^1]. Vojtech and me are seeing the same errors as mentioned there.

No matter if the journal is on a block device or another RAID.

1: https://www.spinics.net/lists/raid/msg62646.html


> On Mar 20, 2021, at 9:12 AM, Manuel Riel <manu@snapdragon.cc> wrote:
> 
> On Mar 20, 2021, at 7:16 AM, Song Liu <song@kernel.org> wrote:
>> 
>> Sorry for being late on this issue.
>> 
>> Manuel and Vojtech, are we confident that this issue only happens when we use
>> another md array as the journal device?
>> 
>> Thanks,
>> Song
> 
> Hi Song,
> 
> thanks for getting back.
> 
> Unfortunately it's still happening, even when using a NVMe partition directly. It just took a long 3 weeks to happen. So discard my patch. Here how it went down yesterday:
> 
> - process md4_raid6 is running with 100% CPU utilization, all I/O to the array is blocked
> - no disk activity on the physical drives
> - soft reboot doesn't work, as md4_raid6 blocks, so hard reset is needed
> - when booting to rescue mode, it tries to assemble the array and shows the same issue of 100% CPU utilization. Also can't reboot.
> - when manually assembling it *with* the journal drive, it will read a few GB from the journal device and then get stuck at 100% CPU utilization again without any disk activity.
> 
> Solution in the end was to avoid assembling the array on reboot, then assemble it *without* the existing journal and add an empty journal drive later. This lead to some data loss and a full resync.
> 
> I'm currently moving all data off this machine and will repave it. Then see if that changes anything.
> 
> My main OS is CentOS 8 and the rescue system was Debian. Both showed a similar issue. This must be connected to the journal drive somehow.
> 
> My journal drive is a partition on an NVMe with ~180GB in size.
> 
> Thanks for any pointers, I could try next.
> 
> Manu


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] md: warn about using another MD array as write journal
  2021-03-21  4:22     ` Manuel Riel
@ 2021-03-22 17:13       ` Song Liu
  2021-03-23  3:27         ` Manuel Riel
  2021-05-12 22:39         ` Vojtech Myslivec
  0 siblings, 2 replies; 8+ messages in thread
From: Song Liu @ 2021-03-22 17:13 UTC (permalink / raw)
  To: Manuel Riel; +Cc: Linux-RAID, Vojtech Myslivec

On Sat, Mar 20, 2021 at 9:22 PM Manuel Riel <manu@snapdragon.cc> wrote:
>
> My impression is that the write-journal feature isn't fully stable yet, as was already reported in 2019[^1]. Vojtech and me are seeing the same errors as mentioned there.
>
> No matter if the journal is on a block device or another RAID.
>
> 1: https://www.spinics.net/lists/raid/msg62646.html
>
>
> > On Mar 20, 2021, at 9:12 AM, Manuel Riel <manu@snapdragon.cc> wrote:
> >
> > On Mar 20, 2021, at 7:16 AM, Song Liu <song@kernel.org> wrote:
> >>
> >> Sorry for being late on this issue.
> >>
> >> Manuel and Vojtech, are we confident that this issue only happens when we use
> >> another md array as the journal device?
> >>
> >> Thanks,
> >> Song
> >
> > Hi Song,
> >
> > thanks for getting back.
> >
> > Unfortunately it's still happening, even when using a NVMe partition directly. It just took a long 3 weeks to happen. So discard my patch. Here how it went down yesterday:
> >
> > - process md4_raid6 is running with 100% CPU utilization, all I/O to the array is blocked
> > - no disk activity on the physical drives
> > - soft reboot doesn't work, as md4_raid6 blocks, so hard reset is needed
> > - when booting to rescue mode, it tries to assemble the array and shows the same issue of 100% CPU utilization. Also can't reboot.
> > - when manually assembling it *with* the journal drive, it will read a few GB from the journal device and then get stuck at 100% CPU utilization again without any disk activity.
> >
> > Solution in the end was to avoid assembling the array on reboot, then assemble it *without* the existing journal and add an empty journal drive later. This lead to some data loss and a full resync.

Thanks for the information. Quick question, does the kernel have the
following change?
It fixes an issue at recovery time. Since you see the issue in normal
execution, it is probably
something different.

Thanks,
Song

commit c9020e64cf33f2dd5b2a7295f2bfea787279218a
Author: Song Liu <songliubraving@fb.com>
Date:   9 months ago

    md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing stripes

    In recovery, if we process too much data, raid5-cache may set
    MD_SB_CHANGE_PENDING, which causes spinning in handle_stripe().
    Fix this issue by clearing the bit before flushing data only
    stripes. This issue was initially discussed in [1].

    [1] https://www.spinics.net/lists/raid/msg64409.html

    Signed-off-by: Song Liu <songliubraving@fb.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] md: warn about using another MD array as write journal
  2021-03-22 17:13       ` Song Liu
@ 2021-03-23  3:27         ` Manuel Riel
  2021-05-12 22:39         ` Vojtech Myslivec
  1 sibling, 0 replies; 8+ messages in thread
From: Manuel Riel @ 2021-03-23  3:27 UTC (permalink / raw)
  To: Song Liu; +Cc: Linux-RAID, Vojtech Myslivec


> On Mar 23, 2021, at 1:13 AM, Song Liu <song@kernel.org> wrote:
> 
> Thanks for the information. Quick question, does the kernel have the
> following change?
> It fixes an issue at recovery time. Since you see the issue in normal
> execution, it is probably
> something different.
> 
> Thanks,
> Song
> 
> commit c9020e64cf33f2dd5b2a7295f2bfea787279218a
> Author: Song Liu <songliubraving@fb.com>
> Date:   9 months ago
> 
>    md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing stripes

Interesting. No, it doesn't have this change. My active kernel here is CentOS 4.18.0-240. They added this patch only in 4.18.0-277.[1]

I'll try a kernel with this commit then. Thanks for the hint!


1: https://rpmfind.net/linux/RPM/centos/8-stream/baseos/x86_64/Packages/kernel-4.18.0-277.el8.x86_64.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] md: warn about using another MD array as write journal
  2021-03-22 17:13       ` Song Liu
  2021-03-23  3:27         ` Manuel Riel
@ 2021-05-12 22:39         ` Vojtech Myslivec
  2021-05-13  1:19           ` Guoqing Jiang
  1 sibling, 1 reply; 8+ messages in thread
From: Vojtech Myslivec @ 2021-05-12 22:39 UTC (permalink / raw)
  To: Song Liu; +Cc: Manuel Riel, Linux-RAID, Michal Moravec

[-- Attachment #1: Type: text/plain, Size: 1543 bytes --]

It has been two months since I last reported the state of the issue:

On 17. 03. 21 16:55, Vojtech Myslivec wrote:
 > Thanks a lot Manuel for your findings and information.
 >
 > I have moved journal from logical volume on RAID1 to a plain partition
 > on a SSD and I will monitor the state.

So, we run the MD level 6 array (/dev/md1) with journal device on
a plain partition of one of SSD disk (/dev/sdh5) now. See attached files
for more details.


Since then (March 17th), our discussed issue happened "only" three 
times. First occurrence was on April 21st, 5 weeks after moving the journal.

*I can confirm that the issue still persist, but it is definitely less
frequent.*



On 22. 03. 21 18:13, Song Liu wrote:
 > Thanks for the information. Quick question, does the kernel have the
 > following change?
 >
 > commit c9020e64cf33f2dd5b2a7295f2bfea787279218a Author: Song
 > Liu<songliubraving@fb.com> Date:   9 months ago
 >
 > ...

We run latest available kernel from "Debian backports" distribution
repository, that is Linux version 5.10 currently. I checked that we had
kernel 5.10 as well on March, when I moved the journal.

If I checked it well, this particular patch is part of kernel 5.9
already.



Maybe unrelated, but I noticed this log message just after our "unstuck"
script performed some random I/O operation (just as I described before 
in this e-mail thread):

     May 2 ... kernel: [2035647.004554] md: md1: data-check done.



I would provide more information if needed. Thanks for any new info.

Vojtech Myslivec

[-- Attachment #2: lsblk.txt --]
[-- Type: text/plain, Size: 1475 bytes --]

NAME           MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda              8:0    0   7,3T  0 disk  
└─md1            9:1    0  29,1T  0 raid6 /mnt/data
sdb              8:16   0   7,3T  0 disk  
└─md1            9:1    0  29,1T  0 raid6 /mnt/data
sdc              8:32   0   7,3T  0 disk  
└─md1            9:1    0  29,1T  0 raid6 /mnt/data
sdd              8:48   0   7,3T  0 disk  
└─md1            9:1    0  29,1T  0 raid6 /mnt/data
sde              8:64   0   7,3T  0 disk  
└─md1            9:1    0  29,1T  0 raid6 /mnt/data
sdf              8:80   0   7,3T  0 disk  
└─md1            9:1    0  29,1T  0 raid6 /mnt/data
sdg              8:96   1 223,6G  0 disk  
├─sdg1           8:97   1  37,3G  0 part  
│ └─md0          9:0    0  37,2G  0 raid1 
│   ├─vg0-swap 253:0    0   3,7G  0 lvm   [SWAP]
│   └─vg0-root 253:1    0  14,9G  0 lvm   /
├─sdg2           8:98   1     1K  0 part  
├─sdg5           8:101  1     8G  0 part  
└─sdg6           8:102  1 178,3G  0 part  
sdh              8:112  1 223,6G  0 disk  
├─sdh1           8:113  1  37,3G  0 part  
│ └─md0          9:0    0  37,2G  0 raid1 
│   ├─vg0-swap 253:0    0   3,7G  0 lvm   [SWAP]
│   └─vg0-root 253:1    0  14,9G  0 lvm   /
├─sdh2           8:114  1     1K  0 part  
├─sdh5           8:117  1     8G  0 part  
│ └─md1          9:1    0  29,1T  0 raid6 /mnt/data
└─sdh6           8:118  1 178,3G  0 part  

[-- Attachment #3: mdstat-detail-md0.txt --]
[-- Type: text/plain, Size: 813 bytes --]

/dev/md0:
           Version : 1.2
     Creation Time : Tue Jan  8 13:16:26 2019
        Raid Level : raid1
        Array Size : 39028736 (37.22 GiB 39.97 GB)
     Used Dev Size : 39028736 (37.22 GiB 39.97 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Thu May 13 00:17:06 2021
             State : clean 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              Name : backup1:0  (local to host backup1)
              UUID : fe06ac67:967c62f7:5ef1b67b:7b951104
            Events : 697

    Number   Major   Minor   RaidDevice State
       0       8       97        0      active sync   /dev/sdg1
       1       8      113        1      active sync   /dev/sdh1


[-- Attachment #4: mdstat-detail-md1.txt --]
[-- Type: text/plain, Size: 1207 bytes --]

/dev/md1:
           Version : 1.2
     Creation Time : Wed Apr  3 17:16:20 2019
        Raid Level : raid6
        Array Size : 31256100864 (29808.14 GiB 32006.25 GB)
     Used Dev Size : 7814025216 (7452.04 GiB 8001.56 GB)
      Raid Devices : 6
     Total Devices : 7
       Persistence : Superblock is persistent

       Update Time : Thu May 13 00:15:22 2021
             State : clean 
    Active Devices : 6
   Working Devices : 7
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : journal

              Name : backup1:1  (local to host backup1)
              UUID : fd61cb22:30bfc616:6506829d:9319af95
            Events : 2588836

    Number   Major   Minor   RaidDevice State
       1       8       16        0      active sync   /dev/sdb
       2       8        0        1      active sync   /dev/sda
       3       8       32        2      active sync   /dev/sdc
       4       8       48        3      active sync   /dev/sdd
       5       8       64        4      active sync   /dev/sde
       6       8       80        5      active sync   /dev/sdf

       7       8      117        -      journal   /dev/sdh5


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] md: warn about using another MD array as write journal
  2021-05-12 22:39         ` Vojtech Myslivec
@ 2021-05-13  1:19           ` Guoqing Jiang
  0 siblings, 0 replies; 8+ messages in thread
From: Guoqing Jiang @ 2021-05-13  1:19 UTC (permalink / raw)
  To: Vojtech Myslivec, Song Liu; +Cc: Manuel Riel, Linux-RAID, Michal Moravec



On 5/13/21 6:39 AM, Vojtech Myslivec wrote:
> It has been two months since I last reported the state of the issue:
>
> On 17. 03. 21 16:55, Vojtech Myslivec wrote:
> > Thanks a lot Manuel for your findings and information.
> >
> > I have moved journal from logical volume on RAID1 to a plain partition
> > on a SSD and I will monitor the state.
>
> So, we run the MD level 6 array (/dev/md1) with journal device on
> a plain partition of one of SSD disk (/dev/sdh5) now. See attached files
> for more details.
>
>
> Since then (March 17th), our discussed issue happened "only" three 
> times. First occurrence was on April 21st, 5 weeks after moving the 
> journal.
>
> *I can confirm that the issue still persist, but it is definitely less
> frequent.*

Could you check if this helps?

diff --git a/drivers/md/md.c b/drivers/md/md.c
index bd813f747769..b97429f19247 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1014,6 +1014,7 @@ void md_super_write(struct mddev *mddev, struct 
md_rdev *rdev,
             !test_bit(LastDev, &rdev->flags))
                 ff = MD_FAILFAST;
         bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH | REQ_FUA 
| ff;
+       bio->bi_opf |= REQ_IDLE;

Thanks,
Guoqing

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-05-13  1:20 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-17  4:37 [PATCH] md: warn about using another MD array as write journal Manuel Riel
2021-03-19 23:16 ` Song Liu
2021-03-20  1:12   ` Manuel Riel
2021-03-21  4:22     ` Manuel Riel
2021-03-22 17:13       ` Song Liu
2021-03-23  3:27         ` Manuel Riel
2021-05-12 22:39         ` Vojtech Myslivec
2021-05-13  1:19           ` Guoqing Jiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.