* [PATCH] md: warn about using another MD array as write journal @ 2021-03-17 4:37 Manuel Riel 2021-03-19 23:16 ` Song Liu 0 siblings, 1 reply; 8+ messages in thread From: Manuel Riel @ 2021-03-17 4:37 UTC (permalink / raw) To: Linux-RAID, Song Liu; +Cc: Vojtech Myslivec To follow up on a previous discussion[1] about stuck RAIDs, I'd like to propose adding a warning about this to the relevant docs. Specifically users shouldn't add other MD arrays as journal device. Ideally mdadm would check for this, but having it in the docs is useful too. 1: https://lore.kernel.org/linux-btrfs/d3fced3f-6c2b-5ffa-fd24-b24ec6e7d4be@xmyslivec.cz/ --- diff --git a/Documentation/driver-api/md/raid5-cache.rst b/Documentation/driver-api/md/raid5-cache.rst index d7a15f44a..128044018 100644 --- a/Documentation/driver-api/md/raid5-cache.rst +++ b/Documentation/driver-api/md/raid5-cache.rst @@ -17,7 +17,10 @@ And switch it back to write-through mode by:: echo "write-through" > /sys/block/md0/md/journal_mode In both modes, all writes to the array will hit cache disk first. This means -the cache disk must be fast and sustainable. +the cache disk must be fast and sustainable. The cache disk also can't be +another MD RAID array, since such a nested setup can cause problems when +assembling an array or lead to the primary array getting stuck during +operation. write-through mode ================== ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] md: warn about using another MD array as write journal 2021-03-17 4:37 [PATCH] md: warn about using another MD array as write journal Manuel Riel @ 2021-03-19 23:16 ` Song Liu 2021-03-20 1:12 ` Manuel Riel 0 siblings, 1 reply; 8+ messages in thread From: Song Liu @ 2021-03-19 23:16 UTC (permalink / raw) To: Manuel Riel; +Cc: Linux-RAID, Vojtech Myslivec On Tue, Mar 16, 2021 at 9:39 PM Manuel Riel <manu@snapdragon.cc> wrote: > > To follow up on a previous discussion[1] about stuck RAIDs, I'd like to propose adding a warning > about this to the relevant docs. Specifically users shouldn't add other MD arrays as journal device. > > Ideally mdadm would check for this, but having it in the docs is useful too. > > 1: https://lore.kernel.org/linux-btrfs/d3fced3f-6c2b-5ffa-fd24-b24ec6e7d4be@xmyslivec.cz/ > > --- > > diff --git a/Documentation/driver-api/md/raid5-cache.rst b/Documentation/driver-api/md/raid5-cache.rst > index d7a15f44a..128044018 100644 > --- a/Documentation/driver-api/md/raid5-cache.rst > +++ b/Documentation/driver-api/md/raid5-cache.rst > @@ -17,7 +17,10 @@ And switch it back to write-through mode by:: > echo "write-through" > /sys/block/md0/md/journal_mode > > In both modes, all writes to the array will hit cache disk first. This means > -the cache disk must be fast and sustainable. > +the cache disk must be fast and sustainable. The cache disk also can't be > +another MD RAID array, since such a nested setup can cause problems when > +assembling an array or lead to the primary array getting stuck during > +operation. Sorry for being late on this issue. Manuel and Vojtech, are we confident that this issue only happens when we use another md array as the journal device? Thanks, Song > > write-through mode > ================== ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] md: warn about using another MD array as write journal 2021-03-19 23:16 ` Song Liu @ 2021-03-20 1:12 ` Manuel Riel 2021-03-21 4:22 ` Manuel Riel 0 siblings, 1 reply; 8+ messages in thread From: Manuel Riel @ 2021-03-20 1:12 UTC (permalink / raw) To: Song Liu; +Cc: Linux-RAID, Vojtech Myslivec On Mar 20, 2021, at 7:16 AM, Song Liu <song@kernel.org> wrote: > > Sorry for being late on this issue. > > Manuel and Vojtech, are we confident that this issue only happens when we use > another md array as the journal device? > > Thanks, > Song Hi Song, thanks for getting back. Unfortunately it's still happening, even when using a NVMe partition directly. It just took a long 3 weeks to happen. So discard my patch. Here how it went down yesterday: - process md4_raid6 is running with 100% CPU utilization, all I/O to the array is blocked - no disk activity on the physical drives - soft reboot doesn't work, as md4_raid6 blocks, so hard reset is needed - when booting to rescue mode, it tries to assemble the array and shows the same issue of 100% CPU utilization. Also can't reboot. - when manually assembling it *with* the journal drive, it will read a few GB from the journal device and then get stuck at 100% CPU utilization again without any disk activity. Solution in the end was to avoid assembling the array on reboot, then assemble it *without* the existing journal and add an empty journal drive later. This lead to some data loss and a full resync. I'm currently moving all data off this machine and will repave it. Then see if that changes anything. My main OS is CentOS 8 and the rescue system was Debian. Both showed a similar issue. This must be connected to the journal drive somehow. My journal drive is a partition on an NVMe with ~180GB in size. Thanks for any pointers, I could try next. Manu ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] md: warn about using another MD array as write journal 2021-03-20 1:12 ` Manuel Riel @ 2021-03-21 4:22 ` Manuel Riel 2021-03-22 17:13 ` Song Liu 0 siblings, 1 reply; 8+ messages in thread From: Manuel Riel @ 2021-03-21 4:22 UTC (permalink / raw) To: Song Liu; +Cc: Linux-RAID, Vojtech Myslivec My impression is that the write-journal feature isn't fully stable yet, as was already reported in 2019[^1]. Vojtech and me are seeing the same errors as mentioned there. No matter if the journal is on a block device or another RAID. 1: https://www.spinics.net/lists/raid/msg62646.html > On Mar 20, 2021, at 9:12 AM, Manuel Riel <manu@snapdragon.cc> wrote: > > On Mar 20, 2021, at 7:16 AM, Song Liu <song@kernel.org> wrote: >> >> Sorry for being late on this issue. >> >> Manuel and Vojtech, are we confident that this issue only happens when we use >> another md array as the journal device? >> >> Thanks, >> Song > > Hi Song, > > thanks for getting back. > > Unfortunately it's still happening, even when using a NVMe partition directly. It just took a long 3 weeks to happen. So discard my patch. Here how it went down yesterday: > > - process md4_raid6 is running with 100% CPU utilization, all I/O to the array is blocked > - no disk activity on the physical drives > - soft reboot doesn't work, as md4_raid6 blocks, so hard reset is needed > - when booting to rescue mode, it tries to assemble the array and shows the same issue of 100% CPU utilization. Also can't reboot. > - when manually assembling it *with* the journal drive, it will read a few GB from the journal device and then get stuck at 100% CPU utilization again without any disk activity. > > Solution in the end was to avoid assembling the array on reboot, then assemble it *without* the existing journal and add an empty journal drive later. This lead to some data loss and a full resync. > > I'm currently moving all data off this machine and will repave it. Then see if that changes anything. > > My main OS is CentOS 8 and the rescue system was Debian. Both showed a similar issue. This must be connected to the journal drive somehow. > > My journal drive is a partition on an NVMe with ~180GB in size. > > Thanks for any pointers, I could try next. > > Manu ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] md: warn about using another MD array as write journal 2021-03-21 4:22 ` Manuel Riel @ 2021-03-22 17:13 ` Song Liu 2021-03-23 3:27 ` Manuel Riel 2021-05-12 22:39 ` Vojtech Myslivec 0 siblings, 2 replies; 8+ messages in thread From: Song Liu @ 2021-03-22 17:13 UTC (permalink / raw) To: Manuel Riel; +Cc: Linux-RAID, Vojtech Myslivec On Sat, Mar 20, 2021 at 9:22 PM Manuel Riel <manu@snapdragon.cc> wrote: > > My impression is that the write-journal feature isn't fully stable yet, as was already reported in 2019[^1]. Vojtech and me are seeing the same errors as mentioned there. > > No matter if the journal is on a block device or another RAID. > > 1: https://www.spinics.net/lists/raid/msg62646.html > > > > On Mar 20, 2021, at 9:12 AM, Manuel Riel <manu@snapdragon.cc> wrote: > > > > On Mar 20, 2021, at 7:16 AM, Song Liu <song@kernel.org> wrote: > >> > >> Sorry for being late on this issue. > >> > >> Manuel and Vojtech, are we confident that this issue only happens when we use > >> another md array as the journal device? > >> > >> Thanks, > >> Song > > > > Hi Song, > > > > thanks for getting back. > > > > Unfortunately it's still happening, even when using a NVMe partition directly. It just took a long 3 weeks to happen. So discard my patch. Here how it went down yesterday: > > > > - process md4_raid6 is running with 100% CPU utilization, all I/O to the array is blocked > > - no disk activity on the physical drives > > - soft reboot doesn't work, as md4_raid6 blocks, so hard reset is needed > > - when booting to rescue mode, it tries to assemble the array and shows the same issue of 100% CPU utilization. Also can't reboot. > > - when manually assembling it *with* the journal drive, it will read a few GB from the journal device and then get stuck at 100% CPU utilization again without any disk activity. > > > > Solution in the end was to avoid assembling the array on reboot, then assemble it *without* the existing journal and add an empty journal drive later. This lead to some data loss and a full resync. Thanks for the information. Quick question, does the kernel have the following change? It fixes an issue at recovery time. Since you see the issue in normal execution, it is probably something different. Thanks, Song commit c9020e64cf33f2dd5b2a7295f2bfea787279218a Author: Song Liu <songliubraving@fb.com> Date: 9 months ago md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing stripes In recovery, if we process too much data, raid5-cache may set MD_SB_CHANGE_PENDING, which causes spinning in handle_stripe(). Fix this issue by clearing the bit before flushing data only stripes. This issue was initially discussed in [1]. [1] https://www.spinics.net/lists/raid/msg64409.html Signed-off-by: Song Liu <songliubraving@fb.com> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] md: warn about using another MD array as write journal 2021-03-22 17:13 ` Song Liu @ 2021-03-23 3:27 ` Manuel Riel 2021-05-12 22:39 ` Vojtech Myslivec 1 sibling, 0 replies; 8+ messages in thread From: Manuel Riel @ 2021-03-23 3:27 UTC (permalink / raw) To: Song Liu; +Cc: Linux-RAID, Vojtech Myslivec > On Mar 23, 2021, at 1:13 AM, Song Liu <song@kernel.org> wrote: > > Thanks for the information. Quick question, does the kernel have the > following change? > It fixes an issue at recovery time. Since you see the issue in normal > execution, it is probably > something different. > > Thanks, > Song > > commit c9020e64cf33f2dd5b2a7295f2bfea787279218a > Author: Song Liu <songliubraving@fb.com> > Date: 9 months ago > > md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing stripes Interesting. No, it doesn't have this change. My active kernel here is CentOS 4.18.0-240. They added this patch only in 4.18.0-277.[1] I'll try a kernel with this commit then. Thanks for the hint! 1: https://rpmfind.net/linux/RPM/centos/8-stream/baseos/x86_64/Packages/kernel-4.18.0-277.el8.x86_64.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] md: warn about using another MD array as write journal 2021-03-22 17:13 ` Song Liu 2021-03-23 3:27 ` Manuel Riel @ 2021-05-12 22:39 ` Vojtech Myslivec 2021-05-13 1:19 ` Guoqing Jiang 1 sibling, 1 reply; 8+ messages in thread From: Vojtech Myslivec @ 2021-05-12 22:39 UTC (permalink / raw) To: Song Liu; +Cc: Manuel Riel, Linux-RAID, Michal Moravec [-- Attachment #1: Type: text/plain, Size: 1543 bytes --] It has been two months since I last reported the state of the issue: On 17. 03. 21 16:55, Vojtech Myslivec wrote: > Thanks a lot Manuel for your findings and information. > > I have moved journal from logical volume on RAID1 to a plain partition > on a SSD and I will monitor the state. So, we run the MD level 6 array (/dev/md1) with journal device on a plain partition of one of SSD disk (/dev/sdh5) now. See attached files for more details. Since then (March 17th), our discussed issue happened "only" three times. First occurrence was on April 21st, 5 weeks after moving the journal. *I can confirm that the issue still persist, but it is definitely less frequent.* On 22. 03. 21 18:13, Song Liu wrote: > Thanks for the information. Quick question, does the kernel have the > following change? > > commit c9020e64cf33f2dd5b2a7295f2bfea787279218a Author: Song > Liu<songliubraving@fb.com> Date: 9 months ago > > ... We run latest available kernel from "Debian backports" distribution repository, that is Linux version 5.10 currently. I checked that we had kernel 5.10 as well on March, when I moved the journal. If I checked it well, this particular patch is part of kernel 5.9 already. Maybe unrelated, but I noticed this log message just after our "unstuck" script performed some random I/O operation (just as I described before in this e-mail thread): May 2 ... kernel: [2035647.004554] md: md1: data-check done. I would provide more information if needed. Thanks for any new info. Vojtech Myslivec [-- Attachment #2: lsblk.txt --] [-- Type: text/plain, Size: 1475 bytes --] NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 7,3T 0 disk └─md1 9:1 0 29,1T 0 raid6 /mnt/data sdb 8:16 0 7,3T 0 disk └─md1 9:1 0 29,1T 0 raid6 /mnt/data sdc 8:32 0 7,3T 0 disk └─md1 9:1 0 29,1T 0 raid6 /mnt/data sdd 8:48 0 7,3T 0 disk └─md1 9:1 0 29,1T 0 raid6 /mnt/data sde 8:64 0 7,3T 0 disk └─md1 9:1 0 29,1T 0 raid6 /mnt/data sdf 8:80 0 7,3T 0 disk └─md1 9:1 0 29,1T 0 raid6 /mnt/data sdg 8:96 1 223,6G 0 disk ├─sdg1 8:97 1 37,3G 0 part │ └─md0 9:0 0 37,2G 0 raid1 │ ├─vg0-swap 253:0 0 3,7G 0 lvm [SWAP] │ └─vg0-root 253:1 0 14,9G 0 lvm / ├─sdg2 8:98 1 1K 0 part ├─sdg5 8:101 1 8G 0 part └─sdg6 8:102 1 178,3G 0 part sdh 8:112 1 223,6G 0 disk ├─sdh1 8:113 1 37,3G 0 part │ └─md0 9:0 0 37,2G 0 raid1 │ ├─vg0-swap 253:0 0 3,7G 0 lvm [SWAP] │ └─vg0-root 253:1 0 14,9G 0 lvm / ├─sdh2 8:114 1 1K 0 part ├─sdh5 8:117 1 8G 0 part │ └─md1 9:1 0 29,1T 0 raid6 /mnt/data └─sdh6 8:118 1 178,3G 0 part [-- Attachment #3: mdstat-detail-md0.txt --] [-- Type: text/plain, Size: 813 bytes --] /dev/md0: Version : 1.2 Creation Time : Tue Jan 8 13:16:26 2019 Raid Level : raid1 Array Size : 39028736 (37.22 GiB 39.97 GB) Used Dev Size : 39028736 (37.22 GiB 39.97 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Thu May 13 00:17:06 2021 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Consistency Policy : resync Name : backup1:0 (local to host backup1) UUID : fe06ac67:967c62f7:5ef1b67b:7b951104 Events : 697 Number Major Minor RaidDevice State 0 8 97 0 active sync /dev/sdg1 1 8 113 1 active sync /dev/sdh1 [-- Attachment #4: mdstat-detail-md1.txt --] [-- Type: text/plain, Size: 1207 bytes --] /dev/md1: Version : 1.2 Creation Time : Wed Apr 3 17:16:20 2019 Raid Level : raid6 Array Size : 31256100864 (29808.14 GiB 32006.25 GB) Used Dev Size : 7814025216 (7452.04 GiB 8001.56 GB) Raid Devices : 6 Total Devices : 7 Persistence : Superblock is persistent Update Time : Thu May 13 00:15:22 2021 State : clean Active Devices : 6 Working Devices : 7 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Consistency Policy : journal Name : backup1:1 (local to host backup1) UUID : fd61cb22:30bfc616:6506829d:9319af95 Events : 2588836 Number Major Minor RaidDevice State 1 8 16 0 active sync /dev/sdb 2 8 0 1 active sync /dev/sda 3 8 32 2 active sync /dev/sdc 4 8 48 3 active sync /dev/sdd 5 8 64 4 active sync /dev/sde 6 8 80 5 active sync /dev/sdf 7 8 117 - journal /dev/sdh5 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] md: warn about using another MD array as write journal 2021-05-12 22:39 ` Vojtech Myslivec @ 2021-05-13 1:19 ` Guoqing Jiang 0 siblings, 0 replies; 8+ messages in thread From: Guoqing Jiang @ 2021-05-13 1:19 UTC (permalink / raw) To: Vojtech Myslivec, Song Liu; +Cc: Manuel Riel, Linux-RAID, Michal Moravec On 5/13/21 6:39 AM, Vojtech Myslivec wrote: > It has been two months since I last reported the state of the issue: > > On 17. 03. 21 16:55, Vojtech Myslivec wrote: > > Thanks a lot Manuel for your findings and information. > > > > I have moved journal from logical volume on RAID1 to a plain partition > > on a SSD and I will monitor the state. > > So, we run the MD level 6 array (/dev/md1) with journal device on > a plain partition of one of SSD disk (/dev/sdh5) now. See attached files > for more details. > > > Since then (March 17th), our discussed issue happened "only" three > times. First occurrence was on April 21st, 5 weeks after moving the > journal. > > *I can confirm that the issue still persist, but it is definitely less > frequent.* Could you check if this helps? diff --git a/drivers/md/md.c b/drivers/md/md.c index bd813f747769..b97429f19247 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1014,6 +1014,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev, !test_bit(LastDev, &rdev->flags)) ff = MD_FAILFAST; bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH | REQ_FUA | ff; + bio->bi_opf |= REQ_IDLE; Thanks, Guoqing ^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2021-05-13 1:20 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-03-17 4:37 [PATCH] md: warn about using another MD array as write journal Manuel Riel 2021-03-19 23:16 ` Song Liu 2021-03-20 1:12 ` Manuel Riel 2021-03-21 4:22 ` Manuel Riel 2021-03-22 17:13 ` Song Liu 2021-03-23 3:27 ` Manuel Riel 2021-05-12 22:39 ` Vojtech Myslivec 2021-05-13 1:19 ` Guoqing Jiang
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.