Fwd: The read data is wrong from raid5 when recovery happens

All of lore.kernel.org
 help / color / mirror / Atom feed

* Fwd: The read data is wrong from raid5 when recovery happens
       [not found] <CALTww28aV5CGXQAu46Rkc=fG1jK=ARzCT8VGoVyje8kQdqEXMg@mail.gmail.com>
@ 2023-05-26  2:08 ` Xiao Ni
  2023-05-26  2:17   ` Yu Kuai
                     ` (2 more replies)
  2023-05-26  3:09 ` Guoqing Jiang
  1 sibling, 3 replies; 33+ messages in thread
From: Xiao Ni @ 2023-05-26  2:08 UTC (permalink / raw)
  To: linux-raid, Song Liu, Guoqing Jiang, Heinz Mauelshagen, Nigel Croxon

I received an email that this email can't delivered to someone. Resent
it to linux-raid again.

---------- Forwarded message ---------
From: Xiao Ni <xni@redhat.com>
Date: Fri, May 26, 2023 at 9:49 AM
Subject: The read data is wrong from raid5 when recovery happens
To: Song Liu <song@kernel.org>, Guoqing Jiang <guoqing.jiang@linux.dev>
Cc: linux-raid <linux-raid@vger.kernel.org>, Heinz Mauelshagen
<heinzm@redhat.com>, Nigel Croxon <ncroxon@redhat.com>

Hi all

We found a problem recently. The read data is wrong when recovery
happens. Now we've found it's introduced by patch 10764815f (md: add
io accounting for raid0 and raid5). I can reproduce this 100%. This
problem exists in upstream. The test steps are like this:

1. mdadm -CR $devname -l5 -n4 /dev/sd[b-e] --force --assume-clean
2. mkfs.ext4 -F $devname
3. mount $devname $mount_point
4. mdadm --incremental --fail sdd
5. dd if=/dev/zero of=/tmp/pythontest/file1 bs=1M count=100000 status=progress
6. mdadm /dev/md126 --add /dev/sdd
7. create 31 processes that writes and reads. It compares the content
with md5sum. The test will go on until the recovery stops
8. wait for about 10 minutes, we can see some processes report
checksum is wrong. But if it re-read the data again, the checksum will
be good.

I tried to narrow this problem like this:

-       md_account_bio(mddev, &bi);
+       if (rw == WRITE)
+               md_account_bio(mddev, &bi);
If it only do account for write requests, the problem can disappear.

-       if (rw == READ && mddev->degraded == 0 &&
-           mddev->reshape_position == MaxSector) {
-               bi = chunk_aligned_read(mddev, bi);
-               if (!bi)
-                       return true;
-       }
+       //if (rw == READ && mddev->degraded == 0 &&
+       //    mddev->reshape_position == MaxSector) {
+       //      bi = chunk_aligned_read(mddev, bi);
+       //      if (!bi)
+       //              return true;
+       //}

        if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
                make_discard_request(mddev, bi);
@@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev
*mddev, struct bio * bi)
                        md_write_end(mddev);
                return true;
        }
-       md_account_bio(mddev, &bi);
+       if (rw == READ)
+               md_account_bio(mddev, &bi);

I comment the chunk_aligned_read out and only account for read
requests, this problem can be reproduced.

-- 
Best Regards
Xiao Ni

-- 
Best Regards
Xiao Ni

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Fwd: The read data is wrong from raid5 when recovery happens
  2023-05-26  2:08 ` Fwd: The read data is wrong from raid5 when recovery happens Xiao Ni
@ 2023-05-26  2:17   ` Yu Kuai
  2023-05-26  2:40     ` Xiao Ni
  2023-05-26  3:56   ` d tbsky
  2024-02-14 15:15   ` Fwd: " Mateusz Kusiak
  2 siblings, 1 reply; 33+ messages in thread
From: Yu Kuai @ 2023-05-26  2:17 UTC (permalink / raw)
  To: Xiao Ni, linux-raid, Song Liu, Guoqing Jiang, Heinz Mauelshagen,
	Nigel Croxon, yukuai (C)

Hi,

在 2023/05/26 10:08, Xiao Ni 写道:
> I received an email that this email can't delivered to someone. Resent
> it to linux-raid again.
> 
> ---------- Forwarded message ---------
> From: Xiao Ni <xni@redhat.com>
> Date: Fri, May 26, 2023 at 9:49 AM
> Subject: The read data is wrong from raid5 when recovery happens
> To: Song Liu <song@kernel.org>, Guoqing Jiang <guoqing.jiang@linux.dev>
> Cc: linux-raid <linux-raid@vger.kernel.org>, Heinz Mauelshagen
> <heinzm@redhat.com>, Nigel Croxon <ncroxon@redhat.com>
> 
> 
> Hi all
> 
> We found a problem recently. The read data is wrong when recovery
> happens. Now we've found it's introduced by patch 10764815f (md: add
> io accounting for raid0 and raid5). I can reproduce this 100%. This
> problem exists in upstream. The test steps are like this:
> 
> 1. mdadm -CR $devname -l5 -n4 /dev/sd[b-e] --force --assume-clean
> 2. mkfs.ext4 -F $devname
> 3. mount $devname $mount_point
> 4. mdadm --incremental --fail sdd
> 5. dd if=/dev/zero of=/tmp/pythontest/file1 bs=1M count=100000 status=progress
> 6. mdadm /dev/md126 --add /dev/sdd
Can you try to zero superblock before add sdd? just to bypass readd.

Thanks,
Kuai
> 7. create 31 processes that writes and reads. It compares the content
> with md5sum. The test will go on until the recovery stops
> 8. wait for about 10 minutes, we can see some processes report
> checksum is wrong. But if it re-read the data again, the checksum will
> be good.
> 
> I tried to narrow this problem like this:
> 
> -       md_account_bio(mddev, &bi);
> +       if (rw == WRITE)
> +               md_account_bio(mddev, &bi);
> If it only do account for write requests, the problem can disappear.
> 
> -       if (rw == READ && mddev->degraded == 0 &&
> -           mddev->reshape_position == MaxSector) {
> -               bi = chunk_aligned_read(mddev, bi);
> -               if (!bi)
> -                       return true;
> -       }
> +       //if (rw == READ && mddev->degraded == 0 &&
> +       //    mddev->reshape_position == MaxSector) {
> +       //      bi = chunk_aligned_read(mddev, bi);
> +       //      if (!bi)
> +       //              return true;
> +       //}
> 
>          if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
>                  make_discard_request(mddev, bi);
> @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev
> *mddev, struct bio * bi)
>                          md_write_end(mddev);
>                  return true;
>          }
> -       md_account_bio(mddev, &bi);
> +       if (rw == READ)
> +               md_account_bio(mddev, &bi);
> 
> I comment the chunk_aligned_read out and only account for read
> requests, this problem can be reproduced.
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Fwd: The read data is wrong from raid5 when recovery happens
  2023-05-26  2:17   ` Yu Kuai
@ 2023-05-26  2:40     ` Xiao Ni
  2023-05-26  2:47       ` Yu Kuai
  0 siblings, 1 reply; 33+ messages in thread
From: Xiao Ni @ 2023-05-26  2:40 UTC (permalink / raw)
  To: Yu Kuai
  Cc: linux-raid, Song Liu, Guoqing Jiang, Heinz Mauelshagen,
	Nigel Croxon, yukuai (C)

On Fri, May 26, 2023 at 10:18 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2023/05/26 10:08, Xiao Ni 写道:
> > I received an email that this email can't delivered to someone. Resent
> > it to linux-raid again.
> >
> > ---------- Forwarded message ---------
> > From: Xiao Ni <xni@redhat.com>
> > Date: Fri, May 26, 2023 at 9:49 AM
> > Subject: The read data is wrong from raid5 when recovery happens
> > To: Song Liu <song@kernel.org>, Guoqing Jiang <guoqing.jiang@linux.dev>
> > Cc: linux-raid <linux-raid@vger.kernel.org>, Heinz Mauelshagen
> > <heinzm@redhat.com>, Nigel Croxon <ncroxon@redhat.com>
> >
> >
> > Hi all
> >
> > We found a problem recently. The read data is wrong when recovery
> > happens. Now we've found it's introduced by patch 10764815f (md: add
> > io accounting for raid0 and raid5). I can reproduce this 100%. This
> > problem exists in upstream. The test steps are like this:
> >
> > 1. mdadm -CR $devname -l5 -n4 /dev/sd[b-e] --force --assume-clean
> > 2. mkfs.ext4 -F $devname
> > 3. mount $devname $mount_point
> > 4. mdadm --incremental --fail sdd
> > 5. dd if=/dev/zero of=/tmp/pythontest/file1 bs=1M count=100000 status=progress
> > 6. mdadm /dev/md126 --add /dev/sdd
> Can you try to zero superblock before add sdd? just to bypass readd.

Hi Kuai

I tried with this. It can still be reproduced.

>
> Thanks,
> Kuai
> > 7. create 31 processes that writes and reads. It compares the content
> > with md5sum. The test will go on until the recovery stops
> > 8. wait for about 10 minutes, we can see some processes report
> > checksum is wrong. But if it re-read the data again, the checksum will
> > be good.
> >
> > I tried to narrow this problem like this:
> >
> > -       md_account_bio(mddev, &bi);
> > +       if (rw == WRITE)
> > +               md_account_bio(mddev, &bi);
> > If it only do account for write requests, the problem can disappear.
> >
> > -       if (rw == READ && mddev->degraded == 0 &&
> > -           mddev->reshape_position == MaxSector) {
> > -               bi = chunk_aligned_read(mddev, bi);
> > -               if (!bi)
> > -                       return true;
> > -       }
> > +       //if (rw == READ && mddev->degraded == 0 &&
> > +       //    mddev->reshape_position == MaxSector) {
> > +       //      bi = chunk_aligned_read(mddev, bi);
> > +       //      if (!bi)
> > +       //              return true;
> > +       //}
> >
> >          if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
> >                  make_discard_request(mddev, bi);
> > @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev
> > *mddev, struct bio * bi)
> >                          md_write_end(mddev);
> >                  return true;
> >          }
> > -       md_account_bio(mddev, &bi);
> > +       if (rw == READ)
> > +               md_account_bio(mddev, &bi);
> >
> > I comment the chunk_aligned_read out and only account for read
> > requests, this problem can be reproduced.
> >
>


-- 
Best Regards
Xiao Ni


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Fwd: The read data is wrong from raid5 when recovery happens
  2023-05-26  2:40     ` Xiao Ni
@ 2023-05-26  2:47       ` Yu Kuai
  2023-05-26  3:02         ` Xiao Ni
  0 siblings, 1 reply; 33+ messages in thread
From: Yu Kuai @ 2023-05-26  2:47 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai
  Cc: linux-raid, Song Liu, Guoqing Jiang, Heinz Mauelshagen,
	Nigel Croxon, yukuai (C)

Hi,

在 2023/05/26 10:40, Xiao Ni 写道:
> On Fri, May 26, 2023 at 10:18 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2023/05/26 10:08, Xiao Ni 写道:
>>> I received an email that this email can't delivered to someone. Resent
>>> it to linux-raid again.
>>>
>>> ---------- Forwarded message ---------
>>> From: Xiao Ni <xni@redhat.com>
>>> Date: Fri, May 26, 2023 at 9:49 AM
>>> Subject: The read data is wrong from raid5 when recovery happens
>>> To: Song Liu <song@kernel.org>, Guoqing Jiang <guoqing.jiang@linux.dev>
>>> Cc: linux-raid <linux-raid@vger.kernel.org>, Heinz Mauelshagen
>>> <heinzm@redhat.com>, Nigel Croxon <ncroxon@redhat.com>
>>>
>>>
>>> Hi all
>>>
>>> We found a problem recently. The read data is wrong when recovery
>>> happens. Now we've found it's introduced by patch 10764815f (md: add
>>> io accounting for raid0 and raid5). I can reproduce this 100%. This
>>> problem exists in upstream. The test steps are like this:
>>>
>>> 1. mdadm -CR $devname -l5 -n4 /dev/sd[b-e] --force --assume-clean
>>> 2. mkfs.ext4 -F $devname
>>> 3. mount $devname $mount_point
>>> 4. mdadm --incremental --fail sdd
>>> 5. dd if=/dev/zero of=/tmp/pythontest/file1 bs=1M count=100000 status=progress
>>> 6. mdadm /dev/md126 --add /dev/sdd
>> Can you try to zero superblock before add sdd? just to bypass readd.
> 
> Hi Kuai
> 
> I tried with this. It can still be reproduced.

Ok, I asked this because we found readd has some problem while testing
raid10, and it's easy to reporduce...

Then, there is a related fixed patch just merged:

md/raid5: fix miscalculation of 'end_sector' in raid5_read_one_chunk()

The upstream that you tried contain this one?

Thanks,
Kuai
> 
>>
>> Thanks,
>> Kuai
>>> 7. create 31 processes that writes and reads. It compares the content
>>> with md5sum. The test will go on until the recovery stops
>>> 8. wait for about 10 minutes, we can see some processes report
>>> checksum is wrong. But if it re-read the data again, the checksum will
>>> be good.
>>>
>>> I tried to narrow this problem like this:
>>>
>>> -       md_account_bio(mddev, &bi);
>>> +       if (rw == WRITE)
>>> +               md_account_bio(mddev, &bi);
>>> If it only do account for write requests, the problem can disappear.
>>>
>>> -       if (rw == READ && mddev->degraded == 0 &&
>>> -           mddev->reshape_position == MaxSector) {
>>> -               bi = chunk_aligned_read(mddev, bi);
>>> -               if (!bi)
>>> -                       return true;
>>> -       }
>>> +       //if (rw == READ && mddev->degraded == 0 &&
>>> +       //    mddev->reshape_position == MaxSector) {
>>> +       //      bi = chunk_aligned_read(mddev, bi);
>>> +       //      if (!bi)
>>> +       //              return true;
>>> +       //}
>>>
>>>           if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
>>>                   make_discard_request(mddev, bi);
>>> @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev
>>> *mddev, struct bio * bi)
>>>                           md_write_end(mddev);
>>>                   return true;
>>>           }
>>> -       md_account_bio(mddev, &bi);
>>> +       if (rw == READ)
>>> +               md_account_bio(mddev, &bi);
>>>
>>> I comment the chunk_aligned_read out and only account for read
>>> requests, this problem can be reproduced.
>>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Fwd: The read data is wrong from raid5 when recovery happens
  2023-05-26  2:47       ` Yu Kuai
@ 2023-05-26  3:02         ` Xiao Ni
  0 siblings, 0 replies; 33+ messages in thread
From: Xiao Ni @ 2023-05-26  3:02 UTC (permalink / raw)
  To: Yu Kuai
  Cc: linux-raid, Song Liu, Guoqing Jiang, Heinz Mauelshagen,
	Nigel Croxon, yukuai (C)

On Fri, May 26, 2023 at 10:48 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2023/05/26 10:40, Xiao Ni 写道:
> > On Fri, May 26, 2023 at 10:18 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> Hi,
> >>
> >> 在 2023/05/26 10:08, Xiao Ni 写道:
> >>> I received an email that this email can't delivered to someone. Resent
> >>> it to linux-raid again.
> >>>
> >>> ---------- Forwarded message ---------
> >>> From: Xiao Ni <xni@redhat.com>
> >>> Date: Fri, May 26, 2023 at 9:49 AM
> >>> Subject: The read data is wrong from raid5 when recovery happens
> >>> To: Song Liu <song@kernel.org>, Guoqing Jiang <guoqing.jiang@linux.dev>
> >>> Cc: linux-raid <linux-raid@vger.kernel.org>, Heinz Mauelshagen
> >>> <heinzm@redhat.com>, Nigel Croxon <ncroxon@redhat.com>
> >>>
> >>>
> >>> Hi all
> >>>
> >>> We found a problem recently. The read data is wrong when recovery
> >>> happens. Now we've found it's introduced by patch 10764815f (md: add
> >>> io accounting for raid0 and raid5). I can reproduce this 100%. This
> >>> problem exists in upstream. The test steps are like this:
> >>>
> >>> 1. mdadm -CR $devname -l5 -n4 /dev/sd[b-e] --force --assume-clean
> >>> 2. mkfs.ext4 -F $devname
> >>> 3. mount $devname $mount_point
> >>> 4. mdadm --incremental --fail sdd
> >>> 5. dd if=/dev/zero of=/tmp/pythontest/file1 bs=1M count=100000 status=progress
> >>> 6. mdadm /dev/md126 --add /dev/sdd
> >> Can you try to zero superblock before add sdd? just to bypass readd.
> >
> > Hi Kuai
> >
> > I tried with this. It can still be reproduced.
>
> Ok, I asked this because we found readd has some problem while testing
> raid10, and it's easy to reporduce...
>
> Then, there is a related fixed patch just merged:
>
> md/raid5: fix miscalculation of 'end_sector' in raid5_read_one_chunk()
>
> The upstream that you tried contain this one?


Hi

My test version doesn't have this patch. But as mentioned in the
original email, I comment the codes that calling raid5_read_one_chunk
out and it still can reproduce this bug. So this patch should not fix
this bug.

Regards
Xiao

>
>
> Thanks,
> Kuai
> >
> >>
> >> Thanks,
> >> Kuai
> >>> 7. create 31 processes that writes and reads. It compares the content
> >>> with md5sum. The test will go on until the recovery stops
> >>> 8. wait for about 10 minutes, we can see some processes report
> >>> checksum is wrong. But if it re-read the data again, the checksum will
> >>> be good.
> >>>
> >>> I tried to narrow this problem like this:
> >>>
> >>> -       md_account_bio(mddev, &bi);
> >>> +       if (rw == WRITE)
> >>> +               md_account_bio(mddev, &bi);
> >>> If it only do account for write requests, the problem can disappear.
> >>>
> >>> -       if (rw == READ && mddev->degraded == 0 &&
> >>> -           mddev->reshape_position == MaxSector) {
> >>> -               bi = chunk_aligned_read(mddev, bi);
> >>> -               if (!bi)
> >>> -                       return true;
> >>> -       }
> >>> +       //if (rw == READ && mddev->degraded == 0 &&
> >>> +       //    mddev->reshape_position == MaxSector) {
> >>> +       //      bi = chunk_aligned_read(mddev, bi);
> >>> +       //      if (!bi)
> >>> +       //              return true;
> >>> +       //}
> >>>
> >>>           if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
> >>>                   make_discard_request(mddev, bi);
> >>> @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev
> >>> *mddev, struct bio * bi)
> >>>                           md_write_end(mddev);
> >>>                   return true;
> >>>           }
> >>> -       md_account_bio(mddev, &bi);
> >>> +       if (rw == READ)
> >>> +               md_account_bio(mddev, &bi);
> >>>
> >>> I comment the chunk_aligned_read out and only account for read
> >>> requests, this problem can be reproduced.
> >>>
> >>
> >
> >
>


-- 
Best Regards
Xiao Ni


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
       [not found] <CALTww28aV5CGXQAu46Rkc=fG1jK=ARzCT8VGoVyje8kQdqEXMg@mail.gmail.com>
  2023-05-26  2:08 ` Fwd: The read data is wrong from raid5 when recovery happens Xiao Ni
@ 2023-05-26  3:09 ` Guoqing Jiang
  2023-05-26  6:45   ` Xiao Ni
  1 sibling, 1 reply; 33+ messages in thread
From: Guoqing Jiang @ 2023-05-26  3:09 UTC (permalink / raw)
  To: Xiao Ni, Song Liu; +Cc: linux-raid, Heinz Mauelshagen, Nigel Croxon



On 5/26/23 09:49, Xiao Ni wrote:
> Hi all
>
> We found a problem recently. The read data is wrong when recovery happens.
> Now we've found it's introduced by patch 10764815f (md: add io accounting
> for raid0 and raid5). I can reproduce this 100%. This problem exists in
> upstream. The test steps are like this:
>
> 1. mdadm -CR $devname -l5 -n4 /dev/sd[b-e] --force --assume-clean
> 2. mkfs.ext4 -F $devname
> 3. mount $devname $mount_point
> 4. mdadm --incremental --fail sdd
> 5. dd if=/dev/zero of=/tmp/pythontest/file1 bs=1M count=100000
> status=progress
> 6. mdadm /dev/md126 --add /dev/sdd
> 7. create 31 processes that writes and reads. It compares the content with
> md5sum. The test will go on until the recovery stops
> 8. wait for about 10 minutes, we can see some processes report checksum is
> wrong. But if it re-read the data again, the checksum will be good.
>
> I tried to narrow this problem like this:
>
> -       md_account_bio(mddev, &bi);
> +       if (rw == WRITE)
> +               md_account_bio(mddev, &bi);
> If it only do account for write requests, the problem can disappear.
>
> -       if (rw == READ && mddev->degraded == 0 &&
> -           mddev->reshape_position == MaxSector) {
> -               bi = chunk_aligned_read(mddev, bi);
> -               if (!bi)
> -                       return true;
> -       }
> +       //if (rw == READ && mddev->degraded == 0 &&
> +       //    mddev->reshape_position == MaxSector) {
> +       //      bi = chunk_aligned_read(mddev, bi);
> +       //      if (!bi)
> +       //              return true;
> +       //}
>
>          if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
>                  make_discard_request(mddev, bi);
> @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev *mddev,
> struct bio * bi)
>                          md_write_end(mddev);
>                  return true;
>          }
> -       md_account_bio(mddev, &bi);
> +       if (rw == READ)
> +               md_account_bio(mddev, &bi);
>
> I comment the chunk_aligned_read out and only account for read requests,
> this problem can be reproduced.

After a quick look,raid5_read_one_chunk clones bio by itself, so no need to
do it for the chunk aligned readcase. Could you pls try this?

--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6120,6 +6120,7 @@static bool raid5_make_request(struct mddev *mddev, 
struct bio * bi)
        const int rw = bio_data_dir(bi);
        enum stripe_result res;
        int s, stripe_cnt;
+bool account_bio = true;

        if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
                int ret = log_handle_flush_request(conf, bi);
@@ -6148,6 +6149,7 @@static bool raid5_make_request(struct mddev *mddev, 
struct bio * bi)
        if (rw == READ && mddev->degraded == 0 &&
            mddev->reshape_position == MaxSector) {
                bi = chunk_aligned_read(mddev, bi);
+account_bio = false;
                if (!bi)
                        return true;
        }
@@ -6180,7 +6182,8 @@static bool raid5_make_request(struct mddev *mddev, 
struct bio * bi)
                        md_write_end(mddev);
                return true;
        }
-       md_account_bio(mddev, &bi);
+if (account_bio)
+md_account_bio(mddev, &bi);


Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-26  2:08 ` Fwd: The read data is wrong from raid5 when recovery happens Xiao Ni
  2023-05-26  2:17   ` Yu Kuai
@ 2023-05-26  3:56   ` d tbsky
  2023-05-26  6:20     ` Xiao Ni
  2024-02-14 15:15   ` Fwd: " Mateusz Kusiak
  2 siblings, 1 reply; 33+ messages in thread
From: d tbsky @ 2023-05-26  3:56 UTC (permalink / raw)
  To: linux-raid

Xiao Ni <xni@redhat.com>
> We found a problem recently. The read data is wrong when recovery
> happens. Now we've found it's introduced by patch 10764815f (md: add
> io accounting for raid0 and raid5). I can reproduce this 100%. This
> problem exists in upstream. The test steps are like this:

sorry for the interruption. but I wonder if any current RHEL version
has this problem?
I hope it is safe since I can not find it in bugzilla.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-26  3:56   ` d tbsky
@ 2023-05-26  6:20     ` Xiao Ni
  0 siblings, 0 replies; 33+ messages in thread
From: Xiao Ni @ 2023-05-26  6:20 UTC (permalink / raw)
  To: d tbsky, linux-raid


在 2023/5/26 上午11:56, d tbsky 写道:
> Xiao Ni <xni@redhat.com>
>> We found a problem recently. The read data is wrong when recovery
>> happens. Now we've found it's introduced by patch 10764815f (md: add
>> io accounting for raid0 and raid5). I can reproduce this 100%. This
>> problem exists in upstream. The test steps are like this:
> sorry for the interruption. but I wonder if any current RHEL version
> has this problem?
> I hope it is safe since I can not find it in bugzilla.
>

Hi

There is a bug https://bugzilla.redhat.com/show_bug.cgi?id=2183033 that 
points this problem.

Regards

Xiao


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-26  3:09 ` Guoqing Jiang
@ 2023-05-26  6:45   ` Xiao Ni
  2023-05-26  7:12     ` Guoqing Jiang
  0 siblings, 1 reply; 33+ messages in thread
From: Xiao Ni @ 2023-05-26  6:45 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: Song Liu, linux-raid, Heinz Mauelshagen, Nigel Croxon

On Fri, May 26, 2023 at 11:09 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
>
>
> On 5/26/23 09:49, Xiao Ni wrote:
> > Hi all
> >
> > We found a problem recently. The read data is wrong when recovery happens.
> > Now we've found it's introduced by patch 10764815f (md: add io accounting
> > for raid0 and raid5). I can reproduce this 100%. This problem exists in
> > upstream. The test steps are like this:
> >
> > 1. mdadm -CR $devname -l5 -n4 /dev/sd[b-e] --force --assume-clean
> > 2. mkfs.ext4 -F $devname
> > 3. mount $devname $mount_point
> > 4. mdadm --incremental --fail sdd
> > 5. dd if=/dev/zero of=/tmp/pythontest/file1 bs=1M count=100000
> > status=progress
> > 6. mdadm /dev/md126 --add /dev/sdd
> > 7. create 31 processes that writes and reads. It compares the content with
> > md5sum. The test will go on until the recovery stops
> > 8. wait for about 10 minutes, we can see some processes report checksum is
> > wrong. But if it re-read the data again, the checksum will be good.
> >
> > I tried to narrow this problem like this:
> >
> > -       md_account_bio(mddev, &bi);
> > +       if (rw == WRITE)
> > +               md_account_bio(mddev, &bi);
> > If it only do account for write requests, the problem can disappear.
> >
> > -       if (rw == READ && mddev->degraded == 0 &&
> > -           mddev->reshape_position == MaxSector) {
> > -               bi = chunk_aligned_read(mddev, bi);
> > -               if (!bi)
> > -                       return true;
> > -       }
> > +       //if (rw == READ && mddev->degraded == 0 &&
> > +       //    mddev->reshape_position == MaxSector) {
> > +       //      bi = chunk_aligned_read(mddev, bi);
> > +       //      if (!bi)
> > +       //              return true;
> > +       //}
> >
> >          if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
> >                  make_discard_request(mddev, bi);
> > @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev *mddev,
> > struct bio * bi)
> >                          md_write_end(mddev);
> >                  return true;
> >          }
> > -       md_account_bio(mddev, &bi);
> > +       if (rw == READ)
> > +               md_account_bio(mddev, &bi);
> >
> > I comment the chunk_aligned_read out and only account for read requests,
> > this problem can be reproduced.
>
> After a quick look,raid5_read_one_chunk clones bio by itself, so no need to
> do it for the chunk aligned readcase. Could you pls try this?
>
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6120,6 +6120,7 @@static bool raid5_make_request(struct mddev *mddev,
> struct bio * bi)
>         const int rw = bio_data_dir(bi);
>         enum stripe_result res;
>         int s, stripe_cnt;
> +bool account_bio = true;
>
>         if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
>                 int ret = log_handle_flush_request(conf, bi);
> @@ -6148,6 +6149,7 @@static bool raid5_make_request(struct mddev *mddev,
> struct bio * bi)
>         if (rw == READ && mddev->degraded == 0 &&
>             mddev->reshape_position == MaxSector) {
>                 bi = chunk_aligned_read(mddev, bi);
> +account_bio = false;
>                 if (!bi)
>                         return true;
>         }
> @@ -6180,7 +6182,8 @@static bool raid5_make_request(struct mddev *mddev,
> struct bio * bi)
>                         md_write_end(mddev);
>                 return true;
>         }
> -       md_account_bio(mddev, &bi);
> +if (account_bio)
> +md_account_bio(mddev, &bi);
>
>
> Thanks,
> Guoqing
>

Hi Guoqing

When chunk_aligned_read runs successfully, it just returns. In this
case, it does the account by itself. If it fails to execute, it still
needs run md_account_bio in raid5_make_request. So now the logic
should be right.

And by the way, in the original email, as mentioned, I commented the
chunk aligned read codes out, and it still can be reproduced.


-- 
Best Regards
Xiao Ni


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-26  6:45   ` Xiao Ni
@ 2023-05-26  7:12     ` Guoqing Jiang
  2023-05-26  7:23       ` Xiao Ni
  0 siblings, 1 reply; 33+ messages in thread
From: Guoqing Jiang @ 2023-05-26  7:12 UTC (permalink / raw)
  To: Xiao Ni; +Cc: Song Liu, linux-raid, Heinz Mauelshagen, Nigel Croxon



On 5/26/23 14:45, Xiao Ni wrote:
> On Fri, May 26, 2023 at 11:09 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>>
>>
>> On 5/26/23 09:49, Xiao Ni wrote:
>>> Hi all
>>>
>>> We found a problem recently. The read data is wrong when recovery happens.
>>> Now we've found it's introduced by patch 10764815f (md: add io accounting
>>> for raid0 and raid5). I can reproduce this 100%. This problem exists in
>>> upstream. The test steps are like this:
>>>
>>> 1. mdadm -CR $devname -l5 -n4 /dev/sd[b-e] --force --assume-clean
>>> 2. mkfs.ext4 -F $devname
>>> 3. mount $devname $mount_point
>>> 4. mdadm --incremental --fail sdd
>>> 5. dd if=/dev/zero of=/tmp/pythontest/file1 bs=1M count=100000 status=progress

I suppose /tmp is the mount point.

>>> 6. mdadm /dev/md126 --add /dev/sdd
>>> 7. create 31 processes that writes and reads. It compares the content with
>>> md5sum. The test will go on until the recovery stops

Could you share the test code/script for step 7? Will try it from my side.

>>> 8. wait for about 10 minutes, we can see some processes report checksum is
>>> wrong. But if it re-read the data again, the checksum will be good.

So it is interim, I guess it appeared before recover was finished.

>>> I tried to narrow this problem like this:
>>>
>>> -       md_account_bio(mddev, &bi);
>>> +       if (rw == WRITE)
>>> +               md_account_bio(mddev, &bi);
>>> If it only do account for write requests, the problem can disappear.
>>>
>>> -       if (rw == READ && mddev->degraded == 0 &&
>>> -           mddev->reshape_position == MaxSector) {
>>> -               bi = chunk_aligned_read(mddev, bi);
>>> -               if (!bi)
>>> -                       return true;
>>> -       }
>>> +       //if (rw == READ && mddev->degraded == 0 &&
>>> +       //    mddev->reshape_position == MaxSector) {
>>> +       //      bi = chunk_aligned_read(mddev, bi);
>>> +       //      if (!bi)
>>> +       //              return true;
>>> +       //}
>>>
>>>           if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
>>>                   make_discard_request(mddev, bi);
>>> @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev *mddev,
>>> struct bio * bi)
>>>                           md_write_end(mddev);
>>>                   return true;
>>>           }
>>> -       md_account_bio(mddev, &bi);
>>> +       if (rw == READ)
>>> +               md_account_bio(mddev, &bi);
>>>
>>> I comment the chunk_aligned_read out and only account for read requests,
>>> this problem can be reproduced.
>> After a quick look,raid5_read_one_chunk clones bio by itself, so no need to
>> do it for the chunk aligned readcase. Could you pls try this?

[...]

>> Hi Guoqing
>>
>> When chunk_aligned_read runs successfully, it just returns. In this
>> case, it does the account by itself. If it fails to execute, it still
>> needs run md_account_bio in raid5_make_request. So now the logic
>> should be right.

Hmm, I was confused.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-26  7:12     ` Guoqing Jiang
@ 2023-05-26  7:23       ` Xiao Ni
  2023-05-26  9:13         ` Mariusz Tkaczyk
  2023-05-29  2:25         ` Guoqing Jiang
  0 siblings, 2 replies; 33+ messages in thread
From: Xiao Ni @ 2023-05-26  7:23 UTC (permalink / raw)
  To: Guoqing Jiang, Tkaczyk, Mariusz
  Cc: Song Liu, linux-raid, Heinz Mauelshagen, Nigel Croxon

On Fri, May 26, 2023 at 3:12 PM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
>
>
> On 5/26/23 14:45, Xiao Ni wrote:
> > On Fri, May 26, 2023 at 11:09 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
> >>
> >>
> >> On 5/26/23 09:49, Xiao Ni wrote:
> >>> Hi all
> >>>
> >>> We found a problem recently. The read data is wrong when recovery happens.
> >>> Now we've found it's introduced by patch 10764815f (md: add io accounting
> >>> for raid0 and raid5). I can reproduce this 100%. This problem exists in
> >>> upstream. The test steps are like this:
> >>>
> >>> 1. mdadm -CR $devname -l5 -n4 /dev/sd[b-e] --force --assume-clean
> >>> 2. mkfs.ext4 -F $devname
> >>> 3. mount $devname $mount_point
> >>> 4. mdadm --incremental --fail sdd
> >>> 5. dd if=/dev/zero of=/tmp/pythontest/file1 bs=1M count=100000 status=progress
>
> I suppose /tmp is the mount point.

/tmp/pythontest is the mount point

>
> >>> 6. mdadm /dev/md126 --add /dev/sdd
> >>> 7. create 31 processes that writes and reads. It compares the content with
> >>> md5sum. The test will go on until the recovery stops
>
> Could you share the test code/script for step 7? Will try it from my side.

The test scripts are written by people from intel.
Hi, Mariusz. Can I share the test scripts here?

>
> >>> 8. wait for about 10 minutes, we can see some processes report checksum is
> >>> wrong. But if it re-read the data again, the checksum will be good.
>
> So it is interim, I guess it appeared before recover was finished.

Yes, it appears before recovery finishes. The test will finish once
the recovery finishes.

>
> >>> I tried to narrow this problem like this:
> >>>
> >>> -       md_account_bio(mddev, &bi);
> >>> +       if (rw == WRITE)
> >>> +               md_account_bio(mddev, &bi);
> >>> If it only do account for write requests, the problem can disappear.
> >>>
> >>> -       if (rw == READ && mddev->degraded == 0 &&
> >>> -           mddev->reshape_position == MaxSector) {
> >>> -               bi = chunk_aligned_read(mddev, bi);
> >>> -               if (!bi)
> >>> -                       return true;
> >>> -       }
> >>> +       //if (rw == READ && mddev->degraded == 0 &&
> >>> +       //    mddev->reshape_position == MaxSector) {
> >>> +       //      bi = chunk_aligned_read(mddev, bi);
> >>> +       //      if (!bi)
> >>> +       //              return true;
> >>> +       //}
> >>>
> >>>           if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
> >>>                   make_discard_request(mddev, bi);
> >>> @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev *mddev,
> >>> struct bio * bi)
> >>>                           md_write_end(mddev);
> >>>                   return true;
> >>>           }
> >>> -       md_account_bio(mddev, &bi);
> >>> +       if (rw == READ)
> >>> +               md_account_bio(mddev, &bi);
> >>>
> >>> I comment the chunk_aligned_read out and only account for read requests,
> >>> this problem can be reproduced.
> >> After a quick look,raid5_read_one_chunk clones bio by itself, so no need to
> >> do it for the chunk aligned readcase. Could you pls try this?
>
> [...]
>
> >> Hi Guoqing
> >>
> >> When chunk_aligned_read runs successfully, it just returns. In this
> >> case, it does the account by itself. If it fails to execute, it still
> >> needs run md_account_bio in raid5_make_request. So now the logic
> >> should be right.
>
> Hmm, I was confused.

Which part?
>
> Thanks,
> Guoqing
>


-- 
Best Regards
Xiao Ni


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-26  7:23       ` Xiao Ni
@ 2023-05-26  9:13         ` Mariusz Tkaczyk
  2023-05-26 21:13           ` Song Liu
  2023-05-29  2:25         ` Guoqing Jiang
  1 sibling, 1 reply; 33+ messages in thread
From: Mariusz Tkaczyk @ 2023-05-26  9:13 UTC (permalink / raw)
  To: Xiao Ni
  Cc: Guoqing Jiang, Tkaczyk, Mariusz, Song Liu, linux-raid,
	Heinz Mauelshagen, Nigel Croxon

On Fri, 26 May 2023 15:23:58 +0800
Xiao Ni <xni@redhat.com> wrote:

> On Fri, May 26, 2023 at 3:12 PM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
> >
> >
> >
> > On 5/26/23 14:45, Xiao Ni wrote:  
> > > On Fri, May 26, 2023 at 11:09 AM Guoqing Jiang <guoqing.jiang@linux.dev>
> > > wrote:  
> > >>
> > >>
> > >> On 5/26/23 09:49, Xiao Ni wrote:  
> > >>> Hi all
> > >>>
> > >>> We found a problem recently. The read data is wrong when recovery
> > >>> happens. Now we've found it's introduced by patch 10764815f (md: add io
> > >>> accounting for raid0 and raid5). I can reproduce this 100%. This
> > >>> problem exists in upstream. The test steps are like this:
> > >>>
> > >>> 1. mdadm -CR $devname -l5 -n4 /dev/sd[b-e] --force --assume-clean
> > >>> 2. mkfs.ext4 -F $devname
> > >>> 3. mount $devname $mount_point
> > >>> 4. mdadm --incremental --fail sdd
> > >>> 5. dd if=/dev/zero of=/tmp/pythontest/file1 bs=1M count=100000
> > >>> status=progress  
> >
> > I suppose /tmp is the mount point.  
> 
> /tmp/pythontest is the mount point
> 
> >  
> > >>> 6. mdadm /dev/md126 --add /dev/sdd
> > >>> 7. create 31 processes that writes and reads. It compares the content
> > >>> with md5sum. The test will go on until the recovery stops  
> >
> > Could you share the test code/script for step 7? Will try it from my side.  
> 
> The test scripts are written by people from intel.
> Hi, Mariusz. Can I share the test scripts here?

Yes. Let us know if there is something else we can do to help here.

Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-26  9:13         ` Mariusz Tkaczyk
@ 2023-05-26 21:13           ` Song Liu
  2023-05-27  0:56             ` Xiao Ni
  0 siblings, 1 reply; 33+ messages in thread
From: Song Liu @ 2023-05-26 21:13 UTC (permalink / raw)
  To: Mariusz Tkaczyk
  Cc: Xiao Ni, Guoqing Jiang, Tkaczyk, Mariusz, linux-raid,
	Heinz Mauelshagen, Nigel Croxon

On Fri, May 26, 2023 at 2:13 AM Mariusz Tkaczyk
<mariusz.tkaczyk@linux.intel.com> wrote:
>
> On Fri, 26 May 2023 15:23:58 +0800
> Xiao Ni <xni@redhat.com> wrote:
>
> > On Fri, May 26, 2023 at 3:12 PM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
> > >
> > >
> > >
> > > On 5/26/23 14:45, Xiao Ni wrote:
> > > > On Fri, May 26, 2023 at 11:09 AM Guoqing Jiang <guoqing.jiang@linux.dev>
> > > > wrote:
> > > >>
> > > >>
> > > >> On 5/26/23 09:49, Xiao Ni wrote:
> > > >>> Hi all
> > > >>>
> > > >>> We found a problem recently. The read data is wrong when recovery
> > > >>> happens. Now we've found it's introduced by patch 10764815f (md: add io
> > > >>> accounting for raid0 and raid5). I can reproduce this 100%. This
> > > >>> problem exists in upstream. The test steps are like this:
> > > >>>
> > > >>> 1. mdadm -CR $devname -l5 -n4 /dev/sd[b-e] --force --assume-clean
> > > >>> 2. mkfs.ext4 -F $devname
> > > >>> 3. mount $devname $mount_point
> > > >>> 4. mdadm --incremental --fail sdd
> > > >>> 5. dd if=/dev/zero of=/tmp/pythontest/file1 bs=1M count=100000
> > > >>> status=progress
> > >
> > > I suppose /tmp is the mount point.
> >
> > /tmp/pythontest is the mount point
> >
> > >
> > > >>> 6. mdadm /dev/md126 --add /dev/sdd
> > > >>> 7. create 31 processes that writes and reads. It compares the content
> > > >>> with md5sum. The test will go on until the recovery stops
> > >
> > > Could you share the test code/script for step 7? Will try it from my side.
> >
> > The test scripts are written by people from intel.
> > Hi, Mariusz. Can I share the test scripts here?
>
> Yes. Let us know if there is something else we can do to help here.

I tried to reproduce this with fio, but didn't get much luck. Please share
the test scripts.

Thanks,
Song

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-26 21:13           ` Song Liu
@ 2023-05-27  0:56             ` Xiao Ni
  2023-07-11  0:39               ` Xiao Ni
  0 siblings, 1 reply; 33+ messages in thread
From: Xiao Ni @ 2023-05-27  0:56 UTC (permalink / raw)
  To: Song Liu
  Cc: Mariusz Tkaczyk, Guoqing Jiang, Tkaczyk, Mariusz, linux-raid,
	Heinz Mauelshagen, Nigel Croxon

[-- Attachment #1: Type: text/plain, Size: 2253 bytes --]

On Sat, May 27, 2023 at 5:13 AM Song Liu <song@kernel.org> wrote:
>
> On Fri, May 26, 2023 at 2:13 AM Mariusz Tkaczyk
> <mariusz.tkaczyk@linux.intel.com> wrote:
> >
> > On Fri, 26 May 2023 15:23:58 +0800
> > Xiao Ni <xni@redhat.com> wrote:
> >
> > > On Fri, May 26, 2023 at 3:12 PM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
> > > >
> > > >
> > > >
> > > > On 5/26/23 14:45, Xiao Ni wrote:
> > > > > On Fri, May 26, 2023 at 11:09 AM Guoqing Jiang <guoqing.jiang@linux.dev>
> > > > > wrote:
> > > > >>
> > > > >>
> > > > >> On 5/26/23 09:49, Xiao Ni wrote:
> > > > >>> Hi all
> > > > >>>
> > > > >>> We found a problem recently. The read data is wrong when recovery
> > > > >>> happens. Now we've found it's introduced by patch 10764815f (md: add io
> > > > >>> accounting for raid0 and raid5). I can reproduce this 100%. This
> > > > >>> problem exists in upstream. The test steps are like this:
> > > > >>>
> > > > >>> 1. mdadm -CR $devname -l5 -n4 /dev/sd[b-e] --force --assume-clean
> > > > >>> 2. mkfs.ext4 -F $devname
> > > > >>> 3. mount $devname $mount_point
> > > > >>> 4. mdadm --incremental --fail sdd
> > > > >>> 5. dd if=/dev/zero of=/tmp/pythontest/file1 bs=1M count=100000
> > > > >>> status=progress
> > > >
> > > > I suppose /tmp is the mount point.
> > >
> > > /tmp/pythontest is the mount point
> > >
> > > >
> > > > >>> 6. mdadm /dev/md126 --add /dev/sdd
> > > > >>> 7. create 31 processes that writes and reads. It compares the content
> > > > >>> with md5sum. The test will go on until the recovery stops
> > > >
> > > > Could you share the test code/script for step 7? Will try it from my side.
> > >
> > > The test scripts are written by people from intel.
> > > Hi, Mariusz. Can I share the test scripts here?
> >
> > Yes. Let us know if there is something else we can do to help here.
>
> I tried to reproduce this with fio, but didn't get much luck. Please share
> the test scripts.
>
> Thanks,
> Song
>

Hi all

The attachment is the scripts.

1. It needs to modify the member disks and raid name in prepare_rebuild_env.sh
2. It needs to modify the member disks and raid name in 01-test.sh
3. Then run 01-test.sh directly

-- 
Best Regards
Xiao Ni

[-- Attachment #2: readdata.tar.gz --]
[-- Type: application/x-gzip, Size: 1888 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-26  7:23       ` Xiao Ni
  2023-05-26  9:13         ` Mariusz Tkaczyk
@ 2023-05-29  2:25         ` Guoqing Jiang
  2023-05-29  3:41           ` Xiao Ni
  2023-05-29 13:51           ` Xiao Ni
  1 sibling, 2 replies; 33+ messages in thread
From: Guoqing Jiang @ 2023-05-29  2:25 UTC (permalink / raw)
  To: Xiao Ni, Tkaczyk, Mariusz
  Cc: Song Liu, linux-raid, Heinz Mauelshagen, Nigel Croxon



On 5/26/23 15:23, Xiao Ni wrote:
>
>>>>> 6. mdadm /dev/md126 --add /dev/sdd
>>>>> 7. create 31 processes that writes and reads. It compares the content with
>>>>> md5sum. The test will go on until the recovery stops
>> Could you share the test code/script for step 7? Will try it from my side.
> The test scripts are written by people from intel.
> Hi, Mariusz. Can I share the test scripts here?
>
>>>>> 8. wait for about 10 minutes, we can see some processes report checksum is
>>>>> wrong. But if it re-read the data again, the checksum will be good.
>> So it is interim, I guess it appeared before recover was finished.
> Yes, it appears before recovery finishes. The test will finish once
> the recovery finishes.
>
>>>>> I tried to narrow this problem like this:
>>>>>
>>>>> -       md_account_bio(mddev, &bi);
>>>>> +       if (rw == WRITE)
>>>>> +               md_account_bio(mddev, &bi);
>>>>> If it only do account for write requests, the problem can disappear.
>>>>>
>>>>> -       if (rw == READ && mddev->degraded == 0 &&
>>>>> -           mddev->reshape_position == MaxSector) {
>>>>> -               bi = chunk_aligned_read(mddev, bi);
>>>>> -               if (!bi)
>>>>> -                       return true;
>>>>> -       }
>>>>> +       //if (rw == READ && mddev->degraded == 0 &&
>>>>> +       //    mddev->reshape_position == MaxSector) {
>>>>> +       //      bi = chunk_aligned_read(mddev, bi);
>>>>> +       //      if (!bi)
>>>>> +       //              return true;
>>>>> +       //}
>>>>>
>>>>>            if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
>>>>>                    make_discard_request(mddev, bi);
>>>>> @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev *mddev,
>>>>> struct bio * bi)
>>>>>                            md_write_end(mddev);
>>>>>                    return true;
>>>>>            }
>>>>> -       md_account_bio(mddev, &bi);
>>>>> +       if (rw == READ)
>>>>> +               md_account_bio(mddev, &bi);
>>>>>
>>>>> I comment the chunk_aligned_read out and only account for read requests,
>>>>> this problem can be reproduced.

Only write bio and non aligned chunk read bio call md_account_bio, and 
only account
write bio is fine per your test. It means the md5sum didn't match 
because of non aligned
chunk read bio, so it is not abnormal that data in another chunk could 
be changed with
the recovery is not finished, right?

BTW, I had run the test with bio accounting disabled by default, and 
seems the result is
same.

> git tag  --sort=taggerdate --contain 10764815f |head -1
v5.14-rc1

localhost:~/readdata #uname -r
5.15.0-rc4-59.24-default
localhost:~/readdata #cat /sys/block/md126/queue/iostats
0

And I can still see relevant log from the terminal which runs 01-test.sh

file /tmp/pythontest/data.out.nhoBR6 Test failed 
(org_checksum|checksum_to_match) [582df4ccea6f5851462379fe4b17abf6 
  -|d41d8cd98f00b204e9800998ec
f8427e  -]
file /tmp/pythontest/data.out.nhoBR6 calculate checksum again: 
d41d8cd98f00b204e9800998ecf8427e  -

file /tmp/pythontest/data.out.srGqF9 Test failed 
(org_checksum|checksum_to_match) [7a85ec35b171e52ce41ebdd5da86f1d9 
  -|d41d8cd98f00b204e9800998ec
f8427e  -]
file /tmp/pythontest/data.out.srGqF9 calculate checksum again: 
d41d8cd98f00b204e9800998ecf8427e  -
...

file /tmp/pythontest/data.out.q6m1kK Test failed 
(org_checksum|checksum_to_match) [996f5bb5d2a2ddfd378a80a563ab4ad5 
  -|d41d8cd98f00b204e9800998ec
f8427e  -]
file /tmp/pythontest/data.out.q6m1kK calculate checksum again: 
d41d8cd98f00b204e9800998ecf8427e  -

Rebuild ended, testing done!
stresstest end at : Sun May 28 22:23:53 EDT 2023
Sun May 28 22:23:53 EDT 2023

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-29  2:25         ` Guoqing Jiang
@ 2023-05-29  3:41           ` Xiao Ni
  2023-05-29  8:33             ` Guoqing Jiang
  2023-05-29 13:51           ` Xiao Ni
  1 sibling, 1 reply; 33+ messages in thread
From: Xiao Ni @ 2023-05-29  3:41 UTC (permalink / raw)
  To: Guoqing Jiang
  Cc: Tkaczyk, Mariusz, Song Liu, linux-raid, Heinz Mauelshagen, Nigel Croxon

On Mon, May 29, 2023 at 10:27 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
>
>
> On 5/26/23 15:23, Xiao Ni wrote:
> >
> >>>>> 6. mdadm /dev/md126 --add /dev/sdd
> >>>>> 7. create 31 processes that writes and reads. It compares the content with
> >>>>> md5sum. The test will go on until the recovery stops
> >> Could you share the test code/script for step 7? Will try it from my side.
> > The test scripts are written by people from intel.
> > Hi, Mariusz. Can I share the test scripts here?
> >
> >>>>> 8. wait for about 10 minutes, we can see some processes report checksum is
> >>>>> wrong. But if it re-read the data again, the checksum will be good.
> >> So it is interim, I guess it appeared before recover was finished.
> > Yes, it appears before recovery finishes. The test will finish once
> > the recovery finishes.
> >
> >>>>> I tried to narrow this problem like this:
> >>>>>
> >>>>> -       md_account_bio(mddev, &bi);
> >>>>> +       if (rw == WRITE)
> >>>>> +               md_account_bio(mddev, &bi);
> >>>>> If it only do account for write requests, the problem can disappear.
> >>>>>
> >>>>> -       if (rw == READ && mddev->degraded == 0 &&
> >>>>> -           mddev->reshape_position == MaxSector) {
> >>>>> -               bi = chunk_aligned_read(mddev, bi);
> >>>>> -               if (!bi)
> >>>>> -                       return true;
> >>>>> -       }
> >>>>> +       //if (rw == READ && mddev->degraded == 0 &&
> >>>>> +       //    mddev->reshape_position == MaxSector) {
> >>>>> +       //      bi = chunk_aligned_read(mddev, bi);
> >>>>> +       //      if (!bi)
> >>>>> +       //              return true;
> >>>>> +       //}
> >>>>>
> >>>>>            if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
> >>>>>                    make_discard_request(mddev, bi);
> >>>>> @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev *mddev,
> >>>>> struct bio * bi)
> >>>>>                            md_write_end(mddev);
> >>>>>                    return true;
> >>>>>            }
> >>>>> -       md_account_bio(mddev, &bi);
> >>>>> +       if (rw == READ)
> >>>>> +               md_account_bio(mddev, &bi);
> >>>>>
> >>>>> I comment the chunk_aligned_read out and only account for read requests,
> >>>>> this problem can be reproduced.
>
> Only write bio and non aligned chunk read bio call md_account_bio, and
> only account
> write bio is fine per your test. It means the md5sum didn't match
> because of non aligned
> chunk read bio, so it is not abnormal that data in another chunk could
> be changed with
> the recovery is not finished, right?

That's right, only non aligned read requests can cause this problem.
Good catch. If I understand right, you mean the non aligned read
request reads data from the chunk which hasn't been recovered, right?

>
> BTW, I had run the test with bio accounting disabled by default, and
> seems the result is
> same.
>
> > git tag  --sort=taggerdate --contain 10764815f |head -1
> v5.14-rc1
>
> localhost:~/readdata #uname -r
> 5.15.0-rc4-59.24-default
> localhost:~/readdata #cat /sys/block/md126/queue/iostats
> 0
>
> And I can still see relevant log from the terminal which runs 01-test.sh

Hmm, thanks for this. I'll have a try again. Which kind of disks do
you use for testing?

Regards
Xiao
>
> file /tmp/pythontest/data.out.nhoBR6 Test failed
> (org_checksum|checksum_to_match) [582df4ccea6f5851462379fe4b17abf6
>   -|d41d8cd98f00b204e9800998ec
> f8427e  -]
> file /tmp/pythontest/data.out.nhoBR6 calculate checksum again:
> d41d8cd98f00b204e9800998ecf8427e  -
>
> file /tmp/pythontest/data.out.srGqF9 Test failed
> (org_checksum|checksum_to_match) [7a85ec35b171e52ce41ebdd5da86f1d9
>   -|d41d8cd98f00b204e9800998ec
> f8427e  -]
> file /tmp/pythontest/data.out.srGqF9 calculate checksum again:
> d41d8cd98f00b204e9800998ecf8427e  -
> ...
>
> file /tmp/pythontest/data.out.q6m1kK Test failed
> (org_checksum|checksum_to_match) [996f5bb5d2a2ddfd378a80a563ab4ad5
>   -|d41d8cd98f00b204e9800998ec
> f8427e  -]
> file /tmp/pythontest/data.out.q6m1kK calculate checksum again:
> d41d8cd98f00b204e9800998ecf8427e  -
>
> Rebuild ended, testing done!
> stresstest end at : Sun May 28 22:23:53 EDT 2023
> Sun May 28 22:23:53 EDT 2023
>
> Thanks,
> Guoqing
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-29  3:41           ` Xiao Ni
@ 2023-05-29  8:33             ` Guoqing Jiang
  2023-05-29  8:40               ` Xiao Ni
  0 siblings, 1 reply; 33+ messages in thread
From: Guoqing Jiang @ 2023-05-29  8:33 UTC (permalink / raw)
  To: Xiao Ni
  Cc: Tkaczyk, Mariusz, Song Liu, linux-raid, Heinz Mauelshagen, Nigel Croxon



On 5/29/23 11:41, Xiao Ni wrote:
> On Mon, May 29, 2023 at 10:27 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>>
>>
>> On 5/26/23 15:23, Xiao Ni wrote:
>>>>>>> 6. mdadm /dev/md126 --add /dev/sdd
>>>>>>> 7. create 31 processes that writes and reads. It compares the content with
>>>>>>> md5sum. The test will go on until the recovery stops
>>>> Could you share the test code/script for step 7? Will try it from my side.
>>> The test scripts are written by people from intel.
>>> Hi, Mariusz. Can I share the test scripts here?
>>>
>>>>>>> 8. wait for about 10 minutes, we can see some processes report checksum is
>>>>>>> wrong. But if it re-read the data again, the checksum will be good.
>>>> So it is interim, I guess it appeared before recover was finished.
>>> Yes, it appears before recovery finishes. The test will finish once
>>> the recovery finishes.
>>>
>>>>>>> I tried to narrow this problem like this:
>>>>>>>
>>>>>>> -       md_account_bio(mddev, &bi);
>>>>>>> +       if (rw == WRITE)
>>>>>>> +               md_account_bio(mddev, &bi);
>>>>>>> If it only do account for write requests, the problem can disappear.
>>>>>>>
>>>>>>> -       if (rw == READ && mddev->degraded == 0 &&
>>>>>>> -           mddev->reshape_position == MaxSector) {
>>>>>>> -               bi = chunk_aligned_read(mddev, bi);
>>>>>>> -               if (!bi)
>>>>>>> -                       return true;
>>>>>>> -       }
>>>>>>> +       //if (rw == READ && mddev->degraded == 0 &&
>>>>>>> +       //    mddev->reshape_position == MaxSector) {
>>>>>>> +       //      bi = chunk_aligned_read(mddev, bi);
>>>>>>> +       //      if (!bi)
>>>>>>> +       //              return true;
>>>>>>> +       //}
>>>>>>>
>>>>>>>             if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
>>>>>>>                     make_discard_request(mddev, bi);
>>>>>>> @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev *mddev,
>>>>>>> struct bio * bi)
>>>>>>>                             md_write_end(mddev);
>>>>>>>                     return true;
>>>>>>>             }
>>>>>>> -       md_account_bio(mddev, &bi);
>>>>>>> +       if (rw == READ)
>>>>>>> +               md_account_bio(mddev, &bi);
>>>>>>>
>>>>>>> I comment the chunk_aligned_read out and only account for read requests,
>>>>>>> this problem can be reproduced.
>> Only write bio and non aligned chunk read bio call md_account_bio, and
>> only account write bio is fine per your test. It means the md5sum didn't match
>> because of non aligned chunk read bio, so it is not abnormal that data in another chunk could
>> be changed with the recovery is not finished, right?
> That's right, only non aligned read requests can cause this problem.
> Good catch. If I understand right, you mean the non aligned read
> request reads data from the chunk which hasn't been recovered, right?

Yes, I don't think compare md5sum for such scenario makes more sense given
the state is interim. And it also appeared in my test with disable io 
accounting.

>> BTW, I had run the test with bio accounting disabled by default, and
>> seems the result is
>> same.
>>
>>> git tag  --sort=taggerdate --contain 10764815f |head -1
>> v5.14-rc1
>>
>> localhost:~/readdata #uname -r
>> 5.15.0-rc4-59.24-default
>> localhost:~/readdata #cat /sys/block/md126/queue/iostats
>> 0
>>
>> And I can still see relevant log from the terminal which runs 01-test.sh
> Hmm, thanks for this. I'll have a try again. Which kind of disks do
> you use for testing?

Four SCSI disks (1G capacity) inside VM.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-29  8:33             ` Guoqing Jiang
@ 2023-05-29  8:40               ` Xiao Ni
  2023-05-30  1:36                 ` Guoqing Jiang
  0 siblings, 1 reply; 33+ messages in thread
From: Xiao Ni @ 2023-05-29  8:40 UTC (permalink / raw)
  To: Guoqing Jiang
  Cc: Tkaczyk, Mariusz, Song Liu, linux-raid, Heinz Mauelshagen, Nigel Croxon

On Mon, May 29, 2023 at 4:34 PM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
>
>
> On 5/29/23 11:41, Xiao Ni wrote:
> > On Mon, May 29, 2023 at 10:27 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
> >>
> >>
> >> On 5/26/23 15:23, Xiao Ni wrote:
> >>>>>>> 6. mdadm /dev/md126 --add /dev/sdd
> >>>>>>> 7. create 31 processes that writes and reads. It compares the content with
> >>>>>>> md5sum. The test will go on until the recovery stops
> >>>> Could you share the test code/script for step 7? Will try it from my side.
> >>> The test scripts are written by people from intel.
> >>> Hi, Mariusz. Can I share the test scripts here?
> >>>
> >>>>>>> 8. wait for about 10 minutes, we can see some processes report checksum is
> >>>>>>> wrong. But if it re-read the data again, the checksum will be good.
> >>>> So it is interim, I guess it appeared before recover was finished.
> >>> Yes, it appears before recovery finishes. The test will finish once
> >>> the recovery finishes.
> >>>
> >>>>>>> I tried to narrow this problem like this:
> >>>>>>>
> >>>>>>> -       md_account_bio(mddev, &bi);
> >>>>>>> +       if (rw == WRITE)
> >>>>>>> +               md_account_bio(mddev, &bi);
> >>>>>>> If it only do account for write requests, the problem can disappear.
> >>>>>>>
> >>>>>>> -       if (rw == READ && mddev->degraded == 0 &&
> >>>>>>> -           mddev->reshape_position == MaxSector) {
> >>>>>>> -               bi = chunk_aligned_read(mddev, bi);
> >>>>>>> -               if (!bi)
> >>>>>>> -                       return true;
> >>>>>>> -       }
> >>>>>>> +       //if (rw == READ && mddev->degraded == 0 &&
> >>>>>>> +       //    mddev->reshape_position == MaxSector) {
> >>>>>>> +       //      bi = chunk_aligned_read(mddev, bi);
> >>>>>>> +       //      if (!bi)
> >>>>>>> +       //              return true;
> >>>>>>> +       //}
> >>>>>>>
> >>>>>>>             if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
> >>>>>>>                     make_discard_request(mddev, bi);
> >>>>>>> @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev *mddev,
> >>>>>>> struct bio * bi)
> >>>>>>>                             md_write_end(mddev);
> >>>>>>>                     return true;
> >>>>>>>             }
> >>>>>>> -       md_account_bio(mddev, &bi);
> >>>>>>> +       if (rw == READ)
> >>>>>>> +               md_account_bio(mddev, &bi);
> >>>>>>>
> >>>>>>> I comment the chunk_aligned_read out and only account for read requests,
> >>>>>>> this problem can be reproduced.
> >> Only write bio and non aligned chunk read bio call md_account_bio, and
> >> only account write bio is fine per your test. It means the md5sum didn't match
> >> because of non aligned chunk read bio, so it is not abnormal that data in another chunk could
> >> be changed with the recovery is not finished, right?
> > That's right, only non aligned read requests can cause this problem.
> > Good catch. If I understand right, you mean the non aligned read
> > request reads data from the chunk which hasn't been recovered, right?
>
> Yes, I don't think compare md5sum for such scenario makes more sense given
> the state is interim. And it also appeared in my test with disable io
> accounting.

It's the data which customers read that can be wrong. So it's a
dangerous thing. Because customers use the data directly. If it's true
it looks like there is a race between non aligned read and recovery
process.

Regards
Xiao

>
> >> BTW, I had run the test with bio accounting disabled by default, and
> >> seems the result is
> >> same.
> >>
> >>> git tag  --sort=taggerdate --contain 10764815f |head -1
> >> v5.14-rc1
> >>
> >> localhost:~/readdata #uname -r
> >> 5.15.0-rc4-59.24-default
> >> localhost:~/readdata #cat /sys/block/md126/queue/iostats
> >> 0
> >>
> >> And I can still see relevant log from the terminal which runs 01-test.sh
> > Hmm, thanks for this. I'll have a try again. Which kind of disks do
> > you use for testing?
>
> Four SCSI disks (1G capacity) inside VM.
>
> Thanks,
> Guoqing
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-29  2:25         ` Guoqing Jiang
  2023-05-29  3:41           ` Xiao Ni
@ 2023-05-29 13:51           ` Xiao Ni
  2023-05-30  0:53             ` Guoqing Jiang
  1 sibling, 1 reply; 33+ messages in thread
From: Xiao Ni @ 2023-05-29 13:51 UTC (permalink / raw)
  To: Guoqing Jiang
  Cc: Tkaczyk, Mariusz, Song Liu, linux-raid, Heinz Mauelshagen, Nigel Croxon

On Mon, May 29, 2023 at 10:27 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
>
>
> On 5/26/23 15:23, Xiao Ni wrote:
> >
> >>>>> 6. mdadm /dev/md126 --add /dev/sdd
> >>>>> 7. create 31 processes that writes and reads. It compares the content with
> >>>>> md5sum. The test will go on until the recovery stops
> >> Could you share the test code/script for step 7? Will try it from my side.
> > The test scripts are written by people from intel.
> > Hi, Mariusz. Can I share the test scripts here?
> >
> >>>>> 8. wait for about 10 minutes, we can see some processes report checksum is
> >>>>> wrong. But if it re-read the data again, the checksum will be good.
> >> So it is interim, I guess it appeared before recover was finished.
> > Yes, it appears before recovery finishes. The test will finish once
> > the recovery finishes.
> >
> >>>>> I tried to narrow this problem like this:
> >>>>>
> >>>>> -       md_account_bio(mddev, &bi);
> >>>>> +       if (rw == WRITE)
> >>>>> +               md_account_bio(mddev, &bi);
> >>>>> If it only do account for write requests, the problem can disappear.
> >>>>>
> >>>>> -       if (rw == READ && mddev->degraded == 0 &&
> >>>>> -           mddev->reshape_position == MaxSector) {
> >>>>> -               bi = chunk_aligned_read(mddev, bi);
> >>>>> -               if (!bi)
> >>>>> -                       return true;
> >>>>> -       }
> >>>>> +       //if (rw == READ && mddev->degraded == 0 &&
> >>>>> +       //    mddev->reshape_position == MaxSector) {
> >>>>> +       //      bi = chunk_aligned_read(mddev, bi);
> >>>>> +       //      if (!bi)
> >>>>> +       //              return true;
> >>>>> +       //}
> >>>>>
> >>>>>            if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
> >>>>>                    make_discard_request(mddev, bi);
> >>>>> @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev *mddev,
> >>>>> struct bio * bi)
> >>>>>                            md_write_end(mddev);
> >>>>>                    return true;
> >>>>>            }
> >>>>> -       md_account_bio(mddev, &bi);
> >>>>> +       if (rw == READ)
> >>>>> +               md_account_bio(mddev, &bi);
> >>>>>
> >>>>> I comment the chunk_aligned_read out and only account for read requests,
> >>>>> this problem can be reproduced.
>
> Only write bio and non aligned chunk read bio call md_account_bio, and
> only account
> write bio is fine per your test. It means the md5sum didn't match
> because of non aligned
> chunk read bio, so it is not abnormal that data in another chunk could
> be changed with
> the recovery is not finished, right?
>
> BTW, I had run the test with bio accounting disabled by default, and
> seems the result is
> same.
>
> > git tag  --sort=taggerdate --contain 10764815f |head -1
> v5.14-rc1
>
> localhost:~/readdata #uname -r
> 5.15.0-rc4-59.24-default
> localhost:~/readdata #cat /sys/block/md126/queue/iostats
> 0

I can reproduce this after setting /sys/block/md126/queue/iostats to
0. But if I comment md_account_bio, this can't be reproduced. If
setting iostats to 0, md_account_bio only checks the bit
QUEUE_FLAG_IO_STAT of the request queue of md126. In
chunk_aligned_read, it can split the original bio and returns the
split bio if it can't do the align read. Are there problems in this
case?

Regards
Xiao
>
> And I can still see relevant log from the terminal which runs 01-test.sh
>
> file /tmp/pythontest/data.out.nhoBR6 Test failed
> (org_checksum|checksum_to_match) [582df4ccea6f5851462379fe4b17abf6
>   -|d41d8cd98f00b204e9800998ec
> f8427e  -]
> file /tmp/pythontest/data.out.nhoBR6 calculate checksum again:
> d41d8cd98f00b204e9800998ecf8427e  -
>
> file /tmp/pythontest/data.out.srGqF9 Test failed
> (org_checksum|checksum_to_match) [7a85ec35b171e52ce41ebdd5da86f1d9
>   -|d41d8cd98f00b204e9800998ec
> f8427e  -]
> file /tmp/pythontest/data.out.srGqF9 calculate checksum again:
> d41d8cd98f00b204e9800998ecf8427e  -
> ...
>
> file /tmp/pythontest/data.out.q6m1kK Test failed
> (org_checksum|checksum_to_match) [996f5bb5d2a2ddfd378a80a563ab4ad5
>   -|d41d8cd98f00b204e9800998ec
> f8427e  -]
> file /tmp/pythontest/data.out.q6m1kK calculate checksum again:
> d41d8cd98f00b204e9800998ecf8427e  -
>
> Rebuild ended, testing done!
> stresstest end at : Sun May 28 22:23:53 EDT 2023
> Sun May 28 22:23:53 EDT 2023
>
> Thanks,
> Guoqing
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-29 13:51           ` Xiao Ni
@ 2023-05-30  0:53             ` Guoqing Jiang
  0 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2023-05-30  0:53 UTC (permalink / raw)
  To: Xiao Ni
  Cc: Tkaczyk, Mariusz, Song Liu, linux-raid, Heinz Mauelshagen, Nigel Croxon



On 5/29/23 21:51, Xiao Ni wrote:
> I can reproduce this after setting /sys/block/md126/queue/iostats to
> 0. But if I comment md_account_bio, this can't be reproduced. If
> setting iostats to 0, md_account_bio only checks the bit
> QUEUE_FLAG_IO_STAT of the request queue of md126.

I don't get why the issue can't be reproduced after comment md_acount_bio
out manually, the only difference between it and disable iostats is that 
chunk
aligned read bio still record time.

> In chunk_aligned_read, it can split the original bio and returns the
> split bio if it can't do the align read. Are there problems in this
> case?

Not sure.

Thanks,
Guoqing


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-29  8:40               ` Xiao Ni
@ 2023-05-30  1:36                 ` Guoqing Jiang
  2023-05-30  2:02                   ` Yu Kuai
  0 siblings, 1 reply; 33+ messages in thread
From: Guoqing Jiang @ 2023-05-30  1:36 UTC (permalink / raw)
  To: Xiao Ni
  Cc: Tkaczyk, Mariusz, Song Liu, linux-raid, Heinz Mauelshagen, Nigel Croxon

On 5/29/23 16:40, Xiao Ni wrote:
> It's the data which customers read that can be wrong. So it's a
> dangerous thing. Because customers use the data directly. If it's true
> it looks like there is a race between non aligned read and recovery
> process.

But did you get similar report from customer from the past years? I am not
sure it is reasonable to expect the md5sum should be match under such
conditions (multiple processes do write/read with recovery in progress).

Maybe the test launched 32 processes in parallel is another factor, not sure
it still happens with only 1 process which means only do read after write.

Anyway, I think io accounting is irrelevant.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-30  1:36                 ` Guoqing Jiang
@ 2023-05-30  2:02                   ` Yu Kuai
  2023-05-30  2:11                     ` Xiao Ni
  0 siblings, 1 reply; 33+ messages in thread
From: Yu Kuai @ 2023-05-30  2:02 UTC (permalink / raw)
  To: Guoqing Jiang, Xiao Ni
  Cc: Tkaczyk, Mariusz, Song Liu, linux-raid, Heinz Mauelshagen,
	Nigel Croxon, yukuai (C)

Hi,

在 2023/05/30 9:36, Guoqing Jiang 写道:
> 
> 
> On 5/29/23 16:40, Xiao Ni wrote:
>> It's the data which customers read that can be wrong. So it's a
>> dangerous thing. Because customers use the data directly. If it's true
>> it looks like there is a race between non aligned read and recovery
>> process.
> 
> But did you get similar report from customer from the past years? I am not
> sure it is reasonable to expect the md5sum should be match under such
> conditions (multiple processes do write/read with recovery in progress).
> 
> Maybe the test launched 32 processes in parallel is another factor, not 
> sure
> it still happens with only 1 process which means only do read after write.

May I ask if these processes write the same file with same offset? It's
insane if they do... If not, this cound be a problem.

> 
> Anyway, I think io accounting is irrelevant.

Yes, I can't figure out why io accounting can be involved.

Thanks,
Kuai
> 
> Thanks,
> Guoqing
> .
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-30  2:02                   ` Yu Kuai
@ 2023-05-30  2:11                     ` Xiao Ni
  2023-05-30  2:23                       ` Guoqing Jiang
  0 siblings, 1 reply; 33+ messages in thread
From: Xiao Ni @ 2023-05-30  2:11 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Guoqing Jiang, Tkaczyk, Mariusz, Song Liu, linux-raid,
	Heinz Mauelshagen, Nigel Croxon, yukuai (C)

On Tue, May 30, 2023 at 10:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2023/05/30 9:36, Guoqing Jiang 写道:
> >
> >
> > On 5/29/23 16:40, Xiao Ni wrote:
> >> It's the data which customers read that can be wrong. So it's a
> >> dangerous thing. Because customers use the data directly. If it's true
> >> it looks like there is a race between non aligned read and recovery
> >> process.
> >
> > But did you get similar report from customer from the past years? I am not
> > sure it is reasonable to expect the md5sum should be match under such
> > conditions (multiple processes do write/read with recovery in progress).

I didn't get a similar report these years besides this one.

> >
> > Maybe the test launched 32 processes in parallel is another factor, not
> > sure
> > it still happens with only 1 process which means only do read after write.
>
> May I ask if these processes write the same file with same offset? It's
> insane if they do... If not, this cound be a problem.

They write to different files. One process writes to its own file.

>
> >
> > Anyway, I think io accounting is irrelevant.
>
> Yes, I can't figure out why io accounting can be involved.

Thanks, I'll look more.

Regards
Xiao
>
> Thanks,
> Kuai
> >
> > Thanks,
> > Guoqing
> > .
> >
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-30  2:11                     ` Xiao Ni
@ 2023-05-30  2:23                       ` Guoqing Jiang
  2023-05-30  2:30                         ` Xiao Ni
  0 siblings, 1 reply; 33+ messages in thread
From: Guoqing Jiang @ 2023-05-30  2:23 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai
  Cc: Tkaczyk, Mariusz, Song Liu, linux-raid, Heinz Mauelshagen,
	Nigel Croxon, yukuai (C)



On 5/30/23 10:11, Xiao Ni wrote:
>> May I ask if these processes write the same file with same offset? It's
>> insane if they do... If not, this cound be a problem.
> They write to different files. One process writes to its own file.

How big is the capacity of your array? I see the script write 100G file 
first, then create
different files with 3GB size. So you probably need a array with 200G array.

./01-test.sh:dd if=/dev/zero of=/tmp/pythontest/file1bs=1M count=100000 
status=progress

Thanks,
Guoqing



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-30  2:23                       ` Guoqing Jiang
@ 2023-05-30  2:30                         ` Xiao Ni
  2023-05-30  2:43                           ` Guoqing Jiang
  2023-06-14  8:27                           ` Kusiak, Mateusz
  0 siblings, 2 replies; 33+ messages in thread
From: Xiao Ni @ 2023-05-30  2:30 UTC (permalink / raw)
  To: Guoqing Jiang
  Cc: Yu Kuai, Tkaczyk, Mariusz, Song Liu, linux-raid,
	Heinz Mauelshagen, Nigel Croxon, yukuai (C)

On Tue, May 30, 2023 at 10:23 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
>
>
> On 5/30/23 10:11, Xiao Ni wrote:
> >> May I ask if these processes write the same file with same offset? It's
> >> insane if they do... If not, this cound be a problem.
> > They write to different files. One process writes to its own file.
>
> How big is the capacity of your array? I see the script write 100G file
> first, then create
> different files with 3GB size. So you probably need a array with 200G array.
>
> ./01-test.sh:dd if=/dev/zero of=/tmp/pythontest/file1bs=1M count=100000
> status=progress
>
> Thanks,
> Guoqing
>
>
I used ssd disks and without --size. So the raid is almost 1TB

Regards
Xiao


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-30  2:30                         ` Xiao Ni
@ 2023-05-30  2:43                           ` Guoqing Jiang
  2023-06-14  8:27                           ` Kusiak, Mateusz
  1 sibling, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2023-05-30  2:43 UTC (permalink / raw)
  To: Xiao Ni
  Cc: Yu Kuai, Tkaczyk, Mariusz, Song Liu, linux-raid,
	Heinz Mauelshagen, Nigel Croxon, yukuai (C)



On 5/30/23 10:30, Xiao Ni wrote:
> On Tue, May 30, 2023 at 10:23 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>>
>>
>> On 5/30/23 10:11, Xiao Ni wrote:
>>>> May I ask if these processes write the same file with same offset? It's
>>>> insane if they do... If not, this cound be a problem.
>>> They write to different files. One process writes to its own file.
>> How big is the capacity of your array? I see the script write 100G file
>> first, then create
>> different files with 3GB size. So you probably need a array with 200G array.
>>
>> ./01-test.sh:dd if=/dev/zero of=/tmp/pythontest/file1bs=1M count=100000
>> status=progress
>>
>> Thanks,
>> Guoqing
>>
>>
> I used ssd disks and without --size. So the raid is almost 1TB

I can't reproduce it after disable write 100G file1 and decrease 
FILE_SIZE to 30
to fit my array, and iostats is on, just FYI.

localhost:~/readdata #cat /sys/block/md126/queue/iostats
1
localhost:~/readdata #./01-test.sh
mdadm: stopped /dev/md126
umount: /dev/md126: no mount point specified.
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md126 started.
mke2fs 1.43.8 (1-Jan-2018)
/dev/md126 contains a ext4 file system
        last mounted on /tmp/pythontest on Mon May 29 22:21:18 2023
Creating filesystem with 784896 4k blocks and 196224 inodes
Filesystem UUID: 2eb67cc5-67d0-432e-9b66-d3bc3c472d93
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912

Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done

Mount point already exists: /tmp/pythontest
Clearing mount point!
mdadm: set sdd faulty in md126
mdadm: hot removed sdd from md126
5.15.0-rc4-59.24-default
Mon May 29 22:31:27 EDT 2023
dd start at : Mon May 29 22:31:27 EDT 2023
Mon May 29 22:31:27 EDT 2023
dd ended at : Mon May 29 22:31:27 EDT 2023
mdadm: added /dev/sdd
Mon May 29 22:31:30 EDT 2023
stresstest start  at : Mon May 29 22:31:30 EDT 2023
Test started!
start
[...]
start
Rebuild ended, testing done!
stresstest end at : Mon May 29 22:32:30 EDT 2023
Mon May 29 22:32:30 EDT 2023


Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-30  2:30                         ` Xiao Ni
  2023-05-30  2:43                           ` Guoqing Jiang
@ 2023-06-14  8:27                           ` Kusiak, Mateusz
  2023-06-14  8:46                             ` Xiao Ni
  1 sibling, 1 reply; 33+ messages in thread
From: Kusiak, Mateusz @ 2023-06-14  8:27 UTC (permalink / raw)
  To: Xiao Ni, Guoqing Jiang
  Cc: Yu Kuai, Tkaczyk, Mariusz, Song Liu, linux-raid,
	Heinz Mauelshagen, Nigel Croxon, yukuai (C)

Hi Xiao,
looks like this mail thread has gone silent for a while, were you able
to determine the issue source?

Thanks,
Mateusz

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-06-14  8:27                           ` Kusiak, Mateusz
@ 2023-06-14  8:46                             ` Xiao Ni
  0 siblings, 0 replies; 33+ messages in thread
From: Xiao Ni @ 2023-06-14  8:46 UTC (permalink / raw)
  To: Kusiak, Mateusz
  Cc: Guoqing Jiang, Yu Kuai, Tkaczyk, Mariusz, Song Liu, linux-raid,
	Heinz Mauelshagen, Nigel Croxon, yukuai (C)

On Wed, Jun 14, 2023 at 4:29 PM Kusiak, Mateusz
<mateusz.kusiak@linux.intel.com> wrote:
>
> Hi Xiao,
> looks like this mail thread has gone silent for a while, were you able
> to determine the issue source?
>
> Thanks,
> Mateusz
>

Hi Mateusz

Sorry, I'm still trying to figure out the root cause.

Regards
Xiao


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-05-27  0:56             ` Xiao Ni
@ 2023-07-11  0:39               ` Xiao Ni
  2023-07-14  1:30                 ` Yu Kuai
  0 siblings, 1 reply; 33+ messages in thread
From: Xiao Ni @ 2023-07-11  0:39 UTC (permalink / raw)
  To: Song Liu
  Cc: Mariusz Tkaczyk, Guoqing Jiang, Tkaczyk, Mariusz, linux-raid,
	Heinz Mauelshagen, Nigel Croxon


在 2023/5/27 上午8:56, Xiao Ni 写道:
> Hi all
>
> The attachment is the scripts.
>
> 1. It needs to modify the member disks and raid name in prepare_rebuild_env.sh
> 2. It needs to modify the member disks and raid name in 01-test.sh
> 3. Then run 01-test.sh directly
>

Hi all

I have tried with a work around patch and can confirm this problem can't 
be reproduced again with this patch.

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4cdb35e54251..96d7f8048876 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6190,7 +6190,8 @@ static bool raid5_make_request(struct mddev 
*mddev, struct bio * bi)
                         md_write_end(mddev);
                 return true;
         }
-       md_account_bio(mddev, &bi);
+       if (rw == WRITE)
+               md_account_bio(mddev, &bi);

         /*
          * Lets start with the stripe with the lowest chunk offset in 
the first


This patch only disables the accounting for non-align read requests. I 
know it's not a good one. But the data corruption is more

serious than io accouting.

Regards

Xiao


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: The read data is wrong from raid5 when recovery happens
  2023-07-11  0:39               ` Xiao Ni
@ 2023-07-14  1:30                 ` Yu Kuai
  0 siblings, 0 replies; 33+ messages in thread
From: Yu Kuai @ 2023-07-14  1:30 UTC (permalink / raw)
  To: Xiao Ni, Song Liu
  Cc: Mariusz Tkaczyk, Guoqing Jiang, Tkaczyk, Mariusz, linux-raid,
	Heinz Mauelshagen, Nigel Croxon, yukuai (C)

Hi,

在 2023/07/11 8:39, Xiao Ni 写道:
> 
> 在 2023/5/27 上午8:56, Xiao Ni 写道:
>> Hi all
>>
>> The attachment is the scripts.
>>
>> 1. It needs to modify the member disks and raid name in 
>> prepare_rebuild_env.sh
>> 2. It needs to modify the member disks and raid name in 01-test.sh
>> 3. Then run 01-test.sh directly
>>
> 
> Hi all
> 
> I have tried with a work around patch and can confirm this problem can't 
> be reproduced again with this patch.
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 4cdb35e54251..96d7f8048876 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6190,7 +6190,8 @@ static bool raid5_make_request(struct mddev 
> *mddev, struct bio * bi)
>                          md_write_end(mddev);
>                  return true;
>          }
> -       md_account_bio(mddev, &bi);
> +       if (rw == WRITE)
> +               md_account_bio(mddev, &bi);

After spending sometime review related code, I still can't figure out
what is wrong and how can this solves anything. ☹️

I assume your testcase doesn't involved io error, and
'rdev->recovery_offset' in only updated after sync io is done
from md_do_sync(), and 'rdev->recovery_offset' is checked while choosing
the rdev to read from analyse_stripe(). (There is a problem from
raid5_read_one_chunk() about checking 'recovery_offset', but as you
said chunk_aligned_read() is not involed here.) Hence I think it's not
possible to read data from the disk that is not recovered.

And again, I don't think there is a difference for what will be with or
without this md_account_bio().

Thanks,
Kuai
> 
>          /*
>           * Lets start with the stripe with the lowest chunk offset in 
> the first
> 
> 
> This patch only disables the accounting for non-align read requests. I 
> know it's not a good one. But the data corruption is more
> 
> serious than io accouting.
> 
> Regards
> 
> Xiao
> 
> .
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Fwd: The read data is wrong from raid5 when recovery happens
  2023-05-26  2:08 ` Fwd: The read data is wrong from raid5 when recovery happens Xiao Ni
  2023-05-26  2:17   ` Yu Kuai
  2023-05-26  3:56   ` d tbsky
@ 2024-02-14 15:15   ` Mateusz Kusiak
  2024-02-14 17:12     ` Song Liu
       [not found]     ` <CALTww29s1WupaVRSrEX1GbD=1Bt7b5cxseDnBLARkH1uHUhtCA@mail.gmail.com>
  2 siblings, 2 replies; 33+ messages in thread
From: Mateusz Kusiak @ 2024-02-14 15:15 UTC (permalink / raw)
  To: xni; +Cc: guoqing.jiang, heinzm, linux-raid, ncroxon, song

Hi Xiao,
I'm bringing back this old thread to ask if there is any progress on the 
topic?
If I'm correct, this is not yet fixed in upstream, right?

We have multiple issues submitted for multiple systems that we belive 
are related to this behavior.
Redhat backports hotfix for their systems, but the vulnerability is 
still present on other OSes.
Is there a plan to have it resolved in upstream?

Thanks,
Mateusz

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Fwd: The read data is wrong from raid5 when recovery happens
  2024-02-14 15:15   ` Fwd: " Mateusz Kusiak
@ 2024-02-14 17:12     ` Song Liu
       [not found]     ` <CALTww29s1WupaVRSrEX1GbD=1Bt7b5cxseDnBLARkH1uHUhtCA@mail.gmail.com>
  1 sibling, 0 replies; 33+ messages in thread
From: Song Liu @ 2024-02-14 17:12 UTC (permalink / raw)
  To: Mateusz Kusiak; +Cc: xni, guoqing.jiang, heinzm, linux-raid, ncroxon

On Wed, Feb 14, 2024 at 7:16 AM Mateusz Kusiak
<mateusz.kusiak@linux.intel.com> wrote:
>
> Hi Xiao,
> I'm bringing back this old thread to ask if there is any progress on the
> topic?
> If I'm correct, this is not yet fixed in upstream, right?
>
> We have multiple issues submitted for multiple systems that we belive
> are related to this behavior.
> Redhat backports hotfix for their systems, but the vulnerability is
> still present on other OSes.
> Is there a plan to have it resolved in upstream?

I think 45b478951b2ba5aea70b2850c49c1aa83aedd0d2 have fixed it.
Do we still get reports from systems with this fix?

Thanks,
Song

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Fwd: The read data is wrong from raid5 when recovery happens
       [not found]     ` <CALTww29s1WupaVRSrEX1GbD=1Bt7b5cxseDnBLARkH1uHUhtCA@mail.gmail.com>
@ 2024-02-15 10:41       ` Mateusz Kusiak
  0 siblings, 0 replies; 33+ messages in thread
From: Mateusz Kusiak @ 2024-02-15 10:41 UTC (permalink / raw)
  To: Xiao Ni, Song Liu, linux-raid

On 15.02.2024 04:23, Xiao Ni wrote:

>
> Hi Mateusz
>
> We used a rhel only patch to fix this problem. Now upstream patch 
> 45b478951b2ba5aea70b2850c49c1aa83aedd0d2 should fix this problem. But 
> there is a bug report against this patch 
> https://lkml.iu.edu/hypermail/linux/kernel/2312.0/08634.html. So we 
> didn't backport this patch to rhel and we still use the rhel only 
> patch. We'll consider to backport in next release. For other OSes, you 
> can have a try with this patch.
>
> Regards
> Xiao


Hello,
thanks for the info, I missed that this patch was merged with upstream.

On 14.02.2024 18:12, Song Liu wrote:
> I think 45b478951b2ba5aea70b2850c49c1aa83aedd0d2 have fixed it.
> Do we still get reports from systems with this fix?
>
> Thanks,
> Song
We tested this with much older inbox kernels. I will try upstream kernel 
to verify this.Thanks, Mateusz

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2024-02-15 10:41 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CALTww28aV5CGXQAu46Rkc=fG1jK=ARzCT8VGoVyje8kQdqEXMg@mail.gmail.com>
2023-05-26  2:08 ` Fwd: The read data is wrong from raid5 when recovery happens Xiao Ni
2023-05-26  2:17   ` Yu Kuai
2023-05-26  2:40     ` Xiao Ni
2023-05-26  2:47       ` Yu Kuai
2023-05-26  3:02         ` Xiao Ni
2023-05-26  3:56   ` d tbsky
2023-05-26  6:20     ` Xiao Ni
2024-02-14 15:15   ` Fwd: " Mateusz Kusiak
2024-02-14 17:12     ` Song Liu
     [not found]     ` <CALTww29s1WupaVRSrEX1GbD=1Bt7b5cxseDnBLARkH1uHUhtCA@mail.gmail.com>
2024-02-15 10:41       ` Mateusz Kusiak
2023-05-26  3:09 ` Guoqing Jiang
2023-05-26  6:45   ` Xiao Ni
2023-05-26  7:12     ` Guoqing Jiang
2023-05-26  7:23       ` Xiao Ni
2023-05-26  9:13         ` Mariusz Tkaczyk
2023-05-26 21:13           ` Song Liu
2023-05-27  0:56             ` Xiao Ni
2023-07-11  0:39               ` Xiao Ni
2023-07-14  1:30                 ` Yu Kuai
2023-05-29  2:25         ` Guoqing Jiang
2023-05-29  3:41           ` Xiao Ni
2023-05-29  8:33             ` Guoqing Jiang
2023-05-29  8:40               ` Xiao Ni
2023-05-30  1:36                 ` Guoqing Jiang
2023-05-30  2:02                   ` Yu Kuai
2023-05-30  2:11                     ` Xiao Ni
2023-05-30  2:23                       ` Guoqing Jiang
2023-05-30  2:30                         ` Xiao Ni
2023-05-30  2:43                           ` Guoqing Jiang
2023-06-14  8:27                           ` Kusiak, Mateusz
2023-06-14  8:46                             ` Xiao Ni
2023-05-29 13:51           ` Xiao Ni
2023-05-30  0:53             ` Guoqing Jiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.