All of lore.kernel.org
 help / color / mirror / Atom feed
* raid5 deadlock issue
@ 2022-11-09  6:50 Zhang Tianci
  2022-11-09 22:37 ` Song Liu
  0 siblings, 1 reply; 7+ messages in thread
From: Zhang Tianci @ 2022-11-09  6:50 UTC (permalink / raw)
  To: song; +Cc: linux-raid, Jiachen Zhang

Hi Song,

I am tracking down a deadlock in Linux-5.4.56.

I end up seeing multiple hung tasks with the following stack trace:

  #0 [ffff9dccdd79f538] __schedule at ffffffffaf993b2d
  #1 [ffff9dccdd79f5c8] schedule at ffffffffaf993eaa
  #2 [ffff9dccdd79f5e0] md_bitmap_startwrite at ffffffffc08eef61 [md_mod]
  #3 [ffff9dccdd79f658] add_stripe_bio at ffffffffc0a4b627 [raid456]
  #4 [ffff9dccdd79f6a8] raid5_make_request at ffffffffc0a508fe [raid456]
  #5 [ffff9dccdd79f788] md_handle_request at ffffffffc08e4920 [md_mod]
  #6 [ffff9dccdd79f7f0] md_make_request at ffffffffc08e4a76 [md_mod]
  #7 [ffff9dccdd79f818] generic_make_request at ffffffffaf5ab4bb
  #8 [ffff9dccdd79f878] submit_bio at ffffffffaf5ab6f8
  #9 [ffff9dccdd79f8e0] ext4_io_submit at ffffffffc0834ab9 [ext4]
  #10 [ffff9dccdd79f8f0] ext4_bio_write_page at ffffffffc0834d7e [ext4]

They are all blocked with md_bitmap_startwrite.

After some debugging, I found there are two kinds of deadlock states,
and I guess the latest Linux has the same problems.

1. If there are two thread's bio belonging to the same stripe_head:
      threadA                                       threadB
raid5_make_request
  raid5_get_active_stripe(stripe count=1)
  add_stripe_bio
    md_bitmap_startwrite(get bitmap count)
  release_stripe_plug
    add this stripe to blk plug
  -------------------
  someone do unplug:
  raid5_unplug
                                                  raid5_make_request
                                              raid5_get_active_stripe(count=2)
    __release_stripe                      add_stripe_bio
      stripe count=1                           md_bitmap_startwrite(blocked)
      return

Because the max bitmap counter is (1<<14 - 1), so only one stripe can
not cause this kind of deadlock, but if there are enough stripe_head
working like above, there would be a "AA" deadlock.

Of course, This kind of deadlock is almost impossible to trigger. And
The deadlock I encountered was not of this kind. But this kind of
action is one component of the next kind of deadlock.

2. The special stripe_head is like the above, but thread A may do
another important thing. Thread A may do
`atomic_inc(&conf->preread_active_stripes);`. And This count will
block raid5d to activate delayed stripes:

```
static void raid5_activate_delayed(struct r5conf *conf)
{
      if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) {
             while (!list_empty(&conf->delayed_list)) {
                       struct list_head *l = conf->delayed_list.next;
                       struct stripe_head *sh;
                       sh = list_entry(l, struct stripe_head, lru);
                       list_del_init(l);
                       clear_bit(STRIPE_DELAYED, &sh->state);
                       if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
                              atomic_inc(&conf->preread_active_stripes);
                       list_add_tail(&sh->lru, &conf->hold_list);
                       raid5_wakeup_stripe_thread(sh);
             }
       }
}
```
raid5d will only handle delayed stripe_head when
`conf->preread_active_stripes < IO_THRESHOLD` where `IO_THRESHOLD` is
one. So there would be one kind of ABBA deadlock:
many stripe_head got a bitmap count and waiting for
conf->preread_active_stripes.
someone stripe_head got conf->preread_active_stripes and waiting for
the bitmap count.

I guess the deadlock I encountered was the second kind. There is some
information about raid5 I am using:

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md10 : active raid5 nvme9n1p1[9] nvme8n1p1[7] nvme7n1p1[6]
nvme6n1p1[5] nvme5n1p1[4] nvme4n1p1[3] nvme3n1p1[2] nvme2n1p1[1]
nvme1n1p1[0]
      15001927680 blocks super 1.2 level 5, 512k chunk, algorithm 2
[9/9] [UUUUUUUUU]
      [====>................]  check = 21.0% (394239024/1875240960)
finish=1059475.2min speed=23K/sec
      bitmap: 1/14 pages [4KB], 65536KB chunk

$ mdadm -D /dev/md10
/dev/md10:
        Version : 1.2
  Creation Time : Fri Sep 23 11:47:03 2022
     Raid Level : raid5
     Array Size : 15001927680 (14306.95 GiB 15361.97 GB)
  Used Dev Size : 1875240960 (1788.37 GiB 1920.25 GB)
   Raid Devices : 9
  Total Devices : 9
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sun Nov  6 01:29:49 2022
          State : active, checking
 Active Devices : 9
Working Devices : 9
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

   Check Status : 21% complete

           Name : dc02-pd-t8-n021:10  (local to host dc02-pd-t8-n021)
           UUID : 089300e1:45b54872:31a11457:a41ad66a
         Events : 3968

    Number   Major   Minor   RaidDevice State
       0     259        8        0      active sync   /dev/nvme1n1p1
       1     259        6        1      active sync   /dev/nvme2n1p1
       2     259        7        2      active sync   /dev/nvme3n1p1
       3     259       12        3      active sync   /dev/nvme4n1p1
       4     259       11        4      active sync   /dev/nvme5n1p1
       5     259       14        5      active sync   /dev/nvme6n1p1
       6     259       13        6      active sync   /dev/nvme7n1p1
       7     259       21        7      active sync   /dev/nvme8n1p1
       9     259       20        8      active sync   /dev/nvme9n1p1

And some internal state of the raid5 by crash or sysfs:

$ cat /sys/block/md10/md/stripe_cache_active
4430               # There are so many active stripe_head

crash > foreach UN bt | grep md_bitmap_startwrite | wc -l
48                    # So there are only 48 stripe_head blocked by
the bitmap counter.
crash > list -o stripe_head.lru -s stripe_head.state -O
r5conf.delayed_list -h 0xffff90c1951d5000
.... # There are so many stripe_head, and the number is 4382.

There are 4430 active stripe_head, and 4382 are in delayed_list, the
last 48 blocked by the bitmap counter.
So I guess this is the second deadlock.

Then I reviewed the changelog after the commit 391b5d39faea "md/raid5:
Fix Force reconstruct-write io stuck in degraded raid5" date
2020-07-31, and found no related fixup commit. And I'm not sure my
understanding of raid5 is right. So I wondering if you can help
confirm whether my thoughts are right or not.

And If my thoughts are right, I have one idea to fix up the first problem:

We can add a field `bit_counter` into struct stripe_head, which means
this stripe_head's write counter, only it from 0 become 1, we do
md_bitmap_startwrite, and from 1 becomes 0, we do md_bitmap_endwrite.
So one stripe_head only gets one bitmap counter.

But for the second problem, I think I do not really understand the meaning of
r5conf->preread_active_stripes, So I have no good idea to fix it.

Thanks,

Tianci

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid5 deadlock issue
  2022-11-09  6:50 raid5 deadlock issue Zhang Tianci
@ 2022-11-09 22:37 ` Song Liu
  2022-11-10  3:24   ` [External] " Zhang Tianci
  0 siblings, 1 reply; 7+ messages in thread
From: Song Liu @ 2022-11-09 22:37 UTC (permalink / raw)
  To: Zhang Tianci; +Cc: linux-raid, Jiachen Zhang

 Hi Tianci,

Thanks for the report.


On Tue, Nov 8, 2022 at 10:50 PM Zhang Tianci
<zhangtianci.1997@bytedance.com> wrote:
>
> Hi Song,
>
> I am tracking down a deadlock in Linux-5.4.56.
>
[...]
>
> $ cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md10 : active raid5 nvme9n1p1[9] nvme8n1p1[7] nvme7n1p1[6]
> nvme6n1p1[5] nvme5n1p1[4] nvme4n1p1[3] nvme3n1p1[2] nvme2n1p1[1]
> nvme1n1p1[0]
>       15001927680 blocks super 1.2 level 5, 512k chunk, algorithm 2
> [9/9] [UUUUUUUUU]
>       [====>................]  check = 21.0% (394239024/1875240960)
> finish=1059475.2min speed=23K/sec
>       bitmap: 1/14 pages [4KB], 65536KB chunk

How many instances of this issue do we have? If more than one, I wonder
they are all running the raid5 check (as this one is).

>
> $ mdadm -D /dev/md10
> /dev/md10:
>         Version : 1.2
>   Creation Time : Fri Sep 23 11:47:03 2022
>      Raid Level : raid5
>      Array Size : 15001927680 (14306.95 GiB 15361.97 GB)
>   Used Dev Size : 1875240960 (1788.37 GiB 1920.25 GB)
>    Raid Devices : 9
>   Total Devices : 9
>     Persistence : Superblock is persistent
>
>   Intent Bitmap : Internal
>
>     Update Time : Sun Nov  6 01:29:49 2022
>           State : active, checking
>  Active Devices : 9
> Working Devices : 9
>  Failed Devices : 0
>   Spare Devices : 0
>
>          Layout : left-symmetric
>      Chunk Size : 512K
>
>    Check Status : 21% complete
>
>            Name : dc02-pd-t8-n021:10  (local to host dc02-pd-t8-n021)
>            UUID : 089300e1:45b54872:31a11457:a41ad66a
>          Events : 3968
>
>     Number   Major   Minor   RaidDevice State
>        0     259        8        0      active sync   /dev/nvme1n1p1
>        1     259        6        1      active sync   /dev/nvme2n1p1
>        2     259        7        2      active sync   /dev/nvme3n1p1
>        3     259       12        3      active sync   /dev/nvme4n1p1
>        4     259       11        4      active sync   /dev/nvme5n1p1
>        5     259       14        5      active sync   /dev/nvme6n1p1
>        6     259       13        6      active sync   /dev/nvme7n1p1
>        7     259       21        7      active sync   /dev/nvme8n1p1
>        9     259       20        8      active sync   /dev/nvme9n1p1
>
> And some internal state of the raid5 by crash or sysfs:
>
> $ cat /sys/block/md10/md/stripe_cache_active
> 4430               # There are so many active stripe_head
>
> crash > foreach UN bt | grep md_bitmap_startwrite | wc -l
> 48                    # So there are only 48 stripe_head blocked by
> the bitmap counter.
> crash > list -o stripe_head.lru -s stripe_head.state -O
> r5conf.delayed_list -h 0xffff90c1951d5000
> .... # There are so many stripe_head, and the number is 4382.
>
> There are 4430 active stripe_head, and 4382 are in delayed_list, the
> last 48 blocked by the bitmap counter.
> So I guess this is the second deadlock.
>
> Then I reviewed the changelog after the commit 391b5d39faea "md/raid5:
> Fix Force reconstruct-write io stuck in degraded raid5" date
> 2020-07-31, and found no related fixup commit. And I'm not sure my
> understanding of raid5 is right. So I wondering if you can help
> confirm whether my thoughts are right or not.

Have you tried to reproduce this with the latest kernel? There are a few fixes
after 2020, for example

  commit 3312e6c887fe7539f0adb5756ab9020282aaa3d4

Thanks,
Song

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [External] Re: raid5 deadlock issue
  2022-11-09 22:37 ` Song Liu
@ 2022-11-10  3:24   ` Zhang Tianci
  2022-11-10 16:24     ` Zhang Tianci
  0 siblings, 1 reply; 7+ messages in thread
From: Zhang Tianci @ 2022-11-10  3:24 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid, Jiachen Zhang

Hi Song,

Thanks for your quick reply.

On Thu, Nov 10, 2022 at 6:37 AM Song Liu <song@kernel.org> wrote:
>
>  Hi Tianci,
>
> Thanks for the report.
>
>
> On Tue, Nov 8, 2022 at 10:50 PM Zhang Tianci
> <zhangtianci.1997@bytedance.com> wrote:
> >
> > Hi Song,
> >
> > I am tracking down a deadlock in Linux-5.4.56.
> >
> [...]
> >
> > $ cat /proc/mdstat
> > Personalities : [raid6] [raid5] [raid4]
> > md10 : active raid5 nvme9n1p1[9] nvme8n1p1[7] nvme7n1p1[6]
> > nvme6n1p1[5] nvme5n1p1[4] nvme4n1p1[3] nvme3n1p1[2] nvme2n1p1[1]
> > nvme1n1p1[0]
> >       15001927680 blocks super 1.2 level 5, 512k chunk, algorithm 2
> > [9/9] [UUUUUUUUU]
> >       [====>................]  check = 21.0% (394239024/1875240960)
> > finish=1059475.2min speed=23K/sec
> >       bitmap: 1/14 pages [4KB], 65536KB chunk
>
> How many instances of this issue do we have? If more than one, I wonder
> they are all running the raid5 check (as this one is).

We have three instances, but reboot one machine, so we have two machines now.

And you are right, they are all running the raid5 check, but because I
have debugged
this problem for almost three weeks, and I remember they were not
doing sync when
I first checked /proc/mdstat.

Now their raid5_resync threads backtrace are all:

#0 [ffffa8d15fedfb90] __schedule at ffffffffa0d93b2d
 #1 [ffffa8d15fedfc20] schedule at ffffffffa0d93eaa
 #2 [ffffa8d15fedfc38] md_bitmap_cond_end_sync at ffffffffc045758e [md_mod]
 #3 [ffffa8d15fedfc90] raid5_sync_request at ffffffffc0637e2f [raid456]
 #4 [ffffa8d15fedfcf8] md_do_sync at ffffffffc044bd1c [md_mod]
 #5 [ffffa8d15fedfe98] md_thread at ffffffffc0448e50 [md_mod]
 #6 [ffffa8d15fedff10] kthread at ffffffffa06b0d76
 #7 [ffffa8d15fedff50] ret_from_fork at ffffffffa0e001cf

So I guess the resync thread is just a victim.

>
> >
> > $ mdadm -D /dev/md10
> > /dev/md10:
> >         Version : 1.2
> >   Creation Time : Fri Sep 23 11:47:03 2022
> >      Raid Level : raid5
> >      Array Size : 15001927680 (14306.95 GiB 15361.97 GB)
> >   Used Dev Size : 1875240960 (1788.37 GiB 1920.25 GB)
> >    Raid Devices : 9
> >   Total Devices : 9
> >     Persistence : Superblock is persistent
> >
> >   Intent Bitmap : Internal
> >
> >     Update Time : Sun Nov  6 01:29:49 2022
> >           State : active, checking
> >  Active Devices : 9
> > Working Devices : 9
> >  Failed Devices : 0
> >   Spare Devices : 0
> >
> >          Layout : left-symmetric
> >      Chunk Size : 512K
> >
> >    Check Status : 21% complete
> >
> >            Name : dc02-pd-t8-n021:10  (local to host dc02-pd-t8-n021)
> >            UUID : 089300e1:45b54872:31a11457:a41ad66a
> >          Events : 3968
> >
> >     Number   Major   Minor   RaidDevice State
> >        0     259        8        0      active sync   /dev/nvme1n1p1
> >        1     259        6        1      active sync   /dev/nvme2n1p1
> >        2     259        7        2      active sync   /dev/nvme3n1p1
> >        3     259       12        3      active sync   /dev/nvme4n1p1
> >        4     259       11        4      active sync   /dev/nvme5n1p1
> >        5     259       14        5      active sync   /dev/nvme6n1p1
> >        6     259       13        6      active sync   /dev/nvme7n1p1
> >        7     259       21        7      active sync   /dev/nvme8n1p1
> >        9     259       20        8      active sync   /dev/nvme9n1p1
> >
> > And some internal state of the raid5 by crash or sysfs:
> >
> > $ cat /sys/block/md10/md/stripe_cache_active
> > 4430               # There are so many active stripe_head
> >
> > crash > foreach UN bt | grep md_bitmap_startwrite | wc -l
> > 48                    # So there are only 48 stripe_head blocked by
> > the bitmap counter.
> > crash > list -o stripe_head.lru -s stripe_head.state -O
> > r5conf.delayed_list -h 0xffff90c1951d5000
> > .... # There are so many stripe_head, and the number is 4382.
> >
> > There are 4430 active stripe_head, and 4382 are in delayed_list, the
> > last 48 blocked by the bitmap counter.
> > So I guess this is the second deadlock.
> >
> > Then I reviewed the changelog after the commit 391b5d39faea "md/raid5:
> > Fix Force reconstruct-write io stuck in degraded raid5" date
> > 2020-07-31, and found no related fixup commit. And I'm not sure my
> > understanding of raid5 is right. So I wondering if you can help
> > confirm whether my thoughts are right or not.
>
> Have you tried to reproduce this with the latest kernel? There are a few fixes
> after 2020, for example
>
>   commit 3312e6c887fe7539f0adb5756ab9020282aaa3d4
>
Thanks for your commit, I tried to understand this commit, and I think
it is an optimization
of the batch list but not a fixup?

I have not tried to reproduce this with the latest kernel, because the
max bitmap counter
(1 << 14) is so large that I think it is difficult to reproduce. So I
create a special
environment by hacking kernel(base on 5.4.56) that changes the bitmap counter
max number(COUNTER_MAX) to 4. Then I could get a deadlock by this command:

fio -filename=testfile -ioengine=libaio -bs=16M -size=10G -numjobs=100
-iodepth=1 -runtime=60
-rw=write -group_reporting -name="test"

Then I found the first deadlock state, but it is not the real reason.

I will do a test with the latest kernel. I will report to you the result later.

Thanks,
Tianci

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [External] Re: raid5 deadlock issue
  2022-11-10  3:24   ` [External] " Zhang Tianci
@ 2022-11-10 16:24     ` Zhang Tianci
  2022-11-11  4:14       ` Song Liu
  0 siblings, 1 reply; 7+ messages in thread
From: Zhang Tianci @ 2022-11-10 16:24 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid, Jiachen Zhang

On Thu, Nov 10, 2022 at 11:24 AM Zhang Tianci
<zhangtianci.1997@bytedance.com> wrote:
>
> Hi Song,
>
> Thanks for your quick reply.
>
> On Thu, Nov 10, 2022 at 6:37 AM Song Liu <song@kernel.org> wrote:
> >
> >  Hi Tianci,
> >
> > Thanks for the report.
> >
> >
> > On Tue, Nov 8, 2022 at 10:50 PM Zhang Tianci
> > <zhangtianci.1997@bytedance.com> wrote:
> > >
> > > Hi Song,
> > >
> > > I am tracking down a deadlock in Linux-5.4.56.
> > >
> > [...]
> > >
> > > $ cat /proc/mdstat
> > > Personalities : [raid6] [raid5] [raid4]
> > > md10 : active raid5 nvme9n1p1[9] nvme8n1p1[7] nvme7n1p1[6]
> > > nvme6n1p1[5] nvme5n1p1[4] nvme4n1p1[3] nvme3n1p1[2] nvme2n1p1[1]
> > > nvme1n1p1[0]
> > >       15001927680 blocks super 1.2 level 5, 512k chunk, algorithm 2
> > > [9/9] [UUUUUUUUU]
> > >       [====>................]  check = 21.0% (394239024/1875240960)
> > > finish=1059475.2min speed=23K/sec
> > >       bitmap: 1/14 pages [4KB], 65536KB chunk
> >
> > How many instances of this issue do we have? If more than one, I wonder
> > they are all running the raid5 check (as this one is).
>
> We have three instances, but reboot one machine, so we have two machines now.
>
> And you are right, they are all running the raid5 check, but because I
> have debugged
> this problem for almost three weeks, and I remember they were not
> doing sync when
> I first checked /proc/mdstat.
>
> Now their raid5_resync threads backtrace are all:
>
> #0 [ffffa8d15fedfb90] __schedule at ffffffffa0d93b2d
>  #1 [ffffa8d15fedfc20] schedule at ffffffffa0d93eaa
>  #2 [ffffa8d15fedfc38] md_bitmap_cond_end_sync at ffffffffc045758e [md_mod]
>  #3 [ffffa8d15fedfc90] raid5_sync_request at ffffffffc0637e2f [raid456]
>  #4 [ffffa8d15fedfcf8] md_do_sync at ffffffffc044bd1c [md_mod]
>  #5 [ffffa8d15fedfe98] md_thread at ffffffffc0448e50 [md_mod]
>  #6 [ffffa8d15fedff10] kthread at ffffffffa06b0d76
>  #7 [ffffa8d15fedff50] ret_from_fork at ffffffffa0e001cf
>
> So I guess the resync thread is just a victim.
>
> >
> > >
> > > $ mdadm -D /dev/md10
> > > /dev/md10:
> > >         Version : 1.2
> > >   Creation Time : Fri Sep 23 11:47:03 2022
> > >      Raid Level : raid5
> > >      Array Size : 15001927680 (14306.95 GiB 15361.97 GB)
> > >   Used Dev Size : 1875240960 (1788.37 GiB 1920.25 GB)
> > >    Raid Devices : 9
> > >   Total Devices : 9
> > >     Persistence : Superblock is persistent
> > >
> > >   Intent Bitmap : Internal
> > >
> > >     Update Time : Sun Nov  6 01:29:49 2022
> > >           State : active, checking
> > >  Active Devices : 9
> > > Working Devices : 9
> > >  Failed Devices : 0
> > >   Spare Devices : 0
> > >
> > >          Layout : left-symmetric
> > >      Chunk Size : 512K
> > >
> > >    Check Status : 21% complete
> > >
> > >            Name : dc02-pd-t8-n021:10  (local to host dc02-pd-t8-n021)
> > >            UUID : 089300e1:45b54872:31a11457:a41ad66a
> > >          Events : 3968
> > >
> > >     Number   Major   Minor   RaidDevice State
> > >        0     259        8        0      active sync   /dev/nvme1n1p1
> > >        1     259        6        1      active sync   /dev/nvme2n1p1
> > >        2     259        7        2      active sync   /dev/nvme3n1p1
> > >        3     259       12        3      active sync   /dev/nvme4n1p1
> > >        4     259       11        4      active sync   /dev/nvme5n1p1
> > >        5     259       14        5      active sync   /dev/nvme6n1p1
> > >        6     259       13        6      active sync   /dev/nvme7n1p1
> > >        7     259       21        7      active sync   /dev/nvme8n1p1
> > >        9     259       20        8      active sync   /dev/nvme9n1p1
> > >
> > > And some internal state of the raid5 by crash or sysfs:
> > >
> > > $ cat /sys/block/md10/md/stripe_cache_active
> > > 4430               # There are so many active stripe_head
> > >
> > > crash > foreach UN bt | grep md_bitmap_startwrite | wc -l
> > > 48                    # So there are only 48 stripe_head blocked by
> > > the bitmap counter.
> > > crash > list -o stripe_head.lru -s stripe_head.state -O
> > > r5conf.delayed_list -h 0xffff90c1951d5000
> > > .... # There are so many stripe_head, and the number is 4382.
> > >
> > > There are 4430 active stripe_head, and 4382 are in delayed_list, the
> > > last 48 blocked by the bitmap counter.
> > > So I guess this is the second deadlock.
> > >
> > > Then I reviewed the changelog after the commit 391b5d39faea "md/raid5:
> > > Fix Force reconstruct-write io stuck in degraded raid5" date
> > > 2020-07-31, and found no related fixup commit. And I'm not sure my
> > > understanding of raid5 is right. So I wondering if you can help
> > > confirm whether my thoughts are right or not.
> >
> > Have you tried to reproduce this with the latest kernel? There are a few fixes
> > after 2020, for example
> >
> >   commit 3312e6c887fe7539f0adb5756ab9020282aaa3d4
> >
> Thanks for your commit, I tried to understand this commit, and I think
> it is an optimization
> of the batch list but not a fixup?
>
> I have not tried to reproduce this with the latest kernel, because the
> max bitmap counter
> (1 << 14) is so large that I think it is difficult to reproduce. So I
> create a special
> environment by hacking kernel(base on 5.4.56) that changes the bitmap counter
> max number(COUNTER_MAX) to 4. Then I could get a deadlock by this command:
>
> fio -filename=testfile -ioengine=libaio -bs=16M -size=10G -numjobs=100
> -iodepth=1 -runtime=60
> -rw=write -group_reporting -name="test"
>
> Then I found the first deadlock state, but it is not the real reason.
>
> I will do a test with the latest kernel. I will report to you the result later.
>
I can reproduce the first deadlock in linux-6.1-rc4.
There are 26 stripe_head and 26 fio threads blocked with same backtrace:

 #0 [ffffc9000cd0f8b0] __schedule at ffffffff818b3c3c
 #1 [ffffc9000cd0f940] schedule at ffffffff818b4313
 #2 [ffffc9000cd0f950] md_bitmap_startwrite at ffffffffc063354a [md_mod]
 #3 [ffffc9000cd0f9c0] __add_stripe_bio at ffffffffc064fbd6 [raid456]
 #4 [ffffc9000cd0fa00] raid5_make_request at ffffffffc065a84c [raid456]
 #5 [ffffc9000cd0fb30] md_handle_request at ffffffffc0628496 [md_mod]
 #6 [ffffc9000cd0fb98] __submit_bio at ffffffff813f308f
 #7 [ffffc9000cd0fbb8] submit_bio_noacct_nocheck at ffffffff813f3501
 #8 [ffffc9000cd0fc00] __block_write_full_page at ffffffff8134ca64
 #9 [ffffc9000cd0fc60] __writepage at ffffffff8123f4a3
#10 [ffffc9000cd0fc78] write_cache_pages at ffffffff8123fb57
#11 [ffffc9000cd0fd70] generic_writepages at ffffffff8123feef
#12 [ffffc9000cd0fdc0] do_writepages at ffffffff81241f12
#13 [ffffc9000cd0fe28] filemap_fdatawrite_wbc at ffffffff8123306b
#14 [ffffc9000cd0fe48] __filemap_fdatawrite_range at ffffffff81239154
#15 [ffffc9000cd0fec0] file_write_and_wait_range at ffffffff812393e1
#16 [ffffc9000cd0fef0] blkdev_fsync at ffffffff813ec223
#17 [ffffc9000cd0ff08] do_fsync at ffffffff81342798
#18 [ffffc9000cd0ff30] __x64_sys_fsync at ffffffff813427e0
#19 [ffffc9000cd0ff38] do_syscall_64 at ffffffff818a6114
#20 [ffffc9000cd0ff50] entry_SYSCALL_64_after_hwframe at ffffffff81a0009b

> Thanks,
> Tianci

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [External] Re: raid5 deadlock issue
  2022-11-10 16:24     ` Zhang Tianci
@ 2022-11-11  4:14       ` Song Liu
  2022-11-11  6:26         ` Zhang Tianci
  0 siblings, 1 reply; 7+ messages in thread
From: Song Liu @ 2022-11-11  4:14 UTC (permalink / raw)
  To: Zhang Tianci; +Cc: linux-raid, Jiachen Zhang

On Thu, Nov 10, 2022 at 10:25 AM Zhang Tianci
<zhangtianci.1997@bytedance.com> wrote:
>

> > fio -filename=testfile -ioengine=libaio -bs=16M -size=10G -numjobs=100
> > -iodepth=1 -runtime=60
> > -rw=write -group_reporting -name="test"
> >
> > Then I found the first deadlock state, but it is not the real reason.
> >
> > I will do a test with the latest kernel. I will report to you the result later.
> >
> I can reproduce the first deadlock in linux-6.1-rc4.
> There are 26 stripe_head and 26 fio threads blocked with same backtrace:
>
>  #0 [ffffc9000cd0f8b0] __schedule at ffffffff818b3c3c
>  #1 [ffffc9000cd0f940] schedule at ffffffff818b4313
>  #2 [ffffc9000cd0f950] md_bitmap_startwrite at ffffffffc063354a [md_mod]
>  #3 [ffffc9000cd0f9c0] __add_stripe_bio at ffffffffc064fbd6 [raid456]
>  #4 [ffffc9000cd0fa00] raid5_make_request at ffffffffc065a84c [raid456]
>  #5 [ffffc9000cd0fb30] md_handle_request at ffffffffc0628496 [md_mod]
>  #6 [ffffc9000cd0fb98] __submit_bio at ffffffff813f308f
>  #7 [ffffc9000cd0fbb8] submit_bio_noacct_nocheck at ffffffff813f3501
>  #8 [ffffc9000cd0fc00] __block_write_full_page at ffffffff8134ca64
>  #9 [ffffc9000cd0fc60] __writepage at ffffffff8123f4a3
> #10 [ffffc9000cd0fc78] write_cache_pages at ffffffff8123fb57
> #11 [ffffc9000cd0fd70] generic_writepages at ffffffff8123feef
> #12 [ffffc9000cd0fdc0] do_writepages at ffffffff81241f12
> #13 [ffffc9000cd0fe28] filemap_fdatawrite_wbc at ffffffff8123306b
> #14 [ffffc9000cd0fe48] __filemap_fdatawrite_range at ffffffff81239154
> #15 [ffffc9000cd0fec0] file_write_and_wait_range at ffffffff812393e1
> #16 [ffffc9000cd0fef0] blkdev_fsync at ffffffff813ec223
> #17 [ffffc9000cd0ff08] do_fsync at ffffffff81342798
> #18 [ffffc9000cd0ff30] __x64_sys_fsync at ffffffff813427e0
> #19 [ffffc9000cd0ff38] do_syscall_64 at ffffffff818a6114
> #20 [ffffc9000cd0ff50] entry_SYSCALL_64_after_hwframe at ffffffff81a0009b

Thanks for this information.

I guess this is with COUNTER_MAX of 4? And it is slightly different to the
issue you found?

I will try to look into this next week (taking some time off this week).

Thanks,
Song

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [External] Re: raid5 deadlock issue
  2022-11-11  4:14       ` Song Liu
@ 2022-11-11  6:26         ` Zhang Tianci
  2022-11-30  6:21           ` Zhang Tianci
  0 siblings, 1 reply; 7+ messages in thread
From: Zhang Tianci @ 2022-11-11  6:26 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid, Jiachen Zhang

On Fri, Nov 11, 2022 at 12:14 PM Song Liu <song@kernel.org> wrote:
>
> On Thu, Nov 10, 2022 at 10:25 AM Zhang Tianci
> <zhangtianci.1997@bytedance.com> wrote:
> >
>
> > > fio -filename=testfile -ioengine=libaio -bs=16M -size=10G -numjobs=100
> > > -iodepth=1 -runtime=60
> > > -rw=write -group_reporting -name="test"
> > >
> > > Then I found the first deadlock state, but it is not the real reason.
> > >
> > > I will do a test with the latest kernel. I will report to you the result later.
> > >
> > I can reproduce the first deadlock in linux-6.1-rc4.
> > There are 26 stripe_head and 26 fio threads blocked with same backtrace:
> >
> >  #0 [ffffc9000cd0f8b0] __schedule at ffffffff818b3c3c
> >  #1 [ffffc9000cd0f940] schedule at ffffffff818b4313
> >  #2 [ffffc9000cd0f950] md_bitmap_startwrite at ffffffffc063354a [md_mod]
> >  #3 [ffffc9000cd0f9c0] __add_stripe_bio at ffffffffc064fbd6 [raid456]
> >  #4 [ffffc9000cd0fa00] raid5_make_request at ffffffffc065a84c [raid456]
> >  #5 [ffffc9000cd0fb30] md_handle_request at ffffffffc0628496 [md_mod]
> >  #6 [ffffc9000cd0fb98] __submit_bio at ffffffff813f308f
> >  #7 [ffffc9000cd0fbb8] submit_bio_noacct_nocheck at ffffffff813f3501
> >  #8 [ffffc9000cd0fc00] __block_write_full_page at ffffffff8134ca64
> >  #9 [ffffc9000cd0fc60] __writepage at ffffffff8123f4a3
> > #10 [ffffc9000cd0fc78] write_cache_pages at ffffffff8123fb57
> > #11 [ffffc9000cd0fd70] generic_writepages at ffffffff8123feef
> > #12 [ffffc9000cd0fdc0] do_writepages at ffffffff81241f12
> > #13 [ffffc9000cd0fe28] filemap_fdatawrite_wbc at ffffffff8123306b
> > #14 [ffffc9000cd0fe48] __filemap_fdatawrite_range at ffffffff81239154
> > #15 [ffffc9000cd0fec0] file_write_and_wait_range at ffffffff812393e1
> > #16 [ffffc9000cd0fef0] blkdev_fsync at ffffffff813ec223
> > #17 [ffffc9000cd0ff08] do_fsync at ffffffff81342798
> > #18 [ffffc9000cd0ff30] __x64_sys_fsync at ffffffff813427e0
> > #19 [ffffc9000cd0ff38] do_syscall_64 at ffffffff818a6114
> > #20 [ffffc9000cd0ff50] entry_SYSCALL_64_after_hwframe at ffffffff81a0009b
>
> Thanks for this information.
>
> I guess this is with COUNTER_MAX of 4? And it is slightly different to the
> issue you found?

Yes, I hack COUNTER_MAX to 4, I think this could increase the
probability of bitmap
counter racing.

And this kind of deadlock is very difficult to happen without hacking.

It just happened when I debugged, but it help me find a guess(the
second deadlock state
in the first email) about the real reason.

>
> I will try to look into this next week (taking some time off this week).

Thanks,
Tianci

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [External] Re: raid5 deadlock issue
  2022-11-11  6:26         ` Zhang Tianci
@ 2022-11-30  6:21           ` Zhang Tianci
  0 siblings, 0 replies; 7+ messages in thread
From: Zhang Tianci @ 2022-11-30  6:21 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid, Jiachen Zhang

On Fri, Nov 11, 2022 at 2:26 PM Zhang Tianci
<zhangtianci.1997@bytedance.com> wrote:
>
> On Fri, Nov 11, 2022 at 12:14 PM Song Liu <song@kernel.org> wrote:
> >
> > On Thu, Nov 10, 2022 at 10:25 AM Zhang Tianci
> > <zhangtianci.1997@bytedance.com> wrote:
> > >
> >
> > > > fio -filename=testfile -ioengine=libaio -bs=16M -size=10G -numjobs=100
> > > > -iodepth=1 -runtime=60
> > > > -rw=write -group_reporting -name="test"
> > > >
> > > > Then I found the first deadlock state, but it is not the real reason.
> > > >
> > > > I will do a test with the latest kernel. I will report to you the result later.
> > > >
> > > I can reproduce the first deadlock in linux-6.1-rc4.
> > > There are 26 stripe_head and 26 fio threads blocked with same backtrace:
> > >
> > >  #0 [ffffc9000cd0f8b0] __schedule at ffffffff818b3c3c
> > >  #1 [ffffc9000cd0f940] schedule at ffffffff818b4313
> > >  #2 [ffffc9000cd0f950] md_bitmap_startwrite at ffffffffc063354a [md_mod]
> > >  #3 [ffffc9000cd0f9c0] __add_stripe_bio at ffffffffc064fbd6 [raid456]
> > >  #4 [ffffc9000cd0fa00] raid5_make_request at ffffffffc065a84c [raid456]
> > >  #5 [ffffc9000cd0fb30] md_handle_request at ffffffffc0628496 [md_mod]
> > >  #6 [ffffc9000cd0fb98] __submit_bio at ffffffff813f308f
> > >  #7 [ffffc9000cd0fbb8] submit_bio_noacct_nocheck at ffffffff813f3501
> > >  #8 [ffffc9000cd0fc00] __block_write_full_page at ffffffff8134ca64
> > >  #9 [ffffc9000cd0fc60] __writepage at ffffffff8123f4a3
> > > #10 [ffffc9000cd0fc78] write_cache_pages at ffffffff8123fb57
> > > #11 [ffffc9000cd0fd70] generic_writepages at ffffffff8123feef
> > > #12 [ffffc9000cd0fdc0] do_writepages at ffffffff81241f12
> > > #13 [ffffc9000cd0fe28] filemap_fdatawrite_wbc at ffffffff8123306b
> > > #14 [ffffc9000cd0fe48] __filemap_fdatawrite_range at ffffffff81239154
> > > #15 [ffffc9000cd0fec0] file_write_and_wait_range at ffffffff812393e1
> > > #16 [ffffc9000cd0fef0] blkdev_fsync at ffffffff813ec223
> > > #17 [ffffc9000cd0ff08] do_fsync at ffffffff81342798
> > > #18 [ffffc9000cd0ff30] __x64_sys_fsync at ffffffff813427e0
> > > #19 [ffffc9000cd0ff38] do_syscall_64 at ffffffff818a6114
> > > #20 [ffffc9000cd0ff50] entry_SYSCALL_64_after_hwframe at ffffffff81a0009b
> >
> > Thanks for this information.
> >
> > I guess this is with COUNTER_MAX of 4? And it is slightly different to the
> > issue you found?
>
> Yes, I hack COUNTER_MAX to 4, I think this could increase the
> probability of bitmap
> counter racing.
>
> And this kind of deadlock is very difficult to happen without hacking.
>
> It just happened when I debugged, but it help me find a guess(the
> second deadlock state
> in the first email) about the real reason.
>
> >
> > I will try to look into this next week (taking some time off this week).
>
> Thanks,
> Tianci

Hi Song,

We have met this problem in a new machine, and the /proc/mdstat is not
doing sync:

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md10 : active raid5 nvme9n1p1[9] nvme8n1p1[7] nvme7n1p1[6]
nvme6n1p1[5] nvme5n1p1[4] nvme4n1p1[3] nvme3n1p1[2] nvme2n1p1[1]
nvme1n1p1[0]
      15001927680 blocks super 1.2 level 5, 512k chunk, algorithm 2
[9/9] [UUUUUUUUU]
      bitmap: 6/14 pages [24KB], 65536KB chunk

unused devices: <none>

And there are 1w5+ stripe_head are blocked.

# cat /sys/block/md10/md/stripe_cache_active
15456

I guess this is the same problem as before.

What do you think about my deadlock guess in the first email?

Thanks,
Tianci

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-11-30  6:22 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-09  6:50 raid5 deadlock issue Zhang Tianci
2022-11-09 22:37 ` Song Liu
2022-11-10  3:24   ` [External] " Zhang Tianci
2022-11-10 16:24     ` Zhang Tianci
2022-11-11  4:14       ` Song Liu
2022-11-11  6:26         ` Zhang Tianci
2022-11-30  6:21           ` Zhang Tianci

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.