All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] md/bitmap: wait for bitmap writes to complete during the tear down sequence
@ 2021-04-08 21:39 Sudhakar Panneerselvam
  2021-04-10 15:27 ` heming.zhao
  0 siblings, 1 reply; 8+ messages in thread
From: Sudhakar Panneerselvam @ 2021-04-08 21:39 UTC (permalink / raw)
  To: linux-raid, song
  Cc: heming.zhao, ghe, lidong.zhong, xni, colyli, martin.petersen

NULL pointer dereference was observed in super_written() when it tries
to access the mddev structure.

[The below stack trace is from an older kernel, but the problem described in
this patch applies to the mainline kernel.]

[ 1194.474861] task: ffff8fdd20858000 task.stack: ffffb99d40790000
[ 1194.488000] RIP: 0010:super_written+0x29/0xe1
[ 1194.499688] RSP: 0018:ffff8ffb7fcc3c78 EFLAGS: 00010046
[ 1194.512477] RAX: 0000000000000000 RBX: ffff8ffb7bf4a000 RCX: ffff8ffb78991048
[ 1194.527325] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ffb56b8a200
[ 1194.542576] RBP: ffff8ffb7fcc3c90 R08: 000000000000000b R09: 0000000000000000
[ 1194.558001] R10: ffff8ffb56b8a298 R11: 0000000000000000 R12: ffff8ffb56b8a200
[ 1194.573070] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1194.588117] FS:  0000000000000000(0000) GS:ffff8ffb7fcc0000(0000) knlGS:0000000000000000
[ 1194.604264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1194.617375] CR2: 00000000000002b8 CR3: 00000021e040a002 CR4: 00000000007606e0
[ 1194.632327] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1194.647865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1194.663316] PKRU: 55555554
[ 1194.674090] Call Trace:
[ 1194.683735]  <IRQ>
[ 1194.692948]  bio_endio+0xae/0x135
[ 1194.703580]  blk_update_request+0xad/0x2fa
[ 1194.714990]  blk_update_bidi_request+0x20/0x72
[ 1194.726578]  __blk_end_bidi_request+0x2c/0x4d
[ 1194.738373]  __blk_end_request_all+0x31/0x49
[ 1194.749344]  blk_flush_complete_seq+0x377/0x383
[ 1194.761550]  flush_end_io+0x1dd/0x2a7
[ 1194.772910]  blk_finish_request+0x9f/0x13c
[ 1194.784544]  scsi_end_request+0x180/0x25c
[ 1194.796149]  scsi_io_completion+0xc8/0x610
[ 1194.807503]  scsi_finish_command+0xdc/0x125
[ 1194.818897]  scsi_softirq_done+0x81/0xde
[ 1194.830062]  blk_done_softirq+0xa4/0xcc
[ 1194.841008]  __do_softirq+0xd9/0x29f
[ 1194.851257]  irq_exit+0xe6/0xeb
[ 1194.861290]  do_IRQ+0x59/0xe3
[ 1194.871060]  common_interrupt+0x1c6/0x382
[ 1194.881988]  </IRQ>
[ 1194.890646] RIP: 0010:cpuidle_enter_state+0xdd/0x2a5
[ 1194.902532] RSP: 0018:ffffb99d40793e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff43
[ 1194.917317] RAX: ffff8ffb7fce27c0 RBX: ffff8ffb7fced800 RCX: 000000000000001f
[ 1194.932056] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000000
[ 1194.946428] RBP: ffffb99d40793ea0 R08: 0000000000000004 R09: 0000000000002ed2
[ 1194.960508] R10: 0000000000002664 R11: 0000000000000018 R12: 0000000000000003
[ 1194.974454] R13: 000000000000000b R14: ffffffff925715a0 R15: 0000011610120d5a
[ 1194.988607]  ? cpuidle_enter_state+0xcc/0x2a5
[ 1194.999077]  cpuidle_enter+0x17/0x19
[ 1195.008395]  call_cpuidle+0x23/0x3a
[ 1195.017718]  do_idle+0x172/0x1d5
[ 1195.026358]  cpu_startup_entry+0x73/0x75
[ 1195.035769]  start_secondary+0x1b9/0x20b
[ 1195.044894]  secondary_startup_64+0xa5/0xa5
[ 1195.084921] RIP: super_written+0x29/0xe1 RSP: ffff8ffb7fcc3c78
[ 1195.096354] CR2: 00000000000002b8

bio in the above stack is a bitmap write whose completion is invoked after the
tear down sequence sets the mddev structure to NULL in rdev.

During tear down, there is an attempt to flush the bitmap writes, but it
doesn't fully wait for all the bitmap writes to complete. For instance,
md_bitmap_flush() is called to flush the bitmap writes, but the last call to
md_bitmap_daemon_work() in md_bitmap_flush() could generate new bitmap writes
for which there is no explicit wait to complete those writes. This results in a kernel
panic when the completion routine, super_written() is called which tries to
reference mddev in the rdev that has been set to
NULL(in unbind_rdev_from_array() by tear down sequence).

The solution is to call md_bitmap_wait_writes() after the last call to
md_bitmap_daemon_work() in md_bitmap_flush() to ensure there are no pending
bitmap writes before proceeding with the tear down.

Signed-off-by: Sudhakar Panneerselvam <sudhakar.panneerselvam@oracle.com>
---
 drivers/md/md-bitmap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 200c5d0f08bf..e0fdc3a090c5 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -1722,6 +1722,7 @@ void md_bitmap_flush(struct mddev *mddev)
 	md_bitmap_daemon_work(mddev);
 	bitmap->daemon_lastrun -= sleep;
 	md_bitmap_daemon_work(mddev);
+	md_bitmap_wait_writes(mddev->bitmap);
 	md_bitmap_update_sb(bitmap);
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] md/bitmap: wait for bitmap writes to complete during the tear down sequence
  2021-04-08 21:39 [PATCH] md/bitmap: wait for bitmap writes to complete during the tear down sequence Sudhakar Panneerselvam
@ 2021-04-10 15:27 ` heming.zhao
  2021-04-10 21:35   ` Sudhakar Panneerselvam
  0 siblings, 1 reply; 8+ messages in thread
From: heming.zhao @ 2021-04-10 15:27 UTC (permalink / raw)
  To: Sudhakar Panneerselvam, linux-raid, song
  Cc: lidong.zhong, xni, colyli, martin.petersen

On 4/9/21 5:39 AM, Sudhakar Panneerselvam wrote:
> NULL pointer dereference was observed in super_written() when it tries
> to access the mddev structure.
> 
> [The below stack trace is from an older kernel, but the problem described in
> this patch applies to the mainline kernel.]
> 
> ... ...
> 
> bio in the above stack is a bitmap write whose completion is invoked after the
> tear down sequence sets the mddev structure to NULL in rdev.
> 
> During tear down, there is an attempt to flush the bitmap writes, but it
> doesn't fully wait for all the bitmap writes to complete. For instance,
> md_bitmap_flush() is called to flush the bitmap writes, but the last call to
> md_bitmap_daemon_work() in md_bitmap_flush() could generate new bitmap writes
> for which there is no explicit wait to complete those writes. This results in a kernel
> panic when the completion routine, super_written() is called which tries to
> reference mddev in the rdev that has been set to
> NULL(in unbind_rdev_from_array() by tear down sequence).
> 
> The solution is to call md_bitmap_wait_writes() after the last call to
> md_bitmap_daemon_work() in md_bitmap_flush() to ensure there are no pending
> bitmap writes before proceeding with the tear down.
> 
> Signed-off-by: Sudhakar Panneerselvam <sudhakar.panneerselvam@oracle.com>
> ---
>   drivers/md/md-bitmap.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index 200c5d0f08bf..e0fdc3a090c5 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -1722,6 +1722,7 @@ void md_bitmap_flush(struct mddev *mddev)
>   	md_bitmap_daemon_work(mddev);
>   	bitmap->daemon_lastrun -= sleep;
>   	md_bitmap_daemon_work(mddev);
> +	md_bitmap_wait_writes(mddev->bitmap);
>   	md_bitmap_update_sb(bitmap);
>   }
>   
> 

Hello Sudhakar,

First, let's discuss with master branch kernel.

What command or action stands for "tear down" ?
 From your description, it very like ioctl STOP_ARRAY.
Your crash was related with super_written, which is the callback for
updating array sb, not bitmap sb. in md_update_sb() there is a sync
point md_super_wait(), which will guarantee all sb bios finished successfully.

for your patch, do you check md_bitmap_free, which already done the your patch's job.

the call flow:
```
do_md_stop //by STOP_ARRAY
  + __md_stop_writes()
  |  md_bitmap_flush
  |  md_update_sb
  |   + md_super_write
  |   |  bio->bi_end_io = super_written
  |   + md_super_wait(mddev) //wait for all bios done
  + __md_stop(mddev)
  |  md_bitmap_destroy(mddev);
  |   + md_bitmap_free //see below
  + ...

md_bitmap_free
{
    ...
     //do your patch job.
     /* Shouldn't be needed - but just in case.... */
     wait_event(bitmap->write_wait,
            atomic_read(&bitmap->pending_writes) == 0);
    ...
}
```

Would you share more analysis or test results for your patch?

Thanks,
Heming


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH] md/bitmap: wait for bitmap writes to complete during the tear down sequence
  2021-04-10 15:27 ` heming.zhao
@ 2021-04-10 21:35   ` Sudhakar Panneerselvam
  2021-04-11  9:04     ` heming.zhao
  0 siblings, 1 reply; 8+ messages in thread
From: Sudhakar Panneerselvam @ 2021-04-10 21:35 UTC (permalink / raw)
  To: heming.zhao, linux-raid, song; +Cc: lidong.zhong, xni, colyli, Martin Petersen

> Hello Sudhakar,
> 
> First, let's discuss with master branch kernel.
> 
> What command or action stands for "tear down" ?
>  From your description, it very like ioctl STOP_ARRAY.
> Your crash was related with super_written, which is the callback for
> updating array sb, not bitmap sb. in md_update_sb() there is a sync
> point md_super_wait(), which will guarantee all sb bios finished successfully.
> 
> for your patch, do you check md_bitmap_free, which already done the your patch's job.
> 
> the call flow:
> ```
> do_md_stop //by STOP_ARRAY
>   + __md_stop_writes()
>   |  md_bitmap_flush
>   |  md_update_sb
>   |   + md_super_write
>   |   |  bio->bi_end_io = super_written
>   |   + md_super_wait(mddev) //wait for all bios done
>   + __md_stop(mddev)
>   |  md_bitmap_destroy(mddev);
>   |   + md_bitmap_free //see below
>   + ...
> 
> md_bitmap_free
> {
>     ...
>      //do your patch job.
>      /* Shouldn't be needed - but just in case.... */
>      wait_event(bitmap->write_wait,
>             atomic_read(&bitmap->pending_writes) == 0);
>     ...
> }
> ```
> 
> Would you share more analysis or test results for your patch?

Hello Heming,

Please find more details about our system configuration and the detailed
sequence that led to the crash.

We have a RAID1 configured on an external imsm container. /proc/mdstat output
looks like this:

***
md24 : active raid1 sdb[1] sda[0]
      141557760 blocks super external:/md127/0 [2/2] [UU]
      bitmap: 1/2 pages [4KB], 65536KB chunk

md127 : inactive sdb[1](S) sda[0](S)
      10402 blocks super external:imsm
***

All our system partition is laid out on top of the above RAID1 device as
shown below.

***
/dev/md24p5 on / type xfs (rw,relatime,attr2,inode64,noquota)
/dev/md24p1 on /boot type xfs (rw,nodev,relatime,attr2,inode64,noquota)
/dev/md24p15 on /home type xfs (rw,nodev,relatime,attr2,inode64,noquota)
/dev/md24p12 on /var type xfs (rw,nodev,relatime,attr2,inode64,noquota)
/dev/md24p16 on /tmp type xfs (rw,nodev,relatime,attr2,inode64,noquota)
/dev/md24p11 on /var/log type xfs (rw,nodev,relatime,attr2,inode64,noquota)
/dev/md24p14 on /var/log/audit type xfs (rw,nodev,relatime,attr2,inode64,noquota)
***

In such a configuration, the kernel panic described in this patch occurs
during the "shutdown" of the server. Since the system partition is laid out on
the RAID1, there could be write I/Os going to the RAID device as part of system
log and filesystem activity that typically occur during the shutdown.

From the core dump, I see that, after filesystems are unmounted and killing all
other userspace processes, "mdmon" thread initiates the "stop" on the md device.

***
COMMAND: "mdmon"
   TASK: ffff8ff2b1663c80  [THREAD_INFO: ffff8ff2b1663c80]
    CPU: 30
  STATE: TASK_UNINTERRUPTIBLE
crash> bt
PID: 7390   TASK: ffff8ff2b1663c80  CPU: 30  COMMAND: "mdmon"
 #0 [ffffb99d4eacba30] __schedule at ffffffff91884acc
 #1 [ffffb99d4eacbad0] schedule at ffffffff918850e6
 #2 [ffffb99d4eacbae8] schedule_timeout at ffffffff91889616
 #3 [ffffb99d4eacbb78] wait_for_completion at ffffffff91885d1b
 #4 [ffffb99d4eacbbe0] __wait_rcu_gp at ffffffff9110c123
 #5 [ffffb99d4eacbc28] synchronize_sched at ffffffff9111008e
 #6 [ffffb99d4eacbc78] unbind_rdev_from_array at ffffffff9169640f
 #7 [ffffb99d4eacbcc0] do_md_stop at ffffffff9169f163
 #8 [ffffb99d4eacbd58] array_state_store at ffffffff9169f644
 #9 [ffffb99d4eacbd90] md_attr_store at ffffffff9169b71e
#10 [ffffb99d4eacbdc8] sysfs_kf_write at ffffffff91320edf
#11 [ffffb99d4eacbdd8] kernfs_fop_write at ffffffff913203c4
#12 [ffffb99d4eacbe18] __vfs_write at ffffffff9128fcfa
#13 [ffffb99d4eacbea0] vfs_write at ffffffff9128ffd2
#14 [ffffb99d4eacbee0] sys_write at ffffffff9129025c
#15 [ffffb99d4eacbf28] do_syscall_64 at ffffffff91003a39
#16 [ffffb99d4eacbf50] entry_SYSCALL_64_after_hwframe at ffffffff91a001b1
    RIP: 00007ff3bfc846fd  RSP: 00007ff3bf8a6cf0  RFLAGS: 00000293
    RAX: ffffffffffffffda  RBX: 000055e257ef0941  RCX: 00007ff3bfc846fd
    RDX: 0000000000000005  RSI: 000055e257ef0941  RDI: 0000000000000010
    RBP: 0000000000000010   R8: 0000000000000001   R9: 0000000000000000
    R10: 00000000ffffff01  R11: 0000000000000293  R12: 0000000000000001
    R13: 0000000000000000  R14: 0000000000000001  R15: 0000000000000000
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
crash>
***

While handling the "md" stop in the kernel, the code sequence based on the
above md configuration is

do_md_stop
 + __md_stop_writes
 | + md_bitmap_flush
 | | + md_bitmap_daemon_work
 | | | + md_bitmap_wait_writes()
 | | | | (This wait is for bitmap writes that happened up until now and to avoid
 | | | |  overlapping with the new bitmap writes.)
 | | | | (wait flag is set to zero for async writes to bitmap.)
 | | | + write_page()
 | | | | | (For external bitmap, bitmap->storage.file is NULL, hence ends up
 | | | | |  calling write_sb_page())
 | | | | + write_sb_page()
 | | | | | + md_super_write()
 | | | | | | (Since the wait flag is false, the bitmap write is submitted
 | | | | | |  without waiting for it to complete.)
 | | | | | | + mddev->pending_writes is incremented, then the IO is submitted
 | | | | | |   and the call returns without waiting
 | | + md_bitmap_update_sb() - (This call simply returns because
 | | | | "bitmap->mddev->bitmap_info.external" is true for external bitmaps)
 | + md_update_sb() - (This won't be called as the call is conditional and the
 | | | | | | condition evaluates to false in my setup(see below for crash info)
 + __md_stop
 | + md_bitmap_destroy
 | | + md_bitmap_free
 | | | + wait_event(bitmap->write_wait,
 | | | |            atomic_read(&bitmap->pending_writes) == 0);
 | | | | bitmap->pending_writes is zero, but the mddev->pending_writes is 1,
 | | | | so this call returns immediately.
 | md detachment continues while there is pending mddev I/O cauing NULL pointer
 | derefence when the I/O callback, super_written is called.

***
        crash> struct mddev.bitmap_info.external 0xffff8ffb62f13800
                bitmap_info.external = 1,
        crash> struct mddev.in_sync 0xffff8ffb62f13800
                in_sync = 1
        crash> struct mddev.sb_flags 0xffff8ffb62f13800
                sb_flags = 0
        crash> struct mddev.ro 0xffff8ffb62f13800
                ro = 0
        crash> struct mddev.cluster_info 0xffff8ffb62f13800
                cluster_info = 0x0
        crash>
***

Please review and let me know your thoughts.

Thanks
Sudhakar







> 
> Thanks,
> Heming


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] md/bitmap: wait for bitmap writes to complete during the tear down sequence
  2021-04-10 21:35   ` Sudhakar Panneerselvam
@ 2021-04-11  9:04     ` heming.zhao
  2021-04-12  6:06       ` Song Liu
  0 siblings, 1 reply; 8+ messages in thread
From: heming.zhao @ 2021-04-11  9:04 UTC (permalink / raw)
  To: Sudhakar Panneerselvam, linux-raid, song
  Cc: lidong.zhong, xni, colyli, Martin Petersen

On 4/11/21 5:35 AM, Sudhakar Panneerselvam wrote:
>> Hello Sudhakar,
>>
>> First, let's discuss with master branch kernel.
>>
>> What command or action stands for "tear down" ?
>>   From your description, it very like ioctl STOP_ARRAY.
>> Your crash was related with super_written, which is the callback for
>> updating array sb, not bitmap sb. in md_update_sb() there is a sync
>> point md_super_wait(), which will guarantee all sb bios finished successfully.
>>
>> for your patch, do you check md_bitmap_free, which already done the your patch's job.
>>
>> the call flow:
>> ```
>> do_md_stop //by STOP_ARRAY
>>    + __md_stop_writes()
>>    |  md_bitmap_flush
>>    |  md_update_sb
>>    |   + md_super_write
>>    |   |  bio->bi_end_io = super_written
>>    |   + md_super_wait(mddev) //wait for all bios done
>>    + __md_stop(mddev)
>>    |  md_bitmap_destroy(mddev);
>>    |   + md_bitmap_free //see below
>>    + ...
>>
>> md_bitmap_free
>> {
>>      ...
>>       //do your patch job.
>>       /* Shouldn't be needed - but just in case.... */
>>       wait_event(bitmap->write_wait,
>>              atomic_read(&bitmap->pending_writes) == 0);
>>      ...
>> }
>> ```
>>
>> Would you share more analysis or test results for your patch?
> 
> Hello Heming,
> 
> Please find more details about our system configuration and the detailed
> sequence that led to the crash.
> 
> We have a RAID1 configured on an external imsm container. /proc/mdstat output
> looks like this:
> 
> ***
> md24 : active raid1 sdb[1] sda[0]
>        141557760 blocks super external:/md127/0 [2/2] [UU]
>        bitmap: 1/2 pages [4KB], 65536KB chunk
> 
> md127 : inactive sdb[1](S) sda[0](S)
>        10402 blocks super external:imsm
> ***
> 
> All our system partition is laid out on top of the above RAID1 device as
> shown below.
> 
> ***
> /dev/md24p5 on / type xfs (rw,relatime,attr2,inode64,noquota)
> /dev/md24p1 on /boot type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> /dev/md24p15 on /home type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> /dev/md24p12 on /var type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> /dev/md24p16 on /tmp type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> /dev/md24p11 on /var/log type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> /dev/md24p14 on /var/log/audit type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> ***
> 
> In such a configuration, the kernel panic described in this patch occurs
> during the "shutdown" of the server. Since the system partition is laid out on
> the RAID1, there could be write I/Os going to the RAID device as part of system
> log and filesystem activity that typically occur during the shutdown.
> 
>  From the core dump, I see that, after filesystems are unmounted and killing all
> other userspace processes, "mdmon" thread initiates the "stop" on the md device.
> 
> ***
> COMMAND: "mdmon"
>     TASK: ffff8ff2b1663c80  [THREAD_INFO: ffff8ff2b1663c80]
>      CPU: 30
>    STATE: TASK_UNINTERRUPTIBLE
> crash> bt
> PID: 7390   TASK: ffff8ff2b1663c80  CPU: 30  COMMAND: "mdmon"
>   #0 [ffffb99d4eacba30] __schedule at ffffffff91884acc
>   #1 [ffffb99d4eacbad0] schedule at ffffffff918850e6
>   #2 [ffffb99d4eacbae8] schedule_timeout at ffffffff91889616
>   #3 [ffffb99d4eacbb78] wait_for_completion at ffffffff91885d1b
>   #4 [ffffb99d4eacbbe0] __wait_rcu_gp at ffffffff9110c123
>   #5 [ffffb99d4eacbc28] synchronize_sched at ffffffff9111008e
>   #6 [ffffb99d4eacbc78] unbind_rdev_from_array at ffffffff9169640f
>   #7 [ffffb99d4eacbcc0] do_md_stop at ffffffff9169f163
>   #8 [ffffb99d4eacbd58] array_state_store at ffffffff9169f644
>   #9 [ffffb99d4eacbd90] md_attr_store at ffffffff9169b71e
> #10 [ffffb99d4eacbdc8] sysfs_kf_write at ffffffff91320edf
> #11 [ffffb99d4eacbdd8] kernfs_fop_write at ffffffff913203c4
> #12 [ffffb99d4eacbe18] __vfs_write at ffffffff9128fcfa
> #13 [ffffb99d4eacbea0] vfs_write at ffffffff9128ffd2
> #14 [ffffb99d4eacbee0] sys_write at ffffffff9129025c
> #15 [ffffb99d4eacbf28] do_syscall_64 at ffffffff91003a39
> #16 [ffffb99d4eacbf50] entry_SYSCALL_64_after_hwframe at ffffffff91a001b1
>      RIP: 00007ff3bfc846fd  RSP: 00007ff3bf8a6cf0  RFLAGS: 00000293
>      RAX: ffffffffffffffda  RBX: 000055e257ef0941  RCX: 00007ff3bfc846fd
>      RDX: 0000000000000005  RSI: 000055e257ef0941  RDI: 0000000000000010
>      RBP: 0000000000000010   R8: 0000000000000001   R9: 0000000000000000
>      R10: 00000000ffffff01  R11: 0000000000000293  R12: 0000000000000001
>      R13: 0000000000000000  R14: 0000000000000001  R15: 0000000000000000
>      ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
> crash>
> ***
> 
> While handling the "md" stop in the kernel, the code sequence based on the
> above md configuration is
> 
> do_md_stop
>   + __md_stop_writes
>   | + md_bitmap_flush
>   | | + md_bitmap_daemon_work
>   | | | + md_bitmap_wait_writes()
>   | | | | (This wait is for bitmap writes that happened up until now and to avoid
>   | | | |  overlapping with the new bitmap writes.)
>   | | | | (wait flag is set to zero for async writes to bitmap.)
>   | | | + write_page()
>   | | | | | (For external bitmap, bitmap->storage.file is NULL, hence ends up
>   | | | | |  calling write_sb_page())
>   | | | | + write_sb_page()
>   | | | | | + md_super_write()
>   | | | | | | (Since the wait flag is false, the bitmap write is submitted
>   | | | | | |  without waiting for it to complete.)
>   | | | | | | + mddev->pending_writes is incremented, then the IO is submitted
>   | | | | | |   and the call returns without waiting
>   | | + md_bitmap_update_sb() - (This call simply returns because
>   | | | | "bitmap->mddev->bitmap_info.external" is true for external bitmaps)
>   | + md_update_sb() - (This won't be called as the call is conditional and the
>   | | | | | | condition evaluates to false in my setup(see below for crash info)
>   + __md_stop
>   | + md_bitmap_destroy
>   | | + md_bitmap_free
>   | | | + wait_event(bitmap->write_wait,
>   | | | |            atomic_read(&bitmap->pending_writes) == 0);
>   | | | | bitmap->pending_writes is zero, but the mddev->pending_writes is 1,
>   | | | | so this call returns immediately.
>   | md detachment continues while there is pending mddev I/O cauing NULL pointer
>   | derefence when the I/O callback, super_written is called.
> 
> ***
>          crash> struct mddev.bitmap_info.external 0xffff8ffb62f13800
>                  bitmap_info.external = 1,
>          crash> struct mddev.in_sync 0xffff8ffb62f13800
>                  in_sync = 1
>          crash> struct mddev.sb_flags 0xffff8ffb62f13800
>                  sb_flags = 0
>          crash> struct mddev.ro 0xffff8ffb62f13800
>                  ro = 0
>          crash> struct mddev.cluster_info 0xffff8ffb62f13800
>                  cluster_info = 0x0
>          crash>
> ***
> 
> Please review and let me know your thoughts.
> 

Hello

On previous mail, I was wrong to assume the array mddev->bitmap_info.external is 0. So my analysis was not suite for your case.

Now we understand your env, which is related with IMSM & external bitmap, and I have a little knowledge about this field. May need to add more people in the cc list.

Just from code logic, your analysis is reasonable.
The key of the crash is why md layer didn't do the waiting job for bitmap bios. There have wait points in md_bitmap_daemon_work, write_sb_page, md_bitmap_update_sb & md_update_sb. But in your scenario there all didn't take effect. At last md_bitmap_free do the additional wait bitmap->pending_writes but not mddev->pending_writes. the crash is triggered.

In my opinion, using a special wait is more clear than calling general md_bitmap_wait_writes(). the md_bitmap_wait_writes makes people feel bitmap module does repetitive clean job.

My idea like:
```
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 200c5d0f08bf..ea6fa5a2cb6b 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -1723,6 +1723,8 @@ void md_bitmap_flush(struct mddev *mddev)
         bitmap->daemon_lastrun -= sleep;
         md_bitmap_daemon_work(mddev);
         md_bitmap_update_sb(bitmap);
+       if (mddev->bitmap_info.external)
+               md_super_wait(mddev);
  }
  
  /*
@@ -1746,6 +1748,7 @@ void md_bitmap_free(struct bitmap *bitmap)
         /* Shouldn't be needed - but just in case.... */
         wait_event(bitmap->write_wait,
                    atomic_read(&bitmap->pending_writes) == 0);
+       wait_event(mddev->sb_wait, atomic_read(&mddev->pending_writes)==0);
  
         /* release the bitmap file  */
         md_bitmap_file_unmap(&bitmap->storage);
```

All in all, your fix make sense to me. Let's wait for other people's feedback.

Thanks,
Heming


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] md/bitmap: wait for bitmap writes to complete during the tear down sequence
  2021-04-11  9:04     ` heming.zhao
@ 2021-04-12  6:06       ` Song Liu
  2021-04-12  9:38         ` Sudhakar Panneerselvam
  0 siblings, 1 reply; 8+ messages in thread
From: Song Liu @ 2021-04-12  6:06 UTC (permalink / raw)
  To: heming.zhao
  Cc: Sudhakar Panneerselvam, linux-raid, lidong.zhong, xni, colyli,
	Martin Petersen

On Sun, Apr 11, 2021 at 2:04 AM heming.zhao@suse.com
<heming.zhao@suse.com> wrote:
>
> On 4/11/21 5:35 AM, Sudhakar Panneerselvam wrote:
> >> Hello Sudhakar,
> >>
> >> First, let's discuss with master branch kernel.
> >>
> >> What command or action stands for "tear down" ?
> >>   From your description, it very like ioctl STOP_ARRAY.
> >> Your crash was related with super_written, which is the callback for
> >> updating array sb, not bitmap sb. in md_update_sb() there is a sync
> >> point md_super_wait(), which will guarantee all sb bios finished successfully.
> >>
> >> for your patch, do you check md_bitmap_free, which already done the your patch's job.
> >>
> >> the call flow:
> >> ```
> >> do_md_stop //by STOP_ARRAY
> >>    + __md_stop_writes()
> >>    |  md_bitmap_flush
> >>    |  md_update_sb
> >>    |   + md_super_write
> >>    |   |  bio->bi_end_io = super_written
> >>    |   + md_super_wait(mddev) //wait for all bios done
> >>    + __md_stop(mddev)
> >>    |  md_bitmap_destroy(mddev);
> >>    |   + md_bitmap_free //see below
> >>    + ...
> >>
> >> md_bitmap_free
> >> {
> >>      ...
> >>       //do your patch job.
> >>       /* Shouldn't be needed - but just in case.... */
> >>       wait_event(bitmap->write_wait,
> >>              atomic_read(&bitmap->pending_writes) == 0);
> >>      ...
> >> }
> >> ```
> >>
> >> Would you share more analysis or test results for your patch?
> >
> > Hello Heming,
> >
> > Please find more details about our system configuration and the detailed
> > sequence that led to the crash.
> >
> > We have a RAID1 configured on an external imsm container. /proc/mdstat output
> > looks like this:
> >
> > ***
> > md24 : active raid1 sdb[1] sda[0]
> >        141557760 blocks super external:/md127/0 [2/2] [UU]
> >        bitmap: 1/2 pages [4KB], 65536KB chunk
> >
> > md127 : inactive sdb[1](S) sda[0](S)
> >        10402 blocks super external:imsm
> > ***
> >
> > All our system partition is laid out on top of the above RAID1 device as
> > shown below.
> >
> > ***
> > /dev/md24p5 on / type xfs (rw,relatime,attr2,inode64,noquota)
> > /dev/md24p1 on /boot type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> > /dev/md24p15 on /home type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> > /dev/md24p12 on /var type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> > /dev/md24p16 on /tmp type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> > /dev/md24p11 on /var/log type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> > /dev/md24p14 on /var/log/audit type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> > ***
> >
> > In such a configuration, the kernel panic described in this patch occurs
> > during the "shutdown" of the server. Since the system partition is laid out on
> > the RAID1, there could be write I/Os going to the RAID device as part of system
> > log and filesystem activity that typically occur during the shutdown.
> >
> >  From the core dump, I see that, after filesystems are unmounted and killing all
> > other userspace processes, "mdmon" thread initiates the "stop" on the md device.
> >
> > ***
> > COMMAND: "mdmon"
> >     TASK: ffff8ff2b1663c80  [THREAD_INFO: ffff8ff2b1663c80]
> >      CPU: 30
> >    STATE: TASK_UNINTERRUPTIBLE
> > crash> bt
> > PID: 7390   TASK: ffff8ff2b1663c80  CPU: 30  COMMAND: "mdmon"
> >   #0 [ffffb99d4eacba30] __schedule at ffffffff91884acc
> >   #1 [ffffb99d4eacbad0] schedule at ffffffff918850e6
> >   #2 [ffffb99d4eacbae8] schedule_timeout at ffffffff91889616
> >   #3 [ffffb99d4eacbb78] wait_for_completion at ffffffff91885d1b
> >   #4 [ffffb99d4eacbbe0] __wait_rcu_gp at ffffffff9110c123
> >   #5 [ffffb99d4eacbc28] synchronize_sched at ffffffff9111008e
> >   #6 [ffffb99d4eacbc78] unbind_rdev_from_array at ffffffff9169640f
> >   #7 [ffffb99d4eacbcc0] do_md_stop at ffffffff9169f163
> >   #8 [ffffb99d4eacbd58] array_state_store at ffffffff9169f644
> >   #9 [ffffb99d4eacbd90] md_attr_store at ffffffff9169b71e
> > #10 [ffffb99d4eacbdc8] sysfs_kf_write at ffffffff91320edf
> > #11 [ffffb99d4eacbdd8] kernfs_fop_write at ffffffff913203c4
> > #12 [ffffb99d4eacbe18] __vfs_write at ffffffff9128fcfa
> > #13 [ffffb99d4eacbea0] vfs_write at ffffffff9128ffd2
> > #14 [ffffb99d4eacbee0] sys_write at ffffffff9129025c
> > #15 [ffffb99d4eacbf28] do_syscall_64 at ffffffff91003a39
> > #16 [ffffb99d4eacbf50] entry_SYSCALL_64_after_hwframe at ffffffff91a001b1
> >      RIP: 00007ff3bfc846fd  RSP: 00007ff3bf8a6cf0  RFLAGS: 00000293
> >      RAX: ffffffffffffffda  RBX: 000055e257ef0941  RCX: 00007ff3bfc846fd
> >      RDX: 0000000000000005  RSI: 000055e257ef0941  RDI: 0000000000000010
> >      RBP: 0000000000000010   R8: 0000000000000001   R9: 0000000000000000
> >      R10: 00000000ffffff01  R11: 0000000000000293  R12: 0000000000000001
> >      R13: 0000000000000000  R14: 0000000000000001  R15: 0000000000000000
> >      ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
> > crash>
> > ***
> >
> > While handling the "md" stop in the kernel, the code sequence based on the
> > above md configuration is
> >
> > do_md_stop
> >   + __md_stop_writes
> >   | + md_bitmap_flush
> >   | | + md_bitmap_daemon_work
> >   | | | + md_bitmap_wait_writes()
> >   | | | | (This wait is for bitmap writes that happened up until now and to avoid
> >   | | | |  overlapping with the new bitmap writes.)
> >   | | | | (wait flag is set to zero for async writes to bitmap.)
> >   | | | + write_page()
> >   | | | | | (For external bitmap, bitmap->storage.file is NULL, hence ends up
> >   | | | | |  calling write_sb_page())
> >   | | | | + write_sb_page()
> >   | | | | | + md_super_write()
> >   | | | | | | (Since the wait flag is false, the bitmap write is submitted
> >   | | | | | |  without waiting for it to complete.)
> >   | | | | | | + mddev->pending_writes is incremented, then the IO is submitted
> >   | | | | | |   and the call returns without waiting
> >   | | + md_bitmap_update_sb() - (This call simply returns because
> >   | | | | "bitmap->mddev->bitmap_info.external" is true for external bitmaps)
> >   | + md_update_sb() - (This won't be called as the call is conditional and the
> >   | | | | | | condition evaluates to false in my setup(see below for crash info)
> >   + __md_stop
> >   | + md_bitmap_destroy
> >   | | + md_bitmap_free
> >   | | | + wait_event(bitmap->write_wait,
> >   | | | |            atomic_read(&bitmap->pending_writes) == 0);
> >   | | | | bitmap->pending_writes is zero, but the mddev->pending_writes is 1,
> >   | | | | so this call returns immediately.
> >   | md detachment continues while there is pending mddev I/O cauing NULL pointer
> >   | derefence when the I/O callback, super_written is called.
> >
> > ***
> >          crash> struct mddev.bitmap_info.external 0xffff8ffb62f13800
> >                  bitmap_info.external = 1,
> >          crash> struct mddev.in_sync 0xffff8ffb62f13800
> >                  in_sync = 1
> >          crash> struct mddev.sb_flags 0xffff8ffb62f13800
> >                  sb_flags = 0
> >          crash> struct mddev.ro 0xffff8ffb62f13800
> >                  ro = 0
> >          crash> struct mddev.cluster_info 0xffff8ffb62f13800
> >                  cluster_info = 0x0
> >          crash>
> > ***
> >
> > Please review and let me know your thoughts.
> >
>
> Hello
>
> On previous mail, I was wrong to assume the array mddev->bitmap_info.external is 0. So my analysis was not suite for your case.
>
> Now we understand your env, which is related with IMSM & external bitmap, and I have a little knowledge about this field. May need to add more people in the cc list.
>
> Just from code logic, your analysis is reasonable.
> The key of the crash is why md layer didn't do the waiting job for bitmap bios. There have wait points in md_bitmap_daemon_work, write_sb_page, md_bitmap_update_sb & md_update_sb. But in your scenario there all didn't take effect. At last md_bitmap_free do the additional wait bitmap->pending_writes but not mddev->pending_writes. the crash is triggered.
>
> In my opinion, using a special wait is more clear than calling general md_bitmap_wait_writes(). the md_bitmap_wait_writes makes people feel bitmap module does repetitive clean job.
>
> My idea like:
> ```
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index 200c5d0f08bf..ea6fa5a2cb6b 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -1723,6 +1723,8 @@ void md_bitmap_flush(struct mddev *mddev)
>          bitmap->daemon_lastrun -= sleep;
>          md_bitmap_daemon_work(mddev);
>          md_bitmap_update_sb(bitmap);
> +       if (mddev->bitmap_info.external)
> +               md_super_wait(mddev);
>   }
>
>   /*
> @@ -1746,6 +1748,7 @@ void md_bitmap_free(struct bitmap *bitmap)
>          /* Shouldn't be needed - but just in case.... */
>          wait_event(bitmap->write_wait,
>                     atomic_read(&bitmap->pending_writes) == 0);
> +       wait_event(mddev->sb_wait, atomic_read(&mddev->pending_writes)==0);
>
>          /* release the bitmap file  */
>          md_bitmap_file_unmap(&bitmap->storage);
> ```

I like Heming's version better.

Hi Sudhakar,

Are you able to reproduce the issue? If so, could you please verify
that Heming's
change fixes the issue?

Thanks,
Song

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH] md/bitmap: wait for bitmap writes to complete during the tear down sequence
  2021-04-12  6:06       ` Song Liu
@ 2021-04-12  9:38         ` Sudhakar Panneerselvam
  2021-04-12  9:59           ` heming.zhao
  0 siblings, 1 reply; 8+ messages in thread
From: Sudhakar Panneerselvam @ 2021-04-12  9:38 UTC (permalink / raw)
  To: Song Liu, heming.zhao
  Cc: linux-raid, lidong.zhong, xni, colyli, Martin Petersen

> > In my opinion, using a special wait is more clear than calling general md_bitmap_wait_writes(). the md_bitmap_wait_writes makes people
> feel bitmap module does repetitive clean job.
> >
> > My idea like:
> > ```
> > diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> > index 200c5d0f08bf..ea6fa5a2cb6b 100644
> > --- a/drivers/md/md-bitmap.c
> > +++ b/drivers/md/md-bitmap.c
> > @@ -1723,6 +1723,8 @@ void md_bitmap_flush(struct mddev *mddev)
> >          bitmap->daemon_lastrun -= sleep;
> >          md_bitmap_daemon_work(mddev);
> >          md_bitmap_update_sb(bitmap);
> > +       if (mddev->bitmap_info.external)
> > +               md_super_wait(mddev);

Agreed. This looks much cleaner.

> >   }
> >
> >   /*
> > @@ -1746,6 +1748,7 @@ void md_bitmap_free(struct bitmap *bitmap)
> >          /* Shouldn't be needed - but just in case.... */
> >          wait_event(bitmap->write_wait,
> >                     atomic_read(&bitmap->pending_writes) == 0);
> > +       wait_event(mddev->sb_wait, atomic_read(&mddev->pending_writes)==0);

I think this call looks redundant as this wait is already covered by md_update_sb() for non-external bitmaps and the proposed change in md_bitmap_flush() for external bitmaps. So, can we omit this change? 

> >
> >          /* release the bitmap file  */
> >          md_bitmap_file_unmap(&bitmap->storage);
> > ```
> 
> I like Heming's version better.
> 
> Hi Sudhakar,
> 
> Are you able to reproduce the issue? If so, could you please verify
> that Heming's
> change fixes the issue?

Hi Song, Heming,

Please review my comments above. I can try to reproduce the issue with the fix once we agree on the final fix.

Thanks
Sudhakar

> 
> Thanks,
> Song

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] md/bitmap: wait for bitmap writes to complete during the tear down sequence
  2021-04-12  9:38         ` Sudhakar Panneerselvam
@ 2021-04-12  9:59           ` heming.zhao
  2021-04-13  0:17             ` Sudhakar Panneerselvam
  0 siblings, 1 reply; 8+ messages in thread
From: heming.zhao @ 2021-04-12  9:59 UTC (permalink / raw)
  To: Sudhakar Panneerselvam, Song Liu
  Cc: linux-raid, lidong.zhong, xni, colyli, Martin Petersen

On 4/12/21 5:38 PM, Sudhakar Panneerselvam wrote:
>>> In my opinion, using a special wait is more clear than calling general md_bitmap_wait_writes(). the md_bitmap_wait_writes makes people
>> feel bitmap module does repetitive clean job.
>>>
>>> My idea like:
>>> ```
>>> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
>>> index 200c5d0f08bf..ea6fa5a2cb6b 100644
>>> --- a/drivers/md/md-bitmap.c
>>> +++ b/drivers/md/md-bitmap.c
>>> @@ -1723,6 +1723,8 @@ void md_bitmap_flush(struct mddev *mddev)
>>>           bitmap->daemon_lastrun -= sleep;
>>>           md_bitmap_daemon_work(mddev);
>>>           md_bitmap_update_sb(bitmap);
>>> +       if (mddev->bitmap_info.external)
>>> +               md_super_wait(mddev);
> 
> Agreed. This looks much cleaner.
> 
>>>    }
>>>
>>>    /*
>>> @@ -1746,6 +1748,7 @@ void md_bitmap_free(struct bitmap *bitmap)
>>>           /* Shouldn't be needed - but just in case.... */
>>>           wait_event(bitmap->write_wait,
>>>                      atomic_read(&bitmap->pending_writes) == 0);
>>> +       wait_event(mddev->sb_wait, atomic_read(&mddev->pending_writes)==0);
> 
> I think this call looks redundant as this wait is already covered by md_update_sb() for non-external bitmaps and the proposed change in md_bitmap_flush() for external bitmaps. So, can we omit this change?

Yes, it's absolute redundant step, to add or remove this line is up to you.
I added this line for following the style of bitmap->pending_writes. The comment in this area also give a explanation.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH] md/bitmap: wait for bitmap writes to complete during the tear down sequence
  2021-04-12  9:59           ` heming.zhao
@ 2021-04-13  0:17             ` Sudhakar Panneerselvam
  0 siblings, 0 replies; 8+ messages in thread
From: Sudhakar Panneerselvam @ 2021-04-13  0:17 UTC (permalink / raw)
  To: heming.zhao, Song Liu
  Cc: linux-raid, lidong.zhong, xni, colyli, Martin Petersen



> -----Original Message-----
> From: heming.zhao@suse.com [mailto:heming.zhao@suse.com]
> Sent: Monday, April 12, 2021 3:59 AM
> To: Sudhakar Panneerselvam <sudhakar.panneerselvam@oracle.com>; Song Liu <song@kernel.org>
> Cc: linux-raid@vger.kernel.org; lidong.zhong@suse.com; xni@redhat.com; colyli@suse.com; Martin Petersen
> <martin.petersen@oracle.com>
> Subject: Re: [PATCH] md/bitmap: wait for bitmap writes to complete during the tear down sequence
> 
> On 4/12/21 5:38 PM, Sudhakar Panneerselvam wrote:
> >>> In my opinion, using a special wait is more clear than calling general md_bitmap_wait_writes(). the md_bitmap_wait_writes makes
> people
> >> feel bitmap module does repetitive clean job.
> >>>
> >>> My idea like:
> >>> ```
> >>> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> >>> index 200c5d0f08bf..ea6fa5a2cb6b 100644
> >>> --- a/drivers/md/md-bitmap.c
> >>> +++ b/drivers/md/md-bitmap.c
> >>> @@ -1723,6 +1723,8 @@ void md_bitmap_flush(struct mddev *mddev)
> >>>           bitmap->daemon_lastrun -= sleep;
> >>>           md_bitmap_daemon_work(mddev);
> >>>           md_bitmap_update_sb(bitmap);
> >>> +       if (mddev->bitmap_info.external)
> >>> +               md_super_wait(mddev);
> >
> > Agreed. This looks much cleaner.
> >
> >>>    }
> >>>
> >>>    /*
> >>> @@ -1746,6 +1748,7 @@ void md_bitmap_free(struct bitmap *bitmap)
> >>>           /* Shouldn't be needed - but just in case.... */
> >>>           wait_event(bitmap->write_wait,
> >>>                      atomic_read(&bitmap->pending_writes) == 0);
> >>> +       wait_event(mddev->sb_wait, atomic_read(&mddev->pending_writes)==0);
> >
> > I think this call looks redundant as this wait is already covered by md_update_sb() for non-external bitmaps and the proposed change in
> md_bitmap_flush() for external bitmaps. So, can we omit this change?
> 
> Yes, it's absolute redundant step, to add or remove this line is up to you.
> I added this line for following the style of bitmap->pending_writes. The comment in this area also give a explanation.
> 

Hello Heming, Song,

I could not reproduce the issue with the revised fix. I will be submitting the revised patch shortly.

Thanks
Sudhakar

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-04-13  0:17 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-08 21:39 [PATCH] md/bitmap: wait for bitmap writes to complete during the tear down sequence Sudhakar Panneerselvam
2021-04-10 15:27 ` heming.zhao
2021-04-10 21:35   ` Sudhakar Panneerselvam
2021-04-11  9:04     ` heming.zhao
2021-04-12  6:06       ` Song Liu
2021-04-12  9:38         ` Sudhakar Panneerselvam
2021-04-12  9:59           ` heming.zhao
2021-04-13  0:17             ` Sudhakar Panneerselvam

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.