RE: [PATCH] md/bitmap: wait for bitmap writes to complete during the tear down sequence

From: Sudhakar Panneerselvam <sudhakar.panneerselvam@oracle.com>
To: "heming.zhao@suse.com" <heming.zhao@suse.com>,
	"linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>,
	"song@kernel.org" <song@kernel.org>
Cc: "lidong.zhong@suse.com" <lidong.zhong@suse.com>,
	"xni@redhat.com" <xni@redhat.com>,
	"colyli@suse.com" <colyli@suse.com>,
	Martin Petersen <martin.petersen@oracle.com>
Subject: RE: [PATCH] md/bitmap: wait for bitmap writes to complete during the tear down sequence
Date: Sat, 10 Apr 2021 21:35:51 +0000	[thread overview]
Message-ID: <CY4PR10MB20077F9F84680DC1C0CAE281FD729@CY4PR10MB2007.namprd10.prod.outlook.com> (raw)
In-Reply-To: <ba0f4827-83ae-b7e2-2230-5f4afca2538a@suse.com>

> Hello Sudhakar,
> 
> First, let's discuss with master branch kernel.
> 
> What command or action stands for "tear down" ?
>  From your description, it very like ioctl STOP_ARRAY.
> Your crash was related with super_written, which is the callback for
> updating array sb, not bitmap sb. in md_update_sb() there is a sync
> point md_super_wait(), which will guarantee all sb bios finished successfully.
> 
> for your patch, do you check md_bitmap_free, which already done the your patch's job.
> 
> the call flow:
> ```
> do_md_stop //by STOP_ARRAY
>   + __md_stop_writes()
>   |  md_bitmap_flush
>   |  md_update_sb
>   |   + md_super_write
>   |   |  bio->bi_end_io = super_written
>   |   + md_super_wait(mddev) //wait for all bios done
>   + __md_stop(mddev)
>   |  md_bitmap_destroy(mddev);
>   |   + md_bitmap_free //see below
>   + ...
> 
> md_bitmap_free
> {
>     ...
>      //do your patch job.
>      /* Shouldn't be needed - but just in case.... */
>      wait_event(bitmap->write_wait,
>             atomic_read(&bitmap->pending_writes) == 0);
>     ...
> }
> ```
> 
> Would you share more analysis or test results for your patch?

Hello Heming,

Please find more details about our system configuration and the detailed
sequence that led to the crash.

We have a RAID1 configured on an external imsm container. /proc/mdstat output
looks like this:

***
md24 : active raid1 sdb[1] sda[0]
      141557760 blocks super external:/md127/0 [2/2] [UU]
      bitmap: 1/2 pages [4KB], 65536KB chunk

md127 : inactive sdb[1](S) sda[0](S)
      10402 blocks super external:imsm
***

All our system partition is laid out on top of the above RAID1 device as
shown below.

***
/dev/md24p5 on / type xfs (rw,relatime,attr2,inode64,noquota)
/dev/md24p1 on /boot type xfs (rw,nodev,relatime,attr2,inode64,noquota)
/dev/md24p15 on /home type xfs (rw,nodev,relatime,attr2,inode64,noquota)
/dev/md24p12 on /var type xfs (rw,nodev,relatime,attr2,inode64,noquota)
/dev/md24p16 on /tmp type xfs (rw,nodev,relatime,attr2,inode64,noquota)
/dev/md24p11 on /var/log type xfs (rw,nodev,relatime,attr2,inode64,noquota)
/dev/md24p14 on /var/log/audit type xfs (rw,nodev,relatime,attr2,inode64,noquota)
***

In such a configuration, the kernel panic described in this patch occurs
during the "shutdown" of the server. Since the system partition is laid out on
the RAID1, there could be write I/Os going to the RAID device as part of system
log and filesystem activity that typically occur during the shutdown.

From the core dump, I see that, after filesystems are unmounted and killing all
other userspace processes, "mdmon" thread initiates the "stop" on the md device.

***
COMMAND: "mdmon"
   TASK: ffff8ff2b1663c80  [THREAD_INFO: ffff8ff2b1663c80]
    CPU: 30
  STATE: TASK_UNINTERRUPTIBLE
crash> bt
PID: 7390   TASK: ffff8ff2b1663c80  CPU: 30  COMMAND: "mdmon"
 #0 [ffffb99d4eacba30] __schedule at ffffffff91884acc
 #1 [ffffb99d4eacbad0] schedule at ffffffff918850e6
 #2 [ffffb99d4eacbae8] schedule_timeout at ffffffff91889616
 #3 [ffffb99d4eacbb78] wait_for_completion at ffffffff91885d1b
 #4 [ffffb99d4eacbbe0] __wait_rcu_gp at ffffffff9110c123
 #5 [ffffb99d4eacbc28] synchronize_sched at ffffffff9111008e
 #6 [ffffb99d4eacbc78] unbind_rdev_from_array at ffffffff9169640f
 #7 [ffffb99d4eacbcc0] do_md_stop at ffffffff9169f163
 #8 [ffffb99d4eacbd58] array_state_store at ffffffff9169f644
 #9 [ffffb99d4eacbd90] md_attr_store at ffffffff9169b71e
#10 [ffffb99d4eacbdc8] sysfs_kf_write at ffffffff91320edf
#11 [ffffb99d4eacbdd8] kernfs_fop_write at ffffffff913203c4
#12 [ffffb99d4eacbe18] __vfs_write at ffffffff9128fcfa
#13 [ffffb99d4eacbea0] vfs_write at ffffffff9128ffd2
#14 [ffffb99d4eacbee0] sys_write at ffffffff9129025c
#15 [ffffb99d4eacbf28] do_syscall_64 at ffffffff91003a39
#16 [ffffb99d4eacbf50] entry_SYSCALL_64_after_hwframe at ffffffff91a001b1
    RIP: 00007ff3bfc846fd  RSP: 00007ff3bf8a6cf0  RFLAGS: 00000293
    RAX: ffffffffffffffda  RBX: 000055e257ef0941  RCX: 00007ff3bfc846fd
    RDX: 0000000000000005  RSI: 000055e257ef0941  RDI: 0000000000000010
    RBP: 0000000000000010   R8: 0000000000000001   R9: 0000000000000000
    R10: 00000000ffffff01  R11: 0000000000000293  R12: 0000000000000001
    R13: 0000000000000000  R14: 0000000000000001  R15: 0000000000000000
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
crash>
***

While handling the "md" stop in the kernel, the code sequence based on the
above md configuration is

do_md_stop
 + __md_stop_writes
 | + md_bitmap_flush
 | | + md_bitmap_daemon_work
 | | | + md_bitmap_wait_writes()
 | | | | (This wait is for bitmap writes that happened up until now and to avoid
 | | | |  overlapping with the new bitmap writes.)
 | | | | (wait flag is set to zero for async writes to bitmap.)
 | | | + write_page()
 | | | | | (For external bitmap, bitmap->storage.file is NULL, hence ends up
 | | | | |  calling write_sb_page())
 | | | | + write_sb_page()
 | | | | | + md_super_write()
 | | | | | | (Since the wait flag is false, the bitmap write is submitted
 | | | | | |  without waiting for it to complete.)
 | | | | | | + mddev->pending_writes is incremented, then the IO is submitted
 | | | | | |   and the call returns without waiting
 | | + md_bitmap_update_sb() - (This call simply returns because
 | | | | "bitmap->mddev->bitmap_info.external" is true for external bitmaps)
 | + md_update_sb() - (This won't be called as the call is conditional and the
 | | | | | | condition evaluates to false in my setup(see below for crash info)
 + __md_stop
 | + md_bitmap_destroy
 | | + md_bitmap_free
 | | | + wait_event(bitmap->write_wait,
 | | | |            atomic_read(&bitmap->pending_writes) == 0);
 | | | | bitmap->pending_writes is zero, but the mddev->pending_writes is 1,
 | | | | so this call returns immediately.
 | md detachment continues while there is pending mddev I/O cauing NULL pointer
 | derefence when the I/O callback, super_written is called.

***
        crash> struct mddev.bitmap_info.external 0xffff8ffb62f13800
                bitmap_info.external = 1,
        crash> struct mddev.in_sync 0xffff8ffb62f13800
                in_sync = 1
        crash> struct mddev.sb_flags 0xffff8ffb62f13800
                sb_flags = 0
        crash> struct mddev.ro 0xffff8ffb62f13800
                ro = 0
        crash> struct mddev.cluster_info 0xffff8ffb62f13800
                cluster_info = 0x0
        crash>
***

Please review and let me know your thoughts.

Thanks
Sudhakar

> 
> Thanks,
> Heming