Re: [PATCH] md/bitmap: wait for bitmap writes to complete during the tear down sequence

From: "heming.zhao@suse.com" <heming.zhao@suse.com>
To: Sudhakar Panneerselvam <sudhakar.panneerselvam@oracle.com>,
	"linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>,
	"song@kernel.org" <song@kernel.org>
Cc: "lidong.zhong@suse.com" <lidong.zhong@suse.com>,
	"xni@redhat.com" <xni@redhat.com>,
	"colyli@suse.com" <colyli@suse.com>,
	Martin Petersen <martin.petersen@oracle.com>
Subject: Re: [PATCH] md/bitmap: wait for bitmap writes to complete during the tear down sequence
Date: Sun, 11 Apr 2021 17:04:09 +0800	[thread overview]
Message-ID: <2becafd0-7df7-7a79-8478-b8246f353c9b@suse.com> (raw)
In-Reply-To: <CY4PR10MB20077F9F84680DC1C0CAE281FD729@CY4PR10MB2007.namprd10.prod.outlook.com>

On 4/11/21 5:35 AM, Sudhakar Panneerselvam wrote:
>> Hello Sudhakar,
>>
>> First, let's discuss with master branch kernel.
>>
>> What command or action stands for "tear down" ?
>>   From your description, it very like ioctl STOP_ARRAY.
>> Your crash was related with super_written, which is the callback for
>> updating array sb, not bitmap sb. in md_update_sb() there is a sync
>> point md_super_wait(), which will guarantee all sb bios finished successfully.
>>
>> for your patch, do you check md_bitmap_free, which already done the your patch's job.
>>
>> the call flow:
>> ```
>> do_md_stop //by STOP_ARRAY
>>    + __md_stop_writes()
>>    |  md_bitmap_flush
>>    |  md_update_sb
>>    |   + md_super_write
>>    |   |  bio->bi_end_io = super_written
>>    |   + md_super_wait(mddev) //wait for all bios done
>>    + __md_stop(mddev)
>>    |  md_bitmap_destroy(mddev);
>>    |   + md_bitmap_free //see below
>>    + ...
>>
>> md_bitmap_free
>> {
>>      ...
>>       //do your patch job.
>>       /* Shouldn't be needed - but just in case.... */
>>       wait_event(bitmap->write_wait,
>>              atomic_read(&bitmap->pending_writes) == 0);
>>      ...
>> }
>> ```
>>
>> Would you share more analysis or test results for your patch?
> 
> Hello Heming,
> 
> Please find more details about our system configuration and the detailed
> sequence that led to the crash.
> 
> We have a RAID1 configured on an external imsm container. /proc/mdstat output
> looks like this:
> 
> ***
> md24 : active raid1 sdb[1] sda[0]
>        141557760 blocks super external:/md127/0 [2/2] [UU]
>        bitmap: 1/2 pages [4KB], 65536KB chunk
> 
> md127 : inactive sdb[1](S) sda[0](S)
>        10402 blocks super external:imsm
> ***
> 
> All our system partition is laid out on top of the above RAID1 device as
> shown below.
> 
> ***
> /dev/md24p5 on / type xfs (rw,relatime,attr2,inode64,noquota)
> /dev/md24p1 on /boot type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> /dev/md24p15 on /home type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> /dev/md24p12 on /var type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> /dev/md24p16 on /tmp type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> /dev/md24p11 on /var/log type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> /dev/md24p14 on /var/log/audit type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> ***
> 
> In such a configuration, the kernel panic described in this patch occurs
> during the "shutdown" of the server. Since the system partition is laid out on
> the RAID1, there could be write I/Os going to the RAID device as part of system
> log and filesystem activity that typically occur during the shutdown.
> 
>  From the core dump, I see that, after filesystems are unmounted and killing all
> other userspace processes, "mdmon" thread initiates the "stop" on the md device.
> 
> ***
> COMMAND: "mdmon"
>     TASK: ffff8ff2b1663c80  [THREAD_INFO: ffff8ff2b1663c80]
>      CPU: 30
>    STATE: TASK_UNINTERRUPTIBLE
> crash> bt
> PID: 7390   TASK: ffff8ff2b1663c80  CPU: 30  COMMAND: "mdmon"
>   #0 [ffffb99d4eacba30] __schedule at ffffffff91884acc
>   #1 [ffffb99d4eacbad0] schedule at ffffffff918850e6
>   #2 [ffffb99d4eacbae8] schedule_timeout at ffffffff91889616
>   #3 [ffffb99d4eacbb78] wait_for_completion at ffffffff91885d1b
>   #4 [ffffb99d4eacbbe0] __wait_rcu_gp at ffffffff9110c123
>   #5 [ffffb99d4eacbc28] synchronize_sched at ffffffff9111008e
>   #6 [ffffb99d4eacbc78] unbind_rdev_from_array at ffffffff9169640f
>   #7 [ffffb99d4eacbcc0] do_md_stop at ffffffff9169f163
>   #8 [ffffb99d4eacbd58] array_state_store at ffffffff9169f644
>   #9 [ffffb99d4eacbd90] md_attr_store at ffffffff9169b71e
> #10 [ffffb99d4eacbdc8] sysfs_kf_write at ffffffff91320edf
> #11 [ffffb99d4eacbdd8] kernfs_fop_write at ffffffff913203c4
> #12 [ffffb99d4eacbe18] __vfs_write at ffffffff9128fcfa
> #13 [ffffb99d4eacbea0] vfs_write at ffffffff9128ffd2
> #14 [ffffb99d4eacbee0] sys_write at ffffffff9129025c
> #15 [ffffb99d4eacbf28] do_syscall_64 at ffffffff91003a39
> #16 [ffffb99d4eacbf50] entry_SYSCALL_64_after_hwframe at ffffffff91a001b1
>      RIP: 00007ff3bfc846fd  RSP: 00007ff3bf8a6cf0  RFLAGS: 00000293
>      RAX: ffffffffffffffda  RBX: 000055e257ef0941  RCX: 00007ff3bfc846fd
>      RDX: 0000000000000005  RSI: 000055e257ef0941  RDI: 0000000000000010
>      RBP: 0000000000000010   R8: 0000000000000001   R9: 0000000000000000
>      R10: 00000000ffffff01  R11: 0000000000000293  R12: 0000000000000001
>      R13: 0000000000000000  R14: 0000000000000001  R15: 0000000000000000
>      ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
> crash>
> ***
> 
> While handling the "md" stop in the kernel, the code sequence based on the
> above md configuration is
> 
> do_md_stop
>   + __md_stop_writes
>   | + md_bitmap_flush
>   | | + md_bitmap_daemon_work
>   | | | + md_bitmap_wait_writes()
>   | | | | (This wait is for bitmap writes that happened up until now and to avoid
>   | | | |  overlapping with the new bitmap writes.)
>   | | | | (wait flag is set to zero for async writes to bitmap.)
>   | | | + write_page()
>   | | | | | (For external bitmap, bitmap->storage.file is NULL, hence ends up
>   | | | | |  calling write_sb_page())
>   | | | | + write_sb_page()
>   | | | | | + md_super_write()
>   | | | | | | (Since the wait flag is false, the bitmap write is submitted
>   | | | | | |  without waiting for it to complete.)
>   | | | | | | + mddev->pending_writes is incremented, then the IO is submitted
>   | | | | | |   and the call returns without waiting
>   | | + md_bitmap_update_sb() - (This call simply returns because
>   | | | | "bitmap->mddev->bitmap_info.external" is true for external bitmaps)
>   | + md_update_sb() - (This won't be called as the call is conditional and the
>   | | | | | | condition evaluates to false in my setup(see below for crash info)
>   + __md_stop
>   | + md_bitmap_destroy
>   | | + md_bitmap_free
>   | | | + wait_event(bitmap->write_wait,
>   | | | |            atomic_read(&bitmap->pending_writes) == 0);
>   | | | | bitmap->pending_writes is zero, but the mddev->pending_writes is 1,
>   | | | | so this call returns immediately.
>   | md detachment continues while there is pending mddev I/O cauing NULL pointer
>   | derefence when the I/O callback, super_written is called.
> 
> ***
>          crash> struct mddev.bitmap_info.external 0xffff8ffb62f13800
>                  bitmap_info.external = 1,
>          crash> struct mddev.in_sync 0xffff8ffb62f13800
>                  in_sync = 1
>          crash> struct mddev.sb_flags 0xffff8ffb62f13800
>                  sb_flags = 0
>          crash> struct mddev.ro 0xffff8ffb62f13800
>                  ro = 0
>          crash> struct mddev.cluster_info 0xffff8ffb62f13800
>                  cluster_info = 0x0
>          crash>
> ***
> 
> Please review and let me know your thoughts.
> 

Hello

On previous mail, I was wrong to assume the array mddev->bitmap_info.external is 0. So my analysis was not suite for your case.

Now we understand your env, which is related with IMSM & external bitmap, and I have a little knowledge about this field. May need to add more people in the cc list.

Just from code logic, your analysis is reasonable.
The key of the crash is why md layer didn't do the waiting job for bitmap bios. There have wait points in md_bitmap_daemon_work, write_sb_page, md_bitmap_update_sb & md_update_sb. But in your scenario there all didn't take effect. At last md_bitmap_free do the additional wait bitmap->pending_writes but not mddev->pending_writes. the crash is triggered.

In my opinion, using a special wait is more clear than calling general md_bitmap_wait_writes(). the md_bitmap_wait_writes makes people feel bitmap module does repetitive clean job.

My idea like:
```

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 200c5d0f08bf..ea6fa5a2cb6b 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -1723,6 +1723,8 @@ void md_bitmap_flush(struct mddev *mddev)
         bitmap->daemon_lastrun -= sleep;
         md_bitmap_daemon_work(mddev);
         md_bitmap_update_sb(bitmap);
+       if (mddev->bitmap_info.external)
+               md_super_wait(mddev);
  }
  
  /*
@@ -1746,6 +1748,7 @@ void md_bitmap_free(struct bitmap *bitmap)
         /* Shouldn't be needed - but just in case.... */
         wait_event(bitmap->write_wait,
                    atomic_read(&bitmap->pending_writes) == 0);
+       wait_event(mddev->sb_wait, atomic_read(&mddev->pending_writes)==0);
  
         /* release the bitmap file  */
         md_bitmap_file_unmap(&bitmap->storage);
```

All in all, your fix make sense to me. Let's wait for other people's feedback.

Thanks,
Heming