From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AFAC9C433B4 for ; Mon, 12 Apr 2021 06:06:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8C2C061042 for ; Mon, 12 Apr 2021 06:06:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230356AbhDLGGg (ORCPT ); Mon, 12 Apr 2021 02:06:36 -0400 Received: from mail.kernel.org ([198.145.29.99]:46316 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229448AbhDLGGf (ORCPT ); Mon, 12 Apr 2021 02:06:35 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id CE1FF61042 for ; Mon, 12 Apr 2021 06:06:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1618207577; bh=btMne0D6J+4C2vjM0HxZtiifE1vnIJjBsUJKagkN+cM=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=izGXbHeqAltA8pqvM0Mb/Vp45V2ig+0rXQMnLUMf9eruBEG7N64xzxWOw+A7GmXoR 3jPQuTv8T8keDKH9W5/El/DxjpLj39shs7fj5mipqwuiHdVhLFqvYkLVCcP0tV/VcN Dge/0O9opGdLCuybtPR5GpnvFmzF9dlFKgC2PnBK0cFjD4Ib37CzBwjnSgMjJUv3Qx 2F74leNPSA7vPhAuWa1mOVLbXekaOzkQEUjKCT8dE4G3qV+jM26mVXFdTOUgZf1Qt4 4djROqGvX8XtGEX072QXRLpIl2dzP8stBlgA8wZ2gN2N3cKJWmTjMBI/z3onLyVMl5 Z1uAKyTqEQYOQ== Received: by mail-lf1-f41.google.com with SMTP id v140so19480062lfa.4 for ; Sun, 11 Apr 2021 23:06:16 -0700 (PDT) X-Gm-Message-State: AOAM531WnCad2T2xVu3tGkwCKITCRU4IJn/U6vutz5zDPzc2OVumPiSN Lj1BwrLXEOlo1kcnS3cQXL0yyI39YLZBP76B/FU= X-Google-Smtp-Source: ABdhPJz3OUlGAkWtHlzCHgUcluZpNxM3B5wbWDlpiHCBBvSpTSXb8nFfv2Y6zAmifz6C7X/IETqxEnwnesk4jzTHKhY= X-Received: by 2002:a05:6512:3582:: with SMTP id m2mr18464419lfr.10.1618207575127; Sun, 11 Apr 2021 23:06:15 -0700 (PDT) MIME-Version: 1.0 References: <20210408213917.GA3986@oracle.com> <2becafd0-7df7-7a79-8478-b8246f353c9b@suse.com> In-Reply-To: <2becafd0-7df7-7a79-8478-b8246f353c9b@suse.com> From: Song Liu Date: Sun, 11 Apr 2021 23:06:03 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH] md/bitmap: wait for bitmap writes to complete during the tear down sequence To: "heming.zhao@suse.com" Cc: Sudhakar Panneerselvam , "linux-raid@vger.kernel.org" , "lidong.zhong@suse.com" , "xni@redhat.com" , "colyli@suse.com" , Martin Petersen Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-raid@vger.kernel.org On Sun, Apr 11, 2021 at 2:04 AM heming.zhao@suse.com wrote: > > On 4/11/21 5:35 AM, Sudhakar Panneerselvam wrote: > >> Hello Sudhakar, > >> > >> First, let's discuss with master branch kernel. > >> > >> What command or action stands for "tear down" ? > >> From your description, it very like ioctl STOP_ARRAY. > >> Your crash was related with super_written, which is the callback for > >> updating array sb, not bitmap sb. in md_update_sb() there is a sync > >> point md_super_wait(), which will guarantee all sb bios finished succe= ssfully. > >> > >> for your patch, do you check md_bitmap_free, which already done the yo= ur patch's job. > >> > >> the call flow: > >> ``` > >> do_md_stop //by STOP_ARRAY > >> + __md_stop_writes() > >> | md_bitmap_flush > >> | md_update_sb > >> | + md_super_write > >> | | bio->bi_end_io =3D super_written > >> | + md_super_wait(mddev) //wait for all bios done > >> + __md_stop(mddev) > >> | md_bitmap_destroy(mddev); > >> | + md_bitmap_free //see below > >> + ... > >> > >> md_bitmap_free > >> { > >> ... > >> //do your patch job. > >> /* Shouldn't be needed - but just in case.... */ > >> wait_event(bitmap->write_wait, > >> atomic_read(&bitmap->pending_writes) =3D=3D 0); > >> ... > >> } > >> ``` > >> > >> Would you share more analysis or test results for your patch? > > > > Hello Heming, > > > > Please find more details about our system configuration and the detaile= d > > sequence that led to the crash. > > > > We have a RAID1 configured on an external imsm container. /proc/mdstat = output > > looks like this: > > > > *** > > md24 : active raid1 sdb[1] sda[0] > > 141557760 blocks super external:/md127/0 [2/2] [UU] > > bitmap: 1/2 pages [4KB], 65536KB chunk > > > > md127 : inactive sdb[1](S) sda[0](S) > > 10402 blocks super external:imsm > > *** > > > > All our system partition is laid out on top of the above RAID1 device a= s > > shown below. > > > > *** > > /dev/md24p5 on / type xfs (rw,relatime,attr2,inode64,noquota) > > /dev/md24p1 on /boot type xfs (rw,nodev,relatime,attr2,inode64,noquota) > > /dev/md24p15 on /home type xfs (rw,nodev,relatime,attr2,inode64,noquota= ) > > /dev/md24p12 on /var type xfs (rw,nodev,relatime,attr2,inode64,noquota) > > /dev/md24p16 on /tmp type xfs (rw,nodev,relatime,attr2,inode64,noquota) > > /dev/md24p11 on /var/log type xfs (rw,nodev,relatime,attr2,inode64,noqu= ota) > > /dev/md24p14 on /var/log/audit type xfs (rw,nodev,relatime,attr2,inode6= 4,noquota) > > *** > > > > In such a configuration, the kernel panic described in this patch occur= s > > during the "shutdown" of the server. Since the system partition is laid= out on > > the RAID1, there could be write I/Os going to the RAID device as part o= f system > > log and filesystem activity that typically occur during the shutdown. > > > > From the core dump, I see that, after filesystems are unmounted and ki= lling all > > other userspace processes, "mdmon" thread initiates the "stop" on the m= d device. > > > > *** > > COMMAND: "mdmon" > > TASK: ffff8ff2b1663c80 [THREAD_INFO: ffff8ff2b1663c80] > > CPU: 30 > > STATE: TASK_UNINTERRUPTIBLE > > crash> bt > > PID: 7390 TASK: ffff8ff2b1663c80 CPU: 30 COMMAND: "mdmon" > > #0 [ffffb99d4eacba30] __schedule at ffffffff91884acc > > #1 [ffffb99d4eacbad0] schedule at ffffffff918850e6 > > #2 [ffffb99d4eacbae8] schedule_timeout at ffffffff91889616 > > #3 [ffffb99d4eacbb78] wait_for_completion at ffffffff91885d1b > > #4 [ffffb99d4eacbbe0] __wait_rcu_gp at ffffffff9110c123 > > #5 [ffffb99d4eacbc28] synchronize_sched at ffffffff9111008e > > #6 [ffffb99d4eacbc78] unbind_rdev_from_array at ffffffff9169640f > > #7 [ffffb99d4eacbcc0] do_md_stop at ffffffff9169f163 > > #8 [ffffb99d4eacbd58] array_state_store at ffffffff9169f644 > > #9 [ffffb99d4eacbd90] md_attr_store at ffffffff9169b71e > > #10 [ffffb99d4eacbdc8] sysfs_kf_write at ffffffff91320edf > > #11 [ffffb99d4eacbdd8] kernfs_fop_write at ffffffff913203c4 > > #12 [ffffb99d4eacbe18] __vfs_write at ffffffff9128fcfa > > #13 [ffffb99d4eacbea0] vfs_write at ffffffff9128ffd2 > > #14 [ffffb99d4eacbee0] sys_write at ffffffff9129025c > > #15 [ffffb99d4eacbf28] do_syscall_64 at ffffffff91003a39 > > #16 [ffffb99d4eacbf50] entry_SYSCALL_64_after_hwframe at ffffffff91a001= b1 > > RIP: 00007ff3bfc846fd RSP: 00007ff3bf8a6cf0 RFLAGS: 00000293 > > RAX: ffffffffffffffda RBX: 000055e257ef0941 RCX: 00007ff3bfc846f= d > > RDX: 0000000000000005 RSI: 000055e257ef0941 RDI: 000000000000001= 0 > > RBP: 0000000000000010 R8: 0000000000000001 R9: 000000000000000= 0 > > R10: 00000000ffffff01 R11: 0000000000000293 R12: 000000000000000= 1 > > R13: 0000000000000000 R14: 0000000000000001 R15: 000000000000000= 0 > > ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b > > crash> > > *** > > > > While handling the "md" stop in the kernel, the code sequence based on = the > > above md configuration is > > > > do_md_stop > > + __md_stop_writes > > | + md_bitmap_flush > > | | + md_bitmap_daemon_work > > | | | + md_bitmap_wait_writes() > > | | | | (This wait is for bitmap writes that happened up until now an= d to avoid > > | | | | overlapping with the new bitmap writes.) > > | | | | (wait flag is set to zero for async writes to bitmap.) > > | | | + write_page() > > | | | | | (For external bitmap, bitmap->storage.file is NULL, hence e= nds up > > | | | | | calling write_sb_page()) > > | | | | + write_sb_page() > > | | | | | + md_super_write() > > | | | | | | (Since the wait flag is false, the bitmap write is submit= ted > > | | | | | | without waiting for it to complete.) > > | | | | | | + mddev->pending_writes is incremented, then the IO is su= bmitted > > | | | | | | and the call returns without waiting > > | | + md_bitmap_update_sb() - (This call simply returns because > > | | | | "bitmap->mddev->bitmap_info.external" is true for external bi= tmaps) > > | + md_update_sb() - (This won't be called as the call is conditional= and the > > | | | | | | condition evaluates to false in my setup(see below for cr= ash info) > > + __md_stop > > | + md_bitmap_destroy > > | | + md_bitmap_free > > | | | + wait_event(bitmap->write_wait, > > | | | | atomic_read(&bitmap->pending_writes) =3D=3D 0); > > | | | | bitmap->pending_writes is zero, but the mddev->pending_writes= is 1, > > | | | | so this call returns immediately. > > | md detachment continues while there is pending mddev I/O cauing NUL= L pointer > > | derefence when the I/O callback, super_written is called. > > > > *** > > crash> struct mddev.bitmap_info.external 0xffff8ffb62f13800 > > bitmap_info.external =3D 1, > > crash> struct mddev.in_sync 0xffff8ffb62f13800 > > in_sync =3D 1 > > crash> struct mddev.sb_flags 0xffff8ffb62f13800 > > sb_flags =3D 0 > > crash> struct mddev.ro 0xffff8ffb62f13800 > > ro =3D 0 > > crash> struct mddev.cluster_info 0xffff8ffb62f13800 > > cluster_info =3D 0x0 > > crash> > > *** > > > > Please review and let me know your thoughts. > > > > Hello > > On previous mail, I was wrong to assume the array mddev->bitmap_info.exte= rnal is 0. So my analysis was not suite for your case. > > Now we understand your env, which is related with IMSM & external bitmap,= and I have a little knowledge about this field. May need to add more peopl= e in the cc list. > > Just from code logic, your analysis is reasonable. > The key of the crash is why md layer didn't do the waiting job for bitmap= bios. There have wait points in md_bitmap_daemon_work, write_sb_page, md_b= itmap_update_sb & md_update_sb. But in your scenario there all didn't take = effect. At last md_bitmap_free do the additional wait bitmap->pending_write= s but not mddev->pending_writes. the crash is triggered. > > In my opinion, using a special wait is more clear than calling general md= _bitmap_wait_writes(). the md_bitmap_wait_writes makes people feel bitmap m= odule does repetitive clean job. > > My idea like: > ``` > diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c > index 200c5d0f08bf..ea6fa5a2cb6b 100644 > --- a/drivers/md/md-bitmap.c > +++ b/drivers/md/md-bitmap.c > @@ -1723,6 +1723,8 @@ void md_bitmap_flush(struct mddev *mddev) > bitmap->daemon_lastrun -=3D sleep; > md_bitmap_daemon_work(mddev); > md_bitmap_update_sb(bitmap); > + if (mddev->bitmap_info.external) > + md_super_wait(mddev); > } > > /* > @@ -1746,6 +1748,7 @@ void md_bitmap_free(struct bitmap *bitmap) > /* Shouldn't be needed - but just in case.... */ > wait_event(bitmap->write_wait, > atomic_read(&bitmap->pending_writes) =3D=3D 0); > + wait_event(mddev->sb_wait, atomic_read(&mddev->pending_writes)=3D= =3D0); > > /* release the bitmap file */ > md_bitmap_file_unmap(&bitmap->storage); > ``` I like Heming's version better. Hi Sudhakar, Are you able to reproduce the issue? If so, could you please verify that Heming's change fixes the issue? Thanks, Song