Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs

From: Yu Kuai <yukuai1@huaweicloud.com>
To: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>,
	Yu Kuai <yukuai1@huaweicloud.com>
Cc: AceLan Kao <acelan@gmail.com>, Song Liu <song@kernel.org>,
	Guoqing Jiang <guoqing.jiang@linux.dev>,
	Bagas Sanjaya <bagasdotme@gmail.com>,
	Christoph Hellwig <hch@lst.de>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linux Regressions <regressions@lists.linux.dev>,
	Linux RAID <linux-raid@vger.kernel.org>,
	"yangerkun@huawei.com" <yangerkun@huawei.com>,
	"yukuai (C)" <yukuai3@huawei.com>
Subject: Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
Date: Thu, 7 Sep 2023 20:53:30 +0800	[thread overview]
Message-ID: <513ea05e-cc2e-e0c8-43cc-6636b0631cdf@huaweicloud.com> (raw)
In-Reply-To: <20230907144153.00002492@linux.intel.com>

Hi,

在 2023/09/07 20:41, Mariusz Tkaczyk 写道:
> On Thu, 7 Sep 2023 20:14:03 +0800
> Yu Kuai <yukuai1@huaweicloud.com> wrote:
> 
>> Hi,
>>
>> 在 2023/09/07 19:26, Yu Kuai 写道:
>>> Hi,
>>>
>>> 在 2023/09/07 18:18, Mariusz Tkaczyk 写道:
>>>> On Thu, 7 Sep 2023 10:04:11 +0800
>>>> Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>>   
>>>>> Hi,
>>>>>
>>>>> 在 2023/09/06 18:27, Mariusz Tkaczyk 写道:
>>>>>> On Wed, 6 Sep 2023 14:26:30 +0800
>>>>>> AceLan Kao <acelan@gmail.com> wrote:
>>>>>>>    From previous testing, I don't think it's an issue in systemd, so I
>>>>>>> did a simple test and found the issue is gone.
>>>>>>> You only need to add a small delay in md_release(), then the issue
>>>>>>> can't be reproduced.
>>>>>>>
>>>>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>>>>> index 78be7811a89f..ef47e34c1af5 100644
>>>>>>> --- a/drivers/md/md.c
>>>>>>> +++ b/drivers/md/md.c
>>>>>>> @@ -7805,6 +7805,7 @@ static void md_release(struct gendisk *disk)
>>>>>>> {
>>>>>>>           struct mddev *mddev = disk->private_data;
>>>>>>>
>>>>>>> +       msleep(10);
>>>>>>>           BUG_ON(!mddev);
>>>>>>>           atomic_dec(&mddev->openers);
>>>>>>>           mddev_put(mddev);
>>>>>>
>>>>>> I have repro and I tested it on my setup. It is not working for me.
>>>>>> My setup could be more "advanced" to maximalize chance of reproduction:
>>>>>>
>>>>>> # cat /proc/mdstat
>>>>>> Personalities : [raid1] [raid6] [raid5] [raid4] [raid10] [raid0]
>>>>>> md121 : active raid0 nvme2n1[1] nvme5n1[0]
>>>>>>          7126394880 blocks super external:/md127/0 128k chunks
>>>>>>
>>>>>> md122 : active raid10 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
>>>>>>          104857600 blocks super external:/md126/0 64K chunks 2
>>>>>> near-copies
>>>>>> [4/4] [UUUU]
>>>>>>
>>>>>> md123 : active raid5 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
>>>>>>          2655765504 blocks super external:/md126/1 level 5, 32k chunk,
>>>>>> algorithm 0 [4/4] [UUUU]
>>>>>>
>>>>>> md124 : active raid1 nvme0n1[1] nvme3n1[0]
>>>>>>          99614720 blocks super external:/md125/0 [2/2] [UU]
>>>>>>
>>>>>> md125 : inactive nvme3n1[1](S) nvme0n1[0](S)
>>>>>>          10402 blocks super external:imsm
>>>>>>
>>>>>> md126 : inactive nvme7n1[3](S) nvme1n1[2](S) nvme6n1[1](S)
>>>>>> nvme4n1[0](S)
>>>>>>          20043 blocks super external:imsm
>>>>>>
>>>>>> md127 : inactive nvme2n1[1](S) nvme5n1[0](S)
>>>>>>          10402 blocks super external:imsm
>>>>>>
>>>>>> I have almost 99% repro ratio, slowly moving forward..
>>>>>>
>>>>>> It is endless loop because systemd-shutdown sends ioctl "stop_array"
>>>>>> which
>>>>>> is successful but array is not stopped. For that reason it sets
>>>>>> "changed =
>>>>>> true".
>>>>>
>>>>> How does systemd-shutdown judge if array is stopped? cat /proc/mdstat or
>>>>> ls /dev/md* or other way?
>>>>
>>>> Hi Yu,
>>>>
>>>> It trusts return result, I confirmed that 0 is returned.
>>>> The most weird is we are returning 0 but array is still there, and it is
>>>> stopped again in next systemd loop. I don't understand why yet..
>>>>   
>>>>>> Systemd-shutdown see the change and retries to check if there is
>>>>>> something
>>>>>> else which can be stopped now, and again, again...
>>>>>>
>>>>>> I will check what is returned first, it could be 0 or it could be
>>>>>> positive
>>>>>> errno (nit?) because systemd cares "if(r < 0)".
>>>>>
>>>>> I do noticed that there are lots of log about md123 stopped:
>>>>>
>>>>> [ 1371.834034] md122:systemd-shutdow bd_prepare_to_claim return -16
>>>>> [ 1371.840294] md122:systemd-shutdow blkdev_get_by_dev return -16
>>>>> [ 1371.846845] md: md123 stopped.
>>>>> [ 1371.850155] md122:systemd-shutdow bd_prepare_to_claim return -16
>>>>> [ 1371.856411] md122:systemd-shutdow blkdev_get_by_dev return -16
>>>>> [ 1371.862941] md: md123 stopped.
>>>>>
>>>>> And md_ioctl->do_md_stop doesn't have error path after printing this
>>>>> log, hence 0 will be returned to user.
>>>>>
>>>>> The normal case is that:
>>>>>
>>>>> open md123
>>>>> ioctl STOP_ARRAY -> all rdev should be removed from array
>>>>> close md123 -> mddev will finally be freed by:
>>>>>      md_release
>>>>>       mddev_put
>>>>>        set_bit(MD_DELETED, &mddev->flags) -> user shound not see this
>>>>> mddev
>>>>>        queue_work(md_misc_wq, &mddev->del_work)
>>>>>
>>>>>      mddev_delayed_delete
>>>>>       kobject_put(&mddev->kobj)
>>>>>
>>>>>      md_kobj_release
>>>>>       del_gendisk
>>>>>        md_free_disk
>>>>>         mddev_free
>>>>>   
>>>> Ok thanks, I understand that md_release is called on descriptor
>>>> closing, right?
>>>>   
>>>
>>> Yes, normally close md123 should drop that last reference.
>>>>   
>>>>> Now that you can reporduce this problem 99%, can you dig deeper and find
>>>>> out what is wrong?
>>>>
>>>> Yes, working on it!
>>>>
>>>> My first idea was that mddev_get and mddev_put are missing on
>>>> md_ioctl() path
>>>> but it doesn't help for the issue. My motivation here was that
>>>> md_attr_store and
>>>> md_attr_show are using them.
>>>>
>>>> Systemd regenerates list of MD arrays on every loop and it is always
>>>> there, systemd is able to open file descriptor (maybe inactive?).
>>>
>>> md123 should not be opended again, ioctl(STOP_ARRAY) already set the
>>> flag 'MD_CLOSING' to prevent that. Are you sure that systemd-shutdown do
>>> open and close the array in each loop?
>>
>> I realized that I'm wrong here. 'MD_CLOSING' is cleared before ioctl
>> return by commit 065e519e71b2 ("md: MD_CLOSING needs to be cleared after
>> called md_set_readonly or do_md_stop").
>>
>> I'm confused here, commit message said 'MD_CLOSING' shold not be set for
>> the case STOP_ARRAY_RO, but I don't understand why it's cleared for
>> STOP_ARRAY as well.
>>

Can you try the following patch?

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3afd57622a0b..31b9cec7e4c0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7668,7 +7668,8 @@ static int md_ioctl(struct block_device *bdev, 
fmode_t mode,
                         err = -EBUSY;
                         goto out;
                 }
-               did_set_md_closing = true;
+               if (cmd == STOP_ARRAY_RO)
+                       did_set_md_closing = true;
                 mutex_unlock(&mddev->open_mutex);
                 sync_blockdev(bdev);
         }

I think prevent array to be opened again after STOP_ARRAY might fix
this.

Thanks,
Kuai

> Not related with your findings but:
> 
> I replaced if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
> because that is the way to exit without queuing work:
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 0fe7ab6e8ab9..80bd7446be94 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -618,8 +618,7 @@ static void mddev_delayed_delete(struct work_struct *ws);
> 
>   
> 
> void mddev_put(struct mddev *mddev)
> {
> -       if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
> -               return;
> +       spin_lock(&all_mddevs_lock);
>          if (!mddev->raid_disks && list_empty(&mddev->disks) &&
>              mddev->ctime == 0 && !mddev->hold_active) {
>                  /* Array is not configured at all, and not held active,
> @@ -634,6 +633,7 @@ void mddev_put(struct mddev *mddev)
>                  INIT_WORK(&mddev->del_work, mddev_delayed_delete);
>                  queue_work(md_misc_wq, &mddev->del_work);
>          }
> +       atomic_dec(&mddev->active);
>          spin_unlock(&all_mddevs_lock);
> }
> 
> After that I got kernel panic but it seems that workqueue is scheduled:
> 
>   51.535103] BUG: kernel NULL pointer dereference, address: 0000000000000008
> [   51.539115] ------------[ cut here ]------------
> [   51.543867] #PF: supervisor read access in kernel mode
> 1;[3 9 m S5t1a.r5tPF: error_code(0x0000) - not-present page
> [   51.543875] PGD 0 P4D 0
> .k[ 0 m5.1
>   54ops: 0000 [#1] PREEMPT SMP NOPTI
> [   51.552207] refcount_t: underflow; use-after-free.
> [   51.556820] CPU: 19 PID: 368 Comm: kworker/19:1 Not tainted 6.5.0+ #57
> [   51.556825] Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS
> WLYDCRB1.SYS.0027.P82.2204080829 04/08/2022 [   51.561979] WARNING: CPU: 26
> PID: 376 at lib/refcount.c:28 refcount_warn_saturate+0x99/0xe0 [   51.567273]
> Workqueue:  mddev_delayed_delete [md_mod] [   51.569822] Modules linked in:
> [   51.574351]  (events)
> [   51.574353] RIP: 0010:process_one_work+0x10f/0x3d0
> [   51.579155]  configfs
> 
> In my case, it seems to be IMSM container device is stopped in loop, which is an
> inactive from the start. It is not something I'm totally sure but it could lead
> us to the root cause. So far I know, the original reported uses IMSM arrays too.
> 
> Thanks,
> Mariusz
> 
> .
>