From: Yu Kuai <yukuai1@huaweicloud.com>
To: Yu Kuai <yukuai1@huaweicloud.com>, Song Liu <song@kernel.org>
Cc: Guoqing Jiang <guoqing.jiang@linux.dev>,
logang@deltatee.com, pmenzel@molgen.mpg.de, agk@redhat.com,
snitzer@kernel.org, linux-kernel@vger.kernel.org,
linux-raid@vger.kernel.org, yi.zhang@huawei.com,
yangerkun@huawei.com, Marc Smith <msmith626@gmail.com>,
"yukuai (C)" <yukuai3@huawei.com>
Subject: Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"
Date: Fri, 5 May 2023 17:05:01 +0800 [thread overview]
Message-ID: <e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com> (raw)
In-Reply-To: <9d92a862-e728-5493-52c0-abc634eb6e97@huaweicloud.com>
Hi, Song and Guoqing
在 2023/04/06 16:53, Yu Kuai 写道:
> Hi,
>
> 在 2023/03/29 7:58, Song Liu 写道:
>> On Wed, Mar 22, 2023 at 11:32 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>
>>> Hi,
>>>
>>> 在 2023/03/23 11:50, Guoqing Jiang 写道:
>>>
>>>> Combined your debug patch with above steps. Seems you are
>>>>
>>>> 1. add delay to action_store, so it can't get lock in time.
>>>> 2. echo "want_replacement"**triggers md_check_recovery which can
>>>> grab lock
>>>> to start sync thread.
>>>> 3. action_store finally hold lock to clear RECOVERY_RUNNING in reap
>>>> sync
>>>> thread.
>>>> 4. Then the new added BUG_ON is invoked since RECOVERY_RUNNING is
>>>> cleared
>>>> in step 3.
>>>
>>> Yes, this is exactly what I did.
>>>
>>>> sync_thread can be interrupted once MD_RECOVERY_INTR is set which means
>>>> the RUNNING
>>>> can be cleared, so I am not sure the added BUG_ON is reasonable. And
>>>> change BUG_ON
>>>
>>> I think BUG_ON() is reasonable because only md_reap_sync_thread can
>>> clear it, md_do_sync will exit quictly if MD_RECOVERY_INTR is set, but
>>> md_do_sync should not see that MD_RECOVERY_RUNNING is cleared, otherwise
>>> there is no gurantee that only one sync_thread can be in progress.
>>>
>>>> like this makes more sense to me.
>>>>
>>>> +BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>>> +!test_bit(MD_RECOVERY_INTR, &mddev->recovery));
>>>
>>> I think this can be reporduced likewise, md_check_recovery clear
>>> MD_RECOVERY_INTR, and new sync_thread triggered by echo
>>> "want_replacement" won't set this bit.
>>>
>>>>
>>>> I think there might be racy window like you described but it should be
>>>> really small, I prefer
>>>> to just add a few lines like this instead of revert and introduce new
>>>> lock to resolve the same
>>>> issue (if it is).
>>>
>>> The new lock that I add in this patchset is just try to synchronize idle
>>> and forzen from action_store(patch 3), I can drop it if you think this
>>> is not necessary.
>>>
>>> The main changes is patch 4, new lines is not much and I really don't
>>> like to add new flags unless we have to, current code is already hard
>>> to understand...
>>>
>>> By the way, I'm concerned that drop the mutex to unregister sync_thread
>>> might not be safe, since the mutex protects lots of stuff, and there
>>> might exist other implicit dependencies.
>>>
>>>>
>>>> TBH, I am reluctant to see the changes in the series, it can only be
>>>> considered
>>>> acceptable with conditions:
>>>>
>>>> 1. the previous raid456 bug can be fixed in this way too, hopefully
>>>> Marc
>>>> or others
>>>> can verify it.
After reading the thread:
https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
The deadlock in raid456 has same conditions as raid10:
1) echo idle hold mutex to stop sync thread;
2) sync thread wait for io to complete;
3) io can't be handled by daemon thread because sb flag is set;
4) sb flag can't be cleared because daemon thread can't hold mutex;
I tried to reporduce the deadlock with the reporducer provided in the
thread, howerver, the deadlock is not reporduced after running for more
than a day.
I changed the reporducer to below:
[root@fedora raid5]# cat test_deadlock.sh
#! /bin/bash
(
while true; do
echo check > /sys/block/md0/md/sync_action
sleep 0.5
echo idle > /sys/block/md0/md/sync_action
done
) &
echo 0 > /proc/sys/vm/dirty_background_ratio
(
while true; do
fio -filename=/dev/md0 -bs=4k -rw=write -numjobs=1
-name=xxx
done
) &
And I finially able to reporduce the deadlock with this patch
reverted(running for about an hour):
[root@fedora raid5]# ps -elf | grep " D " | grep -v grep
1 D root 156 2 16 80 0 - 0 md_wri 06:51 ?
00:19:15 [kworker/u8:11+flush-9:0]
5 D root 2239 1 2 80 0 - 992 kthrea 06:57 pts/0
00:02:15 sh test_deadlock.sh
1 D root 42791 2 0 80 0 - 0 raid5_ 07:45 ?
00:00:00 [md0_resync]
5 D root 42803 42797 0 80 0 - 92175 balanc 07:45 ?
00:00:06 fio -filename=/dev/md0 -bs=4k -rw=write -numjobs=1 -name=xxx
[root@fedora raid5]# cat /proc/2239/stack
[<0>] kthread_stop+0x96/0x2b0
[<0>] md_unregister_thread+0x5e/0xd0
[<0>] md_reap_sync_thread+0x27/0x370
[<0>] action_store+0x1fa/0x490
[<0>] md_attr_store+0xa7/0x120
[<0>] sysfs_kf_write+0x3a/0x60
[<0>] kernfs_fop_write_iter+0x144/0x2b0
[<0>] new_sync_write+0x140/0x210
[<0>] vfs_write+0x21a/0x350
[<0>] ksys_write+0x77/0x150
[<0>] __x64_sys_write+0x1d/0x30
[<0>] do_syscall_64+0x45/0x70
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6
[root@fedora raid5]# cat /proc/42791/stack
[<0>] raid5_get_active_stripe+0x606/0x960
[<0>] raid5_sync_request+0x508/0x570
[<0>] md_do_sync.cold+0xaa6/0xee7
[<0>] md_thread+0x266/0x280
[<0>] kthread+0x151/0x1b0
[<0>] ret_from_fork+0x1f/0x30
And with this patchset applied, I run the above reporducer for more than
a day now, and I think the deadlock in raid456 can be fixed.
Can this patchset be considered in next merge window? If so, I'll rebase
this patchset.
Thanks,
Kuai
>>>> 2. pass all the tests in mdadm
>>
>> AFAICT, this set looks like a better solution for this problem. But I
>> agree
>> that we need to make sure it fixes the original bug. mdadm tests are not
>> in a very good shape at the moment. I will spend more time to look into
>> these tests.
>
> While I'm working on another thread to protect md_thread with rcu, I
> found that this patch has other defects that can cause null-ptr-
> deference in theory where md_unregister_thread(&mddev->sync_thread) can
> concurrent with other context to access sync_thread, for example:
>
> t1: md_set_readonly t2: action_store
> md_unregister_thread
> // 'reconfig_mutex' is not held
> // 'reconfig_mutex' is held by caller
> if (mddev->sync_thread)
> thread = *threadp
> *threadp = NULL
> wake_up_process(mddev->sync_thread->tsk)
> // null-ptr-deference
>
> So, I think this revert will make more sence. 😉
>
> Thanks,
> Kuai
>
> .
>
next prev parent reply other threads:[~2023-05-05 9:05 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-22 6:41 [PATCH -next 0/6] md: fix that MD_RECOVERY_RUNNING can be cleared while sync_thread is still running Yu Kuai
2023-03-22 6:41 ` [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store" Yu Kuai
2023-03-22 7:19 ` Guoqing Jiang
2023-03-22 9:00 ` Yu Kuai
2023-03-22 14:32 ` Guoqing Jiang
2023-03-23 1:36 ` Yu Kuai
2023-03-23 3:50 ` Guoqing Jiang
2023-03-23 6:32 ` Yu Kuai
2023-03-28 23:58 ` Song Liu
2023-04-06 8:53 ` Yu Kuai
2023-05-05 9:05 ` Yu Kuai [this message]
2023-03-22 6:41 ` [PATCH -next 2/6] md: refactor action_store() for 'idle' and 'frozen' Yu Kuai
2023-03-22 6:41 ` [PATCH -next 3/6] md: add a mutex to synchronize idle and frozen in action_store() Yu Kuai
2023-03-22 6:41 ` [PATCH -next 4/6] md: refactor idle/frozen_sync_thread() Yu Kuai
2023-03-22 6:41 ` [PATCH -next 5/6] md: wake up 'resync_wait' at last in md_reap_sync_thread() Yu Kuai
2023-03-22 6:41 ` [PATCH -next 6/6] md: enhance checking in md_check_recovery() Yu Kuai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com \
--to=yukuai1@huaweicloud.com \
--cc=agk@redhat.com \
--cc=guoqing.jiang@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=logang@deltatee.com \
--cc=msmith626@gmail.com \
--cc=pmenzel@molgen.mpg.de \
--cc=snitzer@kernel.org \
--cc=song@kernel.org \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
--cc=yukuai3@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.