Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"

From: Guoqing Jiang <guoqing.jiang@linux.dev>
To: Yu Kuai <yukuai1@huaweicloud.com>,
	logang@deltatee.com, pmenzel@molgen.mpg.de, agk@redhat.com,
	snitzer@kernel.org, song@kernel.org
Cc: linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	yi.zhang@huawei.com, yangerkun@huawei.com,
	Marc Smith <msmith626@gmail.com>,
	"yukuai (C)" <yukuai3@huawei.com>
Subject: Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"
Date: Thu, 23 Mar 2023 11:50:48 +0800	[thread overview]
Message-ID: <3fc2a539-e4cc-e057-6cf0-da7b3953be6e@linux.dev> (raw)
In-Reply-To: <31e7f59e-579a-7812-632d-059ed0a6d441@huaweicloud.com>



On 3/23/23 09:36, Yu Kuai wrote:
> Hi,
>
> 在 2023/03/22 22:32, Guoqing Jiang 写道:
>>>> Could you explain how the same work can be re-queued? Isn't the 
>>>> PENDING_BIT
>>>> is already set in t3? I believe queue_work shouldn't do that per 
>>>> the comment
>>>> but I am not expert ...
>>>
>>> This is not related to workqueue, it is just because raid10
>>> reinitialize the work that is already queued, 
>>
>> I am trying to understand the possibility.
>>
>>> like I discribed later in t3:
>>>
>>> t2:
>>> md_check_recovery:
>>>  INIT_WORK -> clear pending
>>>  queue_work -> set pending
>>>   list_add_tail
>>> ...
>>>
>>> t3: -> work is still pending
>>> md_check_recovery:
>>>  INIT_WORK -> clear pending
>>>  queue_work -> set pending
>>>   list_add_tail -> list is corrupted
>>
>> First, t2 and t3 can't be run in parallel since reconfig_mutex must 
>> be held. And if sync_thread existed,
>> the second process would unregister and reap sync_thread which means 
>> the second process will
>> call INIT_WORK and queue_work again.
>>
>> Maybe your description is valid, I would prefer call work_pending and 
>> flush_workqueue instead of
>> INIT_WORK and queue_work.
>
> This is not enough, it's right this can avoid list corruption, but the
> worker function md_start_sync just register a sync_thread, and
> md_do_sync() can still in progress, hence this can't prevent a new
> sync_thread to start while the old one is not done, some other problems
> like deadlock can still be triggered.
>
>>> Of course, our 5.10 and mainline are the same,
>>>
>>> there are some tests:
>>>
>>> First the deadlock can be reporduced reliably, test script is simple:
>>>
>>> mdadm -Cv /dev/md0 -n 4 -l10 /dev/sd[abcd]
>>
>> So this is raid10 while the previous problem was appeared in raid456, 
>> I am not sure it is the same
>> issue, but let's see.
>
> Ok, I'm not quite familiar with raid456 yet, however, the problem is
> still related to that action_store hold mutex to unregister sync_thread,
> right?

Yes and no, the previous raid456 bug also existed because it can't get 
stripe while
barrier is involved as you mentioned in patch 4, which is different.

>
>>> Then, the problem MD_RECOVERY_RUNNING can be cleared can't be 
>>> reporduced
>>> reliably, usually it takes 2+ days to triggered a problem, and each 
>>> time
>>> problem phenomenon can be different, I'm hacking the kernel and add
>>> some BUG_ON to test MD_RECOVERY_RUNNING in attached patch, following
>>> test can trigger the BUG_ON:
>>
>> Also your debug patch obviously added large delay which make the 
>> calltrace happen, I doubt
>> if user can hit it in real life. Anyway, will try below test from my 
>> side.
>>
>>> mdadm -Cv /dev/md0 -e1.0 -n 4 -l 10 /dev/sd{a..d} --run
>>> sleep 5
>>> echo 1 > /sys/module/md_mod/parameters/set_delay
>>> echo idle > /sys/block/md0/md/sync_action &
>>> sleep 5
>>> echo "want_replacement" > /sys/block/md0/md/dev-sdd/state

Combined your debug patch with above steps. Seems you are

1. add delay to action_store, so it can't get lock in time.
2. echo "want_replacement"**triggers md_check_recovery which can grab lock
     to start sync thread.
3. action_store finally hold lock to clear RECOVERY_RUNNING in reap sync 
thread.
4. Then the new added BUG_ON is invoked since RECOVERY_RUNNING is cleared
     in step 3.

>>>
>>> test result:
>>>
>>> [  228.390237] md_check_recovery: running is set
>>> [  228.391376] md_check_recovery: queue new sync thread
>>> [  233.671041] action_store unregister success! delay 10s
>>> [  233.689276] md_check_recovery: running is set
>>> [  238.722448] md_check_recovery: running is set
>>> [  238.723328] md_check_recovery: queue new sync thread
>>> [  238.724851] md_do_sync: before new wor, sleep 10s
>>> [  239.725818] md_do_sync: delay done
>>> [  243.674828] action_store delay done
>>> [  243.700102] md_reap_sync_thread: running is cleared!
>>> [  243.748703] ------------[ cut here ]------------
>>> [  243.749656] kernel BUG at drivers/md/md.c:9084!
>>
>> After your debug patch applied, is L9084 points to below?
>>
>> 9084                                 mddev->curr_resync = MaxSector;
>
> In my environment, it's a BUG_ON() that I added in md_do_sync:

Ok, so we are on different code base ...

> 9080  skip:
> 9081         /* set CHANGE_PENDING here since maybe another update is 
> needed,
> 9082         ┊* so other nodes are informed. It should be harmless for 
> normal
> 9083         ┊* raid */
> 9084         BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
> 9085         set_mask_bits(&mddev->sb_flags, 0,
> 9086                 ┊     BIT(MD_SB_CHANGE_PENDING) | 
> BIT(MD_SB_CHANGE_DEVS));
>
>>
>> I don't understand how it triggers below calltrace, and it has 
>> nothing to do with
>> list corruption, right?
>
> Yes, this is just a early BUG_ON() to detect that if MD_RECOVERY_RUNNING
> is cleared while sync_thread is still in progress.

sync_thread can be interrupted once MD_RECOVERY_INTR is set which means 
the RUNNING
can be cleared, so I am not sure the added BUG_ON is reasonable. And 
change BUG_ON
like this makes more sense to me.

+BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
+!test_bit(MD_RECOVERY_INTR, &mddev->recovery));

I think there might be racy window like you described but it should be 
really small, I prefer
to just add a few lines like this instead of revert and introduce new 
lock to resolve the same
issue (if it is).

@@ -4792,9 +4793,15 @@action_store(struct mddev *mddev, const char 
*page, size_t len)
                        if (mddev->sync_thread) {
                                sector_t save_rp = mddev->reshape_position;

+set_bit(MD_RECOVERY_DONOT, &mddev->recovery);
@@ -4805,6 +4812,7 @@action_store(struct mddev *mddev, const char *page, 
size_t len)
                                mddev->reshape_position = save_rp;
                                set_bit(MD_RECOVERY_INTR, 
&mddev->recovery);
                                md_reap_sync_thread(mddev);
+clear_bit(MD_RECOVERY_DONOT, &mddev->recovery);
                        }
                        mddev_unlock(mddev);
@@ -9296,6 +9313,9 @@void md_check_recovery(struct mddev *mddev)
        if (!md_is_rdwr(mddev) &&
            !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
                return;
+/* action_store is in the middle of reap sync thread, let's wait */
+if (test_bit(MD_RECOVERY_DONOT, &mddev->recovery))
+return;

--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -553,6 +553,7 @@enum recovery_flags {
        MD_RECOVERY_ERROR,      /* sync-action interrupted because 
io-error */
        MD_RECOVERY_WAIT,       /* waiting for pers->start() to finish */
        MD_RESYNCING_REMOTE,    /* remote node is running resync thread */
+MD_RECOVERY_DONOT,     /* for a nasty racy issue */
};

TBH, I am reluctant to see the changes in the series, it can only be 
considered
acceptable with conditions:

1. the previous raid456 bug can be fixed in this way too, hopefully Marc 
or others
     can verify it.
2. pass all the tests in mdadm.

Thanks,
Guoqing