All of lore.kernel.org
 help / color / mirror / Atom feed
From: Guoqing Jiang <guoqing.jiang@linux.dev>
To: Yu Kuai <yukuai1@huaweicloud.com>,
	logang@deltatee.com, pmenzel@molgen.mpg.de, agk@redhat.com,
	snitzer@kernel.org, song@kernel.org
Cc: linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	yi.zhang@huawei.com, yangerkun@huawei.com,
	Marc Smith <msmith626@gmail.com>,
	"yukuai (C)" <yukuai3@huawei.com>
Subject: Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"
Date: Thu, 23 Mar 2023 11:50:48 +0800	[thread overview]
Message-ID: <3fc2a539-e4cc-e057-6cf0-da7b3953be6e@linux.dev> (raw)
In-Reply-To: <31e7f59e-579a-7812-632d-059ed0a6d441@huaweicloud.com>



On 3/23/23 09:36, Yu Kuai wrote:
> Hi,
>
> 在 2023/03/22 22:32, Guoqing Jiang 写道:
>>>> Could you explain how the same work can be re-queued? Isn't the 
>>>> PENDING_BIT
>>>> is already set in t3? I believe queue_work shouldn't do that per 
>>>> the comment
>>>> but I am not expert ...
>>>
>>> This is not related to workqueue, it is just because raid10
>>> reinitialize the work that is already queued, 
>>
>> I am trying to understand the possibility.
>>
>>> like I discribed later in t3:
>>>
>>> t2:
>>> md_check_recovery:
>>>  INIT_WORK -> clear pending
>>>  queue_work -> set pending
>>>   list_add_tail
>>> ...
>>>
>>> t3: -> work is still pending
>>> md_check_recovery:
>>>  INIT_WORK -> clear pending
>>>  queue_work -> set pending
>>>   list_add_tail -> list is corrupted
>>
>> First, t2 and t3 can't be run in parallel since reconfig_mutex must 
>> be held. And if sync_thread existed,
>> the second process would unregister and reap sync_thread which means 
>> the second process will
>> call INIT_WORK and queue_work again.
>>
>> Maybe your description is valid, I would prefer call work_pending and 
>> flush_workqueue instead of
>> INIT_WORK and queue_work.
>
> This is not enough, it's right this can avoid list corruption, but the
> worker function md_start_sync just register a sync_thread, and
> md_do_sync() can still in progress, hence this can't prevent a new
> sync_thread to start while the old one is not done, some other problems
> like deadlock can still be triggered.
>
>>> Of course, our 5.10 and mainline are the same,
>>>
>>> there are some tests:
>>>
>>> First the deadlock can be reporduced reliably, test script is simple:
>>>
>>> mdadm -Cv /dev/md0 -n 4 -l10 /dev/sd[abcd]
>>
>> So this is raid10 while the previous problem was appeared in raid456, 
>> I am not sure it is the same
>> issue, but let's see.
>
> Ok, I'm not quite familiar with raid456 yet, however, the problem is
> still related to that action_store hold mutex to unregister sync_thread,
> right?

Yes and no, the previous raid456 bug also existed because it can't get 
stripe while
barrier is involved as you mentioned in patch 4, which is different.

>
>>> Then, the problem MD_RECOVERY_RUNNING can be cleared can't be 
>>> reporduced
>>> reliably, usually it takes 2+ days to triggered a problem, and each 
>>> time
>>> problem phenomenon can be different, I'm hacking the kernel and add
>>> some BUG_ON to test MD_RECOVERY_RUNNING in attached patch, following
>>> test can trigger the BUG_ON:
>>
>> Also your debug patch obviously added large delay which make the 
>> calltrace happen, I doubt
>> if user can hit it in real life. Anyway, will try below test from my 
>> side.
>>
>>> mdadm -Cv /dev/md0 -e1.0 -n 4 -l 10 /dev/sd{a..d} --run
>>> sleep 5
>>> echo 1 > /sys/module/md_mod/parameters/set_delay
>>> echo idle > /sys/block/md0/md/sync_action &
>>> sleep 5
>>> echo "want_replacement" > /sys/block/md0/md/dev-sdd/state

Combined your debug patch with above steps. Seems you are

1. add delay to action_store, so it can't get lock in time.
2. echo "want_replacement"**triggers md_check_recovery which can grab lock
     to start sync thread.
3. action_store finally hold lock to clear RECOVERY_RUNNING in reap sync 
thread.
4. Then the new added BUG_ON is invoked since RECOVERY_RUNNING is cleared
     in step 3.

>>>
>>> test result:
>>>
>>> [  228.390237] md_check_recovery: running is set
>>> [  228.391376] md_check_recovery: queue new sync thread
>>> [  233.671041] action_store unregister success! delay 10s
>>> [  233.689276] md_check_recovery: running is set
>>> [  238.722448] md_check_recovery: running is set
>>> [  238.723328] md_check_recovery: queue new sync thread
>>> [  238.724851] md_do_sync: before new wor, sleep 10s
>>> [  239.725818] md_do_sync: delay done
>>> [  243.674828] action_store delay done
>>> [  243.700102] md_reap_sync_thread: running is cleared!
>>> [  243.748703] ------------[ cut here ]------------
>>> [  243.749656] kernel BUG at drivers/md/md.c:9084!
>>
>> After your debug patch applied, is L9084 points to below?
>>
>> 9084                                 mddev->curr_resync = MaxSector;
>
> In my environment, it's a BUG_ON() that I added in md_do_sync:

Ok, so we are on different code base ...

> 9080  skip:
> 9081         /* set CHANGE_PENDING here since maybe another update is 
> needed,
> 9082         ┊* so other nodes are informed. It should be harmless for 
> normal
> 9083         ┊* raid */
> 9084         BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
> 9085         set_mask_bits(&mddev->sb_flags, 0,
> 9086                 ┊     BIT(MD_SB_CHANGE_PENDING) | 
> BIT(MD_SB_CHANGE_DEVS));
>
>>
>> I don't understand how it triggers below calltrace, and it has 
>> nothing to do with
>> list corruption, right?
>
> Yes, this is just a early BUG_ON() to detect that if MD_RECOVERY_RUNNING
> is cleared while sync_thread is still in progress.

sync_thread can be interrupted once MD_RECOVERY_INTR is set which means 
the RUNNING
can be cleared, so I am not sure the added BUG_ON is reasonable. And 
change BUG_ON
like this makes more sense to me.

+BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
+!test_bit(MD_RECOVERY_INTR, &mddev->recovery));

I think there might be racy window like you described but it should be 
really small, I prefer
to just add a few lines like this instead of revert and introduce new 
lock to resolve the same
issue (if it is).

@@ -4792,9 +4793,15 @@action_store(struct mddev *mddev, const char 
*page, size_t len)
                        if (mddev->sync_thread) {
                                sector_t save_rp = mddev->reshape_position;

+set_bit(MD_RECOVERY_DONOT, &mddev->recovery);
@@ -4805,6 +4812,7 @@action_store(struct mddev *mddev, const char *page, 
size_t len)
                                mddev->reshape_position = save_rp;
                                set_bit(MD_RECOVERY_INTR, 
&mddev->recovery);
                                md_reap_sync_thread(mddev);
+clear_bit(MD_RECOVERY_DONOT, &mddev->recovery);
                        }
                        mddev_unlock(mddev);
@@ -9296,6 +9313,9 @@void md_check_recovery(struct mddev *mddev)
        if (!md_is_rdwr(mddev) &&
            !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
                return;
+/* action_store is in the middle of reap sync thread, let's wait */
+if (test_bit(MD_RECOVERY_DONOT, &mddev->recovery))
+return;
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -553,6 +553,7 @@enum recovery_flags {
        MD_RECOVERY_ERROR,      /* sync-action interrupted because 
io-error */
        MD_RECOVERY_WAIT,       /* waiting for pers->start() to finish */
        MD_RESYNCING_REMOTE,    /* remote node is running resync thread */
+MD_RECOVERY_DONOT,     /* for a nasty racy issue */
};

TBH, I am reluctant to see the changes in the series, it can only be 
considered
acceptable with conditions:

1. the previous raid456 bug can be fixed in this way too, hopefully Marc 
or others
     can verify it.
2. pass all the tests in mdadm.

Thanks,
Guoqing

  reply	other threads:[~2023-03-23  3:52 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-22  6:41 [PATCH -next 0/6] md: fix that MD_RECOVERY_RUNNING can be cleared while sync_thread is still running Yu Kuai
2023-03-22  6:41 ` [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store" Yu Kuai
2023-03-22  7:19   ` Guoqing Jiang
2023-03-22  9:00     ` Yu Kuai
2023-03-22 14:32       ` Guoqing Jiang
2023-03-23  1:36         ` Yu Kuai
2023-03-23  3:50           ` Guoqing Jiang [this message]
2023-03-23  6:32             ` Yu Kuai
2023-03-28 23:58               ` Song Liu
2023-04-06  8:53                 ` Yu Kuai
2023-05-05  9:05                   ` Yu Kuai
2023-03-22  6:41 ` [PATCH -next 2/6] md: refactor action_store() for 'idle' and 'frozen' Yu Kuai
2023-03-22  6:41 ` [PATCH -next 3/6] md: add a mutex to synchronize idle and frozen in action_store() Yu Kuai
2023-03-22  6:41 ` [PATCH -next 4/6] md: refactor idle/frozen_sync_thread() Yu Kuai
2023-03-22  6:41 ` [PATCH -next 5/6] md: wake up 'resync_wait' at last in md_reap_sync_thread() Yu Kuai
2023-03-22  6:41 ` [PATCH -next 6/6] md: enhance checking in md_check_recovery() Yu Kuai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3fc2a539-e4cc-e057-6cf0-da7b3953be6e@linux.dev \
    --to=guoqing.jiang@linux.dev \
    --cc=agk@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=logang@deltatee.com \
    --cc=msmith626@gmail.com \
    --cc=pmenzel@molgen.mpg.de \
    --cc=snitzer@kernel.org \
    --cc=song@kernel.org \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    --cc=yukuai1@huaweicloud.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.