linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yu Kuai <yukuai1@huaweicloud.com>
To: Guoqing Jiang <guoqing.jiang@linux.dev>,
	Yu Kuai <yukuai1@huaweicloud.com>,
	logang@deltatee.com, pmenzel@molgen.mpg.de, agk@redhat.com,
	snitzer@kernel.org, song@kernel.org
Cc: linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	yi.zhang@huawei.com, yangerkun@huawei.com,
	Marc Smith <msmith626@gmail.com>,
	"yukuai (C)" <yukuai3@huawei.com>
Subject: Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"
Date: Thu, 23 Mar 2023 09:36:27 +0800	[thread overview]
Message-ID: <31e7f59e-579a-7812-632d-059ed0a6d441@huaweicloud.com> (raw)
In-Reply-To: <b91ae03a-14d5-11eb-8ec7-3ed91ff2c59e@linux.dev>

Hi,

在 2023/03/22 22:32, Guoqing Jiang 写道:
>>> Could you explain how the same work can be re-queued? Isn't the 
>>> PENDING_BIT
>>> is already set in t3? I believe queue_work shouldn't do that per the 
>>> comment
>>> but I am not expert ...
>>
>> This is not related to workqueue, it is just because raid10
>> reinitialize the work that is already queued, 
> 
> I am trying to understand the possibility.
> 
>> like I discribed later in t3:
>>
>> t2:
>> md_check_recovery:
>>  INIT_WORK -> clear pending
>>  queue_work -> set pending
>>   list_add_tail
>> ...
>>
>> t3: -> work is still pending
>> md_check_recovery:
>>  INIT_WORK -> clear pending
>>  queue_work -> set pending
>>   list_add_tail -> list is corrupted
> 
> First, t2 and t3 can't be run in parallel since reconfig_mutex must be 
> held. And if sync_thread existed,
> the second process would unregister and reap sync_thread which means the 
> second process will
> call INIT_WORK and queue_work again.
> 
> Maybe your description is valid, I would prefer call work_pending and 
> flush_workqueue instead of
> INIT_WORK and queue_work.

This is not enough, it's right this can avoid list corruption, but the
worker function md_start_sync just register a sync_thread, and
md_do_sync() can still in progress, hence this can't prevent a new
sync_thread to start while the old one is not done, some other problems
like deadlock can still be triggered.

>> Of course, our 5.10 and mainline are the same,
>>
>> there are some tests:
>>
>> First the deadlock can be reporduced reliably, test script is simple:
>>
>> mdadm -Cv /dev/md0 -n 4 -l10 /dev/sd[abcd]
> 
> So this is raid10 while the previous problem was appeared in raid456, I 
> am not sure it is the same
> issue, but let's see.

Ok, I'm not quite familiar with raid456 yet, however, the problem is
still related to that action_store hold mutex to unregister sync_thread,
right?

>> Then, the problem MD_RECOVERY_RUNNING can be cleared can't be reporduced
>> reliably, usually it takes 2+ days to triggered a problem, and each time
>> problem phenomenon can be different, I'm hacking the kernel and add
>> some BUG_ON to test MD_RECOVERY_RUNNING in attached patch, following
>> test can trigger the BUG_ON:
> 
> Also your debug patch obviously added large delay which make the 
> calltrace happen, I doubt
> if user can hit it in real life. Anyway, will try below test from my side.
> 
>> mdadm -Cv /dev/md0 -e1.0 -n 4 -l 10 /dev/sd{a..d} --run
>> sleep 5
>> echo 1 > /sys/module/md_mod/parameters/set_delay
>> echo idle > /sys/block/md0/md/sync_action &
>> sleep 5
>> echo "want_replacement" > /sys/block/md0/md/dev-sdd/state
>>
>> test result:
>>
>> [  228.390237] md_check_recovery: running is set
>> [  228.391376] md_check_recovery: queue new sync thread
>> [  233.671041] action_store unregister success! delay 10s
>> [  233.689276] md_check_recovery: running is set
>> [  238.722448] md_check_recovery: running is set
>> [  238.723328] md_check_recovery: queue new sync thread
>> [  238.724851] md_do_sync: before new wor, sleep 10s
>> [  239.725818] md_do_sync: delay done
>> [  243.674828] action_store delay done
>> [  243.700102] md_reap_sync_thread: running is cleared!
>> [  243.748703] ------------[ cut here ]------------
>> [  243.749656] kernel BUG at drivers/md/md.c:9084!
> 
> After your debug patch applied, is L9084 points to below?
> 
> 9084                                 mddev->curr_resync = MaxSector;

In my environment, it's a BUG_ON() that I added in md_do_sync:

9080  skip:
9081         /* set CHANGE_PENDING here since maybe another update is 
needed,
9082         ┊* so other nodes are informed. It should be harmless for 
normal
9083         ┊* raid */
9084         BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
9085         set_mask_bits(&mddev->sb_flags, 0,
9086                 ┊     BIT(MD_SB_CHANGE_PENDING) | 
BIT(MD_SB_CHANGE_DEVS));

> 
> I don't understand how it triggers below calltrace, and it has nothing 
> to do with
> list corruption, right?

Yes, this is just a early BUG_ON() to detect that if MD_RECOVERY_RUNNING
is cleared while sync_thread is still in progress.

Thanks,
Kuai


  reply	other threads:[~2023-03-23  1:36 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-22  6:41 [PATCH -next 0/6] md: fix that MD_RECOVERY_RUNNING can be cleared while sync_thread is still running Yu Kuai
2023-03-22  6:41 ` [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store" Yu Kuai
2023-03-22  7:19   ` Guoqing Jiang
2023-03-22  9:00     ` Yu Kuai
2023-03-22 14:32       ` Guoqing Jiang
2023-03-23  1:36         ` Yu Kuai [this message]
2023-03-23  3:50           ` Guoqing Jiang
2023-03-23  6:32             ` Yu Kuai
2023-03-28 23:58               ` Song Liu
2023-04-06  8:53                 ` Yu Kuai
2023-05-05  9:05                   ` Yu Kuai
2023-03-22  6:41 ` [PATCH -next 2/6] md: refactor action_store() for 'idle' and 'frozen' Yu Kuai
2023-03-22  6:41 ` [PATCH -next 3/6] md: add a mutex to synchronize idle and frozen in action_store() Yu Kuai
2023-03-22  6:41 ` [PATCH -next 4/6] md: refactor idle/frozen_sync_thread() Yu Kuai
2023-03-22  6:41 ` [PATCH -next 5/6] md: wake up 'resync_wait' at last in md_reap_sync_thread() Yu Kuai
2023-03-22  6:41 ` [PATCH -next 6/6] md: enhance checking in md_check_recovery() Yu Kuai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=31e7f59e-579a-7812-632d-059ed0a6d441@huaweicloud.com \
    --to=yukuai1@huaweicloud.com \
    --cc=agk@redhat.com \
    --cc=guoqing.jiang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=logang@deltatee.com \
    --cc=msmith626@gmail.com \
    --cc=pmenzel@molgen.mpg.de \
    --cc=snitzer@kernel.org \
    --cc=song@kernel.org \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).