From: Yu Kuai <yukuai1@huaweicloud.com>
To: Dragan Stancevic <dragan@stancevic.com>,
Yu Kuai <yukuai1@huaweicloud.com>,
song@kernel.org
Cc: buczek@molgen.mpg.de, guoqing.jiang@linux.dev,
it+raid@molgen.mpg.de, linux-kernel@vger.kernel.org,
linux-raid@vger.kernel.org, msmith626@gmail.com,
"yangerkun@huawei.com" <yangerkun@huawei.com>,
"yukuai (C)" <yukuai3@huawei.com>
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
Date: Thu, 24 Aug 2023 09:18:58 +0800 [thread overview]
Message-ID: <07d5c7c2-c444-8747-ed6d-ca24231decd8@huaweicloud.com> (raw)
In-Reply-To: <2061b123-6332-1456-e7c3-b713752527fb@stancevic.com>
Hi,
在 2023/08/23 23:33, Dragan Stancevic 写道:
> Hi Kuai-
>
> On 8/22/23 20:22, Yu Kuai wrote:
>> Hi,
>>
>> 在 2023/08/23 5:16, Dragan Stancevic 写道:
>>> On Tue, 3/28/23 17:01 Song Liu wrote:
>>>> On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith
>>>> <msmith626@gmail.com>
>>>> wr=
>>>> ote:
>>>> >
>>>> > On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith
>>>> <msmith626@gmail.com>=
>>>> wrote:
>>>> > >
>>>> > > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang
>>>> <guoqing.jiang@li=
>>>> nux.dev> wrote:
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > > > On 3/14/23 21:25, Marc Smith wrote:
>>>> > > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang
>>>> > > > > <guoqing.jiang@cloud.ionos.com> wrote:
>>>> > > > >> Hi Donald,
>>>> > > > >>
>>>> > > > >> On 2/8/21 19:41, Donald Buczek wrote:
>>>> > > > >>> Dear Guoqing,
>>>> > > > >>>
>>>> > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>>> > > > >>>>
>>>> > > > >>>> On 2/8/21 12:38, Donald Buczek wrote:
>>>> > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to
>>>> unregister
>>>> > > > >>>>>> sync_thread, like this.
>>>> > > > >>>>>>
>>>> > > > >>>>>> /* resync has finished, collect result */
>>>> > > > >>>>>> mddev_unlock(mddev);
>>>> > > > >>>>>> md_unregister_thread(&mddev->sync_thread);
>>>> > > > >>>>>> mddev_lock(mddev);
>>>> > > > >>>>> As above: While we wait for the sync thread to terminate,
>>>> would=
>>>> n't it
>>>> > > > >>>>> be a problem, if another user space operation takes
>>>> the mutex?
>>>> > > > >>>> I don't think other places can be blocked while hold
>>>> mutex,
>>>> othe=
>>>> rwise
>>>> > > > >>>> these places can cause potential deadlock. Please try
>>>> above
>>>> two =
>>>> lines
>>>> > > > >>>> change. And perhaps others have better idea.
>>>> > > > >>> Yes, this works. No deadlock after >11000 seconds,
>>>> > > > >>>
>>>> > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37,
>>>> 434, 12=
>>>> 65,
>>>> > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>>>> > > > >> Great. I will send a formal patch with your reported-by and
>>>> tested=
>>>> -by.
>>>> > > > >>
>>>> > > > >> Thanks,
>>>> > > > >> Guoqing
>>>> > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks
>>>> like 1/=
>>>> 2
>>>> > > > > of the patches that supposedly resolve this were applied
>>>> to the
>>>> sta=
>>>> ble
>>>> > > > > kernels, however, one was omitted due to a regression:
>>>> > > > > md: don't unregister sync_thread with reconfig_mutex held
>>>> (upstream
>>>> > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
>>>> > > > >
>>>> > > > > I don't see any follow-up on the thread from June 8th 2022
>>>> asking f=
>>>> or
>>>> > > > > this patch to be dropped from all stable kernels since it
>>>> caused a
>>>> > > > > regression.
>>>> > > > >
>>>> > > > > The patch doesn't appear to be present in the current
>>>> mainline
>>>> kern=
>>>> el
>>>> > > > > (6.3-rc2) either. So I assume this issue is still present
>>>> there, or=
>>>> it
>>>> > > > > was resolved differently and I just can't find the
>>>> commit/patch.
>>>> > > >
>>>> > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev
>>>> before
>>>> rea=
>>>> p
>>>> > > > sync_thread in action_store".
>>>> > >
>>>> > > Okay, let me try applying that patch... it does not appear to be
>>>> > > present in my 5.4.229 kernel source. Thanks.
>>>> >
>>>> > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap
>>>> > sync_thread in action_store"' patch on top of vanilla 5.4.229
>>>> source
>>>> > appears to fix the problem for me -- I can't reproduce the issue
>>>> with
>>>> > the script, and it's been running for >24 hours now. (Previously
>>>> I was
>>>> > able to induce the issue within a matter of minutes.)
>>>>
>>>> Hi Marc,
>>>>
>>>> Could you please run your reproducer on the md-tmp branch?
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp
>>>>
>>>>
>>>> This contains a different version of the fix by Yu Kuai.
>>>>
>>>> Thanks,
>>>> Song
>>>>
>>>
>>> Hi Song, I can easily reproduce this issue on 5.10.133 and 5.10.53.
>>> The change
>>> "9dfbdafda3b3 "md: unlock mddev before reap" does not fix the issue
>>> for me.
>>>
>>> But I did pull the changes from the md-tmp branch you are refering:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp
>>>
>>>
>>> I was not totally clear on which change exactly to pull, but I pulled
>>> the following changes:
>>> 2023-03-28 md: enhance checking in md_check_recovery()md-tmp Yu
>>> Kuai 1 -7/+15
>>> 2023-03-28 md: wake up 'resync_wait' at last in
>>> md_reap_sync_thread() Yu Kuai 1 -1/+1
>>> 2023-03-28 md: refactor idle/frozen_sync_thread() Yu Kuai 2 -4/+22
>>> 2023-03-28 md: add a mutex to synchronize idle and frozen in
>>> action_store() Yu Kuai 2 -0/+8
>>> 2023-03-28 md: refactor action_store() for 'idle' and 'frozen' Yu
>>> Kuai 1 -16/+45
>>>
>>> I used to be able to reproduce the lockup within minutes, but with those
>>> changes the test system has been running for more than 120 hours.
>>>
>>> When you said a "different fix", can you confirm that I grabbed the
>>> right
>>> changes and that I need all 5 of them.
>>
>> Yes, you grabbed the right changes, and these patches is merged to
>> linux-next as well.
>>>
>>> And second question was, has this fix been submitted upstream yet?
>>> If so which kernel version?
>>
>> This fix is currently in linux-next, and will be applied to v6.6-rc1
>> soon.
>
> Thank you, that is great news. I'd like to see this change backported to
> 5.10 and 6.1, do you have any plans of backporting to any of the
> previous kernels?
>
> If not, I would like to try to get your changes into 5.10 and 6.1 if
> Greg will accept them.
>
I don't have plans yet, so feel free to do this, I guess these patches
won't be picked automatically due to the conflict. Feel free to ask if
you meet any problems.
Thanks,
Kuai
>
> Four out of five of your changes were a straight cherry-pick into 5.10,
> one needed a minor conflict resolution. But I can definitely confirm
> that your changes fix the lockup issue on 5.10
>
> I am now switching to 6.1 and will test the changes there too.
>
>
> Thanks
>
>
> --
> Peace can only come as a natural consequence
> of universal enlightenment -Dr. Nikola Tesla
>
>
> .
>
next prev parent reply other threads:[~2023-08-24 1:20 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-28 12:25 md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition Donald Buczek
2020-11-30 2:06 ` Guoqing Jiang
2020-12-01 9:29 ` Donald Buczek
2020-12-02 17:28 ` Donald Buczek
2020-12-03 1:55 ` Guoqing Jiang
2020-12-03 11:42 ` Donald Buczek
2020-12-21 12:33 ` Donald Buczek
2021-01-19 11:30 ` Donald Buczek
2021-01-20 16:33 ` Guoqing Jiang
2021-01-23 13:04 ` Donald Buczek
2021-01-25 8:54 ` Donald Buczek
2021-01-25 21:32 ` Donald Buczek
2021-01-26 0:44 ` Guoqing Jiang
2021-01-26 9:50 ` Donald Buczek
2021-01-26 11:14 ` Guoqing Jiang
2021-01-26 12:58 ` Donald Buczek
2021-01-26 14:06 ` Guoqing Jiang
2021-01-26 16:05 ` Donald Buczek
2021-02-02 15:42 ` Guoqing Jiang
2021-02-08 11:38 ` Donald Buczek
2021-02-08 14:53 ` Guoqing Jiang
2021-02-08 18:41 ` Donald Buczek
2021-02-09 0:46 ` Guoqing Jiang
2021-02-09 9:24 ` Donald Buczek
2023-03-14 13:25 ` Marc Smith
2023-03-14 13:55 ` Guoqing Jiang
2023-03-14 14:45 ` Marc Smith
2023-03-16 15:25 ` Marc Smith
2023-03-29 0:01 ` Song Liu
2023-08-22 21:16 ` Dragan Stancevic
2023-08-23 1:22 ` Yu Kuai
2023-08-23 15:33 ` Dragan Stancevic
2023-08-24 1:18 ` Yu Kuai [this message]
2023-08-28 20:32 ` Dragan Stancevic
2023-08-30 1:36 ` Yu Kuai
2023-09-05 3:50 ` Yu Kuai
2023-09-05 13:54 ` Dragan Stancevic
2023-09-13 9:08 ` Donald Buczek
2023-09-13 14:16 ` Dragan Stancevic
2023-09-14 6:03 ` Donald Buczek
2023-09-17 8:55 ` Donald Buczek
2023-09-24 14:35 ` Donald Buczek
2023-09-25 1:11 ` Yu Kuai
2023-09-25 9:11 ` Donald Buczek
2023-09-25 9:32 ` Yu Kuai
2023-03-15 3:02 ` Yu Kuai
2023-03-15 9:30 ` Guoqing Jiang
2023-03-15 9:53 ` Yu Kuai
2023-03-15 7:52 ` Donald Buczek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=07d5c7c2-c444-8747-ed6d-ca24231decd8@huaweicloud.com \
--to=yukuai1@huaweicloud.com \
--cc=buczek@molgen.mpg.de \
--cc=dragan@stancevic.com \
--cc=guoqing.jiang@linux.dev \
--cc=it+raid@molgen.mpg.de \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=msmith626@gmail.com \
--cc=song@kernel.org \
--cc=yangerkun@huawei.com \
--cc=yukuai3@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).