From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 52ABAC76196 for ; Wed, 29 Mar 2023 00:01:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229887AbjC2AB2 (ORCPT ); Tue, 28 Mar 2023 20:01:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46768 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229779AbjC2AB0 (ORCPT ); Tue, 28 Mar 2023 20:01:26 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DFF38AF; Tue, 28 Mar 2023 17:01:25 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 4BDC2618E5; Wed, 29 Mar 2023 00:01:25 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id AE908C433EF; Wed, 29 Mar 2023 00:01:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1680048084; bh=RVxXcEiibRebSGr+aKRiwje0chKKQkgOOPr8eSlVSkY=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=awFjxt5m0rkRkcTeDWE3+14Tkz0Ts+OA3EknP8LyGLYSJKo7krFIZWEkt/fa7t6jW QxwfuThJ/kmJ4Landpwg9VLqnUnT7tGY9Y5MVxEJMQZWjYtEqwd4hutizbqCT6aXS6 EwpOxhJfxIuhpGi99UdCv1iphl/Uml/hs3cGcz0ILoc53cwbQxalNiTwc+eOVg3Fss kaAdLvwMDeFOb40RvTGnw/pdg5rubsjNIF3qvm+1GwzMRqdKYpqSHOKqez44Gy7OVP 9TZu5SU83+qbpoAxi7zBCDKNnHpJl0V4yXsQtYRb6M+qQajFKYxoMUl29Ms32Xqma8 uBWdN0XzJ+Gpw== Received: by mail-lf1-f49.google.com with SMTP id h11so11007489lfu.8; Tue, 28 Mar 2023 17:01:24 -0700 (PDT) X-Gm-Message-State: AAQBX9ci19N+jOu5gFVsNadSuILBcG+PrLw9rM6qNu7+DGHqohK4DRHG X4u9ocCdGJp9bfO4/KKlxZ6O4j21M6gikWgkCos= X-Google-Smtp-Source: AKy350b6mMLErOC4Bt39LRAAoBVZQ1UY0GpckPACTLqkHA0ji0rRxAVibRESZV5L9UB9RCqQS1oIe6imBco637i/Krw= X-Received: by 2002:ac2:5dcd:0:b0:4e8:5bed:a051 with SMTP id x13-20020ac25dcd000000b004e85beda051mr5202722lfq.3.1680048082691; Tue, 28 Mar 2023 17:01:22 -0700 (PDT) MIME-Version: 1.0 References: <55e30408-ac63-965f-769f-18be5fd5885c@molgen.mpg.de> <30576384-682c-c021-ff16-bebed8251365@molgen.mpg.de> <6c7008df-942e-13b1-2e70-a058e96ab0e9@cloud.ionos.com> <12f09162-c92f-8fbb-8382-cba6188bfb29@molgen.mpg.de> <6757d55d-ada8-9b7e-b7fd-2071fe905466@cloud.ionos.com> <93d8d623-8aec-ad91-490c-a414c4926fb2@molgen.mpg.de> <0bb7c8d8-6b96-ce70-c5ee-ba414de10561@cloud.ionos.com> <1cdfceb6-f39b-70e1-3018-ea14dbe257d9@cloud.ionos.com> <7733de01-d1b0-e56f-db6a-137a752f7236@molgen.mpg.de> <2af18cf7-05eb-f1d1-616a-2c5894d1ac43@linux.dev> In-Reply-To: From: Song Liu Date: Tue, 28 Mar 2023 17:01:09 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition To: Marc Smith , Yu Kuai Cc: Guoqing Jiang , Donald Buczek , linux-raid@vger.kernel.org, Linux Kernel Mailing List , it+raid@molgen.mpg.de Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-raid@vger.kernel.org On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith wr= ote: > > On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith = wrote: > > > > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang wrote: > > > > > > > > > > > > On 3/14/23 21:25, Marc Smith wrote: > > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang > > > > wrote: > > > >> Hi Donald, > > > >> > > > >> On 2/8/21 19:41, Donald Buczek wrote: > > > >>> Dear Guoqing, > > > >>> > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote: > > > >>>> > > > >>>> On 2/8/21 12:38, Donald Buczek wrote: > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister > > > >>>>>> sync_thread, like this. > > > >>>>>> > > > >>>>>> /* resync has finished, collect result */ > > > >>>>>> mddev_unlock(mddev); > > > >>>>>> md_unregister_thread(&mddev->sync_thread); > > > >>>>>> mddev_lock(mddev); > > > >>>>> As above: While we wait for the sync thread to terminate, would= n't it > > > >>>>> be a problem, if another user space operation takes the mutex? > > > >>>> I don't think other places can be blocked while hold mutex, othe= rwise > > > >>>> these places can cause potential deadlock. Please try above two = lines > > > >>>> change. And perhaps others have better idea. > > > >>> Yes, this works. No deadlock after >11000 seconds, > > > >>> > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 12= 65, > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) > > > >> Great. I will send a formal patch with your reported-by and tested= -by. > > > >> > > > >> Thanks, > > > >> Guoqing > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/= 2 > > > > of the patches that supposedly resolve this were applied to the sta= ble > > > > kernels, however, one was omitted due to a regression: > > > > md: don't unregister sync_thread with reconfig_mutex held (upstream > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934) > > > > > > > > I don't see any follow-up on the thread from June 8th 2022 asking f= or > > > > this patch to be dropped from all stable kernels since it caused a > > > > regression. > > > > > > > > The patch doesn't appear to be present in the current mainline kern= el > > > > (6.3-rc2) either. So I assume this issue is still present there, or= it > > > > was resolved differently and I just can't find the commit/patch. > > > > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before rea= p > > > sync_thread in action_store". > > > > Okay, let me try applying that patch... it does not appear to be > > present in my 5.4.229 kernel source. Thanks. > > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap > sync_thread in action_store"' patch on top of vanilla 5.4.229 source > appears to fix the problem for me -- I can't reproduce the issue with > the script, and it's been running for >24 hours now. (Previously I was > able to induce the issue within a matter of minutes.) Hi Marc, Could you please run your reproducer on the md-tmp branch? https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp This contains a different version of the fix by Yu Kuai. Thanks, Song