From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3FB38C433F5 for ; Wed, 22 Dec 2021 05:17:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230274AbhLVFRo (ORCPT ); Wed, 22 Dec 2021 00:17:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50256 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229509AbhLVFRo (ORCPT ); Wed, 22 Dec 2021 00:17:44 -0500 Received: from wp530.webpack.hosteurope.de (wp530.webpack.hosteurope.de [IPv6:2a01:488:42:1000:50ed:8234::]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A9550C061574 for ; Tue, 21 Dec 2021 21:17:43 -0800 (PST) Received: from ip4d173d4a.dynamic.kabel-deutschland.de ([77.23.61.74] helo=[192.168.66.200]); authenticated by wp530.webpack.hosteurope.de running ExIM with esmtpsa (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) id 1mzu0O-0003Nd-7b; Wed, 22 Dec 2021 06:17:36 +0100 Message-ID: Date: Wed, 22 Dec 2021 06:17:35 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.4.0 Subject: Re: [PATCH net] igb: fix deadlock caused by taking RTNL in RPM resume path Content-Language: en-BW To: "Nguyen, Anthony L" , "kuba@kernel.org" , "davem@davemloft.net" , "Brandeburg, Jesse" Cc: "Torvalds, Linus" , "gregkh@linuxfoundation.org" , "netdev@vger.kernel.org" , "intel-wired-lan@lists.osuosl.org" , "hkallweit1@gmail.com" References: <6bb28d2f-4884-7696-0582-c26c35534bae@gmail.com> From: Thorsten Leemhuis In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-bounce-key: webpack.hosteurope.de;regressions@leemhuis.info;1640150263;b493914b; X-HE-SMSGID: 1mzu0O-0003Nd-7b Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On 20.12.21 20:56, Nguyen, Anthony L wrote: > On Sun, 2021-12-19 at 09:31 +0100, Thorsten Leemhuis wrote: >> Hi, this is your Linux kernel regression tracker speaking. >> >> On 29.11.21 22:14, Heiner Kallweit wrote: >>> Recent net core changes caused an issue with few Intel drivers >>> (reportedly igb), where taking RTNL in RPM resume path results in a >>> deadlock. See [0] for a bug report. I don't think the core changes >>> are wrong, but taking RTNL in RPM resume path isn't needed. >>> The Intel drivers are the only ones doing this. See [1] for a >>> discussion on the issue. Following patch changes the RPM resume >>> path >>> to not take RTNL. >>> >>> [0] https://bugzilla.kernel.org/show_bug.cgi?id=215129 >>> [1] >>> https://lore.kernel.org/netdev/20211125074949.5f897431@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/t/ >>> >>> Fixes: bd869245a3dc ("net: core: try to runtime-resume detached >>> device in __dev_open") >>> Fixes: f32a21376573 ("ethtool: runtime-resume netdev parent before >>> ethtool ioctl ops") >>> Tested-by: Martin Stolpe >>> Signed-off-by: Heiner Kallweit >> >> Long story short: what is taken this fix so long to get mainlined? It >> to >> me seems progressing unnecessary slow, especially as it's a >> regression >> that made it into v5.15 and thus for weeks now seems to bug more and >> more people. >> >> >> The long story, starting with the background details: >> >> The quoted patch fixes a regression among others caused by >> f32a21376573 >> ("ethtool: runtime-resume netdev parent before ethtool ioctl ops"), >> which got merged for v5.15-rc1. >> >> The regression ("kernel hangs during power down") was afaik first >> reported on Wed, 24 Nov (IOW: nearly a month ago) and forwarded to >> the >> list shortly afterwards: >> https://bugzilla.kernel.org/show_bug.cgi?id=215129 >> https://lore.kernel.org/netdev/20211124144505.31e15716@hermes.local/ >> >> The quoted patch to fix the regression was posted on Mon, 29 Nov (thx >> Heiner for providing it!). Obviously reviewing patches can take a few >> days when they are complicated, as the other messages in this thread >> show. But according to >> https://bugzilla.kernel.org/show_bug.cgi?id=215129#c8 the patch was >> ACKed by Thu, 7 Dec. To quote: ```The patch is on its way via the >> Intel >> network driver tree: >> https://kernel.googlesource.com/pub/scm/linux/kernel/git/tnguy/net-queue/+/refs/heads/dev-queue``` >> >> And that's where the patch afaics still is. It hasn't even reached >> linux-next yet, unless I'm missing something. A merge into mainline >> thus >> is not even in sight; this seems especially bad with the holiday >> season >> coming up, as getting the fix mainlined is a prerequisite to get it >> backported to 5.15.y, as our latest stable kernel is affected by >> this. > > I've been waiting for our validation team to get to this patch to do > some additional testing. However, as you mentioned, with the holidays > coming up, it seems the tester is now out. As it looks like some in the > community have been able to do some testing on this, I'll go ahead and > send this on. Thx. I see the patch now in addition to dev-queue is also in master of this repo: https://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue.git/ But the fix still didn't make it in todays linux-next. Seems neither your master branch nor branches like '1GbE' (which seem to be the ones from which such fixes later get send to the net tree) are in linux-next afaic. Just wondering: Wouldn't it be better if they were? This would allow the users of linux-next and CIs checking it to test the fix before it's send to the net tree, which last week seems to have happened only a few hours (6209dd778f66) before net was merged into mainline (180f3bcfe362). Ciao, Thorsten