From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.4 required=3.0 tests=BUG6152_INVALID_DATE_TZ_ABSURD,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,INVALID_DATE_TZ_ABSURD,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 82C6FC5DF62 for ; Wed, 6 Nov 2019 12:16:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4B5D220869 for ; Wed, 6 Nov 2019 12:16:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1573042599; bh=QQeB/GNauR9IPFh/DgcPHF2K7CGPkOycgO+5MUz3IlI=; h=To:Subject:Date:From:Cc:In-Reply-To:References:List-ID:From; b=J0/D3zj3vVIVGI/lU3Okkc2g+8KIiN8xcZLBkqFHybkKM0b2RmdG8xPKba7SsXoIu rwwmXej/H0nrAriM2EmZK1uJPpzugLscBIqQH6O9TagX+lAGel13k3DvXJqUd9Trzx EfmyJ/56HfnnVJ+sjieVuAheDBwc46hopN9A8Fus= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729894AbfKFMQi (ORCPT ); Wed, 6 Nov 2019 07:16:38 -0500 Received: from inca-roads.misterjones.org ([213.251.177.50]:52750 "EHLO inca-roads.misterjones.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725856AbfKFMQi (ORCPT ); Wed, 6 Nov 2019 07:16:38 -0500 Received: from www-data by cheepnis.misterjones.org with local (Exim 4.80) (envelope-from ) id 1iSKEl-0001CL-Ev; Wed, 06 Nov 2019 13:16:35 +0100 To: Salil Mehta Subject: RE: [PATCH] net: hns: Ensure that interface teardown cannot race with TX interrupt X-PHP-Originating-Script: 0:main.inc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Date: Wed, 06 Nov 2019 13:25:56 +0109 From: Marc Zyngier Cc: , , "lipeng (Y)" , "Zhuangyuzeng (Yisen)" , "David S . Miller" , Linuxarm In-Reply-To: <2311b5965adb4ccea83b6072115efc6c@huawei.com> References: <20191104195604.17109-1-maz@kernel.org> <20191106081748.0e21554c@why> <2311b5965adb4ccea83b6072115efc6c@huawei.com> Message-ID: <21493d3d08936d7ed67f7153cdaa418e@www.loen.fr> X-Sender: maz@kernel.org User-Agent: Roundcube Webmail/0.7.2 X-SA-Exim-Connect-IP: X-SA-Exim-Rcpt-To: salil.mehta@huawei.com, edumazet@google.com, netdev@vger.kernel.org, lipeng321@huawei.com, yisen.zhuang@huawei.com, davem@davemloft.net, linuxarm@huawei.com X-SA-Exim-Mail-From: maz@kernel.org X-SA-Exim-Scanned: No (on cheepnis.misterjones.org); SAEximRunCond expanded to false Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On 2019-11-06 12:28, Salil Mehta wrote: > Hi Marc, > >> On Tue, 5 Nov 2019 18:41:11 +0000 >> Salil Mehta wrote: >> >> Hi Salil, >> >> > Hi Marc, >> > I tested with the patch on D05 with the lockdep enabled kernel >> with below options >> > and I could not reproduce the deadlock. I do not argue the issue >> being mentioned >> > as this looks to be a clear bug which should hit while TX >> data-path is running >> > and we try to disable the interface. >> >> Lockdep screaming at you doesn't mean the deadly scenario happens in >> practice, and I've never seen the machine hanging in these >> conditions. >> But I've also never tried to trigger it in anger. >> >> > Could you please help me know the exact set of steps you used to >> get into this >> > problem. Also, are you able to re-create it easily/frequently? >> >> I just need to issue "reboot" (which calls "ip link ... down") for >> this >> to trigger. Here's a full splat[1], as well as my full config[2]. It >> is >> 100% repeatable. >> >> > # Kernel Config options: >> > CONFIG_LOCKDEP_SUPPORT=y >> > CONFIG_LOCKDEP=y >> >> You'll need at least >> >> CONFIG_PROVE_LOCKING=y >> CONFIG_NET_POLL_CONTROLLER=y > > > Few points: > 1. To me netpoll causing spinlock deadlock with IRQ leg of TX and ip > util is > highly unlikely since netpoll runs with both RX/TX interrupts > disabled. > It runs in polling mode to facilitate parallel path to features > like > Netconsole, netdump etc. hence, deadlock because of the netpoll > should > be highly unlikely. Therefore, smells of some other problem > here... > 2. Also, I remember patch[s1][s2] from Eric Dumazet to disable > netpoll on many > NICs way back in 4.19 kernel on the basis of Song Liu's findings. > > Problem: > Aah, I see the problem now, it is because of the stray code related > to the > NET_POLL_CONTROLLER in hns driver which actually should have got > remove within > the patch[s1], and that also explains why it does not get hit while > NET POLL > is disabled. > > > /* netif_tx_lock will turn down the performance, set only when > necessary */ > #ifdef CONFIG_NET_POLL_CONTROLLER > #define NETIF_TX_LOCK(ring) spin_lock(&(ring)->lock) > #define NETIF_TX_UNLOCK(ring) spin_unlock(&(ring)->lock) > #else > #define NETIF_TX_LOCK(ring) > #define NETIF_TX_UNLOCK(ring) > #endif > > > Once you define CONFIG_NET_POLL_CONTROLLER in the latest code these > macros > Kick-in even for the normal NAPI path. Which can cause deadlock and > that perhaps > is what you are seeing? Yes, that's the problem. > Now, the question is do we require these locks in normal NAPI poll? I > do not > see that we need them anymore as Tasklets are serialized to > themselves and > configuration path like "ip down" cannot conflict with NAPI > poll path > as the later is always disabled prior performing interface down > operation. > Hence, no conflict there. My preference would indeed be to drop these per-queue locks if they aren't required. I couldn't figure out from a cursory look at the code whether two CPUs could serve the same TX queue. If that cannot happen by construction, then these locks are perfectly useless and should be removed. > > As a side analysis, I could figure out some contentions in the > configuration > path not related to this though. :) > > > Suggested Solution: > Since we do not have support of NET_POLL_CONTROLLER macros > NETIF_TX_[UN]LOCK > We should remove these NET_POLL_CONTROLLER macros altogether for now. > > Though, I still have not looked comprehensively how other are able to > use > Debugging utils like netconsole etc without having > NET_POLL_CONTROLLER. > Maybe @Eric Dumazet might give us some insight on this? > > > If you agree with this then I can send a patch to remove these from > hns > driver. This should solve your problem as well? Sure, as long as you can guarantee that these locks are never used for anything useful. Thanks, M. -- Jazz is not dead. It just smells funny...