From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9E728C64E8A
	for <netdev@archiver.kernel.org>; Mon, 30 Nov 2020 18:49:15 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 33DEA20725
	for <netdev@archiver.kernel.org>; Mon, 30 Nov 2020 18:49:15 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="fDuZo93x"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2387578AbgK3StO (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Mon, 30 Nov 2020 13:49:14 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60034 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1725870AbgK3StN (ORCPT
        <rfc822;netdev@vger.kernel.org>); Mon, 30 Nov 2020 13:49:13 -0500
Received: from mail-ej1-x644.google.com (mail-ej1-x644.google.com [IPv6:2a00:1450:4864:20::644])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F1F21C0613D2
        for <netdev@vger.kernel.org>; Mon, 30 Nov 2020 10:48:32 -0800 (PST)
Received: by mail-ej1-x644.google.com with SMTP id 7so23823218ejm.0
        for <netdev@vger.kernel.org>; Mon, 30 Nov 2020 10:48:32 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=1+3yGKRv3VOXRevLBdEgEzXQh9xjB10emJgOU35KolY=;
        b=fDuZo93xQfiasoo4MnpOoNdnjkoVukLyflNzji7w/ftQW+jG0moOJNlmniaM6bI0vG
         hzRsUnq2H9pmTKyzm2C2sQ1k8xNfHVTiwqZN34NKkN3QRy9dOuZTWkH2uQDgpBvJIgDz
         sayRR36YZfwNvzd9fpkG9sMe/6GKxjyOFfCXpgG0GONBUJ66teFJd79kJ88XNYDMz0zA
         SchJxn7FCAwqy7hgZUuXwDc4F+jSE7dJ1wREIUkWM/N19cPu9Ubj5eeF2lvSAk05Gz3D
         9kUQb5vtZ90iMLbFKEuPq82q2gFkTbO5FQZ+dieEP4EI0sg/4nPJQ8zbGZtGFHORd2Qi
         y1FQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=1+3yGKRv3VOXRevLBdEgEzXQh9xjB10emJgOU35KolY=;
        b=oSWwSt402fMNe4Rc3PO6XjCN/Y68jRlZOXleBJ9yeAOVuJ2R0jhy9jZ5KsUEAGYzXW
         QpFej8ieM+K8li4Q26DavNP0KAShzIEzZeFCFx4hnaRgwC6wKYF4LMrjlyJVcTXZOLUS
         y8ZKUSQTw12PGIqaS5/YpG+26bwOv5JlQYtyD+/+Ui4N7KpX/cCtoEi1O3SEiSpTCGIn
         korNQOfjuD5/jR4BfDkyVuVZcf0ts7jr6yxXTrQ5EM8aTrz4tVNypbVMYZLY9+KsBxc6
         56HaAY/iU9jh/6JQ9LiMUpVXlxxsNf2hn6/t/4txV6mYiS8OjFHjik72wjyawZkmFaNp
         i2NA==
X-Gm-Message-State: AOAM532k60+/Lnm6us2yjZwvmynSHZUmw2U+GCln+xR5KmqwkpHHp2u+
        U/dUedEPCjBoFbxjV+q0Ooc=
X-Google-Smtp-Source: ABdhPJzBF8mU8rDy+j6eQ2lxQ0FJHAh9xFq4ATLTc/2FP5ZexrcHO9hRBs1KJ5dACwOgWkDsM5ulZA==
X-Received: by 2002:a17:906:a982:: with SMTP id jr2mr12432014ejb.292.1606762110607;
        Mon, 30 Nov 2020 10:48:30 -0800 (PST)
Received: from skbuf ([188.25.2.120])
        by smtp.gmail.com with ESMTPSA id k23sm8930117ejo.108.2020.11.30.10.48.28
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 30 Nov 2020 10:48:29 -0800 (PST)
Date:   Mon, 30 Nov 2020 20:48:28 +0200
From:   Vladimir Oltean <olteanv@gmail.com>
To:     Jakub Kicinski <kuba@kernel.org>
Cc:     Eric Dumazet <edumazet@google.com>,
        Stephen Hemminger <stephen@networkplumber.org>,
        netdev <netdev@vger.kernel.org>,
        Paul Gortmaker <paul.gortmaker@windriver.com>,
        Jiri Benc <jbenc@redhat.com>,
        Or Gerlitz <ogerlitz@mellanox.com>,
        Cong Wang <xiyou.wangcong@gmail.com>,
        Jamal Hadi Salim <jhs@mojatatu.com>,
        Andrew Lunn <andrew@lunn.ch>,
        Florian Fainelli <f.fainelli@gmail.com>
Subject: Re: Correct usage of dev_base_lock in 2020
Message-ID: <20201130184828.x56bwxxiwydsxt3k@skbuf>
References: <20201129182435.jgqfjbekqmmtaief@skbuf>
 <20201129205817.hti2l4hm2fbp2iwy@skbuf>
 <20201129211230.4d704931@hermes.local>
 <CANn89iKyyCwiKHFvQMqmeAbaR9SzwsCsko49FP+4NBW6+ZXN4w@mail.gmail.com>
 <20201130101405.73901b17@kicinski-fedora-pc1c0hjn.DHCP.thefacebook.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20201130101405.73901b17@kicinski-fedora-pc1c0hjn.DHCP.thefacebook.com>
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

On Mon, Nov 30, 2020 at 10:14:05AM -0800, Jakub Kicinski wrote:
> On Mon, 30 Nov 2020 11:41:10 +0100 Eric Dumazet wrote:
> > > So dev_base_lock dates back to the Big Kernel Lock breakup back in Linux 2.4
> > > (ie before my time). The time has come to get rid of it.
> > >
> > > The use is sysfs is because could be changed to RCU. There have been issues
> > > in the past with sysfs causing lock inversions with the rtnl mutex, that
> > > is why you will see some trylock code there.
> > >
> > > My guess is that dev_base_lock readers exist only because no one bothered to do
> > > the RCU conversion.
> >
> > I think we did, a long time ago.
> >
> > We took care of all ' fast paths' already.
> >
> > Not sure what is needed, current situation does not bother me at all ;)
>
> Perhaps Vladimir has a plan to post separately about it (in that case
> sorry for jumping ahead) but the initial problem was procfs which is
> (hopefully mostly irrelevant by now, and) taking the RCU lock only
> therefore forcing drivers to have re-entrant, non-sleeping
> .ndo_get_stats64 implementations.

Right, the end reason why I'm even looking at this is because I want to
convert all callers of dev_get_stats to use sleepable context and not
atomic. This makes it easier to gather statistics from devices that have
a firmware, or off-chip devices behind a slow bus like SPI.

Like Jakub pointed out, some places call dev_get_stats while iterating
through the list of network interfaces - one would be procfs, but not
only. These callers are pure readers, so they use RCU protection. But
that gives us atomic context when calling dev_get_stats. The naive
solution is to convert all those callers to hold the RTNL mutex, which
is the writer-side protection for the network interface lists, and which
is sleepable. In fact I do have a series of 8 patches where I get that
done. But there are some weirder cases, such as the bonding driver,
where I need to do this:

-----------------------------[cut here]-----------------------------
>From 369a0e18a2446cda8ff52d72c02aa144ae6687ec Mon Sep 17 00:00:00 2001
From: Vladimir Oltean <vladimir.oltean@nxp.com>
Date: Mon, 30 Nov 2020 02:39:46 +0200
Subject: [PATCH] net: bonding: retrieve device statistics under RTNL, not RCU

In the effort of making .ndo_get_stats64 be able to sleep, we need to
ensure the callers of dev_get_stats do not use atomic context.

The bonding driver uses an RCU read-side critical section to ensure the
integrity of the list of network interfaces, because the driver iterates
through all net devices in the netns to find the ones which are its
configured slaves. We still need some protection against an interface
registering or deregistering, and the writer-side lock, the RTNL mutex,
is fine for that, because it offers sleepable context.

We are taking the RTNL this way (checking for rtnl_is_locked first)
because the RTNL is not guaranteed to be held by all callers of
ndo_get_stats64, in fact there will be work in the future that will
avoid as much RTNL-holding as possible.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/net/bonding/bond_main.c | 18 +++++++-----------
 include/net/bonding.h           |  1 -
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index e0880a3840d7..1d44534e95d2 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3738,21 +3738,17 @@ static void bond_get_stats(struct net_device *bond_dev,
 			   struct rtnl_link_stats64 *stats)
 {
 	struct bonding *bond = netdev_priv(bond_dev);
+	bool rtnl_locked = rtnl_is_locked();
 	struct rtnl_link_stats64 temp;
 	struct list_head *iter;
 	struct slave *slave;
-	int nest_level = 0;
 
+	if (!rtnl_locked)
+		rtnl_lock();
 
-	rcu_read_lock();
-#ifdef CONFIG_LOCKDEP
-	nest_level = bond_get_lowest_level_rcu(bond_dev);
-#endif
-
-	spin_lock_nested(&bond->stats_lock, nest_level);
 	memcpy(stats, &bond->bond_stats, sizeof(*stats));
 
-	bond_for_each_slave_rcu(bond, slave, iter) {
+	bond_for_each_slave(bond, slave, iter) {
 		const struct rtnl_link_stats64 *new =
 			dev_get_stats(slave->dev, &temp);
 
@@ -3763,8 +3759,9 @@ static void bond_get_stats(struct net_device *bond_dev,
 	}
 
 	memcpy(&bond->bond_stats, stats, sizeof(*stats));
-	spin_unlock(&bond->stats_lock);
-	rcu_read_unlock();
+
+	if (!rtnl_locked)
+		rtnl_unlock();
 }
 
 static int bond_do_ioctl(struct net_device *bond_dev, struct ifreq *ifr, int cmd)
@@ -5192,7 +5189,6 @@ static int bond_init(struct net_device *bond_dev)
 	if (!bond->wq)
 		return -ENOMEM;
 
-	spin_lock_init(&bond->stats_lock);
 	netdev_lockdep_set_classes(bond_dev);
 
 	list_add_tail(&bond->bond_list, &bn->dev_list);
diff --git a/include/net/bonding.h b/include/net/bonding.h
index d9d0ff3b0ad3..6fbde9713424 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -224,7 +224,6 @@ struct bonding {
 	 * ALB mode (6) - to sync the use and modifications of its hash table
 	 */
 	spinlock_t mode_lock;
-	spinlock_t stats_lock;
 	u8	 send_peer_notif;
 	u8       igmp_retrans;
 #ifdef CONFIG_PROC_FS
-----------------------------[cut here]-----------------------------

Only that there's a problem with this. It's the lock ordering that
Stephen talked about.

cat /sys/class/net/bond0/statistics/*
[   38.715829]
[   38.717329] ======================================================
[   38.723528] WARNING: possible circular locking dependency detected
[   38.729730] 5.10.0-rc5-next-20201127-00016-gde02bf51f2fa #1481 Not tainted
[   38.736628] ------------------------------------------------------
[   38.742828] cat/602 is trying to acquire lock:
[   38.747285] ffffcf4908119080 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x20/0x28
[   38.754555]
[   38.754555] but task is already holding lock:
[   38.760406] ffff53364222c168 (kn->active#123){++++}-{0:0}, at: kernfs_seq_start+0x38/0xb0
[   38.768631]
[   38.768631] which lock already depends on the new lock.
[   38.768631]
[   38.776835]
[   38.776835] the existing dependency chain (in reverse order) is:
[   38.784341]
[   38.784341] -> #1 (kn->active#123){++++}-{0:0}:
[   38.790387]        lock_acquire+0x238/0x468
[   38.794583]        __kernfs_remove+0x26c/0x3e0
[   38.799040]        kernfs_remove_by_name_ns+0x5c/0xb8
[   38.804107]        remove_files.isra.1+0x40/0x80
[   38.808739]        sysfs_remove_group+0x50/0xb0
[   38.813282]        sysfs_remove_groups+0x3c/0x58
[   38.817913]        device_remove_attrs+0x64/0x98
[   38.822544]        device_del+0x16c/0x3f8
[   38.826566]        netdev_unregister_kobject+0x80/0x90
[   38.831722]        rollback_registered_many+0x444/0x688
[   38.836964]        unregister_netdevice_many.part.163+0x20/0xa8
[   38.842904]        unregister_netdevice_many+0x2c/0x38
[   38.848059]        rtnl_delete_link+0x5c/0x90
[   38.852428]        rtnl_dellink+0x158/0x310
[   38.856622]        rtnetlink_rcv_msg+0x298/0x4d8
[   38.861254]        netlink_rcv_skb+0x64/0x130
[   38.865625]        rtnetlink_rcv+0x28/0x38
[   38.869733]        netlink_unicast+0x1dc/0x288
[   38.874190]        netlink_sendmsg+0x2b0/0x3e8
[   38.878647]        ____sys_sendmsg+0x27c/0x2c0
[   38.883105]        ___sys_sendmsg+0x90/0xd0
[   38.887299]        __sys_sendmsg+0x7c/0xd0
[   38.891407]        __arm64_sys_sendmsg+0x2c/0x38
[   38.896040]        el0_svc_common.constprop.3+0x80/0x1b0
[   38.901369]        do_el0_svc+0x34/0xa0
[   38.905216]        el0_sync_handler+0x138/0x198
[   38.909760]        el0_sync+0x140/0x180
[   38.913604]
[   38.913604] -> #0 (rtnl_mutex){+.+.}-{4:4}:
[   38.919295]        check_noncircular+0x154/0x168
[   38.923926]        __lock_acquire+0x1118/0x15e0
[   38.928470]        lock_acquire+0x238/0x468
[   38.932667]        __mutex_lock+0xa4/0x970
[   38.936776]        mutex_lock_nested+0x54/0x70
[   38.941233]        rtnl_lock+0x20/0x28
[   38.944993]        bond_get_stats+0x140/0x1a8
[   38.949363]        dev_get_stats+0xdc/0xf0
[   38.953472]        netstat_show.isra.25+0x50/0xa0
[   38.958190]        collisions_show+0x2c/0x38
[   38.962472]        dev_attr_show+0x3c/0x78
[   38.966580]        sysfs_kf_seq_show+0xbc/0x138
[   38.971124]        kernfs_seq_show+0x44/0x50
[   38.975407]        seq_read_iter+0xe4/0x450
[   38.979601]        seq_read+0xf8/0x148
[   38.983359]        kernfs_fop_read+0x1e0/0x280
[   38.987817]        vfs_read+0xac/0x1c8
[   38.991576]        ksys_read+0x74/0xf8
[   38.995335]        __arm64_sys_read+0x24/0x30
[   38.999706]        el0_svc_common.constprop.3+0x80/0x1b0
[   39.005035]        do_el0_svc+0x34/0xa0
[   39.008882]        el0_sync_handler+0x138/0x198
[   39.013426]        el0_sync+0x140/0x180
[   39.017270]
[   39.017270] other info that might help us debug this:
[   39.017270]
[   39.025300]  Possible unsafe locking scenario:
[   39.025300]
[   39.031236]        CPU0                    CPU1
[   39.035779]        ----                    ----
[   39.040321]   lock(kn->active#123);
[   39.043826]                                lock(rtnl_mutex);
[   39.049507]                                lock(kn->active#123);
[   39.055540]   lock(rtnl_mutex);
[   39.058691]
[   39.058691]  *** DEADLOCK ***
[   39.058691]
[   39.064630] 3 locks held by cat/602:
[   39.068212]  #0: ffff533645926f70 (&p->lock){+.+.}-{4:4}, at: seq_read_iter+0x64/0x450
[   39.076173]  #1: ffff533643bfa090 (&of->mutex){+.+.}-{4:4}, at: kernfs_seq_start+0x30/0xb0
[   39.084482]  #2: ffff53364222c168 (kn->active#123){++++}-{0:0}, at: kernfs_seq_start+0x38/0xb0

Ok, I admit I don't really understand the message, because struct
kernfs_node::active is an atomic_t and not really a lock, but
intuitively I think lockdep is unhappy that the RTNL mutex is not being
used here as a top-level mutex.
If somebody were to cat /sys/class/net/bond0/statistics/*, the kernfs
node locking happens first, then the RTNL mutex is acquired by the
bonding implementation of ndo_get_stats64.
If somebody were to delete the bonding device via an "ip link del"
rtnetlink message, the RTNL mutex would be held first, then the kernfs
node lock corresponding to the interface's sysfs attributes second.
I don't think there's really any potential deadlock, only an ordering
issue between lockdep classes.

How I think this could be solved is if we made the network interface
lists use some other writer-side protection than the big RTNL mutex.
So I went on exactly to see how knee-deep I would need to get into that.

There are 2 separate classes of problems:
- We already have two ways of protecting pure readers: via RCU and via
  the rwlock. It doesn't help if we also add a second way of locking out
  pure writers. We need to first clean up what we have. That's the
  reason why I started the discussion about the rwlock.
- Some callers appear to not use any sort of protection at all. Does
  this code path look ok to you?

nfnetlink_rcv_batch
-> NFT_MSG_NEWCHAIN
   -> nf_tables_addchain
      -> nft_chain_parse_hook
         -> nft_chain_parse_netdev
            -> nf_tables_parse_netdev_hooks
               -> nft_netdev_hook_alloc
                  -> __dev_get_by_name
                     -> netdev_name_node_lookup: must be under RTNL mutex or dev_base_lock protection

Unless I'm missing something, there are just tons, tons of callers of
__dev_get_by_name which hold neither the RTNL mutex, nor the
dev_base_lock rwlock, nor RCU read-side critical section.

What I was thinking was that if we could add a separate mutex for the
network interface lists (and this time make it per netns, and not global
to the kernel), then we could take that lock a lot more locally. I.e. we
could have __dev_get_by_name take that lock, and alleviate the need for
callers to hold the RTNL mutex. That would solve some of these bugs too,
and would also be more usable for the bonding case.