netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Correct usage of dev_base_lock in 2020
       [not found] <20201129182435.jgqfjbekqmmtaief@skbuf>
@ 2020-11-29 20:58 ` Vladimir Oltean
  2020-11-30  5:12   ` Stephen Hemminger
  0 siblings, 1 reply; 28+ messages in thread
From: Vladimir Oltean @ 2020-11-29 20:58 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, Paul Gortmaker, Stephen Hemminger,
	Eric Dumazet, Jiri Benc, Or Gerlitz, Cong Wang, Jamal Hadi Salim
  Cc: Andrew Lunn, Florian Fainelli

[ resent, had forgot to copy the list ]

Hi,

net/core/dev.c has this to say about the locking rules around the network
interface lists (dev_base_head, and I can only assume that it also applies to
the per-ifindex hash table dev_index_head and the per-name hash table
dev_name_head):

/*
 * The @dev_base_head list is protected by @dev_base_lock and the rtnl
 * semaphore.
 *
 * Pure readers hold dev_base_lock for reading, or rcu_read_lock()
 *
 * Writers must hold the rtnl semaphore while they loop through the
 * dev_base_head list, and hold dev_base_lock for writing when they do the
 * actual updates.  This allows pure readers to access the list even
 * while a writer is preparing to update it.
 *
 * To put it another way, dev_base_lock is held for writing only to
 * protect against pure readers; the rtnl semaphore provides the
 * protection against other writers.
 *
 * See, for example usages, register_netdevice() and
 * unregister_netdevice(), which must be called with the rtnl
 * semaphore held.
 */

However, as of today, most if not all the read-side accessors of the network
interface lists have been converted to run under rcu_read_lock. As Eric explains,

commit fb699dfd426a189fe33b91586c15176a75c8aed0
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Mon Oct 19 19:18:49 2009 +0000

    net: Introduce dev_get_by_index_rcu()

    Some workloads hit dev_base_lock rwlock pretty hard.
    We can use RCU lookups to avoid touching this rwlock.

    netdevices are already freed after a RCU grace period, so this patch
    adds no penalty at device dismantle time.

    dev_ifname() converted to dev_get_by_index_rcu()

    Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

A lot of work has been put into eliminating the dev_base_lock rwlock
completely, as Stephen explained here:

[PATCH 00/10] netdev: get rid of read_lock(&dev_base_lock) usages
https://www.spinics.net/lists/netdev/msg112264.html

However, its use has not been completely eliminated. It is still there, and
even more confusingly, that comment in net/core/dev.c is still there. What I
see the dev_base_lock being used for now are complete oddballs.

- The debugfs for mac80211, in net/mac80211/debugfs_netdev.c, holds the read
  side when printing some interface properties (good luck disentangling the
  code and figuring out which ones, though). What is that read-side actually
  protecting against?

- HSR, in net/hsr/hsr_device.c (called from hsr_netdev_notify on NETDEV_UP
  NETDEV_DOWN and NETDEV_CHANGE), takes the write-side of the lock when
  modifying the RFC 2863 operstate of the interface. Why?
  Actually the use of dev_base_lock is the most widespread in the kernel today
  when accessing the RFC 2863 operstate. I could only find this truncated
  discussion in the archives:
    Re: Issue 0 WAS (Re: Oustanding issues WAS(IRe: Consensus? WAS(RFC 2863)
    https://www.mail-archive.com/netdev@vger.kernel.org/msg03632.html
  and it said:

    > be transitioned to up/dormant etc. So an ethernet driver doesnt know it
    > needs to go from detecting peer link is up to next being authenticated
    > in the case of 802.1x. It just calls netif_carrier_on which checks
    > link_mode to decide on transition.

    we could protect operstate with a spinlock_irqsave() and then change it either
    from netif_[carrier|dormant]_on/off() or userspace-supplicant. However, I'm
    not feeling good about it. Look at rtnetlink_fill_ifinfo(), it is able to
    query a consistent snapshot of all interface settings as long as locking with
    dev_base_lock and rtnl is obeyed. __LINK_STATE flags are already an
    exemption, and I don't want operstate to be another. That's why I chose
    setting it from linkwatch in process context, and I really think this is the
    correct approach.

- rfc2863_policy() in net/core/link_watch.c seems to be the major writer that
  holds this lock in 2020, together with do_setlink() and set_operstate() from
  net/core/rtnetlink.c. Has the lock been repurposed over the years and we
  should update its name appropriately?

- This usage from netdev_show() in net/core/net-sysfs.c just looks random to
  me, maybe somebody can explain:

	read_lock(&dev_base_lock);
	if (dev_isalive(ndev))
		ret = (*format)(ndev, buf);
	read_unlock(&dev_base_lock);

- This also looks like nonsense to me, maybe somebody can explain.
  drivers/infiniband/hw/mlx4/main.c, function mlx4_ib_update_qps():

	read_lock(&dev_base_lock);
	new_smac = mlx4_mac_to_u64(dev->dev_addr);
	read_unlock(&dev_base_lock);

  where mlx4_mac_to_u64 does:

static inline u64 mlx4_mac_to_u64(u8 *addr)
{
	u64 mac = 0;
	int i;

	for (i = 0; i < ETH_ALEN; i++) {
		mac <<= 8;
		mac |= addr[i];
	}
	return mac;
}

  basically a duplicate of ether_addr_to_u64. So I can only assume that the
  dev_base_lock was taken to protect against what, against changes to
  dev->dev_addr? :)

So it's clear that the dev_base_lock needs to be at least renamed, if not
removed (and at least some instances of it removed). But it's not clear what to
rename it to.

Thanks,
-Vladimir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-29 20:58 ` Correct usage of dev_base_lock in 2020 Vladimir Oltean
@ 2020-11-30  5:12   ` Stephen Hemminger
  2020-11-30 10:41     ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2020-11-30  5:12 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, Jakub Kicinski, Paul Gortmaker, Eric Dumazet, Jiri Benc,
	Or Gerlitz, Cong Wang, Jamal Hadi Salim, Andrew Lunn,
	Florian Fainelli

On Sun, 29 Nov 2020 22:58:17 +0200
Vladimir Oltean <olteanv@gmail.com> wrote:

> [ resent, had forgot to copy the list ]
> 
> Hi,
> 
> net/core/dev.c has this to say about the locking rules around the network
> interface lists (dev_base_head, and I can only assume that it also applies to
> the per-ifindex hash table dev_index_head and the per-name hash table
> dev_name_head):
> 
> /*
>  * The @dev_base_head list is protected by @dev_base_lock and the rtnl
>  * semaphore.
>  *
>  * Pure readers hold dev_base_lock for reading, or rcu_read_lock()
>  *
>  * Writers must hold the rtnl semaphore while they loop through the
>  * dev_base_head list, and hold dev_base_lock for writing when they do the
>  * actual updates.  This allows pure readers to access the list even
>  * while a writer is preparing to update it.
>  *
>  * To put it another way, dev_base_lock is held for writing only to
>  * protect against pure readers; the rtnl semaphore provides the
>  * protection against other writers.
>  *
>  * See, for example usages, register_netdevice() and
>  * unregister_netdevice(), which must be called with the rtnl
>  * semaphore held.
>  */
> 
> However, as of today, most if not all the read-side accessors of the network
> interface lists have been converted to run under rcu_read_lock. As Eric explains,
> 
> commit fb699dfd426a189fe33b91586c15176a75c8aed0
> Author: Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Mon Oct 19 19:18:49 2009 +0000
> 
>     net: Introduce dev_get_by_index_rcu()
> 
>     Some workloads hit dev_base_lock rwlock pretty hard.
>     We can use RCU lookups to avoid touching this rwlock.
> 
>     netdevices are already freed after a RCU grace period, so this patch
>     adds no penalty at device dismantle time.
> 
>     dev_ifname() converted to dev_get_by_index_rcu()
> 
>     Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> A lot of work has been put into eliminating the dev_base_lock rwlock
> completely, as Stephen explained here:
> 
> [PATCH 00/10] netdev: get rid of read_lock(&dev_base_lock) usages
> https://www.spinics.net/lists/netdev/msg112264.html
> 
> However, its use has not been completely eliminated. It is still there, and
> even more confusingly, that comment in net/core/dev.c is still there. What I
> see the dev_base_lock being used for now are complete oddballs.
> 
> - The debugfs for mac80211, in net/mac80211/debugfs_netdev.c, holds the read
>   side when printing some interface properties (good luck disentangling the
>   code and figuring out which ones, though). What is that read-side actually
>   protecting against?
> 
> - HSR, in net/hsr/hsr_device.c (called from hsr_netdev_notify on NETDEV_UP
>   NETDEV_DOWN and NETDEV_CHANGE), takes the write-side of the lock when
>   modifying the RFC 2863 operstate of the interface. Why?
>   Actually the use of dev_base_lock is the most widespread in the kernel today
>   when accessing the RFC 2863 operstate. I could only find this truncated
>   discussion in the archives:
>     Re: Issue 0 WAS (Re: Oustanding issues WAS(IRe: Consensus? WAS(RFC 2863)
>     https://www.mail-archive.com/netdev@vger.kernel.org/msg03632.html
>   and it said:
> 
>     > be transitioned to up/dormant etc. So an ethernet driver doesnt know it
>     > needs to go from detecting peer link is up to next being authenticated
>     > in the case of 802.1x. It just calls netif_carrier_on which checks
>     > link_mode to decide on transition.  
> 
>     we could protect operstate with a spinlock_irqsave() and then change it either
>     from netif_[carrier|dormant]_on/off() or userspace-supplicant. However, I'm
>     not feeling good about it. Look at rtnetlink_fill_ifinfo(), it is able to
>     query a consistent snapshot of all interface settings as long as locking with
>     dev_base_lock and rtnl is obeyed. __LINK_STATE flags are already an
>     exemption, and I don't want operstate to be another. That's why I chose
>     setting it from linkwatch in process context, and I really think this is the
>     correct approach.
> 
> - rfc2863_policy() in net/core/link_watch.c seems to be the major writer that
>   holds this lock in 2020, together with do_setlink() and set_operstate() from
>   net/core/rtnetlink.c. Has the lock been repurposed over the years and we
>   should update its name appropriately?
> 
> - This usage from netdev_show() in net/core/net-sysfs.c just looks random to
>   me, maybe somebody can explain:
> 
> 	read_lock(&dev_base_lock);
> 	if (dev_isalive(ndev))
> 		ret = (*format)(ndev, buf);
> 	read_unlock(&dev_base_lock);


So dev_base_lock dates back to the Big Kernel Lock breakup back in Linux 2.4
(ie before my time). The time has come to get rid of it.

The use is sysfs is because could be changed to RCU. There have been issues
in the past with sysfs causing lock inversions with the rtnl mutex, that
is why you will see some trylock code there.

My guess is that dev_base_lock readers exist only because no one bothered to do
the RCU conversion.

Complex locking rules lead to mistakes and often don't get much performance
gain.  There are really two different domains being covered by locks here.

The first area is change of state of network devices. This has traditionally
been covered by RTNL because there are places that depend on coordinating
state between multiple devices. RTNL is too big and held too long but getting
rid of it is hard because there are corner cases (like state changes from userspace
for VPN devices).

The other area is code that wants to do read access to look at list of devices.
These pure readers can/should be converted to RCU by now. Writers should hold RTNL.

You could change the readers of operstate to use some form of RCU and atomic
operation (seqlock?). The state of the device has several components flags, operstate
etc, and there is no well defined way to read a consistent set of them.

Good Luck on your quest.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30  5:12   ` Stephen Hemminger
@ 2020-11-30 10:41     ` Eric Dumazet
  2020-11-30 18:14       ` Jakub Kicinski
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2020-11-30 10:41 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Vladimir Oltean, netdev, Jakub Kicinski, Paul Gortmaker,
	Jiri Benc, Or Gerlitz, Cong Wang, Jamal Hadi Salim, Andrew Lunn,
	Florian Fainelli

On Mon, Nov 30, 2020 at 6:12 AM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Sun, 29 Nov 2020 22:58:17 +0200
> Vladimir Oltean <olteanv@gmail.com> wrote:
>
> > [ resent, had forgot to copy the list ]
> >
> > Hi,
> >
> > net/core/dev.c has this to say about the locking rules around the network
> > interface lists (dev_base_head, and I can only assume that it also applies to
> > the per-ifindex hash table dev_index_head and the per-name hash table
> > dev_name_head):
> >
> > /*
> >  * The @dev_base_head list is protected by @dev_base_lock and the rtnl
> >  * semaphore.
> >  *
> >  * Pure readers hold dev_base_lock for reading, or rcu_read_lock()
> >  *
> >  * Writers must hold the rtnl semaphore while they loop through the
> >  * dev_base_head list, and hold dev_base_lock for writing when they do the
> >  * actual updates.  This allows pure readers to access the list even
> >  * while a writer is preparing to update it.
> >  *
> >  * To put it another way, dev_base_lock is held for writing only to
> >  * protect against pure readers; the rtnl semaphore provides the
> >  * protection against other writers.
> >  *
> >  * See, for example usages, register_netdevice() and
> >  * unregister_netdevice(), which must be called with the rtnl
> >  * semaphore held.
> >  */
> >
> > However, as of today, most if not all the read-side accessors of the network
> > interface lists have been converted to run under rcu_read_lock. As Eric explains,
> >
> > commit fb699dfd426a189fe33b91586c15176a75c8aed0
> > Author: Eric Dumazet <eric.dumazet@gmail.com>
> > Date:   Mon Oct 19 19:18:49 2009 +0000
> >
> >     net: Introduce dev_get_by_index_rcu()
> >
> >     Some workloads hit dev_base_lock rwlock pretty hard.
> >     We can use RCU lookups to avoid touching this rwlock.
> >
> >     netdevices are already freed after a RCU grace period, so this patch
> >     adds no penalty at device dismantle time.
> >
> >     dev_ifname() converted to dev_get_by_index_rcu()
> >
> >     Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> >     Signed-off-by: David S. Miller <davem@davemloft.net>
> >
> > A lot of work has been put into eliminating the dev_base_lock rwlock
> > completely, as Stephen explained here:
> >
> > [PATCH 00/10] netdev: get rid of read_lock(&dev_base_lock) usages
> > https://www.spinics.net/lists/netdev/msg112264.html
> >
> > However, its use has not been completely eliminated. It is still there, and
> > even more confusingly, that comment in net/core/dev.c is still there. What I
> > see the dev_base_lock being used for now are complete oddballs.
> >
> > - The debugfs for mac80211, in net/mac80211/debugfs_netdev.c, holds the read
> >   side when printing some interface properties (good luck disentangling the
> >   code and figuring out which ones, though). What is that read-side actually
> >   protecting against?
> >
> > - HSR, in net/hsr/hsr_device.c (called from hsr_netdev_notify on NETDEV_UP
> >   NETDEV_DOWN and NETDEV_CHANGE), takes the write-side of the lock when
> >   modifying the RFC 2863 operstate of the interface. Why?
> >   Actually the use of dev_base_lock is the most widespread in the kernel today
> >   when accessing the RFC 2863 operstate. I could only find this truncated
> >   discussion in the archives:
> >     Re: Issue 0 WAS (Re: Oustanding issues WAS(IRe: Consensus? WAS(RFC 2863)
> >     https://www.mail-archive.com/netdev@vger.kernel.org/msg03632.html
> >   and it said:
> >
> >     > be transitioned to up/dormant etc. So an ethernet driver doesnt know it
> >     > needs to go from detecting peer link is up to next being authenticated
> >     > in the case of 802.1x. It just calls netif_carrier_on which checks
> >     > link_mode to decide on transition.
> >
> >     we could protect operstate with a spinlock_irqsave() and then change it either
> >     from netif_[carrier|dormant]_on/off() or userspace-supplicant. However, I'm
> >     not feeling good about it. Look at rtnetlink_fill_ifinfo(), it is able to
> >     query a consistent snapshot of all interface settings as long as locking with
> >     dev_base_lock and rtnl is obeyed. __LINK_STATE flags are already an
> >     exemption, and I don't want operstate to be another. That's why I chose
> >     setting it from linkwatch in process context, and I really think this is the
> >     correct approach.
> >
> > - rfc2863_policy() in net/core/link_watch.c seems to be the major writer that
> >   holds this lock in 2020, together with do_setlink() and set_operstate() from
> >   net/core/rtnetlink.c. Has the lock been repurposed over the years and we
> >   should update its name appropriately?
> >
> > - This usage from netdev_show() in net/core/net-sysfs.c just looks random to
> >   me, maybe somebody can explain:
> >
> >       read_lock(&dev_base_lock);
> >       if (dev_isalive(ndev))
> >               ret = (*format)(ndev, buf);
> >       read_unlock(&dev_base_lock);
>
>
> So dev_base_lock dates back to the Big Kernel Lock breakup back in Linux 2.4
> (ie before my time). The time has come to get rid of it.
>
> The use is sysfs is because could be changed to RCU. There have been issues
> in the past with sysfs causing lock inversions with the rtnl mutex, that
> is why you will see some trylock code there.
>
> My guess is that dev_base_lock readers exist only because no one bothered to do
> the RCU conversion.

I think we did, a long time ago.

We took care of all ' fast paths' already.

Not sure what is needed, current situation does not bother me at all ;)

>
> Complex locking rules lead to mistakes and often don't get much performance
> gain.  There are really two different domains being covered by locks here.
>
> The first area is change of state of network devices. This has traditionally
> been covered by RTNL because there are places that depend on coordinating
> state between multiple devices. RTNL is too big and held too long but getting
> rid of it is hard because there are corner cases (like state changes from userspace
> for VPN devices).
>
> The other area is code that wants to do read access to look at list of devices.
> These pure readers can/should be converted to RCU by now. Writers should hold RTNL.

Yes, and sometimes this is unfortunate.

dev_change_name() for example is an issue, because of the
synchronize_rcu() it contains.

>
> You could change the readers of operstate to use some form of RCU and atomic
> operation (seqlock?). The state of the device has several components flags, operstate
> etc, and there is no well defined way to read a consistent set of them.
>
> Good Luck on your quest.
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 10:41     ` Eric Dumazet
@ 2020-11-30 18:14       ` Jakub Kicinski
  2020-11-30 18:30         ` Eric Dumazet
  2020-11-30 18:48         ` Vladimir Oltean
  0 siblings, 2 replies; 28+ messages in thread
From: Jakub Kicinski @ 2020-11-30 18:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Vladimir Oltean, netdev, Paul Gortmaker,
	Jiri Benc, Or Gerlitz, Cong Wang, Jamal Hadi Salim, Andrew Lunn,
	Florian Fainelli

On Mon, 30 Nov 2020 11:41:10 +0100 Eric Dumazet wrote:
> > So dev_base_lock dates back to the Big Kernel Lock breakup back in Linux 2.4
> > (ie before my time). The time has come to get rid of it.
> >
> > The use is sysfs is because could be changed to RCU. There have been issues
> > in the past with sysfs causing lock inversions with the rtnl mutex, that
> > is why you will see some trylock code there.
> >
> > My guess is that dev_base_lock readers exist only because no one bothered to do
> > the RCU conversion.  
> 
> I think we did, a long time ago.
> 
> We took care of all ' fast paths' already.
> 
> Not sure what is needed, current situation does not bother me at all ;)

Perhaps Vladimir has a plan to post separately about it (in that case
sorry for jumping ahead) but the initial problem was procfs which is
(hopefully mostly irrelevant by now, and) taking the RCU lock only
therefore forcing drivers to have re-entrant, non-sleeping
.ndo_get_stats64 implementations.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 18:14       ` Jakub Kicinski
@ 2020-11-30 18:30         ` Eric Dumazet
  2020-11-30 18:48         ` Vladimir Oltean
  1 sibling, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2020-11-30 18:30 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stephen Hemminger, Vladimir Oltean, netdev, Paul Gortmaker,
	Jiri Benc, Or Gerlitz, Cong Wang, Jamal Hadi Salim, Andrew Lunn,
	Florian Fainelli

On Mon, Nov 30, 2020 at 7:14 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 30 Nov 2020 11:41:10 +0100 Eric Dumazet wrote:
> > > So dev_base_lock dates back to the Big Kernel Lock breakup back in Linux 2.4
> > > (ie before my time). The time has come to get rid of it.
> > >
> > > The use is sysfs is because could be changed to RCU. There have been issues
> > > in the past with sysfs causing lock inversions with the rtnl mutex, that
> > > is why you will see some trylock code there.
> > >
> > > My guess is that dev_base_lock readers exist only because no one bothered to do
> > > the RCU conversion.
> >
> > I think we did, a long time ago.
> >
> > We took care of all ' fast paths' already.
> >
> > Not sure what is needed, current situation does not bother me at all ;)
>
> Perhaps Vladimir has a plan to post separately about it (in that case
> sorry for jumping ahead) but the initial problem was procfs which is
> (hopefully mostly irrelevant by now, and) taking the RCU lock only
> therefore forcing drivers to have re-entrant, non-sleeping
> .ndo_get_stats64 implementations.

I think bonding also calls  ndo_get_stats64() while in non-sleeping context.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 18:14       ` Jakub Kicinski
  2020-11-30 18:30         ` Eric Dumazet
@ 2020-11-30 18:48         ` Vladimir Oltean
  2020-11-30 19:00           ` Eric Dumazet
                             ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Vladimir Oltean @ 2020-11-30 18:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Eric Dumazet, Stephen Hemminger, netdev, Paul Gortmaker,
	Jiri Benc, Or Gerlitz, Cong Wang, Jamal Hadi Salim, Andrew Lunn,
	Florian Fainelli

On Mon, Nov 30, 2020 at 10:14:05AM -0800, Jakub Kicinski wrote:
> On Mon, 30 Nov 2020 11:41:10 +0100 Eric Dumazet wrote:
> > > So dev_base_lock dates back to the Big Kernel Lock breakup back in Linux 2.4
> > > (ie before my time). The time has come to get rid of it.
> > >
> > > The use is sysfs is because could be changed to RCU. There have been issues
> > > in the past with sysfs causing lock inversions with the rtnl mutex, that
> > > is why you will see some trylock code there.
> > >
> > > My guess is that dev_base_lock readers exist only because no one bothered to do
> > > the RCU conversion.
> >
> > I think we did, a long time ago.
> >
> > We took care of all ' fast paths' already.
> >
> > Not sure what is needed, current situation does not bother me at all ;)
>
> Perhaps Vladimir has a plan to post separately about it (in that case
> sorry for jumping ahead) but the initial problem was procfs which is
> (hopefully mostly irrelevant by now, and) taking the RCU lock only
> therefore forcing drivers to have re-entrant, non-sleeping
> .ndo_get_stats64 implementations.

Right, the end reason why I'm even looking at this is because I want to
convert all callers of dev_get_stats to use sleepable context and not
atomic. This makes it easier to gather statistics from devices that have
a firmware, or off-chip devices behind a slow bus like SPI.

Like Jakub pointed out, some places call dev_get_stats while iterating
through the list of network interfaces - one would be procfs, but not
only. These callers are pure readers, so they use RCU protection. But
that gives us atomic context when calling dev_get_stats. The naive
solution is to convert all those callers to hold the RTNL mutex, which
is the writer-side protection for the network interface lists, and which
is sleepable. In fact I do have a series of 8 patches where I get that
done. But there are some weirder cases, such as the bonding driver,
where I need to do this:

-----------------------------[cut here]-----------------------------
From 369a0e18a2446cda8ff52d72c02aa144ae6687ec Mon Sep 17 00:00:00 2001
From: Vladimir Oltean <vladimir.oltean@nxp.com>
Date: Mon, 30 Nov 2020 02:39:46 +0200
Subject: [PATCH] net: bonding: retrieve device statistics under RTNL, not RCU

In the effort of making .ndo_get_stats64 be able to sleep, we need to
ensure the callers of dev_get_stats do not use atomic context.

The bonding driver uses an RCU read-side critical section to ensure the
integrity of the list of network interfaces, because the driver iterates
through all net devices in the netns to find the ones which are its
configured slaves. We still need some protection against an interface
registering or deregistering, and the writer-side lock, the RTNL mutex,
is fine for that, because it offers sleepable context.

We are taking the RTNL this way (checking for rtnl_is_locked first)
because the RTNL is not guaranteed to be held by all callers of
ndo_get_stats64, in fact there will be work in the future that will
avoid as much RTNL-holding as possible.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/net/bonding/bond_main.c | 18 +++++++-----------
 include/net/bonding.h           |  1 -
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index e0880a3840d7..1d44534e95d2 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3738,21 +3738,17 @@ static void bond_get_stats(struct net_device *bond_dev,
 			   struct rtnl_link_stats64 *stats)
 {
 	struct bonding *bond = netdev_priv(bond_dev);
+	bool rtnl_locked = rtnl_is_locked();
 	struct rtnl_link_stats64 temp;
 	struct list_head *iter;
 	struct slave *slave;
-	int nest_level = 0;
 
+	if (!rtnl_locked)
+		rtnl_lock();
 
-	rcu_read_lock();
-#ifdef CONFIG_LOCKDEP
-	nest_level = bond_get_lowest_level_rcu(bond_dev);
-#endif
-
-	spin_lock_nested(&bond->stats_lock, nest_level);
 	memcpy(stats, &bond->bond_stats, sizeof(*stats));
 
-	bond_for_each_slave_rcu(bond, slave, iter) {
+	bond_for_each_slave(bond, slave, iter) {
 		const struct rtnl_link_stats64 *new =
 			dev_get_stats(slave->dev, &temp);
 
@@ -3763,8 +3759,9 @@ static void bond_get_stats(struct net_device *bond_dev,
 	}
 
 	memcpy(&bond->bond_stats, stats, sizeof(*stats));
-	spin_unlock(&bond->stats_lock);
-	rcu_read_unlock();
+
+	if (!rtnl_locked)
+		rtnl_unlock();
 }
 
 static int bond_do_ioctl(struct net_device *bond_dev, struct ifreq *ifr, int cmd)
@@ -5192,7 +5189,6 @@ static int bond_init(struct net_device *bond_dev)
 	if (!bond->wq)
 		return -ENOMEM;
 
-	spin_lock_init(&bond->stats_lock);
 	netdev_lockdep_set_classes(bond_dev);
 
 	list_add_tail(&bond->bond_list, &bn->dev_list);
diff --git a/include/net/bonding.h b/include/net/bonding.h
index d9d0ff3b0ad3..6fbde9713424 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -224,7 +224,6 @@ struct bonding {
 	 * ALB mode (6) - to sync the use and modifications of its hash table
 	 */
 	spinlock_t mode_lock;
-	spinlock_t stats_lock;
 	u8	 send_peer_notif;
 	u8       igmp_retrans;
 #ifdef CONFIG_PROC_FS
-----------------------------[cut here]-----------------------------

Only that there's a problem with this. It's the lock ordering that
Stephen talked about.

cat /sys/class/net/bond0/statistics/*
[   38.715829]
[   38.717329] ======================================================
[   38.723528] WARNING: possible circular locking dependency detected
[   38.729730] 5.10.0-rc5-next-20201127-00016-gde02bf51f2fa #1481 Not tainted
[   38.736628] ------------------------------------------------------
[   38.742828] cat/602 is trying to acquire lock:
[   38.747285] ffffcf4908119080 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x20/0x28
[   38.754555]
[   38.754555] but task is already holding lock:
[   38.760406] ffff53364222c168 (kn->active#123){++++}-{0:0}, at: kernfs_seq_start+0x38/0xb0
[   38.768631]
[   38.768631] which lock already depends on the new lock.
[   38.768631]
[   38.776835]
[   38.776835] the existing dependency chain (in reverse order) is:
[   38.784341]
[   38.784341] -> #1 (kn->active#123){++++}-{0:0}:
[   38.790387]        lock_acquire+0x238/0x468
[   38.794583]        __kernfs_remove+0x26c/0x3e0
[   38.799040]        kernfs_remove_by_name_ns+0x5c/0xb8
[   38.804107]        remove_files.isra.1+0x40/0x80
[   38.808739]        sysfs_remove_group+0x50/0xb0
[   38.813282]        sysfs_remove_groups+0x3c/0x58
[   38.817913]        device_remove_attrs+0x64/0x98
[   38.822544]        device_del+0x16c/0x3f8
[   38.826566]        netdev_unregister_kobject+0x80/0x90
[   38.831722]        rollback_registered_many+0x444/0x688
[   38.836964]        unregister_netdevice_many.part.163+0x20/0xa8
[   38.842904]        unregister_netdevice_many+0x2c/0x38
[   38.848059]        rtnl_delete_link+0x5c/0x90
[   38.852428]        rtnl_dellink+0x158/0x310
[   38.856622]        rtnetlink_rcv_msg+0x298/0x4d8
[   38.861254]        netlink_rcv_skb+0x64/0x130
[   38.865625]        rtnetlink_rcv+0x28/0x38
[   38.869733]        netlink_unicast+0x1dc/0x288
[   38.874190]        netlink_sendmsg+0x2b0/0x3e8
[   38.878647]        ____sys_sendmsg+0x27c/0x2c0
[   38.883105]        ___sys_sendmsg+0x90/0xd0
[   38.887299]        __sys_sendmsg+0x7c/0xd0
[   38.891407]        __arm64_sys_sendmsg+0x2c/0x38
[   38.896040]        el0_svc_common.constprop.3+0x80/0x1b0
[   38.901369]        do_el0_svc+0x34/0xa0
[   38.905216]        el0_sync_handler+0x138/0x198
[   38.909760]        el0_sync+0x140/0x180
[   38.913604]
[   38.913604] -> #0 (rtnl_mutex){+.+.}-{4:4}:
[   38.919295]        check_noncircular+0x154/0x168
[   38.923926]        __lock_acquire+0x1118/0x15e0
[   38.928470]        lock_acquire+0x238/0x468
[   38.932667]        __mutex_lock+0xa4/0x970
[   38.936776]        mutex_lock_nested+0x54/0x70
[   38.941233]        rtnl_lock+0x20/0x28
[   38.944993]        bond_get_stats+0x140/0x1a8
[   38.949363]        dev_get_stats+0xdc/0xf0
[   38.953472]        netstat_show.isra.25+0x50/0xa0
[   38.958190]        collisions_show+0x2c/0x38
[   38.962472]        dev_attr_show+0x3c/0x78
[   38.966580]        sysfs_kf_seq_show+0xbc/0x138
[   38.971124]        kernfs_seq_show+0x44/0x50
[   38.975407]        seq_read_iter+0xe4/0x450
[   38.979601]        seq_read+0xf8/0x148
[   38.983359]        kernfs_fop_read+0x1e0/0x280
[   38.987817]        vfs_read+0xac/0x1c8
[   38.991576]        ksys_read+0x74/0xf8
[   38.995335]        __arm64_sys_read+0x24/0x30
[   38.999706]        el0_svc_common.constprop.3+0x80/0x1b0
[   39.005035]        do_el0_svc+0x34/0xa0
[   39.008882]        el0_sync_handler+0x138/0x198
[   39.013426]        el0_sync+0x140/0x180
[   39.017270]
[   39.017270] other info that might help us debug this:
[   39.017270]
[   39.025300]  Possible unsafe locking scenario:
[   39.025300]
[   39.031236]        CPU0                    CPU1
[   39.035779]        ----                    ----
[   39.040321]   lock(kn->active#123);
[   39.043826]                                lock(rtnl_mutex);
[   39.049507]                                lock(kn->active#123);
[   39.055540]   lock(rtnl_mutex);
[   39.058691]
[   39.058691]  *** DEADLOCK ***
[   39.058691]
[   39.064630] 3 locks held by cat/602:
[   39.068212]  #0: ffff533645926f70 (&p->lock){+.+.}-{4:4}, at: seq_read_iter+0x64/0x450
[   39.076173]  #1: ffff533643bfa090 (&of->mutex){+.+.}-{4:4}, at: kernfs_seq_start+0x30/0xb0
[   39.084482]  #2: ffff53364222c168 (kn->active#123){++++}-{0:0}, at: kernfs_seq_start+0x38/0xb0

Ok, I admit I don't really understand the message, because struct
kernfs_node::active is an atomic_t and not really a lock, but
intuitively I think lockdep is unhappy that the RTNL mutex is not being
used here as a top-level mutex.
If somebody were to cat /sys/class/net/bond0/statistics/*, the kernfs
node locking happens first, then the RTNL mutex is acquired by the
bonding implementation of ndo_get_stats64.
If somebody were to delete the bonding device via an "ip link del"
rtnetlink message, the RTNL mutex would be held first, then the kernfs
node lock corresponding to the interface's sysfs attributes second.
I don't think there's really any potential deadlock, only an ordering
issue between lockdep classes.

How I think this could be solved is if we made the network interface
lists use some other writer-side protection than the big RTNL mutex.
So I went on exactly to see how knee-deep I would need to get into that.

There are 2 separate classes of problems:
- We already have two ways of protecting pure readers: via RCU and via
  the rwlock. It doesn't help if we also add a second way of locking out
  pure writers. We need to first clean up what we have. That's the
  reason why I started the discussion about the rwlock.
- Some callers appear to not use any sort of protection at all. Does
  this code path look ok to you?

nfnetlink_rcv_batch
-> NFT_MSG_NEWCHAIN
   -> nf_tables_addchain
      -> nft_chain_parse_hook
         -> nft_chain_parse_netdev
            -> nf_tables_parse_netdev_hooks
               -> nft_netdev_hook_alloc
                  -> __dev_get_by_name
                     -> netdev_name_node_lookup: must be under RTNL mutex or dev_base_lock protection

Unless I'm missing something, there are just tons, tons of callers of
__dev_get_by_name which hold neither the RTNL mutex, nor the
dev_base_lock rwlock, nor RCU read-side critical section.

What I was thinking was that if we could add a separate mutex for the
network interface lists (and this time make it per netns, and not global
to the kernel), then we could take that lock a lot more locally. I.e. we
could have __dev_get_by_name take that lock, and alleviate the need for
callers to hold the RTNL mutex. That would solve some of these bugs too,
and would also be more usable for the bonding case.

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 18:48         ` Vladimir Oltean
@ 2020-11-30 19:00           ` Eric Dumazet
  2020-11-30 19:03             ` Vladimir Oltean
  2020-12-01 14:42           ` Pablo Neira Ayuso
  2020-12-10  4:32           ` [PATCH] net: bonding: retrieve device statistics under RTNL, not RCU kernel test robot
  2 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2020-11-30 19:00 UTC (permalink / raw)
  To: Vladimir Oltean, Jakub Kicinski
  Cc: Eric Dumazet, Stephen Hemminger, netdev, Paul Gortmaker,
	Jiri Benc, Or Gerlitz, Cong Wang, Jamal Hadi Salim, Andrew Lunn,
	Florian Fainelli



On 11/30/20 7:48 PM, Vladimir Oltean wrote:
> On Mon, Nov 30, 2020 at 10:14:05AM -0800, Jakub Kicinski wrote:
>> On Mon, 30 Nov 2020 11:41:10 +0100 Eric Dumazet wrote:
>>>> So dev_base_lock dates back to the Big Kernel Lock breakup back in Linux 2.4
>>>> (ie before my time). The time has come to get rid of it.
>>>>
>>>> The use is sysfs is because could be changed to RCU. There have been issues
>>>> in the past with sysfs causing lock inversions with the rtnl mutex, that
>>>> is why you will see some trylock code there.
>>>>
>>>> My guess is that dev_base_lock readers exist only because no one bothered to do
>>>> the RCU conversion.
>>>
>>> I think we did, a long time ago.
>>>
>>> We took care of all ' fast paths' already.
>>>
>>> Not sure what is needed, current situation does not bother me at all ;)
>>
>> Perhaps Vladimir has a plan to post separately about it (in that case
>> sorry for jumping ahead) but the initial problem was procfs which is
>> (hopefully mostly irrelevant by now, and) taking the RCU lock only
>> therefore forcing drivers to have re-entrant, non-sleeping
>> .ndo_get_stats64 implementations.
> 
> Right, the end reason why I'm even looking at this is because I want to
> convert all callers of dev_get_stats to use sleepable context and not
> atomic. This makes it easier to gather statistics from devices that have
> a firmware, or off-chip devices behind a slow bus like SPI.
> 
> Like Jakub pointed out, some places call dev_get_stats while iterating
> through the list of network interfaces - one would be procfs, but not
> only. These callers are pure readers, so they use RCU protection. But
> that gives us atomic context when calling dev_get_stats. The naive
> solution is to convert all those callers to hold the RTNL mutex, which
> is the writer-side protection for the network interface lists, and which
> is sleepable. In fact I do have a series of 8 patches where I get that
> done. But there are some weirder cases, such as the bonding driver,
> where I need to do this:
> 
> -----------------------------[cut here]-----------------------------
> From 369a0e18a2446cda8ff52d72c02aa144ae6687ec Mon Sep 17 00:00:00 2001
> From: Vladimir Oltean <vladimir.oltean@nxp.com>
> Date: Mon, 30 Nov 2020 02:39:46 +0200
> Subject: [PATCH] net: bonding: retrieve device statistics under RTNL, not RCU
> 
> In the effort of making .ndo_get_stats64 be able to sleep, we need to
> ensure the callers of dev_get_stats do not use atomic context.
> 
> The bonding driver uses an RCU read-side critical section to ensure the
> integrity of the list of network interfaces, because the driver iterates
> through all net devices in the netns to find the ones which are its
> configured slaves. We still need some protection against an interface
> registering or deregistering, and the writer-side lock, the RTNL mutex,
> is fine for that, because it offers sleepable context.
> 
> We are taking the RTNL this way (checking for rtnl_is_locked first)
> because the RTNL is not guaranteed to be held by all callers of
> ndo_get_stats64, in fact there will be work in the future that will
> avoid as much RTNL-holding as possible.
> 
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
>  drivers/net/bonding/bond_main.c | 18 +++++++-----------
>  include/net/bonding.h           |  1 -
>  2 files changed, 7 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index e0880a3840d7..1d44534e95d2 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -3738,21 +3738,17 @@ static void bond_get_stats(struct net_device *bond_dev,
>  			   struct rtnl_link_stats64 *stats)
>  {
>  	struct bonding *bond = netdev_priv(bond_dev);
> +	bool rtnl_locked = rtnl_is_locked();
>  	struct rtnl_link_stats64 temp;
>  	struct list_head *iter;
>  	struct slave *slave;
> -	int nest_level = 0;
>  
> +	if (!rtnl_locked)
> +		rtnl_lock();

Gosh, do not do that.

Convert the bonding ->stats_lock to a mutex instead.

Adding more reliance to RTNL is not helping cases where
access to stats should not be blocked by other users of RTNL (which can be abused)

>  
> -	rcu_read_lock();
> -#ifdef CONFIG_LOCKDEP
> -	nest_level = bond_get_lowest_level_rcu(bond_dev);
> -#endif
> -
> -	spin_lock_nested(&bond->stats_lock, nest_level);
>  	memcpy(stats, &bond->bond_stats, sizeof(*stats));
>  
> -	bond_for_each_slave_rcu(bond, slave, iter) {
> +	bond_for_each_slave(bond, slave, iter) {
>  		const struct rtnl_link_stats64 *new =
>  			dev_get_stats(slave->dev, &temp);
>  
> @@ -3763,8 +3759,9 @@ static void bond_get_stats(struct net_device *bond_dev,
>  	}
>  
>  	memcpy(&bond->bond_stats, stats, sizeof(*stats));
> -	spin_unlock(&bond->stats_lock);
> -	rcu_read_unlock();
> +
> +	if (!rtnl_locked)
> +		rtnl_unlock();
>  }
>  
>  static int bond_do_ioctl(struct net_device *bond_dev, struct ifreq *ifr, int cmd)
> @@ -5192,7 +5189,6 @@ static int bond_init(struct net_device *bond_dev)
>  	if (!bond->wq)
>  		return -ENOMEM;
>  
> -	spin_lock_init(&bond->stats_lock);
>  	netdev_lockdep_set_classes(bond_dev);
>  
>  	list_add_tail(&bond->bond_list, &bn->dev_list);
> diff --git a/include/net/bonding.h b/include/net/bonding.h
> index d9d0ff3b0ad3..6fbde9713424 100644
> --- a/include/net/bonding.h
> +++ b/include/net/bonding.h
> @@ -224,7 +224,6 @@ struct bonding {
>  	 * ALB mode (6) - to sync the use and modifications of its hash table
>  	 */
>  	spinlock_t mode_lock;
> -	spinlock_t stats_lock;
>  	u8	 send_peer_notif;
>  	u8       igmp_retrans;
>  #ifdef CONFIG_PROC_FS
> -----------------------------[cut here]-----------------------------
> 
> Only that there's a problem with this. It's the lock ordering that
> Stephen talked about.
> 
> cat /sys/class/net/bond0/statistics/*
> [   38.715829]
> [   38.717329] ======================================================
> [   38.723528] WARNING: possible circular locking dependency detected
> [   38.729730] 5.10.0-rc5-next-20201127-00016-gde02bf51f2fa #1481 Not tainted
> [   38.736628] ------------------------------------------------------
> [   38.742828] cat/602 is trying to acquire lock:
> [   38.747285] ffffcf4908119080 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x20/0x28
> [   38.754555]
> [   38.754555] but task is already holding lock:
> [   38.760406] ffff53364222c168 (kn->active#123){++++}-{0:0}, at: kernfs_seq_start+0x38/0xb0
> [   38.768631]
> [   38.768631] which lock already depends on the new lock.
> [   38.768631]
> [   38.776835]
> [   38.776835] the existing dependency chain (in reverse order) is:
> [   38.784341]
> [   38.784341] -> #1 (kn->active#123){++++}-{0:0}:
> [   38.790387]        lock_acquire+0x238/0x468
> [   38.794583]        __kernfs_remove+0x26c/0x3e0
> [   38.799040]        kernfs_remove_by_name_ns+0x5c/0xb8
> [   38.804107]        remove_files.isra.1+0x40/0x80
> [   38.808739]        sysfs_remove_group+0x50/0xb0
> [   38.813282]        sysfs_remove_groups+0x3c/0x58
> [   38.817913]        device_remove_attrs+0x64/0x98
> [   38.822544]        device_del+0x16c/0x3f8
> [   38.826566]        netdev_unregister_kobject+0x80/0x90
> [   38.831722]        rollback_registered_many+0x444/0x688
> [   38.836964]        unregister_netdevice_many.part.163+0x20/0xa8
> [   38.842904]        unregister_netdevice_many+0x2c/0x38
> [   38.848059]        rtnl_delete_link+0x5c/0x90
> [   38.852428]        rtnl_dellink+0x158/0x310
> [   38.856622]        rtnetlink_rcv_msg+0x298/0x4d8
> [   38.861254]        netlink_rcv_skb+0x64/0x130
> [   38.865625]        rtnetlink_rcv+0x28/0x38
> [   38.869733]        netlink_unicast+0x1dc/0x288
> [   38.874190]        netlink_sendmsg+0x2b0/0x3e8
> [   38.878647]        ____sys_sendmsg+0x27c/0x2c0
> [   38.883105]        ___sys_sendmsg+0x90/0xd0
> [   38.887299]        __sys_sendmsg+0x7c/0xd0
> [   38.891407]        __arm64_sys_sendmsg+0x2c/0x38
> [   38.896040]        el0_svc_common.constprop.3+0x80/0x1b0
> [   38.901369]        do_el0_svc+0x34/0xa0
> [   38.905216]        el0_sync_handler+0x138/0x198
> [   38.909760]        el0_sync+0x140/0x180
> [   38.913604]
> [   38.913604] -> #0 (rtnl_mutex){+.+.}-{4:4}:
> [   38.919295]        check_noncircular+0x154/0x168
> [   38.923926]        __lock_acquire+0x1118/0x15e0
> [   38.928470]        lock_acquire+0x238/0x468
> [   38.932667]        __mutex_lock+0xa4/0x970
> [   38.936776]        mutex_lock_nested+0x54/0x70
> [   38.941233]        rtnl_lock+0x20/0x28
> [   38.944993]        bond_get_stats+0x140/0x1a8
> [   38.949363]        dev_get_stats+0xdc/0xf0
> [   38.953472]        netstat_show.isra.25+0x50/0xa0
> [   38.958190]        collisions_show+0x2c/0x38
> [   38.962472]        dev_attr_show+0x3c/0x78
> [   38.966580]        sysfs_kf_seq_show+0xbc/0x138
> [   38.971124]        kernfs_seq_show+0x44/0x50
> [   38.975407]        seq_read_iter+0xe4/0x450
> [   38.979601]        seq_read+0xf8/0x148
> [   38.983359]        kernfs_fop_read+0x1e0/0x280
> [   38.987817]        vfs_read+0xac/0x1c8
> [   38.991576]        ksys_read+0x74/0xf8
> [   38.995335]        __arm64_sys_read+0x24/0x30
> [   38.999706]        el0_svc_common.constprop.3+0x80/0x1b0
> [   39.005035]        do_el0_svc+0x34/0xa0
> [   39.008882]        el0_sync_handler+0x138/0x198
> [   39.013426]        el0_sync+0x140/0x180
> [   39.017270]
> [   39.017270] other info that might help us debug this:
> [   39.017270]
> [   39.025300]  Possible unsafe locking scenario:
> [   39.025300]
> [   39.031236]        CPU0                    CPU1
> [   39.035779]        ----                    ----
> [   39.040321]   lock(kn->active#123);
> [   39.043826]                                lock(rtnl_mutex);
> [   39.049507]                                lock(kn->active#123);
> [   39.055540]   lock(rtnl_mutex);
> [   39.058691]
> [   39.058691]  *** DEADLOCK ***
> [   39.058691]
> [   39.064630] 3 locks held by cat/602:
> [   39.068212]  #0: ffff533645926f70 (&p->lock){+.+.}-{4:4}, at: seq_read_iter+0x64/0x450
> [   39.076173]  #1: ffff533643bfa090 (&of->mutex){+.+.}-{4:4}, at: kernfs_seq_start+0x30/0xb0
> [   39.084482]  #2: ffff53364222c168 (kn->active#123){++++}-{0:0}, at: kernfs_seq_start+0x38/0xb0
> 
> Ok, I admit I don't really understand the message, because struct
> kernfs_node::active is an atomic_t and not really a lock, but
> intuitively I think lockdep is unhappy that the RTNL mutex is not being
> used here as a top-level mutex.
> If somebody were to cat /sys/class/net/bond0/statistics/*, the kernfs
> node locking happens first, then the RTNL mutex is acquired by the
> bonding implementation of ndo_get_stats64.
> If somebody were to delete the bonding device via an "ip link del"
> rtnetlink message, the RTNL mutex would be held first, then the kernfs
> node lock corresponding to the interface's sysfs attributes second.
> I don't think there's really any potential deadlock, only an ordering
> issue between lockdep classes.
> 
> How I think this could be solved is if we made the network interface
> lists use some other writer-side protection than the big RTNL mutex.
> So I went on exactly to see how knee-deep I would need to get into that.
> 
> There are 2 separate classes of problems:
> - We already have two ways of protecting pure readers: via RCU and via
>   the rwlock. It doesn't help if we also add a second way of locking out
>   pure writers. We need to first clean up what we have. That's the
>   reason why I started the discussion about the rwlock.
> - Some callers appear to not use any sort of protection at all. Does
>   this code path look ok to you?
> 
> nfnetlink_rcv_batch
> -> NFT_MSG_NEWCHAIN
>    -> nf_tables_addchain
>       -> nft_chain_parse_hook
>          -> nft_chain_parse_netdev
>             -> nf_tables_parse_netdev_hooks
>                -> nft_netdev_hook_alloc
>                   -> __dev_get_by_name
>                      -> netdev_name_node_lookup: must be under RTNL mutex or dev_base_lock protection
> 
> Unless I'm missing something, there are just tons, tons of callers of
> __dev_get_by_name which hold neither the RTNL mutex, nor the
> dev_base_lock rwlock, nor RCU read-side critical section.
> 
> What I was thinking was that if we could add a separate mutex for the
> network interface lists (and this time make it per netns, and not global
> to the kernel), then we could take that lock a lot more locally. I.e. we
> could have __dev_get_by_name take that lock, and alleviate the need for
> callers to hold the RTNL mutex. That would solve some of these bugs too,
> and would also be more usable for the bonding case.
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 19:00           ` Eric Dumazet
@ 2020-11-30 19:03             ` Vladimir Oltean
  2020-11-30 19:22               ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Vladimir Oltean @ 2020-11-30 19:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jakub Kicinski, Eric Dumazet, Stephen Hemminger, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli

On Mon, Nov 30, 2020 at 08:00:40PM +0100, Eric Dumazet wrote:
>
>
> On 11/30/20 7:48 PM, Vladimir Oltean wrote:
> > On Mon, Nov 30, 2020 at 10:14:05AM -0800, Jakub Kicinski wrote:
> >> On Mon, 30 Nov 2020 11:41:10 +0100 Eric Dumazet wrote:
> >>>> So dev_base_lock dates back to the Big Kernel Lock breakup back in Linux 2.4
> >>>> (ie before my time). The time has come to get rid of it.
> >>>>
> >>>> The use is sysfs is because could be changed to RCU. There have been issues
> >>>> in the past with sysfs causing lock inversions with the rtnl mutex, that
> >>>> is why you will see some trylock code there.
> >>>>
> >>>> My guess is that dev_base_lock readers exist only because no one bothered to do
> >>>> the RCU conversion.
> >>>
> >>> I think we did, a long time ago.
> >>>
> >>> We took care of all ' fast paths' already.
> >>>
> >>> Not sure what is needed, current situation does not bother me at all ;)
> >>
> >> Perhaps Vladimir has a plan to post separately about it (in that case
> >> sorry for jumping ahead) but the initial problem was procfs which is
> >> (hopefully mostly irrelevant by now, and) taking the RCU lock only
> >> therefore forcing drivers to have re-entrant, non-sleeping
> >> .ndo_get_stats64 implementations.
> >
> > Right, the end reason why I'm even looking at this is because I want to
> > convert all callers of dev_get_stats to use sleepable context and not
> > atomic. This makes it easier to gather statistics from devices that have
> > a firmware, or off-chip devices behind a slow bus like SPI.
> >
> > Like Jakub pointed out, some places call dev_get_stats while iterating
> > through the list of network interfaces - one would be procfs, but not
> > only. These callers are pure readers, so they use RCU protection. But
> > that gives us atomic context when calling dev_get_stats. The naive
> > solution is to convert all those callers to hold the RTNL mutex, which
> > is the writer-side protection for the network interface lists, and which
> > is sleepable. In fact I do have a series of 8 patches where I get that
> > done. But there are some weirder cases, such as the bonding driver,
> > where I need to do this:
> >
> > -----------------------------[cut here]-----------------------------
> > From 369a0e18a2446cda8ff52d72c02aa144ae6687ec Mon Sep 17 00:00:00 2001
> > From: Vladimir Oltean <vladimir.oltean@nxp.com>
> > Date: Mon, 30 Nov 2020 02:39:46 +0200
> > Subject: [PATCH] net: bonding: retrieve device statistics under RTNL, not RCU
> >
> > In the effort of making .ndo_get_stats64 be able to sleep, we need to
> > ensure the callers of dev_get_stats do not use atomic context.
> >
> > The bonding driver uses an RCU read-side critical section to ensure the
> > integrity of the list of network interfaces, because the driver iterates
> > through all net devices in the netns to find the ones which are its
> > configured slaves. We still need some protection against an interface
> > registering or deregistering, and the writer-side lock, the RTNL mutex,
> > is fine for that, because it offers sleepable context.
> >
> > We are taking the RTNL this way (checking for rtnl_is_locked first)
> > because the RTNL is not guaranteed to be held by all callers of
> > ndo_get_stats64, in fact there will be work in the future that will
> > avoid as much RTNL-holding as possible.
> >
> > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> > ---
> >  drivers/net/bonding/bond_main.c | 18 +++++++-----------
> >  include/net/bonding.h           |  1 -
> >  2 files changed, 7 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> > index e0880a3840d7..1d44534e95d2 100644
> > --- a/drivers/net/bonding/bond_main.c
> > +++ b/drivers/net/bonding/bond_main.c
> > @@ -3738,21 +3738,17 @@ static void bond_get_stats(struct net_device *bond_dev,
> >  			   struct rtnl_link_stats64 *stats)
> >  {
> >  	struct bonding *bond = netdev_priv(bond_dev);
> > +	bool rtnl_locked = rtnl_is_locked();
> >  	struct rtnl_link_stats64 temp;
> >  	struct list_head *iter;
> >  	struct slave *slave;
> > -	int nest_level = 0;
> >
> > +	if (!rtnl_locked)
> > +		rtnl_lock();
>
> Gosh, do not do that.
>
> Convert the bonding ->stats_lock to a mutex instead.
>
> Adding more reliance to RTNL is not helping cases where
> access to stats should not be blocked by other users of RTNL (which can be abused)

I can't, Eric. The bond_for_each_slave() macro needs protection against
net devices being registered and unregistered.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 19:03             ` Vladimir Oltean
@ 2020-11-30 19:22               ` Eric Dumazet
  2020-11-30 19:32                 ` Vladimir Oltean
  2020-11-30 19:46                 ` Vladimir Oltean
  0 siblings, 2 replies; 28+ messages in thread
From: Eric Dumazet @ 2020-11-30 19:22 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Eric Dumazet, Jakub Kicinski, Stephen Hemminger, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli

On Mon, Nov 30, 2020 at 8:03 PM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Mon, Nov 30, 2020 at 08:00:40PM +0100, Eric Dumazet wrote:
> >
> >
> > On 11/30/20 7:48 PM, Vladimir Oltean wrote:
> > > On Mon, Nov 30, 2020 at 10:14:05AM -0800, Jakub Kicinski wrote:
> > >> On Mon, 30 Nov 2020 11:41:10 +0100 Eric Dumazet wrote:
> > >>>> So dev_base_lock dates back to the Big Kernel Lock breakup back in Linux 2.4
> > >>>> (ie before my time). The time has come to get rid of it.
> > >>>>
> > >>>> The use is sysfs is because could be changed to RCU. There have been issues
> > >>>> in the past with sysfs causing lock inversions with the rtnl mutex, that
> > >>>> is why you will see some trylock code there.
> > >>>>
> > >>>> My guess is that dev_base_lock readers exist only because no one bothered to do
> > >>>> the RCU conversion.
> > >>>
> > >>> I think we did, a long time ago.
> > >>>
> > >>> We took care of all ' fast paths' already.
> > >>>
> > >>> Not sure what is needed, current situation does not bother me at all ;)
> > >>
> > >> Perhaps Vladimir has a plan to post separately about it (in that case
> > >> sorry for jumping ahead) but the initial problem was procfs which is
> > >> (hopefully mostly irrelevant by now, and) taking the RCU lock only
> > >> therefore forcing drivers to have re-entrant, non-sleeping
> > >> .ndo_get_stats64 implementations.
> > >
> > > Right, the end reason why I'm even looking at this is because I want to
> > > convert all callers of dev_get_stats to use sleepable context and not
> > > atomic. This makes it easier to gather statistics from devices that have
> > > a firmware, or off-chip devices behind a slow bus like SPI.
> > >
> > > Like Jakub pointed out, some places call dev_get_stats while iterating
> > > through the list of network interfaces - one would be procfs, but not
> > > only. These callers are pure readers, so they use RCU protection. But
> > > that gives us atomic context when calling dev_get_stats. The naive
> > > solution is to convert all those callers to hold the RTNL mutex, which
> > > is the writer-side protection for the network interface lists, and which
> > > is sleepable. In fact I do have a series of 8 patches where I get that
> > > done. But there are some weirder cases, such as the bonding driver,
> > > where I need to do this:
> > >
> > > -----------------------------[cut here]-----------------------------
> > > From 369a0e18a2446cda8ff52d72c02aa144ae6687ec Mon Sep 17 00:00:00 2001
> > > From: Vladimir Oltean <vladimir.oltean@nxp.com>
> > > Date: Mon, 30 Nov 2020 02:39:46 +0200
> > > Subject: [PATCH] net: bonding: retrieve device statistics under RTNL, not RCU
> > >
> > > In the effort of making .ndo_get_stats64 be able to sleep, we need to
> > > ensure the callers of dev_get_stats do not use atomic context.
> > >
> > > The bonding driver uses an RCU read-side critical section to ensure the
> > > integrity of the list of network interfaces, because the driver iterates
> > > through all net devices in the netns to find the ones which are its
> > > configured slaves. We still need some protection against an interface
> > > registering or deregistering, and the writer-side lock, the RTNL mutex,
> > > is fine for that, because it offers sleepable context.
> > >
> > > We are taking the RTNL this way (checking for rtnl_is_locked first)
> > > because the RTNL is not guaranteed to be held by all callers of
> > > ndo_get_stats64, in fact there will be work in the future that will
> > > avoid as much RTNL-holding as possible.
> > >
> > > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> > > ---
> > >  drivers/net/bonding/bond_main.c | 18 +++++++-----------
> > >  include/net/bonding.h           |  1 -
> > >  2 files changed, 7 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> > > index e0880a3840d7..1d44534e95d2 100644
> > > --- a/drivers/net/bonding/bond_main.c
> > > +++ b/drivers/net/bonding/bond_main.c
> > > @@ -3738,21 +3738,17 @@ static void bond_get_stats(struct net_device *bond_dev,
> > >                        struct rtnl_link_stats64 *stats)
> > >  {
> > >     struct bonding *bond = netdev_priv(bond_dev);
> > > +   bool rtnl_locked = rtnl_is_locked();
> > >     struct rtnl_link_stats64 temp;
> > >     struct list_head *iter;
> > >     struct slave *slave;
> > > -   int nest_level = 0;
> > >
> > > +   if (!rtnl_locked)
> > > +           rtnl_lock();
> >
> > Gosh, do not do that.
> >
> > Convert the bonding ->stats_lock to a mutex instead.
> >
> > Adding more reliance to RTNL is not helping cases where
> > access to stats should not be blocked by other users of RTNL (which can be abused)
>
> I can't, Eric. The bond_for_each_slave() macro needs protection against
> net devices being registered and unregistered.

And ?

A bonding device can absolutely maintain a private list, ready for
bonding ndo_get_stats() use, regardless
of register/unregister logic.

bond_for_each_slave() is simply a macro, you can replace it by something else.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 19:22               ` Eric Dumazet
@ 2020-11-30 19:32                 ` Vladimir Oltean
  2020-11-30 21:41                   ` Florian Fainelli
  2020-11-30 19:46                 ` Vladimir Oltean
  1 sibling, 1 reply; 28+ messages in thread
From: Vladimir Oltean @ 2020-11-30 19:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, Jakub Kicinski, Stephen Hemminger, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli

On Mon, Nov 30, 2020 at 08:22:01PM +0100, Eric Dumazet wrote:
> On Mon, Nov 30, 2020 at 8:03 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> > On Mon, Nov 30, 2020 at 08:00:40PM +0100, Eric Dumazet wrote:
> > > On 11/30/20 7:48 PM, Vladimir Oltean wrote:
> > > > diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> > > > index e0880a3840d7..1d44534e95d2 100644
> > > > --- a/drivers/net/bonding/bond_main.c
> > > > +++ b/drivers/net/bonding/bond_main.c
> > > > @@ -3738,21 +3738,17 @@ static void bond_get_stats(struct net_device *bond_dev,
> > > >                        struct rtnl_link_stats64 *stats)
> > > >  {
> > > >     struct bonding *bond = netdev_priv(bond_dev);
> > > > +   bool rtnl_locked = rtnl_is_locked();
> > > >     struct rtnl_link_stats64 temp;
> > > >     struct list_head *iter;
> > > >     struct slave *slave;
> > > > -   int nest_level = 0;
> > > >
> > > > +   if (!rtnl_locked)
> > > > +           rtnl_lock();
> > >
> > > Gosh, do not do that.
> > >
> > > Convert the bonding ->stats_lock to a mutex instead.
> > >
> > > Adding more reliance to RTNL is not helping cases where
> > > access to stats should not be blocked by other users of RTNL (which can be abused)
> >
> > I can't, Eric. The bond_for_each_slave() macro needs protection against
> > net devices being registered and unregistered.
>
> And ?
>
> A bonding device can absolutely maintain a private list, ready for
> bonding ndo_get_stats() use, regardless
> of register/unregister logic.
>
> bond_for_each_slave() is simply a macro, you can replace it by something else.

Yeah, ok, let's assume I can do that for the particular case of bonding.
What about this one though.

-----------------------------[cut here]-----------------------------
From 0d114e38ee20d93b41f8e29082e9e5e0fa7f6b0e Mon Sep 17 00:00:00 2001
From: Vladimir Oltean <vladimir.oltean@nxp.com>
Date: Mon, 30 Nov 2020 02:27:28 +0200
Subject: [PATCH] s390/appldata_net_sum: retrieve device statistics under RTNL,
 not RCU

In the effort of making .ndo_get_stats64 be able to sleep, we need to
ensure the callers of dev_get_stats do not use atomic context.

In the case of the appldata driver, an RCU read-side critical section is
used to ensure the integrity of the list of network interfaces, because
the driver iterates through all net devices in the netns to aggregate
statistics. We still need some protection against an interface
registering or deregistering, and the writer-side lock, the RTNL mutex,
is fine for that, because it offers sleepable context.

The ops->callback function is called from under appldata_ops_mutex
protection, so this is proof that the context is sleepable and holding
RTNL is therefore fine.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 arch/s390/appldata/appldata_net_sum.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/s390/appldata/appldata_net_sum.c b/arch/s390/appldata/appldata_net_sum.c
index 59c282ca002f..22da28ab10d8 100644
--- a/arch/s390/appldata/appldata_net_sum.c
+++ b/arch/s390/appldata/appldata_net_sum.c
@@ -78,8 +78,9 @@ static void appldata_get_net_sum_data(void *data)
 	tx_dropped = 0;
 	collisions = 0;

-	rcu_read_lock();
-	for_each_netdev_rcu(&init_net, dev) {
+	rtnl_lock();
+
+	for_each_netdev(&init_net, dev) {
 		const struct rtnl_link_stats64 *stats;
 		struct rtnl_link_stats64 temp;

@@ -95,7 +96,8 @@ static void appldata_get_net_sum_data(void *data)
 		collisions += stats->collisions;
 		i++;
 	}
-	rcu_read_unlock();
+
+	rtnl_unlock();

 	net_data->nr_interfaces = i;
 	net_data->rx_packets = rx_packets;
-----------------------------[cut here]-----------------------------

Or this one.

-----------------------------[cut here]-----------------------------
From 93ffc25f30849aaf89e50e58d32b0b047831f94d Mon Sep 17 00:00:00 2001
From: Vladimir Oltean <vladimir.oltean@nxp.com>
Date: Mon, 30 Nov 2020 02:49:25 +0200
Subject: [PATCH] parisc/led: retrieve device statistics under RTNL, not RCU

In the effort of making .ndo_get_stats64 be able to sleep, we need to
ensure the callers of dev_get_stats do not use atomic context.

The LED driver for HP-PARISC workstations uses a workqueue to
periodically check for updates in network interface statistics, and
flicker when those have changed (i.e. there has been activity on the
line). Honestly that is a strange idea even when protected by RCU, but
now, the dev_get_stats call can sleep, and iterating through the list of
network interfaces still needs to ensure the integrity of list of
network interfaces. So that leaves us only one locking option given the
current design of the network stack, and that is the RTNL mutex. In the
future we might be able to make this a little bit less expensive by
creating a separate mutex for the list of network interfaces.

Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: linux-parisc@vger.kernel.org
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/parisc/led.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/parisc/led.c b/drivers/parisc/led.c
index 36c6613f7a36..09dcffaed85f 100644
--- a/drivers/parisc/led.c
+++ b/drivers/parisc/led.c
@@ -38,7 +38,6 @@
 #include <linux/ctype.h>
 #include <linux/blkdev.h>
 #include <linux/workqueue.h>
-#include <linux/rcupdate.h>
 #include <asm/io.h>
 #include <asm/processor.h>
 #include <asm/hardware.h>
@@ -356,12 +355,13 @@ static __inline__ int led_get_net_activity(void)
 
 	rx_total = tx_total = 0;
 	
-	/* we are running as a workqueue task, so we can use an RCU lookup */
-	rcu_read_lock();
-	for_each_netdev_rcu(&init_net, dev) {
+	/* we are running as a workqueue task, so we can take the RTNL mutex */
+	rtnl_lock();
+
+	for_each_netdev(&init_net, dev) {
 	    const struct rtnl_link_stats64 *stats;
 	    struct rtnl_link_stats64 temp;
-	    struct in_device *in_dev = __in_dev_get_rcu(dev);
+	    struct in_device *in_dev = __in_dev_get_rtnl(dev);
 	    if (!in_dev || !in_dev->ifa_list)
 		continue;
 	    if (ipv4_is_loopback(in_dev->ifa_list->ifa_local))
@@ -370,7 +370,8 @@ static __inline__ int led_get_net_activity(void)
 	    rx_total += stats->rx_packets;
 	    tx_total += stats->tx_packets;
 	}
-	rcu_read_unlock();
+
+	rtnl_unlock();
 
 	retval = 0;
 
-----------------------------[cut here]-----------------------------

Where I'm getting at is that we're going to need a new mutex for
write-side protection of the network interface lists, one that is !=
RTNL mutex.

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 19:22               ` Eric Dumazet
  2020-11-30 19:32                 ` Vladimir Oltean
@ 2020-11-30 19:46                 ` Vladimir Oltean
  2020-11-30 20:18                   ` Eric Dumazet
  2020-11-30 20:21                   ` Stephen Hemminger
  1 sibling, 2 replies; 28+ messages in thread
From: Vladimir Oltean @ 2020-11-30 19:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, Jakub Kicinski, Stephen Hemminger, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli

On Mon, Nov 30, 2020 at 08:22:01PM +0100, Eric Dumazet wrote:
> And ?
>
> A bonding device can absolutely maintain a private list, ready for
> bonding ndo_get_stats() use, regardless
> of register/unregister logic.
>
> bond_for_each_slave() is simply a macro, you can replace it by something else.

Also, coming to take the comment at face value.
Can it really? How? Freeing a net_device at unregister time happens
after an RCU grace period. So whatever the bonding driver does to keep a
private list of slave devices, those pointers need to be under RCU
protection. And that doesn't help with the sleepable context that we're
looking for.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 19:46                 ` Vladimir Oltean
@ 2020-11-30 20:18                   ` Eric Dumazet
  2020-11-30 20:21                   ` Stephen Hemminger
  1 sibling, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2020-11-30 20:18 UTC (permalink / raw)
  To: Vladimir Oltean, Eric Dumazet
  Cc: Eric Dumazet, Jakub Kicinski, Stephen Hemminger, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli



On 11/30/20 8:46 PM, Vladimir Oltean wrote:
> On Mon, Nov 30, 2020 at 08:22:01PM +0100, Eric Dumazet wrote:
>> And ?
>>
>> A bonding device can absolutely maintain a private list, ready for
>> bonding ndo_get_stats() use, regardless
>> of register/unregister logic.
>>
>> bond_for_each_slave() is simply a macro, you can replace it by something else.
> 
> Also, coming to take the comment at face value.
> Can it really? How? Freeing a net_device at unregister time happens
> after an RCU grace period.

Except that the device would have to be removed from the bonding list
before the RCU grace period starts.

This removal would acquire the bonding ->stats_mutex in order to change the list.

 So whatever the bonding driver does to keep a
> private list of slave devices, those pointers need to be under RCU
> protection.


Not at all, if this new list is _only_ used from process context,
and protected by a per-device mutex.

I am not speaking of existing lists that _possibly_ are
used from IRQ context, thus are using RCU.


 And that doesn't help with the sleepable context that we're
> looking for.
>

Again, RCU would not be used at all, since you want ndo_get_stats64()
being called in process context (sleepable context)

And this should be solved without expanding RTNL usage.
(We do not want to block RTNL for ~10ms just because a device driver has to sleep
while a firmware request is processed)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 19:46                 ` Vladimir Oltean
  2020-11-30 20:18                   ` Eric Dumazet
@ 2020-11-30 20:21                   ` Stephen Hemminger
  2020-11-30 20:26                     ` Vladimir Oltean
  1 sibling, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2020-11-30 20:21 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Eric Dumazet, Eric Dumazet, Jakub Kicinski, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli

On Mon, 30 Nov 2020 21:46:17 +0200
Vladimir Oltean <olteanv@gmail.com> wrote:

> On Mon, Nov 30, 2020 at 08:22:01PM +0100, Eric Dumazet wrote:
> > And ?
> >
> > A bonding device can absolutely maintain a private list, ready for
> > bonding ndo_get_stats() use, regardless
> > of register/unregister logic.
> >
> > bond_for_each_slave() is simply a macro, you can replace it by something else.  
> 
> Also, coming to take the comment at face value.
> Can it really? How? Freeing a net_device at unregister time happens
> after an RCU grace period. So whatever the bonding driver does to keep a
> private list of slave devices, those pointers need to be under RCU
> protection. And that doesn't help with the sleepable context that we're
> looking for.

if device is in a private list (in bond device), the way to handle
this is to use dev_hold() to keep a ref count.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 20:21                   ` Stephen Hemminger
@ 2020-11-30 20:26                     ` Vladimir Oltean
  2020-11-30 20:29                       ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Vladimir Oltean @ 2020-11-30 20:26 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Eric Dumazet, Eric Dumazet, Jakub Kicinski, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli

On Mon, Nov 30, 2020 at 12:21:29PM -0800, Stephen Hemminger wrote:
> if device is in a private list (in bond device), the way to handle
> this is to use dev_hold() to keep a ref count.

Correct, dev_hold is a tool that can also be used. But it is a tool that
does not solve the general problem - only particular ones. See the other
interesting callers of dev_get_stats in parisc, appldata, net_failover.
We can't ignore that RTNL is used for write-side locking forever.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 20:26                     ` Vladimir Oltean
@ 2020-11-30 20:29                       ` Eric Dumazet
  2020-11-30 20:36                         ` Vladimir Oltean
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2020-11-30 20:29 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Stephen Hemminger, Eric Dumazet, Jakub Kicinski, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli

On Mon, Nov 30, 2020 at 9:26 PM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Mon, Nov 30, 2020 at 12:21:29PM -0800, Stephen Hemminger wrote:
> > if device is in a private list (in bond device), the way to handle
> > this is to use dev_hold() to keep a ref count.
>
> Correct, dev_hold is a tool that can also be used. But it is a tool that
> does not solve the general problem - only particular ones. See the other
> interesting callers of dev_get_stats in parisc, appldata, net_failover.
> We can't ignore that RTNL is used for write-side locking forever.

dev_base_lock is used to protect the list of devices (eg for /proc/net/devices),
so this will need to be replaced by something. dev_hold() won't
protect the 'list' from changing under us.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 20:29                       ` Eric Dumazet
@ 2020-11-30 20:36                         ` Vladimir Oltean
  2020-11-30 20:43                           ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Vladimir Oltean @ 2020-11-30 20:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Eric Dumazet, Jakub Kicinski, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli

On Mon, Nov 30, 2020 at 09:29:15PM +0100, Eric Dumazet wrote:
> On Mon, Nov 30, 2020 at 9:26 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> >
> > On Mon, Nov 30, 2020 at 12:21:29PM -0800, Stephen Hemminger wrote:
> > > if device is in a private list (in bond device), the way to handle
> > > this is to use dev_hold() to keep a ref count.
> >
> > Correct, dev_hold is a tool that can also be used. But it is a tool that
> > does not solve the general problem - only particular ones. See the other
> > interesting callers of dev_get_stats in parisc, appldata, net_failover.
> > We can't ignore that RTNL is used for write-side locking forever.
>
> dev_base_lock is used to protect the list of devices (eg for /proc/net/devices),
> so this will need to be replaced by something. dev_hold() won't
> protect the 'list' from changing under us.

Yes, so as I was saying. I was thinking that I could add another locking
mechanism, such as struct net::netdev_lists_mutex or something like that.
A mutex does not really have a read-side and a write-side, but logically
speaking, this one would. So as long as I take this mutex from all places
that also take the write-side of dev_base_lock, I should get equivalent
semantics on the read side as if I were to take the RTNL mutex. I don't
even need to convert all instances of RTNL-holding, that could be spread
out over a longer period of time. It's just that I can hold this new
netdev_lists_mutex in new code that calls for_each_netdev and friends,
and doesn't otherwise need the RTNL.

Again, the reason why I opened this thread was that I wanted to get rid
of dev_base_lock first, before I introduced the struct net::netdev_lists_mutex.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 20:36                         ` Vladimir Oltean
@ 2020-11-30 20:43                           ` Eric Dumazet
  2020-11-30 20:50                             ` Vladimir Oltean
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2020-11-30 20:43 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Stephen Hemminger, Eric Dumazet, Jakub Kicinski, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli

On Mon, Nov 30, 2020 at 9:36 PM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Mon, Nov 30, 2020 at 09:29:15PM +0100, Eric Dumazet wrote:
> > On Mon, Nov 30, 2020 at 9:26 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> > >
> > > On Mon, Nov 30, 2020 at 12:21:29PM -0800, Stephen Hemminger wrote:
> > > > if device is in a private list (in bond device), the way to handle
> > > > this is to use dev_hold() to keep a ref count.
> > >
> > > Correct, dev_hold is a tool that can also be used. But it is a tool that
> > > does not solve the general problem - only particular ones. See the other
> > > interesting callers of dev_get_stats in parisc, appldata, net_failover.
> > > We can't ignore that RTNL is used for write-side locking forever.
> >
> > dev_base_lock is used to protect the list of devices (eg for /proc/net/devices),
> > so this will need to be replaced by something. dev_hold() won't
> > protect the 'list' from changing under us.
>
> Yes, so as I was saying. I was thinking that I could add another locking
> mechanism, such as struct net::netdev_lists_mutex or something like that.
> A mutex does not really have a read-side and a write-side, but logically
> speaking, this one would. So as long as I take this mutex from all places
> that also take the write-side of dev_base_lock, I should get equivalent
> semantics on the read side as if I were to take the RTNL mutex. I don't
> even need to convert all instances of RTNL-holding, that could be spread
> out over a longer period of time. It's just that I can hold this new
> netdev_lists_mutex in new code that calls for_each_netdev and friends,
> and doesn't otherwise need the RTNL.
>
> Again, the reason why I opened this thread was that I wanted to get rid
> of dev_base_lock first, before I introduced the struct net::netdev_lists_mutex.

Understood, but really dev_base_lock can only be removed _after_ we
convert all usages
to something else (mutex based, and preferably not the global RTNL)

Focusing on dev_base_lock seems a distraction really.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 20:43                           ` Eric Dumazet
@ 2020-11-30 20:50                             ` Vladimir Oltean
  2020-11-30 21:00                               ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Vladimir Oltean @ 2020-11-30 20:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Eric Dumazet, Jakub Kicinski, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli

On Mon, Nov 30, 2020 at 09:43:01PM +0100, Eric Dumazet wrote:
> Understood, but really dev_base_lock can only be removed _after_ we
> convert all usages to something else (mutex based, and preferably not
> the global RTNL)

Sure.
A large part of getting rid of dev_base_lock seems to be just:
- deleting the bogus usage from mlx4 infiniband and friends
- converting procfs, sysfs and friends to netdev_lists_mutex
- renaming whatever is left into something related to the RFC 2863
  operstate.

> Focusing on dev_base_lock seems a distraction really.

Maybe.
But it's going to be awkward to explain in words what the locking rules
are, when the read side can take optionally the dev_base_lock, RCU, or
netdev_lists_lock, and the write side can take optionally the dev_base_lock,
RTNL, or netdev_lists_lock. Not to mention that anybody grepping for
dev_base_lock will see the current usage and not make a lot out of it.

I'm not really sure how to order this rework to be honest.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 20:50                             ` Vladimir Oltean
@ 2020-11-30 21:00                               ` Eric Dumazet
  2020-11-30 21:11                                 ` Vladimir Oltean
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2020-11-30 21:00 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Stephen Hemminger, Eric Dumazet, Jakub Kicinski, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli

On Mon, Nov 30, 2020 at 9:50 PM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Mon, Nov 30, 2020 at 09:43:01PM +0100, Eric Dumazet wrote:
> > Understood, but really dev_base_lock can only be removed _after_ we
> > convert all usages to something else (mutex based, and preferably not
> > the global RTNL)
>
> Sure.
> A large part of getting rid of dev_base_lock seems to be just:
> - deleting the bogus usage from mlx4 infiniband and friends
> - converting procfs, sysfs and friends to netdev_lists_mutex
> - renaming whatever is left into something related to the RFC 2863
>   operstate.
>
> > Focusing on dev_base_lock seems a distraction really.
>
> Maybe.
> But it's going to be awkward to explain in words what the locking rules
> are, when the read side can take optionally the dev_base_lock, RCU, or
> netdev_lists_lock, and the write side can take optionally the dev_base_lock,
> RTNL, or netdev_lists_lock. Not to mention that anybody grepping for
> dev_base_lock will see the current usage and not make a lot out of it.
>
> I'm not really sure how to order this rework to be honest.

We can not have a mix of RCU /rwlock/mutex. It must be one, because of
bonding/teaming.

So all existing uses of rwlock / RCU need to be removed.

This is probably not trivial.

Perhaps you could add a temporary ndo_get_sleepable_stats64() so that
drivers can be converted one at a time.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 21:00                               ` Eric Dumazet
@ 2020-11-30 21:11                                 ` Vladimir Oltean
  2020-11-30 21:46                                   ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Vladimir Oltean @ 2020-11-30 21:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Eric Dumazet, Jakub Kicinski, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli

On Mon, Nov 30, 2020 at 10:00:16PM +0100, Eric Dumazet wrote:
> On Mon, Nov 30, 2020 at 9:50 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> >
> > On Mon, Nov 30, 2020 at 09:43:01PM +0100, Eric Dumazet wrote:
> > > Understood, but really dev_base_lock can only be removed _after_ we
> > > convert all usages to something else (mutex based, and preferably not
> > > the global RTNL)
> >
> > Sure.
> > A large part of getting rid of dev_base_lock seems to be just:
> > - deleting the bogus usage from mlx4 infiniband and friends
> > - converting procfs, sysfs and friends to netdev_lists_mutex
> > - renaming whatever is left into something related to the RFC 2863
> >   operstate.
> >
> > > Focusing on dev_base_lock seems a distraction really.
> >
> > Maybe.
> > But it's going to be awkward to explain in words what the locking rules
> > are, when the read side can take optionally the dev_base_lock, RCU, or
> > netdev_lists_lock, and the write side can take optionally the dev_base_lock,
> > RTNL, or netdev_lists_lock. Not to mention that anybody grepping for
> > dev_base_lock will see the current usage and not make a lot out of it.
> >
> > I'm not really sure how to order this rework to be honest.
>
> We can not have a mix of RCU /rwlock/mutex. It must be one, because of
> bonding/teaming.
>
> So all existing uses of rwlock / RCU need to be removed.
>
> This is probably not trivial.

Now, "it's going to look nasty" is one thing, whereas "it won't work" is
completely different. I think it would work though, so could you expand
on why you're saying we can't have the mix? dev_change_name(),
list_netdevice() and unlist_netdevice() just need to take one more layer
of locking. The new netdev_lists_mutex would serve as a temporary
alternative to the RTNL mutex. Then we could gradually replace more and
more of the RTNL mutex with netdev_lists_mutex. The bonding driver can
certainly use the netdev_lists_mutex. It guarantees protection against
the three functions mentioned above, and it is sleepable, and it is not
the RTNL mutex. So can procfs and sysfs. Am I missing something?

> Perhaps you could add a temporary ndo_get_sleepable_stats64() so that
> drivers can be converted one at a time.

Yeah, been there, Jakub doesn't like it.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 19:32                 ` Vladimir Oltean
@ 2020-11-30 21:41                   ` Florian Fainelli
  0 siblings, 0 replies; 28+ messages in thread
From: Florian Fainelli @ 2020-11-30 21:41 UTC (permalink / raw)
  To: Vladimir Oltean, Eric Dumazet
  Cc: Eric Dumazet, Jakub Kicinski, Stephen Hemminger, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn



On 11/30/2020 11:32 AM, Vladimir Oltean wrote:
> -----------------------------[cut here]-----------------------------
> From 93ffc25f30849aaf89e50e58d32b0b047831f94d Mon Sep 17 00:00:00 2001
> From: Vladimir Oltean <vladimir.oltean@nxp.com>
> Date: Mon, 30 Nov 2020 02:49:25 +0200
> Subject: [PATCH] parisc/led: retrieve device statistics under RTNL, not RCU
> 
> In the effort of making .ndo_get_stats64 be able to sleep, we need to
> ensure the callers of dev_get_stats do not use atomic context.
> 
> The LED driver for HP-PARISC workstations uses a workqueue to
> periodically check for updates in network interface statistics, and
> flicker when those have changed (i.e. there has been activity on the
> line). Honestly that is a strange idea even when protected by RCU, but
> now, the dev_get_stats call can sleep, and iterating through the list of
> network interfaces still needs to ensure the integrity of list of
> network interfaces. So that leaves us only one locking option given the
> current design of the network stack, and that is the RTNL mutex. In the
> future we might be able to make this a little bit less expensive by
> creating a separate mutex for the list of network interfaces.

We have a netdev LED trigger under drivers/leds/trigger/ledtrig-netdev.c
that appears to be nicely duplicating this LED driver.
-- 
Florian

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 21:11                                 ` Vladimir Oltean
@ 2020-11-30 21:46                                   ` Eric Dumazet
  2020-11-30 21:53                                     ` Vladimir Oltean
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2020-11-30 21:46 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Stephen Hemminger, Eric Dumazet, Jakub Kicinski, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli

On Mon, Nov 30, 2020 at 10:12 PM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Mon, Nov 30, 2020 at 10:00:16PM +0100, Eric Dumazet wrote:
> > On Mon, Nov 30, 2020 at 9:50 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> > >
> > > On Mon, Nov 30, 2020 at 09:43:01PM +0100, Eric Dumazet wrote:
> > > > Understood, but really dev_base_lock can only be removed _after_ we
> > > > convert all usages to something else (mutex based, and preferably not
> > > > the global RTNL)
> > >
> > > Sure.
> > > A large part of getting rid of dev_base_lock seems to be just:
> > > - deleting the bogus usage from mlx4 infiniband and friends
> > > - converting procfs, sysfs and friends to netdev_lists_mutex
> > > - renaming whatever is left into something related to the RFC 2863
> > >   operstate.
> > >
> > > > Focusing on dev_base_lock seems a distraction really.
> > >
> > > Maybe.
> > > But it's going to be awkward to explain in words what the locking rules
> > > are, when the read side can take optionally the dev_base_lock, RCU, or
> > > netdev_lists_lock, and the write side can take optionally the dev_base_lock,
> > > RTNL, or netdev_lists_lock. Not to mention that anybody grepping for
> > > dev_base_lock will see the current usage and not make a lot out of it.
> > >
> > > I'm not really sure how to order this rework to be honest.
> >
> > We can not have a mix of RCU /rwlock/mutex. It must be one, because of
> > bonding/teaming.
> >
> > So all existing uses of rwlock / RCU need to be removed.
> >
> > This is probably not trivial.
>
> Now, "it's going to look nasty" is one thing, whereas "it won't work" is
> completely different. I think it would work though, so could you expand
> on why you're saying we can't have the mix?

You can not use dev_base_lock() or RCU and call an ndo_get_stats64()
that could sleep.

You can not for example start changing bonding, since bond_get_stats()
could be called from non-sleepable context (net/core/net-procfs.c)

I am still referring to your patch adding :

+       if (!rtnl_locked)
+               rtnl_lock();

This is all I said.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 21:46                                   ` Eric Dumazet
@ 2020-11-30 21:53                                     ` Vladimir Oltean
  2020-11-30 22:20                                       ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Vladimir Oltean @ 2020-11-30 21:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Eric Dumazet, Jakub Kicinski, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli

On Mon, Nov 30, 2020 at 10:46:00PM +0100, Eric Dumazet wrote:
> You can not use dev_base_lock() or RCU and call an ndo_get_stats64()
> that could sleep.
>
> You can not for example start changing bonding, since bond_get_stats()
> could be called from non-sleepable context (net/core/net-procfs.c)
>
> I am still referring to your patch adding :
>
> +       if (!rtnl_locked)
> +               rtnl_lock();
>
> This is all I said.

Ah, ok, well I didn't show you all the patches, did I?

-----------------------------[cut here]-----------------------------
From d62c65ef6cb357e1b8c5a4ab189718e157a569ae Mon Sep 17 00:00:00 2001
From: Vladimir Oltean <vladimir.oltean@nxp.com>
Date: Thu, 26 Nov 2020 23:08:57 +0200
Subject: [PATCH] net: procfs: retrieve device statistics under RTNL, not RCU

In the effort of making .ndo_get_stats64 be able to sleep, we need to
ensure the callers of dev_get_stats do not use atomic context.

The /proc/net/dev file uses an RCU read-side critical section to ensure
the integrity of the list of network interfaces, because it iterates
through all net devices in the netns to show their statistics.
We still need some protection against an interface registering or
deregistering, and the writer-side lock, the RTNL mutex, is fine for
that, because it offers sleepable context.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/core/net-procfs.c                   | 11 ++++++-----
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c
index c714e6a9dad4..1429e8c066d8 100644
--- a/net/core/net-procfs.c
+++ b/net/core/net-procfs.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/netdevice.h>
 #include <linux/proc_fs.h>
+#include <linux/rtnetlink.h>
 #include <linux/seq_file.h>
 #include <net/wext.h>
 
@@ -21,7 +22,7 @@ static inline struct net_device *dev_from_same_bucket(struct seq_file *seq, loff
 	unsigned int count = 0, offset = get_offset(*pos);
 
 	h = &net->dev_index_head[get_bucket(*pos)];
-	hlist_for_each_entry_rcu(dev, h, index_hlist) {
+	hlist_for_each_entry(dev, h, index_hlist) {
 		if (++count == offset)
 			return dev;
 	}
@@ -51,9 +52,9 @@ static inline struct net_device *dev_from_bucket(struct seq_file *seq, loff_t *p
  *	in detail.
  */
 static void *dev_seq_start(struct seq_file *seq, loff_t *pos)
-	__acquires(RCU)
+	__acquires(rtnl_mutex)
 {
-	rcu_read_lock();
+	rtnl_lock();
 	if (!*pos)
 		return SEQ_START_TOKEN;
 
@@ -70,9 +71,9 @@ static void *dev_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 }
 
 static void dev_seq_stop(struct seq_file *seq, void *v)
-	__releases(RCU)
+	__releases(rtnl_mutex)
 {
-	rcu_read_unlock();
+	rtnl_unlock();
 }
 
 static void dev_seq_printf_stats(struct seq_file *seq, struct net_device *dev)
-----------------------------[cut here]-----------------------------

and:

-----------------------------[cut here]-----------------------------
From 0c51569116b3844d0d99831697a8e4134814d50e Mon Sep 17 00:00:00 2001
From: Vladimir Oltean <vladimir.oltean@nxp.com>
Date: Mon, 30 Nov 2020 02:51:01 +0200
Subject: [PATCH] net: sysfs: retrieve device statistics unlocked

In the effort of making .ndo_get_stats64 be able to sleep, we need to
ensure the callers of dev_get_stats do not use atomic context.

I need to preface this by saying that I have no idea why netstat_show
takes the dev_base_lock rwlock. Two things are for certain:
(a) it's not due to dev_isalive requiring it for some mystical reason,
    because broadcast_show() also calls dev_isalive() and has had no
    problem existing since the beginning of git.
(b) the dev_get_stats function definitely does not need dev_base_lock
    protection either. In fact, holding the dev_base_lock is the entire
    problem here, because we want to make dev_get_stats sleepable, and
    holding a rwlock gives us atomic context.

So since no protection seems to be necessary, just run unlocked while
retrieving the /sys/class/net/eth0/statistics/* values.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/core/net-sysfs.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 999b70c59761..0782a476b424 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -585,14 +585,13 @@ static ssize_t netstat_show(const struct device *d,
 	WARN_ON(offset > sizeof(struct rtnl_link_stats64) ||
 		offset % sizeof(u64) != 0);
 
-	read_lock(&dev_base_lock);
 	if (dev_isalive(dev)) {
 		struct rtnl_link_stats64 temp;
 		const struct rtnl_link_stats64 *stats = dev_get_stats(dev, &temp);
 
 		ret = sprintf(buf, fmt_u64, *(u64 *)(((u8 *)stats) + offset));
 	}
-	read_unlock(&dev_base_lock);
+
 	return ret;
 }
 
-----------------------------[cut here]-----------------------------

In all fairness, I think there is precedent in the kernel for changing
so much RCU-protected code to use RTNL. I can't recall the exact link
now, except for this one:
https://patchwork.ozlabs.org/project/netdev/patch/1410306738-18036-2-git-send-email-xiyou.wangcong@gmail.com/,
but you and Cong have changed a lot of RCU-protected accesses in IPv6,
because the read-side critical section was taking too much time and was
not sleepable/preemptible.

Now, I do agree that there's only so much we can keep adding to the RTNL
mutex. I guess somebody needs to start the migration towards a different
mutex. I'll prepare some patches then, and rework the ones that I have.

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 21:53                                     ` Vladimir Oltean
@ 2020-11-30 22:20                                       ` Eric Dumazet
  2020-11-30 22:41                                         ` Vladimir Oltean
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2020-11-30 22:20 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Stephen Hemminger, Eric Dumazet, Jakub Kicinski, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli

On Mon, Nov 30, 2020 at 10:53 PM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Mon, Nov 30, 2020 at 10:46:00PM +0100, Eric Dumazet wrote:
> > You can not use dev_base_lock() or RCU and call an ndo_get_stats64()
> > that could sleep.
> >
> > You can not for example start changing bonding, since bond_get_stats()
> > could be called from non-sleepable context (net/core/net-procfs.c)
> >
> > I am still referring to your patch adding :
> >
> > +       if (!rtnl_locked)
> > +               rtnl_lock();
> >
> > This is all I said.
>
> Ah, ok, well I didn't show you all the patches, did I?


Have you sent them during Thanksgiving perhaps ?

I suggest you follow normal submission process, sending patch series
rather than inlining multiple patches in one email, this is becoming
hard to follow.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 22:20                                       ` Eric Dumazet
@ 2020-11-30 22:41                                         ` Vladimir Oltean
  0 siblings, 0 replies; 28+ messages in thread
From: Vladimir Oltean @ 2020-11-30 22:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Eric Dumazet, Jakub Kicinski, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli

On Mon, Nov 30, 2020 at 11:20:23PM +0100, Eric Dumazet wrote:
> On Mon, Nov 30, 2020 at 10:53 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> >
> > On Mon, Nov 30, 2020 at 10:46:00PM +0100, Eric Dumazet wrote:
> > > You can not use dev_base_lock() or RCU and call an ndo_get_stats64()
> > > that could sleep.
> > >
> > > You can not for example start changing bonding, since bond_get_stats()
> > > could be called from non-sleepable context (net/core/net-procfs.c)
> > >
> > > I am still referring to your patch adding :
> > >
> > > +       if (!rtnl_locked)
> > > +               rtnl_lock();
> > >
> > > This is all I said.
> >
> > Ah, ok, well I didn't show you all the patches, did I?
>
>
> Have you sent them during Thanksgiving perhaps ?
>
> I suggest you follow normal submission process, sending patch series
> rather than inlining multiple patches in one email, this is becoming
> hard to follow.

No, I did not post these at all formally for review, nor do I intend to.
I just wrote them "for fun" (if this could be called fun) to get an idea
of how much there is to change, in the "best case" where I do no rework
to the locking at all, just use what's currently available. And I can't
submit these patches as-is, because of lockdep warnings in bonding. I
will post patches formally for review when I have a clear migration
plan.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-11-30 18:48         ` Vladimir Oltean
  2020-11-30 19:00           ` Eric Dumazet
@ 2020-12-01 14:42           ` Pablo Neira Ayuso
  2020-12-01 18:58             ` Vladimir Oltean
  2020-12-10  4:32           ` [PATCH] net: bonding: retrieve device statistics under RTNL, not RCU kernel test robot
  2 siblings, 1 reply; 28+ messages in thread
From: Pablo Neira Ayuso @ 2020-12-01 14:42 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Jakub Kicinski, Eric Dumazet, Stephen Hemminger, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli, fw

On Mon, Nov 30, 2020 at 08:48:28PM +0200, Vladimir Oltean wrote:
[...]
> There are 2 separate classes of problems:
> - We already have two ways of protecting pure readers: via RCU and via
>   the rwlock. It doesn't help if we also add a second way of locking out
>   pure writers. We need to first clean up what we have. That's the
>   reason why I started the discussion about the rwlock.
> - Some callers appear to not use any sort of protection at all. Does
>   this code path look ok to you?
> 
> nfnetlink_rcv_batch
> -> NFT_MSG_NEWCHAIN

This path holds the nft commit mutex.

>    -> nf_tables_addchain
>       -> nft_chain_parse_hook
>          -> nft_chain_parse_netdev
>             -> nf_tables_parse_netdev_hooks
>                -> nft_netdev_hook_alloc
>                   -> __dev_get_by_name
>                      -> netdev_name_node_lookup: must be under RTNL mutex or dev_base_lock protection

The nf_tables_netdev_event() notifier holds the nft commit mutex too.
Assuming worst case, race between __dev_get_by_name() and device
removal, the notifier waits for the NFT_MSG_NEWCHAIN path to finish.
If the nf_tables_netdev_event() notifier wins race, then
__dev_get_by_name() hits ENOENT.

The idea is explained here:

commit 90d2723c6d4cb2ace50fc3b932a2bcc77710450b
Author: Pablo Neira Ayuso <pablo@netfilter.org>
Date:   Tue Mar 20 17:00:19 2018 +0100

    netfilter: nf_tables: do not hold reference on netdevice from preparation phase

    The netfilter netdevice event handler hold the nfnl_lock mutex, this
    avoids races with a device going away while such device is being
    attached to hooks from the netlink control plane. Therefore, either
    control plane bails out with ENOENT or netdevice event path waits until
    the hook that is attached to net_device is registered.

I can submit a patch adding a comment so anyone else does not get
confused :-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Correct usage of dev_base_lock in 2020
  2020-12-01 14:42           ` Pablo Neira Ayuso
@ 2020-12-01 18:58             ` Vladimir Oltean
  0 siblings, 0 replies; 28+ messages in thread
From: Vladimir Oltean @ 2020-12-01 18:58 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Jakub Kicinski, Eric Dumazet, Stephen Hemminger, netdev,
	Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn, Florian Fainelli, fw

On Tue, Dec 01, 2020 at 03:42:38PM +0100, Pablo Neira Ayuso wrote:
> On Mon, Nov 30, 2020 at 08:48:28PM +0200, Vladimir Oltean wrote:
> [...]
> > There are 2 separate classes of problems:
> > - We already have two ways of protecting pure readers: via RCU and via
> >   the rwlock. It doesn't help if we also add a second way of locking out
> >   pure writers. We need to first clean up what we have. That's the
> >   reason why I started the discussion about the rwlock.
> > - Some callers appear to not use any sort of protection at all. Does
> >   this code path look ok to you?
> > 
> > nfnetlink_rcv_batch
> > -> NFT_MSG_NEWCHAIN
> 
> This path holds the nft commit mutex.
> 
> >    -> nf_tables_addchain
> >       -> nft_chain_parse_hook
> >          -> nft_chain_parse_netdev
> >             -> nf_tables_parse_netdev_hooks
> >                -> nft_netdev_hook_alloc
> >                   -> __dev_get_by_name
> >                      -> netdev_name_node_lookup: must be under RTNL mutex or dev_base_lock protection
> 
> The nf_tables_netdev_event() notifier holds the nft commit mutex too.
> Assuming worst case, race between __dev_get_by_name() and device
> removal, the notifier waits for the NFT_MSG_NEWCHAIN path to finish.
> If the nf_tables_netdev_event() notifier wins race, then
> __dev_get_by_name() hits ENOENT.
> 
> The idea is explained here:
> 
> commit 90d2723c6d4cb2ace50fc3b932a2bcc77710450b
> Author: Pablo Neira Ayuso <pablo@netfilter.org>
> Date:   Tue Mar 20 17:00:19 2018 +0100
> 
>     netfilter: nf_tables: do not hold reference on netdevice from preparation phase
> 
>     The netfilter netdevice event handler hold the nfnl_lock mutex, this
>     avoids races with a device going away while such device is being
>     attached to hooks from the netlink control plane. Therefore, either
>     control plane bails out with ENOENT or netdevice event path waits until
>     the hook that is attached to net_device is registered.
> 
> I can submit a patch adding a comment so anyone else does not get
> confused :-)

Ok, so since you are holding the net->nft.commit_mutex from a code path
that is called under RTNL (call_netdevice_notifiers_info), then you are
indirectly serializing all the other holders of the commit_mutex with
the RTNL mutex.

I think it would really help if you could add a comment, yes.


Some other code paths that call __dev_get_by_name() while not holding
the RTNL mutex or the dev_base_lock:

atrtr_ioctl_addrt() <- atalk_ioctl() <- not under RTNL or dev_base_lock
handle_ip_over_ddp() <- atalk_rcv() <- not under RTNL or dev_base_lock
net/bridge/br_sysfs_if.c: store_backup_port() <- sysfs <- not under RTNL or dev_base_lock

net/netfilter/ipvs/ip_vs_sync.c: start_sync_thread() <- not under RTNL or dev_base_lock

include/net/pkt_cls.h: tcf_change_indev() <- fl_set_key() <- fl_set_parms <- RTNL potentially held depending on bool rtnl_held, dev_base_lock definitely not

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] net: bonding: retrieve device statistics under RTNL, not RCU
  2020-11-30 18:48         ` Vladimir Oltean
  2020-11-30 19:00           ` Eric Dumazet
  2020-12-01 14:42           ` Pablo Neira Ayuso
@ 2020-12-10  4:32           ` kernel test robot
  2 siblings, 0 replies; 28+ messages in thread
From: kernel test robot @ 2020-12-10  4:32 UTC (permalink / raw)
  To: Vladimir Oltean, Jakub Kicinski
  Cc: kbuild-all, clang-built-linux, Eric Dumazet, Stephen Hemminger,
	netdev, Paul Gortmaker, Jiri Benc, Or Gerlitz, Cong Wang,
	Jamal Hadi Salim, Andrew Lunn

[-- Attachment #1: Type: text/plain, Size: 5642 bytes --]

Hi Vladimir,

I love your patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v5.10-rc7 next-20201209]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Vladimir-Oltean/net-bonding-retrieve-device-statistics-under-RTNL-not-RCU/20201201-025343
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git b65054597872ce3aefbc6a666385eabdf9e288da
config: mips-randconfig-r026-20201209 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 1968804ac726e7674d5de22bc2204b45857da344)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install mips cross compiling tool for clang build
        # apt-get install binutils-mips-linux-gnu
        # https://github.com/0day-ci/linux/commit/ae1fadfd6122065f2a211de176290d184fda525a
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Vladimir-Oltean/net-bonding-retrieve-device-statistics-under-RTNL-not-RCU/20201201-025343
        git checkout ae1fadfd6122065f2a211de176290d184fda525a
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=mips 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/net/bonding/bond_main.c:3698:12: warning: unused function 'bond_get_lowest_level_rcu'
   static int bond_get_lowest_level_rcu(struct net_device
   ^
   fatal error: error in backend: Nested variants found in inline asm string: ' .set push
   .set mips64r2
   .if ( 0x00 ) != -1)) 0x00 ) != -1)) : ($( static struct ftrace_branch_data __attribute__((__aligned__(4))) __attribute__((__section__("_ftrace_branch"))) __if_trace = $( .func = __func__, .file = "arch/mips/include/asm/atomic.h", .line = 153, $); 0x00 ) != -1)) : $))) ) && ( 0 ); .set push; .set mips64r2; .rept 1; sync 0x00; .endr; .set pop; .else; ; .endif
   1: ll $0, $1 # atomic_add
   addu $0, $2
   sc $0, $1
   beqz $0, 1b
   .set pop
   '
   clang-12: error: clang frontend command failed with exit code 70 (use -v to see invocation)
   clang version 12.0.0 (git://gitmirror/llvm_project 1968804ac726e7674d5de22bc2204b45857da344)
   Target: mipsel-unknown-linux-gnu
   Thread model: posix
   InstalledDir: /opt/cross/clang-1968804ac7/bin
   clang-12: note: diagnostic msg:
   Makefile arch drivers include kernel scripts source usr

vim +/bond_get_lowest_level_rcu +3698 drivers/net/bonding/bond_main.c

fe30937b65354c Eric Dumazet 2016-03-17  3696  
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3697  #ifdef CONFIG_LOCKDEP
b3e80d44f5b1b4 Taehee Yoo   2020-02-15 @3698  static int bond_get_lowest_level_rcu(struct net_device *dev)
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3699  {
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3700  	struct net_device *ldev, *next, *now, *dev_stack[MAX_NEST_DEV + 1];
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3701  	struct list_head *niter, *iter, *iter_stack[MAX_NEST_DEV + 1];
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3702  	int cur = 0, max = 0;
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3703  
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3704  	now = dev;
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3705  	iter = &dev->adj_list.lower;
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3706  
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3707  	while (1) {
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3708  		next = NULL;
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3709  		while (1) {
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3710  			ldev = netdev_next_lower_dev_rcu(now, &iter);
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3711  			if (!ldev)
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3712  				break;
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3713  
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3714  			next = ldev;
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3715  			niter = &ldev->adj_list.lower;
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3716  			dev_stack[cur] = now;
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3717  			iter_stack[cur++] = iter;
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3718  			if (max <= cur)
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3719  				max = cur;
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3720  			break;
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3721  		}
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3722  
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3723  		if (!next) {
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3724  			if (!cur)
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3725  				return max;
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3726  			next = dev_stack[--cur];
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3727  			niter = iter_stack[cur];
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3728  		}
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3729  
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3730  		now = next;
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3731  		iter = niter;
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3732  	}
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3733  
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3734  	return max;
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3735  }
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3736  #endif
b3e80d44f5b1b4 Taehee Yoo   2020-02-15  3737  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26899 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-12-10  4:34 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20201129182435.jgqfjbekqmmtaief@skbuf>
2020-11-29 20:58 ` Correct usage of dev_base_lock in 2020 Vladimir Oltean
2020-11-30  5:12   ` Stephen Hemminger
2020-11-30 10:41     ` Eric Dumazet
2020-11-30 18:14       ` Jakub Kicinski
2020-11-30 18:30         ` Eric Dumazet
2020-11-30 18:48         ` Vladimir Oltean
2020-11-30 19:00           ` Eric Dumazet
2020-11-30 19:03             ` Vladimir Oltean
2020-11-30 19:22               ` Eric Dumazet
2020-11-30 19:32                 ` Vladimir Oltean
2020-11-30 21:41                   ` Florian Fainelli
2020-11-30 19:46                 ` Vladimir Oltean
2020-11-30 20:18                   ` Eric Dumazet
2020-11-30 20:21                   ` Stephen Hemminger
2020-11-30 20:26                     ` Vladimir Oltean
2020-11-30 20:29                       ` Eric Dumazet
2020-11-30 20:36                         ` Vladimir Oltean
2020-11-30 20:43                           ` Eric Dumazet
2020-11-30 20:50                             ` Vladimir Oltean
2020-11-30 21:00                               ` Eric Dumazet
2020-11-30 21:11                                 ` Vladimir Oltean
2020-11-30 21:46                                   ` Eric Dumazet
2020-11-30 21:53                                     ` Vladimir Oltean
2020-11-30 22:20                                       ` Eric Dumazet
2020-11-30 22:41                                         ` Vladimir Oltean
2020-12-01 14:42           ` Pablo Neira Ayuso
2020-12-01 18:58             ` Vladimir Oltean
2020-12-10  4:32           ` [PATCH] net: bonding: retrieve device statistics under RTNL, not RCU kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).