linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch V2 0/4] net, refcount: Address dst_entry reference count scalability issues
@ 2023-03-07 12:57 Thomas Gleixner
  2023-03-07 12:57 ` [patch V2 1/4] net: dst: Prevent false sharing vs. dst_entry::__refcnt Thomas Gleixner
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Thomas Gleixner @ 2023-03-07 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, x86, Wangyang Guo, Arjan van De Ven,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, Will Deacon, Peter Zijlstra, Boqun Feng, Mark Rutland,
	Marc Zyngier

Hi!

This is version 2 of this series. Version 1 can be found here:

     https://lore.kernel.org/lkml/20230228132118.978145284@linutronix.de

Wangyang and Arjan reported a bottleneck in the networking code related to
struct dst_entry::__refcnt. Performance tanks massively when concurrency on
a dst_entry increases.

This happens when there are a large amount of connections to or from the
same IP address. The memtier benchmark when run on the same host as
memcached amplifies this massively. But even over real network connections
this issue can be observed at an obviously smaller scale (due to the
network bandwith limitations in my setup, i.e. 1Gb). How to reproduce:

  Run memcached with -t $N and memtier_benchmark with -t $M and --ratio=1:100
  on the same machine. localhost connections amplify the problem.

  Start with the defaults for $N and $M and increase them. Depending on
  your machine this will tank at some point. But even in reasonably small
  $N, $M scenarios the refcount operations and the resulting false sharing
  fallout becomes visible in perf top. At some point it becomes the
  dominating issue.

There are two factors which make this reference count a scalability issue:

   1) False sharing

      dst_entry:__refcnt is located at offset 64 of dst_entry, which puts
      it into a seperate cacheline vs. the read mostly members located at
      the beginning of the struct.

      That prevents false sharing vs. the struct members in the first 64
      bytes of the structure, but there is also

      	    dst_entry::lwtstate

      which is located after the reference count and in the same cache
      line. This member is read after a reference count has been acquired.

      The other problem is struct rtable, which embeds a struct dst_entry
      at offset 0. struct dst_entry has a size of 112 bytes, which means
      that the struct members of rtable which follow the dst member share
      the same cache line as dst_entry::__refcnt. Especially

      	  rtable::rt_genid

      is also read by the contexts which have a reference count acquired
      already.

      When dst_entry:__refcnt is incremented or decremented via an atomic
      operation these read accesses stall and contribute to the performance
      problem.

   2) atomic_inc_not_zero()

      A reference on dst_entry:__refcnt is acquired via
      atomic_inc_not_zero() and released via atomic_dec_return().

      atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
      which exposes O(N^2) behaviour under contention with N concurrent
      operations. Contention scalability is degrading with even a small
      amount of contenders and gets worse from there.

      Lightweight instrumentation exposed an average of 8!! retry loops per
      atomic_inc_not_zero() invocation in a inc()/dec() loop running
      concurrently on 112 CPUs.

      There is nothing which can be done to make atomic_inc_not_zero() more
      scalable.

The following series addresses these issues:

    1) Reorder and pad struct dst_entry to prevent the false sharing.

    2) Implement and use a reference count implementation which avoids the
       atomic_inc_not_zero() problem.

       It is slightly less performant in the case of the final 0 -> -1
       transition, but the deconstruction of these objects is a low
       frequency event. get()/put() pairs are in the hotpath and that's
       what this implementation optimizes for.

       The algorithm of this reference count is only suitable for RCU
       managed objects. Therefore it cannot replace the refcount_t
       algorithm, which is also based on atomic_inc_not_zero(), due to a
       subtle race condition related to the 0 -> -1 transition and the final
       verdict to mark the reference count dead. See details in patch 2/3.

       It might be just my lack of imagination which declares this to be
       impossible and I'd be happy to be proven wrong.

       As a bonus the new rcuref implementation provides underflow/overflow
       detection and mitigation while being performance wise on par with
       open coded atomic_inc_not_zero() / atomic_dec_return() pairs even in
       the non-contended case.

The combination of these two changes results in performance gains in micro
benchmarks and also localhost and networked memtier benchmarks talking to
memcached. It's hard to quantify the benchmark results as they depend
heavily on the micro-architecture and the number of concurrent operations.

The overall gain of both changes for localhost memtier ranges from 1.2X to
3.2X and from +2% to %5% range for networked operations on a 1Gb connection.

A micro benchmark which enforces maximized concurrency shows a gain between
1.2X and 4.7X!!!

Obviously this is focussed on a particular problem and therefore needs to
be discussed in detail. It also requires wider testing outside of the cases
which this is focussed on.

Though the false sharing issue is obvious and should be addressed
independent of the more focussed reference count changes.

The series is also available from git:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git

Changes vs. V1:

  - Move rt[6i]_uncached[_list] into struct dst_entry instead of blindly
    padding. (Eric)

  - Use atomic_add_negative() for rcuref (Linus)

I want to say thanks to Wangyang who analyzed the issue and provided the
initial fix for the false sharing problem. Further thanks go to Arjan
Peter, Marc, Will and Borislav for valuable input and providing test
results on machines which I do not have access to, and to Linus and
Eric for helpful feedback on V1.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch V2 1/4] net: dst: Prevent false sharing vs. dst_entry::__refcnt
  2023-03-07 12:57 [patch V2 0/4] net, refcount: Address dst_entry reference count scalability issues Thomas Gleixner
@ 2023-03-07 12:57 ` Thomas Gleixner
  2023-03-15 20:36   ` Jakub Kicinski
  2023-03-07 12:57 ` [patch V2 2/4] atomics: Provide atomic_add_and_negative() variants Thomas Gleixner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2023-03-07 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, x86, Wangyang Guo, Arjan van De Ven,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, Will Deacon, Peter Zijlstra, Boqun Feng, Mark Rutland,
	Marc Zyngier

From: Wangyang Guo <wangyang.guo@intel.com>

dst_entry::__refcnt is highly contended in scenarios where many connections
happen from and to the same IP. The reference count is an atomic_t, so the
reference count operations have to take the cache-line exclusive.

Aside of the unavoidable reference count contention there is another
significant problem which is caused by that: False sharing.

perf top identified two affected read accesses. dst_entry::lwtstate and
rtable::rt_genid.

dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into
a seperate cacheline vs. the read mostly members located at the beginning
of the struct.

That prevents false sharing vs. the struct members in the first 64
bytes of the structure, but there is also

  dst_entry::lwtstate

which is located after the reference count and in the same cache line. This
member is read after a reference count has been acquired.

struct rtable embeds a struct dst_entry at offset 0. struct dst_entry has a
size of 112 bytes, which means that the struct members of rtable which
follow the dst member share the same cache line as dst_entry::__refcnt.
Especially

  rtable::rt_genid

is also read by the contexts which have a reference count acquired
already.

When dst_entry:__refcnt is incremented or decremented via an atomic
operation these read accesses stall. This was found when analysing the
memtier benchmark in 1:100 mode, which amplifies the problem extremly.

Move the rt[6i]_uncached[_list] members out of struct rtable and struct
rt6_info into struct dst_entry to provide padding and move the lwtstate
member after that so it ends up in the same cache line.

The resulting improvement depends on the micro-architecture and the number
of CPUs. It ranges from +20% to +120% with a localhost memtier/memcached
benchmark.

[ tglx: Rearrange struct ]

Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
---
V2: Move uncached[_list] into dst_entry (Eric)
---
 include/net/dst.h       |   15 ++++++++++++++-
 include/net/ip6_fib.h   |    3 ---
 include/net/ip6_route.h |    2 +-
 include/net/route.h     |    3 ---
 net/ipv4/route.c        |   20 ++++++++++----------
 net/ipv4/xfrm4_policy.c |    4 ++--
 net/ipv6/route.c        |   26 +++++++++++++-------------
 net/ipv6/xfrm6_policy.c |    4 ++--
 8 files changed, 42 insertions(+), 35 deletions(-)

--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -69,15 +69,28 @@ struct dst_entry {
 #endif
 	int			__use;
 	unsigned long		lastuse;
-	struct lwtunnel_state   *lwtstate;
 	struct rcu_head		rcu_head;
 	short			error;
 	short			__pad;
 	__u32			tclassid;
 #ifndef CONFIG_64BIT
+	struct lwtunnel_state   *lwtstate;
 	atomic_t		__refcnt;	/* 32-bit offset 64 */
 #endif
 	netdevice_tracker	dev_tracker;
+
+	/*
+	 * Used by rtable and rt6_info. Moves lwtstate into the next cache
+	 * line on 64bit so that lwtstate does not cause false sharing with
+	 * __refcnt under contention of __refcnt. This also puts the
+	 * frequently accessed members of rtable and rt6_info out of the
+	 * __refcnt cache line.
+	 */
+	struct list_head	rt_uncached;
+	struct uncached_list	*rt_uncached_list;
+#ifdef CONFIG_64BIT
+	struct lwtunnel_state   *lwtstate;
+#endif
 };
 
 struct dst_metrics {
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -217,9 +217,6 @@ struct rt6_info {
 	struct inet6_dev		*rt6i_idev;
 	u32				rt6i_flags;
 
-	struct list_head		rt6i_uncached;
-	struct uncached_list		*rt6i_uncached_list;
-
 	/* more non-fragment space at head required */
 	unsigned short			rt6i_nfheader_len;
 };
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -100,7 +100,7 @@ static inline struct dst_entry *ip6_rout
 static inline void ip6_rt_put_flags(struct rt6_info *rt, int flags)
 {
 	if (!(flags & RT6_LOOKUP_F_DST_NOREF) ||
-	    !list_empty(&rt->rt6i_uncached))
+	    !list_empty(&rt->dst.rt_uncached))
 		ip6_rt_put(rt);
 }
 
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -78,9 +78,6 @@ struct rtable {
 	/* Miscellaneous cached information */
 	u32			rt_mtu_locked:1,
 				rt_pmtu:31;
-
-	struct list_head	rt_uncached;
-	struct uncached_list	*rt_uncached_list;
 };
 
 static inline bool rt_is_input_route(const struct rtable *rt)
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1508,20 +1508,20 @@ void rt_add_uncached_list(struct rtable
 {
 	struct uncached_list *ul = raw_cpu_ptr(&rt_uncached_list);
 
-	rt->rt_uncached_list = ul;
+	rt->dst.rt_uncached_list = ul;
 
 	spin_lock_bh(&ul->lock);
-	list_add_tail(&rt->rt_uncached, &ul->head);
+	list_add_tail(&rt->dst.rt_uncached, &ul->head);
 	spin_unlock_bh(&ul->lock);
 }
 
 void rt_del_uncached_list(struct rtable *rt)
 {
-	if (!list_empty(&rt->rt_uncached)) {
-		struct uncached_list *ul = rt->rt_uncached_list;
+	if (!list_empty(&rt->dst.rt_uncached)) {
+		struct uncached_list *ul = rt->dst.rt_uncached_list;
 
 		spin_lock_bh(&ul->lock);
-		list_del_init(&rt->rt_uncached);
+		list_del_init(&rt->dst.rt_uncached);
 		spin_unlock_bh(&ul->lock);
 	}
 }
@@ -1546,13 +1546,13 @@ void rt_flush_dev(struct net_device *dev
 			continue;
 
 		spin_lock_bh(&ul->lock);
-		list_for_each_entry_safe(rt, safe, &ul->head, rt_uncached) {
+		list_for_each_entry_safe(rt, safe, &ul->head, dst.rt_uncached) {
 			if (rt->dst.dev != dev)
 				continue;
 			rt->dst.dev = blackhole_netdev;
 			netdev_ref_replace(dev, blackhole_netdev,
 					   &rt->dst.dev_tracker, GFP_ATOMIC);
-			list_move(&rt->rt_uncached, &ul->quarantine);
+			list_move(&rt->dst.rt_uncached, &ul->quarantine);
 		}
 		spin_unlock_bh(&ul->lock);
 	}
@@ -1644,7 +1644,7 @@ struct rtable *rt_dst_alloc(struct net_d
 		rt->rt_uses_gateway = 0;
 		rt->rt_gw_family = 0;
 		rt->rt_gw4 = 0;
-		INIT_LIST_HEAD(&rt->rt_uncached);
+		INIT_LIST_HEAD(&rt->dst.rt_uncached);
 
 		rt->dst.output = ip_output;
 		if (flags & RTCF_LOCAL)
@@ -1675,7 +1675,7 @@ struct rtable *rt_dst_clone(struct net_d
 			new_rt->rt_gw4 = rt->rt_gw4;
 		else if (rt->rt_gw_family == AF_INET6)
 			new_rt->rt_gw6 = rt->rt_gw6;
-		INIT_LIST_HEAD(&new_rt->rt_uncached);
+		INIT_LIST_HEAD(&new_rt->dst.rt_uncached);
 
 		new_rt->dst.input = rt->dst.input;
 		new_rt->dst.output = rt->dst.output;
@@ -2859,7 +2859,7 @@ struct dst_entry *ipv4_blackhole_route(s
 		else if (rt->rt_gw_family == AF_INET6)
 			rt->rt_gw6 = ort->rt_gw6;
 
-		INIT_LIST_HEAD(&rt->rt_uncached);
+		INIT_LIST_HEAD(&rt->dst.rt_uncached);
 	}
 
 	dst_release(dst_orig);
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -91,7 +91,7 @@ static int xfrm4_fill_dst(struct xfrm_ds
 		xdst->u.rt.rt_gw6 = rt->rt_gw6;
 	xdst->u.rt.rt_pmtu = rt->rt_pmtu;
 	xdst->u.rt.rt_mtu_locked = rt->rt_mtu_locked;
-	INIT_LIST_HEAD(&xdst->u.rt.rt_uncached);
+	INIT_LIST_HEAD(&xdst->u.rt.dst.rt_uncached);
 	rt_add_uncached_list(&xdst->u.rt);
 
 	return 0;
@@ -121,7 +121,7 @@ static void xfrm4_dst_destroy(struct dst
 	struct xfrm_dst *xdst = (struct xfrm_dst *)dst;
 
 	dst_destroy_metrics_generic(dst);
-	if (xdst->u.rt.rt_uncached_list)
+	if (xdst->u.rt.dst.rt_uncached_list)
 		rt_del_uncached_list(&xdst->u.rt);
 	xfrm_dst_destroy(xdst);
 }
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -139,20 +139,20 @@ void rt6_uncached_list_add(struct rt6_in
 {
 	struct uncached_list *ul = raw_cpu_ptr(&rt6_uncached_list);
 
-	rt->rt6i_uncached_list = ul;
+	rt->dst.rt_uncached_list = ul;
 
 	spin_lock_bh(&ul->lock);
-	list_add_tail(&rt->rt6i_uncached, &ul->head);
+	list_add_tail(&rt->dst.rt_uncached, &ul->head);
 	spin_unlock_bh(&ul->lock);
 }
 
 void rt6_uncached_list_del(struct rt6_info *rt)
 {
-	if (!list_empty(&rt->rt6i_uncached)) {
-		struct uncached_list *ul = rt->rt6i_uncached_list;
+	if (!list_empty(&rt->dst.rt_uncached)) {
+		struct uncached_list *ul = rt->dst.rt_uncached_list;
 
 		spin_lock_bh(&ul->lock);
-		list_del_init(&rt->rt6i_uncached);
+		list_del_init(&rt->dst.rt_uncached);
 		spin_unlock_bh(&ul->lock);
 	}
 }
@@ -169,7 +169,7 @@ static void rt6_uncached_list_flush_dev(
 			continue;
 
 		spin_lock_bh(&ul->lock);
-		list_for_each_entry_safe(rt, safe, &ul->head, rt6i_uncached) {
+		list_for_each_entry_safe(rt, safe, &ul->head, dst.rt_uncached) {
 			struct inet6_dev *rt_idev = rt->rt6i_idev;
 			struct net_device *rt_dev = rt->dst.dev;
 			bool handled = false;
@@ -188,7 +188,7 @@ static void rt6_uncached_list_flush_dev(
 				handled = true;
 			}
 			if (handled)
-				list_move(&rt->rt6i_uncached,
+				list_move(&rt->dst.rt_uncached,
 					  &ul->quarantine);
 		}
 		spin_unlock_bh(&ul->lock);
@@ -334,7 +334,7 @@ static const struct rt6_info ip6_blk_hol
 static void rt6_info_init(struct rt6_info *rt)
 {
 	memset_after(rt, 0, dst);
-	INIT_LIST_HEAD(&rt->rt6i_uncached);
+	INIT_LIST_HEAD(&rt->dst.rt_uncached);
 }
 
 /* allocate dst with ip6_dst_ops */
@@ -2638,7 +2638,7 @@ struct dst_entry *ip6_route_output_flags
 	dst = ip6_route_output_flags_noref(net, sk, fl6, flags);
 	rt6 = (struct rt6_info *)dst;
 	/* For dst cached in uncached_list, refcnt is already taken. */
-	if (list_empty(&rt6->rt6i_uncached) && !dst_hold_safe(dst)) {
+	if (list_empty(&rt6->dst.rt_uncached) && !dst_hold_safe(dst)) {
 		dst = &net->ipv6.ip6_null_entry->dst;
 		dst_hold(dst);
 	}
@@ -2748,7 +2748,7 @@ INDIRECT_CALLABLE_SCOPE struct dst_entry
 	from = rcu_dereference(rt->from);
 
 	if (from && (rt->rt6i_flags & RTF_PCPU ||
-	    unlikely(!list_empty(&rt->rt6i_uncached))))
+	    unlikely(!list_empty(&rt->dst.rt_uncached))))
 		dst_ret = rt6_dst_from_check(rt, from, cookie);
 	else
 		dst_ret = rt6_check(rt, from, cookie);
@@ -6477,7 +6477,7 @@ static int __net_init ip6_route_net_init
 	net->ipv6.ip6_null_entry->dst.ops = &net->ipv6.ip6_dst_ops;
 	dst_init_metrics(&net->ipv6.ip6_null_entry->dst,
 			 ip6_template_metrics, true);
-	INIT_LIST_HEAD(&net->ipv6.ip6_null_entry->rt6i_uncached);
+	INIT_LIST_HEAD(&net->ipv6.ip6_null_entry->dst.rt_uncached);
 
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
 	net->ipv6.fib6_has_custom_rules = false;
@@ -6489,7 +6489,7 @@ static int __net_init ip6_route_net_init
 	net->ipv6.ip6_prohibit_entry->dst.ops = &net->ipv6.ip6_dst_ops;
 	dst_init_metrics(&net->ipv6.ip6_prohibit_entry->dst,
 			 ip6_template_metrics, true);
-	INIT_LIST_HEAD(&net->ipv6.ip6_prohibit_entry->rt6i_uncached);
+	INIT_LIST_HEAD(&net->ipv6.ip6_prohibit_entry->dst.rt_uncached);
 
 	net->ipv6.ip6_blk_hole_entry = kmemdup(&ip6_blk_hole_entry_template,
 					       sizeof(*net->ipv6.ip6_blk_hole_entry),
@@ -6499,7 +6499,7 @@ static int __net_init ip6_route_net_init
 	net->ipv6.ip6_blk_hole_entry->dst.ops = &net->ipv6.ip6_dst_ops;
 	dst_init_metrics(&net->ipv6.ip6_blk_hole_entry->dst,
 			 ip6_template_metrics, true);
-	INIT_LIST_HEAD(&net->ipv6.ip6_blk_hole_entry->rt6i_uncached);
+	INIT_LIST_HEAD(&net->ipv6.ip6_blk_hole_entry->dst.rt_uncached);
 #ifdef CONFIG_IPV6_SUBTREES
 	net->ipv6.fib6_routes_require_src = 0;
 #endif
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -89,7 +89,7 @@ static int xfrm6_fill_dst(struct xfrm_ds
 	xdst->u.rt6.rt6i_gateway = rt->rt6i_gateway;
 	xdst->u.rt6.rt6i_dst = rt->rt6i_dst;
 	xdst->u.rt6.rt6i_src = rt->rt6i_src;
-	INIT_LIST_HEAD(&xdst->u.rt6.rt6i_uncached);
+	INIT_LIST_HEAD(&xdst->u.rt6.dst.rt_uncached);
 	rt6_uncached_list_add(&xdst->u.rt6);
 
 	return 0;
@@ -121,7 +121,7 @@ static void xfrm6_dst_destroy(struct dst
 	if (likely(xdst->u.rt6.rt6i_idev))
 		in6_dev_put(xdst->u.rt6.rt6i_idev);
 	dst_destroy_metrics_generic(dst);
-	if (xdst->u.rt6.rt6i_uncached_list)
+	if (xdst->u.rt6.dst.rt_uncached_list)
 		rt6_uncached_list_del(&xdst->u.rt6);
 	xfrm_dst_destroy(xdst);
 }


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch V2 2/4] atomics: Provide atomic_add_and_negative() variants
  2023-03-07 12:57 [patch V2 0/4] net, refcount: Address dst_entry reference count scalability issues Thomas Gleixner
  2023-03-07 12:57 ` [patch V2 1/4] net: dst: Prevent false sharing vs. dst_entry::__refcnt Thomas Gleixner
@ 2023-03-07 12:57 ` Thomas Gleixner
  2023-03-22 13:20   ` Mark Rutland
  2023-03-07 12:57 ` [patch V2 3/4] atomics: Provide rcuref - scalable reference counting Thomas Gleixner
  2023-03-07 12:57 ` [patch V2 4/4] net: dst: Switch to rcuref_t " Thomas Gleixner
  3 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2023-03-07 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, x86, Wangyang Guo, Arjan van De Ven,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, Will Deacon, Peter Zijlstra, Boqun Feng, Mark Rutland,
	Marc Zyngier

atomic_add_and_negative() does not provide the relaxed/acquire/release
variants.

Provide them in preparation for a new scalable reference count algorithm.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V2: New patch
---
 include/linux/atomic/atomic-arch-fallback.h |  200 +++++++++++++++++++++++++++-
 include/linux/atomic/atomic-instrumented.h  |   68 +++++++++
 include/linux/atomic/atomic-long.h          |   38 +++++
 scripts/atomic/atomics.tbl                  |    2 
 scripts/atomic/fallbacks/add_negative       |    4 
 5 files changed, 306 insertions(+), 6 deletions(-)

--- a/include/linux/atomic/atomic-arch-fallback.h
+++ b/include/linux/atomic/atomic-arch-fallback.h
@@ -1208,6 +1208,13 @@ arch_atomic_inc_and_test(atomic_t *v)
 #define arch_atomic_inc_and_test arch_atomic_inc_and_test
 #endif
 
+#ifndef arch_atomic_add_negative_relaxed
+#ifdef arch_atomic_add_negative
+#define arch_atomic_add_negative_acquire arch_atomic_add_negative
+#define arch_atomic_add_negative_release arch_atomic_add_negative
+#define arch_atomic_add_negative_relaxed arch_atomic_add_negative
+#endif /* arch_atomic_add_negative */
+
 #ifndef arch_atomic_add_negative
 /**
  * arch_atomic_add_negative - add and test if negative
@@ -1226,6 +1233,98 @@ arch_atomic_add_negative(int i, atomic_t
 #define arch_atomic_add_negative arch_atomic_add_negative
 #endif
 
+#ifndef arch_atomic_add_negative_acquire
+/**
+ * arch_atomic_add_negative - add and test if negative
+ * @i: integer value to add
+ * @v: pointer of type atomic_t
+ *
+ * Atomically adds @i to @v and returns true
+ * if the result is negative, or false when
+ * result is greater than or equal to zero.
+ */
+static __always_inline bool
+arch_atomic_add_negative_acquire(int i, atomic_t *v)
+{
+	return arch_atomic_add_return_acquire(i, v) < 0;
+}
+#define arch_atomic_add_negative_acquire arch_atomic_add_negative_acquire
+#endif
+
+#ifndef arch_atomic_add_negative_release
+/**
+ * arch_atomic_add_negative - add and test if negative
+ * @i: integer value to add
+ * @v: pointer of type atomic_t
+ *
+ * Atomically adds @i to @v and returns true
+ * if the result is negative, or false when
+ * result is greater than or equal to zero.
+ */
+static __always_inline bool
+arch_atomic_add_negative_release(int i, atomic_t *v)
+{
+	return arch_atomic_add_return_release(i, v) < 0;
+}
+#define arch_atomic_add_negative_release arch_atomic_add_negative_release
+#endif
+
+#ifndef arch_atomic_add_negative_relaxed
+/**
+ * arch_atomic_add_negative - add and test if negative
+ * @i: integer value to add
+ * @v: pointer of type atomic_t
+ *
+ * Atomically adds @i to @v and returns true
+ * if the result is negative, or false when
+ * result is greater than or equal to zero.
+ */
+static __always_inline bool
+arch_atomic_add_negative_relaxed(int i, atomic_t *v)
+{
+	return arch_atomic_add_return_relaxed(i, v) < 0;
+}
+#define arch_atomic_add_negative_relaxed arch_atomic_add_negative_relaxed
+#endif
+
+#else /* arch_atomic_add_negative_relaxed */
+
+#ifndef arch_atomic_add_negative_acquire
+static __always_inline bool
+arch_atomic_add_negative_acquire(int i, atomic_t *v)
+{
+	bool ret = arch_atomic_add_negative_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_add_negative_acquire arch_atomic_add_negative_acquire
+#endif
+
+#ifndef arch_atomic_add_negative_release
+static __always_inline bool
+arch_atomic_add_negative_release(int i, atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_add_negative_relaxed(i, v);
+}
+#define arch_atomic_add_negative_release arch_atomic_add_negative_release
+#endif
+
+#ifndef arch_atomic_add_negative
+static __always_inline bool
+arch_atomic_add_negative(int i, atomic_t *v)
+{
+	bool ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_add_negative_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_add_negative arch_atomic_add_negative
+#endif
+
+#endif /* arch_atomic_add_negative_relaxed */
+
 #ifndef arch_atomic_fetch_add_unless
 /**
  * arch_atomic_fetch_add_unless - add unless the number is already a given value
@@ -2329,6 +2428,13 @@ arch_atomic64_inc_and_test(atomic64_t *v
 #define arch_atomic64_inc_and_test arch_atomic64_inc_and_test
 #endif
 
+#ifndef arch_atomic64_add_negative_relaxed
+#ifdef arch_atomic64_add_negative
+#define arch_atomic64_add_negative_acquire arch_atomic64_add_negative
+#define arch_atomic64_add_negative_release arch_atomic64_add_negative
+#define arch_atomic64_add_negative_relaxed arch_atomic64_add_negative
+#endif /* arch_atomic64_add_negative */
+
 #ifndef arch_atomic64_add_negative
 /**
  * arch_atomic64_add_negative - add and test if negative
@@ -2347,6 +2453,98 @@ arch_atomic64_add_negative(s64 i, atomic
 #define arch_atomic64_add_negative arch_atomic64_add_negative
 #endif
 
+#ifndef arch_atomic64_add_negative_acquire
+/**
+ * arch_atomic64_add_negative - add and test if negative
+ * @i: integer value to add
+ * @v: pointer of type atomic64_t
+ *
+ * Atomically adds @i to @v and returns true
+ * if the result is negative, or false when
+ * result is greater than or equal to zero.
+ */
+static __always_inline bool
+arch_atomic64_add_negative_acquire(s64 i, atomic64_t *v)
+{
+	return arch_atomic64_add_return_acquire(i, v) < 0;
+}
+#define arch_atomic64_add_negative_acquire arch_atomic64_add_negative_acquire
+#endif
+
+#ifndef arch_atomic64_add_negative_release
+/**
+ * arch_atomic64_add_negative - add and test if negative
+ * @i: integer value to add
+ * @v: pointer of type atomic64_t
+ *
+ * Atomically adds @i to @v and returns true
+ * if the result is negative, or false when
+ * result is greater than or equal to zero.
+ */
+static __always_inline bool
+arch_atomic64_add_negative_release(s64 i, atomic64_t *v)
+{
+	return arch_atomic64_add_return_release(i, v) < 0;
+}
+#define arch_atomic64_add_negative_release arch_atomic64_add_negative_release
+#endif
+
+#ifndef arch_atomic64_add_negative_relaxed
+/**
+ * arch_atomic64_add_negative - add and test if negative
+ * @i: integer value to add
+ * @v: pointer of type atomic64_t
+ *
+ * Atomically adds @i to @v and returns true
+ * if the result is negative, or false when
+ * result is greater than or equal to zero.
+ */
+static __always_inline bool
+arch_atomic64_add_negative_relaxed(s64 i, atomic64_t *v)
+{
+	return arch_atomic64_add_return_relaxed(i, v) < 0;
+}
+#define arch_atomic64_add_negative_relaxed arch_atomic64_add_negative_relaxed
+#endif
+
+#else /* arch_atomic64_add_negative_relaxed */
+
+#ifndef arch_atomic64_add_negative_acquire
+static __always_inline bool
+arch_atomic64_add_negative_acquire(s64 i, atomic64_t *v)
+{
+	bool ret = arch_atomic64_add_negative_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_add_negative_acquire arch_atomic64_add_negative_acquire
+#endif
+
+#ifndef arch_atomic64_add_negative_release
+static __always_inline bool
+arch_atomic64_add_negative_release(s64 i, atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_add_negative_relaxed(i, v);
+}
+#define arch_atomic64_add_negative_release arch_atomic64_add_negative_release
+#endif
+
+#ifndef arch_atomic64_add_negative
+static __always_inline bool
+arch_atomic64_add_negative(s64 i, atomic64_t *v)
+{
+	bool ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_add_negative_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_add_negative arch_atomic64_add_negative
+#endif
+
+#endif /* arch_atomic64_add_negative_relaxed */
+
 #ifndef arch_atomic64_fetch_add_unless
 /**
  * arch_atomic64_fetch_add_unless - add unless the number is already a given value
@@ -2456,4 +2654,4 @@ arch_atomic64_dec_if_positive(atomic64_t
 #endif
 
 #endif /* _LINUX_ATOMIC_FALLBACK_H */
-// b5e87bdd5ede61470c29f7a7e4de781af3770f09
+// 63bcc1a53125d4eca5e659892e10615f00f9abf8
--- a/include/linux/atomic/atomic-instrumented.h
+++ b/include/linux/atomic/atomic-instrumented.h
@@ -592,6 +592,28 @@ atomic_add_negative(int i, atomic_t *v)
 	return arch_atomic_add_negative(i, v);
 }
 
+static __always_inline bool
+atomic_add_negative_acquire(int i, atomic_t *v)
+{
+	instrument_atomic_read_write(v, sizeof(*v));
+	return arch_atomic_add_negative_acquire(i, v);
+}
+
+static __always_inline bool
+atomic_add_negative_release(int i, atomic_t *v)
+{
+	kcsan_release();
+	instrument_atomic_read_write(v, sizeof(*v));
+	return arch_atomic_add_negative_release(i, v);
+}
+
+static __always_inline bool
+atomic_add_negative_relaxed(int i, atomic_t *v)
+{
+	instrument_atomic_read_write(v, sizeof(*v));
+	return arch_atomic_add_negative_relaxed(i, v);
+}
+
 static __always_inline int
 atomic_fetch_add_unless(atomic_t *v, int a, int u)
 {
@@ -1211,6 +1233,28 @@ atomic64_add_negative(s64 i, atomic64_t
 	return arch_atomic64_add_negative(i, v);
 }
 
+static __always_inline bool
+atomic64_add_negative_acquire(s64 i, atomic64_t *v)
+{
+	instrument_atomic_read_write(v, sizeof(*v));
+	return arch_atomic64_add_negative_acquire(i, v);
+}
+
+static __always_inline bool
+atomic64_add_negative_release(s64 i, atomic64_t *v)
+{
+	kcsan_release();
+	instrument_atomic_read_write(v, sizeof(*v));
+	return arch_atomic64_add_negative_release(i, v);
+}
+
+static __always_inline bool
+atomic64_add_negative_relaxed(s64 i, atomic64_t *v)
+{
+	instrument_atomic_read_write(v, sizeof(*v));
+	return arch_atomic64_add_negative_relaxed(i, v);
+}
+
 static __always_inline s64
 atomic64_fetch_add_unless(atomic64_t *v, s64 a, s64 u)
 {
@@ -1830,6 +1874,28 @@ atomic_long_add_negative(long i, atomic_
 	return arch_atomic_long_add_negative(i, v);
 }
 
+static __always_inline bool
+atomic_long_add_negative_acquire(long i, atomic_long_t *v)
+{
+	instrument_atomic_read_write(v, sizeof(*v));
+	return arch_atomic_long_add_negative_acquire(i, v);
+}
+
+static __always_inline bool
+atomic_long_add_negative_release(long i, atomic_long_t *v)
+{
+	kcsan_release();
+	instrument_atomic_read_write(v, sizeof(*v));
+	return arch_atomic_long_add_negative_release(i, v);
+}
+
+static __always_inline bool
+atomic_long_add_negative_relaxed(long i, atomic_long_t *v)
+{
+	instrument_atomic_read_write(v, sizeof(*v));
+	return arch_atomic_long_add_negative_relaxed(i, v);
+}
+
 static __always_inline long
 atomic_long_fetch_add_unless(atomic_long_t *v, long a, long u)
 {
@@ -2083,4 +2149,4 @@ atomic_long_dec_if_positive(atomic_long_
 })
 
 #endif /* _LINUX_ATOMIC_INSTRUMENTED_H */
-// 764f741eb77a7ad565dc8d99ce2837d5542e8aee
+// 1b485de9cbaa4900de59e14ee2084357eaeb1c3a
--- a/include/linux/atomic/atomic-long.h
+++ b/include/linux/atomic/atomic-long.h
@@ -479,6 +479,24 @@ arch_atomic_long_add_negative(long i, at
 	return arch_atomic64_add_negative(i, v);
 }
 
+static __always_inline bool
+arch_atomic_long_add_negative_acquire(long i, atomic_long_t *v)
+{
+	return arch_atomic64_add_negative_acquire(i, v);
+}
+
+static __always_inline bool
+arch_atomic_long_add_negative_release(long i, atomic_long_t *v)
+{
+	return arch_atomic64_add_negative_release(i, v);
+}
+
+static __always_inline bool
+arch_atomic_long_add_negative_relaxed(long i, atomic_long_t *v)
+{
+	return arch_atomic64_add_negative_relaxed(i, v);
+}
+
 static __always_inline long
 arch_atomic_long_fetch_add_unless(atomic_long_t *v, long a, long u)
 {
@@ -973,6 +991,24 @@ arch_atomic_long_add_negative(long i, at
 	return arch_atomic_add_negative(i, v);
 }
 
+static __always_inline bool
+arch_atomic_long_add_negative_acquire(long i, atomic_long_t *v)
+{
+	return arch_atomic_add_negative_acquire(i, v);
+}
+
+static __always_inline bool
+arch_atomic_long_add_negative_release(long i, atomic_long_t *v)
+{
+	return arch_atomic_add_negative_release(i, v);
+}
+
+static __always_inline bool
+arch_atomic_long_add_negative_relaxed(long i, atomic_long_t *v)
+{
+	return arch_atomic_add_negative_relaxed(i, v);
+}
+
 static __always_inline long
 arch_atomic_long_fetch_add_unless(atomic_long_t *v, long a, long u)
 {
@@ -1011,4 +1047,4 @@ arch_atomic_long_dec_if_positive(atomic_
 
 #endif /* CONFIG_64BIT */
 #endif /* _LINUX_ATOMIC_LONG_H */
-// e8f0e08ff072b74d180eabe2ad001282b38c2c88
+// a194c07d7d2f4b0e178d3c118c919775d5d65f50
--- a/scripts/atomic/atomics.tbl
+++ b/scripts/atomic/atomics.tbl
@@ -33,7 +33,7 @@ try_cmpxchg		B	v	p:old	i:new
 sub_and_test		b	i	v
 dec_and_test		b	v
 inc_and_test		b	v
-add_negative		b	i	v
+add_negative		B	i	v
 add_unless		fb	v	i:a	i:u
 inc_not_zero		b	v
 inc_unless_negative	b	v
--- a/scripts/atomic/fallbacks/add_negative
+++ b/scripts/atomic/fallbacks/add_negative
@@ -9,8 +9,8 @@ cat <<EOF
  * result is greater than or equal to zero.
  */
 static __always_inline bool
-arch_${atomic}_add_negative(${int} i, ${atomic}_t *v)
+arch_${atomic}_add_negative${order}(${int} i, ${atomic}_t *v)
 {
-	return arch_${atomic}_add_return(i, v) < 0;
+	return arch_${atomic}_add_return${order}(i, v) < 0;
 }
 EOF


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch V2 3/4] atomics: Provide rcuref - scalable reference counting
  2023-03-07 12:57 [patch V2 0/4] net, refcount: Address dst_entry reference count scalability issues Thomas Gleixner
  2023-03-07 12:57 ` [patch V2 1/4] net: dst: Prevent false sharing vs. dst_entry::__refcnt Thomas Gleixner
  2023-03-07 12:57 ` [patch V2 2/4] atomics: Provide atomic_add_and_negative() variants Thomas Gleixner
@ 2023-03-07 12:57 ` Thomas Gleixner
  2023-03-09  8:35   ` Qiuxu Zhuo
  2023-03-07 12:57 ` [patch V2 4/4] net: dst: Switch to rcuref_t " Thomas Gleixner
  3 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2023-03-07 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, x86, Wangyang Guo, Arjan van De Ven,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, Will Deacon, Peter Zijlstra, Boqun Feng, Mark Rutland,
	Marc Zyngier, Arjan Van De Ven

atomic_t based reference counting, including refcount_t, uses
atomic_inc_not_zero() for acquiring a reference. atomic_inc_not_zero() is
implemented with a atomic_try_cmpxchg() loop. High contention of the
reference count leads to retry loops and scales badly. There is nothing to
improve on this implementation as the semantics have to be preserved.

Provide rcuref as a scalable alternative solution which is suitable for RCU
managed objects. Similar to refcount_t it comes with overflow and underflow
detection and mitigation.

rcuref treats the underlying atomic_t as an unsigned integer and partitions
this space into zones:

  0x00000000 - 0x7FFFFFFF	valid zone (1 .. INT_MAX references)
  0x80000000 - 0xBFFFFFFF	saturation zone
  0xC0000000 - 0xFFFFFFFE	dead zone
  0xFFFFFFFF   			no reference

rcuref_get() unconditionally increments the reference count with
atomic_add_negative_relaxed(). rcuref_put() unconditionally decrements the
reference count with atomic_add_negative_release().

This unconditional increment avoids the inc_not_zero() problem, but
requires a more complex implementation on the put() side when the count
drops from 0 to -1.

When this transition is detected then it is attempted to mark the reference
count dead, by setting it to the midpoint of the dead zone with a single
atomic_cmpxchg_release() operation. This operation can fail due to a
concurrent rcuref_get() elevating the reference count from -1 to 0 again.

If the unconditional increment in rcuref_get() hits a reference count which
is marked dead (or saturated) it will detect it after the fact and bring
back the reference count to the midpoint of the respective zone. The zones
provide enough tolerance which makes it practically impossible to escape
from a zone.

The racy implementation of rcuref_put() requires to protect rcuref_put()
against a grace period ending in order to prevent a subtle use after
free. As RCU is the only mechanism which allows to protect against that, it
is not possible to fully replace the atomic_inc_not_zero() based
implementation of refcount_t with this scheme.

The final drop is slightly more expensive than the atomic_dec_return()
counterpart, but that's not the case which this is optimized for. The
optimization is on the high frequeunt get()/put() pairs and their
scalability.

The performance of an uncontended rcuref_get()/put() pair where the put()
is not dropping the last reference is still on par with the plain atomic
operations, while at the same time providing overflow and underflow
detection and mitigation.

The performance of rcuref compared to plain atomic_inc_not_zero() and
atomic_dec_return() based reference counting under contention:

 -  Micro benchmark: All CPUs running a increment/decrement loop on an
    elevated reference count, which means the 0 to -1 transition never
    happens.

    The performance gain depends on microarchitecture and the number of
    CPUs and has been observed in the range of 1.3X to 4.7X

 - Conversion of dst_entry::__refcnt to rcuref and testing with the
    localhost memtier/memcached benchmark. That benchmark shows the
    reference count contention prominently.
    
    The performance gain depends on microarchitecture and the number of
    CPUs and has been observed in the range of 1.1X to 2.6X over the
    previous fix for the false sharing issue vs. struct
    dst_entry::__refcnt.

    When memtier is run over a real 1Gb network connection, there is a
    small gain on top of the false sharing fix. The two changes combined
    result in a 2%-5% total gain for that networked test.

Reported-by: Wangyang Guo <wangyang.guo@intel.com>
Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
---
V2: Switch to atomic_add_negative() to make the fast path lean
    (Linus)
---
 include/linux/rcuref.h |  155 +++++++++++++++++++++++++++
 include/linux/types.h  |    6 +
 lib/Makefile           |    2 
 lib/rcuref.c           |  281 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 443 insertions(+), 1 deletion(-)

--- /dev/null
+++ b/include/linux/rcuref.h
@@ -0,0 +1,155 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_RCUREF_H
+#define _LINUX_RCUREF_H
+
+#include <linux/atomic.h>
+#include <linux/bug.h>
+#include <linux/limits.h>
+#include <linux/lockdep.h>
+#include <linux/preempt.h>
+#include <linux/rcupdate.h>
+
+#define RCUREF_ONEREF		0x00000000U
+#define RCUREF_MAXREF		0x7FFFFFFFU
+#define RCUREF_SATURATED	0xA0000000U
+#define RCUREF_RELEASED		0xC0000000U
+#define RCUREF_DEAD		0xE0000000U
+#define RCUREF_NOREF		0xFFFFFFFFU
+
+/**
+ * rcuref_init - Initialize a rcuref reference count with the given reference count
+ * @ref:	Pointer to the reference count
+ * @cnt:	The initial reference count typically '1'
+ */
+static inline void rcuref_init(rcuref_t *ref, unsigned int cnt)
+{
+	atomic_set(&ref->refcnt, cnt - 1);
+}
+
+/**
+ * rcuref_read - Read the number of held reference counts of a rcuref
+ * @ref:	Pointer to the reference count
+ *
+ * Return: The number of held references (0 ... N)
+ */
+static inline unsigned int rcuref_read(rcuref_t *ref)
+{
+	unsigned int c = atomic_read(&ref->refcnt);
+
+	/* Return 0 if within the DEAD zone. */
+	return c >= RCUREF_RELEASED ? 0 : c + 1;
+}
+
+extern __must_check bool rcuref_get_slowpath(rcuref_t *ref);
+
+/**
+ * rcuref_get - Acquire one reference on a rcuref reference count
+ * @ref:	Pointer to the reference count
+ *
+ * Similar to atomic_inc_not_zero() but saturates at RCUREF_MAXREF.
+ *
+ * Provides no memory ordering, it is assumed the caller has guaranteed the
+ * object memory to be stable (RCU, etc.). It does provide a control dependency
+ * and thereby orders future stores. See documentation in lib/rcuref.c
+ *
+ * Return:
+ *	False if the attempt to acquire a reference failed. This happens
+ *	when the last reference has been put already
+ *
+ *	True if a reference was successfully acquired
+ */
+static inline __must_check bool rcuref_get(rcuref_t *ref)
+{
+	/*
+	 * Unconditionally increase the reference count. The saturation and
+	 * dead zones provide enough tolerance for this.
+	 */
+	if (likely(!atomic_add_negative_relaxed(1, &ref->refcnt)))
+		return true;
+
+	/* Handle the cases inside the saturation and dead zones */
+	return rcuref_get_slowpath(ref);
+}
+
+extern __must_check bool rcuref_put_slowpath(rcuref_t *ref);
+
+/*
+ * Internal helper. Do not invoke directly.
+ */
+static __always_inline __must_check bool __rcuref_put(rcuref_t *ref)
+{
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held() && preemptible(),
+			 "suspicious rcuref_put_rcusafe() usage");
+	/*
+	 * Unconditionally decrease the reference count. The saturation and
+	 * dead zones provide enough tolerance for this.
+	 */
+	if (likely(!atomic_add_negative_release(-1, &ref->refcnt)))
+		return false;
+
+	/*
+	 * Handle the last reference drop and cases inside the saturation
+	 * and dead zones.
+	 */
+	return rcuref_put_slowpath(ref);
+}
+
+/**
+ * rcuref_put_rcusafe -- Release one reference for a rcuref reference count RCU safe
+ * @ref:	Pointer to the reference count
+ *
+ * Provides release memory ordering, such that prior loads and stores are done
+ * before, and provides an acquire ordering on success such that free()
+ * must come after.
+ *
+ * Can be invoked from contexts, which guarantee that no grace period can
+ * happen which would free the object concurrently if the decrement drops
+ * the last reference and the slowpath races against a concurrent get() and
+ * put() pair. rcu_read_lock()'ed and atomic contexts qualify.
+ *
+ * Return:
+ *	True if this was the last reference with no future references
+ *	possible. This signals the caller that it can safely release the
+ *	object which is protected by the reference counter.
+ *
+ *	False if there are still active references or the put() raced
+ *	with a concurrent get()/put() pair. Caller is not allowed to
+ *	release the protected object.
+ */
+static inline __must_check bool rcuref_put_rcusafe(rcuref_t *ref)
+{
+	return __rcuref_put(ref);
+}
+
+/**
+ * rcuref_put -- Release one reference for a rcuref reference count
+ * @ref:	Pointer to the reference count
+ *
+ * Can be invoked from any context.
+ *
+ * Provides release memory ordering, such that prior loads and stores are done
+ * before, and provides an acquire ordering on success such that free()
+ * must come after.
+ *
+ * Return:
+ *
+ *	True if this was the last reference with no future references
+ *	possible. This signals the caller that it can safely schedule the
+ *	object, which is protected by the reference counter, for
+ *	deconstruction.
+ *
+ *	False if there are still active references or the put() raced
+ *	with a concurrent get()/put() pair. Caller is not allowed to
+ *	deconstruct the protected object.
+ */
+static inline __must_check bool rcuref_put(rcuref_t *ref)
+{
+	bool released;
+
+	preempt_disable();
+	released = __rcuref_put(ref);
+	preempt_enable();
+	return released;
+}
+
+#endif
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -175,6 +175,12 @@ typedef struct {
 } atomic64_t;
 #endif
 
+typedef struct {
+	atomic_t refcnt;
+} rcuref_t;
+
+#define RCUREF_INIT(i)	{ .refcnt = ATOMIC_INIT(i - 1) }
+
 struct list_head {
 	struct list_head *next, *prev;
 };
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -47,7 +47,7 @@ obj-y += bcd.o sort.o parser.o debug_loc
 	 list_sort.o uuid.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 	 percpu-refcount.o rhashtable.o base64.o \
-	 once.o refcount.o usercopy.o errseq.o bucket_locks.o \
+	 once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \
 	 generic-radix-tree.o
 obj-$(CONFIG_STRING_SELFTEST) += test_string.o
 obj-y += string_helpers.o
--- /dev/null
+++ b/lib/rcuref.c
@@ -0,0 +1,281 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * rcuref - A scalable reference count implementation for RCU managed objects
+ *
+ * rcuref is provided to replace open coded reference count implementations
+ * based on atomic_t. It protects explicitely RCU managed objects which can
+ * be visible even after the last reference has been dropped and the object
+ * is heading towards destruction.
+ *
+ * A common usage pattern is:
+ *
+ * get()
+ *	rcu_read_lock();
+ *	p = get_ptr();
+ *	if (p && !atomic_inc_not_zero(&p->refcnt))
+ *		p = NULL;
+ *	rcu_read_unlock();
+ *	return p;
+ *
+ * put()
+ *	if (!atomic_dec_return(&->refcnt)) {
+ *		remove_ptr(p);
+ *		kfree_rcu((p, rcu);
+ *	}
+ *
+ * atomic_inc_not_zero() is implemented with a try_cmpxchg() loop which has
+ * O(N^2) behaviour under contention with N concurrent operations.
+ *
+ * rcuref uses atomic_add_negative_relaxed() for the fast path, which scales
+ * better under contention.
+ *
+ * Why not refcount?
+ * =================
+ *
+ * In principle it should be possible to make refcount use the rcuref
+ * scheme, but the destruction race described below cannot be prevented
+ * unless the protected object is RCU managed.
+ *
+ * Theory of operation
+ * ===================
+ *
+ * rcuref uses an unsigned integer reference counter. As long as the
+ * counter value is greater than or equal to RCUREF_ONEREF and not larger
+ * than RCUREF_MAXREF the reference is alive:
+ *
+ * ONEREF   MAXREF               SATURATED             RELEASED      DEAD    NOREF
+ * 0        0x7FFFFFFF 0x8000000 0xA0000000 0xBFFFFFFF 0xC0000000 0xE0000000 0xFFFFFFFF
+ * <---valid --------> <-------saturation zone-------> <-----dead zone----->
+ *
+ * The get() and put() operations do unconditional increments and
+ * decrements. The result is checked after the operation. This optimizes
+ * for the fast path.
+ *
+ * If the reference count is saturated or dead, then the increments and
+ * decrements are not harmful as the reference count still stays in the
+ * respective zones and is always set back to STATURATED resp. DEAD. The
+ * zones have room for 2^28 racing operations in each direction, which
+ * makes it practically impossible to escape the zones.
+ *
+ * Once the last reference is dropped the reference count becomes
+ * RCUREF_NOREF which forces rcuref_put() into the slowpath operation. The
+ * slowpath then tries to set the reference count from RCUREF_NOREF to
+ * RCUREF_DEAD via a cmpxchg(). This opens a small window where a
+ * concurrent rcuref_get() can acquire the reference count and bring it
+ * back to RCUREF_ONEREF or even drop the reference again and mark it DEAD.
+ *
+ * If the cmpxchg() succeeds then a concurrent rcuref_get() will result in
+ * DEAD + 1, which is inside the dead zone. If that happens the reference
+ * count is put back to DEAD.
+ *
+ * The actual race is possible due to the unconditional increment and
+ * decrements in rcuref_get() and rcuref_put():
+ *
+ *	T1				T2
+ *	get()				put()
+ *					if (atomic_add_negative(1, &ref->refcnt))
+ *		succeeds->			atomic_cmpxchg(&ref->refcnt, -1, DEAD);
+ *
+ *	atomic_add_negative(1, &ref->refcnt);	<- Elevates refcount to DEAD + 1
+ *
+ * As the result of T1's add is negative, the get() goes into the slow path
+ * and observes refcnt being in the dead zone which makes the operation fail.
+ *
+ * Possible critical states:
+ *
+ *	Context Counter	References	Operation
+ *	T1	0	1		init()
+ *	T2	1	2		get()
+ *	T1	0	1		put()
+ *	T2     -1	0		put() tries to mark dead
+ *	T1	0	1		get()
+ *	T2	0	1		put() mark dead fails
+ *	T1     -1	0		put() tries to mark dead
+ *	T1    DEAD	0		put() mark dead succeeds
+ *	T2    DEAD+1	0		get() fails and puts it back to DEAD
+ *
+ * Of course there are more complex scenarios, but the above illustrates
+ * the working principle. The rest is left to the imagination of the
+ * reader.
+ *
+ * Deconstruction race
+ * ===================
+ *
+ * The release operation must be protected by prohibiting a grace period in
+ * order to prevent a possible use after free:
+ *
+ *	T1				T2
+ *	put()				get()
+ *	// ref->refcnt = ONEREF
+ *	if (atomic_add_negative(-1, &ref->cnt))
+ *		return false;				<- Not taken
+ *
+ *	// ref->refcnt == NOREF
+ *	--> preemption
+ *					// Elevates ref->c to ONEREF
+ *					if (!atomic_add_negative(1, &ref->refcnt))
+ *						return true;			<- taken
+ *
+ *					if (put(&p->ref)) { <-- Succeeds
+ *						remove_pointer(p);
+ *						kfree_rcu(p, rcu);
+ *					}
+ *
+ *		RCU grace period ends, object is freed
+ *
+ *	atomic_cmpxchg(&ref->refcnt, NONE, DEAD);	<- UAF
+ *
+ * This is prevented by disabling preemption around the put() operation as
+ * that's in most kernel configurations cheaper than a rcu_read_lock() /
+ * rcu_read_unlock() pair and in many cases even a NOOP. In any case it
+ * prevents the grace period which keeps the object alive until all put()
+ * operations complete.
+ *
+ * Saturation protection
+ * =====================
+ *
+ * The reference count has a saturation limit RCUREF_MAXREF (INT_MAX).
+ * Once this is exceedded the reference count becomes stale by setting it
+ * to RCUREF_SATURATED, which will cause a memory leak, but it prevents
+ * wrap arounds which obviously cause worse problems than a memory
+ * leak. When saturation is reached a warning is emitted.
+ *
+ * Race conditions
+ * ===============
+ *
+ * All reference count increment/decrement operations are unconditional and
+ * only verified after the fact. This optimizes for the good case and takes
+ * the occasional race vs. a dead or already saturated refcount into
+ * account. The saturation and dead zones are large enough to accomodate
+ * for that.
+ *
+ * Memory ordering
+ * ===============
+ *
+ * Memory ordering rules are slightly relaxed wrt regular atomic_t functions
+ * and provide only what is strictly required for refcounts.
+ *
+ * The increments are fully relaxed; these will not provide ordering. The
+ * rationale is that whatever is used to obtain the object to increase the
+ * reference count on will provide the ordering. For locked data
+ * structures, its the lock acquire, for RCU/lockless data structures its
+ * the dependent load.
+ *
+ * rcuref_get() provides a control dependency ordering future stores which
+ * ensures that the object is not modified when acquiring a reference
+ * fails.
+ *
+ * rcuref_put() provides release order, i.e. all prior loads and stores
+ * will be issued before. It also provides a control dependency ordering
+ * against the subsequent destruction of the object.
+ *
+ * If rcuref_put() successfully dropped the last reference and marked the
+ * object DEAD it also provides acquire ordering.
+ */
+
+#include <linux/export.h>
+#include <linux/rcuref.h>
+
+/**
+ * rcuref_get_slowpath - Slowpath of rcuref_get()
+ * @ref:	Pointer to the reference count
+ *
+ * Invoked when the reference count is outside of the valid zone.
+ *
+ * Return:
+ *	False if the reference count was already marked dead
+ *
+ *	True if the reference count is saturated, which prevents the
+ *	object from being deconstructed ever.
+ */
+bool rcuref_get_slowpath(rcuref_t *ref)
+{
+	unsigned int cnt = atomic_read(&ref->refcnt);
+
+	/*
+	 * If the reference count was already marked dead, undo the
+	 * increment so it stays in the middle of the dead zone and return
+	 * fail.
+	 */
+	if (cnt >= RCUREF_RELEASED) {
+		atomic_set(&ref->refcnt, RCUREF_DEAD);
+		return false;
+	}
+
+	/*
+	 * If it was saturated, warn and mark it so. In case the increment
+	 * was already on a saturated value restore the saturation
+	 * marker. This keeps it in the middle of the saturation zone and
+	 * prevents the reference count from overflowing. This leaks the
+	 * object memory, but prevents the obvious reference count overflow
+	 * damage.
+	 */
+	if (WARN_ONCE(cnt > RCUREF_MAXREF, "rcuref saturated - leaking memory"))
+		atomic_set(&ref->refcnt, RCUREF_SATURATED);
+	return true;
+}
+EXPORT_SYMBOL_GPL(rcuref_get_slowpath);
+
+/**
+ * rcuref_put_slowpath - Slowpath of __rcuref_put()
+ * @ref:	Pointer to the reference count
+ *
+ * Invoked when the reference count is outside of the valid zone.
+ *
+ * Return:
+ *	True if this was the last reference with no future references
+ *	possible. This signals the caller that it can safely schedule the
+ *	object, which is protected by the reference counter, for
+ *	deconstruction.
+ *
+ *	False if there are still active references or the put() raced
+ *	with a concurrent get()/put() pair. Caller is not allowed to
+ *	deconstruct the protected object.
+ */
+bool rcuref_put_slowpath(rcuref_t *ref)
+{
+	unsigned int cnt = atomic_read(&ref->refcnt);
+
+	/* Did this drop the last reference? */
+	if (likely(cnt == RCUREF_NOREF)) {
+		/*
+		 * Carefully try to set the reference count to RCUREF_DEAD.
+		 *
+		 * This can fail if a concurrent get() operation has
+		 * elevated it again or the corresponding put() even marked
+		 * it dead already. Both are valid situations and do not
+		 * require a retry. If this fails the caller is not
+		 * allowed to deconstruct the object.
+		 */
+		if (atomic_cmpxchg_release(&ref->refcnt, RCUREF_NOREF, RCUREF_DEAD) != RCUREF_NOREF)
+			return false;
+
+		/*
+		 * The caller can safely schedule the object for
+		 * deconstruction. Provide acquire ordering.
+		 */
+		smp_acquire__after_ctrl_dep();
+		return true;
+	}
+
+	/*
+	 * If the reference count was already in the dead zone, then this
+	 * put() operation is imbalanced. Warn, put the reference count back to
+	 * DEAD and tell the caller to not deconstruct the object.
+	 */
+	if (WARN_ONCE(cnt >= RCUREF_RELEASED, "rcuref - imbalanced put()")) {
+		atomic_set(&ref->refcnt, RCUREF_DEAD);
+		return false;
+	}
+
+	/*
+	 * This is a put() operation on a saturated refcount. Restore the
+	 * mean saturation value and tell the caller to not deconstruct the
+	 * object.
+	 */
+	if (cnt > RCUREF_MAXREF)
+		atomic_set(&ref->refcnt, RCUREF_SATURATED);
+	return false;
+}
+EXPORT_SYMBOL_GPL(rcuref_put_slowpath);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch V2 4/4] net: dst: Switch to rcuref_t reference counting
  2023-03-07 12:57 [patch V2 0/4] net, refcount: Address dst_entry reference count scalability issues Thomas Gleixner
                   ` (2 preceding siblings ...)
  2023-03-07 12:57 ` [patch V2 3/4] atomics: Provide rcuref - scalable reference counting Thomas Gleixner
@ 2023-03-07 12:57 ` Thomas Gleixner
  2023-03-07 17:55   ` Linus Torvalds
  3 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2023-03-07 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, x86, Wangyang Guo, Arjan van De Ven,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, Will Deacon, Peter Zijlstra, Boqun Feng, Mark Rutland,
	Marc Zyngier, Arjan Van De Ven

Under high contention dst_entry::__refcnt becomes a significant bottleneck.

atomic_inc_not_zero() is implemented with a cmpxchg() loop, which goes into
high retry rates on contention.

Switch the reference count to rcuref_t which results in a significant
performance gain.

The gain depends on the micro-architecture and the number of concurrent
operations and has been measured in the range of +25% to +130% with a
localhost memtier/memcached benchmark which amplifies the problem
massively.

Running the memtier/memcached benchmark over a real (1Gb) network
connection the conversion on top of the false sharing fix for struct
dst_entry::__refcnt results in a total gain in the 2%-5% range over the
upstream baseline.

Reported-by: Wangyang Guo <wangyang.guo@intel.com>
Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
---
 include/net/dst.h               |    9 +++++----
 include/net/sock.h              |    2 +-
 net/bridge/br_nf_core.c         |    2 +-
 net/core/dst.c                  |   26 +++++---------------------
 net/core/rtnetlink.c            |    2 +-
 net/ipv6/route.c                |    6 +++---
 net/netfilter/ipvs/ip_vs_xmit.c |    4 ++--
 7 files changed, 18 insertions(+), 33 deletions(-)

--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -16,6 +16,7 @@
 #include <linux/bug.h>
 #include <linux/jiffies.h>
 #include <linux/refcount.h>
+#include <linux/rcuref.h>
 #include <net/neighbour.h>
 #include <asm/processor.h>
 #include <linux/indirect_call_wrapper.h>
@@ -65,7 +66,7 @@ struct dst_entry {
 	 * input/output/ops or performance tanks badly
 	 */
 #ifdef CONFIG_64BIT
-	atomic_t		__refcnt;	/* 64-bit offset 64 */
+	rcuref_t		__refcnt;	/* 64-bit offset 64 */
 #endif
 	int			__use;
 	unsigned long		lastuse;
@@ -75,7 +76,7 @@ struct dst_entry {
 	__u32			tclassid;
 #ifndef CONFIG_64BIT
 	struct lwtunnel_state   *lwtstate;
-	atomic_t		__refcnt;	/* 32-bit offset 64 */
+	rcuref_t		__refcnt;	/* 32-bit offset 64 */
 #endif
 	netdevice_tracker	dev_tracker;
 
@@ -241,7 +242,7 @@ static inline void dst_hold(struct dst_e
 	 * the placement of __refcnt in struct dst_entry
 	 */
 	BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
-	WARN_ON(atomic_inc_not_zero(&dst->__refcnt) == 0);
+	WARN_ON(!rcuref_get(&dst->__refcnt));
 }
 
 static inline void dst_use_noref(struct dst_entry *dst, unsigned long time)
@@ -305,7 +306,7 @@ static inline void skb_dst_copy(struct s
  */
 static inline bool dst_hold_safe(struct dst_entry *dst)
 {
-	return atomic_inc_not_zero(&dst->__refcnt);
+	return rcuref_get(&dst->__refcnt);
 }
 
 /**
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2131,7 +2131,7 @@ sk_dst_get(struct sock *sk)
 
 	rcu_read_lock();
 	dst = rcu_dereference(sk->sk_dst_cache);
-	if (dst && !atomic_inc_not_zero(&dst->__refcnt))
+	if (dst && !rcuref_get(&dst->__refcnt))
 		dst = NULL;
 	rcu_read_unlock();
 	return dst;
--- a/net/bridge/br_nf_core.c
+++ b/net/bridge/br_nf_core.c
@@ -73,7 +73,7 @@ void br_netfilter_rtable_init(struct net
 {
 	struct rtable *rt = &br->fake_rtable;
 
-	atomic_set(&rt->dst.__refcnt, 1);
+	rcuref_init(&rt->dst.__refcnt, 1);
 	rt->dst.dev = br->dev;
 	dst_init_metrics(&rt->dst, br_dst_default_metrics, true);
 	rt->dst.flags	= DST_NOXFRM | DST_FAKE_RTABLE;
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -66,7 +66,7 @@ void dst_init(struct dst_entry *dst, str
 	dst->tclassid = 0;
 #endif
 	dst->lwtstate = NULL;
-	atomic_set(&dst->__refcnt, initial_ref);
+	rcuref_init(&dst->__refcnt, initial_ref);
 	dst->__use = 0;
 	dst->lastuse = jiffies;
 	dst->flags = flags;
@@ -162,31 +162,15 @@ EXPORT_SYMBOL(dst_dev_put);
 
 void dst_release(struct dst_entry *dst)
 {
-	if (dst) {
-		int newrefcnt;
-
-		newrefcnt = atomic_dec_return(&dst->__refcnt);
-		if (WARN_ONCE(newrefcnt < 0, "dst_release underflow"))
-			net_warn_ratelimited("%s: dst:%p refcnt:%d\n",
-					     __func__, dst, newrefcnt);
-		if (!newrefcnt)
-			call_rcu_hurry(&dst->rcu_head, dst_destroy_rcu);
-	}
+	if (dst && rcuref_put(&dst->__refcnt))
+		call_rcu_hurry(&dst->rcu_head, dst_destroy_rcu);
 }
 EXPORT_SYMBOL(dst_release);
 
 void dst_release_immediate(struct dst_entry *dst)
 {
-	if (dst) {
-		int newrefcnt;
-
-		newrefcnt = atomic_dec_return(&dst->__refcnt);
-		if (WARN_ONCE(newrefcnt < 0, "dst_release_immediate underflow"))
-			net_warn_ratelimited("%s: dst:%p refcnt:%d\n",
-					     __func__, dst, newrefcnt);
-		if (!newrefcnt)
-			dst_destroy(dst);
-	}
+	if (dst && rcuref_put(&dst->__refcnt))
+		dst_destroy(dst);
 }
 EXPORT_SYMBOL(dst_release_immediate);
 
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -840,7 +840,7 @@ int rtnl_put_cacheinfo(struct sk_buff *s
 	if (dst) {
 		ci.rta_lastuse = jiffies_delta_to_clock_t(jiffies - dst->lastuse);
 		ci.rta_used = dst->__use;
-		ci.rta_clntref = atomic_read(&dst->__refcnt);
+		ci.rta_clntref = rcuref_read(&dst->__refcnt);
 	}
 	if (expires) {
 		unsigned long clock;
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -293,7 +293,7 @@ static const struct fib6_info fib6_null_
 
 static const struct rt6_info ip6_null_entry_template = {
 	.dst = {
-		.__refcnt	= ATOMIC_INIT(1),
+		.__refcnt	= RCUREF_INIT(1),
 		.__use		= 1,
 		.obsolete	= DST_OBSOLETE_FORCE_CHK,
 		.error		= -ENETUNREACH,
@@ -307,7 +307,7 @@ static const struct rt6_info ip6_null_en
 
 static const struct rt6_info ip6_prohibit_entry_template = {
 	.dst = {
-		.__refcnt	= ATOMIC_INIT(1),
+		.__refcnt	= RCUREF_INIT(1),
 		.__use		= 1,
 		.obsolete	= DST_OBSOLETE_FORCE_CHK,
 		.error		= -EACCES,
@@ -319,7 +319,7 @@ static const struct rt6_info ip6_prohibi
 
 static const struct rt6_info ip6_blk_hole_entry_template = {
 	.dst = {
-		.__refcnt	= ATOMIC_INIT(1),
+		.__refcnt	= RCUREF_INIT(1),
 		.__use		= 1,
 		.obsolete	= DST_OBSOLETE_FORCE_CHK,
 		.error		= -EINVAL,
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -339,7 +339,7 @@ static int
 			spin_unlock_bh(&dest->dst_lock);
 			IP_VS_DBG(10, "new dst %pI4, src %pI4, refcnt=%d\n",
 				  &dest->addr.ip, &dest_dst->dst_saddr.ip,
-				  atomic_read(&rt->dst.__refcnt));
+				  rcuref_read(&rt->dst.__refcnt));
 		}
 		if (ret_saddr)
 			*ret_saddr = dest_dst->dst_saddr.ip;
@@ -507,7 +507,7 @@ static int
 			spin_unlock_bh(&dest->dst_lock);
 			IP_VS_DBG(10, "new dst %pI6, src %pI6, refcnt=%d\n",
 				  &dest->addr.in6, &dest_dst->dst_saddr.in6,
-				  atomic_read(&rt->dst.__refcnt));
+				  rcuref_read(&rt->dst.__refcnt));
 		}
 		if (ret_saddr)
 			*ret_saddr = dest_dst->dst_saddr.in6;


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch V2 4/4] net: dst: Switch to rcuref_t reference counting
  2023-03-07 12:57 ` [patch V2 4/4] net: dst: Switch to rcuref_t " Thomas Gleixner
@ 2023-03-07 17:55   ` Linus Torvalds
  2023-03-07 23:00     ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2023-03-07 17:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Wangyang Guo, Arjan van De Ven, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, Will Deacon,
	Peter Zijlstra, Boqun Feng, Mark Rutland, Marc Zyngier,
	Arjan Van De Ven

On Tue, Mar 7, 2023 at 4:57 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> -       atomic_t                __refcnt;       /* 64-bit offset 64 */
> +       rcuref_t                __refcnt;       /* 64-bit offset 64 */

> -       atomic_t                __refcnt;       /* 32-bit offset 64 */
> +       rcuref_t                __refcnt;       /* 32-bit offset 64 */

I assume any mis-use is caught by typechecking, but I'd be even
happier if you changed the name of the member when you fundamentally
change the use model for it (eg "__refcnt" -> "__rcuref" or
something).

Or was there some reason for not doing that?

          Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch V2 4/4] net: dst: Switch to rcuref_t reference counting
  2023-03-07 17:55   ` Linus Torvalds
@ 2023-03-07 23:00     ` Thomas Gleixner
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Gleixner @ 2023-03-07 23:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, x86, Wangyang Guo, Arjan van De Ven, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, Will Deacon,
	Peter Zijlstra, Boqun Feng, Mark Rutland, Marc Zyngier,
	Arjan Van De Ven

On Tue, Mar 07 2023 at 09:55, Linus Torvalds wrote:
> On Tue, Mar 7, 2023 at 4:57 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> -       atomic_t                __refcnt;       /* 64-bit offset 64 */
>> +       rcuref_t                __refcnt;       /* 64-bit offset 64 */
>
>> -       atomic_t                __refcnt;       /* 32-bit offset 64 */
>> +       rcuref_t                __refcnt;       /* 32-bit offset 64 */
>
> I assume any mis-use is caught by typechecking, but I'd be even
> happier if you changed the name of the member when you fundamentally
> change the use model for it (eg "__refcnt" -> "__rcuref" or
> something).
>
> Or was there some reason for not doing that?

Other than sheer laziness? No. You have a valid point and I should have
thought about that myself.

Thanks for calling me out on that.

       tglx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch V2 3/4] atomics: Provide rcuref - scalable reference counting
  2023-03-07 12:57 ` [patch V2 3/4] atomics: Provide rcuref - scalable reference counting Thomas Gleixner
@ 2023-03-09  8:35   ` Qiuxu Zhuo
  2023-03-20 16:05     ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: Qiuxu Zhuo @ 2023-03-09  8:35 UTC (permalink / raw)
  To: tglx
  Cc: arjan.van.de.ven, arjan, boqun.feng, davem, edumazet, kuba,
	linux-kernel, mark.rutland, maz, netdev, pabeni, peterz,
	torvalds, wangyang.guo, will, x86

Hi Thomas,

Some comments on the comments.
If I'm wrong, please correct me ;-).

> From: Thomas Gleixner <tglx@linutronix.de>
> To: LKML <linux-kernel@vger.kernel.org>
> Cc: Linus Torvalds <torvalds@linuxfoundation.org>,
> 	x86@kernel.org, Wangyang Guo <wangyang.guo@intel.com>,
> 	Arjan van De Ven <arjan@linux.intel.com>,
> 	"David S. Miller" <davem@davemloft.net>,
> 	Eric Dumazet <edumazet@google.com>,
> 	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
> 	netdev@vger.kernel.org, Will Deacon <will@kernel.org>,
> 	Peter Zijlstra <peterz@infradead.org>,
> 	Boqun Feng <boqun.feng@gmail.com>,
> 	Mark Rutland <mark.rutland@arm.com>,
> 	Marc Zyngier <maz@kernel.org>,
> 	Arjan Van De Ven <arjan.van.de.ven@intel.com>
> Subject: [patch V2 3/4] atomics: Provide rcuref - scalable reference counting
> 
> atomic_t based reference counting, including refcount_t, uses
> atomic_inc_not_zero() for acquiring a reference. atomic_inc_not_zero() is
> implemented with a atomic_try_cmpxchg() loop. High contention of the
> reference count leads to retry loops and scales badly. There is nothing to
> improve on this implementation as the semantics have to be preserved.
> 
> Provide rcuref as a scalable alternative solution which is suitable for RCU
> managed objects. Similar to refcount_t it comes with overflow and underflow
> detection and mitigation.
> 
> rcuref treats the underlying atomic_t as an unsigned integer and partitions
> this space into zones:
> 
>   0x00000000 - 0x7FFFFFFF	valid zone (1 .. INT_MAX references)

From the point of rcuref_read()'s view:
0x00000000 encodes 1, ...,  then 0x7FFFFFFF should encode INT_MAX + 1 references.

>   0x80000000 - 0xBFFFFFFF	saturation zone
>   0xC0000000 - 0xFFFFFFFE	dead zone
>   0xFFFFFFFF   			no reference
> 
> rcuref_get() unconditionally increments the reference count with
> atomic_add_negative_relaxed(). rcuref_put() unconditionally decrements the
> reference count with atomic_add_negative_release().
> 
> This unconditional increment avoids the inc_not_zero() problem, but
> requires a more complex implementation on the put() side when the count
> drops from 0 to -1.
> 
> When this transition is detected then it is attempted to mark the reference
> count dead, by setting it to the midpoint of the dead zone with a single
> atomic_cmpxchg_release() operation. This operation can fail due to a
> concurrent rcuref_get() elevating the reference count from -1 to 0 again.
> 
> If the unconditional increment in rcuref_get() hits a reference count which
> is marked dead (or saturated) it will detect it after the fact and bring
> back the reference count to the midpoint of the respective zone. The zones
> provide enough tolerance which makes it practically impossible to escape
> from a zone.

[...]

> + * Why not refcount?
> + * =================
> + *
> + * In principle it should be possible to make refcount use the rcuref
> + * scheme, but the destruction race described below cannot be prevented
> + * unless the protected object is RCU managed.
> + *
> + * Theory of operation
> + * ===================
> + *
> + * rcuref uses an unsigned integer reference counter. As long as the
> + * counter value is greater than or equal to RCUREF_ONEREF and not larger
> + * than RCUREF_MAXREF the reference is alive:
> + *
> + * ONEREF   MAXREF               SATURATED             RELEASED      DEAD    NOREF
> + * 0        0x7FFFFFFF 0x8000000 0xA0000000 0xBFFFFFFF 0xC0000000 0xE0000000 0xFFFFFFFF
> + * <---valid --------> <-------saturation zone-------> <-----dead zone----->
> + *
> + * The get() and put() operations do unconditional increments and
> + * decrements. The result is checked after the operation. This optimizes
> + * for the fast path.
> + *
> + * If the reference count is saturated or dead, then the increments and
> + * decrements are not harmful as the reference count still stays in the
> + * respective zones and is always set back to STATURATED resp. DEAD. The
> + * zones have room for 2^28 racing operations in each direction, which
> + * makes it practically impossible to escape the zones.
> + *
> + * Once the last reference is dropped the reference count becomes
> + * RCUREF_NOREF which forces rcuref_put() into the slowpath operation. The
> + * slowpath then tries to set the reference count from RCUREF_NOREF to
> + * RCUREF_DEAD via a cmpxchg(). This opens a small window where a
> + * concurrent rcuref_get() can acquire the reference count and bring it
> + * back to RCUREF_ONEREF or even drop the reference again and mark it DEAD.
> + *
> + * If the cmpxchg() succeeds then a concurrent rcuref_get() will result in
> + * DEAD + 1, which is inside the dead zone. If that happens the reference
> + * count is put back to DEAD.
> + *
> + * The actual race is possible due to the unconditional increment and
> + * decrements in rcuref_get() and rcuref_put():
> + *
> + *	T1				T2
> + *	get()				put()
> + *					if (atomic_add_negative(1, &ref->refcnt))

For T2 put() here:
"if (atomic_add_negative(1, &ref->refcnt))" ->
"if (atomic_add_negative(-1, &ref->refcnt))"

> + *		succeeds->			atomic_cmpxchg(&ref->refcnt, -1, DEAD);

Is it more readable if 's/-1/NODEF/g' ?

> + *
> + *	atomic_add_negative(1, &ref->refcnt);	<- Elevates refcount to DEAD + 1
> + *
> + * As the result of T1's add is negative, the get() goes into the slow path
> + * and observes refcnt being in the dead zone which makes the operation fail.
> + *
> + * Possible critical states:
> + *
> + *	Context Counter	References	Operation
> + *	T1	0	1		init()
> + *	T2	1	2		get()
> + *	T1	0	1		put()
> + *	T2     -1	0		put() tries to mark dead
> + *	T1	0	1		get()
> + *	T2	0	1		put() mark dead fails
> + *	T1     -1	0		put() tries to mark dead
> + *	T1    DEAD	0		put() mark dead succeeds
> + *	T2    DEAD+1	0		get() fails and puts it back to DEAD
> + *
> + * Of course there are more complex scenarios, but the above illustrates
> + * the working principle. The rest is left to the imagination of the
> + * reader.
> + *
> + * Deconstruction race
> + * ===================
> + *
> + * The release operation must be protected by prohibiting a grace period in
> + * order to prevent a possible use after free:
> + *
> + *	T1				T2
> + *	put()				get()
> + *	// ref->refcnt = ONEREF
> + *	if (atomic_add_negative(-1, &ref->cnt))

For T1 put() here:
"if (atomic_add_negative(-1, &ref->cnt))" ->
"if (!atomic_add_negative(-1, &ref->cnt))"

> + *		return false;				<- Not taken
> + *
> + *	// ref->refcnt == NOREF
> + *	--> preemption
> + *					// Elevates ref->c to ONEREF

s/ref->c/ref->refcnt/g

> + *					if (!atomic_add_negative(1, &ref->refcnt))
> + *						return true;			<- taken
> + *
> + *					if (put(&p->ref)) { <-- Succeeds
> + *						remove_pointer(p);
> + *						kfree_rcu(p, rcu);
> + *					}
> + *
> + *		RCU grace period ends, object is freed
> + *
> + *	atomic_cmpxchg(&ref->refcnt, NONE, DEAD);	<- UAF

s/NONE/NOREF/g

[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch V2 1/4] net: dst: Prevent false sharing vs. dst_entry::__refcnt
  2023-03-07 12:57 ` [patch V2 1/4] net: dst: Prevent false sharing vs. dst_entry::__refcnt Thomas Gleixner
@ 2023-03-15 20:36   ` Jakub Kicinski
  2023-03-15 20:47     ` Eric Dumazet
  2023-03-15 22:26     ` David Ahern
  0 siblings, 2 replies; 13+ messages in thread
From: Jakub Kicinski @ 2023-03-15 20:36 UTC (permalink / raw)
  To: Eric Dumazet, David Ahern
  Cc: Thomas Gleixner, LKML, x86, David S. Miller, Paolo Abeni, netdev

On Tue,  7 Mar 2023 13:57:42 +0100 (CET) Thomas Gleixner wrote:
> Move the rt[6i]_uncached[_list] members out of struct rtable and struct
> rt6_info into struct dst_entry to provide padding and move the lwtstate
> member after that so it ends up in the same cache line.

Eric, David, looks reasonable? 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch V2 1/4] net: dst: Prevent false sharing vs. dst_entry::__refcnt
  2023-03-15 20:36   ` Jakub Kicinski
@ 2023-03-15 20:47     ` Eric Dumazet
  2023-03-15 22:26     ` David Ahern
  1 sibling, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2023-03-15 20:47 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Ahern, Thomas Gleixner, LKML, x86, David S. Miller,
	Paolo Abeni, netdev

On Wed, Mar 15, 2023 at 1:37 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue,  7 Mar 2023 13:57:42 +0100 (CET) Thomas Gleixner wrote:
> > Move the rt[6i]_uncached[_list] members out of struct rtable and struct
> > rt6_info into struct dst_entry to provide padding and move the lwtstate
> > member after that so it ends up in the same cache line.
>
> Eric, David, looks reasonable?

Yes indeed, thanks for the heads up

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch V2 1/4] net: dst: Prevent false sharing vs. dst_entry::__refcnt
  2023-03-15 20:36   ` Jakub Kicinski
  2023-03-15 20:47     ` Eric Dumazet
@ 2023-03-15 22:26     ` David Ahern
  1 sibling, 0 replies; 13+ messages in thread
From: David Ahern @ 2023-03-15 22:26 UTC (permalink / raw)
  To: Jakub Kicinski, Eric Dumazet
  Cc: Thomas Gleixner, LKML, x86, David S. Miller, Paolo Abeni, netdev

On 3/15/23 2:36 PM, Jakub Kicinski wrote:
> On Tue,  7 Mar 2023 13:57:42 +0100 (CET) Thomas Gleixner wrote:
>> Move the rt[6i]_uncached[_list] members out of struct rtable and struct
>> rt6_info into struct dst_entry to provide padding and move the lwtstate
>> member after that so it ends up in the same cache line.
> 
> Eric, David, looks reasonable? 

Reviewed-by: David Ahern <dsahern@kernel.org>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch V2 3/4] atomics: Provide rcuref - scalable reference counting
  2023-03-09  8:35   ` Qiuxu Zhuo
@ 2023-03-20 16:05     ` Thomas Gleixner
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Gleixner @ 2023-03-20 16:05 UTC (permalink / raw)
  To: Qiuxu Zhuo
  Cc: arjan.van.de.ven, arjan, boqun.feng, davem, edumazet, kuba,
	linux-kernel, mark.rutland, maz, netdev, pabeni, peterz,
	torvalds, wangyang.guo, will, x86

Qiuxu!

On Thu, Mar 09 2023 at 16:35, Qiuxu Zhuo wrote:

>> rcuref treats the underlying atomic_t as an unsigned integer and partitions
>> this space into zones:
>> 
>>   0x00000000 - 0x7FFFFFFF	valid zone (1 .. INT_MAX references)
>
> From the point of rcuref_read()'s view:
> 0x00000000 encodes 1, ...,  then 0x7FFFFFFF should encode INT_MAX + 1
> references.

orrect.

>> + * The actual race is possible due to the unconditional increment and
>> + * decrements in rcuref_get() and rcuref_put():
>> + *
>> + *	T1				T2
>> + *	get()				put()
>> + *					if (atomic_add_negative(1, &ref->refcnt))
>
> For T2 put() here:
> "if (atomic_add_negative(1, &ref->refcnt))" ->
> "if (atomic_add_negative(-1, &ref->refcnt))"

Yup.


>> + *		succeeds->			atomic_cmpxchg(&ref->refcnt, -1, DEAD);
>
> Is it more readable if 's/-1/NODEF/g' ?

True

>> + *	T1				T2
>> + *	put()				get()
>> + *	// ref->refcnt = ONEREF
>> + *	if (atomic_add_negative(-1, &ref->cnt))
>
> For T1 put() here:
> "if (atomic_add_negative(-1, &ref->cnt))" ->
> "if (!atomic_add_negative(-1, &ref->cnt))"

Indeed.

>> + *		return false;				<- Not taken
>> + *
>> + *	// ref->refcnt == NOREF
>> + *	--> preemption
>> + *					// Elevates ref->c to ONEREF
>
> s/ref->c/ref->refcnt/g

Yes.

>> + *					if (!atomic_add_negative(1, &ref->refcnt))
>> + *						return true;			<- taken
>> + *
>> + *					if (put(&p->ref)) { <-- Succeeds
>> + *						remove_pointer(p);
>> + *						kfree_rcu(p, rcu);
>> + *					}
>> + *
>> + *		RCU grace period ends, object is freed
>> + *
>> + *	atomic_cmpxchg(&ref->refcnt, NONE, DEAD);	<- UAF
>
> s/NONE/NOREF/g

Right. Thanks for spotting these details!

       Thomas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch V2 2/4] atomics: Provide atomic_add_and_negative() variants
  2023-03-07 12:57 ` [patch V2 2/4] atomics: Provide atomic_add_and_negative() variants Thomas Gleixner
@ 2023-03-22 13:20   ` Mark Rutland
  0 siblings, 0 replies; 13+ messages in thread
From: Mark Rutland @ 2023-03-22 13:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Linus Torvalds, x86, Wangyang Guo, Arjan van De Ven,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, Will Deacon, Peter Zijlstra, Boqun Feng, Marc Zyngier

On Tue, Mar 07, 2023 at 01:57:43PM +0100, Thomas Gleixner wrote:
> atomic_add_and_negative() does not provide the relaxed/acquire/release
             ^^^^

Spurious 'and_' above.

> variants.
> 
> Provide them in preparation for a new scalable reference count algorithm.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Other than the typo, this looks reasonable to me, so FWIW:

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> ---
> V2: New patch
> ---
>  include/linux/atomic/atomic-arch-fallback.h |  200 +++++++++++++++++++++++++++-
>  include/linux/atomic/atomic-instrumented.h  |   68 +++++++++
>  include/linux/atomic/atomic-long.h          |   38 +++++
>  scripts/atomic/atomics.tbl                  |    2 
>  scripts/atomic/fallbacks/add_negative       |    4 
>  5 files changed, 306 insertions(+), 6 deletions(-)
> 
> --- a/include/linux/atomic/atomic-arch-fallback.h
> +++ b/include/linux/atomic/atomic-arch-fallback.h
> @@ -1208,6 +1208,13 @@ arch_atomic_inc_and_test(atomic_t *v)
>  #define arch_atomic_inc_and_test arch_atomic_inc_and_test
>  #endif
>  
> +#ifndef arch_atomic_add_negative_relaxed
> +#ifdef arch_atomic_add_negative
> +#define arch_atomic_add_negative_acquire arch_atomic_add_negative
> +#define arch_atomic_add_negative_release arch_atomic_add_negative
> +#define arch_atomic_add_negative_relaxed arch_atomic_add_negative
> +#endif /* arch_atomic_add_negative */
> +
>  #ifndef arch_atomic_add_negative
>  /**
>   * arch_atomic_add_negative - add and test if negative
> @@ -1226,6 +1233,98 @@ arch_atomic_add_negative(int i, atomic_t
>  #define arch_atomic_add_negative arch_atomic_add_negative
>  #endif
>  
> +#ifndef arch_atomic_add_negative_acquire
> +/**
> + * arch_atomic_add_negative - add and test if negative
> + * @i: integer value to add
> + * @v: pointer of type atomic_t
> + *
> + * Atomically adds @i to @v and returns true
> + * if the result is negative, or false when
> + * result is greater than or equal to zero.
> + */
> +static __always_inline bool
> +arch_atomic_add_negative_acquire(int i, atomic_t *v)
> +{
> +	return arch_atomic_add_return_acquire(i, v) < 0;
> +}
> +#define arch_atomic_add_negative_acquire arch_atomic_add_negative_acquire
> +#endif
> +
> +#ifndef arch_atomic_add_negative_release
> +/**
> + * arch_atomic_add_negative - add and test if negative
> + * @i: integer value to add
> + * @v: pointer of type atomic_t
> + *
> + * Atomically adds @i to @v and returns true
> + * if the result is negative, or false when
> + * result is greater than or equal to zero.
> + */
> +static __always_inline bool
> +arch_atomic_add_negative_release(int i, atomic_t *v)
> +{
> +	return arch_atomic_add_return_release(i, v) < 0;
> +}
> +#define arch_atomic_add_negative_release arch_atomic_add_negative_release
> +#endif
> +
> +#ifndef arch_atomic_add_negative_relaxed
> +/**
> + * arch_atomic_add_negative - add and test if negative
> + * @i: integer value to add
> + * @v: pointer of type atomic_t
> + *
> + * Atomically adds @i to @v and returns true
> + * if the result is negative, or false when
> + * result is greater than or equal to zero.
> + */
> +static __always_inline bool
> +arch_atomic_add_negative_relaxed(int i, atomic_t *v)
> +{
> +	return arch_atomic_add_return_relaxed(i, v) < 0;
> +}
> +#define arch_atomic_add_negative_relaxed arch_atomic_add_negative_relaxed
> +#endif
> +
> +#else /* arch_atomic_add_negative_relaxed */
> +
> +#ifndef arch_atomic_add_negative_acquire
> +static __always_inline bool
> +arch_atomic_add_negative_acquire(int i, atomic_t *v)
> +{
> +	bool ret = arch_atomic_add_negative_relaxed(i, v);
> +	__atomic_acquire_fence();
> +	return ret;
> +}
> +#define arch_atomic_add_negative_acquire arch_atomic_add_negative_acquire
> +#endif
> +
> +#ifndef arch_atomic_add_negative_release
> +static __always_inline bool
> +arch_atomic_add_negative_release(int i, atomic_t *v)
> +{
> +	__atomic_release_fence();
> +	return arch_atomic_add_negative_relaxed(i, v);
> +}
> +#define arch_atomic_add_negative_release arch_atomic_add_negative_release
> +#endif
> +
> +#ifndef arch_atomic_add_negative
> +static __always_inline bool
> +arch_atomic_add_negative(int i, atomic_t *v)
> +{
> +	bool ret;
> +	__atomic_pre_full_fence();
> +	ret = arch_atomic_add_negative_relaxed(i, v);
> +	__atomic_post_full_fence();
> +	return ret;
> +}
> +#define arch_atomic_add_negative arch_atomic_add_negative
> +#endif
> +
> +#endif /* arch_atomic_add_negative_relaxed */
> +
>  #ifndef arch_atomic_fetch_add_unless
>  /**
>   * arch_atomic_fetch_add_unless - add unless the number is already a given value
> @@ -2329,6 +2428,13 @@ arch_atomic64_inc_and_test(atomic64_t *v
>  #define arch_atomic64_inc_and_test arch_atomic64_inc_and_test
>  #endif
>  
> +#ifndef arch_atomic64_add_negative_relaxed
> +#ifdef arch_atomic64_add_negative
> +#define arch_atomic64_add_negative_acquire arch_atomic64_add_negative
> +#define arch_atomic64_add_negative_release arch_atomic64_add_negative
> +#define arch_atomic64_add_negative_relaxed arch_atomic64_add_negative
> +#endif /* arch_atomic64_add_negative */
> +
>  #ifndef arch_atomic64_add_negative
>  /**
>   * arch_atomic64_add_negative - add and test if negative
> @@ -2347,6 +2453,98 @@ arch_atomic64_add_negative(s64 i, atomic
>  #define arch_atomic64_add_negative arch_atomic64_add_negative
>  #endif
>  
> +#ifndef arch_atomic64_add_negative_acquire
> +/**
> + * arch_atomic64_add_negative - add and test if negative
> + * @i: integer value to add
> + * @v: pointer of type atomic64_t
> + *
> + * Atomically adds @i to @v and returns true
> + * if the result is negative, or false when
> + * result is greater than or equal to zero.
> + */
> +static __always_inline bool
> +arch_atomic64_add_negative_acquire(s64 i, atomic64_t *v)
> +{
> +	return arch_atomic64_add_return_acquire(i, v) < 0;
> +}
> +#define arch_atomic64_add_negative_acquire arch_atomic64_add_negative_acquire
> +#endif
> +
> +#ifndef arch_atomic64_add_negative_release
> +/**
> + * arch_atomic64_add_negative - add and test if negative
> + * @i: integer value to add
> + * @v: pointer of type atomic64_t
> + *
> + * Atomically adds @i to @v and returns true
> + * if the result is negative, or false when
> + * result is greater than or equal to zero.
> + */
> +static __always_inline bool
> +arch_atomic64_add_negative_release(s64 i, atomic64_t *v)
> +{
> +	return arch_atomic64_add_return_release(i, v) < 0;
> +}
> +#define arch_atomic64_add_negative_release arch_atomic64_add_negative_release
> +#endif
> +
> +#ifndef arch_atomic64_add_negative_relaxed
> +/**
> + * arch_atomic64_add_negative - add and test if negative
> + * @i: integer value to add
> + * @v: pointer of type atomic64_t
> + *
> + * Atomically adds @i to @v and returns true
> + * if the result is negative, or false when
> + * result is greater than or equal to zero.
> + */
> +static __always_inline bool
> +arch_atomic64_add_negative_relaxed(s64 i, atomic64_t *v)
> +{
> +	return arch_atomic64_add_return_relaxed(i, v) < 0;
> +}
> +#define arch_atomic64_add_negative_relaxed arch_atomic64_add_negative_relaxed
> +#endif
> +
> +#else /* arch_atomic64_add_negative_relaxed */
> +
> +#ifndef arch_atomic64_add_negative_acquire
> +static __always_inline bool
> +arch_atomic64_add_negative_acquire(s64 i, atomic64_t *v)
> +{
> +	bool ret = arch_atomic64_add_negative_relaxed(i, v);
> +	__atomic_acquire_fence();
> +	return ret;
> +}
> +#define arch_atomic64_add_negative_acquire arch_atomic64_add_negative_acquire
> +#endif
> +
> +#ifndef arch_atomic64_add_negative_release
> +static __always_inline bool
> +arch_atomic64_add_negative_release(s64 i, atomic64_t *v)
> +{
> +	__atomic_release_fence();
> +	return arch_atomic64_add_negative_relaxed(i, v);
> +}
> +#define arch_atomic64_add_negative_release arch_atomic64_add_negative_release
> +#endif
> +
> +#ifndef arch_atomic64_add_negative
> +static __always_inline bool
> +arch_atomic64_add_negative(s64 i, atomic64_t *v)
> +{
> +	bool ret;
> +	__atomic_pre_full_fence();
> +	ret = arch_atomic64_add_negative_relaxed(i, v);
> +	__atomic_post_full_fence();
> +	return ret;
> +}
> +#define arch_atomic64_add_negative arch_atomic64_add_negative
> +#endif
> +
> +#endif /* arch_atomic64_add_negative_relaxed */
> +
>  #ifndef arch_atomic64_fetch_add_unless
>  /**
>   * arch_atomic64_fetch_add_unless - add unless the number is already a given value
> @@ -2456,4 +2654,4 @@ arch_atomic64_dec_if_positive(atomic64_t
>  #endif
>  
>  #endif /* _LINUX_ATOMIC_FALLBACK_H */
> -// b5e87bdd5ede61470c29f7a7e4de781af3770f09
> +// 63bcc1a53125d4eca5e659892e10615f00f9abf8
> --- a/include/linux/atomic/atomic-instrumented.h
> +++ b/include/linux/atomic/atomic-instrumented.h
> @@ -592,6 +592,28 @@ atomic_add_negative(int i, atomic_t *v)
>  	return arch_atomic_add_negative(i, v);
>  }
>  
> +static __always_inline bool
> +atomic_add_negative_acquire(int i, atomic_t *v)
> +{
> +	instrument_atomic_read_write(v, sizeof(*v));
> +	return arch_atomic_add_negative_acquire(i, v);
> +}
> +
> +static __always_inline bool
> +atomic_add_negative_release(int i, atomic_t *v)
> +{
> +	kcsan_release();
> +	instrument_atomic_read_write(v, sizeof(*v));
> +	return arch_atomic_add_negative_release(i, v);
> +}
> +
> +static __always_inline bool
> +atomic_add_negative_relaxed(int i, atomic_t *v)
> +{
> +	instrument_atomic_read_write(v, sizeof(*v));
> +	return arch_atomic_add_negative_relaxed(i, v);
> +}
> +
>  static __always_inline int
>  atomic_fetch_add_unless(atomic_t *v, int a, int u)
>  {
> @@ -1211,6 +1233,28 @@ atomic64_add_negative(s64 i, atomic64_t
>  	return arch_atomic64_add_negative(i, v);
>  }
>  
> +static __always_inline bool
> +atomic64_add_negative_acquire(s64 i, atomic64_t *v)
> +{
> +	instrument_atomic_read_write(v, sizeof(*v));
> +	return arch_atomic64_add_negative_acquire(i, v);
> +}
> +
> +static __always_inline bool
> +atomic64_add_negative_release(s64 i, atomic64_t *v)
> +{
> +	kcsan_release();
> +	instrument_atomic_read_write(v, sizeof(*v));
> +	return arch_atomic64_add_negative_release(i, v);
> +}
> +
> +static __always_inline bool
> +atomic64_add_negative_relaxed(s64 i, atomic64_t *v)
> +{
> +	instrument_atomic_read_write(v, sizeof(*v));
> +	return arch_atomic64_add_negative_relaxed(i, v);
> +}
> +
>  static __always_inline s64
>  atomic64_fetch_add_unless(atomic64_t *v, s64 a, s64 u)
>  {
> @@ -1830,6 +1874,28 @@ atomic_long_add_negative(long i, atomic_
>  	return arch_atomic_long_add_negative(i, v);
>  }
>  
> +static __always_inline bool
> +atomic_long_add_negative_acquire(long i, atomic_long_t *v)
> +{
> +	instrument_atomic_read_write(v, sizeof(*v));
> +	return arch_atomic_long_add_negative_acquire(i, v);
> +}
> +
> +static __always_inline bool
> +atomic_long_add_negative_release(long i, atomic_long_t *v)
> +{
> +	kcsan_release();
> +	instrument_atomic_read_write(v, sizeof(*v));
> +	return arch_atomic_long_add_negative_release(i, v);
> +}
> +
> +static __always_inline bool
> +atomic_long_add_negative_relaxed(long i, atomic_long_t *v)
> +{
> +	instrument_atomic_read_write(v, sizeof(*v));
> +	return arch_atomic_long_add_negative_relaxed(i, v);
> +}
> +
>  static __always_inline long
>  atomic_long_fetch_add_unless(atomic_long_t *v, long a, long u)
>  {
> @@ -2083,4 +2149,4 @@ atomic_long_dec_if_positive(atomic_long_
>  })
>  
>  #endif /* _LINUX_ATOMIC_INSTRUMENTED_H */
> -// 764f741eb77a7ad565dc8d99ce2837d5542e8aee
> +// 1b485de9cbaa4900de59e14ee2084357eaeb1c3a
> --- a/include/linux/atomic/atomic-long.h
> +++ b/include/linux/atomic/atomic-long.h
> @@ -479,6 +479,24 @@ arch_atomic_long_add_negative(long i, at
>  	return arch_atomic64_add_negative(i, v);
>  }
>  
> +static __always_inline bool
> +arch_atomic_long_add_negative_acquire(long i, atomic_long_t *v)
> +{
> +	return arch_atomic64_add_negative_acquire(i, v);
> +}
> +
> +static __always_inline bool
> +arch_atomic_long_add_negative_release(long i, atomic_long_t *v)
> +{
> +	return arch_atomic64_add_negative_release(i, v);
> +}
> +
> +static __always_inline bool
> +arch_atomic_long_add_negative_relaxed(long i, atomic_long_t *v)
> +{
> +	return arch_atomic64_add_negative_relaxed(i, v);
> +}
> +
>  static __always_inline long
>  arch_atomic_long_fetch_add_unless(atomic_long_t *v, long a, long u)
>  {
> @@ -973,6 +991,24 @@ arch_atomic_long_add_negative(long i, at
>  	return arch_atomic_add_negative(i, v);
>  }
>  
> +static __always_inline bool
> +arch_atomic_long_add_negative_acquire(long i, atomic_long_t *v)
> +{
> +	return arch_atomic_add_negative_acquire(i, v);
> +}
> +
> +static __always_inline bool
> +arch_atomic_long_add_negative_release(long i, atomic_long_t *v)
> +{
> +	return arch_atomic_add_negative_release(i, v);
> +}
> +
> +static __always_inline bool
> +arch_atomic_long_add_negative_relaxed(long i, atomic_long_t *v)
> +{
> +	return arch_atomic_add_negative_relaxed(i, v);
> +}
> +
>  static __always_inline long
>  arch_atomic_long_fetch_add_unless(atomic_long_t *v, long a, long u)
>  {
> @@ -1011,4 +1047,4 @@ arch_atomic_long_dec_if_positive(atomic_
>  
>  #endif /* CONFIG_64BIT */
>  #endif /* _LINUX_ATOMIC_LONG_H */
> -// e8f0e08ff072b74d180eabe2ad001282b38c2c88
> +// a194c07d7d2f4b0e178d3c118c919775d5d65f50
> --- a/scripts/atomic/atomics.tbl
> +++ b/scripts/atomic/atomics.tbl
> @@ -33,7 +33,7 @@ try_cmpxchg		B	v	p:old	i:new
>  sub_and_test		b	i	v
>  dec_and_test		b	v
>  inc_and_test		b	v
> -add_negative		b	i	v
> +add_negative		B	i	v
>  add_unless		fb	v	i:a	i:u
>  inc_not_zero		b	v
>  inc_unless_negative	b	v
> --- a/scripts/atomic/fallbacks/add_negative
> +++ b/scripts/atomic/fallbacks/add_negative
> @@ -9,8 +9,8 @@ cat <<EOF
>   * result is greater than or equal to zero.
>   */
>  static __always_inline bool
> -arch_${atomic}_add_negative(${int} i, ${atomic}_t *v)
> +arch_${atomic}_add_negative${order}(${int} i, ${atomic}_t *v)
>  {
> -	return arch_${atomic}_add_return(i, v) < 0;
> +	return arch_${atomic}_add_return${order}(i, v) < 0;
>  }
>  EOF
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-03-22 13:21 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-07 12:57 [patch V2 0/4] net, refcount: Address dst_entry reference count scalability issues Thomas Gleixner
2023-03-07 12:57 ` [patch V2 1/4] net: dst: Prevent false sharing vs. dst_entry::__refcnt Thomas Gleixner
2023-03-15 20:36   ` Jakub Kicinski
2023-03-15 20:47     ` Eric Dumazet
2023-03-15 22:26     ` David Ahern
2023-03-07 12:57 ` [patch V2 2/4] atomics: Provide atomic_add_and_negative() variants Thomas Gleixner
2023-03-22 13:20   ` Mark Rutland
2023-03-07 12:57 ` [patch V2 3/4] atomics: Provide rcuref - scalable reference counting Thomas Gleixner
2023-03-09  8:35   ` Qiuxu Zhuo
2023-03-20 16:05     ` Thomas Gleixner
2023-03-07 12:57 ` [patch V2 4/4] net: dst: Switch to rcuref_t " Thomas Gleixner
2023-03-07 17:55   ` Linus Torvalds
2023-03-07 23:00     ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).