netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/3] net, refcount: Address dst_entry reference count scalability issues
@ 2023-02-28 14:33 Thomas Gleixner
  2023-02-28 14:33 ` [patch 1/3] net: dst: Prevent false sharing vs. dst_entry::__refcnt Thomas Gleixner
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Thomas Gleixner @ 2023-02-28 14:33 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, x86, Wangyang Guo, Arjan van De Ven,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, Will Deacon, Peter Zijlstra, Boqun Feng, Mark Rutland,
	Marc Zyngier

Hi!

Wangyang and Arjan reported a bottleneck in the networking code related to
struct dst_entry::__refcnt. Performance tanks massively when concurrency on
a dst_entry increases.

This happens when there are a large amount of connections to or from the
same IP address. The memtier benchmark when run on the same host as
memcached amplifies this massively. But even over real network connections
this issue can be observed at an obviously smaller scale (due to the
network bandwith limitations in my setup, i.e. 1Gb).

There are two factors which make this reference count a scalability issue:

   1) False sharing

      dst_entry:__refcnt is located at offset 64 of dst_entry, which puts
      it into a seperate cacheline vs. the read mostly members located at
      the beginning of the struct.

      That prevents false sharing vs. the struct members in the first 64
      bytes of the structure, but there is also

      	    dst_entry::lwtstate

      which is located after the reference count and in the same cache
      line. This member is read after a reference count has been acquired.

      The other problem is struct rtable, which embeds a struct dst_entry
      at offset 0. struct dst_entry has a size of 112 bytes, which means
      that the struct members of rtable which follow the dst member share
      the same cache line as dst_entry::__refcnt. Especially

      	  rtable::rt_genid

      is also read by the contexts which have a reference count acquired
      already.

      When dst_entry:__refcnt is incremented or decremented via an atomic
      operation these read accesses stall and contribute to the performance
      problem.

   2) atomic_inc_not_zero()

      A reference on dst_entry:__refcnt is acquired via
      atomic_inc_not_zero() and released via atomic_dec_return().

      atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
      which exposes O(N^2) behaviour under contention with N concurrent
      operations.

      Lightweight instrumentation exposed an average of 8!! retry loops per
      atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
      running concurrently on 112 CPUs.

      There is nothing which can be done to make atomic_inc_not_zero() more
      scalable.

The following series addresses these issues:

    1) Reorder and pad struct dst_entry to prevent the false sharing.

    2) Implement and use a reference count implementation which avoids the
       atomic_inc_not_zero() problem.

       It is slightly less performant in the case of the final 1 -> 0
       transition, but the deconstruction of these objects is a low
       frequency event. get()/put() pairs are in the hotpath and that's
       what this implementation optimizes for.

       The algorithm of this reference count is only suitable for RCU
       managed objects. Therefore it cannot replace the refcount_t
       algorithm, which is also based on atomic_inc_not_zero(), due to a
       subtle race condition related to the 1 -> 0 transition and the final
       verdict to mark the reference count dead. See details in patch 2/3.

       It might be just my lack of imagination which declares this to be
       impossible and I'd be happy to be proven wrong.

       As a bonus the new rcuref implementation provides underflow/overflow
       detection and mitigation while being performance wise on par with
       open coded atomic_inc_not_zero() / atomic_dec_return() pairs even in
       the non-contended case.

The combination of these two changes results in performance gains in micro
benchmarks and also localhost and networked memtier benchmarks talking to
memcached. It's hard to quantify the benchmark results as they depend
heavily on the micro-architecture and the number of concurrent operations.

The overall gain of both changes for localhost memtier ranges from 1.2X to
3.2X and from +2% to %5% range for networked operations on a 1Gb connection.

A micro benchmark which enforces maximized concurrency shows a gain between
1.2X and 4.7X!!!

Obviously this is focussed on a particular problem and therefore needs to
be discussed in detail. It also requires wider testing outside of the cases
which this is focussed on.

Though the false sharing issue is obvious and should be addressed
independent of the more focussed reference count changes.

The series is also available from git:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git

I want to say thanks to Wangyang who analyzed the issue and provided
the initial fix for the false sharing problem. Further thanks go to
Arjan Peter, Marc, Will and Borislav for valuable input and providing
test results on machines which I do not have access to.

Thoughts?

Thanks,

	tglx
---
 include/linux/rcuref.h          |   89 +++++++++++
 include/linux/types.h           |    6 
 include/net/dst.h               |   21 ++
 include/net/sock.h              |    2 
 lib/Makefile                    |    2 
 lib/rcuref.c                    |  311 ++++++++++++++++++++++++++++++++++++++++
 net/bridge/br_nf_core.c         |    2 
 net/core/dst.c                  |   26 ---
 net/core/rtnetlink.c            |    2 
 net/ipv6/route.c                |    6 
 net/netfilter/ipvs/ip_vs_xmit.c |    4 
 11 files changed, 436 insertions(+), 35 deletions(-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch 1/3] net: dst: Prevent false sharing vs. dst_entry::__refcnt
  2023-02-28 14:33 [patch 0/3] net, refcount: Address dst_entry reference count scalability issues Thomas Gleixner
@ 2023-02-28 14:33 ` Thomas Gleixner
  2023-02-28 15:17   ` Eric Dumazet
  2023-02-28 14:33 ` [patch 2/3] atomics: Provide rcuref - scalable reference counting Thomas Gleixner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2023-02-28 14:33 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, x86, Wangyang Guo, Arjan van De Ven,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, Will Deacon, Peter Zijlstra, Boqun Feng, Mark Rutland,
	Marc Zyngier

From: Wangyang Guo <wangyang.guo@intel.com>

dst_entry::__refcnt is highly contended in scenarios where many connections
happen from and to the same IP. The reference count is an atomic_t, so the
reference count operations have to take the cache-line exclusive.

Aside of the unavoidable reference count contention there is another
significant problem which is caused by that: False sharing.

perf top identified two affected read accesses. dst_entry::lwtstate and
rtable::rt_genid.

dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into
a seperate cacheline vs. the read mostly members located at the beginning
of the struct.

That prevents false sharing vs. the struct members in the first 64
bytes of the structure, but there is also

     dst_entry::lwtstate

which is located after the reference count and in the same cache line. This
member is read after a reference count has been acquired.

struct rtable embeds a struct dst_entry at offset 0. struct dst_entry has a
size of 112 bytes, which means that the struct members of rtable which
follow the dst member share the same cache line as dst_entry::__refcnt.
Especially

      	  rtable::rt_genid

is also read by the contexts which have a reference count acquired
already.

When dst_entry:__refcnt is incremented or decremented via an atomic
operation these read accesses stall.

This was found when analysing the memtier benchmark in 1:100 mode, which
amplifies the problem extremly.

Rearrange and pad the structure so that the lwtstate member is in the next
cache-line. This increases the struct size from 112 to 136 bytes on 64bit.

The resulting improvement depends on the micro-architecture and the number
of CPUs. It ranges from +20% to +120% with a localhost memtier/memcached
benchmark.

[ tglx: Rearrange struct ]

Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
Signed-off-by: Arjan van De Ven <arjan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
---
 include/net/dst.h |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -69,15 +69,25 @@ struct dst_entry {
 #endif
 	int			__use;
 	unsigned long		lastuse;
-	struct lwtunnel_state   *lwtstate;
 	struct rcu_head		rcu_head;
 	short			error;
 	short			__pad;
 	__u32			tclassid;
 #ifndef CONFIG_64BIT
+	struct lwtunnel_state   *lwtstate;
 	atomic_t		__refcnt;	/* 32-bit offset 64 */
 #endif
 	netdevice_tracker	dev_tracker;
+#ifdef CONFIG_64BIT
+	/*
+	 * Ensure that lwtstate is not in the same cache line as __refcnt,
+	 * because that would lead to false sharing under high contention
+	 * of __refcnt. This also ensures that rtable::rt_genid is not
+	 * sharing the same cache-line.
+	 */
+	int			pad2[6];
+	struct lwtunnel_state   *lwtstate;
+#endif
 };
 
 struct dst_metrics {


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch 2/3] atomics: Provide rcuref - scalable reference counting
  2023-02-28 14:33 [patch 0/3] net, refcount: Address dst_entry reference count scalability issues Thomas Gleixner
  2023-02-28 14:33 ` [patch 1/3] net: dst: Prevent false sharing vs. dst_entry::__refcnt Thomas Gleixner
@ 2023-02-28 14:33 ` Thomas Gleixner
  2023-03-01  0:42   ` Linus Torvalds
  2023-02-28 14:33 ` [patch 3/3] net: dst: Switch to rcuref_t " Thomas Gleixner
  2023-02-28 15:07 ` [patch 0/3] net, refcount: Address dst_entry reference count scalability issues Eric Dumazet
  3 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2023-02-28 14:33 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, x86, Wangyang Guo, Arjan Van De Ven, Will Deacon,
	Peter Zijlstra, Boqun Feng, Mark Rutland, Marc Zyngier,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev

atomic_t based reference counting, including refcount_t, uses
atomic_inc_not_zero() for acquiring a reference. atomic_inc_not_zero() is
implemented with a atomic_try_cmpxchg() loop. High contention of the
reference count leads to retry loops and scales badly. There is nothing to
improve on this implementation as the semantics have to be preserved.

Provide rcuref as a scalable alternative solution which is suitable for RCU
managed objects. Similar to refcount_t it comes with overflow and underflow
detection and mitigation.

rcuref treats the underlying atomic_t as an unsigned integer and partitions
this space into zones:

  0x00000000 - 0x7FFFFFFF	valid zone
  0x80000000 - 0xBFFFFFFF	saturation zone
  0xC0000000 - 0xFFFFFFFF	dead zone

rcuref_get() unconditionally increments the reference count with
atomic_fetch_add_relaxed(). rcuref_put() unconditionally decrements the
reference count with atomic_fetch_sub_relaxed().

This unconditional increment avoids the inc_not_zero() problem, but
requires a more complex implementation on the put() side when the count
drops from 1 to 0.

When this transition is detected then it is attempted to mark the reference
count dead, by setting it to the midpoint of the dead zone with a single
atomic_cmpxchg_release() operation. This operation can fail due to a
concurrent rcuref_get() elevating the reference count from 0 to 1.

If the unconditional increment in rcuref_get() hits a reference count which
is marked dead (or saturated) it will detect it after the fact and bring
back the reference count to the midpoint of the respective zone. The zones
provide enough tolerance which makes it practically impossible to escape
from a zone.

The racy implementation of rcuref_put() requires to protect rcuref_put()
against a grace period ending in order to prevent a subtle use after
free. As RCU is the only mechanism which allows to protect against that, it
is not possible to replace the atomic_inc_not_zero() based implementation
of refcount_t with this scheme.

The final drop is slightly more expensive than the atomic_dec_return()
counterpart, but that's not the case which this is optimized for. The
optimization is on the high frequeunt get()/put() pairs and their
scalability.

The performance of an uncontended rcuref_get()/put() pair where the put()
is not dropping the last reference is still on par with the plain atomic
operations, while at the same time providing overflow and underflow
detection and mitigation.

The performance of rcuref compared to plain atomic_inc_not_zero() and
atomic_dec_return() based reference counting under contention:

 -  Micro benchmark: All CPUs running a increment/decrement loop on an
    elevated reference count, which means the 1 to 0 transition never
    happens.

    The performance gain depends on microarchitecture and the number of
    CPUs and has been observed in the range of 1.3X to 4.7X

 - Conversion of dst_entry::__refcnt to rcuref and testing with the
    localhost memtier/memcached benchmark. That benchmark shows the
    reference count contention prominently.
    
    The performance gain depends on microarchitecture and the number of
    CPUs and has been observed in the range of 1.1X to 2.6X over the
    previous fix for the false sharing issue vs. struct
    dst_entry::__refcnt.

    When memtier is run over a real 1Gb network connection, there is a
    small gain on top of the false sharing fix. The two changes combined
    result in a 2%-5% total gain for that networked test.

Reported-by: Wangyang Guo <wangyang.guo@intel.com>
Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
---
 include/linux/rcuref.h |   89 ++++++++++++++
 include/linux/types.h  |    6 
 lib/Makefile           |    2 
 lib/rcuref.c           |  311 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 407 insertions(+), 1 deletion(-)

--- /dev/null
+++ b/include/linux/rcuref.h
@@ -0,0 +1,89 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_RCUREF_H
+#define _LINUX_RCUREF_H
+
+#include <linux/atomic.h>
+#include <linux/bug.h>
+#include <linux/limits.h>
+#include <linux/lockdep.h>
+#include <linux/preempt.h>
+#include <linux/rcupdate.h>
+
+#define RCUREF_NOREF		0x00000000
+#define RCUREF_ONEREF		0x00000001
+#define RCUREF_MAXREF		0x7FFFFFFF
+#define RCUREF_SATURATED	0xA0000000
+#define RCUREF_RELEASED		0xC0000000
+#define RCUREF_DEAD		0xE0000000
+
+/**
+ * rcuref_init - Initialize a rcuref reference count with the given reference count
+ * @ref:	Pointer to the reference count
+ * @cnt:	The initial reference count typically '1'
+ */
+static inline void rcuref_init(rcuref_t *ref, unsigned int cnt)
+{
+	atomic_set(&ref->refcnt, cnt);
+}
+
+/**
+ * rcuref_read - Read the number of held reference counts of a rcuref
+ * @ref:	Pointer to the reference count
+ *
+ * Return: The number of held references (0 ... N)
+ */
+static inline unsigned int rcuref_read(rcuref_t *ref)
+{
+	unsigned int c = atomic_read(&ref->refcnt);
+
+	/* Return 0 if within the DEAD zone. */
+	return c >= RCUREF_RELEASED ? 0 : c;
+}
+
+extern __must_check bool rcuref_get_slowpath(rcuref_t *ref, unsigned int new);
+
+/**
+ * rcuref_get - Acquire one reference on a rcuref reference count
+ * @ref:	Pointer to the reference count
+ *
+ * Similar to atomic_inc_not_zero() but saturates at RCUREF_MAXREF.
+ *
+ * Provides no memory ordering, it is assumed the caller has guaranteed the
+ * object memory to be stable (RCU, etc.). It does provide a control dependency
+ * and thereby orders future stores. See documentation in lib/rcuref.c
+ *
+ * Return:
+ *	False if the attempt to acquire a reference failed. This happens
+ *	when the last reference has been put already
+ *
+ *	True if a reference was successfully acquired
+ */
+static inline __must_check bool rcuref_get(rcuref_t *ref)
+{
+	/*
+	 * Unconditionally increase the reference count. The saturation and
+	 * dead zones provide enough tolerance for this.
+	 */
+	unsigned int old = atomic_fetch_add_relaxed(1, &ref->refcnt);
+
+	/*
+	 * If the old value is less than RCUREF_MAXREF, this is a valid
+	 * reference.
+	 *
+	 * In case the original value was RCUREF_NOREF the above
+	 * unconditional increment raced with a concurrent put() operation
+	 * dropping the last reference. That racing put() operation
+	 * subsequently fails to mark the reference count dead because the
+	 * count is now elevated again and the concurrent caller is
+	 * therefore not allowed to deconstruct the object.
+	 */
+	if (likely(old < RCUREF_MAXREF))
+		return true;
+
+	/* Handle the cases inside the saturation and dead zones */
+	return rcuref_get_slowpath(ref, old);
+}
+
+extern __must_check bool rcuref_put(rcuref_t *ref);
+
+#endif
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -175,6 +175,12 @@ typedef struct {
 } atomic64_t;
 #endif
 
+typedef struct {
+	atomic_t refcnt;
+} rcuref_t;
+
+#define RCUREF_INIT(i)	{ .refcnt = ATOMIC_INIT(i) }
+
 struct list_head {
 	struct list_head *next, *prev;
 };
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -47,7 +47,7 @@ obj-y += bcd.o sort.o parser.o debug_loc
 	 list_sort.o uuid.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 	 percpu-refcount.o rhashtable.o base64.o \
-	 once.o refcount.o usercopy.o errseq.o bucket_locks.o \
+	 once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \
 	 generic-radix-tree.o
 obj-$(CONFIG_STRING_SELFTEST) += test_string.o
 obj-y += string_helpers.o
--- /dev/null
+++ b/lib/rcuref.c
@@ -0,0 +1,311 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * rcuref - A scalable reference count implementation for RCU managed objects
+ *
+ * rcuref is provided to replace open coded reference count implementations
+ * based on atomic_t. It protects explicitely RCU managed objects which can
+ * be visible even after the last reference has been dropped and the object
+ * is heading towards destruction.
+ *
+ * A common usage pattern is:
+ *
+ * get()
+ *	rcu_read_lock();
+ *	p = get_ptr();
+ *	if (p && !atomic_inc_not_zero(&p->refcnt))
+ *		p = NULL;
+ *	rcu_read_unlock();
+ *	return p;
+ *
+ * put()
+ *	if (!atomic_dec_return(&->refcnt)) {
+ *		remove_ptr(p);
+ *		kfree_rcu((p, rcu);
+ *	}
+ *
+ * atomic_inc_not_zero() is implemented with a try_cmpxchg() loop which has
+ * O(N^2) behaviour under contention with N concurrent operations.
+ *
+ * rcuref uses atomic_fetch_add_relaxed() and atomic_fetch_sub_release()
+ * for the fast path, which scale better under contention.
+ *
+ * Why not refcount?
+ * =================
+ *
+ * In principle it should be possible to make refcount use the rcuref
+ * scheme, but the destruction race described below cannot be prevented
+ * unless the protected object is RCU managed.
+ *
+ * Theory of operation
+ * ===================
+ *
+ * rcuref uses an unsigned integer reference counter. As long as the
+ * counter value is greater than or equal to RCUREF_ONEREF and not larger
+ * than RCUREF_MAXREF the reference is alive:
+ *
+ * NOREF ONEREF   MAXREF             SATURATED             RELEASED      DEAD
+ * 0     1      0x7FFFFFFF 0x8000000 0xA0000000 0xBFFFFFFF 0xC0000000 0xE0000000 0xFFFFFFFF
+ * <---valid ------------> <-------saturation zone-------> <-----------dead zone---------->
+ *
+ * The get() and put() operations do unconditional increments and
+ * decrements. The result is checked after the operation. This optimizes
+ * for the fast path.
+ *
+ * If the reference count is saturated or dead, then the increments and
+ * decrements are not harmful as the reference count still stays in the
+ * respective zones and is always set back to STATURATED resp. DEAD. The
+ * zones have room for 2^28 racing operations in each direction, which
+ * makes it practically impossible to escape the zones.
+ *
+ * Once the last reference is dropped the reference count becomes
+ * RCUREF_NOREF which forces rcuref_put() into the slowpath operation. The
+ * slowpath then tries to set the reference count from RCUREF_NOREF to
+ * RCUREF_DEAD via a cmpxchg(). This opens a small window where a
+ * concurrent rcuref_get() can acquire the reference count and bring it
+ * back to RCUREF_ONEREF or even drop the reference again and mark it DEAD.
+ *
+ * If the cmpxchg() succeeds then a concurrent rcuref_get() will result in
+ * DEAD + 1, which is inside the dead zone. If that happens the reference
+ * count is put back to DEAD.
+ *
+ * The actual race is possible due to the unconditional increment and
+ * decrements in rcuref_get() and rcuref_put():
+ *
+ *	T1				T2
+ *	get()				put()
+ *					if (atomic_fetch_sub(1, &ref->refcnt) >= 0)
+ *		succeeds->			atomic_try_cmpxchg(&ref->refcnt, -1, DEAD);
+ *
+ *	old = atomic_fetch_add(1, &ref->refcnt);	<- Elevates refcount to DEAD + 1
+ *
+ * As @old observed by T1 is within the dead zone the T1 get() fails.
+ *
+ * Possible critical states:
+ *
+ *	Context Counter	References	Operation
+ *	T1	1	1		init()
+ *	T2	2	2		get()
+ *	T1	1	1		put()
+ *	T2      0	0		put() tries to mark dead
+ *	T1	1	1		get()
+ *	T2	1	1		put() mark dead fails
+ *	T1      0	0		put() tries to mark dead
+ *	T1    DEAD	0		put() mark dead succeeds
+ *	T2    DEAD+1	0		get() fails and puts it back to DEAD
+ *
+ * Of course there are more complex scenarios, but the above illustrates
+ * the working principle. The rest is left to the imagination of the
+ * reader.
+ *
+ * Deconstruction race
+ * ===================
+ *
+ * The release operation must be protected by prohibiting a grace period in
+ * order to prevent a possible use after free:
+ *
+ *	T1				T2
+ *	put()				get()
+ *	// ref->refcnt = ONEREF
+ *	if (atomic_fetch_sub(1, &ref->cnt) > ONEREF)
+ *		return false;				<- Not taken
+ *
+ *	// ref->refcnt == NOREF
+ *	--> preemption
+ *					// Elevates ref->c to ONEREF
+ *					if (!atomic_fetch_add(1, &ref->refcnt) >= NOREF)
+ *						return true;			<- taken
+ *
+ *					if (put(&p->ref)) { <-- Succeeds
+ *						remove_pointer(p);
+ *						kfree_rcu(p, rcu);
+ *					}
+ *
+ *		RCU grace period ends, object is freed
+ *
+ *	atomic_cmpxchg(&ref->refcnt, NONE, DEAD);	<- UAF
+ *
+ * This is prevented by disabling preemption around the put() operation as
+ * that's in most kernel configurations cheaper than a rcu_read_lock() /
+ * rcu_read_unlock() pair and in many cases even a NOOP. In any case it
+ * prevents the grace period which keeps the object alive until all put()
+ * operations complete.
+ *
+ * Saturation protection
+ * =====================
+ *
+ * The reference count has a saturation limit RCUREF_MAXREF (INT_MAX).
+ * Once this is exceedded the reference count becomes stale by setting it
+ * to RCUREF_SATURATED, which will cause a memory leak, but it prevents
+ * wrap arounds which obviously cause worse problems than a memory
+ * leak. When saturation is reached a warning is emitted.
+ *
+ * Race conditions
+ * ===============
+ *
+ * All reference count increment/decrement operations are unconditional and
+ * only verified after the fact. This optimizes for the good case and takes
+ * the occasional race vs. a dead or already saturated refcount into
+ * account. The saturation and dead zones are large enough to accomodate
+ * for that.
+ *
+ * Memory ordering
+ * ===============
+ *
+ * Memory ordering rules are slightly relaxed wrt regular atomic_t functions
+ * and provide only what is strictly required for refcounts.
+ *
+ * The increments are fully relaxed; these will not provide ordering. The
+ * rationale is that whatever is used to obtain the object to increase the
+ * reference count on will provide the ordering. For locked data
+ * structures, its the lock acquire, for RCU/lockless data structures its
+ * the dependent load.
+ *
+ * rcuref_get() provides a control dependency ordering future stores which
+ * ensures that the object is not modified when acquiring a reference
+ * fails.
+ *
+ * rcuref_put() provides release order, i.e. all prior loads and stores
+ * will be issued before. It also provides a control dependency ordering
+ * against the subsequent destruction of the object.
+ *
+ * If rcuref_put() successfully dropped the last reference and marked the
+ * object DEAD it also provides acquire ordering.
+ */
+
+#include <linux/export.h>
+#include <linux/rcuref.h>
+
+/**
+ * rcuref_get_slowpath - Slowpath of rcuref_get()
+ * @ref:	Pointer to the reference count
+ * @old:	The reference count before the unconditional increment
+ *		operation in rcuref_get()
+ *
+ * Invoked when the reference count is outside of the valid zone.
+ *
+ * Return:
+ *	False if the reference count was already marked dead
+ *
+ *	True if the reference count is saturated, which prevents the
+ *	object from being deconstructed ever.
+ */
+bool rcuref_get_slowpath(rcuref_t *ref, unsigned int old)
+{
+	/*
+	 * If the reference count was already marked dead, undo the
+	 * increment so it stays in the middle of the dead zone and return
+	 * fail.
+	 */
+	if (old >= RCUREF_RELEASED) {
+		atomic_set(&ref->refcnt, RCUREF_DEAD);
+		return false;
+	}
+
+	/*
+	 * If it was saturated, warn and mark it so. In case the increment
+	 * was already on a saturated value restore the saturation
+	 * marker. This keeps it in the middle of the saturation zone and
+	 * prevents the reference count from overflowing. This leaks the
+	 * object memory, but prevents the obvious reference count overflow
+	 * damage.
+	 */
+	WARN_ONCE(old >= RCUREF_MAXREF, "rcuref saturated - leaking memory");
+	atomic_set(&ref->refcnt, RCUREF_SATURATED);
+	return true;
+}
+EXPORT_SYMBOL_GPL(rcuref_get_slowpath);
+
+static __must_check bool __rcuref_put(rcuref_t *ref)
+{
+	/*
+	 * Unconditionally decrement the reference count. The saturation and
+	 * dead zones provide enough tolerance for this.
+	 */
+	unsigned int old = atomic_fetch_sub_release(1, &ref->refcnt);
+
+	/*
+	 * If the old value is in the valid range and is greater than
+	 * RCUREF_ONEREF, nothing to do.
+	 */
+	if (likely(old > RCUREF_ONEREF && old <= RCUREF_MAXREF))
+		return false;
+
+	/* Did this drop the last reference? */
+	if (likely(old == RCUREF_ONEREF)) {
+		/*
+		 * Carefully try to set the reference count to RCUREF_DEAD.
+		 *
+		 * This can fail if a concurrent get() operation has
+		 * elevated it again or the corresponding put() even marked
+		 * it dead already. Both are valid situations and do not
+		 * require a retry. If this fails the caller is not
+		 * allowed to deconstruct the object.
+		 */
+		if (atomic_cmpxchg_release(&ref->refcnt, RCUREF_NOREF, RCUREF_DEAD) != RCUREF_NOREF)
+			return false;
+
+		/*
+		 * The caller can safely schedule the object for
+		 * deconstruction. Provide acquire ordering.
+		 */
+		smp_acquire__after_ctrl_dep();
+		return true;
+	}
+
+	/*
+	 * If the reference count was already in the dead zone, then this
+	 * put() operation is imbalanced. Warn, put the reference count back to
+	 * DEAD and tell the caller to not deconstruct the object.
+	 */
+	if (WARN_ONCE(old >= RCUREF_RELEASED, "rcuref - imbalanced put()")) {
+		atomic_set(&ref->refcnt, RCUREF_DEAD);
+		return false;
+	}
+
+	/*
+	 * This is a put() operation on a saturated refcount. Restore the
+	 * mean saturation value and tell the caller to not deconstruct the
+	 * object.
+	 */
+	atomic_set(&ref->refcnt, RCUREF_SATURATED);
+	return false;
+}
+
+/**
+ * rcuref_put -- Release one reference for a rcuref reference count
+ * @ref:	Pointer to the reference count
+ *
+ * Can be invoked from any context.
+ *
+ * Provides release memory ordering, such that prior loads and stores are done
+ * before, and provides an acquire ordering on success such that free()
+ * must come after.
+ *
+ * Return:
+ *
+ *	True if this was the last reference with no future references
+ *	possible. This signals the caller that it can safely schedule the
+ *	object, which is protected by the reference counter, for
+ *	deconstruction.
+ *
+ *	False if there are still active references or the put() raced
+ *	with a concurrent get()/put() pair. Caller is not allowed to
+ *	deconstruct the protected object.
+ */
+bool rcuref_put(rcuref_t *ref)
+{
+	bool released;
+
+	/*
+	 * Protect against a concurrent get()/put() pair which marks the
+	 * reference count DEAD and schedules it for RCU free. This
+	 * prevents a grace period and is cheaper than
+	 * rcu_read_lock()/unlock().
+	 */
+	preempt_disable();
+	released = __rcuref_put(ref);
+	preempt_enable();
+	return released;
+}
+EXPORT_SYMBOL_GPL(rcuref_put);


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch 3/3] net: dst: Switch to rcuref_t reference counting
  2023-02-28 14:33 [patch 0/3] net, refcount: Address dst_entry reference count scalability issues Thomas Gleixner
  2023-02-28 14:33 ` [patch 1/3] net: dst: Prevent false sharing vs. dst_entry::__refcnt Thomas Gleixner
  2023-02-28 14:33 ` [patch 2/3] atomics: Provide rcuref - scalable reference counting Thomas Gleixner
@ 2023-02-28 14:33 ` Thomas Gleixner
  2023-02-28 15:07 ` [patch 0/3] net, refcount: Address dst_entry reference count scalability issues Eric Dumazet
  3 siblings, 0 replies; 19+ messages in thread
From: Thomas Gleixner @ 2023-02-28 14:33 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, x86, Wangyang Guo, Arjan Van De Ven,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, Will Deacon, Peter Zijlstra, Boqun Feng, Mark Rutland,
	Marc Zyngier

Under high contention dst_entry::__refcnt becomes a significant bottleneck.

atomic_inc_not_zero() is implemented with a cmpxchg() loop, which goes into
high retry rates on contention.

Switch the reference count to rcuref_t which results in a significant
performance gain.

The gain depends on the micro-architecture and the number of concurrent
operations and has been measured in the range of +25% to +130% with a
localhost memtier/memcached benchmark which amplifies the problem
massively.

Running the memtier/memcached benchmark over a real (1Gb) network
connection the conversion on top of the false sharing fix for struct
dst_entry::__refcnt results in a total gain in the 2%-5% range over the
upstream baseline.

Reported-by: Wangyang Guo <wangyang.guo@intel.com>
Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
---
 include/net/dst.h               |    9 +++++----
 include/net/sock.h              |    2 +-
 net/bridge/br_nf_core.c         |    2 +-
 net/core/dst.c                  |   26 +++++---------------------
 net/core/rtnetlink.c            |    2 +-
 net/ipv6/route.c                |    6 +++---
 net/netfilter/ipvs/ip_vs_xmit.c |    4 ++--
 7 files changed, 18 insertions(+), 33 deletions(-)

--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -16,6 +16,7 @@
 #include <linux/bug.h>
 #include <linux/jiffies.h>
 #include <linux/refcount.h>
+#include <linux/rcuref.h>
 #include <net/neighbour.h>
 #include <asm/processor.h>
 #include <linux/indirect_call_wrapper.h>
@@ -65,7 +66,7 @@ struct dst_entry {
 	 * input/output/ops or performance tanks badly
 	 */
 #ifdef CONFIG_64BIT
-	atomic_t		__refcnt;	/* 64-bit offset 64 */
+	rcuref_t		__refcnt;	/* 64-bit offset 64 */
 #endif
 	int			__use;
 	unsigned long		lastuse;
@@ -75,7 +76,7 @@ struct dst_entry {
 	__u32			tclassid;
 #ifndef CONFIG_64BIT
 	struct lwtunnel_state   *lwtstate;
-	atomic_t		__refcnt;	/* 32-bit offset 64 */
+	rcuref_t		__refcnt;	/* 32-bit offset 64 */
 #endif
 	netdevice_tracker	dev_tracker;
 #ifdef CONFIG_64BIT
@@ -238,7 +239,7 @@ static inline void dst_hold(struct dst_e
 	 * the placement of __refcnt in struct dst_entry
 	 */
 	BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
-	WARN_ON(atomic_inc_not_zero(&dst->__refcnt) == 0);
+	WARN_ON(!rcuref_get(&dst->__refcnt));
 }
 
 static inline void dst_use_noref(struct dst_entry *dst, unsigned long time)
@@ -302,7 +303,7 @@ static inline void skb_dst_copy(struct s
  */
 static inline bool dst_hold_safe(struct dst_entry *dst)
 {
-	return atomic_inc_not_zero(&dst->__refcnt);
+	return rcuref_get(&dst->__refcnt);
 }
 
 /**
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2131,7 +2131,7 @@ sk_dst_get(struct sock *sk)
 
 	rcu_read_lock();
 	dst = rcu_dereference(sk->sk_dst_cache);
-	if (dst && !atomic_inc_not_zero(&dst->__refcnt))
+	if (dst && !rcuref_get(&dst->__refcnt))
 		dst = NULL;
 	rcu_read_unlock();
 	return dst;
--- a/net/bridge/br_nf_core.c
+++ b/net/bridge/br_nf_core.c
@@ -73,7 +73,7 @@ void br_netfilter_rtable_init(struct net
 {
 	struct rtable *rt = &br->fake_rtable;
 
-	atomic_set(&rt->dst.__refcnt, 1);
+	rcuref_init(&rt->dst.__refcnt, 1);
 	rt->dst.dev = br->dev;
 	dst_init_metrics(&rt->dst, br_dst_default_metrics, true);
 	rt->dst.flags	= DST_NOXFRM | DST_FAKE_RTABLE;
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -66,7 +66,7 @@ void dst_init(struct dst_entry *dst, str
 	dst->tclassid = 0;
 #endif
 	dst->lwtstate = NULL;
-	atomic_set(&dst->__refcnt, initial_ref);
+	rcuref_init(&dst->__refcnt, initial_ref);
 	dst->__use = 0;
 	dst->lastuse = jiffies;
 	dst->flags = flags;
@@ -162,31 +162,15 @@ EXPORT_SYMBOL(dst_dev_put);
 
 void dst_release(struct dst_entry *dst)
 {
-	if (dst) {
-		int newrefcnt;
-
-		newrefcnt = atomic_dec_return(&dst->__refcnt);
-		if (WARN_ONCE(newrefcnt < 0, "dst_release underflow"))
-			net_warn_ratelimited("%s: dst:%p refcnt:%d\n",
-					     __func__, dst, newrefcnt);
-		if (!newrefcnt)
-			call_rcu_hurry(&dst->rcu_head, dst_destroy_rcu);
-	}
+	if (dst && rcuref_put(&dst->__refcnt))
+		call_rcu_hurry(&dst->rcu_head, dst_destroy_rcu);
 }
 EXPORT_SYMBOL(dst_release);
 
 void dst_release_immediate(struct dst_entry *dst)
 {
-	if (dst) {
-		int newrefcnt;
-
-		newrefcnt = atomic_dec_return(&dst->__refcnt);
-		if (WARN_ONCE(newrefcnt < 0, "dst_release_immediate underflow"))
-			net_warn_ratelimited("%s: dst:%p refcnt:%d\n",
-					     __func__, dst, newrefcnt);
-		if (!newrefcnt)
-			dst_destroy(dst);
-	}
+	if (dst && rcuref_put(&dst->__refcnt))
+		dst_destroy(dst);
 }
 EXPORT_SYMBOL(dst_release_immediate);
 
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -840,7 +840,7 @@ int rtnl_put_cacheinfo(struct sk_buff *s
 	if (dst) {
 		ci.rta_lastuse = jiffies_delta_to_clock_t(jiffies - dst->lastuse);
 		ci.rta_used = dst->__use;
-		ci.rta_clntref = atomic_read(&dst->__refcnt);
+		ci.rta_clntref = rcuref_read(&dst->__refcnt);
 	}
 	if (expires) {
 		unsigned long clock;
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -293,7 +293,7 @@ static const struct fib6_info fib6_null_
 
 static const struct rt6_info ip6_null_entry_template = {
 	.dst = {
-		.__refcnt	= ATOMIC_INIT(1),
+		.__refcnt	= RCUREF_INIT(1),
 		.__use		= 1,
 		.obsolete	= DST_OBSOLETE_FORCE_CHK,
 		.error		= -ENETUNREACH,
@@ -307,7 +307,7 @@ static const struct rt6_info ip6_null_en
 
 static const struct rt6_info ip6_prohibit_entry_template = {
 	.dst = {
-		.__refcnt	= ATOMIC_INIT(1),
+		.__refcnt	= RCUREF_INIT(1),
 		.__use		= 1,
 		.obsolete	= DST_OBSOLETE_FORCE_CHK,
 		.error		= -EACCES,
@@ -319,7 +319,7 @@ static const struct rt6_info ip6_prohibi
 
 static const struct rt6_info ip6_blk_hole_entry_template = {
 	.dst = {
-		.__refcnt	= ATOMIC_INIT(1),
+		.__refcnt	= RCUREF_INIT(1),
 		.__use		= 1,
 		.obsolete	= DST_OBSOLETE_FORCE_CHK,
 		.error		= -EINVAL,
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -339,7 +339,7 @@ static int
 			spin_unlock_bh(&dest->dst_lock);
 			IP_VS_DBG(10, "new dst %pI4, src %pI4, refcnt=%d\n",
 				  &dest->addr.ip, &dest_dst->dst_saddr.ip,
-				  atomic_read(&rt->dst.__refcnt));
+				  rcuref_read(&rt->dst.__refcnt));
 		}
 		if (ret_saddr)
 			*ret_saddr = dest_dst->dst_saddr.ip;
@@ -507,7 +507,7 @@ static int
 			spin_unlock_bh(&dest->dst_lock);
 			IP_VS_DBG(10, "new dst %pI6, src %pI6, refcnt=%d\n",
 				  &dest->addr.in6, &dest_dst->dst_saddr.in6,
-				  atomic_read(&rt->dst.__refcnt));
+				  rcuref_read(&rt->dst.__refcnt));
 		}
 		if (ret_saddr)
 			*ret_saddr = dest_dst->dst_saddr.in6;


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 0/3] net, refcount: Address dst_entry reference count scalability issues
  2023-02-28 14:33 [patch 0/3] net, refcount: Address dst_entry reference count scalability issues Thomas Gleixner
                   ` (2 preceding siblings ...)
  2023-02-28 14:33 ` [patch 3/3] net: dst: Switch to rcuref_t " Thomas Gleixner
@ 2023-02-28 15:07 ` Eric Dumazet
  2023-02-28 16:38   ` Thomas Gleixner
  3 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2023-02-28 15:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Linus Torvalds, x86, Wangyang Guo, Arjan van De Ven,
	David S. Miller, Jakub Kicinski, Paolo Abeni, netdev,
	Will Deacon, Peter Zijlstra, Boqun Feng, Mark Rutland,
	Marc Zyngier

On Tue, Feb 28, 2023 at 3:33 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Hi!
>
> Wangyang and Arjan reported a bottleneck in the networking code related to
> struct dst_entry::__refcnt. Performance tanks massively when concurrency on
> a dst_entry increases.

We have per-cpu or per-tcp-socket dst though.

Input path is RCU and does not touch dst refcnt.

In real workloads (200Gbit NIC and above), we do not observe
contention on a dst refcnt.

So it would be nice knowing in which case you noticed some issues,
maybe there is something wrong in some layer.

>
> This happens when there are a large amount of connections to or from the
> same IP address. The memtier benchmark when run on the same host as
> memcached amplifies this massively. But even over real network connections
> this issue can be observed at an obviously smaller scale (due to the
> network bandwith limitations in my setup, i.e. 1Gb).
>
> There are two factors which make this reference count a scalability issue:
>
>    1) False sharing
>
>       dst_entry:__refcnt is located at offset 64 of dst_entry, which puts
>       it into a seperate cacheline vs. the read mostly members located at
>       the beginning of the struct.
>
>       That prevents false sharing vs. the struct members in the first 64
>       bytes of the structure, but there is also
>
>             dst_entry::lwtstate
>
>       which is located after the reference count and in the same cache
>       line. This member is read after a reference count has been acquired.
>
>       The other problem is struct rtable, which embeds a struct dst_entry
>       at offset 0. struct dst_entry has a size of 112 bytes, which means
>       that the struct members of rtable which follow the dst member share
>       the same cache line as dst_entry::__refcnt. Especially
>
>           rtable::rt_genid
>
>       is also read by the contexts which have a reference count acquired
>       already.
>
>       When dst_entry:__refcnt is incremented or decremented via an atomic
>       operation these read accesses stall and contribute to the performance
>       problem.

In our kernel profiles, we never saw dst->refcnt changes being a
serious problem.

>
>    2) atomic_inc_not_zero()
>
>       A reference on dst_entry:__refcnt is acquired via
>       atomic_inc_not_zero() and released via atomic_dec_return().
>
>       atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
>       which exposes O(N^2) behaviour under contention with N concurrent
>       operations.
>
>       Lightweight instrumentation exposed an average of 8!! retry loops per
>       atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
>       running concurrently on 112 CPUs.

User space benchmark <> kernel space.
And we tend not using 112 cpus for kernel stack processing.

Again, concurrent dst->refcnt changes are quite unlikely.

>
>       There is nothing which can be done to make atomic_inc_not_zero() more
>       scalable.
>
> The following series addresses these issues:
>
>     1) Reorder and pad struct dst_entry to prevent the false sharing.
>
>     2) Implement and use a reference count implementation which avoids the
>        atomic_inc_not_zero() problem.
>
>        It is slightly less performant in the case of the final 1 -> 0
>        transition, but the deconstruction of these objects is a low
>        frequency event. get()/put() pairs are in the hotpath and that's
>        what this implementation optimizes for.
>
>        The algorithm of this reference count is only suitable for RCU
>        managed objects. Therefore it cannot replace the refcount_t
>        algorithm, which is also based on atomic_inc_not_zero(), due to a
>        subtle race condition related to the 1 -> 0 transition and the final
>        verdict to mark the reference count dead. See details in patch 2/3.
>
>        It might be just my lack of imagination which declares this to be
>        impossible and I'd be happy to be proven wrong.
>
>        As a bonus the new rcuref implementation provides underflow/overflow
>        detection and mitigation while being performance wise on par with
>        open coded atomic_inc_not_zero() / atomic_dec_return() pairs even in
>        the non-contended case.
>
> The combination of these two changes results in performance gains in micro
> benchmarks and also localhost and networked memtier benchmarks talking to
> memcached. It's hard to quantify the benchmark results as they depend
> heavily on the micro-architecture and the number of concurrent operations.
>
> The overall gain of both changes for localhost memtier ranges from 1.2X to
> 3.2X and from +2% to %5% range for networked operations on a 1Gb connection.
>
> A micro benchmark which enforces maximized concurrency shows a gain between
> 1.2X and 4.7X!!!

Can you elaborate on what networking benchmark you have used,
and what actual gains you got ?

In which path access to dst->lwtstate proved to be a problem ?

>
> Obviously this is focussed on a particular problem and therefore needs to
> be discussed in detail. It also requires wider testing outside of the cases
> which this is focussed on.
>
> Though the false sharing issue is obvious and should be addressed
> independent of the more focussed reference count changes.
>
> The series is also available from git:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git
>
> I want to say thanks to Wangyang who analyzed the issue and provided
> the initial fix for the false sharing problem. Further thanks go to
> Arjan Peter, Marc, Will and Borislav for valuable input and providing
> test results on machines which I do not have access to.
>
> Thoughts?

Initial feeling is that we need more details on the real workload.

Is it some degenerated case with one connected UDP socket used by
multiple threads ?

To me, this looks like someone wanted to push a new piece of infra
(include/linux/rcuref.h)
and decided that dst->refcnt would be a perfect place.
Not the other way (noticing there is an issue, enquire networking
folks about it, before designing a solution)

Thanks



>
> Thanks,
>
>         tglx
> ---
>  include/linux/rcuref.h          |   89 +++++++++++
>  include/linux/types.h           |    6
>  include/net/dst.h               |   21 ++
>  include/net/sock.h              |    2
>  lib/Makefile                    |    2
>  lib/rcuref.c                    |  311 ++++++++++++++++++++++++++++++++++++++++
>  net/bridge/br_nf_core.c         |    2
>  net/core/dst.c                  |   26 ---
>  net/core/rtnetlink.c            |    2
>  net/ipv6/route.c                |    6
>  net/netfilter/ipvs/ip_vs_xmit.c |    4
>  11 files changed, 436 insertions(+), 35 deletions(-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 1/3] net: dst: Prevent false sharing vs. dst_entry::__refcnt
  2023-02-28 14:33 ` [patch 1/3] net: dst: Prevent false sharing vs. dst_entry::__refcnt Thomas Gleixner
@ 2023-02-28 15:17   ` Eric Dumazet
  2023-02-28 21:31     ` Thomas Gleixner
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2023-02-28 15:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Linus Torvalds, x86, Wangyang Guo, Arjan van De Ven,
	David S. Miller, Jakub Kicinski, Paolo Abeni, netdev,
	Will Deacon, Peter Zijlstra, Boqun Feng, Mark Rutland,
	Marc Zyngier

On Tue, Feb 28, 2023 at 3:33 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> From: Wangyang Guo <wangyang.guo@intel.com>
>
> dst_entry::__refcnt is highly contended in scenarios where many connections
> happen from and to the same IP. The reference count is an atomic_t, so the
> reference count operations have to take the cache-line exclusive.
>
> Aside of the unavoidable reference count contention there is another
> significant problem which is caused by that: False sharing.
>
> perf top identified two affected read accesses. dst_entry::lwtstate and
> rtable::rt_genid.
>
> dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into
> a seperate cacheline vs. the read mostly members located at the beginning
> of the struct.

This will probably increase struct rt6_info past the 4 cache line size, right ?

It would be nice to allow sharing the 'hot' cache line with seldom used fields.

Instead of mere pads, add some unions, and let rt6i_uncached/rt6i_uncached_list
use them.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 0/3] net, refcount: Address dst_entry reference count scalability issues
  2023-02-28 15:07 ` [patch 0/3] net, refcount: Address dst_entry reference count scalability issues Eric Dumazet
@ 2023-02-28 16:38   ` Thomas Gleixner
  2023-02-28 16:59     ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2023-02-28 16:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: LKML, Linus Torvalds, x86, Wangyang Guo, Arjan van De Ven,
	David S. Miller, Jakub Kicinski, Paolo Abeni, netdev,
	Will Deacon, Peter Zijlstra, Boqun Feng, Mark Rutland,
	Marc Zyngier

Eric!

On Tue, Feb 28 2023 at 16:07, Eric Dumazet wrote:
> On Tue, Feb 28, 2023 at 3:33 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> Hi!
>>
>> Wangyang and Arjan reported a bottleneck in the networking code related to
>> struct dst_entry::__refcnt. Performance tanks massively when concurrency on
>> a dst_entry increases.
>
> We have per-cpu or per-tcp-socket dst though.
>
> Input path is RCU and does not touch dst refcnt.
>
> In real workloads (200Gbit NIC and above), we do not observe
> contention on a dst refcnt.
>
> So it would be nice knowing in which case you noticed some issues,
> maybe there is something wrong in some layer.

Two lines further down I explained which benchmark was used, no?

>> This happens when there are a large amount of connections to or from the
>> same IP address. The memtier benchmark when run on the same host as
>> memcached amplifies this massively. But even over real network connections
>> this issue can be observed at an obviously smaller scale (due to the
>> network bandwith limitations in my setup, i.e. 1Gb).
>>       atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
>>       which exposes O(N^2) behaviour under contention with N concurrent
>>       operations.
>>
>>       Lightweight instrumentation exposed an average of 8!! retry loops per
>>       atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
>>       running concurrently on 112 CPUs.
>
> User space benchmark <> kernel space.

I know that. The point was to illustrate the non-scalability.

> And we tend not using 112 cpus for kernel stack processing.
>
> Again, concurrent dst->refcnt changes are quite unlikely.

So unlikely that they stand out in that particular benchmark.

>> The overall gain of both changes for localhost memtier ranges from 1.2X to
>> 3.2X and from +2% to %5% range for networked operations on a 1Gb connection.
>>
>> A micro benchmark which enforces maximized concurrency shows a gain between
>> 1.2X and 4.7X!!!
>
> Can you elaborate on what networking benchmark you have used,
> and what actual gains you got ?

I'm happy to repeat here that it was memtier/memcached as I explained
more than once in the cover letter.

> In which path access to dst->lwtstate proved to be a problem ?

ip_finish_output2()
   if (lwtunnel_xmit_redirect(dst->lwtstate)) <- This read

> To me, this looks like someone wanted to push a new piece of infra
> (include/linux/rcuref.h)
> and decided that dst->refcnt would be a perfect place.
>
> Not the other way (noticing there is an issue, enquire networking
> folks about it, before designing a solution)

We looked at this because the reference count operations stood out in
perf top and we analyzed it down to the false sharing _and_ the
non-scalability of atomic_inc_not_zero().

That's what made me to think about a different implementation and yes,
the case at hand, which happens to be in the network code, allowed me to
verify that it actually works and scales better.

I'm terrible sorry, that I missed to first ask network people for
permission to think. Won't happen again.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 0/3] net, refcount: Address dst_entry reference count scalability issues
  2023-02-28 16:38   ` Thomas Gleixner
@ 2023-02-28 16:59     ` Eric Dumazet
  2023-03-01  1:00       ` Thomas Gleixner
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2023-02-28 16:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Linus Torvalds, x86, Wangyang Guo, Arjan van De Ven,
	David S. Miller, Jakub Kicinski, Paolo Abeni, netdev,
	Will Deacon, Peter Zijlstra, Boqun Feng, Mark Rutland,
	Marc Zyngier

On Tue, Feb 28, 2023 at 5:38 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Eric!
>
> On Tue, Feb 28 2023 at 16:07, Eric Dumazet wrote:
> > On Tue, Feb 28, 2023 at 3:33 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>
> >> Hi!
> >>
> >> Wangyang and Arjan reported a bottleneck in the networking code related to
> >> struct dst_entry::__refcnt. Performance tanks massively when concurrency on
> >> a dst_entry increases.
> >
> > We have per-cpu or per-tcp-socket dst though.
> >
> > Input path is RCU and does not touch dst refcnt.
> >
> > In real workloads (200Gbit NIC and above), we do not observe
> > contention on a dst refcnt.
> >
> > So it would be nice knowing in which case you noticed some issues,
> > maybe there is something wrong in some layer.
>
> Two lines further down I explained which benchmark was used, no?
>
> >> This happens when there are a large amount of connections to or from the
> >> same IP address. The memtier benchmark when run on the same host as
> >> memcached amplifies this massively. But even over real network connections
> >> this issue can be observed at an obviously smaller scale (due to the
> >> network bandwith limitations in my setup, i.e. 1Gb).
> >>       atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
> >>       which exposes O(N^2) behaviour under contention with N concurrent
> >>       operations.
> >>
> >>       Lightweight instrumentation exposed an average of 8!! retry loops per
> >>       atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
> >>       running concurrently on 112 CPUs.
> >
> > User space benchmark <> kernel space.
>
> I know that. The point was to illustrate the non-scalability.
>
> > And we tend not using 112 cpus for kernel stack processing.
> >
> > Again, concurrent dst->refcnt changes are quite unlikely.
>
> So unlikely that they stand out in that particular benchmark.
>
> >> The overall gain of both changes for localhost memtier ranges from 1.2X to
> >> 3.2X and from +2% to %5% range for networked operations on a 1Gb connection.
> >>
> >> A micro benchmark which enforces maximized concurrency shows a gain between
> >> 1.2X and 4.7X!!!
> >
> > Can you elaborate on what networking benchmark you have used,
> > and what actual gains you got ?
>
> I'm happy to repeat here that it was memtier/memcached as I explained
> more than once in the cover letter.
>
> > In which path access to dst->lwtstate proved to be a problem ?
>
> ip_finish_output2()
>    if (lwtunnel_xmit_redirect(dst->lwtstate)) <- This read

This change alone should be easy to measure, please do this ?

Oftentimes, moving a field looks sane, but the cache line access is
simply done later.
For example when refcnt is changed :)

Making dsts one cache line bigger has a performance impact.

>
> > To me, this looks like someone wanted to push a new piece of infra
> > (include/linux/rcuref.h)
> > and decided that dst->refcnt would be a perfect place.
> >
> > Not the other way (noticing there is an issue, enquire networking
> > folks about it, before designing a solution)
>
> We looked at this because the reference count operations stood out in
> perf top and we analyzed it down to the false sharing _and_ the
> non-scalability of atomic_inc_not_zero().
>

Please share your recipe and perf results.

We must have been very lucky to not see this at Google.

tcp_rr workloads show dst_mtu() costs (60k GCU in Google fleet) ,
which outperform the dst refcnt you are mentioning here.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 1/3] net: dst: Prevent false sharing vs. dst_entry::__refcnt
  2023-02-28 15:17   ` Eric Dumazet
@ 2023-02-28 21:31     ` Thomas Gleixner
  2023-03-01 23:12       ` Thomas Gleixner
  0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2023-02-28 21:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: LKML, Linus Torvalds, x86, Wangyang Guo, Arjan van De Ven,
	David S. Miller, Jakub Kicinski, Paolo Abeni, netdev,
	Will Deacon, Peter Zijlstra, Boqun Feng, Mark Rutland,
	Marc Zyngier

Eric!

On Tue, Feb 28 2023 at 16:17, Eric Dumazet wrote:
> On Tue, Feb 28, 2023 at 3:33 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> perf top identified two affected read accesses. dst_entry::lwtstate and
>> rtable::rt_genid.
>>
>> dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into
>> a seperate cacheline vs. the read mostly members located at the beginning
>> of the struct.
>
> This will probably increase struct rt6_info past the 4 cache line
> size, right ?

pahole says: /* size: 256, cachelines: 4, members: 11 */

> It would be nice to allow sharing the 'hot' cache line with seldom
> used fields.

Sure.

> Instead of mere pads, add some unions, and let rt6i_uncached/rt6i_uncached_list
> use them.

If I understand correctly, you suggest to move

   rt6_info::rt6i_uncached[_list], rtable::rt_uncached[_list]
   
into struct dst_entry and fixup the usage sites, right?

I don't see why that would need a union. dst_entry::rt_uncached[_list]
would work for both, no?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 2/3] atomics: Provide rcuref - scalable reference counting
  2023-02-28 14:33 ` [patch 2/3] atomics: Provide rcuref - scalable reference counting Thomas Gleixner
@ 2023-03-01  0:42   ` Linus Torvalds
  2023-03-01  1:07     ` Linus Torvalds
  2023-03-01 11:09     ` Thomas Gleixner
  0 siblings, 2 replies; 19+ messages in thread
From: Linus Torvalds @ 2023-03-01  0:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Wangyang Guo, Arjan Van De Ven, Will Deacon,
	Peter Zijlstra, Boqun Feng, Mark Rutland, Marc Zyngier,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev

On Tue, Feb 28, 2023 at 6:33 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> This unconditional increment avoids the inc_not_zero() problem, but
> requires a more complex implementation on the put() side when the count
> drops from 1 to 0.
>
> When this transition is detected then it is attempted to mark the reference
> count dead, by setting it to the midpoint of the dead zone with a single
> atomic_cmpxchg_release() operation. This operation can fail due to a
> concurrent rcuref_get() elevating the reference count from 0 to 1.

This looks sane to me, however it does look like the code is not really optimal.

This is supposed to be a critical function, and is inlined:

> +static inline __must_check bool rcuref_get(rcuref_t *ref)
> +{
> +       unsigned int old = atomic_fetch_add_relaxed(1, &ref->refcnt);
> +
> +       if (likely(old < RCUREF_MAXREF))
> +               return true;

but that comparison would be much better if RCUREF_MAXREF was
0x80000000 and you'd end up just checking the sign of the result,
instead of checking big numbers.

Also, this optimal value choice ends up being architecture-specific,
since some do the "fetch_add", and others tend to prefer "add_return",
and so the point that is cheapest to check ends up depending on which
architecture it is.

This may seem like nit-picking, but I absolutely *HATE* our current
refcount interface for how absolutely horrid the code generation ends
up being. It's gotten better, but it's still not great.

So if we're introducing yet another refcount interface, and it's done
in the name of efficiency, I would *really* want it to actually be
exactly that: efficient. Not some half-way thing.

And yes, that may mean that it should have some architecture-specific
code (with fallback defaults for the generic case).

               Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 0/3] net, refcount: Address dst_entry reference count scalability issues
  2023-02-28 16:59     ` Eric Dumazet
@ 2023-03-01  1:00       ` Thomas Gleixner
  2023-03-01  3:17         ` Jakub Kicinski
  0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2023-03-01  1:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: LKML, Linus Torvalds, x86, Wangyang Guo, Arjan van De Ven,
	David S. Miller, Jakub Kicinski, Paolo Abeni, netdev,
	Will Deacon, Peter Zijlstra, Boqun Feng, Mark Rutland,
	Marc Zyngier

Eric!

On Tue, Feb 28 2023 at 17:59, Eric Dumazet wrote:
> On Tue, Feb 28, 2023 at 5:38 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> >>       Lightweight instrumentation exposed an average of 8!! retry loops per
>> >>       atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
>> >>       running concurrently on 112 CPUs.
>> >
>> > User space benchmark <> kernel space.
>>
>> I know that. The point was to illustrate the non-scalability.
>> >
>> > And we tend not using 112 cpus for kernel stack processing.

Just to be more clear. Scalability of atomic_inc_not_zero() tanks
already with a small number of concurrent operations. As explained in
the cover letter and in the changelogs. The worst case:

  It exposes O(N^2) behaviour under contention with N concurrent
  operations.

The actual impact depends on the micro-architecture. But O(N^2) is bad
by definition, right? It tanks way before 112 CPUs significantly.

>> > In which path access to dst->lwtstate proved to be a problem ?
>>
>> ip_finish_output2()
>>    if (lwtunnel_xmit_redirect(dst->lwtstate)) <- This read
>
> This change alone should be easy to measure, please do this ?

Moving lwtstate out also moves rtable::rt_genid out into the next
cacheline. So it's not that easy to measure the impact of moving
lwtstate alone.

We measured the impact of changing struct dst_entry seperately as you
can see from the changelog of the actual patch [1/3]:

  "The resulting improvement depends on the micro-architecture and the number
   of CPUs. It ranges from +20% to +120% with a localhost memtier/memcached
   benchmark."

It was very intentional _not_ to provide a single benchmark output to
boast about the improvement simply because the outcome differs
massively depending on the micro-architecture.

Here are 3 different contemporary machines to compare the impact of the
padding/reorder of dst_entry and the refcount.

memcached/memtier 1:100
                                A                       B                       C

atomic_t baseline		10702			 6156			 11993
atomic_t + padding		12505   = 1.16x		13994  = 2.27x		 24997  = 2.08x 
rcuref_t + padding		22888   = 2.13x		22080  = 3.58x		 25048  = 2.08x

atomic_t baseline		10702			 6156			 11993
rcuref_t			11843   = 1.10x		 9590  = 1.55x		 15434	= 1.28x  

atomic_t + padding baseline	12505			13994			 24997
rcuref_t + padding		22888   = 1.83x		22080  = 1.57x		 25048  = 1.01x

See?

I surely could have picked the most prominent of each and made a bug
fuzz about it, but if you read the changelogs caerefully, I provided
range numbers which make this clear.

Here are perf top numbers from machine A with a cutoff at 1.00%:

  Baseline:

     14.24% [kernel] [k] __dev_queue_xmit
      7.12% [kernel] [k] ipv4_dst_check
      5.13% [kernel] [k] ip_finish_output2
      4.95% [kernel] [k] dst_release
      4.27% [kernel] [k] ip_rcv_finish_core.constprop.0

     35.71% SUBTOTAL

      3.86% [kernel] [k] ip_skb_dst_mtu
      3.85% [kernel] [k] ipv4_mtu
      3.58% [kernel] [k] raw_local_deliver
      2.44% [kernel] [k] tcp_v4_rcv
      1.45% [kernel] [k] tcp_ack
      1.32% [kernel] [k] skb_release_data
      1.23% [kernel] [k] _raw_spin_lock_irqsave

     53.44% TOTAL
         
  Padding:    
     
     24.21% [kernel] [k] __dev_queue_xmit
      7.75% [kernel] [k] dst_release

     31.96% SUBTOTAL

      1.96% [kernel] [k] tcp_v4_rcv
      1.88% [kernel] [k] tcp_current_mss
      1.87% [kernel] [k] __sk_dst_check
      1.84% [kernel] [k] tcp_ack_update_rtt
      1.83% [kernel] [k] _raw_spin_lock_irqsave
      1.83% [kernel] [k] skb_release_data
      1.83% [kernel] [k] ip_rcv_finish_core.constprop.0
      1.70% [kernel] [k] __tcp_ack_snd_check
      1.67% [kernel] [k] tcp_v4_do_rcv
      1.39% [kernel] [k] tcp_ack
      1.20% [kernel] [k] __fget_light.part.0
      1.00% [kernel] [k] _raw_spin_lock_bh

     51.96% TOTAL
         
  Padding + rcuref:
     
      9.23% [kernel] [k] __dev_queue_xmit
      8.81% [kernel] [k] rcuref_put
      2.22% [kernel] [k] tcp_ack
      1.86% [kernel] [k] ip_rcv_finish_core.constprop.0

     18.04% SUBTOTAL

      1.86% [kernel] [k] tcp_v4_rcv
      1.72% [kernel] [k] tcp_current_mss
      1.69% [kernel] [k] skb_release_data
      1.58% [kernel] [k] tcp_ack_update_rtt
      1.56% [kernel] [k] __sk_dst_check
      1.50% [kernel] [k] _raw_spin_lock_irqsave
      1.40% [kernel] [k] __tcp_ack_snd_check
      1.40% [kernel] [k] tcp_v4_do_rcv
      1.39% [kernel] [k] __fget_light.part.0
      1.23% [kernel] [k] tcp_sendmsg_locked
      1.19% [kernel] [k] _raw_spin_lock_bh
      1.17% [kernel] [k] tcp_recvmsg_locked
      1.15% [kernel] [k] ipv4_mtu
      1.14% [kernel] [k] __inet_lookup_established
      1.10% [kernel] [k] ip_skb_dst_mtu
      1.02% [kernel] [k] tcp_rcv_established
      1.02% [kernel] [k] tcp_poll

     45.24% TOTAL

All the outstanding numbers above each SUBTOTAL are related to the
refcnt issues.

As you can see from the above numbers of the three example machines, the
relevant perf output would be totally different, but still correlated to
both the false sharing and the refcount performance.

> Oftentimes, moving a field looks sane, but the cache line access is
> simply done later.
> For example when refcnt is changed :)

Sorry, I can't decode what you are trying to tell me here.

> Making dsts one cache line bigger has a performance impact.

Sure. I'm not claiming that this is completely free. Whether it really
matters is a different story and that's why we are debating this, right?

>> We looked at this because the reference count operations stood out in
>> perf top and we analyzed it down to the false sharing _and_ the
>> non-scalability of atomic_inc_not_zero().
>>
>
> Please share your recipe and perf results.

Sorry for being not explicit enough about this, but I was under the
impression that explicitely mentioning memcached and memtier would be
enough of a hint for people famiiar with this matter.

Run memcached with -t $N and memtier_benchmark with -t $M and
--ratio=1:100 on the same machine. localhost connections obviously
amplify the problem,

Start with the defaults for $N and $M and increase them. Depending on
your machine this will tank at some point. But even in reasonably small
$N, $M scenarios the refcount operations and the resulting false sharing
fallout becomes visible in perf top. At some point it becomes the
dominating issue while the machine still has capacity...

> We must have been very lucky to not see this at Google.

There _is_ a world outside of Google? :)

Seriously. The point is that even if you @google cannot obverse this as
a major issue and it just gives your usecase a minimal 0.X gain, it
still is contributing to the overall performance, no?

I have surely better things to do than pushing the next newfangled
infrastrucure just because.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 2/3] atomics: Provide rcuref - scalable reference counting
  2023-03-01  0:42   ` Linus Torvalds
@ 2023-03-01  1:07     ` Linus Torvalds
  2023-03-01 11:09     ` Thomas Gleixner
  1 sibling, 0 replies; 19+ messages in thread
From: Linus Torvalds @ 2023-03-01  1:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Wangyang Guo, Arjan Van De Ven, Will Deacon,
	Peter Zijlstra, Boqun Feng, Mark Rutland, Marc Zyngier,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev

On Tue, Feb 28, 2023 at 4:42 PM Linus Torvalds
<torvalds@linuxfoundation.org> wrote:
>
> And yes, that may mean that it should have some architecture-specific
> code (with fallback defaults for the generic case).

Another reason for architecture-specific code is that anybody who
doesn't have atomics and just relies on an LL/SC model is actually
better of *not* having any of this complexity.

In fact, the Intel RAO instruction set would likely do that on x86
too. With that alleged future "CMPccXADD", there likely is no longer
any advantage to this model of rcuref.

Now, I don't know when - if ever - said RAO instruction set extension
comes, but I'd hope that the new scalable reference counting would be
ready for it.

             Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 0/3] net, refcount: Address dst_entry reference count scalability issues
  2023-03-01  1:00       ` Thomas Gleixner
@ 2023-03-01  3:17         ` Jakub Kicinski
  2023-03-01 10:40           ` Thomas Gleixner
  0 siblings, 1 reply; 19+ messages in thread
From: Jakub Kicinski @ 2023-03-01  3:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Eric Dumazet, LKML, Linus Torvalds, x86, Wangyang Guo,
	Arjan van De Ven, David S. Miller, Paolo Abeni, netdev,
	Will Deacon, Peter Zijlstra, Boqun Feng, Mark Rutland,
	Marc Zyngier

FWIW looks good to me, especially the refcount part.
We do see 10% of jitter in microbenchmarks due to random cache
effects, so forgive the questioning. But again, the refcount seems 
like an obvious win to my noob eyes.

While I have you it would be remiss of me not to mention my ksoftirq
change which makes a large difference in production workloads:
https://lore.kernel.org/all/20221222221244.1290833-3-kuba@kernel.org/
Is Peter's "rework" of softirq going in for 6.3?

On Wed, 01 Mar 2023 02:00:20 +0100 Thomas Gleixner wrote:
> >> We looked at this because the reference count operations stood out in
> >> perf top and we analyzed it down to the false sharing _and_ the
> >> non-scalability of atomic_inc_not_zero().
> >
> > Please share your recipe and perf results.  
> 
> Sorry for being not explicit enough about this, but I was under the
> impression that explicitely mentioning memcached and memtier would be
> enough of a hint for people famiiar with this matter.

I think the disconnect may be that we are indeed familiar with 
the workloads, but these exact workloads don't hit the issue
in production (I don't work at Google but a similarly large corp).
My initial reaction was also to see if I can find the issue in prod.
Not to question but in hope that I can indeed find a repro, and make
this series an easy sell.

> Run memcached with -t $N and memtier_benchmark with -t $M and
> --ratio=1:100 on the same machine. localhost connections obviously
> amplify the problem,
> 
> Start with the defaults for $N and $M and increase them. Depending on
> your machine this will tank at some point. But even in reasonably small
> $N, $M scenarios the refcount operations and the resulting false sharing
> fallout becomes visible in perf top. At some point it becomes the
> dominating issue while the machine still has capacity...
> 
> > We must have been very lucky to not see this at Google.  
> 
> There _is_ a world outside of Google? :)
> 
> Seriously. The point is that even if you @google cannot obverse this as
> a major issue and it just gives your usecase a minimal 0.X gain, it
> still is contributing to the overall performance, no?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 0/3] net, refcount: Address dst_entry reference count scalability issues
  2023-03-01  3:17         ` Jakub Kicinski
@ 2023-03-01 10:40           ` Thomas Gleixner
  0 siblings, 0 replies; 19+ messages in thread
From: Thomas Gleixner @ 2023-03-01 10:40 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Eric Dumazet, LKML, Linus Torvalds, x86, Wangyang Guo,
	Arjan van De Ven, David S. Miller, Paolo Abeni, netdev,
	Will Deacon, Peter Zijlstra, Boqun Feng, Mark Rutland,
	Marc Zyngier

On Tue, Feb 28 2023 at 19:17, Jakub Kicinski wrote:
> FWIW looks good to me, especially the refcount part.
> We do see 10% of jitter in microbenchmarks due to random cache
> effects, so forgive the questioning.

Yes, those things are hard to measure. 

> But again, the refcount seems like an obvious win to my noob eyes.
>
> While I have you it would be remiss of me not to mention my ksoftirq
> change which makes a large difference in production workloads:
> https://lore.kernel.org/all/20221222221244.1290833-3-kuba@kernel.org/

Let me find this again.

> Is Peter's "rework" of softirq going in for 6.3?

Not that I'm aware of. That had a few loose ends and got lost in space
AFAICT.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 2/3] atomics: Provide rcuref - scalable reference counting
  2023-03-01  0:42   ` Linus Torvalds
  2023-03-01  1:07     ` Linus Torvalds
@ 2023-03-01 11:09     ` Thomas Gleixner
  2023-03-02  1:05       ` Thomas Gleixner
  1 sibling, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2023-03-01 11:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, x86, Wangyang Guo, Arjan Van De Ven, Will Deacon,
	Peter Zijlstra, Boqun Feng, Mark Rutland, Marc Zyngier,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev

On Tue, Feb 28 2023 at 16:42, Linus Torvalds wrote:
> On Tue, Feb 28, 2023 at 6:33 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> This may seem like nit-picking, but I absolutely *HATE* our current
> refcount interface for how absolutely horrid the code generation ends
> up being. It's gotten better, but it's still not great.
>
> So if we're introducing yet another refcount interface, and it's done
> in the name of efficiency, I would *really* want it to actually be
> exactly that: efficient. Not some half-way thing.
>
> And yes, that may mean that it should have some architecture-specific
> code (with fallback defaults for the generic case).

Let me stare at that some more.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 1/3] net: dst: Prevent false sharing vs. dst_entry::__refcnt
  2023-02-28 21:31     ` Thomas Gleixner
@ 2023-03-01 23:12       ` Thomas Gleixner
  0 siblings, 0 replies; 19+ messages in thread
From: Thomas Gleixner @ 2023-03-01 23:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: LKML, Linus Torvalds, x86, Wangyang Guo, Arjan van De Ven,
	David S. Miller, Jakub Kicinski, Paolo Abeni, netdev,
	Will Deacon, Peter Zijlstra, Boqun Feng, Mark Rutland,
	Marc Zyngier

On Tue, Feb 28 2023 at 22:31, Thomas Gleixner wrote:
> On Tue, Feb 28 2023 at 16:17, Eric Dumazet wrote:
>> Instead of mere pads, add some unions, and let rt6i_uncached/rt6i_uncached_list
>> use them.
>
> If I understand correctly, you suggest to move
>
>    rt6_info::rt6i_uncached[_list], rtable::rt_uncached[_list]
>    
> into struct dst_entry and fixup the usage sites, right?
>
> I don't see why that would need a union. dst_entry::rt_uncached[_list]
> would work for both, no?

So I came up with the below.

rt6_info shrinks from 232 to 224 bytes. rtable size is unchanged

Thanks,

        tglx
---

--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -69,15 +69,28 @@ struct dst_entry {
 #endif
 	int			__use;
 	unsigned long		lastuse;
-	struct lwtunnel_state   *lwtstate;
 	struct rcu_head		rcu_head;
 	short			error;
 	short			__pad;
 	__u32			tclassid;
 #ifndef CONFIG_64BIT
+	struct lwtunnel_state   *lwtstate;
 	atomic_t		__refcnt;	/* 32-bit offset 64 */
 #endif
 	netdevice_tracker	dev_tracker;
+
+	/*
+	 * Used by rtable and rt6_info. Moves lwtstate into the next cache
+	 * line on 64bit so that lwtstate does not cause false sharing with
+	 * __refcnt under contention of __refcnt. This also puts the
+	 * frequently accessed members of rtable and rt6_info out of the
+	 * __refcnt cache line.
+	 */
+	struct list_head	rt_uncached;
+	struct uncached_list	*rt_uncached_list;
+#ifdef CONFIG_64BIT
+	struct lwtunnel_state   *lwtstate;
+#endif
 };
 
 struct dst_metrics {
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -217,9 +217,6 @@ struct rt6_info {
 	struct inet6_dev		*rt6i_idev;
 	u32				rt6i_flags;
 
-	struct list_head		rt6i_uncached;
-	struct uncached_list		*rt6i_uncached_list;
-
 	/* more non-fragment space at head required */
 	unsigned short			rt6i_nfheader_len;
 };
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -104,7 +104,7 @@ static inline struct dst_entry *ip6_rout
 static inline void ip6_rt_put_flags(struct rt6_info *rt, int flags)
 {
 	if (!(flags & RT6_LOOKUP_F_DST_NOREF) ||
-	    !list_empty(&rt->rt6i_uncached))
+	    !list_empty(&rt->dst.rt_uncached))
 		ip6_rt_put(rt);
 }
 
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -81,9 +81,6 @@ struct rtable {
 	/* Miscellaneous cached information */
 	u32			rt_mtu_locked:1,
 				rt_pmtu:31;
-
-	struct list_head	rt_uncached;
-	struct uncached_list	*rt_uncached_list;
 };
 
 static inline bool rt_is_input_route(const struct rtable *rt)
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1508,20 +1508,20 @@ void rt_add_uncached_list(struct rtable
 {
 	struct uncached_list *ul = raw_cpu_ptr(&rt_uncached_list);
 
-	rt->rt_uncached_list = ul;
+	rt->dst.rt_uncached_list = ul;
 
 	spin_lock_bh(&ul->lock);
-	list_add_tail(&rt->rt_uncached, &ul->head);
+	list_add_tail(&rt->dst.rt_uncached, &ul->head);
 	spin_unlock_bh(&ul->lock);
 }
 
 void rt_del_uncached_list(struct rtable *rt)
 {
-	if (!list_empty(&rt->rt_uncached)) {
-		struct uncached_list *ul = rt->rt_uncached_list;
+	if (!list_empty(&rt->dst.rt_uncached)) {
+		struct uncached_list *ul = rt->dst.rt_uncached_list;
 
 		spin_lock_bh(&ul->lock);
-		list_del_init(&rt->rt_uncached);
+		list_del_init(&rt->dst.rt_uncached);
 		spin_unlock_bh(&ul->lock);
 	}
 }
@@ -1546,13 +1546,13 @@ void rt_flush_dev(struct net_device *dev
 			continue;
 
 		spin_lock_bh(&ul->lock);
-		list_for_each_entry_safe(rt, safe, &ul->head, rt_uncached) {
+		list_for_each_entry_safe(rt, safe, &ul->head, dst.rt_uncached) {
 			if (rt->dst.dev != dev)
 				continue;
 			rt->dst.dev = blackhole_netdev;
 			netdev_ref_replace(dev, blackhole_netdev,
 					   &rt->dst.dev_tracker, GFP_ATOMIC);
-			list_move(&rt->rt_uncached, &ul->quarantine);
+			list_move(&rt->dst.rt_uncached, &ul->quarantine);
 		}
 		spin_unlock_bh(&ul->lock);
 	}
@@ -1644,7 +1644,7 @@ struct rtable *rt_dst_alloc(struct net_d
 		rt->rt_uses_gateway = 0;
 		rt->rt_gw_family = 0;
 		rt->rt_gw4 = 0;
-		INIT_LIST_HEAD(&rt->rt_uncached);
+		INIT_LIST_HEAD(&rt->dst.rt_uncached);
 
 		rt->dst.output = ip_output;
 		if (flags & RTCF_LOCAL)
@@ -1675,7 +1675,7 @@ struct rtable *rt_dst_clone(struct net_d
 			new_rt->rt_gw4 = rt->rt_gw4;
 		else if (rt->rt_gw_family == AF_INET6)
 			new_rt->rt_gw6 = rt->rt_gw6;
-		INIT_LIST_HEAD(&new_rt->rt_uncached);
+		INIT_LIST_HEAD(&new_rt->dst.rt_uncached);
 
 		new_rt->dst.input = rt->dst.input;
 		new_rt->dst.output = rt->dst.output;
@@ -2859,7 +2859,7 @@ struct dst_entry *ipv4_blackhole_route(s
 		else if (rt->rt_gw_family == AF_INET6)
 			rt->rt_gw6 = ort->rt_gw6;
 
-		INIT_LIST_HEAD(&rt->rt_uncached);
+		INIT_LIST_HEAD(&rt->dst.rt_uncached);
 	}
 
 	dst_release(dst_orig);
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -91,7 +91,7 @@ static int xfrm4_fill_dst(struct xfrm_ds
 		xdst->u.rt.rt_gw6 = rt->rt_gw6;
 	xdst->u.rt.rt_pmtu = rt->rt_pmtu;
 	xdst->u.rt.rt_mtu_locked = rt->rt_mtu_locked;
-	INIT_LIST_HEAD(&xdst->u.rt.rt_uncached);
+	INIT_LIST_HEAD(&xdst->u.rt.dst.rt_uncached);
 	rt_add_uncached_list(&xdst->u.rt);
 
 	return 0;
@@ -121,7 +121,7 @@ static void xfrm4_dst_destroy(struct dst
 	struct xfrm_dst *xdst = (struct xfrm_dst *)dst;
 
 	dst_destroy_metrics_generic(dst);
-	if (xdst->u.rt.rt_uncached_list)
+	if (xdst->u.rt.dst.rt_uncached_list)
 		rt_del_uncached_list(&xdst->u.rt);
 	xfrm_dst_destroy(xdst);
 }
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -139,20 +139,20 @@ void rt6_uncached_list_add(struct rt6_in
 {
 	struct uncached_list *ul = raw_cpu_ptr(&rt6_uncached_list);
 
-	rt->rt6i_uncached_list = ul;
+	rt->dst.rt_uncached_list = ul;
 
 	spin_lock_bh(&ul->lock);
-	list_add_tail(&rt->rt6i_uncached, &ul->head);
+	list_add_tail(&rt->dst.rt_uncached, &ul->head);
 	spin_unlock_bh(&ul->lock);
 }
 
 void rt6_uncached_list_del(struct rt6_info *rt)
 {
-	if (!list_empty(&rt->rt6i_uncached)) {
-		struct uncached_list *ul = rt->rt6i_uncached_list;
+	if (!list_empty(&rt->dst.rt_uncached)) {
+		struct uncached_list *ul = rt->dst.rt_uncached_list;
 
 		spin_lock_bh(&ul->lock);
-		list_del_init(&rt->rt6i_uncached);
+		list_del_init(&rt->dst.rt_uncached);
 		spin_unlock_bh(&ul->lock);
 	}
 }
@@ -169,7 +169,7 @@ static void rt6_uncached_list_flush_dev(
 			continue;
 
 		spin_lock_bh(&ul->lock);
-		list_for_each_entry_safe(rt, safe, &ul->head, rt6i_uncached) {
+		list_for_each_entry_safe(rt, safe, &ul->head, dst.rt_uncached) {
 			struct inet6_dev *rt_idev = rt->rt6i_idev;
 			struct net_device *rt_dev = rt->dst.dev;
 			bool handled = false;
@@ -188,7 +188,7 @@ static void rt6_uncached_list_flush_dev(
 				handled = true;
 			}
 			if (handled)
-				list_move(&rt->rt6i_uncached,
+				list_move(&rt->dst.rt_uncached,
 					  &ul->quarantine);
 		}
 		spin_unlock_bh(&ul->lock);
@@ -334,7 +334,7 @@ static const struct rt6_info ip6_blk_hol
 static void rt6_info_init(struct rt6_info *rt)
 {
 	memset_after(rt, 0, dst);
-	INIT_LIST_HEAD(&rt->rt6i_uncached);
+	INIT_LIST_HEAD(&rt->dst.rt_uncached);
 }
 
 /* allocate dst with ip6_dst_ops */
@@ -2638,7 +2638,7 @@ struct dst_entry *ip6_route_output_flags
 	dst = ip6_route_output_flags_noref(net, sk, fl6, flags);
 	rt6 = (struct rt6_info *)dst;
 	/* For dst cached in uncached_list, refcnt is already taken. */
-	if (list_empty(&rt6->rt6i_uncached) && !dst_hold_safe(dst)) {
+	if (list_empty(&rt6->dst.rt_uncached) && !dst_hold_safe(dst)) {
 		dst = &net->ipv6.ip6_null_entry->dst;
 		dst_hold(dst);
 	}
@@ -2748,7 +2748,7 @@ INDIRECT_CALLABLE_SCOPE struct dst_entry
 	from = rcu_dereference(rt->from);
 
 	if (from && (rt->rt6i_flags & RTF_PCPU ||
-	    unlikely(!list_empty(&rt->rt6i_uncached))))
+	    unlikely(!list_empty(&rt->dst.rt_uncached))))
 		dst_ret = rt6_dst_from_check(rt, from, cookie);
 	else
 		dst_ret = rt6_check(rt, from, cookie);
@@ -6483,7 +6483,7 @@ static int __net_init ip6_route_net_init
 	net->ipv6.ip6_null_entry->dst.ops = &net->ipv6.ip6_dst_ops;
 	dst_init_metrics(&net->ipv6.ip6_null_entry->dst,
 			 ip6_template_metrics, true);
-	INIT_LIST_HEAD(&net->ipv6.ip6_null_entry->rt6i_uncached);
+	INIT_LIST_HEAD(&net->ipv6.ip6_null_entry->dst.rt_uncached);
 
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
 	net->ipv6.fib6_has_custom_rules = false;
@@ -6495,7 +6495,7 @@ static int __net_init ip6_route_net_init
 	net->ipv6.ip6_prohibit_entry->dst.ops = &net->ipv6.ip6_dst_ops;
 	dst_init_metrics(&net->ipv6.ip6_prohibit_entry->dst,
 			 ip6_template_metrics, true);
-	INIT_LIST_HEAD(&net->ipv6.ip6_prohibit_entry->rt6i_uncached);
+	INIT_LIST_HEAD(&net->ipv6.ip6_prohibit_entry->dst.rt_uncached);
 
 	net->ipv6.ip6_blk_hole_entry = kmemdup(&ip6_blk_hole_entry_template,
 					       sizeof(*net->ipv6.ip6_blk_hole_entry),
@@ -6505,7 +6505,7 @@ static int __net_init ip6_route_net_init
 	net->ipv6.ip6_blk_hole_entry->dst.ops = &net->ipv6.ip6_dst_ops;
 	dst_init_metrics(&net->ipv6.ip6_blk_hole_entry->dst,
 			 ip6_template_metrics, true);
-	INIT_LIST_HEAD(&net->ipv6.ip6_blk_hole_entry->rt6i_uncached);
+	INIT_LIST_HEAD(&net->ipv6.ip6_blk_hole_entry->dst.rt_uncached);
 #ifdef CONFIG_IPV6_SUBTREES
 	net->ipv6.fib6_routes_require_src = 0;
 #endif
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -89,7 +89,7 @@ static int xfrm6_fill_dst(struct xfrm_ds
 	xdst->u.rt6.rt6i_gateway = rt->rt6i_gateway;
 	xdst->u.rt6.rt6i_dst = rt->rt6i_dst;
 	xdst->u.rt6.rt6i_src = rt->rt6i_src;
-	INIT_LIST_HEAD(&xdst->u.rt6.rt6i_uncached);
+	INIT_LIST_HEAD(&xdst->u.rt6.dst.rt_uncached);
 	rt6_uncached_list_add(&xdst->u.rt6);
 
 	return 0;
@@ -121,7 +121,7 @@ static void xfrm6_dst_destroy(struct dst
 	if (likely(xdst->u.rt6.rt6i_idev))
 		in6_dev_put(xdst->u.rt6.rt6i_idev);
 	dst_destroy_metrics_generic(dst);
-	if (xdst->u.rt6.rt6i_uncached_list)
+	if (xdst->u.rt6.dst.rt_uncached_list)
 		rt6_uncached_list_del(&xdst->u.rt6);
 	xfrm_dst_destroy(xdst);
 }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 2/3] atomics: Provide rcuref - scalable reference counting
  2023-03-01 11:09     ` Thomas Gleixner
@ 2023-03-02  1:05       ` Thomas Gleixner
  2023-03-02  1:29         ` Randy Dunlap
  2023-03-02 19:36         ` Linus Torvalds
  0 siblings, 2 replies; 19+ messages in thread
From: Thomas Gleixner @ 2023-03-02  1:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, x86, Wangyang Guo, Arjan Van De Ven, Will Deacon,
	Peter Zijlstra, Boqun Feng, Mark Rutland, Marc Zyngier,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev

On Wed, Mar 01 2023 at 12:09, Thomas Gleixner wrote:
> On Tue, Feb 28 2023 at 16:42, Linus Torvalds wrote:
>> On Tue, Feb 28, 2023 at 6:33 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> And yes, that may mean that it should have some architecture-specific
>> code (with fallback defaults for the generic case).
>
> Let me stare at that some more.

So I went back to something which I tried first and dumped it because I
couldn't convince myself that it is correct. That first implementation
was actually incorrect as I could not wrap my head around that dead race
UAF problem. I should have revisited it once I got that sorted. Duh!

The result of staring more is:

get():
    6b57:       f0 41 83 45 40 01       lock addl $0x1,0x40(%r13)
    6b5d:       0f 88 cd 00 00 00       js     6c30			// -> slowpath if negative

    Success

put(), PREEMPT=n or invoked from RCU safe code
     414:	f0 83 47 40 ff       	lock addl $0xffffffff,0x40(%rdi)
     419:	78 06                	js     421			// -> slowpath if negative

    not last reference (fast path)

put(), PREEMPT=y:

     574:	65 ff 05 00 00 00 00 	incl   %gs:0x0(%rip)		// preempt_disable()
     57b:	f0 83 47 40 ff       	lock addl $0xffffffff,0x40(%rdi)
     580:	0f 98 c0             	sets   %al			// safe result
     583:	78 2b                	js     5b0			// -> slowpath if negative
     585:	65 ff 0d 00 00 00 00 	decl   %gs:0x0(%rip)        	// preempt_enable()
     58c:	74 1b                	je     5a9			// -> preempt_schedule()
     58e:	84 c0                	test   %al,%al			// The actual result checked
     590:	75 06                	jne    598			// -> destruct object

    not last reference

The current code looks like this:

get():

    63b4:       41 8b 47 40             mov    0x40(%r15),%eax          // initial read
    63b8:       85 c0                   test   %eax,%eax	        // check for 0
    63ba:       0f 84 e9 00 00 00       je     64a9			// fail if 0
    63c0:       8d 50 01                lea    0x1(%rax),%edx		// + 1
    63c3:       f0 41 0f b1 57 40       lock cmpxchg %edx,0x40(%r15)	// try update
    63c9:       0f 94 44 24 07          sete   0x7(%rsp)		// store result
    63ce:       0f b6 4c 24 07          movzbl 0x7(%rsp),%ecx		// read it back !?!
    63d3:       84 c9                   test   %cl,%cl			// test for success
    63d5:       74 e1                   je     63b8			// repeat on fail

    Success

put(), w/o sanity checking:

     29a:	b8 ff ff ff ff          mov    $0xffffffff,%eax		// -1
     29f:	f0 0f c1 47 40          lock xadd %eax,0x40(%rdi)	// add
     2a4:	83 f8 01                cmp    $0x1,%eax		// check old == 1
     2a7:	74 05                   je     2ae			// slowpath destroy object

    Not last reference 

but the actual network code does some sanity checking:

     29a:	41 55                   push   %r13			// extra push
     29c:	b9 ff ff ff ff          mov    $0xffffffff,%ecx		// -1
     2a1:	41 54                   push   %r12			// extra push
     2a3:	49 89 fc                mov    %rdi,%r12		// extra save RDI
     2a6:	f0 0f c1 4f 40          lock xadd %ecx,0x40(%rdi)	// add
     2ab:	83 e9 01                sub    $0x1,%ecx		// new = old - 1
     2ae:	41 89 cd                mov    %ecx,%r13d		// extra save
     2b1:	78 24                   js     2d7			// slowpath underrun
     2b3:	74 09                   je     2be			// slowpath destroy object
     2b5:	41 5c                   pop    %r12			// extra pop
     2b7:	41 5d                   pop    %r13			// extra pop
     2b9:	e9 00 00 00 00          jmpq   2be			// not last reference (fast path)

Awesome, right?

I also thought about the newfangled CMPccXADD instruction. That will
need some macro wrappery to handle the alternative and it get's rid of
the dead race in put(). There will be some ugly involved, but I'm sure
that this can be handled halfways sanely in common code.

All the ugly will be in the slowpath, which will still be there to
provide saturation and UAF detection/mitigation. The fast path will grow
in size as CMPccXADD is not one of the slim size instructions, but the
basic operating principle will still work out.

See the reworked patch below. This needs some atomic-fallback changes to
build. I force pushed the complete lot to:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rcuref

in case you want the full picture.

The main change is how the zones are defined. They are off by one
now. I'm glad I kept the defines of the initial version around. :)

The pathological test case showed a slight improvement in a quick test,
but I'm way too tired to say anything conclusive right now,

Thanks for nudging me!

        tglx
---
--- /dev/null
+++ b/include/linux/rcuref.h
@@ -0,0 +1,155 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_RCUREF_H
+#define _LINUX_RCUREF_H
+
+#include <linux/atomic.h>
+#include <linux/bug.h>
+#include <linux/limits.h>
+#include <linux/lockdep.h>
+#include <linux/preempt.h>
+#include <linux/rcupdate.h>
+
+#define RCUREF_ONEREF		0x00000000U
+#define RCUREF_MAXREF		0x7FFFFFFFU
+#define RCUREF_SATURATED	0xA0000000U
+#define RCUREF_RELEASED		0xC0000000U
+#define RCUREF_DEAD		0xE0000000U
+#define RCUREF_NOREF		0xFFFFFFFFU
+
+/**
+ * rcuref_init - Initialize a rcuref reference count with the given reference count
+ * @ref:	Pointer to the reference count
+ * @cnt:	The initial reference count typically '1'
+ */
+static inline void rcuref_init(rcuref_t *ref, unsigned int cnt)
+{
+	atomic_set(&ref->refcnt, cnt - 1);
+}
+
+/**
+ * rcuref_read - Read the number of held reference counts of a rcuref
+ * @ref:	Pointer to the reference count
+ *
+ * Return: The number of held references (0 ... N)
+ */
+static inline unsigned int rcuref_read(rcuref_t *ref)
+{
+	unsigned int c = atomic_read(&ref->refcnt);
+
+	/* Return 0 if within the DEAD zone. */
+	return c >= RCUREF_RELEASED ? 0 : c + 1;
+}
+
+extern __must_check bool rcuref_get_slowpath(rcuref_t *ref);
+
+/**
+ * rcuref_get - Acquire one reference on a rcuref reference count
+ * @ref:	Pointer to the reference count
+ *
+ * Similar to atomic_inc_not_zero() but saturates at RCUREF_MAXREF.
+ *
+ * Provides no memory ordering, it is assumed the caller has guaranteed the
+ * object memory to be stable (RCU, etc.). It does provide a control dependency
+ * and thereby orders future stores. See documentation in lib/rcuref.c
+ *
+ * Return:
+ *	False if the attempt to acquire a reference failed. This happens
+ *	when the last reference has been put already
+ *
+ *	True if a reference was successfully acquired
+ */
+static inline __must_check bool rcuref_get(rcuref_t *ref)
+{
+	/*
+	 * Unconditionally increase the reference count. The saturation and
+	 * dead zones provide enough tolerance for this.
+	 */
+	if (likely(!atomic_add_negative_relaxed(1, &ref->refcnt)))
+		return true;
+
+	/* Handle the cases inside the saturation and dead zones */
+	return rcuref_get_slowpath(ref);
+}
+
+extern __must_check bool rcuref_put_slowpath(rcuref_t *ref);
+
+/*
+ * Internal helper. Do not invoke directly.
+ */
+static __always_inline __must_check bool __rcuref_put(rcuref_t *ref)
+{
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held() && preemptible(),
+			 "suspicious rcuref_put_rcusafe() usage");
+	/*
+	 * Unconditionally decrease the reference count. The saturation and
+	 * dead zones provide enough tolerance for this.
+	 */
+	if (likely(!atomic_add_negative_release(-1, &ref->refcnt)))
+		return false;
+
+	/*
+	 * Handle the last reference drop and cases inside the saturation
+	 * and dead zones.
+	 */
+	return rcuref_put_slowpath(ref);
+}
+
+/**
+ * rcuref_put_rcusafe -- Release one reference for a rcuref reference count RCU safe
+ * @ref:	Pointer to the reference count
+ *
+ * Provides release memory ordering, such that prior loads and stores are done
+ * before, and provides an acquire ordering on success such that free()
+ * must come after.
+ *
+ * Can be invoked from contexts, which guarantee that no grace period can
+ * happen which would free the object concurrently if the decrement drops
+ * the last reference and the slowpath races against a concurrent get() and
+ * put() pair. rcu_read_lock()'ed and atomic contexts qualify.
+ *
+ * Return:
+ *	True if this was the last reference with no future references
+ *	possible. This signals the caller that it can safely release the
+ *	object which is protected by the reference counter.
+ *
+ *	False if there are still active references or the put() raced
+ *	with a concurrent get()/put() pair. Caller is not allowed to
+ *	release the protected object.
+ */
+static inline __must_check bool rcuref_put_rcusafe(rcuref_t *ref)
+{
+	return __rcuref_put(ref);
+}
+
+/**
+ * rcuref_put -- Release one reference for a rcuref reference count
+ * @ref:	Pointer to the reference count
+ *
+ * Can be invoked from any context.
+ *
+ * Provides release memory ordering, such that prior loads and stores are done
+ * before, and provides an acquire ordering on success such that free()
+ * must come after.
+ *
+ * Return:
+ *
+ *	True if this was the last reference with no future references
+ *	possible. This signals the caller that it can safely schedule the
+ *	object, which is protected by the reference counter, for
+ *	deconstruction.
+ *
+ *	False if there are still active references or the put() raced
+ *	with a concurrent get()/put() pair. Caller is not allowed to
+ *	deconstruct the protected object.
+ */
+static inline __must_check bool rcuref_put(rcuref_t *ref)
+{
+	bool released;
+
+	preempt_disable();
+	released = __rcuref_put(ref);
+	preempt_enable();
+	return released;
+}
+
+#endif
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -175,6 +175,12 @@ typedef struct {
 } atomic64_t;
 #endif
 
+typedef struct {
+	atomic_t refcnt;
+} rcuref_t;
+
+#define RCUREF_INIT(i)	{ .refcnt = ATOMIC_INIT(i - 1) }
+
 struct list_head {
 	struct list_head *next, *prev;
 };
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -47,7 +47,7 @@ obj-y += bcd.o sort.o parser.o debug_loc
 	 list_sort.o uuid.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 	 percpu-refcount.o rhashtable.o base64.o \
-	 once.o refcount.o usercopy.o errseq.o bucket_locks.o \
+	 once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \
 	 generic-radix-tree.o
 obj-$(CONFIG_STRING_SELFTEST) += test_string.o
 obj-y += string_helpers.o
--- /dev/null
+++ b/lib/rcuref.c
@@ -0,0 +1,281 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * rcuref - A scalable reference count implementation for RCU managed objects
+ *
+ * rcuref is provided to replace open coded reference count implementations
+ * based on atomic_t. It protects explicitely RCU managed objects which can
+ * be visible even after the last reference has been dropped and the object
+ * is heading towards destruction.
+ *
+ * A common usage pattern is:
+ *
+ * get()
+ *	rcu_read_lock();
+ *	p = get_ptr();
+ *	if (p && !atomic_inc_not_zero(&p->refcnt))
+ *		p = NULL;
+ *	rcu_read_unlock();
+ *	return p;
+ *
+ * put()
+ *	if (!atomic_dec_return(&->refcnt)) {
+ *		remove_ptr(p);
+ *		kfree_rcu((p, rcu);
+ *	}
+ *
+ * atomic_inc_not_zero() is implemented with a try_cmpxchg() loop which has
+ * O(N^2) behaviour under contention with N concurrent operations.
+ *
+ * rcuref uses atomic_add_negative_relaxed() for the fast path, which scales
+ * better under contention.
+ *
+ * Why not refcount?
+ * =================
+ *
+ * In principle it should be possible to make refcount use the rcuref
+ * scheme, but the destruction race described below cannot be prevented
+ * unless the protected object is RCU managed.
+ *
+ * Theory of operation
+ * ===================
+ *
+ * rcuref uses an unsigned integer reference counter. As long as the
+ * counter value is greater than or equal to RCUREF_ONEREF and not larger
+ * than RCUREF_MAXREF the reference is alive:
+ *
+ * ONEREF   MAXREF               SATURATED             RELEASED      DEAD    NOREF
+ * 0        0x7FFFFFFF 0x8000000 0xA0000000 0xBFFFFFFF 0xC0000000 0xE0000000 0xFFFFFFFF
+ * <---valid --------> <-------saturation zone-------> <-----dead zone----->
+ *
+ * The get() and put() operations do unconditional increments and
+ * decrements. The result is checked after the operation. This optimizes
+ * for the fast path.
+ *
+ * If the reference count is saturated or dead, then the increments and
+ * decrements are not harmful as the reference count still stays in the
+ * respective zones and is always set back to STATURATED resp. DEAD. The
+ * zones have room for 2^28 racing operations in each direction, which
+ * makes it practically impossible to escape the zones.
+ *
+ * Once the last reference is dropped the reference count becomes
+ * RCUREF_NOREF which forces rcuref_put() into the slowpath operation. The
+ * slowpath then tries to set the reference count from RCUREF_NOREF to
+ * RCUREF_DEAD via a cmpxchg(). This opens a small window where a
+ * concurrent rcuref_get() can acquire the reference count and bring it
+ * back to RCUREF_ONEREF or even drop the reference again and mark it DEAD.
+ *
+ * If the cmpxchg() succeeds then a concurrent rcuref_get() will result in
+ * DEAD + 1, which is inside the dead zone. If that happens the reference
+ * count is put back to DEAD.
+ *
+ * The actual race is possible due to the unconditional increment and
+ * decrements in rcuref_get() and rcuref_put():
+ *
+ *	T1				T2
+ *	get()				put()
+ *					if (atomic_add_negative(1, &ref->refcnt))
+ *		succeeds->			atomic_cmpxchg(&ref->refcnt, -1, DEAD);
+ *
+ *	atomic_add_negative(1, &ref->refcnt);	<- Elevates refcount to DEAD + 1
+ *
+ * As the result of T1's add is negative, the get() goes into the slow path
+ * and observes refcnt being in the dead zone which makes the operation fail.
+ *
+ * Possible critical states:
+ *
+ *	Context Counter	References	Operation
+ *	T1	0	1		init()
+ *	T2	1	2		get()
+ *	T1	0	1		put()
+ *	T2     -1	0		put() tries to mark dead
+ *	T1	0	1		get()
+ *	T2	0	1		put() mark dead fails
+ *	T1     -1	0		put() tries to mark dead
+ *	T1    DEAD	0		put() mark dead succeeds
+ *	T2    DEAD+1	0		get() fails and puts it back to DEAD
+ *
+ * Of course there are more complex scenarios, but the above illustrates
+ * the working principle. The rest is left to the imagination of the
+ * reader.
+ *
+ * Deconstruction race
+ * ===================
+ *
+ * The release operation must be protected by prohibiting a grace period in
+ * order to prevent a possible use after free:
+ *
+ *	T1				T2
+ *	put()				get()
+ *	// ref->refcnt = ONEREF
+ *	if (atomic_add_negative(-1, &ref->cnt))
+ *		return false;				<- Not taken
+ *
+ *	// ref->refcnt == NOREF
+ *	--> preemption
+ *					// Elevates ref->c to ONEREF
+ *					if (!atomic_add_negative(1, &ref->refcnt))
+ *						return true;			<- taken
+ *
+ *					if (put(&p->ref)) { <-- Succeeds
+ *						remove_pointer(p);
+ *						kfree_rcu(p, rcu);
+ *					}
+ *
+ *		RCU grace period ends, object is freed
+ *
+ *	atomic_cmpxchg(&ref->refcnt, NONE, DEAD);	<- UAF
+ *
+ * This is prevented by disabling preemption around the put() operation as
+ * that's in most kernel configurations cheaper than a rcu_read_lock() /
+ * rcu_read_unlock() pair and in many cases even a NOOP. In any case it
+ * prevents the grace period which keeps the object alive until all put()
+ * operations complete.
+ *
+ * Saturation protection
+ * =====================
+ *
+ * The reference count has a saturation limit RCUREF_MAXREF (INT_MAX).
+ * Once this is exceedded the reference count becomes stale by setting it
+ * to RCUREF_SATURATED, which will cause a memory leak, but it prevents
+ * wrap arounds which obviously cause worse problems than a memory
+ * leak. When saturation is reached a warning is emitted.
+ *
+ * Race conditions
+ * ===============
+ *
+ * All reference count increment/decrement operations are unconditional and
+ * only verified after the fact. This optimizes for the good case and takes
+ * the occasional race vs. a dead or already saturated refcount into
+ * account. The saturation and dead zones are large enough to accomodate
+ * for that.
+ *
+ * Memory ordering
+ * ===============
+ *
+ * Memory ordering rules are slightly relaxed wrt regular atomic_t functions
+ * and provide only what is strictly required for refcounts.
+ *
+ * The increments are fully relaxed; these will not provide ordering. The
+ * rationale is that whatever is used to obtain the object to increase the
+ * reference count on will provide the ordering. For locked data
+ * structures, its the lock acquire, for RCU/lockless data structures its
+ * the dependent load.
+ *
+ * rcuref_get() provides a control dependency ordering future stores which
+ * ensures that the object is not modified when acquiring a reference
+ * fails.
+ *
+ * rcuref_put() provides release order, i.e. all prior loads and stores
+ * will be issued before. It also provides a control dependency ordering
+ * against the subsequent destruction of the object.
+ *
+ * If rcuref_put() successfully dropped the last reference and marked the
+ * object DEAD it also provides acquire ordering.
+ */
+
+#include <linux/export.h>
+#include <linux/rcuref.h>
+
+/**
+ * rcuref_get_slowpath - Slowpath of rcuref_get()
+ * @ref:	Pointer to the reference count
+ *
+ * Invoked when the reference count is outside of the valid zone.
+ *
+ * Return:
+ *	False if the reference count was already marked dead
+ *
+ *	True if the reference count is saturated, which prevents the
+ *	object from being deconstructed ever.
+ */
+bool rcuref_get_slowpath(rcuref_t *ref)
+{
+	unsigned int cnt = atomic_read(&ref->refcnt);
+
+	/*
+	 * If the reference count was already marked dead, undo the
+	 * increment so it stays in the middle of the dead zone and return
+	 * fail.
+	 */
+	if (cnt >= RCUREF_RELEASED) {
+		atomic_set(&ref->refcnt, RCUREF_DEAD);
+		return false;
+	}
+
+	/*
+	 * If it was saturated, warn and mark it so. In case the increment
+	 * was already on a saturated value restore the saturation
+	 * marker. This keeps it in the middle of the saturation zone and
+	 * prevents the reference count from overflowing. This leaks the
+	 * object memory, but prevents the obvious reference count overflow
+	 * damage.
+	 */
+	if (WARN_ONCE(cnt > RCUREF_MAXREF, "rcuref saturated - leaking memory"))
+		atomic_set(&ref->refcnt, RCUREF_SATURATED);
+	return true;
+}
+EXPORT_SYMBOL_GPL(rcuref_get_slowpath);
+
+/**
+ * rcuref_put_slowpath - Slowpath of __rcuref_put()
+ * @ref:	Pointer to the reference count
+ *
+ * Invoked when the reference count is outside of the valid zone.
+ *
+ * Return:
+ *	True if this was the last reference with no future references
+ *	possible. This signals the caller that it can safely schedule the
+ *	object, which is protected by the reference counter, for
+ *	deconstruction.
+ *
+ *	False if there are still active references or the put() raced
+ *	with a concurrent get()/put() pair. Caller is not allowed to
+ *	deconstruct the protected object.
+ */
+bool rcuref_put_slowpath(rcuref_t *ref)
+{
+	unsigned int cnt = atomic_read(&ref->refcnt);
+
+	/* Did this drop the last reference? */
+	if (likely(cnt == RCUREF_NOREF)) {
+		/*
+		 * Carefully try to set the reference count to RCUREF_DEAD.
+		 *
+		 * This can fail if a concurrent get() operation has
+		 * elevated it again or the corresponding put() even marked
+		 * it dead already. Both are valid situations and do not
+		 * require a retry. If this fails the caller is not
+		 * allowed to deconstruct the object.
+		 */
+		if (atomic_cmpxchg_release(&ref->refcnt, RCUREF_NOREF, RCUREF_DEAD) != RCUREF_NOREF)
+			return false;
+
+		/*
+		 * The caller can safely schedule the object for
+		 * deconstruction. Provide acquire ordering.
+		 */
+		smp_acquire__after_ctrl_dep();
+		return true;
+	}
+
+	/*
+	 * If the reference count was already in the dead zone, then this
+	 * put() operation is imbalanced. Warn, put the reference count back to
+	 * DEAD and tell the caller to not deconstruct the object.
+	 */
+	if (WARN_ONCE(cnt >= RCUREF_RELEASED, "rcuref - imbalanced put()")) {
+		atomic_set(&ref->refcnt, RCUREF_DEAD);
+		return false;
+	}
+
+	/*
+	 * Is this a put() operation on a saturated refcount? If so, rRestore the
+	 * mean saturation value and tell the caller to not deconstruct the
+	 * object.
+	 */
+	if (cnt > RCUREF_MAXREF)
+		atomic_set(&ref->refcnt, RCUREF_SATURATED);
+	return false;
+}
+EXPORT_SYMBOL_GPL(rcuref_put_slowpath);




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 2/3] atomics: Provide rcuref - scalable reference counting
  2023-03-02  1:05       ` Thomas Gleixner
@ 2023-03-02  1:29         ` Randy Dunlap
  2023-03-02 19:36         ` Linus Torvalds
  1 sibling, 0 replies; 19+ messages in thread
From: Randy Dunlap @ 2023-03-02  1:29 UTC (permalink / raw)
  To: Thomas Gleixner, Linus Torvalds
  Cc: LKML, x86, Wangyang Guo, Arjan Van De Ven, Will Deacon,
	Peter Zijlstra, Boqun Feng, Mark Rutland, Marc Zyngier,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev

(typos)

On 3/1/23 17:05, Thomas Gleixner wrote:
> On Wed, Mar 01 2023 at 12:09, Thomas Gleixner wrote:

> ---
> --- /dev/null
> +++ b/include/linux/rcuref.h
> @@ -0,0 +1,155 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef _LINUX_RCUREF_H
> +#define _LINUX_RCUREF_H
> +
> +#include <linux/atomic.h>
> +#include <linux/bug.h>
> +#include <linux/limits.h>
> +#include <linux/lockdep.h>
> +#include <linux/preempt.h>
> +#include <linux/rcupdate.h>
> +
> +#define RCUREF_ONEREF		0x00000000U
> +#define RCUREF_MAXREF		0x7FFFFFFFU
> +#define RCUREF_SATURATED	0xA0000000U
> +#define RCUREF_RELEASED		0xC0000000U
> +#define RCUREF_DEAD		0xE0000000U
> +#define RCUREF_NOREF		0xFFFFFFFFU
> +
> +/**
> + * rcuref_init - Initialize a rcuref reference count with the given reference count
> + * @ref:	Pointer to the reference count
> + * @cnt:	The initial reference count typically '1'

		                      count, typically

> + */
> +static inline void rcuref_init(rcuref_t *ref, unsigned int cnt)
> +{
> +	atomic_set(&ref->refcnt, cnt - 1);
> +}
> +

[snip]

> --- /dev/null
> +++ b/lib/rcuref.c
> @@ -0,0 +1,281 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/*
> + * rcuref - A scalable reference count implementation for RCU managed objects
> + *
> + * rcuref is provided to replace open coded reference count implementations
> + * based on atomic_t. It protects explicitely RCU managed objects which can

                                     explicitly

> + * be visible even after the last reference has been dropped and the object
> + * is heading towards destruction.
> + *
> + * A common usage pattern is:
> + *
> + * get()
> + *	rcu_read_lock();
> + *	p = get_ptr();
> + *	if (p && !atomic_inc_not_zero(&p->refcnt))
> + *		p = NULL;
> + *	rcu_read_unlock();
> + *	return p;
> + *
> + * put()
> + *	if (!atomic_dec_return(&->refcnt)) {
> + *		remove_ptr(p);
> + *		kfree_rcu((p, rcu);
> + *	}
> + *
> + * atomic_inc_not_zero() is implemented with a try_cmpxchg() loop which has
> + * O(N^2) behaviour under contention with N concurrent operations.
> + *
> + * rcuref uses atomic_add_negative_relaxed() for the fast path, which scales
> + * better under contention.
> + *
> + * Why not refcount?
> + * =================
> + *
> + * In principle it should be possible to make refcount use the rcuref
> + * scheme, but the destruction race described below cannot be prevented
> + * unless the protected object is RCU managed.
> + *
> + * Theory of operation
> + * ===================
> + *
> + * rcuref uses an unsigned integer reference counter. As long as the
> + * counter value is greater than or equal to RCUREF_ONEREF and not larger
> + * than RCUREF_MAXREF the reference is alive:
> + *
> + * ONEREF   MAXREF               SATURATED             RELEASED      DEAD    NOREF
> + * 0        0x7FFFFFFF 0x8000000 0xA0000000 0xBFFFFFFF 0xC0000000 0xE0000000 0xFFFFFFFF
> + * <---valid --------> <-------saturation zone-------> <-----dead zone----->
> + *
> + * The get() and put() operations do unconditional increments and
> + * decrements. The result is checked after the operation. This optimizes
> + * for the fast path.
> + *
> + * If the reference count is saturated or dead, then the increments and
> + * decrements are not harmful as the reference count still stays in the
> + * respective zones and is always set back to STATURATED resp. DEAD. The

                                                 SATURATED

> + * zones have room for 2^28 racing operations in each direction, which
> + * makes it practically impossible to escape the zones.
> + *
> + * Once the last reference is dropped the reference count becomes
> + * RCUREF_NOREF which forces rcuref_put() into the slowpath operation. The
> + * slowpath then tries to set the reference count from RCUREF_NOREF to
> + * RCUREF_DEAD via a cmpxchg(). This opens a small window where a
> + * concurrent rcuref_get() can acquire the reference count and bring it
> + * back to RCUREF_ONEREF or even drop the reference again and mark it DEAD.
> + *
> + * If the cmpxchg() succeeds then a concurrent rcuref_get() will result in
> + * DEAD + 1, which is inside the dead zone. If that happens the reference
> + * count is put back to DEAD.
> + *
> + * The actual race is possible due to the unconditional increment and
> + * decrements in rcuref_get() and rcuref_put():
> + *
> + *	T1				T2
> + *	get()				put()
> + *					if (atomic_add_negative(1, &ref->refcnt))
> + *		succeeds->			atomic_cmpxchg(&ref->refcnt, -1, DEAD);
> + *
> + *	atomic_add_negative(1, &ref->refcnt);	<- Elevates refcount to DEAD + 1
> + *
> + * As the result of T1's add is negative, the get() goes into the slow path
> + * and observes refcnt being in the dead zone which makes the operation fail.
> + *
> + * Possible critical states:
> + *
> + *	Context Counter	References	Operation
> + *	T1	0	1		init()
> + *	T2	1	2		get()
> + *	T1	0	1		put()
> + *	T2     -1	0		put() tries to mark dead
> + *	T1	0	1		get()
> + *	T2	0	1		put() mark dead fails
> + *	T1     -1	0		put() tries to mark dead
> + *	T1    DEAD	0		put() mark dead succeeds
> + *	T2    DEAD+1	0		get() fails and puts it back to DEAD
> + *
> + * Of course there are more complex scenarios, but the above illustrates
> + * the working principle. The rest is left to the imagination of the
> + * reader.
> + *
> + * Deconstruction race
> + * ===================
> + *
> + * The release operation must be protected by prohibiting a grace period in
> + * order to prevent a possible use after free:
> + *
> + *	T1				T2
> + *	put()				get()
> + *	// ref->refcnt = ONEREF
> + *	if (atomic_add_negative(-1, &ref->cnt))
> + *		return false;				<- Not taken
> + *
> + *	// ref->refcnt == NOREF
> + *	--> preemption
> + *					// Elevates ref->c to ONEREF
> + *					if (!atomic_add_negative(1, &ref->refcnt))
> + *						return true;			<- taken
> + *
> + *					if (put(&p->ref)) { <-- Succeeds
> + *						remove_pointer(p);
> + *						kfree_rcu(p, rcu);
> + *					}
> + *
> + *		RCU grace period ends, object is freed
> + *
> + *	atomic_cmpxchg(&ref->refcnt, NONE, DEAD);	<- UAF
> + *
> + * This is prevented by disabling preemption around the put() operation as
> + * that's in most kernel configurations cheaper than a rcu_read_lock() /
> + * rcu_read_unlock() pair and in many cases even a NOOP. In any case it
> + * prevents the grace period which keeps the object alive until all put()
> + * operations complete.
> + *
> + * Saturation protection
> + * =====================
> + *
> + * The reference count has a saturation limit RCUREF_MAXREF (INT_MAX).
> + * Once this is exceedded the reference count becomes stale by setting it

                   exceeded

> + * to RCUREF_SATURATED, which will cause a memory leak, but it prevents
> + * wrap arounds which obviously cause worse problems than a memory

      wraparounds

> + * leak. When saturation is reached a warning is emitted.
> + *
> + * Race conditions
> + * ===============
> + *
> + * All reference count increment/decrement operations are unconditional and
> + * only verified after the fact. This optimizes for the good case and takes
> + * the occasional race vs. a dead or already saturated refcount into
> + * account. The saturation and dead zones are large enough to accomodate

                                                                 accommodate
"accommodate that" or "allow for that".

> + * for that.
> + *
> + * Memory ordering
> + * ===============
> + *
> + * Memory ordering rules are slightly relaxed wrt regular atomic_t functions

Preferably "with respect to".

> + * and provide only what is strictly required for refcounts.
> + *
> + * The increments are fully relaxed; these will not provide ordering. The
> + * rationale is that whatever is used to obtain the object to increase the
> + * reference count on will provide the ordering. For locked data
> + * structures, its the lock acquire, for RCU/lockless data structures its
> + * the dependent load.
> + *
> + * rcuref_get() provides a control dependency ordering future stores which
> + * ensures that the object is not modified when acquiring a reference
> + * fails.
> + *
> + * rcuref_put() provides release order, i.e. all prior loads and stores
> + * will be issued before. It also provides a control dependency ordering
> + * against the subsequent destruction of the object.
> + *
> + * If rcuref_put() successfully dropped the last reference and marked the
> + * object DEAD it also provides acquire ordering.
> + */
> +
> +#include <linux/export.h>
> +#include <linux/rcuref.h>
> +

[snip]

> +/**
> + * rcuref_put_slowpath - Slowpath of __rcuref_put()
> + * @ref:	Pointer to the reference count
> + *
> + * Invoked when the reference count is outside of the valid zone.
> + *
> + * Return:
> + *	True if this was the last reference with no future references
> + *	possible. This signals the caller that it can safely schedule the
> + *	object, which is protected by the reference counter, for
> + *	deconstruction.
> + *
> + *	False if there are still active references or the put() raced
> + *	with a concurrent get()/put() pair. Caller is not allowed to
> + *	deconstruct the protected object.
> + */
> +bool rcuref_put_slowpath(rcuref_t *ref)
> +{
> +	unsigned int cnt = atomic_read(&ref->refcnt);
> +
> +	/* Did this drop the last reference? */
> +	if (likely(cnt == RCUREF_NOREF)) {
> +		/*
> +		 * Carefully try to set the reference count to RCUREF_DEAD.
> +		 *
> +		 * This can fail if a concurrent get() operation has
> +		 * elevated it again or the corresponding put() even marked
> +		 * it dead already. Both are valid situations and do not
> +		 * require a retry. If this fails the caller is not
> +		 * allowed to deconstruct the object.
> +		 */
> +		if (atomic_cmpxchg_release(&ref->refcnt, RCUREF_NOREF, RCUREF_DEAD) != RCUREF_NOREF)
> +			return false;
> +
> +		/*
> +		 * The caller can safely schedule the object for
> +		 * deconstruction. Provide acquire ordering.
> +		 */
> +		smp_acquire__after_ctrl_dep();
> +		return true;
> +	}
> +
> +	/*
> +	 * If the reference count was already in the dead zone, then this
> +	 * put() operation is imbalanced. Warn, put the reference count back to
> +	 * DEAD and tell the caller to not deconstruct the object.
> +	 */
> +	if (WARN_ONCE(cnt >= RCUREF_RELEASED, "rcuref - imbalanced put()")) {
> +		atomic_set(&ref->refcnt, RCUREF_DEAD);
> +		return false;
> +	}
> +
> +	/*
> +	 * Is this a put() operation on a saturated refcount? If so, rRestore the

	                                                             restore

> +	 * mean saturation value and tell the caller to not deconstruct the
> +	 * object.
> +	 */
> +	if (cnt > RCUREF_MAXREF)
> +		atomic_set(&ref->refcnt, RCUREF_SATURATED);
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(rcuref_put_slowpath);


-- 
~Randy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 2/3] atomics: Provide rcuref - scalable reference counting
  2023-03-02  1:05       ` Thomas Gleixner
  2023-03-02  1:29         ` Randy Dunlap
@ 2023-03-02 19:36         ` Linus Torvalds
  1 sibling, 0 replies; 19+ messages in thread
From: Linus Torvalds @ 2023-03-02 19:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Wangyang Guo, Arjan Van De Ven, Will Deacon,
	Peter Zijlstra, Boqun Feng, Mark Rutland, Marc Zyngier,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev

On Wed, Mar 1, 2023 at 5:05 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> The result of staring more is:
>
> get():
>     6b57:       f0 41 83 45 40 01       lock addl $0x1,0x40(%r13)
>     6b5d:       0f 88 cd 00 00 00       js     6c30                     // -> slowpath if negative

[ rest removed ]

Yeah, so this looks like I was hoping for.

That PREEMPT=y case of 'put() makes me slightly unhappy, and I'm
wondering if it can be improved with better placement of the
preempt_disable/enable, but apart from maybe some massaging to that I
don't see a good way to avoid it.

And the ugliness is mostly about the preemption side, not about the
refcount itself. I've looked at that "preempt_enable ->
preempt_schedule" code generation before, and I've disliked it before,
and I don't have an answer to it.

> but the actual network code does some sanity checking:

Ok. Not pretty. But at least it's just an xadd on the access itself,
there's just some extra noise around it.

            Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-03-02 19:36 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-28 14:33 [patch 0/3] net, refcount: Address dst_entry reference count scalability issues Thomas Gleixner
2023-02-28 14:33 ` [patch 1/3] net: dst: Prevent false sharing vs. dst_entry::__refcnt Thomas Gleixner
2023-02-28 15:17   ` Eric Dumazet
2023-02-28 21:31     ` Thomas Gleixner
2023-03-01 23:12       ` Thomas Gleixner
2023-02-28 14:33 ` [patch 2/3] atomics: Provide rcuref - scalable reference counting Thomas Gleixner
2023-03-01  0:42   ` Linus Torvalds
2023-03-01  1:07     ` Linus Torvalds
2023-03-01 11:09     ` Thomas Gleixner
2023-03-02  1:05       ` Thomas Gleixner
2023-03-02  1:29         ` Randy Dunlap
2023-03-02 19:36         ` Linus Torvalds
2023-02-28 14:33 ` [patch 3/3] net: dst: Switch to rcuref_t " Thomas Gleixner
2023-02-28 15:07 ` [patch 0/3] net, refcount: Address dst_entry reference count scalability issues Eric Dumazet
2023-02-28 16:38   ` Thomas Gleixner
2023-02-28 16:59     ` Eric Dumazet
2023-03-01  1:00       ` Thomas Gleixner
2023-03-01  3:17         ` Jakub Kicinski
2023-03-01 10:40           ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).