netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 00/25] seqlock: Extend seqcount API with associated locks
@ 2020-05-19 21:45 Ahmed S. Darwish
  2020-05-19 21:45 ` [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount Ahmed S. Darwish
                   ` (8 more replies)
  0 siblings, 9 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-05-19 21:45 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, David S. Miller,
	Andrew Morton, Jens Axboe, Jonathan Corbet, Alexander Viro,
	David Airlie, Daniel Vetter, netdev, linux-mm, linux-block,
	dri-devel, linux-fsdevel, linux-doc

Hi,

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. If the serialization primitive is
not disabling preemption implicitly, preemption has to be explicitly
disabled before entering the write side critical section.

There is no built-in debugging mechanism to verify that the lock used
for writer serialization is held and preemption is disabled. Some usage
sites like dma-buf have explicit lockdep checks for the writer-side
lock, but this covers only a small portion of the sequence counter usage
in the kernel.

Add new sequence counter types which allows to associate a lock to the
sequence counter at initialization time. The seqcount API functions are
extended to provide appropriate lockdep assertions depending on the
seqcount/lock type.

For sequence counters with associated locks that do not implicitly
disable preemption, preemption protection is enforced in the sequence
counter write side functions. This removes the need to explicitly add
preempt_disable/enable() around the write side critical sections: the
write_begin/end() functions for these new sequence counter types
automatically do this.

Extend the lockdep API with a macro asserting that preemption is
disabled.  Use it to verify that preemption is disabled for all sequence
counters write side critical sections.

If lockdep is disabled, these lock associations and non-preemptibility
checks are compiled out and have neither storage size nor runtime
overhead. If lockdep is enabled, a pointer to the lock is stored in the
seqcount and the write side API functions enable lockdep assertions.

The following seqcount types with associated locks are introduced:

     seqcount_spinlock_t
     seqcount_raw_spinlock_t
     seqcount_rwlock_t
     seqcount_mutex_t
     seqcount_ww_mutex_t

This lock association is not only useful for debugging purposes, it also
provides a mechanism for PREEMPT_RT to prevent writer starvation. On RT
kernels spinlocks and rwlocks are substituted with sleeping locks and
the code sections protected by these locks become preemptible, which has
the same problem as write side critical section with preemption enabled
on a non-RT kernel. RT utilizes this association by storing the provided
lock pointer and in case that a reader sees an active writer (seqcount
is odd), it does not spin, but blocks on the associated lock similar to
read_seqbegin_or_lock().

By using the lockdep debugging mechanisms added in this patch series, a
number of erroneous seqcount call-sites were discovered across the
kernel. The fixes are included at the beginning of the series.

Thanks,

8<--------------

Ahmed S. Darwish (25):
  net: core: device_rename: Use rwsem instead of a seqcount
  mm/swap: Don't abuse the seqcount latching API
  net: phy: fixed_phy: Remove unused seqcount
  block: nr_sects_write(): Disable preemption on seqcount write
  u64_stats: Document writer non-preemptibility requirement
  dma-buf: Remove custom seqcount lockdep class key
  lockdep: Add preemption disabled assertion API
  seqlock: lockdep assert non-preemptibility on seqcount_t write
  Documentation: locking: Describe seqlock design and usage
  seqlock: Add RST directives to kernel-doc code samples and notes
  seqlock: Add missing kernel-doc annotations
  seqlock: Extend seqcount API with associated locks
  dma-buf: Use sequence counter with associated wound/wait mutex
  sched: tasks: Use sequence counter with associated spinlock
  netfilter: conntrack: Use sequence counter with associated spinlock
  netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
  xfrm: policy: Use sequence counters with associated lock
  timekeeping: Use sequence counter with associated raw spinlock
  vfs: Use sequence counter with associated spinlock
  raid5: Use sequence counter with associated spinlock
  iocost: Use sequence counter with associated spinlock
  NFSv4: Use sequence counter with associated spinlock
  userfaultfd: Use sequence counter with associated spinlock
  kvm/eventfd: Use sequence counter with associated spinlock
  hrtimer: Use sequence counter with associated raw spinlock

 Documentation/locking/index.rst               |   1 +
 Documentation/locking/seqlock.rst             | 239 +++++
 MAINTAINERS                                   |   2 +-
 block/blk-iocost.c                            |   5 +-
 block/blk.h                                   |   2 +
 drivers/dma-buf/dma-resv.c                    |  15 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   2 -
 drivers/md/raid5.c                            |   2 +-
 drivers/md/raid5.h                            |   2 +-
 drivers/net/phy/fixed_phy.c                   |  25 +-
 fs/dcache.c                                   |   2 +-
 fs/fs_struct.c                                |   4 +-
 fs/nfs/nfs4_fs.h                              |   2 +-
 fs/nfs/nfs4state.c                            |   2 +-
 fs/userfaultfd.c                              |   4 +-
 include/linux/dcache.h                        |   2 +-
 include/linux/dma-resv.h                      |   4 +-
 include/linux/fs_struct.h                     |   2 +-
 include/linux/hrtimer.h                       |   2 +-
 include/linux/kvm_irqfd.h                     |   2 +-
 include/linux/lockdep.h                       |   9 +
 include/linux/sched.h                         |   2 +-
 include/linux/seqlock.h                       | 882 +++++++++++++++---
 include/linux/seqlock_types_internal.h        | 187 ++++
 include/linux/u64_stats_sync.h                |  38 +-
 include/net/netfilter/nf_conntrack.h          |   2 +-
 init/init_task.c                              |   3 +-
 kernel/fork.c                                 |   2 +-
 kernel/locking/lockdep.c                      |  15 +
 kernel/time/hrtimer.c                         |  13 +-
 kernel/time/timekeeping.c                     |  19 +-
 lib/Kconfig.debug                             |   1 +
 mm/swap.c                                     |  57 +-
 net/core/dev.c                                |  30 +-
 net/netfilter/nf_conntrack_core.c             |   5 +-
 net/netfilter/nft_set_rbtree.c                |   4 +-
 net/xfrm/xfrm_policy.c                        |  10 +-
 virt/kvm/eventfd.c                            |   2 +-
 38 files changed, 1325 insertions(+), 277 deletions(-)
 create mode 100644 Documentation/locking/seqlock.rst
 create mode 100644 include/linux/seqlock_types_internal.h

base-commit: 2ef96a5bb12be62ef75b5828c0aab838ebb29cb8
--
2.20.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-19 21:45 [PATCH v1 00/25] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
@ 2020-05-19 21:45 ` Ahmed S. Darwish
  2020-05-19 22:01   ` Stephen Hemminger
                     ` (2 more replies)
  2020-05-19 21:45 ` [PATCH v1 03/25] net: phy: fixed_phy: Remove unused seqcount Ahmed S. Darwish
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-05-19 21:45 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, David S. Miller,
	Jakub Kicinski, netdev

Sequence counters write paths are critical sections that must never be
preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.

Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
netdev name retrieval.") handled a deadlock, observed with
CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
infinitely spinning: it got scheduled after the seqcount write side
blocked inside its own critical section.

To fix that deadlock, among other issues, the commit added a
cond_resched() inside the read side section. While this will get the
non-preemptible kernel eventually unstuck, the seqcount reader is fully
exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.

The fix is also still broken: if the seqcount reader belongs to a
real-time scheduling policy, it can spin forever and the kernel will
livelock.

Disabling preemption over the seqcount write side critical section will
not work: inside it are a number of GFP_KERNEL allocations and mutex
locking through the drivers/base/ :: device_rename() call chain.

From all the above, replace the seqcount with a rwsem.

Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
Cc: <stable@vger.kernel.org>
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/dev.c | 30 ++++++++++++------------------
 1 file changed, 12 insertions(+), 18 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 522288177bbd..e18a4c23df0e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -79,6 +79,7 @@
 #include <linux/sched.h>
 #include <linux/sched/mm.h>
 #include <linux/mutex.h>
+#include <linux/rwsem.h>
 #include <linux/string.h>
 #include <linux/mm.h>
 #include <linux/socket.h>
@@ -194,7 +195,7 @@ static DEFINE_SPINLOCK(napi_hash_lock);
 static unsigned int napi_gen_id = NR_CPUS;
 static DEFINE_READ_MOSTLY_HASHTABLE(napi_hash, 8);
 
-static seqcount_t devnet_rename_seq;
+static DECLARE_RWSEM(devnet_rename_sem);
 
 static inline void dev_base_seq_inc(struct net *net)
 {
@@ -930,18 +931,13 @@ EXPORT_SYMBOL(dev_get_by_napi_id);
  *	@net: network namespace
  *	@name: a pointer to the buffer where the name will be stored.
  *	@ifindex: the ifindex of the interface to get the name from.
- *
- *	The use of raw_seqcount_begin() and cond_resched() before
- *	retrying is required as we want to give the writers a chance
- *	to complete when CONFIG_PREEMPTION is not set.
  */
 int netdev_get_name(struct net *net, char *name, int ifindex)
 {
 	struct net_device *dev;
-	unsigned int seq;
 
-retry:
-	seq = raw_seqcount_begin(&devnet_rename_seq);
+	down_read(&devnet_rename_sem);
+
 	rcu_read_lock();
 	dev = dev_get_by_index_rcu(net, ifindex);
 	if (!dev) {
@@ -951,10 +947,8 @@ int netdev_get_name(struct net *net, char *name, int ifindex)
 
 	strcpy(name, dev->name);
 	rcu_read_unlock();
-	if (read_seqcount_retry(&devnet_rename_seq, seq)) {
-		cond_resched();
-		goto retry;
-	}
+
+	up_read(&devnet_rename_sem);
 
 	return 0;
 }
@@ -1228,10 +1222,10 @@ int dev_change_name(struct net_device *dev, const char *newname)
 	    likely(!(dev->priv_flags & IFF_LIVE_RENAME_OK)))
 		return -EBUSY;
 
-	write_seqcount_begin(&devnet_rename_seq);
+	down_write(&devnet_rename_sem);
 
 	if (strncmp(newname, dev->name, IFNAMSIZ) == 0) {
-		write_seqcount_end(&devnet_rename_seq);
+		up_write(&devnet_rename_sem);
 		return 0;
 	}
 
@@ -1239,7 +1233,7 @@ int dev_change_name(struct net_device *dev, const char *newname)
 
 	err = dev_get_valid_name(net, dev, newname);
 	if (err < 0) {
-		write_seqcount_end(&devnet_rename_seq);
+		up_write(&devnet_rename_sem);
 		return err;
 	}
 
@@ -1254,11 +1248,11 @@ int dev_change_name(struct net_device *dev, const char *newname)
 	if (ret) {
 		memcpy(dev->name, oldname, IFNAMSIZ);
 		dev->name_assign_type = old_assign_type;
-		write_seqcount_end(&devnet_rename_seq);
+		up_write(&devnet_rename_sem);
 		return ret;
 	}
 
-	write_seqcount_end(&devnet_rename_seq);
+	up_write(&devnet_rename_sem);
 
 	netdev_adjacent_rename_links(dev, oldname);
 
@@ -1279,7 +1273,7 @@ int dev_change_name(struct net_device *dev, const char *newname)
 		/* err >= 0 after dev_alloc_name() or stores the first errno */
 		if (err >= 0) {
 			err = ret;
-			write_seqcount_begin(&devnet_rename_seq);
+			down_write(&devnet_rename_sem);
 			memcpy(dev->name, oldname, IFNAMSIZ);
 			memcpy(oldname, newname, IFNAMSIZ);
 			dev->name_assign_type = old_assign_type;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 03/25] net: phy: fixed_phy: Remove unused seqcount
  2020-05-19 21:45 [PATCH v1 00/25] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
  2020-05-19 21:45 ` [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount Ahmed S. Darwish
@ 2020-05-19 21:45 ` Ahmed S. Darwish
  2020-05-19 21:45 ` [PATCH v1 05/25] u64_stats: Document writer non-preemptibility requirement Ahmed S. Darwish
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-05-19 21:45 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, Andrew Lunn,
	Florian Fainelli, Heiner Kallweit, Russell King, David S. Miller,
	netdev

Commit bf7afb29d545 ("phy: improve safety of fixed-phy MII register
reading") protected the fixed PHY status with a sequence counter.

Two years later, commit d2b977939b18 ("net: phy: fixed-phy: remove
fixed_phy_update_state()") removed the sequence counter's write side
critical section -- neutralizing its read side retry loop.

Remove the unused seqcount.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/phy/fixed_phy.c | 25 ++++++++++---------------
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/drivers/net/phy/fixed_phy.c b/drivers/net/phy/fixed_phy.c
index 4a3d34f40cb9..f55365c9d1f7 100644
--- a/drivers/net/phy/fixed_phy.c
+++ b/drivers/net/phy/fixed_phy.c
@@ -34,7 +34,6 @@ struct fixed_mdio_bus {
 struct fixed_phy {
 	int addr;
 	struct phy_device *phydev;
-	seqcount_t seqcount;
 	struct fixed_phy_status status;
 	bool no_carrier;
 	int (*link_update)(struct net_device *, struct fixed_phy_status *);
@@ -80,19 +79,17 @@ static int fixed_mdio_read(struct mii_bus *bus, int phy_addr, int reg_num)
 	list_for_each_entry(fp, &fmb->phys, node) {
 		if (fp->addr == phy_addr) {
 			struct fixed_phy_status state;
-			int s;
 
-			do {
-				s = read_seqcount_begin(&fp->seqcount);
-				fp->status.link = !fp->no_carrier;
-				/* Issue callback if user registered it. */
-				if (fp->link_update)
-					fp->link_update(fp->phydev->attached_dev,
-							&fp->status);
-				/* Check the GPIO for change in status */
-				fixed_phy_update(fp);
-				state = fp->status;
-			} while (read_seqcount_retry(&fp->seqcount, s));
+			fp->status.link = !fp->no_carrier;
+
+			/* Issue callback if user registered it. */
+			if (fp->link_update)
+				fp->link_update(fp->phydev->attached_dev,
+						&fp->status);
+
+			/* Check the GPIO for change in status */
+			fixed_phy_update(fp);
+			state = fp->status;
 
 			return swphy_read_reg(reg_num, &state);
 		}
@@ -150,8 +147,6 @@ static int fixed_phy_add_gpiod(unsigned int irq, int phy_addr,
 	if (!fp)
 		return -ENOMEM;
 
-	seqcount_init(&fp->seqcount);
-
 	if (irq != PHY_POLL)
 		fmb->mii_bus->irq[phy_addr] = irq;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 05/25] u64_stats: Document writer non-preemptibility requirement
  2020-05-19 21:45 [PATCH v1 00/25] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
  2020-05-19 21:45 ` [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount Ahmed S. Darwish
  2020-05-19 21:45 ` [PATCH v1 03/25] net: phy: fixed_phy: Remove unused seqcount Ahmed S. Darwish
@ 2020-05-19 21:45 ` Ahmed S. Darwish
  2020-05-19 21:45 ` [PATCH v1 15/25] netfilter: conntrack: Use sequence counter with associated spinlock Ahmed S. Darwish
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-05-19 21:45 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, David S. Miller,
	Jakub Kicinski, netdev

The u64_stats mechanism uses sequence counters to protect against 64-bit
values tearing on 32-bit architectures. Updating such statistics is a
sequence counter write side critical section.

Preemption must be disabled before entering this seqcount write critical
section.  Failing to do so, the seqcount read side can preempt the write
side section and spin for the entire scheduler tick.  If that reader
belongs to a real-time scheduling class, it can spin forever and the
kernel will livelock.

Document this statistics update side non-preemptibility requirement.

Reword the u64_stats header file top comment to always mention "Reader"
or "Writer" at the start of each bullet point, making it easier to
follow which side each point is actually for.

Fix the statement "whole thing is a NOOP on 64bit arches or UP kernels".
For 32-bit UP kernels, preemption is always disabled for the statistics
read side section.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/u64_stats_sync.h | 38 ++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/include/linux/u64_stats_sync.h b/include/linux/u64_stats_sync.h
index 9de5c10293f5..30358ce3d8fe 100644
--- a/include/linux/u64_stats_sync.h
+++ b/include/linux/u64_stats_sync.h
@@ -7,29 +7,31 @@
  * we provide a synchronization point, that is a noop on 64bit or UP kernels.
  *
  * Key points :
- * 1) Use a seqcount on SMP 32bits, with low overhead.
- * 2) Whole thing is a noop on 64bit arches or UP kernels.
- * 3) Write side must ensure mutual exclusion or one seqcount update could
+ *
+ * 1) Use a seqcount on 32-bit SMP, only disable preemption for 32-bit UP.
+ *
+ * 2) The whole thing is a no-op on 64-bit architectures.
+ *
+ * 3) Write side must ensure mutual exclusion, or one seqcount update could
  *    be lost, thus blocking readers forever.
- *    If this synchronization point is not a mutex, but a spinlock or
- *    spinlock_bh() or disable_bh() :
- * 3.1) Write side should not sleep.
- * 3.2) Write side should not allow preemption.
- * 3.3) If applicable, interrupts should be disabled.
  *
- * 4) If reader fetches several counters, there is no guarantee the whole values
- *    are consistent (remember point 1) : this is a noop on 64bit arches anyway)
+ * 4) Write side must disable preemption, or a seqcount reader can preempt the
+ *    writer and also spin forever.
  *
- * 5) readers are allowed to sleep or be preempted/interrupted : They perform
- *    pure reads. But if they have to fetch many values, it's better to not allow
- *    preemptions/interruptions to avoid many retries.
+ * 5) Write side must use the _irqsave() variant if other writers, or a reader,
+ *    can be invoked from an IRQ context.
  *
- * 6) If counter might be written by an interrupt, readers should block interrupts.
- *    (On UP, there is no seqcount_t protection, a reader allowing interrupts could
- *     read partial values)
+ * 6) If reader fetches several counters, there is no guarantee the whole values
+ *    are consistent w.r.t. each other (remember point #2: seqcounts are not
+ *    used for 64bit architectures).
  *
- * 7) For irq and softirq uses, readers can use u64_stats_fetch_begin_irq() and
- *    u64_stats_fetch_retry_irq() helpers
+ * 7) Readers are allowed to sleep or be preempted/interrupted: they perform
+ *    pure reads.
+ *
+ * 8) Readers must use both u64_stats_fetch_{begin,retry}_irq() if the stats
+ *    might be updated from a hardirq or softirq context (remember point #1:
+ *    seqcounts are not used for UP kernels). 32-bit UP stat readers could read
+ *    corrupted 64-bit values otherwise.
  *
  * Usage :
  *
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 15/25] netfilter: conntrack: Use sequence counter with associated spinlock
  2020-05-19 21:45 [PATCH v1 00/25] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
                   ` (2 preceding siblings ...)
  2020-05-19 21:45 ` [PATCH v1 05/25] u64_stats: Document writer non-preemptibility requirement Ahmed S. Darwish
@ 2020-05-19 21:45 ` Ahmed S. Darwish
  2020-05-19 21:45 ` [PATCH v1 16/25] netfilter: nft_set_rbtree: Use sequence counter with associated rwlock Ahmed S. Darwish
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-05-19 21:45 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, David S. Miller,
	Jakub Kicinski, netfilter-devel, coreteam, netdev

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
---
 include/net/netfilter/nf_conntrack.h | 2 +-
 net/netfilter/nf_conntrack_core.c    | 5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 9f551f3b69c6..333fd54aec30 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -286,7 +286,7 @@ int nf_conntrack_hash_resize(unsigned int hashsize);
 
 extern struct hlist_nulls_head *nf_conntrack_hash;
 extern unsigned int nf_conntrack_htable_size;
-extern seqcount_t nf_conntrack_generation;
+extern seqcount_spinlock_t nf_conntrack_generation;
 extern unsigned int nf_conntrack_max;
 
 /* must be called with rcu read lock held */
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index c4582eb71766..48a839377da2 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -180,7 +180,7 @@ EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
 
 unsigned int nf_conntrack_max __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_max);
-seqcount_t nf_conntrack_generation __read_mostly;
+seqcount_spinlock_t nf_conntrack_generation __read_mostly;
 static unsigned int nf_conntrack_hash_rnd __read_mostly;
 
 static u32 hash_conntrack_raw(const struct nf_conntrack_tuple *tuple,
@@ -2512,7 +2512,8 @@ int nf_conntrack_init_start(void)
 	/* struct nf_ct_ext uses u8 to store offsets/size */
 	BUILD_BUG_ON(total_extension_size() > 255u);
 
-	seqcount_init(&nf_conntrack_generation);
+	seqcount_spinlock_init(&nf_conntrack_generation,
+			       &nf_conntrack_locks_all_lock);
 
 	for (i = 0; i < CONNTRACK_LOCKS; i++)
 		spin_lock_init(&nf_conntrack_locks[i]);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 16/25] netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
  2020-05-19 21:45 [PATCH v1 00/25] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
                   ` (3 preceding siblings ...)
  2020-05-19 21:45 ` [PATCH v1 15/25] netfilter: conntrack: Use sequence counter with associated spinlock Ahmed S. Darwish
@ 2020-05-19 21:45 ` Ahmed S. Darwish
  2020-05-19 21:45 ` [PATCH v1 17/25] xfrm: policy: Use sequence counters with associated lock Ahmed S. Darwish
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-05-19 21:45 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, David S. Miller,
	Jakub Kicinski, netfilter-devel, coreteam, netdev

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_rwlock_t data type, which allows to associate a
rwlock with the sequence counter. This enables lockdep to verify that
the rwlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
---
 net/netfilter/nft_set_rbtree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index 3ffef454d469..f50d986d43c5 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -18,7 +18,7 @@
 struct nft_rbtree {
 	struct rb_root		root;
 	rwlock_t		lock;
-	seqcount_t		count;
+	seqcount_rwlock_t	count;
 	struct delayed_work	gc_work;
 };
 
@@ -505,7 +505,7 @@ static int nft_rbtree_init(const struct nft_set *set,
 	struct nft_rbtree *priv = nft_set_priv(set);
 
 	rwlock_init(&priv->lock);
-	seqcount_init(&priv->count);
+	seqcount_rwlock_init(&priv->count, &priv->lock);
 	priv->root = RB_ROOT;
 
 	INIT_DEFERRABLE_WORK(&priv->gc_work, nft_rbtree_gc);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 17/25] xfrm: policy: Use sequence counters with associated lock
  2020-05-19 21:45 [PATCH v1 00/25] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
                   ` (4 preceding siblings ...)
  2020-05-19 21:45 ` [PATCH v1 16/25] netfilter: nft_set_rbtree: Use sequence counter with associated rwlock Ahmed S. Darwish
@ 2020-05-19 21:45 ` Ahmed S. Darwish
  2020-06-08  0:57 ` [PATCH v2 00/18] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-05-19 21:45 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, Steffen Klassert,
	Herbert Xu, David S. Miller, Jakub Kicinski, netdev

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. If the serialization primitive is
not disabling preemption implicitly, preemption has to be explicitly
disabled before entering the sequence counter write side critical
section.

A plain seqcount_t does not contain the information of which lock must
be held when entering a write side critical section.

Use the new seqcount_spinlock_t and seqcount_mutex_t data types instead,
which allow to associate a lock with the sequence counter. This enables
lockdep to verify that the lock used for writer serialization is held
when the write side critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
---
 net/xfrm/xfrm_policy.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 297b2fdb3c29..aae78a7aecd7 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -122,7 +122,7 @@ struct xfrm_pol_inexact_bin {
 	/* list containing '*:*' policies */
 	struct hlist_head hhead;
 
-	seqcount_t count;
+	seqcount_spinlock_t count;
 	/* tree sorted by daddr/prefix */
 	struct rb_root root_d;
 
@@ -155,7 +155,7 @@ static struct xfrm_policy_afinfo const __rcu *xfrm_policy_afinfo[AF_INET6 + 1]
 						__read_mostly;
 
 static struct kmem_cache *xfrm_dst_cache __ro_after_init;
-static __read_mostly seqcount_t xfrm_policy_hash_generation;
+static __read_mostly seqcount_mutex_t xfrm_policy_hash_generation;
 
 static struct rhashtable xfrm_policy_inexact_table;
 static const struct rhashtable_params xfrm_pol_inexact_params;
@@ -719,7 +719,7 @@ xfrm_policy_inexact_alloc_bin(const struct xfrm_policy *pol, u8 dir)
 	INIT_HLIST_HEAD(&bin->hhead);
 	bin->root_d = RB_ROOT;
 	bin->root_s = RB_ROOT;
-	seqcount_init(&bin->count);
+	seqcount_spinlock_init(&bin->count, &net->xfrm.xfrm_policy_lock);
 
 	prev = rhashtable_lookup_get_insert_key(&xfrm_policy_inexact_table,
 						&bin->k, &bin->head,
@@ -1911,7 +1911,7 @@ static int xfrm_policy_match(const struct xfrm_policy *pol,
 
 static struct xfrm_pol_inexact_node *
 xfrm_policy_lookup_inexact_addr(const struct rb_root *r,
-				seqcount_t *count,
+				seqcount_spinlock_t *count,
 				const xfrm_address_t *addr, u16 family)
 {
 	const struct rb_node *parent;
@@ -4158,7 +4158,7 @@ void __init xfrm_init(void)
 {
 	register_pernet_subsys(&xfrm_net_ops);
 	xfrm_dev_init();
-	seqcount_init(&xfrm_policy_hash_generation);
+	seqcount_mutex_init(&xfrm_policy_hash_generation, &hash_resize_mutex);
 	xfrm_input_init();
 
 #ifdef CONFIG_INET_ESPINTCP
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-19 21:45 ` [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount Ahmed S. Darwish
@ 2020-05-19 22:01   ` Stephen Hemminger
  2020-05-19 22:23     ` Thomas Gleixner
  2020-05-20  2:01   ` Eric Dumazet
  2020-05-20 14:37   ` Dan Carpenter
  2 siblings, 1 reply; 38+ messages in thread
From: Stephen Hemminger @ 2020-05-19 22:01 UTC (permalink / raw)
  To: Ahmed S. Darwish
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Paul E. McKenney, Sebastian A. Siewior, Steven Rostedt, LKML,
	David S. Miller, Jakub Kicinski, netdev

On Tue, 19 May 2020 23:45:23 +0200
"Ahmed S. Darwish" <a.darwish@linutronix.de> wrote:

> Sequence counters write paths are critical sections that must never be
> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
> 
> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> netdev name retrieval.") handled a deadlock, observed with
> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> infinitely spinning: it got scheduled after the seqcount write side
> blocked inside its own critical section.
> 
> To fix that deadlock, among other issues, the commit added a
> cond_resched() inside the read side section. While this will get the
> non-preemptible kernel eventually unstuck, the seqcount reader is fully
> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
> 
> The fix is also still broken: if the seqcount reader belongs to a
> real-time scheduling policy, it can spin forever and the kernel will
> livelock.
> 
> Disabling preemption over the seqcount write side critical section will
> not work: inside it are a number of GFP_KERNEL allocations and mutex
> locking through the drivers/base/ :: device_rename() call chain.
> 
> From all the above, replace the seqcount with a rwsem.
> 
> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Have your performance tested this with 1000's of network devices?

The reason seqcount logic is was done here was to achieve scaleablity
and a semaphore does not scale as well.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-19 22:01   ` Stephen Hemminger
@ 2020-05-19 22:23     ` Thomas Gleixner
  2020-05-19 23:11       ` Stephen Hemminger
  0 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2020-05-19 22:23 UTC (permalink / raw)
  To: Stephen Hemminger, Ahmed S. Darwish
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E. McKenney,
	Sebastian A. Siewior, Steven Rostedt, LKML, David S. Miller,
	Jakub Kicinski, netdev

Stephen Hemminger <stephen@networkplumber.org> writes:
> On Tue, 19 May 2020 23:45:23 +0200
> "Ahmed S. Darwish" <a.darwish@linutronix.de> wrote:
>
>> Sequence counters write paths are critical sections that must never be
>> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
>> 
>> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
>> netdev name retrieval.") handled a deadlock, observed with
>> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
>> infinitely spinning: it got scheduled after the seqcount write side
>> blocked inside its own critical section.
>> 
>> To fix that deadlock, among other issues, the commit added a
>> cond_resched() inside the read side section. While this will get the
>> non-preemptible kernel eventually unstuck, the seqcount reader is fully
>> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
>> 
>> The fix is also still broken: if the seqcount reader belongs to a
>> real-time scheduling policy, it can spin forever and the kernel will
>> livelock.
>> 
>> Disabling preemption over the seqcount write side critical section will
>> not work: inside it are a number of GFP_KERNEL allocations and mutex
>> locking through the drivers/base/ :: device_rename() call chain.
>> 
>> From all the above, replace the seqcount with a rwsem.
>> 
>> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
>> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
>> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
>> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>
> Have your performance tested this with 1000's of network devices?

No. We did not. -ENOTESTCASE

> The reason seqcount logic is was done here was to achieve scaleablity
> and a semaphore does not scale as well.

That still does not make the livelock magically going away. Just make a
reader with real-time priority preempt the writer and the system stops
dead. The net result is perfomance <= 0.

This was observed on RT kernels without a special 1000's of network
devices test case.

Just for the record: This is not a RT specific problem. You can
reproduce that w/o an RT kernel as well. Just run the reader with
real-time scheduling policy.

As much as you hate it from a performance POV the only sane rule of
programming is: Correctness first.

And this code clearly violates that rule.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-19 22:23     ` Thomas Gleixner
@ 2020-05-19 23:11       ` Stephen Hemminger
  2020-05-19 23:42         ` Thomas Gleixner
  0 siblings, 1 reply; 38+ messages in thread
From: Stephen Hemminger @ 2020-05-19 23:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ahmed S. Darwish, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Paul E. McKenney, Sebastian A. Siewior, Steven Rostedt, LKML,
	David S. Miller, Jakub Kicinski, netdev

On Wed, 20 May 2020 00:23:48 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> Stephen Hemminger <stephen@networkplumber.org> writes:
> > On Tue, 19 May 2020 23:45:23 +0200
> > "Ahmed S. Darwish" <a.darwish@linutronix.de> wrote:
> >  
> >> Sequence counters write paths are critical sections that must never be
> >> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
> >> 
> >> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> >> netdev name retrieval.") handled a deadlock, observed with
> >> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> >> infinitely spinning: it got scheduled after the seqcount write side
> >> blocked inside its own critical section.
> >> 
> >> To fix that deadlock, among other issues, the commit added a
> >> cond_resched() inside the read side section. While this will get the
> >> non-preemptible kernel eventually unstuck, the seqcount reader is fully
> >> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
> >> 
> >> The fix is also still broken: if the seqcount reader belongs to a
> >> real-time scheduling policy, it can spin forever and the kernel will
> >> livelock.
> >> 
> >> Disabling preemption over the seqcount write side critical section will
> >> not work: inside it are a number of GFP_KERNEL allocations and mutex
> >> locking through the drivers/base/ :: device_rename() call chain.
> >> 
> >> From all the above, replace the seqcount with a rwsem.
> >> 
> >> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
> >> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> >> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
> >> Cc: <stable@vger.kernel.org>
> >> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> >> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>  
> >
> > Have your performance tested this with 1000's of network devices?  
> 
> No. We did not. -ENOTESTCASE

Please try, it isn't that hard..

# time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done

real	0m17.002s
user	0m1.064s
sys	0m0.375s

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-19 23:11       ` Stephen Hemminger
@ 2020-05-19 23:42         ` Thomas Gleixner
  2020-05-20  0:06           ` Stephen Hemminger
  2020-05-20  2:57           ` David Miller
  0 siblings, 2 replies; 38+ messages in thread
From: Thomas Gleixner @ 2020-05-19 23:42 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Ahmed S. Darwish, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Paul E. McKenney, Sebastian A. Siewior, Steven Rostedt, LKML,
	David S. Miller, Jakub Kicinski, netdev

Stephen Hemminger <stephen@networkplumber.org> writes:
> On Wed, 20 May 2020 00:23:48 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
>> No. We did not. -ENOTESTCASE
>
> Please try, it isn't that hard..
>
> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
>
> real	0m17.002s
> user	0m1.064s
> sys	0m0.375s

And that solves the incorrectness of the current code in which way?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-19 23:42         ` Thomas Gleixner
@ 2020-05-20  0:06           ` Stephen Hemminger
  2020-05-20  1:55             ` Thomas Gleixner
  2020-05-20  2:57           ` David Miller
  1 sibling, 1 reply; 38+ messages in thread
From: Stephen Hemminger @ 2020-05-20  0:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ahmed S. Darwish, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Paul E. McKenney, Sebastian A. Siewior, Steven Rostedt, LKML,
	David S. Miller, Jakub Kicinski, netdev

On Wed, 20 May 2020 01:42:30 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> Stephen Hemminger <stephen@networkplumber.org> writes:
> > On Wed, 20 May 2020 00:23:48 +0200
> > Thomas Gleixner <tglx@linutronix.de> wrote:  
> >> No. We did not. -ENOTESTCASE  
> >
> > Please try, it isn't that hard..
> >
> > # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
> >
> > real	0m17.002s
> > user	0m1.064s
> > sys	0m0.375s  
> 
> And that solves the incorrectness of the current code in which way?

Agree that the current code is has evolved over time to a state where it is not
correct in the case of Preempt-RT. The motivation for the changes to seqcount
goes back many years when there were ISP's that were concerned about scaling of tunnels, vlans etc.

Is it too much to ask for a simple before/after test of your patch as part 
of the submission. You probably measure latency changes to the nanosecond.

Getting it correct without causing user complaints.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-20  0:06           ` Stephen Hemminger
@ 2020-05-20  1:55             ` Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: Thomas Gleixner @ 2020-05-20  1:55 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Ahmed S. Darwish, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Paul E. McKenney, Sebastian A. Siewior, Steven Rostedt, LKML,
	David S. Miller, Jakub Kicinski, netdev

Stephen Hemminger <stephen@networkplumber.org> writes:
> On Wed, 20 May 2020 01:42:30 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> Stephen Hemminger <stephen@networkplumber.org> writes:
>> > On Wed, 20 May 2020 00:23:48 +0200
>> > Thomas Gleixner <tglx@linutronix.de> wrote:  
>> >> No. We did not. -ENOTESTCASE  
>> >
>> > Please try, it isn't that hard..
>> >
>> > # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
>> >
>> > real	0m17.002s
>> > user	0m1.064s
>> > sys	0m0.375s  
>> 
>> And that solves the incorrectness of the current code in which way?
>
> Agree that the current code is has evolved over time to a state where it is not
> correct in the case of Preempt-RT.

That's not a RT problem as explained in great length in the changelog
and as I pointed out in my previous reply.

 Realtime scheduling classes are available on stock kernels and all
 those attempts to "fix" the livelock problem are ignoring that fact.

Just because you or whoever involved are not using them or do not care
is not making the code more correct.

> The motivation for the changes to seqcount goes back many years when
> there were ISP's that were concerned about scaling of tunnels, vlans
> etc.

I completely understand where this comes from, but that is not a
justification for incorrect code at all.

> Is it too much to ask for a simple before/after test of your patch as part 
> of the submission. You probably measure latency changes to the
> nanosecond.

It's not too much to ask and I'm happy to provide the numbers.

But before I waste my time and produce them, can you please explain how
any numbers provided are going to change the fact that the code is
incorrect?

  A bug, is a bug no matter what the numbers are.

I don't have a insta reproducer at hand for the problem which made that
code go belly up, but the net result is simply:

      Before:			After:
	real	INFINTE         0mxx.yyys

And the 'Before' comes with the extra benefit of stall warnings (if
enabled in the config).

If you insist I surely can go the extra mile and write up the insta
reproducer and stick it into a bugzilla for you.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-19 21:45 ` [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount Ahmed S. Darwish
  2020-05-19 22:01   ` Stephen Hemminger
@ 2020-05-20  2:01   ` Eric Dumazet
  2020-05-20  6:42     ` Ahmed S. Darwish
  2020-05-20 14:37   ` Dan Carpenter
  2 siblings, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2020-05-20  2:01 UTC (permalink / raw)
  To: Ahmed S. Darwish, Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, David S. Miller, Jakub Kicinski, netdev



On 5/19/20 2:45 PM, Ahmed S. Darwish wrote:
> Sequence counters write paths are critical sections that must never be
> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
> 
> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> netdev name retrieval.") handled a deadlock, observed with
> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> infinitely spinning: it got scheduled after the seqcount write side
> blocked inside its own critical section.
> 
> To fix that deadlock, among other issues, the commit added a
> cond_resched() inside the read side section. While this will get the
> non-preemptible kernel eventually unstuck, the seqcount reader is fully
> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
> 
> The fix is also still broken: if the seqcount reader belongs to a
> real-time scheduling policy, it can spin forever and the kernel will
> livelock.
> 
> Disabling preemption over the seqcount write side critical section will
> not work: inside it are a number of GFP_KERNEL allocations and mutex
> locking through the drivers/base/ :: device_rename() call chain.
> 
> From all the above, replace the seqcount with a rwsem.
> 
> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  net/core/dev.c | 30 ++++++++++++------------------
>  1 file changed, 12 insertions(+), 18 deletions(-)
>

Seems fine to me, assuming rwsem prevent starvation of the writer.

(Presumably this could be a per ndevice rwsem, or per netns, to provide some isolation)

Alternative would be to convert ndev->name from char array to a pointer (rcu protected),
but this looks quite invasive change, certainly not for stable branches.

Reviewed-by: Eric Dumazet <edumazet@google.com>



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-19 23:42         ` Thomas Gleixner
  2020-05-20  0:06           ` Stephen Hemminger
@ 2020-05-20  2:57           ` David Miller
  2020-05-20  3:18             ` Eric Dumazet
  2020-05-20 19:37             ` Thomas Gleixner
  1 sibling, 2 replies; 38+ messages in thread
From: David Miller @ 2020-05-20  2:57 UTC (permalink / raw)
  To: tglx
  Cc: stephen, a.darwish, peterz, mingo, will, paulmck, bigeasy,
	rostedt, linux-kernel, kuba, netdev

From: Thomas Gleixner <tglx@linutronix.de>
Date: Wed, 20 May 2020 01:42:30 +0200

> Stephen Hemminger <stephen@networkplumber.org> writes:
>> On Wed, 20 May 2020 00:23:48 +0200
>> Thomas Gleixner <tglx@linutronix.de> wrote:
>>> No. We did not. -ENOTESTCASE
>>
>> Please try, it isn't that hard..
>>
>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
>>
>> real	0m17.002s
>> user	0m1.064s
>> sys	0m0.375s
> 
> And that solves the incorrectness of the current code in which way?

You mentioned that there wasn't a test case, he gave you one to try.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-20  2:57           ` David Miller
@ 2020-05-20  3:18             ` Eric Dumazet
  2020-05-20  4:36               ` Stephen Hemminger
  2020-05-20 19:37             ` Thomas Gleixner
  1 sibling, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2020-05-20  3:18 UTC (permalink / raw)
  To: David Miller, tglx
  Cc: stephen, a.darwish, peterz, mingo, will, paulmck, bigeasy,
	rostedt, linux-kernel, kuba, netdev



On 5/19/20 7:57 PM, David Miller wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Wed, 20 May 2020 01:42:30 +0200
> 
>> Stephen Hemminger <stephen@networkplumber.org> writes:
>>> On Wed, 20 May 2020 00:23:48 +0200
>>> Thomas Gleixner <tglx@linutronix.de> wrote:
>>>> No. We did not. -ENOTESTCASE
>>>
>>> Please try, it isn't that hard..
>>>
>>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
>>>
>>> real	0m17.002s
>>> user	0m1.064s
>>> sys	0m0.375s
>>
>> And that solves the incorrectness of the current code in which way?
> 
> You mentioned that there wasn't a test case, he gave you one to try.
> 

I do not think this would ever use device rename, nor netdev_get_name()

None of this stuff is fast path really.

# time for ((i=1;i<1000;i++)); do ip li add dev dummy$i type dummy; done

real	0m1.127s
user	0m0.270s
sys	0m1.039s

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-20  3:18             ` Eric Dumazet
@ 2020-05-20  4:36               ` Stephen Hemminger
  0 siblings, 0 replies; 38+ messages in thread
From: Stephen Hemminger @ 2020-05-20  4:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, tglx, a.darwish, peterz, mingo, will, paulmck,
	bigeasy, rostedt, linux-kernel, kuba, netdev

On Tue, 19 May 2020 20:18:19 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On 5/19/20 7:57 PM, David Miller wrote:
> > From: Thomas Gleixner <tglx@linutronix.de>
> > Date: Wed, 20 May 2020 01:42:30 +0200
> >   
> >> Stephen Hemminger <stephen@networkplumber.org> writes:  
> >>> On Wed, 20 May 2020 00:23:48 +0200
> >>> Thomas Gleixner <tglx@linutronix.de> wrote:  
> >>>> No. We did not. -ENOTESTCASE  
> >>>
> >>> Please try, it isn't that hard..
> >>>
> >>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
> >>>
> >>> real	0m17.002s
> >>> user	0m1.064s
> >>> sys	0m0.375s  
> >>
> >> And that solves the incorrectness of the current code in which way?  
> > 
> > You mentioned that there wasn't a test case, he gave you one to try.
> >   
> 
> I do not think this would ever use device rename, nor netdev_get_name()
> 
> None of this stuff is fast path really.
> 
> # time for ((i=1;i<1000;i++)); do ip li add dev dummy$i type dummy; done
> 
> real	0m1.127s
> user	0m0.270s
> sys	0m1.039s

Your right it is a weak test, and most of the overhead is in the syscall
and all netlink events that happen.

It does end up looking up the new name, so would exercise that.
Better test is to use %d syntax or create 1000 dummy's then rename every one.

This is more of a stress test
# for ((i=0;i<1000;i++)); do echo link add dev dummy%d type dummy; done | time ip -batch -
0.00user 0.29system 0:02.11elapsed 13%CPU (0avgtext+0avgdata 2544maxresident)k
0inputs+0outputs (0major+148minor)pagefaults 0swaps

# for ((i=999;i>=0;i--)); do echo link set dummy$i name dummy$((i+1)); done | time ip -batch -
0.00user 0.26system 0:54.98elapsed 0%CPU (0avgtext+0avgdata 2508maxresident)k
0inputs+0outputs (0major+145minor)pagefaults 0swaps


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-20  2:01   ` Eric Dumazet
@ 2020-05-20  6:42     ` Ahmed S. Darwish
  2020-05-20 12:51       ` Eric Dumazet
  0 siblings, 1 reply; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-05-20  6:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Waiman Long, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, David S. Miller, Jakub Kicinski, netdev

Hello Eric,

On Tue, May 19, 2020 at 07:01:38PM -0700, Eric Dumazet wrote:
>
> On 5/19/20 2:45 PM, Ahmed S. Darwish wrote:
> > Sequence counters write paths are critical sections that must never be
> > preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
> >
> > Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> > netdev name retrieval.") handled a deadlock, observed with
> > CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> > infinitely spinning: it got scheduled after the seqcount write side
> > blocked inside its own critical section.
> >
> > To fix that deadlock, among other issues, the commit added a
> > cond_resched() inside the read side section. While this will get the
> > non-preemptible kernel eventually unstuck, the seqcount reader is fully
> > exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
> >
> > The fix is also still broken: if the seqcount reader belongs to a
> > real-time scheduling policy, it can spin forever and the kernel will
> > livelock.
> >
> > Disabling preemption over the seqcount write side critical section will
> > not work: inside it are a number of GFP_KERNEL allocations and mutex
> > locking through the drivers/base/ :: device_rename() call chain.
> >
> > From all the above, replace the seqcount with a rwsem.
> >
> > Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
> > Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> > Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
> > Cc: <stable@vger.kernel.org>
> > Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> > Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > ---
> >  net/core/dev.c | 30 ++++++++++++------------------
> >  1 file changed, 12 insertions(+), 18 deletions(-)
> >
>
> Seems fine to me, assuming rwsem prevent starvation of the writer.
>

Thanks for the review.

AFAIK, due to 5cfd92e12e13 ("locking/rwsem: Adaptive disabling of reader
optimistic spinning"), using a rwsem shouldn't lead to writer starvation
in the contended case.

--
Ahmed S. Darwish
Linutronix GmbH

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-20  6:42     ` Ahmed S. Darwish
@ 2020-05-20 12:51       ` Eric Dumazet
  2020-06-03 14:33         ` Ahmed S. Darwish
  0 siblings, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2020-05-20 12:51 UTC (permalink / raw)
  To: Ahmed S. Darwish, Eric Dumazet
  Cc: Waiman Long, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, David S. Miller, Jakub Kicinski, netdev



On 5/19/20 11:42 PM, Ahmed S. Darwish wrote:
> Hello Eric,
> 
> On Tue, May 19, 2020 at 07:01:38PM -0700, Eric Dumazet wrote:
>>
>> On 5/19/20 2:45 PM, Ahmed S. Darwish wrote:
>>> Sequence counters write paths are critical sections that must never be
>>> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
>>>
>>> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
>>> netdev name retrieval.") handled a deadlock, observed with
>>> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
>>> infinitely spinning: it got scheduled after the seqcount write side
>>> blocked inside its own critical section.
>>>
>>> To fix that deadlock, among other issues, the commit added a
>>> cond_resched() inside the read side section. While this will get the
>>> non-preemptible kernel eventually unstuck, the seqcount reader is fully
>>> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
>>>
>>> The fix is also still broken: if the seqcount reader belongs to a
>>> real-time scheduling policy, it can spin forever and the kernel will
>>> livelock.
>>>
>>> Disabling preemption over the seqcount write side critical section will
>>> not work: inside it are a number of GFP_KERNEL allocations and mutex
>>> locking through the drivers/base/ :: device_rename() call chain.
>>>
>>> From all the above, replace the seqcount with a rwsem.
>>>
>>> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
>>> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
>>> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
>>> Cc: <stable@vger.kernel.org>
>>> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
>>> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>>> ---
>>>  net/core/dev.c | 30 ++++++++++++------------------
>>>  1 file changed, 12 insertions(+), 18 deletions(-)
>>>
>>
>> Seems fine to me, assuming rwsem prevent starvation of the writer.
>>
> 
> Thanks for the review.
> 
> AFAIK, due to 5cfd92e12e13 ("locking/rwsem: Adaptive disabling of reader
> optimistic spinning"), using a rwsem shouldn't lead to writer starvation
> in the contended case.

Hmm this was in linux-5.3, so very recent stuff.

Has this patch been backported to stable releases ?

With all the Fixes: tags you added, stable teams will backport this networking patch to
all stable versions.

Do we have a way to tune a dedicare rwsem to 'give preference to the (unique in this case) writer" over
a myriad of potential readers ?

Thanks.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-19 21:45 ` [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount Ahmed S. Darwish
  2020-05-19 22:01   ` Stephen Hemminger
  2020-05-20  2:01   ` Eric Dumazet
@ 2020-05-20 14:37   ` Dan Carpenter
  2020-05-25 16:22     ` Ahmed S. Darwish
  2 siblings, 1 reply; 38+ messages in thread
From: Dan Carpenter @ 2020-05-20 14:37 UTC (permalink / raw)
  To: kbuild, Ahmed S. Darwish, Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: lkp, kbuild-all, Thomas Gleixner, Paul E. McKenney,
	Sebastian A. Siewior, Steven Rostedt, LKML, Ahmed S. Darwish,
	Jakub Kicinski, netdev

[-- Attachment #1: Type: text/plain, Size: 2944 bytes --]

Hi "Ahmed,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/locking/core]
[also build test WARNING on nf-next/master nf/master tip/timers/core linus/master v5.7-rc6 next-20200519]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Ahmed-S-Darwish/seqlock-Extend-seqcount-API-with-associated-locks/20200520-055145
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 23b5ae2e8e1326c91b5dfdbb6ebcd5a6820074ae
config: x86_64-defconfig (attached as .config)

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

smatch warnings:
net/core/dev.c:953 netdev_get_name() warn: inconsistent returns 'devnet_rename_sem'.

# https://github.com/0day-ci/linux/commit/2354e271ada778bbb935d7b20113693710905cff
git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout 2354e271ada778bbb935d7b20113693710905cff
vim +/devnet_rename_sem +953 net/core/dev.c

5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  935  int netdev_get_name(struct net *net, char *name, int ifindex)
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  936  {
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  937  	struct net_device *dev;
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  938  
2354e271ada778b Ahmed S. Darwish 2020-05-19  939  	down_read(&devnet_rename_sem);
                                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

2354e271ada778b Ahmed S. Darwish 2020-05-19  940  
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  941  	rcu_read_lock();
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  942  	dev = dev_get_by_index_rcu(net, ifindex);
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  943  	if (!dev) {
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  944  		rcu_read_unlock();
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  945  		return -ENODEV;
                                                                ^^^^^^^^^^^^^^
We need to drop the new semaphore on error.

5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  946  	}
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  947  
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  948  	strcpy(name, dev->name);
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  949  	rcu_read_unlock();
2354e271ada778b Ahmed S. Darwish 2020-05-19  950  
2354e271ada778b Ahmed S. Darwish 2020-05-19  951  	up_read(&devnet_rename_sem);
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  952  
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 @953  	return 0;
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  954  }

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29111 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-20  2:57           ` David Miller
  2020-05-20  3:18             ` Eric Dumazet
@ 2020-05-20 19:37             ` Thomas Gleixner
  2020-05-20 21:36               ` Stephen Hemminger
  1 sibling, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2020-05-20 19:37 UTC (permalink / raw)
  To: David Miller
  Cc: stephen, a.darwish, peterz, mingo, will, paulmck, bigeasy,
	rostedt, linux-kernel, kuba, netdev

David Miller <davem@davemloft.net> writes:
> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Wed, 20 May 2020 01:42:30 +0200
>>> Please try, it isn't that hard..
>>>
>>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
>>>
>>> real	0m17.002s
>>> user	0m1.064s
>>> sys	0m0.375s
>> 
>> And that solves the incorrectness of the current code in which way?
>
> You mentioned that there wasn't a test case, he gave you one to try.

If it makes you happy to compare incorrrect code with correct code, here
you go:

5 runs of 1000 device add, 1000 device rename and 1000 device del

CONFIG_PREEMPT_NONE=y

         Base      rwsem
 add     0:05.01   0:05.28
	 0:05.93   0:06.11
	 0:06.52   0:06.26
	 0:06.06   0:05.74
	 0:05.71   0:06.07

 rename  0:32.57   0:33.04
	 0:32.91   0:32.45
	 0:32.72   0:32.53
	 0:39.65   0:34.18
	 0:34.52   0:32.50

 delete  3:48.65   3:48.91
	 3:49.66   3:49.13
	 3:45.29   3:48.26
	 3:47.56   3:46.60
	 3:50.01   3:48.06

 -------------------------

CONFIG_PREEMPT_VOLUNTARY=y

         Base      rwsem
 add     0:06.80   0:06.42
	 0:04.77   0:05.03
	 0:05.74   0:04.62
	 0:05.87   0:04.34
	 0:04.20   0:04.12

 rename  0:33.33   0:42.02
	 0:42.36   0:32.55
	 0:39.58   0:31.60
	 0:33.69   0:35.08
	 0:34.24   0:33.97

 delete  3:47.82   3:44.00
	 3:47.42   3:51.00
	 3:48.52   3:48.88
	 3:48.50   3:48.09
	 3:50.03   3:46.56

 -------------------------

CONFIG_PREEMPT=y

         Base      rwsem

 add     0:07.89   0:07.72
	 0:07.25   0:06.72
	 0:07.42   0:06.51
	 0:06.92   0:06.38
	 0:06.20   0:06.72

 rename  0:41.77   0:32.39
	 0:44.29   0:33.29
	 0:36.19   0:34.86
	 0:33.19   0:35.06
	 0:37.00   0:34.78

 delete  2:36.96   2:39.97
	 2:37.80   2:42.19
	 2:44.66   2:48.40
	 2:39.75   2:41.02
	 2:40.77   2:38.36

The runtime variation is rather large and when running the same in a VM
I got complete random numbers for both base and rwsem. The most amazing
was delete where the time varies from 30s to 6m20s.

Btw, Sebastian noticed that rename spams dmesg:

  netdev_info(dev, "renamed from %s\n", oldname);

which eats about 50% of the Rename run time.

         Base      netdev_info() removed

Rename   0:34.84   0:17.48

That number at least makes tons of sense

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-20 19:37             ` Thomas Gleixner
@ 2020-05-20 21:36               ` Stephen Hemminger
  0 siblings, 0 replies; 38+ messages in thread
From: Stephen Hemminger @ 2020-05-20 21:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Miller, a.darwish, peterz, mingo, will, paulmck, bigeasy,
	rostedt, linux-kernel, kuba, netdev

On Wed, 20 May 2020 21:37:11 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> David Miller <davem@davemloft.net> writes:
> > From: Thomas Gleixner <tglx@linutronix.de>
> > Date: Wed, 20 May 2020 01:42:30 +0200  
> >>> Please try, it isn't that hard..
> >>>
> >>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
> >>>
> >>> real	0m17.002s
> >>> user	0m1.064s
> >>> sys	0m0.375s  
> >> 
> >> And that solves the incorrectness of the current code in which way?  
> >
> > You mentioned that there wasn't a test case, he gave you one to try.  
> 
> If it makes you happy to compare incorrrect code with correct code, here
> you go:
> 
> 5 runs of 1000 device add, 1000 device rename and 1000 device del
> 
> CONFIG_PREEMPT_NONE=y
> 
>          Base      rwsem
>  add     0:05.01   0:05.28
> 	 0:05.93   0:06.11
> 	 0:06.52   0:06.26
> 	 0:06.06   0:05.74
> 	 0:05.71   0:06.07
> 
>  rename  0:32.57   0:33.04
> 	 0:32.91   0:32.45
> 	 0:32.72   0:32.53
> 	 0:39.65   0:34.18
> 	 0:34.52   0:32.50
> 
>  delete  3:48.65   3:48.91
> 	 3:49.66   3:49.13
> 	 3:45.29   3:48.26
> 	 3:47.56   3:46.60
> 	 3:50.01   3:48.06
> 
>  -------------------------
> 
> CONFIG_PREEMPT_VOLUNTARY=y
> 
>          Base      rwsem
>  add     0:06.80   0:06.42
> 	 0:04.77   0:05.03
> 	 0:05.74   0:04.62
> 	 0:05.87   0:04.34
> 	 0:04.20   0:04.12
> 
>  rename  0:33.33   0:42.02
> 	 0:42.36   0:32.55
> 	 0:39.58   0:31.60
> 	 0:33.69   0:35.08
> 	 0:34.24   0:33.97
> 
>  delete  3:47.82   3:44.00
> 	 3:47.42   3:51.00
> 	 3:48.52   3:48.88
> 	 3:48.50   3:48.09
> 	 3:50.03   3:46.56
> 
>  -------------------------
> 
> CONFIG_PREEMPT=y
> 
>          Base      rwsem
> 
>  add     0:07.89   0:07.72
> 	 0:07.25   0:06.72
> 	 0:07.42   0:06.51
> 	 0:06.92   0:06.38
> 	 0:06.20   0:06.72
> 
>  rename  0:41.77   0:32.39
> 	 0:44.29   0:33.29
> 	 0:36.19   0:34.86
> 	 0:33.19   0:35.06
> 	 0:37.00   0:34.78
> 
>  delete  2:36.96   2:39.97
> 	 2:37.80   2:42.19
> 	 2:44.66   2:48.40
> 	 2:39.75   2:41.02
> 	 2:40.77   2:38.36
> 
> The runtime variation is rather large and when running the same in a VM
> I got complete random numbers for both base and rwsem. The most amazing
> was delete where the time varies from 30s to 6m20s.
> 
> Btw, Sebastian noticed that rename spams dmesg:
> 
>   netdev_info(dev, "renamed from %s\n", oldname);
> 
> which eats about 50% of the Rename run time.
> 
>          Base      netdev_info() removed
> 
> Rename   0:34.84   0:17.48
> 
> That number at least makes tons of sense
> 
> Thanks,
> 
>         tglx

Looks good thanks for following through.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-20 14:37   ` Dan Carpenter
@ 2020-05-25 16:22     ` Ahmed S. Darwish
  0 siblings, 0 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-05-25 16:22 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: kbuild, Peter Zijlstra, Ingo Molnar, Will Deacon, lkp,
	kbuild-all, Thomas Gleixner, Paul E. McKenney,
	Sebastian A. Siewior, Steven Rostedt, LKML, Jakub Kicinski,
	netdev

On Wed, May 20, 2020 at 05:37:07PM +0300, Dan Carpenter wrote:
...
>
> smatch warnings:
> net/core/dev.c:953 netdev_get_name() warn: inconsistent returns 'devnet_rename_sem'.
>
...
>
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  935  int netdev_get_name(struct net *net, char *name, int ifindex)
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  936  {
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  937  	struct net_device *dev;
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  938
> 2354e271ada778b Ahmed S. Darwish 2020-05-19  939  	down_read(&devnet_rename_sem);
>                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> 2354e271ada778b Ahmed S. Darwish 2020-05-19  940
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  941  	rcu_read_lock();
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  942  	dev = dev_get_by_index_rcu(net, ifindex);
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  943  	if (!dev) {
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  944  		rcu_read_unlock();
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  945  		return -ENODEV;
>                                                               ^^^^^^^^^^^^^^

Oh, shouldn't have missed that. Will fix in v2.

Thanks,

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount
  2020-05-20 12:51       ` Eric Dumazet
@ 2020-06-03 14:33         ` Ahmed S. Darwish
  0 siblings, 0 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-06-03 14:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Waiman Long, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, David S. Miller, Jakub Kicinski, netdev

On Wed, May 20, 2020 at 05:51:27AM -0700, Eric Dumazet wrote:
>
> On 5/19/20 11:42 PM, Ahmed S. Darwish wrote:
> > Hello Eric,
> >
> > On Tue, May 19, 2020 at 07:01:38PM -0700, Eric Dumazet wrote:
> >>
> >> On 5/19/20 2:45 PM, Ahmed S. Darwish wrote:
> >>> Sequence counters write paths are critical sections that must never be
> >>> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
> >>>
> >>> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> >>> netdev name retrieval.") handled a deadlock, observed with
> >>> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> >>> infinitely spinning: it got scheduled after the seqcount write side
> >>> blocked inside its own critical section.
> >>>
> >>> To fix that deadlock, among other issues, the commit added a
> >>> cond_resched() inside the read side section. While this will get the
> >>> non-preemptible kernel eventually unstuck, the seqcount reader is fully
> >>> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
> >>>
> >>> The fix is also still broken: if the seqcount reader belongs to a
> >>> real-time scheduling policy, it can spin forever and the kernel will
> >>> livelock.
> >>>
> >>> Disabling preemption over the seqcount write side critical section will
> >>> not work: inside it are a number of GFP_KERNEL allocations and mutex
> >>> locking through the drivers/base/ :: device_rename() call chain.
> >>>
> >>> From all the above, replace the seqcount with a rwsem.
> >>>
> >>> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
> >>> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> >>> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
> >>> Cc: <stable@vger.kernel.org>
> >>> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> >>> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> >>> ---
> >>>  net/core/dev.c | 30 ++++++++++++------------------
> >>>  1 file changed, 12 insertions(+), 18 deletions(-)
> >>>
> >>
> >> Seems fine to me, assuming rwsem prevent starvation of the writer.
> >>
> >
> > Thanks for the review.
> >
> > AFAIK, due to 5cfd92e12e13 ("locking/rwsem: Adaptive disabling of reader
> > optimistic spinning"), using a rwsem shouldn't lead to writer starvation
> > in the contended case.
>
> Hmm this was in linux-5.3, so very recent stuff.
>
> Has this patch been backported to stable releases ?
>
> With all the Fixes: tags you added, stable teams will backport this
> networking patch to all stable versions.
>
> Do we have a way to tune a dedicare rwsem to 'give preference to the
> (unique in this case) writer" over a myriad of potential readers ?
>

I was wrong in referencing the commit 5cfd92e12e13 above.

Before and after that commit, once a rwsem writer is blocking, all
subsequent readers will block until that writer makes progress.

Given that behavior, and that the read section is already quite short, I
don't think there's any danger incurred on writers here.

(a v2 will be sent shortly, fixing the error found Dan/kbuild-bot.)

Thanks,

--
Ahmed S. Darwish
Linutronix GmbH

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 00/18] seqlock: Extend seqcount API with associated locks
  2020-05-19 21:45 [PATCH v1 00/25] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
                   ` (5 preceding siblings ...)
  2020-05-19 21:45 ` [PATCH v1 17/25] xfrm: policy: Use sequence counters with associated lock Ahmed S. Darwish
@ 2020-06-08  0:57 ` Ahmed S. Darwish
  2020-06-08  0:57   ` [PATCH v2 08/18] netfilter: conntrack: Use sequence counter with associated spinlock Ahmed S. Darwish
                     ` (2 more replies)
  2020-06-30  5:44 ` [PATCH v3 00/20] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
  2020-07-20 15:55 ` [PATCH v4 00/24] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
  8 siblings, 3 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-06-08  0:57 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, Jonathan Corbet,
	linux-doc, David Airlie, Daniel Vetter, dri-devel,
	David S. Miller, netdev, Jens Axboe, linux-block, Alexander Viro,
	linux-fsdevel

Hi,

This is v2 of the seqlock patch series:

   [PATCH v1 00/25] seqlock: Extend seqcount API with associated locks
   https://lore.kernel.org/lkml/20200519214547.352050-1-a.darwish@linutronix.de

Patches 1=>3 of this v2 series add documentation for the existing
seqlock.h datatypes and APIs. Hopefully they can hit v5.8 -rc2 or -rc3.

Changelog-v2
============

1. Drop, for now, the seqlock v1 patches #7 and #8. These patches added
lockdep non-preemptibility checks to seqcount_t write paths, but they
now depend on on-going work by Peter:

   [PATCH v3 0/5] lockdep: Change IRQ state tracking to use per-cpu variables
   https://lkml.kernel.org/r/20200529213550.683440625@infradead.org

   [PATCH 00/14] x86/entry: disallow #DB more and x86/entry lockdep/nmi
   https://lkml.kernel.org/r/20200529212728.795169701@infradead.org

Once Peter's work get merged, I'll send the non-preemptibility checks as
a separate series.

2. Drop the v1 seqcount_t call-sites bugfixes. I've already posted them
in an isolated series. They got merged into their respective trees, and
will hit v5.8-rc1 soon:

   [PATCH v2 0/6] seqlock: seqcount_t call sites bugfixes
   https://lore.kernel.org/lkml/20200603144949.1122421-1-a.darwish@linutronix.de

3. Patch #1: Add a small paragraph explaining that seqcount_t/seqlock_t
cannot be used if the protected data contains pointers. A similar
paragraph already existed in seqlock.h, but got mistakenly dropped.

4. Patch #2: Don't add RST directives inside kernel-doc comments. Peter
doesn't like them :) I've kept the indentation though, and found a
minimal way for Sphinx to properly render these code samples without too
much disruption.

5. Patch #3: Brush up the introduced kernel-doc comments. Make them more
consistent overall, and more concise.

Thanks,

8<--------------

Ahmed S. Darwish (18):
  Documentation: locking: Describe seqlock design and usage
  seqlock: Properly format kernel-doc code samples
  seqlock: Add missing kernel-doc annotations
  seqlock: Extend seqcount API with associated locks
  dma-buf: Remove custom seqcount lockdep class key
  dma-buf: Use sequence counter with associated wound/wait mutex
  sched: tasks: Use sequence counter with associated spinlock
  netfilter: conntrack: Use sequence counter with associated spinlock
  netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
  xfrm: policy: Use sequence counters with associated lock
  timekeeping: Use sequence counter with associated raw spinlock
  vfs: Use sequence counter with associated spinlock
  raid5: Use sequence counter with associated spinlock
  iocost: Use sequence counter with associated spinlock
  NFSv4: Use sequence counter with associated spinlock
  userfaultfd: Use sequence counter with associated spinlock
  kvm/eventfd: Use sequence counter with associated spinlock
  hrtimer: Use sequence counter with associated raw spinlock

 Documentation/locking/index.rst               |   1 +
 Documentation/locking/seqlock.rst             | 242 +++++
 MAINTAINERS                                   |   2 +-
 block/blk-iocost.c                            |   5 +-
 drivers/dma-buf/dma-resv.c                    |  15 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   2 -
 drivers/md/raid5.c                            |   2 +-
 drivers/md/raid5.h                            |   2 +-
 fs/dcache.c                                   |   2 +-
 fs/fs_struct.c                                |   4 +-
 fs/nfs/nfs4_fs.h                              |   2 +-
 fs/nfs/nfs4state.c                            |   2 +-
 fs/userfaultfd.c                              |   4 +-
 include/linux/dcache.h                        |   2 +-
 include/linux/dma-resv.h                      |   4 +-
 include/linux/fs_struct.h                     |   2 +-
 include/linux/hrtimer.h                       |   2 +-
 include/linux/kvm_irqfd.h                     |   2 +-
 include/linux/sched.h                         |   2 +-
 include/linux/seqlock.h                       | 855 ++++++++++++++----
 include/linux/seqlock_types_internal.h        | 187 ++++
 include/net/netfilter/nf_conntrack.h          |   2 +-
 init/init_task.c                              |   3 +-
 kernel/fork.c                                 |   2 +-
 kernel/time/hrtimer.c                         |  13 +-
 kernel/time/timekeeping.c                     |  19 +-
 net/netfilter/nf_conntrack_core.c             |   5 +-
 net/netfilter/nft_set_rbtree.c                |   4 +-
 net/xfrm/xfrm_policy.c                        |  10 +-
 virt/kvm/eventfd.c                            |   2 +-
 30 files changed, 1175 insertions(+), 226 deletions(-)
 create mode 100644 Documentation/locking/seqlock.rst
 create mode 100644 include/linux/seqlock_types_internal.h

base-commit: 3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162
--
2.20.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 08/18] netfilter: conntrack: Use sequence counter with associated spinlock
  2020-06-08  0:57 ` [PATCH v2 00/18] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
@ 2020-06-08  0:57   ` Ahmed S. Darwish
  2020-06-08  0:57   ` [PATCH v2 09/18] netfilter: nft_set_rbtree: Use sequence counter with associated rwlock Ahmed S. Darwish
  2020-06-08  0:57   ` [PATCH v2 10/18] xfrm: policy: Use sequence counters with associated lock Ahmed S. Darwish
  2 siblings, 0 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-06-08  0:57 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, David S. Miller,
	Jakub Kicinski, netfilter-devel, coreteam, netdev

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
---
 include/net/netfilter/nf_conntrack.h | 2 +-
 net/netfilter/nf_conntrack_core.c    | 5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 90690e37a56f..ea4e2010b246 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -286,7 +286,7 @@ int nf_conntrack_hash_resize(unsigned int hashsize);
 
 extern struct hlist_nulls_head *nf_conntrack_hash;
 extern unsigned int nf_conntrack_htable_size;
-extern seqcount_t nf_conntrack_generation;
+extern seqcount_spinlock_t nf_conntrack_generation;
 extern unsigned int nf_conntrack_max;
 
 /* must be called with rcu read lock held */
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index bb72ca5f3999..1f9518569195 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -180,7 +180,7 @@ EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
 
 unsigned int nf_conntrack_max __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_max);
-seqcount_t nf_conntrack_generation __read_mostly;
+seqcount_spinlock_t nf_conntrack_generation __read_mostly;
 static unsigned int nf_conntrack_hash_rnd __read_mostly;
 
 static u32 hash_conntrack_raw(const struct nf_conntrack_tuple *tuple,
@@ -2589,7 +2589,8 @@ int nf_conntrack_init_start(void)
 	/* struct nf_ct_ext uses u8 to store offsets/size */
 	BUILD_BUG_ON(total_extension_size() > 255u);
 
-	seqcount_init(&nf_conntrack_generation);
+	seqcount_spinlock_init(&nf_conntrack_generation,
+			       &nf_conntrack_locks_all_lock);
 
 	for (i = 0; i < CONNTRACK_LOCKS; i++)
 		spin_lock_init(&nf_conntrack_locks[i]);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 09/18] netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
  2020-06-08  0:57 ` [PATCH v2 00/18] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
  2020-06-08  0:57   ` [PATCH v2 08/18] netfilter: conntrack: Use sequence counter with associated spinlock Ahmed S. Darwish
@ 2020-06-08  0:57   ` Ahmed S. Darwish
  2020-06-08  0:57   ` [PATCH v2 10/18] xfrm: policy: Use sequence counters with associated lock Ahmed S. Darwish
  2 siblings, 0 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-06-08  0:57 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, David S. Miller,
	Jakub Kicinski, netfilter-devel, coreteam, netdev

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_rwlock_t data type, which allows to associate a
rwlock with the sequence counter. This enables lockdep to verify that
the rwlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
---
 net/netfilter/nft_set_rbtree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index 62f416bc0579..9f58261ee4c7 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -18,7 +18,7 @@
 struct nft_rbtree {
 	struct rb_root		root;
 	rwlock_t		lock;
-	seqcount_t		count;
+	seqcount_rwlock_t	count;
 	struct delayed_work	gc_work;
 };
 
@@ -516,7 +516,7 @@ static int nft_rbtree_init(const struct nft_set *set,
 	struct nft_rbtree *priv = nft_set_priv(set);
 
 	rwlock_init(&priv->lock);
-	seqcount_init(&priv->count);
+	seqcount_rwlock_init(&priv->count, &priv->lock);
 	priv->root = RB_ROOT;
 
 	INIT_DEFERRABLE_WORK(&priv->gc_work, nft_rbtree_gc);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 10/18] xfrm: policy: Use sequence counters with associated lock
  2020-06-08  0:57 ` [PATCH v2 00/18] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
  2020-06-08  0:57   ` [PATCH v2 08/18] netfilter: conntrack: Use sequence counter with associated spinlock Ahmed S. Darwish
  2020-06-08  0:57   ` [PATCH v2 09/18] netfilter: nft_set_rbtree: Use sequence counter with associated rwlock Ahmed S. Darwish
@ 2020-06-08  0:57   ` Ahmed S. Darwish
  2 siblings, 0 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-06-08  0:57 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, Steffen Klassert,
	Herbert Xu, David S. Miller, Jakub Kicinski, netdev

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. If the serialization primitive is
not disabling preemption implicitly, preemption has to be explicitly
disabled before entering the sequence counter write side critical
section.

A plain seqcount_t does not contain the information of which lock must
be held when entering a write side critical section.

Use the new seqcount_spinlock_t and seqcount_mutex_t data types instead,
which allow to associate a lock with the sequence counter. This enables
lockdep to verify that the lock used for writer serialization is held
when the write side critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
---
 net/xfrm/xfrm_policy.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 564aa6492e7c..732a940468b0 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -122,7 +122,7 @@ struct xfrm_pol_inexact_bin {
 	/* list containing '*:*' policies */
 	struct hlist_head hhead;
 
-	seqcount_t count;
+	seqcount_spinlock_t count;
 	/* tree sorted by daddr/prefix */
 	struct rb_root root_d;
 
@@ -155,7 +155,7 @@ static struct xfrm_policy_afinfo const __rcu *xfrm_policy_afinfo[AF_INET6 + 1]
 						__read_mostly;
 
 static struct kmem_cache *xfrm_dst_cache __ro_after_init;
-static __read_mostly seqcount_t xfrm_policy_hash_generation;
+static __read_mostly seqcount_mutex_t xfrm_policy_hash_generation;
 
 static struct rhashtable xfrm_policy_inexact_table;
 static const struct rhashtable_params xfrm_pol_inexact_params;
@@ -719,7 +719,7 @@ xfrm_policy_inexact_alloc_bin(const struct xfrm_policy *pol, u8 dir)
 	INIT_HLIST_HEAD(&bin->hhead);
 	bin->root_d = RB_ROOT;
 	bin->root_s = RB_ROOT;
-	seqcount_init(&bin->count);
+	seqcount_spinlock_init(&bin->count, &net->xfrm.xfrm_policy_lock);
 
 	prev = rhashtable_lookup_get_insert_key(&xfrm_policy_inexact_table,
 						&bin->k, &bin->head,
@@ -1906,7 +1906,7 @@ static int xfrm_policy_match(const struct xfrm_policy *pol,
 
 static struct xfrm_pol_inexact_node *
 xfrm_policy_lookup_inexact_addr(const struct rb_root *r,
-				seqcount_t *count,
+				seqcount_spinlock_t *count,
 				const xfrm_address_t *addr, u16 family)
 {
 	const struct rb_node *parent;
@@ -4153,7 +4153,7 @@ void __init xfrm_init(void)
 {
 	register_pernet_subsys(&xfrm_net_ops);
 	xfrm_dev_init();
-	seqcount_init(&xfrm_policy_hash_generation);
+	seqcount_mutex_init(&xfrm_policy_hash_generation, &hash_resize_mutex);
 	xfrm_input_init();
 
 #ifdef CONFIG_INET_ESPINTCP
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 00/20] seqlock: Extend seqcount API with associated locks
  2020-05-19 21:45 [PATCH v1 00/25] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
                   ` (6 preceding siblings ...)
  2020-06-08  0:57 ` [PATCH v2 00/18] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
@ 2020-06-30  5:44 ` Ahmed S. Darwish
  2020-06-30  5:44   ` [PATCH v3 10/20] netfilter: conntrack: Use sequence counter with associated spinlock Ahmed S. Darwish
                     ` (2 more replies)
  2020-07-20 15:55 ` [PATCH v4 00/24] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
  8 siblings, 3 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-06-30  5:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, Jonathan Corbet,
	linux-doc, David Airlie, Daniel Vetter, dri-devel,
	David S. Miller, netdev, Jens Axboe, linux-block, Alexander Viro,
	linux-fsdevel

Hi,

This is v3 of the seqlock patch series:

   [PATCH v1 00/25] seqlock: Extend seqcount API with associated locks
   https://lore.kernel.org/lkml/20200519214547.352050-1-a.darwish@linutronix.de

   [PATCH v2 00/18]
   https://lore.kernel.org/lkml/20200608005729.1874024-1-a.darwish@linutronix.de

It's based over:

   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git locking/core

to get Peter's lockdep irqstate tracking series below, which untangles
mainline seqlock.h<=>sched.h 'current->' task_struct circular dependency:

   https://lkml.kernel.org/r/linuxppc-dev/20200623083645.277342609@infradead.org

Changelog-v3:

 - Re-add lockdep non-preemptibility checks on seqcount_t write paths.
   They were removed from v2 due to the circular dependencies mentioned.

 - Slight rebase over the new v5.8-rc1 KCSAN seqlock.h changes

 - Collect seqcount_t call-sites acked-by tags

Thanks,

8<--------------

Ahmed S. Darwish (20):
  Documentation: locking: Describe seqlock design and usage
  seqlock: Properly format kernel-doc code samples
  seqlock: Add missing kernel-doc annotations
  lockdep: Add preemption enabled/disabled assertion APIs
  seqlock: lockdep assert non-preemptibility on seqcount_t write
  seqlock: Extend seqcount API with associated locks
  dma-buf: Remove custom seqcount lockdep class key
  dma-buf: Use sequence counter with associated wound/wait mutex
  sched: tasks: Use sequence counter with associated spinlock
  netfilter: conntrack: Use sequence counter with associated spinlock
  netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
  xfrm: policy: Use sequence counters with associated lock
  timekeeping: Use sequence counter with associated raw spinlock
  vfs: Use sequence counter with associated spinlock
  raid5: Use sequence counter with associated spinlock
  iocost: Use sequence counter with associated spinlock
  NFSv4: Use sequence counter with associated spinlock
  userfaultfd: Use sequence counter with associated spinlock
  kvm/eventfd: Use sequence counter with associated spinlock
  hrtimer: Use sequence counter with associated raw spinlock

 Documentation/locking/index.rst               |   1 +
 Documentation/locking/seqlock.rst             | 242 +++++
 MAINTAINERS                                   |   2 +-
 block/blk-iocost.c                            |   5 +-
 drivers/dma-buf/dma-resv.c                    |  15 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   2 -
 drivers/md/raid5.c                            |   2 +-
 drivers/md/raid5.h                            |   2 +-
 fs/dcache.c                                   |   2 +-
 fs/fs_struct.c                                |   4 +-
 fs/nfs/nfs4_fs.h                              |   2 +-
 fs/nfs/nfs4state.c                            |   2 +-
 fs/userfaultfd.c                              |   4 +-
 include/linux/dcache.h                        |   2 +-
 include/linux/dma-resv.h                      |   4 +-
 include/linux/fs_struct.h                     |   2 +-
 include/linux/hrtimer.h                       |   2 +-
 include/linux/kvm_irqfd.h                     |   2 +-
 include/linux/lockdep.h                       |  18 +
 include/linux/sched.h                         |   2 +-
 include/linux/seqlock.h                       | 872 ++++++++++++++----
 include/linux/seqlock_types_internal.h        | 186 ++++
 include/net/netfilter/nf_conntrack.h          |   2 +-
 init/init_task.c                              |   3 +-
 kernel/fork.c                                 |   2 +-
 kernel/time/hrtimer.c                         |  13 +-
 kernel/time/timekeeping.c                     |  19 +-
 lib/Kconfig.debug                             |   1 +
 net/netfilter/nf_conntrack_core.c             |   5 +-
 net/netfilter/nft_set_rbtree.c                |   4 +-
 net/xfrm/xfrm_policy.c                        |  10 +-
 virt/kvm/eventfd.c                            |   2 +-
 32 files changed, 1211 insertions(+), 225 deletions(-)
 create mode 100644 Documentation/locking/seqlock.rst
 create mode 100644 include/linux/seqlock_types_internal.h

base-commit: 997e89fa345e9006f311cf9f9c8fd9f7d96c240f
--
2.20.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 10/20] netfilter: conntrack: Use sequence counter with associated spinlock
  2020-06-30  5:44 ` [PATCH v3 00/20] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
@ 2020-06-30  5:44   ` Ahmed S. Darwish
  2020-06-30  5:44   ` [PATCH v3 11/20] netfilter: nft_set_rbtree: Use sequence counter with associated rwlock Ahmed S. Darwish
  2020-06-30  5:44   ` [PATCH v3 12/20] xfrm: policy: Use sequence counters with associated lock Ahmed S. Darwish
  2 siblings, 0 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-06-30  5:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, David S. Miller,
	Jakub Kicinski, netfilter-devel, coreteam, netdev

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
---
 include/net/netfilter/nf_conntrack.h | 2 +-
 net/netfilter/nf_conntrack_core.c    | 5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 90690e37a56f..ea4e2010b246 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -286,7 +286,7 @@ int nf_conntrack_hash_resize(unsigned int hashsize);
 
 extern struct hlist_nulls_head *nf_conntrack_hash;
 extern unsigned int nf_conntrack_htable_size;
-extern seqcount_t nf_conntrack_generation;
+extern seqcount_spinlock_t nf_conntrack_generation;
 extern unsigned int nf_conntrack_max;
 
 /* must be called with rcu read lock held */
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 79cd9dde457b..b8c54d390f93 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -180,7 +180,7 @@ EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
 
 unsigned int nf_conntrack_max __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_max);
-seqcount_t nf_conntrack_generation __read_mostly;
+seqcount_spinlock_t nf_conntrack_generation __read_mostly;
 static unsigned int nf_conntrack_hash_rnd __read_mostly;
 
 static u32 hash_conntrack_raw(const struct nf_conntrack_tuple *tuple,
@@ -2598,7 +2598,8 @@ int nf_conntrack_init_start(void)
 	/* struct nf_ct_ext uses u8 to store offsets/size */
 	BUILD_BUG_ON(total_extension_size() > 255u);
 
-	seqcount_init(&nf_conntrack_generation);
+	seqcount_spinlock_init(&nf_conntrack_generation,
+			       &nf_conntrack_locks_all_lock);
 
 	for (i = 0; i < CONNTRACK_LOCKS; i++)
 		spin_lock_init(&nf_conntrack_locks[i]);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 11/20] netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
  2020-06-30  5:44 ` [PATCH v3 00/20] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
  2020-06-30  5:44   ` [PATCH v3 10/20] netfilter: conntrack: Use sequence counter with associated spinlock Ahmed S. Darwish
@ 2020-06-30  5:44   ` Ahmed S. Darwish
  2020-06-30  5:44   ` [PATCH v3 12/20] xfrm: policy: Use sequence counters with associated lock Ahmed S. Darwish
  2 siblings, 0 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-06-30  5:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, David S. Miller,
	Jakub Kicinski, netfilter-devel, coreteam, netdev

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_rwlock_t data type, which allows to associate a
rwlock with the sequence counter. This enables lockdep to verify that
the rwlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
---
 net/netfilter/nft_set_rbtree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index 62f416bc0579..9f58261ee4c7 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -18,7 +18,7 @@
 struct nft_rbtree {
 	struct rb_root		root;
 	rwlock_t		lock;
-	seqcount_t		count;
+	seqcount_rwlock_t	count;
 	struct delayed_work	gc_work;
 };
 
@@ -516,7 +516,7 @@ static int nft_rbtree_init(const struct nft_set *set,
 	struct nft_rbtree *priv = nft_set_priv(set);
 
 	rwlock_init(&priv->lock);
-	seqcount_init(&priv->count);
+	seqcount_rwlock_init(&priv->count, &priv->lock);
 	priv->root = RB_ROOT;
 
 	INIT_DEFERRABLE_WORK(&priv->gc_work, nft_rbtree_gc);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 12/20] xfrm: policy: Use sequence counters with associated lock
  2020-06-30  5:44 ` [PATCH v3 00/20] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
  2020-06-30  5:44   ` [PATCH v3 10/20] netfilter: conntrack: Use sequence counter with associated spinlock Ahmed S. Darwish
  2020-06-30  5:44   ` [PATCH v3 11/20] netfilter: nft_set_rbtree: Use sequence counter with associated rwlock Ahmed S. Darwish
@ 2020-06-30  5:44   ` Ahmed S. Darwish
  2 siblings, 0 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-06-30  5:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, Steffen Klassert,
	Herbert Xu, David S. Miller, Jakub Kicinski, netdev

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. If the serialization primitive is
not disabling preemption implicitly, preemption has to be explicitly
disabled before entering the sequence counter write side critical
section.

A plain seqcount_t does not contain the information of which lock must
be held when entering a write side critical section.

Use the new seqcount_spinlock_t and seqcount_mutex_t data types instead,
which allow to associate a lock with the sequence counter. This enables
lockdep to verify that the lock used for writer serialization is held
when the write side critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
---
 net/xfrm/xfrm_policy.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 564aa6492e7c..732a940468b0 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -122,7 +122,7 @@ struct xfrm_pol_inexact_bin {
 	/* list containing '*:*' policies */
 	struct hlist_head hhead;
 
-	seqcount_t count;
+	seqcount_spinlock_t count;
 	/* tree sorted by daddr/prefix */
 	struct rb_root root_d;
 
@@ -155,7 +155,7 @@ static struct xfrm_policy_afinfo const __rcu *xfrm_policy_afinfo[AF_INET6 + 1]
 						__read_mostly;
 
 static struct kmem_cache *xfrm_dst_cache __ro_after_init;
-static __read_mostly seqcount_t xfrm_policy_hash_generation;
+static __read_mostly seqcount_mutex_t xfrm_policy_hash_generation;
 
 static struct rhashtable xfrm_policy_inexact_table;
 static const struct rhashtable_params xfrm_pol_inexact_params;
@@ -719,7 +719,7 @@ xfrm_policy_inexact_alloc_bin(const struct xfrm_policy *pol, u8 dir)
 	INIT_HLIST_HEAD(&bin->hhead);
 	bin->root_d = RB_ROOT;
 	bin->root_s = RB_ROOT;
-	seqcount_init(&bin->count);
+	seqcount_spinlock_init(&bin->count, &net->xfrm.xfrm_policy_lock);
 
 	prev = rhashtable_lookup_get_insert_key(&xfrm_policy_inexact_table,
 						&bin->k, &bin->head,
@@ -1906,7 +1906,7 @@ static int xfrm_policy_match(const struct xfrm_policy *pol,
 
 static struct xfrm_pol_inexact_node *
 xfrm_policy_lookup_inexact_addr(const struct rb_root *r,
-				seqcount_t *count,
+				seqcount_spinlock_t *count,
 				const xfrm_address_t *addr, u16 family)
 {
 	const struct rb_node *parent;
@@ -4153,7 +4153,7 @@ void __init xfrm_init(void)
 {
 	register_pernet_subsys(&xfrm_net_ops);
 	xfrm_dev_init();
-	seqcount_init(&xfrm_policy_hash_generation);
+	seqcount_mutex_init(&xfrm_policy_hash_generation, &hash_resize_mutex);
 	xfrm_input_init();
 
 #ifdef CONFIG_INET_ESPINTCP
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 00/24] seqlock: Extend seqcount API with associated locks
  2020-05-19 21:45 [PATCH v1 00/25] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
                   ` (7 preceding siblings ...)
  2020-06-30  5:44 ` [PATCH v3 00/20] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
@ 2020-07-20 15:55 ` Ahmed S. Darwish
  2020-07-20 15:55   ` [PATCH v4 14/24] netfilter: conntrack: Use sequence counter with associated spinlock Ahmed S. Darwish
                     ` (3 more replies)
  8 siblings, 4 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-07-20 15:55 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, Jonathan Corbet,
	linux-doc, David S. Miller, netdev, Alexander Viro,
	linux-fsdevel

Hi,

This is v4 of the seqlock patch series:

   [PATCH v1 00/25]
   https://lore.kernel.org/lkml/20200519214547.352050-1-a.darwish@linutronix.de

   [PATCH v2 00/06] (bugfixes-only, merged)
   https://lore.kernel.org/lkml/20200603144949.1122421-1-a.darwish@linutronix.de

   [PATCH v2 00/18]
   https://lore.kernel.org/lkml/20200608005729.1874024-1-a.darwish@linutronix.de

   [PATCH v3 00/20]
   https://lore.kernel.org/lkml/20200630054452.3675847-1-a.darwish@linutronix.de

It is based over:

   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git :: locking/core

Changelog
=========

 - Unconditionally use C11 _Generic() expressions for seqcount_locktype_t
   switching. Thanks to Peter, pushing for 6ec4476ac825 ("Raise gcc
   version requirement to 4.9").

 - Compress the new seqcount_locktype_t code code by using generative
   macros, as suggested by Peter here:

   https://lkml.kernel.org/r/20200708122938.GQ4800@hirez.programming.kicks-ass.net

   Keep *all* functions that are to be invoked by call-sites out of such
   generative macros though. This simplifies the generative macros code,
   and (more importantly) make the newly exported seqlock.h API explicit.

 - Make all documentation "RST-lite", for better readability from text
   editors.

 - Add additional clean-ups at the start of the series for better
   overall readability of seqlock.h code, and for future extensibility.

Thanks,

8<--------------

Ahmed S. Darwish (24):
  Documentation: locking: Describe seqlock design and usage
  seqlock: Properly format kernel-doc code samples
  seqlock: seqcount_t latch: End read sections with
    read_seqcount_retry()
  seqlock: Reorder seqcount_t and seqlock_t API definitions
  seqlock: Add kernel-doc for seqcount_t and seqlock_t APIs
  seqlock: Implement raw_seqcount_begin() in terms of
    raw_read_seqcount()
  lockdep: Add preemption enabled/disabled assertion APIs
  seqlock: lockdep assert non-preemptibility on seqcount_t write
  seqlock: Extend seqcount API with associated locks
  seqlock: Align multi-line macros newline escapes at 72 columns
  dma-buf: Remove custom seqcount lockdep class key
  dma-buf: Use sequence counter with associated wound/wait mutex
  sched: tasks: Use sequence counter with associated spinlock
  netfilter: conntrack: Use sequence counter with associated spinlock
  netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
  xfrm: policy: Use sequence counters with associated lock
  timekeeping: Use sequence counter with associated raw spinlock
  vfs: Use sequence counter with associated spinlock
  raid5: Use sequence counter with associated spinlock
  iocost: Use sequence counter with associated spinlock
  NFSv4: Use sequence counter with associated spinlock
  userfaultfd: Use sequence counter with associated spinlock
  kvm/eventfd: Use sequence counter with associated spinlock
  hrtimer: Use sequence counter with associated raw spinlock

 Documentation/locking/index.rst               |    1 +
 Documentation/locking/seqlock.rst             |  222 ++++
 block/blk-iocost.c                            |    5 +-
 drivers/dma-buf/dma-resv.c                    |   15 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |    2 -
 drivers/md/raid5.c                            |    2 +-
 drivers/md/raid5.h                            |    2 +-
 fs/dcache.c                                   |    2 +-
 fs/fs_struct.c                                |    4 +-
 fs/nfs/nfs4_fs.h                              |    2 +-
 fs/nfs/nfs4state.c                            |    2 +-
 fs/userfaultfd.c                              |    4 +-
 include/linux/dcache.h                        |    2 +-
 include/linux/dma-resv.h                      |    4 +-
 include/linux/fs_struct.h                     |    2 +-
 include/linux/hrtimer.h                       |    2 +-
 include/linux/kvm_irqfd.h                     |    2 +-
 include/linux/lockdep.h                       |   19 +
 include/linux/sched.h                         |    2 +-
 include/linux/seqlock.h                       | 1139 +++++++++++++----
 include/net/netfilter/nf_conntrack.h          |    2 +-
 init/init_task.c                              |    3 +-
 kernel/fork.c                                 |    2 +-
 kernel/time/hrtimer.c                         |   13 +-
 kernel/time/timekeeping.c                     |   19 +-
 lib/Kconfig.debug                             |    1 +
 net/netfilter/nf_conntrack_core.c             |    5 +-
 net/netfilter/nft_set_rbtree.c                |    4 +-
 net/xfrm/xfrm_policy.c                        |   10 +-
 virt/kvm/eventfd.c                            |    2 +-
 30 files changed, 1173 insertions(+), 323 deletions(-)
 create mode 100644 Documentation/locking/seqlock.rst

base-commit: a9232dc5607dbada801f2fe83ea307cda762969a
--
2.20.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v4 14/24] netfilter: conntrack: Use sequence counter with associated spinlock
  2020-07-20 15:55 ` [PATCH v4 00/24] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
@ 2020-07-20 15:55   ` Ahmed S. Darwish
  2020-07-20 15:55   ` [PATCH v4 15/24] netfilter: nft_set_rbtree: Use sequence counter with associated rwlock Ahmed S. Darwish
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-07-20 15:55 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, David S. Miller,
	Jakub Kicinski, netfilter-devel, coreteam, netdev

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
---
 include/net/netfilter/nf_conntrack.h | 2 +-
 net/netfilter/nf_conntrack_core.c    | 5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 90690e37a56f..ea4e2010b246 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -286,7 +286,7 @@ int nf_conntrack_hash_resize(unsigned int hashsize);
 
 extern struct hlist_nulls_head *nf_conntrack_hash;
 extern unsigned int nf_conntrack_htable_size;
-extern seqcount_t nf_conntrack_generation;
+extern seqcount_spinlock_t nf_conntrack_generation;
 extern unsigned int nf_conntrack_max;
 
 /* must be called with rcu read lock held */
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 79cd9dde457b..b8c54d390f93 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -180,7 +180,7 @@ EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
 
 unsigned int nf_conntrack_max __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_max);
-seqcount_t nf_conntrack_generation __read_mostly;
+seqcount_spinlock_t nf_conntrack_generation __read_mostly;
 static unsigned int nf_conntrack_hash_rnd __read_mostly;
 
 static u32 hash_conntrack_raw(const struct nf_conntrack_tuple *tuple,
@@ -2598,7 +2598,8 @@ int nf_conntrack_init_start(void)
 	/* struct nf_ct_ext uses u8 to store offsets/size */
 	BUILD_BUG_ON(total_extension_size() > 255u);
 
-	seqcount_init(&nf_conntrack_generation);
+	seqcount_spinlock_init(&nf_conntrack_generation,
+			       &nf_conntrack_locks_all_lock);
 
 	for (i = 0; i < CONNTRACK_LOCKS; i++)
 		spin_lock_init(&nf_conntrack_locks[i]);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 15/24] netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
  2020-07-20 15:55 ` [PATCH v4 00/24] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
  2020-07-20 15:55   ` [PATCH v4 14/24] netfilter: conntrack: Use sequence counter with associated spinlock Ahmed S. Darwish
@ 2020-07-20 15:55   ` Ahmed S. Darwish
  2020-07-20 15:55   ` [PATCH v4 16/24] xfrm: policy: Use sequence counters with associated lock Ahmed S. Darwish
  2020-07-20 16:49   ` [PATCH v4 00/24] seqlock: Extend seqcount API with associated locks Eric Biggers
  3 siblings, 0 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-07-20 15:55 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, David S. Miller,
	Jakub Kicinski, netfilter-devel, coreteam, netdev

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_rwlock_t data type, which allows to associate a
rwlock with the sequence counter. This enables lockdep to verify that
the rwlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
---
 net/netfilter/nft_set_rbtree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index b6aad3fc46c3..4b2834fd17b2 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -18,7 +18,7 @@
 struct nft_rbtree {
 	struct rb_root		root;
 	rwlock_t		lock;
-	seqcount_t		count;
+	seqcount_rwlock_t	count;
 	struct delayed_work	gc_work;
 };
 
@@ -523,7 +523,7 @@ static int nft_rbtree_init(const struct nft_set *set,
 	struct nft_rbtree *priv = nft_set_priv(set);
 
 	rwlock_init(&priv->lock);
-	seqcount_init(&priv->count);
+	seqcount_rwlock_init(&priv->count, &priv->lock);
 	priv->root = RB_ROOT;
 
 	INIT_DEFERRABLE_WORK(&priv->gc_work, nft_rbtree_gc);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 16/24] xfrm: policy: Use sequence counters with associated lock
  2020-07-20 15:55 ` [PATCH v4 00/24] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
  2020-07-20 15:55   ` [PATCH v4 14/24] netfilter: conntrack: Use sequence counter with associated spinlock Ahmed S. Darwish
  2020-07-20 15:55   ` [PATCH v4 15/24] netfilter: nft_set_rbtree: Use sequence counter with associated rwlock Ahmed S. Darwish
@ 2020-07-20 15:55   ` Ahmed S. Darwish
  2020-07-20 16:49   ` [PATCH v4 00/24] seqlock: Extend seqcount API with associated locks Eric Biggers
  3 siblings, 0 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-07-20 15:55 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, Steffen Klassert,
	Herbert Xu, David S. Miller, Jakub Kicinski, netdev

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. If the serialization primitive is
not disabling preemption implicitly, preemption has to be explicitly
disabled before entering the sequence counter write side critical
section.

A plain seqcount_t does not contain the information of which lock must
be held when entering a write side critical section.

Use the new seqcount_spinlock_t and seqcount_mutex_t data types instead,
which allow to associate a lock with the sequence counter. This enables
lockdep to verify that the lock used for writer serialization is held
when the write side critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
---
 net/xfrm/xfrm_policy.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 564aa6492e7c..732a940468b0 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -122,7 +122,7 @@ struct xfrm_pol_inexact_bin {
 	/* list containing '*:*' policies */
 	struct hlist_head hhead;
 
-	seqcount_t count;
+	seqcount_spinlock_t count;
 	/* tree sorted by daddr/prefix */
 	struct rb_root root_d;
 
@@ -155,7 +155,7 @@ static struct xfrm_policy_afinfo const __rcu *xfrm_policy_afinfo[AF_INET6 + 1]
 						__read_mostly;
 
 static struct kmem_cache *xfrm_dst_cache __ro_after_init;
-static __read_mostly seqcount_t xfrm_policy_hash_generation;
+static __read_mostly seqcount_mutex_t xfrm_policy_hash_generation;
 
 static struct rhashtable xfrm_policy_inexact_table;
 static const struct rhashtable_params xfrm_pol_inexact_params;
@@ -719,7 +719,7 @@ xfrm_policy_inexact_alloc_bin(const struct xfrm_policy *pol, u8 dir)
 	INIT_HLIST_HEAD(&bin->hhead);
 	bin->root_d = RB_ROOT;
 	bin->root_s = RB_ROOT;
-	seqcount_init(&bin->count);
+	seqcount_spinlock_init(&bin->count, &net->xfrm.xfrm_policy_lock);
 
 	prev = rhashtable_lookup_get_insert_key(&xfrm_policy_inexact_table,
 						&bin->k, &bin->head,
@@ -1906,7 +1906,7 @@ static int xfrm_policy_match(const struct xfrm_policy *pol,
 
 static struct xfrm_pol_inexact_node *
 xfrm_policy_lookup_inexact_addr(const struct rb_root *r,
-				seqcount_t *count,
+				seqcount_spinlock_t *count,
 				const xfrm_address_t *addr, u16 family)
 {
 	const struct rb_node *parent;
@@ -4153,7 +4153,7 @@ void __init xfrm_init(void)
 {
 	register_pernet_subsys(&xfrm_net_ops);
 	xfrm_dev_init();
-	seqcount_init(&xfrm_policy_hash_generation);
+	seqcount_mutex_init(&xfrm_policy_hash_generation, &hash_resize_mutex);
 	xfrm_input_init();
 
 #ifdef CONFIG_INET_ESPINTCP
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 00/24] seqlock: Extend seqcount API with associated locks
  2020-07-20 15:55 ` [PATCH v4 00/24] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
                     ` (2 preceding siblings ...)
  2020-07-20 15:55   ` [PATCH v4 16/24] xfrm: policy: Use sequence counters with associated lock Ahmed S. Darwish
@ 2020-07-20 16:49   ` Eric Biggers
  2020-07-20 17:33     ` Ahmed S. Darwish
  3 siblings, 1 reply; 38+ messages in thread
From: Eric Biggers @ 2020-07-20 16:49 UTC (permalink / raw)
  To: Ahmed S. Darwish
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Paul E. McKenney, Sebastian A. Siewior, Steven Rostedt, LKML,
	Jonathan Corbet, linux-doc, David S. Miller, netdev,
	Alexander Viro, linux-fsdevel

On Mon, Jul 20, 2020 at 05:55:06PM +0200, Ahmed S. Darwish wrote:
> Hi,
> 
> This is v4 of the seqlock patch series:
> 
>    [PATCH v1 00/25]
>    https://lore.kernel.org/lkml/20200519214547.352050-1-a.darwish@linutronix.de
> 
>    [PATCH v2 00/06] (bugfixes-only, merged)
>    https://lore.kernel.org/lkml/20200603144949.1122421-1-a.darwish@linutronix.de
> 
>    [PATCH v2 00/18]
>    https://lore.kernel.org/lkml/20200608005729.1874024-1-a.darwish@linutronix.de
> 
>    [PATCH v3 00/20]
>    https://lore.kernel.org/lkml/20200630054452.3675847-1-a.darwish@linutronix.de
> 
> It is based over:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git :: locking/core
> 

Please include an explanation of the patch series in the cover letter.  It looks
like you sent it in v1 and then stopped including it.  That doesn't work;
reviewers shouldn't have to read all previous versions too and try to guess
which parts are still up-to-date.

- Eric

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 00/24] seqlock: Extend seqcount API with associated locks
  2020-07-20 16:49   ` [PATCH v4 00/24] seqlock: Extend seqcount API with associated locks Eric Biggers
@ 2020-07-20 17:33     ` Ahmed S. Darwish
  0 siblings, 0 replies; 38+ messages in thread
From: Ahmed S. Darwish @ 2020-07-20 17:33 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Paul E. McKenney, Sebastian A. Siewior, Steven Rostedt, LKML,
	Jonathan Corbet, linux-doc, David S. Miller, netdev,
	Alexander Viro, linux-fsdevel

On Mon, Jul 20, 2020 at 09:49:12AM -0700, Eric Biggers wrote:
> On Mon, Jul 20, 2020 at 05:55:06PM +0200, Ahmed S. Darwish wrote:
> > Hi,
> >
> > This is v4 of the seqlock patch series:
> >
> >    [PATCH v1 00/25]
> >    https://lore.kernel.org/lkml/20200519214547.352050-1-a.darwish@linutronix.de
> >
> >    [PATCH v2 00/06] (bugfixes-only, merged)
> >    https://lore.kernel.org/lkml/20200603144949.1122421-1-a.darwish@linutronix.de
> >
> >    [PATCH v2 00/18]
> >    https://lore.kernel.org/lkml/20200608005729.1874024-1-a.darwish@linutronix.de
> >
> >    [PATCH v3 00/20]
> >    https://lore.kernel.org/lkml/20200630054452.3675847-1-a.darwish@linutronix.de
> >
> > It is based over:
> >
> >    git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git :: locking/core
> >
>
> Please include an explanation of the patch series in the cover letter.  It looks
> like you sent it in v1 and then stopped including it.  That doesn't work;
> reviewers shouldn't have to read all previous versions too and try to guess
> which parts are still up-to-date.
>

Noted. Thanks for the heads-up.

--
Ahmed S. Darwish
Linutronix GmbH

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2020-07-20 17:33 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-19 21:45 [PATCH v1 00/25] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
2020-05-19 21:45 ` [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount Ahmed S. Darwish
2020-05-19 22:01   ` Stephen Hemminger
2020-05-19 22:23     ` Thomas Gleixner
2020-05-19 23:11       ` Stephen Hemminger
2020-05-19 23:42         ` Thomas Gleixner
2020-05-20  0:06           ` Stephen Hemminger
2020-05-20  1:55             ` Thomas Gleixner
2020-05-20  2:57           ` David Miller
2020-05-20  3:18             ` Eric Dumazet
2020-05-20  4:36               ` Stephen Hemminger
2020-05-20 19:37             ` Thomas Gleixner
2020-05-20 21:36               ` Stephen Hemminger
2020-05-20  2:01   ` Eric Dumazet
2020-05-20  6:42     ` Ahmed S. Darwish
2020-05-20 12:51       ` Eric Dumazet
2020-06-03 14:33         ` Ahmed S. Darwish
2020-05-20 14:37   ` Dan Carpenter
2020-05-25 16:22     ` Ahmed S. Darwish
2020-05-19 21:45 ` [PATCH v1 03/25] net: phy: fixed_phy: Remove unused seqcount Ahmed S. Darwish
2020-05-19 21:45 ` [PATCH v1 05/25] u64_stats: Document writer non-preemptibility requirement Ahmed S. Darwish
2020-05-19 21:45 ` [PATCH v1 15/25] netfilter: conntrack: Use sequence counter with associated spinlock Ahmed S. Darwish
2020-05-19 21:45 ` [PATCH v1 16/25] netfilter: nft_set_rbtree: Use sequence counter with associated rwlock Ahmed S. Darwish
2020-05-19 21:45 ` [PATCH v1 17/25] xfrm: policy: Use sequence counters with associated lock Ahmed S. Darwish
2020-06-08  0:57 ` [PATCH v2 00/18] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
2020-06-08  0:57   ` [PATCH v2 08/18] netfilter: conntrack: Use sequence counter with associated spinlock Ahmed S. Darwish
2020-06-08  0:57   ` [PATCH v2 09/18] netfilter: nft_set_rbtree: Use sequence counter with associated rwlock Ahmed S. Darwish
2020-06-08  0:57   ` [PATCH v2 10/18] xfrm: policy: Use sequence counters with associated lock Ahmed S. Darwish
2020-06-30  5:44 ` [PATCH v3 00/20] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
2020-06-30  5:44   ` [PATCH v3 10/20] netfilter: conntrack: Use sequence counter with associated spinlock Ahmed S. Darwish
2020-06-30  5:44   ` [PATCH v3 11/20] netfilter: nft_set_rbtree: Use sequence counter with associated rwlock Ahmed S. Darwish
2020-06-30  5:44   ` [PATCH v3 12/20] xfrm: policy: Use sequence counters with associated lock Ahmed S. Darwish
2020-07-20 15:55 ` [PATCH v4 00/24] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
2020-07-20 15:55   ` [PATCH v4 14/24] netfilter: conntrack: Use sequence counter with associated spinlock Ahmed S. Darwish
2020-07-20 15:55   ` [PATCH v4 15/24] netfilter: nft_set_rbtree: Use sequence counter with associated rwlock Ahmed S. Darwish
2020-07-20 15:55   ` [PATCH v4 16/24] xfrm: policy: Use sequence counters with associated lock Ahmed S. Darwish
2020-07-20 16:49   ` [PATCH v4 00/24] seqlock: Extend seqcount API with associated locks Eric Biggers
2020-07-20 17:33     ` Ahmed S. Darwish

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).