linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 00/25] seqlock: Extend seqcount API with associated locks
@ 2020-05-19 21:45 Ahmed S. Darwish
  2020-05-19 21:45 ` [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API Ahmed S. Darwish
       [not found] ` <20200827114044.11173-1-a.darwish@linutronix.de>
  0 siblings, 2 replies; 12+ messages in thread
From: Ahmed S. Darwish @ 2020-05-19 21:45 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, David S. Miller,
	Andrew Morton, Jens Axboe, Jonathan Corbet, Alexander Viro,
	David Airlie, Daniel Vetter, netdev, linux-mm, linux-block,
	dri-devel, linux-fsdevel, linux-doc

Hi,

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. If the serialization primitive is
not disabling preemption implicitly, preemption has to be explicitly
disabled before entering the write side critical section.

There is no built-in debugging mechanism to verify that the lock used
for writer serialization is held and preemption is disabled. Some usage
sites like dma-buf have explicit lockdep checks for the writer-side
lock, but this covers only a small portion of the sequence counter usage
in the kernel.

Add new sequence counter types which allows to associate a lock to the
sequence counter at initialization time. The seqcount API functions are
extended to provide appropriate lockdep assertions depending on the
seqcount/lock type.

For sequence counters with associated locks that do not implicitly
disable preemption, preemption protection is enforced in the sequence
counter write side functions. This removes the need to explicitly add
preempt_disable/enable() around the write side critical sections: the
write_begin/end() functions for these new sequence counter types
automatically do this.

Extend the lockdep API with a macro asserting that preemption is
disabled.  Use it to verify that preemption is disabled for all sequence
counters write side critical sections.

If lockdep is disabled, these lock associations and non-preemptibility
checks are compiled out and have neither storage size nor runtime
overhead. If lockdep is enabled, a pointer to the lock is stored in the
seqcount and the write side API functions enable lockdep assertions.

The following seqcount types with associated locks are introduced:

     seqcount_spinlock_t
     seqcount_raw_spinlock_t
     seqcount_rwlock_t
     seqcount_mutex_t
     seqcount_ww_mutex_t

This lock association is not only useful for debugging purposes, it also
provides a mechanism for PREEMPT_RT to prevent writer starvation. On RT
kernels spinlocks and rwlocks are substituted with sleeping locks and
the code sections protected by these locks become preemptible, which has
the same problem as write side critical section with preemption enabled
on a non-RT kernel. RT utilizes this association by storing the provided
lock pointer and in case that a reader sees an active writer (seqcount
is odd), it does not spin, but blocks on the associated lock similar to
read_seqbegin_or_lock().

By using the lockdep debugging mechanisms added in this patch series, a
number of erroneous seqcount call-sites were discovered across the
kernel. The fixes are included at the beginning of the series.

Thanks,

8<--------------

Ahmed S. Darwish (25):
  net: core: device_rename: Use rwsem instead of a seqcount
  mm/swap: Don't abuse the seqcount latching API
  net: phy: fixed_phy: Remove unused seqcount
  block: nr_sects_write(): Disable preemption on seqcount write
  u64_stats: Document writer non-preemptibility requirement
  dma-buf: Remove custom seqcount lockdep class key
  lockdep: Add preemption disabled assertion API
  seqlock: lockdep assert non-preemptibility on seqcount_t write
  Documentation: locking: Describe seqlock design and usage
  seqlock: Add RST directives to kernel-doc code samples and notes
  seqlock: Add missing kernel-doc annotations
  seqlock: Extend seqcount API with associated locks
  dma-buf: Use sequence counter with associated wound/wait mutex
  sched: tasks: Use sequence counter with associated spinlock
  netfilter: conntrack: Use sequence counter with associated spinlock
  netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
  xfrm: policy: Use sequence counters with associated lock
  timekeeping: Use sequence counter with associated raw spinlock
  vfs: Use sequence counter with associated spinlock
  raid5: Use sequence counter with associated spinlock
  iocost: Use sequence counter with associated spinlock
  NFSv4: Use sequence counter with associated spinlock
  userfaultfd: Use sequence counter with associated spinlock
  kvm/eventfd: Use sequence counter with associated spinlock
  hrtimer: Use sequence counter with associated raw spinlock

 Documentation/locking/index.rst               |   1 +
 Documentation/locking/seqlock.rst             | 239 +++++
 MAINTAINERS                                   |   2 +-
 block/blk-iocost.c                            |   5 +-
 block/blk.h                                   |   2 +
 drivers/dma-buf/dma-resv.c                    |  15 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   2 -
 drivers/md/raid5.c                            |   2 +-
 drivers/md/raid5.h                            |   2 +-
 drivers/net/phy/fixed_phy.c                   |  25 +-
 fs/dcache.c                                   |   2 +-
 fs/fs_struct.c                                |   4 +-
 fs/nfs/nfs4_fs.h                              |   2 +-
 fs/nfs/nfs4state.c                            |   2 +-
 fs/userfaultfd.c                              |   4 +-
 include/linux/dcache.h                        |   2 +-
 include/linux/dma-resv.h                      |   4 +-
 include/linux/fs_struct.h                     |   2 +-
 include/linux/hrtimer.h                       |   2 +-
 include/linux/kvm_irqfd.h                     |   2 +-
 include/linux/lockdep.h                       |   9 +
 include/linux/sched.h                         |   2 +-
 include/linux/seqlock.h                       | 882 +++++++++++++++---
 include/linux/seqlock_types_internal.h        | 187 ++++
 include/linux/u64_stats_sync.h                |  38 +-
 include/net/netfilter/nf_conntrack.h          |   2 +-
 init/init_task.c                              |   3 +-
 kernel/fork.c                                 |   2 +-
 kernel/locking/lockdep.c                      |  15 +
 kernel/time/hrtimer.c                         |  13 +-
 kernel/time/timekeeping.c                     |  19 +-
 lib/Kconfig.debug                             |   1 +
 mm/swap.c                                     |  57 +-
 net/core/dev.c                                |  30 +-
 net/netfilter/nf_conntrack_core.c             |   5 +-
 net/netfilter/nft_set_rbtree.c                |   4 +-
 net/xfrm/xfrm_policy.c                        |  10 +-
 virt/kvm/eventfd.c                            |   2 +-
 38 files changed, 1325 insertions(+), 277 deletions(-)
 create mode 100644 Documentation/locking/seqlock.rst
 create mode 100644 include/linux/seqlock_types_internal.h

base-commit: 2ef96a5bb12be62ef75b5828c0aab838ebb29cb8
--
2.20.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API
  2020-05-19 21:45 [PATCH v1 00/25] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
@ 2020-05-19 21:45 ` Ahmed S. Darwish
  2020-05-20 11:24   ` Hillf Danton
                     ` (2 more replies)
       [not found] ` <20200827114044.11173-1-a.darwish@linutronix.de>
  1 sibling, 3 replies; 12+ messages in thread
From: Ahmed S. Darwish @ 2020-05-19 21:45 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Ahmed S. Darwish, Andrew Morton,
	Konstantin Khlebnikov, linux-mm

Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls")
implemented an optimization mechanism to exit the to-be-started LRU
drain operation (name it A) if another drain operation *started and
finished* while (A) was blocked on the LRU draining mutex.

This was done through a seqcount latch, which is an abuse of its
semantics:

  1. Seqcount latching should be used for the purpose of switching
     between two storage places with sequence protection to allow
     interruptible, preemptible writer sections. The optimization
     mechanism has absolutely nothing to do with that.

  2. The used raw_write_seqcount_latch() has two smp write memory
     barriers to always insure one consistent storage place out of the
     two storage places available. This extra smp_wmb() is redundant for
     the optimization use case.

Beside the API abuse, the semantics of a latch sequence counter was
force fitted into the optimization. What was actually meant is to track
generations of LRU draining operations, where "current lru draining
generation = x" implies that all generations 0 < n <= x are already
*scheduled* for draining.

Remove the conceptually-inappropriate seqcount latch usage and manually
implement the optimization using a counter and SMP memory barriers.

Link: https://lkml.kernel.org/r/CALYGNiPSr-cxV9MX9czaVh6Wz_gzSv3H_8KPvgjBTGbJywUJpA@mail.gmail.com
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
---
 mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 47 insertions(+), 10 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index bf9a79fed62d..d6910eeed43d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
  */
 void lru_add_drain_all(void)
 {
-	static seqcount_t seqcount = SEQCNT_ZERO(seqcount);
-	static DEFINE_MUTEX(lock);
+	/*
+	 * lru_drain_gen - Current generation of pages that could be in vectors
+	 *
+	 * (A) Definition: lru_drain_gen = x implies that all generations
+	 *     0 < n <= x are already scheduled for draining.
+	 *
+	 * This is an optimization for the highly-contended use case where a
+	 * user space workload keeps constantly generating a flow of pages
+	 * for each CPU.
+	 */
+	static unsigned int lru_drain_gen;
 	static struct cpumask has_work;
-	int cpu, seq;
+	static DEFINE_MUTEX(lock);
+	int cpu, this_gen;
 
 	/*
 	 * Make sure nobody triggers this path before mm_percpu_wq is fully
@@ -725,21 +735,48 @@ void lru_add_drain_all(void)
 	if (WARN_ON(!mm_percpu_wq))
 		return;
 
-	seq = raw_read_seqcount_latch(&seqcount);
+	/*
+	 * (B) Cache the LRU draining generation number
+	 *
+	 * smp_rmb() ensures that the counter is loaded before the mutex is
+	 * taken. It pairs with the smp_wmb() inside the mutex critical section
+	 * at (D).
+	 */
+	this_gen = READ_ONCE(lru_drain_gen);
+	smp_rmb();
 
 	mutex_lock(&lock);
 
 	/*
-	 * Piggyback on drain started and finished while we waited for lock:
-	 * all pages pended at the time of our enter were drained from vectors.
+	 * (C) Exit the draining operation if a newer generation, from another
+	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
 	 */
-	if (__read_seqcount_retry(&seqcount, seq))
+	if (unlikely(this_gen != lru_drain_gen))
 		goto done;
 
-	raw_write_seqcount_latch(&seqcount);
+	/*
+	 * (D) Increment generation number
+	 *
+	 * Pairs with READ_ONCE() and smp_rmb() at (B), outside of the critical
+	 * section.
+	 *
+	 * This pairing must be done here, before the for_each_online_cpu loop
+	 * below which drains the page vectors.
+	 *
+	 * Let x, y, and z represent some system CPU numbers, where x < y < z.
+	 * Assume CPU #z is is in the middle of the for_each_online_cpu loop
+	 * below and has already reached CPU #y's per-cpu data. CPU #x comes
+	 * along, adds some pages to its per-cpu vectors, then calls
+	 * lru_add_drain_all().
+	 *
+	 * If the paired smp_wmb() below is done at any later step, e.g. after
+	 * the loop, CPU #x will just exit at (C) and miss flushing out all of
+	 * its added pages.
+	 */
+	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
+	smp_wmb();
 
 	cpumask_clear(&has_work);
-
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
@@ -766,7 +803,7 @@ void lru_add_drain_all(void)
 {
 	lru_add_drain();
 }
-#endif
+#endif /* CONFIG_SMP */
 
 /**
  * release_pages - batched put_page()
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API
  2020-05-19 21:45 ` [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API Ahmed S. Darwish
@ 2020-05-20 11:24   ` Hillf Danton
  2020-05-20 12:22   ` Konstantin Khlebnikov
  2020-05-22 14:57   ` Peter Zijlstra
  2 siblings, 0 replies; 12+ messages in thread
From: Hillf Danton @ 2020-05-20 11:24 UTC (permalink / raw)
  To: Ahmed S. Darwish; +Cc: LKML, Konstantin Khlebnikov, linux-mm, Hillf Danton


On Tue, 19 May 2020 23:45:24 +0200 "Ahmed S. Darwish" wrote:
> 
> Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls")
> implemented an optimization mechanism to exit the to-be-started LRU
> drain operation (name it A) if another drain operation *started and
> finished* while (A) was blocked on the LRU draining mutex.
> 
> This was done through a seqcount latch, which is an abuse of its
> semantics:
> 
>   1. Seqcount latching should be used for the purpose of switching
>      between two storage places with sequence protection to allow
>      interruptible, preemptible writer sections. The optimization
>      mechanism has absolutely nothing to do with that.
> 
>   2. The used raw_write_seqcount_latch() has two smp write memory
>      barriers to always insure one consistent storage place out of the
>      two storage places available. This extra smp_wmb() is redundant for
>      the optimization use case.
> 
> Beside the API abuse, the semantics of a latch sequence counter was
> force fitted into the optimization. What was actually meant is to track
> generations of LRU draining operations, where "current lru draining
> generation =3D x" implies that all generations 0 < n <=3D x are already
> *scheduled* for draining.
> 
> Remove the conceptually-inappropriate seqcount latch usage and manually
> implement the optimization using a counter and SMP memory barriers.
> 
> Link: https://lkml.kernel.org/r/CALYGNiPSr-cxV9MX9czaVh6Wz_gzSv3H_8KPvgjBTGbJywUJpA@mail.gmail.com
> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> ---
>  mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 47 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index bf9a79fed62d..d6910eeed43d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
>   */
>  void lru_add_drain_all(void)
>  {
> -	static seqcount_t seqcount =3D SEQCNT_ZERO(seqcount);
> -	static DEFINE_MUTEX(lock);
> +	/*
> +	 * lru_drain_gen - Current generation of pages that could be in vectors
> +	 *
> +	 * (A) Definition: lru_drain_gen =3D x implies that all generations
> +	 *     0 < n <=3D x are already scheduled for draining.
> +	 *
> +	 * This is an optimization for the highly-contended use case where a
> +	 * user space workload keeps constantly generating a flow of pages
> +	 * for each CPU.
> +	 */
> +	static unsigned int lru_drain_gen;
>  	static struct cpumask has_work;
> -	int cpu, seq;
> +	static DEFINE_MUTEX(lock);
> +	int cpu, this_gen;
> =20
>  	/*
>  	 * Make sure nobody triggers this path before mm_percpu_wq is fully
> @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
>  	if (WARN_ON(!mm_percpu_wq))
>  		return;
> =20
> -	seq =3D raw_read_seqcount_latch(&seqcount);
> +	/*
> +	 * (B) Cache the LRU draining generation number
> +	 *
> +	 * smp_rmb() ensures that the counter is loaded before the mutex is
> +	 * taken. It pairs with the smp_wmb() inside the mutex critical section
> +	 * at (D).
> +	 */
> +	this_gen =3D READ_ONCE(lru_drain_gen);
> +	smp_rmb();
> =20
>  	mutex_lock(&lock);
> =20
>  	/*
> -	 * Piggyback on drain started and finished while we waited for lock:
> -	 * all pages pended at the time of our enter were drained from vectors.
> +	 * (C) Exit the draining operation if a newer generation, from another
> +	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
>  	 */
> -	if (__read_seqcount_retry(&seqcount, seq))
> +	if (unlikely(this_gen !=3D lru_drain_gen))
>  		goto done;
> =20
> -	raw_write_seqcount_latch(&seqcount);
> +	/*
> +	 * (D) Increment generation number
> +	 *
> +	 * Pairs with READ_ONCE() and smp_rmb() at (B), outside of the critical
> +	 * section.
> +	 *
> +	 * This pairing must be done here, before the for_each_online_cpu loop
> +	 * below which drains the page vectors.
> +	 *
> +	 * Let x, y, and z represent some system CPU numbers, where x < y < z.
> +	 * Assume CPU #z is is in the middle of the for_each_online_cpu loop
> +	 * below and has already reached CPU #y's per-cpu data. CPU #x comes
> +	 * along, adds some pages to its per-cpu vectors, then calls
> +	 * lru_add_drain_all().
> +	 *
> +	 * If the paired smp_wmb() below is done at any later step, e.g. after
> +	 * the loop, CPU #x will just exit at (C) and miss flushing out all of
> +	 * its added pages.
> +	 */
> +	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> +	smp_wmb();

Writer is inside mutex.

> =20
>  	cpumask_clear(&has_work);
> -
>  	for_each_online_cpu(cpu) {
>  		struct work_struct *work =3D &per_cpu(lru_add_drain_work, cpu);
> =20
> @@ -766,7 +803,7 @@ void lru_add_drain_all(void)
>  {
>  	lru_add_drain();
>  }
> -#endif
> +#endif /* CONFIG_SMP */
> =20
>  /**
>   * release_pages - batched put_page()
> --=20
> 2.20.1


The tiny diff, inspired by this work, is trying to trylock mutex by
swicthing the sequence counter's readers and writer across the mutex
boundary.

--- a/mm/swap.c
+++ b/mm/swap.c
@@ -714,10 +714,12 @@ static void lru_add_drain_per_cpu(struct
  */
 void lru_add_drain_all(void)
 {
-	static seqcount_t seqcount = SEQCNT_ZERO(seqcount);
+	static unsigned int lru_drain_gen;
 	static DEFINE_MUTEX(lock);
 	static struct cpumask has_work;
-	int cpu, seq;
+	int cpu;
+	int loop = 0;
+	unsigned int seq;
 
 	/*
 	 * Make sure nobody triggers this path before mm_percpu_wq is fully
@@ -726,18 +728,12 @@ void lru_add_drain_all(void)
 	if (WARN_ON(!mm_percpu_wq))
 		return;
 
-	seq = raw_read_seqcount_latch(&seqcount);
+	seq = lru_drain_gen++;
+	smp_mb();
 
-	mutex_lock(&lock);
-
-	/*
-	 * Piggyback on drain started and finished while we waited for lock:
-	 * all pages pended at the time of our enter were drained from vectors.
-	 */
-	if (__read_seqcount_retry(&seqcount, seq))
-		goto done;
-
-	raw_write_seqcount_latch(&seqcount);
+	if (!mutex_trylock(&lock))
+		return;
+more_work:
 
 	cpumask_clear(&has_work);
 
@@ -759,7 +755,11 @@ void lru_add_drain_all(void)
 	for_each_cpu(cpu, &has_work)
 		flush_work(&per_cpu(lru_add_drain_work, cpu));
 
-done:
+	smp_mb();
+	if (++seq != lru_drain_gen && loop++ < 128)
+		goto more_work;
+
+	smp_mb();
 	mutex_unlock(&lock);
 }
 #else
--



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API
  2020-05-19 21:45 ` [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API Ahmed S. Darwish
  2020-05-20 11:24   ` Hillf Danton
@ 2020-05-20 12:22   ` Konstantin Khlebnikov
  2020-05-20 13:05     ` Peter Zijlstra
  2020-05-22 14:57   ` Peter Zijlstra
  2 siblings, 1 reply; 12+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-20 12:22 UTC (permalink / raw)
  To: Ahmed S. Darwish, Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: Thomas Gleixner, Paul E. McKenney, Sebastian A. Siewior,
	Steven Rostedt, LKML, Andrew Morton, linux-mm

On 20/05/2020 00.45, Ahmed S. Darwish wrote:
> Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls")
> implemented an optimization mechanism to exit the to-be-started LRU
> drain operation (name it A) if another drain operation *started and
> finished* while (A) was blocked on the LRU draining mutex.
> 
> This was done through a seqcount latch, which is an abuse of its
> semantics:
> 
>    1. Seqcount latching should be used for the purpose of switching
>       between two storage places with sequence protection to allow
>       interruptible, preemptible writer sections. The optimization
>       mechanism has absolutely nothing to do with that.
> 
>    2. The used raw_write_seqcount_latch() has two smp write memory
>       barriers to always insure one consistent storage place out of the
>       two storage places available. This extra smp_wmb() is redundant for
>       the optimization use case.
> 
> Beside the API abuse, the semantics of a latch sequence counter was
> force fitted into the optimization. What was actually meant is to track
> generations of LRU draining operations, where "current lru draining
> generation = x" implies that all generations 0 < n <= x are already
> *scheduled* for draining.
> 
> Remove the conceptually-inappropriate seqcount latch usage and manually
> implement the optimization using a counter and SMP memory barriers.

Well, I thought it fits perfectly =)

Maybe it's worth to add helpers with appropriate semantics?
This is pretty common pattern.

> 
> Link: https://lkml.kernel.org/r/CALYGNiPSr-cxV9MX9czaVh6Wz_gzSv3H_8KPvgjBTGbJywUJpA@mail.gmail.com
> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> ---
>   mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++----------
>   1 file changed, 47 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index bf9a79fed62d..d6910eeed43d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
>    */
>   void lru_add_drain_all(void)
>   {
> -	static seqcount_t seqcount = SEQCNT_ZERO(seqcount);
> -	static DEFINE_MUTEX(lock);
> +	/*
> +	 * lru_drain_gen - Current generation of pages that could be in vectors
> +	 *
> +	 * (A) Definition: lru_drain_gen = x implies that all generations
> +	 *     0 < n <= x are already scheduled for draining.
> +	 *
> +	 * This is an optimization for the highly-contended use case where a
> +	 * user space workload keeps constantly generating a flow of pages
> +	 * for each CPU.
> +	 */
> +	static unsigned int lru_drain_gen;
>   	static struct cpumask has_work;
> -	int cpu, seq;
> +	static DEFINE_MUTEX(lock);
> +	int cpu, this_gen;
>   
>   	/*
>   	 * Make sure nobody triggers this path before mm_percpu_wq is fully
> @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
>   	if (WARN_ON(!mm_percpu_wq))
>   		return;
>   
> -	seq = raw_read_seqcount_latch(&seqcount);
> +	/*
> +	 * (B) Cache the LRU draining generation number
> +	 *
> +	 * smp_rmb() ensures that the counter is loaded before the mutex is
> +	 * taken. It pairs with the smp_wmb() inside the mutex critical section
> +	 * at (D).
> +	 */
> +	this_gen = READ_ONCE(lru_drain_gen);
> +	smp_rmb();
>   
>   	mutex_lock(&lock);
>   
>   	/*
> -	 * Piggyback on drain started and finished while we waited for lock:
> -	 * all pages pended at the time of our enter were drained from vectors.
> +	 * (C) Exit the draining operation if a newer generation, from another
> +	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
>   	 */
> -	if (__read_seqcount_retry(&seqcount, seq))
> +	if (unlikely(this_gen != lru_drain_gen))
>   		goto done;
>   
> -	raw_write_seqcount_latch(&seqcount);
> +	/*
> +	 * (D) Increment generation number
> +	 *
> +	 * Pairs with READ_ONCE() and smp_rmb() at (B), outside of the critical
> +	 * section.
> +	 *
> +	 * This pairing must be done here, before the for_each_online_cpu loop
> +	 * below which drains the page vectors.
> +	 *
> +	 * Let x, y, and z represent some system CPU numbers, where x < y < z.
> +	 * Assume CPU #z is is in the middle of the for_each_online_cpu loop
> +	 * below and has already reached CPU #y's per-cpu data. CPU #x comes
> +	 * along, adds some pages to its per-cpu vectors, then calls
> +	 * lru_add_drain_all().
> +	 *
> +	 * If the paired smp_wmb() below is done at any later step, e.g. after
> +	 * the loop, CPU #x will just exit at (C) and miss flushing out all of
> +	 * its added pages.
> +	 */
> +	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> +	smp_wmb();
>   
>   	cpumask_clear(&has_work);
> -
>   	for_each_online_cpu(cpu) {
>   		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>   
> @@ -766,7 +803,7 @@ void lru_add_drain_all(void)
>   {
>   	lru_add_drain();
>   }
> -#endif
> +#endif /* CONFIG_SMP */
>   
>   /**
>    * release_pages - batched put_page()
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API
  2020-05-20 12:22   ` Konstantin Khlebnikov
@ 2020-05-20 13:05     ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2020-05-20 13:05 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Ahmed S. Darwish, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Paul E. McKenney, Sebastian A. Siewior, Steven Rostedt, LKML,
	Andrew Morton, linux-mm

On Wed, May 20, 2020 at 03:22:15PM +0300, Konstantin Khlebnikov wrote:
> On 20/05/2020 00.45, Ahmed S. Darwish wrote:
> > Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls")
> > implemented an optimization mechanism to exit the to-be-started LRU
> > drain operation (name it A) if another drain operation *started and
> > finished* while (A) was blocked on the LRU draining mutex.

That commit is horrible...

> Well, I thought it fits perfectly =)
> 
> Maybe it's worth to add helpers with appropriate semantics?
> This is pretty common pattern.

Where's more sites?

> > @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
> >   	if (WARN_ON(!mm_percpu_wq))
> >   		return;
> > -	seq = raw_read_seqcount_latch(&seqcount);
> >   	mutex_lock(&lock);
> >   	/*
> > -	 * Piggyback on drain started and finished while we waited for lock:
> > -	 * all pages pended at the time of our enter were drained from vectors.
> >   	 */
> > -	if (__read_seqcount_retry(&seqcount, seq))
> >   		goto done;

Since there is no ordering in raw_read_seqcount_latch(), and
mutex_lock() is an ACQUIRE, there's no guarantee the read actually
happens before the mutex is acquired.

> > -	raw_write_seqcount_latch(&seqcount);
> >   	cpumask_clear(&has_work);


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API
  2020-05-19 21:45 ` [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API Ahmed S. Darwish
  2020-05-20 11:24   ` Hillf Danton
  2020-05-20 12:22   ` Konstantin Khlebnikov
@ 2020-05-22 14:57   ` Peter Zijlstra
  2020-05-22 15:17     ` Sebastian A. Siewior
                       ` (2 more replies)
  2 siblings, 3 replies; 12+ messages in thread
From: Peter Zijlstra @ 2020-05-22 14:57 UTC (permalink / raw)
  To: Ahmed S. Darwish
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Paul E. McKenney,
	Sebastian A. Siewior, Steven Rostedt, LKML, Andrew Morton,
	Konstantin Khlebnikov, linux-mm

On Tue, May 19, 2020 at 11:45:24PM +0200, Ahmed S. Darwish wrote:
> @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
>   */
>  void lru_add_drain_all(void)
>  {

> +	static unsigned int lru_drain_gen;
>  	static struct cpumask has_work;
> +	static DEFINE_MUTEX(lock);
> +	int cpu, this_gen;
>  
>  	/*
>  	 * Make sure nobody triggers this path before mm_percpu_wq is fully
> @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
>  	if (WARN_ON(!mm_percpu_wq))
>  		return;
>  

> +	this_gen = READ_ONCE(lru_drain_gen);
> +	smp_rmb();

	this_gen = smp_load_acquire(&lru_drain_gen);
>  
>  	mutex_lock(&lock);
>  
>  	/*
> +	 * (C) Exit the draining operation if a newer generation, from another
> +	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
>  	 */
> +	if (unlikely(this_gen != lru_drain_gen))
>  		goto done;
>  

> +	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> +	smp_wmb();

You can leave this smp_wmb() out and rely on the smp_mb() implied by
queue_work_on()'s test_and_set_bit().

>  	cpumask_clear(&has_work);
> -
>  	for_each_online_cpu(cpu) {
>  		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>  

While you're here, do:

	s/cpumask_set_cpu/__&/

> @@ -766,7 +803,7 @@ void lru_add_drain_all(void)
>  {
>  	lru_add_drain();
>  }
> -#endif
> +#endif /* CONFIG_SMP */
>  
>  /**
>   * release_pages - batched put_page()


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API
  2020-05-22 14:57   ` Peter Zijlstra
@ 2020-05-22 15:17     ` Sebastian A. Siewior
  2020-05-22 16:23       ` Peter Zijlstra
  2020-05-25 15:24     ` Ahmed S. Darwish
  2020-05-25 16:10     ` John Ogness
  2 siblings, 1 reply; 12+ messages in thread
From: Sebastian A. Siewior @ 2020-05-22 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ahmed S. Darwish, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Paul E. McKenney, Steven Rostedt, LKML, Andrew Morton,
	Konstantin Khlebnikov, linux-mm

On 2020-05-22 16:57:07 [+0200], Peter Zijlstra wrote:
> > @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
> >  	if (WARN_ON(!mm_percpu_wq))
> >  		return;
> >  
> 
> > +	this_gen = READ_ONCE(lru_drain_gen);
> > +	smp_rmb();
> 
> 	this_gen = smp_load_acquire(&lru_drain_gen);
> >  
> >  	mutex_lock(&lock);
> >  
> >  	/*
> > +	 * (C) Exit the draining operation if a newer generation, from another
> > +	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
> >  	 */
> > +	if (unlikely(this_gen != lru_drain_gen))
> >  		goto done;
> >  
> 
> > +	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> > +	smp_wmb();
> 
> You can leave this smp_wmb() out and rely on the smp_mb() implied by
> queue_work_on()'s test_and_set_bit().

This is to avoid smp_store_release() ?

Sebastian


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API
  2020-05-22 15:17     ` Sebastian A. Siewior
@ 2020-05-22 16:23       ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2020-05-22 16:23 UTC (permalink / raw)
  To: Sebastian A. Siewior
  Cc: Ahmed S. Darwish, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Paul E. McKenney, Steven Rostedt, LKML, Andrew Morton,
	Konstantin Khlebnikov, linux-mm

On Fri, May 22, 2020 at 05:17:05PM +0200, Sebastian A. Siewior wrote:
> On 2020-05-22 16:57:07 [+0200], Peter Zijlstra wrote:
> > > @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
> > >  	if (WARN_ON(!mm_percpu_wq))
> > >  		return;
> > >  
> > 
> > > +	this_gen = READ_ONCE(lru_drain_gen);
> > > +	smp_rmb();
> > 
> > 	this_gen = smp_load_acquire(&lru_drain_gen);
> > >  
> > >  	mutex_lock(&lock);
> > >  
> > >  	/*
> > > +	 * (C) Exit the draining operation if a newer generation, from another
> > > +	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
> > >  	 */
> > > +	if (unlikely(this_gen != lru_drain_gen))
> > >  		goto done;
> > >  
> > 
> > > +	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> > > +	smp_wmb();
> > 
> > You can leave this smp_wmb() out and rely on the smp_mb() implied by
> > queue_work_on()'s test_and_set_bit().
> 
> This is to avoid smp_store_release() ?

store_release would have the barrier on the other end. If you read the
comments (I so helpfully cut out) you'll see it wants to order against
later stores, not ealier.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API
  2020-05-22 14:57   ` Peter Zijlstra
  2020-05-22 15:17     ` Sebastian A. Siewior
@ 2020-05-25 15:24     ` Ahmed S. Darwish
  2020-05-25 15:45       ` Peter Zijlstra
  2020-05-25 16:10     ` John Ogness
  2 siblings, 1 reply; 12+ messages in thread
From: Ahmed S. Darwish @ 2020-05-25 15:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Paul E. McKenney,
	Sebastian A. Siewior, Steven Rostedt, LKML, Andrew Morton,
	Konstantin Khlebnikov, linux-mm

Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, May 19, 2020 at 11:45:24PM +0200, Ahmed S. Darwish wrote:
> > @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
> >   */
> >  void lru_add_drain_all(void)
> >  {
>

Re-adding cut-out comment for context:

	/*
	 * lru_drain_gen - Current generation of pages that could be in vectors
	 *
	 * (A) Definition: lru_drain_gen = x implies that all generations
	 *     0 < n <= x are already scheduled for draining.
	 *
	 * This is an optimization for the highly-contended use case where a
	 * user space workload keeps constantly generating a flow of pages
	 * for each CPU.
	 */
> > +	static unsigned int lru_drain_gen;
> >  	static struct cpumask has_work;
> > +	static DEFINE_MUTEX(lock);
> > +	int cpu, this_gen;
> >
> >  	/*
> >  	 * Make sure nobody triggers this path before mm_percpu_wq is fully
> > @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
> >  	if (WARN_ON(!mm_percpu_wq))
> >  		return;
> >
>

Re-adding cut-out comment for context:

	/*
	 * (B) Cache the LRU draining generation number
	 *
	 * smp_rmb() ensures that the counter is loaded before the mutex is
	 * taken. It pairs with the smp_wmb() inside the mutex critical section
	 * at (D).
	 */
> > +	this_gen = READ_ONCE(lru_drain_gen);
> > +	smp_rmb();
>
> 	this_gen = smp_load_acquire(&lru_drain_gen);

ACK. will do.

> >
> >  	mutex_lock(&lock);
> >
> >  	/*
> > +	 * (C) Exit the draining operation if a newer generation, from another
> > +	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
> >  	 */
> > +	if (unlikely(this_gen != lru_drain_gen))
> >  		goto done;
> >
>

Re-adding cut-out comment for context:

	/*
	 * (D) Increment generation number
	 *
	 * Pairs with READ_ONCE() and smp_rmb() at (B), outside of the critical
	 * section.
	 *
	 * This pairing must be done here, before the for_each_online_cpu loop
	 * below which drains the page vectors.
	 *
	 * Let x, y, and z represent some system CPU numbers, where x < y < z.
	 * Assume CPU #z is is in the middle of the for_each_online_cpu loop
	 * below and has already reached CPU #y's per-cpu data. CPU #x comes
	 * along, adds some pages to its per-cpu vectors, then calls
	 * lru_add_drain_all().
	 *
	 * If the paired smp_wmb() below is done at any later step, e.g. after
	 * the loop, CPU #x will just exit at (C) and miss flushing out all of
	 * its added pages.
	 */
> > +	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> > +	smp_wmb();
>
> You can leave this smp_wmb() out and rely on the smp_mb() implied by
> queue_work_on()'s test_and_set_bit().
>

Won't this be too implicit?

Isn't it possible that, over the years, queue_work_on() impementation
changes and the test_and_set_bit()/smp_mb() gets removed?

If that happens, this commit will get *silently* broken and the local
CPU pages won't be drained.

> >  	cpumask_clear(&has_work);
> > -
> >  	for_each_online_cpu(cpu) {
> >  		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
> >
>
> While you're here, do:
>
> 	s/cpumask_set_cpu/__&/
>

ACK.

Thanks,

--
Ahmed S. Darwish
Linutronix GmbH


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API
  2020-05-25 15:24     ` Ahmed S. Darwish
@ 2020-05-25 15:45       ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2020-05-25 15:45 UTC (permalink / raw)
  To: Ahmed S. Darwish
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Paul E. McKenney,
	Sebastian A. Siewior, Steven Rostedt, LKML, Andrew Morton,
	Konstantin Khlebnikov, linux-mm

On Mon, May 25, 2020 at 05:24:01PM +0200, Ahmed S. Darwish wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, May 19, 2020 at 11:45:24PM +0200, Ahmed S. Darwish wrote:

> > > +	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> > > +	smp_wmb();
> >
> > You can leave this smp_wmb() out and rely on the smp_mb() implied by
> > queue_work_on()'s test_and_set_bit().
> >
> 
> Won't this be too implicit?
> 
> Isn't it possible that, over the years, queue_work_on() impementation
> changes and the test_and_set_bit()/smp_mb() gets removed?
> 
> If that happens, this commit will get *silently* broken and the local
> CPU pages won't be drained.

Add a comment to queue_work_on() that points here? That way people are
aware.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API
  2020-05-22 14:57   ` Peter Zijlstra
  2020-05-22 15:17     ` Sebastian A. Siewior
  2020-05-25 15:24     ` Ahmed S. Darwish
@ 2020-05-25 16:10     ` John Ogness
  2 siblings, 0 replies; 12+ messages in thread
From: John Ogness @ 2020-05-25 16:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ahmed S. Darwish, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Paul E. McKenney, Sebastian A. Siewior, Steven Rostedt, LKML,
	Andrew Morton, Konstantin Khlebnikov, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2799 bytes --]

Hi,

This optimization is broken. The main concern here: Is it possible that
lru_add_drain_all() _would_ have drained pagevec X, but then aborted
because another lru_add_drain_all() is underway and that other task will
_not_ drain pagevec X? I claim the answer is yes!

My suggested changes are inline below.

I attached a litmus test to verify it.

On 2020-05-22, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, May 19, 2020 at 11:45:24PM +0200, Ahmed S. Darwish wrote:
>> @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
>>   */
>>  void lru_add_drain_all(void)
>>  {
>
>> +	static unsigned int lru_drain_gen;
>>  	static struct cpumask has_work;
>> +	static DEFINE_MUTEX(lock);
>> +	int cpu, this_gen;
>>  
>>  	/*
>>  	 * Make sure nobody triggers this path before mm_percpu_wq is fully
>> @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
>>  	if (WARN_ON(!mm_percpu_wq))
>>  		return;
>>  

An smp_mb() is needed here.

	/*
	 * Guarantee the pagevec counter stores visible by
	 * this CPU are visible to other CPUs before loading
	 * the current drain generation.
	 */
	smp_mb();

>> +	this_gen = READ_ONCE(lru_drain_gen);
>> +	smp_rmb();
>
> 	this_gen = smp_load_acquire(&lru_drain_gen);
>>  
>>  	mutex_lock(&lock);
>>  
>>  	/*
>> +	 * (C) Exit the draining operation if a newer generation, from another
>> +	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
>>  	 */
>> +	if (unlikely(this_gen != lru_drain_gen))
>>  		goto done;
>>  
>
>> +	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
>> +	smp_wmb();

Instead of smp_wmb(), this needs to be a full memory barrier.

	/*
	 * Guarantee the new drain generation is stored before
	 * loading the pagevec counters.
	 */
	smp_mb();

> You can leave this smp_wmb() out and rely on the smp_mb() implied by
> queue_work_on()'s test_and_set_bit().
>
>>  	cpumask_clear(&has_work);
>> -
>>  	for_each_online_cpu(cpu) {
>>  		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>>  
>
> While you're here, do:
>
> 	s/cpumask_set_cpu/__&/
>
>> @@ -766,7 +803,7 @@ void lru_add_drain_all(void)
>>  {
>>  	lru_add_drain();
>>  }
>> -#endif
>> +#endif /* CONFIG_SMP */
>>  
>>  /**
>>   * release_pages - batched put_page()

For the litmus test:

1:rx=0             (P1 did not see the pagevec counter)
2:rx=1             (P2 _would_ have seen the pagevec counter)
2:ry1=0 /\ 2:ry2=1 (P2 aborted due to optimization)

Changing the smp_mb() back to smp_wmb() in P1 and removing the smp_mb()
in P2 represents this patch. And it shows that sometimes P2 will abort
even though it would have drained the pagevec and P1 did not drain the
pagevec.

This is ugly as hell. And there maybe other memory barrier types to make
it pretty. But as is, memory barriers are missing.

John Ogness


[-- Attachment #2: lru_add_drain_all.litmus --]
[-- Type: text/plain, Size: 1069 bytes --]

C lru_add_drain_all

(*
 * x is a pagevec counter
 * y is @lru_drain_gen
 * z is @lock
 *)

{ }

P0(int *x)
{
	// mark pagevec for draining
	WRITE_ONCE(*x, 1);
}

P1(int *x, int *y, int *z)
{
	int rx;
	int rz;

	// mutex_lock(&lock);
	rz = cmpxchg_acquire(z, 0, 1);
	if (rz == 0) {

		// WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
		WRITE_ONCE(*y, 1);

		// guarantee lru_drain_gen store before loading pagevec
		smp_mb();

		// if (pagevec_count(...))
		rx = READ_ONCE(*x);

		// mutex_unlock(&lock);
		rz = cmpxchg_release(z, 1, 2);
	}
}

P2(int *x, int *y, int *z)
{
	int rx;
	int ry1;
	int ry2;
	int rz;

	// the pagevec counter as visible now to this CPU
	rx = READ_ONCE(*x);

	// guarantee pagevec store before loading lru_drain_gen
	smp_mb();

	// this_gen = READ_ONCE(lru_drain_gen); smp_rmb();
	ry1 = smp_load_acquire(y);

	// mutex_lock(&lock) - acquired after P1
	rz = cmpxchg_acquire(z, 2, 3);
	if (rz == 2) {

		// if (unlikely(this_gen != lru_drain_gen))
		ry2 = READ_ONCE(*y);
	}
}

locations [x; y; z]
exists (1:rx=0 /\ 2:rx=1 /\ 2:ry1=0 /\ 2:ry2=1)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v1 2/8] mm/swap: Do not abuse the seqcount_t latching API
       [not found] ` <20200827114044.11173-1-a.darwish@linutronix.de>
@ 2020-08-27 11:40   ` Ahmed S. Darwish
  0 siblings, 0 replies; 12+ messages in thread
From: Ahmed S. Darwish @ 2020-08-27 11:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Andrew Morton,
	Konstantin Khlebnikov, linux-mm
  Cc: Thomas Gleixner, Sebastian A. Siewior, LKML, Ahmed S. Darwish

Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls")
implemented an optimization mechanism to exit the to-be-started LRU
drain operation (name it A) if another drain operation *started and
finished* while (A) was blocked on the LRU draining mutex.

This was done through a seqcount_t latch, which is an abuse of its
semantics:

  1. seqcount_t latching should be used for the purpose of switching
     between two storage places with sequence protection to allow
     interruptible, preemptible, writer sections. The referenced
     optimization mechanism has absolutely nothing to do with that.

  2. The used raw_write_seqcount_latch() has two SMP write memory
     barriers to insure one consistent storage place out of the two
     storage places available. A full memory barrier is required
     instead: to guarantee that the pagevec counter stores visible by
     local CPU are visible to other CPUs -- before loading the current
     drain generation.

Beside the seqcount_t API abuse, the semantics of a latch sequence
counter was force-fitted into the referenced optimization. What was
meant is to track "generations" of LRU draining operations, where
"global lru draining generation = x" implies that all generations
0 < n <= x are already *scheduled* for draining -- thus nothing needs
to be done if the current generation number n <= x.

Remove the conceptually-inappropriate seqcount_t latch usage. Manually
implement the referenced optimization using a counter and SMP memory
barriers.

Note, while at it, use the non-atomic variant of cpumask_set_cpu(),
__cpumask_set_cpu(), due to the already existing mutex protection.

Link: https://lkml.kernel.org/r/CALYGNiPSr-cxV9MX9czaVh6Wz_gzSv3H_8KPvgjBTGbJywUJpA@mail.gmail.com
Link: https://lkml.kernel.org/r/87y2pg9erj.fsf@vostro.fn.ogness.net
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
---
 mm/swap.c | 65 +++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 54 insertions(+), 11 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index d16d65d9b4e0..a1ec807e325d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -763,10 +763,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
  */
 void lru_add_drain_all(void)
 {
-	static seqcount_t seqcount = SEQCNT_ZERO(seqcount);
-	static DEFINE_MUTEX(lock);
+	/*
+	 * lru_drain_gen - Global pages generation number
+	 *
+	 * (A) Definition: global lru_drain_gen = x implies that all generations
+	 *     0 < n <= x are already *scheduled* for draining.
+	 *
+	 * This is an optimization for the highly-contended use case where a
+	 * user space workload keeps constantly generating a flow of pages for
+	 * each CPU.
+	 */
+	static unsigned int lru_drain_gen;
 	static struct cpumask has_work;
-	int cpu, seq;
+	static DEFINE_MUTEX(lock);
+	unsigned cpu, this_gen;
 
 	/*
 	 * Make sure nobody triggers this path before mm_percpu_wq is fully
@@ -775,21 +785,54 @@ void lru_add_drain_all(void)
 	if (WARN_ON(!mm_percpu_wq))
 		return;
 
-	seq = raw_read_seqcount_latch(&seqcount);
+	/*
+	 * Guarantee pagevec counter stores visible by this CPU are visible to
+	 * other CPUs before loading the current drain generation.
+	 */
+	smp_mb();
+
+	/*
+	 * (B) Locally cache global LRU draining generation number
+	 *
+	 * The read barrier ensures that the counter is loaded before the mutex
+	 * is taken. It pairs with smp_mb() inside the mutex critical section
+	 * at (D).
+	 */
+	this_gen = smp_load_acquire(&lru_drain_gen);
 
 	mutex_lock(&lock);
 
 	/*
-	 * Piggyback on drain started and finished while we waited for lock:
-	 * all pages pended at the time of our enter were drained from vectors.
+	 * (C) Exit the draining operation if a newer generation, from another
+	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
 	 */
-	if (__read_seqcount_retry(&seqcount, seq))
+	if (unlikely(this_gen != lru_drain_gen))
 		goto done;
 
-	raw_write_seqcount_latch(&seqcount);
+	/*
+	 * (D) Increment global generation number
+	 *
+	 * Pairs with smp_load_acquire() at (B), outside of the critical
+	 * section. Use a full memory barrier to guarantee that the new global
+	 * drain generation number is stored before loading pagevec counters.
+	 *
+	 * This pairing must be done here, before the for_each_online_cpu loop
+	 * below which drains the page vectors.
+	 *
+	 * Let x, y, and z represent some system CPU numbers, where x < y < z.
+	 * Assume CPU #z is is in the middle of the for_each_online_cpu loop
+	 * below and has already reached CPU #y's per-cpu data. CPU #x comes
+	 * along, adds some pages to its per-cpu vectors, then calls
+	 * lru_add_drain_all().
+	 *
+	 * If the paired barrier is done at any later step, e.g. after the
+	 * loop, CPU #x will just exit at (C) and miss flushing out all of its
+	 * added pages.
+	 */
+	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
+	smp_mb();
 
 	cpumask_clear(&has_work);
-
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
@@ -801,7 +844,7 @@ void lru_add_drain_all(void)
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, mm_percpu_wq, work);
-			cpumask_set_cpu(cpu, &has_work);
+			__cpumask_set_cpu(cpu, &has_work);
 		}
 	}
 
@@ -816,7 +859,7 @@ void lru_add_drain_all(void)
 {
 	lru_add_drain();
 }
-#endif
+#endif /* CONFIG_SMP */
 
 /**
  * release_pages - batched put_page()
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-08-27 11:40 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-19 21:45 [PATCH v1 00/25] seqlock: Extend seqcount API with associated locks Ahmed S. Darwish
2020-05-19 21:45 ` [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API Ahmed S. Darwish
2020-05-20 11:24   ` Hillf Danton
2020-05-20 12:22   ` Konstantin Khlebnikov
2020-05-20 13:05     ` Peter Zijlstra
2020-05-22 14:57   ` Peter Zijlstra
2020-05-22 15:17     ` Sebastian A. Siewior
2020-05-22 16:23       ` Peter Zijlstra
2020-05-25 15:24     ` Ahmed S. Darwish
2020-05-25 15:45       ` Peter Zijlstra
2020-05-25 16:10     ` John Ogness
     [not found] ` <20200827114044.11173-1-a.darwish@linutronix.de>
2020-08-27 11:40   ` [PATCH v1 2/8] mm/swap: Do not abuse the seqcount_t " Ahmed S. Darwish

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).