linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH tip/core/rcu 0/9] Miscellaneous fixes for v4.15
@ 2017-10-04 21:29 Paul E. McKenney
  2017-10-04 21:29 ` [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays Paul E. McKenney
                   ` (8 more replies)
  0 siblings, 9 replies; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-04 21:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg

Hello!

This series contains the following miscellaneous fixes:

1.	Provide GP ordering in face of migrations and delays.

2.	Fix up pending cbs check in rcu_prepare_for_idle(), courtesy
	of Neeraj Upadhyay.

3.	Create call_rcu_tasks() kthread at boot time.

4.	Map irq_work_on_queue() to irq_work_on() in !SMP.

5.	Add parameters to SRCU docbook comments.

6.	Make resched_cpu() unconditional.

7.	Pretend ->boost_mtx acquired legitimately for lockdep's benefit.

8.	Add extended-quiescent-state testing advice.

9.	Include rcupdate.h into kernel/rcu/segcblist, courtesy of
	Sebastian Andrzej Siewior.

							Thanx, Paul

------------------------------------------------------------------------

 include/linux/irq_work.h   |    3 ---
 include/linux/srcu.h       |    1 +
 kernel/irq_work.c          |    9 ++++++---
 kernel/rcu/rcu_segcblist.c |    1 +
 kernel/rcu/srcutree.c      |    2 +-
 kernel/rcu/tree.c          |   30 ++++++++++++++++++++++++++++++
 kernel/rcu/tree_plugin.h   |    7 +++++--
 kernel/rcu/update.c        |   34 +++++++++++++++-------------------
 kernel/sched/core.c        |    3 +--
 9 files changed, 60 insertions(+), 30 deletions(-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-04 21:29 [PATCH tip/core/rcu 0/9] Miscellaneous fixes for v4.15 Paul E. McKenney
@ 2017-10-04 21:29 ` Paul E. McKenney
  2017-10-05  9:41   ` Peter Zijlstra
  2017-10-05 13:17   ` Steven Rostedt
  2017-10-04 21:29 ` [PATCH tip/core/rcu 2/9] rcu: Fix up pending cbs check in rcu_prepare_for_idle Paul E. McKenney
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-04 21:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	Paul E. McKenney

Consider the following admittedly improbable sequence of events:

o	RCU is initially idle.

o	Task A on CPU 0 executes rcu_read_lock().

o	Task B on CPU 1 executes synchronize_rcu(), which must
	wait on Task A:

	o	Task B registers the callback, which starts a new
		grace period, awakening the grace-period kthread
		on CPU 3, which immediately starts a new grace period.

	o	Task B migrates to CPU 2, which provides a quiescent
		state for both CPUs 1 and 2.

	o	Both CPUs 1 and 2 take scheduling-clock interrupts,
		and both invoke RCU_SOFTIRQ, both thus learning of the
		new grace period.

	o	Task B is delayed, perhaps by vCPU preemption on CPU 2.

o	CPUs 2 and 3 pass through quiescent states, which are reported
	to core RCU.

o	Task B is resumed just long enough to be migrated to CPU 3,
	and then is once again delayed.

o	Task A executes rcu_read_unlock(), exiting its RCU read-side
	critical section.

o	CPU 0 passes through a quiescent sate, which is reported to
	core RCU.  Only CPU 1 continues to block the grace period.

o	CPU 1 passes through a quiescent state, which is reported to
	core RCU.  This ends the grace period, and CPU 1 therefore
	invokes its callbacks, one of which awakens Task B via
	complete().

o	Task B resumes (still on CPU 3) and starts executing
	wait_for_completion(), which sees that the completion has
	already completed, and thus does not block.  It returns from
	the synchronize_rcu() without any ordering against the
	end of Task A's RCU read-side critical section.

	It can therefore mess up Task A's RCU read-side critical section,
	in theory, anyway.

However, if CPU hotplug ever gets rid of stop_machine(), there will be
more straightforward ways for this sort of thing to happen, so this
commit adds a memory barrier in order to enforce the needed ordering.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/update.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 5033b66d2753..9e599fcdd7bf 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -413,6 +413,16 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array,
 			wait_for_completion(&rs_array[i].completion);
 		destroy_rcu_head_on_stack(&rs_array[i].head);
 	}
+
+	/*
+	 * If we migrated after we registered a callback, but before the
+	 * corresponding wait_for_completion(), we might now be running
+	 * on a CPU that has not yet noticed that the corresponding grace
+	 * period has ended.  That CPU might not yet be fully ordered
+	 * against the completion of the grace period, so the full memory
+	 * barrier below enforces that ordering via the completion's state.
+	 */
+	smp_mb(); /* ^^^ */
 }
 EXPORT_SYMBOL_GPL(__wait_rcu_gp);
 
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH tip/core/rcu 2/9] rcu: Fix up pending cbs check in rcu_prepare_for_idle
  2017-10-04 21:29 [PATCH tip/core/rcu 0/9] Miscellaneous fixes for v4.15 Paul E. McKenney
  2017-10-04 21:29 ` [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays Paul E. McKenney
@ 2017-10-04 21:29 ` Paul E. McKenney
  2017-10-04 21:29 ` [PATCH tip/core/rcu 3/9] rcu: Create call_rcu_tasks() kthread at boot time Paul E. McKenney
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-04 21:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	Neeraj Upadhyay, Paul E. McKenney, stable

From: Neeraj Upadhyay <neeraju@codeaurora.org>

The pending-callbacks check in rcu_prepare_for_idle() is backwards.
It should accelerate if there are pending callbacks, but the check
rather uselessly accelerates only if there are no callbacks.  This commit
therefore inverts this check.

Fixes: 15fecf89e46a ("srcu: Abstract multi-tail callback list handling")
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # 4.12.x
---
 kernel/rcu/tree_plugin.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index e012b9be777e..fed95fa941e6 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1507,7 +1507,7 @@ static void rcu_prepare_for_idle(void)
 	rdtp->last_accelerate = jiffies;
 	for_each_rcu_flavor(rsp) {
 		rdp = this_cpu_ptr(rsp->rda);
-		if (rcu_segcblist_pend_cbs(&rdp->cblist))
+		if (!rcu_segcblist_pend_cbs(&rdp->cblist))
 			continue;
 		rnp = rdp->mynode;
 		raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH tip/core/rcu 3/9] rcu: Create call_rcu_tasks() kthread at boot time
  2017-10-04 21:29 [PATCH tip/core/rcu 0/9] Miscellaneous fixes for v4.15 Paul E. McKenney
  2017-10-04 21:29 ` [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays Paul E. McKenney
  2017-10-04 21:29 ` [PATCH tip/core/rcu 2/9] rcu: Fix up pending cbs check in rcu_prepare_for_idle Paul E. McKenney
@ 2017-10-04 21:29 ` Paul E. McKenney
  2017-10-04 21:29 ` [PATCH tip/core/rcu 4/9] irq_work: Map irq_work_on_queue() to irq_work_on() in !SMP Paul E. McKenney
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-04 21:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	Paul E. McKenney

Currently the call_rcu_tasks() kthread is created upon first
invocation of call_rcu_tasks().  This has the advantage of avoiding
creation if there are never any invocations of call_rcu_tasks() and of
synchronize_rcu_tasks(), but it requires an unreliable heuristic to
determine when it is safe to create the kthread.  For example, it is
not safe to create the kthread when call_rcu_tasks() is invoked with
a spinlock held, but there is no good way to detect this in !PREEMPT
kernels.

This commit therefore creates this kthread unconditionally at
core_initcall() time.  If you don't want this kthread created, then
build with CONFIG_TASKS_RCU=n.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/update.c | 24 +++++-------------------
 1 file changed, 5 insertions(+), 19 deletions(-)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 9e599fcdd7bf..ed5922857384 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -585,7 +585,6 @@ DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
 static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
 module_param(rcu_task_stall_timeout, int, 0644);
 
-static void rcu_spawn_tasks_kthread(void);
 static struct task_struct *rcu_tasks_kthread_ptr;
 
 /**
@@ -610,7 +609,6 @@ void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
 {
 	unsigned long flags;
 	bool needwake;
-	bool havetask = READ_ONCE(rcu_tasks_kthread_ptr);
 
 	rhp->next = NULL;
 	rhp->func = func;
@@ -620,11 +618,8 @@ void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
 	rcu_tasks_cbs_tail = &rhp->next;
 	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
 	/* We can't create the thread unless interrupts are enabled. */
-	if ((needwake && havetask) ||
-	    (!havetask && !irqs_disabled_flags(flags))) {
-		rcu_spawn_tasks_kthread();
+	if (needwake && READ_ONCE(rcu_tasks_kthread_ptr))
 		wake_up(&rcu_tasks_cbs_wq);
-	}
 }
 EXPORT_SYMBOL_GPL(call_rcu_tasks);
 
@@ -863,27 +858,18 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 	}
 }
 
-/* Spawn rcu_tasks_kthread() at first call to call_rcu_tasks(). */
-static void rcu_spawn_tasks_kthread(void)
+/* Spawn rcu_tasks_kthread() at core_initcall() time. */
+static int __init rcu_spawn_tasks_kthread(void)
 {
-	static DEFINE_MUTEX(rcu_tasks_kthread_mutex);
 	struct task_struct *t;
 
-	if (READ_ONCE(rcu_tasks_kthread_ptr)) {
-		smp_mb(); /* Ensure caller sees full kthread. */
-		return;
-	}
-	mutex_lock(&rcu_tasks_kthread_mutex);
-	if (rcu_tasks_kthread_ptr) {
-		mutex_unlock(&rcu_tasks_kthread_mutex);
-		return;
-	}
 	t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
 	BUG_ON(IS_ERR(t));
 	smp_mb(); /* Ensure others see full kthread. */
 	WRITE_ONCE(rcu_tasks_kthread_ptr, t);
-	mutex_unlock(&rcu_tasks_kthread_mutex);
+	return 0;
 }
+core_initcall(rcu_spawn_tasks_kthread);
 
 /* Do the srcu_read_lock() for the above synchronize_srcu().  */
 void exit_tasks_rcu_start(void)
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH tip/core/rcu 4/9] irq_work: Map irq_work_on_queue() to irq_work_on() in !SMP
  2017-10-04 21:29 [PATCH tip/core/rcu 0/9] Miscellaneous fixes for v4.15 Paul E. McKenney
                   ` (2 preceding siblings ...)
  2017-10-04 21:29 ` [PATCH tip/core/rcu 3/9] rcu: Create call_rcu_tasks() kthread at boot time Paul E. McKenney
@ 2017-10-04 21:29 ` Paul E. McKenney
  2017-10-04 21:29 ` [PATCH tip/core/rcu 5/9] srcu: Add parameters to SRCU docbook comments Paul E. McKenney
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-04 21:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	Paul E. McKenney

Commit 478850160636 ("irq_work: Implement remote queueing") provides
irq_work_on_queue() only for SMP builds.  However, providing it simplifies
code that submits irq_work to lists of CPUs, eliminating the !SMP special
cases.  This commit therefore maps irq_work_on_queue() to irq_work_on()
in !SMP builds, but validating the specified CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 include/linux/irq_work.h | 3 ---
 kernel/irq_work.c        | 9 ++++++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 47b9ebd4a74f..d8c9876d5da4 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -33,10 +33,7 @@ void init_irq_work(struct irq_work *work, void (*func)(struct irq_work *))
 #define DEFINE_IRQ_WORK(name, _f) struct irq_work name = { .func = (_f), }
 
 bool irq_work_queue(struct irq_work *work);
-
-#ifdef CONFIG_SMP
 bool irq_work_queue_on(struct irq_work *work, int cpu);
-#endif
 
 void irq_work_tick(void);
 void irq_work_sync(struct irq_work *work);
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index bcf107ce0854..9f20f6c72579 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -56,7 +56,6 @@ void __weak arch_irq_work_raise(void)
 	 */
 }
 
-#ifdef CONFIG_SMP
 /*
  * Enqueue the irq_work @work on @cpu unless it's already pending
  * somewhere.
@@ -68,6 +67,8 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	/* All work should have been flushed before going offline */
 	WARN_ON_ONCE(cpu_is_offline(cpu));
 
+#ifdef CONFIG_SMP
+
 	/* Arch remote IPI send/receive backend aren't NMI safe */
 	WARN_ON_ONCE(in_nmi());
 
@@ -78,10 +79,12 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
 		arch_send_call_function_single_ipi(cpu);
 
+#else /* #ifdef CONFIG_SMP */
+	irq_work_queue(work);
+#endif /* #else #ifdef CONFIG_SMP */
+
 	return true;
 }
-EXPORT_SYMBOL_GPL(irq_work_queue_on);
-#endif
 
 /* Enqueue the irq work @work on the current CPU */
 bool irq_work_queue(struct irq_work *work)
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH tip/core/rcu 5/9] srcu: Add parameters to SRCU docbook comments
  2017-10-04 21:29 [PATCH tip/core/rcu 0/9] Miscellaneous fixes for v4.15 Paul E. McKenney
                   ` (3 preceding siblings ...)
  2017-10-04 21:29 ` [PATCH tip/core/rcu 4/9] irq_work: Map irq_work_on_queue() to irq_work_on() in !SMP Paul E. McKenney
@ 2017-10-04 21:29 ` Paul E. McKenney
  2017-10-04 21:29 ` [PATCH tip/core/rcu 6/9] sched: Make resched_cpu() unconditional Paul E. McKenney
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-04 21:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	Paul E. McKenney

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 include/linux/srcu.h  | 1 +
 kernel/rcu/srcutree.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 39af9bc0f653..62be8966e837 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -78,6 +78,7 @@ void synchronize_srcu(struct srcu_struct *sp);
 
 /**
  * srcu_read_lock_held - might we be in SRCU read-side critical section?
+ * @sp: The srcu_struct structure to check
  *
  * If CONFIG_DEBUG_LOCK_ALLOC is selected, returns nonzero iff in an SRCU
  * read-side critical section.  In absence of CONFIG_DEBUG_LOCK_ALLOC,
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 729a8706751d..6d5880089ff6 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -854,7 +854,7 @@ void __call_srcu(struct srcu_struct *sp, struct rcu_head *rhp,
 /**
  * call_srcu() - Queue a callback for invocation after an SRCU grace period
  * @sp: srcu_struct in queue the callback
- * @head: structure to be used for queueing the SRCU callback.
+ * @rhp: structure to be used for queueing the SRCU callback.
  * @func: function to be invoked after the SRCU grace period
  *
  * The callback function will be invoked some time after a full SRCU
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH tip/core/rcu 6/9] sched: Make resched_cpu() unconditional
  2017-10-04 21:29 [PATCH tip/core/rcu 0/9] Miscellaneous fixes for v4.15 Paul E. McKenney
                   ` (4 preceding siblings ...)
  2017-10-04 21:29 ` [PATCH tip/core/rcu 5/9] srcu: Add parameters to SRCU docbook comments Paul E. McKenney
@ 2017-10-04 21:29 ` Paul E. McKenney
  2017-10-04 21:29 ` [PATCH tip/core/rcu 7/9] rcu: Pretend ->boost_mtx acquired legitimately Paul E. McKenney
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-04 21:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	Paul E. McKenney, stable

The current implementation of synchronize_sched_expedited() incorrectly
assumes that resched_cpu() is unconditional, which it is not.  This means
that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
fails as follows (analysis by Neeraj Upadhyay):

o	CPU1 is waiting for expedited wait to complete:

	sync_rcu_exp_select_cpus
	     rdp->exp_dynticks_snap & 0x1   // returns 1 for CPU5
	     IPI sent to CPU5

	synchronize_sched_expedited_wait
		 ret = swait_event_timeout(rsp->expedited_wq,
					   sync_rcu_preempt_exp_done(rnp_root),
					   jiffies_stall);

	expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())

o	CPU5 handles IPI and fails to acquire rq lock.

	Handles IPI
	     sync_sched_exp_handler
		 resched_cpu
		     returns while failing to try lock acquire rq->lock
		 need_resched is not set

o	CPU5 calls  rcu_idle_enter() and as need_resched is not set, goes to
	idle (schedule() is not called).

o	CPU 1 reports RCU stall.

Given that resched_cpu() is now used only by RCU, this commit fixes the
assumption by making resched_cpu() unconditional.

Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Suggested-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
---
 kernel/sched/core.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d17c5da523a0..8fa7b6f9e19b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -505,8 +505,7 @@ void resched_cpu(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
-	if (!raw_spin_trylock_irqsave(&rq->lock, flags))
-		return;
+	raw_spin_lock_irqsave(&rq->lock, flags);
 	resched_curr(rq);
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH tip/core/rcu 7/9] rcu: Pretend ->boost_mtx acquired legitimately
  2017-10-04 21:29 [PATCH tip/core/rcu 0/9] Miscellaneous fixes for v4.15 Paul E. McKenney
                   ` (5 preceding siblings ...)
  2017-10-04 21:29 ` [PATCH tip/core/rcu 6/9] sched: Make resched_cpu() unconditional Paul E. McKenney
@ 2017-10-04 21:29 ` Paul E. McKenney
  2017-10-05  9:50   ` Peter Zijlstra
  2017-10-04 21:29 ` [PATCH tip/core/rcu 8/9] rcu: Add extended-quiescent-state testing advice Paul E. McKenney
  2017-10-04 21:29 ` [PATCH tip/core/rcu 9/9] rcu/segcblist: Include rcupdate.h Paul E. McKenney
  8 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-04 21:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	Paul E. McKenney

RCU priority boosting uses rt_mutex_init_proxy_locked() to initialize an
rt_mutex structure in locked state held by some other task.  When that
other task releases it, lockdep complains (quite accurately, but a bit
uselessly) that the other task never acquired it.  This complaint can
suppress other, more helpful, lockdep complaints, and in any case it is
a false positive.

This commit therefore uses the mutex_acquire() macro to make it look
like that other process legitimately acquired the lock, thus suppressing
this lockdep false-positive complaint.

Of course, if lockdep ever learns about rt_mutex_init_proxy_locked(),
this commit will need to be reverted.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree_plugin.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index fed95fa941e6..60bfb16c9a1a 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -529,8 +529,11 @@ void rcu_read_unlock_special(struct task_struct *t)
 		}
 
 		/* Unboost if we were boosted. */
-		if (IS_ENABLED(CONFIG_RCU_BOOST) && drop_boost_mutex)
+		if (IS_ENABLED(CONFIG_RCU_BOOST) && drop_boost_mutex) {
+			/* For lockdep, pretend we acquired lock honestly. */
+			mutex_acquire(&rnp->boost_mtx.dep_map, 0, 0, _RET_IP_);
 			rt_mutex_unlock(&rnp->boost_mtx);
+		}
 
 		/*
 		 * If this was the last task on the expedited lists,
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH tip/core/rcu 8/9] rcu: Add extended-quiescent-state testing advice
  2017-10-04 21:29 [PATCH tip/core/rcu 0/9] Miscellaneous fixes for v4.15 Paul E. McKenney
                   ` (6 preceding siblings ...)
  2017-10-04 21:29 ` [PATCH tip/core/rcu 7/9] rcu: Pretend ->boost_mtx acquired legitimately Paul E. McKenney
@ 2017-10-04 21:29 ` Paul E. McKenney
  2017-10-04 21:29 ` [PATCH tip/core/rcu 9/9] rcu/segcblist: Include rcupdate.h Paul E. McKenney
  8 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-04 21:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	Paul E. McKenney

If you add or remove calls to rcu_idle_enter(), rcu_user_enter(),
rcu_irq_exit(), rcu_irq_exit_irqson(), rcu_idle_exit(), rcu_user_exit(),
rcu_irq_enter(), rcu_irq_enter_irqson(), rcu_nmi_enter(), or
rcu_nmi_exit(), you should run a full set of tests on a kernel built
with CONFIG_RCU_EQS_DEBUG=y.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0c44c7b42e6d..5186fcea8ba3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -837,6 +837,9 @@ static void rcu_eqs_enter(bool user)
  * We crowbar the ->dynticks_nesting field to zero to allow for
  * the possibility of usermode upcalls having messed up our count
  * of interrupt nesting level during the prior busy period.
+ *
+ * If you add or remove a call to rcu_idle_enter(), be sure to test with
+ * CONFIG_RCU_EQS_DEBUG=y.
  */
 void rcu_idle_enter(void)
 {
@@ -852,6 +855,9 @@ void rcu_idle_enter(void)
  * is permitted between this call and rcu_user_exit(). This way the
  * CPU doesn't need to maintain the tick for RCU maintenance purposes
  * when the CPU runs in userspace.
+ *
+ * If you add or remove a call to rcu_user_enter(), be sure to test with
+ * CONFIG_RCU_EQS_DEBUG=y.
  */
 void rcu_user_enter(void)
 {
@@ -875,6 +881,9 @@ void rcu_user_enter(void)
  * Use things like work queues to work around this limitation.
  *
  * You have been warned.
+ *
+ * If you add or remove a call to rcu_irq_exit(), be sure to test with
+ * CONFIG_RCU_EQS_DEBUG=y.
  */
 void rcu_irq_exit(void)
 {
@@ -899,6 +908,9 @@ void rcu_irq_exit(void)
 
 /*
  * Wrapper for rcu_irq_exit() where interrupts are enabled.
+ *
+ * If you add or remove a call to rcu_irq_exit_irqson(), be sure to test
+ * with CONFIG_RCU_EQS_DEBUG=y.
  */
 void rcu_irq_exit_irqson(void)
 {
@@ -971,6 +983,9 @@ static void rcu_eqs_exit(bool user)
  * allow for the possibility of usermode upcalls messing up our count
  * of interrupt nesting level during the busy period that is just
  * now starting.
+ *
+ * If you add or remove a call to rcu_idle_exit(), be sure to test with
+ * CONFIG_RCU_EQS_DEBUG=y.
  */
 void rcu_idle_exit(void)
 {
@@ -987,6 +1002,9 @@ void rcu_idle_exit(void)
  *
  * Exit RCU idle mode while entering the kernel because it can
  * run a RCU read side critical section anytime.
+ *
+ * If you add or remove a call to rcu_user_exit(), be sure to test with
+ * CONFIG_RCU_EQS_DEBUG=y.
  */
 void rcu_user_exit(void)
 {
@@ -1012,6 +1030,9 @@ void rcu_user_exit(void)
  * Use things like work queues to work around this limitation.
  *
  * You have been warned.
+ *
+ * If you add or remove a call to rcu_irq_enter(), be sure to test with
+ * CONFIG_RCU_EQS_DEBUG=y.
  */
 void rcu_irq_enter(void)
 {
@@ -1037,6 +1058,9 @@ void rcu_irq_enter(void)
 
 /*
  * Wrapper for rcu_irq_enter() where interrupts are enabled.
+ *
+ * If you add or remove a call to rcu_irq_enter_irqson(), be sure to test
+ * with CONFIG_RCU_EQS_DEBUG=y.
  */
 void rcu_irq_enter_irqson(void)
 {
@@ -1055,6 +1079,9 @@ void rcu_irq_enter_irqson(void)
  * that the CPU is active.  This implementation permits nested NMIs, as
  * long as the nesting level does not overflow an int.  (You will probably
  * run out of stack space first.)
+ *
+ * If you add or remove a call to rcu_nmi_enter(), be sure to test
+ * with CONFIG_RCU_EQS_DEBUG=y.
  */
 void rcu_nmi_enter(void)
 {
@@ -1087,6 +1114,9 @@ void rcu_nmi_enter(void)
  * RCU-idle period, update rdtp->dynticks and rdtp->dynticks_nmi_nesting
  * to let the RCU grace-period handling know that the CPU is back to
  * being RCU-idle.
+ *
+ * If you add or remove a call to rcu_nmi_exit(), be sure to test
+ * with CONFIG_RCU_EQS_DEBUG=y.
  */
 void rcu_nmi_exit(void)
 {
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH tip/core/rcu 9/9] rcu/segcblist: Include rcupdate.h
  2017-10-04 21:29 [PATCH tip/core/rcu 0/9] Miscellaneous fixes for v4.15 Paul E. McKenney
                   ` (7 preceding siblings ...)
  2017-10-04 21:29 ` [PATCH tip/core/rcu 8/9] rcu: Add extended-quiescent-state testing advice Paul E. McKenney
@ 2017-10-04 21:29 ` Paul E. McKenney
  8 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-04 21:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	Sebastian Andrzej Siewior, Paul E. McKenney

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

The RT build on ARM complains about non-existing ULONG_CMP_LT.
This commit therefore includes rcupdate.h into rcu_segcblist.c.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/rcu_segcblist.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 7649fcd2c4c7..88cba7c2956c 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -23,6 +23,7 @@
 #include <linux/types.h>
 #include <linux/kernel.h>
 #include <linux/interrupt.h>
+#include <linux/rcupdate.h>
 
 #include "rcu_segcblist.h"
 
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-04 21:29 ` [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays Paul E. McKenney
@ 2017-10-05  9:41   ` Peter Zijlstra
  2017-10-05 14:55     ` Paul E. McKenney
  2017-10-05 13:17   ` Steven Rostedt
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-10-05  9:41 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg

On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote:
> Consider the following admittedly improbable sequence of events:
> 
> o	RCU is initially idle.
> 
> o	Task A on CPU 0 executes rcu_read_lock().
> 
> o	Task B on CPU 1 executes synchronize_rcu(), which must
> 	wait on Task A:
> 
> 	o	Task B registers the callback, which starts a new
> 		grace period, awakening the grace-period kthread
> 		on CPU 3, which immediately starts a new grace period.
> 
> 	o	Task B migrates to CPU 2, which provides a quiescent
> 		state for both CPUs 1 and 2.
> 
> 	o	Both CPUs 1 and 2 take scheduling-clock interrupts,
> 		and both invoke RCU_SOFTIRQ, both thus learning of the
> 		new grace period.
> 
> 	o	Task B is delayed, perhaps by vCPU preemption on CPU 2.
> 
> o	CPUs 2 and 3 pass through quiescent states, which are reported
> 	to core RCU.
> 
> o	Task B is resumed just long enough to be migrated to CPU 3,
> 	and then is once again delayed.
> 
> o	Task A executes rcu_read_unlock(), exiting its RCU read-side
> 	critical section.
> 
> o	CPU 0 passes through a quiescent sate, which is reported to
> 	core RCU.  Only CPU 1 continues to block the grace period.
> 
> o	CPU 1 passes through a quiescent state, which is reported to
> 	core RCU.  This ends the grace period, and CPU 1 therefore
> 	invokes its callbacks, one of which awakens Task B via
> 	complete().
> 
> o	Task B resumes (still on CPU 3) and starts executing
> 	wait_for_completion(), which sees that the completion has
> 	already completed, and thus does not block.  It returns from
> 	the synchronize_rcu() without any ordering against the
> 	end of Task A's RCU read-side critical section.
> 
> 	It can therefore mess up Task A's RCU read-side critical section,
> 	in theory, anyway.

I'm not sure I follow, at the very least the wait_for_completion() does
an ACQUIRE such that it observes the state prior to the RELEASE as done
by complete(), no?

And is not CPU0's QS reporting ordered against that complete()?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 7/9] rcu: Pretend ->boost_mtx acquired legitimately
  2017-10-04 21:29 ` [PATCH tip/core/rcu 7/9] rcu: Pretend ->boost_mtx acquired legitimately Paul E. McKenney
@ 2017-10-05  9:50   ` Peter Zijlstra
  2017-10-05 15:06     ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-10-05  9:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg

On Wed, Oct 04, 2017 at 02:29:33PM -0700, Paul E. McKenney wrote:
> RCU priority boosting uses rt_mutex_init_proxy_locked() to initialize an
> rt_mutex structure in locked state held by some other task.  When that
> other task releases it, lockdep complains (quite accurately, but a bit
> uselessly) that the other task never acquired it.  This complaint can
> suppress other, more helpful, lockdep complaints, and in any case it is
> a false positive.
> 
> This commit therefore uses the mutex_acquire() macro to make it look
> like that other process legitimately acquired the lock, thus suppressing
> this lockdep false-positive complaint.
> 
> Of course, if lockdep ever learns about rt_mutex_init_proxy_locked(),
> this commit will need to be reverted.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

This is a consequence of me doing:

 f5694788ad8d ("rt_mutex: Add lockdep annotations")

Right?

> ---
>  kernel/rcu/tree_plugin.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index fed95fa941e6..60bfb16c9a1a 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -529,8 +529,11 @@ void rcu_read_unlock_special(struct task_struct *t)
>  		}
>  
>  		/* Unboost if we were boosted. */
> -		if (IS_ENABLED(CONFIG_RCU_BOOST) && drop_boost_mutex)
> +		if (IS_ENABLED(CONFIG_RCU_BOOST) && drop_boost_mutex) {
> +			/* For lockdep, pretend we acquired lock honestly. */
> +			mutex_acquire(&rnp->boost_mtx.dep_map, 0, 0, _RET_IP_);
>  			rt_mutex_unlock(&rnp->boost_mtx);
> +		}

So I'm thinking the problem is that you're mixing rt_mutex and PI-futex
primitives here. As per commit:

  5293c2efda37 ("futex,rt_mutex: Provide futex specific rt_mutex API")

these are two separate APIs, that should, ideally, not be mixed.

The 'right' counterpart to rt_mutex_init_proxy_locked() is
rt_mutex_futex_unlock() (which very much does not include lockdep bits).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-04 21:29 ` [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays Paul E. McKenney
  2017-10-05  9:41   ` Peter Zijlstra
@ 2017-10-05 13:17   ` Steven Rostedt
  2017-10-05 13:40     ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Steven Rostedt @ 2017-10-05 13:17 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, dhowells, edumazet,
	fweisbec, oleg

On Wed,  4 Oct 2017 14:29:27 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> Consider the following admittedly improbable sequence of events:
> 
> o	RCU is initially idle.
> 
> o	Task A on CPU 0 executes rcu_read_lock().

A starts rcu_read_lock() critical section.

> 
> o	Task B on CPU 1 executes synchronize_rcu(), which must
> 	wait on Task A:

B waits for A.

> 
> 	o	Task B registers the callback, which starts a new
> 		grace period, awakening the grace-period kthread
> 		on CPU 3, which immediately starts a new grace period.

  [ isn't B blocked (off rq)? How does it migrate? ]

> 
> 	o	Task B migrates to CPU 2, which provides a quiescent
> 		state for both CPUs 1 and 2.
> 
> 	o	Both CPUs 1 and 2 take scheduling-clock interrupts,
> 		and both invoke RCU_SOFTIRQ, both thus learning of the
> 		new grace period.
> 
> 	o	Task B is delayed, perhaps by vCPU preemption on CPU 2.
> 
> o	CPUs 2 and 3 pass through quiescent states, which are reported
> 	to core RCU.
> 
> o	Task B is resumed just long enough to be migrated to CPU 3,
> 	and then is once again delayed.
> 
> o	Task A executes rcu_read_unlock(), exiting its RCU read-side
> 	critical section.

A calls rcu_read_unlock() ending the critical section

> 
> o	CPU 0 passes through a quiescent sate, which is reported to
> 	core RCU.  Only CPU 1 continues to block the grace period.
> 
> o	CPU 1 passes through a quiescent state, which is reported to
> 	core RCU.  This ends the grace period, and CPU 1 therefore
> 	invokes its callbacks, one of which awakens Task B via
> 	complete().
> 
> o	Task B resumes (still on CPU 3) and starts executing
> 	wait_for_completion(), which sees that the completion has
> 	already completed, and thus does not block.  It returns from
> 	the synchronize_rcu() without any ordering against the
> 	end of Task A's RCU read-side critical section.

B runs


> 
> 	It can therefore mess up Task A's RCU read-side critical section,
> 	in theory, anyway.

I don't see how B ran during A's critical section.

-- Steve

> 
> However, if CPU hotplug ever gets rid of stop_machine(), there will be
> more straightforward ways for this sort of thing to happen, so this
> commit adds a memory barrier in order to enforce the needed ordering.
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-05 13:17   ` Steven Rostedt
@ 2017-10-05 13:40     ` Peter Zijlstra
  2017-10-05 14:13       ` Steven Rostedt
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-10-05 13:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, linux-kernel, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, dhowells, edumazet,
	fweisbec, oleg

On Thu, Oct 05, 2017 at 09:17:03AM -0400, Steven Rostedt wrote:
> On Wed,  4 Oct 2017 14:29:27 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > Consider the following admittedly improbable sequence of events:
> > 
> > o	RCU is initially idle.
> > 
> > o	Task A on CPU 0 executes rcu_read_lock().
> 
> A starts rcu_read_lock() critical section.
> 
> > 
> > o	Task B on CPU 1 executes synchronize_rcu(), which must
> > 	wait on Task A:
> 
> B waits for A.
> 
> > 
> > 	o	Task B registers the callback, which starts a new
> > 		grace period, awakening the grace-period kthread
> > 		on CPU 3, which immediately starts a new grace period.
> 
>   [ isn't B blocked (off rq)? How does it migrate? ]

No, its running synchronize_rcu() but hasn't blocked yet. It would block
on wait_for_completion(), but per the very last point, we'll observe the
complete() before we block.

> > 	o	Task B migrates to CPU 2, which provides a quiescent
> > 		state for both CPUs 1 and 2.
> > 
> > 	o	Both CPUs 1 and 2 take scheduling-clock interrupts,
> > 		and both invoke RCU_SOFTIRQ, both thus learning of the
> > 		new grace period.
> > 
> > 	o	Task B is delayed, perhaps by vCPU preemption on CPU 2.
> > 
> > o	CPUs 2 and 3 pass through quiescent states, which are reported
> > 	to core RCU.
> > 
> > o	Task B is resumed just long enough to be migrated to CPU 3,
> > 	and then is once again delayed.
> > 
> > o	Task A executes rcu_read_unlock(), exiting its RCU read-side
> > 	critical section.
> 
> A calls rcu_read_unlock() ending the critical section

The point is that rcu_read_unlock() doesn't have memory ordering.

> > 
> > o	CPU 0 passes through a quiescent sate, which is reported to
> > 	core RCU.  Only CPU 1 continues to block the grace period.
> > 
> > o	CPU 1 passes through a quiescent state, which is reported to
> > 	core RCU.  This ends the grace period, and CPU 1 therefore
> > 	invokes its callbacks, one of which awakens Task B via
> > 	complete().
> > 
> > o	Task B resumes (still on CPU 3) and starts executing
> > 	wait_for_completion(), which sees that the completion has
> > 	already completed, and thus does not block.  It returns from
> > 	the synchronize_rcu() without any ordering against the
> > 	end of Task A's RCU read-side critical section.
> 
> B runs
> 
> 
> > 
> > 	It can therefore mess up Task A's RCU read-side critical section,
> > 	in theory, anyway.
> 
> I don't see how B ran during A's critical section.

It didn't but that doesn't mean the memory ordering agrees. What we
require is B observes (per the memory ordering) everything up to and
including the rcu_read_unlock(). This is not 'time' related.


That said, I don't think it can actually happen, because CPU0's QS state
is ordered against the complete and the wait_for_completion is ordered
against the complete.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-05 13:40     ` Peter Zijlstra
@ 2017-10-05 14:13       ` Steven Rostedt
  0 siblings, 0 replies; 30+ messages in thread
From: Steven Rostedt @ 2017-10-05 14:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, linux-kernel, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, dhowells, edumazet,
	fweisbec, oleg

On Thu, 5 Oct 2017 15:40:42 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Oct 05, 2017 at 09:17:03AM -0400, Steven Rostedt wrote:
> > On Wed,  4 Oct 2017 14:29:27 -0700
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >   
> > > Consider the following admittedly improbable sequence of events:
> > > 
> > > o	RCU is initially idle.
> > > 
> > > o	Task A on CPU 0 executes rcu_read_lock().  
> > 
> > A starts rcu_read_lock() critical section.
> >   
> > > 
> > > o	Task B on CPU 1 executes synchronize_rcu(), which must
> > > 	wait on Task A:  
> > 
> > B waits for A.
> >   
> > > 
> > > 	o	Task B registers the callback, which starts a new
> > > 		grace period, awakening the grace-period kthread
> > > 		on CPU 3, which immediately starts a new grace period.  
> > 
> >   [ isn't B blocked (off rq)? How does it migrate? ]  
> 
> No, its running synchronize_rcu() but hasn't blocked yet. It would block
> on wait_for_completion(), but per the very last point, we'll observe the
> complete() before we block.

Hmm, maybe this should be stated as well, as it looked confusing.

> 
> > > 	o	Task B migrates to CPU 2, which provides a quiescent
> > > 		state for both CPUs 1 and 2.
> > > 
> > > 	o	Both CPUs 1 and 2 take scheduling-clock interrupts,
> > > 		and both invoke RCU_SOFTIRQ, both thus learning of the
> > > 		new grace period.
> > > 
> > > 	o	Task B is delayed, perhaps by vCPU preemption on CPU 2.
> > > 
> > > o	CPUs 2 and 3 pass through quiescent states, which are reported
> > > 	to core RCU.
> > > 
> > > o	Task B is resumed just long enough to be migrated to CPU 3,
> > > 	and then is once again delayed.
> > > 
> > > o	Task A executes rcu_read_unlock(), exiting its RCU read-side
> > > 	critical section.  
> > 
> > A calls rcu_read_unlock() ending the critical section  
> 
> The point is that rcu_read_unlock() doesn't have memory ordering.

Ah, that makes a difference. Can we state that in the change log too.

> 
> > > 
> > > o	CPU 0 passes through a quiescent sate, which is reported to
> > > 	core RCU.  Only CPU 1 continues to block the grace period.
> > > 
> > > o	CPU 1 passes through a quiescent state, which is reported to
> > > 	core RCU.  This ends the grace period, and CPU 1 therefore
> > > 	invokes its callbacks, one of which awakens Task B via
> > > 	complete().
> > > 
> > > o	Task B resumes (still on CPU 3) and starts executing
> > > 	wait_for_completion(), which sees that the completion has
> > > 	already completed, and thus does not block.  It returns from
> > > 	the synchronize_rcu() without any ordering against the
> > > 	end of Task A's RCU read-side critical section.  
> > 
> > B runs
> > 
> >   
> > > 
> > > 	It can therefore mess up Task A's RCU read-side critical section,
> > > 	in theory, anyway.  
> > 
> > I don't see how B ran during A's critical section.  
> 
> It didn't but that doesn't mean the memory ordering agrees. What we
> require is B observes (per the memory ordering) everything up to and
> including the rcu_read_unlock(). This is not 'time' related.

OK, that makes sense. It was just missing from the change log, so mere
mortals like myself can keep up. (This is why I'm Batman. I'm only
human but I can still compete with superheros ;-)

> 
> 
> That said, I don't think it can actually happen, because CPU0's QS state
> is ordered against the complete and the wait_for_completion is ordered
> against the complete.

OK, but this doesn't hurt to have. Just a paranoid check, which when it
comes to RCU, leaning toward paranoid is good.

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-05  9:41   ` Peter Zijlstra
@ 2017-10-05 14:55     ` Paul E. McKenney
  2017-10-05 15:39       ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-05 14:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg

On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote:
> > Consider the following admittedly improbable sequence of events:
> > 
> > o	RCU is initially idle.
> > 
> > o	Task A on CPU 0 executes rcu_read_lock().
> > 
> > o	Task B on CPU 1 executes synchronize_rcu(), which must
> > 	wait on Task A:
> > 
> > 	o	Task B registers the callback, which starts a new
> > 		grace period, awakening the grace-period kthread
> > 		on CPU 3, which immediately starts a new grace period.
> > 
> > 	o	Task B migrates to CPU 2, which provides a quiescent
> > 		state for both CPUs 1 and 2.
> > 
> > 	o	Both CPUs 1 and 2 take scheduling-clock interrupts,
> > 		and both invoke RCU_SOFTIRQ, both thus learning of the
> > 		new grace period.
> > 
> > 	o	Task B is delayed, perhaps by vCPU preemption on CPU 2.
> > 
> > o	CPUs 2 and 3 pass through quiescent states, which are reported
> > 	to core RCU.
> > 
> > o	Task B is resumed just long enough to be migrated to CPU 3,
> > 	and then is once again delayed.
> > 
> > o	Task A executes rcu_read_unlock(), exiting its RCU read-side
> > 	critical section.
> > 
> > o	CPU 0 passes through a quiescent sate, which is reported to
> > 	core RCU.  Only CPU 1 continues to block the grace period.
> > 
> > o	CPU 1 passes through a quiescent state, which is reported to
> > 	core RCU.  This ends the grace period, and CPU 1 therefore
> > 	invokes its callbacks, one of which awakens Task B via
> > 	complete().
> > 
> > o	Task B resumes (still on CPU 3) and starts executing
> > 	wait_for_completion(), which sees that the completion has
> > 	already completed, and thus does not block.  It returns from
> > 	the synchronize_rcu() without any ordering against the
> > 	end of Task A's RCU read-side critical section.
> > 
> > 	It can therefore mess up Task A's RCU read-side critical section,
> > 	in theory, anyway.
> 
> I'm not sure I follow, at the very least the wait_for_completion() does
> an ACQUIRE such that it observes the state prior to the RELEASE as done
> by complete(), no?

Your point being that both wait_for_completion() and complete() acquire
and release the same lock?  (Yes, I suspect that I was confusing this
with wait_event() and wake_up(), just so you know.)

> And is not CPU0's QS reporting ordered against that complete()?

Mumble mumble mumble powerpc mumble mumble mumble...

OK, I will make this new memory barrier only execute for powerpc.

Or am I missing something else here?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 7/9] rcu: Pretend ->boost_mtx acquired legitimately
  2017-10-05  9:50   ` Peter Zijlstra
@ 2017-10-05 15:06     ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-05 15:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg

On Thu, Oct 05, 2017 at 11:50:59AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 04, 2017 at 02:29:33PM -0700, Paul E. McKenney wrote:
> > RCU priority boosting uses rt_mutex_init_proxy_locked() to initialize an
> > rt_mutex structure in locked state held by some other task.  When that
> > other task releases it, lockdep complains (quite accurately, but a bit
> > uselessly) that the other task never acquired it.  This complaint can
> > suppress other, more helpful, lockdep complaints, and in any case it is
> > a false positive.
> > 
> > This commit therefore uses the mutex_acquire() macro to make it look
> > like that other process legitimately acquired the lock, thus suppressing
> > this lockdep false-positive complaint.
> > 
> > Of course, if lockdep ever learns about rt_mutex_init_proxy_locked(),
> > this commit will need to be reverted.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> This is a consequence of me doing:
> 
>  f5694788ad8d ("rt_mutex: Add lockdep annotations")
> 
> Right?

The timing matches, so I do believe this is the case.

> > ---
> >  kernel/rcu/tree_plugin.h | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index fed95fa941e6..60bfb16c9a1a 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -529,8 +529,11 @@ void rcu_read_unlock_special(struct task_struct *t)
> >  		}
> >  
> >  		/* Unboost if we were boosted. */
> > -		if (IS_ENABLED(CONFIG_RCU_BOOST) && drop_boost_mutex)
> > +		if (IS_ENABLED(CONFIG_RCU_BOOST) && drop_boost_mutex) {
> > +			/* For lockdep, pretend we acquired lock honestly. */
> > +			mutex_acquire(&rnp->boost_mtx.dep_map, 0, 0, _RET_IP_);
> >  			rt_mutex_unlock(&rnp->boost_mtx);
> > +		}
> 
> So I'm thinking the problem is that you're mixing rt_mutex and PI-futex
> primitives here. As per commit:
> 
>   5293c2efda37 ("futex,rt_mutex: Provide futex specific rt_mutex API")
> 
> these are two separate APIs, that should, ideally, not be mixed.
> 
> The 'right' counterpart to rt_mutex_init_proxy_locked() is
> rt_mutex_futex_unlock() (which very much does not include lockdep bits).

OK, will give this a try.  It does at least seem to build, so I guess
that is a good start.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-05 14:55     ` Paul E. McKenney
@ 2017-10-05 15:39       ` Peter Zijlstra
  2017-10-05 16:19         ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-10-05 15:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg

On Thu, Oct 05, 2017 at 07:55:13AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote:
> > > Consider the following admittedly improbable sequence of events:
> > > 
> > > o	RCU is initially idle.
> > > 
> > > o	Task A on CPU 0 executes rcu_read_lock().
> > > 
> > > o	Task B on CPU 1 executes synchronize_rcu(), which must
> > > 	wait on Task A:
> > > 
> > > 	o	Task B registers the callback, which starts a new
> > > 		grace period, awakening the grace-period kthread
> > > 		on CPU 3, which immediately starts a new grace period.
> > > 
> > > 	o	Task B migrates to CPU 2, which provides a quiescent
> > > 		state for both CPUs 1 and 2.
> > > 
> > > 	o	Both CPUs 1 and 2 take scheduling-clock interrupts,
> > > 		and both invoke RCU_SOFTIRQ, both thus learning of the
> > > 		new grace period.
> > > 
> > > 	o	Task B is delayed, perhaps by vCPU preemption on CPU 2.
> > > 
> > > o	CPUs 2 and 3 pass through quiescent states, which are reported
> > > 	to core RCU.
> > > 
> > > o	Task B is resumed just long enough to be migrated to CPU 3,
> > > 	and then is once again delayed.
> > > 
> > > o	Task A executes rcu_read_unlock(), exiting its RCU read-side
> > > 	critical section.
> > > 
> > > o	CPU 0 passes through a quiescent sate, which is reported to
> > > 	core RCU.  Only CPU 1 continues to block the grace period.
> > > 
> > > o	CPU 1 passes through a quiescent state, which is reported to
> > > 	core RCU.  This ends the grace period, and CPU 1 therefore
> > > 	invokes its callbacks, one of which awakens Task B via
> > > 	complete().
> > > 
> > > o	Task B resumes (still on CPU 3) and starts executing
> > > 	wait_for_completion(), which sees that the completion has
> > > 	already completed, and thus does not block.  It returns from
> > > 	the synchronize_rcu() without any ordering against the
> > > 	end of Task A's RCU read-side critical section.
> > > 
> > > 	It can therefore mess up Task A's RCU read-side critical section,
> > > 	in theory, anyway.
> > 
> > I'm not sure I follow, at the very least the wait_for_completion() does
> > an ACQUIRE such that it observes the state prior to the RELEASE as done
> > by complete(), no?
> 
> Your point being that both wait_for_completion() and complete() acquire
> and release the same lock?  (Yes, I suspect that I was confusing this
> with wait_event() and wake_up(), just so you know.)

Well, fundamentally complete()/wait_for_completion() is a message-pass
and they include a RELEASE/ACQUIRE pair for causal reasons.

Per the implementation they use a spinlock, but any implementation needs
to provide at least that RELEASE/ACQUIRE pair.

> > And is not CPU0's QS reporting ordered against that complete()?
> 
> Mumble mumble mumble powerpc mumble mumble mumble...
> 
> OK, I will make this new memory barrier only execute for powerpc.
> 
> Or am I missing something else here?

So I'm not entirely clear on the required semantics here; why do we need
a full mb? I'm thinking CPU0's QS propagating through the tree and
arriving at the root node is a multi-copy-atomic / transitive thing and
all CPUs will agree the system QS has ended, right?

Whichever CPU establishes the system QS does complete() and the
wait_for_completion() then has the weak-transitive causal relation to
that, ensuring that -- in the above example -- CPU3 must be _after_
CPU0's rcu_read_unlock().

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-05 15:39       ` Peter Zijlstra
@ 2017-10-05 16:19         ` Paul E. McKenney
  2017-10-05 16:25           ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-05 16:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg

On Thu, Oct 05, 2017 at 05:39:13PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 05, 2017 at 07:55:13AM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote:
> > > > Consider the following admittedly improbable sequence of events:
> > > > 
> > > > o	RCU is initially idle.
> > > > 
> > > > o	Task A on CPU 0 executes rcu_read_lock().
> > > > 
> > > > o	Task B on CPU 1 executes synchronize_rcu(), which must
> > > > 	wait on Task A:
> > > > 
> > > > 	o	Task B registers the callback, which starts a new
> > > > 		grace period, awakening the grace-period kthread
> > > > 		on CPU 3, which immediately starts a new grace period.
> > > > 
> > > > 	o	Task B migrates to CPU 2, which provides a quiescent
> > > > 		state for both CPUs 1 and 2.
> > > > 
> > > > 	o	Both CPUs 1 and 2 take scheduling-clock interrupts,
> > > > 		and both invoke RCU_SOFTIRQ, both thus learning of the
> > > > 		new grace period.
> > > > 
> > > > 	o	Task B is delayed, perhaps by vCPU preemption on CPU 2.
> > > > 
> > > > o	CPUs 2 and 3 pass through quiescent states, which are reported
> > > > 	to core RCU.
> > > > 
> > > > o	Task B is resumed just long enough to be migrated to CPU 3,
> > > > 	and then is once again delayed.
> > > > 
> > > > o	Task A executes rcu_read_unlock(), exiting its RCU read-side
> > > > 	critical section.
> > > > 
> > > > o	CPU 0 passes through a quiescent sate, which is reported to
> > > > 	core RCU.  Only CPU 1 continues to block the grace period.
> > > > 
> > > > o	CPU 1 passes through a quiescent state, which is reported to
> > > > 	core RCU.  This ends the grace period, and CPU 1 therefore
> > > > 	invokes its callbacks, one of which awakens Task B via
> > > > 	complete().
> > > > 
> > > > o	Task B resumes (still on CPU 3) and starts executing
> > > > 	wait_for_completion(), which sees that the completion has
> > > > 	already completed, and thus does not block.  It returns from
> > > > 	the synchronize_rcu() without any ordering against the
> > > > 	end of Task A's RCU read-side critical section.
> > > > 
> > > > 	It can therefore mess up Task A's RCU read-side critical section,
> > > > 	in theory, anyway.
> > > 
> > > I'm not sure I follow, at the very least the wait_for_completion() does
> > > an ACQUIRE such that it observes the state prior to the RELEASE as done
> > > by complete(), no?
> > 
> > Your point being that both wait_for_completion() and complete() acquire
> > and release the same lock?  (Yes, I suspect that I was confusing this
> > with wait_event() and wake_up(), just so you know.)
> 
> Well, fundamentally complete()/wait_for_completion() is a message-pass
> and they include a RELEASE/ACQUIRE pair for causal reasons.
> 
> Per the implementation they use a spinlock, but any implementation needs
> to provide at least that RELEASE/ACQUIRE pair.
> 
> > > And is not CPU0's QS reporting ordered against that complete()?
> > 
> > Mumble mumble mumble powerpc mumble mumble mumble...
> > 
> > OK, I will make this new memory barrier only execute for powerpc.
> > 
> > Or am I missing something else here?
> 
> So I'm not entirely clear on the required semantics here; why do we need
> a full mb? I'm thinking CPU0's QS propagating through the tree and
> arriving at the root node is a multi-copy-atomic / transitive thing and
> all CPUs will agree the system QS has ended, right?
> 
> Whichever CPU establishes the system QS does complete() and the
> wait_for_completion() then has the weak-transitive causal relation to
> that, ensuring that -- in the above example -- CPU3 must be _after_
> CPU0's rcu_read_unlock().

Yes, the ordering does need to be visible to uninvolved CPUs, so
release-acquire is not necessarily strong enough.

My current thought is like this:

	if (IS_ENABLED(CONFIG_ARCH_WEAK_RELEASE_ACQUIRE))
		smp_mb();

Thoughts?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-05 16:19         ` Paul E. McKenney
@ 2017-10-05 16:25           ` Peter Zijlstra
  2017-10-05 18:22             ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-10-05 16:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg

On Thu, Oct 05, 2017 at 09:19:09AM -0700, Paul E. McKenney wrote:
> 
> Yes, the ordering does need to be visible to uninvolved CPUs, so
> release-acquire is not necessarily strong enough.

Why? Because I'm not at all seeing why it needs more. Your Changelog
doesn't provide enough hints.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-05 16:25           ` Peter Zijlstra
@ 2017-10-05 18:22             ` Paul E. McKenney
  2017-10-06  9:07               ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-05 18:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg

On Thu, Oct 05, 2017 at 06:25:14PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 05, 2017 at 09:19:09AM -0700, Paul E. McKenney wrote:
> > 
> > Yes, the ordering does need to be visible to uninvolved CPUs, so
> > release-acquire is not necessarily strong enough.
> 
> Why? Because I'm not at all seeing why it needs more. Your Changelog
> doesn't provide enough hints.

Hmmm...  Here is what I was worried about:

	C C-PaulEMcKenney-W+RWC4+2017-10-05

	{
	}

	P0(int *a, int *x)
	{
		WRITE_ONCE(*a, 1);
		smp_mb(); /* Lock acquisition for rcu_node ->lock. */
		WRITE_ONCE(*x, 1);
	}

	P1(int *x, int *y, spinlock_t *l)
	{
		r3 = READ_ONCE(*x);
		smp_mb(); /* Lock acquisition for rcu_node ->lock. */
		spin_lock(l); /* Locking in complete(). */
		WRITE_ONCE(*y, 1);
		spin_unlock(l);
	}

	P2(int *y, int *b, spinlock_t *l)
	{
		spin_lock(l); /* Locking in wait_for_completion. */
		r4 = READ_ONCE(*y);
		spin_unlock(l);
		r1 = READ_ONCE(*b);
	}

	P3(int *b, int *a)
	{
		WRITE_ONCE(*b, 1);
		smp_mb();
		r2 = READ_ONCE(*a);
	}

	exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)

My concern was that P2()'s read from b could be pulled into the lock's
critical section and reordered with the read from y.  But even if
you physically rearrange the code, the cycle is forbidden.  Removing
P2()'s lock acquisition and release does allow the cycle.  I get the
same result when replacing spin_lock() with xchg_acquire() and
spin_unlock() with smp_store_release().  I can drop P1()'s smp_mb()
(but otherwise like the original above) and the cycle is forbidden.
Removing P1()'s lock acquisition and release, but leaving the smp_mb(),
does allow the cycle.

So it looks like I can drop this patch entirely.  Though somewhat
nervously, despite the evidence that I have ample redundancy in
ordering.  ;-)

So, thank you for persisting on this one!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-05 18:22             ` Paul E. McKenney
@ 2017-10-06  9:07               ` Peter Zijlstra
  2017-10-06 19:18                 ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-10-06  9:07 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg

On Thu, Oct 05, 2017 at 11:22:04AM -0700, Paul E. McKenney wrote:
> Hmmm...  Here is what I was worried about:
> 
> 	C C-PaulEMcKenney-W+RWC4+2017-10-05
> 
> 	{
> 	}
> 
> 	P0(int *a, int *x)
> 	{
> 		WRITE_ONCE(*a, 1);
> 		smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> 		WRITE_ONCE(*x, 1);
> 	}
> 
> 	P1(int *x, int *y, spinlock_t *l)
> 	{
> 		r3 = READ_ONCE(*x);
> 		smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> 		spin_lock(l); /* Locking in complete(). */
> 		WRITE_ONCE(*y, 1);
> 		spin_unlock(l);
> 	}
> 
> 	P2(int *y, int *b, spinlock_t *l)
> 	{
> 		spin_lock(l); /* Locking in wait_for_completion. */
> 		r4 = READ_ONCE(*y);
> 		spin_unlock(l);
> 		r1 = READ_ONCE(*b);
> 	}
> 
> 	P3(int *b, int *a)
> 	{
> 		WRITE_ONCE(*b, 1);
> 		smp_mb();
> 		r2 = READ_ONCE(*a);
> 	}
> 
> 	exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)


/me goes and install this herd thing again.. I'm sure I had it running
_somewhere_.. A well.


	C C-PaulEMcKenney-W+RWC4+2017-10-05

	{
	}

	P0(int *a, int *x)
	{
		WRITE_ONCE(*a, 1);
		smp_mb(); /* Lock acquisition for rcu_node ->lock. */
		WRITE_ONCE(*x, 1);
	}

	P1(int *x, int *y)
	{
		r3 = READ_ONCE(*x);
		smp_mb(); /* Lock acquisition for rcu_node ->lock. */
		smp_store_release(y, 1);
	}

	P2(int *y, int *b)
	{
		r4 = smp_load_acquire(y);
		r1 = READ_ONCE(*b);
	}

	P3(int *b, int *a)
	{
		WRITE_ONCE(*b, 1);
		smp_mb();
		r2 = READ_ONCE(*a);
	}

	exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)


Is what I was thinking of, I think that is the minimal ordering
complete()/wait_for_completion() need to provide.

(also, that r# numbering confuses the hell out of me, its not related to
P nor to the variables)

Test C-PaulEMcKenney-W+RWC4+2017-10-05 Allowed
States 15
1:r3=0; 2:r1=0; 2:r4=0; 3:r2=0;
1:r3=0; 2:r1=0; 2:r4=0; 3:r2=1;
1:r3=0; 2:r1=0; 2:r4=1; 3:r2=0;
1:r3=0; 2:r1=0; 2:r4=1; 3:r2=1;
1:r3=0; 2:r1=1; 2:r4=0; 3:r2=0;
1:r3=0; 2:r1=1; 2:r4=0; 3:r2=1;
1:r3=0; 2:r1=1; 2:r4=1; 3:r2=0;
1:r3=0; 2:r1=1; 2:r4=1; 3:r2=1;
1:r3=1; 2:r1=0; 2:r4=0; 3:r2=0;
1:r3=1; 2:r1=0; 2:r4=0; 3:r2=1;
1:r3=1; 2:r1=0; 2:r4=1; 3:r2=1;
1:r3=1; 2:r1=1; 2:r4=0; 3:r2=0;
1:r3=1; 2:r1=1; 2:r4=0; 3:r2=1;
1:r3=1; 2:r1=1; 2:r4=1; 3:r2=0;
1:r3=1; 2:r1=1; 2:r4=1; 3:r2=1;
No
Witnesses
Positive: 0 Negative: 15
Condition exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
Observation C-PaulEMcKenney-W+RWC4+2017-10-05 Never 0 15
Time C-PaulEMcKenney-W+RWC4+2017-10-05 0.04
Hash=f7f8ad6eab33e90718a394bcb021557d

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-06  9:07               ` Peter Zijlstra
@ 2017-10-06 19:18                 ` Paul E. McKenney
  2017-10-06 20:15                   ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-06 19:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg

On Fri, Oct 06, 2017 at 11:07:23AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 05, 2017 at 11:22:04AM -0700, Paul E. McKenney wrote:
> > Hmmm...  Here is what I was worried about:
> > 
> > 	C C-PaulEMcKenney-W+RWC4+2017-10-05
> > 
> > 	{
> > 	}
> > 
> > 	P0(int *a, int *x)
> > 	{
> > 		WRITE_ONCE(*a, 1);
> > 		smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > 		WRITE_ONCE(*x, 1);
> > 	}
> > 
> > 	P1(int *x, int *y, spinlock_t *l)
> > 	{
> > 		r3 = READ_ONCE(*x);
> > 		smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > 		spin_lock(l); /* Locking in complete(). */
> > 		WRITE_ONCE(*y, 1);
> > 		spin_unlock(l);
> > 	}
> > 
> > 	P2(int *y, int *b, spinlock_t *l)
> > 	{
> > 		spin_lock(l); /* Locking in wait_for_completion. */
> > 		r4 = READ_ONCE(*y);
> > 		spin_unlock(l);
> > 		r1 = READ_ONCE(*b);
> > 	}
> > 
> > 	P3(int *b, int *a)
> > 	{
> > 		WRITE_ONCE(*b, 1);
> > 		smp_mb();
> > 		r2 = READ_ONCE(*a);
> > 	}
> > 
> > 	exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
> 
> /me goes and install this herd thing again.. I'm sure I had it running
> _somewhere_.. A well.
> 
> 	C C-PaulEMcKenney-W+RWC4+2017-10-05
> 
> 	{
> 	}
> 
> 	P0(int *a, int *x)
> 	{
> 		WRITE_ONCE(*a, 1);
> 		smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> 		WRITE_ONCE(*x, 1);
> 	}
> 
> 	P1(int *x, int *y)
> 	{
> 		r3 = READ_ONCE(*x);
> 		smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> 		smp_store_release(y, 1);
> 	}
> 
> 	P2(int *y, int *b)
> 	{
> 		r4 = smp_load_acquire(y);
> 		r1 = READ_ONCE(*b);
> 	}
> 
> 	P3(int *b, int *a)
> 	{
> 		WRITE_ONCE(*b, 1);
> 		smp_mb();
> 		r2 = READ_ONCE(*a);
> 	}
> 
> 	exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
> 
> 
> Is what I was thinking of, I think that is the minimal ordering
> complete()/wait_for_completion() need to provide.

OK, I will bite...  What do the smp_store_release() and the
smp_load_acquire() correspond to?  I see just plain locking in
wait_for_completion() and complete().

> (also, that r# numbering confuses the hell out of me, its not related to
> P nor to the variables)

Yeah, it is random, sorry!!!

> Test C-PaulEMcKenney-W+RWC4+2017-10-05 Allowed
> States 15
> 1:r3=0; 2:r1=0; 2:r4=0; 3:r2=0;
> 1:r3=0; 2:r1=0; 2:r4=0; 3:r2=1;
> 1:r3=0; 2:r1=0; 2:r4=1; 3:r2=0;
> 1:r3=0; 2:r1=0; 2:r4=1; 3:r2=1;
> 1:r3=0; 2:r1=1; 2:r4=0; 3:r2=0;
> 1:r3=0; 2:r1=1; 2:r4=0; 3:r2=1;
> 1:r3=0; 2:r1=1; 2:r4=1; 3:r2=0;
> 1:r3=0; 2:r1=1; 2:r4=1; 3:r2=1;
> 1:r3=1; 2:r1=0; 2:r4=0; 3:r2=0;
> 1:r3=1; 2:r1=0; 2:r4=0; 3:r2=1;
> 1:r3=1; 2:r1=0; 2:r4=1; 3:r2=1;
> 1:r3=1; 2:r1=1; 2:r4=0; 3:r2=0;
> 1:r3=1; 2:r1=1; 2:r4=0; 3:r2=1;
> 1:r3=1; 2:r1=1; 2:r4=1; 3:r2=0;
> 1:r3=1; 2:r1=1; 2:r4=1; 3:r2=1;
> No
> Witnesses
> Positive: 0 Negative: 15
> Condition exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
> Observation C-PaulEMcKenney-W+RWC4+2017-10-05 Never 0 15
> Time C-PaulEMcKenney-W+RWC4+2017-10-05 0.04
> Hash=f7f8ad6eab33e90718a394bcb021557d

But yes, looking closer, this corresponds to the rule of thumb about
non-rf relations and full memory barriers.  We have two non-rf relations
(P2->P3 and P3->P0), so we need two full barriers, one each between the
non-rf relations.

So I dropped that patch yesterday.  The main thing I was missing was
that there is no ordering-free fastpath in wait_for_completion() and
complete(): Each unconditionally acquires the lock.  So the smp_mb()
that I was trying to add doesn't need to be there.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-06 19:18                 ` Paul E. McKenney
@ 2017-10-06 20:15                   ` Peter Zijlstra
  2017-10-07  3:31                     ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-10-06 20:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg

On Fri, Oct 06, 2017 at 12:18:22PM -0700, Paul E. McKenney wrote:
> > /me goes and install this herd thing again.. I'm sure I had it running
> > _somewhere_.. A well.
> > 
> > 	C C-PaulEMcKenney-W+RWC4+2017-10-05
> > 
> > 	{
> > 	}
> > 
> > 	P0(int *a, int *x)
> > 	{
> > 		WRITE_ONCE(*a, 1);
> > 		smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > 		WRITE_ONCE(*x, 1);
> > 	}
> > 
> > 	P1(int *x, int *y)
> > 	{
> > 		r3 = READ_ONCE(*x);
> > 		smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > 		smp_store_release(y, 1);
> > 	}
> > 
> > 	P2(int *y, int *b)
> > 	{
> > 		r4 = smp_load_acquire(y);
> > 		r1 = READ_ONCE(*b);
> > 	}
> > 
> > 	P3(int *b, int *a)
> > 	{
> > 		WRITE_ONCE(*b, 1);
> > 		smp_mb();
> > 		r2 = READ_ONCE(*a);
> > 	}
> > 
> > 	exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
> > 
> > 
> > Is what I was thinking of, I think that is the minimal ordering
> > complete()/wait_for_completion() need to provide.
> 
> OK, I will bite...  What do the smp_store_release() and the
> smp_load_acquire() correspond to?  I see just plain locking in
> wait_for_completion() and complete().

They reflect the concept of complete() / wait_for_completion().
Fundamentally all it needs to do is pass the message of 'completion'.

That is, if we were to go optimize our completion implementation, it
would be impossible to be weaker than this and still correct.

> So I dropped that patch yesterday.  The main thing I was missing was
> that there is no ordering-free fastpath in wait_for_completion() and
> complete(): Each unconditionally acquires the lock.  So the smp_mb()
> that I was trying to add doesn't need to be there.

Going by the above, it never needs to be there, even if there was a
lock-free fast-path.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-06 20:15                   ` Peter Zijlstra
@ 2017-10-07  3:31                     ` Paul E. McKenney
  2017-10-07  9:29                       ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-07  3:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg

On Fri, Oct 06, 2017 at 10:15:37PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 06, 2017 at 12:18:22PM -0700, Paul E. McKenney wrote:
> > > /me goes and install this herd thing again.. I'm sure I had it running
> > > _somewhere_.. A well.
> > > 
> > > 	C C-PaulEMcKenney-W+RWC4+2017-10-05
> > > 
> > > 	{
> > > 	}
> > > 
> > > 	P0(int *a, int *x)
> > > 	{
> > > 		WRITE_ONCE(*a, 1);
> > > 		smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > > 		WRITE_ONCE(*x, 1);
> > > 	}
> > > 
> > > 	P1(int *x, int *y)
> > > 	{
> > > 		r3 = READ_ONCE(*x);
> > > 		smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > > 		smp_store_release(y, 1);
> > > 	}
> > > 
> > > 	P2(int *y, int *b)
> > > 	{
> > > 		r4 = smp_load_acquire(y);
> > > 		r1 = READ_ONCE(*b);
> > > 	}
> > > 
> > > 	P3(int *b, int *a)
> > > 	{
> > > 		WRITE_ONCE(*b, 1);
> > > 		smp_mb();
> > > 		r2 = READ_ONCE(*a);
> > > 	}
> > > 
> > > 	exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
> > > 
> > > 
> > > Is what I was thinking of, I think that is the minimal ordering
> > > complete()/wait_for_completion() need to provide.
> > 
> > OK, I will bite...  What do the smp_store_release() and the
> > smp_load_acquire() correspond to?  I see just plain locking in
> > wait_for_completion() and complete().
> 
> They reflect the concept of complete() / wait_for_completion().
> Fundamentally all it needs to do is pass the message of 'completion'.
> 
> That is, if we were to go optimize our completion implementation, it
> would be impossible to be weaker than this and still correct.

OK, though the model does not provide spinlocks, and there can be
differences in behavior between spinlocks and release-acquire.
But yes, in this case, it works.

> > So I dropped that patch yesterday.  The main thing I was missing was
> > that there is no ordering-free fastpath in wait_for_completion() and
> > complete(): Each unconditionally acquires the lock.  So the smp_mb()
> > that I was trying to add doesn't need to be there.
> 
> Going by the above, it never needs to be there, even if there was a
> lock-free fast-path.

Given that wait_for_completion()/complete() both acquire the same lock,
yes, and agreed, if it were lockless but provided the release and
acquire ordering, then yes.  But if it was instead structured like
wait_event()/wake_up(), there would be ordering only if the caller
supplied it.

All that aside, paring the ordering down to the bare minimum is not
always the right approach.  Nevertheless, in this particular case,
there is plenty of ordering, so yet again, I have dropped this commit.
Like yesterday.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-07  3:31                     ` Paul E. McKenney
@ 2017-10-07  9:29                       ` Peter Zijlstra
  2017-10-07 18:28                         ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-10-07  9:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg

On Fri, Oct 06, 2017 at 08:31:05PM -0700, Paul E. McKenney wrote:

> > > OK, I will bite...  What do the smp_store_release() and the
> > > smp_load_acquire() correspond to?  I see just plain locking in
> > > wait_for_completion() and complete().
> > 
> > They reflect the concept of complete() / wait_for_completion().
> > Fundamentally all it needs to do is pass the message of 'completion'.
> > 
> > That is, if we were to go optimize our completion implementation, it
> > would be impossible to be weaker than this and still correct.
> 
> OK, though the model does not provide spinlocks, and there can be
> differences in behavior between spinlocks and release-acquire.
> But yes, in this case, it works.

Sure; but the fundamental property here is that if we observe the
complete() we must also observe everything that went before. The exact
means of implementing that is irrelevant.

> > > So I dropped that patch yesterday.  The main thing I was missing was
> > > that there is no ordering-free fastpath in wait_for_completion() and
> > > complete(): Each unconditionally acquires the lock.  So the smp_mb()
> > > that I was trying to add doesn't need to be there.
> > 
> > Going by the above, it never needs to be there, even if there was a
> > lock-free fast-path.
> 
> Given that wait_for_completion()/complete() both acquire the same lock,
> yes, and agreed, if it were lockless but provided the release and
> acquire ordering, then yes.

I'm not sure I got the point across; so I'll try once more. Without
providing this ordering the completion would be fundamentally broken. It
_must_ provide this ordering.

> But if it was instead structured like
> wait_event()/wake_up(), there would be ordering only if the caller
> supplied it.

Right, wait_event()/wake_up() are different in that the 'condition'
variable is external to the abstraction and thus it cannot help.

All wait_event()/wake_up() can guarantee is that IFF it does a wakeup,
the woken thread will observe the prior state of the waker. But given
the actual condition is external and we might not hit the actual sleep
case, there is no guarantees.

> All that aside, paring the ordering down to the bare minimum is not
> always the right approach.

Why not? In what sort of cases does it go wobbly?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-07  9:29                       ` Peter Zijlstra
@ 2017-10-07 18:28                         ` Paul E. McKenney
  2017-10-09  8:16                           ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-07 18:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg

On Sat, Oct 07, 2017 at 11:29:19AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 06, 2017 at 08:31:05PM -0700, Paul E. McKenney wrote:
> 
> > > > OK, I will bite...  What do the smp_store_release() and the
> > > > smp_load_acquire() correspond to?  I see just plain locking in
> > > > wait_for_completion() and complete().
> > > 
> > > They reflect the concept of complete() / wait_for_completion().
> > > Fundamentally all it needs to do is pass the message of 'completion'.
> > > 
> > > That is, if we were to go optimize our completion implementation, it
> > > would be impossible to be weaker than this and still correct.
> > 
> > OK, though the model does not provide spinlocks, and there can be

Sigh.  s/not//.  The current model -does- provide spinlocks, though
they are a bit new.  I don't know of any breakage, but I am paranoid
enough so that where feasible I double-check against xchg_acquire()
and store_release().

> > differences in behavior between spinlocks and release-acquire.
> > But yes, in this case, it works.
> 
> Sure; but the fundamental property here is that if we observe the
> complete() we must also observe everything that went before. The exact
> means of implementing that is irrelevant.

Agreed, and that also tends to speed up the running of the model on
the litmus test, so this sort of abstraction is a very good thing for
multiple reasons.

So why did I use spinlocks?  Because the model was small and fast enough,
and using the spinlocks meant that I didn't need to take time to worry
about the code's intent.

But if you are saying that it would be good to have wait_for_completion()
and complete() directly modeled at some point, no argument.  In addition,
I hope that the memory model is applied to other tools that analyze kernel
code.

> > > > So I dropped that patch yesterday.  The main thing I was missing was
> > > > that there is no ordering-free fastpath in wait_for_completion() and
> > > > complete(): Each unconditionally acquires the lock.  So the smp_mb()
> > > > that I was trying to add doesn't need to be there.
> > > 
> > > Going by the above, it never needs to be there, even if there was a
> > > lock-free fast-path.
> > 
> > Given that wait_for_completion()/complete() both acquire the same lock,
> > yes, and agreed, if it were lockless but provided the release and
> > acquire ordering, then yes.
> 
> I'm not sure I got the point across; so I'll try once more. Without
> providing this ordering the completion would be fundamentally broken. It
> _must_ provide this ordering.

OK, I now understand what you are getting at, and I do very much like
that guarantee.

> > But if it was instead structured like
> > wait_event()/wake_up(), there would be ordering only if the caller
> > supplied it.
> 
> Right, wait_event()/wake_up() are different in that the 'condition'
> variable is external to the abstraction and thus it cannot help.
> 
> All wait_event()/wake_up() can guarantee is that IFF it does a wakeup,
> the woken thread will observe the prior state of the waker. But given
> the actual condition is external and we might not hit the actual sleep
> case, there is no guarantees.

Agreed.

> > All that aside, paring the ordering down to the bare minimum is not
> > always the right approach.
> 
> Why not? In what sort of cases does it go wobbly?

For one, when it conflicts with maintainability.  For example, it would
probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the
smp_mb__after_unlock_lock() invocations.  But those are slowpaths, and the
small speedup on only one architecture is just not worth the added pain.
Especially given the nice wrapper functions that you provided.

But of course if this were instead (say) rcu_read_lock() or common-case
rcu_read_unlock(), I would be willing to undergo much more pain.  On the
other hand, for that exact reason, that common-case code path doesn't
acquire locks in the first place.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-07 18:28                         ` Paul E. McKenney
@ 2017-10-09  8:16                           ` Peter Zijlstra
  2017-10-09 14:37                             ` Andrea Parri
  2017-10-09 23:15                             ` Paul E. McKenney
  0 siblings, 2 replies; 30+ messages in thread
From: Peter Zijlstra @ 2017-10-09  8:16 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg

On Sat, Oct 07, 2017 at 11:28:57AM -0700, Paul E. McKenney wrote:
> But if you are saying that it would be good to have wait_for_completion()
> and complete() directly modeled at some point, no argument.  In addition,
> I hope that the memory model is applied to other tools that analyze kernel
> code.

> > I'm not sure I got the point across; so I'll try once more. Without
> > providing this ordering the completion would be fundamentally broken. It
> > _must_ provide this ordering.
> 
> OK, I now understand what you are getting at, and I do very much like
> that guarantee.

Right, so maybe we should update the completion comments a bit to call
out this property, because I'm not sure its there.

Also, with this, I think the smp_store_release() / smp_load_acquire() is
a perfectly fine abstraction of it, I don't think the model needs to be
taught about the completion interface.

> > Why not? In what sort of cases does it go wobbly?
> 
> For one, when it conflicts with maintainability.  For example, it would
> probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the
> smp_mb__after_unlock_lock() invocations.  But those are slowpaths, and the
> small speedup on only one architecture is just not worth the added pain.
> Especially given the nice wrapper functions that you provided.
> 
> But of course if this were instead (say) rcu_read_lock() or common-case
> rcu_read_unlock(), I would be willing to undergo much more pain.  On the
> other hand, for that exact reason, that common-case code path doesn't
> acquire locks in the first place.  ;-)

Ah, so for models I would go absolutely minimal; it helps understand
what the strict requirements are and where we over-provide etc..

For actual code you're entirely right, there's no point in trying to be
cute with the rcu-node locks. Simplicity rules etc..

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-09  8:16                           ` Peter Zijlstra
@ 2017-10-09 14:37                             ` Andrea Parri
  2017-10-09 23:15                             ` Paul E. McKenney
  1 sibling, 0 replies; 30+ messages in thread
From: Andrea Parri @ 2017-10-09 14:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, linux-kernel, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg

On Mon, Oct 09, 2017 at 10:16:37AM +0200, Peter Zijlstra wrote:
> On Sat, Oct 07, 2017 at 11:28:57AM -0700, Paul E. McKenney wrote:
> > But if you are saying that it would be good to have wait_for_completion()
> > and complete() directly modeled at some point, no argument.  In addition,
> > I hope that the memory model is applied to other tools that analyze kernel
> > code.
> 
> > > I'm not sure I got the point across; so I'll try once more. Without
> > > providing this ordering the completion would be fundamentally broken. It
> > > _must_ provide this ordering.
> > 
> > OK, I now understand what you are getting at, and I do very much like
> > that guarantee.
> 
> Right, so maybe we should update the completion comments a bit to call
> out this property, because I'm not sure its there.
> 
> Also, with this, I think the smp_store_release() / smp_load_acquire() is
> a perfectly fine abstraction of it, I don't think the model needs to be
> taught about the completion interface.
> 
> > > Why not? In what sort of cases does it go wobbly?
> > 
> > For one, when it conflicts with maintainability.  For example, it would
> > probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the
> > smp_mb__after_unlock_lock() invocations.  But those are slowpaths, and the
> > small speedup on only one architecture is just not worth the added pain.
> > Especially given the nice wrapper functions that you provided.
> > 
> > But of course if this were instead (say) rcu_read_lock() or common-case
> > rcu_read_unlock(), I would be willing to undergo much more pain.  On the
> > other hand, for that exact reason, that common-case code path doesn't
> > acquire locks in the first place.  ;-)
> 
> Ah, so for models I would go absolutely minimal; it helps understand
> what the strict requirements are and where we over-provide etc..

Except, maybe, that simplicity and maintainability do apply to "models"
(design) as well... ;-)

As Ingo once put it [1] (referring to the "Linux-kernel memory model"):

  "it's not true that Linux has to offer a barrier and locking model
   that panders to the weakest (and craziest!) memory ordering model
   amongst all the possible Linux platforms - theoretical or real metal.

   Instead what we want to do is to consciously, intelligently _pick_
   a sane, maintainable memory model and offer primitives for that -
   at least as far as generic code is concerned. Each architecture can
   map those primitives to the best of its abilities."

  Andrea

[1] https://marc.info/?l=linux-mm&m=138513336717990&w=2


> 
> For actual code you're entirely right, there's no point in trying to be
> cute with the rcu-node locks. Simplicity rules etc..

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
  2017-10-09  8:16                           ` Peter Zijlstra
  2017-10-09 14:37                             ` Andrea Parri
@ 2017-10-09 23:15                             ` Paul E. McKenney
  1 sibling, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2017-10-09 23:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg

On Mon, Oct 09, 2017 at 10:16:37AM +0200, Peter Zijlstra wrote:
> On Sat, Oct 07, 2017 at 11:28:57AM -0700, Paul E. McKenney wrote:
> > But if you are saying that it would be good to have wait_for_completion()
> > and complete() directly modeled at some point, no argument.  In addition,
> > I hope that the memory model is applied to other tools that analyze kernel
> > code.
> 
> > > I'm not sure I got the point across; so I'll try once more. Without
> > > providing this ordering the completion would be fundamentally broken. It
> > > _must_ provide this ordering.
> > 
> > OK, I now understand what you are getting at, and I do very much like
> > that guarantee.
> 
> Right, so maybe we should update the completion comments a bit to call
> out this property, because I'm not sure its there.

That would be very good!

> Also, with this, I think the smp_store_release() / smp_load_acquire() is
> a perfectly fine abstraction of it, I don't think the model needs to be
> taught about the completion interface.

Agreed, if we pull too much stuff into the model it might well become
difficult to maintain in and of itself.  So we do need to choose
carefully based on model performance and ease of use.

For a performance example, note that locking can be easily modeled with
xchg_acquire() for lock acquisition and smp_store_release() for lock
release.  But here is the resulting performance compared to directly
modeling locking within the cat code:

			xchg_acquire() and
	# Processes	smp_store_release()	Direct modeling

	2		        0.061 s		0.007 s

	3		        3.542 s		0.039 s

	4		      569.189 s		0.397 s

	5		> 113,853.000 s		5.187 s

Combinatorial explosion is a real pain...  The problem is that the
xchg_acquire() / smp_store_release() approach requires the herd7 tool
to consider and then reject all the scenarios where the lock critical
sections overlap.  In contrast, when direct modeling, those overlapping
critical-section scenarios are implicitly rejected by the cat-code that
models locking.  Don't get me wrong, there is still combinatorial
explosion, as you can see from the rightmost column of numbers.
After all, the herd7 tool still needs to consider all possible
lock-acquisition orders.  But the explosion is much less explosive.  ;-)

Given that we expect locking to be very heavily used, it makes a lot
of sense to directly model it, which we (mostly Alan, Andrea, and Luc)
have done.  Direct modeling of RCU gets similar speedups over emulation,
plus the fact that accurate emulation of RCU requires some surprisingly
unnatural acts.  ("Ghost" accesses unaffected by normal memory-ordering
primitives, and "ghost-buster memory barriers" that affect both "ghost"
and normal accesses.  Plus the transformation is complex, so much so
that I broke down and wrote a script, which was used to validate the
early versions of the RCU model.)

In constrast, I have no reason to believe that direct modeling could make
wait_for_completion() / complete() go any faster than the equivalent
setup using smp_store_release() / smp_load_acquire(), that is, I don't
see anything that could be optimized away byt cat-code.  So the only
reason to directly model wait_for_completion() / complete() would be
to get rid of the need to add the extra term to the "exists" clause.
And agreed, we should wait until people are actually being significantly
inconvenienced by this, which I do not expect any time soon.

> > > Why not? In what sort of cases does it go wobbly?
> > 
> > For one, when it conflicts with maintainability.  For example, it would
> > probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the
> > smp_mb__after_unlock_lock() invocations.  But those are slowpaths, and the
> > small speedup on only one architecture is just not worth the added pain.
> > Especially given the nice wrapper functions that you provided.
> > 
> > But of course if this were instead (say) rcu_read_lock() or common-case
> > rcu_read_unlock(), I would be willing to undergo much more pain.  On the
> > other hand, for that exact reason, that common-case code path doesn't
> > acquire locks in the first place.  ;-)
> 
> Ah, so for models I would go absolutely minimal; it helps understand
> what the strict requirements are and where we over-provide etc..

Agreed, we should be careful how much we promise lest we get locked
into needlessly heavyweight implementations.

> For actual code you're entirely right, there's no point in trying to be
> cute with the rcu-node locks. Simplicity rules etc..

Hear, hear!  ;-)

								Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2017-10-09 23:15 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-04 21:29 [PATCH tip/core/rcu 0/9] Miscellaneous fixes for v4.15 Paul E. McKenney
2017-10-04 21:29 ` [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays Paul E. McKenney
2017-10-05  9:41   ` Peter Zijlstra
2017-10-05 14:55     ` Paul E. McKenney
2017-10-05 15:39       ` Peter Zijlstra
2017-10-05 16:19         ` Paul E. McKenney
2017-10-05 16:25           ` Peter Zijlstra
2017-10-05 18:22             ` Paul E. McKenney
2017-10-06  9:07               ` Peter Zijlstra
2017-10-06 19:18                 ` Paul E. McKenney
2017-10-06 20:15                   ` Peter Zijlstra
2017-10-07  3:31                     ` Paul E. McKenney
2017-10-07  9:29                       ` Peter Zijlstra
2017-10-07 18:28                         ` Paul E. McKenney
2017-10-09  8:16                           ` Peter Zijlstra
2017-10-09 14:37                             ` Andrea Parri
2017-10-09 23:15                             ` Paul E. McKenney
2017-10-05 13:17   ` Steven Rostedt
2017-10-05 13:40     ` Peter Zijlstra
2017-10-05 14:13       ` Steven Rostedt
2017-10-04 21:29 ` [PATCH tip/core/rcu 2/9] rcu: Fix up pending cbs check in rcu_prepare_for_idle Paul E. McKenney
2017-10-04 21:29 ` [PATCH tip/core/rcu 3/9] rcu: Create call_rcu_tasks() kthread at boot time Paul E. McKenney
2017-10-04 21:29 ` [PATCH tip/core/rcu 4/9] irq_work: Map irq_work_on_queue() to irq_work_on() in !SMP Paul E. McKenney
2017-10-04 21:29 ` [PATCH tip/core/rcu 5/9] srcu: Add parameters to SRCU docbook comments Paul E. McKenney
2017-10-04 21:29 ` [PATCH tip/core/rcu 6/9] sched: Make resched_cpu() unconditional Paul E. McKenney
2017-10-04 21:29 ` [PATCH tip/core/rcu 7/9] rcu: Pretend ->boost_mtx acquired legitimately Paul E. McKenney
2017-10-05  9:50   ` Peter Zijlstra
2017-10-05 15:06     ` Paul E. McKenney
2017-10-04 21:29 ` [PATCH tip/core/rcu 8/9] rcu: Add extended-quiescent-state testing advice Paul E. McKenney
2017-10-04 21:29 ` [PATCH tip/core/rcu 9/9] rcu/segcblist: Include rcupdate.h Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).