RCU Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline
@ 2020-03-12 18:16 Paul E. McKenney
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 01/16] sched/core: Add function to sample state of non-running function paulmck
                   ` (17 more replies)
  0 siblings, 18 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-12 18:16 UTC (permalink / raw)
  To: mutt, rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel

Hello!

This series provides two variants of Tasks RCU, a rude variant inspired
by Steven Rostedt's use of schedule_on_each_cpu(), and a tracing variant
requested by the BPF folks and perhaps also of use for other tracing
use cases.

The tracing variant has explicit read-side markers to permit finite grace
periods even given in-kernel loops in PREEMPT=n builds It also protects
code in the idle loop, on exception entry/exit paths, and on the various
CPU-hotplug online/offline code paths, thus having protection properties
similar to SRCU.  However, unlike SRCU, this variant avoids expensive
instructions in the read-side primitives, thus having read-side overhead
similar to that of preemptible RCU.

There are of course downsides.  The grace-period code can send IPIs to
CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
It is necessary to scan the full tasklist, much as for Tasks RCU.  There
is a single callback queue guarded by a single lock, again, much as for
Tasks RCU.  If needed, these downsides can be at least partially remedied.

Perhaps most important, this variant of RCU does not affect the vanilla
flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
readers can operate from idle, offline, and exception entry/exit in no
way allows rcu_preempt and rcu_sched readers to also do so.

This effort benefited greatly from off-list discussions of BPF
requirements with Alexei Starovoitov and Andrii Nakryiko, as well as from
numerous on-list discussions, at least some of which are captured in the
"Link:" tags on the patches themselves.

The patches in this series are as follows:

1.	Add function to sample state of non-running function.
	I would guess that the API is still subject to change.  ;-)

2.	Use the above function to add per-task state to RCU CPU stall
	warnings.

3.	Add rcutorture module parameter to produce non-busy-wait task
	stalls, thus allowing the above RCU CPU stall change to be
	exercised.

4.	Move Tasks RCU to its own file.

5.	Create struct to hold RCU-tasks state information.

6.	Reinstate synchronize_rcu_mult(), as there will likely once
	again be a need to wait on multiple flavors of RCU.

7.	Add an rcutorture test for synchronize_rcu_mult().

8.	Refactor RCU-tasks to allow variants to be added.

9.	Add an RCU-tasks rude variant, based on Steven Rostedt's
	use of schedule_on_each_cpu().

10.	Add torture tests for RCU Tasks Rude.

11.	Use unique names for RCU-Tasks kthreads and messages.

12.	Further refactor RCU-tasks to allow adding even more variants.

13.	Code movement to allow even more Tasks RCU variants.

14.	Add an RCU Tasks Trace to simplify protection of tracing hooks,
	including BPF.

15.	Add torture tests for RCU Tasks Trace.

16.	Add stall warnings for RCU Tasks Trace.

The new versions of Tasks RCU pass moderate rcutorture testing, and more
severe testing is in the offing.  They are not yet ready for production
use, however!

							Thanx, Paul

------------------------------------------------------------------------

 Documentation/admin-guide/kernel-parameters.txt             |    5 
 include/linux/rcupdate.h                                    |    9 
 include/linux/rcupdate_trace.h                              |   84 
 include/linux/rcupdate_wait.h                               |   19 
 include/linux/sched.h                                       |    8 
 include/linux/wait.h                                        |    2 
 init/init_task.c                                            |    4 
 kernel/fork.c                                               |    4 
 kernel/rcu/Kconfig                                          |   34 
 kernel/rcu/Kconfig.debug                                    |    4 
 kernel/rcu/rcu.h                                            |    2 
 kernel/rcu/rcutorture.c                                     |   96 
 kernel/rcu/tasks.h                                          | 1730 +++++++++---
 kernel/rcu/tree_stall.h                                     |   38 
 kernel/rcu/update.c                                         |  370 --
 kernel/sched/core.c                                         |   49 
 tools/testing/selftests/rcutorture/configs/rcu/CFLIST       |    2 
 tools/testing/selftests/rcutorture/configs/rcu/RUDE01       |   10 
 tools/testing/selftests/rcutorture/configs/rcu/RUDE01.boot  |    1 
 tools/testing/selftests/rcutorture/configs/rcu/TRACE01      |   10 
 tools/testing/selftests/rcutorture/configs/rcu/TRACE01.boot |    1 
 21 files changed, 1702 insertions(+), 780 deletions(-)

^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC tip/core/rcu 01/16] sched/core: Add function to sample state of non-running function
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
@ 2020-03-12 18:16 ` paulmck
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 02/16] rcu: Add per-task state to RCU CPU stall warnings paulmck
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-12 18:16 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman

From: "Paul E. McKenney" <paulmck@kernel.org>

A running task's state can be sampled in a consistent manner (for example,
for diagnostic purposes) simply by invoking smp_call_function_single()
on its CPU, which may be obtained using task_cpu(), then having the
IPI handler verify that the desired task is in fact still running.
However, if the task is not running, this sampling can in theory be
done directly.  In practice, the task might start running at any time,
including during the sampling period.  Gaining a consistent sample of
a not-running task therefore requires that something be done to prevent
that task from running.

This commit therefore adds a try_invoke_on_nonrunning_task() function
that invokes a specified function with the specified argument if the
specified task is in a non-running state, returning true if successful.
Otherwise this function simply returns false.  Given that the function
passed to try_invoke_on_nonrunning_task() will be invoked with a runqueue
lock held, that function had better be quite lightweight.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Apply feedback from Peter Zijlstra and Steven Rostedt. ]
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
---
 include/linux/wait.h |  2 ++
 kernel/sched/core.c  | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 3283c8d..6c0f989 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -1148,4 +1148,6 @@ int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, i
 		(wait)->flags = 0;						\
 	} while (0)
 
+bool try_invoke_on_nonrunning_task(struct task_struct *p, void (*func)(void *arg), void *arg);
+
 #endif /* _LINUX_WAIT_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fc1dfc0..d64328c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2580,6 +2580,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 *
 	 * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
 	 * __schedule().  See the comment for smp_mb__after_spinlock().
+	 *
+	 * A similar smb_rmb() lives in try_invoke_on_nonrunning_task().
 	 */
 	smp_rmb();
 	if (p->on_rq && ttwu_remote(p, wake_flags))
@@ -2654,6 +2656,53 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 }
 
 /**
+ * try_invoke_on_nonrunning_task - Invoke a function for a non-running task
+ * @p: Process for which the function is to be invoked.
+ * @func: Function to invoke.
+ * @arg: Argument to function.
+ *
+ * If the specified task is not running (either sleeping or runnable but
+ * not actually running), arrange to keep it in that state while invoking
+ * @func(@arg).  Given that @func can be invoked with a runqueue lock held,
+ * it had better be quite lightweight.
+ *
+ * Returns:
+ *	@false if the task is running or blocked.
+ *	@true if the task is runnable but not running.
+ */
+bool try_invoke_on_nonrunning_task(struct task_struct *p, void (*func)(void *arg), void *arg)
+{
+	bool ret = false;
+	struct rq_flags rf;
+	struct rq *rq;
+
+	lockdep_assert_irqs_enabled();
+	raw_spin_lock_irq(&p->pi_lock);
+	if (p->on_rq) {
+		rq = __task_rq_lock(p, &rf);
+		if (task_rq(p) == rq && !task_curr(p)) {
+			func(arg);
+			ret = true;
+		}
+		rq_unlock(rq, &rf);
+	} else {
+		switch (p->state) {
+		case TASK_RUNNING:
+		case TASK_WAKING:
+			break;
+		default:
+			smp_rmb();
+			if (!p->on_rq) {
+				func(arg);
+				ret = true;
+			}
+		}
+	}
+	raw_spin_unlock_irq(&p->pi_lock);
+	return ret;
+}
+
+/**
  * wake_up_process - Wake up a specific process
  * @p: The process to be woken up.
  *
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC tip/core/rcu 02/16] rcu: Add per-task state to RCU CPU stall warnings
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 01/16] sched/core: Add function to sample state of non-running function paulmck
@ 2020-03-12 18:16 ` paulmck
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 03/16] rcutorture: Add flag to produce non-busy-wait task stalls paulmck
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-12 18:16 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

Currently, an RCU-preempt CPU stall warning simply lists the PIDs of
those tasks holding up the current grace period.  This can be helpful,
but more can be even more helpful.

To this end, this commit adds the nesting level, whether the task
things it was preempted in its current RCU read-side critical section,
whether RCU core has asked this task for a quiescent state, whether the
expedited-grace-period hint is set, and whether the task believes that
it is on the blocked-tasks list (it must be, or it would not be printed,
but if things are broken, best not to take too much for granted).

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tree_stall.h | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 502b4dd..6de5a5b 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -192,14 +192,39 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp)
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
 
+// Communicate task state back to the RCU CPU stall warning request.
+struct rcu_stall_chk_rdr {
+	struct task_struct *t;
+	int nesting;
+	union rcu_special rs;
+	bool on_blkd_list;
+};
+
+/*
+ * Report out the state of a not-running task that is stalling the
+ * current RCU grace period.
+ */
+static void check_slow_task(void *arg)
+{
+	struct rcu_node *rnp;
+	struct rcu_stall_chk_rdr *rscrp = arg;
+	struct task_struct *t = rscrp->t;
+
+	rscrp->nesting = t->rcu_read_lock_nesting;
+	rscrp->rs = t->rcu_read_unlock_special;
+	rnp = t->rcu_blocked_node;
+	rscrp->on_blkd_list = !list_empty(&t->rcu_node_entry);
+}
+
 /*
  * Scan the current list of tasks blocked within RCU read-side critical
  * sections, printing out the tid of each.
  */
 static int rcu_print_task_stall(struct rcu_node *rnp)
 {
-	struct task_struct *t;
 	int ndetected = 0;
+	struct rcu_stall_chk_rdr rscr;
+	struct task_struct *t;
 
 	if (!rcu_preempt_blocked_readers_cgp(rnp))
 		return 0;
@@ -208,7 +233,16 @@ static int rcu_print_task_stall(struct rcu_node *rnp)
 	t = list_entry(rnp->gp_tasks->prev,
 		       struct task_struct, rcu_node_entry);
 	list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) {
-		pr_cont(" P%d", t->pid);
+		rscr.t = t;
+		if (!try_invoke_on_nonrunning_task(t, check_slow_task, &rscr))
+			pr_cont(" P%d", t->pid);
+		else
+			pr_cont(" P%d/%d:%c%c%c%c",
+				t->pid, rscr.nesting,
+				".b"[rscr.rs.b.blocked],
+				".q"[rscr.rs.b.need_qs],
+				".e"[rscr.rs.b.exp_hint],
+				".l"[rscr.on_blkd_list]);
 		ndetected++;
 	}
 	pr_cont("\n");
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC tip/core/rcu 03/16] rcutorture: Add flag to produce non-busy-wait task stalls
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 01/16] sched/core: Add function to sample state of non-running function paulmck
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 02/16] rcu: Add per-task state to RCU CPU stall warnings paulmck
@ 2020-03-12 18:16 ` paulmck
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 04/16] rcu-tasks: Move Tasks RCU to its own file paulmck
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-12 18:16 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit aids testing of RCU task stall warning messages by adding
an rcutorture.stall_cpu_block module parameter that results in the
induced stall sleeping within the RCU read-side critical section.
Spinning with interrupts disabled is still available via the
rcutorture.stall_cpu_irqsoff module parameter, and specifying neither
of these two module parameters will spin with preemption disabled.

Note that sleeping (as opposed to preemption) results in additional
complaints from RCU at context-switch time, so yet more testing.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 Documentation/admin-guide/kernel-parameters.txt |  5 +++++
 kernel/rcu/rcutorture.c                         | 13 ++++++++-----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6d16b78..17eff15 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4161,6 +4161,11 @@
 			Duration of CPU stall (s) to test RCU CPU stall
 			warnings, zero to disable.
 
+	rcutorture.stall_cpu_block= [KNL]
+			Sleep while stalling if set.  This will result
+			in warnings from preemptible RCU in addition
+			to any other stall-related activity.
+
 	rcutorture.stall_cpu_holdoff= [KNL]
 			Time to wait (s) after boot before inducing stall.
 
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index b3301f3..f75d466 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -102,6 +102,7 @@ torture_param(int, stall_cpu, 0, "Stall duration (s), zero to disable.");
 torture_param(int, stall_cpu_holdoff, 10,
 	     "Time to wait before starting stall (s).");
 torture_param(int, stall_cpu_irqsoff, 0, "Disable interrupts while stalling.");
+torture_param(int, stall_cpu_block, 0, "Sleep while stalling.");
 torture_param(int, stat_interval, 60,
 	     "Number of seconds between stats printk()s");
 torture_param(int, stutter, 5, "Number of seconds to run/halt test");
@@ -1599,6 +1600,7 @@ static int rcutorture_booster_init(unsigned int cpu)
  */
 static int rcu_torture_stall(void *args)
 {
+	int idx;
 	unsigned long stop_at;
 
 	VERBOSE_TOROUT_STRING("rcu_torture_stall task started");
@@ -1610,21 +1612,22 @@ static int rcu_torture_stall(void *args)
 	if (!kthread_should_stop()) {
 		stop_at = ktime_get_seconds() + stall_cpu;
 		/* RCU CPU stall is expected behavior in following code. */
-		rcu_read_lock();
+		idx = cur_ops->readlock();
 		if (stall_cpu_irqsoff)
 			local_irq_disable();
-		else
+		else if (!stall_cpu_block)
 			preempt_disable();
 		pr_alert("rcu_torture_stall start on CPU %d.\n",
 			 smp_processor_id());
 		while (ULONG_CMP_LT((unsigned long)ktime_get_seconds(),
 				    stop_at))
-			continue;  /* Induce RCU CPU stall warning. */
+			if (stall_cpu_block)
+				schedule_timeout_uninterruptible(HZ);
 		if (stall_cpu_irqsoff)
 			local_irq_enable();
-		else
+		else if (!stall_cpu_block)
 			preempt_enable();
-		rcu_read_unlock();
+		cur_ops->readunlock(idx);
 		pr_alert("rcu_torture_stall end.\n");
 	}
 	torture_shutdown_absorb("rcu_torture_stall");
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC tip/core/rcu 04/16] rcu-tasks: Move Tasks RCU to its own file
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                   ` (2 preceding siblings ...)
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 03/16] rcutorture: Add flag to produce non-busy-wait task stalls paulmck
@ 2020-03-12 18:16 ` paulmck
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 05/16] rcu-tasks: Create struct to hold state information paulmck
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-12 18:16 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This code-movement-only commit is in preparation for adding an additional
flavors of Tasks RCU, which relies on workqueues to detect grace periods.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h  | 370 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/rcu/update.c | 366 +--------------------------------------------------
 2 files changed, 372 insertions(+), 364 deletions(-)
 create mode 100644 kernel/rcu/tasks.h

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
new file mode 100644
index 0000000..be8d179
--- /dev/null
+++ b/kernel/rcu/tasks.h
@@ -0,0 +1,370 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/*
+ * Task-based RCU implementations.
+ *
+ * Copyright (C) 2020 Paul E. McKenney
+ */
+
+#ifdef CONFIG_TASKS_RCU
+
+/*
+ * Simple variant of RCU whose quiescent states are voluntary context
+ * switch, cond_resched_rcu_qs(), user-space execution, and idle.
+ * As such, grace periods can take one good long time.  There are no
+ * read-side primitives similar to rcu_read_lock() and rcu_read_unlock()
+ * because this implementation is intended to get the system into a safe
+ * state for some of the manipulations involved in tracing and the like.
+ * Finally, this implementation does not support high call_rcu_tasks()
+ * rates from multiple CPUs.  If this is required, per-CPU callback lists
+ * will be needed.
+ */
+
+/* Global list of callbacks and associated lock. */
+static struct rcu_head *rcu_tasks_cbs_head;
+static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
+static DECLARE_WAIT_QUEUE_HEAD(rcu_tasks_cbs_wq);
+static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
+
+/* Track exiting tasks in order to allow them to be waited for. */
+DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
+
+/* Control stall timeouts.  Disable with <= 0, otherwise jiffies till stall. */
+#define RCU_TASK_STALL_TIMEOUT (HZ * 60 * 10)
+static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
+module_param(rcu_task_stall_timeout, int, 0644);
+
+static struct task_struct *rcu_tasks_kthread_ptr;
+
+/**
+ * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
+ * @rhp: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. call_rcu_tasks() assumes
+ * that the read-side critical sections end at a voluntary context
+ * switch (not a preemption!), cond_resched_rcu_qs(), entry into idle,
+ * or transition to usermode execution.  As such, there are no read-side
+ * primitives analogous to rcu_read_lock() and rcu_read_unlock() because
+ * this primitive is intended to determine that all tasks have passed
+ * through a safe state, not so much for data-strcuture synchronization.
+ *
+ * See the description of call_rcu() for more detailed information on
+ * memory ordering guarantees.
+ */
+void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
+{
+	unsigned long flags;
+	bool needwake;
+
+	rhp->next = NULL;
+	rhp->func = func;
+	raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
+	needwake = !rcu_tasks_cbs_head;
+	WRITE_ONCE(*rcu_tasks_cbs_tail, rhp);
+	rcu_tasks_cbs_tail = &rhp->next;
+	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+	/* We can't create the thread unless interrupts are enabled. */
+	if (needwake && READ_ONCE(rcu_tasks_kthread_ptr))
+		wake_up(&rcu_tasks_cbs_wq);
+}
+EXPORT_SYMBOL_GPL(call_rcu_tasks);
+
+/**
+ * synchronize_rcu_tasks - wait until an rcu-tasks grace period has elapsed.
+ *
+ * Control will return to the caller some time after a full rcu-tasks
+ * grace period has elapsed, in other words after all currently
+ * executing rcu-tasks read-side critical sections have elapsed.  These
+ * read-side critical sections are delimited by calls to schedule(),
+ * cond_resched_tasks_rcu_qs(), idle execution, userspace execution, calls
+ * to synchronize_rcu_tasks(), and (in theory, anyway) cond_resched().
+ *
+ * This is a very specialized primitive, intended only for a few uses in
+ * tracing and other situations requiring manipulation of function
+ * preambles and profiling hooks.  The synchronize_rcu_tasks() function
+ * is not (yet) intended for heavy use from multiple CPUs.
+ *
+ * Note that this guarantee implies further memory-ordering guarantees.
+ * On systems with more than one CPU, when synchronize_rcu_tasks() returns,
+ * each CPU is guaranteed to have executed a full memory barrier since the
+ * end of its last RCU-tasks read-side critical section whose beginning
+ * preceded the call to synchronize_rcu_tasks().  In addition, each CPU
+ * having an RCU-tasks read-side critical section that extends beyond
+ * the return from synchronize_rcu_tasks() is guaranteed to have executed
+ * a full memory barrier after the beginning of synchronize_rcu_tasks()
+ * and before the beginning of that RCU-tasks read-side critical section.
+ * Note that these guarantees include CPUs that are offline, idle, or
+ * executing in user mode, as well as CPUs that are executing in the kernel.
+ *
+ * Furthermore, if CPU A invoked synchronize_rcu_tasks(), which returned
+ * to its caller on CPU B, then both CPU A and CPU B are guaranteed
+ * to have executed a full memory barrier during the execution of
+ * synchronize_rcu_tasks() -- even if CPU A and CPU B are the same CPU
+ * (but again only if the system has more than one CPU).
+ */
+void synchronize_rcu_tasks(void)
+{
+	/* Complain if the scheduler has not started.  */
+	RCU_LOCKDEP_WARN(rcu_scheduler_active == RCU_SCHEDULER_INACTIVE,
+			 "synchronize_rcu_tasks called too soon");
+
+	/* Wait for the grace period. */
+	wait_rcu_gp(call_rcu_tasks);
+}
+EXPORT_SYMBOL_GPL(synchronize_rcu_tasks);
+
+/**
+ * rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
+ *
+ * Although the current implementation is guaranteed to wait, it is not
+ * obligated to, for example, if there are no pending callbacks.
+ */
+void rcu_barrier_tasks(void)
+{
+	/* There is only one callback queue, so this is easy.  ;-) */
+	synchronize_rcu_tasks();
+}
+EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
+
+/* See if tasks are still holding out, complain if so. */
+static void check_holdout_task(struct task_struct *t,
+			       bool needreport, bool *firstreport)
+{
+	int cpu;
+
+	if (!READ_ONCE(t->rcu_tasks_holdout) ||
+	    t->rcu_tasks_nvcsw != READ_ONCE(t->nvcsw) ||
+	    !READ_ONCE(t->on_rq) ||
+	    (IS_ENABLED(CONFIG_NO_HZ_FULL) &&
+	     !is_idle_task(t) && t->rcu_tasks_idle_cpu >= 0)) {
+		WRITE_ONCE(t->rcu_tasks_holdout, false);
+		list_del_init(&t->rcu_tasks_holdout_list);
+		put_task_struct(t);
+		return;
+	}
+	rcu_request_urgent_qs_task(t);
+	if (!needreport)
+		return;
+	if (*firstreport) {
+		pr_err("INFO: rcu_tasks detected stalls on tasks:\n");
+		*firstreport = false;
+	}
+	cpu = task_cpu(t);
+	pr_alert("%p: %c%c nvcsw: %lu/%lu holdout: %d idle_cpu: %d/%d\n",
+		 t, ".I"[is_idle_task(t)],
+		 "N."[cpu < 0 || !tick_nohz_full_cpu(cpu)],
+		 t->rcu_tasks_nvcsw, t->nvcsw, t->rcu_tasks_holdout,
+		 t->rcu_tasks_idle_cpu, cpu);
+	sched_show_task(t);
+}
+
+/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
+static int __noreturn rcu_tasks_kthread(void *arg)
+{
+	unsigned long flags;
+	struct task_struct *g, *t;
+	unsigned long lastreport;
+	struct rcu_head *list;
+	struct rcu_head *next;
+	LIST_HEAD(rcu_tasks_holdouts);
+	int fract;
+
+	/* Run on housekeeping CPUs by default.  Sysadm can move if desired. */
+	housekeeping_affine(current, HK_FLAG_RCU);
+
+	/*
+	 * Each pass through the following loop makes one check for
+	 * newly arrived callbacks, and, if there are some, waits for
+	 * one RCU-tasks grace period and then invokes the callbacks.
+	 * This loop is terminated by the system going down.  ;-)
+	 */
+	for (;;) {
+
+		/* Pick up any new callbacks. */
+		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
+		list = rcu_tasks_cbs_head;
+		rcu_tasks_cbs_head = NULL;
+		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
+		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+
+		/* If there were none, wait a bit and start over. */
+		if (!list) {
+			wait_event_interruptible(rcu_tasks_cbs_wq,
+						 READ_ONCE(rcu_tasks_cbs_head));
+			if (!rcu_tasks_cbs_head) {
+				WARN_ON(signal_pending(current));
+				schedule_timeout_interruptible(HZ/10);
+			}
+			continue;
+		}
+
+		/*
+		 * Wait for all pre-existing t->on_rq and t->nvcsw
+		 * transitions to complete.  Invoking synchronize_rcu()
+		 * suffices because all these transitions occur with
+		 * interrupts disabled.  Without this synchronize_rcu(),
+		 * a read-side critical section that started before the
+		 * grace period might be incorrectly seen as having started
+		 * after the grace period.
+		 *
+		 * This synchronize_rcu() also dispenses with the
+		 * need for a memory barrier on the first store to
+		 * ->rcu_tasks_holdout, as it forces the store to happen
+		 * after the beginning of the grace period.
+		 */
+		synchronize_rcu();
+
+		/*
+		 * There were callbacks, so we need to wait for an
+		 * RCU-tasks grace period.  Start off by scanning
+		 * the task list for tasks that are not already
+		 * voluntarily blocked.  Mark these tasks and make
+		 * a list of them in rcu_tasks_holdouts.
+		 */
+		rcu_read_lock();
+		for_each_process_thread(g, t) {
+			if (t != current && READ_ONCE(t->on_rq) &&
+			    !is_idle_task(t)) {
+				get_task_struct(t);
+				t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
+				WRITE_ONCE(t->rcu_tasks_holdout, true);
+				list_add(&t->rcu_tasks_holdout_list,
+					 &rcu_tasks_holdouts);
+			}
+		}
+		rcu_read_unlock();
+
+		/*
+		 * Wait for tasks that are in the process of exiting.
+		 * This does only part of the job, ensuring that all
+		 * tasks that were previously exiting reach the point
+		 * where they have disabled preemption, allowing the
+		 * later synchronize_rcu() to finish the job.
+		 */
+		synchronize_srcu(&tasks_rcu_exit_srcu);
+
+		/*
+		 * Each pass through the following loop scans the list
+		 * of holdout tasks, removing any that are no longer
+		 * holdouts.  When the list is empty, we are done.
+		 */
+		lastreport = jiffies;
+
+		/* Start off with HZ/10 wait and slowly back off to 1 HZ wait*/
+		fract = 10;
+
+		for (;;) {
+			bool firstreport;
+			bool needreport;
+			int rtst;
+			struct task_struct *t1;
+
+			if (list_empty(&rcu_tasks_holdouts))
+				break;
+
+			/* Slowly back off waiting for holdouts */
+			schedule_timeout_interruptible(HZ/fract);
+
+			if (fract > 1)
+				fract--;
+
+			rtst = READ_ONCE(rcu_task_stall_timeout);
+			needreport = rtst > 0 &&
+				     time_after(jiffies, lastreport + rtst);
+			if (needreport)
+				lastreport = jiffies;
+			firstreport = true;
+			WARN_ON(signal_pending(current));
+			list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
+						rcu_tasks_holdout_list) {
+				check_holdout_task(t, needreport, &firstreport);
+				cond_resched();
+			}
+		}
+
+		/*
+		 * Because ->on_rq and ->nvcsw are not guaranteed
+		 * to have a full memory barriers prior to them in the
+		 * schedule() path, memory reordering on other CPUs could
+		 * cause their RCU-tasks read-side critical sections to
+		 * extend past the end of the grace period.  However,
+		 * because these ->nvcsw updates are carried out with
+		 * interrupts disabled, we can use synchronize_rcu()
+		 * to force the needed ordering on all such CPUs.
+		 *
+		 * This synchronize_rcu() also confines all
+		 * ->rcu_tasks_holdout accesses to be within the grace
+		 * period, avoiding the need for memory barriers for
+		 * ->rcu_tasks_holdout accesses.
+		 *
+		 * In addition, this synchronize_rcu() waits for exiting
+		 * tasks to complete their final preempt_disable() region
+		 * of execution, cleaning up after the synchronize_srcu()
+		 * above.
+		 */
+		synchronize_rcu();
+
+		/* Invoke the callbacks. */
+		while (list) {
+			next = list->next;
+			local_bh_disable();
+			list->func(list);
+			local_bh_enable();
+			list = next;
+			cond_resched();
+		}
+		/* Paranoid sleep to keep this from entering a tight loop */
+		schedule_timeout_uninterruptible(HZ/10);
+	}
+}
+
+/* Spawn rcu_tasks_kthread() at core_initcall() time. */
+static int __init rcu_spawn_tasks_kthread(void)
+{
+	struct task_struct *t;
+
+	t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
+	if (WARN_ONCE(IS_ERR(t), "%s: Could not start Tasks-RCU grace-period kthread, OOM is now expected behavior\n", __func__))
+		return 0;
+	smp_mb(); /* Ensure others see full kthread. */
+	WRITE_ONCE(rcu_tasks_kthread_ptr, t);
+	return 0;
+}
+core_initcall(rcu_spawn_tasks_kthread);
+
+/* Do the srcu_read_lock() for the above synchronize_srcu().  */
+void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
+{
+	preempt_disable();
+	current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu);
+	preempt_enable();
+}
+
+/* Do the srcu_read_unlock() for the above synchronize_srcu().  */
+void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu)
+{
+	preempt_disable();
+	__srcu_read_unlock(&tasks_rcu_exit_srcu, current->rcu_tasks_idx);
+	preempt_enable();
+}
+
+#endif /* #ifdef CONFIG_TASKS_RCU */
+
+#ifndef CONFIG_TINY_RCU
+
+/*
+ * Print any non-default Tasks RCU settings.
+ */
+static void __init rcu_tasks_bootup_oddness(void)
+{
+#ifdef CONFIG_TASKS_RCU
+	if (rcu_task_stall_timeout != RCU_TASK_STALL_TIMEOUT)
+		pr_info("\tTasks-RCU CPU stall warnings timeout set to %d (rcu_task_stall_timeout).\n", rcu_task_stall_timeout);
+	else
+		pr_info("\tTasks RCU enabled.\n");
+#endif /* #ifdef CONFIG_TASKS_RCU */
+}
+
+#endif /* #ifndef CONFIG_TINY_RCU */
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index a4ad8e0..1fbeb99 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -489,370 +489,6 @@ int rcu_cpu_stall_suppress_at_boot __read_mostly; // !0 = suppress boot stalls.
 EXPORT_SYMBOL_GPL(rcu_cpu_stall_suppress_at_boot);
 module_param(rcu_cpu_stall_suppress_at_boot, int, 0444);
 
-#ifdef CONFIG_TASKS_RCU
-
-/*
- * Simple variant of RCU whose quiescent states are voluntary context
- * switch, cond_resched_rcu_qs(), user-space execution, and idle.
- * As such, grace periods can take one good long time.  There are no
- * read-side primitives similar to rcu_read_lock() and rcu_read_unlock()
- * because this implementation is intended to get the system into a safe
- * state for some of the manipulations involved in tracing and the like.
- * Finally, this implementation does not support high call_rcu_tasks()
- * rates from multiple CPUs.  If this is required, per-CPU callback lists
- * will be needed.
- */
-
-/* Global list of callbacks and associated lock. */
-static struct rcu_head *rcu_tasks_cbs_head;
-static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
-static DECLARE_WAIT_QUEUE_HEAD(rcu_tasks_cbs_wq);
-static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
-
-/* Track exiting tasks in order to allow them to be waited for. */
-DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
-
-/* Control stall timeouts.  Disable with <= 0, otherwise jiffies till stall. */
-#define RCU_TASK_STALL_TIMEOUT (HZ * 60 * 10)
-static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
-module_param(rcu_task_stall_timeout, int, 0644);
-
-static struct task_struct *rcu_tasks_kthread_ptr;
-
-/**
- * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
- * @rhp: structure to be used for queueing the RCU updates.
- * @func: actual callback function to be invoked after the grace period
- *
- * The callback function will be invoked some time after a full grace
- * period elapses, in other words after all currently executing RCU
- * read-side critical sections have completed. call_rcu_tasks() assumes
- * that the read-side critical sections end at a voluntary context
- * switch (not a preemption!), cond_resched_rcu_qs(), entry into idle,
- * or transition to usermode execution.  As such, there are no read-side
- * primitives analogous to rcu_read_lock() and rcu_read_unlock() because
- * this primitive is intended to determine that all tasks have passed
- * through a safe state, not so much for data-strcuture synchronization.
- *
- * See the description of call_rcu() for more detailed information on
- * memory ordering guarantees.
- */
-void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
-{
-	unsigned long flags;
-	bool needwake;
-
-	rhp->next = NULL;
-	rhp->func = func;
-	raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
-	needwake = !rcu_tasks_cbs_head;
-	WRITE_ONCE(*rcu_tasks_cbs_tail, rhp);
-	rcu_tasks_cbs_tail = &rhp->next;
-	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
-	/* We can't create the thread unless interrupts are enabled. */
-	if (needwake && READ_ONCE(rcu_tasks_kthread_ptr))
-		wake_up(&rcu_tasks_cbs_wq);
-}
-EXPORT_SYMBOL_GPL(call_rcu_tasks);
-
-/**
- * synchronize_rcu_tasks - wait until an rcu-tasks grace period has elapsed.
- *
- * Control will return to the caller some time after a full rcu-tasks
- * grace period has elapsed, in other words after all currently
- * executing rcu-tasks read-side critical sections have elapsed.  These
- * read-side critical sections are delimited by calls to schedule(),
- * cond_resched_tasks_rcu_qs(), idle execution, userspace execution, calls
- * to synchronize_rcu_tasks(), and (in theory, anyway) cond_resched().
- *
- * This is a very specialized primitive, intended only for a few uses in
- * tracing and other situations requiring manipulation of function
- * preambles and profiling hooks.  The synchronize_rcu_tasks() function
- * is not (yet) intended for heavy use from multiple CPUs.
- *
- * Note that this guarantee implies further memory-ordering guarantees.
- * On systems with more than one CPU, when synchronize_rcu_tasks() returns,
- * each CPU is guaranteed to have executed a full memory barrier since the
- * end of its last RCU-tasks read-side critical section whose beginning
- * preceded the call to synchronize_rcu_tasks().  In addition, each CPU
- * having an RCU-tasks read-side critical section that extends beyond
- * the return from synchronize_rcu_tasks() is guaranteed to have executed
- * a full memory barrier after the beginning of synchronize_rcu_tasks()
- * and before the beginning of that RCU-tasks read-side critical section.
- * Note that these guarantees include CPUs that are offline, idle, or
- * executing in user mode, as well as CPUs that are executing in the kernel.
- *
- * Furthermore, if CPU A invoked synchronize_rcu_tasks(), which returned
- * to its caller on CPU B, then both CPU A and CPU B are guaranteed
- * to have executed a full memory barrier during the execution of
- * synchronize_rcu_tasks() -- even if CPU A and CPU B are the same CPU
- * (but again only if the system has more than one CPU).
- */
-void synchronize_rcu_tasks(void)
-{
-	/* Complain if the scheduler has not started.  */
-	RCU_LOCKDEP_WARN(rcu_scheduler_active == RCU_SCHEDULER_INACTIVE,
-			 "synchronize_rcu_tasks called too soon");
-
-	/* Wait for the grace period. */
-	wait_rcu_gp(call_rcu_tasks);
-}
-EXPORT_SYMBOL_GPL(synchronize_rcu_tasks);
-
-/**
- * rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
- *
- * Although the current implementation is guaranteed to wait, it is not
- * obligated to, for example, if there are no pending callbacks.
- */
-void rcu_barrier_tasks(void)
-{
-	/* There is only one callback queue, so this is easy.  ;-) */
-	synchronize_rcu_tasks();
-}
-EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
-
-/* See if tasks are still holding out, complain if so. */
-static void check_holdout_task(struct task_struct *t,
-			       bool needreport, bool *firstreport)
-{
-	int cpu;
-
-	if (!READ_ONCE(t->rcu_tasks_holdout) ||
-	    t->rcu_tasks_nvcsw != READ_ONCE(t->nvcsw) ||
-	    !READ_ONCE(t->on_rq) ||
-	    (IS_ENABLED(CONFIG_NO_HZ_FULL) &&
-	     !is_idle_task(t) && t->rcu_tasks_idle_cpu >= 0)) {
-		WRITE_ONCE(t->rcu_tasks_holdout, false);
-		list_del_init(&t->rcu_tasks_holdout_list);
-		put_task_struct(t);
-		return;
-	}
-	rcu_request_urgent_qs_task(t);
-	if (!needreport)
-		return;
-	if (*firstreport) {
-		pr_err("INFO: rcu_tasks detected stalls on tasks:\n");
-		*firstreport = false;
-	}
-	cpu = task_cpu(t);
-	pr_alert("%p: %c%c nvcsw: %lu/%lu holdout: %d idle_cpu: %d/%d\n",
-		 t, ".I"[is_idle_task(t)],
-		 "N."[cpu < 0 || !tick_nohz_full_cpu(cpu)],
-		 t->rcu_tasks_nvcsw, t->nvcsw, t->rcu_tasks_holdout,
-		 t->rcu_tasks_idle_cpu, cpu);
-	sched_show_task(t);
-}
-
-/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
-static int __noreturn rcu_tasks_kthread(void *arg)
-{
-	unsigned long flags;
-	struct task_struct *g, *t;
-	unsigned long lastreport;
-	struct rcu_head *list;
-	struct rcu_head *next;
-	LIST_HEAD(rcu_tasks_holdouts);
-	int fract;
-
-	/* Run on housekeeping CPUs by default.  Sysadm can move if desired. */
-	housekeeping_affine(current, HK_FLAG_RCU);
-
-	/*
-	 * Each pass through the following loop makes one check for
-	 * newly arrived callbacks, and, if there are some, waits for
-	 * one RCU-tasks grace period and then invokes the callbacks.
-	 * This loop is terminated by the system going down.  ;-)
-	 */
-	for (;;) {
-
-		/* Pick up any new callbacks. */
-		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
-		list = rcu_tasks_cbs_head;
-		rcu_tasks_cbs_head = NULL;
-		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
-		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
-
-		/* If there were none, wait a bit and start over. */
-		if (!list) {
-			wait_event_interruptible(rcu_tasks_cbs_wq,
-						 READ_ONCE(rcu_tasks_cbs_head));
-			if (!rcu_tasks_cbs_head) {
-				WARN_ON(signal_pending(current));
-				schedule_timeout_interruptible(HZ/10);
-			}
-			continue;
-		}
-
-		/*
-		 * Wait for all pre-existing t->on_rq and t->nvcsw
-		 * transitions to complete.  Invoking synchronize_rcu()
-		 * suffices because all these transitions occur with
-		 * interrupts disabled.  Without this synchronize_rcu(),
-		 * a read-side critical section that started before the
-		 * grace period might be incorrectly seen as having started
-		 * after the grace period.
-		 *
-		 * This synchronize_rcu() also dispenses with the
-		 * need for a memory barrier on the first store to
-		 * ->rcu_tasks_holdout, as it forces the store to happen
-		 * after the beginning of the grace period.
-		 */
-		synchronize_rcu();
-
-		/*
-		 * There were callbacks, so we need to wait for an
-		 * RCU-tasks grace period.  Start off by scanning
-		 * the task list for tasks that are not already
-		 * voluntarily blocked.  Mark these tasks and make
-		 * a list of them in rcu_tasks_holdouts.
-		 */
-		rcu_read_lock();
-		for_each_process_thread(g, t) {
-			if (t != current && READ_ONCE(t->on_rq) &&
-			    !is_idle_task(t)) {
-				get_task_struct(t);
-				t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
-				WRITE_ONCE(t->rcu_tasks_holdout, true);
-				list_add(&t->rcu_tasks_holdout_list,
-					 &rcu_tasks_holdouts);
-			}
-		}
-		rcu_read_unlock();
-
-		/*
-		 * Wait for tasks that are in the process of exiting.
-		 * This does only part of the job, ensuring that all
-		 * tasks that were previously exiting reach the point
-		 * where they have disabled preemption, allowing the
-		 * later synchronize_rcu() to finish the job.
-		 */
-		synchronize_srcu(&tasks_rcu_exit_srcu);
-
-		/*
-		 * Each pass through the following loop scans the list
-		 * of holdout tasks, removing any that are no longer
-		 * holdouts.  When the list is empty, we are done.
-		 */
-		lastreport = jiffies;
-
-		/* Start off with HZ/10 wait and slowly back off to 1 HZ wait*/
-		fract = 10;
-
-		for (;;) {
-			bool firstreport;
-			bool needreport;
-			int rtst;
-			struct task_struct *t1;
-
-			if (list_empty(&rcu_tasks_holdouts))
-				break;
-
-			/* Slowly back off waiting for holdouts */
-			schedule_timeout_interruptible(HZ/fract);
-
-			if (fract > 1)
-				fract--;
-
-			rtst = READ_ONCE(rcu_task_stall_timeout);
-			needreport = rtst > 0 &&
-				     time_after(jiffies, lastreport + rtst);
-			if (needreport)
-				lastreport = jiffies;
-			firstreport = true;
-			WARN_ON(signal_pending(current));
-			list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
-						rcu_tasks_holdout_list) {
-				check_holdout_task(t, needreport, &firstreport);
-				cond_resched();
-			}
-		}
-
-		/*
-		 * Because ->on_rq and ->nvcsw are not guaranteed
-		 * to have a full memory barriers prior to them in the
-		 * schedule() path, memory reordering on other CPUs could
-		 * cause their RCU-tasks read-side critical sections to
-		 * extend past the end of the grace period.  However,
-		 * because these ->nvcsw updates are carried out with
-		 * interrupts disabled, we can use synchronize_rcu()
-		 * to force the needed ordering on all such CPUs.
-		 *
-		 * This synchronize_rcu() also confines all
-		 * ->rcu_tasks_holdout accesses to be within the grace
-		 * period, avoiding the need for memory barriers for
-		 * ->rcu_tasks_holdout accesses.
-		 *
-		 * In addition, this synchronize_rcu() waits for exiting
-		 * tasks to complete their final preempt_disable() region
-		 * of execution, cleaning up after the synchronize_srcu()
-		 * above.
-		 */
-		synchronize_rcu();
-
-		/* Invoke the callbacks. */
-		while (list) {
-			next = list->next;
-			local_bh_disable();
-			list->func(list);
-			local_bh_enable();
-			list = next;
-			cond_resched();
-		}
-		/* Paranoid sleep to keep this from entering a tight loop */
-		schedule_timeout_uninterruptible(HZ/10);
-	}
-}
-
-/* Spawn rcu_tasks_kthread() at core_initcall() time. */
-static int __init rcu_spawn_tasks_kthread(void)
-{
-	struct task_struct *t;
-
-	t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
-	if (WARN_ONCE(IS_ERR(t), "%s: Could not start Tasks-RCU grace-period kthread, OOM is now expected behavior\n", __func__))
-		return 0;
-	smp_mb(); /* Ensure others see full kthread. */
-	WRITE_ONCE(rcu_tasks_kthread_ptr, t);
-	return 0;
-}
-core_initcall(rcu_spawn_tasks_kthread);
-
-/* Do the srcu_read_lock() for the above synchronize_srcu().  */
-void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
-{
-	preempt_disable();
-	current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu);
-	preempt_enable();
-}
-
-/* Do the srcu_read_unlock() for the above synchronize_srcu().  */
-void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu)
-{
-	preempt_disable();
-	__srcu_read_unlock(&tasks_rcu_exit_srcu, current->rcu_tasks_idx);
-	preempt_enable();
-}
-
-#endif /* #ifdef CONFIG_TASKS_RCU */
-
-#ifndef CONFIG_TINY_RCU
-
-/*
- * Print any non-default Tasks RCU settings.
- */
-static void __init rcu_tasks_bootup_oddness(void)
-{
-#ifdef CONFIG_TASKS_RCU
-	if (rcu_task_stall_timeout != RCU_TASK_STALL_TIMEOUT)
-		pr_info("\tTasks-RCU CPU stall warnings timeout set to %d (rcu_task_stall_timeout).\n", rcu_task_stall_timeout);
-	else
-		pr_info("\tTasks RCU enabled.\n");
-#endif /* #ifdef CONFIG_TASKS_RCU */
-}
-
-#endif /* #ifndef CONFIG_TINY_RCU */
-
 #ifdef CONFIG_PROVE_RCU
 
 /*
@@ -923,6 +559,8 @@ late_initcall(rcu_verify_early_boot_tests);
 void rcu_early_boot_tests(void) {}
 #endif /* CONFIG_PROVE_RCU */
 
+#include "tasks.h"
+
 #ifndef CONFIG_TINY_RCU
 
 /*
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC tip/core/rcu 05/16] rcu-tasks: Create struct to hold state information
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                   ` (3 preceding siblings ...)
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 04/16] rcu-tasks: Move Tasks RCU to its own file paulmck
@ 2020-03-12 18:16 ` paulmck
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 06/16] rcu: Reinstate synchronize_rcu_mult() paulmck
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-12 18:16 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit creates an rcu_tasks struct to hold state information for
RCU Tasks.  This is a preparation commit for adding additional flavors
of Tasks RCU, each of which would have its own rcu_tasks struct.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 73 ++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 46 insertions(+), 27 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index be8d179..5ccfe0d 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -7,6 +7,30 @@
 
 #ifdef CONFIG_TASKS_RCU
 
+/**
+ * Definition for a Tasks-RCU-like mechanism.
+ * @cbs_head: Head of callback list.
+ * @cbs_tail: Tail pointer for callback list.
+ * @cbs_wq: Wait queue allowning new callback to get kthread's attention.
+ * @cbs_lock: Lock protecting callback list.
+ * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
+ */
+struct rcu_tasks {
+	struct rcu_head *cbs_head;
+	struct rcu_head **cbs_tail;
+	struct wait_queue_head cbs_wq;
+	raw_spinlock_t cbs_lock;
+	struct task_struct *kthread_ptr;
+};
+
+#define DEFINE_RCU_TASKS(name)						\
+static struct rcu_tasks name =						\
+{									\
+	.cbs_tail = &name.cbs_head,					\
+	.cbs_wq = __WAIT_QUEUE_HEAD_INITIALIZER(name.cbs_wq),		\
+	.cbs_lock = __RAW_SPIN_LOCK_UNLOCKED(name.cbs_lock),		\
+}
+
 /*
  * Simple variant of RCU whose quiescent states are voluntary context
  * switch, cond_resched_rcu_qs(), user-space execution, and idle.
@@ -18,12 +42,7 @@
  * rates from multiple CPUs.  If this is required, per-CPU callback lists
  * will be needed.
  */
-
-/* Global list of callbacks and associated lock. */
-static struct rcu_head *rcu_tasks_cbs_head;
-static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
-static DECLARE_WAIT_QUEUE_HEAD(rcu_tasks_cbs_wq);
-static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
+DEFINE_RCU_TASKS(rcu_tasks);
 
 /* Track exiting tasks in order to allow them to be waited for. */
 DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
@@ -33,8 +52,6 @@ DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
 static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
 module_param(rcu_task_stall_timeout, int, 0644);
 
-static struct task_struct *rcu_tasks_kthread_ptr;
-
 /**
  * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
  * @rhp: structure to be used for queueing the RCU updates.
@@ -57,17 +74,18 @@ void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
 {
 	unsigned long flags;
 	bool needwake;
+	struct rcu_tasks *rtp = &rcu_tasks;
 
 	rhp->next = NULL;
 	rhp->func = func;
-	raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
-	needwake = !rcu_tasks_cbs_head;
-	WRITE_ONCE(*rcu_tasks_cbs_tail, rhp);
-	rcu_tasks_cbs_tail = &rhp->next;
-	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+	raw_spin_lock_irqsave(&rtp->cbs_lock, flags);
+	needwake = !rtp->cbs_head;
+	WRITE_ONCE(*rtp->cbs_tail, rhp);
+	rtp->cbs_tail = &rhp->next;
+	raw_spin_unlock_irqrestore(&rtp->cbs_lock, flags);
 	/* We can't create the thread unless interrupts are enabled. */
-	if (needwake && READ_ONCE(rcu_tasks_kthread_ptr))
-		wake_up(&rcu_tasks_cbs_wq);
+	if (needwake && READ_ONCE(rtp->kthread_ptr))
+		wake_up(&rtp->cbs_wq);
 }
 EXPORT_SYMBOL_GPL(call_rcu_tasks);
 
@@ -169,10 +187,12 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 	struct rcu_head *list;
 	struct rcu_head *next;
 	LIST_HEAD(rcu_tasks_holdouts);
+	struct rcu_tasks *rtp = arg;
 	int fract;
 
 	/* Run on housekeeping CPUs by default.  Sysadm can move if desired. */
 	housekeeping_affine(current, HK_FLAG_RCU);
+	WRITE_ONCE(rtp->kthread_ptr, current); // Let GPs start!
 
 	/*
 	 * Each pass through the following loop makes one check for
@@ -183,17 +203,17 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 	for (;;) {
 
 		/* Pick up any new callbacks. */
-		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
-		list = rcu_tasks_cbs_head;
-		rcu_tasks_cbs_head = NULL;
-		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
-		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+		raw_spin_lock_irqsave(&rtp->cbs_lock, flags);
+		list = rtp->cbs_head;
+		rtp->cbs_head = NULL;
+		rtp->cbs_tail = &rtp->cbs_head;
+		raw_spin_unlock_irqrestore(&rtp->cbs_lock, flags);
 
 		/* If there were none, wait a bit and start over. */
 		if (!list) {
-			wait_event_interruptible(rcu_tasks_cbs_wq,
-						 READ_ONCE(rcu_tasks_cbs_head));
-			if (!rcu_tasks_cbs_head) {
+			wait_event_interruptible(rtp->cbs_wq,
+						 READ_ONCE(rtp->cbs_head));
+			if (!rtp->cbs_head) {
 				WARN_ON(signal_pending(current));
 				schedule_timeout_interruptible(HZ/10);
 			}
@@ -211,7 +231,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 		 *
 		 * This synchronize_rcu() also dispenses with the
 		 * need for a memory barrier on the first store to
-		 * ->rcu_tasks_holdout, as it forces the store to happen
+		 * t->rcu_tasks_holdout, as it forces the store to happen
 		 * after the beginning of the grace period.
 		 */
 		synchronize_rcu();
@@ -278,7 +298,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 			firstreport = true;
 			WARN_ON(signal_pending(current));
 			list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
-						rcu_tasks_holdout_list) {
+						 rcu_tasks_holdout_list) {
 				check_holdout_task(t, needreport, &firstreport);
 				cond_resched();
 			}
@@ -325,11 +345,10 @@ static int __init rcu_spawn_tasks_kthread(void)
 {
 	struct task_struct *t;
 
-	t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
+	t = kthread_run(rcu_tasks_kthread, &rcu_tasks, "rcu_tasks_kthread");
 	if (WARN_ONCE(IS_ERR(t), "%s: Could not start Tasks-RCU grace-period kthread, OOM is now expected behavior\n", __func__))
 		return 0;
 	smp_mb(); /* Ensure others see full kthread. */
-	WRITE_ONCE(rcu_tasks_kthread_ptr, t);
 	return 0;
 }
 core_initcall(rcu_spawn_tasks_kthread);
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC tip/core/rcu 06/16] rcu: Reinstate synchronize_rcu_mult()
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                   ` (4 preceding siblings ...)
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 05/16] rcu-tasks: Create struct to hold state information paulmck
@ 2020-03-12 18:16 ` paulmck
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 07/16] rcutorture: Add a test for synchronize_rcu_mult() paulmck
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-12 18:16 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

With the advent and likely usage of synchronize_rcu_rude(), there is
again a need to wait on multiple types of RCU grace periods, for
example, call_rcu_tasks() and call_rcu_tasks_rude().  This commit
therefore reinstates synchronize_rcu_mult() in order to allow these
grace periods to be straightforwardly waited on concurrently.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/rcupdate_wait.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h
index c0578ba..699b938 100644
--- a/include/linux/rcupdate_wait.h
+++ b/include/linux/rcupdate_wait.h
@@ -31,4 +31,23 @@ do {									\
 
 #define wait_rcu_gp(...) _wait_rcu_gp(false, __VA_ARGS__)
 
+/**
+ * synchronize_rcu_mult - Wait concurrently for multiple grace periods
+ * @...: List of call_rcu() functions for different grace periods to wait on
+ *
+ * This macro waits concurrently for multiple types of RCU grace periods.
+ * For example, synchronize_rcu_mult(call_rcu, call_rcu_tasks) would wait
+ * on concurrent RCU and RCU-tasks grace periods.  Waiting on a given SRCU
+ * domain requires you to write a wrapper function for that SRCU domain's
+ * call_srcu() function, with this wrapper supplying the pointer to the
+ * corresponding srcu_struct.
+ *
+ * The first argument tells Tiny RCU's _wait_rcu_gp() not to
+ * bother waiting for RCU.  The reason for this is because anywhere
+ * synchronize_rcu_mult() can be called is automatically already a full
+ * grace period.
+ */
+#define synchronize_rcu_mult(...) \
+	_wait_rcu_gp(IS_ENABLED(CONFIG_TINY_RCU), __VA_ARGS__)
+
 #endif /* _LINUX_SCHED_RCUPDATE_WAIT_H */
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC tip/core/rcu 07/16] rcutorture: Add a test for synchronize_rcu_mult()
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                   ` (5 preceding siblings ...)
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 06/16] rcu: Reinstate synchronize_rcu_mult() paulmck
@ 2020-03-12 18:16 ` paulmck
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 08/16] rcu-tasks: Refactor RCU-tasks to allow variants to be added paulmck
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-12 18:16 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds a crude test for synchronize_rcu_mult().  This is
currently a smoke test rather than a high-quality stress test.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/rcutorture.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index f75d466..1880c5f 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -20,7 +20,7 @@
 #include <linux/err.h>
 #include <linux/spinlock.h>
 #include <linux/smp.h>
-#include <linux/rcupdate.h>
+#include <linux/rcupdate_wait.h>
 #include <linux/interrupt.h>
 #include <linux/sched/signal.h>
 #include <uapi/linux/sched/types.h>
@@ -666,6 +666,11 @@ static void rcu_tasks_torture_deferred_free(struct rcu_torture *p)
 	call_rcu_tasks(&p->rtort_rcu, rcu_torture_cb);
 }
 
+static void synchronize_rcu_mult_test(void)
+{
+	synchronize_rcu_mult(call_rcu_tasks, call_rcu);
+}
+
 static struct rcu_torture_ops tasks_ops = {
 	.ttype		= RCU_TASKS_FLAVOR,
 	.init		= rcu_sync_torture_init,
@@ -675,7 +680,7 @@ static struct rcu_torture_ops tasks_ops = {
 	.get_gp_seq	= rcu_no_completed,
 	.deferred_free	= rcu_tasks_torture_deferred_free,
 	.sync		= synchronize_rcu_tasks,
-	.exp_sync	= synchronize_rcu_tasks,
+	.exp_sync	= synchronize_rcu_mult_test,
 	.call		= call_rcu_tasks,
 	.cb_barrier	= rcu_barrier_tasks,
 	.fqs		= NULL,
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC tip/core/rcu 08/16] rcu-tasks: Refactor RCU-tasks to allow variants to be added
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                   ` (6 preceding siblings ...)
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 07/16] rcutorture: Add a test for synchronize_rcu_mult() paulmck
@ 2020-03-12 18:16 ` paulmck
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant paulmck
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-12 18:16 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit splits out generic processing from RCU-tasks-specific
processing in order to allow additional flavors to be added.  It also
adds a def_bool TASKS_RCU_GENERIC to enable the common RCU-tasks
infrastructure code.

This is primarily, but not entirely, a code-movement commit.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/rcupdate.h |   6 +-
 kernel/rcu/Kconfig       |  10 +-
 kernel/rcu/tasks.h       | 491 +++++++++++++++++++++++++----------------------
 kernel/rcu/update.c      |   4 +
 4 files changed, 272 insertions(+), 239 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 2678a37..5523145 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -129,7 +129,7 @@ static inline void rcu_init_nohz(void) { }
  * Note a quasi-voluntary context switch for RCU-tasks's benefit.
  * This is a macro rather than an inline function to avoid #include hell.
  */
-#ifdef CONFIG_TASKS_RCU
+#ifdef CONFIG_TASKS_RCU_GENERIC
 #define rcu_tasks_qs(t) \
 	do { \
 		if (READ_ONCE((t)->rcu_tasks_holdout)) \
@@ -140,14 +140,14 @@ void call_rcu_tasks(struct rcu_head *head, rcu_callback_t func);
 void synchronize_rcu_tasks(void);
 void exit_tasks_rcu_start(void);
 void exit_tasks_rcu_finish(void);
-#else /* #ifdef CONFIG_TASKS_RCU */
+#else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
 #define rcu_tasks_qs(t)	do { } while (0)
 #define rcu_note_voluntary_context_switch(t) do { } while (0)
 #define call_rcu_tasks call_rcu
 #define synchronize_rcu_tasks synchronize_rcu
 static inline void exit_tasks_rcu_start(void) { }
 static inline void exit_tasks_rcu_finish(void) { }
-#endif /* #else #ifdef CONFIG_TASKS_RCU */
+#endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */
 
 /**
  * cond_resched_tasks_rcu_qs - Report potential quiescent states to RCU
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 1cc940f..38475d0 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -70,13 +70,19 @@ config TREE_SRCU
 	help
 	  This option selects the full-fledged version of SRCU.
 
+config TASKS_RCU_GENERIC
+	def_bool TASKS_RCU
+	select SRCU
+	help
+	  This option enables generic infrastructure code supporting
+	  task-based RCU implementations.  Not for manual selection.
+
 config TASKS_RCU
 	def_bool PREEMPTION
-	select SRCU
 	help
 	  This option enables a task-based RCU implementation that uses
 	  only voluntary context switch (not preemption!), idle, and
-	  user-mode execution as quiescent states.
+	  user-mode execution as quiescent states.  Not for manual selection.
 
 config RCU_STALL_COMMON
 	def_bool TREE_RCU
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 5ccfe0d..d77921e 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -5,7 +5,13 @@
  * Copyright (C) 2020 Paul E. McKenney
  */
 
-#ifdef CONFIG_TASKS_RCU
+
+////////////////////////////////////////////////////////////////////////
+//
+// Generic data structures.
+
+struct rcu_tasks;
+typedef void (*rcu_tasks_gp_func_t)(struct rcu_tasks *rtp);
 
 /**
  * Definition for a Tasks-RCU-like mechanism.
@@ -14,6 +20,8 @@
  * @cbs_wq: Wait queue allowning new callback to get kthread's attention.
  * @cbs_lock: Lock protecting callback list.
  * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
+ * @gp_func: This flavor's grace-period-wait function.
+ * @call_func: This flavor's call_rcu()-equivalent function.
  */
 struct rcu_tasks {
 	struct rcu_head *cbs_head;
@@ -21,29 +29,20 @@ struct rcu_tasks {
 	struct wait_queue_head cbs_wq;
 	raw_spinlock_t cbs_lock;
 	struct task_struct *kthread_ptr;
+	rcu_tasks_gp_func_t gp_func;
+	call_rcu_func_t call_func;
 };
 
-#define DEFINE_RCU_TASKS(name)						\
+#define DEFINE_RCU_TASKS(name, gp, call)				\
 static struct rcu_tasks name =						\
 {									\
 	.cbs_tail = &name.cbs_head,					\
 	.cbs_wq = __WAIT_QUEUE_HEAD_INITIALIZER(name.cbs_wq),		\
 	.cbs_lock = __RAW_SPIN_LOCK_UNLOCKED(name.cbs_lock),		\
+	.gp_func = gp,							\
+	.call_func = call,						\
 }
 
-/*
- * Simple variant of RCU whose quiescent states are voluntary context
- * switch, cond_resched_rcu_qs(), user-space execution, and idle.
- * As such, grace periods can take one good long time.  There are no
- * read-side primitives similar to rcu_read_lock() and rcu_read_unlock()
- * because this implementation is intended to get the system into a safe
- * state for some of the manipulations involved in tracing and the like.
- * Finally, this implementation does not support high call_rcu_tasks()
- * rates from multiple CPUs.  If this is required, per-CPU callback lists
- * will be needed.
- */
-DEFINE_RCU_TASKS(rcu_tasks);
-
 /* Track exiting tasks in order to allow them to be waited for. */
 DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
 
@@ -52,29 +51,16 @@ DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
 static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
 module_param(rcu_task_stall_timeout, int, 0644);
 
-/**
- * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
- * @rhp: structure to be used for queueing the RCU updates.
- * @func: actual callback function to be invoked after the grace period
- *
- * The callback function will be invoked some time after a full grace
- * period elapses, in other words after all currently executing RCU
- * read-side critical sections have completed. call_rcu_tasks() assumes
- * that the read-side critical sections end at a voluntary context
- * switch (not a preemption!), cond_resched_rcu_qs(), entry into idle,
- * or transition to usermode execution.  As such, there are no read-side
- * primitives analogous to rcu_read_lock() and rcu_read_unlock() because
- * this primitive is intended to determine that all tasks have passed
- * through a safe state, not so much for data-strcuture synchronization.
- *
- * See the description of call_rcu() for more detailed information on
- * memory ordering guarantees.
- */
-void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
+////////////////////////////////////////////////////////////////////////
+//
+// Generic code.
+
+// Enqueue a callback for the specified flavor of Tasks RCU.
+static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func,
+				   struct rcu_tasks *rtp)
 {
 	unsigned long flags;
 	bool needwake;
-	struct rcu_tasks *rtp = &rcu_tasks;
 
 	rhp->next = NULL;
 	rhp->func = func;
@@ -87,108 +73,25 @@ void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
 	if (needwake && READ_ONCE(rtp->kthread_ptr))
 		wake_up(&rtp->cbs_wq);
 }
-EXPORT_SYMBOL_GPL(call_rcu_tasks);
 
-/**
- * synchronize_rcu_tasks - wait until an rcu-tasks grace period has elapsed.
- *
- * Control will return to the caller some time after a full rcu-tasks
- * grace period has elapsed, in other words after all currently
- * executing rcu-tasks read-side critical sections have elapsed.  These
- * read-side critical sections are delimited by calls to schedule(),
- * cond_resched_tasks_rcu_qs(), idle execution, userspace execution, calls
- * to synchronize_rcu_tasks(), and (in theory, anyway) cond_resched().
- *
- * This is a very specialized primitive, intended only for a few uses in
- * tracing and other situations requiring manipulation of function
- * preambles and profiling hooks.  The synchronize_rcu_tasks() function
- * is not (yet) intended for heavy use from multiple CPUs.
- *
- * Note that this guarantee implies further memory-ordering guarantees.
- * On systems with more than one CPU, when synchronize_rcu_tasks() returns,
- * each CPU is guaranteed to have executed a full memory barrier since the
- * end of its last RCU-tasks read-side critical section whose beginning
- * preceded the call to synchronize_rcu_tasks().  In addition, each CPU
- * having an RCU-tasks read-side critical section that extends beyond
- * the return from synchronize_rcu_tasks() is guaranteed to have executed
- * a full memory barrier after the beginning of synchronize_rcu_tasks()
- * and before the beginning of that RCU-tasks read-side critical section.
- * Note that these guarantees include CPUs that are offline, idle, or
- * executing in user mode, as well as CPUs that are executing in the kernel.
- *
- * Furthermore, if CPU A invoked synchronize_rcu_tasks(), which returned
- * to its caller on CPU B, then both CPU A and CPU B are guaranteed
- * to have executed a full memory barrier during the execution of
- * synchronize_rcu_tasks() -- even if CPU A and CPU B are the same CPU
- * (but again only if the system has more than one CPU).
- */
-void synchronize_rcu_tasks(void)
+// Wait for a grace period for the specified flavor of Tasks RCU.
+static void synchronize_rcu_tasks_generic(struct rcu_tasks *rtp)
 {
 	/* Complain if the scheduler has not started.  */
 	RCU_LOCKDEP_WARN(rcu_scheduler_active == RCU_SCHEDULER_INACTIVE,
 			 "synchronize_rcu_tasks called too soon");
 
 	/* Wait for the grace period. */
-	wait_rcu_gp(call_rcu_tasks);
-}
-EXPORT_SYMBOL_GPL(synchronize_rcu_tasks);
-
-/**
- * rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
- *
- * Although the current implementation is guaranteed to wait, it is not
- * obligated to, for example, if there are no pending callbacks.
- */
-void rcu_barrier_tasks(void)
-{
-	/* There is only one callback queue, so this is easy.  ;-) */
-	synchronize_rcu_tasks();
-}
-EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
-
-/* See if tasks are still holding out, complain if so. */
-static void check_holdout_task(struct task_struct *t,
-			       bool needreport, bool *firstreport)
-{
-	int cpu;
-
-	if (!READ_ONCE(t->rcu_tasks_holdout) ||
-	    t->rcu_tasks_nvcsw != READ_ONCE(t->nvcsw) ||
-	    !READ_ONCE(t->on_rq) ||
-	    (IS_ENABLED(CONFIG_NO_HZ_FULL) &&
-	     !is_idle_task(t) && t->rcu_tasks_idle_cpu >= 0)) {
-		WRITE_ONCE(t->rcu_tasks_holdout, false);
-		list_del_init(&t->rcu_tasks_holdout_list);
-		put_task_struct(t);
-		return;
-	}
-	rcu_request_urgent_qs_task(t);
-	if (!needreport)
-		return;
-	if (*firstreport) {
-		pr_err("INFO: rcu_tasks detected stalls on tasks:\n");
-		*firstreport = false;
-	}
-	cpu = task_cpu(t);
-	pr_alert("%p: %c%c nvcsw: %lu/%lu holdout: %d idle_cpu: %d/%d\n",
-		 t, ".I"[is_idle_task(t)],
-		 "N."[cpu < 0 || !tick_nohz_full_cpu(cpu)],
-		 t->rcu_tasks_nvcsw, t->nvcsw, t->rcu_tasks_holdout,
-		 t->rcu_tasks_idle_cpu, cpu);
-	sched_show_task(t);
+	wait_rcu_gp(rtp->call_func);
 }
 
 /* RCU-tasks kthread that detects grace periods and invokes callbacks. */
 static int __noreturn rcu_tasks_kthread(void *arg)
 {
 	unsigned long flags;
-	struct task_struct *g, *t;
-	unsigned long lastreport;
 	struct rcu_head *list;
 	struct rcu_head *next;
-	LIST_HEAD(rcu_tasks_holdouts);
 	struct rcu_tasks *rtp = arg;
-	int fract;
 
 	/* Run on housekeeping CPUs by default.  Sysadm can move if desired. */
 	housekeeping_affine(current, HK_FLAG_RCU);
@@ -220,111 +123,8 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 			continue;
 		}
 
-		/*
-		 * Wait for all pre-existing t->on_rq and t->nvcsw
-		 * transitions to complete.  Invoking synchronize_rcu()
-		 * suffices because all these transitions occur with
-		 * interrupts disabled.  Without this synchronize_rcu(),
-		 * a read-side critical section that started before the
-		 * grace period might be incorrectly seen as having started
-		 * after the grace period.
-		 *
-		 * This synchronize_rcu() also dispenses with the
-		 * need for a memory barrier on the first store to
-		 * t->rcu_tasks_holdout, as it forces the store to happen
-		 * after the beginning of the grace period.
-		 */
-		synchronize_rcu();
-
-		/*
-		 * There were callbacks, so we need to wait for an
-		 * RCU-tasks grace period.  Start off by scanning
-		 * the task list for tasks that are not already
-		 * voluntarily blocked.  Mark these tasks and make
-		 * a list of them in rcu_tasks_holdouts.
-		 */
-		rcu_read_lock();
-		for_each_process_thread(g, t) {
-			if (t != current && READ_ONCE(t->on_rq) &&
-			    !is_idle_task(t)) {
-				get_task_struct(t);
-				t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
-				WRITE_ONCE(t->rcu_tasks_holdout, true);
-				list_add(&t->rcu_tasks_holdout_list,
-					 &rcu_tasks_holdouts);
-			}
-		}
-		rcu_read_unlock();
-
-		/*
-		 * Wait for tasks that are in the process of exiting.
-		 * This does only part of the job, ensuring that all
-		 * tasks that were previously exiting reach the point
-		 * where they have disabled preemption, allowing the
-		 * later synchronize_rcu() to finish the job.
-		 */
-		synchronize_srcu(&tasks_rcu_exit_srcu);
-
-		/*
-		 * Each pass through the following loop scans the list
-		 * of holdout tasks, removing any that are no longer
-		 * holdouts.  When the list is empty, we are done.
-		 */
-		lastreport = jiffies;
-
-		/* Start off with HZ/10 wait and slowly back off to 1 HZ wait*/
-		fract = 10;
-
-		for (;;) {
-			bool firstreport;
-			bool needreport;
-			int rtst;
-			struct task_struct *t1;
-
-			if (list_empty(&rcu_tasks_holdouts))
-				break;
-
-			/* Slowly back off waiting for holdouts */
-			schedule_timeout_interruptible(HZ/fract);
-
-			if (fract > 1)
-				fract--;
-
-			rtst = READ_ONCE(rcu_task_stall_timeout);
-			needreport = rtst > 0 &&
-				     time_after(jiffies, lastreport + rtst);
-			if (needreport)
-				lastreport = jiffies;
-			firstreport = true;
-			WARN_ON(signal_pending(current));
-			list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
-						 rcu_tasks_holdout_list) {
-				check_holdout_task(t, needreport, &firstreport);
-				cond_resched();
-			}
-		}
-
-		/*
-		 * Because ->on_rq and ->nvcsw are not guaranteed
-		 * to have a full memory barriers prior to them in the
-		 * schedule() path, memory reordering on other CPUs could
-		 * cause their RCU-tasks read-side critical sections to
-		 * extend past the end of the grace period.  However,
-		 * because these ->nvcsw updates are carried out with
-		 * interrupts disabled, we can use synchronize_rcu()
-		 * to force the needed ordering on all such CPUs.
-		 *
-		 * This synchronize_rcu() also confines all
-		 * ->rcu_tasks_holdout accesses to be within the grace
-		 * period, avoiding the need for memory barriers for
-		 * ->rcu_tasks_holdout accesses.
-		 *
-		 * In addition, this synchronize_rcu() waits for exiting
-		 * tasks to complete their final preempt_disable() region
-		 * of execution, cleaning up after the synchronize_srcu()
-		 * above.
-		 */
-		synchronize_rcu();
+		// Wait for one grace period.
+		rtp->gp_func(rtp);
 
 		/* Invoke the callbacks. */
 		while (list) {
@@ -340,18 +140,16 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 	}
 }
 
-/* Spawn rcu_tasks_kthread() at core_initcall() time. */
-static int __init rcu_spawn_tasks_kthread(void)
+/* Spawn RCU-tasks grace-period kthread, e.g., at core_initcall() time. */
+static void __init rcu_spawn_tasks_kthread_generic(struct rcu_tasks *rtp)
 {
 	struct task_struct *t;
 
-	t = kthread_run(rcu_tasks_kthread, &rcu_tasks, "rcu_tasks_kthread");
+	t = kthread_run(rcu_tasks_kthread, rtp, "rcu_tasks_kthread");
 	if (WARN_ONCE(IS_ERR(t), "%s: Could not start Tasks-RCU grace-period kthread, OOM is now expected behavior\n", __func__))
-		return 0;
+		return;
 	smp_mb(); /* Ensure others see full kthread. */
-	return 0;
 }
-core_initcall(rcu_spawn_tasks_kthread);
 
 /* Do the srcu_read_lock() for the above synchronize_srcu().  */
 void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
@@ -369,8 +167,6 @@ void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu)
 	preempt_enable();
 }
 
-#endif /* #ifdef CONFIG_TASKS_RCU */
-
 #ifndef CONFIG_TINY_RCU
 
 /*
@@ -387,3 +183,230 @@ static void __init rcu_tasks_bootup_oddness(void)
 }
 
 #endif /* #ifndef CONFIG_TINY_RCU */
+
+#ifdef CONFIG_TASKS_RCU
+
+////////////////////////////////////////////////////////////////////////
+//
+// Simple variant of RCU whose quiescent states are voluntary context
+// switch, cond_resched_rcu_qs(), user-space execution, and idle.
+// As such, grace periods can take one good long time.  There are no
+// read-side primitives similar to rcu_read_lock() and rcu_read_unlock()
+// because this implementation is intended to get the system into a safe
+// state for some of the manipulations involved in tracing and the like.
+// Finally, this implementation does not support high call_rcu_tasks()
+// rates from multiple CPUs.  If this is required, per-CPU callback lists
+// will be needed.
+
+/* See if tasks are still holding out, complain if so. */
+static void check_holdout_task(struct task_struct *t,
+			       bool needreport, bool *firstreport)
+{
+	int cpu;
+
+	if (!READ_ONCE(t->rcu_tasks_holdout) ||
+	    t->rcu_tasks_nvcsw != READ_ONCE(t->nvcsw) ||
+	    !READ_ONCE(t->on_rq) ||
+	    (IS_ENABLED(CONFIG_NO_HZ_FULL) &&
+	     !is_idle_task(t) && t->rcu_tasks_idle_cpu >= 0)) {
+		WRITE_ONCE(t->rcu_tasks_holdout, false);
+		list_del_init(&t->rcu_tasks_holdout_list);
+		put_task_struct(t);
+		return;
+	}
+	rcu_request_urgent_qs_task(t);
+	if (!needreport)
+		return;
+	if (*firstreport) {
+		pr_err("INFO: rcu_tasks detected stalls on tasks:\n");
+		*firstreport = false;
+	}
+	cpu = task_cpu(t);
+	pr_alert("%p: %c%c nvcsw: %lu/%lu holdout: %d idle_cpu: %d/%d\n",
+		 t, ".I"[is_idle_task(t)],
+		 "N."[cpu < 0 || !tick_nohz_full_cpu(cpu)],
+		 t->rcu_tasks_nvcsw, t->nvcsw, t->rcu_tasks_holdout,
+		 t->rcu_tasks_idle_cpu, cpu);
+	sched_show_task(t);
+}
+
+/* Wait for one RCU-tasks grace period. */
+static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
+{
+	struct task_struct *g, *t;
+	unsigned long lastreport;
+	LIST_HEAD(rcu_tasks_holdouts);
+	int fract;
+
+	/*
+	 * Wait for all pre-existing t->on_rq and t->nvcsw transitions
+	 * to complete.  Invoking synchronize_rcu() suffices because all
+	 * these transitions occur with interrupts disabled.  Without this
+	 * synchronize_rcu(), a read-side critical section that started
+	 * before the grace period might be incorrectly seen as having
+	 * started after the grace period.
+	 *
+	 * This synchronize_rcu() also dispenses with the need for a
+	 * memory barrier on the first store to t->rcu_tasks_holdout,
+	 * as it forces the store to happen after the beginning of the
+	 * grace period.
+	 */
+	synchronize_rcu();
+
+	/*
+	 * There were callbacks, so we need to wait for an RCU-tasks
+	 * grace period.  Start off by scanning the task list for tasks
+	 * that are not already voluntarily blocked.  Mark these tasks
+	 * and make a list of them in rcu_tasks_holdouts.
+	 */
+	rcu_read_lock();
+	for_each_process_thread(g, t) {
+		if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
+			get_task_struct(t);
+			t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
+			WRITE_ONCE(t->rcu_tasks_holdout, true);
+			list_add(&t->rcu_tasks_holdout_list,
+				 &rcu_tasks_holdouts);
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * Wait for tasks that are in the process of exiting.  This
+	 * does only part of the job, ensuring that all tasks that were
+	 * previously exiting reach the point where they have disabled
+	 * preemption, allowing the later synchronize_rcu() to finish
+	 * the job.
+	 */
+	synchronize_srcu(&tasks_rcu_exit_srcu);
+
+	/*
+	 * Each pass through the following loop scans the list of holdout
+	 * tasks, removing any that are no longer holdouts.  When the list
+	 * is empty, we are done.
+	 */
+	lastreport = jiffies;
+
+	/* Start off with HZ/10 wait and slowly back off to 1 HZ wait. */
+	fract = 10;
+
+	for (;;) {
+		bool firstreport;
+		bool needreport;
+		int rtst;
+		struct task_struct *t1;
+
+		if (list_empty(&rcu_tasks_holdouts))
+			break;
+
+		/* Slowly back off waiting for holdouts */
+		schedule_timeout_interruptible(HZ/fract);
+
+		if (fract > 1)
+			fract--;
+
+		rtst = READ_ONCE(rcu_task_stall_timeout);
+		needreport = rtst > 0 && time_after(jiffies, lastreport + rtst);
+		if (needreport)
+			lastreport = jiffies;
+		firstreport = true;
+		WARN_ON(signal_pending(current));
+		list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
+					 rcu_tasks_holdout_list) {
+			check_holdout_task(t, needreport, &firstreport);
+			cond_resched();
+		}
+	}
+
+	/*
+	 * Because ->on_rq and ->nvcsw are not guaranteed to have a full
+	 * memory barriers prior to them in the schedule() path, memory
+	 * reordering on other CPUs could cause their RCU-tasks read-side
+	 * critical sections to extend past the end of the grace period.
+	 * However, because these ->nvcsw updates are carried out with
+	 * interrupts disabled, we can use synchronize_rcu() to force the
+	 * needed ordering on all such CPUs.
+	 *
+	 * This synchronize_rcu() also confines all ->rcu_tasks_holdout
+	 * accesses to be within the grace period, avoiding the need for
+	 * memory barriers for ->rcu_tasks_holdout accesses.
+	 *
+	 * In addition, this synchronize_rcu() waits for exiting tasks
+	 * to complete their final preempt_disable() region of execution,
+	 * cleaning up after the synchronize_srcu() above.
+	 */
+	synchronize_rcu();
+}
+
+void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
+DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks);
+
+/**
+ * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
+ * @rhp: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. call_rcu_tasks() assumes
+ * that the read-side critical sections end at a voluntary context
+ * switch (not a preemption!), cond_resched_rcu_qs(), entry into idle,
+ * or transition to usermode execution.  As such, there are no read-side
+ * primitives analogous to rcu_read_lock() and rcu_read_unlock() because
+ * this primitive is intended to determine that all tasks have passed
+ * through a safe state, not so much for data-strcuture synchronization.
+ *
+ * See the description of call_rcu() for more detailed information on
+ * memory ordering guarantees.
+ */
+void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
+{
+	call_rcu_tasks_generic(rhp, func, &rcu_tasks);
+}
+EXPORT_SYMBOL_GPL(call_rcu_tasks);
+
+/**
+ * synchronize_rcu_tasks - wait until an rcu-tasks grace period has elapsed.
+ *
+ * Control will return to the caller some time after a full rcu-tasks
+ * grace period has elapsed, in other words after all currently
+ * executing rcu-tasks read-side critical sections have elapsed.  These
+ * read-side critical sections are delimited by calls to schedule(),
+ * cond_resched_tasks_rcu_qs(), idle execution, userspace execution, calls
+ * to synchronize_rcu_tasks(), and (in theory, anyway) cond_resched().
+ *
+ * This is a very specialized primitive, intended only for a few uses in
+ * tracing and other situations requiring manipulation of function
+ * preambles and profiling hooks.  The synchronize_rcu_tasks() function
+ * is not (yet) intended for heavy use from multiple CPUs.
+ *
+ * See the description of synchronize_rcu() for more detailed information
+ * on memory ordering guarantees.
+ */
+void synchronize_rcu_tasks(void)
+{
+	synchronize_rcu_tasks_generic(&rcu_tasks);
+}
+EXPORT_SYMBOL_GPL(synchronize_rcu_tasks);
+
+/**
+ * rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
+ *
+ * Although the current implementation is guaranteed to wait, it is not
+ * obligated to, for example, if there are no pending callbacks.
+ */
+void rcu_barrier_tasks(void)
+{
+	/* There is only one callback queue, so this is easy.  ;-) */
+	synchronize_rcu_tasks();
+}
+EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
+
+static int __init rcu_spawn_tasks_kthread(void)
+{
+	rcu_spawn_tasks_kthread_generic(&rcu_tasks);
+	return 0;
+}
+core_initcall(rcu_spawn_tasks_kthread);
+
+#endif /* #ifdef CONFIG_TASKS_RCU */
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 1fbeb99..b1fa519 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -559,7 +559,11 @@ late_initcall(rcu_verify_early_boot_tests);
 void rcu_early_boot_tests(void) {}
 #endif /* CONFIG_PROVE_RCU */
 
+#ifdef CONFIG_TASKS_RCU_GENERIC
 #include "tasks.h"
+#else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
+static inline void rcu_tasks_bootup_oddness(void) {}
+#endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */
 
 #ifndef CONFIG_TINY_RCU
 
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                   ` (7 preceding siblings ...)
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 08/16] rcu-tasks: Refactor RCU-tasks to allow variants to be added paulmck
@ 2020-03-12 18:16 ` paulmck
  2020-03-16 19:47   ` Joel Fernandes
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 10/16] rcutorture: Add torture tests for RCU Tasks Rude paulmck
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 171+ messages in thread
From: paulmck @ 2020-03-12 18:16 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds a "rude" variant of RCU-tasks that has as quiescent
states schedule(), cond_resched_tasks_rcu_qs(), userspace execution,
and (in theory, anyway) cond_resched().  Updates make use of IPIs and
force an IPI and a context switch on each online CPU.  This variant
is useful in some situations in tracing.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/rcupdate.h |  3 ++
 kernel/rcu/Kconfig       | 12 +++++-
 kernel/rcu/tasks.h       | 99 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 113 insertions(+), 1 deletion(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 5523145..2be97a8 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -37,6 +37,7 @@
 /* Exported common interfaces */
 void call_rcu(struct rcu_head *head, rcu_callback_t func);
 void rcu_barrier_tasks(void);
+void rcu_barrier_tasks_rude(void);
 void synchronize_rcu(void);
 
 #ifdef CONFIG_PREEMPT_RCU
@@ -138,6 +139,8 @@ static inline void rcu_init_nohz(void) { }
 #define rcu_note_voluntary_context_switch(t) rcu_tasks_qs(t)
 void call_rcu_tasks(struct rcu_head *head, rcu_callback_t func);
 void synchronize_rcu_tasks(void);
+void call_rcu_tasks_rude(struct rcu_head *head, rcu_callback_t func);
+void synchronize_rcu_tasks_rude(void);
 void exit_tasks_rcu_start(void);
 void exit_tasks_rcu_finish(void);
 #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 38475d0..0d43ec1 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -71,7 +71,7 @@ config TREE_SRCU
 	  This option selects the full-fledged version of SRCU.
 
 config TASKS_RCU_GENERIC
-	def_bool TASKS_RCU
+	def_bool TASKS_RCU || TASKS_RUDE_RCU
 	select SRCU
 	help
 	  This option enables generic infrastructure code supporting
@@ -84,6 +84,16 @@ config TASKS_RCU
 	  only voluntary context switch (not preemption!), idle, and
 	  user-mode execution as quiescent states.  Not for manual selection.
 
+config TASKS_RUDE_RCU
+	def_bool 0
+	default n
+	help
+	  This option enables a task-based RCU implementation that uses
+	  only context switch (including preemption) and user-mode
+	  execution as quiescent states.  It forces IPIs and context
+	  switches on all online CPUs, including idle ones, so use
+	  with caution.  Not for manual selection.
+
 config RCU_STALL_COMMON
 	def_bool TREE_RCU
 	help
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index d77921e..1d25c50 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -180,6 +180,9 @@ static void __init rcu_tasks_bootup_oddness(void)
 	else
 		pr_info("\tTasks RCU enabled.\n");
 #endif /* #ifdef CONFIG_TASKS_RCU */
+#ifdef CONFIG_TASKS_RUDE_RCU
+	pr_info("\tRude variant of Tasks RCU enabled.\n");
+#endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
 }
 
 #endif /* #ifndef CONFIG_TINY_RCU */
@@ -410,3 +413,99 @@ static int __init rcu_spawn_tasks_kthread(void)
 core_initcall(rcu_spawn_tasks_kthread);
 
 #endif /* #ifdef CONFIG_TASKS_RCU */
+
+#ifdef CONFIG_TASKS_RUDE_RCU
+
+////////////////////////////////////////////////////////////////////////
+//
+// "Rude" variant of Tasks RCU, inspired by Steve Rostedt's trick of
+// passing an empty function to schedule_on_each_cpu().  This approach
+// provides an asynchronous call_rcu_rude() API and batching of concurrent
+// calls to the synchronous synchronize_rcu_rude() API.  This sends IPIs
+// far and wide and induces otherwise unnecessary context switches on all
+// online CPUs, whether online or not.
+
+// Empty function to allow workqueues to force a context switch.
+static void rcu_tasks_be_rude(struct work_struct *work)
+{
+}
+
+// Wait for one rude RCU-tasks grace period.
+static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp)
+{
+	schedule_on_each_cpu(rcu_tasks_be_rude);
+}
+EXPORT_SYMBOL_GPL(rcu_tasks_rude_wait_gp);
+
+void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func);
+DEFINE_RCU_TASKS(rcu_tasks_rude, rcu_tasks_rude_wait_gp, call_rcu_tasks_rude);
+
+/**
+ * call_rcu_tasks_rude() - Queue a callback rude task-based grace period
+ * @rhp: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. call_rcu_tasks_rude()
+ * assumes that the read-side critical sections end at context switch,
+ * cond_resched_rcu_qs(), or transition to usermode execution.  As such,
+ * there are no read-side primitives analogous to rcu_read_lock() and
+ * rcu_read_unlock() because this primitive is intended to determine
+ * that all tasks have passed through a safe state, not so much for
+ * data-strcuture synchronization.
+ *
+ * See the description of call_rcu() for more detailed information on
+ * memory ordering guarantees.
+ */
+void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func)
+{
+	call_rcu_tasks_generic(rhp, func, &rcu_tasks_rude);
+}
+EXPORT_SYMBOL_GPL(call_rcu_tasks_rude);
+
+/**
+ * synchronize_rcu_tasks_rude - wait for a rude rcu-tasks grace period
+ *
+ * Control will return to the caller some time after a rude rcu-tasks
+ * grace period has elapsed, in other words after all currently
+ * executing rcu-tasks read-side critical sections have elapsed.  These
+ * read-side critical sections are delimited by calls to schedule(),
+ * cond_resched_tasks_rcu_qs(), userspace execution, and (in theory,
+ * anyway) cond_resched().
+ *
+ * This is a very specialized primitive, intended only for a few uses in
+ * tracing and other situations requiring manipulation of function preambles
+ * and profiling hooks.  The synchronize_rcu_tasks_rude() function is not
+ * (yet) intended for heavy use from multiple CPUs.
+ *
+ * See the description of synchronize_rcu() for more detailed information
+ * on memory ordering guarantees.
+ */
+void synchronize_rcu_tasks_rude(void)
+{
+	synchronize_rcu_tasks_generic(&rcu_tasks_rude);
+}
+EXPORT_SYMBOL_GPL(synchronize_rcu_tasks_rude);
+
+/**
+ * rcu_barrier_tasks_rude - Wait for in-flight call_rcu_tasks_rude() callbacks.
+ *
+ * Although the current implementation is guaranteed to wait, it is not
+ * obligated to, for example, if there are no pending callbacks.
+ */
+void rcu_barrier_tasks_rude(void)
+{
+	/* There is only one callback queue, so this is easy.  ;-) */
+	synchronize_rcu_tasks_rude();
+}
+EXPORT_SYMBOL_GPL(rcu_barrier_tasks_rude);
+
+static int __init rcu_spawn_tasks_rude_kthread(void)
+{
+	rcu_spawn_tasks_kthread_generic(&rcu_tasks_rude);
+	return 0;
+}
+core_initcall(rcu_spawn_tasks_rude_kthread);
+
+#endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC tip/core/rcu 10/16] rcutorture: Add torture tests for RCU Tasks Rude
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                   ` (8 preceding siblings ...)
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant paulmck
@ 2020-03-12 18:16 ` paulmck
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 11/16] rcu-tasks: Use unique names for RCU-Tasks kthreads and messages paulmck
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-12 18:16 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds the definitions required to torture the rude flavor of
RCU tasks.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/Kconfig.debug                           |  2 ++
 kernel/rcu/rcu.h                                   |  1 +
 kernel/rcu/rcutorture.c                            | 31 ++++++++++++++++++++--
 .../selftests/rcutorture/configs/rcu/CFLIST        |  1 +
 .../selftests/rcutorture/configs/rcu/RUDE01        | 10 +++++++
 .../selftests/rcutorture/configs/rcu/RUDE01.boot   |  1 +
 6 files changed, 44 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/RUDE01
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/RUDE01.boot

diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index ec4bb6c..b15a3bd 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -24,6 +24,7 @@ config RCU_PERF_TEST
 	select TORTURE_TEST
 	select SRCU
 	select TASKS_RCU
+	select TASKS_RUDE_RCU
 	default n
 	help
 	  This option provides a kernel module that runs performance
@@ -41,6 +42,7 @@ config RCU_TORTURE_TEST
 	select TORTURE_TEST
 	select SRCU
 	select TASKS_RCU
+	select TASKS_RUDE_RCU
 	default n
 	help
 	  This option provides a kernel module that runs torture tests
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 00ddc92..c574620 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -441,6 +441,7 @@ void rcu_request_urgent_qs_task(struct task_struct *t);
 enum rcutorture_type {
 	RCU_FLAVOR,
 	RCU_TASKS_FLAVOR,
+	RCU_TASKS_RUDE_FLAVOR,
 	RCU_TRIVIAL_FLAVOR,
 	SRCU_FLAVOR,
 	INVALID_RCU_FLAVOR
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 1880c5f..1acafc5 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -731,6 +731,33 @@ static struct rcu_torture_ops trivial_ops = {
 	.name		= "trivial"
 };
 
+/*
+ * Definitions for rude RCU-tasks torture testing.
+ */
+
+static void rcu_tasks_rude_torture_deferred_free(struct rcu_torture *p)
+{
+	call_rcu_tasks_rude(&p->rtort_rcu, rcu_torture_cb);
+}
+
+static struct rcu_torture_ops tasks_rude_ops = {
+	.ttype		= RCU_TASKS_RUDE_FLAVOR,
+	.init		= rcu_sync_torture_init,
+	.readlock	= rcu_torture_read_lock_trivial,
+	.read_delay	= rcu_read_delay,  /* just reuse rcu's version. */
+	.readunlock	= rcu_torture_read_unlock_trivial,
+	.get_gp_seq	= rcu_no_completed,
+	.deferred_free	= rcu_tasks_rude_torture_deferred_free,
+	.sync		= synchronize_rcu_tasks_rude,
+	.exp_sync	= synchronize_rcu_tasks_rude,
+	.call		= call_rcu_tasks_rude,
+	.cb_barrier	= rcu_barrier_tasks_rude,
+	.fqs		= NULL,
+	.stats		= NULL,
+	.irq_capable	= 1,
+	.name		= "tasks-rude"
+};
+
 static unsigned long rcutorture_seq_diff(unsigned long new, unsigned long old)
 {
 	if (!cur_ops->gp_diff)
@@ -740,7 +767,7 @@ static unsigned long rcutorture_seq_diff(unsigned long new, unsigned long old)
 
 static bool __maybe_unused torturing_tasks(void)
 {
-	return cur_ops == &tasks_ops;
+	return cur_ops == &tasks_ops || cur_ops == &tasks_rude_ops;
 }
 
 /*
@@ -2408,7 +2435,7 @@ rcu_torture_init(void)
 	int firsterr = 0;
 	static struct rcu_torture_ops *torture_ops[] = {
 		&rcu_ops, &rcu_busted_ops, &srcu_ops, &srcud_ops,
-		&busted_srcud_ops, &tasks_ops, &trivial_ops,
+		&busted_srcud_ops, &tasks_ops, &tasks_rude_ops, &trivial_ops,
 	};
 
 	if (!torture_init_begin(torture_type, verbose))
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
index c3c1fb5..ec0c72f 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
+++ b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
@@ -14,3 +14,4 @@ TINY02
 TASKS01
 TASKS02
 TASKS03
+RUDE01
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/RUDE01 b/tools/testing/selftests/rcutorture/configs/rcu/RUDE01
new file mode 100644
index 0000000..bafe94c
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/RUDE01
@@ -0,0 +1,10 @@
+CONFIG_SMP=y
+CONFIG_NR_CPUS=2
+CONFIG_HOTPLUG_CPU=y
+CONFIG_PREEMPT_NONE=n
+CONFIG_PREEMPT_VOLUNTARY=n
+CONFIG_PREEMPT=y
+CONFIG_DEBUG_LOCK_ALLOC=y
+CONFIG_PROVE_LOCKING=y
+#CHECK#CONFIG_PROVE_RCU=y
+CONFIG_RCU_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/RUDE01.boot b/tools/testing/selftests/rcutorture/configs/rcu/RUDE01.boot
new file mode 100644
index 0000000..9363708
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/RUDE01.boot
@@ -0,0 +1 @@
+rcutorture.torture_type=tasks-rude
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC tip/core/rcu 11/16] rcu-tasks: Use unique names for RCU-Tasks kthreads and messages
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                   ` (9 preceding siblings ...)
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 10/16] rcutorture: Add torture tests for RCU Tasks Rude paulmck
@ 2020-03-12 18:16 ` paulmck
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 12/16] rcu-tasks: Further refactor RCU-tasks to allow adding more variants paulmck
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-12 18:16 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit causes the flavors of RCU Tasks to use different names
for their kthreads and in their console messages.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 1d25c50..6a5b2b7e 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -22,6 +22,8 @@ typedef void (*rcu_tasks_gp_func_t)(struct rcu_tasks *rtp);
  * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
  * @gp_func: This flavor's grace-period-wait function.
  * @call_func: This flavor's call_rcu()-equivalent function.
+ * @name: This flavor's textual name.
+ * @kname: This flavor's kthread name.
  */
 struct rcu_tasks {
 	struct rcu_head *cbs_head;
@@ -31,16 +33,20 @@ struct rcu_tasks {
 	struct task_struct *kthread_ptr;
 	rcu_tasks_gp_func_t gp_func;
 	call_rcu_func_t call_func;
+	char *name;
+	char *kname;
 };
 
-#define DEFINE_RCU_TASKS(name, gp, call)				\
-static struct rcu_tasks name =						\
+#define DEFINE_RCU_TASKS(rt_name, gp, call, n)				\
+static struct rcu_tasks rt_name =					\
 {									\
-	.cbs_tail = &name.cbs_head,					\
-	.cbs_wq = __WAIT_QUEUE_HEAD_INITIALIZER(name.cbs_wq),		\
-	.cbs_lock = __RAW_SPIN_LOCK_UNLOCKED(name.cbs_lock),		\
+	.cbs_tail = &rt_name.cbs_head,					\
+	.cbs_wq = __WAIT_QUEUE_HEAD_INITIALIZER(rt_name.cbs_wq),	\
+	.cbs_lock = __RAW_SPIN_LOCK_UNLOCKED(rt_name.cbs_lock),		\
 	.gp_func = gp,							\
 	.call_func = call,						\
+	.name = n,							\
+	.kname = #rt_name,						\
 }
 
 /* Track exiting tasks in order to allow them to be waited for. */
@@ -145,8 +151,8 @@ static void __init rcu_spawn_tasks_kthread_generic(struct rcu_tasks *rtp)
 {
 	struct task_struct *t;
 
-	t = kthread_run(rcu_tasks_kthread, rtp, "rcu_tasks_kthread");
-	if (WARN_ONCE(IS_ERR(t), "%s: Could not start Tasks-RCU grace-period kthread, OOM is now expected behavior\n", __func__))
+	t = kthread_run(rcu_tasks_kthread, rtp, "%s_kthread", rtp->kname);
+	if (WARN_ONCE(IS_ERR(t), "%s: Could not start %s grace-period kthread, OOM is now expected behavior\n", __func__, rtp->name))
 		return;
 	smp_mb(); /* Ensure others see full kthread. */
 }
@@ -342,7 +348,7 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 }
 
 void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
-DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks);
+DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks");
 
 /**
  * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
@@ -438,7 +444,8 @@ static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp)
 EXPORT_SYMBOL_GPL(rcu_tasks_rude_wait_gp);
 
 void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func);
-DEFINE_RCU_TASKS(rcu_tasks_rude, rcu_tasks_rude_wait_gp, call_rcu_tasks_rude);
+DEFINE_RCU_TASKS(rcu_tasks_rude, rcu_tasks_rude_wait_gp, call_rcu_tasks_rude,
+		 "RCU Tasks Rude");
 
 /**
  * call_rcu_tasks_rude() - Queue a callback rude task-based grace period
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC tip/core/rcu 12/16] rcu-tasks: Further refactor RCU-tasks to allow adding more variants
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                   ` (10 preceding siblings ...)
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 11/16] rcu-tasks: Use unique names for RCU-Tasks kthreads and messages paulmck
@ 2020-03-12 18:16 ` paulmck
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 13/16] rcu-tasks: Code movement to allow more Tasks RCU variants paulmck
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-12 18:16 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit refactors RCU tasks to allow variants to be added that
share the tasklist scan and later holdout list processing.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 166 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 108 insertions(+), 58 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 6a5b2b7e..a19bd92 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -12,6 +12,11 @@
 
 struct rcu_tasks;
 typedef void (*rcu_tasks_gp_func_t)(struct rcu_tasks *rtp);
+typedef void (*pregp_func_t)(void);
+typedef void (*pertask_func_t)(struct task_struct *t, struct list_head *hop);
+typedef void (*postscan_func_t)(void);
+typedef void (*holdouts_func_t)(struct list_head *hop, bool ndrpt, bool *frptp);
+typedef void (*postgp_func_t)(void);
 
 /**
  * Definition for a Tasks-RCU-like mechanism.
@@ -21,6 +26,11 @@ typedef void (*rcu_tasks_gp_func_t)(struct rcu_tasks *rtp);
  * @cbs_lock: Lock protecting callback list.
  * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
  * @gp_func: This flavor's grace-period-wait function.
+ * @pregp_func: This flavor's pre-grace-period function (optional).
+ * @pertask_func: This flavor's per-task scan function (optional).
+ * @postscan_func: This flavor's post-task scan function (optional).
+ * @holdout_func: This flavor's holdout-list scan function (optional).
+ * @postgp_func: This flavor's post-grace-period function (optional).
  * @call_func: This flavor's call_rcu()-equivalent function.
  * @name: This flavor's textual name.
  * @kname: This flavor's kthread name.
@@ -32,6 +42,11 @@ struct rcu_tasks {
 	raw_spinlock_t cbs_lock;
 	struct task_struct *kthread_ptr;
 	rcu_tasks_gp_func_t gp_func;
+	pregp_func_t pregp_func;
+	pertask_func_t pertask_func;
+	postscan_func_t postscan_func;
+	holdouts_func_t holdouts_func;
+	postgp_func_t postgp_func;
 	call_rcu_func_t call_func;
 	char *name;
 	char *kname;
@@ -113,6 +128,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 
 		/* Pick up any new callbacks. */
 		raw_spin_lock_irqsave(&rtp->cbs_lock, flags);
+		smp_mb__after_unlock_lock(); // Order updates vs. GP.
 		list = rtp->cbs_head;
 		rtp->cbs_head = NULL;
 		rtp->cbs_tail = &rtp->cbs_head;
@@ -207,6 +223,49 @@ static void __init rcu_tasks_bootup_oddness(void)
 // rates from multiple CPUs.  If this is required, per-CPU callback lists
 // will be needed.
 
+/* Pre-grace-period preparation. */
+static void rcu_tasks_pregp_step(void)
+{
+	/*
+	 * Wait for all pre-existing t->on_rq and t->nvcsw transitions
+	 * to complete.  Invoking synchronize_rcu() suffices because all
+	 * these transitions occur with interrupts disabled.  Without this
+	 * synchronize_rcu(), a read-side critical section that started
+	 * before the grace period might be incorrectly seen as having
+	 * started after the grace period.
+	 *
+	 * This synchronize_rcu() also dispenses with the need for a
+	 * memory barrier on the first store to t->rcu_tasks_holdout,
+	 * as it forces the store to happen after the beginning of the
+	 * grace period.
+	 */
+	synchronize_rcu();
+}
+
+/* Per-task initial processing. */
+static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop)
+{
+	if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
+		get_task_struct(t);
+		t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
+		WRITE_ONCE(t->rcu_tasks_holdout, true);
+		list_add(&t->rcu_tasks_holdout_list, hop);
+	}
+}
+
+/* Processing between scanning taskslist and draining the holdout list. */
+void rcu_tasks_postscan(void)
+{
+	/*
+	 * Wait for tasks that are in the process of exiting.  This
+	 * does only part of the job, ensuring that all tasks that were
+	 * previously exiting reach the point where they have disabled
+	 * preemption, allowing the later synchronize_rcu() to finish
+	 * the job.
+	 */
+	synchronize_srcu(&tasks_rcu_exit_srcu);
+}
+
 /* See if tasks are still holding out, complain if so. */
 static void check_holdout_task(struct task_struct *t,
 			       bool needreport, bool *firstreport)
@@ -239,55 +298,63 @@ static void check_holdout_task(struct task_struct *t,
 	sched_show_task(t);
 }
 
+/* Scan the holdout lists for tasks no longer holding out. */
+static void check_all_holdout_tasks(struct list_head *hop,
+				    bool needreport, bool *firstreport)
+{
+	struct task_struct *t, *t1;
+
+	list_for_each_entry_safe(t, t1, hop, rcu_tasks_holdout_list) {
+		check_holdout_task(t, needreport, firstreport);
+		cond_resched();
+	}
+}
+
+/* Finish off the Tasks-RCU grace period. */
+static void rcu_tasks_postgp(void)
+{
+	/*
+	 * Because ->on_rq and ->nvcsw are not guaranteed to have a full
+	 * memory barriers prior to them in the schedule() path, memory
+	 * reordering on other CPUs could cause their RCU-tasks read-side
+	 * critical sections to extend past the end of the grace period.
+	 * However, because these ->nvcsw updates are carried out with
+	 * interrupts disabled, we can use synchronize_rcu() to force the
+	 * needed ordering on all such CPUs.
+	 *
+	 * This synchronize_rcu() also confines all ->rcu_tasks_holdout
+	 * accesses to be within the grace period, avoiding the need for
+	 * memory barriers for ->rcu_tasks_holdout accesses.
+	 *
+	 * In addition, this synchronize_rcu() waits for exiting tasks
+	 * to complete their final preempt_disable() region of execution,
+	 * cleaning up after the synchronize_srcu() above.
+	 */
+	synchronize_rcu();
+}
+
 /* Wait for one RCU-tasks grace period. */
 static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 {
 	struct task_struct *g, *t;
 	unsigned long lastreport;
-	LIST_HEAD(rcu_tasks_holdouts);
+	LIST_HEAD(holdouts);
 	int fract;
 
-	/*
-	 * Wait for all pre-existing t->on_rq and t->nvcsw transitions
-	 * to complete.  Invoking synchronize_rcu() suffices because all
-	 * these transitions occur with interrupts disabled.  Without this
-	 * synchronize_rcu(), a read-side critical section that started
-	 * before the grace period might be incorrectly seen as having
-	 * started after the grace period.
-	 *
-	 * This synchronize_rcu() also dispenses with the need for a
-	 * memory barrier on the first store to t->rcu_tasks_holdout,
-	 * as it forces the store to happen after the beginning of the
-	 * grace period.
-	 */
-	synchronize_rcu();
+	rtp->pregp_func();
 
 	/*
 	 * There were callbacks, so we need to wait for an RCU-tasks
 	 * grace period.  Start off by scanning the task list for tasks
 	 * that are not already voluntarily blocked.  Mark these tasks
-	 * and make a list of them in rcu_tasks_holdouts.
+	 * and make a list of them in holdouts.
 	 */
 	rcu_read_lock();
-	for_each_process_thread(g, t) {
-		if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
-			get_task_struct(t);
-			t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
-			WRITE_ONCE(t->rcu_tasks_holdout, true);
-			list_add(&t->rcu_tasks_holdout_list,
-				 &rcu_tasks_holdouts);
-		}
-	}
+	for_each_process_thread(g, t)
+		rtp->pertask_func(t, &holdouts);
 	rcu_read_unlock();
 
-	/*
-	 * Wait for tasks that are in the process of exiting.  This
-	 * does only part of the job, ensuring that all tasks that were
-	 * previously exiting reach the point where they have disabled
-	 * preemption, allowing the later synchronize_rcu() to finish
-	 * the job.
-	 */
-	synchronize_srcu(&tasks_rcu_exit_srcu);
+	rtp->postscan_func();
 
 	/*
 	 * Each pass through the following loop scans the list of holdout
@@ -303,9 +370,8 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 		bool firstreport;
 		bool needreport;
 		int rtst;
-		struct task_struct *t1;
 
-		if (list_empty(&rcu_tasks_holdouts))
+		if (list_empty(&holdouts))
 			break;
 
 		/* Slowly back off waiting for holdouts */
@@ -320,31 +386,10 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 			lastreport = jiffies;
 		firstreport = true;
 		WARN_ON(signal_pending(current));
-		list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
-					 rcu_tasks_holdout_list) {
-			check_holdout_task(t, needreport, &firstreport);
-			cond_resched();
-		}
+		rtp->holdouts_func(&holdouts, needreport, &firstreport);
 	}
 
-	/*
-	 * Because ->on_rq and ->nvcsw are not guaranteed to have a full
-	 * memory barriers prior to them in the schedule() path, memory
-	 * reordering on other CPUs could cause their RCU-tasks read-side
-	 * critical sections to extend past the end of the grace period.
-	 * However, because these ->nvcsw updates are carried out with
-	 * interrupts disabled, we can use synchronize_rcu() to force the
-	 * needed ordering on all such CPUs.
-	 *
-	 * This synchronize_rcu() also confines all ->rcu_tasks_holdout
-	 * accesses to be within the grace period, avoiding the need for
-	 * memory barriers for ->rcu_tasks_holdout accesses.
-	 *
-	 * In addition, this synchronize_rcu() waits for exiting tasks
-	 * to complete their final preempt_disable() region of execution,
-	 * cleaning up after the synchronize_srcu() above.
-	 */
-	synchronize_rcu();
+	rtp->postgp_func();
 }
 
 void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
@@ -413,6 +458,11 @@ EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
 
 static int __init rcu_spawn_tasks_kthread(void)
 {
+	rcu_tasks.pregp_func = rcu_tasks_pregp_step;
+	rcu_tasks.pertask_func = rcu_tasks_pertask;
+	rcu_tasks.postscan_func = rcu_tasks_postscan;
+	rcu_tasks.holdouts_func = check_all_holdout_tasks;
+	rcu_tasks.postgp_func = rcu_tasks_postgp;
 	rcu_spawn_tasks_kthread_generic(&rcu_tasks);
 	return 0;
 }
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC tip/core/rcu 13/16] rcu-tasks: Code movement to allow more Tasks RCU variants
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                   ` (11 preceding siblings ...)
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 12/16] rcu-tasks: Further refactor RCU-tasks to allow adding more variants paulmck
@ 2020-03-12 18:16 ` paulmck
  2020-03-12 18:17 ` [PATCH RFC tip/core/rcu 14/16] rcu: Add an RCU Tasks Trace to simplify protection of tracing hooks paulmck
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-12 18:16 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit does nothing but move rcu_tasks_wait_gp() up to a new section
for common code.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 122 +++++++++++++++++++++++++++--------------------------
 1 file changed, 63 insertions(+), 59 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index a19bd92..b65c45f 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -213,6 +213,69 @@ static void __init rcu_tasks_bootup_oddness(void)
 
 ////////////////////////////////////////////////////////////////////////
 //
+// Shared code between task-list-scanning variants of Tasks RCU.
+
+/* Wait for one RCU-tasks grace period. */
+static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
+{
+	struct task_struct *g, *t;
+	unsigned long lastreport;
+	LIST_HEAD(holdouts);
+	int fract;
+
+	rtp->pregp_func();
+
+	/*
+	 * There were callbacks, so we need to wait for an RCU-tasks
+	 * grace period.  Start off by scanning the task list for tasks
+	 * that are not already voluntarily blocked.  Mark these tasks
+	 * and make a list of them in holdouts.
+	 */
+	rcu_read_lock();
+	for_each_process_thread(g, t)
+		rtp->pertask_func(t, &holdouts);
+	rcu_read_unlock();
+
+	rtp->postscan_func();
+
+	/*
+	 * Each pass through the following loop scans the list of holdout
+	 * tasks, removing any that are no longer holdouts.  When the list
+	 * is empty, we are done.
+	 */
+	lastreport = jiffies;
+
+	/* Start off with HZ/10 wait and slowly back off to 1 HZ wait. */
+	fract = 10;
+
+	for (;;) {
+		bool firstreport;
+		bool needreport;
+		int rtst;
+
+		if (list_empty(&holdouts))
+			break;
+
+		/* Slowly back off waiting for holdouts */
+		schedule_timeout_interruptible(HZ/fract);
+
+		if (fract > 1)
+			fract--;
+
+		rtst = READ_ONCE(rcu_task_stall_timeout);
+		needreport = rtst > 0 && time_after(jiffies, lastreport + rtst);
+		if (needreport)
+			lastreport = jiffies;
+		firstreport = true;
+		WARN_ON(signal_pending(current));
+		rtp->holdouts_func(&holdouts, needreport, &firstreport);
+	}
+
+	rtp->postgp_func();
+}
+
+////////////////////////////////////////////////////////////////////////
+//
 // Simple variant of RCU whose quiescent states are voluntary context
 // switch, cond_resched_rcu_qs(), user-space execution, and idle.
 // As such, grace periods can take one good long time.  There are no
@@ -333,65 +396,6 @@ static void rcu_tasks_postgp(void)
 	synchronize_rcu();
 }
 
-/* Wait for one RCU-tasks grace period. */
-static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
-{
-	struct task_struct *g, *t;
-	unsigned long lastreport;
-	LIST_HEAD(holdouts);
-	int fract;
-
-	rtp->pregp_func();
-
-	/*
-	 * There were callbacks, so we need to wait for an RCU-tasks
-	 * grace period.  Start off by scanning the task list for tasks
-	 * that are not already voluntarily blocked.  Mark these tasks
-	 * and make a list of them in holdouts.
-	 */
-	rcu_read_lock();
-	for_each_process_thread(g, t)
-		rtp->pertask_func(t, &holdouts);
-	rcu_read_unlock();
-
-	rtp->postscan_func();
-
-	/*
-	 * Each pass through the following loop scans the list of holdout
-	 * tasks, removing any that are no longer holdouts.  When the list
-	 * is empty, we are done.
-	 */
-	lastreport = jiffies;
-
-	/* Start off with HZ/10 wait and slowly back off to 1 HZ wait. */
-	fract = 10;
-
-	for (;;) {
-		bool firstreport;
-		bool needreport;
-		int rtst;
-
-		if (list_empty(&holdouts))
-			break;
-
-		/* Slowly back off waiting for holdouts */
-		schedule_timeout_interruptible(HZ/fract);
-
-		if (fract > 1)
-			fract--;
-
-		rtst = READ_ONCE(rcu_task_stall_timeout);
-		needreport = rtst > 0 && time_after(jiffies, lastreport + rtst);
-		if (needreport)
-			lastreport = jiffies;
-		firstreport = true;
-		WARN_ON(signal_pending(current));
-		rtp->holdouts_func(&holdouts, needreport, &firstreport);
-	}
-
-	rtp->postgp_func();
-}
-
 void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
 DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks");
 
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC tip/core/rcu 14/16] rcu: Add an RCU Tasks Trace to simplify protection of tracing hooks
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                   ` (12 preceding siblings ...)
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 13/16] rcu-tasks: Code movement to allow more Tasks RCU variants paulmck
@ 2020-03-12 18:17 ` paulmck
  2020-03-12 18:17 ` [PATCH RFC tip/core/rcu 15/16] rcutorture: Add torture tests for RCU Tasks Trace paulmck
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-12 18:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney,
	Alexei Starovoitov, Andrii Nakryiko

From: "Paul E. McKenney" <paulmck@kernel.org>

Because RCU does not watch exception early-entry/late-exit, idle-loop,
or CPU-hotplug execution, protection of tracing and BPF operations is
needlessly complicated.  This commit therefore adds a variant of
Tasks RCU that:

o	Has explicit read-side markers to allow finite grace periods
	in the face of in-kernel loops for PREEMPT=n builds.

o	Protects code in the idle loop, exception entry/exit, and
	CPU-hotplug code paths, similar to the capabilities of SRCU.

o	Avoids expensive read-side instruction, having overhead similar
	to that of Preemptible RCU.

There are of course downsides.  The grace-period code can send IPIs to
CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
It is necessary to scan the full tasklist, much as for Tasks RCU.  There
is a single callback queue guarded by a single lock, again, much as for
Tasks RCU.  If needed, these downsides can be at least partially remedied.

Perhaps most important, this variant of RCU does not affect the vanilla
flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
readers can operate from idle, offline, and exception entry/exit in no
way allows rcu_preempt and rcu_sched readers to also do so.

This effort benefited greatly from off-list discussions of BPF
requirements with Alexei Starovoitov and Andrii Nakryiko.  At least
some of the on-list discussions are captured in the Link: tags below.

Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andriin@fb.com>
---
 include/linux/rcupdate_trace.h |  84 ++++++++++
 include/linux/sched.h          |   8 +
 init/init_task.c               |   4 +
 kernel/fork.c                  |   4 +
 kernel/rcu/Kconfig             |  12 +-
 kernel/rcu/tasks.h             | 356 ++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 459 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/rcupdate_trace.h

diff --git a/include/linux/rcupdate_trace.h b/include/linux/rcupdate_trace.h
new file mode 100644
index 0000000..ed97e10
--- /dev/null
+++ b/include/linux/rcupdate_trace.h
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/*
+ * Read-Copy Update mechanism for mutual exclusion, adapted for tracing.
+ *
+ * Copyright (C) 2020 Paul E. McKenney.
+ */
+
+#ifndef __LINUX_RCUPDATE_TRACE_H
+#define __LINUX_RCUPDATE_TRACE_H
+
+#include <linux/sched.h>
+#include <linux/rcupdate.h>
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+
+extern struct lockdep_map rcu_trace_lock_map;
+
+static inline int rcu_read_lock_trace_held(void)
+{
+	return lock_is_held(&rcu_trace_lock_map);
+}
+
+#else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+
+static inline int rcu_read_lock_trace_held(void)
+{
+	return 1;
+}
+
+#endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+
+#ifdef CONFIG_TASKS_TRACE_RCU
+
+void rcu_read_unlock_trace_special(struct task_struct *t);
+
+/**
+ * rcu_read_lock_trace - mark beginning of RCU-trace read-side critical section
+ *
+ * When synchronize_rcu_trace() is invoked by one task, then that task
+ * is guaranteed to block until all other tasks exit their read-side
+ * critical sections.  Similarly, if call_rcu_trace() is invoked on one
+ * task while other tasks are within RCU read-side critical sections,
+ * invocation of the corresponding RCU callback is deferred until after
+ * the all the other tasks exit their critical sections.
+ *
+ * For more details, please see the documentation for rcu_read_lock().
+ */
+static inline void rcu_read_lock_trace(void)
+{
+	struct task_struct *t = current;
+
+	WRITE_ONCE(t->trc_reader_nesting, READ_ONCE(t->trc_reader_nesting) + 1);
+	rcu_lock_acquire(&rcu_trace_lock_map);
+}
+
+/**
+ * rcu_read_unlock_trace - mark end of RCU-trace read-side critical section
+ *
+ * Pairs with a preceding call to rcu_read_lock_trace(), and nesting is
+ * allowed.  Invoking a rcu_read_unlock_trace() when there is no matching
+ * rcu_read_lock_trace() is verboten, and will result in lockdep complaints.
+ *
+ * For more details, please see the documentation for rcu_read_unlock().
+ */
+static inline void rcu_read_unlock_trace(void)
+{
+	int nesting;
+	struct task_struct *t = current;
+
+	rcu_lock_release(&rcu_trace_lock_map);
+	nesting = READ_ONCE(t->trc_reader_nesting) - 1;
+	WRITE_ONCE(t->trc_reader_nesting, nesting);
+	if (likely(!READ_ONCE(t->trc_reader_need_end)) || nesting)
+		return;  // We assume shallow reader nesting.
+	rcu_read_unlock_trace_special(t);
+}
+
+void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
+void synchronize_rcu_tasks_trace(void);
+void rcu_barrier_tasks_trace(void);
+
+#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
+
+#endif /* __LINUX_RCUPDATE_TRACE_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 621e4aa..ef68ae4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -722,6 +722,14 @@ struct task_struct {
 	struct list_head		rcu_tasks_holdout_list;
 #endif /* #ifdef CONFIG_TASKS_RCU */
 
+#ifdef CONFIG_TASKS_TRACE_RCU
+	int				trc_reader_nesting;
+	int				trc_ipi_to_cpu;
+	bool				trc_reader_need_end;
+	bool				trc_reader_checked;
+	struct list_head		trc_holdout_list;
+#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
+
 	struct sched_info		sched_info;
 
 	struct list_head		tasks;
diff --git a/init/init_task.c b/init/init_task.c
index 096191d..1b9ec3d 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -140,6 +140,10 @@ struct task_struct init_task
 	.rcu_tasks_holdout_list = LIST_HEAD_INIT(init_task.rcu_tasks_holdout_list),
 	.rcu_tasks_idle_cpu = -1,
 #endif
+#ifdef CONFIG_TASKS_TRACE_RCU
+	.trc_reader_nesting = 0,
+	.trc_holdout_list = LIST_HEAD_INIT(init_task.trc_holdout_list),
+#endif
 #ifdef CONFIG_CPUSETS
 	.mems_allowed_seq = SEQCNT_ZERO(init_task.mems_allowed_seq),
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index e592e6f..d0e547c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1685,6 +1685,10 @@ static inline void rcu_copy_process(struct task_struct *p)
 	INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
 	p->rcu_tasks_idle_cpu = -1;
 #endif /* #ifdef CONFIG_TASKS_RCU */
+#ifdef CONFIG_TASKS_TRACE_RCU
+	p->trc_reader_nesting = 0;
+	INIT_LIST_HEAD(&p->trc_holdout_list);
+#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
 
 struct pid *pidfd_pid(const struct file *file)
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 0d43ec1..187226b 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -71,7 +71,7 @@ config TREE_SRCU
 	  This option selects the full-fledged version of SRCU.
 
 config TASKS_RCU_GENERIC
-	def_bool TASKS_RCU || TASKS_RUDE_RCU
+	def_bool TASKS_RCU || TASKS_RUDE_RCU || TASKS_TRACE_RCU
 	select SRCU
 	help
 	  This option enables generic infrastructure code supporting
@@ -94,6 +94,16 @@ config TASKS_RUDE_RCU
 	  switches on all online CPUs, including idle ones, so use
 	  with caution.  Not for manual selection.
 
+config TASKS_TRACE_RCU
+	def_bool 0
+	default n
+	help
+	  This option enables a task-based RCU implementation that uses
+	  explicit rcu_read_lock_trace() read-side markers, and allows
+	  these readers to appear in the idle loop as well as on the CPU
+	  hotplug code paths.  It can force IPIs on online CPUs, including
+	  idle ones, so use with caution.  Not for manual selection.
+
 config RCU_STALL_COMMON
 	def_bool TREE_RCU
 	help
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index b65c45f..5d7bd48 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -181,12 +181,17 @@ void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
 	preempt_enable();
 }
 
+static void exit_tasks_rcu_finish_trace(struct task_struct *t);
+
 /* Do the srcu_read_unlock() for the above synchronize_srcu().  */
 void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu)
 {
+	struct task_struct *t = current;
+
 	preempt_disable();
-	__srcu_read_unlock(&tasks_rcu_exit_srcu, current->rcu_tasks_idx);
+	__srcu_read_unlock(&tasks_rcu_exit_srcu, t->rcu_tasks_idx);
 	preempt_enable();
+	exit_tasks_rcu_finish_trace(t);
 }
 
 #ifndef CONFIG_TINY_RCU
@@ -196,15 +201,19 @@ void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu)
  */
 static void __init rcu_tasks_bootup_oddness(void)
 {
-#ifdef CONFIG_TASKS_RCU
+#if defined(CONFIG_TASKS_RCU) || defined(CONFIG_TASKS_TRACE_RCU)
 	if (rcu_task_stall_timeout != RCU_TASK_STALL_TIMEOUT)
 		pr_info("\tTasks-RCU CPU stall warnings timeout set to %d (rcu_task_stall_timeout).\n", rcu_task_stall_timeout);
-	else
-		pr_info("\tTasks RCU enabled.\n");
+#endif /* #ifdef CONFIG_TASKS_RCU */
+#ifdef CONFIG_TASKS_RCU
+	pr_info("\tTrampoline variant of Tasks RCU enabled.\n");
 #endif /* #ifdef CONFIG_TASKS_RCU */
 #ifdef CONFIG_TASKS_RUDE_RCU
 	pr_info("\tRude variant of Tasks RCU enabled.\n");
 #endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
+#ifdef CONFIG_TASKS_TRACE_RCU
+	pr_info("\tTracing variant of Tasks RCU enabled.\n");
+#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
 
 #endif /* #ifndef CONFIG_TINY_RCU */
@@ -480,10 +489,10 @@ core_initcall(rcu_spawn_tasks_kthread);
 //
 // "Rude" variant of Tasks RCU, inspired by Steve Rostedt's trick of
 // passing an empty function to schedule_on_each_cpu().  This approach
-// provides an asynchronous call_rcu_rude() API and batching of concurrent
-// calls to the synchronous synchronize_rcu_rude() API.  This sends IPIs
-// far and wide and induces otherwise unnecessary context switches on all
-// online CPUs, whether online or not.
+// provides an asynchronous call_rcu_tasks_rude() API and batching
+// of concurrent calls to the synchronous synchronize_rcu_rude() API.
+// This sends IPIs far and wide and induces otherwise unnecessary context
+// switches on all online CPUs, whether online or not.
 
 // Empty function to allow workqueues to force a context switch.
 static void rcu_tasks_be_rude(struct work_struct *work)
@@ -570,3 +579,334 @@ static int __init rcu_spawn_tasks_rude_kthread(void)
 core_initcall(rcu_spawn_tasks_rude_kthread);
 
 #endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
+
+////////////////////////////////////////////////////////////////////////
+//
+// Tracing variant of Tasks RCU.  This variant is designed to be used
+// to protect tracing hooks, including those of BPF.  This variant
+// therefore:
+//
+// 1.	Has explicit read-side markers to allow finite grace periods
+//	in the face of in-kernel loops for PREEMPT=n builds.
+//
+// 2.	Protects code in the idle loop, exception entry/exit, and
+//	CPU-hotplug code paths, similar to the capabilities of SRCU.
+//
+// 3.	Avoids expensive read-side instruction, having overhead similar
+//	to that of Preemptible RCU.
+//
+// There are of course downsides.  The grace-period code can send IPIs to
+// CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
+// It is necessary to scan the full tasklist, much as for Tasks RCU.  There
+// is a single callback queue guarded by a single lock, again, much as for
+// Tasks RCU.  If needed, these downsides can be at least partially remedied.
+//
+// Perhaps most important, this variant of RCU does not affect the vanilla
+// flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
+// readers can operate from idle, offline, and exception entry/exit in no
+// way allows rcu_preempt and rcu_sched readers to also do so.
+
+// The lockdep state must be outside of #ifdef to be useful.
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+static struct lock_class_key rcu_lock_trace_key;
+struct lockdep_map rcu_trace_lock_map =
+	STATIC_LOCKDEP_MAP_INIT("rcu_read_lock_trace", &rcu_lock_trace_key);
+EXPORT_SYMBOL_GPL(rcu_trace_lock_map);
+#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+
+#ifdef CONFIG_TASKS_TRACE_RCU
+
+atomic_t trc_n_readers_need_end;	// Number of waited-for readers.
+DECLARE_WAIT_QUEUE_HEAD(trc_wait);	// List of holdout tasks.
+
+// Record outstanding IPIs to each CPU.  No point in sending two...
+static DEFINE_PER_CPU(bool, trc_ipi_to_cpu);
+
+/* If we are the last reader, wake up the grace-period kthread. */
+void rcu_read_unlock_trace_special(struct task_struct *t)
+{
+	WRITE_ONCE(t->trc_reader_need_end, false);
+	if (atomic_dec_and_test(&trc_n_readers_need_end))
+		wake_up(&trc_wait);
+}
+EXPORT_SYMBOL_GPL(rcu_read_unlock_trace_special);
+
+/* Add a task to the holdout list, if it is not already on the list. */
+static void trc_add_holdout(struct task_struct *t, struct list_head *bhp)
+{
+	if (list_empty(&t->trc_holdout_list)) {
+		get_task_struct(t);
+		list_add(&t->trc_holdout_list, bhp);
+	}
+}
+
+/* Remove a task from the holdout list, if it is in fact present. */
+static void trc_del_holdout(struct task_struct *t)
+{
+	if (!list_empty(&t->trc_holdout_list)) {
+		list_del_init(&t->trc_holdout_list);
+		put_task_struct(t);
+	}
+}
+
+/* IPI handler to check task state. */
+static void trc_read_check_handler(void *t_in)
+{
+	struct task_struct *t = current;
+	struct task_struct *texp = t_in;
+
+	// If the task is no longer running on this CPU, leave.
+	if (unlikely(texp != t)) {
+		if (WARN_ON_ONCE(atomic_dec_and_test(&trc_n_readers_need_end)))
+			wake_up(&trc_wait);
+		goto reset_ipi; // Already on holdout list, so will check later.
+	}
+
+	// We have the correct task, so mark it as having been checked.
+	WRITE_ONCE(t->trc_reader_checked, true);
+
+	// If the task is not in a read-side critical section, and
+	// if this is the last reader, awaken the grace-period kthread.
+	if (likely(!t->trc_reader_nesting)) {
+		if (WARN_ON_ONCE(atomic_dec_and_test(&trc_n_readers_need_end)))
+			wake_up(&trc_wait);
+		goto reset_ipi;
+	}
+
+	// Get here if the task is in a read-side critical section.  Set
+	// its state so that it will awaken the grace-period kthread upon
+	// exit from that critical section.
+	WARN_ON_ONCE(t->trc_reader_need_end);
+	WRITE_ONCE(t->trc_reader_need_end, true);
+
+reset_ipi:
+	// Allow future IPIs to be sent on CPU and for task.
+	// Also order this IPI handler against any later manipulations of
+	// the intended task.
+	smp_store_release(&per_cpu(trc_ipi_to_cpu, smp_processor_id()), false); // ^^^
+	smp_store_release(&texp->trc_ipi_to_cpu, -1); // ^^^
+}
+
+/* Callback function for scheduler to check non-running) task.  */
+static void trc_inspect_reader_notrunning(void *arg)
+{
+	struct task_struct *t = arg;
+
+	// Mark as checked.  Because this is called from the grace-period
+	// kthread, also remove the task from the holdout list.
+	t->trc_reader_checked = true;
+	trc_del_holdout(t);
+
+	// If the task is in a read-side critical section, set up its
+	// its state so that it will awaken the grace-period kthread upon
+	// exit from that critical section.
+	if (unlikely(t->trc_reader_nesting)) {
+		atomic_inc(&trc_n_readers_need_end); // One more to wait on.
+		WARN_ON_ONCE(t->trc_reader_need_end);
+		WRITE_ONCE(t->trc_reader_need_end, true);
+	}
+}
+
+/* Attempt to extract the state for the specified task. */
+static void trc_wait_for_one_reader(struct task_struct *t,
+				    struct list_head *bhp)
+{
+	int cpu;
+
+	// If a previous IPI is still in flight, let it complete.
+	if (smp_load_acquire(&t->trc_ipi_to_cpu) != -1) // Order IPI
+		return;
+
+	// The current task had better be in a quiescent state.
+	if (t == current) {
+		t->trc_reader_checked = true;
+		trc_del_holdout(t);
+		WARN_ON_ONCE(t->trc_reader_nesting);
+		return;
+	}
+
+	// Attempt to nail down the task for inspection.
+	if (try_invoke_on_nonrunning_task(t, trc_inspect_reader_notrunning, t))
+		return;
+
+	// If currently running, send an IPI, either way, add to list.
+	trc_add_holdout(t, bhp);
+	if (task_curr(t)) {
+		// The task is currently running, so try IPIing it.
+		cpu = task_cpu(t);
+
+		// If there is already an IPI outstanding, let it happen.
+		if (per_cpu(trc_ipi_to_cpu, cpu) || t->trc_ipi_to_cpu >= 0)
+			return;
+
+		atomic_inc(&trc_n_readers_need_end);
+		per_cpu(trc_ipi_to_cpu, cpu) = true;
+		t->trc_ipi_to_cpu = cpu;
+		if (smp_call_function_single(cpu,
+					     trc_read_check_handler, t, 0)) {
+			per_cpu(trc_ipi_to_cpu, cpu) = false;
+			t->trc_ipi_to_cpu = cpu;
+		}
+	}
+}
+
+/* Initialize for a new RCU-tasks-trace grace period. */
+static void rcu_tasks_trace_pregp_step(void)
+{
+	int cpu;
+
+	// Wait for CPU-hotplug paths to complete.
+	cpus_read_lock();
+	cpus_read_unlock();
+
+	// Allow for fast-acting IPIs.
+	atomic_set(&trc_n_readers_need_end, 1);
+
+	// There shouldn't be any old IPIs, but...
+	for_each_possible_cpu(cpu)
+		WARN_ON_ONCE(per_cpu(trc_ipi_to_cpu, cpu));
+}
+
+/* Do first-round processing for the specified task. */
+static void rcu_tasks_trace_pertask(struct task_struct *t,
+				    struct list_head *hop)
+{
+	WRITE_ONCE(t->trc_reader_need_end, false);
+	t->trc_reader_checked = false;
+	t->trc_ipi_to_cpu = -1;
+	trc_wait_for_one_reader(t, hop);
+}
+
+/* Do intermediate processing between task and holdout scans. */
+static void rcu_tasks_trace_postscan(void)
+{
+	// Wait for late-stage exiting tasks to finish exiting.
+	// These might have passed the call to exit_tasks_rcu_finish().
+	synchronize_rcu();
+	// Any tasks that exit after this point will set ->trc_reader_checked.
+}
+
+/* Do one scan of the holdout list. */
+static void check_all_holdout_tasks_trace(struct list_head *hop,
+					  bool ndrpt, bool *frptp)
+{
+	struct task_struct *g, *t;
+
+	list_for_each_entry_safe(t, g, hop, trc_holdout_list) {
+		// If safe and needed, try to check the current task.
+		if (READ_ONCE(t->trc_ipi_to_cpu) == -1 &&
+		    !READ_ONCE(t->trc_reader_checked))
+			trc_wait_for_one_reader(t, hop);
+
+		// If check succeeded, remove this task from the list.
+		if (READ_ONCE(t->trc_reader_checked))
+			trc_del_holdout(t);
+	}
+}
+
+/* Wait for grace period to complete and provide ordering. */
+static void rcu_tasks_trace_postgp(void)
+{
+	// Remove the safety count.
+	smp_mb__before_atomic();  // Order vs. earlier atomics
+	atomic_dec(&trc_n_readers_need_end);
+	smp_mb__after_atomic();  // Order vs. later atomics
+
+	// Wait for readers.
+	wait_event_idle_exclusive(trc_wait,
+				  atomic_read(&trc_n_readers_need_end) == 0);
+
+	smp_mb(); // Caller's code must be ordered after wakeup.
+}
+
+/* Report any needed quiescent state for this exiting task. */
+void exit_tasks_rcu_finish_trace(struct task_struct *t)
+{
+	WRITE_ONCE(t->trc_reader_checked, true);
+	WARN_ON_ONCE(t->trc_reader_nesting);
+	WRITE_ONCE(t->trc_reader_nesting, 0);
+	if (WARN_ON_ONCE(READ_ONCE(t->trc_reader_need_end)))
+		rcu_read_unlock_trace_special(t);
+}
+
+void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
+DEFINE_RCU_TASKS(rcu_tasks_trace, rcu_tasks_wait_gp, call_rcu_tasks_trace,
+		 "RCU Tasks Trace");
+
+/**
+ * call_rcu_tasks_trace() - Queue a callback trace task-based grace period
+ * @rhp: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. call_rcu_tasks_trace()
+ * assumes that the read-side critical sections end at context switch,
+ * cond_resched_rcu_qs(), or transition to usermode execution.  As such,
+ * there are no read-side primitives analogous to rcu_read_lock() and
+ * rcu_read_unlock() because this primitive is intended to determine
+ * that all tasks have passed through a safe state, not so much for
+ * data-strcuture synchronization.
+ *
+ * See the description of call_rcu() for more detailed information on
+ * memory ordering guarantees.
+ */
+void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func)
+{
+	call_rcu_tasks_generic(rhp, func, &rcu_tasks_trace);
+}
+EXPORT_SYMBOL_GPL(call_rcu_tasks_trace);
+
+/**
+ * synchronize_rcu_tasks_trace - wait for a trace rcu-tasks grace period
+ *
+ * Control will return to the caller some time after a trace rcu-tasks
+ * grace period has elapsed, in other words after all currently
+ * executing rcu-tasks read-side critical sections have elapsed.  These
+ * read-side critical sections are delimited by calls to schedule(),
+ * cond_resched_tasks_rcu_qs(), userspace execution, and (in theory,
+ * anyway) cond_resched().
+ *
+ * This is a very specialized primitive, intended only for a few uses in
+ * tracing and other situations requiring manipulation of function preambles
+ * and profiling hooks.  The synchronize_rcu_tasks_trace() function is not
+ * (yet) intended for heavy use from multiple CPUs.
+ *
+ * See the description of synchronize_rcu() for more detailed information
+ * on memory ordering guarantees.
+ */
+void synchronize_rcu_tasks_trace(void)
+{
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_trace_lock_map), "Illegal synchronize_rcu_tasks_trace() in RCU Tasks Trace read-side critical section");
+	synchronize_rcu_tasks_generic(&rcu_tasks_trace);
+}
+EXPORT_SYMBOL_GPL(synchronize_rcu_tasks_trace);
+
+/**
+ * rcu_barrier_tasks_trace - Wait for in-flight call_rcu_tasks_trace() callbacks.
+ *
+ * Although the current implementation is guaranteed to wait, it is not
+ * obligated to, for example, if there are no pending callbacks.
+ */
+void rcu_barrier_tasks_trace(void)
+{
+	/* There is only one callback queue, so this is easy.  ;-) */
+	synchronize_rcu_tasks_trace();
+}
+EXPORT_SYMBOL_GPL(rcu_barrier_tasks_trace);
+
+static int __init rcu_spawn_tasks_trace_kthread(void)
+{
+	rcu_tasks_trace.pregp_func = rcu_tasks_trace_pregp_step;
+	rcu_tasks_trace.pertask_func = rcu_tasks_trace_pertask;
+	rcu_tasks_trace.postscan_func = rcu_tasks_trace_postscan;
+	rcu_tasks_trace.holdouts_func = check_all_holdout_tasks_trace;
+	rcu_tasks_trace.postgp_func = rcu_tasks_trace_postgp;
+	rcu_spawn_tasks_kthread_generic(&rcu_tasks_trace);
+	return 0;
+}
+core_initcall(rcu_spawn_tasks_trace_kthread);
+
+#else /* #ifdef CONFIG_TASKS_TRACE_RCU */
+void exit_tasks_rcu_finish_trace(struct task_struct *t) { }
+#endif /* #else #ifdef CONFIG_TASKS_TRACE_RCU */
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC tip/core/rcu 15/16] rcutorture: Add torture tests for RCU Tasks Trace
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                   ` (13 preceding siblings ...)
  2020-03-12 18:17 ` [PATCH RFC tip/core/rcu 14/16] rcu: Add an RCU Tasks Trace to simplify protection of tracing hooks paulmck
@ 2020-03-12 18:17 ` paulmck
  2020-03-12 18:17 ` [PATCH RFC tip/core/rcu 16/16] rcu-tasks: Add stall warnings " paulmck
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-12 18:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds the definitions required to torture the tracing flavor
of RCU tasks.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/Kconfig.debug                           |  2 +
 kernel/rcu/rcu.h                                   |  1 +
 kernel/rcu/rcutorture.c                            | 43 +++++++++++++++++++++-
 .../selftests/rcutorture/configs/rcu/CFLIST        |  1 +
 .../selftests/rcutorture/configs/rcu/TRACE01       | 10 +++++
 .../selftests/rcutorture/configs/rcu/TRACE01.boot  |  1 +
 6 files changed, 57 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TRACE01
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TRACE01.boot

diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index b15a3bd..a4db41d 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -25,6 +25,7 @@ config RCU_PERF_TEST
 	select SRCU
 	select TASKS_RCU
 	select TASKS_RUDE_RCU
+	select TASKS_TRACE_RCU
 	default n
 	help
 	  This option provides a kernel module that runs performance
@@ -43,6 +44,7 @@ config RCU_TORTURE_TEST
 	select SRCU
 	select TASKS_RCU
 	select TASKS_RUDE_RCU
+	select TASKS_TRACE_RCU
 	default n
 	help
 	  This option provides a kernel module that runs torture tests
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index c574620..72903867 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -442,6 +442,7 @@ enum rcutorture_type {
 	RCU_FLAVOR,
 	RCU_TASKS_FLAVOR,
 	RCU_TASKS_RUDE_FLAVOR,
+	RCU_TASKS_TRACING_FLAVOR,
 	RCU_TRIVIAL_FLAVOR,
 	SRCU_FLAVOR,
 	INVALID_RCU_FLAVOR
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 1acafc5..ae26ea8 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -45,6 +45,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/oom.h>
 #include <linux/tick.h>
+#include <linux/rcupdate_trace.h>
 
 #include "rcu.h"
 
@@ -758,6 +759,44 @@ static struct rcu_torture_ops tasks_rude_ops = {
 	.name		= "tasks-rude"
 };
 
+/*
+ * Definitions for tracing RCU-tasks torture testing.
+ */
+
+static int tasks_tracing_torture_read_lock(void)
+{
+	rcu_read_lock_trace();
+	return 0;
+}
+
+static void tasks_tracing_torture_read_unlock(int idx)
+{
+	rcu_read_unlock_trace();
+}
+
+static void rcu_tasks_tracing_torture_deferred_free(struct rcu_torture *p)
+{
+	call_rcu_tasks_trace(&p->rtort_rcu, rcu_torture_cb);
+}
+
+static struct rcu_torture_ops tasks_tracing_ops = {
+	.ttype		= RCU_TASKS_TRACING_FLAVOR,
+	.init		= rcu_sync_torture_init,
+	.readlock	= tasks_tracing_torture_read_lock,
+	.read_delay	= rcu_read_delay,  /* just reuse rcu's version. */
+	.readunlock	= tasks_tracing_torture_read_unlock,
+	.get_gp_seq	= rcu_no_completed,
+	.deferred_free	= rcu_tasks_tracing_torture_deferred_free,
+	.sync		= synchronize_rcu_tasks_trace,
+	.exp_sync	= synchronize_rcu_tasks_trace,
+	.call		= call_rcu_tasks_trace,
+	.cb_barrier	= rcu_barrier_tasks_trace,
+	.fqs		= NULL,
+	.stats		= NULL,
+	.irq_capable	= 1,
+	.name		= "tasks-tracing"
+};
+
 static unsigned long rcutorture_seq_diff(unsigned long new, unsigned long old)
 {
 	if (!cur_ops->gp_diff)
@@ -1316,6 +1355,7 @@ static bool rcu_torture_one_read(struct torture_random_state *trsp)
 				  rcu_read_lock_bh_held() ||
 				  rcu_read_lock_sched_held() ||
 				  srcu_read_lock_held(srcu_ctlp) ||
+				  rcu_read_lock_trace_held() ||
 				  torturing_tasks());
 	if (p == NULL) {
 		/* Wait for rcu_torture_writer to get underway */
@@ -2435,7 +2475,8 @@ rcu_torture_init(void)
 	int firsterr = 0;
 	static struct rcu_torture_ops *torture_ops[] = {
 		&rcu_ops, &rcu_busted_ops, &srcu_ops, &srcud_ops,
-		&busted_srcud_ops, &tasks_ops, &tasks_rude_ops, &trivial_ops,
+		&busted_srcud_ops, &tasks_ops, &tasks_rude_ops,
+		&tasks_tracing_ops, &trivial_ops,
 	};
 
 	if (!torture_init_begin(torture_type, verbose))
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
index ec0c72f..dfb1817 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
+++ b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
@@ -15,3 +15,4 @@ TASKS01
 TASKS02
 TASKS03
 RUDE01
+TRACE01
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TRACE01 b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01
new file mode 100644
index 0000000..078e2c1
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01
@@ -0,0 +1,10 @@
+CONFIG_SMP=y
+CONFIG_NR_CPUS=4
+CONFIG_HOTPLUG_CPU=y
+CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_VOLUNTARY=n
+CONFIG_PREEMPT=n
+CONFIG_DEBUG_LOCK_ALLOC=y
+CONFIG_PROVE_LOCKING=y
+#CHECK#CONFIG_PROVE_RCU=y
+CONFIG_RCU_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TRACE01.boot b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01.boot
new file mode 100644
index 0000000..9675ad6
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01.boot
@@ -0,0 +1 @@
+rcutorture.torture_type=tasks-tracing
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC tip/core/rcu 16/16] rcu-tasks: Add stall warnings for RCU Tasks Trace
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                   ` (14 preceding siblings ...)
  2020-03-12 18:17 ` [PATCH RFC tip/core/rcu 15/16] rcutorture: Add torture tests for RCU Tasks Trace paulmck
@ 2020-03-12 18:17 ` paulmck
  2020-03-13 14:41 ` [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Frederic Weisbecker
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
  17 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-12 18:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds RCU CPU stall warnings for RCU Tasks Trace.  These
dump out any tasks blocking the current grace period, as well as any
CPUs that have not responded to an IPI request.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 5d7bd48..8cffc1c 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -788,8 +788,9 @@ static void rcu_tasks_trace_postscan(void)
 
 /* Do one scan of the holdout list. */
 static void check_all_holdout_tasks_trace(struct list_head *hop,
-					  bool ndrpt, bool *frptp)
+					  bool needreport, bool *firstreport)
 {
+	int cpu;
 	struct task_struct *g, *t;
 
 	list_for_each_entry_safe(t, g, hop, trc_holdout_list) {
@@ -799,9 +800,32 @@ static void check_all_holdout_tasks_trace(struct list_head *hop,
 			trc_wait_for_one_reader(t, hop);
 
 		// If check succeeded, remove this task from the list.
-		if (READ_ONCE(t->trc_reader_checked))
+		if (READ_ONCE(t->trc_reader_checked)) {
 			trc_del_holdout(t);
+			continue;
+		} else if (!needreport) {
+			continue;
+		}
+		if (*firstreport) {
+			pr_err("INFO: rcu_tasks_trace detected stalls on tasks:\n");
+			*firstreport = false;
+		}
+		cpu = task_cpu(t);
+		pr_alert("%p: %c%c%c nesting: %d%c cpu: %d\n",
+			 t,
+			 ".I"[READ_ONCE(t->trc_ipi_to_cpu) > 0],
+			 ".i"[is_idle_task(t)],
+			 "N."[cpu < 0 || !tick_nohz_full_cpu(cpu)],
+			 t->trc_reader_nesting,
+			 " N"[!!t->trc_reader_need_end],
+			 cpu);
+		sched_show_task(t);
 	}
+	if (!needreport)
+		return;
+	for_each_possible_cpu(cpu)
+		if (per_cpu(trc_ipi_to_cpu, cpu))
+			pr_alert("\tIPI outstanding to CPU %d\n", cpu);
 }
 
 /* Wait for grace period to complete and provide ordering. */
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                   ` (15 preceding siblings ...)
  2020-03-12 18:17 ` [PATCH RFC tip/core/rcu 16/16] rcu-tasks: Add stall warnings " paulmck
@ 2020-03-13 14:41 ` Frederic Weisbecker
  2020-03-13 15:42   ` Paul E. McKenney
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
  17 siblings, 1 reply; 171+ messages in thread
From: Frederic Weisbecker @ 2020-03-13 14:41 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: mutt, rcu, linux-kernel, kernel-team, mingo, jiangshanlai,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt,
	dhowells, edumazet, fweisbec, oleg, joel

On Thu, Mar 12, 2020 at 11:16:18AM -0700, Paul E. McKenney wrote:
> Hello!
> 
> This series provides two variants of Tasks RCU, a rude variant inspired
> by Steven Rostedt's use of schedule_on_each_cpu(), and a tracing variant
> requested by the BPF folks and perhaps also of use for other tracing
> use cases.
> 
> The tracing variant has explicit read-side markers to permit finite grace
> periods even given in-kernel loops in PREEMPT=n builds It also protects
> code in the idle loop, on exception entry/exit paths, and on the various
> CPU-hotplug online/offline code paths, thus having protection properties
> similar to SRCU.  However, unlike SRCU, this variant avoids expensive
> instructions in the read-side primitives, thus having read-side overhead
> similar to that of preemptible RCU.
> 
> There are of course downsides.  The grace-period code can send IPIs to
> CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
> It is necessary to scan the full tasklist, much as for Tasks RCU.  There
> is a single callback queue guarded by a single lock, again, much as for
> Tasks RCU.  If needed, these downsides can be at least partially remedied

So what we trade to fix the issues we are having with tracing against extended
grace periods, we lose in CPU isolation. That worries me a bit as tracing can
be thoroughly used with nohz_full and CPU isolation.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline
  2020-03-13 14:41 ` [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Frederic Weisbecker
@ 2020-03-13 15:42   ` Paul E. McKenney
  2020-03-15 17:45     ` Mathieu Desnoyers
  2020-03-16 14:45     ` Frederic Weisbecker
  0 siblings, 2 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-13 15:42 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: mutt, rcu, linux-kernel, kernel-team, mingo, jiangshanlai,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt,
	dhowells, edumazet, fweisbec, oleg, joel

On Fri, Mar 13, 2020 at 03:41:46PM +0100, Frederic Weisbecker wrote:
> On Thu, Mar 12, 2020 at 11:16:18AM -0700, Paul E. McKenney wrote:
> > Hello!
> > 
> > This series provides two variants of Tasks RCU, a rude variant inspired
> > by Steven Rostedt's use of schedule_on_each_cpu(), and a tracing variant
> > requested by the BPF folks and perhaps also of use for other tracing
> > use cases.
> > 
> > The tracing variant has explicit read-side markers to permit finite grace
> > periods even given in-kernel loops in PREEMPT=n builds It also protects
> > code in the idle loop, on exception entry/exit paths, and on the various
> > CPU-hotplug online/offline code paths, thus having protection properties
> > similar to SRCU.  However, unlike SRCU, this variant avoids expensive
> > instructions in the read-side primitives, thus having read-side overhead
> > similar to that of preemptible RCU.
> > 
> > There are of course downsides.  The grace-period code can send IPIs to
> > CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
> > It is necessary to scan the full tasklist, much as for Tasks RCU.  There
> > is a single callback queue guarded by a single lock, again, much as for
> > Tasks RCU.  If needed, these downsides can be at least partially remedied
> 
> So what we trade to fix the issues we are having with tracing against extended
> grace periods, we lose in CPU isolation. That worries me a bit as tracing can
> be thoroughly used with nohz_full and CPU isolation.

First, disturbing nohz_full CPUs can be avoided by the sysadm simply
refusing to remove tracepoints while sensitive applications are running
on nohz_full CPUs.

Second, for non-CPU-bound real-time programs with mostly-idle CPUs,
I should be able to decrease the likelihood of sending IPIs pretty much
to zero.

Or am I missing something here?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline
  2020-03-13 15:42   ` Paul E. McKenney
@ 2020-03-15 17:45     ` Mathieu Desnoyers
  2020-03-15 17:59       ` Paul E. McKenney
  2020-03-16 14:45     ` Frederic Weisbecker
  1 sibling, 1 reply; 171+ messages in thread
From: Mathieu Desnoyers @ 2020-03-15 17:45 UTC (permalink / raw)
  To: paulmck
  Cc: Frederic Weisbecker, rcu, linux-kernel, kernel-team, Ingo Molnar,
	Lai Jiangshan, dipankar, Andrew Morton, Josh Triplett,
	Thomas Gleixner, Peter Zijlstra, rostedt, David Howells,
	Eric Dumazet, fweisbec, Oleg Nesterov, Joel Fernandes, Google

----- On Mar 13, 2020, at 11:42 AM, paulmck paulmck@kernel.org wrote:

> On Fri, Mar 13, 2020 at 03:41:46PM +0100, Frederic Weisbecker wrote:
>> On Thu, Mar 12, 2020 at 11:16:18AM -0700, Paul E. McKenney wrote:
>> > Hello!
>> > 
>> > This series provides two variants of Tasks RCU, a rude variant inspired
>> > by Steven Rostedt's use of schedule_on_each_cpu(), and a tracing variant
>> > requested by the BPF folks and perhaps also of use for other tracing
>> > use cases.
>> > 
>> > The tracing variant has explicit read-side markers to permit finite grace
>> > periods even given in-kernel loops in PREEMPT=n builds It also protects
>> > code in the idle loop, on exception entry/exit paths, and on the various
>> > CPU-hotplug online/offline code paths, thus having protection properties
>> > similar to SRCU.  However, unlike SRCU, this variant avoids expensive
>> > instructions in the read-side primitives, thus having read-side overhead
>> > similar to that of preemptible RCU.
>> > 
>> > There are of course downsides.  The grace-period code can send IPIs to
>> > CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
>> > It is necessary to scan the full tasklist, much as for Tasks RCU.  There
>> > is a single callback queue guarded by a single lock, again, much as for
>> > Tasks RCU.  If needed, these downsides can be at least partially remedied
>> 
>> So what we trade to fix the issues we are having with tracing against extended
>> grace periods, we lose in CPU isolation. That worries me a bit as tracing can
>> be thoroughly used with nohz_full and CPU isolation.
> 
> First, disturbing nohz_full CPUs can be avoided by the sysadm simply
> refusing to remove tracepoints while sensitive applications are running
> on nohz_full CPUs.

I doubt this approach will survive real-life.

> 
> Second, for non-CPU-bound real-time programs with mostly-idle CPUs,
> I should be able to decrease the likelihood of sending IPIs pretty much
> to zero.
> 
> Or am I missing something here?

I would recommend considering the following alternative for this tracing-rcu
flavor:

- For all CPUs which are not nohz_full:
  - Implement fast RCU read-side which only requires compiler barriers,
  - Use IPIs to each of those CPUs when doing a grace period.

- For all nohz_full CPUS:
  - Dynamically detect CPUs which are nohz_full,
  - Implement slower RCU read-side with memory barriers,
  - No need to issue any IPI to those CPUs when doing the grace period.

This should cover all use-cases: staying fast for the common case, without
disturbing RT workloads.

Thoughts ?

Thanks,

Mathieu




-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline
  2020-03-15 17:45     ` Mathieu Desnoyers
@ 2020-03-15 17:59       ` Paul E. McKenney
  2020-03-16 18:36         ` Steven Rostedt
  0 siblings, 1 reply; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-15 17:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, rcu, linux-kernel, kernel-team, Ingo Molnar,
	Lai Jiangshan, dipankar, Andrew Morton, Josh Triplett,
	Thomas Gleixner, Peter Zijlstra, rostedt, David Howells,
	Eric Dumazet, fweisbec, Oleg Nesterov, Joel Fernandes, Google

On Sun, Mar 15, 2020 at 01:45:05PM -0400, Mathieu Desnoyers wrote:
> ----- On Mar 13, 2020, at 11:42 AM, paulmck paulmck@kernel.org wrote:
> 
> > On Fri, Mar 13, 2020 at 03:41:46PM +0100, Frederic Weisbecker wrote:
> >> On Thu, Mar 12, 2020 at 11:16:18AM -0700, Paul E. McKenney wrote:
> >> > Hello!
> >> > 
> >> > This series provides two variants of Tasks RCU, a rude variant inspired
> >> > by Steven Rostedt's use of schedule_on_each_cpu(), and a tracing variant
> >> > requested by the BPF folks and perhaps also of use for other tracing
> >> > use cases.
> >> > 
> >> > The tracing variant has explicit read-side markers to permit finite grace
> >> > periods even given in-kernel loops in PREEMPT=n builds It also protects
> >> > code in the idle loop, on exception entry/exit paths, and on the various
> >> > CPU-hotplug online/offline code paths, thus having protection properties
> >> > similar to SRCU.  However, unlike SRCU, this variant avoids expensive
> >> > instructions in the read-side primitives, thus having read-side overhead
> >> > similar to that of preemptible RCU.
> >> > 
> >> > There are of course downsides.  The grace-period code can send IPIs to
> >> > CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
> >> > It is necessary to scan the full tasklist, much as for Tasks RCU.  There
> >> > is a single callback queue guarded by a single lock, again, much as for
> >> > Tasks RCU.  If needed, these downsides can be at least partially remedied
> >> 
> >> So what we trade to fix the issues we are having with tracing against extended
> >> grace periods, we lose in CPU isolation. That worries me a bit as tracing can
> >> be thoroughly used with nohz_full and CPU isolation.
> > 
> > First, disturbing nohz_full CPUs can be avoided by the sysadm simply
> > refusing to remove tracepoints while sensitive applications are running
> > on nohz_full CPUs.
> 
> I doubt this approach will survive real-life.

Nothing survives real life, at least not indefinitely.  ;-)

> > Second, for non-CPU-bound real-time programs with mostly-idle CPUs,
> > I should be able to decrease the likelihood of sending IPIs pretty much
> > to zero.
> > 
> > Or am I missing something here?
> 
> I would recommend considering the following alternative for this tracing-rcu
> flavor:
> 
> - For all CPUs which are not nohz_full:
>   - Implement fast RCU read-side which only requires compiler barriers,
>   - Use IPIs to each of those CPUs when doing a grace period.
> 
> - For all nohz_full CPUS:
>   - Dynamically detect CPUs which are nohz_full,
>   - Implement slower RCU read-side with memory barriers,
>   - No need to issue any IPI to those CPUs when doing the grace period.
> 
> This should cover all use-cases: staying fast for the common case, without
> disturbing RT workloads.
> 
> Thoughts ?

I will certainly add this to my list of potential solutions, and thank
you for pointing me at it!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline
  2020-03-13 15:42   ` Paul E. McKenney
  2020-03-15 17:45     ` Mathieu Desnoyers
@ 2020-03-16 14:45     ` Frederic Weisbecker
  2020-03-16 15:39       ` Paul E. McKenney
  1 sibling, 1 reply; 171+ messages in thread
From: Frederic Weisbecker @ 2020-03-16 14:45 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: mutt, rcu, linux-kernel, kernel-team, mingo, jiangshanlai,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt,
	dhowells, edumazet, fweisbec, oleg, joel

On Fri, Mar 13, 2020 at 08:42:43AM -0700, Paul E. McKenney wrote:
> On Fri, Mar 13, 2020 at 03:41:46PM +0100, Frederic Weisbecker wrote:
> > On Thu, Mar 12, 2020 at 11:16:18AM -0700, Paul E. McKenney wrote:
> > > Hello!
> > > 
> > > This series provides two variants of Tasks RCU, a rude variant inspired
> > > by Steven Rostedt's use of schedule_on_each_cpu(), and a tracing variant
> > > requested by the BPF folks and perhaps also of use for other tracing
> > > use cases.
> > > 
> > > The tracing variant has explicit read-side markers to permit finite grace
> > > periods even given in-kernel loops in PREEMPT=n builds It also protects
> > > code in the idle loop, on exception entry/exit paths, and on the various
> > > CPU-hotplug online/offline code paths, thus having protection properties
> > > similar to SRCU.  However, unlike SRCU, this variant avoids expensive
> > > instructions in the read-side primitives, thus having read-side overhead
> > > similar to that of preemptible RCU.
> > > 
> > > There are of course downsides.  The grace-period code can send IPIs to
> > > CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
> > > It is necessary to scan the full tasklist, much as for Tasks RCU.  There
> > > is a single callback queue guarded by a single lock, again, much as for
> > > Tasks RCU.  If needed, these downsides can be at least partially remedied
> > 
> > So what we trade to fix the issues we are having with tracing against extended
> > grace periods, we lose in CPU isolation. That worries me a bit as tracing can
> > be thoroughly used with nohz_full and CPU isolation.
> 
> First, disturbing nohz_full CPUs can be avoided by the sysadm simply
> refusing to remove tracepoints while sensitive applications are running
> on nohz_full CPUs.

So, in that case we'll need to modify the tools such as perf tools to avoid
releasing the related buffers until we are ready to do so.

That's possible but it's kindof an ABI breakage. Also what if there is a
long running service on that nohz full CPU polling on the networking card...

Thanks.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline
  2020-03-16 14:45     ` Frederic Weisbecker
@ 2020-03-16 15:39       ` Paul E. McKenney
  0 siblings, 0 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-16 15:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: mutt, rcu, linux-kernel, kernel-team, mingo, jiangshanlai,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt,
	dhowells, edumazet, fweisbec, oleg, joel

On Mon, Mar 16, 2020 at 03:45:36PM +0100, Frederic Weisbecker wrote:
> On Fri, Mar 13, 2020 at 08:42:43AM -0700, Paul E. McKenney wrote:
> > On Fri, Mar 13, 2020 at 03:41:46PM +0100, Frederic Weisbecker wrote:
> > > On Thu, Mar 12, 2020 at 11:16:18AM -0700, Paul E. McKenney wrote:
> > > > Hello!
> > > > 
> > > > This series provides two variants of Tasks RCU, a rude variant inspired
> > > > by Steven Rostedt's use of schedule_on_each_cpu(), and a tracing variant
> > > > requested by the BPF folks and perhaps also of use for other tracing
> > > > use cases.
> > > > 
> > > > The tracing variant has explicit read-side markers to permit finite grace
> > > > periods even given in-kernel loops in PREEMPT=n builds It also protects
> > > > code in the idle loop, on exception entry/exit paths, and on the various
> > > > CPU-hotplug online/offline code paths, thus having protection properties
> > > > similar to SRCU.  However, unlike SRCU, this variant avoids expensive
> > > > instructions in the read-side primitives, thus having read-side overhead
> > > > similar to that of preemptible RCU.
> > > > 
> > > > There are of course downsides.  The grace-period code can send IPIs to
> > > > CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
> > > > It is necessary to scan the full tasklist, much as for Tasks RCU.  There
> > > > is a single callback queue guarded by a single lock, again, much as for
> > > > Tasks RCU.  If needed, these downsides can be at least partially remedied
> > > 
> > > So what we trade to fix the issues we are having with tracing against extended
> > > grace periods, we lose in CPU isolation. That worries me a bit as tracing can
> > > be thoroughly used with nohz_full and CPU isolation.
> > 
> > First, disturbing nohz_full CPUs can be avoided by the sysadm simply
> > refusing to remove tracepoints while sensitive applications are running
> > on nohz_full CPUs.
> 
> So, in that case we'll need to modify the tools such as perf tools to avoid
> releasing the related buffers until we are ready to do so.
> 
> That's possible but it's kindof an ABI breakage. Also what if there is a
> long running service on that nohz full CPU polling on the networking card...

In the near term, I do admit that Mathieu's point about using smp_mb()
in readers but only on nohz_full CPUs is attractive.

I have some other ideas, but simplicity has its advantages, and if no
one complains, perhaps those advantages are also good for the long term.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline
  2020-03-15 17:59       ` Paul E. McKenney
@ 2020-03-16 18:36         ` Steven Rostedt
  2020-03-16 18:52           ` Paul E. McKenney
  0 siblings, 1 reply; 171+ messages in thread
From: Steven Rostedt @ 2020-03-16 18:36 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mathieu Desnoyers, Frederic Weisbecker, rcu, linux-kernel,
	kernel-team, Ingo Molnar, Lai Jiangshan, dipankar, Andrew Morton,
	Josh Triplett, Thomas Gleixner, Peter Zijlstra, David Howells,
	Eric Dumazet, fweisbec, Oleg Nesterov, Joel Fernandes, Google

On Sun, 15 Mar 2020 10:59:21 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> Nothing survives real life

  #coronavirus!

-- Steve

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline
  2020-03-16 18:36         ` Steven Rostedt
@ 2020-03-16 18:52           ` Paul E. McKenney
  0 siblings, 0 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-16 18:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Frederic Weisbecker, rcu, linux-kernel,
	kernel-team, Ingo Molnar, Lai Jiangshan, dipankar, Andrew Morton,
	Josh Triplett, Thomas Gleixner, Peter Zijlstra, David Howells,
	Eric Dumazet, fweisbec, Oleg Nesterov, Joel Fernandes, Google

On Mon, Mar 16, 2020 at 02:36:06PM -0400, Steven Rostedt wrote:
> On Sun, 15 Mar 2020 10:59:21 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > Nothing survives real life
> 
>   #coronavirus!

Heh!

But something will eventually do coronavirus in.  Hopefully one of those
things is my immune system.  :-/

							Thanx, Paul

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant
  2020-03-12 18:16 ` [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant paulmck
@ 2020-03-16 19:47   ` Joel Fernandes
  2020-03-16 20:17     ` Joel Fernandes
  2020-03-16 20:29     ` Paul E. McKenney
  0 siblings, 2 replies; 171+ messages in thread
From: Joel Fernandes @ 2020-03-16 19:47 UTC (permalink / raw)
  To: paulmck
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg

On Thu, Mar 12, 2020 at 11:16:55AM -0700, paulmck@kernel.org wrote:
> From: "Paul E. McKenney" <paulmck@kernel.org>
> 
> This commit adds a "rude" variant of RCU-tasks that has as quiescent
> states schedule(), cond_resched_tasks_rcu_qs(), userspace execution,
> and (in theory, anyway) cond_resched().  Updates make use of IPIs and
> force an IPI and a context switch on each online CPU.  This variant
> is useful in some situations in tracing.

Would it be possible to better clarify that the "rude version" works only
from preempt-disabled regions? Is that also true for the "non-rude" version?

Also it would be good to clarify better in cover letter, how these new
flavors relate to the existing Tasks-RCU implementation.

In the existing one, a quiescent state is a task updating its context switch
counters such that it went to sleep at least once, implying there is no
chance it is on an about to be destroyed trampoline.

However, here we are trying to determine if a task state is no longer on an
RQ (which I gleaned from the first patch). Sounds very similar, would the
context switch counters not help in that determination as well? If it is Ok,
it would be good to describe in cover letter about what is exactly is a
quiescent state and what exactly is a reader section in the cover letter, for
both non-rude and rude version. Thanks!

thanks,

 - Joel



> 
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> ---
>  include/linux/rcupdate.h |  3 ++
>  kernel/rcu/Kconfig       | 12 +++++-
>  kernel/rcu/tasks.h       | 99 ++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 113 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 5523145..2be97a8 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -37,6 +37,7 @@
>  /* Exported common interfaces */
>  void call_rcu(struct rcu_head *head, rcu_callback_t func);
>  void rcu_barrier_tasks(void);
> +void rcu_barrier_tasks_rude(void);
>  void synchronize_rcu(void);
>  
>  #ifdef CONFIG_PREEMPT_RCU
> @@ -138,6 +139,8 @@ static inline void rcu_init_nohz(void) { }
>  #define rcu_note_voluntary_context_switch(t) rcu_tasks_qs(t)
>  void call_rcu_tasks(struct rcu_head *head, rcu_callback_t func);
>  void synchronize_rcu_tasks(void);
> +void call_rcu_tasks_rude(struct rcu_head *head, rcu_callback_t func);
> +void synchronize_rcu_tasks_rude(void);
>  void exit_tasks_rcu_start(void);
>  void exit_tasks_rcu_finish(void);
>  #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index 38475d0..0d43ec1 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -71,7 +71,7 @@ config TREE_SRCU
>  	  This option selects the full-fledged version of SRCU.
>  
>  config TASKS_RCU_GENERIC
> -	def_bool TASKS_RCU
> +	def_bool TASKS_RCU || TASKS_RUDE_RCU
>  	select SRCU
>  	help
>  	  This option enables generic infrastructure code supporting
> @@ -84,6 +84,16 @@ config TASKS_RCU
>  	  only voluntary context switch (not preemption!), idle, and
>  	  user-mode execution as quiescent states.  Not for manual selection.
>  
> +config TASKS_RUDE_RCU
> +	def_bool 0
> +	default n
> +	help
> +	  This option enables a task-based RCU implementation that uses
> +	  only context switch (including preemption) and user-mode
> +	  execution as quiescent states.  It forces IPIs and context
> +	  switches on all online CPUs, including idle ones, so use
> +	  with caution.  Not for manual selection.
> +
>  config RCU_STALL_COMMON
>  	def_bool TREE_RCU
>  	help
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index d77921e..1d25c50 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -180,6 +180,9 @@ static void __init rcu_tasks_bootup_oddness(void)
>  	else
>  		pr_info("\tTasks RCU enabled.\n");
>  #endif /* #ifdef CONFIG_TASKS_RCU */
> +#ifdef CONFIG_TASKS_RUDE_RCU
> +	pr_info("\tRude variant of Tasks RCU enabled.\n");
> +#endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
>  }
>  
>  #endif /* #ifndef CONFIG_TINY_RCU */
> @@ -410,3 +413,99 @@ static int __init rcu_spawn_tasks_kthread(void)
>  core_initcall(rcu_spawn_tasks_kthread);
>  
>  #endif /* #ifdef CONFIG_TASKS_RCU */
> +
> +#ifdef CONFIG_TASKS_RUDE_RCU
> +
> +////////////////////////////////////////////////////////////////////////
> +//
> +// "Rude" variant of Tasks RCU, inspired by Steve Rostedt's trick of
> +// passing an empty function to schedule_on_each_cpu().  This approach
> +// provides an asynchronous call_rcu_rude() API and batching of concurrent
> +// calls to the synchronous synchronize_rcu_rude() API.  This sends IPIs
> +// far and wide and induces otherwise unnecessary context switches on all
> +// online CPUs, whether online or not.
> +
> +// Empty function to allow workqueues to force a context switch.
> +static void rcu_tasks_be_rude(struct work_struct *work)
> +{
> +}
> +
> +// Wait for one rude RCU-tasks grace period.
> +static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp)
> +{
> +	schedule_on_each_cpu(rcu_tasks_be_rude);
> +}
> +EXPORT_SYMBOL_GPL(rcu_tasks_rude_wait_gp);
> +
> +void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func);
> +DEFINE_RCU_TASKS(rcu_tasks_rude, rcu_tasks_rude_wait_gp, call_rcu_tasks_rude);
> +
> +/**
> + * call_rcu_tasks_rude() - Queue a callback rude task-based grace period
> + * @rhp: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all currently executing RCU
> + * read-side critical sections have completed. call_rcu_tasks_rude()
> + * assumes that the read-side critical sections end at context switch,
> + * cond_resched_rcu_qs(), or transition to usermode execution.  As such,
> + * there are no read-side primitives analogous to rcu_read_lock() and
> + * rcu_read_unlock() because this primitive is intended to determine
> + * that all tasks have passed through a safe state, not so much for
> + * data-strcuture synchronization.
> + *
> + * See the description of call_rcu() for more detailed information on
> + * memory ordering guarantees.
> + */
> +void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func)
> +{
> +	call_rcu_tasks_generic(rhp, func, &rcu_tasks_rude);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_tasks_rude);
> +
> +/**
> + * synchronize_rcu_tasks_rude - wait for a rude rcu-tasks grace period
> + *
> + * Control will return to the caller some time after a rude rcu-tasks
> + * grace period has elapsed, in other words after all currently
> + * executing rcu-tasks read-side critical sections have elapsed.  These
> + * read-side critical sections are delimited by calls to schedule(),
> + * cond_resched_tasks_rcu_qs(), userspace execution, and (in theory,
> + * anyway) cond_resched().
> + *
> + * This is a very specialized primitive, intended only for a few uses in
> + * tracing and other situations requiring manipulation of function preambles
> + * and profiling hooks.  The synchronize_rcu_tasks_rude() function is not
> + * (yet) intended for heavy use from multiple CPUs.
> + *
> + * See the description of synchronize_rcu() for more detailed information
> + * on memory ordering guarantees.
> + */
> +void synchronize_rcu_tasks_rude(void)
> +{
> +	synchronize_rcu_tasks_generic(&rcu_tasks_rude);
> +}
> +EXPORT_SYMBOL_GPL(synchronize_rcu_tasks_rude);
> +
> +/**
> + * rcu_barrier_tasks_rude - Wait for in-flight call_rcu_tasks_rude() callbacks.
> + *
> + * Although the current implementation is guaranteed to wait, it is not
> + * obligated to, for example, if there are no pending callbacks.
> + */
> +void rcu_barrier_tasks_rude(void)
> +{
> +	/* There is only one callback queue, so this is easy.  ;-) */
> +	synchronize_rcu_tasks_rude();
> +}
> +EXPORT_SYMBOL_GPL(rcu_barrier_tasks_rude);
> +
> +static int __init rcu_spawn_tasks_rude_kthread(void)
> +{
> +	rcu_spawn_tasks_kthread_generic(&rcu_tasks_rude);
> +	return 0;
> +}
> +core_initcall(rcu_spawn_tasks_rude_kthread);
> +
> +#endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant
  2020-03-16 19:47   ` Joel Fernandes
@ 2020-03-16 20:17     ` Joel Fernandes
  2020-03-16 20:32       ` Paul E. McKenney
  2020-03-16 20:29     ` Paul E. McKenney
  1 sibling, 1 reply; 171+ messages in thread
From: Joel Fernandes @ 2020-03-16 20:17 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, LKML, kernel-team@fb.com,,
	Ingo Molnar, Lai Jiangshan, dipankar, Andrew Morton,
	Mathieu Desnoyers, Josh Triplett, Thomas Glexiner,
	Peter Zijlstra, Steven Rostedt, David Howells, Eric Dumazet,
	Frederic Weisbecker, Oleg Nesterov

On Mon, Mar 16, 2020 at 3:47 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Thu, Mar 12, 2020 at 11:16:55AM -0700, paulmck@kernel.org wrote:
> > From: "Paul E. McKenney" <paulmck@kernel.org>
> >
> > This commit adds a "rude" variant of RCU-tasks that has as quiescent
> > states schedule(), cond_resched_tasks_rcu_qs(), userspace execution,
> > and (in theory, anyway) cond_resched().  Updates make use of IPIs and
> > force an IPI and a context switch on each online CPU.  This variant
> > is useful in some situations in tracing.
>
> Would it be possible to better clarify that the "rude version" works only
> from preempt-disabled regions? Is that also true for the "non-rude" version?
>
> Also it would be good to clarify better in cover letter, how these new
> flavors relate to the existing Tasks-RCU implementation.
>
> In the existing one, a quiescent state is a task updating its context switch
> counters such that it went to sleep at least once, implying there is no
> chance it is on an about to be destroyed trampoline.
>
> However, here we are trying to determine if a task state is no longer on an
> RQ (which I gleaned from the first patch). Sounds very similar, would the
> context switch counters not help in that determination as well? If it is Ok,
> it would be good to describe in cover letter about what is exactly is a
> quiescent state and what exactly is a reader section in the cover letter, for
> both non-rude and rude version. Thanks!

Just curious, why is the "rude" version better than SRCU? Seems the
schedule_on_each_cpu() would be much slower than SRCU especially if
there are 1000s of CPUs involved. Is there any reason that is a better
alternative?

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant
  2020-03-16 19:47   ` Joel Fernandes
  2020-03-16 20:17     ` Joel Fernandes
@ 2020-03-16 20:29     ` Paul E. McKenney
  1 sibling, 0 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-16 20:29 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg

On Mon, Mar 16, 2020 at 03:47:54PM -0400, Joel Fernandes wrote:
> On Thu, Mar 12, 2020 at 11:16:55AM -0700, paulmck@kernel.org wrote:
> > From: "Paul E. McKenney" <paulmck@kernel.org>
> > 
> > This commit adds a "rude" variant of RCU-tasks that has as quiescent
> > states schedule(), cond_resched_tasks_rcu_qs(), userspace execution,
> > and (in theory, anyway) cond_resched().  Updates make use of IPIs and
> > force an IPI and a context switch on each online CPU.  This variant
> > is useful in some situations in tracing.
> 
> Would it be possible to better clarify that the "rude version" works only
> from preempt-disabled regions? Is that also true for the "non-rude" version?

The rude version's read-side critical sections are preempt-disabled
regions.

The prior non-rude version's critical sections are not limited to
preempt-disabled regions, but instead extend between voluntary context
switches, as described in the header comment for synchronize_rcu_tasks().

> Also it would be good to clarify better in cover letter, how these new
> flavors relate to the existing Tasks-RCU implementation.

The cover letter did that for the tracing variant (shown below), but I
can add this information for the rude variant on the next round.

	The tracing variant has explicit read-side markers to permit
	finite grace periods even given in-kernel loops in PREEMPT=n
	builds It also protects code in the idle loop, on exception
	entry/exit paths, and on the various CPU-hotplug online/offline
	code paths, thus having protection properties similar to SRCU.
	However, unlike SRCU, this variant avoids expensive instructions
	in the read-side primitives, thus having read-side overhead
	similar to that of preemptible RCU.

> In the existing one, a quiescent state is a task updating its context switch
> counters such that it went to sleep at least once, implying there is no
> chance it is on an about to be destroyed trampoline.

Yes, voluntary context switch.

> However, here we are trying to determine if a task state is no longer on an
> RQ (which I gleaned from the first patch). Sounds very similar, would the
> context switch counters not help in that determination as well? If it is Ok,
> it would be good to describe in cover letter about what is exactly is a
> quiescent state and what exactly is a reader section in the cover letter, for
> both non-rude and rude version. Thanks!

No, for both of the new variants, a task can be in a quiescent state
while still being on a runqueue.  The rude variant can be preempted and
the tracing variant can be outside of its explicitly marked read-side
critical sections.

The existing non-rude version's quiescent states are already listed in
the docbook header comment for synchronize_rcu_tasks().

							Thanx, Paul

> thanks,
> 
>  - Joel
> 
> 
> 
> > 
> > Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > ---
> >  include/linux/rcupdate.h |  3 ++
> >  kernel/rcu/Kconfig       | 12 +++++-
> >  kernel/rcu/tasks.h       | 99 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 113 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 5523145..2be97a8 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -37,6 +37,7 @@
> >  /* Exported common interfaces */
> >  void call_rcu(struct rcu_head *head, rcu_callback_t func);
> >  void rcu_barrier_tasks(void);
> > +void rcu_barrier_tasks_rude(void);
> >  void synchronize_rcu(void);
> >  
> >  #ifdef CONFIG_PREEMPT_RCU
> > @@ -138,6 +139,8 @@ static inline void rcu_init_nohz(void) { }
> >  #define rcu_note_voluntary_context_switch(t) rcu_tasks_qs(t)
> >  void call_rcu_tasks(struct rcu_head *head, rcu_callback_t func);
> >  void synchronize_rcu_tasks(void);
> > +void call_rcu_tasks_rude(struct rcu_head *head, rcu_callback_t func);
> > +void synchronize_rcu_tasks_rude(void);
> >  void exit_tasks_rcu_start(void);
> >  void exit_tasks_rcu_finish(void);
> >  #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
> > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > index 38475d0..0d43ec1 100644
> > --- a/kernel/rcu/Kconfig
> > +++ b/kernel/rcu/Kconfig
> > @@ -71,7 +71,7 @@ config TREE_SRCU
> >  	  This option selects the full-fledged version of SRCU.
> >  
> >  config TASKS_RCU_GENERIC
> > -	def_bool TASKS_RCU
> > +	def_bool TASKS_RCU || TASKS_RUDE_RCU
> >  	select SRCU
> >  	help
> >  	  This option enables generic infrastructure code supporting
> > @@ -84,6 +84,16 @@ config TASKS_RCU
> >  	  only voluntary context switch (not preemption!), idle, and
> >  	  user-mode execution as quiescent states.  Not for manual selection.
> >  
> > +config TASKS_RUDE_RCU
> > +	def_bool 0
> > +	default n
> > +	help
> > +	  This option enables a task-based RCU implementation that uses
> > +	  only context switch (including preemption) and user-mode
> > +	  execution as quiescent states.  It forces IPIs and context
> > +	  switches on all online CPUs, including idle ones, so use
> > +	  with caution.  Not for manual selection.
> > +
> >  config RCU_STALL_COMMON
> >  	def_bool TREE_RCU
> >  	help
> > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > index d77921e..1d25c50 100644
> > --- a/kernel/rcu/tasks.h
> > +++ b/kernel/rcu/tasks.h
> > @@ -180,6 +180,9 @@ static void __init rcu_tasks_bootup_oddness(void)
> >  	else
> >  		pr_info("\tTasks RCU enabled.\n");
> >  #endif /* #ifdef CONFIG_TASKS_RCU */
> > +#ifdef CONFIG_TASKS_RUDE_RCU
> > +	pr_info("\tRude variant of Tasks RCU enabled.\n");
> > +#endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
> >  }
> >  
> >  #endif /* #ifndef CONFIG_TINY_RCU */
> > @@ -410,3 +413,99 @@ static int __init rcu_spawn_tasks_kthread(void)
> >  core_initcall(rcu_spawn_tasks_kthread);
> >  
> >  #endif /* #ifdef CONFIG_TASKS_RCU */
> > +
> > +#ifdef CONFIG_TASKS_RUDE_RCU
> > +
> > +////////////////////////////////////////////////////////////////////////
> > +//
> > +// "Rude" variant of Tasks RCU, inspired by Steve Rostedt's trick of
> > +// passing an empty function to schedule_on_each_cpu().  This approach
> > +// provides an asynchronous call_rcu_rude() API and batching of concurrent
> > +// calls to the synchronous synchronize_rcu_rude() API.  This sends IPIs
> > +// far and wide and induces otherwise unnecessary context switches on all
> > +// online CPUs, whether online or not.
> > +
> > +// Empty function to allow workqueues to force a context switch.
> > +static void rcu_tasks_be_rude(struct work_struct *work)
> > +{
> > +}
> > +
> > +// Wait for one rude RCU-tasks grace period.
> > +static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp)
> > +{
> > +	schedule_on_each_cpu(rcu_tasks_be_rude);
> > +}
> > +EXPORT_SYMBOL_GPL(rcu_tasks_rude_wait_gp);
> > +
> > +void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func);
> > +DEFINE_RCU_TASKS(rcu_tasks_rude, rcu_tasks_rude_wait_gp, call_rcu_tasks_rude);
> > +
> > +/**
> > + * call_rcu_tasks_rude() - Queue a callback rude task-based grace period
> > + * @rhp: structure to be used for queueing the RCU updates.
> > + * @func: actual callback function to be invoked after the grace period
> > + *
> > + * The callback function will be invoked some time after a full grace
> > + * period elapses, in other words after all currently executing RCU
> > + * read-side critical sections have completed. call_rcu_tasks_rude()
> > + * assumes that the read-side critical sections end at context switch,
> > + * cond_resched_rcu_qs(), or transition to usermode execution.  As such,
> > + * there are no read-side primitives analogous to rcu_read_lock() and
> > + * rcu_read_unlock() because this primitive is intended to determine
> > + * that all tasks have passed through a safe state, not so much for
> > + * data-strcuture synchronization.
> > + *
> > + * See the description of call_rcu() for more detailed information on
> > + * memory ordering guarantees.
> > + */
> > +void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func)
> > +{
> > +	call_rcu_tasks_generic(rhp, func, &rcu_tasks_rude);
> > +}
> > +EXPORT_SYMBOL_GPL(call_rcu_tasks_rude);
> > +
> > +/**
> > + * synchronize_rcu_tasks_rude - wait for a rude rcu-tasks grace period
> > + *
> > + * Control will return to the caller some time after a rude rcu-tasks
> > + * grace period has elapsed, in other words after all currently
> > + * executing rcu-tasks read-side critical sections have elapsed.  These
> > + * read-side critical sections are delimited by calls to schedule(),
> > + * cond_resched_tasks_rcu_qs(), userspace execution, and (in theory,
> > + * anyway) cond_resched().
> > + *
> > + * This is a very specialized primitive, intended only for a few uses in
> > + * tracing and other situations requiring manipulation of function preambles
> > + * and profiling hooks.  The synchronize_rcu_tasks_rude() function is not
> > + * (yet) intended for heavy use from multiple CPUs.
> > + *
> > + * See the description of synchronize_rcu() for more detailed information
> > + * on memory ordering guarantees.
> > + */
> > +void synchronize_rcu_tasks_rude(void)
> > +{
> > +	synchronize_rcu_tasks_generic(&rcu_tasks_rude);
> > +}
> > +EXPORT_SYMBOL_GPL(synchronize_rcu_tasks_rude);
> > +
> > +/**
> > + * rcu_barrier_tasks_rude - Wait for in-flight call_rcu_tasks_rude() callbacks.
> > + *
> > + * Although the current implementation is guaranteed to wait, it is not
> > + * obligated to, for example, if there are no pending callbacks.
> > + */
> > +void rcu_barrier_tasks_rude(void)
> > +{
> > +	/* There is only one callback queue, so this is easy.  ;-) */
> > +	synchronize_rcu_tasks_rude();
> > +}
> > +EXPORT_SYMBOL_GPL(rcu_barrier_tasks_rude);
> > +
> > +static int __init rcu_spawn_tasks_rude_kthread(void)
> > +{
> > +	rcu_spawn_tasks_kthread_generic(&rcu_tasks_rude);
> > +	return 0;
> > +}
> > +core_initcall(rcu_spawn_tasks_rude_kthread);
> > +
> > +#endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
> > -- 
> > 2.9.5
> > 

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant
  2020-03-16 20:17     ` Joel Fernandes
@ 2020-03-16 20:32       ` Paul E. McKenney
  2020-03-16 21:32         ` Steven Rostedt
  0 siblings, 1 reply; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-16 20:32 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, LKML, kernel-team@fb.com,,
	Ingo Molnar, Lai Jiangshan, dipankar, Andrew Morton,
	Mathieu Desnoyers, Josh Triplett, Thomas Glexiner,
	Peter Zijlstra, Steven Rostedt, David Howells, Eric Dumazet,
	Frederic Weisbecker, Oleg Nesterov

On Mon, Mar 16, 2020 at 04:17:51PM -0400, Joel Fernandes wrote:
> On Mon, Mar 16, 2020 at 3:47 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Thu, Mar 12, 2020 at 11:16:55AM -0700, paulmck@kernel.org wrote:
> > > From: "Paul E. McKenney" <paulmck@kernel.org>
> > >
> > > This commit adds a "rude" variant of RCU-tasks that has as quiescent
> > > states schedule(), cond_resched_tasks_rcu_qs(), userspace execution,
> > > and (in theory, anyway) cond_resched().  Updates make use of IPIs and
> > > force an IPI and a context switch on each online CPU.  This variant
> > > is useful in some situations in tracing.
> >
> > Would it be possible to better clarify that the "rude version" works only
> > from preempt-disabled regions? Is that also true for the "non-rude" version?
> >
> > Also it would be good to clarify better in cover letter, how these new
> > flavors relate to the existing Tasks-RCU implementation.
> >
> > In the existing one, a quiescent state is a task updating its context switch
> > counters such that it went to sleep at least once, implying there is no
> > chance it is on an about to be destroyed trampoline.
> >
> > However, here we are trying to determine if a task state is no longer on an
> > RQ (which I gleaned from the first patch). Sounds very similar, would the
> > context switch counters not help in that determination as well? If it is Ok,
> > it would be good to describe in cover letter about what is exactly is a
> > quiescent state and what exactly is a reader section in the cover letter, for
> > both non-rude and rude version. Thanks!
> 
> Just curious, why is the "rude" version better than SRCU? Seems the
> schedule_on_each_cpu() would be much slower than SRCU especially if
> there are 1000s of CPUs involved. Is there any reason that is a better
> alternative?

The rude version has much faster readers, and the story I hear is that
there are not expected to be all that many concurrent updaters.

But to get more detail, why not ask Steven why he chose not to use SRCU?
(I know the story for the BPF guys, and it is because of SRCU's read-side
overhead.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant
  2020-03-16 20:32       ` Paul E. McKenney
@ 2020-03-16 21:32         ` Steven Rostedt
  2020-03-16 21:45           ` Joel Fernandes
  0 siblings, 1 reply; 171+ messages in thread
From: Steven Rostedt @ 2020-03-16 21:32 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, rcu, LKML, kernel-team@fb.com,,
	Ingo Molnar, Lai Jiangshan, dipankar, Andrew Morton,
	Mathieu Desnoyers, Josh Triplett, Thomas Glexiner,
	Peter Zijlstra, David Howells, Eric Dumazet, Frederic Weisbecker,
	Oleg Nesterov

On Mon, 16 Mar 2020 13:32:41 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> > Just curious, why is the "rude" version better than SRCU? Seems the
> > schedule_on_each_cpu() would be much slower than SRCU especially if
> > there are 1000s of CPUs involved. Is there any reason that is a better
> > alternative?  
> 
> The rude version has much faster readers, and the story I hear is that
> there are not expected to be all that many concurrent updaters.
> 
> But to get more detail, why not ask Steven why he chose not to use SRCU?
> (I know the story for the BPF guys, and it is because of SRCU's read-side
> overhead.)

Same for the function side (if not even more so). This would require adding
a srcu_read_lock() to all functions that can be traced! That would be a huge
kill in performance. Probably to the point no one would bother even using
function tracer.

-- Steve

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant
  2020-03-16 21:32         ` Steven Rostedt
@ 2020-03-16 21:45           ` Joel Fernandes
  2020-03-16 22:03             ` Steven Rostedt
  0 siblings, 1 reply; 171+ messages in thread
From: Joel Fernandes @ 2020-03-16 21:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, rcu, LKML, kernel-team@fb.com,,
	Ingo Molnar, Lai Jiangshan, dipankar, Andrew Morton,
	Mathieu Desnoyers, Josh Triplett, Thomas Glexiner,
	Peter Zijlstra, David Howells, Eric Dumazet, Frederic Weisbecker,
	Oleg Nesterov

On Mon, Mar 16, 2020 at 5:32 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 16 Mar 2020 13:32:41 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>
> > > Just curious, why is the "rude" version better than SRCU? Seems the
> > > schedule_on_each_cpu() would be much slower than SRCU especially if
> > > there are 1000s of CPUs involved. Is there any reason that is a better
> > > alternative?
> >
> > The rude version has much faster readers, and the story I hear is that
> > there are not expected to be all that many concurrent updaters.
> >
> > But to get more detail, why not ask Steven why he chose not to use SRCU?
> > (I know the story for the BPF guys, and it is because of SRCU's read-side
> > overhead.)
>
> Same for the function side (if not even more so). This would require adding
> a srcu_read_lock() to all functions that can be traced! That would be a huge
> kill in performance. Probably to the point no one would bother even using
> function tracer.

Point well taken! Thanks,

  -Joel

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant
  2020-03-16 21:45           ` Joel Fernandes
@ 2020-03-16 22:03             ` Steven Rostedt
  2020-03-16 22:11               ` Paul E. McKenney
  2020-05-10  9:59               ` Lai Jiangshan
  0 siblings, 2 replies; 171+ messages in thread
From: Steven Rostedt @ 2020-03-16 22:03 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E. McKenney, rcu, LKML, kernel-team@fb.com,,
	Ingo Molnar, Lai Jiangshan, dipankar, Andrew Morton,
	Mathieu Desnoyers, Josh Triplett, Thomas Glexiner,
	Peter Zijlstra, David Howells, Eric Dumazet, Frederic Weisbecker,
	Oleg Nesterov

On Mon, 16 Mar 2020 17:45:40 -0400
Joel Fernandes <joel@joelfernandes.org> wrote:

> >
> > Same for the function side (if not even more so). This would require adding
> > a srcu_read_lock() to all functions that can be traced! That would be a huge
> > kill in performance. Probably to the point no one would bother even using
> > function tracer.  
> 
> Point well taken! Thanks,

Actually, it's worse than that. (We talked about this on IRC but I wanted
it documented here too).

You can't use any type of locking, unless you insert it around all the
callers of the nops (which is unreasonable).

That is, we have gcc -pg -mfentry that creates at the start of all traced
functions:

 <some_func>:
    call __fentry__
    [code for function here]

At boot up (or even by the compiler itself) we convert that to:

 <some_func>:
    nop
    [code for function here]


When we want to trace this function we use text_poke (with current kernels)
and convert it to this:

 <some_func>:
    call trace_trampoline
    [code for function here]


That trace_trampoline can be allocated, which means when its no longer
needed, it must be freed. But when do we know it's safe to free it? Here's
the issue.


 <some_func>:
    call trace_trampoline  <- interrupt happens just after the jump
    [code for function here]

Now the task has just executed the call to the trace_trampoline. Which
means the instruction pointer is set to the start of the trampoline. But it
has yet executed that trampoline.

Now if the task is preempted, and a real time hog is keeping it from
running for minutes at a time (which is possible!). And in the mean time,
we are done with that trampoline and free it. What happens when that task
is scheduled back? There's no more trampoline to execute even though its
instruction pointer is to execute the first operand on the trampoline!

I used the analogy of jumping off the cliff expecting a magic carpet to be
there to catch you, and just before you land, it disappears. That would be
a very bad day indeed!

We have no way to add a grace period between the start of a function (can
be *any* function) and the start of the trampoline. Since the problem is
that the task was non-voluntarily preempted before it could execute the
trampoline, and that trampolines are not allowed (suppose) to call
schedule, then we have our quiescent state to track (voluntary scheduling).
When all tasks have either voluntarily scheduled, or entered user space
after disconnecting a trampoline from a function, we know that it is safe to
free the trampoline.

-- Steve

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant
  2020-03-16 22:03             ` Steven Rostedt
@ 2020-03-16 22:11               ` Paul E. McKenney
  2020-05-10  9:59               ` Lai Jiangshan
  1 sibling, 0 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-16 22:11 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Joel Fernandes, rcu, LKML, kernel-team@fb.com,,
	Ingo Molnar, Lai Jiangshan, dipankar, Andrew Morton,
	Mathieu Desnoyers, Josh Triplett, Thomas Glexiner,
	Peter Zijlstra, David Howells, Eric Dumazet, Frederic Weisbecker,
	Oleg Nesterov

On Mon, Mar 16, 2020 at 06:03:52PM -0400, Steven Rostedt wrote:
> On Mon, 16 Mar 2020 17:45:40 -0400
> Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> > >
> > > Same for the function side (if not even more so). This would require adding
> > > a srcu_read_lock() to all functions that can be traced! That would be a huge
> > > kill in performance. Probably to the point no one would bother even using
> > > function tracer.  
> > 
> > Point well taken! Thanks,
> 
> Actually, it's worse than that. (We talked about this on IRC but I wanted
> it documented here too).
> 
> You can't use any type of locking, unless you insert it around all the
> callers of the nops (which is unreasonable).
> 
> That is, we have gcc -pg -mfentry that creates at the start of all traced
> functions:
> 
>  <some_func>:
>     call __fentry__
>     [code for function here]
> 
> At boot up (or even by the compiler itself) we convert that to:
> 
>  <some_func>:
>     nop
>     [code for function here]
> 
> 
> When we want to trace this function we use text_poke (with current kernels)
> and convert it to this:
> 
>  <some_func>:
>     call trace_trampoline
>     [code for function here]
> 
> 
> That trace_trampoline can be allocated, which means when its no longer
> needed, it must be freed. But when do we know it's safe to free it? Here's
> the issue.
> 
> 
>  <some_func>:
>     call trace_trampoline  <- interrupt happens just after the jump
>     [code for function here]
> 
> Now the task has just executed the call to the trace_trampoline. Which
> means the instruction pointer is set to the start of the trampoline. But it
> has yet executed that trampoline.
> 
> Now if the task is preempted, and a real time hog is keeping it from
> running for minutes at a time (which is possible!). And in the mean time,
> we are done with that trampoline and free it. What happens when that task
> is scheduled back? There's no more trampoline to execute even though its
> instruction pointer is to execute the first operand on the trampoline!
> 
> I used the analogy of jumping off the cliff expecting a magic carpet to be
> there to catch you, and just before you land, it disappears. That would be
> a very bad day indeed!

I never have thought of an analogy between Tasks RCU and magic carpets
before.  Maybe time to go watch Aladdin or something.  ;-)

							Thanx, Paul

> We have no way to add a grace period between the start of a function (can
> be *any* function) and the start of the trampoline. Since the problem is
> that the task was non-voluntarily preempted before it could execute the
> trampoline, and that trampolines are not allowed (suppose) to call
> schedule, then we have our quiescent state to track (voluntary scheduling).
> When all tasks have either voluntarily scheduled, or entered user space
> after disconnecting a trampoline from a function, we know that it is safe to
> free the trampoline.
> 
> -- Steve

^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 0/22] Prototype RCU usable from idle, exception, offline
  2020-03-12 18:16 [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                   ` (16 preceding siblings ...)
  2020-03-13 14:41 ` [PATCH RFC tip/core/rcu 0/16] Prototype RCU usable from idle, exception, offline Frederic Weisbecker
@ 2020-03-19  0:10 ` Paul E. McKenney
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task paulmck
                     ` (25 more replies)
  17 siblings, 26 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel

Hello!

This series provides two variants of Tasks RCU, a rude variant inspired
by Steven Rostedt's use of schedule_on_each_cpu(), and a tracing variant
requested by the BPF folks and perhaps also of use for other tracing
use cases.

The rude variant uses context switches and offline as its quiescent
states, so that preempt-disabled regions of code executing on online
CPUs form the tasks rude RCU readers.

The tracing variant has explicit read-side markers to permit finite grace
periods even given in-kernel loops in PREEMPT=n builds.  These markers
are rcu_read_lock_trace() and rcu_read_unlock_trace(), so that any code
not under rcu_read_lock_trace() is a quiescent state.  This variant
also protects marked code in the idle loop, on exception entry/exit
paths, and on the various CPU-hotplug online/offline code paths, thus
having protection properties similar to SRCU.  However, unlike SRCU,
this variant avoids expensive instructions in the read-side primitives,
thus having read-side overhead similar to that of preemptible RCU.

There are of course downsides.  The grace-period code can send IPIs to
CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
However, this version enlists the aid of the context-switch hooks,
which eliminates the need for IPIs in context-switch-heavy workloads.
It also prohibits sending of IPIs early in the grace period, which
provides additional opportunity for the hooks to do their job.  Additional
IPI-reduction mechanisms are under development.

It is also necessary to scan the full tasklist, much as for Tasks RCU.
There is a single callback queue guarded by a single lock, again, much
as for Tasks RCU.  If needed, these downsides can be at least partially
remedied.

Perhaps most important, this variant of RCU does not affect the vanilla
flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
readers can operate from idle, offline, and exception entry/exit in no
way allows rcu_preempt and rcu_sched readers to also do so.

The RCU tasks trace mechanism is based off of RCU tasks rather than
SRCU because the latter is more complex and also because the latter
uses a CPU-by-CPU approach to tracking quiescent states instead of the
task-by-task approach that is needed.  It is in theory possible to
mash RCU tasks trace into the Tree SRCU implementation, but there
will need to be extremely good reasons for doing so.

This effort benefited greatly from off-list discussions of BPF
requirements with Alexei Starovoitov and Andrii Nakryiko, as well as from
numerous on-list discussions, at least some of which are captured in the
"Link:" tags on the patches themselves.

The patches in this series are as follows, with asterisks indicating
significant change from v1:

1*.	Add function to sample state of a locked-down task.  I would
	still guess that the API is still subject to change.  ;-)

2*.	Use the above function to add per-task state to RCU CPU stall
	warnings.  This commit was adapted to the new API.

3.	Add rcutorture module parameter to produce non-busy-wait task
	stalls, thus allowing the above RCU CPU stall change to be
	exercised.

4.	Move Tasks RCU to its own file.

5.	Create struct to hold RCU-tasks state information.

6.	Reinstate synchronize_rcu_mult(), as there will likely once
	again be a need to wait on multiple flavors of RCU.

7.	Add an rcutorture test for synchronize_rcu_mult().

8.	Refactor RCU-tasks to allow variants to be added.

9.	Add an RCU-tasks rude variant, based on Steven Rostedt's
	use of schedule_on_each_cpu().

10.	Add torture tests for RCU Tasks Rude.

11.	Use unique names for RCU-Tasks kthreads and messages.

12.	Further refactor RCU-tasks to allow adding even more variants.

13.	Code movement to allow even more Tasks RCU variants.

14*.	Add an RCU Tasks Trace to simplify protection of tracing hooks,
	including BPF.  This version fixes a number of bugs and adapts
	to the new lockdown-task API.

15.	Add torture tests for RCU Tasks Trace.

16.	Add stall warnings for RCU Tasks Trace.

17*.	Move #ifdef into tasks.h to ease addition of Kconfig-dependent APIs.

18*.	Add RCU-tasks-specific information to rcutorture writer stall
	output, easing debugging of these RCU variants.

19*.	Make the above rcutorture writer stall output include
	grace-period state.

20*.	Cause RCU tasks trace to take advantage of RCU scheduler hooks,
	thus reducing the number of IPIs.

21*.	Record grace-period start time for RCU tasks variants for
	IPI throttling and for debugging.

22*.	Provide a kernel boot parameter to delay IPIs until a given grace
	period reaches the specified age, with this age defaulting to
	half a second, further reducing the number of IPIs.  To zero on
	context-switch-heavy workloads.

These new versions of Tasks RCU now pass heavy rcutorture testing,
and should thus be fine for experimental use.

Changes since v1:

o	Updated this cover letter to provide more detail, including
	on roads not taken.

o	Updated commit logs based on feedback from v1.

o	Updated the function providing a consistent view of the
	specified non-running task's state to invoke the specified
	function even if the task is currently running.  This will
	be necessary to safely eliminate IPIs for long-term idle and
	userspace execution.  The function may also now return false
	to transmit a failure indication to the caller, for example,
	if the function cannot handle being invoked on a running CPU.
	The function is now passed the relevant task_struct pointer as
	well as the specified argument.

	Changes were of course made to use the new API.

o	Leveraged context-switch hooks to avoid unnecessary IPIs.

o	Held off IPIs for the first half second (by default) of each
	grace period to give the context-switch hooks a better chance
	to do their job.

o	Lots of testing.

o	Fixed a number of bugs.

Todo:

o	Leverage idle entry/exit hooks to reduce IPIing of idle tasks.

o	Switch to read-side memory barriers during idle and userspace
	execution in kernels built for real-time or battery-powered use.
	As currently planned, nohz_full CPUs would be IPIed only during
	long-running loops in the kernel (as in more than half a second
	of such execution by default).

	Although this does add a branch to rcu_read_lock_trace() in
	kernels built to take this approach, the check involves only
	a cache-hot byte.  However, rcu_read_unlock_trace() executes
	the same sequence of instructions as before in the case where
	no memory barrier is required even in kernels built to take
	this approach.

o	Context-switch hooks and delayed IPIs could potentially also be
	applied to reduce the IPI intensity of RCU-tasks rude, but I do
	not yet know of any reason to do this.  If you believe that this
	is needed, please let me know.

o	Lots more testing.

							Thanx, Paul

------------------------------------------------------------------------

 Documentation/admin-guide/kernel-parameters.txt             |   12 
 include/linux/rcupdate.h                                    |   48 
 include/linux/rcupdate_trace.h                              |   84 
 include/linux/rcupdate_wait.h                               |   19 
 include/linux/rcutiny.h                                     |    2 
 include/linux/sched.h                                       |    8 
 include/linux/wait.h                                        |    2 
 init/init_task.c                                            |    4 
 kernel/fork.c                                               |    4 
 kernel/rcu/Kconfig                                          |   34 
 kernel/rcu/Kconfig.debug                                    |    4 
 kernel/rcu/rcu.h                                            |    3 
 kernel/rcu/rcutorture.c                                     |   99 
 kernel/rcu/tasks.h                                          | 1925 +++++++++---
 kernel/rcu/tree_plugin.h                                    |    6 
 kernel/rcu/tree_stall.h                                     |   40 
 kernel/rcu/update.c                                         |  374 --
 kernel/sched/core.c                                         |   48 
 tools/testing/selftests/rcutorture/configs/rcu/CFLIST       |    2 
 tools/testing/selftests/rcutorture/configs/rcu/RUDE01       |   10 
 tools/testing/selftests/rcutorture/configs/rcu/RUDE01.boot  |    1 
 tools/testing/selftests/rcutorture/configs/rcu/TRACE01      |   10 
 tools/testing/selftests/rcutorture/configs/rcu/TRACE01.boot |    1 
 23 files changed, 1926 insertions(+), 814 deletions(-)

^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
@ 2020-03-19  0:10   ` paulmck
  2020-03-19 17:22     ` Steven Rostedt
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 02/22] rcu: Add per-task state to RCU CPU stall warnings paulmck
                     ` (24 subsequent siblings)
  25 siblings, 1 reply; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman

From: "Paul E. McKenney" <paulmck@kernel.org>

A running task's state can be sampled in a consistent manner (for example,
for diagnostic purposes) simply by invoking smp_call_function_single()
on its CPU, which may be obtained using task_cpu(), then having the
IPI handler verify that the desired task is in fact still running.
However, if the task is not running, this sampling can in theory be done
immediately and directly.  In practice, the task might start running at
any time, including during the sampling period.  Gaining a consistent
sample of a not-running task therefore requires that something be done
to lock down the target task's state.

This commit therefore adds a try_invoke_on_locked_down_task() function
that invokes a specified function if the specified task can be locked
down, returning true if successful and if the specified function returns
true.  Otherwise this function simply returns false.  Given that the
function passed to try_invoke_on_nonrunning_task() might be invoked with
a runqueue lock held, that function had better be quite lightweight.

The function is passed the target task's task_struct pointer and the
argument passed to try_invoke_on_locked_down_task(), allowing easy access
to task state and to a location for further variables to be passed in
and out.

Note that the specified function will be called even if the specified
task is currently running.  The function can use ->on_rq and task_curr()
to quickly and easily determine the task's state, and can return false
if this state is not to the function's liking.  The caller of teh
try_invoke_on_locked_down_task() would then see the false return value,
and could take appropriate action, for example, trying again later or
sending an IPI if matters are more urgent.

It is expected that use cases such as the RCU CPU stall warning code will
simply return false if the task is currently running.  However, there are
use cases involving nohz_full CPUs where the specified function might
instead fall back to an alternative sampling scheme that relies on heavier
synchronization (such as memory barriers) in the target task.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Apply feedback from Peter Zijlstra and Steven Rostedt. ]
[ paulmck: Invoke if running to handle feedback from Mathieu Desnoyers. ]
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
---
 include/linux/wait.h |  2 ++
 kernel/sched/core.c  | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 3283c8d..e2bb8ed 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -1148,4 +1148,6 @@ int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, i
 		(wait)->flags = 0;						\
 	} while (0)
 
+bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg);
+
 #endif /* _LINUX_WAIT_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fc1dfc0..195eba0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2580,6 +2580,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 *
 	 * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
 	 * __schedule().  See the comment for smp_mb__after_spinlock().
+	 *
+	 * A similar smb_rmb() lives in try_invoke_on_locked_down_task().
 	 */
 	smp_rmb();
 	if (p->on_rq && ttwu_remote(p, wake_flags))
@@ -2654,6 +2656,52 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 }
 
 /**
+ * try_invoke_on_locked_down_task - Invoke a function on task in fixed state
+ * @p: Process for which the function is to be invoked.
+ * @func: Function to invoke.
+ * @arg: Argument to function.
+ *
+ * If the specified task can be quickly locked into a definite state
+ * (either sleeping or on a given runqueue), arrange to keep it in that
+ * state while invoking @func(@arg).  This function can use ->on_rq and
+ * task_curr() to work out what the state is, if required.  Given that
+ * @func can be invoked with a runqueue lock held, it had better be quite
+ * lightweight.
+ *
+ * Returns:
+ *	@false if the task slipped out from under the locks.
+ *	@true if the task was locked onto a runqueue or is sleeping.
+ *		However, @func can override this by returning @false.
+ */
+bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg)
+{
+	bool ret = false;
+	struct rq_flags rf;
+	struct rq *rq;
+
+	lockdep_assert_irqs_enabled();
+	raw_spin_lock_irq(&p->pi_lock);
+	if (p->on_rq) {
+		rq = __task_rq_lock(p, &rf);
+		if (task_rq(p) == rq)
+			ret = func(p, arg);
+		rq_unlock(rq, &rf);
+	} else {
+		switch (p->state) {
+		case TASK_RUNNING:
+		case TASK_WAKING:
+			break;
+		default:
+			smp_rmb();
+			if (!p->on_rq)
+				ret = func(p, arg);
+		}
+	}
+	raw_spin_unlock_irq(&p->pi_lock);
+	return ret;
+}
+
+/**
  * wake_up_process - Wake up a specific process
  * @p: The process to be woken up.
  *
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 02/22] rcu: Add per-task state to RCU CPU stall warnings
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19 17:27     ` Steven Rostedt
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 03/22] rcutorture: Add flag to produce non-busy-wait task stalls paulmck
                     ` (23 subsequent siblings)
  25 siblings, 1 reply; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

Currently, an RCU-preempt CPU stall warning simply lists the PIDs of
those tasks holding up the current grace period.  This can be helpful,
but more can be even more helpful.

To this end, this commit adds the nesting level, whether the task
things it was preempted in its current RCU read-side critical section,
whether RCU core has asked this task for a quiescent state, whether the
expedited-grace-period hint is set, and whether the task believes that
it is on the blocked-tasks list (it must be, or it would not be printed,
but if things are broken, best not to take too much for granted).

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tree_stall.h | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 502b4dd..e19487d 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -192,14 +192,40 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp)
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
 
+// Communicate task state back to the RCU CPU stall warning request.
+struct rcu_stall_chk_rdr {
+	int nesting;
+	union rcu_special rs;
+	bool on_blkd_list;
+};
+
+/*
+ * Report out the state of a not-running task that is stalling the
+ * current RCU grace period.
+ */
+static bool check_slow_task(struct task_struct *t, void *arg)
+{
+	struct rcu_node *rnp;
+	struct rcu_stall_chk_rdr *rscrp = arg;
+
+	if (task_curr(t))
+		return false; // It is running, so decline to inspect it.
+	rscrp->nesting = t->rcu_read_lock_nesting;
+	rscrp->rs = t->rcu_read_unlock_special;
+	rnp = t->rcu_blocked_node;
+	rscrp->on_blkd_list = !list_empty(&t->rcu_node_entry);
+	return true;
+}
+
 /*
  * Scan the current list of tasks blocked within RCU read-side critical
  * sections, printing out the tid of each.
  */
 static int rcu_print_task_stall(struct rcu_node *rnp)
 {
-	struct task_struct *t;
 	int ndetected = 0;
+	struct rcu_stall_chk_rdr rscr;
+	struct task_struct *t;
 
 	if (!rcu_preempt_blocked_readers_cgp(rnp))
 		return 0;
@@ -208,7 +234,15 @@ static int rcu_print_task_stall(struct rcu_node *rnp)
 	t = list_entry(rnp->gp_tasks->prev,
 		       struct task_struct, rcu_node_entry);
 	list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) {
-		pr_cont(" P%d", t->pid);
+		if (!try_invoke_on_locked_down_task(t, check_slow_task, &rscr))
+			pr_cont(" P%d", t->pid);
+		else
+			pr_cont(" P%d/%d:%c%c%c%c",
+				t->pid, rscr.nesting,
+				".b"[rscr.rs.b.blocked],
+				".q"[rscr.rs.b.need_qs],
+				".e"[rscr.rs.b.exp_hint],
+				".l"[rscr.on_blkd_list]);
 		ndetected++;
 	}
 	pr_cont("\n");
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 03/22] rcutorture: Add flag to produce non-busy-wait task stalls
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task paulmck
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 02/22] rcu: Add per-task state to RCU CPU stall warnings paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 04/22] rcu-tasks: Move Tasks RCU to its own file paulmck
                     ` (22 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit aids testing of RCU task stall warning messages by adding
an rcutorture.stall_cpu_block module parameter that results in the
induced stall sleeping within the RCU read-side critical section.
Spinning with interrupts disabled is still available via the
rcutorture.stall_cpu_irqsoff module parameter, and specifying neither
of these two module parameters will spin with preemption disabled.

Note that sleeping (as opposed to preemption) results in additional
complaints from RCU at context-switch time, so yet more testing.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 Documentation/admin-guide/kernel-parameters.txt |  5 +++++
 kernel/rcu/rcutorture.c                         | 15 +++++++++------
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6d16b78..17eff15 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4161,6 +4161,11 @@
 			Duration of CPU stall (s) to test RCU CPU stall
 			warnings, zero to disable.
 
+	rcutorture.stall_cpu_block= [KNL]
+			Sleep while stalling if set.  This will result
+			in warnings from preemptible RCU in addition
+			to any other stall-related activity.
+
 	rcutorture.stall_cpu_holdoff= [KNL]
 			Time to wait (s) after boot before inducing stall.
 
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index b3301f3..ada5b91 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -102,6 +102,7 @@ torture_param(int, stall_cpu, 0, "Stall duration (s), zero to disable.");
 torture_param(int, stall_cpu_holdoff, 10,
 	     "Time to wait before starting stall (s).");
 torture_param(int, stall_cpu_irqsoff, 0, "Disable interrupts while stalling.");
+torture_param(int, stall_cpu_block, 0, "Sleep while stalling.");
 torture_param(int, stat_interval, 60,
 	     "Number of seconds between stats printk()s");
 torture_param(int, stutter, 5, "Number of seconds to run/halt test");
@@ -1599,6 +1600,7 @@ static int rcutorture_booster_init(unsigned int cpu)
  */
 static int rcu_torture_stall(void *args)
 {
+	int idx;
 	unsigned long stop_at;
 
 	VERBOSE_TOROUT_STRING("rcu_torture_stall task started");
@@ -1610,21 +1612,22 @@ static int rcu_torture_stall(void *args)
 	if (!kthread_should_stop()) {
 		stop_at = ktime_get_seconds() + stall_cpu;
 		/* RCU CPU stall is expected behavior in following code. */
-		rcu_read_lock();
+		idx = cur_ops->readlock();
 		if (stall_cpu_irqsoff)
 			local_irq_disable();
-		else
+		else if (!stall_cpu_block)
 			preempt_disable();
 		pr_alert("rcu_torture_stall start on CPU %d.\n",
-			 smp_processor_id());
+			 raw_smp_processor_id());
 		while (ULONG_CMP_LT((unsigned long)ktime_get_seconds(),
 				    stop_at))
-			continue;  /* Induce RCU CPU stall warning. */
+			if (stall_cpu_block)
+				schedule_timeout_uninterruptible(HZ);
 		if (stall_cpu_irqsoff)
 			local_irq_enable();
-		else
+		else if (!stall_cpu_block)
 			preempt_enable();
-		rcu_read_unlock();
+		cur_ops->readunlock(idx);
 		pr_alert("rcu_torture_stall end.\n");
 	}
 	torture_shutdown_absorb("rcu_torture_stall");
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 04/22] rcu-tasks: Move Tasks RCU to its own file
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (2 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 03/22] rcutorture: Add flag to produce non-busy-wait task stalls paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 05/22] rcu-tasks: Create struct to hold state information paulmck
                     ` (21 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This code-movement-only commit is in preparation for adding an additional
flavor of Tasks RCU, which relies on workqueues to detect grace periods.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h  | 370 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/rcu/update.c | 366 +--------------------------------------------------
 2 files changed, 372 insertions(+), 364 deletions(-)
 create mode 100644 kernel/rcu/tasks.h

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
new file mode 100644
index 0000000..be8d179
--- /dev/null
+++ b/kernel/rcu/tasks.h
@@ -0,0 +1,370 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/*
+ * Task-based RCU implementations.
+ *
+ * Copyright (C) 2020 Paul E. McKenney
+ */
+
+#ifdef CONFIG_TASKS_RCU
+
+/*
+ * Simple variant of RCU whose quiescent states are voluntary context
+ * switch, cond_resched_rcu_qs(), user-space execution, and idle.
+ * As such, grace periods can take one good long time.  There are no
+ * read-side primitives similar to rcu_read_lock() and rcu_read_unlock()
+ * because this implementation is intended to get the system into a safe
+ * state for some of the manipulations involved in tracing and the like.
+ * Finally, this implementation does not support high call_rcu_tasks()
+ * rates from multiple CPUs.  If this is required, per-CPU callback lists
+ * will be needed.
+ */
+
+/* Global list of callbacks and associated lock. */
+static struct rcu_head *rcu_tasks_cbs_head;
+static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
+static DECLARE_WAIT_QUEUE_HEAD(rcu_tasks_cbs_wq);
+static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
+
+/* Track exiting tasks in order to allow them to be waited for. */
+DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
+
+/* Control stall timeouts.  Disable with <= 0, otherwise jiffies till stall. */
+#define RCU_TASK_STALL_TIMEOUT (HZ * 60 * 10)
+static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
+module_param(rcu_task_stall_timeout, int, 0644);
+
+static struct task_struct *rcu_tasks_kthread_ptr;
+
+/**
+ * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
+ * @rhp: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. call_rcu_tasks() assumes
+ * that the read-side critical sections end at a voluntary context
+ * switch (not a preemption!), cond_resched_rcu_qs(), entry into idle,
+ * or transition to usermode execution.  As such, there are no read-side
+ * primitives analogous to rcu_read_lock() and rcu_read_unlock() because
+ * this primitive is intended to determine that all tasks have passed
+ * through a safe state, not so much for data-strcuture synchronization.
+ *
+ * See the description of call_rcu() for more detailed information on
+ * memory ordering guarantees.
+ */
+void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
+{
+	unsigned long flags;
+	bool needwake;
+
+	rhp->next = NULL;
+	rhp->func = func;
+	raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
+	needwake = !rcu_tasks_cbs_head;
+	WRITE_ONCE(*rcu_tasks_cbs_tail, rhp);
+	rcu_tasks_cbs_tail = &rhp->next;
+	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+	/* We can't create the thread unless interrupts are enabled. */
+	if (needwake && READ_ONCE(rcu_tasks_kthread_ptr))
+		wake_up(&rcu_tasks_cbs_wq);
+}
+EXPORT_SYMBOL_GPL(call_rcu_tasks);
+
+/**
+ * synchronize_rcu_tasks - wait until an rcu-tasks grace period has elapsed.
+ *
+ * Control will return to the caller some time after a full rcu-tasks
+ * grace period has elapsed, in other words after all currently
+ * executing rcu-tasks read-side critical sections have elapsed.  These
+ * read-side critical sections are delimited by calls to schedule(),
+ * cond_resched_tasks_rcu_qs(), idle execution, userspace execution, calls
+ * to synchronize_rcu_tasks(), and (in theory, anyway) cond_resched().
+ *
+ * This is a very specialized primitive, intended only for a few uses in
+ * tracing and other situations requiring manipulation of function
+ * preambles and profiling hooks.  The synchronize_rcu_tasks() function
+ * is not (yet) intended for heavy use from multiple CPUs.
+ *
+ * Note that this guarantee implies further memory-ordering guarantees.
+ * On systems with more than one CPU, when synchronize_rcu_tasks() returns,
+ * each CPU is guaranteed to have executed a full memory barrier since the
+ * end of its last RCU-tasks read-side critical section whose beginning
+ * preceded the call to synchronize_rcu_tasks().  In addition, each CPU
+ * having an RCU-tasks read-side critical section that extends beyond
+ * the return from synchronize_rcu_tasks() is guaranteed to have executed
+ * a full memory barrier after the beginning of synchronize_rcu_tasks()
+ * and before the beginning of that RCU-tasks read-side critical section.
+ * Note that these guarantees include CPUs that are offline, idle, or
+ * executing in user mode, as well as CPUs that are executing in the kernel.
+ *
+ * Furthermore, if CPU A invoked synchronize_rcu_tasks(), which returned
+ * to its caller on CPU B, then both CPU A and CPU B are guaranteed
+ * to have executed a full memory barrier during the execution of
+ * synchronize_rcu_tasks() -- even if CPU A and CPU B are the same CPU
+ * (but again only if the system has more than one CPU).
+ */
+void synchronize_rcu_tasks(void)
+{
+	/* Complain if the scheduler has not started.  */
+	RCU_LOCKDEP_WARN(rcu_scheduler_active == RCU_SCHEDULER_INACTIVE,
+			 "synchronize_rcu_tasks called too soon");
+
+	/* Wait for the grace period. */
+	wait_rcu_gp(call_rcu_tasks);
+}
+EXPORT_SYMBOL_GPL(synchronize_rcu_tasks);
+
+/**
+ * rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
+ *
+ * Although the current implementation is guaranteed to wait, it is not
+ * obligated to, for example, if there are no pending callbacks.
+ */
+void rcu_barrier_tasks(void)
+{
+	/* There is only one callback queue, so this is easy.  ;-) */
+	synchronize_rcu_tasks();
+}
+EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
+
+/* See if tasks are still holding out, complain if so. */
+static void check_holdout_task(struct task_struct *t,
+			       bool needreport, bool *firstreport)
+{
+	int cpu;
+
+	if (!READ_ONCE(t->rcu_tasks_holdout) ||
+	    t->rcu_tasks_nvcsw != READ_ONCE(t->nvcsw) ||
+	    !READ_ONCE(t->on_rq) ||
+	    (IS_ENABLED(CONFIG_NO_HZ_FULL) &&
+	     !is_idle_task(t) && t->rcu_tasks_idle_cpu >= 0)) {
+		WRITE_ONCE(t->rcu_tasks_holdout, false);
+		list_del_init(&t->rcu_tasks_holdout_list);
+		put_task_struct(t);
+		return;
+	}
+	rcu_request_urgent_qs_task(t);
+	if (!needreport)
+		return;
+	if (*firstreport) {
+		pr_err("INFO: rcu_tasks detected stalls on tasks:\n");
+		*firstreport = false;
+	}
+	cpu = task_cpu(t);
+	pr_alert("%p: %c%c nvcsw: %lu/%lu holdout: %d idle_cpu: %d/%d\n",
+		 t, ".I"[is_idle_task(t)],
+		 "N."[cpu < 0 || !tick_nohz_full_cpu(cpu)],
+		 t->rcu_tasks_nvcsw, t->nvcsw, t->rcu_tasks_holdout,
+		 t->rcu_tasks_idle_cpu, cpu);
+	sched_show_task(t);
+}
+
+/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
+static int __noreturn rcu_tasks_kthread(void *arg)
+{
+	unsigned long flags;
+	struct task_struct *g, *t;
+	unsigned long lastreport;
+	struct rcu_head *list;
+	struct rcu_head *next;
+	LIST_HEAD(rcu_tasks_holdouts);
+	int fract;
+
+	/* Run on housekeeping CPUs by default.  Sysadm can move if desired. */
+	housekeeping_affine(current, HK_FLAG_RCU);
+
+	/*
+	 * Each pass through the following loop makes one check for
+	 * newly arrived callbacks, and, if there are some, waits for
+	 * one RCU-tasks grace period and then invokes the callbacks.
+	 * This loop is terminated by the system going down.  ;-)
+	 */
+	for (;;) {
+
+		/* Pick up any new callbacks. */
+		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
+		list = rcu_tasks_cbs_head;
+		rcu_tasks_cbs_head = NULL;
+		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
+		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+
+		/* If there were none, wait a bit and start over. */
+		if (!list) {
+			wait_event_interruptible(rcu_tasks_cbs_wq,
+						 READ_ONCE(rcu_tasks_cbs_head));
+			if (!rcu_tasks_cbs_head) {
+				WARN_ON(signal_pending(current));
+				schedule_timeout_interruptible(HZ/10);
+			}
+			continue;
+		}
+
+		/*
+		 * Wait for all pre-existing t->on_rq and t->nvcsw
+		 * transitions to complete.  Invoking synchronize_rcu()
+		 * suffices because all these transitions occur with
+		 * interrupts disabled.  Without this synchronize_rcu(),
+		 * a read-side critical section that started before the
+		 * grace period might be incorrectly seen as having started
+		 * after the grace period.
+		 *
+		 * This synchronize_rcu() also dispenses with the
+		 * need for a memory barrier on the first store to
+		 * ->rcu_tasks_holdout, as it forces the store to happen
+		 * after the beginning of the grace period.
+		 */
+		synchronize_rcu();
+
+		/*
+		 * There were callbacks, so we need to wait for an
+		 * RCU-tasks grace period.  Start off by scanning
+		 * the task list for tasks that are not already
+		 * voluntarily blocked.  Mark these tasks and make
+		 * a list of them in rcu_tasks_holdouts.
+		 */
+		rcu_read_lock();
+		for_each_process_thread(g, t) {
+			if (t != current && READ_ONCE(t->on_rq) &&
+			    !is_idle_task(t)) {
+				get_task_struct(t);
+				t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
+				WRITE_ONCE(t->rcu_tasks_holdout, true);
+				list_add(&t->rcu_tasks_holdout_list,
+					 &rcu_tasks_holdouts);
+			}
+		}
+		rcu_read_unlock();
+
+		/*
+		 * Wait for tasks that are in the process of exiting.
+		 * This does only part of the job, ensuring that all
+		 * tasks that were previously exiting reach the point
+		 * where they have disabled preemption, allowing the
+		 * later synchronize_rcu() to finish the job.
+		 */
+		synchronize_srcu(&tasks_rcu_exit_srcu);
+
+		/*
+		 * Each pass through the following loop scans the list
+		 * of holdout tasks, removing any that are no longer
+		 * holdouts.  When the list is empty, we are done.
+		 */
+		lastreport = jiffies;
+
+		/* Start off with HZ/10 wait and slowly back off to 1 HZ wait*/
+		fract = 10;
+
+		for (;;) {
+			bool firstreport;
+			bool needreport;
+			int rtst;
+			struct task_struct *t1;
+
+			if (list_empty(&rcu_tasks_holdouts))
+				break;
+
+			/* Slowly back off waiting for holdouts */
+			schedule_timeout_interruptible(HZ/fract);
+
+			if (fract > 1)
+				fract--;
+
+			rtst = READ_ONCE(rcu_task_stall_timeout);
+			needreport = rtst > 0 &&
+				     time_after(jiffies, lastreport + rtst);
+			if (needreport)
+				lastreport = jiffies;
+			firstreport = true;
+			WARN_ON(signal_pending(current));
+			list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
+						rcu_tasks_holdout_list) {
+				check_holdout_task(t, needreport, &firstreport);
+				cond_resched();
+			}
+		}
+
+		/*
+		 * Because ->on_rq and ->nvcsw are not guaranteed
+		 * to have a full memory barriers prior to them in the
+		 * schedule() path, memory reordering on other CPUs could
+		 * cause their RCU-tasks read-side critical sections to
+		 * extend past the end of the grace period.  However,
+		 * because these ->nvcsw updates are carried out with
+		 * interrupts disabled, we can use synchronize_rcu()
+		 * to force the needed ordering on all such CPUs.
+		 *
+		 * This synchronize_rcu() also confines all
+		 * ->rcu_tasks_holdout accesses to be within the grace
+		 * period, avoiding the need for memory barriers for
+		 * ->rcu_tasks_holdout accesses.
+		 *
+		 * In addition, this synchronize_rcu() waits for exiting
+		 * tasks to complete their final preempt_disable() region
+		 * of execution, cleaning up after the synchronize_srcu()
+		 * above.
+		 */
+		synchronize_rcu();
+
+		/* Invoke the callbacks. */
+		while (list) {
+			next = list->next;
+			local_bh_disable();
+			list->func(list);
+			local_bh_enable();
+			list = next;
+			cond_resched();
+		}
+		/* Paranoid sleep to keep this from entering a tight loop */
+		schedule_timeout_uninterruptible(HZ/10);
+	}
+}
+
+/* Spawn rcu_tasks_kthread() at core_initcall() time. */
+static int __init rcu_spawn_tasks_kthread(void)
+{
+	struct task_struct *t;
+
+	t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
+	if (WARN_ONCE(IS_ERR(t), "%s: Could not start Tasks-RCU grace-period kthread, OOM is now expected behavior\n", __func__))
+		return 0;
+	smp_mb(); /* Ensure others see full kthread. */
+	WRITE_ONCE(rcu_tasks_kthread_ptr, t);
+	return 0;
+}
+core_initcall(rcu_spawn_tasks_kthread);
+
+/* Do the srcu_read_lock() for the above synchronize_srcu().  */
+void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
+{
+	preempt_disable();
+	current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu);
+	preempt_enable();
+}
+
+/* Do the srcu_read_unlock() for the above synchronize_srcu().  */
+void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu)
+{
+	preempt_disable();
+	__srcu_read_unlock(&tasks_rcu_exit_srcu, current->rcu_tasks_idx);
+	preempt_enable();
+}
+
+#endif /* #ifdef CONFIG_TASKS_RCU */
+
+#ifndef CONFIG_TINY_RCU
+
+/*
+ * Print any non-default Tasks RCU settings.
+ */
+static void __init rcu_tasks_bootup_oddness(void)
+{
+#ifdef CONFIG_TASKS_RCU
+	if (rcu_task_stall_timeout != RCU_TASK_STALL_TIMEOUT)
+		pr_info("\tTasks-RCU CPU stall warnings timeout set to %d (rcu_task_stall_timeout).\n", rcu_task_stall_timeout);
+	else
+		pr_info("\tTasks RCU enabled.\n");
+#endif /* #ifdef CONFIG_TASKS_RCU */
+}
+
+#endif /* #ifndef CONFIG_TINY_RCU */
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index dd837da..0fb2a9e 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -489,370 +489,6 @@ int rcu_cpu_stall_suppress_at_boot __read_mostly; // !0 = suppress boot stalls.
 EXPORT_SYMBOL_GPL(rcu_cpu_stall_suppress_at_boot);
 module_param(rcu_cpu_stall_suppress_at_boot, int, 0444);
 
-#ifdef CONFIG_TASKS_RCU
-
-/*
- * Simple variant of RCU whose quiescent states are voluntary context
- * switch, cond_resched_rcu_qs(), user-space execution, and idle.
- * As such, grace periods can take one good long time.  There are no
- * read-side primitives similar to rcu_read_lock() and rcu_read_unlock()
- * because this implementation is intended to get the system into a safe
- * state for some of the manipulations involved in tracing and the like.
- * Finally, this implementation does not support high call_rcu_tasks()
- * rates from multiple CPUs.  If this is required, per-CPU callback lists
- * will be needed.
- */
-
-/* Global list of callbacks and associated lock. */
-static struct rcu_head *rcu_tasks_cbs_head;
-static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
-static DECLARE_WAIT_QUEUE_HEAD(rcu_tasks_cbs_wq);
-static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
-
-/* Track exiting tasks in order to allow them to be waited for. */
-DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
-
-/* Control stall timeouts.  Disable with <= 0, otherwise jiffies till stall. */
-#define RCU_TASK_STALL_TIMEOUT (HZ * 60 * 10)
-static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
-module_param(rcu_task_stall_timeout, int, 0644);
-
-static struct task_struct *rcu_tasks_kthread_ptr;
-
-/**
- * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
- * @rhp: structure to be used for queueing the RCU updates.
- * @func: actual callback function to be invoked after the grace period
- *
- * The callback function will be invoked some time after a full grace
- * period elapses, in other words after all currently executing RCU
- * read-side critical sections have completed. call_rcu_tasks() assumes
- * that the read-side critical sections end at a voluntary context
- * switch (not a preemption!), cond_resched_rcu_qs(), entry into idle,
- * or transition to usermode execution.  As such, there are no read-side
- * primitives analogous to rcu_read_lock() and rcu_read_unlock() because
- * this primitive is intended to determine that all tasks have passed
- * through a safe state, not so much for data-strcuture synchronization.
- *
- * See the description of call_rcu() for more detailed information on
- * memory ordering guarantees.
- */
-void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
-{
-	unsigned long flags;
-	bool needwake;
-
-	rhp->next = NULL;
-	rhp->func = func;
-	raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
-	needwake = !rcu_tasks_cbs_head;
-	WRITE_ONCE(*rcu_tasks_cbs_tail, rhp);
-	rcu_tasks_cbs_tail = &rhp->next;
-	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
-	/* We can't create the thread unless interrupts are enabled. */
-	if (needwake && READ_ONCE(rcu_tasks_kthread_ptr))
-		wake_up(&rcu_tasks_cbs_wq);
-}
-EXPORT_SYMBOL_GPL(call_rcu_tasks);
-
-/**
- * synchronize_rcu_tasks - wait until an rcu-tasks grace period has elapsed.
- *
- * Control will return to the caller some time after a full rcu-tasks
- * grace period has elapsed, in other words after all currently
- * executing rcu-tasks read-side critical sections have elapsed.  These
- * read-side critical sections are delimited by calls to schedule(),
- * cond_resched_tasks_rcu_qs(), idle execution, userspace execution, calls
- * to synchronize_rcu_tasks(), and (in theory, anyway) cond_resched().
- *
- * This is a very specialized primitive, intended only for a few uses in
- * tracing and other situations requiring manipulation of function
- * preambles and profiling hooks.  The synchronize_rcu_tasks() function
- * is not (yet) intended for heavy use from multiple CPUs.
- *
- * Note that this guarantee implies further memory-ordering guarantees.
- * On systems with more than one CPU, when synchronize_rcu_tasks() returns,
- * each CPU is guaranteed to have executed a full memory barrier since the
- * end of its last RCU-tasks read-side critical section whose beginning
- * preceded the call to synchronize_rcu_tasks().  In addition, each CPU
- * having an RCU-tasks read-side critical section that extends beyond
- * the return from synchronize_rcu_tasks() is guaranteed to have executed
- * a full memory barrier after the beginning of synchronize_rcu_tasks()
- * and before the beginning of that RCU-tasks read-side critical section.
- * Note that these guarantees include CPUs that are offline, idle, or
- * executing in user mode, as well as CPUs that are executing in the kernel.
- *
- * Furthermore, if CPU A invoked synchronize_rcu_tasks(), which returned
- * to its caller on CPU B, then both CPU A and CPU B are guaranteed
- * to have executed a full memory barrier during the execution of
- * synchronize_rcu_tasks() -- even if CPU A and CPU B are the same CPU
- * (but again only if the system has more than one CPU).
- */
-void synchronize_rcu_tasks(void)
-{
-	/* Complain if the scheduler has not started.  */
-	RCU_LOCKDEP_WARN(rcu_scheduler_active == RCU_SCHEDULER_INACTIVE,
-			 "synchronize_rcu_tasks called too soon");
-
-	/* Wait for the grace period. */
-	wait_rcu_gp(call_rcu_tasks);
-}
-EXPORT_SYMBOL_GPL(synchronize_rcu_tasks);
-
-/**
- * rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
- *
- * Although the current implementation is guaranteed to wait, it is not
- * obligated to, for example, if there are no pending callbacks.
- */
-void rcu_barrier_tasks(void)
-{
-	/* There is only one callback queue, so this is easy.  ;-) */
-	synchronize_rcu_tasks();
-}
-EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
-
-/* See if tasks are still holding out, complain if so. */
-static void check_holdout_task(struct task_struct *t,
-			       bool needreport, bool *firstreport)
-{
-	int cpu;
-
-	if (!READ_ONCE(t->rcu_tasks_holdout) ||
-	    t->rcu_tasks_nvcsw != READ_ONCE(t->nvcsw) ||
-	    !READ_ONCE(t->on_rq) ||
-	    (IS_ENABLED(CONFIG_NO_HZ_FULL) &&
-	     !is_idle_task(t) && t->rcu_tasks_idle_cpu >= 0)) {
-		WRITE_ONCE(t->rcu_tasks_holdout, false);
-		list_del_init(&t->rcu_tasks_holdout_list);
-		put_task_struct(t);
-		return;
-	}
-	rcu_request_urgent_qs_task(t);
-	if (!needreport)
-		return;
-	if (*firstreport) {
-		pr_err("INFO: rcu_tasks detected stalls on tasks:\n");
-		*firstreport = false;
-	}
-	cpu = task_cpu(t);
-	pr_alert("%p: %c%c nvcsw: %lu/%lu holdout: %d idle_cpu: %d/%d\n",
-		 t, ".I"[is_idle_task(t)],
-		 "N."[cpu < 0 || !tick_nohz_full_cpu(cpu)],
-		 t->rcu_tasks_nvcsw, t->nvcsw, t->rcu_tasks_holdout,
-		 t->rcu_tasks_idle_cpu, cpu);
-	sched_show_task(t);
-}
-
-/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
-static int __noreturn rcu_tasks_kthread(void *arg)
-{
-	unsigned long flags;
-	struct task_struct *g, *t;
-	unsigned long lastreport;
-	struct rcu_head *list;
-	struct rcu_head *next;
-	LIST_HEAD(rcu_tasks_holdouts);
-	int fract;
-
-	/* Run on housekeeping CPUs by default.  Sysadm can move if desired. */
-	housekeeping_affine(current, HK_FLAG_RCU);
-
-	/*
-	 * Each pass through the following loop makes one check for
-	 * newly arrived callbacks, and, if there are some, waits for
-	 * one RCU-tasks grace period and then invokes the callbacks.
-	 * This loop is terminated by the system going down.  ;-)
-	 */
-	for (;;) {
-
-		/* Pick up any new callbacks. */
-		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
-		list = rcu_tasks_cbs_head;
-		rcu_tasks_cbs_head = NULL;
-		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
-		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
-
-		/* If there were none, wait a bit and start over. */
-		if (!list) {
-			wait_event_interruptible(rcu_tasks_cbs_wq,
-						 READ_ONCE(rcu_tasks_cbs_head));
-			if (!rcu_tasks_cbs_head) {
-				WARN_ON(signal_pending(current));
-				schedule_timeout_interruptible(HZ/10);
-			}
-			continue;
-		}
-
-		/*
-		 * Wait for all pre-existing t->on_rq and t->nvcsw
-		 * transitions to complete.  Invoking synchronize_rcu()
-		 * suffices because all these transitions occur with
-		 * interrupts disabled.  Without this synchronize_rcu(),
-		 * a read-side critical section that started before the
-		 * grace period might be incorrectly seen as having started
-		 * after the grace period.
-		 *
-		 * This synchronize_rcu() also dispenses with the
-		 * need for a memory barrier on the first store to
-		 * ->rcu_tasks_holdout, as it forces the store to happen
-		 * after the beginning of the grace period.
-		 */
-		synchronize_rcu();
-
-		/*
-		 * There were callbacks, so we need to wait for an
-		 * RCU-tasks grace period.  Start off by scanning
-		 * the task list for tasks that are not already
-		 * voluntarily blocked.  Mark these tasks and make
-		 * a list of them in rcu_tasks_holdouts.
-		 */
-		rcu_read_lock();
-		for_each_process_thread(g, t) {
-			if (t != current && READ_ONCE(t->on_rq) &&
-			    !is_idle_task(t)) {
-				get_task_struct(t);
-				t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
-				WRITE_ONCE(t->rcu_tasks_holdout, true);
-				list_add(&t->rcu_tasks_holdout_list,
-					 &rcu_tasks_holdouts);
-			}
-		}
-		rcu_read_unlock();
-
-		/*
-		 * Wait for tasks that are in the process of exiting.
-		 * This does only part of the job, ensuring that all
-		 * tasks that were previously exiting reach the point
-		 * where they have disabled preemption, allowing the
-		 * later synchronize_rcu() to finish the job.
-		 */
-		synchronize_srcu(&tasks_rcu_exit_srcu);
-
-		/*
-		 * Each pass through the following loop scans the list
-		 * of holdout tasks, removing any that are no longer
-		 * holdouts.  When the list is empty, we are done.
-		 */
-		lastreport = jiffies;
-
-		/* Start off with HZ/10 wait and slowly back off to 1 HZ wait*/
-		fract = 10;
-
-		for (;;) {
-			bool firstreport;
-			bool needreport;
-			int rtst;
-			struct task_struct *t1;
-
-			if (list_empty(&rcu_tasks_holdouts))
-				break;
-
-			/* Slowly back off waiting for holdouts */
-			schedule_timeout_interruptible(HZ/fract);
-
-			if (fract > 1)
-				fract--;
-
-			rtst = READ_ONCE(rcu_task_stall_timeout);
-			needreport = rtst > 0 &&
-				     time_after(jiffies, lastreport + rtst);
-			if (needreport)
-				lastreport = jiffies;
-			firstreport = true;
-			WARN_ON(signal_pending(current));
-			list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
-						rcu_tasks_holdout_list) {
-				check_holdout_task(t, needreport, &firstreport);
-				cond_resched();
-			}
-		}
-
-		/*
-		 * Because ->on_rq and ->nvcsw are not guaranteed
-		 * to have a full memory barriers prior to them in the
-		 * schedule() path, memory reordering on other CPUs could
-		 * cause their RCU-tasks read-side critical sections to
-		 * extend past the end of the grace period.  However,
-		 * because these ->nvcsw updates are carried out with
-		 * interrupts disabled, we can use synchronize_rcu()
-		 * to force the needed ordering on all such CPUs.
-		 *
-		 * This synchronize_rcu() also confines all
-		 * ->rcu_tasks_holdout accesses to be within the grace
-		 * period, avoiding the need for memory barriers for
-		 * ->rcu_tasks_holdout accesses.
-		 *
-		 * In addition, this synchronize_rcu() waits for exiting
-		 * tasks to complete their final preempt_disable() region
-		 * of execution, cleaning up after the synchronize_srcu()
-		 * above.
-		 */
-		synchronize_rcu();
-
-		/* Invoke the callbacks. */
-		while (list) {
-			next = list->next;
-			local_bh_disable();
-			list->func(list);
-			local_bh_enable();
-			list = next;
-			cond_resched();
-		}
-		/* Paranoid sleep to keep this from entering a tight loop */
-		schedule_timeout_uninterruptible(HZ/10);
-	}
-}
-
-/* Spawn rcu_tasks_kthread() at core_initcall() time. */
-static int __init rcu_spawn_tasks_kthread(void)
-{
-	struct task_struct *t;
-
-	t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
-	if (WARN_ONCE(IS_ERR(t), "%s: Could not start Tasks-RCU grace-period kthread, OOM is now expected behavior\n", __func__))
-		return 0;
-	smp_mb(); /* Ensure others see full kthread. */
-	WRITE_ONCE(rcu_tasks_kthread_ptr, t);
-	return 0;
-}
-core_initcall(rcu_spawn_tasks_kthread);
-
-/* Do the srcu_read_lock() for the above synchronize_srcu().  */
-void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
-{
-	preempt_disable();
-	current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu);
-	preempt_enable();
-}
-
-/* Do the srcu_read_unlock() for the above synchronize_srcu().  */
-void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu)
-{
-	preempt_disable();
-	__srcu_read_unlock(&tasks_rcu_exit_srcu, current->rcu_tasks_idx);
-	preempt_enable();
-}
-
-#endif /* #ifdef CONFIG_TASKS_RCU */
-
-#ifndef CONFIG_TINY_RCU
-
-/*
- * Print any non-default Tasks RCU settings.
- */
-static void __init rcu_tasks_bootup_oddness(void)
-{
-#ifdef CONFIG_TASKS_RCU
-	if (rcu_task_stall_timeout != RCU_TASK_STALL_TIMEOUT)
-		pr_info("\tTasks-RCU CPU stall warnings timeout set to %d (rcu_task_stall_timeout).\n", rcu_task_stall_timeout);
-	else
-		pr_info("\tTasks RCU enabled.\n");
-#endif /* #ifdef CONFIG_TASKS_RCU */
-}
-
-#endif /* #ifndef CONFIG_TINY_RCU */
-
 #ifdef CONFIG_PROVE_RCU
 
 /*
@@ -923,6 +559,8 @@ late_initcall(rcu_verify_early_boot_tests);
 void rcu_early_boot_tests(void) {}
 #endif /* CONFIG_PROVE_RCU */
 
+#include "tasks.h"
+
 #ifndef CONFIG_TINY_RCU
 
 /*
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 05/22] rcu-tasks: Create struct to hold state information
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (3 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 04/22] rcu-tasks: Move Tasks RCU to its own file paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 06/22] rcu: Reinstate synchronize_rcu_mult() paulmck
                     ` (20 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit creates an rcu_tasks struct to hold state information for
RCU Tasks.  This is a preparation commit for adding additional flavors
of Tasks RCU, each of which would have its own rcu_tasks struct.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 73 ++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 46 insertions(+), 27 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index be8d179..5ccfe0d 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -7,6 +7,30 @@
 
 #ifdef CONFIG_TASKS_RCU
 
+/**
+ * Definition for a Tasks-RCU-like mechanism.
+ * @cbs_head: Head of callback list.
+ * @cbs_tail: Tail pointer for callback list.
+ * @cbs_wq: Wait queue allowning new callback to get kthread's attention.
+ * @cbs_lock: Lock protecting callback list.
+ * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
+ */
+struct rcu_tasks {
+	struct rcu_head *cbs_head;
+	struct rcu_head **cbs_tail;
+	struct wait_queue_head cbs_wq;
+	raw_spinlock_t cbs_lock;
+	struct task_struct *kthread_ptr;
+};
+
+#define DEFINE_RCU_TASKS(name)						\
+static struct rcu_tasks name =						\
+{									\
+	.cbs_tail = &name.cbs_head,					\
+	.cbs_wq = __WAIT_QUEUE_HEAD_INITIALIZER(name.cbs_wq),		\
+	.cbs_lock = __RAW_SPIN_LOCK_UNLOCKED(name.cbs_lock),		\
+}
+
 /*
  * Simple variant of RCU whose quiescent states are voluntary context
  * switch, cond_resched_rcu_qs(), user-space execution, and idle.
@@ -18,12 +42,7 @@
  * rates from multiple CPUs.  If this is required, per-CPU callback lists
  * will be needed.
  */
-
-/* Global list of callbacks and associated lock. */
-static struct rcu_head *rcu_tasks_cbs_head;
-static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
-static DECLARE_WAIT_QUEUE_HEAD(rcu_tasks_cbs_wq);
-static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
+DEFINE_RCU_TASKS(rcu_tasks);
 
 /* Track exiting tasks in order to allow them to be waited for. */
 DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
@@ -33,8 +52,6 @@ DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
 static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
 module_param(rcu_task_stall_timeout, int, 0644);
 
-static struct task_struct *rcu_tasks_kthread_ptr;
-
 /**
  * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
  * @rhp: structure to be used for queueing the RCU updates.
@@ -57,17 +74,18 @@ void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
 {
 	unsigned long flags;
 	bool needwake;
+	struct rcu_tasks *rtp = &rcu_tasks;
 
 	rhp->next = NULL;
 	rhp->func = func;
-	raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
-	needwake = !rcu_tasks_cbs_head;
-	WRITE_ONCE(*rcu_tasks_cbs_tail, rhp);
-	rcu_tasks_cbs_tail = &rhp->next;
-	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+	raw_spin_lock_irqsave(&rtp->cbs_lock, flags);
+	needwake = !rtp->cbs_head;
+	WRITE_ONCE(*rtp->cbs_tail, rhp);
+	rtp->cbs_tail = &rhp->next;
+	raw_spin_unlock_irqrestore(&rtp->cbs_lock, flags);
 	/* We can't create the thread unless interrupts are enabled. */
-	if (needwake && READ_ONCE(rcu_tasks_kthread_ptr))
-		wake_up(&rcu_tasks_cbs_wq);
+	if (needwake && READ_ONCE(rtp->kthread_ptr))
+		wake_up(&rtp->cbs_wq);
 }
 EXPORT_SYMBOL_GPL(call_rcu_tasks);
 
@@ -169,10 +187,12 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 	struct rcu_head *list;
 	struct rcu_head *next;
 	LIST_HEAD(rcu_tasks_holdouts);
+	struct rcu_tasks *rtp = arg;
 	int fract;
 
 	/* Run on housekeeping CPUs by default.  Sysadm can move if desired. */
 	housekeeping_affine(current, HK_FLAG_RCU);
+	WRITE_ONCE(rtp->kthread_ptr, current); // Let GPs start!
 
 	/*
 	 * Each pass through the following loop makes one check for
@@ -183,17 +203,17 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 	for (;;) {
 
 		/* Pick up any new callbacks. */
-		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
-		list = rcu_tasks_cbs_head;
-		rcu_tasks_cbs_head = NULL;
-		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
-		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+		raw_spin_lock_irqsave(&rtp->cbs_lock, flags);
+		list = rtp->cbs_head;
+		rtp->cbs_head = NULL;
+		rtp->cbs_tail = &rtp->cbs_head;
+		raw_spin_unlock_irqrestore(&rtp->cbs_lock, flags);
 
 		/* If there were none, wait a bit and start over. */
 		if (!list) {
-			wait_event_interruptible(rcu_tasks_cbs_wq,
-						 READ_ONCE(rcu_tasks_cbs_head));
-			if (!rcu_tasks_cbs_head) {
+			wait_event_interruptible(rtp->cbs_wq,
+						 READ_ONCE(rtp->cbs_head));
+			if (!rtp->cbs_head) {
 				WARN_ON(signal_pending(current));
 				schedule_timeout_interruptible(HZ/10);
 			}
@@ -211,7 +231,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 		 *
 		 * This synchronize_rcu() also dispenses with the
 		 * need for a memory barrier on the first store to
-		 * ->rcu_tasks_holdout, as it forces the store to happen
+		 * t->rcu_tasks_holdout, as it forces the store to happen
 		 * after the beginning of the grace period.
 		 */
 		synchronize_rcu();
@@ -278,7 +298,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 			firstreport = true;
 			WARN_ON(signal_pending(current));
 			list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
-						rcu_tasks_holdout_list) {
+						 rcu_tasks_holdout_list) {
 				check_holdout_task(t, needreport, &firstreport);
 				cond_resched();
 			}
@@ -325,11 +345,10 @@ static int __init rcu_spawn_tasks_kthread(void)
 {
 	struct task_struct *t;
 
-	t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
+	t = kthread_run(rcu_tasks_kthread, &rcu_tasks, "rcu_tasks_kthread");
 	if (WARN_ONCE(IS_ERR(t), "%s: Could not start Tasks-RCU grace-period kthread, OOM is now expected behavior\n", __func__))
 		return 0;
 	smp_mb(); /* Ensure others see full kthread. */
-	WRITE_ONCE(rcu_tasks_kthread_ptr, t);
 	return 0;
 }
 core_initcall(rcu_spawn_tasks_kthread);
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 06/22] rcu: Reinstate synchronize_rcu_mult()
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (4 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 05/22] rcu-tasks: Create struct to hold state information paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 07/22] rcutorture: Add a test for synchronize_rcu_mult() paulmck
                     ` (19 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

With the advent and likely usage of synchronize_rcu_rude(), there is
again a need to wait on multiple types of RCU grace periods, for
example, call_rcu_tasks() and call_rcu_tasks_rude().  This commit
therefore reinstates synchronize_rcu_mult() in order to allow these
grace periods to be straightforwardly waited on concurrently.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/rcupdate_wait.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h
index c0578ba..699b938 100644
--- a/include/linux/rcupdate_wait.h
+++ b/include/linux/rcupdate_wait.h
@@ -31,4 +31,23 @@ do {									\
 
 #define wait_rcu_gp(...) _wait_rcu_gp(false, __VA_ARGS__)
 
+/**
+ * synchronize_rcu_mult - Wait concurrently for multiple grace periods
+ * @...: List of call_rcu() functions for different grace periods to wait on
+ *
+ * This macro waits concurrently for multiple types of RCU grace periods.
+ * For example, synchronize_rcu_mult(call_rcu, call_rcu_tasks) would wait
+ * on concurrent RCU and RCU-tasks grace periods.  Waiting on a given SRCU
+ * domain requires you to write a wrapper function for that SRCU domain's
+ * call_srcu() function, with this wrapper supplying the pointer to the
+ * corresponding srcu_struct.
+ *
+ * The first argument tells Tiny RCU's _wait_rcu_gp() not to
+ * bother waiting for RCU.  The reason for this is because anywhere
+ * synchronize_rcu_mult() can be called is automatically already a full
+ * grace period.
+ */
+#define synchronize_rcu_mult(...) \
+	_wait_rcu_gp(IS_ENABLED(CONFIG_TINY_RCU), __VA_ARGS__)
+
 #endif /* _LINUX_SCHED_RCUPDATE_WAIT_H */
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 07/22] rcutorture: Add a test for synchronize_rcu_mult()
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (5 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 06/22] rcu: Reinstate synchronize_rcu_mult() paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 08/22] rcu-tasks: Refactor RCU-tasks to allow variants to be added paulmck
                     ` (18 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds a crude test for synchronize_rcu_mult().  This is
currently a smoke test rather than a high-quality stress test.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/rcutorture.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index ada5b91..88631f5 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -20,7 +20,7 @@
 #include <linux/err.h>
 #include <linux/spinlock.h>
 #include <linux/smp.h>
-#include <linux/rcupdate.h>
+#include <linux/rcupdate_wait.h>
 #include <linux/interrupt.h>
 #include <linux/sched/signal.h>
 #include <uapi/linux/sched/types.h>
@@ -666,6 +666,11 @@ static void rcu_tasks_torture_deferred_free(struct rcu_torture *p)
 	call_rcu_tasks(&p->rtort_rcu, rcu_torture_cb);
 }
 
+static void synchronize_rcu_mult_test(void)
+{
+	synchronize_rcu_mult(call_rcu_tasks, call_rcu);
+}
+
 static struct rcu_torture_ops tasks_ops = {
 	.ttype		= RCU_TASKS_FLAVOR,
 	.init		= rcu_sync_torture_init,
@@ -675,7 +680,7 @@ static struct rcu_torture_ops tasks_ops = {
 	.get_gp_seq	= rcu_no_completed,
 	.deferred_free	= rcu_tasks_torture_deferred_free,
 	.sync		= synchronize_rcu_tasks,
-	.exp_sync	= synchronize_rcu_tasks,
+	.exp_sync	= synchronize_rcu_mult_test,
 	.call		= call_rcu_tasks,
 	.cb_barrier	= rcu_barrier_tasks,
 	.fqs		= NULL,
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 08/22] rcu-tasks: Refactor RCU-tasks to allow variants to be added
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (6 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 07/22] rcutorture: Add a test for synchronize_rcu_mult() paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 09/22] rcu-tasks: Add an RCU-tasks rude variant paulmck
                     ` (17 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit splits out generic processing from RCU-tasks-specific
processing in order to allow additional flavors to be added.  It also
adds a def_bool TASKS_RCU_GENERIC to enable the common RCU-tasks
infrastructure code.

This is primarily, but not entirely, a code-movement commit.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/rcupdate.h |   6 +-
 kernel/rcu/Kconfig       |  10 +-
 kernel/rcu/tasks.h       | 491 +++++++++++++++++++++++++----------------------
 kernel/rcu/update.c      |   4 +
 4 files changed, 272 insertions(+), 239 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 2678a37..5523145 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -129,7 +129,7 @@ static inline void rcu_init_nohz(void) { }
  * Note a quasi-voluntary context switch for RCU-tasks's benefit.
  * This is a macro rather than an inline function to avoid #include hell.
  */
-#ifdef CONFIG_TASKS_RCU
+#ifdef CONFIG_TASKS_RCU_GENERIC
 #define rcu_tasks_qs(t) \
 	do { \
 		if (READ_ONCE((t)->rcu_tasks_holdout)) \
@@ -140,14 +140,14 @@ void call_rcu_tasks(struct rcu_head *head, rcu_callback_t func);
 void synchronize_rcu_tasks(void);
 void exit_tasks_rcu_start(void);
 void exit_tasks_rcu_finish(void);
-#else /* #ifdef CONFIG_TASKS_RCU */
+#else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
 #define rcu_tasks_qs(t)	do { } while (0)
 #define rcu_note_voluntary_context_switch(t) do { } while (0)
 #define call_rcu_tasks call_rcu
 #define synchronize_rcu_tasks synchronize_rcu
 static inline void exit_tasks_rcu_start(void) { }
 static inline void exit_tasks_rcu_finish(void) { }
-#endif /* #else #ifdef CONFIG_TASKS_RCU */
+#endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */
 
 /**
  * cond_resched_tasks_rcu_qs - Report potential quiescent states to RCU
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 1cc940f..38475d0 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -70,13 +70,19 @@ config TREE_SRCU
 	help
 	  This option selects the full-fledged version of SRCU.
 
+config TASKS_RCU_GENERIC
+	def_bool TASKS_RCU
+	select SRCU
+	help
+	  This option enables generic infrastructure code supporting
+	  task-based RCU implementations.  Not for manual selection.
+
 config TASKS_RCU
 	def_bool PREEMPTION
-	select SRCU
 	help
 	  This option enables a task-based RCU implementation that uses
 	  only voluntary context switch (not preemption!), idle, and
-	  user-mode execution as quiescent states.
+	  user-mode execution as quiescent states.  Not for manual selection.
 
 config RCU_STALL_COMMON
 	def_bool TREE_RCU
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 5ccfe0d..d77921e 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -5,7 +5,13 @@
  * Copyright (C) 2020 Paul E. McKenney
  */
 
-#ifdef CONFIG_TASKS_RCU
+
+////////////////////////////////////////////////////////////////////////
+//
+// Generic data structures.
+
+struct rcu_tasks;
+typedef void (*rcu_tasks_gp_func_t)(struct rcu_tasks *rtp);
 
 /**
  * Definition for a Tasks-RCU-like mechanism.
@@ -14,6 +20,8 @@
  * @cbs_wq: Wait queue allowning new callback to get kthread's attention.
  * @cbs_lock: Lock protecting callback list.
  * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
+ * @gp_func: This flavor's grace-period-wait function.
+ * @call_func: This flavor's call_rcu()-equivalent function.
  */
 struct rcu_tasks {
 	struct rcu_head *cbs_head;
@@ -21,29 +29,20 @@ struct rcu_tasks {
 	struct wait_queue_head cbs_wq;
 	raw_spinlock_t cbs_lock;
 	struct task_struct *kthread_ptr;
+	rcu_tasks_gp_func_t gp_func;
+	call_rcu_func_t call_func;
 };
 
-#define DEFINE_RCU_TASKS(name)						\
+#define DEFINE_RCU_TASKS(name, gp, call)				\
 static struct rcu_tasks name =						\
 {									\
 	.cbs_tail = &name.cbs_head,					\
 	.cbs_wq = __WAIT_QUEUE_HEAD_INITIALIZER(name.cbs_wq),		\
 	.cbs_lock = __RAW_SPIN_LOCK_UNLOCKED(name.cbs_lock),		\
+	.gp_func = gp,							\
+	.call_func = call,						\
 }
 
-/*
- * Simple variant of RCU whose quiescent states are voluntary context
- * switch, cond_resched_rcu_qs(), user-space execution, and idle.
- * As such, grace periods can take one good long time.  There are no
- * read-side primitives similar to rcu_read_lock() and rcu_read_unlock()
- * because this implementation is intended to get the system into a safe
- * state for some of the manipulations involved in tracing and the like.
- * Finally, this implementation does not support high call_rcu_tasks()
- * rates from multiple CPUs.  If this is required, per-CPU callback lists
- * will be needed.
- */
-DEFINE_RCU_TASKS(rcu_tasks);
-
 /* Track exiting tasks in order to allow them to be waited for. */
 DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
 
@@ -52,29 +51,16 @@ DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
 static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
 module_param(rcu_task_stall_timeout, int, 0644);
 
-/**
- * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
- * @rhp: structure to be used for queueing the RCU updates.
- * @func: actual callback function to be invoked after the grace period
- *
- * The callback function will be invoked some time after a full grace
- * period elapses, in other words after all currently executing RCU
- * read-side critical sections have completed. call_rcu_tasks() assumes
- * that the read-side critical sections end at a voluntary context
- * switch (not a preemption!), cond_resched_rcu_qs(), entry into idle,
- * or transition to usermode execution.  As such, there are no read-side
- * primitives analogous to rcu_read_lock() and rcu_read_unlock() because
- * this primitive is intended to determine that all tasks have passed
- * through a safe state, not so much for data-strcuture synchronization.
- *
- * See the description of call_rcu() for more detailed information on
- * memory ordering guarantees.
- */
-void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
+////////////////////////////////////////////////////////////////////////
+//
+// Generic code.
+
+// Enqueue a callback for the specified flavor of Tasks RCU.
+static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func,
+				   struct rcu_tasks *rtp)
 {
 	unsigned long flags;
 	bool needwake;
-	struct rcu_tasks *rtp = &rcu_tasks;
 
 	rhp->next = NULL;
 	rhp->func = func;
@@ -87,108 +73,25 @@ void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
 	if (needwake && READ_ONCE(rtp->kthread_ptr))
 		wake_up(&rtp->cbs_wq);
 }
-EXPORT_SYMBOL_GPL(call_rcu_tasks);
 
-/**
- * synchronize_rcu_tasks - wait until an rcu-tasks grace period has elapsed.
- *
- * Control will return to the caller some time after a full rcu-tasks
- * grace period has elapsed, in other words after all currently
- * executing rcu-tasks read-side critical sections have elapsed.  These
- * read-side critical sections are delimited by calls to schedule(),
- * cond_resched_tasks_rcu_qs(), idle execution, userspace execution, calls
- * to synchronize_rcu_tasks(), and (in theory, anyway) cond_resched().
- *
- * This is a very specialized primitive, intended only for a few uses in
- * tracing and other situations requiring manipulation of function
- * preambles and profiling hooks.  The synchronize_rcu_tasks() function
- * is not (yet) intended for heavy use from multiple CPUs.
- *
- * Note that this guarantee implies further memory-ordering guarantees.
- * On systems with more than one CPU, when synchronize_rcu_tasks() returns,
- * each CPU is guaranteed to have executed a full memory barrier since the
- * end of its last RCU-tasks read-side critical section whose beginning
- * preceded the call to synchronize_rcu_tasks().  In addition, each CPU
- * having an RCU-tasks read-side critical section that extends beyond
- * the return from synchronize_rcu_tasks() is guaranteed to have executed
- * a full memory barrier after the beginning of synchronize_rcu_tasks()
- * and before the beginning of that RCU-tasks read-side critical section.
- * Note that these guarantees include CPUs that are offline, idle, or
- * executing in user mode, as well as CPUs that are executing in the kernel.
- *
- * Furthermore, if CPU A invoked synchronize_rcu_tasks(), which returned
- * to its caller on CPU B, then both CPU A and CPU B are guaranteed
- * to have executed a full memory barrier during the execution of
- * synchronize_rcu_tasks() -- even if CPU A and CPU B are the same CPU
- * (but again only if the system has more than one CPU).
- */
-void synchronize_rcu_tasks(void)
+// Wait for a grace period for the specified flavor of Tasks RCU.
+static void synchronize_rcu_tasks_generic(struct rcu_tasks *rtp)
 {
 	/* Complain if the scheduler has not started.  */
 	RCU_LOCKDEP_WARN(rcu_scheduler_active == RCU_SCHEDULER_INACTIVE,
 			 "synchronize_rcu_tasks called too soon");
 
 	/* Wait for the grace period. */
-	wait_rcu_gp(call_rcu_tasks);
-}
-EXPORT_SYMBOL_GPL(synchronize_rcu_tasks);
-
-/**
- * rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
- *
- * Although the current implementation is guaranteed to wait, it is not
- * obligated to, for example, if there are no pending callbacks.
- */
-void rcu_barrier_tasks(void)
-{
-	/* There is only one callback queue, so this is easy.  ;-) */
-	synchronize_rcu_tasks();
-}
-EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
-
-/* See if tasks are still holding out, complain if so. */
-static void check_holdout_task(struct task_struct *t,
-			       bool needreport, bool *firstreport)
-{
-	int cpu;
-
-	if (!READ_ONCE(t->rcu_tasks_holdout) ||
-	    t->rcu_tasks_nvcsw != READ_ONCE(t->nvcsw) ||
-	    !READ_ONCE(t->on_rq) ||
-	    (IS_ENABLED(CONFIG_NO_HZ_FULL) &&
-	     !is_idle_task(t) && t->rcu_tasks_idle_cpu >= 0)) {
-		WRITE_ONCE(t->rcu_tasks_holdout, false);
-		list_del_init(&t->rcu_tasks_holdout_list);
-		put_task_struct(t);
-		return;
-	}
-	rcu_request_urgent_qs_task(t);
-	if (!needreport)
-		return;
-	if (*firstreport) {
-		pr_err("INFO: rcu_tasks detected stalls on tasks:\n");
-		*firstreport = false;
-	}
-	cpu = task_cpu(t);
-	pr_alert("%p: %c%c nvcsw: %lu/%lu holdout: %d idle_cpu: %d/%d\n",
-		 t, ".I"[is_idle_task(t)],
-		 "N."[cpu < 0 || !tick_nohz_full_cpu(cpu)],
-		 t->rcu_tasks_nvcsw, t->nvcsw, t->rcu_tasks_holdout,
-		 t->rcu_tasks_idle_cpu, cpu);
-	sched_show_task(t);
+	wait_rcu_gp(rtp->call_func);
 }
 
 /* RCU-tasks kthread that detects grace periods and invokes callbacks. */
 static int __noreturn rcu_tasks_kthread(void *arg)
 {
 	unsigned long flags;
-	struct task_struct *g, *t;
-	unsigned long lastreport;
 	struct rcu_head *list;
 	struct rcu_head *next;
-	LIST_HEAD(rcu_tasks_holdouts);
 	struct rcu_tasks *rtp = arg;
-	int fract;
 
 	/* Run on housekeeping CPUs by default.  Sysadm can move if desired. */
 	housekeeping_affine(current, HK_FLAG_RCU);
@@ -220,111 +123,8 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 			continue;
 		}
 
-		/*
-		 * Wait for all pre-existing t->on_rq and t->nvcsw
-		 * transitions to complete.  Invoking synchronize_rcu()
-		 * suffices because all these transitions occur with
-		 * interrupts disabled.  Without this synchronize_rcu(),
-		 * a read-side critical section that started before the
-		 * grace period might be incorrectly seen as having started
-		 * after the grace period.
-		 *
-		 * This synchronize_rcu() also dispenses with the
-		 * need for a memory barrier on the first store to
-		 * t->rcu_tasks_holdout, as it forces the store to happen
-		 * after the beginning of the grace period.
-		 */
-		synchronize_rcu();
-
-		/*
-		 * There were callbacks, so we need to wait for an
-		 * RCU-tasks grace period.  Start off by scanning
-		 * the task list for tasks that are not already
-		 * voluntarily blocked.  Mark these tasks and make
-		 * a list of them in rcu_tasks_holdouts.
-		 */
-		rcu_read_lock();
-		for_each_process_thread(g, t) {
-			if (t != current && READ_ONCE(t->on_rq) &&
-			    !is_idle_task(t)) {
-				get_task_struct(t);
-				t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
-				WRITE_ONCE(t->rcu_tasks_holdout, true);
-				list_add(&t->rcu_tasks_holdout_list,
-					 &rcu_tasks_holdouts);
-			}
-		}
-		rcu_read_unlock();
-
-		/*
-		 * Wait for tasks that are in the process of exiting.
-		 * This does only part of the job, ensuring that all
-		 * tasks that were previously exiting reach the point
-		 * where they have disabled preemption, allowing the
-		 * later synchronize_rcu() to finish the job.
-		 */
-		synchronize_srcu(&tasks_rcu_exit_srcu);
-
-		/*
-		 * Each pass through the following loop scans the list
-		 * of holdout tasks, removing any that are no longer
-		 * holdouts.  When the list is empty, we are done.
-		 */
-		lastreport = jiffies;
-
-		/* Start off with HZ/10 wait and slowly back off to 1 HZ wait*/
-		fract = 10;
-
-		for (;;) {
-			bool firstreport;
-			bool needreport;
-			int rtst;
-			struct task_struct *t1;
-
-			if (list_empty(&rcu_tasks_holdouts))
-				break;
-
-			/* Slowly back off waiting for holdouts */
-			schedule_timeout_interruptible(HZ/fract);
-
-			if (fract > 1)
-				fract--;
-
-			rtst = READ_ONCE(rcu_task_stall_timeout);
-			needreport = rtst > 0 &&
-				     time_after(jiffies, lastreport + rtst);
-			if (needreport)
-				lastreport = jiffies;
-			firstreport = true;
-			WARN_ON(signal_pending(current));
-			list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
-						 rcu_tasks_holdout_list) {
-				check_holdout_task(t, needreport, &firstreport);
-				cond_resched();
-			}
-		}
-
-		/*
-		 * Because ->on_rq and ->nvcsw are not guaranteed
-		 * to have a full memory barriers prior to them in the
-		 * schedule() path, memory reordering on other CPUs could
-		 * cause their RCU-tasks read-side critical sections to
-		 * extend past the end of the grace period.  However,
-		 * because these ->nvcsw updates are carried out with
-		 * interrupts disabled, we can use synchronize_rcu()
-		 * to force the needed ordering on all such CPUs.
-		 *
-		 * This synchronize_rcu() also confines all
-		 * ->rcu_tasks_holdout accesses to be within the grace
-		 * period, avoiding the need for memory barriers for
-		 * ->rcu_tasks_holdout accesses.
-		 *
-		 * In addition, this synchronize_rcu() waits for exiting
-		 * tasks to complete their final preempt_disable() region
-		 * of execution, cleaning up after the synchronize_srcu()
-		 * above.
-		 */
-		synchronize_rcu();
+		// Wait for one grace period.
+		rtp->gp_func(rtp);
 
 		/* Invoke the callbacks. */
 		while (list) {
@@ -340,18 +140,16 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 	}
 }
 
-/* Spawn rcu_tasks_kthread() at core_initcall() time. */
-static int __init rcu_spawn_tasks_kthread(void)
+/* Spawn RCU-tasks grace-period kthread, e.g., at core_initcall() time. */
+static void __init rcu_spawn_tasks_kthread_generic(struct rcu_tasks *rtp)
 {
 	struct task_struct *t;
 
-	t = kthread_run(rcu_tasks_kthread, &rcu_tasks, "rcu_tasks_kthread");
+	t = kthread_run(rcu_tasks_kthread, rtp, "rcu_tasks_kthread");
 	if (WARN_ONCE(IS_ERR(t), "%s: Could not start Tasks-RCU grace-period kthread, OOM is now expected behavior\n", __func__))
-		return 0;
+		return;
 	smp_mb(); /* Ensure others see full kthread. */
-	return 0;
 }
-core_initcall(rcu_spawn_tasks_kthread);
 
 /* Do the srcu_read_lock() for the above synchronize_srcu().  */
 void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
@@ -369,8 +167,6 @@ void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu)
 	preempt_enable();
 }
 
-#endif /* #ifdef CONFIG_TASKS_RCU */
-
 #ifndef CONFIG_TINY_RCU
 
 /*
@@ -387,3 +183,230 @@ static void __init rcu_tasks_bootup_oddness(void)
 }
 
 #endif /* #ifndef CONFIG_TINY_RCU */
+
+#ifdef CONFIG_TASKS_RCU
+
+////////////////////////////////////////////////////////////////////////
+//
+// Simple variant of RCU whose quiescent states are voluntary context
+// switch, cond_resched_rcu_qs(), user-space execution, and idle.
+// As such, grace periods can take one good long time.  There are no
+// read-side primitives similar to rcu_read_lock() and rcu_read_unlock()
+// because this implementation is intended to get the system into a safe
+// state for some of the manipulations involved in tracing and the like.
+// Finally, this implementation does not support high call_rcu_tasks()
+// rates from multiple CPUs.  If this is required, per-CPU callback lists
+// will be needed.
+
+/* See if tasks are still holding out, complain if so. */
+static void check_holdout_task(struct task_struct *t,
+			       bool needreport, bool *firstreport)
+{
+	int cpu;
+
+	if (!READ_ONCE(t->rcu_tasks_holdout) ||
+	    t->rcu_tasks_nvcsw != READ_ONCE(t->nvcsw) ||
+	    !READ_ONCE(t->on_rq) ||
+	    (IS_ENABLED(CONFIG_NO_HZ_FULL) &&
+	     !is_idle_task(t) && t->rcu_tasks_idle_cpu >= 0)) {
+		WRITE_ONCE(t->rcu_tasks_holdout, false);
+		list_del_init(&t->rcu_tasks_holdout_list);
+		put_task_struct(t);
+		return;
+	}
+	rcu_request_urgent_qs_task(t);
+	if (!needreport)
+		return;
+	if (*firstreport) {
+		pr_err("INFO: rcu_tasks detected stalls on tasks:\n");
+		*firstreport = false;
+	}
+	cpu = task_cpu(t);
+	pr_alert("%p: %c%c nvcsw: %lu/%lu holdout: %d idle_cpu: %d/%d\n",
+		 t, ".I"[is_idle_task(t)],
+		 "N."[cpu < 0 || !tick_nohz_full_cpu(cpu)],
+		 t->rcu_tasks_nvcsw, t->nvcsw, t->rcu_tasks_holdout,
+		 t->rcu_tasks_idle_cpu, cpu);
+	sched_show_task(t);
+}
+
+/* Wait for one RCU-tasks grace period. */
+static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
+{
+	struct task_struct *g, *t;
+	unsigned long lastreport;
+	LIST_HEAD(rcu_tasks_holdouts);
+	int fract;
+
+	/*
+	 * Wait for all pre-existing t->on_rq and t->nvcsw transitions
+	 * to complete.  Invoking synchronize_rcu() suffices because all
+	 * these transitions occur with interrupts disabled.  Without this
+	 * synchronize_rcu(), a read-side critical section that started
+	 * before the grace period might be incorrectly seen as having
+	 * started after the grace period.
+	 *
+	 * This synchronize_rcu() also dispenses with the need for a
+	 * memory barrier on the first store to t->rcu_tasks_holdout,
+	 * as it forces the store to happen after the beginning of the
+	 * grace period.
+	 */
+	synchronize_rcu();
+
+	/*
+	 * There were callbacks, so we need to wait for an RCU-tasks
+	 * grace period.  Start off by scanning the task list for tasks
+	 * that are not already voluntarily blocked.  Mark these tasks
+	 * and make a list of them in rcu_tasks_holdouts.
+	 */
+	rcu_read_lock();
+	for_each_process_thread(g, t) {
+		if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
+			get_task_struct(t);
+			t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
+			WRITE_ONCE(t->rcu_tasks_holdout, true);
+			list_add(&t->rcu_tasks_holdout_list,
+				 &rcu_tasks_holdouts);
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * Wait for tasks that are in the process of exiting.  This
+	 * does only part of the job, ensuring that all tasks that were
+	 * previously exiting reach the point where they have disabled
+	 * preemption, allowing the later synchronize_rcu() to finish
+	 * the job.
+	 */
+	synchronize_srcu(&tasks_rcu_exit_srcu);
+
+	/*
+	 * Each pass through the following loop scans the list of holdout
+	 * tasks, removing any that are no longer holdouts.  When the list
+	 * is empty, we are done.
+	 */
+	lastreport = jiffies;
+
+	/* Start off with HZ/10 wait and slowly back off to 1 HZ wait. */
+	fract = 10;
+
+	for (;;) {
+		bool firstreport;
+		bool needreport;
+		int rtst;
+		struct task_struct *t1;
+
+		if (list_empty(&rcu_tasks_holdouts))
+			break;
+
+		/* Slowly back off waiting for holdouts */
+		schedule_timeout_interruptible(HZ/fract);
+
+		if (fract > 1)
+			fract--;
+
+		rtst = READ_ONCE(rcu_task_stall_timeout);
+		needreport = rtst > 0 && time_after(jiffies, lastreport + rtst);
+		if (needreport)
+			lastreport = jiffies;
+		firstreport = true;
+		WARN_ON(signal_pending(current));
+		list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
+					 rcu_tasks_holdout_list) {
+			check_holdout_task(t, needreport, &firstreport);
+			cond_resched();
+		}
+	}
+
+	/*
+	 * Because ->on_rq and ->nvcsw are not guaranteed to have a full
+	 * memory barriers prior to them in the schedule() path, memory
+	 * reordering on other CPUs could cause their RCU-tasks read-side
+	 * critical sections to extend past the end of the grace period.
+	 * However, because these ->nvcsw updates are carried out with
+	 * interrupts disabled, we can use synchronize_rcu() to force the
+	 * needed ordering on all such CPUs.
+	 *
+	 * This synchronize_rcu() also confines all ->rcu_tasks_holdout
+	 * accesses to be within the grace period, avoiding the need for
+	 * memory barriers for ->rcu_tasks_holdout accesses.
+	 *
+	 * In addition, this synchronize_rcu() waits for exiting tasks
+	 * to complete their final preempt_disable() region of execution,
+	 * cleaning up after the synchronize_srcu() above.
+	 */
+	synchronize_rcu();
+}
+
+void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
+DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks);
+
+/**
+ * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
+ * @rhp: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. call_rcu_tasks() assumes
+ * that the read-side critical sections end at a voluntary context
+ * switch (not a preemption!), cond_resched_rcu_qs(), entry into idle,
+ * or transition to usermode execution.  As such, there are no read-side
+ * primitives analogous to rcu_read_lock() and rcu_read_unlock() because
+ * this primitive is intended to determine that all tasks have passed
+ * through a safe state, not so much for data-strcuture synchronization.
+ *
+ * See the description of call_rcu() for more detailed information on
+ * memory ordering guarantees.
+ */
+void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
+{
+	call_rcu_tasks_generic(rhp, func, &rcu_tasks);
+}
+EXPORT_SYMBOL_GPL(call_rcu_tasks);
+
+/**
+ * synchronize_rcu_tasks - wait until an rcu-tasks grace period has elapsed.
+ *
+ * Control will return to the caller some time after a full rcu-tasks
+ * grace period has elapsed, in other words after all currently
+ * executing rcu-tasks read-side critical sections have elapsed.  These
+ * read-side critical sections are delimited by calls to schedule(),
+ * cond_resched_tasks_rcu_qs(), idle execution, userspace execution, calls
+ * to synchronize_rcu_tasks(), and (in theory, anyway) cond_resched().
+ *
+ * This is a very specialized primitive, intended only for a few uses in
+ * tracing and other situations requiring manipulation of function
+ * preambles and profiling hooks.  The synchronize_rcu_tasks() function
+ * is not (yet) intended for heavy use from multiple CPUs.
+ *
+ * See the description of synchronize_rcu() for more detailed information
+ * on memory ordering guarantees.
+ */
+void synchronize_rcu_tasks(void)
+{
+	synchronize_rcu_tasks_generic(&rcu_tasks);
+}
+EXPORT_SYMBOL_GPL(synchronize_rcu_tasks);
+
+/**
+ * rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
+ *
+ * Although the current implementation is guaranteed to wait, it is not
+ * obligated to, for example, if there are no pending callbacks.
+ */
+void rcu_barrier_tasks(void)
+{
+	/* There is only one callback queue, so this is easy.  ;-) */
+	synchronize_rcu_tasks();
+}
+EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
+
+static int __init rcu_spawn_tasks_kthread(void)
+{
+	rcu_spawn_tasks_kthread_generic(&rcu_tasks);
+	return 0;
+}
+core_initcall(rcu_spawn_tasks_kthread);
+
+#endif /* #ifdef CONFIG_TASKS_RCU */
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 0fb2a9e..16058a5 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -559,7 +559,11 @@ late_initcall(rcu_verify_early_boot_tests);
 void rcu_early_boot_tests(void) {}
 #endif /* CONFIG_PROVE_RCU */
 
+#ifdef CONFIG_TASKS_RCU_GENERIC
 #include "tasks.h"
+#else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
+static inline void rcu_tasks_bootup_oddness(void) {}
+#endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */
 
 #ifndef CONFIG_TINY_RCU
 
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 09/22] rcu-tasks: Add an RCU-tasks rude variant
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (7 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 08/22] rcu-tasks: Refactor RCU-tasks to allow variants to be added paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19 19:04     ` Steven Rostedt
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 10/22] rcutorture: Add torture tests for RCU Tasks Rude paulmck
                     ` (16 subsequent siblings)
  25 siblings, 1 reply; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds a "rude" variant of RCU-tasks that has as quiescent
states schedule(), cond_resched_tasks_rcu_qs(), userspace execution,
and (in theory, anyway) cond_resched().  In other words, RCU-tasks rude
readers are regions of code with preemption disabled, but excluding code
early in the CPU-online sequence and late in the CPU-offline sequence.
Updates make use of IPIs and force an IPI and a context switch on each
online CPU.  This variant is useful in some situations in tracing.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
[ paulmck: Apply EXPORT_SYMBOL_GPL() feedback from Qiujun Huang. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/rcupdate.h |  3 ++
 kernel/rcu/Kconfig       | 12 +++++-
 kernel/rcu/tasks.h       | 98 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 112 insertions(+), 1 deletion(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 5523145..2be97a8 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -37,6 +37,7 @@
 /* Exported common interfaces */
 void call_rcu(struct rcu_head *head, rcu_callback_t func);
 void rcu_barrier_tasks(void);
+void rcu_barrier_tasks_rude(void);
 void synchronize_rcu(void);
 
 #ifdef CONFIG_PREEMPT_RCU
@@ -138,6 +139,8 @@ static inline void rcu_init_nohz(void) { }
 #define rcu_note_voluntary_context_switch(t) rcu_tasks_qs(t)
 void call_rcu_tasks(struct rcu_head *head, rcu_callback_t func);
 void synchronize_rcu_tasks(void);
+void call_rcu_tasks_rude(struct rcu_head *head, rcu_callback_t func);
+void synchronize_rcu_tasks_rude(void);
 void exit_tasks_rcu_start(void);
 void exit_tasks_rcu_finish(void);
 #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 38475d0..0d43ec1 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -71,7 +71,7 @@ config TREE_SRCU
 	  This option selects the full-fledged version of SRCU.
 
 config TASKS_RCU_GENERIC
-	def_bool TASKS_RCU
+	def_bool TASKS_RCU || TASKS_RUDE_RCU
 	select SRCU
 	help
 	  This option enables generic infrastructure code supporting
@@ -84,6 +84,16 @@ config TASKS_RCU
 	  only voluntary context switch (not preemption!), idle, and
 	  user-mode execution as quiescent states.  Not for manual selection.
 
+config TASKS_RUDE_RCU
+	def_bool 0
+	default n
+	help
+	  This option enables a task-based RCU implementation that uses
+	  only context switch (including preemption) and user-mode
+	  execution as quiescent states.  It forces IPIs and context
+	  switches on all online CPUs, including idle ones, so use
+	  with caution.  Not for manual selection.
+
 config RCU_STALL_COMMON
 	def_bool TREE_RCU
 	help
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index d77921e..7ba1730 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -180,6 +180,9 @@ static void __init rcu_tasks_bootup_oddness(void)
 	else
 		pr_info("\tTasks RCU enabled.\n");
 #endif /* #ifdef CONFIG_TASKS_RCU */
+#ifdef CONFIG_TASKS_RUDE_RCU
+	pr_info("\tRude variant of Tasks RCU enabled.\n");
+#endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
 }
 
 #endif /* #ifndef CONFIG_TINY_RCU */
@@ -410,3 +413,98 @@ static int __init rcu_spawn_tasks_kthread(void)
 core_initcall(rcu_spawn_tasks_kthread);
 
 #endif /* #ifdef CONFIG_TASKS_RCU */
+
+#ifdef CONFIG_TASKS_RUDE_RCU
+
+////////////////////////////////////////////////////////////////////////
+//
+// "Rude" variant of Tasks RCU, inspired by Steve Rostedt's trick of
+// passing an empty function to schedule_on_each_cpu().  This approach
+// provides an asynchronous call_rcu_rude() API and batching of concurrent
+// calls to the synchronous synchronize_rcu_rude() API.  This sends IPIs
+// far and wide and induces otherwise unnecessary context switches on all
+// online CPUs, whether online or not.
+
+// Empty function to allow workqueues to force a context switch.
+static void rcu_tasks_be_rude(struct work_struct *work)
+{
+}
+
+// Wait for one rude RCU-tasks grace period.
+static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp)
+{
+	schedule_on_each_cpu(rcu_tasks_be_rude);
+}
+
+void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func);
+DEFINE_RCU_TASKS(rcu_tasks_rude, rcu_tasks_rude_wait_gp, call_rcu_tasks_rude);
+
+/**
+ * call_rcu_tasks_rude() - Queue a callback rude task-based grace period
+ * @rhp: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. call_rcu_tasks_rude()
+ * assumes that the read-side critical sections end at context switch,
+ * cond_resched_rcu_qs(), or transition to usermode execution.  As such,
+ * there are no read-side primitives analogous to rcu_read_lock() and
+ * rcu_read_unlock() because this primitive is intended to determine
+ * that all tasks have passed through a safe state, not so much for
+ * data-strcuture synchronization.
+ *
+ * See the description of call_rcu() for more detailed information on
+ * memory ordering guarantees.
+ */
+void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func)
+{
+	call_rcu_tasks_generic(rhp, func, &rcu_tasks_rude);
+}
+EXPORT_SYMBOL_GPL(call_rcu_tasks_rude);
+
+/**
+ * synchronize_rcu_tasks_rude - wait for a rude rcu-tasks grace period
+ *
+ * Control will return to the caller some time after a rude rcu-tasks
+ * grace period has elapsed, in other words after all currently
+ * executing rcu-tasks read-side critical sections have elapsed.  These
+ * read-side critical sections are delimited by calls to schedule(),
+ * cond_resched_tasks_rcu_qs(), userspace execution, and (in theory,
+ * anyway) cond_resched().
+ *
+ * This is a very specialized primitive, intended only for a few uses in
+ * tracing and other situations requiring manipulation of function preambles
+ * and profiling hooks.  The synchronize_rcu_tasks_rude() function is not
+ * (yet) intended for heavy use from multiple CPUs.
+ *
+ * See the description of synchronize_rcu() for more detailed information
+ * on memory ordering guarantees.
+ */
+void synchronize_rcu_tasks_rude(void)
+{
+	synchronize_rcu_tasks_generic(&rcu_tasks_rude);
+}
+EXPORT_SYMBOL_GPL(synchronize_rcu_tasks_rude);
+
+/**
+ * rcu_barrier_tasks_rude - Wait for in-flight call_rcu_tasks_rude() callbacks.
+ *
+ * Although the current implementation is guaranteed to wait, it is not
+ * obligated to, for example, if there are no pending callbacks.
+ */
+void rcu_barrier_tasks_rude(void)
+{
+	/* There is only one callback queue, so this is easy.  ;-) */
+	synchronize_rcu_tasks_rude();
+}
+EXPORT_SYMBOL_GPL(rcu_barrier_tasks_rude);
+
+static int __init rcu_spawn_tasks_rude_kthread(void)
+{
+	rcu_spawn_tasks_kthread_generic(&rcu_tasks_rude);
+	return 0;
+}
+core_initcall(rcu_spawn_tasks_rude_kthread);
+
+#endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 10/22] rcutorture: Add torture tests for RCU Tasks Rude
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (8 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 09/22] rcu-tasks: Add an RCU-tasks rude variant paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 11/22] rcu-tasks: Use unique names for RCU-Tasks kthreads and messages paulmck
                     ` (15 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds the definitions required to torture the rude flavor of
RCU tasks.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/Kconfig.debug                           |  2 ++
 kernel/rcu/rcu.h                                   |  1 +
 kernel/rcu/rcutorture.c                            | 31 ++++++++++++++++++++--
 .../selftests/rcutorture/configs/rcu/CFLIST        |  1 +
 .../selftests/rcutorture/configs/rcu/RUDE01        | 10 +++++++
 .../selftests/rcutorture/configs/rcu/RUDE01.boot   |  1 +
 6 files changed, 44 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/RUDE01
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/RUDE01.boot

diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index ec4bb6c..b15a3bd 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -24,6 +24,7 @@ config RCU_PERF_TEST
 	select TORTURE_TEST
 	select SRCU
 	select TASKS_RCU
+	select TASKS_RUDE_RCU
 	default n
 	help
 	  This option provides a kernel module that runs performance
@@ -41,6 +42,7 @@ config RCU_TORTURE_TEST
 	select TORTURE_TEST
 	select SRCU
 	select TASKS_RCU
+	select TASKS_RUDE_RCU
 	default n
 	help
 	  This option provides a kernel module that runs torture tests
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 00ddc92..c574620 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -441,6 +441,7 @@ void rcu_request_urgent_qs_task(struct task_struct *t);
 enum rcutorture_type {
 	RCU_FLAVOR,
 	RCU_TASKS_FLAVOR,
+	RCU_TASKS_RUDE_FLAVOR,
 	RCU_TRIVIAL_FLAVOR,
 	SRCU_FLAVOR,
 	INVALID_RCU_FLAVOR
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 88631f5..386cd11 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -731,6 +731,33 @@ static struct rcu_torture_ops trivial_ops = {
 	.name		= "trivial"
 };
 
+/*
+ * Definitions for rude RCU-tasks torture testing.
+ */
+
+static void rcu_tasks_rude_torture_deferred_free(struct rcu_torture *p)
+{
+	call_rcu_tasks_rude(&p->rtort_rcu, rcu_torture_cb);
+}
+
+static struct rcu_torture_ops tasks_rude_ops = {
+	.ttype		= RCU_TASKS_RUDE_FLAVOR,
+	.init		= rcu_sync_torture_init,
+	.readlock	= rcu_torture_read_lock_trivial,
+	.read_delay	= rcu_read_delay,  /* just reuse rcu's version. */
+	.readunlock	= rcu_torture_read_unlock_trivial,
+	.get_gp_seq	= rcu_no_completed,
+	.deferred_free	= rcu_tasks_rude_torture_deferred_free,
+	.sync		= synchronize_rcu_tasks_rude,
+	.exp_sync	= synchronize_rcu_tasks_rude,
+	.call		= call_rcu_tasks_rude,
+	.cb_barrier	= rcu_barrier_tasks_rude,
+	.fqs		= NULL,
+	.stats		= NULL,
+	.irq_capable	= 1,
+	.name		= "tasks-rude"
+};
+
 static unsigned long rcutorture_seq_diff(unsigned long new, unsigned long old)
 {
 	if (!cur_ops->gp_diff)
@@ -740,7 +767,7 @@ static unsigned long rcutorture_seq_diff(unsigned long new, unsigned long old)
 
 static bool __maybe_unused torturing_tasks(void)
 {
-	return cur_ops == &tasks_ops;
+	return cur_ops == &tasks_ops || cur_ops == &tasks_rude_ops;
 }
 
 /*
@@ -2408,7 +2435,7 @@ rcu_torture_init(void)
 	int firsterr = 0;
 	static struct rcu_torture_ops *torture_ops[] = {
 		&rcu_ops, &rcu_busted_ops, &srcu_ops, &srcud_ops,
-		&busted_srcud_ops, &tasks_ops, &trivial_ops,
+		&busted_srcud_ops, &tasks_ops, &tasks_rude_ops, &trivial_ops,
 	};
 
 	if (!torture_init_begin(torture_type, verbose))
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
index c3c1fb5..ec0c72f 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
+++ b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
@@ -14,3 +14,4 @@ TINY02
 TASKS01
 TASKS02
 TASKS03
+RUDE01
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/RUDE01 b/tools/testing/selftests/rcutorture/configs/rcu/RUDE01
new file mode 100644
index 0000000..bafe94c
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/RUDE01
@@ -0,0 +1,10 @@
+CONFIG_SMP=y
+CONFIG_NR_CPUS=2
+CONFIG_HOTPLUG_CPU=y
+CONFIG_PREEMPT_NONE=n
+CONFIG_PREEMPT_VOLUNTARY=n
+CONFIG_PREEMPT=y
+CONFIG_DEBUG_LOCK_ALLOC=y
+CONFIG_PROVE_LOCKING=y
+#CHECK#CONFIG_PROVE_RCU=y
+CONFIG_RCU_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/RUDE01.boot b/tools/testing/selftests/rcutorture/configs/rcu/RUDE01.boot
new file mode 100644
index 0000000..9363708
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/RUDE01.boot
@@ -0,0 +1 @@
+rcutorture.torture_type=tasks-rude
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 11/22] rcu-tasks: Use unique names for RCU-Tasks kthreads and messages
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (9 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 10/22] rcutorture: Add torture tests for RCU Tasks Rude paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 12/22] rcu-tasks: Further refactor RCU-tasks to allow adding more variants paulmck
                     ` (14 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit causes the flavors of RCU Tasks to use different names
for their kthreads and in their console messages.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 7ba1730..ac0f282 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -22,6 +22,8 @@ typedef void (*rcu_tasks_gp_func_t)(struct rcu_tasks *rtp);
  * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
  * @gp_func: This flavor's grace-period-wait function.
  * @call_func: This flavor's call_rcu()-equivalent function.
+ * @name: This flavor's textual name.
+ * @kname: This flavor's kthread name.
  */
 struct rcu_tasks {
 	struct rcu_head *cbs_head;
@@ -31,16 +33,20 @@ struct rcu_tasks {
 	struct task_struct *kthread_ptr;
 	rcu_tasks_gp_func_t gp_func;
 	call_rcu_func_t call_func;
+	char *name;
+	char *kname;
 };
 
-#define DEFINE_RCU_TASKS(name, gp, call)				\
-static struct rcu_tasks name =						\
+#define DEFINE_RCU_TASKS(rt_name, gp, call, n)				\
+static struct rcu_tasks rt_name =					\
 {									\
-	.cbs_tail = &name.cbs_head,					\
-	.cbs_wq = __WAIT_QUEUE_HEAD_INITIALIZER(name.cbs_wq),		\
-	.cbs_lock = __RAW_SPIN_LOCK_UNLOCKED(name.cbs_lock),		\
+	.cbs_tail = &rt_name.cbs_head,					\
+	.cbs_wq = __WAIT_QUEUE_HEAD_INITIALIZER(rt_name.cbs_wq),	\
+	.cbs_lock = __RAW_SPIN_LOCK_UNLOCKED(rt_name.cbs_lock),		\
 	.gp_func = gp,							\
 	.call_func = call,						\
+	.name = n,							\
+	.kname = #rt_name,						\
 }
 
 /* Track exiting tasks in order to allow them to be waited for. */
@@ -145,8 +151,8 @@ static void __init rcu_spawn_tasks_kthread_generic(struct rcu_tasks *rtp)
 {
 	struct task_struct *t;
 
-	t = kthread_run(rcu_tasks_kthread, rtp, "rcu_tasks_kthread");
-	if (WARN_ONCE(IS_ERR(t), "%s: Could not start Tasks-RCU grace-period kthread, OOM is now expected behavior\n", __func__))
+	t = kthread_run(rcu_tasks_kthread, rtp, "%s_kthread", rtp->kname);
+	if (WARN_ONCE(IS_ERR(t), "%s: Could not start %s grace-period kthread, OOM is now expected behavior\n", __func__, rtp->name))
 		return;
 	smp_mb(); /* Ensure others see full kthread. */
 }
@@ -342,7 +348,7 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 }
 
 void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
-DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks);
+DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks");
 
 /**
  * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
@@ -437,7 +443,8 @@ static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp)
 }
 
 void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func);
-DEFINE_RCU_TASKS(rcu_tasks_rude, rcu_tasks_rude_wait_gp, call_rcu_tasks_rude);
+DEFINE_RCU_TASKS(rcu_tasks_rude, rcu_tasks_rude_wait_gp, call_rcu_tasks_rude,
+		 "RCU Tasks Rude");
 
 /**
  * call_rcu_tasks_rude() - Queue a callback rude task-based grace period
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 12/22] rcu-tasks: Further refactor RCU-tasks to allow adding more variants
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (10 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 11/22] rcu-tasks: Use unique names for RCU-Tasks kthreads and messages paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 13/22] rcu-tasks: Code movement to allow more Tasks RCU variants paulmck
                     ` (13 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit refactors RCU tasks to allow variants to be added.  These
variants will share the current Tasks-RCU tasklist scan and the holdout
list processing.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 166 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 108 insertions(+), 58 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index ac0f282..d00e772 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -12,6 +12,11 @@
 
 struct rcu_tasks;
 typedef void (*rcu_tasks_gp_func_t)(struct rcu_tasks *rtp);
+typedef void (*pregp_func_t)(void);
+typedef void (*pertask_func_t)(struct task_struct *t, struct list_head *hop);
+typedef void (*postscan_func_t)(void);
+typedef void (*holdouts_func_t)(struct list_head *hop, bool ndrpt, bool *frptp);
+typedef void (*postgp_func_t)(void);
 
 /**
  * Definition for a Tasks-RCU-like mechanism.
@@ -21,6 +26,11 @@ typedef void (*rcu_tasks_gp_func_t)(struct rcu_tasks *rtp);
  * @cbs_lock: Lock protecting callback list.
  * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
  * @gp_func: This flavor's grace-period-wait function.
+ * @pregp_func: This flavor's pre-grace-period function (optional).
+ * @pertask_func: This flavor's per-task scan function (optional).
+ * @postscan_func: This flavor's post-task scan function (optional).
+ * @holdout_func: This flavor's holdout-list scan function (optional).
+ * @postgp_func: This flavor's post-grace-period function (optional).
  * @call_func: This flavor's call_rcu()-equivalent function.
  * @name: This flavor's textual name.
  * @kname: This flavor's kthread name.
@@ -32,6 +42,11 @@ struct rcu_tasks {
 	raw_spinlock_t cbs_lock;
 	struct task_struct *kthread_ptr;
 	rcu_tasks_gp_func_t gp_func;
+	pregp_func_t pregp_func;
+	pertask_func_t pertask_func;
+	postscan_func_t postscan_func;
+	holdouts_func_t holdouts_func;
+	postgp_func_t postgp_func;
 	call_rcu_func_t call_func;
 	char *name;
 	char *kname;
@@ -113,6 +128,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 
 		/* Pick up any new callbacks. */
 		raw_spin_lock_irqsave(&rtp->cbs_lock, flags);
+		smp_mb__after_unlock_lock(); // Order updates vs. GP.
 		list = rtp->cbs_head;
 		rtp->cbs_head = NULL;
 		rtp->cbs_tail = &rtp->cbs_head;
@@ -207,6 +223,49 @@ static void __init rcu_tasks_bootup_oddness(void)
 // rates from multiple CPUs.  If this is required, per-CPU callback lists
 // will be needed.
 
+/* Pre-grace-period preparation. */
+static void rcu_tasks_pregp_step(void)
+{
+	/*
+	 * Wait for all pre-existing t->on_rq and t->nvcsw transitions
+	 * to complete.  Invoking synchronize_rcu() suffices because all
+	 * these transitions occur with interrupts disabled.  Without this
+	 * synchronize_rcu(), a read-side critical section that started
+	 * before the grace period might be incorrectly seen as having
+	 * started after the grace period.
+	 *
+	 * This synchronize_rcu() also dispenses with the need for a
+	 * memory barrier on the first store to t->rcu_tasks_holdout,
+	 * as it forces the store to happen after the beginning of the
+	 * grace period.
+	 */
+	synchronize_rcu();
+}
+
+/* Per-task initial processing. */
+static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop)
+{
+	if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
+		get_task_struct(t);
+		t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
+		WRITE_ONCE(t->rcu_tasks_holdout, true);
+		list_add(&t->rcu_tasks_holdout_list, hop);
+	}
+}
+
+/* Processing between scanning taskslist and draining the holdout list. */
+void rcu_tasks_postscan(void)
+{
+	/*
+	 * Wait for tasks that are in the process of exiting.  This
+	 * does only part of the job, ensuring that all tasks that were
+	 * previously exiting reach the point where they have disabled
+	 * preemption, allowing the later synchronize_rcu() to finish
+	 * the job.
+	 */
+	synchronize_srcu(&tasks_rcu_exit_srcu);
+}
+
 /* See if tasks are still holding out, complain if so. */
 static void check_holdout_task(struct task_struct *t,
 			       bool needreport, bool *firstreport)
@@ -239,55 +298,63 @@ static void check_holdout_task(struct task_struct *t,
 	sched_show_task(t);
 }
 
+/* Scan the holdout lists for tasks no longer holding out. */
+static void check_all_holdout_tasks(struct list_head *hop,
+				    bool needreport, bool *firstreport)
+{
+	struct task_struct *t, *t1;
+
+	list_for_each_entry_safe(t, t1, hop, rcu_tasks_holdout_list) {
+		check_holdout_task(t, needreport, firstreport);
+		cond_resched();
+	}
+}
+
+/* Finish off the Tasks-RCU grace period. */
+static void rcu_tasks_postgp(void)
+{
+	/*
+	 * Because ->on_rq and ->nvcsw are not guaranteed to have a full
+	 * memory barriers prior to them in the schedule() path, memory
+	 * reordering on other CPUs could cause their RCU-tasks read-side
+	 * critical sections to extend past the end of the grace period.
+	 * However, because these ->nvcsw updates are carried out with
+	 * interrupts disabled, we can use synchronize_rcu() to force the
+	 * needed ordering on all such CPUs.
+	 *
+	 * This synchronize_rcu() also confines all ->rcu_tasks_holdout
+	 * accesses to be within the grace period, avoiding the need for
+	 * memory barriers for ->rcu_tasks_holdout accesses.
+	 *
+	 * In addition, this synchronize_rcu() waits for exiting tasks
+	 * to complete their final preempt_disable() region of execution,
+	 * cleaning up after the synchronize_srcu() above.
+	 */
+	synchronize_rcu();
+}
+
 /* Wait for one RCU-tasks grace period. */
 static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 {
 	struct task_struct *g, *t;
 	unsigned long lastreport;
-	LIST_HEAD(rcu_tasks_holdouts);
+	LIST_HEAD(holdouts);
 	int fract;
 
-	/*
-	 * Wait for all pre-existing t->on_rq and t->nvcsw transitions
-	 * to complete.  Invoking synchronize_rcu() suffices because all
-	 * these transitions occur with interrupts disabled.  Without this
-	 * synchronize_rcu(), a read-side critical section that started
-	 * before the grace period might be incorrectly seen as having
-	 * started after the grace period.
-	 *
-	 * This synchronize_rcu() also dispenses with the need for a
-	 * memory barrier on the first store to t->rcu_tasks_holdout,
-	 * as it forces the store to happen after the beginning of the
-	 * grace period.
-	 */
-	synchronize_rcu();
+	rtp->pregp_func();
 
 	/*
 	 * There were callbacks, so we need to wait for an RCU-tasks
 	 * grace period.  Start off by scanning the task list for tasks
 	 * that are not already voluntarily blocked.  Mark these tasks
-	 * and make a list of them in rcu_tasks_holdouts.
+	 * and make a list of them in holdouts.
 	 */
 	rcu_read_lock();
-	for_each_process_thread(g, t) {
-		if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
-			get_task_struct(t);
-			t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
-			WRITE_ONCE(t->rcu_tasks_holdout, true);
-			list_add(&t->rcu_tasks_holdout_list,
-				 &rcu_tasks_holdouts);
-		}
-	}
+	for_each_process_thread(g, t)
+		rtp->pertask_func(t, &holdouts);
 	rcu_read_unlock();
 
-	/*
-	 * Wait for tasks that are in the process of exiting.  This
-	 * does only part of the job, ensuring that all tasks that were
-	 * previously exiting reach the point where they have disabled
-	 * preemption, allowing the later synchronize_rcu() to finish
-	 * the job.
-	 */
-	synchronize_srcu(&tasks_rcu_exit_srcu);
+	rtp->postscan_func();
 
 	/*
 	 * Each pass through the following loop scans the list of holdout
@@ -303,9 +370,8 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 		bool firstreport;
 		bool needreport;
 		int rtst;
-		struct task_struct *t1;
 
-		if (list_empty(&rcu_tasks_holdouts))
+		if (list_empty(&holdouts))
 			break;
 
 		/* Slowly back off waiting for holdouts */
@@ -320,31 +386,10 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 			lastreport = jiffies;
 		firstreport = true;
 		WARN_ON(signal_pending(current));
-		list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
-					 rcu_tasks_holdout_list) {
-			check_holdout_task(t, needreport, &firstreport);
-			cond_resched();
-		}
+		rtp->holdouts_func(&holdouts, needreport, &firstreport);
 	}
 
-	/*
-	 * Because ->on_rq and ->nvcsw are not guaranteed to have a full
-	 * memory barriers prior to them in the schedule() path, memory
-	 * reordering on other CPUs could cause their RCU-tasks read-side
-	 * critical sections to extend past the end of the grace period.
-	 * However, because these ->nvcsw updates are carried out with
-	 * interrupts disabled, we can use synchronize_rcu() to force the
-	 * needed ordering on all such CPUs.
-	 *
-	 * This synchronize_rcu() also confines all ->rcu_tasks_holdout
-	 * accesses to be within the grace period, avoiding the need for
-	 * memory barriers for ->rcu_tasks_holdout accesses.
-	 *
-	 * In addition, this synchronize_rcu() waits for exiting tasks
-	 * to complete their final preempt_disable() region of execution,
-	 * cleaning up after the synchronize_srcu() above.
-	 */
-	synchronize_rcu();
+	rtp->postgp_func();
 }
 
 void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
@@ -413,6 +458,11 @@ EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
 
 static int __init rcu_spawn_tasks_kthread(void)
 {
+	rcu_tasks.pregp_func = rcu_tasks_pregp_step;
+	rcu_tasks.pertask_func = rcu_tasks_pertask;
+	rcu_tasks.postscan_func = rcu_tasks_postscan;
+	rcu_tasks.holdouts_func = check_all_holdout_tasks;
+	rcu_tasks.postgp_func = rcu_tasks_postgp;
 	rcu_spawn_tasks_kthread_generic(&rcu_tasks);
 	return 0;
 }
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 13/22] rcu-tasks: Code movement to allow more Tasks RCU variants
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (11 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 12/22] rcu-tasks: Further refactor RCU-tasks to allow adding more variants paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 14/22] rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks paulmck
                     ` (12 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit does nothing but move rcu_tasks_wait_gp() up to a new section
for common code.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 122 +++++++++++++++++++++++++++--------------------------
 1 file changed, 63 insertions(+), 59 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index d00e772..e959052 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -213,6 +213,69 @@ static void __init rcu_tasks_bootup_oddness(void)
 
 ////////////////////////////////////////////////////////////////////////
 //
+// Shared code between task-list-scanning variants of Tasks RCU.
+
+/* Wait for one RCU-tasks grace period. */
+static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
+{
+	struct task_struct *g, *t;
+	unsigned long lastreport;
+	LIST_HEAD(holdouts);
+	int fract;
+
+	rtp->pregp_func();
+
+	/*
+	 * There were callbacks, so we need to wait for an RCU-tasks
+	 * grace period.  Start off by scanning the task list for tasks
+	 * that are not already voluntarily blocked.  Mark these tasks
+	 * and make a list of them in holdouts.
+	 */
+	rcu_read_lock();
+	for_each_process_thread(g, t)
+		rtp->pertask_func(t, &holdouts);
+	rcu_read_unlock();
+
+	rtp->postscan_func();
+
+	/*
+	 * Each pass through the following loop scans the list of holdout
+	 * tasks, removing any that are no longer holdouts.  When the list
+	 * is empty, we are done.
+	 */
+	lastreport = jiffies;
+
+	/* Start off with HZ/10 wait and slowly back off to 1 HZ wait. */
+	fract = 10;
+
+	for (;;) {
+		bool firstreport;
+		bool needreport;
+		int rtst;
+
+		if (list_empty(&holdouts))
+			break;
+
+		/* Slowly back off waiting for holdouts */
+		schedule_timeout_interruptible(HZ/fract);
+
+		if (fract > 1)
+			fract--;
+
+		rtst = READ_ONCE(rcu_task_stall_timeout);
+		needreport = rtst > 0 && time_after(jiffies, lastreport + rtst);
+		if (needreport)
+			lastreport = jiffies;
+		firstreport = true;
+		WARN_ON(signal_pending(current));
+		rtp->holdouts_func(&holdouts, needreport, &firstreport);
+	}
+
+	rtp->postgp_func();
+}
+
+////////////////////////////////////////////////////////////////////////
+//
 // Simple variant of RCU whose quiescent states are voluntary context
 // switch, cond_resched_rcu_qs(), user-space execution, and idle.
 // As such, grace periods can take one good long time.  There are no
@@ -333,65 +396,6 @@ static void rcu_tasks_postgp(void)
 	synchronize_rcu();
 }
 
-/* Wait for one RCU-tasks grace period. */
-static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
-{
-	struct task_struct *g, *t;
-	unsigned long lastreport;
-	LIST_HEAD(holdouts);
-	int fract;
-
-	rtp->pregp_func();
-
-	/*
-	 * There were callbacks, so we need to wait for an RCU-tasks
-	 * grace period.  Start off by scanning the task list for tasks
-	 * that are not already voluntarily blocked.  Mark these tasks
-	 * and make a list of them in holdouts.
-	 */
-	rcu_read_lock();
-	for_each_process_thread(g, t)
-		rtp->pertask_func(t, &holdouts);
-	rcu_read_unlock();
-
-	rtp->postscan_func();
-
-	/*
-	 * Each pass through the following loop scans the list of holdout
-	 * tasks, removing any that are no longer holdouts.  When the list
-	 * is empty, we are done.
-	 */
-	lastreport = jiffies;
-
-	/* Start off with HZ/10 wait and slowly back off to 1 HZ wait. */
-	fract = 10;
-
-	for (;;) {
-		bool firstreport;
-		bool needreport;
-		int rtst;
-
-		if (list_empty(&holdouts))
-			break;
-
-		/* Slowly back off waiting for holdouts */
-		schedule_timeout_interruptible(HZ/fract);
-
-		if (fract > 1)
-			fract--;
-
-		rtst = READ_ONCE(rcu_task_stall_timeout);
-		needreport = rtst > 0 && time_after(jiffies, lastreport + rtst);
-		if (needreport)
-			lastreport = jiffies;
-		firstreport = true;
-		WARN_ON(signal_pending(current));
-		rtp->holdouts_func(&holdouts, needreport, &firstreport);
-	}
-
-	rtp->postgp_func();
-}
-
 void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
 DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks");
 
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 14/22] rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (12 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 13/22] rcu-tasks: Code movement to allow more Tasks RCU variants paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  1:37     ` Joel Fernandes
  2020-03-19 19:42     ` Steven Rostedt
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 15/22] rcutorture: Add torture tests for RCU Tasks Trace paulmck
                     ` (11 subsequent siblings)
  25 siblings, 2 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney,
	Alexei Starovoitov, Andrii Nakryiko

From: "Paul E. McKenney" <paulmck@kernel.org>

Because RCU does not watch exception early-entry/late-exit, idle-loop,
or CPU-hotplug execution, protection of tracing and BPF operations is
needlessly complicated.  This commit therefore adds a variant of
Tasks RCU that:

o	Has explicit read-side markers to allow finite grace periods in
	the face of in-kernel loops for PREEMPT=n builds.  These markers
	are rcu_read_lock_trace() and rcu_read_unlock_trace().

o	Protects code in the idle loop, exception entry/exit, and
	CPU-hotplug code paths.  In this respect, RCU-tasks trace is
	similar to SRCU, but with lighter-weight readers.

o	Avoids expensive read-side instruction, having overhead similar
	to that of Preemptible RCU.

There are of course downsides:

o	The grace-period code can send IPIs to CPUs, even when those CPUs
	are in the idle loop or in nohz_full userspace.  This will be
	addressed by later commits.

o	It is necessary to scan the full tasklist, much as for Tasks RCU.

o	There is a single callback queue guarded by a single lock,
	again, much as for Tasks RCU.  However, those early use cases
	that request multiple grace periods in quick succession are
	expected to do so from a single task, which makes the single
	lock almost irrelevant.  If needed, multiple callback queues
	can be provided using any number of schemes.

Perhaps most important, this variant of RCU does not affect the vanilla
flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
readers can operate from idle, offline, and exception entry/exit in no
way enables rcu_preempt and rcu_sched readers to do so.

This effort benefited greatly from off-list discussions of BPF
requirements with Alexei Starovoitov and Andrii Nakryiko.  At least
some of the on-list discussions are captured in the Link: tags below.
In addition, KCSAN was quite helpful in finding some early bugs.

Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andriin@fb.com>
---
 include/linux/rcupdate_trace.h |  84 ++++++++++
 include/linux/sched.h          |   8 +
 init/init_task.c               |   4 +
 kernel/fork.c                  |   4 +
 kernel/rcu/Kconfig             |  12 +-
 kernel/rcu/tasks.h             | 359 ++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 462 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/rcupdate_trace.h

diff --git a/include/linux/rcupdate_trace.h b/include/linux/rcupdate_trace.h
new file mode 100644
index 0000000..ed97e10
--- /dev/null
+++ b/include/linux/rcupdate_trace.h
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/*
+ * Read-Copy Update mechanism for mutual exclusion, adapted for tracing.
+ *
+ * Copyright (C) 2020 Paul E. McKenney.
+ */
+
+#ifndef __LINUX_RCUPDATE_TRACE_H
+#define __LINUX_RCUPDATE_TRACE_H
+
+#include <linux/sched.h>
+#include <linux/rcupdate.h>
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+
+extern struct lockdep_map rcu_trace_lock_map;
+
+static inline int rcu_read_lock_trace_held(void)
+{
+	return lock_is_held(&rcu_trace_lock_map);
+}
+
+#else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+
+static inline int rcu_read_lock_trace_held(void)
+{
+	return 1;
+}
+
+#endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+
+#ifdef CONFIG_TASKS_TRACE_RCU
+
+void rcu_read_unlock_trace_special(struct task_struct *t);
+
+/**
+ * rcu_read_lock_trace - mark beginning of RCU-trace read-side critical section
+ *
+ * When synchronize_rcu_trace() is invoked by one task, then that task
+ * is guaranteed to block until all other tasks exit their read-side
+ * critical sections.  Similarly, if call_rcu_trace() is invoked on one
+ * task while other tasks are within RCU read-side critical sections,
+ * invocation of the corresponding RCU callback is deferred until after
+ * the all the other tasks exit their critical sections.
+ *
+ * For more details, please see the documentation for rcu_read_lock().
+ */
+static inline void rcu_read_lock_trace(void)
+{
+	struct task_struct *t = current;
+
+	WRITE_ONCE(t->trc_reader_nesting, READ_ONCE(t->trc_reader_nesting) + 1);
+	rcu_lock_acquire(&rcu_trace_lock_map);
+}
+
+/**
+ * rcu_read_unlock_trace - mark end of RCU-trace read-side critical section
+ *
+ * Pairs with a preceding call to rcu_read_lock_trace(), and nesting is
+ * allowed.  Invoking a rcu_read_unlock_trace() when there is no matching
+ * rcu_read_lock_trace() is verboten, and will result in lockdep complaints.
+ *
+ * For more details, please see the documentation for rcu_read_unlock().
+ */
+static inline void rcu_read_unlock_trace(void)
+{
+	int nesting;
+	struct task_struct *t = current;
+
+	rcu_lock_release(&rcu_trace_lock_map);
+	nesting = READ_ONCE(t->trc_reader_nesting) - 1;
+	WRITE_ONCE(t->trc_reader_nesting, nesting);
+	if (likely(!READ_ONCE(t->trc_reader_need_end)) || nesting)
+		return;  // We assume shallow reader nesting.
+	rcu_read_unlock_trace_special(t);
+}
+
+void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
+void synchronize_rcu_tasks_trace(void);
+void rcu_barrier_tasks_trace(void);
+
+#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
+
+#endif /* __LINUX_RCUPDATE_TRACE_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 621e4aa..ef68ae4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -722,6 +722,14 @@ struct task_struct {
 	struct list_head		rcu_tasks_holdout_list;
 #endif /* #ifdef CONFIG_TASKS_RCU */
 
+#ifdef CONFIG_TASKS_TRACE_RCU
+	int				trc_reader_nesting;
+	int				trc_ipi_to_cpu;
+	bool				trc_reader_need_end;
+	bool				trc_reader_checked;
+	struct list_head		trc_holdout_list;
+#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
+
 	struct sched_info		sched_info;
 
 	struct list_head		tasks;
diff --git a/init/init_task.c b/init/init_task.c
index 096191d..1b9ec3d 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -140,6 +140,10 @@ struct task_struct init_task
 	.rcu_tasks_holdout_list = LIST_HEAD_INIT(init_task.rcu_tasks_holdout_list),
 	.rcu_tasks_idle_cpu = -1,
 #endif
+#ifdef CONFIG_TASKS_TRACE_RCU
+	.trc_reader_nesting = 0,
+	.trc_holdout_list = LIST_HEAD_INIT(init_task.trc_holdout_list),
+#endif
 #ifdef CONFIG_CPUSETS
 	.mems_allowed_seq = SEQCNT_ZERO(init_task.mems_allowed_seq),
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index e592e6f..d0e547c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1685,6 +1685,10 @@ static inline void rcu_copy_process(struct task_struct *p)
 	INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
 	p->rcu_tasks_idle_cpu = -1;
 #endif /* #ifdef CONFIG_TASKS_RCU */
+#ifdef CONFIG_TASKS_TRACE_RCU
+	p->trc_reader_nesting = 0;
+	INIT_LIST_HEAD(&p->trc_holdout_list);
+#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
 
 struct pid *pidfd_pid(const struct file *file)
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 0d43ec1..187226b 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -71,7 +71,7 @@ config TREE_SRCU
 	  This option selects the full-fledged version of SRCU.
 
 config TASKS_RCU_GENERIC
-	def_bool TASKS_RCU || TASKS_RUDE_RCU
+	def_bool TASKS_RCU || TASKS_RUDE_RCU || TASKS_TRACE_RCU
 	select SRCU
 	help
 	  This option enables generic infrastructure code supporting
@@ -94,6 +94,16 @@ config TASKS_RUDE_RCU
 	  switches on all online CPUs, including idle ones, so use
 	  with caution.  Not for manual selection.
 
+config TASKS_TRACE_RCU
+	def_bool 0
+	default n
+	help
+	  This option enables a task-based RCU implementation that uses
+	  explicit rcu_read_lock_trace() read-side markers, and allows
+	  these readers to appear in the idle loop as well as on the CPU
+	  hotplug code paths.  It can force IPIs on online CPUs, including
+	  idle ones, so use with caution.  Not for manual selection.
+
 config RCU_STALL_COMMON
 	def_bool TREE_RCU
 	help
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index e959052..24f41ec 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -181,12 +181,17 @@ void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
 	preempt_enable();
 }
 
+static void exit_tasks_rcu_finish_trace(struct task_struct *t);
+
 /* Do the srcu_read_unlock() for the above synchronize_srcu().  */
 void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu)
 {
+	struct task_struct *t = current;
+
 	preempt_disable();
-	__srcu_read_unlock(&tasks_rcu_exit_srcu, current->rcu_tasks_idx);
+	__srcu_read_unlock(&tasks_rcu_exit_srcu, t->rcu_tasks_idx);
 	preempt_enable();
+	exit_tasks_rcu_finish_trace(t);
 }
 
 #ifndef CONFIG_TINY_RCU
@@ -196,15 +201,19 @@ void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu)
  */
 static void __init rcu_tasks_bootup_oddness(void)
 {
-#ifdef CONFIG_TASKS_RCU
+#if defined(CONFIG_TASKS_RCU) || defined(CONFIG_TASKS_TRACE_RCU)
 	if (rcu_task_stall_timeout != RCU_TASK_STALL_TIMEOUT)
 		pr_info("\tTasks-RCU CPU stall warnings timeout set to %d (rcu_task_stall_timeout).\n", rcu_task_stall_timeout);
-	else
-		pr_info("\tTasks RCU enabled.\n");
+#endif /* #ifdef CONFIG_TASKS_RCU */
+#ifdef CONFIG_TASKS_RCU
+	pr_info("\tTrampoline variant of Tasks RCU enabled.\n");
 #endif /* #ifdef CONFIG_TASKS_RCU */
 #ifdef CONFIG_TASKS_RUDE_RCU
 	pr_info("\tRude variant of Tasks RCU enabled.\n");
 #endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
+#ifdef CONFIG_TASKS_TRACE_RCU
+	pr_info("\tTracing variant of Tasks RCU enabled.\n");
+#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
 
 #endif /* #ifndef CONFIG_TINY_RCU */
@@ -480,10 +489,10 @@ core_initcall(rcu_spawn_tasks_kthread);
 //
 // "Rude" variant of Tasks RCU, inspired by Steve Rostedt's trick of
 // passing an empty function to schedule_on_each_cpu().  This approach
-// provides an asynchronous call_rcu_rude() API and batching of concurrent
-// calls to the synchronous synchronize_rcu_rude() API.  This sends IPIs
-// far and wide and induces otherwise unnecessary context switches on all
-// online CPUs, whether online or not.
+// provides an asynchronous call_rcu_tasks_rude() API and batching
+// of concurrent calls to the synchronous synchronize_rcu_rude() API.
+// This sends IPIs far and wide and induces otherwise unnecessary context
+// switches on all online CPUs, whether online or not.
 
 // Empty function to allow workqueues to force a context switch.
 static void rcu_tasks_be_rude(struct work_struct *work)
@@ -569,3 +578,337 @@ static int __init rcu_spawn_tasks_rude_kthread(void)
 core_initcall(rcu_spawn_tasks_rude_kthread);
 
 #endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
+
+////////////////////////////////////////////////////////////////////////
+//
+// Tracing variant of Tasks RCU.  This variant is designed to be used
+// to protect tracing hooks, including those of BPF.  This variant
+// therefore:
+//
+// 1.	Has explicit read-side markers to allow finite grace periods
+//	in the face of in-kernel loops for PREEMPT=n builds.
+//
+// 2.	Protects code in the idle loop, exception entry/exit, and
+//	CPU-hotplug code paths, similar to the capabilities of SRCU.
+//
+// 3.	Avoids expensive read-side instruction, having overhead similar
+//	to that of Preemptible RCU.
+//
+// There are of course downsides.  The grace-period code can send IPIs to
+// CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
+// It is necessary to scan the full tasklist, much as for Tasks RCU.  There
+// is a single callback queue guarded by a single lock, again, much as for
+// Tasks RCU.  If needed, these downsides can be at least partially remedied.
+//
+// Perhaps most important, this variant of RCU does not affect the vanilla
+// flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
+// readers can operate from idle, offline, and exception entry/exit in no
+// way allows rcu_preempt and rcu_sched readers to also do so.
+
+// The lockdep state must be outside of #ifdef to be useful.
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+static struct lock_class_key rcu_lock_trace_key;
+struct lockdep_map rcu_trace_lock_map =
+	STATIC_LOCKDEP_MAP_INIT("rcu_read_lock_trace", &rcu_lock_trace_key);
+EXPORT_SYMBOL_GPL(rcu_trace_lock_map);
+#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+
+#ifdef CONFIG_TASKS_TRACE_RCU
+
+atomic_t trc_n_readers_need_end;	// Number of waited-for readers.
+DECLARE_WAIT_QUEUE_HEAD(trc_wait);	// List of holdout tasks.
+
+// Record outstanding IPIs to each CPU.  No point in sending two...
+static DEFINE_PER_CPU(bool, trc_ipi_to_cpu);
+
+/* If we are the last reader, wake up the grace-period kthread. */
+void rcu_read_unlock_trace_special(struct task_struct *t)
+{
+	WRITE_ONCE(t->trc_reader_need_end, false);
+	if (atomic_dec_and_test(&trc_n_readers_need_end))
+		wake_up(&trc_wait);
+}
+EXPORT_SYMBOL_GPL(rcu_read_unlock_trace_special);
+
+/* Add a task to the holdout list, if it is not already on the list. */
+static void trc_add_holdout(struct task_struct *t, struct list_head *bhp)
+{
+	if (list_empty(&t->trc_holdout_list)) {
+		get_task_struct(t);
+		list_add(&t->trc_holdout_list, bhp);
+	}
+}
+
+/* Remove a task from the holdout list, if it is in fact present. */
+static void trc_del_holdout(struct task_struct *t)
+{
+	if (!list_empty(&t->trc_holdout_list)) {
+		list_del_init(&t->trc_holdout_list);
+		put_task_struct(t);
+	}
+}
+
+/* IPI handler to check task state. */
+static void trc_read_check_handler(void *t_in)
+{
+	struct task_struct *t = current;
+	struct task_struct *texp = t_in;
+
+	// If the task is no longer running on this CPU, leave.
+	if (unlikely(texp != t)) {
+		if (WARN_ON_ONCE(atomic_dec_and_test(&trc_n_readers_need_end)))
+			wake_up(&trc_wait);
+		goto reset_ipi; // Already on holdout list, so will check later.
+	}
+
+	// If the task is not in a read-side critical section, and
+	// if this is the last reader, awaken the grace-period kthread.
+	if (likely(!t->trc_reader_nesting)) {
+		if (WARN_ON_ONCE(atomic_dec_and_test(&trc_n_readers_need_end)))
+			wake_up(&trc_wait);
+		// Mark as checked after decrement to avoid false
+		// positives on the above WARN_ON_ONCE().
+		WRITE_ONCE(t->trc_reader_checked, true);
+		goto reset_ipi;
+	}
+	WRITE_ONCE(t->trc_reader_checked, true);
+
+	// Get here if the task is in a read-side critical section.  Set
+	// its state so that it will awaken the grace-period kthread upon
+	// exit from that critical section.
+	WARN_ON_ONCE(t->trc_reader_need_end);
+	WRITE_ONCE(t->trc_reader_need_end, true);
+
+reset_ipi:
+	// Allow future IPIs to be sent on CPU and for task.
+	// Also order this IPI handler against any later manipulations of
+	// the intended task.
+	smp_store_release(&per_cpu(trc_ipi_to_cpu, smp_processor_id()), false); // ^^^
+	smp_store_release(&texp->trc_ipi_to_cpu, -1); // ^^^
+}
+
+/* Callback function for scheduler to check non-running) task.  */
+static bool trc_inspect_reader_notrunning(struct task_struct *t, void *arg)
+{
+	if (task_curr(t))
+		return false;  // It is running, so decline to inspect it.
+
+	// Mark as checked.  Because this is called from the grace-period
+	// kthread, also remove the task from the holdout list.
+	t->trc_reader_checked = true;
+	trc_del_holdout(t);
+
+	// If the task is in a read-side critical section, set up its
+	// its state so that it will awaken the grace-period kthread upon
+	// exit from that critical section.
+	if (unlikely(t->trc_reader_nesting)) {
+		atomic_inc(&trc_n_readers_need_end); // One more to wait on.
+		WARN_ON_ONCE(t->trc_reader_need_end);
+		WRITE_ONCE(t->trc_reader_need_end, true);
+	}
+	return true;
+}
+
+/* Attempt to extract the state for the specified task. */
+static void trc_wait_for_one_reader(struct task_struct *t,
+				    struct list_head *bhp)
+{
+	int cpu;
+
+	// If a previous IPI is still in flight, let it complete.
+	if (smp_load_acquire(&t->trc_ipi_to_cpu) != -1) // Order IPI
+		return;
+
+	// The current task had better be in a quiescent state.
+	if (t == current) {
+		t->trc_reader_checked = true;
+		trc_del_holdout(t);
+		WARN_ON_ONCE(t->trc_reader_nesting);
+		return;
+	}
+
+	// Attempt to nail down the task for inspection.
+	if (try_invoke_on_locked_down_task(t, trc_inspect_reader_notrunning, t))
+		return;
+
+	// If currently running, send an IPI, either way, add to list.
+	trc_add_holdout(t, bhp);
+	if (task_curr(t)) {
+		// The task is currently running, so try IPIing it.
+		cpu = task_cpu(t);
+
+		// If there is already an IPI outstanding, let it happen.
+		if (per_cpu(trc_ipi_to_cpu, cpu) || t->trc_ipi_to_cpu >= 0)
+			return;
+
+		atomic_inc(&trc_n_readers_need_end);
+		per_cpu(trc_ipi_to_cpu, cpu) = true;
+		t->trc_ipi_to_cpu = cpu;
+		if (smp_call_function_single(cpu,
+					     trc_read_check_handler, t, 0)) {
+			per_cpu(trc_ipi_to_cpu, cpu) = false;
+			t->trc_ipi_to_cpu = cpu;
+		}
+	}
+}
+
+/* Initialize for a new RCU-tasks-trace grace period. */
+static void rcu_tasks_trace_pregp_step(void)
+{
+	int cpu;
+
+	// Wait for CPU-hotplug paths to complete.
+	cpus_read_lock();
+	cpus_read_unlock();
+
+	// Allow for fast-acting IPIs.
+	atomic_set(&trc_n_readers_need_end, 1);
+
+	// There shouldn't be any old IPIs, but...
+	for_each_possible_cpu(cpu)
+		WARN_ON_ONCE(per_cpu(trc_ipi_to_cpu, cpu));
+}
+
+/* Do first-round processing for the specified task. */
+static void rcu_tasks_trace_pertask(struct task_struct *t,
+				    struct list_head *hop)
+{
+	WRITE_ONCE(t->trc_reader_need_end, false);
+	t->trc_reader_checked = false;
+	t->trc_ipi_to_cpu = -1;
+	trc_wait_for_one_reader(t, hop);
+}
+
+/* Do intermediate processing between task and holdout scans. */
+static void rcu_tasks_trace_postscan(void)
+{
+	// Wait for late-stage exiting tasks to finish exiting.
+	// These might have passed the call to exit_tasks_rcu_finish().
+	synchronize_rcu();
+	// Any tasks that exit after this point will set ->trc_reader_checked.
+}
+
+/* Do one scan of the holdout list. */
+static void check_all_holdout_tasks_trace(struct list_head *hop,
+					  bool ndrpt, bool *frptp)
+{
+	struct task_struct *g, *t;
+
+	list_for_each_entry_safe(t, g, hop, trc_holdout_list) {
+		// If safe and needed, try to check the current task.
+		if (READ_ONCE(t->trc_ipi_to_cpu) == -1 &&
+		    !READ_ONCE(t->trc_reader_checked))
+			trc_wait_for_one_reader(t, hop);
+
+		// If check succeeded, remove this task from the list.
+		if (READ_ONCE(t->trc_reader_checked))
+			trc_del_holdout(t);
+	}
+}
+
+/* Wait for grace period to complete and provide ordering. */
+static void rcu_tasks_trace_postgp(void)
+{
+	// Remove the safety count.
+	smp_mb__before_atomic();  // Order vs. earlier atomics
+	atomic_dec(&trc_n_readers_need_end);
+	smp_mb__after_atomic();  // Order vs. later atomics
+
+	// Wait for readers.
+	wait_event_idle_exclusive(trc_wait,
+				  atomic_read(&trc_n_readers_need_end) == 0);
+
+	smp_mb(); // Caller's code must be ordered after wakeup.
+}
+
+/* Report any needed quiescent state for this exiting task. */
+void exit_tasks_rcu_finish_trace(struct task_struct *t)
+{
+	WRITE_ONCE(t->trc_reader_checked, true);
+	WARN_ON_ONCE(t->trc_reader_nesting);
+	WRITE_ONCE(t->trc_reader_nesting, 0);
+	if (WARN_ON_ONCE(READ_ONCE(t->trc_reader_need_end)))
+		rcu_read_unlock_trace_special(t);
+}
+
+void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
+DEFINE_RCU_TASKS(rcu_tasks_trace, rcu_tasks_wait_gp, call_rcu_tasks_trace,
+		 "RCU Tasks Trace");
+
+/**
+ * call_rcu_tasks_trace() - Queue a callback trace task-based grace period
+ * @rhp: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. call_rcu_tasks_trace()
+ * assumes that the read-side critical sections end at context switch,
+ * cond_resched_rcu_qs(), or transition to usermode execution.  As such,
+ * there are no read-side primitives analogous to rcu_read_lock() and
+ * rcu_read_unlock() because this primitive is intended to determine
+ * that all tasks have passed through a safe state, not so much for
+ * data-strcuture synchronization.
+ *
+ * See the description of call_rcu() for more detailed information on
+ * memory ordering guarantees.
+ */
+void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func)
+{
+	call_rcu_tasks_generic(rhp, func, &rcu_tasks_trace);
+}
+EXPORT_SYMBOL_GPL(call_rcu_tasks_trace);
+
+/**
+ * synchronize_rcu_tasks_trace - wait for a trace rcu-tasks grace period
+ *
+ * Control will return to the caller some time after a trace rcu-tasks
+ * grace period has elapsed, in other words after all currently
+ * executing rcu-tasks read-side critical sections have elapsed.  These
+ * read-side critical sections are delimited by calls to schedule(),
+ * cond_resched_tasks_rcu_qs(), userspace execution, and (in theory,
+ * anyway) cond_resched().
+ *
+ * This is a very specialized primitive, intended only for a few uses in
+ * tracing and other situations requiring manipulation of function preambles
+ * and profiling hooks.  The synchronize_rcu_tasks_trace() function is not
+ * (yet) intended for heavy use from multiple CPUs.
+ *
+ * See the description of synchronize_rcu() for more detailed information
+ * on memory ordering guarantees.
+ */
+void synchronize_rcu_tasks_trace(void)
+{
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_trace_lock_map), "Illegal synchronize_rcu_tasks_trace() in RCU Tasks Trace read-side critical section");
+	synchronize_rcu_tasks_generic(&rcu_tasks_trace);
+}
+EXPORT_SYMBOL_GPL(synchronize_rcu_tasks_trace);
+
+/**
+ * rcu_barrier_tasks_trace - Wait for in-flight call_rcu_tasks_trace() callbacks.
+ *
+ * Although the current implementation is guaranteed to wait, it is not
+ * obligated to, for example, if there are no pending callbacks.
+ */
+void rcu_barrier_tasks_trace(void)
+{
+	/* There is only one callback queue, so this is easy.  ;-) */
+	synchronize_rcu_tasks_trace();
+}
+EXPORT_SYMBOL_GPL(rcu_barrier_tasks_trace);
+
+static int __init rcu_spawn_tasks_trace_kthread(void)
+{
+	rcu_tasks_trace.pregp_func = rcu_tasks_trace_pregp_step;
+	rcu_tasks_trace.pertask_func = rcu_tasks_trace_pertask;
+	rcu_tasks_trace.postscan_func = rcu_tasks_trace_postscan;
+	rcu_tasks_trace.holdouts_func = check_all_holdout_tasks_trace;
+	rcu_tasks_trace.postgp_func = rcu_tasks_trace_postgp;
+	rcu_spawn_tasks_kthread_generic(&rcu_tasks_trace);
+	return 0;
+}
+core_initcall(rcu_spawn_tasks_trace_kthread);
+
+#else /* #ifdef CONFIG_TASKS_TRACE_RCU */
+void exit_tasks_rcu_finish_trace(struct task_struct *t) { }
+#endif /* #else #ifdef CONFIG_TASKS_TRACE_RCU */
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 15/22] rcutorture: Add torture tests for RCU Tasks Trace
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (13 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 14/22] rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 16/22] rcu-tasks: Add stall warnings " paulmck
                     ` (10 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds the definitions required to torture the tracing flavor
of RCU tasks.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/Kconfig.debug                           |  2 +
 kernel/rcu/rcu.h                                   |  1 +
 kernel/rcu/rcutorture.c                            | 44 +++++++++++++++++++++-
 .../selftests/rcutorture/configs/rcu/CFLIST        |  1 +
 .../selftests/rcutorture/configs/rcu/TRACE01       | 10 +++++
 .../selftests/rcutorture/configs/rcu/TRACE01.boot  |  1 +
 6 files changed, 58 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TRACE01
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TRACE01.boot

diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index b15a3bd..a4db41d 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -25,6 +25,7 @@ config RCU_PERF_TEST
 	select SRCU
 	select TASKS_RCU
 	select TASKS_RUDE_RCU
+	select TASKS_TRACE_RCU
 	default n
 	help
 	  This option provides a kernel module that runs performance
@@ -43,6 +44,7 @@ config RCU_TORTURE_TEST
 	select SRCU
 	select TASKS_RCU
 	select TASKS_RUDE_RCU
+	select TASKS_TRACE_RCU
 	default n
 	help
 	  This option provides a kernel module that runs torture tests
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index c574620..72903867 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -442,6 +442,7 @@ enum rcutorture_type {
 	RCU_FLAVOR,
 	RCU_TASKS_FLAVOR,
 	RCU_TASKS_RUDE_FLAVOR,
+	RCU_TASKS_TRACING_FLAVOR,
 	RCU_TRIVIAL_FLAVOR,
 	SRCU_FLAVOR,
 	INVALID_RCU_FLAVOR
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 386cd11..bb6daa58 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -45,6 +45,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/oom.h>
 #include <linux/tick.h>
+#include <linux/rcupdate_trace.h>
 
 #include "rcu.h"
 
@@ -758,6 +759,45 @@ static struct rcu_torture_ops tasks_rude_ops = {
 	.name		= "tasks-rude"
 };
 
+/*
+ * Definitions for tracing RCU-tasks torture testing.
+ */
+
+static int tasks_tracing_torture_read_lock(void)
+{
+	rcu_read_lock_trace();
+	return 0;
+}
+
+static void tasks_tracing_torture_read_unlock(int idx)
+{
+	rcu_read_unlock_trace();
+}
+
+static void rcu_tasks_tracing_torture_deferred_free(struct rcu_torture *p)
+{
+	call_rcu_tasks_trace(&p->rtort_rcu, rcu_torture_cb);
+}
+
+static struct rcu_torture_ops tasks_tracing_ops = {
+	.ttype		= RCU_TASKS_TRACING_FLAVOR,
+	.init		= rcu_sync_torture_init,
+	.readlock	= tasks_tracing_torture_read_lock,
+	.read_delay	= srcu_read_delay,  /* just reuse srcu's version. */
+	.readunlock	= tasks_tracing_torture_read_unlock,
+	.get_gp_seq	= rcu_no_completed,
+	.deferred_free	= rcu_tasks_tracing_torture_deferred_free,
+	.sync		= synchronize_rcu_tasks_trace,
+	.exp_sync	= synchronize_rcu_tasks_trace,
+	.call		= call_rcu_tasks_trace,
+	.cb_barrier	= rcu_barrier_tasks_trace,
+	.fqs		= NULL,
+	.stats		= NULL,
+	.irq_capable	= 1,
+	.slow_gps	= 1,
+	.name		= "tasks-tracing"
+};
+
 static unsigned long rcutorture_seq_diff(unsigned long new, unsigned long old)
 {
 	if (!cur_ops->gp_diff)
@@ -1316,6 +1356,7 @@ static bool rcu_torture_one_read(struct torture_random_state *trsp)
 				  rcu_read_lock_bh_held() ||
 				  rcu_read_lock_sched_held() ||
 				  srcu_read_lock_held(srcu_ctlp) ||
+				  rcu_read_lock_trace_held() ||
 				  torturing_tasks());
 	if (p == NULL) {
 		/* Wait for rcu_torture_writer to get underway */
@@ -2435,7 +2476,8 @@ rcu_torture_init(void)
 	int firsterr = 0;
 	static struct rcu_torture_ops *torture_ops[] = {
 		&rcu_ops, &rcu_busted_ops, &srcu_ops, &srcud_ops,
-		&busted_srcud_ops, &tasks_ops, &tasks_rude_ops, &trivial_ops,
+		&busted_srcud_ops, &tasks_ops, &tasks_rude_ops,
+		&tasks_tracing_ops, &trivial_ops,
 	};
 
 	if (!torture_init_begin(torture_type, verbose))
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
index ec0c72f..dfb1817 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
+++ b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
@@ -15,3 +15,4 @@ TASKS01
 TASKS02
 TASKS03
 RUDE01
+TRACE01
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TRACE01 b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01
new file mode 100644
index 0000000..078e2c1
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01
@@ -0,0 +1,10 @@
+CONFIG_SMP=y
+CONFIG_NR_CPUS=4
+CONFIG_HOTPLUG_CPU=y
+CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_VOLUNTARY=n
+CONFIG_PREEMPT=n
+CONFIG_DEBUG_LOCK_ALLOC=y
+CONFIG_PROVE_LOCKING=y
+#CHECK#CONFIG_PROVE_RCU=y
+CONFIG_RCU_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TRACE01.boot b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01.boot
new file mode 100644
index 0000000..9675ad6
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01.boot
@@ -0,0 +1 @@
+rcutorture.torture_type=tasks-tracing
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 16/22] rcu-tasks: Add stall warnings for RCU Tasks Trace
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (14 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 15/22] rcutorture: Add torture tests for RCU Tasks Trace paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 17/22] rcu-tasks: Move #ifdef into tasks.h paulmck
                     ` (9 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds RCU CPU stall warnings for RCU Tasks Trace.  These
dump out any tasks blocking the current grace period, as well as any
CPUs that have not responded to an IPI request.  This happens in two
phases, when initially extracting state from the tasks and later when
waiting for any holdout tasks to check in.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 68 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 24f41ec..e6ef98e 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -788,9 +788,41 @@ static void rcu_tasks_trace_postscan(void)
 	// Any tasks that exit after this point will set ->trc_reader_checked.
 }
 
+/* Show the state of a task stalling the current RCU tasks trace GP. */
+static void show_stalled_task_trace(struct task_struct *t, bool *firstreport)
+{
+	int cpu;
+
+	if (*firstreport) {
+		pr_err("INFO: rcu_tasks_trace detected stalls on tasks:\n");
+		*firstreport = false;
+	}
+	// FIXME: This should attempt to use try_invoke_on_nonrunning_task().
+	cpu = task_cpu(t);
+	pr_alert("P%d: %c%c%c nesting: %d%c cpu: %d\n",
+		 t->pid,
+		 ".I"[READ_ONCE(t->trc_ipi_to_cpu) > 0],
+		 ".i"[is_idle_task(t)],
+		 ".N"[cpu > 0 && tick_nohz_full_cpu(cpu)],
+		 t->trc_reader_nesting,
+		 " N"[!!t->trc_reader_need_end],
+		 cpu);
+	sched_show_task(t);
+}
+
+/* List stalled IPIs for RCU tasks trace. */
+static void show_stalled_ipi_trace(void)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		if (per_cpu(trc_ipi_to_cpu, cpu))
+			pr_alert("\tIPI outstanding to CPU %d\n", cpu);
+}
+
 /* Do one scan of the holdout list. */
 static void check_all_holdout_tasks_trace(struct list_head *hop,
-					  bool ndrpt, bool *frptp)
+					  bool needreport, bool *firstreport)
 {
 	struct task_struct *g, *t;
 
@@ -803,21 +835,53 @@ static void check_all_holdout_tasks_trace(struct list_head *hop,
 		// If check succeeded, remove this task from the list.
 		if (READ_ONCE(t->trc_reader_checked))
 			trc_del_holdout(t);
+		else if (needreport)
+			show_stalled_task_trace(t, firstreport);
+	}
+	if (needreport) {
+		if (firstreport)
+			pr_err("INFO: rcu_tasks_trace detected stalls?\n");
+		show_stalled_ipi_trace();
 	}
 }
 
 /* Wait for grace period to complete and provide ordering. */
 static void rcu_tasks_trace_postgp(void)
 {
+	bool firstreport;
+	struct task_struct *g, *t;
+	LIST_HEAD(holdouts);
+	long ret;
+
 	// Remove the safety count.
 	smp_mb__before_atomic();  // Order vs. earlier atomics
 	atomic_dec(&trc_n_readers_need_end);
 	smp_mb__after_atomic();  // Order vs. later atomics
 
 	// Wait for readers.
-	wait_event_idle_exclusive(trc_wait,
-				  atomic_read(&trc_n_readers_need_end) == 0);
-
+	for (;;) {
+		ret = wait_event_idle_exclusive_timeout(
+				trc_wait,
+				atomic_read(&trc_n_readers_need_end) == 0,
+				READ_ONCE(rcu_task_stall_timeout));
+		if (ret)
+			break;  // Count reached zero.
+		for_each_process_thread(g, t) {
+			if (READ_ONCE(t->trc_reader_need_end)) {
+				trc_add_holdout(t, &holdouts);
+			}
+		}
+		firstreport = true;
+		list_for_each_entry_safe(t, g, &holdouts, trc_holdout_list)
+			if (READ_ONCE(t->trc_reader_need_end)) {
+				show_stalled_task_trace(t, &firstreport);
+				trc_del_holdout(t);
+			}
+		if (firstreport)
+			pr_err("INFO: rcu_tasks_trace detected stalls?\n");
+		show_stalled_ipi_trace();
+		pr_err("\t%d holdouts\n", atomic_read(&trc_n_readers_need_end));
+	}
 	smp_mb(); // Caller's code must be ordered after wakeup.
 }
 
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 17/22] rcu-tasks: Move #ifdef into tasks.h
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (15 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 16/22] rcu-tasks: Add stall warnings " paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 18/22] rcu-tasks: Add RCU tasks to rcutorture writer stall output paulmck
                     ` (8 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit pushes the #ifdef CONFIG_TASKS_RCU_GENERIC from
kernel/rcu/update.c to kernel/rcu/tasks.h in order to improve
readability as more APIs are added.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h  | 5 +++++
 kernel/rcu/update.c | 4 ----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index e6ef98e..220f264 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -5,6 +5,7 @@
  * Copyright (C) 2020 Paul E. McKenney
  */
 
+#ifdef CONFIG_TASKS_RCU_GENERIC
 
 ////////////////////////////////////////////////////////////////////////
 //
@@ -976,3 +977,7 @@ core_initcall(rcu_spawn_tasks_trace_kthread);
 #else /* #ifdef CONFIG_TASKS_TRACE_RCU */
 void exit_tasks_rcu_finish_trace(struct task_struct *t) { }
 #endif /* #else #ifdef CONFIG_TASKS_TRACE_RCU */
+
+#else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
+static inline void rcu_tasks_bootup_oddness(void) {}
+#endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 16058a5..0fb2a9e 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -559,11 +559,7 @@ late_initcall(rcu_verify_early_boot_tests);
 void rcu_early_boot_tests(void) {}
 #endif /* CONFIG_PROVE_RCU */
 
-#ifdef CONFIG_TASKS_RCU_GENERIC
 #include "tasks.h"
-#else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
-static inline void rcu_tasks_bootup_oddness(void) {}
-#endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */
 
 #ifndef CONFIG_TINY_RCU
 
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 18/22] rcu-tasks: Add RCU tasks to rcutorture writer stall output
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (16 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 17/22] rcu-tasks: Move #ifdef into tasks.h paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 19/22] rcu-tasks: Make rcutorture writer stall output include GP state paulmck
                     ` (7 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds state for each RCU-tasks flavor to the rcutorture
writer stall output.  The initial state is minimal, but you have to
start somewhere.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Fixes based on feedback from kbuild test robot. ]
---
 kernel/rcu/rcu.h        |  1 +
 kernel/rcu/tasks.h      | 45 +++++++++++++++++++++++++++++++++++++++++++--
 kernel/rcu/tree_stall.h |  2 +-
 3 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 72903867..e1089fd 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -431,6 +431,7 @@ bool rcu_gp_is_expedited(void);  /* Internal RCU use. */
 void rcu_expedite_gp(void);
 void rcu_unexpedite_gp(void);
 void rcupdate_announce_bootup_oddness(void);
+void show_rcu_tasks_gp_kthreads(void);
 void rcu_request_urgent_qs_task(struct task_struct *t);
 #endif /* #else #ifdef CONFIG_TINY_RCU */
 
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 220f264..1583e45 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -219,6 +219,16 @@ static void __init rcu_tasks_bootup_oddness(void)
 
 #endif /* #ifndef CONFIG_TINY_RCU */
 
+/* Dump out rcutorture-relevant state common to all RCU-tasks flavors. */
+static void show_rcu_tasks_generic_gp_kthread(struct rcu_tasks *rtp, char *s)
+{
+	pr_info("%s %c%c %s\n",
+		rtp->kname,
+		".k"[!!data_race(rtp->kthread_ptr)],
+		".C"[!!data_race(rtp->cbs_head)],
+		s);
+}
+
 #ifdef CONFIG_TASKS_RCU
 
 ////////////////////////////////////////////////////////////////////////
@@ -482,7 +492,14 @@ static int __init rcu_spawn_tasks_kthread(void)
 }
 core_initcall(rcu_spawn_tasks_kthread);
 
-#endif /* #ifdef CONFIG_TASKS_RCU */
+static void show_rcu_tasks_classic_gp_kthread(void)
+{
+	show_rcu_tasks_generic_gp_kthread(&rcu_tasks, "");
+}
+
+#else /* #ifdef CONFIG_TASKS_RCU */
+static void show_rcu_tasks_classic_gp_kthread(void) { }
+#endif /* #else #ifdef CONFIG_TASKS_RCU */
 
 #ifdef CONFIG_TASKS_RUDE_RCU
 
@@ -578,7 +595,14 @@ static int __init rcu_spawn_tasks_rude_kthread(void)
 }
 core_initcall(rcu_spawn_tasks_rude_kthread);
 
-#endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
+static void show_rcu_tasks_rude_gp_kthread(void)
+{
+	show_rcu_tasks_generic_gp_kthread(&rcu_tasks_rude, "");
+}
+
+#else /* #ifdef CONFIG_TASKS_RUDE_RCU */
+static void show_rcu_tasks_rude_gp_kthread(void) {}
+#endif /* #else #ifdef CONFIG_TASKS_RUDE_RCU */
 
 ////////////////////////////////////////////////////////////////////////
 //
@@ -974,10 +998,27 @@ static int __init rcu_spawn_tasks_trace_kthread(void)
 }
 core_initcall(rcu_spawn_tasks_trace_kthread);
 
+static void show_rcu_tasks_trace_gp_kthread(void)
+{
+	char buf[32];
+
+	sprintf(buf, "N%d", atomic_read(&trc_n_readers_need_end));
+	show_rcu_tasks_generic_gp_kthread(&rcu_tasks_trace, buf);
+}
+
 #else /* #ifdef CONFIG_TASKS_TRACE_RCU */
 void exit_tasks_rcu_finish_trace(struct task_struct *t) { }
+static inline void show_rcu_tasks_trace_gp_kthread(void) {}
 #endif /* #else #ifdef CONFIG_TASKS_TRACE_RCU */
 
+void show_rcu_tasks_gp_kthreads(void)
+{
+	show_rcu_tasks_classic_gp_kthread();
+	show_rcu_tasks_rude_gp_kthread();
+	show_rcu_tasks_trace_gp_kthread();
+}
+
 #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
 static inline void rcu_tasks_bootup_oddness(void) {}
+void show_rcu_tasks_gp_kthreads(void) {}
 #endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index e19487d..ec8e985 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -649,7 +649,7 @@ void show_rcu_gp_kthreads(void)
 		if (rcu_segcblist_is_offloaded(&rdp->cblist))
 			show_rcu_nocb_state(rdp);
 	}
-	/* sched_show_task(rcu_state.gp_kthread); */
+	show_rcu_tasks_gp_kthreads();
 }
 EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);
 
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 19/22] rcu-tasks: Make rcutorture writer stall output include GP state
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (17 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 18/22] rcu-tasks: Add RCU tasks to rcutorture writer stall output paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 20/22] rcu-tasks: Make RCU Tasks Trace make use of RCU scheduler hooks paulmck
                     ` (6 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds grace-period state and time to the rcutorture writer
stall output.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 72 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 1583e45..c7f03c9 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -17,7 +17,7 @@ typedef void (*pregp_func_t)(void);
 typedef void (*pertask_func_t)(struct task_struct *t, struct list_head *hop);
 typedef void (*postscan_func_t)(void);
 typedef void (*holdouts_func_t)(struct list_head *hop, bool ndrpt, bool *frptp);
-typedef void (*postgp_func_t)(void);
+typedef void (*postgp_func_t)(struct rcu_tasks *rtp);
 
 /**
  * Definition for a Tasks-RCU-like mechanism.
@@ -27,6 +27,9 @@ typedef void (*postgp_func_t)(void);
  * @cbs_lock: Lock protecting callback list.
  * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
  * @gp_func: This flavor's grace-period-wait function.
+ * @gp_state: Grace period's most recent state transition (debugging).
+ * @gp_jiffies: Time of last @gp_state transition.
+ * @gp_start: Most recent grace-period start in jiffies.
  * @pregp_func: This flavor's pre-grace-period function (optional).
  * @pertask_func: This flavor's per-task scan function (optional).
  * @postscan_func: This flavor's post-task scan function (optional).
@@ -41,6 +44,8 @@ struct rcu_tasks {
 	struct rcu_head **cbs_tail;
 	struct wait_queue_head cbs_wq;
 	raw_spinlock_t cbs_lock;
+	int gp_state;
+	unsigned long gp_jiffies;
 	struct task_struct *kthread_ptr;
 	rcu_tasks_gp_func_t gp_func;
 	pregp_func_t pregp_func;
@@ -73,10 +78,56 @@ DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
 static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
 module_param(rcu_task_stall_timeout, int, 0644);
 
+/* RCU tasks grace-period state for debugging. */
+#define RTGS_INIT		 0
+#define RTGS_WAIT_WAIT_CBS	 1
+#define RTGS_WAIT_GP		 2
+#define RTGS_PRE_WAIT_GP	 3
+#define RTGS_SCAN_TASKLIST	 4
+#define RTGS_POST_SCAN_TASKLIST	 5
+#define RTGS_WAIT_SCAN_HOLDOUTS	 6
+#define RTGS_SCAN_HOLDOUTS	 7
+#define RTGS_POST_GP		 8
+#define RTGS_WAIT_READERS	 9
+#define RTGS_INVOKE_CBS		10
+#define RTGS_WAIT_CBS		11
+static const char * const rcu_tasks_gp_state_names[] = {
+	"RTGS_INIT",
+	"RTGS_WAIT_WAIT_CBS",
+	"RTGS_WAIT_GP",
+	"RTGS_PRE_WAIT_GP",
+	"RTGS_SCAN_TASKLIST",
+	"RTGS_POST_SCAN_TASKLIST",
+	"RTGS_WAIT_SCAN_HOLDOUTS",
+	"RTGS_SCAN_HOLDOUTS",
+	"RTGS_POST_GP",
+	"RTGS_WAIT_READERS",
+	"RTGS_INVOKE_CBS",
+	"RTGS_WAIT_CBS",
+};
+
 ////////////////////////////////////////////////////////////////////////
 //
 // Generic code.
 
+/* Record grace-period phase and time. */
+static void set_tasks_gp_state(struct rcu_tasks *rtp, int newstate)
+{
+	rtp->gp_state = newstate;
+	rtp->gp_jiffies = jiffies;
+}
+
+/* Return state name. */
+static const char *tasks_gp_state_getname(struct rcu_tasks *rtp)
+{
+	int i = data_race(rtp->gp_state); // Let KCSAN detect update races
+	int j = READ_ONCE(i); // Prevent the compiler from reading twice
+
+	if (j >= ARRAY_SIZE(rcu_tasks_gp_state_names))
+		return "???";
+	return rcu_tasks_gp_state_names[j];
+}
+
 // Enqueue a callback for the specified flavor of Tasks RCU.
 static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func,
 				   struct rcu_tasks *rtp)
@@ -141,15 +192,18 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 						 READ_ONCE(rtp->cbs_head));
 			if (!rtp->cbs_head) {
 				WARN_ON(signal_pending(current));
+				set_tasks_gp_state(rtp, RTGS_WAIT_WAIT_CBS);
 				schedule_timeout_interruptible(HZ/10);
 			}
 			continue;
 		}
 
 		// Wait for one grace period.
+		set_tasks_gp_state(rtp, RTGS_WAIT_GP);
 		rtp->gp_func(rtp);
 
 		/* Invoke the callbacks. */
+		set_tasks_gp_state(rtp, RTGS_INVOKE_CBS);
 		while (list) {
 			next = list->next;
 			local_bh_disable();
@@ -160,6 +214,8 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 		}
 		/* Paranoid sleep to keep this from entering a tight loop */
 		schedule_timeout_uninterruptible(HZ/10);
+
+		set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
 	}
 }
 
@@ -222,8 +278,11 @@ static void __init rcu_tasks_bootup_oddness(void)
 /* Dump out rcutorture-relevant state common to all RCU-tasks flavors. */
 static void show_rcu_tasks_generic_gp_kthread(struct rcu_tasks *rtp, char *s)
 {
-	pr_info("%s %c%c %s\n",
+	pr_info("%s: %s(%d) since %lu %c%c %s\n",
 		rtp->kname,
+		tasks_gp_state_getname(rtp),
+		data_race(rtp->gp_state),
+		jiffies - data_race(rtp->gp_jiffies),
 		".k"[!!data_race(rtp->kthread_ptr)],
 		".C"[!!data_race(rtp->cbs_head)],
 		s);
@@ -243,6 +302,7 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 	LIST_HEAD(holdouts);
 	int fract;
 
+	set_tasks_gp_state(rtp, RTGS_PRE_WAIT_GP);
 	rtp->pregp_func();
 
 	/*
@@ -251,11 +311,13 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 	 * that are not already voluntarily blocked.  Mark these tasks
 	 * and make a list of them in holdouts.
 	 */
+	set_tasks_gp_state(rtp, RTGS_SCAN_TASKLIST);
 	rcu_read_lock();
 	for_each_process_thread(g, t)
 		rtp->pertask_func(t, &holdouts);
 	rcu_read_unlock();
 
+	set_tasks_gp_state(rtp, RTGS_POST_SCAN_TASKLIST);
 	rtp->postscan_func();
 
 	/*
@@ -277,6 +339,7 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 			break;
 
 		/* Slowly back off waiting for holdouts */
+		set_tasks_gp_state(rtp, RTGS_WAIT_SCAN_HOLDOUTS);
 		schedule_timeout_interruptible(HZ/fract);
 
 		if (fract > 1)
@@ -288,10 +351,12 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 			lastreport = jiffies;
 		firstreport = true;
 		WARN_ON(signal_pending(current));
+		set_tasks_gp_state(rtp, RTGS_SCAN_HOLDOUTS);
 		rtp->holdouts_func(&holdouts, needreport, &firstreport);
 	}
 
-	rtp->postgp_func();
+	set_tasks_gp_state(rtp, RTGS_POST_GP);
+	rtp->postgp_func(rtp);
 }
 
 ////////////////////////////////////////////////////////////////////////
@@ -394,7 +459,7 @@ static void check_all_holdout_tasks(struct list_head *hop,
 }
 
 /* Finish off the Tasks-RCU grace period. */
-static void rcu_tasks_postgp(void)
+static void rcu_tasks_postgp(struct rcu_tasks *rtp)
 {
 	/*
 	 * Because ->on_rq and ->nvcsw are not guaranteed to have a full
@@ -871,7 +936,7 @@ static void check_all_holdout_tasks_trace(struct list_head *hop,
 }
 
 /* Wait for grace period to complete and provide ordering. */
-static void rcu_tasks_trace_postgp(void)
+static void rcu_tasks_trace_postgp(struct rcu_tasks *rtp)
 {
 	bool firstreport;
 	struct task_struct *g, *t;
@@ -884,6 +949,7 @@ static void rcu_tasks_trace_postgp(void)
 	smp_mb__after_atomic();  // Order vs. later atomics
 
 	// Wait for readers.
+	set_tasks_gp_state(rtp, RTGS_WAIT_READERS);
 	for (;;) {
 		ret = wait_event_idle_exclusive_timeout(
 				trc_wait,
@@ -891,6 +957,7 @@ static void rcu_tasks_trace_postgp(void)
 				READ_ONCE(rcu_task_stall_timeout));
 		if (ret)
 			break;  // Count reached zero.
+		// Stall warning time, so make a list of the offenders.
 		for_each_process_thread(g, t) {
 			if (READ_ONCE(t->trc_reader_need_end)) {
 				trc_add_holdout(t, &holdouts);
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 20/22] rcu-tasks: Make RCU Tasks Trace make use of RCU scheduler hooks
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (18 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 19/22] rcu-tasks: Make rcutorture writer stall output include GP state paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 21/22] rcu-tasks: Add a grace-period start time for throttling and debug paulmck
                     ` (5 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit makes the calls to rcu_tasks_qs() detect and report
quiescent states for RCU tasks trace.  If the task is in a quiescent
state and if ->trc_reader_checked is not yet set, the task sets its own
->trc_reader_checked.  This will cause the grace-period kthread to
remove it from the holdout list if it still remains there.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/rcupdate.h | 39 ++++++++++++++++++++++++++++++++-------
 include/linux/rcutiny.h  |  2 +-
 kernel/rcu/tasks.h       |  5 +++--
 kernel/rcu/tree_plugin.h |  6 ++----
 4 files changed, 38 insertions(+), 14 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 2be97a8..3598bbb 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -131,12 +131,37 @@ static inline void rcu_init_nohz(void) { }
  * This is a macro rather than an inline function to avoid #include hell.
  */
 #ifdef CONFIG_TASKS_RCU_GENERIC
-#define rcu_tasks_qs(t) \
-	do { \
-		if (READ_ONCE((t)->rcu_tasks_holdout)) \
-			WRITE_ONCE((t)->rcu_tasks_holdout, false); \
+
+# ifdef CONFIG_TASKS_RCU
+# define rcu_tasks_classic_qs(t, preempt)				\
+	do {								\
+		if (!(preempt) && READ_ONCE((t)->rcu_tasks_holdout))	\
+			WRITE_ONCE((t)->rcu_tasks_holdout, false);	\
 	} while (0)
-#define rcu_note_voluntary_context_switch(t) rcu_tasks_qs(t)
+# else
+# define rcu_tasks_classic_qs(t, preempt) do { } while (0)
+# endif
+
+# ifdef CONFIG_TASKS_RCU_TRACE
+# define rcu_tasks_trace_qs(t)						\
+	do {								\
+		if (!likely(READ_ONCE((t)->trc_reader_checked)) &&	\
+		    !unlikely(READ_ONCE((t)->trc_reader_nesting))) {	\
+			smp_store_release(&(t)->trc_reader_checked, true); \
+			smp_mb(); /* Readers partitioned by store. */	\
+		}							\
+	} while (0)
+# else
+# define rcu_tasks_trace_qs(t) do { } while (0)
+# endif
+
+#define rcu_tasks_qs(t, preempt)					\
+do {									\
+	rcu_tasks_classic_qs((t), (preempt));				\
+	rcu_tasks_trace_qs((t));					\
+} while (0)
+
+#define rcu_note_voluntary_context_switch(t) rcu_tasks_qs(t, false)
 void call_rcu_tasks(struct rcu_head *head, rcu_callback_t func);
 void synchronize_rcu_tasks(void);
 void call_rcu_tasks_rude(struct rcu_head *head, rcu_callback_t func);
@@ -144,7 +169,7 @@ void synchronize_rcu_tasks_rude(void);
 void exit_tasks_rcu_start(void);
 void exit_tasks_rcu_finish(void);
 #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
-#define rcu_tasks_qs(t)	do { } while (0)
+#define rcu_tasks_qs(t, preempt) do { } while (0)
 #define rcu_note_voluntary_context_switch(t) do { } while (0)
 #define call_rcu_tasks call_rcu
 #define synchronize_rcu_tasks synchronize_rcu
@@ -161,7 +186,7 @@ static inline void exit_tasks_rcu_finish(void) { }
  */
 #define cond_resched_tasks_rcu_qs() \
 do { \
-	rcu_tasks_qs(current); \
+	rcu_tasks_qs(current, false); \
 	cond_resched(); \
 } while (0)
 
diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 045c28b..d77e111 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -49,7 +49,7 @@ static inline void rcu_softirq_qs(void)
 #define rcu_note_context_switch(preempt) \
 	do { \
 		rcu_qs(); \
-		rcu_tasks_qs(current); \
+		rcu_tasks_qs(current, (preempt)); \
 	} while (0)
 
 static inline int rcu_needs_cpu(u64 basemono, u64 *nextevt)
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index c7f03c9..ca5fbde 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -180,7 +180,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 
 		/* Pick up any new callbacks. */
 		raw_spin_lock_irqsave(&rtp->cbs_lock, flags);
-		smp_mb__after_unlock_lock(); // Order updates vs. GP.
+		smp_mb__after_spinlock(); // Order updates vs. GP.
 		list = rtp->cbs_head;
 		rtp->cbs_head = NULL;
 		rtp->cbs_tail = &rtp->cbs_head;
@@ -864,7 +864,7 @@ static void rcu_tasks_trace_pertask(struct task_struct *t,
 				    struct list_head *hop)
 {
 	WRITE_ONCE(t->trc_reader_need_end, false);
-	t->trc_reader_checked = false;
+	WRITE_ONCE(t->trc_reader_checked, false);
 	t->trc_ipi_to_cpu = -1;
 	trc_wait_for_one_reader(t, hop);
 }
@@ -975,6 +975,7 @@ static void rcu_tasks_trace_postgp(struct rcu_tasks *rtp)
 		pr_err("\t%d holdouts\n", atomic_read(&trc_n_readers_need_end));
 	}
 	smp_mb(); // Caller's code must be ordered after wakeup.
+		  // Pairs with pretty much every ordering primitive.
 }
 
 /* Report any needed quiescent state for this exiting task. */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 7cf76e8..9355536 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -331,8 +331,7 @@ void rcu_note_context_switch(bool preempt)
 	rcu_qs();
 	if (rdp->exp_deferred_qs)
 		rcu_report_exp_rdp(rdp);
-	if (!preempt)
-		rcu_tasks_qs(current);
+	rcu_tasks_qs(current, preempt);
 	trace_rcu_utilization(TPS("End context switch"));
 }
 EXPORT_SYMBOL_GPL(rcu_note_context_switch);
@@ -841,8 +840,7 @@ void rcu_note_context_switch(bool preempt)
 	this_cpu_write(rcu_data.rcu_urgent_qs, false);
 	if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs)))
 		rcu_momentary_dyntick_idle();
-	if (!preempt)
-		rcu_tasks_qs(current);
+	rcu_tasks_qs(current, preempt);
 out:
 	trace_rcu_utilization(TPS("End context switch"));
 }
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 21/22] rcu-tasks: Add a grace-period start time for throttling and debug
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (19 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 20/22] rcu-tasks: Make RCU Tasks Trace make use of RCU scheduler hooks paulmck
@ 2020-03-19  0:10   ` paulmck
  2020-03-19  0:11   ` [PATCH RFC v2 tip/core/rcu 22/22] rcu-tasks: Provide boot parameter to delay IPIs until late in grace period paulmck
                     ` (4 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:10 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds a place to record the grace-period start in jiffies.
This will be used by later commits for debugging purposes and to throttle
IPIs early in the grace period.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index ca5fbde..6d237c4 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -46,6 +46,7 @@ struct rcu_tasks {
 	raw_spinlock_t cbs_lock;
 	int gp_state;
 	unsigned long gp_jiffies;
+	unsigned long gp_start;
 	struct task_struct *kthread_ptr;
 	rcu_tasks_gp_func_t gp_func;
 	pregp_func_t pregp_func;
@@ -200,6 +201,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 
 		// Wait for one grace period.
 		set_tasks_gp_state(rtp, RTGS_WAIT_GP);
+		rtp->gp_start = jiffies;
 		rtp->gp_func(rtp);
 
 		/* Invoke the callbacks. */
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v2 tip/core/rcu 22/22] rcu-tasks: Provide boot parameter to delay IPIs until late in grace period
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (20 preceding siblings ...)
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 21/22] rcu-tasks: Add a grace-period start time for throttling and debug paulmck
@ 2020-03-19  0:11   ` paulmck
  2020-03-19 11:31   ` [PATCH RFC v2 tip/core/rcu 0/22] Prototype RCU usable from idle, exception, offline Mathieu Desnoyers
                     ` (3 subsequent siblings)
  25 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-19  0:11 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit provides a rcupdate.rcu_task_ipi_delay kernel boot parameter
that specifies how old the RCU tasks trace grace period must be before
the grace-period kthread starts sending IPIs.  This delay allows more
tasks to pass through rcu_tasks_qs() quiescent states, thus reducing
(or even eliminating) the number of IPIs that must be sent.

On a short rcutorture test setting this kernel boot parameter to HZ/2
resulted in zero IPIs for all 877 RCU-tasks trace grace periods that
elapsed during that test.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 Documentation/admin-guide/kernel-parameters.txt |  7 +++++++
 kernel/rcu/tasks.h                              | 15 ++++++++++-----
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 17eff15..2865767 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4242,6 +4242,13 @@
 			only normal grace-period primitives.  No effect
 			on CONFIG_TINY_RCU kernels.
 
+	rcupdate.rcu_task_ipi_delay= [KNL]
+			Set time in jiffies during which RCU tasks will
+			avoid sending IPIs, starting with the beginning
+			of a given grace period.  Setting a large
+			number avoids disturbing real-time workloads,
+			but lengthens grace periods.
+
 	rcupdate.rcu_task_stall_timeout= [KNL]
 			Set timeout in jiffies for RCU task stall warning
 			messages.  Disable with a value less than or equal
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 6d237c4..d166797 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -74,6 +74,11 @@ static struct rcu_tasks rt_name =					\
 /* Track exiting tasks in order to allow them to be waited for. */
 DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
 
+/* Avoid IPIing CPUs early in the grace period. */
+#define RCU_TASK_IPI_DELAY (HZ / 2)
+static int rcu_task_ipi_delay __read_mostly = RCU_TASK_IPI_DELAY;
+module_param(rcu_task_ipi_delay, int, 0644);
+
 /* Control stall timeouts.  Disable with <= 0, otherwise jiffies till stall. */
 #define RCU_TASK_STALL_TIMEOUT (HZ * 60 * 10)
 static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
@@ -713,6 +718,10 @@ DECLARE_WAIT_QUEUE_HEAD(trc_wait);	// List of holdout tasks.
 // Record outstanding IPIs to each CPU.  No point in sending two...
 static DEFINE_PER_CPU(bool, trc_ipi_to_cpu);
 
+void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
+DEFINE_RCU_TASKS(rcu_tasks_trace, rcu_tasks_wait_gp, call_rcu_tasks_trace,
+		 "RCU Tasks Trace");
+
 /* If we are the last reader, wake up the grace-period kthread. */
 void rcu_read_unlock_trace_special(struct task_struct *t)
 {
@@ -825,7 +834,7 @@ static void trc_wait_for_one_reader(struct task_struct *t,
 
 	// If currently running, send an IPI, either way, add to list.
 	trc_add_holdout(t, bhp);
-	if (task_curr(t)) {
+	if (task_curr(t) && time_after(jiffies, rcu_tasks_trace.gp_start + rcu_task_ipi_delay)) {
 		// The task is currently running, so try IPIing it.
 		cpu = task_cpu(t);
 
@@ -990,10 +999,6 @@ void exit_tasks_rcu_finish_trace(struct task_struct *t)
 		rcu_read_unlock_trace_special(t);
 }
 
-void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
-DEFINE_RCU_TASKS(rcu_tasks_trace, rcu_tasks_wait_gp, call_rcu_tasks_trace,
-		 "RCU Tasks Trace");
-
 /**
  * call_rcu_tasks_trace() - Queue a callback trace task-based grace period
  * @rhp: structure to be used for queueing the RCU updates.
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 14/22] rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 14/22] rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks paulmck
@ 2020-03-19  1:37     ` Joel Fernandes
  2020-03-19  1:58       ` Joel Fernandes
  2020-03-19 19:42     ` Steven Rostedt
  1 sibling, 1 reply; 171+ messages in thread
From: Joel Fernandes @ 2020-03-19  1:37 UTC (permalink / raw)
  To: paulmck
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, Alexei Starovoitov, Andrii Nakryiko

On Wed, Mar 18, 2020 at 05:10:52PM -0700, paulmck@kernel.org wrote:
[...]
> +/* Initialize for a new RCU-tasks-trace grace period. */
> +static void rcu_tasks_trace_pregp_step(void)
> +{
> +	int cpu;
> +
> +	// Wait for CPU-hotplug paths to complete.
> +	cpus_read_lock();
> +	cpus_read_unlock();
> +
> +	// Allow for fast-acting IPIs.
> +	atomic_set(&trc_n_readers_need_end, 1);
> +
> +	// There shouldn't be any old IPIs, but...
> +	for_each_possible_cpu(cpu)
> +		WARN_ON_ONCE(per_cpu(trc_ipi_to_cpu, cpu));
> +}
> +
> +/* Do first-round processing for the specified task. */
> +static void rcu_tasks_trace_pertask(struct task_struct *t,
> +				    struct list_head *hop)
> +{
> +	WRITE_ONCE(t->trc_reader_need_end, false);
> +	t->trc_reader_checked = false;
> +	t->trc_ipi_to_cpu = -1;
> +	trc_wait_for_one_reader(t, hop);
> +}
> +
> +/* Do intermediate processing between task and holdout scans. */
> +static void rcu_tasks_trace_postscan(void)
> +{
> +	// Wait for late-stage exiting tasks to finish exiting.
> +	// These might have passed the call to exit_tasks_rcu_finish().
> +	synchronize_rcu();
> +	// Any tasks that exit after this point will set ->trc_reader_checked.
> +}
> +
> +/* Do one scan of the holdout list. */
> +static void check_all_holdout_tasks_trace(struct list_head *hop,
> +					  bool ndrpt, bool *frptp)
> +{
> +	struct task_struct *g, *t;
> +
> +	list_for_each_entry_safe(t, g, hop, trc_holdout_list) {
> +		// If safe and needed, try to check the current task.
> +		if (READ_ONCE(t->trc_ipi_to_cpu) == -1 &&
> +		    !READ_ONCE(t->trc_reader_checked))
> +			trc_wait_for_one_reader(t, hop);

Just some questions:

1. How are we ensuring on the reader-side that we are executing memory
barriers that are sufficient to ensure that all update-side memory operations
in reader section is visible to code executing after the grace period?

2. Is it possible that a hold-out task is removed from the hold-out list and is
not waited on in the updater side, before the reader side got a chance to
indirectly execute such memory barriers?

3. If a reader sees updates that were done before the grace period started, it
should not see any updates that happen after the grace period ends. Is that
guaranteed with this RCU-Trace?

If its Ok, it would be nice to mention more about memory ordering aspect in
the changelog.

thanks!

 - Joel


> +
> +		// If check succeeded, remove this task from the list.
> +		if (READ_ONCE(t->trc_reader_checked))
> +			trc_del_holdout(t);
> +	}
> +}
> +
> +/* Wait for grace period to complete and provide ordering. */
> +static void rcu_tasks_trace_postgp(void)
> +{
> +	// Remove the safety count.
> +	smp_mb__before_atomic();  // Order vs. earlier atomics
> +	atomic_dec(&trc_n_readers_need_end);
> +	smp_mb__after_atomic();  // Order vs. later atomics
> +
> +	// Wait for readers.
> +	wait_event_idle_exclusive(trc_wait,
> +				  atomic_read(&trc_n_readers_need_end) == 0);
> +
> +	smp_mb(); // Caller's code must be ordered after wakeup.
> +}
> +
> +/* Report any needed quiescent state for this exiting task. */
> +void exit_tasks_rcu_finish_trace(struct task_struct *t)
> +{
> +	WRITE_ONCE(t->trc_reader_checked, true);
> +	WARN_ON_ONCE(t->trc_reader_nesting);
> +	WRITE_ONCE(t->trc_reader_nesting, 0);
> +	if (WARN_ON_ONCE(READ_ONCE(t->trc_reader_need_end)))
> +		rcu_read_unlock_trace_special(t);
> +}
> +
> +void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
> +DEFINE_RCU_TASKS(rcu_tasks_trace, rcu_tasks_wait_gp, call_rcu_tasks_trace,
> +		 "RCU Tasks Trace");
> +
> +/**
> + * call_rcu_tasks_trace() - Queue a callback trace task-based grace period
> + * @rhp: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all currently executing RCU
> + * read-side critical sections have completed. call_rcu_tasks_trace()
> + * assumes that the read-side critical sections end at context switch,
> + * cond_resched_rcu_qs(), or transition to usermode execution.  As such,
> + * there are no read-side primitives analogous to rcu_read_lock() and
> + * rcu_read_unlock() because this primitive is intended to determine
> + * that all tasks have passed through a safe state, not so much for
> + * data-strcuture synchronization.
> + *
> + * See the description of call_rcu() for more detailed information on
> + * memory ordering guarantees.
> + */
> +void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func)
> +{
> +	call_rcu_tasks_generic(rhp, func, &rcu_tasks_trace);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_tasks_trace);
> +
> +/**
> + * synchronize_rcu_tasks_trace - wait for a trace rcu-tasks grace period
> + *
> + * Control will return to the caller some time after a trace rcu-tasks
> + * grace period has elapsed, in other words after all currently
> + * executing rcu-tasks read-side critical sections have elapsed.  These
> + * read-side critical sections are delimited by calls to schedule(),
> + * cond_resched_tasks_rcu_qs(), userspace execution, and (in theory,
> + * anyway) cond_resched().
> + *
> + * This is a very specialized primitive, intended only for a few uses in
> + * tracing and other situations requiring manipulation of function preambles
> + * and profiling hooks.  The synchronize_rcu_tasks_trace() function is not
> + * (yet) intended for heavy use from multiple CPUs.
> + *
> + * See the description of synchronize_rcu() for more detailed information
> + * on memory ordering guarantees.
> + */
> +void synchronize_rcu_tasks_trace(void)
> +{
> +	RCU_LOCKDEP_WARN(lock_is_held(&rcu_trace_lock_map), "Illegal synchronize_rcu_tasks_trace() in RCU Tasks Trace read-side critical section");
> +	synchronize_rcu_tasks_generic(&rcu_tasks_trace);
> +}
> +EXPORT_SYMBOL_GPL(synchronize_rcu_tasks_trace);
> +
> +/**
> + * rcu_barrier_tasks_trace - Wait for in-flight call_rcu_tasks_trace() callbacks.
> + *
> + * Although the current implementation is guaranteed to wait, it is not
> + * obligated to, for example, if there are no pending callbacks.
> + */
> +void rcu_barrier_tasks_trace(void)
> +{
> +	/* There is only one callback queue, so this is easy.  ;-) */
> +	synchronize_rcu_tasks_trace();
> +}
> +EXPORT_SYMBOL_GPL(rcu_barrier_tasks_trace);
> +
> +static int __init rcu_spawn_tasks_trace_kthread(void)
> +{
> +	rcu_tasks_trace.pregp_func = rcu_tasks_trace_pregp_step;
> +	rcu_tasks_trace.pertask_func = rcu_tasks_trace_pertask;
> +	rcu_tasks_trace.postscan_func = rcu_tasks_trace_postscan;
> +	rcu_tasks_trace.holdouts_func = check_all_holdout_tasks_trace;
> +	rcu_tasks_trace.postgp_func = rcu_tasks_trace_postgp;
> +	rcu_spawn_tasks_kthread_generic(&rcu_tasks_trace);
> +	return 0;
> +}
> +core_initcall(rcu_spawn_tasks_trace_kthread);
> +
> +#else /* #ifdef CONFIG_TASKS_TRACE_RCU */
> +void exit_tasks_rcu_finish_trace(struct task_struct *t) { }
> +#endif /* #else #ifdef CONFIG_TASKS_TRACE_RCU */
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 14/22] rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks
  2020-03-19  1:37     ` Joel Fernandes
@ 2020-03-19  1:58       ` Joel Fernandes
  2020-03-19  3:40         ` Paul E. McKenney
  0 siblings, 1 reply; 171+ messages in thread
From: Joel Fernandes @ 2020-03-19  1:58 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, LKML, kernel-team@fb.com,,
	Ingo Molnar, Lai Jiangshan, dipankar, Andrew Morton,
	Mathieu Desnoyers, Josh Triplett, Thomas Glexiner,
	Peter Zijlstra, Steven Rostedt, David Howells, Eric Dumazet,
	Frederic Weisbecker, Oleg Nesterov, Alexei Starovoitov,
	Andrii Nakryiko

On Wed, Mar 18, 2020 at 9:37 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Wed, Mar 18, 2020 at 05:10:52PM -0700, paulmck@kernel.org wrote:
> [...]
> > +/* Initialize for a new RCU-tasks-trace grace period. */
> > +static void rcu_tasks_trace_pregp_step(void)
> > +{
> > +     int cpu;
> > +
> > +     // Wait for CPU-hotplug paths to complete.
> > +     cpus_read_lock();
> > +     cpus_read_unlock();
> > +
> > +     // Allow for fast-acting IPIs.
> > +     atomic_set(&trc_n_readers_need_end, 1);
> > +
> > +     // There shouldn't be any old IPIs, but...
> > +     for_each_possible_cpu(cpu)
> > +             WARN_ON_ONCE(per_cpu(trc_ipi_to_cpu, cpu));
> > +}
> > +
> > +/* Do first-round processing for the specified task. */
> > +static void rcu_tasks_trace_pertask(struct task_struct *t,
> > +                                 struct list_head *hop)
> > +{
> > +     WRITE_ONCE(t->trc_reader_need_end, false);
> > +     t->trc_reader_checked = false;
> > +     t->trc_ipi_to_cpu = -1;
> > +     trc_wait_for_one_reader(t, hop);
> > +}
> > +
> > +/* Do intermediate processing between task and holdout scans. */
> > +static void rcu_tasks_trace_postscan(void)
> > +{
> > +     // Wait for late-stage exiting tasks to finish exiting.
> > +     // These might have passed the call to exit_tasks_rcu_finish().
> > +     synchronize_rcu();
> > +     // Any tasks that exit after this point will set ->trc_reader_checked.
> > +}
> > +
> > +/* Do one scan of the holdout list. */
> > +static void check_all_holdout_tasks_trace(struct list_head *hop,
> > +                                       bool ndrpt, bool *frptp)
> > +{
> > +     struct task_struct *g, *t;
> > +
> > +     list_for_each_entry_safe(t, g, hop, trc_holdout_list) {
> > +             // If safe and needed, try to check the current task.
> > +             if (READ_ONCE(t->trc_ipi_to_cpu) == -1 &&
> > +                 !READ_ONCE(t->trc_reader_checked))
> > +                     trc_wait_for_one_reader(t, hop);
>
> Just some questions:
>
> 1. How are we ensuring on the reader-side that we are executing memory
> barriers that are sufficient to ensure that all update-side memory operations

Apologies, here I meant "update memory operations".

thanks,

 - Joel


> in reader section is visible to code executing after the grace period?
>
> 2. Is it possible that a hold-out task is removed from the hold-out list and is
> not waited on in the updater side, before the reader side got a chance to
> indirectly execute such memory barriers?
>
> 3. If a reader sees updates that were done before the grace period started, it
> should not see any updates that happen after the grace period ends. Is that
> guaranteed with this RCU-Trace?
>
> If its Ok, it would be nice to mention more about memory ordering aspect in
> the changelog.
>
> thanks!
>
>  - Joel
>
>
> > +
> > +             // If check succeeded, remove this task from the list.
> > +             if (READ_ONCE(t->trc_reader_checked))
> > +                     trc_del_holdout(t);
> > +     }
> > +}
> > +
> > +/* Wait for grace period to complete and provide ordering. */
> > +static void rcu_tasks_trace_postgp(void)
> > +{
> > +     // Remove the safety count.
> > +     smp_mb__before_atomic();  // Order vs. earlier atomics
> > +     atomic_dec(&trc_n_readers_need_end);
> > +     smp_mb__after_atomic();  // Order vs. later atomics
> > +
> > +     // Wait for readers.
> > +     wait_event_idle_exclusive(trc_wait,
> > +                               atomic_read(&trc_n_readers_need_end) == 0);
> > +
> > +     smp_mb(); // Caller's code must be ordered after wakeup.
> > +}
> > +
> > +/* Report any needed quiescent state for this exiting task. */
> > +void exit_tasks_rcu_finish_trace(struct task_struct *t)
> > +{
> > +     WRITE_ONCE(t->trc_reader_checked, true);
> > +     WARN_ON_ONCE(t->trc_reader_nesting);
> > +     WRITE_ONCE(t->trc_reader_nesting, 0);
> > +     if (WARN_ON_ONCE(READ_ONCE(t->trc_reader_need_end)))
> > +             rcu_read_unlock_trace_special(t);
> > +}
> > +
> > +void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
> > +DEFINE_RCU_TASKS(rcu_tasks_trace, rcu_tasks_wait_gp, call_rcu_tasks_trace,
> > +              "RCU Tasks Trace");
> > +
> > +/**
> > + * call_rcu_tasks_trace() - Queue a callback trace task-based grace period
> > + * @rhp: structure to be used for queueing the RCU updates.
> > + * @func: actual callback function to be invoked after the grace period
> > + *
> > + * The callback function will be invoked some time after a full grace
> > + * period elapses, in other words after all currently executing RCU
> > + * read-side critical sections have completed. call_rcu_tasks_trace()
> > + * assumes that the read-side critical sections end at context switch,
> > + * cond_resched_rcu_qs(), or transition to usermode execution.  As such,
> > + * there are no read-side primitives analogous to rcu_read_lock() and
> > + * rcu_read_unlock() because this primitive is intended to determine
> > + * that all tasks have passed through a safe state, not so much for
> > + * data-strcuture synchronization.
> > + *
> > + * See the description of call_rcu() for more detailed information on
> > + * memory ordering guarantees.
> > + */
> > +void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func)
> > +{
> > +     call_rcu_tasks_generic(rhp, func, &rcu_tasks_trace);
> > +}
> > +EXPORT_SYMBOL_GPL(call_rcu_tasks_trace);
> > +
> > +/**
> > + * synchronize_rcu_tasks_trace - wait for a trace rcu-tasks grace period
> > + *
> > + * Control will return to the caller some time after a trace rcu-tasks
> > + * grace period has elapsed, in other words after all currently
> > + * executing rcu-tasks read-side critical sections have elapsed.  These
> > + * read-side critical sections are delimited by calls to schedule(),
> > + * cond_resched_tasks_rcu_qs(), userspace execution, and (in theory,
> > + * anyway) cond_resched().
> > + *
> > + * This is a very specialized primitive, intended only for a few uses in
> > + * tracing and other situations requiring manipulation of function preambles
> > + * and profiling hooks.  The synchronize_rcu_tasks_trace() function is not
> > + * (yet) intended for heavy use from multiple CPUs.
> > + *
> > + * See the description of synchronize_rcu() for more detailed information
> > + * on memory ordering guarantees.
> > + */
> > +void synchronize_rcu_tasks_trace(void)
> > +{
> > +     RCU_LOCKDEP_WARN(lock_is_held(&rcu_trace_lock_map), "Illegal synchronize_rcu_tasks_trace() in RCU Tasks Trace read-side critical section");
> > +     synchronize_rcu_tasks_generic(&rcu_tasks_trace);
> > +}
> > +EXPORT_SYMBOL_GPL(synchronize_rcu_tasks_trace);
> > +
> > +/**
> > + * rcu_barrier_tasks_trace - Wait for in-flight call_rcu_tasks_trace() callbacks.
> > + *
> > + * Although the current implementation is guaranteed to wait, it is not
> > + * obligated to, for example, if there are no pending callbacks.
> > + */
> > +void rcu_barrier_tasks_trace(void)
> > +{
> > +     /* There is only one callback queue, so this is easy.  ;-) */
> > +     synchronize_rcu_tasks_trace();
> > +}
> > +EXPORT_SYMBOL_GPL(rcu_barrier_tasks_trace);
> > +
> > +static int __init rcu_spawn_tasks_trace_kthread(void)
> > +{
> > +     rcu_tasks_trace.pregp_func = rcu_tasks_trace_pregp_step;
> > +     rcu_tasks_trace.pertask_func = rcu_tasks_trace_pertask;
> > +     rcu_tasks_trace.postscan_func = rcu_tasks_trace_postscan;
> > +     rcu_tasks_trace.holdouts_func = check_all_holdout_tasks_trace;
> > +     rcu_tasks_trace.postgp_func = rcu_tasks_trace_postgp;
> > +     rcu_spawn_tasks_kthread_generic(&rcu_tasks_trace);
> > +     return 0;
> > +}
> > +core_initcall(rcu_spawn_tasks_trace_kthread);
> > +
> > +#else /* #ifdef CONFIG_TASKS_TRACE_RCU */
> > +void exit_tasks_rcu_finish_trace(struct task_struct *t) { }
> > +#endif /* #else #ifdef CONFIG_TASKS_TRACE_RCU */
> > --
> > 2.9.5
> >

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 14/22] rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks
  2020-03-19  1:58       ` Joel Fernandes
@ 2020-03-19  3:40         ` Paul E. McKenney
  0 siblings, 0 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-19  3:40 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, LKML, kernel-team@fb.com,,
	Ingo Molnar, Lai Jiangshan, dipankar, Andrew Morton,
	Mathieu Desnoyers, Josh Triplett, Thomas Glexiner,
	Peter Zijlstra, Steven Rostedt, David Howells, Eric Dumazet,
	Frederic Weisbecker, Oleg Nesterov, Alexei Starovoitov,
	Andrii Nakryiko

On Wed, Mar 18, 2020 at 09:58:23PM -0400, Joel Fernandes wrote:
> On Wed, Mar 18, 2020 at 9:37 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Wed, Mar 18, 2020 at 05:10:52PM -0700, paulmck@kernel.org wrote:
> > [...]
> > > +/* Initialize for a new RCU-tasks-trace grace period. */
> > > +static void rcu_tasks_trace_pregp_step(void)
> > > +{
> > > +     int cpu;
> > > +
> > > +     // Wait for CPU-hotplug paths to complete.
> > > +     cpus_read_lock();
> > > +     cpus_read_unlock();
> > > +
> > > +     // Allow for fast-acting IPIs.
> > > +     atomic_set(&trc_n_readers_need_end, 1);
> > > +
> > > +     // There shouldn't be any old IPIs, but...
> > > +     for_each_possible_cpu(cpu)
> > > +             WARN_ON_ONCE(per_cpu(trc_ipi_to_cpu, cpu));
> > > +}
> > > +
> > > +/* Do first-round processing for the specified task. */
> > > +static void rcu_tasks_trace_pertask(struct task_struct *t,
> > > +                                 struct list_head *hop)
> > > +{
> > > +     WRITE_ONCE(t->trc_reader_need_end, false);
> > > +     t->trc_reader_checked = false;
> > > +     t->trc_ipi_to_cpu = -1;
> > > +     trc_wait_for_one_reader(t, hop);
> > > +}
> > > +
> > > +/* Do intermediate processing between task and holdout scans. */
> > > +static void rcu_tasks_trace_postscan(void)
> > > +{
> > > +     // Wait for late-stage exiting tasks to finish exiting.
> > > +     // These might have passed the call to exit_tasks_rcu_finish().
> > > +     synchronize_rcu();
> > > +     // Any tasks that exit after this point will set ->trc_reader_checked.
> > > +}
> > > +
> > > +/* Do one scan of the holdout list. */
> > > +static void check_all_holdout_tasks_trace(struct list_head *hop,
> > > +                                       bool ndrpt, bool *frptp)
> > > +{
> > > +     struct task_struct *g, *t;
> > > +
> > > +     list_for_each_entry_safe(t, g, hop, trc_holdout_list) {
> > > +             // If safe and needed, try to check the current task.
> > > +             if (READ_ONCE(t->trc_ipi_to_cpu) == -1 &&
> > > +                 !READ_ONCE(t->trc_reader_checked))
> > > +                     trc_wait_for_one_reader(t, hop);
> >
> > Just some questions:
> >
> > 1. How are we ensuring on the reader-side that we are executing memory
> > barriers that are sufficient to ensure that all update-side memory operations
> 
> Apologies, here I meant "update memory operations".
> 
> > in reader section is visible to code executing after the grace period?

The most pithy response is that in many of the cases, we are not
executing any additional memory barriers on the read side because it is
not necessary to do so.  Another pithy response would be that there are
very likely still memory-ordering bugs, hence the "RFC".  ;-)

Perhaps more constructively...

There are a number of cases, taking things task by task.

1.	The target task is blocked and not in a read-side critical
	section.  In this case, the grace-period kthread's call to
	try_invoke_on_locked_down_task() will acquire the target task's
	->pi_lock.  This lock will have been acquired by the target task
	while sleeping and will be acquired again when it awakened.
	This set of locks will order prior and subsequent read-side
	critical sections against that point in the grace-period
	process.  The smp_mb__after_spinlock() at the beginning of
	rcu_tasks_kthread() in combination with the lock will order this
	against posting of the callbacks.  The smp_mb() at the end of
	rcu_tasks_trace_postgp() does the same for post-grace-period
	actions.

	Note that the ordering from the callbacks to the subsequent
	readers still holds even if the target task is in a
	read-side critical section.

2.	The target task is runnable and not in a read-side critical
	section.  This proceeds the same as #1 above, except that
	both the ->pi_lock and the runqueue lock will be held when
	checking the target tasks's state.

	Note that the ordering from the callbacks to the subsequent
	readers still holds even if the target task is in a
	read-side critical section.

3.	The target task is running that we IPI and is not in a
	read-side critical section.  In this case, there is ordering
	from the smp_call_function_single() to the IPI handler
	on the one hand (and thus to later readers), and due to the
	smp_store_release() (from earlier readers) on the other hand.  The
	READ_ONCE(t->trc_ipi_to_cpu) pairs with the smp_store_release()
	in combination with the aforementioned smp_mb().

	Note that the ordering from the callbacks to the subsequent
	readers still holds even if the target task is in a
	read-side critical section.

4.	If the target task passes through a context switch while
	not in a read-side critical section, it also passes
	through rcu_tasks_trace_qs(), which has an smp_mb() and
	smp_store_release() that work not unlike the IPI case above.

	If the task is in a read-side critical section, nothing
	happens, and either this or some other case will apply
	at some later time.

5.	Otherwise, the task's ->trc_reader_special.b.need_qs is
	set, which will cause the rcu_read_unlock_trace() to
	invoke rcu_read_unlock_trace_special().  The chain
	of atomic_dec_and_test() calls will order all of these
	events, and that chain, when combined with
	wait_event_idle_exclusive_timeout()'s read of that same
	variable and its later smp_mb() will force the needed
	ordering with all prior readers.

Hey, you asked!

Your turn.  Find the bug(s).

> > 2. Is it possible that a hold-out task is removed from the hold-out list and is
> > not waited on in the updater side, before the reader side got a chance to
> > indirectly execute such memory barriers?

The only purpose of the hold-out task list is to keep track of tasks
that still need one of the above options to be applied.  Once a task is
removed from that list one of the above options has happened or is in
flight, so there is no need for the hold-out task list to be involved
in any ordering.

> > 3. If a reader sees updates that were done before the grace period started, it
> > should not see any updates that happen after the grace period ends. Is that
> > guaranteed with this RCU-Trace?

Yes, as above.

> > If its Ok, it would be nice to mention more about memory ordering aspect in
> > the changelog.

I can certainly do that, but in the immortal words of MS-DOS, are you sure?

The problem is that RCU memory ordering isn't a single pairing or
even a chain.  It is instead a net weaving through all the readers
before the end of the grace period, all the readers after the
beginning of the grace period, the grace period itself, as well as
the updates both before and after the grace period.  See for example
Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg.  It would
look sort of like that, though perhaps a little bit less elaborate.

							Thanx, Paul

> > thanks!
> >
> >  - Joel
> >
> >
> > > +
> > > +             // If check succeeded, remove this task from the list.
> > > +             if (READ_ONCE(t->trc_reader_checked))
> > > +                     trc_del_holdout(t);
> > > +     }
> > > +}
> > > +
> > > +/* Wait for grace period to complete and provide ordering. */
> > > +static void rcu_tasks_trace_postgp(void)
> > > +{
> > > +     // Remove the safety count.
> > > +     smp_mb__before_atomic();  // Order vs. earlier atomics
> > > +     atomic_dec(&trc_n_readers_need_end);
> > > +     smp_mb__after_atomic();  // Order vs. later atomics
> > > +
> > > +     // Wait for readers.
> > > +     wait_event_idle_exclusive(trc_wait,
> > > +                               atomic_read(&trc_n_readers_need_end) == 0);
> > > +
> > > +     smp_mb(); // Caller's code must be ordered after wakeup.
> > > +}
> > > +
> > > +/* Report any needed quiescent state for this exiting task. */
> > > +void exit_tasks_rcu_finish_trace(struct task_struct *t)
> > > +{
> > > +     WRITE_ONCE(t->trc_reader_checked, true);
> > > +     WARN_ON_ONCE(t->trc_reader_nesting);
> > > +     WRITE_ONCE(t->trc_reader_nesting, 0);
> > > +     if (WARN_ON_ONCE(READ_ONCE(t->trc_reader_need_end)))
> > > +             rcu_read_unlock_trace_special(t);
> > > +}
> > > +
> > > +void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
> > > +DEFINE_RCU_TASKS(rcu_tasks_trace, rcu_tasks_wait_gp, call_rcu_tasks_trace,
> > > +              "RCU Tasks Trace");
> > > +
> > > +/**
> > > + * call_rcu_tasks_trace() - Queue a callback trace task-based grace period
> > > + * @rhp: structure to be used for queueing the RCU updates.
> > > + * @func: actual callback function to be invoked after the grace period
> > > + *
> > > + * The callback function will be invoked some time after a full grace
> > > + * period elapses, in other words after all currently executing RCU
> > > + * read-side critical sections have completed. call_rcu_tasks_trace()
> > > + * assumes that the read-side critical sections end at context switch,
> > > + * cond_resched_rcu_qs(), or transition to usermode execution.  As such,
> > > + * there are no read-side primitives analogous to rcu_read_lock() and
> > > + * rcu_read_unlock() because this primitive is intended to determine
> > > + * that all tasks have passed through a safe state, not so much for
> > > + * data-strcuture synchronization.
> > > + *
> > > + * See the description of call_rcu() for more detailed information on
> > > + * memory ordering guarantees.
> > > + */
> > > +void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func)
> > > +{
> > > +     call_rcu_tasks_generic(rhp, func, &rcu_tasks_trace);
> > > +}
> > > +EXPORT_SYMBOL_GPL(call_rcu_tasks_trace);
> > > +
> > > +/**
> > > + * synchronize_rcu_tasks_trace - wait for a trace rcu-tasks grace period
> > > + *
> > > + * Control will return to the caller some time after a trace rcu-tasks
> > > + * grace period has elapsed, in other words after all currently
> > > + * executing rcu-tasks read-side critical sections have elapsed.  These
> > > + * read-side critical sections are delimited by calls to schedule(),
> > > + * cond_resched_tasks_rcu_qs(), userspace execution, and (in theory,
> > > + * anyway) cond_resched().
> > > + *
> > > + * This is a very specialized primitive, intended only for a few uses in
> > > + * tracing and other situations requiring manipulation of function preambles
> > > + * and profiling hooks.  The synchronize_rcu_tasks_trace() function is not
> > > + * (yet) intended for heavy use from multiple CPUs.
> > > + *
> > > + * See the description of synchronize_rcu() for more detailed information
> > > + * on memory ordering guarantees.
> > > + */
> > > +void synchronize_rcu_tasks_trace(void)
> > > +{
> > > +     RCU_LOCKDEP_WARN(lock_is_held(&rcu_trace_lock_map), "Illegal synchronize_rcu_tasks_trace() in RCU Tasks Trace read-side critical section");
> > > +     synchronize_rcu_tasks_generic(&rcu_tasks_trace);
> > > +}
> > > +EXPORT_SYMBOL_GPL(synchronize_rcu_tasks_trace);
> > > +
> > > +/**
> > > + * rcu_barrier_tasks_trace - Wait for in-flight call_rcu_tasks_trace() callbacks.
> > > + *
> > > + * Although the current implementation is guaranteed to wait, it is not
> > > + * obligated to, for example, if there are no pending callbacks.
> > > + */
> > > +void rcu_barrier_tasks_trace(void)
> > > +{
> > > +     /* There is only one callback queue, so this is easy.  ;-) */
> > > +     synchronize_rcu_tasks_trace();
> > > +}
> > > +EXPORT_SYMBOL_GPL(rcu_barrier_tasks_trace);
> > > +
> > > +static int __init rcu_spawn_tasks_trace_kthread(void)
> > > +{
> > > +     rcu_tasks_trace.pregp_func = rcu_tasks_trace_pregp_step;
> > > +     rcu_tasks_trace.pertask_func = rcu_tasks_trace_pertask;
> > > +     rcu_tasks_trace.postscan_func = rcu_tasks_trace_postscan;
> > > +     rcu_tasks_trace.holdouts_func = check_all_holdout_tasks_trace;
> > > +     rcu_tasks_trace.postgp_func = rcu_tasks_trace_postgp;
> > > +     rcu_spawn_tasks_kthread_generic(&rcu_tasks_trace);
> > > +     return 0;
> > > +}
> > > +core_initcall(rcu_spawn_tasks_trace_kthread);
> > > +
> > > +#else /* #ifdef CONFIG_TASKS_TRACE_RCU */
> > > +void exit_tasks_rcu_finish_trace(struct task_struct *t) { }
> > > +#endif /* #else #ifdef CONFIG_TASKS_TRACE_RCU */
> > > --
> > > 2.9.5
> > >

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 0/22] Prototype RCU usable from idle, exception, offline
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (21 preceding siblings ...)
  2020-03-19  0:11   ` [PATCH RFC v2 tip/core/rcu 22/22] rcu-tasks: Provide boot parameter to delay IPIs until late in grace period paulmck
@ 2020-03-19 11:31   ` Mathieu Desnoyers
  2020-03-19 13:13     ` Paul E. McKenney
       [not found]   ` <20200319104614.11444-1-hdanton@sina.com>
                     ` (2 subsequent siblings)
  25 siblings, 1 reply; 171+ messages in thread
From: Mathieu Desnoyers @ 2020-03-19 11:31 UTC (permalink / raw)
  To: paulmck
  Cc: rcu, linux-kernel, kernel-team, Ingo Molnar, Lai Jiangshan,
	dipankar, Andrew Morton, Josh Triplett, Thomas Gleixner,
	Peter Zijlstra, rostedt, David Howells, Eric Dumazet, fweisbec,
	Oleg Nesterov, Joel Fernandes, Google

----- On Mar 18, 2020, at 8:10 PM, paulmck paulmck@kernel.org wrote:

> Hello!

Hi Paul,

Thanks for pulling this together! Some comments below (based only on the
cover message),

[...]

> There are of course downsides.  The grace-period code can send IPIs to
> CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
> However, this version enlists the aid of the context-switch hooks,
> which eliminates the need for IPIs in context-switch-heavy workloads.
> It also prohibits sending of IPIs early in the grace period, which
> provides additional opportunity for the hooks to do their job.  Additional
> IPI-reduction mechanisms are under development.

I suspect that on nohz_full cpus, at least some use-cases which really care
about not receiving IPIs will not be doing that many context switches.

What are the possible approaches to have IPI-*elimination* for nohz cpus ?

> 
> The RCU tasks trace mechanism is based off of RCU tasks rather than
> SRCU because the latter is more complex and also because the latter
> uses a CPU-by-CPU approach to tracking quiescent states instead of the
> task-by-task approach that is needed.  It is in theory possible to
> mash RCU tasks trace into the Tree SRCU implementation, but there
> will need to be extremely good reasons for doing so.

I have a hard time buying the "less complexity" argument to justify the
introduction of yet another flavor of RCU when a close match already
exists (SRCU).

The other argument for this task-based RCU (rather than CPU-by-CPU as
done by SRCU) is that "a task-by-task approach is needed". What I
do not get from this explanation is why is such an approach needed ?

Also, another aspect worth discussing here is the use-cases which
need to be covered by tracing-rcu. Is this specific flavor targeting
specifically preempt-off use-cases, or is the goal here to target
use-cases which may trigger major page faults within the read-side
critical section as well ?

Note that doing task-by-task tracking of tracing-rcu rather than
cpu-by-cpu is not free: AFAIU it bloats the task struct (always)
for a use-case which is not always active. My experience with
tracepoints and asm gotos is that we need to be careful not to
slow down the common case (kernel running without any tracing
active, but tracing configured in) if we want to keep distributions
and end users building kernels with introspection facilities in
place.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 03/22] rcutorture: Add flag to produce non-busy-wait task stalls
       [not found]   ` <20200319104614.11444-1-hdanton@sina.com>
@ 2020-03-19 12:38     ` Paul E. McKenney
       [not found]     ` <20200319133947.12172-1-hdanton@sina.com>
  1 sibling, 0 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-19 12:38 UTC (permalink / raw)
  To: Hillf Danton
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel

On Thu, Mar 19, 2020 at 06:46:14PM +0800, Hillf Danton wrote:
> 
> On Wed, 18 Mar 2020 17:10:41 -0700
> > From: "Paul E. McKenney" <paulmck@kernel.org>
> > 
> > This commit aids testing of RCU task stall warning messages by adding
> > an rcutorture.stall_cpu_block module parameter that results in the
> > induced stall sleeping within the RCU read-side critical section.
> > Spinning with interrupts disabled is still available via the
> > rcutorture.stall_cpu_irqsoff module parameter, and specifying neither
> > of these two module parameters will spin with preemption disabled.
> > 
> > Note that sleeping (as opposed to preemption) results in additional
> > complaints from RCU at context-switch time, so yet more testing.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > ---
> >  Documentation/admin-guide/kernel-parameters.txt |  5 +++++
> >  kernel/rcu/rcutorture.c                         | 15 +++++++++------
> >  2 files changed, 14 insertions(+), 6 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 6d16b78..17eff15 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4161,6 +4161,11 @@
> >  			Duration of CPU stall (s) to test RCU CPU stall
> >  			warnings, zero to disable.
> >  
> > +	rcutorture.stall_cpu_block= [KNL]
> > +			Sleep while stalling if set.  This will result
> > +			in warnings from preemptible RCU in addition
> > +			to any other stall-related activity.
> > +
> >  	rcutorture.stall_cpu_holdoff= [KNL]
> >  			Time to wait (s) after boot before inducing stall.
> >  
> > diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> > index b3301f3..ada5b91 100644
> > --- a/kernel/rcu/rcutorture.c
> > +++ b/kernel/rcu/rcutorture.c
> > @@ -102,6 +102,7 @@ torture_param(int, stall_cpu, 0, "Stall duration (s), zero to disable.");
> >  torture_param(int, stall_cpu_holdoff, 10,
> >  	     "Time to wait before starting stall (s).");
> >  torture_param(int, stall_cpu_irqsoff, 0, "Disable interrupts while stalling.");
> > +torture_param(int, stall_cpu_block, 0, "Sleep while stalling.");
> >  torture_param(int, stat_interval, 60,
> >  	     "Number of seconds between stats printk()s");
> >  torture_param(int, stutter, 5, "Number of seconds to run/halt test");
> > @@ -1599,6 +1600,7 @@ static int rcutorture_booster_init(unsigned int cpu)
> >   */
> >  static int rcu_torture_stall(void *args)
> >  {
> > +	int idx;
> >  	unsigned long stop_at;
> >  
> >  	VERBOSE_TOROUT_STRING("rcu_torture_stall task started");
> > @@ -1610,21 +1612,22 @@ static int rcu_torture_stall(void *args)
> >  	if (!kthread_should_stop()) {
> >  		stop_at = ktime_get_seconds() + stall_cpu;
> >  		/* RCU CPU stall is expected behavior in following code. */
> > -		rcu_read_lock();
> > +		idx = cur_ops->readlock();
> >  		if (stall_cpu_irqsoff)
> >  			local_irq_disable();
> > -		else
> > +		else if (!stall_cpu_block)
> >  			preempt_disable();
> >  		pr_alert("rcu_torture_stall start on CPU %d.\n",
> > -			 smp_processor_id());
> > +			 raw_smp_processor_id());
> >  		while (ULONG_CMP_LT((unsigned long)ktime_get_seconds(),
> >  				    stop_at))
> > -			continue;  /* Induce RCU CPU stall warning. */
> > +			if (stall_cpu_block)
> > +				schedule_timeout_uninterruptible(HZ);
> 
> Why is the scheduled-in task so special that it will be running on
> the current CPU with irq disabled?

You lost me on this one.  IRQs are not at all disabled.

Oh, you mean the _uninterruptible suffix?

That only affects accounting during the sleep.  Since this is rcutorture,
the exact suffix is not all that relevant.  So the current state is that
rcutorture tends to use _uninterruptible.

Or am I missing your point?

							Thanx, Paul

> >  		if (stall_cpu_irqsoff)
> >  			local_irq_enable();
> > -		else
> > +		else if (!stall_cpu_block)
> >  			preempt_enable();
> > -		rcu_read_unlock();
> > +		cur_ops->readunlock(idx);
> >  		pr_alert("rcu_torture_stall end.\n");
> >  	}
> >  	torture_shutdown_absorb("rcu_torture_stall");
> > -- 
> > 2.9.5
> 

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 0/22] Prototype RCU usable from idle, exception, offline
  2020-03-19 11:31   ` [PATCH RFC v2 tip/core/rcu 0/22] Prototype RCU usable from idle, exception, offline Mathieu Desnoyers
@ 2020-03-19 13:13     ` Paul E. McKenney
  0 siblings, 0 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-19 13:13 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: rcu, linux-kernel, kernel-team, Ingo Molnar, Lai Jiangshan,
	dipankar, Andrew Morton, Josh Triplett, Thomas Gleixner,
	Peter Zijlstra, rostedt, David Howells, Eric Dumazet, fweisbec,
	Oleg Nesterov, Joel Fernandes, Google

On Thu, Mar 19, 2020 at 07:31:38AM -0400, Mathieu Desnoyers wrote:
> ----- On Mar 18, 2020, at 8:10 PM, paulmck paulmck@kernel.org wrote:
> 
> > Hello!
> 
> Hi Paul,
> 
> Thanks for pulling this together! Some comments below (based only on the
> cover message),

And thank you for your review and comments!

> [...]
> 
> > There are of course downsides.  The grace-period code can send IPIs to
> > CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
> > However, this version enlists the aid of the context-switch hooks,
> > which eliminates the need for IPIs in context-switch-heavy workloads.
> > It also prohibits sending of IPIs early in the grace period, which
> > provides additional opportunity for the hooks to do their job.  Additional
> > IPI-reduction mechanisms are under development.
> 
> I suspect that on nohz_full cpus, at least some use-cases which really care
> about not receiving IPIs will not be doing that many context switches.
> 
> What are the possible approaches to have IPI-*elimination* for nohz cpus ?

Pretty much as you suggested, actually.  I have some other approaches
that should eliminate read-side overhead, and thus might prove necessary
longer term.  However, from what I can see your suggestion is good and
sufficient, and perhaps indefinitely.  So thank you for that!

In more detail:

Add a per-task flag that tells readers to use smp_mb().  In kernels
built for these workloads (CONFIG_TASKS_TRACE_RCU_READ_MB=y), set
this flag on entry to usermode/idle (that is, the non-offline RCU
extended quiescent states) and clear it upon exit.  The pre-existing
dyntick-counter increment provides the necessary mameory ordering.
Then if the update side sees that the task is running on a CPU in a
non-offline extended quiescent state (which just happens to be what the
dynticks counter already indicates), it carries out the checks knowing
that the reader is using memory barriers.

The initial state sets CONFIG_TASKS_TRACE_RCU_READ_MB=y when either
CONFIG_PREEMPT_RT=y or when CONFIG_NR_CPUS < 8.  The rationale for
the former is that HPC NO_HZ_FULL workloads probably don't care
all that much about a stray IPI as long as it happens infrequently.
500ms should qualify as "infrequently".  The rationale for the latter
was that I couldn't get any better heuristic than number of CPUs from
my battery-powered contacts.  Yes, 8 is a bit low, especially given
that my own smartphone has 8 CPUs, but I have to start somewhere.
Another option is for battery-powered devices to just "select
CONFIG_TASKS_TRACE_RCU_READ_MB" in their defconfig files.

This work has started in -rcu, but was not to the point where I felt
comfortable sending it in yesterday's series.  And yes, I will add it
to the cover letter, hopefully on the next version of this patch set.

> > The RCU tasks trace mechanism is based off of RCU tasks rather than
> > SRCU because the latter is more complex and also because the latter
> > uses a CPU-by-CPU approach to tracking quiescent states instead of the
> > task-by-task approach that is needed.  It is in theory possible to
> > mash RCU tasks trace into the Tree SRCU implementation, but there
> > will need to be extremely good reasons for doing so.
> 
> I have a hard time buying the "less complexity" argument to justify the
> introduction of yet another flavor of RCU when a close match already
> exists (SRCU).

Tree SRCU is not the simplest thing out there.  And please see below.

> The other argument for this task-based RCU (rather than CPU-by-CPU as
> done by SRCU) is that "a task-by-task approach is needed". What I
> do not get from this explanation is why is such an approach needed ?

Because SRCU's accounting only knows the number of things that are
preventing the current grace period from ending.  In this, it differs
from userspace RCU, which knows exactly which threads are preventing the
current grace period from ending.  In contrast, SRCU has absolutely no
idea which task or CPU is preventing the grace period from ending.
SRCU is therefore incapable of locking down those tasks to encourage
them to report the next quiescent state.

Yes, SRCU could be modified.  Maybe someday that will prove a good
idea.  Today is not that day, nor is that day coming soon.

> Also, another aspect worth discussing here is the use-cases which
> need to be covered by tracing-rcu. Is this specific flavor targeting
> specifically preempt-off use-cases, or is the goal here to target
> use-cases which may trigger major page faults within the read-side
> critical section as well ?

Yes, CONFIG_PREEMPT=n use cases are still extremely important.  Don't get
me wrong, real-time computing does have a warm place in my heart, but
there is still a very large number of CONFIG_PREEMPT=n systems running
out there.  And yes, the ability to handle the occasional page fault is
also important.  So both simultaneously, not just one or the other.

> Note that doing task-by-task tracking of tracing-rcu rather than
> cpu-by-cpu is not free: AFAIU it bloats the task struct (always)
> for a use-case which is not always active. My experience with
> tracepoints and asm gotos is that we need to be careful not to
> slow down the common case (kernel running without any tracing
> active, but tracing configured in) if we want to keep distributions
> and end users building kernels with introspection facilities in
> place.

Which is indeed another reason for not pressing SRCU into service here.
There would have to be a "special" srcu_struct that owned a piece of
the task_struct structure on the one hand, or there would need to be
another set of allocations done at the creation of each task for each
such srcu_struct on the other.  Neither sounds at all attractive.

Please note also that Tasks RCU, which is already used by many forms of
tracing, already added a similar amount to the task_struct structure quite
some time ago.  But you are right that I should to add an item to my todo
list to squeeze the state down a bit, and for both variants while I am
at it.  For but one example, I could automatically use a short for the
CPU numbers when CONFIG_NR_CPUS is less than 32,768.  Perhaps a similar
optimization could be applied elsewhere in the task_struct structure.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 03/22] rcutorture: Add flag to produce non-busy-wait task stalls
       [not found]     ` <20200319133947.12172-1-hdanton@sina.com>
@ 2020-03-19 15:22       ` Paul E. McKenney
       [not found]       ` <20200320040329.9840-1-hdanton@sina.com>
  1 sibling, 0 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-19 15:22 UTC (permalink / raw)
  To: Hillf Danton
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel

On Thu, Mar 19, 2020 at 09:39:47PM +0800, Hillf Danton wrote:
> 
> On Thu, 19 Mar 2020 05:38:12 -0700 "Paul E. McKenney" wrote:
> > 
> > On Thu, Mar 19, 2020 at 06:46:14PM +0800, Hillf Danton wrote:
> > > 
> > > On Wed, 18 Mar 2020 17:10:41 -0700
> > > >  static int rcu_torture_stall(void *args)
> > > >  {
> > > > +	int idx;
> > > >  	unsigned long stop_at;
> > > >  
> > > >  	VERBOSE_TOROUT_STRING("rcu_torture_stall task started");
> > > > @@ -1610,21 +1612,22 @@ static int rcu_torture_stall(void *args)
> > > >  	if (!kthread_should_stop()) {
> > > >  		stop_at = ktime_get_seconds() + stall_cpu;
> > > >  		/* RCU CPU stall is expected behavior in following code. */
> > > > -		rcu_read_lock();
> > > > +		idx = cur_ops->readlock();
> > > >  		if (stall_cpu_irqsoff)
> > > >  			local_irq_disable();
> > > > -		else
> > > > +		else if (!stall_cpu_block)
> > > >  			preempt_disable();
> > > >  		pr_alert("rcu_torture_stall start on CPU %d.\n",
> > > > -			 smp_processor_id());
> > > > +			 raw_smp_processor_id());
> > > >  		while (ULONG_CMP_LT((unsigned long)ktime_get_seconds(),
> > > >  				    stop_at))
> > > > -			continue;  /* Induce RCU CPU stall warning. */
> > > > +			if (stall_cpu_block)
> > > > +				schedule_timeout_uninterruptible(HZ);
> > > 
> > > Why is the scheduled-in task so special that it will be running on
> > > the current CPU with irq disabled?
> > 
> > You lost me on this one.
> 
> Quite likely :)
> 
> > IRQs are not at all disabled.
> > 
> > > >  		if (stall_cpu_irqsoff)
> > > >  			local_irq_enable();
> 
> Local IRQs get enabled here depending on stall_cpu_irqsoff.
> 
> What I was asking is the scheduling case like
> 
> 	local_irq_disable();
> 	schedule_timeout(HZ);
> 	local_irq_enable();
> 
> Is it likely going to be ruled out in this patch?

If an rcutorture run specified both the rcutorture.stall_cpu_irqsoff and
the rcutorture.stall_cpu_block module parameters, then yes, exactly the
sequence you call out should occur.  Can't say that I have tried this,
though.  Nor would I expect to have ever done so without your suggesting
that I do.

But why not try it on current -rcu?

tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 12 --duration 3 --configs "TRACE01" --bootargs "rcutorture.stall_cpu=25 rcutorture.stall_cpu_holdoff=30 rcutorture.stall_cpu_block=1 rcupdate.rcu_task_stall_timeout=10000 rcutorture.stall_cpu_irqsoff"

This tells rcutorture to use all 12 hardware threads, to run the kernel for
three minutes, to run only the TRACE01 rcutorture scenario, and to test
RCU CPU stall warnings:

rcutorture.stall_cpu=25: Stall the CPU for 25 seconds.

rcutorture.stall_cpu_holdoff=30: Wait 30 seconds after boot to start stalling.

rcutorture.stall_cpu_block=1: Do the schedule_timeout_uninterruptible()
	while stalling.

rcupdate.rcu_task_stall_timeout=10000: Set the stall-warning timeout
	to 10,000 jiffies, or ten seconds.

rcutorture.stall_cpu_irqsoff: This tells rcutorture to execute the
	local_irq_disable() that you called out above.

And this results in a couple of stall warning messages, as expected
given that you get two ten-second intervals in a 25-second interval.

No other complaints, though.  And of course what happens is that
__schedule() enables interrupts on first call (as it must do) and they
remain enabled past that point.  Then local_irq_enable() redundantly
enables them.

> Or is it anything by design?

So no, not by design.  I don't see any reason to change it.  After all,
if you are running rcutorture and also asking rcutorture to make CPU
stall warnings, you have to expect a bit of noise from the kernel.
Testing that noise is after all the whole point.  ;-)

							Thanx, Paul

> > > > -		else
> > > > +		else if (!stall_cpu_block)
> > > >  			preempt_enable();
> > > > -		rcu_read_unlock();
> > > > +		cur_ops->readunlock(idx);
> > > >  		pr_alert("rcu_torture_stall end.\n");
> > > >  	}
> > > >  	torture_shutdown_absorb("rcu_torture_stall");
> > > > -- 
> > > > 2.9.5
> 

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task paulmck
@ 2020-03-19 17:22     ` Steven Rostedt
  2020-03-19 17:35       ` Paul E. McKenney
  0 siblings, 1 reply; 171+ messages in thread
From: Steven Rostedt @ 2020-03-19 17:22 UTC (permalink / raw)
  To: paulmck
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, dhowells, edumazet,
	fweisbec, oleg, joel, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman

On Wed, 18 Mar 2020 17:10:39 -0700
paulmck@kernel.org wrote:

> From: "Paul E. McKenney" <paulmck@kernel.org>
> 
> A running task's state can be sampled in a consistent manner (for example,
> for diagnostic purposes) simply by invoking smp_call_function_single()
> on its CPU, which may be obtained using task_cpu(), then having the
> IPI handler verify that the desired task is in fact still running.
> However, if the task is not running, this sampling can in theory be done
> immediately and directly.  In practice, the task might start running at
> any time, including during the sampling period.  Gaining a consistent
> sample of a not-running task therefore requires that something be done
> to lock down the target task's state.
> 
> This commit therefore adds a try_invoke_on_locked_down_task() function
> that invokes a specified function if the specified task can be locked
> down, returning true if successful and if the specified function returns
> true.  Otherwise this function simply returns false.  Given that the
> function passed to try_invoke_on_nonrunning_task() might be invoked with
> a runqueue lock held, that function had better be quite lightweight.
> 
> The function is passed the target task's task_struct pointer and the
> argument passed to try_invoke_on_locked_down_task(), allowing easy access
> to task state and to a location for further variables to be passed in
> and out.
> 
> Note that the specified function will be called even if the specified
> task is currently running.  The function can use ->on_rq and task_curr()
> to quickly and easily determine the task's state, and can return false
> if this state is not to the function's liking.  The caller of teh

  s/teh/the/

> try_invoke_on_locked_down_task() would then see the false return value,
> and could take appropriate action, for example, trying again later or
> sending an IPI if matters are more urgent.
> 
> It is expected that use cases such as the RCU CPU stall warning code will
> simply return false if the task is currently running.  However, there are
> use cases involving nohz_full CPUs where the specified function might
> instead fall back to an alternative sampling scheme that relies on heavier
> synchronization (such as memory barriers) in the target task.
> 
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> [ paulmck: Apply feedback from Peter Zijlstra and Steven Rostedt. ]
> [ paulmck: Invoke if running to handle feedback from Mathieu Desnoyers. ]
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/wait.h |  2 ++
>  kernel/sched/core.c  | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/include/linux/wait.h b/include/linux/wait.h
> index 3283c8d..e2bb8ed 100644
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -1148,4 +1148,6 @@ int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, i
>  		(wait)->flags = 0;						\
>  	} while (0)
>  
> +bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg);
> +
>  #endif /* _LINUX_WAIT_H */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fc1dfc0..195eba0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2580,6 +2580,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	 *
>  	 * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
>  	 * __schedule().  See the comment for smp_mb__after_spinlock().
> +	 *
> +	 * A similar smb_rmb() lives in try_invoke_on_locked_down_task().
>  	 */
>  	smp_rmb();
>  	if (p->on_rq && ttwu_remote(p, wake_flags))
> @@ -2654,6 +2656,52 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  }
>  
>  /**
> + * try_invoke_on_locked_down_task - Invoke a function on task in fixed state
> + * @p: Process for which the function is to be invoked.
> + * @func: Function to invoke.
> + * @arg: Argument to function.
> + *
> + * If the specified task can be quickly locked into a definite state
> + * (either sleeping or on a given runqueue), arrange to keep it in that
> + * state while invoking @func(@arg).  This function can use ->on_rq and
> + * task_curr() to work out what the state is, if required.  Given that
> + * @func can be invoked with a runqueue lock held, it had better be quite
> + * lightweight.
> + *
> + * Returns:
> + *	@false if the task slipped out from under the locks.
> + *	@true if the task was locked onto a runqueue or is sleeping.
> + *		However, @func can override this by returning @false.

Should probably state that it will return false if the state could be
locked, otherwise it returns the return code of the function.

I'm wondering if we shouldn't have the function return code be something
passed in by the parameter, and have this return either true (locked and
function called), or false (not locked and function wasn't called).


> + */
> +bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg)
> +{
> +	bool ret = false;
> +	struct rq_flags rf;
> +	struct rq *rq;
> +
> +	lockdep_assert_irqs_enabled();
> +	raw_spin_lock_irq(&p->pi_lock);
> +	if (p->on_rq) {
> +		rq = __task_rq_lock(p, &rf);
> +		if (task_rq(p) == rq)
> +			ret = func(p, arg);
> +		rq_unlock(rq, &rf);
> +	} else {
> +		switch (p->state) {
> +		case TASK_RUNNING:
> +		case TASK_WAKING:
> +			break;
> +		default:

Don't we need a comment here about why we have a rmb() and where the
matching wmb() is?

-- Steve

> +			smp_rmb();
> +			if (!p->on_rq)
> +				ret = func(p, arg);
> +		}
> +	}
> +	raw_spin_unlock_irq(&p->pi_lock);
> +	return ret;
> +}
> +
> +/**
>   * wake_up_process - Wake up a specific process
>   * @p: The process to be woken up.
>   *


^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 02/22] rcu: Add per-task state to RCU CPU stall warnings
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 02/22] rcu: Add per-task state to RCU CPU stall warnings paulmck
@ 2020-03-19 17:27     ` Steven Rostedt
  2020-03-19 17:41       ` Paul E. McKenney
  0 siblings, 1 reply; 171+ messages in thread
From: Steven Rostedt @ 2020-03-19 17:27 UTC (permalink / raw)
  To: paulmck
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, dhowells, edumazet,
	fweisbec, oleg, joel

On Wed, 18 Mar 2020 17:10:40 -0700
paulmck@kernel.org wrote:

> From: "Paul E. McKenney" <paulmck@kernel.org>
> 
> Currently, an RCU-preempt CPU stall warning simply lists the PIDs of
> those tasks holding up the current grace period.  This can be helpful,
> but more can be even more helpful.
> 
> To this end, this commit adds the nesting level, whether the task
> things it was preempted in its current RCU read-side critical section,

s/things/thinks/

> whether RCU core has asked this task for a quiescent state, whether the
> expedited-grace-period hint is set, and whether the task believes that
> it is on the blocked-tasks list (it must be, or it would not be printed,
> but if things are broken, best not to take too much for granted).
> 
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> ---
>  kernel/rcu/tree_stall.h | 38 ++++++++++++++++++++++++++++++++++++--
>  1 file changed, 36 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> index 502b4dd..e19487d 100644
> --- a/kernel/rcu/tree_stall.h
> +++ b/kernel/rcu/tree_stall.h
> @@ -192,14 +192,40 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp)
>  	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  }
>  
> +// Communicate task state back to the RCU CPU stall warning request.
> +struct rcu_stall_chk_rdr {
> +	int nesting;
> +	union rcu_special rs;
> +	bool on_blkd_list;
> +};
> +
> +/*
> + * Report out the state of a not-running task that is stalling the
> + * current RCU grace period.
> + */
> +static bool check_slow_task(struct task_struct *t, void *arg)
> +{
> +	struct rcu_node *rnp;
> +	struct rcu_stall_chk_rdr *rscrp = arg;
> +
> +	if (task_curr(t))
> +		return false; // It is running, so decline to inspect it.

Since it can be locked on_rq(), should we report that too?

-- Steve

> +	rscrp->nesting = t->rcu_read_lock_nesting;
> +	rscrp->rs = t->rcu_read_unlock_special;
> +	rnp = t->rcu_blocked_node;
> +	rscrp->on_blkd_list = !list_empty(&t->rcu_node_entry);
> +	return true;
> +}
> +
>  /*
>   * Scan the current list of tasks blocked within RCU read-side critical
>   * sections, printing out the tid of each.
>   */
>  static int rcu_print_task_stall(struct rcu_node *rnp)
>  {
> -	struct task_struct *t;
>  	int ndetected = 0;
> +	struct rcu_stall_chk_rdr rscr;
> +	struct task_struct *t;
>  
>  	if (!rcu_preempt_blocked_readers_cgp(rnp))
>  		return 0;
> @@ -208,7 +234,15 @@ static int rcu_print_task_stall(struct rcu_node *rnp)
>  	t = list_entry(rnp->gp_tasks->prev,
>  		       struct task_struct, rcu_node_entry);
>  	list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) {
> -		pr_cont(" P%d", t->pid);
> +		if (!try_invoke_on_locked_down_task(t, check_slow_task, &rscr))
> +			pr_cont(" P%d", t->pid);
> +		else
> +			pr_cont(" P%d/%d:%c%c%c%c",
> +				t->pid, rscr.nesting,
> +				".b"[rscr.rs.b.blocked],
> +				".q"[rscr.rs.b.need_qs],
> +				".e"[rscr.rs.b.exp_hint],
> +				".l"[rscr.on_blkd_list]);
>  		ndetected++;
>  	}
>  	pr_cont("\n");


^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task
  2020-03-19 17:22     ` Steven Rostedt
@ 2020-03-19 17:35       ` Paul E. McKenney
  2020-03-20  2:49         ` Paul E. McKenney
  0 siblings, 1 reply; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-19 17:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, dhowells, edumazet,
	fweisbec, oleg, joel, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman

On Thu, Mar 19, 2020 at 01:22:38PM -0400, Steven Rostedt wrote:
> On Wed, 18 Mar 2020 17:10:39 -0700
> paulmck@kernel.org wrote:
> 
> > From: "Paul E. McKenney" <paulmck@kernel.org>
> > 
> > A running task's state can be sampled in a consistent manner (for example,
> > for diagnostic purposes) simply by invoking smp_call_function_single()
> > on its CPU, which may be obtained using task_cpu(), then having the
> > IPI handler verify that the desired task is in fact still running.
> > However, if the task is not running, this sampling can in theory be done
> > immediately and directly.  In practice, the task might start running at
> > any time, including during the sampling period.  Gaining a consistent
> > sample of a not-running task therefore requires that something be done
> > to lock down the target task's state.
> > 
> > This commit therefore adds a try_invoke_on_locked_down_task() function
> > that invokes a specified function if the specified task can be locked
> > down, returning true if successful and if the specified function returns
> > true.  Otherwise this function simply returns false.  Given that the
> > function passed to try_invoke_on_nonrunning_task() might be invoked with
> > a runqueue lock held, that function had better be quite lightweight.
> > 
> > The function is passed the target task's task_struct pointer and the
> > argument passed to try_invoke_on_locked_down_task(), allowing easy access
> > to task state and to a location for further variables to be passed in
> > and out.
> > 
> > Note that the specified function will be called even if the specified
> > task is currently running.  The function can use ->on_rq and task_curr()
> > to quickly and easily determine the task's state, and can return false
> > if this state is not to the function's liking.  The caller of teh
> 
>   s/teh/the/

Good eyes, will fix!

> > try_invoke_on_locked_down_task() would then see the false return value,
> > and could take appropriate action, for example, trying again later or
> > sending an IPI if matters are more urgent.
> > 
> > It is expected that use cases such as the RCU CPU stall warning code will
> > simply return false if the task is currently running.  However, there are
> > use cases involving nohz_full CPUs where the specified function might
> > instead fall back to an alternative sampling scheme that relies on heavier
> > synchronization (such as memory barriers) in the target task.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > [ paulmck: Apply feedback from Peter Zijlstra and Steven Rostedt. ]
> > [ paulmck: Invoke if running to handle feedback from Mathieu Desnoyers. ]
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > ---
> >  include/linux/wait.h |  2 ++
> >  kernel/sched/core.c  | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 50 insertions(+)
> > 
> > diff --git a/include/linux/wait.h b/include/linux/wait.h
> > index 3283c8d..e2bb8ed 100644
> > --- a/include/linux/wait.h
> > +++ b/include/linux/wait.h
> > @@ -1148,4 +1148,6 @@ int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, i
> >  		(wait)->flags = 0;						\
> >  	} while (0)
> >  
> > +bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg);
> > +
> >  #endif /* _LINUX_WAIT_H */
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index fc1dfc0..195eba0 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2580,6 +2580,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >  	 *
> >  	 * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
> >  	 * __schedule().  See the comment for smp_mb__after_spinlock().
> > +	 *
> > +	 * A similar smb_rmb() lives in try_invoke_on_locked_down_task().
> >  	 */
> >  	smp_rmb();
> >  	if (p->on_rq && ttwu_remote(p, wake_flags))
> > @@ -2654,6 +2656,52 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >  }
> >  
> >  /**
> > + * try_invoke_on_locked_down_task - Invoke a function on task in fixed state
> > + * @p: Process for which the function is to be invoked.
> > + * @func: Function to invoke.
> > + * @arg: Argument to function.
> > + *
> > + * If the specified task can be quickly locked into a definite state
> > + * (either sleeping or on a given runqueue), arrange to keep it in that
> > + * state while invoking @func(@arg).  This function can use ->on_rq and
> > + * task_curr() to work out what the state is, if required.  Given that
> > + * @func can be invoked with a runqueue lock held, it had better be quite
> > + * lightweight.
> > + *
> > + * Returns:
> > + *	@false if the task slipped out from under the locks.
> > + *	@true if the task was locked onto a runqueue or is sleeping.
> > + *		However, @func can override this by returning @false.
> 
> Should probably state that it will return false if the state could be
> locked, otherwise it returns the return code of the function.

So like this?

 * Returns:
 * @false if the task state could not be locked.
 * Otherwise, the return value from @func(arg).

> I'm wondering if we shouldn't have the function return code be something
> passed in by the parameter, and have this return either true (locked and
> function called), or false (not locked and function wasn't called).

I was thinking of this as one of the possible uses of whatever arg
points to, which allows the caller of try_invoke_on_locked_down_task()
and the specified function to communicate whatever they wish.  Then
the specified function could (for example) unconditionally return true
so that the return value from try_invoke_on_locked_down_task() indicated
whether or not the specified function was called.

The current setup is very convenient for the use cases thus far.  It
allows the function to say "Yeah, I was called, but I couldn't do
anything", thus allowing the caller to make exactly one check to know
that corrective action is required.

> > + */
> > +bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg)
> > +{
> > +	bool ret = false;
> > +	struct rq_flags rf;
> > +	struct rq *rq;
> > +
> > +	lockdep_assert_irqs_enabled();
> > +	raw_spin_lock_irq(&p->pi_lock);
> > +	if (p->on_rq) {
> > +		rq = __task_rq_lock(p, &rf);
> > +		if (task_rq(p) == rq)
> > +			ret = func(p, arg);
> > +		rq_unlock(rq, &rf);
> > +	} else {
> > +		switch (p->state) {
> > +		case TASK_RUNNING:
> > +		case TASK_WAKING:
> > +			break;
> > +		default:
> 
> Don't we need a comment here about why we have a rmb() and where the
> matching wmb() is?

Indeed we do!  I will fix that as well.

							Thanx, Paul

> -- Steve
> 
> > +			smp_rmb();
> > +			if (!p->on_rq)
> > +				ret = func(p, arg);
> > +		}
> > +	}
> > +	raw_spin_unlock_irq(&p->pi_lock);
> > +	return ret;
> > +}
> > +
> > +/**
> >   * wake_up_process - Wake up a specific process
> >   * @p: The process to be woken up.
> >   *
> 

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 02/22] rcu: Add per-task state to RCU CPU stall warnings
  2020-03-19 17:27     ` Steven Rostedt
@ 2020-03-19 17:41       ` Paul E. McKenney
  0 siblings, 0 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-19 17:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, dhowells, edumazet,
	fweisbec, oleg, joel

On Thu, Mar 19, 2020 at 01:27:31PM -0400, Steven Rostedt wrote:
> On Wed, 18 Mar 2020 17:10:40 -0700
> paulmck@kernel.org wrote:
> 
> > From: "Paul E. McKenney" <paulmck@kernel.org>
> > 
> > Currently, an RCU-preempt CPU stall warning simply lists the PIDs of
> > those tasks holding up the current grace period.  This can be helpful,
> > but more can be even more helpful.
> > 
> > To this end, this commit adds the nesting level, whether the task
> > things it was preempted in its current RCU read-side critical section,
> 
> s/things/thinks/

I thing that was an excellent catch, thank you!  ;-)

> > whether RCU core has asked this task for a quiescent state, whether the
> > expedited-grace-period hint is set, and whether the task believes that
> > it is on the blocked-tasks list (it must be, or it would not be printed,
> > but if things are broken, best not to take too much for granted).
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > ---
> >  kernel/rcu/tree_stall.h | 38 ++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 36 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> > index 502b4dd..e19487d 100644
> > --- a/kernel/rcu/tree_stall.h
> > +++ b/kernel/rcu/tree_stall.h
> > @@ -192,14 +192,40 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp)
> >  	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> >  }
> >  
> > +// Communicate task state back to the RCU CPU stall warning request.
> > +struct rcu_stall_chk_rdr {
> > +	int nesting;
> > +	union rcu_special rs;
> > +	bool on_blkd_list;
> > +};
> > +
> > +/*
> > + * Report out the state of a not-running task that is stalling the
> > + * current RCU grace period.
> > + */
> > +static bool check_slow_task(struct task_struct *t, void *arg)
> > +{
> > +	struct rcu_node *rnp;
> > +	struct rcu_stall_chk_rdr *rscrp = arg;
> > +
> > +	if (task_curr(t))
> > +		return false; // It is running, so decline to inspect it.
> 
> Since it can be locked on_rq(), should we report that too?

If it is locked on_rq() but !task_curr(t), it is runnable but not running.
Because the runqueue lock is held in that case, it cannot start running,
so the remainder of this function can safely inspect its state.  The
runqueue locks will supply the required ordering, ensuring a consistent
snapshot of the task's state.

However, if it is task_curr(t), which implies on_rq() as I understand
it, the task is running and therefore might be changing its state, and
doing so without any sort of attention to synchronization.  After all,
it is the task's private state that it is changing, so we don't want to
be paying the cost of any synchronization anyway.  Hence the return of
false above.

Or am I missing your point?

							Thanx, Paul

> -- Steve
> 
> > +	rscrp->nesting = t->rcu_read_lock_nesting;
> > +	rscrp->rs = t->rcu_read_unlock_special;
> > +	rnp = t->rcu_blocked_node;
> > +	rscrp->on_blkd_list = !list_empty(&t->rcu_node_entry);
> > +	return true;
> > +}
> > +
> >  /*
> >   * Scan the current list of tasks blocked within RCU read-side critical
> >   * sections, printing out the tid of each.
> >   */
> >  static int rcu_print_task_stall(struct rcu_node *rnp)
> >  {
> > -	struct task_struct *t;
> >  	int ndetected = 0;
> > +	struct rcu_stall_chk_rdr rscr;
> > +	struct task_struct *t;
> >  
> >  	if (!rcu_preempt_blocked_readers_cgp(rnp))
> >  		return 0;
> > @@ -208,7 +234,15 @@ static int rcu_print_task_stall(struct rcu_node *rnp)
> >  	t = list_entry(rnp->gp_tasks->prev,
> >  		       struct task_struct, rcu_node_entry);
> >  	list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) {
> > -		pr_cont(" P%d", t->pid);
> > +		if (!try_invoke_on_locked_down_task(t, check_slow_task, &rscr))
> > +			pr_cont(" P%d", t->pid);
> > +		else
> > +			pr_cont(" P%d/%d:%c%c%c%c",
> > +				t->pid, rscr.nesting,
> > +				".b"[rscr.rs.b.blocked],
> > +				".q"[rscr.rs.b.need_qs],
> > +				".e"[rscr.rs.b.exp_hint],
> > +				".l"[rscr.on_blkd_list]);
> >  		ndetected++;
> >  	}
> >  	pr_cont("\n");
> 

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 09/22] rcu-tasks: Add an RCU-tasks rude variant
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 09/22] rcu-tasks: Add an RCU-tasks rude variant paulmck
@ 2020-03-19 19:04     ` Steven Rostedt
  2020-03-19 23:58       ` Paul E. McKenney
  0 siblings, 1 reply; 171+ messages in thread
From: Steven Rostedt @ 2020-03-19 19:04 UTC (permalink / raw)
  To: paulmck
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, dhowells, edumazet,
	fweisbec, oleg, joel

On Wed, 18 Mar 2020 17:10:47 -0700
paulmck@kernel.org wrote:

> From: "Paul E. McKenney" <paulmck@kernel.org>
> 
> This commit adds a "rude" variant of RCU-tasks that has as quiescent
> states schedule(), cond_resched_tasks_rcu_qs(), userspace execution,
> and (in theory, anyway) cond_resched().  In other words, RCU-tasks rude
> readers are regions of code with preemption disabled, but excluding code
> early in the CPU-online sequence and late in the CPU-offline sequence.
> Updates make use of IPIs and force an IPI and a context switch on each
> online CPU.  This variant is useful in some situations in tracing.
> 
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> [ paulmck: Apply EXPORT_SYMBOL_GPL() feedback from Qiujun Huang. ]
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> ---
>  include/linux/rcupdate.h |  3 ++
>  kernel/rcu/Kconfig       | 12 +++++-
>  kernel/rcu/tasks.h       | 98 ++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 112 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 5523145..2be97a8 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -37,6 +37,7 @@
>  /* Exported common interfaces */
>  void call_rcu(struct rcu_head *head, rcu_callback_t func);
>  void rcu_barrier_tasks(void);
> +void rcu_barrier_tasks_rude(void);
>  void synchronize_rcu(void);
>  
>  #ifdef CONFIG_PREEMPT_RCU
> @@ -138,6 +139,8 @@ static inline void rcu_init_nohz(void) { }
>  #define rcu_note_voluntary_context_switch(t) rcu_tasks_qs(t)
>  void call_rcu_tasks(struct rcu_head *head, rcu_callback_t func);
>  void synchronize_rcu_tasks(void);
> +void call_rcu_tasks_rude(struct rcu_head *head, rcu_callback_t func);
> +void synchronize_rcu_tasks_rude(void);
>  void exit_tasks_rcu_start(void);
>  void exit_tasks_rcu_finish(void);
>  #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index 38475d0..0d43ec1 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -71,7 +71,7 @@ config TREE_SRCU
>  	  This option selects the full-fledged version of SRCU.
>  
>  config TASKS_RCU_GENERIC
> -	def_bool TASKS_RCU
> +	def_bool TASKS_RCU || TASKS_RUDE_RCU
>  	select SRCU
>  	help
>  	  This option enables generic infrastructure code supporting
> @@ -84,6 +84,16 @@ config TASKS_RCU
>  	  only voluntary context switch (not preemption!), idle, and
>  	  user-mode execution as quiescent states.  Not for manual selection.
>  
> +config TASKS_RUDE_RCU
> +	def_bool 0
> +	default n

No need for "default n" as that's the default without it.

> +	help
> +	  This option enables a task-based RCU implementation that uses
> +	  only context switch (including preemption) and user-mode
> +	  execution as quiescent states.  It forces IPIs and context
> +	  switches on all online CPUs, including idle ones, so use
> +	  with caution.  Not for manual selection.

Really don't need the "Not for manual selection", as not having a prompt
shows that too.

> +
>  config RCU_STALL_COMMON
>  	def_bool TREE_RCU
>  	help
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index d77921e..7ba1730 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -180,6 +180,9 @@ static void __init rcu_tasks_bootup_oddness(void)
>  	else
>  		pr_info("\tTasks RCU enabled.\n");
>  #endif /* #ifdef CONFIG_TASKS_RCU */
> +#ifdef CONFIG_TASKS_RUDE_RCU
> +	pr_info("\tRude variant of Tasks RCU enabled.\n");
> +#endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
>  }
>  
>  #endif /* #ifndef CONFIG_TINY_RCU */
> @@ -410,3 +413,98 @@ static int __init rcu_spawn_tasks_kthread(void)
>  core_initcall(rcu_spawn_tasks_kthread);
>  
>  #endif /* #ifdef CONFIG_TASKS_RCU */
> +
> +#ifdef CONFIG_TASKS_RUDE_RCU
> +
> +////////////////////////////////////////////////////////////////////////
> +//
> +// "Rude" variant of Tasks RCU, inspired by Steve Rostedt's trick of
> +// passing an empty function to schedule_on_each_cpu().  This approach
> +// provides an asynchronous call_rcu_rude() API and batching of concurrent
> +// calls to the synchronous synchronize_rcu_rude() API.  This sends IPIs
> +// far and wide and induces otherwise unnecessary context switches on all
> +// online CPUs, whether online or not.

   "on all online CPUs, whether online or not" ????

-- Steve

> +
> +// Empty function to allow workqueues to force a context switch.
> +static void rcu_tasks_be_rude(struct work_struct *work)
> +{
> +}
> +

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 14/22] rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks
  2020-03-19  0:10   ` [PATCH RFC v2 tip/core/rcu 14/22] rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks paulmck
  2020-03-19  1:37     ` Joel Fernandes
@ 2020-03-19 19:42     ` Steven Rostedt
  2020-03-20  0:28       ` Paul E. McKenney
  1 sibling, 1 reply; 171+ messages in thread
From: Steven Rostedt @ 2020-03-19 19:42 UTC (permalink / raw)
  To: paulmck
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, dhowells, edumazet,
	fweisbec, oleg, joel, Alexei Starovoitov, Andrii Nakryiko

On Wed, 18 Mar 2020 17:10:52 -0700
paulmck@kernel.org wrote:

> From: "Paul E. McKenney" <paulmck@kernel.org>
> 
> Because RCU does not watch exception early-entry/late-exit, idle-loop,
> or CPU-hotplug execution, protection of tracing and BPF operations is
> needlessly complicated.  This commit therefore adds a variant of
> Tasks RCU that:
> 
> o	Has explicit read-side markers to allow finite grace periods in
> 	the face of in-kernel loops for PREEMPT=n builds.  These markers
> 	are rcu_read_lock_trace() and rcu_read_unlock_trace().
> 
> o	Protects code in the idle loop, exception entry/exit, and
> 	CPU-hotplug code paths.  In this respect, RCU-tasks trace is
> 	similar to SRCU, but with lighter-weight readers.
> 
> o	Avoids expensive read-side instruction, having overhead similar
> 	to that of Preemptible RCU.
> 
> There are of course downsides:
> 
> o	The grace-period code can send IPIs to CPUs, even when those CPUs
> 	are in the idle loop or in nohz_full userspace.  This will be
> 	addressed by later commits.
> 
> o	It is necessary to scan the full tasklist, much as for Tasks RCU.
> 
> o	There is a single callback queue guarded by a single lock,
> 	again, much as for Tasks RCU.  However, those early use cases
> 	that request multiple grace periods in quick succession are
> 	expected to do so from a single task, which makes the single
> 	lock almost irrelevant.  If needed, multiple callback queues
> 	can be provided using any number of schemes.
> 
> Perhaps most important, this variant of RCU does not affect the vanilla
> flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
> readers can operate from idle, offline, and exception entry/exit in no
> way enables rcu_preempt and rcu_sched readers to do so.
> 
> This effort benefited greatly from off-list discussions of BPF
> requirements with Alexei Starovoitov and Andrii Nakryiko.  At least
> some of the on-list discussions are captured in the Link: tags below.
> In addition, KCSAN was quite helpful in finding some early bugs.

If we have this, do we still need to have RCU Tasks Rude flavor?

> 
> Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
> Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
> Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Cc: Andrii Nakryiko <andriin@fb.com>
> ---
>  include/linux/rcupdate_trace.h |  84 ++++++++++
>  include/linux/sched.h          |   8 +
>  init/init_task.c               |   4 +
>  kernel/fork.c                  |   4 +
>  kernel/rcu/Kconfig             |  12 +-
>  kernel/rcu/tasks.h             | 359 ++++++++++++++++++++++++++++++++++++++++-
>  6 files changed, 462 insertions(+), 9 deletions(-)
>  create mode 100644 include/linux/rcupdate_trace.h
> 


> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index 0d43ec1..187226b 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -71,7 +71,7 @@ config TREE_SRCU
>  	  This option selects the full-fledged version of SRCU.
>  
>  config TASKS_RCU_GENERIC
> -	def_bool TASKS_RCU || TASKS_RUDE_RCU
> +	def_bool TASKS_RCU || TASKS_RUDE_RCU || TASKS_TRACE_RCU
>  	select SRCU
>  	help
>  	  This option enables generic infrastructure code supporting
> @@ -94,6 +94,16 @@ config TASKS_RUDE_RCU
>  	  switches on all online CPUs, including idle ones, so use
>  	  with caution.  Not for manual selection.
>  
> +config TASKS_TRACE_RCU
> +	def_bool 0
> +	default n

Again, no need for "default n"

> +	help
> +	  This option enables a task-based RCU implementation that uses
> +	  explicit rcu_read_lock_trace() read-side markers, and allows
> +	  these readers to appear in the idle loop as well as on the CPU
> +	  hotplug code paths.  It can force IPIs on online CPUs, including
> +	  idle ones, so use with caution.  Not for manual selection.

And no real need for "Not for manual selection".

> +
>  config RCU_STALL_COMMON
>  	def_bool TREE_RCU
>  	help



> @@ -480,10 +489,10 @@ core_initcall(rcu_spawn_tasks_kthread);
>  //
>  // "Rude" variant of Tasks RCU, inspired by Steve Rostedt's trick of
>  // passing an empty function to schedule_on_each_cpu().  This approach
> -// provides an asynchronous call_rcu_rude() API and batching of concurrent
> -// calls to the synchronous synchronize_rcu_rude() API.  This sends IPIs
> -// far and wide and induces otherwise unnecessary context switches on all
> -// online CPUs, whether online or not.
> +// provides an asynchronous call_rcu_tasks_rude() API and batching
> +// of concurrent calls to the synchronous synchronize_rcu_rude() API.
> +// This sends IPIs far and wide and induces otherwise unnecessary context
> +// switches on all online CPUs, whether online or not.

This looks out of place for this patch. Perhaps you fixed this code and
applied it to the wrong patch?

>  
>  // Empty function to allow workqueues to force a context switch.
>  static void rcu_tasks_be_rude(struct work_struct *work)
> @@ -569,3 +578,337 @@ static int __init rcu_spawn_tasks_rude_kthread(void)
>  core_initcall(rcu_spawn_tasks_rude_kthread);
>  
>  #endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
> +
> +////////////////////////////////////////////////////////////////////////
> +//
> +// Tracing variant of Tasks RCU.  This variant is designed to be used
> +// to protect tracing hooks, including those of BPF.  This variant
> +// therefore:
> +//
> +// 1.	Has explicit read-side markers to allow finite grace periods
> +//	in the face of in-kernel loops for PREEMPT=n builds.
> +//
> +// 2.	Protects code in the idle loop, exception entry/exit, and
> +//	CPU-hotplug code paths, similar to the capabilities of SRCU.
> +//
> +// 3.	Avoids expensive read-side instruction, having overhead similar
> +//	to that of Preemptible RCU.
> +//
> +// There are of course downsides.  The grace-period code can send IPIs to
> +// CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
> +// It is necessary to scan the full tasklist, much as for Tasks RCU.  There
> +// is a single callback queue guarded by a single lock, again, much as for
> +// Tasks RCU.  If needed, these downsides can be at least partially remedied.
> +//
> +// Perhaps most important, this variant of RCU does not affect the vanilla
> +// flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
> +// readers can operate from idle, offline, and exception entry/exit in no
> +// way allows rcu_preempt and rcu_sched readers to also do so.
> +
> +// The lockdep state must be outside of #ifdef to be useful.
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +static struct lock_class_key rcu_lock_trace_key;
> +struct lockdep_map rcu_trace_lock_map =
> +	STATIC_LOCKDEP_MAP_INIT("rcu_read_lock_trace", &rcu_lock_trace_key);
> +EXPORT_SYMBOL_GPL(rcu_trace_lock_map);
> +#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> +
> +#ifdef CONFIG_TASKS_TRACE_RCU
> +
> +atomic_t trc_n_readers_need_end;	// Number of waited-for readers.
> +DECLARE_WAIT_QUEUE_HEAD(trc_wait);	// List of holdout tasks.
> +
> +// Record outstanding IPIs to each CPU.  No point in sending two...
> +static DEFINE_PER_CPU(bool, trc_ipi_to_cpu);
> +
> +/* If we are the last reader, wake up the grace-period kthread. */
> +void rcu_read_unlock_trace_special(struct task_struct *t)
> +{
> +	WRITE_ONCE(t->trc_reader_need_end, false);
> +	if (atomic_dec_and_test(&trc_n_readers_need_end))
> +		wake_up(&trc_wait);

Hmm, this can't be called in places that hold the rq->lock can it?

-- Steve

> +}
> +EXPORT_SYMBOL_GPL(rcu_read_unlock_trace_special);
> +
> +/* Add a task to the holdout list, if it is not already on the list. */
> +static void trc_add_holdout(struct task_struct *t, struct list_head *bhp)
> +{
> +	if (list_empty(&t->trc_holdout_list)) {
> +		get_task_struct(t);
> +		list_add(&t->trc_holdout_list, bhp);
> +	}
> +}
> +ndif /* #else #ifdef CONFIG_TASKS_TRACE_RCU */


^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 09/22] rcu-tasks: Add an RCU-tasks rude variant
  2020-03-19 19:04     ` Steven Rostedt
@ 2020-03-19 23:58       ` Paul E. McKenney
  0 siblings, 0 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-19 23:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, dhowells, edumazet,
	fweisbec, oleg, joel

On Thu, Mar 19, 2020 at 03:04:32PM -0400, Steven Rostedt wrote:
> On Wed, 18 Mar 2020 17:10:47 -0700
> paulmck@kernel.org wrote:
> 
> > From: "Paul E. McKenney" <paulmck@kernel.org>
> > 
> > This commit adds a "rude" variant of RCU-tasks that has as quiescent
> > states schedule(), cond_resched_tasks_rcu_qs(), userspace execution,
> > and (in theory, anyway) cond_resched().  In other words, RCU-tasks rude
> > readers are regions of code with preemption disabled, but excluding code
> > early in the CPU-online sequence and late in the CPU-offline sequence.
> > Updates make use of IPIs and force an IPI and a context switch on each
> > online CPU.  This variant is useful in some situations in tracing.
> > 
> > Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> > [ paulmck: Apply EXPORT_SYMBOL_GPL() feedback from Qiujun Huang. ]
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > ---
> >  include/linux/rcupdate.h |  3 ++
> >  kernel/rcu/Kconfig       | 12 +++++-
> >  kernel/rcu/tasks.h       | 98 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 112 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 5523145..2be97a8 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -37,6 +37,7 @@
> >  /* Exported common interfaces */
> >  void call_rcu(struct rcu_head *head, rcu_callback_t func);
> >  void rcu_barrier_tasks(void);
> > +void rcu_barrier_tasks_rude(void);
> >  void synchronize_rcu(void);
> >  
> >  #ifdef CONFIG_PREEMPT_RCU
> > @@ -138,6 +139,8 @@ static inline void rcu_init_nohz(void) { }
> >  #define rcu_note_voluntary_context_switch(t) rcu_tasks_qs(t)
> >  void call_rcu_tasks(struct rcu_head *head, rcu_callback_t func);
> >  void synchronize_rcu_tasks(void);
> > +void call_rcu_tasks_rude(struct rcu_head *head, rcu_callback_t func);
> > +void synchronize_rcu_tasks_rude(void);
> >  void exit_tasks_rcu_start(void);
> >  void exit_tasks_rcu_finish(void);
> >  #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
> > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > index 38475d0..0d43ec1 100644
> > --- a/kernel/rcu/Kconfig
> > +++ b/kernel/rcu/Kconfig
> > @@ -71,7 +71,7 @@ config TREE_SRCU
> >  	  This option selects the full-fledged version of SRCU.
> >  
> >  config TASKS_RCU_GENERIC
> > -	def_bool TASKS_RCU
> > +	def_bool TASKS_RCU || TASKS_RUDE_RCU
> >  	select SRCU
> >  	help
> >  	  This option enables generic infrastructure code supporting
> > @@ -84,6 +84,16 @@ config TASKS_RCU
> >  	  only voluntary context switch (not preemption!), idle, and
> >  	  user-mode execution as quiescent states.  Not for manual selection.
> >  
> > +config TASKS_RUDE_RCU
> > +	def_bool 0
> > +	default n
> 
> No need for "default n" as that's the default without it.

Removed!

> > +	help
> > +	  This option enables a task-based RCU implementation that uses
> > +	  only context switch (including preemption) and user-mode
> > +	  execution as quiescent states.  It forces IPIs and context
> > +	  switches on all online CPUs, including idle ones, so use
> > +	  with caution.  Not for manual selection.
> 
> Really don't need the "Not for manual selection", as not having a prompt
> shows that too.

And also removed.

> > +
> >  config RCU_STALL_COMMON
> >  	def_bool TREE_RCU
> >  	help
> > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > index d77921e..7ba1730 100644
> > --- a/kernel/rcu/tasks.h
> > +++ b/kernel/rcu/tasks.h
> > @@ -180,6 +180,9 @@ static void __init rcu_tasks_bootup_oddness(void)
> >  	else
> >  		pr_info("\tTasks RCU enabled.\n");
> >  #endif /* #ifdef CONFIG_TASKS_RCU */
> > +#ifdef CONFIG_TASKS_RUDE_RCU
> > +	pr_info("\tRude variant of Tasks RCU enabled.\n");
> > +#endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
> >  }
> >  
> >  #endif /* #ifndef CONFIG_TINY_RCU */
> > @@ -410,3 +413,98 @@ static int __init rcu_spawn_tasks_kthread(void)
> >  core_initcall(rcu_spawn_tasks_kthread);
> >  
> >  #endif /* #ifdef CONFIG_TASKS_RCU */
> > +
> > +#ifdef CONFIG_TASKS_RUDE_RCU
> > +
> > +////////////////////////////////////////////////////////////////////////
> > +//
> > +// "Rude" variant of Tasks RCU, inspired by Steve Rostedt's trick of
> > +// passing an empty function to schedule_on_each_cpu().  This approach
> > +// provides an asynchronous call_rcu_rude() API and batching of concurrent
> > +// calls to the synchronous synchronize_rcu_rude() API.  This sends IPIs
> > +// far and wide and induces otherwise unnecessary context switches on all
> > +// online CPUs, whether online or not.
> 
>    "on all online CPUs, whether online or not" ????

Good catch!  It should be "whether idle or not".  Fixed.  ;-)

							Thanx, Paul

> -- Steve
> 
> > +
> > +// Empty function to allow workqueues to force a context switch.
> > +static void rcu_tasks_be_rude(struct work_struct *work)
> > +{
> > +}
> > +

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 14/22] rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks
  2020-03-19 19:42     ` Steven Rostedt
@ 2020-03-20  0:28       ` Paul E. McKenney
  2020-03-20  0:48         ` Steven Rostedt
  0 siblings, 1 reply; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-20  0:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, dhowells, edumazet,
	fweisbec, oleg, joel, Alexei Starovoitov, Andrii Nakryiko

On Thu, Mar 19, 2020 at 03:42:39PM -0400, Steven Rostedt wrote:
> On Wed, 18 Mar 2020 17:10:52 -0700
> paulmck@kernel.org wrote:
> 
> > From: "Paul E. McKenney" <paulmck@kernel.org>
> > 
> > Because RCU does not watch exception early-entry/late-exit, idle-loop,
> > or CPU-hotplug execution, protection of tracing and BPF operations is
> > needlessly complicated.  This commit therefore adds a variant of
> > Tasks RCU that:
> > 
> > o	Has explicit read-side markers to allow finite grace periods in
> > 	the face of in-kernel loops for PREEMPT=n builds.  These markers
> > 	are rcu_read_lock_trace() and rcu_read_unlock_trace().
> > 
> > o	Protects code in the idle loop, exception entry/exit, and
> > 	CPU-hotplug code paths.  In this respect, RCU-tasks trace is
> > 	similar to SRCU, but with lighter-weight readers.
> > 
> > o	Avoids expensive read-side instruction, having overhead similar
> > 	to that of Preemptible RCU.
> > 
> > There are of course downsides:
> > 
> > o	The grace-period code can send IPIs to CPUs, even when those CPUs
> > 	are in the idle loop or in nohz_full userspace.  This will be
> > 	addressed by later commits.
> > 
> > o	It is necessary to scan the full tasklist, much as for Tasks RCU.
> > 
> > o	There is a single callback queue guarded by a single lock,
> > 	again, much as for Tasks RCU.  However, those early use cases
> > 	that request multiple grace periods in quick succession are
> > 	expected to do so from a single task, which makes the single
> > 	lock almost irrelevant.  If needed, multiple callback queues
> > 	can be provided using any number of schemes.
> > 
> > Perhaps most important, this variant of RCU does not affect the vanilla
> > flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
> > readers can operate from idle, offline, and exception entry/exit in no
> > way enables rcu_preempt and rcu_sched readers to do so.
> > 
> > This effort benefited greatly from off-list discussions of BPF
> > requirements with Alexei Starovoitov and Andrii Nakryiko.  At least
> > some of the on-list discussions are captured in the Link: tags below.
> > In addition, KCSAN was quite helpful in finding some early bugs.
> 
> If we have this, do we still need to have RCU Tasks Rude flavor?

I have no idea.

It is not a drop-in replacement for RCU Tasks Rude because
preempt_disable() acts as a reader for Rude but not for Trace, which
needs explicit rcu_read_lock_trace() and rcu_read_unlock_trace() markers.
But it might well be that all of the schedule_on_each_cpu(ftrace_sync)
users could be adapted to use RCU Tasks Trace instead.

I would have to defer to you.  ;-)

> > Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
> > Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
> > Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> > Cc: Andrii Nakryiko <andriin@fb.com>
> > ---
> >  include/linux/rcupdate_trace.h |  84 ++++++++++
> >  include/linux/sched.h          |   8 +
> >  init/init_task.c               |   4 +
> >  kernel/fork.c                  |   4 +
> >  kernel/rcu/Kconfig             |  12 +-
> >  kernel/rcu/tasks.h             | 359 ++++++++++++++++++++++++++++++++++++++++-
> >  6 files changed, 462 insertions(+), 9 deletions(-)
> >  create mode 100644 include/linux/rcupdate_trace.h
> > 
> 
> 
> > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > index 0d43ec1..187226b 100644
> > --- a/kernel/rcu/Kconfig
> > +++ b/kernel/rcu/Kconfig
> > @@ -71,7 +71,7 @@ config TREE_SRCU
> >  	  This option selects the full-fledged version of SRCU.
> >  
> >  config TASKS_RCU_GENERIC
> > -	def_bool TASKS_RCU || TASKS_RUDE_RCU
> > +	def_bool TASKS_RCU || TASKS_RUDE_RCU || TASKS_TRACE_RCU
> >  	select SRCU
> >  	help
> >  	  This option enables generic infrastructure code supporting
> > @@ -94,6 +94,16 @@ config TASKS_RUDE_RCU
> >  	  switches on all online CPUs, including idle ones, so use
> >  	  with caution.  Not for manual selection.
> >  
> > +config TASKS_TRACE_RCU
> > +	def_bool 0
> > +	default n
> 
> Again, no need for "default n"
> 
> > +	help
> > +	  This option enables a task-based RCU implementation that uses
> > +	  explicit rcu_read_lock_trace() read-side markers, and allows
> > +	  these readers to appear in the idle loop as well as on the CPU
> > +	  hotplug code paths.  It can force IPIs on online CPUs, including
> > +	  idle ones, so use with caution.  Not for manual selection.
> 
> And no real need for "Not for manual selection".

Both removed, thank you!

> > +
> >  config RCU_STALL_COMMON
> >  	def_bool TREE_RCU
> >  	help
> 
> 
> 
> > @@ -480,10 +489,10 @@ core_initcall(rcu_spawn_tasks_kthread);
> >  //
> >  // "Rude" variant of Tasks RCU, inspired by Steve Rostedt's trick of
> >  // passing an empty function to schedule_on_each_cpu().  This approach
> > -// provides an asynchronous call_rcu_rude() API and batching of concurrent
> > -// calls to the synchronous synchronize_rcu_rude() API.  This sends IPIs
> > -// far and wide and induces otherwise unnecessary context switches on all
> > -// online CPUs, whether online or not.
> > +// provides an asynchronous call_rcu_tasks_rude() API and batching
> > +// of concurrent calls to the synchronous synchronize_rcu_rude() API.
> > +// This sends IPIs far and wide and induces otherwise unnecessary context
> > +// switches on all online CPUs, whether online or not.
> 
> This looks out of place for this patch. Perhaps you fixed this code and
> applied it to the wrong patch?

Indeed I did, good catch!  Huh.  I guess the easiest fix is to back out
this change, rebase, then apply it to the earlier commit in a second
rebase.  I guess git -might- figure it out in one pass.  ;-)

> >  // Empty function to allow workqueues to force a context switch.
> >  static void rcu_tasks_be_rude(struct work_struct *work)
> > @@ -569,3 +578,337 @@ static int __init rcu_spawn_tasks_rude_kthread(void)
> >  core_initcall(rcu_spawn_tasks_rude_kthread);
> >  
> >  #endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
> > +
> > +////////////////////////////////////////////////////////////////////////
> > +//
> > +// Tracing variant of Tasks RCU.  This variant is designed to be used
> > +// to protect tracing hooks, including those of BPF.  This variant
> > +// therefore:
> > +//
> > +// 1.	Has explicit read-side markers to allow finite grace periods
> > +//	in the face of in-kernel loops for PREEMPT=n builds.
> > +//
> > +// 2.	Protects code in the idle loop, exception entry/exit, and
> > +//	CPU-hotplug code paths, similar to the capabilities of SRCU.
> > +//
> > +// 3.	Avoids expensive read-side instruction, having overhead similar
> > +//	to that of Preemptible RCU.
> > +//
> > +// There are of course downsides.  The grace-period code can send IPIs to
> > +// CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
> > +// It is necessary to scan the full tasklist, much as for Tasks RCU.  There
> > +// is a single callback queue guarded by a single lock, again, much as for
> > +// Tasks RCU.  If needed, these downsides can be at least partially remedied.
> > +//
> > +// Perhaps most important, this variant of RCU does not affect the vanilla
> > +// flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
> > +// readers can operate from idle, offline, and exception entry/exit in no
> > +// way allows rcu_preempt and rcu_sched readers to also do so.
> > +
> > +// The lockdep state must be outside of #ifdef to be useful.
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > +static struct lock_class_key rcu_lock_trace_key;
> > +struct lockdep_map rcu_trace_lock_map =
> > +	STATIC_LOCKDEP_MAP_INIT("rcu_read_lock_trace", &rcu_lock_trace_key);
> > +EXPORT_SYMBOL_GPL(rcu_trace_lock_map);
> > +#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> > +
> > +#ifdef CONFIG_TASKS_TRACE_RCU
> > +
> > +atomic_t trc_n_readers_need_end;	// Number of waited-for readers.
> > +DECLARE_WAIT_QUEUE_HEAD(trc_wait);	// List of holdout tasks.
> > +
> > +// Record outstanding IPIs to each CPU.  No point in sending two...
> > +static DEFINE_PER_CPU(bool, trc_ipi_to_cpu);
> > +
> > +/* If we are the last reader, wake up the grace-period kthread. */
> > +void rcu_read_unlock_trace_special(struct task_struct *t)
> > +{
> > +	WRITE_ONCE(t->trc_reader_need_end, false);
> > +	if (atomic_dec_and_test(&trc_n_readers_need_end))
> > +		wake_up(&trc_wait);
> 
> Hmm, this can't be called in places that hold the rq->lock can it?

Good point.  If interrupts are disabled, it will need to use some
other mechanism.  One approach is irqwork.  Another is a timer.

Suggestions?

							Thanx, Paul

> -- Steve
> 
> > +}
> > +EXPORT_SYMBOL_GPL(rcu_read_unlock_trace_special);
> > +
> > +/* Add a task to the holdout list, if it is not already on the list. */
> > +static void trc_add_holdout(struct task_struct *t, struct list_head *bhp)
> > +{
> > +	if (list_empty(&t->trc_holdout_list)) {
> > +		get_task_struct(t);
> > +		list_add(&t->trc_holdout_list, bhp);
> > +	}
> > +}
> > +ndif /* #else #ifdef CONFIG_TASKS_TRACE_RCU */
> 

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 14/22] rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks
  2020-03-20  0:28       ` Paul E. McKenney
@ 2020-03-20  0:48         ` Steven Rostedt
  2020-03-20  2:41           ` Paul E. McKenney
  0 siblings, 1 reply; 171+ messages in thread
From: Steven Rostedt @ 2020-03-20  0:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, dhowells, edumazet,
	fweisbec, oleg, joel, Alexei Starovoitov, Andrii Nakryiko

On Thu, 19 Mar 2020 17:28:13 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> Good point.  If interrupts are disabled, it will need to use some
> other mechanism.  One approach is irqwork.  Another is a timer.
> 
> Suggestions?

Ftrace and perf use irq_work, I would think that should work here too.

-- Steve

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 14/22] rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks
  2020-03-20  0:48         ` Steven Rostedt
@ 2020-03-20  2:41           ` Paul E. McKenney
  2020-03-28 14:06             ` Joel Fernandes
  0 siblings, 1 reply; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-20  2:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, dhowells, edumazet,
	fweisbec, oleg, joel, Alexei Starovoitov, Andrii Nakryiko

On Thu, Mar 19, 2020 at 08:48:38PM -0400, Steven Rostedt wrote:
> On Thu, 19 Mar 2020 17:28:13 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > Good point.  If interrupts are disabled, it will need to use some
> > other mechanism.  One approach is irqwork.  Another is a timer.
> > 
> > Suggestions?
> 
> Ftrace and perf use irq_work, I would think that should work here too.

Sounds good, will give it a go!  And thank you for catching this!

							Thans, Paul

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task
  2020-03-19 17:35       ` Paul E. McKenney
@ 2020-03-20  2:49         ` Paul E. McKenney
  2020-03-20  3:09           ` Steven Rostedt
  2020-03-24  0:06           ` Joel Fernandes
  0 siblings, 2 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-20  2:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, dhowells, edumazet,
	fweisbec, oleg, joel, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman

On Thu, Mar 19, 2020 at 10:35:25AM -0700, Paul E. McKenney wrote:
> On Thu, Mar 19, 2020 at 01:22:38PM -0400, Steven Rostedt wrote:
> > On Wed, 18 Mar 2020 17:10:39 -0700
> > paulmck@kernel.org wrote:

[ . . . ]

> > >  /**
> > > + * try_invoke_on_locked_down_task - Invoke a function on task in fixed state
> > > + * @p: Process for which the function is to be invoked.
> > > + * @func: Function to invoke.
> > > + * @arg: Argument to function.
> > > + *
> > > + * If the specified task can be quickly locked into a definite state
> > > + * (either sleeping or on a given runqueue), arrange to keep it in that
> > > + * state while invoking @func(@arg).  This function can use ->on_rq and
> > > + * task_curr() to work out what the state is, if required.  Given that
> > > + * @func can be invoked with a runqueue lock held, it had better be quite
> > > + * lightweight.
> > > + *
> > > + * Returns:
> > > + *	@false if the task slipped out from under the locks.
> > > + *	@true if the task was locked onto a runqueue or is sleeping.
> > > + *		However, @func can override this by returning @false.
> > 
> > Should probably state that it will return false if the state could be
> > locked, otherwise it returns the return code of the function.
> 
> So like this?
> 
>  * Returns:
>  * @false if the task state could not be locked.
>  * Otherwise, the return value from @func(arg).
> 
> > I'm wondering if we shouldn't have the function return code be something
> > passed in by the parameter, and have this return either true (locked and
> > function called), or false (not locked and function wasn't called).
> 
> I was thinking of this as one of the possible uses of whatever arg
> points to, which allows the caller of try_invoke_on_locked_down_task()
> and the specified function to communicate whatever they wish.  Then
> the specified function could (for example) unconditionally return true
> so that the return value from try_invoke_on_locked_down_task() indicated
> whether or not the specified function was called.
> 
> The current setup is very convenient for the use cases thus far.  It
> allows the function to say "Yeah, I was called, but I couldn't do
> anything", thus allowing the caller to make exactly one check to know
> that corrective action is required.

And here is another use case that led me to take this approach.
The trc_inspect_reader_notrunning() function in the patch below is passed
to try_invoke_on_locked_down_task() whose caller can continue testing
just the return value from try_invoke_on_locked_down_task() to work out
what to do next.

Thoughts?  Other use cases?

							Thanx, Paul

------------------------------------------------------------------------

commit e26a234c1205bf02b62b62cd7f15f8086fc0b13b
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Thu Mar 19 15:33:12 2020 -0700

    rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built
    
    Systems running CPU-bound real-time task do not want IPIs sent to CPUs
    executing nohz_full userspace tasks.  Battery-powered systems don't
    want IPIs sent to idle CPUs in low-power mode.  Unfortunately, RCU tasks
    trace can and will send such IPIs in some cases.
    
    Both of these situations occur only when the target CPU is in RCU
    dyntick-idle mode, in other words, when RCU is not watching the
    target CPU.  This suggests that CPUs in dyntick-idle mode should use
    memory barriers in outermost invocations of rcu_read_lock_trace()
    and rcu_read_unlock_trace(), which would allow the RCU tasks trace
    grace period to directly read out the target CPU's read-side state.
    One challenge is that RCU tasks trace is not targeting a specific
    CPU, but rather a task.  And that task could switch from one CPU to
    another at any time.
    
    This commit therefore uses try_invoke_on_locked_down_task()
    and checks for task_curr() in trc_inspect_reader_notrunning().
    When this condition holds, the target task is running and cannot move.
    If CONFIG_TASKS_TRACE_RCU_READ_MB=y, the new rcu_dynticks_zero_in_eqs()
    function can be used to check if the specified integer (in this case,
    t->trc_reader_nesting) is zero while the target CPU remains in that same
    dyntick-idle sojourn.  If so, the target task is in a quiescent state.
    If not, trc_read_check_handler() must indicate failure so that the
    grace-period kthread can take appropriate action or retry after an
    appropriate delay, as the case may be.
    
    With this change, given CONFIG_TASKS_TRACE_RCU_READ_MB=y, if a given
    CPU remains idle or a given task continues executing in nohz_full mode,
    the RCU tasks trace grace-period kthread will detect this without the
    need to send an IPI.
    
    Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index e1089fd..296f926 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -501,6 +501,7 @@ void srcutorture_get_gp_data(enum rcutorture_type test_type,
 #endif
 
 #ifdef CONFIG_TINY_RCU
+static inline bool rcu_dynticks_zero_in_eqs(int cpu, int *vp) { return false; }
 static inline unsigned long rcu_get_gp_seq(void) { return 0; }
 static inline unsigned long rcu_exp_batches_completed(void) { return 0; }
 static inline unsigned long
@@ -510,6 +511,7 @@ static inline void show_rcu_gp_kthreads(void) { }
 static inline int rcu_get_gp_kthreads_prio(void) { return 0; }
 static inline void rcu_fwd_progress_check(unsigned long j) { }
 #else /* #ifdef CONFIG_TINY_RCU */
+bool rcu_dynticks_zero_in_eqs(int cpu, int *vp);
 unsigned long rcu_get_gp_seq(void);
 unsigned long rcu_exp_batches_completed(void);
 unsigned long srcu_batches_completed(struct srcu_struct *sp);
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index d31ed74..36f03d3 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -802,22 +802,38 @@ static void trc_read_check_handler(void *t_in)
 /* Callback function for scheduler to check non-running) task.  */
 static bool trc_inspect_reader_notrunning(struct task_struct *t, void *arg)
 {
-	if (task_curr(t))
-		return false;  // It is running, so decline to inspect it.
+	int cpu = task_cpu(t);
+	bool in_qs = false;
+
+	if (task_curr(t)) {
+		// If no chance of heavyweight readers, do it the hard way.
+		if (!IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
+			return false;
+
+		// If heavyweight readers are enabled on the remote task,
+		// we can inspect its state despite its currently running.
+		// However, we cannot safely change its state.
+		if (!rcu_dynticks_zero_in_eqs(cpu, &t->trc_reader_nesting))
+			return false; // No quiescent state, do it the hard way.
+		in_qs = true;
+	} else {
+		in_qs = likely(!t->trc_reader_nesting);
+	}
 
 	// Mark as checked.  Because this is called from the grace-period
 	// kthread, also remove the task from the holdout list.
 	t->trc_reader_checked = true;
 	trc_del_holdout(t);
 
-	// If the task is in a read-side critical section, set up its
+	if (in_qs)
+		return true;  // Already in quiescent state, done!!!
+
+	// The task is in a read-side critical section, so set up its
 	// its state so that it will awaken the grace-period kthread upon
 	// exit from that critical section.
-	if (unlikely(t->trc_reader_nesting)) {
-		atomic_inc(&trc_n_readers_need_end); // One more to wait on.
-		WARN_ON_ONCE(t->trc_reader_special.b.need_qs);
-		WRITE_ONCE(t->trc_reader_special.b.need_qs, true);
-	}
+	atomic_inc(&trc_n_readers_need_end); // One more to wait on.
+	WARN_ON_ONCE(t->trc_reader_special.b.need_qs);
+	WRITE_ONCE(t->trc_reader_special.b.need_qs, true);
 	return true;
 }
 
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index de6228a..4eb424e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -239,6 +239,7 @@ static void rcu_dynticks_eqs_enter(void)
 	 * critical sections, and we also must force ordering with the
 	 * next idle sojourn.
 	 */
+	rcu_dynticks_task_trace_enter();  // Before ->dynticks update!
 	seq = atomic_add_return(RCU_DYNTICK_CTRL_CTR, &rdp->dynticks);
 	// RCU is no longer watching.  Better be in extended quiescent state!
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
@@ -265,6 +266,7 @@ static void rcu_dynticks_eqs_exit(void)
 	 */
 	seq = atomic_add_return(RCU_DYNTICK_CTRL_CTR, &rdp->dynticks);
 	// RCU is now watching.  Better not be in an extended quiescent state!
+	rcu_dynticks_task_trace_exit();  // After ->dynticks update!
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
 		     !(seq & RCU_DYNTICK_CTRL_CTR));
 	if (seq & RCU_DYNTICK_CTRL_MASK) {
@@ -337,6 +339,28 @@ static bool rcu_dynticks_in_eqs_since(struct rcu_data *rdp, int snap)
 }
 
 /*
+ * Return true if the referenced integer is zero while the specified
+ * CPU remains within a single extended quiescent state.
+ */
+bool rcu_dynticks_zero_in_eqs(int cpu, int *vp)
+{
+	struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
+	int snap;
+
+	// If not quiescent, force back to earlier extended quiescent state.
+	snap = atomic_read(&rdp->dynticks) & ~(RCU_DYNTICK_CTRL_MASK |
+					       RCU_DYNTICK_CTRL_CTR);
+
+	smp_rmb(); // Order ->dynticks and *vp reads.
+	if (READ_ONCE(*vp))
+		return false;  // Non-zero, so report failure;
+	smp_rmb(); // Order *vp read and ->dynticks re-read.
+
+	// If still in the same extended quiescent state, we are good!
+	return snap == (atomic_read(&rdp->dynticks) & ~RCU_DYNTICK_CTRL_MASK);
+}
+
+/*
  * Set the special (bottom) bit of the specified CPU so that it
  * will take special action (such as flushing its TLB) on the
  * next exit from an extended quiescent state.  Returns true if
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 44edd0a..43991a4 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -455,6 +455,8 @@ static void rcu_bind_gp_kthread(void);
 static bool rcu_nohz_full_cpu(void);
 static void rcu_dynticks_task_enter(void);
 static void rcu_dynticks_task_exit(void);
+static void rcu_dynticks_task_trace_enter(void);
+static void rcu_dynticks_task_trace_exit(void);
 
 /* Forward declarations for tree_stall.h */
 static void record_gp_stall_check_time(void);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 9355536..f4a344e 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2553,3 +2553,21 @@ static void rcu_dynticks_task_exit(void)
 	WRITE_ONCE(current->rcu_tasks_idle_cpu, -1);
 #endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
 }
+
+/* Turn on heavyweight RCU tasks trace readers on idle/user entry. */
+static void rcu_dynticks_task_trace_enter(void)
+{
+#ifdef CONFIG_TASKS_RCU_TRACE
+	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
+		current->trc_reader_special.b.need_mb = true;
+#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
+}
+
+/* Turn off heavyweight RCU tasks trace readers on idle/user exit. */
+static void rcu_dynticks_task_trace_exit(void)
+{
+#ifdef CONFIG_TASKS_RCU_TRACE
+	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
+		current->trc_reader_special.b.need_mb = false;
+#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
+}

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task
  2020-03-20  2:49         ` Paul E. McKenney
@ 2020-03-20  3:09           ` Steven Rostedt
  2020-03-20 16:27             ` Paul E. McKenney
  2020-03-24  0:06           ` Joel Fernandes
  1 sibling, 1 reply; 171+ messages in thread
From: Steven Rostedt @ 2020-03-20  3:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, dhowells, edumazet,
	fweisbec, oleg, joel, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman

On Thu, 19 Mar 2020 19:49:43 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> > The current setup is very convenient for the use cases thus far.  It
> > allows the function to say "Yeah, I was called, but I couldn't do
> > anything", thus allowing the caller to make exactly one check to know
> > that corrective action is required.  
> 
> And here is another use case that led me to take this approach.
> The trc_inspect_reader_notrunning() function in the patch below is passed
> to try_invoke_on_locked_down_task() whose caller can continue testing
> just the return value from try_invoke_on_locked_down_task() to work out
> what to do next.
> 
> Thoughts?  Other use cases?

Note, I made this comment before looking at the use cases in the later
patches. I was looking at it for a more generic purpose, but I'm not
sure there is one.

It's fine as is for now.

-- Steve

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 03/22] rcutorture: Add flag to produce non-busy-wait task stalls
       [not found]       ` <20200320040329.9840-1-hdanton@sina.com>
@ 2020-03-20 16:11         ` Paul E. McKenney
  0 siblings, 0 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-20 16:11 UTC (permalink / raw)
  To: Hillf Danton
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel

On Fri, Mar 20, 2020 at 12:03:29PM +0800, Hillf Danton wrote:
> 
> On Thu, 19 Mar 2020 08:22:53 -0700 "Paul E. McKenney" wrote:
> > 
> > On Thu, Mar 19, 2020 at 09:39:47PM +0800, Hillf Danton wrote:
> > > 
> > > On Thu, 19 Mar 2020 05:38:12 -0700 "Paul E. McKenney" wrote:
> > > > 
> > > > On Thu, Mar 19, 2020 at 06:46:14PM +0800, Hillf Danton wrote:
> > > > > 
> > > > > On Wed, 18 Mar 2020 17:10:41 -0700
> > > > > >  static int rcu_torture_stall(void *args)
> > > > > >  {
> > > > > > +	int idx;
> > > > > >  	unsigned long stop_at;
> > > > > >  
> > > > > >  	VERBOSE_TOROUT_STRING("rcu_torture_stall task started");
> > > > > > @@ -1610,21 +1612,22 @@ static int rcu_torture_stall(void *args)
> > > > > >  	if (!kthread_should_stop()) {
> > > > > >  		stop_at = ktime_get_seconds() + stall_cpu;
> > > > > >  		/* RCU CPU stall is expected behavior in following code. */
> > > > > > -		rcu_read_lock();
> > > > > > +		idx = cur_ops->readlock();
> > > > > >  		if (stall_cpu_irqsoff)
> > > > > >  			local_irq_disable();
> > > > > > -		else
> > > > > > +		else if (!stall_cpu_block)
> > > > > >  			preempt_disable();
> > > > > >  		pr_alert("rcu_torture_stall start on CPU %d.\n",
> > > > > > -			 smp_processor_id());
> > > > > > +			 raw_smp_processor_id());
> > > > > >  		while (ULONG_CMP_LT((unsigned long)ktime_get_seconds(),
> > > > > >  				    stop_at))
> > > > > > -			continue;  /* Induce RCU CPU stall warning. */
> > > > > > +			if (stall_cpu_block)
> > > > > > +				schedule_timeout_uninterruptible(HZ);
> > > > > 
> > > > > Why is the scheduled-in task so special that it will be running on
> > > > > the current CPU with irq disabled?
> > > > 
> > > > You lost me on this one.
> > > 
> > > Quite likely :)
> > > 
> > > > IRQs are not at all disabled.
> > > > 
> > > > > >  		if (stall_cpu_irqsoff)
> > > > > >  			local_irq_enable();
> > > 
> > > Local IRQs get enabled here depending on stall_cpu_irqsoff.
> > > 
> > > What I was asking is the scheduling case like
> > > 
> > > 	local_irq_disable();
> > > 	schedule_timeout(HZ);
> > > 	local_irq_enable();
> > > 
> > > Is it likely going to be ruled out in this patch?
> > 
> > If an rcutorture run specified both the rcutorture.stall_cpu_irqsoff and
> > the rcutorture.stall_cpu_block module parameters, then yes, exactly the
> > sequence you call out should occur.  Can't say that I have tried this,
> > though.  Nor would I expect to have ever done so without your suggesting
> > that I do.
> > 
> > But why not try it on current -rcu?
> > 
> > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 12 --duration 3 --configs "TRACE01" --bootargs "rcutorture.stall_cpu=25 rcutorture.stall_cpu_holdoff=30 rcutorture.stall_cpu_block=1 rcupdate.rcu_task_stall_timeout=10000 rcutorture.stall_cpu_irqsoff"
> > 
> > This tells rcutorture to use all 12 hardware threads, to run the kernel for
> > three minutes, to run only the TRACE01 rcutorture scenario, and to test
> > RCU CPU stall warnings:
> > 
> > rcutorture.stall_cpu=25: Stall the CPU for 25 seconds.
> > 
> > rcutorture.stall_cpu_holdoff=30: Wait 30 seconds after boot to start stalling.
> > 
> > rcutorture.stall_cpu_block=1: Do the schedule_timeout_uninterruptible()
> > 	while stalling.
> > 
> > rcupdate.rcu_task_stall_timeout=10000: Set the stall-warning timeout
> > 	to 10,000 jiffies, or ten seconds.
> > 
> > rcutorture.stall_cpu_irqsoff: This tells rcutorture to execute the
> > 	local_irq_disable() that you called out above.
> > 
> > And this results in a couple of stall warning messages, as expected
> 
> Given these warning messages,
> 
> > given that you get two ten-second intervals in a 25-second interval.
> 
> I suspect it is likely to induce RCU CPU stall using
> schedule_timeout_uninterruptible(HZ).

Exactly.  In fact, inducing an RCU CPU stall warning is the whole purpose
of those rcutorture module parameters.

But it depends on the flavor of RCU in use.  Feel free to experiment!

And again, I do not expect to be using rcutorture.stall_cpu_block=1 and
rcutorture.stall_cpu_irqsoff at the same time except as a demonstration
of something silly.  So I do not see the need to make a change.  If you
are advocating some change or reporting some bug, I am missing your point.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task
  2020-03-20  3:09           ` Steven Rostedt
@ 2020-03-20 16:27             ` Paul E. McKenney
  0 siblings, 0 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-20 16:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, dhowells, edumazet,
	fweisbec, oleg, joel, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman

On Thu, Mar 19, 2020 at 11:09:45PM -0400, Steven Rostedt wrote:
> On Thu, 19 Mar 2020 19:49:43 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > > The current setup is very convenient for the use cases thus far.  It
> > > allows the function to say "Yeah, I was called, but I couldn't do
> > > anything", thus allowing the caller to make exactly one check to know
> > > that corrective action is required.  
> > 
> > And here is another use case that led me to take this approach.
> > The trc_inspect_reader_notrunning() function in the patch below is passed
> > to try_invoke_on_locked_down_task() whose caller can continue testing
> > just the return value from try_invoke_on_locked_down_task() to work out
> > what to do next.
> > 
> > Thoughts?  Other use cases?
> 
> Note, I made this comment before looking at the use cases in the later
> patches. I was looking at it for a more generic purpose, but I'm not
> sure there is one.
> 
> It's fine as is for now.

Sounds good, and again thank you for looking this over!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 04/22] rcu-tasks: Move Tasks RCU to its own file
       [not found]   ` <20200320071228.9740-1-hdanton@sina.com>
@ 2020-03-20 19:14     ` Paul E. McKenney
  0 siblings, 0 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-20 19:14 UTC (permalink / raw)
  To: Hillf Danton
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel

On Fri, Mar 20, 2020 at 03:12:28PM +0800, Hillf Danton wrote:
> 
> On Wed, 18 Mar 2020 17:10:42 -0700 "Paul E. McKenney" wrote:
> > 
> > +/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
> > +static int __noreturn rcu_tasks_kthread(void *arg)
> > +{
> > +	unsigned long flags;
> > +	struct task_struct *g, *t;
> > +	unsigned long lastreport;
> > +	struct rcu_head *list;
> > +	struct rcu_head *next;
> > +	LIST_HEAD(rcu_tasks_holdouts);
> > +	int fract;
> > +
> > +	/* Run on housekeeping CPUs by default.  Sysadm can move if desired. */
> > +	housekeeping_affine(current, HK_FLAG_RCU);
> > +
> > +	/*
> > +	 * Each pass through the following loop makes one check for
> > +	 * newly arrived callbacks, and, if there are some, waits for
> > +	 * one RCU-tasks grace period and then invokes the callbacks.
> > +	 * This loop is terminated by the system going down.  ;-)
> > +	 */
> > +	for (;;) {
> > +
> > +		/* Pick up any new callbacks. */
> > +		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> > +		list = rcu_tasks_cbs_head;
> > +		rcu_tasks_cbs_head = NULL;
> > +		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> > +		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> > +
> > +		/* If there were none, wait a bit and start over. */
> > +		if (!list) {
> > +			wait_event_interruptible(rcu_tasks_cbs_wq,
> > +						 READ_ONCE(rcu_tasks_cbs_head));
> > +			if (!rcu_tasks_cbs_head) {
> > +				WARN_ON(signal_pending(current));
> > +				schedule_timeout_interruptible(HZ/10);
> > +			}
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * Wait for all pre-existing t->on_rq and t->nvcsw
> > +		 * transitions to complete.  Invoking synchronize_rcu()
> > +		 * suffices because all these transitions occur with
> > +		 * interrupts disabled.  Without this synchronize_rcu(),
> > +		 * a read-side critical section that started before the
> > +		 * grace period might be incorrectly seen as having started
> > +		 * after the grace period.
> > +		 *
> > +		 * This synchronize_rcu() also dispenses with the
> > +		 * need for a memory barrier on the first store to
> > +		 * ->rcu_tasks_holdout, as it forces the store to happen
> > +		 * after the beginning of the grace period.
> > +		 */
> > +		synchronize_rcu();
> > +
> > +		/*
> > +		 * There were callbacks, so we need to wait for an
> > +		 * RCU-tasks grace period.  Start off by scanning
> > +		 * the task list for tasks that are not already
> > +		 * voluntarily blocked.  Mark these tasks and make
> > +		 * a list of them in rcu_tasks_holdouts.
> > +		 */
> > +		rcu_read_lock();
> > +		for_each_process_thread(g, t) {
> > +			if (t != current && READ_ONCE(t->on_rq) &&
> > +			    !is_idle_task(t)) {
> > +				get_task_struct(t);
> > +				t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
> > +				WRITE_ONCE(t->rcu_tasks_holdout, true);
> > +				list_add(&t->rcu_tasks_holdout_list,
> > +					 &rcu_tasks_holdouts);
> > +			}
> > +		}
> 
> Nit: report stall if it would take a jiffy longer than
> rcu_task_stall_timeout to collect the tasks?

Fair point!

That said, the wait time below is in hundreds of milliseconds and the
stall time defaults to ten minutes, so it is not clear that such a check
out constitute non-dead code.

							Thanx, Paul

> > +		rcu_read_unlock();
> > +
> > +		/*
> > +		 * Wait for tasks that are in the process of exiting.
> > +		 * This does only part of the job, ensuring that all
> > +		 * tasks that were previously exiting reach the point
> > +		 * where they have disabled preemption, allowing the
> > +		 * later synchronize_rcu() to finish the job.
> > +		 */
> > +		synchronize_srcu(&tasks_rcu_exit_srcu);
> > +
> > +		/*
> > +		 * Each pass through the following loop scans the list
> > +		 * of holdout tasks, removing any that are no longer
> > +		 * holdouts.  When the list is empty, we are done.
> > +		 */
> > +		lastreport = jiffies;
> > +
> > +		/* Start off with HZ/10 wait and slowly back off to 1 HZ wait*/
> > +		fract = 10;
> > +
> > +		for (;;) {
> > +			bool firstreport;
> > +			bool needreport;
> > +			int rtst;
> > +			struct task_struct *t1;
> > +
> > +			if (list_empty(&rcu_tasks_holdouts))
> > +				break;
> > +
> > +			/* Slowly back off waiting for holdouts */
> > +			schedule_timeout_interruptible(HZ/fract);
> > +
> > +			if (fract > 1)
> > +				fract--;
> > +
> > +			rtst = READ_ONCE(rcu_task_stall_timeout);
> > +			needreport = rtst > 0 &&
> > +				     time_after(jiffies, lastreport + rtst);
> > +			if (needreport)
> > +				lastreport = jiffies;
> > +			firstreport = true;
> > +			WARN_ON(signal_pending(current));
> > +			list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
> > +						rcu_tasks_holdout_list) {
> > +				check_holdout_task(t, needreport, &firstreport);
> > +				cond_resched();
> > +			}
> > +		}
> > +
> > +		/*
> > +		 * Because ->on_rq and ->nvcsw are not guaranteed
> > +		 * to have a full memory barriers prior to them in the
> > +		 * schedule() path, memory reordering on other CPUs could
> > +		 * cause their RCU-tasks read-side critical sections to
> > +		 * extend past the end of the grace period.  However,
> > +		 * because these ->nvcsw updates are carried out with
> > +		 * interrupts disabled, we can use synchronize_rcu()
> > +		 * to force the needed ordering on all such CPUs.
> > +		 *
> > +		 * This synchronize_rcu() also confines all
> > +		 * ->rcu_tasks_holdout accesses to be within the grace
> > +		 * period, avoiding the need for memory barriers for
> > +		 * ->rcu_tasks_holdout accesses.
> > +		 *
> > +		 * In addition, this synchronize_rcu() waits for exiting
> > +		 * tasks to complete their final preempt_disable() region
> > +		 * of execution, cleaning up after the synchronize_srcu()
> > +		 * above.
> > +		 */
> > +		synchronize_rcu();
> > +
> > +		/* Invoke the callbacks. */
> > +		while (list) {
> > +			next = list->next;
> > +			local_bh_disable();
> > +			list->func(list);
> > +			local_bh_enable();
> > +			list = next;
> > +			cond_resched();
> > +		}
> > +		/* Paranoid sleep to keep this from entering a tight loop */
> > +		schedule_timeout_uninterruptible(HZ/10);
> > +	}
> > +}
> 

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task
  2020-03-20  2:49         ` Paul E. McKenney
  2020-03-20  3:09           ` Steven Rostedt
@ 2020-03-24  0:06           ` Joel Fernandes
  2020-03-24  0:15             ` Joel Fernandes
  2020-03-24 15:48             ` Paul E. McKenney
  1 sibling, 2 replies; 171+ messages in thread
From: Joel Fernandes @ 2020-03-24  0:06 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, rcu, linux-kernel, kernel-team, mingo,
	jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, dhowells, edumazet, fweisbec, oleg, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman

On Thu, Mar 19, 2020 at 07:49:43PM -0700, Paul E. McKenney wrote:
[...] 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit e26a234c1205bf02b62b62cd7f15f8086fc0b13b
> Author: Paul E. McKenney <paulmck@kernel.org>
> Date:   Thu Mar 19 15:33:12 2020 -0700
> 
>     rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built
>     
>     Systems running CPU-bound real-time task do not want IPIs sent to CPUs
>     executing nohz_full userspace tasks.  Battery-powered systems don't
>     want IPIs sent to idle CPUs in low-power mode.  Unfortunately, RCU tasks
>     trace can and will send such IPIs in some cases.
>     
>     Both of these situations occur only when the target CPU is in RCU
>     dyntick-idle mode, in other words, when RCU is not watching the
>     target CPU.  This suggests that CPUs in dyntick-idle mode should use
>     memory barriers in outermost invocations of rcu_read_lock_trace()
>     and rcu_read_unlock_trace(), which would allow the RCU tasks trace
>     grace period to directly read out the target CPU's read-side state.
>     One challenge is that RCU tasks trace is not targeting a specific
>     CPU, but rather a task.  And that task could switch from one CPU to
>     another at any time.
>     
>     This commit therefore uses try_invoke_on_locked_down_task()
>     and checks for task_curr() in trc_inspect_reader_notrunning().
>     When this condition holds, the target task is running and cannot move.
>     If CONFIG_TASKS_TRACE_RCU_READ_MB=y, the new rcu_dynticks_zero_in_eqs()
>     function can be used to check if the specified integer (in this case,
>     t->trc_reader_nesting) is zero while the target CPU remains in that same
>     dyntick-idle sojourn.  If so, the target task is in a quiescent state.
>     If not, trc_read_check_handler() must indicate failure so that the
>     grace-period kthread can take appropriate action or retry after an
>     appropriate delay, as the case may be.
>     
>     With this change, given CONFIG_TASKS_TRACE_RCU_READ_MB=y, if a given
>     CPU remains idle or a given task continues executing in nohz_full mode,
>     the RCU tasks trace grace-period kthread will detect this without the
>     need to send an IPI.
>     
>     Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index e1089fd..296f926 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -501,6 +501,7 @@ void srcutorture_get_gp_data(enum rcutorture_type test_type,
>  #endif
>  
>  #ifdef CONFIG_TINY_RCU
> +static inline bool rcu_dynticks_zero_in_eqs(int cpu, int *vp) { return false; }
>  static inline unsigned long rcu_get_gp_seq(void) { return 0; }
>  static inline unsigned long rcu_exp_batches_completed(void) { return 0; }
>  static inline unsigned long
> @@ -510,6 +511,7 @@ static inline void show_rcu_gp_kthreads(void) { }
>  static inline int rcu_get_gp_kthreads_prio(void) { return 0; }
>  static inline void rcu_fwd_progress_check(unsigned long j) { }
>  #else /* #ifdef CONFIG_TINY_RCU */
> +bool rcu_dynticks_zero_in_eqs(int cpu, int *vp);
>  unsigned long rcu_get_gp_seq(void);
>  unsigned long rcu_exp_batches_completed(void);
>  unsigned long srcu_batches_completed(struct srcu_struct *sp);
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index d31ed74..36f03d3 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -802,22 +802,38 @@ static void trc_read_check_handler(void *t_in)
>  /* Callback function for scheduler to check non-running) task.  */
>  static bool trc_inspect_reader_notrunning(struct task_struct *t, void *arg)

This function name is a bit confusing. The task could be running when this
function is called. Below you are detecting that the task is running, by
calling task_curr().

Maybe just trc_inspect_reader() is better?

[..]

> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 44edd0a..43991a4 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -455,6 +455,8 @@ static void rcu_bind_gp_kthread(void);
>  static bool rcu_nohz_full_cpu(void);
>  static void rcu_dynticks_task_enter(void);
>  static void rcu_dynticks_task_exit(void);
> +static void rcu_dynticks_task_trace_enter(void);
> +static void rcu_dynticks_task_trace_exit(void);
>  
>  /* Forward declarations for tree_stall.h */
>  static void record_gp_stall_check_time(void);
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 9355536..f4a344e 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2553,3 +2553,21 @@ static void rcu_dynticks_task_exit(void)
>  	WRITE_ONCE(current->rcu_tasks_idle_cpu, -1);
>  #endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
>  }
> +
> +/* Turn on heavyweight RCU tasks trace readers on idle/user entry. */
> +static void rcu_dynticks_task_trace_enter(void)
> +{
> +#ifdef CONFIG_TASKS_RCU_TRACE
> +	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
> +		current->trc_reader_special.b.need_mb = true;

If this is every called from middle of a reader section (that is we
transition from IPI-mode to using heavier reader-sections), then is a memory
barrier needed here just to protect the reader section that already started?

thanks,

 - Joel


> +#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
> +}
> +
> +/* Turn off heavyweight RCU tasks trace readers on idle/user exit. */
> +static void rcu_dynticks_task_trace_exit(void)
> +{
> +#ifdef CONFIG_TASKS_RCU_TRACE
> +	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
> +		current->trc_reader_special.b.need_mb = false;
> +#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
> +}

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task
  2020-03-24  0:06           ` Joel Fernandes
@ 2020-03-24  0:15             ` Joel Fernandes
  2020-03-24 16:26               ` Paul E. McKenney
  2020-03-24 15:48             ` Paul E. McKenney
  1 sibling, 1 reply; 171+ messages in thread
From: Joel Fernandes @ 2020-03-24  0:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, rcu, linux-kernel, kernel-team, mingo,
	jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, dhowells, edumazet, fweisbec, oleg, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman

On Mon, Mar 23, 2020 at 08:06:39PM -0400, Joel Fernandes wrote:
> On Thu, Mar 19, 2020 at 07:49:43PM -0700, Paul E. McKenney wrote:
> [...] 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit e26a234c1205bf02b62b62cd7f15f8086fc0b13b
> > Author: Paul E. McKenney <paulmck@kernel.org>
> > Date:   Thu Mar 19 15:33:12 2020 -0700
> > 
> >     rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built
> >     
> >     Systems running CPU-bound real-time task do not want IPIs sent to CPUs
> >     executing nohz_full userspace tasks.  Battery-powered systems don't
> >     want IPIs sent to idle CPUs in low-power mode.  Unfortunately, RCU tasks
> >     trace can and will send such IPIs in some cases.
> >     
> >     Both of these situations occur only when the target CPU is in RCU
> >     dyntick-idle mode, in other words, when RCU is not watching the
> >     target CPU.  This suggests that CPUs in dyntick-idle mode should use
> >     memory barriers in outermost invocations of rcu_read_lock_trace()
> >     and rcu_read_unlock_trace(), which would allow the RCU tasks trace
> >     grace period to directly read out the target CPU's read-side state.
> >     One challenge is that RCU tasks trace is not targeting a specific
> >     CPU, but rather a task.  And that task could switch from one CPU to
> >     another at any time.
> >     
> >     This commit therefore uses try_invoke_on_locked_down_task()
> >     and checks for task_curr() in trc_inspect_reader_notrunning().
> >     When this condition holds, the target task is running and cannot move.
> >     If CONFIG_TASKS_TRACE_RCU_READ_MB=y, the new rcu_dynticks_zero_in_eqs()
> >     function can be used to check if the specified integer (in this case,
> >     t->trc_reader_nesting) is zero while the target CPU remains in that same
> >     dyntick-idle sojourn.  If so, the target task is in a quiescent state.
> >     If not, trc_read_check_handler() must indicate failure so that the
> >     grace-period kthread can take appropriate action or retry after an
> >     appropriate delay, as the case may be.
> >     
> >     With this change, given CONFIG_TASKS_TRACE_RCU_READ_MB=y, if a given
> >     CPU remains idle or a given task continues executing in nohz_full mode,
> >     the RCU tasks trace grace-period kthread will detect this without the
> >     need to send an IPI.
> >     
> >     Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > 
> > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > index e1089fd..296f926 100644
> > --- a/kernel/rcu/rcu.h
> > +++ b/kernel/rcu/rcu.h
> > @@ -501,6 +501,7 @@ void srcutorture_get_gp_data(enum rcutorture_type test_type,
> >  #endif
> >  
> >  #ifdef CONFIG_TINY_RCU
> > +static inline bool rcu_dynticks_zero_in_eqs(int cpu, int *vp) { return false; }
> >  static inline unsigned long rcu_get_gp_seq(void) { return 0; }
> >  static inline unsigned long rcu_exp_batches_completed(void) { return 0; }
> >  static inline unsigned long
> > @@ -510,6 +511,7 @@ static inline void show_rcu_gp_kthreads(void) { }
> >  static inline int rcu_get_gp_kthreads_prio(void) { return 0; }
> >  static inline void rcu_fwd_progress_check(unsigned long j) { }
> >  #else /* #ifdef CONFIG_TINY_RCU */
> > +bool rcu_dynticks_zero_in_eqs(int cpu, int *vp);
> >  unsigned long rcu_get_gp_seq(void);
> >  unsigned long rcu_exp_batches_completed(void);
> >  unsigned long srcu_batches_completed(struct srcu_struct *sp);
> > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > index d31ed74..36f03d3 100644
> > --- a/kernel/rcu/tasks.h
> > +++ b/kernel/rcu/tasks.h
> > @@ -802,22 +802,38 @@ static void trc_read_check_handler(void *t_in)
> >  /* Callback function for scheduler to check non-running) task.  */
> >  static bool trc_inspect_reader_notrunning(struct task_struct *t, void *arg)
> 
> This function name is a bit confusing. The task could be running when this
> function is called. Below you are detecting that the task is running, by
> calling task_curr().
> 
> Maybe just trc_inspect_reader() is better?
> 
> [..]
> 
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index 44edd0a..43991a4 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -455,6 +455,8 @@ static void rcu_bind_gp_kthread(void);
> >  static bool rcu_nohz_full_cpu(void);
> >  static void rcu_dynticks_task_enter(void);
> >  static void rcu_dynticks_task_exit(void);
> > +static void rcu_dynticks_task_trace_enter(void);
> > +static void rcu_dynticks_task_trace_exit(void);
> >  
> >  /* Forward declarations for tree_stall.h */
> >  static void record_gp_stall_check_time(void);
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 9355536..f4a344e 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -2553,3 +2553,21 @@ static void rcu_dynticks_task_exit(void)
> >  	WRITE_ONCE(current->rcu_tasks_idle_cpu, -1);
> >  #endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> >  }
> > +
> > +/* Turn on heavyweight RCU tasks trace readers on idle/user entry. */
> > +static void rcu_dynticks_task_trace_enter(void)
> > +{
> > +#ifdef CONFIG_TASKS_RCU_TRACE
> > +	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
> > +		current->trc_reader_special.b.need_mb = true;
> 
> If this is every called from middle of a reader section (that is we
> transition from IPI-mode to using heavier reader-sections), then is a memory
> barrier needed here just to protect the reader section that already started?

Forgot to add:
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

thanks,

 - Joel



^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task
  2020-03-24  0:06           ` Joel Fernandes
  2020-03-24  0:15             ` Joel Fernandes
@ 2020-03-24 15:48             ` Paul E. McKenney
  2020-03-24 16:52               ` Joel Fernandes
  1 sibling, 1 reply; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-24 15:48 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Steven Rostedt, rcu, linux-kernel, kernel-team, mingo,
	jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, dhowells, edumazet, fweisbec, oleg, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman

On Mon, Mar 23, 2020 at 08:06:39PM -0400, Joel Fernandes wrote:
> On Thu, Mar 19, 2020 at 07:49:43PM -0700, Paul E. McKenney wrote:
> [...] 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit e26a234c1205bf02b62b62cd7f15f8086fc0b13b
> > Author: Paul E. McKenney <paulmck@kernel.org>
> > Date:   Thu Mar 19 15:33:12 2020 -0700
> > 
> >     rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built
> >     
> >     Systems running CPU-bound real-time task do not want IPIs sent to CPUs
> >     executing nohz_full userspace tasks.  Battery-powered systems don't
> >     want IPIs sent to idle CPUs in low-power mode.  Unfortunately, RCU tasks
> >     trace can and will send such IPIs in some cases.
> >     
> >     Both of these situations occur only when the target CPU is in RCU
> >     dyntick-idle mode, in other words, when RCU is not watching the
> >     target CPU.  This suggests that CPUs in dyntick-idle mode should use
> >     memory barriers in outermost invocations of rcu_read_lock_trace()
> >     and rcu_read_unlock_trace(), which would allow the RCU tasks trace
> >     grace period to directly read out the target CPU's read-side state.
> >     One challenge is that RCU tasks trace is not targeting a specific
> >     CPU, but rather a task.  And that task could switch from one CPU to
> >     another at any time.
> >     
> >     This commit therefore uses try_invoke_on_locked_down_task()
> >     and checks for task_curr() in trc_inspect_reader_notrunning().
> >     When this condition holds, the target task is running and cannot move.
> >     If CONFIG_TASKS_TRACE_RCU_READ_MB=y, the new rcu_dynticks_zero_in_eqs()
> >     function can be used to check if the specified integer (in this case,
> >     t->trc_reader_nesting) is zero while the target CPU remains in that same
> >     dyntick-idle sojourn.  If so, the target task is in a quiescent state.
> >     If not, trc_read_check_handler() must indicate failure so that the
> >     grace-period kthread can take appropriate action or retry after an
> >     appropriate delay, as the case may be.
> >     
> >     With this change, given CONFIG_TASKS_TRACE_RCU_READ_MB=y, if a given
> >     CPU remains idle or a given task continues executing in nohz_full mode,
> >     the RCU tasks trace grace-period kthread will detect this without the
> >     need to send an IPI.
> >     
> >     Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > 
> > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > index e1089fd..296f926 100644
> > --- a/kernel/rcu/rcu.h
> > +++ b/kernel/rcu/rcu.h
> > @@ -501,6 +501,7 @@ void srcutorture_get_gp_data(enum rcutorture_type test_type,
> >  #endif
> >  
> >  #ifdef CONFIG_TINY_RCU
> > +static inline bool rcu_dynticks_zero_in_eqs(int cpu, int *vp) { return false; }
> >  static inline unsigned long rcu_get_gp_seq(void) { return 0; }
> >  static inline unsigned long rcu_exp_batches_completed(void) { return 0; }
> >  static inline unsigned long
> > @@ -510,6 +511,7 @@ static inline void show_rcu_gp_kthreads(void) { }
> >  static inline int rcu_get_gp_kthreads_prio(void) { return 0; }
> >  static inline void rcu_fwd_progress_check(unsigned long j) { }
> >  #else /* #ifdef CONFIG_TINY_RCU */
> > +bool rcu_dynticks_zero_in_eqs(int cpu, int *vp);
> >  unsigned long rcu_get_gp_seq(void);
> >  unsigned long rcu_exp_batches_completed(void);
> >  unsigned long srcu_batches_completed(struct srcu_struct *sp);
> > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > index d31ed74..36f03d3 100644
> > --- a/kernel/rcu/tasks.h
> > +++ b/kernel/rcu/tasks.h
> > @@ -802,22 +802,38 @@ static void trc_read_check_handler(void *t_in)
> >  /* Callback function for scheduler to check non-running) task.  */
> >  static bool trc_inspect_reader_notrunning(struct task_struct *t, void *arg)
> 
> This function name is a bit confusing. The task could be running when this
> function is called. Below you are detecting that the task is running, by
> calling task_curr().
> 
> Maybe just trc_inspect_reader() is better?

Sold!  ;-)

> [..]
> 
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index 44edd0a..43991a4 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -455,6 +455,8 @@ static void rcu_bind_gp_kthread(void);
> >  static bool rcu_nohz_full_cpu(void);
> >  static void rcu_dynticks_task_enter(void);
> >  static void rcu_dynticks_task_exit(void);
> > +static void rcu_dynticks_task_trace_enter(void);
> > +static void rcu_dynticks_task_trace_exit(void);
> >  
> >  /* Forward declarations for tree_stall.h */
> >  static void record_gp_stall_check_time(void);
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 9355536..f4a344e 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -2553,3 +2553,21 @@ static void rcu_dynticks_task_exit(void)
> >  	WRITE_ONCE(current->rcu_tasks_idle_cpu, -1);
> >  #endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> >  }
> > +
> > +/* Turn on heavyweight RCU tasks trace readers on idle/user entry. */
> > +static void rcu_dynticks_task_trace_enter(void)
> > +{
> > +#ifdef CONFIG_TASKS_RCU_TRACE
> > +	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
> > +		current->trc_reader_special.b.need_mb = true;
> 
> If this is every called from middle of a reader section (that is we
> transition from IPI-mode to using heavier reader-sections), then is a memory
> barrier needed here just to protect the reader section that already started?

That memory barrier is provided by the memory ordering in the callers
of rcu_dynticks_task_trace_enter() and rcu_dynticks_task_trace_exit(),
namely, those callers' atomic_add_return() invocations.  These barriers
pair with the pair of smp_rmb() calls in rcu_dynticks_zero_in_eqs(),
which is in turn invoked from the function formerly known as
trc_inspect_reader_notrunning(), AKA trc_inspect_reader().

This same pair of smp_rmb() calls also pair with the conditional smp_mb()
calls in rcu_read_lock_trace() and rcu_read_unlock_trace().

In your scenario, the calls in rcu_read_lock_trace() and
rcu_read_unlock_trace() wouldn't happen, but in that case the ordering
from atomic_add_return() would suffice.

Does that work?  Or is there an ordering bug in there somewhere?

							Thanx, Paul

> thanks,
> 
>  - Joel
> 
> 
> > +#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
> > +}
> > +
> > +/* Turn off heavyweight RCU tasks trace readers on idle/user exit. */
> > +static void rcu_dynticks_task_trace_exit(void)
> > +{
> > +#ifdef CONFIG_TASKS_RCU_TRACE
> > +	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
> > +		current->trc_reader_special.b.need_mb = false;
> > +#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
> > +}

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task
  2020-03-24  0:15             ` Joel Fernandes
@ 2020-03-24 16:26               ` Paul E. McKenney
  0 siblings, 0 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-24 16:26 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Steven Rostedt, rcu, linux-kernel, kernel-team, mingo,
	jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, dhowells, edumazet, fweisbec, oleg, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman

On Mon, Mar 23, 2020 at 08:15:49PM -0400, Joel Fernandes wrote:
> On Mon, Mar 23, 2020 at 08:06:39PM -0400, Joel Fernandes wrote:
> > On Thu, Mar 19, 2020 at 07:49:43PM -0700, Paul E. McKenney wrote:
> > [...] 
> > > 							Thanx, Paul
> > > 
> > > ------------------------------------------------------------------------
> > > 
> > > commit e26a234c1205bf02b62b62cd7f15f8086fc0b13b
> > > Author: Paul E. McKenney <paulmck@kernel.org>
> > > Date:   Thu Mar 19 15:33:12 2020 -0700
> > > 
> > >     rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built
> > >     
> > >     Systems running CPU-bound real-time task do not want IPIs sent to CPUs
> > >     executing nohz_full userspace tasks.  Battery-powered systems don't
> > >     want IPIs sent to idle CPUs in low-power mode.  Unfortunately, RCU tasks
> > >     trace can and will send such IPIs in some cases.
> > >     
> > >     Both of these situations occur only when the target CPU is in RCU
> > >     dyntick-idle mode, in other words, when RCU is not watching the
> > >     target CPU.  This suggests that CPUs in dyntick-idle mode should use
> > >     memory barriers in outermost invocations of rcu_read_lock_trace()
> > >     and rcu_read_unlock_trace(), which would allow the RCU tasks trace
> > >     grace period to directly read out the target CPU's read-side state.
> > >     One challenge is that RCU tasks trace is not targeting a specific
> > >     CPU, but rather a task.  And that task could switch from one CPU to
> > >     another at any time.
> > >     
> > >     This commit therefore uses try_invoke_on_locked_down_task()
> > >     and checks for task_curr() in trc_inspect_reader_notrunning().
> > >     When this condition holds, the target task is running and cannot move.
> > >     If CONFIG_TASKS_TRACE_RCU_READ_MB=y, the new rcu_dynticks_zero_in_eqs()
> > >     function can be used to check if the specified integer (in this case,
> > >     t->trc_reader_nesting) is zero while the target CPU remains in that same
> > >     dyntick-idle sojourn.  If so, the target task is in a quiescent state.
> > >     If not, trc_read_check_handler() must indicate failure so that the
> > >     grace-period kthread can take appropriate action or retry after an
> > >     appropriate delay, as the case may be.
> > >     
> > >     With this change, given CONFIG_TASKS_TRACE_RCU_READ_MB=y, if a given
> > >     CPU remains idle or a given task continues executing in nohz_full mode,
> > >     the RCU tasks trace grace-period kthread will detect this without the
> > >     need to send an IPI.
> > >     
> > >     Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > 
> > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > > index e1089fd..296f926 100644
> > > --- a/kernel/rcu/rcu.h
> > > +++ b/kernel/rcu/rcu.h
> > > @@ -501,6 +501,7 @@ void srcutorture_get_gp_data(enum rcutorture_type test_type,
> > >  #endif
> > >  
> > >  #ifdef CONFIG_TINY_RCU
> > > +static inline bool rcu_dynticks_zero_in_eqs(int cpu, int *vp) { return false; }
> > >  static inline unsigned long rcu_get_gp_seq(void) { return 0; }
> > >  static inline unsigned long rcu_exp_batches_completed(void) { return 0; }
> > >  static inline unsigned long
> > > @@ -510,6 +511,7 @@ static inline void show_rcu_gp_kthreads(void) { }
> > >  static inline int rcu_get_gp_kthreads_prio(void) { return 0; }
> > >  static inline void rcu_fwd_progress_check(unsigned long j) { }
> > >  #else /* #ifdef CONFIG_TINY_RCU */
> > > +bool rcu_dynticks_zero_in_eqs(int cpu, int *vp);
> > >  unsigned long rcu_get_gp_seq(void);
> > >  unsigned long rcu_exp_batches_completed(void);
> > >  unsigned long srcu_batches_completed(struct srcu_struct *sp);
> > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > > index d31ed74..36f03d3 100644
> > > --- a/kernel/rcu/tasks.h
> > > +++ b/kernel/rcu/tasks.h
> > > @@ -802,22 +802,38 @@ static void trc_read_check_handler(void *t_in)
> > >  /* Callback function for scheduler to check non-running) task.  */
> > >  static bool trc_inspect_reader_notrunning(struct task_struct *t, void *arg)
> > 
> > This function name is a bit confusing. The task could be running when this
> > function is called. Below you are detecting that the task is running, by
> > calling task_curr().
> > 
> > Maybe just trc_inspect_reader() is better?
> > 
> > [..]
> > 
> > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > > index 44edd0a..43991a4 100644
> > > --- a/kernel/rcu/tree.h
> > > +++ b/kernel/rcu/tree.h
> > > @@ -455,6 +455,8 @@ static void rcu_bind_gp_kthread(void);
> > >  static bool rcu_nohz_full_cpu(void);
> > >  static void rcu_dynticks_task_enter(void);
> > >  static void rcu_dynticks_task_exit(void);
> > > +static void rcu_dynticks_task_trace_enter(void);
> > > +static void rcu_dynticks_task_trace_exit(void);
> > >  
> > >  /* Forward declarations for tree_stall.h */
> > >  static void record_gp_stall_check_time(void);
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index 9355536..f4a344e 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -2553,3 +2553,21 @@ static void rcu_dynticks_task_exit(void)
> > >  	WRITE_ONCE(current->rcu_tasks_idle_cpu, -1);
> > >  #endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > >  }
> > > +
> > > +/* Turn on heavyweight RCU tasks trace readers on idle/user entry. */
> > > +static void rcu_dynticks_task_trace_enter(void)
> > > +{
> > > +#ifdef CONFIG_TASKS_RCU_TRACE
> > > +	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
> > > +		current->trc_reader_special.b.need_mb = true;
> > 
> > If this is every called from middle of a reader section (that is we
> > transition from IPI-mode to using heavier reader-sections), then is a memory
> > barrier needed here just to protect the reader section that already started?
> 
> Forgot to add:
> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Applied, thank you!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task
  2020-03-24 15:48             ` Paul E. McKenney
@ 2020-03-24 16:52               ` Joel Fernandes
  2020-03-24 17:20                 ` Paul E. McKenney
  0 siblings, 1 reply; 171+ messages in thread
From: Joel Fernandes @ 2020-03-24 16:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, rcu, linux-kernel, kernel-team, mingo,
	jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, dhowells, edumazet, fweisbec, oleg, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, vpillai

On Tue, Mar 24, 2020 at 08:48:22AM -0700, Paul E. McKenney wrote:
[..] 
> > 
> > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > > index 44edd0a..43991a4 100644
> > > --- a/kernel/rcu/tree.h
> > > +++ b/kernel/rcu/tree.h
> > > @@ -455,6 +455,8 @@ static void rcu_bind_gp_kthread(void);
> > >  static bool rcu_nohz_full_cpu(void);
> > >  static void rcu_dynticks_task_enter(void);
> > >  static void rcu_dynticks_task_exit(void);
> > > +static void rcu_dynticks_task_trace_enter(void);
> > > +static void rcu_dynticks_task_trace_exit(void);
> > >  
> > >  /* Forward declarations for tree_stall.h */
> > >  static void record_gp_stall_check_time(void);
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index 9355536..f4a344e 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -2553,3 +2553,21 @@ static void rcu_dynticks_task_exit(void)
> > >  	WRITE_ONCE(current->rcu_tasks_idle_cpu, -1);
> > >  #endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > >  }
> > > +
> > > +/* Turn on heavyweight RCU tasks trace readers on idle/user entry. */
> > > +static void rcu_dynticks_task_trace_enter(void)
> > > +{
> > > +#ifdef CONFIG_TASKS_RCU_TRACE
> > > +	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
> > > +		current->trc_reader_special.b.need_mb = true;
> > 
> > If this is every called from middle of a reader section (that is we
> > transition from IPI-mode to using heavier reader-sections), then is a memory
> > barrier needed here just to protect the reader section that already started?
> 
> That memory barrier is provided by the memory ordering in the callers
> of rcu_dynticks_task_trace_enter() and rcu_dynticks_task_trace_exit(),
> namely, those callers' atomic_add_return() invocations.  These barriers
> pair with the pair of smp_rmb() calls in rcu_dynticks_zero_in_eqs(),
> which is in turn invoked from the function formerly known as
> trc_inspect_reader_notrunning(), AKA trc_inspect_reader().
> 
> This same pair of smp_rmb() calls also pair with the conditional smp_mb()
> calls in rcu_read_lock_trace() and rcu_read_unlock_trace().
> 
> In your scenario, the calls in rcu_read_lock_trace() and
> rcu_read_unlock_trace() wouldn't happen, but in that case the ordering
> from atomic_add_return() would suffice.
> 
> Does that work?  Or is there an ordering bug in there somewhere?

Thanks for explaining. Could the following scenario cause a problem?

If we consider the litmus test:

{
int x = 1;
int *y = &x;
int z = 1;
}

P0(int *x, int *z, int **y)
{
	int *r0;
	int r1;

	dynticks_eqs_trace_enter();

	rcu_read_lock();
	r0 = rcu_dereference(*y);

	dynticks_eqs_trace_exit(); // cut-off reader's mb wings :)

	r1 = READ_ONCE(*r0); // Reordering of this beyond the unlock() is bad.
	rcu_read_unlock();
}

P1(int *x, int *z, int **y)
{
	rcu_assign_pointer(*y, z);
	synchronize_rcu();
	WRITE_ONCE(*x, 0);
}

exists (0:r0=x /\ 0:r1=0)

Then the following situation can happen?

	READER					UPDATER

						y = &z;

	eqs_enter(); // full-mb

	rcu_read_lock(); // full-mb
	// r0 = x;
						// GP-start
						// ..zero_in_eqs() notices eqs, no IPI
	eqs_exit(); // full-mb

	// actual r1 = *x but will reorder

	rcu_read_unlock(); // no-mb
						// GP-finish as notices nesting = 0
						x = 0;
	// reordered r1 = *x = 0;


Basically r0=x /\ r1=0 happened because r1=0. Or did I miss something that
prevents it?

thanks,

 - Joel




> > thanks,
> > 
> >  - Joel
> > 
> > 
> > > +#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
> > > +}
> > > +
> > > +/* Turn off heavyweight RCU tasks trace readers on idle/user exit. */
> > > +static void rcu_dynticks_task_trace_exit(void)
> > > +{
> > > +#ifdef CONFIG_TASKS_RCU_TRACE
> > > +	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
> > > +		current->trc_reader_special.b.need_mb = false;
> > > +#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
> > > +}

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task
  2020-03-24 16:52               ` Joel Fernandes
@ 2020-03-24 17:20                 ` Paul E. McKenney
  2020-03-24 18:19                   ` Joel Fernandes
  0 siblings, 1 reply; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-24 17:20 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Steven Rostedt, rcu, linux-kernel, kernel-team, mingo,
	jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, dhowells, edumazet, fweisbec, oleg, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, vpillai

On Tue, Mar 24, 2020 at 12:52:55PM -0400, Joel Fernandes wrote:
> On Tue, Mar 24, 2020 at 08:48:22AM -0700, Paul E. McKenney wrote:
> [..] 
> > > 
> > > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > > > index 44edd0a..43991a4 100644
> > > > --- a/kernel/rcu/tree.h
> > > > +++ b/kernel/rcu/tree.h
> > > > @@ -455,6 +455,8 @@ static void rcu_bind_gp_kthread(void);
> > > >  static bool rcu_nohz_full_cpu(void);
> > > >  static void rcu_dynticks_task_enter(void);
> > > >  static void rcu_dynticks_task_exit(void);
> > > > +static void rcu_dynticks_task_trace_enter(void);
> > > > +static void rcu_dynticks_task_trace_exit(void);
> > > >  
> > > >  /* Forward declarations for tree_stall.h */
> > > >  static void record_gp_stall_check_time(void);
> > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > index 9355536..f4a344e 100644
> > > > --- a/kernel/rcu/tree_plugin.h
> > > > +++ b/kernel/rcu/tree_plugin.h
> > > > @@ -2553,3 +2553,21 @@ static void rcu_dynticks_task_exit(void)
> > > >  	WRITE_ONCE(current->rcu_tasks_idle_cpu, -1);
> > > >  #endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > > >  }
> > > > +
> > > > +/* Turn on heavyweight RCU tasks trace readers on idle/user entry. */
> > > > +static void rcu_dynticks_task_trace_enter(void)
> > > > +{
> > > > +#ifdef CONFIG_TASKS_RCU_TRACE
> > > > +	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
> > > > +		current->trc_reader_special.b.need_mb = true;
> > > 
> > > If this is every called from middle of a reader section (that is we
> > > transition from IPI-mode to using heavier reader-sections), then is a memory
> > > barrier needed here just to protect the reader section that already started?
> > 
> > That memory barrier is provided by the memory ordering in the callers
> > of rcu_dynticks_task_trace_enter() and rcu_dynticks_task_trace_exit(),
> > namely, those callers' atomic_add_return() invocations.  These barriers
> > pair with the pair of smp_rmb() calls in rcu_dynticks_zero_in_eqs(),
> > which is in turn invoked from the function formerly known as
> > trc_inspect_reader_notrunning(), AKA trc_inspect_reader().
> > 
> > This same pair of smp_rmb() calls also pair with the conditional smp_mb()
> > calls in rcu_read_lock_trace() and rcu_read_unlock_trace().
> > 
> > In your scenario, the calls in rcu_read_lock_trace() and
> > rcu_read_unlock_trace() wouldn't happen, but in that case the ordering
> > from atomic_add_return() would suffice.
> > 
> > Does that work?  Or is there an ordering bug in there somewhere?
> 
> Thanks for explaining. Could the following scenario cause a problem?
> 
> If we consider the litmus test:
> 
> {
> int x = 1;
> int *y = &x;
> int z = 1;
> }
> 
> P0(int *x, int *z, int **y)
> {
> 	int *r0;
> 	int r1;
> 
> 	dynticks_eqs_trace_enter();
> 
> 	rcu_read_lock();
> 	r0 = rcu_dereference(*y);
> 
> 	dynticks_eqs_trace_exit(); // cut-off reader's mb wings :)

RCU Tasks Trace currently assumes that a reader will not start within
idle and end outside of idle.  However, keep in mind that eqs exit
implies a full memory barrier and changes the ->dynticks counter.
The call to rcu_dynticks_task_trace_exit() is not standalone.  Instead,
the atomic_add_return() immediately preceding that call is critically
important.  And ditto for rcu_dynticks_task_trace_enter() and the
atomic_add_return() immediately following it.

The overall effect is similar to that of sequence locks.

> 	r1 = READ_ONCE(*r0); // Reordering of this beyond the unlock() is bad.
> 	rcu_read_unlock();
> }
> 
> P1(int *x, int *z, int **y)
> {
> 	rcu_assign_pointer(*y, z);
> 	synchronize_rcu();
> 	WRITE_ONCE(*x, 0);
> }
> 
> exists (0:r0=x /\ 0:r1=0)
> 
> Then the following situation can happen?
> 
> 	READER					UPDATER
> 
> 						y = &z;
> 
> 	eqs_enter(); // full-mb
> 
> 	rcu_read_lock(); // full-mb
> 	// r0 = x;
> 						// GP-start
> 						// ..zero_in_eqs() notices eqs, no IPI
> 	eqs_exit(); // full-mb
> 
> 	// actual r1 = *x but will reorder
> 
> 	rcu_read_unlock(); // no-mb
> 						// GP-finish as notices nesting = 0
> 						x = 0;

Followed by an smp_rmb() followed the second read of ->dynticks, which
will see a non-zero bottom bit for ->dynticks, and thus return false.
This in turn will cause the possible zero nesting counter to be ignored.

> 	// reordered r1 = *x = 0;
> 
> 
> Basically r0=x /\ r1=0 happened because r1=0. Or did I miss something that
> prevents it?

Yes, the change in value of ->dynticks and the full ordering associated
with the atomic_add_return() that makes this change.

							Thanx, Paul

> thanks,
> 
>  - Joel
> 
> 
> 
> 
> > > thanks,
> > > 
> > >  - Joel
> > > 
> > > 
> > > > +#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
> > > > +}
> > > > +
> > > > +/* Turn off heavyweight RCU tasks trace readers on idle/user exit. */
> > > > +static void rcu_dynticks_task_trace_exit(void)
> > > > +{
> > > > +#ifdef CONFIG_TASKS_RCU_TRACE
> > > > +	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
> > > > +		current->trc_reader_special.b.need_mb = false;
> > > > +#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
> > > > +}

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task
  2020-03-24 17:20                 ` Paul E. McKenney
@ 2020-03-24 18:19                   ` Joel Fernandes
  2020-03-25  0:58                     ` Paul E. McKenney
  0 siblings, 1 reply; 171+ messages in thread
From: Joel Fernandes @ 2020-03-24 18:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, rcu, linux-kernel, kernel-team, mingo,
	jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, dhowells, edumazet, fweisbec, oleg, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, vpillai

On Tue, Mar 24, 2020 at 10:20:26AM -0700, Paul E. McKenney wrote:
> On Tue, Mar 24, 2020 at 12:52:55PM -0400, Joel Fernandes wrote:
> > On Tue, Mar 24, 2020 at 08:48:22AM -0700, Paul E. McKenney wrote:
> > [..] 
> > > > 
> > > > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > > > > index 44edd0a..43991a4 100644
> > > > > --- a/kernel/rcu/tree.h
> > > > > +++ b/kernel/rcu/tree.h
> > > > > @@ -455,6 +455,8 @@ static void rcu_bind_gp_kthread(void);
> > > > >  static bool rcu_nohz_full_cpu(void);
> > > > >  static void rcu_dynticks_task_enter(void);
> > > > >  static void rcu_dynticks_task_exit(void);
> > > > > +static void rcu_dynticks_task_trace_enter(void);
> > > > > +static void rcu_dynticks_task_trace_exit(void);
> > > > >  
> > > > >  /* Forward declarations for tree_stall.h */
> > > > >  static void record_gp_stall_check_time(void);
> > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > index 9355536..f4a344e 100644
> > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > @@ -2553,3 +2553,21 @@ static void rcu_dynticks_task_exit(void)
> > > > >  	WRITE_ONCE(current->rcu_tasks_idle_cpu, -1);
> > > > >  #endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > > > >  }
> > > > > +
> > > > > +/* Turn on heavyweight RCU tasks trace readers on idle/user entry. */
> > > > > +static void rcu_dynticks_task_trace_enter(void)
> > > > > +{
> > > > > +#ifdef CONFIG_TASKS_RCU_TRACE
> > > > > +	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
> > > > > +		current->trc_reader_special.b.need_mb = true;
> > > > 
> > > > If this is every called from middle of a reader section (that is we
> > > > transition from IPI-mode to using heavier reader-sections), then is a memory
> > > > barrier needed here just to protect the reader section that already started?
> > > 
> > > That memory barrier is provided by the memory ordering in the callers
> > > of rcu_dynticks_task_trace_enter() and rcu_dynticks_task_trace_exit(),
> > > namely, those callers' atomic_add_return() invocations.  These barriers
> > > pair with the pair of smp_rmb() calls in rcu_dynticks_zero_in_eqs(),
> > > which is in turn invoked from the function formerly known as
> > > trc_inspect_reader_notrunning(), AKA trc_inspect_reader().
> > > 
> > > This same pair of smp_rmb() calls also pair with the conditional smp_mb()
> > > calls in rcu_read_lock_trace() and rcu_read_unlock_trace().
> > > 
> > > In your scenario, the calls in rcu_read_lock_trace() and
> > > rcu_read_unlock_trace() wouldn't happen, but in that case the ordering
> > > from atomic_add_return() would suffice.
> > > 
> > > Does that work?  Or is there an ordering bug in there somewhere?
> > 
> > Thanks for explaining. Could the following scenario cause a problem?
> > 
> > If we consider the litmus test:
> > 
> > {
> > int x = 1;
> > int *y = &x;
> > int z = 1;
> > }
> > 
> > P0(int *x, int *z, int **y)
> > {
> > 	int *r0;
> > 	int r1;
> > 
> > 	dynticks_eqs_trace_enter();
> > 
> > 	rcu_read_lock();
> > 	r0 = rcu_dereference(*y);
> > 
> > 	dynticks_eqs_trace_exit(); // cut-off reader's mb wings :)
> 
> RCU Tasks Trace currently assumes that a reader will not start within
> idle and end outside of idle.  However, keep in mind that eqs exit
> implies a full memory barrier and changes the ->dynticks counter.
> The call to rcu_dynticks_task_trace_exit() is not standalone.  Instead,
> the atomic_add_return() immediately preceding that call is critically
> important.  And ditto for rcu_dynticks_task_trace_enter() and the
> atomic_add_return() immediately following it.
> 
> The overall effect is similar to that of sequence locks.

Yes, sounds good. My corner case did consider the full memory barrier aspect.

> > 	r1 = READ_ONCE(*r0); // Reordering of this beyond the unlock() is bad.
> > 	rcu_read_unlock();
> > }
> > 
> > P1(int *x, int *z, int **y)
> > {
> > 	rcu_assign_pointer(*y, z);
> > 	synchronize_rcu();
> > 	WRITE_ONCE(*x, 0);
> > }
> > 
> > exists (0:r0=x /\ 0:r1=0)
> > 
> > Then the following situation can happen?
> > 
> > 	READER					UPDATER
> > 
> > 						y = &z;
> > 
> > 	eqs_enter(); // full-mb
> > 
> > 	rcu_read_lock(); // full-mb
> > 	// r0 = x;
> > 						// GP-start
> > 						// ..zero_in_eqs() notices eqs, no IPI
> > 	eqs_exit(); // full-mb
> > 
> > 	// actual r1 = *x but will reorder
> > 
> > 	rcu_read_unlock(); // no-mb
> > 						// GP-finish as notices nesting = 0
> > 						x = 0;
> 
> Followed by an smp_rmb() followed the second read of ->dynticks, which
> will see a non-zero bottom bit for ->dynticks, and thus return false.
> This in turn will cause the possible zero nesting counter to be ignored.

Ah, I see. You are re-reading dynticks to confirm that the case I brought up
does not occur. That sounds good to me :) I drew out all possible (similar)
scenarios and could not break it and found the GP ordering guarantees holds :)

thanks,

 - Joel


> > 	// reordered r1 = *x = 0;
> > 
> > 
> > Basically r0=x /\ r1=0 happened because r1=0. Or did I miss something that
> > prevents it?
> 
> Yes, the change in value of ->dynticks and the full ordering associated
> with the atomic_add_return() that makes this change.
> 
> 							Thanx, Paul
> 
> > thanks,
> > 
> >  - Joel
> > 
> > 
> > 
> > 
> > > > thanks,
> > > > 
> > > >  - Joel
> > > > 
> > > > 
> > > > > +#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
> > > > > +}
> > > > > +
> > > > > +/* Turn off heavyweight RCU tasks trace readers on idle/user exit. */
> > > > > +static void rcu_dynticks_task_trace_exit(void)
> > > > > +{
> > > > > +#ifdef CONFIG_TASKS_RCU_TRACE
> > > > > +	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
> > > > > +		current->trc_reader_special.b.need_mb = false;
> > > > > +#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
> > > > > +}

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: [PATCH RFC v2 tip/core/rcu 01/22] sched/core: Add function to sample state of locked-down task
  2020-03-24 18:19                   ` Joel Fernandes
@ 2020-03-25  0:58                     ` Paul E. McKenney
  0 siblings, 0 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-25  0:58 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Steven Rostedt, rcu, linux-kernel, kernel-team, mingo,
	jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, dhowells, edumazet, fweisbec, oleg, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, vpillai

On Tue, Mar 24, 2020 at 02:19:39PM -0400, Joel Fernandes wrote:
> On Tue, Mar 24, 2020 at 10:20:26AM -0700, Paul E. McKenney wrote:
> > On Tue, Mar 24, 2020 at 12:52:55PM -0400, Joel Fernandes wrote:
> > > On Tue, Mar 24, 2020 at 08:48:22AM -0700, Paul E. McKenney wrote:
> > > [..] 
> > > > > 
> > > > > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > > > > > index 44edd0a..43991a4 100644
> > > > > > --- a/kernel/rcu/tree.h
> > > > > > +++ b/kernel/rcu/tree.h
> > > > > > @@ -455,6 +455,8 @@ static void rcu_bind_gp_kthread(void);
> > > > > >  static bool rcu_nohz_full_cpu(void);
> > > > > >  static void rcu_dynticks_task_enter(void);
> > > > > >  static void rcu_dynticks_task_exit(void);
> > > > > > +static void rcu_dynticks_task_trace_enter(void);
> > > > > > +static void rcu_dynticks_task_trace_exit(void);
> > > > > >  
> > > > > >  /* Forward declarations for tree_stall.h */
> > > > > >  static void record_gp_stall_check_time(void);
> > > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > > index 9355536..f4a344e 100644
> > > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > > @@ -2553,3 +2553,21 @@ static void rcu_dynticks_task_exit(void)
> > > > > >  	WRITE_ONCE(current->rcu_tasks_idle_cpu, -1);
> > > > > >  #endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > > > > >  }
> > > > > > +
> > > > > > +/* Turn on heavyweight RCU tasks trace readers on idle/user entry. */
> > > > > > +static void rcu_dynticks_task_trace_enter(void)
> > > > > > +{
> > > > > > +#ifdef CONFIG_TASKS_RCU_TRACE
> > > > > > +	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
> > > > > > +		current->trc_reader_special.b.need_mb = true;
> > > > > 
> > > > > If this is every called from middle of a reader section (that is we
> > > > > transition from IPI-mode to using heavier reader-sections), then is a memory
> > > > > barrier needed here just to protect the reader section that already started?
> > > > 
> > > > That memory barrier is provided by the memory ordering in the callers
> > > > of rcu_dynticks_task_trace_enter() and rcu_dynticks_task_trace_exit(),
> > > > namely, those callers' atomic_add_return() invocations.  These barriers
> > > > pair with the pair of smp_rmb() calls in rcu_dynticks_zero_in_eqs(),
> > > > which is in turn invoked from the function formerly known as
> > > > trc_inspect_reader_notrunning(), AKA trc_inspect_reader().
> > > > 
> > > > This same pair of smp_rmb() calls also pair with the conditional smp_mb()
> > > > calls in rcu_read_lock_trace() and rcu_read_unlock_trace().
> > > > 
> > > > In your scenario, the calls in rcu_read_lock_trace() and
> > > > rcu_read_unlock_trace() wouldn't happen, but in that case the ordering
> > > > from atomic_add_return() would suffice.
> > > > 
> > > > Does that work?  Or is there an ordering bug in there somewhere?
> > > 
> > > Thanks for explaining. Could the following scenario cause a problem?
> > > 
> > > If we consider the litmus test:
> > > 
> > > {
> > > int x = 1;
> > > int *y = &x;
> > > int z = 1;
> > > }
> > > 
> > > P0(int *x, int *z, int **y)
> > > {
> > > 	int *r0;
> > > 	int r1;
> > > 
> > > 	dynticks_eqs_trace_enter();
> > > 
> > > 	rcu_read_lock();
> > > 	r0 = rcu_dereference(*y);
> > > 
> > > 	dynticks_eqs_trace_exit(); // cut-off reader's mb wings :)
> > 
> > RCU Tasks Trace currently assumes that a reader will not start within
> > idle and end outside of idle.  However, keep in mind that eqs exit
> > implies a full memory barrier and changes the ->dynticks counter.
> > The call to rcu_dynticks_task_trace_exit() is not standalone.  Instead,
> > the atomic_add_return() immediately preceding that call is critically
> > important.  And ditto for rcu_dynticks_task_trace_enter() and the
> > atomic_add_return() immediately following it.
> > 
> > The overall effect is similar to that of sequence locks.
> 
> Yes, sounds good. My corner case did consider the full memory barrier aspect.

In your defense, it is not like the memory barriers are conveniently
collected in one place or anything.

> > > 	r1 = READ_ONCE(*r0); // Reordering of this beyond the unlock() is bad.
> > > 	rcu_read_unlock();
> > > }
> > > 
> > > P1(int *x, int *z, int **y)
> > > {
> > > 	rcu_assign_pointer(*y, z);
> > > 	synchronize_rcu();
> > > 	WRITE_ONCE(*x, 0);
> > > }
> > > 
> > > exists (0:r0=x /\ 0:r1=0)
> > > 
> > > Then the following situation can happen?
> > > 
> > > 	READER					UPDATER
> > > 
> > > 						y = &z;
> > > 
> > > 	eqs_enter(); // full-mb
> > > 
> > > 	rcu_read_lock(); // full-mb
> > > 	// r0 = x;
> > > 						// GP-start
> > > 						// ..zero_in_eqs() notices eqs, no IPI
> > > 	eqs_exit(); // full-mb
> > > 
> > > 	// actual r1 = *x but will reorder
> > > 
> > > 	rcu_read_unlock(); // no-mb
> > > 						// GP-finish as notices nesting = 0
> > > 						x = 0;
> > 
> > Followed by an smp_rmb() followed the second read of ->dynticks, which
> > will see a non-zero bottom bit for ->dynticks, and thus return false.
> > This in turn will cause the possible zero nesting counter to be ignored.
> 
> Ah, I see. You are re-reading dynticks to confirm that the case I brought up
> does not occur. That sounds good to me :) I drew out all possible (similar)
> scenarios and could not break it and found the GP ordering guarantees holds :)

Thank you for checking!

							Thanx, Paul

> thanks,
> 
>  - Joel
> 
> 
> > > 	// reordered r1 = *x = 0;
> > > 
> > > 
> > > Basically r0=x /\ r1=0 happened because r1=0. Or did I miss something that
> > > prevents it?
> > 
> > Yes, the change in value of ->dynticks and the full ordering associated
> > with the atomic_add_return() that makes this change.
> > 
> > 							Thanx, Paul
> > 
> > > thanks,
> > > 
> > >  - Joel
> > > 
> > > 
> > > 
> > > 
> > > > > thanks,
> > > > > 
> > > > >  - Joel
> > > > > 
> > > > > 
> > > > > > +#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
> > > > > > +}
> > > > > > +
> > > > > > +/* Turn off heavyweight RCU tasks trace readers on idle/user exit. */
> > > > > > +static void rcu_dynticks_task_trace_exit(void)
> > > > > > +{
> > > > > > +#ifdef CONFIG_TASKS_RCU_TRACE
> > > > > > +	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
> > > > > > +		current->trc_reader_special.b.need_mb = false;
> > > > > > +#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
> > > > > > +}

^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline
  2020-03-19  0:10 ` [PATCH RFC v2 tip/core/rcu 0/22] " Paul E. McKenney
                     ` (24 preceding siblings ...)
       [not found]   ` <20200320071228.9740-1-hdanton@sina.com>
@ 2020-03-27 22:23   ` Paul E. McKenney
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 01/34] sched/core: Add function to sample state of locked-down task paulmck
                       ` (34 more replies)
  25 siblings, 35 replies; 171+ messages in thread
From: Paul E. McKenney @ 2020-03-27 22:23 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel

Hello!

This series provides two variants of Tasks RCU, a rude variant inspired
by Steven Rostedt's use of schedule_on_each_cpu(), and a tracing variant
requested by the BPF folks to be used (for example) to protect BPF
programs that unconditionally access userspace memory, and thus might
occasionally take a page fault, resulting in a voluntary context switch.

The rude variant uses context switches and offline as its quiescent
states, so that preempt-disabled regions of code executing on online
CPUs form the tasks rude RCU readers.

The tracing variant has explicit read-side markers to permit finite grace
periods even given in-kernel loops in PREEMPT=n builds.  These markers
are rcu_read_lock_trace() and rcu_read_unlock_trace(), so that any code
not under rcu_read_lock_trace() is a quiescent state.  This variant
also protects marked code in the idle loop, on exception entry/exit
paths, and on the various CPU-hotplug online/offline code paths, thus
having protection properties similar to SRCU.  However, unlike SRCU,
this variant avoids expensive instructions in the read-side primitives,
thus having read-side overhead similar to that of preemptible RCU.
This difference is important for some BPF programs, according to
benchmarking from Alexei Starovoitov:

https://lore.kernel.org/lkml/20200310014043.4dbagqbr2wsbuarm@ast-mbp/

There are of course downsides.  The grace-period code can send IPIs to
CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
However, this version enlists the aid of the context-switch hooks,
which eliminates the need for IPIs in context-switch-heavy workloads.
It also prohibits sending of IPIs early in the grace period based on a
new rcupdate.rcu_task_ipi_delay kernel boot parameter, which provides
additional opportunity for the hooks to do their job.  Finally, a new
TASKS_TRACE_RCU_READ_MB Kconfig option avoids sending IPIs to tasks
executing userspace or in the idle loop, at the expense of higher overhead
readers during kernel entry/exit code and in the idle loop.

It is also necessary to scan the full tasklist, much as for Tasks RCU.
There is a single callback queue guarded by a single lock, again, much
as for Tasks RCU.  If needed, these downsides can be at least partially
remedied.

Perhaps most important, this variant of RCU does not affect the vanilla
flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
readers can operate from idle, offline, and exception entry/exit in no
way allows rcu_preempt and rcu_sched readers to also do so.

The RCU tasks trace mechanism is based off of RCU tasks rather than
SRCU because the latter is more complex and also because the latter
uses a CPU-by-CPU approach to tracking quiescent states instead of the
task-by-task approach that is needed.  It is in theory possible to mash
RCU tasks trace into the Tree SRCU implementation, but there will need
to be extremely good reasons for doing so.  The vanilla RCU mechanism
could in theory be used in CONFIG_PREEMPT=y kernels, but fails utterly
in CONFIG_PREEMPT=n kernels.  Tasks RCU does not work because page
faults can result in a voluntary context switch, which prevents it from
protecting a BPF program that page faults.  The new "rude" variant only
protected preempt-disable regions of code, thus also failing to protect
BPF programs that page fault.

This effort benefited greatly from off-list discussions of BPF
requirements with Alexei Starovoitov and Andrii Nakryiko, as well as from
numerous on-list discussions, at least some of which are captured in the
"Link:" tags on the patches themselves.

The patches in this series are as follows, with asterisks indicating
significant change from v1:

1*.	Add function to sample state of a locked-down task.  Added
	the task_struct argument to the callback function.

2*.	Use the above function to add per-task state to RCU CPU stall
	warnings.  This commit was adapted to the updated API.

3.	Add rcutorture module parameter to produce non-busy-wait task
	stalls, thus allowing the above RCU CPU stall change to be
	exercised.

4.	Move Tasks RCU to its own file.

5.	Create struct to hold RCU-tasks state information.

6.	Reinstate synchronize_rcu_mult(), as there will likely once
	again be a need to wait on multiple flavors of RCU.

7.	Add an rcutorture test for synchronize_rcu_mult().

8.	Refactor RCU-tasks to allow variants to be added.

9*.	Add an RCU-tasks rude variant, based on Steven Rostedt's
	use of schedule_on_each_cpu().  Updated Kconfig default
	to rely on default default value, updated help text, and
	updated the header comment.

10.	Add torture tests for RCU Tasks Rude.

11.	Use unique names for RCU-Tasks kthreads and messages.

12.	Further refactor RCU-tasks to allow adding even more variants.

13.	Code movement to allow even more Tasks RCU variants.

14*.	Add an RCU Tasks Trace to simplify protection of tracing hooks,
	including BPF.  This version fixes even more bugs and adds a
	URL to an email explaining the memory ordering.  It also updates
	the Kconfig default and updates the help text.  Furthermore, it
	moves a misplaced comment update.  Finally, it makes the
	rcu_read_unlock_trace() function safe for scheduler locks,
	interrupt handlers, and NMI handlers.

15.	Add torture tests for RCU Tasks Trace.

16.	Add stall warnings for RCU Tasks Trace.

17.	Move #ifdef into tasks.h to ease addition of Kconfig-dependent APIs.

18.	Add RCU-tasks-specific information to rcutorture writer stall
	output, easing debugging of these RCU variants.

19.	Make the above rcutorture writer stall output include
	grace-period state.

20.	Cause RCU tasks trace to take advantage of RCU scheduler hooks,
	thus reducing the number of IPIs.

21.	Record grace-period start time for RCU tasks variants for
	IPI throttling and for debugging.

22.	Provide a kernel boot parameter to delay IPIs until a given grace
	period reaches the specified age, with this age defaulting to
	half a second, further reducing the number of IPIs.  To zero on
	context-switch-heavy workloads.

23*.	Split ->trc_reader_need_end to make room for memory-barrier
	indication.

24*.	Add grace-period and IPI counts to statistics.

25*.	Add Kconfig option to mediate smp_mb() vs. IPI.

26*.	Avoid IPIing userspace/idle tasks if kernel is so built.

27*.	Allow rcu_read_unlock_trace() under scheduler locks.

28*.	Disable CPU hotplug across RCU tasks trace scans.  This enables
	detection of idle tasks for offline CPUs.

29*.	Handle the running-offline idle-task special case.

30*.	Make RCU tasks trace also wait for idle tasks.

31*.	Add rcu_dynticks_zero_in_eqs() effectiveness statistics.

32*.	Add count for idle tasks on offline CPUs.

33*.	Add TRACE02 scenario enabling RCU Tasks Trace IPIs.
	The existing TRACE01 scenario avoids IPIs to userspace
	and idle CPUs.

34*.	Add IPI failure count to statistics.

These new versions of Tasks RCU now pass heavy rcutorture testing, and
should thus be fine for experimental use.  The original Tree RCU went
upstream with less testing than this has seen, but then again those were
simpler times.  ;-)

Changes since v2:

o	Leveraged idle entry/exit hooks to reduce IPIing of idle and
	userspace tasks.

o	Switch to read-side memory barriers during idle and userspace
	execution in kernels built for real-time or battery-powered use,
	mediated by a new TASKS_TRACE_RCU_READ_MB Kconfig option.  Also
	add an rcutorture test scenario for this option.

o	Adjust rcutorture to better test the IPI path.  (Seeing zero IPIs
	might be satisfying to me personally, but it is a lousy test
	strategy!)

o	Added more information to stall warnings and rcutorture
	end-of-test printout.

o	Make rcu_read_unlock_trace() usable when invoked with
	scheduler locks held.

o	Make rcu_read_unlock_trace() usable in interrupt and NMI
	handlers.

o	Fix handling of idle tasks, including those "running" on
	offline CPUs.

o	Fixed a number of other bugs found during testing and responded
	to review feedback.

Changes since v1:

o	Updated this cover letter to provide more detail, including
	on roads not taken.

o	Updated commit logs based on feedback from v1.

o	Updated the function providing a consistent view of the
	specified non-running task's state to invoke the specified
	function even if the task is currently running.  This will
	be necessary to safely eliminate IPIs for long-term idle and
	userspace execution.  The function may also now return false
	to transmit a failure indication to the caller, for example,
	if the function cannot handle being invoked on a running CPU.
	The function is now passed the relevant task_struct pointer as
	well as the specified argument.

	Changes were of course made to use the new API.

o	Leveraged context-switch hooks to avoid unnecessary IPIs.

o	Held off IPIs for the first half second (by default) of each
	grace period to give the context-switch hooks a better chance
	to do their job.

o	Lots of testing.

o	Fixed a number of bugs and responded to v2 feedback.

Todo:

o	Even more testing.

o	If all goes well, post a non-RFC series.

							Thanx, Paul

------------------------------------------------------------------------

 Documentation/admin-guide/kernel-parameters.txt             |   12 
 include/linux/rcupdate.h                                    |   48 
 include/linux/rcupdate_trace.h                              |   98 
 include/linux/rcupdate_wait.h                               |   19 
 include/linux/rcutiny.h                                     |    2 
 include/linux/sched.h                                       |   12 
 include/linux/wait.h                                        |    2 
 init/init_task.c                                            |    5 
 kernel/fork.c                                               |    5 
 kernel/rcu/Kconfig                                          |   50 
 kernel/rcu/Kconfig.debug                                    |    4 
 kernel/rcu/rcu.h                                            |    5 
 kernel/rcu/rcutorture.c                                     |   99 
 kernel/rcu/tasks.h                                          | 2089 +++++++++---
 kernel/rcu/tree.c                                           |   24 
 kernel/rcu/tree.h                                           |    2 
 kernel/rcu/tree_plugin.h                                    |   24 
 kernel/rcu/tree_stall.h                                     |   40 
 kernel/rcu/update.c                                         |  375 --
 kernel/sched/core.c                                         |   48 
 tools/testing/selftests/rcutorture/configs/rcu/CFLIST       |    3 
 tools/testing/selftests/rcutorture/configs/rcu/RUDE01       |   10 
 tools/testing/selftests/rcutorture/configs/rcu/RUDE01.boot  |    1 
 tools/testing/selftests/rcutorture/configs/rcu/TRACE01      |   11 
 tools/testing/selftests/rcutorture/configs/rcu/TRACE01.boot |    1 
 tools/testing/selftests/rcutorture/configs/rcu/TRACE02      |   11 
 tools/testing/selftests/rcutorture/configs/rcu/TRACE02.boot |    1 
 27 files changed, 2140 insertions(+), 861 deletions(-)

^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 01/34] sched/core: Add function to sample state of locked-down task
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 02/34] rcu: Add per-task state to RCU CPU stall warnings paulmck
                       ` (33 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman

From: "Paul E. McKenney" <paulmck@kernel.org>

A running task's state can be sampled in a consistent manner (for example,
for diagnostic purposes) simply by invoking smp_call_function_single()
on its CPU, which may be obtained using task_cpu(), then having the
IPI handler verify that the desired task is in fact still running.
However, if the task is not running, this sampling can in theory be done
immediately and directly.  In practice, the task might start running at
any time, including during the sampling period.  Gaining a consistent
sample of a not-running task therefore requires that something be done
to lock down the target task's state.

This commit therefore adds a try_invoke_on_locked_down_task() function
that invokes a specified function if the specified task can be locked
down, returning true if successful and if the specified function returns
true.  Otherwise this function simply returns false.  Given that the
function passed to try_invoke_on_nonrunning_task() might be invoked with
a runqueue lock held, that function had better be quite lightweight.

The function is passed the target task's task_struct pointer and the
argument passed to try_invoke_on_locked_down_task(), allowing easy access
to task state and to a location for further variables to be passed in
and out.

Note that the specified function will be called even if the specified
task is currently running.  The function can use ->on_rq and task_curr()
to quickly and easily determine the task's state, and can return false
if this state is not to the function's liking.  The caller of the
try_invoke_on_locked_down_task() would then see the false return value,
and could take appropriate action, for example, trying again later or
sending an IPI if matters are more urgent.

It is expected that use cases such as the RCU CPU stall warning code will
simply return false if the task is currently running.  However, there are
use cases involving nohz_full CPUs where the specified function might
instead fall back to an alternative sampling scheme that relies on heavier
synchronization (such as memory barriers) in the target task.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
[ paulmck: Apply feedback from Peter Zijlstra and Steven Rostedt. ]
[ paulmck: Invoke if running to handle feedback from Mathieu Desnoyers. ]
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/wait.h |  2 ++
 kernel/sched/core.c  | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 3283c8d..e2bb8ed 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -1148,4 +1148,6 @@ int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, i
 		(wait)->flags = 0;						\
 	} while (0)
 
+bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg);
+
 #endif /* _LINUX_WAIT_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a9983d..c37e99b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2574,6 +2574,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 *
 	 * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
 	 * __schedule().  See the comment for smp_mb__after_spinlock().
+	 *
+	 * A similar smb_rmb() lives in try_invoke_on_locked_down_task().
 	 */
 	smp_rmb();
 	if (p->on_rq && ttwu_remote(p, wake_flags))
@@ -2648,6 +2650,52 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 }
 
 /**
+ * try_invoke_on_locked_down_task - Invoke a function on task in fixed state
+ * @p: Process for which the function is to be invoked.
+ * @func: Function to invoke.
+ * @arg: Argument to function.
+ *
+ * If the specified task can be quickly locked into a definite state
+ * (either sleeping or on a given runqueue), arrange to keep it in that
+ * state while invoking @func(@arg).  This function can use ->on_rq and
+ * task_curr() to work out what the state is, if required.  Given that
+ * @func can be invoked with a runqueue lock held, it had better be quite
+ * lightweight.
+ *
+ * Returns:
+ *	@false if the task slipped out from under the locks.
+ *	@true if the task was locked onto a runqueue or is sleeping.
+ *		However, @func can override this by returning @false.
+ */
+bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg)
+{
+	bool ret = false;
+	struct rq_flags rf;
+	struct rq *rq;
+
+	lockdep_assert_irqs_enabled();
+	raw_spin_lock_irq(&p->pi_lock);
+	if (p->on_rq) {
+		rq = __task_rq_lock(p, &rf);
+		if (task_rq(p) == rq)
+			ret = func(p, arg);
+		rq_unlock(rq, &rf);
+	} else {
+		switch (p->state) {
+		case TASK_RUNNING:
+		case TASK_WAKING:
+			break;
+		default:
+			smp_rmb(); // See smp_rmb() comment in try_to_wake_up().
+			if (!p->on_rq)
+				ret = func(p, arg);
+		}
+	}
+	raw_spin_unlock_irq(&p->pi_lock);
+	return ret;
+}
+
+/**
  * wake_up_process - Wake up a specific process
  * @p: The process to be woken up.
  *
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 02/34] rcu: Add per-task state to RCU CPU stall warnings
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 01/34] sched/core: Add function to sample state of locked-down task paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 03/34] rcutorture: Add flag to produce non-busy-wait task stalls paulmck
                       ` (32 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

Currently, an RCU-preempt CPU stall warning simply lists the PIDs of
those tasks holding up the current grace period.  This can be helpful,
but more can be even more helpful.

To this end, this commit adds the nesting level, whether the task
thinks it was preempted in its current RCU read-side critical section,
whether RCU core has asked this task for a quiescent state, whether the
expedited-grace-period hint is set, and whether the task believes that
it is on the blocked-tasks list (it must be, or it would not be printed,
but if things are broken, best not to take too much for granted).

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tree_stall.h | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 502b4dd..e19487d 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -192,14 +192,40 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp)
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
 
+// Communicate task state back to the RCU CPU stall warning request.
+struct rcu_stall_chk_rdr {
+	int nesting;
+	union rcu_special rs;
+	bool on_blkd_list;
+};
+
+/*
+ * Report out the state of a not-running task that is stalling the
+ * current RCU grace period.
+ */
+static bool check_slow_task(struct task_struct *t, void *arg)
+{
+	struct rcu_node *rnp;
+	struct rcu_stall_chk_rdr *rscrp = arg;
+
+	if (task_curr(t))
+		return false; // It is running, so decline to inspect it.
+	rscrp->nesting = t->rcu_read_lock_nesting;
+	rscrp->rs = t->rcu_read_unlock_special;
+	rnp = t->rcu_blocked_node;
+	rscrp->on_blkd_list = !list_empty(&t->rcu_node_entry);
+	return true;
+}
+
 /*
  * Scan the current list of tasks blocked within RCU read-side critical
  * sections, printing out the tid of each.
  */
 static int rcu_print_task_stall(struct rcu_node *rnp)
 {
-	struct task_struct *t;
 	int ndetected = 0;
+	struct rcu_stall_chk_rdr rscr;
+	struct task_struct *t;
 
 	if (!rcu_preempt_blocked_readers_cgp(rnp))
 		return 0;
@@ -208,7 +234,15 @@ static int rcu_print_task_stall(struct rcu_node *rnp)
 	t = list_entry(rnp->gp_tasks->prev,
 		       struct task_struct, rcu_node_entry);
 	list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) {
-		pr_cont(" P%d", t->pid);
+		if (!try_invoke_on_locked_down_task(t, check_slow_task, &rscr))
+			pr_cont(" P%d", t->pid);
+		else
+			pr_cont(" P%d/%d:%c%c%c%c",
+				t->pid, rscr.nesting,
+				".b"[rscr.rs.b.blocked],
+				".q"[rscr.rs.b.need_qs],
+				".e"[rscr.rs.b.exp_hint],
+				".l"[rscr.on_blkd_list]);
 		ndetected++;
 	}
 	pr_cont("\n");
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 03/34] rcutorture: Add flag to produce non-busy-wait task stalls
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 01/34] sched/core: Add function to sample state of locked-down task paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 02/34] rcu: Add per-task state to RCU CPU stall warnings paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 04/34] rcu-tasks: Move Tasks RCU to its own file paulmck
                       ` (31 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit aids testing of RCU task stall warning messages by adding
an rcutorture.stall_cpu_block module parameter that results in the
induced stall sleeping within the RCU read-side critical section.
Spinning with interrupts disabled is still available via the
rcutorture.stall_cpu_irqsoff module parameter, and specifying neither
of these two module parameters will spin with preemption disabled.

Note that sleeping (as opposed to preemption) results in additional
complaints from RCU at context-switch time, so yet more testing.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 Documentation/admin-guide/kernel-parameters.txt |  5 +++++
 kernel/rcu/rcutorture.c                         | 15 +++++++++------
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 1a5ff11..df2baf9 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4165,6 +4165,11 @@
 			Duration of CPU stall (s) to test RCU CPU stall
 			warnings, zero to disable.
 
+	rcutorture.stall_cpu_block= [KNL]
+			Sleep while stalling if set.  This will result
+			in warnings from preemptible RCU in addition
+			to any other stall-related activity.
+
 	rcutorture.stall_cpu_holdoff= [KNL]
 			Time to wait (s) after boot before inducing stall.
 
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index b3301f3..ada5b91 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -102,6 +102,7 @@ torture_param(int, stall_cpu, 0, "Stall duration (s), zero to disable.");
 torture_param(int, stall_cpu_holdoff, 10,
 	     "Time to wait before starting stall (s).");
 torture_param(int, stall_cpu_irqsoff, 0, "Disable interrupts while stalling.");
+torture_param(int, stall_cpu_block, 0, "Sleep while stalling.");
 torture_param(int, stat_interval, 60,
 	     "Number of seconds between stats printk()s");
 torture_param(int, stutter, 5, "Number of seconds to run/halt test");
@@ -1599,6 +1600,7 @@ static int rcutorture_booster_init(unsigned int cpu)
  */
 static int rcu_torture_stall(void *args)
 {
+	int idx;
 	unsigned long stop_at;
 
 	VERBOSE_TOROUT_STRING("rcu_torture_stall task started");
@@ -1610,21 +1612,22 @@ static int rcu_torture_stall(void *args)
 	if (!kthread_should_stop()) {
 		stop_at = ktime_get_seconds() + stall_cpu;
 		/* RCU CPU stall is expected behavior in following code. */
-		rcu_read_lock();
+		idx = cur_ops->readlock();
 		if (stall_cpu_irqsoff)
 			local_irq_disable();
-		else
+		else if (!stall_cpu_block)
 			preempt_disable();
 		pr_alert("rcu_torture_stall start on CPU %d.\n",
-			 smp_processor_id());
+			 raw_smp_processor_id());
 		while (ULONG_CMP_LT((unsigned long)ktime_get_seconds(),
 				    stop_at))
-			continue;  /* Induce RCU CPU stall warning. */
+			if (stall_cpu_block)
+				schedule_timeout_uninterruptible(HZ);
 		if (stall_cpu_irqsoff)
 			local_irq_enable();
-		else
+		else if (!stall_cpu_block)
 			preempt_enable();
-		rcu_read_unlock();
+		cur_ops->readunlock(idx);
 		pr_alert("rcu_torture_stall end.\n");
 	}
 	torture_shutdown_absorb("rcu_torture_stall");
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 04/34] rcu-tasks: Move Tasks RCU to its own file
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (2 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 03/34] rcutorture: Add flag to produce non-busy-wait task stalls paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 05/34] rcu-tasks: Create struct to hold state information paulmck
                       ` (30 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This code-movement-only commit is in preparation for adding an additional
flavor of Tasks RCU, which relies on workqueues to detect grace periods.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h  | 370 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/rcu/update.c | 366 +--------------------------------------------------
 2 files changed, 372 insertions(+), 364 deletions(-)
 create mode 100644 kernel/rcu/tasks.h

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
new file mode 100644
index 0000000..be8d179
--- /dev/null
+++ b/kernel/rcu/tasks.h
@@ -0,0 +1,370 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/*
+ * Task-based RCU implementations.
+ *
+ * Copyright (C) 2020 Paul E. McKenney
+ */
+
+#ifdef CONFIG_TASKS_RCU
+
+/*
+ * Simple variant of RCU whose quiescent states are voluntary context
+ * switch, cond_resched_rcu_qs(), user-space execution, and idle.
+ * As such, grace periods can take one good long time.  There are no
+ * read-side primitives similar to rcu_read_lock() and rcu_read_unlock()
+ * because this implementation is intended to get the system into a safe
+ * state for some of the manipulations involved in tracing and the like.
+ * Finally, this implementation does not support high call_rcu_tasks()
+ * rates from multiple CPUs.  If this is required, per-CPU callback lists
+ * will be needed.
+ */
+
+/* Global list of callbacks and associated lock. */
+static struct rcu_head *rcu_tasks_cbs_head;
+static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
+static DECLARE_WAIT_QUEUE_HEAD(rcu_tasks_cbs_wq);
+static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
+
+/* Track exiting tasks in order to allow them to be waited for. */
+DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
+
+/* Control stall timeouts.  Disable with <= 0, otherwise jiffies till stall. */
+#define RCU_TASK_STALL_TIMEOUT (HZ * 60 * 10)
+static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
+module_param(rcu_task_stall_timeout, int, 0644);
+
+static struct task_struct *rcu_tasks_kthread_ptr;
+
+/**
+ * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
+ * @rhp: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. call_rcu_tasks() assumes
+ * that the read-side critical sections end at a voluntary context
+ * switch (not a preemption!), cond_resched_rcu_qs(), entry into idle,
+ * or transition to usermode execution.  As such, there are no read-side
+ * primitives analogous to rcu_read_lock() and rcu_read_unlock() because
+ * this primitive is intended to determine that all tasks have passed
+ * through a safe state, not so much for data-strcuture synchronization.
+ *
+ * See the description of call_rcu() for more detailed information on
+ * memory ordering guarantees.
+ */
+void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
+{
+	unsigned long flags;
+	bool needwake;
+
+	rhp->next = NULL;
+	rhp->func = func;
+	raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
+	needwake = !rcu_tasks_cbs_head;
+	WRITE_ONCE(*rcu_tasks_cbs_tail, rhp);
+	rcu_tasks_cbs_tail = &rhp->next;
+	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+	/* We can't create the thread unless interrupts are enabled. */
+	if (needwake && READ_ONCE(rcu_tasks_kthread_ptr))
+		wake_up(&rcu_tasks_cbs_wq);
+}
+EXPORT_SYMBOL_GPL(call_rcu_tasks);
+
+/**
+ * synchronize_rcu_tasks - wait until an rcu-tasks grace period has elapsed.
+ *
+ * Control will return to the caller some time after a full rcu-tasks
+ * grace period has elapsed, in other words after all currently
+ * executing rcu-tasks read-side critical sections have elapsed.  These
+ * read-side critical sections are delimited by calls to schedule(),
+ * cond_resched_tasks_rcu_qs(), idle execution, userspace execution, calls
+ * to synchronize_rcu_tasks(), and (in theory, anyway) cond_resched().
+ *
+ * This is a very specialized primitive, intended only for a few uses in
+ * tracing and other situations requiring manipulation of function
+ * preambles and profiling hooks.  The synchronize_rcu_tasks() function
+ * is not (yet) intended for heavy use from multiple CPUs.
+ *
+ * Note that this guarantee implies further memory-ordering guarantees.
+ * On systems with more than one CPU, when synchronize_rcu_tasks() returns,
+ * each CPU is guaranteed to have executed a full memory barrier since the
+ * end of its last RCU-tasks read-side critical section whose beginning
+ * preceded the call to synchronize_rcu_tasks().  In addition, each CPU
+ * having an RCU-tasks read-side critical section that extends beyond
+ * the return from synchronize_rcu_tasks() is guaranteed to have executed
+ * a full memory barrier after the beginning of synchronize_rcu_tasks()
+ * and before the beginning of that RCU-tasks read-side critical section.
+ * Note that these guarantees include CPUs that are offline, idle, or
+ * executing in user mode, as well as CPUs that are executing in the kernel.
+ *
+ * Furthermore, if CPU A invoked synchronize_rcu_tasks(), which returned
+ * to its caller on CPU B, then both CPU A and CPU B are guaranteed
+ * to have executed a full memory barrier during the execution of
+ * synchronize_rcu_tasks() -- even if CPU A and CPU B are the same CPU
+ * (but again only if the system has more than one CPU).
+ */
+void synchronize_rcu_tasks(void)
+{
+	/* Complain if the scheduler has not started.  */
+	RCU_LOCKDEP_WARN(rcu_scheduler_active == RCU_SCHEDULER_INACTIVE,
+			 "synchronize_rcu_tasks called too soon");
+
+	/* Wait for the grace period. */
+	wait_rcu_gp(call_rcu_tasks);
+}
+EXPORT_SYMBOL_GPL(synchronize_rcu_tasks);
+
+/**
+ * rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
+ *
+ * Although the current implementation is guaranteed to wait, it is not
+ * obligated to, for example, if there are no pending callbacks.
+ */
+void rcu_barrier_tasks(void)
+{
+	/* There is only one callback queue, so this is easy.  ;-) */
+	synchronize_rcu_tasks();
+}
+EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
+
+/* See if tasks are still holding out, complain if so. */
+static void check_holdout_task(struct task_struct *t,
+			       bool needreport, bool *firstreport)
+{
+	int cpu;
+
+	if (!READ_ONCE(t->rcu_tasks_holdout) ||
+	    t->rcu_tasks_nvcsw != READ_ONCE(t->nvcsw) ||
+	    !READ_ONCE(t->on_rq) ||
+	    (IS_ENABLED(CONFIG_NO_HZ_FULL) &&
+	     !is_idle_task(t) && t->rcu_tasks_idle_cpu >= 0)) {
+		WRITE_ONCE(t->rcu_tasks_holdout, false);
+		list_del_init(&t->rcu_tasks_holdout_list);
+		put_task_struct(t);
+		return;
+	}
+	rcu_request_urgent_qs_task(t);
+	if (!needreport)
+		return;
+	if (*firstreport) {
+		pr_err("INFO: rcu_tasks detected stalls on tasks:\n");
+		*firstreport = false;
+	}
+	cpu = task_cpu(t);
+	pr_alert("%p: %c%c nvcsw: %lu/%lu holdout: %d idle_cpu: %d/%d\n",
+		 t, ".I"[is_idle_task(t)],
+		 "N."[cpu < 0 || !tick_nohz_full_cpu(cpu)],
+		 t->rcu_tasks_nvcsw, t->nvcsw, t->rcu_tasks_holdout,
+		 t->rcu_tasks_idle_cpu, cpu);
+	sched_show_task(t);
+}
+
+/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
+static int __noreturn rcu_tasks_kthread(void *arg)
+{
+	unsigned long flags;
+	struct task_struct *g, *t;
+	unsigned long lastreport;
+	struct rcu_head *list;
+	struct rcu_head *next;
+	LIST_HEAD(rcu_tasks_holdouts);
+	int fract;
+
+	/* Run on housekeeping CPUs by default.  Sysadm can move if desired. */
+	housekeeping_affine(current, HK_FLAG_RCU);
+
+	/*
+	 * Each pass through the following loop makes one check for
+	 * newly arrived callbacks, and, if there are some, waits for
+	 * one RCU-tasks grace period and then invokes the callbacks.
+	 * This loop is terminated by the system going down.  ;-)
+	 */
+	for (;;) {
+
+		/* Pick up any new callbacks. */
+		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
+		list = rcu_tasks_cbs_head;
+		rcu_tasks_cbs_head = NULL;
+		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
+		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+
+		/* If there were none, wait a bit and start over. */
+		if (!list) {
+			wait_event_interruptible(rcu_tasks_cbs_wq,
+						 READ_ONCE(rcu_tasks_cbs_head));
+			if (!rcu_tasks_cbs_head) {
+				WARN_ON(signal_pending(current));
+				schedule_timeout_interruptible(HZ/10);
+			}
+			continue;
+		}
+
+		/*
+		 * Wait for all pre-existing t->on_rq and t->nvcsw
+		 * transitions to complete.  Invoking synchronize_rcu()
+		 * suffices because all these transitions occur with
+		 * interrupts disabled.  Without this synchronize_rcu(),
+		 * a read-side critical section that started before the
+		 * grace period might be incorrectly seen as having started
+		 * after the grace period.
+		 *
+		 * This synchronize_rcu() also dispenses with the
+		 * need for a memory barrier on the first store to
+		 * ->rcu_tasks_holdout, as it forces the store to happen
+		 * after the beginning of the grace period.
+		 */
+		synchronize_rcu();
+
+		/*
+		 * There were callbacks, so we need to wait for an
+		 * RCU-tasks grace period.  Start off by scanning
+		 * the task list for tasks that are not already
+		 * voluntarily blocked.  Mark these tasks and make
+		 * a list of them in rcu_tasks_holdouts.
+		 */
+		rcu_read_lock();
+		for_each_process_thread(g, t) {
+			if (t != current && READ_ONCE(t->on_rq) &&
+			    !is_idle_task(t)) {
+				get_task_struct(t);
+				t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
+				WRITE_ONCE(t->rcu_tasks_holdout, true);
+				list_add(&t->rcu_tasks_holdout_list,
+					 &rcu_tasks_holdouts);
+			}
+		}
+		rcu_read_unlock();
+
+		/*
+		 * Wait for tasks that are in the process of exiting.
+		 * This does only part of the job, ensuring that all
+		 * tasks that were previously exiting reach the point
+		 * where they have disabled preemption, allowing the
+		 * later synchronize_rcu() to finish the job.
+		 */
+		synchronize_srcu(&tasks_rcu_exit_srcu);
+
+		/*
+		 * Each pass through the following loop scans the list
+		 * of holdout tasks, removing any that are no longer
+		 * holdouts.  When the list is empty, we are done.
+		 */
+		lastreport = jiffies;
+
+		/* Start off with HZ/10 wait and slowly back off to 1 HZ wait*/
+		fract = 10;
+
+		for (;;) {
+			bool firstreport;
+			bool needreport;
+			int rtst;
+			struct task_struct *t1;
+
+			if (list_empty(&rcu_tasks_holdouts))
+				break;
+
+			/* Slowly back off waiting for holdouts */
+			schedule_timeout_interruptible(HZ/fract);
+
+			if (fract > 1)
+				fract--;
+
+			rtst = READ_ONCE(rcu_task_stall_timeout);
+			needreport = rtst > 0 &&
+				     time_after(jiffies, lastreport + rtst);
+			if (needreport)
+				lastreport = jiffies;
+			firstreport = true;
+			WARN_ON(signal_pending(current));
+			list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
+						rcu_tasks_holdout_list) {
+				check_holdout_task(t, needreport, &firstreport);
+				cond_resched();
+			}
+		}
+
+		/*
+		 * Because ->on_rq and ->nvcsw are not guaranteed
+		 * to have a full memory barriers prior to them in the
+		 * schedule() path, memory reordering on other CPUs could
+		 * cause their RCU-tasks read-side critical sections to
+		 * extend past the end of the grace period.  However,
+		 * because these ->nvcsw updates are carried out with
+		 * interrupts disabled, we can use synchronize_rcu()
+		 * to force the needed ordering on all such CPUs.
+		 *
+		 * This synchronize_rcu() also confines all
+		 * ->rcu_tasks_holdout accesses to be within the grace
+		 * period, avoiding the need for memory barriers for
+		 * ->rcu_tasks_holdout accesses.
+		 *
+		 * In addition, this synchronize_rcu() waits for exiting
+		 * tasks to complete their final preempt_disable() region
+		 * of execution, cleaning up after the synchronize_srcu()
+		 * above.
+		 */
+		synchronize_rcu();
+
+		/* Invoke the callbacks. */
+		while (list) {
+			next = list->next;
+			local_bh_disable();
+			list->func(list);
+			local_bh_enable();
+			list = next;
+			cond_resched();
+		}
+		/* Paranoid sleep to keep this from entering a tight loop */
+		schedule_timeout_uninterruptible(HZ/10);
+	}
+}
+
+/* Spawn rcu_tasks_kthread() at core_initcall() time. */
+static int __init rcu_spawn_tasks_kthread(void)
+{
+	struct task_struct *t;
+
+	t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
+	if (WARN_ONCE(IS_ERR(t), "%s: Could not start Tasks-RCU grace-period kthread, OOM is now expected behavior\n", __func__))
+		return 0;
+	smp_mb(); /* Ensure others see full kthread. */
+	WRITE_ONCE(rcu_tasks_kthread_ptr, t);
+	return 0;
+}
+core_initcall(rcu_spawn_tasks_kthread);
+
+/* Do the srcu_read_lock() for the above synchronize_srcu().  */
+void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
+{
+	preempt_disable();
+	current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu);
+	preempt_enable();
+}
+
+/* Do the srcu_read_unlock() for the above synchronize_srcu().  */
+void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu)
+{
+	preempt_disable();
+	__srcu_read_unlock(&tasks_rcu_exit_srcu, current->rcu_tasks_idx);
+	preempt_enable();
+}
+
+#endif /* #ifdef CONFIG_TASKS_RCU */
+
+#ifndef CONFIG_TINY_RCU
+
+/*
+ * Print any non-default Tasks RCU settings.
+ */
+static void __init rcu_tasks_bootup_oddness(void)
+{
+#ifdef CONFIG_TASKS_RCU
+	if (rcu_task_stall_timeout != RCU_TASK_STALL_TIMEOUT)
+		pr_info("\tTasks-RCU CPU stall warnings timeout set to %d (rcu_task_stall_timeout).\n", rcu_task_stall_timeout);
+	else
+		pr_info("\tTasks RCU enabled.\n");
+#endif /* #ifdef CONFIG_TASKS_RCU */
+}
+
+#endif /* #ifndef CONFIG_TINY_RCU */
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index dd837da..0fb2a9e 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -489,370 +489,6 @@ int rcu_cpu_stall_suppress_at_boot __read_mostly; // !0 = suppress boot stalls.
 EXPORT_SYMBOL_GPL(rcu_cpu_stall_suppress_at_boot);
 module_param(rcu_cpu_stall_suppress_at_boot, int, 0444);
 
-#ifdef CONFIG_TASKS_RCU
-
-/*
- * Simple variant of RCU whose quiescent states are voluntary context
- * switch, cond_resched_rcu_qs(), user-space execution, and idle.
- * As such, grace periods can take one good long time.  There are no
- * read-side primitives similar to rcu_read_lock() and rcu_read_unlock()
- * because this implementation is intended to get the system into a safe
- * state for some of the manipulations involved in tracing and the like.
- * Finally, this implementation does not support high call_rcu_tasks()
- * rates from multiple CPUs.  If this is required, per-CPU callback lists
- * will be needed.
- */
-
-/* Global list of callbacks and associated lock. */
-static struct rcu_head *rcu_tasks_cbs_head;
-static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
-static DECLARE_WAIT_QUEUE_HEAD(rcu_tasks_cbs_wq);
-static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
-
-/* Track exiting tasks in order to allow them to be waited for. */
-DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
-
-/* Control stall timeouts.  Disable with <= 0, otherwise jiffies till stall. */
-#define RCU_TASK_STALL_TIMEOUT (HZ * 60 * 10)
-static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
-module_param(rcu_task_stall_timeout, int, 0644);
-
-static struct task_struct *rcu_tasks_kthread_ptr;
-
-/**
- * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
- * @rhp: structure to be used for queueing the RCU updates.
- * @func: actual callback function to be invoked after the grace period
- *
- * The callback function will be invoked some time after a full grace
- * period elapses, in other words after all currently executing RCU
- * read-side critical sections have completed. call_rcu_tasks() assumes
- * that the read-side critical sections end at a voluntary context
- * switch (not a preemption!), cond_resched_rcu_qs(), entry into idle,
- * or transition to usermode execution.  As such, there are no read-side
- * primitives analogous to rcu_read_lock() and rcu_read_unlock() because
- * this primitive is intended to determine that all tasks have passed
- * through a safe state, not so much for data-strcuture synchronization.
- *
- * See the description of call_rcu() for more detailed information on
- * memory ordering guarantees.
- */
-void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
-{
-	unsigned long flags;
-	bool needwake;
-
-	rhp->next = NULL;
-	rhp->func = func;
-	raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
-	needwake = !rcu_tasks_cbs_head;
-	WRITE_ONCE(*rcu_tasks_cbs_tail, rhp);
-	rcu_tasks_cbs_tail = &rhp->next;
-	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
-	/* We can't create the thread unless interrupts are enabled. */
-	if (needwake && READ_ONCE(rcu_tasks_kthread_ptr))
-		wake_up(&rcu_tasks_cbs_wq);
-}
-EXPORT_SYMBOL_GPL(call_rcu_tasks);
-
-/**
- * synchronize_rcu_tasks - wait until an rcu-tasks grace period has elapsed.
- *
- * Control will return to the caller some time after a full rcu-tasks
- * grace period has elapsed, in other words after all currently
- * executing rcu-tasks read-side critical sections have elapsed.  These
- * read-side critical sections are delimited by calls to schedule(),
- * cond_resched_tasks_rcu_qs(), idle execution, userspace execution, calls
- * to synchronize_rcu_tasks(), and (in theory, anyway) cond_resched().
- *
- * This is a very specialized primitive, intended only for a few uses in
- * tracing and other situations requiring manipulation of function
- * preambles and profiling hooks.  The synchronize_rcu_tasks() function
- * is not (yet) intended for heavy use from multiple CPUs.
- *
- * Note that this guarantee implies further memory-ordering guarantees.
- * On systems with more than one CPU, when synchronize_rcu_tasks() returns,
- * each CPU is guaranteed to have executed a full memory barrier since the
- * end of its last RCU-tasks read-side critical section whose beginning
- * preceded the call to synchronize_rcu_tasks().  In addition, each CPU
- * having an RCU-tasks read-side critical section that extends beyond
- * the return from synchronize_rcu_tasks() is guaranteed to have executed
- * a full memory barrier after the beginning of synchronize_rcu_tasks()
- * and before the beginning of that RCU-tasks read-side critical section.
- * Note that these guarantees include CPUs that are offline, idle, or
- * executing in user mode, as well as CPUs that are executing in the kernel.
- *
- * Furthermore, if CPU A invoked synchronize_rcu_tasks(), which returned
- * to its caller on CPU B, then both CPU A and CPU B are guaranteed
- * to have executed a full memory barrier during the execution of
- * synchronize_rcu_tasks() -- even if CPU A and CPU B are the same CPU
- * (but again only if the system has more than one CPU).
- */
-void synchronize_rcu_tasks(void)
-{
-	/* Complain if the scheduler has not started.  */
-	RCU_LOCKDEP_WARN(rcu_scheduler_active == RCU_SCHEDULER_INACTIVE,
-			 "synchronize_rcu_tasks called too soon");
-
-	/* Wait for the grace period. */
-	wait_rcu_gp(call_rcu_tasks);
-}
-EXPORT_SYMBOL_GPL(synchronize_rcu_tasks);
-
-/**
- * rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
- *
- * Although the current implementation is guaranteed to wait, it is not
- * obligated to, for example, if there are no pending callbacks.
- */
-void rcu_barrier_tasks(void)
-{
-	/* There is only one callback queue, so this is easy.  ;-) */
-	synchronize_rcu_tasks();
-}
-EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
-
-/* See if tasks are still holding out, complain if so. */
-static void check_holdout_task(struct task_struct *t,
-			       bool needreport, bool *firstreport)
-{
-	int cpu;
-
-	if (!READ_ONCE(t->rcu_tasks_holdout) ||
-	    t->rcu_tasks_nvcsw != READ_ONCE(t->nvcsw) ||
-	    !READ_ONCE(t->on_rq) ||
-	    (IS_ENABLED(CONFIG_NO_HZ_FULL) &&
-	     !is_idle_task(t) && t->rcu_tasks_idle_cpu >= 0)) {
-		WRITE_ONCE(t->rcu_tasks_holdout, false);
-		list_del_init(&t->rcu_tasks_holdout_list);
-		put_task_struct(t);
-		return;
-	}
-	rcu_request_urgent_qs_task(t);
-	if (!needreport)
-		return;
-	if (*firstreport) {
-		pr_err("INFO: rcu_tasks detected stalls on tasks:\n");
-		*firstreport = false;
-	}
-	cpu = task_cpu(t);
-	pr_alert("%p: %c%c nvcsw: %lu/%lu holdout: %d idle_cpu: %d/%d\n",
-		 t, ".I"[is_idle_task(t)],
-		 "N."[cpu < 0 || !tick_nohz_full_cpu(cpu)],
-		 t->rcu_tasks_nvcsw, t->nvcsw, t->rcu_tasks_holdout,
-		 t->rcu_tasks_idle_cpu, cpu);
-	sched_show_task(t);
-}
-
-/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
-static int __noreturn rcu_tasks_kthread(void *arg)
-{
-	unsigned long flags;
-	struct task_struct *g, *t;
-	unsigned long lastreport;
-	struct rcu_head *list;
-	struct rcu_head *next;
-	LIST_HEAD(rcu_tasks_holdouts);
-	int fract;
-
-	/* Run on housekeeping CPUs by default.  Sysadm can move if desired. */
-	housekeeping_affine(current, HK_FLAG_RCU);
-
-	/*
-	 * Each pass through the following loop makes one check for
-	 * newly arrived callbacks, and, if there are some, waits for
-	 * one RCU-tasks grace period and then invokes the callbacks.
-	 * This loop is terminated by the system going down.  ;-)
-	 */
-	for (;;) {
-
-		/* Pick up any new callbacks. */
-		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
-		list = rcu_tasks_cbs_head;
-		rcu_tasks_cbs_head = NULL;
-		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
-		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
-
-		/* If there were none, wait a bit and start over. */
-		if (!list) {
-			wait_event_interruptible(rcu_tasks_cbs_wq,
-						 READ_ONCE(rcu_tasks_cbs_head));
-			if (!rcu_tasks_cbs_head) {
-				WARN_ON(signal_pending(current));
-				schedule_timeout_interruptible(HZ/10);
-			}
-			continue;
-		}
-
-		/*
-		 * Wait for all pre-existing t->on_rq and t->nvcsw
-		 * transitions to complete.  Invoking synchronize_rcu()
-		 * suffices because all these transitions occur with
-		 * interrupts disabled.  Without this synchronize_rcu(),
-		 * a read-side critical section that started before the
-		 * grace period might be incorrectly seen as having started
-		 * after the grace period.
-		 *
-		 * This synchronize_rcu() also dispenses with the
-		 * need for a memory barrier on the first store to
-		 * ->rcu_tasks_holdout, as it forces the store to happen
-		 * after the beginning of the grace period.
-		 */
-		synchronize_rcu();
-
-		/*
-		 * There were callbacks, so we need to wait for an
-		 * RCU-tasks grace period.  Start off by scanning
-		 * the task list for tasks that are not already
-		 * voluntarily blocked.  Mark these tasks and make
-		 * a list of them in rcu_tasks_holdouts.
-		 */
-		rcu_read_lock();
-		for_each_process_thread(g, t) {
-			if (t != current && READ_ONCE(t->on_rq) &&
-			    !is_idle_task(t)) {
-				get_task_struct(t);
-				t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
-				WRITE_ONCE(t->rcu_tasks_holdout, true);
-				list_add(&t->rcu_tasks_holdout_list,
-					 &rcu_tasks_holdouts);
-			}
-		}
-		rcu_read_unlock();
-
-		/*
-		 * Wait for tasks that are in the process of exiting.
-		 * This does only part of the job, ensuring that all
-		 * tasks that were previously exiting reach the point
-		 * where they have disabled preemption, allowing the
-		 * later synchronize_rcu() to finish the job.
-		 */
-		synchronize_srcu(&tasks_rcu_exit_srcu);
-
-		/*
-		 * Each pass through the following loop scans the list
-		 * of holdout tasks, removing any that are no longer
-		 * holdouts.  When the list is empty, we are done.
-		 */
-		lastreport = jiffies;
-
-		/* Start off with HZ/10 wait and slowly back off to 1 HZ wait*/
-		fract = 10;
-
-		for (;;) {
-			bool firstreport;
-			bool needreport;
-			int rtst;
-			struct task_struct *t1;
-
-			if (list_empty(&rcu_tasks_holdouts))
-				break;
-
-			/* Slowly back off waiting for holdouts */
-			schedule_timeout_interruptible(HZ/fract);
-
-			if (fract > 1)
-				fract--;
-
-			rtst = READ_ONCE(rcu_task_stall_timeout);
-			needreport = rtst > 0 &&
-				     time_after(jiffies, lastreport + rtst);
-			if (needreport)
-				lastreport = jiffies;
-			firstreport = true;
-			WARN_ON(signal_pending(current));
-			list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
-						rcu_tasks_holdout_list) {
-				check_holdout_task(t, needreport, &firstreport);
-				cond_resched();
-			}
-		}
-
-		/*
-		 * Because ->on_rq and ->nvcsw are not guaranteed
-		 * to have a full memory barriers prior to them in the
-		 * schedule() path, memory reordering on other CPUs could
-		 * cause their RCU-tasks read-side critical sections to
-		 * extend past the end of the grace period.  However,
-		 * because these ->nvcsw updates are carried out with
-		 * interrupts disabled, we can use synchronize_rcu()
-		 * to force the needed ordering on all such CPUs.
-		 *
-		 * This synchronize_rcu() also confines all
-		 * ->rcu_tasks_holdout accesses to be within the grace
-		 * period, avoiding the need for memory barriers for
-		 * ->rcu_tasks_holdout accesses.
-		 *
-		 * In addition, this synchronize_rcu() waits for exiting
-		 * tasks to complete their final preempt_disable() region
-		 * of execution, cleaning up after the synchronize_srcu()
-		 * above.
-		 */
-		synchronize_rcu();
-
-		/* Invoke the callbacks. */
-		while (list) {
-			next = list->next;
-			local_bh_disable();
-			list->func(list);
-			local_bh_enable();
-			list = next;
-			cond_resched();
-		}
-		/* Paranoid sleep to keep this from entering a tight loop */
-		schedule_timeout_uninterruptible(HZ/10);
-	}
-}
-
-/* Spawn rcu_tasks_kthread() at core_initcall() time. */
-static int __init rcu_spawn_tasks_kthread(void)
-{
-	struct task_struct *t;
-
-	t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
-	if (WARN_ONCE(IS_ERR(t), "%s: Could not start Tasks-RCU grace-period kthread, OOM is now expected behavior\n", __func__))
-		return 0;
-	smp_mb(); /* Ensure others see full kthread. */
-	WRITE_ONCE(rcu_tasks_kthread_ptr, t);
-	return 0;
-}
-core_initcall(rcu_spawn_tasks_kthread);
-
-/* Do the srcu_read_lock() for the above synchronize_srcu().  */
-void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
-{
-	preempt_disable();
-	current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu);
-	preempt_enable();
-}
-
-/* Do the srcu_read_unlock() for the above synchronize_srcu().  */
-void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu)
-{
-	preempt_disable();
-	__srcu_read_unlock(&tasks_rcu_exit_srcu, current->rcu_tasks_idx);
-	preempt_enable();
-}
-
-#endif /* #ifdef CONFIG_TASKS_RCU */
-
-#ifndef CONFIG_TINY_RCU
-
-/*
- * Print any non-default Tasks RCU settings.
- */
-static void __init rcu_tasks_bootup_oddness(void)
-{
-#ifdef CONFIG_TASKS_RCU
-	if (rcu_task_stall_timeout != RCU_TASK_STALL_TIMEOUT)
-		pr_info("\tTasks-RCU CPU stall warnings timeout set to %d (rcu_task_stall_timeout).\n", rcu_task_stall_timeout);
-	else
-		pr_info("\tTasks RCU enabled.\n");
-#endif /* #ifdef CONFIG_TASKS_RCU */
-}
-
-#endif /* #ifndef CONFIG_TINY_RCU */
-
 #ifdef CONFIG_PROVE_RCU
 
 /*
@@ -923,6 +559,8 @@ late_initcall(rcu_verify_early_boot_tests);
 void rcu_early_boot_tests(void) {}
 #endif /* CONFIG_PROVE_RCU */
 
+#include "tasks.h"
+
 #ifndef CONFIG_TINY_RCU
 
 /*
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 05/34] rcu-tasks: Create struct to hold state information
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (3 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 04/34] rcu-tasks: Move Tasks RCU to its own file paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 06/34] rcu: Reinstate synchronize_rcu_mult() paulmck
                       ` (29 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit creates an rcu_tasks struct to hold state information for
RCU Tasks.  This is a preparation commit for adding additional flavors
of Tasks RCU, each of which would have its own rcu_tasks struct.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 73 ++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 46 insertions(+), 27 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index be8d179..5ccfe0d 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -7,6 +7,30 @@
 
 #ifdef CONFIG_TASKS_RCU
 
+/**
+ * Definition for a Tasks-RCU-like mechanism.
+ * @cbs_head: Head of callback list.
+ * @cbs_tail: Tail pointer for callback list.
+ * @cbs_wq: Wait queue allowning new callback to get kthread's attention.
+ * @cbs_lock: Lock protecting callback list.
+ * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
+ */
+struct rcu_tasks {
+	struct rcu_head *cbs_head;
+	struct rcu_head **cbs_tail;
+	struct wait_queue_head cbs_wq;
+	raw_spinlock_t cbs_lock;
+	struct task_struct *kthread_ptr;
+};
+
+#define DEFINE_RCU_TASKS(name)						\
+static struct rcu_tasks name =						\
+{									\
+	.cbs_tail = &name.cbs_head,					\
+	.cbs_wq = __WAIT_QUEUE_HEAD_INITIALIZER(name.cbs_wq),		\
+	.cbs_lock = __RAW_SPIN_LOCK_UNLOCKED(name.cbs_lock),		\
+}
+
 /*
  * Simple variant of RCU whose quiescent states are voluntary context
  * switch, cond_resched_rcu_qs(), user-space execution, and idle.
@@ -18,12 +42,7 @@
  * rates from multiple CPUs.  If this is required, per-CPU callback lists
  * will be needed.
  */
-
-/* Global list of callbacks and associated lock. */
-static struct rcu_head *rcu_tasks_cbs_head;
-static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
-static DECLARE_WAIT_QUEUE_HEAD(rcu_tasks_cbs_wq);
-static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
+DEFINE_RCU_TASKS(rcu_tasks);
 
 /* Track exiting tasks in order to allow them to be waited for. */
 DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
@@ -33,8 +52,6 @@ DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
 static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
 module_param(rcu_task_stall_timeout, int, 0644);
 
-static struct task_struct *rcu_tasks_kthread_ptr;
-
 /**
  * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
  * @rhp: structure to be used for queueing the RCU updates.
@@ -57,17 +74,18 @@ void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
 {
 	unsigned long flags;
 	bool needwake;
+	struct rcu_tasks *rtp = &rcu_tasks;
 
 	rhp->next = NULL;
 	rhp->func = func;
-	raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
-	needwake = !rcu_tasks_cbs_head;
-	WRITE_ONCE(*rcu_tasks_cbs_tail, rhp);
-	rcu_tasks_cbs_tail = &rhp->next;
-	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+	raw_spin_lock_irqsave(&rtp->cbs_lock, flags);
+	needwake = !rtp->cbs_head;
+	WRITE_ONCE(*rtp->cbs_tail, rhp);
+	rtp->cbs_tail = &rhp->next;
+	raw_spin_unlock_irqrestore(&rtp->cbs_lock, flags);
 	/* We can't create the thread unless interrupts are enabled. */
-	if (needwake && READ_ONCE(rcu_tasks_kthread_ptr))
-		wake_up(&rcu_tasks_cbs_wq);
+	if (needwake && READ_ONCE(rtp->kthread_ptr))
+		wake_up(&rtp->cbs_wq);
 }
 EXPORT_SYMBOL_GPL(call_rcu_tasks);
 
@@ -169,10 +187,12 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 	struct rcu_head *list;
 	struct rcu_head *next;
 	LIST_HEAD(rcu_tasks_holdouts);
+	struct rcu_tasks *rtp = arg;
 	int fract;
 
 	/* Run on housekeeping CPUs by default.  Sysadm can move if desired. */
 	housekeeping_affine(current, HK_FLAG_RCU);
+	WRITE_ONCE(rtp->kthread_ptr, current); // Let GPs start!
 
 	/*
 	 * Each pass through the following loop makes one check for
@@ -183,17 +203,17 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 	for (;;) {
 
 		/* Pick up any new callbacks. */
-		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
-		list = rcu_tasks_cbs_head;
-		rcu_tasks_cbs_head = NULL;
-		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
-		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+		raw_spin_lock_irqsave(&rtp->cbs_lock, flags);
+		list = rtp->cbs_head;
+		rtp->cbs_head = NULL;
+		rtp->cbs_tail = &rtp->cbs_head;
+		raw_spin_unlock_irqrestore(&rtp->cbs_lock, flags);
 
 		/* If there were none, wait a bit and start over. */
 		if (!list) {
-			wait_event_interruptible(rcu_tasks_cbs_wq,
-						 READ_ONCE(rcu_tasks_cbs_head));
-			if (!rcu_tasks_cbs_head) {
+			wait_event_interruptible(rtp->cbs_wq,
+						 READ_ONCE(rtp->cbs_head));
+			if (!rtp->cbs_head) {
 				WARN_ON(signal_pending(current));
 				schedule_timeout_interruptible(HZ/10);
 			}
@@ -211,7 +231,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 		 *
 		 * This synchronize_rcu() also dispenses with the
 		 * need for a memory barrier on the first store to
-		 * ->rcu_tasks_holdout, as it forces the store to happen
+		 * t->rcu_tasks_holdout, as it forces the store to happen
 		 * after the beginning of the grace period.
 		 */
 		synchronize_rcu();
@@ -278,7 +298,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 			firstreport = true;
 			WARN_ON(signal_pending(current));
 			list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
-						rcu_tasks_holdout_list) {
+						 rcu_tasks_holdout_list) {
 				check_holdout_task(t, needreport, &firstreport);
 				cond_resched();
 			}
@@ -325,11 +345,10 @@ static int __init rcu_spawn_tasks_kthread(void)
 {
 	struct task_struct *t;
 
-	t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
+	t = kthread_run(rcu_tasks_kthread, &rcu_tasks, "rcu_tasks_kthread");
 	if (WARN_ONCE(IS_ERR(t), "%s: Could not start Tasks-RCU grace-period kthread, OOM is now expected behavior\n", __func__))
 		return 0;
 	smp_mb(); /* Ensure others see full kthread. */
-	WRITE_ONCE(rcu_tasks_kthread_ptr, t);
 	return 0;
 }
 core_initcall(rcu_spawn_tasks_kthread);
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 06/34] rcu: Reinstate synchronize_rcu_mult()
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (4 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 05/34] rcu-tasks: Create struct to hold state information paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 07/34] rcutorture: Add a test for synchronize_rcu_mult() paulmck
                       ` (28 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

With the advent and likely usage of synchronize_rcu_rude(), there is
again a need to wait on multiple types of RCU grace periods, for
example, call_rcu_tasks() and call_rcu_tasks_rude().  This commit
therefore reinstates synchronize_rcu_mult() in order to allow these
grace periods to be straightforwardly waited on concurrently.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/rcupdate_wait.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h
index c0578ba..699b938 100644
--- a/include/linux/rcupdate_wait.h
+++ b/include/linux/rcupdate_wait.h
@@ -31,4 +31,23 @@ do {									\
 
 #define wait_rcu_gp(...) _wait_rcu_gp(false, __VA_ARGS__)
 
+/**
+ * synchronize_rcu_mult - Wait concurrently for multiple grace periods
+ * @...: List of call_rcu() functions for different grace periods to wait on
+ *
+ * This macro waits concurrently for multiple types of RCU grace periods.
+ * For example, synchronize_rcu_mult(call_rcu, call_rcu_tasks) would wait
+ * on concurrent RCU and RCU-tasks grace periods.  Waiting on a given SRCU
+ * domain requires you to write a wrapper function for that SRCU domain's
+ * call_srcu() function, with this wrapper supplying the pointer to the
+ * corresponding srcu_struct.
+ *
+ * The first argument tells Tiny RCU's _wait_rcu_gp() not to
+ * bother waiting for RCU.  The reason for this is because anywhere
+ * synchronize_rcu_mult() can be called is automatically already a full
+ * grace period.
+ */
+#define synchronize_rcu_mult(...) \
+	_wait_rcu_gp(IS_ENABLED(CONFIG_TINY_RCU), __VA_ARGS__)
+
 #endif /* _LINUX_SCHED_RCUPDATE_WAIT_H */
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 07/34] rcutorture: Add a test for synchronize_rcu_mult()
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (5 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 06/34] rcu: Reinstate synchronize_rcu_mult() paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 08/34] rcu-tasks: Refactor RCU-tasks to allow variants to be added paulmck
                       ` (27 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds a crude test for synchronize_rcu_mult().  This is
currently a smoke test rather than a high-quality stress test.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/rcutorture.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index ada5b91..88631f5 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -20,7 +20,7 @@
 #include <linux/err.h>
 #include <linux/spinlock.h>
 #include <linux/smp.h>
-#include <linux/rcupdate.h>
+#include <linux/rcupdate_wait.h>
 #include <linux/interrupt.h>
 #include <linux/sched/signal.h>
 #include <uapi/linux/sched/types.h>
@@ -666,6 +666,11 @@ static void rcu_tasks_torture_deferred_free(struct rcu_torture *p)
 	call_rcu_tasks(&p->rtort_rcu, rcu_torture_cb);
 }
 
+static void synchronize_rcu_mult_test(void)
+{
+	synchronize_rcu_mult(call_rcu_tasks, call_rcu);
+}
+
 static struct rcu_torture_ops tasks_ops = {
 	.ttype		= RCU_TASKS_FLAVOR,
 	.init		= rcu_sync_torture_init,
@@ -675,7 +680,7 @@ static struct rcu_torture_ops tasks_ops = {
 	.get_gp_seq	= rcu_no_completed,
 	.deferred_free	= rcu_tasks_torture_deferred_free,
 	.sync		= synchronize_rcu_tasks,
-	.exp_sync	= synchronize_rcu_tasks,
+	.exp_sync	= synchronize_rcu_mult_test,
 	.call		= call_rcu_tasks,
 	.cb_barrier	= rcu_barrier_tasks,
 	.fqs		= NULL,
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 08/34] rcu-tasks: Refactor RCU-tasks to allow variants to be added
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (6 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 07/34] rcutorture: Add a test for synchronize_rcu_mult() paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 09/34] rcu-tasks: Add an RCU-tasks rude variant paulmck
                       ` (26 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit splits out generic processing from RCU-tasks-specific
processing in order to allow additional flavors to be added.  It also
adds a def_bool TASKS_RCU_GENERIC to enable the common RCU-tasks
infrastructure code.

This is primarily, but not entirely, a code-movement commit.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/rcupdate.h |   6 +-
 kernel/rcu/Kconfig       |  10 +-
 kernel/rcu/tasks.h       | 491 +++++++++++++++++++++++++----------------------
 kernel/rcu/update.c      |   4 +
 4 files changed, 272 insertions(+), 239 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 2678a37..5523145 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -129,7 +129,7 @@ static inline void rcu_init_nohz(void) { }
  * Note a quasi-voluntary context switch for RCU-tasks's benefit.
  * This is a macro rather than an inline function to avoid #include hell.
  */
-#ifdef CONFIG_TASKS_RCU
+#ifdef CONFIG_TASKS_RCU_GENERIC
 #define rcu_tasks_qs(t) \
 	do { \
 		if (READ_ONCE((t)->rcu_tasks_holdout)) \
@@ -140,14 +140,14 @@ void call_rcu_tasks(struct rcu_head *head, rcu_callback_t func);
 void synchronize_rcu_tasks(void);
 void exit_tasks_rcu_start(void);
 void exit_tasks_rcu_finish(void);
-#else /* #ifdef CONFIG_TASKS_RCU */
+#else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
 #define rcu_tasks_qs(t)	do { } while (0)
 #define rcu_note_voluntary_context_switch(t) do { } while (0)
 #define call_rcu_tasks call_rcu
 #define synchronize_rcu_tasks synchronize_rcu
 static inline void exit_tasks_rcu_start(void) { }
 static inline void exit_tasks_rcu_finish(void) { }
-#endif /* #else #ifdef CONFIG_TASKS_RCU */
+#endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */
 
 /**
  * cond_resched_tasks_rcu_qs - Report potential quiescent states to RCU
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 1cc940f..38475d0 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -70,13 +70,19 @@ config TREE_SRCU
 	help
 	  This option selects the full-fledged version of SRCU.
 
+config TASKS_RCU_GENERIC
+	def_bool TASKS_RCU
+	select SRCU
+	help
+	  This option enables generic infrastructure code supporting
+	  task-based RCU implementations.  Not for manual selection.
+
 config TASKS_RCU
 	def_bool PREEMPTION
-	select SRCU
 	help
 	  This option enables a task-based RCU implementation that uses
 	  only voluntary context switch (not preemption!), idle, and
-	  user-mode execution as quiescent states.
+	  user-mode execution as quiescent states.  Not for manual selection.
 
 config RCU_STALL_COMMON
 	def_bool TREE_RCU
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 5ccfe0d..d77921e 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -5,7 +5,13 @@
  * Copyright (C) 2020 Paul E. McKenney
  */
 
-#ifdef CONFIG_TASKS_RCU
+
+////////////////////////////////////////////////////////////////////////
+//
+// Generic data structures.
+
+struct rcu_tasks;
+typedef void (*rcu_tasks_gp_func_t)(struct rcu_tasks *rtp);
 
 /**
  * Definition for a Tasks-RCU-like mechanism.
@@ -14,6 +20,8 @@
  * @cbs_wq: Wait queue allowning new callback to get kthread's attention.
  * @cbs_lock: Lock protecting callback list.
  * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
+ * @gp_func: This flavor's grace-period-wait function.
+ * @call_func: This flavor's call_rcu()-equivalent function.
  */
 struct rcu_tasks {
 	struct rcu_head *cbs_head;
@@ -21,29 +29,20 @@ struct rcu_tasks {
 	struct wait_queue_head cbs_wq;
 	raw_spinlock_t cbs_lock;
 	struct task_struct *kthread_ptr;
+	rcu_tasks_gp_func_t gp_func;
+	call_rcu_func_t call_func;
 };
 
-#define DEFINE_RCU_TASKS(name)						\
+#define DEFINE_RCU_TASKS(name, gp, call)				\
 static struct rcu_tasks name =						\
 {									\
 	.cbs_tail = &name.cbs_head,					\
 	.cbs_wq = __WAIT_QUEUE_HEAD_INITIALIZER(name.cbs_wq),		\
 	.cbs_lock = __RAW_SPIN_LOCK_UNLOCKED(name.cbs_lock),		\
+	.gp_func = gp,							\
+	.call_func = call,						\
 }
 
-/*
- * Simple variant of RCU whose quiescent states are voluntary context
- * switch, cond_resched_rcu_qs(), user-space execution, and idle.
- * As such, grace periods can take one good long time.  There are no
- * read-side primitives similar to rcu_read_lock() and rcu_read_unlock()
- * because this implementation is intended to get the system into a safe
- * state for some of the manipulations involved in tracing and the like.
- * Finally, this implementation does not support high call_rcu_tasks()
- * rates from multiple CPUs.  If this is required, per-CPU callback lists
- * will be needed.
- */
-DEFINE_RCU_TASKS(rcu_tasks);
-
 /* Track exiting tasks in order to allow them to be waited for. */
 DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
 
@@ -52,29 +51,16 @@ DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
 static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
 module_param(rcu_task_stall_timeout, int, 0644);
 
-/**
- * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
- * @rhp: structure to be used for queueing the RCU updates.
- * @func: actual callback function to be invoked after the grace period
- *
- * The callback function will be invoked some time after a full grace
- * period elapses, in other words after all currently executing RCU
- * read-side critical sections have completed. call_rcu_tasks() assumes
- * that the read-side critical sections end at a voluntary context
- * switch (not a preemption!), cond_resched_rcu_qs(), entry into idle,
- * or transition to usermode execution.  As such, there are no read-side
- * primitives analogous to rcu_read_lock() and rcu_read_unlock() because
- * this primitive is intended to determine that all tasks have passed
- * through a safe state, not so much for data-strcuture synchronization.
- *
- * See the description of call_rcu() for more detailed information on
- * memory ordering guarantees.
- */
-void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
+////////////////////////////////////////////////////////////////////////
+//
+// Generic code.
+
+// Enqueue a callback for the specified flavor of Tasks RCU.
+static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func,
+				   struct rcu_tasks *rtp)
 {
 	unsigned long flags;
 	bool needwake;
-	struct rcu_tasks *rtp = &rcu_tasks;
 
 	rhp->next = NULL;
 	rhp->func = func;
@@ -87,108 +73,25 @@ void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
 	if (needwake && READ_ONCE(rtp->kthread_ptr))
 		wake_up(&rtp->cbs_wq);
 }
-EXPORT_SYMBOL_GPL(call_rcu_tasks);
 
-/**
- * synchronize_rcu_tasks - wait until an rcu-tasks grace period has elapsed.
- *
- * Control will return to the caller some time after a full rcu-tasks
- * grace period has elapsed, in other words after all currently
- * executing rcu-tasks read-side critical sections have elapsed.  These
- * read-side critical sections are delimited by calls to schedule(),
- * cond_resched_tasks_rcu_qs(), idle execution, userspace execution, calls
- * to synchronize_rcu_tasks(), and (in theory, anyway) cond_resched().
- *
- * This is a very specialized primitive, intended only for a few uses in
- * tracing and other situations requiring manipulation of function
- * preambles and profiling hooks.  The synchronize_rcu_tasks() function
- * is not (yet) intended for heavy use from multiple CPUs.
- *
- * Note that this guarantee implies further memory-ordering guarantees.
- * On systems with more than one CPU, when synchronize_rcu_tasks() returns,
- * each CPU is guaranteed to have executed a full memory barrier since the
- * end of its last RCU-tasks read-side critical section whose beginning
- * preceded the call to synchronize_rcu_tasks().  In addition, each CPU
- * having an RCU-tasks read-side critical section that extends beyond
- * the return from synchronize_rcu_tasks() is guaranteed to have executed
- * a full memory barrier after the beginning of synchronize_rcu_tasks()
- * and before the beginning of that RCU-tasks read-side critical section.
- * Note that these guarantees include CPUs that are offline, idle, or
- * executing in user mode, as well as CPUs that are executing in the kernel.
- *
- * Furthermore, if CPU A invoked synchronize_rcu_tasks(), which returned
- * to its caller on CPU B, then both CPU A and CPU B are guaranteed
- * to have executed a full memory barrier during the execution of
- * synchronize_rcu_tasks() -- even if CPU A and CPU B are the same CPU
- * (but again only if the system has more than one CPU).
- */
-void synchronize_rcu_tasks(void)
+// Wait for a grace period for the specified flavor of Tasks RCU.
+static void synchronize_rcu_tasks_generic(struct rcu_tasks *rtp)
 {
 	/* Complain if the scheduler has not started.  */
 	RCU_LOCKDEP_WARN(rcu_scheduler_active == RCU_SCHEDULER_INACTIVE,
 			 "synchronize_rcu_tasks called too soon");
 
 	/* Wait for the grace period. */
-	wait_rcu_gp(call_rcu_tasks);
-}
-EXPORT_SYMBOL_GPL(synchronize_rcu_tasks);
-
-/**
- * rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
- *
- * Although the current implementation is guaranteed to wait, it is not
- * obligated to, for example, if there are no pending callbacks.
- */
-void rcu_barrier_tasks(void)
-{
-	/* There is only one callback queue, so this is easy.  ;-) */
-	synchronize_rcu_tasks();
-}
-EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
-
-/* See if tasks are still holding out, complain if so. */
-static void check_holdout_task(struct task_struct *t,
-			       bool needreport, bool *firstreport)
-{
-	int cpu;
-
-	if (!READ_ONCE(t->rcu_tasks_holdout) ||
-	    t->rcu_tasks_nvcsw != READ_ONCE(t->nvcsw) ||
-	    !READ_ONCE(t->on_rq) ||
-	    (IS_ENABLED(CONFIG_NO_HZ_FULL) &&
-	     !is_idle_task(t) && t->rcu_tasks_idle_cpu >= 0)) {
-		WRITE_ONCE(t->rcu_tasks_holdout, false);
-		list_del_init(&t->rcu_tasks_holdout_list);
-		put_task_struct(t);
-		return;
-	}
-	rcu_request_urgent_qs_task(t);
-	if (!needreport)
-		return;
-	if (*firstreport) {
-		pr_err("INFO: rcu_tasks detected stalls on tasks:\n");
-		*firstreport = false;
-	}
-	cpu = task_cpu(t);
-	pr_alert("%p: %c%c nvcsw: %lu/%lu holdout: %d idle_cpu: %d/%d\n",
-		 t, ".I"[is_idle_task(t)],
-		 "N."[cpu < 0 || !tick_nohz_full_cpu(cpu)],
-		 t->rcu_tasks_nvcsw, t->nvcsw, t->rcu_tasks_holdout,
-		 t->rcu_tasks_idle_cpu, cpu);
-	sched_show_task(t);
+	wait_rcu_gp(rtp->call_func);
 }
 
 /* RCU-tasks kthread that detects grace periods and invokes callbacks. */
 static int __noreturn rcu_tasks_kthread(void *arg)
 {
 	unsigned long flags;
-	struct task_struct *g, *t;
-	unsigned long lastreport;
 	struct rcu_head *list;
 	struct rcu_head *next;
-	LIST_HEAD(rcu_tasks_holdouts);
 	struct rcu_tasks *rtp = arg;
-	int fract;
 
 	/* Run on housekeeping CPUs by default.  Sysadm can move if desired. */
 	housekeeping_affine(current, HK_FLAG_RCU);
@@ -220,111 +123,8 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 			continue;
 		}
 
-		/*
-		 * Wait for all pre-existing t->on_rq and t->nvcsw
-		 * transitions to complete.  Invoking synchronize_rcu()
-		 * suffices because all these transitions occur with
-		 * interrupts disabled.  Without this synchronize_rcu(),
-		 * a read-side critical section that started before the
-		 * grace period might be incorrectly seen as having started
-		 * after the grace period.
-		 *
-		 * This synchronize_rcu() also dispenses with the
-		 * need for a memory barrier on the first store to
-		 * t->rcu_tasks_holdout, as it forces the store to happen
-		 * after the beginning of the grace period.
-		 */
-		synchronize_rcu();
-
-		/*
-		 * There were callbacks, so we need to wait for an
-		 * RCU-tasks grace period.  Start off by scanning
-		 * the task list for tasks that are not already
-		 * voluntarily blocked.  Mark these tasks and make
-		 * a list of them in rcu_tasks_holdouts.
-		 */
-		rcu_read_lock();
-		for_each_process_thread(g, t) {
-			if (t != current && READ_ONCE(t->on_rq) &&
-			    !is_idle_task(t)) {
-				get_task_struct(t);
-				t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
-				WRITE_ONCE(t->rcu_tasks_holdout, true);
-				list_add(&t->rcu_tasks_holdout_list,
-					 &rcu_tasks_holdouts);
-			}
-		}
-		rcu_read_unlock();
-
-		/*
-		 * Wait for tasks that are in the process of exiting.
-		 * This does only part of the job, ensuring that all
-		 * tasks that were previously exiting reach the point
-		 * where they have disabled preemption, allowing the
-		 * later synchronize_rcu() to finish the job.
-		 */
-		synchronize_srcu(&tasks_rcu_exit_srcu);
-
-		/*
-		 * Each pass through the following loop scans the list
-		 * of holdout tasks, removing any that are no longer
-		 * holdouts.  When the list is empty, we are done.
-		 */
-		lastreport = jiffies;
-
-		/* Start off with HZ/10 wait and slowly back off to 1 HZ wait*/
-		fract = 10;
-
-		for (;;) {
-			bool firstreport;
-			bool needreport;
-			int rtst;
-			struct task_struct *t1;
-
-			if (list_empty(&rcu_tasks_holdouts))
-				break;
-
-			/* Slowly back off waiting for holdouts */
-			schedule_timeout_interruptible(HZ/fract);
-
-			if (fract > 1)
-				fract--;
-
-			rtst = READ_ONCE(rcu_task_stall_timeout);
-			needreport = rtst > 0 &&
-				     time_after(jiffies, lastreport + rtst);
-			if (needreport)
-				lastreport = jiffies;
-			firstreport = true;
-			WARN_ON(signal_pending(current));
-			list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
-						 rcu_tasks_holdout_list) {
-				check_holdout_task(t, needreport, &firstreport);
-				cond_resched();
-			}
-		}
-
-		/*
-		 * Because ->on_rq and ->nvcsw are not guaranteed
-		 * to have a full memory barriers prior to them in the
-		 * schedule() path, memory reordering on other CPUs could
-		 * cause their RCU-tasks read-side critical sections to
-		 * extend past the end of the grace period.  However,
-		 * because these ->nvcsw updates are carried out with
-		 * interrupts disabled, we can use synchronize_rcu()
-		 * to force the needed ordering on all such CPUs.
-		 *
-		 * This synchronize_rcu() also confines all
-		 * ->rcu_tasks_holdout accesses to be within the grace
-		 * period, avoiding the need for memory barriers for
-		 * ->rcu_tasks_holdout accesses.
-		 *
-		 * In addition, this synchronize_rcu() waits for exiting
-		 * tasks to complete their final preempt_disable() region
-		 * of execution, cleaning up after the synchronize_srcu()
-		 * above.
-		 */
-		synchronize_rcu();
+		// Wait for one grace period.
+		rtp->gp_func(rtp);
 
 		/* Invoke the callbacks. */
 		while (list) {
@@ -340,18 +140,16 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 	}
 }
 
-/* Spawn rcu_tasks_kthread() at core_initcall() time. */
-static int __init rcu_spawn_tasks_kthread(void)
+/* Spawn RCU-tasks grace-period kthread, e.g., at core_initcall() time. */
+static void __init rcu_spawn_tasks_kthread_generic(struct rcu_tasks *rtp)
 {
 	struct task_struct *t;
 
-	t = kthread_run(rcu_tasks_kthread, &rcu_tasks, "rcu_tasks_kthread");
+	t = kthread_run(rcu_tasks_kthread, rtp, "rcu_tasks_kthread");
 	if (WARN_ONCE(IS_ERR(t), "%s: Could not start Tasks-RCU grace-period kthread, OOM is now expected behavior\n", __func__))
-		return 0;
+		return;
 	smp_mb(); /* Ensure others see full kthread. */
-	return 0;
 }
-core_initcall(rcu_spawn_tasks_kthread);
 
 /* Do the srcu_read_lock() for the above synchronize_srcu().  */
 void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
@@ -369,8 +167,6 @@ void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu)
 	preempt_enable();
 }
 
-#endif /* #ifdef CONFIG_TASKS_RCU */
-
 #ifndef CONFIG_TINY_RCU
 
 /*
@@ -387,3 +183,230 @@ static void __init rcu_tasks_bootup_oddness(void)
 }
 
 #endif /* #ifndef CONFIG_TINY_RCU */
+
+#ifdef CONFIG_TASKS_RCU
+
+////////////////////////////////////////////////////////////////////////
+//
+// Simple variant of RCU whose quiescent states are voluntary context
+// switch, cond_resched_rcu_qs(), user-space execution, and idle.
+// As such, grace periods can take one good long time.  There are no
+// read-side primitives similar to rcu_read_lock() and rcu_read_unlock()
+// because this implementation is intended to get the system into a safe
+// state for some of the manipulations involved in tracing and the like.
+// Finally, this implementation does not support high call_rcu_tasks()
+// rates from multiple CPUs.  If this is required, per-CPU callback lists
+// will be needed.
+
+/* See if tasks are still holding out, complain if so. */
+static void check_holdout_task(struct task_struct *t,
+			       bool needreport, bool *firstreport)
+{
+	int cpu;
+
+	if (!READ_ONCE(t->rcu_tasks_holdout) ||
+	    t->rcu_tasks_nvcsw != READ_ONCE(t->nvcsw) ||
+	    !READ_ONCE(t->on_rq) ||
+	    (IS_ENABLED(CONFIG_NO_HZ_FULL) &&
+	     !is_idle_task(t) && t->rcu_tasks_idle_cpu >= 0)) {
+		WRITE_ONCE(t->rcu_tasks_holdout, false);
+		list_del_init(&t->rcu_tasks_holdout_list);
+		put_task_struct(t);
+		return;
+	}
+	rcu_request_urgent_qs_task(t);
+	if (!needreport)
+		return;
+	if (*firstreport) {
+		pr_err("INFO: rcu_tasks detected stalls on tasks:\n");
+		*firstreport = false;
+	}
+	cpu = task_cpu(t);
+	pr_alert("%p: %c%c nvcsw: %lu/%lu holdout: %d idle_cpu: %d/%d\n",
+		 t, ".I"[is_idle_task(t)],
+		 "N."[cpu < 0 || !tick_nohz_full_cpu(cpu)],
+		 t->rcu_tasks_nvcsw, t->nvcsw, t->rcu_tasks_holdout,
+		 t->rcu_tasks_idle_cpu, cpu);
+	sched_show_task(t);
+}
+
+/* Wait for one RCU-tasks grace period. */
+static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
+{
+	struct task_struct *g, *t;
+	unsigned long lastreport;
+	LIST_HEAD(rcu_tasks_holdouts);
+	int fract;
+
+	/*
+	 * Wait for all pre-existing t->on_rq and t->nvcsw transitions
+	 * to complete.  Invoking synchronize_rcu() suffices because all
+	 * these transitions occur with interrupts disabled.  Without this
+	 * synchronize_rcu(), a read-side critical section that started
+	 * before the grace period might be incorrectly seen as having
+	 * started after the grace period.
+	 *
+	 * This synchronize_rcu() also dispenses with the need for a
+	 * memory barrier on the first store to t->rcu_tasks_holdout,
+	 * as it forces the store to happen after the beginning of the
+	 * grace period.
+	 */
+	synchronize_rcu();
+
+	/*
+	 * There were callbacks, so we need to wait for an RCU-tasks
+	 * grace period.  Start off by scanning the task list for tasks
+	 * that are not already voluntarily blocked.  Mark these tasks
+	 * and make a list of them in rcu_tasks_holdouts.
+	 */
+	rcu_read_lock();
+	for_each_process_thread(g, t) {
+		if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
+			get_task_struct(t);
+			t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
+			WRITE_ONCE(t->rcu_tasks_holdout, true);
+			list_add(&t->rcu_tasks_holdout_list,
+				 &rcu_tasks_holdouts);
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * Wait for tasks that are in the process of exiting.  This
+	 * does only part of the job, ensuring that all tasks that were
+	 * previously exiting reach the point where they have disabled
+	 * preemption, allowing the later synchronize_rcu() to finish
+	 * the job.
+	 */
+	synchronize_srcu(&tasks_rcu_exit_srcu);
+
+	/*
+	 * Each pass through the following loop scans the list of holdout
+	 * tasks, removing any that are no longer holdouts.  When the list
+	 * is empty, we are done.
+	 */
+	lastreport = jiffies;
+
+	/* Start off with HZ/10 wait and slowly back off to 1 HZ wait. */
+	fract = 10;
+
+	for (;;) {
+		bool firstreport;
+		bool needreport;
+		int rtst;
+		struct task_struct *t1;
+
+		if (list_empty(&rcu_tasks_holdouts))
+			break;
+
+		/* Slowly back off waiting for holdouts */
+		schedule_timeout_interruptible(HZ/fract);
+
+		if (fract > 1)
+			fract--;
+
+		rtst = READ_ONCE(rcu_task_stall_timeout);
+		needreport = rtst > 0 && time_after(jiffies, lastreport + rtst);
+		if (needreport)
+			lastreport = jiffies;
+		firstreport = true;
+		WARN_ON(signal_pending(current));
+		list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
+					 rcu_tasks_holdout_list) {
+			check_holdout_task(t, needreport, &firstreport);
+			cond_resched();
+		}
+	}
+
+	/*
+	 * Because ->on_rq and ->nvcsw are not guaranteed to have a full
+	 * memory barriers prior to them in the schedule() path, memory
+	 * reordering on other CPUs could cause their RCU-tasks read-side
+	 * critical sections to extend past the end of the grace period.
+	 * However, because these ->nvcsw updates are carried out with
+	 * interrupts disabled, we can use synchronize_rcu() to force the
+	 * needed ordering on all such CPUs.
+	 *
+	 * This synchronize_rcu() also confines all ->rcu_tasks_holdout
+	 * accesses to be within the grace period, avoiding the need for
+	 * memory barriers for ->rcu_tasks_holdout accesses.
+	 *
+	 * In addition, this synchronize_rcu() waits for exiting tasks
+	 * to complete their final preempt_disable() region of execution,
+	 * cleaning up after the synchronize_srcu() above.
+	 */
+	synchronize_rcu();
+}
+
+void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
+DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks);
+
+/**
+ * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
+ * @rhp: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. call_rcu_tasks() assumes
+ * that the read-side critical sections end at a voluntary context
+ * switch (not a preemption!), cond_resched_rcu_qs(), entry into idle,
+ * or transition to usermode execution.  As such, there are no read-side
+ * primitives analogous to rcu_read_lock() and rcu_read_unlock() because
+ * this primitive is intended to determine that all tasks have passed
+ * through a safe state, not so much for data-strcuture synchronization.
+ *
+ * See the description of call_rcu() for more detailed information on
+ * memory ordering guarantees.
+ */
+void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
+{
+	call_rcu_tasks_generic(rhp, func, &rcu_tasks);
+}
+EXPORT_SYMBOL_GPL(call_rcu_tasks);
+
+/**
+ * synchronize_rcu_tasks - wait until an rcu-tasks grace period has elapsed.
+ *
+ * Control will return to the caller some time after a full rcu-tasks
+ * grace period has elapsed, in other words after all currently
+ * executing rcu-tasks read-side critical sections have elapsed.  These
+ * read-side critical sections are delimited by calls to schedule(),
+ * cond_resched_tasks_rcu_qs(), idle execution, userspace execution, calls
+ * to synchronize_rcu_tasks(), and (in theory, anyway) cond_resched().
+ *
+ * This is a very specialized primitive, intended only for a few uses in
+ * tracing and other situations requiring manipulation of function
+ * preambles and profiling hooks.  The synchronize_rcu_tasks() function
+ * is not (yet) intended for heavy use from multiple CPUs.
+ *
+ * See the description of synchronize_rcu() for more detailed information
+ * on memory ordering guarantees.
+ */
+void synchronize_rcu_tasks(void)
+{
+	synchronize_rcu_tasks_generic(&rcu_tasks);
+}
+EXPORT_SYMBOL_GPL(synchronize_rcu_tasks);
+
+/**
+ * rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
+ *
+ * Although the current implementation is guaranteed to wait, it is not
+ * obligated to, for example, if there are no pending callbacks.
+ */
+void rcu_barrier_tasks(void)
+{
+	/* There is only one callback queue, so this is easy.  ;-) */
+	synchronize_rcu_tasks();
+}
+EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
+
+static int __init rcu_spawn_tasks_kthread(void)
+{
+	rcu_spawn_tasks_kthread_generic(&rcu_tasks);
+	return 0;
+}
+core_initcall(rcu_spawn_tasks_kthread);
+
+#endif /* #ifdef CONFIG_TASKS_RCU */
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 0fb2a9e..16058a5 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -559,7 +559,11 @@ late_initcall(rcu_verify_early_boot_tests);
 void rcu_early_boot_tests(void) {}
 #endif /* CONFIG_PROVE_RCU */
 
+#ifdef CONFIG_TASKS_RCU_GENERIC
 #include "tasks.h"
+#else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
+static inline void rcu_tasks_bootup_oddness(void) {}
+#endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */
 
 #ifndef CONFIG_TINY_RCU
 
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 09/34] rcu-tasks: Add an RCU-tasks rude variant
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (7 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 08/34] rcu-tasks: Refactor RCU-tasks to allow variants to be added paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 10/34] rcutorture: Add torture tests for RCU Tasks Rude paulmck
                       ` (25 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds a "rude" variant of RCU-tasks that has as quiescent
states schedule(), cond_resched_tasks_rcu_qs(), userspace execution,
and (in theory, anyway) cond_resched().  In other words, RCU-tasks rude
readers are regions of code with preemption disabled, but excluding code
early in the CPU-online sequence and late in the CPU-offline sequence.
Updates make use of IPIs and force an IPI and a context switch on each
online CPU.  This variant is useful in some situations in tracing.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
[ paulmck: Apply EXPORT_SYMBOL_GPL() feedback from Qiujun Huang. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Apply review feedback from Steve Rostedt. ]
---
 include/linux/rcupdate.h |  3 ++
 kernel/rcu/Kconfig       | 11 +++++-
 kernel/rcu/tasks.h       | 98 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 111 insertions(+), 1 deletion(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 5523145..2be97a8 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -37,6 +37,7 @@
 /* Exported common interfaces */
 void call_rcu(struct rcu_head *head, rcu_callback_t func);
 void rcu_barrier_tasks(void);
+void rcu_barrier_tasks_rude(void);
 void synchronize_rcu(void);
 
 #ifdef CONFIG_PREEMPT_RCU
@@ -138,6 +139,8 @@ static inline void rcu_init_nohz(void) { }
 #define rcu_note_voluntary_context_switch(t) rcu_tasks_qs(t)
 void call_rcu_tasks(struct rcu_head *head, rcu_callback_t func);
 void synchronize_rcu_tasks(void);
+void call_rcu_tasks_rude(struct rcu_head *head, rcu_callback_t func);
+void synchronize_rcu_tasks_rude(void);
 void exit_tasks_rcu_start(void);
 void exit_tasks_rcu_finish(void);
 #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 38475d0..6ee6372 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -71,7 +71,7 @@ config TREE_SRCU
 	  This option selects the full-fledged version of SRCU.
 
 config TASKS_RCU_GENERIC
-	def_bool TASKS_RCU
+	def_bool TASKS_RCU || TASKS_RUDE_RCU
 	select SRCU
 	help
 	  This option enables generic infrastructure code supporting
@@ -84,6 +84,15 @@ config TASKS_RCU
 	  only voluntary context switch (not preemption!), idle, and
 	  user-mode execution as quiescent states.  Not for manual selection.
 
+config TASKS_RUDE_RCU
+	def_bool 0
+	help
+	  This option enables a task-based RCU implementation that uses
+	  only context switch (including preemption) and user-mode
+	  execution as quiescent states.  It forces IPIs and context
+	  switches on all online CPUs, including idle ones, so use
+	  with caution.
+
 config RCU_STALL_COMMON
 	def_bool TREE_RCU
 	help
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index d77921e..7f9ed20 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -180,6 +180,9 @@ static void __init rcu_tasks_bootup_oddness(void)
 	else
 		pr_info("\tTasks RCU enabled.\n");
 #endif /* #ifdef CONFIG_TASKS_RCU */
+#ifdef CONFIG_TASKS_RUDE_RCU
+	pr_info("\tRude variant of Tasks RCU enabled.\n");
+#endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
 }
 
 #endif /* #ifndef CONFIG_TINY_RCU */
@@ -410,3 +413,98 @@ static int __init rcu_spawn_tasks_kthread(void)
 core_initcall(rcu_spawn_tasks_kthread);
 
 #endif /* #ifdef CONFIG_TASKS_RCU */
+
+#ifdef CONFIG_TASKS_RUDE_RCU
+
+////////////////////////////////////////////////////////////////////////
+//
+// "Rude" variant of Tasks RCU, inspired by Steve Rostedt's trick of
+// passing an empty function to schedule_on_each_cpu().  This approach
+// provides an asynchronous call_rcu_tasks_rude() API and batching
+// of concurrent calls to the synchronous synchronize_rcu_rude() API.
+// This sends IPIs far and wide and induces otherwise unnecessary context
+// switches on all online CPUs, whether idle or not.
+
+// Empty function to allow workqueues to force a context switch.
+static void rcu_tasks_be_rude(struct work_struct *work)
+{
+}
+
+// Wait for one rude RCU-tasks grace period.
+static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp)
+{
+	schedule_on_each_cpu(rcu_tasks_be_rude);
+}
+
+void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func);
+DEFINE_RCU_TASKS(rcu_tasks_rude, rcu_tasks_rude_wait_gp, call_rcu_tasks_rude);
+
+/**
+ * call_rcu_tasks_rude() - Queue a callback rude task-based grace period
+ * @rhp: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. call_rcu_tasks_rude()
+ * assumes that the read-side critical sections end at context switch,
+ * cond_resched_rcu_qs(), or transition to usermode execution.  As such,
+ * there are no read-side primitives analogous to rcu_read_lock() and
+ * rcu_read_unlock() because this primitive is intended to determine
+ * that all tasks have passed through a safe state, not so much for
+ * data-strcuture synchronization.
+ *
+ * See the description of call_rcu() for more detailed information on
+ * memory ordering guarantees.
+ */
+void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func)
+{
+	call_rcu_tasks_generic(rhp, func, &rcu_tasks_rude);
+}
+EXPORT_SYMBOL_GPL(call_rcu_tasks_rude);
+
+/**
+ * synchronize_rcu_tasks_rude - wait for a rude rcu-tasks grace period
+ *
+ * Control will return to the caller some time after a rude rcu-tasks
+ * grace period has elapsed, in other words after all currently
+ * executing rcu-tasks read-side critical sections have elapsed.  These
+ * read-side critical sections are delimited by calls to schedule(),
+ * cond_resched_tasks_rcu_qs(), userspace execution, and (in theory,
+ * anyway) cond_resched().
+ *
+ * This is a very specialized primitive, intended only for a few uses in
+ * tracing and other situations requiring manipulation of function preambles
+ * and profiling hooks.  The synchronize_rcu_tasks_rude() function is not
+ * (yet) intended for heavy use from multiple CPUs.
+ *
+ * See the description of synchronize_rcu() for more detailed information
+ * on memory ordering guarantees.
+ */
+void synchronize_rcu_tasks_rude(void)
+{
+	synchronize_rcu_tasks_generic(&rcu_tasks_rude);
+}
+EXPORT_SYMBOL_GPL(synchronize_rcu_tasks_rude);
+
+/**
+ * rcu_barrier_tasks_rude - Wait for in-flight call_rcu_tasks_rude() callbacks.
+ *
+ * Although the current implementation is guaranteed to wait, it is not
+ * obligated to, for example, if there are no pending callbacks.
+ */
+void rcu_barrier_tasks_rude(void)
+{
+	/* There is only one callback queue, so this is easy.  ;-) */
+	synchronize_rcu_tasks_rude();
+}
+EXPORT_SYMBOL_GPL(rcu_barrier_tasks_rude);
+
+static int __init rcu_spawn_tasks_rude_kthread(void)
+{
+	rcu_spawn_tasks_kthread_generic(&rcu_tasks_rude);
+	return 0;
+}
+core_initcall(rcu_spawn_tasks_rude_kthread);
+
+#endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 10/34] rcutorture: Add torture tests for RCU Tasks Rude
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (8 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 09/34] rcu-tasks: Add an RCU-tasks rude variant paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 11/34] rcu-tasks: Use unique names for RCU-Tasks kthreads and messages paulmck
                       ` (24 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds the definitions required to torture the rude flavor of
RCU tasks.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/Kconfig.debug                           |  2 ++
 kernel/rcu/rcu.h                                   |  1 +
 kernel/rcu/rcutorture.c                            | 31 ++++++++++++++++++++--
 .../selftests/rcutorture/configs/rcu/CFLIST        |  1 +
 .../selftests/rcutorture/configs/rcu/RUDE01        | 10 +++++++
 .../selftests/rcutorture/configs/rcu/RUDE01.boot   |  1 +
 6 files changed, 44 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/RUDE01
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/RUDE01.boot

diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index ec4bb6c..b15a3bd 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -24,6 +24,7 @@ config RCU_PERF_TEST
 	select TORTURE_TEST
 	select SRCU
 	select TASKS_RCU
+	select TASKS_RUDE_RCU
 	default n
 	help
 	  This option provides a kernel module that runs performance
@@ -41,6 +42,7 @@ config RCU_TORTURE_TEST
 	select TORTURE_TEST
 	select SRCU
 	select TASKS_RCU
+	select TASKS_RUDE_RCU
 	default n
 	help
 	  This option provides a kernel module that runs torture tests
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 00ddc92..c574620 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -441,6 +441,7 @@ void rcu_request_urgent_qs_task(struct task_struct *t);
 enum rcutorture_type {
 	RCU_FLAVOR,
 	RCU_TASKS_FLAVOR,
+	RCU_TASKS_RUDE_FLAVOR,
 	RCU_TRIVIAL_FLAVOR,
 	SRCU_FLAVOR,
 	INVALID_RCU_FLAVOR
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 88631f5..386cd11 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -731,6 +731,33 @@ static struct rcu_torture_ops trivial_ops = {
 	.name		= "trivial"
 };
 
+/*
+ * Definitions for rude RCU-tasks torture testing.
+ */
+
+static void rcu_tasks_rude_torture_deferred_free(struct rcu_torture *p)
+{
+	call_rcu_tasks_rude(&p->rtort_rcu, rcu_torture_cb);
+}
+
+static struct rcu_torture_ops tasks_rude_ops = {
+	.ttype		= RCU_TASKS_RUDE_FLAVOR,
+	.init		= rcu_sync_torture_init,
+	.readlock	= rcu_torture_read_lock_trivial,
+	.read_delay	= rcu_read_delay,  /* just reuse rcu's version. */
+	.readunlock	= rcu_torture_read_unlock_trivial,
+	.get_gp_seq	= rcu_no_completed,
+	.deferred_free	= rcu_tasks_rude_torture_deferred_free,
+	.sync		= synchronize_rcu_tasks_rude,
+	.exp_sync	= synchronize_rcu_tasks_rude,
+	.call		= call_rcu_tasks_rude,
+	.cb_barrier	= rcu_barrier_tasks_rude,
+	.fqs		= NULL,
+	.stats		= NULL,
+	.irq_capable	= 1,
+	.name		= "tasks-rude"
+};
+
 static unsigned long rcutorture_seq_diff(unsigned long new, unsigned long old)
 {
 	if (!cur_ops->gp_diff)
@@ -740,7 +767,7 @@ static unsigned long rcutorture_seq_diff(unsigned long new, unsigned long old)
 
 static bool __maybe_unused torturing_tasks(void)
 {
-	return cur_ops == &tasks_ops;
+	return cur_ops == &tasks_ops || cur_ops == &tasks_rude_ops;
 }
 
 /*
@@ -2408,7 +2435,7 @@ rcu_torture_init(void)
 	int firsterr = 0;
 	static struct rcu_torture_ops *torture_ops[] = {
 		&rcu_ops, &rcu_busted_ops, &srcu_ops, &srcud_ops,
-		&busted_srcud_ops, &tasks_ops, &trivial_ops,
+		&busted_srcud_ops, &tasks_ops, &tasks_rude_ops, &trivial_ops,
 	};
 
 	if (!torture_init_begin(torture_type, verbose))
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
index c3c1fb5..ec0c72f 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
+++ b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
@@ -14,3 +14,4 @@ TINY02
 TASKS01
 TASKS02
 TASKS03
+RUDE01
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/RUDE01 b/tools/testing/selftests/rcutorture/configs/rcu/RUDE01
new file mode 100644
index 0000000..bafe94c
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/RUDE01
@@ -0,0 +1,10 @@
+CONFIG_SMP=y
+CONFIG_NR_CPUS=2
+CONFIG_HOTPLUG_CPU=y
+CONFIG_PREEMPT_NONE=n
+CONFIG_PREEMPT_VOLUNTARY=n
+CONFIG_PREEMPT=y
+CONFIG_DEBUG_LOCK_ALLOC=y
+CONFIG_PROVE_LOCKING=y
+#CHECK#CONFIG_PROVE_RCU=y
+CONFIG_RCU_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/RUDE01.boot b/tools/testing/selftests/rcutorture/configs/rcu/RUDE01.boot
new file mode 100644
index 0000000..9363708
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/RUDE01.boot
@@ -0,0 +1 @@
+rcutorture.torture_type=tasks-rude
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 11/34] rcu-tasks: Use unique names for RCU-Tasks kthreads and messages
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (9 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 10/34] rcutorture: Add torture tests for RCU Tasks Rude paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 12/34] rcu-tasks: Further refactor RCU-tasks to allow adding more variants paulmck
                       ` (23 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit causes the flavors of RCU Tasks to use different names
for their kthreads and in their console messages.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 7f9ed20..9ca83c6 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -22,6 +22,8 @@ typedef void (*rcu_tasks_gp_func_t)(struct rcu_tasks *rtp);
  * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
  * @gp_func: This flavor's grace-period-wait function.
  * @call_func: This flavor's call_rcu()-equivalent function.
+ * @name: This flavor's textual name.
+ * @kname: This flavor's kthread name.
  */
 struct rcu_tasks {
 	struct rcu_head *cbs_head;
@@ -31,16 +33,20 @@ struct rcu_tasks {
 	struct task_struct *kthread_ptr;
 	rcu_tasks_gp_func_t gp_func;
 	call_rcu_func_t call_func;
+	char *name;
+	char *kname;
 };
 
-#define DEFINE_RCU_TASKS(name, gp, call)				\
-static struct rcu_tasks name =						\
+#define DEFINE_RCU_TASKS(rt_name, gp, call, n)				\
+static struct rcu_tasks rt_name =					\
 {									\
-	.cbs_tail = &name.cbs_head,					\
-	.cbs_wq = __WAIT_QUEUE_HEAD_INITIALIZER(name.cbs_wq),		\
-	.cbs_lock = __RAW_SPIN_LOCK_UNLOCKED(name.cbs_lock),		\
+	.cbs_tail = &rt_name.cbs_head,					\
+	.cbs_wq = __WAIT_QUEUE_HEAD_INITIALIZER(rt_name.cbs_wq),	\
+	.cbs_lock = __RAW_SPIN_LOCK_UNLOCKED(rt_name.cbs_lock),		\
 	.gp_func = gp,							\
 	.call_func = call,						\
+	.name = n,							\
+	.kname = #rt_name,						\
 }
 
 /* Track exiting tasks in order to allow them to be waited for. */
@@ -145,8 +151,8 @@ static void __init rcu_spawn_tasks_kthread_generic(struct rcu_tasks *rtp)
 {
 	struct task_struct *t;
 
-	t = kthread_run(rcu_tasks_kthread, rtp, "rcu_tasks_kthread");
-	if (WARN_ONCE(IS_ERR(t), "%s: Could not start Tasks-RCU grace-period kthread, OOM is now expected behavior\n", __func__))
+	t = kthread_run(rcu_tasks_kthread, rtp, "%s_kthread", rtp->kname);
+	if (WARN_ONCE(IS_ERR(t), "%s: Could not start %s grace-period kthread, OOM is now expected behavior\n", __func__, rtp->name))
 		return;
 	smp_mb(); /* Ensure others see full kthread. */
 }
@@ -342,7 +348,7 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 }
 
 void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
-DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks);
+DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks");
 
 /**
  * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
@@ -437,7 +443,8 @@ static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp)
 }
 
 void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func);
-DEFINE_RCU_TASKS(rcu_tasks_rude, rcu_tasks_rude_wait_gp, call_rcu_tasks_rude);
+DEFINE_RCU_TASKS(rcu_tasks_rude, rcu_tasks_rude_wait_gp, call_rcu_tasks_rude,
+		 "RCU Tasks Rude");
 
 /**
  * call_rcu_tasks_rude() - Queue a callback rude task-based grace period
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 12/34] rcu-tasks: Further refactor RCU-tasks to allow adding more variants
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (10 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 11/34] rcu-tasks: Use unique names for RCU-Tasks kthreads and messages paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 13/34] rcu-tasks: Code movement to allow more Tasks RCU variants paulmck
                       ` (22 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit refactors RCU tasks to allow variants to be added.  These
variants will share the current Tasks-RCU tasklist scan and the holdout
list processing.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 166 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 108 insertions(+), 58 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 9ca83c6..344426e 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -12,6 +12,11 @@
 
 struct rcu_tasks;
 typedef void (*rcu_tasks_gp_func_t)(struct rcu_tasks *rtp);
+typedef void (*pregp_func_t)(void);
+typedef void (*pertask_func_t)(struct task_struct *t, struct list_head *hop);
+typedef void (*postscan_func_t)(void);
+typedef void (*holdouts_func_t)(struct list_head *hop, bool ndrpt, bool *frptp);
+typedef void (*postgp_func_t)(void);
 
 /**
  * Definition for a Tasks-RCU-like mechanism.
@@ -21,6 +26,11 @@ typedef void (*rcu_tasks_gp_func_t)(struct rcu_tasks *rtp);
  * @cbs_lock: Lock protecting callback list.
  * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
  * @gp_func: This flavor's grace-period-wait function.
+ * @pregp_func: This flavor's pre-grace-period function (optional).
+ * @pertask_func: This flavor's per-task scan function (optional).
+ * @postscan_func: This flavor's post-task scan function (optional).
+ * @holdout_func: This flavor's holdout-list scan function (optional).
+ * @postgp_func: This flavor's post-grace-period function (optional).
  * @call_func: This flavor's call_rcu()-equivalent function.
  * @name: This flavor's textual name.
  * @kname: This flavor's kthread name.
@@ -32,6 +42,11 @@ struct rcu_tasks {
 	raw_spinlock_t cbs_lock;
 	struct task_struct *kthread_ptr;
 	rcu_tasks_gp_func_t gp_func;
+	pregp_func_t pregp_func;
+	pertask_func_t pertask_func;
+	postscan_func_t postscan_func;
+	holdouts_func_t holdouts_func;
+	postgp_func_t postgp_func;
 	call_rcu_func_t call_func;
 	char *name;
 	char *kname;
@@ -113,6 +128,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 
 		/* Pick up any new callbacks. */
 		raw_spin_lock_irqsave(&rtp->cbs_lock, flags);
+		smp_mb__after_unlock_lock(); // Order updates vs. GP.
 		list = rtp->cbs_head;
 		rtp->cbs_head = NULL;
 		rtp->cbs_tail = &rtp->cbs_head;
@@ -207,6 +223,49 @@ static void __init rcu_tasks_bootup_oddness(void)
 // rates from multiple CPUs.  If this is required, per-CPU callback lists
 // will be needed.
 
+/* Pre-grace-period preparation. */
+static void rcu_tasks_pregp_step(void)
+{
+	/*
+	 * Wait for all pre-existing t->on_rq and t->nvcsw transitions
+	 * to complete.  Invoking synchronize_rcu() suffices because all
+	 * these transitions occur with interrupts disabled.  Without this
+	 * synchronize_rcu(), a read-side critical section that started
+	 * before the grace period might be incorrectly seen as having
+	 * started after the grace period.
+	 *
+	 * This synchronize_rcu() also dispenses with the need for a
+	 * memory barrier on the first store to t->rcu_tasks_holdout,
+	 * as it forces the store to happen after the beginning of the
+	 * grace period.
+	 */
+	synchronize_rcu();
+}
+
+/* Per-task initial processing. */
+static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop)
+{
+	if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
+		get_task_struct(t);
+		t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
+		WRITE_ONCE(t->rcu_tasks_holdout, true);
+		list_add(&t->rcu_tasks_holdout_list, hop);
+	}
+}
+
+/* Processing between scanning taskslist and draining the holdout list. */
+void rcu_tasks_postscan(void)
+{
+	/*
+	 * Wait for tasks that are in the process of exiting.  This
+	 * does only part of the job, ensuring that all tasks that were
+	 * previously exiting reach the point where they have disabled
+	 * preemption, allowing the later synchronize_rcu() to finish
+	 * the job.
+	 */
+	synchronize_srcu(&tasks_rcu_exit_srcu);
+}
+
 /* See if tasks are still holding out, complain if so. */
 static void check_holdout_task(struct task_struct *t,
 			       bool needreport, bool *firstreport)
@@ -239,55 +298,63 @@ static void check_holdout_task(struct task_struct *t,
 	sched_show_task(t);
 }
 
+/* Scan the holdout lists for tasks no longer holding out. */
+static void check_all_holdout_tasks(struct list_head *hop,
+				    bool needreport, bool *firstreport)
+{
+	struct task_struct *t, *t1;
+
+	list_for_each_entry_safe(t, t1, hop, rcu_tasks_holdout_list) {
+		check_holdout_task(t, needreport, firstreport);
+		cond_resched();
+	}
+}
+
+/* Finish off the Tasks-RCU grace period. */
+static void rcu_tasks_postgp(void)
+{
+	/*
+	 * Because ->on_rq and ->nvcsw are not guaranteed to have a full
+	 * memory barriers prior to them in the schedule() path, memory
+	 * reordering on other CPUs could cause their RCU-tasks read-side
+	 * critical sections to extend past the end of the grace period.
+	 * However, because these ->nvcsw updates are carried out with
+	 * interrupts disabled, we can use synchronize_rcu() to force the
+	 * needed ordering on all such CPUs.
+	 *
+	 * This synchronize_rcu() also confines all ->rcu_tasks_holdout
+	 * accesses to be within the grace period, avoiding the need for
+	 * memory barriers for ->rcu_tasks_holdout accesses.
+	 *
+	 * In addition, this synchronize_rcu() waits for exiting tasks
+	 * to complete their final preempt_disable() region of execution,
+	 * cleaning up after the synchronize_srcu() above.
+	 */
+	synchronize_rcu();
+}
+
 /* Wait for one RCU-tasks grace period. */
 static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 {
 	struct task_struct *g, *t;
 	unsigned long lastreport;
-	LIST_HEAD(rcu_tasks_holdouts);
+	LIST_HEAD(holdouts);
 	int fract;
 
-	/*
-	 * Wait for all pre-existing t->on_rq and t->nvcsw transitions
-	 * to complete.  Invoking synchronize_rcu() suffices because all
-	 * these transitions occur with interrupts disabled.  Without this
-	 * synchronize_rcu(), a read-side critical section that started
-	 * before the grace period might be incorrectly seen as having
-	 * started after the grace period.
-	 *
-	 * This synchronize_rcu() also dispenses with the need for a
-	 * memory barrier on the first store to t->rcu_tasks_holdout,
-	 * as it forces the store to happen after the beginning of the
-	 * grace period.
-	 */
-	synchronize_rcu();
+	rtp->pregp_func();
 
 	/*
 	 * There were callbacks, so we need to wait for an RCU-tasks
 	 * grace period.  Start off by scanning the task list for tasks
 	 * that are not already voluntarily blocked.  Mark these tasks
-	 * and make a list of them in rcu_tasks_holdouts.
+	 * and make a list of them in holdouts.
 	 */
 	rcu_read_lock();
-	for_each_process_thread(g, t) {
-		if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
-			get_task_struct(t);
-			t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
-			WRITE_ONCE(t->rcu_tasks_holdout, true);
-			list_add(&t->rcu_tasks_holdout_list,
-				 &rcu_tasks_holdouts);
-		}
-	}
+	for_each_process_thread(g, t)
+		rtp->pertask_func(t, &holdouts);
 	rcu_read_unlock();
 
-	/*
-	 * Wait for tasks that are in the process of exiting.  This
-	 * does only part of the job, ensuring that all tasks that were
-	 * previously exiting reach the point where they have disabled
-	 * preemption, allowing the later synchronize_rcu() to finish
-	 * the job.
-	 */
-	synchronize_srcu(&tasks_rcu_exit_srcu);
+	rtp->postscan_func();
 
 	/*
 	 * Each pass through the following loop scans the list of holdout
@@ -303,9 +370,8 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 		bool firstreport;
 		bool needreport;
 		int rtst;
-		struct task_struct *t1;
 
-		if (list_empty(&rcu_tasks_holdouts))
+		if (list_empty(&holdouts))
 			break;
 
 		/* Slowly back off waiting for holdouts */
@@ -320,31 +386,10 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 			lastreport = jiffies;
 		firstreport = true;
 		WARN_ON(signal_pending(current));
-		list_for_each_entry_safe(t, t1, &rcu_tasks_holdouts,
-					 rcu_tasks_holdout_list) {
-			check_holdout_task(t, needreport, &firstreport);
-			cond_resched();
-		}
+		rtp->holdouts_func(&holdouts, needreport, &firstreport);
 	}
 
-	/*
-	 * Because ->on_rq and ->nvcsw are not guaranteed to have a full
-	 * memory barriers prior to them in the schedule() path, memory
-	 * reordering on other CPUs could cause their RCU-tasks read-side
-	 * critical sections to extend past the end of the grace period.
-	 * However, because these ->nvcsw updates are carried out with
-	 * interrupts disabled, we can use synchronize_rcu() to force the
-	 * needed ordering on all such CPUs.
-	 *
-	 * This synchronize_rcu() also confines all ->rcu_tasks_holdout
-	 * accesses to be within the grace period, avoiding the need for
-	 * memory barriers for ->rcu_tasks_holdout accesses.
-	 *
-	 * In addition, this synchronize_rcu() waits for exiting tasks
-	 * to complete their final preempt_disable() region of execution,
-	 * cleaning up after the synchronize_srcu() above.
-	 */
-	synchronize_rcu();
+	rtp->postgp_func();
 }
 
 void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
@@ -413,6 +458,11 @@ EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
 
 static int __init rcu_spawn_tasks_kthread(void)
 {
+	rcu_tasks.pregp_func = rcu_tasks_pregp_step;
+	rcu_tasks.pertask_func = rcu_tasks_pertask;
+	rcu_tasks.postscan_func = rcu_tasks_postscan;
+	rcu_tasks.holdouts_func = check_all_holdout_tasks;
+	rcu_tasks.postgp_func = rcu_tasks_postgp;
 	rcu_spawn_tasks_kthread_generic(&rcu_tasks);
 	return 0;
 }
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 13/34] rcu-tasks: Code movement to allow more Tasks RCU variants
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (11 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 12/34] rcu-tasks: Further refactor RCU-tasks to allow adding more variants paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 14/34] rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks paulmck
                       ` (21 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit does nothing but move rcu_tasks_wait_gp() up to a new section
for common code.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 122 +++++++++++++++++++++++++++--------------------------
 1 file changed, 63 insertions(+), 59 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 344426e..d8b09d5 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -213,6 +213,69 @@ static void __init rcu_tasks_bootup_oddness(void)
 
 ////////////////////////////////////////////////////////////////////////
 //
+// Shared code between task-list-scanning variants of Tasks RCU.
+
+/* Wait for one RCU-tasks grace period. */
+static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
+{
+	struct task_struct *g, *t;
+	unsigned long lastreport;
+	LIST_HEAD(holdouts);
+	int fract;
+
+	rtp->pregp_func();
+
+	/*
+	 * There were callbacks, so we need to wait for an RCU-tasks
+	 * grace period.  Start off by scanning the task list for tasks
+	 * that are not already voluntarily blocked.  Mark these tasks
+	 * and make a list of them in holdouts.
+	 */
+	rcu_read_lock();
+	for_each_process_thread(g, t)
+		rtp->pertask_func(t, &holdouts);
+	rcu_read_unlock();
+
+	rtp->postscan_func();
+
+	/*
+	 * Each pass through the following loop scans the list of holdout
+	 * tasks, removing any that are no longer holdouts.  When the list
+	 * is empty, we are done.
+	 */
+	lastreport = jiffies;
+
+	/* Start off with HZ/10 wait and slowly back off to 1 HZ wait. */
+	fract = 10;
+
+	for (;;) {
+		bool firstreport;
+		bool needreport;
+		int rtst;
+
+		if (list_empty(&holdouts))
+			break;
+
+		/* Slowly back off waiting for holdouts */
+		schedule_timeout_interruptible(HZ/fract);
+
+		if (fract > 1)
+			fract--;
+
+		rtst = READ_ONCE(rcu_task_stall_timeout);
+		needreport = rtst > 0 && time_after(jiffies, lastreport + rtst);
+		if (needreport)
+			lastreport = jiffies;
+		firstreport = true;
+		WARN_ON(signal_pending(current));
+		rtp->holdouts_func(&holdouts, needreport, &firstreport);
+	}
+
+	rtp->postgp_func();
+}
+
+////////////////////////////////////////////////////////////////////////
+//
 // Simple variant of RCU whose quiescent states are voluntary context
 // switch, cond_resched_rcu_qs(), user-space execution, and idle.
 // As such, grace periods can take one good long time.  There are no
@@ -333,65 +396,6 @@ static void rcu_tasks_postgp(void)
 	synchronize_rcu();
 }
 
-/* Wait for one RCU-tasks grace period. */
-static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
-{
-	struct task_struct *g, *t;
-	unsigned long lastreport;
-	LIST_HEAD(holdouts);
-	int fract;
-
-	rtp->pregp_func();
-
-	/*
-	 * There were callbacks, so we need to wait for an RCU-tasks
-	 * grace period.  Start off by scanning the task list for tasks
-	 * that are not already voluntarily blocked.  Mark these tasks
-	 * and make a list of them in holdouts.
-	 */
-	rcu_read_lock();
-	for_each_process_thread(g, t)
-		rtp->pertask_func(t, &holdouts);
-	rcu_read_unlock();
-
-	rtp->postscan_func();
-
-	/*
-	 * Each pass through the following loop scans the list of holdout
-	 * tasks, removing any that are no longer holdouts.  When the list
-	 * is empty, we are done.
-	 */
-	lastreport = jiffies;
-
-	/* Start off with HZ/10 wait and slowly back off to 1 HZ wait. */
-	fract = 10;
-
-	for (;;) {
-		bool firstreport;
-		bool needreport;
-		int rtst;
-
-		if (list_empty(&holdouts))
-			break;
-
-		/* Slowly back off waiting for holdouts */
-		schedule_timeout_interruptible(HZ/fract);
-
-		if (fract > 1)
-			fract--;
-
-		rtst = READ_ONCE(rcu_task_stall_timeout);
-		needreport = rtst > 0 && time_after(jiffies, lastreport + rtst);
-		if (needreport)
-			lastreport = jiffies;
-		firstreport = true;
-		WARN_ON(signal_pending(current));
-		rtp->holdouts_func(&holdouts, needreport, &firstreport);
-	}
-
-	rtp->postgp_func();
-}
-
 void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
 DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks");
 
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 14/34] rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (12 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 13/34] rcu-tasks: Code movement to allow more Tasks RCU variants paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 15/34] rcutorture: Add torture tests for RCU Tasks Trace paulmck
                       ` (20 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney,
	Alexei Starovoitov, Andrii Nakryiko

From: "Paul E. McKenney" <paulmck@kernel.org>

Because RCU does not watch exception early-entry/late-exit, idle-loop,
or CPU-hotplug execution, protection of tracing and BPF operations is
needlessly complicated.  This commit therefore adds a variant of
Tasks RCU that:

o	Has explicit read-side markers to allow finite grace periods in
	the face of in-kernel loops for PREEMPT=n builds.  These markers
	are rcu_read_lock_trace() and rcu_read_unlock_trace().

o	Protects code in the idle loop, exception entry/exit, and
	CPU-hotplug code paths.  In this respect, RCU-tasks trace is
	similar to SRCU, but with lighter-weight readers.

o	Avoids expensive read-side instruction, having overhead similar
	to that of Preemptible RCU.

There are of course downsides:

o	The grace-period code can send IPIs to CPUs, even when those
	CPUs are in the idle loop or in nohz_full userspace.  This is
	mitigated by later commits.

o	It is necessary to scan the full tasklist, much as for Tasks RCU.

o	There is a single callback queue guarded by a single lock,
	again, much as for Tasks RCU.  However, those early use cases
	that request multiple grace periods in quick succession are
	expected to do so from a single task, which makes the single
	lock almost irrelevant.  If needed, multiple callback queues
	can be provided using any number of schemes.

Perhaps most important, this variant of RCU does not affect the vanilla
flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
readers can operate from idle, offline, and exception entry/exit in no
way enables rcu_preempt and rcu_sched readers to do so.

The memory ordering was outlined here:
https://lore.kernel.org/lkml/20200319034030.GX3199@paulmck-ThinkPad-P72/

This effort benefited greatly from off-list discussions of BPF
requirements with Alexei Starovoitov and Andrii Nakryiko.  At least
some of the on-list discussions are captured in the Link: tags below.
In addition, KCSAN was quite helpful in finding some early bugs.

Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Apply feedback from Steve Rostedt and Joel Fernandes. ]
[ paulmck: Decrement trc_n_readers_need_end upon IPI failure. ]
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andriin@fb.com>
---
 include/linux/rcupdate_trace.h |  84 ++++++++++
 include/linux/sched.h          |   8 +
 init/init_task.c               |   4 +
 kernel/fork.c                  |   4 +
 kernel/rcu/Kconfig             |  11 +-
 kernel/rcu/tasks.h             | 357 ++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 463 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/rcupdate_trace.h

diff --git a/include/linux/rcupdate_trace.h b/include/linux/rcupdate_trace.h
new file mode 100644
index 0000000..ed97e10
--- /dev/null
+++ b/include/linux/rcupdate_trace.h
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/*
+ * Read-Copy Update mechanism for mutual exclusion, adapted for tracing.
+ *
+ * Copyright (C) 2020 Paul E. McKenney.
+ */
+
+#ifndef __LINUX_RCUPDATE_TRACE_H
+#define __LINUX_RCUPDATE_TRACE_H
+
+#include <linux/sched.h>
+#include <linux/rcupdate.h>
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+
+extern struct lockdep_map rcu_trace_lock_map;
+
+static inline int rcu_read_lock_trace_held(void)
+{
+	return lock_is_held(&rcu_trace_lock_map);
+}
+
+#else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+
+static inline int rcu_read_lock_trace_held(void)
+{
+	return 1;
+}
+
+#endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+
+#ifdef CONFIG_TASKS_TRACE_RCU
+
+void rcu_read_unlock_trace_special(struct task_struct *t);
+
+/**
+ * rcu_read_lock_trace - mark beginning of RCU-trace read-side critical section
+ *
+ * When synchronize_rcu_trace() is invoked by one task, then that task
+ * is guaranteed to block until all other tasks exit their read-side
+ * critical sections.  Similarly, if call_rcu_trace() is invoked on one
+ * task while other tasks are within RCU read-side critical sections,
+ * invocation of the corresponding RCU callback is deferred until after
+ * the all the other tasks exit their critical sections.
+ *
+ * For more details, please see the documentation for rcu_read_lock().
+ */
+static inline void rcu_read_lock_trace(void)
+{
+	struct task_struct *t = current;
+
+	WRITE_ONCE(t->trc_reader_nesting, READ_ONCE(t->trc_reader_nesting) + 1);
+	rcu_lock_acquire(&rcu_trace_lock_map);
+}
+
+/**
+ * rcu_read_unlock_trace - mark end of RCU-trace read-side critical section
+ *
+ * Pairs with a preceding call to rcu_read_lock_trace(), and nesting is
+ * allowed.  Invoking a rcu_read_unlock_trace() when there is no matching
+ * rcu_read_lock_trace() is verboten, and will result in lockdep complaints.
+ *
+ * For more details, please see the documentation for rcu_read_unlock().
+ */
+static inline void rcu_read_unlock_trace(void)
+{
+	int nesting;
+	struct task_struct *t = current;
+
+	rcu_lock_release(&rcu_trace_lock_map);
+	nesting = READ_ONCE(t->trc_reader_nesting) - 1;
+	WRITE_ONCE(t->trc_reader_nesting, nesting);
+	if (likely(!READ_ONCE(t->trc_reader_need_end)) || nesting)
+		return;  // We assume shallow reader nesting.
+	rcu_read_unlock_trace_special(t);
+}
+
+void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
+void synchronize_rcu_tasks_trace(void);
+void rcu_barrier_tasks_trace(void);
+
+#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
+
+#endif /* __LINUX_RCUPDATE_TRACE_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 621e4aa..ef68ae4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -722,6 +722,14 @@ struct task_struct {
 	struct list_head		rcu_tasks_holdout_list;
 #endif /* #ifdef CONFIG_TASKS_RCU */
 
+#ifdef CONFIG_TASKS_TRACE_RCU
+	int				trc_reader_nesting;
+	int				trc_ipi_to_cpu;
+	bool				trc_reader_need_end;
+	bool				trc_reader_checked;
+	struct list_head		trc_holdout_list;
+#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
+
 	struct sched_info		sched_info;
 
 	struct list_head		tasks;
diff --git a/init/init_task.c b/init/init_task.c
index 096191d..1b9ec3d 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -140,6 +140,10 @@ struct task_struct init_task
 	.rcu_tasks_holdout_list = LIST_HEAD_INIT(init_task.rcu_tasks_holdout_list),
 	.rcu_tasks_idle_cpu = -1,
 #endif
+#ifdef CONFIG_TASKS_TRACE_RCU
+	.trc_reader_nesting = 0,
+	.trc_holdout_list = LIST_HEAD_INIT(init_task.trc_holdout_list),
+#endif
 #ifdef CONFIG_CPUSETS
 	.mems_allowed_seq = SEQCNT_ZERO(init_task.mems_allowed_seq),
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 00b1cdd..97df86b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1685,6 +1685,10 @@ static inline void rcu_copy_process(struct task_struct *p)
 	INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
 	p->rcu_tasks_idle_cpu = -1;
 #endif /* #ifdef CONFIG_TASKS_RCU */
+#ifdef CONFIG_TASKS_TRACE_RCU
+	p->trc_reader_nesting = 0;
+	INIT_LIST_HEAD(&p->trc_holdout_list);
+#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
 
 struct pid *pidfd_pid(const struct file *file)
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 6ee6372..cb1d18e 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -71,7 +71,7 @@ config TREE_SRCU
 	  This option selects the full-fledged version of SRCU.
 
 config TASKS_RCU_GENERIC
-	def_bool TASKS_RCU || TASKS_RUDE_RCU
+	def_bool TASKS_RCU || TASKS_RUDE_RCU || TASKS_TRACE_RCU
 	select SRCU
 	help
 	  This option enables generic infrastructure code supporting
@@ -93,6 +93,15 @@ config TASKS_RUDE_RCU
 	  switches on all online CPUs, including idle ones, so use
 	  with caution.
 
+config TASKS_TRACE_RCU
+	def_bool 0
+	help
+	  This option enables a task-based RCU implementation that uses
+	  explicit rcu_read_lock_trace() read-side markers, and allows
+	  these readers to appear in the idle loop as well as on the CPU
+	  hotplug code paths.  It can force IPIs on online CPUs, including
+	  idle ones, so use with caution.
+
 config RCU_STALL_COMMON
 	def_bool TREE_RCU
 	help
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index d8b09d5..a5ed7e2 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -181,12 +181,17 @@ void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
 	preempt_enable();
 }
 
+static void exit_tasks_rcu_finish_trace(struct task_struct *t);
+
 /* Do the srcu_read_unlock() for the above synchronize_srcu().  */
 void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu)
 {
+	struct task_struct *t = current;
+
 	preempt_disable();
-	__srcu_read_unlock(&tasks_rcu_exit_srcu, current->rcu_tasks_idx);
+	__srcu_read_unlock(&tasks_rcu_exit_srcu, t->rcu_tasks_idx);
 	preempt_enable();
+	exit_tasks_rcu_finish_trace(t);
 }
 
 #ifndef CONFIG_TINY_RCU
@@ -196,15 +201,19 @@ void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu)
  */
 static void __init rcu_tasks_bootup_oddness(void)
 {
-#ifdef CONFIG_TASKS_RCU
+#if defined(CONFIG_TASKS_RCU) || defined(CONFIG_TASKS_TRACE_RCU)
 	if (rcu_task_stall_timeout != RCU_TASK_STALL_TIMEOUT)
 		pr_info("\tTasks-RCU CPU stall warnings timeout set to %d (rcu_task_stall_timeout).\n", rcu_task_stall_timeout);
-	else
-		pr_info("\tTasks RCU enabled.\n");
+#endif /* #ifdef CONFIG_TASKS_RCU */
+#ifdef CONFIG_TASKS_RCU
+	pr_info("\tTrampoline variant of Tasks RCU enabled.\n");
 #endif /* #ifdef CONFIG_TASKS_RCU */
 #ifdef CONFIG_TASKS_RUDE_RCU
 	pr_info("\tRude variant of Tasks RCU enabled.\n");
 #endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
+#ifdef CONFIG_TASKS_TRACE_RCU
+	pr_info("\tTracing variant of Tasks RCU enabled.\n");
+#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
 
 #endif /* #ifndef CONFIG_TINY_RCU */
@@ -569,3 +578,343 @@ static int __init rcu_spawn_tasks_rude_kthread(void)
 core_initcall(rcu_spawn_tasks_rude_kthread);
 
 #endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
+
+////////////////////////////////////////////////////////////////////////
+//
+// Tracing variant of Tasks RCU.  This variant is designed to be used
+// to protect tracing hooks, including those of BPF.  This variant
+// therefore:
+//
+// 1.	Has explicit read-side markers to allow finite grace periods
+//	in the face of in-kernel loops for PREEMPT=n builds.
+//
+// 2.	Protects code in the idle loop, exception entry/exit, and
+//	CPU-hotplug code paths, similar to the capabilities of SRCU.
+//
+// 3.	Avoids expensive read-side instruction, having overhead similar
+//	to that of Preemptible RCU.
+//
+// There are of course downsides.  The grace-period code can send IPIs to
+// CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
+// It is necessary to scan the full tasklist, much as for Tasks RCU.  There
+// is a single callback queue guarded by a single lock, again, much as for
+// Tasks RCU.  If needed, these downsides can be at least partially remedied.
+//
+// Perhaps most important, this variant of RCU does not affect the vanilla
+// flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
+// readers can operate from idle, offline, and exception entry/exit in no
+// way allows rcu_preempt and rcu_sched readers to also do so.
+
+// The lockdep state must be outside of #ifdef to be useful.
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+static struct lock_class_key rcu_lock_trace_key;
+struct lockdep_map rcu_trace_lock_map =
+	STATIC_LOCKDEP_MAP_INIT("rcu_read_lock_trace", &rcu_lock_trace_key);
+EXPORT_SYMBOL_GPL(rcu_trace_lock_map);
+#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+
+#ifdef CONFIG_TASKS_TRACE_RCU
+
+atomic_t trc_n_readers_need_end;	// Number of waited-for readers.
+DECLARE_WAIT_QUEUE_HEAD(trc_wait);	// List of holdout tasks.
+
+// Record outstanding IPIs to each CPU.  No point in sending two...
+static DEFINE_PER_CPU(bool, trc_ipi_to_cpu);
+
+/* If we are the last reader, wake up the grace-period kthread. */
+void rcu_read_unlock_trace_special(struct task_struct *t)
+{
+	WRITE_ONCE(t->trc_reader_need_end, false);
+	if (atomic_dec_and_test(&trc_n_readers_need_end))
+		wake_up(&trc_wait);
+}
+EXPORT_SYMBOL_GPL(rcu_read_unlock_trace_special);
+
+/* Add a task to the holdout list, if it is not already on the list. */
+static void trc_add_holdout(struct task_struct *t, struct list_head *bhp)
+{
+	if (list_empty(&t->trc_holdout_list)) {
+		get_task_struct(t);
+		list_add(&t->trc_holdout_list, bhp);
+	}
+}
+
+/* Remove a task from the holdout list, if it is in fact present. */
+static void trc_del_holdout(struct task_struct *t)
+{
+	if (!list_empty(&t->trc_holdout_list)) {
+		list_del_init(&t->trc_holdout_list);
+		put_task_struct(t);
+	}
+}
+
+/* IPI handler to check task state. */
+static void trc_read_check_handler(void *t_in)
+{
+	struct task_struct *t = current;
+	struct task_struct *texp = t_in;
+
+	// If the task is no longer running on this CPU, leave.
+	if (unlikely(texp != t)) {
+		if (WARN_ON_ONCE(atomic_dec_and_test(&trc_n_readers_need_end)))
+			wake_up(&trc_wait);
+		goto reset_ipi; // Already on holdout list, so will check later.
+	}
+
+	// If the task is not in a read-side critical section, and
+	// if this is the last reader, awaken the grace-period kthread.
+	if (likely(!t->trc_reader_nesting)) {
+		if (WARN_ON_ONCE(atomic_dec_and_test(&trc_n_readers_need_end)))
+			wake_up(&trc_wait);
+		// Mark as checked after decrement to avoid false
+		// positives on the above WARN_ON_ONCE().
+		WRITE_ONCE(t->trc_reader_checked, true);
+		goto reset_ipi;
+	}
+	WRITE_ONCE(t->trc_reader_checked, true);
+
+	// Get here if the task is in a read-side critical section.  Set
+	// its state so that it will awaken the grace-period kthread upon
+	// exit from that critical section.
+	WARN_ON_ONCE(t->trc_reader_need_end);
+	WRITE_ONCE(t->trc_reader_need_end, true);
+
+reset_ipi:
+	// Allow future IPIs to be sent on CPU and for task.
+	// Also order this IPI handler against any later manipulations of
+	// the intended task.
+	smp_store_release(&per_cpu(trc_ipi_to_cpu, smp_processor_id()), false); // ^^^
+	smp_store_release(&texp->trc_ipi_to_cpu, -1); // ^^^
+}
+
+/* Callback function for scheduler to check locked-down task.  */
+static bool trc_inspect_reader(struct task_struct *t, void *arg)
+{
+	if (task_curr(t))
+		return false;  // It is running, so decline to inspect it.
+
+	// Mark as checked.  Because this is called from the grace-period
+	// kthread, also remove the task from the holdout list.
+	t->trc_reader_checked = true;
+	trc_del_holdout(t);
+
+	// If the task is in a read-side critical section, set up its
+	// its state so that it will awaken the grace-period kthread upon
+	// exit from that critical section.
+	if (unlikely(t->trc_reader_nesting)) {
+		atomic_inc(&trc_n_readers_need_end); // One more to wait on.
+		WARN_ON_ONCE(t->trc_reader_need_end);
+		WRITE_ONCE(t->trc_reader_need_end, true);
+	}
+	return true;
+}
+
+/* Attempt to extract the state for the specified task. */
+static void trc_wait_for_one_reader(struct task_struct *t,
+				    struct list_head *bhp)
+{
+	int cpu;
+
+	// If a previous IPI is still in flight, let it complete.
+	if (smp_load_acquire(&t->trc_ipi_to_cpu) != -1) // Order IPI
+		return;
+
+	// The current task had better be in a quiescent state.
+	if (t == current) {
+		t->trc_reader_checked = true;
+		trc_del_holdout(t);
+		WARN_ON_ONCE(t->trc_reader_nesting);
+		return;
+	}
+
+	// Attempt to nail down the task for inspection.
+	if (try_invoke_on_locked_down_task(t, trc_inspect_reader, NULL))
+		return;
+
+	// If currently running, send an IPI, either way, add to list.
+	trc_add_holdout(t, bhp);
+	if (task_curr(t)) {
+		// The task is currently running, so try IPIing it.
+		cpu = task_cpu(t);
+
+		// If there is already an IPI outstanding, let it happen.
+		if (per_cpu(trc_ipi_to_cpu, cpu) || t->trc_ipi_to_cpu >= 0)
+			return;
+
+		atomic_inc(&trc_n_readers_need_end);
+		per_cpu(trc_ipi_to_cpu, cpu) = true;
+		t->trc_ipi_to_cpu = cpu;
+		if (smp_call_function_single(cpu,
+					     trc_read_check_handler, t, 0)) {
+			// Just in case there is some other reason for
+			// failure than the target CPU being offline.
+			per_cpu(trc_ipi_to_cpu, cpu) = false;
+			t->trc_ipi_to_cpu = cpu;
+			if (atomic_dec_and_test(&trc_n_readers_need_end)) {
+				WARN_ON_ONCE(1);
+				wake_up(&trc_wait);
+			}
+		}
+	}
+}
+
+/* Initialize for a new RCU-tasks-trace grace period. */
+static void rcu_tasks_trace_pregp_step(void)
+{
+	int cpu;
+
+	// Wait for CPU-hotplug paths to complete.
+	cpus_read_lock();
+	cpus_read_unlock();
+
+	// Allow for fast-acting IPIs.
+	atomic_set(&trc_n_readers_need_end, 1);
+
+	// There shouldn't be any old IPIs, but...
+	for_each_possible_cpu(cpu)
+		WARN_ON_ONCE(per_cpu(trc_ipi_to_cpu, cpu));
+}
+
+/* Do first-round processing for the specified task. */
+static void rcu_tasks_trace_pertask(struct task_struct *t,
+				    struct list_head *hop)
+{
+	WRITE_ONCE(t->trc_reader_need_end, false);
+	t->trc_reader_checked = false;
+	t->trc_ipi_to_cpu = -1;
+	trc_wait_for_one_reader(t, hop);
+}
+
+/* Do intermediate processing between task and holdout scans. */
+static void rcu_tasks_trace_postscan(void)
+{
+	// Wait for late-stage exiting tasks to finish exiting.
+	// These might have passed the call to exit_tasks_rcu_finish().
+	synchronize_rcu();
+	// Any tasks that exit after this point will set ->trc_reader_checked.
+}
+
+/* Do one scan of the holdout list. */
+static void check_all_holdout_tasks_trace(struct list_head *hop,
+					  bool ndrpt, bool *frptp)
+{
+	struct task_struct *g, *t;
+
+	list_for_each_entry_safe(t, g, hop, trc_holdout_list) {
+		// If safe and needed, try to check the current task.
+		if (READ_ONCE(t->trc_ipi_to_cpu) == -1 &&
+		    !READ_ONCE(t->trc_reader_checked))
+			trc_wait_for_one_reader(t, hop);
+
+		// If check succeeded, remove this task from the list.
+		if (READ_ONCE(t->trc_reader_checked))
+			trc_del_holdout(t);
+	}
+}
+
+/* Wait for grace period to complete and provide ordering. */
+static void rcu_tasks_trace_postgp(void)
+{
+	// Remove the safety count.
+	smp_mb__before_atomic();  // Order vs. earlier atomics
+	atomic_dec(&trc_n_readers_need_end);
+	smp_mb__after_atomic();  // Order vs. later atomics
+
+	// Wait for readers.
+	wait_event_idle_exclusive(trc_wait,
+				  atomic_read(&trc_n_readers_need_end) == 0);
+
+	smp_mb(); // Caller's code must be ordered after wakeup.
+}
+
+/* Report any needed quiescent state for this exiting task. */
+void exit_tasks_rcu_finish_trace(struct task_struct *t)
+{
+	WRITE_ONCE(t->trc_reader_checked, true);
+	WARN_ON_ONCE(t->trc_reader_nesting);
+	WRITE_ONCE(t->trc_reader_nesting, 0);
+	if (WARN_ON_ONCE(READ_ONCE(t->trc_reader_need_end)))
+		rcu_read_unlock_trace_special(t);
+}
+
+void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
+DEFINE_RCU_TASKS(rcu_tasks_trace, rcu_tasks_wait_gp, call_rcu_tasks_trace,
+		 "RCU Tasks Trace");
+
+/**
+ * call_rcu_tasks_trace() - Queue a callback trace task-based grace period
+ * @rhp: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. call_rcu_tasks_trace()
+ * assumes that the read-side critical sections end at context switch,
+ * cond_resched_rcu_qs(), or transition to usermode execution.  As such,
+ * there are no read-side primitives analogous to rcu_read_lock() and
+ * rcu_read_unlock() because this primitive is intended to determine
+ * that all tasks have passed through a safe state, not so much for
+ * data-strcuture synchronization.
+ *
+ * See the description of call_rcu() for more detailed information on
+ * memory ordering guarantees.
+ */
+void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func)
+{
+	call_rcu_tasks_generic(rhp, func, &rcu_tasks_trace);
+}
+EXPORT_SYMBOL_GPL(call_rcu_tasks_trace);
+
+/**
+ * synchronize_rcu_tasks_trace - wait for a trace rcu-tasks grace period
+ *
+ * Control will return to the caller some time after a trace rcu-tasks
+ * grace period has elapsed, in other words after all currently
+ * executing rcu-tasks read-side critical sections have elapsed.  These
+ * read-side critical sections are delimited by calls to schedule(),
+ * cond_resched_tasks_rcu_qs(), userspace execution, and (in theory,
+ * anyway) cond_resched().
+ *
+ * This is a very specialized primitive, intended only for a few uses in
+ * tracing and other situations requiring manipulation of function preambles
+ * and profiling hooks.  The synchronize_rcu_tasks_trace() function is not
+ * (yet) intended for heavy use from multiple CPUs.
+ *
+ * See the description of synchronize_rcu() for more detailed information
+ * on memory ordering guarantees.
+ */
+void synchronize_rcu_tasks_trace(void)
+{
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_trace_lock_map), "Illegal synchronize_rcu_tasks_trace() in RCU Tasks Trace read-side critical section");
+	synchronize_rcu_tasks_generic(&rcu_tasks_trace);
+}
+EXPORT_SYMBOL_GPL(synchronize_rcu_tasks_trace);
+
+/**
+ * rcu_barrier_tasks_trace - Wait for in-flight call_rcu_tasks_trace() callbacks.
+ *
+ * Although the current implementation is guaranteed to wait, it is not
+ * obligated to, for example, if there are no pending callbacks.
+ */
+void rcu_barrier_tasks_trace(void)
+{
+	/* There is only one callback queue, so this is easy.  ;-) */
+	synchronize_rcu_tasks_trace();
+}
+EXPORT_SYMBOL_GPL(rcu_barrier_tasks_trace);
+
+static int __init rcu_spawn_tasks_trace_kthread(void)
+{
+	rcu_tasks_trace.pregp_func = rcu_tasks_trace_pregp_step;
+	rcu_tasks_trace.pertask_func = rcu_tasks_trace_pertask;
+	rcu_tasks_trace.postscan_func = rcu_tasks_trace_postscan;
+	rcu_tasks_trace.holdouts_func = check_all_holdout_tasks_trace;
+	rcu_tasks_trace.postgp_func = rcu_tasks_trace_postgp;
+	rcu_spawn_tasks_kthread_generic(&rcu_tasks_trace);
+	return 0;
+}
+core_initcall(rcu_spawn_tasks_trace_kthread);
+
+#else /* #ifdef CONFIG_TASKS_TRACE_RCU */
+void exit_tasks_rcu_finish_trace(struct task_struct *t) { }
+#endif /* #else #ifdef CONFIG_TASKS_TRACE_RCU */
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 15/34] rcutorture: Add torture tests for RCU Tasks Trace
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (13 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 14/34] rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 16/34] rcu-tasks: Add stall warnings " paulmck
                       ` (19 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds the definitions required to torture the tracing flavor
of RCU tasks.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/Kconfig.debug                           |  2 +
 kernel/rcu/rcu.h                                   |  1 +
 kernel/rcu/rcutorture.c                            | 44 +++++++++++++++++++++-
 .../selftests/rcutorture/configs/rcu/CFLIST        |  1 +
 .../selftests/rcutorture/configs/rcu/TRACE01       | 10 +++++
 .../selftests/rcutorture/configs/rcu/TRACE01.boot  |  1 +
 6 files changed, 58 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TRACE01
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TRACE01.boot

diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index b15a3bd..a4db41d 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -25,6 +25,7 @@ config RCU_PERF_TEST
 	select SRCU
 	select TASKS_RCU
 	select TASKS_RUDE_RCU
+	select TASKS_TRACE_RCU
 	default n
 	help
 	  This option provides a kernel module that runs performance
@@ -43,6 +44,7 @@ config RCU_TORTURE_TEST
 	select SRCU
 	select TASKS_RCU
 	select TASKS_RUDE_RCU
+	select TASKS_TRACE_RCU
 	default n
 	help
 	  This option provides a kernel module that runs torture tests
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index c574620..72903867 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -442,6 +442,7 @@ enum rcutorture_type {
 	RCU_FLAVOR,
 	RCU_TASKS_FLAVOR,
 	RCU_TASKS_RUDE_FLAVOR,
+	RCU_TASKS_TRACING_FLAVOR,
 	RCU_TRIVIAL_FLAVOR,
 	SRCU_FLAVOR,
 	INVALID_RCU_FLAVOR
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 386cd11..bb6daa58 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -45,6 +45,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/oom.h>
 #include <linux/tick.h>
+#include <linux/rcupdate_trace.h>
 
 #include "rcu.h"
 
@@ -758,6 +759,45 @@ static struct rcu_torture_ops tasks_rude_ops = {
 	.name		= "tasks-rude"
 };
 
+/*
+ * Definitions for tracing RCU-tasks torture testing.
+ */
+
+static int tasks_tracing_torture_read_lock(void)
+{
+	rcu_read_lock_trace();
+	return 0;
+}
+
+static void tasks_tracing_torture_read_unlock(int idx)
+{
+	rcu_read_unlock_trace();
+}
+
+static void rcu_tasks_tracing_torture_deferred_free(struct rcu_torture *p)
+{
+	call_rcu_tasks_trace(&p->rtort_rcu, rcu_torture_cb);
+}
+
+static struct rcu_torture_ops tasks_tracing_ops = {
+	.ttype		= RCU_TASKS_TRACING_FLAVOR,
+	.init		= rcu_sync_torture_init,
+	.readlock	= tasks_tracing_torture_read_lock,
+	.read_delay	= srcu_read_delay,  /* just reuse srcu's version. */
+	.readunlock	= tasks_tracing_torture_read_unlock,
+	.get_gp_seq	= rcu_no_completed,
+	.deferred_free	= rcu_tasks_tracing_torture_deferred_free,
+	.sync		= synchronize_rcu_tasks_trace,
+	.exp_sync	= synchronize_rcu_tasks_trace,
+	.call		= call_rcu_tasks_trace,
+	.cb_barrier	= rcu_barrier_tasks_trace,
+	.fqs		= NULL,
+	.stats		= NULL,
+	.irq_capable	= 1,
+	.slow_gps	= 1,
+	.name		= "tasks-tracing"
+};
+
 static unsigned long rcutorture_seq_diff(unsigned long new, unsigned long old)
 {
 	if (!cur_ops->gp_diff)
@@ -1316,6 +1356,7 @@ static bool rcu_torture_one_read(struct torture_random_state *trsp)
 				  rcu_read_lock_bh_held() ||
 				  rcu_read_lock_sched_held() ||
 				  srcu_read_lock_held(srcu_ctlp) ||
+				  rcu_read_lock_trace_held() ||
 				  torturing_tasks());
 	if (p == NULL) {
 		/* Wait for rcu_torture_writer to get underway */
@@ -2435,7 +2476,8 @@ rcu_torture_init(void)
 	int firsterr = 0;
 	static struct rcu_torture_ops *torture_ops[] = {
 		&rcu_ops, &rcu_busted_ops, &srcu_ops, &srcud_ops,
-		&busted_srcud_ops, &tasks_ops, &tasks_rude_ops, &trivial_ops,
+		&busted_srcud_ops, &tasks_ops, &tasks_rude_ops,
+		&tasks_tracing_ops, &trivial_ops,
 	};
 
 	if (!torture_init_begin(torture_type, verbose))
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
index ec0c72f..dfb1817 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
+++ b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
@@ -15,3 +15,4 @@ TASKS01
 TASKS02
 TASKS03
 RUDE01
+TRACE01
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TRACE01 b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01
new file mode 100644
index 0000000..078e2c1
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01
@@ -0,0 +1,10 @@
+CONFIG_SMP=y
+CONFIG_NR_CPUS=4
+CONFIG_HOTPLUG_CPU=y
+CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_VOLUNTARY=n
+CONFIG_PREEMPT=n
+CONFIG_DEBUG_LOCK_ALLOC=y
+CONFIG_PROVE_LOCKING=y
+#CHECK#CONFIG_PROVE_RCU=y
+CONFIG_RCU_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TRACE01.boot b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01.boot
new file mode 100644
index 0000000..9675ad6
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01.boot
@@ -0,0 +1 @@
+rcutorture.torture_type=tasks-tracing
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 16/34] rcu-tasks: Add stall warnings for RCU Tasks Trace
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (14 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 15/34] rcutorture: Add torture tests for RCU Tasks Trace paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 17/34] rcu-tasks: Move #ifdef into tasks.h paulmck
                       ` (18 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds RCU CPU stall warnings for RCU Tasks Trace.  These
dump out any tasks blocking the current grace period, as well as any
CPUs that have not responded to an IPI request.  This happens in two
phases, when initially extracting state from the tasks and later when
waiting for any holdout tasks to check in.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 66 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index a5ed7e2..fc7f116 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -794,9 +794,41 @@ static void rcu_tasks_trace_postscan(void)
 	// Any tasks that exit after this point will set ->trc_reader_checked.
 }
 
+/* Show the state of a task stalling the current RCU tasks trace GP. */
+static void show_stalled_task_trace(struct task_struct *t, bool *firstreport)
+{
+	int cpu;
+
+	if (*firstreport) {
+		pr_err("INFO: rcu_tasks_trace detected stalls on tasks:\n");
+		*firstreport = false;
+	}
+	// FIXME: This should attempt to use try_invoke_on_nonrunning_task().
+	cpu = task_cpu(t);
+	pr_alert("P%d: %c%c%c nesting: %d%c cpu: %d\n",
+		 t->pid,
+		 ".I"[READ_ONCE(t->trc_ipi_to_cpu) > 0],
+		 ".i"[is_idle_task(t)],
+		 ".N"[cpu > 0 && tick_nohz_full_cpu(cpu)],
+		 t->trc_reader_nesting,
+		 " N"[!!t->trc_reader_need_end],
+		 cpu);
+	sched_show_task(t);
+}
+
+/* List stalled IPIs for RCU tasks trace. */
+static void show_stalled_ipi_trace(void)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		if (per_cpu(trc_ipi_to_cpu, cpu))
+			pr_alert("\tIPI outstanding to CPU %d\n", cpu);
+}
+
 /* Do one scan of the holdout list. */
 static void check_all_holdout_tasks_trace(struct list_head *hop,
-					  bool ndrpt, bool *frptp)
+					  bool needreport, bool *firstreport)
 {
 	struct task_struct *g, *t;
 
@@ -809,21 +841,51 @@ static void check_all_holdout_tasks_trace(struct list_head *hop,
 		// If check succeeded, remove this task from the list.
 		if (READ_ONCE(t->trc_reader_checked))
 			trc_del_holdout(t);
+		else if (needreport)
+			show_stalled_task_trace(t, firstreport);
+	}
+	if (needreport) {
+		if (firstreport)
+			pr_err("INFO: rcu_tasks_trace detected stalls?\n");
+		show_stalled_ipi_trace();
 	}
 }
 
 /* Wait for grace period to complete and provide ordering. */
 static void rcu_tasks_trace_postgp(void)
 {
+	bool firstreport;
+	struct task_struct *g, *t;
+	LIST_HEAD(holdouts);
+	long ret;
+
 	// Remove the safety count.
 	smp_mb__before_atomic();  // Order vs. earlier atomics
 	atomic_dec(&trc_n_readers_need_end);
 	smp_mb__after_atomic();  // Order vs. later atomics
 
 	// Wait for readers.
-	wait_event_idle_exclusive(trc_wait,
-				  atomic_read(&trc_n_readers_need_end) == 0);
-
+	for (;;) {
+		ret = wait_event_idle_exclusive_timeout(
+				trc_wait,
+				atomic_read(&trc_n_readers_need_end) == 0,
+				READ_ONCE(rcu_task_stall_timeout));
+		if (ret)
+			break;  // Count reached zero.
+		for_each_process_thread(g, t)
+			if (READ_ONCE(t->trc_reader_need_end))
+				trc_add_holdout(t, &holdouts);
+		firstreport = true;
+		list_for_each_entry_safe(t, g, &holdouts, trc_holdout_list)
+			if (READ_ONCE(t->trc_reader_need_end)) {
+				show_stalled_task_trace(t, &firstreport);
+				trc_del_holdout(t);
+			}
+		if (firstreport)
+			pr_err("INFO: rcu_tasks_trace detected stalls?\n");
+		show_stalled_ipi_trace();
+		pr_err("\t%d holdouts\n", atomic_read(&trc_n_readers_need_end));
+	}
 	smp_mb(); // Caller's code must be ordered after wakeup.
 }
 
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 17/34] rcu-tasks: Move #ifdef into tasks.h
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (15 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 16/34] rcu-tasks: Add stall warnings " paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 18/34] rcu-tasks: Add RCU tasks to rcutorture writer stall output paulmck
                       ` (17 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit pushes the #ifdef CONFIG_TASKS_RCU_GENERIC from
kernel/rcu/update.c to kernel/rcu/tasks.h in order to improve
readability as more APIs are added.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h  | 5 +++++
 kernel/rcu/update.c | 4 ----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index fc7f116..b52a640 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -5,6 +5,7 @@
  * Copyright (C) 2020 Paul E. McKenney
  */
 
+#ifdef CONFIG_TASKS_RCU_GENERIC
 
 ////////////////////////////////////////////////////////////////////////
 //
@@ -980,3 +981,7 @@ core_initcall(rcu_spawn_tasks_trace_kthread);
 #else /* #ifdef CONFIG_TASKS_TRACE_RCU */
 void exit_tasks_rcu_finish_trace(struct task_struct *t) { }
 #endif /* #else #ifdef CONFIG_TASKS_TRACE_RCU */
+
+#else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
+static inline void rcu_tasks_bootup_oddness(void) {}
+#endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 16058a5..0fb2a9e 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -559,11 +559,7 @@ late_initcall(rcu_verify_early_boot_tests);
 void rcu_early_boot_tests(void) {}
 #endif /* CONFIG_PROVE_RCU */
 
-#ifdef CONFIG_TASKS_RCU_GENERIC
 #include "tasks.h"
-#else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
-static inline void rcu_tasks_bootup_oddness(void) {}
-#endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */
 
 #ifndef CONFIG_TINY_RCU
 
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 18/34] rcu-tasks: Add RCU tasks to rcutorture writer stall output
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (16 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 17/34] rcu-tasks: Move #ifdef into tasks.h paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 19/34] rcu-tasks: Make rcutorture writer stall output include GP state paulmck
                       ` (16 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds state for each RCU-tasks flavor to the rcutorture
writer stall output.  The initial state is minimal, but you have to
start somewhere.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Fixes based on feedback from kbuild test robot. ]
---
 kernel/rcu/rcu.h        |  1 +
 kernel/rcu/tasks.h      | 45 +++++++++++++++++++++++++++++++++++++++++++--
 kernel/rcu/tree_stall.h |  2 +-
 3 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 72903867..e1089fd 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -431,6 +431,7 @@ bool rcu_gp_is_expedited(void);  /* Internal RCU use. */
 void rcu_expedite_gp(void);
 void rcu_unexpedite_gp(void);
 void rcupdate_announce_bootup_oddness(void);
+void show_rcu_tasks_gp_kthreads(void);
 void rcu_request_urgent_qs_task(struct task_struct *t);
 #endif /* #else #ifdef CONFIG_TINY_RCU */
 
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index b52a640..7ce9a60 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -219,6 +219,16 @@ static void __init rcu_tasks_bootup_oddness(void)
 
 #endif /* #ifndef CONFIG_TINY_RCU */
 
+/* Dump out rcutorture-relevant state common to all RCU-tasks flavors. */
+static void show_rcu_tasks_generic_gp_kthread(struct rcu_tasks *rtp, char *s)
+{
+	pr_info("%s %c%c %s\n",
+		rtp->kname,
+		".k"[!!data_race(rtp->kthread_ptr)],
+		".C"[!!data_race(rtp->cbs_head)],
+		s);
+}
+
 #ifdef CONFIG_TASKS_RCU
 
 ////////////////////////////////////////////////////////////////////////
@@ -482,7 +492,14 @@ static int __init rcu_spawn_tasks_kthread(void)
 }
 core_initcall(rcu_spawn_tasks_kthread);
 
-#endif /* #ifdef CONFIG_TASKS_RCU */
+static void show_rcu_tasks_classic_gp_kthread(void)
+{
+	show_rcu_tasks_generic_gp_kthread(&rcu_tasks, "");
+}
+
+#else /* #ifdef CONFIG_TASKS_RCU */
+static void show_rcu_tasks_classic_gp_kthread(void) { }
+#endif /* #else #ifdef CONFIG_TASKS_RCU */
 
 #ifdef CONFIG_TASKS_RUDE_RCU
 
@@ -578,7 +595,14 @@ static int __init rcu_spawn_tasks_rude_kthread(void)
 }
 core_initcall(rcu_spawn_tasks_rude_kthread);
 
-#endif /* #ifdef CONFIG_TASKS_RUDE_RCU */
+static void show_rcu_tasks_rude_gp_kthread(void)
+{
+	show_rcu_tasks_generic_gp_kthread(&rcu_tasks_rude, "");
+}
+
+#else /* #ifdef CONFIG_TASKS_RUDE_RCU */
+static void show_rcu_tasks_rude_gp_kthread(void) {}
+#endif /* #else #ifdef CONFIG_TASKS_RUDE_RCU */
 
 ////////////////////////////////////////////////////////////////////////
 //
@@ -978,10 +1002,27 @@ static int __init rcu_spawn_tasks_trace_kthread(void)
 }
 core_initcall(rcu_spawn_tasks_trace_kthread);
 
+static void show_rcu_tasks_trace_gp_kthread(void)
+{
+	char buf[32];
+
+	sprintf(buf, "N%d", atomic_read(&trc_n_readers_need_end));
+	show_rcu_tasks_generic_gp_kthread(&rcu_tasks_trace, buf);
+}
+
 #else /* #ifdef CONFIG_TASKS_TRACE_RCU */
 void exit_tasks_rcu_finish_trace(struct task_struct *t) { }
+static inline void show_rcu_tasks_trace_gp_kthread(void) {}
 #endif /* #else #ifdef CONFIG_TASKS_TRACE_RCU */
 
+void show_rcu_tasks_gp_kthreads(void)
+{
+	show_rcu_tasks_classic_gp_kthread();
+	show_rcu_tasks_rude_gp_kthread();
+	show_rcu_tasks_trace_gp_kthread();
+}
+
 #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
 static inline void rcu_tasks_bootup_oddness(void) {}
+void show_rcu_tasks_gp_kthreads(void) {}
 #endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index e19487d..ec8e985 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -649,7 +649,7 @@ void show_rcu_gp_kthreads(void)
 		if (rcu_segcblist_is_offloaded(&rdp->cblist))
 			show_rcu_nocb_state(rdp);
 	}
-	/* sched_show_task(rcu_state.gp_kthread); */
+	show_rcu_tasks_gp_kthreads();
 }
 EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);
 
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 19/34] rcu-tasks: Make rcutorture writer stall output include GP state
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (17 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 18/34] rcu-tasks: Add RCU tasks to rcutorture writer stall output paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 20/34] rcu-tasks: Make RCU Tasks Trace make use of RCU scheduler hooks paulmck
                       ` (15 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds grace-period state and time to the rcutorture writer
stall output.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 72 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 7ce9a60..7286582 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -17,7 +17,7 @@ typedef void (*pregp_func_t)(void);
 typedef void (*pertask_func_t)(struct task_struct *t, struct list_head *hop);
 typedef void (*postscan_func_t)(void);
 typedef void (*holdouts_func_t)(struct list_head *hop, bool ndrpt, bool *frptp);
-typedef void (*postgp_func_t)(void);
+typedef void (*postgp_func_t)(struct rcu_tasks *rtp);
 
 /**
  * Definition for a Tasks-RCU-like mechanism.
@@ -27,6 +27,9 @@ typedef void (*postgp_func_t)(void);
  * @cbs_lock: Lock protecting callback list.
  * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
  * @gp_func: This flavor's grace-period-wait function.
+ * @gp_state: Grace period's most recent state transition (debugging).
+ * @gp_jiffies: Time of last @gp_state transition.
+ * @gp_start: Most recent grace-period start in jiffies.
  * @pregp_func: This flavor's pre-grace-period function (optional).
  * @pertask_func: This flavor's per-task scan function (optional).
  * @postscan_func: This flavor's post-task scan function (optional).
@@ -41,6 +44,8 @@ struct rcu_tasks {
 	struct rcu_head **cbs_tail;
 	struct wait_queue_head cbs_wq;
 	raw_spinlock_t cbs_lock;
+	int gp_state;
+	unsigned long gp_jiffies;
 	struct task_struct *kthread_ptr;
 	rcu_tasks_gp_func_t gp_func;
 	pregp_func_t pregp_func;
@@ -73,10 +78,56 @@ DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
 static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
 module_param(rcu_task_stall_timeout, int, 0644);
 
+/* RCU tasks grace-period state for debugging. */
+#define RTGS_INIT		 0
+#define RTGS_WAIT_WAIT_CBS	 1
+#define RTGS_WAIT_GP		 2
+#define RTGS_PRE_WAIT_GP	 3
+#define RTGS_SCAN_TASKLIST	 4
+#define RTGS_POST_SCAN_TASKLIST	 5
+#define RTGS_WAIT_SCAN_HOLDOUTS	 6
+#define RTGS_SCAN_HOLDOUTS	 7
+#define RTGS_POST_GP		 8
+#define RTGS_WAIT_READERS	 9
+#define RTGS_INVOKE_CBS		10
+#define RTGS_WAIT_CBS		11
+static const char * const rcu_tasks_gp_state_names[] = {
+	"RTGS_INIT",
+	"RTGS_WAIT_WAIT_CBS",
+	"RTGS_WAIT_GP",
+	"RTGS_PRE_WAIT_GP",
+	"RTGS_SCAN_TASKLIST",
+	"RTGS_POST_SCAN_TASKLIST",
+	"RTGS_WAIT_SCAN_HOLDOUTS",
+	"RTGS_SCAN_HOLDOUTS",
+	"RTGS_POST_GP",
+	"RTGS_WAIT_READERS",
+	"RTGS_INVOKE_CBS",
+	"RTGS_WAIT_CBS",
+};
+
 ////////////////////////////////////////////////////////////////////////
 //
 // Generic code.
 
+/* Record grace-period phase and time. */
+static void set_tasks_gp_state(struct rcu_tasks *rtp, int newstate)
+{
+	rtp->gp_state = newstate;
+	rtp->gp_jiffies = jiffies;
+}
+
+/* Return state name. */
+static const char *tasks_gp_state_getname(struct rcu_tasks *rtp)
+{
+	int i = data_race(rtp->gp_state); // Let KCSAN detect update races
+	int j = READ_ONCE(i); // Prevent the compiler from reading twice
+
+	if (j >= ARRAY_SIZE(rcu_tasks_gp_state_names))
+		return "???";
+	return rcu_tasks_gp_state_names[j];
+}
+
 // Enqueue a callback for the specified flavor of Tasks RCU.
 static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func,
 				   struct rcu_tasks *rtp)
@@ -141,15 +192,18 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 						 READ_ONCE(rtp->cbs_head));
 			if (!rtp->cbs_head) {
 				WARN_ON(signal_pending(current));
+				set_tasks_gp_state(rtp, RTGS_WAIT_WAIT_CBS);
 				schedule_timeout_interruptible(HZ/10);
 			}
 			continue;
 		}
 
 		// Wait for one grace period.
+		set_tasks_gp_state(rtp, RTGS_WAIT_GP);
 		rtp->gp_func(rtp);
 
 		/* Invoke the callbacks. */
+		set_tasks_gp_state(rtp, RTGS_INVOKE_CBS);
 		while (list) {
 			next = list->next;
 			local_bh_disable();
@@ -160,6 +214,8 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 		}
 		/* Paranoid sleep to keep this from entering a tight loop */
 		schedule_timeout_uninterruptible(HZ/10);
+
+		set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
 	}
 }
 
@@ -222,8 +278,11 @@ static void __init rcu_tasks_bootup_oddness(void)
 /* Dump out rcutorture-relevant state common to all RCU-tasks flavors. */
 static void show_rcu_tasks_generic_gp_kthread(struct rcu_tasks *rtp, char *s)
 {
-	pr_info("%s %c%c %s\n",
+	pr_info("%s: %s(%d) since %lu %c%c %s\n",
 		rtp->kname,
+		tasks_gp_state_getname(rtp),
+		data_race(rtp->gp_state),
+		jiffies - data_race(rtp->gp_jiffies),
 		".k"[!!data_race(rtp->kthread_ptr)],
 		".C"[!!data_race(rtp->cbs_head)],
 		s);
@@ -243,6 +302,7 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 	LIST_HEAD(holdouts);
 	int fract;
 
+	set_tasks_gp_state(rtp, RTGS_PRE_WAIT_GP);
 	rtp->pregp_func();
 
 	/*
@@ -251,11 +311,13 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 	 * that are not already voluntarily blocked.  Mark these tasks
 	 * and make a list of them in holdouts.
 	 */
+	set_tasks_gp_state(rtp, RTGS_SCAN_TASKLIST);
 	rcu_read_lock();
 	for_each_process_thread(g, t)
 		rtp->pertask_func(t, &holdouts);
 	rcu_read_unlock();
 
+	set_tasks_gp_state(rtp, RTGS_POST_SCAN_TASKLIST);
 	rtp->postscan_func();
 
 	/*
@@ -277,6 +339,7 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 			break;
 
 		/* Slowly back off waiting for holdouts */
+		set_tasks_gp_state(rtp, RTGS_WAIT_SCAN_HOLDOUTS);
 		schedule_timeout_interruptible(HZ/fract);
 
 		if (fract > 1)
@@ -288,10 +351,12 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 			lastreport = jiffies;
 		firstreport = true;
 		WARN_ON(signal_pending(current));
+		set_tasks_gp_state(rtp, RTGS_SCAN_HOLDOUTS);
 		rtp->holdouts_func(&holdouts, needreport, &firstreport);
 	}
 
-	rtp->postgp_func();
+	set_tasks_gp_state(rtp, RTGS_POST_GP);
+	rtp->postgp_func(rtp);
 }
 
 ////////////////////////////////////////////////////////////////////////
@@ -394,7 +459,7 @@ static void check_all_holdout_tasks(struct list_head *hop,
 }
 
 /* Finish off the Tasks-RCU grace period. */
-static void rcu_tasks_postgp(void)
+static void rcu_tasks_postgp(struct rcu_tasks *rtp)
 {
 	/*
 	 * Because ->on_rq and ->nvcsw are not guaranteed to have a full
@@ -877,7 +942,7 @@ static void check_all_holdout_tasks_trace(struct list_head *hop,
 }
 
 /* Wait for grace period to complete and provide ordering. */
-static void rcu_tasks_trace_postgp(void)
+static void rcu_tasks_trace_postgp(struct rcu_tasks *rtp)
 {
 	bool firstreport;
 	struct task_struct *g, *t;
@@ -890,6 +955,7 @@ static void rcu_tasks_trace_postgp(void)
 	smp_mb__after_atomic();  // Order vs. later atomics
 
 	// Wait for readers.
+	set_tasks_gp_state(rtp, RTGS_WAIT_READERS);
 	for (;;) {
 		ret = wait_event_idle_exclusive_timeout(
 				trc_wait,
@@ -897,6 +963,7 @@ static void rcu_tasks_trace_postgp(void)
 				READ_ONCE(rcu_task_stall_timeout));
 		if (ret)
 			break;  // Count reached zero.
+		// Stall warning time, so make a list of the offenders.
 		for_each_process_thread(g, t)
 			if (READ_ONCE(t->trc_reader_need_end))
 				trc_add_holdout(t, &holdouts);
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 20/34] rcu-tasks: Make RCU Tasks Trace make use of RCU scheduler hooks
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (18 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 19/34] rcu-tasks: Make rcutorture writer stall output include GP state paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 21/34] rcu-tasks: Add a grace-period start time for throttling and debug paulmck
                       ` (14 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit makes the calls to rcu_tasks_qs() detect and report
quiescent states for RCU tasks trace.  If the task is in a quiescent
state and if ->trc_reader_checked is not yet set, the task sets its own
->trc_reader_checked.  This will cause the grace-period kthread to
remove it from the holdout list if it still remains there.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/rcupdate.h | 39 ++++++++++++++++++++++++++++++++-------
 include/linux/rcutiny.h  |  2 +-
 kernel/rcu/tasks.h       |  5 +++--
 kernel/rcu/tree_plugin.h |  6 ++----
 4 files changed, 38 insertions(+), 14 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 2be97a8..3598bbb 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -131,12 +131,37 @@ static inline void rcu_init_nohz(void) { }
  * This is a macro rather than an inline function to avoid #include hell.
  */
 #ifdef CONFIG_TASKS_RCU_GENERIC
-#define rcu_tasks_qs(t) \
-	do { \
-		if (READ_ONCE((t)->rcu_tasks_holdout)) \
-			WRITE_ONCE((t)->rcu_tasks_holdout, false); \
+
+# ifdef CONFIG_TASKS_RCU
+# define rcu_tasks_classic_qs(t, preempt)				\
+	do {								\
+		if (!(preempt) && READ_ONCE((t)->rcu_tasks_holdout))	\
+			WRITE_ONCE((t)->rcu_tasks_holdout, false);	\
 	} while (0)
-#define rcu_note_voluntary_context_switch(t) rcu_tasks_qs(t)
+# else
+# define rcu_tasks_classic_qs(t, preempt) do { } while (0)
+# endif
+
+# ifdef CONFIG_TASKS_RCU_TRACE
+# define rcu_tasks_trace_qs(t)						\
+	do {								\
+		if (!likely(READ_ONCE((t)->trc_reader_checked)) &&	\
+		    !unlikely(READ_ONCE((t)->trc_reader_nesting))) {	\
+			smp_store_release(&(t)->trc_reader_checked, true); \
+			smp_mb(); /* Readers partitioned by store. */	\
+		}							\
+	} while (0)
+# else
+# define rcu_tasks_trace_qs(t) do { } while (0)
+# endif
+
+#define rcu_tasks_qs(t, preempt)					\
+do {									\
+	rcu_tasks_classic_qs((t), (preempt));				\
+	rcu_tasks_trace_qs((t));					\
+} while (0)
+
+#define rcu_note_voluntary_context_switch(t) rcu_tasks_qs(t, false)
 void call_rcu_tasks(struct rcu_head *head, rcu_callback_t func);
 void synchronize_rcu_tasks(void);
 void call_rcu_tasks_rude(struct rcu_head *head, rcu_callback_t func);
@@ -144,7 +169,7 @@ void synchronize_rcu_tasks_rude(void);
 void exit_tasks_rcu_start(void);
 void exit_tasks_rcu_finish(void);
 #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
-#define rcu_tasks_qs(t)	do { } while (0)
+#define rcu_tasks_qs(t, preempt) do { } while (0)
 #define rcu_note_voluntary_context_switch(t) do { } while (0)
 #define call_rcu_tasks call_rcu
 #define synchronize_rcu_tasks synchronize_rcu
@@ -161,7 +186,7 @@ static inline void exit_tasks_rcu_finish(void) { }
  */
 #define cond_resched_tasks_rcu_qs() \
 do { \
-	rcu_tasks_qs(current); \
+	rcu_tasks_qs(current, false); \
 	cond_resched(); \
 } while (0)
 
diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 045c28b..d77e111 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -49,7 +49,7 @@ static inline void rcu_softirq_qs(void)
 #define rcu_note_context_switch(preempt) \
 	do { \
 		rcu_qs(); \
-		rcu_tasks_qs(current); \
+		rcu_tasks_qs(current, (preempt)); \
 	} while (0)
 
 static inline int rcu_needs_cpu(u64 basemono, u64 *nextevt)
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 7286582..cbc9905 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -180,7 +180,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 
 		/* Pick up any new callbacks. */
 		raw_spin_lock_irqsave(&rtp->cbs_lock, flags);
-		smp_mb__after_unlock_lock(); // Order updates vs. GP.
+		smp_mb__after_spinlock(); // Order updates vs. GP.
 		list = rtp->cbs_head;
 		rtp->cbs_head = NULL;
 		rtp->cbs_tail = &rtp->cbs_head;
@@ -870,7 +870,7 @@ static void rcu_tasks_trace_pertask(struct task_struct *t,
 				    struct list_head *hop)
 {
 	WRITE_ONCE(t->trc_reader_need_end, false);
-	t->trc_reader_checked = false;
+	WRITE_ONCE(t->trc_reader_checked, false);
 	t->trc_ipi_to_cpu = -1;
 	trc_wait_for_one_reader(t, hop);
 }
@@ -979,6 +979,7 @@ static void rcu_tasks_trace_postgp(struct rcu_tasks *rtp)
 		pr_err("\t%d holdouts\n", atomic_read(&trc_n_readers_need_end));
 	}
 	smp_mb(); // Caller's code must be ordered after wakeup.
+		  // Pairs with pretty much every ordering primitive.
 }
 
 /* Report any needed quiescent state for this exiting task. */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 7cf76e8..9355536 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -331,8 +331,7 @@ void rcu_note_context_switch(bool preempt)
 	rcu_qs();
 	if (rdp->exp_deferred_qs)
 		rcu_report_exp_rdp(rdp);
-	if (!preempt)
-		rcu_tasks_qs(current);
+	rcu_tasks_qs(current, preempt);
 	trace_rcu_utilization(TPS("End context switch"));
 }
 EXPORT_SYMBOL_GPL(rcu_note_context_switch);
@@ -841,8 +840,7 @@ void rcu_note_context_switch(bool preempt)
 	this_cpu_write(rcu_data.rcu_urgent_qs, false);
 	if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs)))
 		rcu_momentary_dyntick_idle();
-	if (!preempt)
-		rcu_tasks_qs(current);
+	rcu_tasks_qs(current, preempt);
 out:
 	trace_rcu_utilization(TPS("End context switch"));
 }
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 21/34] rcu-tasks: Add a grace-period start time for throttling and debug
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (19 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 20/34] rcu-tasks: Make RCU Tasks Trace make use of RCU scheduler hooks paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 22/34] rcu-tasks: Provide boot parameter to delay IPIs until late in grace period paulmck
                       ` (13 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds a place to record the grace-period start in jiffies.
This will be used by later commits for debugging purposes and to throttle
IPIs early in the grace period.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index cbc9905..fa9c069 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -46,6 +46,7 @@ struct rcu_tasks {
 	raw_spinlock_t cbs_lock;
 	int gp_state;
 	unsigned long gp_jiffies;
+	unsigned long gp_start;
 	struct task_struct *kthread_ptr;
 	rcu_tasks_gp_func_t gp_func;
 	pregp_func_t pregp_func;
@@ -200,6 +201,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 
 		// Wait for one grace period.
 		set_tasks_gp_state(rtp, RTGS_WAIT_GP);
+		rtp->gp_start = jiffies;
 		rtp->gp_func(rtp);
 
 		/* Invoke the callbacks. */
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 22/34] rcu-tasks: Provide boot parameter to delay IPIs until late in grace period
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (20 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 21/34] rcu-tasks: Add a grace-period start time for throttling and debug paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 23/34] rcu-tasks: Split ->trc_reader_need_end paulmck
                       ` (12 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit provides a rcupdate.rcu_task_ipi_delay kernel boot parameter
that specifies how old the RCU tasks trace grace period must be before
the grace-period kthread starts sending IPIs.  This delay allows more
tasks to pass through rcu_tasks_qs() quiescent states, thus reducing
(or even eliminating) the number of IPIs that must be sent.

On a short rcutorture test setting this kernel boot parameter to HZ/2
resulted in zero IPIs for all 877 RCU-tasks trace grace periods that
elapsed during that test.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 Documentation/admin-guide/kernel-parameters.txt |  7 +++++++
 kernel/rcu/tasks.h                              | 15 ++++++++++-----
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index df2baf9..6f3b3be 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4246,6 +4246,13 @@
 			only normal grace-period primitives.  No effect
 			on CONFIG_TINY_RCU kernels.
 
+	rcupdate.rcu_task_ipi_delay= [KNL]
+			Set time in jiffies during which RCU tasks will
+			avoid sending IPIs, starting with the beginning
+			of a given grace period.  Setting a large
+			number avoids disturbing real-time workloads,
+			but lengthens grace periods.
+
 	rcupdate.rcu_task_stall_timeout= [KNL]
 			Set timeout in jiffies for RCU task stall warning
 			messages.  Disable with a value less than or equal
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index fa9c069..a034f48 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -74,6 +74,11 @@ static struct rcu_tasks rt_name =					\
 /* Track exiting tasks in order to allow them to be waited for. */
 DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
 
+/* Avoid IPIing CPUs early in the grace period. */
+#define RCU_TASK_IPI_DELAY (HZ / 2)
+static int rcu_task_ipi_delay __read_mostly = RCU_TASK_IPI_DELAY;
+module_param(rcu_task_ipi_delay, int, 0644);
+
 /* Control stall timeouts.  Disable with <= 0, otherwise jiffies till stall. */
 #define RCU_TASK_STALL_TIMEOUT (HZ * 60 * 10)
 static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
@@ -713,6 +718,10 @@ DECLARE_WAIT_QUEUE_HEAD(trc_wait);	// List of holdout tasks.
 // Record outstanding IPIs to each CPU.  No point in sending two...
 static DEFINE_PER_CPU(bool, trc_ipi_to_cpu);
 
+void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
+DEFINE_RCU_TASKS(rcu_tasks_trace, rcu_tasks_wait_gp, call_rcu_tasks_trace,
+		 "RCU Tasks Trace");
+
 /* If we are the last reader, wake up the grace-period kthread. */
 void rcu_read_unlock_trace_special(struct task_struct *t)
 {
@@ -825,7 +834,7 @@ static void trc_wait_for_one_reader(struct task_struct *t,
 
 	// If currently running, send an IPI, either way, add to list.
 	trc_add_holdout(t, bhp);
-	if (task_curr(t)) {
+	if (task_curr(t) && time_after(jiffies, rcu_tasks_trace.gp_start + rcu_task_ipi_delay)) {
 		// The task is currently running, so try IPIing it.
 		cpu = task_cpu(t);
 
@@ -994,10 +1003,6 @@ void exit_tasks_rcu_finish_trace(struct task_struct *t)
 		rcu_read_unlock_trace_special(t);
 }
 
-void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
-DEFINE_RCU_TASKS(rcu_tasks_trace, rcu_tasks_wait_gp, call_rcu_tasks_trace,
-		 "RCU Tasks Trace");
-
 /**
  * call_rcu_tasks_trace() - Queue a callback trace task-based grace period
  * @rhp: structure to be used for queueing the RCU updates.
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 23/34] rcu-tasks: Split ->trc_reader_need_end
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (21 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 22/34] rcu-tasks: Provide boot parameter to delay IPIs until late in grace period paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 24/34] rcu-tasks: Add grace-period and IPI counts to statistics paulmck
                       ` (11 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit splits ->trc_reader_need_end by using the rcu_special union.
This change permits readers to check to see if a memory barrier is
required without any added overhead in the common case where no such
barrier is required.  This commit also adds the read-side checking.
Later commits will add the machinery to properly set the new
->trc_reader_special.b.need_mb field.

This commit also makes rcu_read_unlock_trace_special() tolerate nested
read-side critical sections within interrupt and NMI handlers.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/rcupdate_trace.h | 11 +++++++----
 include/linux/sched.h          |  4 ++--
 init/init_task.c               |  1 +
 kernel/fork.c                  |  1 +
 kernel/rcu/tasks.h             | 33 ++++++++++++++++++++-------------
 5 files changed, 31 insertions(+), 19 deletions(-)

diff --git a/include/linux/rcupdate_trace.h b/include/linux/rcupdate_trace.h
index ed97e10..c42b365c 100644
--- a/include/linux/rcupdate_trace.h
+++ b/include/linux/rcupdate_trace.h
@@ -31,7 +31,7 @@ static inline int rcu_read_lock_trace_held(void)
 
 #ifdef CONFIG_TASKS_TRACE_RCU
 
-void rcu_read_unlock_trace_special(struct task_struct *t);
+void rcu_read_unlock_trace_special(struct task_struct *t, int nesting);
 
 /**
  * rcu_read_lock_trace - mark beginning of RCU-trace read-side critical section
@@ -50,6 +50,8 @@ static inline void rcu_read_lock_trace(void)
 	struct task_struct *t = current;
 
 	WRITE_ONCE(t->trc_reader_nesting, READ_ONCE(t->trc_reader_nesting) + 1);
+	if (t->trc_reader_special.b.need_mb)
+		smp_mb(); // Pairs with update-side barriers
 	rcu_lock_acquire(&rcu_trace_lock_map);
 }
 
@@ -69,10 +71,11 @@ static inline void rcu_read_unlock_trace(void)
 
 	rcu_lock_release(&rcu_trace_lock_map);
 	nesting = READ_ONCE(t->trc_reader_nesting) - 1;
-	WRITE_ONCE(t->trc_reader_nesting, nesting);
-	if (likely(!READ_ONCE(t->trc_reader_need_end)) || nesting)
+	if (likely(!READ_ONCE(t->trc_reader_special.s)) || nesting) {
+		WRITE_ONCE(t->trc_reader_nesting, nesting);
 		return;  // We assume shallow reader nesting.
-	rcu_read_unlock_trace_special(t);
+	}
+	rcu_read_unlock_trace_special(t, nesting);
 }
 
 void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ef68ae4..63d359e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -611,7 +611,7 @@ union rcu_special {
 		u8			blocked;
 		u8			need_qs;
 		u8			exp_hint; /* Hint for performance. */
-		u8			pad; /* No garbage from compiler! */
+		u8			need_mb; /* Readers need smp_mb(). */
 	} b; /* Bits. */
 	u32 s; /* Set of bits. */
 };
@@ -725,7 +725,7 @@ struct task_struct {
 #ifdef CONFIG_TASKS_TRACE_RCU
 	int				trc_reader_nesting;
 	int				trc_ipi_to_cpu;
-	bool				trc_reader_need_end;
+	union rcu_special		trc_reader_special;
 	bool				trc_reader_checked;
 	struct list_head		trc_holdout_list;
 #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
diff --git a/init/init_task.c b/init/init_task.c
index 1b9ec3d..4efd819 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -142,6 +142,7 @@ struct task_struct init_task
 #endif
 #ifdef CONFIG_TASKS_TRACE_RCU
 	.trc_reader_nesting = 0,
+	.trc_reader_special.s = 0,
 	.trc_holdout_list = LIST_HEAD_INIT(init_task.trc_holdout_list),
 #endif
 #ifdef CONFIG_CPUSETS
diff --git a/kernel/fork.c b/kernel/fork.c
index 97df86b..2505e21 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1687,6 +1687,7 @@ static inline void rcu_copy_process(struct task_struct *p)
 #endif /* #ifdef CONFIG_TASKS_RCU */
 #ifdef CONFIG_TASKS_TRACE_RCU
 	p->trc_reader_nesting = 0;
+	p->trc_reader_special.s = 0;
 	INIT_LIST_HEAD(&p->trc_holdout_list);
 #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index a034f48..4150f8d 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -723,10 +723,17 @@ DEFINE_RCU_TASKS(rcu_tasks_trace, rcu_tasks_wait_gp, call_rcu_tasks_trace,
 		 "RCU Tasks Trace");
 
 /* If we are the last reader, wake up the grace-period kthread. */
-void rcu_read_unlock_trace_special(struct task_struct *t)
+void rcu_read_unlock_trace_special(struct task_struct *t, int nesting)
 {
-	WRITE_ONCE(t->trc_reader_need_end, false);
-	if (atomic_dec_and_test(&trc_n_readers_need_end))
+	int nq = t->trc_reader_special.b.need_qs;
+
+	if (t->trc_reader_special.b.need_mb)
+		smp_mb(); // Pairs with update-side barriers.
+	// Update .need_qs before ->trc_reader_nesting for irq/NMI handlers.
+	if (nq)
+		WRITE_ONCE(t->trc_reader_special.b.need_qs, false);
+	WRITE_ONCE(t->trc_reader_nesting, nesting);
+	if (nq && atomic_dec_and_test(&trc_n_readers_need_end))
 		wake_up(&trc_wait);
 }
 EXPORT_SYMBOL_GPL(rcu_read_unlock_trace_special);
@@ -777,8 +784,8 @@ static void trc_read_check_handler(void *t_in)
 	// Get here if the task is in a read-side critical section.  Set
 	// its state so that it will awaken the grace-period kthread upon
 	// exit from that critical section.
-	WARN_ON_ONCE(t->trc_reader_need_end);
-	WRITE_ONCE(t->trc_reader_need_end, true);
+	WARN_ON_ONCE(t->trc_reader_special.b.need_qs);
+	WRITE_ONCE(t->trc_reader_special.b.need_qs, true);
 
 reset_ipi:
 	// Allow future IPIs to be sent on CPU and for task.
@@ -804,8 +811,8 @@ static bool trc_inspect_reader(struct task_struct *t, void *arg)
 	// exit from that critical section.
 	if (unlikely(t->trc_reader_nesting)) {
 		atomic_inc(&trc_n_readers_need_end); // One more to wait on.
-		WARN_ON_ONCE(t->trc_reader_need_end);
-		WRITE_ONCE(t->trc_reader_need_end, true);
+		WARN_ON_ONCE(t->trc_reader_special.b.need_qs);
+		WRITE_ONCE(t->trc_reader_special.b.need_qs, true);
 	}
 	return true;
 }
@@ -880,7 +887,7 @@ static void rcu_tasks_trace_pregp_step(void)
 static void rcu_tasks_trace_pertask(struct task_struct *t,
 				    struct list_head *hop)
 {
-	WRITE_ONCE(t->trc_reader_need_end, false);
+	WRITE_ONCE(t->trc_reader_special.b.need_qs, false);
 	WRITE_ONCE(t->trc_reader_checked, false);
 	t->trc_ipi_to_cpu = -1;
 	trc_wait_for_one_reader(t, hop);
@@ -912,7 +919,7 @@ static void show_stalled_task_trace(struct task_struct *t, bool *firstreport)
 		 ".i"[is_idle_task(t)],
 		 ".N"[cpu > 0 && tick_nohz_full_cpu(cpu)],
 		 t->trc_reader_nesting,
-		 " N"[!!t->trc_reader_need_end],
+		 " N"[!!t->trc_reader_special.b.need_qs],
 		 cpu);
 	sched_show_task(t);
 }
@@ -976,11 +983,11 @@ static void rcu_tasks_trace_postgp(struct rcu_tasks *rtp)
 			break;  // Count reached zero.
 		// Stall warning time, so make a list of the offenders.
 		for_each_process_thread(g, t)
-			if (READ_ONCE(t->trc_reader_need_end))
+			if (READ_ONCE(t->trc_reader_special.b.need_qs))
 				trc_add_holdout(t, &holdouts);
 		firstreport = true;
 		list_for_each_entry_safe(t, g, &holdouts, trc_holdout_list)
-			if (READ_ONCE(t->trc_reader_need_end)) {
+			if (READ_ONCE(t->trc_reader_special.b.need_qs)) {
 				show_stalled_task_trace(t, &firstreport);
 				trc_del_holdout(t);
 			}
@@ -999,8 +1006,8 @@ void exit_tasks_rcu_finish_trace(struct task_struct *t)
 	WRITE_ONCE(t->trc_reader_checked, true);
 	WARN_ON_ONCE(t->trc_reader_nesting);
 	WRITE_ONCE(t->trc_reader_nesting, 0);
-	if (WARN_ON_ONCE(READ_ONCE(t->trc_reader_need_end)))
-		rcu_read_unlock_trace_special(t);
+	if (WARN_ON_ONCE(READ_ONCE(t->trc_reader_special.b.need_qs)))
+		rcu_read_unlock_trace_special(t, 0);
 }
 
 /**
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 24/34] rcu-tasks: Add grace-period and IPI counts to statistics
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (22 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 23/34] rcu-tasks: Split ->trc_reader_need_end paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 25/34] rcu-tasks: Add Kconfig option to mediate smp_mb() vs. IPI paulmck
                       ` (10 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds a grace-period count and a count of IPIs sent since
boot, which is printed in response to rcutorture writer stalls and at
the end of rcutorture testing.  These counts will be used to evaluate
various schemes to reduce the number of IPIs sent.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 4150f8d..f9a828c 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -30,6 +30,8 @@ typedef void (*postgp_func_t)(struct rcu_tasks *rtp);
  * @gp_state: Grace period's most recent state transition (debugging).
  * @gp_jiffies: Time of last @gp_state transition.
  * @gp_start: Most recent grace-period start in jiffies.
+ * @n_gps: Number of grace periods completed since boot.
+ * @n_ipis: Number of IPIs sent to encourage grace periods to end.
  * @pregp_func: This flavor's pre-grace-period function (optional).
  * @pertask_func: This flavor's per-task scan function (optional).
  * @postscan_func: This flavor's post-task scan function (optional).
@@ -47,6 +49,8 @@ struct rcu_tasks {
 	int gp_state;
 	unsigned long gp_jiffies;
 	unsigned long gp_start;
+	unsigned long n_gps;
+	unsigned long n_ipis;
 	struct task_struct *kthread_ptr;
 	rcu_tasks_gp_func_t gp_func;
 	pregp_func_t pregp_func;
@@ -208,6 +212,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 		set_tasks_gp_state(rtp, RTGS_WAIT_GP);
 		rtp->gp_start = jiffies;
 		rtp->gp_func(rtp);
+		rtp->n_gps++;
 
 		/* Invoke the callbacks. */
 		set_tasks_gp_state(rtp, RTGS_INVOKE_CBS);
@@ -285,11 +290,12 @@ static void __init rcu_tasks_bootup_oddness(void)
 /* Dump out rcutorture-relevant state common to all RCU-tasks flavors. */
 static void show_rcu_tasks_generic_gp_kthread(struct rcu_tasks *rtp, char *s)
 {
-	pr_info("%s: %s(%d) since %lu %c%c %s\n",
+	pr_info("%s: %s(%d) since %lu g:%lu i:%lu %c%c %s\n",
 		rtp->kname,
 		tasks_gp_state_getname(rtp),
 		data_race(rtp->gp_state),
 		jiffies - data_race(rtp->gp_jiffies),
+		data_race(rtp->n_gps), data_race(rtp->n_ipis),
 		".k"[!!data_race(rtp->kthread_ptr)],
 		".C"[!!data_race(rtp->cbs_head)],
 		s);
@@ -592,6 +598,7 @@ static void rcu_tasks_be_rude(struct work_struct *work)
 // Wait for one rude RCU-tasks grace period.
 static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp)
 {
+	rtp->n_ipis += cpumask_weight(cpu_online_mask);
 	schedule_on_each_cpu(rcu_tasks_be_rude);
 }
 
@@ -852,6 +859,7 @@ static void trc_wait_for_one_reader(struct task_struct *t,
 		atomic_inc(&trc_n_readers_need_end);
 		per_cpu(trc_ipi_to_cpu, cpu) = true;
 		t->trc_ipi_to_cpu = cpu;
+		rcu_tasks_trace.n_ipis++;
 		if (smp_call_function_single(cpu,
 					     trc_read_check_handler, t, 0)) {
 			// Just in case there is some other reason for
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 25/34] rcu-tasks: Add Kconfig option to mediate smp_mb() vs. IPI
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (23 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 24/34] rcu-tasks: Add grace-period and IPI counts to statistics paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 26/34] rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built paulmck
                       ` (9 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit provides a new TASKS_TRACE_RCU_READ_MB Kconfig option that
enables use of read-side memory barriers by both rcu_read_lock_trace()
and rcu_read_unlock_trace() when the are executed with the
current->trc_reader_special.b.need_mb flag set.  This flag is currently
never set.  Doing that is the subject of a later commit.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/rcupdate_trace.h |  3 ++-
 kernel/rcu/Kconfig             | 18 ++++++++++++++++++
 kernel/rcu/tasks.h             |  3 ++-
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/rcupdate_trace.h b/include/linux/rcupdate_trace.h
index c42b365c..4c25a41 100644
--- a/include/linux/rcupdate_trace.h
+++ b/include/linux/rcupdate_trace.h
@@ -50,7 +50,8 @@ static inline void rcu_read_lock_trace(void)
 	struct task_struct *t = current;
 
 	WRITE_ONCE(t->trc_reader_nesting, READ_ONCE(t->trc_reader_nesting) + 1);
-	if (t->trc_reader_special.b.need_mb)
+	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB) &&
+	    t->trc_reader_special.b.need_mb)
 		smp_mb(); // Pairs with update-side barriers
 	rcu_lock_acquire(&rcu_trace_lock_map);
 }
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index cb1d18e..0ebe15a 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -234,4 +234,22 @@ config RCU_NOCB_CPU
 	  Say Y here if you want to help to debug reduced OS jitter.
 	  Say N here if you are unsure.
 
+config TASKS_TRACE_RCU_READ_MB
+	bool "Tasks Trace RCU readers use memory barriers in user and idle"
+	depends on RCU_EXPERT
+	default PREEMPT_RT || NR_CPUS < 8
+	help
+	  Use this option to further reduce the number of IPIs sent
+	  to CPUs executing in userspace or idle during tasks trace
+	  RCU grace periods.  Given that a reasonable setting of
+	  the rcupdate.rcu_task_ipi_delay kernel boot parameter
+	  eliminates such IPIs for many workloads, proper setting
+	  of this Kconfig option is important mostly for aggressive
+	  real-time installations and for battery-powered devices,
+	  hence the default chosen above.
+
+	  Say Y here if you hate IPIs.
+	  Say N here if you hate read-side memory barriers.
+	  Take the default if you are unsure.
+
 endmenu # "RCU Subsystem"
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index f9a828c..b18298b 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -734,7 +734,8 @@ void rcu_read_unlock_trace_special(struct task_struct *t, int nesting)
 {
 	int nq = t->trc_reader_special.b.need_qs;
 
-	if (t->trc_reader_special.b.need_mb)
+	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB) &&
+	    t->trc_reader_special.b.need_mb)
 		smp_mb(); // Pairs with update-side barriers.
 	// Update .need_qs before ->trc_reader_nesting for irq/NMI handlers.
 	if (nq)
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 26/34] rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (24 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 25/34] rcu-tasks: Add Kconfig option to mediate smp_mb() vs. IPI paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 27/34] rcu-tasks: Allow rcu_read_unlock_trace() under scheduler locks paulmck
                       ` (8 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

Systems running CPU-bound real-time task do not want IPIs sent to CPUs
executing nohz_full userspace tasks.  Battery-powered systems don't
want IPIs sent to idle CPUs in low-power mode.  Unfortunately, RCU tasks
trace can and will send such IPIs in some cases.

Both of these situations occur only when the target CPU is in RCU
dyntick-idle mode, in other words, when RCU is not watching the
target CPU.  This suggests that CPUs in dyntick-idle mode should use
memory barriers in outermost invocations of rcu_read_lock_trace()
and rcu_read_unlock_trace(), which would allow the RCU tasks trace
grace period to directly read out the target CPU's read-side state.
One challenge is that RCU tasks trace is not targeting a specific
CPU, but rather a task.  And that task could switch from one CPU to
another at any time.

This commit therefore uses try_invoke_on_locked_down_task()
and checks for task_curr() in trc_inspect_reader_notrunning().
When this condition holds, the target task is running and cannot move.
If CONFIG_TASKS_TRACE_RCU_READ_MB=y, the new rcu_dynticks_zero_in_eqs()
function can be used to check if the specified integer (in this case,
t->trc_reader_nesting) is zero while the target CPU remains in that same
dyntick-idle sojourn.  If so, the target task is in a quiescent state.
If not, trc_read_check_handler() must indicate failure so that the
grace-period kthread can take appropriate action or retry after an
appropriate delay, as the case may be.

With this change, given CONFIG_TASKS_TRACE_RCU_READ_MB=y, if a given
CPU remains idle or a given task continues executing in nohz_full mode,
the RCU tasks trace grace-period kthread will detect this without the
need to send an IPI.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/rcu.h         |  2 ++
 kernel/rcu/tasks.h       | 36 ++++++++++++++++++++++++++----------
 kernel/rcu/tree.c        | 24 ++++++++++++++++++++++++
 kernel/rcu/tree.h        |  2 ++
 kernel/rcu/tree_plugin.h | 18 ++++++++++++++++++
 5 files changed, 72 insertions(+), 10 deletions(-)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index e1089fd..296f926 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -501,6 +501,7 @@ void srcutorture_get_gp_data(enum rcutorture_type test_type,
 #endif
 
 #ifdef CONFIG_TINY_RCU
+static inline bool rcu_dynticks_zero_in_eqs(int cpu, int *vp) { return false; }
 static inline unsigned long rcu_get_gp_seq(void) { return 0; }
 static inline unsigned long rcu_exp_batches_completed(void) { return 0; }
 static inline unsigned long
@@ -510,6 +511,7 @@ static inline void show_rcu_gp_kthreads(void) { }
 static inline int rcu_get_gp_kthreads_prio(void) { return 0; }
 static inline void rcu_fwd_progress_check(unsigned long j) { }
 #else /* #ifdef CONFIG_TINY_RCU */
+bool rcu_dynticks_zero_in_eqs(int cpu, int *vp);
 unsigned long rcu_get_gp_seq(void);
 unsigned long rcu_exp_batches_completed(void);
 unsigned long srcu_batches_completed(struct srcu_struct *sp);
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index b18298b..a7ecde9 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -806,22 +806,38 @@ static void trc_read_check_handler(void *t_in)
 /* Callback function for scheduler to check locked-down task.  */
 static bool trc_inspect_reader(struct task_struct *t, void *arg)
 {
-	if (task_curr(t))
-		return false;  // It is running, so decline to inspect it.
+	int cpu = task_cpu(t);
+	bool in_qs = false;
+
+	if (task_curr(t)) {
+		// If no chance of heavyweight readers, do it the hard way.
+		if (!IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
+			return false;
+
+		// If heavyweight readers are enabled on the remote task,
+		// we can inspect its state despite its currently running.
+		// However, we cannot safely change its state.
+		if (!rcu_dynticks_zero_in_eqs(cpu, &t->trc_reader_nesting))
+			return false; // No quiescent state, do it the hard way.
+		in_qs = true;
+	} else {
+		in_qs = likely(!t->trc_reader_nesting);
+	}
 
 	// Mark as checked.  Because this is called from the grace-period
 	// kthread, also remove the task from the holdout list.
 	t->trc_reader_checked = true;
 	trc_del_holdout(t);
 
-	// If the task is in a read-side critical section, set up its
-	// its state so that it will awaken the grace-period kthread upon
-	// exit from that critical section.
-	if (unlikely(t->trc_reader_nesting)) {
-		atomic_inc(&trc_n_readers_need_end); // One more to wait on.
-		WARN_ON_ONCE(t->trc_reader_special.b.need_qs);
-		WRITE_ONCE(t->trc_reader_special.b.need_qs, true);
-	}
+	if (in_qs)
+		return true;  // Already in quiescent state, done!!!
+
+	// The task is in a read-side critical section, so set up its
+	// state so that it will awaken the grace-period kthread upon exit
+	// from that critical section.
+	atomic_inc(&trc_n_readers_need_end); // One more to wait on.
+	WARN_ON_ONCE(t->trc_reader_special.b.need_qs);
+	WRITE_ONCE(t->trc_reader_special.b.need_qs, true);
 	return true;
 }
 
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index de6228a..4eb424e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -239,6 +239,7 @@ static void rcu_dynticks_eqs_enter(void)
 	 * critical sections, and we also must force ordering with the
 	 * next idle sojourn.
 	 */
+	rcu_dynticks_task_trace_enter();  // Before ->dynticks update!
 	seq = atomic_add_return(RCU_DYNTICK_CTRL_CTR, &rdp->dynticks);
 	// RCU is no longer watching.  Better be in extended quiescent state!
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
@@ -265,6 +266,7 @@ static void rcu_dynticks_eqs_exit(void)
 	 */
 	seq = atomic_add_return(RCU_DYNTICK_CTRL_CTR, &rdp->dynticks);
 	// RCU is now watching.  Better not be in an extended quiescent state!
+	rcu_dynticks_task_trace_exit();  // After ->dynticks update!
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
 		     !(seq & RCU_DYNTICK_CTRL_CTR));
 	if (seq & RCU_DYNTICK_CTRL_MASK) {
@@ -337,6 +339,28 @@ static bool rcu_dynticks_in_eqs_since(struct rcu_data *rdp, int snap)
 }
 
 /*
+ * Return true if the referenced integer is zero while the specified
+ * CPU remains within a single extended quiescent state.
+ */
+bool rcu_dynticks_zero_in_eqs(int cpu, int *vp)
+{
+	struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
+	int snap;
+
+	// If not quiescent, force back to earlier extended quiescent state.
+	snap = atomic_read(&rdp->dynticks) & ~(RCU_DYNTICK_CTRL_MASK |
+					       RCU_DYNTICK_CTRL_CTR);
+
+	smp_rmb(); // Order ->dynticks and *vp reads.
+	if (READ_ONCE(*vp))
+		return false;  // Non-zero, so report failure;
+	smp_rmb(); // Order *vp read and ->dynticks re-read.
+
+	// If still in the same extended quiescent state, we are good!
+	return snap == (atomic_read(&rdp->dynticks) & ~RCU_DYNTICK_CTRL_MASK);
+}
+
+/*
  * Set the special (bottom) bit of the specified CPU so that it
  * will take special action (such as flushing its TLB) on the
  * next exit from an extended quiescent state.  Returns true if
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 44edd0a..43991a4 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -455,6 +455,8 @@ static void rcu_bind_gp_kthread(void);
 static bool rcu_nohz_full_cpu(void);
 static void rcu_dynticks_task_enter(void);
 static void rcu_dynticks_task_exit(void);
+static void rcu_dynticks_task_trace_enter(void);
+static void rcu_dynticks_task_trace_exit(void);
 
 /* Forward declarations for tree_stall.h */
 static void record_gp_stall_check_time(void);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 9355536..f4a344e 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2553,3 +2553,21 @@ static void rcu_dynticks_task_exit(void)
 	WRITE_ONCE(current->rcu_tasks_idle_cpu, -1);
 #endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
 }
+
+/* Turn on heavyweight RCU tasks trace readers on idle/user entry. */
+static void rcu_dynticks_task_trace_enter(void)
+{
+#ifdef CONFIG_TASKS_RCU_TRACE
+	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
+		current->trc_reader_special.b.need_mb = true;
+#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
+}
+
+/* Turn off heavyweight RCU tasks trace readers on idle/user exit. */
+static void rcu_dynticks_task_trace_exit(void)
+{
+#ifdef CONFIG_TASKS_RCU_TRACE
+	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
+		current->trc_reader_special.b.need_mb = false;
+#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
+}
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 27/34] rcu-tasks: Allow rcu_read_unlock_trace() under scheduler locks
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (25 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 26/34] rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 28/34] rcu-tasks: Disable CPU hotplug across RCU tasks trace scans paulmck
                       ` (7 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

The rcu_read_unlock_trace() can invoke rcu_read_unlock_trace_special(),
which in turn can call wake_up().  Therefore, if any scheduler lock is
held across a call to rcu_read_unlock_trace(), self-deadlock can occur.
This commit therefore uses the irq_work facility to defer the wake_up()
to a clean environment where no scheduler locks will be held.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
[ paulmck: Update #includes for m68k per kbuild test robot. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h  | 12 +++++++++++-
 kernel/rcu/update.c |  1 +
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index a7ecde9..2663167e 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -729,6 +729,16 @@ void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
 DEFINE_RCU_TASKS(rcu_tasks_trace, rcu_tasks_wait_gp, call_rcu_tasks_trace,
 		 "RCU Tasks Trace");
 
+/*
+ * This irq_work handler allows rcu_read_unlock_trace() to be invoked
+ * while the scheduler locks are held.
+ */
+static void rcu_read_unlock_iw(struct irq_work *iwp)
+{
+	wake_up(&trc_wait);
+}
+static DEFINE_IRQ_WORK(rcu_tasks_trace_iw, rcu_read_unlock_iw);
+
 /* If we are the last reader, wake up the grace-period kthread. */
 void rcu_read_unlock_trace_special(struct task_struct *t, int nesting)
 {
@@ -742,7 +752,7 @@ void rcu_read_unlock_trace_special(struct task_struct *t, int nesting)
 		WRITE_ONCE(t->trc_reader_special.b.need_qs, false);
 	WRITE_ONCE(t->trc_reader_nesting, nesting);
 	if (nq && atomic_dec_and_test(&trc_n_readers_need_end))
-		wake_up(&trc_wait);
+		irq_work_queue(&rcu_tasks_trace_iw);
 }
 EXPORT_SYMBOL_GPL(rcu_read_unlock_trace_special);
 
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 0fb2a9e..40e3512 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -41,6 +41,7 @@
 #include <linux/sched/isolation.h>
 #include <linux/kprobes.h>
 #include <linux/slab.h>
+#include <linux/irq_work.h>
 
 #define CREATE_TRACE_POINTS
 
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 28/34] rcu-tasks: Disable CPU hotplug across RCU tasks trace scans
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (26 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 27/34] rcu-tasks: Allow rcu_read_unlock_trace() under scheduler locks paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 29/34] rcu-tasks: Handle the running-offline idle-task special case paulmck
                       ` (6 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit disables CPU hotplug across RCU tasks trace scans, which
is a first step towards correctly recognizing idle tasks "running" on
offline CPUs.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 2663167e..df6e785 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -906,16 +906,16 @@ static void rcu_tasks_trace_pregp_step(void)
 {
 	int cpu;
 
-	// Wait for CPU-hotplug paths to complete.
-	cpus_read_lock();
-	cpus_read_unlock();
-
 	// Allow for fast-acting IPIs.
 	atomic_set(&trc_n_readers_need_end, 1);
 
 	// There shouldn't be any old IPIs, but...
 	for_each_possible_cpu(cpu)
 		WARN_ON_ONCE(per_cpu(trc_ipi_to_cpu, cpu));
+
+	// Disable CPU hotplug across the tasklist scan.
+	// This also waits for all readers in CPU-hotplug code paths.
+	cpus_read_lock();
 }
 
 /* Do first-round processing for the specified task. */
@@ -931,6 +931,9 @@ static void rcu_tasks_trace_pertask(struct task_struct *t,
 /* Do intermediate processing between task and holdout scans. */
 static void rcu_tasks_trace_postscan(void)
 {
+	// Re-enable CPU hotplug now that the tasklist scan has completed.
+	cpus_read_unlock();
+
 	// Wait for late-stage exiting tasks to finish exiting.
 	// These might have passed the call to exit_tasks_rcu_finish().
 	synchronize_rcu();
@@ -975,6 +978,9 @@ static void check_all_holdout_tasks_trace(struct list_head *hop,
 {
 	struct task_struct *g, *t;
 
+	// Disable CPU hotplug across the holdout list scan.
+	cpus_read_lock();
+
 	list_for_each_entry_safe(t, g, hop, trc_holdout_list) {
 		// If safe and needed, try to check the current task.
 		if (READ_ONCE(t->trc_ipi_to_cpu) == -1 &&
@@ -987,6 +993,10 @@ static void check_all_holdout_tasks_trace(struct list_head *hop,
 		else if (needreport)
 			show_stalled_task_trace(t, firstreport);
 	}
+
+	// Re-enable CPU hotplug now that the holdout list scan has completed.
+	cpus_read_unlock();
+
 	if (needreport) {
 		if (firstreport)
 			pr_err("INFO: rcu_tasks_trace detected stalls?\n");
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 29/34] rcu-tasks: Handle the running-offline idle-task special case
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (27 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 28/34] rcu-tasks: Disable CPU hotplug across RCU tasks trace scans paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 30/34] rcu-tasks: Make RCU tasks trace also wait for idle tasks paulmck
                       ` (5 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

The idle task corresponding to an offline CPU can appear to be running
while that CPU is offline.  This commit therefore adds checks for this
situation, treating it as a quiescent state.  Because the tasklist scan
and the holdout-list scan now exclude CPU-hotplug operations, readers
on the CPU-hotplug paths are still waited for.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index df6e785..d1fa7715 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -818,16 +818,20 @@ static bool trc_inspect_reader(struct task_struct *t, void *arg)
 {
 	int cpu = task_cpu(t);
 	bool in_qs = false;
+	bool ofl = cpu_is_offline(cpu);
 
 	if (task_curr(t)) {
+		WARN_ON_ONCE(ofl & !is_idle_task(t));
+
 		// If no chance of heavyweight readers, do it the hard way.
-		if (!IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
+		if (!ofl && !IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
 			return false;
 
 		// If heavyweight readers are enabled on the remote task,
 		// we can inspect its state despite its currently running.
 		// However, we cannot safely change its state.
-		if (!rcu_dynticks_zero_in_eqs(cpu, &t->trc_reader_nesting))
+		if (!ofl && // Check for "running" idle tasks on offline CPUs.
+		    !rcu_dynticks_zero_in_eqs(cpu, &t->trc_reader_nesting))
 			return false; // No quiescent state, do it the hard way.
 		in_qs = true;
 	} else {
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 30/34] rcu-tasks: Make RCU tasks trace also wait for idle tasks
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (28 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 29/34] rcu-tasks: Handle the running-offline idle-task special case paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 31/34] rcu-tasks: Add rcu_dynticks_zero_in_eqs() effectiveness statistics paulmck
                       ` (4 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit scans the CPUs, adding each CPU's idle task to the list of
tasks that need quiescent states.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index d1fa7715..27d2458 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -15,7 +15,7 @@ struct rcu_tasks;
 typedef void (*rcu_tasks_gp_func_t)(struct rcu_tasks *rtp);
 typedef void (*pregp_func_t)(void);
 typedef void (*pertask_func_t)(struct task_struct *t, struct list_head *hop);
-typedef void (*postscan_func_t)(void);
+typedef void (*postscan_func_t)(struct list_head *hop);
 typedef void (*holdouts_func_t)(struct list_head *hop, bool ndrpt, bool *frptp);
 typedef void (*postgp_func_t)(struct rcu_tasks *rtp);
 
@@ -331,7 +331,7 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 	rcu_read_unlock();
 
 	set_tasks_gp_state(rtp, RTGS_POST_SCAN_TASKLIST);
-	rtp->postscan_func();
+	rtp->postscan_func(&holdouts);
 
 	/*
 	 * Each pass through the following loop scans the list of holdout
@@ -415,7 +415,7 @@ static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop)
 }
 
 /* Processing between scanning taskslist and draining the holdout list. */
-void rcu_tasks_postscan(void)
+void rcu_tasks_postscan(struct list_head *hop)
 {
 	/*
 	 * Wait for tasks that are in the process of exiting.  This
@@ -932,9 +932,17 @@ static void rcu_tasks_trace_pertask(struct task_struct *t,
 	trc_wait_for_one_reader(t, hop);
 }
 
-/* Do intermediate processing between task and holdout scans. */
-static void rcu_tasks_trace_postscan(void)
+/*
+ * Do intermediate processing between task and holdout scans and
+ * pick up the idle tasks.
+ */
+static void rcu_tasks_trace_postscan(struct list_head *hop)
 {
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		rcu_tasks_trace_pertask(idle_task(cpu), hop);
+
 	// Re-enable CPU hotplug now that the tasklist scan has completed.
 	cpus_read_unlock();
 
-- 
2.9.5


^ permalink raw reply	[flat|nested] 171+ messages in thread

* [PATCH v3 tip/core/rcu 31/34] rcu-tasks: Add rcu_dynticks_zero_in_eqs() effectiveness statistics
  2020-03-27 22:23   ` [PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline Paul E. McKenney
                       ` (29 preceding siblings ...)
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 30/34] rcu-tasks: Make RCU tasks trace also wait for idle tasks paulmck
@ 2020-03-27 22:24     ` paulmck
  2020-03-27 22:24     ` [PATCH v3 tip/core/rcu 32/34] rcu-tasks: Add count for idle tasks on offline CPUs paulmck
                       ` (3 subsequent siblings)
  34 siblings, 0 replies; 171+ messages in thread
From: paulmck @ 2020-03-27 22:24 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit adds counts of the number of calls and number of successful
calls to rcu_dynticks_zero_in_eqs(), which are printed at the end
of rcutorture runs and at stall time.  This allows evaluation of the
effectiveness of rcu_dynticks_zero_in_eqs().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tasks.h | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 27d2458..c1a0706 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -725,6 +725,11 @@ DECLARE_WAIT_QUEUE_HEAD(trc_wait);	// List of holdout tasks.
 // Record outstanding IPIs to each CPU.  No point in sending two...
 static DEFINE_PER_CPU(bool, trc_ipi_to_cpu);
 
+// The number of detections of task quiescent state relying on
+// heavyweight readers executing explicit memory barriers.
+unsigned long n_heavy_reader_attempts;
+unsigned long n_heavy_reader_updates;
+
 void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
 DEFINE_RCU_TASKS(rcu_tasks_trace, rcu_tasks_wait_gp, call_rcu_tasks_trace,
 		 "RCU Tasks Trace");
@@ -830,9 +835,11 @@ static bool trc_inspect_reader(struct task_struct *t, void *arg)
 		// If heavyweight readers are enabled on the remote task,
 		// we can inspect its state despite its currently running.
 		// However, we cannot safely change its state.
+		n_heavy_reader_attempts++;
 		if (!ofl && // Check for "running" idle tasks on offline CPUs.
 		    !rcu_dynticks_zero_in_eqs(cpu, &t->trc_reader_nesting))
 			return false; // No quiescent state, do it the hard way.
+		n_heavy_reader_updates++;
 		in_qs = true;
 	} else {
 		in_qs = likely(!t->trc_reader_nesting);
@@ -1143,9 +1150,11 @@ core_initcall(rcu_spawn_tasks_trace_kthread);
 
 static void show_rcu_tasks_trace_gp_kthread(void)
 {
-	char buf[32];
+	char buf[64];
 
-	sprintf(buf, &