All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 tip/core/rcu 0/9
@ 2014-07-31 21:54 Paul E. McKenney
  2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-07-31 21:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

Hello!

This series provides v3 of a prototype of an RCU-tasks implementation,
which has been requested to assist with tramopoline removal.  This flavor
of RCU is task-based rather than CPU-based, and has voluntary context
switch, usermode execution, and the idle loops as its only quiescent
states.  This selection of quiescent states ensures that at the end
of a grace period, there will no longer be any tasks depending on a
trampoline that was removed before the beginning of that grace period.
This works because such trampolines do not contain function calls,
do not contain voluntary context switches, do not switch to usermode,
and do not switch to idle.

The patches in this series are as follows:

1.	Adds the basic call_rcu_tasks() functionality.

2.	Provides cond_resched_rcu_qs() to force quiescent states, including
	RCU-tasks quiescent states, in long loops.

3.	Adds synchronous APIs: synchronize_rcu_tasks() and
	rcu_barrier_tasks().

4.	Adds GPL exports for the above APIs, courtesy of Steven Rostedt.

5.	Adds rcutorture tests for RCU-tasks.

6.	Adds RCU-tasks test cases to rcutorture scripting.

7.	Adds stall-warning checks for RCU-tasks.

8.	Improves RCU-tasks energy efficiency by replacing polling with
	wait/wakeup.

9.	Document RCU-tasks stall-warning messages.

Changes from v2:

o	Use get_task_struct() instead of do_exit() hooks to synchronize
	with exiting tasks, as suggested by Lai Jiangshan.

o	Add checks of ->on_rq to the grace-period-wait polling, again
	as suggested by Lai Jiangshan.

o	Repositioned synchronize_sched() calls and improved their
	comments.

Changes from v1:

o	The lockdep issue with list locking was finessed by ditching
	list locking in favor of having the list manipulated by a single
	kthread.  This change trimmed about 150 highly concurrent lines
	from the implementation.

o	Get rid of the scheduler hooks in favor of polling the
	per-task count of voluntary context switches, in response
	to Peter Zijlstra's concerns about scheduler overhead.

o	Passes more aggressive rcutorture runs, which indicates that
	an increase in rcutorture's aggression is called for.

o	Handled review comments from Peter Zijlstra, Lai Jiangshan,
	Frederic Weisbecker, and Oleg Nesterov.

o	Added RCU-tasks stall-warning documentation.

Remaining issues include:

o	It is not clear that trampolines in functions called from the
	idle loop are correctly handled.  Or if anyone cares about
	trampolines in functions called from the idle loop.

o	The current implementation does not yet recognize tasks that start
	out executing is usermode.  Instead, it waits for the next
	scheduling-clock tick to note them.

o	As a result, the current implementation does not handle nohz_full=
	CPUs executing tasks running in usermode.  There are a couple of
	possible fixes under consideration.

o	If a task is preempted while executing in usermode, the RCU-tasks
	grace period will not end until that task resumes.  (Is there
	some reasonable way to determine that a given preempted task
	was preempted from usermode execution?)

o	More about RCU-tasks needs to be added to Documentation/RCU.

o	This version creates rcu_tasks_kthread() even if there never will
	be any uses, which is expected to be the common case.  A future
	version might create rcu_tasks_kthread() on demand, as suggested
	off-list by Josh Triplett.

o	There are probably still bugs.

							Thanx, Paul

------------------------------------------------------------------------

 b/Documentation/RCU/stallwarn.txt                             |   33 -
 b/Documentation/kernel-parameters.txt                         |    5 
 b/fs/file.c                                                   |    2 
 b/include/linux/init_task.h                                   |    9 
 b/include/linux/rcupdate.h                                    |   52 +
 b/include/linux/sched.h                                       |   23 
 b/init/Kconfig                                                |   10 
 b/kernel/rcu/rcutorture.c                                     |   44 +
 b/kernel/rcu/tiny.c                                           |    2 
 b/kernel/rcu/tree.c                                           |   14 
 b/kernel/rcu/tree_plugin.h                                    |    2 
 b/kernel/rcu/update.c                                         |  292 +++++++++-
 b/mm/mlock.c                                                  |    2 
 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01      |    7 
 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot |    1 
 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02      |    6 
 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot |    1 
 17 files changed, 460 insertions(+), 45 deletions(-)


^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-31 21:54 [PATCH v3 tip/core/rcu 0/9 Paul E. McKenney
@ 2014-07-31 21:55 ` Paul E. McKenney
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops Paul E. McKenney
                     ` (15 more replies)
  0 siblings, 16 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-07-31 21:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds a new RCU-tasks flavor of RCU, which provides
call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
context switch (not preemption!), userspace execution, and the idle loop.
Note that unlike other RCU flavors, these quiescent states occur in tasks,
not necessarily CPUs.  Includes fixes from Steven Rostedt.

This RCU flavor is assumed to have very infrequent latency-tolerate
updaters.  This assumption permits significant simplifications, including
a single global callback list protected by a single global lock, along
with a single linked list containing all tasks that have not yet passed
through a quiescent state.  If experience shows this assumption to be
incorrect, the required additional complexity will be added.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 include/linux/init_task.h |   9 +++
 include/linux/rcupdate.h  |  36 ++++++++++
 include/linux/sched.h     |  23 ++++---
 init/Kconfig              |  10 +++
 kernel/rcu/tiny.c         |   2 +
 kernel/rcu/tree.c         |   2 +
 kernel/rcu/update.c       | 171 ++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 242 insertions(+), 11 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6df7f9fe0d01..78715ea7c30c 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -124,6 +124,14 @@ extern struct group_info init_groups;
 #else
 #define INIT_TASK_RCU_PREEMPT(tsk)
 #endif
+#ifdef CONFIG_TASKS_RCU
+#define INIT_TASK_RCU_TASKS(tsk)					\
+	.rcu_tasks_holdout = false,					\
+	.rcu_tasks_holdout_list =					\
+		LIST_HEAD_INIT(tsk.rcu_tasks_holdout_list),
+#else
+#define INIT_TASK_RCU_TASKS(tsk)
+#endif
 
 extern struct cred init_cred;
 
@@ -231,6 +239,7 @@ extern struct task_group root_task_group;
 	INIT_FTRACE_GRAPH						\
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
+	INIT_TASK_RCU_TASKS(tsk)					\
 	INIT_CPUSET_SEQ(tsk)						\
 	INIT_RT_MUTEXES(tsk)						\
 	INIT_VTIME(tsk)							\
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 6a94cc8b1ca0..829efc99df3e 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -197,6 +197,26 @@ void call_rcu_sched(struct rcu_head *head,
 
 void synchronize_sched(void);
 
+/**
+ * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. call_rcu_tasks() assumes
+ * that the read-side critical sections end at a voluntary context
+ * switch (not a preemption!), entry into idle, or transition to usermode
+ * execution.  As such, there are no read-side primitives analogous to
+ * rcu_read_lock() and rcu_read_unlock() because this primitive is intended
+ * to determine that all tasks have passed through a safe state, not so
+ * much for data-strcuture synchronization.
+ *
+ * See the description of call_rcu() for more detailed information on
+ * memory ordering guarantees.
+ */
+void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
+
 #ifdef CONFIG_PREEMPT_RCU
 
 void __rcu_read_lock(void);
@@ -294,6 +314,22 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
 		rcu_irq_exit(); \
 	} while (0)
 
+/*
+ * Note a voluntary context switch for RCU-tasks benefit.  This is a
+ * macro rather than an inline function to avoid #include hell.
+ */
+#ifdef CONFIG_TASKS_RCU
+#define rcu_note_voluntary_context_switch(t) \
+	do { \
+		preempt_disable(); /* Exclude synchronize_sched(); */ \
+		if (ACCESS_ONCE((t)->rcu_tasks_holdout)) \
+			ACCESS_ONCE((t)->rcu_tasks_holdout) = 0; \
+		preempt_enable(); \
+	} while (0)
+#else /* #ifdef CONFIG_TASKS_RCU */
+#define rcu_note_voluntary_context_switch(t)	do { } while (0)
+#endif /* #else #ifdef CONFIG_TASKS_RCU */
+
 #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP)
 bool __rcu_is_watching(void);
 #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 306f4f0c987a..3cf124389ec7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1273,6 +1273,11 @@ struct task_struct {
 #ifdef CONFIG_RCU_BOOST
 	struct rt_mutex *rcu_boost_mutex;
 #endif /* #ifdef CONFIG_RCU_BOOST */
+#ifdef CONFIG_TASKS_RCU
+	unsigned long rcu_tasks_nvcsw;
+	int rcu_tasks_holdout;
+	struct list_head rcu_tasks_holdout_list;
+#endif /* #ifdef CONFIG_TASKS_RCU */
 
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
 	struct sched_info sched_info;
@@ -1998,31 +2003,27 @@ extern void task_clear_jobctl_pending(struct task_struct *task,
 				      unsigned int mask);
 
 #ifdef CONFIG_PREEMPT_RCU
-
 #define RCU_READ_UNLOCK_BLOCKED (1 << 0) /* blocked while in RCU read-side. */
 #define RCU_READ_UNLOCK_NEED_QS (1 << 1) /* RCU core needs CPU response. */
+#endif /* #ifdef CONFIG_PREEMPT_RCU */
 
 static inline void rcu_copy_process(struct task_struct *p)
 {
+#ifdef CONFIG_PREEMPT_RCU
 	p->rcu_read_lock_nesting = 0;
 	p->rcu_read_unlock_special = 0;
-#ifdef CONFIG_TREE_PREEMPT_RCU
 	p->rcu_blocked_node = NULL;
-#endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
 #ifdef CONFIG_RCU_BOOST
 	p->rcu_boost_mutex = NULL;
 #endif /* #ifdef CONFIG_RCU_BOOST */
 	INIT_LIST_HEAD(&p->rcu_node_entry);
+#endif /* #ifdef CONFIG_PREEMPT_RCU */
+#ifdef CONFIG_TASKS_RCU
+	p->rcu_tasks_holdout = false;
+	INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
+#endif /* #ifdef CONFIG_TASKS_RCU */
 }
 
-#else
-
-static inline void rcu_copy_process(struct task_struct *p)
-{
-}
-
-#endif
-
 static inline void tsk_restore_flags(struct task_struct *task,
 				unsigned long orig_flags, unsigned long flags)
 {
diff --git a/init/Kconfig b/init/Kconfig
index 9d76b99af1b9..c56cb62a2df1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -507,6 +507,16 @@ config PREEMPT_RCU
 	  This option enables preemptible-RCU code that is common between
 	  the TREE_PREEMPT_RCU and TINY_PREEMPT_RCU implementations.
 
+config TASKS_RCU
+	bool "Task_based RCU implementation using voluntary context switch"
+	default n
+	help
+	  This option enables a task-based RCU implementation that uses
+	  only voluntary context switch (not preemption!), idle, and
+	  user-mode execution as quiescent states.
+
+	  If unsure, say N.
+
 config RCU_STALL_COMMON
 	def_bool ( TREE_RCU || TREE_PREEMPT_RCU || RCU_TRACE )
 	help
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index d9efcc13008c..717f00854fc0 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -254,6 +254,8 @@ void rcu_check_callbacks(int cpu, int user)
 		rcu_sched_qs(cpu);
 	else if (!in_softirq())
 		rcu_bh_qs(cpu);
+	if (user)
+		rcu_note_voluntary_context_switch(current);
 }
 
 /*
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 625d0b0cd75a..f958c52f644d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2413,6 +2413,8 @@ void rcu_check_callbacks(int cpu, int user)
 	rcu_preempt_check_callbacks(cpu);
 	if (rcu_pending(cpu))
 		invoke_rcu_core();
+	if (user)
+		rcu_note_voluntary_context_switch(current);
 	trace_rcu_utilization(TPS("End scheduler-tick"));
 }
 
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index bc7883570530..50453589e3ca 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -47,6 +47,7 @@
 #include <linux/hardirq.h>
 #include <linux/delay.h>
 #include <linux/module.h>
+#include <linux/kthread.h>
 
 #define CREATE_TRACE_POINTS
 
@@ -350,3 +351,173 @@ static int __init check_cpu_stall_init(void)
 early_initcall(check_cpu_stall_init);
 
 #endif /* #ifdef CONFIG_RCU_STALL_COMMON */
+
+#ifdef CONFIG_TASKS_RCU
+
+/*
+ * Simple variant of RCU whose quiescent states are voluntary context switch,
+ * user-space execution, and idle.  As such, grace periods can take one good
+ * long time.  There are no read-side primitives similar to rcu_read_lock()
+ * and rcu_read_unlock() because this implementation is intended to get
+ * the system into a safe state for some of the manipulations involved in
+ * tracing and the like.  Finally, this implementation does not support
+ * high call_rcu_tasks() rates from multiple CPUs.  If this is required,
+ * per-CPU callback lists will be needed.
+ */
+
+/* Lists of tasks that we are still waiting for during this grace period. */
+static LIST_HEAD(rcu_tasks_holdouts);
+
+/* Global list of callbacks and associated lock. */
+static struct rcu_head *rcu_tasks_cbs_head;
+static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
+static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
+
+/* Post an RCU-tasks callback. */
+void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
+{
+	unsigned long flags;
+
+	rhp->next = NULL;
+	rhp->func = func;
+	raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
+	*rcu_tasks_cbs_tail = rhp;
+	rcu_tasks_cbs_tail = &rhp->next;
+	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+}
+EXPORT_SYMBOL_GPL(call_rcu_tasks);
+
+/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
+static int __noreturn rcu_tasks_kthread(void *arg)
+{
+	unsigned long flags;
+	struct task_struct *g, *t;
+	struct rcu_head *list;
+	struct rcu_head *next;
+
+	/* FIXME: Add housekeeping affinity. */
+
+	/*
+	 * Each pass through the following loop makes one check for
+	 * newly arrived callbacks, and, if there are some, waits for
+	 * one RCU-tasks grace period and then invokes the callbacks.
+	 * This loop is terminated by the system going down.  ;-)
+	 */
+	for (;;) {
+
+		/* Pick up any new callbacks. */
+		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
+		smp_mb__after_unlock_lock(); /* Enforce GP memory ordering. */
+		list = rcu_tasks_cbs_head;
+		rcu_tasks_cbs_head = NULL;
+		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
+		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+
+		/* If there were none, wait a bit and start over. */
+		if (!list) {
+			schedule_timeout_interruptible(HZ);
+			flush_signals(current);
+			continue;
+		}
+
+		/*
+		 * Wait for all pre-existing t->on_rq and t->nvcsw
+		 * transitions to complete.  Invoking synchronize_sched()
+		 * suffices because all these transitions occur with
+		 * interrupts disabled.  Without this synchronize_sched(),
+		 * a read-side critical section that started before the
+		 * grace period might be incorrectly seen as having started
+		 * after the grace period.
+		 *
+		 * This synchronize_sched() also dispenses with the
+		 * need for a memory barrier on the first store to
+		 * ->rcu_tasks_holdout, as it forces the store to happen
+		 * after the beginning of the grace period.
+		 */
+		synchronize_sched();
+
+		/*
+		 * There were callbacks, so we need to wait for an
+		 * RCU-tasks grace period.  Start off by scanning
+		 * the task list for tasks that are not already
+		 * voluntarily blocked.  Mark these tasks and make
+		 * a list of them in rcu_tasks_holdouts.
+		 */
+		rcu_read_lock();
+		for_each_process_thread(g, t) {
+			if (t != current && ACCESS_ONCE(t->on_rq) &&
+			    !is_idle_task(t)) {
+				get_task_struct(t);
+				t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
+				ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
+				list_add(&t->rcu_tasks_holdout_list,
+					 &rcu_tasks_holdouts);
+			}
+		}
+		rcu_read_unlock();
+
+		/*
+		 * Each pass through the following loop scans the list
+		 * of holdout tasks, removing any that are no longer
+		 * holdouts.  When the list is empty, we are done.
+		 */
+		while (!list_empty(&rcu_tasks_holdouts)) {
+			schedule_timeout_interruptible(HZ / 10);
+			flush_signals(current);
+			rcu_read_lock();
+			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
+						rcu_tasks_holdout_list) {
+				if (ACCESS_ONCE(t->rcu_tasks_holdout)) {
+					if (t->rcu_tasks_nvcsw ==
+					    ACCESS_ONCE(t->nvcsw) &&
+					    ACCESS_ONCE(t->on_rq))
+						continue;
+					ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
+				}
+				list_del_rcu(&t->rcu_tasks_holdout_list);
+				put_task_struct(t);
+			}
+			rcu_read_unlock();
+		}
+
+		/*
+		 * Because ->on_rq and ->nvcsw are not guaranteed
+		 * to have a full memory barriers prior to them in the
+		 * schedule() path, memory reordering on other CPUs could
+		 * cause their RCU-tasks read-side critical sections to
+		 * extend past the end of the grace period.  However,
+		 * because these ->nvcsw updates are carried out with
+		 * interrupts disabled, we can use synchronize_sched()
+		 * to force the needed ordering on all such CPUs.
+		 *
+		 * This synchronize_sched() also confines all
+		 * ->rcu_tasks_holdout accesses to be within the grace
+		 * period, avoiding the need for memory barriers for
+		 * ->rcu_tasks_holdout accesses.
+		 */
+		synchronize_sched();
+
+		/* Invoke the callbacks. */
+		while (list) {
+			next = list->next;
+			local_bh_disable();
+			list->func(list);
+			local_bh_enable();
+			list = next;
+			cond_resched();
+		}
+	}
+}
+
+/* Spawn rcu_tasks_kthread() at boot time. */
+static int __init rcu_spawn_tasks_kthread(void)
+{
+	struct task_struct __maybe_unused *t;
+
+	t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
+	BUG_ON(IS_ERR(t));
+	return 0;
+}
+early_initcall(rcu_spawn_tasks_kthread);
+
+#endif /* #ifdef CONFIG_TASKS_RCU */
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH v3 tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops
  2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
@ 2014-07-31 21:55   ` Paul E. McKenney
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks Paul E. McKenney
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-07-31 21:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

RCU-tasks requires the occasional voluntary context switch
from CPU-bound in-kernel tasks.  In some cases, this requires
instrumenting cond_resched().  However, there is some reluctance
to countenance unconditionally instrumenting cond_resched() (see
http://lwn.net/Articles/603252/), so this commit creates a separate
cond_resched_rcu_qs() that may be used in place of cond_resched() in
locations prone to long-duration in-kernel looping.

This commit currently instruments only RCU-tasks.  Future possibilities
include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
IPI usage.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 fs/file.c                |  2 +-
 include/linux/rcupdate.h | 13 +++++++++++++
 kernel/rcu/rcutorture.c  |  4 ++--
 kernel/rcu/tree.c        | 12 ++++++------
 kernel/rcu/tree_plugin.h |  2 +-
 mm/mlock.c               |  2 +-
 6 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 66923fe3176e..1cafc4c9275b 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -367,7 +367,7 @@ static struct fdtable *close_files(struct files_struct * files)
 				struct file * file = xchg(&fdt->fd[i], NULL);
 				if (file) {
 					filp_close(file, files);
-					cond_resched();
+					cond_resched_rcu_qs();
 				}
 			}
 			i++;
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 829efc99df3e..ac87f587a1c1 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -330,6 +330,19 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
 #define rcu_note_voluntary_context_switch(t)	do { } while (0)
 #endif /* #else #ifdef CONFIG_TASKS_RCU */
 
+/**
+ * cond_resched_rcu_qs - Report potential quiescent states to RCU
+ *
+ * This macro resembles cond_resched(), except that it is defined to
+ * report potential quiescent states to RCU-tasks even if the cond_resched()
+ * machinery were to be shut off, as some advocate for PREEMPT kernels.
+ */
+#define cond_resched_rcu_qs() \
+do { \
+	rcu_note_voluntary_context_switch(current); \
+	cond_resched(); \
+} while (0)
+
 #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP)
 bool __rcu_is_watching(void);
 #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 7fa34f86e5ba..febe07062ac5 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -667,7 +667,7 @@ static int rcu_torture_boost(void *arg)
 				}
 				call_rcu_time = jiffies;
 			}
-			cond_resched();
+			cond_resched_rcu_qs();
 			stutter_wait("rcu_torture_boost");
 			if (torture_must_stop())
 				goto checkwait;
@@ -1019,7 +1019,7 @@ rcu_torture_reader(void *arg)
 		__this_cpu_inc(rcu_torture_batch[completed]);
 		preempt_enable();
 		cur_ops->readunlock(idx);
-		cond_resched();
+		cond_resched_rcu_qs();
 		stutter_wait("rcu_torture_reader");
 	} while (!torture_must_stop());
 	if (irqreader && cur_ops->irq_capable) {
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f958c52f644d..645a33efc0d4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1650,7 +1650,7 @@ static int rcu_gp_init(struct rcu_state *rsp)
 		    system_state == SYSTEM_RUNNING)
 			udelay(200);
 #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
-		cond_resched();
+		cond_resched_rcu_qs();
 	}
 
 	mutex_unlock(&rsp->onoff_mutex);
@@ -1739,7 +1739,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 		/* smp_mb() provided by prior unlock-lock pair. */
 		nocb += rcu_future_gp_cleanup(rsp, rnp);
 		raw_spin_unlock_irq(&rnp->lock);
-		cond_resched();
+		cond_resched_rcu_qs();
 	}
 	rnp = rcu_get_root(rsp);
 	raw_spin_lock_irq(&rnp->lock);
@@ -1788,7 +1788,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
 			/* Locking provides needed memory barrier. */
 			if (rcu_gp_init(rsp))
 				break;
-			cond_resched();
+			cond_resched_rcu_qs();
 			flush_signals(current);
 			trace_rcu_grace_period(rsp->name,
 					       ACCESS_ONCE(rsp->gpnum),
@@ -1831,10 +1831,10 @@ static int __noreturn rcu_gp_kthread(void *arg)
 				trace_rcu_grace_period(rsp->name,
 						       ACCESS_ONCE(rsp->gpnum),
 						       TPS("fqsend"));
-				cond_resched();
+				cond_resched_rcu_qs();
 			} else {
 				/* Deal with stray signal. */
-				cond_resched();
+				cond_resched_rcu_qs();
 				flush_signals(current);
 				trace_rcu_grace_period(rsp->name,
 						       ACCESS_ONCE(rsp->gpnum),
@@ -2437,7 +2437,7 @@ static void force_qs_rnp(struct rcu_state *rsp,
 	struct rcu_node *rnp;
 
 	rcu_for_each_leaf_node(rsp, rnp) {
-		cond_resched();
+		cond_resched_rcu_qs();
 		mask = 0;
 		raw_spin_lock_irqsave(&rnp->lock, flags);
 		smp_mb__after_unlock_lock();
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 02ac0fb186b8..a86a363ea453 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1842,7 +1842,7 @@ static int rcu_oom_notify(struct notifier_block *self,
 	get_online_cpus();
 	for_each_online_cpu(cpu) {
 		smp_call_function_single(cpu, rcu_oom_notify_cpu, NULL, 1);
-		cond_resched();
+		cond_resched_rcu_qs();
 	}
 	put_online_cpus();
 
diff --git a/mm/mlock.c b/mm/mlock.c
index b1eb53634005..bc386a22d647 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -782,7 +782,7 @@ static int do_mlockall(int flags)
 
 		/* Ignore errors */
 		mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags);
-		cond_resched();
+		cond_resched_rcu_qs();
 	}
 out:
 	return 0;
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops Paul E. McKenney
@ 2014-07-31 21:55   ` Paul E. McKenney
  2014-08-01 15:09     ` Oleg Nesterov
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 4/9] rcu: Export RCU-tasks APIs to GPL modules Paul E. McKenney
                     ` (13 subsequent siblings)
  15 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-07-31 21:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

It turns out to be easier to add the synchronous grace-period waiting
functions to RCU-tasks than to work around their absense in rcutorture,
so this commit adds them.  The key point is that the existence of
call_rcu_tasks() means that rcutorture needs an rcu_barrier_tasks().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 include/linux/rcupdate.h |  2 ++
 kernel/rcu/update.c      | 55 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index ac87f587a1c1..1f073af940a5 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -216,6 +216,8 @@ void synchronize_sched(void);
  * memory ordering guarantees.
  */
 void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
+void synchronize_rcu_tasks(void);
+void rcu_barrier_tasks(void);
 
 #ifdef CONFIG_PREEMPT_RCU
 
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 50453589e3ca..fe866848a063 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -387,6 +387,61 @@ void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
 }
 EXPORT_SYMBOL_GPL(call_rcu_tasks);
 
+/**
+ * synchronize_rcu_tasks - wait until an rcu-tasks grace period has elapsed.
+ *
+ * Control will return to the caller some time after a full rcu-tasks
+ * grace period has elapsed, in other words after all currently
+ * executing rcu-tasks read-side critical sections have elapsed.  These
+ * read-side critical sections are delimited by calls to schedule(),
+ * cond_resched_rcu_qs(), idle execution, userspace execution, calls
+ * to synchronize_rcu_tasks(), and (in theory, anyway) cond_resched().
+ *
+ * This is a very specialized primitive, intended only for a few uses in
+ * tracing and other situations requiring manipulation of function
+ * preambles and profiling hooks.  The synchronize_rcu_tasks() function
+ * is not (yet) intended for heavy use from multiple CPUs.
+ *
+ * Note that this guarantee implies further memory-ordering guarantees.
+ * On systems with more than one CPU, when synchronize_rcu_tasks() returns,
+ * each CPU is guaranteed to have executed a full memory barrier since the
+ * end of its last RCU-tasks read-side critical section whose beginning
+ * preceded the call to synchronize_rcu_tasks().  In addition, each CPU
+ * having an RCU-tasks read-side critical section that extends beyond
+ * the return from synchronize_rcu_tasks() is guaranteed to have executed
+ * a full memory barrier after the beginning of synchronize_rcu_tasks()
+ * and before the beginning of that RCU-tasks read-side critical section.
+ * Note that these guarantees include CPUs that are offline, idle, or
+ * executing in user mode, as well as CPUs that are executing in the kernel.
+ *
+ * Furthermore, if CPU A invoked synchronize_rcu_tasks(), which returned
+ * to its caller on CPU B, then both CPU A and CPU B are guaranteed
+ * to have executed a full memory barrier during the execution of
+ * synchronize_rcu_tasks() -- even if CPU A and CPU B are the same CPU
+ * (but again only if the system has more than one CPU).
+ */
+void synchronize_rcu_tasks(void)
+{
+	/* Complain if the scheduler has not started.  */
+	rcu_lockdep_assert(!rcu_scheduler_active,
+			   "synchronize_rcu_tasks called too soon");
+
+	/* Wait for the grace period. */
+	wait_rcu_gp(call_rcu_tasks);
+}
+
+/**
+ * rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
+ *
+ * Although the current implementation is guaranteed to wait, it is not
+ * obligated to, for example, if there are no pending callbacks.
+ */
+void rcu_barrier_tasks(void)
+{
+	/* There is only one callback queue, so this is easy.  ;-) */
+	synchronize_rcu_tasks();
+}
+
 /* RCU-tasks kthread that detects grace periods and invokes callbacks. */
 static int __noreturn rcu_tasks_kthread(void *arg)
 {
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH v3 tip/core/rcu 4/9] rcu: Export RCU-tasks APIs to GPL modules
  2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops Paul E. McKenney
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks Paul E. McKenney
@ 2014-07-31 21:55   ` Paul E. McKenney
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 5/9] rcutorture: Add torture tests for RCU-tasks Paul E. McKenney
                     ` (12 subsequent siblings)
  15 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-07-31 21:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: Steven Rostedt <rostedt@goodmis.org>

This commit exports the RCU-tasks APIs, call_rcu_tasks(),
synchronize_rcu_tasks(), and rcu_barrier_tasks(), to GPL-licensed
kernel modules.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcu/update.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index fe866848a063..b7694019e952 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -429,6 +429,7 @@ void synchronize_rcu_tasks(void)
 	/* Wait for the grace period. */
 	wait_rcu_gp(call_rcu_tasks);
 }
+EXPORT_SYMBOL_GPL(synchronize_rcu_tasks);
 
 /**
  * rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
@@ -441,6 +442,7 @@ void rcu_barrier_tasks(void)
 	/* There is only one callback queue, so this is easy.  ;-) */
 	synchronize_rcu_tasks();
 }
+EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
 
 /* RCU-tasks kthread that detects grace periods and invokes callbacks. */
 static int __noreturn rcu_tasks_kthread(void *arg)
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH v3 tip/core/rcu 5/9] rcutorture: Add torture tests for RCU-tasks
  2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (2 preceding siblings ...)
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 4/9] rcu: Export RCU-tasks APIs to GPL modules Paul E. McKenney
@ 2014-07-31 21:55   ` Paul E. McKenney
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 6/9] rcutorture: Add RCU-tasks test cases Paul E. McKenney
                     ` (11 subsequent siblings)
  15 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-07-31 21:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds torture tests for RCU-tasks.  It also fixes a bug that
would segfault for an RCU flavor lacking a callback-barrier function.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 include/linux/rcupdate.h |  1 +
 kernel/rcu/rcutorture.c  | 40 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 1f073af940a5..f9d314bbc7b6 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -55,6 +55,7 @@ enum rcutorture_type {
 	RCU_FLAVOR,
 	RCU_BH_FLAVOR,
 	RCU_SCHED_FLAVOR,
+	RCU_TASKS_FLAVOR,
 	SRCU_FLAVOR,
 	INVALID_RCU_FLAVOR
 };
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index febe07062ac5..6d12ab6675bc 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -602,6 +602,42 @@ static struct rcu_torture_ops sched_ops = {
 };
 
 /*
+ * Definitions for RCU-tasks torture testing.
+ */
+
+static int tasks_torture_read_lock(void)
+{
+	return 0;
+}
+
+static void tasks_torture_read_unlock(int idx)
+{
+}
+
+static void rcu_tasks_torture_deferred_free(struct rcu_torture *p)
+{
+	call_rcu_tasks(&p->rtort_rcu, rcu_torture_cb);
+}
+
+static struct rcu_torture_ops tasks_ops = {
+	.ttype		= RCU_TASKS_FLAVOR,
+	.init		= rcu_sync_torture_init,
+	.readlock	= tasks_torture_read_lock,
+	.read_delay	= rcu_read_delay,  /* just reuse rcu's version. */
+	.readunlock	= tasks_torture_read_unlock,
+	.completed	= rcu_no_completed,
+	.deferred_free	= rcu_tasks_torture_deferred_free,
+	.sync		= synchronize_rcu_tasks,
+	.exp_sync	= synchronize_rcu_tasks,
+	.call		= call_rcu_tasks,
+	.cb_barrier	= rcu_barrier_tasks,
+	.fqs		= NULL,
+	.stats		= NULL,
+	.irq_capable	= 1,
+	.name		= "tasks"
+};
+
+/*
  * RCU torture priority-boost testing.  Runs one real-time thread per
  * CPU for moderate bursts, repeatedly registering RCU callbacks and
  * spinning waiting for them to be invoked.  If a given callback takes
@@ -1295,7 +1331,8 @@ static int rcu_torture_barrier_cbs(void *arg)
 		if (atomic_dec_and_test(&barrier_cbs_count))
 			wake_up(&barrier_wq);
 	} while (!torture_must_stop());
-	cur_ops->cb_barrier();
+	if (cur_ops->cb_barrier != NULL)
+		cur_ops->cb_barrier();
 	destroy_rcu_head_on_stack(&rcu);
 	torture_kthread_stopping("rcu_torture_barrier_cbs");
 	return 0;
@@ -1534,6 +1571,7 @@ rcu_torture_init(void)
 	int firsterr = 0;
 	static struct rcu_torture_ops *torture_ops[] = {
 		&rcu_ops, &rcu_bh_ops, &rcu_busted_ops, &srcu_ops, &sched_ops,
+		&tasks_ops,
 	};
 
 	if (!torture_init_begin(torture_type, verbose, &rcutorture_runnable))
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH v3 tip/core/rcu 6/9] rcutorture: Add RCU-tasks test cases
  2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (3 preceding siblings ...)
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 5/9] rcutorture: Add torture tests for RCU-tasks Paul E. McKenney
@ 2014-07-31 21:55   ` Paul E. McKenney
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 7/9] rcu: Add stall-warning checks for RCU-tasks Paul E. McKenney
                     ` (10 subsequent siblings)
  15 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-07-31 21:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds the TASKS01 and TASKS02 Kconfig fragments, along with
the corresponding TASKS01.boot and TASKS02.boot boot-parameter files
specifying that rcutorture test RCU-tasks instead of the default flavor.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 tools/testing/selftests/rcutorture/configs/rcu/TASKS01      | 7 +++++++
 tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot | 1 +
 tools/testing/selftests/rcutorture/configs/rcu/TASKS02      | 6 ++++++
 tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot | 1 +
 4 files changed, 15 insertions(+)
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TASKS01
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TASKS02
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot

diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TASKS01 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01
new file mode 100644
index 000000000000..263a20f01fae
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01
@@ -0,0 +1,7 @@
+CONFIG_SMP=y
+CONFIG_NR_CPUS=2
+CONFIG_HOTPLUG_CPU=y
+CONFIG_PREEMPT_NONE=n
+CONFIG_PREEMPT_VOLUNTARY=n
+CONFIG_PREEMPT=y
+CONFIG_TASKS_RCU=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot
new file mode 100644
index 000000000000..cd2a188eeb6d
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot
@@ -0,0 +1 @@
+rcutorture.torture_type=tasks
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TASKS02 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02
new file mode 100644
index 000000000000..17b669c8833c
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02
@@ -0,0 +1,6 @@
+CONFIG_SMP=n
+CONFIG_HOTPLUG_CPU=y
+CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_VOLUNTARY=n
+CONFIG_PREEMPT=n
+CONFIG_TASKS_RCU=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot
new file mode 100644
index 000000000000..cd2a188eeb6d
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot
@@ -0,0 +1 @@
+rcutorture.torture_type=tasks
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH v3 tip/core/rcu 7/9] rcu: Add stall-warning checks for RCU-tasks
  2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (4 preceding siblings ...)
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 6/9] rcutorture: Add RCU-tasks test cases Paul E. McKenney
@ 2014-07-31 21:55   ` Paul E. McKenney
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 8/9] rcu: Improve RCU-tasks energy efficiency Paul E. McKenney
                     ` (9 subsequent siblings)
  15 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-07-31 21:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds a three-minute RCU-tasks stall warning.  The actual
time is controlled by the boot/sysfs parameter rcu_task_stall_timeout,
with values less than or equal to zero disabling the stall warnings.
The default value is three minutes, which means that the tasks that
have not yet responded will get their stacks dumped every three minutes,
until they pass through a voluntary context switch.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 Documentation/kernel-parameters.txt |  5 ++++
 kernel/rcu/update.c                 | 50 +++++++++++++++++++++++++++++--------
 2 files changed, 44 insertions(+), 11 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 910c3829f81d..8cdbde7b17f5 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2921,6 +2921,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 	rcupdate.rcu_cpu_stall_timeout= [KNL]
 			Set timeout for RCU CPU stall warning messages.
 
+	rcupdate.rcu_task_stall_timeout= [KNL]
+			Set timeout in jiffies for RCU task stall warning
+			messages.  Disable with a value less than or equal
+			to zero.
+
 	rdinit=		[KNL]
 			Format: <full_path>
 			Run specified binary instead of /init from the ramdisk,
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index b7694019e952..e940b86af4e8 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -373,6 +373,10 @@ static struct rcu_head *rcu_tasks_cbs_head;
 static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
 static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
 
+/* Control stall timeouts.  Disable with <= 0, otherwise jiffies till stall. */
+static int rcu_task_stall_timeout __read_mostly = HZ * 60 * 3;
+module_param(rcu_task_stall_timeout, int, 0644);
+
 /* Post an RCU-tasks callback. */
 void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
 {
@@ -444,11 +448,33 @@ void rcu_barrier_tasks(void)
 }
 EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
 
+/* See if tasks are still holding out, complain if so. */
+static void check_holdout_task(struct task_struct *t,
+			       bool needreport, bool *firstreport)
+{
+	if (!ACCESS_ONCE(t->rcu_tasks_holdout) ||
+	    t->rcu_tasks_nvcsw != ACCESS_ONCE(t->nvcsw) ||
+	    !ACCESS_ONCE(t->on_rq)) {
+		ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
+		list_del_rcu(&t->rcu_tasks_holdout_list);
+		put_task_struct(t);
+		return;
+	}
+	if (!needreport)
+		return;
+	if (*firstreport) {
+		pr_err("INFO: rcu_tasks detected stalls on tasks:\n");
+		*firstreport = false;
+	}
+	sched_show_task(current);
+}
+
 /* RCU-tasks kthread that detects grace periods and invokes callbacks. */
 static int __noreturn rcu_tasks_kthread(void *arg)
 {
 	unsigned long flags;
 	struct task_struct *g, *t;
+	unsigned long lastreport;
 	struct rcu_head *list;
 	struct rcu_head *next;
 
@@ -518,22 +544,24 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 		 * of holdout tasks, removing any that are no longer
 		 * holdouts.  When the list is empty, we are done.
 		 */
+		lastreport = jiffies;
 		while (!list_empty(&rcu_tasks_holdouts)) {
+			bool firstreport;
+			bool needreport;
+			int rtst;
+
 			schedule_timeout_interruptible(HZ / 10);
+			rtst = ACCESS_ONCE(rcu_task_stall_timeout);
+			needreport = rtst > 0 &&
+				     time_after(jiffies, lastreport + rtst);
+			if (needreport)
+				lastreport = jiffies;
+			firstreport = true;
 			flush_signals(current);
 			rcu_read_lock();
 			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
-						rcu_tasks_holdout_list) {
-				if (ACCESS_ONCE(t->rcu_tasks_holdout)) {
-					if (t->rcu_tasks_nvcsw ==
-					    ACCESS_ONCE(t->nvcsw) &&
-					    ACCESS_ONCE(t->on_rq))
-						continue;
-					ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
-				}
-				list_del_rcu(&t->rcu_tasks_holdout_list);
-				put_task_struct(t);
-			}
+						rcu_tasks_holdout_list)
+				check_holdout_task(t, needreport, &firstreport);
 			rcu_read_unlock();
 		}
 
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH v3 tip/core/rcu 8/9] rcu: Improve RCU-tasks energy efficiency
  2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (5 preceding siblings ...)
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 7/9] rcu: Add stall-warning checks for RCU-tasks Paul E. McKenney
@ 2014-07-31 21:55   ` Paul E. McKenney
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 9/9] documentation: Add verbiage on RCU-tasks stall warning messages Paul E. McKenney
                     ` (8 subsequent siblings)
  15 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-07-31 21:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

The current RCU-tasks implementation uses strict polling to detect
callback arrivals.  This works quite well, but is not so good for
energy efficiency.  This commit therefore replaces the strict polling
with a wait queue.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/update.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index e940b86af4e8..f14a79d0d6de 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -371,6 +371,7 @@ static LIST_HEAD(rcu_tasks_holdouts);
 /* Global list of callbacks and associated lock. */
 static struct rcu_head *rcu_tasks_cbs_head;
 static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
+static DECLARE_WAIT_QUEUE_HEAD(rcu_tasks_cbs_wq);
 static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
 
 /* Control stall timeouts.  Disable with <= 0, otherwise jiffies till stall. */
@@ -381,13 +382,17 @@ module_param(rcu_task_stall_timeout, int, 0644);
 void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
 {
 	unsigned long flags;
+	bool needwake;
 
 	rhp->next = NULL;
 	rhp->func = func;
 	raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
+	needwake = !rcu_tasks_cbs_head;
 	*rcu_tasks_cbs_tail = rhp;
 	rcu_tasks_cbs_tail = &rhp->next;
 	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+	if (needwake)
+		wake_up(&rcu_tasks_cbs_wq);
 }
 EXPORT_SYMBOL_GPL(call_rcu_tasks);
 
@@ -498,8 +503,12 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 
 		/* If there were none, wait a bit and start over. */
 		if (!list) {
-			schedule_timeout_interruptible(HZ);
-			flush_signals(current);
+			wait_event_interruptible(rcu_tasks_cbs_wq,
+						 rcu_tasks_cbs_head);
+			if (!rcu_tasks_cbs_head) {
+				flush_signals(current);
+				schedule_timeout_interruptible(HZ/10);
+			}
 			continue;
 		}
 
@@ -591,6 +600,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 			list = next;
 			cond_resched();
 		}
+		schedule_timeout_uninterruptible(HZ/10);
 	}
 }
 
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH v3 tip/core/rcu 9/9] documentation: Add verbiage on RCU-tasks stall warning messages
  2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (6 preceding siblings ...)
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 8/9] rcu: Improve RCU-tasks energy efficiency Paul E. McKenney
@ 2014-07-31 21:55   ` Paul E. McKenney
  2014-07-31 23:57   ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Frederic Weisbecker
                     ` (7 subsequent siblings)
  15 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-07-31 21:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit documents RCU-tasks stall warning messages and also describes
when to use the new cond_resched_rcu_qs() API.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 Documentation/RCU/stallwarn.txt | 33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
index 68fe3ad27015..ef5a2fd4ff70 100644
--- a/Documentation/RCU/stallwarn.txt
+++ b/Documentation/RCU/stallwarn.txt
@@ -56,8 +56,20 @@ RCU_STALL_RAT_DELAY
 	two jiffies.  (This is a cpp macro, not a kernel configuration
 	parameter.)
 
-When a CPU detects that it is stalling, it will print a message similar
-to the following:
+rcupdate.rcu_task_stall_timeout
+
+	This boot/sysfs parameter controls the RCU-tasks stall warning
+	interval.  A value of zero or less suppresses RCU-tasks stall
+	warnings.  A positive value sets the stall-warning interval
+	in jiffies.  An RCU-tasks stall warning starts wtih the line:
+
+		INFO: rcu_tasks detected stalls on tasks:
+
+	And continues with the output of sched_show_task() for each
+	task stalling the current RCU-tasks grace period.
+
+For non-RCU-tasks flavors of RCU, when a CPU detects that it is stalling,
+it will print a message similar to the following:
 
 INFO: rcu_sched_state detected stall on CPU 5 (t=2500 jiffies)
 
@@ -174,8 +186,12 @@ o	A CPU looping with preemption disabled.  This condition can
 o	A CPU looping with bottom halves disabled.  This condition can
 	result in RCU-sched and RCU-bh stalls.
 
-o	For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel
-	without invoking schedule().
+o	For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the
+	kernel without invoking schedule().  Note that cond_resched()
+	does not necessarily prevent RCU CPU stall warnings.  Therefore,
+	if the looping in the kernel is really expected and desirable
+	behavior, you might need to replace some of the cond_resched()
+	calls with calls to cond_resched_rcu_qs().
 
 o	A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
 	happen to preempt a low-priority task in the middle of an RCU
@@ -208,11 +224,10 @@ o	A hardware failure.  This is quite unlikely, but has occurred
 	This resulted in a series of RCU CPU stall warnings, eventually
 	leading the realization that the CPU had failed.
 
-The RCU, RCU-sched, and RCU-bh implementations have CPU stall warning.
-SRCU does not have its own CPU stall warnings, but its calls to
-synchronize_sched() will result in RCU-sched detecting RCU-sched-related
-CPU stalls.  Please note that RCU only detects CPU stalls when there is
-a grace period in progress.  No grace period, no CPU stall warnings.
+The RCU, RCU-sched, RCU-bh, and RCU-tasks implementations have CPU stall
+warning.  Note that SRCU does -not- have CPU stall warnings.  Please note
+that RCU only detects CPU stalls when there is a grace period in progress.
+No grace period, no CPU stall warnings.
 
 To diagnose the cause of the stall, inspect the stack traces.
 The offending function will usually be near the top of the stack.
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (7 preceding siblings ...)
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 9/9] documentation: Add verbiage on RCU-tasks stall warning messages Paul E. McKenney
@ 2014-07-31 23:57   ` Frederic Weisbecker
  2014-08-01  2:04     ` Paul E. McKenney
  2014-08-01  1:15   ` Lai Jiangshan
                     ` (6 subsequent siblings)
  15 siblings, 1 reply; 122+ messages in thread
From: Frederic Weisbecker @ 2014-07-31 23:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart, oleg,
	bobby.prani

On Thu, Jul 31, 2014 at 02:55:01PM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> This commit adds a new RCU-tasks flavor of RCU, which provides
> call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
> context switch (not preemption!), userspace execution, and the idle loop.
> Note that unlike other RCU flavors, these quiescent states occur in tasks,
> not necessarily CPUs.  Includes fixes from Steven Rostedt.
> 
> This RCU flavor is assumed to have very infrequent latency-tolerate
> updaters.  This assumption permits significant simplifications, including
> a single global callback list protected by a single global lock, along
> with a single linked list containing all tasks that have not yet passed
> through a quiescent state.  If experience shows this assumption to be
> incorrect, the required additional complexity will be added.
> 
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  include/linux/init_task.h |   9 +++
>  include/linux/rcupdate.h  |  36 ++++++++++
>  include/linux/sched.h     |  23 ++++---
>  init/Kconfig              |  10 +++
>  kernel/rcu/tiny.c         |   2 +
>  kernel/rcu/tree.c         |   2 +
>  kernel/rcu/update.c       | 171 ++++++++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 242 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index 6df7f9fe0d01..78715ea7c30c 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -124,6 +124,14 @@ extern struct group_info init_groups;
>  #else
>  #define INIT_TASK_RCU_PREEMPT(tsk)
>  #endif
> +#ifdef CONFIG_TASKS_RCU
> +#define INIT_TASK_RCU_TASKS(tsk)					\
> +	.rcu_tasks_holdout = false,					\
> +	.rcu_tasks_holdout_list =					\
> +		LIST_HEAD_INIT(tsk.rcu_tasks_holdout_list),
> +#else
> +#define INIT_TASK_RCU_TASKS(tsk)
> +#endif
>  
>  extern struct cred init_cred;
>  
> @@ -231,6 +239,7 @@ extern struct task_group root_task_group;
>  	INIT_FTRACE_GRAPH						\
>  	INIT_TRACE_RECURSION						\
>  	INIT_TASK_RCU_PREEMPT(tsk)					\
> +	INIT_TASK_RCU_TASKS(tsk)					\
>  	INIT_CPUSET_SEQ(tsk)						\
>  	INIT_RT_MUTEXES(tsk)						\
>  	INIT_VTIME(tsk)							\
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 6a94cc8b1ca0..829efc99df3e 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -197,6 +197,26 @@ void call_rcu_sched(struct rcu_head *head,
>  
>  void synchronize_sched(void);
>  
> +/**
> + * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all currently executing RCU
> + * read-side critical sections have completed. call_rcu_tasks() assumes
> + * that the read-side critical sections end at a voluntary context
> + * switch (not a preemption!), entry into idle, or transition to usermode
> + * execution.  As such, there are no read-side primitives analogous to
> + * rcu_read_lock() and rcu_read_unlock() because this primitive is intended
> + * to determine that all tasks have passed through a safe state, not so
> + * much for data-strcuture synchronization.
> + *
> + * See the description of call_rcu() for more detailed information on
> + * memory ordering guarantees.
> + */
> +void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
> +
>  #ifdef CONFIG_PREEMPT_RCU
>  
>  void __rcu_read_lock(void);
> @@ -294,6 +314,22 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
>  		rcu_irq_exit(); \
>  	} while (0)
>  
> +/*
> + * Note a voluntary context switch for RCU-tasks benefit.  This is a
> + * macro rather than an inline function to avoid #include hell.
> + */
> +#ifdef CONFIG_TASKS_RCU
> +#define rcu_note_voluntary_context_switch(t) \
> +	do { \
> +		preempt_disable(); /* Exclude synchronize_sched(); */ \
> +		if (ACCESS_ONCE((t)->rcu_tasks_holdout)) \
> +			ACCESS_ONCE((t)->rcu_tasks_holdout) = 0; \
> +		preempt_enable(); \
> +	} while (0)
> +#else /* #ifdef CONFIG_TASKS_RCU */
> +#define rcu_note_voluntary_context_switch(t)	do { } while (0)
> +#endif /* #else #ifdef CONFIG_TASKS_RCU */
> +
>  #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP)
>  bool __rcu_is_watching(void);
>  #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 306f4f0c987a..3cf124389ec7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1273,6 +1273,11 @@ struct task_struct {
>  #ifdef CONFIG_RCU_BOOST
>  	struct rt_mutex *rcu_boost_mutex;
>  #endif /* #ifdef CONFIG_RCU_BOOST */
> +#ifdef CONFIG_TASKS_RCU
> +	unsigned long rcu_tasks_nvcsw;
> +	int rcu_tasks_holdout;
> +	struct list_head rcu_tasks_holdout_list;
> +#endif /* #ifdef CONFIG_TASKS_RCU */
>  
>  #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
>  	struct sched_info sched_info;
> @@ -1998,31 +2003,27 @@ extern void task_clear_jobctl_pending(struct task_struct *task,
>  				      unsigned int mask);
>  
>  #ifdef CONFIG_PREEMPT_RCU
> -
>  #define RCU_READ_UNLOCK_BLOCKED (1 << 0) /* blocked while in RCU read-side. */
>  #define RCU_READ_UNLOCK_NEED_QS (1 << 1) /* RCU core needs CPU response. */
> +#endif /* #ifdef CONFIG_PREEMPT_RCU */
>  
>  static inline void rcu_copy_process(struct task_struct *p)
>  {
> +#ifdef CONFIG_PREEMPT_RCU
>  	p->rcu_read_lock_nesting = 0;
>  	p->rcu_read_unlock_special = 0;
> -#ifdef CONFIG_TREE_PREEMPT_RCU
>  	p->rcu_blocked_node = NULL;
> -#endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
>  #ifdef CONFIG_RCU_BOOST
>  	p->rcu_boost_mutex = NULL;
>  #endif /* #ifdef CONFIG_RCU_BOOST */
>  	INIT_LIST_HEAD(&p->rcu_node_entry);
> +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> +#ifdef CONFIG_TASKS_RCU
> +	p->rcu_tasks_holdout = false;
> +	INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
> +#endif /* #ifdef CONFIG_TASKS_RCU */
>  }
>  
> -#else
> -
> -static inline void rcu_copy_process(struct task_struct *p)
> -{
> -}
> -
> -#endif
> -
>  static inline void tsk_restore_flags(struct task_struct *task,
>  				unsigned long orig_flags, unsigned long flags)
>  {
> diff --git a/init/Kconfig b/init/Kconfig
> index 9d76b99af1b9..c56cb62a2df1 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -507,6 +507,16 @@ config PREEMPT_RCU
>  	  This option enables preemptible-RCU code that is common between
>  	  the TREE_PREEMPT_RCU and TINY_PREEMPT_RCU implementations.
>  
> +config TASKS_RCU
> +	bool "Task_based RCU implementation using voluntary context switch"
> +	default n
> +	help
> +	  This option enables a task-based RCU implementation that uses
> +	  only voluntary context switch (not preemption!), idle, and
> +	  user-mode execution as quiescent states.
> +
> +	  If unsure, say N.

I don't remember who said that, but indeed this is a pure internal feature
only. The user doesn't need to select that option ever.

> +
>  config RCU_STALL_COMMON
>  	def_bool ( TREE_RCU || TREE_PREEMPT_RCU || RCU_TRACE )
>  	help
> diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> index d9efcc13008c..717f00854fc0 100644
> --- a/kernel/rcu/tiny.c
> +++ b/kernel/rcu/tiny.c
> @@ -254,6 +254,8 @@ void rcu_check_callbacks(int cpu, int user)
>  		rcu_sched_qs(cpu);
>  	else if (!in_softirq())
>  		rcu_bh_qs(cpu);
> +	if (user)
> +		rcu_note_voluntary_context_switch(current);
>  }
>  
>  /*
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 625d0b0cd75a..f958c52f644d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2413,6 +2413,8 @@ void rcu_check_callbacks(int cpu, int user)
>  	rcu_preempt_check_callbacks(cpu);
>  	if (rcu_pending(cpu))
>  		invoke_rcu_core();
> +	if (user)
> +		rcu_note_voluntary_context_switch(current);
>  	trace_rcu_utilization(TPS("End scheduler-tick"));
>  }
>  
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index bc7883570530..50453589e3ca 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -47,6 +47,7 @@
>  #include <linux/hardirq.h>
>  #include <linux/delay.h>
>  #include <linux/module.h>
> +#include <linux/kthread.h>
>  
>  #define CREATE_TRACE_POINTS
>  
> @@ -350,3 +351,173 @@ static int __init check_cpu_stall_init(void)
>  early_initcall(check_cpu_stall_init);
>  
>  #endif /* #ifdef CONFIG_RCU_STALL_COMMON */
> +
> +#ifdef CONFIG_TASKS_RCU
> +
> +/*
> + * Simple variant of RCU whose quiescent states are voluntary context switch,
> + * user-space execution, and idle.  As such, grace periods can take one good
> + * long time.  There are no read-side primitives similar to rcu_read_lock()
> + * and rcu_read_unlock() because this implementation is intended to get
> + * the system into a safe state for some of the manipulations involved in
> + * tracing and the like.  Finally, this implementation does not support
> + * high call_rcu_tasks() rates from multiple CPUs.  If this is required,
> + * per-CPU callback lists will be needed.
> + */
> +
> +/* Lists of tasks that we are still waiting for during this grace period. */
> +static LIST_HEAD(rcu_tasks_holdouts);
> +
> +/* Global list of callbacks and associated lock. */
> +static struct rcu_head *rcu_tasks_cbs_head;
> +static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> +static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
> +
> +/* Post an RCU-tasks callback. */
> +void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
> +{
> +	unsigned long flags;
> +
> +	rhp->next = NULL;
> +	rhp->func = func;
> +	raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> +	*rcu_tasks_cbs_tail = rhp;
> +	rcu_tasks_cbs_tail = &rhp->next;
> +	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_tasks);
> +
> +/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
> +static int __noreturn rcu_tasks_kthread(void *arg)
> +{
> +	unsigned long flags;
> +	struct task_struct *g, *t;
> +	struct rcu_head *list;
> +	struct rcu_head *next;
> +
> +	/* FIXME: Add housekeeping affinity. */
> +
> +	/*
> +	 * Each pass through the following loop makes one check for
> +	 * newly arrived callbacks, and, if there are some, waits for
> +	 * one RCU-tasks grace period and then invokes the callbacks.
> +	 * This loop is terminated by the system going down.  ;-)
> +	 */
> +	for (;;) {
> +
> +		/* Pick up any new callbacks. */
> +		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> +		smp_mb__after_unlock_lock(); /* Enforce GP memory ordering. */

I have no idea which against what this is exactly ordering. GP is a vast thing.
Especially for tricky barriers like __after_unlock_lock() which suggest very
counter-intuitive ordering, a detailed comment is very welcome :)

> +		list = rcu_tasks_cbs_head;
> +		rcu_tasks_cbs_head = NULL;
> +		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> +		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> +
> +		/* If there were none, wait a bit and start over. */
> +		if (!list) {
> +			schedule_timeout_interruptible(HZ);

So this thread is going to poll every second? I guess something prevents it
to run when system is idle somewhere? But I'm not familiar with the whole patchset
yet. But even without that it looks like a very annoying noise. why not use something
wait/wakeup based?

> +			flush_signals(current);
> +			continue;
> +		}
> +
> +		/*
> +		 * Wait for all pre-existing t->on_rq and t->nvcsw
> +		 * transitions to complete.  Invoking synchronize_sched()
> +		 * suffices because all these transitions occur with
> +		 * interrupts disabled.  Without this synchronize_sched(),
> +		 * a read-side critical section that started before the
> +		 * grace period might be incorrectly seen as having started
> +		 * after the grace period.
> +		 *
> +		 * This synchronize_sched() also dispenses with the
> +		 * need for a memory barrier on the first store to
> +		 * ->rcu_tasks_holdout, as it forces the store to happen
> +		 * after the beginning of the grace period.
> +		 */
> +		synchronize_sched();
> +
> +		/*
> +		 * There were callbacks, so we need to wait for an
> +		 * RCU-tasks grace period.  Start off by scanning
> +		 * the task list for tasks that are not already
> +		 * voluntarily blocked.  Mark these tasks and make
> +		 * a list of them in rcu_tasks_holdouts.
> +		 */
> +		rcu_read_lock();
> +		for_each_process_thread(g, t) {
> +			if (t != current && ACCESS_ONCE(t->on_rq) &&
> +			    !is_idle_task(t)) {
> +				get_task_struct(t);
> +				t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
> +				ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
> +				list_add(&t->rcu_tasks_holdout_list,
> +					 &rcu_tasks_holdouts);
> +			}
> +		}
> +		rcu_read_unlock();
> +
> +		/*
> +		 * Each pass through the following loop scans the list
> +		 * of holdout tasks, removing any that are no longer
> +		 * holdouts.  When the list is empty, we are done.
> +		 */
> +		while (!list_empty(&rcu_tasks_holdouts)) {
> +			schedule_timeout_interruptible(HZ / 10);

OTOH here it is not annoying because it should only happen when rcu task
is used, which should be rare.

Thanks.

> +			flush_signals(current);
> +			rcu_read_lock();
> +			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
> +						rcu_tasks_holdout_list) {
> +				if (ACCESS_ONCE(t->rcu_tasks_holdout)) {
> +					if (t->rcu_tasks_nvcsw ==
> +					    ACCESS_ONCE(t->nvcsw) &&
> +					    ACCESS_ONCE(t->on_rq))
> +						continue;
> +					ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> +				}
> +				list_del_rcu(&t->rcu_tasks_holdout_list);
> +				put_task_struct(t);
> +			}
> +			rcu_read_unlock();
> +		}
> +
> +		/*
> +		 * Because ->on_rq and ->nvcsw are not guaranteed
> +		 * to have a full memory barriers prior to them in the
> +		 * schedule() path, memory reordering on other CPUs could
> +		 * cause their RCU-tasks read-side critical sections to
> +		 * extend past the end of the grace period.  However,
> +		 * because these ->nvcsw updates are carried out with
> +		 * interrupts disabled, we can use synchronize_sched()
> +		 * to force the needed ordering on all such CPUs.
> +		 *
> +		 * This synchronize_sched() also confines all
> +		 * ->rcu_tasks_holdout accesses to be within the grace
> +		 * period, avoiding the need for memory barriers for
> +		 * ->rcu_tasks_holdout accesses.
> +		 */
> +		synchronize_sched();
> +
> +		/* Invoke the callbacks. */
> +		while (list) {
> +			next = list->next;
> +			local_bh_disable();
> +			list->func(list);
> +			local_bh_enable();
> +			list = next;
> +			cond_resched();
> +		}
> +	}
> +}
> +
> +/* Spawn rcu_tasks_kthread() at boot time. */
> +static int __init rcu_spawn_tasks_kthread(void)
> +{
> +	struct task_struct __maybe_unused *t;
> +
> +	t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
> +	BUG_ON(IS_ERR(t));
> +	return 0;
> +}
> +early_initcall(rcu_spawn_tasks_kthread);
> +
> +#endif /* #ifdef CONFIG_TASKS_RCU */
> -- 
> 1.8.1.5
> 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (8 preceding siblings ...)
  2014-07-31 23:57   ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Frederic Weisbecker
@ 2014-08-01  1:15   ` Lai Jiangshan
  2014-08-01  1:59     ` Paul E. McKenney
  2014-08-01  1:31   ` Lai Jiangshan
                     ` (5 subsequent siblings)
  15 siblings, 1 reply; 122+ messages in thread
From: Lai Jiangshan @ 2014-08-01  1:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, dvhart, fweisbec,
	oleg, bobby.prani

On 08/01/2014 05:55 AM, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> This commit adds a new RCU-tasks flavor of RCU, which provides
> call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
> context switch (not preemption!), userspace execution, and the idle loop.
> Note that unlike other RCU flavors, these quiescent states occur in tasks,
> not necessarily CPUs.  Includes fixes from Steven Rostedt.
> 
> This RCU flavor is assumed to have very infrequent latency-tolerate
> updaters.  This assumption permits significant simplifications, including
> a single global callback list protected by a single global lock, along
> with a single linked list containing all tasks that have not yet passed
> through a quiescent state.  If experience shows this assumption to be
> incorrect, the required additional complexity will be added.
> 
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  include/linux/init_task.h |   9 +++
>  include/linux/rcupdate.h  |  36 ++++++++++
>  include/linux/sched.h     |  23 ++++---
>  init/Kconfig              |  10 +++
>  kernel/rcu/tiny.c         |   2 +
>  kernel/rcu/tree.c         |   2 +
>  kernel/rcu/update.c       | 171 ++++++++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 242 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index 6df7f9fe0d01..78715ea7c30c 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -124,6 +124,14 @@ extern struct group_info init_groups;
>  #else
>  #define INIT_TASK_RCU_PREEMPT(tsk)
>  #endif
> +#ifdef CONFIG_TASKS_RCU
> +#define INIT_TASK_RCU_TASKS(tsk)					\
> +	.rcu_tasks_holdout = false,					\
> +	.rcu_tasks_holdout_list =					\
> +		LIST_HEAD_INIT(tsk.rcu_tasks_holdout_list),
> +#else
> +#define INIT_TASK_RCU_TASKS(tsk)
> +#endif
>  
>  extern struct cred init_cred;
>  
> @@ -231,6 +239,7 @@ extern struct task_group root_task_group;
>  	INIT_FTRACE_GRAPH						\
>  	INIT_TRACE_RECURSION						\
>  	INIT_TASK_RCU_PREEMPT(tsk)					\
> +	INIT_TASK_RCU_TASKS(tsk)					\
>  	INIT_CPUSET_SEQ(tsk)						\
>  	INIT_RT_MUTEXES(tsk)						\
>  	INIT_VTIME(tsk)							\
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 6a94cc8b1ca0..829efc99df3e 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -197,6 +197,26 @@ void call_rcu_sched(struct rcu_head *head,
>  
>  void synchronize_sched(void);
>  
> +/**
> + * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all currently executing RCU
> + * read-side critical sections have completed. call_rcu_tasks() assumes
> + * that the read-side critical sections end at a voluntary context
> + * switch (not a preemption!), entry into idle, or transition to usermode
> + * execution.  As such, there are no read-side primitives analogous to
> + * rcu_read_lock() and rcu_read_unlock() because this primitive is intended
> + * to determine that all tasks have passed through a safe state, not so
> + * much for data-strcuture synchronization.
> + *
> + * See the description of call_rcu() for more detailed information on
> + * memory ordering guarantees.
> + */
> +void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
> +
>  #ifdef CONFIG_PREEMPT_RCU
>  
>  void __rcu_read_lock(void);
> @@ -294,6 +314,22 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
>  		rcu_irq_exit(); \
>  	} while (0)
>  
> +/*
> + * Note a voluntary context switch for RCU-tasks benefit.  This is a
> + * macro rather than an inline function to avoid #include hell.
> + */
> +#ifdef CONFIG_TASKS_RCU
> +#define rcu_note_voluntary_context_switch(t) \
> +	do { \
> +		preempt_disable(); /* Exclude synchronize_sched(); */ \
> +		if (ACCESS_ONCE((t)->rcu_tasks_holdout)) \
> +			ACCESS_ONCE((t)->rcu_tasks_holdout) = 0; \
> +		preempt_enable(); \
> +	} while (0)
> +#else /* #ifdef CONFIG_TASKS_RCU */
> +#define rcu_note_voluntary_context_switch(t)	do { } while (0)
> +#endif /* #else #ifdef CONFIG_TASKS_RCU */
> +
>  #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP)
>  bool __rcu_is_watching(void);
>  #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 306f4f0c987a..3cf124389ec7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1273,6 +1273,11 @@ struct task_struct {
>  #ifdef CONFIG_RCU_BOOST
>  	struct rt_mutex *rcu_boost_mutex;
>  #endif /* #ifdef CONFIG_RCU_BOOST */
> +#ifdef CONFIG_TASKS_RCU
> +	unsigned long rcu_tasks_nvcsw;
> +	int rcu_tasks_holdout;
> +	struct list_head rcu_tasks_holdout_list;
> +#endif /* #ifdef CONFIG_TASKS_RCU */
>  
>  #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
>  	struct sched_info sched_info;
> @@ -1998,31 +2003,27 @@ extern void task_clear_jobctl_pending(struct task_struct *task,
>  				      unsigned int mask);
>  
>  #ifdef CONFIG_PREEMPT_RCU
> -
>  #define RCU_READ_UNLOCK_BLOCKED (1 << 0) /* blocked while in RCU read-side. */
>  #define RCU_READ_UNLOCK_NEED_QS (1 << 1) /* RCU core needs CPU response. */
> +#endif /* #ifdef CONFIG_PREEMPT_RCU */
>  
>  static inline void rcu_copy_process(struct task_struct *p)
>  {
> +#ifdef CONFIG_PREEMPT_RCU
>  	p->rcu_read_lock_nesting = 0;
>  	p->rcu_read_unlock_special = 0;
> -#ifdef CONFIG_TREE_PREEMPT_RCU
>  	p->rcu_blocked_node = NULL;
> -#endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
>  #ifdef CONFIG_RCU_BOOST
>  	p->rcu_boost_mutex = NULL;
>  #endif /* #ifdef CONFIG_RCU_BOOST */
>  	INIT_LIST_HEAD(&p->rcu_node_entry);
> +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> +#ifdef CONFIG_TASKS_RCU
> +	p->rcu_tasks_holdout = false;
> +	INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
> +#endif /* #ifdef CONFIG_TASKS_RCU */
>  }
>  
> -#else
> -
> -static inline void rcu_copy_process(struct task_struct *p)
> -{
> -}
> -
> -#endif
> -
>  static inline void tsk_restore_flags(struct task_struct *task,
>  				unsigned long orig_flags, unsigned long flags)
>  {
> diff --git a/init/Kconfig b/init/Kconfig
> index 9d76b99af1b9..c56cb62a2df1 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -507,6 +507,16 @@ config PREEMPT_RCU
>  	  This option enables preemptible-RCU code that is common between
>  	  the TREE_PREEMPT_RCU and TINY_PREEMPT_RCU implementations.
>  
> +config TASKS_RCU
> +	bool "Task_based RCU implementation using voluntary context switch"
> +	default n
> +	help
> +	  This option enables a task-based RCU implementation that uses
> +	  only voluntary context switch (not preemption!), idle, and
> +	  user-mode execution as quiescent states.
> +
> +	  If unsure, say N.
> +
>  config RCU_STALL_COMMON
>  	def_bool ( TREE_RCU || TREE_PREEMPT_RCU || RCU_TRACE )
>  	help
> diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> index d9efcc13008c..717f00854fc0 100644
> --- a/kernel/rcu/tiny.c
> +++ b/kernel/rcu/tiny.c
> @@ -254,6 +254,8 @@ void rcu_check_callbacks(int cpu, int user)
>  		rcu_sched_qs(cpu);
>  	else if (!in_softirq())
>  		rcu_bh_qs(cpu);
> +	if (user)
> +		rcu_note_voluntary_context_switch(current);
>  }
>  
>  /*
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 625d0b0cd75a..f958c52f644d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2413,6 +2413,8 @@ void rcu_check_callbacks(int cpu, int user)
>  	rcu_preempt_check_callbacks(cpu);
>  	if (rcu_pending(cpu))
>  		invoke_rcu_core();
> +	if (user)
> +		rcu_note_voluntary_context_switch(current);
>  	trace_rcu_utilization(TPS("End scheduler-tick"));
>  }
>  
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index bc7883570530..50453589e3ca 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -47,6 +47,7 @@
>  #include <linux/hardirq.h>
>  #include <linux/delay.h>
>  #include <linux/module.h>
> +#include <linux/kthread.h>
>  
>  #define CREATE_TRACE_POINTS
>  
> @@ -350,3 +351,173 @@ static int __init check_cpu_stall_init(void)
>  early_initcall(check_cpu_stall_init);
>  
>  #endif /* #ifdef CONFIG_RCU_STALL_COMMON */
> +
> +#ifdef CONFIG_TASKS_RCU
> +
> +/*
> + * Simple variant of RCU whose quiescent states are voluntary context switch,
> + * user-space execution, and idle.  As such, grace periods can take one good
> + * long time.  There are no read-side primitives similar to rcu_read_lock()
> + * and rcu_read_unlock() because this implementation is intended to get
> + * the system into a safe state for some of the manipulations involved in
> + * tracing and the like.  Finally, this implementation does not support
> + * high call_rcu_tasks() rates from multiple CPUs.  If this is required,
> + * per-CPU callback lists will be needed.
> + */
> +
> +/* Lists of tasks that we are still waiting for during this grace period. */
> +static LIST_HEAD(rcu_tasks_holdouts);
> +
> +/* Global list of callbacks and associated lock. */
> +static struct rcu_head *rcu_tasks_cbs_head;
> +static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> +static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
> +
> +/* Post an RCU-tasks callback. */
> +void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
> +{
> +	unsigned long flags;
> +
> +	rhp->next = NULL;
> +	rhp->func = func;
> +	raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> +	*rcu_tasks_cbs_tail = rhp;
> +	rcu_tasks_cbs_tail = &rhp->next;
> +	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_tasks);
> +
> +/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
> +static int __noreturn rcu_tasks_kthread(void *arg)
> +{
> +	unsigned long flags;
> +	struct task_struct *g, *t;
> +	struct rcu_head *list;
> +	struct rcu_head *next;
> +
> +	/* FIXME: Add housekeeping affinity. */
> +
> +	/*
> +	 * Each pass through the following loop makes one check for
> +	 * newly arrived callbacks, and, if there are some, waits for
> +	 * one RCU-tasks grace period and then invokes the callbacks.
> +	 * This loop is terminated by the system going down.  ;-)
> +	 */
> +	for (;;) {
> +
> +		/* Pick up any new callbacks. */
> +		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> +		smp_mb__after_unlock_lock(); /* Enforce GP memory ordering. */
> +		list = rcu_tasks_cbs_head;
> +		rcu_tasks_cbs_head = NULL;
> +		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> +		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> +
> +		/* If there were none, wait a bit and start over. */
> +		if (!list) {
> +			schedule_timeout_interruptible(HZ);
> +			flush_signals(current);
> +			continue;
> +		}
> +
> +		/*
> +		 * Wait for all pre-existing t->on_rq and t->nvcsw
> +		 * transitions to complete.  Invoking synchronize_sched()
> +		 * suffices because all these transitions occur with
> +		 * interrupts disabled.  Without this synchronize_sched(),
> +		 * a read-side critical section that started before the
> +		 * grace period might be incorrectly seen as having started
> +		 * after the grace period.
> +		 *
> +		 * This synchronize_sched() also dispenses with the
> +		 * need for a memory barrier on the first store to
> +		 * ->rcu_tasks_holdout, as it forces the store to happen
> +		 * after the beginning of the grace period.
> +		 */
> +		synchronize_sched();
> +
> +		/*
> +		 * There were callbacks, so we need to wait for an
> +		 * RCU-tasks grace period.  Start off by scanning
> +		 * the task list for tasks that are not already
> +		 * voluntarily blocked.  Mark these tasks and make
> +		 * a list of them in rcu_tasks_holdouts.
> +		 */
> +		rcu_read_lock();
> +		for_each_process_thread(g, t) {
> +			if (t != current && ACCESS_ONCE(t->on_rq) &&
> +			    !is_idle_task(t)) {
> +				get_task_struct(t);
> +				t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
> +				ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
> +				list_add(&t->rcu_tasks_holdout_list,
> +					 &rcu_tasks_holdouts);
> +			}
> +		}
> +		rcu_read_unlock();
> +
> +		/*
> +		 * Each pass through the following loop scans the list
> +		 * of holdout tasks, removing any that are no longer
> +		 * holdouts.  When the list is empty, we are done.
> +		 */
> +		while (!list_empty(&rcu_tasks_holdouts)) {
> +			schedule_timeout_interruptible(HZ / 10);
> +			flush_signals(current);
> +			rcu_read_lock();
> +			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
> +						rcu_tasks_holdout_list) {
> +				if (ACCESS_ONCE(t->rcu_tasks_holdout)) {
> +					if (t->rcu_tasks_nvcsw ==
> +					    ACCESS_ONCE(t->nvcsw) &&
> +					    ACCESS_ONCE(t->on_rq))
> +						continue;
> +					ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> +				}
> +				list_del_rcu(&t->rcu_tasks_holdout_list);
> +				put_task_struct(t);
> +			}
> +			rcu_read_unlock();

rcu_read_lock() and list opertions of rcu variant are unneeded.

> +		}
> +
> +		/*
> +		 * Because ->on_rq and ->nvcsw are not guaranteed
> +		 * to have a full memory barriers prior to them in the
> +		 * schedule() path, memory reordering on other CPUs could
> +		 * cause their RCU-tasks read-side critical sections to
> +		 * extend past the end of the grace period.  However,
> +		 * because these ->nvcsw updates are carried out with
> +		 * interrupts disabled, we can use synchronize_sched()
> +		 * to force the needed ordering on all such CPUs.
> +		 *
> +		 * This synchronize_sched() also confines all
> +		 * ->rcu_tasks_holdout accesses to be within the grace
> +		 * period, avoiding the need for memory barriers for
> +		 * ->rcu_tasks_holdout accesses.
> +		 */
> +		synchronize_sched();
> +
> +		/* Invoke the callbacks. */
> +		while (list) {
> +			next = list->next;
> +			local_bh_disable();
> +			list->func(list);
> +			local_bh_enable();
> +			list = next;
> +			cond_resched();
> +		}
> +	}
> +}
> +
> +/* Spawn rcu_tasks_kthread() at boot time. */
> +static int __init rcu_spawn_tasks_kthread(void)
> +{
> +	struct task_struct __maybe_unused *t;
> +
> +	t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
> +	BUG_ON(IS_ERR(t));
> +	return 0;
> +}
> +early_initcall(rcu_spawn_tasks_kthread);
> +
> +#endif /* #ifdef CONFIG_TASKS_RCU */


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (9 preceding siblings ...)
  2014-08-01  1:15   ` Lai Jiangshan
@ 2014-08-01  1:31   ` Lai Jiangshan
  2014-08-01  2:11     ` Paul E. McKenney
  2014-08-01 14:11   ` Oleg Nesterov
                     ` (4 subsequent siblings)
  15 siblings, 1 reply; 122+ messages in thread
From: Lai Jiangshan @ 2014-08-01  1:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, dvhart, fweisbec,
	oleg, bobby.prani

On 08/01/2014 05:55 AM, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> This commit adds a new RCU-tasks flavor of RCU, which provides
> call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
> context switch (not preemption!), userspace execution, and the idle loop.
> Note that unlike other RCU flavors, these quiescent states occur in tasks,
> not necessarily CPUs.  Includes fixes from Steven Rostedt.
> 
> This RCU flavor is assumed to have very infrequent latency-tolerate
> updaters.  This assumption permits significant simplifications, including
> a single global callback list protected by a single global lock, along
> with a single linked list containing all tasks that have not yet passed
> through a quiescent state.  If experience shows this assumption to be
> incorrect, the required additional complexity will be added.
> 
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  include/linux/init_task.h |   9 +++
>  include/linux/rcupdate.h  |  36 ++++++++++
>  include/linux/sched.h     |  23 ++++---
>  init/Kconfig              |  10 +++
>  kernel/rcu/tiny.c         |   2 +
>  kernel/rcu/tree.c         |   2 +
>  kernel/rcu/update.c       | 171 ++++++++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 242 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index 6df7f9fe0d01..78715ea7c30c 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -124,6 +124,14 @@ extern struct group_info init_groups;
>  #else
>  #define INIT_TASK_RCU_PREEMPT(tsk)
>  #endif
> +#ifdef CONFIG_TASKS_RCU
> +#define INIT_TASK_RCU_TASKS(tsk)					\
> +	.rcu_tasks_holdout = false,					\
> +	.rcu_tasks_holdout_list =					\
> +		LIST_HEAD_INIT(tsk.rcu_tasks_holdout_list),
> +#else
> +#define INIT_TASK_RCU_TASKS(tsk)
> +#endif
>  
>  extern struct cred init_cred;
>  
> @@ -231,6 +239,7 @@ extern struct task_group root_task_group;
>  	INIT_FTRACE_GRAPH						\
>  	INIT_TRACE_RECURSION						\
>  	INIT_TASK_RCU_PREEMPT(tsk)					\
> +	INIT_TASK_RCU_TASKS(tsk)					\
>  	INIT_CPUSET_SEQ(tsk)						\
>  	INIT_RT_MUTEXES(tsk)						\
>  	INIT_VTIME(tsk)							\
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 6a94cc8b1ca0..829efc99df3e 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -197,6 +197,26 @@ void call_rcu_sched(struct rcu_head *head,
>  
>  void synchronize_sched(void);
>  
> +/**
> + * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all currently executing RCU
> + * read-side critical sections have completed. call_rcu_tasks() assumes
> + * that the read-side critical sections end at a voluntary context
> + * switch (not a preemption!), entry into idle, or transition to usermode
> + * execution.  As such, there are no read-side primitives analogous to
> + * rcu_read_lock() and rcu_read_unlock() because this primitive is intended
> + * to determine that all tasks have passed through a safe state, not so
> + * much for data-strcuture synchronization.
> + *
> + * See the description of call_rcu() for more detailed information on
> + * memory ordering guarantees.
> + */
> +void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
> +
>  #ifdef CONFIG_PREEMPT_RCU
>  
>  void __rcu_read_lock(void);
> @@ -294,6 +314,22 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
>  		rcu_irq_exit(); \
>  	} while (0)
>  
> +/*
> + * Note a voluntary context switch for RCU-tasks benefit.  This is a
> + * macro rather than an inline function to avoid #include hell.
> + */
> +#ifdef CONFIG_TASKS_RCU
> +#define rcu_note_voluntary_context_switch(t) \
> +	do { \
> +		preempt_disable(); /* Exclude synchronize_sched(); */ \
> +		if (ACCESS_ONCE((t)->rcu_tasks_holdout)) \
> +			ACCESS_ONCE((t)->rcu_tasks_holdout) = 0; \
> +		preempt_enable(); \

Why the preempt_disable() is needed here? The comments in rcu_tasks_kthread()
can't persuade me.  Maybe it could be removed?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-01  1:15   ` Lai Jiangshan
@ 2014-08-01  1:59     ` Paul E. McKenney
  0 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-01  1:59 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, dvhart, fweisbec,
	oleg, bobby.prani

On Fri, Aug 01, 2014 at 09:15:34AM +0800, Lai Jiangshan wrote:
> On 08/01/2014 05:55 AM, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > 
> > This commit adds a new RCU-tasks flavor of RCU, which provides
> > call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
> > context switch (not preemption!), userspace execution, and the idle loop.
> > Note that unlike other RCU flavors, these quiescent states occur in tasks,
> > not necessarily CPUs.  Includes fixes from Steven Rostedt.
> > 
> > This RCU flavor is assumed to have very infrequent latency-tolerate
> > updaters.  This assumption permits significant simplifications, including
> > a single global callback list protected by a single global lock, along
> > with a single linked list containing all tasks that have not yet passed
> > through a quiescent state.  If experience shows this assumption to be
> > incorrect, the required additional complexity will be added.
> > 
> > Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> >  include/linux/init_task.h |   9 +++
> >  include/linux/rcupdate.h  |  36 ++++++++++
> >  include/linux/sched.h     |  23 ++++---
> >  init/Kconfig              |  10 +++
> >  kernel/rcu/tiny.c         |   2 +
> >  kernel/rcu/tree.c         |   2 +
> >  kernel/rcu/update.c       | 171 ++++++++++++++++++++++++++++++++++++++++++++++
> >  7 files changed, 242 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> > index 6df7f9fe0d01..78715ea7c30c 100644
> > --- a/include/linux/init_task.h
> > +++ b/include/linux/init_task.h
> > @@ -124,6 +124,14 @@ extern struct group_info init_groups;
> >  #else
> >  #define INIT_TASK_RCU_PREEMPT(tsk)
> >  #endif
> > +#ifdef CONFIG_TASKS_RCU
> > +#define INIT_TASK_RCU_TASKS(tsk)					\
> > +	.rcu_tasks_holdout = false,					\
> > +	.rcu_tasks_holdout_list =					\
> > +		LIST_HEAD_INIT(tsk.rcu_tasks_holdout_list),
> > +#else
> > +#define INIT_TASK_RCU_TASKS(tsk)
> > +#endif
> >  
> >  extern struct cred init_cred;
> >  
> > @@ -231,6 +239,7 @@ extern struct task_group root_task_group;
> >  	INIT_FTRACE_GRAPH						\
> >  	INIT_TRACE_RECURSION						\
> >  	INIT_TASK_RCU_PREEMPT(tsk)					\
> > +	INIT_TASK_RCU_TASKS(tsk)					\
> >  	INIT_CPUSET_SEQ(tsk)						\
> >  	INIT_RT_MUTEXES(tsk)						\
> >  	INIT_VTIME(tsk)							\
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 6a94cc8b1ca0..829efc99df3e 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -197,6 +197,26 @@ void call_rcu_sched(struct rcu_head *head,
> >  
> >  void synchronize_sched(void);
> >  
> > +/**
> > + * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
> > + * @head: structure to be used for queueing the RCU updates.
> > + * @func: actual callback function to be invoked after the grace period
> > + *
> > + * The callback function will be invoked some time after a full grace
> > + * period elapses, in other words after all currently executing RCU
> > + * read-side critical sections have completed. call_rcu_tasks() assumes
> > + * that the read-side critical sections end at a voluntary context
> > + * switch (not a preemption!), entry into idle, or transition to usermode
> > + * execution.  As such, there are no read-side primitives analogous to
> > + * rcu_read_lock() and rcu_read_unlock() because this primitive is intended
> > + * to determine that all tasks have passed through a safe state, not so
> > + * much for data-strcuture synchronization.
> > + *
> > + * See the description of call_rcu() for more detailed information on
> > + * memory ordering guarantees.
> > + */
> > +void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
> > +
> >  #ifdef CONFIG_PREEMPT_RCU
> >  
> >  void __rcu_read_lock(void);
> > @@ -294,6 +314,22 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
> >  		rcu_irq_exit(); \
> >  	} while (0)
> >  
> > +/*
> > + * Note a voluntary context switch for RCU-tasks benefit.  This is a
> > + * macro rather than an inline function to avoid #include hell.
> > + */
> > +#ifdef CONFIG_TASKS_RCU
> > +#define rcu_note_voluntary_context_switch(t) \
> > +	do { \
> > +		preempt_disable(); /* Exclude synchronize_sched(); */ \
> > +		if (ACCESS_ONCE((t)->rcu_tasks_holdout)) \
> > +			ACCESS_ONCE((t)->rcu_tasks_holdout) = 0; \
> > +		preempt_enable(); \
> > +	} while (0)
> > +#else /* #ifdef CONFIG_TASKS_RCU */
> > +#define rcu_note_voluntary_context_switch(t)	do { } while (0)
> > +#endif /* #else #ifdef CONFIG_TASKS_RCU */
> > +
> >  #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP)
> >  bool __rcu_is_watching(void);
> >  #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 306f4f0c987a..3cf124389ec7 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1273,6 +1273,11 @@ struct task_struct {
> >  #ifdef CONFIG_RCU_BOOST
> >  	struct rt_mutex *rcu_boost_mutex;
> >  #endif /* #ifdef CONFIG_RCU_BOOST */
> > +#ifdef CONFIG_TASKS_RCU
> > +	unsigned long rcu_tasks_nvcsw;
> > +	int rcu_tasks_holdout;
> > +	struct list_head rcu_tasks_holdout_list;
> > +#endif /* #ifdef CONFIG_TASKS_RCU */
> >  
> >  #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
> >  	struct sched_info sched_info;
> > @@ -1998,31 +2003,27 @@ extern void task_clear_jobctl_pending(struct task_struct *task,
> >  				      unsigned int mask);
> >  
> >  #ifdef CONFIG_PREEMPT_RCU
> > -
> >  #define RCU_READ_UNLOCK_BLOCKED (1 << 0) /* blocked while in RCU read-side. */
> >  #define RCU_READ_UNLOCK_NEED_QS (1 << 1) /* RCU core needs CPU response. */
> > +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> >  
> >  static inline void rcu_copy_process(struct task_struct *p)
> >  {
> > +#ifdef CONFIG_PREEMPT_RCU
> >  	p->rcu_read_lock_nesting = 0;
> >  	p->rcu_read_unlock_special = 0;
> > -#ifdef CONFIG_TREE_PREEMPT_RCU
> >  	p->rcu_blocked_node = NULL;
> > -#endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
> >  #ifdef CONFIG_RCU_BOOST
> >  	p->rcu_boost_mutex = NULL;
> >  #endif /* #ifdef CONFIG_RCU_BOOST */
> >  	INIT_LIST_HEAD(&p->rcu_node_entry);
> > +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> > +#ifdef CONFIG_TASKS_RCU
> > +	p->rcu_tasks_holdout = false;
> > +	INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
> > +#endif /* #ifdef CONFIG_TASKS_RCU */
> >  }
> >  
> > -#else
> > -
> > -static inline void rcu_copy_process(struct task_struct *p)
> > -{
> > -}
> > -
> > -#endif
> > -
> >  static inline void tsk_restore_flags(struct task_struct *task,
> >  				unsigned long orig_flags, unsigned long flags)
> >  {
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 9d76b99af1b9..c56cb62a2df1 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -507,6 +507,16 @@ config PREEMPT_RCU
> >  	  This option enables preemptible-RCU code that is common between
> >  	  the TREE_PREEMPT_RCU and TINY_PREEMPT_RCU implementations.
> >  
> > +config TASKS_RCU
> > +	bool "Task_based RCU implementation using voluntary context switch"
> > +	default n
> > +	help
> > +	  This option enables a task-based RCU implementation that uses
> > +	  only voluntary context switch (not preemption!), idle, and
> > +	  user-mode execution as quiescent states.
> > +
> > +	  If unsure, say N.
> > +
> >  config RCU_STALL_COMMON
> >  	def_bool ( TREE_RCU || TREE_PREEMPT_RCU || RCU_TRACE )
> >  	help
> > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> > index d9efcc13008c..717f00854fc0 100644
> > --- a/kernel/rcu/tiny.c
> > +++ b/kernel/rcu/tiny.c
> > @@ -254,6 +254,8 @@ void rcu_check_callbacks(int cpu, int user)
> >  		rcu_sched_qs(cpu);
> >  	else if (!in_softirq())
> >  		rcu_bh_qs(cpu);
> > +	if (user)
> > +		rcu_note_voluntary_context_switch(current);
> >  }
> >  
> >  /*
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 625d0b0cd75a..f958c52f644d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -2413,6 +2413,8 @@ void rcu_check_callbacks(int cpu, int user)
> >  	rcu_preempt_check_callbacks(cpu);
> >  	if (rcu_pending(cpu))
> >  		invoke_rcu_core();
> > +	if (user)
> > +		rcu_note_voluntary_context_switch(current);
> >  	trace_rcu_utilization(TPS("End scheduler-tick"));
> >  }
> >  
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index bc7883570530..50453589e3ca 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -47,6 +47,7 @@
> >  #include <linux/hardirq.h>
> >  #include <linux/delay.h>
> >  #include <linux/module.h>
> > +#include <linux/kthread.h>
> >  
> >  #define CREATE_TRACE_POINTS
> >  
> > @@ -350,3 +351,173 @@ static int __init check_cpu_stall_init(void)
> >  early_initcall(check_cpu_stall_init);
> >  
> >  #endif /* #ifdef CONFIG_RCU_STALL_COMMON */
> > +
> > +#ifdef CONFIG_TASKS_RCU
> > +
> > +/*
> > + * Simple variant of RCU whose quiescent states are voluntary context switch,
> > + * user-space execution, and idle.  As such, grace periods can take one good
> > + * long time.  There are no read-side primitives similar to rcu_read_lock()
> > + * and rcu_read_unlock() because this implementation is intended to get
> > + * the system into a safe state for some of the manipulations involved in
> > + * tracing and the like.  Finally, this implementation does not support
> > + * high call_rcu_tasks() rates from multiple CPUs.  If this is required,
> > + * per-CPU callback lists will be needed.
> > + */
> > +
> > +/* Lists of tasks that we are still waiting for during this grace period. */
> > +static LIST_HEAD(rcu_tasks_holdouts);
> > +
> > +/* Global list of callbacks and associated lock. */
> > +static struct rcu_head *rcu_tasks_cbs_head;
> > +static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> > +static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
> > +
> > +/* Post an RCU-tasks callback. */
> > +void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
> > +{
> > +	unsigned long flags;
> > +
> > +	rhp->next = NULL;
> > +	rhp->func = func;
> > +	raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> > +	*rcu_tasks_cbs_tail = rhp;
> > +	rcu_tasks_cbs_tail = &rhp->next;
> > +	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> > +}
> > +EXPORT_SYMBOL_GPL(call_rcu_tasks);
> > +
> > +/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
> > +static int __noreturn rcu_tasks_kthread(void *arg)
> > +{
> > +	unsigned long flags;
> > +	struct task_struct *g, *t;
> > +	struct rcu_head *list;
> > +	struct rcu_head *next;
> > +
> > +	/* FIXME: Add housekeeping affinity. */
> > +
> > +	/*
> > +	 * Each pass through the following loop makes one check for
> > +	 * newly arrived callbacks, and, if there are some, waits for
> > +	 * one RCU-tasks grace period and then invokes the callbacks.
> > +	 * This loop is terminated by the system going down.  ;-)
> > +	 */
> > +	for (;;) {
> > +
> > +		/* Pick up any new callbacks. */
> > +		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> > +		smp_mb__after_unlock_lock(); /* Enforce GP memory ordering. */
> > +		list = rcu_tasks_cbs_head;
> > +		rcu_tasks_cbs_head = NULL;
> > +		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> > +		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> > +
> > +		/* If there were none, wait a bit and start over. */
> > +		if (!list) {
> > +			schedule_timeout_interruptible(HZ);
> > +			flush_signals(current);
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * Wait for all pre-existing t->on_rq and t->nvcsw
> > +		 * transitions to complete.  Invoking synchronize_sched()
> > +		 * suffices because all these transitions occur with
> > +		 * interrupts disabled.  Without this synchronize_sched(),
> > +		 * a read-side critical section that started before the
> > +		 * grace period might be incorrectly seen as having started
> > +		 * after the grace period.
> > +		 *
> > +		 * This synchronize_sched() also dispenses with the
> > +		 * need for a memory barrier on the first store to
> > +		 * ->rcu_tasks_holdout, as it forces the store to happen
> > +		 * after the beginning of the grace period.
> > +		 */
> > +		synchronize_sched();
> > +
> > +		/*
> > +		 * There were callbacks, so we need to wait for an
> > +		 * RCU-tasks grace period.  Start off by scanning
> > +		 * the task list for tasks that are not already
> > +		 * voluntarily blocked.  Mark these tasks and make
> > +		 * a list of them in rcu_tasks_holdouts.
> > +		 */
> > +		rcu_read_lock();
> > +		for_each_process_thread(g, t) {
> > +			if (t != current && ACCESS_ONCE(t->on_rq) &&
> > +			    !is_idle_task(t)) {
> > +				get_task_struct(t);
> > +				t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
> > +				ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
> > +				list_add(&t->rcu_tasks_holdout_list,
> > +					 &rcu_tasks_holdouts);
> > +			}
> > +		}
> > +		rcu_read_unlock();
> > +
> > +		/*
> > +		 * Each pass through the following loop scans the list
> > +		 * of holdout tasks, removing any that are no longer
> > +		 * holdouts.  When the list is empty, we are done.
> > +		 */
> > +		while (!list_empty(&rcu_tasks_holdouts)) {
> > +			schedule_timeout_interruptible(HZ / 10);
> > +			flush_signals(current);
> > +			rcu_read_lock();
> > +			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
> > +						rcu_tasks_holdout_list) {
> > +				if (ACCESS_ONCE(t->rcu_tasks_holdout)) {
> > +					if (t->rcu_tasks_nvcsw ==
> > +					    ACCESS_ONCE(t->nvcsw) &&
> > +					    ACCESS_ONCE(t->on_rq))
> > +						continue;
> > +					ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> > +				}
> > +				list_del_rcu(&t->rcu_tasks_holdout_list);
> > +				put_task_struct(t);
> > +			}
> > +			rcu_read_unlock();
> 
> rcu_read_lock() and list opertions of rcu variant are unneeded.

Good point, will change to list_for_each_entry_safe().

							Thanx, Paul

> > +		}
> > +
> > +		/*
> > +		 * Because ->on_rq and ->nvcsw are not guaranteed
> > +		 * to have a full memory barriers prior to them in the
> > +		 * schedule() path, memory reordering on other CPUs could
> > +		 * cause their RCU-tasks read-side critical sections to
> > +		 * extend past the end of the grace period.  However,
> > +		 * because these ->nvcsw updates are carried out with
> > +		 * interrupts disabled, we can use synchronize_sched()
> > +		 * to force the needed ordering on all such CPUs.
> > +		 *
> > +		 * This synchronize_sched() also confines all
> > +		 * ->rcu_tasks_holdout accesses to be within the grace
> > +		 * period, avoiding the need for memory barriers for
> > +		 * ->rcu_tasks_holdout accesses.
> > +		 */
> > +		synchronize_sched();
> > +
> > +		/* Invoke the callbacks. */
> > +		while (list) {
> > +			next = list->next;
> > +			local_bh_disable();
> > +			list->func(list);
> > +			local_bh_enable();
> > +			list = next;
> > +			cond_resched();
> > +		}
> > +	}
> > +}
> > +
> > +/* Spawn rcu_tasks_kthread() at boot time. */
> > +static int __init rcu_spawn_tasks_kthread(void)
> > +{
> > +	struct task_struct __maybe_unused *t;
> > +
> > +	t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
> > +	BUG_ON(IS_ERR(t));
> > +	return 0;
> > +}
> > +early_initcall(rcu_spawn_tasks_kthread);
> > +
> > +#endif /* #ifdef CONFIG_TASKS_RCU */
> 


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-31 23:57   ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Frederic Weisbecker
@ 2014-08-01  2:04     ` Paul E. McKenney
  2014-08-01 15:06       ` Frederic Weisbecker
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-01  2:04 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart, oleg,
	bobby.prani

On Fri, Aug 01, 2014 at 01:57:50AM +0200, Frederic Weisbecker wrote:
> On Thu, Jul 31, 2014 at 02:55:01PM -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > 
> > This commit adds a new RCU-tasks flavor of RCU, which provides
> > call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
> > context switch (not preemption!), userspace execution, and the idle loop.
> > Note that unlike other RCU flavors, these quiescent states occur in tasks,
> > not necessarily CPUs.  Includes fixes from Steven Rostedt.
> > 
> > This RCU flavor is assumed to have very infrequent latency-tolerate
> > updaters.  This assumption permits significant simplifications, including
> > a single global callback list protected by a single global lock, along
> > with a single linked list containing all tasks that have not yet passed
> > through a quiescent state.  If experience shows this assumption to be
> > incorrect, the required additional complexity will be added.
> > 
> > Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> >  include/linux/init_task.h |   9 +++
> >  include/linux/rcupdate.h  |  36 ++++++++++
> >  include/linux/sched.h     |  23 ++++---
> >  init/Kconfig              |  10 +++
> >  kernel/rcu/tiny.c         |   2 +
> >  kernel/rcu/tree.c         |   2 +
> >  kernel/rcu/update.c       | 171 ++++++++++++++++++++++++++++++++++++++++++++++
> >  7 files changed, 242 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> > index 6df7f9fe0d01..78715ea7c30c 100644
> > --- a/include/linux/init_task.h
> > +++ b/include/linux/init_task.h
> > @@ -124,6 +124,14 @@ extern struct group_info init_groups;
> >  #else
> >  #define INIT_TASK_RCU_PREEMPT(tsk)
> >  #endif
> > +#ifdef CONFIG_TASKS_RCU
> > +#define INIT_TASK_RCU_TASKS(tsk)					\
> > +	.rcu_tasks_holdout = false,					\
> > +	.rcu_tasks_holdout_list =					\
> > +		LIST_HEAD_INIT(tsk.rcu_tasks_holdout_list),
> > +#else
> > +#define INIT_TASK_RCU_TASKS(tsk)
> > +#endif
> >  
> >  extern struct cred init_cred;
> >  
> > @@ -231,6 +239,7 @@ extern struct task_group root_task_group;
> >  	INIT_FTRACE_GRAPH						\
> >  	INIT_TRACE_RECURSION						\
> >  	INIT_TASK_RCU_PREEMPT(tsk)					\
> > +	INIT_TASK_RCU_TASKS(tsk)					\
> >  	INIT_CPUSET_SEQ(tsk)						\
> >  	INIT_RT_MUTEXES(tsk)						\
> >  	INIT_VTIME(tsk)							\
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 6a94cc8b1ca0..829efc99df3e 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -197,6 +197,26 @@ void call_rcu_sched(struct rcu_head *head,
> >  
> >  void synchronize_sched(void);
> >  
> > +/**
> > + * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
> > + * @head: structure to be used for queueing the RCU updates.
> > + * @func: actual callback function to be invoked after the grace period
> > + *
> > + * The callback function will be invoked some time after a full grace
> > + * period elapses, in other words after all currently executing RCU
> > + * read-side critical sections have completed. call_rcu_tasks() assumes
> > + * that the read-side critical sections end at a voluntary context
> > + * switch (not a preemption!), entry into idle, or transition to usermode
> > + * execution.  As such, there are no read-side primitives analogous to
> > + * rcu_read_lock() and rcu_read_unlock() because this primitive is intended
> > + * to determine that all tasks have passed through a safe state, not so
> > + * much for data-strcuture synchronization.
> > + *
> > + * See the description of call_rcu() for more detailed information on
> > + * memory ordering guarantees.
> > + */
> > +void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
> > +
> >  #ifdef CONFIG_PREEMPT_RCU
> >  
> >  void __rcu_read_lock(void);
> > @@ -294,6 +314,22 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
> >  		rcu_irq_exit(); \
> >  	} while (0)
> >  
> > +/*
> > + * Note a voluntary context switch for RCU-tasks benefit.  This is a
> > + * macro rather than an inline function to avoid #include hell.
> > + */
> > +#ifdef CONFIG_TASKS_RCU
> > +#define rcu_note_voluntary_context_switch(t) \
> > +	do { \
> > +		preempt_disable(); /* Exclude synchronize_sched(); */ \
> > +		if (ACCESS_ONCE((t)->rcu_tasks_holdout)) \
> > +			ACCESS_ONCE((t)->rcu_tasks_holdout) = 0; \
> > +		preempt_enable(); \
> > +	} while (0)
> > +#else /* #ifdef CONFIG_TASKS_RCU */
> > +#define rcu_note_voluntary_context_switch(t)	do { } while (0)
> > +#endif /* #else #ifdef CONFIG_TASKS_RCU */
> > +
> >  #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP)
> >  bool __rcu_is_watching(void);
> >  #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 306f4f0c987a..3cf124389ec7 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1273,6 +1273,11 @@ struct task_struct {
> >  #ifdef CONFIG_RCU_BOOST
> >  	struct rt_mutex *rcu_boost_mutex;
> >  #endif /* #ifdef CONFIG_RCU_BOOST */
> > +#ifdef CONFIG_TASKS_RCU
> > +	unsigned long rcu_tasks_nvcsw;
> > +	int rcu_tasks_holdout;
> > +	struct list_head rcu_tasks_holdout_list;
> > +#endif /* #ifdef CONFIG_TASKS_RCU */
> >  
> >  #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
> >  	struct sched_info sched_info;
> > @@ -1998,31 +2003,27 @@ extern void task_clear_jobctl_pending(struct task_struct *task,
> >  				      unsigned int mask);
> >  
> >  #ifdef CONFIG_PREEMPT_RCU
> > -
> >  #define RCU_READ_UNLOCK_BLOCKED (1 << 0) /* blocked while in RCU read-side. */
> >  #define RCU_READ_UNLOCK_NEED_QS (1 << 1) /* RCU core needs CPU response. */
> > +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> >  
> >  static inline void rcu_copy_process(struct task_struct *p)
> >  {
> > +#ifdef CONFIG_PREEMPT_RCU
> >  	p->rcu_read_lock_nesting = 0;
> >  	p->rcu_read_unlock_special = 0;
> > -#ifdef CONFIG_TREE_PREEMPT_RCU
> >  	p->rcu_blocked_node = NULL;
> > -#endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
> >  #ifdef CONFIG_RCU_BOOST
> >  	p->rcu_boost_mutex = NULL;
> >  #endif /* #ifdef CONFIG_RCU_BOOST */
> >  	INIT_LIST_HEAD(&p->rcu_node_entry);
> > +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> > +#ifdef CONFIG_TASKS_RCU
> > +	p->rcu_tasks_holdout = false;
> > +	INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
> > +#endif /* #ifdef CONFIG_TASKS_RCU */
> >  }
> >  
> > -#else
> > -
> > -static inline void rcu_copy_process(struct task_struct *p)
> > -{
> > -}
> > -
> > -#endif
> > -
> >  static inline void tsk_restore_flags(struct task_struct *task,
> >  				unsigned long orig_flags, unsigned long flags)
> >  {
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 9d76b99af1b9..c56cb62a2df1 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -507,6 +507,16 @@ config PREEMPT_RCU
> >  	  This option enables preemptible-RCU code that is common between
> >  	  the TREE_PREEMPT_RCU and TINY_PREEMPT_RCU implementations.
> >  
> > +config TASKS_RCU
> > +	bool "Task_based RCU implementation using voluntary context switch"
> > +	default n
> > +	help
> > +	  This option enables a task-based RCU implementation that uses
> > +	  only voluntary context switch (not preemption!), idle, and
> > +	  user-mode execution as quiescent states.
> > +
> > +	  If unsure, say N.
> 
> I don't remember who said that, but indeed this is a pure internal feature
> only. The user doesn't need to select that option ever.

I suspect that you are correct.  This way is convenient for me for testing,
but I expect to make it pure internal before long.

> > +
> >  config RCU_STALL_COMMON
> >  	def_bool ( TREE_RCU || TREE_PREEMPT_RCU || RCU_TRACE )
> >  	help
> > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> > index d9efcc13008c..717f00854fc0 100644
> > --- a/kernel/rcu/tiny.c
> > +++ b/kernel/rcu/tiny.c
> > @@ -254,6 +254,8 @@ void rcu_check_callbacks(int cpu, int user)
> >  		rcu_sched_qs(cpu);
> >  	else if (!in_softirq())
> >  		rcu_bh_qs(cpu);
> > +	if (user)
> > +		rcu_note_voluntary_context_switch(current);
> >  }
> >  
> >  /*
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 625d0b0cd75a..f958c52f644d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -2413,6 +2413,8 @@ void rcu_check_callbacks(int cpu, int user)
> >  	rcu_preempt_check_callbacks(cpu);
> >  	if (rcu_pending(cpu))
> >  		invoke_rcu_core();
> > +	if (user)
> > +		rcu_note_voluntary_context_switch(current);
> >  	trace_rcu_utilization(TPS("End scheduler-tick"));
> >  }
> >  
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index bc7883570530..50453589e3ca 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -47,6 +47,7 @@
> >  #include <linux/hardirq.h>
> >  #include <linux/delay.h>
> >  #include <linux/module.h>
> > +#include <linux/kthread.h>
> >  
> >  #define CREATE_TRACE_POINTS
> >  
> > @@ -350,3 +351,173 @@ static int __init check_cpu_stall_init(void)
> >  early_initcall(check_cpu_stall_init);
> >  
> >  #endif /* #ifdef CONFIG_RCU_STALL_COMMON */
> > +
> > +#ifdef CONFIG_TASKS_RCU
> > +
> > +/*
> > + * Simple variant of RCU whose quiescent states are voluntary context switch,
> > + * user-space execution, and idle.  As such, grace periods can take one good
> > + * long time.  There are no read-side primitives similar to rcu_read_lock()
> > + * and rcu_read_unlock() because this implementation is intended to get
> > + * the system into a safe state for some of the manipulations involved in
> > + * tracing and the like.  Finally, this implementation does not support
> > + * high call_rcu_tasks() rates from multiple CPUs.  If this is required,
> > + * per-CPU callback lists will be needed.
> > + */
> > +
> > +/* Lists of tasks that we are still waiting for during this grace period. */
> > +static LIST_HEAD(rcu_tasks_holdouts);
> > +
> > +/* Global list of callbacks and associated lock. */
> > +static struct rcu_head *rcu_tasks_cbs_head;
> > +static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> > +static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
> > +
> > +/* Post an RCU-tasks callback. */
> > +void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
> > +{
> > +	unsigned long flags;
> > +
> > +	rhp->next = NULL;
> > +	rhp->func = func;
> > +	raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> > +	*rcu_tasks_cbs_tail = rhp;
> > +	rcu_tasks_cbs_tail = &rhp->next;
> > +	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> > +}
> > +EXPORT_SYMBOL_GPL(call_rcu_tasks);
> > +
> > +/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
> > +static int __noreturn rcu_tasks_kthread(void *arg)
> > +{
> > +	unsigned long flags;
> > +	struct task_struct *g, *t;
> > +	struct rcu_head *list;
> > +	struct rcu_head *next;
> > +
> > +	/* FIXME: Add housekeeping affinity. */
> > +
> > +	/*
> > +	 * Each pass through the following loop makes one check for
> > +	 * newly arrived callbacks, and, if there are some, waits for
> > +	 * one RCU-tasks grace period and then invokes the callbacks.
> > +	 * This loop is terminated by the system going down.  ;-)
> > +	 */
> > +	for (;;) {
> > +
> > +		/* Pick up any new callbacks. */
> > +		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> > +		smp_mb__after_unlock_lock(); /* Enforce GP memory ordering. */
> 
> I have no idea which against what this is exactly ordering. GP is a vast thing.
> Especially for tricky barriers like __after_unlock_lock() which suggest very
> counter-intuitive ordering, a detailed comment is very welcome :)

Mostly makes sure that whatever happened before the callback was queued
is seen by everyone to have happened before the grace period started.
Though the synchronize_sched() below may have obsoleted this, will
check.

> > +		list = rcu_tasks_cbs_head;
> > +		rcu_tasks_cbs_head = NULL;
> > +		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> > +		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> > +
> > +		/* If there were none, wait a bit and start over. */
> > +		if (!list) {
> > +			schedule_timeout_interruptible(HZ);
> 
> So this thread is going to poll every second? I guess something prevents it
> to run when system is idle somewhere? But I'm not familiar with the whole patchset
> yet. But even without that it looks like a very annoying noise. why not use something
> wait/wakeup based?

And a later patch does the wait/wakeup thing.  Start stupid, add small
amounts of sophistication incrementally.

> > +			flush_signals(current);
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * Wait for all pre-existing t->on_rq and t->nvcsw
> > +		 * transitions to complete.  Invoking synchronize_sched()
> > +		 * suffices because all these transitions occur with
> > +		 * interrupts disabled.  Without this synchronize_sched(),
> > +		 * a read-side critical section that started before the
> > +		 * grace period might be incorrectly seen as having started
> > +		 * after the grace period.
> > +		 *
> > +		 * This synchronize_sched() also dispenses with the
> > +		 * need for a memory barrier on the first store to
> > +		 * ->rcu_tasks_holdout, as it forces the store to happen
> > +		 * after the beginning of the grace period.
> > +		 */
> > +		synchronize_sched();
> > +
> > +		/*
> > +		 * There were callbacks, so we need to wait for an
> > +		 * RCU-tasks grace period.  Start off by scanning
> > +		 * the task list for tasks that are not already
> > +		 * voluntarily blocked.  Mark these tasks and make
> > +		 * a list of them in rcu_tasks_holdouts.
> > +		 */
> > +		rcu_read_lock();
> > +		for_each_process_thread(g, t) {
> > +			if (t != current && ACCESS_ONCE(t->on_rq) &&
> > +			    !is_idle_task(t)) {
> > +				get_task_struct(t);
> > +				t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
> > +				ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
> > +				list_add(&t->rcu_tasks_holdout_list,
> > +					 &rcu_tasks_holdouts);
> > +			}
> > +		}
> > +		rcu_read_unlock();
> > +
> > +		/*
> > +		 * Each pass through the following loop scans the list
> > +		 * of holdout tasks, removing any that are no longer
> > +		 * holdouts.  When the list is empty, we are done.
> > +		 */
> > +		while (!list_empty(&rcu_tasks_holdouts)) {
> > +			schedule_timeout_interruptible(HZ / 10);
> 
> OTOH here it is not annoying because it should only happen when rcu task
> is used, which should be rare.

Glad you like it!

I will likely also add checks for other things needing the current CPU.

							Thanx, Paul

> Thanks.
> 
> > +			flush_signals(current);
> > +			rcu_read_lock();
> > +			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
> > +						rcu_tasks_holdout_list) {
> > +				if (ACCESS_ONCE(t->rcu_tasks_holdout)) {
> > +					if (t->rcu_tasks_nvcsw ==
> > +					    ACCESS_ONCE(t->nvcsw) &&
> > +					    ACCESS_ONCE(t->on_rq))
> > +						continue;
> > +					ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> > +				}
> > +				list_del_rcu(&t->rcu_tasks_holdout_list);
> > +				put_task_struct(t);
> > +			}
> > +			rcu_read_unlock();
> > +		}
> > +
> > +		/*
> > +		 * Because ->on_rq and ->nvcsw are not guaranteed
> > +		 * to have a full memory barriers prior to them in the
> > +		 * schedule() path, memory reordering on other CPUs could
> > +		 * cause their RCU-tasks read-side critical sections to
> > +		 * extend past the end of the grace period.  However,
> > +		 * because these ->nvcsw updates are carried out with
> > +		 * interrupts disabled, we can use synchronize_sched()
> > +		 * to force the needed ordering on all such CPUs.
> > +		 *
> > +		 * This synchronize_sched() also confines all
> > +		 * ->rcu_tasks_holdout accesses to be within the grace
> > +		 * period, avoiding the need for memory barriers for
> > +		 * ->rcu_tasks_holdout accesses.
> > +		 */
> > +		synchronize_sched();
> > +
> > +		/* Invoke the callbacks. */
> > +		while (list) {
> > +			next = list->next;
> > +			local_bh_disable();
> > +			list->func(list);
> > +			local_bh_enable();
> > +			list = next;
> > +			cond_resched();
> > +		}
> > +	}
> > +}
> > +
> > +/* Spawn rcu_tasks_kthread() at boot time. */
> > +static int __init rcu_spawn_tasks_kthread(void)
> > +{
> > +	struct task_struct __maybe_unused *t;
> > +
> > +	t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
> > +	BUG_ON(IS_ERR(t));
> > +	return 0;
> > +}
> > +early_initcall(rcu_spawn_tasks_kthread);
> > +
> > +#endif /* #ifdef CONFIG_TASKS_RCU */
> > -- 
> > 1.8.1.5
> > 
> 


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-01  1:31   ` Lai Jiangshan
@ 2014-08-01  2:11     ` Paul E. McKenney
  0 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-01  2:11 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, dvhart, fweisbec,
	oleg, bobby.prani

On Fri, Aug 01, 2014 at 09:31:37AM +0800, Lai Jiangshan wrote:
> On 08/01/2014 05:55 AM, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > 
> > This commit adds a new RCU-tasks flavor of RCU, which provides
> > call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
> > context switch (not preemption!), userspace execution, and the idle loop.
> > Note that unlike other RCU flavors, these quiescent states occur in tasks,
> > not necessarily CPUs.  Includes fixes from Steven Rostedt.
> > 
> > This RCU flavor is assumed to have very infrequent latency-tolerate
> > updaters.  This assumption permits significant simplifications, including
> > a single global callback list protected by a single global lock, along
> > with a single linked list containing all tasks that have not yet passed
> > through a quiescent state.  If experience shows this assumption to be
> > incorrect, the required additional complexity will be added.
> > 
> > Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> >  include/linux/init_task.h |   9 +++
> >  include/linux/rcupdate.h  |  36 ++++++++++
> >  include/linux/sched.h     |  23 ++++---
> >  init/Kconfig              |  10 +++
> >  kernel/rcu/tiny.c         |   2 +
> >  kernel/rcu/tree.c         |   2 +
> >  kernel/rcu/update.c       | 171 ++++++++++++++++++++++++++++++++++++++++++++++
> >  7 files changed, 242 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> > index 6df7f9fe0d01..78715ea7c30c 100644
> > --- a/include/linux/init_task.h
> > +++ b/include/linux/init_task.h
> > @@ -124,6 +124,14 @@ extern struct group_info init_groups;
> >  #else
> >  #define INIT_TASK_RCU_PREEMPT(tsk)
> >  #endif
> > +#ifdef CONFIG_TASKS_RCU
> > +#define INIT_TASK_RCU_TASKS(tsk)					\
> > +	.rcu_tasks_holdout = false,					\
> > +	.rcu_tasks_holdout_list =					\
> > +		LIST_HEAD_INIT(tsk.rcu_tasks_holdout_list),
> > +#else
> > +#define INIT_TASK_RCU_TASKS(tsk)
> > +#endif
> >  
> >  extern struct cred init_cred;
> >  
> > @@ -231,6 +239,7 @@ extern struct task_group root_task_group;
> >  	INIT_FTRACE_GRAPH						\
> >  	INIT_TRACE_RECURSION						\
> >  	INIT_TASK_RCU_PREEMPT(tsk)					\
> > +	INIT_TASK_RCU_TASKS(tsk)					\
> >  	INIT_CPUSET_SEQ(tsk)						\
> >  	INIT_RT_MUTEXES(tsk)						\
> >  	INIT_VTIME(tsk)							\
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 6a94cc8b1ca0..829efc99df3e 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -197,6 +197,26 @@ void call_rcu_sched(struct rcu_head *head,
> >  
> >  void synchronize_sched(void);
> >  
> > +/**
> > + * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
> > + * @head: structure to be used for queueing the RCU updates.
> > + * @func: actual callback function to be invoked after the grace period
> > + *
> > + * The callback function will be invoked some time after a full grace
> > + * period elapses, in other words after all currently executing RCU
> > + * read-side critical sections have completed. call_rcu_tasks() assumes
> > + * that the read-side critical sections end at a voluntary context
> > + * switch (not a preemption!), entry into idle, or transition to usermode
> > + * execution.  As such, there are no read-side primitives analogous to
> > + * rcu_read_lock() and rcu_read_unlock() because this primitive is intended
> > + * to determine that all tasks have passed through a safe state, not so
> > + * much for data-strcuture synchronization.
> > + *
> > + * See the description of call_rcu() for more detailed information on
> > + * memory ordering guarantees.
> > + */
> > +void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
> > +
> >  #ifdef CONFIG_PREEMPT_RCU
> >  
> >  void __rcu_read_lock(void);
> > @@ -294,6 +314,22 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
> >  		rcu_irq_exit(); \
> >  	} while (0)
> >  
> > +/*
> > + * Note a voluntary context switch for RCU-tasks benefit.  This is a
> > + * macro rather than an inline function to avoid #include hell.
> > + */
> > +#ifdef CONFIG_TASKS_RCU
> > +#define rcu_note_voluntary_context_switch(t) \
> > +	do { \
> > +		preempt_disable(); /* Exclude synchronize_sched(); */ \
> > +		if (ACCESS_ONCE((t)->rcu_tasks_holdout)) \
> > +			ACCESS_ONCE((t)->rcu_tasks_holdout) = 0; \
> > +		preempt_enable(); \
> 
> Why the preempt_disable() is needed here? The comments in rcu_tasks_kthread()
> can't persuade me.  Maybe it could be removed?

The synchronize_sched() near the end of the main loop in rcu_tasks_thread()
might well have obsoleted this, will take a closer look.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (10 preceding siblings ...)
  2014-08-01  1:31   ` Lai Jiangshan
@ 2014-08-01 14:11   ` Oleg Nesterov
  2014-08-01 18:28     ` Paul E. McKenney
  2014-08-01 18:57   ` Oleg Nesterov
                     ` (3 subsequent siblings)
  15 siblings, 1 reply; 122+ messages in thread
From: Oleg Nesterov @ 2014-08-01 14:11 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On 07/31, Paul E. McKenney wrote:
>
> +/* Lists of tasks that we are still waiting for during this grace period. */
> +static LIST_HEAD(rcu_tasks_holdouts);

This can be local var in rcu_tasks_kthread()

> +		while (!list_empty(&rcu_tasks_holdouts)) {
> +			schedule_timeout_interruptible(HZ / 10);
> +			flush_signals(current);

Still can't undestand why your paranoia wants flush_signals ;)
This is unneeded and confusing. If you think we can have a bug here,
then we should ot hide it, WARN_ON(signal_pending) would be better.
And if you think signal_pending(current) is possible, why do you
check this only after schedule_interruptible() ?

> +		synchronize_sched();
> +
> +		/* Invoke the callbacks. */
> +		while (list) {
> +			next = list->next;
> +			local_bh_disable();
> +			list->func(list);
> +			local_bh_enable();
> +			list = next;
> +			cond_resched();
> +		}

Not sure this makes any sense, but perhaps we can check for the new
callbacks and start the next gp. IOW, the main loop roughly does

	for (;;) {
		list = rcu_tasks_cbs_head;
		rcu_tasks_cbs_head = NULL;

		if (!list)
			sleep();

		synchronize_sched();

		wait_for_rcu_tasks_holdout();

		synchronize_sched();

		process_callbacks(list);
	}

we can "join" 2 synchronize_sched's and do

	ready_list = NULL;
	for (;;) {
		list = rcu_tasks_cbs_head;
		rcu_tasks_cbs_head = NULL;

		if (!list && !ready_list)
			sleep();

		synchronize_sched();

		if (ready_list) {
			process_callbacks(ready_list);
			ready_list = NULL;
		}

		if (!list)
			continue;

		wait_for_rcu_tasks_holdout();
		ready_list = list;
	}

Oleg.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-01  2:04     ` Paul E. McKenney
@ 2014-08-01 15:06       ` Frederic Weisbecker
  0 siblings, 0 replies; 122+ messages in thread
From: Frederic Weisbecker @ 2014-08-01 15:06 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart, oleg,
	bobby.prani

On Thu, Jul 31, 2014 at 07:04:16PM -0700, Paul E. McKenney wrote:
> On Fri, Aug 01, 2014 at 01:57:50AM +0200, Frederic Weisbecker wrote:
> > 
> > So this thread is going to poll every second? I guess something prevents it
> > to run when system is idle somewhere? But I'm not familiar with the whole patchset
> > yet. But even without that it looks like a very annoying noise. why not use something
> > wait/wakeup based?
> 
> And a later patch does the wait/wakeup thing.  Start stupid, add small
> amounts of sophistication incrementally.

Aah indeed! :)

> 
> > > +			flush_signals(current);
> > > +			continue;
> > > +		}
> > > +
> > > +		/*
> > > +		 * Wait for all pre-existing t->on_rq and t->nvcsw
> > > +		 * transitions to complete.  Invoking synchronize_sched()
> > > +		 * suffices because all these transitions occur with
> > > +		 * interrupts disabled.  Without this synchronize_sched(),
> > > +		 * a read-side critical section that started before the
> > > +		 * grace period might be incorrectly seen as having started
> > > +		 * after the grace period.
> > > +		 *
> > > +		 * This synchronize_sched() also dispenses with the
> > > +		 * need for a memory barrier on the first store to
> > > +		 * ->rcu_tasks_holdout, as it forces the store to happen
> > > +		 * after the beginning of the grace period.
> > > +		 */
> > > +		synchronize_sched();
> > > +
> > > +		/*
> > > +		 * There were callbacks, so we need to wait for an
> > > +		 * RCU-tasks grace period.  Start off by scanning
> > > +		 * the task list for tasks that are not already
> > > +		 * voluntarily blocked.  Mark these tasks and make
> > > +		 * a list of them in rcu_tasks_holdouts.
> > > +		 */
> > > +		rcu_read_lock();
> > > +		for_each_process_thread(g, t) {
> > > +			if (t != current && ACCESS_ONCE(t->on_rq) &&
> > > +			    !is_idle_task(t)) {
> > > +				get_task_struct(t);
> > > +				t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
> > > +				ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
> > > +				list_add(&t->rcu_tasks_holdout_list,
> > > +					 &rcu_tasks_holdouts);
> > > +			}
> > > +		}
> > > +		rcu_read_unlock();
> > > +
> > > +		/*
> > > +		 * Each pass through the following loop scans the list
> > > +		 * of holdout tasks, removing any that are no longer
> > > +		 * holdouts.  When the list is empty, we are done.
> > > +		 */
> > > +		while (!list_empty(&rcu_tasks_holdouts)) {
> > > +			schedule_timeout_interruptible(HZ / 10);
> > 
> > OTOH here it is not annoying because it should only happen when rcu task
> > is used, which should be rare.
> 
> Glad you like it!
> 
> I will likely also add checks for other things needing the current CPU.

Ok, thanks!

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks Paul E. McKenney
@ 2014-08-01 15:09     ` Oleg Nesterov
  2014-08-01 18:32       ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Oleg Nesterov @ 2014-08-01 15:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On 07/31, Paul E. McKenney wrote:
>
> +void synchronize_rcu_tasks(void)
> +{
> +	/* Complain if the scheduler has not started.  */
> +	rcu_lockdep_assert(!rcu_scheduler_active,
> +			   "synchronize_rcu_tasks called too soon");
> +
> +	/* Wait for the grace period. */
> +	wait_rcu_gp(call_rcu_tasks);
> +}

Btw, what about CONFIG_PREEMPT=n ?

I mean, can't synchronize_rcu_tasks() be synchronize_sched() in this
case?

Oleg.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-01 14:11   ` Oleg Nesterov
@ 2014-08-01 18:28     ` Paul E. McKenney
  2014-08-01 18:40       ` Oleg Nesterov
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-01 18:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On Fri, Aug 01, 2014 at 04:11:44PM +0200, Oleg Nesterov wrote:
> On 07/31, Paul E. McKenney wrote:
> >
> > +/* Lists of tasks that we are still waiting for during this grace period. */
> > +static LIST_HEAD(rcu_tasks_holdouts);

Good point, fixed!

> This can be local var in rcu_tasks_kthread()
> 
> > +		while (!list_empty(&rcu_tasks_holdouts)) {
> > +			schedule_timeout_interruptible(HZ / 10);
> > +			flush_signals(current);
> 
> Still can't undestand why your paranoia wants flush_signals ;)
> This is unneeded and confusing. If you think we can have a bug here,
> then we should ot hide it, WARN_ON(signal_pending) would be better.
> And if you think signal_pending(current) is possible, why do you
> check this only after schedule_interruptible() ?

I can live with WARN_ON(signal_pending(current)).  Fixed!

> > +		synchronize_sched();
> > +
> > +		/* Invoke the callbacks. */
> > +		while (list) {
> > +			next = list->next;
> > +			local_bh_disable();
> > +			list->func(list);
> > +			local_bh_enable();
> > +			list = next;
> > +			cond_resched();
> > +		}
> 
> Not sure this makes any sense, but perhaps we can check for the new
> callbacks and start the next gp. IOW, the main loop roughly does
> 
> 	for (;;) {
> 		list = rcu_tasks_cbs_head;
> 		rcu_tasks_cbs_head = NULL;
> 
> 		if (!list)
> 			sleep();
> 
> 		synchronize_sched();
> 
> 		wait_for_rcu_tasks_holdout();
> 
> 		synchronize_sched();
> 
> 		process_callbacks(list);
> 	}
> 
> we can "join" 2 synchronize_sched's and do
> 
> 	ready_list = NULL;
> 	for (;;) {
> 		list = rcu_tasks_cbs_head;
> 		rcu_tasks_cbs_head = NULL;
> 
> 		if (!list && !ready_list)
> 			sleep();
> 
> 		synchronize_sched();
> 
> 		if (ready_list) {
> 			process_callbacks(ready_list);
> 			ready_list = NULL;
> 		}
> 
> 		if (!list)
> 			continue;
> 
> 		wait_for_rcu_tasks_holdout();
> 		ready_list = list;
> 	}

The lack of barriers for the updates I am checking mean that I really
do need a synchronize_sched() on either side of the grace-period wait.
The grace period needs to guarantee that anything that happened on any
CPU before the start of the grace period happens before anything that
happens on any CPU after the end of the grace period.  If I leave off
either synchronize_sched(), we lose this guarantee.

							Thanx, Paul

> Oleg.
> 


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-01 15:09     ` Oleg Nesterov
@ 2014-08-01 18:32       ` Paul E. McKenney
  2014-08-01 19:44         ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-01 18:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On Fri, Aug 01, 2014 at 05:09:26PM +0200, Oleg Nesterov wrote:
> On 07/31, Paul E. McKenney wrote:
> >
> > +void synchronize_rcu_tasks(void)
> > +{
> > +	/* Complain if the scheduler has not started.  */
> > +	rcu_lockdep_assert(!rcu_scheduler_active,
> > +			   "synchronize_rcu_tasks called too soon");
> > +
> > +	/* Wait for the grace period. */
> > +	wait_rcu_gp(call_rcu_tasks);
> > +}
> 
> Btw, what about CONFIG_PREEMPT=n ?
> 
> I mean, can't synchronize_rcu_tasks() be synchronize_sched() in this
> case?

Excellent point, indeed it can!

And if I do it right, it will make CONFIG_TASKS_RCU=y safe for kernel
tinification.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-01 18:28     ` Paul E. McKenney
@ 2014-08-01 18:40       ` Oleg Nesterov
  2014-08-02 23:00         ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Oleg Nesterov @ 2014-08-01 18:40 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On 08/01, Paul E. McKenney wrote:
>
> On Fri, Aug 01, 2014 at 04:11:44PM +0200, Oleg Nesterov wrote:
> > Not sure this makes any sense, but perhaps we can check for the new
> > callbacks and start the next gp. IOW, the main loop roughly does
> >
> > 	for (;;) {
> > 		list = rcu_tasks_cbs_head;
> > 		rcu_tasks_cbs_head = NULL;
> >
> > 		if (!list)
> > 			sleep();
> >
> > 		synchronize_sched();
> >
> > 		wait_for_rcu_tasks_holdout();
> >
> > 		synchronize_sched();
> >
> > 		process_callbacks(list);
> > 	}
> >
> > we can "join" 2 synchronize_sched's and do
> >
> > 	ready_list = NULL;
> > 	for (;;) {
> > 		list = rcu_tasks_cbs_head;
> > 		rcu_tasks_cbs_head = NULL;
> >
> > 		if (!list && !ready_list)
> > 			sleep();
> >
> > 		synchronize_sched();
> >
> > 		if (ready_list) {
> > 			process_callbacks(ready_list);
> > 			ready_list = NULL;
> > 		}
> >
> > 		if (!list)
> > 			continue;
> >
> > 		wait_for_rcu_tasks_holdout();
> > 		ready_list = list;
> > 	}
>
> The lack of barriers for the updates I am checking mean that I really
> do need a synchronize_sched() on either side of the grace-period wait.

Yes,

> The grace period needs to guarantee that anything that happened on any
> CPU before the start of the grace period happens before anything that
> happens on any CPU after the end of the grace period.  If I leave off
> either synchronize_sched(), we lose this guarantee.

But the 2nd variant still has synchronize_sched() on both sides?

Oleg.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (11 preceding siblings ...)
  2014-08-01 14:11   ` Oleg Nesterov
@ 2014-08-01 18:57   ` Oleg Nesterov
  2014-08-02 22:50     ` Paul E. McKenney
  2014-08-02 14:56   ` Oleg Nesterov
                     ` (2 subsequent siblings)
  15 siblings, 1 reply; 122+ messages in thread
From: Oleg Nesterov @ 2014-08-01 18:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On 07/31, Paul E. McKenney wrote:
>
> +		rcu_read_lock();
> +		for_each_process_thread(g, t) {
> +			if (t != current && ACCESS_ONCE(t->on_rq) &&
> +			    !is_idle_task(t)) {
> +				get_task_struct(t);
> +				t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
> +				ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
> +				list_add(&t->rcu_tasks_holdout_list,
> +					 &rcu_tasks_holdouts);
> +			}
> +		}
> +		rcu_read_unlock();

So let me repeat. for_each_process_thread() above can not (in general) see
the exiting tasks which have already called exit_notify(), because such a
task can be removed from rcu lists at any time.

Now suppose that proc_exit_connector() is probed. Or another function which
can be called after exit_notify(), this doesn't matter.

An exiting task T jumps into trampoline and gets the preemption. It can be
already (auto)reaped and removed from rcu lists.

synchronize_rcu_tasks() can not see this task, so it can return before T gets
a chance to resume.

Oleg.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-01 18:32       ` Paul E. McKenney
@ 2014-08-01 19:44         ` Paul E. McKenney
  2014-08-02 14:47           ` Oleg Nesterov
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-01 19:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On Fri, Aug 01, 2014 at 11:32:51AM -0700, Paul E. McKenney wrote:
> On Fri, Aug 01, 2014 at 05:09:26PM +0200, Oleg Nesterov wrote:
> > On 07/31, Paul E. McKenney wrote:
> > >
> > > +void synchronize_rcu_tasks(void)
> > > +{
> > > +	/* Complain if the scheduler has not started.  */
> > > +	rcu_lockdep_assert(!rcu_scheduler_active,
> > > +			   "synchronize_rcu_tasks called too soon");
> > > +
> > > +	/* Wait for the grace period. */
> > > +	wait_rcu_gp(call_rcu_tasks);
> > > +}
> > 
> > Btw, what about CONFIG_PREEMPT=n ?
> > 
> > I mean, can't synchronize_rcu_tasks() be synchronize_sched() in this
> > case?
> 
> Excellent point, indeed it can!
> 
> And if I do it right, it will make CONFIG_TASKS_RCU=y safe for kernel
> tinification.  ;-)

Unless, that is, we need to wait for trampolines in the idle loop...

Sounds like a question for Steven.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-01 19:44         ` Paul E. McKenney
@ 2014-08-02 14:47           ` Oleg Nesterov
  2014-08-02 22:58             ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Oleg Nesterov @ 2014-08-02 14:47 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On 08/01, Paul E. McKenney wrote:
>
> On Fri, Aug 01, 2014 at 11:32:51AM -0700, Paul E. McKenney wrote:
> > On Fri, Aug 01, 2014 at 05:09:26PM +0200, Oleg Nesterov wrote:
> > > On 07/31, Paul E. McKenney wrote:
> > > >
> > > > +void synchronize_rcu_tasks(void)
> > > > +{
> > > > +	/* Complain if the scheduler has not started.  */
> > > > +	rcu_lockdep_assert(!rcu_scheduler_active,
> > > > +			   "synchronize_rcu_tasks called too soon");
> > > > +
> > > > +	/* Wait for the grace period. */
> > > > +	wait_rcu_gp(call_rcu_tasks);
> > > > +}
> > >
> > > Btw, what about CONFIG_PREEMPT=n ?
> > >
> > > I mean, can't synchronize_rcu_tasks() be synchronize_sched() in this
> > > case?
> >
> > Excellent point, indeed it can!
> >
> > And if I do it right, it will make CONFIG_TASKS_RCU=y safe for kernel
> > tinification.  ;-)
>
> Unless, that is, we need to wait for trampolines in the idle loop...
>
> Sounds like a question for Steven.  ;-)

Sure, but the full blown synchronize_rcu_tasks() can't handle the idle threads
anyway. An idle thread can not be deactivated and for_each_process() can't see
it anyway.

Oleg.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (12 preceding siblings ...)
  2014-08-01 18:57   ` Oleg Nesterov
@ 2014-08-02 14:56   ` Oleg Nesterov
  2014-08-02 22:57     ` Paul E. McKenney
  2014-08-04  1:28   ` Lai Jiangshan
  2014-08-08 19:13   ` Peter Zijlstra
  15 siblings, 1 reply; 122+ messages in thread
From: Oleg Nesterov @ 2014-08-02 14:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On 07/31, Paul E. McKenney wrote:
>
> +		rcu_read_lock();
> +		for_each_process_thread(g, t) {
> +			if (t != current && ACCESS_ONCE(t->on_rq) &&
> +			    !is_idle_task(t)) {

I didn't notice this check before, but it is not needed. for_each_process()
can see the idle threads, there are not on process/thread lists.

But this doesn't really matter, the main problem is that I still think that
for_each_process_thread() can't really work anyway.

Oleg.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-01 18:57   ` Oleg Nesterov
@ 2014-08-02 22:50     ` Paul E. McKenney
  0 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-02 22:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On Fri, Aug 01, 2014 at 08:57:53PM +0200, Oleg Nesterov wrote:
> On 07/31, Paul E. McKenney wrote:
> >
> > +		rcu_read_lock();
> > +		for_each_process_thread(g, t) {
> > +			if (t != current && ACCESS_ONCE(t->on_rq) &&
> > +			    !is_idle_task(t)) {
> > +				get_task_struct(t);
> > +				t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
> > +				ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
> > +				list_add(&t->rcu_tasks_holdout_list,
> > +					 &rcu_tasks_holdouts);
> > +			}
> > +		}
> > +		rcu_read_unlock();
> 
> So let me repeat. for_each_process_thread() above can not (in general) see
> the exiting tasks which have already called exit_notify(), because such a
> task can be removed from rcu lists at any time.
> 
> Now suppose that proc_exit_connector() is probed. Or another function which
> can be called after exit_notify(), this doesn't matter.
> 
> An exiting task T jumps into trampoline and gets the preemption. It can be
> already (auto)reaped and removed from rcu lists.
> 
> synchronize_rcu_tasks() can not see this task, so it can return before T gets
> a chance to resume.

Good catch!!!

So it looks like I will need to hook into do_exit() after all.  Oh well,
can't have everything...

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-02 14:56   ` Oleg Nesterov
@ 2014-08-02 22:57     ` Paul E. McKenney
  2014-08-03 13:33       ` Oleg Nesterov
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-02 22:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On Sat, Aug 02, 2014 at 04:56:16PM +0200, Oleg Nesterov wrote:
> On 07/31, Paul E. McKenney wrote:
> >
> > +		rcu_read_lock();
> > +		for_each_process_thread(g, t) {
> > +			if (t != current && ACCESS_ONCE(t->on_rq) &&
> > +			    !is_idle_task(t)) {
> 
> I didn't notice this check before, but it is not needed. for_each_process()
> can see the idle threads, there are not on process/thread lists.

Good to know.  Any other important tasks I am missing?

I am guessing that I need to do something like this:

	for_each_process(g) {
		/* Do build step. */
		for_each_thread(g, t) {
			if (g == t)
				continue;
			/* Do build step. */
		}
	}

Or is there a better way to handle this?

> But this doesn't really matter, the main problem is that I still think that
> for_each_process_thread() can't really work anyway.

Your point about exiting tasks I get, and I believe I can solve it.
I am hoping that the above sort of construction takes care of the
idle threads.  I might also need to do something to handle changes in
process/thread hierarchy -- but hopefully without having to read-acquire
the task-list lock.

So what else am I missing?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-02 14:47           ` Oleg Nesterov
@ 2014-08-02 22:58             ` Paul E. McKenney
  2014-08-06  0:57               ` Steven Rostedt
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-02 22:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On Sat, Aug 02, 2014 at 04:47:19PM +0200, Oleg Nesterov wrote:
> On 08/01, Paul E. McKenney wrote:
> >
> > On Fri, Aug 01, 2014 at 11:32:51AM -0700, Paul E. McKenney wrote:
> > > On Fri, Aug 01, 2014 at 05:09:26PM +0200, Oleg Nesterov wrote:
> > > > On 07/31, Paul E. McKenney wrote:
> > > > >
> > > > > +void synchronize_rcu_tasks(void)
> > > > > +{
> > > > > +	/* Complain if the scheduler has not started.  */
> > > > > +	rcu_lockdep_assert(!rcu_scheduler_active,
> > > > > +			   "synchronize_rcu_tasks called too soon");
> > > > > +
> > > > > +	/* Wait for the grace period. */
> > > > > +	wait_rcu_gp(call_rcu_tasks);
> > > > > +}
> > > >
> > > > Btw, what about CONFIG_PREEMPT=n ?
> > > >
> > > > I mean, can't synchronize_rcu_tasks() be synchronize_sched() in this
> > > > case?
> > >
> > > Excellent point, indeed it can!
> > >
> > > And if I do it right, it will make CONFIG_TASKS_RCU=y safe for kernel
> > > tinification.  ;-)
> >
> > Unless, that is, we need to wait for trampolines in the idle loop...
> >
> > Sounds like a question for Steven.  ;-)
> 
> Sure, but the full blown synchronize_rcu_tasks() can't handle the idle threads
> anyway. An idle thread can not be deactivated and for_each_process() can't see
> it anyway.

Indeed, if idle threads need to be tracked, their tracking will need to
be at least partially special-cased.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-01 18:40       ` Oleg Nesterov
@ 2014-08-02 23:00         ` Paul E. McKenney
  2014-08-03 12:57           ` Oleg Nesterov
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-02 23:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On Fri, Aug 01, 2014 at 08:40:59PM +0200, Oleg Nesterov wrote:
> On 08/01, Paul E. McKenney wrote:
> >
> > On Fri, Aug 01, 2014 at 04:11:44PM +0200, Oleg Nesterov wrote:
> > > Not sure this makes any sense, but perhaps we can check for the new
> > > callbacks and start the next gp. IOW, the main loop roughly does
> > >
> > > 	for (;;) {
> > > 		list = rcu_tasks_cbs_head;
> > > 		rcu_tasks_cbs_head = NULL;
> > >
> > > 		if (!list)
> > > 			sleep();
> > >
> > > 		synchronize_sched();
> > >
> > > 		wait_for_rcu_tasks_holdout();
> > >
> > > 		synchronize_sched();
> > >
> > > 		process_callbacks(list);
> > > 	}
> > >
> > > we can "join" 2 synchronize_sched's and do
> > >
> > > 	ready_list = NULL;
> > > 	for (;;) {
> > > 		list = rcu_tasks_cbs_head;
> > > 		rcu_tasks_cbs_head = NULL;
> > >
> > > 		if (!list && !ready_list)
> > > 			sleep();
> > >
> > > 		synchronize_sched();
> > >
> > > 		if (ready_list) {
> > > 			process_callbacks(ready_list);
> > > 			ready_list = NULL;
> > > 		}
> > >
> > > 		if (!list)
> > > 			continue;
> > >
> > > 		wait_for_rcu_tasks_holdout();
> > > 		ready_list = list;
> > > 	}
> >
> > The lack of barriers for the updates I am checking mean that I really
> > do need a synchronize_sched() on either side of the grace-period wait.
> 
> Yes,
> 
> > The grace period needs to guarantee that anything that happened on any
> > CPU before the start of the grace period happens before anything that
> > happens on any CPU after the end of the grace period.  If I leave off
> > either synchronize_sched(), we lose this guarantee.
> 
> But the 2nd variant still has synchronize_sched() on both sides?

Your second variant above?  Unless it is in wait_for_rcu_tasks_holdouts(),
I am not seeing it.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-02 23:00         ` Paul E. McKenney
@ 2014-08-03 12:57           ` Oleg Nesterov
  2014-08-03 22:03             ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Oleg Nesterov @ 2014-08-03 12:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On 08/02, Paul E. McKenney wrote:
>
> On Fri, Aug 01, 2014 at 08:40:59PM +0200, Oleg Nesterov wrote:
> > On 08/01, Paul E. McKenney wrote:
> > >
> > > On Fri, Aug 01, 2014 at 04:11:44PM +0200, Oleg Nesterov wrote:
> > > > Not sure this makes any sense, but perhaps we can check for the new
> > > > callbacks and start the next gp. IOW, the main loop roughly does
> > > >
> > > > 	for (;;) {
> > > > 		list = rcu_tasks_cbs_head;
> > > > 		rcu_tasks_cbs_head = NULL;
> > > >
> > > > 		if (!list)
> > > > 			sleep();
> > > >
> > > > 		synchronize_sched();
> > > >
> > > > 		wait_for_rcu_tasks_holdout();
> > > >
> > > > 		synchronize_sched();
> > > >
> > > > 		process_callbacks(list);
> > > > 	}
> > > >
> > > > we can "join" 2 synchronize_sched's and do
> > > >
> > > > 	ready_list = NULL;
> > > > 	for (;;) {
> > > > 		list = rcu_tasks_cbs_head;
> > > > 		rcu_tasks_cbs_head = NULL;
> > > >
> > > > 		if (!list && !ready_list)
> > > > 			sleep();
> > > >
> > > > 		synchronize_sched();
> > > >
> > > > 		if (ready_list) {
> > > > 			process_callbacks(ready_list);
> > > > 			ready_list = NULL;
> > > > 		}
> > > >
> > > > 		if (!list)
> > > > 			continue;
> > > >
> > > > 		wait_for_rcu_tasks_holdout();
> > > > 		ready_list = list;
> > > > 	}
> > >
> > > The lack of barriers for the updates I am checking mean that I really
> > > do need a synchronize_sched() on either side of the grace-period wait.
> >
> > Yes,
> >
> > > The grace period needs to guarantee that anything that happened on any
> > > CPU before the start of the grace period happens before anything that
> > > happens on any CPU after the end of the grace period.  If I leave off
> > > either synchronize_sched(), we lose this guarantee.
> >
> > But the 2nd variant still has synchronize_sched() on both sides?
>
> Your second variant above?  Unless it is in wait_for_rcu_tasks_holdouts(),
> I am not seeing it.

I guess I probably misunderstood you from the very beginning. And now I am
curious what exactly I missed...

The code above doesn't do process_callbacks() after wait_for_rcu_tasks_holdout(),
it does this only after another synchronize_sched(). The only difference is that
we dequeue the next generation of the pending rcu_tasks_cbs_head callbacks.

IOW. Lets look at the current code. Suppose that synchronize_rcu_tasks() is
called when rcu_tasks_kthread() sleeps in wait_for_rcu_tasks_holdout(). In
this case the new wakeme_after_rcu callback will sit in rcu_tasks_cbs_head
until rcu_tasks_kthread() does the 2nd synchronize_sched() + process_callbacks().
Only after that it will be dequeued and rcu_tasks_kthread() will start another gp.

This means that we have 3 synchronize_sched()'s before synchronize_rcu_tasks()
returns.

Do we really need this? With the 2nd variant the new callback will be dequeud
right after wait_for_rcu_tasks_holdout(), and we only have 2 necessary
synchronize_sched()'s around wait_for_rcu_tasks_holdout().

But it seems that I missed something else. Could you please spell?

Oleg.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-02 22:57     ` Paul E. McKenney
@ 2014-08-03 13:33       ` Oleg Nesterov
  2014-08-03 22:05         ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Oleg Nesterov @ 2014-08-03 13:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On 08/02, Paul E. McKenney wrote:
>
> On Sat, Aug 02, 2014 at 04:56:16PM +0200, Oleg Nesterov wrote:
> > On 07/31, Paul E. McKenney wrote:
> > >
> > > +		rcu_read_lock();
> > > +		for_each_process_thread(g, t) {
> > > +			if (t != current && ACCESS_ONCE(t->on_rq) &&
> > > +			    !is_idle_task(t)) {
> >
> > I didn't notice this check before, but it is not needed. for_each_process()
> > can see the idle threads, there are not on process/thread lists.
    ^^^^^^^

argh, I mean't "can't see" of course...

> Good to know.  Any other important tasks I am missing?

Nothing else.

> I am guessing that I need to do something like this:
>
> 	for_each_process(g) {
> 		/* Do build step. */
> 		for_each_thread(g, t) {
> 			if (g == t)
> 				continue;
> 			/* Do build step. */
> 		}
> 	}

Sorry, I don't understand... This is equivalent to

	for_each_process_thread(g, t) {
		/* Do build step. */
	}

> Your point about exiting tasks I get, and I believe I can solve it.
> I am hoping that the above sort of construction takes care of the
> idle threads.

It is simple to find the idle threads, just

	for_each_cpu(cpu) {
		do_something(cpu_rq(cpu)->idle);
	}

but it is not clear to me what else you need to do with the idle threads.
Probably not too much, they mostly run under preempt_disable().

> I might also need to do something to handle changes in
> process/thread hierarchy -- but hopefully without having to read-acquire
> the task-list lock.

It seems that you need another global list, a task should be visible on that
list until exit_rcu().

Oleg.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-03 12:57           ` Oleg Nesterov
@ 2014-08-03 22:03             ` Paul E. McKenney
  2014-08-04 13:29               ` Oleg Nesterov
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-03 22:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On Sun, Aug 03, 2014 at 02:57:58PM +0200, Oleg Nesterov wrote:
> On 08/02, Paul E. McKenney wrote:
> >
> > On Fri, Aug 01, 2014 at 08:40:59PM +0200, Oleg Nesterov wrote:
> > > On 08/01, Paul E. McKenney wrote:
> > > >
> > > > On Fri, Aug 01, 2014 at 04:11:44PM +0200, Oleg Nesterov wrote:
> > > > > Not sure this makes any sense, but perhaps we can check for the new
> > > > > callbacks and start the next gp. IOW, the main loop roughly does
> > > > >
> > > > > 	for (;;) {
> > > > > 		list = rcu_tasks_cbs_head;
> > > > > 		rcu_tasks_cbs_head = NULL;
> > > > >
> > > > > 		if (!list)
> > > > > 			sleep();
> > > > >
> > > > > 		synchronize_sched();
> > > > >
> > > > > 		wait_for_rcu_tasks_holdout();
> > > > >
> > > > > 		synchronize_sched();
> > > > >
> > > > > 		process_callbacks(list);
> > > > > 	}
> > > > >
> > > > > we can "join" 2 synchronize_sched's and do
> > > > >
> > > > > 	ready_list = NULL;
> > > > > 	for (;;) {
> > > > > 		list = rcu_tasks_cbs_head;
> > > > > 		rcu_tasks_cbs_head = NULL;
> > > > >
> > > > > 		if (!list && !ready_list)
> > > > > 			sleep();
> > > > >
> > > > > 		synchronize_sched();
> > > > >
> > > > > 		if (ready_list) {
> > > > > 			process_callbacks(ready_list);
> > > > > 			ready_list = NULL;
> > > > > 		}
> > > > >
> > > > > 		if (!list)
> > > > > 			continue;
> > > > >
> > > > > 		wait_for_rcu_tasks_holdout();
> > > > > 		ready_list = list;
> > > > > 	}
> > > >
> > > > The lack of barriers for the updates I am checking mean that I really
> > > > do need a synchronize_sched() on either side of the grace-period wait.
> > >
> > > Yes,
> > >
> > > > The grace period needs to guarantee that anything that happened on any
> > > > CPU before the start of the grace period happens before anything that
> > > > happens on any CPU after the end of the grace period.  If I leave off
> > > > either synchronize_sched(), we lose this guarantee.
> > >
> > > But the 2nd variant still has synchronize_sched() on both sides?
> >
> > Your second variant above?  Unless it is in wait_for_rcu_tasks_holdouts(),
> > I am not seeing it.
> 
> I guess I probably misunderstood you from the very beginning. And now I am
> curious what exactly I missed...
> 
> The code above doesn't do process_callbacks() after wait_for_rcu_tasks_holdout(),
> it does this only after another synchronize_sched(). The only difference is that
> we dequeue the next generation of the pending rcu_tasks_cbs_head callbacks.
> 
> IOW. Lets look at the current code. Suppose that synchronize_rcu_tasks() is
> called when rcu_tasks_kthread() sleeps in wait_for_rcu_tasks_holdout(). In
> this case the new wakeme_after_rcu callback will sit in rcu_tasks_cbs_head
> until rcu_tasks_kthread() does the 2nd synchronize_sched() + process_callbacks().
> Only after that it will be dequeued and rcu_tasks_kthread() will start another gp.
> 
> This means that we have 3 synchronize_sched()'s before synchronize_rcu_tasks()
> returns.
> 
> Do we really need this? With the 2nd variant the new callback will be dequeud
> right after wait_for_rcu_tasks_holdout(), and we only have 2 necessary
> synchronize_sched()'s around wait_for_rcu_tasks_holdout().
> 
> But it seems that I missed something else. Could you please spell?

You missed nothing.  I missed the fact that you rolled the loop, using
ready_list and list.

If I understand correctly, your goal is to remove a synchronize_sched()
worth of latency from the overall RCU-tasks callback latency.  Or am I
still confused?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-03 13:33       ` Oleg Nesterov
@ 2014-08-03 22:05         ` Paul E. McKenney
  2014-08-04  0:37           ` Lai Jiangshan
  2014-08-04 13:32           ` Oleg Nesterov
  0 siblings, 2 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-03 22:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On Sun, Aug 03, 2014 at 03:33:18PM +0200, Oleg Nesterov wrote:
> On 08/02, Paul E. McKenney wrote:
> >
> > On Sat, Aug 02, 2014 at 04:56:16PM +0200, Oleg Nesterov wrote:
> > > On 07/31, Paul E. McKenney wrote:
> > > >
> > > > +		rcu_read_lock();
> > > > +		for_each_process_thread(g, t) {
> > > > +			if (t != current && ACCESS_ONCE(t->on_rq) &&
> > > > +			    !is_idle_task(t)) {
> > >
> > > I didn't notice this check before, but it is not needed. for_each_process()
> > > can see the idle threads, there are not on process/thread lists.
>     ^^^^^^^
> 
> argh, I mean't "can't see" of course...
> 
> > Good to know.  Any other important tasks I am missing?
> 
> Nothing else.

Whew!  ;-)

> > I am guessing that I need to do something like this:
> >
> > 	for_each_process(g) {
> > 		/* Do build step. */
> > 		for_each_thread(g, t) {
> > 			if (g == t)
> > 				continue;
> > 			/* Do build step. */
> > 		}
> > 	}
> 
> Sorry, I don't understand... This is equivalent to
> 
> 	for_each_process_thread(g, t) {
> 		/* Do build step. */
> 	}

OK, got it.

> > Your point about exiting tasks I get, and I believe I can solve it.
> > I am hoping that the above sort of construction takes care of the
> > idle threads.
> 
> It is simple to find the idle threads, just
> 
> 	for_each_cpu(cpu) {
> 		do_something(cpu_rq(cpu)->idle);
> 	}
> 
> but it is not clear to me what else you need to do with the idle threads.
> Probably not too much, they mostly run under preempt_disable().

OK, looks easy enough.  And yes, one good question is what, if anything,
we need to do with the idle threads.

> > I might also need to do something to handle changes in
> > process/thread hierarchy -- but hopefully without having to read-acquire
> > the task-list lock.
> 
> It seems that you need another global list, a task should be visible on that
> list until exit_rcu().

As in create another global list that all tasks are added to when created
and then remved from when they exit?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-03 22:05         ` Paul E. McKenney
@ 2014-08-04  0:37           ` Lai Jiangshan
  2014-08-04  1:09             ` Paul E. McKenney
  2014-08-04 13:32           ` Oleg Nesterov
  1 sibling, 1 reply; 122+ messages in thread
From: Lai Jiangshan @ 2014-08-04  0:37 UTC (permalink / raw)
  To: paulmck
  Cc: Oleg Nesterov, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

On 08/04/2014 06:05 AM, Paul E. McKenney wrote:
> On Sun, Aug 03, 2014 at 03:33:18PM +0200, Oleg Nesterov wrote:
>> On 08/02, Paul E. McKenney wrote:
>>>
>>> On Sat, Aug 02, 2014 at 04:56:16PM +0200, Oleg Nesterov wrote:
>>>> On 07/31, Paul E. McKenney wrote:
>>>>>
>>>>> +		rcu_read_lock();
>>>>> +		for_each_process_thread(g, t) {
>>>>> +			if (t != current && ACCESS_ONCE(t->on_rq) &&
>>>>> +			    !is_idle_task(t)) {
>>>>
>>>> I didn't notice this check before, but it is not needed. for_each_process()
>>>> can see the idle threads, there are not on process/thread lists.
>>     ^^^^^^^
>>
>> argh, I mean't "can't see" of course...
>>
>>> Good to know.  Any other important tasks I am missing?
>>
>> Nothing else.
> 
> Whew!  ;-)
> 
>>> I am guessing that I need to do something like this:
>>>
>>> 	for_each_process(g) {
>>> 		/* Do build step. */
>>> 		for_each_thread(g, t) {
>>> 			if (g == t)
>>> 				continue;
>>> 			/* Do build step. */
>>> 		}
>>> 	}
>>
>> Sorry, I don't understand... This is equivalent to
>>
>> 	for_each_process_thread(g, t) {
>> 		/* Do build step. */
>> 	}
> 
> OK, got it.
> 
>>> Your point about exiting tasks I get, and I believe I can solve it.
>>> I am hoping that the above sort of construction takes care of the
>>> idle threads.
>>
>> It is simple to find the idle threads, just
>>
>> 	for_each_cpu(cpu) {
>> 		do_something(cpu_rq(cpu)->idle);
>> 	}
>>
>> but it is not clear to me what else you need to do with the idle threads.
>> Probably not too much, they mostly run under preempt_disable().
> 
> OK, looks easy enough.  And yes, one good question is what, if anything,
> we need to do with the idle threads.
> 
>>> I might also need to do something to handle changes in
>>> process/thread hierarchy -- but hopefully without having to read-acquire
>>> the task-list lock.
>>
>> It seems that you need another global list, a task should be visible on that
>> list until exit_rcu().
> 
> As in create another global list that all tasks are added to when created
> and then remved from when they exit?

An alternative solution:
srcu_read_lock() before exit_notify(), srcu_read_unlock() after the last preempt_disable()
in the do_exit, and synchronize_srcu() in rcu_tasks_kthread().





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-04  0:37           ` Lai Jiangshan
@ 2014-08-04  1:09             ` Paul E. McKenney
  2014-08-04 13:25               ` Oleg Nesterov
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-04  1:09 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Oleg Nesterov, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

On Mon, Aug 04, 2014 at 08:37:37AM +0800, Lai Jiangshan wrote:
> On 08/04/2014 06:05 AM, Paul E. McKenney wrote:
> > On Sun, Aug 03, 2014 at 03:33:18PM +0200, Oleg Nesterov wrote:
> >> On 08/02, Paul E. McKenney wrote:
> >>>
> >>> On Sat, Aug 02, 2014 at 04:56:16PM +0200, Oleg Nesterov wrote:
> >>>> On 07/31, Paul E. McKenney wrote:
> >>>>>
> >>>>> +		rcu_read_lock();
> >>>>> +		for_each_process_thread(g, t) {
> >>>>> +			if (t != current && ACCESS_ONCE(t->on_rq) &&
> >>>>> +			    !is_idle_task(t)) {
> >>>>
> >>>> I didn't notice this check before, but it is not needed. for_each_process()
> >>>> can see the idle threads, there are not on process/thread lists.
> >>     ^^^^^^^
> >>
> >> argh, I mean't "can't see" of course...
> >>
> >>> Good to know.  Any other important tasks I am missing?
> >>
> >> Nothing else.
> > 
> > Whew!  ;-)
> > 
> >>> I am guessing that I need to do something like this:
> >>>
> >>> 	for_each_process(g) {
> >>> 		/* Do build step. */
> >>> 		for_each_thread(g, t) {
> >>> 			if (g == t)
> >>> 				continue;
> >>> 			/* Do build step. */
> >>> 		}
> >>> 	}
> >>
> >> Sorry, I don't understand... This is equivalent to
> >>
> >> 	for_each_process_thread(g, t) {
> >> 		/* Do build step. */
> >> 	}
> > 
> > OK, got it.
> > 
> >>> Your point about exiting tasks I get, and I believe I can solve it.
> >>> I am hoping that the above sort of construction takes care of the
> >>> idle threads.
> >>
> >> It is simple to find the idle threads, just
> >>
> >> 	for_each_cpu(cpu) {
> >> 		do_something(cpu_rq(cpu)->idle);
> >> 	}
> >>
> >> but it is not clear to me what else you need to do with the idle threads.
> >> Probably not too much, they mostly run under preempt_disable().
> > 
> > OK, looks easy enough.  And yes, one good question is what, if anything,
> > we need to do with the idle threads.
> > 
> >>> I might also need to do something to handle changes in
> >>> process/thread hierarchy -- but hopefully without having to read-acquire
> >>> the task-list lock.
> >>
> >> It seems that you need another global list, a task should be visible on that
> >> list until exit_rcu().
> > 
> > As in create another global list that all tasks are added to when created
> > and then remved from when they exit?
> 
> An alternative solution:
> srcu_read_lock() before exit_notify(), srcu_read_unlock() after the last preempt_disable()
> in the do_exit, and synchronize_srcu() in rcu_tasks_kthread().

That is a good way to synchronize with the exiting tasks, and I will
probably that that approach.

I -thought- that Oleg was concerned about safely building the list to
start with, though.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (13 preceding siblings ...)
  2014-08-02 14:56   ` Oleg Nesterov
@ 2014-08-04  1:28   ` Lai Jiangshan
  2014-08-04  7:46     ` Peter Zijlstra
  2014-08-08 19:13   ` Peter Zijlstra
  15 siblings, 1 reply; 122+ messages in thread
From: Lai Jiangshan @ 2014-08-04  1:28 UTC (permalink / raw)
  To: Paul E. McKenney, peterz
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On 08/01/2014 05:55 AM, Paul E. McKenney wrote:
> +		rcu_read_lock();
> +		for_each_process_thread(g, t) {
> +			if (t != current && ACCESS_ONCE(t->on_rq) &&
> +			    !is_idle_task(t)) {
> +				get_task_struct(t);
> +				t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
> +				ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
> +				list_add(&t->rcu_tasks_holdout_list,
> +					 &rcu_tasks_holdouts);

This loop will collect all the runnable tasks.  It is too much tasks.
Is it possible to collect only on_cpu tasks or PREEMPT_ACTIVE tasks?
It seems hard to achieve it.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-04  1:28   ` Lai Jiangshan
@ 2014-08-04  7:46     ` Peter Zijlstra
  2014-08-04  8:18       ` Lai Jiangshan
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-04  7:46 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paul E. McKenney, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

[-- Attachment #1: Type: text/plain, Size: 772 bytes --]

On Mon, Aug 04, 2014 at 09:28:45AM +0800, Lai Jiangshan wrote:
> On 08/01/2014 05:55 AM, Paul E. McKenney wrote:
> > +		rcu_read_lock();
> > +		for_each_process_thread(g, t) {
> > +			if (t != current && ACCESS_ONCE(t->on_rq) &&
> > +			    !is_idle_task(t)) {
> > +				get_task_struct(t);
> > +				t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
> > +				ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
> > +				list_add(&t->rcu_tasks_holdout_list,
> > +					 &rcu_tasks_holdouts);
> 
> This loop will collect all the runnable tasks.  It is too much tasks.
> Is it possible to collect only on_cpu tasks or PREEMPT_ACTIVE tasks?
> It seems hard to achieve it.

Without taking the rq->lock you cannot do that race-free. And we're not
going to be taking rq->lock here.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-04  7:46     ` Peter Zijlstra
@ 2014-08-04  8:18       ` Lai Jiangshan
  2014-08-04 11:50         ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Lai Jiangshan @ 2014-08-04  8:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On 08/04/2014 03:46 PM, Peter Zijlstra wrote:
> On Mon, Aug 04, 2014 at 09:28:45AM +0800, Lai Jiangshan wrote:
>> On 08/01/2014 05:55 AM, Paul E. McKenney wrote:
>>> +		rcu_read_lock();
>>> +		for_each_process_thread(g, t) {
>>> +			if (t != current && ACCESS_ONCE(t->on_rq) &&
>>> +			    !is_idle_task(t)) {
>>> +				get_task_struct(t);
>>> +				t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
>>> +				ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
>>> +				list_add(&t->rcu_tasks_holdout_list,
>>> +					 &rcu_tasks_holdouts);
>>
>> This loop will collect all the runnable tasks.  It is too much tasks.
>> Is it possible to collect only on_cpu tasks or PREEMPT_ACTIVE tasks?
>> It seems hard to achieve it.
> 
> Without taking the rq->lock you cannot do that race-free. And we're not
> going to be taking rq->lock here.

It is because we can't fetch task->on_cpu and preempt_count atomically
so that rq->lock is required.

3 bleeding solutions:

1) Allocate one bit in preempt_count to stand for not_on_cpu ( = !task->on_cpu)
2) allocate one bit in nvcsw to stand for on_scheduled (or not_on_scheduled, see next)
3) introduce task->on_scheduled whose semantics is between on_cpu and on_rq,
   on_scheduled = scheduled on cpu or preempted, (not voluntary scheduled out)

But the scheduler doesn't need neither of such things.  So these is still no hope.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-04  8:18       ` Lai Jiangshan
@ 2014-08-04 11:50         ` Paul E. McKenney
  2014-08-04 12:25           ` Peter Zijlstra
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-04 11:50 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Peter Zijlstra, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On Mon, Aug 04, 2014 at 04:18:53PM +0800, Lai Jiangshan wrote:
> On 08/04/2014 03:46 PM, Peter Zijlstra wrote:
> > On Mon, Aug 04, 2014 at 09:28:45AM +0800, Lai Jiangshan wrote:
> >> On 08/01/2014 05:55 AM, Paul E. McKenney wrote:
> >>> +		rcu_read_lock();
> >>> +		for_each_process_thread(g, t) {
> >>> +			if (t != current && ACCESS_ONCE(t->on_rq) &&
> >>> +			    !is_idle_task(t)) {
> >>> +				get_task_struct(t);
> >>> +				t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
> >>> +				ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
> >>> +				list_add(&t->rcu_tasks_holdout_list,
> >>> +					 &rcu_tasks_holdouts);
> >>
> >> This loop will collect all the runnable tasks.  It is too much tasks.
> >> Is it possible to collect only on_cpu tasks or PREEMPT_ACTIVE tasks?
> >> It seems hard to achieve it.
> > 
> > Without taking the rq->lock you cannot do that race-free. And we're not
> > going to be taking rq->lock here.
> 
> It is because we can't fetch task->on_cpu and preempt_count atomically
> so that rq->lock is required.
> 
> 3 bleeding solutions:
> 
> 1) Allocate one bit in preempt_count to stand for not_on_cpu ( = !task->on_cpu)
> 2) allocate one bit in nvcsw to stand for on_scheduled (or not_on_scheduled, see next)
> 3) introduce task->on_scheduled whose semantics is between on_cpu and on_rq,
>    on_scheduled = scheduled on cpu or preempted, (not voluntary scheduled out)
> 
> But the scheduler doesn't need neither of such things.  So these is still no hope.

OK, I will bite...

What kinds of tasks are on a runqueue, but neither ->on_cpu nor
PREEMPT_ACTIVE?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-04 11:50         ` Paul E. McKenney
@ 2014-08-04 12:25           ` Peter Zijlstra
  2014-08-04 12:37             ` Paul E. McKenney
  2014-08-04 14:56             ` Peter Zijlstra
  0 siblings, 2 replies; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-04 12:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

[-- Attachment #1: Type: text/plain, Size: 448 bytes --]

On Mon, Aug 04, 2014 at 04:50:44AM -0700, Paul E. McKenney wrote:
> OK, I will bite...
> 
> What kinds of tasks are on a runqueue, but neither ->on_cpu nor
> PREEMPT_ACTIVE?

Userspace tasks, they don't necessarily get PREEMPT_ACTIVE when
preempted. Now obviously you're not _that_ interested in userspace tasks
for this, so that might be ok.

But the main point was, you cannot use ->on_cpu or PREEMPT_ACTIVE
without holding rq->lock.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-04 12:25           ` Peter Zijlstra
@ 2014-08-04 12:37             ` Paul E. McKenney
  2014-08-04 14:56             ` Peter Zijlstra
  1 sibling, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-04 12:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On Mon, Aug 04, 2014 at 02:25:15PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 04, 2014 at 04:50:44AM -0700, Paul E. McKenney wrote:
> > OK, I will bite...
> > 
> > What kinds of tasks are on a runqueue, but neither ->on_cpu nor
> > PREEMPT_ACTIVE?
> 
> Userspace tasks, they don't necessarily get PREEMPT_ACTIVE when
> preempted. Now obviously you're not _that_ interested in userspace tasks
> for this, so that might be ok.

Right, I detect userspace tasks using RCU's scheduling-clock handler.
(And code to handle the NO_HZ_FULL case is under construction.)

> But the main point was, you cannot use ->on_cpu or PREEMPT_ACTIVE
> without holding rq->lock.

OK, got it.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-04  1:09             ` Paul E. McKenney
@ 2014-08-04 13:25               ` Oleg Nesterov
  2014-08-04 13:51                 ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Oleg Nesterov @ 2014-08-04 13:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

On 08/03, Paul E. McKenney wrote:
>
> On Mon, Aug 04, 2014 at 08:37:37AM +0800, Lai Jiangshan wrote:
> > An alternative solution:
> > srcu_read_lock() before exit_notify(), srcu_read_unlock() after the last preempt_disable()
> > in the do_exit, and synchronize_srcu() in rcu_tasks_kthread().
>
> That is a good way to synchronize with the exiting tasks, and I will
> probably that that approach.
>
> I -thought- that Oleg was concerned about safely building the list to
> start with, though.

But for_each_process_thread() under rcu_read_lock() should work except
it can miss the exiting tasks.

So it seems that the clever Lai's trick should solve the problem.

Oleg.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-03 22:03             ` Paul E. McKenney
@ 2014-08-04 13:29               ` Oleg Nesterov
  2014-08-04 13:48                 ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Oleg Nesterov @ 2014-08-04 13:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On 08/03, Paul E. McKenney wrote:
>
> If I understand correctly, your goal is to remove a synchronize_sched()
> worth of latency from the overall RCU-tasks callback latency.  Or am I
> still confused?

Yes, exactly. But again, I am not sure this minor optimization makes sense,
mostly I tried to check my understanding.

Oleg.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-03 22:05         ` Paul E. McKenney
  2014-08-04  0:37           ` Lai Jiangshan
@ 2014-08-04 13:32           ` Oleg Nesterov
  2014-08-04 19:28             ` Paul E. McKenney
  1 sibling, 1 reply; 122+ messages in thread
From: Oleg Nesterov @ 2014-08-04 13:32 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On 08/03, Paul E. McKenney wrote:
>
> On Sun, Aug 03, 2014 at 03:33:18PM +0200, Oleg Nesterov wrote:
> > It seems that you need another global list, a task should be visible on that
> > list until exit_rcu().
>
> As in create another global list that all tasks are added to when created
> and then remved from when they exit?

This is what I meant, although I hoped there is a better solution. The obvious
problem is locking.

And at least Lai's idea looks certainly better to me.

Oleg.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-04 13:29               ` Oleg Nesterov
@ 2014-08-04 13:48                 ` Paul E. McKenney
  0 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-04 13:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On Mon, Aug 04, 2014 at 03:29:27PM +0200, Oleg Nesterov wrote:
> On 08/03, Paul E. McKenney wrote:
> >
> > If I understand correctly, your goal is to remove a synchronize_sched()
> > worth of latency from the overall RCU-tasks callback latency.  Or am I
> > still confused?
> 
> Yes, exactly. But again, I am not sure this minor optimization makes sense,
> mostly I tried to check my understanding.

No problem with your understanding!  ;-)

I am not going to include this optimization in the first round, but
I will certainly keep it in mind should grace-period latency from
nearly concurrent updates become a problem.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-04 13:25               ` Oleg Nesterov
@ 2014-08-04 13:51                 ` Paul E. McKenney
  2014-08-04 13:52                   ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-04 13:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

On Mon, Aug 04, 2014 at 03:25:25PM +0200, Oleg Nesterov wrote:
> On 08/03, Paul E. McKenney wrote:
> >
> > On Mon, Aug 04, 2014 at 08:37:37AM +0800, Lai Jiangshan wrote:
> > > An alternative solution:
> > > srcu_read_lock() before exit_notify(), srcu_read_unlock() after the last preempt_disable()
> > > in the do_exit, and synchronize_srcu() in rcu_tasks_kthread().
> >
> > That is a good way to synchronize with the exiting tasks, and I will
> > probably that that approach.
> >
> > I -thought- that Oleg was concerned about safely building the list to
> > start with, though.
> 
> But for_each_process_thread() under rcu_read_lock() should work except
> it can miss the exiting tasks.
> 
> So it seems that the clever Lai's trick should solve the problem.

Cool!  Lai's trick seems to be doing well in early testing, so keeping
fingers firmly crossed.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-04 13:51                 ` Paul E. McKenney
@ 2014-08-04 13:52                   ` Paul E. McKenney
  0 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-04 13:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

On Mon, Aug 04, 2014 at 06:51:04AM -0700, Paul E. McKenney wrote:
> On Mon, Aug 04, 2014 at 03:25:25PM +0200, Oleg Nesterov wrote:
> > On 08/03, Paul E. McKenney wrote:
> > >
> > > On Mon, Aug 04, 2014 at 08:37:37AM +0800, Lai Jiangshan wrote:
> > > > An alternative solution:
> > > > srcu_read_lock() before exit_notify(), srcu_read_unlock() after the last preempt_disable()
> > > > in the do_exit, and synchronize_srcu() in rcu_tasks_kthread().
> > >
> > > That is a good way to synchronize with the exiting tasks, and I will
> > > probably that that approach.
> > >
> > > I -thought- that Oleg was concerned about safely building the list to
> > > start with, though.
> > 
> > But for_each_process_thread() under rcu_read_lock() should work except
> > it can miss the exiting tasks.
> > 
> > So it seems that the clever Lai's trick should solve the problem.
> 
> Cool!  Lai's trick seems to be doing well in early testing, so keeping
> fingers firmly crossed.  ;-)

And the horrible thing is that my plan had been to use per-CPU reference
counts for this purpose, which would of course have been sort of like
re-implementing a special case of SRCU.

So we should all be thankful to Lai for his suggestion!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-04 12:25           ` Peter Zijlstra
  2014-08-04 12:37             ` Paul E. McKenney
@ 2014-08-04 14:56             ` Peter Zijlstra
  2014-08-05  0:47               ` Lai Jiangshan
  1 sibling, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-04 14:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

[-- Attachment #1: Type: text/plain, Size: 1113 bytes --]

On Mon, Aug 04, 2014 at 02:25:15PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 04, 2014 at 04:50:44AM -0700, Paul E. McKenney wrote:
> > OK, I will bite...
> > 
> > What kinds of tasks are on a runqueue, but neither ->on_cpu nor
> > PREEMPT_ACTIVE?
> 
> Userspace tasks, they don't necessarily get PREEMPT_ACTIVE when
> preempted. Now obviously you're not _that_ interested in userspace tasks
> for this, so that might be ok.
> 
> But the main point was, you cannot use ->on_cpu or PREEMPT_ACTIVE
> without holding rq->lock.

Hmm, maybe you can, we have the context switch in between setting
->on_cpu and clearing PREEMPT_ACTIVE and vice-versa.

The context switch (obviously) provides a full barrier, so we might be
able to -- with careful consideration -- read these two separate values
and construct something usable from them.

Something like:

	task_preempt_count(tsk) & PREEMPT_ACTIVE
	smp_rmb();
	tsk->on_cpu

And because we set PREEMPT_ACTIVE before clearing on_cpu, this should
race the right way (err towards the inclusive side).

Obviously that wants a big fat comment...

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-04 13:32           ` Oleg Nesterov
@ 2014-08-04 19:28             ` Paul E. McKenney
  2014-08-04 19:32               ` Oleg Nesterov
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-04 19:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On Mon, Aug 04, 2014 at 03:32:21PM +0200, Oleg Nesterov wrote:
> On 08/03, Paul E. McKenney wrote:
> >
> > On Sun, Aug 03, 2014 at 03:33:18PM +0200, Oleg Nesterov wrote:
> > > It seems that you need another global list, a task should be visible on that
> > > list until exit_rcu().
> >
> > As in create another global list that all tasks are added to when created
> > and then remved from when they exit?
> 
> This is what I meant, although I hoped there is a better solution. The obvious
> problem is locking.
> 
> And at least Lai's idea looks certainly better to me.

OK, so I checked out my earlier concern about the group leader going away.
It looks like the group leader now sticks around until all threads in
the group have exited, which is a nice change from the behavior I was
(perhaps incorrectly) recalling!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-04 19:28             ` Paul E. McKenney
@ 2014-08-04 19:32               ` Oleg Nesterov
  0 siblings, 0 replies; 122+ messages in thread
From: Oleg Nesterov @ 2014-08-04 19:32 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On 08/04, Paul E. McKenney wrote:
>
> OK, so I checked out my earlier concern about the group leader going away.
> It looks like the group leader now sticks around until all threads in
> the group have exited, which is a nice change from the behavior I was
> (perhaps incorrectly) recalling!

Ah, I didn't know you were worried about a zombie leader ;)

Yes, this case was always fine. Ignoring the fact while_each_thread() was
always buggy ;) Fortunately for_each_thread() should be fine.

Oleg.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-04 14:56             ` Peter Zijlstra
@ 2014-08-05  0:47               ` Lai Jiangshan
  2014-08-05 21:55                 ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Lai Jiangshan @ 2014-08-05  0:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On 08/04/2014 10:56 PM, Peter Zijlstra wrote:
> On Mon, Aug 04, 2014 at 02:25:15PM +0200, Peter Zijlstra wrote:
>> On Mon, Aug 04, 2014 at 04:50:44AM -0700, Paul E. McKenney wrote:
>>> OK, I will bite...
>>>
>>> What kinds of tasks are on a runqueue, but neither ->on_cpu nor
>>> PREEMPT_ACTIVE?
>>
>> Userspace tasks, they don't necessarily get PREEMPT_ACTIVE when
>> preempted. Now obviously you're not _that_ interested in userspace tasks
>> for this, so that might be ok.
>>
>> But the main point was, you cannot use ->on_cpu or PREEMPT_ACTIVE
>> without holding rq->lock.
> 
> Hmm, maybe you can, we have the context switch in between setting
> ->on_cpu and clearing PREEMPT_ACTIVE and vice-versa.
> 
> The context switch (obviously) provides a full barrier, so we might be
> able to -- with careful consideration -- read these two separate values
> and construct something usable from them.
> 
> Something like:
> 
> 	task_preempt_count(tsk) & PREEMPT_ACTIVE
	the @tsk is running on_cpu, the above result is false.
> 	smp_rmb();
> 	tsk->on_cpu
	now the @tsk is preempted, the above result also is false.

	so it is useless if we fetch the preempt_count and on_cpu in two separated
instructions.  Maybe it would work if we also take tsk->nivcsw in consideration.
(I just noticed that tsk->n[i]vcsw are the version numbers for the tsk->on_cpu)

bool task_on_cpu_or_preempted(tsk)
{
	unsigned long saved_nivcsw;

	saved_nivcsw = task->nivcsw;
	if (tsk->on_cpu)
		return true;

	smp_rmb();

	if (task_preempt_count(tsk) & PREEMPT_ACTIVE)
		return true;

	smp_rmb();

	if (tsk->on_cpu || task->nivcsw != saved_nivcsw)
		return true;

	return false;
}

> 
> And because we set PREEMPT_ACTIVE before clearing on_cpu, this should
> race the right way (err towards the inclusive side).
> 
> Obviously that wants a big fat comment...


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-05  0:47               ` Lai Jiangshan
@ 2014-08-05 21:55                 ` Paul E. McKenney
  2014-08-06  0:27                   ` Lai Jiangshan
                                     ` (2 more replies)
  0 siblings, 3 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-05 21:55 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Peter Zijlstra, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On Tue, Aug 05, 2014 at 08:47:55AM +0800, Lai Jiangshan wrote:
> On 08/04/2014 10:56 PM, Peter Zijlstra wrote:
> > On Mon, Aug 04, 2014 at 02:25:15PM +0200, Peter Zijlstra wrote:
> >> On Mon, Aug 04, 2014 at 04:50:44AM -0700, Paul E. McKenney wrote:
> >>> OK, I will bite...
> >>>
> >>> What kinds of tasks are on a runqueue, but neither ->on_cpu nor
> >>> PREEMPT_ACTIVE?
> >>
> >> Userspace tasks, they don't necessarily get PREEMPT_ACTIVE when
> >> preempted. Now obviously you're not _that_ interested in userspace tasks
> >> for this, so that might be ok.
> >>
> >> But the main point was, you cannot use ->on_cpu or PREEMPT_ACTIVE
> >> without holding rq->lock.
> > 
> > Hmm, maybe you can, we have the context switch in between setting
> > ->on_cpu and clearing PREEMPT_ACTIVE and vice-versa.
> > 
> > The context switch (obviously) provides a full barrier, so we might be
> > able to -- with careful consideration -- read these two separate values
> > and construct something usable from them.
> > 
> > Something like:
> > 
> > 	task_preempt_count(tsk) & PREEMPT_ACTIVE
> 	the @tsk is running on_cpu, the above result is false.
> > 	smp_rmb();
> > 	tsk->on_cpu
> 	now the @tsk is preempted, the above result also is false.
> 
> 	so it is useless if we fetch the preempt_count and on_cpu in two separated
> instructions.  Maybe it would work if we also take tsk->nivcsw in consideration.
> (I just noticed that tsk->n[i]vcsw are the version numbers for the tsk->on_cpu)
> 
> bool task_on_cpu_or_preempted(tsk)
> {
> 	unsigned long saved_nivcsw;
> 
> 	saved_nivcsw = task->nivcsw;
> 	if (tsk->on_cpu)
> 		return true;
> 
> 	smp_rmb();
> 
> 	if (task_preempt_count(tsk) & PREEMPT_ACTIVE)
> 		return true;
> 
> 	smp_rmb();
> 
> 	if (tsk->on_cpu || task->nivcsw != saved_nivcsw)
> 		return true;
> 
> 	return false;
> }
> 
> > 
> > And because we set PREEMPT_ACTIVE before clearing on_cpu, this should
> > race the right way (err towards the inclusive side).
> > 
> > Obviously that wants a big fat comment...

How about the following?  Non-nohz_full userspace tasks are already covered
courtesy of scheduling-clock interrupts, and this handles nohz_full usermode
tasks.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

rcu: Make TASKS_RCU handle nohx_full= CPUs

Currently TASKS_RCU would ignore a CPU running a task in nohz_full=
usermode execution.  There would be neither a context switch nor a
scheduling-clock interrupt to tell TASKS_RCU that the task in question
had passed through a quiescent state.  The grace period would therefore
extend indefinitely.  This commit therefore makes RCU's dyntick-idle
subsystem record the task_struct structure of the task that is running
in dyntick-idle mode on each CPU.  The TASKS_RCU grace period can
then access this information and record a quiescent state on
behalf of any CPU running in dyntick-idle usermode.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index f504f797c9c8..777aac3a34c0 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1140,5 +1140,14 @@ static inline void rcu_sysidle_force_exit(void)
 
 #endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
 
+#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
+struct task_struct *rcu_dynticks_task_cur(int cpu);
+#else /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
+static inline struct task_struct *rcu_dynticks_task_cur(int cpu)
+{
+	return NULL;
+}
+#endif /* #else #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
+
 
 #endif /* __LINUX_RCUPDATE_H */
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 645a33efc0d4..86a0a7d5bbbd 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -526,6 +526,7 @@ static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, long long oldval,
 	atomic_inc(&rdtp->dynticks);
 	smp_mb__after_atomic();  /* Force ordering with next sojourn. */
 	WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
+	rcu_dynticks_task_enter(rdtp, current);
 
 	/*
 	 * It is illegal to enter an extended quiescent state while
@@ -642,6 +643,7 @@ void rcu_irq_exit(void)
 static void rcu_eqs_exit_common(struct rcu_dynticks *rdtp, long long oldval,
 			       int user)
 {
+	rcu_dynticks_task_exit(rdtp);
 	smp_mb__before_atomic();  /* Force ordering w/previous sojourn. */
 	atomic_inc(&rdtp->dynticks);
 	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 0f69a79c5b7d..1e79fa1b7cbf 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -88,6 +88,9 @@ struct rcu_dynticks {
 				    /* Process level is worth LLONG_MAX/2. */
 	int dynticks_nmi_nesting;   /* Track NMI nesting level. */
 	atomic_t dynticks;	    /* Even value for idle, else odd. */
+#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
+	struct task_struct *dynticks_tsk;
+#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
 #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
 	long long dynticks_idle_nesting;
 				    /* irq/process nesting level from idle. */
@@ -579,6 +582,9 @@ static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
 static void rcu_bind_gp_kthread(void);
 static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
 static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
+static void rcu_dynticks_task_enter(struct rcu_dynticks *rdtp,
+				    struct task_struct *t);
+static void rcu_dynticks_task_exit(struct rcu_dynticks *rdtp);
 
 #endif /* #ifndef RCU_TREE_NONCORE */
 
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index a86a363ea453..442d62edc564 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2852,3 +2852,29 @@ static void rcu_bind_gp_kthread(void)
 		set_cpus_allowed_ptr(current, cpumask_of(cpu));
 #endif /* #ifdef CONFIG_NO_HZ_FULL */
 }
+
+/* Record the current task on dyntick-idle entry. */
+static void rcu_dynticks_task_enter(struct rcu_dynticks *rdtp,
+				    struct task_struct *t)
+{
+#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
+	ACCESS_ONCE(rdtp->dynticks_tsk) = t;
+#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
+}
+
+/* Record no current task on dyntick-idle exit. */
+static void rcu_dynticks_task_exit(struct rcu_dynticks *rdtp)
+{
+	rcu_dynticks_task_enter(rdtp, NULL);
+}
+
+#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
+struct task_struct *rcu_dynticks_task_cur(int cpu)
+{
+	struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
+	struct task_struct *t = ACCESS_ONCE(rdtp->dynticks_tsk);
+
+	smp_read_barrier_depends(); /* Dereferences after fetch of "t". */
+	return t;
+}
+#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index ad2a8df43757..6ad6af2ab028 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -481,6 +481,28 @@ static void check_holdout_task(struct task_struct *t,
 	sched_show_task(current);
 }
 
+/* Check for nohz_full CPUs executing in userspace. */
+static void check_no_hz_full_tasks(void)
+{
+#ifdef CONFIG_NO_HZ_FULL
+	int cpu;
+	struct task_struct *t;
+
+	for_each_online_cpu(cpu) {
+		cond_resched();
+		rcu_read_lock();
+		t = rcu_dynticks_task_cur(cpu);
+		if (t == NULL || is_idle_task(t)) {
+			rcu_read_unlock();
+			continue;
+		}
+		if (ACCESS_ONCE(t->rcu_tasks_holdout))
+			ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
+		rcu_read_unlock();
+	}
+#endif /* #ifdef CONFIG_NO_HZ_FULL */
+}
+
 /* RCU-tasks kthread that detects grace periods and invokes callbacks. */
 static int __noreturn rcu_tasks_kthread(void *arg)
 {
@@ -584,6 +606,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 				lastreport = jiffies;
 			firstreport = true;
 			WARN_ON(signal_pending(current));
+			check_no_hz_full_tasks();
 			rcu_read_lock();
 			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
 						rcu_tasks_holdout_list)


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-05 21:55                 ` Paul E. McKenney
@ 2014-08-06  0:27                   ` Lai Jiangshan
  2014-08-06  0:48                     ` Paul E. McKenney
  2014-08-06  0:33                   ` Lai Jiangshan
  2014-08-07  8:49                   ` Peter Zijlstra
  2 siblings, 1 reply; 122+ messages in thread
From: Lai Jiangshan @ 2014-08-06  0:27 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On 08/06/2014 05:55 AM, Paul E. McKenney wrote:
> On Tue, Aug 05, 2014 at 08:47:55AM +0800, Lai Jiangshan wrote:
>> On 08/04/2014 10:56 PM, Peter Zijlstra wrote:
>>> On Mon, Aug 04, 2014 at 02:25:15PM +0200, Peter Zijlstra wrote:
>>>> On Mon, Aug 04, 2014 at 04:50:44AM -0700, Paul E. McKenney wrote:
>>>>> OK, I will bite...
>>>>>
>>>>> What kinds of tasks are on a runqueue, but neither ->on_cpu nor
>>>>> PREEMPT_ACTIVE?
>>>>
>>>> Userspace tasks, they don't necessarily get PREEMPT_ACTIVE when
>>>> preempted. Now obviously you're not _that_ interested in userspace tasks
>>>> for this, so that might be ok.
>>>>
>>>> But the main point was, you cannot use ->on_cpu or PREEMPT_ACTIVE
>>>> without holding rq->lock.
>>>
>>> Hmm, maybe you can, we have the context switch in between setting
>>> ->on_cpu and clearing PREEMPT_ACTIVE and vice-versa.
>>>
>>> The context switch (obviously) provides a full barrier, so we might be
>>> able to -- with careful consideration -- read these two separate values
>>> and construct something usable from them.
>>>
>>> Something like:
>>>
>>> 	task_preempt_count(tsk) & PREEMPT_ACTIVE
>> 	the @tsk is running on_cpu, the above result is false.
>>> 	smp_rmb();
>>> 	tsk->on_cpu
>> 	now the @tsk is preempted, the above result also is false.
>>
>> 	so it is useless if we fetch the preempt_count and on_cpu in two separated
>> instructions.  Maybe it would work if we also take tsk->nivcsw in consideration.
>> (I just noticed that tsk->n[i]vcsw are the version numbers for the tsk->on_cpu)
>>
>> bool task_on_cpu_or_preempted(tsk)
>> {
>> 	unsigned long saved_nivcsw;
>>
>> 	saved_nivcsw = task->nivcsw;
>> 	if (tsk->on_cpu)
>> 		return true;
>>
>> 	smp_rmb();
>>
>> 	if (task_preempt_count(tsk) & PREEMPT_ACTIVE)
>> 		return true;
>>
>> 	smp_rmb();
>>
>> 	if (tsk->on_cpu || task->nivcsw != saved_nivcsw)
>> 		return true;
>>
>> 	return false;
>> }
>>
>>>
>>> And because we set PREEMPT_ACTIVE before clearing on_cpu, this should
>>> race the right way (err towards the inclusive side).
>>>
>>> Obviously that wants a big fat comment...
> 
> How about the following?  Non-nohz_full userspace tasks are already covered
> courtesy of scheduling-clock interrupts, and this handles nohz_full usermode
> tasks.

synchronize_rcu_tasks() is called extremely rare.  Is it right?  So I think it is
acceptable to interrupt the nohz_full usermode CPUs.

> 
> Thoughts?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> rcu: Make TASKS_RCU handle nohx_full= CPUs
> 
> Currently TASKS_RCU would ignore a CPU running a task in nohz_full=
> usermode execution.  There would be neither a context switch nor a
> scheduling-clock interrupt to tell TASKS_RCU that the task in question
> had passed through a quiescent state.  The grace period would therefore
> extend indefinitely.  This commit therefore makes RCU's dyntick-idle
> subsystem record the task_struct structure of the task that is running
> in dyntick-idle mode on each CPU.  The TASKS_RCU grace period can
> then access this information and record a quiescent state on
> behalf of any CPU running in dyntick-idle usermode.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index f504f797c9c8..777aac3a34c0 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -1140,5 +1140,14 @@ static inline void rcu_sysidle_force_exit(void)
>  
>  #endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
>  
> +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> +struct task_struct *rcu_dynticks_task_cur(int cpu);
> +#else /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> +static inline struct task_struct *rcu_dynticks_task_cur(int cpu)
> +{
> +	return NULL;
> +}
> +#endif /* #else #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> +
>  
>  #endif /* __LINUX_RCUPDATE_H */
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 645a33efc0d4..86a0a7d5bbbd 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -526,6 +526,7 @@ static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, long long oldval,
>  	atomic_inc(&rdtp->dynticks);
>  	smp_mb__after_atomic();  /* Force ordering with next sojourn. */
>  	WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> +	rcu_dynticks_task_enter(rdtp, current);
>  
>  	/*
>  	 * It is illegal to enter an extended quiescent state while
> @@ -642,6 +643,7 @@ void rcu_irq_exit(void)
>  static void rcu_eqs_exit_common(struct rcu_dynticks *rdtp, long long oldval,
>  			       int user)
>  {
> +	rcu_dynticks_task_exit(rdtp);
>  	smp_mb__before_atomic();  /* Force ordering w/previous sojourn. */
>  	atomic_inc(&rdtp->dynticks);
>  	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 0f69a79c5b7d..1e79fa1b7cbf 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -88,6 +88,9 @@ struct rcu_dynticks {
>  				    /* Process level is worth LLONG_MAX/2. */
>  	int dynticks_nmi_nesting;   /* Track NMI nesting level. */
>  	atomic_t dynticks;	    /* Even value for idle, else odd. */
> +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> +	struct task_struct *dynticks_tsk;
> +#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
>  #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
>  	long long dynticks_idle_nesting;
>  				    /* irq/process nesting level from idle. */
> @@ -579,6 +582,9 @@ static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
>  static void rcu_bind_gp_kthread(void);
>  static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
>  static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
> +static void rcu_dynticks_task_enter(struct rcu_dynticks *rdtp,
> +				    struct task_struct *t);
> +static void rcu_dynticks_task_exit(struct rcu_dynticks *rdtp);
>  
>  #endif /* #ifndef RCU_TREE_NONCORE */
>  
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index a86a363ea453..442d62edc564 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2852,3 +2852,29 @@ static void rcu_bind_gp_kthread(void)
>  		set_cpus_allowed_ptr(current, cpumask_of(cpu));
>  #endif /* #ifdef CONFIG_NO_HZ_FULL */
>  }
> +
> +/* Record the current task on dyntick-idle entry. */
> +static void rcu_dynticks_task_enter(struct rcu_dynticks *rdtp,
> +				    struct task_struct *t)
> +{
> +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> +	ACCESS_ONCE(rdtp->dynticks_tsk) = t;
> +#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> +}
> +
> +/* Record no current task on dyntick-idle exit. */
> +static void rcu_dynticks_task_exit(struct rcu_dynticks *rdtp)
> +{
> +	rcu_dynticks_task_enter(rdtp, NULL);
> +}
> +
> +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> +struct task_struct *rcu_dynticks_task_cur(int cpu)
> +{
> +	struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
> +	struct task_struct *t = ACCESS_ONCE(rdtp->dynticks_tsk);
> +
> +	smp_read_barrier_depends(); /* Dereferences after fetch of "t". */
> +	return t;
> +}
> +#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index ad2a8df43757..6ad6af2ab028 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -481,6 +481,28 @@ static void check_holdout_task(struct task_struct *t,
>  	sched_show_task(current);
>  }
>  
> +/* Check for nohz_full CPUs executing in userspace. */
> +static void check_no_hz_full_tasks(void)
> +{
> +#ifdef CONFIG_NO_HZ_FULL
> +	int cpu;
> +	struct task_struct *t;
> +
> +	for_each_online_cpu(cpu) {
> +		cond_resched();
> +		rcu_read_lock();
> +		t = rcu_dynticks_task_cur(cpu);
> +		if (t == NULL || is_idle_task(t)) {
> +			rcu_read_unlock();
> +			continue;
> +		}
> +		if (ACCESS_ONCE(t->rcu_tasks_holdout))
> +			ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> +		rcu_read_unlock();
> +	}
> +#endif /* #ifdef CONFIG_NO_HZ_FULL */
> +}
> +
>  /* RCU-tasks kthread that detects grace periods and invokes callbacks. */
>  static int __noreturn rcu_tasks_kthread(void *arg)
>  {
> @@ -584,6 +606,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
>  				lastreport = jiffies;
>  			firstreport = true;
>  			WARN_ON(signal_pending(current));
> +			check_no_hz_full_tasks();
>  			rcu_read_lock();
>  			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
>  						rcu_tasks_holdout_list)
> 
> .
> 


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-05 21:55                 ` Paul E. McKenney
  2014-08-06  0:27                   ` Lai Jiangshan
@ 2014-08-06  0:33                   ` Lai Jiangshan
  2014-08-06  0:51                     ` Paul E. McKenney
  2014-08-07  8:49                   ` Peter Zijlstra
  2 siblings, 1 reply; 122+ messages in thread
From: Lai Jiangshan @ 2014-08-06  0:33 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On 08/06/2014 05:55 AM, Paul E. McKenney wrote:
> On Tue, Aug 05, 2014 at 08:47:55AM +0800, Lai Jiangshan wrote:
>> On 08/04/2014 10:56 PM, Peter Zijlstra wrote:
>>> On Mon, Aug 04, 2014 at 02:25:15PM +0200, Peter Zijlstra wrote:
>>>> On Mon, Aug 04, 2014 at 04:50:44AM -0700, Paul E. McKenney wrote:
>>>>> OK, I will bite...
>>>>>
>>>>> What kinds of tasks are on a runqueue, but neither ->on_cpu nor
>>>>> PREEMPT_ACTIVE?
>>>>
>>>> Userspace tasks, they don't necessarily get PREEMPT_ACTIVE when
>>>> preempted. Now obviously you're not _that_ interested in userspace tasks
>>>> for this, so that might be ok.
>>>>
>>>> But the main point was, you cannot use ->on_cpu or PREEMPT_ACTIVE
>>>> without holding rq->lock.
>>>
>>> Hmm, maybe you can, we have the context switch in between setting
>>> ->on_cpu and clearing PREEMPT_ACTIVE and vice-versa.
>>>
>>> The context switch (obviously) provides a full barrier, so we might be
>>> able to -- with careful consideration -- read these two separate values
>>> and construct something usable from them.
>>>
>>> Something like:
>>>
>>> 	task_preempt_count(tsk) & PREEMPT_ACTIVE
>> 	the @tsk is running on_cpu, the above result is false.
>>> 	smp_rmb();
>>> 	tsk->on_cpu
>> 	now the @tsk is preempted, the above result also is false.
>>
>> 	so it is useless if we fetch the preempt_count and on_cpu in two separated
>> instructions.  Maybe it would work if we also take tsk->nivcsw in consideration.
>> (I just noticed that tsk->n[i]vcsw are the version numbers for the tsk->on_cpu)
>>
>> bool task_on_cpu_or_preempted(tsk)
>> {
>> 	unsigned long saved_nivcsw;
>>
>> 	saved_nivcsw = task->nivcsw;
>> 	if (tsk->on_cpu)
>> 		return true;
>>
>> 	smp_rmb();
>>
>> 	if (task_preempt_count(tsk) & PREEMPT_ACTIVE)
>> 		return true;
>>
>> 	smp_rmb();
>>
>> 	if (tsk->on_cpu || task->nivcsw != saved_nivcsw)
>> 		return true;
>>
>> 	return false;
>> }
>>
>>>
>>> And because we set PREEMPT_ACTIVE before clearing on_cpu, this should
>>> race the right way (err towards the inclusive side).
>>>
>>> Obviously that wants a big fat comment...
> 
> How about the following?  Non-nohz_full userspace tasks are already covered
> courtesy of scheduling-clock interrupts, and this handles nohz_full usermode
> tasks.
> 
> Thoughts?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> rcu: Make TASKS_RCU handle nohx_full= CPUs
> 
> Currently TASKS_RCU would ignore a CPU running a task in nohz_full=
> usermode execution.  There would be neither a context switch nor a
> scheduling-clock interrupt to tell TASKS_RCU that the task in question
> had passed through a quiescent state.  The grace period would therefore
> extend indefinitely.  This commit therefore makes RCU's dyntick-idle
> subsystem record the task_struct structure of the task that is running
> in dyntick-idle mode on each CPU.  The TASKS_RCU grace period can
> then access this information and record a quiescent state on
> behalf of any CPU running in dyntick-idle usermode.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index f504f797c9c8..777aac3a34c0 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -1140,5 +1140,14 @@ static inline void rcu_sysidle_force_exit(void)
>  
>  #endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
>  
> +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> +struct task_struct *rcu_dynticks_task_cur(int cpu);
> +#else /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> +static inline struct task_struct *rcu_dynticks_task_cur(int cpu)
> +{
> +	return NULL;
> +}
> +#endif /* #else #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> +
>  
>  #endif /* __LINUX_RCUPDATE_H */
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 645a33efc0d4..86a0a7d5bbbd 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -526,6 +526,7 @@ static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, long long oldval,
>  	atomic_inc(&rdtp->dynticks);
>  	smp_mb__after_atomic();  /* Force ordering with next sojourn. */
>  	WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> +	rcu_dynticks_task_enter(rdtp, current);
>  
>  	/*
>  	 * It is illegal to enter an extended quiescent state while
> @@ -642,6 +643,7 @@ void rcu_irq_exit(void)
>  static void rcu_eqs_exit_common(struct rcu_dynticks *rdtp, long long oldval,
>  			       int user)
>  {
> +	rcu_dynticks_task_exit(rdtp);



What happen when the trampoline happens before rcu_eqs_exit_common()?
synchronize_sched() also skip this CPUs.  I think, for all CPUs,  only
real schedule is reliable.

>  	smp_mb__before_atomic();  /* Force ordering w/previous sojourn. */
>  	atomic_inc(&rdtp->dynticks);
>  	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 0f69a79c5b7d..1e79fa1b7cbf 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -88,6 +88,9 @@ struct rcu_dynticks {
>  				    /* Process level is worth LLONG_MAX/2. */
>  	int dynticks_nmi_nesting;   /* Track NMI nesting level. */
>  	atomic_t dynticks;	    /* Even value for idle, else odd. */
> +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> +	struct task_struct *dynticks_tsk;
> +#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
>  #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
>  	long long dynticks_idle_nesting;
>  				    /* irq/process nesting level from idle. */
> @@ -579,6 +582,9 @@ static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
>  static void rcu_bind_gp_kthread(void);
>  static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
>  static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
> +static void rcu_dynticks_task_enter(struct rcu_dynticks *rdtp,
> +				    struct task_struct *t);
> +static void rcu_dynticks_task_exit(struct rcu_dynticks *rdtp);
>  
>  #endif /* #ifndef RCU_TREE_NONCORE */
>  
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index a86a363ea453..442d62edc564 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2852,3 +2852,29 @@ static void rcu_bind_gp_kthread(void)
>  		set_cpus_allowed_ptr(current, cpumask_of(cpu));
>  #endif /* #ifdef CONFIG_NO_HZ_FULL */
>  }
> +
> +/* Record the current task on dyntick-idle entry. */
> +static void rcu_dynticks_task_enter(struct rcu_dynticks *rdtp,
> +				    struct task_struct *t)
> +{
> +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> +	ACCESS_ONCE(rdtp->dynticks_tsk) = t;
> +#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> +}
> +
> +/* Record no current task on dyntick-idle exit. */
> +static void rcu_dynticks_task_exit(struct rcu_dynticks *rdtp)
> +{
> +	rcu_dynticks_task_enter(rdtp, NULL);
> +}
> +
> +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> +struct task_struct *rcu_dynticks_task_cur(int cpu)
> +{
> +	struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
> +	struct task_struct *t = ACCESS_ONCE(rdtp->dynticks_tsk);
> +
> +	smp_read_barrier_depends(); /* Dereferences after fetch of "t". */
> +	return t;
> +}
> +#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index ad2a8df43757..6ad6af2ab028 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -481,6 +481,28 @@ static void check_holdout_task(struct task_struct *t,
>  	sched_show_task(current);
>  }
>  
> +/* Check for nohz_full CPUs executing in userspace. */
> +static void check_no_hz_full_tasks(void)
> +{
> +#ifdef CONFIG_NO_HZ_FULL
> +	int cpu;
> +	struct task_struct *t;
> +
> +	for_each_online_cpu(cpu) {
> +		cond_resched();
> +		rcu_read_lock();
> +		t = rcu_dynticks_task_cur(cpu);
> +		if (t == NULL || is_idle_task(t)) {
> +			rcu_read_unlock();
> +			continue;
> +		}
> +		if (ACCESS_ONCE(t->rcu_tasks_holdout))
> +			ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> +		rcu_read_unlock();
> +	}
> +#endif /* #ifdef CONFIG_NO_HZ_FULL */
> +}
> +
>  /* RCU-tasks kthread that detects grace periods and invokes callbacks. */
>  static int __noreturn rcu_tasks_kthread(void *arg)
>  {
> @@ -584,6 +606,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
>  				lastreport = jiffies;
>  			firstreport = true;
>  			WARN_ON(signal_pending(current));
> +			check_no_hz_full_tasks();
>  			rcu_read_lock();
>  			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
>  						rcu_tasks_holdout_list)
> 
> .
> 


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-06  0:27                   ` Lai Jiangshan
@ 2014-08-06  0:48                     ` Paul E. McKenney
  0 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-06  0:48 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Peter Zijlstra, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On Wed, Aug 06, 2014 at 08:27:51AM +0800, Lai Jiangshan wrote:
> On 08/06/2014 05:55 AM, Paul E. McKenney wrote:
> > On Tue, Aug 05, 2014 at 08:47:55AM +0800, Lai Jiangshan wrote:
> >> On 08/04/2014 10:56 PM, Peter Zijlstra wrote:
> >>> On Mon, Aug 04, 2014 at 02:25:15PM +0200, Peter Zijlstra wrote:
> >>>> On Mon, Aug 04, 2014 at 04:50:44AM -0700, Paul E. McKenney wrote:
> >>>>> OK, I will bite...
> >>>>>
> >>>>> What kinds of tasks are on a runqueue, but neither ->on_cpu nor
> >>>>> PREEMPT_ACTIVE?
> >>>>
> >>>> Userspace tasks, they don't necessarily get PREEMPT_ACTIVE when
> >>>> preempted. Now obviously you're not _that_ interested in userspace tasks
> >>>> for this, so that might be ok.
> >>>>
> >>>> But the main point was, you cannot use ->on_cpu or PREEMPT_ACTIVE
> >>>> without holding rq->lock.
> >>>
> >>> Hmm, maybe you can, we have the context switch in between setting
> >>> ->on_cpu and clearing PREEMPT_ACTIVE and vice-versa.
> >>>
> >>> The context switch (obviously) provides a full barrier, so we might be
> >>> able to -- with careful consideration -- read these two separate values
> >>> and construct something usable from them.
> >>>
> >>> Something like:
> >>>
> >>> 	task_preempt_count(tsk) & PREEMPT_ACTIVE
> >> 	the @tsk is running on_cpu, the above result is false.
> >>> 	smp_rmb();
> >>> 	tsk->on_cpu
> >> 	now the @tsk is preempted, the above result also is false.
> >>
> >> 	so it is useless if we fetch the preempt_count and on_cpu in two separated
> >> instructions.  Maybe it would work if we also take tsk->nivcsw in consideration.
> >> (I just noticed that tsk->n[i]vcsw are the version numbers for the tsk->on_cpu)
> >>
> >> bool task_on_cpu_or_preempted(tsk)
> >> {
> >> 	unsigned long saved_nivcsw;
> >>
> >> 	saved_nivcsw = task->nivcsw;
> >> 	if (tsk->on_cpu)
> >> 		return true;
> >>
> >> 	smp_rmb();
> >>
> >> 	if (task_preempt_count(tsk) & PREEMPT_ACTIVE)
> >> 		return true;
> >>
> >> 	smp_rmb();
> >>
> >> 	if (tsk->on_cpu || task->nivcsw != saved_nivcsw)
> >> 		return true;
> >>
> >> 	return false;
> >> }
> >>
> >>>
> >>> And because we set PREEMPT_ACTIVE before clearing on_cpu, this should
> >>> race the right way (err towards the inclusive side).
> >>>
> >>> Obviously that wants a big fat comment...
> > 
> > How about the following?  Non-nohz_full userspace tasks are already covered
> > courtesy of scheduling-clock interrupts, and this handles nohz_full usermode
> > tasks.
> 
> synchronize_rcu_tasks() is called extremely rare.  Is it right?  So I think it is
> acceptable to interrupt the nohz_full usermode CPUs.

Given that it is pretty easy to avoid interrupting them, why not avoid it?

That said, there is one additional class of usermode tasks, namely
those that remain preempted from usermode for long periods of time.

I do not currently catch those.

							Thanx, Paul

> > Thoughts?
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > rcu: Make TASKS_RCU handle nohx_full= CPUs
> > 
> > Currently TASKS_RCU would ignore a CPU running a task in nohz_full=
> > usermode execution.  There would be neither a context switch nor a
> > scheduling-clock interrupt to tell TASKS_RCU that the task in question
> > had passed through a quiescent state.  The grace period would therefore
> > extend indefinitely.  This commit therefore makes RCU's dyntick-idle
> > subsystem record the task_struct structure of the task that is running
> > in dyntick-idle mode on each CPU.  The TASKS_RCU grace period can
> > then access this information and record a quiescent state on
> > behalf of any CPU running in dyntick-idle usermode.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index f504f797c9c8..777aac3a34c0 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -1140,5 +1140,14 @@ static inline void rcu_sysidle_force_exit(void)
> >  
> >  #endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
> >  
> > +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> > +struct task_struct *rcu_dynticks_task_cur(int cpu);
> > +#else /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > +static inline struct task_struct *rcu_dynticks_task_cur(int cpu)
> > +{
> > +	return NULL;
> > +}
> > +#endif /* #else #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > +
> >  
> >  #endif /* __LINUX_RCUPDATE_H */
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 645a33efc0d4..86a0a7d5bbbd 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -526,6 +526,7 @@ static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, long long oldval,
> >  	atomic_inc(&rdtp->dynticks);
> >  	smp_mb__after_atomic();  /* Force ordering with next sojourn. */
> >  	WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> > +	rcu_dynticks_task_enter(rdtp, current);
> >  
> >  	/*
> >  	 * It is illegal to enter an extended quiescent state while
> > @@ -642,6 +643,7 @@ void rcu_irq_exit(void)
> >  static void rcu_eqs_exit_common(struct rcu_dynticks *rdtp, long long oldval,
> >  			       int user)
> >  {
> > +	rcu_dynticks_task_exit(rdtp);
> >  	smp_mb__before_atomic();  /* Force ordering w/previous sojourn. */
> >  	atomic_inc(&rdtp->dynticks);
> >  	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index 0f69a79c5b7d..1e79fa1b7cbf 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -88,6 +88,9 @@ struct rcu_dynticks {
> >  				    /* Process level is worth LLONG_MAX/2. */
> >  	int dynticks_nmi_nesting;   /* Track NMI nesting level. */
> >  	atomic_t dynticks;	    /* Even value for idle, else odd. */
> > +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> > +	struct task_struct *dynticks_tsk;
> > +#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> >  #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
> >  	long long dynticks_idle_nesting;
> >  				    /* irq/process nesting level from idle. */
> > @@ -579,6 +582,9 @@ static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> >  static void rcu_bind_gp_kthread(void);
> >  static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
> >  static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
> > +static void rcu_dynticks_task_enter(struct rcu_dynticks *rdtp,
> > +				    struct task_struct *t);
> > +static void rcu_dynticks_task_exit(struct rcu_dynticks *rdtp);
> >  
> >  #endif /* #ifndef RCU_TREE_NONCORE */
> >  
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index a86a363ea453..442d62edc564 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -2852,3 +2852,29 @@ static void rcu_bind_gp_kthread(void)
> >  		set_cpus_allowed_ptr(current, cpumask_of(cpu));
> >  #endif /* #ifdef CONFIG_NO_HZ_FULL */
> >  }
> > +
> > +/* Record the current task on dyntick-idle entry. */
> > +static void rcu_dynticks_task_enter(struct rcu_dynticks *rdtp,
> > +				    struct task_struct *t)
> > +{
> > +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> > +	ACCESS_ONCE(rdtp->dynticks_tsk) = t;
> > +#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > +}
> > +
> > +/* Record no current task on dyntick-idle exit. */
> > +static void rcu_dynticks_task_exit(struct rcu_dynticks *rdtp)
> > +{
> > +	rcu_dynticks_task_enter(rdtp, NULL);
> > +}
> > +
> > +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> > +struct task_struct *rcu_dynticks_task_cur(int cpu)
> > +{
> > +	struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
> > +	struct task_struct *t = ACCESS_ONCE(rdtp->dynticks_tsk);
> > +
> > +	smp_read_barrier_depends(); /* Dereferences after fetch of "t". */
> > +	return t;
> > +}
> > +#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index ad2a8df43757..6ad6af2ab028 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -481,6 +481,28 @@ static void check_holdout_task(struct task_struct *t,
> >  	sched_show_task(current);
> >  }
> >  
> > +/* Check for nohz_full CPUs executing in userspace. */
> > +static void check_no_hz_full_tasks(void)
> > +{
> > +#ifdef CONFIG_NO_HZ_FULL
> > +	int cpu;
> > +	struct task_struct *t;
> > +
> > +	for_each_online_cpu(cpu) {
> > +		cond_resched();
> > +		rcu_read_lock();
> > +		t = rcu_dynticks_task_cur(cpu);
> > +		if (t == NULL || is_idle_task(t)) {
> > +			rcu_read_unlock();
> > +			continue;
> > +		}
> > +		if (ACCESS_ONCE(t->rcu_tasks_holdout))
> > +			ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> > +		rcu_read_unlock();
> > +	}
> > +#endif /* #ifdef CONFIG_NO_HZ_FULL */
> > +}
> > +
> >  /* RCU-tasks kthread that detects grace periods and invokes callbacks. */
> >  static int __noreturn rcu_tasks_kthread(void *arg)
> >  {
> > @@ -584,6 +606,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
> >  				lastreport = jiffies;
> >  			firstreport = true;
> >  			WARN_ON(signal_pending(current));
> > +			check_no_hz_full_tasks();
> >  			rcu_read_lock();
> >  			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
> >  						rcu_tasks_holdout_list)
> > 
> > .
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-06  0:33                   ` Lai Jiangshan
@ 2014-08-06  0:51                     ` Paul E. McKenney
  2014-08-06 22:48                       ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-06  0:51 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Peter Zijlstra, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On Wed, Aug 06, 2014 at 08:33:29AM +0800, Lai Jiangshan wrote:
> On 08/06/2014 05:55 AM, Paul E. McKenney wrote:
> > On Tue, Aug 05, 2014 at 08:47:55AM +0800, Lai Jiangshan wrote:
> >> On 08/04/2014 10:56 PM, Peter Zijlstra wrote:
> >>> On Mon, Aug 04, 2014 at 02:25:15PM +0200, Peter Zijlstra wrote:
> >>>> On Mon, Aug 04, 2014 at 04:50:44AM -0700, Paul E. McKenney wrote:
> >>>>> OK, I will bite...
> >>>>>
> >>>>> What kinds of tasks are on a runqueue, but neither ->on_cpu nor
> >>>>> PREEMPT_ACTIVE?
> >>>>
> >>>> Userspace tasks, they don't necessarily get PREEMPT_ACTIVE when
> >>>> preempted. Now obviously you're not _that_ interested in userspace tasks
> >>>> for this, so that might be ok.
> >>>>
> >>>> But the main point was, you cannot use ->on_cpu or PREEMPT_ACTIVE
> >>>> without holding rq->lock.
> >>>
> >>> Hmm, maybe you can, we have the context switch in between setting
> >>> ->on_cpu and clearing PREEMPT_ACTIVE and vice-versa.
> >>>
> >>> The context switch (obviously) provides a full barrier, so we might be
> >>> able to -- with careful consideration -- read these two separate values
> >>> and construct something usable from them.
> >>>
> >>> Something like:
> >>>
> >>> 	task_preempt_count(tsk) & PREEMPT_ACTIVE
> >> 	the @tsk is running on_cpu, the above result is false.
> >>> 	smp_rmb();
> >>> 	tsk->on_cpu
> >> 	now the @tsk is preempted, the above result also is false.
> >>
> >> 	so it is useless if we fetch the preempt_count and on_cpu in two separated
> >> instructions.  Maybe it would work if we also take tsk->nivcsw in consideration.
> >> (I just noticed that tsk->n[i]vcsw are the version numbers for the tsk->on_cpu)
> >>
> >> bool task_on_cpu_or_preempted(tsk)
> >> {
> >> 	unsigned long saved_nivcsw;
> >>
> >> 	saved_nivcsw = task->nivcsw;
> >> 	if (tsk->on_cpu)
> >> 		return true;
> >>
> >> 	smp_rmb();
> >>
> >> 	if (task_preempt_count(tsk) & PREEMPT_ACTIVE)
> >> 		return true;
> >>
> >> 	smp_rmb();
> >>
> >> 	if (tsk->on_cpu || task->nivcsw != saved_nivcsw)
> >> 		return true;
> >>
> >> 	return false;
> >> }
> >>
> >>>
> >>> And because we set PREEMPT_ACTIVE before clearing on_cpu, this should
> >>> race the right way (err towards the inclusive side).
> >>>
> >>> Obviously that wants a big fat comment...
> > 
> > How about the following?  Non-nohz_full userspace tasks are already covered
> > courtesy of scheduling-clock interrupts, and this handles nohz_full usermode
> > tasks.
> > 
> > Thoughts?
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > rcu: Make TASKS_RCU handle nohx_full= CPUs
> > 
> > Currently TASKS_RCU would ignore a CPU running a task in nohz_full=
> > usermode execution.  There would be neither a context switch nor a
> > scheduling-clock interrupt to tell TASKS_RCU that the task in question
> > had passed through a quiescent state.  The grace period would therefore
> > extend indefinitely.  This commit therefore makes RCU's dyntick-idle
> > subsystem record the task_struct structure of the task that is running
> > in dyntick-idle mode on each CPU.  The TASKS_RCU grace period can
> > then access this information and record a quiescent state on
> > behalf of any CPU running in dyntick-idle usermode.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index f504f797c9c8..777aac3a34c0 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -1140,5 +1140,14 @@ static inline void rcu_sysidle_force_exit(void)
> >  
> >  #endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
> >  
> > +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> > +struct task_struct *rcu_dynticks_task_cur(int cpu);
> > +#else /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > +static inline struct task_struct *rcu_dynticks_task_cur(int cpu)
> > +{
> > +	return NULL;
> > +}
> > +#endif /* #else #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > +
> >  
> >  #endif /* __LINUX_RCUPDATE_H */
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 645a33efc0d4..86a0a7d5bbbd 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -526,6 +526,7 @@ static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, long long oldval,
> >  	atomic_inc(&rdtp->dynticks);
> >  	smp_mb__after_atomic();  /* Force ordering with next sojourn. */
> >  	WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> > +	rcu_dynticks_task_enter(rdtp, current);
> >  
> >  	/*
> >  	 * It is illegal to enter an extended quiescent state while
> > @@ -642,6 +643,7 @@ void rcu_irq_exit(void)
> >  static void rcu_eqs_exit_common(struct rcu_dynticks *rdtp, long long oldval,
> >  			       int user)
> >  {
> > +	rcu_dynticks_task_exit(rdtp);
> 
> What happen when the trampoline happens before rcu_eqs_exit_common()?
> synchronize_sched() also skip this CPUs.  I think, for all CPUs,  only
> real schedule is reliable.

True, this prohibits tracing the point from the call to
rcu_eqs_enter_common() to the transition to usermode.  I am betting that
this is OK, though.

							Thanx, Paul

> >  	smp_mb__before_atomic();  /* Force ordering w/previous sojourn. */
> >  	atomic_inc(&rdtp->dynticks);
> >  	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index 0f69a79c5b7d..1e79fa1b7cbf 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -88,6 +88,9 @@ struct rcu_dynticks {
> >  				    /* Process level is worth LLONG_MAX/2. */
> >  	int dynticks_nmi_nesting;   /* Track NMI nesting level. */
> >  	atomic_t dynticks;	    /* Even value for idle, else odd. */
> > +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> > +	struct task_struct *dynticks_tsk;
> > +#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> >  #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
> >  	long long dynticks_idle_nesting;
> >  				    /* irq/process nesting level from idle. */
> > @@ -579,6 +582,9 @@ static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> >  static void rcu_bind_gp_kthread(void);
> >  static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
> >  static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
> > +static void rcu_dynticks_task_enter(struct rcu_dynticks *rdtp,
> > +				    struct task_struct *t);
> > +static void rcu_dynticks_task_exit(struct rcu_dynticks *rdtp);
> >  
> >  #endif /* #ifndef RCU_TREE_NONCORE */
> >  
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index a86a363ea453..442d62edc564 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -2852,3 +2852,29 @@ static void rcu_bind_gp_kthread(void)
> >  		set_cpus_allowed_ptr(current, cpumask_of(cpu));
> >  #endif /* #ifdef CONFIG_NO_HZ_FULL */
> >  }
> > +
> > +/* Record the current task on dyntick-idle entry. */
> > +static void rcu_dynticks_task_enter(struct rcu_dynticks *rdtp,
> > +				    struct task_struct *t)
> > +{
> > +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> > +	ACCESS_ONCE(rdtp->dynticks_tsk) = t;
> > +#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > +}
> > +
> > +/* Record no current task on dyntick-idle exit. */
> > +static void rcu_dynticks_task_exit(struct rcu_dynticks *rdtp)
> > +{
> > +	rcu_dynticks_task_enter(rdtp, NULL);
> > +}
> > +
> > +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> > +struct task_struct *rcu_dynticks_task_cur(int cpu)
> > +{
> > +	struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
> > +	struct task_struct *t = ACCESS_ONCE(rdtp->dynticks_tsk);
> > +
> > +	smp_read_barrier_depends(); /* Dereferences after fetch of "t". */
> > +	return t;
> > +}
> > +#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index ad2a8df43757..6ad6af2ab028 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -481,6 +481,28 @@ static void check_holdout_task(struct task_struct *t,
> >  	sched_show_task(current);
> >  }
> >  
> > +/* Check for nohz_full CPUs executing in userspace. */
> > +static void check_no_hz_full_tasks(void)
> > +{
> > +#ifdef CONFIG_NO_HZ_FULL
> > +	int cpu;
> > +	struct task_struct *t;
> > +
> > +	for_each_online_cpu(cpu) {
> > +		cond_resched();
> > +		rcu_read_lock();
> > +		t = rcu_dynticks_task_cur(cpu);
> > +		if (t == NULL || is_idle_task(t)) {
> > +			rcu_read_unlock();
> > +			continue;
> > +		}
> > +		if (ACCESS_ONCE(t->rcu_tasks_holdout))
> > +			ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> > +		rcu_read_unlock();
> > +	}
> > +#endif /* #ifdef CONFIG_NO_HZ_FULL */
> > +}
> > +
> >  /* RCU-tasks kthread that detects grace periods and invokes callbacks. */
> >  static int __noreturn rcu_tasks_kthread(void *arg)
> >  {
> > @@ -584,6 +606,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
> >  				lastreport = jiffies;
> >  			firstreport = true;
> >  			WARN_ON(signal_pending(current));
> > +			check_no_hz_full_tasks();
> >  			rcu_read_lock();
> >  			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
> >  						rcu_tasks_holdout_list)
> > 
> > .
> > 
> 


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-02 22:58             ` Paul E. McKenney
@ 2014-08-06  0:57               ` Steven Rostedt
  2014-08-06  1:21                 ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Rostedt @ 2014-08-06  0:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, dhowells, edumazet,
	dvhart, fweisbec, bobby.prani

On Sat, 2 Aug 2014 15:58:57 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Sat, Aug 02, 2014 at 04:47:19PM +0200, Oleg Nesterov wrote:
> > On 08/01, Paul E. McKenney wrote:
> > >
> > > On Fri, Aug 01, 2014 at 11:32:51AM -0700, Paul E. McKenney wrote:
> > > > On Fri, Aug 01, 2014 at 05:09:26PM +0200, Oleg Nesterov wrote:
> > > > > On 07/31, Paul E. McKenney wrote:
> > > > > >
> > > > > > +void synchronize_rcu_tasks(void)
> > > > > > +{
> > > > > > +	/* Complain if the scheduler has not started.  */
> > > > > > +	rcu_lockdep_assert(!rcu_scheduler_active,
> > > > > > +			   "synchronize_rcu_tasks called too soon");
> > > > > > +
> > > > > > +	/* Wait for the grace period. */
> > > > > > +	wait_rcu_gp(call_rcu_tasks);
> > > > > > +}
> > > > >
> > > > > Btw, what about CONFIG_PREEMPT=n ?
> > > > >
> > > > > I mean, can't synchronize_rcu_tasks() be synchronize_sched() in this
> > > > > case?
> > > >
> > > > Excellent point, indeed it can!
> > > >
> > > > And if I do it right, it will make CONFIG_TASKS_RCU=y safe for kernel
> > > > tinification.  ;-)
> > >
> > > Unless, that is, we need to wait for trampolines in the idle loop...
> > >
> > > Sounds like a question for Steven.  ;-)
> > 
> > Sure, but the full blown synchronize_rcu_tasks() can't handle the idle threads
> > anyway. An idle thread can not be deactivated and for_each_process() can't see
> > it anyway.
> 
> Indeed, if idle threads need to be tracked, their tracking will need to
> be at least partially special-cased.
> 

Yeah, idle threads can be affected by the trampolines. That is, we can
still hook a trampoline to some function in the idle loop.

But we should be able to make the hardware call that puts the CPU to
sleep a quiescent state too. May need to be arch dependent. :-/

-- Steve

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-06  0:57               ` Steven Rostedt
@ 2014-08-06  1:21                 ` Paul E. McKenney
  2014-08-06  8:47                   ` Peter Zijlstra
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-06  1:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, dhowells, edumazet,
	dvhart, fweisbec, bobby.prani

On Tue, Aug 05, 2014 at 08:57:11PM -0400, Steven Rostedt wrote:
> On Sat, 2 Aug 2014 15:58:57 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Sat, Aug 02, 2014 at 04:47:19PM +0200, Oleg Nesterov wrote:
> > > On 08/01, Paul E. McKenney wrote:
> > > >
> > > > On Fri, Aug 01, 2014 at 11:32:51AM -0700, Paul E. McKenney wrote:
> > > > > On Fri, Aug 01, 2014 at 05:09:26PM +0200, Oleg Nesterov wrote:
> > > > > > On 07/31, Paul E. McKenney wrote:
> > > > > > >
> > > > > > > +void synchronize_rcu_tasks(void)
> > > > > > > +{
> > > > > > > +	/* Complain if the scheduler has not started.  */
> > > > > > > +	rcu_lockdep_assert(!rcu_scheduler_active,
> > > > > > > +			   "synchronize_rcu_tasks called too soon");
> > > > > > > +
> > > > > > > +	/* Wait for the grace period. */
> > > > > > > +	wait_rcu_gp(call_rcu_tasks);
> > > > > > > +}
> > > > > >
> > > > > > Btw, what about CONFIG_PREEMPT=n ?
> > > > > >
> > > > > > I mean, can't synchronize_rcu_tasks() be synchronize_sched() in this
> > > > > > case?
> > > > >
> > > > > Excellent point, indeed it can!
> > > > >
> > > > > And if I do it right, it will make CONFIG_TASKS_RCU=y safe for kernel
> > > > > tinification.  ;-)
> > > >
> > > > Unless, that is, we need to wait for trampolines in the idle loop...
> > > >
> > > > Sounds like a question for Steven.  ;-)
> > > 
> > > Sure, but the full blown synchronize_rcu_tasks() can't handle the idle threads
> > > anyway. An idle thread can not be deactivated and for_each_process() can't see
> > > it anyway.
> > 
> > Indeed, if idle threads need to be tracked, their tracking will need to
> > be at least partially special-cased.
> 
> Yeah, idle threads can be affected by the trampolines. That is, we can
> still hook a trampoline to some function in the idle loop.
> 
> But we should be able to make the hardware call that puts the CPU to
> sleep a quiescent state too. May need to be arch dependent. :-/

OK, my plan for this eventuality is to do the following:

1.	Ignore the ->on_rq field, as idle tasks are always on a runqueue.

2.	Watch the context-switch counter.

3.	Ignore dyntick-idle state for idle tasks.

4.	If there is no quiescent state from a given idle task after
	a few seconds, schedule rcu_tasks_kthread() on top of the
	offending CPU.

Your idea is an interesting one, but does require another set of
dyntick-idle-like functions and counters.  Or moving the current
rcu_idle_enter() and rcu_idle_exit() calls deeper into the idle loop.

Not sure which is a better approach.  Alternatively, we could just
rely on #4 above, on the grounds that battery life should not be
too badly degraded by the occasional RCU-tasks interference.

Note that this is a different situation than NO_HZ_FULL in realtime
environments, where the worst case causes trouble even if it happens
very infrequently.

						Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-06  1:21                 ` Paul E. McKenney
@ 2014-08-06  8:47                   ` Peter Zijlstra
  2014-08-06 12:09                     ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-06  8:47 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

[-- Attachment #1: Type: text/plain, Size: 2059 bytes --]

On Tue, Aug 05, 2014 at 06:21:39PM -0700, Paul E. McKenney wrote:
> > Yeah, idle threads can be affected by the trampolines. That is, we can
> > still hook a trampoline to some function in the idle loop.
> > 
> > But we should be able to make the hardware call that puts the CPU to
> > sleep a quiescent state too. May need to be arch dependent. :-/
> 
> OK, my plan for this eventuality is to do the following:
> 
> 1.	Ignore the ->on_rq field, as idle tasks are always on a runqueue.
> 
> 2.	Watch the context-switch counter.
> 
> 3.	Ignore dyntick-idle state for idle tasks.
> 
> 4.	If there is no quiescent state from a given idle task after
> 	a few seconds, schedule rcu_tasks_kthread() on top of the
> 	offending CPU.
> 
> Your idea is an interesting one, but does require another set of
> dyntick-idle-like functions and counters.  Or moving the current
> rcu_idle_enter() and rcu_idle_exit() calls deeper into the idle loop.
> 
> Not sure which is a better approach.  Alternatively, we could just
> rely on #4 above, on the grounds that battery life should not be
> too badly degraded by the occasional RCU-tasks interference.
> 
> Note that this is a different situation than NO_HZ_FULL in realtime
> environments, where the worst case causes trouble even if it happens
> very infrequently.

Or you could shoot all CPUs with resched_cpu() which would have them
cycle through schedule() even if there's nothing but the idle thread to
run. That guarantees they'll go to sleep again in a !trampoline.

But I still very much hate the polling stuff...

Can't we abuse the preempt notifiers? Say we make it possible to install
preemption notifiers cross-task, then the task-rcu can install a
preempt-out notifier which completes the rcu-task wait.

After all, since we tagged it it was !running, and being scheduled out
means it ran (once) and therefore isn't on a trampoline anymore.

And the tick, which checks to see if the task got to userspace can do
the same, remove the notifier and then complete.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-06  8:47                   ` Peter Zijlstra
@ 2014-08-06 12:09                     ` Paul E. McKenney
  2014-08-06 16:30                       ` Peter Zijlstra
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-06 12:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

On Wed, Aug 06, 2014 at 10:47:08AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 05, 2014 at 06:21:39PM -0700, Paul E. McKenney wrote:
> > > Yeah, idle threads can be affected by the trampolines. That is, we can
> > > still hook a trampoline to some function in the idle loop.
> > > 
> > > But we should be able to make the hardware call that puts the CPU to
> > > sleep a quiescent state too. May need to be arch dependent. :-/
> > 
> > OK, my plan for this eventuality is to do the following:
> > 
> > 1.	Ignore the ->on_rq field, as idle tasks are always on a runqueue.
> > 
> > 2.	Watch the context-switch counter.
> > 
> > 3.	Ignore dyntick-idle state for idle tasks.
> > 
> > 4.	If there is no quiescent state from a given idle task after
> > 	a few seconds, schedule rcu_tasks_kthread() on top of the
> > 	offending CPU.
> > 
> > Your idea is an interesting one, but does require another set of
> > dyntick-idle-like functions and counters.  Or moving the current
> > rcu_idle_enter() and rcu_idle_exit() calls deeper into the idle loop.
> > 
> > Not sure which is a better approach.  Alternatively, we could just
> > rely on #4 above, on the grounds that battery life should not be
> > too badly degraded by the occasional RCU-tasks interference.
> > 
> > Note that this is a different situation than NO_HZ_FULL in realtime
> > environments, where the worst case causes trouble even if it happens
> > very infrequently.
> 
> Or you could shoot all CPUs with resched_cpu() which would have them
> cycle through schedule() even if there's nothing but the idle thread to
> run. That guarantees they'll go to sleep again in a !trampoline.

Good point, that would be an easier way to handle the idle threads than
messing with rcu_tasks_kthread()'s affinity.  Thank you!

> But I still very much hate the polling stuff...
> 
> Can't we abuse the preempt notifiers? Say we make it possible to install
> preemption notifiers cross-task, then the task-rcu can install a
> preempt-out notifier which completes the rcu-task wait.
> 
> After all, since we tagged it it was !running, and being scheduled out
> means it ran (once) and therefore isn't on a trampoline anymore.

Maybe I am being overly paranoid, but couldn't the task be preempted
in a trampoline, be resumed, execute one instruction (still in the
tramopoline) and be preempted again?

> And the tick, which checks to see if the task got to userspace can do
> the same, remove the notifier and then complete.

My main concern with this sort of approach is that I have to deal
with full-up concurrency (200 CPUs all complete tasks concurrently,
for example), which would make for a much larger and more complex patch.
Now, I do admit that it is quite possible that I will end up there anyway,
for example, if more people start using RCU-tasks, but I see no need to
hurry this process.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-06 12:09                     ` Paul E. McKenney
@ 2014-08-06 16:30                       ` Peter Zijlstra
  2014-08-06 22:45                         ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-06 16:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

[-- Attachment #1: Type: text/plain, Size: 2269 bytes --]

On Wed, Aug 06, 2014 at 05:09:59AM -0700, Paul E. McKenney wrote:

> > Or you could shoot all CPUs with resched_cpu() which would have them
> > cycle through schedule() even if there's nothing but the idle thread to
> > run. That guarantees they'll go to sleep again in a !trampoline.
> 
> Good point, that would be an easier way to handle the idle threads than
> messing with rcu_tasks_kthread()'s affinity.  Thank you!

One issue though, resched_cpu() doesn't wait for that to complete. We'd
need something that would guarantee the remote CPU has actually
completed execution.

> > But I still very much hate the polling stuff...
> > 
> > Can't we abuse the preempt notifiers? Say we make it possible to install
> > preemption notifiers cross-task, then the task-rcu can install a
> > preempt-out notifier which completes the rcu-task wait.
> > 
> > After all, since we tagged it it was !running, and being scheduled out
> > means it ran (once) and therefore isn't on a trampoline anymore.
> 
> Maybe I am being overly paranoid, but couldn't the task be preempted
> in a trampoline, be resumed, execute one instruction (still in the
> tramopoline) and be preempted again?

Ah, what I failed to state was we should check the sleep condition. So
'voluntary' schedule() calls.

Of course if we'd made something specific to the trampoline thing and
not 'task'-rcu we could simply check if the IP was inside a trampoline
or not.

> > And the tick, which checks to see if the task got to userspace can do
> > the same, remove the notifier and then complete.
> 
> My main concern with this sort of approach is that I have to deal
> with full-up concurrency (200 CPUs all complete tasks concurrently,
> for example), which would make for a much larger and more complex patch.
> Now, I do admit that it is quite possible that I will end up there anyway,
> for example, if more people start using RCU-tasks, but I see no need to
> hurry this process.  ;-)

You mean cacheline contention on the struct completion? I'd first make
it simple and only fix it if/when it becomes a problem.

200 CPUs contending on a single cacheline _once_ is annoying, but
probably still lots cheaper than polling state for at least that many
tasks.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-06 16:30                       ` Peter Zijlstra
@ 2014-08-06 22:45                         ` Paul E. McKenney
  2014-08-07  8:45                           ` Peter Zijlstra
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-06 22:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

On Wed, Aug 06, 2014 at 06:30:35PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 06, 2014 at 05:09:59AM -0700, Paul E. McKenney wrote:
> 
> > > Or you could shoot all CPUs with resched_cpu() which would have them
> > > cycle through schedule() even if there's nothing but the idle thread to
> > > run. That guarantees they'll go to sleep again in a !trampoline.
> > 
> > Good point, that would be an easier way to handle the idle threads than
> > messing with rcu_tasks_kthread()'s affinity.  Thank you!
> 
> One issue though, resched_cpu() doesn't wait for that to complete. We'd
> need something that would guarantee the remote CPU has actually
> completed execution.

Not a problem in the idle case.  Forcing the CPU out of the idle loop
will invoke rcu_idle_exit(), which will increment ->dynticks, which
will be visible to RCU-tasks.

> > > But I still very much hate the polling stuff...

The nice thing about the polling approach is minimal overhead in the
common case where RCU-tasks is not in use.

> > > Can't we abuse the preempt notifiers? Say we make it possible to install
> > > preemption notifiers cross-task, then the task-rcu can install a
> > > preempt-out notifier which completes the rcu-task wait.
> > > 
> > > After all, since we tagged it it was !running, and being scheduled out
> > > means it ran (once) and therefore isn't on a trampoline anymore.
> > 
> > Maybe I am being overly paranoid, but couldn't the task be preempted
> > in a trampoline, be resumed, execute one instruction (still in the
> > tramopoline) and be preempted again?
> 
> Ah, what I failed to state was we should check the sleep condition. So
> 'voluntary' schedule() calls.

OK, but I already catch this via the ->nvcsw counter.  See patch below.

> Of course if we'd made something specific to the trampoline thing and
> not 'task'-rcu we could simply check if the IP was inside a trampoline
> or not.

Of course.  I suspect that there are devils in those details as well.  ;-)

> > > And the tick, which checks to see if the task got to userspace can do
> > > the same, remove the notifier and then complete.
> > 
> > My main concern with this sort of approach is that I have to deal
> > with full-up concurrency (200 CPUs all complete tasks concurrently,
> > for example), which would make for a much larger and more complex patch.
> > Now, I do admit that it is quite possible that I will end up there anyway,
> > for example, if more people start using RCU-tasks, but I see no need to
> > hurry this process.  ;-)
> 
> You mean cacheline contention on the struct completion? I'd first make
> it simple and only fix it if/when it becomes a problem.
> 
> 200 CPUs contending on a single cacheline _once_ is annoying, but
> probably still lots cheaper than polling state for at least that many
> tasks.

On larger systems, memory contention can reportedly be more than merely
annoying.

							Thanx, Paul

------------------------------------------------------------------------

 include/linux/rcupdate.h |    6 +++---
 include/linux/rcutiny.h  |    6 +++++-
 include/linux/rcutree.h  |    2 ++
 include/linux/sched.h    |    1 +
 kernel/rcu/tree.c        |   10 ++++++++++
 kernel/rcu/tree.h        |    4 ++--
 kernel/rcu/tree_plugin.h |    4 ++--
 kernel/rcu/update.c      |   27 ++++++++++++++++++++++-----
 8 files changed, 47 insertions(+), 13 deletions(-)

rcu: Make RCU-tasks wait for idle tasks

Because idle-task code may need to be patched, RCU-tasks need to wait
for idle tasks to schedule.  This commit therefore detects this case
via RCU's dyntick-idle counters.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 0294fc180508..70f2b953c392 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1138,14 +1138,14 @@ static inline void rcu_sysidle_force_exit(void)
 
 #endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
 
-#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
+#if defined(CONFIG_TASKS_RCU) && !defined(CONFIG_TINY_RCU)
 struct task_struct *rcu_dynticks_task_cur(int cpu);
-#else /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
+#else /* #if defined(CONFIG_TASKS_RCU) && !defined(CONFIG_TINY_RCU) */
 static inline struct task_struct *rcu_dynticks_task_cur(int cpu)
 {
 	return NULL;
 }
-#endif /* #else #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
+#endif /* #else #if defined(CONFIG_TASKS_RCU) && !defined(CONFIG_TINY_RCU) */
 
 
 #endif /* __LINUX_RCUPDATE_H */
diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index d40a6a451330..b882c27cd314 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -154,7 +154,11 @@ static inline bool rcu_is_watching(void)
 	return true;
 }
 
-
 #endif /* #else defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) */
 
+static inline unsigned int rcu_dynticks_ctr(int cpu)
+{
+	return 0;
+}
+
 #endif /* __LINUX_RCUTINY_H */
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 3e2f5d432743..0d8fdfcb4f0b 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -97,4 +97,6 @@ extern int rcu_scheduler_active __read_mostly;
 
 bool rcu_is_watching(void);
 
+unsigned int rcu_dynticks_ctr(int cpu);
+
 #endif /* __LINUX_RCUTREE_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3cf124389ec7..db4e6cb8fb77 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1277,6 +1277,7 @@ struct task_struct {
 	unsigned long rcu_tasks_nvcsw;
 	int rcu_tasks_holdout;
 	struct list_head rcu_tasks_holdout_list;
+	unsigned int rcu_tasks_dynticks;
 #endif /* #ifdef CONFIG_TASKS_RCU */
 
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 86a0a7d5bbbd..6298a66118e5 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -493,6 +493,16 @@ cpu_needs_another_gp(struct rcu_state *rsp, struct rcu_data *rdp)
 }
 
 /*
+ * rcu_dynticks_ctr - return value of the specified CPU's dynticks counter
+ */
+unsigned int rcu_dynticks_ctr(int cpu)
+{
+	struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
+
+	return atomic_add_return(0, &rdtp->dynticks);
+}
+
+/*
  * rcu_eqs_enter_common - current CPU is moving towards extended quiescent state
  *
  * If the new value of the ->dynticks_nesting counter now is zero,
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 1e79fa1b7cbf..e373f8ddc60a 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -88,9 +88,9 @@ struct rcu_dynticks {
 				    /* Process level is worth LLONG_MAX/2. */
 	int dynticks_nmi_nesting;   /* Track NMI nesting level. */
 	atomic_t dynticks;	    /* Even value for idle, else odd. */
-#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
+#if defined(CONFIG_TASKS_RCU)
 	struct task_struct *dynticks_tsk;
-#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
+#endif /* #if defined(CONFIG_TASKS_RCU) */
 #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
 	long long dynticks_idle_nesting;
 				    /* irq/process nesting level from idle. */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 442d62edc564..381cb93ad3fa 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2868,7 +2868,7 @@ static void rcu_dynticks_task_exit(struct rcu_dynticks *rdtp)
 	rcu_dynticks_task_enter(rdtp, NULL);
 }
 
-#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
+#if defined(CONFIG_TASKS_RCU) && !defined(CONFIG_TINY_RCU)
 struct task_struct *rcu_dynticks_task_cur(int cpu)
 {
 	struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
@@ -2877,4 +2877,4 @@ struct task_struct *rcu_dynticks_task_cur(int cpu)
 	smp_read_barrier_depends(); /* Dereferences after fetch of "t". */
 	return t;
 }
-#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
+#endif /* #if defined(CONFIG_TASKS_RCU) && !defined(CONFIG_TINY_RCU) */
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 069c06fd6a79..6a6a4c80c553 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -48,6 +48,7 @@
 #include <linux/delay.h>
 #include <linux/module.h>
 #include <linux/kthread.h>
+#include "../sched/sched.h" /* cpu_rq()->idle */
 
 #define CREATE_TRACE_POINTS
 
@@ -484,7 +485,6 @@ static void check_holdout_task(struct task_struct *t,
 /* Check for nohz_full CPUs executing in userspace. */
 static void check_no_hz_full_tasks(void)
 {
-#ifdef CONFIG_NO_HZ_FULL
 	int cpu;
 	struct task_struct *t;
 
@@ -492,7 +492,11 @@ static void check_no_hz_full_tasks(void)
 		cond_resched();
 		rcu_read_lock();
 		t = rcu_dynticks_task_cur(cpu);
-		if (t == NULL || is_idle_task(t)) {
+		if (t == NULL ||
+		    (is_idle_task(t) &&
+		     t->rcu_tasks_dynticks == rcu_dynticks_ctr(cpu))) {
+			if (t != NULL)
+				resched_cpu(cpu); /* Kick idle task. */
 			rcu_read_unlock();
 			continue;
 		}
@@ -500,12 +504,12 @@ static void check_no_hz_full_tasks(void)
 			ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
 		rcu_read_unlock();
 	}
-#endif /* #ifdef CONFIG_NO_HZ_FULL */
 }
 
 /* RCU-tasks kthread that detects grace periods and invokes callbacks. */
 static int __noreturn rcu_tasks_kthread(void *arg)
 {
+	int cpu;
 	unsigned long flags;
 	struct task_struct *g, *t;
 	unsigned long lastreport;
@@ -566,8 +570,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 		 */
 		rcu_read_lock();
 		for_each_process_thread(g, t) {
-			if (t != current && ACCESS_ONCE(t->on_rq) &&
-			    !is_idle_task(t)) {
+			if (t != current && ACCESS_ONCE(t->on_rq)) {
 				get_task_struct(t);
 				t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
 				ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
@@ -577,6 +580,20 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 		}
 		rcu_read_unlock();
 
+		/* Next, queue up any currently running idle tasks. */
+		for_each_online_cpu(cpu) {
+			t = cpu_rq(cpu)->idle;
+			if (t == rcu_dynticks_task_cur(cpu)) {
+				list_add(&t->rcu_tasks_holdout_list,
+					 &rcu_tasks_holdouts);
+				t->rcu_tasks_dynticks = rcu_dynticks_ctr(cpu);
+				t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
+				ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
+				list_add(&t->rcu_tasks_holdout_list,
+					 &rcu_tasks_holdouts);
+			}
+		}
+
 		/*
 		 * Wait for tasks that are in the process of exiting.
 		 * This does only part of the job, ensuring that all


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-06  0:51                     ` Paul E. McKenney
@ 2014-08-06 22:48                       ` Paul E. McKenney
  0 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-06 22:48 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Peter Zijlstra, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On Tue, Aug 05, 2014 at 05:51:01PM -0700, Paul E. McKenney wrote:
> On Wed, Aug 06, 2014 at 08:33:29AM +0800, Lai Jiangshan wrote:
> > On 08/06/2014 05:55 AM, Paul E. McKenney wrote:
> > > On Tue, Aug 05, 2014 at 08:47:55AM +0800, Lai Jiangshan wrote:
> > >> On 08/04/2014 10:56 PM, Peter Zijlstra wrote:
> > >>> On Mon, Aug 04, 2014 at 02:25:15PM +0200, Peter Zijlstra wrote:
> > >>>> On Mon, Aug 04, 2014 at 04:50:44AM -0700, Paul E. McKenney wrote:
> > >>>>> OK, I will bite...
> > >>>>>
> > >>>>> What kinds of tasks are on a runqueue, but neither ->on_cpu nor
> > >>>>> PREEMPT_ACTIVE?
> > >>>>
> > >>>> Userspace tasks, they don't necessarily get PREEMPT_ACTIVE when
> > >>>> preempted. Now obviously you're not _that_ interested in userspace tasks
> > >>>> for this, so that might be ok.
> > >>>>
> > >>>> But the main point was, you cannot use ->on_cpu or PREEMPT_ACTIVE
> > >>>> without holding rq->lock.
> > >>>
> > >>> Hmm, maybe you can, we have the context switch in between setting
> > >>> ->on_cpu and clearing PREEMPT_ACTIVE and vice-versa.
> > >>>
> > >>> The context switch (obviously) provides a full barrier, so we might be
> > >>> able to -- with careful consideration -- read these two separate values
> > >>> and construct something usable from them.
> > >>>
> > >>> Something like:
> > >>>
> > >>> 	task_preempt_count(tsk) & PREEMPT_ACTIVE
> > >> 	the @tsk is running on_cpu, the above result is false.
> > >>> 	smp_rmb();
> > >>> 	tsk->on_cpu
> > >> 	now the @tsk is preempted, the above result also is false.
> > >>
> > >> 	so it is useless if we fetch the preempt_count and on_cpu in two separated
> > >> instructions.  Maybe it would work if we also take tsk->nivcsw in consideration.
> > >> (I just noticed that tsk->n[i]vcsw are the version numbers for the tsk->on_cpu)
> > >>
> > >> bool task_on_cpu_or_preempted(tsk)
> > >> {
> > >> 	unsigned long saved_nivcsw;
> > >>
> > >> 	saved_nivcsw = task->nivcsw;
> > >> 	if (tsk->on_cpu)
> > >> 		return true;
> > >>
> > >> 	smp_rmb();
> > >>
> > >> 	if (task_preempt_count(tsk) & PREEMPT_ACTIVE)
> > >> 		return true;
> > >>
> > >> 	smp_rmb();
> > >>
> > >> 	if (tsk->on_cpu || task->nivcsw != saved_nivcsw)
> > >> 		return true;
> > >>
> > >> 	return false;
> > >> }
> > >>
> > >>>
> > >>> And because we set PREEMPT_ACTIVE before clearing on_cpu, this should
> > >>> race the right way (err towards the inclusive side).
> > >>>
> > >>> Obviously that wants a big fat comment...
> > > 
> > > How about the following?  Non-nohz_full userspace tasks are already covered
> > > courtesy of scheduling-clock interrupts, and this handles nohz_full usermode
> > > tasks.
> > > 
> > > Thoughts?
> > > 
> > > 							Thanx, Paul
> > > 
> > > ------------------------------------------------------------------------
> > > 
> > > rcu: Make TASKS_RCU handle nohx_full= CPUs
> > > 
> > > Currently TASKS_RCU would ignore a CPU running a task in nohz_full=
> > > usermode execution.  There would be neither a context switch nor a
> > > scheduling-clock interrupt to tell TASKS_RCU that the task in question
> > > had passed through a quiescent state.  The grace period would therefore
> > > extend indefinitely.  This commit therefore makes RCU's dyntick-idle
> > > subsystem record the task_struct structure of the task that is running
> > > in dyntick-idle mode on each CPU.  The TASKS_RCU grace period can
> > > then access this information and record a quiescent state on
> > > behalf of any CPU running in dyntick-idle usermode.
> > > 
> > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > 
> > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > index f504f797c9c8..777aac3a34c0 100644
> > > --- a/include/linux/rcupdate.h
> > > +++ b/include/linux/rcupdate.h
> > > @@ -1140,5 +1140,14 @@ static inline void rcu_sysidle_force_exit(void)
> > >  
> > >  #endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
> > >  
> > > +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> > > +struct task_struct *rcu_dynticks_task_cur(int cpu);
> > > +#else /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > > +static inline struct task_struct *rcu_dynticks_task_cur(int cpu)
> > > +{
> > > +	return NULL;
> > > +}
> > > +#endif /* #else #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > > +
> > >  
> > >  #endif /* __LINUX_RCUPDATE_H */
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 645a33efc0d4..86a0a7d5bbbd 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -526,6 +526,7 @@ static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, long long oldval,
> > >  	atomic_inc(&rdtp->dynticks);
> > >  	smp_mb__after_atomic();  /* Force ordering with next sojourn. */
> > >  	WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> > > +	rcu_dynticks_task_enter(rdtp, current);
> > >  
> > >  	/*
> > >  	 * It is illegal to enter an extended quiescent state while
> > > @@ -642,6 +643,7 @@ void rcu_irq_exit(void)
> > >  static void rcu_eqs_exit_common(struct rcu_dynticks *rdtp, long long oldval,
> > >  			       int user)
> > >  {
> > > +	rcu_dynticks_task_exit(rdtp);
> > 
> > What happen when the trampoline happens before rcu_eqs_exit_common()?
> > synchronize_sched() also skip this CPUs.  I think, for all CPUs,  only
> > real schedule is reliable.
> 
> True, this prohibits tracing the point from the call to
> rcu_eqs_enter_common() to the transition to usermode.  I am betting that
> this is OK, though.

And if not, it is not all that hard to handle this case.

							Thanx, Paul

> > >  	smp_mb__before_atomic();  /* Force ordering w/previous sojourn. */
> > >  	atomic_inc(&rdtp->dynticks);
> > >  	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > > index 0f69a79c5b7d..1e79fa1b7cbf 100644
> > > --- a/kernel/rcu/tree.h
> > > +++ b/kernel/rcu/tree.h
> > > @@ -88,6 +88,9 @@ struct rcu_dynticks {
> > >  				    /* Process level is worth LLONG_MAX/2. */
> > >  	int dynticks_nmi_nesting;   /* Track NMI nesting level. */
> > >  	atomic_t dynticks;	    /* Even value for idle, else odd. */
> > > +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> > > +	struct task_struct *dynticks_tsk;
> > > +#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > >  #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
> > >  	long long dynticks_idle_nesting;
> > >  				    /* irq/process nesting level from idle. */
> > > @@ -579,6 +582,9 @@ static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> > >  static void rcu_bind_gp_kthread(void);
> > >  static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
> > >  static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
> > > +static void rcu_dynticks_task_enter(struct rcu_dynticks *rdtp,
> > > +				    struct task_struct *t);
> > > +static void rcu_dynticks_task_exit(struct rcu_dynticks *rdtp);
> > >  
> > >  #endif /* #ifndef RCU_TREE_NONCORE */
> > >  
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index a86a363ea453..442d62edc564 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -2852,3 +2852,29 @@ static void rcu_bind_gp_kthread(void)
> > >  		set_cpus_allowed_ptr(current, cpumask_of(cpu));
> > >  #endif /* #ifdef CONFIG_NO_HZ_FULL */
> > >  }
> > > +
> > > +/* Record the current task on dyntick-idle entry. */
> > > +static void rcu_dynticks_task_enter(struct rcu_dynticks *rdtp,
> > > +				    struct task_struct *t)
> > > +{
> > > +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> > > +	ACCESS_ONCE(rdtp->dynticks_tsk) = t;
> > > +#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > > +}
> > > +
> > > +/* Record no current task on dyntick-idle exit. */
> > > +static void rcu_dynticks_task_exit(struct rcu_dynticks *rdtp)
> > > +{
> > > +	rcu_dynticks_task_enter(rdtp, NULL);
> > > +}
> > > +
> > > +#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
> > > +struct task_struct *rcu_dynticks_task_cur(int cpu)
> > > +{
> > > +	struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
> > > +	struct task_struct *t = ACCESS_ONCE(rdtp->dynticks_tsk);
> > > +
> > > +	smp_read_barrier_depends(); /* Dereferences after fetch of "t". */
> > > +	return t;
> > > +}
> > > +#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
> > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > > index ad2a8df43757..6ad6af2ab028 100644
> > > --- a/kernel/rcu/update.c
> > > +++ b/kernel/rcu/update.c
> > > @@ -481,6 +481,28 @@ static void check_holdout_task(struct task_struct *t,
> > >  	sched_show_task(current);
> > >  }
> > >  
> > > +/* Check for nohz_full CPUs executing in userspace. */
> > > +static void check_no_hz_full_tasks(void)
> > > +{
> > > +#ifdef CONFIG_NO_HZ_FULL
> > > +	int cpu;
> > > +	struct task_struct *t;
> > > +
> > > +	for_each_online_cpu(cpu) {
> > > +		cond_resched();
> > > +		rcu_read_lock();
> > > +		t = rcu_dynticks_task_cur(cpu);
> > > +		if (t == NULL || is_idle_task(t)) {
> > > +			rcu_read_unlock();
> > > +			continue;
> > > +		}
> > > +		if (ACCESS_ONCE(t->rcu_tasks_holdout))
> > > +			ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> > > +		rcu_read_unlock();
> > > +	}
> > > +#endif /* #ifdef CONFIG_NO_HZ_FULL */
> > > +}
> > > +
> > >  /* RCU-tasks kthread that detects grace periods and invokes callbacks. */
> > >  static int __noreturn rcu_tasks_kthread(void *arg)
> > >  {
> > > @@ -584,6 +606,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
> > >  				lastreport = jiffies;
> > >  			firstreport = true;
> > >  			WARN_ON(signal_pending(current));
> > > +			check_no_hz_full_tasks();
> > >  			rcu_read_lock();
> > >  			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
> > >  						rcu_tasks_holdout_list)
> > > 
> > > .
> > > 
> > 


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-06 22:45                         ` Paul E. McKenney
@ 2014-08-07  8:45                           ` Peter Zijlstra
  2014-08-07 15:00                             ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-07  8:45 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

[-- Attachment #1: Type: text/plain, Size: 931 bytes --]

On Wed, Aug 06, 2014 at 03:45:18PM -0700, Paul E. McKenney wrote:
> > > > But I still very much hate the polling stuff...
> 
> The nice thing about the polling approach is minimal overhead in the
> common case where RCU-tasks is not in use.

No, quite the reverse, there is overhead when its not in use, as opposed
to no overhead at all.

I'm still not convinced we need this 'generic' rcu-task stuff and create
yet another kthread with polling semantics, we want to let the system
idle out when there's nothing to do, not keep waking it up.

So do we really need the call_rcu_task() thing and why isn't something
like synchronize_tasks() good enough?

So the thing is, the one proposed user is very rare (*) and for that
you're adding overhead outside of that user (a separate kthread) and
your adding overhead when its not used.

* I'm assuming that, since tracing is 'rare' and this is some tracing
thing.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-05 21:55                 ` Paul E. McKenney
  2014-08-06  0:27                   ` Lai Jiangshan
  2014-08-06  0:33                   ` Lai Jiangshan
@ 2014-08-07  8:49                   ` Peter Zijlstra
  2014-08-07 15:43                     ` Paul E. McKenney
  2 siblings, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-07  8:49 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

[-- Attachment #1: Type: text/plain, Size: 797 bytes --]

On Tue, Aug 05, 2014 at 02:55:10PM -0700, Paul E. McKenney wrote:
> +/* Check for nohz_full CPUs executing in userspace. */
> +static void check_no_hz_full_tasks(void)
> +{
> +#ifdef CONFIG_NO_HZ_FULL
> +	int cpu;
> +	struct task_struct *t;
> +
> +	for_each_online_cpu(cpu) {
> +		cond_resched();
> +		rcu_read_lock();
> +		t = rcu_dynticks_task_cur(cpu);
> +		if (t == NULL || is_idle_task(t)) {
> +			rcu_read_unlock();
> +			continue;
> +		}
> +		if (ACCESS_ONCE(t->rcu_tasks_holdout))
> +			ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> +		rcu_read_unlock();
> +	}
> +#endif /* #ifdef CONFIG_NO_HZ_FULL */
> +}

That's not hotplug safe afaict, and I've no idea if someone pointed that
out already because people refuse to trim email and I can't be arsed to
wade through pages and pages of quoting.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-07  8:45                           ` Peter Zijlstra
@ 2014-08-07 15:00                             ` Paul E. McKenney
  2014-08-07 15:26                               ` Peter Zijlstra
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-07 15:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

On Thu, Aug 07, 2014 at 10:45:44AM +0200, Peter Zijlstra wrote:
> On Wed, Aug 06, 2014 at 03:45:18PM -0700, Paul E. McKenney wrote:
> > > > > But I still very much hate the polling stuff...
> > 
> > The nice thing about the polling approach is minimal overhead in the
> > common case where RCU-tasks is not in use.
> 
> No, quite the reverse, there is overhead when its not in use, as opposed
> to no overhead at all.

Say what???

> I'm still not convinced we need this 'generic' rcu-task stuff and create
> yet another kthread with polling semantics, we want to let the system
> idle out when there's nothing to do, not keep waking it up.

Which is exactly what happens.  The kthread is created only at first
use, so if no one uses RCU-tasks, then no kthread is created, see
https://lkml.org/lkml/2014/8/4/630.  Even if a kthread is created, if
there is no more work for it to do, it sleeps indefinitely.  See for
example https://lkml.org/lkml/2014/8/4/629.

> So do we really need the call_rcu_task() thing and why isn't something
> like synchronize_tasks() good enough?

Sounds like a question for Steven.

> So the thing is, the one proposed user is very rare (*) and for that
> you're adding overhead outside of that user (a separate kthread) and
> your adding overhead when its not used.

If that really was the case, that would be bad.  However, in the latest
versions, that is no longer the case.

> * I'm assuming that, since tracing is 'rare' and this is some tracing
> thing.

Another good point for Steven.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-07 15:00                             ` Paul E. McKenney
@ 2014-08-07 15:26                               ` Peter Zijlstra
  2014-08-07 17:27                                 ` Peter Zijlstra
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-07 15:26 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

[-- Attachment #1: Type: text/plain, Size: 1824 bytes --]

On Thu, Aug 07, 2014 at 08:00:31AM -0700, Paul E. McKenney wrote:
> On Thu, Aug 07, 2014 at 10:45:44AM +0200, Peter Zijlstra wrote:
> > On Wed, Aug 06, 2014 at 03:45:18PM -0700, Paul E. McKenney wrote:
> > > > > > But I still very much hate the polling stuff...
> > > 
> > > The nice thing about the polling approach is minimal overhead in the
> > > common case where RCU-tasks is not in use.
> > 
> > No, quite the reverse, there is overhead when its not in use, as opposed
> > to no overhead at all.
> 
> Say what???
> 
> > I'm still not convinced we need this 'generic' rcu-task stuff and create
> > yet another kthread with polling semantics, we want to let the system
> > idle out when there's nothing to do, not keep waking it up.
> 
> Which is exactly what happens.  The kthread is created only at first
> use, so if no one uses RCU-tasks, then no kthread is created, see
> https://lkml.org/lkml/2014/8/4/630.  Even if a kthread is created, if
> there is no more work for it to do, it sleeps indefinitely.  See for
> example https://lkml.org/lkml/2014/8/4/629.

Ah, the 'full' patch I was staring at for reference did an unconditional
poll.

> > So do we really need the call_rcu_task() thing and why isn't something
> > like synchronize_tasks() good enough?
> 
> Sounds like a question for Steven.
> 
> > So the thing is, the one proposed user is very rare (*) and for that
> > you're adding overhead outside of that user (a separate kthread) and
> > your adding overhead when its not used.
> 
> If that really was the case, that would be bad.  However, in the latest
> versions, that is no longer the case.
> 
> > * I'm assuming that, since tracing is 'rare' and this is some tracing
> > thing.
> 
> Another good point for Steven.

Yes.. and he's back now, so please :-)

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-07  8:49                   ` Peter Zijlstra
@ 2014-08-07 15:43                     ` Paul E. McKenney
  2014-08-07 16:32                       ` Peter Zijlstra
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-07 15:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On Thu, Aug 07, 2014 at 10:49:21AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 05, 2014 at 02:55:10PM -0700, Paul E. McKenney wrote:
> > +/* Check for nohz_full CPUs executing in userspace. */
> > +static void check_no_hz_full_tasks(void)
> > +{
> > +#ifdef CONFIG_NO_HZ_FULL
> > +	int cpu;
> > +	struct task_struct *t;
> > +
> > +	for_each_online_cpu(cpu) {
> > +		cond_resched();
> > +		rcu_read_lock();
> > +		t = rcu_dynticks_task_cur(cpu);
> > +		if (t == NULL || is_idle_task(t)) {
> > +			rcu_read_unlock();
> > +			continue;
> > +		}
> > +		if (ACCESS_ONCE(t->rcu_tasks_holdout))
> > +			ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> > +		rcu_read_unlock();
> > +	}
> > +#endif /* #ifdef CONFIG_NO_HZ_FULL */
> > +}
> 
> That's not hotplug safe afaict, and I've no idea if someone pointed that
> out already because people refuse to trim email and I can't be arsed to
> wade through pages and pages of quoting.

Hmmm...  That does look a bit suspicious, now that you mention it.
After all, if a CPU is offline, its idle tasks cannot be on a
trampoline.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-07 15:43                     ` Paul E. McKenney
@ 2014-08-07 16:32                       ` Peter Zijlstra
  2014-08-07 17:48                         ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-07 16:32 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

[-- Attachment #1: Type: text/plain, Size: 1682 bytes --]

On Thu, Aug 07, 2014 at 08:43:58AM -0700, Paul E. McKenney wrote:
> On Thu, Aug 07, 2014 at 10:49:21AM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 05, 2014 at 02:55:10PM -0700, Paul E. McKenney wrote:
> > > +/* Check for nohz_full CPUs executing in userspace. */
> > > +static void check_no_hz_full_tasks(void)
> > > +{
> > > +#ifdef CONFIG_NO_HZ_FULL
> > > +	int cpu;
> > > +	struct task_struct *t;
> > > +
> > > +	for_each_online_cpu(cpu) {
> > > +		cond_resched();
> > > +		rcu_read_lock();
> > > +		t = rcu_dynticks_task_cur(cpu);
> > > +		if (t == NULL || is_idle_task(t)) {
> > > +			rcu_read_unlock();
> > > +			continue;
> > > +		}
> > > +		if (ACCESS_ONCE(t->rcu_tasks_holdout))
> > > +			ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> > > +		rcu_read_unlock();
> > > +	}
> > > +#endif /* #ifdef CONFIG_NO_HZ_FULL */
> > > +}
> > 
> > That's not hotplug safe afaict, and I've no idea if someone pointed that
> > out already because people refuse to trim email and I can't be arsed to
> > wade through pages and pages of quoting.
> 
> Hmmm...  That does look a bit suspicious, now that you mention it.
> After all, if a CPU is offline, its idle tasks cannot be on a
> trampoline.

But what about a CPU that in the process of getting on-line, it started
when there still were trampolines but hasn't quite made it to full
'online' status when you do this check.

Similar for going offline I suppose, we start to go offline when there
were still trampolines and got missed in that check because we already
cleared the 'online' bit but we're not quite dead yet.

I couldn't find any serialization against either on or off lining of
CPUs.


[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-07 15:26                               ` Peter Zijlstra
@ 2014-08-07 17:27                                 ` Peter Zijlstra
  2014-08-07 18:46                                   ` Peter Zijlstra
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-07 17:27 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

[-- Attachment #1: Type: text/plain, Size: 1244 bytes --]

On Thu, Aug 07, 2014 at 05:26:00PM +0200, Peter Zijlstra wrote:
> > > So do we really need the call_rcu_task() thing and why isn't something
> > > like synchronize_tasks() good enough?
> > 
> > Sounds like a question for Steven.
> > 
> > > So the thing is, the one proposed user is very rare (*) and for that
> > > you're adding overhead outside of that user (a separate kthread) and
> > > your adding overhead when its not used.
> > 
> > If that really was the case, that would be bad.  However, in the latest
> > versions, that is no longer the case.
> > 
> > > * I'm assuming that, since tracing is 'rare' and this is some tracing
> > > thing.
> > 
> > Another good point for Steven.
> 
> Yes.. and he's back now, so please :-)

Right, Steve (and Paul) please explain _why_ this is an 'RCU' at all?
_Why_ do we have call_rcu_task(), and why is it entwined in the 'normal'
RCU stuff? We've got SRCU -- which btw started out simple, without
call_srcu() -- and that lives entirely independent. And SRCU is far more
an actual RCU than this thing is, its got read side primitives and
everything.

Also, I cannot think of any other use besides trampolines for this
thing, but that might be my limited imagination.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-07 16:32                       ` Peter Zijlstra
@ 2014-08-07 17:48                         ` Paul E. McKenney
  0 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-07 17:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On Thu, Aug 07, 2014 at 06:32:28PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 07, 2014 at 08:43:58AM -0700, Paul E. McKenney wrote:
> > On Thu, Aug 07, 2014 at 10:49:21AM +0200, Peter Zijlstra wrote:
> > > On Tue, Aug 05, 2014 at 02:55:10PM -0700, Paul E. McKenney wrote:
> > > > +/* Check for nohz_full CPUs executing in userspace. */
> > > > +static void check_no_hz_full_tasks(void)
> > > > +{
> > > > +#ifdef CONFIG_NO_HZ_FULL
> > > > +	int cpu;
> > > > +	struct task_struct *t;
> > > > +
> > > > +	for_each_online_cpu(cpu) {
> > > > +		cond_resched();
> > > > +		rcu_read_lock();
> > > > +		t = rcu_dynticks_task_cur(cpu);
> > > > +		if (t == NULL || is_idle_task(t)) {
> > > > +			rcu_read_unlock();
> > > > +			continue;
> > > > +		}
> > > > +		if (ACCESS_ONCE(t->rcu_tasks_holdout))
> > > > +			ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> > > > +		rcu_read_unlock();
> > > > +	}
> > > > +#endif /* #ifdef CONFIG_NO_HZ_FULL */
> > > > +}
> > > 
> > > That's not hotplug safe afaict, and I've no idea if someone pointed that
> > > out already because people refuse to trim email and I can't be arsed to
> > > wade through pages and pages of quoting.
> > 
> > Hmmm...  That does look a bit suspicious, now that you mention it.
> > After all, if a CPU is offline, its idle tasks cannot be on a
> > trampoline.
> 
> But what about a CPU that in the process of getting on-line, it started
> when there still were trampolines but hasn't quite made it to full
> 'online' status when you do this check.

Well, one easy approach would be to exclude CPU hotplug during that time.

> Similar for going offline I suppose, we start to go offline when there
> were still trampolines and got missed in that check because we already
> cleared the 'online' bit but we're not quite dead yet.

Yep, same issue here.

> I couldn't find any serialization against either on or off lining of
> CPUs.

Yep, I need to fix this.  The most straightforward approach seems to be
use of cpu_maps_update_begin() like in the following patch.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

rcu: Make RCU-tasks wait for idle tasks

Because idle-task code may need to be patched, RCU-tasks need to wait
for idle tasks to schedule.  This commit therefore detects this case
via RCU's dyntick-idle counters.  Block CPU hotplug during this time
to avoid sending IPIs to offline CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 0294fc180508..70f2b953c392 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1138,14 +1138,14 @@ static inline void rcu_sysidle_force_exit(void)
 
 #endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
 
-#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
+#if defined(CONFIG_TASKS_RCU) && !defined(CONFIG_TINY_RCU)
 struct task_struct *rcu_dynticks_task_cur(int cpu);
-#else /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
+#else /* #if defined(CONFIG_TASKS_RCU) && !defined(CONFIG_TINY_RCU) */
 static inline struct task_struct *rcu_dynticks_task_cur(int cpu)
 {
 	return NULL;
 }
-#endif /* #else #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
+#endif /* #else #if defined(CONFIG_TASKS_RCU) && !defined(CONFIG_TINY_RCU) */
 
 
 #endif /* __LINUX_RCUPDATE_H */
diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index d40a6a451330..b882c27cd314 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -154,7 +154,11 @@ static inline bool rcu_is_watching(void)
 	return true;
 }
 
-
 #endif /* #else defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) */
 
+static inline unsigned int rcu_dynticks_ctr(int cpu)
+{
+	return 0;
+}
+
 #endif /* __LINUX_RCUTINY_H */
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 3e2f5d432743..0d8fdfcb4f0b 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -97,4 +97,6 @@ extern int rcu_scheduler_active __read_mostly;
 
 bool rcu_is_watching(void);
 
+unsigned int rcu_dynticks_ctr(int cpu);
+
 #endif /* __LINUX_RCUTREE_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3cf124389ec7..db4e6cb8fb77 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1277,6 +1277,7 @@ struct task_struct {
 	unsigned long rcu_tasks_nvcsw;
 	int rcu_tasks_holdout;
 	struct list_head rcu_tasks_holdout_list;
+	unsigned int rcu_tasks_dynticks;
 #endif /* #ifdef CONFIG_TASKS_RCU */
 
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 86a0a7d5bbbd..6298a66118e5 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -493,6 +493,16 @@ cpu_needs_another_gp(struct rcu_state *rsp, struct rcu_data *rdp)
 }
 
 /*
+ * rcu_dynticks_ctr - return value of the specified CPU's dynticks counter
+ */
+unsigned int rcu_dynticks_ctr(int cpu)
+{
+	struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
+
+	return atomic_add_return(0, &rdtp->dynticks);
+}
+
+/*
  * rcu_eqs_enter_common - current CPU is moving towards extended quiescent state
  *
  * If the new value of the ->dynticks_nesting counter now is zero,
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 1e79fa1b7cbf..e373f8ddc60a 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -88,9 +88,9 @@ struct rcu_dynticks {
 				    /* Process level is worth LLONG_MAX/2. */
 	int dynticks_nmi_nesting;   /* Track NMI nesting level. */
 	atomic_t dynticks;	    /* Even value for idle, else odd. */
-#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
+#if defined(CONFIG_TASKS_RCU)
 	struct task_struct *dynticks_tsk;
-#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
+#endif /* #if defined(CONFIG_TASKS_RCU) */
 #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
 	long long dynticks_idle_nesting;
 				    /* irq/process nesting level from idle. */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 442d62edc564..381cb93ad3fa 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2868,7 +2868,7 @@ static void rcu_dynticks_task_exit(struct rcu_dynticks *rdtp)
 	rcu_dynticks_task_enter(rdtp, NULL);
 }
 
-#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
+#if defined(CONFIG_TASKS_RCU) && !defined(CONFIG_TINY_RCU)
 struct task_struct *rcu_dynticks_task_cur(int cpu)
 {
 	struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
@@ -2877,4 +2877,4 @@ struct task_struct *rcu_dynticks_task_cur(int cpu)
 	smp_read_barrier_depends(); /* Dereferences after fetch of "t". */
 	return t;
 }
-#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
+#endif /* #if defined(CONFIG_TASKS_RCU) && !defined(CONFIG_TINY_RCU) */
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 069c06fd6a79..637f8c7fc0c2 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -48,6 +48,7 @@
 #include <linux/delay.h>
 #include <linux/module.h>
 #include <linux/kthread.h>
+#include "../sched/sched.h" /* cpu_rq()->idle */
 
 #define CREATE_TRACE_POINTS
 
@@ -484,7 +485,6 @@ static void check_holdout_task(struct task_struct *t,
 /* Check for nohz_full CPUs executing in userspace. */
 static void check_no_hz_full_tasks(void)
 {
-#ifdef CONFIG_NO_HZ_FULL
 	int cpu;
 	struct task_struct *t;
 
@@ -492,7 +492,11 @@ static void check_no_hz_full_tasks(void)
 		cond_resched();
 		rcu_read_lock();
 		t = rcu_dynticks_task_cur(cpu);
-		if (t == NULL || is_idle_task(t)) {
+		if (t == NULL ||
+		    (is_idle_task(t) &&
+		     t->rcu_tasks_dynticks == rcu_dynticks_ctr(cpu))) {
+			if (t != NULL)
+				resched_cpu(cpu); /* Kick idle task. */
 			rcu_read_unlock();
 			continue;
 		}
@@ -500,12 +504,12 @@ static void check_no_hz_full_tasks(void)
 			ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
 		rcu_read_unlock();
 	}
-#endif /* #ifdef CONFIG_NO_HZ_FULL */
 }
 
 /* RCU-tasks kthread that detects grace periods and invokes callbacks. */
 static int __noreturn rcu_tasks_kthread(void *arg)
 {
+	int cpu;
 	unsigned long flags;
 	struct task_struct *g, *t;
 	unsigned long lastreport;
@@ -566,8 +570,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 		 */
 		rcu_read_lock();
 		for_each_process_thread(g, t) {
-			if (t != current && ACCESS_ONCE(t->on_rq) &&
-			    !is_idle_task(t)) {
+			if (t != current && ACCESS_ONCE(t->on_rq)) {
 				get_task_struct(t);
 				t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
 				ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
@@ -578,6 +581,26 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 		rcu_read_unlock();
 
 		/*
+		 * Next, queue up any currently running idle tasks.
+		 * Exclude CPU hotplug during the time we are working
+		 * with idle tasks, as it is considered bad form to
+		 * send IPIs to offline CPUs.
+		 */
+		cpu_maps_update_begin();
+		for_each_online_cpu(cpu) {
+			t = cpu_rq(cpu)->idle;
+			if (t == rcu_dynticks_task_cur(cpu)) {
+				list_add(&t->rcu_tasks_holdout_list,
+					 &rcu_tasks_holdouts);
+				t->rcu_tasks_dynticks = rcu_dynticks_ctr(cpu);
+				t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
+				ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
+				list_add(&t->rcu_tasks_holdout_list,
+					 &rcu_tasks_holdouts);
+			}
+		}
+
+		/*
 		 * Wait for tasks that are in the process of exiting.
 		 * This does only part of the job, ensuring that all
 		 * tasks that were previously exiting reach the point
@@ -613,6 +636,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 				cond_resched();
 			}
 		}
+		cpu_maps_update_done();
 
 		/*
 		 * Because ->on_rq and ->nvcsw are not guaranteed


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-07 17:27                                 ` Peter Zijlstra
@ 2014-08-07 18:46                                   ` Peter Zijlstra
  2014-08-07 19:49                                     ` Steven Rostedt
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-07 18:46 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

[-- Attachment #1: Type: text/plain, Size: 797 bytes --]

On Thu, Aug 07, 2014 at 07:27:53PM +0200, Peter Zijlstra wrote:
> Right, Steve (and Paul) please explain _why_ this is an 'RCU' at all?
> _Why_ do we have call_rcu_task(), and why is it entwined in the 'normal'
> RCU stuff? We've got SRCU -- which btw started out simple, without
> call_srcu() -- and that lives entirely independent. And SRCU is far more
> an actual RCU than this thing is, its got read side primitives and
> everything.
> 
> Also, I cannot think of any other use besides trampolines for this
> thing, but that might be my limited imagination.

Also, trampolines can end up in the return frames, right? So how can you
be sure when to wipe them? Passing through schedule() isn't enough for
that.

Userspace is, but kernel threads typically don't ever end up there.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-07 18:46                                   ` Peter Zijlstra
@ 2014-08-07 19:49                                     ` Steven Rostedt
  2014-08-07 19:53                                       ` Steven Rostedt
  2014-08-07 20:06                                       ` Peter Zijlstra
  0 siblings, 2 replies; 122+ messages in thread
From: Steven Rostedt @ 2014-08-07 19:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

On Thu, 7 Aug 2014 20:46:35 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Aug 07, 2014 at 07:27:53PM +0200, Peter Zijlstra wrote:
> > Right, Steve (and Paul) please explain _why_ this is an 'RCU' at all?
> > _Why_ do we have call_rcu_task(), and why is it entwined in the 'normal'
> > RCU stuff? We've got SRCU -- which btw started out simple, without
> > call_srcu() -- and that lives entirely independent. And SRCU is far more
> > an actual RCU than this thing is, its got read side primitives and
> > everything.
> > 
> > Also, I cannot think of any other use besides trampolines for this
> > thing, but that might be my limited imagination.
> 
> Also, trampolines can end up in the return frames, right? So how can you
> be sure when to wipe them? Passing through schedule() isn't enough for
> that.

Not sure what you mean.

> 
> Userspace is, but kernel threads typically don't ever end up there.

Only voluntary calls to schedule() will be a quiescent state. Preempt
doesn't count. And no, function callbacks to not call schedule(),
function callbacks should be treated even stricter than interrupt
handlers. They should never call schedule() directly or even take any
locks. Heck, they should be stricter than NMIs for that matter.

Hence, once something calls schedule() directly, we know that it is not
on a trampoline, nor is it going to return to one.

-- Steve

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-07 19:49                                     ` Steven Rostedt
@ 2014-08-07 19:53                                       ` Steven Rostedt
  2014-08-07 20:08                                         ` Peter Zijlstra
  2014-08-07 20:06                                       ` Peter Zijlstra
  1 sibling, 1 reply; 122+ messages in thread
From: Steven Rostedt @ 2014-08-07 19:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

On Thu, 7 Aug 2014 15:49:07 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:


> Only voluntary calls to schedule() will be a quiescent state. Preempt
> doesn't count. And no, function callbacks to not call schedule(),
> function callbacks should be treated even stricter than interrupt
> handlers. They should never call schedule() directly or even take any
> locks. Heck, they should be stricter than NMIs for that matter.
> 
> Hence, once something calls schedule() directly, we know that it is not
> on a trampoline, nor is it going to return to one.

I should also be a bit clearer here. It's not just function callbacks,
but anything that adds a trampoline that can be called from any context
(like for kprobes). The point is, these trampolines that can execute
anywhere (including in NMIs), must have strict use cases. These are not
a notifier or other generic operation that normal RCU is fine for.
These are for really specific cases that require the call_rcu_task() to
free.

call_rcu_task() should seldom be used. The only cases really are for
kprobes and function tracing, and perhaps other dynamic callers.

-- Steve

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-07 19:49                                     ` Steven Rostedt
  2014-08-07 19:53                                       ` Steven Rostedt
@ 2014-08-07 20:06                                       ` Peter Zijlstra
  1 sibling, 0 replies; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-07 20:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

On Thu, Aug 07, 2014 at 03:49:07PM -0400, Steven Rostedt wrote:
> On Thu, 7 Aug 2014 20:46:35 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Thu, Aug 07, 2014 at 07:27:53PM +0200, Peter Zijlstra wrote:
> > > Right, Steve (and Paul) please explain _why_ this is an 'RCU' at all?
> > > _Why_ do we have call_rcu_task(), and why is it entwined in the 'normal'
> > > RCU stuff? We've got SRCU -- which btw started out simple, without
> > > call_srcu() -- and that lives entirely independent. And SRCU is far more
> > > an actual RCU than this thing is, its got read side primitives and
> > > everything.
> > > 
> > > Also, I cannot think of any other use besides trampolines for this
> > > thing, but that might be my limited imagination.
> > 
> > Also, trampolines can end up in the return frames, right? So how can you
> > be sure when to wipe them? Passing through schedule() isn't enough for
> > that.
> 
> Not sure what you mean.

void bar()
{
	mutex_lock();
	...
	mutex_unlock();
}

void foo()
{
	bar();
}

Normally that'll give you a stack/return frame like:

 foo()
   bar()
     mutex_lock()
       schedule();

Now suppose there's a trampoline around bar(), that would give:

  foo()
    __trampoline()
      bar()
        mutex_lock()
	  schedule()

so the function return of bar doesn't point to foo, but to the
trampoline. But we call schedule() from mutex_lock() and think we're all
good.

> > Userspace is, but kernel threads typically don't ever end up there.

> Hence, once something calls schedule() directly, we know that it is not
> on a trampoline, nor is it going to return to one.

How can you say its not going to return to one?


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-07 19:53                                       ` Steven Rostedt
@ 2014-08-07 20:08                                         ` Peter Zijlstra
  2014-08-07 21:18                                           ` Steven Rostedt
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-07 20:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

On Thu, Aug 07, 2014 at 03:53:26PM -0400, Steven Rostedt wrote:
> On Thu, 7 Aug 2014 15:49:07 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> 
> > Only voluntary calls to schedule() will be a quiescent state. Preempt
> > doesn't count. And no, function callbacks to not call schedule(),
> > function callbacks should be treated even stricter than interrupt
> > handlers. They should never call schedule() directly or even take any
> > locks. Heck, they should be stricter than NMIs for that matter.
> > 
> > Hence, once something calls schedule() directly, we know that it is not
> > on a trampoline, nor is it going to return to one.
> 
> I should also be a bit clearer here. It's not just function callbacks,
> but anything that adds a trampoline that can be called from any context
> (like for kprobes). The point is, these trampolines that can execute
> anywhere (including in NMIs), must have strict use cases. These are not
> a notifier or other generic operation that normal RCU is fine for.
> These are for really specific cases that require the call_rcu_task() to
> free.
> 
> call_rcu_task() should seldom be used. The only cases really are for
> kprobes and function tracing, and perhaps other dynamic callers.

OK, you've got to start over and start at the beginning, because I'm
really not understanding this..

What is a 'trampoline' and what are you going to use them for.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-07 20:08                                         ` Peter Zijlstra
@ 2014-08-07 21:18                                           ` Steven Rostedt
  2014-08-08  6:40                                             ` Peter Zijlstra
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Rostedt @ 2014-08-07 21:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

On Thu, 7 Aug 2014 22:08:13 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> OK, you've got to start over and start at the beginning, because I'm
> really not understanding this..
> 
> What is a 'trampoline' and what are you going to use them for.

Great question! :-)

The trampoline is some code that is used to jump to and then jump
someplace else. Currently, we use this for kprobes and ftrace. For
ftrace we have the ftrace_caller trampoline, which is static. When
booting, most functions in the kernel call the mcount code which
simply returns without doing anything. This too is a "trampoline". At
boot, we convert these calls to nops (as you already know). When we
enable callbacks from functions, we convert those calls to call
"ftrace_caller" which is a small assembly trampoline that will call
some function that registered with ftrace.

Now why do we need the call_rcu_task() routine?

Right now, if you register multiple callbacks to ftrace, even if they
are not tracing the same routine, ftrace has to change ftrace_caller to
call another trampoline (in C), that does a loop of all ops registered
with ftrace, and compares the function to the ops hash tables to see if
the ops function should be called for that function.

What we want to do is to create a dynamic trampoline that is a copy of
the ftrace_caller code, but instead of calling this list trampoline, it
calls the ops function directly. This way, each ops registered with
ftrace can have its own custom trampoline that when called will only
call the ops function and not have to iterate over a list. This only
happens if the function being traced only has this one ops registered.
For functions with multiple ops attached to it, we need to call the
list anyway. But for the majority of the cases, this is not the case.

The one caveat for this is, how do we free this custom trampoline when
the ops is done with it? Especially for users of ftrace that
dynamically create their own ops (like perf, and ftrace instances).

We need to find a way to free it, but unfortunately, there's no way to
know when it is safe to free it. There's no way to disable preemption
or have some other notifier to let us know if a task has jumped to this
trampoline and has been preempted (sleeping). The only safe way to know
that no task is on the trampoline is to remove the calls to it,
synchronize the CPUS (so the trampolines are not even in the caches),
and then wait for all tasks to go through some quiescent state. This
state happens to be either not running, in userspace, or when it
voluntarily calls schedule. Because nothing that uses this trampoline
should do that, and if the task voluntarily calls schedule, we know
it's not on the trampoline.

Make sense?

-- Steve

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-07 21:18                                           ` Steven Rostedt
@ 2014-08-08  6:40                                             ` Peter Zijlstra
  2014-08-08 14:12                                               ` Steven Rostedt
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-08  6:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani

[-- Attachment #1: Type: text/plain, Size: 3797 bytes --]

On Thu, Aug 07, 2014 at 05:18:23PM -0400, Steven Rostedt wrote:
> On Thu, 7 Aug 2014 22:08:13 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > OK, you've got to start over and start at the beginning, because I'm
> > really not understanding this..
> > 
> > What is a 'trampoline' and what are you going to use them for.
> 
> Great question! :-)
> 
> The trampoline is some code that is used to jump to and then jump
> someplace else. Currently, we use this for kprobes and ftrace. For
> ftrace we have the ftrace_caller trampoline, which is static. When
> booting, most functions in the kernel call the mcount code which
> simply returns without doing anything. This too is a "trampoline". At
> boot, we convert these calls to nops (as you already know). When we
> enable callbacks from functions, we convert those calls to call
> "ftrace_caller" which is a small assembly trampoline that will call
> some function that registered with ftrace.
> 
> Now why do we need the call_rcu_task() routine?
> 
> Right now, if you register multiple callbacks to ftrace, even if they
> are not tracing the same routine, ftrace has to change ftrace_caller to
> call another trampoline (in C), that does a loop of all ops registered
> with ftrace, and compares the function to the ops hash tables to see if
> the ops function should be called for that function.
> 
> What we want to do is to create a dynamic trampoline that is a copy of
> the ftrace_caller code, but instead of calling this list trampoline, it
> calls the ops function directly. This way, each ops registered with
> ftrace can have its own custom trampoline that when called will only
> call the ops function and not have to iterate over a list. This only
> happens if the function being traced only has this one ops registered.
> For functions with multiple ops attached to it, we need to call the
> list anyway. But for the majority of the cases, this is not the case.
> 
> The one caveat for this is, how do we free this custom trampoline when
> the ops is done with it? Especially for users of ftrace that
> dynamically create their own ops (like perf, and ftrace instances).
> 
> We need to find a way to free it, but unfortunately, there's no way to
> know when it is safe to free it. There's no way to disable preemption
> or have some other notifier to let us know if a task has jumped to this
> trampoline and has been preempted (sleeping). The only safe way to know
> that no task is on the trampoline is to remove the calls to it,
> synchronize the CPUS (so the trampolines are not even in the caches),
> and then wait for all tasks to go through some quiescent state. This
> state happens to be either not running, in userspace, or when it
> voluntarily calls schedule. Because nothing that uses this trampoline
> should do that, and if the task voluntarily calls schedule, we know
> it's not on the trampoline.
> 
> Make sense?

Ok, so they're purely used in the function prologue/epilogue callchain.
And you don't want to use synchronize_tasks() because registering a trace
functions is atomic ?

But why would you use dynamic memory allocation for these trampolines at
all? Why not use the one default trampoline for this?

Suppose that thing looks like:

ftrace_mcount_handler()
{
	for_each_hlist_rcu(entry,..)
		entry->func();
}

so why not make it look like:

ftrace_mcount_handler()
{
	asm_volatile_goto("jmp %l[label]" ::: &do_list);
	return;

do_list:
	for_each_hlist_rcu(entry,...)
		entry->func();
}

Then, for:
	no entries -> NOP, 
	one entry -> "CALL $func", 
	more entries -> "JMP &do_list.

No need for extra allocations and fancy means of getting rid of them,
and only a few bytes extra wrt the existing function.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08  6:40                                             ` Peter Zijlstra
@ 2014-08-08 14:12                                               ` Steven Rostedt
  2014-08-08 14:28                                                 ` Paul E. McKenney
  2014-08-08 14:34                                                 ` Peter Zijlstra
  0 siblings, 2 replies; 122+ messages in thread
From: Steven Rostedt @ 2014-08-08 14:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

On Fri, 8 Aug 2014 08:40:20 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Aug 07, 2014 at 05:18:23PM -0400, Steven Rostedt wrote:
> > On Thu, 7 Aug 2014 22:08:13 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > OK, you've got to start over and start at the beginning, because I'm
> > > really not understanding this..
> > > 
> > > What is a 'trampoline' and what are you going to use them for.
> > 
> > Great question! :-)
> > 
> > The trampoline is some code that is used to jump to and then jump
> > someplace else. Currently, we use this for kprobes and ftrace. For
> > ftrace we have the ftrace_caller trampoline, which is static. When
> > booting, most functions in the kernel call the mcount code which
> > simply returns without doing anything. This too is a "trampoline". At
> > boot, we convert these calls to nops (as you already know). When we
> > enable callbacks from functions, we convert those calls to call
> > "ftrace_caller" which is a small assembly trampoline that will call
> > some function that registered with ftrace.
> > 
> > Now why do we need the call_rcu_task() routine?
> > 
> > Right now, if you register multiple callbacks to ftrace, even if they
> > are not tracing the same routine, ftrace has to change ftrace_caller to
> > call another trampoline (in C), that does a loop of all ops registered
> > with ftrace, and compares the function to the ops hash tables to see if
> > the ops function should be called for that function.
> > 
> > What we want to do is to create a dynamic trampoline that is a copy of
> > the ftrace_caller code, but instead of calling this list trampoline, it
> > calls the ops function directly. This way, each ops registered with
> > ftrace can have its own custom trampoline that when called will only
> > call the ops function and not have to iterate over a list. This only
> > happens if the function being traced only has this one ops registered.
> > For functions with multiple ops attached to it, we need to call the
> > list anyway. But for the majority of the cases, this is not the case.
> > 
> > The one caveat for this is, how do we free this custom trampoline when
> > the ops is done with it? Especially for users of ftrace that
> > dynamically create their own ops (like perf, and ftrace instances).
> > 
> > We need to find a way to free it, but unfortunately, there's no way to
> > know when it is safe to free it. There's no way to disable preemption
> > or have some other notifier to let us know if a task has jumped to this
> > trampoline and has been preempted (sleeping). The only safe way to know
> > that no task is on the trampoline is to remove the calls to it,
> > synchronize the CPUS (so the trampolines are not even in the caches),
> > and then wait for all tasks to go through some quiescent state. This
> > state happens to be either not running, in userspace, or when it
> > voluntarily calls schedule. Because nothing that uses this trampoline
> > should do that, and if the task voluntarily calls schedule, we know
> > it's not on the trampoline.
> > 
> > Make sense?
> 
> Ok, so they're purely used in the function prologue/epilogue callchain.

No, they are also used by optimized kprobes. This is why optimized
kprobes depend on !CONFIG_PREEMPT. [ added Masami to the discussion ].

Which reminds me. On !CONFIG_PREEMPT, call_rcu_task() should be
equivalent to call_rcu_sched().

> And you don't want to use synchronize_tasks() because registering a trace
> functions is atomic ?

No. Has nothing to do with registering the trace function. The issue is
that we have no idea when a task happens to be on a trampoline after it
is registered. For example:

ops adds a callback to sys_read:

sys_read() {
 call trampoline ->
    set up regs for function call.
    <interrupt>
      preempt_schedule();

      [ new task runs for long time ]


While this new task is running, we remove the trampoline and want to
free it. Say this new task keeps the other task from running for
minutes! We call synchronize_sched() or any other rcu call, and all
grace periods finish and we free the trampoline. The sys_read() no
longer calls our trampoline. Doesn't matter, because that task is still
on it. Now we schedule that task back. It's on a trampoline that has
just been freed! BOOM. It's executing code that no longer exits.

> 
> But why would you use dynamic memory allocation for these trampolines at
> all? Why not use the one default trampoline for this?

That's what ftrace does today.

> 
> Suppose that thing looks like:
> 
> ftrace_mcount_handler()
> {
> 	for_each_hlist_rcu(entry,..)
> 		entry->func();
> }
> 
> so why not make it look like:
> 
> ftrace_mcount_handler()
> {
> 	asm_volatile_goto("jmp %l[label]" ::: &do_list);
> 	return;
> 
> do_list:
> 	for_each_hlist_rcu(entry,...)
> 		entry->func();
> }
> 
> Then, for:
> 	no entries -> NOP, 
> 	one entry -> "CALL $func", 
> 	more entries -> "JMP &do_list.

Except that we don't use jump labels for this, but just update the
trampoline directly (we've been doing this before jump labels ever
existed, and the trampoline is all in assembly anyway).

> 
> No need for extra allocations and fancy means of getting rid of them,
> and only a few bytes extra wrt the existing function.

This doesn't address the issue we want to solve.

Say we have 1000 functions we want to trace with 1000 different
callbacks. Each of theses functions has one call back. How do you solve
that with your solution? Today, we do the list for every function. That
is, for each of these 1000 functions, we run through 1000 ops looking
for the ops that registered for this function. Not very efficient is it?


What we want to do today, is to create a dynamic trampoline for each of
theses 1000 functions. Each function will call a separate trampoline
that will only call the function that was registered to it. That way,
we can have 1000 different ops registered to 1000 different functions
and still have the same performance.

-- Steve

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 14:12                                               ` Steven Rostedt
@ 2014-08-08 14:28                                                 ` Paul E. McKenney
  2014-08-09 10:56                                                   ` Masami Hiramatsu
  2014-08-08 14:34                                                 ` Peter Zijlstra
  1 sibling, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-08 14:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

On Fri, Aug 08, 2014 at 10:12:21AM -0400, Steven Rostedt wrote:
> On Fri, 8 Aug 2014 08:40:20 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Thu, Aug 07, 2014 at 05:18:23PM -0400, Steven Rostedt wrote:
> > > On Thu, 7 Aug 2014 22:08:13 +0200
> > > Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > OK, you've got to start over and start at the beginning, because I'm
> > > > really not understanding this..
> > > > 
> > > > What is a 'trampoline' and what are you going to use them for.
> > > 
> > > Great question! :-)
> > > 
> > > The trampoline is some code that is used to jump to and then jump
> > > someplace else. Currently, we use this for kprobes and ftrace. For
> > > ftrace we have the ftrace_caller trampoline, which is static. When
> > > booting, most functions in the kernel call the mcount code which
> > > simply returns without doing anything. This too is a "trampoline". At
> > > boot, we convert these calls to nops (as you already know). When we
> > > enable callbacks from functions, we convert those calls to call
> > > "ftrace_caller" which is a small assembly trampoline that will call
> > > some function that registered with ftrace.
> > > 
> > > Now why do we need the call_rcu_task() routine?
> > > 
> > > Right now, if you register multiple callbacks to ftrace, even if they
> > > are not tracing the same routine, ftrace has to change ftrace_caller to
> > > call another trampoline (in C), that does a loop of all ops registered
> > > with ftrace, and compares the function to the ops hash tables to see if
> > > the ops function should be called for that function.
> > > 
> > > What we want to do is to create a dynamic trampoline that is a copy of
> > > the ftrace_caller code, but instead of calling this list trampoline, it
> > > calls the ops function directly. This way, each ops registered with
> > > ftrace can have its own custom trampoline that when called will only
> > > call the ops function and not have to iterate over a list. This only
> > > happens if the function being traced only has this one ops registered.
> > > For functions with multiple ops attached to it, we need to call the
> > > list anyway. But for the majority of the cases, this is not the case.
> > > 
> > > The one caveat for this is, how do we free this custom trampoline when
> > > the ops is done with it? Especially for users of ftrace that
> > > dynamically create their own ops (like perf, and ftrace instances).
> > > 
> > > We need to find a way to free it, but unfortunately, there's no way to
> > > know when it is safe to free it. There's no way to disable preemption
> > > or have some other notifier to let us know if a task has jumped to this
> > > trampoline and has been preempted (sleeping). The only safe way to know
> > > that no task is on the trampoline is to remove the calls to it,
> > > synchronize the CPUS (so the trampolines are not even in the caches),
> > > and then wait for all tasks to go through some quiescent state. This
> > > state happens to be either not running, in userspace, or when it
> > > voluntarily calls schedule. Because nothing that uses this trampoline
> > > should do that, and if the task voluntarily calls schedule, we know
> > > it's not on the trampoline.
> > > 
> > > Make sense?
> > 
> > Ok, so they're purely used in the function prologue/epilogue callchain.
> 
> No, they are also used by optimized kprobes. This is why optimized
> kprobes depend on !CONFIG_PREEMPT. [ added Masami to the discussion ].
> 
> Which reminds me. On !CONFIG_PREEMPT, call_rcu_task() should be
> equivalent to call_rcu_sched().

Almost.  One difference is that call_rcu_sched() won't wait for
idle-task execution.  So presumably you are currently prohibited from
putting kprobes in idle tasks.

Oleg slipped this one past me, and for more than a full hour,
(https://lkml.org/lkml/2014/8/2/18), but this time I remembered.  ;-)

							Thanx, Paul

> > And you don't want to use synchronize_tasks() because registering a trace
> > functions is atomic ?
> 
> No. Has nothing to do with registering the trace function. The issue is
> that we have no idea when a task happens to be on a trampoline after it
> is registered. For example:
> 
> ops adds a callback to sys_read:
> 
> sys_read() {
>  call trampoline ->
>     set up regs for function call.
>     <interrupt>
>       preempt_schedule();
> 
>       [ new task runs for long time ]
> 
> 
> While this new task is running, we remove the trampoline and want to
> free it. Say this new task keeps the other task from running for
> minutes! We call synchronize_sched() or any other rcu call, and all
> grace periods finish and we free the trampoline. The sys_read() no
> longer calls our trampoline. Doesn't matter, because that task is still
> on it. Now we schedule that task back. It's on a trampoline that has
> just been freed! BOOM. It's executing code that no longer exits.
> 
> > 
> > But why would you use dynamic memory allocation for these trampolines at
> > all? Why not use the one default trampoline for this?
> 
> That's what ftrace does today.
> 
> > 
> > Suppose that thing looks like:
> > 
> > ftrace_mcount_handler()
> > {
> > 	for_each_hlist_rcu(entry,..)
> > 		entry->func();
> > }
> > 
> > so why not make it look like:
> > 
> > ftrace_mcount_handler()
> > {
> > 	asm_volatile_goto("jmp %l[label]" ::: &do_list);
> > 	return;
> > 
> > do_list:
> > 	for_each_hlist_rcu(entry,...)
> > 		entry->func();
> > }
> > 
> > Then, for:
> > 	no entries -> NOP, 
> > 	one entry -> "CALL $func", 
> > 	more entries -> "JMP &do_list.
> 
> Except that we don't use jump labels for this, but just update the
> trampoline directly (we've been doing this before jump labels ever
> existed, and the trampoline is all in assembly anyway).
> 
> > 
> > No need for extra allocations and fancy means of getting rid of them,
> > and only a few bytes extra wrt the existing function.
> 
> This doesn't address the issue we want to solve.
> 
> Say we have 1000 functions we want to trace with 1000 different
> callbacks. Each of theses functions has one call back. How do you solve
> that with your solution? Today, we do the list for every function. That
> is, for each of these 1000 functions, we run through 1000 ops looking
> for the ops that registered for this function. Not very efficient is it?
> 
> 
> What we want to do today, is to create a dynamic trampoline for each of
> theses 1000 functions. Each function will call a separate trampoline
> that will only call the function that was registered to it. That way,
> we can have 1000 different ops registered to 1000 different functions
> and still have the same performance.
> 
> -- Steve
> 


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 14:12                                               ` Steven Rostedt
  2014-08-08 14:28                                                 ` Paul E. McKenney
@ 2014-08-08 14:34                                                 ` Peter Zijlstra
  2014-08-08 14:58                                                   ` Steven Rostedt
  1 sibling, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-08 14:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

[-- Attachment #1: Type: text/plain, Size: 3028 bytes --]

On Fri, Aug 08, 2014 at 10:12:21AM -0400, Steven Rostedt wrote:
> > Ok, so they're purely used in the function prologue/epilogue callchain.
> 
> No, they are also used by optimized kprobes. This is why optimized
> kprobes depend on !CONFIG_PREEMPT. [ added Masami to the discussion ].

How do those work? Is that one where the INT3 relocates the instruction
stream into an alternative 'text' and that JMPs back into the original
stream at the end?

And what is there to make sure the kprobe itself doesn't do 'funny'?

> Which reminds me. On !CONFIG_PREEMPT, call_rcu_task() should be
> equivalent to call_rcu_sched().

Sure, as long as you make absolutely sure none of that code ends up
calling cond_resched()/might_sleep() etc. Which I think you already said
was true, so no worries there.

> > And you don't want to use synchronize_tasks() because registering a trace
> > functions is atomic ?
> 
> No. Has nothing to do with registering the trace function. The issue is
> that we have no idea when a task happens to be on a trampoline after it
> is registered. For example:
> 
> ops adds a callback to sys_read:
> 
> sys_read() {
>  call trampoline ->
>     set up regs for function call.
>     <interrupt>
>       preempt_schedule();
> 
>       [ new task runs for long time ]
> 
> 
> While this new task is running, we remove the trampoline and want to
> free it. Say this new task keeps the other task from running for
> minutes! We call synchronize_sched() or any other rcu call, and all
> grace periods finish and we free the trampoline. The sys_read() no
> longer calls our trampoline. Doesn't matter, because that task is still
> on it. Now we schedule that task back. It's on a trampoline that has
> just been freed! BOOM. It's executing code that no longer exits.

Sure, I get that part. What I was getting as is _WHY_ you need
call_rcu_task(), why isn't synchronize_tasks() good enough?

> > No need for extra allocations and fancy means of getting rid of them,
> > and only a few bytes extra wrt the existing function.
> 
> This doesn't address the issue we want to solve.
> 
> Say we have 1000 functions we want to trace with 1000 different
> callbacks. Each of theses functions has one call back. How do you solve
> that with your solution? Today, we do the list for every function. That
> is, for each of these 1000 functions, we run through 1000 ops looking
> for the ops that registered for this function. Not very efficient is it?

Ah, but you didn't say that, didn't you :-)

> What we want to do today, is to create a dynamic trampoline for each of
> theses 1000 functions. Each function will call a separate trampoline
> that will only call the function that was registered to it. That way,
> we can have 1000 different ops registered to 1000 different functions
> and still have the same performance.

And how will you limit the amount of memory tied up in this? This looks
like a good way to tie up an immense amount of memory fast.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 14:34                                                 ` Peter Zijlstra
@ 2014-08-08 14:58                                                   ` Steven Rostedt
  2014-08-08 15:16                                                     ` Peter Zijlstra
  2014-08-08 16:27                                                     ` Peter Zijlstra
  0 siblings, 2 replies; 122+ messages in thread
From: Steven Rostedt @ 2014-08-08 14:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

On Fri, 8 Aug 2014 16:34:13 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Aug 08, 2014 at 10:12:21AM -0400, Steven Rostedt wrote:
> > > Ok, so they're purely used in the function prologue/epilogue callchain.
> > 
> > No, they are also used by optimized kprobes. This is why optimized
> > kprobes depend on !CONFIG_PREEMPT. [ added Masami to the discussion ].
> 
> How do those work? Is that one where the INT3 relocates the instruction
> stream into an alternative 'text' and that JMPs back into the original
> stream at the end?

No, it's where we replace the 'int3' with a jump to a trampoline that
simulates an INT3. Speeds things up quite a bit.

> 
> And what is there to make sure the kprobe itself doesn't do 'funny'?

Well, kprobes, like function callbacks are just restricted like
interrupt handlers are. If they break, they break. They should know
better ;-)

> 
> > Which reminds me. On !CONFIG_PREEMPT, call_rcu_task() should be
> > equivalent to call_rcu_sched().
> 
> Sure, as long as you make absolutely sure none of that code ends up
> calling cond_resched()/might_sleep() etc. Which I think you already said
> was true, so no worries there.

Right. There's no guarantees that someone wont do such a stupid thing.
But then, there's no guarantees that someone wont register an NMI
callback with the same code too.

> 
> > > And you don't want to use synchronize_tasks() because registering a trace
> > > functions is atomic ?
> > 
> > No. Has nothing to do with registering the trace function. The issue is
> > that we have no idea when a task happens to be on a trampoline after it
> > is registered. For example:
> > 
> > ops adds a callback to sys_read:
> > 
> > sys_read() {
> >  call trampoline ->
> >     set up regs for function call.
> >     <interrupt>
> >       preempt_schedule();
> > 
> >       [ new task runs for long time ]
> > 
> > 
> > While this new task is running, we remove the trampoline and want to
> > free it. Say this new task keeps the other task from running for
> > minutes! We call synchronize_sched() or any other rcu call, and all
> > grace periods finish and we free the trampoline. The sys_read() no
> > longer calls our trampoline. Doesn't matter, because that task is still
> > on it. Now we schedule that task back. It's on a trampoline that has
> > just been freed! BOOM. It's executing code that no longer exits.
> 
> Sure, I get that part. What I was getting as is _WHY_ you need
> call_rcu_task(), why isn't synchronize_tasks() good enough?

Oh, because that synchronize_tasks() may take minutes. And that means
we wont be able to return for a long time. The only thing I can really
see using call_rcu_task() is something that needs to free its data. Why
wait around when all you're going to do is call free? It's basically
just a garbage collector.

> 
> > > No need for extra allocations and fancy means of getting rid of them,
> > > and only a few bytes extra wrt the existing function.
> > 
> > This doesn't address the issue we want to solve.
> > 
> > Say we have 1000 functions we want to trace with 1000 different
> > callbacks. Each of theses functions has one call back. How do you solve
> > that with your solution? Today, we do the list for every function. That
> > is, for each of these 1000 functions, we run through 1000 ops looking
> > for the ops that registered for this function. Not very efficient is it?
> 
> Ah, but you didn't say that, didn't you :-)

I just thought it was implied ;-)

> 
> > What we want to do today, is to create a dynamic trampoline for each of
> > theses 1000 functions. Each function will call a separate trampoline
> > that will only call the function that was registered to it. That way,
> > we can have 1000 different ops registered to 1000 different functions
> > and still have the same performance.
> 
> And how will you limit the amount of memory tied up in this? This looks
> like a good way to tie up an immense amount of memory fast.

Well, these operations are currently only allowed by root. Thus, it's
the thing that root should be careful about. The trampolines are small,
and it will take a hell of a lot of callbacks to cause issues.

The thing I'm worried about is to make sure they get freed. Otherwise a
leak will cause more issues than anything else. Which also means we
need to have a way to expedite call_rcu_tasks() if need be.

-- Steve

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 14:58                                                   ` Steven Rostedt
@ 2014-08-08 15:16                                                     ` Peter Zijlstra
  2014-08-08 15:39                                                       ` Steven Rostedt
  2014-08-08 16:27                                                     ` Peter Zijlstra
  1 sibling, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-08 15:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

[-- Attachment #1: Type: text/plain, Size: 4090 bytes --]

On Fri, Aug 08, 2014 at 10:58:58AM -0400, Steven Rostedt wrote:
> On Fri, 8 Aug 2014 16:34:13 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, Aug 08, 2014 at 10:12:21AM -0400, Steven Rostedt wrote:
> > > > Ok, so they're purely used in the function prologue/epilogue callchain.
> > > 
> > > No, they are also used by optimized kprobes. This is why optimized
> > > kprobes depend on !CONFIG_PREEMPT. [ added Masami to the discussion ].
> > 
> > How do those work? Is that one where the INT3 relocates the instruction
> > stream into an alternative 'text' and that JMPs back into the original
> > stream at the end?
> 
> No, it's where we replace the 'int3' with a jump to a trampoline that
> simulates an INT3. Speeds things up quite a bit.

Ah, ok. 

> > And what is there to make sure the kprobe itself doesn't do 'funny'?
> 
> Well, kprobes, like function callbacks are just restricted like
> interrupt handlers are. If they break, they break. They should know
> better ;-)

But is there debugging infrastructure to test they don't do funny? Or
are we just going to wait for some random runtime weirdness and then
pull our hair out?

So do we run these handlers with preempt_disable() or anything like
that? (maybe only as a debug option).

> > Sure, as long as you make absolutely sure none of that code ends up
> > calling cond_resched()/might_sleep() etc. Which I think you already said
> > was true, so no worries there.
> 
> Right. There's no guarantees that someone wont do such a stupid thing.
> But then, there's no guarantees that someone wont register an NMI
> callback with the same code too.

Well, kprobes is 'special' in that it it almost encourages random non
kernel devs to write modules for it, so it gets extra special creative
crap in.

I'm fairly sure you'll get your ftrace handler right, because that's
only you and maybe a few other people ever touching that code. Not so
with kprobes.

> > Sure, I get that part. What I was getting as is _WHY_ you need
> > call_rcu_task(), why isn't synchronize_tasks() good enough?
> 
> Oh, because that synchronize_tasks() may take minutes. And that means
> we wont be able to return for a long time. The only thing I can really
> see using call_rcu_task() is something that needs to free its data. Why
> wait around when all you're going to do is call free? It's basically
> just a garbage collector.

Well the waiting has the advantage of being a natural throttle on the
amount of memory tied up in the whole scheme.

> > > What we want to do today, is to create a dynamic trampoline for each of
> > > theses 1000 functions. Each function will call a separate trampoline
> > > that will only call the function that was registered to it. That way,
> > > we can have 1000 different ops registered to 1000 different functions
> > > and still have the same performance.
> > 
> > And how will you limit the amount of memory tied up in this? This looks
> > like a good way to tie up an immense amount of memory fast.
> 
> Well, these operations are currently only allowed by root. Thus, it's
> the thing that root should be careful about. The trampolines are small,
> and it will take a hell of a lot of callbacks to cause issues.

You said 10e3 order things, I then give you a small 32bit arm system.

The thing is, even root is a clueless idiot most of the times, yes we'll
let him shoot his foot off and give him enough rope to hang himself, and
we'll even show him how to tie the knot at times, but how is he to know
that this script that 'worked' now causes his machine to OOM and behave
brick like?

> The thing I'm worried about is to make sure they get freed. Otherwise a
> leak will cause more issues than anything else. Which also means we
> need to have a way to expedite call_rcu_tasks() if need be.

No.. for one, that doesn't follow. Things will get freed (eventually)
even without expedite, and secondly implementing expedite would mean
force scheduling tasks etc. And we're just so not going to do that.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 15:16                                                     ` Peter Zijlstra
@ 2014-08-08 15:39                                                       ` Steven Rostedt
  2014-08-08 16:01                                                         ` Peter Zijlstra
  2014-08-08 16:17                                                         ` Peter Zijlstra
  0 siblings, 2 replies; 122+ messages in thread
From: Steven Rostedt @ 2014-08-08 15:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

On Fri, 8 Aug 2014 17:16:43 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

 
> > Well, kprobes, like function callbacks are just restricted like
> > interrupt handlers are. If they break, they break. They should know
> > better ;-)
> 
> But is there debugging infrastructure to test they don't do funny? Or
> are we just going to wait for some random runtime weirdness and then
> pull our hair out?

Well, currently there isn't many handler. I'm not sure about kprobes.
They only have a few handlers too. Just enabling the handler to run
under interrupt or NMI context would probably be enough to see if they
do something stupid.

> 
> So do we run these handlers with preempt_disable() or anything like
> that? (maybe only as a debug option).

Actually, for dynamically created ftrace_ops (like perf, and
ftrace instances), we call a wrapper function that does disable
preemption. Because we still need to free the "ops" part, which is
passed to the handler. And we don't want some handler being called
without ops, or even the list function.

Any dynamic ops is called by the ftrace list routine always, because it
disables preemption before traversing the list. Here we can't even us
synchronize_sched() because the function tracer can be called outside
the scope of RCU (going into, out of idle or userspace). Thus, when a
dynamically created ftrace ops is unregistered, a call to
schedule_on_each_cpu() is made, and when that returns, we free the ops.
 
> 
> > > Sure, as long as you make absolutely sure none of that code ends up
> > > calling cond_resched()/might_sleep() etc. Which I think you already said
> > > was true, so no worries there.
> > 
> > Right. There's no guarantees that someone wont do such a stupid thing.
> > But then, there's no guarantees that someone wont register an NMI
> > callback with the same code too.
> 
> Well, kprobes is 'special' in that it it almost encourages random non
> kernel devs to write modules for it, so it gets extra special creative
> crap in.

Well, crap gets what crap deserves ;-)

> 
> I'm fairly sure you'll get your ftrace handler right, because that's
> only you and maybe a few other people ever touching that code. Not so
> with kprobes.

Actually, that's becoming less true. What do you think the live
kernel patching is going to use?

Also, anyone can register a function handler (perf, systemtap, and even
LTTng can). But these still get wrapped by that preempt off.

But we can still make a debug option that has the trampoline disable
preemption, and this will make sure the function it calls doesn't do
something stupid.

> 
> > > Sure, I get that part. What I was getting as is _WHY_ you need
> > > call_rcu_task(), why isn't synchronize_tasks() good enough?
> > 
> > Oh, because that synchronize_tasks() may take minutes. And that means
> > we wont be able to return for a long time. The only thing I can really
> > see using call_rcu_task() is something that needs to free its data. Why
> > wait around when all you're going to do is call free? It's basically
> > just a garbage collector.
> 
> Well the waiting has the advantage of being a natural throttle on the
> amount of memory tied up in the whole scheme.

Yeah, but it would suck if you do "echo nop > current_tracer" while
running something that is preventing a task to unload, and have to wait
a few minutes for that to return.

> 
> > > > What we want to do today, is to create a dynamic trampoline for each of
> > > > theses 1000 functions. Each function will call a separate trampoline
> > > > that will only call the function that was registered to it. That way,
> > > > we can have 1000 different ops registered to 1000 different functions
> > > > and still have the same performance.
> > > 
> > > And how will you limit the amount of memory tied up in this? This looks
> > > like a good way to tie up an immense amount of memory fast.
> > 
> > Well, these operations are currently only allowed by root. Thus, it's
> > the thing that root should be careful about. The trampolines are small,
> > and it will take a hell of a lot of callbacks to cause issues.
> 
> You said 10e3 order things, I then give you a small 32bit arm system.
> 
> The thing is, even root is a clueless idiot most of the times, yes we'll
> let him shoot his foot off and give him enough rope to hang himself, and
> we'll even show him how to tie the knot at times, but how is he to know
> that this script that 'worked' now causes his machine to OOM and behave
> brick like?

Honestly, to allocate 1000 different callbacks now, requires allocating
1000 different ops, which are much larger than a single trampoline.
Right now, the only way to do that is to create a 1000 ftrace
instances, and those creates a 1000 directories, would would include
1000 event directories, each having its own dentry for each event
(around 900).

I don't think we need to worry about this use to work on a small arm 32
bit system and it doesn't work now. The difference in memory
consumption is not that much.


> 
> > The thing I'm worried about is to make sure they get freed. Otherwise a
> > leak will cause more issues than anything else. Which also means we
> > need to have a way to expedite call_rcu_tasks() if need be.
> 
> No.. for one, that doesn't follow. Things will get freed (eventually)
> even without expedite, and secondly implementing expedite would mean
> force scheduling tasks etc. And we're just so not going to do that.
> 
> 

Yeah, you are probably right. This should not be used much, and for
now, only by root users (or other privilege users). And the amount of
memory that needs to be freed will be very small. It would probably be
very difficult to DoS the system with this feature even if you had root.

-- Steve

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 15:39                                                       ` Steven Rostedt
@ 2014-08-08 16:01                                                         ` Peter Zijlstra
  2014-08-08 16:10                                                           ` Steven Rostedt
  2014-08-08 16:17                                                         ` Peter Zijlstra
  1 sibling, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-08 16:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

[-- Attachment #1: Type: text/plain, Size: 2110 bytes --]

On Fri, Aug 08, 2014 at 11:39:49AM -0400, Steven Rostedt wrote:
> > > > Sure, I get that part. What I was getting as is _WHY_ you need
> > > > call_rcu_task(), why isn't synchronize_tasks() good enough?
> > > 
> > > Oh, because that synchronize_tasks() may take minutes. And that means
> > > we wont be able to return for a long time. The only thing I can really
> > > see using call_rcu_task() is something that needs to free its data. Why
> > > wait around when all you're going to do is call free? It's basically
> > > just a garbage collector.
> > 
> > Well the waiting has the advantage of being a natural throttle on the
> > amount of memory tied up in the whole scheme.
> 
> Yeah, but it would suck if you do "echo nop > current_tracer" while
> running something that is preventing a task to unload, and have to wait
> a few minutes for that to return.

> > You said 10e3 order things, I then give you a small 32bit arm system.

> Honestly, to allocate 1000 different callbacks now, requires allocating
> 1000 different ops, which are much larger than a single trampoline.
> Right now, the only way to do that is to create a 1000 ftrace
> instances, and those creates a 1000 directories, would would include
> 1000 event directories, each having its own dentry for each event
> (around 900).
> 
> I don't think we need to worry about this use to work on a small arm 32
> bit system and it doesn't work now. The difference in memory
> consumption is not that much.

Well, see the _BIG_ difference is that currently, when I do nop >
current_tracer all that memory is instantly freed again.

With the proposed scheme, if I setup state, reconsider, destroy state,
try again, and generally muck about I can tie up unspecified amounts of
memory.

And being the bumbling idiot that I am, that's actually fairly typical
of how I end up tracing. There's no neat and tidy, I trace something,
look at the trace, script a little, muck about with the settings and
goto 1.


In any case, I think I now fully understand what you're trying to do,
just not sure its all win.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 16:01                                                         ` Peter Zijlstra
@ 2014-08-08 16:10                                                           ` Steven Rostedt
  0 siblings, 0 replies; 122+ messages in thread
From: Steven Rostedt @ 2014-08-08 16:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

On Fri, 8 Aug 2014 18:01:28 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

 
> Well, see the _BIG_ difference is that currently, when I do nop >
> current_tracer all that memory is instantly freed again.
> 
> With the proposed scheme, if I setup state, reconsider, destroy state,
> try again, and generally muck about I can tie up unspecified amounts of
> memory.
> 
> And being the bumbling idiot that I am, that's actually fairly typical
> of how I end up tracing. There's no neat and tidy, I trace something,
> look at the trace, script a little, muck about with the settings and
> goto 1.


It would actually be trivial to make that case never free the
trampoline associated to function tracing. As that is a static ops that
never gets freed, the trampoline it uses doesn't need to be freed
either.

But, if you were to do:

   # cd /sys/kernel/debug/tracing
   # mkdir instances/foo
   # cd instances/foo
1: # echo function > current_tracer
   # echo nop > current_tracer
   goto 1

Then, yeah that could do it.

> 
> 
> In any case, I think I now fully understand what you're trying to do,
> just not sure its all win.

Fair enough.

-- Steve

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 15:39                                                       ` Steven Rostedt
  2014-08-08 16:01                                                         ` Peter Zijlstra
@ 2014-08-08 16:17                                                         ` Peter Zijlstra
  2014-08-08 16:40                                                           ` Steven Rostedt
  1 sibling, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-08 16:17 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

[-- Attachment #1: Type: text/plain, Size: 542 bytes --]

On Fri, Aug 08, 2014 at 11:39:49AM -0400, Steven Rostedt wrote:
> Also, anyone can register a function handler (perf, systemtap, and even
> LTTng can).

So how common is it to have more than 1 function event handler, and how
many of those cases is there significant difference in the actual
functions?

Because that's the case you're optimizing for. Why is that an important
enough case.

I'm thinking I'll never trigger that, I only ever have the one normal
ftrace handler. So no trampolines for me, the code you have now works
just fine.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 14:58                                                   ` Steven Rostedt
  2014-08-08 15:16                                                     ` Peter Zijlstra
@ 2014-08-08 16:27                                                     ` Peter Zijlstra
  2014-08-08 16:39                                                       ` Paul E. McKenney
                                                                         ` (2 more replies)
  1 sibling, 3 replies; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-08 16:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

[-- Attachment #1: Type: text/plain, Size: 1124 bytes --]

On Fri, Aug 08, 2014 at 10:58:58AM -0400, Steven Rostedt wrote:

> > > No, they are also used by optimized kprobes. This is why optimized
> > > kprobes depend on !CONFIG_PREEMPT. [ added Masami to the discussion ].
> > 
> > How do those work? Is that one where the INT3 relocates the instruction
> > stream into an alternative 'text' and that JMPs back into the original
> > stream at the end?
> 
> No, it's where we replace the 'int3' with a jump to a trampoline that
> simulates an INT3. Speeds things up quite a bit.

OK, so the trivial 'fix' for that is to patch the probe site like:

	preempt_disable();		INC	GS:%__preempt_count
	call trampoline;		CALL	0xDEADBEEF
	preempt_enable();		DEC	GS:%__preempt_count
					JNZ	1f
					CALL	___preempt_schedule
				1f:

At which point the preempt_disable/enable() are the read side primitives
and call_rcu_sched/synchronize_sched are sufficient to release it.

With the per-cpu preempt count stuff we have on x86 that is 4
instructions for the preempt_*() stuff -- they're 'big' instructions
though, since 3 have memops and 2 have a segment prefix.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 16:27                                                     ` Peter Zijlstra
@ 2014-08-08 16:39                                                       ` Paul E. McKenney
  2014-08-08 16:49                                                         ` Steven Rostedt
  2014-08-08 16:51                                                         ` Peter Zijlstra
  2014-08-08 16:43                                                       ` Steven Rostedt
  2014-08-08 17:27                                                       ` Steven Rostedt
  2 siblings, 2 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-08 16:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

On Fri, Aug 08, 2014 at 06:27:14PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 08, 2014 at 10:58:58AM -0400, Steven Rostedt wrote:
> 
> > > > No, they are also used by optimized kprobes. This is why optimized
> > > > kprobes depend on !CONFIG_PREEMPT. [ added Masami to the discussion ].
> > > 
> > > How do those work? Is that one where the INT3 relocates the instruction
> > > stream into an alternative 'text' and that JMPs back into the original
> > > stream at the end?
> > 
> > No, it's where we replace the 'int3' with a jump to a trampoline that
> > simulates an INT3. Speeds things up quite a bit.
> 
> OK, so the trivial 'fix' for that is to patch the probe site like:
> 
> 	preempt_disable();		INC	GS:%__preempt_count
> 	call trampoline;		CALL	0xDEADBEEF
> 	preempt_enable();		DEC	GS:%__preempt_count
> 					JNZ	1f
> 					CALL	___preempt_schedule
> 				1f:
> 
> At which point the preempt_disable/enable() are the read side primitives
> and call_rcu_sched/synchronize_sched are sufficient to release it.

Unless this is done in idle, at which point RCU-sched is studiously
ignoring any preempt_disable() sections.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 16:17                                                         ` Peter Zijlstra
@ 2014-08-08 16:40                                                           ` Steven Rostedt
  2014-08-08 16:52                                                             ` Peter Zijlstra
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Rostedt @ 2014-08-08 16:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

On Fri, 8 Aug 2014 18:17:09 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Aug 08, 2014 at 11:39:49AM -0400, Steven Rostedt wrote:
> > Also, anyone can register a function handler (perf, systemtap, and even
> > LTTng can).
> 
> So how common is it to have more than 1 function event handler, and how
> many of those cases is there significant difference in the actual
> functions?

For me, it's quite common. Of course, I tend to poweruse function
tracing :-).  I do make instances and trace a few functions here and
there, while tracing all functions.

Note, if you are tracing all functions, all it takes is just adding one
other ops callback to a single function. That is, if you are tracing
all functions, and add another ops, it will make all users use the list
function. That is. Even though only one function has two callbacks
attached to it, all functions will call the list function, which does
make a noticeable difference. I measured it once before, I'll have to
do it again.

Tracing all functions is extremely invasive, and any increase in the
time of the function callback handlers will have a dramatic hit on
performance.

An mcount nop, causes 13% slowdown. Imagine what adding a loop does.


> 
> Because that's the case you're optimizing for. Why is that an important
> enough case.
> 
> I'm thinking I'll never trigger that, I only ever have the one normal
> ftrace handler. So no trampolines for me, the code you have now works
> just fine.
> 


And it will work just fine afterward :-)
 
I will treat any regressions as a significant (high priority) bug.

-- Steve

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 16:27                                                     ` Peter Zijlstra
  2014-08-08 16:39                                                       ` Paul E. McKenney
@ 2014-08-08 16:43                                                       ` Steven Rostedt
  2014-08-08 16:50                                                         ` Peter Zijlstra
  2014-08-08 17:27                                                       ` Steven Rostedt
  2 siblings, 1 reply; 122+ messages in thread
From: Steven Rostedt @ 2014-08-08 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

On Fri, 8 Aug 2014 18:27:14 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Aug 08, 2014 at 10:58:58AM -0400, Steven Rostedt wrote:
> 
> > > > No, they are also used by optimized kprobes. This is why optimized
> > > > kprobes depend on !CONFIG_PREEMPT. [ added Masami to the discussion ].
> > > 
> > > How do those work? Is that one where the INT3 relocates the instruction
> > > stream into an alternative 'text' and that JMPs back into the original
> > > stream at the end?
> > 
> > No, it's where we replace the 'int3' with a jump to a trampoline that
> > simulates an INT3. Speeds things up quite a bit.
> 
> OK, so the trivial 'fix' for that is to patch the probe site like:
> 
> 	preempt_disable();		INC	GS:%__preempt_count
> 	call trampoline;		CALL	0xDEADBEEF
> 	preempt_enable();		DEC	GS:%__preempt_count
> 					JNZ	1f
> 					CALL	___preempt_schedule
> 				1f:
> 
> At which point the preempt_disable/enable() are the read side primitives
> and call_rcu_sched/synchronize_sched are sufficient to release it.
> 
> With the per-cpu preempt count stuff we have on x86 that is 4
> instructions for the preempt_*() stuff -- they're 'big' instructions
> though, since 3 have memops and 2 have a segment prefix.
> 
> 

Now the question is, how do you do that atomically? And safely.
Currently, all we replace at the call sites is a nop that is added by
gcc -pg and us replacing the call mcount with it. That looks much more
complex than our current solution.

-- Steve

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 16:39                                                       ` Paul E. McKenney
@ 2014-08-08 16:49                                                         ` Steven Rostedt
  2014-08-08 16:51                                                         ` Peter Zijlstra
  1 sibling, 0 replies; 122+ messages in thread
From: Steven Rostedt @ 2014-08-08 16:49 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

On Fri, 8 Aug 2014 09:39:05 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:


> Unless this is done in idle, at which point RCU-sched is studiously
> ignoring any preempt_disable() sections.

ftrace doesn't use synchronize_sched() today because of this. It
instead just does a "schedule_on_each_cpu()" and waits for that to
return before going further.

-- Steve

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 16:43                                                       ` Steven Rostedt
@ 2014-08-08 16:50                                                         ` Peter Zijlstra
  0 siblings, 0 replies; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-08 16:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

[-- Attachment #1: Type: text/plain, Size: 2307 bytes --]

On Fri, Aug 08, 2014 at 12:43:40PM -0400, Steven Rostedt wrote:
> On Fri, 8 Aug 2014 18:27:14 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, Aug 08, 2014 at 10:58:58AM -0400, Steven Rostedt wrote:
> > 
> > > > > No, they are also used by optimized kprobes. This is why optimized
> > > > > kprobes depend on !CONFIG_PREEMPT. [ added Masami to the discussion ].
> > > > 
> > > > How do those work? Is that one where the INT3 relocates the instruction
> > > > stream into an alternative 'text' and that JMPs back into the original
> > > > stream at the end?
> > > 
> > > No, it's where we replace the 'int3' with a jump to a trampoline that
> > > simulates an INT3. Speeds things up quite a bit.
> > 
> > OK, so the trivial 'fix' for that is to patch the probe site like:
> > 
> > 	preempt_disable();		INC	GS:%__preempt_count
> > 	call trampoline;		CALL	0xDEADBEEF
> > 	preempt_enable();		DEC	GS:%__preempt_count
> > 					JNZ	1f
> > 					CALL	___preempt_schedule
> > 				1f:
> > 
> > At which point the preempt_disable/enable() are the read side primitives
> > and call_rcu_sched/synchronize_sched are sufficient to release it.
> > 
> > With the per-cpu preempt count stuff we have on x86 that is 4
> > instructions for the preempt_*() stuff -- they're 'big' instructions
> > though, since 3 have memops and 2 have a segment prefix.
> > 
> > 
> 
> Now the question is, how do you do that atomically? And safely.
> Currently, all we replace at the call sites is a nop that is added by
> gcc -pg and us replacing the call mcount with it. That looks much more
> complex than our current solution.

Same way kprobes already does it. You can place that kprobe anywhere as
long as the function is long enough. The JMP you write for the optimized
kprobes is often longer than the instruction its patching.

So you start by writing the INT3 which is atomic, after that you 'copy'
the original text you're going to destroy into the tail of the
trampoline, followed by a 'return' JMP.

Then you write the tail end of the above sequence, and finally you
'fixup' the first instruction by removing the INT3.

But yes, I'm not sure that's going to work for the mcount thing, but it
will work for kprobes just fine, since its already doing this afaik.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 16:39                                                       ` Paul E. McKenney
  2014-08-08 16:49                                                         ` Steven Rostedt
@ 2014-08-08 16:51                                                         ` Peter Zijlstra
  2014-08-08 17:09                                                           ` Paul E. McKenney
  1 sibling, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-08 16:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

[-- Attachment #1: Type: text/plain, Size: 1448 bytes --]

On Fri, Aug 08, 2014 at 09:39:05AM -0700, Paul E. McKenney wrote:
> On Fri, Aug 08, 2014 at 06:27:14PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 08, 2014 at 10:58:58AM -0400, Steven Rostedt wrote:
> > 
> > > > > No, they are also used by optimized kprobes. This is why optimized
> > > > > kprobes depend on !CONFIG_PREEMPT. [ added Masami to the discussion ].
> > > > 
> > > > How do those work? Is that one where the INT3 relocates the instruction
> > > > stream into an alternative 'text' and that JMPs back into the original
> > > > stream at the end?
> > > 
> > > No, it's where we replace the 'int3' with a jump to a trampoline that
> > > simulates an INT3. Speeds things up quite a bit.
> > 
> > OK, so the trivial 'fix' for that is to patch the probe site like:
> > 
> > 	preempt_disable();		INC	GS:%__preempt_count
> > 	call trampoline;		CALL	0xDEADBEEF
> > 	preempt_enable();		DEC	GS:%__preempt_count
> > 					JNZ	1f
> > 					CALL	___preempt_schedule
> > 				1f:
> > 
> > At which point the preempt_disable/enable() are the read side primitives
> > and call_rcu_sched/synchronize_sched are sufficient to release it.
> 
> Unless this is done in idle, at which point RCU-sched is studiously
> ignoring any preempt_disable() sections.

Well, given that kprobes is already using it, it 'must' be good ;-) I
suspect much of the idle loop is marked with __kprobe or so, or nobody
has been brave enough to try.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 16:40                                                           ` Steven Rostedt
@ 2014-08-08 16:52                                                             ` Peter Zijlstra
  0 siblings, 0 replies; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-08 16:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

[-- Attachment #1: Type: text/plain, Size: 818 bytes --]

On Fri, Aug 08, 2014 at 12:40:18PM -0400, Steven Rostedt wrote:
> On Fri, 8 Aug 2014 18:17:09 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, Aug 08, 2014 at 11:39:49AM -0400, Steven Rostedt wrote:
> > > Also, anyone can register a function handler (perf, systemtap, and even
> > > LTTng can).
> > 
> > So how common is it to have more than 1 function event handler, and how
> > many of those cases is there significant difference in the actual
> > functions?
> 
> For me, it's quite common. Of course, I tend to poweruse function
> tracing :-).  I do make instances and trace a few functions here and
> there, while tracing all functions.

Meh, then you have multiple trace files to combine, which makes the
entire reading/scripting thing just _that_ much harder.

Not useful.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 16:51                                                         ` Peter Zijlstra
@ 2014-08-08 17:09                                                           ` Paul E. McKenney
  0 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-08 17:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

On Fri, Aug 08, 2014 at 06:51:12PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 08, 2014 at 09:39:05AM -0700, Paul E. McKenney wrote:
> > On Fri, Aug 08, 2014 at 06:27:14PM +0200, Peter Zijlstra wrote:
> > > On Fri, Aug 08, 2014 at 10:58:58AM -0400, Steven Rostedt wrote:
> > > 
> > > > > > No, they are also used by optimized kprobes. This is why optimized
> > > > > > kprobes depend on !CONFIG_PREEMPT. [ added Masami to the discussion ].
> > > > > 
> > > > > How do those work? Is that one where the INT3 relocates the instruction
> > > > > stream into an alternative 'text' and that JMPs back into the original
> > > > > stream at the end?
> > > > 
> > > > No, it's where we replace the 'int3' with a jump to a trampoline that
> > > > simulates an INT3. Speeds things up quite a bit.
> > > 
> > > OK, so the trivial 'fix' for that is to patch the probe site like:
> > > 
> > > 	preempt_disable();		INC	GS:%__preempt_count
> > > 	call trampoline;		CALL	0xDEADBEEF
> > > 	preempt_enable();		DEC	GS:%__preempt_count
> > > 					JNZ	1f
> > > 					CALL	___preempt_schedule
> > > 				1f:
> > > 
> > > At which point the preempt_disable/enable() are the read side primitives
> > > and call_rcu_sched/synchronize_sched are sufficient to release it.
> > 
> > Unless this is done in idle, at which point RCU-sched is studiously
> > ignoring any preempt_disable() sections.
> 
> Well, given that kprobes is already using it, it 'must' be good ;-) I
> suspect much of the idle loop is marked with __kprobe or so, or nobody
> has been brave enough to try.

Not seeing much in the way of __kprobe, so guessing lack of bravery.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 16:27                                                     ` Peter Zijlstra
  2014-08-08 16:39                                                       ` Paul E. McKenney
  2014-08-08 16:43                                                       ` Steven Rostedt
@ 2014-08-08 17:27                                                       ` Steven Rostedt
  2014-08-09 10:36                                                         ` Masami Hiramatsu
  2 siblings, 1 reply; 122+ messages in thread
From: Steven Rostedt @ 2014-08-08 17:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, tglx, dhowells,
	edumazet, dvhart, fweisbec, bobby.prani, masami.hiramatsu.pt

On Fri, 8 Aug 2014 18:27:14 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Aug 08, 2014 at 10:58:58AM -0400, Steven Rostedt wrote:
> 
> > > > No, they are also used by optimized kprobes. This is why optimized
> > > > kprobes depend on !CONFIG_PREEMPT. [ added Masami to the discussion ].
> > > 
> > > How do those work? Is that one where the INT3 relocates the instruction
> > > stream into an alternative 'text' and that JMPs back into the original
> > > stream at the end?
> > 
> > No, it's where we replace the 'int3' with a jump to a trampoline that
> > simulates an INT3. Speeds things up quite a bit.
> 
> OK, so the trivial 'fix' for that is to patch the probe site like:
> 
> 	preempt_disable();		INC	GS:%__preempt_count
> 	call trampoline;		CALL	0xDEADBEEF
> 	preempt_enable();		DEC	GS:%__preempt_count
> 					JNZ	1f
> 					CALL	___preempt_schedule
> 				1f:
> 
> At which point the preempt_disable/enable() are the read side primitives
> and call_rcu_sched/synchronize_sched are sufficient to release it.
> 
> With the per-cpu preempt count stuff we have on x86 that is 4
> instructions for the preempt_*() stuff -- they're 'big' instructions
> though, since 3 have memops and 2 have a segment prefix.
> 
> 

Well, this looks like it may make kprobes a bit more complex, and even
slow down slightly the optimized probe.

Also note that if we add call_rcu_tasks(), then perf function tracing
can be called directly instead of being added to the trampoline that
disables and enables preemption before calling it.

-- Steve

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (14 preceding siblings ...)
  2014-08-04  1:28   ` Lai Jiangshan
@ 2014-08-08 19:13   ` Peter Zijlstra
  2014-08-08 20:58     ` Paul E. McKenney
  15 siblings, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-08 19:13 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani



So I think you can make the entire thing work with
rcu_note_context_switch().

If we have the sync thing do something like:


	for_each_task(t) {
		atomic_inc(&rcu_tasks);
		atomic_or(&t->rcu_attention, RCU_TASK);
		smp_mb__after_atomic();
		if (!t->on_rq) {
			if (atomic_test_and_clear(&t->rcu_attention, RCU_TASK))
				atomic_dec(&rcu_tasks);
		}
	}

	wait_event(&rcu_tasks_wq, !atomic_read(&rcu_tasks));


And then we have rcu_task_note_context_switch() (as called from
rcu_note_context_switch) do:


	/* we want actual context switches, ignore preemption */
	if (preempt_count() & PREEMPT_ACTIVE)
		return;

	/* if not marked for RCU attention, bail */
	if (!(atomic_read(&t->rcu_attention) & RCU_TASK))
		return;

	/* raced with sync_rcu_task(), all done */
	if (!atomic_test_and_clear(&t->rcu_attention, RCU_TASK))
		return;

	/* not the last.. */
	if (!atomic_dec_and_test(&rcu_tasks))
		return;

	wake_up(&rcu_task_rq);


The idea is to share rcu_attention with rcu_preempt, such that we only
touch a single 'extra' cacheline in case RCU doesn't need to pay
attention to this task.

Also, it would be good if we can manage to squeeze this variable in a
cacheline that's already touched by the schedule() so as not to incur
undue overhead.

And on that, you probably should change rcu_sched_rq() to read:

	this_cpu_inc(rcu_sched_data.passed_quiesce);

That avoids touching the per-cpu data offset.

And it would be very good if we could avoid the unconditional IRQ flag
fiddling in rcu_preempt_note_context_switch(), them expensive, this
looks entirely feasibly in the 'normal' case where
t->rcu_read_unlock_special doesn't have RCU_READ_UNLOCK_NEED_QS set.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-08 19:13   ` Peter Zijlstra
@ 2014-08-08 20:58     ` Paul E. McKenney
  2014-08-09  6:15       ` Peter Zijlstra
  2014-08-09 18:33       ` Peter Zijlstra
  0 siblings, 2 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-08 20:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Fri, Aug 08, 2014 at 09:13:26PM +0200, Peter Zijlstra wrote:
> 
> 
> So I think you can make the entire thing work with
> rcu_note_context_switch().
> 
> If we have the sync thing do something like:
> 
> 
> 	for_each_task(t) {
> 		atomic_inc(&rcu_tasks);
> 		atomic_or(&t->rcu_attention, RCU_TASK);
> 		smp_mb__after_atomic();
> 		if (!t->on_rq) {
> 			if (atomic_test_and_clear(&t->rcu_attention, RCU_TASK))
> 				atomic_dec(&rcu_tasks);
> 		}
> 	}
> 
> 	wait_event(&rcu_tasks_wq, !atomic_read(&rcu_tasks));
> 
> 
> And then we have rcu_task_note_context_switch() (as called from
> rcu_note_context_switch) do:
> 
> 
> 	/* we want actual context switches, ignore preemption */
> 	if (preempt_count() & PREEMPT_ACTIVE)
> 		return;
> 
> 	/* if not marked for RCU attention, bail */
> 	if (!(atomic_read(&t->rcu_attention) & RCU_TASK))
> 		return;
> 
> 	/* raced with sync_rcu_task(), all done */
> 	if (!atomic_test_and_clear(&t->rcu_attention, RCU_TASK))
> 		return;
> 
> 	/* not the last.. */
> 	if (!atomic_dec_and_test(&rcu_tasks))
> 		return;
> 
> 	wake_up(&rcu_task_rq);
> 
> 
> The idea is to share rcu_attention with rcu_preempt, such that we only
> touch a single 'extra' cacheline in case RCU doesn't need to pay
> attention to this task.
> 
> Also, it would be good if we can manage to squeeze this variable in a
> cacheline that's already touched by the schedule() so as not to incur
> undue overhead.

This approach does not get me the idle tasks and the NO_HZ_FULL usermode
tasks.  I am pretty sure that I am stuck polling in those cases, so I
might as well poll.

> And on that, you probably should change rcu_sched_rq() to read:
> 
> 	this_cpu_inc(rcu_sched_data.passed_quiesce);
> 
> That avoids touching the per-cpu data offset.

Hmmm...  Interrupts are disabled, so no need to further disable
interrupts.  Storing 1 works fine, no need to increment.  If I followed
the twisty per_cpu passages correctly, my guess is that you would like
me to do something like this:

	__this_cpu_write(rcu_sched_data.passed_quiesce, 1);

Does that work?

> And it would be very good if we could avoid the unconditional IRQ flag
> fiddling in rcu_preempt_note_context_switch(), them expensive, this
> looks entirely feasibly in the 'normal' case where
> t->rcu_read_unlock_special doesn't have RCU_READ_UNLOCK_NEED_QS set.

Agreed, but sometimes RCU_READ_UNLOCK_NEED_QS is set.

That said, I should probably revisit RCU_READ_UNLOCK_NEED_QS.  A lot has
changed since I wrote that code.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-08 20:58     ` Paul E. McKenney
@ 2014-08-09  6:15       ` Peter Zijlstra
  2014-08-09 12:44         ` Steven Rostedt
  2014-08-09 16:01         ` Paul E. McKenney
  2014-08-09 18:33       ` Peter Zijlstra
  1 sibling, 2 replies; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-09  6:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

[-- Attachment #1: Type: text/plain, Size: 2413 bytes --]

On Fri, Aug 08, 2014 at 01:58:26PM -0700, Paul E. McKenney wrote:
> On Fri, Aug 08, 2014 at 09:13:26PM +0200, Peter Zijlstra wrote:
> > 
> > 
> > So I think you can make the entire thing work with
> > rcu_note_context_switch().
> > 
> > If we have the sync thing do something like:
> > 
> > 
> > 	for_each_task(t) {
> > 		atomic_inc(&rcu_tasks);
> > 		atomic_or(&t->rcu_attention, RCU_TASK);
> > 		smp_mb__after_atomic();
> > 		if (!t->on_rq) {
> > 			if (atomic_test_and_clear(&t->rcu_attention, RCU_TASK))
> > 				atomic_dec(&rcu_tasks);
> > 		}
> > 	}
> > 
> > 	wait_event(&rcu_tasks_wq, !atomic_read(&rcu_tasks));
> > 
> > 
> > And then we have rcu_task_note_context_switch() (as called from
> > rcu_note_context_switch) do:
> > 
> > 
> > 	/* we want actual context switches, ignore preemption */
> > 	if (preempt_count() & PREEMPT_ACTIVE)
> > 		return;
> > 
> > 	/* if not marked for RCU attention, bail */
> > 	if (!(atomic_read(&t->rcu_attention) & RCU_TASK))
> > 		return;
> > 
> > 	/* raced with sync_rcu_task(), all done */
> > 	if (!atomic_test_and_clear(&t->rcu_attention, RCU_TASK))
> > 		return;
> > 
> > 	/* not the last.. */
> > 	if (!atomic_dec_and_test(&rcu_tasks))
> > 		return;
> > 
> > 	wake_up(&rcu_task_rq);
> > 
> > 
> > The idea is to share rcu_attention with rcu_preempt, such that we only
> > touch a single 'extra' cacheline in case RCU doesn't need to pay
> > attention to this task.
> > 
> > Also, it would be good if we can manage to squeeze this variable in a
> > cacheline that's already touched by the schedule() so as not to incur
> > undue overhead.
> 
> This approach does not get me the idle tasks and the NO_HZ_FULL usermode
> tasks.  I am pretty sure that I am stuck polling in those cases, so I
> might as well poll.

That's so wrong its not funny. If you need some abortion to deal with
NOHZ_FULL then put it under CONFIG_NOHZ_FULL, don't burden the entire
world with it.

Also, I thought RCU already knew which CPUs were in nohz_full userspace,
so we can insta check that in the sync, together with the !->on_rq test,
if the task is running on a nohz_full cpu in rcu quiescent state, also
clear the task.

As for idle tasks, I'm not sure about those, I think that we should say
NO to anything that would require waking idle CPUs, push the pain to
ftrace/kprobes, we should _not_ be waking idle cpus.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 17:27                                                       ` Steven Rostedt
@ 2014-08-09 10:36                                                         ` Masami Hiramatsu
  0 siblings, 0 replies; 122+ messages in thread
From: Masami Hiramatsu @ 2014-08-09 10:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Paul E. McKenney, Oleg Nesterov, linux-kernel,
	mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	dhowells, edumazet, dvhart, fweisbec, bobby.prani

(2014/08/09 2:27), Steven Rostedt wrote:
> On Fri, 8 Aug 2014 18:27:14 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Fri, Aug 08, 2014 at 10:58:58AM -0400, Steven Rostedt wrote:
>>
>>>>> No, they are also used by optimized kprobes. This is why optimized
>>>>> kprobes depend on !CONFIG_PREEMPT. [ added Masami to the discussion ].
>>>>
>>>> How do those work? Is that one where the INT3 relocates the instruction
>>>> stream into an alternative 'text' and that JMPs back into the original
>>>> stream at the end?
>>>
>>> No, it's where we replace the 'int3' with a jump to a trampoline that
>>> simulates an INT3. Speeds things up quite a bit.
>>
>> OK, so the trivial 'fix' for that is to patch the probe site like:
>>
>> 	preempt_disable();		INC	GS:%__preempt_count
>> 	call trampoline;		CALL	0xDEADBEEF
>> 	preempt_enable();		DEC	GS:%__preempt_count
>> 					JNZ	1f
>> 					CALL	___preempt_schedule
>> 				1f:
>>
>> At which point the preempt_disable/enable() are the read side primitives
>> and call_rcu_sched/synchronize_sched are sufficient to release it.
>>
>> With the per-cpu preempt count stuff we have on x86 that is 4
>> instructions for the preempt_*() stuff -- they're 'big' instructions
>> though, since 3 have memops and 2 have a segment prefix.
>>
>>
> 
> Well, this looks like it may make kprobes a bit more complex, and even
> slow down slightly the optimized probe.

This may not only influence the performance, but also reduce the
applicability of optprobe too much :(
Since the optprobe can replace multiple instructions, it decodes probe
site to find out the basic blocks for avoiding the jump instruction to
across the border of the basic blocks, which can cause another branch
can jump into the jump instruction.
So, patching "big" instruction series for the probe site just reduces
the possibility of jump optimization. I don't think that is practical.

Thank you,

> 
> Also note that if we add call_rcu_tasks(), then perf function tracing
> can be called directly instead of being added to the trampoline that
> disables and enables preemption before calling it.
> 
> -- Steve
> 


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Re: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-08-08 14:28                                                 ` Paul E. McKenney
@ 2014-08-09 10:56                                                   ` Masami Hiramatsu
  0 siblings, 0 replies; 122+ messages in thread
From: Masami Hiramatsu @ 2014-08-09 10:56 UTC (permalink / raw)
  To: paulmck
  Cc: Steven Rostedt, Peter Zijlstra, Oleg Nesterov, linux-kernel,
	mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	dhowells, edumazet, dvhart, fweisbec, bobby.prani,
	yrl.pp-manager.tt

(2014/08/08 23:28), Paul E. McKenney wrote:
> On Fri, Aug 08, 2014 at 10:12:21AM -0400, Steven Rostedt wrote:
>> On Fri, 8 Aug 2014 08:40:20 +0200
>> Peter Zijlstra <peterz@infradead.org> wrote:
>>
>>> On Thu, Aug 07, 2014 at 05:18:23PM -0400, Steven Rostedt wrote:
>>>> On Thu, 7 Aug 2014 22:08:13 +0200
>>>> Peter Zijlstra <peterz@infradead.org> wrote:
>>>>
>>>>> OK, you've got to start over and start at the beginning, because I'm
>>>>> really not understanding this..
>>>>>
>>>>> What is a 'trampoline' and what are you going to use them for.
>>>>
>>>> Great question! :-)
>>>>
>>>> The trampoline is some code that is used to jump to and then jump
>>>> someplace else. Currently, we use this for kprobes and ftrace. For
>>>> ftrace we have the ftrace_caller trampoline, which is static. When
>>>> booting, most functions in the kernel call the mcount code which
>>>> simply returns without doing anything. This too is a "trampoline". At
>>>> boot, we convert these calls to nops (as you already know). When we
>>>> enable callbacks from functions, we convert those calls to call
>>>> "ftrace_caller" which is a small assembly trampoline that will call
>>>> some function that registered with ftrace.
>>>>
>>>> Now why do we need the call_rcu_task() routine?
>>>>
>>>> Right now, if you register multiple callbacks to ftrace, even if they
>>>> are not tracing the same routine, ftrace has to change ftrace_caller to
>>>> call another trampoline (in C), that does a loop of all ops registered
>>>> with ftrace, and compares the function to the ops hash tables to see if
>>>> the ops function should be called for that function.
>>>>
>>>> What we want to do is to create a dynamic trampoline that is a copy of
>>>> the ftrace_caller code, but instead of calling this list trampoline, it
>>>> calls the ops function directly. This way, each ops registered with
>>>> ftrace can have its own custom trampoline that when called will only
>>>> call the ops function and not have to iterate over a list. This only
>>>> happens if the function being traced only has this one ops registered.
>>>> For functions with multiple ops attached to it, we need to call the
>>>> list anyway. But for the majority of the cases, this is not the case.
>>>>
>>>> The one caveat for this is, how do we free this custom trampoline when
>>>> the ops is done with it? Especially for users of ftrace that
>>>> dynamically create their own ops (like perf, and ftrace instances).
>>>>
>>>> We need to find a way to free it, but unfortunately, there's no way to
>>>> know when it is safe to free it. There's no way to disable preemption
>>>> or have some other notifier to let us know if a task has jumped to this
>>>> trampoline and has been preempted (sleeping). The only safe way to know
>>>> that no task is on the trampoline is to remove the calls to it,
>>>> synchronize the CPUS (so the trampolines are not even in the caches),
>>>> and then wait for all tasks to go through some quiescent state. This
>>>> state happens to be either not running, in userspace, or when it
>>>> voluntarily calls schedule. Because nothing that uses this trampoline
>>>> should do that, and if the task voluntarily calls schedule, we know
>>>> it's not on the trampoline.
>>>>
>>>> Make sense?
>>>
>>> Ok, so they're purely used in the function prologue/epilogue callchain.
>>
>> No, they are also used by optimized kprobes. This is why optimized
>> kprobes depend on !CONFIG_PREEMPT. [ added Masami to the discussion ].
>>
>> Which reminds me. On !CONFIG_PREEMPT, call_rcu_task() should be
>> equivalent to call_rcu_sched().
> 
> Almost.  One difference is that call_rcu_sched() won't wait for
> idle-task execution.  So presumably you are currently prohibited from
> putting kprobes in idle tasks.

No need to prohibit all kprobes, just prohibit optimizing if the kprobe
is in the idle context (if I can detect it). Since I've already replaced
text-area based __kprobes with list-based NOKPROBE_SYMOBOL in core kernel,
I think it is an option to add NOOPTPROBE_SYMBOL for that purpose.

Thank you,

> Oleg slipped this one past me, and for more than a full hour,
> (https://lkml.org/lkml/2014/8/2/18), but this time I remembered.  ;-)
> 
> 							Thanx, Paul



-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-09  6:15       ` Peter Zijlstra
@ 2014-08-09 12:44         ` Steven Rostedt
  2014-08-09 16:05           ` Paul E. McKenney
  2014-08-09 16:01         ` Paul E. McKenney
  1 sibling, 1 reply; 122+ messages in thread
From: Steven Rostedt @ 2014-08-09 12:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, dhowells, edumazet, dvhart,
	fweisbec, oleg, bobby.prani

On Sat, 9 Aug 2014 08:15:14 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

 
> As for idle tasks, I'm not sure about those, I think that we should say
> NO to anything that would require waking idle CPUs, push the pain to
> ftrace/kprobes, we should _not_ be waking idle cpus.

I agree, but I haven't had a chance to review the patch set (will
probably do that on Monday, just got back from vacation last week and
was inundated by other things). Does the idle waking happen only when
there's something queued in the call_rcu_tasks()? It should definitely
not be waking all the time. That's just wrong.

But if it only wakes when something is queued, it wouldn't be burdening
anything, unless it is needed.

-- Steve

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-09  6:15       ` Peter Zijlstra
  2014-08-09 12:44         ` Steven Rostedt
@ 2014-08-09 16:01         ` Paul E. McKenney
  2014-08-09 18:19           ` Peter Zijlstra
  1 sibling, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-09 16:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Sat, Aug 09, 2014 at 08:15:14AM +0200, Peter Zijlstra wrote:
> On Fri, Aug 08, 2014 at 01:58:26PM -0700, Paul E. McKenney wrote:
> > On Fri, Aug 08, 2014 at 09:13:26PM +0200, Peter Zijlstra wrote:
> > > 
> > > 
> > > So I think you can make the entire thing work with
> > > rcu_note_context_switch().
> > > 
> > > If we have the sync thing do something like:
> > > 
> > > 
> > > 	for_each_task(t) {
> > > 		atomic_inc(&rcu_tasks);
> > > 		atomic_or(&t->rcu_attention, RCU_TASK);
> > > 		smp_mb__after_atomic();
> > > 		if (!t->on_rq) {
> > > 			if (atomic_test_and_clear(&t->rcu_attention, RCU_TASK))
> > > 				atomic_dec(&rcu_tasks);
> > > 		}
> > > 	}
> > > 
> > > 	wait_event(&rcu_tasks_wq, !atomic_read(&rcu_tasks));
> > > 
> > > 
> > > And then we have rcu_task_note_context_switch() (as called from
> > > rcu_note_context_switch) do:
> > > 
> > > 
> > > 	/* we want actual context switches, ignore preemption */
> > > 	if (preempt_count() & PREEMPT_ACTIVE)
> > > 		return;
> > > 
> > > 	/* if not marked for RCU attention, bail */
> > > 	if (!(atomic_read(&t->rcu_attention) & RCU_TASK))
> > > 		return;
> > > 
> > > 	/* raced with sync_rcu_task(), all done */
> > > 	if (!atomic_test_and_clear(&t->rcu_attention, RCU_TASK))
> > > 		return;
> > > 
> > > 	/* not the last.. */
> > > 	if (!atomic_dec_and_test(&rcu_tasks))
> > > 		return;
> > > 
> > > 	wake_up(&rcu_task_rq);
> > > 
> > > 
> > > The idea is to share rcu_attention with rcu_preempt, such that we only
> > > touch a single 'extra' cacheline in case RCU doesn't need to pay
> > > attention to this task.
> > > 
> > > Also, it would be good if we can manage to squeeze this variable in a
> > > cacheline that's already touched by the schedule() so as not to incur
> > > undue overhead.
> > 
> > This approach does not get me the idle tasks and the NO_HZ_FULL usermode
> > tasks.  I am pretty sure that I am stuck polling in those cases, so I
> > might as well poll.
> 
> That's so wrong its not funny. If you need some abortion to deal with
> NOHZ_FULL then put it under CONFIG_NOHZ_FULL, don't burden the entire
> world with it.

Peter, the polling approach actually -reduces- the common-case
per-context-switch burden, as in when RCU-tasks isn't doing anything.
See your own code above.

> Also, I thought RCU already knew which CPUs were in nohz_full userspace,
> so we can insta check that in the sync, together with the !->on_rq test,
> if the task is running on a nohz_full cpu in rcu quiescent state, also
> clear the task.

RCU does know which are in nohz_full userspace, but it needs to handle
the case where they are not nohz_full at the beginning of the RCU-tasks
grace period.  Yes, I could hook into rcu_user_enter(), but that is
backwards from the viewpoint of the common case where there is no
RCU-tasks happening.

> As for idle tasks, I'm not sure about those, I think that we should say
> NO to anything that would require waking idle CPUs, push the pain to
> ftrace/kprobes, we should _not_ be waking idle cpus.

So the current patch set wakes an idle task once per RCU-tasks grace
period, but only when that idle task did not otherwise get awakened.
This is not a real problem.

And it could probably be reduced further, for example, for architectures
where the program counter of sleeping CPUs can be remotely accessed and
where the address of the am-asleep code is known.  I doubt that this
would really be worth it, but it could be done, in theory anyway.  Or, as
Steven suggested earlier, there could be a per-CPU variable that was set
(with approapriate memory ordering) when the CPU was actually sleeping.

So I don't believe that the current wakeup rate is a problem, and it
can be reduced if it proves to be a problem.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-09 12:44         ` Steven Rostedt
@ 2014-08-09 16:05           ` Paul E. McKenney
  0 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-09 16:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, dhowells, edumazet, dvhart,
	fweisbec, oleg, bobby.prani

On Sat, Aug 09, 2014 at 08:44:39AM -0400, Steven Rostedt wrote:
> On Sat, 9 Aug 2014 08:15:14 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> 
> > As for idle tasks, I'm not sure about those, I think that we should say
> > NO to anything that would require waking idle CPUs, push the pain to
> > ftrace/kprobes, we should _not_ be waking idle cpus.
> 
> I agree, but I haven't had a chance to review the patch set (will
> probably do that on Monday, just got back from vacation last week and
> was inundated by other things).

Let me get v5 out first.  I expect to have it out by end of Monday.

>                                 Does the idle waking happen only when
> there's something queued in the call_rcu_tasks()? It should definitely
> not be waking all the time. That's just wrong.
> 
> But if it only wakes when something is queued, it wouldn't be burdening
> anything, unless it is needed.

Indeed, idle waking only happens when there is an RCU-tasks grace period
in progress.  No RCU-tasks grace period, no RCU-tasks idle waking.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-09 16:01         ` Paul E. McKenney
@ 2014-08-09 18:19           ` Peter Zijlstra
  2014-08-09 18:24             ` Peter Zijlstra
  2014-08-10  1:26             ` Paul E. McKenney
  0 siblings, 2 replies; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-09 18:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

[-- Attachment #1: Type: text/plain, Size: 2048 bytes --]

On Sat, Aug 09, 2014 at 09:01:37AM -0700, Paul E. McKenney wrote:
> > That's so wrong its not funny. If you need some abortion to deal with
> > NOHZ_FULL then put it under CONFIG_NOHZ_FULL, don't burden the entire
> > world with it.
> 
> Peter, the polling approach actually -reduces- the common-case
> per-context-switch burden, as in when RCU-tasks isn't doing anything.
> See your own code above.

I'm not seeing it, CONFIG_PREEMPT already touches a per task cacheline
for each context switch. And for !PREEMPT this thing should pretty much
reduce to rcu_sched.

Would not the thing I proposed be a valid rcu_preempt implementation?
one where its rcu read side primitives run from (voluntary) schedule()
to (voluntary) schedule() call and therefore entirely cover smaller
sections.

> > As for idle tasks, I'm not sure about those, I think that we should say
> > NO to anything that would require waking idle CPUs, push the pain to
> > ftrace/kprobes, we should _not_ be waking idle cpus.
> 
> So the current patch set wakes an idle task once per RCU-tasks grace
> period, but only when that idle task did not otherwise get awakened.
> This is not a real problem.

And on the other hand we're trying to reduce random wakeups, so this
sure is a problem. If we don't start, we don't have to fix later.

> And it could probably be reduced further, for example, for architectures
> where the program counter of sleeping CPUs can be remotely accessed and
> where the address of the am-asleep code is known.  I doubt that this
> would really be worth it, but it could be done, in theory anyway.  Or, as
> Steven suggested earlier, there could be a per-CPU variable that was set
> (with approapriate memory ordering) when the CPU was actually sleeping.
> 
> So I don't believe that the current wakeup rate is a problem, and it
> can be reduced if it proves to be a problem.

How about we simply assume 'idle' code, as defined by the rcu idle hooks
are safe? Why do we want to bend over backwards to cover this?

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-09 18:19           ` Peter Zijlstra
@ 2014-08-09 18:24             ` Peter Zijlstra
  2014-08-10  1:29               ` Paul E. McKenney
  2014-08-10  1:26             ` Paul E. McKenney
  1 sibling, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-09 18:24 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

[-- Attachment #1: Type: text/plain, Size: 332 bytes --]

On Sat, Aug 09, 2014 at 08:19:20PM +0200, Peter Zijlstra wrote:
> How about we simply assume 'idle' code, as defined by the rcu idle hooks
> are safe? Why do we want to bend over backwards to cover this?

The thing is, we already have the special rcu trace hooks for tracing
inside this rcu-idle section, so why go beyond this now?

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-08 20:58     ` Paul E. McKenney
  2014-08-09  6:15       ` Peter Zijlstra
@ 2014-08-09 18:33       ` Peter Zijlstra
  2014-08-10  1:38         ` Paul E. McKenney
  1 sibling, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-09 18:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

[-- Attachment #1: Type: text/plain, Size: 2038 bytes --]

On Fri, Aug 08, 2014 at 01:58:26PM -0700, Paul E. McKenney wrote:
> 
> > And on that, you probably should change rcu_sched_rq() to read:
> > 
> > 	this_cpu_inc(rcu_sched_data.passed_quiesce);
> > 
> > That avoids touching the per-cpu data offset.
> 
> Hmmm...  Interrupts are disabled,

No they are not, __schedule()->rcu_note_context_switch()->rcu_sched_qs()
is only called with preemption disabled.

We only disable IRQs later, where we take the rq->lock.

> so no need to further disable
> interrupts.  Storing 1 works fine, no need to increment.  If I followed
> the twisty per_cpu passages correctly, my guess is that you would like
> me to do something like this:
> 
> 	__this_cpu_write(rcu_sched_data.passed_quiesce, 1);
> 
> Does that work?

Yeah, should be more or less similar, the inc might be encoded shorter
due to not requiring an immediate, but who cares :-)

void rcu_sched_qs(int cpu)
{
	if (trace_rcu_grace_period_enabled()) {
		if (!__this_cpu_read(rcu_sched_data.passed_quiesce))
			trace_rcu_grace_period(...);
	}
	__this_cpu_write(rcu_sched_data.passed_quiesce, 1);
}

Would further avoid emitting the conditional in the normal case where
the tracepoint is inactive.

Steve does it make sense to have __DO_TRACE() emit __trace_##name() to
avoid the double static_branch thing?

> > And it would be very good if we could avoid the unconditional IRQ flag
> > fiddling in rcu_preempt_note_context_switch(), them expensive, this
> > looks entirely feasibly in the 'normal' case where
> > t->rcu_read_unlock_special doesn't have RCU_READ_UNLOCK_NEED_QS set.
> 
> Agreed, but sometimes RCU_READ_UNLOCK_NEED_QS is set.
> 
> That said, I should probably revisit RCU_READ_UNLOCK_NEED_QS.  A lot has
> changed since I wrote that code.

Sure, but a conditional testing RCU_READ_UNLOCK_NEED_QS is far cheaper
than poking the IRQ flags. That said, its not entirely clear to me why
that needs IRQs disabled at all, then again I didn't look long and I'm
sure its all subtle.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-09 18:19           ` Peter Zijlstra
  2014-08-09 18:24             ` Peter Zijlstra
@ 2014-08-10  1:26             ` Paul E. McKenney
  2014-08-10  8:12               ` Peter Zijlstra
  1 sibling, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-10  1:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Sat, Aug 09, 2014 at 08:19:20PM +0200, Peter Zijlstra wrote:
> On Sat, Aug 09, 2014 at 09:01:37AM -0700, Paul E. McKenney wrote:
> > > That's so wrong its not funny. If you need some abortion to deal with
> > > NOHZ_FULL then put it under CONFIG_NOHZ_FULL, don't burden the entire
> > > world with it.
> > 
> > Peter, the polling approach actually -reduces- the common-case
> > per-context-switch burden, as in when RCU-tasks isn't doing anything.
> > See your own code above.
> 
> I'm not seeing it, CONFIG_PREEMPT already touches a per task cacheline
> for each context switch. And for !PREEMPT this thing should pretty much
> reduce to rcu_sched.

Except when you do the wakeup operation, in which case you have something
that is either complex, slow and non-scalable, or both.  I am surprised
that you want anything like that on that path.

> Would not the thing I proposed be a valid rcu_preempt implementation?
> one where its rcu read side primitives run from (voluntary) schedule()
> to (voluntary) schedule() call and therefore entirely cover smaller
> sections.

In theory, sure.  In practice, blocking on tasks that are preempted
outside of an RCU read-side critical section would not be a good thing
for normal RCU, which has frequent update operations.  Among other things.

> > > As for idle tasks, I'm not sure about those, I think that we should say
> > > NO to anything that would require waking idle CPUs, push the pain to
> > > ftrace/kprobes, we should _not_ be waking idle cpus.
> > 
> > So the current patch set wakes an idle task once per RCU-tasks grace
> > period, but only when that idle task did not otherwise get awakened.
> > This is not a real problem.
> 
> And on the other hand we're trying to reduce random wakeups, so this
> sure is a problem. If we don't start, we don't have to fix later.

I doubt that a wakeup at the end of certain ftrace operations is going
to be a real problem.

> > And it could probably be reduced further, for example, for architectures
> > where the program counter of sleeping CPUs can be remotely accessed and
> > where the address of the am-asleep code is known.  I doubt that this
> > would really be worth it, but it could be done, in theory anyway.  Or, as
> > Steven suggested earlier, there could be a per-CPU variable that was set
> > (with approapriate memory ordering) when the CPU was actually sleeping.
> > 
> > So I don't believe that the current wakeup rate is a problem, and it
> > can be reduced if it proves to be a problem.
> 
> How about we simply assume 'idle' code, as defined by the rcu idle hooks
> are safe? Why do we want to bend over backwards to cover this?

Steven covered this earlier in this thread.  One addition might be "For
the same reason that event tracing provides the _rcuidle suffix."

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-09 18:24             ` Peter Zijlstra
@ 2014-08-10  1:29               ` Paul E. McKenney
  2014-08-10  8:14                 ` Peter Zijlstra
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-10  1:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Sat, Aug 09, 2014 at 08:24:00PM +0200, Peter Zijlstra wrote:
> On Sat, Aug 09, 2014 at 08:19:20PM +0200, Peter Zijlstra wrote:
> > How about we simply assume 'idle' code, as defined by the rcu idle hooks
> > are safe? Why do we want to bend over backwards to cover this?
> 
> The thing is, we already have the special rcu trace hooks for tracing
> inside this rcu-idle section, so why go beyond this now?

I have to defer to Steven and Masami on this one, but I would guess that
they want the ability to trace the idle loop for the same reasons they
stated earlier.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-09 18:33       ` Peter Zijlstra
@ 2014-08-10  1:38         ` Paul E. McKenney
  2014-08-10 15:00           ` Peter Zijlstra
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-10  1:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Sat, Aug 09, 2014 at 08:33:55PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 08, 2014 at 01:58:26PM -0700, Paul E. McKenney wrote:
> > 
> > > And on that, you probably should change rcu_sched_rq() to read:
> > > 
> > > 	this_cpu_inc(rcu_sched_data.passed_quiesce);
> > > 
> > > That avoids touching the per-cpu data offset.
> > 
> > Hmmm...  Interrupts are disabled,
> 
> No they are not, __schedule()->rcu_note_context_switch()->rcu_sched_qs()
> is only called with preemption disabled.
> 
> We only disable IRQs later, where we take the rq->lock.

You want me not to disable irqs before invoking rcu_preempt_qs() from
rcu_preempt_note_context_switch(), I get that.  But right now, they
really are disabled courtesy of the local_irq_save() before the call
to rcu_preempt_qs() from rcu_preempt_note_context_switch().

> > so no need to further disable
> > interrupts.  Storing 1 works fine, no need to increment.  If I followed
> > the twisty per_cpu passages correctly, my guess is that you would like
> > me to do something like this:
> > 
> > 	__this_cpu_write(rcu_sched_data.passed_quiesce, 1);
> > 
> > Does that work?
> 
> Yeah, should be more or less similar, the inc might be encoded shorter
> due to not requiring an immediate, but who cares :-)
> 
> void rcu_sched_qs(int cpu)
> {
> 	if (trace_rcu_grace_period_enabled()) {
> 		if (!__this_cpu_read(rcu_sched_data.passed_quiesce))
> 			trace_rcu_grace_period(...);
> 	}
> 	__this_cpu_write(rcu_sched_data.passed_quiesce, 1);
> }
> 
> Would further avoid emitting the conditional in the normal case where
> the tracepoint is inactive.

It might be better to avoid storing to rcu_sched_data.passed_quiesce when
it is already 1, though the difference would be quite hard to measure.
In that case, this would work nicely:

static void rcu_preempt_qs(int cpu)
{
	if (rdp->passed_quiesce == 0) {
		trace_rcu_grace_period(TPS("rcu_preempt"), rdp->gpnum, TPS("cpuqs"));
	> 	__this_cpu_write(rcu_sched_data.passed_quiesce, 1);
	}
	current->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_NEED_QS;
}

> Steve does it make sense to have __DO_TRACE() emit __trace_##name() to
> avoid the double static_branch thing?
> 
> > > And it would be very good if we could avoid the unconditional IRQ flag
> > > fiddling in rcu_preempt_note_context_switch(), them expensive, this
> > > looks entirely feasibly in the 'normal' case where
> > > t->rcu_read_unlock_special doesn't have RCU_READ_UNLOCK_NEED_QS set.
> > 
> > Agreed, but sometimes RCU_READ_UNLOCK_NEED_QS is set.
> > 
> > That said, I should probably revisit RCU_READ_UNLOCK_NEED_QS.  A lot has
> > changed since I wrote that code.
> 
> Sure, but a conditional testing RCU_READ_UNLOCK_NEED_QS is far cheaper
> than poking the IRQ flags. That said, its not entirely clear to me why
> that needs IRQs disabled at all, then again I didn't look long and I'm
> sure its all subtle.

This bit gets set from the scheduler-clock interrupt, so disabling
interrupts is the standard approach to avoid confusion.  Might be possible
to avoid it in this case, or make it less frequent, or whatever.  As I
said, I haven't thought much about it since the initial implementation
some years back, so worth worrying about again.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-10  1:26             ` Paul E. McKenney
@ 2014-08-10  8:12               ` Peter Zijlstra
  2014-08-10 16:46                 ` Peter Zijlstra
  2014-08-11  3:23                 ` Paul E. McKenney
  0 siblings, 2 replies; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-10  8:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

[-- Attachment #1: Type: text/plain, Size: 3396 bytes --]

On Sat, Aug 09, 2014 at 06:26:12PM -0700, Paul E. McKenney wrote:
> On Sat, Aug 09, 2014 at 08:19:20PM +0200, Peter Zijlstra wrote:
> > On Sat, Aug 09, 2014 at 09:01:37AM -0700, Paul E. McKenney wrote:
> > > > That's so wrong its not funny. If you need some abortion to deal with
> > > > NOHZ_FULL then put it under CONFIG_NOHZ_FULL, don't burden the entire
> > > > world with it.
> > > 
> > > Peter, the polling approach actually -reduces- the common-case
> > > per-context-switch burden, as in when RCU-tasks isn't doing anything.
> > > See your own code above.
> > 
> > I'm not seeing it, CONFIG_PREEMPT already touches a per task cacheline
> > for each context switch. And for !PREEMPT this thing should pretty much
> > reduce to rcu_sched.
> 
> Except when you do the wakeup operation, in which case you have something
> that is either complex, slow and non-scalable, or both.  I am surprised
> that you want anything like that on that path.

Its a nr_cpus problem at that point, which is massively better than a
nr_tasks problem, and I'm sure we've solved this counter thing many
times, we can do it again.

But for clarity of purpose the single atomic+waitqueue is far easier.

> > Would not the thing I proposed be a valid rcu_preempt implementation?
> > one where its rcu read side primitives run from (voluntary) schedule()
> > to (voluntary) schedule() call and therefore entirely cover smaller
> > sections.
> 
> In theory, sure.  In practice, blocking on tasks that are preempted
> outside of an RCU read-side critical section would not be a good thing
> for normal RCU, which has frequent update operations.  Among other things.

Sure, just looking for parallels and verifying understanding here. By
the very nature of not having read side primitives to limit coverage its
a pessimistic thing.

> > > > As for idle tasks, I'm not sure about those, I think that we should say
> > > > NO to anything that would require waking idle CPUs, push the pain to
> > > > ftrace/kprobes, we should _not_ be waking idle cpus.
> > > 
> > > So the current patch set wakes an idle task once per RCU-tasks grace
> > > period, but only when that idle task did not otherwise get awakened.
> > > This is not a real problem.
> > 
> > And on the other hand we're trying to reduce random wakeups, so this
> > sure is a problem. If we don't start, we don't have to fix later.
> 
> I doubt that a wakeup at the end of certain ftrace operations is going
> to be a real problem.

But its not ftrace, its rcu_task, and if we put it out there, we'll grow
more and more customers, and soon we'll always have users and never let
CPUs sleep.

That's how these things go, so we should really try and push back on
these things, and that's the thing that worries me most in this
discussion, you seem very happy to provide what's asked for without due
consideration of the negatives.

> > > So I don't believe that the current wakeup rate is a problem, and it
> > > can be reduced if it proves to be a problem.
> > 
> > How about we simply assume 'idle' code, as defined by the rcu idle hooks
> > are safe? Why do we want to bend over backwards to cover this?
> 
> Steven covered this earlier in this thread.  One addition might be "For
> the same reason that event tracing provides the _rcuidle suffix."

I really don't think its worth the cost.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-10  1:29               ` Paul E. McKenney
@ 2014-08-10  8:14                 ` Peter Zijlstra
  2014-08-11  3:30                   ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-10  8:14 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

[-- Attachment #1: Type: text/plain, Size: 768 bytes --]

On Sat, Aug 09, 2014 at 06:29:24PM -0700, Paul E. McKenney wrote:
> On Sat, Aug 09, 2014 at 08:24:00PM +0200, Peter Zijlstra wrote:
> > On Sat, Aug 09, 2014 at 08:19:20PM +0200, Peter Zijlstra wrote:
> > > How about we simply assume 'idle' code, as defined by the rcu idle hooks
> > > are safe? Why do we want to bend over backwards to cover this?
> > 
> > The thing is, we already have the special rcu trace hooks for tracing
> > inside this rcu-idle section, so why go beyond this now?
> 
> I have to defer to Steven and Masami on this one, but I would guess that
> they want the ability to trace the idle loop for the same reasons they
> stated earlier.

want want want, I want a damn pony but somehow I'm not getting one. Why
are they getting this?

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-10  1:38         ` Paul E. McKenney
@ 2014-08-10 15:00           ` Peter Zijlstra
  2014-08-11  3:37             ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-10 15:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

[-- Attachment #1: Type: text/plain, Size: 2106 bytes --]

On Sat, Aug 09, 2014 at 06:38:29PM -0700, Paul E. McKenney wrote:
> On Sat, Aug 09, 2014 at 08:33:55PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 08, 2014 at 01:58:26PM -0700, Paul E. McKenney wrote:
> > > 
> > > > And on that, you probably should change rcu_sched_rq() to read:
> > > > 
> > > > 	this_cpu_inc(rcu_sched_data.passed_quiesce);
> > > > 
> > > > That avoids touching the per-cpu data offset.
> > > 
> > > Hmmm...  Interrupts are disabled,
> > 
> > No they are not, __schedule()->rcu_note_context_switch()->rcu_sched_qs()
> > is only called with preemption disabled.
> > 
> > We only disable IRQs later, where we take the rq->lock.
> 
> You want me not to disable irqs before invoking rcu_preempt_qs() from
> rcu_preempt_note_context_switch(), I get that.  But right now, they
> really are disabled courtesy of the local_irq_save() before the call
> to rcu_preempt_qs() from rcu_preempt_note_context_switch().

Ah, confusion there, I said rcu_sched_qs(), you're talking about
rcu_preempt_qs().

Yes the call to rcu_preempt_qs() is unconditionally wrapped in IRQ
disable.

> > void rcu_sched_qs(int cpu)
> > {
> > 	if (trace_rcu_grace_period_enabled()) {
> > 		if (!__this_cpu_read(rcu_sched_data.passed_quiesce))
> > 			trace_rcu_grace_period(...);
> > 	}
> > 	__this_cpu_write(rcu_sched_data.passed_quiesce, 1);
> > }
> > 
> > Would further avoid emitting the conditional in the normal case where
> > the tracepoint is inactive.
> 
> It might be better to avoid storing to rcu_sched_data.passed_quiesce when
> it is already 1, though the difference would be quite hard to measure.
> In that case, this would work nicely:
> 
> static void rcu_preempt_qs(int cpu)
> {
> 	if (rdp->passed_quiesce == 0) {
> 		trace_rcu_grace_period(TPS("rcu_preempt"), rdp->gpnum, TPS("cpuqs"));
> 	> 	__this_cpu_write(rcu_sched_data.passed_quiesce, 1);
> 	}
> 	current->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_NEED_QS;
> }

Yes, that's a consideration, fair enough. Again note the confusion
between sched/preempt. But yes, both can use this 'cleanup'.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-10  8:12               ` Peter Zijlstra
@ 2014-08-10 16:46                 ` Peter Zijlstra
  2014-08-11  3:28                   ` Paul E. McKenney
  2014-08-11  3:23                 ` Paul E. McKenney
  1 sibling, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-10 16:46 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

[-- Attachment #1: Type: text/plain, Size: 3813 bytes --]

On Sun, Aug 10, 2014 at 10:12:54AM +0200, Peter Zijlstra wrote:
> > Steven covered this earlier in this thread.  One addition might be "For
> > the same reason that event tracing provides the _rcuidle suffix."
> 
> I really don't think its worth the cost.

Entirely untested, but something like the below shrinks the
amount of code called under rcu_idle.

Also, why are trace_cpu_idle*() not inside rcu_idle_{enter,exit}() ?
Doesn't seem to make sense to duplicate all that.

Also, the .cpu argument to trace_cpu_idle() seems silly, how is that
_ever_ going to be any other cpu than the current?

Also, the below removes all trace_.*_rcuidle() usage.

---
 arch/x86/kernel/process.c |  2 --
 kernel/sched/idle.c       | 34 ++++++++++++++++++++--------------
 2 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index f804dc935d2a..9fc3fc123887 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -307,9 +307,7 @@ void arch_cpu_idle(void)
  */
 void default_idle(void)
 {
-	trace_cpu_idle_rcuidle(1, smp_processor_id());
 	safe_halt();
-	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
 }
 #ifdef CONFIG_APM_MODULE
 EXPORT_SYMBOL(default_idle);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 9f1608f99819..591c08b0e66a 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -44,13 +44,13 @@ __setup("hlt", cpu_idle_nopoll_setup);
 
 static inline int cpu_idle_poll(void)
 {
+	trace_cpu_idle(0, smp_processor_id());
 	rcu_idle_enter();
-	trace_cpu_idle_rcuidle(0, smp_processor_id());
 	local_irq_enable();
 	while (!tif_need_resched())
 		cpu_relax();
-	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
 	rcu_idle_exit();
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 	return 1;
 }
 
@@ -97,13 +97,6 @@ static void cpuidle_idle_call(void)
 	stop_critical_timings();
 
 	/*
-	 * Tell the RCU framework we are entering an idle section,
-	 * so no more rcu read side critical sections and one more
-	 * step to the grace period
-	 */
-	rcu_idle_enter();
-
-	/*
 	 * Ask the cpuidle framework to choose a convenient idle state.
 	 * Fall back to the default arch idle method on errors.
 	 */
@@ -114,10 +107,15 @@ static void cpuidle_idle_call(void)
 		 * We can't use the cpuidle framework, let's use the default
 		 * idle routine.
 		 */
-		if (current_clr_polling_and_test())
+		if (current_clr_polling_and_test()) {
 			local_irq_enable();
-		else
+		} else {
+			trace_cpu_idle(0, smp_processor_id());
+			rcu_idle_enter();
 			arch_cpu_idle();
+			rcu_idle_exit();
+			trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
+		}
 
 		goto exit_idle;
 	}
@@ -147,7 +145,14 @@ static void cpuidle_idle_call(void)
 	    clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu))
 		goto use_default;
 
-	trace_cpu_idle_rcuidle(next_state, dev->cpu);
+	trace_cpu_idle(next_state, dev->cpu);
+
+	/*
+	 * Tell the RCU framework we are entering an idle section,
+	 * so no more rcu read side critical sections and one more
+	 * step to the grace period
+	 */
+	rcu_idle_enter();
 
 	/*
 	 * Enter the idle state previously returned by the governor decision.
@@ -156,7 +161,9 @@ static void cpuidle_idle_call(void)
 	 */
 	entered_state = cpuidle_enter(drv, dev, next_state);
 
-	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
+	rcu_idle_exit();
+
+	trace_cpu_idle(PWR_EVENT_EXIT, dev->cpu);
 
 	if (broadcast)
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
@@ -175,7 +182,6 @@ static void cpuidle_idle_call(void)
 	if (WARN_ON_ONCE(irqs_disabled()))
 		local_irq_enable();
 
-	rcu_idle_exit();
 	start_critical_timings();
 }
 

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-10  8:12               ` Peter Zijlstra
  2014-08-10 16:46                 ` Peter Zijlstra
@ 2014-08-11  3:23                 ` Paul E. McKenney
  1 sibling, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-11  3:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Sun, Aug 10, 2014 at 10:12:54AM +0200, Peter Zijlstra wrote:
> On Sat, Aug 09, 2014 at 06:26:12PM -0700, Paul E. McKenney wrote:
> > On Sat, Aug 09, 2014 at 08:19:20PM +0200, Peter Zijlstra wrote:
> > > On Sat, Aug 09, 2014 at 09:01:37AM -0700, Paul E. McKenney wrote:
> > > > > That's so wrong its not funny. If you need some abortion to deal with
> > > > > NOHZ_FULL then put it under CONFIG_NOHZ_FULL, don't burden the entire
> > > > > world with it.
> > > > 
> > > > Peter, the polling approach actually -reduces- the common-case
> > > > per-context-switch burden, as in when RCU-tasks isn't doing anything.
> > > > See your own code above.
> > > 
> > > I'm not seeing it, CONFIG_PREEMPT already touches a per task cacheline
> > > for each context switch. And for !PREEMPT this thing should pretty much
> > > reduce to rcu_sched.
> > 
> > Except when you do the wakeup operation, in which case you have something
> > that is either complex, slow and non-scalable, or both.  I am surprised
> > that you want anything like that on that path.
> 
> Its a nr_cpus problem at that point, which is massively better than a
> nr_tasks problem, and I'm sure we've solved this counter thing many
> times, we can do it again.
> 
> But for clarity of purpose the single atomic+waitqueue is far easier.

Either way, it is very likely an NR_CPUS problem after the first scan
of the task list.  Sure, you could have a million preempted tasks on an
eight-CPU system, but if that is the case, an occasional poll loop is
the very least of your worries.

And if we were trying to produce a textbook example, I might agree with
your "clarity of purpose" point.  As it is, sorry, but no.

> > > Would not the thing I proposed be a valid rcu_preempt implementation?
> > > one where its rcu read side primitives run from (voluntary) schedule()
> > > to (voluntary) schedule() call and therefore entirely cover smaller
> > > sections.
> > 
> > In theory, sure.  In practice, blocking on tasks that are preempted
> > outside of an RCU read-side critical section would not be a good thing
> > for normal RCU, which has frequent update operations.  Among other things.
> 
> Sure, just looking for parallels and verifying understanding here. By
> the very nature of not having read side primitives to limit coverage its
> a pessimistic thing.

Fair enough!

> > > > > As for idle tasks, I'm not sure about those, I think that we should say
> > > > > NO to anything that would require waking idle CPUs, push the pain to
> > > > > ftrace/kprobes, we should _not_ be waking idle cpus.
> > > > 
> > > > So the current patch set wakes an idle task once per RCU-tasks grace
> > > > period, but only when that idle task did not otherwise get awakened.
> > > > This is not a real problem.
> > > 
> > > And on the other hand we're trying to reduce random wakeups, so this
> > > sure is a problem. If we don't start, we don't have to fix later.
> > 
> > I doubt that a wakeup at the end of certain ftrace operations is going
> > to be a real problem.
> 
> But its not ftrace, its rcu_task, and if we put it out there, we'll grow
> more and more customers, and soon we'll always have users and never let
> CPUs sleep.
> 
> That's how these things go, so we should really try and push back on
> these things, and that's the thing that worries me most in this
> discussion, you seem very happy to provide what's asked for without due
> consideration of the negatives.

Good point.

So I could make all the entry points static, and call into ftrace and
friends passing them the addresses of the entry points.  I would also
need to pass them to rcutorture.  Then no one uses this stuff without
express permission.

> > > > So I don't believe that the current wakeup rate is a problem, and it
> > > > can be reduced if it proves to be a problem.
> > > 
> > > How about we simply assume 'idle' code, as defined by the rcu idle hooks
> > > are safe? Why do we want to bend over backwards to cover this?
> > 
> > Steven covered this earlier in this thread.  One addition might be "For
> > the same reason that event tracing provides the _rcuidle suffix."
> 
> I really don't think its worth the cost.

That part has been coming in loud and clear for quite some time now.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-10 16:46                 ` Peter Zijlstra
@ 2014-08-11  3:28                   ` Paul E. McKenney
  0 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-11  3:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Sun, Aug 10, 2014 at 06:46:33PM +0200, Peter Zijlstra wrote:
> On Sun, Aug 10, 2014 at 10:12:54AM +0200, Peter Zijlstra wrote:
> > > Steven covered this earlier in this thread.  One addition might be "For
> > > the same reason that event tracing provides the _rcuidle suffix."
> > 
> > I really don't think its worth the cost.
> 
> Entirely untested, but something like the below shrinks the
> amount of code called under rcu_idle.

If this can be shrunk far enough on enough architectures, I would be
happy to drop the idle-wakeup stuff.

> Also, why are trace_cpu_idle*() not inside rcu_idle_{enter,exit}() ?
> Doesn't seem to make sense to duplicate all that.

There was some rumor about some architecture having portions of the
idle-exit path implemented using hardware assist.  No idea if there
was any truth to that rumor, but if there is, that might constrain
where rcu_idle_{enter,exit}() go.  Other than that, it was just a
desire to avoid touching arch-specific code, which is only a desire
rather than any sort of hard requirement.

> Also, the .cpu argument to trace_cpu_idle() seems silly, how is that
> _ever_ going to be any other cpu than the current?

Beat me.

> Also, the below removes all trace_.*_rcuidle() usage.

If doing that works, that could be good!

							Thanx, Paul

> ---
>  arch/x86/kernel/process.c |  2 --
>  kernel/sched/idle.c       | 34 ++++++++++++++++++++--------------
>  2 files changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index f804dc935d2a..9fc3fc123887 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -307,9 +307,7 @@ void arch_cpu_idle(void)
>   */
>  void default_idle(void)
>  {
> -	trace_cpu_idle_rcuidle(1, smp_processor_id());
>  	safe_halt();
> -	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
>  }
>  #ifdef CONFIG_APM_MODULE
>  EXPORT_SYMBOL(default_idle);
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 9f1608f99819..591c08b0e66a 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -44,13 +44,13 @@ __setup("hlt", cpu_idle_nopoll_setup);
>  
>  static inline int cpu_idle_poll(void)
>  {
> +	trace_cpu_idle(0, smp_processor_id());
>  	rcu_idle_enter();
> -	trace_cpu_idle_rcuidle(0, smp_processor_id());
>  	local_irq_enable();
>  	while (!tif_need_resched())
>  		cpu_relax();
> -	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
>  	rcu_idle_exit();
> +	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
>  	return 1;
>  }
>  
> @@ -97,13 +97,6 @@ static void cpuidle_idle_call(void)
>  	stop_critical_timings();
>  
>  	/*
> -	 * Tell the RCU framework we are entering an idle section,
> -	 * so no more rcu read side critical sections and one more
> -	 * step to the grace period
> -	 */
> -	rcu_idle_enter();
> -
> -	/*
>  	 * Ask the cpuidle framework to choose a convenient idle state.
>  	 * Fall back to the default arch idle method on errors.
>  	 */
> @@ -114,10 +107,15 @@ static void cpuidle_idle_call(void)
>  		 * We can't use the cpuidle framework, let's use the default
>  		 * idle routine.
>  		 */
> -		if (current_clr_polling_and_test())
> +		if (current_clr_polling_and_test()) {
>  			local_irq_enable();
> -		else
> +		} else {
> +			trace_cpu_idle(0, smp_processor_id());
> +			rcu_idle_enter();
>  			arch_cpu_idle();
> +			rcu_idle_exit();
> +			trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
> +		}
>  
>  		goto exit_idle;
>  	}
> @@ -147,7 +145,14 @@ static void cpuidle_idle_call(void)
>  	    clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu))
>  		goto use_default;
>  
> -	trace_cpu_idle_rcuidle(next_state, dev->cpu);
> +	trace_cpu_idle(next_state, dev->cpu);
> +
> +	/*
> +	 * Tell the RCU framework we are entering an idle section,
> +	 * so no more rcu read side critical sections and one more
> +	 * step to the grace period
> +	 */
> +	rcu_idle_enter();
>  
>  	/*
>  	 * Enter the idle state previously returned by the governor decision.
> @@ -156,7 +161,9 @@ static void cpuidle_idle_call(void)
>  	 */
>  	entered_state = cpuidle_enter(drv, dev, next_state);
>  
> -	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
> +	rcu_idle_exit();
> +
> +	trace_cpu_idle(PWR_EVENT_EXIT, dev->cpu);
>  
>  	if (broadcast)
>  		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
> @@ -175,7 +182,6 @@ static void cpuidle_idle_call(void)
>  	if (WARN_ON_ONCE(irqs_disabled()))
>  		local_irq_enable();
>  
> -	rcu_idle_exit();
>  	start_critical_timings();
>  }
>  



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-10  8:14                 ` Peter Zijlstra
@ 2014-08-11  3:30                   ` Paul E. McKenney
  2014-08-11 11:57                     ` Peter Zijlstra
  0 siblings, 1 reply; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-11  3:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Sun, Aug 10, 2014 at 10:14:25AM +0200, Peter Zijlstra wrote:
> On Sat, Aug 09, 2014 at 06:29:24PM -0700, Paul E. McKenney wrote:
> > On Sat, Aug 09, 2014 at 08:24:00PM +0200, Peter Zijlstra wrote:
> > > On Sat, Aug 09, 2014 at 08:19:20PM +0200, Peter Zijlstra wrote:
> > > > How about we simply assume 'idle' code, as defined by the rcu idle hooks
> > > > are safe? Why do we want to bend over backwards to cover this?
> > > 
> > > The thing is, we already have the special rcu trace hooks for tracing
> > > inside this rcu-idle section, so why go beyond this now?
> > 
> > I have to defer to Steven and Masami on this one, but I would guess that
> > they want the ability to trace the idle loop for the same reasons they
> > stated earlier.
> 
> want want want, I want a damn pony but somehow I'm not getting one. Why
> are they getting this?

We can only be glad that my daughters' old My Little Pony toys are
long gone (http://en.wikipedia.org/wiki/My_Little_Pony).  Not sure I
would have been able to resist sending one along.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-10 15:00           ` Peter Zijlstra
@ 2014-08-11  3:37             ` Paul E. McKenney
  0 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-11  3:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Sun, Aug 10, 2014 at 05:00:05PM +0200, Peter Zijlstra wrote:
> On Sat, Aug 09, 2014 at 06:38:29PM -0700, Paul E. McKenney wrote:
> > On Sat, Aug 09, 2014 at 08:33:55PM +0200, Peter Zijlstra wrote:
> > > On Fri, Aug 08, 2014 at 01:58:26PM -0700, Paul E. McKenney wrote:
> > > > 
> > > > > And on that, you probably should change rcu_sched_rq() to read:
> > > > > 
> > > > > 	this_cpu_inc(rcu_sched_data.passed_quiesce);
> > > > > 
> > > > > That avoids touching the per-cpu data offset.
> > > > 
> > > > Hmmm...  Interrupts are disabled,
> > > 
> > > No they are not, __schedule()->rcu_note_context_switch()->rcu_sched_qs()
> > > is only called with preemption disabled.
> > > 
> > > We only disable IRQs later, where we take the rq->lock.
> > 
> > You want me not to disable irqs before invoking rcu_preempt_qs() from
> > rcu_preempt_note_context_switch(), I get that.  But right now, they
> > really are disabled courtesy of the local_irq_save() before the call
> > to rcu_preempt_qs() from rcu_preempt_note_context_switch().
> 
> Ah, confusion there, I said rcu_sched_qs(), you're talking about
> rcu_preempt_qs().
> 
> Yes the call to rcu_preempt_qs() is unconditionally wrapped in IRQ
> disable.

Apologies for my confusion!  The rcu_sched_qs() call doesn't need
to interact directly with the scheduling-clock interrupt using
read-modify-write variables, so it gets a pass.

> > > void rcu_sched_qs(int cpu)
> > > {
> > > 	if (trace_rcu_grace_period_enabled()) {
> > > 		if (!__this_cpu_read(rcu_sched_data.passed_quiesce))
> > > 			trace_rcu_grace_period(...);
> > > 	}
> > > 	__this_cpu_write(rcu_sched_data.passed_quiesce, 1);
> > > }
> > > 
> > > Would further avoid emitting the conditional in the normal case where
> > > the tracepoint is inactive.
> > 
> > It might be better to avoid storing to rcu_sched_data.passed_quiesce when
> > it is already 1, though the difference would be quite hard to measure.
> > In that case, this would work nicely:
> > 
> > static void rcu_preempt_qs(int cpu)
> > {
> > 	if (rdp->passed_quiesce == 0) {
> > 		trace_rcu_grace_period(TPS("rcu_preempt"), rdp->gpnum, TPS("cpuqs"));
> > 	> 	__this_cpu_write(rcu_sched_data.passed_quiesce, 1);
> > 	}
> > 	current->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_NEED_QS;
> > }
> 
> Yes, that's a consideration, fair enough. Again note the confusion
> between sched/preempt. But yes, both can use this 'cleanup'.

OK, it is on my list.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-11  3:30                   ` Paul E. McKenney
@ 2014-08-11 11:57                     ` Peter Zijlstra
  2014-08-11 16:15                       ` Paul E. McKenney
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2014-08-11 11:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

[-- Attachment #1: Type: text/plain, Size: 506 bytes --]

On Sun, Aug 10, 2014 at 08:30:48PM -0700, Paul E. McKenney wrote:
> On Sun, Aug 10, 2014 at 10:14:25AM +0200, Peter Zijlstra wrote:

> > want want want, I want a damn pony but somehow I'm not getting one. Why
> > are they getting this?
> 
> We can only be glad that my daughters' old My Little Pony toys are
> long gone (http://en.wikipedia.org/wiki/My_Little_Pony).  Not sure I
> would have been able to resist sending one along.  ;-)

LOL, I _knew_ someone was going to propose doing that ;-)

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-11 11:57                     ` Peter Zijlstra
@ 2014-08-11 16:15                       ` Paul E. McKenney
  0 siblings, 0 replies; 122+ messages in thread
From: Paul E. McKenney @ 2014-08-11 16:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Mon, Aug 11, 2014 at 01:57:06PM +0200, Peter Zijlstra wrote:
> On Sun, Aug 10, 2014 at 08:30:48PM -0700, Paul E. McKenney wrote:
> > On Sun, Aug 10, 2014 at 10:14:25AM +0200, Peter Zijlstra wrote:
> 
> > > want want want, I want a damn pony but somehow I'm not getting one. Why
> > > are they getting this?
> > 
> > We can only be glad that my daughters' old My Little Pony toys are
> > long gone (http://en.wikipedia.org/wiki/My_Little_Pony).  Not sure I
> > would have been able to resist sending one along.  ;-)
> 
> LOL, I _knew_ someone was going to propose doing that ;-)

You know me too well, Peter!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 122+ messages in thread

end of thread, other threads:[~2014-08-11 16:15 UTC | newest]

Thread overview: 122+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-31 21:54 [PATCH v3 tip/core/rcu 0/9 Paul E. McKenney
2014-07-31 21:55 ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops Paul E. McKenney
2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks Paul E. McKenney
2014-08-01 15:09     ` Oleg Nesterov
2014-08-01 18:32       ` Paul E. McKenney
2014-08-01 19:44         ` Paul E. McKenney
2014-08-02 14:47           ` Oleg Nesterov
2014-08-02 22:58             ` Paul E. McKenney
2014-08-06  0:57               ` Steven Rostedt
2014-08-06  1:21                 ` Paul E. McKenney
2014-08-06  8:47                   ` Peter Zijlstra
2014-08-06 12:09                     ` Paul E. McKenney
2014-08-06 16:30                       ` Peter Zijlstra
2014-08-06 22:45                         ` Paul E. McKenney
2014-08-07  8:45                           ` Peter Zijlstra
2014-08-07 15:00                             ` Paul E. McKenney
2014-08-07 15:26                               ` Peter Zijlstra
2014-08-07 17:27                                 ` Peter Zijlstra
2014-08-07 18:46                                   ` Peter Zijlstra
2014-08-07 19:49                                     ` Steven Rostedt
2014-08-07 19:53                                       ` Steven Rostedt
2014-08-07 20:08                                         ` Peter Zijlstra
2014-08-07 21:18                                           ` Steven Rostedt
2014-08-08  6:40                                             ` Peter Zijlstra
2014-08-08 14:12                                               ` Steven Rostedt
2014-08-08 14:28                                                 ` Paul E. McKenney
2014-08-09 10:56                                                   ` Masami Hiramatsu
2014-08-08 14:34                                                 ` Peter Zijlstra
2014-08-08 14:58                                                   ` Steven Rostedt
2014-08-08 15:16                                                     ` Peter Zijlstra
2014-08-08 15:39                                                       ` Steven Rostedt
2014-08-08 16:01                                                         ` Peter Zijlstra
2014-08-08 16:10                                                           ` Steven Rostedt
2014-08-08 16:17                                                         ` Peter Zijlstra
2014-08-08 16:40                                                           ` Steven Rostedt
2014-08-08 16:52                                                             ` Peter Zijlstra
2014-08-08 16:27                                                     ` Peter Zijlstra
2014-08-08 16:39                                                       ` Paul E. McKenney
2014-08-08 16:49                                                         ` Steven Rostedt
2014-08-08 16:51                                                         ` Peter Zijlstra
2014-08-08 17:09                                                           ` Paul E. McKenney
2014-08-08 16:43                                                       ` Steven Rostedt
2014-08-08 16:50                                                         ` Peter Zijlstra
2014-08-08 17:27                                                       ` Steven Rostedt
2014-08-09 10:36                                                         ` Masami Hiramatsu
2014-08-07 20:06                                       ` Peter Zijlstra
2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 4/9] rcu: Export RCU-tasks APIs to GPL modules Paul E. McKenney
2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 5/9] rcutorture: Add torture tests for RCU-tasks Paul E. McKenney
2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 6/9] rcutorture: Add RCU-tasks test cases Paul E. McKenney
2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 7/9] rcu: Add stall-warning checks for RCU-tasks Paul E. McKenney
2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 8/9] rcu: Improve RCU-tasks energy efficiency Paul E. McKenney
2014-07-31 21:55   ` [PATCH v3 tip/core/rcu 9/9] documentation: Add verbiage on RCU-tasks stall warning messages Paul E. McKenney
2014-07-31 23:57   ` [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Frederic Weisbecker
2014-08-01  2:04     ` Paul E. McKenney
2014-08-01 15:06       ` Frederic Weisbecker
2014-08-01  1:15   ` Lai Jiangshan
2014-08-01  1:59     ` Paul E. McKenney
2014-08-01  1:31   ` Lai Jiangshan
2014-08-01  2:11     ` Paul E. McKenney
2014-08-01 14:11   ` Oleg Nesterov
2014-08-01 18:28     ` Paul E. McKenney
2014-08-01 18:40       ` Oleg Nesterov
2014-08-02 23:00         ` Paul E. McKenney
2014-08-03 12:57           ` Oleg Nesterov
2014-08-03 22:03             ` Paul E. McKenney
2014-08-04 13:29               ` Oleg Nesterov
2014-08-04 13:48                 ` Paul E. McKenney
2014-08-01 18:57   ` Oleg Nesterov
2014-08-02 22:50     ` Paul E. McKenney
2014-08-02 14:56   ` Oleg Nesterov
2014-08-02 22:57     ` Paul E. McKenney
2014-08-03 13:33       ` Oleg Nesterov
2014-08-03 22:05         ` Paul E. McKenney
2014-08-04  0:37           ` Lai Jiangshan
2014-08-04  1:09             ` Paul E. McKenney
2014-08-04 13:25               ` Oleg Nesterov
2014-08-04 13:51                 ` Paul E. McKenney
2014-08-04 13:52                   ` Paul E. McKenney
2014-08-04 13:32           ` Oleg Nesterov
2014-08-04 19:28             ` Paul E. McKenney
2014-08-04 19:32               ` Oleg Nesterov
2014-08-04  1:28   ` Lai Jiangshan
2014-08-04  7:46     ` Peter Zijlstra
2014-08-04  8:18       ` Lai Jiangshan
2014-08-04 11:50         ` Paul E. McKenney
2014-08-04 12:25           ` Peter Zijlstra
2014-08-04 12:37             ` Paul E. McKenney
2014-08-04 14:56             ` Peter Zijlstra
2014-08-05  0:47               ` Lai Jiangshan
2014-08-05 21:55                 ` Paul E. McKenney
2014-08-06  0:27                   ` Lai Jiangshan
2014-08-06  0:48                     ` Paul E. McKenney
2014-08-06  0:33                   ` Lai Jiangshan
2014-08-06  0:51                     ` Paul E. McKenney
2014-08-06 22:48                       ` Paul E. McKenney
2014-08-07  8:49                   ` Peter Zijlstra
2014-08-07 15:43                     ` Paul E. McKenney
2014-08-07 16:32                       ` Peter Zijlstra
2014-08-07 17:48                         ` Paul E. McKenney
2014-08-08 19:13   ` Peter Zijlstra
2014-08-08 20:58     ` Paul E. McKenney
2014-08-09  6:15       ` Peter Zijlstra
2014-08-09 12:44         ` Steven Rostedt
2014-08-09 16:05           ` Paul E. McKenney
2014-08-09 16:01         ` Paul E. McKenney
2014-08-09 18:19           ` Peter Zijlstra
2014-08-09 18:24             ` Peter Zijlstra
2014-08-10  1:29               ` Paul E. McKenney
2014-08-10  8:14                 ` Peter Zijlstra
2014-08-11  3:30                   ` Paul E. McKenney
2014-08-11 11:57                     ` Peter Zijlstra
2014-08-11 16:15                       ` Paul E. McKenney
2014-08-10  1:26             ` Paul E. McKenney
2014-08-10  8:12               ` Peter Zijlstra
2014-08-10 16:46                 ` Peter Zijlstra
2014-08-11  3:28                   ` Paul E. McKenney
2014-08-11  3:23                 ` Paul E. McKenney
2014-08-09 18:33       ` Peter Zijlstra
2014-08-10  1:38         ` Paul E. McKenney
2014-08-10 15:00           ` Peter Zijlstra
2014-08-11  3:37             ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.