All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH tip/core/rcu 0/9] RCU-tasks implementation
@ 2014-07-28 22:55 Paul E. McKenney
  2014-07-28 22:56 ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-28 22:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

Hello!

This series provides a prototype of an RCU-tasks implementation, which has
been requested to assist with tramopoline removal.  This flavor of RCU
is task-based rather than CPU-based, and has voluntary context switch,
usermode execution, and the idle loops as its only quiescent states.
This selection of quiescent states ensures that at the end of a grace
period, there will no longer be any tasks depending on a trampoline that
was removed before the beginning of that grace period.  This works because
such trampolines do not contain function calls, do not contain voluntary
context switches, do not switch to usermode, and do not switch to idle.

The patches in this series are as follows:

1.	Adds the basic call_rcu_tasks() functionality.

2.	Provides cond_resched_rcu_qs() to force quiescent states, including
	RCU-tasks quiescent states, in long loops.

3.	Adds synchronous APIs: synchronize_rcu_tasks() and
	rcu_barrier_tasks().

4.	Adds GPL exports for the above APIs, courtesy of Steven Rostedt.

5.	Adds rcutorture tests for RCU-tasks.

6.	Adds RCU-tasks test cases to rcutorture scripting.

7.	Adds stall-warning checks for RCU-tasks.

8.	Adds synchronization with exiting tasks, preventing RCU-tasks from
	waiting on exited tasks.

9.	Improves RCU-tasks energy efficiency by replacing polling with
	wait/wakeup.

Remaining issues include:

o	The list locking is not yet set up for lockdep.

o	It is not clear that trampolines in functions called from the
	idle loop are correctly handled.  Or if anyone cares about
	trampolines in functions called from the idle loop.

o	The current implementation does not yet recognize tasks that start
	out executing is usermode.  Instead, it waits for the next
	scheduling-clock tick to note them.

o	As a result, the current implementation does not handle nohz_full=
	CPUs executing tasks running in usermode.  There are a couple of
	possible fixes under consideration.

o	If a task is preempted while executing in usermode, the RCU-tasks
	grace period will not end until that task resumes.  (Is there
	some reasonable way to determine that a given preempted task
	was preempted from usermode execution?)

o	RCU-tasks needs to be added to Documentation/RCU.

o	There are probably still bugs.

							Thanx, Paul

------------------------------------------------------------------------

 b/Documentation/kernel-parameters.txt                         |    5 
 b/fs/file.c                                                   |    2 
 b/include/linux/init_task.h                                   |   14 
 b/include/linux/rcupdate.h                                    |   60 
 b/include/linux/rcutiny.h                                     |    1 
 b/include/linux/sched.h                                       |   13 
 b/init/Kconfig                                                |   10 
 b/kernel/rcu/rcutorture.c                                     |   44 
 b/kernel/rcu/tiny.c                                           |    2 
 b/kernel/rcu/tree.c                                           |   14 
 b/kernel/rcu/tree_plugin.h                                    |    4 
 b/kernel/rcu/update.c                                         |  616 +++++++++-
 b/kernel/sched/core.c                                         |    2 
 b/mm/mlock.c                                                  |    2 
 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01      |    7 
 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot |    1 
 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02      |    6 
 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot |    1 
 18 files changed, 749 insertions(+), 55 deletions(-)


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-28 22:55 [PATCH tip/core/rcu 0/9] RCU-tasks implementation Paul E. McKenney
@ 2014-07-28 22:56 ` Paul E. McKenney
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops Paul E. McKenney
                     ` (14 more replies)
  0 siblings, 15 replies; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-28 22:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds a new RCU-tasks flavor of RCU, which provides
call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
context switch (not preemption!), userspace execution, and the idle loop.
Note that unlike other RCU flavors, these quiescent states occur in tasks,
not necessarily CPUs.  Includes fixes from Steven Rostedt.

This RCU flavor is assumed to have very infrequent latency-tolerate
updaters.  This assumption permits significant simplifications, including
a single global callback list protected by a single global lock, along
with a single linked list containing all tasks that have not yet passed
through a quiescent state.  If experience shows this assumption to be
incorrect, the required additional complexity will be added.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 include/linux/init_task.h |   9 +++
 include/linux/rcupdate.h  |  36 +++++++++++
 include/linux/sched.h     |   8 +++
 init/Kconfig              |  10 +++
 kernel/rcu/tiny.c         |   2 +
 kernel/rcu/tree.c         |   2 +
 kernel/rcu/update.c       | 158 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c       |   2 +
 8 files changed, 227 insertions(+)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6df7f9fe0d01..78715ea7c30c 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -124,6 +124,14 @@ extern struct group_info init_groups;
 #else
 #define INIT_TASK_RCU_PREEMPT(tsk)
 #endif
+#ifdef CONFIG_TASKS_RCU
+#define INIT_TASK_RCU_TASKS(tsk)					\
+	.rcu_tasks_holdout = false,					\
+	.rcu_tasks_holdout_list =					\
+		LIST_HEAD_INIT(tsk.rcu_tasks_holdout_list),
+#else
+#define INIT_TASK_RCU_TASKS(tsk)
+#endif
 
 extern struct cred init_cred;
 
@@ -231,6 +239,7 @@ extern struct task_group root_task_group;
 	INIT_FTRACE_GRAPH						\
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
+	INIT_TASK_RCU_TASKS(tsk)					\
 	INIT_CPUSET_SEQ(tsk)						\
 	INIT_RT_MUTEXES(tsk)						\
 	INIT_VTIME(tsk)							\
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 6a94cc8b1ca0..05656b504295 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -197,6 +197,26 @@ void call_rcu_sched(struct rcu_head *head,
 
 void synchronize_sched(void);
 
+/**
+ * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. call_rcu_tasks() assumes
+ * that the read-side critical sections end at a voluntary context
+ * switch (not a preemption!), entry into idle, or transition to usermode
+ * execution.  As such, there are no read-side primitives analogous to
+ * rcu_read_lock() and rcu_read_unlock() because this primitive is intended
+ * to determine that all tasks have passed through a safe state, not so
+ * much for data-strcuture synchronization.
+ *
+ * See the description of call_rcu() for more detailed information on
+ * memory ordering guarantees.
+ */
+void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
+
 #ifdef CONFIG_PREEMPT_RCU
 
 void __rcu_read_lock(void);
@@ -294,6 +314,22 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
 		rcu_irq_exit(); \
 	} while (0)
 
+/*
+ * Note a voluntary context switch for RCU-tasks benefit.  This is a
+ * macro rather than an inline function to avoid #include hell.
+ */
+#ifdef CONFIG_TASKS_RCU
+#define rcu_note_voluntary_context_switch(t) \
+	do { \
+		preempt_disable(); /* Exclude synchronize_sched(); */ \
+		if ((t)->rcu_tasks_holdout) \
+			smp_store_release(&(t)->rcu_tasks_holdout, 0); \
+		preempt_enable(); \
+	} while (0)
+#else /* #ifdef CONFIG_TASKS_RCU */
+#define rcu_note_voluntary_context_switch(t)	do { } while (0)
+#endif /* #else #ifdef CONFIG_TASKS_RCU */
+
 #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP)
 bool __rcu_is_watching(void);
 #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 306f4f0c987a..3e18b7bbe4df 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1273,6 +1273,10 @@ struct task_struct {
 #ifdef CONFIG_RCU_BOOST
 	struct rt_mutex *rcu_boost_mutex;
 #endif /* #ifdef CONFIG_RCU_BOOST */
+#ifdef CONFIG_TASKS_RCU
+	int rcu_tasks_holdout;
+	struct list_head rcu_tasks_holdout_list;
+#endif /* #ifdef CONFIG_TASKS_RCU */
 
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
 	struct sched_info sched_info;
@@ -2013,6 +2017,10 @@ static inline void rcu_copy_process(struct task_struct *p)
 	p->rcu_boost_mutex = NULL;
 #endif /* #ifdef CONFIG_RCU_BOOST */
 	INIT_LIST_HEAD(&p->rcu_node_entry);
+#ifdef CONFIG_TASKS_RCU
+	p->rcu_tasks_holdout = false;
+	INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
+#endif /* #ifdef CONFIG_TASKS_RCU */
 }
 
 #else
diff --git a/init/Kconfig b/init/Kconfig
index 9d76b99af1b9..c56cb62a2df1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -507,6 +507,16 @@ config PREEMPT_RCU
 	  This option enables preemptible-RCU code that is common between
 	  the TREE_PREEMPT_RCU and TINY_PREEMPT_RCU implementations.
 
+config TASKS_RCU
+	bool "Task_based RCU implementation using voluntary context switch"
+	default n
+	help
+	  This option enables a task-based RCU implementation that uses
+	  only voluntary context switch (not preemption!), idle, and
+	  user-mode execution as quiescent states.
+
+	  If unsure, say N.
+
 config RCU_STALL_COMMON
 	def_bool ( TREE_RCU || TREE_PREEMPT_RCU || RCU_TRACE )
 	help
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index d9efcc13008c..717f00854fc0 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -254,6 +254,8 @@ void rcu_check_callbacks(int cpu, int user)
 		rcu_sched_qs(cpu);
 	else if (!in_softirq())
 		rcu_bh_qs(cpu);
+	if (user)
+		rcu_note_voluntary_context_switch(current);
 }
 
 /*
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 625d0b0cd75a..f958c52f644d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2413,6 +2413,8 @@ void rcu_check_callbacks(int cpu, int user)
 	rcu_preempt_check_callbacks(cpu);
 	if (rcu_pending(cpu))
 		invoke_rcu_core();
+	if (user)
+		rcu_note_voluntary_context_switch(current);
 	trace_rcu_utilization(TPS("End scheduler-tick"));
 }
 
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index bc7883570530..2bcb1e611460 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -47,6 +47,7 @@
 #include <linux/hardirq.h>
 #include <linux/delay.h>
 #include <linux/module.h>
+#include <linux/kthread.h>
 
 #define CREATE_TRACE_POINTS
 
@@ -350,3 +351,160 @@ static int __init check_cpu_stall_init(void)
 early_initcall(check_cpu_stall_init);
 
 #endif /* #ifdef CONFIG_RCU_STALL_COMMON */
+
+#ifdef CONFIG_TASKS_RCU
+
+/*
+ * Simple variant of RCU whose quiescent states are voluntary context switch,
+ * user-space execution, and idle.  As such, grace periods can take one good
+ * long time.  There are no read-side primitives similar to rcu_read_lock()
+ * and rcu_read_unlock() because this implementation is intended to get
+ * the system into a safe state for some of the manipulations involved in
+ * tracing and the like.  Finally, this implementation does not support
+ * high call_rcu_tasks() rates from multiple CPUs.  If this is required,
+ * per-CPU callback lists will be needed.
+ */
+
+/* Lists of tasks that we are still waiting for during this grace period. */
+static LIST_HEAD(rcu_tasks_holdouts);
+
+/* Global list of callbacks and associated lock. */
+static struct rcu_head *rcu_tasks_cbs_head;
+static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
+static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
+
+/* Post an RCU-tasks callback. */
+void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
+{
+	unsigned long flags;
+
+	rhp->next = NULL;
+	rhp->func = func;
+	raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
+	*rcu_tasks_cbs_tail = rhp;
+	rcu_tasks_cbs_tail = &rhp->next;
+	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+}
+EXPORT_SYMBOL_GPL(call_rcu_tasks);
+
+/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
+static int __noreturn rcu_tasks_kthread(void *arg)
+{
+	unsigned long flags;
+	struct task_struct *g, *t;
+	struct rcu_head *list;
+	struct rcu_head *next;
+
+	/* FIXME: Add housekeeping affinity. */
+
+	/*
+	 * Each pass through the following loop makes one check for
+	 * newly arrived callbacks, and, if there are some, waits for
+	 * one RCU-tasks grace period and then invokes the callbacks.
+	 * This loop is terminated by the system going down.  ;-)
+	 */
+	for (;;) {
+
+		/* Pick up any new callbacks. */
+		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
+		smp_mb__after_unlock_lock(); /* Enforce GP memory ordering. */
+		list = rcu_tasks_cbs_head;
+		rcu_tasks_cbs_head = NULL;
+		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
+		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+
+		/* If there were none, wait a bit and start over. */
+		if (!list) {
+			schedule_timeout_interruptible(HZ);
+			flush_signals(current);
+			continue;
+		}
+
+		/*
+		 * There were callbacks, so we need to wait for an
+		 * RCU-tasks grace period.  Start off by scanning
+		 * the task list for tasks that are not already
+		 * voluntarily blocked.  Mark these tasks and make
+		 * a list of them in rcu_tasks_holdouts.
+		 */
+		rcu_read_lock();
+		do_each_thread(g, t) {
+			if (t != current && ACCESS_ONCE(t->on_rq) &&
+			    !is_idle_task(t)) {
+				t->rcu_tasks_holdout = 1;
+				list_add(&t->rcu_tasks_holdout_list,
+					 &rcu_tasks_holdouts);
+			}
+		} while_each_thread(g, t);
+		rcu_read_unlock();
+
+		/*
+		 * The "t != current" and "!is_idle_task()" comparisons
+		 * above are stable, but the "t->on_rq" value could
+		 * change at any time, and is generally unordered.
+		 * Therefore, we need some ordering.  The trick is
+		 * that t->on_rq is updated with a runqueue lock held,
+		 * and thus with interrupts disabled.  So the following
+		 * synchronize_sched() provides the needed ordering by:
+		 * (1) Waiting for all interrupts-disabled code sections
+		 * to complete and (2) The synchronize_sched() ordering
+		 * guarantees, which provide for a memory barrier on each
+		 * CPU since the completion of its last read-side critical
+		 * section, including interrupt-disabled code sections.
+		 */
+		synchronize_sched();
+
+		/*
+		 * Each pass through the following loop scans the list
+		 * of holdout tasks, removing any that are no longer
+		 * holdouts.  When the list is empty, we are done.
+		 */
+		while (!list_empty(&rcu_tasks_holdouts)) {
+			schedule_timeout_interruptible(HZ / 10);
+			flush_signals(current);
+			rcu_read_lock();
+			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
+						rcu_tasks_holdout_list) {
+				if (smp_load_acquire(&t->rcu_tasks_holdout))
+					continue;
+				list_del_init(&t->rcu_tasks_holdout_list);
+				/* @@@ need to check for usermode on CPU. */
+			}
+			rcu_read_unlock();
+		}
+
+		/*
+		 * Most implementations of RCU would need to force a
+		 * memory barrier on all non-idle non-userspace CPUs at
+		 * this point.  The reason that this implementation does
+		 * not is that all of the callbacks are invoked by this
+		 * kthread.  Therefore, the quiescent states on the other
+		 * CPUs need only be ordered against this one kthread,
+		 * and the smp_store_release() and smp_load_acquire()
+		 * on ->rcu_tasks_holdout suffices for this ordering.
+		 */
+
+		/* Invoke the callbacks. */
+		while (list) {
+			next = list->next;
+			local_bh_disable();
+			list->func(list);
+			local_bh_enable();
+			list = next;
+			cond_resched();
+		}
+	}
+}
+
+/* Spawn rcu_tasks_kthread() at boot time. */
+static int __init rcu_spawn_tasks_kthread(void)
+{
+	struct task_struct __maybe_unused *t;
+
+	t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
+	BUG_ON(IS_ERR(t));
+	return 0;
+}
+early_initcall(rcu_spawn_tasks_kthread);
+
+#endif /* #ifdef CONFIG_TASKS_RCU */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bc1638b33449..a0d2f3a03566 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2762,6 +2762,7 @@ need_resched:
 		} else {
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 			prev->on_rq = 0;
+			rcu_note_voluntary_context_switch(prev);
 
 			/*
 			 * If a worker went to sleep, notify and ask workqueue
@@ -2828,6 +2829,7 @@ asmlinkage __visible void __sched schedule(void)
 	struct task_struct *tsk = current;
 
 	sched_submit_work(tsk);
+	rcu_note_voluntary_context_switch(tsk);
 	__schedule();
 }
 EXPORT_SYMBOL(schedule);
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH RFC tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops
  2014-07-28 22:56 ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
@ 2014-07-28 22:56   ` Paul E. McKenney
  2014-07-29  7:55     ` Peter Zijlstra
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks Paul E. McKenney
                     ` (13 subsequent siblings)
  14 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-28 22:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

RCU-tasks requires the occasional voluntary context switch
from CPU-bound in-kernel tasks.  In some cases, this requires
instrumenting cond_resched().  However, there is some reluctance
to countenance unconditionally instrumenting cond_resched() (see
http://lwn.net/Articles/603252/), so this commit creates a separate
cond_resched_rcu_qs() that may be used in place of cond_resched() in
locations prone to long-duration in-kernel looping.

This commit currently instruments only RCU-tasks.  Future possibilities
include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
IPI usage.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 fs/file.c                |  2 +-
 include/linux/rcupdate.h | 13 +++++++++++++
 kernel/rcu/rcutorture.c  |  4 ++--
 kernel/rcu/tree.c        | 12 ++++++------
 kernel/rcu/tree_plugin.h |  2 +-
 mm/mlock.c               |  2 +-
 6 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 66923fe3176e..1cafc4c9275b 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -367,7 +367,7 @@ static struct fdtable *close_files(struct files_struct * files)
 				struct file * file = xchg(&fdt->fd[i], NULL);
 				if (file) {
 					filp_close(file, files);
-					cond_resched();
+					cond_resched_rcu_qs();
 				}
 			}
 			i++;
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 05656b504295..3299ff98ad03 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -330,6 +330,19 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
 #define rcu_note_voluntary_context_switch(t)	do { } while (0)
 #endif /* #else #ifdef CONFIG_TASKS_RCU */
 
+/**
+ * cond_resched_rcu_qs - Report potential quiescent states to RCU
+ *
+ * This macro resembles cond_resched(), except that it is defined to
+ * report potential quiescent states to RCU-tasks even if the cond_resched()
+ * machinery were to be shut off, as some advocate for PREEMPT kernels.
+ */
+#define cond_resched_rcu_qs() \
+do { \
+	rcu_note_voluntary_context_switch(current); \
+	cond_resched(); \
+} while (0)
+
 #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP)
 bool __rcu_is_watching(void);
 #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 7fa34f86e5ba..febe07062ac5 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -667,7 +667,7 @@ static int rcu_torture_boost(void *arg)
 				}
 				call_rcu_time = jiffies;
 			}
-			cond_resched();
+			cond_resched_rcu_qs();
 			stutter_wait("rcu_torture_boost");
 			if (torture_must_stop())
 				goto checkwait;
@@ -1019,7 +1019,7 @@ rcu_torture_reader(void *arg)
 		__this_cpu_inc(rcu_torture_batch[completed]);
 		preempt_enable();
 		cur_ops->readunlock(idx);
-		cond_resched();
+		cond_resched_rcu_qs();
 		stutter_wait("rcu_torture_reader");
 	} while (!torture_must_stop());
 	if (irqreader && cur_ops->irq_capable) {
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f958c52f644d..645a33efc0d4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1650,7 +1650,7 @@ static int rcu_gp_init(struct rcu_state *rsp)
 		    system_state == SYSTEM_RUNNING)
 			udelay(200);
 #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
-		cond_resched();
+		cond_resched_rcu_qs();
 	}
 
 	mutex_unlock(&rsp->onoff_mutex);
@@ -1739,7 +1739,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 		/* smp_mb() provided by prior unlock-lock pair. */
 		nocb += rcu_future_gp_cleanup(rsp, rnp);
 		raw_spin_unlock_irq(&rnp->lock);
-		cond_resched();
+		cond_resched_rcu_qs();
 	}
 	rnp = rcu_get_root(rsp);
 	raw_spin_lock_irq(&rnp->lock);
@@ -1788,7 +1788,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
 			/* Locking provides needed memory barrier. */
 			if (rcu_gp_init(rsp))
 				break;
-			cond_resched();
+			cond_resched_rcu_qs();
 			flush_signals(current);
 			trace_rcu_grace_period(rsp->name,
 					       ACCESS_ONCE(rsp->gpnum),
@@ -1831,10 +1831,10 @@ static int __noreturn rcu_gp_kthread(void *arg)
 				trace_rcu_grace_period(rsp->name,
 						       ACCESS_ONCE(rsp->gpnum),
 						       TPS("fqsend"));
-				cond_resched();
+				cond_resched_rcu_qs();
 			} else {
 				/* Deal with stray signal. */
-				cond_resched();
+				cond_resched_rcu_qs();
 				flush_signals(current);
 				trace_rcu_grace_period(rsp->name,
 						       ACCESS_ONCE(rsp->gpnum),
@@ -2437,7 +2437,7 @@ static void force_qs_rnp(struct rcu_state *rsp,
 	struct rcu_node *rnp;
 
 	rcu_for_each_leaf_node(rsp, rnp) {
-		cond_resched();
+		cond_resched_rcu_qs();
 		mask = 0;
 		raw_spin_lock_irqsave(&rnp->lock, flags);
 		smp_mb__after_unlock_lock();
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 02ac0fb186b8..a86a363ea453 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1842,7 +1842,7 @@ static int rcu_oom_notify(struct notifier_block *self,
 	get_online_cpus();
 	for_each_online_cpu(cpu) {
 		smp_call_function_single(cpu, rcu_oom_notify_cpu, NULL, 1);
-		cond_resched();
+		cond_resched_rcu_qs();
 	}
 	put_online_cpus();
 
diff --git a/mm/mlock.c b/mm/mlock.c
index b1eb53634005..bc386a22d647 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -782,7 +782,7 @@ static int do_mlockall(int flags)
 
 		/* Ignore errors */
 		mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags);
-		cond_resched();
+		cond_resched_rcu_qs();
 	}
 out:
 	return 0;
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH RFC tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks
  2014-07-28 22:56 ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops Paul E. McKenney
@ 2014-07-28 22:56   ` Paul E. McKenney
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 4/9] rcu: Export RCU-tasks APIs to GPL modules Paul E. McKenney
                     ` (12 subsequent siblings)
  14 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-28 22:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

It turns out to be easier to add the synchronous grace-period waiting
functions to RCU-tasks than to work around their absense in rcutorture,
so this commit adds them.  The key point is that the existence of
call_rcu_tasks() means that rcutorture needs an rcu_barrier_tasks().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 include/linux/rcupdate.h |  2 ++
 kernel/rcu/update.c      | 55 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 3299ff98ad03..17c7e25c38be 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -216,6 +216,8 @@ void synchronize_sched(void);
  * memory ordering guarantees.
  */
 void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
+void synchronize_rcu_tasks(void);
+void rcu_barrier_tasks(void);
 
 #ifdef CONFIG_PREEMPT_RCU
 
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 2bcb1e611460..310ecf4f8e74 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -387,6 +387,61 @@ void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
 }
 EXPORT_SYMBOL_GPL(call_rcu_tasks);
 
+/**
+ * synchronize_rcu_tasks - wait until an rcu-tasks grace period has elapsed.
+ *
+ * Control will return to the caller some time after a full rcu-tasks
+ * grace period has elapsed, in other words after all currently
+ * executing rcu-tasks read-side critical sections have elapsed.  These
+ * read-side critical sections are delimited by calls to schedule(),
+ * cond_resched_rcu_qs(), idle execution, userspace execution, calls
+ * to synchronize_rcu_tasks(), and (in theory, anyway) cond_resched().
+ *
+ * This is a very specialized primitive, intended only for a few uses in
+ * tracing and other situations requiring manipulation of function
+ * preambles and profiling hooks.  The synchronize_rcu_tasks() function
+ * is not (yet) intended for heavy use from multiple CPUs.
+ *
+ * Note that this guarantee implies further memory-ordering guarantees.
+ * On systems with more than one CPU, when synchronize_rcu_tasks() returns,
+ * each CPU is guaranteed to have executed a full memory barrier since the
+ * end of its last RCU-tasks read-side critical section whose beginning
+ * preceded the call to synchronize_rcu_tasks().  In addition, each CPU
+ * having an RCU-tasks read-side critical section that extends beyond
+ * the return from synchronize_rcu_tasks() is guaranteed to have executed
+ * a full memory barrier after the beginning of synchronize_rcu_tasks()
+ * and before the beginning of that RCU-tasks read-side critical section.
+ * Note that these guarantees include CPUs that are offline, idle, or
+ * executing in user mode, as well as CPUs that are executing in the kernel.
+ *
+ * Furthermore, if CPU A invoked synchronize_rcu_tasks(), which returned
+ * to its caller on CPU B, then both CPU A and CPU B are guaranteed
+ * to have executed a full memory barrier during the execution of
+ * synchronize_rcu_tasks() -- even if CPU A and CPU B are the same CPU
+ * (but again only if the system has more than one CPU).
+ */
+void synchronize_rcu_tasks(void)
+{
+	/* Complain if the scheduler has not started.  */
+	rcu_lockdep_assert(!rcu_scheduler_active,
+			   "synchronize_rcu_tasks called too soon");
+
+	/* Wait for the grace period. */
+	wait_rcu_gp(call_rcu_tasks);
+}
+
+/**
+ * rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
+ *
+ * Although the current implementation is guaranteed to wait, it is not
+ * obligated to, for example, if there are no pending callbacks.
+ */
+void rcu_barrier_tasks(void)
+{
+	/* There is only one callback queue, so this is easy.  ;-) */
+	synchronize_rcu_tasks();
+}
+
 /* RCU-tasks kthread that detects grace periods and invokes callbacks. */
 static int __noreturn rcu_tasks_kthread(void *arg)
 {
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH RFC tip/core/rcu 4/9] rcu: Export RCU-tasks APIs to GPL modules
  2014-07-28 22:56 ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops Paul E. McKenney
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks Paul E. McKenney
@ 2014-07-28 22:56   ` Paul E. McKenney
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 5/9] rcutorture: Add torture tests for RCU-tasks Paul E. McKenney
                     ` (11 subsequent siblings)
  14 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-28 22:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: Steven Rostedt <rostedt@goodmis.org>

This commit exports the RCU-tasks APIs, call_rcu_tasks(),
synchronize_rcu_tasks(), and rcu_barrier_tasks(), to GPL-licensed
kernel modules.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/update.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 310ecf4f8e74..583f6778f330 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -429,6 +429,7 @@ void synchronize_rcu_tasks(void)
 	/* Wait for the grace period. */
 	wait_rcu_gp(call_rcu_tasks);
 }
+EXPORT_SYMBOL_GPL(synchronize_rcu_tasks);
 
 /**
  * rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
@@ -441,6 +442,7 @@ void rcu_barrier_tasks(void)
 	/* There is only one callback queue, so this is easy.  ;-) */
 	synchronize_rcu_tasks();
 }
+EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
 
 /* RCU-tasks kthread that detects grace periods and invokes callbacks. */
 static int __noreturn rcu_tasks_kthread(void *arg)
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH RFC tip/core/rcu 5/9] rcutorture: Add torture tests for RCU-tasks
  2014-07-28 22:56 ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (2 preceding siblings ...)
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 4/9] rcu: Export RCU-tasks APIs to GPL modules Paul E. McKenney
@ 2014-07-28 22:56   ` Paul E. McKenney
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 6/9] rcutorture: Add RCU-tasks test cases Paul E. McKenney
                     ` (10 subsequent siblings)
  14 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-28 22:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds torture tests for RCU-tasks.  It also fixes a bug that
would segfault for an RCU flavor lacking a callback-barrier function.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 include/linux/rcupdate.h |  1 +
 kernel/rcu/rcutorture.c  | 40 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 17c7e25c38be..ecb2198849e0 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -55,6 +55,7 @@ enum rcutorture_type {
 	RCU_FLAVOR,
 	RCU_BH_FLAVOR,
 	RCU_SCHED_FLAVOR,
+	RCU_TASKS_FLAVOR,
 	SRCU_FLAVOR,
 	INVALID_RCU_FLAVOR
 };
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index febe07062ac5..6d12ab6675bc 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -602,6 +602,42 @@ static struct rcu_torture_ops sched_ops = {
 };
 
 /*
+ * Definitions for RCU-tasks torture testing.
+ */
+
+static int tasks_torture_read_lock(void)
+{
+	return 0;
+}
+
+static void tasks_torture_read_unlock(int idx)
+{
+}
+
+static void rcu_tasks_torture_deferred_free(struct rcu_torture *p)
+{
+	call_rcu_tasks(&p->rtort_rcu, rcu_torture_cb);
+}
+
+static struct rcu_torture_ops tasks_ops = {
+	.ttype		= RCU_TASKS_FLAVOR,
+	.init		= rcu_sync_torture_init,
+	.readlock	= tasks_torture_read_lock,
+	.read_delay	= rcu_read_delay,  /* just reuse rcu's version. */
+	.readunlock	= tasks_torture_read_unlock,
+	.completed	= rcu_no_completed,
+	.deferred_free	= rcu_tasks_torture_deferred_free,
+	.sync		= synchronize_rcu_tasks,
+	.exp_sync	= synchronize_rcu_tasks,
+	.call		= call_rcu_tasks,
+	.cb_barrier	= rcu_barrier_tasks,
+	.fqs		= NULL,
+	.stats		= NULL,
+	.irq_capable	= 1,
+	.name		= "tasks"
+};
+
+/*
  * RCU torture priority-boost testing.  Runs one real-time thread per
  * CPU for moderate bursts, repeatedly registering RCU callbacks and
  * spinning waiting for them to be invoked.  If a given callback takes
@@ -1295,7 +1331,8 @@ static int rcu_torture_barrier_cbs(void *arg)
 		if (atomic_dec_and_test(&barrier_cbs_count))
 			wake_up(&barrier_wq);
 	} while (!torture_must_stop());
-	cur_ops->cb_barrier();
+	if (cur_ops->cb_barrier != NULL)
+		cur_ops->cb_barrier();
 	destroy_rcu_head_on_stack(&rcu);
 	torture_kthread_stopping("rcu_torture_barrier_cbs");
 	return 0;
@@ -1534,6 +1571,7 @@ rcu_torture_init(void)
 	int firsterr = 0;
 	static struct rcu_torture_ops *torture_ops[] = {
 		&rcu_ops, &rcu_bh_ops, &rcu_busted_ops, &srcu_ops, &sched_ops,
+		&tasks_ops,
 	};
 
 	if (!torture_init_begin(torture_type, verbose, &rcutorture_runnable))
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH RFC tip/core/rcu 6/9] rcutorture: Add RCU-tasks test cases
  2014-07-28 22:56 ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (3 preceding siblings ...)
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 5/9] rcutorture: Add torture tests for RCU-tasks Paul E. McKenney
@ 2014-07-28 22:56   ` Paul E. McKenney
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 7/9] rcu: Add stall-warning checks for RCU-tasks Paul E. McKenney
                     ` (9 subsequent siblings)
  14 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-28 22:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds the TASKS01 and TASKS02 Kconfig fragments, along with
the corresponding TASKS01.boot and TASKS02.boot boot-parameter files
specifying that rcutorture test RCU-tasks instead of the default flavor.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 tools/testing/selftests/rcutorture/configs/rcu/TASKS01      | 7 +++++++
 tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot | 1 +
 tools/testing/selftests/rcutorture/configs/rcu/TASKS02      | 6 ++++++
 tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot | 1 +
 4 files changed, 15 insertions(+)
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TASKS01
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TASKS02
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot

diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TASKS01 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01
new file mode 100644
index 000000000000..263a20f01fae
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01
@@ -0,0 +1,7 @@
+CONFIG_SMP=y
+CONFIG_NR_CPUS=2
+CONFIG_HOTPLUG_CPU=y
+CONFIG_PREEMPT_NONE=n
+CONFIG_PREEMPT_VOLUNTARY=n
+CONFIG_PREEMPT=y
+CONFIG_TASKS_RCU=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot
new file mode 100644
index 000000000000..cd2a188eeb6d
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot
@@ -0,0 +1 @@
+rcutorture.torture_type=tasks
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TASKS02 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02
new file mode 100644
index 000000000000..17b669c8833c
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02
@@ -0,0 +1,6 @@
+CONFIG_SMP=n
+CONFIG_HOTPLUG_CPU=y
+CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_VOLUNTARY=n
+CONFIG_PREEMPT=n
+CONFIG_TASKS_RCU=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot
new file mode 100644
index 000000000000..cd2a188eeb6d
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot
@@ -0,0 +1 @@
+rcutorture.torture_type=tasks
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH RFC tip/core/rcu 7/9] rcu: Add stall-warning checks for RCU-tasks
  2014-07-28 22:56 ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (4 preceding siblings ...)
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 6/9] rcutorture: Add RCU-tasks test cases Paul E. McKenney
@ 2014-07-28 22:56   ` Paul E. McKenney
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 8/9] rcu: Make RCU-tasks track exiting tasks Paul E. McKenney
                     ` (8 subsequent siblings)
  14 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-28 22:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds a three-minute RCU-tasks stall warning.  The actual
time is controlled by the boot/sysfs parameter rcu_task_stall_timeout,
with values less than or equal to zero disabling the stall warnings.
The default value is three minutes, which means that the tasks that
have not yet responded will get their stacks dumped every three minutes,
until they pass through a voluntary context switch.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Conflicts:
	kernel/rcu/update.c
---
 Documentation/kernel-parameters.txt |  5 +++++
 kernel/rcu/update.c                 | 42 +++++++++++++++++++++++++++++++------
 2 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 910c3829f81d..8cdbde7b17f5 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2921,6 +2921,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 	rcupdate.rcu_cpu_stall_timeout= [KNL]
 			Set timeout for RCU CPU stall warning messages.
 
+	rcupdate.rcu_task_stall_timeout= [KNL]
+			Set timeout in jiffies for RCU task stall warning
+			messages.  Disable with a value less than or equal
+			to zero.
+
 	rdinit=		[KNL]
 			Format: <full_path>
 			Run specified binary instead of /init from the ramdisk,
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 583f6778f330..0fd68871a191 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -373,6 +373,10 @@ static struct rcu_head *rcu_tasks_cbs_head;
 static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
 static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
 
+/* Control stall timeouts.  Disable with <= 0, otherwise jiffies till stall. */
+static int rcu_task_stall_timeout __read_mostly = HZ * 60 * 3;
+module_param(rcu_task_stall_timeout, int, 0644);
+
 /* Post an RCU-tasks callback. */
 void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
 {
@@ -444,11 +448,30 @@ void rcu_barrier_tasks(void)
 }
 EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
 
+/* See if tasks are still holding out, complain if so. */
+static void check_holdout_task(struct task_struct *t,
+			       bool needreport, bool *firstreport)
+{
+	if (!smp_load_acquire(&t->rcu_tasks_holdout)) {
+		/* @@@ need to check for usermode on CPU. */
+		list_del_rcu(&t->rcu_tasks_holdout_list);
+		return;
+	}
+	if (!needreport)
+		return;
+	if (*firstreport) {
+		pr_err("INFO: rcu_tasks detected stalls on tasks:\n");
+		*firstreport = false;
+	}
+	sched_show_task(current);
+}
+
 /* RCU-tasks kthread that detects grace periods and invokes callbacks. */
 static int __noreturn rcu_tasks_kthread(void *arg)
 {
 	unsigned long flags;
 	struct task_struct *g, *t;
+	unsigned long lastreport;
 	struct rcu_head *list;
 	struct rcu_head *next;
 
@@ -516,17 +539,24 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 		 * of holdout tasks, removing any that are no longer
 		 * holdouts.  When the list is empty, we are done.
 		 */
+		lastreport = jiffies;
 		while (!list_empty(&rcu_tasks_holdouts)) {
+			bool firstreport;
+			bool needreport;
+			int rtst;
+
 			schedule_timeout_interruptible(HZ / 10);
+			rtst = ACCESS_ONCE(rcu_task_stall_timeout);
+			needreport = rtst > 0 &&
+				     time_after(jiffies, lastreport + rtst);
+			if (needreport)
+				lastreport = jiffies;
+			firstreport = true;
 			flush_signals(current);
 			rcu_read_lock();
 			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
-						rcu_tasks_holdout_list) {
-				if (smp_load_acquire(&t->rcu_tasks_holdout))
-					continue;
-				list_del_init(&t->rcu_tasks_holdout_list);
-				/* @@@ need to check for usermode on CPU. */
-			}
+						rcu_tasks_holdout_list)
+				check_holdout_task(t, needreport, &firstreport);
 			rcu_read_unlock();
 		}
 
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH RFC tip/core/rcu 8/9] rcu: Make RCU-tasks track exiting tasks
  2014-07-28 22:56 ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (5 preceding siblings ...)
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 7/9] rcu: Add stall-warning checks for RCU-tasks Paul E. McKenney
@ 2014-07-28 22:56   ` Paul E. McKenney
  2014-07-30 17:04     ` Oleg Nesterov
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 9/9] rcu: Improve RCU-tasks energy efficiency Paul E. McKenney
                     ` (7 subsequent siblings)
  14 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-28 22:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds synchronization with exiting tasks, so that RCU-tasks
avoids waiting on tasks that no longer exist.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Conflicts:
	kernel/rcu/update.c
---
 include/linux/init_task.h |   5 +-
 include/linux/rcupdate.h  |   8 ++
 include/linux/rcutiny.h   |   1 +
 include/linux/sched.h     |   5 +-
 kernel/rcu/tree_plugin.h  |   2 +
 kernel/rcu/update.c       | 345 +++++++++++++++++++++++++++++++++++++++++-----
 6 files changed, 331 insertions(+), 35 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 78715ea7c30c..dd9b2d471270 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -127,8 +127,9 @@ extern struct group_info init_groups;
 #ifdef CONFIG_TASKS_RCU
 #define INIT_TASK_RCU_TASKS(tsk)					\
 	.rcu_tasks_holdout = false,					\
-	.rcu_tasks_holdout_list =					\
-		LIST_HEAD_INIT(tsk.rcu_tasks_holdout_list),
+	.rcu_tasks_holdout_list.prev = LIST_POISON2,			\
+	.rcu_tasks_lock = __SPIN_LOCK_UNLOCKED(tsk.rcu_tasks_lock),	\
+	.rcu_tasks_exiting = 0,
 #else
 #define INIT_TASK_RCU_TASKS(tsk)
 #endif
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index ecb2198849e0..1c0c286b97df 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -292,6 +292,14 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
 					 struct task_struct *next) { }
 #endif /* CONFIG_RCU_USER_QS */
 
+#ifdef CONFIG_TASKS_RCU
+void exit_rcu_tasks(void);
+#else /* #ifdef CONFIG_TASKS_RCU */
+static inline exit_rcu_tasks(void)
+{
+}
+#endif /* #else #ifdef CONFIG_TASKS_RCU */
+
 /**
  * RCU_NONIDLE - Indicate idle-loop code that needs RCU readers
  * @a: Code that RCU needs to pay attention to.
diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index d40a6a451330..326cd54d0f34 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -129,6 +129,7 @@ static inline void rcu_cpu_stall_reset(void)
 
 static inline void exit_rcu(void)
 {
+	exit_rcu_tasks();
 }
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3e18b7bbe4df..f896b93b29f6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1276,6 +1276,8 @@ struct task_struct {
 #ifdef CONFIG_TASKS_RCU
 	int rcu_tasks_holdout;
 	struct list_head rcu_tasks_holdout_list;
+	spinlock_t rcu_tasks_lock;
+	int rcu_tasks_exiting;
 #endif /* #ifdef CONFIG_TASKS_RCU */
 
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
@@ -2019,7 +2021,8 @@ static inline void rcu_copy_process(struct task_struct *p)
 	INIT_LIST_HEAD(&p->rcu_node_entry);
 #ifdef CONFIG_TASKS_RCU
 	p->rcu_tasks_holdout = false;
-	INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
+	p->rcu_tasks_holdout_list.prev = LIST_POISON2;
+	spin_lock_init(&p->rcu_tasks_lock);
 #endif /* #ifdef CONFIG_TASKS_RCU */
 }
 
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index a86a363ea453..420f4e852c93 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -943,6 +943,7 @@ void exit_rcu(void)
 	barrier();
 	t->rcu_read_unlock_special = RCU_READ_UNLOCK_BLOCKED;
 	__rcu_read_unlock();
+	exit_rcu_tasks();
 }
 
 #else /* #ifdef CONFIG_TREE_PREEMPT_RCU */
@@ -1093,6 +1094,7 @@ static void __init __rcu_init_preempt(void)
  */
 void exit_rcu(void)
 {
+	exit_rcu_tasks();
 }
 
 #endif /* #else #ifdef CONFIG_TREE_PREEMPT_RCU */
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 0fd68871a191..7d3e56f7b623 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -367,6 +367,7 @@ early_initcall(check_cpu_stall_init);
 
 /* Lists of tasks that we are still waiting for during this grace period. */
 static LIST_HEAD(rcu_tasks_holdouts);
+static DEFINE_SPINLOCK(rcu_tasks_global_lock);
 
 /* Global list of callbacks and associated lock. */
 static struct rcu_head *rcu_tasks_cbs_head;
@@ -448,13 +449,303 @@ void rcu_barrier_tasks(void)
 }
 EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
 
+/*
+ * Given a ->rcu_tasks_holdout_list list_head structure, return the
+ * corresponding lock.  For the list header, this is whatever default
+ * is passed in, while for task_struct structures, this is the
+ * ->rcu_tasks_lock field.
+ *
+ * The dflt should be NULL if the caller already holds rcu_tasks_global_lock,
+ * or &rcu_tasks_global_lock otherwise.
+ */
+static spinlock_t *list_to_lock(struct list_head *lhp, spinlock_t *dflt)
+{
+	struct task_struct *tp;
+
+	if (lhp == &rcu_tasks_holdouts)
+		return dflt;
+	tp = container_of(lhp, struct task_struct, rcu_tasks_holdout_list);
+	return &tp->rcu_tasks_lock;
+}
+
+/*
+ * A "list lock" structure is a doubly linked list with a per-element
+ * lock and a global lock protecting the list header.  When combined
+ * with RCU traversal, we get a list with serialized addition, but
+ * where independent deletions can proceed concurrently with each other
+ * and with addition.  It is the caller's responsibility to avoid the
+ * ABA problem, where an element is recycled onto the list so that a
+ * given operation might see both the old and the new version of that
+ * element.  This use case avoids this problem by operating in phases,
+ * so that the list must be seen by all as completely empty between
+ * phases.  A given item may only be added once during a given phase.
+ *
+ * To add an element, you must hold the locks of the two adjacent elements
+ * as well as of the element being added.  To remove an element, you must
+ * hold that element's lock as well as those of its two neighbors.
+ */
+
+/*
+ * Add the specified task_struct structure's ->rcu_tasks_holdout_list
+ * field to the rcu_tasks_holdouts list.
+ *
+ * Please note that this function relies on the fact that it is adding
+ * to one end of the list.  A function adding to the middle of the list
+ * would be much more complex.
+ */
+static void list_lock_add_rcu(struct task_struct *t)
+{
+	spinlock_t *sp;
+
+	spin_lock(&rcu_tasks_global_lock);
+	spin_lock(&t->rcu_tasks_lock);
+	/* Because we hold the global lock, last item cannot change. */
+	BUG_ON(rcu_tasks_holdouts.prev->next != &rcu_tasks_holdouts);
+	sp = list_to_lock(rcu_tasks_holdouts.prev, NULL);
+	if (sp)
+		spin_lock(sp);
+	list_add_tail_rcu(&t->rcu_tasks_holdout_list, &rcu_tasks_holdouts);
+	if (sp)
+		spin_unlock(sp);
+	spin_unlock(&t->rcu_tasks_lock);
+	spin_unlock(&rcu_tasks_global_lock);
+}
+
+/*
+ * Remove the specified task_struct structure's ->rcu_tasks_holdout_list
+ * field to the rcu_tasks_holdouts list.  This function is somewhat more
+ * complex due to the need to avoid deadlock and due to the need to handle
+ * concurrent deletion and addition operations.
+ */
+static void list_lock_del_rcu(struct task_struct *t)
+{
+	struct list_head *prevlh;
+	struct task_struct *prev;
+	struct task_struct *prevck;
+	spinlock_t *prevlock;
+	struct task_struct *next;
+	struct task_struct *nextck;
+	spinlock_t *nextlock;
+	int i = 0;
+	bool gbl = false;
+
+	/*
+	 * First, use trylock primitives to acquire the needed locks.
+	 * These are inherently immune to deadlock.
+	 */
+	for (;;) {
+		/* Avoid having tasks freed out from under us. */
+		rcu_read_lock();
+
+		/* Identify locks to acquire. */
+		prevlh = t->rcu_tasks_holdout_list.prev;
+		if (prevlh == LIST_POISON2)
+			goto rcu_ret; /* Already deleted:  Our work is done! */
+		prev = container_of(prevlh,
+				    struct task_struct, rcu_tasks_holdout_list);
+		prevlock = list_to_lock(&prev->rcu_tasks_holdout_list,
+					&rcu_tasks_global_lock);
+		next = container_of(t->rcu_tasks_holdout_list.next,
+				    struct task_struct, rcu_tasks_holdout_list);
+		nextlock = list_to_lock(&next->rcu_tasks_holdout_list,
+					&rcu_tasks_global_lock);
+		if (nextlock == prevlock)
+			nextlock = NULL; /* Last task, don't deadlock. */
+
+		/* Check for malformed list. */
+		BUG_ON(prevlock == &t->rcu_tasks_lock ||
+		       nextlock == &t->rcu_tasks_lock);
+
+		/* Attempt to acquire the locks identified above. */
+		if (!spin_trylock(prevlock))
+			goto retry_prep;
+		if (!spin_trylock(&t->rcu_tasks_lock))
+			goto retry_unlock_1;
+		if (nextlock && !spin_trylock(nextlock))
+			goto retry_unlock_2;
+
+		/* Did the list change while we were acquiring locks? */
+		prevck = container_of(t->rcu_tasks_holdout_list.prev,
+				      struct task_struct,
+				      rcu_tasks_holdout_list);
+		nextck = container_of(t->rcu_tasks_holdout_list.next,
+				      struct task_struct,
+				      rcu_tasks_holdout_list);
+		if (prevck == prev && nextck == next)
+			goto del_return_unlock; /* No, go delete! */
+
+		/*
+		 * List changed or lock acquisition failed, so drop locks
+		 * and retry.
+		 */
+		if (nextlock)
+			spin_unlock(nextlock);
+retry_unlock_2:
+		spin_unlock(&t->rcu_tasks_lock);
+retry_unlock_1:
+		spin_unlock(prevlock);
+retry_prep:
+		rcu_read_unlock();
+
+		/* Allow some time for locks to be released, then retry. */
+		udelay(3);
+		if (++i > 10)
+			break;
+	}
+
+	/*
+	 * Conditionally acquiring locks failed too many times, so no
+	 * more Mr. Nice Guy.  First acquire the global lock, then
+	 * unconditionally acquire the locks we think that we need.
+	 * Because only the holder of the global lock unconditionally
+	 * acquires the per-task_struct locks, deadlock is avoided.
+	 * (I first heard of this trick from Doug Lea.)
+	 *
+	 * Of course, the list might still change, so we still have
+	 * to check and possibly retry.  Can't have everything!
+	 */
+	gbl = true;
+	spin_lock(&rcu_tasks_global_lock);
+	for (;;) {
+		/* Prevent task_struct from being freed. */
+		rcu_read_lock();
+
+		/*
+		 * Identify locks.  We already hold the global lock, hence
+		 * the NULL dflt argument to list_to_lock().
+		 */
+		prevlh = t->rcu_tasks_holdout_list.prev;
+		if (prevlh == LIST_POISON2)
+			goto unlock_gbl_ret; /* Already deleted! */
+		prev = container_of(prevlh,
+				    struct task_struct, rcu_tasks_holdout_list);
+		prevlock = list_to_lock(&prev->rcu_tasks_holdout_list, NULL);
+		next = container_of(t->rcu_tasks_holdout_list.next,
+				    struct task_struct, rcu_tasks_holdout_list);
+		nextlock = list_to_lock(&next->rcu_tasks_holdout_list, NULL);
+
+		/* Acquire the identified locks. */
+		if (prevlock)
+			spin_lock(prevlock);
+		spin_lock(&t->rcu_tasks_lock);
+		if (nextlock)
+			spin_lock(nextlock);
+
+		/* Check to see if the list changed during lock acquisition. */
+		prevck = container_of(t->rcu_tasks_holdout_list.prev,
+				      struct task_struct,
+				      rcu_tasks_holdout_list);
+		nextck = container_of(t->rcu_tasks_holdout_list.next,
+				      struct task_struct,
+				      rcu_tasks_holdout_list);
+		if (prevck == prev && nextck == next)
+			break;  /* No list changes, go do removal. */
+
+		/* Release the locks, wait a bit, and go retry. */
+		if (nextlock)
+			spin_unlock(nextlock);
+		spin_unlock(&t->rcu_tasks_lock);
+		if (prevlock)
+			spin_unlock(prevlock);
+		rcu_read_unlock();
+		udelay(3);
+	}
+
+	/* We get here once we succeed in acquiring the needed locks. */
+del_return_unlock:
+	/* Remove the element from the list. */
+	list_del_rcu(&t->rcu_tasks_holdout_list);
+
+	/* Release the locks, exit the RCU read-side critical section, done! */
+	if (nextlock)
+		spin_unlock(nextlock);
+	spin_unlock(&t->rcu_tasks_lock);
+	if (prevlock)
+		spin_unlock(prevlock);
+unlock_gbl_ret:
+	if (gbl)
+		spin_unlock(&rcu_tasks_global_lock);
+rcu_ret:
+	rcu_read_unlock();
+}
+
+/*
+ * Build the list of tasks that must be waited on for this RCU-tasks
+ * grace period.  Note that we must wait for pre-existing exiting tasks
+ * to finish exiting in order to avoid the ABA problem.
+ */
+static void rcu_tasks_build_list(void)
+{
+	struct task_struct *g, *t;
+	int n_exiting = 0;
+
+	/*
+	 * Wait for all pre-existing t->on_rq transitions to complete.
+	 * Invoking synchronize_sched() suffices because all t->on_rq
+	 * transitions occur with interrupts disabled.
+	 */
+	synchronize_sched();
+
+	/*
+	 * Scan the task list under RCU protection, accumulating
+	 * tasks that are currently running or preempted that are
+	 * not also in the process of exiting.
+	 */
+	rcu_read_lock();
+	do_each_thread(g, t) {
+		/* Acquire this thread's lock to synchronize with exit. */
+		spin_lock(&t->rcu_tasks_lock);
+		if (t->rcu_tasks_exiting) {
+			/*
+			 * Task is exiting, so don't add to list.  Instead,
+			 * set up to wait for its exiting to complete.
+			 */
+			n_exiting++;
+			t->rcu_tasks_exiting = 2;
+			spin_unlock(&t->rcu_tasks_lock);
+			goto next_thread;
+		}
+
+		/* Assume that we must wait for this task. */
+		ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
+		spin_unlock(&t->rcu_tasks_lock);
+		smp_mb();  /* Order ->rcu_tasks_holdout store before "if". */
+		if (t == current || !ACCESS_ONCE(t->on_rq) || is_idle_task(t)) {
+			smp_store_release(&t->rcu_tasks_holdout, 0);
+			goto next_thread;
+		}
+		list_lock_add_rcu(t);
+next_thread:;
+	} while_each_thread(g, t);
+	rcu_read_unlock();
+
+	/*
+	 * OK, we have our candidate list of threads.  Now wait for
+	 * the threads that were in the process of exiting to finish
+	 * doing so.
+	 */
+	while (n_exiting) {
+		n_exiting = 0;
+		rcu_read_lock();
+		do_each_thread(g, t) {
+			if (ACCESS_ONCE(t->rcu_tasks_exiting) == 2) {
+				n_exiting++;
+				goto wait_exit_again;
+			}
+		} while_each_thread(g, t);
+wait_exit_again:
+		rcu_read_unlock();
+		schedule_timeout_uninterruptible(1);
+	}
+}
+
 /* See if tasks are still holding out, complain if so. */
 static void check_holdout_task(struct task_struct *t,
 			       bool needreport, bool *firstreport)
 {
 	if (!smp_load_acquire(&t->rcu_tasks_holdout)) {
 		/* @@@ need to check for usermode on CPU. */
-		list_del_rcu(&t->rcu_tasks_holdout_list);
+		list_lock_del_rcu(t);
 		return;
 	}
 	if (!needreport)
@@ -470,7 +761,7 @@ static void check_holdout_task(struct task_struct *t,
 static int __noreturn rcu_tasks_kthread(void *arg)
 {
 	unsigned long flags;
-	struct task_struct *g, *t;
+	struct task_struct *t;
 	unsigned long lastreport;
 	struct rcu_head *list;
 	struct rcu_head *next;
@@ -502,37 +793,10 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 
 		/*
 		 * There were callbacks, so we need to wait for an
-		 * RCU-tasks grace period.  Start off by scanning
-		 * the task list for tasks that are not already
-		 * voluntarily blocked.  Mark these tasks and make
-		 * a list of them in rcu_tasks_holdouts.
+		 * RCU-tasks grace period.  Go build the list of
+		 * tasks that must be waited for.
 		 */
-		rcu_read_lock();
-		do_each_thread(g, t) {
-			if (t != current && ACCESS_ONCE(t->on_rq) &&
-			    !is_idle_task(t)) {
-				t->rcu_tasks_holdout = 1;
-				list_add(&t->rcu_tasks_holdout_list,
-					 &rcu_tasks_holdouts);
-			}
-		} while_each_thread(g, t);
-		rcu_read_unlock();
-
-		/*
-		 * The "t != current" and "!is_idle_task()" comparisons
-		 * above are stable, but the "t->on_rq" value could
-		 * change at any time, and is generally unordered.
-		 * Therefore, we need some ordering.  The trick is
-		 * that t->on_rq is updated with a runqueue lock held,
-		 * and thus with interrupts disabled.  So the following
-		 * synchronize_sched() provides the needed ordering by:
-		 * (1) Waiting for all interrupts-disabled code sections
-		 * to complete and (2) The synchronize_sched() ordering
-		 * guarantees, which provide for a memory barrier on each
-		 * CPU since the completion of its last read-side critical
-		 * section, including interrupt-disabled code sections.
-		 */
-		synchronize_sched();
+		rcu_tasks_build_list();
 
 		/*
 		 * Each pass through the following loop scans the list
@@ -594,4 +858,21 @@ static int __init rcu_spawn_tasks_kthread(void)
 }
 early_initcall(rcu_spawn_tasks_kthread);
 
+/*
+ * RCU-tasks hook for exiting tasks.  This hook prevents the current
+ * task from being added to the RCU-tasks list, and also ensures that
+ * any future RCU-tasks grace period will wait for the current task
+ * to finish exiting.
+ */
+void exit_rcu_tasks(void)
+{
+	struct task_struct *t = current;
+
+	cond_resched();
+	spin_lock(&t->rcu_tasks_lock);
+	t->rcu_tasks_exiting = t->rcu_tasks_holdout + 1;
+	spin_unlock(&t->rcu_tasks_lock);
+	list_lock_del_rcu(t);
+}
+
 #endif /* #ifdef CONFIG_TASKS_RCU */
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH RFC tip/core/rcu 9/9] rcu: Improve RCU-tasks energy efficiency
  2014-07-28 22:56 ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (6 preceding siblings ...)
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 8/9] rcu: Make RCU-tasks track exiting tasks Paul E. McKenney
@ 2014-07-28 22:56   ` Paul E. McKenney
  2014-07-29  7:50   ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Peter Zijlstra
                     ` (6 subsequent siblings)
  14 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-28 22:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

The current RCU-tasks implementation uses strict polling to detect
callback arrivals.  This works quite well, but is not so good for
energy efficiency.  This commit therefore replaces the strict polling
with a wait queue.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Conflicts:
	kernel/rcu/update.c
---
 kernel/rcu/update.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 7d3e56f7b623..34625ef543f6 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -372,6 +372,7 @@ static DEFINE_SPINLOCK(rcu_tasks_global_lock);
 /* Global list of callbacks and associated lock. */
 static struct rcu_head *rcu_tasks_cbs_head;
 static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
+static DECLARE_WAIT_QUEUE_HEAD(rcu_tasks_cbs_wq);
 static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
 
 /* Control stall timeouts.  Disable with <= 0, otherwise jiffies till stall. */
@@ -382,13 +383,17 @@ module_param(rcu_task_stall_timeout, int, 0644);
 void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
 {
 	unsigned long flags;
+	bool needwake;
 
 	rhp->next = NULL;
 	rhp->func = func;
 	raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
+	needwake = !rcu_tasks_cbs_head;
 	*rcu_tasks_cbs_tail = rhp;
 	rcu_tasks_cbs_tail = &rhp->next;
 	raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+	if (needwake)
+		wake_up(&rcu_tasks_cbs_wq);
 }
 EXPORT_SYMBOL_GPL(call_rcu_tasks);
 
@@ -786,8 +791,12 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 
 		/* If there were none, wait a bit and start over. */
 		if (!list) {
-			schedule_timeout_interruptible(HZ);
-			flush_signals(current);
+			wait_event_interruptible(rcu_tasks_cbs_wq,
+						 rcu_tasks_cbs_head);
+			if (!rcu_tasks_cbs_head) {
+				flush_signals(current);
+				schedule_timeout_interruptible(HZ/10);
+			}
 			continue;
 		}
 
@@ -844,6 +853,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 			list = next;
 			cond_resched();
 		}
+		schedule_timeout_uninterruptible(HZ/10);
 	}
 }
 
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-28 22:56 ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (7 preceding siblings ...)
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 9/9] rcu: Improve RCU-tasks energy efficiency Paul E. McKenney
@ 2014-07-29  7:50   ` Peter Zijlstra
  2014-07-29 15:57     ` Paul E. McKenney
  2014-07-29  8:12   ` Peter Zijlstra
                     ` (5 subsequent siblings)
  14 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-07-29  7:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

[-- Attachment #1: Type: text/plain, Size: 939 bytes --]

On Mon, Jul 28, 2014 at 03:56:12PM -0700, Paul E. McKenney wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bc1638b33449..a0d2f3a03566 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2762,6 +2762,7 @@ need_resched:
>  		} else {
>  			deactivate_task(rq, prev, DEQUEUE_SLEEP);
>  			prev->on_rq = 0;
> +			rcu_note_voluntary_context_switch(prev);
>  
>  			/*
>  			 * If a worker went to sleep, notify and ask workqueue
> @@ -2828,6 +2829,7 @@ asmlinkage __visible void __sched schedule(void)
>  	struct task_struct *tsk = current;
>  
>  	sched_submit_work(tsk);
> +	rcu_note_voluntary_context_switch(tsk);
>  	__schedule();
>  }

Yeah, not entirely happy with that, you add two calls into one of the
hotest paths of the kernel.

And I'm still not entirely sure why, your 0/x babbled something about
trampolines, but I'm not sure I understand how those lead to this.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops Paul E. McKenney
@ 2014-07-29  7:55     ` Peter Zijlstra
  2014-07-29 16:22       ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-07-29  7:55 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

[-- Attachment #1: Type: text/plain, Size: 858 bytes --]

On Mon, Jul 28, 2014 at 03:56:13PM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> RCU-tasks requires the occasional voluntary context switch
> from CPU-bound in-kernel tasks.  In some cases, this requires
> instrumenting cond_resched().  However, there is some reluctance
> to countenance unconditionally instrumenting cond_resched() (see
> http://lwn.net/Articles/603252/),

No, if its a good reason mention it, if not ignore it.

> so this commit creates a separate
> cond_resched_rcu_qs() that may be used in place of cond_resched() in
> locations prone to long-duration in-kernel looping.

Sounds like a pain and a recipe for mistakes. How is joe kernel hacker
supposed to 1) know about this new api, and 2) decide which to use?

Heck, even I wouldn't know, and I just read the damn patch.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-28 22:56 ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (8 preceding siblings ...)
  2014-07-29  7:50   ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Peter Zijlstra
@ 2014-07-29  8:12   ` Peter Zijlstra
  2014-07-29 16:36     ` Paul E. McKenney
  2014-07-29  8:12   ` Peter Zijlstra
                     ` (4 subsequent siblings)
  14 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-07-29  8:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Mon, Jul 28, 2014 at 03:56:12PM -0700, Paul E. McKenney wrote:
> +		/*
> +		 * Each pass through the following loop scans the list
> +		 * of holdout tasks, removing any that are no longer
> +		 * holdouts.  When the list is empty, we are done.
> +		 */
> +		while (!list_empty(&rcu_tasks_holdouts)) {
> +			schedule_timeout_interruptible(HZ / 10);
> +			flush_signals(current);
> +			rcu_read_lock();
> +			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
> +						rcu_tasks_holdout_list) {
> +				if (smp_load_acquire(&t->rcu_tasks_holdout))
> +					continue;
> +				list_del_init(&t->rcu_tasks_holdout_list);
> +				/* @@@ need to check for usermode on CPU. */
> +			}
> +			rcu_read_unlock();
> +		}

That's a potential CPU runtime sink.. imagine having to scan 100k tasks
10 times a second. Polling O(nr_tasks) is not good.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-28 22:56 ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (9 preceding siblings ...)
  2014-07-29  8:12   ` Peter Zijlstra
@ 2014-07-29  8:12   ` Peter Zijlstra
  2014-07-29  8:14   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  14 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2014-07-29  8:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Mon, Jul 28, 2014 at 03:56:12PM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> This commit adds a new RCU-tasks flavor of RCU, which provides
> call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
> context switch (not preemption!), userspace execution, and the idle loop.
> Note that unlike other RCU flavors, these quiescent states occur in tasks,
> not necessarily CPUs.  Includes fixes from Steven Rostedt.

I've not found the userspace transition part, where that at?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-28 22:56 ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (10 preceding siblings ...)
  2014-07-29  8:12   ` Peter Zijlstra
@ 2014-07-29  8:14   ` Peter Zijlstra
  2014-07-29 17:23     ` Paul E. McKenney
  2014-07-30  6:52   ` Lai Jiangshan
                     ` (2 subsequent siblings)
  14 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-07-29  8:14 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Mon, Jul 28, 2014 at 03:56:12PM -0700, Paul E. McKenney wrote:
> @@ -254,6 +254,8 @@ void rcu_check_callbacks(int cpu, int user)
>  		rcu_sched_qs(cpu);
>  	else if (!in_softirq())
>  		rcu_bh_qs(cpu);
> +	if (user)
> +		rcu_note_voluntary_context_switch(current);
>  }

There's nothing like sending email you can't find something... :-)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-29  7:50   ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Peter Zijlstra
@ 2014-07-29 15:57     ` Paul E. McKenney
  2014-07-29 16:07       ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-29 15:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Tue, Jul 29, 2014 at 09:50:55AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 28, 2014 at 03:56:12PM -0700, Paul E. McKenney wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index bc1638b33449..a0d2f3a03566 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2762,6 +2762,7 @@ need_resched:
> >  		} else {
> >  			deactivate_task(rq, prev, DEQUEUE_SLEEP);
> >  			prev->on_rq = 0;
> > +			rcu_note_voluntary_context_switch(prev);
> >  
> >  			/*
> >  			 * If a worker went to sleep, notify and ask workqueue
> > @@ -2828,6 +2829,7 @@ asmlinkage __visible void __sched schedule(void)
> >  	struct task_struct *tsk = current;
> >  
> >  	sched_submit_work(tsk);
> > +	rcu_note_voluntary_context_switch(tsk);
> >  	__schedule();
> >  }
> 
> Yeah, not entirely happy with that, you add two calls into one of the
> hotest paths of the kernel.

I did look into leveraging counters, but cannot remember why I decided
that this was a bad idea.  I guess it is time to recheck...

The ->nvcsw field in the task_struct structure looks promising:

o	Looks like it does in fact get incremented in __schedule() via
	the switch_count pointer.

o	Looks like it is unconditionally compiled in.

o	There are no memory barriers, but a synchronize_sched()
	should take care of that, given that this counter is
	incremented with interrupts disabled.

So I should be able to snapshot the task_struct structure's ->nvcsw
field and avoid the added code in the fastpaths.

Seem plausible, or am I confused about the role of ->nvcsw?

> And I'm still not entirely sure why, your 0/x babbled something about
> trampolines, but I'm not sure I understand how those lead to this.

Steven Rostedt sent an email recently giving more detail.  And of course
now I am having trouble finding it.  Maybe he will take pity on us and
send along a pointer to it.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-29 15:57     ` Paul E. McKenney
@ 2014-07-29 16:07       ` Peter Zijlstra
  2014-07-29 16:33         ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-07-29 16:07 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Tue, Jul 29, 2014 at 08:57:47AM -0700, Paul E. McKenney wrote:
> On Tue, Jul 29, 2014 at 09:50:55AM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 28, 2014 at 03:56:12PM -0700, Paul E. McKenney wrote:
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index bc1638b33449..a0d2f3a03566 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -2762,6 +2762,7 @@ need_resched:
> > >  		} else {
> > >  			deactivate_task(rq, prev, DEQUEUE_SLEEP);
> > >  			prev->on_rq = 0;
> > > +			rcu_note_voluntary_context_switch(prev);
> > >  
> > >  			/*
> > >  			 * If a worker went to sleep, notify and ask workqueue
> > > @@ -2828,6 +2829,7 @@ asmlinkage __visible void __sched schedule(void)
> > >  	struct task_struct *tsk = current;
> > >  
> > >  	sched_submit_work(tsk);
> > > +	rcu_note_voluntary_context_switch(tsk);
> > >  	__schedule();
> > >  }
> > 
> > Yeah, not entirely happy with that, you add two calls into one of the
> > hotest paths of the kernel.
> 
> I did look into leveraging counters, but cannot remember why I decided
> that this was a bad idea.  I guess it is time to recheck...
> 
> The ->nvcsw field in the task_struct structure looks promising:
> 
> o	Looks like it does in fact get incremented in __schedule() via
> 	the switch_count pointer.
> 
> o	Looks like it is unconditionally compiled in.
> 
> o	There are no memory barriers, but a synchronize_sched()
> 	should take care of that, given that this counter is
> 	incremented with interrupts disabled.

Well, there's obviously the actual context switch, which should imply an
actual MB such that tasks are self ordered even when execution continues
on another cpu etc..

> So I should be able to snapshot the task_struct structure's ->nvcsw
> field and avoid the added code in the fastpaths.
> 
> Seem plausible, or am I confused about the role of ->nvcsw?

Nope, that's the 'I scheduled to go to sleep' counter.

There is of course the 'polling' issue I raised in a further email...

> > And I'm still not entirely sure why, your 0/x babbled something about
> > trampolines, but I'm not sure I understand how those lead to this.
> 
> Steven Rostedt sent an email recently giving more detail.  And of course
> now I am having trouble finding it.  Maybe he will take pity on us and
> send along a pointer to it.  ;-)

Yah, would make good Changelog material that ;-)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops
  2014-07-29  7:55     ` Peter Zijlstra
@ 2014-07-29 16:22       ` Paul E. McKenney
  2014-07-29 17:25         ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-29 16:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Tue, Jul 29, 2014 at 09:55:36AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 28, 2014 at 03:56:13PM -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > 
> > RCU-tasks requires the occasional voluntary context switch
> > from CPU-bound in-kernel tasks.  In some cases, this requires
> > instrumenting cond_resched().  However, there is some reluctance
> > to countenance unconditionally instrumenting cond_resched() (see
> > http://lwn.net/Articles/603252/),
> 
> No, if its a good reason mention it, if not ignore it.

Fair enough.  ;-)

> > so this commit creates a separate
> > cond_resched_rcu_qs() that may be used in place of cond_resched() in
> > locations prone to long-duration in-kernel looping.
> 
> Sounds like a pain and a recipe for mistakes. How is joe kernel hacker
> supposed to 1) know about this new api, and 2) decide which to use?
> 
> Heck, even I wouldn't know, and I just read the damn patch.

When Joe Hacker gets stall warning messages due to loops in the kernel
that contain cond_resched(), that is a hint that cond_resched_rcu_qs()
is required.  These stall warnings can occur when using RCU-tasks and when
using normal RCU in NO_HZ_FULL kernels in cases where the scheduling-clock
interrupt is left off while executing a long code path in the kernel.
(Of course, in both cases, another eminently reasonable fix is to shorten
the offending code path in the kernel.)

I should add words to that effect to Documentation/RCU/stallwarn.txt,
shouldn't I?  Done.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-29 16:07       ` Peter Zijlstra
@ 2014-07-29 16:33         ` Paul E. McKenney
  2014-07-29 17:31           ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-29 16:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Tue, Jul 29, 2014 at 06:07:54PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 29, 2014 at 08:57:47AM -0700, Paul E. McKenney wrote:
> > On Tue, Jul 29, 2014 at 09:50:55AM +0200, Peter Zijlstra wrote:
> > > On Mon, Jul 28, 2014 at 03:56:12PM -0700, Paul E. McKenney wrote:
> > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > index bc1638b33449..a0d2f3a03566 100644
> > > > --- a/kernel/sched/core.c
> > > > +++ b/kernel/sched/core.c
> > > > @@ -2762,6 +2762,7 @@ need_resched:
> > > >  		} else {
> > > >  			deactivate_task(rq, prev, DEQUEUE_SLEEP);
> > > >  			prev->on_rq = 0;
> > > > +			rcu_note_voluntary_context_switch(prev);
> > > >  
> > > >  			/*
> > > >  			 * If a worker went to sleep, notify and ask workqueue
> > > > @@ -2828,6 +2829,7 @@ asmlinkage __visible void __sched schedule(void)
> > > >  	struct task_struct *tsk = current;
> > > >  
> > > >  	sched_submit_work(tsk);
> > > > +	rcu_note_voluntary_context_switch(tsk);
> > > >  	__schedule();
> > > >  }
> > > 
> > > Yeah, not entirely happy with that, you add two calls into one of the
> > > hotest paths of the kernel.
> > 
> > I did look into leveraging counters, but cannot remember why I decided
> > that this was a bad idea.  I guess it is time to recheck...
> > 
> > The ->nvcsw field in the task_struct structure looks promising:
> > 
> > o	Looks like it does in fact get incremented in __schedule() via
> > 	the switch_count pointer.
> > 
> > o	Looks like it is unconditionally compiled in.
> > 
> > o	There are no memory barriers, but a synchronize_sched()
> > 	should take care of that, given that this counter is
> > 	incremented with interrupts disabled.
> 
> Well, there's obviously the actual context switch, which should imply an
> actual MB such that tasks are self ordered even when execution continues
> on another cpu etc..

True enough, except that it appears that the context switch happens
after the ->nvcsw increment, which means that it doesn't help RCU-tasks
guarantee that if it has seen the increment, then all prior processing
has completed.  There might be enough stuff prior the increment, but I
don't see anything that I feel comfortable relying on.  Am I missing
some ordering?

> > So I should be able to snapshot the task_struct structure's ->nvcsw
> > field and avoid the added code in the fastpaths.
> > 
> > Seem plausible, or am I confused about the role of ->nvcsw?
> 
> Nope, that's the 'I scheduled to go to sleep' counter.

I am assuming that the "Nope" goes with "am I confused" rather than
"Seem plausible" -- if not, please let me know.  ;-)

> There is of course the 'polling' issue I raised in a further email...

Yep, and other flavors of RCU go to lengths to avoid scanning the
task_struct lists.  Steven said that updates will be rare and that it
is OK for them to have high latency and overhead.  Thus far, I am taking
him at his word.  ;-)

I considered interrupting the task_struct polling loop periodically,
and would add that if needed.  That said, this requires nailing down the
task_struct at which the vacation is taken.  Here "nailing down" does not
simply mean "prevent from being freed", but rather "prevent from being
removed from the lists traversed by do_each_thread/while_each_thread."

Of course, if there is some easy way of doing this, please let me know!

> > > And I'm still not entirely sure why, your 0/x babbled something about
> > > trampolines, but I'm not sure I understand how those lead to this.
> > 
> > Steven Rostedt sent an email recently giving more detail.  And of course
> > now I am having trouble finding it.  Maybe he will take pity on us and
> > send along a pointer to it.  ;-)
> 
> Yah, would make good Changelog material that ;-)

;-) ;-) ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-29  8:12   ` Peter Zijlstra
@ 2014-07-29 16:36     ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-29 16:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Tue, Jul 29, 2014 at 10:12:11AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 28, 2014 at 03:56:12PM -0700, Paul E. McKenney wrote:
> > +		/*
> > +		 * Each pass through the following loop scans the list
> > +		 * of holdout tasks, removing any that are no longer
> > +		 * holdouts.  When the list is empty, we are done.
> > +		 */
> > +		while (!list_empty(&rcu_tasks_holdouts)) {
> > +			schedule_timeout_interruptible(HZ / 10);
> > +			flush_signals(current);
> > +			rcu_read_lock();
> > +			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
> > +						rcu_tasks_holdout_list) {
> > +				if (smp_load_acquire(&t->rcu_tasks_holdout))
> > +					continue;
> > +				list_del_init(&t->rcu_tasks_holdout_list);
> > +				/* @@@ need to check for usermode on CPU. */
> > +			}
> > +			rcu_read_unlock();
> > +		}
> 
> That's a potential CPU runtime sink.. imagine having to scan 100k tasks
> 10 times a second. Polling O(nr_tasks) is not good.

This only scans those tasks that are blocking the RCU-tasks grace period,
and this list should get shorter reasonably quickly as each task does
a voluntary context switch.

Of course, there is the do_each_thread() / while_each_thread() loop
that builds this list, and yes, that does look at each and every task,
as does the subsequent loop that waits for pre-existing partially exited
tasks to disappear from the list.  As noted in an earlier email, I am
taking Steven at his word when he said that updates are very infrequent
and that he doesn't care about the latency and overhead of the updates.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-29  8:14   ` Peter Zijlstra
@ 2014-07-29 17:23     ` Paul E. McKenney
  2014-07-29 17:33       ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-29 17:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Tue, Jul 29, 2014 at 10:14:16AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 28, 2014 at 03:56:12PM -0700, Paul E. McKenney wrote:
> > @@ -254,6 +254,8 @@ void rcu_check_callbacks(int cpu, int user)
> >  		rcu_sched_qs(cpu);
> >  	else if (!in_softirq())
> >  		rcu_bh_qs(cpu);
> > +	if (user)
> > +		rcu_note_voluntary_context_switch(current);
> >  }
> 
> There's nothing like sending email you can't find something... :-)

Well, this is unfortunately only a partial solution.  It does not handle
the NO_HZ_FULL scheduling-clock-free usermode execution.  I have ink on
paper indicating a couple of ways to do that, but figured I should get
feedback on this stuff before going too much farther.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops
  2014-07-29 16:22       ` Paul E. McKenney
@ 2014-07-29 17:25         ` Peter Zijlstra
  2014-07-29 17:33           ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-07-29 17:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Tue, Jul 29, 2014 at 09:22:36AM -0700, Paul E. McKenney wrote:
> On Tue, Jul 29, 2014 at 09:55:36AM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 28, 2014 at 03:56:13PM -0700, Paul E. McKenney wrote:
> > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > 
> > > RCU-tasks requires the occasional voluntary context switch
> > > from CPU-bound in-kernel tasks.  In some cases, this requires
> > > instrumenting cond_resched().  However, there is some reluctance
> > > to countenance unconditionally instrumenting cond_resched() (see
> > > http://lwn.net/Articles/603252/),
> > 
> > No, if its a good reason mention it, if not ignore it.
> 
> Fair enough.  ;-)
> 
> > > so this commit creates a separate
> > > cond_resched_rcu_qs() that may be used in place of cond_resched() in
> > > locations prone to long-duration in-kernel looping.
> > 
> > Sounds like a pain and a recipe for mistakes. How is joe kernel hacker
> > supposed to 1) know about this new api, and 2) decide which to use?
> > 
> > Heck, even I wouldn't know, and I just read the damn patch.
> 
> When Joe Hacker gets stall warning messages due to loops in the kernel
> that contain cond_resched(), that is a hint that cond_resched_rcu_qs()
> is required.  These stall warnings can occur when using RCU-tasks and when
> using normal RCU in NO_HZ_FULL kernels in cases where the scheduling-clock
> interrupt is left off while executing a long code path in the kernel.
> (Of course, in both cases, another eminently reasonable fix is to shorten
> the offending code path in the kernel.)
> 
> I should add words to that effect to Documentation/RCU/stallwarn.txt,
> shouldn't I?  Done.

No, but why can't we make the regular cond_resched() do this?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-29 16:33         ` Paul E. McKenney
@ 2014-07-29 17:31           ` Peter Zijlstra
  2014-07-29 18:19             ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-07-29 17:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Tue, Jul 29, 2014 at 09:33:12AM -0700, Paul E. McKenney wrote:
> > > o	There are no memory barriers, but a synchronize_sched()
> > > 	should take care of that, given that this counter is
> > > 	incremented with interrupts disabled.
> > 
> > Well, there's obviously the actual context switch, which should imply an
> > actual MB such that tasks are self ordered even when execution continues
> > on another cpu etc..
> 
> True enough, except that it appears that the context switch happens
> after the ->nvcsw increment, which means that it doesn't help RCU-tasks
> guarantee that if it has seen the increment, then all prior processing
> has completed.  There might be enough stuff prior the increment, but I
> don't see anything that I feel comfortable relying on.  Am I missing
> some ordering?

There's the smp_mb__before_spinlock() + raw_spin_lock_irq(), but that's
about it I suppose.

> > > So I should be able to snapshot the task_struct structure's ->nvcsw
> > > field and avoid the added code in the fastpaths.
> > > 
> > > Seem plausible, or am I confused about the role of ->nvcsw?
> > 
> > Nope, that's the 'I scheduled to go to sleep' counter.
> 
> I am assuming that the "Nope" goes with "am I confused" rather than
> "Seem plausible" -- if not, please let me know.  ;-)

Yah :-)

> > There is of course the 'polling' issue I raised in a further email...
> 
> Yep, and other flavors of RCU go to lengths to avoid scanning the
> task_struct lists.  Steven said that updates will be rare and that it
> is OK for them to have high latency and overhead.  Thus far, I am taking
> him at his word.  ;-)
> 
> I considered interrupting the task_struct polling loop periodically,
> and would add that if needed.  That said, this requires nailing down the
> task_struct at which the vacation is taken.  Here "nailing down" does not
> simply mean "prevent from being freed", but rather "prevent from being
> removed from the lists traversed by do_each_thread/while_each_thread."
> 
> Of course, if there is some easy way of doing this, please let me know!

Well, one reason I'm not liking this is because its in an anonymous
context (kthread) which makes accounting for it hard.

I feel we're doing far too much async stuff already and it keeps getting
worse and worse. Ideally we'd be able to account every cycle of kernel
'overhead' to a specific user action.

Another reason is that I fundamentally dislike polling stuff.. but yes,
I'm not really seeing how to do this differently, partly because I'm not
entirely sure why we need this to begin with. I'm not sure what problem
we're solving.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops
  2014-07-29 17:25         ` Peter Zijlstra
@ 2014-07-29 17:33           ` Paul E. McKenney
  2014-07-29 17:36             ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-29 17:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Tue, Jul 29, 2014 at 07:25:42PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 29, 2014 at 09:22:36AM -0700, Paul E. McKenney wrote:
> > On Tue, Jul 29, 2014 at 09:55:36AM +0200, Peter Zijlstra wrote:
> > > On Mon, Jul 28, 2014 at 03:56:13PM -0700, Paul E. McKenney wrote:
> > > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > > 
> > > > RCU-tasks requires the occasional voluntary context switch
> > > > from CPU-bound in-kernel tasks.  In some cases, this requires
> > > > instrumenting cond_resched().  However, there is some reluctance
> > > > to countenance unconditionally instrumenting cond_resched() (see
> > > > http://lwn.net/Articles/603252/),
> > > 
> > > No, if its a good reason mention it, if not ignore it.
> > 
> > Fair enough.  ;-)
> > 
> > > > so this commit creates a separate
> > > > cond_resched_rcu_qs() that may be used in place of cond_resched() in
> > > > locations prone to long-duration in-kernel looping.
> > > 
> > > Sounds like a pain and a recipe for mistakes. How is joe kernel hacker
> > > supposed to 1) know about this new api, and 2) decide which to use?
> > > 
> > > Heck, even I wouldn't know, and I just read the damn patch.
> > 
> > When Joe Hacker gets stall warning messages due to loops in the kernel
> > that contain cond_resched(), that is a hint that cond_resched_rcu_qs()
> > is required.  These stall warnings can occur when using RCU-tasks and when
> > using normal RCU in NO_HZ_FULL kernels in cases where the scheduling-clock
> > interrupt is left off while executing a long code path in the kernel.
> > (Of course, in both cases, another eminently reasonable fix is to shorten
> > the offending code path in the kernel.)
> > 
> > I should add words to that effect to Documentation/RCU/stallwarn.txt,
> > shouldn't I?  Done.
> 
> No, but why can't we make the regular cond_resched() do this?

Well, I got a lot of grief when I tried it a few weeks ago.

But from what I can see, you are the maintainer or cond_resched(), so
if you are good with making the normal cond_resched() do this, I am
more than happy to make it so!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-29 17:23     ` Paul E. McKenney
@ 2014-07-29 17:33       ` Peter Zijlstra
  2014-07-29 18:06         ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-07-29 17:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Tue, Jul 29, 2014 at 10:23:04AM -0700, Paul E. McKenney wrote:
> On Tue, Jul 29, 2014 at 10:14:16AM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 28, 2014 at 03:56:12PM -0700, Paul E. McKenney wrote:
> > > @@ -254,6 +254,8 @@ void rcu_check_callbacks(int cpu, int user)
> > >  		rcu_sched_qs(cpu);
> > >  	else if (!in_softirq())
> > >  		rcu_bh_qs(cpu);
> > > +	if (user)
> > > +		rcu_note_voluntary_context_switch(current);
> > >  }
> > 
> > There's nothing like sending email you can't find something... :-)
> 
> Well, this is unfortunately only a partial solution.  It does not handle
> the NO_HZ_FULL scheduling-clock-free usermode execution.  I have ink on
> paper indicating a couple of ways to do that, but figured I should get
> feedback on this stuff before going too much farther.

Yah, so the nohz_full already has the horrid overhead of user<->kernel
switches, so you can 'trivially' hook into those.

FWIW its _the_ thing that makes nohz_full uninteresting for me. The
required overhead is insane. But yes there are people willing to pay
that etc..

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops
  2014-07-29 17:33           ` Paul E. McKenney
@ 2014-07-29 17:36             ` Peter Zijlstra
  2014-07-29 17:37               ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-07-29 17:36 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Tue, Jul 29, 2014 at 10:33:18AM -0700, Paul E. McKenney wrote:
> > No, but why can't we make the regular cond_resched() do this?
> 
> Well, I got a lot of grief when I tried it a few weeks ago.
> 
> But from what I can see, you are the maintainer or cond_resched(), so
> if you are good with making the normal cond_resched() do this, I am
> more than happy to make it so!  ;-)

Well, its the 'obvious' thing to do. But clearly I haven't tried so I'm
blissfully unaware of any problems. And the Changelog didn't inform me
either (you had a link in there, which I didn't read :-)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops
  2014-07-29 17:36             ` Peter Zijlstra
@ 2014-07-29 17:37               ` Peter Zijlstra
  2014-07-29 17:55                 ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-07-29 17:37 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Tue, Jul 29, 2014 at 07:36:00PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 29, 2014 at 10:33:18AM -0700, Paul E. McKenney wrote:
> > > No, but why can't we make the regular cond_resched() do this?
> > 
> > Well, I got a lot of grief when I tried it a few weeks ago.
> > 
> > But from what I can see, you are the maintainer or cond_resched(), so
> > if you are good with making the normal cond_resched() do this, I am
> > more than happy to make it so!  ;-)
> 
> Well, its the 'obvious' thing to do. But clearly I haven't tried so I'm
> blissfully unaware of any problems. And the Changelog didn't inform me
> either (you had a link in there, which I didn't read :-)

Then again, last time we touched cond_resched() we had a scalability
issue or somesuch, or am I misremembering things?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops
  2014-07-29 17:37               ` Peter Zijlstra
@ 2014-07-29 17:55                 ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-29 17:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Tue, Jul 29, 2014 at 07:37:33PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 29, 2014 at 07:36:00PM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 29, 2014 at 10:33:18AM -0700, Paul E. McKenney wrote:
> > > > No, but why can't we make the regular cond_resched() do this?
> > > 
> > > Well, I got a lot of grief when I tried it a few weeks ago.
> > > 
> > > But from what I can see, you are the maintainer or cond_resched(), so
> > > if you are good with making the normal cond_resched() do this, I am
> > > more than happy to make it so!  ;-)
> > 
> > Well, its the 'obvious' thing to do. But clearly I haven't tried so I'm
> > blissfully unaware of any problems. And the Changelog didn't inform me
> > either (you had a link in there, which I didn't read :-)
> 
> Then again, last time we touched cond_resched() we had a scalability
> issue or somesuch, or am I misremembering things?

More overhead than scalability, but yes.  That said, that was a much
heavier weight touch.  A later version with only an access to per-CPU
variable turned out to have overhead below what could be measured.
But I am comfortable with the current approach that does not touch
cond_resched() as well.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-29 17:33       ` Peter Zijlstra
@ 2014-07-29 18:06         ` Paul E. McKenney
  2014-07-30 13:23           ` Mike Galbraith
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-29 18:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Tue, Jul 29, 2014 at 07:33:32PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 29, 2014 at 10:23:04AM -0700, Paul E. McKenney wrote:
> > On Tue, Jul 29, 2014 at 10:14:16AM +0200, Peter Zijlstra wrote:
> > > On Mon, Jul 28, 2014 at 03:56:12PM -0700, Paul E. McKenney wrote:
> > > > @@ -254,6 +254,8 @@ void rcu_check_callbacks(int cpu, int user)
> > > >  		rcu_sched_qs(cpu);
> > > >  	else if (!in_softirq())
> > > >  		rcu_bh_qs(cpu);
> > > > +	if (user)
> > > > +		rcu_note_voluntary_context_switch(current);
> > > >  }
> > > 
> > > There's nothing like sending email you can't find something... :-)
> > 
> > Well, this is unfortunately only a partial solution.  It does not handle
> > the NO_HZ_FULL scheduling-clock-free usermode execution.  I have ink on
> > paper indicating a couple of ways to do that, but figured I should get
> > feedback on this stuff before going too much farther.
> 
> Yah, so the nohz_full already has the horrid overhead of user<->kernel
> switches, so you can 'trivially' hook into those.

Yep, the plan is to use RCU's dyntick-idle code as the hook.

> FWIW its _the_ thing that makes nohz_full uninteresting for me. The
> required overhead is insane. But yes there are people willing to pay
> that etc..

It would indeed be good to reduce the overhead.  I could imagine all sorts
of insane approaches involving assuming that CPU write buffers flush in
bounded time, though CPU vendors seem unwilling to make guarantees in
this area.  ;-)

Or is something other than rcu_user_enter() and rcu_user_exit() causing
the pain here?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-29 17:31           ` Peter Zijlstra
@ 2014-07-29 18:19             ` Paul E. McKenney
  2014-07-29 19:25               ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-29 18:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Tue, Jul 29, 2014 at 07:31:21PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 29, 2014 at 09:33:12AM -0700, Paul E. McKenney wrote:
> > > > o	There are no memory barriers, but a synchronize_sched()
> > > > 	should take care of that, given that this counter is
> > > > 	incremented with interrupts disabled.
> > > 
> > > Well, there's obviously the actual context switch, which should imply an
> > > actual MB such that tasks are self ordered even when execution continues
> > > on another cpu etc..
> > 
> > True enough, except that it appears that the context switch happens
> > after the ->nvcsw increment, which means that it doesn't help RCU-tasks
> > guarantee that if it has seen the increment, then all prior processing
> > has completed.  There might be enough stuff prior the increment, but I
> > don't see anything that I feel comfortable relying on.  Am I missing
> > some ordering?
> 
> There's the smp_mb__before_spinlock() + raw_spin_lock_irq(), but that's
> about it I suppose.

Which does only smp_wmb(), correct?  :-(

> > > > So I should be able to snapshot the task_struct structure's ->nvcsw
> > > > field and avoid the added code in the fastpaths.
> > > > 
> > > > Seem plausible, or am I confused about the role of ->nvcsw?
> > > 
> > > Nope, that's the 'I scheduled to go to sleep' counter.
> > 
> > I am assuming that the "Nope" goes with "am I confused" rather than
> > "Seem plausible" -- if not, please let me know.  ;-)
> 
> Yah :-)
> 
> > > There is of course the 'polling' issue I raised in a further email...
> > 
> > Yep, and other flavors of RCU go to lengths to avoid scanning the
> > task_struct lists.  Steven said that updates will be rare and that it
> > is OK for them to have high latency and overhead.  Thus far, I am taking
> > him at his word.  ;-)
> > 
> > I considered interrupting the task_struct polling loop periodically,
> > and would add that if needed.  That said, this requires nailing down the
> > task_struct at which the vacation is taken.  Here "nailing down" does not
> > simply mean "prevent from being freed", but rather "prevent from being
> > removed from the lists traversed by do_each_thread/while_each_thread."
> > 
> > Of course, if there is some easy way of doing this, please let me know!
> 
> Well, one reason I'm not liking this is because its in an anonymous
> context (kthread) which makes accounting for it hard.
> 
> I feel we're doing far too much async stuff already and it keeps getting
> worse and worse. Ideally we'd be able to account every cycle of kernel
> 'overhead' to a specific user action.

Hmmm...

In theory, we could transfer the overhead of the kthread for a given grace
period to the task invoking the corresponding synchronize_rcu_tasks().
In practice, the overhead might need to be parceled out among several
tasks that concurrently invoked synchronize_rcu_tasks().  Or I suppose
that the overhead could be assigned to the first such task that woke
up, on the theory that things would even out over time.

So exactly how annoyed are you about the lack of accounting?  ;-)

> Another reason is that I fundamentally dislike polling stuff.. but yes,
> I'm not really seeing how to do this differently, partly because I'm not
> entirely sure why we need this to begin with. I'm not sure what problem
> we're solving.

As I recall it...

Steven is working on some sort of tracing infrastructure that involves
dynamically allocated trampolines being inserted into some/all functions.
The trampoline code can be preempted, but never does voluntary context
switches, and presumably never calls anything that does voluntary
context switches.

Easy to insert a trampoline, but the trick is removing them.

The thought is to restore the instructions at the begining of the
function in question, wait for an RCU-tasks grace period, then dispose
of the trampoline.

Of course, you could imagine disabling preemption or otherwise entering
an RCU read-side critical section before transferring to the trampoline,
but this was apparently a no-go due to the overhead for small functions.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-29 18:19             ` Paul E. McKenney
@ 2014-07-29 19:25               ` Peter Zijlstra
  2014-07-29 20:11                 ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-07-29 19:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Tue, Jul 29, 2014 at 11:19:49AM -0700, Paul E. McKenney wrote:
> > I feel we're doing far too much async stuff already and it keeps getting
> > worse and worse. Ideally we'd be able to account every cycle of kernel
> > 'overhead' to a specific user action.
> 
> Hmmm...
> 
> In theory, we could transfer the overhead of the kthread for a given grace
> period to the task invoking the corresponding synchronize_rcu_tasks().
> In practice, the overhead might need to be parceled out among several
> tasks that concurrently invoked synchronize_rcu_tasks().  Or I suppose
> that the overhead could be assigned to the first such task that woke
> up, on the theory that things would even out over time.
> 
> So exactly how annoyed are you about the lack of accounting?  ;-)

Its a general annoyance that people don't seem to consider this at all.

And RCU isn't the largest offender by a long shot.

> > Another reason is that I fundamentally dislike polling stuff.. but yes,
> > I'm not really seeing how to do this differently, partly because I'm not
> > entirely sure why we need this to begin with. I'm not sure what problem
> > we're solving.
> 
> As I recall it...
> 
> Steven is working on some sort of tracing infrastructure that involves
> dynamically allocated trampolines being inserted into some/all functions.
> The trampoline code can be preempted, but never does voluntary context
> switches, and presumably never calls anything that does voluntary
> context switches.
> 
> Easy to insert a trampoline, but the trick is removing them.
> 
> The thought is to restore the instructions at the begining of the
> function in question, wait for an RCU-tasks grace period, then dispose
> of the trampoline.
> 
> Of course, you could imagine disabling preemption or otherwise entering
> an RCU read-side critical section before transferring to the trampoline,
> but this was apparently a no-go due to the overhead for small functions.

So why not use the freezer to get the kernel into a known good state and
then remove them trampolines? That would mean a more noticeable
disruption of service, but it might be ok for something like disabling a
tracer or so. Dunno.

Kernel threads are the problem here, lemme ponder this for a bit.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-29 19:25               ` Peter Zijlstra
@ 2014-07-29 20:11                 ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-29 20:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

On Tue, Jul 29, 2014 at 09:25:04PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 29, 2014 at 11:19:49AM -0700, Paul E. McKenney wrote:
> > > I feel we're doing far too much async stuff already and it keeps getting
> > > worse and worse. Ideally we'd be able to account every cycle of kernel
> > > 'overhead' to a specific user action.
> > 
> > Hmmm...
> > 
> > In theory, we could transfer the overhead of the kthread for a given grace
> > period to the task invoking the corresponding synchronize_rcu_tasks().
> > In practice, the overhead might need to be parceled out among several
> > tasks that concurrently invoked synchronize_rcu_tasks().  Or I suppose
> > that the overhead could be assigned to the first such task that woke
> > up, on the theory that things would even out over time.
> > 
> > So exactly how annoyed are you about the lack of accounting?  ;-)
> 
> Its a general annoyance that people don't seem to consider this at all.
> 
> And RCU isn't the largest offender by a long shot.

A challenge!  ;-)

> > > Another reason is that I fundamentally dislike polling stuff.. but yes,
> > > I'm not really seeing how to do this differently, partly because I'm not
> > > entirely sure why we need this to begin with. I'm not sure what problem
> > > we're solving.
> > 
> > As I recall it...
> > 
> > Steven is working on some sort of tracing infrastructure that involves
> > dynamically allocated trampolines being inserted into some/all functions.
> > The trampoline code can be preempted, but never does voluntary context
> > switches, and presumably never calls anything that does voluntary
> > context switches.
> > 
> > Easy to insert a trampoline, but the trick is removing them.
> > 
> > The thought is to restore the instructions at the begining of the
> > function in question, wait for an RCU-tasks grace period, then dispose
> > of the trampoline.
> > 
> > Of course, you could imagine disabling preemption or otherwise entering
> > an RCU read-side critical section before transferring to the trampoline,
> > but this was apparently a no-go due to the overhead for small functions.
> 
> So why not use the freezer to get the kernel into a known good state and
> then remove them trampolines? That would mean a more noticeable
> disruption of service, but it might be ok for something like disabling a
> tracer or so. Dunno.
> 
> Kernel threads are the problem here, lemme ponder this for a bit.

There was a debate about what points in a kernel thread were "safe
points" a few months back, which might be related.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-28 22:56 ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (11 preceding siblings ...)
  2014-07-29  8:14   ` Peter Zijlstra
@ 2014-07-30  6:52   ` Lai Jiangshan
  2014-07-30 15:07     ` Paul E. McKenney
  2014-07-30 13:41   ` Frederic Weisbecker
  2014-07-30 15:49   ` Oleg Nesterov
  14 siblings, 1 reply; 46+ messages in thread
From: Lai Jiangshan @ 2014-07-30  6:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, dvhart, fweisbec,
	oleg, bobby.prani

On 07/29/2014 06:56 AM, Paul E. McKenney wrote:

> +		/*
> +		 * Each pass through the following loop scans the list
> +		 * of holdout tasks, removing any that are no longer
> +		 * holdouts.  When the list is empty, we are done.
> +		 */
> +		while (!list_empty(&rcu_tasks_holdouts)) {
> +			schedule_timeout_interruptible(HZ / 10);
> +			flush_signals(current);
> +			rcu_read_lock();
> +			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
> +						rcu_tasks_holdout_list) {
> +				if (smp_load_acquire(&t->rcu_tasks_holdout))
> +					continue;
> +				list_del_init(&t->rcu_tasks_holdout_list);
> +				/* @@@ need to check for usermode on CPU. */
> +			}
> +			rcu_read_unlock();

Maybe I missed something. The task @t may already exited and we access to
the stale memory here if without patch 8/9.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-29 18:06         ` Paul E. McKenney
@ 2014-07-30 13:23           ` Mike Galbraith
  2014-07-30 14:23             ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Galbraith @ 2014-07-30 13:23 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On Tue, 2014-07-29 at 11:06 -0700, Paul E. McKenney wrote: 
> On Tue, Jul 29, 2014 at 07:33:32PM +0200, Peter Zijlstra wrote:

> > FWIW its _the_ thing that makes nohz_full uninteresting for me. The
> > required overhead is insane. But yes there are people willing to pay
> > that etc..
> 
> It would indeed be good to reduce the overhead.  I could imagine all sorts
> of insane approaches involving assuming that CPU write buffers flush in
> bounded time, though CPU vendors seem unwilling to make guarantees in
> this area.  ;-)
> 
> Or is something other than rcu_user_enter() and rcu_user_exit() causing
> the pain here?

Border guards stamping visas.  Note native_sched_clock().

echo 0 > sched_wakeup_granularity_ns
taskset -c 3 pipe-test 1

    CONFIG_NO_HZ_FULL=y 604.2 KHz           CONFIG_NO_HZ_FULL=y, nohz_full=3 303.5 KHz
    10.45%   __schedule                     8.74%   native_sched_clock
    10.03%   system_call                    5.63%   __schedule
     4.86%   _raw_spin_lock_irqsave         4.75%   _raw_spin_lock
     4.51%   __switch_to                    4.35%   reschedule_interrupt
     4.31%   copy_user_generic_string       3.91%   _raw_spin_unlock_irqrestore
     3.50%   pipe_read                      3.35%   system_call
     3.02%   pipe_write                     2.73%   context_tracking_user_exit
     2.76%   mutex_lock                     2.30%   _raw_spin_lock_irqsave
     2.30%   native_sched_clock             2.08%   context_tracking_user_enter
     2.27%   copy_page_to_iter_iovec        1.94%   __switch_to
     2.16%   mutex_unlock                   1.88%   copy_user_generic_string
     2.15%   _raw_spin_unlock_irqrestore    1.80%   account_system_time
     1.86%   copy_page_from_iter_iovec      1.77%   rcu_eqs_enter_common.isra.42
     1.85%   vfs_write                      1.60%   pipe_read
     1.67%   new_sync_read                  1.58%   pipe_write
     1.61%   new_sync_write                 1.39%   mutex_lock
     1.49%   vfs_read                       1.37%   enqueue_task_fair
     1.47%   fsnotify                       1.25%   rcu_eqs_exit_common.isra.43
     1.43%   __fget_light                   1.14%   get_vtime_delta
     1.36%   enqueue_task_fair              1.11%   flat_send_IPI_mask
     1.28%   finish_task_switch             1.07%   tracesys
     1.26%   dequeue_task_fair              1.03%   dequeue_task_fair
     1.25%   __sb_start_write               1.01%   copy_page_to_iter_iovec
     1.22%   _raw_spin_lock_irq             1.01%   int_check_syscall_exit_work
     1.20%   try_to_wake_up                 0.97%   vfs_write
     1.16%   update_curr                    0.94%   __context_tracking_task_switch
     1.05%   __fsnotify_parent              0.93%   mutex_unlock
     1.03%   pick_next_task_fair            0.88%   copy_page_from_iter_iovec
     1.02%   sys_write                      0.87%   new_sync_write
     1.01%   sys_read                       0.86%   __fget_light
     1.00%   __wake_up_sync_key             0.85%   __sb_start_write
     0.93%   __wake_up_common               0.85%   int_ret_from_sys_call
     0.92%   copy_page_to_iter              0.83%   syscall_trace_leave
     0.90%   check_preempt_wakeup           0.78%   new_sync_read
     0.90%   __srcu_read_lock               0.78%   account_user_time
     0.89%   put_prev_task_fair             0.76%   update_curr
     0.88%   copy_page_from_iter            0.74%   fsnotify
     0.82%   __sb_end_write                 0.73%   try_to_wake_up
     0.76%   __percpu_counter_add           0.71%   finish_task_switch
     0.74%   prepare_to_wait                0.70%   _raw_spin_lock_irq
     0.72%   touch_atime                    0.69%   __wake_up_sync_key
     0.71%   pipe_wait                      0.69%   __tick_nohz_task_switch

pinned endless stat("/", &buf)

    CONFIG_NO_HZ_FULL=y                     CONFIG_NO_HZ_FULL=y, nohz_full=3
    17.13%   system_call                    8.78%   system_call
    11.20%   kmem_cache_alloc               8.52%   native_sched_clock
     7.14%   lockref_get_not_dead           6.02%   context_tracking_user_exit
     7.10%   kmem_cache_free                4.53%   kmem_cache_alloc
     6.42%   path_init                      4.46%   _raw_spin_lock
     5.69%   copy_user_generic_string       4.13%   copy_user_generic_string
     5.25%   lockref_put_or_lock            4.01%   kmem_cache_free
     4.14%   strncpy_from_user              3.36%   context_tracking_user_enter
     3.99%   path_lookupat                  3.25%   lockref_get_not_dead
     3.12%   complete_walk                  3.25%   lockref_put_or_lock
     2.91%   getname_flags                  2.86%   rcu_eqs_enter_common.isra.42
     2.88%   cp_new_stat                    2.84%   path_init
     2.79%   vfs_fstatat                    2.56%   rcu_eqs_exit_common.isra.43
     2.59%   user_path_at_empty             2.52%   int_check_syscall_exit_work
     1.93%   link_path_walk                 2.51%   tracesys
     1.81%   generic_fillattr               2.08%   cp_new_stat
     1.75%   dput                           2.00%   syscall_trace_leave
     1.71%   filename_lookup.isra.50        1.75%   complete_walk
     1.66%   mntput                         1.69%   path_lookupat
     1.45%   vfs_getattr_nosec              1.58%   strncpy_from_user
     1.04%   final_putname                  1.56%   get_vtime_delta
     1.02%   SYSC_newstat                   1.34%   int_with_check

     CONFIG_NO_HZ_FULL=y, nohz_full=3
-    8.53%  [kernel]         [k] native_sched_clock                                                                                                                                                                                        ▒
   - native_sched_clock                                                                                                                                                                                                                    ▒
      - 96.76% local_clock                                                                                                                                                                                                                 ▒
         - get_vtime_delta                                                                                                                                                                                                                 ▒
            - 51.95% vtime_account_user                                                                                                                                                                                                    ▒
                 99.96% context_tracking_user_exit                                                                                                                                                                                         ▒
                    syscall_trace_enter                                                                                                                                                                                                    ▒
                    tracesys                                                                                                                                                                                                               ▒
                    __xstat64                                                                                                                                                                                                              ▒
                    __libc_start_main                                                                                                                                                                                                      ▒
            - 48.05% __vtime_account_system                                                                                                                                                                                                ▒
                 99.96% vtime_user_enter                                                                                                                                                                                                   ▒
                    context_tracking_user_enter                                                                                                                                                                                            ▒
                    syscall_trace_leave                                                                                                                                                                                                    ◆
                    int_check_syscall_exit_work                                                                                                                                                                                            ▒
                    __xstat64                                                                                                                                                                                                              ▒
                    __libc_start_main                                                                                                                                                                                                      ▒
      - 3.23% get_vtime_delta                                                                                                                                                                                                              ▒
           52.96% vtime_account_user                                                                                                                                                                                                       ▒
              context_tracking_user_exit                                                                                                                                                                                                   ▒
              syscall_trace_enter                                                                                                                                                                                                          ▒
              tracesys                                                                                                                                                                                                                     ▒
              __xstat64                                                                                                                                                                                                                    ▒
              __libc_start_main                                                                                                                                                                                                            ▒
           47.04% __vtime_account_system                                                                                                                                                                                                   ▒
              vtime_user_enter                                                                                                                                                                                                             ▒
              context_tracking_user_enter                                                                                                                                                                                                  ▒
              syscall_trace_leave                                                                                                                                                                                                          ▒
              int_check_syscall_exit_work                                                                                                                                                                                                  ▒
              __xstat64                                                                                                                                                                                                                    ▒
              __libc_start_main


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-28 22:56 ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (12 preceding siblings ...)
  2014-07-30  6:52   ` Lai Jiangshan
@ 2014-07-30 13:41   ` Frederic Weisbecker
  2014-07-30 16:10     ` Paul E. McKenney
  2014-07-30 15:49   ` Oleg Nesterov
  14 siblings, 1 reply; 46+ messages in thread
From: Frederic Weisbecker @ 2014-07-30 13:41 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart, oleg,
	bobby.prani

On Mon, Jul 28, 2014 at 03:56:12PM -0700, Paul E. McKenney wrote:
> +/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
> +static int __noreturn rcu_tasks_kthread(void *arg)
> +{
> +	unsigned long flags;
> +	struct task_struct *g, *t;
> +	struct rcu_head *list;
> +	struct rcu_head *next;
> +
> +	/* FIXME: Add housekeeping affinity. */
> +
> +	/*
> +	 * Each pass through the following loop makes one check for
> +	 * newly arrived callbacks, and, if there are some, waits for
> +	 * one RCU-tasks grace period and then invokes the callbacks.
> +	 * This loop is terminated by the system going down.  ;-)
> +	 */
> +	for (;;) {
> +
> +		/* Pick up any new callbacks. */
> +		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> +		smp_mb__after_unlock_lock(); /* Enforce GP memory ordering. */
> +		list = rcu_tasks_cbs_head;
> +		rcu_tasks_cbs_head = NULL;
> +		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> +		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> +
> +		/* If there were none, wait a bit and start over. */
> +		if (!list) {
> +			schedule_timeout_interruptible(HZ);
> +			flush_signals(current);
> +			continue;
> +		}
> +
> +		/*
> +		 * There were callbacks, so we need to wait for an
> +		 * RCU-tasks grace period.  Start off by scanning
> +		 * the task list for tasks that are not already
> +		 * voluntarily blocked.  Mark these tasks and make
> +		 * a list of them in rcu_tasks_holdouts.
> +		 */
> +		rcu_read_lock();
> +		do_each_thread(g, t) {
> +			if (t != current && ACCESS_ONCE(t->on_rq) &&
> +			    !is_idle_task(t)) {
> +				t->rcu_tasks_holdout = 1;
> +				list_add(&t->rcu_tasks_holdout_list,
> +					 &rcu_tasks_holdouts);
> +			}
> +		} while_each_thread(g, t);
> +		rcu_read_unlock();

I think you need for_each_process_thread() to be RCU-safe.
while_each_thread() has shown to be racy against exec even with rcu_read_lock().

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-30 13:23           ` Mike Galbraith
@ 2014-07-30 14:23             ` Paul E. McKenney
  2014-07-31  7:37               ` Mike Galbraith
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-30 14:23 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On Wed, Jul 30, 2014 at 03:23:39PM +0200, Mike Galbraith wrote:
> On Tue, 2014-07-29 at 11:06 -0700, Paul E. McKenney wrote: 
> > On Tue, Jul 29, 2014 at 07:33:32PM +0200, Peter Zijlstra wrote:
> 
> > > FWIW its _the_ thing that makes nohz_full uninteresting for me. The
> > > required overhead is insane. But yes there are people willing to pay
> > > that etc..
> > 
> > It would indeed be good to reduce the overhead.  I could imagine all sorts
> > of insane approaches involving assuming that CPU write buffers flush in
> > bounded time, though CPU vendors seem unwilling to make guarantees in
> > this area.  ;-)
> > 
> > Or is something other than rcu_user_enter() and rcu_user_exit() causing
> > the pain here?
> 
> Border guards stamping visas.  Note native_sched_clock().

Thank you for running this!

So the delta accounting is much of the pain.  Hmmm...

							Thanx, Paul

> echo 0 > sched_wakeup_granularity_ns
> taskset -c 3 pipe-test 1
> 
>     CONFIG_NO_HZ_FULL=y 604.2 KHz           CONFIG_NO_HZ_FULL=y, nohz_full=3 303.5 KHz
>     10.45%   __schedule                     8.74%   native_sched_clock
>     10.03%   system_call                    5.63%   __schedule
>      4.86%   _raw_spin_lock_irqsave         4.75%   _raw_spin_lock
>      4.51%   __switch_to                    4.35%   reschedule_interrupt
>      4.31%   copy_user_generic_string       3.91%   _raw_spin_unlock_irqrestore
>      3.50%   pipe_read                      3.35%   system_call
>      3.02%   pipe_write                     2.73%   context_tracking_user_exit
>      2.76%   mutex_lock                     2.30%   _raw_spin_lock_irqsave
>      2.30%   native_sched_clock             2.08%   context_tracking_user_enter
>      2.27%   copy_page_to_iter_iovec        1.94%   __switch_to
>      2.16%   mutex_unlock                   1.88%   copy_user_generic_string
>      2.15%   _raw_spin_unlock_irqrestore    1.80%   account_system_time
>      1.86%   copy_page_from_iter_iovec      1.77%   rcu_eqs_enter_common.isra.42
>      1.85%   vfs_write                      1.60%   pipe_read
>      1.67%   new_sync_read                  1.58%   pipe_write
>      1.61%   new_sync_write                 1.39%   mutex_lock
>      1.49%   vfs_read                       1.37%   enqueue_task_fair
>      1.47%   fsnotify                       1.25%   rcu_eqs_exit_common.isra.43
>      1.43%   __fget_light                   1.14%   get_vtime_delta
>      1.36%   enqueue_task_fair              1.11%   flat_send_IPI_mask
>      1.28%   finish_task_switch             1.07%   tracesys
>      1.26%   dequeue_task_fair              1.03%   dequeue_task_fair
>      1.25%   __sb_start_write               1.01%   copy_page_to_iter_iovec
>      1.22%   _raw_spin_lock_irq             1.01%   int_check_syscall_exit_work
>      1.20%   try_to_wake_up                 0.97%   vfs_write
>      1.16%   update_curr                    0.94%   __context_tracking_task_switch
>      1.05%   __fsnotify_parent              0.93%   mutex_unlock
>      1.03%   pick_next_task_fair            0.88%   copy_page_from_iter_iovec
>      1.02%   sys_write                      0.87%   new_sync_write
>      1.01%   sys_read                       0.86%   __fget_light
>      1.00%   __wake_up_sync_key             0.85%   __sb_start_write
>      0.93%   __wake_up_common               0.85%   int_ret_from_sys_call
>      0.92%   copy_page_to_iter              0.83%   syscall_trace_leave
>      0.90%   check_preempt_wakeup           0.78%   new_sync_read
>      0.90%   __srcu_read_lock               0.78%   account_user_time
>      0.89%   put_prev_task_fair             0.76%   update_curr
>      0.88%   copy_page_from_iter            0.74%   fsnotify
>      0.82%   __sb_end_write                 0.73%   try_to_wake_up
>      0.76%   __percpu_counter_add           0.71%   finish_task_switch
>      0.74%   prepare_to_wait                0.70%   _raw_spin_lock_irq
>      0.72%   touch_atime                    0.69%   __wake_up_sync_key
>      0.71%   pipe_wait                      0.69%   __tick_nohz_task_switch
> 
> pinned endless stat("/", &buf)
> 
>     CONFIG_NO_HZ_FULL=y                     CONFIG_NO_HZ_FULL=y, nohz_full=3
>     17.13%   system_call                    8.78%   system_call
>     11.20%   kmem_cache_alloc               8.52%   native_sched_clock
>      7.14%   lockref_get_not_dead           6.02%   context_tracking_user_exit
>      7.10%   kmem_cache_free                4.53%   kmem_cache_alloc
>      6.42%   path_init                      4.46%   _raw_spin_lock
>      5.69%   copy_user_generic_string       4.13%   copy_user_generic_string
>      5.25%   lockref_put_or_lock            4.01%   kmem_cache_free
>      4.14%   strncpy_from_user              3.36%   context_tracking_user_enter
>      3.99%   path_lookupat                  3.25%   lockref_get_not_dead
>      3.12%   complete_walk                  3.25%   lockref_put_or_lock
>      2.91%   getname_flags                  2.86%   rcu_eqs_enter_common.isra.42
>      2.88%   cp_new_stat                    2.84%   path_init
>      2.79%   vfs_fstatat                    2.56%   rcu_eqs_exit_common.isra.43
>      2.59%   user_path_at_empty             2.52%   int_check_syscall_exit_work
>      1.93%   link_path_walk                 2.51%   tracesys
>      1.81%   generic_fillattr               2.08%   cp_new_stat
>      1.75%   dput                           2.00%   syscall_trace_leave
>      1.71%   filename_lookup.isra.50        1.75%   complete_walk
>      1.66%   mntput                         1.69%   path_lookupat
>      1.45%   vfs_getattr_nosec              1.58%   strncpy_from_user
>      1.04%   final_putname                  1.56%   get_vtime_delta
>      1.02%   SYSC_newstat                   1.34%   int_with_check
> 
>      CONFIG_NO_HZ_FULL=y, nohz_full=3
> -    8.53%  [kernel]         [k] native_sched_clock                                                                                                                                                                                        ▒
>    - native_sched_clock                                                                                                                                                                                                                    ▒
>       - 96.76% local_clock                                                                                                                                                                                                                 ▒
>          - get_vtime_delta                                                                                                                                                                                                                 ▒
>             - 51.95% vtime_account_user                                                                                                                                                                                                    ▒
>                  99.96% context_tracking_user_exit                                                                                                                                                                                         ▒
>                     syscall_trace_enter                                                                                                                                                                                                    ▒
>                     tracesys                                                                                                                                                                                                               ▒
>                     __xstat64                                                                                                                                                                                                              ▒
>                     __libc_start_main                                                                                                                                                                                                      ▒
>             - 48.05% __vtime_account_system                                                                                                                                                                                                ▒
>                  99.96% vtime_user_enter                                                                                                                                                                                                   ▒
>                     context_tracking_user_enter                                                                                                                                                                                            ▒
>                     syscall_trace_leave                                                                                                                                                                                                    ◆
>                     int_check_syscall_exit_work                                                                                                                                                                                            ▒
>                     __xstat64                                                                                                                                                                                                              ▒
>                     __libc_start_main                                                                                                                                                                                                      ▒
>       - 3.23% get_vtime_delta                                                                                                                                                                                                              ▒
>            52.96% vtime_account_user                                                                                                                                                                                                       ▒
>               context_tracking_user_exit                                                                                                                                                                                                   ▒
>               syscall_trace_enter                                                                                                                                                                                                          ▒
>               tracesys                                                                                                                                                                                                                     ▒
>               __xstat64                                                                                                                                                                                                                    ▒
>               __libc_start_main                                                                                                                                                                                                            ▒
>            47.04% __vtime_account_system                                                                                                                                                                                                   ▒
>               vtime_user_enter                                                                                                                                                                                                             ▒
>               context_tracking_user_enter                                                                                                                                                                                                  ▒
>               syscall_trace_leave                                                                                                                                                                                                          ▒
>               int_check_syscall_exit_work                                                                                                                                                                                                  ▒
>               __xstat64                                                                                                                                                                                                                    ▒
>               __libc_start_main
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-30  6:52   ` Lai Jiangshan
@ 2014-07-30 15:07     ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-30 15:07 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, dvhart, fweisbec,
	oleg, bobby.prani

On Wed, Jul 30, 2014 at 02:52:41PM +0800, Lai Jiangshan wrote:
> On 07/29/2014 06:56 AM, Paul E. McKenney wrote:
> 
> > +		/*
> > +		 * Each pass through the following loop scans the list
> > +		 * of holdout tasks, removing any that are no longer
> > +		 * holdouts.  When the list is empty, we are done.
> > +		 */
> > +		while (!list_empty(&rcu_tasks_holdouts)) {
> > +			schedule_timeout_interruptible(HZ / 10);
> > +			flush_signals(current);
> > +			rcu_read_lock();
> > +			list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
> > +						rcu_tasks_holdout_list) {
> > +				if (smp_load_acquire(&t->rcu_tasks_holdout))
> > +					continue;
> > +				list_del_init(&t->rcu_tasks_holdout_list);
> > +				/* @@@ need to check for usermode on CPU. */
> > +			}
> > +			rcu_read_unlock();
> 
> Maybe I missed something. The task @t may already exited and we access to
> the stale memory here if without patch 8/9.

Yep, patch 8/9 is not optional.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-28 22:56 ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
                     ` (13 preceding siblings ...)
  2014-07-30 13:41   ` Frederic Weisbecker
@ 2014-07-30 15:49   ` Oleg Nesterov
  2014-07-30 16:08     ` Paul E. McKenney
  14 siblings, 1 reply; 46+ messages in thread
From: Oleg Nesterov @ 2014-07-30 15:49 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On 07/28, Paul E. McKenney wrote:
>
> This commit adds a new RCU-tasks flavor of RCU, which provides
> call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
> context switch (not preemption!), userspace execution, and the idle loop.
> Note that unlike other RCU flavors, these quiescent states occur in tasks,
> not necessarily CPUs.  Includes fixes from Steven Rostedt.

I still hope I will read this series later. Not that I really hope I will
understand it ;)

Just one question for now,

> +static int __noreturn rcu_tasks_kthread(void *arg)
> +{
> +	unsigned long flags;
> +	struct task_struct *g, *t;
> +	struct rcu_head *list;
> +	struct rcu_head *next;
> +
> +	/* FIXME: Add housekeeping affinity. */
> +
> +	/*
> +	 * Each pass through the following loop makes one check for
> +	 * newly arrived callbacks, and, if there are some, waits for
> +	 * one RCU-tasks grace period and then invokes the callbacks.
> +	 * This loop is terminated by the system going down.  ;-)
> +	 */
> +	for (;;) {
> +
> +		/* Pick up any new callbacks. */
> +		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> +		smp_mb__after_unlock_lock(); /* Enforce GP memory ordering. */
> +		list = rcu_tasks_cbs_head;
> +		rcu_tasks_cbs_head = NULL;
> +		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> +		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> +
> +		/* If there were none, wait a bit and start over. */
> +		if (!list) {
> +			schedule_timeout_interruptible(HZ);
> +			flush_signals(current);

Why? And I see more flush_signals() in the current kernel/rcu/ code. Unless
a kthread does allow_signal() it can't have a pending signal?

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-30 15:49   ` Oleg Nesterov
@ 2014-07-30 16:08     ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-30 16:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On Wed, Jul 30, 2014 at 05:49:49PM +0200, Oleg Nesterov wrote:
> On 07/28, Paul E. McKenney wrote:
> >
> > This commit adds a new RCU-tasks flavor of RCU, which provides
> > call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
> > context switch (not preemption!), userspace execution, and the idle loop.
> > Note that unlike other RCU flavors, these quiescent states occur in tasks,
> > not necessarily CPUs.  Includes fixes from Steven Rostedt.
> 
> I still hope I will read this series later. Not that I really hope I will
> understand it ;)

Well, don't put too much time into it just now.  Bozo here has been doing
concurrent programming so long that he sometimes misses opportunities
for single-threaded programming.  Hence the locked-list stuff.  :-/

> Just one question for now,
> 
> > +static int __noreturn rcu_tasks_kthread(void *arg)
> > +{
> > +	unsigned long flags;
> > +	struct task_struct *g, *t;
> > +	struct rcu_head *list;
> > +	struct rcu_head *next;
> > +
> > +	/* FIXME: Add housekeeping affinity. */
> > +
> > +	/*
> > +	 * Each pass through the following loop makes one check for
> > +	 * newly arrived callbacks, and, if there are some, waits for
> > +	 * one RCU-tasks grace period and then invokes the callbacks.
> > +	 * This loop is terminated by the system going down.  ;-)
> > +	 */
> > +	for (;;) {
> > +
> > +		/* Pick up any new callbacks. */
> > +		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> > +		smp_mb__after_unlock_lock(); /* Enforce GP memory ordering. */
> > +		list = rcu_tasks_cbs_head;
> > +		rcu_tasks_cbs_head = NULL;
> > +		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> > +		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> > +
> > +		/* If there were none, wait a bit and start over. */
> > +		if (!list) {
> > +			schedule_timeout_interruptible(HZ);
> > +			flush_signals(current);
> 
> Why? And I see more flush_signals() in the current kernel/rcu/ code. Unless
> a kthread does allow_signal() it can't have a pending signal?

Because I am overly paranoid.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-30 13:41   ` Frederic Weisbecker
@ 2014-07-30 16:10     ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-30 16:10 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart, oleg,
	bobby.prani

On Wed, Jul 30, 2014 at 03:41:54PM +0200, Frederic Weisbecker wrote:
> On Mon, Jul 28, 2014 at 03:56:12PM -0700, Paul E. McKenney wrote:
> > +/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
> > +static int __noreturn rcu_tasks_kthread(void *arg)
> > +{
> > +	unsigned long flags;
> > +	struct task_struct *g, *t;
> > +	struct rcu_head *list;
> > +	struct rcu_head *next;
> > +
> > +	/* FIXME: Add housekeeping affinity. */
> > +
> > +	/*
> > +	 * Each pass through the following loop makes one check for
> > +	 * newly arrived callbacks, and, if there are some, waits for
> > +	 * one RCU-tasks grace period and then invokes the callbacks.
> > +	 * This loop is terminated by the system going down.  ;-)
> > +	 */
> > +	for (;;) {
> > +
> > +		/* Pick up any new callbacks. */
> > +		raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> > +		smp_mb__after_unlock_lock(); /* Enforce GP memory ordering. */
> > +		list = rcu_tasks_cbs_head;
> > +		rcu_tasks_cbs_head = NULL;
> > +		rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> > +		raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> > +
> > +		/* If there were none, wait a bit and start over. */
> > +		if (!list) {
> > +			schedule_timeout_interruptible(HZ);
> > +			flush_signals(current);
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * There were callbacks, so we need to wait for an
> > +		 * RCU-tasks grace period.  Start off by scanning
> > +		 * the task list for tasks that are not already
> > +		 * voluntarily blocked.  Mark these tasks and make
> > +		 * a list of them in rcu_tasks_holdouts.
> > +		 */
> > +		rcu_read_lock();
> > +		do_each_thread(g, t) {
> > +			if (t != current && ACCESS_ONCE(t->on_rq) &&
> > +			    !is_idle_task(t)) {
> > +				t->rcu_tasks_holdout = 1;
> > +				list_add(&t->rcu_tasks_holdout_list,
> > +					 &rcu_tasks_holdouts);
> > +			}
> > +		} while_each_thread(g, t);
> > +		rcu_read_unlock();
> 
> I think you need for_each_process_thread() to be RCU-safe.
> while_each_thread() has shown to be racy against exec even with rcu_read_lock().

Good catch, thank you!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 8/9] rcu: Make RCU-tasks track exiting tasks
  2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 8/9] rcu: Make RCU-tasks track exiting tasks Paul E. McKenney
@ 2014-07-30 17:04     ` Oleg Nesterov
  2014-07-30 18:24       ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Oleg Nesterov @ 2014-07-30 17:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On 07/28, Paul E. McKenney wrote:
>
> This commit adds synchronization with exiting tasks, so that RCU-tasks
> avoids waiting on tasks that no longer exist.

I don't understand this patch yet, but it seems that it adds more than
just synchronization with exiting tasks?

> +		ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
> +		spin_unlock(&t->rcu_tasks_lock);
> +		smp_mb();  /* Order ->rcu_tasks_holdout store before "if". */
> +		if (t == current || !ACCESS_ONCE(t->on_rq) || is_idle_task(t)) {
> +			smp_store_release(&t->rcu_tasks_holdout, 0);
> +			goto next_thread;
> +		}

This should avoid the race with schedule()->rcu_note_voluntary_context_switch(),
right?

> -		rcu_read_lock();
> -		do_each_thread(g, t) {
> -			if (t != current && ACCESS_ONCE(t->on_rq) &&
> -			    !is_idle_task(t)) {
> -				t->rcu_tasks_holdout = 1;

Because before this patch the code looks obviously racy, a task can do
sleep(FOREVER) and block rcu_tasks_kthread() if it reads ->on_rq == 1
after rcu_note_voluntary_context_switch() was already called.

However, I am not sure this race is actually closed even after this
change... why rcu_note_voluntary_context_switch() can not miss
->rcu_tasks_holdout != 0 ?

OK, it seems that you are going to send the next version anyway, so
please ignore.

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 8/9] rcu: Make RCU-tasks track exiting tasks
  2014-07-30 17:04     ` Oleg Nesterov
@ 2014-07-30 18:24       ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-30 18:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, tglx, peterz, rostedt, dhowells, edumazet, dvhart,
	fweisbec, bobby.prani

On Wed, Jul 30, 2014 at 07:04:42PM +0200, Oleg Nesterov wrote:
> On 07/28, Paul E. McKenney wrote:
> >
> > This commit adds synchronization with exiting tasks, so that RCU-tasks
> > avoids waiting on tasks that no longer exist.
> 
> I don't understand this patch yet, but it seems that it adds more than
> just synchronization with exiting tasks?

There was also a bit of code reorganization to keep indentation level
down to a dull roar.

> > +		ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
> > +		spin_unlock(&t->rcu_tasks_lock);
> > +		smp_mb();  /* Order ->rcu_tasks_holdout store before "if". */
> > +		if (t == current || !ACCESS_ONCE(t->on_rq) || is_idle_task(t)) {
> > +			smp_store_release(&t->rcu_tasks_holdout, 0);
> > +			goto next_thread;
> > +		}
> 
> This should avoid the race with schedule()->rcu_note_voluntary_context_switch(),
> right?
> 
> > -		rcu_read_lock();
> > -		do_each_thread(g, t) {
> > -			if (t != current && ACCESS_ONCE(t->on_rq) &&
> > -			    !is_idle_task(t)) {
> > -				t->rcu_tasks_holdout = 1;
> 
> Because before this patch the code looks obviously racy, a task can do
> sleep(FOREVER) and block rcu_tasks_kthread() if it reads ->on_rq == 1
> after rcu_note_voluntary_context_switch() was already called.
> 
> However, I am not sure this race is actually closed even after this
> change... why rcu_note_voluntary_context_switch() can not miss
> ->rcu_tasks_holdout != 0 ?

Good point, I need to add a !ACCESS_ONCE(t->on_rq) when scanning the list
of tasks blocking the grace period.  I also need to handle NO_HZ_FULL,
but that comes later.

							Thanx, Paul

> OK, it seems that you are going to send the next version anyway, so
> please ignore.
> 
> Oleg.
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-30 14:23             ` Paul E. McKenney
@ 2014-07-31  7:37               ` Mike Galbraith
  2014-07-31 16:38                 ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Galbraith @ 2014-07-31  7:37 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On Wed, 2014-07-30 at 07:23 -0700, Paul E. McKenney wrote:

> So the delta accounting is much of the pain.  Hmmm...

(overhead picture was incomplete, just fixing that...) 

executive summary:
nohz_full=NA cpu=3      604.2 KHz  1.000
nohz_full=3, cpu=3      303.5 KHz   .502
nohz_full=3, cpu=2      460.4 KHz   .761

boring details:
    nohz_full=NA, pipe-test cpu=3           nohz_full=3, pipe-test cpu=3               nohz_full=3, pipe-test cpu=2
    10.45%   __schedule                     8.74%   native_sched_clock                 9.22%   __schedule
    10.03%   system_call                    5.63%   __schedule                         5.29%   system_call
     4.86%   _raw_spin_lock_irqsave         4.75%   _raw_spin_lock                     4.79%   context_tracking_user_exit
     4.51%   __switch_to                    4.35%   reschedule_interrupt               3.81%   _raw_spin_lock_irqsave
     4.31%   copy_user_generic_string       3.91%   _raw_spin_unlock_irqrestore        3.57%   __switch_to
     3.50%   pipe_read                      3.35%   system_call                        3.45%   copy_user_generic_string
     3.02%   pipe_write                     2.73%   context_tracking_user_exit         2.90%   context_tracking_user_enter
     2.76%   mutex_lock                     2.30%   _raw_spin_lock_irqsave             2.86%   pipe_read
     2.30%   native_sched_clock             2.08%   context_tracking_user_enter        2.33%   mutex_lock
     2.27%   copy_page_to_iter_iovec        1.94%   __switch_to                        2.14%   pipe_write
     2.16%   mutex_unlock                   1.88%   copy_user_generic_string           1.89%   copy_page_to_iter_iovec
     2.15%   _raw_spin_unlock_irqrestore    1.80%   account_system_time                1.88%   tracesys
     1.86%   copy_page_from_iter_iovec      1.77%   rcu_eqs_enter_common.isra.42       1.78%   native_sched_clock
     1.85%   vfs_write                      1.60%   pipe_read                          1.70%   mutex_unlock
     1.67%   new_sync_read                  1.58%   pipe_write                         1.68%   _raw_spin_unlock_irqrestore
     1.61%   new_sync_write                 1.39%   mutex_lock                         1.67%   int_check_syscall_exit_work
     1.49%   vfs_read                       1.37%   enqueue_task_fair                  1.54%   __context_tracking_task_switch
     1.47%   fsnotify                       1.25%   rcu_eqs_exit_common.isra.43        1.39%   copy_page_from_iter_iovec
     1.43%   __fget_light                   1.14%   get_vtime_delta                    1.38%   new_sync_read
     1.36%   enqueue_task_fair              1.11%   flat_send_IPI_mask                 1.38%   __tick_nohz_task_switch
     1.28%   finish_task_switch             1.07%   tracesys                           1.37%   syscall_trace_leave
     1.26%   dequeue_task_fair              1.03%   dequeue_task_fair                  1.35%   vfs_write
     1.25%   __sb_start_write               1.01%   copy_page_to_iter_iovec            1.34%   new_sync_write
     1.22%   _raw_spin_lock_irq             1.01%   int_check_syscall_exit_work        1.31%   int_ret_from_sys_call
     1.20%   try_to_wake_up                 0.97%   vfs_write                          1.30%   enqueue_task_fair
     1.16%   update_curr                    0.94%   __context_tracking_task_switch     1.23%   fsnotify
     1.05%   __fsnotify_parent              0.93%   mutex_unlock                       1.22%   finish_task_switch
     1.03%   pick_next_task_fair            0.88%   copy_page_from_iter_iovec          1.14%   vfs_read
     1.02%   sys_write                      0.87%   new_sync_write                     1.12%   _raw_spin_lock_irq
     1.01%   sys_read                       0.86%   __fget_light                       1.08%   dequeue_task_fair
     1.00%   __wake_up_sync_key             0.85%   __sb_start_write                   1.06%   sys_read
     0.93%   __wake_up_common               0.85%   int_ret_from_sys_call              1.04%   int_with_check
     0.92%   copy_page_to_iter              0.83%   syscall_trace_leave                1.02%   update_curr
     0.90%   check_preempt_wakeup           0.78%   new_sync_read                      0.99%   syscall_trace_enter
     0.90%   __srcu_read_lock               0.78%   account_user_time                  0.96%   __fget_light
     0.89%   put_prev_task_fair             0.76%   update_curr                        0.93%   __sb_start_write
     0.88%   copy_page_from_iter            0.74%   fsnotify                           0.89%   copy_page_to_iter
     0.82%   __sb_end_write                 0.73%   try_to_wake_up                     0.87%   try_to_wake_up
     0.76%   __percpu_counter_add           0.71%   finish_task_switch                 0.86%   check_preempt_wakeup
     0.74%   prepare_to_wait                0.70%   _raw_spin_lock_irq                 0.86%   sys_write
     0.72%   touch_atime                    0.69%   __wake_up_sync_key                 0.83%   __fsnotify_parent
     0.71%   pipe_wait                      0.69%   __tick_nohz_task_switch            0.81%   __wake_up_sync_key



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-31  7:37               ` Mike Galbraith
@ 2014-07-31 16:38                 ` Paul E. McKenney
  2014-08-01  2:59                   ` Mike Galbraith
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2014-07-31 16:38 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On Thu, Jul 31, 2014 at 09:37:08AM +0200, Mike Galbraith wrote:
> On Wed, 2014-07-30 at 07:23 -0700, Paul E. McKenney wrote:
> 
> > So the delta accounting is much of the pain.  Hmmm...
> 
> (overhead picture was incomplete, just fixing that...) 

Thank you again!!!

And I have to ask...

Does building with CONFIG_NO_HZ_FULL_SYSIDLE=y slow things down even more?
If so, that would give me a rough idea of the cost of RCU's dyntick-idle
handling.

							Thanx, Paul

> executive summary:
> nohz_full=NA cpu=3      604.2 KHz  1.000
> nohz_full=3, cpu=3      303.5 KHz   .502
> nohz_full=3, cpu=2      460.4 KHz   .761
> 
> boring details:
>     nohz_full=NA, pipe-test cpu=3           nohz_full=3, pipe-test cpu=3               nohz_full=3, pipe-test cpu=2
>     10.45%   __schedule                     8.74%   native_sched_clock                 9.22%   __schedule
>     10.03%   system_call                    5.63%   __schedule                         5.29%   system_call
>      4.86%   _raw_spin_lock_irqsave         4.75%   _raw_spin_lock                     4.79%   context_tracking_user_exit
>      4.51%   __switch_to                    4.35%   reschedule_interrupt               3.81%   _raw_spin_lock_irqsave
>      4.31%   copy_user_generic_string       3.91%   _raw_spin_unlock_irqrestore        3.57%   __switch_to
>      3.50%   pipe_read                      3.35%   system_call                        3.45%   copy_user_generic_string
>      3.02%   pipe_write                     2.73%   context_tracking_user_exit         2.90%   context_tracking_user_enter
>      2.76%   mutex_lock                     2.30%   _raw_spin_lock_irqsave             2.86%   pipe_read
>      2.30%   native_sched_clock             2.08%   context_tracking_user_enter        2.33%   mutex_lock
>      2.27%   copy_page_to_iter_iovec        1.94%   __switch_to                        2.14%   pipe_write
>      2.16%   mutex_unlock                   1.88%   copy_user_generic_string           1.89%   copy_page_to_iter_iovec
>      2.15%   _raw_spin_unlock_irqrestore    1.80%   account_system_time                1.88%   tracesys
>      1.86%   copy_page_from_iter_iovec      1.77%   rcu_eqs_enter_common.isra.42       1.78%   native_sched_clock
>      1.85%   vfs_write                      1.60%   pipe_read                          1.70%   mutex_unlock
>      1.67%   new_sync_read                  1.58%   pipe_write                         1.68%   _raw_spin_unlock_irqrestore
>      1.61%   new_sync_write                 1.39%   mutex_lock                         1.67%   int_check_syscall_exit_work
>      1.49%   vfs_read                       1.37%   enqueue_task_fair                  1.54%   __context_tracking_task_switch
>      1.47%   fsnotify                       1.25%   rcu_eqs_exit_common.isra.43        1.39%   copy_page_from_iter_iovec
>      1.43%   __fget_light                   1.14%   get_vtime_delta                    1.38%   new_sync_read
>      1.36%   enqueue_task_fair              1.11%   flat_send_IPI_mask                 1.38%   __tick_nohz_task_switch
>      1.28%   finish_task_switch             1.07%   tracesys                           1.37%   syscall_trace_leave
>      1.26%   dequeue_task_fair              1.03%   dequeue_task_fair                  1.35%   vfs_write
>      1.25%   __sb_start_write               1.01%   copy_page_to_iter_iovec            1.34%   new_sync_write
>      1.22%   _raw_spin_lock_irq             1.01%   int_check_syscall_exit_work        1.31%   int_ret_from_sys_call
>      1.20%   try_to_wake_up                 0.97%   vfs_write                          1.30%   enqueue_task_fair
>      1.16%   update_curr                    0.94%   __context_tracking_task_switch     1.23%   fsnotify
>      1.05%   __fsnotify_parent              0.93%   mutex_unlock                       1.22%   finish_task_switch
>      1.03%   pick_next_task_fair            0.88%   copy_page_from_iter_iovec          1.14%   vfs_read
>      1.02%   sys_write                      0.87%   new_sync_write                     1.12%   _raw_spin_lock_irq
>      1.01%   sys_read                       0.86%   __fget_light                       1.08%   dequeue_task_fair
>      1.00%   __wake_up_sync_key             0.85%   __sb_start_write                   1.06%   sys_read
>      0.93%   __wake_up_common               0.85%   int_ret_from_sys_call              1.04%   int_with_check
>      0.92%   copy_page_to_iter              0.83%   syscall_trace_leave                1.02%   update_curr
>      0.90%   check_preempt_wakeup           0.78%   new_sync_read                      0.99%   syscall_trace_enter
>      0.90%   __srcu_read_lock               0.78%   account_user_time                  0.96%   __fget_light
>      0.89%   put_prev_task_fair             0.76%   update_curr                        0.93%   __sb_start_write
>      0.88%   copy_page_from_iter            0.74%   fsnotify                           0.89%   copy_page_to_iter
>      0.82%   __sb_end_write                 0.73%   try_to_wake_up                     0.87%   try_to_wake_up
>      0.76%   __percpu_counter_add           0.71%   finish_task_switch                 0.86%   check_preempt_wakeup
>      0.74%   prepare_to_wait                0.70%   _raw_spin_lock_irq                 0.86%   sys_write
>      0.72%   touch_atime                    0.69%   __wake_up_sync_key                 0.83%   __fsnotify_parent
>      0.71%   pipe_wait                      0.69%   __tick_nohz_task_switch            0.81%   __wake_up_sync_key
> 
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-07-31 16:38                 ` Paul E. McKenney
@ 2014-08-01  2:59                   ` Mike Galbraith
  2014-08-01 15:16                     ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Galbraith @ 2014-08-01  2:59 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On Thu, 2014-07-31 at 09:38 -0700, Paul E. McKenney wrote:

> Does building with CONFIG_NO_HZ_FULL_SYSIDLE=y slow things down even more?
> If so, that would give me a rough idea of the cost of RCU's dyntick-idle
> handling.

Nope.  Deltas are all down in the statistical frog hair.

-Mike


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks()
  2014-08-01  2:59                   ` Mike Galbraith
@ 2014-08-01 15:16                     ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2014-08-01 15:16 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, bobby.prani

On Fri, Aug 01, 2014 at 04:59:35AM +0200, Mike Galbraith wrote:
> On Thu, 2014-07-31 at 09:38 -0700, Paul E. McKenney wrote:
> 
> > Does building with CONFIG_NO_HZ_FULL_SYSIDLE=y slow things down even more?
> > If so, that would give me a rough idea of the cost of RCU's dyntick-idle
> > handling.
> 
> Nope.  Deltas are all down in the statistical frog hair.

OK, then I guess that trying to squeeze down RCU's dyntick-idle detection
won't be much help.  Worth a thought, though, and thank you for running
the tests!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2014-08-02  6:49 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-28 22:55 [PATCH tip/core/rcu 0/9] RCU-tasks implementation Paul E. McKenney
2014-07-28 22:56 ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Paul E. McKenney
2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops Paul E. McKenney
2014-07-29  7:55     ` Peter Zijlstra
2014-07-29 16:22       ` Paul E. McKenney
2014-07-29 17:25         ` Peter Zijlstra
2014-07-29 17:33           ` Paul E. McKenney
2014-07-29 17:36             ` Peter Zijlstra
2014-07-29 17:37               ` Peter Zijlstra
2014-07-29 17:55                 ` Paul E. McKenney
2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks Paul E. McKenney
2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 4/9] rcu: Export RCU-tasks APIs to GPL modules Paul E. McKenney
2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 5/9] rcutorture: Add torture tests for RCU-tasks Paul E. McKenney
2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 6/9] rcutorture: Add RCU-tasks test cases Paul E. McKenney
2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 7/9] rcu: Add stall-warning checks for RCU-tasks Paul E. McKenney
2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 8/9] rcu: Make RCU-tasks track exiting tasks Paul E. McKenney
2014-07-30 17:04     ` Oleg Nesterov
2014-07-30 18:24       ` Paul E. McKenney
2014-07-28 22:56   ` [PATCH RFC tip/core/rcu 9/9] rcu: Improve RCU-tasks energy efficiency Paul E. McKenney
2014-07-29  7:50   ` [PATCH RFC tip/core/rcu 1/9] rcu: Add call_rcu_tasks() Peter Zijlstra
2014-07-29 15:57     ` Paul E. McKenney
2014-07-29 16:07       ` Peter Zijlstra
2014-07-29 16:33         ` Paul E. McKenney
2014-07-29 17:31           ` Peter Zijlstra
2014-07-29 18:19             ` Paul E. McKenney
2014-07-29 19:25               ` Peter Zijlstra
2014-07-29 20:11                 ` Paul E. McKenney
2014-07-29  8:12   ` Peter Zijlstra
2014-07-29 16:36     ` Paul E. McKenney
2014-07-29  8:12   ` Peter Zijlstra
2014-07-29  8:14   ` Peter Zijlstra
2014-07-29 17:23     ` Paul E. McKenney
2014-07-29 17:33       ` Peter Zijlstra
2014-07-29 18:06         ` Paul E. McKenney
2014-07-30 13:23           ` Mike Galbraith
2014-07-30 14:23             ` Paul E. McKenney
2014-07-31  7:37               ` Mike Galbraith
2014-07-31 16:38                 ` Paul E. McKenney
2014-08-01  2:59                   ` Mike Galbraith
2014-08-01 15:16                     ` Paul E. McKenney
2014-07-30  6:52   ` Lai Jiangshan
2014-07-30 15:07     ` Paul E. McKenney
2014-07-30 13:41   ` Frederic Weisbecker
2014-07-30 16:10     ` Paul E. McKenney
2014-07-30 15:49   ` Oleg Nesterov
2014-07-30 16:08     ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.