linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] Optimize the cpu hotplug locking -v2
@ 2013-10-08 10:25 Peter Zijlstra
  2013-10-08 10:25 ` [PATCH 1/6] hotplug: Optimize {get,put}_online_cpus() Peter Zijlstra
                   ` (7 more replies)
  0 siblings, 8 replies; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-08 10:25 UTC (permalink / raw)
  To: Oleg Nesterov, Paul McKenney
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Steven Rostedt, Linus Torvalds, linux-kernel, Peter Zijlstra

The current cpu hotplug lock is a single global lock; therefore excluding
hotplug is a very expensive proposition even though it is rare occurrence under
normal operation.

There is a desire for a more light weight implementation of
{get,put}_online_cpus() from both the NUMA scheduling as well as the -RT side.

The current hotplug lock is a full reader preference lock -- and thus supports
reader recursion. However since we're making the read side lock much cheaper it
is the expectation that it will also be used far more. Which in turn would lead
to writer starvation.

Therefore the new lock proposed is completely fair; albeit somewhat expensive
on the write side. This in turn means that we need a per-task nesting count to
support reader recursion.

This patch-set is in 3 parts;

 1) The new hotplug lock implementation, very fast on the read side,
    very expensive on the write side; patch 1

 2) Some new rcu_sync primitive by Oleg; patches 2,4-6

 3) Employment of the rcy_sync thingy to optimize the write-side of the
    new hotplug lock; patch 3


Barring further comments/objections I'll stuff this lot into -tip and go look
at getting the numa bits into there too.

Changes since -v1:
 - Addressed all comments from Paul
 - Actually made sure to send the version that builds
 - Added a few patches from Oleg extending the rcu_sync primitive
 - Added Reviewed-by for Oleg to patches 1,3 -- please holler if you disagree!


---
 include/linux/rcusync.h   |   57 +++++++++++
 kernel/rcusync.c          |  152 +++++++++++++++++++++++++++++++
 include/linux/cpu.h       |   68 +++++++++++++-
 include/linux/sched.h     |    3 
 kernel/Makefile           |    3 
 kernel/cpu.c              |  223 +++++++++++++++++++++++++++++++++-------------
 kernel/sched/core.c       |    2 
 7 files changed, 444 insertions(+), 64 deletions(-)


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 1/6] hotplug: Optimize {get,put}_online_cpus()
  2013-10-08 10:25 [PATCH 0/6] Optimize the cpu hotplug locking -v2 Peter Zijlstra
@ 2013-10-08 10:25 ` Peter Zijlstra
  2013-10-08 15:08   ` Rik van Riel
  2013-10-10  5:47   ` Andrew Morton
  2013-10-08 10:25 ` [PATCH 2/6] rcu: Create rcu_sync infrastructure Peter Zijlstra
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-08 10:25 UTC (permalink / raw)
  To: Oleg Nesterov, Paul McKenney
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Steven Rostedt, Linus Torvalds, linux-kernel, Peter Zijlstra

[-- Attachment #1: peterz-per-cpu-hotplug.patch --]
[-- Type: text/plain, Size: 12126 bytes --]

The current implementation of get_online_cpus() is global of nature
and thus not suited for any kind of common usage.

Re-implement the current recursive r/w cpu hotplug lock such that the
read side locks are as light as possible.

The current cpu hotplug lock is entirely reader biased; but since
readers are expensive there aren't a lot of them about and writer
starvation isn't a particular problem.

However by making the reader side more usable there is a fair chance
it will get used more and thus the starvation issue becomes a real
possibility.

Therefore this new implementation is fair, alternating readers and
writers; this however requires per-task state to allow the reader
recursion.

Many comments are contributed by Paul McKenney, and many previous
attempts were shown to be inadequate by both Paul and Oleg; many
thanks to them for persisting to poke holes in my attempts.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/cpu.h   |   67 ++++++++++++++
 include/linux/sched.h |    3 
 kernel/cpu.c          |  229 ++++++++++++++++++++++++++++++++++++--------------
 kernel/sched/core.c   |    2 
 4 files changed, 238 insertions(+), 63 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,8 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/sched.h>
 
 struct device;
 
@@ -173,10 +175,69 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern int __cpuhp_state;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	/* Support reader recursion */
+	/* The value was >= 1 and remains so, reordering causes no harm. */
+	if (current->cpuhp_ref++)
+		return;
+
+	preempt_disable();
+	/*
+	 * We are in an RCU-sched read-side critical section, so the writer
+	 * cannot both change __cpuhp_state from readers_fast and start
+	 * checking counters while we are here. So if we see !__cpuhp_state,
+	 * we know that the writer won't be checking until we past the
+	 * preempt_enable() and that once the synchronize_sched() is done, the
+	 * writer will see anything we did within this RCU-sched read-side
+	 * critical section.
+	 */
+	if (likely(!__cpuhp_state))
+		__this_cpu_inc(__cpuhp_refcount);
+	else
+		__get_online_cpus(); /* Unconditional memory barrier. */
+	preempt_enable();
+	/*
+	 * The barrier() from preempt_enable() prevents the compiler from
+	 * bleeding the critical section out.
+	 */
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	/* The value was >= 1 and remains so, reordering causes no harm. */
+	if (--current->cpuhp_ref)
+		return;
+
+	/*
+	 * The barrier() in preempt_disable() prevents the compiler from
+	 * bleeding the critical section out.
+	 */
+	preempt_disable();
+	/*
+	 * Same as in get_online_cpus().
+	 */
+	if (likely(!__cpuhp_state))
+		__this_cpu_dec(__cpuhp_refcount);
+	else
+		__put_online_cpus(); /* Unconditional memory barrier. */
+	preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
@@ -200,6 +261,8 @@ static inline void cpu_hotplug_driver_un
 
 #else		/* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()	do { } while (0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1039,6 +1039,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
+#ifdef CONFIG_HOTPLUG_CPU
+	int cpuhp_ref;
+#endif
 	struct task_struct *last_wakee;
 	unsigned long wakee_flips;
 	unsigned long wakee_flip_decay_ts;
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,195 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	struct mutex lock; /* Synchronizes accesses to refcount, */
-	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
-	 */
-	int refcount;
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-	.refcount = 0,
-};
+enum { readers_fast = 0, readers_slow, readers_block };
 
-void get_online_cpus(void)
+int __cpuhp_state;
+EXPORT_SYMBOL_GPL(__cpuhp_state);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static atomic_t cpuhp_waitcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+	p->cpuhp_ref = 0;
+}
+
+void __get_online_cpus(void)
 {
-	might_sleep();
-	if (cpu_hotplug.active_writer == current)
+again:
+	__this_cpu_inc(__cpuhp_refcount);
+
+	/*
+	 * Due to having preemption disabled the decrement happens on
+	 * the same CPU as the increment, avoiding the
+	 * increment-on-one-CPU-and-decrement-on-another problem.
+	 *
+	 * And yes, if the reader misses the writer's assignment of
+	 * readers_block to __cpuhp_state, then the writer is
+	 * guaranteed to see the reader's increment.  Conversely, any
+	 * readers that increment their __cpuhp_refcount after the
+	 * writer looks are guaranteed to see the readers_block value,
+	 * which in turn means that they are guaranteed to immediately
+	 * decrement their __cpuhp_refcount, so that it doesn't matter
+	 * that the writer missed them.
+	 */
+
+	smp_mb(); /* A matches D */
+
+	if (likely(__cpuhp_state != readers_block))
 		return;
-	mutex_lock(&cpu_hotplug.lock);
-	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
 
+	/*
+	 * Make sure an outgoing writer sees the waitcount to ensure we
+	 * make progress.
+	 */
+	atomic_inc(&cpuhp_waitcount);
+
+	/*
+	 * Per the above comment; we still have preemption disabled and
+	 * will thus decrement on the same CPU as we incremented.
+	 */
+	__put_online_cpus();
+
+	/*
+	 * We either call schedule() in the wait, or we'll fall through
+	 * and reschedule on the preempt_enable() in get_online_cpus().
+	 */
+	preempt_enable_no_resched();
+	__wait_event(cpuhp_readers, __cpuhp_state != readers_block);
+	preempt_disable();
+
+	/*
+	 * Given we've still got preempt_disabled and new cpu_hotplug_begin()
+	 * must do a synchronize_sched() we're guaranteed a successfull
+	 * acquisition this time -- even if we wake the current
+	 * cpu_hotplug_end() now.
+	 */
+	if (atomic_dec_and_test(&cpuhp_waitcount))
+		wake_up(&cpuhp_writer);
+
+	goto again;
 }
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__get_online_cpus);
 
-void put_online_cpus(void)
+void __put_online_cpus(void)
 {
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
+	smp_mb(); /* B matches C */
+	/*
+	 * In other words, if they see our decrement (presumably to aggregate
+	 * zero, as that is the only time it matters) they will also see our
+	 * critical section.
+	 */
+	this_cpu_dec(__cpuhp_refcount);
 
-	if (WARN_ON(!cpu_hotplug.refcount))
-		cpu_hotplug.refcount++; /* try to fix things up */
+	/* Prod writer to recheck readers_active */
+	wake_up(&cpuhp_writer);
+}
+EXPORT_SYMBOL_GPL(__put_online_cpus);
 
-	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+#define per_cpu_sum(var)						\
+({ 									\
+ 	typeof(var) __sum = 0;						\
+ 	int cpu;							\
+ 	for_each_possible_cpu(cpu)					\
+ 		__sum += per_cpu(var, cpu);				\
+ 	__sum;								\
+})
 
+/*
+ * Return true if the modular sum of the __cpuhp_refcount per-CPU variables is
+ * zero.  If this sum is zero, then it is stable due to the fact that if any
+ * newly arriving readers increment a given counter, they will immediately
+ * decrement that same counter.
+ */
+static bool cpuhp_readers_active_check(void)
+{
+	if (per_cpu_sum(__cpuhp_refcount) != 0)
+		return false;
+
+	/*
+	 * If we observed the decrement; ensure we see the entire critical
+	 * section.
+	 */
+
+	smp_mb(); /* C matches B */
+
+	return true;
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
 
 /*
- * This ensures that the hotplug operation can begin only when the
- * refcount goes to zero.
- *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
- * Since cpu_hotplug_begin() is always called after invoking
- * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
+ * This will notify new readers to block and wait for all active readers to
+ * complete.
  */
 void cpu_hotplug_begin(void)
 {
-	cpu_hotplug.active_writer = current;
+	/*
+	 * Since cpu_hotplug_begin() is always called after invoking
+	 * cpu_maps_update_begin(), we can be sure that only one writer is
+	 * active.
+	 */
+	lockdep_assert_held(&cpu_add_remove_lock);
+
+	/* Allow reader-in-writer recursion. */
+	current->cpuhp_ref++;
+
+	/* Notify readers to take the slow path. */
+	__cpuhp_state = readers_slow;
+
+	/* See percpu_down_write(); guarantees all readers take the slow path */
+	synchronize_sched();
+
+	/*
+	 * Notify new readers to block; up until now, and thus throughout the
+	 * longish synchronize_sched() above, new readers could still come in.
+	 */
+	__cpuhp_state = readers_block;
 
-	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		if (likely(!cpu_hotplug.refcount))
-			break;
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
-		schedule();
-	}
+	smp_mb(); /* D matches A */
+
+	/*
+	 * If they don't see our writer of readers_block to __cpuhp_state,
+	 * then we are guaranteed to see their __cpuhp_refcount increment, and
+	 * therefore will wait for them.
+	 */
+
+	/* Wait for all now active readers to complete. */
+	wait_event(cpuhp_writer, cpuhp_readers_active_check());
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	/*
+	 * Signal the writer is done, no fast path yet.
+	 *
+	 * One reason that we cannot just immediately flip to readers_fast is
+	 * that new readers might fail to see the results of this writer's
+	 * critical section.
+	 */
+	__cpuhp_state = readers_slow;
+	wake_up_all(&cpuhp_readers);
+
+	/*
+	 * The wait_event()/wake_up_all() prevents the race where the readers
+	 * are delayed between fetching __cpuhp_state and blocking.
+	 */
+
+	/* See percpu_up_write(); readers will no longer attempt to block. */
+	synchronize_sched();
+
+	/* Let 'em rip */
+	__cpuhp_state = readers_fast;
+	current->cpuhp_ref--;
+
+	/*
+	 * Wait for any pending readers to be running. This ensures readers
+	 * after writer and avoids writers starving readers.
+	 */
+	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
 }
 
 /*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1635,6 +1635,8 @@ static void __sched_fork(struct task_str
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
 #endif /* CONFIG_NUMA_BALANCING */
+
+	cpu_hotplug_init_task(p);
 }
 
 #ifdef CONFIG_NUMA_BALANCING



^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 2/6] rcu: Create rcu_sync infrastructure
  2013-10-08 10:25 [PATCH 0/6] Optimize the cpu hotplug locking -v2 Peter Zijlstra
  2013-10-08 10:25 ` [PATCH 1/6] hotplug: Optimize {get,put}_online_cpus() Peter Zijlstra
@ 2013-10-08 10:25 ` Peter Zijlstra
  2013-10-08 20:40   ` Jonathan Corbet
                     ` (2 more replies)
  2013-10-08 10:25 ` [PATCH 3/6] hotplug: Optimize cpu_hotplug_{begin,done}() using rcu_sync Peter Zijlstra
                   ` (5 subsequent siblings)
  7 siblings, 3 replies; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-08 10:25 UTC (permalink / raw)
  To: Oleg Nesterov, Paul McKenney
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Steven Rostedt, Linus Torvalds, linux-kernel, Peter Zijlstra

[-- Attachment #1: oleg-sync_sched.patch --]
[-- Type: text/plain, Size: 6084 bytes --]

From: Oleg Nesterov <oleg@redhat.com>

It is functionally equivalent to

        struct rcu_sync_struct {
                atomic_t counter;
        };

        static inline bool rcu_sync_is_idle(struct rcu_sync_struct *xxx)
        {
                return atomic_read(&xxx->counter) == 0;
        }

        static inline void rcu_sync_enter(struct rcu_sync_struct *xxx)
        {
                atomic_inc(&xxx->counter);
                synchronize_sched();
        }

        static inline void rcu_sync_enter(struct rcu_sync_struct *xxx)
        {
                synchronize_sched();
                atomic_dec(&xxx->counter);
        }

except: it records the state and synchronize_sched() is only called by
rcu_sync_enter() and only if necessary.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130929183634.GA15563@redhat.com
---
 include/linux/rcusync.h |   64 ++++++++++++++++++++++++++++
 kernel/Makefile         |    3 -
 kernel/rcusync.c        |  108 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 174 insertions(+), 1 deletion(-)

--- /dev/null
+++ b/include/linux/rcusync.h
@@ -0,0 +1,64 @@
+#ifndef _LINUX_RCUSYNC_H_
+#define _LINUX_RCUSYNC_H_
+
+#include <linux/wait.h>
+#include <linux/rcupdate.h>
+
+struct rcu_sync_struct {
+	int			gp_state;
+	int			gp_count;
+	wait_queue_head_t	gp_wait;
+
+	int			cb_state;
+	struct rcu_head		cb_head;
+
+	void (*sync)(void);
+	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
+};
+
+#define ___RCU_SYNC_INIT(name)						\
+	.gp_state = 0,							\
+	.gp_count = 0,							\
+	.gp_wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.gp_wait),		\
+	.cb_state = 0
+
+#define __RCU_SCHED_SYNC_INIT(name) {					\
+	___RCU_SYNC_INIT(name),						\
+	.sync = synchronize_sched,					\
+	.call = call_rcu_sched,						\
+}
+
+#define __RCU_BH_SYNC_INIT(name) {					\
+	___RCU_SYNC_INIT(name),						\
+	.sync = synchronize_rcu_bh,					\
+	.call = call_rcu_bh,						\
+}
+
+#define __RCU_SYNC_INIT(name) {						\
+	___RCU_SYNC_INIT(name),						\
+	.sync = synchronize_rcu,					\
+	.call = call_rcu,						\
+}
+
+#define DEFINE_RCU_SCHED_SYNC(name)					\
+	struct rcu_sync_struct name = __RCU_SCHED_SYNC_INIT(name)
+
+#define DEFINE_RCU_BH_SYNC(name)					\
+	struct rcu_sync_struct name = __RCU_BH_SYNC_INIT(name)
+
+#define DEFINE_RCU_SYNC(name)						\
+	struct rcu_sync_struct name = __RCU_SYNC_INIT(name)
+
+static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
+{
+	return !rss->gp_state; /* GP_IDLE */
+}
+
+enum rcu_sync_type { RCU_SYNC, RCU_SCHED_SYNC, RCU_BH_SYNC };
+
+extern void rcu_sync_init(struct rcu_sync_struct *, enum rcu_sync_type);
+extern void rcu_sync_enter(struct rcu_sync_struct *);
+extern void rcu_sync_exit(struct rcu_sync_struct *);
+
+#endif /* _LINUX_RCUSYNC_H_ */
+
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -10,7 +10,8 @@ obj-y     = fork.o exec_domain.o panic.o
 	    kthread.o wait.o sys_ni.o posix-cpu-timers.o mutex.o \
 	    hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
 	    notifier.o ksysfs.o cred.o reboot.o \
-	    async.o range.o groups.o lglock.o smpboot.o
+	    async.o range.o groups.o lglock.o smpboot.o \
+	    rcusync.o
 
 ifdef CONFIG_FUNCTION_TRACER
 # Do not trace debug files and internal ftrace files
--- /dev/null
+++ b/kernel/rcusync.c
@@ -0,0 +1,108 @@
+
+#include <linux/rcusync.h>
+#include <linux/sched.h>
+
+enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
+enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
+
+#define	rss_lock	gp_wait.lock
+
+void rcu_sync_init(struct rcu_sync_struct *rss, enum rcu_sync_type type)
+{
+	memset(rss, 0, sizeof(*rss));
+	init_waitqueue_head(&rss->gp_wait);
+
+	switch (type) {
+	case RCU_SYNC:
+		rss->sync = synchronize_rcu;
+		rss->call = call_rcu;
+		break;
+
+	case RCU_SCHED_SYNC:
+		rss->sync = synchronize_sched;
+		rss->call = call_rcu_sched;
+		break;
+
+	case RCU_BH_SYNC:
+		rss->sync = synchronize_rcu_bh;
+		rss->call = call_rcu_bh;
+		break;
+	}
+}
+
+void rcu_sync_enter(struct rcu_sync_struct *rss)
+{
+	bool need_wait, need_sync;
+
+	spin_lock_irq(&rss->rss_lock);
+	need_wait = rss->gp_count++;
+	need_sync = rss->gp_state == GP_IDLE;
+	if (need_sync)
+		rss->gp_state = GP_PENDING;
+	spin_unlock_irq(&rss->rss_lock);
+
+	BUG_ON(need_wait && need_sync);
+
+	if (need_sync) {
+		rss->sync();
+		rss->gp_state = GP_PASSED;
+		wake_up_all(&rss->gp_wait);
+	} else if (need_wait) {
+		wait_event(rss->gp_wait, rss->gp_state == GP_PASSED);
+	} else {
+		/*
+		 * Possible when there's a pending CB from a rcu_sync_exit().
+		 * Nobody has yet been allowed the 'fast' path and thus we can
+		 * avoid doing any sync(). The callback will get 'dropped'.
+		 */
+		BUG_ON(rss->gp_state != GP_PASSED);
+	}
+}
+
+static void rcu_sync_func(struct rcu_head *rcu)
+{
+	struct rcu_sync_struct *rss =
+		container_of(rcu, struct rcu_sync_struct, cb_head);
+	unsigned long flags;
+
+
+	BUG_ON(rss->gp_state != GP_PASSED);
+	BUG_ON(rss->cb_state == CB_IDLE);
+
+	spin_lock_irqsave(&rss->rss_lock, flags);
+	if (rss->gp_count) {
+		/*
+		 * A new rcu_sync_begin() has happened; drop the callback.
+		 */
+		rss->cb_state = CB_IDLE;
+	} else if (rss->cb_state == CB_REPLAY) {
+		/*
+		 * A new rcu_sync_exit() has happened; requeue the callback
+		 * to catch a later GP.
+		 */
+		rss->cb_state = CB_PENDING;
+		rss->call(&rss->cb_head, rcu_sync_func);
+	} else {
+		/*
+		 * We're at least a GP after rcu_sync_exit(); eveybody will now
+		 * have observed the write side critical section. Let 'em rip!.
+		 */
+		rss->cb_state = CB_IDLE;
+		rss->gp_state = GP_IDLE;
+	}
+	spin_unlock_irqrestore(&rss->rss_lock, flags);
+}
+
+void rcu_sync_exit(struct rcu_sync_struct *rss)
+{
+	spin_lock_irq(&rss->rss_lock);
+	if (!--rss->gp_count) {
+		if (rss->cb_state == CB_IDLE) {
+			rss->cb_state = CB_PENDING;
+			rss->call(&rss->cb_head, rcu_sync_func);
+		} else if (rss->cb_state == CB_PENDING) {
+			rss->cb_state = CB_REPLAY;
+		}
+	}
+	spin_unlock_irq(&rss->rss_lock);
+}



^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 3/6] hotplug: Optimize cpu_hotplug_{begin,done}() using rcu_sync
  2013-10-08 10:25 [PATCH 0/6] Optimize the cpu hotplug locking -v2 Peter Zijlstra
  2013-10-08 10:25 ` [PATCH 1/6] hotplug: Optimize {get,put}_online_cpus() Peter Zijlstra
  2013-10-08 10:25 ` [PATCH 2/6] rcu: Create rcu_sync infrastructure Peter Zijlstra
@ 2013-10-08 10:25 ` Peter Zijlstra
  2013-10-08 16:28   ` Paul E. McKenney
  2013-10-08 10:25 ` [PATCH 4/6] rcusync: Introduce struct rcu_sync_ops Peter Zijlstra
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-08 10:25 UTC (permalink / raw)
  To: Oleg Nesterov, Paul McKenney
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Steven Rostedt, Linus Torvalds, linux-kernel, Peter Zijlstra

[-- Attachment #1: peterz-cpu-hotplug-2.patch --]
[-- Type: text/plain, Size: 5707 bytes --]

Use the fancy new rcu_sync bits from Oleg to optimize the fancy new
hotplug lock implementation.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/cpu.h |    7 +++---
 kernel/cpu.c        |   54 +++++++++++++++++++++++-----------------------------
 2 files changed, 28 insertions(+), 33 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -18,6 +18,7 @@
 #include <linux/cpumask.h>
 #include <linux/percpu.h>
 #include <linux/sched.h>
+#include <linux/rcusync.h>
 
 struct device;
 
@@ -180,7 +181,7 @@ extern void cpu_hotplug_init_task(struct
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
 
-extern int __cpuhp_state;
+extern struct rcu_sync_struct __cpuhp_rss;
 DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
 
 extern void __get_online_cpus(void);
@@ -204,7 +205,7 @@ static inline void get_online_cpus(void)
 	 * writer will see anything we did within this RCU-sched read-side
 	 * critical section.
 	 */
-	if (likely(!__cpuhp_state))
+	if (likely(rcu_sync_is_idle(&__cpuhp_rss)))
 		__this_cpu_inc(__cpuhp_refcount);
 	else
 		__get_online_cpus(); /* Unconditional memory barrier. */
@@ -231,7 +232,7 @@ static inline void put_online_cpus(void)
 	/*
 	 * Same as in get_online_cpus().
 	 */
-	if (likely(!__cpuhp_state))
+	if (likely(rcu_sync_is_idle(&__cpuhp_rss)))
 		__this_cpu_dec(__cpuhp_refcount);
 	else
 		__put_online_cpus(); /* Unconditional memory barrier. */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,14 +49,15 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-enum { readers_fast = 0, readers_slow, readers_block };
+enum { readers_slow, readers_block };
 
-int __cpuhp_state;
-EXPORT_SYMBOL_GPL(__cpuhp_state);
+DEFINE_RCU_SCHED_SYNC(__cpuhp_rss);
+EXPORT_SYMBOL_GPL(__cpuhp_rss);
 
 DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
 EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
 
+static int cpuhp_state = readers_slow;
 static atomic_t cpuhp_waitcount;
 static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
 static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
@@ -68,7 +69,6 @@ void cpu_hotplug_init_task(struct task_s
 
 void __get_online_cpus(void)
 {
-again:
 	__this_cpu_inc(__cpuhp_refcount);
 
 	/*
@@ -77,7 +77,7 @@ void __get_online_cpus(void)
 	 * increment-on-one-CPU-and-decrement-on-another problem.
 	 *
 	 * And yes, if the reader misses the writer's assignment of
-	 * readers_block to __cpuhp_state, then the writer is
+	 * readers_block to cpuhp_state, then the writer is
 	 * guaranteed to see the reader's increment.  Conversely, any
 	 * readers that increment their __cpuhp_refcount after the
 	 * writer looks are guaranteed to see the readers_block value,
@@ -88,7 +88,7 @@ void __get_online_cpus(void)
 
 	smp_mb(); /* A matches D */
 
-	if (likely(__cpuhp_state != readers_block))
+	if (likely(cpuhp_state != readers_block))
 		return;
 
 	/*
@@ -108,19 +108,19 @@ void __get_online_cpus(void)
 	 * and reschedule on the preempt_enable() in get_online_cpus().
 	 */
 	preempt_enable_no_resched();
-	__wait_event(cpuhp_readers, __cpuhp_state != readers_block);
+	__wait_event(cpuhp_readers, cpuhp_state != readers_block);
 	preempt_disable();
 
+	__this_cpu_inc(__cpuhp_refcount);
+
 	/*
-	 * Given we've still got preempt_disabled and new cpu_hotplug_begin()
-	 * must do a synchronize_sched() we're guaranteed a successfull
-	 * acquisition this time -- even if we wake the current
-	 * cpu_hotplug_end() now.
+	 * cpu_hotplug_done() waits until all pending readers are gone;
+	 * this means that a new cpu_hotplug_begin() must observe our
+	 * refcount increment and wait for it to go away.
 	 */
-	if (atomic_dec_and_test(&cpuhp_waitcount))
-		wake_up(&cpuhp_writer);
 
-	goto again;
+	if (atomic_dec_and_test(&cpuhp_waitcount)) /* A */
+		wake_up(&cpuhp_writer);
 }
 EXPORT_SYMBOL_GPL(__get_online_cpus);
 
@@ -186,21 +186,18 @@ void cpu_hotplug_begin(void)
 	current->cpuhp_ref++;
 
 	/* Notify readers to take the slow path. */
-	__cpuhp_state = readers_slow;
-
-	/* See percpu_down_write(); guarantees all readers take the slow path */
-	synchronize_sched();
+	rcu_sync_enter(&__cpuhp_rss);
 
 	/*
 	 * Notify new readers to block; up until now, and thus throughout the
-	 * longish synchronize_sched() above, new readers could still come in.
+	 * longish rcu_sync_enter() above, new readers could still come in.
 	 */
-	__cpuhp_state = readers_block;
+	cpuhp_state = readers_block;
 
 	smp_mb(); /* D matches A */
 
 	/*
-	 * If they don't see our writer of readers_block to __cpuhp_state,
+	 * If they don't see our writer of readers_block to cpuhp_state,
 	 * then we are guaranteed to see their __cpuhp_refcount increment, and
 	 * therefore will wait for them.
 	 */
@@ -218,26 +215,23 @@ void cpu_hotplug_done(void)
 	 * that new readers might fail to see the results of this writer's
 	 * critical section.
 	 */
-	__cpuhp_state = readers_slow;
+	cpuhp_state = readers_slow;
 	wake_up_all(&cpuhp_readers);
 
 	/*
 	 * The wait_event()/wake_up_all() prevents the race where the readers
-	 * are delayed between fetching __cpuhp_state and blocking.
+	 * are delayed between fetching cpuhp_state and blocking.
 	 */
 
-	/* See percpu_up_write(); readers will no longer attempt to block. */
-	synchronize_sched();
-
-	/* Let 'em rip */
-	__cpuhp_state = readers_fast;
 	current->cpuhp_ref--;
 
 	/*
-	 * Wait for any pending readers to be running. This ensures readers
-	 * after writer and avoids writers starving readers.
+	 * Wait for any pending readers to be running. This avoids writers
+	 * starving readers.
 	 */
 	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
+
+	rcu_sync_exit(&__cpuhp_rss);
 }
 
 /*



^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 4/6] rcusync: Introduce struct rcu_sync_ops
  2013-10-08 10:25 [PATCH 0/6] Optimize the cpu hotplug locking -v2 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2013-10-08 10:25 ` [PATCH 3/6] hotplug: Optimize cpu_hotplug_{begin,done}() using rcu_sync Peter Zijlstra
@ 2013-10-08 10:25 ` Peter Zijlstra
  2013-10-08 16:30   ` Paul E. McKenney
  2013-10-17  2:07   ` Lai Jiangshan
  2013-10-08 10:25 ` [PATCH 5/6] rcusync: Add the CONFIG_PROVE_RCU checks Peter Zijlstra
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-08 10:25 UTC (permalink / raw)
  To: Oleg Nesterov, Paul McKenney
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Steven Rostedt, Linus Torvalds, linux-kernel, Peter Zijlstra

[-- Attachment #1: oleg_nesterov-1_rcusync-introduce_struct_rcu_sync_ops.patch --]
[-- Type: text/plain, Size: 4990 bytes --]

From: Oleg Nesterov <oleg@redhat.com>

Add the new struct rcu_sync_ops which holds sync/call methods, and
turn the function pointers in rcu_sync_struct into an array of struct
rcu_sync_ops.

This simplifies the "init" helpers, and this way it is simpler to add
the new methods we need, especially ifdef'ed.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131005171746.GA17664@redhat.com
---

diff --git a/include/linux/rcusync.h b/include/linux/rcusync.h
index 7858491..988ec33 100644
--- a/include/linux/rcusync.h
+++ b/include/linux/rcusync.h
@@ -4,6 +4,8 @@
 #include <linux/wait.h>
 #include <linux/rcupdate.h>
 
+enum rcu_sync_type { RCU_SYNC, RCU_SCHED_SYNC, RCU_BH_SYNC };
+
 struct rcu_sync_struct {
 	int			gp_state;
 	int			gp_count;
@@ -12,53 +14,37 @@ struct rcu_sync_struct {
 	int			cb_state;
 	struct rcu_head		cb_head;
 
-	void (*sync)(void);
-	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
+	enum rcu_sync_type	gp_type;
 };
 
-#define ___RCU_SYNC_INIT(name)						\
-	.gp_state = 0,							\
-	.gp_count = 0,							\
-	.gp_wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.gp_wait),		\
-	.cb_state = 0
-
-#define __RCU_SCHED_SYNC_INIT(name) {					\
-	___RCU_SYNC_INIT(name),						\
-	.sync = synchronize_sched,					\
-	.call = call_rcu_sched,						\
-}
-
-#define __RCU_BH_SYNC_INIT(name) {					\
-	___RCU_SYNC_INIT(name),						\
-	.sync = synchronize_rcu_bh,					\
-	.call = call_rcu_bh,						\
-}
-
-#define __RCU_SYNC_INIT(name) {						\
-	___RCU_SYNC_INIT(name),						\
-	.sync = synchronize_rcu,					\
-	.call = call_rcu,						\
-}
-
-#define DEFINE_RCU_SCHED_SYNC(name)					\
-	struct rcu_sync_struct name = __RCU_SCHED_SYNC_INIT(name)
-
-#define DEFINE_RCU_BH_SYNC(name)					\
-	struct rcu_sync_struct name = __RCU_BH_SYNC_INIT(name)
-
-#define DEFINE_RCU_SYNC(name)						\
-	struct rcu_sync_struct name = __RCU_SYNC_INIT(name)
-
 static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
 {
 	return !rss->gp_state; /* GP_IDLE */
 }
 
-enum rcu_sync_type { RCU_SYNC, RCU_SCHED_SYNC, RCU_BH_SYNC };
-
 extern void rcu_sync_init(struct rcu_sync_struct *, enum rcu_sync_type);
 extern void rcu_sync_enter(struct rcu_sync_struct *);
 extern void rcu_sync_exit(struct rcu_sync_struct *);
 
+#define __RCU_SYNC_INITIALIZER(name, type) {				\
+		.gp_state = 0,						\
+		.gp_count = 0,						\
+		.gp_wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.gp_wait),	\
+		.cb_state = 0,						\
+		.gp_type = type,					\
+	}
+
+#define	__DEFINE_RCU_SYNC(name, type)	\
+	struct rcu_sync_struct name = __RCU_SYNC_INITIALIZER(name, type)
+
+#define DEFINE_RCU_SYNC(name)		\
+	__DEFINE_RCU_SYNC(name, RCU_SYNC)
+
+#define DEFINE_RCU_SCHED_SYNC(name)	\
+	__DEFINE_RCU_SYNC(name, RCU_SCHED_SYNC)
+
+#define DEFINE_RCU_BH_SYNC(name)	\
+	__DEFINE_RCU_SYNC(name, RCU_BH_SYNC)
+
 #endif /* _LINUX_RCUSYNC_H_ */
 
diff --git a/kernel/rcusync.c b/kernel/rcusync.c
index f84176a..99051b7 100644
--- a/kernel/rcusync.c
+++ b/kernel/rcusync.c
@@ -1,7 +1,24 @@
-
 #include <linux/rcusync.h>
 #include <linux/sched.h>
 
+static const struct {
+	void (*sync)(void);
+	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
+} gp_ops[] = {
+	[RCU_SYNC] = {
+		.sync = synchronize_rcu,
+		.call = call_rcu,
+	},
+	[RCU_SCHED_SYNC] = {
+		.sync = synchronize_sched,
+		.call = call_rcu_sched,
+	},
+	[RCU_BH_SYNC] = {
+		.sync = synchronize_rcu_bh,
+		.call = call_rcu_bh,
+	},
+};
+
 enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
 enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
 
@@ -11,23 +28,7 @@ void rcu_sync_init(struct rcu_sync_struct *rss, enum rcu_sync_type type)
 {
 	memset(rss, 0, sizeof(*rss));
 	init_waitqueue_head(&rss->gp_wait);
-
-	switch (type) {
-	case RCU_SYNC:
-		rss->sync = synchronize_rcu;
-		rss->call = call_rcu;
-		break;
-
-	case RCU_SCHED_SYNC:
-		rss->sync = synchronize_sched;
-		rss->call = call_rcu_sched;
-		break;
-
-	case RCU_BH_SYNC:
-		rss->sync = synchronize_rcu_bh;
-		rss->call = call_rcu_bh;
-		break;
-	}
+	rss->gp_type = type;
 }
 
 void rcu_sync_enter(struct rcu_sync_struct *rss)
@@ -44,7 +45,7 @@ void rcu_sync_enter(struct rcu_sync_struct *rss)
 	BUG_ON(need_wait && need_sync);
 
 	if (need_sync) {
-		rss->sync();
+		gp_ops[rss->gp_type].sync();
 		rss->gp_state = GP_PASSED;
 		wake_up_all(&rss->gp_wait);
 	} else if (need_wait) {
@@ -81,7 +82,7 @@ static void rcu_sync_func(struct rcu_head *rcu)
 		 * to catch a later GP.
 		 */
 		rss->cb_state = CB_PENDING;
-		rss->call(&rss->cb_head, rcu_sync_func);
+		gp_ops[rss->gp_type].call(&rss->cb_head, rcu_sync_func);
 	} else {
 		/*
 		 * We're at least a GP after rcu_sync_exit(); eveybody will now
@@ -99,7 +100,7 @@ void rcu_sync_exit(struct rcu_sync_struct *rss)
 	if (!--rss->gp_count) {
 		if (rss->cb_state == CB_IDLE) {
 			rss->cb_state = CB_PENDING;
-			rss->call(&rss->cb_head, rcu_sync_func);
+			gp_ops[rss->gp_type].call(&rss->cb_head, rcu_sync_func);
 		} else if (rss->cb_state == CB_PENDING) {
 			rss->cb_state = CB_REPLAY;
 		}




^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 5/6] rcusync: Add the CONFIG_PROVE_RCU checks
  2013-10-08 10:25 [PATCH 0/6] Optimize the cpu hotplug locking -v2 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2013-10-08 10:25 ` [PATCH 4/6] rcusync: Introduce struct rcu_sync_ops Peter Zijlstra
@ 2013-10-08 10:25 ` Peter Zijlstra
  2013-10-08 16:30   ` Paul E. McKenney
  2013-10-08 10:25 ` [PATCH 6/6] rcusync: Introduce rcu_sync_dtor() Peter Zijlstra
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-08 10:25 UTC (permalink / raw)
  To: Oleg Nesterov, Paul McKenney
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Steven Rostedt, Linus Torvalds, linux-kernel, Peter Zijlstra

[-- Attachment #1: oleg_nesterov-2_rcusync-add_the_config_prove_rcu_checks.patch --]
[-- Type: text/plain, Size: 2346 bytes --]

From: Oleg Nesterov <oleg@redhat.com>

It would be nice to validate that the caller of rcu_sync_is_idle()
holds the corresponding type of RCU read-side lock. Add the new
rcu_sync_ops->held() method and change rcu_sync_is_idle() to
WARN() if it returns false.

This obviously penalizes the readers (fast-path), but only if
CONFIG_PROVE_RCU.

Suggested-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131004184633.GA17557@redhat.com
---
 include/linux/rcusync.h |    6 ++++++
 kernel/rcusync.c        |   21 +++++++++++++++++++++
 2 files changed, 27 insertions(+)

--- a/include/linux/rcusync.h
+++ b/include/linux/rcusync.h
@@ -17,9 +17,15 @@ struct rcu_sync_struct {
 	enum rcu_sync_type	gp_type;
 };
 
+extern bool __rcu_sync_is_idle(struct rcu_sync_struct *);
+
 static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
 {
+#ifdef CONFIG_PROVE_RCU
+	return __rcu_sync_is_idle(rss);
+#else
 	return !rss->gp_state; /* GP_IDLE */
+#endif
 }
 
 extern void rcu_sync_init(struct rcu_sync_struct *, enum rcu_sync_type);
--- a/kernel/rcusync.c
+++ b/kernel/rcusync.c
@@ -1,21 +1,33 @@
 #include <linux/rcusync.h>
 #include <linux/sched.h>
 
+#ifdef CONFIG_PROVE_RCU
+#define __INIT_HELD(func)	.held = func,
+#else
+#define __INIT_HELD(func)
+#endif
+
 static const struct {
 	void (*sync)(void);
 	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
+#ifdef CONFIG_PROVE_RCU
+	int  (*held)(void);
+#endif
 } gp_ops[] = {
 	[RCU_SYNC] = {
 		.sync = synchronize_rcu,
 		.call = call_rcu,
+		__INIT_HELD(rcu_read_lock_held)
 	},
 	[RCU_SCHED_SYNC] = {
 		.sync = synchronize_sched,
 		.call = call_rcu_sched,
+		__INIT_HELD(rcu_read_lock_sched_held)
 	},
 	[RCU_BH_SYNC] = {
 		.sync = synchronize_rcu_bh,
 		.call = call_rcu_bh,
+		__INIT_HELD(rcu_read_lock_bh_held)
 	},
 };
 
@@ -24,6 +36,15 @@ enum { CB_IDLE = 0, CB_PENDING, CB_REPLA
 
 #define	rss_lock	gp_wait.lock
 
+#ifdef CONFIG_PROVE_RCU
+bool __rcu_sync_is_idle(struct rcu_sync_struct *rss)
+{
+	WARN_ON(!gp_ops[rss->gp_type].held());
+	return rss->gp_state == GP_IDLE;
+}
+EXPORT_SYMBOL_GPL(__rcu_sync_is_idle);
+#endif
+
 void rcu_sync_init(struct rcu_sync_struct *rss, enum rcu_sync_type type)
 {
 	memset(rss, 0, sizeof(*rss));



^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 6/6] rcusync: Introduce rcu_sync_dtor()
  2013-10-08 10:25 [PATCH 0/6] Optimize the cpu hotplug locking -v2 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2013-10-08 10:25 ` [PATCH 5/6] rcusync: Add the CONFIG_PROVE_RCU checks Peter Zijlstra
@ 2013-10-08 10:25 ` Peter Zijlstra
  2013-10-08 16:32   ` Paul E. McKenney
  2013-10-08 15:27 ` [PATCH 0/6] Optimize the cpu hotplug locking -v2 Oleg Nesterov
  2013-10-10  5:50 ` Andrew Morton
  7 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-08 10:25 UTC (permalink / raw)
  To: Oleg Nesterov, Paul McKenney
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Steven Rostedt, Linus Torvalds, linux-kernel, Peter Zijlstra

[-- Attachment #1: oleg_nesterov-3_rcusync-introduce_rcu_sync_dtor.patch --]
[-- Type: text/plain, Size: 2186 bytes --]

From: Oleg Nesterov <oleg@redhat.com>

Add the new rcu_sync_ops->wait() method and the new helper,
rcu_sync_dtor().

It is needed if you are going to, say, kfree(rcu_sync_object).
It simply calls ops->wait() to "flush" the potentially pending
rcu callback.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131004184636.GA17560@redhat.com
---
 include/linux/rcusync.h |    1 +
 kernel/rcusync.c        |   22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+)

--- a/include/linux/rcusync.h
+++ b/include/linux/rcusync.h
@@ -31,6 +31,7 @@ static inline bool rcu_sync_is_idle(stru
 extern void rcu_sync_init(struct rcu_sync_struct *, enum rcu_sync_type);
 extern void rcu_sync_enter(struct rcu_sync_struct *);
 extern void rcu_sync_exit(struct rcu_sync_struct *);
+extern void rcu_sync_dtor(struct rcu_sync_struct *);
 
 #define __RCU_SYNC_INITIALIZER(name, type) {				\
 		.gp_state = 0,						\
--- a/kernel/rcusync.c
+++ b/kernel/rcusync.c
@@ -10,6 +10,7 @@
 static const struct {
 	void (*sync)(void);
 	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
+	void (*wait)(void);
 #ifdef CONFIG_PROVE_RCU
 	int  (*held)(void);
 #endif
@@ -17,16 +18,19 @@ static const struct {
 	[RCU_SYNC] = {
 		.sync = synchronize_rcu,
 		.call = call_rcu,
+		.wait = rcu_barrier,
 		__INIT_HELD(rcu_read_lock_held)
 	},
 	[RCU_SCHED_SYNC] = {
 		.sync = synchronize_sched,
 		.call = call_rcu_sched,
+		.wait = rcu_barrier_sched,
 		__INIT_HELD(rcu_read_lock_sched_held)
 	},
 	[RCU_BH_SYNC] = {
 		.sync = synchronize_rcu_bh,
 		.call = call_rcu_bh,
+		.wait = rcu_barrier_bh,
 		__INIT_HELD(rcu_read_lock_bh_held)
 	},
 };
@@ -128,3 +132,21 @@ void rcu_sync_exit(struct rcu_sync_struc
 	}
 	spin_unlock_irq(&rss->rss_lock);
 }
+
+void rcu_sync_dtor(struct rcu_sync_struct *rss)
+{
+	int cb_state;
+
+	BUG_ON(rss->gp_count);
+
+	spin_lock_irq(&rss->rss_lock);
+	if (rss->cb_state == CB_REPLAY)
+		rss->cb_state = CB_PENDING;
+	cb_state = rss->cb_state;
+	spin_unlock_irq(&rss->rss_lock);
+
+	if (cb_state != CB_IDLE) {
+		gp_ops[rss->gp_type].wait();
+		BUG_ON(rss->cb_state != CB_IDLE);
+	}
+}



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] hotplug: Optimize {get,put}_online_cpus()
  2013-10-08 10:25 ` [PATCH 1/6] hotplug: Optimize {get,put}_online_cpus() Peter Zijlstra
@ 2013-10-08 15:08   ` Rik van Riel
  2013-10-10  5:47   ` Andrew Morton
  1 sibling, 0 replies; 76+ messages in thread
From: Rik van Riel @ 2013-10-08 15:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Paul McKenney, Mel Gorman, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Steven Rostedt, Linus Torvalds, linux-kernel

On 10/08/2013 06:25 AM, Peter Zijlstra wrote:

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-08 10:25 [PATCH 0/6] Optimize the cpu hotplug locking -v2 Peter Zijlstra
                   ` (5 preceding siblings ...)
  2013-10-08 10:25 ` [PATCH 6/6] rcusync: Introduce rcu_sync_dtor() Peter Zijlstra
@ 2013-10-08 15:27 ` Oleg Nesterov
  2013-10-08 15:38   ` Peter Zijlstra
  2013-10-10  5:50 ` Andrew Morton
  7 siblings, 1 reply; 76+ messages in thread
From: Oleg Nesterov @ 2013-10-08 15:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Steven Rostedt, Linus Torvalds, linux-kernel

On 10/08, Peter Zijlstra wrote:
>
>  - Added Reviewed-by for Oleg to patches 1,3 -- please holler if you disagree!

Thanks ;)

And of course, feel free to add my sob to 4/6, although this doesn't
matter.

Oleg.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-08 15:27 ` [PATCH 0/6] Optimize the cpu hotplug locking -v2 Oleg Nesterov
@ 2013-10-08 15:38   ` Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-08 15:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Steven Rostedt, Linus Torvalds, linux-kernel

On Tue, Oct 08, 2013 at 05:27:30PM +0200, Oleg Nesterov wrote:
> On 10/08, Peter Zijlstra wrote:
> >
> >  - Added Reviewed-by for Oleg to patches 1,3 -- please holler if you disagree!
> 
> Thanks ;)
> 
> And of course, feel free to add my sob to 4/6, although this doesn't
> matter.

Thanks, done!

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 3/6] hotplug: Optimize cpu_hotplug_{begin,done}() using rcu_sync
  2013-10-08 10:25 ` [PATCH 3/6] hotplug: Optimize cpu_hotplug_{begin,done}() using rcu_sync Peter Zijlstra
@ 2013-10-08 16:28   ` Paul E. McKenney
  0 siblings, 0 replies; 76+ messages in thread
From: Paul E. McKenney @ 2013-10-08 16:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Steven Rostedt, Linus Torvalds, linux-kernel

On Tue, Oct 08, 2013 at 12:25:08PM +0200, Peter Zijlstra wrote:
> Use the fancy new rcu_sync bits from Oleg to optimize the fancy new
> hotplug lock implementation.
> 
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> ---
>  include/linux/cpu.h |    7 +++---
>  kernel/cpu.c        |   54 +++++++++++++++++++++++-----------------------------
>  2 files changed, 28 insertions(+), 33 deletions(-)
> 
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -18,6 +18,7 @@
>  #include <linux/cpumask.h>
>  #include <linux/percpu.h>
>  #include <linux/sched.h>
> +#include <linux/rcusync.h>
> 
>  struct device;
> 
> @@ -180,7 +181,7 @@ extern void cpu_hotplug_init_task(struct
>  extern void cpu_hotplug_begin(void);
>  extern void cpu_hotplug_done(void);
> 
> -extern int __cpuhp_state;
> +extern struct rcu_sync_struct __cpuhp_rss;
>  DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> 
>  extern void __get_online_cpus(void);
> @@ -204,7 +205,7 @@ static inline void get_online_cpus(void)
>  	 * writer will see anything we did within this RCU-sched read-side
>  	 * critical section.
>  	 */
> -	if (likely(!__cpuhp_state))
> +	if (likely(rcu_sync_is_idle(&__cpuhp_rss)))
>  		__this_cpu_inc(__cpuhp_refcount);
>  	else
>  		__get_online_cpus(); /* Unconditional memory barrier. */
> @@ -231,7 +232,7 @@ static inline void put_online_cpus(void)
>  	/*
>  	 * Same as in get_online_cpus().
>  	 */
> -	if (likely(!__cpuhp_state))
> +	if (likely(rcu_sync_is_idle(&__cpuhp_rss)))
>  		__this_cpu_dec(__cpuhp_refcount);
>  	else
>  		__put_online_cpus(); /* Unconditional memory barrier. */
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -49,14 +49,15 @@ static int cpu_hotplug_disabled;
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> 
> -enum { readers_fast = 0, readers_slow, readers_block };
> +enum { readers_slow, readers_block };
> 
> -int __cpuhp_state;
> -EXPORT_SYMBOL_GPL(__cpuhp_state);
> +DEFINE_RCU_SCHED_SYNC(__cpuhp_rss);
> +EXPORT_SYMBOL_GPL(__cpuhp_rss);
> 
>  DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
>  EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> 
> +static int cpuhp_state = readers_slow;
>  static atomic_t cpuhp_waitcount;
>  static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
>  static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
> @@ -68,7 +69,6 @@ void cpu_hotplug_init_task(struct task_s
> 
>  void __get_online_cpus(void)
>  {
> -again:
>  	__this_cpu_inc(__cpuhp_refcount);
> 
>  	/*
> @@ -77,7 +77,7 @@ void __get_online_cpus(void)
>  	 * increment-on-one-CPU-and-decrement-on-another problem.
>  	 *
>  	 * And yes, if the reader misses the writer's assignment of
> -	 * readers_block to __cpuhp_state, then the writer is
> +	 * readers_block to cpuhp_state, then the writer is
>  	 * guaranteed to see the reader's increment.  Conversely, any
>  	 * readers that increment their __cpuhp_refcount after the
>  	 * writer looks are guaranteed to see the readers_block value,
> @@ -88,7 +88,7 @@ void __get_online_cpus(void)
> 
>  	smp_mb(); /* A matches D */
> 
> -	if (likely(__cpuhp_state != readers_block))
> +	if (likely(cpuhp_state != readers_block))
>  		return;
> 
>  	/*
> @@ -108,19 +108,19 @@ void __get_online_cpus(void)
>  	 * and reschedule on the preempt_enable() in get_online_cpus().
>  	 */
>  	preempt_enable_no_resched();
> -	__wait_event(cpuhp_readers, __cpuhp_state != readers_block);
> +	__wait_event(cpuhp_readers, cpuhp_state != readers_block);
>  	preempt_disable();
> 
> +	__this_cpu_inc(__cpuhp_refcount);
> +
>  	/*
> -	 * Given we've still got preempt_disabled and new cpu_hotplug_begin()
> -	 * must do a synchronize_sched() we're guaranteed a successfull
> -	 * acquisition this time -- even if we wake the current
> -	 * cpu_hotplug_end() now.
> +	 * cpu_hotplug_done() waits until all pending readers are gone;
> +	 * this means that a new cpu_hotplug_begin() must observe our
> +	 * refcount increment and wait for it to go away.
>  	 */
> -	if (atomic_dec_and_test(&cpuhp_waitcount))
> -		wake_up(&cpuhp_writer);
> 
> -	goto again;
> +	if (atomic_dec_and_test(&cpuhp_waitcount)) /* A */
> +		wake_up(&cpuhp_writer);
>  }
>  EXPORT_SYMBOL_GPL(__get_online_cpus);
> 
> @@ -186,21 +186,18 @@ void cpu_hotplug_begin(void)
>  	current->cpuhp_ref++;
> 
>  	/* Notify readers to take the slow path. */
> -	__cpuhp_state = readers_slow;
> -
> -	/* See percpu_down_write(); guarantees all readers take the slow path */
> -	synchronize_sched();
> +	rcu_sync_enter(&__cpuhp_rss);
> 
>  	/*
>  	 * Notify new readers to block; up until now, and thus throughout the
> -	 * longish synchronize_sched() above, new readers could still come in.
> +	 * longish rcu_sync_enter() above, new readers could still come in.
>  	 */
> -	__cpuhp_state = readers_block;
> +	cpuhp_state = readers_block;
> 
>  	smp_mb(); /* D matches A */
> 
>  	/*
> -	 * If they don't see our writer of readers_block to __cpuhp_state,
> +	 * If they don't see our writer of readers_block to cpuhp_state,
>  	 * then we are guaranteed to see their __cpuhp_refcount increment, and
>  	 * therefore will wait for them.
>  	 */
> @@ -218,26 +215,23 @@ void cpu_hotplug_done(void)
>  	 * that new readers might fail to see the results of this writer's
>  	 * critical section.
>  	 */
> -	__cpuhp_state = readers_slow;
> +	cpuhp_state = readers_slow;
>  	wake_up_all(&cpuhp_readers);
> 
>  	/*
>  	 * The wait_event()/wake_up_all() prevents the race where the readers
> -	 * are delayed between fetching __cpuhp_state and blocking.
> +	 * are delayed between fetching cpuhp_state and blocking.
>  	 */
> 
> -	/* See percpu_up_write(); readers will no longer attempt to block. */
> -	synchronize_sched();
> -
> -	/* Let 'em rip */
> -	__cpuhp_state = readers_fast;
>  	current->cpuhp_ref--;
> 
>  	/*
> -	 * Wait for any pending readers to be running. This ensures readers
> -	 * after writer and avoids writers starving readers.
> +	 * Wait for any pending readers to be running. This avoids writers
> +	 * starving readers.
>  	 */
>  	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> +
> +	rcu_sync_exit(&__cpuhp_rss);
>  }
> 
>  /*
> 
> 


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 4/6] rcusync: Introduce struct rcu_sync_ops
  2013-10-08 10:25 ` [PATCH 4/6] rcusync: Introduce struct rcu_sync_ops Peter Zijlstra
@ 2013-10-08 16:30   ` Paul E. McKenney
  2013-10-17  2:07   ` Lai Jiangshan
  1 sibling, 0 replies; 76+ messages in thread
From: Paul E. McKenney @ 2013-10-08 16:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Steven Rostedt, Linus Torvalds, linux-kernel

On Tue, Oct 08, 2013 at 12:25:09PM +0200, Peter Zijlstra wrote:
> From: Oleg Nesterov <oleg@redhat.com>
> 
> Add the new struct rcu_sync_ops which holds sync/call methods, and
> turn the function pointers in rcu_sync_struct into an array of struct
> rcu_sync_ops.
> 
> This simplifies the "init" helpers, and this way it is simpler to add
> the new methods we need, especially ifdef'ed.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/20131005171746.GA17664@redhat.com

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> ---
> 
> diff --git a/include/linux/rcusync.h b/include/linux/rcusync.h
> index 7858491..988ec33 100644
> --- a/include/linux/rcusync.h
> +++ b/include/linux/rcusync.h
> @@ -4,6 +4,8 @@
>  #include <linux/wait.h>
>  #include <linux/rcupdate.h>
> 
> +enum rcu_sync_type { RCU_SYNC, RCU_SCHED_SYNC, RCU_BH_SYNC };
> +
>  struct rcu_sync_struct {
>  	int			gp_state;
>  	int			gp_count;
> @@ -12,53 +14,37 @@ struct rcu_sync_struct {
>  	int			cb_state;
>  	struct rcu_head		cb_head;
> 
> -	void (*sync)(void);
> -	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
> +	enum rcu_sync_type	gp_type;
>  };
> 
> -#define ___RCU_SYNC_INIT(name)						\
> -	.gp_state = 0,							\
> -	.gp_count = 0,							\
> -	.gp_wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.gp_wait),		\
> -	.cb_state = 0
> -
> -#define __RCU_SCHED_SYNC_INIT(name) {					\
> -	___RCU_SYNC_INIT(name),						\
> -	.sync = synchronize_sched,					\
> -	.call = call_rcu_sched,						\
> -}
> -
> -#define __RCU_BH_SYNC_INIT(name) {					\
> -	___RCU_SYNC_INIT(name),						\
> -	.sync = synchronize_rcu_bh,					\
> -	.call = call_rcu_bh,						\
> -}
> -
> -#define __RCU_SYNC_INIT(name) {						\
> -	___RCU_SYNC_INIT(name),						\
> -	.sync = synchronize_rcu,					\
> -	.call = call_rcu,						\
> -}
> -
> -#define DEFINE_RCU_SCHED_SYNC(name)					\
> -	struct rcu_sync_struct name = __RCU_SCHED_SYNC_INIT(name)
> -
> -#define DEFINE_RCU_BH_SYNC(name)					\
> -	struct rcu_sync_struct name = __RCU_BH_SYNC_INIT(name)
> -
> -#define DEFINE_RCU_SYNC(name)						\
> -	struct rcu_sync_struct name = __RCU_SYNC_INIT(name)
> -
>  static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
>  {
>  	return !rss->gp_state; /* GP_IDLE */
>  }
> 
> -enum rcu_sync_type { RCU_SYNC, RCU_SCHED_SYNC, RCU_BH_SYNC };
> -
>  extern void rcu_sync_init(struct rcu_sync_struct *, enum rcu_sync_type);
>  extern void rcu_sync_enter(struct rcu_sync_struct *);
>  extern void rcu_sync_exit(struct rcu_sync_struct *);
> 
> +#define __RCU_SYNC_INITIALIZER(name, type) {				\
> +		.gp_state = 0,						\
> +		.gp_count = 0,						\
> +		.gp_wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.gp_wait),	\
> +		.cb_state = 0,						\
> +		.gp_type = type,					\
> +	}
> +
> +#define	__DEFINE_RCU_SYNC(name, type)	\
> +	struct rcu_sync_struct name = __RCU_SYNC_INITIALIZER(name, type)
> +
> +#define DEFINE_RCU_SYNC(name)		\
> +	__DEFINE_RCU_SYNC(name, RCU_SYNC)
> +
> +#define DEFINE_RCU_SCHED_SYNC(name)	\
> +	__DEFINE_RCU_SYNC(name, RCU_SCHED_SYNC)
> +
> +#define DEFINE_RCU_BH_SYNC(name)	\
> +	__DEFINE_RCU_SYNC(name, RCU_BH_SYNC)
> +
>  #endif /* _LINUX_RCUSYNC_H_ */
> 
> diff --git a/kernel/rcusync.c b/kernel/rcusync.c
> index f84176a..99051b7 100644
> --- a/kernel/rcusync.c
> +++ b/kernel/rcusync.c
> @@ -1,7 +1,24 @@
> -
>  #include <linux/rcusync.h>
>  #include <linux/sched.h>
> 
> +static const struct {
> +	void (*sync)(void);
> +	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
> +} gp_ops[] = {
> +	[RCU_SYNC] = {
> +		.sync = synchronize_rcu,
> +		.call = call_rcu,
> +	},
> +	[RCU_SCHED_SYNC] = {
> +		.sync = synchronize_sched,
> +		.call = call_rcu_sched,
> +	},
> +	[RCU_BH_SYNC] = {
> +		.sync = synchronize_rcu_bh,
> +		.call = call_rcu_bh,
> +	},
> +};
> +
>  enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
>  enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
> 
> @@ -11,23 +28,7 @@ void rcu_sync_init(struct rcu_sync_struct *rss, enum rcu_sync_type type)
>  {
>  	memset(rss, 0, sizeof(*rss));
>  	init_waitqueue_head(&rss->gp_wait);
> -
> -	switch (type) {
> -	case RCU_SYNC:
> -		rss->sync = synchronize_rcu;
> -		rss->call = call_rcu;
> -		break;
> -
> -	case RCU_SCHED_SYNC:
> -		rss->sync = synchronize_sched;
> -		rss->call = call_rcu_sched;
> -		break;
> -
> -	case RCU_BH_SYNC:
> -		rss->sync = synchronize_rcu_bh;
> -		rss->call = call_rcu_bh;
> -		break;
> -	}
> +	rss->gp_type = type;
>  }
> 
>  void rcu_sync_enter(struct rcu_sync_struct *rss)
> @@ -44,7 +45,7 @@ void rcu_sync_enter(struct rcu_sync_struct *rss)
>  	BUG_ON(need_wait && need_sync);
> 
>  	if (need_sync) {
> -		rss->sync();
> +		gp_ops[rss->gp_type].sync();
>  		rss->gp_state = GP_PASSED;
>  		wake_up_all(&rss->gp_wait);
>  	} else if (need_wait) {
> @@ -81,7 +82,7 @@ static void rcu_sync_func(struct rcu_head *rcu)
>  		 * to catch a later GP.
>  		 */
>  		rss->cb_state = CB_PENDING;
> -		rss->call(&rss->cb_head, rcu_sync_func);
> +		gp_ops[rss->gp_type].call(&rss->cb_head, rcu_sync_func);
>  	} else {
>  		/*
>  		 * We're at least a GP after rcu_sync_exit(); eveybody will now
> @@ -99,7 +100,7 @@ void rcu_sync_exit(struct rcu_sync_struct *rss)
>  	if (!--rss->gp_count) {
>  		if (rss->cb_state == CB_IDLE) {
>  			rss->cb_state = CB_PENDING;
> -			rss->call(&rss->cb_head, rcu_sync_func);
> +			gp_ops[rss->gp_type].call(&rss->cb_head, rcu_sync_func);
>  		} else if (rss->cb_state == CB_PENDING) {
>  			rss->cb_state = CB_REPLAY;
>  		}
> 
> 
> 


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5/6] rcusync: Add the CONFIG_PROVE_RCU checks
  2013-10-08 10:25 ` [PATCH 5/6] rcusync: Add the CONFIG_PROVE_RCU checks Peter Zijlstra
@ 2013-10-08 16:30   ` Paul E. McKenney
  0 siblings, 0 replies; 76+ messages in thread
From: Paul E. McKenney @ 2013-10-08 16:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Steven Rostedt, Linus Torvalds, linux-kernel

On Tue, Oct 08, 2013 at 12:25:10PM +0200, Peter Zijlstra wrote:
> From: Oleg Nesterov <oleg@redhat.com>
> 
> It would be nice to validate that the caller of rcu_sync_is_idle()
> holds the corresponding type of RCU read-side lock. Add the new
> rcu_sync_ops->held() method and change rcu_sync_is_idle() to
> WARN() if it returns false.
> 
> This obviously penalizes the readers (fast-path), but only if
> CONFIG_PROVE_RCU.
> 
> Suggested-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Link: http://lkml.kernel.org/r/20131004184633.GA17557@redhat.com
> ---
>  include/linux/rcusync.h |    6 ++++++
>  kernel/rcusync.c        |   21 +++++++++++++++++++++
>  2 files changed, 27 insertions(+)
> 
> --- a/include/linux/rcusync.h
> +++ b/include/linux/rcusync.h
> @@ -17,9 +17,15 @@ struct rcu_sync_struct {
>  	enum rcu_sync_type	gp_type;
>  };
> 
> +extern bool __rcu_sync_is_idle(struct rcu_sync_struct *);
> +
>  static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
>  {
> +#ifdef CONFIG_PROVE_RCU
> +	return __rcu_sync_is_idle(rss);
> +#else
>  	return !rss->gp_state; /* GP_IDLE */
> +#endif
>  }
> 
>  extern void rcu_sync_init(struct rcu_sync_struct *, enum rcu_sync_type);
> --- a/kernel/rcusync.c
> +++ b/kernel/rcusync.c
> @@ -1,21 +1,33 @@
>  #include <linux/rcusync.h>
>  #include <linux/sched.h>
> 
> +#ifdef CONFIG_PROVE_RCU
> +#define __INIT_HELD(func)	.held = func,
> +#else
> +#define __INIT_HELD(func)
> +#endif
> +
>  static const struct {
>  	void (*sync)(void);
>  	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
> +#ifdef CONFIG_PROVE_RCU
> +	int  (*held)(void);
> +#endif
>  } gp_ops[] = {
>  	[RCU_SYNC] = {
>  		.sync = synchronize_rcu,
>  		.call = call_rcu,
> +		__INIT_HELD(rcu_read_lock_held)
>  	},
>  	[RCU_SCHED_SYNC] = {
>  		.sync = synchronize_sched,
>  		.call = call_rcu_sched,
> +		__INIT_HELD(rcu_read_lock_sched_held)
>  	},
>  	[RCU_BH_SYNC] = {
>  		.sync = synchronize_rcu_bh,
>  		.call = call_rcu_bh,
> +		__INIT_HELD(rcu_read_lock_bh_held)
>  	},
>  };
> 
> @@ -24,6 +36,15 @@ enum { CB_IDLE = 0, CB_PENDING, CB_REPLA
> 
>  #define	rss_lock	gp_wait.lock
> 
> +#ifdef CONFIG_PROVE_RCU
> +bool __rcu_sync_is_idle(struct rcu_sync_struct *rss)
> +{
> +	WARN_ON(!gp_ops[rss->gp_type].held());
> +	return rss->gp_state == GP_IDLE;
> +}
> +EXPORT_SYMBOL_GPL(__rcu_sync_is_idle);
> +#endif
> +
>  void rcu_sync_init(struct rcu_sync_struct *rss, enum rcu_sync_type type)
>  {
>  	memset(rss, 0, sizeof(*rss));
> 
> 


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 6/6] rcusync: Introduce rcu_sync_dtor()
  2013-10-08 10:25 ` [PATCH 6/6] rcusync: Introduce rcu_sync_dtor() Peter Zijlstra
@ 2013-10-08 16:32   ` Paul E. McKenney
  0 siblings, 0 replies; 76+ messages in thread
From: Paul E. McKenney @ 2013-10-08 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Steven Rostedt, Linus Torvalds, linux-kernel

On Tue, Oct 08, 2013 at 12:25:11PM +0200, Peter Zijlstra wrote:
> From: Oleg Nesterov <oleg@redhat.com>
> 
> Add the new rcu_sync_ops->wait() method and the new helper,
> rcu_sync_dtor().
> 
> It is needed if you are going to, say, kfree(rcu_sync_object).
> It simply calls ops->wait() to "flush" the potentially pending
> rcu callback.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Link: http://lkml.kernel.org/r/20131004184636.GA17560@redhat.com
> ---
>  include/linux/rcusync.h |    1 +
>  kernel/rcusync.c        |   22 ++++++++++++++++++++++
>  2 files changed, 23 insertions(+)
> 
> --- a/include/linux/rcusync.h
> +++ b/include/linux/rcusync.h
> @@ -31,6 +31,7 @@ static inline bool rcu_sync_is_idle(stru
>  extern void rcu_sync_init(struct rcu_sync_struct *, enum rcu_sync_type);
>  extern void rcu_sync_enter(struct rcu_sync_struct *);
>  extern void rcu_sync_exit(struct rcu_sync_struct *);
> +extern void rcu_sync_dtor(struct rcu_sync_struct *);
> 
>  #define __RCU_SYNC_INITIALIZER(name, type) {				\
>  		.gp_state = 0,						\
> --- a/kernel/rcusync.c
> +++ b/kernel/rcusync.c
> @@ -10,6 +10,7 @@
>  static const struct {
>  	void (*sync)(void);
>  	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
> +	void (*wait)(void);
>  #ifdef CONFIG_PROVE_RCU
>  	int  (*held)(void);
>  #endif
> @@ -17,16 +18,19 @@ static const struct {
>  	[RCU_SYNC] = {
>  		.sync = synchronize_rcu,
>  		.call = call_rcu,
> +		.wait = rcu_barrier,
>  		__INIT_HELD(rcu_read_lock_held)
>  	},
>  	[RCU_SCHED_SYNC] = {
>  		.sync = synchronize_sched,
>  		.call = call_rcu_sched,
> +		.wait = rcu_barrier_sched,
>  		__INIT_HELD(rcu_read_lock_sched_held)
>  	},
>  	[RCU_BH_SYNC] = {
>  		.sync = synchronize_rcu_bh,
>  		.call = call_rcu_bh,
> +		.wait = rcu_barrier_bh,
>  		__INIT_HELD(rcu_read_lock_bh_held)
>  	},
>  };
> @@ -128,3 +132,21 @@ void rcu_sync_exit(struct rcu_sync_struc
>  	}
>  	spin_unlock_irq(&rss->rss_lock);
>  }
> +
> +void rcu_sync_dtor(struct rcu_sync_struct *rss)
> +{
> +	int cb_state;
> +
> +	BUG_ON(rss->gp_count);
> +
> +	spin_lock_irq(&rss->rss_lock);
> +	if (rss->cb_state == CB_REPLAY)
> +		rss->cb_state = CB_PENDING;
> +	cb_state = rss->cb_state;
> +	spin_unlock_irq(&rss->rss_lock);
> +
> +	if (cb_state != CB_IDLE) {
> +		gp_ops[rss->gp_type].wait();
> +		BUG_ON(rss->cb_state != CB_IDLE);
> +	}
> +}
> 
> 


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/6] rcu: Create rcu_sync infrastructure
  2013-10-08 10:25 ` [PATCH 2/6] rcu: Create rcu_sync infrastructure Peter Zijlstra
@ 2013-10-08 20:40   ` Jonathan Corbet
  2013-10-09 19:52     ` Peter Zijlstra
  2013-10-17  2:56   ` Lai Jiangshan
  2013-10-17 10:36   ` Srikar Dronamraju
  2 siblings, 1 reply; 76+ messages in thread
From: Jonathan Corbet @ 2013-10-08 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

OK, so this is a real nit, but...in the changelog:

>         static inline void rcu_sync_enter(struct rcu_sync_struct *xxx)
>         {
>                 atomic_inc(&xxx->counter);
>                 synchronize_sched();
>         }
> 
>         static inline void rcu_sync_enter(struct rcu_sync_struct *xxx)
>         {
>                 synchronize_sched();
>                 atomic_dec(&xxx->counter);
>         }

That second rcu_sync_enter should really be rcu_sync_exit.

Is the "xxx" there to test spamassassin configurations? :)

jon

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/6] rcu: Create rcu_sync infrastructure
  2013-10-08 20:40   ` Jonathan Corbet
@ 2013-10-09 19:52     ` Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-09 19:52 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Oleg Nesterov, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On Tue, Oct 08, 2013 at 02:40:45PM -0600, Jonathan Corbet wrote:
> OK, so this is a real nit, but...in the changelog:
> 
> >         static inline void rcu_sync_enter(struct rcu_sync_struct *xxx)
> >         {
> >                 atomic_inc(&xxx->counter);
> >                 synchronize_sched();
> >         }
> > 
> >         static inline void rcu_sync_enter(struct rcu_sync_struct *xxx)
> >         {
> >                 synchronize_sched();
> >                 atomic_dec(&xxx->counter);
> >         }
> 
> That second rcu_sync_enter should really be rcu_sync_exit.
> 
> Is the "xxx" there to test spamassassin configurations? :)

Uhh, no, its a search-n-replace fail, Oleg's original email on the
subject called the entire primitive xxx. I replaced (some) of them with
rcu_sync.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] hotplug: Optimize {get,put}_online_cpus()
  2013-10-08 10:25 ` [PATCH 1/6] hotplug: Optimize {get,put}_online_cpus() Peter Zijlstra
  2013-10-08 15:08   ` Rik van Riel
@ 2013-10-10  5:47   ` Andrew Morton
  2013-10-10 11:06     ` Oleg Nesterov
  1 sibling, 1 reply; 76+ messages in thread
From: Andrew Morton @ 2013-10-10  5:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On Tue, 08 Oct 2013 12:25:06 +0200 Peter Zijlstra <peterz@infradead.org> wrote:

> The current implementation of get_online_cpus() is global of nature
> and thus not suited for any kind of common usage.
> 
> Re-implement the current recursive r/w cpu hotplug lock such that the
> read side locks are as light as possible.
> 
> The current cpu hotplug lock is entirely reader biased; but since
> readers are expensive there aren't a lot of them about and writer
> starvation isn't a particular problem.
> 
> However by making the reader side more usable there is a fair chance
> it will get used more and thus the starvation issue becomes a real
> possibility.
> 
> Therefore this new implementation is fair, alternating readers and
> writers; this however requires per-task state to allow the reader
> recursion.

Obvious question: can't we adapt lglocks for this?  It would need the
counter in task_struct to permit reader nesting, but what else is
needed?


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-08 10:25 [PATCH 0/6] Optimize the cpu hotplug locking -v2 Peter Zijlstra
                   ` (6 preceding siblings ...)
  2013-10-08 15:27 ` [PATCH 0/6] Optimize the cpu hotplug locking -v2 Oleg Nesterov
@ 2013-10-10  5:50 ` Andrew Morton
  2013-10-10  6:27   ` Ingo Molnar
  2013-10-10 12:19   ` Peter Zijlstra
  7 siblings, 2 replies; 76+ messages in thread
From: Andrew Morton @ 2013-10-10  5:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On Tue, 08 Oct 2013 12:25:05 +0200 Peter Zijlstra <peterz@infradead.org> wrote:

> The current cpu hotplug lock is a single global lock; therefore excluding
> hotplug is a very expensive proposition even though it is rare occurrence under
> normal operation.
> 
> There is a desire for a more light weight implementation of
> {get,put}_online_cpus() from both the NUMA scheduling as well as the -RT side.
> 
> The current hotplug lock is a full reader preference lock -- and thus supports
> reader recursion. However since we're making the read side lock much cheaper it
> is the expectation that it will also be used far more. Which in turn would lead
> to writer starvation.
> 
> Therefore the new lock proposed is completely fair; albeit somewhat expensive
> on the write side. This in turn means that we need a per-task nesting count to
> support reader recursion.

This is a lot of code and a lot of new complexity.  It needs some pretty
convincing performance numbers to justify its inclusion, no?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10  5:50 ` Andrew Morton
@ 2013-10-10  6:27   ` Ingo Molnar
  2013-10-10  6:34     ` Andrew Morton
  2013-10-10 12:19   ` Peter Zijlstra
  1 sibling, 1 reply; 76+ messages in thread
From: Ingo Molnar @ 2013-10-10  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Oleg Nesterov, Paul McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 08 Oct 2013 12:25:05 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > The current cpu hotplug lock is a single global lock; therefore 
> > excluding hotplug is a very expensive proposition even though it is 
> > rare occurrence under normal operation.
> > 
> > There is a desire for a more light weight implementation of 
> > {get,put}_online_cpus() from both the NUMA scheduling as well as the 
> > -RT side.
> > 
> > The current hotplug lock is a full reader preference lock -- and thus 
> > supports reader recursion. However since we're making the read side 
> > lock much cheaper it is the expectation that it will also be used far 
> > more. Which in turn would lead to writer starvation.
> > 
> > Therefore the new lock proposed is completely fair; albeit somewhat 
> > expensive on the write side. This in turn means that we need a 
> > per-task nesting count to support reader recursion.
> 
> This is a lot of code and a lot of new complexity.  It needs some pretty 
> convincing performance numbers to justify its inclusion, no?

Should be fairly straightforward to test: the sys_sched_getaffinity() and 
sys_sched_setaffinity() syscalls both make use of 
get_online_cpus()/put_online_cpus(), so a testcase frobbing affinities on 
N CPUs in parallel ought to demonstrate scalability improvements pretty 
nicely.

[ It's not just about affinities: in particular sys_sched_getaffinity() 
  also gets used as a NR_CPUS runtime detection method in apps, so it 
  matters to regular non-affine loads as well. ]

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10  6:27   ` Ingo Molnar
@ 2013-10-10  6:34     ` Andrew Morton
  2013-10-10  7:27       ` Ingo Molnar
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Morton @ 2013-10-10  6:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Oleg Nesterov, Paul McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On Thu, 10 Oct 2013 08:27:41 +0200 Ingo Molnar <mingo@kernel.org> wrote:

> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Tue, 08 Oct 2013 12:25:05 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > The current cpu hotplug lock is a single global lock; therefore 
> > > excluding hotplug is a very expensive proposition even though it is 
> > > rare occurrence under normal operation.
> > > 
> > > There is a desire for a more light weight implementation of 
> > > {get,put}_online_cpus() from both the NUMA scheduling as well as the 
> > > -RT side.
> > > 
> > > The current hotplug lock is a full reader preference lock -- and thus 
> > > supports reader recursion. However since we're making the read side 
> > > lock much cheaper it is the expectation that it will also be used far 
> > > more. Which in turn would lead to writer starvation.
> > > 
> > > Therefore the new lock proposed is completely fair; albeit somewhat 
> > > expensive on the write side. This in turn means that we need a 
> > > per-task nesting count to support reader recursion.
> > 
> > This is a lot of code and a lot of new complexity.  It needs some pretty 
> > convincing performance numbers to justify its inclusion, no?
> 
> Should be fairly straightforward to test: the sys_sched_getaffinity() and 
> sys_sched_setaffinity() syscalls both make use of 
> get_online_cpus()/put_online_cpus(), so a testcase frobbing affinities on 
> N CPUs in parallel ought to demonstrate scalability improvements pretty 
> nicely.

Well, an in-kernel microbenchmark which camps in a loop doing get/put
would measure this as well.

But neither approach answers the question "how useful is this patchset".

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10  6:34     ` Andrew Morton
@ 2013-10-10  7:27       ` Ingo Molnar
  2013-10-10  7:33         ` Andrew Morton
  0 siblings, 1 reply; 76+ messages in thread
From: Ingo Molnar @ 2013-10-10  7:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Oleg Nesterov, Paul McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 10 Oct 2013 08:27:41 +0200 Ingo Molnar <mingo@kernel.org> wrote:
> 
> > * Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > > On Tue, 08 Oct 2013 12:25:05 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > The current cpu hotplug lock is a single global lock; therefore 
> > > > excluding hotplug is a very expensive proposition even though it is 
> > > > rare occurrence under normal operation.
> > > > 
> > > > There is a desire for a more light weight implementation of 
> > > > {get,put}_online_cpus() from both the NUMA scheduling as well as the 
> > > > -RT side.
> > > > 
> > > > The current hotplug lock is a full reader preference lock -- and thus 
> > > > supports reader recursion. However since we're making the read side 
> > > > lock much cheaper it is the expectation that it will also be used far 
> > > > more. Which in turn would lead to writer starvation.
> > > > 
> > > > Therefore the new lock proposed is completely fair; albeit somewhat 
> > > > expensive on the write side. This in turn means that we need a 
> > > > per-task nesting count to support reader recursion.
> > > 
> > > This is a lot of code and a lot of new complexity.  It needs some pretty 
> > > convincing performance numbers to justify its inclusion, no?
> > 
> > Should be fairly straightforward to test: the sys_sched_getaffinity() 
> > and sys_sched_setaffinity() syscalls both make use of 
> > get_online_cpus()/put_online_cpus(), so a testcase frobbing affinities 
> > on N CPUs in parallel ought to demonstrate scalability improvements 
> > pretty nicely.
> 
> Well, an in-kernel microbenchmark which camps in a loop doing get/put 
> would measure this as well.
> 
> But neither approach answers the question "how useful is this patchset".

Even ignoring all the other reasons cited, sys_sched_getaffinity() / 
sys_sched_setaffinity() are prime time system calls, and as long as the 
patches are correct, speeding them up is worthwhile.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10  7:27       ` Ingo Molnar
@ 2013-10-10  7:33         ` Andrew Morton
  2013-10-10  7:45           ` Ingo Molnar
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Morton @ 2013-10-10  7:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Oleg Nesterov, Paul McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On Thu, 10 Oct 2013 09:27:57 +0200 Ingo Molnar <mingo@kernel.org> wrote:

> > > Should be fairly straightforward to test: the sys_sched_getaffinity() 
> > > and sys_sched_setaffinity() syscalls both make use of 
> > > get_online_cpus()/put_online_cpus(), so a testcase frobbing affinities 
> > > on N CPUs in parallel ought to demonstrate scalability improvements 
> > > pretty nicely.
> > 
> > Well, an in-kernel microbenchmark which camps in a loop doing get/put 
> > would measure this as well.
> > 
> > But neither approach answers the question "how useful is this patchset".
> 
> Even ignoring all the other reasons cited, sys_sched_getaffinity() / 
> sys_sched_setaffinity() are prime time system calls, and as long as the 
> patches are correct, speeding them up is worthwhile.

That I would not have guessed.  What's the use case for calling
get/set_affinity at high frequency?


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10  7:33         ` Andrew Morton
@ 2013-10-10  7:45           ` Ingo Molnar
  0 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2013-10-10  7:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Oleg Nesterov, Paul McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 10 Oct 2013 09:27:57 +0200 Ingo Molnar <mingo@kernel.org> wrote:
> 
> > > > Should be fairly straightforward to test: the sys_sched_getaffinity() 
> > > > and sys_sched_setaffinity() syscalls both make use of 
> > > > get_online_cpus()/put_online_cpus(), so a testcase frobbing affinities 
> > > > on N CPUs in parallel ought to demonstrate scalability improvements 
> > > > pretty nicely.
> > > 
> > > Well, an in-kernel microbenchmark which camps in a loop doing get/put 
> > > would measure this as well.
> > > 
> > > But neither approach answers the question "how useful is this patchset".
> > 
> > Even ignoring all the other reasons cited, sys_sched_getaffinity() / 
> > sys_sched_setaffinity() are prime time system calls, and as long as 
> > the patches are correct, speeding them up is worthwhile.
> 
> That I would not have guessed.  What's the use case for calling 
> get/set_affinity at high frequency?

I don't think high-freq usage is common (at all).

It could happen in AIM7-like workloads that start up a ton of binaries in 
parallel, which can easily create parallel sched_getaffinity() calls 
during process startup?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] hotplug: Optimize {get,put}_online_cpus()
  2013-10-10  5:47   ` Andrew Morton
@ 2013-10-10 11:06     ` Oleg Nesterov
  2013-10-10 14:55       ` Paul E. McKenney
  0 siblings, 1 reply; 76+ messages in thread
From: Oleg Nesterov @ 2013-10-10 11:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On 10/09, Andrew Morton wrote:
>
> On Tue, 08 Oct 2013 12:25:06 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
>
> > The current implementation of get_online_cpus() is global of nature
> > and thus not suited for any kind of common usage.
> >
> > Re-implement the current recursive r/w cpu hotplug lock such that the
> > read side locks are as light as possible.
> >
> > The current cpu hotplug lock is entirely reader biased; but since
> > readers are expensive there aren't a lot of them about and writer
> > starvation isn't a particular problem.
> >
> > However by making the reader side more usable there is a fair chance
> > it will get used more and thus the starvation issue becomes a real
> > possibility.
> >
> > Therefore this new implementation is fair, alternating readers and
> > writers; this however requires per-task state to allow the reader
> > recursion.
>
> Obvious question: can't we adapt lglocks for this?  It would need the
> counter in task_struct to permit reader nesting, but what else is
> needed?

Unlikely. If nothing else, get_online_cpus() is might_sleep().

But we can joing this with percpu_rw_semaphore (and I am going to try
to do this). Ignoring the counter in task_struct this is the same thing,
but get_online_cpus() is also optimized for the case when the writer
is pending (percpu_down_read() uses down_read() in this case).

Oleg.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10  5:50 ` Andrew Morton
  2013-10-10  6:27   ` Ingo Molnar
@ 2013-10-10 12:19   ` Peter Zijlstra
  2013-10-10 14:57     ` Ingo Molnar
  1 sibling, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-10 12:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On Wed, Oct 09, 2013 at 10:50:06PM -0700, Andrew Morton wrote:
> On Tue, 08 Oct 2013 12:25:05 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > The current cpu hotplug lock is a single global lock; therefore excluding
> > hotplug is a very expensive proposition even though it is rare occurrence under
> > normal operation.
> > 
> > There is a desire for a more light weight implementation of
> > {get,put}_online_cpus() from both the NUMA scheduling as well as the -RT side.
> > 
> > The current hotplug lock is a full reader preference lock -- and thus supports
> > reader recursion. However since we're making the read side lock much cheaper it
> > is the expectation that it will also be used far more. Which in turn would lead
> > to writer starvation.
> > 
> > Therefore the new lock proposed is completely fair; albeit somewhat expensive
> > on the write side. This in turn means that we need a per-task nesting count to
> > support reader recursion.
> 
> This is a lot of code and a lot of new complexity.  It needs some pretty
> convincing performance numbers to justify its inclusion, no?

And here I thought it was generally understood to be unwise to bash
global state on anything like a regular manner from every cpu.

The NUMA bits really ought to use get_online_cpus()/put_online_cpus() on
every balance pass; which is about once a second on every cpu.

RT -- which has some quite horrible hotplug hacks due to this --
basically takes get_online_cpus() for every spin_lock/spin_unlock in the
kernel.

But the thing is; our sense of NR_CPUS has shifted, where it used to be
ok to do something like:

  for_each_cpu()

With preemption disabled; it gets to be less and less sane to do so,
simply because 'common' hardware has 256+ CPUs these days. If we cannot
rely on preempt disable to exclude hotplug, we must use
get_online_cpus(), but get_online_cpus() is global state and thus cannot
be used at any sort of frequency.



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] hotplug: Optimize {get,put}_online_cpus()
  2013-10-10 11:06     ` Oleg Nesterov
@ 2013-10-10 14:55       ` Paul E. McKenney
  0 siblings, 0 replies; 76+ messages in thread
From: Paul E. McKenney @ 2013-10-10 14:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On Thu, Oct 10, 2013 at 01:06:16PM +0200, Oleg Nesterov wrote:
> On 10/09, Andrew Morton wrote:
> >
> > On Tue, 08 Oct 2013 12:25:06 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > The current implementation of get_online_cpus() is global of nature
> > > and thus not suited for any kind of common usage.
> > >
> > > Re-implement the current recursive r/w cpu hotplug lock such that the
> > > read side locks are as light as possible.
> > >
> > > The current cpu hotplug lock is entirely reader biased; but since
> > > readers are expensive there aren't a lot of them about and writer
> > > starvation isn't a particular problem.
> > >
> > > However by making the reader side more usable there is a fair chance
> > > it will get used more and thus the starvation issue becomes a real
> > > possibility.
> > >
> > > Therefore this new implementation is fair, alternating readers and
> > > writers; this however requires per-task state to allow the reader
> > > recursion.
> >
> > Obvious question: can't we adapt lglocks for this?  It would need the
> > counter in task_struct to permit reader nesting, but what else is
> > needed?
> 
> Unlikely. If nothing else, get_online_cpus() is might_sleep().
> 
> But we can joing this with percpu_rw_semaphore (and I am going to try
> to do this). Ignoring the counter in task_struct this is the same thing,
> but get_online_cpus() is also optimized for the case when the writer
> is pending (percpu_down_read() uses down_read() in this case).

To Andrew's overall question, I believe that by the time we apply
this in the various places where it can help, it will have simplified
things a bit -- and made them faster and more scalable.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 12:19   ` Peter Zijlstra
@ 2013-10-10 14:57     ` Ingo Molnar
  2013-10-10 15:21       ` Peter Zijlstra
  2013-10-10 15:26       ` Oleg Nesterov
  0 siblings, 2 replies; 76+ messages in thread
From: Ingo Molnar @ 2013-10-10 14:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Oleg Nesterov, Paul McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel


* Peter Zijlstra <peterz@infradead.org> wrote:

> But the thing is; our sense of NR_CPUS has shifted, where it used to be 
> ok to do something like:
> 
>   for_each_cpu()
> 
> With preemption disabled; it gets to be less and less sane to do so, 
> simply because 'common' hardware has 256+ CPUs these days. If we cannot 
> rely on preempt disable to exclude hotplug, we must use 
> get_online_cpus(), but get_online_cpus() is global state and thus cannot 
> be used at any sort of frequency.

So ... why not make it _really_ cheap, i.e. the read lock costing nothing, 
and tie CPU hotplug to freezing all tasks in the system?

Actual CPU hot unplugging and repluggin is _ridiculously_ rare in a 
system, I don't understand how we tolerate _any_ overhead from this utter 
slowpath.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 14:57     ` Ingo Molnar
@ 2013-10-10 15:21       ` Peter Zijlstra
  2013-10-10 15:36         ` Oleg Nesterov
  2013-10-10 16:50         ` Ingo Molnar
  2013-10-10 15:26       ` Oleg Nesterov
  1 sibling, 2 replies; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-10 15:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Oleg Nesterov, Paul McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On Thu, Oct 10, 2013 at 04:57:38PM +0200, Ingo Molnar wrote:
> So ... why not make it _really_ cheap, i.e. the read lock costing nothing, 
> and tie CPU hotplug to freezing all tasks in the system?

Such that we freeze regular tasks in userspace and kernel tasks in their
special freezer callback so as to guarantee minimal state?

If you can get people to agree with that I'm all for it.

That said, Oleg wants to use the same scheme for percpu_rwsem, and I'm
not sure his write-side (something in uprobes) wants to do the same.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 14:57     ` Ingo Molnar
  2013-10-10 15:21       ` Peter Zijlstra
@ 2013-10-10 15:26       ` Oleg Nesterov
  2013-10-10 16:00         ` Andrew Morton
  2013-10-10 18:52         ` Srivatsa S. Bhat
  1 sibling, 2 replies; 76+ messages in thread
From: Oleg Nesterov @ 2013-10-10 15:26 UTC (permalink / raw)
  To: Ingo Molnar, Srivatsa S. Bhat
  Cc: Peter Zijlstra, Andrew Morton, Paul McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On 10/10, Ingo Molnar wrote:
>
> * Peter Zijlstra <peterz@infradead.org> wrote:
>
> > But the thing is; our sense of NR_CPUS has shifted, where it used to be
> > ok to do something like:
> >
> >   for_each_cpu()
> >
> > With preemption disabled; it gets to be less and less sane to do so,
> > simply because 'common' hardware has 256+ CPUs these days. If we cannot
> > rely on preempt disable to exclude hotplug, we must use
> > get_online_cpus(), but get_online_cpus() is global state and thus cannot
> > be used at any sort of frequency.
>
> So ... why not make it _really_ cheap, i.e. the read lock costing nothing,
> and tie CPU hotplug to freezing all tasks in the system?
>
> Actual CPU hot unplugging and repluggin is _ridiculously_ rare in a
> system, I don't understand how we tolerate _any_ overhead from this utter
> slowpath.

Well, iirc Srivatsa (cc'ed) pointed out that some systems do cpu_down/up
quite often to save the power.

Oleg.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 15:21       ` Peter Zijlstra
@ 2013-10-10 15:36         ` Oleg Nesterov
  2013-10-10 16:50         ` Ingo Molnar
  1 sibling, 0 replies; 76+ messages in thread
From: Oleg Nesterov @ 2013-10-10 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Paul McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On 10/10, Peter Zijlstra wrote:
>
> That said, Oleg wants to use the same scheme for percpu_rwsem,

Yes, and later then (I hope) get_online_cpus() will be just
current->cpuhp_ref++ || percpu_down_read().

(just in case, we only need ->cpuhp_ref to ensure that the readers
 can't starve the writers as it currently possible. iow to block the
 new readers but ensure that the recursive get/down_read can't block)

And please look at sb_wait_write/sb_start_write. It uses the same
logic except:

	- it doesn't have the "true" fast-path for readers

	- its "slow" mode is slower than necessary

	- it is buggy (afaics)

> and I'm
> not sure his write-side (something in uprobes) wants to do the same.

surely not ;)

Oleg.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 15:26       ` Oleg Nesterov
@ 2013-10-10 16:00         ` Andrew Morton
  2013-10-10 16:36           ` Steven Rostedt
                             ` (2 more replies)
  2013-10-10 18:52         ` Srivatsa S. Bhat
  1 sibling, 3 replies; 76+ messages in thread
From: Andrew Morton @ 2013-10-10 16:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Srivatsa S. Bhat, Peter Zijlstra, Paul McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On Thu, 10 Oct 2013 17:26:12 +0200 Oleg Nesterov <oleg@redhat.com> wrote:

> On 10/10, Ingo Molnar wrote:
> >
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > But the thing is; our sense of NR_CPUS has shifted, where it used to be
> > > ok to do something like:
> > >
> > >   for_each_cpu()
> > >
> > > With preemption disabled; it gets to be less and less sane to do so,
> > > simply because 'common' hardware has 256+ CPUs these days. If we cannot
> > > rely on preempt disable to exclude hotplug, we must use
> > > get_online_cpus(), but get_online_cpus() is global state and thus cannot
> > > be used at any sort of frequency.
> >
> > So ... why not make it _really_ cheap, i.e. the read lock costing nothing,
> > and tie CPU hotplug to freezing all tasks in the system?
> >
> > Actual CPU hot unplugging and repluggin is _ridiculously_ rare in a
> > system, I don't understand how we tolerate _any_ overhead from this utter
> > slowpath.
> 
> Well, iirc Srivatsa (cc'ed) pointed out that some systems do cpu_down/up
> quite often to save the power.

cpu hotremove already uses stop_machine, so such an approach shouldn't
actually worsen things (a lot) for them?

It's been ages since I looked at this stuff :( Although it isn't used
much, memory hotplug manages to use stop_machine() on the add/remove
(ie, "writer") side and nothing at all on the "reader" side.  Is there
anything which fundamentally prevents cpu hotplug from doing the same?


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 16:00         ` Andrew Morton
@ 2013-10-10 16:36           ` Steven Rostedt
  2013-10-10 16:43             ` Andrew Morton
  2013-10-10 16:52           ` Ingo Molnar
  2013-10-10 16:54           ` Oleg Nesterov
  2 siblings, 1 reply; 76+ messages in thread
From: Steven Rostedt @ 2013-10-10 16:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Ingo Molnar, Srivatsa S. Bhat, Peter Zijlstra,
	Paul McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, linux-kernel

On Thu, 10 Oct 2013 09:00:44 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> It's been ages since I looked at this stuff :( Although it isn't used
> much, memory hotplug manages to use stop_machine() on the add/remove
> (ie, "writer") side and nothing at all on the "reader" side.  Is there
> anything which fundamentally prevents cpu hotplug from doing the same?


I would think that memory hotplug may require stop machine as all CPUs
may touch that memory. But we would like to remove stomp machine from
CPU hotplug. Why prevent all CPUs from running when we want to remove
one?

-- Steve

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 16:36           ` Steven Rostedt
@ 2013-10-10 16:43             ` Andrew Morton
  2013-10-10 16:53               ` Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Morton @ 2013-10-10 16:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Ingo Molnar, Srivatsa S. Bhat, Peter Zijlstra,
	Paul McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, linux-kernel

On Thu, 10 Oct 2013 12:36:31 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 10 Oct 2013 09:00:44 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > It's been ages since I looked at this stuff :( Although it isn't used
> > much, memory hotplug manages to use stop_machine() on the add/remove
> > (ie, "writer") side and nothing at all on the "reader" side.  Is there
> > anything which fundamentally prevents cpu hotplug from doing the same?
> 
> 
> I would think that memory hotplug may require stop machine as all CPUs
> may touch that memory.

Sure.

> But we would like to remove stomp machine from
> CPU hotplug.

We do?  That's news.  It wasn't mentioned in the changelog and should
have been.  Why?  

> Why prevent all CPUs from running when we want to remove
> one?

So get_online_cpus() goes away.  Nothing is more scalable than nothing!

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 15:21       ` Peter Zijlstra
  2013-10-10 15:36         ` Oleg Nesterov
@ 2013-10-10 16:50         ` Ingo Molnar
  2013-10-10 17:13           ` Paul E. McKenney
  1 sibling, 1 reply; 76+ messages in thread
From: Ingo Molnar @ 2013-10-10 16:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Oleg Nesterov, Paul McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Oct 10, 2013 at 04:57:38PM +0200, Ingo Molnar wrote:
>
> > So ... why not make it _really_ cheap, i.e. the read lock costing 
> > nothing, and tie CPU hotplug to freezing all tasks in the system?
> 
> Such that we freeze regular tasks in userspace and kernel tasks in their 
> special freezer callback so as to guarantee minimal state?

Yes, absolutely.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 16:00         ` Andrew Morton
  2013-10-10 16:36           ` Steven Rostedt
@ 2013-10-10 16:52           ` Ingo Molnar
  2013-10-10 17:44             ` Paul E. McKenney
  2013-10-10 16:54           ` Oleg Nesterov
  2 siblings, 1 reply; 76+ messages in thread
From: Ingo Molnar @ 2013-10-10 16:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Srivatsa S. Bhat, Peter Zijlstra, Paul McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 10 Oct 2013 17:26:12 +0200 Oleg Nesterov <oleg@redhat.com> wrote:
> 
> > On 10/10, Ingo Molnar wrote:
> > >
> > > * Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > > But the thing is; our sense of NR_CPUS has shifted, where it used to be
> > > > ok to do something like:
> > > >
> > > >   for_each_cpu()
> > > >
> > > > With preemption disabled; it gets to be less and less sane to do 
> > > > so, simply because 'common' hardware has 256+ CPUs these days. If 
> > > > we cannot rely on preempt disable to exclude hotplug, we must use 
> > > > get_online_cpus(), but get_online_cpus() is global state and thus 
> > > > cannot be used at any sort of frequency.
> > >
> > > So ... why not make it _really_ cheap, i.e. the read lock costing 
> > > nothing, and tie CPU hotplug to freezing all tasks in the system?
> > >
> > > Actual CPU hot unplugging and repluggin is _ridiculously_ rare in a 
> > > system, I don't understand how we tolerate _any_ overhead from this 
> > > utter slowpath.
> > 
> > Well, iirc Srivatsa (cc'ed) pointed out that some systems do 
> > cpu_down/up quite often to save the power.
> 
> cpu hotremove already uses stop_machine, so such an approach shouldn't 
> actually worsen things (a lot) for them?

Also, using CPU hotremove to save power, instead of implementing proper 
power aware scheduling, is very broken to begin with.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 16:43             ` Andrew Morton
@ 2013-10-10 16:53               ` Peter Zijlstra
  2013-10-10 17:13                 ` Steven Rostedt
  2013-10-10 17:39                 ` [PATCH 0/6] Optimize the cpu hotplug locking -v2 Oleg Nesterov
  0 siblings, 2 replies; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-10 16:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Steven Rostedt, Oleg Nesterov, Ingo Molnar, Srivatsa S. Bhat,
	Paul McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, linux-kernel

On Thu, Oct 10, 2013 at 09:43:55AM -0700, Andrew Morton wrote:
> > But we would like to remove stomp machine from
> > CPU hotplug.
> 
> We do?  That's news.  It wasn't mentioned in the changelog and should
> have been.  Why?  

It would be an unrelated change to this and unrelated to the reasons as
to why I want a faster get_online_cpus().

> > Why prevent all CPUs from running when we want to remove
> > one?
> 
> So get_online_cpus() goes away.  Nothing is more scalable than nothing!

Very much agreed; now stop_machine() wouldn't actually work for hotplug
because it will instantly preempt everybody, including someone who might
be in the middle of using per-cpu state of the cpu we're about to
remove.

The freeze suggestion from Ingo would actually work because we freeze
tasks at known good points (userspace and kthread_freeze() points) where
we know they're not fiddling with per-cpu state.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 16:00         ` Andrew Morton
  2013-10-10 16:36           ` Steven Rostedt
  2013-10-10 16:52           ` Ingo Molnar
@ 2013-10-10 16:54           ` Oleg Nesterov
  2013-10-10 19:04             ` Srivatsa S. Bhat
  2 siblings, 1 reply; 76+ messages in thread
From: Oleg Nesterov @ 2013-10-10 16:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Srivatsa S. Bhat, Peter Zijlstra, Paul McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On 10/10, Andrew Morton wrote:
>
> On Thu, 10 Oct 2013 17:26:12 +0200 Oleg Nesterov <oleg@redhat.com> wrote:
>
> > On 10/10, Ingo Molnar wrote:
> > >
> > > So ... why not make it _really_ cheap, i.e. the read lock costing nothing,
> > > and tie CPU hotplug to freezing all tasks in the system?
> > >
> > > Actual CPU hot unplugging and repluggin is _ridiculously_ rare in a
> > > system, I don't understand how we tolerate _any_ overhead from this utter
> > > slowpath.
> >
> > Well, iirc Srivatsa (cc'ed) pointed out that some systems do cpu_down/up
> > quite often to save the power.
>
> cpu hotremove already uses stop_machine,

And Srivatsa wants to remove it from cpu_down().

> so such an approach shouldn't
> actually worsen things (a lot) for them?

this depends on what this "freezing all tasks" actually means.
I understood it as try_to_freeze_tasks/etc, looks too heavy...

But my only point was, I am not sure we can assume that cpu_down/up
is extremly rare and its cost does not matter.

> use stop_machine() on the add/remove
> (ie, "writer") side and nothing at all on the "reader" side.  Is there
> anything which fundamentally prevents cpu hotplug from doing the same?

Well, then we actually need to park all tasks in system, I guess.
IOW, freezer.

Oleg.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 16:53               ` Peter Zijlstra
@ 2013-10-10 17:13                 ` Steven Rostedt
  2013-10-10 17:48                   ` Andrew Morton
  2013-10-10 17:39                 ` [PATCH 0/6] Optimize the cpu hotplug locking -v2 Oleg Nesterov
  1 sibling, 1 reply; 76+ messages in thread
From: Steven Rostedt @ 2013-10-10 17:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Oleg Nesterov, Ingo Molnar, Srivatsa S. Bhat,
	Paul McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, linux-kernel

On Thu, 10 Oct 2013 18:53:37 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Oct 10, 2013 at 09:43:55AM -0700, Andrew Morton wrote:
> > > But we would like to remove stomp machine from
> > > CPU hotplug.
> > 
> > We do?  That's news.  It wasn't mentioned in the changelog and should
> > have been.  Why?  
> 
> It would be an unrelated change to this and unrelated to the reasons as
> to why I want a faster get_online_cpus().

Yeah, sorry for the confusion. My comment wasn't really about this
change set but about stop machine and hotplug in general. Needing stop
machine for hotplug has been a complaint by many, but off topic for
this particular change set.

> 
> > > Why prevent all CPUs from running when we want to remove
> > > one?
> > 
> > So get_online_cpus() goes away.  Nothing is more scalable than nothing!
> 
> Very much agreed; now stop_machine() wouldn't actually work for hotplug
> because it will instantly preempt everybody, including someone who might
> be in the middle of using per-cpu state of the cpu we're about to
> remove.

Well, stop machine doesn't instantly preempt everybody. Only those that
don't have preemption disabled. Using per_cpu without preemption
disabled can be dangerous. Except for the migrate disable we want to
add for -rt. Then we can't rely on migrate disable and stop machine
making sure per_cpu data isn't being used. Oh, and for threads that use
per_cpu that are bound to a CPU. But then they need to be taken care of
too for their CPU going off line.

But this would also require the "get_online_cpu()s" to disable
preemption as well. Not quite a "nothing" we are looking for.


-- Steve

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 16:50         ` Ingo Molnar
@ 2013-10-10 17:13           ` Paul E. McKenney
  2013-10-10 17:35             ` Ingo Molnar
  2013-10-10 18:35             ` Peter Zijlstra
  0 siblings, 2 replies; 76+ messages in thread
From: Paul E. McKenney @ 2013-10-10 17:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Andrew Morton, Oleg Nesterov, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On Thu, Oct 10, 2013 at 06:50:46PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Thu, Oct 10, 2013 at 04:57:38PM +0200, Ingo Molnar wrote:
> >
> > > So ... why not make it _really_ cheap, i.e. the read lock costing 
> > > nothing, and tie CPU hotplug to freezing all tasks in the system?
> > 
> > Such that we freeze regular tasks in userspace and kernel tasks in their 
> > special freezer callback so as to guarantee minimal state?
> 
> Yes, absolutely.

That does add quite a bit of latency to the hotplug operations, which
IIRC slows down things like boot, suspend, and resume.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 17:13           ` Paul E. McKenney
@ 2013-10-10 17:35             ` Ingo Molnar
  2013-10-10 18:35             ` Peter Zijlstra
  1 sibling, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2013-10-10 17:35 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Andrew Morton, Oleg Nesterov, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> On Thu, Oct 10, 2013 at 06:50:46PM +0200, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Thu, Oct 10, 2013 at 04:57:38PM +0200, Ingo Molnar wrote:
> > >
> > > > So ... why not make it _really_ cheap, i.e. the read lock costing 
> > > > nothing, and tie CPU hotplug to freezing all tasks in the system?
> > > 
> > > Such that we freeze regular tasks in userspace and kernel tasks in 
> > > their special freezer callback so as to guarantee minimal state?
> > 
> > Yes, absolutely.
> 
> That does add quite a bit of latency to the hotplug operations, which 
> IIRC slows down things like boot, suspend, and resume.

I think it should _speed up_ suspend: instead of bringing down each CPU 
one by one, they could just be all stopped.

Same for bootup - it's not really a hotplug operation in the traditional 
sense, as all CPUs are 'present' at once, and they could be added in one 
group operation instead of serializing the bootup.

Also, most CPU bootup sequences are serialized.

It would slow down individual CPU hotplug/hotunplug - and that is what is 
a really rare operation outside broken schemes to save power ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 16:53               ` Peter Zijlstra
  2013-10-10 17:13                 ` Steven Rostedt
@ 2013-10-10 17:39                 ` Oleg Nesterov
  1 sibling, 0 replies; 76+ messages in thread
From: Oleg Nesterov @ 2013-10-10 17:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Steven Rostedt, Ingo Molnar, Srivatsa S. Bhat,
	Paul McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, linux-kernel

On 10/10, Peter Zijlstra wrote:
>
> The freeze suggestion from Ingo would actually work because we freeze
> tasks at known good points

Not really known/good, we have more and more freezable_schedule's.
But probably this is fine, nobody should do this under get_online_cpus().

> (userspace and kthread_freeze() points) where
> we know they're not fiddling with per-cpu state.

We should also take care of PF_NOFREEZE kthreads.

Oleg.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 16:52           ` Ingo Molnar
@ 2013-10-10 17:44             ` Paul E. McKenney
  0 siblings, 0 replies; 76+ messages in thread
From: Paul E. McKenney @ 2013-10-10 17:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Oleg Nesterov, Srivatsa S. Bhat, Peter Zijlstra,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On Thu, Oct 10, 2013 at 06:52:29PM +0200, Ingo Molnar wrote:
> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Thu, 10 Oct 2013 17:26:12 +0200 Oleg Nesterov <oleg@redhat.com> wrote:
> > 
> > > On 10/10, Ingo Molnar wrote:
> > > >
> > > > * Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > > But the thing is; our sense of NR_CPUS has shifted, where it used to be
> > > > > ok to do something like:
> > > > >
> > > > >   for_each_cpu()
> > > > >
> > > > > With preemption disabled; it gets to be less and less sane to do 
> > > > > so, simply because 'common' hardware has 256+ CPUs these days. If 
> > > > > we cannot rely on preempt disable to exclude hotplug, we must use 
> > > > > get_online_cpus(), but get_online_cpus() is global state and thus 
> > > > > cannot be used at any sort of frequency.
> > > >
> > > > So ... why not make it _really_ cheap, i.e. the read lock costing 
> > > > nothing, and tie CPU hotplug to freezing all tasks in the system?
> > > >
> > > > Actual CPU hot unplugging and repluggin is _ridiculously_ rare in a 
> > > > system, I don't understand how we tolerate _any_ overhead from this 
> > > > utter slowpath.
> > > 
> > > Well, iirc Srivatsa (cc'ed) pointed out that some systems do 
> > > cpu_down/up quite often to save the power.
> > 
> > cpu hotremove already uses stop_machine, so such an approach shouldn't 
> > actually worsen things (a lot) for them?
> 
> Also, using CPU hotremove to save power, instead of implementing proper 
> power aware scheduling, is very broken to begin with.

In many cases, agreed.  Particularly on devices where there is an easy
way to send out regular updates and bug fixes, and especially if they
are connected to AC power.

But if I was in charge of creating a battery-powered multicore embedded
device that was to ship in million-unit quantities, especially in the
not-uncommon case where more than one CPU is needed only for specific
actions, I would offline the extra CPUs except when I was sure they were
needed.  This would of course still rely on power-aware scheduling when
more than one CPU was in use.  The power-aware scheduling conserves energy
for short-term variations (in which case offlining really is useless),
but longer term "keep this CPU -off-" is provided by offlining.

Yes, the Linux kernel code quality is quite good, but bugs still happen,
and bugs can easily cause CPUs to power up unnecessarily.  This doesn't
have to happen very often to kill your battery.  If the extra CPUs are
offline, a large class of those bugs are rendered harmless.  If you are
going to have a very large number of difficult-to-update battery-powered
devices in the field, defending against such bugs is worthwhile.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 17:13                 ` Steven Rostedt
@ 2013-10-10 17:48                   ` Andrew Morton
  2013-10-10 18:10                     ` Linus Torvalds
  2013-10-10 18:34                     ` Peter Zijlstra
  0 siblings, 2 replies; 76+ messages in thread
From: Andrew Morton @ 2013-10-10 17:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Oleg Nesterov, Ingo Molnar, Srivatsa S. Bhat,
	Paul McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, linux-kernel

On Thu, 10 Oct 2013 13:13:05 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:

> > 
> > > > Why prevent all CPUs from running when we want to remove
> > > > one?
> > > 
> > > So get_online_cpus() goes away.  Nothing is more scalable than nothing!
> > 
> > Very much agreed; now stop_machine() wouldn't actually work for hotplug
> > because it will instantly preempt everybody, including someone who might
> > be in the middle of using per-cpu state of the cpu we're about to
> > remove.
> 
> Well, stop machine doesn't instantly preempt everybody. Only those that
> don't have preemption disabled. Using per_cpu without preemption
> disabled can be dangerous.

Yes, I'd have thought that the cases where a CPU is fiddling with
another CPU's percpu data with preemption enabled would be rather rare.

I can't actually think of any off the top.  Are there examples we can
look at?


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 17:48                   ` Andrew Morton
@ 2013-10-10 18:10                     ` Linus Torvalds
  2013-10-10 18:43                       ` Steven Rostedt
  2013-10-10 18:46                       ` Peter Zijlstra
  2013-10-10 18:34                     ` Peter Zijlstra
  1 sibling, 2 replies; 76+ messages in thread
From: Linus Torvalds @ 2013-10-10 18:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Steven Rostedt, Peter Zijlstra, Oleg Nesterov, Ingo Molnar,
	Srivatsa S. Bhat, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On Thu, Oct 10, 2013 at 10:48 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> Yes, I'd have thought that the cases where a CPU is fiddling with
> another CPU's percpu data with preemption enabled would be rather rare.
>
> I can't actually think of any off the top.  Are there examples we can
> look at?

You can't do that right now - since you have to get the cpu list. So
it may not be with "preemption enabled", but it should always be under
the locking provided by get_online_cpus().. That one allows sleeping,
though.

I personally would *love* to make CPU hotplug be a lockless thing
entirely. But I detest stop-machine too, because it has these really
annoying properties.

So if we want to make it zero-cost to look at online CPU data, can we
avoid even the stop-machine synchronization, instead saying that the
cpu hotplug bitmap is updated completely locklessly, but if you see a
bit set, the data associated with that CPU is guaranteed to still be
available.

IOW, just use "RCU semantics" on a per-bit level. When we offline a CPU, we do

     clear_bit(cpu, cpu_online_mask);
     rcu_synchronize();
     .. now we can free all the percpu data and kill the CPU ..

without any locking anywhere - not stop-machine, not anything. If
somebody is doing a "for_each_cpu()" (under just a regular
rcu_read_lock()) and they see the bit set while it's going down, who
cares? The CPU is still there, the data is accessible..

I'm sure there's some reason the above wouldn't work, but the above
would seem to be pretty optimal. Why do we really force this big
locking thing? The new patches make that locking _smarter_, but it's
still a damn big lock. Could we possibly go _beyond_ the lock?

                 Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 17:48                   ` Andrew Morton
  2013-10-10 18:10                     ` Linus Torvalds
@ 2013-10-10 18:34                     ` Peter Zijlstra
  2013-10-10 18:49                       ` Linus Torvalds
  1 sibling, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-10 18:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Steven Rostedt, Oleg Nesterov, Ingo Molnar, Srivatsa S. Bhat,
	Paul McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, linux-kernel

On Thu, Oct 10, 2013 at 10:48:56AM -0700, Andrew Morton wrote:
> > > Very much agreed; now stop_machine() wouldn't actually work for hotplug
> > > because it will instantly preempt everybody, including someone who might
> > > be in the middle of using per-cpu state of the cpu we're about to
> > > remove.
> > 
> > Well, stop machine doesn't instantly preempt everybody. Only those that
> > don't have preemption disabled. Using per_cpu without preemption
> > disabled can be dangerous.

Can be, but there's more of that around than you want. Also that's not
the entire issue.

> Yes, I'd have thought that the cases where a CPU is fiddling with
> another CPU's percpu data with preemption enabled would be rather rare.

Look at kernel/events/core.c:swevent_hlist_get() it does something like:

  get_online_cpus();
  for_each_online_cpu() {
  	allocate_hash_table();
  }
  put_online_cpus();

Surely you don't want to do that with preemption disabled?

But my point is that even though there aren't many of these today; with
the growing number of cpus in 'commodity' hardware you want to move away
from using preempt_disable() as hotplug lock.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 17:13           ` Paul E. McKenney
  2013-10-10 17:35             ` Ingo Molnar
@ 2013-10-10 18:35             ` Peter Zijlstra
  1 sibling, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-10 18:35 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Ingo Molnar, Andrew Morton, Oleg Nesterov, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On Thu, Oct 10, 2013 at 10:13:53AM -0700, Paul E. McKenney wrote:
> That does add quite a bit of latency to the hotplug operations, which
> IIRC slows down things like boot, suspend, and resume.

Guess what suspend does before it unplugs all these cpus?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 18:10                     ` Linus Torvalds
@ 2013-10-10 18:43                       ` Steven Rostedt
  2013-10-10 18:50                         ` Peter Zijlstra
  2013-10-10 19:00                         ` Linus Torvalds
  2013-10-10 18:46                       ` Peter Zijlstra
  1 sibling, 2 replies; 76+ messages in thread
From: Steven Rostedt @ 2013-10-10 18:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Peter Zijlstra, Oleg Nesterov, Ingo Molnar,
	Srivatsa S. Bhat, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On Thu, 10 Oct 2013 11:10:35 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:
     .. now we can free all the percpu data and kill the CPU ..
> 
> without any locking anywhere - not stop-machine, not anything. If
> somebody is doing a "for_each_cpu()" (under just a regular
> rcu_read_lock()) and they see the bit set while it's going down, who
> cares? The CPU is still there, the data is accessible..

The problem is that rcu_read_lock() requires preemption disabled unless
you are using the preemptable rcu tree version. There's always
srcu_read_lock() but that's not so free. It's basically the same as
what Peter is doing.

There's places in the kernel that does for_each_cpu() that I'm sure you
don't want to disable preemption for. Especially when you start having
4096 CPU machines!

-- Steve

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 18:10                     ` Linus Torvalds
  2013-10-10 18:43                       ` Steven Rostedt
@ 2013-10-10 18:46                       ` Peter Zijlstra
  1 sibling, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-10 18:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Steven Rostedt, Oleg Nesterov, Ingo Molnar,
	Srivatsa S. Bhat, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On Thu, Oct 10, 2013 at 11:10:35AM -0700, Linus Torvalds wrote:
> You can't do that right now - since you have to get the cpu list. So
> it may not be with "preemption enabled", but it should always be under
> the locking provided by get_online_cpus().. That one allows sleeping,
> though.
> 
> I personally would *love* to make CPU hotplug be a lockless thing
> entirely. But I detest stop-machine too, because it has these really
> annoying properties.
> 
> So if we want to make it zero-cost to look at online CPU data, can we
> avoid even the stop-machine synchronization, instead saying that the
> cpu hotplug bitmap is updated completely locklessly, but if you see a
> bit set, the data associated with that CPU is guaranteed to still be
> available.
> 
> IOW, just use "RCU semantics" on a per-bit level. When we offline a CPU, we do
> 
>      clear_bit(cpu, cpu_online_mask);
>      rcu_synchronize();
>      .. now we can free all the percpu data and kill the CPU ..
> 
> without any locking anywhere - not stop-machine, not anything. If
> somebody is doing a "for_each_cpu()" (under just a regular
> rcu_read_lock()) and they see the bit set while it's going down, who
> cares? The CPU is still there, the data is accessible..
> 
> I'm sure there's some reason the above wouldn't work, but the above
> would seem to be pretty optimal. Why do we really force this big
> locking thing? The new patches make that locking _smarter_, but it's
> still a damn big lock. Could we possibly go _beyond_ the lock?

The only down-side to doing this is that you cannot actually allocate
memory under rcu_read_lock() because it might not allow preemption.

That said; I like the idea. I'll go try and audit the get_online_cpus()
sites to see if there's any that really need full exclusion.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 18:34                     ` Peter Zijlstra
@ 2013-10-10 18:49                       ` Linus Torvalds
  2013-10-10 19:04                         ` Steven Rostedt
  2013-10-11 12:38                         ` Peter Zijlstra
  0 siblings, 2 replies; 76+ messages in thread
From: Linus Torvalds @ 2013-10-10 18:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Steven Rostedt, Oleg Nesterov, Ingo Molnar,
	Srivatsa S. Bhat, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On Thu, Oct 10, 2013 at 11:34 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> But my point is that even though there aren't many of these today; with
> the growing number of cpus in 'commodity' hardware you want to move away
> from using preempt_disable() as hotplug lock.

Umm.

Wasn't this pretty much the argument for the COMPLETELY F*CKED UP
change to make the vma locking use a semaphore?

The whole "it's more convenient to use sleeping locks" argument is
PURE AND UTTER SHIT when it comes to really core code. We are *much*
better off saying "this is core, we care so deeply about performance
that you had better use your brain before you do this".

Seriously. Your argument is bad, but more importantly, it is
*dangerously* bad. It's crap that results in bad code: and the bad
code is almost impossible to fix up later, because once you encourage
people to do the easy thing, they'll do so.

Yes, preempt_disable() is harder to use than sleeping locks. You need
to do pre-allocation etc. But it is much *much* more efficient.

And in the kernel, we care. We have the resources. Plus, we can also
say "if you can't handle it, don't do it". We don't need new features
so badly that we are willing to screw up core code.

As to your kernel/events/core.c example, things like that are
generally pretty easy to fix. The good thing about preemption-disable
is:

 - it's pretty easily debuggable, because you do get big warnings of
"you screwed up exactly _here_"

 - it's so cheap that especially when it comes to things that can
sleep, it's perfectly ok to drop the preemption count, do the sleeping
thing, re-disable preemption and then check if we still needed the
data.

Oh, and I'm sure there are several users that currently depend on
being able to sleep over get_online_cpu's.  But I'm pretty sure it's
"several", not "hundreds", and I think we could fix them up.

                 Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 18:43                       ` Steven Rostedt
@ 2013-10-10 18:50                         ` Peter Zijlstra
  2013-10-10 19:15                           ` Paul E. McKenney
  2013-10-10 19:00                         ` Linus Torvalds
  1 sibling, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-10 18:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Andrew Morton, Oleg Nesterov, Ingo Molnar,
	Srivatsa S. Bhat, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On Thu, Oct 10, 2013 at 02:43:27PM -0400, Steven Rostedt wrote:
> On Thu, 10 Oct 2013 11:10:35 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>      .. now we can free all the percpu data and kill the CPU ..
> > 
> > without any locking anywhere - not stop-machine, not anything. If
> > somebody is doing a "for_each_cpu()" (under just a regular
> > rcu_read_lock()) and they see the bit set while it's going down, who
> > cares? The CPU is still there, the data is accessible..
> 
> The problem is that rcu_read_lock() requires preemption disabled unless
> you are using the preemptable rcu tree version. There's always
> srcu_read_lock() but that's not so free. It's basically the same as
> what Peter is doing.

No, srcu is actually more expensive in the fast path. Although possibly
we could make SRCU more complex ;-)

> There's places in the kernel that does for_each_cpu() that I'm sure you
> don't want to disable preemption for. Especially when you start having
> 4096 CPU machines!

:-)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 15:26       ` Oleg Nesterov
  2013-10-10 16:00         ` Andrew Morton
@ 2013-10-10 18:52         ` Srivatsa S. Bhat
  1 sibling, 0 replies; 76+ messages in thread
From: Srivatsa S. Bhat @ 2013-10-10 18:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Paul McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel, Rafael J. Wysocki

On 10/10/2013 08:56 PM, Oleg Nesterov wrote:
> On 10/10, Ingo Molnar wrote:
>>
>> * Peter Zijlstra <peterz@infradead.org> wrote:
>>
>>> But the thing is; our sense of NR_CPUS has shifted, where it used to be
>>> ok to do something like:
>>>
>>>   for_each_cpu()
>>>
>>> With preemption disabled; it gets to be less and less sane to do so,
>>> simply because 'common' hardware has 256+ CPUs these days. If we cannot
>>> rely on preempt disable to exclude hotplug, we must use
>>> get_online_cpus(), but get_online_cpus() is global state and thus cannot
>>> be used at any sort of frequency.
>>
>> So ... why not make it _really_ cheap, i.e. the read lock costing nothing,
>> and tie CPU hotplug to freezing all tasks in the system?
>>
>> Actual CPU hot unplugging and repluggin is _ridiculously_ rare in a
>> system, I don't understand how we tolerate _any_ overhead from this utter
>> slowpath.
> 
> Well, iirc Srivatsa (cc'ed) pointed out that some systems do cpu_down/up
> quite often to save the power.
> 

Yes, I've heard of such systems and so I might have brought them up during
discussions about CPU hotplug. But unfortunately, I have been misquoted quite
often, leading to the wrong impression that I have such a usecase or that I
recommend/support using CPU hotplug for power management. So let me clarify
that part, while I have the chance. (And I don't blame anyone for that. I
work on power-management related areas, and I've worked on improving/optimizing
CPU hotplug; so its pretty natural to make a connection between the two
and assume that I tried to optimize CPU hotplug keeping power management
in mind. But that's not the case, as I explain below.)

I started out trying to make suspend/resume more reliable, scalable and fast.
And suspend/resume uses CPU hotplug underneath and that's a pretty valid usecase.
So with that, I started looking at CPU hotplug and soon realized the mess it
had become. So I started working on cleaning up that mess, like rethinking the
whole notifier scheme[1], and removing the ridiculous stop_machine() from the
cpu_down path[2] etc. But the intention behind all this work was just to make
CPU hotplug cleaner/saner/bug-free and possibly speed up suspend/resume. IOW,
I didn't have any explicit intention to make it easier for people to use it
for power management, although I understood that some of this work might
help those poor souls who don't have any other choice, for whatever reason.
And fortunately, (IIUC) the number of systems/people relying on CPU hotplug for
power management has reduced quite a bit in the recent times, which is a very
good thing.

So, to reiterate, I totally agree that power-aware scheduler is the right way
to do CPU power management; CPU hotplug is simply not the tool to use for that.
No question about that. Also, system shutdown used to depend on CPU hotplug to
disable the non-boot CPUs, but we don't do that any more after commit cf7df378a,
which is a very welcome change. And in future if we can somehow do suspend/resume
without using CPU hotplug, that would be absolutely wonderful as well. (There
have been discussions in the past around this, but nobody has a solution yet).
The other valid usecases that I can think of, for using CPU hotplug, is for RAS
reasons and for DLPAR (Dynamic Logical Partitioning) operations on powerpc, both
of which are not performance-sensitive, AFAIK.


[1]. Reverse invocation of CPU hotplug notifiers
     http://lwn.net/Articles/508072/

[2]. Stop-machine()-free CPU hotplug
     http://lwn.net/Articles/538819/ (v6)
     http://lwn.net/Articles/556727/

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 18:43                       ` Steven Rostedt
  2013-10-10 18:50                         ` Peter Zijlstra
@ 2013-10-10 19:00                         ` Linus Torvalds
  1 sibling, 0 replies; 76+ messages in thread
From: Linus Torvalds @ 2013-10-10 19:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Morton, Peter Zijlstra, Oleg Nesterov, Ingo Molnar,
	Srivatsa S. Bhat, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On Thu, Oct 10, 2013 at 11:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> There's places in the kernel that does for_each_cpu() that I'm sure you
> don't want to disable preemption for. Especially when you start having
> 4096 CPU machines!

We could _easily_ add preemption points in for_each_cpu() for the
MAX_SMP case. We can even do it in the macro itself.

So I wouldn't worry about things like that. I'd expect the "Oh damn,
we have to allocate" case to be more painful, but I think that is
fairly easily handled by having something like a "redo_cpu()" macro
that you can call inside for_each_cpu(). So the pattern for the
(nopefully-not-all-that-common) thing would be something like

   sometype *prealloc = NULL;

   for_each_cpu() {
         if (!prealloc) {
               rcu_read_unlock();  // Or whatever we want to call it
               prealloc = kmalloc(..);
               rcu_read_lock();
               redo_cpu();
          }
   }
   kfree(prealloc);

which is certainly more work than a sleeping lock is, but this looks
like a initialization pattern.

Looking at the things that are actually performance-critical, they
tend to be things like "gather all the statistics for all CPUs".

Honesty in advertizing: I only grepped a couple of users of
get_online_cpus(), and they all seemed like they'd be perfectly happy
with preemption-off. But maybe I was just lucky in my few samples: I
have anecdotes, not data.

              Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 16:54           ` Oleg Nesterov
@ 2013-10-10 19:04             ` Srivatsa S. Bhat
  0 siblings, 0 replies; 76+ messages in thread
From: Srivatsa S. Bhat @ 2013-10-10 19:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Ingo Molnar, Peter Zijlstra, Paul McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On 10/10/2013 10:24 PM, Oleg Nesterov wrote:
> On 10/10, Andrew Morton wrote:
>>
>> On Thu, 10 Oct 2013 17:26:12 +0200 Oleg Nesterov <oleg@redhat.com> wrote:
>>
>>> On 10/10, Ingo Molnar wrote:
>>>>
>>>> So ... why not make it _really_ cheap, i.e. the read lock costing nothing,
>>>> and tie CPU hotplug to freezing all tasks in the system?
>>>>
>>>> Actual CPU hot unplugging and repluggin is _ridiculously_ rare in a
>>>> system, I don't understand how we tolerate _any_ overhead from this utter
>>>> slowpath.
>>>
>>> Well, iirc Srivatsa (cc'ed) pointed out that some systems do cpu_down/up
>>> quite often to save the power.
>>
>> cpu hotremove already uses stop_machine,
> 
> And Srivatsa wants to remove it from cpu_down().
> 

Yes, I have worked on several designs to remove stop_machine() from the cpu_down
path.

http://lwn.net/Articles/538819/
http://lwn.net/Articles/556727/

>> so such an approach shouldn't
>> actually worsen things (a lot) for them?
> 
> this depends on what this "freezing all tasks" actually means.
> I understood it as try_to_freeze_tasks/etc, looks too heavy...
> 
> But my only point was, I am not sure we can assume that cpu_down/up
> is extremly rare and its cost does not matter.
> 
>> use stop_machine() on the add/remove
>> (ie, "writer") side and nothing at all on the "reader" side.  Is there
>> anything which fundamentally prevents cpu hotplug from doing the same?
> 

Its certainly possible to remove stop_machine() from CPU hotplug, as I've
demonstrated in the patchsets mentioned above. And there were pretty good
performance improvements too, arising out of that change, as described here:

http://article.gmane.org/gmane.linux.ports.ppc.embedded/56122

Regards,
Srivatsa S. Bhat




^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 18:49                       ` Linus Torvalds
@ 2013-10-10 19:04                         ` Steven Rostedt
  2013-10-10 19:16                           ` Linus Torvalds
  2013-10-11 12:38                         ` Peter Zijlstra
  1 sibling, 1 reply; 76+ messages in thread
From: Steven Rostedt @ 2013-10-10 19:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, Oleg Nesterov, Ingo Molnar,
	Srivatsa S. Bhat, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On Thu, 10 Oct 2013 11:49:15 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:
 
> Oh, and I'm sure there are several users that currently depend on
> being able to sleep over get_online_cpu's.  But I'm pretty sure it's
> "several", not "hundreds", and I think we could fix them up.

I'm wondering if we can have a for_each_cpu() that only disables
preemption in the loop. That is, each iteration will enable preemption,
but the loop itself will guarantee that the current cpu to process
still exists.

	rcu_read_lock();
	for_each_cpu(cpu, mask) {
		rcu_read_unlock();
		rcu_read_lock();
		if (!cpu_online(cpu))
			continue;
		[...]
	}
	rcu_read_unlock();

That way expensive loops wont stop the current CPU to process all
online CPUs.

Of course, it will miss a CPU that comes online. But it would still
need to handle the case of a CPU coming online after the final
put_online_cpus(), so I'm not sure that's a problem.

-- Steve

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 18:50                         ` Peter Zijlstra
@ 2013-10-10 19:15                           ` Paul E. McKenney
  0 siblings, 0 replies; 76+ messages in thread
From: Paul E. McKenney @ 2013-10-10 19:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Linus Torvalds, Andrew Morton, Oleg Nesterov,
	Ingo Molnar, Srivatsa S. Bhat, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On Thu, Oct 10, 2013 at 08:50:32PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 10, 2013 at 02:43:27PM -0400, Steven Rostedt wrote:
> > On Thu, 10 Oct 2013 11:10:35 -0700
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >      .. now we can free all the percpu data and kill the CPU ..
> > > 
> > > without any locking anywhere - not stop-machine, not anything. If
> > > somebody is doing a "for_each_cpu()" (under just a regular
> > > rcu_read_lock()) and they see the bit set while it's going down, who
> > > cares? The CPU is still there, the data is accessible..
> > 
> > The problem is that rcu_read_lock() requires preemption disabled unless
> > you are using the preemptable rcu tree version. There's always
> > srcu_read_lock() but that's not so free. It's basically the same as
> > what Peter is doing.
> 
> No, srcu is actually more expensive in the fast path. Although possibly
> we could make SRCU more complex ;-)

Indeed.  Especially if we wanted to get rid of the memory barriers in
srcu_read_lock() and srcu_read_unlock() and still allow SRCU to be used
from idle and offline CPUs!  ;-)

							Thanx, Paul

> > There's places in the kernel that does for_each_cpu() that I'm sure you
> > don't want to disable preemption for. Especially when you start having
> > 4096 CPU machines!
> 
> :-)

But shouldn't you only have to disable over each separate pass through
the loop rather across the entire loop?  "I guarantee that the CPU I
just handed you will remain online through the body of the loop" rather
than "I guarantee that no CPU is going anywhere through the entire loop."

							Thanx, Paul


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 19:04                         ` Steven Rostedt
@ 2013-10-10 19:16                           ` Linus Torvalds
  2013-10-10 19:34                             ` Peter Zijlstra
                                               ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: Linus Torvalds @ 2013-10-10 19:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Andrew Morton, Oleg Nesterov, Ingo Molnar,
	Srivatsa S. Bhat, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On Thu, Oct 10, 2013 at 12:04 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I'm wondering if we can have a for_each_cpu() that only disables
> preemption in the loop.

I think we'd generally want to have it be something the loop asks for.

If the loop is just some kind of "gather statistics" thing, I don't
think it's required. The cost per loop is so low (usually adding up a
couple of words) that the downside drowns the upside.

And we could easily look at MAXSMP (or NR_CPUS) at compile-time, and
not do it for common small values (although it looks like Fedora
defaults to 128 CPU's for their distro kernels, which seems a bit
excessive - too many by far for normal people, too few for the crazy
big ones).

                Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 19:16                           ` Linus Torvalds
@ 2013-10-10 19:34                             ` Peter Zijlstra
  2013-10-10 19:34                             ` Steven Rostedt
  2013-10-11  6:09                             ` Ingo Molnar
  2 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-10 19:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Andrew Morton, Oleg Nesterov, Ingo Molnar,
	Srivatsa S. Bhat, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On Thu, Oct 10, 2013 at 12:16:16PM -0700, Linus Torvalds wrote:
> On Thu, Oct 10, 2013 at 12:04 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I'm wondering if we can have a for_each_cpu() that only disables
> > preemption in the loop.
> 
> I think we'd generally want to have it be something the loop asks for.
> 
> If the loop is just some kind of "gather statistics" thing, I don't
> think it's required. The cost per loop is so low (usually adding up a
> couple of words) that the downside drowns the upside.
> 
> And we could easily look at MAXSMP (or NR_CPUS) at compile-time, and
> not do it for common small values (although it looks like Fedora
> defaults to 128 CPU's for their distro kernels, which seems a bit
> excessive - too many by far for normal people, too few for the crazy
> big ones).

Alternatively we could write it something like:

	rcu_read_lock();
	for_each_online_node(node) {
		for_each_cpu(cpu, cpumask_of_node(node)) {
			...
		}
		cond_resched_rcu();
	}
	rcu_read_unlock();

But yes, I think this pattern (and variations) should work for most
cases.

The one I'm struggling with atm is kernel/cpuset.c:
rebuild_sched_domains_locked(). Although I'm thinking we should do that
seqlock style, just build it and verify if its still valid, if not, try
again -- although I'm sure it'll be 'fun' to get correct.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 19:16                           ` Linus Torvalds
  2013-10-10 19:34                             ` Peter Zijlstra
@ 2013-10-10 19:34                             ` Steven Rostedt
  2013-10-11  6:09                             ` Ingo Molnar
  2 siblings, 0 replies; 76+ messages in thread
From: Steven Rostedt @ 2013-10-10 19:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, Oleg Nesterov, Ingo Molnar,
	Srivatsa S. Bhat, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On Thu, 10 Oct 2013 12:16:16 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Oct 10, 2013 at 12:04 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I'm wondering if we can have a for_each_cpu() that only disables
> > preemption in the loop.
> 
> I think we'd generally want to have it be something the loop asks for.

Yeah, I was thinking of adding a new macro, not update for_each_cpu()
itself, so that only the locations that would require it would use it.

> 
> If the loop is just some kind of "gather statistics" thing, I don't
> think it's required. The cost per loop is so low (usually adding up a
> couple of words) that the downside drowns the upside.
> 
> And we could easily look at MAXSMP (or NR_CPUS) at compile-time, and
> not do it for common small values (although it looks like Fedora
> defaults to 128 CPU's for their distro kernels, which seems a bit
> excessive - too many by far for normal people, too few for the crazy
> big ones).

We could always make it a boot time option, and use alternates to
change the code as we do with spin_locks() and other code.

-- Steve (the lover of self modifying code)


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 19:16                           ` Linus Torvalds
  2013-10-10 19:34                             ` Peter Zijlstra
  2013-10-10 19:34                             ` Steven Rostedt
@ 2013-10-11  6:09                             ` Ingo Molnar
  2 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2013-10-11  6:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Peter Zijlstra, Andrew Morton, Oleg Nesterov,
	Srivatsa S. Bhat, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Oct 10, 2013 at 12:04 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I'm wondering if we can have a for_each_cpu() that only disables 
> > preemption in the loop.
> 
> I think we'd generally want to have it be something the loop asks for.
> 
> If the loop is just some kind of "gather statistics" thing, I don't 
> think it's required. The cost per loop is so low (usually adding up a 
> couple of words) that the downside drowns the upside.
> 
> And we could easily look at MAXSMP (or NR_CPUS) at compile-time, and not 
> do it for common small values (although it looks like Fedora defaults to 
> 128 CPU's for their distro kernels, which seems a bit excessive - too 
> many by far for normal people, too few for the crazy big ones).

Ubuntu has it at 256, so I guess Fedora is even a bit conservative ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-10 18:49                       ` Linus Torvalds
  2013-10-10 19:04                         ` Steven Rostedt
@ 2013-10-11 12:38                         ` Peter Zijlstra
  2013-10-11 18:25                           ` Oleg Nesterov
  2013-10-17 16:49                           ` [tip:sched/core] sched: Remove get_online_cpus() usage tip-bot for Peter Zijlstra
  1 sibling, 2 replies; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-11 12:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Steven Rostedt, Oleg Nesterov, Ingo Molnar,
	Srivatsa S. Bhat, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On Thu, Oct 10, 2013 at 11:49:15AM -0700, Linus Torvalds wrote:
> On Thu, Oct 10, 2013 at 11:34 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > But my point is that even though there aren't many of these today; with
> > the growing number of cpus in 'commodity' hardware you want to move away
> > from using preempt_disable() as hotplug lock.
> 
> Umm.
> 
> Wasn't this pretty much the argument for the COMPLETELY F*CKED UP
> change to make the vma locking use a semaphore?

Not entirely; but fair enough, I did screw up there.

> The whole "it's more convenient to use sleeping locks" argument is
> PURE AND UTTER SHIT when it comes to really core code. We are *much*
> better off saying "this is core, we care so deeply about performance
> that you had better use your brain before you do this".
> 
> Seriously. Your argument is bad, but more importantly, it is
> *dangerously* bad. It's crap that results in bad code: and the bad
> code is almost impossible to fix up later, because once you encourage
> people to do the easy thing, they'll do so.

Right, I fell face first into this trap. The existence of
{get,put}_online_cpus() made me stop thinking and use it.

As a penance I'll start by removing all get_online_cpus() usage from the
scheduler.

---
Subject: sched: Remove get_online_cpus() usage

Remove get_online_cpus() usage from the scheduler; there's 4 sites that
use it:

 - sched_init_smp(); where its completely superfluous since we're in
   'early' boot and there simply cannot be any hotplugging.

 - sched_getaffinity(); we already take a raw spinlock to protect the
   task cpus_allowed mask, this disables preemption and therefore
   also stabilizes cpu_online_mask as that's modified using
   stop_machine. However switch to active mask for symmetry with
   sched_setaffinity()/set_cpus_allowed_ptr(). We guarantee active
   mask stability by inserting sync_rcu/sched() into _cpu_down.

 - sched_setaffinity(); we don't appear to need get_online_cpus()
   either, there's two sites where hotplug appears relevant:
    * cpuset_cpus_allowed(); for the !cpuset case we use possible_mask,
      for the cpuset case we hold task_lock, which is a spinlock and
      thus for mainline disables preemption (might cause pain on RT).
    * set_cpus_allowed_ptr(); Holds all scheduler locks and thus has
      preemption properly disabled; also it already deals with hotplug
      races explicitly where it releases them.

 - migrate_swap(); we can make stop_two_cpus() do the heavy lifting for
   us with a little trickery. By adding a sync_sched/rcu() after the
   CPU_DOWN_PREPARE notifier we can provide preempt/rcu guarantees for
   cpu_active_mask. Use these to validate that both our cpus are active
   when queueing the stop work before we queue the stop_machine works
   for take_cpu_down().

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/cpu.c          | 17 +++++++++++++++++
 kernel/sched/core.c   | 20 ++++++++++----------
 kernel/stop_machine.c | 26 +++++++++++++++++++++-----
 3 files changed, 48 insertions(+), 15 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index d7f07a2da5a6..63aa50d7ce1e 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -308,6 +308,23 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
 	}
 	smpboot_park_threads(cpu);
 
+	/*
+	 * By now we've cleared cpu_active_mask, wait for all preempt-disabled
+	 * and RCU users of this state to go away such that all new such users
+	 * will observe it.
+	 *
+	 * For CONFIG_PREEMPT we have preemptible RCU and its sync_rcu() might
+	 * not imply sync_sched(), so explicitly call both.
+	 */
+#ifdef CONFIG_PREEMPT
+	synchronize_sched();
+#endif
+	synchronize_rcu();
+
+	/*
+	 * So now all preempt/rcu users must observe !cpu_active().
+	 */
+
 	err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
 	if (err) {
 		/* CPU didn't die: tell everyone.  Can't complain. */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c3feebcf112..498a5e5a53f5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1081,8 +1081,6 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p)
 	struct migration_swap_arg arg;
 	int ret = -EINVAL;
 
-	get_online_cpus();
-
 	arg = (struct migration_swap_arg){
 		.src_task = cur,
 		.src_cpu = task_cpu(cur),
@@ -1093,6 +1091,10 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p)
 	if (arg.src_cpu == arg.dst_cpu)
 		goto out;
 
+	/*
+	 * These three tests are all lockless; this is OK since all of them
+	 * will be re-checked with proper locks held further down the line.
+	 */
 	if (!cpu_active(arg.src_cpu) || !cpu_active(arg.dst_cpu))
 		goto out;
 
@@ -1105,7 +1107,6 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p)
 	ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
 
 out:
-	put_online_cpus();
 	return ret;
 }
 
@@ -3706,7 +3707,6 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 	struct task_struct *p;
 	int retval;
 
-	get_online_cpus();
 	rcu_read_lock();
 
 	p = find_process_by_pid(pid);
@@ -3769,7 +3769,6 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 	free_cpumask_var(cpus_allowed);
 out_put_task:
 	put_task_struct(p);
-	put_online_cpus();
 	return retval;
 }
 
@@ -3814,7 +3813,6 @@ long sched_getaffinity(pid_t pid, struct cpumask *mask)
 	unsigned long flags;
 	int retval;
 
-	get_online_cpus();
 	rcu_read_lock();
 
 	retval = -ESRCH;
@@ -3827,12 +3825,11 @@ long sched_getaffinity(pid_t pid, struct cpumask *mask)
 		goto out_unlock;
 
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	cpumask_and(mask, &p->cpus_allowed, cpu_online_mask);
+	cpumask_and(mask, &p->cpus_allowed, cpu_active_mask);
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 
 out_unlock:
 	rcu_read_unlock();
-	put_online_cpus();
 
 	return retval;
 }
@@ -6490,14 +6487,17 @@ void __init sched_init_smp(void)
 
 	sched_init_numa();
 
-	get_online_cpus();
+	/*
+	 * There's no userspace yet to cause hotplug operations; hence all the
+	 * cpu masks are stable and all blatant races in the below code cannot
+	 * happen.
+	 */
 	mutex_lock(&sched_domains_mutex);
 	init_sched_domains(cpu_active_mask);
 	cpumask_andnot(non_isolated_cpus, cpu_possible_mask, cpu_isolated_map);
 	if (cpumask_empty(non_isolated_cpus))
 		cpumask_set_cpu(smp_processor_id(), non_isolated_cpus);
 	mutex_unlock(&sched_domains_mutex);
-	put_online_cpus();
 
 	hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
 	hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 32a6c44d8f78..a6eb6d519284 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -234,11 +234,13 @@ static void irq_cpu_stop_queue_work(void *arg)
  */
 int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg)
 {
-	int call_cpu;
 	struct cpu_stop_done done;
 	struct cpu_stop_work work1, work2;
 	struct irq_cpu_stop_queue_work_info call_args;
-	struct multi_stop_data msdata = {
+	struct multi_stop_data msdata;
+
+	preempt_disable();
+	msdata = (struct multi_stop_date){
 		.fn = fn,
 		.data = arg,
 		.num_threads = 2,
@@ -262,16 +264,30 @@ int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *
 	set_state(&msdata, MULTI_STOP_PREPARE);
 
 	/*
+	 * If we observe both CPUs active we know _cpu_down() cannot yet have
+	 * queued its stop_machine works and therefore ours will get executed
+	 * first. Or its not either one of our CPUs that's getting unplugged,
+	 * in which case we don't care.
+	 *
+	 * This relies on the stopper workqueues to be FIFO.
+	 */
+	if (!cpu_active(cpu1) || !cpu_active(cpu2)) {
+		preempt_enable();
+		return -ENOENT;
+	}
+
+	/*
 	 * Queuing needs to be done by the lowest numbered CPU, to ensure
 	 * that works are always queued in the same order on every CPU.
 	 * This prevents deadlocks.
 	 */
-	call_cpu = min(cpu1, cpu2);
-
-	smp_call_function_single(call_cpu, &irq_cpu_stop_queue_work,
+	smp_call_function_single(min(cpu1, cpu2),
+				 &irq_cpu_stop_queue_work,
 				 &call_args, 0);
+	preempt_enable();
 
 	wait_for_completion(&done.completion);
+
 	return done.executed ? done.ret : -ENOENT;
 }
 

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-11 12:38                         ` Peter Zijlstra
@ 2013-10-11 18:25                           ` Oleg Nesterov
  2013-10-11 20:48                             ` Peter Zijlstra
  2013-10-17 16:49                           ` [tip:sched/core] sched: Remove get_online_cpus() usage tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 76+ messages in thread
From: Oleg Nesterov @ 2013-10-11 18:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Steven Rostedt, Ingo Molnar,
	Srivatsa S. Bhat, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On 10/11, Peter Zijlstra wrote:
>
> As a penance I'll start by removing all get_online_cpus() usage from the
> scheduler.

I only looked at the change in setaffinity,

> @@ -3706,7 +3707,6 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
>  	struct task_struct *p;
>  	int retval;
>
> -	get_online_cpus();
>  	rcu_read_lock();

Hmm. In theory task_rq_lock() doesn't imply rcu-lock, so
set_cpus_allowed_ptr() can miss the change in cpu_active_mask. But this
is probably fine, CPU_DYING does __migrate_task().

However. This means that sched_setaffinity() can fail if it races with
the failing cpu_down() (say, __cpu_notify(CPU_DOWN_PREPARE) fails).
Probably we do not really care, just this looks a bit confusing.

> @@ -3814,7 +3813,6 @@ long sched_getaffinity(pid_t pid, struct cpumask *mask)
>  	unsigned long flags;
>  	int retval;
>
> -	get_online_cpus();

This change is probably fine in any case?

>  	rcu_read_lock();
>
>  	retval = -ESRCH;
> @@ -3827,12 +3825,11 @@ long sched_getaffinity(pid_t pid, struct cpumask *mask)
>  		goto out_unlock;
>
>  	raw_spin_lock_irqsave(&p->pi_lock, flags);
> -	cpumask_and(mask, &p->cpus_allowed, cpu_online_mask);
> +	cpumask_and(mask, &p->cpus_allowed, cpu_active_mask);

But I am just curious, is this change is strictly needed?

Afaics we do not care if we race with set_cpu_online(true/false).

Oleg.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-11 18:25                           ` Oleg Nesterov
@ 2013-10-11 20:48                             ` Peter Zijlstra
  2013-10-12 17:06                               ` Oleg Nesterov
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-11 20:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Steven Rostedt, Ingo Molnar,
	Srivatsa S. Bhat, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On Fri, Oct 11, 2013 at 08:25:07PM +0200, Oleg Nesterov wrote:
> On 10/11, Peter Zijlstra wrote:
> >
> > As a penance I'll start by removing all get_online_cpus() usage from the
> > scheduler.
> 
> I only looked at the change in setaffinity,
> 
> > @@ -3706,7 +3707,6 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
> >  	struct task_struct *p;
> >  	int retval;
> >
> > -	get_online_cpus();
> >  	rcu_read_lock();
> 
> Hmm. In theory task_rq_lock() doesn't imply rcu-lock, so
> set_cpus_allowed_ptr() can miss the change in cpu_active_mask. But this
> is probably fine, CPU_DYING does __migrate_task().

I'm fine with always doing sync_sched(); sync_rcu(); if that makes you
feel better. But I thought that assuming that !PREEMPT sync_rcu() would
imply sync_sched() was ok. I think the comment there even says as much.

And task_rq_lock() will very much disable preemption; and thus we get
what we want, right?

In any case; the goal was to make either RCU or preempt-disable
sufficient.

> However. This means that sched_setaffinity() can fail if it races with
> the failing cpu_down() (say, __cpu_notify(CPU_DOWN_PREPARE) fails).
> Probably we do not really care, just this looks a bit confusing.

Couldn't be bothered; failing hotplug will have side-effects any which
way.

> > @@ -3814,7 +3813,6 @@ long sched_getaffinity(pid_t pid, struct cpumask *mask)
> >  	unsigned long flags;
> >  	int retval;
> >
> > -	get_online_cpus();
> 
> This change is probably fine in any case?

Yes.

> >  	rcu_read_lock();
> >
> >  	retval = -ESRCH;
> > @@ -3827,12 +3825,11 @@ long sched_getaffinity(pid_t pid, struct cpumask *mask)
> >  		goto out_unlock;
> >
> >  	raw_spin_lock_irqsave(&p->pi_lock, flags);
> > -	cpumask_and(mask, &p->cpus_allowed, cpu_online_mask);
> > +	cpumask_and(mask, &p->cpus_allowed, cpu_active_mask);
> 
> But I am just curious, is this change is strictly needed?

No; we could do without. It really doesn't matter much if anything. I
only did it because sched_setaffinity()->set_cpus_allowed_ptr() checks
against active, not online. And had a sudden urge to make get/set
symmetric -- totally pointless otherwise.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-11 20:48                             ` Peter Zijlstra
@ 2013-10-12 17:06                               ` Oleg Nesterov
  2013-10-14  9:05                                 ` Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Oleg Nesterov @ 2013-10-12 17:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Steven Rostedt, Ingo Molnar,
	Srivatsa S. Bhat, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On 10/11, Peter Zijlstra wrote:
>
> On Fri, Oct 11, 2013 at 08:25:07PM +0200, Oleg Nesterov wrote:
> > On 10/11, Peter Zijlstra wrote:
> > >
> > > As a penance I'll start by removing all get_online_cpus() usage from the
> > > scheduler.
> >
> > I only looked at the change in setaffinity,
> >
> > > @@ -3706,7 +3707,6 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
> > >  	struct task_struct *p;
> > >  	int retval;
> > >
> > > -	get_online_cpus();
> > >  	rcu_read_lock();
> >
> > Hmm. In theory task_rq_lock() doesn't imply rcu-lock, so
> > set_cpus_allowed_ptr() can miss the change in cpu_active_mask. But this
> > is probably fine, CPU_DYING does __migrate_task().
>
> I'm fine with always doing sync_sched(); sync_rcu(); if that makes you
> feel better.

No, I was just curious. iow, I am asking, not arguing.

> But I thought that assuming that !PREEMPT sync_rcu() would
> imply sync_sched() was ok. I think the comment there even says as much.
>
> And task_rq_lock() will very much disable preemption; and thus we get
> what we want, right?

it even disables irqs, so this should always imply rcu_read_lock() with
any implementation, I guess. However I was told we should not rely on
this, and say posix_timer_event() does rcu_read_lock() even it is always
called under spin_lock_irq().

But what I actually tried to say, it seems that this particular change
looks fine even if cpu_down() doesn't do sync_sched/rcu at all, because
we can rely on __migrate_task(). IOW, if we race with DOWN_PREPARE and
miss set_cpu_active(false) we can pretend that this CPU goes down later.

> In any case; the goal was to make either RCU or preempt-disable
> sufficient.
>
> > However. This means that sched_setaffinity() can fail if it races with
> > the failing cpu_down() (say, __cpu_notify(CPU_DOWN_PREPARE) fails).
> > Probably we do not really care, just this looks a bit confusing.
>
> Couldn't be bothered; failing hotplug will have side-effects any which
> way.

OK.

> > > @@ -3827,12 +3825,11 @@ long sched_getaffinity(pid_t pid, struct cpumask *mask)
> > >  		goto out_unlock;
> > >
> > >  	raw_spin_lock_irqsave(&p->pi_lock, flags);
> > > -	cpumask_and(mask, &p->cpus_allowed, cpu_online_mask);
> > > +	cpumask_and(mask, &p->cpus_allowed, cpu_active_mask);
> >
> > But I am just curious, is this change is strictly needed?
>
> No; we could do without. It really doesn't matter much if anything. I
> only did it because sched_setaffinity()->set_cpus_allowed_ptr() checks
> against active, not online. And had a sudden urge to make get/set
> symmetric -- totally pointless otherwise.

OK, thanks, I was just curious.

In fact I do not even understand why getaffinity() doesn't simply
return ->cpus_allowed, but this is off-topic.

Oleg.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-12 17:06                               ` Oleg Nesterov
@ 2013-10-14  9:05                                 ` Peter Zijlstra
  2013-10-14  9:23                                   ` Paul E. McKenney
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2013-10-14  9:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Steven Rostedt, Ingo Molnar,
	Srivatsa S. Bhat, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On Sat, Oct 12, 2013 at 07:06:56PM +0200, Oleg Nesterov wrote:
> it even disables irqs, so this should always imply rcu_read_lock() with
> any implementation, 

Not so; I could make an RCU implementation that drives the state machine
from rcu_read_unlock(). Such an implementation doesn't need the
interrupt driven poll-state driver we currently have and could thus
subvert that assumption :-)

Then again, there's a good reason PaulMck didn't pick this
implementation.

> In fact I do not even understand why getaffinity() doesn't simply
> return ->cpus_allowed, but this is off-topic.

Yeah, me neither :-(, it always surprises me. But changing it is likely
to break stuff so there we are.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-14  9:05                                 ` Peter Zijlstra
@ 2013-10-14  9:23                                   ` Paul E. McKenney
  2013-10-15  1:01                                     ` Paul E. McKenney
  0 siblings, 1 reply; 76+ messages in thread
From: Paul E. McKenney @ 2013-10-14  9:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Linus Torvalds, Andrew Morton, Steven Rostedt,
	Ingo Molnar, Srivatsa S. Bhat, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On Mon, Oct 14, 2013 at 11:05:08AM +0200, Peter Zijlstra wrote:
> On Sat, Oct 12, 2013 at 07:06:56PM +0200, Oleg Nesterov wrote:
> > it even disables irqs, so this should always imply rcu_read_lock() with
> > any implementation, 
> 
> Not so; I could make an RCU implementation that drives the state machine
> from rcu_read_unlock(). Such an implementation doesn't need the
> interrupt driven poll-state driver we currently have and could thus
> subvert that assumption :-)
> 
> Then again, there's a good reason PaulMck didn't pick this
> implementation.

True enough, but there really are some out-of-tree RCU implementations
that do take this approach and where disabling interrupts would not
block preemptible RCU.  So please do not rely on this implementation
detail.  You never know...

> > In fact I do not even understand why getaffinity() doesn't simply
> > return ->cpus_allowed, but this is off-topic.
> 
> Yeah, me neither :-(, it always surprises me. But changing it is likely
> to break stuff so there we are.

I know that feeling...

							Thanx, Paul


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-14  9:23                                   ` Paul E. McKenney
@ 2013-10-15  1:01                                     ` Paul E. McKenney
  0 siblings, 0 replies; 76+ messages in thread
From: Paul E. McKenney @ 2013-10-15  1:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Linus Torvalds, Andrew Morton, Steven Rostedt,
	Ingo Molnar, Srivatsa S. Bhat, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linux Kernel Mailing List

On Mon, Oct 14, 2013 at 02:23:55AM -0700, Paul E. McKenney wrote:
> On Mon, Oct 14, 2013 at 11:05:08AM +0200, Peter Zijlstra wrote:
> > On Sat, Oct 12, 2013 at 07:06:56PM +0200, Oleg Nesterov wrote:
> > > it even disables irqs, so this should always imply rcu_read_lock() with
> > > any implementation, 
> > 
> > Not so; I could make an RCU implementation that drives the state machine
> > from rcu_read_unlock(). Such an implementation doesn't need the
> > interrupt driven poll-state driver we currently have and could thus
> > subvert that assumption :-)
> > 
> > Then again, there's a good reason PaulMck didn't pick this
> > implementation.
> 
> True enough, but there really are some out-of-tree RCU implementations
> that do take this approach and where disabling interrupts would not
> block preemptible RCU.  So please do not rely on this implementation
> detail.  You never know...

Actually, the current implementation of SRCU is not blocked by CPUs
disabling interrupts!

							Thanx, Paul

> > > In fact I do not even understand why getaffinity() doesn't simply
> > > return ->cpus_allowed, but this is off-topic.
> > 
> > Yeah, me neither :-(, it always surprises me. But changing it is likely
> > to break stuff so there we are.
> 
> I know that feeling...
> 
> 							Thanx, Paul


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 4/6] rcusync: Introduce struct rcu_sync_ops
  2013-10-08 10:25 ` [PATCH 4/6] rcusync: Introduce struct rcu_sync_ops Peter Zijlstra
  2013-10-08 16:30   ` Paul E. McKenney
@ 2013-10-17  2:07   ` Lai Jiangshan
       [not found]     ` <20131017154228.GL4553@linux.vnet.ibm.com>
  1 sibling, 1 reply; 76+ messages in thread
From: Lai Jiangshan @ 2013-10-17  2:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On 10/08/2013 06:25 PM, Peter Zijlstra wrote:
> From: Oleg Nesterov <oleg@redhat.com>
> 
> Add the new struct rcu_sync_ops which holds sync/call methods, and
> turn the function pointers in rcu_sync_struct into an array of struct
> rcu_sync_ops.


Hi, Paul

I think this work should be done in rcupdate.[ch] side by introducing struct rcu_flavor.

Thanks,
Lai

> 
> This simplifies the "init" helpers, and this way it is simpler to add
> the new methods we need, especially ifdef'ed.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/20131005171746.GA17664@redhat.com
> ---
> 
> diff --git a/include/linux/rcusync.h b/include/linux/rcusync.h
> index 7858491..988ec33 100644
> --- a/include/linux/rcusync.h
> +++ b/include/linux/rcusync.h
> @@ -4,6 +4,8 @@
>  #include <linux/wait.h>
>  #include <linux/rcupdate.h>
>  
> +enum rcu_sync_type { RCU_SYNC, RCU_SCHED_SYNC, RCU_BH_SYNC };
> +
>  struct rcu_sync_struct {
>  	int			gp_state;
>  	int			gp_count;
> @@ -12,53 +14,37 @@ struct rcu_sync_struct {
>  	int			cb_state;
>  	struct rcu_head		cb_head;
>  
> -	void (*sync)(void);
> -	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
> +	enum rcu_sync_type	gp_type;
>  };
>  
> -#define ___RCU_SYNC_INIT(name)						\
> -	.gp_state = 0,							\
> -	.gp_count = 0,							\
> -	.gp_wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.gp_wait),		\
> -	.cb_state = 0
> -
> -#define __RCU_SCHED_SYNC_INIT(name) {					\
> -	___RCU_SYNC_INIT(name),						\
> -	.sync = synchronize_sched,					\
> -	.call = call_rcu_sched,						\
> -}
> -
> -#define __RCU_BH_SYNC_INIT(name) {					\
> -	___RCU_SYNC_INIT(name),						\
> -	.sync = synchronize_rcu_bh,					\
> -	.call = call_rcu_bh,						\
> -}
> -
> -#define __RCU_SYNC_INIT(name) {						\
> -	___RCU_SYNC_INIT(name),						\
> -	.sync = synchronize_rcu,					\
> -	.call = call_rcu,						\
> -}
> -
> -#define DEFINE_RCU_SCHED_SYNC(name)					\
> -	struct rcu_sync_struct name = __RCU_SCHED_SYNC_INIT(name)
> -
> -#define DEFINE_RCU_BH_SYNC(name)					\
> -	struct rcu_sync_struct name = __RCU_BH_SYNC_INIT(name)
> -
> -#define DEFINE_RCU_SYNC(name)						\
> -	struct rcu_sync_struct name = __RCU_SYNC_INIT(name)
> -
>  static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
>  {
>  	return !rss->gp_state; /* GP_IDLE */
>  }
>  
> -enum rcu_sync_type { RCU_SYNC, RCU_SCHED_SYNC, RCU_BH_SYNC };
> -
>  extern void rcu_sync_init(struct rcu_sync_struct *, enum rcu_sync_type);
>  extern void rcu_sync_enter(struct rcu_sync_struct *);
>  extern void rcu_sync_exit(struct rcu_sync_struct *);
>  
> +#define __RCU_SYNC_INITIALIZER(name, type) {				\
> +		.gp_state = 0,						\
> +		.gp_count = 0,						\
> +		.gp_wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.gp_wait),	\
> +		.cb_state = 0,						\
> +		.gp_type = type,					\
> +	}
> +
> +#define	__DEFINE_RCU_SYNC(name, type)	\
> +	struct rcu_sync_struct name = __RCU_SYNC_INITIALIZER(name, type)
> +
> +#define DEFINE_RCU_SYNC(name)		\
> +	__DEFINE_RCU_SYNC(name, RCU_SYNC)
> +
> +#define DEFINE_RCU_SCHED_SYNC(name)	\
> +	__DEFINE_RCU_SYNC(name, RCU_SCHED_SYNC)
> +
> +#define DEFINE_RCU_BH_SYNC(name)	\
> +	__DEFINE_RCU_SYNC(name, RCU_BH_SYNC)
> +
>  #endif /* _LINUX_RCUSYNC_H_ */
>  
> diff --git a/kernel/rcusync.c b/kernel/rcusync.c
> index f84176a..99051b7 100644
> --- a/kernel/rcusync.c
> +++ b/kernel/rcusync.c
> @@ -1,7 +1,24 @@
> -
>  #include <linux/rcusync.h>
>  #include <linux/sched.h>
>  
> +static const struct {
> +	void (*sync)(void);
> +	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
> +} gp_ops[] = {
> +	[RCU_SYNC] = {
> +		.sync = synchronize_rcu,
> +		.call = call_rcu,
> +	},
> +	[RCU_SCHED_SYNC] = {
> +		.sync = synchronize_sched,
> +		.call = call_rcu_sched,
> +	},
> +	[RCU_BH_SYNC] = {
> +		.sync = synchronize_rcu_bh,
> +		.call = call_rcu_bh,
> +	},
> +};
> +
>  enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
>  enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
>  
> @@ -11,23 +28,7 @@ void rcu_sync_init(struct rcu_sync_struct *rss, enum rcu_sync_type type)
>  {
>  	memset(rss, 0, sizeof(*rss));
>  	init_waitqueue_head(&rss->gp_wait);
> -
> -	switch (type) {
> -	case RCU_SYNC:
> -		rss->sync = synchronize_rcu;
> -		rss->call = call_rcu;
> -		break;
> -
> -	case RCU_SCHED_SYNC:
> -		rss->sync = synchronize_sched;
> -		rss->call = call_rcu_sched;
> -		break;
> -
> -	case RCU_BH_SYNC:
> -		rss->sync = synchronize_rcu_bh;
> -		rss->call = call_rcu_bh;
> -		break;
> -	}
> +	rss->gp_type = type;
>  }
>  
>  void rcu_sync_enter(struct rcu_sync_struct *rss)
> @@ -44,7 +45,7 @@ void rcu_sync_enter(struct rcu_sync_struct *rss)
>  	BUG_ON(need_wait && need_sync);
>  
>  	if (need_sync) {
> -		rss->sync();
> +		gp_ops[rss->gp_type].sync();
>  		rss->gp_state = GP_PASSED;
>  		wake_up_all(&rss->gp_wait);
>  	} else if (need_wait) {
> @@ -81,7 +82,7 @@ static void rcu_sync_func(struct rcu_head *rcu)
>  		 * to catch a later GP.
>  		 */
>  		rss->cb_state = CB_PENDING;
> -		rss->call(&rss->cb_head, rcu_sync_func);
> +		gp_ops[rss->gp_type].call(&rss->cb_head, rcu_sync_func);
>  	} else {
>  		/*
>  		 * We're at least a GP after rcu_sync_exit(); eveybody will now
> @@ -99,7 +100,7 @@ void rcu_sync_exit(struct rcu_sync_struct *rss)
>  	if (!--rss->gp_count) {
>  		if (rss->cb_state == CB_IDLE) {
>  			rss->cb_state = CB_PENDING;
> -			rss->call(&rss->cb_head, rcu_sync_func);
> +			gp_ops[rss->gp_type].call(&rss->cb_head, rcu_sync_func);
>  		} else if (rss->cb_state == CB_PENDING) {
>  			rss->cb_state = CB_REPLAY;
>  		}
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/6] rcu: Create rcu_sync infrastructure
  2013-10-08 10:25 ` [PATCH 2/6] rcu: Create rcu_sync infrastructure Peter Zijlstra
  2013-10-08 20:40   ` Jonathan Corbet
@ 2013-10-17  2:56   ` Lai Jiangshan
  2013-10-17 10:36   ` Srikar Dronamraju
  2 siblings, 0 replies; 76+ messages in thread
From: Lai Jiangshan @ 2013-10-17  2:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Paul McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On 10/08/2013 06:25 PM, Peter Zijlstra wrote:
> From: Oleg Nesterov <oleg@redhat.com>
> 
> It is functionally equivalent to
> 
>         struct rcu_sync_struct {
>                 atomic_t counter;
>         };
> 
>         static inline bool rcu_sync_is_idle(struct rcu_sync_struct *xxx)
>         {
>                 return atomic_read(&xxx->counter) == 0;
>         }
> 
>         static inline void rcu_sync_enter(struct rcu_sync_struct *xxx)
>         {
>                 atomic_inc(&xxx->counter);
>                 synchronize_sched();
>         }
> 
>         static inline void rcu_sync_enter(struct rcu_sync_struct *xxx)
>         {
>                 synchronize_sched();
>                 atomic_dec(&xxx->counter);
>         }
> 
> except: it records the state and synchronize_sched() is only called by
> rcu_sync_enter() and only if necessary.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Reviewed-by: Paul McKenney <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/20130929183634.GA15563@redhat.com
> ---
>  include/linux/rcusync.h |   64 ++++++++++++++++++++++++++++
>  kernel/Makefile         |    3 -
>  kernel/rcusync.c        |  108 ++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 174 insertions(+), 1 deletion(-)
> 
> --- /dev/null
> +++ b/include/linux/rcusync.h
> @@ -0,0 +1,64 @@
> +#ifndef _LINUX_RCUSYNC_H_
> +#define _LINUX_RCUSYNC_H_
> +
> +#include <linux/wait.h>
> +#include <linux/rcupdate.h>
> +
> +struct rcu_sync_struct {
> +	int			gp_state;
> +	int			gp_count;
> +	wait_queue_head_t	gp_wait;
> +
> +	int			cb_state;
> +	struct rcu_head		cb_head;
> +
> +	void (*sync)(void);
> +	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
> +};
> +
> +#define ___RCU_SYNC_INIT(name)						\
> +	.gp_state = 0,							\
> +	.gp_count = 0,							\
> +	.gp_wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.gp_wait),		\
> +	.cb_state = 0
> +
> +#define __RCU_SCHED_SYNC_INIT(name) {					\
> +	___RCU_SYNC_INIT(name),						\
> +	.sync = synchronize_sched,					\
> +	.call = call_rcu_sched,						\
> +}
> +
> +#define __RCU_BH_SYNC_INIT(name) {					\
> +	___RCU_SYNC_INIT(name),						\
> +	.sync = synchronize_rcu_bh,					\
> +	.call = call_rcu_bh,						\
> +}
> +
> +#define __RCU_SYNC_INIT(name) {						\
> +	___RCU_SYNC_INIT(name),						\
> +	.sync = synchronize_rcu,					\
> +	.call = call_rcu,						\
> +}
> +
> +#define DEFINE_RCU_SCHED_SYNC(name)					\
> +	struct rcu_sync_struct name = __RCU_SCHED_SYNC_INIT(name)
> +
> +#define DEFINE_RCU_BH_SYNC(name)					\
> +	struct rcu_sync_struct name = __RCU_BH_SYNC_INIT(name)
> +
> +#define DEFINE_RCU_SYNC(name)						\
> +	struct rcu_sync_struct name = __RCU_SYNC_INIT(name)
> +
> +static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
> +{
> +	return !rss->gp_state; /* GP_IDLE */
> +}

Hi, All

We may need to use ACCESS_ONCE() here to avoid the compiler access it multi-times.

it would be better: return ACCESS_ONCE(rss->gp_state) == GP_IDLE;


-my comment continues until it reaches a "Thanks".-

> +
> +enum rcu_sync_type { RCU_SYNC, RCU_SCHED_SYNC, RCU_BH_SYNC };
> +
> +extern void rcu_sync_init(struct rcu_sync_struct *, enum rcu_sync_type);
> +extern void rcu_sync_enter(struct rcu_sync_struct *);
> +extern void rcu_sync_exit(struct rcu_sync_struct *);
> +
> +#endif /* _LINUX_RCUSYNC_H_ */
> +
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -10,7 +10,8 @@ obj-y     = fork.o exec_domain.o panic.o
>  	    kthread.o wait.o sys_ni.o posix-cpu-timers.o mutex.o \
>  	    hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
>  	    notifier.o ksysfs.o cred.o reboot.o \
> -	    async.o range.o groups.o lglock.o smpboot.o
> +	    async.o range.o groups.o lglock.o smpboot.o \
> +	    rcusync.o
>  
>  ifdef CONFIG_FUNCTION_TRACER
>  # Do not trace debug files and internal ftrace files
> --- /dev/null
> +++ b/kernel/rcusync.c
> @@ -0,0 +1,108 @@
> +
> +#include <linux/rcusync.h>
> +#include <linux/sched.h>
> +
> +enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
> +enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
> +
> +#define	rss_lock	gp_wait.lock
> +
> +void rcu_sync_init(struct rcu_sync_struct *rss, enum rcu_sync_type type)
> +{
> +	memset(rss, 0, sizeof(*rss));
> +	init_waitqueue_head(&rss->gp_wait);
> +
> +	switch (type) {
> +	case RCU_SYNC:
> +		rss->sync = synchronize_rcu;
> +		rss->call = call_rcu;
> +		break;
> +
> +	case RCU_SCHED_SYNC:
> +		rss->sync = synchronize_sched;
> +		rss->call = call_rcu_sched;
> +		break;
> +
> +	case RCU_BH_SYNC:
> +		rss->sync = synchronize_rcu_bh;
> +		rss->call = call_rcu_bh;
> +		break;
> +	}
> +}
> +
> +void rcu_sync_enter(struct rcu_sync_struct *rss)
> +{
> +	bool need_wait, need_sync;
> +
> +	spin_lock_irq(&rss->rss_lock);
> +	need_wait = rss->gp_count++;
> +	need_sync = rss->gp_state == GP_IDLE;

I suggest that "need_wait = rss->gp_state == GP_PENDING;"



> +	if (need_sync)
> +		rss->gp_state = GP_PENDING;
> +	spin_unlock_irq(&rss->rss_lock);
> +
> +	BUG_ON(need_wait && need_sync);
> +
> +	if (need_sync) {
> +		rss->sync();
> +		rss->gp_state = GP_PASSED;
> +		wake_up_all(&rss->gp_wait);
> +	} else if (need_wait) {
> +		wait_event(rss->gp_wait, rss->gp_state == GP_PASSED);

memory_barrier is required here in case rss->gp_state is read without rss_lock held.

CPU1				CPU2				CPU3
								rcu_read_lock()
								rcu_sync_is_idle()
rcu_sync_enter()
  rss->sync()			rcu_sync_enter()
				/* XXXX */
								rcu_read_unlock()
  rss->sync() returns
  rss->gp_state = GP_PASSED;
				  wait_event()(fastpath)
				/* code here maybe run on XXX by CPU reorder */
				


> +	} else {
> +		/*
> +		 * Possible when there's a pending CB from a rcu_sync_exit().
> +		 * Nobody has yet been allowed the 'fast' path and thus we can
> +		 * avoid doing any sync(). The callback will get 'dropped'.
> +		 */
> +		BUG_ON(rss->gp_state != GP_PASSED);

memory_barrier is required here like above.


> +	}
> +}
> +
> +static void rcu_sync_func(struct rcu_head *rcu)
> +{
> +	struct rcu_sync_struct *rss =
> +		container_of(rcu, struct rcu_sync_struct, cb_head);
> +	unsigned long flags;
> +
> +
> +	BUG_ON(rss->gp_state != GP_PASSED);
> +	BUG_ON(rss->cb_state == CB_IDLE);
> +
> +	spin_lock_irqsave(&rss->rss_lock, flags);
> +	if (rss->gp_count) {
> +		/*
> +		 * A new rcu_sync_begin() has happened; drop the callback.
> +		 */
> +		rss->cb_state = CB_IDLE;
> +	} else if (rss->cb_state == CB_REPLAY) {
> +		/*
> +		 * A new rcu_sync_exit() has happened; requeue the callback
> +		 * to catch a later GP.
> +		 */
> +		rss->cb_state = CB_PENDING;
> +		rss->call(&rss->cb_head, rcu_sync_func);
> +	} else {
> +		/*
> +		 * We're at least a GP after rcu_sync_exit(); eveybody will now

s/eveybody/everybody/



Please add
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>

Thanks,
Lai

> +		 * have observed the write side critical section. Let 'em rip!.
> +		 */
> +		rss->cb_state = CB_IDLE;
> +		rss->gp_state = GP_IDLE;
> +	}
> +	spin_unlock_irqrestore(&rss->rss_lock, flags);
> +}
> +
> +void rcu_sync_exit(struct rcu_sync_struct *rss)
> +{
> +	spin_lock_irq(&rss->rss_lock);
> +	if (!--rss->gp_count) {
> +		if (rss->cb_state == CB_IDLE) {
> +			rss->cb_state = CB_PENDING;
> +			rss->call(&rss->cb_head, rcu_sync_func);
> +		} else if (rss->cb_state == CB_PENDING) {
> +			rss->cb_state = CB_REPLAY;
> +		}
> +	}
> +	spin_unlock_irq(&rss->rss_lock);
> +}
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/6] rcu: Create rcu_sync infrastructure
  2013-10-08 10:25 ` [PATCH 2/6] rcu: Create rcu_sync infrastructure Peter Zijlstra
  2013-10-08 20:40   ` Jonathan Corbet
  2013-10-17  2:56   ` Lai Jiangshan
@ 2013-10-17 10:36   ` Srikar Dronamraju
  2 siblings, 0 replies; 76+ messages in thread
From: Srikar Dronamraju @ 2013-10-17 10:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Paul McKenney, Mel Gorman, Rik van Riel,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Steven Rostedt, Linus Torvalds, linux-kernel

* Peter Zijlstra <peterz@infradead.org> [2013-10-08 12:25:07]:

> From: Oleg Nesterov <oleg@redhat.com>
> 
> It is functionally equivalent to
> 
>         struct rcu_sync_struct {
>                 atomic_t counter;
>         };
> 
>         static inline bool rcu_sync_is_idle(struct rcu_sync_struct *xxx)
>         {
>                 return atomic_read(&xxx->counter) == 0;
>         }
> 
>         static inline void rcu_sync_enter(struct rcu_sync_struct *xxx)
>         {
>                 atomic_inc(&xxx->counter);
>                 synchronize_sched();
>         }
> 
>         static inline void rcu_sync_enter(struct rcu_sync_struct *xxx)

Guess you meant rcu_sync_exit

>         {
>                 synchronize_sched();
>                 atomic_dec(&xxx->counter);
>         }
> 
> except: it records the state and synchronize_sched() is only called by
> rcu_sync_enter() and only if necessary.
> 

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [tip:sched/core] sched: Remove get_online_cpus() usage
  2013-10-11 12:38                         ` Peter Zijlstra
  2013-10-11 18:25                           ` Oleg Nesterov
@ 2013-10-17 16:49                           ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 76+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-10-17 16:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, torvalds, peterz, riel, rostedt, akpm, aarcange, oleg,
	tglx, hpa, linux-kernel, hannes, srivatsa.bhat, paulmck, srikar,
	mgorman

Commit-ID:  6acce3ef84520537f8a09a12c9ddbe814a584dd2
Gitweb:     http://git.kernel.org/tip/6acce3ef84520537f8a09a12c9ddbe814a584dd2
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 11 Oct 2013 14:38:20 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 16 Oct 2013 14:22:16 +0200

sched: Remove get_online_cpus() usage

Remove get_online_cpus() usage from the scheduler; there's 4 sites that
use it:

 - sched_init_smp(); where its completely superfluous since we're in
   'early' boot and there simply cannot be any hotplugging.

 - sched_getaffinity(); we already take a raw spinlock to protect the
   task cpus_allowed mask, this disables preemption and therefore
   also stabilizes cpu_online_mask as that's modified using
   stop_machine. However switch to active mask for symmetry with
   sched_setaffinity()/set_cpus_allowed_ptr(). We guarantee active
   mask stability by inserting sync_rcu/sched() into _cpu_down.

 - sched_setaffinity(); we don't appear to need get_online_cpus()
   either, there's two sites where hotplug appears relevant:
    * cpuset_cpus_allowed(); for the !cpuset case we use possible_mask,
      for the cpuset case we hold task_lock, which is a spinlock and
      thus for mainline disables preemption (might cause pain on RT).
    * set_cpus_allowed_ptr(); Holds all scheduler locks and thus has
      preemption properly disabled; also it already deals with hotplug
      races explicitly where it releases them.

 - migrate_swap(); we can make stop_two_cpus() do the heavy lifting for
   us with a little trickery. By adding a sync_sched/rcu() after the
   CPU_DOWN_PREPARE notifier we can provide preempt/rcu guarantees for
   cpu_active_mask. Use these to validate that both our cpus are active
   when queueing the stop work before we queue the stop_machine works
   for take_cpu_down().

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/20131011123820.GV3081@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/cpu.c          | 17 +++++++++++++++++
 kernel/sched/core.c   | 20 ++++++++++----------
 kernel/stop_machine.c | 26 +++++++++++++++++++++-----
 3 files changed, 48 insertions(+), 15 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index d7f07a2..63aa50d 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -308,6 +308,23 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
 	}
 	smpboot_park_threads(cpu);
 
+	/*
+	 * By now we've cleared cpu_active_mask, wait for all preempt-disabled
+	 * and RCU users of this state to go away such that all new such users
+	 * will observe it.
+	 *
+	 * For CONFIG_PREEMPT we have preemptible RCU and its sync_rcu() might
+	 * not imply sync_sched(), so explicitly call both.
+	 */
+#ifdef CONFIG_PREEMPT
+	synchronize_sched();
+#endif
+	synchronize_rcu();
+
+	/*
+	 * So now all preempt/rcu users must observe !cpu_active().
+	 */
+
 	err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
 	if (err) {
 		/* CPU didn't die: tell everyone.  Can't complain. */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a972acd..c06b8d3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1085,8 +1085,6 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p)
 	struct migration_swap_arg arg;
 	int ret = -EINVAL;
 
-	get_online_cpus();
-
 	arg = (struct migration_swap_arg){
 		.src_task = cur,
 		.src_cpu = task_cpu(cur),
@@ -1097,6 +1095,10 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p)
 	if (arg.src_cpu == arg.dst_cpu)
 		goto out;
 
+	/*
+	 * These three tests are all lockless; this is OK since all of them
+	 * will be re-checked with proper locks held further down the line.
+	 */
 	if (!cpu_active(arg.src_cpu) || !cpu_active(arg.dst_cpu))
 		goto out;
 
@@ -1109,7 +1111,6 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p)
 	ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
 
 out:
-	put_online_cpus();
 	return ret;
 }
 
@@ -3710,7 +3711,6 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 	struct task_struct *p;
 	int retval;
 
-	get_online_cpus();
 	rcu_read_lock();
 
 	p = find_process_by_pid(pid);
@@ -3773,7 +3773,6 @@ out_free_cpus_allowed:
 	free_cpumask_var(cpus_allowed);
 out_put_task:
 	put_task_struct(p);
-	put_online_cpus();
 	return retval;
 }
 
@@ -3818,7 +3817,6 @@ long sched_getaffinity(pid_t pid, struct cpumask *mask)
 	unsigned long flags;
 	int retval;
 
-	get_online_cpus();
 	rcu_read_lock();
 
 	retval = -ESRCH;
@@ -3831,12 +3829,11 @@ long sched_getaffinity(pid_t pid, struct cpumask *mask)
 		goto out_unlock;
 
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	cpumask_and(mask, &p->cpus_allowed, cpu_online_mask);
+	cpumask_and(mask, &p->cpus_allowed, cpu_active_mask);
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 
 out_unlock:
 	rcu_read_unlock();
-	put_online_cpus();
 
 	return retval;
 }
@@ -6494,14 +6491,17 @@ void __init sched_init_smp(void)
 
 	sched_init_numa();
 
-	get_online_cpus();
+	/*
+	 * There's no userspace yet to cause hotplug operations; hence all the
+	 * cpu masks are stable and all blatant races in the below code cannot
+	 * happen.
+	 */
 	mutex_lock(&sched_domains_mutex);
 	init_sched_domains(cpu_active_mask);
 	cpumask_andnot(non_isolated_cpus, cpu_possible_mask, cpu_isolated_map);
 	if (cpumask_empty(non_isolated_cpus))
 		cpumask_set_cpu(smp_processor_id(), non_isolated_cpus);
 	mutex_unlock(&sched_domains_mutex);
-	put_online_cpus();
 
 	hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
 	hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 32a6c44..c530bc5 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -234,11 +234,13 @@ static void irq_cpu_stop_queue_work(void *arg)
  */
 int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg)
 {
-	int call_cpu;
 	struct cpu_stop_done done;
 	struct cpu_stop_work work1, work2;
 	struct irq_cpu_stop_queue_work_info call_args;
-	struct multi_stop_data msdata = {
+	struct multi_stop_data msdata;
+
+	preempt_disable();
+	msdata = (struct multi_stop_data){
 		.fn = fn,
 		.data = arg,
 		.num_threads = 2,
@@ -262,16 +264,30 @@ int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *
 	set_state(&msdata, MULTI_STOP_PREPARE);
 
 	/*
+	 * If we observe both CPUs active we know _cpu_down() cannot yet have
+	 * queued its stop_machine works and therefore ours will get executed
+	 * first. Or its not either one of our CPUs that's getting unplugged,
+	 * in which case we don't care.
+	 *
+	 * This relies on the stopper workqueues to be FIFO.
+	 */
+	if (!cpu_active(cpu1) || !cpu_active(cpu2)) {
+		preempt_enable();
+		return -ENOENT;
+	}
+
+	/*
 	 * Queuing needs to be done by the lowest numbered CPU, to ensure
 	 * that works are always queued in the same order on every CPU.
 	 * This prevents deadlocks.
 	 */
-	call_cpu = min(cpu1, cpu2);
-
-	smp_call_function_single(call_cpu, &irq_cpu_stop_queue_work,
+	smp_call_function_single(min(cpu1, cpu2),
+				 &irq_cpu_stop_queue_work,
 				 &call_args, 0);
+	preempt_enable();
 
 	wait_for_completion(&done.completion);
+
 	return done.executed ? done.ret : -ENOENT;
 }
 

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 4/6] rcusync: Introduce struct rcu_sync_ops
       [not found]     ` <20131017154228.GL4553@linux.vnet.ibm.com>
@ 2013-10-18  1:23       ` Lai Jiangshan
  2013-10-18 12:10         ` Oleg Nesterov
  0 siblings, 1 reply; 76+ messages in thread
From: Lai Jiangshan @ 2013-10-18  1:23 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Oleg Nesterov, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On 10/17/2013 11:42 PM, Paul E. McKenney wrote:
> On Thu, Oct 17, 2013 at 10:07:15AM +0800, Lai Jiangshan wrote:
>> On 10/08/2013 06:25 PM, Peter Zijlstra wrote:
>>> From: Oleg Nesterov <oleg@redhat.com>
>>>
>>> Add the new struct rcu_sync_ops which holds sync/call methods, and
>>> turn the function pointers in rcu_sync_struct into an array of struct
>>> rcu_sync_ops.
>>
>> Hi, Paul
>>
>> I think this work should be done in rcupdate.[ch] side by introducing
>> struct rcu_flavor.
> 
> I -do- have on my list to add an rcutorture test for rcu_sync, but
> what do you have in mind by adding struct rcu_flavor?  I am guessing
> that you do not mean to try to create an rcu_state and a set of

No.
The thing what I need is just as same as Oleg Nesterov implemented.
It is just a structure with several function pointers for different RCU variants.
But it would be better if we implement in rcupdate.[ch],
and name it to struct rcu_flavor like the URCU.

After we have struct rcu_flavor, we can replace the following code
to a pointer to struct rcu_flavor.

struct rcu_state:
	void (*call)(struct rcu_head *head,	/* call_rcu() flavor. */
		     void (*func)(struct rcu_head *head));

struct rcu_torture_ops {
	int (*readlock)(void);
	void (*readunlock)(int idx);
	void (*sync)(void);
	void (*exp_sync)(void);
	void (*call)(struct rcu_head *head, void (*func)(struct rcu_head *rcu));
	void (*cb_barrier)(void);
};

and struct rcu_sync {
	...
};


Thanks,
Lai.



> rcu_data structures for rcu_sync -- the write-side optimizations
> would not fit well into rcu_state/rcu_data.
> 
> You might mean to simply put this code into rcupdate.[ch], which
> would be OK, but rcu_sync seems to me to be sufficiently different
> to deserve its own .c and .h files.
> 
> So what did you have in mind?

I mean we need something like struct rcu_flavor as the URCU.


> 
> 							Thanx, Paul
> 
>> Thanks,
>> Lai
>>
>>>
>>> This simplifies the "init" helpers, and this way it is simpler to add
>>> the new methods we need, especially ifdef'ed.
>>>
>>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>>> Link: http://lkml.kernel.org/r/20131005171746.GA17664@redhat.com
>>> ---
>>>
>>> diff --git a/include/linux/rcusync.h b/include/linux/rcusync.h
>>> index 7858491..988ec33 100644
>>> --- a/include/linux/rcusync.h
>>> +++ b/include/linux/rcusync.h
>>> @@ -4,6 +4,8 @@
>>>  #include <linux/wait.h>
>>>  #include <linux/rcupdate.h>
>>>  
>>> +enum rcu_sync_type { RCU_SYNC, RCU_SCHED_SYNC, RCU_BH_SYNC };
>>> +
>>>  struct rcu_sync_struct {
>>>  	int			gp_state;
>>>  	int			gp_count;
>>> @@ -12,53 +14,37 @@ struct rcu_sync_struct {
>>>  	int			cb_state;
>>>  	struct rcu_head		cb_head;
>>>  
>>> -	void (*sync)(void);
>>> -	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
>>> +	enum rcu_sync_type	gp_type;
>>>  };
>>>  
>>> -#define ___RCU_SYNC_INIT(name)						\
>>> -	.gp_state = 0,							\
>>> -	.gp_count = 0,							\
>>> -	.gp_wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.gp_wait),		\
>>> -	.cb_state = 0
>>> -
>>> -#define __RCU_SCHED_SYNC_INIT(name) {					\
>>> -	___RCU_SYNC_INIT(name),						\
>>> -	.sync = synchronize_sched,					\
>>> -	.call = call_rcu_sched,						\
>>> -}
>>> -
>>> -#define __RCU_BH_SYNC_INIT(name) {					\
>>> -	___RCU_SYNC_INIT(name),						\
>>> -	.sync = synchronize_rcu_bh,					\
>>> -	.call = call_rcu_bh,						\
>>> -}
>>> -
>>> -#define __RCU_SYNC_INIT(name) {						\
>>> -	___RCU_SYNC_INIT(name),						\
>>> -	.sync = synchronize_rcu,					\
>>> -	.call = call_rcu,						\
>>> -}
>>> -
>>> -#define DEFINE_RCU_SCHED_SYNC(name)					\
>>> -	struct rcu_sync_struct name = __RCU_SCHED_SYNC_INIT(name)
>>> -
>>> -#define DEFINE_RCU_BH_SYNC(name)					\
>>> -	struct rcu_sync_struct name = __RCU_BH_SYNC_INIT(name)
>>> -
>>> -#define DEFINE_RCU_SYNC(name)						\
>>> -	struct rcu_sync_struct name = __RCU_SYNC_INIT(name)
>>> -
>>>  static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
>>>  {
>>>  	return !rss->gp_state; /* GP_IDLE */
>>>  }
>>>  
>>> -enum rcu_sync_type { RCU_SYNC, RCU_SCHED_SYNC, RCU_BH_SYNC };
>>> -
>>>  extern void rcu_sync_init(struct rcu_sync_struct *, enum rcu_sync_type);
>>>  extern void rcu_sync_enter(struct rcu_sync_struct *);
>>>  extern void rcu_sync_exit(struct rcu_sync_struct *);
>>>  
>>> +#define __RCU_SYNC_INITIALIZER(name, type) {				\
>>> +		.gp_state = 0,						\
>>> +		.gp_count = 0,						\
>>> +		.gp_wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.gp_wait),	\
>>> +		.cb_state = 0,						\
>>> +		.gp_type = type,					\
>>> +	}
>>> +
>>> +#define	__DEFINE_RCU_SYNC(name, type)	\
>>> +	struct rcu_sync_struct name = __RCU_SYNC_INITIALIZER(name, type)
>>> +
>>> +#define DEFINE_RCU_SYNC(name)		\
>>> +	__DEFINE_RCU_SYNC(name, RCU_SYNC)
>>> +
>>> +#define DEFINE_RCU_SCHED_SYNC(name)	\
>>> +	__DEFINE_RCU_SYNC(name, RCU_SCHED_SYNC)
>>> +
>>> +#define DEFINE_RCU_BH_SYNC(name)	\
>>> +	__DEFINE_RCU_SYNC(name, RCU_BH_SYNC)
>>> +
>>>  #endif /* _LINUX_RCUSYNC_H_ */
>>>  
>>> diff --git a/kernel/rcusync.c b/kernel/rcusync.c
>>> index f84176a..99051b7 100644
>>> --- a/kernel/rcusync.c
>>> +++ b/kernel/rcusync.c
>>> @@ -1,7 +1,24 @@
>>> -
>>>  #include <linux/rcusync.h>
>>>  #include <linux/sched.h>
>>>  
>>> +static const struct {
>>> +	void (*sync)(void);
>>> +	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
>>> +} gp_ops[] = {
>>> +	[RCU_SYNC] = {
>>> +		.sync = synchronize_rcu,
>>> +		.call = call_rcu,
>>> +	},
>>> +	[RCU_SCHED_SYNC] = {
>>> +		.sync = synchronize_sched,
>>> +		.call = call_rcu_sched,
>>> +	},
>>> +	[RCU_BH_SYNC] = {
>>> +		.sync = synchronize_rcu_bh,
>>> +		.call = call_rcu_bh,
>>> +	},
>>> +};
>>> +
>>>  enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
>>>  enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
>>>  
>>> @@ -11,23 +28,7 @@ void rcu_sync_init(struct rcu_sync_struct *rss, enum rcu_sync_type type)
>>>  {
>>>  	memset(rss, 0, sizeof(*rss));
>>>  	init_waitqueue_head(&rss->gp_wait);
>>> -
>>> -	switch (type) {
>>> -	case RCU_SYNC:
>>> -		rss->sync = synchronize_rcu;
>>> -		rss->call = call_rcu;
>>> -		break;
>>> -
>>> -	case RCU_SCHED_SYNC:
>>> -		rss->sync = synchronize_sched;
>>> -		rss->call = call_rcu_sched;
>>> -		break;
>>> -
>>> -	case RCU_BH_SYNC:
>>> -		rss->sync = synchronize_rcu_bh;
>>> -		rss->call = call_rcu_bh;
>>> -		break;
>>> -	}
>>> +	rss->gp_type = type;
>>>  }
>>>  
>>>  void rcu_sync_enter(struct rcu_sync_struct *rss)
>>> @@ -44,7 +45,7 @@ void rcu_sync_enter(struct rcu_sync_struct *rss)
>>>  	BUG_ON(need_wait && need_sync);
>>>  
>>>  	if (need_sync) {
>>> -		rss->sync();
>>> +		gp_ops[rss->gp_type].sync();
>>>  		rss->gp_state = GP_PASSED;
>>>  		wake_up_all(&rss->gp_wait);
>>>  	} else if (need_wait) {
>>> @@ -81,7 +82,7 @@ static void rcu_sync_func(struct rcu_head *rcu)
>>>  		 * to catch a later GP.
>>>  		 */
>>>  		rss->cb_state = CB_PENDING;
>>> -		rss->call(&rss->cb_head, rcu_sync_func);
>>> +		gp_ops[rss->gp_type].call(&rss->cb_head, rcu_sync_func);
>>>  	} else {
>>>  		/*
>>>  		 * We're at least a GP after rcu_sync_exit(); eveybody will now
>>> @@ -99,7 +100,7 @@ void rcu_sync_exit(struct rcu_sync_struct *rss)
>>>  	if (!--rss->gp_count) {
>>>  		if (rss->cb_state == CB_IDLE) {
>>>  			rss->cb_state = CB_PENDING;
>>> -			rss->call(&rss->cb_head, rcu_sync_func);
>>> +			gp_ops[rss->gp_type].call(&rss->cb_head, rcu_sync_func);
>>>  		} else if (rss->cb_state == CB_PENDING) {
>>>  			rss->cb_state = CB_REPLAY;
>>>  		}
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>
>>
> 


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 4/6] rcusync: Introduce struct rcu_sync_ops
  2013-10-18  1:23       ` Lai Jiangshan
@ 2013-10-18 12:10         ` Oleg Nesterov
  2013-10-20 16:58           ` Paul E. McKenney
  0 siblings, 1 reply; 76+ messages in thread
From: Oleg Nesterov @ 2013-10-18 12:10 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: paulmck, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On 10/18, Lai Jiangshan wrote:
>
> On 10/17/2013 11:42 PM, Paul E. McKenney wrote:
> > On Thu, Oct 17, 2013 at 10:07:15AM +0800, Lai Jiangshan wrote:
> >> On 10/08/2013 06:25 PM, Peter Zijlstra wrote:
> >>> From: Oleg Nesterov <oleg@redhat.com>
> >>>
> >>> Add the new struct rcu_sync_ops which holds sync/call methods, and
> >>> turn the function pointers in rcu_sync_struct into an array of struct
> >>> rcu_sync_ops.
> >>
> >> Hi, Paul
> >>
> >> I think this work should be done in rcupdate.[ch] side by introducing
> >> struct rcu_flavor.
> >
> > I -do- have on my list to add an rcutorture test for rcu_sync, but
> > what do you have in mind by adding struct rcu_flavor?  I am guessing
> > that you do not mean to try to create an rcu_state and a set of
>
> No.
> The thing what I need is just as same as Oleg Nesterov implemented.
> It is just a structure with several function pointers for different RCU variants.
> But it would be better if we implement in rcupdate.[ch],
> and name it to struct rcu_flavor like the URCU.
>
> After we have struct rcu_flavor, we can replace the following code
> to a pointer to struct rcu_flavor.
>
> struct rcu_state:
> 	void (*call)(struct rcu_head *head,	/* call_rcu() flavor. */
> 		     void (*func)(struct rcu_head *head));
>
> struct rcu_torture_ops {
> 	int (*readlock)(void);
> 	void (*readunlock)(int idx);
> 	void (*sync)(void);
> 	void (*exp_sync)(void);
> 	void (*call)(struct rcu_head *head, void (*func)(struct rcu_head *rcu));
> 	void (*cb_barrier)(void);
> };

Yes, probably. But it is not clear how/when this rcu_sync will be merged.
(I hope it will be merged anyway, if nothing else I'll resend these patches
 for percpu_rw_semaphore with other updates in percpu-rwsem.c).

Until then, perhaps you can add rcu_flavor/whatever in rcupdate.* ? Then
rcu_sync can be triviallly updated to use the ops we have in rcupdate.
And rcutorture.c of course.

IOW, I think that this should be a separate change, before or after
rcu_sync.

Oleg.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 4/6] rcusync: Introduce struct rcu_sync_ops
  2013-10-18 12:10         ` Oleg Nesterov
@ 2013-10-20 16:58           ` Paul E. McKenney
  0 siblings, 0 replies; 76+ messages in thread
From: Paul E. McKenney @ 2013-10-20 16:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lai Jiangshan, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Thomas Gleixner, Steven Rostedt, Linus Torvalds,
	linux-kernel

On Fri, Oct 18, 2013 at 02:10:15PM +0200, Oleg Nesterov wrote:
> On 10/18, Lai Jiangshan wrote:
> >
> > On 10/17/2013 11:42 PM, Paul E. McKenney wrote:
> > > On Thu, Oct 17, 2013 at 10:07:15AM +0800, Lai Jiangshan wrote:
> > >> On 10/08/2013 06:25 PM, Peter Zijlstra wrote:
> > >>> From: Oleg Nesterov <oleg@redhat.com>
> > >>>
> > >>> Add the new struct rcu_sync_ops which holds sync/call methods, and
> > >>> turn the function pointers in rcu_sync_struct into an array of struct
> > >>> rcu_sync_ops.
> > >>
> > >> Hi, Paul
> > >>
> > >> I think this work should be done in rcupdate.[ch] side by introducing
> > >> struct rcu_flavor.
> > >
> > > I -do- have on my list to add an rcutorture test for rcu_sync, but
> > > what do you have in mind by adding struct rcu_flavor?  I am guessing
> > > that you do not mean to try to create an rcu_state and a set of
> >
> > No.
> > The thing what I need is just as same as Oleg Nesterov implemented.
> > It is just a structure with several function pointers for different RCU variants.
> > But it would be better if we implement in rcupdate.[ch],
> > and name it to struct rcu_flavor like the URCU.
> >
> > After we have struct rcu_flavor, we can replace the following code
> > to a pointer to struct rcu_flavor.
> >
> > struct rcu_state:
> > 	void (*call)(struct rcu_head *head,	/* call_rcu() flavor. */
> > 		     void (*func)(struct rcu_head *head));
> >
> > struct rcu_torture_ops {
> > 	int (*readlock)(void);
> > 	void (*readunlock)(int idx);
> > 	void (*sync)(void);
> > 	void (*exp_sync)(void);
> > 	void (*call)(struct rcu_head *head, void (*func)(struct rcu_head *rcu));
> > 	void (*cb_barrier)(void);
> > };
> 
> Yes, probably. But it is not clear how/when this rcu_sync will be merged.
> (I hope it will be merged anyway, if nothing else I'll resend these patches
>  for percpu_rw_semaphore with other updates in percpu-rwsem.c).
> 
> Until then, perhaps you can add rcu_flavor/whatever in rcupdate.* ? Then
> rcu_sync can be triviallly updated to use the ops we have in rcupdate.
> And rcutorture.c of course.
> 
> IOW, I think that this should be a separate change, before or after
> rcu_sync.

Perhaps put it in the shiny new kernel/rcu directory in -tip?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-11 12:45 ` Steven Rostedt
@ 2013-10-11 13:51   ` George Spelvin
  0 siblings, 0 replies; 76+ messages in thread
From: George Spelvin @ 2013-10-11 13:51 UTC (permalink / raw)
  To: linux, rostedt; +Cc: linux-kernel, paulmck, peterz, srivatsa.bhat, torvalds

> The problem is that the scheduler doesn't see that the current task has
> preemption disabled. It only looks at the priorities of the current
> task, and if it can preempt it, it will. It sets the NEED_RESCHED to the
> current task and waits for the preemption to schedule it out.

Ah, got it.  It's a priority inversion problem, and while there is a
standard solution (priority inheritance), if I were to suggest such a
heavyweight solution, Linus would start yelling again and Sarah would
get unhappy, and we all know how that ends up.

Thank you.  I will now cease displaying my ignorance in public.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
  2013-10-11 12:14 George Spelvin
@ 2013-10-11 12:45 ` Steven Rostedt
  2013-10-11 13:51   ` George Spelvin
  0 siblings, 1 reply; 76+ messages in thread
From: Steven Rostedt @ 2013-10-11 12:45 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-kernel, paulmck, peterz, srivatsa.bhat, torvalds

On 11 Oct 2013 08:14:57 -0400
"George Spelvin" <linux@horizon.com> wrote:

> > There's places in the kernel that does for_each_cpu() that I'm sure you
> > don't want to disable preemption for. Especially when you start having
> > 4096 CPU machines!
> 
> Er... why not?
> 
> Seriously.  If I have 4096 processors, and preemption is disabled on
> *one* of them for a long time, can't an urgent task just find a different
> processor to preempt?

The problem is that the scheduler doesn't see that the current task has
preemption disabled. It only looks at the priorities of the current
task, and if it can preempt it, it will. It sets the NEED_RESCHED to the
current task and waits for the preemption to schedule it out.

If an RT task is woken on a CPU that's running a SCHED_OTHER task
that's doing this loop. It will have to wait for the loop to finish
before it can schedule in. After doing the wake up, the scheduling
algorithm is happy that it got an RT task on a CPU quickly, when in
reality it has to wait a long time till preemption is enabled again.

-- Steve



> 
> This seems like a non-problem.  Or am I misunderstanding something about
> processor affinity?


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Optimize the cpu hotplug locking -v2
@ 2013-10-11 12:14 George Spelvin
  2013-10-11 12:45 ` Steven Rostedt
  0 siblings, 1 reply; 76+ messages in thread
From: George Spelvin @ 2013-10-11 12:14 UTC (permalink / raw)
  To: rostedt; +Cc: linux, linux-kernel, paulmck, peterz, srivatsa.bhat, torvalds

> There's places in the kernel that does for_each_cpu() that I'm sure you
> don't want to disable preemption for. Especially when you start having
> 4096 CPU machines!

Er... why not?

Seriously.  If I have 4096 processors, and preemption is disabled on
*one* of them for a long time, can't an urgent task just find a different
processor to preempt?

This seems like a non-problem.  Or am I misunderstanding something about
processor affinity?

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2013-10-20 16:58 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-08 10:25 [PATCH 0/6] Optimize the cpu hotplug locking -v2 Peter Zijlstra
2013-10-08 10:25 ` [PATCH 1/6] hotplug: Optimize {get,put}_online_cpus() Peter Zijlstra
2013-10-08 15:08   ` Rik van Riel
2013-10-10  5:47   ` Andrew Morton
2013-10-10 11:06     ` Oleg Nesterov
2013-10-10 14:55       ` Paul E. McKenney
2013-10-08 10:25 ` [PATCH 2/6] rcu: Create rcu_sync infrastructure Peter Zijlstra
2013-10-08 20:40   ` Jonathan Corbet
2013-10-09 19:52     ` Peter Zijlstra
2013-10-17  2:56   ` Lai Jiangshan
2013-10-17 10:36   ` Srikar Dronamraju
2013-10-08 10:25 ` [PATCH 3/6] hotplug: Optimize cpu_hotplug_{begin,done}() using rcu_sync Peter Zijlstra
2013-10-08 16:28   ` Paul E. McKenney
2013-10-08 10:25 ` [PATCH 4/6] rcusync: Introduce struct rcu_sync_ops Peter Zijlstra
2013-10-08 16:30   ` Paul E. McKenney
2013-10-17  2:07   ` Lai Jiangshan
     [not found]     ` <20131017154228.GL4553@linux.vnet.ibm.com>
2013-10-18  1:23       ` Lai Jiangshan
2013-10-18 12:10         ` Oleg Nesterov
2013-10-20 16:58           ` Paul E. McKenney
2013-10-08 10:25 ` [PATCH 5/6] rcusync: Add the CONFIG_PROVE_RCU checks Peter Zijlstra
2013-10-08 16:30   ` Paul E. McKenney
2013-10-08 10:25 ` [PATCH 6/6] rcusync: Introduce rcu_sync_dtor() Peter Zijlstra
2013-10-08 16:32   ` Paul E. McKenney
2013-10-08 15:27 ` [PATCH 0/6] Optimize the cpu hotplug locking -v2 Oleg Nesterov
2013-10-08 15:38   ` Peter Zijlstra
2013-10-10  5:50 ` Andrew Morton
2013-10-10  6:27   ` Ingo Molnar
2013-10-10  6:34     ` Andrew Morton
2013-10-10  7:27       ` Ingo Molnar
2013-10-10  7:33         ` Andrew Morton
2013-10-10  7:45           ` Ingo Molnar
2013-10-10 12:19   ` Peter Zijlstra
2013-10-10 14:57     ` Ingo Molnar
2013-10-10 15:21       ` Peter Zijlstra
2013-10-10 15:36         ` Oleg Nesterov
2013-10-10 16:50         ` Ingo Molnar
2013-10-10 17:13           ` Paul E. McKenney
2013-10-10 17:35             ` Ingo Molnar
2013-10-10 18:35             ` Peter Zijlstra
2013-10-10 15:26       ` Oleg Nesterov
2013-10-10 16:00         ` Andrew Morton
2013-10-10 16:36           ` Steven Rostedt
2013-10-10 16:43             ` Andrew Morton
2013-10-10 16:53               ` Peter Zijlstra
2013-10-10 17:13                 ` Steven Rostedt
2013-10-10 17:48                   ` Andrew Morton
2013-10-10 18:10                     ` Linus Torvalds
2013-10-10 18:43                       ` Steven Rostedt
2013-10-10 18:50                         ` Peter Zijlstra
2013-10-10 19:15                           ` Paul E. McKenney
2013-10-10 19:00                         ` Linus Torvalds
2013-10-10 18:46                       ` Peter Zijlstra
2013-10-10 18:34                     ` Peter Zijlstra
2013-10-10 18:49                       ` Linus Torvalds
2013-10-10 19:04                         ` Steven Rostedt
2013-10-10 19:16                           ` Linus Torvalds
2013-10-10 19:34                             ` Peter Zijlstra
2013-10-10 19:34                             ` Steven Rostedt
2013-10-11  6:09                             ` Ingo Molnar
2013-10-11 12:38                         ` Peter Zijlstra
2013-10-11 18:25                           ` Oleg Nesterov
2013-10-11 20:48                             ` Peter Zijlstra
2013-10-12 17:06                               ` Oleg Nesterov
2013-10-14  9:05                                 ` Peter Zijlstra
2013-10-14  9:23                                   ` Paul E. McKenney
2013-10-15  1:01                                     ` Paul E. McKenney
2013-10-17 16:49                           ` [tip:sched/core] sched: Remove get_online_cpus() usage tip-bot for Peter Zijlstra
2013-10-10 17:39                 ` [PATCH 0/6] Optimize the cpu hotplug locking -v2 Oleg Nesterov
2013-10-10 16:52           ` Ingo Molnar
2013-10-10 17:44             ` Paul E. McKenney
2013-10-10 16:54           ` Oleg Nesterov
2013-10-10 19:04             ` Srivatsa S. Bhat
2013-10-10 18:52         ` Srivatsa S. Bhat
2013-10-11 12:14 George Spelvin
2013-10-11 12:45 ` Steven Rostedt
2013-10-11 13:51   ` George Spelvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).