All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v4 0/9] CPU hotplug: stop_machine()-free CPU hotplug
@ 2012-12-11 14:03 Srivatsa S. Bhat
  2012-12-11 14:04 ` [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context Srivatsa S. Bhat
                   ` (8 more replies)
  0 siblings, 9 replies; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-11 14:03 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

Hi,

This patchset removes CPU hotplug's dependence on stop_machine() from the CPU
offline path and provides an alternative (set of APIs) to preempt_disable() to
prevent CPUs from going offline, which can be invoked from atomic context.

This is an RFC patchset with only a few call-sites of preempt_disable()
converted to the new APIs for now, and the main goal is to get feedback on the
design of the new atomic APIs and see if it serves as a viable replacement for
stop_machine()-free CPU hotplug. A brief description of the algorithm is
available in the "Changes in vN" section.

Overview of the patches:
-----------------------

Patch 1 introduces the new APIs that can be used from atomic context, to
prevent CPUs from going offline.

Patch 2 is a cleanup; it converts preprocessor macros to static inline
functions.

Patches 3 to 8 convert various call-sites to use the new APIs.

Patch 9 is the one which actually removes stop_machine() from the CPU
offline path.

Changes in v4:
--------------
  The synchronization scheme has been simplified quite a bit, which makes it
  look a lot less complex than before. Some highlights:

* Implicit ACKs:

  The earlier design required the readers to explicitly ACK the writer's
  signal. The new design uses implicit ACKs instead. The reader switching
  over to rwlock implicitly tells the writer to stop waiting for that reader.

* No atomic operations:

  Since we got rid of explicit ACKs, we no longer have the need for a reader
  and a writer to update the same counter. So we can get rid of atomic ops
  too.

Changes in v3:
--------------
* Dropped the _light() and _full() variants of the APIs. Provided a single
  interface: get/put_online_cpus_atomic().

* Completely redesigned the synchronization mechanism again, to make it
  fast and scalable at the reader-side in the fast-path (when no hotplug
  writers are active). This new scheme also ensures that there is no
  possibility of deadlocks due to circular locking dependency.
  In summary, this provides the scalability and speed of per-cpu rwlocks
  (without actually using them), while avoiding the downside (deadlock
  possibilities) which is inherent in any per-cpu locking scheme that is
  meant to compete with preempt_disable()/enable() in terms of flexibility.

  The problem with using per-cpu locking to replace preempt_disable()/enable
  was explained here:
  https://lkml.org/lkml/2012/12/6/290

  Basically we use per-cpu counters (for scalability) when no writers are
  active, and then switch to global rwlocks (for lock-safety) when a writer
  becomes active. It is a slightly complex scheme, but it is based on
  standard principles of distributed algorithms.

Changes in v2:
-------------
* Completely redesigned the synchronization scheme to avoid using any extra
  cpumasks.

* Provided APIs for 2 types of atomic hotplug readers: "light" (for
  light-weight) and "full". We wish to have more "light" readers than
  the "full" ones, to avoid indirectly inducing the "stop_machine effect"
  without even actually using stop_machine().

  And the patches show that it _is_ generally true: 5 patches deal with
  "light" readers, whereas only 1 patch deals with a "full" reader.

  Also, the "light" readers happen to be in very hot paths. So it makes a
  lot of sense to have such a distinction and a corresponding light-weight
  API.

Links to previous versions:
v3: https://lkml.org/lkml/2012/12/7/287
v2: https://lkml.org/lkml/2012/12/5/322
v1: https://lkml.org/lkml/2012/12/4/88

Comments and suggestions welcome!

--
 Paul E. McKenney (1):
      cpu: No more __stop_machine() in _cpu_down()

Srivatsa S. Bhat (8):
      CPU hotplug: Provide APIs to prevent CPU offline from atomic context
      CPU hotplug: Convert preprocessor macros to static inline functions
      smp, cpu hotplug: Fix smp_call_function_*() to prevent CPU offline properly
      smp, cpu hotplug: Fix on_each_cpu_*() to prevent CPU offline properly
      sched, cpu hotplug: Use stable online cpus in try_to_wake_up() & select_task_rq()
      kick_process(), cpu-hotplug: Prevent offlining of target CPU properly
      yield_to(), cpu-hotplug: Prevent offlining of other CPUs properly
      kvm, vmx: Add atomic synchronization with CPU Hotplug


  arch/x86/kvm/vmx.c  |    8 +-
 include/linux/cpu.h |    8 +-
 kernel/cpu.c        |  206 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/core.c |   22 +++++
 kernel/smp.c        |   63 ++++++++++------
 5 files changed, 273 insertions(+), 34 deletions(-)



Thanks,
Srivatsa S. Bhat
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-11 14:03 [RFC PATCH v4 0/9] CPU hotplug: stop_machine()-free CPU hotplug Srivatsa S. Bhat
@ 2012-12-11 14:04 ` Srivatsa S. Bhat
  2012-12-12 17:17   ` Oleg Nesterov
  2012-12-11 14:04 ` [RFC PATCH v4 2/9] CPU hotplug: Convert preprocessor macros to static inline functions Srivatsa S. Bhat
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-11 14:04 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

There are places where preempt_disable() is used to prevent any CPU from
going offline during the critical section. Let us call them as "atomic
hotplug readers" ("atomic" because they run in atomic contexts).

Today, preempt_disable() works because the writer uses stop_machine().
But once stop_machine() is gone, the readers won't be able to prevent
CPUs from going offline using preempt_disable().

The intent of this patch is to provide synchronization APIs for such
atomic hotplug readers, to prevent (any) CPUs from going offline, without
depending on stop_machine() at the writer-side. The new APIs will look
something like this:  get/put_online_cpus_atomic()

Some important design requirements and considerations:
-----------------------------------------------------

1. Scalable synchronization at the reader-side, especially in the fast-path

   Any synchronization at the atomic hotplug readers side must be highly
   scalable - avoid global single-holder locks/counters etc. Because, these
   paths currently use the extremely fast preempt_disable(); our replacement
   to preempt_disable() should not become ridiculously costly and also should
   not serialize the readers among themselves needlessly.

   At a minimum, the new APIs must be extremely fast at the reader side
   atleast in the fast-path, when no CPU offline writers are active.

2. preempt_disable() was recursive. The replacement should also be recursive.

3. No (new) lock-ordering restrictions

   preempt_disable() was super-flexible. It didn't impose any ordering
   restrictions or rules for nesting. Our replacement should also be equally
   flexible and usable.

4. No deadlock possibilities

   Regular per-cpu locking is not the way to go if we want to have relaxed
   rules for lock-ordering. Because, we can end up in circular-locking
   dependencies as explained in https://lkml.org/lkml/2012/12/6/290

   So, avoid the usual per-cpu locking schemes (per-cpu locks/per-cpu atomic
   counters with spin-on-contention etc) as much as possible.


Implementation of the design:
----------------------------

We use global rwlocks for synchronization, because then we won't get into
lock-ordering related problems (unlike per-cpu locks). However, global
rwlocks lead to unnecessary cache-line bouncing even when there are no
hotplug writers present, which can slow down the system needlessly.

Per-cpu counters can help solve the cache-line bouncing problem. So we
actually use the best of both: per-cpu counters (no-waiting) at the reader
side in the fast-path, and global rwlocks in the slowpath.

[ Fastpath = no writer is active; Slowpath = a writer is active ]

IOW, the hotplug readers just increment/decrement their per-cpu refcounts
when no writer is active. When a writer becomes active, he signals all
readers to switch to global rwlocks for the duration of the CPU offline
operation. The readers switch over when it is safe for them (ie., when they
are about to start a fresh, non-nested read-side critical section) and
start using (holding) the global rwlock for read in their subsequent critical
sections.

The hotplug writer waits for every reader to switch, and then acquires
the global rwlock for write and takes the CPU offline. Then the writer
signals all readers that the CPU offline is done, and that they can go back
to using their per-cpu refcounts again.

Note that the lock-safety (despite the per-cpu scheme) comes from the fact
that the readers can *choose* _when_ to switch to rwlocks upon the writer's
signal. And the readers don't wait on anybody based on the per-cpu counters.
The only true synchronization that involves waiting at the reader-side in this
scheme, is the one arising from the global rwlock, which is safe from the
circular locking dependency problems mentioned above (because it is global).

Reader-writer locks and per-cpu counters are recursive, so they can be
used in a nested fashion in the reader-path. Also, this design of switching
the synchronization scheme ensures that you can safely nest and call these
APIs in any way you want, just like preempt_disable()/enable.

Together, these satisfy all the requirements mentioned above.

I'm indebted to Michael Wang and Xiao Guangrong for their numerous thoughtful
suggestions and ideas, which inspired and influenced many of the decisions in
this as well as previous designs. Thanks a lot Michael and Xiao!

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/cpu.h |    4 +
 kernel/cpu.c        |  204 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 205 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index ce7a074..cf24da1 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -175,6 +175,8 @@ extern struct bus_type cpu_subsys;
 
 extern void get_online_cpus(void);
 extern void put_online_cpus(void);
+extern void get_online_cpus_atomic(void);
+extern void put_online_cpus_atomic(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
 #define register_hotcpu_notifier(nb)	register_cpu_notifier(nb)
 #define unregister_hotcpu_notifier(nb)	unregister_cpu_notifier(nb)
@@ -198,6 +200,8 @@ static inline void cpu_hotplug_driver_unlock(void)
 
 #define get_online_cpus()	do { } while (0)
 #define put_online_cpus()	do { } while (0)
+#define get_online_cpus_atomic()	do { } while (0)
+#define put_online_cpus_atomic()	do { } while (0)
 #define hotcpu_notifier(fn, pri)	do { (void)(fn); } while (0)
 /* These aren't inline functions due to a GCC bug. */
 #define register_hotcpu_notifier(nb)	({ (void)(nb); 0; })
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 42bd331..5a63296 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -133,6 +133,119 @@ static void cpu_hotplug_done(void)
 	mutex_unlock(&cpu_hotplug.lock);
 }
 
+/*
+ * Reader-writer lock to synchronize between atomic hotplug readers
+ * and the CPU offline hotplug writer.
+ */
+static DEFINE_RWLOCK(hotplug_rwlock);
+
+static DEFINE_PER_CPU(int, reader_percpu_refcnt);
+static DEFINE_PER_CPU(bool, writer_signal);
+
+
+#define reader_uses_percpu_refcnt(cpu)					\
+			(ACCESS_ONCE(per_cpu(reader_percpu_refcnt, cpu)))
+
+#define reader_nested_percpu()						\
+				(__this_cpu_read(reader_percpu_refcnt) > 1)
+
+#define writer_active()							\
+				(__this_cpu_read(writer_signal))
+
+
+
+/*
+ * Invoked by hotplug reader, to prevent CPUs from going offline.
+ *
+ * If there are no CPU offline writers active, just increment the
+ * per-cpu counter 'reader_percpu_refcnt' and proceed.
+ *
+ * If a CPU offline hotplug writer is active, we'll need to switch from
+ * per-cpu refcounts to the global rwlock, when the time is right.
+ *
+ * It is not safe to switch the synchronization scheme when we are
+ * already in a read-side critical section which uses per-cpu refcounts.
+ * Also, we don't want to allow heterogeneous readers to nest inside
+ * each other, to avoid complications in put_online_cpus_atomic().
+ *
+ * Once you switch, keep using the rwlocks for synchronization, until
+ * the writer signals the end of CPU offline.
+ *
+ * You can call this recursively, without fear of locking problems.
+ *
+ * Returns with preemption disabled.
+ */
+void get_online_cpus_atomic(void)
+{
+	unsigned long flags;
+
+	preempt_disable();
+
+	if (cpu_hotplug.active_writer == current)
+		return;
+
+	local_irq_save(flags);
+
+	/*
+	 * Use the percpu refcounts by default. Switch over to rwlock (if
+	 * necessary) later on. This helps avoid several race conditions
+	 * as well.
+	 */
+	__this_cpu_inc(reader_percpu_refcnt);
+
+	smp_rmb(); /* Paired with smp_mb() in announce_cpu_offline_begin(). */
+
+	/*
+	 * We must not allow heterogeneous nesting of readers (ie., readers
+	 * using percpu refcounts to nest with readers using rwlocks).
+	 * So don't switch the synchronization scheme if we are currently
+	 * using perpcu refcounts.
+	 */
+	if (!reader_nested_percpu() && unlikely(writer_active())) {
+
+		read_lock(&hotplug_rwlock);
+
+		/*
+		 * We might have raced with a writer going inactive before we
+		 * took the read-lock. So re-evaluate whether we still need to
+		 * use the rwlock or if we can switch back to percpu refcounts.
+		 * (This also helps avoid heterogeneous nesting of readers).
+		 */
+		if (writer_active())
+			__this_cpu_dec(reader_percpu_refcnt);
+		else
+			read_unlock(&hotplug_rwlock);
+	}
+
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(get_online_cpus_atomic);
+
+void put_online_cpus_atomic(void)
+{
+	unsigned long flags;
+
+	if (cpu_hotplug.active_writer == current)
+		goto out;
+
+	local_irq_save(flags);
+
+	/*
+	 * We never allow heterogeneous nesting of readers. So it is trivial
+	 * to find out the kind of reader we are, and undo the operation
+	 * done by our corresponding get_online_cpus_atomic().
+	 */
+	if (__this_cpu_read(reader_percpu_refcnt))
+		__this_cpu_dec(reader_percpu_refcnt);
+	else
+		read_unlock(&hotplug_rwlock);
+
+	local_irq_restore(flags);
+out:
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(put_online_cpus_atomic);
+
 #else /* #if CONFIG_HOTPLUG_CPU */
 static void cpu_hotplug_begin(void) {}
 static void cpu_hotplug_done(void) {}
@@ -237,6 +350,61 @@ static inline void check_for_tasks(int cpu)
 	write_unlock_irq(&tasklist_lock);
 }
 
+static inline void raise_writer_signal(unsigned int cpu)
+{
+	per_cpu(writer_signal, cpu) = true;
+}
+
+static inline void drop_writer_signal(unsigned int cpu)
+{
+	per_cpu(writer_signal, cpu) = false;
+}
+
+static void announce_cpu_offline_begin(void)
+{
+	unsigned int cpu;
+
+	for_each_online_cpu(cpu)
+		raise_writer_signal(cpu);
+
+	smp_mb();
+}
+
+static void announce_cpu_offline_end(unsigned int dead_cpu)
+{
+	unsigned int cpu;
+
+	drop_writer_signal(dead_cpu);
+
+	for_each_online_cpu(cpu)
+		drop_writer_signal(cpu);
+
+	smp_mb();
+}
+
+/*
+ * Wait for the reader to see the writer's signal and switch from percpu
+ * refcounts to global rwlock.
+ *
+ * If the reader is still using percpu refcounts, wait for him to switch.
+ * Else, we can safely go ahead, because either the reader has already
+ * switched over, or the next atomic hotplug reader who comes along on this
+ * CPU will notice the writer's signal and will switch over to the rwlock.
+ */
+static inline void sync_atomic_reader(unsigned int cpu)
+{
+	while (reader_uses_percpu_refcnt(cpu))
+		cpu_relax();
+}
+
+static void sync_all_readers(void)
+{
+	unsigned int cpu;
+
+	for_each_online_cpu(cpu)
+		sync_atomic_reader(cpu);
+}
+
 struct take_cpu_down_param {
 	unsigned long mod;
 	void *hcpu;
@@ -246,15 +414,45 @@ struct take_cpu_down_param {
 static int __ref take_cpu_down(void *_param)
 {
 	struct take_cpu_down_param *param = _param;
-	int err;
+	unsigned long flags;
+	unsigned int cpu = (long)(param->hcpu);
+	int err = 0;
+
+
+	/*
+	 * Inform all atomic readers that we are going to offline a CPU
+	 * and that they need to switch from per-cpu refcounts to the
+	 * global hotplug_rwlock.
+	 */
+	announce_cpu_offline_begin();
+
+	/* Wait for every reader to notice the announcement and switch over */
+	sync_all_readers();
+
+	/*
+	 * Now all the readers have switched to using the global hotplug_rwlock.
+	 * So now is our chance, go bring down the CPU!
+	 */
+
+	write_lock_irqsave(&hotplug_rwlock, flags);
 
 	/* Ensure this CPU doesn't handle any more interrupts. */
 	err = __cpu_disable();
 	if (err < 0)
-		return err;
+		goto out;
 
 	cpu_notify(CPU_DYING | param->mod, param->hcpu);
-	return 0;
+
+out:
+	/*
+	 * Inform all atomic readers that we are done with the CPU offline
+	 * operation, so that they can switch back to their per-cpu refcounts.
+	 * (We don't need to wait for them to see it).
+	 */
+	announce_cpu_offline_end(cpu);
+
+	write_unlock_irqrestore(&hotplug_rwlock, flags);
+	return err;
 }
 
 /* Requires cpu_add_remove_lock to be held */


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH v4 2/9] CPU hotplug: Convert preprocessor macros to static inline functions
  2012-12-11 14:03 [RFC PATCH v4 0/9] CPU hotplug: stop_machine()-free CPU hotplug Srivatsa S. Bhat
  2012-12-11 14:04 ` [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context Srivatsa S. Bhat
@ 2012-12-11 14:04 ` Srivatsa S. Bhat
  2012-12-11 14:04 ` [RFC PATCH v4 3/9] smp, cpu hotplug: Fix smp_call_function_*() to prevent CPU offline properly Srivatsa S. Bhat
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-11 14:04 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/05/2012 06:10 AM, Andrew Morton wrote:
"static inline C functions would be preferred if possible.  Feel free to
fix up the wrong crufty surrounding code as well ;-)"

Convert the macros in the CPU hotplug code to static inline C functions.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/cpu.h |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index cf24da1..eb79f47 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -198,10 +198,10 @@ static inline void cpu_hotplug_driver_unlock(void)
 
 #else		/* CONFIG_HOTPLUG_CPU */
 
-#define get_online_cpus()	do { } while (0)
-#define put_online_cpus()	do { } while (0)
-#define get_online_cpus_atomic()	do { } while (0)
-#define put_online_cpus_atomic()	do { } while (0)
+static inline void get_online_cpus(void) {}
+static inline void put_online_cpus(void) {}
+static inline void get_online_cpus_atomic(void) {}
+static inline void put_online_cpus_atomic(void) {}
 #define hotcpu_notifier(fn, pri)	do { (void)(fn); } while (0)
 /* These aren't inline functions due to a GCC bug. */
 #define register_hotcpu_notifier(nb)	({ (void)(nb); 0; })


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH v4 3/9] smp, cpu hotplug: Fix smp_call_function_*() to prevent CPU offline properly
  2012-12-11 14:03 [RFC PATCH v4 0/9] CPU hotplug: stop_machine()-free CPU hotplug Srivatsa S. Bhat
  2012-12-11 14:04 ` [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context Srivatsa S. Bhat
  2012-12-11 14:04 ` [RFC PATCH v4 2/9] CPU hotplug: Convert preprocessor macros to static inline functions Srivatsa S. Bhat
@ 2012-12-11 14:04 ` Srivatsa S. Bhat
  2012-12-11 14:04 ` [RFC PATCH v4 4/9] smp, cpu hotplug: Fix on_each_cpu_*() " Srivatsa S. Bhat
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-11 14:04 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

Once stop_machine() is gone from the CPU offline path, we won't be able to
depend on preempt_disable() to prevent CPUs from going offline from under us.

Use the get/put_online_cpus_atomic() APIs to prevent CPUs from going offline,
while invoking from atomic context.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/smp.c |   38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 29dd40a..ce1a866 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -310,7 +310,8 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
 	 * prevent preemption and reschedule on another processor,
 	 * as well as CPU removal
 	 */
-	this_cpu = get_cpu();
+	get_online_cpus_atomic();
+	this_cpu = smp_processor_id();
 
 	/*
 	 * Can deadlock when called with interrupts disabled.
@@ -342,7 +343,7 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
 		}
 	}
 
-	put_cpu();
+	put_online_cpus_atomic();
 
 	return err;
 }
@@ -371,8 +372,10 @@ int smp_call_function_any(const struct cpumask *mask,
 	const struct cpumask *nodemask;
 	int ret;
 
+	get_online_cpus_atomic();
 	/* Try for same CPU (cheapest) */
-	cpu = get_cpu();
+	cpu = smp_processor_id();
+
 	if (cpumask_test_cpu(cpu, mask))
 		goto call;
 
@@ -388,7 +391,7 @@ int smp_call_function_any(const struct cpumask *mask,
 	cpu = cpumask_any_and(mask, cpu_online_mask);
 call:
 	ret = smp_call_function_single(cpu, func, info, wait);
-	put_cpu();
+	put_online_cpus_atomic();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(smp_call_function_any);
@@ -409,14 +412,17 @@ void __smp_call_function_single(int cpu, struct call_single_data *data,
 	unsigned int this_cpu;
 	unsigned long flags;
 
-	this_cpu = get_cpu();
+	get_online_cpus_atomic();
+
+	this_cpu = smp_processor_id();
+
 	/*
 	 * Can deadlock when called with interrupts disabled.
 	 * We allow cpu's that are not yet online though, as no one else can
 	 * send smp call function interrupt to this cpu and as such deadlocks
 	 * can't happen.
 	 */
-	WARN_ON_ONCE(cpu_online(smp_processor_id()) && wait && irqs_disabled()
+	WARN_ON_ONCE(cpu_online(this_cpu) && wait && irqs_disabled()
 		     && !oops_in_progress);
 
 	if (cpu == this_cpu) {
@@ -427,7 +433,7 @@ void __smp_call_function_single(int cpu, struct call_single_data *data,
 		csd_lock(data);
 		generic_exec_single(cpu, data, wait);
 	}
-	put_cpu();
+	put_online_cpus_atomic();
 }
 
 /**
@@ -451,6 +457,8 @@ void smp_call_function_many(const struct cpumask *mask,
 	unsigned long flags;
 	int refs, cpu, next_cpu, this_cpu = smp_processor_id();
 
+	get_online_cpus_atomic();
+
 	/*
 	 * Can deadlock when called with interrupts disabled.
 	 * We allow cpu's that are not yet online though, as no one else can
@@ -467,17 +475,18 @@ void smp_call_function_many(const struct cpumask *mask,
 
 	/* No online cpus?  We're done. */
 	if (cpu >= nr_cpu_ids)
-		return;
+		goto out_unlock;
 
 	/* Do we have another CPU which isn't us? */
 	next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
 	if (next_cpu == this_cpu)
-		next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);
+		next_cpu = cpumask_next_and(next_cpu, mask,
+						cpu_online_mask);
 
 	/* Fastpath: do that cpu by itself. */
 	if (next_cpu >= nr_cpu_ids) {
 		smp_call_function_single(cpu, func, info, wait);
-		return;
+		goto out_unlock;
 	}
 
 	data = &__get_cpu_var(cfd_data);
@@ -523,7 +532,7 @@ void smp_call_function_many(const struct cpumask *mask,
 	/* Some callers race with other cpus changing the passed mask */
 	if (unlikely(!refs)) {
 		csd_unlock(&data->csd);
-		return;
+		goto out_unlock;
 	}
 
 	raw_spin_lock_irqsave(&call_function.lock, flags);
@@ -554,6 +563,9 @@ void smp_call_function_many(const struct cpumask *mask,
 	/* Optionally wait for the CPUs to complete */
 	if (wait)
 		csd_lock_wait(&data->csd);
+
+out_unlock:
+	put_online_cpus_atomic();
 }
 EXPORT_SYMBOL(smp_call_function_many);
 
@@ -574,9 +586,9 @@ EXPORT_SYMBOL(smp_call_function_many);
  */
 int smp_call_function(smp_call_func_t func, void *info, int wait)
 {
-	preempt_disable();
+	get_online_cpus_atomic();
 	smp_call_function_many(cpu_online_mask, func, info, wait);
-	preempt_enable();
+	put_online_cpus_atomic();
 
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH v4 4/9] smp, cpu hotplug: Fix on_each_cpu_*() to prevent CPU offline properly
  2012-12-11 14:03 [RFC PATCH v4 0/9] CPU hotplug: stop_machine()-free CPU hotplug Srivatsa S. Bhat
                   ` (2 preceding siblings ...)
  2012-12-11 14:04 ` [RFC PATCH v4 3/9] smp, cpu hotplug: Fix smp_call_function_*() to prevent CPU offline properly Srivatsa S. Bhat
@ 2012-12-11 14:04 ` Srivatsa S. Bhat
  2012-12-11 14:05 ` [RFC PATCH v4 5/9] sched, cpu hotplug: Use stable online cpus in try_to_wake_up() & select_task_rq() Srivatsa S. Bhat
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-11 14:04 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

Once stop_machine() is gone from the CPU offline path, we won't be able to
depend on preempt_disable() to prevent CPUs from going offline from under us.

Use the get/put_online_cpus_atomic() APIs to prevent CPUs from going offline,
while invoking from atomic context.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/smp.c |   25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index ce1a866..0031000 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -688,12 +688,12 @@ int on_each_cpu(void (*func) (void *info), void *info, int wait)
 	unsigned long flags;
 	int ret = 0;
 
-	preempt_disable();
+	get_online_cpus_atomic();
 	ret = smp_call_function(func, info, wait);
 	local_irq_save(flags);
 	func(info);
 	local_irq_restore(flags);
-	preempt_enable();
+	put_online_cpus_atomic();
 	return ret;
 }
 EXPORT_SYMBOL(on_each_cpu);
@@ -715,7 +715,11 @@ EXPORT_SYMBOL(on_each_cpu);
 void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
 			void *info, bool wait)
 {
-	int cpu = get_cpu();
+	int cpu;
+
+	get_online_cpus_atomic();
+
+	cpu = smp_processor_id();
 
 	smp_call_function_many(mask, func, info, wait);
 	if (cpumask_test_cpu(cpu, mask)) {
@@ -723,7 +727,7 @@ void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
 		func(info);
 		local_irq_enable();
 	}
-	put_cpu();
+	put_online_cpus_atomic();
 }
 EXPORT_SYMBOL(on_each_cpu_mask);
 
@@ -748,8 +752,9 @@ EXPORT_SYMBOL(on_each_cpu_mask);
  * The function might sleep if the GFP flags indicates a non
  * atomic allocation is allowed.
  *
- * Preemption is disabled to protect against CPUs going offline but not online.
- * CPUs going online during the call will not be seen or sent an IPI.
+ * We use get/put_online_cpus_atomic() to prevent CPUs from going
+ * offline in-between our operation. CPUs coming online during the
+ * call will not be seen or sent an IPI.
  *
  * You must not call this function with disabled interrupts or
  * from a hardware interrupt handler or from a bottom half handler.
@@ -764,26 +769,26 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
 	might_sleep_if(gfp_flags & __GFP_WAIT);
 
 	if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
-		preempt_disable();
+		get_online_cpus_atomic();
 		for_each_online_cpu(cpu)
 			if (cond_func(cpu, info))
 				cpumask_set_cpu(cpu, cpus);
 		on_each_cpu_mask(cpus, func, info, wait);
-		preempt_enable();
+		put_online_cpus_atomic();
 		free_cpumask_var(cpus);
 	} else {
 		/*
 		 * No free cpumask, bother. No matter, we'll
 		 * just have to IPI them one by one.
 		 */
-		preempt_disable();
+		get_online_cpus_atomic();
 		for_each_online_cpu(cpu)
 			if (cond_func(cpu, info)) {
 				ret = smp_call_function_single(cpu, func,
 								info, wait);
 				WARN_ON_ONCE(!ret);
 			}
-		preempt_enable();
+		put_online_cpus_atomic();
 	}
 }
 EXPORT_SYMBOL(on_each_cpu_cond);


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH v4 5/9] sched, cpu hotplug: Use stable online cpus in try_to_wake_up() & select_task_rq()
  2012-12-11 14:03 [RFC PATCH v4 0/9] CPU hotplug: stop_machine()-free CPU hotplug Srivatsa S. Bhat
                   ` (3 preceding siblings ...)
  2012-12-11 14:04 ` [RFC PATCH v4 4/9] smp, cpu hotplug: Fix on_each_cpu_*() " Srivatsa S. Bhat
@ 2012-12-11 14:05 ` Srivatsa S. Bhat
  2012-12-11 14:05 ` [RFC PATCH v4 6/9] kick_process(), cpu-hotplug: Prevent offlining of target CPU properly Srivatsa S. Bhat
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-11 14:05 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

Once stop_machine() is gone from the CPU offline path, we won't be able to
depend on preempt_disable() to prevent CPUs from going offline from under us.

Use the get/put_online_cpus_atomic() APIs to prevent CPUs from going offline,
while invoking from atomic context.

Scheduler functions such as try_to_wake_up() and select_task_rq() (and even
select_fallback_rq()) deal with picking new CPUs to run tasks. So they need
to synchronize with CPU offline operations.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/sched/core.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..f51e0aa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1103,6 +1103,10 @@ EXPORT_SYMBOL_GPL(kick_process);
 #ifdef CONFIG_SMP
 /*
  * ->cpus_allowed is protected by both rq->lock and p->pi_lock
+ *
+ *  Must be called under get/put_online_cpus_atomic() or
+ *  equivalent, to avoid CPUs from going offline from underneath
+ *  us.
  */
 static int select_fallback_rq(int cpu, struct task_struct *p)
 {
@@ -1166,6 +1170,9 @@ out:
 
 /*
  * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
+ *
+ * Must be called under get/put_online_cpus_atomic(), to prevent
+ * CPUs from going offline from underneath us.
  */
 static inline
 int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
@@ -1406,6 +1413,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	int cpu, success = 0;
 
 	smp_wmb();
+	get_online_cpus_atomic();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	if (!(p->state & state))
 		goto out;
@@ -1446,6 +1454,7 @@ stat:
 	ttwu_stat(p, cpu, wake_flags);
 out:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	put_online_cpus_atomic();
 
 	return success;
 }
@@ -1624,6 +1633,7 @@ void wake_up_new_task(struct task_struct *p)
 	unsigned long flags;
 	struct rq *rq;
 
+	get_online_cpus_atomic();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 #ifdef CONFIG_SMP
 	/*
@@ -1644,6 +1654,7 @@ void wake_up_new_task(struct task_struct *p)
 		p->sched_class->task_woken(rq, p);
 #endif
 	task_rq_unlock(rq, p, &flags);
+	put_online_cpus_atomic();
 }
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -2541,6 +2552,7 @@ void sched_exec(void)
 	unsigned long flags;
 	int dest_cpu;
 
+	get_online_cpus_atomic();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
@@ -2550,11 +2562,13 @@ void sched_exec(void)
 		struct migration_arg arg = { p, dest_cpu };
 
 		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+		put_online_cpus_atomic();
 		stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg);
 		return;
 	}
 unlock:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	put_online_cpus_atomic();
 }
 
 #endif


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH v4 6/9] kick_process(), cpu-hotplug: Prevent offlining of target CPU properly
  2012-12-11 14:03 [RFC PATCH v4 0/9] CPU hotplug: stop_machine()-free CPU hotplug Srivatsa S. Bhat
                   ` (4 preceding siblings ...)
  2012-12-11 14:05 ` [RFC PATCH v4 5/9] sched, cpu hotplug: Use stable online cpus in try_to_wake_up() & select_task_rq() Srivatsa S. Bhat
@ 2012-12-11 14:05 ` Srivatsa S. Bhat
  2012-12-11 14:05 ` [RFC PATCH v4 7/9] yield_to(), cpu-hotplug: Prevent offlining of other CPUs properly Srivatsa S. Bhat
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-11 14:05 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

Once stop_machine() is gone from the CPU offline path, we won't be able to
depend on preempt_disable() to prevent CPUs from going offline from under us.

Use the get/put_online_cpus_atomic() APIs to prevent CPUs from going offline,
while invoking from atomic context.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/sched/core.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f51e0aa..cff7656 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1091,11 +1091,11 @@ void kick_process(struct task_struct *p)
 {
 	int cpu;
 
-	preempt_disable();
+	get_online_cpus_atomic();
 	cpu = task_cpu(p);
 	if ((cpu != smp_processor_id()) && task_curr(p))
 		smp_send_reschedule(cpu);
-	preempt_enable();
+	put_online_cpus_atomic();
 }
 EXPORT_SYMBOL_GPL(kick_process);
 #endif /* CONFIG_SMP */


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH v4 7/9] yield_to(), cpu-hotplug: Prevent offlining of other CPUs properly
  2012-12-11 14:03 [RFC PATCH v4 0/9] CPU hotplug: stop_machine()-free CPU hotplug Srivatsa S. Bhat
                   ` (5 preceding siblings ...)
  2012-12-11 14:05 ` [RFC PATCH v4 6/9] kick_process(), cpu-hotplug: Prevent offlining of target CPU properly Srivatsa S. Bhat
@ 2012-12-11 14:05 ` Srivatsa S. Bhat
  2012-12-11 14:05 ` [RFC PATCH v4 8/9] kvm, vmx: Add atomic synchronization with CPU Hotplug Srivatsa S. Bhat
  2012-12-11 14:05 ` [RFC PATCH v4 9/9] cpu: No more __stop_machine() in _cpu_down() Srivatsa S. Bhat
  8 siblings, 0 replies; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-11 14:05 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

Once stop_machine() is gone from the CPU offline path, we won't be able to
depend on local_irq_save() to prevent CPUs from going offline from under us.

Use the get/put_online_cpus_atomic() APIs to prevent CPUs from going offline,
while invoking from atomic context.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/sched/core.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cff7656..4b982bf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4312,6 +4312,7 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
 	unsigned long flags;
 	bool yielded = 0;
 
+	get_online_cpus_atomic();
 	local_irq_save(flags);
 	rq = this_rq();
 
@@ -4339,13 +4340,14 @@ again:
 		 * Make p's CPU reschedule; pick_next_entity takes care of
 		 * fairness.
 		 */
-		if (preempt && rq != p_rq)
+		if (preempt && rq != p_rq && cpu_online(task_cpu(p)))
 			resched_task(p_rq->curr);
 	}
 
 out:
 	double_rq_unlock(rq, p_rq);
 	local_irq_restore(flags);
+	put_online_cpus_atomic();
 
 	if (yielded)
 		schedule();


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH v4 8/9] kvm, vmx: Add atomic synchronization with CPU Hotplug
  2012-12-11 14:03 [RFC PATCH v4 0/9] CPU hotplug: stop_machine()-free CPU hotplug Srivatsa S. Bhat
                   ` (6 preceding siblings ...)
  2012-12-11 14:05 ` [RFC PATCH v4 7/9] yield_to(), cpu-hotplug: Prevent offlining of other CPUs properly Srivatsa S. Bhat
@ 2012-12-11 14:05 ` Srivatsa S. Bhat
  2012-12-11 14:05 ` [RFC PATCH v4 9/9] cpu: No more __stop_machine() in _cpu_down() Srivatsa S. Bhat
  8 siblings, 0 replies; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-11 14:05 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

preempt_disable() will no longer help prevent CPUs from going offline, once
stop_machine() gets removed from the CPU offline path. So use
get/put_online_cpus_atomic() in vmx_vcpu_load() to prevent CPUs from
going offline while clearing vmcs.

Reported-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Debugged-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 arch/x86/kvm/vmx.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f858159..d8a4cf1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1519,10 +1519,14 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
 
-	if (!vmm_exclusive)
+	if (!vmm_exclusive) {
 		kvm_cpu_vmxon(phys_addr);
-	else if (vmx->loaded_vmcs->cpu != cpu)
+	} else if (vmx->loaded_vmcs->cpu != cpu) {
+		/* Prevent any CPU from going offline */
+		get_online_cpus_atomic();
 		loaded_vmcs_clear(vmx->loaded_vmcs);
+		put_online_cpus_atomic();
+	}
 
 	if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
 		per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH v4 9/9] cpu: No more __stop_machine() in _cpu_down()
  2012-12-11 14:03 [RFC PATCH v4 0/9] CPU hotplug: stop_machine()-free CPU hotplug Srivatsa S. Bhat
                   ` (7 preceding siblings ...)
  2012-12-11 14:05 ` [RFC PATCH v4 8/9] kvm, vmx: Add atomic synchronization with CPU Hotplug Srivatsa S. Bhat
@ 2012-12-11 14:05 ` Srivatsa S. Bhat
  8 siblings, 0 replies; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-11 14:05 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

From: Paul E. McKenney <paul.mckenney@linaro.org>

The _cpu_down() function invoked as part of the CPU-hotplug offlining
process currently invokes __stop_machine(), which is slow and inflicts
substantial real-time latencies on the entire system.  This patch
substitutes stop_one_cpu() for __stop_machine() in order to improve
both performance and real-time latency.

This is currently unsafe, because there are a number of uses of
preempt_disable() that are intended to block CPU-hotplug offlining.
These will be fixed by using get/put_online_cpus_atomic(), but in the
meantime, this commit is one way to help locate them.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ srivatsa.bhat@linux.vnet.ibm.com: Refer to the new sync primitives for
  readers (in the changelog), and s/stop_cpus/stop_one_cpu ]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/cpu.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 5a63296..3f9498e 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -484,7 +484,7 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
 	}
 	smpboot_park_threads(cpu);
 
-	err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
+	err = stop_one_cpu(cpu, take_cpu_down, &tcd_param);
 	if (err) {
 		/* CPU didn't die: tell everyone.  Can't complain. */
 		smpboot_unpark_threads(cpu);


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-11 14:04 ` [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context Srivatsa S. Bhat
@ 2012-12-12 17:17   ` Oleg Nesterov
  2012-12-12 17:24     ` Oleg Nesterov
  2012-12-12 17:53     ` Srivatsa S. Bhat
  0 siblings, 2 replies; 40+ messages in thread
From: Oleg Nesterov @ 2012-12-12 17:17 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/11, Srivatsa S. Bhat wrote:
>
> IOW, the hotplug readers just increment/decrement their per-cpu refcounts
> when no writer is active.

plus cli/sti ;) and increment/decrement are atomic.

At first glance looks correct to me, but I'll try to read it carefully
later.

A couple of minor nits,

> +static DEFINE_PER_CPU(bool, writer_signal);

Why it needs to be per-cpu? It can be global and __read_mostly to avoid
the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal
into a single cacheline...

> +void get_online_cpus_atomic(void)
> +{
> +	unsigned long flags;
> +
> +	preempt_disable();
> +
> +	if (cpu_hotplug.active_writer == current)
> +		return;
> +
> +	local_irq_save(flags);

Yes... this is still needed, we are going to increment reader_percpu_refcnt
unconditionally and this makes reader_nested_percpu() == T.

But,

> +void put_online_cpus_atomic(void)
> +{
> +	unsigned long flags;
> +
> +	if (cpu_hotplug.active_writer == current)
> +		goto out;
> +
> +	local_irq_save(flags);
> +
> +	/*
> +	 * We never allow heterogeneous nesting of readers. So it is trivial
> +	 * to find out the kind of reader we are, and undo the operation
> +	 * done by our corresponding get_online_cpus_atomic().
> +	 */
> +	if (__this_cpu_read(reader_percpu_refcnt))
> +		__this_cpu_dec(reader_percpu_refcnt);
> +	else
> +		read_unlock(&hotplug_rwlock);
> +
> +	local_irq_restore(flags);
> +out:
> +	preempt_enable();
> +}

Do we really need local_irq_save/restore in put_ ?

Oleg.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-12 17:17   ` Oleg Nesterov
@ 2012-12-12 17:24     ` Oleg Nesterov
  2012-12-12 18:11       ` Srivatsa S. Bhat
  2012-12-12 17:53     ` Srivatsa S. Bhat
  1 sibling, 1 reply; 40+ messages in thread
From: Oleg Nesterov @ 2012-12-12 17:24 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/12, Oleg Nesterov wrote:
>
> On 12/11, Srivatsa S. Bhat wrote:
> >
> > IOW, the hotplug readers just increment/decrement their per-cpu refcounts
> > when no writer is active.
>
> plus cli/sti ;) and increment/decrement are atomic.
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

OOPS, sorry I was going to say "adds mb()".

And when I look at get_online_cpus_atomic() again it uses rmb(). This
doesn't look correct, we need the full barrier between this_cpu_inc()
and writer_active().

At the same time reader_nested_percpu() can be checked before mb().

Oleg.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-12 17:17   ` Oleg Nesterov
  2012-12-12 17:24     ` Oleg Nesterov
@ 2012-12-12 17:53     ` Srivatsa S. Bhat
  2012-12-12 18:02       ` Oleg Nesterov
  1 sibling, 1 reply; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-12 17:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/12/2012 10:47 PM, Oleg Nesterov wrote:
> On 12/11, Srivatsa S. Bhat wrote:
>>
>> IOW, the hotplug readers just increment/decrement their per-cpu refcounts
>> when no writer is active.
> 
> plus cli/sti ;)

Of course, forgot to mention it, again! :)

> and increment/decrement are atomic.
> 
> At first glance looks correct to me, but I'll try to read it carefully
> later.
> 
> A couple of minor nits,
> 
>> +static DEFINE_PER_CPU(bool, writer_signal);
> 
> Why it needs to be per-cpu? It can be global and __read_mostly to avoid
> the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal
> into a single cacheline...
> 

Even I realized this (that we could use a global) after posting out the
series.. But do you think that it would be better to retain the per-cpu
variant itself, due to the cache effects?

>> +void get_online_cpus_atomic(void)
>> +{
>> +	unsigned long flags;
>> +
>> +	preempt_disable();
>> +
>> +	if (cpu_hotplug.active_writer == current)
>> +		return;
>> +
>> +	local_irq_save(flags);
> 
> Yes... this is still needed, we are going to increment reader_percpu_refcnt
> unconditionally and this makes reader_nested_percpu() == T.
> 
> But,
> 
>> +void put_online_cpus_atomic(void)
>> +{
>> +	unsigned long flags;
>> +
>> +	if (cpu_hotplug.active_writer == current)
>> +		goto out;
>> +
>> +	local_irq_save(flags);
>> +
>> +	/*
>> +	 * We never allow heterogeneous nesting of readers. So it is trivial
>> +	 * to find out the kind of reader we are, and undo the operation
>> +	 * done by our corresponding get_online_cpus_atomic().
>> +	 */
>> +	if (__this_cpu_read(reader_percpu_refcnt))
>> +		__this_cpu_dec(reader_percpu_refcnt);
>> +	else
>> +		read_unlock(&hotplug_rwlock);
>> +
>> +	local_irq_restore(flags);
>> +out:
>> +	preempt_enable();
>> +}
> 
> Do we really need local_irq_save/restore in put_ ?
>

Hmm.. good point! I don't think we need it.
 

Regards,
Srivatsa S. Bhat



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-12 17:53     ` Srivatsa S. Bhat
@ 2012-12-12 18:02       ` Oleg Nesterov
  2012-12-12 18:30         ` Srivatsa S. Bhat
  2012-12-12 19:36         ` Oleg Nesterov
  0 siblings, 2 replies; 40+ messages in thread
From: Oleg Nesterov @ 2012-12-12 18:02 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/12, Srivatsa S. Bhat wrote:
>
> On 12/12/2012 10:47 PM, Oleg Nesterov wrote:
> >
> > Why it needs to be per-cpu? It can be global and __read_mostly to avoid
> > the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal
> > into a single cacheline...
>
> Even I realized this (that we could use a global) after posting out the
> series.. But do you think that it would be better to retain the per-cpu
> variant itself, due to the cache effects?

I don't really know, up to you. This was the question ;)

> > Do we really need local_irq_save/restore in put_ ?
> >
>
> Hmm.. good point! I don't think we need it.

And _perhaps_ get_ can avoid it too?

I didn't really try to think, probably this is not right, but can't
something like this work?

	#define XXXX	(1 << 16)
	#define MASK	(XXXX -1)

	void get_online_cpus_atomic(void)
	{
		preempt_disable();

		// only for writer
		__this_cpu_add(reader_percpu_refcnt, XXXX);

		if (__this_cpu_read(reader_percpu_refcnt) & MASK) {
			__this_cpu_inc(reader_percpu_refcnt);
		} else {
			smp_wmb();
			if (writer_active()) {
				...
			}
		}

		__this_cpu_dec(reader_percpu_refcnt, XXXX);
	}

	void put_online_cpus_atomic(void)
	{
		if (__this_cpu_read(reader_percpu_refcnt) & MASK)
			__this_cpu_dec(reader_percpu_refcnt);
		else
			read_unlock(&hotplug_rwlock);
		preempt_enable();
	}

Oleg.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-12 17:24     ` Oleg Nesterov
@ 2012-12-12 18:11       ` Srivatsa S. Bhat
  2012-12-12 18:23         ` Oleg Nesterov
  0 siblings, 1 reply; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-12 18:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/12/2012 10:54 PM, Oleg Nesterov wrote:
> On 12/12, Oleg Nesterov wrote:
>>
>> On 12/11, Srivatsa S. Bhat wrote:
>>>
>>> IOW, the hotplug readers just increment/decrement their per-cpu refcounts
>>> when no writer is active.
>>
>> plus cli/sti ;) and increment/decrement are atomic.
>                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> OOPS, sorry I was going to say "adds mb()".
> 

Ok, got it :)

> And when I look at get_online_cpus_atomic() again it uses rmb(). This
> doesn't look correct, we need the full barrier between this_cpu_inc()
> and writer_active().
> 

Hmm..

> At the same time reader_nested_percpu() can be checked before mb().
> 

I thought that since the increment and the check (reader_nested_percpu)
act on the same memory location, they will naturally be run in the given
order, without any need for barriers. Am I wrong?

(I referred Documentation/memory-barriers.txt again to verify this, and
the second point under the "Guarantees" section looked like it said the
same thing : "Overlapping loads and stores within a particular CPU will
appear to be ordered within that CPU").

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-12 18:11       ` Srivatsa S. Bhat
@ 2012-12-12 18:23         ` Oleg Nesterov
  2012-12-12 18:42           ` Srivatsa S. Bhat
  0 siblings, 1 reply; 40+ messages in thread
From: Oleg Nesterov @ 2012-12-12 18:23 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/12, Srivatsa S. Bhat wrote:
>
> On 12/12/2012 10:54 PM, Oleg Nesterov wrote:
>
> > And when I look at get_online_cpus_atomic() again it uses rmb(). This
> > doesn't look correct, we need the full barrier between this_cpu_inc()
> > and writer_active().
>
> Hmm..
>
> > At the same time reader_nested_percpu() can be checked before mb().
>
> I thought that since the increment and the check (reader_nested_percpu)
> act on the same memory location, they will naturally be run in the given
> order, without any need for barriers. Am I wrong?

And this is what I meant, you do not need a barrier before
reader_nested_percpu().

But you need to ensure that WRITE(reader_percpu_refcnt) and READ(writer_signal)
can't be reordered, so you need mb() in between. rmb() can serialize LOADs and
STOREs.

Or I misunderstood?

Oleg.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-12 18:02       ` Oleg Nesterov
@ 2012-12-12 18:30         ` Srivatsa S. Bhat
  2012-12-12 18:48           ` Oleg Nesterov
  2012-12-12 19:36         ` Oleg Nesterov
  1 sibling, 1 reply; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-12 18:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/12/2012 11:32 PM, Oleg Nesterov wrote:
> On 12/12, Srivatsa S. Bhat wrote:
>>
>> On 12/12/2012 10:47 PM, Oleg Nesterov wrote:
>>>
>>> Why it needs to be per-cpu? It can be global and __read_mostly to avoid
>>> the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal
>>> into a single cacheline...
>>
>> Even I realized this (that we could use a global) after posting out the
>> series.. But do you think that it would be better to retain the per-cpu
>> variant itself, due to the cache effects?
> 
> I don't really know, up to you. This was the question ;)

OK :-)

> 
>>> Do we really need local_irq_save/restore in put_ ?
>>>
>>
>> Hmm.. good point! I don't think we need it.
> 
> And _perhaps_ get_ can avoid it too?
> 
> I didn't really try to think, probably this is not right, but can't
> something like this work?
> 
> 	#define XXXX	(1 << 16)
> 	#define MASK	(XXXX -1)
> 
> 	void get_online_cpus_atomic(void)
> 	{
> 		preempt_disable();
> 
> 		// only for writer
> 		__this_cpu_add(reader_percpu_refcnt, XXXX);
> 
> 		if (__this_cpu_read(reader_percpu_refcnt) & MASK) {
> 			__this_cpu_inc(reader_percpu_refcnt);
> 		} else {
> 			smp_wmb();
> 			if (writer_active()) {
> 				...
> 			}
> 		}
> 
> 		__this_cpu_dec(reader_percpu_refcnt, XXXX);
> 	}
> 

Sorry, may be I'm too blind to see, but I didn't understand the logic
of how the mask helps us avoid disabling interrupts.. Can you kindly
elaborate?

> 	void put_online_cpus_atomic(void)
> 	{
> 		if (__this_cpu_read(reader_percpu_refcnt) & MASK)
> 			__this_cpu_dec(reader_percpu_refcnt);
> 		else
> 			read_unlock(&hotplug_rwlock);
> 		preempt_enable();
> 	}
> 

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-12 18:23         ` Oleg Nesterov
@ 2012-12-12 18:42           ` Srivatsa S. Bhat
  0 siblings, 0 replies; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-12 18:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/12/2012 11:53 PM, Oleg Nesterov wrote:
> On 12/12, Srivatsa S. Bhat wrote:
>>
>> On 12/12/2012 10:54 PM, Oleg Nesterov wrote:
>>
>>> And when I look at get_online_cpus_atomic() again it uses rmb(). This
>>> doesn't look correct, we need the full barrier between this_cpu_inc()
>>> and writer_active().
>>
>> Hmm..
>>
>>> At the same time reader_nested_percpu() can be checked before mb().
>>
>> I thought that since the increment and the check (reader_nested_percpu)
>> act on the same memory location, they will naturally be run in the given
>> order, without any need for barriers. Am I wrong?
> 
> And this is what I meant, you do not need a barrier before
> reader_nested_percpu().
> 

Ah, ok!

> But you need to ensure that WRITE(reader_percpu_refcnt) and READ(writer_signal)
> can't be reordered, so you need mb() in between. rmb() can serialize LOADs and
> STOREs.
> 

OK, got it. (I know you meant s/can/can't).

I'm trying to see if we can somehow exploit the fact that the writer can
potentially tolerate if a reader ignores his signal (to switch to rwlocks)
for a while... and use this to get rid of barriers in the reader path (without
using synchronize_sched() at the writer, of course). And perhaps also take advantage
of the fact that the read_lock() acts as a one-way barrier..

I don't know, maybe its not possible after all.. :-/
 
Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-12 18:30         ` Srivatsa S. Bhat
@ 2012-12-12 18:48           ` Oleg Nesterov
  2012-12-12 19:12             ` Srivatsa S. Bhat
  0 siblings, 1 reply; 40+ messages in thread
From: Oleg Nesterov @ 2012-12-12 18:48 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/13, Srivatsa S. Bhat wrote:
>
> On 12/12/2012 11:32 PM, Oleg Nesterov wrote:
> > And _perhaps_ get_ can avoid it too?
> >
> > I didn't really try to think, probably this is not right, but can't
> > something like this work?
> >
> > 	#define XXXX	(1 << 16)
> > 	#define MASK	(XXXX -1)
> >
> > 	void get_online_cpus_atomic(void)
> > 	{
> > 		preempt_disable();
> >
> > 		// only for writer
> > 		__this_cpu_add(reader_percpu_refcnt, XXXX);
> >
> > 		if (__this_cpu_read(reader_percpu_refcnt) & MASK) {
> > 			__this_cpu_inc(reader_percpu_refcnt);
> > 		} else {
> > 			smp_wmb();
> > 			if (writer_active()) {
> > 				...
> > 			}
> > 		}
> >
> > 		__this_cpu_dec(reader_percpu_refcnt, XXXX);
> > 	}
> >
>
> Sorry, may be I'm too blind to see, but I didn't understand the logic
> of how the mask helps us avoid disabling interrupts..

Why do we need cli/sti at all? We should prevent the following race:

	- the writer already holds hotplug_rwlock, so get_ must not
	  succeed.

	- the new reader comes, it increments reader_percpu_refcnt,
	  but before it checks writer_active() ...

	- irq handler does get_online_cpus_atomic() and sees
	  reader_nested_percpu() == T, so it simply increments
	  reader_percpu_refcnt and succeeds.

OTOH, why do we need to increment reader_percpu_refcnt the counter
in advance? To ensure that either we see writer_active() or the
writer should see reader_percpu_refcnt != 0 (and that is why they
should write/read in reverse order).

The code above tries to avoid this race using the lower 16 bits
as a "nested-counter", and the upper bits to avoid the race with
the writer.

	// only for writer
	__this_cpu_add(reader_percpu_refcnt, XXXX);

If irq comes and does get_online_cpus_atomic(), it won't be confused
by __this_cpu_add(XXXX), it will check the lower bits and switch to
the "slow path".


But once again, so far I didn't really try to think. It is quite
possible I missed something.

Oleg.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-12 18:48           ` Oleg Nesterov
@ 2012-12-12 19:12             ` Srivatsa S. Bhat
  2012-12-13 15:26               ` Srivatsa S. Bhat
  0 siblings, 1 reply; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-12 19:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/13/2012 12:18 AM, Oleg Nesterov wrote:
> On 12/13, Srivatsa S. Bhat wrote:
>>
>> On 12/12/2012 11:32 PM, Oleg Nesterov wrote:
>>> And _perhaps_ get_ can avoid it too?
>>>
>>> I didn't really try to think, probably this is not right, but can't
>>> something like this work?
>>>
>>> 	#define XXXX	(1 << 16)
>>> 	#define MASK	(XXXX -1)
>>>
>>> 	void get_online_cpus_atomic(void)
>>> 	{
>>> 		preempt_disable();
>>>
>>> 		// only for writer
>>> 		__this_cpu_add(reader_percpu_refcnt, XXXX);
>>>
>>> 		if (__this_cpu_read(reader_percpu_refcnt) & MASK) {
>>> 			__this_cpu_inc(reader_percpu_refcnt);
>>> 		} else {
>>> 			smp_wmb();
>>> 			if (writer_active()) {
>>> 				...
>>> 			}
>>> 		}
>>>
>>> 		__this_cpu_dec(reader_percpu_refcnt, XXXX);
>>> 	}
>>>
>>
>> Sorry, may be I'm too blind to see, but I didn't understand the logic
>> of how the mask helps us avoid disabling interrupts..
> 
> Why do we need cli/sti at all? We should prevent the following race:
> 
> 	- the writer already holds hotplug_rwlock, so get_ must not
> 	  succeed.
> 
> 	- the new reader comes, it increments reader_percpu_refcnt,
> 	  but before it checks writer_active() ...
> 
> 	- irq handler does get_online_cpus_atomic() and sees
> 	  reader_nested_percpu() == T, so it simply increments
> 	  reader_percpu_refcnt and succeeds.
> 
> OTOH, why do we need to increment reader_percpu_refcnt the counter
> in advance? To ensure that either we see writer_active() or the
> writer should see reader_percpu_refcnt != 0 (and that is why they
> should write/read in reverse order).
> 
> The code above tries to avoid this race using the lower 16 bits
> as a "nested-counter", and the upper bits to avoid the race with
> the writer.
> 
> 	// only for writer
> 	__this_cpu_add(reader_percpu_refcnt, XXXX);
> 
> If irq comes and does get_online_cpus_atomic(), it won't be confused
> by __this_cpu_add(XXXX), it will check the lower bits and switch to
> the "slow path".
> 

This is a very clever scheme indeed! :-) Thanks a lot for explaining
it in detail.

> 
> But once again, so far I didn't really try to think. It is quite
> possible I missed something.
> 

Even I don't spot anything wrong with it. But I'll give it some more
thought..

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-12 18:02       ` Oleg Nesterov
  2012-12-12 18:30         ` Srivatsa S. Bhat
@ 2012-12-12 19:36         ` Oleg Nesterov
  2012-12-12 19:43           ` Srivatsa S. Bhat
  1 sibling, 1 reply; 40+ messages in thread
From: Oleg Nesterov @ 2012-12-12 19:36 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/12, Oleg Nesterov wrote:
>
> On 12/12, Srivatsa S. Bhat wrote:
> >
> > On 12/12/2012 10:47 PM, Oleg Nesterov wrote:
> > >
> > > Why it needs to be per-cpu? It can be global and __read_mostly to avoid
> > > the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal
> > > into a single cacheline...
> >
> > Even I realized this (that we could use a global) after posting out the
> > series.. But do you think that it would be better to retain the per-cpu
> > variant itself, due to the cache effects?
>
> I don't really know, up to you. This was the question ;)

But perhaps there is another reason to make it per-cpu...

It seems we can avoid cpu_hotplug.active_writer == current check in
get/put.

take_cpu_down() can clear this_cpu(writer_signal) right after it takes
hotplug_rwlock for writing. It runs with irqs and preemption disabled,
nobody else will ever look at writer_signal on its CPU.

Oleg.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-12 19:36         ` Oleg Nesterov
@ 2012-12-12 19:43           ` Srivatsa S. Bhat
  2012-12-12 21:10             ` Oleg Nesterov
  0 siblings, 1 reply; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-12 19:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/13/2012 01:06 AM, Oleg Nesterov wrote:
> On 12/12, Oleg Nesterov wrote:
>>
>> On 12/12, Srivatsa S. Bhat wrote:
>>>
>>> On 12/12/2012 10:47 PM, Oleg Nesterov wrote:
>>>>
>>>> Why it needs to be per-cpu? It can be global and __read_mostly to avoid
>>>> the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal
>>>> into a single cacheline...
>>>
>>> Even I realized this (that we could use a global) after posting out the
>>> series.. But do you think that it would be better to retain the per-cpu
>>> variant itself, due to the cache effects?
>>
>> I don't really know, up to you. This was the question ;)
> 
> But perhaps there is another reason to make it per-cpu...
> 
> It seems we can avoid cpu_hotplug.active_writer == current check in
> get/put.
> 
> take_cpu_down() can clear this_cpu(writer_signal) right after it takes
> hotplug_rwlock for writing. It runs with irqs and preemption disabled,
> nobody else will ever look at writer_signal on its CPU.
> 

Hmm.. And then the get/put_ on that CPU will increment/decrement the per-cpu
refcount, but we don't care.. because we only need to ensure that they don't
deadlock by taking the rwlock for read.

This sounds great!

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-12 19:43           ` Srivatsa S. Bhat
@ 2012-12-12 21:10             ` Oleg Nesterov
  0 siblings, 0 replies; 40+ messages in thread
From: Oleg Nesterov @ 2012-12-12 21:10 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/13, Srivatsa S. Bhat wrote:
>
> On 12/13/2012 01:06 AM, Oleg Nesterov wrote:
> >
> > But perhaps there is another reason to make it per-cpu...

Actually this is not the reason, please see below. But let me repeat,
it is not that I suggest to remove "per-cpu".

> > It seems we can avoid cpu_hotplug.active_writer == current check in
> > get/put.
> >
> > take_cpu_down() can clear this_cpu(writer_signal) right after it takes
> > hotplug_rwlock for writing. It runs with irqs and preemption disabled,
> > nobody else will ever look at writer_signal on its CPU.
> >
>
> Hmm.. And then the get/put_ on that CPU will increment/decrement the per-cpu
> refcount, but we don't care.. because we only need to ensure that they don't
> deadlock by taking the rwlock for read.

Yes, but...

Probably it would be more clean to simply do this_cpu_inc(reader_percpu_refcnt)
after write_lock(hotplug_rwlock). This will have the same effect for get/put,
and we still can make writer_signal global (if we want).

And note that this will also simplify the lockdep annotations which we (imho)
should add later.

Ignoring all complications get_online_cpus_atomic() does:

	if (this_cpu_read(reader_percpu_refcnt))
		this_cpu_inc(reader_percpu_refcnt);
	else if (!writer_signal)
		this_cpu_inc(reader_percpu_refcnt);	// same as above
	else
		read_lock(&hotplug_rwlock);

But for lockdep it should do:

	if (this_cpu_read(reader_percpu_refcnt))
		this_cpu_inc(reader_percpu_refcnt);
	else if (!writer_signal) {
		this_cpu_inc(reader_percpu_refcnt);
		// pretend we take hotplug_rwlock for lockdep
		rwlock_acquire_read(&hotplug_rwlock.dep_map, 0, 0);
	}
	else
		read_lock(&hotplug_rwlock);

And we need to ensure that rwlock_acquire_read() is not called under
write_lock(hotplug_rwlock).

If we use reader_percpu_refcnt to fool get/put, we should not worry.

Oleg.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-12 19:12             ` Srivatsa S. Bhat
@ 2012-12-13 15:26               ` Srivatsa S. Bhat
  2012-12-13 16:17                 ` Oleg Nesterov
  0 siblings, 1 reply; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-13 15:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/13/2012 12:42 AM, Srivatsa S. Bhat wrote:
> On 12/13/2012 12:18 AM, Oleg Nesterov wrote:
>> On 12/13, Srivatsa S. Bhat wrote:
>>>
>>> On 12/12/2012 11:32 PM, Oleg Nesterov wrote:
>>>> And _perhaps_ get_ can avoid it too?
>>>>
>>>> I didn't really try to think, probably this is not right, but can't
>>>> something like this work?
>>>>
>>>> 	#define XXXX	(1 << 16)
>>>> 	#define MASK	(XXXX -1)
>>>>
>>>> 	void get_online_cpus_atomic(void)
>>>> 	{
>>>> 		preempt_disable();
>>>>
>>>> 		// only for writer
>>>> 		__this_cpu_add(reader_percpu_refcnt, XXXX);
>>>>
>>>> 		if (__this_cpu_read(reader_percpu_refcnt) & MASK) {
>>>> 			__this_cpu_inc(reader_percpu_refcnt);
>>>> 		} else {
>>>> 			smp_wmb();
>>>> 			if (writer_active()) {
>>>> 				...
>>>> 			}
>>>> 		}
>>>>
>>>> 		__this_cpu_dec(reader_percpu_refcnt, XXXX);
>>>> 	}
>>>>
>>>
>>> Sorry, may be I'm too blind to see, but I didn't understand the logic
>>> of how the mask helps us avoid disabling interrupts..
>>
>> Why do we need cli/sti at all? We should prevent the following race:
>>
>> 	- the writer already holds hotplug_rwlock, so get_ must not
>> 	  succeed.
>>
>> 	- the new reader comes, it increments reader_percpu_refcnt,
>> 	  but before it checks writer_active() ...
>>
>> 	- irq handler does get_online_cpus_atomic() and sees
>> 	  reader_nested_percpu() == T, so it simply increments
>> 	  reader_percpu_refcnt and succeeds.
>>
>> OTOH, why do we need to increment reader_percpu_refcnt the counter
>> in advance? To ensure that either we see writer_active() or the
>> writer should see reader_percpu_refcnt != 0 (and that is why they
>> should write/read in reverse order).
>>
>> The code above tries to avoid this race using the lower 16 bits
>> as a "nested-counter", and the upper bits to avoid the race with
>> the writer.
>>
>> 	// only for writer
>> 	__this_cpu_add(reader_percpu_refcnt, XXXX);
>>
>> If irq comes and does get_online_cpus_atomic(), it won't be confused
>> by __this_cpu_add(XXXX), it will check the lower bits and switch to
>> the "slow path".
>>
> 
> This is a very clever scheme indeed! :-) Thanks a lot for explaining
> it in detail.
> 
>>
>> But once again, so far I didn't really try to think. It is quite
>> possible I missed something.
>>
> 
> Even I don't spot anything wrong with it. But I'll give it some more
> thought..

Since an interrupt handler can also run get_online_cpus_atomic(), we
cannot use the __this_cpu_* versions for modifying reader_percpu_refcnt,
right?

To maintain the integrity of the update itself, we will have to use the
this_cpu_* variant, which basically plays spoil-sport on this whole
scheme... :-(

But still, this scheme is better, because the reader doesn't have to spin
on the read_lock() with interrupts disabled. That way, interrupt handlers
that are not hotplug readers can continue to run on this CPU while taking
another CPU offline.

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-13 15:26               ` Srivatsa S. Bhat
@ 2012-12-13 16:17                 ` Oleg Nesterov
  2012-12-13 16:32                   ` Srivatsa S. Bhat
  2012-12-13 16:32                   ` Tejun Heo
  0 siblings, 2 replies; 40+ messages in thread
From: Oleg Nesterov @ 2012-12-13 16:17 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/13, Srivatsa S. Bhat wrote:
>
> On 12/13/2012 12:42 AM, Srivatsa S. Bhat wrote:
> >
> > Even I don't spot anything wrong with it. But I'll give it some more
> > thought..
>
> Since an interrupt handler can also run get_online_cpus_atomic(), we
> cannot use the __this_cpu_* versions for modifying reader_percpu_refcnt,
> right?

Hmm. I thought that __this_cpu_* must be safe under preempt_disable().
IOW, I thought that, say, this_cpu_inc() is "equal" to preempt_disable +
__this_cpu_inc() correctness-wise.

And. I thought that this_cpu_inc() is safe wrt interrupt, like local_t.

But when I try to read the comments percpu.h, I am starting to think that
even this_cpu_inc() is not safe if irq handler can do the same?

Confused...

I am shy to ask... will, say, DEFINE_PER_CPU(local_t) and
local_inc(__this_cpu_ptr(...)) work??

> But still, this scheme is better, because the reader doesn't have to spin
> on the read_lock() with interrupts disabled.

Yes, but my main concern is that irq_disable/enable itself is not that cheap.

Oleg.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-13 16:17                 ` Oleg Nesterov
@ 2012-12-13 16:32                   ` Srivatsa S. Bhat
  2012-12-14 18:03                     ` Oleg Nesterov
  2012-12-13 16:32                   ` Tejun Heo
  1 sibling, 1 reply; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-13 16:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/13/2012 09:47 PM, Oleg Nesterov wrote:
> On 12/13, Srivatsa S. Bhat wrote:
>>
>> On 12/13/2012 12:42 AM, Srivatsa S. Bhat wrote:
>>>
>>> Even I don't spot anything wrong with it. But I'll give it some more
>>> thought..
>>
>> Since an interrupt handler can also run get_online_cpus_atomic(), we
>> cannot use the __this_cpu_* versions for modifying reader_percpu_refcnt,
>> right?
> 
> Hmm. I thought that __this_cpu_* must be safe under preempt_disable().
> IOW, I thought that, say, this_cpu_inc() is "equal" to preempt_disable +
> __this_cpu_inc() correctness-wise.
>
> And. I thought that this_cpu_inc() is safe wrt interrupt, like local_t.
> 
> But when I try to read the comments percpu.h, I am starting to think that
> even this_cpu_inc() is not safe if irq handler can do the same?
> 

The comment seems to say that its not safe wrt interrupts. But looking at
the code in include/linux/percpu.h, IIUC, that is true only about
this_cpu_read() because it only disables preemption.

However, this_cpu_inc() looks safe wrt interrupts because it wraps the
increment within raw_local_irqsave()/restore().

> Confused...
> 
> I am shy to ask... will, say, DEFINE_PER_CPU(local_t) and
> local_inc(__this_cpu_ptr(...)) work??
> 
>> But still, this scheme is better, because the reader doesn't have to spin
>> on the read_lock() with interrupts disabled.
> 
> Yes, but my main concern is that irq_disable/enable itself is not that cheap.
> 

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-13 16:17                 ` Oleg Nesterov
  2012-12-13 16:32                   ` Srivatsa S. Bhat
@ 2012-12-13 16:32                   ` Tejun Heo
  1 sibling, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2012-12-13 16:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srivatsa S. Bhat, tglx, peterz, paulmck, rusty, mingo, akpm,
	namhyung, vincent.guittot, sbw, amit.kucheria, rostedt, rjw,
	wangyun, xiaoguangrong, nikunj, linux-pm, linux-kernel

Hello, Oleg.

On Thu, Dec 13, 2012 at 05:17:09PM +0100, Oleg Nesterov wrote:
> Hmm. I thought that __this_cpu_* must be safe under preempt_disable().
> IOW, I thought that, say, this_cpu_inc() is "equal" to preempt_disable +
> __this_cpu_inc() correctness-wise.

this_cpu_inc() equals local_irq_save() + __this_cpu_inc().

> And. I thought that this_cpu_inc() is safe wrt interrupt, like local_t.

Yes, it is safe.

> But when I try to read the comments percpu.h, I am starting to think that
> even this_cpu_inc() is not safe if irq handler can do the same?
> 
> Confused...

Yeah, the comment is confusing and the way these macros are defined
doesn't help.  There used to be three variants and it looks like we
didn't update the comment while removing the preempt safe ones.  Gotta
clean those up.  Anyways, yes, this_cpu_*() are safe against irqs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-13 16:32                   ` Srivatsa S. Bhat
@ 2012-12-14 18:03                     ` Oleg Nesterov
  2012-12-18 15:53                       ` Srivatsa S. Bhat
  0 siblings, 1 reply; 40+ messages in thread
From: Oleg Nesterov @ 2012-12-14 18:03 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/13, Srivatsa S. Bhat wrote:
>
> On 12/13/2012 09:47 PM, Oleg Nesterov wrote:
> > On 12/13, Srivatsa S. Bhat wrote:
> >>
> >> On 12/13/2012 12:42 AM, Srivatsa S. Bhat wrote:
> >>>
> >>> Even I don't spot anything wrong with it. But I'll give it some more
> >>> thought..
> >>
> >> Since an interrupt handler can also run get_online_cpus_atomic(), we
> >> cannot use the __this_cpu_* versions for modifying reader_percpu_refcnt,
> >> right?
> >
> > Hmm. I thought that __this_cpu_* must be safe under preempt_disable().
> > IOW, I thought that, say, this_cpu_inc() is "equal" to preempt_disable +
> > __this_cpu_inc() correctness-wise.
> >
> > And. I thought that this_cpu_inc() is safe wrt interrupt, like local_t.
> >
> > But when I try to read the comments percpu.h, I am starting to think that
> > even this_cpu_inc() is not safe if irq handler can do the same?
> >
>
> The comment seems to say that its not safe wrt interrupts. But looking at
> the code in include/linux/percpu.h, IIUC, that is true only about
> this_cpu_read() because it only disables preemption.
>
> However, this_cpu_inc() looks safe wrt interrupts because it wraps the
> increment within raw_local_irqsave()/restore().

You mean _this_cpu_generic_to_op() I guess. So yes, I think you are right,
this_cpu_* should be irq-safe, but __this_cpu_* is not.

Thanks.

At least on x86 there is no difference between this_ and __this_, both do
percpu_add_op() without local_irq_disable/enable. But it seems that most
of architectures use generic code.

Oleg.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-14 18:03                     ` Oleg Nesterov
@ 2012-12-18 15:53                       ` Srivatsa S. Bhat
  2012-12-18 19:43                         ` Oleg Nesterov
  0 siblings, 1 reply; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-18 15:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/14/2012 11:33 PM, Oleg Nesterov wrote:
> On 12/13, Srivatsa S. Bhat wrote:
>>
>> On 12/13/2012 09:47 PM, Oleg Nesterov wrote:
>>> On 12/13, Srivatsa S. Bhat wrote:
>>>>
>>>> On 12/13/2012 12:42 AM, Srivatsa S. Bhat wrote:
>>>>>
>>>>> Even I don't spot anything wrong with it. But I'll give it some more
>>>>> thought..
>>>>
>>>> Since an interrupt handler can also run get_online_cpus_atomic(), we
>>>> cannot use the __this_cpu_* versions for modifying reader_percpu_refcnt,
>>>> right?
>>>
>>> Hmm. I thought that __this_cpu_* must be safe under preempt_disable().
>>> IOW, I thought that, say, this_cpu_inc() is "equal" to preempt_disable +
>>> __this_cpu_inc() correctness-wise.
>>>
>>> And. I thought that this_cpu_inc() is safe wrt interrupt, like local_t.
>>>
>>> But when I try to read the comments percpu.h, I am starting to think that
>>> even this_cpu_inc() is not safe if irq handler can do the same?
>>>
>>
>> The comment seems to say that its not safe wrt interrupts. But looking at
>> the code in include/linux/percpu.h, IIUC, that is true only about
>> this_cpu_read() because it only disables preemption.
>>
>> However, this_cpu_inc() looks safe wrt interrupts because it wraps the
>> increment within raw_local_irqsave()/restore().
> 
> You mean _this_cpu_generic_to_op() I guess. So yes, I think you are right,
> this_cpu_* should be irq-safe, but __this_cpu_* is not.
> 

Yes.

> Thanks.
> 
> At least on x86 there is no difference between this_ and __this_, both do
> percpu_add_op() without local_irq_disable/enable. But it seems that most
> of architectures use generic code.
> 

So now that we can't avoid disabling and enabling interrupts, I was
wondering if we could exploit this to avoid the smp_mb()..

Maybe this is a stupid question, but I'll shoot it anyway...
Does local_irq_disable()/enable provide any ordering guarantees by any chance?
I think the answer is no, but if it is yes, I guess we can do as shown
below to ensure that STORE(reader_percpu_refcnt) happens before
LOAD(writer_signal).

void get_online_cpus_atomic(void)
{
	unsigned long flags;

	preempt_disable();

	//only for writer
	local_irq_save(flags);
	__this_cpu_add(reader_percpu_refcnt, XXXX);
	local_irq_restore(flags);

	//no need of an explicit smp_mb()

	if (__this_cpu_read(reader_percpu_refcnt) & MASK) {
		this_cpu_inc(reader_percpu_refcnt);
	} else if (writer_active()) {
		...
	}

	this_cpu_dec(reader_percpu_refcnt, XXXX);

}

I tried thinking about other ways to avoid that smp_mb() in the reader,
but was unsuccessful. So if the above assumption is wrong, I guess we'll
just have to go with the version that uses synchronize_sched() at the
writer-side.

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-18 15:53                       ` Srivatsa S. Bhat
@ 2012-12-18 19:43                         ` Oleg Nesterov
  2012-12-18 20:06                           ` Srivatsa S. Bhat
  0 siblings, 1 reply; 40+ messages in thread
From: Oleg Nesterov @ 2012-12-18 19:43 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/18, Srivatsa S. Bhat wrote:
>
> So now that we can't avoid disabling and enabling interrupts,

Still I think it would be better to not use local_irq_save/restore
directly. And,

> I was
> wondering if we could exploit this to avoid the smp_mb()..
>
> Maybe this is a stupid question, but I'll shoot it anyway...
> Does local_irq_disable()/enable provide any ordering guarantees by any chance?

Oh, I do not know.

But please look at the comment above prepare_to_wait(). It assumes
that even spin_unlock_irqrestore() is not the full barrier.

In any case. get_online_cpus_atomic() has to use irq_restore, not
irq_enable. And _restore does nothing "special" if irqs were already
disabled, so I think we can't rely on sti.

> I tried thinking about other ways to avoid that smp_mb() in the reader,

Just in case, I think there is no way to avoid mb() in _get (although
perhaps it can be "implicit").

The writer changes cpu_online_mask and drops the lock. We need to ensure
that the reader sees the change in cpu_online_mask after _get returns.

> but was unsuccessful. So if the above assumption is wrong, I guess we'll
> just have to go with the version that uses synchronize_sched() at the
> writer-side.

In this case we can also convert get_online_cpus() to use percpu_rwsem
and avoid mutex_lock(&cpu_hotplug.lock), but this is minor I guess.
I do not think get_online_cpus() is called too often.

Oleg.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-18 19:43                         ` Oleg Nesterov
@ 2012-12-18 20:06                           ` Srivatsa S. Bhat
  2012-12-19 16:39                             ` Oleg Nesterov
  0 siblings, 1 reply; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-18 20:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/19/2012 01:13 AM, Oleg Nesterov wrote:
> On 12/18, Srivatsa S. Bhat wrote:
>>
>> So now that we can't avoid disabling and enabling interrupts,
> 
> Still I think it would be better to not use local_irq_save/restore
> directly.

Sure, we can use this_cpu_add() itself. I explicitly used
local_irq_save/restore here just to explain my question.

> And,
> 
>> I was
>> wondering if we could exploit this to avoid the smp_mb()..
>>
>> Maybe this is a stupid question, but I'll shoot it anyway...
>> Does local_irq_disable()/enable provide any ordering guarantees by any chance?
> 
> Oh, I do not know.
> 
> But please look at the comment above prepare_to_wait(). It assumes
> that even spin_unlock_irqrestore() is not the full barrier.
> 

Semi-permeable barrier.. Hmm.. 

> In any case. get_online_cpus_atomic() has to use irq_restore, not
> irq_enable. And _restore does nothing "special" if irqs were already
> disabled, so I think we can't rely on sti.
> 

Right, I forgot about the _restore part.

>> I tried thinking about other ways to avoid that smp_mb() in the reader,
> 
> Just in case, I think there is no way to avoid mb() in _get (although
> perhaps it can be "implicit").
> 

Actually, I was trying to avoid mb() in the _fastpath_, when there is no
active writer. I missed stating that clearly, sorry.

> The writer changes cpu_online_mask and drops the lock. We need to ensure
> that the reader sees the change in cpu_online_mask after _get returns.
> 

The write_unlock() will ensure the completion of the update to cpu_online_mask,
using the same semi-permeable logic that you pointed above. So readers will
see the update as soon as the writer releases the lock, right?

>> but was unsuccessful. So if the above assumption is wrong, I guess we'll
>> just have to go with the version that uses synchronize_sched() at the
>> writer-side.
> 
> In this case we can also convert get_online_cpus() to use percpu_rwsem
> and avoid mutex_lock(&cpu_hotplug.lock), but this is minor I guess.
> I do not think get_online_cpus() is called too often.
> 

Yes, we could do that as well. I remember you saying that you had some
patches for percpu_rwsem to help use it in cpu hotplug code (to make it
recursive, IIRC).

So, I guess we'll go with the synchronize_sched() approach for percpu rwlocks
then. Tejun, it is still worthwhile to expose this as a generic percpu rwlock
and then use it inside cpu hotplug code, right?


Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-18 20:06                           ` Srivatsa S. Bhat
@ 2012-12-19 16:39                             ` Oleg Nesterov
  2012-12-19 18:16                               ` Srivatsa S. Bhat
  0 siblings, 1 reply; 40+ messages in thread
From: Oleg Nesterov @ 2012-12-19 16:39 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

(sorry if you see this email twice)

On 12/19, Srivatsa S. Bhat wrote:
>
> On 12/19/2012 01:13 AM, Oleg Nesterov wrote:
>
> >> I tried thinking about other ways to avoid that smp_mb() in the reader,
> >
> > Just in case, I think there is no way to avoid mb() in _get (although
> > perhaps it can be "implicit").
> >
>
> Actually, I was trying to avoid mb() in the _fastpath_, when there is no
> active writer. I missed stating that clearly, sorry.

Yes, I meant the fastpath too,

> > The writer changes cpu_online_mask and drops the lock. We need to ensure
> > that the reader sees the change in cpu_online_mask after _get returns.
> >
>
> The write_unlock() will ensure the completion of the update to cpu_online_mask,
> using the same semi-permeable logic that you pointed above. So readers will
> see the update as soon as the writer releases the lock, right?

Why?

Once again, the writer (say) removes CPU_1 from cpu_online_mask, and
sets writer_signal[0] = 0, this "enables" fast path on CPU_0.

But, without a barrier, there is no guarantee that CPU_0 will see the
change in cpu_online_mask after get_online_cpus_atomic() checks
writer_signal[0] and returns.

Hmm. And this means that the last version lacks the additional rmb()
(at least) if writer_active() == F.

Oleg.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-19 16:39                             ` Oleg Nesterov
@ 2012-12-19 18:16                               ` Srivatsa S. Bhat
  2012-12-19 19:14                                 ` Oleg Nesterov
  0 siblings, 1 reply; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-19 18:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/19/2012 10:09 PM, Oleg Nesterov wrote:
> (sorry if you see this email twice)
> 
> On 12/19, Srivatsa S. Bhat wrote:
>>
>> On 12/19/2012 01:13 AM, Oleg Nesterov wrote:
>>
>>>> I tried thinking about other ways to avoid that smp_mb() in the reader,
>>>
>>> Just in case, I think there is no way to avoid mb() in _get (although
>>> perhaps it can be "implicit").
>>>
>>
>> Actually, I was trying to avoid mb() in the _fastpath_, when there is no
>> active writer. I missed stating that clearly, sorry.
> 
> Yes, I meant the fastpath too,
> 
>>> The writer changes cpu_online_mask and drops the lock. We need to ensure
>>> that the reader sees the change in cpu_online_mask after _get returns.
>>>
>>
>> The write_unlock() will ensure the completion of the update to cpu_online_mask,
>> using the same semi-permeable logic that you pointed above. So readers will
>> see the update as soon as the writer releases the lock, right?
> 
> Why?
> 
> Once again, the writer (say) removes CPU_1 from cpu_online_mask, and
> sets writer_signal[0] = 0, this "enables" fast path on CPU_0.
> 
> But, without a barrier, there is no guarantee that CPU_0 will see the
> change in cpu_online_mask after get_online_cpus_atomic() checks
> writer_signal[0] and returns.
> 

Hmm, because a reader's code can get reordered to something like:

read cpu_online_mask

get_online_cpus_atomic()

You are right, I missed that.

> Hmm. And this means that the last version lacks the additional rmb()
> (at least) if writer_active() == F.
> 

Yes, the fast path needs that smp_rmb(), whereas the slow-path is fine,
because it takes the read_lock().

BTW, there is a small problem with the synchronize_sched() approach:
We can't generalize the synchronization scheme as a generic percpu rwlock
because the writer's role is split into 2, the first part (the one having
synchronize_sched()) which should be run in process context, and the
second part (the rest of it), which can be run in atomic context.

That is, needing the writer to be able to sleep (due to synchronize_sched())
will make the locking scheme unusable (in other usecases) I think. And
the split (blocking and non-blocking part) itself seems hard to express
as a single writer API.

Hmmm.. so what do we do? Shall we say "We anyway have to do smp_rmb() in the
reader in the fast path. Instead let's do a full smp_mb() and get rid of
the synchronize_sched() in the writer, and thereby expose this synchronization
scheme as generic percpu rwlocks?" Or should we rather use synchronize_sched()
and keep this synchronization scheme restricted to CPU hotplug only?

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-19 18:16                               ` Srivatsa S. Bhat
@ 2012-12-19 19:14                                 ` Oleg Nesterov
  2012-12-19 19:49                                   ` Srivatsa S. Bhat
  0 siblings, 1 reply; 40+ messages in thread
From: Oleg Nesterov @ 2012-12-19 19:14 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/19, Srivatsa S. Bhat wrote:
>
> BTW, there is a small problem with the synchronize_sched() approach:
> We can't generalize the synchronization scheme as a generic percpu rwlock
> because the writer's role is split into 2, the first part (the one having
> synchronize_sched()) which should be run in process context, and the
> second part (the rest of it), which can be run in atomic context.

Yes,

> That is, needing the writer to be able to sleep (due to synchronize_sched())
> will make the locking scheme unusable (in other usecases) I think. And
> the split (blocking and non-blocking part) itself seems hard to express
> as a single writer API.

I do not think this is the problem.

We need 2 helpers for writer, the 1st one does synchronize_sched() and the
2nd one takes rwlock. A generic percpu_write_lock() simply calls them both.

In fact I think that the 1st helper should not do synchronize_sched(),
and (say) cpu_hotplug_begin() should call it itself. Because if we also
change cpu_hotplug_begin() to use percpu_rw_sem we do not want to do
synchronize_sched() twice. But lets ignore this for now.

But,

> Hmmm.. so what do we do? Shall we say "We anyway have to do smp_rmb() in the
> reader in the fast path. Instead let's do a full smp_mb() and get rid of
> the synchronize_sched() in the writer, and thereby expose this synchronization
> scheme as generic percpu rwlocks?" Or should we rather use synchronize_sched()
> and keep this synchronization scheme restricted to CPU hotplug only?

Oh, I do not know ;)

To me, the main question is: can we use synchronize_sched() in cpu_down?
It is slow.

Oleg.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-19 19:14                                 ` Oleg Nesterov
@ 2012-12-19 19:49                                   ` Srivatsa S. Bhat
  2012-12-20 13:42                                     ` Oleg Nesterov
  0 siblings, 1 reply; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-19 19:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/20/2012 12:44 AM, Oleg Nesterov wrote:
> On 12/19, Srivatsa S. Bhat wrote:
>>
>> BTW, there is a small problem with the synchronize_sched() approach:
>> We can't generalize the synchronization scheme as a generic percpu rwlock
>> because the writer's role is split into 2, the first part (the one having
>> synchronize_sched()) which should be run in process context, and the
>> second part (the rest of it), which can be run in atomic context.
> 
> Yes,
> 
>> That is, needing the writer to be able to sleep (due to synchronize_sched())
>> will make the locking scheme unusable (in other usecases) I think. And
>> the split (blocking and non-blocking part) itself seems hard to express
>> as a single writer API.
> 
> I do not think this is the problem.
> 
> We need 2 helpers for writer, the 1st one does synchronize_sched() and the
> 2nd one takes rwlock. A generic percpu_write_lock() simply calls them both.
>

Ah, that's the problem no? Users of reader-writer locks expect to run in
atomic context (ie., they don't want to sleep). We can't expose an API that
can make the task go to sleep under the covers!
(And that too, definitely not with a name such as "percpu_write_lock_irqsave()"
... because we are going to be interrupt-safe as well, in the second part...).
 
> In fact I think that the 1st helper should not do synchronize_sched(),
> and (say) cpu_hotplug_begin() should call it itself. Because if we also
> change cpu_hotplug_begin() to use percpu_rw_sem we do not want to do
> synchronize_sched() twice. But lets ignore this for now.
> 

Yeah, let's ignore this for now, else I'll get all confused ;-)

> But,
> 
>> Hmmm.. so what do we do? Shall we say "We anyway have to do smp_rmb() in the
>> reader in the fast path. Instead let's do a full smp_mb() and get rid of
>> the synchronize_sched() in the writer, and thereby expose this synchronization
>> scheme as generic percpu rwlocks?" Or should we rather use synchronize_sched()
>> and keep this synchronization scheme restricted to CPU hotplug only?
> 
> Oh, I do not know ;)
> 
> To me, the main question is: can we use synchronize_sched() in cpu_down?
> It is slow.
>

Haha :-) So we don't want smp_mb() in the reader, *and* also don't want
synchronize_sched() in the writer! Sounds like saying we want to have the cake
and eat it too ;-) :P

But seriously speaking, I understand that its a bit of a hard choice to make.
On one side, we can avoid synchronize_sched() at the writer, but have to bear
the burden of smp_mb() at the reader even in the fastpath, when no writer is active.
On the other side, we can make the reader avoid smp_mb(), but the writer has to
suffer a full synchronize_sched(). But an important point is that, at the writer
side, we already wait for so many mutex locks and stuff, like in the CPU_DOWN_PREPARE
stage (as determined by the callbacks). So adding a synchronize_sched() to the mix
shouldn't make it noticeably bad, isn't it?

(I'm not arguing in favor of using synchronize_sched(). I'm just trying to point
out that using it might not be as bad as we think it is, in the CPU hotplug case.
And moreover, since I'm still not convinced about the writer API part if use
synchronize_sched(), I'd rather avoid synchronize_sched().)

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-19 19:49                                   ` Srivatsa S. Bhat
@ 2012-12-20 13:42                                     ` Oleg Nesterov
  2012-12-20 14:06                                       ` Srivatsa S. Bhat
  2012-12-22 20:17                                       ` Srivatsa S. Bhat
  0 siblings, 2 replies; 40+ messages in thread
From: Oleg Nesterov @ 2012-12-20 13:42 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/20, Srivatsa S. Bhat wrote:
>
> On 12/20/2012 12:44 AM, Oleg Nesterov wrote:
> >
> > We need 2 helpers for writer, the 1st one does synchronize_sched() and the
> > 2nd one takes rwlock. A generic percpu_write_lock() simply calls them both.
> >
>
> Ah, that's the problem no? Users of reader-writer locks expect to run in
> atomic context (ie., they don't want to sleep).

Ah, I misunderstood.

Sure, percpu_write_lock() should be might_sleep(), and this is not
symmetric to percpu_read_lock().

> We can't expose an API that
> can make the task go to sleep under the covers!

Why? Just this should be documented. However I would not worry until we
find another user. Until then we do not even need to add percpu_write_lock
or try to generalize this code too much.

> > To me, the main question is: can we use synchronize_sched() in cpu_down?
> > It is slow.
> >
>
> Haha :-) So we don't want smp_mb() in the reader,

We need mb() + rmb(). Plust cli/sti unless this arch has optimized
this_cpu_add() like x86 (as you pointed out).

> *and* also don't want
> synchronize_sched() in the writer! Sounds like saying we want to have the cake
> and eat it too ;-) :P

Personally I'd vote for synchronize_sched() but I am not sure. And I do
not really understand the problem space.

> And moreover, since I'm still not convinced about the writer API part if use
> synchronize_sched(), I'd rather avoid synchronize_sched().)

Understand.

And yes, synchronize_sched() adds more problems. For example, where should
we call it? I do not this _cpu_down() should do this, in this case, say,
disable_nonboot_cpus() needs num_online_cpus() synchronize_sched's.

So probably cpu_down() should call it before cpu_maps_update_begin(), this
makes the locking even less obvious.

In short. What I am trying to say is, don't ask me I do not know ;)

Oleg.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-20 13:42                                     ` Oleg Nesterov
@ 2012-12-20 14:06                                       ` Srivatsa S. Bhat
  2012-12-22 20:17                                       ` Srivatsa S. Bhat
  1 sibling, 0 replies; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-20 14:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/20/2012 07:12 PM, Oleg Nesterov wrote:
> On 12/20, Srivatsa S. Bhat wrote:
>>
>> On 12/20/2012 12:44 AM, Oleg Nesterov wrote:
>>>
>>> We need 2 helpers for writer, the 1st one does synchronize_sched() and the
>>> 2nd one takes rwlock. A generic percpu_write_lock() simply calls them both.
>>>
>>
>> Ah, that's the problem no? Users of reader-writer locks expect to run in
>> atomic context (ie., they don't want to sleep).
> 
> Ah, I misunderstood.
> 
> Sure, percpu_write_lock() should be might_sleep(), and this is not
> symmetric to percpu_read_lock().
> 
>> We can't expose an API that
>> can make the task go to sleep under the covers!
> 
> Why? Just this should be documented. However I would not worry until we
> find another user. Until then we do not even need to add percpu_write_lock
> or try to generalize this code too much.
> 

Hmm.. But considering the disable_nonboot_cpus() case you mentioned below, I'm
only getting farther away from using synchronize_sched() ;-) And that also makes
it easier to expose a generic percpu rwlock API, like Tejun was suggesting.
So I'll give it a shot.

>>> To me, the main question is: can we use synchronize_sched() in cpu_down?
>>> It is slow.
>>>
>>
>> Haha :-) So we don't want smp_mb() in the reader,
> 
> We need mb() + rmb(). Plust cli/sti unless this arch has optimized
> this_cpu_add() like x86 (as you pointed out).
> 
>> *and* also don't want
>> synchronize_sched() in the writer! Sounds like saying we want to have the cake
>> and eat it too ;-) :P
> 
> Personally I'd vote for synchronize_sched() but I am not sure. And I do
> not really understand the problem space.
> 
>> And moreover, since I'm still not convinced about the writer API part if use
>> synchronize_sched(), I'd rather avoid synchronize_sched().)
> 
> Understand.
> 
> And yes, synchronize_sched() adds more problems. For example, where should
> we call it? I do not this _cpu_down() should do this, in this case, say,
> disable_nonboot_cpus() needs num_online_cpus() synchronize_sched's.
> 

Ouch! I should have seen that coming!

> So probably cpu_down() should call it before cpu_maps_update_begin(), this
> makes the locking even less obvious.
> 

True.

> In short. What I am trying to say is, don't ask me I do not know ;)
>

OK then, I'll go with what I believe is a reasonably good way (not necessarily
the best way) to deal with this:

I'll avoid the use of synchronize_sched(), expose a decent-looking percpu
rwlock implementation, use it in CPU hotplug and get rid of stop_machine().
That would certainly be a good starting base, IMHO.

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-20 13:42                                     ` Oleg Nesterov
  2012-12-20 14:06                                       ` Srivatsa S. Bhat
@ 2012-12-22 20:17                                       ` Srivatsa S. Bhat
  2012-12-23 16:42                                         ` Oleg Nesterov
  1 sibling, 1 reply; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-22 20:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/20/2012 07:12 PM, Oleg Nesterov wrote:
> On 12/20, Srivatsa S. Bhat wrote:
>>
>> On 12/20/2012 12:44 AM, Oleg Nesterov wrote:
>>>
>>> We need 2 helpers for writer, the 1st one does synchronize_sched() and the
>>> 2nd one takes rwlock. A generic percpu_write_lock() simply calls them both.
>>>
>>
>> Ah, that's the problem no? Users of reader-writer locks expect to run in
>> atomic context (ie., they don't want to sleep).
> 
> Ah, I misunderstood.
> 
> Sure, percpu_write_lock() should be might_sleep(), and this is not
> symmetric to percpu_read_lock().
> 
>> We can't expose an API that
>> can make the task go to sleep under the covers!
> 
> Why? Just this should be documented. However I would not worry until we
> find another user. Until then we do not even need to add percpu_write_lock
> or try to generalize this code too much.
> 
>>> To me, the main question is: can we use synchronize_sched() in cpu_down?
>>> It is slow.
>>>
>>
>> Haha :-) So we don't want smp_mb() in the reader,
> 
> We need mb() + rmb(). Plust cli/sti unless this arch has optimized
> this_cpu_add() like x86 (as you pointed out).
> 

Hey, IIUC, we actually don't need mb() in the reader!! Just an rmb() will do.

This is the reader code I have so far:

#define reader_nested_percpu()						\
	     (__this_cpu_read(reader_percpu_refcnt) & READER_REFCNT_MASK)

#define writer_active()							\
				(__this_cpu_read(writer_signal))


#define READER_PRESENT		(1UL << 16)
#define READER_REFCNT_MASK	(READER_PRESENT - 1)

void get_online_cpus_atomic(void)
{
	preempt_disable();

	/*
	 * First and foremost, make your presence known to the writer.
	 */
	this_cpu_add(reader_percpu_refcnt, READER_PRESENT);

	/*
	 * If we are already using per-cpu refcounts, it is not safe to switch
	 * the synchronization scheme. So continue using the refcounts.
	 */
	if (reader_nested_percpu()) {
		this_cpu_inc(reader_percpu_refcnt);
	} else {
		smp_rmb();
		if (unlikely(writer_active())) {
			... //take hotplug_rwlock
		}
	}

	...

	/* Prevent reordering of any subsequent reads of cpu_online_mask. */
	smp_rmb();
}

The smp_rmb() before writer_active() ensures that LOAD(writer_signal) follows
LOAD(reader_percpu_refcnt) (at the 'if' condition). And in turn, that load is
automatically going to follow the STORE(reader_percpu_refcnt) (at this_cpu_add())
due to the data dependency. So it is something like a transitive relation.

So, the result is that, we mark ourselves as active in reader_percpu_refcnt before
we check writer_signal. This is exactly what we wanted to do right?
And luckily, due to the dependency, we can achieve it without using the heavy
smp_mb(). And, we can't crib about the smp_rmb() because it is unavoidable anyway
(because we want to prevent reordering of the reads to cpu_online_mask, like you
pointed out earlier).

I hope I'm not missing anything...

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-22 20:17                                       ` Srivatsa S. Bhat
@ 2012-12-23 16:42                                         ` Oleg Nesterov
  2012-12-24 15:50                                           ` Srivatsa S. Bhat
  0 siblings, 1 reply; 40+ messages in thread
From: Oleg Nesterov @ 2012-12-23 16:42 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/23, Srivatsa S. Bhat wrote:
>
> On 12/20/2012 07:12 PM, Oleg Nesterov wrote:
> >
> > We need mb() + rmb(). Plust cli/sti unless this arch has optimized
> > this_cpu_add() like x86 (as you pointed out).
> >
>
> Hey, IIUC, we actually don't need mb() in the reader!! Just an rmb() will do.

Well. I don't think so. But when it comes to the barriers I am never sure
until Paul confirms my understanding ;)

> #define reader_nested_percpu()						\
> 	     (__this_cpu_read(reader_percpu_refcnt) & READER_REFCNT_MASK)
>
> #define writer_active()							\
> 				(__this_cpu_read(writer_signal))
>
>
> #define READER_PRESENT		(1UL << 16)
> #define READER_REFCNT_MASK	(READER_PRESENT - 1)
>
> void get_online_cpus_atomic(void)
> {
> 	preempt_disable();
>
> 	/*
> 	 * First and foremost, make your presence known to the writer.
> 	 */
> 	this_cpu_add(reader_percpu_refcnt, READER_PRESENT);
>
> 	/*
> 	 * If we are already using per-cpu refcounts, it is not safe to switch
> 	 * the synchronization scheme. So continue using the refcounts.
> 	 */
> 	if (reader_nested_percpu()) {
> 		this_cpu_inc(reader_percpu_refcnt);
> 	} else {
> 		smp_rmb();
> 		if (unlikely(writer_active())) {
> 			... //take hotplug_rwlock
> 		}
> 	}
>
> 	...
>
> 	/* Prevent reordering of any subsequent reads of cpu_online_mask. */
> 	smp_rmb();
> }
>
> The smp_rmb() before writer_active() ensures that LOAD(writer_signal) follows
> LOAD(reader_percpu_refcnt) (at the 'if' condition). And in turn, that load is
> automatically going to follow the STORE(reader_percpu_refcnt)

But why this STORE should be visible on another CPU before we LOAD(writer_signal)?

Lets discuss the simple and artificial example. Suppose we have

	int X, Y;

	int func(void)
	{
		X = 1;	// suppose that nobody else can change it
		mb();
		return Y;
	}

Now you are saying that we can change it and avoid the costly mb():

	int func(void)
	{
		X = 1;

		if (X != 1)
			BUG();
	
		rmb();
		return Y;
	}

I doubt. rmb() can only guarantee that the preceding LOAD's should be
completed. Without mb() it is possible that this CPU won't write X to
memory at all.

Oleg.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  2012-12-23 16:42                                         ` Oleg Nesterov
@ 2012-12-24 15:50                                           ` Srivatsa S. Bhat
  0 siblings, 0 replies; 40+ messages in thread
From: Srivatsa S. Bhat @ 2012-12-24 15:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

On 12/23/2012 10:12 PM, Oleg Nesterov wrote:
> On 12/23, Srivatsa S. Bhat wrote:
>>
>> On 12/20/2012 07:12 PM, Oleg Nesterov wrote:
>>>
>>> We need mb() + rmb(). Plust cli/sti unless this arch has optimized
>>> this_cpu_add() like x86 (as you pointed out).
>>>
>>
>> Hey, IIUC, we actually don't need mb() in the reader!! Just an rmb() will do.
> 
> Well. I don't think so. But when it comes to the barriers I am never sure
> until Paul confirms my understanding ;)
> 
>> #define reader_nested_percpu()						\
>> 	     (__this_cpu_read(reader_percpu_refcnt) & READER_REFCNT_MASK)
>>
>> #define writer_active()							\
>> 				(__this_cpu_read(writer_signal))
>>
>>
>> #define READER_PRESENT		(1UL << 16)
>> #define READER_REFCNT_MASK	(READER_PRESENT - 1)
>>
>> void get_online_cpus_atomic(void)
>> {
>> 	preempt_disable();
>>
>> 	/*
>> 	 * First and foremost, make your presence known to the writer.
>> 	 */
>> 	this_cpu_add(reader_percpu_refcnt, READER_PRESENT);
>>
>> 	/*
>> 	 * If we are already using per-cpu refcounts, it is not safe to switch
>> 	 * the synchronization scheme. So continue using the refcounts.
>> 	 */
>> 	if (reader_nested_percpu()) {
>> 		this_cpu_inc(reader_percpu_refcnt);
>> 	} else {
>> 		smp_rmb();
>> 		if (unlikely(writer_active())) {
>> 			... //take hotplug_rwlock
>> 		}
>> 	}
>>
>> 	...
>>
>> 	/* Prevent reordering of any subsequent reads of cpu_online_mask. */
>> 	smp_rmb();
>> }
>>
>> The smp_rmb() before writer_active() ensures that LOAD(writer_signal) follows
>> LOAD(reader_percpu_refcnt) (at the 'if' condition). And in turn, that load is
>> automatically going to follow the STORE(reader_percpu_refcnt)
> 
> But why this STORE should be visible on another CPU before we LOAD(writer_signal)?
> 
> Lets discuss the simple and artificial example. Suppose we have
> 
> 	int X, Y;
> 
> 	int func(void)
> 	{
> 		X = 1;	// suppose that nobody else can change it
> 		mb();
> 		return Y;
> 	}
> 
> Now you are saying that we can change it and avoid the costly mb():
> 
> 	int func(void)
> 	{
> 		X = 1;
> 
> 		if (X != 1)
> 			BUG();
> 	
> 		rmb();
> 		return Y;
> 	}
> 
> I doubt. rmb() can only guarantee that the preceding LOAD's should be
> completed. Without mb() it is possible that this CPU won't write X to
> memory at all.
> 

Oh, ok :-( Thanks for correcting me and for the detailed explanation!
For a moment, I really thought we had it solved at last! ;-(

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2012-12-24 15:51 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-11 14:03 [RFC PATCH v4 0/9] CPU hotplug: stop_machine()-free CPU hotplug Srivatsa S. Bhat
2012-12-11 14:04 ` [RFC PATCH v4 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context Srivatsa S. Bhat
2012-12-12 17:17   ` Oleg Nesterov
2012-12-12 17:24     ` Oleg Nesterov
2012-12-12 18:11       ` Srivatsa S. Bhat
2012-12-12 18:23         ` Oleg Nesterov
2012-12-12 18:42           ` Srivatsa S. Bhat
2012-12-12 17:53     ` Srivatsa S. Bhat
2012-12-12 18:02       ` Oleg Nesterov
2012-12-12 18:30         ` Srivatsa S. Bhat
2012-12-12 18:48           ` Oleg Nesterov
2012-12-12 19:12             ` Srivatsa S. Bhat
2012-12-13 15:26               ` Srivatsa S. Bhat
2012-12-13 16:17                 ` Oleg Nesterov
2012-12-13 16:32                   ` Srivatsa S. Bhat
2012-12-14 18:03                     ` Oleg Nesterov
2012-12-18 15:53                       ` Srivatsa S. Bhat
2012-12-18 19:43                         ` Oleg Nesterov
2012-12-18 20:06                           ` Srivatsa S. Bhat
2012-12-19 16:39                             ` Oleg Nesterov
2012-12-19 18:16                               ` Srivatsa S. Bhat
2012-12-19 19:14                                 ` Oleg Nesterov
2012-12-19 19:49                                   ` Srivatsa S. Bhat
2012-12-20 13:42                                     ` Oleg Nesterov
2012-12-20 14:06                                       ` Srivatsa S. Bhat
2012-12-22 20:17                                       ` Srivatsa S. Bhat
2012-12-23 16:42                                         ` Oleg Nesterov
2012-12-24 15:50                                           ` Srivatsa S. Bhat
2012-12-13 16:32                   ` Tejun Heo
2012-12-12 19:36         ` Oleg Nesterov
2012-12-12 19:43           ` Srivatsa S. Bhat
2012-12-12 21:10             ` Oleg Nesterov
2012-12-11 14:04 ` [RFC PATCH v4 2/9] CPU hotplug: Convert preprocessor macros to static inline functions Srivatsa S. Bhat
2012-12-11 14:04 ` [RFC PATCH v4 3/9] smp, cpu hotplug: Fix smp_call_function_*() to prevent CPU offline properly Srivatsa S. Bhat
2012-12-11 14:04 ` [RFC PATCH v4 4/9] smp, cpu hotplug: Fix on_each_cpu_*() " Srivatsa S. Bhat
2012-12-11 14:05 ` [RFC PATCH v4 5/9] sched, cpu hotplug: Use stable online cpus in try_to_wake_up() & select_task_rq() Srivatsa S. Bhat
2012-12-11 14:05 ` [RFC PATCH v4 6/9] kick_process(), cpu-hotplug: Prevent offlining of target CPU properly Srivatsa S. Bhat
2012-12-11 14:05 ` [RFC PATCH v4 7/9] yield_to(), cpu-hotplug: Prevent offlining of other CPUs properly Srivatsa S. Bhat
2012-12-11 14:05 ` [RFC PATCH v4 8/9] kvm, vmx: Add atomic synchronization with CPU Hotplug Srivatsa S. Bhat
2012-12-11 14:05 ` [RFC PATCH v4 9/9] cpu: No more __stop_machine() in _cpu_down() Srivatsa S. Bhat

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.