linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [v7 0/8] Reduce cross CPU IPI interference
@ 2012-01-26 10:01 Gilad Ben-Yossef
  2012-01-26 10:01 ` [v7 1/8] smp: introduce a generic on_each_cpu_mask function Gilad Ben-Yossef
                   ` (8 more replies)
  0 siblings, 9 replies; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-26 10:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gilad Ben-Yossef, Christoph Lameter, Chris Metcalf,
	Peter Zijlstra, Frederic Weisbecker, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman,
	Andrew Morton, Alexander Viro, Avi Kivity, Michal Nazarewicz,
	Kosaki Motohiro, Milton Miller

We have lots of infrastructure in place to partition multi-core systems
such that we have a group of CPUs that are dedicated to specific task:
cgroups, scheduler and interrupt affinity, and cpuisol= boot parameter.
Still, kernel code will at times interrupt all CPUs in the system via IPIs
for various needs. These IPIs are useful and cannot be avoided altogether,
but in certain cases it is possible to interrupt only specific CPUs that
have useful work to do and not the entire system.

This patch set, inspired by discussions with Peter Zijlstra and Frederic
Weisbecker when testing the nohz task patch set, is a first stab at trying
to explore doing this by locating the places where such global IPI calls
are being made and turning the global IPI into an IPI for a specific group
of CPUs.  The purpose of the patch set is to get feedback if this is the
right way to go for dealing with this issue and indeed, if the issue is
even worth dealing with at all. Based on the feedback from this patch set
I plan to offer further patches that address similar issue in other code
paths.

The patch creates an on_each_cpu_mask and on_each_cpu_cond infrastructure
API (the former derived from existing arch specific versions in Tile and
Arm) and uses them to turn several global IPI invocation to per CPU
group invocations.

This 7th iteration includes the following changes, all based on feedback 
from Milton Miller:

- Use a static cpumask_t to track CPUs with pcps in drain_all_pages.
- Fix logic bug sending an IPI based on state of pcps in last zone only
  (and re-run tests to make sure we still see the benefits).
- Use bool and smp_call_func_t for on_each_cpu_cond prototype.
- Accept a GFP flags parameters by on_each_cpu_cond for cpumask allocation.
- Disable preemption around for_each_online_cpu in on_each_cpu_cond to avoid.
  racing with hotplug events.
- Use bool and smp_call_func_t for n_each_cpu_mask prototype.
- Multiple documentation and description fixes and improvements.

The patch set also available from the ipi_noise_v7 branch at
git://github.com/gby/linux.git

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
CC: Christoph Lameter <cl@linux.com>
CC: Chris Metcalf <cmetcalf@tilera.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: linux-mm@kvack.org
CC: Pekka Enberg <penberg@kernel.org>
CC: Matt Mackall <mpm@selenic.com>
CC: Sasha Levin <levinsasha928@gmail.com>
CC: Rik van Riel <riel@redhat.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Avi Kivity <avi@redhat.com>
CC: Michal Nazarewicz <mina86@mina86.com>
CC: Kosaki Motohiro <kosaki.motohiro@gmail.com>
CC: Milton Miller <miltonm@bga.com>

Gilad Ben-Yossef (8):
  smp: introduce a generic on_each_cpu_mask function
  arm: move arm over to generic on_each_cpu_mask
  tile: move tile to use generic on_each_cpu_mask
  smp: add func to IPI cpus based on parameter func
  slub: only IPI CPUs that have per cpu obj to flush
  fs: only send IPI to invalidate LRU BH when needed
  mm: only IPI CPUs to drain local pages if they exist
  mm: add vmstat counters for tracking PCP drains

 arch/arm/kernel/smp_tlb.c     |   20 ++-------
 arch/tile/include/asm/smp.h   |    7 ---
 arch/tile/kernel/smp.c        |   19 ---------
 fs/buffer.c                   |   15 +++++++-
 include/linux/smp.h           |   41 +++++++++++++++++++
 include/linux/vm_event_item.h |    1 +
 kernel/smp.c                  |   87 +++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c               |   36 ++++++++++++++++-
 mm/slub.c                     |   10 ++++-
 mm/vmstat.c                   |    2 +
 10 files changed, 194 insertions(+), 44 deletions(-)


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [v7 1/8] smp: introduce a generic on_each_cpu_mask function
  2012-01-26 10:01 [v7 0/8] Reduce cross CPU IPI interference Gilad Ben-Yossef
@ 2012-01-26 10:01 ` Gilad Ben-Yossef
  2012-01-29 12:24   ` Gilad Ben-Yossef
  2012-01-26 10:01 ` [v7 2/8] arm: move arm over to generic on_each_cpu_mask Gilad Ben-Yossef
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-26 10:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gilad Ben-Yossef, Chris Metcalf, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Pekka Enberg,
	Matt Mackall, Rik van Riel, Andi Kleen, Sasha Levin, Mel Gorman,
	Andrew Morton, Alexander Viro, linux-fsdevel, Avi Kivity,
	Michal Nazarewicz, Kosaki Motohiro, Milton Miller

on_each_cpu_mask calls a function on processors specified by
cpumask, which may or may not include the local processor.

You must not call this function with disabled interrupts or
from a hardware interrupt handler or from a bottom half handler.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
CC: Chris Metcalf <cmetcalf@tilera.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: linux-mm@kvack.org
CC: Pekka Enberg <penberg@kernel.org>
CC: Matt Mackall <mpm@selenic.com>
CC: Rik van Riel <riel@redhat.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Sasha Levin <levinsasha928@gmail.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
CC: Avi Kivity <avi@redhat.com>
CC: Michal Nazarewicz <mina86@mina86.org>
CC: Kosaki Motohiro <kosaki.motohiro@gmail.com>
CC: Milton Miller <miltonm@bga.com>
---
 include/linux/smp.h |   22 ++++++++++++++++++++++
 kernel/smp.c        |   29 +++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 8cc38d3..d0adb78 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -102,6 +102,13 @@ static inline void call_function_init(void) { }
 int on_each_cpu(smp_call_func_t func, void *info, int wait);
 
 /*
+ * Call a function on processors specified by mask, which might include
+ * the local one.
+ */
+void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
+		void *info, bool wait);
+
+/*
  * Mark the boot cpu "online" so that it can call console drivers in
  * printk() and can access its per-cpu storage.
  */
@@ -132,6 +139,21 @@ static inline int up_smp_call_function(smp_call_func_t func, void *info)
 		local_irq_enable();		\
 		0;				\
 	})
+/*
+ * Note we still need to test the mask even for UP
+ * because we actually can get an empty mask from
+ * code that on SMP might call us without the local
+ * CPU in the mask.
+ */
+#define on_each_cpu_mask(mask, func, info, wait) \
+	do {						\
+		if (cpumask_test_cpu(0, (mask))) {	\
+			local_irq_disable();		\
+			(func)(info);			\
+			local_irq_enable();		\
+		}					\
+	} while (0)
+
 static inline void smp_send_reschedule(int cpu) { }
 #define num_booting_cpus()			1
 #define smp_prepare_boot_cpu()			do {} while (0)
diff --git a/kernel/smp.c b/kernel/smp.c
index db197d6..a081e6c 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -701,3 +701,32 @@ int on_each_cpu(void (*func) (void *info), void *info, int wait)
 	return ret;
 }
 EXPORT_SYMBOL(on_each_cpu);
+
+/**
+ * on_each_cpu_mask(): Run a function on processors specified by
+ * cpumask, which may include the local processor.
+ * @mask: The set of cpus to run on (only runs on online subset).
+ * @func: The function to run. This must be fast and non-blocking.
+ * @info: An arbitrary pointer to pass to the function.
+ * @wait: If true, wait (atomically) until function has completed
+ *        on other CPUs.
+ *
+ * If @wait is true, then returns once @func has returned.
+ *
+ * You must not call this function with disabled interrupts or
+ * from a hardware interrupt handler or from a bottom half handler.
+ */
+void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
+			void *info, bool wait)
+{
+	int cpu = get_cpu();
+
+	smp_call_function_many(mask, func, info, wait);
+	if (cpumask_test_cpu(cpu, mask)) {
+		local_irq_disable();
+		func(info);
+		local_irq_enable();
+	}
+	put_cpu();
+}
+EXPORT_SYMBOL(on_each_cpu_mask);
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [v7 2/8] arm: move arm over to generic on_each_cpu_mask
  2012-01-26 10:01 [v7 0/8] Reduce cross CPU IPI interference Gilad Ben-Yossef
  2012-01-26 10:01 ` [v7 1/8] smp: introduce a generic on_each_cpu_mask function Gilad Ben-Yossef
@ 2012-01-26 10:01 ` Gilad Ben-Yossef
  2012-01-26 10:01 ` [v7 3/8] tile: move tile to use " Gilad Ben-Yossef
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-26 10:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gilad Ben-Yossef, Peter Zijlstra, Frederic Weisbecker,
	Russell King, Christoph Lameter, Chris Metcalf, linux-mm,
	Pekka Enberg, Matt Mackall, Rik van Riel, Andi Kleen,
	Sasha Levin, Mel Gorman, Andrew Morton, Alexander Viro,
	linux-fsdevel, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

Note that the generic version is a little different then the Arm one:

1. It has the mask as first parameter
2. It calls the function on the calling CPU with interrupts disabled,
   but this should be OK since the function is called on the other CPUs
   with interrupts disabled anyway.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Christoph Lameter <cl@linux.com>
CC: Chris Metcalf <cmetcalf@tilera.com>
CC: linux-mm@kvack.org
CC: Pekka Enberg <penberg@kernel.org>
CC: Matt Mackall <mpm@selenic.com>
CC: Rik van Riel <riel@redhat.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Sasha Levin <levinsasha928@gmail.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
CC: Avi Kivity <avi@redhat.com>
CC: Michal Nazarewicz <mina86@mina86.org>
CC: Kosaki Motohiro <kosaki.motohiro@gmail.com>
CC: Milton Miller <miltonm@bga.com>
---
 arch/arm/kernel/smp_tlb.c |   20 +++++---------------
 1 files changed, 5 insertions(+), 15 deletions(-)

diff --git a/arch/arm/kernel/smp_tlb.c b/arch/arm/kernel/smp_tlb.c
index 7dcb352..02c5d2c 100644
--- a/arch/arm/kernel/smp_tlb.c
+++ b/arch/arm/kernel/smp_tlb.c
@@ -13,18 +13,6 @@
 #include <asm/smp_plat.h>
 #include <asm/tlbflush.h>
 
-static void on_each_cpu_mask(void (*func)(void *), void *info, int wait,
-	const struct cpumask *mask)
-{
-	preempt_disable();
-
-	smp_call_function_many(mask, func, info, wait);
-	if (cpumask_test_cpu(smp_processor_id(), mask))
-		func(info);
-
-	preempt_enable();
-}
-
 /**********************************************************************/
 
 /*
@@ -87,7 +75,7 @@ void flush_tlb_all(void)
 void flush_tlb_mm(struct mm_struct *mm)
 {
 	if (tlb_ops_need_broadcast())
-		on_each_cpu_mask(ipi_flush_tlb_mm, mm, 1, mm_cpumask(mm));
+		on_each_cpu_mask(mm_cpumask(mm), ipi_flush_tlb_mm, mm, 1);
 	else
 		local_flush_tlb_mm(mm);
 }
@@ -98,7 +86,8 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr)
 		struct tlb_args ta;
 		ta.ta_vma = vma;
 		ta.ta_start = uaddr;
-		on_each_cpu_mask(ipi_flush_tlb_page, &ta, 1, mm_cpumask(vma->vm_mm));
+		on_each_cpu_mask(mm_cpumask(vma->vm_mm), ipi_flush_tlb_page,
+					&ta, 1);
 	} else
 		local_flush_tlb_page(vma, uaddr);
 }
@@ -121,7 +110,8 @@ void flush_tlb_range(struct vm_area_struct *vma,
 		ta.ta_vma = vma;
 		ta.ta_start = start;
 		ta.ta_end = end;
-		on_each_cpu_mask(ipi_flush_tlb_range, &ta, 1, mm_cpumask(vma->vm_mm));
+		on_each_cpu_mask(mm_cpumask(vma->vm_mm), ipi_flush_tlb_range,
+					&ta, 1);
 	} else
 		local_flush_tlb_range(vma, start, end);
 }
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [v7 3/8] tile: move tile to use generic on_each_cpu_mask
  2012-01-26 10:01 [v7 0/8] Reduce cross CPU IPI interference Gilad Ben-Yossef
  2012-01-26 10:01 ` [v7 1/8] smp: introduce a generic on_each_cpu_mask function Gilad Ben-Yossef
  2012-01-26 10:01 ` [v7 2/8] arm: move arm over to generic on_each_cpu_mask Gilad Ben-Yossef
@ 2012-01-26 10:01 ` Gilad Ben-Yossef
  2012-01-26 10:01 ` [v7 4/8] smp: add func to IPI cpus based on parameter func Gilad Ben-Yossef
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-26 10:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gilad Ben-Yossef, Peter Zijlstra, Frederic Weisbecker,
	Russell King, linux-mm, Christoph Lameter, Pekka Enberg,
	Matt Mackall, Rik van Riel, Andi Kleen, Sasha Levin, Mel Gorman,
	Andrew Morton, Alexander Viro, linux-fsdevel, Avi Kivity,
	Michal Nazarewicz, Kosaki Motohiro

The API is the same as the tile private one, but the generic version
also calls the function on the with interrupts disabled in UP case

This is OK since the function is called on the other CPUs
with interrupts disabled.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: linux-mm@kvack.org
CC: Christoph Lameter <cl@linux-foundation.org>
CC: Pekka Enberg <penberg@kernel.org>
CC: Matt Mackall <mpm@selenic.com>
CC: Rik van Riel <riel@redhat.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Sasha Levin <levinsasha928@gmail.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
CC: Avi Kivity <avi@redhat.com>
CC: Michal Nazarewicz <mina86@mina86.org>
CC: Kosaki Motohiro <kosaki.motohiro@gmail.com>
---
 arch/tile/include/asm/smp.h |    7 -------
 arch/tile/kernel/smp.c      |   19 -------------------
 2 files changed, 0 insertions(+), 26 deletions(-)

diff --git a/arch/tile/include/asm/smp.h b/arch/tile/include/asm/smp.h
index 532124a..1aa759a 100644
--- a/arch/tile/include/asm/smp.h
+++ b/arch/tile/include/asm/smp.h
@@ -43,10 +43,6 @@ void evaluate_message(int tag);
 /* Boot a secondary cpu */
 void online_secondary(void);
 
-/* Call a function on a specified set of CPUs (may include this one). */
-extern void on_each_cpu_mask(const struct cpumask *mask,
-			     void (*func)(void *), void *info, bool wait);
-
 /* Topology of the supervisor tile grid, and coordinates of boot processor */
 extern HV_Topology smp_topology;
 
@@ -91,9 +87,6 @@ void print_disabled_cpus(void);
 
 #else /* !CONFIG_SMP */
 
-#define on_each_cpu_mask(mask, func, info, wait)		\
-  do { if (cpumask_test_cpu(0, (mask))) func(info); } while (0)
-
 #define smp_master_cpu		0
 #define smp_height		1
 #define smp_width		1
diff --git a/arch/tile/kernel/smp.c b/arch/tile/kernel/smp.c
index c52224d..a44e103 100644
--- a/arch/tile/kernel/smp.c
+++ b/arch/tile/kernel/smp.c
@@ -87,25 +87,6 @@ void send_IPI_allbutself(int tag)
 	send_IPI_many(&mask, tag);
 }
 
-
-/*
- * Provide smp_call_function_mask, but also run function locally
- * if specified in the mask.
- */
-void on_each_cpu_mask(const struct cpumask *mask, void (*func)(void *),
-		      void *info, bool wait)
-{
-	int cpu = get_cpu();
-	smp_call_function_many(mask, func, info, wait);
-	if (cpumask_test_cpu(cpu, mask)) {
-		local_irq_disable();
-		func(info);
-		local_irq_enable();
-	}
-	put_cpu();
-}
-
-
 /*
  * Functions related to starting/stopping cpus.
  */
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [v7 4/8] smp: add func to IPI cpus based on parameter func
  2012-01-26 10:01 [v7 0/8] Reduce cross CPU IPI interference Gilad Ben-Yossef
                   ` (2 preceding siblings ...)
  2012-01-26 10:01 ` [v7 3/8] tile: move tile to use " Gilad Ben-Yossef
@ 2012-01-26 10:01 ` Gilad Ben-Yossef
  2012-01-27 23:57   ` Andrew Morton
  2012-01-26 10:01 ` [v7 5/8] slub: only IPI CPUs that have per cpu obj to flush Gilad Ben-Yossef
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-26 10:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gilad Ben-Yossef, Chris Metcalf, Christoph Lameter,
	Peter Zijlstra, Frederic Weisbecker, Russell King, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Alexander Viro, linux-fsdevel, Avi Kivity,
	Michal Nazarewicz, Kosaki Motohiro, Andrew Morton, Milton Miller

Add the on_each_cpu_cond() function that wraps on_each_cpu_mask()
and calculates the cpumask of cpus to IPI by calling a function supplied
as a parameter in order to determine whether to IPI each specific cpu.

The function works around allocation failure of cpumask variable in
CONFIG_CPUMASK_OFFSTACK=y by itereating over cpus sending an IPI a
time via smp_call_function_single().

The function is useful since it allows to seperate the specific
code that decided in each case whether to IPI a specific cpu for
a specific request from the common boilerplate code of handling
creating the mask, handling failures etc.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
CC: Chris Metcalf <cmetcalf@tilera.com>
CC: Christoph Lameter <cl@linux-foundation.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: linux-mm@kvack.org
CC: Pekka Enberg <penberg@kernel.org>
CC: Matt Mackall <mpm@selenic.com>
CC: Sasha Levin <levinsasha928@gmail.com>
CC: Rik van Riel <riel@redhat.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
CC: Avi Kivity <avi@redhat.com>
CC: Michal Nazarewicz <mina86@mina86.com>
CC: Kosaki Motohiro <kosaki.motohiro@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Milton Miller <miltonm@bga.com>
---
 include/linux/smp.h |   19 ++++++++++++++++
 kernel/smp.c        |   58 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+), 0 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index d0adb78..e1ea702 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -109,6 +109,15 @@ void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
 		void *info, bool wait);
 
 /*
+ * Call a function on each processor for which the supplied function
+ * cond_func returns a positive value. This may include the local
+ * processor.
+ */
+void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+		smp_call_func_t func, void *info, bool wait,
+		gfp_t gfpflags);
+
+/*
  * Mark the boot cpu "online" so that it can call console drivers in
  * printk() and can access its per-cpu storage.
  */
@@ -153,6 +162,16 @@ static inline int up_smp_call_function(smp_call_func_t func, void *info)
 			local_irq_enable();		\
 		}					\
 	} while (0)
+#define on_each_cpu_cond(cond_func, func, info, wait, gfpflags) \
+	do {						\
+		preempt_disable();			\
+		if (cond_func(0, info)) {		\
+			local_irq_disable();		\
+			(func)(info);			\
+			local_irq_enable();		\
+		}					\
+		preempt_enable();			\
+	} while (0)
 
 static inline void smp_send_reschedule(int cpu) { }
 #define num_booting_cpus()			1
diff --git a/kernel/smp.c b/kernel/smp.c
index a081e6c..fa0912a 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -730,3 +730,61 @@ void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
 	put_cpu();
 }
 EXPORT_SYMBOL(on_each_cpu_mask);
+
+/*
+ * on_each_cpu_cond(): Call a function on each processor for which
+ * the supplied function cond_func returns true, optionally waiting
+ * for all the required CPUs to finish. This may include the local
+ * processor.
+ * @cond_func:	A callback function that is passed a cpu id and
+ *		the the info parameter. The function is called
+ *		with preemption disabled. The function should
+ *		return a blooean value indicating whether to IPI
+ *		the specified CPU.
+ * @func:	The function to run on all applicable CPUs.
+ *		This must be fast and non-blocking.
+ * @info:	An arbitrary pointer to pass to both functions.
+ * @wait:	If true, wait (atomically) until function has
+ *		completed on other CPUs.
+ * @gfpflags:	GFP flags to use when allocating the cpumask
+ *		used internally by the function.
+ *
+ * The function might sleep if the GFP flags indicates a non
+ * atomic allocation is allowed.
+ *
+ * You must not call this function with disabled interrupts or
+ * from a hardware interrupt handler or from a bottom half handler.
+ */
+void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+			smp_call_func_t func, void *info, bool wait,
+			gfp_t gfpflags)
+{
+	cpumask_var_t cpus;
+	int cpu, ret;
+
+	might_sleep_if(gfpflags & __GFP_WAIT);
+
+	if (likely(zalloc_cpumask_var(&cpus, (gfpflags|__GFP_NOWARN)))) {
+		preempt_disable();
+		for_each_online_cpu(cpu)
+			if (cond_func(cpu, info))
+				cpumask_set_cpu(cpu, cpus);
+		on_each_cpu_mask(cpus, func, info, wait);
+		preempt_enable();
+		free_cpumask_var(cpus);
+	} else {
+		/*
+		 * No free cpumask, bother. No matter, we'll
+		 * just have to IPI them one by one.
+		 */
+		preempt_disable();
+		for_each_online_cpu(cpu)
+			if (cond_func(cpu, info)) {
+				ret = smp_call_function_single(cpu, func,
+								info, wait);
+				WARN_ON_ONCE(!ret);
+			}
+		preempt_enable();
+	}
+}
+EXPORT_SYMBOL(on_each_cpu_cond);
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [v7 5/8] slub: only IPI CPUs that have per cpu obj to flush
  2012-01-26 10:01 [v7 0/8] Reduce cross CPU IPI interference Gilad Ben-Yossef
                   ` (3 preceding siblings ...)
  2012-01-26 10:01 ` [v7 4/8] smp: add func to IPI cpus based on parameter func Gilad Ben-Yossef
@ 2012-01-26 10:01 ` Gilad Ben-Yossef
  2012-01-26 15:09   ` Christoph Lameter
  2012-01-26 10:01 ` [v7 6/8] fs: only send IPI to invalidate LRU BH when needed Gilad Ben-Yossef
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-26 10:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gilad Ben-Yossef, Christoph Lameter, Chris Metcalf,
	Peter Zijlstra, Frederic Weisbecker, Russell King, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	linux-fsdevel, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

flush_all() is called for each kmem_cahce_destroy(). So every cache
being destroyed dynamically ends up sending an IPI to each CPU in the
system, regardless if the cache has ever been used there.

For example, if you close the Infinband ipath driver char device file,
the close file ops calls kmem_cache_destroy(). So running some
infiniband config tool on one a single CPU dedicated to system tasks
might interrupt the rest of the 127 CPUs dedicated to some CPU
intensive or latency sensitive task.

I suspect there is a good chance that every line in the output of "git
grep kmem_cache_destroy linux/ | grep '\->'" has a similar scenario.

This patch attempts to rectify this issue by sending an IPI to flush
the per cpu objects back to the free lists only to CPUs that seem to
have such objects.

The check which CPU to IPI is racy but we don't care since asking a
CPU without per cpu objects to flush does no damage and as far as I
can tell the flush_all by itself is racy against allocs on remote
CPUs anyway, so if you required the flush_all to be determinstic, you
had to arrange for locking regardless.

Without this patch the following artificial test case:

$ cd /sys/kernel/slab
$ for DIR in *; do cat $DIR/alloc_calls > /dev/null; done

produces 166 IPIs on an cpuset isolated CPU. With it it produces none.

The code path of memory allocation failure for CPUMASK_OFFSTACK=y
config was tested using fault injection framework.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
CC: Christoph Lameter <cl@linux.com>
CC: Chris Metcalf <cmetcalf@tilera.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: linux-mm@kvack.org
CC: Pekka Enberg <penberg@kernel.org>
CC: Matt Mackall <mpm@selenic.com>
CC: Sasha Levin <levinsasha928@gmail.com>
CC: Rik van Riel <riel@redhat.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
CC: Avi Kivity <avi@redhat.com>
CC: Michal Nazarewicz <mina86@mina86.org>
CC: Kosaki Motohiro <kosaki.motohiro@gmail.com>
CC: Milton Miller <miltonm@bga.com>
---
 mm/slub.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 4907563..3d75f89 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2018,9 +2018,17 @@ static void flush_cpu_slab(void *d)
 	__flush_cpu_slab(s, smp_processor_id());
 }
 
+static bool has_cpu_slab(int cpu, void *info)
+{
+	struct kmem_cache *s = info;
+	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+
+	return !!(c->page);
+}
+
 static void flush_all(struct kmem_cache *s)
 {
-	on_each_cpu(flush_cpu_slab, s, 1);
+	on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1, GFP_ATOMIC);
 }
 
 /*
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [v7 6/8] fs: only send IPI to invalidate LRU BH when needed
  2012-01-26 10:01 [v7 0/8] Reduce cross CPU IPI interference Gilad Ben-Yossef
                   ` (4 preceding siblings ...)
  2012-01-26 10:01 ` [v7 5/8] slub: only IPI CPUs that have per cpu obj to flush Gilad Ben-Yossef
@ 2012-01-26 10:01 ` Gilad Ben-Yossef
  2012-01-26 10:02 ` [v7 7/8] mm: only IPI CPUs to drain local pages if they exist Gilad Ben-Yossef
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-26 10:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gilad Ben-Yossef, Christoph Lameter, Chris Metcalf,
	Peter Zijlstra, Frederic Weisbecker, Russell King, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	linux-fsdevel, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

In several code paths, such as when unmounting a file system (but
not only) we send an IPI to ask each cpu to invalidate its local
LRU BHs.

For multi-cores systems that have many cpus that may not have
any LRU BH because they are idle or because they have not performed
any file system accesses since last invalidation (e.g. CPU crunching
on high perfomance computing nodes that write results to shared
memory or only using filesystems that do not use the bh layer.)
This can lead to loss of performance each time someone switches
the KVM (the virtual keyboard and screen type, not the hypervisor)
if it has a USB storage stuck in.

This patch attempts to only send an IPI to cpus that have LRU BH.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
CC: Christoph Lameter <cl@linux.com>
CC: Chris Metcalf <cmetcalf@tilera.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: linux-mm@kvack.org
CC: Pekka Enberg <penberg@kernel.org>
CC: Matt Mackall <mpm@selenic.com>
CC: Sasha Levin <levinsasha928@gmail.com>
CC: Rik van Riel <riel@redhat.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
CC: Avi Kivity <avi@redhat.com>
CC: Michal Nazarewicz <mina86@mina86.com>
CC: Kosaki Motohiro <kosaki.motohiro@gmail.com>
CC: Milton Miller <miltonm@bga.com>
---
 fs/buffer.c |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 1a30db7..baa075e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1384,10 +1384,23 @@ static void invalidate_bh_lru(void *arg)
 	}
 	put_cpu_var(bh_lrus);
 }
+
+static bool has_bh_in_lru(int cpu, void *dummy)
+{
+	struct bh_lru *b = per_cpu_ptr(&bh_lrus, cpu);
+	int i;
 	
+	for (i = 0; i < BH_LRU_SIZE; i++) {
+		if (b->bhs[i])
+			return 1;
+	}
+
+	return 0;
+}
+
 void invalidate_bh_lrus(void)
 {
-	on_each_cpu(invalidate_bh_lru, NULL, 1);
+	on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1, GFP_KERNEL);
 }
 EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
 
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [v7 7/8] mm: only IPI CPUs to drain local pages if they exist
  2012-01-26 10:01 [v7 0/8] Reduce cross CPU IPI interference Gilad Ben-Yossef
                   ` (5 preceding siblings ...)
  2012-01-26 10:01 ` [v7 6/8] fs: only send IPI to invalidate LRU BH when needed Gilad Ben-Yossef
@ 2012-01-26 10:02 ` Gilad Ben-Yossef
  2012-01-26 15:13   ` Christoph Lameter
                     ` (2 more replies)
  2012-01-26 10:02 ` [v7 8/8] mm: add vmstat counters for tracking PCP drains Gilad Ben-Yossef
  2012-01-26 15:19 ` [v7 0/8] Reduce cross CPU IPI interference Peter Zijlstra
  8 siblings, 3 replies; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-26 10:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gilad Ben-Yossef, Mel Gorman, KOSAKI Motohiro, Christoph Lameter,
	Chris Metcalf, Peter Zijlstra, Frederic Weisbecker, Russell King,
	linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Andrew Morton, Alexander Viro, linux-fsdevel,
	Avi Kivity, Michal Nazarewicz, Milton Miller

Calculate a cpumask of CPUs with per-cpu pages in any zone
and only send an IPI requesting CPUs to drain these pages
to the buddy allocator if they actually have pages when
asked to flush.

This patch saves 85%+ of IPIs asking to drain per-cpu
pages in case of severe memory preassure that leads
to OOM since in these cases multiple, possibly concurrent,
allocation requests end up in the direct reclaim code
path so when the per-cpu pages end up reclaimed on first
allocation failure for most of the proceeding allocation
attempts until the memory pressure is off (possibly via
the OOM killer) there are no per-cpu pages on most CPUs
(and there can easily be hundreds of them).

This also has the side effect of shortening the average
latency of direct reclaim by 1 or more order of magnitude
since waiting for all the CPUs to ACK the IPI takes a
long time.

Tested by running "hackbench 400" on a 8 CPU x86 VM and
observing the difference between the number of direct
reclaim attempts that end up in drain_all_pages() and
those were more then 1/2 of the online CPU had any per-cpu
page in them, using the vmstat counters introduced
in the next patch in the series and using proc/interrupts.

In the test sceanrio, this was seen to save around 3600 global
IPIs after trigerring an OOM on a concurrent workload:

$ cat /proc/vmstat | tail -n 2
pcp_global_drain 0
pcp_global_ipi_saved 0

$ cat /proc/interrupts | grep CAL
CAL:          1          2          1          2
          2          2          2          2   Function call interrupts

$ hackbench 400
[OOM messages snipped]

$ cat /proc/vmstat | tail -n 2
pcp_global_drain 3647
pcp_global_ipi_saved 3642

$ cat /proc/interrupts | grep CAL
CAL:          6         13          6          3
          3          3         1 2          7   Function call interrupts

Please note that if the global drain is removed from the
direct reclaim path as a patch from Mel Gorman currently
suggests this should be replaced with an on_each_cpu_cond
invocation.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Christoph Lameter <cl@linux.com>
CC: Chris Metcalf <cmetcalf@tilera.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: linux-mm@kvack.org
CC: Pekka Enberg <penberg@kernel.org>
CC: Matt Mackall <mpm@selenic.com>
CC: Sasha Levin <levinsasha928@gmail.com>
CC: Rik van Riel <riel@redhat.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
CC: Avi Kivity <avi@redhat.com>
CC: Michal Nazarewicz <mina86@mina86.com>
CC: Milton Miller <miltonm@bga.com>
---
 mm/page_alloc.c |   31 ++++++++++++++++++++++++++++++-
 1 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d2186ec..4135983 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1165,7 +1165,36 @@ void drain_local_pages(void *arg)
  */
 void drain_all_pages(void)
 {
-	on_each_cpu(drain_local_pages, NULL, 1);
+	int cpu;
+	struct per_cpu_pageset *pcp;
+	struct zone *zone;
+
+	/* Allocate in the BSS so we wont require allocation in
+	 * direct reclaim path for CONFIG_CPUMASK_OFFSTACK=y
+	 */
+	static cpumask_t cpus_with_pcps;
+
+	/*
+	 * We don't care about racing with CPU hotplug event
+	 * as offline notification will cause the notified
+	 * cpu to drain that CPU pcps and on_each_cpu_mask
+	 * disables preemption as part of its processing
+	 */
+	for_each_online_cpu(cpu) {
+		bool has_pcps = false;
+		for_each_populated_zone(zone) {
+			pcp = per_cpu_ptr(zone->pageset, cpu);
+			if (pcp->pcp.count) {
+				has_pcps = true;
+				break;
+			}
+		}
+		if (has_pcps)
+			cpumask_set_cpu(cpu, &cpus_with_pcps);
+		else
+			cpumask_clear_cpu(cpu, &cpus_with_pcps);
+	}
+	on_each_cpu_mask(&cpus_with_pcps, drain_local_pages, NULL, 1);
 }
 
 #ifdef CONFIG_HIBERNATION
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [v7 8/8] mm: add vmstat counters for tracking PCP drains
  2012-01-26 10:01 [v7 0/8] Reduce cross CPU IPI interference Gilad Ben-Yossef
                   ` (6 preceding siblings ...)
  2012-01-26 10:02 ` [v7 7/8] mm: only IPI CPUs to drain local pages if they exist Gilad Ben-Yossef
@ 2012-01-26 10:02 ` Gilad Ben-Yossef
  2012-01-26 15:19 ` [v7 0/8] Reduce cross CPU IPI interference Peter Zijlstra
  8 siblings, 0 replies; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-26 10:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gilad Ben-Yossef, Christoph Lameter, Chris Metcalf,
	Peter Zijlstra, Frederic Weisbecker, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman,
	Andrew Morton, Alexander Viro, Avi Kivity, Michal Nazarewicz,
	Kosaki Motohiro, Milton Miller

This patch introduces two new vmstat counters for testing purposes:
pcp_global_drain that counts the number of times a per-cpu pages
global drain was requested and pcp_global_ipi_saved that counts
the number of times the number of CPUs with per-cpu pages in any
zone were less then 1/2 of the number of online CPUs.

The patch purpose is to show the usefulness of only sending an IPI
asking to drain per-cpu pages to CPUs that actually have them
instead of a blind global IPI. It is probably not useful by itself.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
CC: Christoph Lameter <cl@linux.com>
CC: Chris Metcalf <cmetcalf@tilera.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: linux-mm@kvack.org
CC: Pekka Enberg <penberg@kernel.org>
CC: Matt Mackall <mpm@selenic.com>
CC: Sasha Levin <levinsasha928@gmail.com>
CC: Rik van Riel <riel@redhat.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Avi Kivity <avi@redhat.com>
CC: Michal Nazarewicz <mina86@mina86.com>
CC: Kosaki Motohiro <kosaki.motohiro@gmail.com>
CC: Milton Miller <miltonm@bga.com>
---
 include/linux/vm_event_item.h |    1 +
 mm/page_alloc.c               |    5 +++++
 mm/vmstat.c                   |    2 ++
 3 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03b90cd..3657f6f 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -58,6 +58,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_COLLAPSE_ALLOC_FAILED,
 		THP_SPLIT,
 #endif
+		PCP_GLOBAL_DRAIN, PCP_GLOBAL_IPI_SAVED,
 		NR_VM_EVENT_ITEMS
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4135983..1aa9031 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1195,6 +1195,11 @@ void drain_all_pages(void)
 			cpumask_clear_cpu(cpu, &cpus_with_pcps);
 	}
 	on_each_cpu_mask(&cpus_with_pcps, drain_local_pages, NULL, 1);
+
+	count_vm_event(PCP_GLOBAL_DRAIN);
+	if (cpumask_weight(&cpus_with_pcps) <
+	   (cpumask_weight(cpu_online_mask) / 2))
+		count_vm_event(PCP_GLOBAL_IPI_SAVED);
 }
 
 #ifdef CONFIG_HIBERNATION
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f600557..3ee5f99 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -786,6 +786,8 @@ const char * const vmstat_text[] = {
 	"thp_collapse_alloc_failed",
 	"thp_split",
 #endif
+	"pcp_global_drain",
+	"pcp_global_ipi_saved"
 
 #endif /* CONFIG_VM_EVENTS_COUNTERS */
 };
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [v7 5/8] slub: only IPI CPUs that have per cpu obj to flush
  2012-01-26 10:01 ` [v7 5/8] slub: only IPI CPUs that have per cpu obj to flush Gilad Ben-Yossef
@ 2012-01-26 15:09   ` Christoph Lameter
  0 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2012-01-26 15:09 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, Chris Metcalf, Peter Zijlstra, Frederic Weisbecker,
	Russell King, linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin,
	Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, linux-fsdevel, Avi Kivity, Michal Nazarewicz,
	Kosaki Motohiro, Milton Miller

On Thu, 26 Jan 2012, Gilad Ben-Yossef wrote:

> produces 166 IPIs on an cpuset isolated CPU. With it it produces none.

Acked-by: Christoph Lameter <cl@linux.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 7/8] mm: only IPI CPUs to drain local pages if they exist
  2012-01-26 10:02 ` [v7 7/8] mm: only IPI CPUs to drain local pages if they exist Gilad Ben-Yossef
@ 2012-01-26 15:13   ` Christoph Lameter
  2012-01-28  0:12   ` Andrew Morton
  2012-01-30 14:59   ` Mel Gorman
  2 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2012-01-26 15:13 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, Mel Gorman, KOSAKI Motohiro, Chris Metcalf,
	Peter Zijlstra, Frederic Weisbecker, Russell King, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Andrew Morton, Alexander Viro, linux-fsdevel,
	Avi Kivity, Michal Nazarewicz, Milton Miller

On Thu, 26 Jan 2012, Gilad Ben-Yossef wrote:

> Calculate a cpumask of CPUs with per-cpu pages in any zone
> and only send an IPI requesting CPUs to drain these pages
> to the buddy allocator if they actually have pages when
> asked to flush.

Acked-by: Christoph Lameter <cl@linux.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-01-26 10:01 [v7 0/8] Reduce cross CPU IPI interference Gilad Ben-Yossef
                   ` (7 preceding siblings ...)
  2012-01-26 10:02 ` [v7 8/8] mm: add vmstat counters for tracking PCP drains Gilad Ben-Yossef
@ 2012-01-26 15:19 ` Peter Zijlstra
  2012-01-29  8:25   ` Gilad Ben-Yossef
  8 siblings, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2012-01-26 15:19 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, Christoph Lameter, Chris Metcalf,
	Frederic Weisbecker, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Thu, 2012-01-26 at 12:01 +0200, Gilad Ben-Yossef wrote:
> Gilad Ben-Yossef (8):
>   smp: introduce a generic on_each_cpu_mask function
>   arm: move arm over to generic on_each_cpu_mask
>   tile: move tile to use generic on_each_cpu_mask
>   smp: add func to IPI cpus based on parameter func
>   slub: only IPI CPUs that have per cpu obj to flush
>   fs: only send IPI to invalidate LRU BH when needed
>   mm: only IPI CPUs to drain local pages if they exist

These patches look very nice!

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>


>   mm: add vmstat counters for tracking PCP drains
> 
I understood from previous postings this patch wasn't meant for
inclusion, if it is, note that cpumask_weight() is a potentially very
expensive operation.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 4/8] smp: add func to IPI cpus based on parameter func
  2012-01-26 10:01 ` [v7 4/8] smp: add func to IPI cpus based on parameter func Gilad Ben-Yossef
@ 2012-01-27 23:57   ` Andrew Morton
  2012-01-29 12:04     ` Gilad Ben-Yossef
  0 siblings, 1 reply; 77+ messages in thread
From: Andrew Morton @ 2012-01-27 23:57 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, Chris Metcalf, Christoph Lameter, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen,
	Alexander Viro, linux-fsdevel, Avi Kivity, Michal Nazarewicz,
	Kosaki Motohiro, Milton Miller

On Thu, 26 Jan 2012 12:01:57 +0200
Gilad Ben-Yossef <gilad@benyossef.com> wrote:

> Add the on_each_cpu_cond() function that wraps on_each_cpu_mask()
> and calculates the cpumask of cpus to IPI by calling a function supplied
> as a parameter in order to determine whether to IPI each specific cpu.
> 
> The function works around allocation failure of cpumask variable in
> CONFIG_CPUMASK_OFFSTACK=y by itereating over cpus sending an IPI a
> time via smp_call_function_single().
> 
> The function is useful since it allows to seperate the specific
> code that decided in each case whether to IPI a specific cpu for
> a specific request from the common boilerplate code of handling
> creating the mask, handling failures etc.
> 
> ...
>
> @@ -153,6 +162,16 @@ static inline int up_smp_call_function(smp_call_func_t func, void *info)
>  			local_irq_enable();		\
>  		}					\
>  	} while (0)
> +#define on_each_cpu_cond(cond_func, func, info, wait, gfpflags) \
> +	do {						\
> +		preempt_disable();			\
> +		if (cond_func(0, info)) {		\
> +			local_irq_disable();		\
> +			(func)(info);			\
> +			local_irq_enable();		\

Ordinarily, local_irq_enable() in such a low-level thing is dangerous,
because it can cause horrid bugs when called from local_irq_disable()d
code.

However I think we're OK here because it is a bug to call on_each_cpu()
and friends with local irqs disabled, yes?

Do we have any warnings printks if someone calls the ipi-sending
functions with local interrupts disabled?  I didn't see any, but didn't
look very hard.

If my above claims are correct then why does on_each_cpu() use
local_irq_save()?  hrm.

> +		}					\
> +		preempt_enable();			\
> +	} while (0)
>  
>  static inline void smp_send_reschedule(int cpu) { }
>  #define num_booting_cpus()			1
> diff --git a/kernel/smp.c b/kernel/smp.c
> index a081e6c..fa0912a 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -730,3 +730,61 @@ void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
>  	put_cpu();
>  }
>  EXPORT_SYMBOL(on_each_cpu_mask);
> +
> +/*
> + * on_each_cpu_cond(): Call a function on each processor for which
> + * the supplied function cond_func returns true, optionally waiting
> + * for all the required CPUs to finish. This may include the local
> + * processor.
> + * @cond_func:	A callback function that is passed a cpu id and
> + *		the the info parameter. The function is called
> + *		with preemption disabled. The function should
> + *		return a blooean value indicating whether to IPI
> + *		the specified CPU.
> + * @func:	The function to run on all applicable CPUs.
> + *		This must be fast and non-blocking.
> + * @info:	An arbitrary pointer to pass to both functions.
> + * @wait:	If true, wait (atomically) until function has
> + *		completed on other CPUs.
> + * @gfpflags:	GFP flags to use when allocating the cpumask
> + *		used internally by the function.
> + *
> + * The function might sleep if the GFP flags indicates a non
> + * atomic allocation is allowed.
> + *
> + * You must not call this function with disabled interrupts or
> + * from a hardware interrupt handler or from a bottom half handler.
> + */
> +void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
> +			smp_call_func_t func, void *info, bool wait,
> +			gfp_t gfpflags)

bah.

z:/usr/src/linux-3.3-rc1> grep -r gfpflags . | wc -l
78
z:/usr/src/linux-3.3-rc1> grep -r gfp_flags . | wc -l 
548

> +{
> +	cpumask_var_t cpus;
> +	int cpu, ret;
> +
> +	might_sleep_if(gfpflags & __GFP_WAIT);

For the zalloc_cpumask_var(), it seems.  I expect there are
might_sleep() elsewhere in the memory allocation paths, but putting one
here will detect bugs even if CONFIG_CPUMASK_OFFSTACK=n.

> +	if (likely(zalloc_cpumask_var(&cpus, (gfpflags|__GFP_NOWARN)))) {
> +		preempt_disable();
> +		for_each_online_cpu(cpu)
> +			if (cond_func(cpu, info))
> +				cpumask_set_cpu(cpu, cpus);
> +		on_each_cpu_mask(cpus, func, info, wait);
> +		preempt_enable();
> +		free_cpumask_var(cpus);
> +	} else {
> +		/*
> +		 * No free cpumask, bother. No matter, we'll
> +		 * just have to IPI them one by one.
> +		 */
> +		preempt_disable();
> +		for_each_online_cpu(cpu)
> +			if (cond_func(cpu, info)) {
> +				ret = smp_call_function_single(cpu, func,
> +								info, wait);
> +				WARN_ON_ONCE(!ret);
> +			}
> +		preempt_enable();
> +	}
> +}
> +EXPORT_SYMBOL(on_each_cpu_cond);

I assume the preempt_disable()s here are to suspend CPU hotplug?


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 7/8] mm: only IPI CPUs to drain local pages if they exist
  2012-01-26 10:02 ` [v7 7/8] mm: only IPI CPUs to drain local pages if they exist Gilad Ben-Yossef
  2012-01-26 15:13   ` Christoph Lameter
@ 2012-01-28  0:12   ` Andrew Morton
  2012-01-29 12:18     ` Gilad Ben-Yossef
  2012-01-30 14:59   ` Mel Gorman
  2 siblings, 1 reply; 77+ messages in thread
From: Andrew Morton @ 2012-01-28  0:12 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, Mel Gorman, KOSAKI Motohiro, Christoph Lameter,
	Chris Metcalf, Peter Zijlstra, Frederic Weisbecker, Russell King,
	linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Alexander Viro, linux-fsdevel, Avi Kivity,
	Michal Nazarewicz, Milton Miller

On Thu, 26 Jan 2012 12:02:00 +0200
Gilad Ben-Yossef <gilad@benyossef.com> wrote:

> Calculate a cpumask of CPUs with per-cpu pages in any zone
> and only send an IPI requesting CPUs to drain these pages
> to the buddy allocator if they actually have pages when
> asked to flush.
> 
> This patch saves 85%+ of IPIs asking to drain per-cpu
> pages in case of severe memory preassure that leads
> to OOM since in these cases multiple, possibly concurrent,
> allocation requests end up in the direct reclaim code
> path so when the per-cpu pages end up reclaimed on first
> allocation failure for most of the proceeding allocation
> attempts until the memory pressure is off (possibly via
> the OOM killer) there are no per-cpu pages on most CPUs
> (and there can easily be hundreds of them).
> 
> This also has the side effect of shortening the average
> latency of direct reclaim by 1 or more order of magnitude
> since waiting for all the CPUs to ACK the IPI takes a
> long time.
> 
> Tested by running "hackbench 400" on a 8 CPU x86 VM and
> observing the difference between the number of direct
> reclaim attempts that end up in drain_all_pages() and
> those were more then 1/2 of the online CPU had any per-cpu
> page in them, using the vmstat counters introduced
> in the next patch in the series and using proc/interrupts.
> 
> In the test sceanrio, this was seen to save around 3600 global
> IPIs after trigerring an OOM on a concurrent workload:
> 
>
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1165,7 +1165,36 @@ void drain_local_pages(void *arg)
>   */
>  void drain_all_pages(void)
>  {
> -	on_each_cpu(drain_local_pages, NULL, 1);
> +	int cpu;
> +	struct per_cpu_pageset *pcp;
> +	struct zone *zone;
> +
> +	/* Allocate in the BSS so we wont require allocation in
> +	 * direct reclaim path for CONFIG_CPUMASK_OFFSTACK=y
> +	 */
> +	static cpumask_t cpus_with_pcps;
> +
> +	/*
> +	 * We don't care about racing with CPU hotplug event
> +	 * as offline notification will cause the notified
> +	 * cpu to drain that CPU pcps and on_each_cpu_mask
> +	 * disables preemption as part of its processing
> +	 */

hmmm.

> +	for_each_online_cpu(cpu) {
> +		bool has_pcps = false;
> +		for_each_populated_zone(zone) {
> +			pcp = per_cpu_ptr(zone->pageset, cpu);
> +			if (pcp->pcp.count) {
> +				has_pcps = true;
> +				break;
> +			}
> +		}
> +		if (has_pcps)
> +			cpumask_set_cpu(cpu, &cpus_with_pcps);
> +		else
> +			cpumask_clear_cpu(cpu, &cpus_with_pcps);
> +	}
> +	on_each_cpu_mask(&cpus_with_pcps, drain_local_pages, NULL, 1);
>  }

Can we end up sending an IPI to a now-unplugged CPU?  That won't work
very well if that CPU is now sitting on its sysadmin's desk.

There's also the case of CPU online.  We could end up failing to IPI a
CPU which now has some percpu pages.  That's not at all serious - 90%
is good enough in page reclaim.  But this thinking merits a mention in
the comment.  Or we simply make this code hotplug-safe.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-01-26 15:19 ` [v7 0/8] Reduce cross CPU IPI interference Peter Zijlstra
@ 2012-01-29  8:25   ` Gilad Ben-Yossef
  2012-02-01 17:04     ` Frederic Weisbecker
  2012-02-01 17:35     ` Peter Zijlstra
  0 siblings, 2 replies; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-29  8:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Christoph Lameter, Chris Metcalf,
	Frederic Weisbecker, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Thu, Jan 26, 2012 at 5:19 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> On Thu, 2012-01-26 at 12:01 +0200, Gilad Ben-Yossef wrote:
> > Gilad Ben-Yossef (8):
> >   smp: introduce a generic on_each_cpu_mask function
> >   arm: move arm over to generic on_each_cpu_mask
> >   tile: move tile to use generic on_each_cpu_mask
> >   smp: add func to IPI cpus based on parameter func
> >   slub: only IPI CPUs that have per cpu obj to flush
> >   fs: only send IPI to invalidate LRU BH when needed
> >   mm: only IPI CPUs to drain local pages if they exist
>
> These patches look very nice!
>
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>

Thank you :-)

If this is of interest, I keep a list tracking global IPI and global
task schedulers sources in the core kernel here:
https://github.com/gby/linux/wiki.

I plan to visit all these potential interference source to see if
something can be done to lower their effect on
isolated CPUs over time.

>
> >   mm: add vmstat counters for tracking PCP drains
> >
> I understood from previous postings this patch wasn't meant for
> inclusion, if it is, note that cpumask_weight() is a potentially very
> expensive operation.

Right. The only purpose of the patch is to show the usefulness
of the previous patch in the series. It is not meant for mainline.

Thanks,
Gilad



--
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 4/8] smp: add func to IPI cpus based on parameter func
  2012-01-27 23:57   ` Andrew Morton
@ 2012-01-29 12:04     ` Gilad Ben-Yossef
  0 siblings, 0 replies; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-29 12:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Chris Metcalf, Christoph Lameter, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen,
	Alexander Viro, linux-fsdevel, Avi Kivity, Michal Nazarewicz,
	Kosaki Motohiro, Milton Miller

On Sat, Jan 28, 2012 at 1:57 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Thu, 26 Jan 2012 12:01:57 +0200
> Gilad Ben-Yossef <gilad@benyossef.com> wrote:
...
>>
>> @@ -153,6 +162,16 @@ static inline int up_smp_call_function(smp_call_func_t func, void *info)
>>                       local_irq_enable();             \
>>               }                                       \
>>       } while (0)
>> +#define on_each_cpu_cond(cond_func, func, info, wait, gfpflags) \
>> +     do {                                            \
>> +             preempt_disable();                      \
>> +             if (cond_func(0, info)) {               \
>> +                     local_irq_disable();            \
>> +                     (func)(info);                   \
>> +                     local_irq_enable();             \
>
> Ordinarily, local_irq_enable() in such a low-level thing is dangerous,
> because it can cause horrid bugs when called from local_irq_disable()d
> code.
>
> However I think we're OK here because it is a bug to call on_each_cpu()
> and friends with local irqs disabled, yes?

Yes, that is my understanding and this way the function gets called in
the same conditions in UP and SMP.

> Do we have any warnings printks if someone calls the ipi-sending
> functions with local interrupts disabled?  I didn't see any, but didn't
> look very hard.

There is this check in smp_call_function_many():

        WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
                     && !oops_in_progress && !early_boot_irqs_disabled);

Only catches SMP offenders though.


> If my above claims are correct then why does on_each_cpu() use
> local_irq_save()?  hrm.

The comment in on_each_cpu() in kernel.smp.c says: "May be
used during early boot while early_boot_irqs_disabled is set.  Use
local_irq_save/restore() instead of local_irq_disable/enable()."


>
>> +             }                                       \
>> +             preempt_enable();                       \
>> +     } while (0)
>>
>>  static inline void smp_send_reschedule(int cpu) { }
>>  #define num_booting_cpus()                   1
>> diff --git a/kernel/smp.c b/kernel/smp.c
>> index a081e6c..fa0912a 100644
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -730,3 +730,61 @@ void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
>>       put_cpu();
>>  }
>>  EXPORT_SYMBOL(on_each_cpu_mask);
>> +
>> +/*
>> + * on_each_cpu_cond(): Call a function on each processor for which
>> + * the supplied function cond_func returns true, optionally waiting
>> + * for all the required CPUs to finish. This may include the local
>> + * processor.
>> + * @cond_func:       A callback function that is passed a cpu id and
>> + *           the the info parameter. The function is called
>> + *           with preemption disabled. The function should
>> + *           return a blooean value indicating whether to IPI
>> + *           the specified CPU.
>> + * @func:    The function to run on all applicable CPUs.
>> + *           This must be fast and non-blocking.
>> + * @info:    An arbitrary pointer to pass to both functions.
>> + * @wait:    If true, wait (atomically) until function has
>> + *           completed on other CPUs.
>> + * @gfpflags:        GFP flags to use when allocating the cpumask
>> + *           used internally by the function.
>> + *
>> + * The function might sleep if the GFP flags indicates a non
>> + * atomic allocation is allowed.
>> + *
>> + * You must not call this function with disabled interrupts or
>> + * from a hardware interrupt handler or from a bottom half handler.
>> + */
>> +void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
>> +                     smp_call_func_t func, void *info, bool wait,
>> +                     gfp_t gfpflags)
>
> bah.
>
> z:/usr/src/linux-3.3-rc1> grep -r gfpflags . | wc -l
> 78
> z:/usr/src/linux-3.3-rc1> grep -r gfp_flags . | wc -l
> 548
>

I have no specific preference. Should I switch?

>> +{
>> +     cpumask_var_t cpus;
>> +     int cpu, ret;
>> +
>> +     might_sleep_if(gfpflags & __GFP_WAIT);
>
> For the zalloc_cpumask_var(), it seems.  I expect there are
> might_sleep() elsewhere in the memory allocation paths, but putting one
> here will detect bugs even if CONFIG_CPUMASK_OFFSTACK=n.

Well, yes, although I didn't think about that :-)

>
>> +     if (likely(zalloc_cpumask_var(&cpus, (gfpflags|__GFP_NOWARN)))) {
>> +             preempt_disable();
>> +             for_each_online_cpu(cpu)
>> +                     if (cond_func(cpu, info))
>> +                             cpumask_set_cpu(cpu, cpus);
>> +             on_each_cpu_mask(cpus, func, info, wait);
>> +             preempt_enable();
>> +             free_cpumask_var(cpus);
>> +     } else {
>> +             /*
>> +              * No free cpumask, bother. No matter, we'll
>> +              * just have to IPI them one by one.
>> +              */
>> +             preempt_disable();
>> +             for_each_online_cpu(cpu)
>> +                     if (cond_func(cpu, info)) {
>> +                             ret = smp_call_function_single(cpu, func,
>> +                                                             info, wait);
>> +                             WARN_ON_ONCE(!ret);
>> +                     }
>> +             preempt_enable();
>> +     }
>> +}
>> +EXPORT_SYMBOL(on_each_cpu_cond);
>
> I assume the preempt_disable()s here are to suspend CPU hotplug?

Yes.  Also, I figured that since the original code disabled
preemption for the entire on_each_cpu run time, including waiting for all
the CPUs to ack the IPI, and since we (hopefully) wait for less CPUs, the
overall runtime with  preemption disabled will be (usually) lower then the
original  code most of the time and we'll get a more robust interface.

Thanks,
Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 7/8] mm: only IPI CPUs to drain local pages if they exist
  2012-01-28  0:12   ` Andrew Morton
@ 2012-01-29 12:18     ` Gilad Ben-Yossef
  2012-01-30 21:49       ` Andrew Morton
  0 siblings, 1 reply; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-29 12:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Mel Gorman, KOSAKI Motohiro, Christoph Lameter,
	Chris Metcalf, Peter Zijlstra, Frederic Weisbecker, Russell King,
	linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Alexander Viro, linux-fsdevel, Avi Kivity,
	Michal Nazarewicz, Milton Miller

On Sat, Jan 28, 2012 at 2:12 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Thu, 26 Jan 2012 12:02:00 +0200
> Gilad Ben-Yossef <gilad@benyossef.com> wrote:
>
>> Calculate a cpumask of CPUs with per-cpu pages in any zone
>> and only send an IPI requesting CPUs to drain these pages
>> to the buddy allocator if they actually have pages when
>> asked to flush.
>>
...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1165,7 +1165,36 @@ void drain_local_pages(void *arg)
>>   */
>>  void drain_all_pages(void)
>>  {
>> -     on_each_cpu(drain_local_pages, NULL, 1);
>> +     int cpu;
>> +     struct per_cpu_pageset *pcp;
>> +     struct zone *zone;
>> +
>> +     /* Allocate in the BSS so we wont require allocation in
>> +      * direct reclaim path for CONFIG_CPUMASK_OFFSTACK=y
>> +      */
>> +     static cpumask_t cpus_with_pcps;
>> +
>> +     /*
>> +      * We don't care about racing with CPU hotplug event
>> +      * as offline notification will cause the notified
>> +      * cpu to drain that CPU pcps and on_each_cpu_mask
>> +      * disables preemption as part of its processing
>> +      */
>
> hmmm.
>
>> +     for_each_online_cpu(cpu) {
>> +             bool has_pcps = false;
>> +             for_each_populated_zone(zone) {
>> +                     pcp = per_cpu_ptr(zone->pageset, cpu);
>> +                     if (pcp->pcp.count) {
>> +                             has_pcps = true;
>> +                             break;
>> +                     }
>> +             }
>> +             if (has_pcps)
>> +                     cpumask_set_cpu(cpu, &cpus_with_pcps);
>> +             else
>> +                     cpumask_clear_cpu(cpu, &cpus_with_pcps);
>> +     }
>> +     on_each_cpu_mask(&cpus_with_pcps, drain_local_pages, NULL, 1);
>>  }
>
> Can we end up sending an IPI to a now-unplugged CPU?  That won't work
> very well if that CPU is now sitting on its sysadmin's desk.

Nope. on_each_cpu_mask() disables preemption and calls smp_call_function_many()
which then checks the mask against the cpu_online_mask

> There's also the case of CPU online.  We could end up failing to IPI a
> CPU which now has some percpu pages.  That's not at all serious - 90%
> is good enough in page reclaim.  But this thinking merits a mention in
> the comment.  Or we simply make this code hotplug-safe.

hmm.. I'm probably daft but I don't see how to make the code hotplug safe for
CPU online case. I mean, let's say we disable preemption throughout the
entire ordeal and then the CPU goes online and gets itself some percpu pages
*after* we've calculated the masks, sent the IPIs and waiting for the
whole thing
to finish but before we've returned...

I might be missing something here, but I think that unless you have some other
means to stop newly hotplugged CPUs to grab per cpus pages there is nothing
you can do in this code to stop it. Maybe make the race window
shorter, that's all.

Would adding a comment such as the following OK?

"This code is protected against sending  an IPI to an offline CPU but does not
guarantee sending an IPI to newly hotplugged CPUs"


Thanks,
Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 1/8] smp: introduce a generic on_each_cpu_mask function
  2012-01-26 10:01 ` [v7 1/8] smp: introduce a generic on_each_cpu_mask function Gilad Ben-Yossef
@ 2012-01-29 12:24   ` Gilad Ben-Yossef
  2012-01-30 21:52     ` Andrew Morton
  0 siblings, 1 reply; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-29 12:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gilad Ben-Yossef, Chris Metcalf, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Pekka Enberg,
	Matt Mackall, Rik van Riel, Andi Kleen, Sasha Levin, Mel Gorman,
	Alexander Viro, linux-fsdevel, Avi Kivity, Michal Nazarewicz,
	Kosaki Motohiro, Milton Miller, linux-kernel

On Thu, Jan 26, 2012 at 12:01 PM, Gilad Ben-Yossef <gilad@benyossef.com> wrote:
> on_each_cpu_mask calls a function on processors specified by
> cpumask, which may or may not include the local processor.
>
> You must not call this function with disabled interrupts or
> from a hardware interrupt handler or from a bottom half handler.
>
> Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
> Reviewed-by: Christoph Lameter <cl@linux.com>
> CC: Chris Metcalf <cmetcalf@tilera.com>
...
> ---
>  include/linux/smp.h |   22 ++++++++++++++++++++++
>  kernel/smp.c        |   29 +++++++++++++++++++++++++++++
>  2 files changed, 51 insertions(+), 0 deletions(-)
>


Milton made the very sensible comment that while adding on_each_cpu() in the
arch generic code and removing the two arch specific instances from tile and arm
in separate patches is good for review, it will break bisect.

He suggested I squash  them into a single commit when it goes in.

Since you picked the patch set into linux-mm, will now be a good time for that?

Thanks,
Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 7/8] mm: only IPI CPUs to drain local pages if they exist
  2012-01-26 10:02 ` [v7 7/8] mm: only IPI CPUs to drain local pages if they exist Gilad Ben-Yossef
  2012-01-26 15:13   ` Christoph Lameter
  2012-01-28  0:12   ` Andrew Morton
@ 2012-01-30 14:59   ` Mel Gorman
  2012-01-30 15:14     ` Gilad Ben-Yossef
  2 siblings, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2012-01-30 14:59 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, KOSAKI Motohiro, Christoph Lameter, Chris Metcalf,
	Peter Zijlstra, Frederic Weisbecker, Russell King, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Andrew Morton, Alexander Viro, linux-fsdevel,
	Avi Kivity, Michal Nazarewicz, Milton Miller

On Thu, Jan 26, 2012 at 12:02:00PM +0200, Gilad Ben-Yossef wrote:
> Calculate a cpumask of CPUs with per-cpu pages in any zone
> and only send an IPI requesting CPUs to drain these pages
> to the buddy allocator if they actually have pages when
> asked to flush.
> 
> This patch saves 85%+ of IPIs asking to drain per-cpu
> pages in case of severe memory preassure that leads
> to OOM since in these cases multiple, possibly concurrent,
> allocation requests end up in the direct reclaim code
> path so when the per-cpu pages end up reclaimed on first
> allocation failure for most of the proceeding allocation
> attempts until the memory pressure is off (possibly via
> the OOM killer) there are no per-cpu pages on most CPUs
> (and there can easily be hundreds of them).
> 
> This also has the side effect of shortening the average
> latency of direct reclaim by 1 or more order of magnitude
> since waiting for all the CPUs to ACK the IPI takes a
> long time.
> 
> Tested by running "hackbench 400" on a 8 CPU x86 VM and
> observing the difference between the number of direct
> reclaim attempts that end up in drain_all_pages() and
> those were more then 1/2 of the online CPU had any per-cpu
> page in them, using the vmstat counters introduced
> in the next patch in the series and using proc/interrupts.
> 
> In the test sceanrio, this was seen to save around 3600 global
> IPIs after trigerring an OOM on a concurrent workload:
> 
> $ cat /proc/vmstat | tail -n 2
> pcp_global_drain 0
> pcp_global_ipi_saved 0
> 
> $ cat /proc/interrupts | grep CAL
> CAL:          1          2          1          2
>           2          2          2          2   Function call interrupts
> 
> $ hackbench 400
> [OOM messages snipped]
> 
> $ cat /proc/vmstat | tail -n 2
> pcp_global_drain 3647
> pcp_global_ipi_saved 3642
> 
> $ cat /proc/interrupts | grep CAL
> CAL:          6         13          6          3
>           3          3         1 2          7   Function call interrupts
> 
> Please note that if the global drain is removed from the
> direct reclaim path as a patch from Mel Gorman currently
> suggests this should be replaced with an on_each_cpu_cond
> invocation.
> 
> Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
> CC: Mel Gorman <mel@csn.ul.ie>
> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> CC: Christoph Lameter <cl@linux.com>
> CC: Chris Metcalf <cmetcalf@tilera.com>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> CC: Frederic Weisbecker <fweisbec@gmail.com>
> CC: Russell King <linux@arm.linux.org.uk>
> CC: linux-mm@kvack.org
> CC: Pekka Enberg <penberg@kernel.org>
> CC: Matt Mackall <mpm@selenic.com>
> CC: Sasha Levin <levinsasha928@gmail.com>
> CC: Rik van Riel <riel@redhat.com>
> CC: Andi Kleen <andi@firstfloor.org>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Alexander Viro <viro@zeniv.linux.org.uk>
> CC: linux-fsdevel@vger.kernel.org
> CC: Avi Kivity <avi@redhat.com>
> CC: Michal Nazarewicz <mina86@mina86.com>
> CC: Milton Miller <miltonm@bga.com>
> ---
>  mm/page_alloc.c |   31 ++++++++++++++++++++++++++++++-
>  1 files changed, 30 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d2186ec..4135983 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1165,7 +1165,36 @@ void drain_local_pages(void *arg)
>   */
>  void drain_all_pages(void)
>  {
> -	on_each_cpu(drain_local_pages, NULL, 1);
> +	int cpu;
> +	struct per_cpu_pageset *pcp;
> +	struct zone *zone;
> +
> +	/* Allocate in the BSS so we wont require allocation in
> +	 * direct reclaim path for CONFIG_CPUMASK_OFFSTACK=y
> +	 */
> +	static cpumask_t cpus_with_pcps;
> +
> +	/*
> +	 * We don't care about racing with CPU hotplug event
> +	 * as offline notification will cause the notified
> +	 * cpu to drain that CPU pcps and on_each_cpu_mask
> +	 * disables preemption as part of its processing
> +	 */
> +	for_each_online_cpu(cpu) {
> +		bool has_pcps = false;
> +		for_each_populated_zone(zone) {
> +			pcp = per_cpu_ptr(zone->pageset, cpu);
> +			if (pcp->pcp.count) {
> +				has_pcps = true;
> +				break;
> +			}
> +		}
> +		if (has_pcps)
> +			cpumask_set_cpu(cpu, &cpus_with_pcps);
> +		else
> +			cpumask_clear_cpu(cpu, &cpus_with_pcps);
> +	}

Lets take two CPUs running this code at the same time. CPU 1 has per-cpu
pages in all zones. CPU 2 has no per-cpu pages in any zone. If both run
at the same time, CPU 2 can be clearing the mask for CPU 1 before it has
had a chance to send the IPI. This means we'll miss sending IPIs to CPUs
that we intended to. As I was willing to send no IPI at all;

Acked-by: Mel Gorman <mel@csn.ul.ie>

But if this gets another revision, add a comment saying that two CPUs
can interfere with each other running at the same time but we don't
care.

> +	on_each_cpu_mask(&cpus_with_pcps, drain_local_pages, NULL, 1);
>  }
>  
>  #ifdef CONFIG_HIBERNATION
> -- 
> 1.7.0.4
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 7/8] mm: only IPI CPUs to drain local pages if they exist
  2012-01-30 14:59   ` Mel Gorman
@ 2012-01-30 15:14     ` Gilad Ben-Yossef
  2012-01-30 15:44       ` Mel Gorman
  0 siblings, 1 reply; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-30 15:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, KOSAKI Motohiro, Christoph Lameter, Chris Metcalf,
	Peter Zijlstra, Frederic Weisbecker, Russell King, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Andrew Morton, Alexander Viro, linux-fsdevel,
	Avi Kivity, Michal Nazarewicz, Milton Miller

On Mon, Jan 30, 2012 at 4:59 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Thu, Jan 26, 2012 at 12:02:00PM +0200, Gilad Ben-Yossef wrote:
>> Calculate a cpumask of CPUs with per-cpu pages in any zone
>> and only send an IPI requesting CPUs to drain these pages
>> to the buddy allocator if they actually have pages when
>> asked to flush.
>>
...
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index d2186ec..4135983 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1165,7 +1165,36 @@ void drain_local_pages(void *arg)
>>   */
>>  void drain_all_pages(void)
>>  {
>> -     on_each_cpu(drain_local_pages, NULL, 1);
>> +     int cpu;
>> +     struct per_cpu_pageset *pcp;
>> +     struct zone *zone;
>> +
>> +     /* Allocate in the BSS so we wont require allocation in
>> +      * direct reclaim path for CONFIG_CPUMASK_OFFSTACK=y
>> +      */
>> +     static cpumask_t cpus_with_pcps;
>> +
>> +     /*
>> +      * We don't care about racing with CPU hotplug event
>> +      * as offline notification will cause the notified
>> +      * cpu to drain that CPU pcps and on_each_cpu_mask
>> +      * disables preemption as part of its processing
>> +      */
>> +     for_each_online_cpu(cpu) {
>> +             bool has_pcps = false;
>> +             for_each_populated_zone(zone) {
>> +                     pcp = per_cpu_ptr(zone->pageset, cpu);
>> +                     if (pcp->pcp.count) {
>> +                             has_pcps = true;
>> +                             break;
>> +                     }
>> +             }
>> +             if (has_pcps)
>> +                     cpumask_set_cpu(cpu, &cpus_with_pcps);
>> +             else
>> +                     cpumask_clear_cpu(cpu, &cpus_with_pcps);
>> +     }
>
> Lets take two CPUs running this code at the same time. CPU 1 has per-cpu
> pages in all zones. CPU 2 has no per-cpu pages in any zone. If both run
> at the same time, CPU 2 can be clearing the mask for CPU 1 before it has
> had a chance to send the IPI. This means we'll miss sending IPIs to CPUs
> that we intended to.

I'm confused. You seem to be assuming that each CPU is looking at its own pcps
only (per zone). Assuming no change in the state of the pcps when both CPUs
run this code at the same time, both of them should mark the bit for
their respective
CPUs the same, unless one of them raced and managed to send the IPI to clear
pcps from the other, at which point you might see one of them send a
spurious IPI
to drains pcps that have been drained - but that isn't bad.

At least, that is what I meant the code to do and what I believe it
does. What have I
missed?

> As I was willing to send no IPI at all;
>
> Acked-by: Mel Gorman <mel@csn.ul.ie>

Thank you for the review and the ACK :-)
>
> But if this gets another revision, add a comment saying that two CPUs
> can interfere with each other running at the same time but we don't
> care.
>
>> +     on_each_cpu_mask(&cpus_with_pcps, drain_local_pages, NULL, 1);
>>  }
>>

Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 7/8] mm: only IPI CPUs to drain local pages if they exist
  2012-01-30 15:14     ` Gilad Ben-Yossef
@ 2012-01-30 15:44       ` Mel Gorman
  0 siblings, 0 replies; 77+ messages in thread
From: Mel Gorman @ 2012-01-30 15:44 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, KOSAKI Motohiro, Christoph Lameter, Chris Metcalf,
	Peter Zijlstra, Frederic Weisbecker, Russell King, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Andrew Morton, Alexander Viro, linux-fsdevel,
	Avi Kivity, Michal Nazarewicz, Milton Miller

On Mon, Jan 30, 2012 at 05:14:37PM +0200, Gilad Ben-Yossef wrote:
> >> +     for_each_online_cpu(cpu) {
> >> +             bool has_pcps = false;
> >> +             for_each_populated_zone(zone) {
> >> +                     pcp = per_cpu_ptr(zone->pageset, cpu);
> >> +                     if (pcp->pcp.count) {
> >> +                             has_pcps = true;
> >> +                             break;
> >> +                     }
> >> +             }
> >> +             if (has_pcps)
> >> +                     cpumask_set_cpu(cpu, &cpus_with_pcps);
> >> +             else
> >> +                     cpumask_clear_cpu(cpu, &cpus_with_pcps);
> >> +     }
> >
> > Lets take two CPUs running this code at the same time. CPU 1 has per-cpu
> > pages in all zones. CPU 2 has no per-cpu pages in any zone. If both run
> > at the same time, CPU 2 can be clearing the mask for CPU 1 before it has
> > had a chance to send the IPI. This means we'll miss sending IPIs to CPUs
> > that we intended to.
> 
> I'm confused. You seem to be assuming that each CPU is looking at its own pcps
> only (per zone).

/me slaps self

I was assuming exactly this.

> Assuming no change in the state of the pcps when both CPUs
> run this code at the same time, both of them should mark the bit for
> their respective
> CPUs the same, unless one of them raced and managed to send the IPI to clear
> pcps from the other, at which point you might see one of them send a
> spurious IPI
> to drains pcps that have been drained - but that isn't bad.
> 

Indeed, the race is tiny and the consequences are not important.

> At least, that is what I meant the code to do and what I believe it
> does. What have I
> missed?
> 

Nothing, the problem was on my side. Sorry for the noise.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 7/8] mm: only IPI CPUs to drain local pages if they exist
  2012-01-29 12:18     ` Gilad Ben-Yossef
@ 2012-01-30 21:49       ` Andrew Morton
  2012-01-31  6:32         ` Gilad Ben-Yossef
  0 siblings, 1 reply; 77+ messages in thread
From: Andrew Morton @ 2012-01-30 21:49 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, Mel Gorman, KOSAKI Motohiro, Christoph Lameter,
	Chris Metcalf, Peter Zijlstra, Frederic Weisbecker, Russell King,
	linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Alexander Viro, linux-fsdevel, Avi Kivity,
	Michal Nazarewicz, Milton Miller

On Sun, 29 Jan 2012 14:18:32 +0200
Gilad Ben-Yossef <gilad@benyossef.com> wrote:

> On Sat, Jan 28, 2012 at 2:12 AM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> > On Thu, 26 Jan 2012 12:02:00 +0200
> > Gilad Ben-Yossef <gilad@benyossef.com> wrote:
> >
> >> Calculate a cpumask of CPUs with per-cpu pages in any zone
> >> and only send an IPI requesting CPUs to drain these pages
> >> to the buddy allocator if they actually have pages when
> >> asked to flush.
> >>
>
> ...
>
> > Can we end up sending an IPI to a now-unplugged CPU? __That won't work
> > very well if that CPU is now sitting on its sysadmin's desk.
> 
> Nope. on_each_cpu_mask() disables preemption and calls smp_call_function_many()
> which then checks the mask against the cpu_online_mask

OK.

General rule of thumb: if a reviewer asked something then it is likely
that others will wonder the same thing when reading the code later on. 
So consider reviewer questions as a sign that the code needs additional
comments!

> > There's also the case of CPU online. __We could end up failing to IPI a
> > CPU which now has some percpu pages. __That's not at all serious - 90%
> > is good enough in page reclaim. __But this thinking merits a mention in
> > the comment. __Or we simply make this code hotplug-safe.
> 
> hmm.. I'm probably daft but I don't see how to make the code hotplug safe for
> CPU online case. I mean, let's say we disable preemption throughout the
> entire ordeal and then the CPU goes online and gets itself some percpu pages
> *after* we've calculated the masks, sent the IPIs and waiting for the
> whole thing
> to finish but before we've returned...

This is inherent to the whole drain-pages design - it's only a
best-effort thing and there's nothing to prevent other CPUs from
undoing your work 2 nanoseconds later.

The exception to this is the case of suspend, which drains the queues
when all tasks (and, hopefully, IRQs) have been frozen.  This is the
only way to make draining 100% "reliable".

> I might be missing something here, but I think that unless you have some other
> means to stop newly hotplugged CPUs to grab per cpus pages there is nothing
> you can do in this code to stop it. Maybe make the race window
> shorter, that's all.
> 
> Would adding a comment such as the following OK?
> 
> "This code is protected against sending  an IPI to an offline CPU but does not
> guarantee sending an IPI to newly hotplugged CPUs"

Looks OK.  I'd mention *how* this protection comes about:
on_each_cpu_mask() blocks hotplug and won't talk to offlined CPUs.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 1/8] smp: introduce a generic on_each_cpu_mask function
  2012-01-29 12:24   ` Gilad Ben-Yossef
@ 2012-01-30 21:52     ` Andrew Morton
  2012-01-31  6:33       ` Gilad Ben-Yossef
  0 siblings, 1 reply; 77+ messages in thread
From: Andrew Morton @ 2012-01-30 21:52 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Chris Metcalf, Peter Zijlstra, Frederic Weisbecker, Russell King,
	linux-mm, Pekka Enberg, Matt Mackall, Rik van Riel, Andi Kleen,
	Sasha Levin, Mel Gorman, Alexander Viro, linux-fsdevel,
	Avi Kivity, Michal Nazarewicz, Kosaki Motohiro, Milton Miller,
	linux-kernel

On Sun, 29 Jan 2012 14:24:16 +0200
Gilad Ben-Yossef <gilad@benyossef.com> wrote:

> On Thu, Jan 26, 2012 at 12:01 PM, Gilad Ben-Yossef <gilad@benyossef.com> wrote:
> > on_each_cpu_mask calls a function on processors specified by
> > cpumask, which may or may not include the local processor.
> >
> > You must not call this function with disabled interrupts or
> > from a hardware interrupt handler or from a bottom half handler.
> >
> > Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
> > Reviewed-by: Christoph Lameter <cl@linux.com>
> > CC: Chris Metcalf <cmetcalf@tilera.com>
> ...
> > ---
> > __include/linux/smp.h | __ 22 ++++++++++++++++++++++
> > __kernel/smp.c __ __ __ __| __ 29 +++++++++++++++++++++++++++++
> > __2 files changed, 51 insertions(+), 0 deletions(-)
> >
> 
> 
> Milton made the very sensible comment that while adding on_each_cpu() in the
> arch generic code and removing the two arch specific instances from tile and arm
> in separate patches is good for review, it will break bisect.
> 
> He suggested I squash  them into a single commit when it goes in.
> 
> Since you picked the patch set into linux-mm, will now be a good time for that?

I can fold the patches together - I do that all the time.

Please identify exactly whcih patches you're referring to here.  

arm-move-arm-over-to-generic-on_each_cpu_mask and
tile-move-tile-to-use-generic-on_each_cpu_mask should be folded into
smp-introduce-a-generic-on_each_cpu_mask-function, yes?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 7/8] mm: only IPI CPUs to drain local pages if they exist
  2012-01-30 21:49       ` Andrew Morton
@ 2012-01-31  6:32         ` Gilad Ben-Yossef
  0 siblings, 0 replies; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-31  6:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Mel Gorman, KOSAKI Motohiro, Christoph Lameter,
	Chris Metcalf, Peter Zijlstra, Frederic Weisbecker, Russell King,
	linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Alexander Viro, linux-fsdevel, Avi Kivity,
	Michal Nazarewicz, Milton Miller

On Mon, Jan 30, 2012 at 11:49 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Sun, 29 Jan 2012 14:18:32 +0200
> Gilad Ben-Yossef <gilad@benyossef.com> wrote:
>
>> On Sat, Jan 28, 2012 at 2:12 AM, Andrew Morton
>> <akpm@linux-foundation.org> wrote:
>> > On Thu, 26 Jan 2012 12:02:00 +0200
>> > Gilad Ben-Yossef <gilad@benyossef.com> wrote:
>> >
>> >> Calculate a cpumask of CPUs with per-cpu pages in any zone
>> >> and only send an IPI requesting CPUs to drain these pages
>> >> to the buddy allocator if they actually have pages when
>> >> asked to flush.
>> >>
>>
>> ...
>>
>> > Can we end up sending an IPI to a now-unplugged CPU? __That won't work
>> > very well if that CPU is now sitting on its sysadmin's desk.
>>
>> Nope. on_each_cpu_mask() disables preemption and calls smp_call_function_many()
>> which then checks the mask against the cpu_online_mask
>
> OK.
>
> General rule of thumb: if a reviewer asked something then it is likely
> that others will wonder the same thing when reading the code later on.
> So consider reviewer questions as a sign that the code needs additional
> comments!

Right, point taken.

>
>> > There's also the case of CPU online. __We could end up failing to IPI a
>> > CPU which now has some percpu pages. __That's not at all serious - 90%
>> > is good enough in page reclaim. __But this thinking merits a mention in
>> > the comment. __Or we simply make this code hotplug-safe.
>>
>> hmm.. I'm probably daft but I don't see how to make the code hotplug safe for
>> CPU online case. I mean, let's say we disable preemption throughout the
>> entire ordeal and then the CPU goes online and gets itself some percpu pages
>> *after* we've calculated the masks, sent the IPIs and waiting for the
>> whole thing
>> to finish but before we've returned...
>
> This is inherent to the whole drain-pages design - it's only a
> best-effort thing and there's nothing to prevent other CPUs from
> undoing your work 2 nanoseconds later.
>
> The exception to this is the case of suspend, which drains the queues
> when all tasks (and, hopefully, IRQs) have been frozen.  This is the
> only way to make draining 100% "reliable".
>
>> I might be missing something here, but I think that unless you have some other
>> means to stop newly hotplugged CPUs to grab per cpus pages there is nothing
>> you can do in this code to stop it. Maybe make the race window
>> shorter, that's all.
>>
>> Would adding a comment such as the following OK?
>>
>> "This code is protected against sending  an IPI to an offline CPU but does not
>> guarantee sending an IPI to newly hotplugged CPUs"
>
> Looks OK.  I'd mention *how* this protection comes about:
> on_each_cpu_mask() blocks hotplug and won't talk to offlined CPUs.

Good. I'll send an updated patch set.

Thanks :-)
Gilad



-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 1/8] smp: introduce a generic on_each_cpu_mask function
  2012-01-30 21:52     ` Andrew Morton
@ 2012-01-31  6:33       ` Gilad Ben-Yossef
  0 siblings, 0 replies; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-31  6:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Metcalf, Peter Zijlstra, Frederic Weisbecker, Russell King,
	linux-mm, Pekka Enberg, Matt Mackall, Rik van Riel, Andi Kleen,
	Sasha Levin, Mel Gorman, Alexander Viro, linux-fsdevel,
	Avi Kivity, Michal Nazarewicz, Kosaki Motohiro, Milton Miller,
	linux-kernel

On Mon, Jan 30, 2012 at 11:52 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Sun, 29 Jan 2012 14:24:16 +0200
> Gilad Ben-Yossef <gilad@benyossef.com> wrote:
>
>> On Thu, Jan 26, 2012 at 12:01 PM, Gilad Ben-Yossef <gilad@benyossef.com> wrote:
>> > on_each_cpu_mask calls a function on processors specified by
>> > cpumask, which may or may not include the local processor.
>> >
>> > You must not call this function with disabled interrupts or
>> > from a hardware interrupt handler or from a bottom half handler.
>> >
>> > Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
>> > Reviewed-by: Christoph Lameter <cl@linux.com>
>> > CC: Chris Metcalf <cmetcalf@tilera.com>
>> ...
>> > ---
>> > __include/linux/smp.h | __ 22 ++++++++++++++++++++++
>> > __kernel/smp.c __ __ __ __| __ 29 +++++++++++++++++++++++++++++
>> > __2 files changed, 51 insertions(+), 0 deletions(-)
>> >
>>
>>
>> Milton made the very sensible comment that while adding on_each_cpu() in the
>> arch generic code and removing the two arch specific instances from tile and arm
>> in separate patches is good for review, it will break bisect.
>>
>> He suggested I squash  them into a single commit when it goes in.
>>
>> Since you picked the patch set into linux-mm, will now be a good time for that?
>
> I can fold the patches together - I do that all the time.
>
> Please identify exactly whcih patches you're referring to here.
>
> arm-move-arm-over-to-generic-on_each_cpu_mask and
> tile-move-tile-to-use-generic-on_each_cpu_mask should be folded into
> smp-introduce-a-generic-on_each_cpu_mask-function, yes?

Yes. Thank you.

Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-01-29  8:25   ` Gilad Ben-Yossef
@ 2012-02-01 17:04     ` Frederic Weisbecker
  2012-02-02  8:46       ` Gilad Ben-Yossef
  2012-02-01 17:35     ` Peter Zijlstra
  1 sibling, 1 reply; 77+ messages in thread
From: Frederic Weisbecker @ 2012-02-01 17:04 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Peter Zijlstra, linux-kernel, Christoph Lameter, Chris Metcalf,
	linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Avi Kivity, Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On Sun, Jan 29, 2012 at 10:25:46AM +0200, Gilad Ben-Yossef wrote:
> On Thu, Jan 26, 2012 at 5:19 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > On Thu, 2012-01-26 at 12:01 +0200, Gilad Ben-Yossef wrote:
> > > Gilad Ben-Yossef (8):
> > >   smp: introduce a generic on_each_cpu_mask function
> > >   arm: move arm over to generic on_each_cpu_mask
> > >   tile: move tile to use generic on_each_cpu_mask
> > >   smp: add func to IPI cpus based on parameter func
> > >   slub: only IPI CPUs that have per cpu obj to flush
> > >   fs: only send IPI to invalidate LRU BH when needed
> > >   mm: only IPI CPUs to drain local pages if they exist
> >
> > These patches look very nice!
> >
> > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> >
> 
> Thank you :-)
> 
> If this is of interest, I keep a list tracking global IPI and global
> task schedulers sources in the core kernel here:
> https://github.com/gby/linux/wiki.
> 
> I plan to visit all these potential interference source to see if
> something can be done to lower their effect on
> isolated CPUs over time.

Very nice especially as many people seem to be interested in
CPU isolation.

When we get the adaptive tickless feature in place, perhaps we'll
also need to think about some way to have more control on the
CPU affinity of some non pinned timers to avoid disturbing
adaptive tickless CPUs. We still need to consider their cache affinity
though.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-01-29  8:25   ` Gilad Ben-Yossef
  2012-02-01 17:04     ` Frederic Weisbecker
@ 2012-02-01 17:35     ` Peter Zijlstra
  2012-02-01 17:57       ` Peter Zijlstra
  2012-02-01 18:40       ` Paul E. McKenney
  1 sibling, 2 replies; 77+ messages in thread
From: Peter Zijlstra @ 2012-02-01 17:35 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, Christoph Lameter, Chris Metcalf,
	Frederic Weisbecker, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller, paulmck

On Sun, 2012-01-29 at 10:25 +0200, Gilad Ben-Yossef wrote:
> 
> If this is of interest, I keep a list tracking global IPI and global
> task schedulers sources in the core kernel here:
> https://github.com/gby/linux/wiki. 

You can add synchronize_.*_expedited() to the list, it does its best to
bash the entire machine in order to try and make RCU grace periods
happen fast.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-01 17:35     ` Peter Zijlstra
@ 2012-02-01 17:57       ` Peter Zijlstra
  2012-02-02  9:42         ` Gilad Ben-Yossef
  2012-02-01 18:40       ` Paul E. McKenney
  1 sibling, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2012-02-01 17:57 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, Christoph Lameter, Chris Metcalf,
	Frederic Weisbecker, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller, paulmck

On Wed, 2012-02-01 at 18:35 +0100, Peter Zijlstra wrote:
> On Sun, 2012-01-29 at 10:25 +0200, Gilad Ben-Yossef wrote:
> > 
> > If this is of interest, I keep a list tracking global IPI and global
> > task schedulers sources in the core kernel here:
> > https://github.com/gby/linux/wiki. 
> 
> You can add synchronize_.*_expedited() to the list, it does its best to
> bash the entire machine in order to try and make RCU grace periods
> happen fast.

Also anything using stop_machine, such as module unload, cpu hot-unplug
and text_poke().

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-01 17:35     ` Peter Zijlstra
  2012-02-01 17:57       ` Peter Zijlstra
@ 2012-02-01 18:40       ` Paul E. McKenney
  2012-02-01 20:06         ` Christoph Lameter
  1 sibling, 1 reply; 77+ messages in thread
From: Paul E. McKenney @ 2012-02-01 18:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben-Yossef, linux-kernel, Christoph Lameter, Chris Metcalf,
	Frederic Weisbecker, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Wed, Feb 01, 2012 at 06:35:22PM +0100, Peter Zijlstra wrote:
> On Sun, 2012-01-29 at 10:25 +0200, Gilad Ben-Yossef wrote:
> > 
> > If this is of interest, I keep a list tracking global IPI and global
> > task schedulers sources in the core kernel here:
> > https://github.com/gby/linux/wiki. 
> 
> You can add synchronize_.*_expedited() to the list, it does its best to
> bash the entire machine in order to try and make RCU grace periods
> happen fast.

I have duly added "Make synchronize_sched_expedited() avoid IPIing idle
CPUs" to http://kernel.org/pub/linux/kernel/people/paulmck/rcutodo.html.

This should not be hard once I have built up some trust in the new
RCU idle-detection code.  It would also automatically apply to
Frederic's dyntick-idle userspace work.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-01 18:40       ` Paul E. McKenney
@ 2012-02-01 20:06         ` Christoph Lameter
  2012-02-01 20:13           ` Paul E. McKenney
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2012-02-01 20:06 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Gilad Ben-Yossef, linux-kernel, Chris Metcalf,
	Frederic Weisbecker, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Wed, 1 Feb 2012, Paul E. McKenney wrote:

> On Wed, Feb 01, 2012 at 06:35:22PM +0100, Peter Zijlstra wrote:
> > On Sun, 2012-01-29 at 10:25 +0200, Gilad Ben-Yossef wrote:
> > >
> > > If this is of interest, I keep a list tracking global IPI and global
> > > task schedulers sources in the core kernel here:
> > > https://github.com/gby/linux/wiki.
> >
> > You can add synchronize_.*_expedited() to the list, it does its best to
> > bash the entire machine in order to try and make RCU grace periods
> > happen fast.
>
> I have duly added "Make synchronize_sched_expedited() avoid IPIing idle
> CPUs" to http://kernel.org/pub/linux/kernel/people/paulmck/rcutodo.html.
>
> This should not be hard once I have built up some trust in the new
> RCU idle-detection code.  It would also automatically apply to
> Frederic's dyntick-idle userspace work.

Could we also apply the same approach to processors busy doing
computational work? In that case the OS is also not needed. Interrupting
these activities is impacting on performance and latency.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-01 20:06         ` Christoph Lameter
@ 2012-02-01 20:13           ` Paul E. McKenney
  2012-02-02  9:34             ` Avi Kivity
  0 siblings, 1 reply; 77+ messages in thread
From: Paul E. McKenney @ 2012-02-01 20:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Gilad Ben-Yossef, linux-kernel, Chris Metcalf,
	Frederic Weisbecker, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Wed, Feb 01, 2012 at 02:06:07PM -0600, Christoph Lameter wrote:
> On Wed, 1 Feb 2012, Paul E. McKenney wrote:
> 
> > On Wed, Feb 01, 2012 at 06:35:22PM +0100, Peter Zijlstra wrote:
> > > On Sun, 2012-01-29 at 10:25 +0200, Gilad Ben-Yossef wrote:
> > > >
> > > > If this is of interest, I keep a list tracking global IPI and global
> > > > task schedulers sources in the core kernel here:
> > > > https://github.com/gby/linux/wiki.
> > >
> > > You can add synchronize_.*_expedited() to the list, it does its best to
> > > bash the entire machine in order to try and make RCU grace periods
> > > happen fast.
> >
> > I have duly added "Make synchronize_sched_expedited() avoid IPIing idle
> > CPUs" to http://kernel.org/pub/linux/kernel/people/paulmck/rcutodo.html.
> >
> > This should not be hard once I have built up some trust in the new
> > RCU idle-detection code.  It would also automatically apply to
> > Frederic's dyntick-idle userspace work.
> 
> Could we also apply the same approach to processors busy doing
> computational work? In that case the OS is also not needed. Interrupting
> these activities is impacting on performance and latency.

Yep, that is in fact what Frederic's dyntick-idle userspace work does.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-01 17:04     ` Frederic Weisbecker
@ 2012-02-02  8:46       ` Gilad Ben-Yossef
  2012-02-02 15:41         ` Chris Metcalf
  2012-02-02 16:24         ` Frederic Weisbecker
  0 siblings, 2 replies; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-02-02  8:46 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, linux-kernel, Christoph Lameter, Chris Metcalf,
	linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Avi Kivity, Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On Wed, Feb 1, 2012 at 7:04 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
>
> On Sun, Jan 29, 2012 at 10:25:46AM +0200, Gilad Ben-Yossef wrote:
>
> > If this is of interest, I keep a list tracking global IPI and global
> > task schedulers sources in the core kernel here:
> > https://github.com/gby/linux/wiki.
> >
> > I plan to visit all these potential interference source to see if
> > something can be done to lower their effect on
> > isolated CPUs over time.
>
> Very nice especially as many people seem to be interested in
> CPU isolation.


Yes, that is what drives me as well. I have a bare metal program
I'm trying to kill here, I researched CPU isolation and ran into your
nohz patch set and asked myself: "OK, if we disable the tick what else
is on the way?"

>
>
> When we get the adaptive tickless feature in place, perhaps we'll
> also need to think about some way to have more control on the
> CPU affinity of some non pinned timers to avoid disturbing
> adaptive tickless CPUs. We still need to consider their cache affinity
> though.


Right. I'm thinking we can treat a CPU going in adaptive tick mode in a similar
fashion to a CPU going offline for the purpose of timer migration.

Some pinned timers might be able to get special treatment as well - take for
example the vmstat work being schedule every second, what should we do with
it for CPU isolation?

It makes sense to me to have that stop scheduling itself when we have the tick
disabled for both idle and a nohz task.

A similar thing can be said for the clocksource watchdog for example - we might
consider having it not trigger stuff on idle or nohz task CPUs

Maybe we can have some notification mechanism when a task goes into nohz
mode and back to let stuff disable itself and back if it makes sense.
It seems more
sensible then having all these individual pieces check for whether
this CPU or other is
in idle or nohz task mode.

The question for nohz task then is when does the notification needs to go out?
only when a task managed to go into nohz mode or when we add a cpu to an
adaptive tick cpuset? because for stuff like vmstat, the very existence of the
runnable workqueue thread can keep a task from going into nohz mode. bah.
maybe we need two notifications...


Thanks!
Gilad
--
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-01 20:13           ` Paul E. McKenney
@ 2012-02-02  9:34             ` Avi Kivity
  2012-02-02 15:34               ` Paul E. McKenney
  0 siblings, 1 reply; 77+ messages in thread
From: Avi Kivity @ 2012-02-02  9:34 UTC (permalink / raw)
  To: paulmck
  Cc: Christoph Lameter, Peter Zijlstra, Gilad Ben-Yossef,
	linux-kernel, Chris Metcalf, Frederic Weisbecker, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On 02/01/2012 10:13 PM, Paul E. McKenney wrote:
> > 
> > Could we also apply the same approach to processors busy doing
> > computational work? In that case the OS is also not needed. Interrupting
> > these activities is impacting on performance and latency.
>
> Yep, that is in fact what Frederic's dyntick-idle userspace work does.
>

Running in a guest is a special case of running in userspace, so we'd
need to extend this work to kvm as well.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-01 17:57       ` Peter Zijlstra
@ 2012-02-02  9:42         ` Gilad Ben-Yossef
  0 siblings, 0 replies; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-02-02  9:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Christoph Lameter, Chris Metcalf,
	Frederic Weisbecker, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller, paulmck

On Wed, Feb 1, 2012 at 7:57 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2012-02-01 at 18:35 +0100, Peter Zijlstra wrote:
>> On Sun, 2012-01-29 at 10:25 +0200, Gilad Ben-Yossef wrote:
>> >
>> > If this is of interest, I keep a list tracking global IPI and global
>> > task schedulers sources in the core kernel here:
>> > https://github.com/gby/linux/wiki.
>>
>> You can add synchronize_.*_expedited() to the list, it does its best to
>> bash the entire machine in order to try and make RCU grace periods
>> happen fast.
>
> Also anything using stop_machine, such as module unload, cpu hot-unplug
> and text_poke().

Thanks! I've added it to the list together with the clocksource
watchdog, which is registering
a timer on each cpu in a cyclinc fashion.

Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-02  9:34             ` Avi Kivity
@ 2012-02-02 15:34               ` Paul E. McKenney
  2012-02-02 16:14                 ` Avi Kivity
  0 siblings, 1 reply; 77+ messages in thread
From: Paul E. McKenney @ 2012-02-02 15:34 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christoph Lameter, Peter Zijlstra, Gilad Ben-Yossef,
	linux-kernel, Chris Metcalf, Frederic Weisbecker, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On Thu, Feb 02, 2012 at 11:34:25AM +0200, Avi Kivity wrote:
> On 02/01/2012 10:13 PM, Paul E. McKenney wrote:
> > > 
> > > Could we also apply the same approach to processors busy doing
> > > computational work? In that case the OS is also not needed. Interrupting
> > > these activities is impacting on performance and latency.
> >
> > Yep, that is in fact what Frederic's dyntick-idle userspace work does.
> 
> Running in a guest is a special case of running in userspace, so we'd
> need to extend this work to kvm as well.

As long as rcu_idle_enter() is called at the appropriate time, RCU will
happily ignore the CPU.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-02  8:46       ` Gilad Ben-Yossef
@ 2012-02-02 15:41         ` Chris Metcalf
  2012-02-05 11:46           ` Gilad Ben-Yossef
                             ` (2 more replies)
  2012-02-02 16:24         ` Frederic Weisbecker
  1 sibling, 3 replies; 77+ messages in thread
From: Chris Metcalf @ 2012-02-02 15:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben-Yossef, Peter Zijlstra, linux-kernel,
	Christoph Lameter, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On 2/2/2012 3:46 AM, Gilad Ben-Yossef wrote:
> On Wed, Feb 1, 2012 at 7:04 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
>> Very nice especially as many people seem to be interested in CPU isolation.

Indeed!

> Yes, that is what drives me as well. I have a bare metal program
> I'm trying to kill here, I researched CPU isolation and ran into your
> nohz patch set and asked myself: "OK, if we disable the tick what else
> is on the way?"

At Tilera we have been supporting a "dataplane" mode (aka Zero Overhead
Linux - the marketing name).  This is configured on a per-cpu basis, and in
addition to setting isolcpus for those nodes, also suppresses various
things that might otherwise run (soft lockup detection, vmstat work,
etc.).  The claim is that you need to specify these kinds of things
per-core since it's not always possible for the kernel to know that you
really don't want the scheduler or any other interrupt source to touch the
core, as opposed to the case where you just happen to have a single process
scheduled on the core and you don't mind occasional interrupts.  But
there's definitely appeal in having the kernel do it adaptively too,
particularly if it can be made to work just as well as configuring it
statically.

We also have a set_dataplane() syscall that a task can make to allow it to
request some additional semantics from the kernel, such as various
debugging modes, a flag to request populating the page table fully, and a
flag to request that all pending kernel timer ticks, etc., happen while the
task spins in the kernel before actually returning to userspace from a
syscall (so you don't get unexpected interrupts once you're back in
userspace).  I've appended the relevant bits of <asm/dataplane.h> for more
details.

We've been planning to start working with the community on returning this,
but since fiddling with the scheduler is pretty tricky stuff and it wasn't
clear there was a lot of interest, we've been deferring it in favor of
other activities.  But seeing more about Frederic Weisbecker's and Gilad
Ben-Yossef's work makes me think that it might be a good time for us to
start that process.  For a start I'll see about putting up a git branch on
kernel.org that has our dataplane stuff in it, for reference.

/*
 * Quiesce the timer interrupt before returning to user space after a
 * system call.  Normally if a task on a dataplane core makes a
 * syscall, the system will run one or more timer ticks after the
 * syscall has completed, causing unexpected interrupts in userspace.
 * Setting DP_QUIESCE avoids that problem by having the kernel "hold"
 * the task in kernel mode until the timer ticks are complete.  This
 * will make syscalls dramatically slower.
 *
 * If multiple dataplane tasks are scheduled on a single core, this
 * in effect silently disables DP_QUIESCE, which allows the tasks to make
 * progress, but without actually disabling the timer tick.
 */
#define DP_QUIESCE      0x1

/*
 * Disallow the application from entering the kernel in any way,
 * unless it calls set_dataplane() again without this bit set.
 * Issuing any other syscall or causing a page fault would generate a
 * kernel message, and "kill -9" the process.
 *
 * Setting this flag automatically sets DP_QUIESCE as well.
 */
#define DP_STRICT       0x2

/*
 * Debug dataplane interrupts, so that if any interrupt source
 * attempts to involve a dataplane cpu, a kernel message and stack
 * backtrace will be generated on the console.  As this warning is a
 * slow event, it may make sense to avoid this mode in production code
 * to avoid making any possible interrupts even more heavyweight.
 *
 * Setting this flag automatically sets DP_QUIESCE as well.
 */
#define DP_DEBUG        0x4

/*
 * Cause all memory mappings to be populated in the page table.
 * Specifying this when entering dataplane mode ensures that no future
 * page fault events will occur to cause interrupts into the Linux
 * kernel, as long as no new mappings are installed by mmap(), etc.
 * Note that since the hardware TLB is of finite size, there will
 * still be the potential for TLB misses that the hypervisor handles,
 * either via its software TLB cache (fast path) or by walking the
 * kernel page tables (slow path), so touching large amounts of memory
 * will still incur hypervisor interrupt overhead.
 */
#define DP_POPULATE     0x8


-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-02 15:34               ` Paul E. McKenney
@ 2012-02-02 16:14                 ` Avi Kivity
  2012-02-02 17:01                   ` Paul E. McKenney
  0 siblings, 1 reply; 77+ messages in thread
From: Avi Kivity @ 2012-02-02 16:14 UTC (permalink / raw)
  To: paulmck
  Cc: Christoph Lameter, Peter Zijlstra, Gilad Ben-Yossef,
	linux-kernel, Chris Metcalf, Frederic Weisbecker, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On 02/02/2012 05:34 PM, Paul E. McKenney wrote:
> On Thu, Feb 02, 2012 at 11:34:25AM +0200, Avi Kivity wrote:
> > On 02/01/2012 10:13 PM, Paul E. McKenney wrote:
> > > > 
> > > > Could we also apply the same approach to processors busy doing
> > > > computational work? In that case the OS is also not needed. Interrupting
> > > > these activities is impacting on performance and latency.
> > >
> > > Yep, that is in fact what Frederic's dyntick-idle userspace work does.
> > 
> > Running in a guest is a special case of running in userspace, so we'd
> > need to extend this work to kvm as well.
>
> As long as rcu_idle_enter() is called at the appropriate time, RCU will
> happily ignore the CPU.  ;-)
>

It's not called (since the cpu is not idle).  Instead we call
rcu_virt_note_context_switch().

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-02  8:46       ` Gilad Ben-Yossef
  2012-02-02 15:41         ` Chris Metcalf
@ 2012-02-02 16:24         ` Frederic Weisbecker
  2012-02-02 16:29           ` Christoph Lameter
  1 sibling, 1 reply; 77+ messages in thread
From: Frederic Weisbecker @ 2012-02-02 16:24 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Peter Zijlstra, linux-kernel, Christoph Lameter, Chris Metcalf,
	linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Avi Kivity, Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On Thu, Feb 02, 2012 at 10:46:32AM +0200, Gilad Ben-Yossef wrote:
> On Wed, Feb 1, 2012 at 7:04 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> >
> > On Sun, Jan 29, 2012 at 10:25:46AM +0200, Gilad Ben-Yossef wrote:
> >
> > > If this is of interest, I keep a list tracking global IPI and global
> > > task schedulers sources in the core kernel here:
> > > https://github.com/gby/linux/wiki.
> > >
> > > I plan to visit all these potential interference source to see if
> > > something can be done to lower their effect on
> > > isolated CPUs over time.
> >
> > Very nice especially as many people seem to be interested in
> > CPU isolation.
> 
> 
> Yes, that is what drives me as well. I have a bare metal program
> I'm trying to kill here, I researched CPU isolation and ran into your
> nohz patch set and asked myself: "OK, if we disable the tick what else
> is on the way?"
> 
> >
> >
> > When we get the adaptive tickless feature in place, perhaps we'll
> > also need to think about some way to have more control on the
> > CPU affinity of some non pinned timers to avoid disturbing
> > adaptive tickless CPUs. We still need to consider their cache affinity
> > though.
> 
> 
> Right. I'm thinking we can treat a CPU going in adaptive tick mode in a similar
> fashion to a CPU going offline for the purpose of timer migration.
> 
> Some pinned timers might be able to get special treatment as well - take for
> example the vmstat work being schedule every second, what should we do with
> it for CPU isolation?

Right, I remember I saw these vmstat timers on my way when I tried to get 0
interrupts on a CPU.

I think all these timers need to be carefully reviewed before doing anything.
But we certainly shouldn't adopt the behaviour of migrating timers by default.

Some timers really needs to stay on the expected CPU. Note that some
timers may be shutdown by CPU hotplug callbacks. Those wouldn't be migrated
in case of CPU offlining. We need to keep them.

> It makes sense to me to have that stop scheduling itself when we have the tick
> disabled for both idle and a nohz task.

We have deferrable timers, their semantics is to not fire when the CPU is
idle. But beeing idle and beeing adaptive tickless is not the same. On adaptive
tickless the CPU is busy doing things that might be relevant for these deferrable
timers.

So I don't think we can apply the same logic.


> 
> A similar thing can be said for the clocksource watchdog for example - we might
> consider having it not trigger stuff on idle or nohz task CPUs

This one is particular and is only armed when the tsc is unstable (IIUC). I
guess we shouldn't worry about that, it's a corner case.

> Maybe we can have some notification mechanism when a task goes into nohz
> mode and back to let stuff disable itself and back if it makes sense.
> It seems more
> sensible then having all these individual pieces check for whether
> this CPU or other is
> in idle or nohz task mode.
> 
> The question for nohz task then is when does the notification needs to go out?
> only when a task managed to go into nohz mode or when we add a cpu to an
> adaptive tick cpuset? because for stuff like vmstat, the very existence of the
> runnable workqueue thread can keep a task from going into nohz mode. bah.
> maybe we need two notifications...

I think we really need to explore these timers and workqueues case by case.
And may be set up a way to affine these to particular cpusets if needed.

> 
> 
> Thanks!
> Gilad
> --
> Gilad Ben-Yossef
> Chief Coffee Drinker
> gilad@benyossef.com
> Israel Cell: +972-52-8260388
> US Cell: +1-973-8260388
> http://benyossef.com
> 
> "If you take a class in large-scale robotics, can you end up in a
> situation where the homework eats your dog?"
>  -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-02 16:24         ` Frederic Weisbecker
@ 2012-02-02 16:29           ` Christoph Lameter
  2012-02-09 15:52             ` Frederic Weisbecker
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2012-02-02 16:29 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben-Yossef, Peter Zijlstra, linux-kernel, Chris Metcalf,
	linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Avi Kivity, Michal Nazarewicz, Kosaki Motohiro, Milton Miller

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1072 bytes --]

On Thu, 2 Feb 2012, Frederic Weisbecker wrote:

> > Some pinned timers might be able to get special treatment as well - take for
> > example the vmstat work being schedule every second, what should we do with
> > it for CPU isolation?
>
> Right, I remember I saw these vmstat timers on my way when I tried to get 0
> interrupts on a CPU.
>
> I think all these timers need to be carefully reviewed before doing anything.
> But we certainly shouldn't adopt the behaviour of migrating timers by default.
>
> Some timers really needs to stay on the expected CPU. Note that some
> timers may be shutdown by CPU hotplug callbacks. Those wouldn't be migrated
> in case of CPU offlining. We need to keep them.
>
> > It makes sense to me to have that stop scheduling itself when we have the tick
> > disabled for both idle and a nohz task.

The vmstat timer only makes sense when the OS is doing something on the
processor. Otherwise if no counters are incremented and the page and slab
allocator caches are empty then there is no need to run the vmstat timer.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-02 16:14                 ` Avi Kivity
@ 2012-02-02 17:01                   ` Paul E. McKenney
  2012-02-02 17:23                     ` Avi Kivity
  2012-02-02 17:25                     ` Christoph Lameter
  0 siblings, 2 replies; 77+ messages in thread
From: Paul E. McKenney @ 2012-02-02 17:01 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christoph Lameter, Peter Zijlstra, Gilad Ben-Yossef,
	linux-kernel, Chris Metcalf, Frederic Weisbecker, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On Thu, Feb 02, 2012 at 06:14:36PM +0200, Avi Kivity wrote:
> On 02/02/2012 05:34 PM, Paul E. McKenney wrote:
> > On Thu, Feb 02, 2012 at 11:34:25AM +0200, Avi Kivity wrote:
> > > On 02/01/2012 10:13 PM, Paul E. McKenney wrote:
> > > > > 
> > > > > Could we also apply the same approach to processors busy doing
> > > > > computational work? In that case the OS is also not needed. Interrupting
> > > > > these activities is impacting on performance and latency.
> > > >
> > > > Yep, that is in fact what Frederic's dyntick-idle userspace work does.
> > > 
> > > Running in a guest is a special case of running in userspace, so we'd
> > > need to extend this work to kvm as well.
> >
> > As long as rcu_idle_enter() is called at the appropriate time, RCU will
> > happily ignore the CPU.  ;-)
> >
> 
> It's not called (since the cpu is not idle).  Instead we call
> rcu_virt_note_context_switch().

Frederic's work checks to see if there is only one runnable user task
on a given CPU.  If there is only one, then the scheduling-clock interrupt
is turned off for that CPU, and RCU is told to ignore it while it is
executing in user space.  Not sure whether this covers KVM guests.

In any case, this is not yet in mainline.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-02 17:01                   ` Paul E. McKenney
@ 2012-02-02 17:23                     ` Avi Kivity
  2012-02-02 17:51                       ` Paul E. McKenney
  2012-02-02 17:25                     ` Christoph Lameter
  1 sibling, 1 reply; 77+ messages in thread
From: Avi Kivity @ 2012-02-02 17:23 UTC (permalink / raw)
  To: paulmck
  Cc: Christoph Lameter, Peter Zijlstra, Gilad Ben-Yossef,
	linux-kernel, Chris Metcalf, Frederic Weisbecker, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On 02/02/2012 07:01 PM, Paul E. McKenney wrote:
> > 
> > It's not called (since the cpu is not idle).  Instead we call
> > rcu_virt_note_context_switch().
>
> Frederic's work checks to see if there is only one runnable user task
> on a given CPU.  If there is only one, then the scheduling-clock interrupt
> is turned off for that CPU, and RCU is told to ignore it while it is
> executing in user space.  Not sure whether this covers KVM guests.

Conceptually it's the same.  Maybe it needs adjustments, since kvm
enters a guest in a different way than the kernel exits to userspace.

> In any case, this is not yet in mainline.

Let me know when it's in, and I'll have a look.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-02 17:01                   ` Paul E. McKenney
  2012-02-02 17:23                     ` Avi Kivity
@ 2012-02-02 17:25                     ` Christoph Lameter
  2012-02-05 12:06                       ` Gilad Ben-Yossef
  1 sibling, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2012-02-02 17:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Avi Kivity, Peter Zijlstra, Gilad Ben-Yossef, linux-kernel,
	Chris Metcalf, Frederic Weisbecker, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman,
	Andrew Morton, Alexander Viro, Michal Nazarewicz,
	Kosaki Motohiro, Milton Miller

On Thu, 2 Feb 2012, Paul E. McKenney wrote:

> Frederic's work checks to see if there is only one runnable user task
> on a given CPU.  If there is only one, then the scheduling-clock interrupt
> is turned off for that CPU, and RCU is told to ignore it while it is
> executing in user space.  Not sure whether this covers KVM guests.
>
> In any case, this is not yet in mainline.

Sounds great. Is there any plan on when to merge it? Where are the most up
to date patches vs mainstream?


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-02 17:23                     ` Avi Kivity
@ 2012-02-02 17:51                       ` Paul E. McKenney
  2012-02-05 12:16                         ` Avi Kivity
  0 siblings, 1 reply; 77+ messages in thread
From: Paul E. McKenney @ 2012-02-02 17:51 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christoph Lameter, Peter Zijlstra, Gilad Ben-Yossef,
	linux-kernel, Chris Metcalf, Frederic Weisbecker, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On Thu, Feb 02, 2012 at 07:23:39PM +0200, Avi Kivity wrote:
> On 02/02/2012 07:01 PM, Paul E. McKenney wrote:
> > > 
> > > It's not called (since the cpu is not idle).  Instead we call
> > > rcu_virt_note_context_switch().
> >
> > Frederic's work checks to see if there is only one runnable user task
> > on a given CPU.  If there is only one, then the scheduling-clock interrupt
> > is turned off for that CPU, and RCU is told to ignore it while it is
> > executing in user space.  Not sure whether this covers KVM guests.
> 
> Conceptually it's the same.  Maybe it needs adjustments, since kvm
> enters a guest in a different way than the kernel exits to userspace.
> 
> > In any case, this is not yet in mainline.
> 
> Let me know when it's in, and I'll have a look.

Could you please touch base with Frederic Weisbecker to make sure that
what he is doing works for you?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-02 15:41         ` Chris Metcalf
@ 2012-02-05 11:46           ` Gilad Ben-Yossef
  2012-02-10 18:39             ` Peter Zijlstra
  2012-02-10 18:33           ` Peter Zijlstra
  2012-02-10 18:38           ` Peter Zijlstra
  2 siblings, 1 reply; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-02-05 11:46 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Frederic Weisbecker, Peter Zijlstra, linux-kernel,
	Christoph Lameter, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Thu, Feb 2, 2012 at 5:41 PM, Chris Metcalf <cmetcalf@tilera.com> wrote:
> On 2/2/2012 3:46 AM, Gilad Ben-Yossef wrote:
>
>> Yes, that is what drives me as well. I have a bare metal program
>> I'm trying to kill here, I researched CPU isolation and ran into your
>> nohz patch set and asked myself: "OK, if we disable the tick what else
>> is on the way?"
>
> At Tilera we have been supporting a "dataplane" mode (aka Zero Overhead
> Linux - the marketing name).  This is configured on a per-cpu basis, and in
> addition to setting isolcpus for those nodes, also suppresses various
> things that might otherwise run (soft lockup detection, vmstat work,
> etc.).  The claim is that you need to specify these kinds of things
> per-core since it's not always possible for the kernel to know that you
> really don't want the scheduler or any other interrupt source to touch the
> core, as opposed to the case where you just happen to have a single process
> scheduled on the core and you don't mind occasional interrupts.  But
> there's definitely appeal in having the kernel do it adaptively too,
> particularly if it can be made to work just as well as configuring it
> statically.

Currently adaptive tick needs to be enabled as a cpuset property in
order to apply,
but once enabled it is activated automatically when feasible.

The combination of per cpuset enabling and automatic activation makes
sense to me
since cpuset is the way to go to isolate cpus for specific tasks going forward.
>
> We also have a set_dataplane() syscall that a task can make to allow it to
> request some additional semantics from the kernel, such as various
> debugging modes, a flag to request populating the page table fully, and a
> flag to request that all pending kernel timer ticks, etc., happen while the
> task spins in the kernel before actually returning to userspace from a
> syscall (so you don't get unexpected interrupts once you're back in
> userspace).

Oohh.. I like that :-)

> I've appended the relevant bits of <asm/dataplane.h> for more
> details.
>
> We've been planning to start working with the community on returning this,
> but since fiddling with the scheduler is pretty tricky stuff and it wasn't
> clear there was a lot of interest, we've been deferring it in favor of
> other activities.  But seeing more about Frederic Weisbecker's and Gilad
> Ben-Yossef's work makes me think that it might be a good time for us to
> start that process.  For a start I'll see about putting up a git branch on
> kernel.org that has our dataplane stuff in it, for reference.
>

This sounds very interesting. Thanks you!

I for one will be delighted to see that tree as a reference. There is nothing
I hate more then re-inventing the wheel... :-)

> /*
>  * Quiesce the timer interrupt before returning to user space after a
>  * system call.  Normally if a task on a dataplane core makes a
>  * syscall, the system will run one or more timer ticks after the
>  * syscall has completed, causing unexpected interrupts in userspace.
>  * Setting DP_QUIESCE avoids that problem by having the kernel "hold"
>  * the task in kernel mode until the timer ticks are complete.  This
>  * will make syscalls dramatically slower.
>  *
>  * If multiple dataplane tasks are scheduled on a single core, this
>  * in effect silently disables DP_QUIESCE, which allows the tasks to make
>  * progress, but without actually disabling the timer tick.
>  */
> #define DP_QUIESCE      0x1
>
> /*
>  * Disallow the application from entering the kernel in any way,
>  * unless it calls set_dataplane() again without this bit set.
>  * Issuing any other syscall or causing a page fault would generate a
>  * kernel message, and "kill -9" the process.
>  *
>  * Setting this flag automatically sets DP_QUIESCE as well.
>  */
> #define DP_STRICT       0x2
>
> /*
>  * Debug dataplane interrupts, so that if any interrupt source
>  * attempts to involve a dataplane cpu, a kernel message and stack
>  * backtrace will be generated on the console.  As this warning is a
>  * slow event, it may make sense to avoid this mode in production code
>  * to avoid making any possible interrupts even more heavyweight.
>  *
>  * Setting this flag automatically sets DP_QUIESCE as well.
>  */
> #define DP_DEBUG        0x4
>
> /*
>  * Cause all memory mappings to be populated in the page table.
>  * Specifying this when entering dataplane mode ensures that no future
>  * page fault events will occur to cause interrupts into the Linux
>  * kernel, as long as no new mappings are installed by mmap(), etc.
>  * Note that since the hardware TLB is of finite size, there will
>  * still be the potential for TLB misses that the hypervisor handles,
>  * either via its software TLB cache (fast path) or by walking the
>  * kernel page tables (slow path), so touching large amounts of memory
>  * will still incur hypervisor interrupt overhead.
>  */
> #define DP_POPULATE     0x8

hmm... I've probably missed something, but doesn't this replicate
mlockall (MCL_CURRENT|MCL_FUTURE) ?

Thanks!
Gilad




-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-02 17:25                     ` Christoph Lameter
@ 2012-02-05 12:06                       ` Gilad Ben-Yossef
  2012-02-06 18:19                         ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-02-05 12:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Avi Kivity, Peter Zijlstra, linux-kernel,
	Chris Metcalf, Frederic Weisbecker, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman,
	Andrew Morton, Alexander Viro, Michal Nazarewicz,
	Kosaki Motohiro, Milton Miller

On Thu, Feb 2, 2012 at 7:25 PM, Christoph Lameter <cl@linux.com> wrote:
> On Thu, 2 Feb 2012, Paul E. McKenney wrote:
>
>> Frederic's work checks to see if there is only one runnable user task
>> on a given CPU.  If there is only one, then the scheduling-clock interrupt
>> is turned off for that CPU, and RCU is told to ignore it while it is
>> executing in user space.  Not sure whether this covers KVM guests.
>>
>> In any case, this is not yet in mainline.
>
> Sounds great. Is there any plan on when to merge it? Where are the most up
> to date patches vs mainstream?
>


Frederic has the latest version in a git tree here:

git://github.com/fweisbec/linux-dynticks.git
       nohz/cpuset-v2-pre-20120117

It's on top latest rcu/core.

I've been playing with it for some time now. It works very well, considering the
early  state - there are  a couple of TODO items listed here:
https://tglx.de/~fweisbec/TODO-nohz-cpusets and I've seen an assert from
the  RCU code once.

Also, there is some system stuff "in the way" so to speak, of getting the full
benefits:

I had to disable the clock source watchdog (I'm testing in a KVM VM, so I guess
the TSC is not stable), the vmstat_stats work on that CPU and to (try
to) fix what
looks like a bug in  the NOHZ timer code.

But the good news is that with these hacks applied I managed to run a 100%
CPU task  with  zero interrupts  (ticks or  otherwise) on an isolated cpu.

Disregarding TLB overhead, you get bare metal performance with Linux user
space manageability and  debug capabilities.  Pretty magical really: It's like
eating your cake and having it too :-)

Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-02 17:51                       ` Paul E. McKenney
@ 2012-02-05 12:16                         ` Avi Kivity
  2012-02-05 16:59                           ` Paul E. McKenney
  0 siblings, 1 reply; 77+ messages in thread
From: Avi Kivity @ 2012-02-05 12:16 UTC (permalink / raw)
  To: paulmck
  Cc: Christoph Lameter, Peter Zijlstra, Gilad Ben-Yossef,
	linux-kernel, Chris Metcalf, Frederic Weisbecker, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On 02/02/2012 07:51 PM, Paul E. McKenney wrote:
> On Thu, Feb 02, 2012 at 07:23:39PM +0200, Avi Kivity wrote:
> > On 02/02/2012 07:01 PM, Paul E. McKenney wrote:
> > > > 
> > > > It's not called (since the cpu is not idle).  Instead we call
> > > > rcu_virt_note_context_switch().
> > >
> > > Frederic's work checks to see if there is only one runnable user task
> > > on a given CPU.  If there is only one, then the scheduling-clock interrupt
> > > is turned off for that CPU, and RCU is told to ignore it while it is
> > > executing in user space.  Not sure whether this covers KVM guests.
> > 
> > Conceptually it's the same.  Maybe it needs adjustments, since kvm
> > enters a guest in a different way than the kernel exits to userspace.
> > 
> > > In any case, this is not yet in mainline.
> > 
> > Let me know when it's in, and I'll have a look.
>
> Could you please touch base with Frederic Weisbecker to make sure that
> what he is doing works for you?
>

Looks like there are new rcu_user_enter() and rcu_user_exit() APIs which
we can use.  Hopefully they subsume rcu_virt_note_context_switch() so we
only need one set of APIs.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-05 12:16                         ` Avi Kivity
@ 2012-02-05 16:59                           ` Paul E. McKenney
  2012-02-09 15:22                             ` Frederic Weisbecker
  0 siblings, 1 reply; 77+ messages in thread
From: Paul E. McKenney @ 2012-02-05 16:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christoph Lameter, Peter Zijlstra, Gilad Ben-Yossef,
	linux-kernel, Chris Metcalf, Frederic Weisbecker, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On Sun, Feb 05, 2012 at 02:16:17PM +0200, Avi Kivity wrote:
> On 02/02/2012 07:51 PM, Paul E. McKenney wrote:
> > On Thu, Feb 02, 2012 at 07:23:39PM +0200, Avi Kivity wrote:
> > > On 02/02/2012 07:01 PM, Paul E. McKenney wrote:
> > > > > 
> > > > > It's not called (since the cpu is not idle).  Instead we call
> > > > > rcu_virt_note_context_switch().
> > > >
> > > > Frederic's work checks to see if there is only one runnable user task
> > > > on a given CPU.  If there is only one, then the scheduling-clock interrupt
> > > > is turned off for that CPU, and RCU is told to ignore it while it is
> > > > executing in user space.  Not sure whether this covers KVM guests.
> > > 
> > > Conceptually it's the same.  Maybe it needs adjustments, since kvm
> > > enters a guest in a different way than the kernel exits to userspace.
> > > 
> > > > In any case, this is not yet in mainline.
> > > 
> > > Let me know when it's in, and I'll have a look.
> >
> > Could you please touch base with Frederic Weisbecker to make sure that
> > what he is doing works for you?
> 
> Looks like there are new rcu_user_enter() and rcu_user_exit() APIs which
> we can use.  Hopefully they subsume rcu_virt_note_context_switch() so we
> only need one set of APIs.

Now that you mention it, that is a good goal.  However, it requires
coordination with Frederic's code as well, so some investigation
is required.  Bad things happen if you tell RCU you are idle when you
really are not and vice versa!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-05 12:06                       ` Gilad Ben-Yossef
@ 2012-02-06 18:19                         ` Christoph Lameter
  2012-02-09 15:37                           ` Frederic Weisbecker
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2012-02-06 18:19 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Paul E. McKenney, Avi Kivity, Peter Zijlstra, linux-kernel,
	Chris Metcalf, Frederic Weisbecker, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman,
	Andrew Morton, Alexander Viro, Michal Nazarewicz,
	Kosaki Motohiro, Milton Miller

On Sun, 5 Feb 2012, Gilad Ben-Yossef wrote

>
> Frederic has the latest version in a git tree here:
>
> git://github.com/fweisbec/linux-dynticks.git
>        nohz/cpuset-v2-pre-20120117
>
> It's on top latest rcu/core.

Hmmm.. A pull vs upstream leads to lots of conflicts.


> But the good news is that with these hacks applied I managed to run a 100%
> CPU task  with  zero interrupts  (ticks or  otherwise) on an isolated cpu.

Cool.

> Disregarding TLB overhead, you get bare metal performance with Linux user
> space manageability and  debug capabilities.  Pretty magical really: It's like
> eating your cake and having it too :-)

We definitely need that.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-05 16:59                           ` Paul E. McKenney
@ 2012-02-09 15:22                             ` Frederic Weisbecker
  2012-02-09 16:05                               ` Avi Kivity
  0 siblings, 1 reply; 77+ messages in thread
From: Frederic Weisbecker @ 2012-02-09 15:22 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Avi Kivity, Christoph Lameter, Peter Zijlstra, Gilad Ben-Yossef,
	linux-kernel, Chris Metcalf, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman,
	Andrew Morton, Alexander Viro, Michal Nazarewicz,
	Kosaki Motohiro, Milton Miller

On Sun, Feb 05, 2012 at 08:59:27AM -0800, Paul E. McKenney wrote:
> On Sun, Feb 05, 2012 at 02:16:17PM +0200, Avi Kivity wrote:
> > On 02/02/2012 07:51 PM, Paul E. McKenney wrote:
> > > On Thu, Feb 02, 2012 at 07:23:39PM +0200, Avi Kivity wrote:
> > > > On 02/02/2012 07:01 PM, Paul E. McKenney wrote:
> > > > > > 
> > > > > > It's not called (since the cpu is not idle).  Instead we call
> > > > > > rcu_virt_note_context_switch().
> > > > >
> > > > > Frederic's work checks to see if there is only one runnable user task
> > > > > on a given CPU.  If there is only one, then the scheduling-clock interrupt
> > > > > is turned off for that CPU, and RCU is told to ignore it while it is
> > > > > executing in user space.  Not sure whether this covers KVM guests.
> > > > 
> > > > Conceptually it's the same.  Maybe it needs adjustments, since kvm
> > > > enters a guest in a different way than the kernel exits to userspace.
> > > > 
> > > > > In any case, this is not yet in mainline.
> > > > 
> > > > Let me know when it's in, and I'll have a look.
> > >
> > > Could you please touch base with Frederic Weisbecker to make sure that
> > > what he is doing works for you?
> > 
> > Looks like there are new rcu_user_enter() and rcu_user_exit() APIs which
> > we can use.  Hopefully they subsume rcu_virt_note_context_switch() so we
> > only need one set of APIs.
> 
> Now that you mention it, that is a good goal.  However, it requires
> coordination with Frederic's code as well, so some investigation
> is required.  Bad things happen if you tell RCU you are idle when you
> really are not and vice versa!
> 
> 							Thanx, Paul
> 

Right. Avi I need to know more about what you need. rcu_virt_note_context_switch()
notes a quiescent state while rcu_user_enter() shuts down RCU (it's in fact the same
thing than rcu_idle_enter() minus the is_idle_cpu() checks).

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-06 18:19                         ` Christoph Lameter
@ 2012-02-09 15:37                           ` Frederic Weisbecker
  0 siblings, 0 replies; 77+ messages in thread
From: Frederic Weisbecker @ 2012-02-09 15:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben-Yossef, Paul E. McKenney, Avi Kivity, Peter Zijlstra,
	linux-kernel, Chris Metcalf, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman,
	Andrew Morton, Alexander Viro, Michal Nazarewicz,
	Kosaki Motohiro, Milton Miller

On Mon, Feb 06, 2012 at 12:19:44PM -0600, Christoph Lameter wrote:
> On Sun, 5 Feb 2012, Gilad Ben-Yossef wrote
> 
> >
> > Frederic has the latest version in a git tree here:
> >
> > git://github.com/fweisbec/linux-dynticks.git
> >        nohz/cpuset-v2-pre-20120117
> >
> > It's on top latest rcu/core.
> 
> Hmmm.. A pull vs upstream leads to lots of conflicts.

Yeah it's based on the latest rcu/core branch that was pulled
on last -rc1. Better git checkout it than pull.

 
> 
> > But the good news is that with these hacks applied I managed to run a 100%
> > CPU task  with  zero interrupts  (ticks or  otherwise) on an isolated cpu.
> 
> Cool.
> 
> > Disregarding TLB overhead, you get bare metal performance with Linux user
> > space manageability and  debug capabilities.  Pretty magical really: It's like
> > eating your cake and having it too :-)
> 
> We definitely need that.
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-02 16:29           ` Christoph Lameter
@ 2012-02-09 15:52             ` Frederic Weisbecker
  2012-02-09 15:59               ` Chris Metcalf
  2012-02-09 16:26               ` Christoph Lameter
  0 siblings, 2 replies; 77+ messages in thread
From: Frederic Weisbecker @ 2012-02-09 15:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben-Yossef, Peter Zijlstra, linux-kernel, Chris Metcalf,
	linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Avi Kivity, Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On Thu, Feb 02, 2012 at 10:29:57AM -0600, Christoph Lameter wrote:
> On Thu, 2 Feb 2012, Frederic Weisbecker wrote:
> 
> > > Some pinned timers might be able to get special treatment as well - take for
> > > example the vmstat work being schedule every second, what should we do with
> > > it for CPU isolation?
> >
> > Right, I remember I saw these vmstat timers on my way when I tried to get 0
> > interrupts on a CPU.
> >
> > I think all these timers need to be carefully reviewed before doing anything.
> > But we certainly shouldn't adopt the behaviour of migrating timers by default.
> >
> > Some timers really needs to stay on the expected CPU. Note that some
> > timers may be shutdown by CPU hotplug callbacks. Those wouldn't be migrated
> > in case of CPU offlining. We need to keep them.
> >
> > > It makes sense to me to have that stop scheduling itself when we have the tick
> > > disabled for both idle and a nohz task.
> 
> The vmstat timer only makes sense when the OS is doing something on the
> processor. Otherwise if no counters are incremented and the page and slab
> allocator caches are empty then there is no need to run the vmstat timer.

So this is a typical example of a timer we want to shutdown when the CPU is idle
but we want to keep it running when we run in adaptive tickless mode (ie: shutdown
the tick while the CPU is busy).

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-09 15:52             ` Frederic Weisbecker
@ 2012-02-09 15:59               ` Chris Metcalf
  2012-02-09 18:11                 ` Frederic Weisbecker
  2012-02-09 16:26               ` Christoph Lameter
  1 sibling, 1 reply; 77+ messages in thread
From: Chris Metcalf @ 2012-02-09 15:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, Gilad Ben-Yossef, Peter Zijlstra,
	linux-kernel, linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin,
	Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On 2/9/2012 10:52 AM, Frederic Weisbecker wrote:
> On Thu, Feb 02, 2012 at 10:29:57AM -0600, Christoph Lameter wrote:
>> On Thu, 2 Feb 2012, Frederic Weisbecker wrote:
>>
>>>> Some pinned timers might be able to get special treatment as well - take for
>>>> example the vmstat work being schedule every second, what should we do with
>>>> it for CPU isolation?
>>> Right, I remember I saw these vmstat timers on my way when I tried to get 0
>>> interrupts on a CPU.
>>>
>>> I think all these timers need to be carefully reviewed before doing anything.
>>> But we certainly shouldn't adopt the behaviour of migrating timers by default.
>>>
>>> Some timers really needs to stay on the expected CPU. Note that some
>>> timers may be shutdown by CPU hotplug callbacks. Those wouldn't be migrated
>>> in case of CPU offlining. We need to keep them.
>>>
>>>> It makes sense to me to have that stop scheduling itself when we have the tick
>>>> disabled for both idle and a nohz task.
>> The vmstat timer only makes sense when the OS is doing something on the
>> processor. Otherwise if no counters are incremented and the page and slab
>> allocator caches are empty then there is no need to run the vmstat timer.
> So this is a typical example of a timer we want to shutdown when the CPU is idle
> but we want to keep it running when we run in adaptive tickless mode (ie: shutdown
> the tick while the CPU is busy).

We would want to stop the timer as long as the processor is running
exclusively userspace code.  Christoph's point is that in either the idle
case or the userspace-only case, the vmstats won't be incrementing anyway. 
Presumably you'd restart the timer when you enter the kernel, then when it
fires and does its work, you might notice that when you return to userspace
no further information will be collected, so stop the timer at that point. 
Or, perhaps you could just proactively call refresh_cpu_vm_stats() just
before stopping the tick and returning "permanently" to userspace, to make
sure the stats are all properly updated.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-09 15:22                             ` Frederic Weisbecker
@ 2012-02-09 16:05                               ` Avi Kivity
  2012-02-09 18:22                                 ` Frederic Weisbecker
  0 siblings, 1 reply; 77+ messages in thread
From: Avi Kivity @ 2012-02-09 16:05 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, Christoph Lameter, Peter Zijlstra,
	Gilad Ben-Yossef, linux-kernel, Chris Metcalf, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On 02/09/2012 05:22 PM, Frederic Weisbecker wrote:
> > > 
> > > Looks like there are new rcu_user_enter() and rcu_user_exit() APIs which
> > > we can use.  Hopefully they subsume rcu_virt_note_context_switch() so we
> > > only need one set of APIs.
> > 
> > Now that you mention it, that is a good goal.  However, it requires
> > coordination with Frederic's code as well, so some investigation
> > is required.  Bad things happen if you tell RCU you are idle when you
> > really are not and vice versa!
> > 
> > 							Thanx, Paul
> > 
>
> Right. Avi I need to know more about what you need. rcu_virt_note_context_switch()
> notes a quiescent state while rcu_user_enter() shuts down RCU (it's in fact the same
> thing than rcu_idle_enter() minus the is_idle_cpu() checks).

I don't know enough about RCU to say if it's okay or not (I typically
peek at the quick quiz answers).  However, switching to guest mode is
very similar to exiting to user mode: we're guaranteed not to be in an
rcu critical section, and to remain so until the guest exits back to
us.  What guarantees does rcu_user_enter() provide?  With luck guest
entry satisifies them all.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-09 15:52             ` Frederic Weisbecker
  2012-02-09 15:59               ` Chris Metcalf
@ 2012-02-09 16:26               ` Christoph Lameter
  2012-02-09 18:32                 ` Frederic Weisbecker
  1 sibling, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2012-02-09 16:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben-Yossef, Peter Zijlstra, linux-kernel, Chris Metcalf,
	linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Avi Kivity, Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On Thu, 9 Feb 2012, Frederic Weisbecker wrote:

> > The vmstat timer only makes sense when the OS is doing something on the
> > processor. Otherwise if no counters are incremented and the page and slab
> > allocator caches are empty then there is no need to run the vmstat timer.
>
> So this is a typical example of a timer we want to shutdown when the CPU is idle
> but we want to keep it running when we run in adaptive tickless mode (ie: shutdown
> the tick while the CPU is busy).

You can also shut it down when the cpu is busy and not doing any system
calls. If the percpu differentials are all zero (because you just ran the
timer f.e.) and there are no system activities that would change the
counters then there is no point in running the vmstat timer.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-09 15:59               ` Chris Metcalf
@ 2012-02-09 18:11                 ` Frederic Weisbecker
  0 siblings, 0 replies; 77+ messages in thread
From: Frederic Weisbecker @ 2012-02-09 18:11 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Christoph Lameter, Gilad Ben-Yossef, Peter Zijlstra,
	linux-kernel, linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin,
	Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Thu, Feb 09, 2012 at 10:59:45AM -0500, Chris Metcalf wrote:
> On 2/9/2012 10:52 AM, Frederic Weisbecker wrote:
> > On Thu, Feb 02, 2012 at 10:29:57AM -0600, Christoph Lameter wrote:
> >> On Thu, 2 Feb 2012, Frederic Weisbecker wrote:
> >>
> >>>> Some pinned timers might be able to get special treatment as well - take for
> >>>> example the vmstat work being schedule every second, what should we do with
> >>>> it for CPU isolation?
> >>> Right, I remember I saw these vmstat timers on my way when I tried to get 0
> >>> interrupts on a CPU.
> >>>
> >>> I think all these timers need to be carefully reviewed before doing anything.
> >>> But we certainly shouldn't adopt the behaviour of migrating timers by default.
> >>>
> >>> Some timers really needs to stay on the expected CPU. Note that some
> >>> timers may be shutdown by CPU hotplug callbacks. Those wouldn't be migrated
> >>> in case of CPU offlining. We need to keep them.
> >>>
> >>>> It makes sense to me to have that stop scheduling itself when we have the tick
> >>>> disabled for both idle and a nohz task.
> >> The vmstat timer only makes sense when the OS is doing something on the
> >> processor. Otherwise if no counters are incremented and the page and slab
> >> allocator caches are empty then there is no need to run the vmstat timer.
> > So this is a typical example of a timer we want to shutdown when the CPU is idle
> > but we want to keep it running when we run in adaptive tickless mode (ie: shutdown
> > the tick while the CPU is busy).
> 
> We would want to stop the timer as long as the processor is running
> exclusively userspace code.  Christoph's point is that in either the idle
> case or the userspace-only case, the vmstats won't be incrementing anyway. 
> Presumably you'd restart the timer when you enter the kernel, then when it
> fires and does its work, you might notice that when you return to userspace
> no further information will be collected, so stop the timer at that point. 
> Or, perhaps you could just proactively call refresh_cpu_vm_stats() just
> before stopping the tick and returning "permanently" to userspace, to make
> sure the stats are all properly updated.

Ah good point. I believe that many timers considered as deferrable during idle
may also be considered that way for userspace.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-09 16:05                               ` Avi Kivity
@ 2012-02-09 18:22                                 ` Frederic Weisbecker
  2012-02-09 23:41                                   ` Paul E. McKenney
  0 siblings, 1 reply; 77+ messages in thread
From: Frederic Weisbecker @ 2012-02-09 18:22 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Paul E. McKenney, Christoph Lameter, Peter Zijlstra,
	Gilad Ben-Yossef, linux-kernel, Chris Metcalf, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On Thu, Feb 09, 2012 at 06:05:07PM +0200, Avi Kivity wrote:
> On 02/09/2012 05:22 PM, Frederic Weisbecker wrote:
> > > > 
> > > > Looks like there are new rcu_user_enter() and rcu_user_exit() APIs which
> > > > we can use.  Hopefully they subsume rcu_virt_note_context_switch() so we
> > > > only need one set of APIs.
> > > 
> > > Now that you mention it, that is a good goal.  However, it requires
> > > coordination with Frederic's code as well, so some investigation
> > > is required.  Bad things happen if you tell RCU you are idle when you
> > > really are not and vice versa!
> > > 
> > > 							Thanx, Paul
> > > 
> >
> > Right. Avi I need to know more about what you need. rcu_virt_note_context_switch()
> > notes a quiescent state while rcu_user_enter() shuts down RCU (it's in fact the same
> > thing than rcu_idle_enter() minus the is_idle_cpu() checks).
> 
> I don't know enough about RCU to say if it's okay or not (I typically
> peek at the quick quiz answers).  However, switching to guest mode is
> very similar to exiting to user mode: we're guaranteed not to be in an
> rcu critical section, and to remain so until the guest exits back to
> us.

Awesome!

> What guarantees does rcu_user_enter() provide?  With luck guest
> entry satisifies them all.

So rcu_user_enter() puts the CPU into RCU idle mode, which means the CPU
won't need to be part of the global RCU grace period completion. This
prevents it to depend on the timer tick (although for now you keep it)
and to complete some RCU specific work during the tick.

Paul, do you think that would be a win?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-09 16:26               ` Christoph Lameter
@ 2012-02-09 18:32                 ` Frederic Weisbecker
  0 siblings, 0 replies; 77+ messages in thread
From: Frederic Weisbecker @ 2012-02-09 18:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben-Yossef, Peter Zijlstra, linux-kernel, Chris Metcalf,
	linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Avi Kivity, Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On Thu, Feb 09, 2012 at 10:26:02AM -0600, Christoph Lameter wrote:
> On Thu, 9 Feb 2012, Frederic Weisbecker wrote:
> 
> > > The vmstat timer only makes sense when the OS is doing something on the
> > > processor. Otherwise if no counters are incremented and the page and slab
> > > allocator caches are empty then there is no need to run the vmstat timer.
> >
> > So this is a typical example of a timer we want to shutdown when the CPU is idle
> > but we want to keep it running when we run in adaptive tickless mode (ie: shutdown
> > the tick while the CPU is busy).
> 
> You can also shut it down when the cpu is busy and not doing any system
> calls. If the percpu differentials are all zero (because you just ran the
> timer f.e.) and there are no system activities that would change the
> counters then there is no point in running the vmstat timer.

Yep. I believe we can probably find that timer pattern elsewhere as well.
A class of userspace/idle defferable timers.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-09 18:22                                 ` Frederic Weisbecker
@ 2012-02-09 23:41                                   ` Paul E. McKenney
  2012-02-10  1:39                                     ` Frederic Weisbecker
  0 siblings, 1 reply; 77+ messages in thread
From: Paul E. McKenney @ 2012-02-09 23:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Avi Kivity, Christoph Lameter, Peter Zijlstra, Gilad Ben-Yossef,
	linux-kernel, Chris Metcalf, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman,
	Andrew Morton, Alexander Viro, Michal Nazarewicz,
	Kosaki Motohiro, Milton Miller

On Thu, Feb 09, 2012 at 07:22:19PM +0100, Frederic Weisbecker wrote:
> On Thu, Feb 09, 2012 at 06:05:07PM +0200, Avi Kivity wrote:
> > On 02/09/2012 05:22 PM, Frederic Weisbecker wrote:
> > > > > 
> > > > > Looks like there are new rcu_user_enter() and rcu_user_exit() APIs which
> > > > > we can use.  Hopefully they subsume rcu_virt_note_context_switch() so we
> > > > > only need one set of APIs.
> > > > 
> > > > Now that you mention it, that is a good goal.  However, it requires
> > > > coordination with Frederic's code as well, so some investigation
> > > > is required.  Bad things happen if you tell RCU you are idle when you
> > > > really are not and vice versa!
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > >
> > > Right. Avi I need to know more about what you need. rcu_virt_note_context_switch()
> > > notes a quiescent state while rcu_user_enter() shuts down RCU (it's in fact the same
> > > thing than rcu_idle_enter() minus the is_idle_cpu() checks).
> > 
> > I don't know enough about RCU to say if it's okay or not (I typically
> > peek at the quick quiz answers).  However, switching to guest mode is
> > very similar to exiting to user mode: we're guaranteed not to be in an
> > rcu critical section, and to remain so until the guest exits back to
> > us.
> 
> Awesome!
> 
> > What guarantees does rcu_user_enter() provide?  With luck guest
> > entry satisifies them all.
> 
> So rcu_user_enter() puts the CPU into RCU idle mode, which means the CPU
> won't need to be part of the global RCU grace period completion. This
> prevents it to depend on the timer tick (although for now you keep it)
> and to complete some RCU specific work during the tick.
> 
> Paul, do you think that would be a win?

As long as the code doesn't enter RCU read-side critical sections in
the time between rcu_idle_enter() and rcu_idle_exit(), this should
work fine.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-09 23:41                                   ` Paul E. McKenney
@ 2012-02-10  1:39                                     ` Frederic Weisbecker
  2012-02-14 13:18                                       ` Avi Kivity
  0 siblings, 1 reply; 77+ messages in thread
From: Frederic Weisbecker @ 2012-02-10  1:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Avi Kivity, Christoph Lameter, Peter Zijlstra, Gilad Ben-Yossef,
	linux-kernel, Chris Metcalf, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman,
	Andrew Morton, Alexander Viro, Michal Nazarewicz,
	Kosaki Motohiro, Milton Miller

On Thu, Feb 09, 2012 at 03:41:45PM -0800, Paul E. McKenney wrote:
> On Thu, Feb 09, 2012 at 07:22:19PM +0100, Frederic Weisbecker wrote:
> > On Thu, Feb 09, 2012 at 06:05:07PM +0200, Avi Kivity wrote:
> > > On 02/09/2012 05:22 PM, Frederic Weisbecker wrote:
> > > > > > 
> > > > > > Looks like there are new rcu_user_enter() and rcu_user_exit() APIs which
> > > > > > we can use.  Hopefully they subsume rcu_virt_note_context_switch() so we
> > > > > > only need one set of APIs.
> > > > > 
> > > > > Now that you mention it, that is a good goal.  However, it requires
> > > > > coordination with Frederic's code as well, so some investigation
> > > > > is required.  Bad things happen if you tell RCU you are idle when you
> > > > > really are not and vice versa!
> > > > > 
> > > > > 							Thanx, Paul
> > > > > 
> > > >
> > > > Right. Avi I need to know more about what you need. rcu_virt_note_context_switch()
> > > > notes a quiescent state while rcu_user_enter() shuts down RCU (it's in fact the same
> > > > thing than rcu_idle_enter() minus the is_idle_cpu() checks).
> > > 
> > > I don't know enough about RCU to say if it's okay or not (I typically
> > > peek at the quick quiz answers).  However, switching to guest mode is
> > > very similar to exiting to user mode: we're guaranteed not to be in an
> > > rcu critical section, and to remain so until the guest exits back to
> > > us.
> > 
> > Awesome!
> > 
> > > What guarantees does rcu_user_enter() provide?  With luck guest
> > > entry satisifies them all.
> > 
> > So rcu_user_enter() puts the CPU into RCU idle mode, which means the CPU
> > won't need to be part of the global RCU grace period completion. This
> > prevents it to depend on the timer tick (although for now you keep it)
> > and to complete some RCU specific work during the tick.
> > 
> > Paul, do you think that would be a win?
> 
> As long as the code doesn't enter RCU read-side critical sections in
> the time between rcu_idle_enter() and rcu_idle_exit(), this should
> work fine.

This should work fine yeah but further the correctness, I wonder if this
is going to be a win.

We use rcu_idle_enter() in idle to avoid to keep the tick for RCU. But
what about falling into guest mode? I guess the tick is kept there
so is it going to be a win in throughput or something to use rcu_idle_enter()?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-02 15:41         ` Chris Metcalf
  2012-02-05 11:46           ` Gilad Ben-Yossef
@ 2012-02-10 18:33           ` Peter Zijlstra
  2012-02-10 20:33             ` Gilad Ben-Yossef
  2012-02-15 21:50             ` Chris Metcalf
  2012-02-10 18:38           ` Peter Zijlstra
  2 siblings, 2 replies; 77+ messages in thread
From: Peter Zijlstra @ 2012-02-10 18:33 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Frederic Weisbecker, Gilad Ben-Yossef, linux-kernel,
	Christoph Lameter, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Thu, 2012-02-02 at 10:41 -0500, Chris Metcalf wrote:
> At Tilera we have been supporting a "dataplane" mode (aka Zero Overhead
> Linux - the marketing name).  This is configured on a per-cpu basis, and in
> addition to setting isolcpus for those nodes, also suppresses various
> things that might otherwise run (soft lockup detection, vmstat work,
> etc.).  

See that's wrong.. it starts being wrong by depending on cpuisol and
goes from there.

> The claim is that you need to specify these kinds of things
> per-core since it's not always possible for the kernel to know that you
> really don't want the scheduler or any other interrupt source to touch the
> core, as opposed to the case where you just happen to have a single process
> scheduled on the core and you don't mind occasional interrupts.

Right, so that claim is proven false I think.

>   But
> there's definitely appeal in having the kernel do it adaptively too,
> particularly if it can be made to work just as well as configuring it
> statically. 

I see no reason why it shouldn't work as well or even better.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-02 15:41         ` Chris Metcalf
  2012-02-05 11:46           ` Gilad Ben-Yossef
  2012-02-10 18:33           ` Peter Zijlstra
@ 2012-02-10 18:38           ` Peter Zijlstra
  2012-02-10 20:24             ` Gilad Ben-Yossef
  2 siblings, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2012-02-10 18:38 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Frederic Weisbecker, Gilad Ben-Yossef, linux-kernel,
	Christoph Lameter, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Thu, 2012-02-02 at 10:41 -0500, Chris Metcalf wrote:
> 
> /*
>  * Quiesce the timer interrupt before returning to user space after a
>  * system call.  Normally if a task on a dataplane core makes a
>  * syscall, the system will run one or more timer ticks after the
>  * syscall has completed, causing unexpected interrupts in userspace.
>  * Setting DP_QUIESCE avoids that problem by having the kernel "hold"
>  * the task in kernel mode until the timer ticks are complete.  This
>  * will make syscalls dramatically slower.
>  *
>  * If multiple dataplane tasks are scheduled on a single core, this
>  * in effect silently disables DP_QUIESCE, which allows the tasks to make
>  * progress, but without actually disabling the timer tick.
>  */
> #define DP_QUIESCE      0x1

This is what Frederics work does

> 
> /*
>  * Disallow the application from entering the kernel in any way,
>  * unless it calls set_dataplane() again without this bit set.
>  * Issuing any other syscall or causing a page fault would generate a
>  * kernel message, and "kill -9" the process.
>  *
>  * Setting this flag automatically sets DP_QUIESCE as well.
>  */
> #define DP_STRICT       0x2

This is a debug feature.. you'd better know what your own software does.

> 
> /*
>  * Debug dataplane interrupts, so that if any interrupt source
>  * attempts to involve a dataplane cpu, a kernel message and stack
>  * backtrace will be generated on the console.  As this warning is a
>  * slow event, it may make sense to avoid this mode in production code
>  * to avoid making any possible interrupts even more heavyweight.
>  *
>  * Setting this flag automatically sets DP_QUIESCE as well.
>  */
> #define DP_DEBUG        0x4

This too is a debug feature, one that doesn't cover all possible
scenarios.

> /*
>  * Cause all memory mappings to be populated in the page table.
>  * Specifying this when entering dataplane mode ensures that no future
>  * page fault events will occur to cause interrupts into the Linux
>  * kernel, as long as no new mappings are installed by mmap(), etc.
>  * Note that since the hardware TLB is of finite size, there will
>  * still be the potential for TLB misses that the hypervisor handles,
>  * either via its software TLB cache (fast path) or by walking the
>  * kernel page tables (slow path), so touching large amounts of memory
>  * will still incur hypervisor interrupt overhead.
>  */
> #define DP_POPULATE     0x8 

map()s MAP_POPULATE will pre-populate the stuff for you, as will
mlock(), the latter will (mostly) ensure they stay around.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-05 11:46           ` Gilad Ben-Yossef
@ 2012-02-10 18:39             ` Peter Zijlstra
  2012-02-10 20:13               ` Gilad Ben-Yossef
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2012-02-10 18:39 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Chris Metcalf, Frederic Weisbecker, linux-kernel,
	Christoph Lameter, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Sun, 2012-02-05 at 13:46 +0200, Gilad Ben-Yossef wrote:
> > /*
> >  * Cause all memory mappings to be populated in the page table.
> >  * Specifying this when entering dataplane mode ensures that no future
> >  * page fault events will occur to cause interrupts into the Linux
> >  * kernel, as long as no new mappings are installed by mmap(), etc.
> >  * Note that since the hardware TLB is of finite size, there will
> >  * still be the potential for TLB misses that the hypervisor handles,
> >  * either via its software TLB cache (fast path) or by walking the
> >  * kernel page tables (slow path), so touching large amounts of memory
> >  * will still incur hypervisor interrupt overhead.
> >  */
> > #define DP_POPULATE     0x8
> 
> hmm... I've probably missed something, but doesn't this replicate
> mlockall (MCL_CURRENT|MCL_FUTURE) ? 

Never use mlockall() its a sign you're doing it wrong, also his comment
seems to imply MCL_FUTURE isn't required.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-10 18:39             ` Peter Zijlstra
@ 2012-02-10 20:13               ` Gilad Ben-Yossef
  2012-02-10 20:29                 ` Peter Zijlstra
  0 siblings, 1 reply; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-02-10 20:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Frederic Weisbecker, linux-kernel,
	Christoph Lameter, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Fri, Feb 10, 2012 at 8:39 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Sun, 2012-02-05 at 13:46 +0200, Gilad Ben-Yossef wrote:
>> > /*
>> >  * Cause all memory mappings to be populated in the page table.
>> >  * Specifying this when entering dataplane mode ensures that no future
>> >  * page fault events will occur to cause interrupts into the Linux
>> >  * kernel, as long as no new mappings are installed by mmap(), etc.
>> >  * Note that since the hardware TLB is of finite size, there will
>> >  * still be the potential for TLB misses that the hypervisor handles,
>> >  * either via its software TLB cache (fast path) or by walking the
>> >  * kernel page tables (slow path), so touching large amounts of memory
>> >  * will still incur hypervisor interrupt overhead.
>> >  */
>> > #define DP_POPULATE     0x8
>>
>> hmm... I've probably missed something, but doesn't this replicate
>> mlockall (MCL_CURRENT|MCL_FUTURE) ?
>
> Never use mlockall() its a sign you're doing it wrong, also his comment
> seems to imply MCL_FUTURE isn't required.
>


My current understanding is that if I have a real time task and wish it
have a deterministic performance time, you should call mlockall() to lock
the program data and text into physical memory so that  a  less often taken
branch or access to a new data region will not result in a page fault.

You still have to worry about TLB misses on non hardware page table
walk architecture, but at least everything is in the  page tables

If there is a better way to do this? I'm always happy to learn new
ways to do things. :-)


Thanks,
Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-10 18:38           ` Peter Zijlstra
@ 2012-02-10 20:24             ` Gilad Ben-Yossef
  2012-02-15 15:11               ` Peter Zijlstra
  2012-02-15 21:51               ` Chris Metcalf
  0 siblings, 2 replies; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-02-10 20:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Frederic Weisbecker, linux-kernel,
	Christoph Lameter, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Fri, Feb 10, 2012 at 8:38 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2012-02-02 at 10:41 -0500, Chris Metcalf wrote:
>>
>> /*
>>  * Quiesce the timer interrupt before returning to user space after a
>>  * system call.  Normally if a task on a dataplane core makes a
>>  * syscall, the system will run one or more timer ticks after the
>>  * syscall has completed, causing unexpected interrupts in userspace.
>>  * Setting DP_QUIESCE avoids that problem by having the kernel "hold"
>>  * the task in kernel mode until the timer ticks are complete.  This
>>  * will make syscalls dramatically slower.
>>  *
>>  * If multiple dataplane tasks are scheduled on a single core, this
>>  * in effect silently disables DP_QUIESCE, which allows the tasks to make
>>  * progress, but without actually disabling the timer tick.
>>  */
>> #define DP_QUIESCE      0x1
>
> This is what Frederics work does

Actually it's not quite the same I believe. Frederic's patch set
disables the tick
for an adaptive tick task at some timer tick interrupt after the
system call, but
the task doesn't know when that happens, so if timing determinism guarantee is
what you are after you don't quite know in your user task if its safe
to start doing
real time stuff or know.

If I understood Chris quote correctly, they hold the task in kernel
space until the
timer tick is off, so that when the user space task continues to run
after the system
call it can assume the tick is off.

Of course, nothing stops something (IPI?) from re-enabling it later,
but I guess it at least
lets you start in a known state.

I think the concept of giving the task some way to know if the tick is
disabled or not is nice.
Not sure the exact feature and surely not the interface are what we
should adopt - maybe
allow registering to receive a signal at the end of the tick when it
is disabled an re-enabled?

>
>>
>> /*
>>  * Disallow the application from entering the kernel in any way,
>>  * unless it calls set_dataplane() again without this bit set.
>>  * Issuing any other syscall or causing a page fault would generate a
>>  * kernel message, and "kill -9" the process.
>>  *
>>  * Setting this flag automatically sets DP_QUIESCE as well.
>>  */
>> #define DP_STRICT       0x2
>
> This is a debug feature.. you'd better know what your own software does.
>
>>
>> /*
>>  * Debug dataplane interrupts, so that if any interrupt source
>>  * attempts to involve a dataplane cpu, a kernel message and stack
>>  * backtrace will be generated on the console.  As this warning is a
>>  * slow event, it may make sense to avoid this mode in production code
>>  * to avoid making any possible interrupts even more heavyweight.
>>  *
>>  * Setting this flag automatically sets DP_QUIESCE as well.
>>  */
>> #define DP_DEBUG        0x4
>
> This too is a debug feature, one that doesn't cover all possible
> scenarios.

I like the idea of these but suspect a trace event is more suitable to
provide the same information.

>
>> /*
>>  * Cause all memory mappings to be populated in the page table.
>>  * Specifying this when entering dataplane mode ensures that no future
>>  * page fault events will occur to cause interrupts into the Linux
>>  * kernel, as long as no new mappings are installed by mmap(), etc.
>>  * Note that since the hardware TLB is of finite size, there will
>>  * still be the potential for TLB misses that the hypervisor handles,
>>  * either via its software TLB cache (fast path) or by walking the
>>  * kernel page tables (slow path), so touching large amounts of memory
>>  * will still incur hypervisor interrupt overhead.
>>  */
>> #define DP_POPULATE     0x8
>
> map()s MAP_POPULATE will pre-populate the stuff for you, as will
> mlock(), the latter will (mostly) ensure they stay around.
>

Yeap.


I think it's a collection of nice ideas that somehow got grouped together
under a specific interface and I guess we adopt the ideas but not neccisiarly
the implementation or interface:

- trace events for the debug notifications
- recommend mlock & friends for the VM stuff
- Perhaps a signal handler for tick notification?

Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-10 20:13               ` Gilad Ben-Yossef
@ 2012-02-10 20:29                 ` Peter Zijlstra
  2012-02-10 20:39                   ` Gilad Ben-Yossef
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2012-02-10 20:29 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Chris Metcalf, Frederic Weisbecker, linux-kernel,
	Christoph Lameter, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Fri, 2012-02-10 at 22:13 +0200, Gilad Ben-Yossef wrote:
> My current understanding is that if I have a real time task and wish it
> have a deterministic performance time, you should call mlockall() to lock
> the program data and text into physical memory so that  a  less often taken
> branch or access to a new data region will not result in a page fault.
> 
> You still have to worry about TLB misses on non hardware page table
> walk architecture, but at least everything is in the  page tables
> 
> If there is a better way to do this? I'm always happy to learn new
> ways to do things. :-) 

A rt application usually consists of a lot of non-rt parts and a usually
relatively small rt part. Using mlockall() pins the entire application
into memory, which while on the safe side is very often entirely too
much.

The alternative method is to only mlock the text and data used by the rt
part. You need to be aware of what text runs in your rt part anyway,
since you need to make sure it is in fact deterministic code.

One of the ways of achieving this is using a special linker section for
your vetted rt code and mlock()'ing only that text section.

On thread creation, provide a custom allocated (and mlock()'ed) stack
etc..

Basically, if you can't tell a-priory what code is part of your rt part,
you don't have an rt part ;-)


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-10 18:33           ` Peter Zijlstra
@ 2012-02-10 20:33             ` Gilad Ben-Yossef
  2012-02-15 21:50             ` Chris Metcalf
  1 sibling, 0 replies; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-02-10 20:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Frederic Weisbecker, linux-kernel,
	Christoph Lameter, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

[Sent originally to Peter only by some weird gmail quirk. Re sending to all]

On Fri, Feb 10, 2012 at 8:33 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2012-02-02 at 10:41 -0500, Chris Metcalf wrote:
>> At Tilera we have been supporting a "dataplane" mode (aka Zero Overhead
>> Linux - the marketing name).  This is configured on a per-cpu basis, and in
>> addition to setting isolcpus for those nodes, also suppresses various
>> things that might otherwise run (soft lockup detection, vmstat work,
>> etc.).
>
> See that's wrong.. it starts being wrong by depending on cpuisol and
> goes from there.

Actually, correct me if I'm wrong Chris, but I don't think the idea is
to adopt Tilera dataplane mode to mainline but rather treat it as a
reference -

It was develop to answer a specific need, scratch a personal itch, if you
will, and was probably never designed for mass mainline consumption and
it shows.

At the same time its code doing something similar in spirit to what we aim to
and has real word users. We would be foolish to ignore it.

So, a good reference (for the good and bad), not merge request. Right Chris? :-)

Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-10 20:29                 ` Peter Zijlstra
@ 2012-02-10 20:39                   ` Gilad Ben-Yossef
  0 siblings, 0 replies; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-02-10 20:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Frederic Weisbecker, linux-kernel,
	Christoph Lameter, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Fri, Feb 10, 2012 at 10:29 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Fri, 2012-02-10 at 22:13 +0200, Gilad Ben-Yossef wrote:
>> My current understanding is that if I have a real time task and wish it
>> have a deterministic performance time, you should call mlockall() to lock
>> the program data and text into physical memory so that  a  less often taken
>> branch or access to a new data region will not result in a page fault.
>>
>> You still have to worry about TLB misses on non hardware page table
>> walk architecture, but at least everything is in the  page tables
>>
>> If there is a better way to do this? I'm always happy to learn new
>> ways to do things. :-)
>
> A rt application usually consists of a lot of non-rt parts and a usually
> relatively small rt part. Using mlockall() pins the entire application
> into memory, which while on the safe side is very often entirely too
> much.
>
> The alternative method is to only mlock the text and data used by the rt
> part. You need to be aware of what text runs in your rt part anyway,
> since you need to make sure it is in fact deterministic code.
>
> One of the ways of achieving this is using a special linker section for
> your vetted rt code and mlock()'ing only that text section.
>
> On thread creation, provide a custom allocated (and mlock()'ed) stack
> etc..
>
> Basically, if you can't tell a-priory what code is part of your rt part,
> you don't have an rt part ;-)
>

That I can totally agree with.

I guess mlockall() is still useful as a kind of hack for lazy people,
although if you say that this kind of laziness does not really mix
well with real
time programming I will tend to agree... :-)

Gilad



-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-10  1:39                                     ` Frederic Weisbecker
@ 2012-02-14 13:18                                       ` Avi Kivity
  2012-02-21  0:02                                         ` Frederic Weisbecker
  0 siblings, 1 reply; 77+ messages in thread
From: Avi Kivity @ 2012-02-14 13:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, Christoph Lameter, Peter Zijlstra,
	Gilad Ben-Yossef, linux-kernel, Chris Metcalf, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On 02/10/2012 03:39 AM, Frederic Weisbecker wrote:
> > 
> > As long as the code doesn't enter RCU read-side critical sections in
> > the time between rcu_idle_enter() and rcu_idle_exit(), this should
> > work fine.
>
> This should work fine yeah but further the correctness, I wonder if this
> is going to be a win.
>
> We use rcu_idle_enter() in idle to avoid to keep the tick for RCU. But
> what about falling into guest mode? I guess the tick is kept there
> so is it going to be a win in throughput or something to use rcu_idle_enter()?

We could disable the tick while in guest mode as well.  Interrupts in
guest mode are even more expensive than interrupts in user mode.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-10 20:24             ` Gilad Ben-Yossef
@ 2012-02-15 15:11               ` Peter Zijlstra
  2012-02-15 15:19                 ` Gilad Ben-Yossef
  2012-02-15 21:51               ` Chris Metcalf
  1 sibling, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2012-02-15 15:11 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Chris Metcalf, Frederic Weisbecker, linux-kernel,
	Christoph Lameter, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Fri, 2012-02-10 at 22:24 +0200, Gilad Ben-Yossef wrote:
> I think the concept of giving the task some way to know if the tick is
> disabled or not is nice.
> Not sure the exact feature and surely not the interface are what we
> should adopt - maybe
> allow registering to receive a signal at the end of the tick when it
> is disabled an re-enabled? 

Fair enough, I indeed missed that property. And yes that makes sense. 

It might be a tad tricky to implement as things currently stand, because
AFAICR Frederic's stuff re-enables the tick on kernel entry (syscall)
things like signal delivery or a blocking wait for it might be 'fun'.

But I'll have to defer to Frederic, its been too long since I've seen
his patches to remember most details.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-15 15:11               ` Peter Zijlstra
@ 2012-02-15 15:19                 ` Gilad Ben-Yossef
  0 siblings, 0 replies; 77+ messages in thread
From: Gilad Ben-Yossef @ 2012-02-15 15:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Frederic Weisbecker, linux-kernel,
	Christoph Lameter, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Wed, Feb 15, 2012 at 5:11 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> On Fri, 2012-02-10 at 22:24 +0200, Gilad Ben-Yossef wrote:
> > I think the concept of giving the task some way to know if the tick is
> > disabled or not is nice.
> > Not sure the exact feature and surely not the interface are what we
> > should adopt - maybe
> > allow registering to receive a signal at the end of the tick when it
> > is disabled an re-enabled?
>
> Fair enough, I indeed missed that property. And yes that makes sense.
>
> It might be a tad tricky to implement as things currently stand, because
> AFAICR Frederic's stuff re-enables the tick on kernel entry (syscall)
> things like signal delivery or a blocking wait for it might be 'fun'.
>
> But I'll have to defer to Frederic, its been too long since I've seen
> his patches to remember most details.

Yes, what I had in mind is that since Frederic's patch set always
disables the tick
from inside the (last) timer tick, we can have the tick return to user
code from the timer
with a signal whenever it is disabled or re-enabled. Basically, have
the timer code make
the signal pending from inside the timer, so that the return to user
space on the special
timer ticks (the last before disable or the first after re-enable)
will be to a signal handler.

I don't know if what I wrote above actually makes sense or not :-)
I'll try to hack something
up and see.

Thanks,
Gilad

--
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-10 18:33           ` Peter Zijlstra
  2012-02-10 20:33             ` Gilad Ben-Yossef
@ 2012-02-15 21:50             ` Chris Metcalf
  2012-02-15 22:15               ` Christoph Lameter
  2012-02-21  1:34               ` Frederic Weisbecker
  1 sibling, 2 replies; 77+ messages in thread
From: Chris Metcalf @ 2012-02-15 21:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Gilad Ben-Yossef, linux-kernel,
	Christoph Lameter, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On 2/10/2012 1:33 PM, Peter Zijlstra wrote:
> On Thu, 2012-02-02 at 10:41 -0500, Chris Metcalf wrote:
>> At Tilera we have been supporting a "dataplane" mode (aka Zero Overhead
>> Linux - the marketing name).  This is configured on a per-cpu basis, and in
>> addition to setting isolcpus for those nodes, also suppresses various
>> things that might otherwise run (soft lockup detection, vmstat work,
>> etc.).  
> See that's wrong.. it starts being wrong by depending on cpuisol and
> goes from there.
>
>> The claim is that you need to specify these kinds of things
>> per-core since it's not always possible for the kernel to know that you
>> really don't want the scheduler or any other interrupt source to touch the
>> core, as opposed to the case where you just happen to have a single process
>> scheduled on the core and you don't mind occasional interrupts.
> Right, so that claim is proven false I think.
>
>> But
>> there's definitely appeal in having the kernel do it adaptively too,
>> particularly if it can be made to work just as well as configuring it
>> statically. 
> I see no reason why it shouldn't work as well or even better.

Thanks for the feedback.  To echo Gilad's guess in a later email, the code
as-is is not intended as a patch planned for a merge.  The code is in use
by our customers, who have found it useful, but what I'd really like to do
is to make sure to integrate all the functionality that's useful in our
"dataplane" mode into Frederic's ongoing work with nohz cpusets.

The goal of the work we've done is to provide a way for customers to ensure
they reliably have zero jitter on cpus that are trying to process real-time
or otherwise low-latency events.  A good example is 10 Gb network traffic,
where at min-packet sizes you have only 50-odd cpu cycles to dispatch the
packet to one of our 64 cores, and each core then has a budget of only a
few thousand cycles to deal with the core.  A kernel interrupt would mean
dropping packets on the floor.  Similarly, for something like
high-frequency trading, you'd want guaranteed low-latency response.

The Tilera dataplane code is available on the "dataplane" branch (off of
3.3-rc3 at the moment):

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git

I'm still looking at Frederic's git tree, but I'm assuming the following
are all true of tasks that are running on a nohz cpuset core:

- The core will not run the global scheduler to do work stealing, since
otherwise you can't guarantee that only tasks that care about userspace
nohz get to run there.  (I suppose you could loosen thus such that the core
would do work stealing as long as no task was pinned to that core by
affinity, at which point the pinned task would become the only runnable task.)

- Kernel "background" tasks are disabled on that core, at least while
userspace nohz tasks are running: softlockup watchdog, slab reap timer,
vmstat thread, etc.

I've appended a FAQ for the Tilera dataplane mode, intended for our
customers.  This covers the use of the set_dataplane() API and the various
DP_XXX flags.  The DP_DEBUG and DP_HARD flags that Peter objected to are,
of course, for internal/development use, though I think convenient in that
capacity.  The point of DP_HARD is to basically catch userspace bugs where
the application calls syscalls accidentally, etc., and you'd like to find
out about it.  The point of DP_DEBUG is mostly just that the kernel support
is new, so the flag helps us track down bugs by adding checks at all the
various interrupt and IPI sources; but it also allows us to point the
finger at drivers that arrange to deliver interrupts to dataplane cores
because they don't know any better.  It may not be the right API to set in
stone for long-term support, but it was easy to cook up for the time
being.  (I am curious, by the way, as to why you say it doesn't cover all
possible scenarios.)

As for DP_POPULATE, it's just a weaker mlockall(MCL_CURRENT); we are not
planning to keep it as part of the supported API.

However, I do think mlockall() is the right thing to do.  In a later email
you suggest that a RT application has a small RT part and a lot of non-RT
parts.  This is true, but in our experience, the application writer is much
better served by having the RT parts and non-RT parts in separate
processes.  The Tilera FAQ I appended below discusses this in more detail. 
The upshot is that doing mlockall() on the RT process(es) and not on the
non-RT process(es) that compose the application turns out to be a better model.

One last (unfortunate) practicality is that I'm taking off for vacation
tomorrow, so I may not be able to reply to any email until I'm back on Feb
27th.


Tilera dataplane FAQ follows.

What is "dataplane" (Zero Overhead Linux, ZOL) mode?

"Dataplane" mode is a per-cpu setting that marks cpus as primarily
intended to run jitter-sensitive or very-low-latency userspace code.
Dataplane cpus will, generally speaking, not be interrupted by kernel
bookkeeping work and will be able to run just userspace code.


How do you configure dataplane mode?

You enable cores for dataplane mode by specifying the "dataplane="
boot argument to the kernel, using the normal kernel list syntax,
i.e. comma-separated cpus or hyphen-separated ranges of cpus.  (If you
specify nonexistent cpus and they will be silently ignored.)  You must
leave at least one non-dataplane cpu configured.

Simulator images are not generally configured with dataplane enabled;
the exception is the "gx8036-dataplane" image, which is configured
with all cores except core 0 configured as dataplane.

It's not currently possible to change cores from dataplane to
non-dataplane or vice versa after booting the system.


When does a core go in or out of dataplane mode?

The kernel places a task in dataplane mode if is running on a core
that is enabled for dataplane and the task meets a number of criteria:

- No other runnable task is on the same core.  In general, you should
  use the standard kernel affinity mechanism to bind only a single
  task (process or thread) to a single dataplane core at a time.  The
  cores do not attempt to steal work from other cores to load-balance.

- No kernel events are pending for that core.

A task leaves dataplane mode when it enters the kernel for any
reason.  The kernel will 


What is different about a dataplane core from a standard Linux core?

Dataplane cores suppress a number of standard kernel book-keeping
tasks to avoid interrupting user space:

- The kernel soft lockup watchdog is disabled, so soft lockups (when
  the kernel blocks internally and stops scheduling new tasks) are
  not detected on dataplane cores.

- vmstat update is skipped on dataplane cores, along with the usual
  draining of per-cpu free pages back to the global allocator.

- TLB flush events that target kernel pages are suppressed when
  they target cores running dataplane tasks.  Instead, the tasks
  will flush all the kernel TLB entries when they finally enter the
  kernel.

Dataplane cores also try to stop the scheduler core if possible
whenever they leave the kernel to return to userspace, which makes
syscalls somewhat slower than they are on non-dataplane cores.


What can cause a dataplane task to be interrupted?

A task can get interrupts from kernel book-keeping for a few ticks
after its last return to userspace, as the kernel cleans up various
bits of internal state (for example, the kernel's RCU locking
mechanism may require waiting a timer tick duration before a cpu
is no longer involved in the RCU locking protocol).  However, see
the DP_QUIESCE flag below.

Note that a task obviously enters the kernel when it makes a system
call, but it also will enter the kernel when it touches a mapped
page of memory for the first time and has to create a suitable page
table entry.  However, see the DP_POPULATE flag below.

If a task sets up a request for an interrupt, obviously the kernel
will track that and deliver the interrupt as scheduled; for example,
see the alarm() and timer_create() kernel APIs.  Due to hardware
timer limitations, this may require multiple brief interrupts where
the kernel resets the hardware timer to a bit futher out each time;
the Tilera timers only count down over a period of a few seconds, so
multiple interrupts might be required to get a single specified
time requested of the kernel.

User-space TLB flushes will interrupt all threads that share
a given address space.  These can be caused by munmap(), which
in turn can be caused by free().  It can also be caused by threads
migrated to different cpus if kernel homecaching is enabled and
the threads are configured to home the cache for their stacks on the
current cpu, as is true by default; in this case the kernel has
to do a TLB flush to all the other threads so they know to reload
their TLBs if they have to access the migrating task's stack.

Other global kernel events will also cause interrupts, such as
kernel module unload; a list is currently being maintained at
https://github.com/gby/linux/wiki with sources of CPU interference
in core Linux code.


What programming models are recommended for dataplane?

In general, the dataplane cores should run standalone processes,
or else threads of a process that does not run threads on
non-dataplane cores.  Otherwise, the threads running on non-dataplane
cores can cause TLB flushes to the dataplane threads somewhat
unpredictably.  While it's possible to make this work, it's a
difficult discipline and not recommended.

To communicate between dataplane and non-dataplane processes,
the best mechanism is typically some form of shared memory: for
example, shared anonymous memory if one process is forked from
the other, or else a shared memory-mapped file.

Using locks of any kind directly between dataplane and non-dataplane
processes is not recommended.  Pthread locks use kernel support
(futexes) which will cause dataplane tasks to enter the kernel,
which is generally not appropriate.  However, memory-based spin locks
can cause problems when run on non-dataplane cores, since if the
lock is taken and the lock holder is scheduled out by the kernel for
any reason, there can be very long latencies imposed on any other
task (dataplane or non-dataplane) trying to acquire the lock.

A good model is a shared memory circular queue, such as the
<tmc/queue.h> API.  In this case, locking is done separately at each
end of the queue, so dataplane processes only contend with each other.


How does the set_dataplane() API help with dataplane programming?

The <sys/dataplane.h> header provides the set_dataplane() API;
a set_dataplane(2) man page is provided with Tilera Linux as well.

The DP_QUIESCE flag quiesces the timer interrupt before returning to
user space after a system call. Normally if a task on a dataplane core
makes a syscall, the system will run one or more timer ticks after the
syscall has completed, causing unexpected interrupts in
userspace. Setting DP_QUIESCE avoids that problem by having the kernel
"hold" the task in kernel mode until the timer ticks are
complete. This will make syscalls dramatically slower. If multiple
dataplane tasks are scheduled on a single core, this in effect
silently disables DP_QUIESCE, which allows the tasks to make progress,
but without actually disabling the timer tick.

DP_STRICT disallows the application from entering the kernel in any
way, unless it calls set_dataplane() again without this bit
set. Issuing any other syscall or causing a page fault would generate
a kernel message, and "kill -9" the process.  Setting this flag
automatically sets DP_QUIESCE as well, to hold the process in kernel
space until any timer ticks due to the set_dataplane() call have
completed.  This is essentially a development debugging aid.

DP_DEBUG provides support to debug dataplane interrupts, so that if
any interrupt source attempts to involve a dataplane cpu, a kernel
message and stack backtrace will be generated on the console. As this
warning is a slow event, it may make sense to avoid this mode in
production code to avoid making any possible interrupts even more
heavyweight. Setting this flag automatically sets DP_QUIESCE as
well.  This is also intended as a development debugging aid, though
in this case its primary use is to uncover kernel or driver bugs.

Finally, DP_POPULATE causes all memory mappings to be populated in the
page table.  Specifying this when entering dataplane mode ensures that
no future page fault events will occur to cause interrupts into the
Linux kernel, as long as no new mappings are installed by mmap(),
etc.  Note that since the hardware TLB is of finite size, there will
still be the potential for TLB misses that the hypervisor handles,
either via its software TLB cache (fast path) or by walking the kernel
page tables (slow path), so touching large amounts of memory will
still incur hypervisor interrupt overhead.  This is essentially the
same as mlockall(MCL_CURRENT), but without the pages being locked
into memory; this API may be deprecated in the future in favor of
simply requiring mlockall() to be used.


-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-10 20:24             ` Gilad Ben-Yossef
  2012-02-15 15:11               ` Peter Zijlstra
@ 2012-02-15 21:51               ` Chris Metcalf
  1 sibling, 0 replies; 77+ messages in thread
From: Chris Metcalf @ 2012-02-15 21:51 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Peter Zijlstra, Frederic Weisbecker, linux-kernel,
	Christoph Lameter, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On 2/10/2012 3:24 PM, Gilad Ben-Yossef wrote:
> On Fri, Feb 10, 2012 at 8:38 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> On Thu, 2012-02-02 at 10:41 -0500, Chris Metcalf wrote:
>>> /*
>>>  * Quiesce the timer interrupt before returning to user space after a
>>>  * system call.  Normally if a task on a dataplane core makes a
>>>  * syscall, the system will run one or more timer ticks after the
>>>  * syscall has completed, causing unexpected interrupts in userspace.
>>>  * Setting DP_QUIESCE avoids that problem by having the kernel "hold"
>>>  * the task in kernel mode until the timer ticks are complete.  This
>>>  * will make syscalls dramatically slower.
>>>  *
>>>  * If multiple dataplane tasks are scheduled on a single core, this
>>>  * in effect silently disables DP_QUIESCE, which allows the tasks to make
>>>  * progress, but without actually disabling the timer tick.
>>>  */
>>> #define DP_QUIESCE      0x1
>> This is what Frederics work does
> Actually it's not quite the same I believe. Frederic's patch set
> disables the tick
> for an adaptive tick task at some timer tick interrupt after the
> system call, but
> the task doesn't know when that happens, so if timing determinism guarantee is
> what you are after you don't quite know in your user task if its safe
> to start doing
> real time stuff or know.
>
> If I understood Chris quote correctly, they hold the task in kernel
> space until the
> timer tick is off, so that when the user space task continues to run
> after the system
> call it can assume the tick is off.
>
> Of course, nothing stops something (IPI?) from re-enabling it later,
> but I guess it at least
> lets you start in a known state.
>
> I think the concept of giving the task some way to know if the tick is
> disabled or not is nice.
> Not sure the exact feature and surely not the interface are what we
> should adopt - maybe
> allow registering to receive a signal at the end of the tick when it
> is disabled an re-enabled?

The problem with that is that by receiving a signal, you are back where you
started: returning from the kernel to userspace, and potentially leaving
open the possibility that the kernel will still need a scheduler tick or
two to quiesce.

An alternative we considered was to pass in a memory location that the
kernel would update with the current state of the process (tick disabled or
not), and you could then spin reading that location until the kernel
stopped interrupting you and disabled the tick.  But it seemed silly when
we could essentially put that code in the kernel once and get it right.

And note that the "IPI to re-enable it" is an event that is probably a bug
either in your application or in the kernel, which we track with the
DP_DEBUG flag; you wouldn't expect that to happen once your application was
up and running.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-15 21:50             ` Chris Metcalf
@ 2012-02-15 22:15               ` Christoph Lameter
  2012-02-15 23:44                 ` Chris Metcalf
  2012-02-21  1:34               ` Frederic Weisbecker
  1 sibling, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2012-02-15 22:15 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Peter Zijlstra, Frederic Weisbecker, Gilad Ben-Yossef,
	linux-kernel, linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin,
	Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Wed, 15 Feb 2012, Chris Metcalf wrote:

> The Tilera dataplane code is available on the "dataplane" branch (off of
> 3.3-rc3 at the moment):
>
> git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git

Looks like that patch is only for the tile architecture. Is there a
x86 version?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-15 22:15               ` Christoph Lameter
@ 2012-02-15 23:44                 ` Chris Metcalf
  0 siblings, 0 replies; 77+ messages in thread
From: Chris Metcalf @ 2012-02-15 23:44 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Frederic Weisbecker, Gilad Ben-Yossef,
	linux-kernel, linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin,
	Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On 2/15/2012 5:15 PM, Christoph Lameter wrote:
> On Wed, 15 Feb 2012, Chris Metcalf wrote:
>
>> The Tilera dataplane code is available on the "dataplane" branch (off of
>> 3.3-rc3 at the moment):
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git
> Looks like that patch is only for the tile architecture. Is there a
> x86 version?

No, we haven't looked at doing that yet.  Part of that would be moving
things that are now in arch-specific area (like the <asm/dataplane.h>
header) to include/linux/, etc.; since this patch isn't ready for merge
yet, there are plenty of cleanups like that we'd want to do.  Probably the
next step is to figure out how to integrate what Tilera has done with the
nohz cpuset stuff that Frederic has so we retain the best of both worlds.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-14 13:18                                       ` Avi Kivity
@ 2012-02-21  0:02                                         ` Frederic Weisbecker
  0 siblings, 0 replies; 77+ messages in thread
From: Frederic Weisbecker @ 2012-02-21  0:02 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Paul E. McKenney, Christoph Lameter, Peter Zijlstra,
	Gilad Ben-Yossef, linux-kernel, Chris Metcalf, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton, Alexander Viro,
	Michal Nazarewicz, Kosaki Motohiro, Milton Miller

On Tue, Feb 14, 2012 at 03:18:03PM +0200, Avi Kivity wrote:
> On 02/10/2012 03:39 AM, Frederic Weisbecker wrote:
> > > 
> > > As long as the code doesn't enter RCU read-side critical sections in
> > > the time between rcu_idle_enter() and rcu_idle_exit(), this should
> > > work fine.
> >
> > This should work fine yeah but further the correctness, I wonder if this
> > is going to be a win.
> >
> > We use rcu_idle_enter() in idle to avoid to keep the tick for RCU. But
> > what about falling into guest mode? I guess the tick is kept there
> > so is it going to be a win in throughput or something to use rcu_idle_enter()?
> 
> We could disable the tick while in guest mode as well.  Interrupts in
> guest mode are even more expensive than interrupts in user mode.

Right, that's definitely something I need to explore with the adaptive tickless
thing.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-15 21:50             ` Chris Metcalf
  2012-02-15 22:15               ` Christoph Lameter
@ 2012-02-21  1:34               ` Frederic Weisbecker
  2012-03-01 18:27                 ` Chris Metcalf
  1 sibling, 1 reply; 77+ messages in thread
From: Frederic Weisbecker @ 2012-02-21  1:34 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Peter Zijlstra, Gilad Ben-Yossef, linux-kernel,
	Christoph Lameter, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

On Wed, Feb 15, 2012 at 04:50:39PM -0500, Chris Metcalf wrote:
> On 2/10/2012 1:33 PM, Peter Zijlstra wrote:
> > On Thu, 2012-02-02 at 10:41 -0500, Chris Metcalf wrote:
> >> At Tilera we have been supporting a "dataplane" mode (aka Zero Overhead
> >> Linux - the marketing name).  This is configured on a per-cpu basis, and in
> >> addition to setting isolcpus for those nodes, also suppresses various
> >> things that might otherwise run (soft lockup detection, vmstat work,
> >> etc.).  
> > See that's wrong.. it starts being wrong by depending on cpuisol and
> > goes from there.
> >
> >> The claim is that you need to specify these kinds of things
> >> per-core since it's not always possible for the kernel to know that you
> >> really don't want the scheduler or any other interrupt source to touch the
> >> core, as opposed to the case where you just happen to have a single process
> >> scheduled on the core and you don't mind occasional interrupts.
> > Right, so that claim is proven false I think.
> >
> >> But
> >> there's definitely appeal in having the kernel do it adaptively too,
> >> particularly if it can be made to work just as well as configuring it
> >> statically. 
> > I see no reason why it shouldn't work as well or even better.
> 
> Thanks for the feedback.  To echo Gilad's guess in a later email, the code
> as-is is not intended as a patch planned for a merge.  The code is in use
> by our customers, who have found it useful, but what I'd really like to do
> is to make sure to integrate all the functionality that's useful in our
> "dataplane" mode into Frederic's ongoing work with nohz cpusets.
> 
> The goal of the work we've done is to provide a way for customers to ensure
> they reliably have zero jitter on cpus that are trying to process real-time
> or otherwise low-latency events.  A good example is 10 Gb network traffic,
> where at min-packet sizes you have only 50-odd cpu cycles to dispatch the
> packet to one of our 64 cores, and each core then has a budget of only a
> few thousand cycles to deal with the core.  A kernel interrupt would mean
> dropping packets on the floor.  Similarly, for something like
> high-frequency trading, you'd want guaranteed low-latency response.
> 
> The Tilera dataplane code is available on the "dataplane" branch (off of
> 3.3-rc3 at the moment):
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git
> 
> I'm still looking at Frederic's git tree, but I'm assuming the following
> are all true of tasks that are running on a nohz cpuset core:
> 
> - The core will not run the global scheduler to do work stealing, since
> otherwise you can't guarantee that only tasks that care about userspace
> nohz get to run there.  (I suppose you could loosen thus such that the core
> would do work stealing as long as no task was pinned to that core by
> affinity, at which point the pinned task would become the only runnable task.)

A nohz cpuset doesn't really control that. It actually reacts to the scheduler
actions. Like try to stop the tick if there is only one task on the runqueue,
restart it when we have more.

Ensuring the CPU doesn't get distracted is rather the role of the user by
setting the right cpusets to get the desired affinity. And if we still have
noise with workqueues or something, this is something we need to look at
and fix on a case by case basis.


> - Kernel "background" tasks are disabled on that core, at least while
> userspace nohz tasks are running: softlockup watchdog, slab reap timer,
> vmstat thread, etc.

Yeah that's examples of "noisy" things. Those are in fact a seperate issues
that nohz cpusets don't touch. nohz cpuset are really only about trying to
shut down the periodic tick, or defer it for a far as possible in the future.

Now the nohz cpuset uses some user/kernel entry/exit hooks that we can extend
to cover some of these cases. We may want to make some timers "user-deferrable",
ie: deactivate, reactivate them on kernel entry and exit.

That need some thinking though, this may not always be a win for every workload.
But those that are userspace-mostly can profit.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [v7 0/8] Reduce cross CPU IPI interference
  2012-02-21  1:34               ` Frederic Weisbecker
@ 2012-03-01 18:27                 ` Chris Metcalf
  0 siblings, 0 replies; 77+ messages in thread
From: Chris Metcalf @ 2012-03-01 18:27 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Gilad Ben-Yossef, linux-kernel,
	Christoph Lameter, linux-mm, Pekka Enberg, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, Mel Gorman, Andrew Morton,
	Alexander Viro, Avi Kivity, Michal Nazarewicz, Kosaki Motohiro,
	Milton Miller

(Sorry, away on vacation for a while, and just getting back to this thread.)

On 2/20/2012 8:34 PM, Frederic Weisbecker wrote:
> On Wed, Feb 15, 2012 at 04:50:39PM -0500, Chris Metcalf wrote:
>> The Tilera dataplane code is available on the "dataplane" branch (off of
>> 3.3-rc3 at the moment):
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git
>>
>> I'm still looking at Frederic's git tree, but I'm assuming the following
>> are all true of tasks that are running on a nohz cpuset core:
>>
>> - The core will not run the global scheduler to do work stealing, since
>> otherwise you can't guarantee that only tasks that care about userspace
>> nohz get to run there.  (I suppose you could loosen thus such that the core
>> would do work stealing as long as no task was pinned to that core by
>> affinity, at which point the pinned task would become the only runnable task.)
> A nohz cpuset doesn't really control that. It actually reacts to the scheduler
> actions. Like try to stop the tick if there is only one task on the runqueue,
> restart it when we have more.
>
> Ensuring the CPU doesn't get distracted is rather the role of the user by
> setting the right cpusets to get the desired affinity. And if we still have
> noise with workqueues or something, this is something we need to look at
> and fix on a case by case basis.

So won't it still be the case that the nohz cpus will try to run the global
scheduler to do load balancing?  Or are you relying on something like the
idle load balancer functionality to do the load balancing externally?  The
reason for isolcpus in the Tilera code is just to avoid having the
dataplane cpus ever end up having to schedule a tick just so they can do
load balancing work.

Frederic, do you have a design document, or anything else I can read other
than the code in your git tree?  I still haven't found time to do that,
though I'd definitely like to start figuring out how I can integrate your
stuff and the Tilera stuff into a single thing that both meets our
customers' needs, AND is actually integrated into the kernel.org master :-)

>> - Kernel "background" tasks are disabled on that core, at least while
>> userspace nohz tasks are running: softlockup watchdog, slab reap timer,
>> vmstat thread, etc.
> Yeah that's examples of "noisy" things. Those are in fact a seperate issues
> that nohz cpusets don't touch. nohz cpuset are really only about trying to
> shut down the periodic tick, or defer it for a far as possible in the future.
>
> Now the nohz cpuset uses some user/kernel entry/exit hooks that we can extend
> to cover some of these cases. We may want to make some timers "user-deferrable",
> ie: deactivate, reactivate them on kernel entry and exit.
>
> That need some thinking though, this may not always be a win for every workload.
> But those that are userspace-mostly can profit.

Yes.  The workloads we are focused on (along with Gilad and some others) is
just the very simple one where we want to be able to have something go into
userspace, and get 100.000% of the cpu until the task itself takes some
action that requires kernel support.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2012-03-01 18:27 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-26 10:01 [v7 0/8] Reduce cross CPU IPI interference Gilad Ben-Yossef
2012-01-26 10:01 ` [v7 1/8] smp: introduce a generic on_each_cpu_mask function Gilad Ben-Yossef
2012-01-29 12:24   ` Gilad Ben-Yossef
2012-01-30 21:52     ` Andrew Morton
2012-01-31  6:33       ` Gilad Ben-Yossef
2012-01-26 10:01 ` [v7 2/8] arm: move arm over to generic on_each_cpu_mask Gilad Ben-Yossef
2012-01-26 10:01 ` [v7 3/8] tile: move tile to use " Gilad Ben-Yossef
2012-01-26 10:01 ` [v7 4/8] smp: add func to IPI cpus based on parameter func Gilad Ben-Yossef
2012-01-27 23:57   ` Andrew Morton
2012-01-29 12:04     ` Gilad Ben-Yossef
2012-01-26 10:01 ` [v7 5/8] slub: only IPI CPUs that have per cpu obj to flush Gilad Ben-Yossef
2012-01-26 15:09   ` Christoph Lameter
2012-01-26 10:01 ` [v7 6/8] fs: only send IPI to invalidate LRU BH when needed Gilad Ben-Yossef
2012-01-26 10:02 ` [v7 7/8] mm: only IPI CPUs to drain local pages if they exist Gilad Ben-Yossef
2012-01-26 15:13   ` Christoph Lameter
2012-01-28  0:12   ` Andrew Morton
2012-01-29 12:18     ` Gilad Ben-Yossef
2012-01-30 21:49       ` Andrew Morton
2012-01-31  6:32         ` Gilad Ben-Yossef
2012-01-30 14:59   ` Mel Gorman
2012-01-30 15:14     ` Gilad Ben-Yossef
2012-01-30 15:44       ` Mel Gorman
2012-01-26 10:02 ` [v7 8/8] mm: add vmstat counters for tracking PCP drains Gilad Ben-Yossef
2012-01-26 15:19 ` [v7 0/8] Reduce cross CPU IPI interference Peter Zijlstra
2012-01-29  8:25   ` Gilad Ben-Yossef
2012-02-01 17:04     ` Frederic Weisbecker
2012-02-02  8:46       ` Gilad Ben-Yossef
2012-02-02 15:41         ` Chris Metcalf
2012-02-05 11:46           ` Gilad Ben-Yossef
2012-02-10 18:39             ` Peter Zijlstra
2012-02-10 20:13               ` Gilad Ben-Yossef
2012-02-10 20:29                 ` Peter Zijlstra
2012-02-10 20:39                   ` Gilad Ben-Yossef
2012-02-10 18:33           ` Peter Zijlstra
2012-02-10 20:33             ` Gilad Ben-Yossef
2012-02-15 21:50             ` Chris Metcalf
2012-02-15 22:15               ` Christoph Lameter
2012-02-15 23:44                 ` Chris Metcalf
2012-02-21  1:34               ` Frederic Weisbecker
2012-03-01 18:27                 ` Chris Metcalf
2012-02-10 18:38           ` Peter Zijlstra
2012-02-10 20:24             ` Gilad Ben-Yossef
2012-02-15 15:11               ` Peter Zijlstra
2012-02-15 15:19                 ` Gilad Ben-Yossef
2012-02-15 21:51               ` Chris Metcalf
2012-02-02 16:24         ` Frederic Weisbecker
2012-02-02 16:29           ` Christoph Lameter
2012-02-09 15:52             ` Frederic Weisbecker
2012-02-09 15:59               ` Chris Metcalf
2012-02-09 18:11                 ` Frederic Weisbecker
2012-02-09 16:26               ` Christoph Lameter
2012-02-09 18:32                 ` Frederic Weisbecker
2012-02-01 17:35     ` Peter Zijlstra
2012-02-01 17:57       ` Peter Zijlstra
2012-02-02  9:42         ` Gilad Ben-Yossef
2012-02-01 18:40       ` Paul E. McKenney
2012-02-01 20:06         ` Christoph Lameter
2012-02-01 20:13           ` Paul E. McKenney
2012-02-02  9:34             ` Avi Kivity
2012-02-02 15:34               ` Paul E. McKenney
2012-02-02 16:14                 ` Avi Kivity
2012-02-02 17:01                   ` Paul E. McKenney
2012-02-02 17:23                     ` Avi Kivity
2012-02-02 17:51                       ` Paul E. McKenney
2012-02-05 12:16                         ` Avi Kivity
2012-02-05 16:59                           ` Paul E. McKenney
2012-02-09 15:22                             ` Frederic Weisbecker
2012-02-09 16:05                               ` Avi Kivity
2012-02-09 18:22                                 ` Frederic Weisbecker
2012-02-09 23:41                                   ` Paul E. McKenney
2012-02-10  1:39                                     ` Frederic Weisbecker
2012-02-14 13:18                                       ` Avi Kivity
2012-02-21  0:02                                         ` Frederic Weisbecker
2012-02-02 17:25                     ` Christoph Lameter
2012-02-05 12:06                       ` Gilad Ben-Yossef
2012-02-06 18:19                         ` Christoph Lameter
2012-02-09 15:37                           ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).