linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/5] Reduce cross CPU IPI interference
@ 2011-11-22 11:08 Gilad Ben-Yossef
  2011-11-22 11:08 ` [PATCH v4 1/5] smp: Introduce a generic on_each_cpu_mask function Gilad Ben-Yossef
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Gilad Ben-Yossef @ 2011-11-22 11:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gilad Ben-Yossef, Chris Metcalf, Christoph Lameter,
	Peter Zijlstra, Frederic Weisbecker, Russell King, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen

We have lots of infrastructure in place to partition a multi-core system such that we have a group of CPUs that are dedicated to specific task: cgroups, scheduler and interrupt affinity and cpuisol boot parameter. Still, kernel code will some time interrupt all CPUs in the system via IPIs for various needs. These IPIs are useful and cannot be avoided altogether, but in certain cases it is possible to interrupt only specific CPUs that have useful work to
do and not the entire system.

This patch set, inspired by discussions with Peter Zijlstra and Frederic Weisbecker when testing the nohz task patch set, is a first stab at trying to explore doing this by locating the places where such global IPI calls are being made and turning a global IPI into an IPI for a specific group of CPUs.  The purpose of the patch set is to get feedback if this is the right way to go for dealing with this issue and indeed, if the issue is even worth dealing with at all. Based on the feedback from this patch set I plan to offer further patches that address similar issue in other code paths.

The patch creates an on_each_cpu_mask infrastructure API (derived from existing arch specific versions in Tile and Arm) and uses it to turn two global
IPI invocation to per CPU group invocations.

This 4rd version incorporates three efficiency and code style changes based on Christoph L. feedback. 

- Remove un-needed check for per cpu cpu_slab pointer.
  It is guranteed to always be there.

- Renamed pageset variable to pcp to match code style.

- Itereate on cpus first, then zones in the inner loop
  This has a chance to have us better memory locality and so a better cache access pattern.

The patch was compiled for arm and boot tested on x86 in UP, SMP, with and without
CONFIG_CPUMASK_OFFSTACK and was further tested by running hackbench on x86 in
SMP mode in a 4 CPUs VM with no obvious regressions.

I also artificially exercised SLUB flush_all via the debug interface and observed
the difference in IPI count across processors with and without the patch - from
an IPI on all processors but one without the patch to a subset (and often no IPI
at all) with the patch.

I further used fault injection framework to force cpumask alloction failures for
CPUMASK_OFFSTACK=y cases and triggering the code using slub sys debug interface
and running ./hackbench 1000 for page_alloc, with no critical failures.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
CC: Chris Metcalf <cmetcalf@tilera.com>
CC: Christoph Lameter <cl@linux-foundation.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: linux-mm@kvack.org
CC: Pekka Enberg <penberg@kernel.org>
CC: Matt Mackall <mpm@selenic.com>
CC: Sasha Levin <levinsasha928@gmail.com>
CC: Rik van Riel <riel@redhat.com>
CC: Andi Kleen <andi@firstfloor.org>

Gilad Ben-Yossef (5):
  smp: Introduce a generic on_each_cpu_mask function
  arm: Move arm over to generic on_each_cpu_mask
  tile: Move tile to use generic on_each_cpu_mask
  slub: Only IPI CPUs that have per cpu obj to flush
  mm: Only IPI CPUs to drain local pages if they exist

 arch/arm/kernel/smp_tlb.c   |   20 +++++---------------
 arch/tile/include/asm/smp.h |    7 -------
 arch/tile/kernel/smp.c      |   19 -------------------
 include/linux/smp.h         |   16 ++++++++++++++++
 kernel/smp.c                |   20 ++++++++++++++++++++
 mm/page_alloc.c             |   18 +++++++++++++++++-
 mm/slub.c                   |   15 ++++++++++++++-
 7 files changed, 72 insertions(+), 43 deletions(-)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v4 1/5] smp: Introduce a generic on_each_cpu_mask function
  2011-11-22 11:08 [PATCH v4 0/5] Reduce cross CPU IPI interference Gilad Ben-Yossef
@ 2011-11-22 11:08 ` Gilad Ben-Yossef
  2011-11-22 11:08 ` [PATCH v4 2/5] arm: Move arm over to generic on_each_cpu_mask Gilad Ben-Yossef
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Gilad Ben-Yossef @ 2011-11-22 11:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gilad Ben-Yossef, Chris Metcalf, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Pekka Enberg,
	Matt Mackall, Rik van Riel, Andi Kleen, Sasha Levin

on_each_cpu_mask calls a function on processors specified my cpumask, which may include the local processor.

All the limitation specified in smp_call_function_many apply.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
CC: Chris Metcalf <cmetcalf@tilera.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: linux-mm@kvack.org
CC: Pekka Enberg <penberg@kernel.org>
CC: Matt Mackall <mpm@selenic.com>
CC: Rik van Riel <riel@redhat.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Sasha Levin <levinsasha928@gmail.com>
---
 include/linux/smp.h |   16 ++++++++++++++++
 kernel/smp.c        |   20 ++++++++++++++++++++
 2 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 8cc38d3..60628d7 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -102,6 +102,13 @@ static inline void call_function_init(void) { }
 int on_each_cpu(smp_call_func_t func, void *info, int wait);
 
 /*
+ * Call a function on processors specified by mask, which might include
+ * the local one.
+ */
+void on_each_cpu_mask(const struct cpumask *mask, void (*func)(void *),
+		void *info, bool wait);
+
+/*
  * Mark the boot cpu "online" so that it can call console drivers in
  * printk() and can access its per-cpu storage.
  */
@@ -132,6 +139,15 @@ static inline int up_smp_call_function(smp_call_func_t func, void *info)
 		local_irq_enable();		\
 		0;				\
 	})
+#define on_each_cpu_mask(mask, func, info, wait) \
+	do {						\
+		if (cpumask_test_cpu(0, (mask))) {	\
+			local_irq_disable();		\
+			(func)(info);			\
+			local_irq_enable();		\
+		}					\
+	} while (0)
+
 static inline void smp_send_reschedule(int cpu) { }
 #define num_booting_cpus()			1
 #define smp_prepare_boot_cpu()			do {} while (0)
diff --git a/kernel/smp.c b/kernel/smp.c
index db197d6..7c0cbd7 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -701,3 +701,23 @@ int on_each_cpu(void (*func) (void *info), void *info, int wait)
 	return ret;
 }
 EXPORT_SYMBOL(on_each_cpu);
+
+/*
+ * Call a function on processors specified by cpumask, which may include
+ * the local processor. All the limitation specified in smp_call_function_many
+ * apply.
+ */
+void on_each_cpu_mask(const struct cpumask *mask, void (*func)(void *),
+			void *info, bool wait)
+{
+	int cpu = get_cpu();
+
+	smp_call_function_many(mask, func, info, wait);
+	if (cpumask_test_cpu(cpu, mask)) {
+		local_irq_disable();
+		func(info);
+		local_irq_enable();
+	}
+	put_cpu();
+}
+EXPORT_SYMBOL(on_each_cpu_mask);
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 2/5] arm: Move arm over to generic on_each_cpu_mask
  2011-11-22 11:08 [PATCH v4 0/5] Reduce cross CPU IPI interference Gilad Ben-Yossef
  2011-11-22 11:08 ` [PATCH v4 1/5] smp: Introduce a generic on_each_cpu_mask function Gilad Ben-Yossef
@ 2011-11-22 11:08 ` Gilad Ben-Yossef
  2011-11-22 21:00   ` Russell King - ARM Linux
  2011-11-22 11:08 ` [PATCH v4 3/5] tile: Move tile to use " Gilad Ben-Yossef
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Gilad Ben-Yossef @ 2011-11-22 11:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gilad Ben-Yossef, Peter Zijlstra, Frederic Weisbecker,
	Russell King, Christoph Lameter, Chris Metcalf, linux-mm,
	Pekka Enberg, Matt Mackall, Rik van Riel, Andi Kleen,
	Sasha Levin

Note the generic version has the mask as first parameter

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Christoph Lameter <cl@linux.com>
CC: Chris Metcalf <cmetcalf@tilera.com>
CC: linux-mm@kvack.org
CC: Pekka Enberg <penberg@kernel.org>
CC: Matt Mackall <mpm@selenic.com>
CC: Rik van Riel <riel@redhat.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Sasha Levin <levinsasha928@gmail.com>
---
 arch/arm/kernel/smp_tlb.c |   20 +++++---------------
 1 files changed, 5 insertions(+), 15 deletions(-)

diff --git a/arch/arm/kernel/smp_tlb.c b/arch/arm/kernel/smp_tlb.c
index 7dcb352..02c5d2c 100644
--- a/arch/arm/kernel/smp_tlb.c
+++ b/arch/arm/kernel/smp_tlb.c
@@ -13,18 +13,6 @@
 #include <asm/smp_plat.h>
 #include <asm/tlbflush.h>
 
-static void on_each_cpu_mask(void (*func)(void *), void *info, int wait,
-	const struct cpumask *mask)
-{
-	preempt_disable();
-
-	smp_call_function_many(mask, func, info, wait);
-	if (cpumask_test_cpu(smp_processor_id(), mask))
-		func(info);
-
-	preempt_enable();
-}
-
 /**********************************************************************/
 
 /*
@@ -87,7 +75,7 @@ void flush_tlb_all(void)
 void flush_tlb_mm(struct mm_struct *mm)
 {
 	if (tlb_ops_need_broadcast())
-		on_each_cpu_mask(ipi_flush_tlb_mm, mm, 1, mm_cpumask(mm));
+		on_each_cpu_mask(mm_cpumask(mm), ipi_flush_tlb_mm, mm, 1);
 	else
 		local_flush_tlb_mm(mm);
 }
@@ -98,7 +86,8 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr)
 		struct tlb_args ta;
 		ta.ta_vma = vma;
 		ta.ta_start = uaddr;
-		on_each_cpu_mask(ipi_flush_tlb_page, &ta, 1, mm_cpumask(vma->vm_mm));
+		on_each_cpu_mask(mm_cpumask(vma->vm_mm), ipi_flush_tlb_page,
+					&ta, 1);
 	} else
 		local_flush_tlb_page(vma, uaddr);
 }
@@ -121,7 +110,8 @@ void flush_tlb_range(struct vm_area_struct *vma,
 		ta.ta_vma = vma;
 		ta.ta_start = start;
 		ta.ta_end = end;
-		on_each_cpu_mask(ipi_flush_tlb_range, &ta, 1, mm_cpumask(vma->vm_mm));
+		on_each_cpu_mask(mm_cpumask(vma->vm_mm), ipi_flush_tlb_range,
+					&ta, 1);
 	} else
 		local_flush_tlb_range(vma, start, end);
 }
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 3/5] tile: Move tile to use generic on_each_cpu_mask
  2011-11-22 11:08 [PATCH v4 0/5] Reduce cross CPU IPI interference Gilad Ben-Yossef
  2011-11-22 11:08 ` [PATCH v4 1/5] smp: Introduce a generic on_each_cpu_mask function Gilad Ben-Yossef
  2011-11-22 11:08 ` [PATCH v4 2/5] arm: Move arm over to generic on_each_cpu_mask Gilad Ben-Yossef
@ 2011-11-22 11:08 ` Gilad Ben-Yossef
  2011-11-22 11:08 ` [PATCH v4 4/5] slub: Only IPI CPUs that have per cpu obj to flush Gilad Ben-Yossef
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Gilad Ben-Yossef @ 2011-11-22 11:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gilad Ben-Yossef, Peter Zijlstra, Frederic Weisbecker,
	Russell King, linux-mm, Christoph Lameter, Pekka Enberg,
	Matt Mackall, Rik van Riel, Andi Kleen, Sasha Levin

The API is the same as the tile private one, so just remove the private version of the functions

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: linux-mm@kvack.org
CC: Christoph Lameter <cl@linux-foundation.org>
CC: Pekka Enberg <penberg@kernel.org>
CC: Matt Mackall <mpm@selenic.com>
CC: Rik van Riel <riel@redhat.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Sasha Levin <levinsasha928@gmail.com>
---
 arch/tile/include/asm/smp.h |    7 -------
 arch/tile/kernel/smp.c      |   19 -------------------
 2 files changed, 0 insertions(+), 26 deletions(-)

diff --git a/arch/tile/include/asm/smp.h b/arch/tile/include/asm/smp.h
index 532124a..1aa759a 100644
--- a/arch/tile/include/asm/smp.h
+++ b/arch/tile/include/asm/smp.h
@@ -43,10 +43,6 @@ void evaluate_message(int tag);
 /* Boot a secondary cpu */
 void online_secondary(void);
 
-/* Call a function on a specified set of CPUs (may include this one). */
-extern void on_each_cpu_mask(const struct cpumask *mask,
-			     void (*func)(void *), void *info, bool wait);
-
 /* Topology of the supervisor tile grid, and coordinates of boot processor */
 extern HV_Topology smp_topology;
 
@@ -91,9 +87,6 @@ void print_disabled_cpus(void);
 
 #else /* !CONFIG_SMP */
 
-#define on_each_cpu_mask(mask, func, info, wait)		\
-  do { if (cpumask_test_cpu(0, (mask))) func(info); } while (0)
-
 #define smp_master_cpu		0
 #define smp_height		1
 #define smp_width		1
diff --git a/arch/tile/kernel/smp.c b/arch/tile/kernel/smp.c
index c52224d..a44e103 100644
--- a/arch/tile/kernel/smp.c
+++ b/arch/tile/kernel/smp.c
@@ -87,25 +87,6 @@ void send_IPI_allbutself(int tag)
 	send_IPI_many(&mask, tag);
 }
 
-
-/*
- * Provide smp_call_function_mask, but also run function locally
- * if specified in the mask.
- */
-void on_each_cpu_mask(const struct cpumask *mask, void (*func)(void *),
-		      void *info, bool wait)
-{
-	int cpu = get_cpu();
-	smp_call_function_many(mask, func, info, wait);
-	if (cpumask_test_cpu(cpu, mask)) {
-		local_irq_disable();
-		func(info);
-		local_irq_enable();
-	}
-	put_cpu();
-}
-
-
 /*
  * Functions related to starting/stopping cpus.
  */
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 4/5] slub: Only IPI CPUs that have per cpu obj to flush
  2011-11-22 11:08 [PATCH v4 0/5] Reduce cross CPU IPI interference Gilad Ben-Yossef
                   ` (2 preceding siblings ...)
  2011-11-22 11:08 ` [PATCH v4 3/5] tile: Move tile to use " Gilad Ben-Yossef
@ 2011-11-22 11:08 ` Gilad Ben-Yossef
  2011-11-23  6:23   ` Pekka Enberg
  2011-11-22 11:08 ` [PATCH v4 5/5] mm: Only IPI CPUs to drain local pages if they exist Gilad Ben-Yossef
  2011-11-23  1:36 ` [PATCH v4 0/5] Reduce cross CPU IPI interference Chris Metcalf
  5 siblings, 1 reply; 29+ messages in thread
From: Gilad Ben-Yossef @ 2011-11-22 11:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gilad Ben-Yossef, Chris Metcalf, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen

flush_all() is called for each kmem_cahce_destroy(). So every cache being destroyed dynamically ended up sending an IPI to each CPU in the system, regardless if the cache has ever been used there.

For example, if you close the Infinband ipath driver char device file, the close file ops calls kmem_cache_destroy(). So running some infiniband config tool on one a single CPU dedicated to system tasks might interrupt the rest of the 127 CPUs I dedicated to some CPU intensive task.

I suspect there is a good chance that every line in the output of "git grep kmem_cache_destroy linux/ | grep '\->'" has a similar scenario.

This patch attempts to rectify this issue by sending an IPI to flush the per cpu objects back to the free lists only to CPUs that seems to have such objects.

The check which CPU to IPI is racy but we don't care since asking a CPU without per cpu objects to flush does no damage and as far as I can tell the flush_all by itself is racy against allocs on remote CPUs anyway, so if you meant the flush_all to be determinstic, you had to arrange for locking regardless.

Without this patch the following artificial test case:

$ cd /sys/kernel/slab
$ for DIR in *; do cat $DIR/alloc_calls > /dev/null; done

produces 166 IPIs on an cpuset isolated CPU. With it it produces none.

The code path of memory allocation failure for CPUMASK_OFFSTACK=y config was tested using fault injection framework.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Chris Metcalf <cmetcalf@tilera.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: linux-mm@kvack.org
CC: Pekka Enberg <penberg@kernel.org>
CC: Matt Mackall <mpm@selenic.com>
CC: Sasha Levin <levinsasha928@gmail.com>
CC: Rik van Riel <riel@redhat.com>
CC: Andi Kleen <andi@firstfloor.org>
---
 mm/slub.c |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 7d2a996..1f18006 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2006,7 +2006,20 @@ static void flush_cpu_slab(void *d)
 
 static void flush_all(struct kmem_cache *s)
 {
-	on_each_cpu(flush_cpu_slab, s, 1);
+	cpumask_var_t cpus;
+	struct kmem_cache_cpu *c;
+	int cpu;
+
+	if (likely(zalloc_cpumask_var(&cpus, GFP_ATOMIC))) {
+		for_each_online_cpu(cpu) {
+			c = per_cpu_ptr(s->cpu_slab, cpu);
+			if (c->page)
+				cpumask_set_cpu(cpu, cpus);
+		}
+		on_each_cpu_mask(cpus, flush_cpu_slab, s, 1);
+		free_cpumask_var(cpus);
+	} else
+		on_each_cpu(flush_cpu_slab, s, 1);
 }
 
 /*
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 5/5] mm: Only IPI CPUs to drain local pages if they exist
  2011-11-22 11:08 [PATCH v4 0/5] Reduce cross CPU IPI interference Gilad Ben-Yossef
                   ` (3 preceding siblings ...)
  2011-11-22 11:08 ` [PATCH v4 4/5] slub: Only IPI CPUs that have per cpu obj to flush Gilad Ben-Yossef
@ 2011-11-22 11:08 ` Gilad Ben-Yossef
  2011-11-23  7:45   ` Pekka Enberg
  2011-12-23 10:28   ` Mel Gorman
  2011-11-23  1:36 ` [PATCH v4 0/5] Reduce cross CPU IPI interference Chris Metcalf
  5 siblings, 2 replies; 29+ messages in thread
From: Gilad Ben-Yossef @ 2011-11-22 11:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gilad Ben-Yossef, Chris Metcalf, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen

Calculate a cpumask of CPUs with per-cpu pages in any zone and only send an IPI requesting CPUs to drain these pages to the buddy allocator if they actually have pages when asked to flush.

The code path of memory allocation failure for CPUMASK_OFFSTACK=y config was tested using fault injection framework.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Chris Metcalf <cmetcalf@tilera.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: linux-mm@kvack.org
CC: Pekka Enberg <penberg@kernel.org>
CC: Matt Mackall <mpm@selenic.com>
CC: Sasha Levin <levinsasha928@gmail.com>
CC: Rik van Riel <riel@redhat.com>
CC: Andi Kleen <andi@firstfloor.org>
---
 mm/page_alloc.c |   18 +++++++++++++++++-
 1 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dd443d..a3efdf1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1119,7 +1119,23 @@ void drain_local_pages(void *arg)
  */
 void drain_all_pages(void)
 {
-	on_each_cpu(drain_local_pages, NULL, 1);
+	int cpu;
+	struct zone *zone;
+	cpumask_var_t cpus;
+	struct per_cpu_pageset *pcp;
+
+	if (likely(zalloc_cpumask_var(&cpus, GFP_ATOMIC))) {
+		for_each_online_cpu(cpu) {
+			for_each_populated_zone(zone) {
+				pcp = per_cpu_ptr(zone->pageset, cpu);
+				if (pcp->pcp.count)
+					cpumask_set_cpu(cpu, cpus);
+		}
+	}
+		on_each_cpu_mask(cpus, drain_local_pages, NULL, 1);
+		free_cpumask_var(cpus);
+	} else
+		on_each_cpu(drain_local_pages, NULL, 1);
 }
 
 #ifdef CONFIG_HIBERNATION
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/5] arm: Move arm over to generic on_each_cpu_mask
  2011-11-22 11:08 ` [PATCH v4 2/5] arm: Move arm over to generic on_each_cpu_mask Gilad Ben-Yossef
@ 2011-11-22 21:00   ` Russell King - ARM Linux
  2011-11-23  6:47     ` Gilad Ben-Yossef
  0 siblings, 1 reply; 29+ messages in thread
From: Russell King - ARM Linux @ 2011-11-22 21:00 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, Peter Zijlstra, Frederic Weisbecker,
	Christoph Lameter, Chris Metcalf, linux-mm, Pekka Enberg,
	Matt Mackall, Rik van Riel, Andi Kleen, Sasha Levin

On Tue, Nov 22, 2011 at 01:08:45PM +0200, Gilad Ben-Yossef wrote:
> -static void on_each_cpu_mask(void (*func)(void *), void *info, int wait,
> -	const struct cpumask *mask)
> -{
> -	preempt_disable();
> -
> -	smp_call_function_many(mask, func, info, wait);
> -	if (cpumask_test_cpu(smp_processor_id(), mask))
> -		func(info);
> -
> -	preempt_enable();
> -}

What hasn't been said in the descriptions (I couldn't find it) is that
there's a semantic change between the new generic version and this version -
that is, we run the function with IRQs disabled on the local CPU, whereas
the version above runs it with IRQs potentially enabled.

Luckily, for TLB flushing this is probably not a problem, but it's
something that should've been pointed out in the patch description.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 0/5] Reduce cross CPU IPI interference
  2011-11-22 11:08 [PATCH v4 0/5] Reduce cross CPU IPI interference Gilad Ben-Yossef
                   ` (4 preceding siblings ...)
  2011-11-22 11:08 ` [PATCH v4 5/5] mm: Only IPI CPUs to drain local pages if they exist Gilad Ben-Yossef
@ 2011-11-23  1:36 ` Chris Metcalf
  2011-11-23  6:52   ` Gilad Ben-Yossef
  5 siblings, 1 reply; 29+ messages in thread
From: Chris Metcalf @ 2011-11-23  1:36 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, Christoph Lameter, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen

On 11/22/2011 6:08 AM, Gilad Ben-Yossef wrote:
> We have lots of infrastructure in place to partition a multi-core system such that we have a group of CPUs that are dedicated to specific task: cgroups, scheduler and interrupt affinity and cpuisol boot parameter. Still, kernel code will some time interrupt all CPUs in the system via IPIs for various needs. These IPIs are useful and cannot be avoided altogether, but in certain cases it is possible to interrupt only specific CPUs that have useful work to
> do and not the entire system.
>
> This patch set, inspired by discussions with Peter Zijlstra and Frederic Weisbecker when testing the nohz task patch set, is a first stab at trying to explore doing this by locating the places where such global IPI calls are being made and turning a global IPI into an IPI for a specific group of CPUs.  The purpose of the patch set is to get feedback if this is the right way to go for dealing with this issue and indeed, if the issue is even worth dealing with at all. Based on the feedback from this patch set I plan to offer further patches that address similar issue in other code paths.
>
> The patch creates an on_each_cpu_mask infrastructure API (derived from existing arch specific versions in Tile and Arm) and uses it to turn two global
> IPI invocation to per CPU group invocations.

Acked-by: Chris Metcalf <cmetcalf@tilera.com>

I think this kind of work is very important as more and more processing
moves to isolated cpus that need protection from miscellaneous kernel
interrupts.  Keep at it! :-)

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/5] slub: Only IPI CPUs that have per cpu obj to flush
  2011-11-22 11:08 ` [PATCH v4 4/5] slub: Only IPI CPUs that have per cpu obj to flush Gilad Ben-Yossef
@ 2011-11-23  6:23   ` Pekka Enberg
  2011-11-23  6:57     ` Pekka Enberg
  2012-01-01 12:41     ` Avi Kivity
  0 siblings, 2 replies; 29+ messages in thread
From: Pekka Enberg @ 2011-11-23  6:23 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, Chris Metcalf, Peter Zijlstra, Frederic Weisbecker,
	Russell King, linux-mm, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, apkm

On Tue, 22 Nov 2011, Gilad Ben-Yossef wrote:
> static void flush_all(struct kmem_cache *s)
> {
> -	on_each_cpu(flush_cpu_slab, s, 1);
> +	cpumask_var_t cpus;
> +	struct kmem_cache_cpu *c;
> +	int cpu;
> +
> +	if (likely(zalloc_cpumask_var(&cpus, GFP_ATOMIC))) {

__GFP_NOWARN too maybe?

> +		for_each_online_cpu(cpu) {
> +			c = per_cpu_ptr(s->cpu_slab, cpu);
> +			if (c->page)
> +				cpumask_set_cpu(cpu, cpus);
> +		}
> +		on_each_cpu_mask(cpus, flush_cpu_slab, s, 1);
> +		free_cpumask_var(cpus);
> +	} else
> +		on_each_cpu(flush_cpu_slab, s, 1);
> }

Acked-by: Pekka Enberg <penberg@kernel.org>

I can't take the patch because it depends on a new API introduced in the 
first patch.

I'm CC'ing Andrew.

 			Pekka

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/5] arm: Move arm over to generic on_each_cpu_mask
  2011-11-22 21:00   ` Russell King - ARM Linux
@ 2011-11-23  6:47     ` Gilad Ben-Yossef
  0 siblings, 0 replies; 29+ messages in thread
From: Gilad Ben-Yossef @ 2011-11-23  6:47 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: linux-kernel, Peter Zijlstra, Frederic Weisbecker,
	Christoph Lameter, Chris Metcalf, linux-mm, Pekka Enberg,
	Matt Mackall, Rik van Riel, Andi Kleen, Sasha Levin

Hi,

On Tue, Nov 22, 2011 at 11:00 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Tue, Nov 22, 2011 at 01:08:45PM +0200, Gilad Ben-Yossef wrote:
>> -static void on_each_cpu_mask(void (*func)(void *), void *info, int wait,
>> -     const struct cpumask *mask)
>> -{
>> -     preempt_disable();
>> -
>> -     smp_call_function_many(mask, func, info, wait);
>> -     if (cpumask_test_cpu(smp_processor_id(), mask))
>> -             func(info);
>> -
>> -     preempt_enable();
>> -}
>
> What hasn't been said in the descriptions (I couldn't find it) is that
> there's a semantic change between the new generic version and this version -
> that is, we run the function with IRQs disabled on the local CPU, whereas
> the version above runs it with IRQs potentially enabled.
>
> Luckily, for TLB flushing this is probably not a problem, but it's
> something that should've been pointed out in the patch description.

Thank you for the review!

You are very right that I should have mentioned it in the description.
My apologies for missing that bit.

My reasoning for why the change is OK is that the function passed is
ready to run with interrupt disabled because this is how it will be
called on all the other CPUs through the IPI handler so it is safe.
This is also how the generic  on_each_cpu() handles it.

Have I missed something? if not will you like me to update the patch
description and re-send?

Thanks,
Gilad




-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 0/5] Reduce cross CPU IPI interference
  2011-11-23  1:36 ` [PATCH v4 0/5] Reduce cross CPU IPI interference Chris Metcalf
@ 2011-11-23  6:52   ` Gilad Ben-Yossef
  0 siblings, 0 replies; 29+ messages in thread
From: Gilad Ben-Yossef @ 2011-11-23  6:52 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: linux-kernel, Christoph Lameter, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen

Hi

On Wed, Nov 23, 2011 at 3:36 AM, Chris Metcalf <cmetcalf@tilera.com> wrote:
> On 11/22/2011 6:08 AM, Gilad Ben-Yossef wrote:
>> We have lots of infrastructure in place to partition a multi-core system such that we have a group of CPUs that are dedicated to
>
> Acked-by: Chris Metcalf <cmetcalf@tilera.com>
>
> I think this kind of work is very important as more and more processing
> moves to isolated cpus that need protection from miscellaneous kernel
> interrupts.  Keep at it! :-)

Thank you very much. I believe it also has some small contribution to
keeping idle CPUs idle in a multi-core system. I consider this my
personal carbon offset effort :-)

My current plan is to take a look at each invocation of on_each_cpu in
core kernel code and see if it makes sense to give it a similar
treatment.

Thanks,
Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/5] slub: Only IPI CPUs that have per cpu obj to flush
  2011-11-23  6:23   ` Pekka Enberg
@ 2011-11-23  6:57     ` Pekka Enberg
  2011-11-23  7:52       ` Gilad Ben-Yossef
  2012-01-01 12:41     ` Avi Kivity
  1 sibling, 1 reply; 29+ messages in thread
From: Pekka Enberg @ 2011-11-23  6:57 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, Chris Metcalf, Peter Zijlstra, Frederic Weisbecker,
	Russell King, linux-mm, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Andrew Morton

On Wed, Nov 23, 2011 at 8:23 AM, Pekka Enberg <penberg@kernel.org> wrote:
> On Tue, 22 Nov 2011, Gilad Ben-Yossef wrote:
>>
>> static void flush_all(struct kmem_cache *s)
>> {
>> -       on_each_cpu(flush_cpu_slab, s, 1);
>> +       cpumask_var_t cpus;
>> +       struct kmem_cache_cpu *c;
>> +       int cpu;
>> +
>> +       if (likely(zalloc_cpumask_var(&cpus, GFP_ATOMIC))) {
>
> __GFP_NOWARN too maybe?
>
>> +               for_each_online_cpu(cpu) {
>> +                       c = per_cpu_ptr(s->cpu_slab, cpu);
>> +                       if (c->page)
>> +                               cpumask_set_cpu(cpu, cpus);
>> +               }
>> +               on_each_cpu_mask(cpus, flush_cpu_slab, s, 1);
>> +               free_cpumask_var(cpus);
>> +       } else
>> +               on_each_cpu(flush_cpu_slab, s, 1);
>> }
>
> Acked-by: Pekka Enberg <penberg@kernel.org>
>
> I can't take the patch because it depends on a new API introduced in the
> first patch.
>
> I'm CC'ing Andrew.

...this time with the right email address.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 5/5] mm: Only IPI CPUs to drain local pages if they exist
  2011-11-22 11:08 ` [PATCH v4 5/5] mm: Only IPI CPUs to drain local pages if they exist Gilad Ben-Yossef
@ 2011-11-23  7:45   ` Pekka Enberg
  2011-12-23 10:28   ` Mel Gorman
  1 sibling, 0 replies; 29+ messages in thread
From: Pekka Enberg @ 2011-11-23  7:45 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, Chris Metcalf, Peter Zijlstra, Frederic Weisbecker,
	Russell King, linux-mm, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Mel Gorman, Andrew Morton

On Tue, Nov 22, 2011 at 1:08 PM, Gilad Ben-Yossef <gilad@benyossef.com> wrote:
> Calculate a cpumask of CPUs with per-cpu pages in any zone and only send an IPI requesting CPUs to drain these pages to the buddy allocator if they actually have pages when asked to flush.
>
> The code path of memory allocation failure for CPUMASK_OFFSTACK=y config was tested using fault injection framework.
>
> Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> CC: Chris Metcalf <cmetcalf@tilera.com>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> CC: Frederic Weisbecker <fweisbec@gmail.com>
> CC: Russell King <linux@arm.linux.org.uk>
> CC: linux-mm@kvack.org
> CC: Pekka Enberg <penberg@kernel.org>
> CC: Matt Mackall <mpm@selenic.com>
> CC: Sasha Levin <levinsasha928@gmail.com>
> CC: Rik van Riel <riel@redhat.com>
> CC: Andi Kleen <andi@firstfloor.org>

I'm adding Mel and Andrew to CC.

> ---
>  mm/page_alloc.c |   18 +++++++++++++++++-
>  1 files changed, 17 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9dd443d..a3efdf1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1119,7 +1119,23 @@ void drain_local_pages(void *arg)
>  */
>  void drain_all_pages(void)
>  {
> -       on_each_cpu(drain_local_pages, NULL, 1);
> +       int cpu;
> +       struct zone *zone;
> +       cpumask_var_t cpus;
> +       struct per_cpu_pageset *pcp;
> +
> +       if (likely(zalloc_cpumask_var(&cpus, GFP_ATOMIC))) {

__GFP_NOWARN

> +               for_each_online_cpu(cpu) {
> +                       for_each_populated_zone(zone) {
> +                               pcp = per_cpu_ptr(zone->pageset, cpu);
> +                               if (pcp->pcp.count)
> +                                       cpumask_set_cpu(cpu, cpus);
> +               }
> +       }
> +               on_each_cpu_mask(cpus, drain_local_pages, NULL, 1);
> +               free_cpumask_var(cpus);
> +       } else
> +               on_each_cpu(drain_local_pages, NULL, 1);
>  }
>
>  #ifdef CONFIG_HIBERNATION

Acked-by: Pekka Enberg <penberg@kernel.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/5] slub: Only IPI CPUs that have per cpu obj to flush
  2011-11-23  6:57     ` Pekka Enberg
@ 2011-11-23  7:52       ` Gilad Ben-Yossef
  0 siblings, 0 replies; 29+ messages in thread
From: Gilad Ben-Yossef @ 2011-11-23  7:52 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-kernel, Chris Metcalf, Peter Zijlstra, Frederic Weisbecker,
	Russell King, linux-mm, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen, Andrew Morton

On Wed, Nov 23, 2011 at 8:57 AM, Pekka Enberg <penberg@kernel.org> wrote:
> On Wed, Nov 23, 2011 at 8:23 AM, Pekka Enberg <penberg@kernel.org> wrote:
>> On Tue, 22 Nov 2011, Gilad Ben-Yossef wrote:
>>>
>>> static void flush_all(struct kmem_cache *s)
>>> {
>>> -       on_each_cpu(flush_cpu_slab, s, 1);
>>> +       cpumask_var_t cpus;
>>> +       struct kmem_cache_cpu *c;
>>> +       int cpu;
>>> +
>>> +       if (likely(zalloc_cpumask_var(&cpus, GFP_ATOMIC))) {
>>
>> __GFP_NOWARN too maybe?
>>

Right, the allocation failure here is harmless. I should probably do
the same for the page_alloc.c case as well.

>>> +               for_each_online_cpu(cpu) {
>>> +                       c = per_cpu_ptr(s->cpu_slab, cpu);
>>> +                       if (c->page)
>>> +                               cpumask_set_cpu(cpu, cpus);
>>> +               }
>>> +               on_each_cpu_mask(cpus, flush_cpu_slab, s, 1);
>>> +               free_cpumask_var(cpus);
>>> +       } else
>>> +               on_each_cpu(flush_cpu_slab, s, 1);
>>> }
>>
>> Acked-by: Pekka Enberg <penberg@kernel.org>
>>
>> I can't take the patch because it depends on a new API introduced in the
>> first patch.
>>
>> I'm CC'ing Andrew.

Thanks!

There's a git tree of these over at: git://github.com/gby/linux.git
branch ipi_noise_v4 in case that helps.

I will send v5 (and create a new git branch for it) with the above
changes and the Arm patch  description update once I get an Ack from
Russel K.

Cheers,
Gilad

>



-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 5/5] mm: Only IPI CPUs to drain local pages if they exist
  2011-11-22 11:08 ` [PATCH v4 5/5] mm: Only IPI CPUs to drain local pages if they exist Gilad Ben-Yossef
  2011-11-23  7:45   ` Pekka Enberg
@ 2011-12-23 10:28   ` Mel Gorman
  2011-12-25  9:39     ` Gilad Ben-Yossef
  1 sibling, 1 reply; 29+ messages in thread
From: Mel Gorman @ 2011-12-23 10:28 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, Chris Metcalf, Peter Zijlstra, Frederic Weisbecker,
	Russell King, linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin,
	Rik van Riel, Andi Kleen

On Tue, Nov 22, 2011 at 01:08:48PM +0200, Gilad Ben-Yossef wrote:
> Calculate a cpumask of CPUs with per-cpu pages in any zone and only send an IPI requesting CPUs to drain these pages to the buddy allocator if they actually have pages when asked to flush.
> 
> The code path of memory allocation failure for CPUMASK_OFFSTACK=y config was tested using fault injection framework.
> 
> Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> CC: Chris Metcalf <cmetcalf@tilera.com>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> CC: Frederic Weisbecker <fweisbec@gmail.com>
> CC: Russell King <linux@arm.linux.org.uk>
> CC: linux-mm@kvack.org
> CC: Pekka Enberg <penberg@kernel.org>
> CC: Matt Mackall <mpm@selenic.com>
> CC: Sasha Levin <levinsasha928@gmail.com>
> CC: Rik van Riel <riel@redhat.com>
> CC: Andi Kleen <andi@firstfloor.org>
> ---
>  mm/page_alloc.c |   18 +++++++++++++++++-
>  1 files changed, 17 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9dd443d..a3efdf1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1119,7 +1119,23 @@ void drain_local_pages(void *arg)
>   */
>  void drain_all_pages(void)
>  {
> -	on_each_cpu(drain_local_pages, NULL, 1);
> +	int cpu;
> +	struct zone *zone;
> +	cpumask_var_t cpus;
> +	struct per_cpu_pageset *pcp;
> +
> +	if (likely(zalloc_cpumask_var(&cpus, GFP_ATOMIC))) {
> +		for_each_online_cpu(cpu) {
> +			for_each_populated_zone(zone) {
> +				pcp = per_cpu_ptr(zone->pageset, cpu);
> +				if (pcp->pcp.count)
> +					cpumask_set_cpu(cpu, cpus);
> +		}
> +	}
> +		on_each_cpu_mask(cpus, drain_local_pages, NULL, 1);
> +		free_cpumask_var(cpus);

The indenting there is very weird but easily fixed.

A greater concern is that we are calling zalloc_cpumask_var() from the
direct reclaim path when we are already under memory pressure. How often
is this path hit and how often does the allocation fail?

Related to that, calling into the page allocator again for
zalloc_cpumask_var is not cheap.  Does reducing the number of IPIs
offset the cost of calling into the allocator again? How often does it
offset the cost and how often does it end up costing more? I guess that
would heavily depend on the number of CPUs and how many of them have
pages in their per-cpu buffer. Basically, sometimes we *might* save but
it comes at a definite cost of calling into the page allocator again.

The patch looks ok functionally but I'm skeptical that it really helps
performance.

> +	} else
> +		on_each_cpu(drain_local_pages, NULL, 1);
>  }
>  
>  #ifdef CONFIG_HIBERNATION

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 5/5] mm: Only IPI CPUs to drain local pages if they exist
  2011-12-23 10:28   ` Mel Gorman
@ 2011-12-25  9:39     ` Gilad Ben-Yossef
  2011-12-30 15:04       ` Mel Gorman
  0 siblings, 1 reply; 29+ messages in thread
From: Gilad Ben-Yossef @ 2011-12-25  9:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Chris Metcalf, Peter Zijlstra, Frederic Weisbecker,
	Russell King, linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin,
	Rik van Riel, Andi Kleen

On Fri, Dec 23, 2011 at 12:28 PM, Mel Gorman <mgorman@suse.de> wrote:
>
> On Tue, Nov 22, 2011 at 01:08:48PM +0200, Gilad Ben-Yossef wrote:
> > Calculate a cpumask of CPUs with per-cpu pages in any zone and only send an IPI requesting CPUs to drain these pages to the buddy allocator if they actually have pages when asked to flush.
> >
> > The code path of memory allocation failure for CPUMASK_OFFSTACK=y config was tested using fault injection framework.
> >
> > Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
> > Acked-by: Christoph Lameter <cl@linux.com>
> > CC: Chris Metcalf <cmetcalf@tilera.com>
> > CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > CC: Frederic Weisbecker <fweisbec@gmail.com>
> > CC: Russell King <linux@arm.linux.org.uk>
> > CC: linux-mm@kvack.org
> > CC: Pekka Enberg <penberg@kernel.org>
> > CC: Matt Mackall <mpm@selenic.com>
> > CC: Sasha Levin <levinsasha928@gmail.com>
> > CC: Rik van Riel <riel@redhat.com>
> > CC: Andi Kleen <andi@firstfloor.org>
> > ---
> >  mm/page_alloc.c |   18 +++++++++++++++++-
> >  1 files changed, 17 insertions(+), 1 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 9dd443d..a3efdf1 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1119,7 +1119,23 @@ void drain_local_pages(void *arg)
> >   */
> >  void drain_all_pages(void)
> >  {
> > -     on_each_cpu(drain_local_pages, NULL, 1);
> > +     int cpu;
> > +     struct zone *zone;
> > +     cpumask_var_t cpus;
> > +     struct per_cpu_pageset *pcp;
> > +
> > +     if (likely(zalloc_cpumask_var(&cpus, GFP_ATOMIC))) {
> > +             for_each_online_cpu(cpu) {
> > +                     for_each_populated_zone(zone) {
> > +                             pcp = per_cpu_ptr(zone->pageset, cpu);
> > +                             if (pcp->pcp.count)
> > +                                     cpumask_set_cpu(cpu, cpus);
> > +             }
> > +     }
> > +             on_each_cpu_mask(cpus, drain_local_pages, NULL, 1);
> > +             free_cpumask_var(cpus);
>
First off, thank you for the review.

>
> The indenting there is very weird but easily fixed.


Hmm... I missed that. Sorry. I will fix it with the next iteration.
>
>
> A greater concern is that we are calling zalloc_cpumask_var() from the
> direct reclaim path when we are already under memory pressure. How often
> is this path hit and how often does the allocation fail?


Yes, this scenario worried me too. In fact, I worried that we might
end in an infinite
loop of direct reclaim => cpumask var allocation => direct reclaim.
Luckily this can't happen.
It did cause me to test the failure allocation case with the fault
injection frame work to make
sure it fails gracefully.

I'll try to explain why I believe we end up succeeding in most cases:

For CONFIG_CPUMASK_OFFSTACK=n - case there is no allocation, so there
is no problem.

For CONFIG_CPUMASK_OFFSTACK=y but when we got to drain_all_pages from
the memory
hotplug or the memory failure code path (the code other code path that
call drain_all_pages),
there is  no inherent memory pressure, so we should be OK.

If we did get to drain_all_pages  from direct reclaim, but the cpumask
slab has an object in the
slab (or partial pages in the case  of slub), then we never hit the
page allocator so all is well, I believe.

So this leaves us with being called from the direct reclaim path, when
the cpumask slab has no
object or partial pages and it needs to hit the page allocator.  If we
hit direct page relcaim, the original
allocation was not atomic , otherwise we would not have  hit direct
page reclaim.  The cpumask allocation
however,  is atomic, so we have broader allocation options -  allocate
high,  allocate outside our cpuset
etc. and  there is a reasonable chance the cpumask allocation can
succeed even if the original allocation
ended up in direct reclaim.

So we end up failing to allocate the cpumask var only if the memory
pressure is really a global system
memory shortage, as opposed to, for example, some user space failure
to page in some heap space
in a cpuset. Even then, we fail gracefully.


> Related to that, calling into the page allocator again for
> zalloc_cpumask_var is not cheap.  Does reducing the number of IPIs
> offset the cost of calling into the allocator again? How often does it
> offset the cost and how often does it end up costing more? I guess that
> would heavily depend on the number of CPUs and how many of them have
> pages in their per-cpu buffer. Basically, sometimes we *might* save but
> it comes at a definite cost of calling into the page allocator again.
>

Good point and I totally agree it depends on the number of CPUs.

The thing is, if you are at CPUMASK_OFFSTACK=y, you are saying
that you optimize for the large number of CPU case, otherwise it doesn't
make sense - you can represent 32 CPU in the space it takes to
hold the pointer to the cpumask (on 32bit system) etc.

If you are at CPUMASK_OFFSTACK=n you (almost) didn't pay anything.

The way I see it, the use cases where you end up profiting from the code
are the same places you also pay. Having lots of CPU is what forced you
to use CPUMASK_OFFSTACK and pay that extra allocation but
then it is exactly when you have lots of CPUs that the code pays off.

>
> The patch looks ok functionally but I'm skeptical that it really helps
> performance.


Thanks! it is good to hear it is not completely off the wall :-)

I think of it more of as a CPU isolation feature then pure performance.
If you have a system with a couple of dozens of CPUs (Tilera, SGI, Cavium
or the various virtual NUMA folks) you tend to want to break up the system
into sets of CPUs that work of separate tasks.

It is very annoying when a memory allocation failure in a task allocated to a
small set of 4 CPUs yanks out all the rest of your 4,092 CPUs working on
something completely different out of their cache warm happy existence into
the cache cold reality of an IPI, or worse yet yanks all those CPUs from the
nirvana of idle C-states saving power just to discover that no, they don't
actually have anything to do. :-)

I do believe the overhead for the cases without a lot of CPUs is quite minimal.

Thanks again for taking your time tor review the patch and for reading
this long
response.

> +     } else
> +             on_each_cpu(drain_local_pages, NULL, 1);
>  }
>
>  #ifdef CONFIG_HIBERNATION

I do have a new version of the patch with the logic of building the
cpumask abstracted away to a service
function but otherwise the same. I'll send that out if you figure the
general approach is worth while.

--
Mel Gorman
SUSE Labs



--
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 5/5] mm: Only IPI CPUs to drain local pages if they exist
  2011-12-25  9:39     ` Gilad Ben-Yossef
@ 2011-12-30 15:04       ` Mel Gorman
  2011-12-30 15:25         ` Chris Metcalf
  2011-12-30 20:16         ` Gilad Ben-Yossef
  0 siblings, 2 replies; 29+ messages in thread
From: Mel Gorman @ 2011-12-30 15:04 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, Chris Metcalf, Peter Zijlstra, Frederic Weisbecker,
	Russell King, linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin,
	Rik van Riel, Andi Kleen

On Sun, Dec 25, 2011 at 11:39:59AM +0200, Gilad Ben-Yossef wrote:
> Hmm... I missed that. Sorry. I will fix it with the next iteration.
> >
> >
> > A greater concern is that we are calling zalloc_cpumask_var() from the
> > direct reclaim path when we are already under memory pressure. How often
> > is this path hit and how often does the allocation fail?
> 
> 
> Yes, this scenario worried me too. In fact, I worried that we might
> end in an infinite
> loop of direct reclaim => cpumask var allocation => direct reclaim.
> Luckily this can't happen.

Indeed.

> It did cause me to test the failure allocation case with the fault
> injection frame work to make
> sure it fails gracefully.
> 
> I'll try to explain why I believe we end up succeeding in most cases:
> 
> For CONFIG_CPUMASK_OFFSTACK=n - case there is no allocation, so there
> is no problem.
> 

CONFIG_CPUMASK_OFFSTACK is force enabled if CONFIG_MAXSMP on x86. This
may be the case for some server-orientated distributions. I know
SLES enables this option for x86-64 at least. Debian does not but
might in the future. I don't know about RHEL but it should be checked.
Either way, we cannot depend on CONFIG_CPUMASK_OFFSTACK being disabled
(it's enabled on my laptop for example due to the .config it is based
on). That said, breaking the link between MAXSMP and OFFSTACK may be
an option.

Stack usage may also be an issue. If NR_CPU is set to 4096 then
this will allocated 512 bytes on the stack within the page allocator
which can be called from some heavy stack usage paths. However, as
we have already used direct reclaim and that is a heavy stack user,
it probably is not as serious a problem.

> For CONFIG_CPUMASK_OFFSTACK=y but when we got to drain_all_pages from
> the memory
> hotplug or the memory failure code path (the code other code path that
> call drain_all_pages),
> there is  no inherent memory pressure, so we should be OK.
> 

It's the memory failure code path after direct reclaim failed. How
can you say there is no inherent memory pressure?

> If we did get to drain_all_pages  from direct reclaim, but the cpumask
> slab has an object in the
> slab (or partial pages in the case  of slub), then we never hit the
> page allocator so all is well, I believe.
> 

Unless the slab is empty or we are competing heavily with other
alloc_cpumask_var users. I would expect it is rare but that is not
the same as "never" hitting the page allocator. However, I accept
that it is likely to be rare.

> So this leaves us with being called from the direct reclaim path, when
> the cpumask slab has no
> object or partial pages and it needs to hit the page allocator.  If we
> hit direct page relcaim, the original
> allocation was not atomic , otherwise we would not have  hit direct
> page reclaim.  The cpumask allocation
> however,  is atomic, so we have broader allocation options -  allocate
> high,  allocate outside our cpuset
> etc. and  there is a reasonable chance the cpumask allocation can
> succeed even if the original allocation
> ended up in direct reclaim.
> 

Ok, the atomic allocation can dip further into the PFMEMALLOC reserves
meaning that it may succeed an allocation even when an ordinarily
allocation would fail. However, this is applying more memory pressure
at a time we are already under memory pressure. In an extreme case,
allocating for the cpumask will be causing other allocations to fail
and stalling processes for longer. Considering that the processes are
already stalled for direct reclaim and the resulting allocation will
end up as a slab page, the additional delay is likely insignificant.

> So we end up failing to allocate the cpumask var only if the memory
> pressure is really a global system
> memory shortage, as opposed to, for example, some user space failure
> to page in some heap space
> in a cpuset. Even then, we fail gracefully.
> 

Ok, I accept that it will fail gracefully.

> 
> > Related to that, calling into the page allocator again for
> > zalloc_cpumask_var is not cheap.  Does reducing the number of IPIs
> > offset the cost of calling into the allocator again? How often does it
> > offset the cost and how often does it end up costing more? I guess that
> > would heavily depend on the number of CPUs and how many of them have
> > pages in their per-cpu buffer. Basically, sometimes we *might* save but
> > it comes at a definite cost of calling into the page allocator again.
> >
> 
> Good point and I totally agree it depends on the number of CPUs.
> 
> The thing is, if you are at CPUMASK_OFFSTACK=y, you are saying
> that you optimize for the large number of CPU case, otherwise it doesn't
> make sense - you can represent 32 CPU in the space it takes to
> hold the pointer to the cpumask (on 32bit system) etc.
> 
> If you are at CPUMASK_OFFSTACK=n you (almost) didn't pay anything.
> 

It's the CPUMASK_OFFSTACK=y case I worry about as it is enabled on
at least one server-orientated distribution and probably more.

> The way I see it, the use cases where you end up profiting from the code
> are the same places you also pay. Having lots of CPU is what forced you
> to use CPUMASK_OFFSTACK and pay that extra allocation but
> then it is exactly when you have lots of CPUs that the code pays off.
> 
> >
> > The patch looks ok functionally but I'm skeptical that it really helps
> > performance.
> 
> 
> Thanks! it is good to hear it is not completely off the wall :-)
> 
> I think of it more of as a CPU isolation feature then pure performance.
> If you have a system with a couple of dozens of CPUs (Tilera, SGI, Cavium
> or the various virtual NUMA folks) you tend to want to break up the system
> into sets of CPUs that work of separate tasks.
> 

Even with the CPUs isolated, how often is it the case that many of
the CPUs have 0 pages on their per-cpu lists? I checked a bunch of
random machines just there and in every case all CPUs had at least
one page on their per-cpu list. In other words I believe that in
many cases the exact same number of IPIs will be used but with the
additional cost of allocating a cpumask.

> It is very annoying when a memory allocation failure in a task allocated to a
> small set of 4 CPUs yanks out all the rest of your 4,092 CPUs working on
> something completely different out of their cache warm happy existence into
> the cache cold reality of an IPI, or worse yet yanks all those CPUs from the
> nirvana of idle C-states saving power just to discover that no, they don't
> actually have anything to do. :-)
> 
> I do believe the overhead for the cases without a lot of CPUs is quite minimal.
> 

I'm still generally uncomfortable with the allocator allocating memory
while it is known memory is tight.

As a way of mitigating that, I would suggest this is done in two
passes. The first would check if at least 50% of the CPUs have no pages
on their per-cpu list. Then and only then allocate the per-cpu mask to
limit the IPIs. Use a separate patch that counts in /proc/vmstat how
many times the per-cpu mask was allocated as an approximate measure of
how often this logic really reduces the number of IPI calls in practice
and report that number with the patch - i.e. this patch reduces the
number of times IPIs are globally transmitted by X% for some workload.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 5/5] mm: Only IPI CPUs to drain local pages if they exist
  2011-12-30 15:04       ` Mel Gorman
@ 2011-12-30 15:25         ` Chris Metcalf
  2011-12-30 16:08           ` Mel Gorman
  2011-12-30 20:16         ` Gilad Ben-Yossef
  1 sibling, 1 reply; 29+ messages in thread
From: Chris Metcalf @ 2011-12-30 15:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Gilad Ben-Yossef, linux-kernel, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen

On 12/30/2011 10:04 AM, Mel Gorman wrote:
> On Sun, Dec 25, 2011 at 11:39:59AM +0200, Gilad Ben-Yossef wrote:
>> I think of it more of as a CPU isolation feature then pure performance.
>> If you have a system with a couple of dozens of CPUs (Tilera, SGI, Cavium
>> or the various virtual NUMA folks) you tend to want to break up the system
>> into sets of CPUs that work of separate tasks.
> Even with the CPUs isolated, how often is it the case that many of
> the CPUs have 0 pages on their per-cpu lists? I checked a bunch of
> random machines just there and in every case all CPUs had at least
> one page on their per-cpu list. In other words I believe that in
> many cases the exact same number of IPIs will be used but with the
> additional cost of allocating a cpumask.

Just to chime in here from the Tilera point of view, it is indeed important
to handle the case when all the CPUs have 0 pages in their per-cpu lists,
because it may be that the isolated CPUs are not running kernel code at
all, but instead running very jitter-sensitive user-space code doing direct
MMIO.  So it's worth the effort, in that scenario, to avoid perturbing the
cpus with no per-cpu pages.

And in general, as multi-core continues to scale up, it will be
increasingly worth the effort to avoid doing operations that hit every
single cpu "just in case".

> As a way of mitigating that, I would suggest this is done in two passes.
> The first would check if at least 50% of the CPUs have no pages on their
> per-cpu list. Then and only then allocate the per-cpu mask to limit the
> IPIs. Use a separate patch that counts in /proc/vmstat how many times the
> per-cpu mask was allocated as an approximate measure of how often this
> logic really reduces the number of IPI calls in practice and report that
> number with the patch - i.e. this patch reduces the number of times IPIs
> are globally transmitted by X% for some workload. 

It's really a question of how important it is to avoid the IPIs, even to a
relatively small number of cpus.  If only a few cpus are running
jitter-sensitive code, you may still want to take steps to avoid
interrupting them.  I do think the right strategy is to optimistically try
to allocate the cpumask, and only if it fails, send IPIs to every cpu.

Alternately, since we really don't want more than one cpu running the drain
code anyway, you could imagine using a static cpumask, along with a lock to
serialize attempts to drain all the pages.  (Locking here would be tricky,
since we need to run on_each_cpu with interrupts enabled, but there's
probably some reasonable way to make it work.)

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 5/5] mm: Only IPI CPUs to drain local pages if they exist
  2011-12-30 15:25         ` Chris Metcalf
@ 2011-12-30 16:08           ` Mel Gorman
  2011-12-30 20:29             ` Gilad Ben-Yossef
  0 siblings, 1 reply; 29+ messages in thread
From: Mel Gorman @ 2011-12-30 16:08 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben-Yossef, linux-kernel, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen

On Fri, Dec 30, 2011 at 10:25:46AM -0500, Chris Metcalf wrote:
> On 12/30/2011 10:04 AM, Mel Gorman wrote:
> > On Sun, Dec 25, 2011 at 11:39:59AM +0200, Gilad Ben-Yossef wrote:
> >> I think of it more of as a CPU isolation feature then pure performance.
> >> If you have a system with a couple of dozens of CPUs (Tilera, SGI, Cavium
> >> or the various virtual NUMA folks) you tend to want to break up the system
> >> into sets of CPUs that work of separate tasks.
> > Even with the CPUs isolated, how often is it the case that many of
> > the CPUs have 0 pages on their per-cpu lists? I checked a bunch of
> > random machines just there and in every case all CPUs had at least
> > one page on their per-cpu list. In other words I believe that in
> > many cases the exact same number of IPIs will be used but with the
> > additional cost of allocating a cpumask.
> 
> Just to chime in here from the Tilera point of view, it is indeed important
> to handle the case when all the CPUs have 0 pages in their per-cpu lists,
> because it may be that the isolated CPUs are not running kernel code at
> all, but instead running very jitter-sensitive user-space code doing direct
> MMIO. 

This is far from being the common case.

> So it's worth the effort, in that scenario, to avoid perturbing the
> cpus with no per-cpu pages.
> 
> And in general, as multi-core continues to scale up, it will be
> increasingly worth the effort to avoid doing operations that hit every
> single cpu "just in case".
> 

I agree with the sentiment. My concern is that for the vast majority
of cases the per-cpu lists will contain at least 1 page and the number
of IPIs are not reduced by this patch but the cost of an allocation
(be it stack of slab) is still incurred.

> > As a way of mitigating that, I would suggest this is done in two passes.
> > The first would check if at least 50% of the CPUs have no pages on their
> > per-cpu list. Then and only then allocate the per-cpu mask to limit the
> > IPIs. Use a separate patch that counts in /proc/vmstat how many times the
> > per-cpu mask was allocated as an approximate measure of how often this
> > logic really reduces the number of IPI calls in practice and report that
> > number with the patch - i.e. this patch reduces the number of times IPIs
> > are globally transmitted by X% for some workload. 
> 
> It's really a question of how important it is to avoid the IPIs, even to a
> relatively small number of cpus.  If only a few cpus are running
> jitter-sensitive code, you may still want to take steps to avoid
> interrupting them. 

Maybe, but as even one page results in an IPI, it's still not clear
to me that this patch really helps reduce IPIs.

> I do think the right strategy is to optimistically try
> to allocate the cpumask, and only if it fails, send IPIs to every cpu.
> 
> Alternately, since we really don't want more than one cpu running the drain
> code anyway, you could imagine using a static cpumask, along with a lock to
> serialize attempts to drain all the pages.  (Locking here would be tricky,
> since we need to run on_each_cpu with interrupts enabled, but there's
> probably some reasonable way to make it work.)
> 

Good suggestion, that would at least shut up my complaining
about allocation costs! A statically-declared mutex similar
to hugetlb_instantiation_mutex should do it. The context that
drain_all_pages is called from will have interrupts enabled.

Serialising processes entering direct reclaim may result in some stalls
but overall I think the impact of that would be less than increasing
memory pressure when low on memory.

It would still be nice to have some data on how much IPIs are reduced
in practice to confirm the patch really helps.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 5/5] mm: Only IPI CPUs to drain local pages if they exist
  2011-12-30 15:04       ` Mel Gorman
  2011-12-30 15:25         ` Chris Metcalf
@ 2011-12-30 20:16         ` Gilad Ben-Yossef
  1 sibling, 0 replies; 29+ messages in thread
From: Gilad Ben-Yossef @ 2011-12-30 20:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Chris Metcalf, Peter Zijlstra, Frederic Weisbecker,
	Russell King, linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin,
	Rik van Riel, Andi Kleen

On Fri, Dec 30, 2011 at 5:04 PM, Mel Gorman <mgorman@suse.de> wrote:
>
> On Sun, Dec 25, 2011 at 11:39:59AM +0200, Gilad Ben-Yossef wrote:
>
>
>
> CONFIG_CPUMASK_OFFSTACK is force enabled if CONFIG_MAXSMP on x86. This
> may be the case for some server-orientated distributions. I know
> SLES enables this option for x86-64 at least. Debian does not but
> might in the future. I don't know about RHEL but it should be checked.
> Either way, we cannot depend on CONFIG_CPUMASK_OFFSTACK being disabled
> (it's enabled on my laptop for example due to the .config it is based
> on). That said, breaking the link between MAXSMP and OFFSTACK may be
> an option.
>

Yes, I know and it is enabled for RHEL as well, I believe.
The point is, MAXSMP is enabled in the enterprise distribution in
order to support
the massively multi-core systems. Reducing cross CPU interference is important
to these very systems.

In fact, since CONFIG_CPUMASK_OFFSTACK has a price on its own, the fact
that distros enable it (via MAXSMP) is proof in my eyes that the distros find
massively multi-core systems important :-)

That being said, the patch only has value if it actually reduces cross
CPU IPI and
does not incur a bigger price, otherwise of course it should be dropped.


>
> > For CONFIG_CPUMASK_OFFSTACK=y but when we got to drain_all_pages from
> > the memory
> > hotplug or the memory failure code path (the code other code path that
> > call drain_all_pages),
> > there is  no inherent memory pressure, so we should be OK.
> >
>
> It's the memory failure code path after direct reclaim failed. How
> can you say there is no inherent memory pressure?
>
Bah.. you are right. Memory allocation will cause memory migration to
the remaining active memory areas, so yes, it's a memory pressure.
Point taken. My bad.

>
> > The thing is, if you are at CPUMASK_OFFSTACK=y, you are saying
> > that you optimize for the large number of CPU case, otherwise it doesn't
> > make sense - you can represent 32 CPU in the space it takes to
> > hold the pointer to the cpumask (on 32bit system) etc.
> >
> > If you are at CPUMASK_OFFSTACK=n you (almost) didn't pay anything.
> >
> <snip>


>
> It's the CPUMASK_OFFSTACK=y case I worry about as it is enabled on
> at least one server-orientated distribution and probably more.
>
Sure, because they care about performance (or even just plain working) on
massively multi-core systems. Something this patch set aims to get to work
better.



>
> > I think of it more of as a CPU isolation feature then pure performance.
> > If you have a system with a couple of dozens of CPUs (Tilera, SGI, Cavium
> > or the various virtual NUMA folks) you tend to want to break up the system
> > into sets of CPUs that work of separate tasks.
> >
>
> Even with the CPUs isolated, how often is it the case that many of
> the CPUs have 0 pages on their per-cpu lists? I checked a bunch of
> random machines just there and in every case all CPUs had at least
> one page on their per-cpu list. In other words I believe that in
> many cases the exact same number of IPIs will be used but with the
> additional cost of allocating a cpumask.
>

A common usage scenario with systems with lots of cores is to isolate
a group of cores to run a (almost) totally CPU bound task to each CPU
of the set. Those tasks rarely call into the kernel, they just crunch numbers
and they end up have 0 per-cpu set more often then you think.

But you are right that it is a specific use case. The question is what is the
cost in other use cases.

>
> <snip>


>
> I'm still generally uncomfortable with the allocator allocating memory
> while it is known memory is tight.
>

Got you.

>
> As a way of mitigating that, I would suggest this is done in two
> passes. The first would check if at least 50% of the CPUs have no pages
> on their per-cpu list. Then and only then allocate the per-cpu mask to
> limit the IPIs. Use a separate patch that counts in /proc/vmstat how
> many times the per-cpu mask was allocated as an approximate measure of
> how often this logic really reduces the number of IPI calls in practice
> and report that number with the patch - i.e. this patch reduces the
> number of times IPIs are globally transmitted by X% for some workload.
>
Great idea. I like it - and I guess the 50% could be configurable.
Will do and report.

Gilad
>
> --


>
> Mel Gorman
> SUSE Labs




--
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 5/5] mm: Only IPI CPUs to drain local pages if they exist
  2011-12-30 16:08           ` Mel Gorman
@ 2011-12-30 20:29             ` Gilad Ben-Yossef
  2012-01-01  8:03               ` Gilad Ben-Yossef
  0 siblings, 1 reply; 29+ messages in thread
From: Gilad Ben-Yossef @ 2011-12-30 20:29 UTC (permalink / raw)
  To: Mel Gorman, Chris Metcalf
  Cc: linux-kernel, Peter Zijlstra, Frederic Weisbecker, Russell King,
	linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen

On Fri, Dec 30, 2011 at 6:08 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Fri, Dec 30, 2011 at 10:25:46AM -0500, Chris Metcalf wrote:

>> Alternately, since we really don't want more than one cpu running the drain
>> code anyway, you could imagine using a static cpumask, along with a lock to
>> serialize attempts to drain all the pages.  (Locking here would be tricky,
>> since we need to run on_each_cpu with interrupts enabled, but there's
>> probably some reasonable way to make it work.)
>>
>
> Good suggestion, that would at least shut up my complaining
> about allocation costs! A statically-declared mutex similar
> to hugetlb_instantiation_mutex should do it. The context that
> drain_all_pages is called from will have interrupts enabled.
>
> Serialising processes entering direct reclaim may result in some stalls
> but overall I think the impact of that would be less than increasing
> memory pressure when low on memory.
>

Chris, I like the idea :-)

Actually, assuming for a second that on_each_cpu* and underlying code
wont mind if the cpumask will change mid call (I know it does, just thinking out
loud), you could say you don't even need the lock if you careful in how you
set/unset the per cpu bits of the cpumask, since they track the same thing...

Of course, it'll still cause a load of cache line bouncing, so maybe
it's not worth it.

> It would still be nice to have some data on how much IPIs are reduced
> in practice to confirm the patch really helps.

I agree. I'll prepare the patch and will present the data.

Thanks!
Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 5/5] mm: Only IPI CPUs to drain local pages if they exist
  2011-12-30 20:29             ` Gilad Ben-Yossef
@ 2012-01-01  8:03               ` Gilad Ben-Yossef
  0 siblings, 0 replies; 29+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-01  8:03 UTC (permalink / raw)
  To: Mel Gorman, Chris Metcalf
  Cc: linux-kernel, Peter Zijlstra, Frederic Weisbecker, Russell King,
	linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel,
	Andi Kleen

On Fri, Dec 30, 2011 at 10:29 PM, Gilad Ben-Yossef <gilad@benyossef.com> wrote:
> On Fri, Dec 30, 2011 at 6:08 PM, Mel Gorman <mgorman@suse.de> wrote:
>> On Fri, Dec 30, 2011 at 10:25:46AM -0500, Chris Metcalf wrote:
>
>>> Alternately, since we really don't want more than one cpu running the drain
>>> code anyway, you could imagine using a static cpumask, along with a lock to
>>> serialize attempts to drain all the pages.  (Locking here would be tricky,
>>> since we need to run on_each_cpu with interrupts enabled, but there's
>>> probably some reasonable way to make it work.)
>>>
>>
>> Good suggestion, that would at least shut up my complaining
>> about allocation costs! A statically-declared mutex similar
>> to hugetlb_instantiation_mutex should do it. The context that
>> drain_all_pages is called from will have interrupts enabled.
>>
>> Serialising processes entering direct reclaim may result in some stalls
>> but overall I think the impact of that would be less than increasing
>> memory pressure when low on memory.
>>
>
> Chris, I like the idea :-)
>
> Actually, assuming for a second that on_each_cpu* and underlying code
> wont mind if the cpumask will change mid call (I know it does, just thinking out
> loud), you could say you don't even need the lock if you careful in how you
> set/unset the per cpu bits of the cpumask, since they track the same thing...

I took a look and smp_call_function_many is actually fine with the
passed cpumask getting changed in mid call.

I think this means we can do away with a single global cpumask without
any locking and the cost becomes the allocation space for the single cpumask and
the cache bouncing for concurrent updating of the cpumask if
drain_all_pages races
 against itself on other cpus.

I'll spin a patch based on this idea.

Happy new year :-)
Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/5] slub: Only IPI CPUs that have per cpu obj to flush
  2011-11-23  6:23   ` Pekka Enberg
  2011-11-23  6:57     ` Pekka Enberg
@ 2012-01-01 12:41     ` Avi Kivity
  2012-01-01 16:12       ` Gilad Ben-Yossef
  1 sibling, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2012-01-01 12:41 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Gilad Ben-Yossef, linux-kernel, Chris Metcalf, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, apkm

On 11/23/2011 08:23 AM, Pekka Enberg wrote:
> On Tue, 22 Nov 2011, Gilad Ben-Yossef wrote:
>> static void flush_all(struct kmem_cache *s)
>> {
>> -    on_each_cpu(flush_cpu_slab, s, 1);
>> +    cpumask_var_t cpus;
>> +    struct kmem_cache_cpu *c;
>> +    int cpu;
>> +
>> +    if (likely(zalloc_cpumask_var(&cpus, GFP_ATOMIC))) {
>
> __GFP_NOWARN too maybe?
>
>> +        for_each_online_cpu(cpu) {
>> +            c = per_cpu_ptr(s->cpu_slab, cpu);
>> +            if (c->page)
>> +                cpumask_set_cpu(cpu, cpus);
>> +        }
>> +        on_each_cpu_mask(cpus, flush_cpu_slab, s, 1);
>> +        free_cpumask_var(cpus);
>> +    } else
>> +        on_each_cpu(flush_cpu_slab, s, 1);
>> }
>

Since this seems to be a common pattern, how about:

   zalloc_cpumask_var_or_all_online_cpus(&cpus, GFTP_ATOMIC);
   ...
   free_cpumask_var(cpus);

The long-named function at the top of the block either returns a newly
allocated zeroed cpumask, or a static cpumask with all online cpus set. 
The code in the middle is only allowed to set bits in the cpumask
(should be the common usage).  free_cpumask_var() needs to check whether
the freed object is the static variable.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/5] slub: Only IPI CPUs that have per cpu obj to flush
  2012-01-01 12:41     ` Avi Kivity
@ 2012-01-01 16:12       ` Gilad Ben-Yossef
  2012-01-01 16:50         ` Avi Kivity
  0 siblings, 1 reply; 29+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-01 16:12 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Pekka Enberg, linux-kernel, Chris Metcalf, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, apkm

On Sun, Jan 1, 2012 at 2:41 PM, Avi Kivity <avi@redhat.com> wrote:
> On 11/23/2011 08:23 AM, Pekka Enberg wrote:
>> On Tue, 22 Nov 2011, Gilad Ben-Yossef wrote:
>>> static void flush_all(struct kmem_cache *s)
>>> {
>>> -    on_each_cpu(flush_cpu_slab, s, 1);
>>> +    cpumask_var_t cpus;
>>> +    struct kmem_cache_cpu *c;
>>> +    int cpu;
>>> +
>>> +    if (likely(zalloc_cpumask_var(&cpus, GFP_ATOMIC))) {
>>
>> __GFP_NOWARN too maybe?
>>
>>> +        for_each_online_cpu(cpu) {
>>> +            c = per_cpu_ptr(s->cpu_slab, cpu);
>>> +            if (c->page)
>>> +                cpumask_set_cpu(cpu, cpus);
>>> +        }
>>> +        on_each_cpu_mask(cpus, flush_cpu_slab, s, 1);
>>> +        free_cpumask_var(cpus);
>>> +    } else
>>> +        on_each_cpu(flush_cpu_slab, s, 1);
>>> }
>>
>
> Since this seems to be a common pattern, how about:
>
>   zalloc_cpumask_var_or_all_online_cpus(&cpus, GFTP_ATOMIC);
>   ...
>   free_cpumask_var(cpus);
>
> The long-named function at the top of the block either returns a newly
> allocated zeroed cpumask, or a static cpumask with all online cpus set.
> The code in the middle is only allowed to set bits in the cpumask
> (should be the common usage).  free_cpumask_var() needs to check whether
> the freed object is the static variable.

Thanks for the feedback and advice! I totally agree the repeating
pattern needs abstracting.

I ended up chosing to try a different abstraction though - basically a wrapper
on_each_cpu_cond that gets a predicate function to run per CPU to
build the mask
to send the IPI to. It seems cleaner to me not having to mess with
free_cpumask_var
and it abstracts more of the general pattern.

I intend to run the new code through some more testing tomorrow and send out V5
of the patch set. I'd be delighted if you can have a look through it then.

Thanks!
Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/5] slub: Only IPI CPUs that have per cpu obj to flush
  2012-01-01 16:12       ` Gilad Ben-Yossef
@ 2012-01-01 16:50         ` Avi Kivity
  2012-01-02 11:59           ` Gilad Ben-Yossef
  0 siblings, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2012-01-01 16:50 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Pekka Enberg, linux-kernel, Chris Metcalf, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, apkm

On 01/01/2012 06:12 PM, Gilad Ben-Yossef wrote:
> >
> > Since this seems to be a common pattern, how about:
> >
> >   zalloc_cpumask_var_or_all_online_cpus(&cpus, GFTP_ATOMIC);
> >   ...
> >   free_cpumask_var(cpus);
> >
> > The long-named function at the top of the block either returns a newly
> > allocated zeroed cpumask, or a static cpumask with all online cpus set.
> > The code in the middle is only allowed to set bits in the cpumask
> > (should be the common usage).  free_cpumask_var() needs to check whether
> > the freed object is the static variable.
>
> Thanks for the feedback and advice! I totally agree the repeating
> pattern needs abstracting.
>
> I ended up chosing to try a different abstraction though - basically a wrapper
> on_each_cpu_cond that gets a predicate function to run per CPU to
> build the mask
> to send the IPI to. It seems cleaner to me not having to mess with
> free_cpumask_var
> and it abstracts more of the general pattern.
>

This converts the algorithm to O(NR_CPUS) from a potentially lower
complexity algorithm.  Also, the existing algorithm may not like to be
driven by cpu number.  Both are true for kvm.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/5] slub: Only IPI CPUs that have per cpu obj to flush
  2012-01-01 16:50         ` Avi Kivity
@ 2012-01-02 11:59           ` Gilad Ben-Yossef
  2012-01-02 13:30             ` Avi Kivity
  0 siblings, 1 reply; 29+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-02 11:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Pekka Enberg, linux-kernel, Chris Metcalf, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, apkm

On Sun, Jan 1, 2012 at 6:50 PM, Avi Kivity <avi@redhat.com> wrote:
> On 01/01/2012 06:12 PM, Gilad Ben-Yossef wrote:
>> >
>> > Since this seems to be a common pattern, how about:
>> >
>> >   zalloc_cpumask_var_or_all_online_cpus(&cpus, GFTP_ATOMIC);
>> >   ...
>> >   free_cpumask_var(cpus);
>> >
>> > The long-named function at the top of the block either returns a newly
>> > allocated zeroed cpumask, or a static cpumask with all online cpus set.
>> > The code in the middle is only allowed to set bits in the cpumask
>> > (should be the common usage).  free_cpumask_var() needs to check whether
>> > the freed object is the static variable.
>>
>> Thanks for the feedback and advice! I totally agree the repeating
>> pattern needs abstracting.
>>
>> I ended up chosing to try a different abstraction though - basically a wrapper
>> on_each_cpu_cond that gets a predicate function to run per CPU to
>> build the mask
>> to send the IPI to. It seems cleaner to me not having to mess with
>> free_cpumask_var
>> and it abstracts more of the general pattern.
>>
>
> This converts the algorithm to O(NR_CPUS) from a potentially lower
> complexity algorithm.  Also, the existing algorithm may not like to be
> driven by cpu number.  Both are true for kvm.
>

Right, I was only thinking on my own uses, which are O(NR_CPUS) by nature.

I wonder if it would be better to create a safe_cpumask_var type with
its own alloc function
free and and sset_cpu function but no clear_cpu function so that the
compiler will catch
cases of trying to clear bits off of such a cpumask?

It seems safer and also makes handling the free function easier.

Does that makes sense or am I over engineering it? :-)

Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/5] slub: Only IPI CPUs that have per cpu obj to flush
  2012-01-02 11:59           ` Gilad Ben-Yossef
@ 2012-01-02 13:30             ` Avi Kivity
  2012-01-08 16:13               ` Gilad Ben-Yossef
  0 siblings, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2012-01-02 13:30 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Pekka Enberg, linux-kernel, Chris Metcalf, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, apkm

On 01/02/2012 01:59 PM, Gilad Ben-Yossef wrote:
> On Sun, Jan 1, 2012 at 6:50 PM, Avi Kivity <avi@redhat.com> wrote:
> > On 01/01/2012 06:12 PM, Gilad Ben-Yossef wrote:
> >> >
> >> > Since this seems to be a common pattern, how about:
> >> >
> >> >   zalloc_cpumask_var_or_all_online_cpus(&cpus, GFTP_ATOMIC);
> >> >   ...
> >> >   free_cpumask_var(cpus);
> >> >
> >> > The long-named function at the top of the block either returns a newly
> >> > allocated zeroed cpumask, or a static cpumask with all online cpus set.
> >> > The code in the middle is only allowed to set bits in the cpumask
> >> > (should be the common usage).  free_cpumask_var() needs to check whether
> >> > the freed object is the static variable.
> >>
> >> Thanks for the feedback and advice! I totally agree the repeating
> >> pattern needs abstracting.
> >>
> >> I ended up chosing to try a different abstraction though - basically a wrapper
> >> on_each_cpu_cond that gets a predicate function to run per CPU to
> >> build the mask
> >> to send the IPI to. It seems cleaner to me not having to mess with
> >> free_cpumask_var
> >> and it abstracts more of the general pattern.
> >>
> >
> > This converts the algorithm to O(NR_CPUS) from a potentially lower
> > complexity algorithm.  Also, the existing algorithm may not like to be
> > driven by cpu number.  Both are true for kvm.
> >
>
> Right, I was only thinking on my own uses, which are O(NR_CPUS) by nature.
>
> I wonder if it would be better to create a safe_cpumask_var type with
> its own alloc function
> free and and sset_cpu function but no clear_cpu function so that the
> compiler will catch
> cases of trying to clear bits off of such a cpumask?
>
> It seems safer and also makes handling the free function easier.
>
> Does that makes sense or am I over engineering it? :-)

It makes sense.  Depends on the number of call sites, really.  If there
are several, consolidation helps, also makes it easier to further refactor.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/5] slub: Only IPI CPUs that have per cpu obj to flush
  2012-01-02 13:30             ` Avi Kivity
@ 2012-01-08 16:13               ` Gilad Ben-Yossef
  2012-01-08 16:15                 ` Avi Kivity
  0 siblings, 1 reply; 29+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-08 16:13 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Pekka Enberg, linux-kernel, Chris Metcalf, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, apkm

On Mon, Jan 2, 2012 at 3:30 PM, Avi Kivity <avi@redhat.com> wrote:
> On 01/02/2012 01:59 PM, Gilad Ben-Yossef wrote:
>> On Sun, Jan 1, 2012 at 6:50 PM, Avi Kivity <avi@redhat.com> wrote:
>> > On 01/01/2012 06:12 PM, Gilad Ben-Yossef wrote:
>> >> >
>> >> > Since this seems to be a common pattern, how about:
>> >> >
>> >> >   zalloc_cpumask_var_or_all_online_cpus(&cpus, GFTP_ATOMIC);
>> >> >   ...
>> >> >   free_cpumask_var(cpus);
>> >> >
>> >> > The long-named function at the top of the block either returns a newly
>> >> > allocated zeroed cpumask, or a static cpumask with all online cpus set.
>> >> > The code in the middle is only allowed to set bits in the cpumask
>> >> > (should be the common usage).  free_cpumask_var() needs to check whether
>> >> > the freed object is the static variable.
>> >>
>> >> Thanks for the feedback and advice! I totally agree the repeating
>> >> pattern needs abstracting.
>> >>
>> >> I ended up chosing to try a different abstraction though - basically a wrapper
>> >> on_each_cpu_cond that gets a predicate function to run per CPU to
>> >> build the mask
>> >> to send the IPI to. It seems cleaner to me not having to mess with
>> >> free_cpumask_var
>> >> and it abstracts more of the general pattern.
>> >>
>> >
>> > This converts the algorithm to O(NR_CPUS) from a potentially lower
>> > complexity algorithm.  Also, the existing algorithm may not like to be
>> > driven by cpu number.  Both are true for kvm.
>> >
>>
>> Right, I was only thinking on my own uses, which are O(NR_CPUS) by nature.
>>
>> I wonder if it would be better to create a safe_cpumask_var type with
>> its own alloc function
>> free and and sset_cpu function but no clear_cpu function so that the
>> compiler will catch
>> cases of trying to clear bits off of such a cpumask?
>>
>> It seems safer and also makes handling the free function easier.
>>
>> Does that makes sense or am I over engineering it? :-)
>
> It makes sense.  Depends on the number of call sites, really.  If there
> are several, consolidation helps, also makes it easier to further refactor.

As Andrew pointed out code that usually calls just the CPU you wanted but
sometime (under memory pressure) might call other CPUs is a problem
waiting to happen, so I took his advice and re factored to use
smp_call_function_single to IPI each CPU individually in case of alloc failure.

I don't know if that applied to the KVM use case. I'm guessing it
probably doesn't
from what you wrote here.

Cheers,
Gilad

>
> --
> error compiling committee.c: too many arguments to function
>



-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/5] slub: Only IPI CPUs that have per cpu obj to flush
  2012-01-08 16:13               ` Gilad Ben-Yossef
@ 2012-01-08 16:15                 ` Avi Kivity
  0 siblings, 0 replies; 29+ messages in thread
From: Avi Kivity @ 2012-01-08 16:15 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Pekka Enberg, linux-kernel, Chris Metcalf, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Matt Mackall,
	Sasha Levin, Rik van Riel, Andi Kleen, apkm

On 01/08/2012 06:13 PM, Gilad Ben-Yossef wrote:
> On Mon, Jan 2, 2012 at 3:30 PM, Avi Kivity <avi@redhat.com> wrote:
> > On 01/02/2012 01:59 PM, Gilad Ben-Yossef wrote:
> >> On Sun, Jan 1, 2012 at 6:50 PM, Avi Kivity <avi@redhat.com> wrote:
> >> > On 01/01/2012 06:12 PM, Gilad Ben-Yossef wrote:
> >> >> >
> >> >> > Since this seems to be a common pattern, how about:
> >> >> >
> >> >> >   zalloc_cpumask_var_or_all_online_cpus(&cpus, GFTP_ATOMIC);
> >> >> >   ...
> >> >> >   free_cpumask_var(cpus);
> >> >> >
> >> >> > The long-named function at the top of the block either returns a newly
> >> >> > allocated zeroed cpumask, or a static cpumask with all online cpus set.
> >> >> > The code in the middle is only allowed to set bits in the cpumask
> >> >> > (should be the common usage).  free_cpumask_var() needs to check whether
> >> >> > the freed object is the static variable.
> >> >>
> >> >> Thanks for the feedback and advice! I totally agree the repeating
> >> >> pattern needs abstracting.
> >> >>
> >> >> I ended up chosing to try a different abstraction though - basically a wrapper
> >> >> on_each_cpu_cond that gets a predicate function to run per CPU to
> >> >> build the mask
> >> >> to send the IPI to. It seems cleaner to me not having to mess with
> >> >> free_cpumask_var
> >> >> and it abstracts more of the general pattern.
> >> >>
> >> >
> >> > This converts the algorithm to O(NR_CPUS) from a potentially lower
> >> > complexity algorithm.  Also, the existing algorithm may not like to be
> >> > driven by cpu number.  Both are true for kvm.
> >> >
> >>
> >> Right, I was only thinking on my own uses, which are O(NR_CPUS) by nature.
> >>
> >> I wonder if it would be better to create a safe_cpumask_var type with
> >> its own alloc function
> >> free and and sset_cpu function but no clear_cpu function so that the
> >> compiler will catch
> >> cases of trying to clear bits off of such a cpumask?
> >>
> >> It seems safer and also makes handling the free function easier.
> >>
> >> Does that makes sense or am I over engineering it? :-)
> >
> > It makes sense.  Depends on the number of call sites, really.  If there
> > are several, consolidation helps, also makes it easier to further refactor.
>
> As Andrew pointed out code that usually calls just the CPU you wanted but
> sometime (under memory pressure) might call other CPUs is a problem
> waiting to happen, so I took his advice and re factored to use
> smp_call_function_single to IPI each CPU individually in case of alloc failure.
>
> I don't know if that applied to the KVM use case. I'm guessing it
> probably doesn't
> from what you wrote here.

No, it doesn't.  But note it isn't the case you're covering anyway.  kvm
already only IPIs relevant cpus, it just suffers from that ugly
allocation failure case.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2012-01-08 16:15 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-22 11:08 [PATCH v4 0/5] Reduce cross CPU IPI interference Gilad Ben-Yossef
2011-11-22 11:08 ` [PATCH v4 1/5] smp: Introduce a generic on_each_cpu_mask function Gilad Ben-Yossef
2011-11-22 11:08 ` [PATCH v4 2/5] arm: Move arm over to generic on_each_cpu_mask Gilad Ben-Yossef
2011-11-22 21:00   ` Russell King - ARM Linux
2011-11-23  6:47     ` Gilad Ben-Yossef
2011-11-22 11:08 ` [PATCH v4 3/5] tile: Move tile to use " Gilad Ben-Yossef
2011-11-22 11:08 ` [PATCH v4 4/5] slub: Only IPI CPUs that have per cpu obj to flush Gilad Ben-Yossef
2011-11-23  6:23   ` Pekka Enberg
2011-11-23  6:57     ` Pekka Enberg
2011-11-23  7:52       ` Gilad Ben-Yossef
2012-01-01 12:41     ` Avi Kivity
2012-01-01 16:12       ` Gilad Ben-Yossef
2012-01-01 16:50         ` Avi Kivity
2012-01-02 11:59           ` Gilad Ben-Yossef
2012-01-02 13:30             ` Avi Kivity
2012-01-08 16:13               ` Gilad Ben-Yossef
2012-01-08 16:15                 ` Avi Kivity
2011-11-22 11:08 ` [PATCH v4 5/5] mm: Only IPI CPUs to drain local pages if they exist Gilad Ben-Yossef
2011-11-23  7:45   ` Pekka Enberg
2011-12-23 10:28   ` Mel Gorman
2011-12-25  9:39     ` Gilad Ben-Yossef
2011-12-30 15:04       ` Mel Gorman
2011-12-30 15:25         ` Chris Metcalf
2011-12-30 16:08           ` Mel Gorman
2011-12-30 20:29             ` Gilad Ben-Yossef
2012-01-01  8:03               ` Gilad Ben-Yossef
2011-12-30 20:16         ` Gilad Ben-Yossef
2011-11-23  1:36 ` [PATCH v4 0/5] Reduce cross CPU IPI interference Chris Metcalf
2011-11-23  6:52   ` Gilad Ben-Yossef

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).