All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-02-24 12:16 ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:16 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

Still a work in progress but enough has changed that I want to show what
it current looks like. Performance is still improved a little but there are
some large outstanding pieces of fruit

1. Improving free_pcppages_bulk() does a lot of looping, maybe could be better
2. gfp_zone() is still using a cache line for data. I wasn't able to translate
   Kamezawa-sans suggestion into usable code

The following two items should be picked up in a second or third pass at
improving the page allocator

1. Working out if knowing whether pages are cold/hot on free is worth it or
   not
2. Precalculating zonelists for cpusets (Andi described how it could be done,
   it's straight-forward, just will take time but it doesn't affect the
   majority of users)

Changes since V1
  o Remove the ifdef CONFIG_CPUSETS from inside get_page_from_freelist()
  o Use non-lock bit operations for clearing the mlock flag
  o Factor out alloc_flags calculation so it is only done once (Peter)
  o Make gfp.h a bit prettier and clear-cut (Peter)
  o Instead of deleting a debugging check, replace page_count() in the
    free path with a version that does not check for compound pages (Nick)
  o Drop the alteration for hot/cold page freeing until we know if it
    helps or not

The complexity of the page allocator has been increasing for some time
and it has now reached the point where the SLUB allocator is doing strange
tricks to avoid the page allocator. This is obviously bad as it may encourage
other subsystems to try avoiding the page allocator as well.

This series of patches is intended to reduce the cost of the page
allocator by doing the following.

Patches 1-3 iron out the entry paths slightly and remove stupid sanity
checks from the fast path.

Patch 4 uses a lookup table instead of a number of branches to decide what
zones are usable given the GFP flags.

Patch 5 tidies up some flags

Patch 6 avoids repeated checks of the zonelist

Patch 7 breaks the allocator up into a fast and slow path where the fast
path later becomes one long inlined function.

Patches 8-12 avoids calculating the same things repeatedly and instead
calculates them once.

Patches 13-14 inline parts of the allocator fast path

Patch 15 avoids calling get_pageblock_migratetype() potentially twice on
every page free

Patch 16 reduces the number of times interrupts are disabled by reworking
what free_page_mlock() does and not using locked versions of bit operations.

Patch 17 avoids using the zonelist cache on non-NUMA machines

Patch 18 simplifies some debugging checks made during alloc and free.

Patch 19 avoids a list search in the allocator fast path.

Running all of these through a profiler shows me the cost of page allocation
and freeing is reduced by a nice amount without drastically altering how the
allocator actually works. Excluding the cost of zeroing pages, the cost of
allocation is reduced by 25% and the cost of freeing by 12%.  Again excluding
zeroing a page, much of the remaining cost is due to counters, debugging
checks and interrupt disabling.  Of course when a page has to be zeroed,
the dominant cost of a page allocation is zeroing it.

These patches reduce the text size of the kernel by 180 bytes on the one
x86-64 machine I checked.

Range of results (positive is good) on 7 machines that completed tests.

o Kernbench elapsed time	-0.04	to	0.79%
o Kernbench system time		0 	to	3.74%
o tbench			-2.85%  to	5.52%
o Hackbench-sockets		all differences within  noise
o Hackbench-pipes		-2.98%  to	9.11%
o Sysbench			-0.04%  to	5.50%

With hackbench-pipes, only 2 machines out of 7 showed results outside of
the noise. In almost all cases the strandard deviation between runs of
hackbench-pipes was reduced with the patches.

I still haven't run a page-allocator micro-benchmark to see what sort of
figures that gives.

 arch/ia64/hp/common/sba_iommu.c   |    2 
 arch/ia64/kernel/mca.c            |    3 
 arch/ia64/kernel/uncached.c       |    3 
 arch/ia64/sn/pci/pci_dma.c        |    3 
 arch/powerpc/platforms/cell/ras.c |    2 
 arch/x86/kvm/vmx.c                |    2 
 drivers/misc/sgi-gru/grufile.c    |    2 
 drivers/misc/sgi-xp/xpc_uv.c      |    2 
 include/linux/cpuset.h            |    2 
 include/linux/gfp.h               |   62 +--
 include/linux/mm.h                |    1 
 include/linux/mmzone.h            |    8 
 init/main.c                       |    1 
 kernel/profile.c                  |    8 
 mm/filemap.c                      |    2 
 mm/hugetlb.c                      |    4 
 mm/internal.h                     |   11 
 mm/mempolicy.c                    |    2 
 mm/migrate.c                      |    2 
 mm/page_alloc.c                   |  642 +++++++++++++++++++++++++-------------
 mm/slab.c                         |    4 
 mm/slob.c                         |    4 
 mm/vmalloc.c                      |    1 
 23 files changed, 490 insertions(+), 283 deletions(-)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-02-24 12:16 ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:16 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

Still a work in progress but enough has changed that I want to show what
it current looks like. Performance is still improved a little but there are
some large outstanding pieces of fruit

1. Improving free_pcppages_bulk() does a lot of looping, maybe could be better
2. gfp_zone() is still using a cache line for data. I wasn't able to translate
   Kamezawa-sans suggestion into usable code

The following two items should be picked up in a second or third pass at
improving the page allocator

1. Working out if knowing whether pages are cold/hot on free is worth it or
   not
2. Precalculating zonelists for cpusets (Andi described how it could be done,
   it's straight-forward, just will take time but it doesn't affect the
   majority of users)

Changes since V1
  o Remove the ifdef CONFIG_CPUSETS from inside get_page_from_freelist()
  o Use non-lock bit operations for clearing the mlock flag
  o Factor out alloc_flags calculation so it is only done once (Peter)
  o Make gfp.h a bit prettier and clear-cut (Peter)
  o Instead of deleting a debugging check, replace page_count() in the
    free path with a version that does not check for compound pages (Nick)
  o Drop the alteration for hot/cold page freeing until we know if it
    helps or not

The complexity of the page allocator has been increasing for some time
and it has now reached the point where the SLUB allocator is doing strange
tricks to avoid the page allocator. This is obviously bad as it may encourage
other subsystems to try avoiding the page allocator as well.

This series of patches is intended to reduce the cost of the page
allocator by doing the following.

Patches 1-3 iron out the entry paths slightly and remove stupid sanity
checks from the fast path.

Patch 4 uses a lookup table instead of a number of branches to decide what
zones are usable given the GFP flags.

Patch 5 tidies up some flags

Patch 6 avoids repeated checks of the zonelist

Patch 7 breaks the allocator up into a fast and slow path where the fast
path later becomes one long inlined function.

Patches 8-12 avoids calculating the same things repeatedly and instead
calculates them once.

Patches 13-14 inline parts of the allocator fast path

Patch 15 avoids calling get_pageblock_migratetype() potentially twice on
every page free

Patch 16 reduces the number of times interrupts are disabled by reworking
what free_page_mlock() does and not using locked versions of bit operations.

Patch 17 avoids using the zonelist cache on non-NUMA machines

Patch 18 simplifies some debugging checks made during alloc and free.

Patch 19 avoids a list search in the allocator fast path.

Running all of these through a profiler shows me the cost of page allocation
and freeing is reduced by a nice amount without drastically altering how the
allocator actually works. Excluding the cost of zeroing pages, the cost of
allocation is reduced by 25% and the cost of freeing by 12%.  Again excluding
zeroing a page, much of the remaining cost is due to counters, debugging
checks and interrupt disabling.  Of course when a page has to be zeroed,
the dominant cost of a page allocation is zeroing it.

These patches reduce the text size of the kernel by 180 bytes on the one
x86-64 machine I checked.

Range of results (positive is good) on 7 machines that completed tests.

o Kernbench elapsed time	-0.04	to	0.79%
o Kernbench system time		0 	to	3.74%
o tbench			-2.85%  to	5.52%
o Hackbench-sockets		all differences within  noise
o Hackbench-pipes		-2.98%  to	9.11%
o Sysbench			-0.04%  to	5.50%

With hackbench-pipes, only 2 machines out of 7 showed results outside of
the noise. In almost all cases the strandard deviation between runs of
hackbench-pipes was reduced with the patches.

I still haven't run a page-allocator micro-benchmark to see what sort of
figures that gives.

 arch/ia64/hp/common/sba_iommu.c   |    2 
 arch/ia64/kernel/mca.c            |    3 
 arch/ia64/kernel/uncached.c       |    3 
 arch/ia64/sn/pci/pci_dma.c        |    3 
 arch/powerpc/platforms/cell/ras.c |    2 
 arch/x86/kvm/vmx.c                |    2 
 drivers/misc/sgi-gru/grufile.c    |    2 
 drivers/misc/sgi-xp/xpc_uv.c      |    2 
 include/linux/cpuset.h            |    2 
 include/linux/gfp.h               |   62 +--
 include/linux/mm.h                |    1 
 include/linux/mmzone.h            |    8 
 init/main.c                       |    1 
 kernel/profile.c                  |    8 
 mm/filemap.c                      |    2 
 mm/hugetlb.c                      |    4 
 mm/internal.h                     |   11 
 mm/mempolicy.c                    |    2 
 mm/migrate.c                      |    2 
 mm/page_alloc.c                   |  642 +++++++++++++++++++++++++-------------
 mm/slab.c                         |    4 
 mm/slob.c                         |    4 
 mm/vmalloc.c                      |    1 
 23 files changed, 490 insertions(+), 283 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 01/19] Replace __alloc_pages_internal() with __alloc_pages_nodemask()
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:16   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:16 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

__alloc_pages_internal is the core page allocator function but
essentially it is an alias of __alloc_pages_nodemask. Naming a publicly
available and exported function "internal" is also a big ugly. This
patch renames __alloc_pages_internal() to __alloc_pages_nodemask() and
deletes the old nodemask function.

Warning - This patch renames an exported symbol. No kernel driver is
affected by external drivers calling __alloc_pages_internal() should
change the call to __alloc_pages_nodemask() without any alteration of
parameters.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/gfp.h |   12 ++----------
 mm/page_alloc.c     |    4 ++--
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index dd20cd7..dcf0ab8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -168,24 +168,16 @@ static inline void arch_alloc_page(struct page *page, int order) { }
 #endif
 
 struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		       struct zonelist *zonelist, nodemask_t *nodemask);
 
 static inline struct page *
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct zonelist *zonelist)
 {
-	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
 }
 
-static inline struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist, nodemask_t *nodemask)
-{
-	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
-}
-
-
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5675b30..61051d5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1464,7 +1464,7 @@ try_next_zone:
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
@@ -1670,7 +1670,7 @@ nopage:
 got_pg:
 	return page;
 }
-EXPORT_SYMBOL(__alloc_pages_internal);
+EXPORT_SYMBOL(__alloc_pages_nodemask);
 
 /*
  * Common helper functions.
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 01/19] Replace __alloc_pages_internal() with __alloc_pages_nodemask()
@ 2009-02-24 12:16   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:16 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

__alloc_pages_internal is the core page allocator function but
essentially it is an alias of __alloc_pages_nodemask. Naming a publicly
available and exported function "internal" is also a big ugly. This
patch renames __alloc_pages_internal() to __alloc_pages_nodemask() and
deletes the old nodemask function.

Warning - This patch renames an exported symbol. No kernel driver is
affected by external drivers calling __alloc_pages_internal() should
change the call to __alloc_pages_nodemask() without any alteration of
parameters.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/gfp.h |   12 ++----------
 mm/page_alloc.c     |    4 ++--
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index dd20cd7..dcf0ab8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -168,24 +168,16 @@ static inline void arch_alloc_page(struct page *page, int order) { }
 #endif
 
 struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		       struct zonelist *zonelist, nodemask_t *nodemask);
 
 static inline struct page *
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct zonelist *zonelist)
 {
-	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
 }
 
-static inline struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist, nodemask_t *nodemask)
-{
-	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
-}
-
-
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5675b30..61051d5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1464,7 +1464,7 @@ try_next_zone:
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
@@ -1670,7 +1670,7 @@ nopage:
 got_pg:
 	return page;
 }
-EXPORT_SYMBOL(__alloc_pages_internal);
+EXPORT_SYMBOL(__alloc_pages_nodemask);
 
 /*
  * Common helper functions.
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 02/19] Do not sanity check order in the fast path
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:16   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:16 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

No user of the allocator API should be passing in an order >= MAX_ORDER
but we check for it on each and every allocation. Delete this check and
make it a VM_BUG_ON check further down the call path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/gfp.h |    6 ------
 mm/page_alloc.c     |    2 ++
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index dcf0ab8..8736047 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -181,9 +181,6 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
-	if (unlikely(order >= MAX_ORDER))
-		return NULL;
-
 	/* Unknown node is current node */
 	if (nid < 0)
 		nid = numa_node_id();
@@ -197,9 +194,6 @@ extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
 static inline struct page *
 alloc_pages(gfp_t gfp_mask, unsigned int order)
 {
-	if (unlikely(order >= MAX_ORDER))
-		return NULL;
-
 	return alloc_pages_current(gfp_mask, order);
 }
 extern struct page *alloc_page_vma(gfp_t gfp_mask,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 61051d5..c3842f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1407,6 +1407,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 
 	classzone_idx = zone_idx(preferred_zone);
 
+	VM_BUG_ON(order >= MAX_ORDER);
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 02/19] Do not sanity check order in the fast path
@ 2009-02-24 12:16   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:16 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

No user of the allocator API should be passing in an order >= MAX_ORDER
but we check for it on each and every allocation. Delete this check and
make it a VM_BUG_ON check further down the call path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/gfp.h |    6 ------
 mm/page_alloc.c     |    2 ++
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index dcf0ab8..8736047 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -181,9 +181,6 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
-	if (unlikely(order >= MAX_ORDER))
-		return NULL;
-
 	/* Unknown node is current node */
 	if (nid < 0)
 		nid = numa_node_id();
@@ -197,9 +194,6 @@ extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
 static inline struct page *
 alloc_pages(gfp_t gfp_mask, unsigned int order)
 {
-	if (unlikely(order >= MAX_ORDER))
-		return NULL;
-
 	return alloc_pages_current(gfp_mask, order);
 }
 extern struct page *alloc_page_vma(gfp_t gfp_mask,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 61051d5..c3842f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1407,6 +1407,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 
 	classzone_idx = zone_idx(preferred_zone);
 
+	VM_BUG_ON(order >= MAX_ORDER);
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 03/19] Do not check NUMA node ID when the caller knows the node is valid
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:16   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:16 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

Callers of alloc_pages_node() can optionally specify -1 as a node to mean
"allocate from the current node". However, a number of the callers in fast
paths know for a fact their node is valid. To avoid a comparison and branch,
this patch adds alloc_pages_exact_node() that only checks the nid with
VM_BUG_ON(). Callers that know their node is valid are then converted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 arch/ia64/hp/common/sba_iommu.c   |    2 +-
 arch/ia64/kernel/mca.c            |    3 +--
 arch/ia64/kernel/uncached.c       |    3 ++-
 arch/ia64/sn/pci/pci_dma.c        |    3 ++-
 arch/powerpc/platforms/cell/ras.c |    2 +-
 arch/x86/kvm/vmx.c                |    2 +-
 drivers/misc/sgi-gru/grufile.c    |    2 +-
 drivers/misc/sgi-xp/xpc_uv.c      |    2 +-
 include/linux/gfp.h               |    9 +++++++++
 include/linux/mm.h                |    1 -
 kernel/profile.c                  |    8 ++++----
 mm/filemap.c                      |    2 +-
 mm/hugetlb.c                      |    4 ++--
 mm/mempolicy.c                    |    2 +-
 mm/migrate.c                      |    2 +-
 mm/slab.c                         |    4 ++--
 mm/slob.c                         |    4 ++--
 mm/vmalloc.c                      |    1 -
 18 files changed, 32 insertions(+), 24 deletions(-)

diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 6d5e6c5..66a3257 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -1116,7 +1116,7 @@ sba_alloc_coherent (struct device *dev, size_t size, dma_addr_t *dma_handle, gfp
 #ifdef CONFIG_NUMA
 	{
 		struct page *page;
-		page = alloc_pages_node(ioc->node == MAX_NUMNODES ?
+		page = alloc_pages_exact_node(ioc->node == MAX_NUMNODES ?
 		                        numa_node_id() : ioc->node, flags,
 		                        get_order(size));
 
diff --git a/arch/ia64/kernel/mca.c b/arch/ia64/kernel/mca.c
index bab1de2..2e614bd 100644
--- a/arch/ia64/kernel/mca.c
+++ b/arch/ia64/kernel/mca.c
@@ -1829,8 +1829,7 @@ ia64_mca_cpu_init(void *cpu_data)
 			data = mca_bootmem();
 			first_time = 0;
 		} else
-			data = page_address(alloc_pages_node(numa_node_id(),
-					GFP_KERNEL, get_order(sz)));
+			data = __get_free_pages(GFP_KERNEL, get_order(sz));
 		if (!data)
 			panic("Could not allocate MCA memory for cpu %d\n",
 					cpu);
diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index 8eff8c1..6ba72ab 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -98,7 +98,8 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)
 
 	/* attempt to allocate a granule's worth of cached memory pages */
 
-	page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+	page = alloc_pages_exact_node(nid,
+				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				IA64_GRANULE_SHIFT-PAGE_SHIFT);
 	if (!page) {
 		mutex_unlock(&uc_pool->add_chunk_mutex);
diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index 863f501..2aa52de 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -91,7 +91,8 @@ void *sn_dma_alloc_coherent(struct device *dev, size_t size,
 	 */
 	node = pcibus_to_node(pdev->bus);
 	if (likely(node >=0)) {
-		struct page *p = alloc_pages_node(node, flags, get_order(size));
+		struct page *p = alloc_pages_exact_node(node,
+						flags, get_order(size));
 
 		if (likely(p))
 			cpuaddr = page_address(p);
diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c
index 5f961c4..16ba671 100644
--- a/arch/powerpc/platforms/cell/ras.c
+++ b/arch/powerpc/platforms/cell/ras.c
@@ -122,7 +122,7 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order)
 
 	area->nid = nid;
 	area->order = order;
-	area->pages = alloc_pages_node(area->nid, GFP_KERNEL, area->order);
+	area->pages = alloc_pages_exact_node(area->nid, GFP_KERNEL, area->order);
 
 	if (!area->pages)
 		goto out_free_area;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7611af5..cca119a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1244,7 +1244,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
 	struct page *pages;
 	struct vmcs *vmcs;
 
-	pages = alloc_pages_node(node, GFP_KERNEL, vmcs_config.order);
+	pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
 	if (!pages)
 		return NULL;
 	vmcs = page_address(pages);
diff --git a/drivers/misc/sgi-gru/grufile.c b/drivers/misc/sgi-gru/grufile.c
index 6509838..52d4160 100644
--- a/drivers/misc/sgi-gru/grufile.c
+++ b/drivers/misc/sgi-gru/grufile.c
@@ -309,7 +309,7 @@ static int gru_init_tables(unsigned long gru_base_paddr, void *gru_base_vaddr)
 		pnode = uv_node_to_pnode(nid);
 		if (gru_base[bid])
 			continue;
-		page = alloc_pages_node(nid, GFP_KERNEL, order);
+		page = alloc_pages_exact_node(nid, GFP_KERNEL, order);
 		if (!page)
 			goto fail;
 		gru_base[bid] = page_address(page);
diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
index 29c0502..0563350 100644
--- a/drivers/misc/sgi-xp/xpc_uv.c
+++ b/drivers/misc/sgi-xp/xpc_uv.c
@@ -184,7 +184,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
 	mq->mmr_blade = uv_cpu_to_blade_id(cpu);
 
 	nid = cpu_to_node(cpu);
-	page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+	page = alloc_pages_exact_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				pg_order);
 	if (page == NULL) {
 		dev_err(xpc_part, "xpc_create_gru_mq_uv() failed to alloc %d "
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 8736047..59eb093 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -4,6 +4,7 @@
 #include <linux/mmzone.h>
 #include <linux/stddef.h>
 #include <linux/linkage.h>
+#include <linux/mmdebug.h>
 
 struct vm_area_struct;
 
@@ -188,6 +189,14 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
 
+static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
+						unsigned int order)
+{
+	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+
+	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+}
+
 #ifdef CONFIG_NUMA
 extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7dc04ff..954e945 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -7,7 +7,6 @@
 
 #include <linux/gfp.h>
 #include <linux/list.h>
-#include <linux/mmdebug.h>
 #include <linux/mmzone.h>
 #include <linux/rbtree.h>
 #include <linux/prio_tree.h>
diff --git a/kernel/profile.c b/kernel/profile.c
index 7724e04..62e08db 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -371,7 +371,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
 		node = cpu_to_node(cpu);
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
-			page = alloc_pages_node(node,
+			page = alloc_pages_exact_node(node,
 					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
@@ -379,7 +379,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
 			per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
-			page = alloc_pages_node(node,
+			page = alloc_pages_exact_node(node,
 					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
@@ -570,14 +570,14 @@ static int create_hash_tables(void)
 		int node = cpu_to_node(cpu);
 		struct page *page;
 
-		page = alloc_pages_node(node,
+		page = alloc_pages_exact_node(node,
 				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				0);
 		if (!page)
 			goto out_cleanup;
 		per_cpu(cpu_profile_hits, cpu)[1]
 				= (struct profile_hit *)page_address(page);
-		page = alloc_pages_node(node,
+		page = alloc_pages_exact_node(node,
 				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				0);
 		if (!page)
diff --git a/mm/filemap.c b/mm/filemap.c
index 23acefe..2523d95 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -519,7 +519,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
 {
 	if (cpuset_do_page_mem_spread()) {
 		int n = cpuset_mem_spread_node();
-		return alloc_pages_node(n, gfp, 0);
+		return alloc_pages_exact_node(n, gfp, 0);
 	}
 	return alloc_pages(gfp, 0);
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 107da3d..1e99997 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -630,7 +630,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 	if (h->order >= MAX_ORDER)
 		return NULL;
 
-	page = alloc_pages_node(nid,
+	page = alloc_pages_exact_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
 		huge_page_order(h));
@@ -649,7 +649,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
  * Use a helper variable to find the next node and then
  * copy it back to hugetlb_next_nid afterwards:
  * otherwise there's a window in which a racer might
- * pass invalid nid MAX_NUMNODES to alloc_pages_node.
+ * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
  * But we don't need to use a spin_lock here: it really
  * doesn't matter if occasionally a racer chooses the
  * same nid as we do.  Move nid forward in the mask even
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3eb4a6f..341fbca 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -767,7 +767,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
 
 static struct page *new_node_page(struct page *page, unsigned long node, int **x)
 {
-	return alloc_pages_node(node, GFP_HIGHUSER_MOVABLE, 0);
+	return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE, 0);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
index a9eff3f..6bda9c2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -802,7 +802,7 @@ static struct page *new_page_node(struct page *p, unsigned long private,
 
 	*result = &pm->status;
 
-	return alloc_pages_node(pm->node,
+	return alloc_pages_exact_node(pm->node,
 				GFP_HIGHUSER_MOVABLE | GFP_THISNODE, 0);
 }
 
diff --git a/mm/slab.c b/mm/slab.c
index 4d00855..e7f1ded 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1680,7 +1680,7 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		flags |= __GFP_RECLAIMABLE;
 
-	page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+	page = alloc_pages_exact_node(nodeid, flags, cachep->gfporder);
 	if (!page)
 		return NULL;
 
@@ -3210,7 +3210,7 @@ retry:
 		if (local_flags & __GFP_WAIT)
 			local_irq_enable();
 		kmem_flagcheck(cache, flags);
-		obj = kmem_getpages(cache, local_flags, -1);
+		obj = kmem_getpages(cache, local_flags, numa_node_id());
 		if (local_flags & __GFP_WAIT)
 			local_irq_disable();
 		if (obj) {
diff --git a/mm/slob.c b/mm/slob.c
index 52bc8a2..d646a4c 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -46,7 +46,7 @@
  * NUMA support in SLOB is fairly simplistic, pushing most of the real
  * logic down to the page allocator, and simply doing the node accounting
  * on the upper levels. In the event that a node id is explicitly
- * provided, alloc_pages_node() with the specified node id is used
+ * provided, alloc_pages_exact_node() with the specified node id is used
  * instead. The common case (or when the node id isn't explicitly provided)
  * will default to the current node, as per numa_node_id().
  *
@@ -236,7 +236,7 @@ static void *slob_new_page(gfp_t gfp, int order, int node)
 
 #ifdef CONFIG_NUMA
 	if (node != -1)
-		page = alloc_pages_node(node, gfp, order);
+		page = alloc_pages_exact_node(node, gfp, order);
 	else
 #endif
 		page = alloc_pages(gfp, order);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 75f49d3..76fab2d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1322,7 +1322,6 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 			page = alloc_page(gfp_mask);
 		else
 			page = alloc_pages_node(node, gfp_mask, 0);
-
 		if (unlikely(!page)) {
 			/* Successfully allocated i pages, free them in __vunmap() */
 			area->nr_pages = i;
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 03/19] Do not check NUMA node ID when the caller knows the node is valid
@ 2009-02-24 12:16   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:16 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

Callers of alloc_pages_node() can optionally specify -1 as a node to mean
"allocate from the current node". However, a number of the callers in fast
paths know for a fact their node is valid. To avoid a comparison and branch,
this patch adds alloc_pages_exact_node() that only checks the nid with
VM_BUG_ON(). Callers that know their node is valid are then converted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 arch/ia64/hp/common/sba_iommu.c   |    2 +-
 arch/ia64/kernel/mca.c            |    3 +--
 arch/ia64/kernel/uncached.c       |    3 ++-
 arch/ia64/sn/pci/pci_dma.c        |    3 ++-
 arch/powerpc/platforms/cell/ras.c |    2 +-
 arch/x86/kvm/vmx.c                |    2 +-
 drivers/misc/sgi-gru/grufile.c    |    2 +-
 drivers/misc/sgi-xp/xpc_uv.c      |    2 +-
 include/linux/gfp.h               |    9 +++++++++
 include/linux/mm.h                |    1 -
 kernel/profile.c                  |    8 ++++----
 mm/filemap.c                      |    2 +-
 mm/hugetlb.c                      |    4 ++--
 mm/mempolicy.c                    |    2 +-
 mm/migrate.c                      |    2 +-
 mm/slab.c                         |    4 ++--
 mm/slob.c                         |    4 ++--
 mm/vmalloc.c                      |    1 -
 18 files changed, 32 insertions(+), 24 deletions(-)

diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 6d5e6c5..66a3257 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -1116,7 +1116,7 @@ sba_alloc_coherent (struct device *dev, size_t size, dma_addr_t *dma_handle, gfp
 #ifdef CONFIG_NUMA
 	{
 		struct page *page;
-		page = alloc_pages_node(ioc->node == MAX_NUMNODES ?
+		page = alloc_pages_exact_node(ioc->node == MAX_NUMNODES ?
 		                        numa_node_id() : ioc->node, flags,
 		                        get_order(size));
 
diff --git a/arch/ia64/kernel/mca.c b/arch/ia64/kernel/mca.c
index bab1de2..2e614bd 100644
--- a/arch/ia64/kernel/mca.c
+++ b/arch/ia64/kernel/mca.c
@@ -1829,8 +1829,7 @@ ia64_mca_cpu_init(void *cpu_data)
 			data = mca_bootmem();
 			first_time = 0;
 		} else
-			data = page_address(alloc_pages_node(numa_node_id(),
-					GFP_KERNEL, get_order(sz)));
+			data = __get_free_pages(GFP_KERNEL, get_order(sz));
 		if (!data)
 			panic("Could not allocate MCA memory for cpu %d\n",
 					cpu);
diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index 8eff8c1..6ba72ab 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -98,7 +98,8 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)
 
 	/* attempt to allocate a granule's worth of cached memory pages */
 
-	page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+	page = alloc_pages_exact_node(nid,
+				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				IA64_GRANULE_SHIFT-PAGE_SHIFT);
 	if (!page) {
 		mutex_unlock(&uc_pool->add_chunk_mutex);
diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index 863f501..2aa52de 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -91,7 +91,8 @@ void *sn_dma_alloc_coherent(struct device *dev, size_t size,
 	 */
 	node = pcibus_to_node(pdev->bus);
 	if (likely(node >=0)) {
-		struct page *p = alloc_pages_node(node, flags, get_order(size));
+		struct page *p = alloc_pages_exact_node(node,
+						flags, get_order(size));
 
 		if (likely(p))
 			cpuaddr = page_address(p);
diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c
index 5f961c4..16ba671 100644
--- a/arch/powerpc/platforms/cell/ras.c
+++ b/arch/powerpc/platforms/cell/ras.c
@@ -122,7 +122,7 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order)
 
 	area->nid = nid;
 	area->order = order;
-	area->pages = alloc_pages_node(area->nid, GFP_KERNEL, area->order);
+	area->pages = alloc_pages_exact_node(area->nid, GFP_KERNEL, area->order);
 
 	if (!area->pages)
 		goto out_free_area;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7611af5..cca119a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1244,7 +1244,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
 	struct page *pages;
 	struct vmcs *vmcs;
 
-	pages = alloc_pages_node(node, GFP_KERNEL, vmcs_config.order);
+	pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
 	if (!pages)
 		return NULL;
 	vmcs = page_address(pages);
diff --git a/drivers/misc/sgi-gru/grufile.c b/drivers/misc/sgi-gru/grufile.c
index 6509838..52d4160 100644
--- a/drivers/misc/sgi-gru/grufile.c
+++ b/drivers/misc/sgi-gru/grufile.c
@@ -309,7 +309,7 @@ static int gru_init_tables(unsigned long gru_base_paddr, void *gru_base_vaddr)
 		pnode = uv_node_to_pnode(nid);
 		if (gru_base[bid])
 			continue;
-		page = alloc_pages_node(nid, GFP_KERNEL, order);
+		page = alloc_pages_exact_node(nid, GFP_KERNEL, order);
 		if (!page)
 			goto fail;
 		gru_base[bid] = page_address(page);
diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
index 29c0502..0563350 100644
--- a/drivers/misc/sgi-xp/xpc_uv.c
+++ b/drivers/misc/sgi-xp/xpc_uv.c
@@ -184,7 +184,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
 	mq->mmr_blade = uv_cpu_to_blade_id(cpu);
 
 	nid = cpu_to_node(cpu);
-	page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+	page = alloc_pages_exact_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				pg_order);
 	if (page == NULL) {
 		dev_err(xpc_part, "xpc_create_gru_mq_uv() failed to alloc %d "
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 8736047..59eb093 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -4,6 +4,7 @@
 #include <linux/mmzone.h>
 #include <linux/stddef.h>
 #include <linux/linkage.h>
+#include <linux/mmdebug.h>
 
 struct vm_area_struct;
 
@@ -188,6 +189,14 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
 
+static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
+						unsigned int order)
+{
+	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+
+	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+}
+
 #ifdef CONFIG_NUMA
 extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7dc04ff..954e945 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -7,7 +7,6 @@
 
 #include <linux/gfp.h>
 #include <linux/list.h>
-#include <linux/mmdebug.h>
 #include <linux/mmzone.h>
 #include <linux/rbtree.h>
 #include <linux/prio_tree.h>
diff --git a/kernel/profile.c b/kernel/profile.c
index 7724e04..62e08db 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -371,7 +371,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
 		node = cpu_to_node(cpu);
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
-			page = alloc_pages_node(node,
+			page = alloc_pages_exact_node(node,
 					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
@@ -379,7 +379,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
 			per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
-			page = alloc_pages_node(node,
+			page = alloc_pages_exact_node(node,
 					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
@@ -570,14 +570,14 @@ static int create_hash_tables(void)
 		int node = cpu_to_node(cpu);
 		struct page *page;
 
-		page = alloc_pages_node(node,
+		page = alloc_pages_exact_node(node,
 				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				0);
 		if (!page)
 			goto out_cleanup;
 		per_cpu(cpu_profile_hits, cpu)[1]
 				= (struct profile_hit *)page_address(page);
-		page = alloc_pages_node(node,
+		page = alloc_pages_exact_node(node,
 				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				0);
 		if (!page)
diff --git a/mm/filemap.c b/mm/filemap.c
index 23acefe..2523d95 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -519,7 +519,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
 {
 	if (cpuset_do_page_mem_spread()) {
 		int n = cpuset_mem_spread_node();
-		return alloc_pages_node(n, gfp, 0);
+		return alloc_pages_exact_node(n, gfp, 0);
 	}
 	return alloc_pages(gfp, 0);
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 107da3d..1e99997 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -630,7 +630,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 	if (h->order >= MAX_ORDER)
 		return NULL;
 
-	page = alloc_pages_node(nid,
+	page = alloc_pages_exact_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
 		huge_page_order(h));
@@ -649,7 +649,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
  * Use a helper variable to find the next node and then
  * copy it back to hugetlb_next_nid afterwards:
  * otherwise there's a window in which a racer might
- * pass invalid nid MAX_NUMNODES to alloc_pages_node.
+ * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
  * But we don't need to use a spin_lock here: it really
  * doesn't matter if occasionally a racer chooses the
  * same nid as we do.  Move nid forward in the mask even
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3eb4a6f..341fbca 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -767,7 +767,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
 
 static struct page *new_node_page(struct page *page, unsigned long node, int **x)
 {
-	return alloc_pages_node(node, GFP_HIGHUSER_MOVABLE, 0);
+	return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE, 0);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
index a9eff3f..6bda9c2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -802,7 +802,7 @@ static struct page *new_page_node(struct page *p, unsigned long private,
 
 	*result = &pm->status;
 
-	return alloc_pages_node(pm->node,
+	return alloc_pages_exact_node(pm->node,
 				GFP_HIGHUSER_MOVABLE | GFP_THISNODE, 0);
 }
 
diff --git a/mm/slab.c b/mm/slab.c
index 4d00855..e7f1ded 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1680,7 +1680,7 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		flags |= __GFP_RECLAIMABLE;
 
-	page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+	page = alloc_pages_exact_node(nodeid, flags, cachep->gfporder);
 	if (!page)
 		return NULL;
 
@@ -3210,7 +3210,7 @@ retry:
 		if (local_flags & __GFP_WAIT)
 			local_irq_enable();
 		kmem_flagcheck(cache, flags);
-		obj = kmem_getpages(cache, local_flags, -1);
+		obj = kmem_getpages(cache, local_flags, numa_node_id());
 		if (local_flags & __GFP_WAIT)
 			local_irq_disable();
 		if (obj) {
diff --git a/mm/slob.c b/mm/slob.c
index 52bc8a2..d646a4c 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -46,7 +46,7 @@
  * NUMA support in SLOB is fairly simplistic, pushing most of the real
  * logic down to the page allocator, and simply doing the node accounting
  * on the upper levels. In the event that a node id is explicitly
- * provided, alloc_pages_node() with the specified node id is used
+ * provided, alloc_pages_exact_node() with the specified node id is used
  * instead. The common case (or when the node id isn't explicitly provided)
  * will default to the current node, as per numa_node_id().
  *
@@ -236,7 +236,7 @@ static void *slob_new_page(gfp_t gfp, int order, int node)
 
 #ifdef CONFIG_NUMA
 	if (node != -1)
-		page = alloc_pages_node(node, gfp, order);
+		page = alloc_pages_exact_node(node, gfp, order);
 	else
 #endif
 		page = alloc_pages(gfp, order);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 75f49d3..76fab2d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1322,7 +1322,6 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 			page = alloc_page(gfp_mask);
 		else
 			page = alloc_pages_node(node, gfp_mask, 0);
-
 		if (unlikely(!page)) {
 			/* Successfully allocated i pages, free them in __vunmap() */
 			area->nr_pages = i;
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 04/19] Convert gfp_zone() to use a table of precalculated values
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:17   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

Every page allocation uses gfp_zone() to calcuate what the highest zone
allowed by a combination of GFP flags is. This is a large number of branches
to have in a fast path. This patch replaces the branches with a lookup
table that is calculated at boot-time and stored in the read-mostly section
so it can be shared. This requires __GFP_MOVABLE to be redefined but it's
debatable as to whether it should be considered a zone modifier or not.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/gfp.h |   28 +++++++++++-----------------
 init/main.c         |    1 +
 mm/page_alloc.c     |   36 +++++++++++++++++++++++++++++++++++-
 3 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 59eb093..581f8a9 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -16,6 +16,10 @@ struct vm_area_struct;
  * Do not put any conditional on these. If necessary modify the definitions
  * without the underscores and use the consistently. The definitions here may
  * be used in bit comparisons.
+ *
+ * Note that __GFP_MOVABLE uses the next available bit but it is not
+ * a zone modifier. It uses the fourth bit so that the calculation of
+ * gfp_zone() can use a table rather than a series of comparisons
  */
 #define __GFP_DMA	((__force gfp_t)0x01u)
 #define __GFP_HIGHMEM	((__force gfp_t)0x02u)
@@ -50,7 +54,7 @@ struct vm_area_struct;
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
 #define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
 #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
-#define __GFP_MOVABLE	((__force gfp_t)0x100000u)  /* Page is movable */
+#define __GFP_MOVABLE	((__force gfp_t)0x08u)  /* Page is movable */
 
 #define __GFP_BITS_SHIFT 21	/* Room for 21 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -77,6 +81,9 @@ struct vm_area_struct;
 #define GFP_THISNODE	((__force gfp_t)0)
 #endif
 
+/* This is a mask of all modifiers affecting gfp_zonemask() */
+#define GFP_ZONEMASK (__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32 | __GFP_MOVABLE)
+
 /* This mask makes up all the page movable related flags */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
 
@@ -112,24 +119,11 @@ static inline int allocflags_to_migratetype(gfp_t gfp_flags)
 		((gfp_flags & __GFP_RECLAIMABLE) != 0);
 }
 
+extern int gfp_zone_table[GFP_ZONEMASK];
+void init_gfp_zone_table(void);
 static inline enum zone_type gfp_zone(gfp_t flags)
 {
-#ifdef CONFIG_ZONE_DMA
-	if (flags & __GFP_DMA)
-		return ZONE_DMA;
-#endif
-#ifdef CONFIG_ZONE_DMA32
-	if (flags & __GFP_DMA32)
-		return ZONE_DMA32;
-#endif
-	if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
-			(__GFP_HIGHMEM | __GFP_MOVABLE))
-		return ZONE_MOVABLE;
-#ifdef CONFIG_HIGHMEM
-	if (flags & __GFP_HIGHMEM)
-		return ZONE_HIGHMEM;
-#endif
-	return ZONE_NORMAL;
+	return gfp_zone_table[flags & GFP_ZONEMASK];
 }
 
 /*
diff --git a/init/main.c b/init/main.c
index 8442094..08a5663 100644
--- a/init/main.c
+++ b/init/main.c
@@ -573,6 +573,7 @@ asmlinkage void __init start_kernel(void)
 	 * fragile until we cpu_idle() for the first time.
 	 */
 	preempt_disable();
+	init_gfp_zone_table();
 	build_all_zonelists();
 	page_alloc_init();
 	printk(KERN_NOTICE "Kernel command line: %s\n", boot_command_line);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3842f8..7cc4932 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -70,6 +70,7 @@ EXPORT_SYMBOL(node_states);
 unsigned long totalram_pages __read_mostly;
 unsigned long totalreserve_pages __read_mostly;
 unsigned long highest_memmap_pfn __read_mostly;
+int gfp_zone_table[GFP_ZONEMASK] __read_mostly;
 int percpu_pagelist_fraction;
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
@@ -4373,7 +4374,7 @@ static void setup_per_zone_inactive_ratio(void)
  * 8192MB:	11584k
  * 16384MB:	16384k
  */
-static int __init init_per_zone_pages_min(void)
+static int init_per_zone_pages_min(void)
 {
 	unsigned long lowmem_kbytes;
 
@@ -4391,6 +4392,39 @@ static int __init init_per_zone_pages_min(void)
 }
 module_init(init_per_zone_pages_min)
 
+static inline int __init gfp_flags_to_zone(gfp_t flags)
+{
+#ifdef CONFIG_ZONE_DMA
+	if (flags & __GFP_DMA)
+		return ZONE_DMA;
+#endif
+#ifdef CONFIG_ZONE_DMA32
+	if (flags & __GFP_DMA32)
+		return ZONE_DMA32;
+#endif
+	if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
+			(__GFP_HIGHMEM | __GFP_MOVABLE))
+		return ZONE_MOVABLE;
+#ifdef CONFIG_HIGHMEM
+	if (flags & __GFP_HIGHMEM)
+		return ZONE_HIGHMEM;
+#endif
+	return ZONE_NORMAL;
+}
+
+/*
+ * For each possible combination of zone modifier flags, we calculate
+ * what zone it should be using. This consumes a cache line in most
+ * cases but avoids a number of branches in the allocator fast path
+ */
+void __init init_gfp_zone_table(void)
+{
+	gfp_t gfp_flags;
+
+	for (gfp_flags = 0; gfp_flags < GFP_ZONEMASK; gfp_flags++)
+		gfp_zone_table[gfp_flags] = gfp_flags_to_zone(gfp_flags);
+}
+
 /*
  * min_free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so 
  *	that we can call two helper functions whenever min_free_kbytes
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 04/19] Convert gfp_zone() to use a table of precalculated values
@ 2009-02-24 12:17   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

Every page allocation uses gfp_zone() to calcuate what the highest zone
allowed by a combination of GFP flags is. This is a large number of branches
to have in a fast path. This patch replaces the branches with a lookup
table that is calculated at boot-time and stored in the read-mostly section
so it can be shared. This requires __GFP_MOVABLE to be redefined but it's
debatable as to whether it should be considered a zone modifier or not.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/gfp.h |   28 +++++++++++-----------------
 init/main.c         |    1 +
 mm/page_alloc.c     |   36 +++++++++++++++++++++++++++++++++++-
 3 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 59eb093..581f8a9 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -16,6 +16,10 @@ struct vm_area_struct;
  * Do not put any conditional on these. If necessary modify the definitions
  * without the underscores and use the consistently. The definitions here may
  * be used in bit comparisons.
+ *
+ * Note that __GFP_MOVABLE uses the next available bit but it is not
+ * a zone modifier. It uses the fourth bit so that the calculation of
+ * gfp_zone() can use a table rather than a series of comparisons
  */
 #define __GFP_DMA	((__force gfp_t)0x01u)
 #define __GFP_HIGHMEM	((__force gfp_t)0x02u)
@@ -50,7 +54,7 @@ struct vm_area_struct;
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
 #define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
 #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
-#define __GFP_MOVABLE	((__force gfp_t)0x100000u)  /* Page is movable */
+#define __GFP_MOVABLE	((__force gfp_t)0x08u)  /* Page is movable */
 
 #define __GFP_BITS_SHIFT 21	/* Room for 21 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -77,6 +81,9 @@ struct vm_area_struct;
 #define GFP_THISNODE	((__force gfp_t)0)
 #endif
 
+/* This is a mask of all modifiers affecting gfp_zonemask() */
+#define GFP_ZONEMASK (__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32 | __GFP_MOVABLE)
+
 /* This mask makes up all the page movable related flags */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
 
@@ -112,24 +119,11 @@ static inline int allocflags_to_migratetype(gfp_t gfp_flags)
 		((gfp_flags & __GFP_RECLAIMABLE) != 0);
 }
 
+extern int gfp_zone_table[GFP_ZONEMASK];
+void init_gfp_zone_table(void);
 static inline enum zone_type gfp_zone(gfp_t flags)
 {
-#ifdef CONFIG_ZONE_DMA
-	if (flags & __GFP_DMA)
-		return ZONE_DMA;
-#endif
-#ifdef CONFIG_ZONE_DMA32
-	if (flags & __GFP_DMA32)
-		return ZONE_DMA32;
-#endif
-	if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
-			(__GFP_HIGHMEM | __GFP_MOVABLE))
-		return ZONE_MOVABLE;
-#ifdef CONFIG_HIGHMEM
-	if (flags & __GFP_HIGHMEM)
-		return ZONE_HIGHMEM;
-#endif
-	return ZONE_NORMAL;
+	return gfp_zone_table[flags & GFP_ZONEMASK];
 }
 
 /*
diff --git a/init/main.c b/init/main.c
index 8442094..08a5663 100644
--- a/init/main.c
+++ b/init/main.c
@@ -573,6 +573,7 @@ asmlinkage void __init start_kernel(void)
 	 * fragile until we cpu_idle() for the first time.
 	 */
 	preempt_disable();
+	init_gfp_zone_table();
 	build_all_zonelists();
 	page_alloc_init();
 	printk(KERN_NOTICE "Kernel command line: %s\n", boot_command_line);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3842f8..7cc4932 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -70,6 +70,7 @@ EXPORT_SYMBOL(node_states);
 unsigned long totalram_pages __read_mostly;
 unsigned long totalreserve_pages __read_mostly;
 unsigned long highest_memmap_pfn __read_mostly;
+int gfp_zone_table[GFP_ZONEMASK] __read_mostly;
 int percpu_pagelist_fraction;
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
@@ -4373,7 +4374,7 @@ static void setup_per_zone_inactive_ratio(void)
  * 8192MB:	11584k
  * 16384MB:	16384k
  */
-static int __init init_per_zone_pages_min(void)
+static int init_per_zone_pages_min(void)
 {
 	unsigned long lowmem_kbytes;
 
@@ -4391,6 +4392,39 @@ static int __init init_per_zone_pages_min(void)
 }
 module_init(init_per_zone_pages_min)
 
+static inline int __init gfp_flags_to_zone(gfp_t flags)
+{
+#ifdef CONFIG_ZONE_DMA
+	if (flags & __GFP_DMA)
+		return ZONE_DMA;
+#endif
+#ifdef CONFIG_ZONE_DMA32
+	if (flags & __GFP_DMA32)
+		return ZONE_DMA32;
+#endif
+	if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
+			(__GFP_HIGHMEM | __GFP_MOVABLE))
+		return ZONE_MOVABLE;
+#ifdef CONFIG_HIGHMEM
+	if (flags & __GFP_HIGHMEM)
+		return ZONE_HIGHMEM;
+#endif
+	return ZONE_NORMAL;
+}
+
+/*
+ * For each possible combination of zone modifier flags, we calculate
+ * what zone it should be using. This consumes a cache line in most
+ * cases but avoids a number of branches in the allocator fast path
+ */
+void __init init_gfp_zone_table(void)
+{
+	gfp_t gfp_flags;
+
+	for (gfp_flags = 0; gfp_flags < GFP_ZONEMASK; gfp_flags++)
+		gfp_zone_table[gfp_flags] = gfp_flags_to_zone(gfp_flags);
+}
+
 /*
  * min_free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so 
  *	that we can call two helper functions whenever min_free_kbytes
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 05/19] Re-sort GFP flags and fix whitespace alignment for easier reading.
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:17   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

Resort the GFP flags after __GFP_MOVABLE got redefined so how the bits
are used are a bit cleared.

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/gfp.h |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 581f8a9..8f7d176 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -25,6 +25,8 @@ struct vm_area_struct;
 #define __GFP_HIGHMEM	((__force gfp_t)0x02u)
 #define __GFP_DMA32	((__force gfp_t)0x04u)
 
+#define __GFP_MOVABLE	((__force gfp_t)0x08u)  /* Page is movable */
+
 /*
  * Action modifiers - doesn't change the zoning
  *
@@ -50,11 +52,10 @@ struct vm_area_struct;
 #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* See above */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
-#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
-#define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
-#define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_NOMEMALLOC  ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
+#define __GFP_HARDWALL    ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_THISNODE	  ((__force gfp_t)0x40000u) /* No fallback, no policies */
 #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
-#define __GFP_MOVABLE	((__force gfp_t)0x08u)  /* Page is movable */
 
 #define __GFP_BITS_SHIFT 21	/* Room for 21 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 05/19] Re-sort GFP flags and fix whitespace alignment for easier reading.
@ 2009-02-24 12:17   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

Resort the GFP flags after __GFP_MOVABLE got redefined so how the bits
are used are a bit cleared.

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/gfp.h |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 581f8a9..8f7d176 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -25,6 +25,8 @@ struct vm_area_struct;
 #define __GFP_HIGHMEM	((__force gfp_t)0x02u)
 #define __GFP_DMA32	((__force gfp_t)0x04u)
 
+#define __GFP_MOVABLE	((__force gfp_t)0x08u)  /* Page is movable */
+
 /*
  * Action modifiers - doesn't change the zoning
  *
@@ -50,11 +52,10 @@ struct vm_area_struct;
 #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* See above */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
-#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
-#define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
-#define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_NOMEMALLOC  ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
+#define __GFP_HARDWALL    ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_THISNODE	  ((__force gfp_t)0x40000u) /* No fallback, no policies */
 #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
-#define __GFP_MOVABLE	((__force gfp_t)0x08u)  /* Page is movable */
 
 #define __GFP_BITS_SHIFT 21	/* Room for 21 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 06/19] Check only once if the zonelist is suitable for the allocation
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:17   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

It is possible with __GFP_THISNODE that no zones are suitable. This
patch makes sure the check is only made once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7cc4932..99fd538 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1487,9 +1487,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (should_fail_alloc_page(gfp_mask, order))
 		return NULL;
 
-restart:
-	z = zonelist->_zonerefs;  /* the list of zones suitable for gfp_mask */
-
+	/* the list of zones suitable for gfp_mask */
+	z = zonelist->_zonerefs;
 	if (unlikely(!z->zone)) {
 		/*
 		 * Happens if we have an empty zonelist as a result of
@@ -1498,6 +1497,7 @@ restart:
 		return NULL;
 	}
 
+restart:
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 06/19] Check only once if the zonelist is suitable for the allocation
@ 2009-02-24 12:17   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

It is possible with __GFP_THISNODE that no zones are suitable. This
patch makes sure the check is only made once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7cc4932..99fd538 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1487,9 +1487,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (should_fail_alloc_page(gfp_mask, order))
 		return NULL;
 
-restart:
-	z = zonelist->_zonerefs;  /* the list of zones suitable for gfp_mask */
-
+	/* the list of zones suitable for gfp_mask */
+	z = zonelist->_zonerefs;
 	if (unlikely(!z->zone)) {
 		/*
 		 * Happens if we have an empty zonelist as a result of
@@ -1498,6 +1497,7 @@ restart:
 		return NULL;
 	}
 
+restart:
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 07/19] Break up the allocator entry point into fast and slow paths
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:17   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

The core of the page allocator is one giant function which allocates
memory on the stack and makes calculations that may not be needed for every
allocation. This patch breaks up the allocator path into fast and slow paths.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |  345 ++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 216 insertions(+), 129 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 99fd538..503d692 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1463,45 +1463,170 @@ try_next_zone:
 	return page;
 }
 
-/*
- * This is the 'heart' of the zoned buddy allocator.
- */
+int
+should_alloc_retry(gfp_t gfp_mask, unsigned int order,
+				unsigned long pages_reclaimed)
+{
+	/* Do not loop if specifically requested */
+	if (gfp_mask & __GFP_NORETRY)
+		return 0;
+	
+	/*
+	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
+	 * means __GFP_NOFAIL, but that may not be true in other
+	 * implementations.
+	 */
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return 1;
+
+	/*
+	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
+	 * specified, then we retry until we no longer reclaim any pages
+	 * (above), or we've reclaimed an order of pages at least as
+	 * large as the allocation's order. In both cases, if the
+	 * allocation still fails, we stop retrying.
+	 */
+	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
+		return 1;
+
+	/*
+	 * Don't let big-order allocations loop unless the caller
+	 * explicitly requests that. 
+	 */
+	if (gfp_mask & __GFP_NOFAIL)
+		return 1;
+
+	return 0;
+}
+
 struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-			struct zonelist *zonelist, nodemask_t *nodemask)
+__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
 {
-	const gfp_t wait = gfp_mask & __GFP_WAIT;
-	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	struct zoneref *z;
-	struct zone *zone;
 	struct page *page;
+
+	/* Acquire the OOM killer lock for the zones in zonelist */
+	if (!try_set_zone_oom(zonelist, gfp_mask)) {
+		schedule_timeout_uninterruptible(1);
+		return NULL;
+	}
+
+	/*
+	 * Go through the zonelist yet one more time, keep very high watermark
+	 * here, this is only to catch a parallel oom killing, we must fail if
+	 * we're still under heavy pressure.
+	 */
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+		order, zonelist, high_zoneidx,
+		ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+	if (page)
+		goto out;
+
+	/* The OOM killer will not help higher order allocs */
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		goto out;
+
+	/* Exhausted what can be done so it's blamo time */
+	out_of_memory(zonelist, gfp_mask, order);
+
+out:
+	clear_zonelist_oom(zonelist, gfp_mask);
+	return page;
+}
+
+/* The really slow allocator path where we enter direct reclaim */
+struct page *
+__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+{
+	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
-	int do_retry;
-	int alloc_flags;
-	unsigned long did_some_progress;
-	unsigned long pages_reclaimed = 0;
 
-	might_sleep_if(wait);
+	cond_resched();
 
-	if (should_fail_alloc_page(gfp_mask, order))
-		return NULL;
+	/* We now go into synchronous reclaim */
+	cpuset_memory_pressure_bump();
 
-	/* the list of zones suitable for gfp_mask */
-	z = zonelist->_zonerefs;
-	if (unlikely(!z->zone)) {
-		/*
-		 * Happens if we have an empty zonelist as a result of
-		 * GFP_THISNODE being used on a memoryless node
-		 */
-		return NULL;
-	}
+	/*
+	 * The task's cpuset might have expanded its set of allowable nodes
+	 */
+	cpuset_update_task_memory_state();
+	p->flags |= PF_MEMALLOC;
+	reclaim_state.reclaimed_slab = 0;
+	p->reclaim_state = &reclaim_state;
 
-restart:
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
-	if (page)
-		goto got_pg;
+	*did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
+
+	p->reclaim_state = NULL;
+	p->flags &= ~PF_MEMALLOC;
+
+	cond_resched();
+
+	if (order != 0)
+		drain_all_pages();
+
+	if (likely(*did_some_progress))
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
+					zonelist, high_zoneidx, alloc_flags);
+	return page;
+}
+
+static inline int is_allocation_high_priority(struct task_struct *p,
+							gfp_t gfp_mask)
+{
+	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+			&& !in_interrupt())
+		if (!(gfp_mask & __GFP_NOMEMALLOC))
+			return 1;
+	return 0;
+}
+
+/*
+ * This is called in the allocator slow-path if the allocation request is of
+ * sufficient urgency to ignore watermarks and take other desperate measures
+ */
+struct page *
+__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
+{
+	struct page *page;
+
+	do {
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
+			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+
+		if (!page && gfp_mask & __GFP_NOFAIL)
+			congestion_wait(WRITE, HZ/50);
+	} while (!page && (gfp_mask & __GFP_NOFAIL));
+
+	return page;
+}
+
+static inline
+void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx)
+{
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+		wakeup_kswapd(zone, order);
+}
+
+static struct page * noinline
+__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
+{
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	struct page *page = NULL;
+	int alloc_flags;
+	unsigned long pages_reclaimed = 0;
+	unsigned long did_some_progress;
+	struct task_struct *p = current;
 
 	/*
 	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
@@ -1514,8 +1639,7 @@ restart:
 	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
 		goto nopage;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		wakeup_kswapd(zone, order);
+	wake_all_kswapd(order, zonelist, high_zoneidx);
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
@@ -1535,6 +1659,7 @@ restart:
 	if (wait)
 		alloc_flags |= ALLOC_CPUSET;
 
+restart:
 	/*
 	 * Go through the zonelist again. Let __GFP_HIGH and allocations
 	 * coming from realtime tasks go deeper into reserves.
@@ -1548,118 +1673,47 @@ restart:
 	if (page)
 		goto got_pg;
 
-	/* This allocation should allow future memory freeing. */
-
-rebalance:
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
-		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
-nofail_alloc:
-			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, nodemask, order,
-				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
-			if (page)
-				goto got_pg;
-			if (gfp_mask & __GFP_NOFAIL) {
-				congestion_wait(WRITE, HZ/50);
-				goto nofail_alloc;
-			}
-		}
-		goto nopage;
-	}
+	/* Allocate without watermarks if the context allows */
+	if (is_allocation_high_priority(p, gfp_mask))
+		page = __alloc_pages_high_priority(gfp_mask, order,
+			zonelist, high_zoneidx, nodemask);
+	if (page)
+		goto got_pg;
 
 	/* Atomic allocations - we can't balance anything */
 	if (!wait)
 		goto nopage;
 
-	cond_resched();
+	/* Try direct reclaim and then allocating */
+	page = __alloc_pages_direct_reclaim(gfp_mask, order,
+					zonelist, high_zoneidx,
+					nodemask,
+					alloc_flags, &did_some_progress);
+	if (page)
+		goto got_pg;
 
-	/* We now go into synchronous reclaim */
-	cpuset_memory_pressure_bump();
 	/*
-	 * The task's cpuset might have expanded its set of allowable nodes
+	 * If we failed to make any progress reclaiming, then we are
+	 * running out of options and have to consider going OOM
 	 */
-	cpuset_update_task_memory_state();
-	p->flags |= PF_MEMALLOC;
-	reclaim_state.reclaimed_slab = 0;
-	p->reclaim_state = &reclaim_state;
-
-	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
-
-	p->reclaim_state = NULL;
-	p->flags &= ~PF_MEMALLOC;
-
-	cond_resched();
-
-	if (order != 0)
-		drain_all_pages();
+	if (!did_some_progress) {
+		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+			page = __alloc_pages_may_oom(gfp_mask, order,
+					zonelist, high_zoneidx,
+					nodemask);
+			if (page)
+				goto got_pg;
 
-	if (likely(did_some_progress)) {
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
-					zonelist, high_zoneidx, alloc_flags);
-		if (page)
-			goto got_pg;
-	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
-		if (!try_set_zone_oom(zonelist, gfp_mask)) {
-			schedule_timeout_uninterruptible(1);
 			goto restart;
 		}
-
-		/*
-		 * Go through the zonelist yet one more time, keep
-		 * very high watermark here, this is only to catch
-		 * a parallel oom killing, we must fail if we're still
-		 * under heavy pressure.
-		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
-			order, zonelist, high_zoneidx,
-			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
-		if (page) {
-			clear_zonelist_oom(zonelist, gfp_mask);
-			goto got_pg;
-		}
-
-		/* The OOM killer will not help higher order allocs so fail */
-		if (order > PAGE_ALLOC_COSTLY_ORDER) {
-			clear_zonelist_oom(zonelist, gfp_mask);
-			goto nopage;
-		}
-
-		out_of_memory(zonelist, gfp_mask, order);
-		clear_zonelist_oom(zonelist, gfp_mask);
-		goto restart;
 	}
 
-	/*
-	 * Don't let big-order allocations loop unless the caller explicitly
-	 * requests that.  Wait for some write requests to complete then retry.
-	 *
-	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
-	 * means __GFP_NOFAIL, but that may not be true in other
-	 * implementations.
-	 *
-	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
-	 * specified, then we retry until we no longer reclaim any pages
-	 * (above), or we've reclaimed an order of pages at least as
-	 * large as the allocation's order. In both cases, if the
-	 * allocation still fails, we stop retrying.
-	 */
+	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
-	do_retry = 0;
-	if (!(gfp_mask & __GFP_NORETRY)) {
-		if (order <= PAGE_ALLOC_COSTLY_ORDER) {
-			do_retry = 1;
-		} else {
-			if (gfp_mask & __GFP_REPEAT &&
-				pages_reclaimed < (1 << order))
-					do_retry = 1;
-		}
-		if (gfp_mask & __GFP_NOFAIL)
-			do_retry = 1;
-	}
-	if (do_retry) {
+	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
+		/* Wait for some write requests to complete then retry */
 		congestion_wait(WRITE, HZ/50);
-		goto rebalance;
+		goto restart;
 	}
 
 nopage:
@@ -1672,6 +1726,39 @@ nopage:
 	}
 got_pg:
 	return page;
+
+}
+
+/*
+ * This is the 'heart' of the zoned buddy allocator.
+ */
+struct page *
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
+{
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	struct page *page;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+
+	if (should_fail_alloc_page(gfp_mask, order))
+		return NULL;
+
+	/*
+	 * Check the zones suitable for the gfp_mask contain at least one
+	 * valid zone. It's possible to have an empty zonelist as a result
+	 * of GFP_THISNODE and a memoryless node
+	 */
+	if (unlikely(!zonelist->_zonerefs->zone))
+		return NULL;
+
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
+			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+	if (unlikely(!page))
+		page = __alloc_pages_slowpath(gfp_mask, order,
+				zonelist, high_zoneidx, nodemask);
+
+	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 07/19] Break up the allocator entry point into fast and slow paths
@ 2009-02-24 12:17   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

The core of the page allocator is one giant function which allocates
memory on the stack and makes calculations that may not be needed for every
allocation. This patch breaks up the allocator path into fast and slow paths.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |  345 ++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 216 insertions(+), 129 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 99fd538..503d692 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1463,45 +1463,170 @@ try_next_zone:
 	return page;
 }
 
-/*
- * This is the 'heart' of the zoned buddy allocator.
- */
+int
+should_alloc_retry(gfp_t gfp_mask, unsigned int order,
+				unsigned long pages_reclaimed)
+{
+	/* Do not loop if specifically requested */
+	if (gfp_mask & __GFP_NORETRY)
+		return 0;
+	
+	/*
+	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
+	 * means __GFP_NOFAIL, but that may not be true in other
+	 * implementations.
+	 */
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return 1;
+
+	/*
+	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
+	 * specified, then we retry until we no longer reclaim any pages
+	 * (above), or we've reclaimed an order of pages at least as
+	 * large as the allocation's order. In both cases, if the
+	 * allocation still fails, we stop retrying.
+	 */
+	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
+		return 1;
+
+	/*
+	 * Don't let big-order allocations loop unless the caller
+	 * explicitly requests that. 
+	 */
+	if (gfp_mask & __GFP_NOFAIL)
+		return 1;
+
+	return 0;
+}
+
 struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-			struct zonelist *zonelist, nodemask_t *nodemask)
+__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
 {
-	const gfp_t wait = gfp_mask & __GFP_WAIT;
-	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	struct zoneref *z;
-	struct zone *zone;
 	struct page *page;
+
+	/* Acquire the OOM killer lock for the zones in zonelist */
+	if (!try_set_zone_oom(zonelist, gfp_mask)) {
+		schedule_timeout_uninterruptible(1);
+		return NULL;
+	}
+
+	/*
+	 * Go through the zonelist yet one more time, keep very high watermark
+	 * here, this is only to catch a parallel oom killing, we must fail if
+	 * we're still under heavy pressure.
+	 */
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+		order, zonelist, high_zoneidx,
+		ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+	if (page)
+		goto out;
+
+	/* The OOM killer will not help higher order allocs */
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		goto out;
+
+	/* Exhausted what can be done so it's blamo time */
+	out_of_memory(zonelist, gfp_mask, order);
+
+out:
+	clear_zonelist_oom(zonelist, gfp_mask);
+	return page;
+}
+
+/* The really slow allocator path where we enter direct reclaim */
+struct page *
+__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+{
+	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
-	int do_retry;
-	int alloc_flags;
-	unsigned long did_some_progress;
-	unsigned long pages_reclaimed = 0;
 
-	might_sleep_if(wait);
+	cond_resched();
 
-	if (should_fail_alloc_page(gfp_mask, order))
-		return NULL;
+	/* We now go into synchronous reclaim */
+	cpuset_memory_pressure_bump();
 
-	/* the list of zones suitable for gfp_mask */
-	z = zonelist->_zonerefs;
-	if (unlikely(!z->zone)) {
-		/*
-		 * Happens if we have an empty zonelist as a result of
-		 * GFP_THISNODE being used on a memoryless node
-		 */
-		return NULL;
-	}
+	/*
+	 * The task's cpuset might have expanded its set of allowable nodes
+	 */
+	cpuset_update_task_memory_state();
+	p->flags |= PF_MEMALLOC;
+	reclaim_state.reclaimed_slab = 0;
+	p->reclaim_state = &reclaim_state;
 
-restart:
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
-	if (page)
-		goto got_pg;
+	*did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
+
+	p->reclaim_state = NULL;
+	p->flags &= ~PF_MEMALLOC;
+
+	cond_resched();
+
+	if (order != 0)
+		drain_all_pages();
+
+	if (likely(*did_some_progress))
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
+					zonelist, high_zoneidx, alloc_flags);
+	return page;
+}
+
+static inline int is_allocation_high_priority(struct task_struct *p,
+							gfp_t gfp_mask)
+{
+	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+			&& !in_interrupt())
+		if (!(gfp_mask & __GFP_NOMEMALLOC))
+			return 1;
+	return 0;
+}
+
+/*
+ * This is called in the allocator slow-path if the allocation request is of
+ * sufficient urgency to ignore watermarks and take other desperate measures
+ */
+struct page *
+__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
+{
+	struct page *page;
+
+	do {
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
+			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+
+		if (!page && gfp_mask & __GFP_NOFAIL)
+			congestion_wait(WRITE, HZ/50);
+	} while (!page && (gfp_mask & __GFP_NOFAIL));
+
+	return page;
+}
+
+static inline
+void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx)
+{
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+		wakeup_kswapd(zone, order);
+}
+
+static struct page * noinline
+__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
+{
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	struct page *page = NULL;
+	int alloc_flags;
+	unsigned long pages_reclaimed = 0;
+	unsigned long did_some_progress;
+	struct task_struct *p = current;
 
 	/*
 	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
@@ -1514,8 +1639,7 @@ restart:
 	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
 		goto nopage;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		wakeup_kswapd(zone, order);
+	wake_all_kswapd(order, zonelist, high_zoneidx);
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
@@ -1535,6 +1659,7 @@ restart:
 	if (wait)
 		alloc_flags |= ALLOC_CPUSET;
 
+restart:
 	/*
 	 * Go through the zonelist again. Let __GFP_HIGH and allocations
 	 * coming from realtime tasks go deeper into reserves.
@@ -1548,118 +1673,47 @@ restart:
 	if (page)
 		goto got_pg;
 
-	/* This allocation should allow future memory freeing. */
-
-rebalance:
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
-		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
-nofail_alloc:
-			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, nodemask, order,
-				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
-			if (page)
-				goto got_pg;
-			if (gfp_mask & __GFP_NOFAIL) {
-				congestion_wait(WRITE, HZ/50);
-				goto nofail_alloc;
-			}
-		}
-		goto nopage;
-	}
+	/* Allocate without watermarks if the context allows */
+	if (is_allocation_high_priority(p, gfp_mask))
+		page = __alloc_pages_high_priority(gfp_mask, order,
+			zonelist, high_zoneidx, nodemask);
+	if (page)
+		goto got_pg;
 
 	/* Atomic allocations - we can't balance anything */
 	if (!wait)
 		goto nopage;
 
-	cond_resched();
+	/* Try direct reclaim and then allocating */
+	page = __alloc_pages_direct_reclaim(gfp_mask, order,
+					zonelist, high_zoneidx,
+					nodemask,
+					alloc_flags, &did_some_progress);
+	if (page)
+		goto got_pg;
 
-	/* We now go into synchronous reclaim */
-	cpuset_memory_pressure_bump();
 	/*
-	 * The task's cpuset might have expanded its set of allowable nodes
+	 * If we failed to make any progress reclaiming, then we are
+	 * running out of options and have to consider going OOM
 	 */
-	cpuset_update_task_memory_state();
-	p->flags |= PF_MEMALLOC;
-	reclaim_state.reclaimed_slab = 0;
-	p->reclaim_state = &reclaim_state;
-
-	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
-
-	p->reclaim_state = NULL;
-	p->flags &= ~PF_MEMALLOC;
-
-	cond_resched();
-
-	if (order != 0)
-		drain_all_pages();
+	if (!did_some_progress) {
+		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+			page = __alloc_pages_may_oom(gfp_mask, order,
+					zonelist, high_zoneidx,
+					nodemask);
+			if (page)
+				goto got_pg;
 
-	if (likely(did_some_progress)) {
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
-					zonelist, high_zoneidx, alloc_flags);
-		if (page)
-			goto got_pg;
-	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
-		if (!try_set_zone_oom(zonelist, gfp_mask)) {
-			schedule_timeout_uninterruptible(1);
 			goto restart;
 		}
-
-		/*
-		 * Go through the zonelist yet one more time, keep
-		 * very high watermark here, this is only to catch
-		 * a parallel oom killing, we must fail if we're still
-		 * under heavy pressure.
-		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
-			order, zonelist, high_zoneidx,
-			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
-		if (page) {
-			clear_zonelist_oom(zonelist, gfp_mask);
-			goto got_pg;
-		}
-
-		/* The OOM killer will not help higher order allocs so fail */
-		if (order > PAGE_ALLOC_COSTLY_ORDER) {
-			clear_zonelist_oom(zonelist, gfp_mask);
-			goto nopage;
-		}
-
-		out_of_memory(zonelist, gfp_mask, order);
-		clear_zonelist_oom(zonelist, gfp_mask);
-		goto restart;
 	}
 
-	/*
-	 * Don't let big-order allocations loop unless the caller explicitly
-	 * requests that.  Wait for some write requests to complete then retry.
-	 *
-	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
-	 * means __GFP_NOFAIL, but that may not be true in other
-	 * implementations.
-	 *
-	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
-	 * specified, then we retry until we no longer reclaim any pages
-	 * (above), or we've reclaimed an order of pages at least as
-	 * large as the allocation's order. In both cases, if the
-	 * allocation still fails, we stop retrying.
-	 */
+	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
-	do_retry = 0;
-	if (!(gfp_mask & __GFP_NORETRY)) {
-		if (order <= PAGE_ALLOC_COSTLY_ORDER) {
-			do_retry = 1;
-		} else {
-			if (gfp_mask & __GFP_REPEAT &&
-				pages_reclaimed < (1 << order))
-					do_retry = 1;
-		}
-		if (gfp_mask & __GFP_NOFAIL)
-			do_retry = 1;
-	}
-	if (do_retry) {
+	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
+		/* Wait for some write requests to complete then retry */
 		congestion_wait(WRITE, HZ/50);
-		goto rebalance;
+		goto restart;
 	}
 
 nopage:
@@ -1672,6 +1726,39 @@ nopage:
 	}
 got_pg:
 	return page;
+
+}
+
+/*
+ * This is the 'heart' of the zoned buddy allocator.
+ */
+struct page *
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
+{
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	struct page *page;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+
+	if (should_fail_alloc_page(gfp_mask, order))
+		return NULL;
+
+	/*
+	 * Check the zones suitable for the gfp_mask contain at least one
+	 * valid zone. It's possible to have an empty zonelist as a result
+	 * of GFP_THISNODE and a memoryless node
+	 */
+	if (unlikely(!zonelist->_zonerefs->zone))
+		return NULL;
+
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
+			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+	if (unlikely(!page))
+		page = __alloc_pages_slowpath(gfp_mask, order,
+				zonelist, high_zoneidx, nodemask);
+
+	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
 
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 08/19] Simplify the check on whether cpusets are a factor or not
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:17   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

The check whether cpuset contraints need to be checked or not is complex
and often repeated.  This patch makes the check in advance to the comparison
is simplier to compute.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/cpuset.h |    2 ++
 mm/page_alloc.c        |   13 +++++++++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 90c6074..6051082 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -83,6 +83,8 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
 
 #else /* !CONFIG_CPUSETS */
 
+#define number_of_cpusets (0)
+
 static inline int cpuset_init_early(void) { return 0; }
 static inline int cpuset_init(void) { return 0; }
 static inline void cpuset_init_smp(void) {}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 503d692..405cd8c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1136,7 +1136,11 @@ failed:
 #define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
+#ifdef CONFIG_CPUSETS
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+#else
+#define ALLOC_CPUSET		0x00
+#endif /* CONFIG_CPUSETS */
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
@@ -1400,6 +1404,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	int alloc_cpuset = 0;
 
 	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
 							&preferred_zone);
@@ -1410,6 +1415,10 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 
 	VM_BUG_ON(order >= MAX_ORDER);
 
+	/* Determine in advance if the cpuset checks will be needed */
+	if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
+		alloc_cpuset = 1;
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -1420,8 +1429,8 @@ zonelist_scan:
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-		if ((alloc_flags & ALLOC_CPUSET) &&
-			!cpuset_zone_allowed_softwall(zone, gfp_mask))
+		if (alloc_cpuset)
+			if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
 
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 08/19] Simplify the check on whether cpusets are a factor or not
@ 2009-02-24 12:17   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

The check whether cpuset contraints need to be checked or not is complex
and often repeated.  This patch makes the check in advance to the comparison
is simplier to compute.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/cpuset.h |    2 ++
 mm/page_alloc.c        |   13 +++++++++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 90c6074..6051082 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -83,6 +83,8 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
 
 #else /* !CONFIG_CPUSETS */
 
+#define number_of_cpusets (0)
+
 static inline int cpuset_init_early(void) { return 0; }
 static inline int cpuset_init(void) { return 0; }
 static inline void cpuset_init_smp(void) {}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 503d692..405cd8c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1136,7 +1136,11 @@ failed:
 #define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
+#ifdef CONFIG_CPUSETS
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+#else
+#define ALLOC_CPUSET		0x00
+#endif /* CONFIG_CPUSETS */
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
@@ -1400,6 +1404,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	int alloc_cpuset = 0;
 
 	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
 							&preferred_zone);
@@ -1410,6 +1415,10 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 
 	VM_BUG_ON(order >= MAX_ORDER);
 
+	/* Determine in advance if the cpuset checks will be needed */
+	if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
+		alloc_cpuset = 1;
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -1420,8 +1429,8 @@ zonelist_scan:
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-		if ((alloc_flags & ALLOC_CPUSET) &&
-			!cpuset_zone_allowed_softwall(zone, gfp_mask))
+		if (alloc_cpuset)
+			if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
 
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 09/19] Move check for disabled anti-fragmentation out of fastpath
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:17   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

On low-memory systems, anti-fragmentation gets disabled as there is nothing
it can do and it would just incur overhead shuffling pages between lists
constantly. Currently the check is made in the free page fast path for every
page. This patch moves it to a slow path. On machines with low memory,
there will be small amount of additional overhead as pages get shuffled
between lists but it should quickly settle.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |    3 ---
 mm/page_alloc.c        |    4 ++++
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 09c14e2..6089393 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -50,9 +50,6 @@ extern int page_group_by_mobility_disabled;
 
 static inline int get_pageblock_migratetype(struct page *page)
 {
-	if (unlikely(page_group_by_mobility_disabled))
-		return MIGRATE_UNMOVABLE;
-
 	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 405cd8c..6f26944 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -172,6 +172,10 @@ int page_group_by_mobility_disabled __read_mostly;
 
 static void set_pageblock_migratetype(struct page *page, int migratetype)
 {
+
+	if (unlikely(page_group_by_mobility_disabled))
+		migratetype = MIGRATE_UNMOVABLE;
+
 	set_pageblock_flags_group(page, (unsigned long)migratetype,
 					PB_migrate, PB_migrate_end);
 }
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 09/19] Move check for disabled anti-fragmentation out of fastpath
@ 2009-02-24 12:17   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

On low-memory systems, anti-fragmentation gets disabled as there is nothing
it can do and it would just incur overhead shuffling pages between lists
constantly. Currently the check is made in the free page fast path for every
page. This patch moves it to a slow path. On machines with low memory,
there will be small amount of additional overhead as pages get shuffled
between lists but it should quickly settle.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |    3 ---
 mm/page_alloc.c        |    4 ++++
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 09c14e2..6089393 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -50,9 +50,6 @@ extern int page_group_by_mobility_disabled;
 
 static inline int get_pageblock_migratetype(struct page *page)
 {
-	if (unlikely(page_group_by_mobility_disabled))
-		return MIGRATE_UNMOVABLE;
-
 	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 405cd8c..6f26944 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -172,6 +172,10 @@ int page_group_by_mobility_disabled __read_mostly;
 
 static void set_pageblock_migratetype(struct page *page, int migratetype)
 {
+
+	if (unlikely(page_group_by_mobility_disabled))
+		migratetype = MIGRATE_UNMOVABLE;
+
 	set_pageblock_flags_group(page, (unsigned long)migratetype,
 					PB_migrate, PB_migrate_end);
 }
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 10/19] Calculate the preferred zone for allocation only once
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:17   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

get_page_from_freelist() can be called multiple times for an allocation.
Part of this calculates the preferred_zone which is the first usable
zone in the zonelist. This patch calculates preferred_zone once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   53 ++++++++++++++++++++++++++++++++---------------------
 1 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6f26944..074f9a6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1399,24 +1399,19 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
  */
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
-		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
+		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
+		struct zone *preferred_zone)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
 	int classzone_idx;
-	struct zone *zone, *preferred_zone;
+	struct zone *zone;
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 	int alloc_cpuset = 0;
 
-	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
-							&preferred_zone);
-	if (!preferred_zone)
-		return NULL;
-
 	classzone_idx = zone_idx(preferred_zone);
-
 	VM_BUG_ON(order >= MAX_ORDER);
 
 	/* Determine in advance if the cpuset checks will be needed */
@@ -1515,7 +1510,7 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	struct page *page;
 
@@ -1532,7 +1527,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	 */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
-		ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
+		preferred_zone);
 	if (page)
 		goto out;
 
@@ -1552,7 +1548,8 @@ out:
 struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+	unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
@@ -1583,7 +1580,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	if (likely(*did_some_progress))
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
-					zonelist, high_zoneidx, alloc_flags);
+					zonelist, high_zoneidx,
+					alloc_flags, preferred_zone);
 	return page;
 }
 
@@ -1604,13 +1602,14 @@ static inline int is_allocation_high_priority(struct task_struct *p,
 struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
+			preferred_zone);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			congestion_wait(WRITE, HZ/50);
@@ -1632,7 +1631,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, enum zone_ty
 static struct page * noinline
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -1682,14 +1681,15 @@ restart:
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
-						high_zoneidx, alloc_flags);
+						high_zoneidx, alloc_flags,
+						preferred_zone);
 	if (page)
 		goto got_pg;
 
 	/* Allocate without watermarks if the context allows */
 	if (is_allocation_high_priority(p, gfp_mask))
 		page = __alloc_pages_high_priority(gfp_mask, order,
-			zonelist, high_zoneidx, nodemask);
+			zonelist, high_zoneidx, nodemask, preferred_zone);
 	if (page)
 		goto got_pg;
 
@@ -1701,7 +1701,8 @@ restart:
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask,
-					alloc_flags, &did_some_progress);
+					alloc_flags, preferred_zone,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -1713,7 +1714,7 @@ restart:
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
-					nodemask);
+					nodemask, preferred_zone);
 			if (page)
 				goto got_pg;
 
@@ -1750,6 +1751,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	struct zone *preferred_zone;
 	struct page *page;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
@@ -1765,11 +1767,20 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
+	/* The preferred zone is used for statistics later */
+	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
+							&preferred_zone);
+	if (!preferred_zone)
+		return NULL;
+
+	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
+			preferred_zone);
 	if (unlikely(!page))
 		page = __alloc_pages_slowpath(gfp_mask, order,
-				zonelist, high_zoneidx, nodemask);
+				zonelist, high_zoneidx, nodemask,
+				preferred_zone);
 
 	return page;
 }
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 10/19] Calculate the preferred zone for allocation only once
@ 2009-02-24 12:17   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

get_page_from_freelist() can be called multiple times for an allocation.
Part of this calculates the preferred_zone which is the first usable
zone in the zonelist. This patch calculates preferred_zone once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   53 ++++++++++++++++++++++++++++++++---------------------
 1 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6f26944..074f9a6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1399,24 +1399,19 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
  */
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
-		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
+		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
+		struct zone *preferred_zone)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
 	int classzone_idx;
-	struct zone *zone, *preferred_zone;
+	struct zone *zone;
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 	int alloc_cpuset = 0;
 
-	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
-							&preferred_zone);
-	if (!preferred_zone)
-		return NULL;
-
 	classzone_idx = zone_idx(preferred_zone);
-
 	VM_BUG_ON(order >= MAX_ORDER);
 
 	/* Determine in advance if the cpuset checks will be needed */
@@ -1515,7 +1510,7 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	struct page *page;
 
@@ -1532,7 +1527,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	 */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
-		ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
+		preferred_zone);
 	if (page)
 		goto out;
 
@@ -1552,7 +1548,8 @@ out:
 struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+	unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
@@ -1583,7 +1580,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	if (likely(*did_some_progress))
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
-					zonelist, high_zoneidx, alloc_flags);
+					zonelist, high_zoneidx,
+					alloc_flags, preferred_zone);
 	return page;
 }
 
@@ -1604,13 +1602,14 @@ static inline int is_allocation_high_priority(struct task_struct *p,
 struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
+			preferred_zone);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			congestion_wait(WRITE, HZ/50);
@@ -1632,7 +1631,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, enum zone_ty
 static struct page * noinline
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -1682,14 +1681,15 @@ restart:
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
-						high_zoneidx, alloc_flags);
+						high_zoneidx, alloc_flags,
+						preferred_zone);
 	if (page)
 		goto got_pg;
 
 	/* Allocate without watermarks if the context allows */
 	if (is_allocation_high_priority(p, gfp_mask))
 		page = __alloc_pages_high_priority(gfp_mask, order,
-			zonelist, high_zoneidx, nodemask);
+			zonelist, high_zoneidx, nodemask, preferred_zone);
 	if (page)
 		goto got_pg;
 
@@ -1701,7 +1701,8 @@ restart:
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask,
-					alloc_flags, &did_some_progress);
+					alloc_flags, preferred_zone,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -1713,7 +1714,7 @@ restart:
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
-					nodemask);
+					nodemask, preferred_zone);
 			if (page)
 				goto got_pg;
 
@@ -1750,6 +1751,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	struct zone *preferred_zone;
 	struct page *page;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
@@ -1765,11 +1767,20 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
+	/* The preferred zone is used for statistics later */
+	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
+							&preferred_zone);
+	if (!preferred_zone)
+		return NULL;
+
+	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
+			preferred_zone);
 	if (unlikely(!page))
 		page = __alloc_pages_slowpath(gfp_mask, order,
-				zonelist, high_zoneidx, nodemask);
+				zonelist, high_zoneidx, nodemask,
+				preferred_zone);
 
 	return page;
 }
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 11/19] Calculate the migratetype for allocation only once
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:17   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

GFP mask is converted into a migratetype when deciding which pagelist to
take a page from. However, it is happening multiple times per
allocation, at least once per zone traversed. Calculate it once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   43 ++++++++++++++++++++++++++-----------------
 1 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 074f9a6..8437291 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1068,13 +1068,13 @@ void split_page(struct page *page, unsigned int order)
  * or two.
  */
 static struct page *buffered_rmqueue(struct zone *preferred_zone,
-			struct zone *zone, int order, gfp_t gfp_flags)
+			struct zone *zone, int order, gfp_t gfp_flags,
+			int migratetype)
 {
 	unsigned long flags;
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
 	int cpu;
-	int migratetype = allocflags_to_migratetype(gfp_flags);
 
 again:
 	cpu  = get_cpu();
@@ -1400,7 +1400,7 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone)
+		struct zone *preferred_zone, int migratetype)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
@@ -1448,7 +1448,8 @@ zonelist_scan:
 			}
 		}
 
-		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
+		page = buffered_rmqueue(preferred_zone, zone, order,
+						gfp_mask, migratetype);
 		if (page)
 			break;
 this_zone_full:
@@ -1510,7 +1511,8 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	struct page *page;
 
@@ -1528,7 +1530,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone);
+		preferred_zone, migratetype);
 	if (page)
 		goto out;
 
@@ -1549,7 +1551,7 @@ struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	unsigned long *did_some_progress)
+	int migratetype, unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
@@ -1581,7 +1583,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	if (likely(*did_some_progress))
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
-					alloc_flags, preferred_zone);
+					alloc_flags, preferred_zone,
+					migratetype);
 	return page;
 }
 
@@ -1602,14 +1605,15 @@ static inline int is_allocation_high_priority(struct task_struct *p,
 struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone);
+			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			congestion_wait(WRITE, HZ/50);
@@ -1631,7 +1635,8 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, enum zone_ty
 static struct page * noinline
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -1682,14 +1687,16 @@ restart:
 	 */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 						high_zoneidx, alloc_flags,
-						preferred_zone);
+						preferred_zone,
+						migratetype);
 	if (page)
 		goto got_pg;
 
 	/* Allocate without watermarks if the context allows */
 	if (is_allocation_high_priority(p, gfp_mask))
 		page = __alloc_pages_high_priority(gfp_mask, order,
-			zonelist, high_zoneidx, nodemask, preferred_zone);
+			zonelist, high_zoneidx, nodemask, preferred_zone,
+			migratetype);
 	if (page)
 		goto got_pg;
 
@@ -1702,7 +1709,7 @@ restart:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					&did_some_progress);
+					migratetype, &did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -1714,7 +1721,8 @@ restart:
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
-					nodemask, preferred_zone);
+					nodemask, preferred_zone,
+					migratetype);
 			if (page)
 				goto got_pg;
 
@@ -1753,6 +1761,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
 	struct page *page;
+	int migratetype = allocflags_to_migratetype(gfp_mask);
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
@@ -1776,11 +1785,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
-			preferred_zone);
+			preferred_zone, migratetype);
 	if (unlikely(!page))
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone);
+				preferred_zone, migratetype);
 
 	return page;
 }
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 11/19] Calculate the migratetype for allocation only once
@ 2009-02-24 12:17   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

GFP mask is converted into a migratetype when deciding which pagelist to
take a page from. However, it is happening multiple times per
allocation, at least once per zone traversed. Calculate it once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   43 ++++++++++++++++++++++++++-----------------
 1 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 074f9a6..8437291 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1068,13 +1068,13 @@ void split_page(struct page *page, unsigned int order)
  * or two.
  */
 static struct page *buffered_rmqueue(struct zone *preferred_zone,
-			struct zone *zone, int order, gfp_t gfp_flags)
+			struct zone *zone, int order, gfp_t gfp_flags,
+			int migratetype)
 {
 	unsigned long flags;
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
 	int cpu;
-	int migratetype = allocflags_to_migratetype(gfp_flags);
 
 again:
 	cpu  = get_cpu();
@@ -1400,7 +1400,7 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone)
+		struct zone *preferred_zone, int migratetype)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
@@ -1448,7 +1448,8 @@ zonelist_scan:
 			}
 		}
 
-		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
+		page = buffered_rmqueue(preferred_zone, zone, order,
+						gfp_mask, migratetype);
 		if (page)
 			break;
 this_zone_full:
@@ -1510,7 +1511,8 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	struct page *page;
 
@@ -1528,7 +1530,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone);
+		preferred_zone, migratetype);
 	if (page)
 		goto out;
 
@@ -1549,7 +1551,7 @@ struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	unsigned long *did_some_progress)
+	int migratetype, unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
@@ -1581,7 +1583,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	if (likely(*did_some_progress))
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
-					alloc_flags, preferred_zone);
+					alloc_flags, preferred_zone,
+					migratetype);
 	return page;
 }
 
@@ -1602,14 +1605,15 @@ static inline int is_allocation_high_priority(struct task_struct *p,
 struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone);
+			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			congestion_wait(WRITE, HZ/50);
@@ -1631,7 +1635,8 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, enum zone_ty
 static struct page * noinline
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -1682,14 +1687,16 @@ restart:
 	 */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 						high_zoneidx, alloc_flags,
-						preferred_zone);
+						preferred_zone,
+						migratetype);
 	if (page)
 		goto got_pg;
 
 	/* Allocate without watermarks if the context allows */
 	if (is_allocation_high_priority(p, gfp_mask))
 		page = __alloc_pages_high_priority(gfp_mask, order,
-			zonelist, high_zoneidx, nodemask, preferred_zone);
+			zonelist, high_zoneidx, nodemask, preferred_zone,
+			migratetype);
 	if (page)
 		goto got_pg;
 
@@ -1702,7 +1709,7 @@ restart:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					&did_some_progress);
+					migratetype, &did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -1714,7 +1721,8 @@ restart:
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
-					nodemask, preferred_zone);
+					nodemask, preferred_zone,
+					migratetype);
 			if (page)
 				goto got_pg;
 
@@ -1753,6 +1761,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
 	struct page *page;
+	int migratetype = allocflags_to_migratetype(gfp_mask);
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
@@ -1776,11 +1785,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
-			preferred_zone);
+			preferred_zone, migratetype);
 	if (unlikely(!page))
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone);
+				preferred_zone, migratetype);
 
 	return page;
 }
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 12/19] Calculate the alloc_flags for allocation only once
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:17   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

Factor out the mapping between GFP and alloc_flags only once. Once factored
out, it only needs to be calculated once but some care must be taken.

[neilb@suse.de says]
As the test:

-       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-                       && !in_interrupt()) {
-               if (!(gfp_mask & __GFP_NOMEMALLOC)) {

has been replaced with a slightly weaker one:

+       if (alloc_flags & ALLOC_NO_WATERMARKS) {

we need to ensure we don't recurse when PF_MEMALLOC is set.

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 mm/page_alloc.c |   90 +++++++++++++++++++++++++++++++-----------------------
 1 files changed, 52 insertions(+), 38 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8437291..ead73fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1588,16 +1588,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static inline int is_allocation_high_priority(struct task_struct *p,
-							gfp_t gfp_mask)
-{
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt())
-		if (!(gfp_mask & __GFP_NOMEMALLOC))
-			return 1;
-	return 0;
-}
-
 /*
  * This is called in the allocator slow-path if the allocation request is of
  * sufficient urgency to ignore watermarks and take other desperate measures
@@ -1632,6 +1622,44 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, enum zone_ty
 		wakeup_kswapd(zone, order);
 }
 
+/*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+static int gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+	struct task_struct *p = current;
+	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+	/*
+	 * The caller may dip into page reserves a bit more if the caller
+	 * cannot run direct reclaim, or if the caller has realtime scheduling
+	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+	 */
+	if (gfp_mask & __GFP_HIGH)
+		alloc_flags |= ALLOC_HIGH;
+
+	if (!wait) {
+		alloc_flags |= ALLOC_HARDER;
+		/*
+		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+		 */
+		alloc_flags &= ~ALLOC_CPUSET;
+	} else if (unlikely(rt_task(p)) && !in_interrupt())
+		alloc_flags |= ALLOC_HARDER;
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (!in_interrupt() &&
+		    ((p->flags & PF_MEMALLOC) ||
+		     unlikely(test_thread_flag(TIF_MEMDIE))))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+	}
+
+	return alloc_flags;
+}
+
 static struct page * noinline
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1662,48 +1690,34 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
-	 *
-	 * The caller may dip into page reserves a bit more if the caller
-	 * cannot run direct reclaim, or if the caller has realtime scheduling
-	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	alloc_flags = ALLOC_WMARK_MIN;
-	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
-		alloc_flags |= ALLOC_HARDER;
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
-	if (wait)
-		alloc_flags |= ALLOC_CPUSET;
+	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
 restart:
-	/*
-	 * Go through the zonelist again. Let __GFP_HIGH and allocations
-	 * coming from realtime tasks go deeper into reserves.
-	 *
-	 * This is the last chance, in general, before the goto nopage.
-	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-	 */
+	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
-						high_zoneidx, alloc_flags,
-						preferred_zone,
-						migratetype);
+			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
+			preferred_zone, migratetype);
 	if (page)
 		goto got_pg;
 
 	/* Allocate without watermarks if the context allows */
-	if (is_allocation_high_priority(p, gfp_mask))
+	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 		page = __alloc_pages_high_priority(gfp_mask, order,
-			zonelist, high_zoneidx, nodemask, preferred_zone,
-			migratetype);
-	if (page)
-		goto got_pg;
+				zonelist, high_zoneidx, nodemask,
+				preferred_zone, migratetype);
+		if (page)
+			goto got_pg;
+	}
 
 	/* Atomic allocations - we can't balance anything */
 	if (!wait)
 		goto nopage;
 
+	/* Avoid recursion of direct reclaim */
+	if (p->flags & PF_MEMALLOC)
+		goto nopage;
+
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 12/19] Calculate the alloc_flags for allocation only once
@ 2009-02-24 12:17   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

Factor out the mapping between GFP and alloc_flags only once. Once factored
out, it only needs to be calculated once but some care must be taken.

[neilb@suse.de says]
As the test:

-       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-                       && !in_interrupt()) {
-               if (!(gfp_mask & __GFP_NOMEMALLOC)) {

has been replaced with a slightly weaker one:

+       if (alloc_flags & ALLOC_NO_WATERMARKS) {

we need to ensure we don't recurse when PF_MEMALLOC is set.

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 mm/page_alloc.c |   90 +++++++++++++++++++++++++++++++-----------------------
 1 files changed, 52 insertions(+), 38 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8437291..ead73fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1588,16 +1588,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static inline int is_allocation_high_priority(struct task_struct *p,
-							gfp_t gfp_mask)
-{
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt())
-		if (!(gfp_mask & __GFP_NOMEMALLOC))
-			return 1;
-	return 0;
-}
-
 /*
  * This is called in the allocator slow-path if the allocation request is of
  * sufficient urgency to ignore watermarks and take other desperate measures
@@ -1632,6 +1622,44 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, enum zone_ty
 		wakeup_kswapd(zone, order);
 }
 
+/*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+static int gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+	struct task_struct *p = current;
+	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+	/*
+	 * The caller may dip into page reserves a bit more if the caller
+	 * cannot run direct reclaim, or if the caller has realtime scheduling
+	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+	 */
+	if (gfp_mask & __GFP_HIGH)
+		alloc_flags |= ALLOC_HIGH;
+
+	if (!wait) {
+		alloc_flags |= ALLOC_HARDER;
+		/*
+		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+		 */
+		alloc_flags &= ~ALLOC_CPUSET;
+	} else if (unlikely(rt_task(p)) && !in_interrupt())
+		alloc_flags |= ALLOC_HARDER;
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (!in_interrupt() &&
+		    ((p->flags & PF_MEMALLOC) ||
+		     unlikely(test_thread_flag(TIF_MEMDIE))))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+	}
+
+	return alloc_flags;
+}
+
 static struct page * noinline
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1662,48 +1690,34 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
-	 *
-	 * The caller may dip into page reserves a bit more if the caller
-	 * cannot run direct reclaim, or if the caller has realtime scheduling
-	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	alloc_flags = ALLOC_WMARK_MIN;
-	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
-		alloc_flags |= ALLOC_HARDER;
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
-	if (wait)
-		alloc_flags |= ALLOC_CPUSET;
+	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
 restart:
-	/*
-	 * Go through the zonelist again. Let __GFP_HIGH and allocations
-	 * coming from realtime tasks go deeper into reserves.
-	 *
-	 * This is the last chance, in general, before the goto nopage.
-	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-	 */
+	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
-						high_zoneidx, alloc_flags,
-						preferred_zone,
-						migratetype);
+			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
+			preferred_zone, migratetype);
 	if (page)
 		goto got_pg;
 
 	/* Allocate without watermarks if the context allows */
-	if (is_allocation_high_priority(p, gfp_mask))
+	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 		page = __alloc_pages_high_priority(gfp_mask, order,
-			zonelist, high_zoneidx, nodemask, preferred_zone,
-			migratetype);
-	if (page)
-		goto got_pg;
+				zonelist, high_zoneidx, nodemask,
+				preferred_zone, migratetype);
+		if (page)
+			goto got_pg;
+	}
 
 	/* Atomic allocations - we can't balance anything */
 	if (!wait)
 		goto nopage;
 
+	/* Avoid recursion of direct reclaim */
+	if (p->flags & PF_MEMALLOC)
+		goto nopage;
+
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 13/19] Inline __rmqueue_smallest()
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:17   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

Inline __rmqueue_smallest by altering flow very slightly so that there
is only one call site. This allows the function to be inlined without
additional text bloat.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   23 ++++++++++++++++++-----
 1 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ead73fd..51eedfa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -665,7 +665,8 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
  * Go through the free lists for the given migratetype and remove
  * the smallest available page from the freelists
  */
-static struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
+static inline
+struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						int migratetype)
 {
 	unsigned int current_order;
@@ -835,24 +836,36 @@ static struct page *__rmqueue_fallback(struct zone *zone, int order,
 		}
 	}
 
-	/* Use MIGRATE_RESERVE rather than fail an allocation */
-	return __rmqueue_smallest(zone, order, MIGRATE_RESERVE);
+	return NULL;
 }
 
 /*
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order,
+static inline
+struct page *__rmqueue(struct zone *zone, unsigned int order,
 						int migratetype)
 {
 	struct page *page;
 
+retry_reserve:
 	page = __rmqueue_smallest(zone, order, migratetype);
 
-	if (unlikely(!page))
+	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
 		page = __rmqueue_fallback(zone, order, migratetype);
 
+		/*
+		 * Use MIGRATE_RESERVE rather than fail an allocation. goto
+		 * is used because __rmqueue_smallest is an inline function
+		 * and we want just one call site
+		 */
+		if (!page) {
+			migratetype = MIGRATE_RESERVE;
+			goto retry_reserve;
+		}
+	}
+
 	return page;
 }
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 13/19] Inline __rmqueue_smallest()
@ 2009-02-24 12:17   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

Inline __rmqueue_smallest by altering flow very slightly so that there
is only one call site. This allows the function to be inlined without
additional text bloat.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   23 ++++++++++++++++++-----
 1 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ead73fd..51eedfa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -665,7 +665,8 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
  * Go through the free lists for the given migratetype and remove
  * the smallest available page from the freelists
  */
-static struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
+static inline
+struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						int migratetype)
 {
 	unsigned int current_order;
@@ -835,24 +836,36 @@ static struct page *__rmqueue_fallback(struct zone *zone, int order,
 		}
 	}
 
-	/* Use MIGRATE_RESERVE rather than fail an allocation */
-	return __rmqueue_smallest(zone, order, MIGRATE_RESERVE);
+	return NULL;
 }
 
 /*
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order,
+static inline
+struct page *__rmqueue(struct zone *zone, unsigned int order,
 						int migratetype)
 {
 	struct page *page;
 
+retry_reserve:
 	page = __rmqueue_smallest(zone, order, migratetype);
 
-	if (unlikely(!page))
+	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
 		page = __rmqueue_fallback(zone, order, migratetype);
 
+		/*
+		 * Use MIGRATE_RESERVE rather than fail an allocation. goto
+		 * is used because __rmqueue_smallest is an inline function
+		 * and we want just one call site
+		 */
+		if (!page) {
+			migratetype = MIGRATE_RESERVE;
+			goto retry_reserve;
+		}
+	}
+
 	return page;
 }
 
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 14/19] Inline buffered_rmqueue()
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:17   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

buffered_rmqueue() is in the fast path so inline it. This incurs text
bloat as there is now a copy in the fast and slow paths but the cost of
the function call was noticeable in profiles of the fast path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 51eedfa..1786542 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1080,7 +1080,8 @@ void split_page(struct page *page, unsigned int order)
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
  */
-static struct page *buffered_rmqueue(struct zone *preferred_zone,
+static inline
+struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, int order, gfp_t gfp_flags,
 			int migratetype)
 {
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 14/19] Inline buffered_rmqueue()
@ 2009-02-24 12:17   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

buffered_rmqueue() is in the fast path so inline it. This incurs text
bloat as there is now a copy in the fast and slow paths but the cost of
the function call was noticeable in profiles of the fast path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 51eedfa..1786542 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1080,7 +1080,8 @@ void split_page(struct page *page, unsigned int order)
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
  */
-static struct page *buffered_rmqueue(struct zone *preferred_zone,
+static inline
+struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, int order, gfp_t gfp_flags,
 			int migratetype)
 {
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 15/19] Do not call get_pageblock_migratetype() more than necessary
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:17   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

get_pageblock_migratetype() is potentially called twice for every page
free. Once, when being freed to the pcp lists and once when being freed
back to buddy. When freeing from the pcp lists, it is known what the
pageblock type was at the time of free so use it rather than rechecking.
In low memory situations under memory pressure, this might skew
anti-fragmentation slightly but the interference is minimal and
decisions that are fragmenting memory are being made anyway.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   26 ++++++++++++++++----------
 1 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1786542..1aeb5b0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -77,7 +77,8 @@ int percpu_pagelist_fraction;
 int pageblock_order __read_mostly;
 #endif
 
-static void __free_pages_ok(struct page *page, unsigned int order);
+static void __free_pages_ok(struct page *page, unsigned int order,
+					int migratetype);
 
 /*
  * results with 256, 32 in the lowmem_reserve sysctl:
@@ -283,7 +284,7 @@ out:
 
 static void free_compound_page(struct page *page)
 {
-	__free_pages_ok(page, compound_order(page));
+	__free_pages_ok(page, compound_order(page), -1);
 }
 
 void prep_compound_page(struct page *page, unsigned long order)
@@ -456,16 +457,19 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
  */
 
 static inline void __free_one_page(struct page *page,
-		struct zone *zone, unsigned int order)
+		struct zone *zone, unsigned int order,
+		int migratetype)
 {
 	unsigned long page_idx;
 	int order_size = 1 << order;
-	int migratetype = get_pageblock_migratetype(page);
 
 	if (unlikely(PageCompound(page)))
 		if (unlikely(destroy_compound_page(page, order)))
 			return;
 
+	if (migratetype == -1)
+		migratetype = get_pageblock_migratetype(page);
+
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
 
 	VM_BUG_ON(page_idx & (order_size - 1));
@@ -534,21 +538,23 @@ static void free_pages_bulk(struct zone *zone, int count,
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
-		__free_one_page(page, zone, order);
+		__free_one_page(page, zone, order, page_private(page));
 	}
 	spin_unlock(&zone->lock);
 }
 
-static void free_one_page(struct zone *zone, struct page *page, int order)
+static void free_one_page(struct zone *zone, struct page *page, int order,
+				int migratetype)
 {
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
-	__free_one_page(page, zone, order);
+	__free_one_page(page, zone, order, migratetype);
 	spin_unlock(&zone->lock);
 }
 
-static void __free_pages_ok(struct page *page, unsigned int order)
+static void __free_pages_ok(struct page *page, unsigned int order,
+				int migratetype)
 {
 	unsigned long flags;
 	int i;
@@ -569,7 +575,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	free_one_page(page_zone(page), page, order);
+	free_one_page(page_zone(page), page, order, migratetype);
 	local_irq_restore(flags);
 }
 
@@ -1869,7 +1875,7 @@ void __free_pages(struct page *page, unsigned int order)
 		if (order == 0)
 			free_hot_page(page);
 		else
-			__free_pages_ok(page, order);
+			__free_pages_ok(page, order, -1);
 	}
 }
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 15/19] Do not call get_pageblock_migratetype() more than necessary
@ 2009-02-24 12:17   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

get_pageblock_migratetype() is potentially called twice for every page
free. Once, when being freed to the pcp lists and once when being freed
back to buddy. When freeing from the pcp lists, it is known what the
pageblock type was at the time of free so use it rather than rechecking.
In low memory situations under memory pressure, this might skew
anti-fragmentation slightly but the interference is minimal and
decisions that are fragmenting memory are being made anyway.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   26 ++++++++++++++++----------
 1 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1786542..1aeb5b0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -77,7 +77,8 @@ int percpu_pagelist_fraction;
 int pageblock_order __read_mostly;
 #endif
 
-static void __free_pages_ok(struct page *page, unsigned int order);
+static void __free_pages_ok(struct page *page, unsigned int order,
+					int migratetype);
 
 /*
  * results with 256, 32 in the lowmem_reserve sysctl:
@@ -283,7 +284,7 @@ out:
 
 static void free_compound_page(struct page *page)
 {
-	__free_pages_ok(page, compound_order(page));
+	__free_pages_ok(page, compound_order(page), -1);
 }
 
 void prep_compound_page(struct page *page, unsigned long order)
@@ -456,16 +457,19 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
  */
 
 static inline void __free_one_page(struct page *page,
-		struct zone *zone, unsigned int order)
+		struct zone *zone, unsigned int order,
+		int migratetype)
 {
 	unsigned long page_idx;
 	int order_size = 1 << order;
-	int migratetype = get_pageblock_migratetype(page);
 
 	if (unlikely(PageCompound(page)))
 		if (unlikely(destroy_compound_page(page, order)))
 			return;
 
+	if (migratetype == -1)
+		migratetype = get_pageblock_migratetype(page);
+
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
 
 	VM_BUG_ON(page_idx & (order_size - 1));
@@ -534,21 +538,23 @@ static void free_pages_bulk(struct zone *zone, int count,
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
-		__free_one_page(page, zone, order);
+		__free_one_page(page, zone, order, page_private(page));
 	}
 	spin_unlock(&zone->lock);
 }
 
-static void free_one_page(struct zone *zone, struct page *page, int order)
+static void free_one_page(struct zone *zone, struct page *page, int order,
+				int migratetype)
 {
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
-	__free_one_page(page, zone, order);
+	__free_one_page(page, zone, order, migratetype);
 	spin_unlock(&zone->lock);
 }
 
-static void __free_pages_ok(struct page *page, unsigned int order)
+static void __free_pages_ok(struct page *page, unsigned int order,
+				int migratetype)
 {
 	unsigned long flags;
 	int i;
@@ -569,7 +575,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	free_one_page(page_zone(page), page, order);
+	free_one_page(page_zone(page), page, order, migratetype);
 	local_irq_restore(flags);
 }
 
@@ -1869,7 +1875,7 @@ void __free_pages(struct page *page, unsigned int order)
 		if (order == 0)
 			free_hot_page(page);
 		else
-			__free_pages_ok(page, order);
+			__free_pages_ok(page, order, -1);
 	}
 }
 
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 16/19] Do not disable interrupts in free_page_mlock()
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:17   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

free_page_mlock() tests and clears PG_mlocked using locked versions of the
bit operations. If set, it disables interrupts to update counters and this
happens on every page free even though interrupts are disabled very shortly
afterwards a second time.  This is wasteful.

This patch splits what free_page_mlock() does. The bit check is still
made. However, the update of counters is delayed until the interrupts are
disabled and the non-lock version for clearing the bit is used. One potential
weirdness with this split is that the counters do not get updated if the
bad_page() check is triggered but a system showing bad pages is getting
screwed already.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/internal.h   |   11 +++--------
 mm/page_alloc.c |    8 +++++++-
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 478223b..7f775a1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -155,14 +155,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
  */
 static inline void free_page_mlock(struct page *page)
 {
-	if (unlikely(TestClearPageMlocked(page))) {
-		unsigned long flags;
-
-		local_irq_save(flags);
-		__dec_zone_page_state(page, NR_MLOCK);
-		__count_vm_event(UNEVICTABLE_MLOCKFREED);
-		local_irq_restore(flags);
-	}
+	__ClearPageMlocked(page);
+	__dec_zone_page_state(page, NR_MLOCK);
+	__count_vm_event(UNEVICTABLE_MLOCKFREED);
 }
 
 #else /* CONFIG_UNEVICTABLE_LRU */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1aeb5b0..73cf205 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -501,7 +501,6 @@ static inline void __free_one_page(struct page *page,
 
 static inline int free_pages_check(struct page *page)
 {
-	free_page_mlock(page);
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
 		(page_count(page) != 0)  |
@@ -559,6 +558,7 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	unsigned long flags;
 	int i;
 	int bad = 0;
+	int clearMlocked = PageMlocked(page);
 
 	for (i = 0 ; i < (1 << order) ; ++i)
 		bad += free_pages_check(page + i);
@@ -574,6 +574,8 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	kernel_map_pages(page, 1 << order, 0);
 
 	local_irq_save(flags);
+	if (clearMlocked)
+		free_page_mlock(page);
 	__count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, order, migratetype);
 	local_irq_restore(flags);
@@ -1023,6 +1025,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	int clearMlocked = PageMlocked(page);
 
 	if (PageAnon(page))
 		page->mapping = NULL;
@@ -1039,6 +1042,9 @@ static void free_hot_cold_page(struct page *page, int cold)
 	pcp = &zone_pcp(zone, get_cpu())->pcp;
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);
+	if (clearMlocked)
+		free_page_mlock(page);
+
 	if (cold)
 		list_add_tail(&page->lru, &pcp->list);
 	else
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 16/19] Do not disable interrupts in free_page_mlock()
@ 2009-02-24 12:17   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

free_page_mlock() tests and clears PG_mlocked using locked versions of the
bit operations. If set, it disables interrupts to update counters and this
happens on every page free even though interrupts are disabled very shortly
afterwards a second time.  This is wasteful.

This patch splits what free_page_mlock() does. The bit check is still
made. However, the update of counters is delayed until the interrupts are
disabled and the non-lock version for clearing the bit is used. One potential
weirdness with this split is that the counters do not get updated if the
bad_page() check is triggered but a system showing bad pages is getting
screwed already.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/internal.h   |   11 +++--------
 mm/page_alloc.c |    8 +++++++-
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 478223b..7f775a1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -155,14 +155,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
  */
 static inline void free_page_mlock(struct page *page)
 {
-	if (unlikely(TestClearPageMlocked(page))) {
-		unsigned long flags;
-
-		local_irq_save(flags);
-		__dec_zone_page_state(page, NR_MLOCK);
-		__count_vm_event(UNEVICTABLE_MLOCKFREED);
-		local_irq_restore(flags);
-	}
+	__ClearPageMlocked(page);
+	__dec_zone_page_state(page, NR_MLOCK);
+	__count_vm_event(UNEVICTABLE_MLOCKFREED);
 }
 
 #else /* CONFIG_UNEVICTABLE_LRU */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1aeb5b0..73cf205 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -501,7 +501,6 @@ static inline void __free_one_page(struct page *page,
 
 static inline int free_pages_check(struct page *page)
 {
-	free_page_mlock(page);
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
 		(page_count(page) != 0)  |
@@ -559,6 +558,7 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	unsigned long flags;
 	int i;
 	int bad = 0;
+	int clearMlocked = PageMlocked(page);
 
 	for (i = 0 ; i < (1 << order) ; ++i)
 		bad += free_pages_check(page + i);
@@ -574,6 +574,8 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	kernel_map_pages(page, 1 << order, 0);
 
 	local_irq_save(flags);
+	if (clearMlocked)
+		free_page_mlock(page);
 	__count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, order, migratetype);
 	local_irq_restore(flags);
@@ -1023,6 +1025,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	int clearMlocked = PageMlocked(page);
 
 	if (PageAnon(page))
 		page->mapping = NULL;
@@ -1039,6 +1042,9 @@ static void free_hot_cold_page(struct page *page, int cold)
 	pcp = &zone_pcp(zone, get_cpu())->pcp;
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);
+	if (clearMlocked)
+		free_page_mlock(page);
+
 	if (cold)
 		list_add_tail(&page->lru, &pcp->list);
 	else
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 17/19] Do not setup zonelist cache when there is only one node
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:17   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

There is a zonelist cache which is used to track zones that are not in
the allowed cpuset or found to be recently full. This is to reduce cache
footprint on large machines. On smaller machines, it just incurs cost
for no gain. This patch only uses the zonelist cache when there are NUMA
nodes.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 73cf205..e598da8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1483,9 +1483,15 @@ this_zone_full:
 			zlc_mark_zone_full(zonelist, z);
 try_next_zone:
 		if (NUMA_BUILD && !did_zlc_setup) {
-			/* we do zlc_setup after the first zone is tried */
-			allowednodes = zlc_setup(zonelist, alloc_flags);
-			zlc_active = 1;
+			/*
+			 * we do zlc_setup after the first zone is tried
+			 * but only if there are multiple nodes to make
+			 * it worthwhile
+			 */
+			if (num_online_nodes() > 1) {
+				allowednodes = zlc_setup(zonelist, alloc_flags);
+				zlc_active = 1;
+			}
 			did_zlc_setup = 1;
 		}
 	}
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 17/19] Do not setup zonelist cache when there is only one node
@ 2009-02-24 12:17   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

There is a zonelist cache which is used to track zones that are not in
the allowed cpuset or found to be recently full. This is to reduce cache
footprint on large machines. On smaller machines, it just incurs cost
for no gain. This patch only uses the zonelist cache when there are NUMA
nodes.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 73cf205..e598da8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1483,9 +1483,15 @@ this_zone_full:
 			zlc_mark_zone_full(zonelist, z);
 try_next_zone:
 		if (NUMA_BUILD && !did_zlc_setup) {
-			/* we do zlc_setup after the first zone is tried */
-			allowednodes = zlc_setup(zonelist, alloc_flags);
-			zlc_active = 1;
+			/*
+			 * we do zlc_setup after the first zone is tried
+			 * but only if there are multiple nodes to make
+			 * it worthwhile
+			 */
+			if (num_online_nodes() > 1) {
+				allowednodes = zlc_setup(zonelist, alloc_flags);
+				zlc_active = 1;
+			}
 			did_zlc_setup = 1;
 		}
 	}
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 18/19] Do not check for compound pages during the page allocator sanity checks
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:17   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

A number of sanity checks are made on each page allocation and free
including that the page count is zero. page_count() checks for
compound pages and checks the count of the head page if true. However,
in these paths, we do not care if the page is compound or not as the
count of each tail page should also be zero.

This patch makes two changes to the use of page_count() in the free path. It
converts one check of page_count() to a VM_BUG_ON() as the count should
have been unconditionally checked earlier in the free path. It also avoids
checking for compound pages.

[mel@csn.ul.ie: Wrote changelog]
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
---
 mm/page_alloc.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e598da8..8a8db71 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -426,7 +426,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 		return 0;
 
 	if (PageBuddy(buddy) && page_order(buddy) == order) {
-		BUG_ON(page_count(buddy) != 0);
+		VM_BUG_ON(page_count(buddy) != 0);
 		return 1;
 	}
 	return 0;
@@ -503,7 +503,7 @@ static inline int free_pages_check(struct page *page)
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(atomic_read(&page->_count) != 0) |
 		(page->flags & PAGE_FLAGS_CHECK_AT_FREE))) {
 		bad_page(page);
 		return 1;
@@ -648,7 +648,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(atomic_read(&page->_count) != 0)  |
 		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
 		bad_page(page);
 		return 1;
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 18/19] Do not check for compound pages during the page allocator sanity checks
@ 2009-02-24 12:17   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

A number of sanity checks are made on each page allocation and free
including that the page count is zero. page_count() checks for
compound pages and checks the count of the head page if true. However,
in these paths, we do not care if the page is compound or not as the
count of each tail page should also be zero.

This patch makes two changes to the use of page_count() in the free path. It
converts one check of page_count() to a VM_BUG_ON() as the count should
have been unconditionally checked earlier in the free path. It also avoids
checking for compound pages.

[mel@csn.ul.ie: Wrote changelog]
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
---
 mm/page_alloc.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e598da8..8a8db71 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -426,7 +426,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 		return 0;
 
 	if (PageBuddy(buddy) && page_order(buddy) == order) {
-		BUG_ON(page_count(buddy) != 0);
+		VM_BUG_ON(page_count(buddy) != 0);
 		return 1;
 	}
 	return 0;
@@ -503,7 +503,7 @@ static inline int free_pages_check(struct page *page)
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(atomic_read(&page->_count) != 0) |
 		(page->flags & PAGE_FLAGS_CHECK_AT_FREE))) {
 		bad_page(page);
 		return 1;
@@ -648,7 +648,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(atomic_read(&page->_count) != 0)  |
 		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
 		bad_page(page);
 		return 1;
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 19/19] Split per-cpu list into one-list-per-migrate-type
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-24 12:17   ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

Currently the per-cpu page allocator searches the PCP list for pages of the
correct migrate-type to reduce the possibility of pages being inappropriate
placed from a fragmentation perspective. This search is potentially expensive
in a fast-path and undesirable. Splitting the per-cpu list into multiple
lists increases the size of a per-cpu structure and this was potentially
a major problem at the time the search was introduced. These problem has
been mitigated as now only the necessary number of structures is allocated
for the running system.

This patch replaces a list search in the per-cpu allocator with one list
per migrate type.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |    5 ++-
 mm/page_alloc.c        |   80 +++++++++++++++++++++++++++++------------------
 2 files changed, 53 insertions(+), 32 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6089393..2a7349a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -38,6 +38,7 @@
 #define MIGRATE_UNMOVABLE     0
 #define MIGRATE_RECLAIMABLE   1
 #define MIGRATE_MOVABLE       2
+#define MIGRATE_PCPTYPES      3 /* the number of types on the pcp lists */
 #define MIGRATE_RESERVE       3
 #define MIGRATE_ISOLATE       4 /* can't allocate from here */
 #define MIGRATE_TYPES         5
@@ -167,7 +168,9 @@ struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
-	struct list_head list;	/* the list of pages */
+
+	/* Lists of pages, one per migrate type stored on the pcp-lists */
+	struct list_head lists[MIGRATE_TYPES];
 };
 
 struct per_cpu_pageset {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8a8db71..c77ca1b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -514,7 +514,7 @@ static inline int free_pages_check(struct page *page)
 }
 
 /*
- * Frees a list of pages. 
+ * Frees a number of pages from the PCP lists
  * Assumes all pages on list are in same zone, and of same order.
  * count is the number of pages to free.
  *
@@ -524,20 +524,30 @@ static inline int free_pages_check(struct page *page)
  * And clear the zone's pages_scanned counter, to hold off the "all pages are
  * pinned" detection logic.
  */
-static void free_pages_bulk(struct zone *zone, int count,
-					struct list_head *list, int order)
+static void free_pcppages_bulk(struct zone *zone, int count,
+					 struct per_cpu_pages *pcp)
 {
+	int migratetype = 0;
+
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
 	while (count--) {
 		struct page *page;
-
-		VM_BUG_ON(list_empty(list));
+		struct list_head *list;
+
+		/* Remove pages from lists in a round-robin fashion */
+		do {
+			if (migratetype == MIGRATE_PCPTYPES)
+				migratetype = 0;
+			list = &pcp->lists[migratetype];
+			migratetype++;
+		} while (list_empty(list));
+		
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
-		__free_one_page(page, zone, order, page_private(page));
+		__free_one_page(page, zone, 0, page_private(page));
 	}
 	spin_unlock(&zone->lock);
 }
@@ -930,7 +940,7 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 		to_drain = pcp->batch;
 	else
 		to_drain = pcp->count;
-	free_pages_bulk(zone, to_drain, &pcp->list, 0);
+	free_pcppages_bulk(zone, to_drain, pcp);
 	pcp->count -= to_drain;
 	local_irq_restore(flags);
 }
@@ -959,7 +969,7 @@ static void drain_pages(unsigned int cpu)
 
 		pcp = &pset->pcp;
 		local_irq_save(flags);
-		free_pages_bulk(zone, pcp->count, &pcp->list, 0);
+		free_pcppages_bulk(zone, pcp->count, pcp);
 		pcp->count = 0;
 		local_irq_restore(flags);
 	}
@@ -1025,6 +1035,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	int migratetype;
 	int clearMlocked = PageMlocked(page);
 
 	if (PageAnon(page))
@@ -1045,16 +1056,31 @@ static void free_hot_cold_page(struct page *page, int cold)
 	if (clearMlocked)
 		free_page_mlock(page);
 
+	/*
+	 * Only store unreclaimable, reclaimable and movable on pcp lists.
+	 * The one concern is that if the minimum number of free pages is not
+	 * aligned to a pageblock-boundary that allocations/frees from the
+	 * MIGRATE_RESERVE pageblocks may call free_one_page() excessively
+	 */
+	migratetype = get_pageblock_migratetype(page);
+	if (migratetype >= MIGRATE_PCPTYPES) {
+		free_one_page(zone, page, 0, migratetype);
+		goto out;
+	}
+
+	/* Record the migratetype and place on the lists */
+	set_page_private(page, migratetype);
 	if (cold)
-		list_add_tail(&page->lru, &pcp->list);
+		list_add_tail(&page->lru, &pcp->lists[migratetype]);
 	else
-		list_add(&page->lru, &pcp->list);
-	set_page_private(page, get_pageblock_migratetype(page));
+		list_add(&page->lru, &pcp->lists[migratetype]);
+
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
-		free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
+		free_pcppages_bulk(zone, pcp->batch, pcp);
 		pcp->count -= pcp->batch;
 	}
+out:
 	local_irq_restore(flags);
 	put_cpu();
 }
@@ -1109,29 +1135,19 @@ again:
 
 		pcp = &zone_pcp(zone, cpu)->pcp;
 		local_irq_save(flags);
-		if (!pcp->count) {
-			pcp->count = rmqueue_bulk(zone, 0,
-					pcp->batch, &pcp->list, migratetype);
-			if (unlikely(!pcp->count))
+		if (list_empty(&pcp->lists[migratetype])) {
+			pcp->count += rmqueue_bulk(zone, 0, pcp->batch,
+				&pcp->lists[migratetype], migratetype);
+			if (unlikely(list_empty(&pcp->lists[migratetype])))
 				goto failed;
 		}
 
-		/* Find a page of the appropriate migrate type */
 		if (cold) {
-			list_for_each_entry_reverse(page, &pcp->list, lru)
-				if (page_private(page) == migratetype)
-					break;
+			page = list_entry(pcp->lists[migratetype].prev,
+							struct page, lru);
 		} else {
-			list_for_each_entry(page, &pcp->list, lru)
-				if (page_private(page) == migratetype)
-					break;
-		}
-
-		/* Allocate more to the pcp list if necessary */
-		if (unlikely(&page->lru == &pcp->list)) {
-			pcp->count += rmqueue_bulk(zone, 0,
-					pcp->batch, &pcp->list, migratetype);
-			page = list_entry(pcp->list.next, struct page, lru);
+			page = list_entry(pcp->lists[migratetype].next,
+							struct page, lru);
 		}
 
 		list_del(&page->lru);
@@ -2889,6 +2905,7 @@ static int zone_batchsize(struct zone *zone)
 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
 {
 	struct per_cpu_pages *pcp;
+	int migratetype;
 
 	memset(p, 0, sizeof(*p));
 
@@ -2896,7 +2913,8 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
 	pcp->count = 0;
 	pcp->high = 6 * batch;
 	pcp->batch = max(1UL, 1 * batch);
-	INIT_LIST_HEAD(&pcp->list);
+	for (migratetype = 0; migratetype < MIGRATE_TYPES; migratetype++)
+		INIT_LIST_HEAD(&pcp->lists[migratetype]);
 }
 
 /*
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 19/19] Split per-cpu list into one-list-per-migrate-type
@ 2009-02-24 12:17   ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 12:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra

Currently the per-cpu page allocator searches the PCP list for pages of the
correct migrate-type to reduce the possibility of pages being inappropriate
placed from a fragmentation perspective. This search is potentially expensive
in a fast-path and undesirable. Splitting the per-cpu list into multiple
lists increases the size of a per-cpu structure and this was potentially
a major problem at the time the search was introduced. These problem has
been mitigated as now only the necessary number of structures is allocated
for the running system.

This patch replaces a list search in the per-cpu allocator with one list
per migrate type.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |    5 ++-
 mm/page_alloc.c        |   80 +++++++++++++++++++++++++++++------------------
 2 files changed, 53 insertions(+), 32 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6089393..2a7349a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -38,6 +38,7 @@
 #define MIGRATE_UNMOVABLE     0
 #define MIGRATE_RECLAIMABLE   1
 #define MIGRATE_MOVABLE       2
+#define MIGRATE_PCPTYPES      3 /* the number of types on the pcp lists */
 #define MIGRATE_RESERVE       3
 #define MIGRATE_ISOLATE       4 /* can't allocate from here */
 #define MIGRATE_TYPES         5
@@ -167,7 +168,9 @@ struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
-	struct list_head list;	/* the list of pages */
+
+	/* Lists of pages, one per migrate type stored on the pcp-lists */
+	struct list_head lists[MIGRATE_TYPES];
 };
 
 struct per_cpu_pageset {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8a8db71..c77ca1b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -514,7 +514,7 @@ static inline int free_pages_check(struct page *page)
 }
 
 /*
- * Frees a list of pages. 
+ * Frees a number of pages from the PCP lists
  * Assumes all pages on list are in same zone, and of same order.
  * count is the number of pages to free.
  *
@@ -524,20 +524,30 @@ static inline int free_pages_check(struct page *page)
  * And clear the zone's pages_scanned counter, to hold off the "all pages are
  * pinned" detection logic.
  */
-static void free_pages_bulk(struct zone *zone, int count,
-					struct list_head *list, int order)
+static void free_pcppages_bulk(struct zone *zone, int count,
+					 struct per_cpu_pages *pcp)
 {
+	int migratetype = 0;
+
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
 	while (count--) {
 		struct page *page;
-
-		VM_BUG_ON(list_empty(list));
+		struct list_head *list;
+
+		/* Remove pages from lists in a round-robin fashion */
+		do {
+			if (migratetype == MIGRATE_PCPTYPES)
+				migratetype = 0;
+			list = &pcp->lists[migratetype];
+			migratetype++;
+		} while (list_empty(list));
+		
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
-		__free_one_page(page, zone, order, page_private(page));
+		__free_one_page(page, zone, 0, page_private(page));
 	}
 	spin_unlock(&zone->lock);
 }
@@ -930,7 +940,7 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 		to_drain = pcp->batch;
 	else
 		to_drain = pcp->count;
-	free_pages_bulk(zone, to_drain, &pcp->list, 0);
+	free_pcppages_bulk(zone, to_drain, pcp);
 	pcp->count -= to_drain;
 	local_irq_restore(flags);
 }
@@ -959,7 +969,7 @@ static void drain_pages(unsigned int cpu)
 
 		pcp = &pset->pcp;
 		local_irq_save(flags);
-		free_pages_bulk(zone, pcp->count, &pcp->list, 0);
+		free_pcppages_bulk(zone, pcp->count, pcp);
 		pcp->count = 0;
 		local_irq_restore(flags);
 	}
@@ -1025,6 +1035,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	int migratetype;
 	int clearMlocked = PageMlocked(page);
 
 	if (PageAnon(page))
@@ -1045,16 +1056,31 @@ static void free_hot_cold_page(struct page *page, int cold)
 	if (clearMlocked)
 		free_page_mlock(page);
 
+	/*
+	 * Only store unreclaimable, reclaimable and movable on pcp lists.
+	 * The one concern is that if the minimum number of free pages is not
+	 * aligned to a pageblock-boundary that allocations/frees from the
+	 * MIGRATE_RESERVE pageblocks may call free_one_page() excessively
+	 */
+	migratetype = get_pageblock_migratetype(page);
+	if (migratetype >= MIGRATE_PCPTYPES) {
+		free_one_page(zone, page, 0, migratetype);
+		goto out;
+	}
+
+	/* Record the migratetype and place on the lists */
+	set_page_private(page, migratetype);
 	if (cold)
-		list_add_tail(&page->lru, &pcp->list);
+		list_add_tail(&page->lru, &pcp->lists[migratetype]);
 	else
-		list_add(&page->lru, &pcp->list);
-	set_page_private(page, get_pageblock_migratetype(page));
+		list_add(&page->lru, &pcp->lists[migratetype]);
+
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
-		free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
+		free_pcppages_bulk(zone, pcp->batch, pcp);
 		pcp->count -= pcp->batch;
 	}
+out:
 	local_irq_restore(flags);
 	put_cpu();
 }
@@ -1109,29 +1135,19 @@ again:
 
 		pcp = &zone_pcp(zone, cpu)->pcp;
 		local_irq_save(flags);
-		if (!pcp->count) {
-			pcp->count = rmqueue_bulk(zone, 0,
-					pcp->batch, &pcp->list, migratetype);
-			if (unlikely(!pcp->count))
+		if (list_empty(&pcp->lists[migratetype])) {
+			pcp->count += rmqueue_bulk(zone, 0, pcp->batch,
+				&pcp->lists[migratetype], migratetype);
+			if (unlikely(list_empty(&pcp->lists[migratetype])))
 				goto failed;
 		}
 
-		/* Find a page of the appropriate migrate type */
 		if (cold) {
-			list_for_each_entry_reverse(page, &pcp->list, lru)
-				if (page_private(page) == migratetype)
-					break;
+			page = list_entry(pcp->lists[migratetype].prev,
+							struct page, lru);
 		} else {
-			list_for_each_entry(page, &pcp->list, lru)
-				if (page_private(page) == migratetype)
-					break;
-		}
-
-		/* Allocate more to the pcp list if necessary */
-		if (unlikely(&page->lru == &pcp->list)) {
-			pcp->count += rmqueue_bulk(zone, 0,
-					pcp->batch, &pcp->list, migratetype);
-			page = list_entry(pcp->list.next, struct page, lru);
+			page = list_entry(pcp->lists[migratetype].next,
+							struct page, lru);
 		}
 
 		list_del(&page->lru);
@@ -2889,6 +2905,7 @@ static int zone_batchsize(struct zone *zone)
 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
 {
 	struct per_cpu_pages *pcp;
+	int migratetype;
 
 	memset(p, 0, sizeof(*p));
 
@@ -2896,7 +2913,8 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
 	pcp->count = 0;
 	pcp->high = 6 * batch;
 	pcp->batch = max(1UL, 1 * batch);
-	INIT_LIST_HEAD(&pcp->list);
+	for (migratetype = 0; migratetype < MIGRATE_TYPES; migratetype++)
+		INIT_LIST_HEAD(&pcp->lists[migratetype]);
 }
 
 /*
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH 04/19] Convert gfp_zone() to use a table of precalculated values
  2009-02-24 12:17   ` Mel Gorman
@ 2009-02-24 16:43     ` Christoph Lameter
  -1 siblings, 0 replies; 118+ messages in thread
From: Christoph Lameter @ 2009-02-24 16:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra

On Tue, 24 Feb 2009, Mel Gorman wrote:

>  static inline enum zone_type gfp_zone(gfp_t flags)
>  {
> -#ifdef CONFIG_ZONE_DMA
> -	if (flags & __GFP_DMA)
> -		return ZONE_DMA;
> -#endif
> -#ifdef CONFIG_ZONE_DMA32
> -	if (flags & __GFP_DMA32)
> -		return ZONE_DMA32;
> -#endif
> -	if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
> -			(__GFP_HIGHMEM | __GFP_MOVABLE))
> -		return ZONE_MOVABLE;
> -#ifdef CONFIG_HIGHMEM
> -	if (flags & __GFP_HIGHMEM)
> -		return ZONE_HIGHMEM;
> -#endif
> -	return ZONE_NORMAL;
> +	return gfp_zone_table[flags & GFP_ZONEMASK];
>  }

Aassume

GFP_DMA		= 0x01
GFP_DMA32	= 0x02
GFP_MOVABLE	= 0x04
GFP_HIGHMEM	= 0x08

ZONE_NORMAL	= 0
ZONE_DMA	= 1
ZONE_DMA32	= 2
ZONE_MOVABLE	= 3
ZONE_HIGHMEM	= 4

then we could implement gfp_zone simply as:

static inline enum zone_type gfp_zone(gfp_t flags)
{
	return ffs(flags & 0xf);
}

However, this would return ZONE_MOVABLE if only GFP_MOVABLE would be
set but not GFP_HIGHMEM.

If we could make sure that GFP_MOVABLE always includes GFP_HIGHMEM then
this would not be a problem.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 04/19] Convert gfp_zone() to use a table of precalculated values
@ 2009-02-24 16:43     ` Christoph Lameter
  0 siblings, 0 replies; 118+ messages in thread
From: Christoph Lameter @ 2009-02-24 16:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra

On Tue, 24 Feb 2009, Mel Gorman wrote:

>  static inline enum zone_type gfp_zone(gfp_t flags)
>  {
> -#ifdef CONFIG_ZONE_DMA
> -	if (flags & __GFP_DMA)
> -		return ZONE_DMA;
> -#endif
> -#ifdef CONFIG_ZONE_DMA32
> -	if (flags & __GFP_DMA32)
> -		return ZONE_DMA32;
> -#endif
> -	if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
> -			(__GFP_HIGHMEM | __GFP_MOVABLE))
> -		return ZONE_MOVABLE;
> -#ifdef CONFIG_HIGHMEM
> -	if (flags & __GFP_HIGHMEM)
> -		return ZONE_HIGHMEM;
> -#endif
> -	return ZONE_NORMAL;
> +	return gfp_zone_table[flags & GFP_ZONEMASK];
>  }

Aassume

GFP_DMA		= 0x01
GFP_DMA32	= 0x02
GFP_MOVABLE	= 0x04
GFP_HIGHMEM	= 0x08

ZONE_NORMAL	= 0
ZONE_DMA	= 1
ZONE_DMA32	= 2
ZONE_MOVABLE	= 3
ZONE_HIGHMEM	= 4

then we could implement gfp_zone simply as:

static inline enum zone_type gfp_zone(gfp_t flags)
{
	return ffs(flags & 0xf);
}

However, this would return ZONE_MOVABLE if only GFP_MOVABLE would be
set but not GFP_HIGHMEM.

If we could make sure that GFP_MOVABLE always includes GFP_HIGHMEM then
this would not be a problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 04/19] Convert gfp_zone() to use a table of precalculated values
  2009-02-24 16:43     ` Christoph Lameter
@ 2009-02-24 17:07       ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 17:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra

On Tue, Feb 24, 2009 at 11:43:29AM -0500, Christoph Lameter wrote:
> On Tue, 24 Feb 2009, Mel Gorman wrote:
> 
> >  static inline enum zone_type gfp_zone(gfp_t flags)
> >  {
> > -#ifdef CONFIG_ZONE_DMA
> > -	if (flags & __GFP_DMA)
> > -		return ZONE_DMA;
> > -#endif
> > -#ifdef CONFIG_ZONE_DMA32
> > -	if (flags & __GFP_DMA32)
> > -		return ZONE_DMA32;
> > -#endif
> > -	if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
> > -			(__GFP_HIGHMEM | __GFP_MOVABLE))
> > -		return ZONE_MOVABLE;
> > -#ifdef CONFIG_HIGHMEM
> > -	if (flags & __GFP_HIGHMEM)
> > -		return ZONE_HIGHMEM;
> > -#endif
> > -	return ZONE_NORMAL;
> > +	return gfp_zone_table[flags & GFP_ZONEMASK];
> >  }
> 
> Aassume
> 
> GFP_DMA		= 0x01
> GFP_DMA32	= 0x02
> GFP_MOVABLE	= 0x04
> GFP_HIGHMEM	= 0x08
> 
> ZONE_NORMAL	= 0
> ZONE_DMA	= 1
> ZONE_DMA32	= 2
> ZONE_MOVABLE	= 3
> ZONE_HIGHMEM	= 4
> 
> then we could implement gfp_zone simply as:
> 
> static inline enum zone_type gfp_zone(gfp_t flags)
> {
> 	return ffs(flags & 0xf);
> }
> 

A few points immediately spring to mind

o What's the cost of ffs?
o The altering of zone order is not without consequence. The zonelist
  walkers for example make the assumion that the higher the zone index,
  then the "higher" it is. i.e. NORMAL is a bigger index than DMA, HIGHMEM
  is bigger index than NORMAL etc.
o I think movable ends up the wrong "side" of highmem in terms of zone
  order with that scheme and you'd need to redo how the movable zone is
  created.

There are probably other consequences too that I haven't thought of yet.
Summary - this would not be a trivial way fixing anything.

> However, this would return ZONE_MOVABLE if only GFP_MOVABLE would be
> set but not GFP_HIGHMEM.
> 
> If we could make sure that GFP_MOVABLE always includes GFP_HIGHMEM then
> this would not be a problem.
> 

But it wouldn't be right either. It's ok to specify __GFP_MOVABLE without
specifying __GFP_HIGHMEM. Quick grep shows it's not amazingly common but
it's allowed.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 04/19] Convert gfp_zone() to use a table of precalculated values
@ 2009-02-24 17:07       ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 17:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra

On Tue, Feb 24, 2009 at 11:43:29AM -0500, Christoph Lameter wrote:
> On Tue, 24 Feb 2009, Mel Gorman wrote:
> 
> >  static inline enum zone_type gfp_zone(gfp_t flags)
> >  {
> > -#ifdef CONFIG_ZONE_DMA
> > -	if (flags & __GFP_DMA)
> > -		return ZONE_DMA;
> > -#endif
> > -#ifdef CONFIG_ZONE_DMA32
> > -	if (flags & __GFP_DMA32)
> > -		return ZONE_DMA32;
> > -#endif
> > -	if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
> > -			(__GFP_HIGHMEM | __GFP_MOVABLE))
> > -		return ZONE_MOVABLE;
> > -#ifdef CONFIG_HIGHMEM
> > -	if (flags & __GFP_HIGHMEM)
> > -		return ZONE_HIGHMEM;
> > -#endif
> > -	return ZONE_NORMAL;
> > +	return gfp_zone_table[flags & GFP_ZONEMASK];
> >  }
> 
> Aassume
> 
> GFP_DMA		= 0x01
> GFP_DMA32	= 0x02
> GFP_MOVABLE	= 0x04
> GFP_HIGHMEM	= 0x08
> 
> ZONE_NORMAL	= 0
> ZONE_DMA	= 1
> ZONE_DMA32	= 2
> ZONE_MOVABLE	= 3
> ZONE_HIGHMEM	= 4
> 
> then we could implement gfp_zone simply as:
> 
> static inline enum zone_type gfp_zone(gfp_t flags)
> {
> 	return ffs(flags & 0xf);
> }
> 

A few points immediately spring to mind

o What's the cost of ffs?
o The altering of zone order is not without consequence. The zonelist
  walkers for example make the assumion that the higher the zone index,
  then the "higher" it is. i.e. NORMAL is a bigger index than DMA, HIGHMEM
  is bigger index than NORMAL etc.
o I think movable ends up the wrong "side" of highmem in terms of zone
  order with that scheme and you'd need to redo how the movable zone is
  created.

There are probably other consequences too that I haven't thought of yet.
Summary - this would not be a trivial way fixing anything.

> However, this would return ZONE_MOVABLE if only GFP_MOVABLE would be
> set but not GFP_HIGHMEM.
> 
> If we could make sure that GFP_MOVABLE always includes GFP_HIGHMEM then
> this would not be a problem.
> 

But it wouldn't be right either. It's ok to specify __GFP_MOVABLE without
specifying __GFP_HIGHMEM. Quick grep shows it's not amazingly common but
it's allowed.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 03/19] Do not check NUMA node ID when the caller knows the node is valid
  2009-02-24 12:16   ` Mel Gorman
@ 2009-02-24 17:17     ` Christoph Lameter
  -1 siblings, 0 replies; 118+ messages in thread
From: Christoph Lameter @ 2009-02-24 17:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra

This is certainly reducing the number of branches that are inlined into
the kernel code.


Reviewed-by: Christoph Lameter <cl@linux-foundation.org>



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 03/19] Do not check NUMA node ID when the caller knows the node is valid
@ 2009-02-24 17:17     ` Christoph Lameter
  0 siblings, 0 replies; 118+ messages in thread
From: Christoph Lameter @ 2009-02-24 17:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra

This is certainly reducing the number of branches that are inlined into
the kernel code.


Reviewed-by: Christoph Lameter <cl@linux-foundation.org>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 06/19] Check only once if the zonelist is suitable for the allocation
  2009-02-24 12:17   ` Mel Gorman
@ 2009-02-24 17:24     ` Christoph Lameter
  -1 siblings, 0 replies; 118+ messages in thread
From: Christoph Lameter @ 2009-02-24 17:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra

On Tue, 24 Feb 2009, Mel Gorman wrote:

> It is possible with __GFP_THISNODE that no zones are suitable. This
> patch makes sure the check is only made once.

GFP_THISNODE is only a performance factor if SLAB is the slab allocator.
The restart logic in __alloc_pages_internal() is mainly used by OOM
processing.

But the patch looks okay regardless...

Reviewed-by: Christoph Lameter <cl@linux-foundation.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 06/19] Check only once if the zonelist is suitable for the allocation
@ 2009-02-24 17:24     ` Christoph Lameter
  0 siblings, 0 replies; 118+ messages in thread
From: Christoph Lameter @ 2009-02-24 17:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra

On Tue, 24 Feb 2009, Mel Gorman wrote:

> It is possible with __GFP_THISNODE that no zones are suitable. This
> patch makes sure the check is only made once.

GFP_THISNODE is only a performance factor if SLAB is the slab allocator.
The restart logic in __alloc_pages_internal() is mainly used by OOM
processing.

But the patch looks okay regardless...

Reviewed-by: Christoph Lameter <cl@linux-foundation.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/19] Simplify the check on whether cpusets are a factor or not
  2009-02-24 12:17   ` Mel Gorman
@ 2009-02-24 17:27     ` Christoph Lameter
  -1 siblings, 0 replies; 118+ messages in thread
From: Christoph Lameter @ 2009-02-24 17:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra

On Tue, 24 Feb 2009, Mel Gorman wrote:

> @@ -1420,8 +1429,8 @@ zonelist_scan:
>  		if (NUMA_BUILD && zlc_active &&
>  			!zlc_zone_worth_trying(zonelist, z, allowednodes))
>  				continue;
> -		if ((alloc_flags & ALLOC_CPUSET) &&
> -			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> +		if (alloc_cpuset)
> +			if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
>  				goto try_next_zone;

Hmmm... Why remove the && here? Looks more confusing to me.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/19] Simplify the check on whether cpusets are a factor or not
@ 2009-02-24 17:27     ` Christoph Lameter
  0 siblings, 0 replies; 118+ messages in thread
From: Christoph Lameter @ 2009-02-24 17:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra

On Tue, 24 Feb 2009, Mel Gorman wrote:

> @@ -1420,8 +1429,8 @@ zonelist_scan:
>  		if (NUMA_BUILD && zlc_active &&
>  			!zlc_zone_worth_trying(zonelist, z, allowednodes))
>  				continue;
> -		if ((alloc_flags & ALLOC_CPUSET) &&
> -			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> +		if (alloc_cpuset)
> +			if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
>  				goto try_next_zone;

Hmmm... Why remove the && here? Looks more confusing to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 10/19] Calculate the preferred zone for allocation only once
  2009-02-24 12:17   ` Mel Gorman
@ 2009-02-24 17:31     ` Christoph Lameter
  -1 siblings, 0 replies; 118+ messages in thread
From: Christoph Lameter @ 2009-02-24 17:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra

On Tue, 24 Feb 2009, Mel Gorman wrote:

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6f26944..074f9a6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1399,24 +1399,19 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
>   */
>  static struct page *
>  get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
> -		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
> +		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
> +		struct zone *preferred_zone)
>  {

This gets into a quite a number of parameters now. Pass a structure like in
vmscan.c? Or simplify things to be able to run get_page_from_freelist with
less parameters? The number of parameters seem to be too high for a
fastpath function.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 10/19] Calculate the preferred zone for allocation only once
@ 2009-02-24 17:31     ` Christoph Lameter
  0 siblings, 0 replies; 118+ messages in thread
From: Christoph Lameter @ 2009-02-24 17:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra

On Tue, 24 Feb 2009, Mel Gorman wrote:

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6f26944..074f9a6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1399,24 +1399,19 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
>   */
>  static struct page *
>  get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
> -		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
> +		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
> +		struct zone *preferred_zone)
>  {

This gets into a quite a number of parameters now. Pass a structure like in
vmscan.c? Or simplify things to be able to run get_page_from_freelist with
less parameters? The number of parameters seem to be too high for a
fastpath function.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 10/19] Calculate the preferred zone for allocation only once
  2009-02-24 17:31     ` Christoph Lameter
@ 2009-02-24 17:53       ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 17:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra

On Tue, Feb 24, 2009 at 12:31:41PM -0500, Christoph Lameter wrote:
> On Tue, 24 Feb 2009, Mel Gorman wrote:
> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 6f26944..074f9a6 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1399,24 +1399,19 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
> >   */
> >  static struct page *
> >  get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
> > -		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
> > +		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
> > +		struct zone *preferred_zone)
> >  {
> 
> This gets into a quite a number of parameters now. Pass a structure like in
> vmscan.c?

I considered it, but thought that multiple offsets into structures might
exceed the cost of pushing the parameters onto the stack. I never
actually looked at the generated assembly though to make a proper
assessment.

> Or simplify things to be able to run get_page_from_freelist with
> less parameters? The number of parameters seem to be too high for a
> fastpath function.
> 

Which is why I ended up inlining get_page_from_freelist() in V1. It's a rock
and a hard place basically. Passing parameters is expensive, but calculating
the information multiple times is too.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 10/19] Calculate the preferred zone for allocation only once
@ 2009-02-24 17:53       ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 17:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra

On Tue, Feb 24, 2009 at 12:31:41PM -0500, Christoph Lameter wrote:
> On Tue, 24 Feb 2009, Mel Gorman wrote:
> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 6f26944..074f9a6 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1399,24 +1399,19 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
> >   */
> >  static struct page *
> >  get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
> > -		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
> > +		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
> > +		struct zone *preferred_zone)
> >  {
> 
> This gets into a quite a number of parameters now. Pass a structure like in
> vmscan.c?

I considered it, but thought that multiple offsets into structures might
exceed the cost of pushing the parameters onto the stack. I never
actually looked at the generated assembly though to make a proper
assessment.

> Or simplify things to be able to run get_page_from_freelist with
> less parameters? The number of parameters seem to be too high for a
> fastpath function.
> 

Which is why I ended up inlining get_page_from_freelist() in V1. It's a rock
and a hard place basically. Passing parameters is expensive, but calculating
the information multiple times is too.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/19] Simplify the check on whether cpusets are a factor or not
  2009-02-24 17:27     ` Christoph Lameter
@ 2009-02-24 17:55       ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 17:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra

On Tue, Feb 24, 2009 at 12:27:02PM -0500, Christoph Lameter wrote:
> On Tue, 24 Feb 2009, Mel Gorman wrote:
> 
> > @@ -1420,8 +1429,8 @@ zonelist_scan:
> >  		if (NUMA_BUILD && zlc_active &&
> >  			!zlc_zone_worth_trying(zonelist, z, allowednodes))
> >  				continue;
> > -		if ((alloc_flags & ALLOC_CPUSET) &&
> > -			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > +		if (alloc_cpuset)
> > +			if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> >  				goto try_next_zone;
> 
> Hmmm... Why remove the && here? Looks more confusing to me.
> 

At the time, just because it was what I was splitting out. Chances are
it makes no difference to the assembly. I'll double check and if not,
switch it back.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/19] Simplify the check on whether cpusets are a factor or not
@ 2009-02-24 17:55       ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-24 17:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra

On Tue, Feb 24, 2009 at 12:27:02PM -0500, Christoph Lameter wrote:
> On Tue, 24 Feb 2009, Mel Gorman wrote:
> 
> > @@ -1420,8 +1429,8 @@ zonelist_scan:
> >  		if (NUMA_BUILD && zlc_active &&
> >  			!zlc_zone_worth_trying(zonelist, z, allowednodes))
> >  				continue;
> > -		if ((alloc_flags & ALLOC_CPUSET) &&
> > -			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > +		if (alloc_cpuset)
> > +			if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> >  				goto try_next_zone;
> 
> Hmmm... Why remove the && here? Looks more confusing to me.
> 

At the time, just because it was what I was splitting out. Chances are
it makes no difference to the assembly. I'll double check and if not,
switch it back.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-02-24 12:16 ` Mel Gorman
@ 2009-02-26  9:10   ` Lin Ming
  -1 siblings, 0 replies; 118+ messages in thread
From: Lin Ming @ 2009-02-26  9:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra

We tested this v2 patch series with 2.6.29-rc6 on different machines.

		4P qual-core	2P qual-core	2P qual-core HT
		tigerton	stockley	Nehalem
		------------------------------------------------
tbench		+3%		+2%		0%
oltp		-2%		0%		0%
aim7		0%		0%		0%
specjbb2005	+3%		0%		0%
hackbench	0%		0%		0%	

netperf:
TCP-S-112k	0%		-1%		0%
TCP-S-64k	0%		-1%		+1%
TCP-RR-1	0%		0%		+1%
UDP-U-4k	-2%		0%		-2%
UDP-U-1k	+3%		0%		0%
UDP-RR-1	0%		0%		0%
UDP-RR-512	-1%		0%		+1%

Lin Ming

On Tue, 2009-02-24 at 20:16 +0800, Mel Gorman wrote:
> Still a work in progress but enough has changed that I want to show what
> it current looks like. Performance is still improved a little but there are
> some large outstanding pieces of fruit
> 
> 1. Improving free_pcppages_bulk() does a lot of looping, maybe could be better
> 2. gfp_zone() is still using a cache line for data. I wasn't able to translate
>    Kamezawa-sans suggestion into usable code
> 
> The following two items should be picked up in a second or third pass at
> improving the page allocator
> 
> 1. Working out if knowing whether pages are cold/hot on free is worth it or
>    not
> 2. Precalculating zonelists for cpusets (Andi described how it could be done,
>    it's straight-forward, just will take time but it doesn't affect the
>    majority of users)
> 
> Changes since V1
>   o Remove the ifdef CONFIG_CPUSETS from inside get_page_from_freelist()
>   o Use non-lock bit operations for clearing the mlock flag
>   o Factor out alloc_flags calculation so it is only done once (Peter)
>   o Make gfp.h a bit prettier and clear-cut (Peter)
>   o Instead of deleting a debugging check, replace page_count() in the
>     free path with a version that does not check for compound pages (Nick)
>   o Drop the alteration for hot/cold page freeing until we know if it
>     helps or not
> 
> The complexity of the page allocator has been increasing for some time
> and it has now reached the point where the SLUB allocator is doing strange
> tricks to avoid the page allocator. This is obviously bad as it may encourage
> other subsystems to try avoiding the page allocator as well.
> 
> This series of patches is intended to reduce the cost of the page
> allocator by doing the following.
> 
> Patches 1-3 iron out the entry paths slightly and remove stupid sanity
> checks from the fast path.
> 
> Patch 4 uses a lookup table instead of a number of branches to decide what
> zones are usable given the GFP flags.
> 
> Patch 5 tidies up some flags
> 
> Patch 6 avoids repeated checks of the zonelist
> 
> Patch 7 breaks the allocator up into a fast and slow path where the fast
> path later becomes one long inlined function.
> 
> Patches 8-12 avoids calculating the same things repeatedly and instead
> calculates them once.
> 
> Patches 13-14 inline parts of the allocator fast path
> 
> Patch 15 avoids calling get_pageblock_migratetype() potentially twice on
> every page free
> 
> Patch 16 reduces the number of times interrupts are disabled by reworking
> what free_page_mlock() does and not using locked versions of bit operations.
> 
> Patch 17 avoids using the zonelist cache on non-NUMA machines
> 
> Patch 18 simplifies some debugging checks made during alloc and free.
> 
> Patch 19 avoids a list search in the allocator fast path.
> 
> Running all of these through a profiler shows me the cost of page allocation
> and freeing is reduced by a nice amount without drastically altering how the
> allocator actually works. Excluding the cost of zeroing pages, the cost of
> allocation is reduced by 25% and the cost of freeing by 12%.  Again excluding
> zeroing a page, much of the remaining cost is due to counters, debugging
> checks and interrupt disabling.  Of course when a page has to be zeroed,
> the dominant cost of a page allocation is zeroing it.
> 
> These patches reduce the text size of the kernel by 180 bytes on the one
> x86-64 machine I checked.
> 
> Range of results (positive is good) on 7 machines that completed tests.
> 
> o Kernbench elapsed time	-0.04	to	0.79%
> o Kernbench system time		0 	to	3.74%
> o tbench			-2.85%  to	5.52%
> o Hackbench-sockets		all differences within  noise
> o Hackbench-pipes		-2.98%  to	9.11%
> o Sysbench			-0.04%  to	5.50%
> 
> With hackbench-pipes, only 2 machines out of 7 showed results outside of
> the noise. In almost all cases the strandard deviation between runs of
> hackbench-pipes was reduced with the patches.
> 
> I still haven't run a page-allocator micro-benchmark to see what sort of
> figures that gives.
> 
>  arch/ia64/hp/common/sba_iommu.c   |    2 
>  arch/ia64/kernel/mca.c            |    3 
>  arch/ia64/kernel/uncached.c       |    3 
>  arch/ia64/sn/pci/pci_dma.c        |    3 
>  arch/powerpc/platforms/cell/ras.c |    2 
>  arch/x86/kvm/vmx.c                |    2 
>  drivers/misc/sgi-gru/grufile.c    |    2 
>  drivers/misc/sgi-xp/xpc_uv.c      |    2 
>  include/linux/cpuset.h            |    2 
>  include/linux/gfp.h               |   62 +--
>  include/linux/mm.h                |    1 
>  include/linux/mmzone.h            |    8 
>  init/main.c                       |    1 
>  kernel/profile.c                  |    8 
>  mm/filemap.c                      |    2 
>  mm/hugetlb.c                      |    4 
>  mm/internal.h                     |   11 
>  mm/mempolicy.c                    |    2 
>  mm/migrate.c                      |    2 
>  mm/page_alloc.c                   |  642 +++++++++++++++++++++++++-------------
>  mm/slab.c                         |    4 
>  mm/slob.c                         |    4 
>  mm/vmalloc.c                      |    1 
>  23 files changed, 490 insertions(+), 283 deletions(-)


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-02-26  9:10   ` Lin Ming
  0 siblings, 0 replies; 118+ messages in thread
From: Lin Ming @ 2009-02-26  9:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra

We tested this v2 patch series with 2.6.29-rc6 on different machines.

		4P qual-core	2P qual-core	2P qual-core HT
		tigerton	stockley	Nehalem
		------------------------------------------------
tbench		+3%		+2%		0%
oltp		-2%		0%		0%
aim7		0%		0%		0%
specjbb2005	+3%		0%		0%
hackbench	0%		0%		0%	

netperf:
TCP-S-112k	0%		-1%		0%
TCP-S-64k	0%		-1%		+1%
TCP-RR-1	0%		0%		+1%
UDP-U-4k	-2%		0%		-2%
UDP-U-1k	+3%		0%		0%
UDP-RR-1	0%		0%		0%
UDP-RR-512	-1%		0%		+1%

Lin Ming

On Tue, 2009-02-24 at 20:16 +0800, Mel Gorman wrote:
> Still a work in progress but enough has changed that I want to show what
> it current looks like. Performance is still improved a little but there are
> some large outstanding pieces of fruit
> 
> 1. Improving free_pcppages_bulk() does a lot of looping, maybe could be better
> 2. gfp_zone() is still using a cache line for data. I wasn't able to translate
>    Kamezawa-sans suggestion into usable code
> 
> The following two items should be picked up in a second or third pass at
> improving the page allocator
> 
> 1. Working out if knowing whether pages are cold/hot on free is worth it or
>    not
> 2. Precalculating zonelists for cpusets (Andi described how it could be done,
>    it's straight-forward, just will take time but it doesn't affect the
>    majority of users)
> 
> Changes since V1
>   o Remove the ifdef CONFIG_CPUSETS from inside get_page_from_freelist()
>   o Use non-lock bit operations for clearing the mlock flag
>   o Factor out alloc_flags calculation so it is only done once (Peter)
>   o Make gfp.h a bit prettier and clear-cut (Peter)
>   o Instead of deleting a debugging check, replace page_count() in the
>     free path with a version that does not check for compound pages (Nick)
>   o Drop the alteration for hot/cold page freeing until we know if it
>     helps or not
> 
> The complexity of the page allocator has been increasing for some time
> and it has now reached the point where the SLUB allocator is doing strange
> tricks to avoid the page allocator. This is obviously bad as it may encourage
> other subsystems to try avoiding the page allocator as well.
> 
> This series of patches is intended to reduce the cost of the page
> allocator by doing the following.
> 
> Patches 1-3 iron out the entry paths slightly and remove stupid sanity
> checks from the fast path.
> 
> Patch 4 uses a lookup table instead of a number of branches to decide what
> zones are usable given the GFP flags.
> 
> Patch 5 tidies up some flags
> 
> Patch 6 avoids repeated checks of the zonelist
> 
> Patch 7 breaks the allocator up into a fast and slow path where the fast
> path later becomes one long inlined function.
> 
> Patches 8-12 avoids calculating the same things repeatedly and instead
> calculates them once.
> 
> Patches 13-14 inline parts of the allocator fast path
> 
> Patch 15 avoids calling get_pageblock_migratetype() potentially twice on
> every page free
> 
> Patch 16 reduces the number of times interrupts are disabled by reworking
> what free_page_mlock() does and not using locked versions of bit operations.
> 
> Patch 17 avoids using the zonelist cache on non-NUMA machines
> 
> Patch 18 simplifies some debugging checks made during alloc and free.
> 
> Patch 19 avoids a list search in the allocator fast path.
> 
> Running all of these through a profiler shows me the cost of page allocation
> and freeing is reduced by a nice amount without drastically altering how the
> allocator actually works. Excluding the cost of zeroing pages, the cost of
> allocation is reduced by 25% and the cost of freeing by 12%.  Again excluding
> zeroing a page, much of the remaining cost is due to counters, debugging
> checks and interrupt disabling.  Of course when a page has to be zeroed,
> the dominant cost of a page allocation is zeroing it.
> 
> These patches reduce the text size of the kernel by 180 bytes on the one
> x86-64 machine I checked.
> 
> Range of results (positive is good) on 7 machines that completed tests.
> 
> o Kernbench elapsed time	-0.04	to	0.79%
> o Kernbench system time		0 	to	3.74%
> o tbench			-2.85%  to	5.52%
> o Hackbench-sockets		all differences within  noise
> o Hackbench-pipes		-2.98%  to	9.11%
> o Sysbench			-0.04%  to	5.50%
> 
> With hackbench-pipes, only 2 machines out of 7 showed results outside of
> the noise. In almost all cases the strandard deviation between runs of
> hackbench-pipes was reduced with the patches.
> 
> I still haven't run a page-allocator micro-benchmark to see what sort of
> figures that gives.
> 
>  arch/ia64/hp/common/sba_iommu.c   |    2 
>  arch/ia64/kernel/mca.c            |    3 
>  arch/ia64/kernel/uncached.c       |    3 
>  arch/ia64/sn/pci/pci_dma.c        |    3 
>  arch/powerpc/platforms/cell/ras.c |    2 
>  arch/x86/kvm/vmx.c                |    2 
>  drivers/misc/sgi-gru/grufile.c    |    2 
>  drivers/misc/sgi-xp/xpc_uv.c      |    2 
>  include/linux/cpuset.h            |    2 
>  include/linux/gfp.h               |   62 +--
>  include/linux/mm.h                |    1 
>  include/linux/mmzone.h            |    8 
>  init/main.c                       |    1 
>  kernel/profile.c                  |    8 
>  mm/filemap.c                      |    2 
>  mm/hugetlb.c                      |    4 
>  mm/internal.h                     |   11 
>  mm/mempolicy.c                    |    2 
>  mm/migrate.c                      |    2 
>  mm/page_alloc.c                   |  642 +++++++++++++++++++++++++-------------
>  mm/slab.c                         |    4 
>  mm/slob.c                         |    4 
>  mm/vmalloc.c                      |    1 
>  23 files changed, 490 insertions(+), 283 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-02-26  9:10   ` Lin Ming
@ 2009-02-26  9:26     ` Pekka Enberg
  -1 siblings, 0 replies; 118+ messages in thread
From: Pekka Enberg @ 2009-02-26  9:26 UTC (permalink / raw)
  To: Lin Ming
  Cc: Mel Gorman, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra

On Thu, Feb 26, 2009 at 11:10 AM, Lin Ming <ming.m.lin@intel.com> wrote:> We tested this v2 patch series with 2.6.29-rc6 on different machines.
What .config is this? Specifically, is SLUB or SLAB used here?
>>                4P qual-core    2P qual-core    2P qual-core HT>                tigerton        stockley        Nehalem>                ------------------------------------------------> tbench          +3%             +2%             0%> oltp            -2%             0%              0%> aim7            0%              0%              0%> specjbb2005     +3%             0%              0%> hackbench       0%              0%              0%>> netperf:> TCP-S-112k      0%              -1%             0%> TCP-S-64k       0%              -1%             +1%> TCP-RR-1        0%              0%              +1%> UDP-U-4k        -2%             0%              -2%> UDP-U-1k        +3%             0%              0%> UDP-RR-1        0%              0%              0%> UDP-RR-512      -1%             0%              +1%ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-02-26  9:26     ` Pekka Enberg
  0 siblings, 0 replies; 118+ messages in thread
From: Pekka Enberg @ 2009-02-26  9:26 UTC (permalink / raw)
  To: Lin Ming
  Cc: Mel Gorman, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra

On Thu, Feb 26, 2009 at 11:10 AM, Lin Ming <ming.m.lin@intel.com> wrote:
> We tested this v2 patch series with 2.6.29-rc6 on different machines.

What .config is this? Specifically, is SLUB or SLAB used here?

>
>                4P qual-core    2P qual-core    2P qual-core HT
>                tigerton        stockley        Nehalem
>                ------------------------------------------------
> tbench          +3%             +2%             0%
> oltp            -2%             0%              0%
> aim7            0%              0%              0%
> specjbb2005     +3%             0%              0%
> hackbench       0%              0%              0%
>
> netperf:
> TCP-S-112k      0%              -1%             0%
> TCP-S-64k       0%              -1%             +1%
> TCP-RR-1        0%              0%              +1%
> UDP-U-4k        -2%             0%              -2%
> UDP-U-1k        +3%             0%              0%
> UDP-RR-1        0%              0%              0%
> UDP-RR-512      -1%             0%              +1%
N‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨­è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒðèž×¦j)Z†·Ÿ

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-02-26  9:26     ` Pekka Enberg
@ 2009-02-26  9:27       ` Lin Ming
  -1 siblings, 0 replies; 118+ messages in thread
From: Lin Ming @ 2009-02-26  9:27 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Mel Gorman, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra

On Thu, 2009-02-26 at 17:26 +0800, Pekka Enberg wrote:
> On Thu, Feb 26, 2009 at 11:10 AM, Lin Ming <ming.m.lin@intel.com> wrote:
> > We tested this v2 patch series with 2.6.29-rc6 on different machines.
> 
> What .config is this? Specifically, is SLUB or SLAB used here?

SLUB used.

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.29-rc6
# Wed Feb 25 00:38:09 2009
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
# CONFIG_KTIME_SCALAR is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION="-mg-v2"
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
# CONFIG_TASK_XACCT is not set
# CONFIG_AUDIT is not set

#
# RCU Subsystem
#
CONFIG_CLASSIC_RCU=y
# CONFIG_TREE_RCU is not set
# CONFIG_PREEMPT_RCU is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_PREEMPT_RCU_TRACE is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=17
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_USER_SCHED=y
# CONFIG_CGROUP_SCHED is not set
# CONFIG_CGROUPS is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
# CONFIG_IPC_NS is not set
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
# CONFIG_NET_NS is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_COMPAT_BRK=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
# CONFIG_MARKERS is not set
CONFIG_OPROFILE=y
# CONFIG_OPROFILE_IBS is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_KPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_KRETPROBES=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_IO_TRACE=y
# CONFIG_BLK_DEV_BSG is not set
# CONFIG_BLK_DEV_INTEGRITY is not set
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_FREEZER=y

#
# Processor type and features
#
# CONFIG_NO_HZ is not set
# CONFIG_HIGH_RES_TIMERS is not set
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
# CONFIG_SPARSE_IRQ is not set
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_VSMP is not set
CONFIG_SCHED_OMIT_FRAME_POINTER=y
# CONFIG_PARAVIRT_GUEST is not set
# CONFIG_MEMTEST is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_CPU=y
CONFIG_X86_L1_CACHE_BYTES=128
CONFIG_X86_INTERNODE_CACHE_BYTES=128
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=7
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR_64=y
CONFIG_X86_DS=y
CONFIG_X86_PTRACE_BTS=y
CONFIG_HPET_TIMER=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
# CONFIG_CALGARY_IOMMU is not set
# CONFIG_AMD_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
# CONFIG_IOMMU_API is not set
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS=32
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
# CONFIG_I8K is not set
# CONFIG_MICROCODE is not set
# CONFIG_X86_MSR is not set
# CONFIG_X86_CPUID is not set
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_DIRECT_GBPAGES=y
# CONFIG_NUMA is not set
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_SELECT_MEMORY_MODEL=y
# CONFIG_FLATMEM_MANUAL is not set
# CONFIG_DISCONTIGMEM_MANUAL is not set
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_UNEVICTABLE_LRU=y
# CONFIG_X86_CHECK_BIOS_CORRUPTION is not set
CONFIG_X86_RESERVE_LOW_64K=y
CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set
# CONFIG_X86_PAT is not set
# CONFIG_EFI is not set
# CONFIG_SECCOMP is not set
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
# CONFIG_SCHED_HRTICK is not set
CONFIG_KEXEC=y
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x200000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x200000
CONFIG_HOTPLUG_CPU=y
CONFIG_COMPAT_VDSO=y
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

#
# Power management and ACPI options
#
CONFIG_PM=y
# CONFIG_PM_DEBUG is not set
CONFIG_PM_SLEEP_SMP=y
CONFIG_PM_SLEEP=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
# CONFIG_HIBERNATION is not set
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_PROCFS=y
CONFIG_ACPI_PROCFS_POWER=y
CONFIG_ACPI_SYSFS_POWER=y
CONFIG_ACPI_PROC_EVENT=y
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=m
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_THERMAL=y
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
# CONFIG_ACPI_SBS is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_DEBUG=y
CONFIG_CPU_FREQ_STAT=y
CONFIG_CPU_FREQ_STAT_DETAILS=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y

#
# CPUFreq processor drivers
#
CONFIG_X86_ACPI_CPUFREQ=y
CONFIG_X86_POWERNOW_K8=y
CONFIG_X86_POWERNOW_K8_ACPI=y
CONFIG_X86_SPEEDSTEP_CENTRINO=y
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# CONFIG_X86_SPEEDSTEP_LIB is not set
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y

#
# Memory power savings
#
# CONFIG_I7300_IDLE is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
# CONFIG_DMAR is not set
# CONFIG_INTR_REMAP is not set
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=y
CONFIG_PCIEAER=y
# CONFIG_PCIEASPM is not set
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
CONFIG_PCI_LEGACY=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_STUB is not set
CONFIG_HT_IRQ=y
CONFIG_ISA_DMA_API=y
CONFIG_K8_NB=y
CONFIG_PCCARD=y
# CONFIG_PCMCIA_DEBUG is not set
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
CONFIG_PCMCIA_IOCTL=y
CONFIG_CARDBUS=y

#
# PC-card bridges
#
CONFIG_YENTA=y
CONFIG_YENTA_O2=y
CONFIG_YENTA_RICOH=y
CONFIG_YENTA_TI=y
CONFIG_YENTA_ENE_TUNE=y
CONFIG_YENTA_TOSHIBA=y
# CONFIG_PD6729 is not set
# CONFIG_I82092 is not set
CONFIG_PCCARD_NONSTATIC=y
CONFIG_HOTPLUG_PCI=y
# CONFIG_HOTPLUG_PCI_FAKE is not set
CONFIG_HOTPLUG_PCI_ACPI=y
# CONFIG_HOTPLUG_PCI_ACPI_IBM is not set
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
# CONFIG_HAVE_AOUT is not set
CONFIG_BINFMT_MISC=y
CONFIG_IA32_EMULATION=y
CONFIG_IA32_AOUT=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_COMPAT_NET_DEV_OPS=y
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_USER=y
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
# CONFIG_XFRM_STATISTICS is not set
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
CONFIG_IP_MROUTE=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
CONFIG_INET_TUNNEL=y
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
CONFIG_INET_XFRM_MODE_BEET=y
# CONFIG_INET_LRO is not set
# CONFIG_INET_DIAG is not set
CONFIG_TCP_CONG_ADVANCED=y
CONFIG_TCP_CONG_BIC=y
# CONFIG_TCP_CONG_CUBIC is not set
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
# CONFIG_TCP_CONG_HSTCP is not set
# CONFIG_TCP_CONG_HYBLA is not set
# CONFIG_TCP_CONG_VEGAS is not set
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_TCP_CONG_YEAH is not set
# CONFIG_TCP_CONG_ILLINOIS is not set
CONFIG_DEFAULT_BIC=y
# CONFIG_DEFAULT_CUBIC is not set
# CONFIG_DEFAULT_HTCP is not set
# CONFIG_DEFAULT_VEGAS is not set
# CONFIG_DEFAULT_WESTWOOD is not set
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="bic"
# CONFIG_TCP_MD5SIG is not set
CONFIG_IPV6=y
# CONFIG_IPV6_PRIVACY is not set
# CONFIG_IPV6_ROUTER_PREF is not set
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
CONFIG_INET6_XFRM_MODE_TRANSPORT=y
CONFIG_INET6_XFRM_MODE_TUNNEL=y
CONFIG_INET6_XFRM_MODE_BEET=y
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
CONFIG_IPV6_SIT=y
CONFIG_IPV6_NDISC_NODETYPE=y
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_IPV6_MROUTE is not set
CONFIG_NETWORK_SECMARK=y
# CONFIG_NETFILTER is not set
# CONFIG_IP_DCCP is not set
CONFIG_IP_SCTP=y
# CONFIG_SCTP_DBG_MSG is not set
# CONFIG_SCTP_DBG_OBJCNT is not set
# CONFIG_SCTP_HMAC_NONE is not set
# CONFIG_SCTP_HMAC_SHA1 is not set
CONFIG_SCTP_HMAC_MD5=y
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_BRIDGE is not set
# CONFIG_NET_DSA is not set
CONFIG_VLAN_8021Q=y
# CONFIG_VLAN_8021Q_GVRP is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_NET_SCHED is not set
# CONFIG_DCB is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_TCPPROBE is not set
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
# CONFIG_PHONET is not set
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
# CONFIG_CFG80211 is not set
CONFIG_WIRELESS_OLD_REGULATORY=y
# CONFIG_WIRELESS_EXT is not set
# CONFIG_LIB80211 is not set
# CONFIG_MAC80211 is not set
# CONFIG_WIMAX is not set
# CONFIG_RFKILL is not set
# CONFIG_NET_9P is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_SYS_HYPERVISOR is not set
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
# CONFIG_MTD is not set
# CONFIG_PARPORT is not set
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=16384
# CONFIG_BLK_DEV_XIP is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_BLK_DEV_HD is not set
CONFIG_MISC_DEVICES=y
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_SGI_XP is not set
# CONFIG_HP_ILO is not set
# CONFIG_SGI_GRU is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_93CX6 is not set
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_RAID_ATTRS=y
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
# CONFIG_SCSI_TGT is not set
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
# CONFIG_SCSI_SCAN_ASYNC is not set
CONFIG_SCSI_WAIT_SCAN=m

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=y
CONFIG_SCSI_FC_ATTRS=y
CONFIG_SCSI_ISCSI_ATTRS=y
CONFIG_SCSI_SAS_ATTRS=y
CONFIG_SCSI_SAS_LIBSAS=y
# CONFIG_SCSI_SAS_ATA is not set
CONFIG_SCSI_SAS_HOST_SMP=y
# CONFIG_SCSI_SAS_LIBSAS_DEBUG is not set
# CONFIG_SCSI_SRP_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
# CONFIG_ISCSI_TCP is not set
# CONFIG_SCSI_CXGB3_ISCSI is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
CONFIG_SCSI_ACARD=y
CONFIG_SCSI_AACRAID=y
CONFIG_SCSI_AIC7XXX=y
CONFIG_AIC7XXX_CMDS_PER_DEVICE=4
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
# CONFIG_AIC7XXX_DEBUG_ENABLE is not set
CONFIG_AIC7XXX_DEBUG_MASK=0
# CONFIG_AIC7XXX_REG_PRETTY_PRINT is not set
CONFIG_SCSI_AIC7XXX_OLD=y
CONFIG_SCSI_AIC79XX=y
CONFIG_AIC79XX_CMDS_PER_DEVICE=4
CONFIG_AIC79XX_RESET_DELAY_MS=15000
# CONFIG_AIC79XX_DEBUG_ENABLE is not set
CONFIG_AIC79XX_DEBUG_MASK=0
# CONFIG_AIC79XX_REG_PRETTY_PRINT is not set
CONFIG_SCSI_AIC94XX=y
# CONFIG_AIC94XX_DEBUG is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_ARCMSR is not set
CONFIG_MEGARAID_NEWGEN=y
CONFIG_MEGARAID_MM=y
CONFIG_MEGARAID_MAILBOX=y
CONFIG_MEGARAID_LEGACY=y
CONFIG_MEGARAID_SAS=y
CONFIG_SCSI_HPTIOP=y
CONFIG_SCSI_BUSLOGIC=y
# CONFIG_LIBFC is not set
# CONFIG_FCOE is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
CONFIG_SCSI_GDTH=y
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_MVSAS is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
CONFIG_SCSI_QLOGIC_1280=y
CONFIG_SCSI_QLA_FC=y
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_SRP is not set
# CONFIG_SCSI_LOWLEVEL_PCMCIA is not set
# CONFIG_SCSI_DH is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_ACPI=y
CONFIG_SATA_PMP=y
CONFIG_SATA_AHCI=y
# CONFIG_SATA_SIL24 is not set
CONFIG_ATA_SFF=y
# CONFIG_SATA_SVW is not set
CONFIG_ATA_PIIX=y
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SX4 is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_PATA_ACPI is not set
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_ATA_GENERIC is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PCMCIA is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RZ1000 is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set
# CONFIG_PATA_SCH is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
CONFIG_MD_LINEAR=y
CONFIG_MD_RAID0=y
CONFIG_MD_RAID1=y
CONFIG_MD_RAID10=y
CONFIG_MD_RAID456=y
CONFIG_MD_RAID5_RESHAPE=y
CONFIG_MD_MULTIPATH=y
CONFIG_MD_FAULTY=y
CONFIG_BLK_DEV_DM=y
# CONFIG_DM_DEBUG is not set
CONFIG_DM_CRYPT=y
CONFIG_DM_SNAPSHOT=y
CONFIG_DM_MIRROR=y
CONFIG_DM_ZERO=y
CONFIG_DM_MULTIPATH=y
# CONFIG_DM_DELAY is not set
# CONFIG_DM_UEVENT is not set
CONFIG_FUSION=y
CONFIG_FUSION_SPI=y
CONFIG_FUSION_FC=y
CONFIG_FUSION_SAS=y
CONFIG_FUSION_MAX_SGE=40
CONFIG_FUSION_CTL=y
# CONFIG_FUSION_LOGGING is not set

#
# IEEE 1394 (FireWire) support
#

#
# Enable only one of the two stacks, unless you know what you are doing
#
# CONFIG_FIREWIRE is not set
# CONFIG_IEEE1394 is not set
# CONFIG_I2O is not set
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
# CONFIG_DUMMY is not set
# CONFIG_BONDING is not set
# CONFIG_MACVLAN is not set
# CONFIG_EQUALIZER is not set
# CONFIG_TUN is not set
# CONFIG_VETH is not set
# CONFIG_NET_SB1000 is not set
# CONFIG_ARCNET is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
# CONFIG_MARVELL_PHY is not set
# CONFIG_DAVICOM_PHY is not set
# CONFIG_QSEMI_PHY is not set
# CONFIG_LXT_PHY is not set
# CONFIG_CICADA_PHY is not set
# CONFIG_VITESSE_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_BROADCOM_PHY is not set
# CONFIG_ICPLUS_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_STE10XP is not set
# CONFIG_LSI_ET1011C_PHY is not set
# CONFIG_FIXED_PHY is not set
# CONFIG_MDIO_BITBANG is not set
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NET_VENDOR_3COM is not set
# CONFIG_NET_TULIP is not set
# CONFIG_HP100 is not set
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
# CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set
# CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set
# CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set
CONFIG_NET_PCI=y
# CONFIG_PCNET32 is not set
# CONFIG_AMD8111_ETH is not set
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_B44 is not set
# CONFIG_FORCEDETH is not set
CONFIG_E100=y
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_8139CP is not set
# CONFIG_8139TOO is not set
# CONFIG_R6040 is not set
# CONFIG_SIS900 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SMSC9420 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
# CONFIG_VIA_RHINE is not set
# CONFIG_SC92031 is not set
# CONFIG_ATL2 is not set
CONFIG_NETDEV_1000=y
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
CONFIG_E1000=y
CONFIG_E1000E=y
# CONFIG_IP1000 is not set
CONFIG_IGB=y
# CONFIG_IGB_LRO is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
# CONFIG_R8169 is not set
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
# CONFIG_VIA_VELOCITY is not set
CONFIG_TIGON3=y
CONFIG_BNX2=y
# CONFIG_QLA3XXX is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_JME is not set
CONFIG_NETDEV_10000=y
# CONFIG_CHELSIO_T1 is not set
CONFIG_CHELSIO_T3_DEPENDS=y
# CONFIG_CHELSIO_T3 is not set
# CONFIG_ENIC is not set
# CONFIG_IXGBE is not set
CONFIG_IXGB=y
# CONFIG_S2IO is not set
# CONFIG_MYRI10GE is not set
# CONFIG_NETXEN_NIC is not set
# CONFIG_NIU is not set
# CONFIG_MLX4_EN is not set
# CONFIG_MLX4_CORE is not set
# CONFIG_TEHUTI is not set
# CONFIG_BNX2X is not set
# CONFIG_QLGE is not set
# CONFIG_SFC is not set
# CONFIG_TR is not set

#
# Wireless LAN
#
# CONFIG_WLAN_PRE80211 is not set
# CONFIG_WLAN_80211 is not set
# CONFIG_IWLWIFI_LEDS is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET is not set
# CONFIG_NET_PCMCIA is not set
# CONFIG_WAN is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
# CONFIG_NET_FC is not set
# CONFIG_NETCONSOLE is not set
# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set
# CONFIG_ISDN is not set
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
# CONFIG_INPUT_POLLDEV is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_ELANTECH is not set
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
CONFIG_MOUSE_SERIAL=y
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_PCSPKR is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
# CONFIG_INPUT_UINPUT is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_DEVKMEM=y
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_COMPUTONE is not set
# CONFIG_ROCKETPORT is not set
# CONFIG_CYCLADES is not set
# CONFIG_DIGIEPCA is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_ISI is not set
# CONFIG_SYNCLINK is not set
# CONFIG_SYNCLINKMP is not set
# CONFIG_SYNCLINK_GT is not set
# CONFIG_N_HDLC is not set
# CONFIG_RISCOM8 is not set
# CONFIG_SPECIALIX is not set
# CONFIG_SX is not set
# CONFIG_RIO is not set
# CONFIG_STALDRV is not set
# CONFIG_NOZOMI is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
# CONFIG_SERIAL_8250_CS is not set
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
# CONFIG_DEVPTS_MULTIPLE_INSTANCES is not set
# CONFIG_LEGACY_PTYS is not set
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=y
CONFIG_HW_RANDOM_INTEL=y
CONFIG_HW_RANDOM_AMD=y
CONFIG_NVRAM=y
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set

#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
# CONFIG_CARDMAN_4000 is not set
# CONFIG_CARDMAN_4040 is not set
# CONFIG_IPWIRELESS is not set
# CONFIG_MWAVE is not set
# CONFIG_PC8736x_GPIO is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
# CONFIG_HPET_MMAP is not set
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=y
CONFIG_I2C_BOARDINFO=y
# CONFIG_I2C_CHARDEV is not set
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_ISCH is not set
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_SIMTEC is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Graphics adapter I2C/DDC channel drivers
#
# CONFIG_I2C_VOODOO3 is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_STUB is not set

#
# Miscellaneous I2C Chip support
#
# CONFIG_DS1682 is not set
# CONFIG_SENSORS_PCF8574 is not set
# CONFIG_PCF8575 is not set
# CONFIG_SENSORS_PCA9539 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_SENSORS_MAX6875 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I2C_DEBUG_CHIP is not set
# CONFIG_SPI is not set
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
# CONFIG_GPIOLIB is not set
# CONFIG_W1 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_BATTERY_DS2760 is not set
# CONFIG_BATTERY_BQ27x00 is not set
# CONFIG_HWMON is not set
CONFIG_THERMAL=y
CONFIG_WATCHDOG=y
# CONFIG_WATCHDOG_NOWAYOUT is not set

#
# Watchdog Device Drivers
#
CONFIG_SOFT_WATCHDOG=y
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
# CONFIG_ALIM1535_WDT is not set
# CONFIG_ALIM7101_WDT is not set
# CONFIG_SC520_WDT is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
# CONFIG_IBMASR is not set
# CONFIG_WAFER_WDT is not set
# CONFIG_I6300ESB_WDT is not set
# CONFIG_ITCO_WDT is not set
# CONFIG_IT8712F_WDT is not set
# CONFIG_IT87_WDT is not set
# CONFIG_HP_WATCHDOG is not set
# CONFIG_SC1200_WDT is not set
# CONFIG_PC87413_WDT is not set
# CONFIG_60XX_WDT is not set
# CONFIG_SBC8360_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_SMSC_SCH311X_WDT is not set
# CONFIG_SMSC37B787_WDT is not set
# CONFIG_W83627HF_WDT is not set
# CONFIG_W83697HF_WDT is not set
# CONFIG_W83697UG_WDT is not set
# CONFIG_W83877F_WDT is not set
# CONFIG_W83977F_WDT is not set
# CONFIG_MACHZ_WDT is not set
# CONFIG_SBC_EPX_C3_WATCHDOG is not set

#
# PCI-based Watchdog Cards
#
# CONFIG_PCIPCWATCHDOG is not set
# CONFIG_WDTPCI is not set

#
# USB-based Watchdog Cards
#
# CONFIG_USBPCWATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_REGULATOR is not set

#
# Multimedia devices
#

#
# Multimedia core support
#
CONFIG_VIDEO_DEV=y
CONFIG_VIDEO_V4L2_COMMON=y
CONFIG_VIDEO_ALLOW_V4L1=y
CONFIG_VIDEO_V4L1_COMPAT=y
# CONFIG_DVB_CORE is not set
CONFIG_VIDEO_MEDIA=y

#
# Multimedia drivers
#
# CONFIG_MEDIA_ATTACH is not set
CONFIG_MEDIA_TUNER=y
# CONFIG_MEDIA_TUNER_CUSTOMIZE is not set
CONFIG_MEDIA_TUNER_SIMPLE=y
CONFIG_MEDIA_TUNER_TDA8290=y
CONFIG_MEDIA_TUNER_TDA9887=y
CONFIG_MEDIA_TUNER_TEA5761=y
CONFIG_MEDIA_TUNER_TEA5767=y
CONFIG_MEDIA_TUNER_MT20XX=y
CONFIG_MEDIA_TUNER_XC2028=y
CONFIG_MEDIA_TUNER_XC5000=y
CONFIG_VIDEO_V4L2=y
CONFIG_VIDEO_V4L1=y
# CONFIG_VIDEO_CAPTURE_DRIVERS is not set
# CONFIG_RADIO_ADAPTERS is not set
# CONFIG_DAB is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
CONFIG_AGP_SIS=y
CONFIG_AGP_VIA=y
CONFIG_DRM=y
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
# CONFIG_DRM_RADEON is not set
# CONFIG_DRM_I810 is not set
# CONFIG_DRM_I830 is not set
# CONFIG_DRM_I915 is not set
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
CONFIG_VGASTATE=y
CONFIG_VIDEO_OUTPUT_CONTROL=m
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_DDC=y
CONFIG_FB_BOOT_VESA_SUPPORT=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
# CONFIG_FB_SYS_FILLRECT is not set
# CONFIG_FB_SYS_COPYAREA is not set
# CONFIG_FB_SYS_IMAGEBLIT is not set
# CONFIG_FB_FOREIGN_ENDIAN is not set
# CONFIG_FB_SYS_FOPS is not set
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
CONFIG_FB_VGA16=y
# CONFIG_FB_UVESA is not set
CONFIG_FB_VESA=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_LE80578 is not set
CONFIG_FB_INTEL=y
# CONFIG_FB_INTEL_DEBUG is not set
CONFIG_FB_INTEL_I2C=y
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
CONFIG_BACKLIGHT_GENERIC=y
# CONFIG_BACKLIGHT_PROGEAR is not set
# CONFIG_BACKLIGHT_MBP_NVIDIA is not set
# CONFIG_BACKLIGHT_SAHARA is not set

#
# Display device support
#
# CONFIG_DISPLAY_SUPPORT is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
# CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY is not set
CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
# CONFIG_SOUND is not set
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
# CONFIG_HID_DEBUG is not set
# CONFIG_HIDRAW is not set

#
# USB Input Devices
#
CONFIG_USB_HID=y
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y

#
# Special HID drivers
#
CONFIG_HID_COMPAT=y
CONFIG_HID_A4TECH=y
CONFIG_HID_APPLE=y
CONFIG_HID_BELKIN=y
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
CONFIG_HID_CYPRESS=y
CONFIG_HID_EZKEY=y
CONFIG_HID_GYRATION=y
CONFIG_HID_LOGITECH=y
CONFIG_LOGITECH_FF=y
# CONFIG_LOGIRUMBLEPAD2_FF is not set
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y
CONFIG_HID_NTRIG=y
CONFIG_HID_PANTHERLORD=y
# CONFIG_PANTHERLORD_FF is not set
CONFIG_HID_PETALYNX=y
CONFIG_HID_SAMSUNG=y
CONFIG_HID_SONY=y
CONFIG_HID_SUNPLUS=y
# CONFIG_GREENASIA_FF is not set
CONFIG_HID_TOPSEED=y
CONFIG_THRUSTMASTER_FF=y
# CONFIG_ZEROPLUS_FF is not set
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
# CONFIG_USB_ANNOUNCE_NEW_DEVICES is not set

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
CONFIG_USB_DEVICE_CLASS=y
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_OTG is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB is not set
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1760_HCD is not set
CONFIG_USB_OHCI_HCD=y
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_WHCI_HCD is not set
# CONFIG_USB_HWA_HCD is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may also be needed;
#

#
# see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=y
# CONFIG_USB_STORAGE_DEBUG is not set
CONFIG_USB_STORAGE_DATAFAB=y
CONFIG_USB_STORAGE_FREECOM=y
CONFIG_USB_STORAGE_ISD200=y
CONFIG_USB_STORAGE_USBAT=y
CONFIG_USB_STORAGE_SDDR09=y
CONFIG_USB_STORAGE_SDDR55=y
CONFIG_USB_STORAGE_JUMPSHOT=y
CONFIG_USB_STORAGE_ALAUDA=y
# CONFIG_USB_STORAGE_ONETOUCH is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
CONFIG_USB_LIBUSUAL=y

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB port drivers
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_BERRY_CHARGE is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_PHIDGET is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_VST is not set
# CONFIG_USB_GADGET is not set

#
# OTG and related infrastructure
#
# CONFIG_UWB is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
# CONFIG_NEW_LEDS is not set
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
CONFIG_EDAC=y

#
# Reporting subsystems
#
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_MM_EDAC=y
CONFIG_EDAC_E752X=y
# CONFIG_EDAC_I82975X is not set
# CONFIG_EDAC_I3000 is not set
# CONFIG_EDAC_X38 is not set
# CONFIG_EDAC_I5400 is not set
# CONFIG_EDAC_I5000 is not set
# CONFIG_EDAC_I5100 is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8581 is not set

#
# SPI RTC drivers
#

#
# Platform RTC drivers
#
# CONFIG_RTC_DRV_CMOS is not set
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
# CONFIG_RTC_DRV_V3020 is not set

#
# on-CPU RTC drivers
#
# CONFIG_DMADEVICES is not set
# CONFIG_UIO is not set
# CONFIG_STAGING is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_FUJITSU_LAPTOP is not set
# CONFIG_MSI_LAPTOP is not set
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_COMPAL_LAPTOP is not set
# CONFIG_SONY_LAPTOP is not set
# CONFIG_THINKPAD_ACPI is not set
# CONFIG_INTEL_MENLOW is not set
# CONFIG_EEEPC_LAPTOP is not set
# CONFIG_ACPI_WMI is not set
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_TOSHIBA is not set

#
# Firmware Drivers
#
CONFIG_EDD=y
# CONFIG_EDD_OFF is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_DELL_RBU=y
CONFIG_DCDBAS=y
CONFIG_DMIID=y
# CONFIG_ISCSI_IBFT_FIND is not set

#
# File systems
#
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
CONFIG_EXT2_FS_XIP=y
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
# CONFIG_EXT4_FS is not set
CONFIG_FS_XIP=y
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=y
# CONFIG_REISERFS_CHECK is not set
CONFIG_REISERFS_PROC_INFO=y
CONFIG_REISERFS_FS_XATTR=y
CONFIG_REISERFS_FS_POSIX_ACL=y
CONFIG_REISERFS_FS_SECURITY=y
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_FILE_LOCKING=y
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_BTRFS_FS is not set
CONFIG_DNOTIFY=y
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_QUOTA=y
# CONFIG_QUOTA_NETLINK_INTERFACE is not set
CONFIG_PRINT_QUOTA_WARNING=y
CONFIG_QUOTA_TREE=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_AUTOFS_FS=y
CONFIG_AUTOFS4_FS=y
CONFIG_FUSE_FS=y

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="ascii"
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_CONFIGFS_FS=y
CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
CONFIG_ROMFS_FS=y
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
CONFIG_NFSD=y
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y
# CONFIG_SUNRPC_REGISTER_V4 is not set
CONFIG_RPCSEC_GSS_KRB5=y
# CONFIG_RPCSEC_GSS_SPKM3 is not set
# CONFIG_SMB_FS is not set
CONFIG_CIFS=y
# CONFIG_CIFS_STATS is not set
CONFIG_CIFS_WEAK_PW_HASH=y
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y
# CONFIG_CIFS_DEBUG2 is not set
# CONFIG_CIFS_EXPERIMENTAL is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
# CONFIG_LDM_PARTITION is not set
CONFIG_SGI_PARTITION=y
# CONFIG_ULTRIX_PARTITION is not set
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
CONFIG_NLS_UTF8=y
# CONFIG_DLM is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
CONFIG_ENABLE_WARN_DEPRECATED=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
CONFIG_MAGIC_SYSRQ=y
# CONFIG_UNUSED_SYMBOLS is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_SHIRQ is not set
# CONFIG_DETECT_SOFTLOCKUP is not set
CONFIG_SCHED_DEBUG=y
# CONFIG_SCHEDSTATS is not set
# CONFIG_TIMER_STATS is not set
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_WRITECOUNT is not set
CONFIG_DEBUG_MEMORY_INIT=y
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
CONFIG_ARCH_WANT_FRAME_POINTERS=y
# CONFIG_FRAME_POINTER is not set
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_CPU_STALL_DETECTOR is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_LKDTM is not set
# CONFIG_FAULT_INJECTION is not set
# CONFIG_LATENCYTOP is not set
CONFIG_SYSCTL_SYSCALL_CHECK=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_HW_BRANCH_TRACER=y
CONFIG_RING_BUFFER=y
CONFIG_TRACING=y

#
# Tracers
#
# CONFIG_FUNCTION_TRACER is not set
# CONFIG_IRQSOFF_TRACER is not set
# CONFIG_SYSPROF_TRACER is not set
# CONFIG_SCHED_TRACER is not set
# CONFIG_CONTEXT_SWITCH_TRACER is not set
# CONFIG_BOOT_TRACER is not set
# CONFIG_TRACE_BRANCH_PROFILING is not set
# CONFIG_POWER_TRACER is not set
# CONFIG_STACK_TRACER is not set
# CONFIG_HW_BRANCH_TRACER is not set
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_MMIOTRACE is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_DYNAMIC_PRINTK_DEBUG is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
# CONFIG_STRICT_DEVMEM is not set
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
# CONFIG_EARLY_PRINTK_DBGP is not set
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_DEBUG_PER_CPU_MAPS is not set
# CONFIG_X86_PTDUMP is not set
# CONFIG_DEBUG_RODATA is not set
# CONFIG_DEBUG_NX_TEST is not set
# CONFIG_IOMMU_DEBUG is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
# CONFIG_OPTIMIZE_INLINING is not set

#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY is not set
# CONFIG_SECURITYFS is not set
# CONFIG_SECURITY_FILE_CAPABILITIES is not set
CONFIG_XOR_BLOCKS=y
CONFIG_ASYNC_CORE=y
CONFIG_ASYNC_MEMCPY=y
CONFIG_ASYNC_XOR=y
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
# CONFIG_CRYPTO_FIPS is not set
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_GF128MUL is not set
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_CRYPTD is not set
# CONFIG_CRYPTO_AUTHENC is not set
# CONFIG_CRYPTO_TEST is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
# CONFIG_CRYPTO_ECB is not set
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
# CONFIG_CRYPTO_CRC32C_INTEL is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set

#
# Ciphers
#
# CONFIG_CRYPTO_AES is not set
# CONFIG_CRYPTO_AES_X86_64 is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARC4 is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
CONFIG_CRYPTO_DES=y
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_X86_64 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
# CONFIG_CRYPTO_LZO is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_HIFN_795X is not set
CONFIG_HAVE_KVM=y
CONFIG_VIRTUALIZATION=y
# CONFIG_KVM is not set
# CONFIG_VIRTIO_PCI is not set
# CONFIG_VIRTIO_BALLOON is not set

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_GENERIC_FIND_LAST_BIT=y
# CONFIG_CRC_CCITT is not set
# CONFIG_CRC16 is not set
# CONFIG_CRC_T10DIF is not set
CONFIG_CRC_ITU_T=y
CONFIG_CRC32=y
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=y
CONFIG_ZLIB_INFLATE=y
CONFIG_PLIST=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y


> 
> >
> >                4P qual-core    2P qual-core    2P qual-core HT
> >                tigerton        stockley        Nehalem
> >                ------------------------------------------------
> > tbench          +3%             +2%             0%
> > oltp            -2%             0%              0%
> > aim7            0%              0%              0%
> > specjbb2005     +3%             0%              0%
> > hackbench       0%              0%              0%
> >
> > netperf:
> > TCP-S-112k      0%              -1%             0%
> > TCP-S-64k       0%              -1%             +1%
> > TCP-RR-1        0%              0%              +1%
> > UDP-U-4k        -2%             0%              -2%
> > UDP-U-1k        +3%             0%              0%
> > UDP-RR-1        0%              0%              0%
> > UDP-RR-512      -1%             0%              +1%


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-02-26  9:27       ` Lin Ming
  0 siblings, 0 replies; 118+ messages in thread
From: Lin Ming @ 2009-02-26  9:27 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Mel Gorman, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra

On Thu, 2009-02-26 at 17:26 +0800, Pekka Enberg wrote:
> On Thu, Feb 26, 2009 at 11:10 AM, Lin Ming <ming.m.lin@intel.com> wrote:
> > We tested this v2 patch series with 2.6.29-rc6 on different machines.
> 
> What .config is this? Specifically, is SLUB or SLAB used here?

SLUB used.

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.29-rc6
# Wed Feb 25 00:38:09 2009
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
# CONFIG_KTIME_SCALAR is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION="-mg-v2"
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
# CONFIG_TASK_XACCT is not set
# CONFIG_AUDIT is not set

#
# RCU Subsystem
#
CONFIG_CLASSIC_RCU=y
# CONFIG_TREE_RCU is not set
# CONFIG_PREEMPT_RCU is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_PREEMPT_RCU_TRACE is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=17
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_USER_SCHED=y
# CONFIG_CGROUP_SCHED is not set
# CONFIG_CGROUPS is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
# CONFIG_IPC_NS is not set
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
# CONFIG_NET_NS is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_COMPAT_BRK=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
# CONFIG_MARKERS is not set
CONFIG_OPROFILE=y
# CONFIG_OPROFILE_IBS is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_KPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_KRETPROBES=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_IO_TRACE=y
# CONFIG_BLK_DEV_BSG is not set
# CONFIG_BLK_DEV_INTEGRITY is not set
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_FREEZER=y

#
# Processor type and features
#
# CONFIG_NO_HZ is not set
# CONFIG_HIGH_RES_TIMERS is not set
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
# CONFIG_SPARSE_IRQ is not set
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_VSMP is not set
CONFIG_SCHED_OMIT_FRAME_POINTER=y
# CONFIG_PARAVIRT_GUEST is not set
# CONFIG_MEMTEST is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_CPU=y
CONFIG_X86_L1_CACHE_BYTES=128
CONFIG_X86_INTERNODE_CACHE_BYTES=128
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=7
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR_64=y
CONFIG_X86_DS=y
CONFIG_X86_PTRACE_BTS=y
CONFIG_HPET_TIMER=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
# CONFIG_CALGARY_IOMMU is not set
# CONFIG_AMD_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
# CONFIG_IOMMU_API is not set
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS=32
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
# CONFIG_I8K is not set
# CONFIG_MICROCODE is not set
# CONFIG_X86_MSR is not set
# CONFIG_X86_CPUID is not set
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_DIRECT_GBPAGES=y
# CONFIG_NUMA is not set
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_SELECT_MEMORY_MODEL=y
# CONFIG_FLATMEM_MANUAL is not set
# CONFIG_DISCONTIGMEM_MANUAL is not set
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_UNEVICTABLE_LRU=y
# CONFIG_X86_CHECK_BIOS_CORRUPTION is not set
CONFIG_X86_RESERVE_LOW_64K=y
CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set
# CONFIG_X86_PAT is not set
# CONFIG_EFI is not set
# CONFIG_SECCOMP is not set
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
# CONFIG_SCHED_HRTICK is not set
CONFIG_KEXEC=y
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x200000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x200000
CONFIG_HOTPLUG_CPU=y
CONFIG_COMPAT_VDSO=y
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

#
# Power management and ACPI options
#
CONFIG_PM=y
# CONFIG_PM_DEBUG is not set
CONFIG_PM_SLEEP_SMP=y
CONFIG_PM_SLEEP=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
# CONFIG_HIBERNATION is not set
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_PROCFS=y
CONFIG_ACPI_PROCFS_POWER=y
CONFIG_ACPI_SYSFS_POWER=y
CONFIG_ACPI_PROC_EVENT=y
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=m
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_THERMAL=y
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
# CONFIG_ACPI_SBS is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_DEBUG=y
CONFIG_CPU_FREQ_STAT=y
CONFIG_CPU_FREQ_STAT_DETAILS=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y

#
# CPUFreq processor drivers
#
CONFIG_X86_ACPI_CPUFREQ=y
CONFIG_X86_POWERNOW_K8=y
CONFIG_X86_POWERNOW_K8_ACPI=y
CONFIG_X86_SPEEDSTEP_CENTRINO=y
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# CONFIG_X86_SPEEDSTEP_LIB is not set
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y

#
# Memory power savings
#
# CONFIG_I7300_IDLE is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
# CONFIG_DMAR is not set
# CONFIG_INTR_REMAP is not set
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=y
CONFIG_PCIEAER=y
# CONFIG_PCIEASPM is not set
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
CONFIG_PCI_LEGACY=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_STUB is not set
CONFIG_HT_IRQ=y
CONFIG_ISA_DMA_API=y
CONFIG_K8_NB=y
CONFIG_PCCARD=y
# CONFIG_PCMCIA_DEBUG is not set
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
CONFIG_PCMCIA_IOCTL=y
CONFIG_CARDBUS=y

#
# PC-card bridges
#
CONFIG_YENTA=y
CONFIG_YENTA_O2=y
CONFIG_YENTA_RICOH=y
CONFIG_YENTA_TI=y
CONFIG_YENTA_ENE_TUNE=y
CONFIG_YENTA_TOSHIBA=y
# CONFIG_PD6729 is not set
# CONFIG_I82092 is not set
CONFIG_PCCARD_NONSTATIC=y
CONFIG_HOTPLUG_PCI=y
# CONFIG_HOTPLUG_PCI_FAKE is not set
CONFIG_HOTPLUG_PCI_ACPI=y
# CONFIG_HOTPLUG_PCI_ACPI_IBM is not set
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
# CONFIG_HAVE_AOUT is not set
CONFIG_BINFMT_MISC=y
CONFIG_IA32_EMULATION=y
CONFIG_IA32_AOUT=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_COMPAT_NET_DEV_OPS=y
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_USER=y
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
# CONFIG_XFRM_STATISTICS is not set
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
CONFIG_IP_MROUTE=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
CONFIG_INET_TUNNEL=y
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
CONFIG_INET_XFRM_MODE_BEET=y
# CONFIG_INET_LRO is not set
# CONFIG_INET_DIAG is not set
CONFIG_TCP_CONG_ADVANCED=y
CONFIG_TCP_CONG_BIC=y
# CONFIG_TCP_CONG_CUBIC is not set
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
# CONFIG_TCP_CONG_HSTCP is not set
# CONFIG_TCP_CONG_HYBLA is not set
# CONFIG_TCP_CONG_VEGAS is not set
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_TCP_CONG_YEAH is not set
# CONFIG_TCP_CONG_ILLINOIS is not set
CONFIG_DEFAULT_BIC=y
# CONFIG_DEFAULT_CUBIC is not set
# CONFIG_DEFAULT_HTCP is not set
# CONFIG_DEFAULT_VEGAS is not set
# CONFIG_DEFAULT_WESTWOOD is not set
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="bic"
# CONFIG_TCP_MD5SIG is not set
CONFIG_IPV6=y
# CONFIG_IPV6_PRIVACY is not set
# CONFIG_IPV6_ROUTER_PREF is not set
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
CONFIG_INET6_XFRM_MODE_TRANSPORT=y
CONFIG_INET6_XFRM_MODE_TUNNEL=y
CONFIG_INET6_XFRM_MODE_BEET=y
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
CONFIG_IPV6_SIT=y
CONFIG_IPV6_NDISC_NODETYPE=y
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_IPV6_MROUTE is not set
CONFIG_NETWORK_SECMARK=y
# CONFIG_NETFILTER is not set
# CONFIG_IP_DCCP is not set
CONFIG_IP_SCTP=y
# CONFIG_SCTP_DBG_MSG is not set
# CONFIG_SCTP_DBG_OBJCNT is not set
# CONFIG_SCTP_HMAC_NONE is not set
# CONFIG_SCTP_HMAC_SHA1 is not set
CONFIG_SCTP_HMAC_MD5=y
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_BRIDGE is not set
# CONFIG_NET_DSA is not set
CONFIG_VLAN_8021Q=y
# CONFIG_VLAN_8021Q_GVRP is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_NET_SCHED is not set
# CONFIG_DCB is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_TCPPROBE is not set
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
# CONFIG_PHONET is not set
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
# CONFIG_CFG80211 is not set
CONFIG_WIRELESS_OLD_REGULATORY=y
# CONFIG_WIRELESS_EXT is not set
# CONFIG_LIB80211 is not set
# CONFIG_MAC80211 is not set
# CONFIG_WIMAX is not set
# CONFIG_RFKILL is not set
# CONFIG_NET_9P is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_SYS_HYPERVISOR is not set
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
# CONFIG_MTD is not set
# CONFIG_PARPORT is not set
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=16384
# CONFIG_BLK_DEV_XIP is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_BLK_DEV_HD is not set
CONFIG_MISC_DEVICES=y
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_SGI_XP is not set
# CONFIG_HP_ILO is not set
# CONFIG_SGI_GRU is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_93CX6 is not set
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_RAID_ATTRS=y
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
# CONFIG_SCSI_TGT is not set
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
# CONFIG_SCSI_SCAN_ASYNC is not set
CONFIG_SCSI_WAIT_SCAN=m

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=y
CONFIG_SCSI_FC_ATTRS=y
CONFIG_SCSI_ISCSI_ATTRS=y
CONFIG_SCSI_SAS_ATTRS=y
CONFIG_SCSI_SAS_LIBSAS=y
# CONFIG_SCSI_SAS_ATA is not set
CONFIG_SCSI_SAS_HOST_SMP=y
# CONFIG_SCSI_SAS_LIBSAS_DEBUG is not set
# CONFIG_SCSI_SRP_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
# CONFIG_ISCSI_TCP is not set
# CONFIG_SCSI_CXGB3_ISCSI is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
CONFIG_SCSI_ACARD=y
CONFIG_SCSI_AACRAID=y
CONFIG_SCSI_AIC7XXX=y
CONFIG_AIC7XXX_CMDS_PER_DEVICE=4
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
# CONFIG_AIC7XXX_DEBUG_ENABLE is not set
CONFIG_AIC7XXX_DEBUG_MASK=0
# CONFIG_AIC7XXX_REG_PRETTY_PRINT is not set
CONFIG_SCSI_AIC7XXX_OLD=y
CONFIG_SCSI_AIC79XX=y
CONFIG_AIC79XX_CMDS_PER_DEVICE=4
CONFIG_AIC79XX_RESET_DELAY_MS=15000
# CONFIG_AIC79XX_DEBUG_ENABLE is not set
CONFIG_AIC79XX_DEBUG_MASK=0
# CONFIG_AIC79XX_REG_PRETTY_PRINT is not set
CONFIG_SCSI_AIC94XX=y
# CONFIG_AIC94XX_DEBUG is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_ARCMSR is not set
CONFIG_MEGARAID_NEWGEN=y
CONFIG_MEGARAID_MM=y
CONFIG_MEGARAID_MAILBOX=y
CONFIG_MEGARAID_LEGACY=y
CONFIG_MEGARAID_SAS=y
CONFIG_SCSI_HPTIOP=y
CONFIG_SCSI_BUSLOGIC=y
# CONFIG_LIBFC is not set
# CONFIG_FCOE is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
CONFIG_SCSI_GDTH=y
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_MVSAS is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
CONFIG_SCSI_QLOGIC_1280=y
CONFIG_SCSI_QLA_FC=y
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_SRP is not set
# CONFIG_SCSI_LOWLEVEL_PCMCIA is not set
# CONFIG_SCSI_DH is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_ACPI=y
CONFIG_SATA_PMP=y
CONFIG_SATA_AHCI=y
# CONFIG_SATA_SIL24 is not set
CONFIG_ATA_SFF=y
# CONFIG_SATA_SVW is not set
CONFIG_ATA_PIIX=y
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SX4 is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_PATA_ACPI is not set
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_ATA_GENERIC is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PCMCIA is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RZ1000 is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set
# CONFIG_PATA_SCH is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
CONFIG_MD_LINEAR=y
CONFIG_MD_RAID0=y
CONFIG_MD_RAID1=y
CONFIG_MD_RAID10=y
CONFIG_MD_RAID456=y
CONFIG_MD_RAID5_RESHAPE=y
CONFIG_MD_MULTIPATH=y
CONFIG_MD_FAULTY=y
CONFIG_BLK_DEV_DM=y
# CONFIG_DM_DEBUG is not set
CONFIG_DM_CRYPT=y
CONFIG_DM_SNAPSHOT=y
CONFIG_DM_MIRROR=y
CONFIG_DM_ZERO=y
CONFIG_DM_MULTIPATH=y
# CONFIG_DM_DELAY is not set
# CONFIG_DM_UEVENT is not set
CONFIG_FUSION=y
CONFIG_FUSION_SPI=y
CONFIG_FUSION_FC=y
CONFIG_FUSION_SAS=y
CONFIG_FUSION_MAX_SGE=40
CONFIG_FUSION_CTL=y
# CONFIG_FUSION_LOGGING is not set

#
# IEEE 1394 (FireWire) support
#

#
# Enable only one of the two stacks, unless you know what you are doing
#
# CONFIG_FIREWIRE is not set
# CONFIG_IEEE1394 is not set
# CONFIG_I2O is not set
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
# CONFIG_DUMMY is not set
# CONFIG_BONDING is not set
# CONFIG_MACVLAN is not set
# CONFIG_EQUALIZER is not set
# CONFIG_TUN is not set
# CONFIG_VETH is not set
# CONFIG_NET_SB1000 is not set
# CONFIG_ARCNET is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
# CONFIG_MARVELL_PHY is not set
# CONFIG_DAVICOM_PHY is not set
# CONFIG_QSEMI_PHY is not set
# CONFIG_LXT_PHY is not set
# CONFIG_CICADA_PHY is not set
# CONFIG_VITESSE_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_BROADCOM_PHY is not set
# CONFIG_ICPLUS_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_STE10XP is not set
# CONFIG_LSI_ET1011C_PHY is not set
# CONFIG_FIXED_PHY is not set
# CONFIG_MDIO_BITBANG is not set
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NET_VENDOR_3COM is not set
# CONFIG_NET_TULIP is not set
# CONFIG_HP100 is not set
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
# CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set
# CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set
# CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set
CONFIG_NET_PCI=y
# CONFIG_PCNET32 is not set
# CONFIG_AMD8111_ETH is not set
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_B44 is not set
# CONFIG_FORCEDETH is not set
CONFIG_E100=y
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_8139CP is not set
# CONFIG_8139TOO is not set
# CONFIG_R6040 is not set
# CONFIG_SIS900 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SMSC9420 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
# CONFIG_VIA_RHINE is not set
# CONFIG_SC92031 is not set
# CONFIG_ATL2 is not set
CONFIG_NETDEV_1000=y
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
CONFIG_E1000=y
CONFIG_E1000E=y
# CONFIG_IP1000 is not set
CONFIG_IGB=y
# CONFIG_IGB_LRO is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
# CONFIG_R8169 is not set
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
# CONFIG_VIA_VELOCITY is not set
CONFIG_TIGON3=y
CONFIG_BNX2=y
# CONFIG_QLA3XXX is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_JME is not set
CONFIG_NETDEV_10000=y
# CONFIG_CHELSIO_T1 is not set
CONFIG_CHELSIO_T3_DEPENDS=y
# CONFIG_CHELSIO_T3 is not set
# CONFIG_ENIC is not set
# CONFIG_IXGBE is not set
CONFIG_IXGB=y
# CONFIG_S2IO is not set
# CONFIG_MYRI10GE is not set
# CONFIG_NETXEN_NIC is not set
# CONFIG_NIU is not set
# CONFIG_MLX4_EN is not set
# CONFIG_MLX4_CORE is not set
# CONFIG_TEHUTI is not set
# CONFIG_BNX2X is not set
# CONFIG_QLGE is not set
# CONFIG_SFC is not set
# CONFIG_TR is not set

#
# Wireless LAN
#
# CONFIG_WLAN_PRE80211 is not set
# CONFIG_WLAN_80211 is not set
# CONFIG_IWLWIFI_LEDS is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET is not set
# CONFIG_NET_PCMCIA is not set
# CONFIG_WAN is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
# CONFIG_NET_FC is not set
# CONFIG_NETCONSOLE is not set
# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set
# CONFIG_ISDN is not set
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
# CONFIG_INPUT_POLLDEV is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_ELANTECH is not set
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
CONFIG_MOUSE_SERIAL=y
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_PCSPKR is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
# CONFIG_INPUT_UINPUT is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_DEVKMEM=y
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_COMPUTONE is not set
# CONFIG_ROCKETPORT is not set
# CONFIG_CYCLADES is not set
# CONFIG_DIGIEPCA is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_ISI is not set
# CONFIG_SYNCLINK is not set
# CONFIG_SYNCLINKMP is not set
# CONFIG_SYNCLINK_GT is not set
# CONFIG_N_HDLC is not set
# CONFIG_RISCOM8 is not set
# CONFIG_SPECIALIX is not set
# CONFIG_SX is not set
# CONFIG_RIO is not set
# CONFIG_STALDRV is not set
# CONFIG_NOZOMI is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
# CONFIG_SERIAL_8250_CS is not set
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
# CONFIG_DEVPTS_MULTIPLE_INSTANCES is not set
# CONFIG_LEGACY_PTYS is not set
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=y
CONFIG_HW_RANDOM_INTEL=y
CONFIG_HW_RANDOM_AMD=y
CONFIG_NVRAM=y
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set

#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
# CONFIG_CARDMAN_4000 is not set
# CONFIG_CARDMAN_4040 is not set
# CONFIG_IPWIRELESS is not set
# CONFIG_MWAVE is not set
# CONFIG_PC8736x_GPIO is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
# CONFIG_HPET_MMAP is not set
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=y
CONFIG_I2C_BOARDINFO=y
# CONFIG_I2C_CHARDEV is not set
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_ISCH is not set
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_SIMTEC is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Graphics adapter I2C/DDC channel drivers
#
# CONFIG_I2C_VOODOO3 is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_STUB is not set

#
# Miscellaneous I2C Chip support
#
# CONFIG_DS1682 is not set
# CONFIG_SENSORS_PCF8574 is not set
# CONFIG_PCF8575 is not set
# CONFIG_SENSORS_PCA9539 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_SENSORS_MAX6875 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I2C_DEBUG_CHIP is not set
# CONFIG_SPI is not set
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
# CONFIG_GPIOLIB is not set
# CONFIG_W1 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_BATTERY_DS2760 is not set
# CONFIG_BATTERY_BQ27x00 is not set
# CONFIG_HWMON is not set
CONFIG_THERMAL=y
CONFIG_WATCHDOG=y
# CONFIG_WATCHDOG_NOWAYOUT is not set

#
# Watchdog Device Drivers
#
CONFIG_SOFT_WATCHDOG=y
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
# CONFIG_ALIM1535_WDT is not set
# CONFIG_ALIM7101_WDT is not set
# CONFIG_SC520_WDT is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
# CONFIG_IBMASR is not set
# CONFIG_WAFER_WDT is not set
# CONFIG_I6300ESB_WDT is not set
# CONFIG_ITCO_WDT is not set
# CONFIG_IT8712F_WDT is not set
# CONFIG_IT87_WDT is not set
# CONFIG_HP_WATCHDOG is not set
# CONFIG_SC1200_WDT is not set
# CONFIG_PC87413_WDT is not set
# CONFIG_60XX_WDT is not set
# CONFIG_SBC8360_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_SMSC_SCH311X_WDT is not set
# CONFIG_SMSC37B787_WDT is not set
# CONFIG_W83627HF_WDT is not set
# CONFIG_W83697HF_WDT is not set
# CONFIG_W83697UG_WDT is not set
# CONFIG_W83877F_WDT is not set
# CONFIG_W83977F_WDT is not set
# CONFIG_MACHZ_WDT is not set
# CONFIG_SBC_EPX_C3_WATCHDOG is not set

#
# PCI-based Watchdog Cards
#
# CONFIG_PCIPCWATCHDOG is not set
# CONFIG_WDTPCI is not set

#
# USB-based Watchdog Cards
#
# CONFIG_USBPCWATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_REGULATOR is not set

#
# Multimedia devices
#

#
# Multimedia core support
#
CONFIG_VIDEO_DEV=y
CONFIG_VIDEO_V4L2_COMMON=y
CONFIG_VIDEO_ALLOW_V4L1=y
CONFIG_VIDEO_V4L1_COMPAT=y
# CONFIG_DVB_CORE is not set
CONFIG_VIDEO_MEDIA=y

#
# Multimedia drivers
#
# CONFIG_MEDIA_ATTACH is not set
CONFIG_MEDIA_TUNER=y
# CONFIG_MEDIA_TUNER_CUSTOMIZE is not set
CONFIG_MEDIA_TUNER_SIMPLE=y
CONFIG_MEDIA_TUNER_TDA8290=y
CONFIG_MEDIA_TUNER_TDA9887=y
CONFIG_MEDIA_TUNER_TEA5761=y
CONFIG_MEDIA_TUNER_TEA5767=y
CONFIG_MEDIA_TUNER_MT20XX=y
CONFIG_MEDIA_TUNER_XC2028=y
CONFIG_MEDIA_TUNER_XC5000=y
CONFIG_VIDEO_V4L2=y
CONFIG_VIDEO_V4L1=y
# CONFIG_VIDEO_CAPTURE_DRIVERS is not set
# CONFIG_RADIO_ADAPTERS is not set
# CONFIG_DAB is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
CONFIG_AGP_SIS=y
CONFIG_AGP_VIA=y
CONFIG_DRM=y
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
# CONFIG_DRM_RADEON is not set
# CONFIG_DRM_I810 is not set
# CONFIG_DRM_I830 is not set
# CONFIG_DRM_I915 is not set
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
CONFIG_VGASTATE=y
CONFIG_VIDEO_OUTPUT_CONTROL=m
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_DDC=y
CONFIG_FB_BOOT_VESA_SUPPORT=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
# CONFIG_FB_SYS_FILLRECT is not set
# CONFIG_FB_SYS_COPYAREA is not set
# CONFIG_FB_SYS_IMAGEBLIT is not set
# CONFIG_FB_FOREIGN_ENDIAN is not set
# CONFIG_FB_SYS_FOPS is not set
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
CONFIG_FB_VGA16=y
# CONFIG_FB_UVESA is not set
CONFIG_FB_VESA=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_LE80578 is not set
CONFIG_FB_INTEL=y
# CONFIG_FB_INTEL_DEBUG is not set
CONFIG_FB_INTEL_I2C=y
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
CONFIG_BACKLIGHT_GENERIC=y
# CONFIG_BACKLIGHT_PROGEAR is not set
# CONFIG_BACKLIGHT_MBP_NVIDIA is not set
# CONFIG_BACKLIGHT_SAHARA is not set

#
# Display device support
#
# CONFIG_DISPLAY_SUPPORT is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
# CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY is not set
CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
# CONFIG_SOUND is not set
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
# CONFIG_HID_DEBUG is not set
# CONFIG_HIDRAW is not set

#
# USB Input Devices
#
CONFIG_USB_HID=y
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y

#
# Special HID drivers
#
CONFIG_HID_COMPAT=y
CONFIG_HID_A4TECH=y
CONFIG_HID_APPLE=y
CONFIG_HID_BELKIN=y
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
CONFIG_HID_CYPRESS=y
CONFIG_HID_EZKEY=y
CONFIG_HID_GYRATION=y
CONFIG_HID_LOGITECH=y
CONFIG_LOGITECH_FF=y
# CONFIG_LOGIRUMBLEPAD2_FF is not set
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y
CONFIG_HID_NTRIG=y
CONFIG_HID_PANTHERLORD=y
# CONFIG_PANTHERLORD_FF is not set
CONFIG_HID_PETALYNX=y
CONFIG_HID_SAMSUNG=y
CONFIG_HID_SONY=y
CONFIG_HID_SUNPLUS=y
# CONFIG_GREENASIA_FF is not set
CONFIG_HID_TOPSEED=y
CONFIG_THRUSTMASTER_FF=y
# CONFIG_ZEROPLUS_FF is not set
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
# CONFIG_USB_ANNOUNCE_NEW_DEVICES is not set

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
CONFIG_USB_DEVICE_CLASS=y
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_OTG is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB is not set
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1760_HCD is not set
CONFIG_USB_OHCI_HCD=y
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_WHCI_HCD is not set
# CONFIG_USB_HWA_HCD is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may also be needed;
#

#
# see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=y
# CONFIG_USB_STORAGE_DEBUG is not set
CONFIG_USB_STORAGE_DATAFAB=y
CONFIG_USB_STORAGE_FREECOM=y
CONFIG_USB_STORAGE_ISD200=y
CONFIG_USB_STORAGE_USBAT=y
CONFIG_USB_STORAGE_SDDR09=y
CONFIG_USB_STORAGE_SDDR55=y
CONFIG_USB_STORAGE_JUMPSHOT=y
CONFIG_USB_STORAGE_ALAUDA=y
# CONFIG_USB_STORAGE_ONETOUCH is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
CONFIG_USB_LIBUSUAL=y

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB port drivers
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_BERRY_CHARGE is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_PHIDGET is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_VST is not set
# CONFIG_USB_GADGET is not set

#
# OTG and related infrastructure
#
# CONFIG_UWB is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
# CONFIG_NEW_LEDS is not set
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
CONFIG_EDAC=y

#
# Reporting subsystems
#
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_MM_EDAC=y
CONFIG_EDAC_E752X=y
# CONFIG_EDAC_I82975X is not set
# CONFIG_EDAC_I3000 is not set
# CONFIG_EDAC_X38 is not set
# CONFIG_EDAC_I5400 is not set
# CONFIG_EDAC_I5000 is not set
# CONFIG_EDAC_I5100 is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8581 is not set

#
# SPI RTC drivers
#

#
# Platform RTC drivers
#
# CONFIG_RTC_DRV_CMOS is not set
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
# CONFIG_RTC_DRV_V3020 is not set

#
# on-CPU RTC drivers
#
# CONFIG_DMADEVICES is not set
# CONFIG_UIO is not set
# CONFIG_STAGING is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_FUJITSU_LAPTOP is not set
# CONFIG_MSI_LAPTOP is not set
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_COMPAL_LAPTOP is not set
# CONFIG_SONY_LAPTOP is not set
# CONFIG_THINKPAD_ACPI is not set
# CONFIG_INTEL_MENLOW is not set
# CONFIG_EEEPC_LAPTOP is not set
# CONFIG_ACPI_WMI is not set
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_TOSHIBA is not set

#
# Firmware Drivers
#
CONFIG_EDD=y
# CONFIG_EDD_OFF is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_DELL_RBU=y
CONFIG_DCDBAS=y
CONFIG_DMIID=y
# CONFIG_ISCSI_IBFT_FIND is not set

#
# File systems
#
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
CONFIG_EXT2_FS_XIP=y
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
# CONFIG_EXT4_FS is not set
CONFIG_FS_XIP=y
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=y
# CONFIG_REISERFS_CHECK is not set
CONFIG_REISERFS_PROC_INFO=y
CONFIG_REISERFS_FS_XATTR=y
CONFIG_REISERFS_FS_POSIX_ACL=y
CONFIG_REISERFS_FS_SECURITY=y
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_FILE_LOCKING=y
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_BTRFS_FS is not set
CONFIG_DNOTIFY=y
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_QUOTA=y
# CONFIG_QUOTA_NETLINK_INTERFACE is not set
CONFIG_PRINT_QUOTA_WARNING=y
CONFIG_QUOTA_TREE=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_AUTOFS_FS=y
CONFIG_AUTOFS4_FS=y
CONFIG_FUSE_FS=y

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="ascii"
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_CONFIGFS_FS=y
CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
CONFIG_ROMFS_FS=y
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
CONFIG_NFSD=y
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y
# CONFIG_SUNRPC_REGISTER_V4 is not set
CONFIG_RPCSEC_GSS_KRB5=y
# CONFIG_RPCSEC_GSS_SPKM3 is not set
# CONFIG_SMB_FS is not set
CONFIG_CIFS=y
# CONFIG_CIFS_STATS is not set
CONFIG_CIFS_WEAK_PW_HASH=y
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y
# CONFIG_CIFS_DEBUG2 is not set
# CONFIG_CIFS_EXPERIMENTAL is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
# CONFIG_LDM_PARTITION is not set
CONFIG_SGI_PARTITION=y
# CONFIG_ULTRIX_PARTITION is not set
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
CONFIG_NLS_UTF8=y
# CONFIG_DLM is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
CONFIG_ENABLE_WARN_DEPRECATED=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
CONFIG_MAGIC_SYSRQ=y
# CONFIG_UNUSED_SYMBOLS is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_SHIRQ is not set
# CONFIG_DETECT_SOFTLOCKUP is not set
CONFIG_SCHED_DEBUG=y
# CONFIG_SCHEDSTATS is not set
# CONFIG_TIMER_STATS is not set
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_WRITECOUNT is not set
CONFIG_DEBUG_MEMORY_INIT=y
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
CONFIG_ARCH_WANT_FRAME_POINTERS=y
# CONFIG_FRAME_POINTER is not set
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_CPU_STALL_DETECTOR is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_LKDTM is not set
# CONFIG_FAULT_INJECTION is not set
# CONFIG_LATENCYTOP is not set
CONFIG_SYSCTL_SYSCALL_CHECK=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_HW_BRANCH_TRACER=y
CONFIG_RING_BUFFER=y
CONFIG_TRACING=y

#
# Tracers
#
# CONFIG_FUNCTION_TRACER is not set
# CONFIG_IRQSOFF_TRACER is not set
# CONFIG_SYSPROF_TRACER is not set
# CONFIG_SCHED_TRACER is not set
# CONFIG_CONTEXT_SWITCH_TRACER is not set
# CONFIG_BOOT_TRACER is not set
# CONFIG_TRACE_BRANCH_PROFILING is not set
# CONFIG_POWER_TRACER is not set
# CONFIG_STACK_TRACER is not set
# CONFIG_HW_BRANCH_TRACER is not set
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_MMIOTRACE is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_DYNAMIC_PRINTK_DEBUG is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
# CONFIG_STRICT_DEVMEM is not set
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
# CONFIG_EARLY_PRINTK_DBGP is not set
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_DEBUG_PER_CPU_MAPS is not set
# CONFIG_X86_PTDUMP is not set
# CONFIG_DEBUG_RODATA is not set
# CONFIG_DEBUG_NX_TEST is not set
# CONFIG_IOMMU_DEBUG is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
# CONFIG_OPTIMIZE_INLINING is not set

#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY is not set
# CONFIG_SECURITYFS is not set
# CONFIG_SECURITY_FILE_CAPABILITIES is not set
CONFIG_XOR_BLOCKS=y
CONFIG_ASYNC_CORE=y
CONFIG_ASYNC_MEMCPY=y
CONFIG_ASYNC_XOR=y
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
# CONFIG_CRYPTO_FIPS is not set
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_GF128MUL is not set
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_CRYPTD is not set
# CONFIG_CRYPTO_AUTHENC is not set
# CONFIG_CRYPTO_TEST is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
# CONFIG_CRYPTO_ECB is not set
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
# CONFIG_CRYPTO_CRC32C_INTEL is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set

#
# Ciphers
#
# CONFIG_CRYPTO_AES is not set
# CONFIG_CRYPTO_AES_X86_64 is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARC4 is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
CONFIG_CRYPTO_DES=y
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_X86_64 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
# CONFIG_CRYPTO_LZO is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_HIFN_795X is not set
CONFIG_HAVE_KVM=y
CONFIG_VIRTUALIZATION=y
# CONFIG_KVM is not set
# CONFIG_VIRTIO_PCI is not set
# CONFIG_VIRTIO_BALLOON is not set

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_GENERIC_FIND_LAST_BIT=y
# CONFIG_CRC_CCITT is not set
# CONFIG_CRC16 is not set
# CONFIG_CRC_T10DIF is not set
CONFIG_CRC_ITU_T=y
CONFIG_CRC32=y
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=y
CONFIG_ZLIB_INFLATE=y
CONFIG_PLIST=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y


> 
> >
> >                4P qual-core    2P qual-core    2P qual-core HT
> >                tigerton        stockley        Nehalem
> >                ------------------------------------------------
> > tbench          +3%             +2%             0%
> > oltp            -2%             0%              0%
> > aim7            0%              0%              0%
> > specjbb2005     +3%             0%              0%
> > hackbench       0%              0%              0%
> >
> > netperf:
> > TCP-S-112k      0%              -1%             0%
> > TCP-S-64k       0%              -1%             +1%
> > TCP-RR-1        0%              0%              +1%
> > UDP-U-4k        -2%             0%              -2%
> > UDP-U-1k        +3%             0%              0%
> > UDP-RR-1        0%              0%              0%
> > UDP-RR-512      -1%             0%              +1%

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-02-26  9:10   ` Lin Ming
@ 2009-02-26 11:03     ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-26 11:03 UTC (permalink / raw)
  To: Lin Ming
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra

On Thu, Feb 26, 2009 at 05:10:27PM +0800, Lin Ming wrote:
> We tested this v2 patch series with 2.6.29-rc6 on different machines.
> 

Wonderful, thanks.

> 		4P qual-core	2P qual-core	2P qual-core HT
> 		tigerton	stockley	Nehalem
> 		------------------------------------------------
> tbench		+3%		+2%		0%

Nice.

> oltp		-2%		0%		0%

This is a big disappointment and somewhat confusing that it is so
severe. For sysbench I was seeing on six different machines;

	50834.14        51763.08    1.79%
	61852.08        61966.58    0.18%
	5935.98         5980.06     0.74%
	29227.78        30167.72    3.12%
	66702.67        66534.76   -0.25%
	26643.18        26542.59   -0.38%

So, two smallish regressions but mainly gains. Then again, I'm becoming
more and more convinced that sysbench doesn't really represent a proper
OLTP workload.

I'd like to understand more how the page allocator at least was being used
during your tests. Would it be possible to get a full profile (including
instruction if possible and the vmlinux file) for both kernels please?

If you can get the profiles, confirm the regression is still there as
sometimes profiling can alter the outcome. Even if this happens, the
profile will tell me where time is being spent.

> aim7		0%		0%		0%
> specjbb2005	+3%		0%		0%
> hackbench	0%		0%		0%	
> 
> netperf:
> TCP-S-112k	0%		-1%		0%
> TCP-S-64k	0%		-1%		+1%
> TCP-RR-1	0%		0%		+1%
> UDP-U-4k	-2%		0%		-2%

Pekka, for this test was SLUB or the page allocator handling the 4K
allocations?

> UDP-U-1k	+3%		0%		0%
> UDP-RR-1	0%		0%		0%
> UDP-RR-512	-1%		0%		+1%
> 
> Lin Ming
> 

Thanks a million for testing.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-02-26 11:03     ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-26 11:03 UTC (permalink / raw)
  To: Lin Ming
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra

On Thu, Feb 26, 2009 at 05:10:27PM +0800, Lin Ming wrote:
> We tested this v2 patch series with 2.6.29-rc6 on different machines.
> 

Wonderful, thanks.

> 		4P qual-core	2P qual-core	2P qual-core HT
> 		tigerton	stockley	Nehalem
> 		------------------------------------------------
> tbench		+3%		+2%		0%

Nice.

> oltp		-2%		0%		0%

This is a big disappointment and somewhat confusing that it is so
severe. For sysbench I was seeing on six different machines;

	50834.14        51763.08    1.79%
	61852.08        61966.58    0.18%
	5935.98         5980.06     0.74%
	29227.78        30167.72    3.12%
	66702.67        66534.76   -0.25%
	26643.18        26542.59   -0.38%

So, two smallish regressions but mainly gains. Then again, I'm becoming
more and more convinced that sysbench doesn't really represent a proper
OLTP workload.

I'd like to understand more how the page allocator at least was being used
during your tests. Would it be possible to get a full profile (including
instruction if possible and the vmlinux file) for both kernels please?

If you can get the profiles, confirm the regression is still there as
sometimes profiling can alter the outcome. Even if this happens, the
profile will tell me where time is being spent.

> aim7		0%		0%		0%
> specjbb2005	+3%		0%		0%
> hackbench	0%		0%		0%	
> 
> netperf:
> TCP-S-112k	0%		-1%		0%
> TCP-S-64k	0%		-1%		+1%
> TCP-RR-1	0%		0%		+1%
> UDP-U-4k	-2%		0%		-2%

Pekka, for this test was SLUB or the page allocator handling the 4K
allocations?

> UDP-U-1k	+3%		0%		0%
> UDP-RR-1	0%		0%		0%
> UDP-RR-512	-1%		0%		+1%
> 
> Lin Ming
> 

Thanks a million for testing.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-02-26 11:03     ` Mel Gorman
@ 2009-02-26 11:18       ` Pekka Enberg
  -1 siblings, 0 replies; 118+ messages in thread
From: Pekka Enberg @ 2009-02-26 11:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lin Ming, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra

On Thu, 2009-02-26 at 11:03 +0000, Mel Gorman wrote:
> On Thu, Feb 26, 2009 at 05:10:27PM +0800, Lin Ming wrote:
> > We tested this v2 patch series with 2.6.29-rc6 on different machines.
> > 
> 
> Wonderful, thanks.
> 
> > 		4P qual-core	2P qual-core	2P qual-core HT
> > 		tigerton	stockley	Nehalem
> > 		------------------------------------------------
> > tbench		+3%		+2%		0%
> 
> Nice.
> 
> > oltp		-2%		0%		0%
> 
> This is a big disappointment and somewhat confusing that it is so
> severe. For sysbench I was seeing on six different machines;
> 
> 	50834.14        51763.08    1.79%
> 	61852.08        61966.58    0.18%
> 	5935.98         5980.06     0.74%
> 	29227.78        30167.72    3.12%
> 	66702.67        66534.76   -0.25%
> 	26643.18        26542.59   -0.38%
> 
> So, two smallish regressions but mainly gains. Then again, I'm becoming
> more and more convinced that sysbench doesn't really represent a proper
> OLTP workload.
> 
> I'd like to understand more how the page allocator at least was being used
> during your tests. Would it be possible to get a full profile (including
> instruction if possible and the vmlinux file) for both kernels please?
> 
> If you can get the profiles, confirm the regression is still there as
> sometimes profiling can alter the outcome. Even if this happens, the
> profile will tell me where time is being spent.
> 
> > aim7		0%		0%		0%
> > specjbb2005	+3%		0%		0%
> > hackbench	0%		0%		0%	
> > 
> > netperf:
> > TCP-S-112k	0%		-1%		0%
> > TCP-S-64k	0%		-1%		+1%
> > TCP-RR-1	0%		0%		+1%
> > UDP-U-4k	-2%		0%		-2%
> 
> Pekka, for this test was SLUB or the page allocator handling the 4K
> allocations?

The page allocator. The pass-through revert is not in 2.6.29-rc6 and I
won't be sending it until 2.6.30 opens up.

> 
> > UDP-U-1k	+3%		0%		0%
> > UDP-RR-1	0%		0%		0%
> > UDP-RR-512	-1%		0%		+1%
> > 
> > Lin Ming
> > 
> 
> Thanks a million for testing.
> 


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-02-26 11:18       ` Pekka Enberg
  0 siblings, 0 replies; 118+ messages in thread
From: Pekka Enberg @ 2009-02-26 11:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lin Ming, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra

On Thu, 2009-02-26 at 11:03 +0000, Mel Gorman wrote:
> On Thu, Feb 26, 2009 at 05:10:27PM +0800, Lin Ming wrote:
> > We tested this v2 patch series with 2.6.29-rc6 on different machines.
> > 
> 
> Wonderful, thanks.
> 
> > 		4P qual-core	2P qual-core	2P qual-core HT
> > 		tigerton	stockley	Nehalem
> > 		------------------------------------------------
> > tbench		+3%		+2%		0%
> 
> Nice.
> 
> > oltp		-2%		0%		0%
> 
> This is a big disappointment and somewhat confusing that it is so
> severe. For sysbench I was seeing on six different machines;
> 
> 	50834.14        51763.08    1.79%
> 	61852.08        61966.58    0.18%
> 	5935.98         5980.06     0.74%
> 	29227.78        30167.72    3.12%
> 	66702.67        66534.76   -0.25%
> 	26643.18        26542.59   -0.38%
> 
> So, two smallish regressions but mainly gains. Then again, I'm becoming
> more and more convinced that sysbench doesn't really represent a proper
> OLTP workload.
> 
> I'd like to understand more how the page allocator at least was being used
> during your tests. Would it be possible to get a full profile (including
> instruction if possible and the vmlinux file) for both kernels please?
> 
> If you can get the profiles, confirm the regression is still there as
> sometimes profiling can alter the outcome. Even if this happens, the
> profile will tell me where time is being spent.
> 
> > aim7		0%		0%		0%
> > specjbb2005	+3%		0%		0%
> > hackbench	0%		0%		0%	
> > 
> > netperf:
> > TCP-S-112k	0%		-1%		0%
> > TCP-S-64k	0%		-1%		+1%
> > TCP-RR-1	0%		0%		+1%
> > UDP-U-4k	-2%		0%		-2%
> 
> Pekka, for this test was SLUB or the page allocator handling the 4K
> allocations?

The page allocator. The pass-through revert is not in 2.6.29-rc6 and I
won't be sending it until 2.6.30 opens up.

> 
> > UDP-U-1k	+3%		0%		0%
> > UDP-RR-1	0%		0%		0%
> > UDP-RR-512	-1%		0%		+1%
> > 
> > Lin Ming
> > 
> 
> Thanks a million for testing.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-02-26 11:18       ` Pekka Enberg
@ 2009-02-26 11:22         ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-26 11:22 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Lin Ming, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra

On Thu, Feb 26, 2009 at 01:18:59PM +0200, Pekka Enberg wrote:
> On Thu, 2009-02-26 at 11:03 +0000, Mel Gorman wrote:
> > On Thu, Feb 26, 2009 at 05:10:27PM +0800, Lin Ming wrote:
> > > We tested this v2 patch series with 2.6.29-rc6 on different machines.
> > > 
> > 
> > Wonderful, thanks.
> > 
> > > 		4P qual-core	2P qual-core	2P qual-core HT
> > > 		tigerton	stockley	Nehalem
> > > 		------------------------------------------------
> > > tbench		+3%		+2%		0%
> > 
> > Nice.
> > 
> > > oltp		-2%		0%		0%
> > 
> > This is a big disappointment and somewhat confusing that it is so
> > severe. For sysbench I was seeing on six different machines;
> > 
> > 	50834.14        51763.08    1.79%
> > 	61852.08        61966.58    0.18%
> > 	5935.98         5980.06     0.74%
> > 	29227.78        30167.72    3.12%
> > 	66702.67        66534.76   -0.25%
> > 	26643.18        26542.59   -0.38%
> > 
> > So, two smallish regressions but mainly gains. Then again, I'm becoming
> > more and more convinced that sysbench doesn't really represent a proper
> > OLTP workload.
> > 
> > I'd like to understand more how the page allocator at least was being used
> > during your tests. Would it be possible to get a full profile (including
> > instruction if possible and the vmlinux file) for both kernels please?
> > 
> > If you can get the profiles, confirm the regression is still there as
> > sometimes profiling can alter the outcome. Even if this happens, the
> > profile will tell me where time is being spent.
> > 
> > > aim7		0%		0%		0%
> > > specjbb2005	+3%		0%		0%
> > > hackbench	0%		0%		0%	
> > > 
> > > netperf:
> > > TCP-S-112k	0%		-1%		0%
> > > TCP-S-64k	0%		-1%		+1%
> > > TCP-RR-1	0%		0%		+1%
> > > UDP-U-4k	-2%		0%		-2%
> > 
> > Pekka, for this test was SLUB or the page allocator handling the 4K
> > allocations?
> 
> The page allocator. The pass-through revert is not in 2.6.29-rc6 and I
> won't be sending it until 2.6.30 opens up.
> 

In that case, Lin, could I also get the profiles for UDP-U-4K please so I
can see how time is being spent and why it might have gotten worse?

Thanks

> > 
> > > UDP-U-1k	+3%		0%		0%
> > > UDP-RR-1	0%		0%		0%
> > > UDP-RR-512	-1%		0%		+1%
> > > 
> > > Lin Ming
> > > 
> > 
> > Thanks a million for testing.
> > 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-02-26 11:22         ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-02-26 11:22 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Lin Ming, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra

On Thu, Feb 26, 2009 at 01:18:59PM +0200, Pekka Enberg wrote:
> On Thu, 2009-02-26 at 11:03 +0000, Mel Gorman wrote:
> > On Thu, Feb 26, 2009 at 05:10:27PM +0800, Lin Ming wrote:
> > > We tested this v2 patch series with 2.6.29-rc6 on different machines.
> > > 
> > 
> > Wonderful, thanks.
> > 
> > > 		4P qual-core	2P qual-core	2P qual-core HT
> > > 		tigerton	stockley	Nehalem
> > > 		------------------------------------------------
> > > tbench		+3%		+2%		0%
> > 
> > Nice.
> > 
> > > oltp		-2%		0%		0%
> > 
> > This is a big disappointment and somewhat confusing that it is so
> > severe. For sysbench I was seeing on six different machines;
> > 
> > 	50834.14        51763.08    1.79%
> > 	61852.08        61966.58    0.18%
> > 	5935.98         5980.06     0.74%
> > 	29227.78        30167.72    3.12%
> > 	66702.67        66534.76   -0.25%
> > 	26643.18        26542.59   -0.38%
> > 
> > So, two smallish regressions but mainly gains. Then again, I'm becoming
> > more and more convinced that sysbench doesn't really represent a proper
> > OLTP workload.
> > 
> > I'd like to understand more how the page allocator at least was being used
> > during your tests. Would it be possible to get a full profile (including
> > instruction if possible and the vmlinux file) for both kernels please?
> > 
> > If you can get the profiles, confirm the regression is still there as
> > sometimes profiling can alter the outcome. Even if this happens, the
> > profile will tell me where time is being spent.
> > 
> > > aim7		0%		0%		0%
> > > specjbb2005	+3%		0%		0%
> > > hackbench	0%		0%		0%	
> > > 
> > > netperf:
> > > TCP-S-112k	0%		-1%		0%
> > > TCP-S-64k	0%		-1%		+1%
> > > TCP-RR-1	0%		0%		+1%
> > > UDP-U-4k	-2%		0%		-2%
> > 
> > Pekka, for this test was SLUB or the page allocator handling the 4K
> > allocations?
> 
> The page allocator. The pass-through revert is not in 2.6.29-rc6 and I
> won't be sending it until 2.6.30 opens up.
> 

In that case, Lin, could I also get the profiles for UDP-U-4K please so I
can see how time is being spent and why it might have gotten worse?

Thanks

> > 
> > > UDP-U-1k	+3%		0%		0%
> > > UDP-RR-1	0%		0%		0%
> > > UDP-RR-512	-1%		0%		+1%
> > > 
> > > Lin Ming
> > > 
> > 
> > Thanks a million for testing.
> > 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-02-26 11:22         ` Mel Gorman
@ 2009-02-26 12:27           ` Lin Ming
  -1 siblings, 0 replies; 118+ messages in thread
From: Lin Ming @ 2009-02-26 12:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Lin Ming, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Zhang Yanmin, Peter Zijlstra

On Thu, Feb 26, 2009 at 7:22 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> can see how time is being spent and why it might have gotten worse?

OK.
I'll do profiling tomorrow when I get back to work.

Lin Ming

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-02-26 12:27           ` Lin Ming
  0 siblings, 0 replies; 118+ messages in thread
From: Lin Ming @ 2009-02-26 12:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Lin Ming, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Zhang Yanmin, Peter Zijlstra

On Thu, Feb 26, 2009 at 7:22 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> can see how time is being spent and why it might have gotten worse?

OK.
I'll do profiling tomorrow when I get back to work.

Lin Ming

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-02-26 11:18       ` Pekka Enberg
@ 2009-02-26 16:28         ` Christoph Lameter
  -1 siblings, 0 replies; 118+ messages in thread
From: Christoph Lameter @ 2009-02-26 16:28 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Mel Gorman, Lin Ming, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra

On Thu, 26 Feb 2009, Pekka Enberg wrote:

> > > UDP-U-4k	-2%		0%		-2%
> >
> > Pekka, for this test was SLUB or the page allocator handling the 4K
> > allocations?
>
> The page allocator. The pass-through revert is not in 2.6.29-rc6 and I
> won't be sending it until 2.6.30 opens up.

The page allocator will handle allocs >4k. 4k itself is already buffered
since we saw tbench regressions if we passed 4k through.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-02-26 16:28         ` Christoph Lameter
  0 siblings, 0 replies; 118+ messages in thread
From: Christoph Lameter @ 2009-02-26 16:28 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Mel Gorman, Lin Ming, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra

On Thu, 26 Feb 2009, Pekka Enberg wrote:

> > > UDP-U-4k	-2%		0%		-2%
> >
> > Pekka, for this test was SLUB or the page allocator handling the 4K
> > allocations?
>
> The page allocator. The pass-through revert is not in 2.6.29-rc6 and I
> won't be sending it until 2.6.30 opens up.

The page allocator will handle allocs >4k. 4k itself is already buffered
since we saw tbench regressions if we passed 4k through.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-02-26 11:22         ` Mel Gorman
@ 2009-02-27  8:44           ` Lin Ming
  -1 siblings, 0 replies; 118+ messages in thread
From: Lin Ming @ 2009-02-27  8:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra

On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> can see how time is being spent and why it might have gotten worse?

I have done the profiling (oltp and UDP-U-4K) with and without your v2
patches applied to 2.6.29-rc6.
I also enabled CONFIG_DEBUG_INFO so you can translate address to source
line with addr2line.

You can download the oprofile data and vmlinux from below link,
http://www.filefactory.com/file/af2330b/

Lin Ming




^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-02-27  8:44           ` Lin Ming
  0 siblings, 0 replies; 118+ messages in thread
From: Lin Ming @ 2009-02-27  8:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra

On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> can see how time is being spent and why it might have gotten worse?

I have done the profiling (oltp and UDP-U-4K) with and without your v2
patches applied to 2.6.29-rc6.
I also enabled CONFIG_DEBUG_INFO so you can translate address to source
line with addr2line.

You can download the oprofile data and vmlinux from below link,
http://www.filefactory.com/file/af2330b/

Lin Ming



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-02-27  8:44           ` Lin Ming
@ 2009-03-02 11:21             ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-03-02 11:21 UTC (permalink / raw)
  To: Lin Ming
  Cc: Pekka Enberg, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra,
	Ingo Molnar

(Added Ingo as a second scheduler guy as there are queries on tg_shares_up)

On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > can see how time is being spent and why it might have gotten worse?
> 
> I have done the profiling (oltp and UDP-U-4K) with and without your v2
> patches applied to 2.6.29-rc6.
> I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> line with addr2line.
> 
> You can download the oprofile data and vmlinux from below link,
> http://www.filefactory.com/file/af2330b/
> 

Perfect, thanks a lot for profiling this. It is a big help in figuring out
how the allocator is actually being used for your workloads.

The OLTP results had the following things to say about the page allocator.

Samples in the free path
	vanilla:	6207
	mg-v2:		4911
Samples in the allocation path
	vanilla		19948
	mg-v2:		14238

This is based on glancing at the following graphs and not counting the VM
counters as it can't be determined which samples are due to the allocator
and which are due to the rest of the VM accounting.

http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-vanilla-oltp.png
http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-mgv2-oltp.png

So the path costs are reduced in both cases. Whatever caused the regression
there doesn't appear to be in time spent in the allocator but due to
something else I haven't imagined yet. Other oddness

o According to the profile, something like 45% of time is spent entering
  the __alloc_pages_nodemask() function. Function entry costs but not
  that much. Another significant part appears to be in checking a simple
  mask. That doesn't make much sense to me so I don't know what to do with
  that information yet.

o In get_page_from_freelist(), 9% of the time is spent deleting a page
  from the freelist.

Neither of these make sense, we're not spending time where I would expect
to at all. One of two things are happening. Something like cache misses or
bounces are dominating for some reason that is specific to this machine. Cache
misses are one possibility that I'll check out. The other is that the sample
rate is too low and the profile counts are hence misleading.

Question 1: Would it be possible to increase the sample rate and track cache
misses as well please?

Another interesting fact is that we are spending about 15% of the overall
time is spent in tg_shares_up() for both kernels but the vanilla kernel
recorded 977348 samples and the patched kernel recorded 514576 samples. We
are spending less time in the kernel and it's not obvious why or if that is
a good thing or not. You'd think less time in kernel is good but it might
mean we are doing less work overall.

Total aside from the page allocator, I checked what we were doing
in tg_shares_up where the vast amount of time is being spent. This has
something to do with CONFIG_FAIR_GROUP_SCHED. 

Question 2: Scheduler guys, can you think of what it means to be spending
less time in tg_shares_up please?

I don't know enough of how it works to guess why we are in there. FWIW,
we are appear to be spending the most time in the following lines

                weight = tg->cfs_rq[i]->load.weight;
                if (!weight)
                        weight = NICE_0_LOAD;

                tg->cfs_rq[i]->rq_weight = weight;
                rq_weight += weight;
                shares += tg->cfs_rq[i]->shares;

So.... cfs_rq is SMP aligned, but we iterate though it with for_each_cpu()
and we're writing to it. How often is this function run by multiple CPUs? If
the answer is "lots", does that not mean we are cache line bouncing in
here like mad? Another crazy amount of time is spent accessing tg->se when
validating. Basically, any access of the task_group appears to incur huge
costs and cache line bounces would be the obvious explanation.

More stupid poking around. We appear to update these share things on each
fork().

Question 3: Scheduler guys, If the database or clients being used for OLTP is
fork-based instead of thread-based, then we are going to be balancing a lot,
right? What does that mean, how can it be avoided?

Question 4: Lin, this is unrelated to the page allocator but do you know
what the performance difference between vanilla-with-group-sched and
vanilla-without-group-sched is?

The UDP results are screwy as the profiles are not matching up to the
images. For example

oltp.oprofile.2.6.29-rc6:           ffffffff802808a0 11022     0.1727  get_page_from_freelist
oltp.oprofile.2.6.29-rc6-mg-v2:     ffffffff80280610 7958      0.2403  get_page_from_freelist
UDP-U-4K.oprofile.2.6.29-rc6:       ffffffff802808a0 29914     1.2866  get_page_from_freelist
UDP-U-4K.oprofile.2.6.29-rc6-mg-v2: ffffffff802808a0 28153     1.1708  get_page_from_freelist

Look at the addresses. UDP-U-4K.oprofile.2.6.29-rc6-mg-v2 has the address
for UDP-U-4K.oprofile.2.6.29-rc6 so I have no idea what I'm looking at here
for the patched kernel :(.

Question 5: Lin, would it be possible to get whatever script you use for
running netperf so I can try reproducing it?

Going by the vanilla kernel, a *large* amount of time is spent doing
high-order allocations. Over 25% of the cost of buffered_rmqueue() is in
the branch dealing with high-order allocations. Does UDP-U-4K mean that 8K
pages are required for the packets? That means high-order allocations and
high contention on the zone-list. That is bad obviously and has implications
for the SLUB-passthru patch because whether 8K allocations are handled by
SL*B or the page allocator has a big impact on locking.

Next, a little over 50% of the cost get_page_from_freelist() is being spent
acquiring the zone spinlock. The implication is that the SL*B allocators
passing in order-1 allocations to the page allocator are currently going to
hit scalability problems in a big way. The solution may be to extend the
per-cpu allocator to handle magazines up to PAGE_ALLOC_COSTLY_ORDER. I'll
check it out.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-02 11:21             ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-03-02 11:21 UTC (permalink / raw)
  To: Lin Ming
  Cc: Pekka Enberg, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra,
	Ingo Molnar

(Added Ingo as a second scheduler guy as there are queries on tg_shares_up)

On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > can see how time is being spent and why it might have gotten worse?
> 
> I have done the profiling (oltp and UDP-U-4K) with and without your v2
> patches applied to 2.6.29-rc6.
> I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> line with addr2line.
> 
> You can download the oprofile data and vmlinux from below link,
> http://www.filefactory.com/file/af2330b/
> 

Perfect, thanks a lot for profiling this. It is a big help in figuring out
how the allocator is actually being used for your workloads.

The OLTP results had the following things to say about the page allocator.

Samples in the free path
	vanilla:	6207
	mg-v2:		4911
Samples in the allocation path
	vanilla		19948
	mg-v2:		14238

This is based on glancing at the following graphs and not counting the VM
counters as it can't be determined which samples are due to the allocator
and which are due to the rest of the VM accounting.

http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-vanilla-oltp.png
http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-mgv2-oltp.png

So the path costs are reduced in both cases. Whatever caused the regression
there doesn't appear to be in time spent in the allocator but due to
something else I haven't imagined yet. Other oddness

o According to the profile, something like 45% of time is spent entering
  the __alloc_pages_nodemask() function. Function entry costs but not
  that much. Another significant part appears to be in checking a simple
  mask. That doesn't make much sense to me so I don't know what to do with
  that information yet.

o In get_page_from_freelist(), 9% of the time is spent deleting a page
  from the freelist.

Neither of these make sense, we're not spending time where I would expect
to at all. One of two things are happening. Something like cache misses or
bounces are dominating for some reason that is specific to this machine. Cache
misses are one possibility that I'll check out. The other is that the sample
rate is too low and the profile counts are hence misleading.

Question 1: Would it be possible to increase the sample rate and track cache
misses as well please?

Another interesting fact is that we are spending about 15% of the overall
time is spent in tg_shares_up() for both kernels but the vanilla kernel
recorded 977348 samples and the patched kernel recorded 514576 samples. We
are spending less time in the kernel and it's not obvious why or if that is
a good thing or not. You'd think less time in kernel is good but it might
mean we are doing less work overall.

Total aside from the page allocator, I checked what we were doing
in tg_shares_up where the vast amount of time is being spent. This has
something to do with CONFIG_FAIR_GROUP_SCHED. 

Question 2: Scheduler guys, can you think of what it means to be spending
less time in tg_shares_up please?

I don't know enough of how it works to guess why we are in there. FWIW,
we are appear to be spending the most time in the following lines

                weight = tg->cfs_rq[i]->load.weight;
                if (!weight)
                        weight = NICE_0_LOAD;

                tg->cfs_rq[i]->rq_weight = weight;
                rq_weight += weight;
                shares += tg->cfs_rq[i]->shares;

So.... cfs_rq is SMP aligned, but we iterate though it with for_each_cpu()
and we're writing to it. How often is this function run by multiple CPUs? If
the answer is "lots", does that not mean we are cache line bouncing in
here like mad? Another crazy amount of time is spent accessing tg->se when
validating. Basically, any access of the task_group appears to incur huge
costs and cache line bounces would be the obvious explanation.

More stupid poking around. We appear to update these share things on each
fork().

Question 3: Scheduler guys, If the database or clients being used for OLTP is
fork-based instead of thread-based, then we are going to be balancing a lot,
right? What does that mean, how can it be avoided?

Question 4: Lin, this is unrelated to the page allocator but do you know
what the performance difference between vanilla-with-group-sched and
vanilla-without-group-sched is?

The UDP results are screwy as the profiles are not matching up to the
images. For example

oltp.oprofile.2.6.29-rc6:           ffffffff802808a0 11022     0.1727  get_page_from_freelist
oltp.oprofile.2.6.29-rc6-mg-v2:     ffffffff80280610 7958      0.2403  get_page_from_freelist
UDP-U-4K.oprofile.2.6.29-rc6:       ffffffff802808a0 29914     1.2866  get_page_from_freelist
UDP-U-4K.oprofile.2.6.29-rc6-mg-v2: ffffffff802808a0 28153     1.1708  get_page_from_freelist

Look at the addresses. UDP-U-4K.oprofile.2.6.29-rc6-mg-v2 has the address
for UDP-U-4K.oprofile.2.6.29-rc6 so I have no idea what I'm looking at here
for the patched kernel :(.

Question 5: Lin, would it be possible to get whatever script you use for
running netperf so I can try reproducing it?

Going by the vanilla kernel, a *large* amount of time is spent doing
high-order allocations. Over 25% of the cost of buffered_rmqueue() is in
the branch dealing with high-order allocations. Does UDP-U-4K mean that 8K
pages are required for the packets? That means high-order allocations and
high contention on the zone-list. That is bad obviously and has implications
for the SLUB-passthru patch because whether 8K allocations are handled by
SL*B or the page allocator has a big impact on locking.

Next, a little over 50% of the cost get_page_from_freelist() is being spent
acquiring the zone spinlock. The implication is that the SL*B allocators
passing in order-1 allocations to the page allocator are currently going to
hit scalability problems in a big way. The solution may be to extend the
per-cpu allocator to handle magazines up to PAGE_ALLOC_COSTLY_ORDER. I'll
check it out.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-02 11:21             ` Mel Gorman
@ 2009-03-02 11:39               ` Nick Piggin
  -1 siblings, 0 replies; 118+ messages in thread
From: Nick Piggin @ 2009-03-02 11:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Linux Kernel Mailing List, Zhang Yanmin,
	Peter Zijlstra, Ingo Molnar

On Mon, Mar 02, 2009 at 11:21:22AM +0000, Mel Gorman wrote:
> (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> 
> On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > can see how time is being spent and why it might have gotten worse?
> > 
> > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > patches applied to 2.6.29-rc6.
> > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > line with addr2line.
> > 
> > You can download the oprofile data and vmlinux from below link,
> > http://www.filefactory.com/file/af2330b/
> > 
> 
> Perfect, thanks a lot for profiling this. It is a big help in figuring out
> how the allocator is actually being used for your workloads.
> 
> The OLTP results had the following things to say about the page allocator.

Is this OLTP, or UDP-U-4K?

 
> Samples in the free path
> 	vanilla:	6207
> 	mg-v2:		4911
> Samples in the allocation path
> 	vanilla		19948
> 	mg-v2:		14238
> 
> This is based on glancing at the following graphs and not counting the VM
> counters as it can't be determined which samples are due to the allocator
> and which are due to the rest of the VM accounting.
> 
> http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-vanilla-oltp.png
> http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-mgv2-oltp.png
> 
> So the path costs are reduced in both cases. Whatever caused the regression
> there doesn't appear to be in time spent in the allocator but due to
> something else I haven't imagined yet. Other oddness
> 
> o According to the profile, something like 45% of time is spent entering
>   the __alloc_pages_nodemask() function. Function entry costs but not
>   that much. Another significant part appears to be in checking a simple
>   mask. That doesn't make much sense to me so I don't know what to do with
>   that information yet.
> 
> o In get_page_from_freelist(), 9% of the time is spent deleting a page
>   from the freelist.
> 
> Neither of these make sense, we're not spending time where I would expect
> to at all. One of two things are happening. Something like cache misses or
> bounces are dominating for some reason that is specific to this machine. Cache
> misses are one possibility that I'll check out. The other is that the sample
> rate is too low and the profile counts are hence misleading.
> 
> Question 1: Would it be possible to increase the sample rate and track cache
> misses as well please?

If the events are constantly biased, I don't think sample rate will
help. I don't know how the internals of profiling counters work exactly,
but you would expect yes cache misses, and stalls from any number of
different resources could put results in funny places.

Intel's OLTP workload is very sensitive to cacheline footprint of the
kernel, and if you touch some extra cachelines at point A, it can just
result in profile hits getting distributed all over the place. Profiling
cache misses might help, but probably see a similar phenomenon.

I can't remember, does your latest patchset include any patches that change
the possible order in which pages move around? Or is it just made up of
straight-line performance improvement of existing implementation?



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-02 11:39               ` Nick Piggin
  0 siblings, 0 replies; 118+ messages in thread
From: Nick Piggin @ 2009-03-02 11:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Linux Kernel Mailing List, Zhang Yanmin,
	Peter Zijlstra, Ingo Molnar

On Mon, Mar 02, 2009 at 11:21:22AM +0000, Mel Gorman wrote:
> (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> 
> On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > can see how time is being spent and why it might have gotten worse?
> > 
> > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > patches applied to 2.6.29-rc6.
> > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > line with addr2line.
> > 
> > You can download the oprofile data and vmlinux from below link,
> > http://www.filefactory.com/file/af2330b/
> > 
> 
> Perfect, thanks a lot for profiling this. It is a big help in figuring out
> how the allocator is actually being used for your workloads.
> 
> The OLTP results had the following things to say about the page allocator.

Is this OLTP, or UDP-U-4K?

 
> Samples in the free path
> 	vanilla:	6207
> 	mg-v2:		4911
> Samples in the allocation path
> 	vanilla		19948
> 	mg-v2:		14238
> 
> This is based on glancing at the following graphs and not counting the VM
> counters as it can't be determined which samples are due to the allocator
> and which are due to the rest of the VM accounting.
> 
> http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-vanilla-oltp.png
> http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-mgv2-oltp.png
> 
> So the path costs are reduced in both cases. Whatever caused the regression
> there doesn't appear to be in time spent in the allocator but due to
> something else I haven't imagined yet. Other oddness
> 
> o According to the profile, something like 45% of time is spent entering
>   the __alloc_pages_nodemask() function. Function entry costs but not
>   that much. Another significant part appears to be in checking a simple
>   mask. That doesn't make much sense to me so I don't know what to do with
>   that information yet.
> 
> o In get_page_from_freelist(), 9% of the time is spent deleting a page
>   from the freelist.
> 
> Neither of these make sense, we're not spending time where I would expect
> to at all. One of two things are happening. Something like cache misses or
> bounces are dominating for some reason that is specific to this machine. Cache
> misses are one possibility that I'll check out. The other is that the sample
> rate is too low and the profile counts are hence misleading.
> 
> Question 1: Would it be possible to increase the sample rate and track cache
> misses as well please?

If the events are constantly biased, I don't think sample rate will
help. I don't know how the internals of profiling counters work exactly,
but you would expect yes cache misses, and stalls from any number of
different resources could put results in funny places.

Intel's OLTP workload is very sensitive to cacheline footprint of the
kernel, and if you touch some extra cachelines at point A, it can just
result in profile hits getting distributed all over the place. Profiling
cache misses might help, but probably see a similar phenomenon.

I can't remember, does your latest patchset include any patches that change
the possible order in which pages move around? Or is it just made up of
straight-line performance improvement of existing implementation?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-02 11:39               ` Nick Piggin
@ 2009-03-02 12:16                 ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-03-02 12:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Linux Kernel Mailing List, Zhang Yanmin,
	Peter Zijlstra, Ingo Molnar

On Mon, Mar 02, 2009 at 12:39:36PM +0100, Nick Piggin wrote:
> On Mon, Mar 02, 2009 at 11:21:22AM +0000, Mel Gorman wrote:
> > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> > 
> > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > can see how time is being spent and why it might have gotten worse?
> > > 
> > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > patches applied to 2.6.29-rc6.
> > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > line with addr2line.
> > > 
> > > You can download the oprofile data and vmlinux from below link,
> > > http://www.filefactory.com/file/af2330b/
> > > 
> > 
> > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > how the allocator is actually being used for your workloads.
> > 
> > The OLTP results had the following things to say about the page allocator.
> 
> Is this OLTP, or UDP-U-4K?
> 

OLTP. I didn't do a comparison for UDP due to uncertainity of what I was
looking at other than to note that high-order allocations may be a
bigger deal there.

>  
> > Samples in the free path
> > 	vanilla:	6207
> > 	mg-v2:		4911
> > Samples in the allocation path
> > 	vanilla		19948
> > 	mg-v2:		14238
> > 
> > This is based on glancing at the following graphs and not counting the VM
> > counters as it can't be determined which samples are due to the allocator
> > and which are due to the rest of the VM accounting.
> > 
> > http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-vanilla-oltp.png
> > http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-mgv2-oltp.png
> > 
> > So the path costs are reduced in both cases. Whatever caused the regression
> > there doesn't appear to be in time spent in the allocator but due to
> > something else I haven't imagined yet. Other oddness
> > 
> > o According to the profile, something like 45% of time is spent entering
> >   the __alloc_pages_nodemask() function. Function entry costs but not
> >   that much. Another significant part appears to be in checking a simple
> >   mask. That doesn't make much sense to me so I don't know what to do with
> >   that information yet.
> > 
> > o In get_page_from_freelist(), 9% of the time is spent deleting a page
> >   from the freelist.
> > 
> > Neither of these make sense, we're not spending time where I would expect
> > to at all. One of two things are happening. Something like cache misses or
> > bounces are dominating for some reason that is specific to this machine. Cache
> > misses are one possibility that I'll check out. The other is that the sample
> > rate is too low and the profile counts are hence misleading.
> > 
> > Question 1: Would it be possible to increase the sample rate and track cache
> > misses as well please?
> 
> If the events are constantly biased, I don't think sample rate will
> help. I don't know how the internals of profiling counters work exactly,
> but you would expect yes cache misses, and stalls from any number of
> different resources could put results in funny places.
> 

Ok, if it's stalls that are the real factor then yes, increasing the
sample rate might not help. However, the same rates for instructions
were so low, I thought it might be a combination of both low sample
count and stalls happening at particular places. A profile of cache
misses will still be useful as it'll say in general if there is a marked
increase overall or not.

> Intel's OLTP workload is very sensitive to cacheline footprint of the
> kernel, and if you touch some extra cachelines at point A, it can just
> result in profile hits getting distributed all over the place. Profiling
> cache misses might help, but probably see a similar phenomenon.
> 

Interesting, this might put a hole in replacing the gfp_zone() with a
version that uses an additional (or maybe two depending on alignment)
cacheline.

> I can't remember, does your latest patchset include any patches that change
> the possible order in which pages move around? Or is it just made up of
> straight-line performance improvement of existing implementation?
> 

It shouldn't affect order. I did a test a while ago to make sure pages
were still coming back in contiguous order as some IO cards depend on this
behaviour for performance. The intention for the first pass is a straight-line
performance improvement.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-02 12:16                 ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-03-02 12:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Linux Kernel Mailing List, Zhang Yanmin,
	Peter Zijlstra, Ingo Molnar

On Mon, Mar 02, 2009 at 12:39:36PM +0100, Nick Piggin wrote:
> On Mon, Mar 02, 2009 at 11:21:22AM +0000, Mel Gorman wrote:
> > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> > 
> > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > can see how time is being spent and why it might have gotten worse?
> > > 
> > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > patches applied to 2.6.29-rc6.
> > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > line with addr2line.
> > > 
> > > You can download the oprofile data and vmlinux from below link,
> > > http://www.filefactory.com/file/af2330b/
> > > 
> > 
> > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > how the allocator is actually being used for your workloads.
> > 
> > The OLTP results had the following things to say about the page allocator.
> 
> Is this OLTP, or UDP-U-4K?
> 

OLTP. I didn't do a comparison for UDP due to uncertainity of what I was
looking at other than to note that high-order allocations may be a
bigger deal there.

>  
> > Samples in the free path
> > 	vanilla:	6207
> > 	mg-v2:		4911
> > Samples in the allocation path
> > 	vanilla		19948
> > 	mg-v2:		14238
> > 
> > This is based on glancing at the following graphs and not counting the VM
> > counters as it can't be determined which samples are due to the allocator
> > and which are due to the rest of the VM accounting.
> > 
> > http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-vanilla-oltp.png
> > http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-mgv2-oltp.png
> > 
> > So the path costs are reduced in both cases. Whatever caused the regression
> > there doesn't appear to be in time spent in the allocator but due to
> > something else I haven't imagined yet. Other oddness
> > 
> > o According to the profile, something like 45% of time is spent entering
> >   the __alloc_pages_nodemask() function. Function entry costs but not
> >   that much. Another significant part appears to be in checking a simple
> >   mask. That doesn't make much sense to me so I don't know what to do with
> >   that information yet.
> > 
> > o In get_page_from_freelist(), 9% of the time is spent deleting a page
> >   from the freelist.
> > 
> > Neither of these make sense, we're not spending time where I would expect
> > to at all. One of two things are happening. Something like cache misses or
> > bounces are dominating for some reason that is specific to this machine. Cache
> > misses are one possibility that I'll check out. The other is that the sample
> > rate is too low and the profile counts are hence misleading.
> > 
> > Question 1: Would it be possible to increase the sample rate and track cache
> > misses as well please?
> 
> If the events are constantly biased, I don't think sample rate will
> help. I don't know how the internals of profiling counters work exactly,
> but you would expect yes cache misses, and stalls from any number of
> different resources could put results in funny places.
> 

Ok, if it's stalls that are the real factor then yes, increasing the
sample rate might not help. However, the same rates for instructions
were so low, I thought it might be a combination of both low sample
count and stalls happening at particular places. A profile of cache
misses will still be useful as it'll say in general if there is a marked
increase overall or not.

> Intel's OLTP workload is very sensitive to cacheline footprint of the
> kernel, and if you touch some extra cachelines at point A, it can just
> result in profile hits getting distributed all over the place. Profiling
> cache misses might help, but probably see a similar phenomenon.
> 

Interesting, this might put a hole in replacing the gfp_zone() with a
version that uses an additional (or maybe two depending on alignment)
cacheline.

> I can't remember, does your latest patchset include any patches that change
> the possible order in which pages move around? Or is it just made up of
> straight-line performance improvement of existing implementation?
> 

It shouldn't affect order. I did a test a while ago to make sure pages
were still coming back in contiguous order as some IO cards depend on this
behaviour for performance. The intention for the first pass is a straight-line
performance improvement.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-02 12:16                 ` Mel Gorman
@ 2009-03-03  4:42                   ` Nick Piggin
  -1 siblings, 0 replies; 118+ messages in thread
From: Nick Piggin @ 2009-03-03  4:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Linux Kernel Mailing List, Zhang Yanmin,
	Peter Zijlstra, Ingo Molnar

On Mon, Mar 02, 2009 at 12:16:33PM +0000, Mel Gorman wrote:
> On Mon, Mar 02, 2009 at 12:39:36PM +0100, Nick Piggin wrote:
> > > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > > how the allocator is actually being used for your workloads.
> > > 
> > > The OLTP results had the following things to say about the page allocator.
> > 
> > Is this OLTP, or UDP-U-4K?
> > 
> 
> OLTP. I didn't do a comparison for UDP due to uncertainity of what I was
> looking at other than to note that high-order allocations may be a
> bigger deal there.

OK.


> > > Question 1: Would it be possible to increase the sample rate and track cache
> > > misses as well please?
> > 
> > If the events are constantly biased, I don't think sample rate will
> > help. I don't know how the internals of profiling counters work exactly,
> > but you would expect yes cache misses, and stalls from any number of
> > different resources could put results in funny places.
> > 
> 
> Ok, if it's stalls that are the real factor then yes, increasing the
> sample rate might not help. However, the same rates for instructions
> were so low, I thought it might be a combination of both low sample
> count and stalls happening at particular places. A profile of cache
> misses will still be useful as it'll say in general if there is a marked
> increase overall or not.

OK.


> > Intel's OLTP workload is very sensitive to cacheline footprint of the
> > kernel, and if you touch some extra cachelines at point A, it can just
> > result in profile hits getting distributed all over the place. Profiling
> > cache misses might help, but probably see a similar phenomenon.
> > 
> 
> Interesting, this might put a hole in replacing the gfp_zone() with a
> version that uses an additional (or maybe two depending on alignment)
> cacheline.

Well... I still think it is probably a good idea. Firstly is that
it probably saves a line of icache too. Secondly, I guess adding a
*single* extra readonly cacheline is probably not such a problem
even for this workload. I was more thinking of if you changed the
pattern in which pages are allocated (ie. like the hot/cold thing),
or if some change resulted in more cross-cpu operations then it
could result in worse cache efficiency.

But you never know, it might be one patch to look at.


> > I can't remember, does your latest patchset include any patches that change
> > the possible order in which pages move around? Or is it just made up of
> > straight-line performance improvement of existing implementation?
> > 
> 
> It shouldn't affect order. I did a test a while ago to make sure pages
> were still coming back in contiguous order as some IO cards depend on this
> behaviour for performance. The intention for the first pass is a straight-line
> performance improvement.

OK, but the dynamic behaviour too. Free page A, free page B, allocate page
A allocate page B etc.

The hot/cold removal would be an obvious example of what I mean, although
that wasn't included in this recent patchset anyway.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-03  4:42                   ` Nick Piggin
  0 siblings, 0 replies; 118+ messages in thread
From: Nick Piggin @ 2009-03-03  4:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Linux Kernel Mailing List, Zhang Yanmin,
	Peter Zijlstra, Ingo Molnar

On Mon, Mar 02, 2009 at 12:16:33PM +0000, Mel Gorman wrote:
> On Mon, Mar 02, 2009 at 12:39:36PM +0100, Nick Piggin wrote:
> > > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > > how the allocator is actually being used for your workloads.
> > > 
> > > The OLTP results had the following things to say about the page allocator.
> > 
> > Is this OLTP, or UDP-U-4K?
> > 
> 
> OLTP. I didn't do a comparison for UDP due to uncertainity of what I was
> looking at other than to note that high-order allocations may be a
> bigger deal there.

OK.


> > > Question 1: Would it be possible to increase the sample rate and track cache
> > > misses as well please?
> > 
> > If the events are constantly biased, I don't think sample rate will
> > help. I don't know how the internals of profiling counters work exactly,
> > but you would expect yes cache misses, and stalls from any number of
> > different resources could put results in funny places.
> > 
> 
> Ok, if it's stalls that are the real factor then yes, increasing the
> sample rate might not help. However, the same rates for instructions
> were so low, I thought it might be a combination of both low sample
> count and stalls happening at particular places. A profile of cache
> misses will still be useful as it'll say in general if there is a marked
> increase overall or not.

OK.


> > Intel's OLTP workload is very sensitive to cacheline footprint of the
> > kernel, and if you touch some extra cachelines at point A, it can just
> > result in profile hits getting distributed all over the place. Profiling
> > cache misses might help, but probably see a similar phenomenon.
> > 
> 
> Interesting, this might put a hole in replacing the gfp_zone() with a
> version that uses an additional (or maybe two depending on alignment)
> cacheline.

Well... I still think it is probably a good idea. Firstly is that
it probably saves a line of icache too. Secondly, I guess adding a
*single* extra readonly cacheline is probably not such a problem
even for this workload. I was more thinking of if you changed the
pattern in which pages are allocated (ie. like the hot/cold thing),
or if some change resulted in more cross-cpu operations then it
could result in worse cache efficiency.

But you never know, it might be one patch to look at.


> > I can't remember, does your latest patchset include any patches that change
> > the possible order in which pages move around? Or is it just made up of
> > straight-line performance improvement of existing implementation?
> > 
> 
> It shouldn't affect order. I did a test a while ago to make sure pages
> were still coming back in contiguous order as some IO cards depend on this
> behaviour for performance. The intention for the first pass is a straight-line
> performance improvement.

OK, but the dynamic behaviour too. Free page A, free page B, allocate page
A allocate page B etc.

The hot/cold removal would be an obvious example of what I mean, although
that wasn't included in this recent patchset anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-03  4:42                   ` Nick Piggin
@ 2009-03-03  8:25                     ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-03-03  8:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Linux Kernel Mailing List, Zhang Yanmin,
	Peter Zijlstra, Ingo Molnar

On Tue, Mar 03, 2009 at 05:42:40AM +0100, Nick Piggin wrote:
> On Mon, Mar 02, 2009 at 12:16:33PM +0000, Mel Gorman wrote:
> > On Mon, Mar 02, 2009 at 12:39:36PM +0100, Nick Piggin wrote:
> > > > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > > > how the allocator is actually being used for your workloads.
> > > > 
> > > > The OLTP results had the following things to say about the page allocator.
> > > 
> > > Is this OLTP, or UDP-U-4K?
> > > 
> > 
> > OLTP. I didn't do a comparison for UDP due to uncertainity of what I was
> > looking at other than to note that high-order allocations may be a
> > bigger deal there.
> 
> OK.
> 
> > > > Question 1: Would it be possible to increase the sample rate and track cache
> > > > misses as well please?
> > > 
> > > If the events are constantly biased, I don't think sample rate will
> > > help. I don't know how the internals of profiling counters work exactly,
> > > but you would expect yes cache misses, and stalls from any number of
> > > different resources could put results in funny places.
> > > 
> > 
> > Ok, if it's stalls that are the real factor then yes, increasing the
> > sample rate might not help. However, the same rates for instructions
> > were so low, I thought it might be a combination of both low sample
> > count and stalls happening at particular places. A profile of cache
> > misses will still be useful as it'll say in general if there is a marked
> > increase overall or not.
> 
> OK.
> 

As it turns out, my own tests here are showing increased cache misses so
I'm checking out why. One possibility is that the per-cpu structures are
increased in size to avoid a list search during allocation.

> 
> > > Intel's OLTP workload is very sensitive to cacheline footprint of the
> > > kernel, and if you touch some extra cachelines at point A, it can just
> > > result in profile hits getting distributed all over the place. Profiling
> > > cache misses might help, but probably see a similar phenomenon.
> > > 
> > 
> > Interesting, this might put a hole in replacing the gfp_zone() with a
> > version that uses an additional (or maybe two depending on alignment)
> > cacheline.
> 
> Well... I still think it is probably a good idea. Firstly is that
> it probably saves a line of icache too. Secondly, I guess adding a
> *single* extra readonly cacheline is probably not such a problem
> even for this workload. I was more thinking of if you changed the
> pattern in which pages are allocated (ie. like the hot/cold thing),

I need to think about it again, but I think the allocation/free pattern
should be more or less the same.

> or if some change resulted in more cross-cpu operations then it
> could result in worse cache efficiency.
> 

It occured to me before sleeping last night that there could be a lot
of cross-cpu operations taking place in the buddy allocator itself. When
bulk-freeing pages, we have to examine all the buddies and merge them. In
the case of a freshly booted system, many of the pages of interest will be
within the same MAX_ORDER blocks. If multiple CPUs bulk free their pages,
they'll bounce the struct pages between each other a lot as we are writing
those cache lines. However, this would be incurring with or without my patches.

It's an old observation of buddy that it can spend its time merging and
splitting buddies but it was in plain computational cost. I'm not sure if
the potential SMP-badness of it was also considered.

> But you never know, it might be one patch to look at.
> 

I'm shuffling the patches that might affect cache like this towards the
end of the end where they'll be easier to bisect.

> > > I can't remember, does your latest patchset include any patches that change
> > > the possible order in which pages move around? Or is it just made up of
> > > straight-line performance improvement of existing implementation?
> > > 
> > 
> > It shouldn't affect order. I did a test a while ago to make sure pages
> > were still coming back in contiguous order as some IO cards depend on this
> > behaviour for performance. The intention for the first pass is a straight-line
> > performance improvement.
> 
> OK, but the dynamic behaviour too. Free page A, free page B, allocate page
> A allocate page B etc.
> 
> The hot/cold removal would be an obvious example of what I mean, although
> that wasn't included in this recent patchset anyway.
> 

I get your point though, I'll keep it in mind. I've gone from plain
"reduce the clock cycles" to "reduce the cache misses" as if OLTP is
sensitive to this it has to be addressed as well.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-03  8:25                     ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-03-03  8:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Linux Kernel Mailing List, Zhang Yanmin,
	Peter Zijlstra, Ingo Molnar

On Tue, Mar 03, 2009 at 05:42:40AM +0100, Nick Piggin wrote:
> On Mon, Mar 02, 2009 at 12:16:33PM +0000, Mel Gorman wrote:
> > On Mon, Mar 02, 2009 at 12:39:36PM +0100, Nick Piggin wrote:
> > > > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > > > how the allocator is actually being used for your workloads.
> > > > 
> > > > The OLTP results had the following things to say about the page allocator.
> > > 
> > > Is this OLTP, or UDP-U-4K?
> > > 
> > 
> > OLTP. I didn't do a comparison for UDP due to uncertainity of what I was
> > looking at other than to note that high-order allocations may be a
> > bigger deal there.
> 
> OK.
> 
> > > > Question 1: Would it be possible to increase the sample rate and track cache
> > > > misses as well please?
> > > 
> > > If the events are constantly biased, I don't think sample rate will
> > > help. I don't know how the internals of profiling counters work exactly,
> > > but you would expect yes cache misses, and stalls from any number of
> > > different resources could put results in funny places.
> > > 
> > 
> > Ok, if it's stalls that are the real factor then yes, increasing the
> > sample rate might not help. However, the same rates for instructions
> > were so low, I thought it might be a combination of both low sample
> > count and stalls happening at particular places. A profile of cache
> > misses will still be useful as it'll say in general if there is a marked
> > increase overall or not.
> 
> OK.
> 

As it turns out, my own tests here are showing increased cache misses so
I'm checking out why. One possibility is that the per-cpu structures are
increased in size to avoid a list search during allocation.

> 
> > > Intel's OLTP workload is very sensitive to cacheline footprint of the
> > > kernel, and if you touch some extra cachelines at point A, it can just
> > > result in profile hits getting distributed all over the place. Profiling
> > > cache misses might help, but probably see a similar phenomenon.
> > > 
> > 
> > Interesting, this might put a hole in replacing the gfp_zone() with a
> > version that uses an additional (or maybe two depending on alignment)
> > cacheline.
> 
> Well... I still think it is probably a good idea. Firstly is that
> it probably saves a line of icache too. Secondly, I guess adding a
> *single* extra readonly cacheline is probably not such a problem
> even for this workload. I was more thinking of if you changed the
> pattern in which pages are allocated (ie. like the hot/cold thing),

I need to think about it again, but I think the allocation/free pattern
should be more or less the same.

> or if some change resulted in more cross-cpu operations then it
> could result in worse cache efficiency.
> 

It occured to me before sleeping last night that there could be a lot
of cross-cpu operations taking place in the buddy allocator itself. When
bulk-freeing pages, we have to examine all the buddies and merge them. In
the case of a freshly booted system, many of the pages of interest will be
within the same MAX_ORDER blocks. If multiple CPUs bulk free their pages,
they'll bounce the struct pages between each other a lot as we are writing
those cache lines. However, this would be incurring with or without my patches.

It's an old observation of buddy that it can spend its time merging and
splitting buddies but it was in plain computational cost. I'm not sure if
the potential SMP-badness of it was also considered.

> But you never know, it might be one patch to look at.
> 

I'm shuffling the patches that might affect cache like this towards the
end of the end where they'll be easier to bisect.

> > > I can't remember, does your latest patchset include any patches that change
> > > the possible order in which pages move around? Or is it just made up of
> > > straight-line performance improvement of existing implementation?
> > > 
> > 
> > It shouldn't affect order. I did a test a while ago to make sure pages
> > were still coming back in contiguous order as some IO cards depend on this
> > behaviour for performance. The intention for the first pass is a straight-line
> > performance improvement.
> 
> OK, but the dynamic behaviour too. Free page A, free page B, allocate page
> A allocate page B etc.
> 
> The hot/cold removal would be an obvious example of what I mean, although
> that wasn't included in this recent patchset anyway.
> 

I get your point though, I'll keep it in mind. I've gone from plain
"reduce the clock cycles" to "reduce the cache misses" as if OLTP is
sensitive to this it has to be addressed as well.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-03  8:25                     ` Mel Gorman
@ 2009-03-03  9:04                       ` Nick Piggin
  -1 siblings, 0 replies; 118+ messages in thread
From: Nick Piggin @ 2009-03-03  9:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Linux Kernel Mailing List, Zhang Yanmin,
	Peter Zijlstra, Ingo Molnar

On Tue, Mar 03, 2009 at 08:25:12AM +0000, Mel Gorman wrote:
> On Tue, Mar 03, 2009 at 05:42:40AM +0100, Nick Piggin wrote:
> > or if some change resulted in more cross-cpu operations then it
> > could result in worse cache efficiency.
> > 
> 
> It occured to me before sleeping last night that there could be a lot
> of cross-cpu operations taking place in the buddy allocator itself. When
> bulk-freeing pages, we have to examine all the buddies and merge them. In
> the case of a freshly booted system, many of the pages of interest will be
> within the same MAX_ORDER blocks. If multiple CPUs bulk free their pages,
> they'll bounce the struct pages between each other a lot as we are writing
> those cache lines. However, this would be incurring with or without my patches.

Oh yes it would definitely be a factor I think.

 
> > OK, but the dynamic behaviour too. Free page A, free page B, allocate page
> > A allocate page B etc.
> > 
> > The hot/cold removal would be an obvious example of what I mean, although
> > that wasn't included in this recent patchset anyway.
> > 
> 
> I get your point though, I'll keep it in mind. I've gone from plain
> "reduce the clock cycles" to "reduce the cache misses" as if OLTP is
> sensitive to this it has to be addressed as well.

OK cool. The patchset did look pretty good for reducing clock cycles
though, so hopefully it turns out to be something simple.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-03  9:04                       ` Nick Piggin
  0 siblings, 0 replies; 118+ messages in thread
From: Nick Piggin @ 2009-03-03  9:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Linux Kernel Mailing List, Zhang Yanmin,
	Peter Zijlstra, Ingo Molnar

On Tue, Mar 03, 2009 at 08:25:12AM +0000, Mel Gorman wrote:
> On Tue, Mar 03, 2009 at 05:42:40AM +0100, Nick Piggin wrote:
> > or if some change resulted in more cross-cpu operations then it
> > could result in worse cache efficiency.
> > 
> 
> It occured to me before sleeping last night that there could be a lot
> of cross-cpu operations taking place in the buddy allocator itself. When
> bulk-freeing pages, we have to examine all the buddies and merge them. In
> the case of a freshly booted system, many of the pages of interest will be
> within the same MAX_ORDER blocks. If multiple CPUs bulk free their pages,
> they'll bounce the struct pages between each other a lot as we are writing
> those cache lines. However, this would be incurring with or without my patches.

Oh yes it would definitely be a factor I think.

 
> > OK, but the dynamic behaviour too. Free page A, free page B, allocate page
> > A allocate page B etc.
> > 
> > The hot/cold removal would be an obvious example of what I mean, although
> > that wasn't included in this recent patchset anyway.
> > 
> 
> I get your point though, I'll keep it in mind. I've gone from plain
> "reduce the clock cycles" to "reduce the cache misses" as if OLTP is
> sensitive to this it has to be addressed as well.

OK cool. The patchset did look pretty good for reducing clock cycles
though, so hopefully it turns out to be something simple.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-03  9:04                       ` Nick Piggin
@ 2009-03-03 13:51                         ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-03-03 13:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Linux Kernel Mailing List, Zhang Yanmin,
	Peter Zijlstra, Ingo Molnar

On Tue, Mar 03, 2009 at 10:04:42AM +0100, Nick Piggin wrote:
> On Tue, Mar 03, 2009 at 08:25:12AM +0000, Mel Gorman wrote:
> > On Tue, Mar 03, 2009 at 05:42:40AM +0100, Nick Piggin wrote:
> > > or if some change resulted in more cross-cpu operations then it
> > > could result in worse cache efficiency.
> > > 
> > 
> > It occured to me before sleeping last night that there could be a lot
> > of cross-cpu operations taking place in the buddy allocator itself. When
> > bulk-freeing pages, we have to examine all the buddies and merge them. In
> > the case of a freshly booted system, many of the pages of interest will be
> > within the same MAX_ORDER blocks. If multiple CPUs bulk free their pages,
> > they'll bounce the struct pages between each other a lot as we are writing
> > those cache lines. However, this would be incurring with or without my patches.
> 
> Oh yes it would definitely be a factor I think.
> 

It's on the list for a second or third pass to investigate.

>  
> > > OK, but the dynamic behaviour too. Free page A, free page B, allocate page
> > > A allocate page B etc.
> > > 
> > > The hot/cold removal would be an obvious example of what I mean, although
> > > that wasn't included in this recent patchset anyway.
> > > 
> > 
> > I get your point though, I'll keep it in mind. I've gone from plain
> > "reduce the clock cycles" to "reduce the cache misses" as if OLTP is
> > sensitive to this it has to be addressed as well.
> 
> OK cool. The patchset did look pretty good for reducing clock cycles
> though, so hopefully it turns out to be something simple.
> 

I'm hoping it is. I noticed a few oddities where we use more cache than we
need to that I cleaned up. However, the strongest possibility of being a
problem is actually the patch that removes the list-search for a page of a
given migratetype in the allocation path. The fix simplifies the allocation
path but increases the complexity of the bulk-free path by quite a bit and
increases the number of cache lines that are accessed. Worse, the fix grows
the per-cpu structure from one cache line to two on x86-64 NUMA machines
which I think is significant. I'm testing that at the moment but I might
end up dropping the patch from the first pass as a result and confine
the set to "obvious" wins.


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-03 13:51                         ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-03-03 13:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Linux Kernel Mailing List, Zhang Yanmin,
	Peter Zijlstra, Ingo Molnar

On Tue, Mar 03, 2009 at 10:04:42AM +0100, Nick Piggin wrote:
> On Tue, Mar 03, 2009 at 08:25:12AM +0000, Mel Gorman wrote:
> > On Tue, Mar 03, 2009 at 05:42:40AM +0100, Nick Piggin wrote:
> > > or if some change resulted in more cross-cpu operations then it
> > > could result in worse cache efficiency.
> > > 
> > 
> > It occured to me before sleeping last night that there could be a lot
> > of cross-cpu operations taking place in the buddy allocator itself. When
> > bulk-freeing pages, we have to examine all the buddies and merge them. In
> > the case of a freshly booted system, many of the pages of interest will be
> > within the same MAX_ORDER blocks. If multiple CPUs bulk free their pages,
> > they'll bounce the struct pages between each other a lot as we are writing
> > those cache lines. However, this would be incurring with or without my patches.
> 
> Oh yes it would definitely be a factor I think.
> 

It's on the list for a second or third pass to investigate.

>  
> > > OK, but the dynamic behaviour too. Free page A, free page B, allocate page
> > > A allocate page B etc.
> > > 
> > > The hot/cold removal would be an obvious example of what I mean, although
> > > that wasn't included in this recent patchset anyway.
> > > 
> > 
> > I get your point though, I'll keep it in mind. I've gone from plain
> > "reduce the clock cycles" to "reduce the cache misses" as if OLTP is
> > sensitive to this it has to be addressed as well.
> 
> OK cool. The patchset did look pretty good for reducing clock cycles
> though, so hopefully it turns out to be something simple.
> 

I'm hoping it is. I noticed a few oddities where we use more cache than we
need to that I cleaned up. However, the strongest possibility of being a
problem is actually the patch that removes the list-search for a page of a
given migratetype in the allocation path. The fix simplifies the allocation
path but increases the complexity of the bulk-free path by quite a bit and
increases the number of cache lines that are accessed. Worse, the fix grows
the per-cpu structure from one cache line to two on x86-64 NUMA machines
which I think is significant. I'm testing that at the moment but I might
end up dropping the patch from the first pass as a result and confine
the set to "obvious" wins.


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-02 11:21             ` Mel Gorman
@ 2009-03-03 16:31               ` Christoph Lameter
  -1 siblings, 0 replies; 118+ messages in thread
From: Christoph Lameter @ 2009-03-03 16:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra,
	Ingo Molnar

On Mon, 2 Mar 2009, Mel Gorman wrote:

> Going by the vanilla kernel, a *large* amount of time is spent doing
> high-order allocations. Over 25% of the cost of buffered_rmqueue() is in
> the branch dealing with high-order allocations. Does UDP-U-4K mean that 8K
> pages are required for the packets? That means high-order allocations and
> high contention on the zone-list. That is bad obviously and has implications
> for the SLUB-passthru patch because whether 8K allocations are handled by
> SL*B or the page allocator has a big impact on locking.
>
> Next, a little over 50% of the cost get_page_from_freelist() is being spent
> acquiring the zone spinlock. The implication is that the SL*B allocators
> passing in order-1 allocations to the page allocator are currently going to
> hit scalability problems in a big way. The solution may be to extend the
> per-cpu allocator to handle magazines up to PAGE_ALLOC_COSTLY_ORDER. I'll
> check it out.

Then we are increasing the number of queues dramatically in the page
allocator. More of a memory sink. Less cache hotness.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-03 16:31               ` Christoph Lameter
  0 siblings, 0 replies; 118+ messages in thread
From: Christoph Lameter @ 2009-03-03 16:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra,
	Ingo Molnar

On Mon, 2 Mar 2009, Mel Gorman wrote:

> Going by the vanilla kernel, a *large* amount of time is spent doing
> high-order allocations. Over 25% of the cost of buffered_rmqueue() is in
> the branch dealing with high-order allocations. Does UDP-U-4K mean that 8K
> pages are required for the packets? That means high-order allocations and
> high contention on the zone-list. That is bad obviously and has implications
> for the SLUB-passthru patch because whether 8K allocations are handled by
> SL*B or the page allocator has a big impact on locking.
>
> Next, a little over 50% of the cost get_page_from_freelist() is being spent
> acquiring the zone spinlock. The implication is that the SL*B allocators
> passing in order-1 allocations to the page allocator are currently going to
> hit scalability problems in a big way. The solution may be to extend the
> per-cpu allocator to handle magazines up to PAGE_ALLOC_COSTLY_ORDER. I'll
> check it out.

Then we are increasing the number of queues dramatically in the page
allocator. More of a memory sink. Less cache hotness.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-03 16:31               ` Christoph Lameter
@ 2009-03-03 21:48                 ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-03-03 21:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra,
	Ingo Molnar

On Tue, Mar 03, 2009 at 11:31:46AM -0500, Christoph Lameter wrote:
> On Mon, 2 Mar 2009, Mel Gorman wrote:
> 
> > Going by the vanilla kernel, a *large* amount of time is spent doing
> > high-order allocations. Over 25% of the cost of buffered_rmqueue() is in
> > the branch dealing with high-order allocations. Does UDP-U-4K mean that 8K
> > pages are required for the packets? That means high-order allocations and
> > high contention on the zone-list. That is bad obviously and has implications
> > for the SLUB-passthru patch because whether 8K allocations are handled by
> > SL*B or the page allocator has a big impact on locking.
> >
> > Next, a little over 50% of the cost get_page_from_freelist() is being spent
> > acquiring the zone spinlock. The implication is that the SL*B allocators
> > passing in order-1 allocations to the page allocator are currently going to
> > hit scalability problems in a big way. The solution may be to extend the
> > per-cpu allocator to handle magazines up to PAGE_ALLOC_COSTLY_ORDER. I'll
> > check it out.
> 
> Then we are increasing the number of queues dramatically in the page
> allocator. More of a memory sink. Less cache hotness.
> 

It doesn't have to be more queues and networking is doing order-1 allocations
based on a quick instrumentation so we might be justified in doing this to
avoid contending excessively on the zone lock.

Without the patchset, we do a search of the pcp lists for a page of the
appropriate migrate type. There is a patch that removes this search at
the cost of one cache line per CPU and it works reasonably well.

However, if the search was left in place, you can add pages of other orders
and just search for those which should be a lot less costly. Yes, the search
is unfortunate but you avoid acquiring the zone lock without increasing
the size of the per-cpu structure. The search will require cache lines it's
probably less than acquiring teh zone lock and going through the whole buddy
allocator for order-1 pages.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-03 21:48                 ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-03-03 21:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Zhang Yanmin, Peter Zijlstra,
	Ingo Molnar

On Tue, Mar 03, 2009 at 11:31:46AM -0500, Christoph Lameter wrote:
> On Mon, 2 Mar 2009, Mel Gorman wrote:
> 
> > Going by the vanilla kernel, a *large* amount of time is spent doing
> > high-order allocations. Over 25% of the cost of buffered_rmqueue() is in
> > the branch dealing with high-order allocations. Does UDP-U-4K mean that 8K
> > pages are required for the packets? That means high-order allocations and
> > high contention on the zone-list. That is bad obviously and has implications
> > for the SLUB-passthru patch because whether 8K allocations are handled by
> > SL*B or the page allocator has a big impact on locking.
> >
> > Next, a little over 50% of the cost get_page_from_freelist() is being spent
> > acquiring the zone spinlock. The implication is that the SL*B allocators
> > passing in order-1 allocations to the page allocator are currently going to
> > hit scalability problems in a big way. The solution may be to extend the
> > per-cpu allocator to handle magazines up to PAGE_ALLOC_COSTLY_ORDER. I'll
> > check it out.
> 
> Then we are increasing the number of queues dramatically in the page
> allocator. More of a memory sink. Less cache hotness.
> 

It doesn't have to be more queues and networking is doing order-1 allocations
based on a quick instrumentation so we might be justified in doing this to
avoid contending excessively on the zone lock.

Without the patchset, we do a search of the pcp lists for a page of the
appropriate migrate type. There is a patch that removes this search at
the cost of one cache line per CPU and it works reasonably well.

However, if the search was left in place, you can add pages of other orders
and just search for those which should be a lot less costly. Yes, the search
is unfortunate but you avoid acquiring the zone lock without increasing
the size of the per-cpu structure. The search will require cache lines it's
probably less than acquiring teh zone lock and going through the whole buddy
allocator for order-1 pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-02 11:21             ` Mel Gorman
@ 2009-03-04  2:05               ` Zhang, Yanmin
  -1 siblings, 0 replies; 118+ messages in thread
From: Zhang, Yanmin @ 2009-03-04  2:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Peter Zijlstra, Ingo Molnar

On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> 
> On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > can see how time is being spent and why it might have gotten worse?
> > 
> > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > patches applied to 2.6.29-rc6.
> > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > line with addr2line.
> > 
> > You can download the oprofile data and vmlinux from below link,
> > http://www.filefactory.com/file/af2330b/
> > 
> 
> Perfect, thanks a lot for profiling this. It is a big help in figuring out
> how the allocator is actually being used for your workloads.
> 
> The OLTP results had the following things to say about the page allocator.
In case we might mislead you guys, I want to clarify that here OLTP is
sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
memory.

Ma Chinang, another Intel guy, does work on the famous OLTP running.

> 
> Samples in the free path
> 	vanilla:	6207
> 	mg-v2:		4911
> Samples in the allocation path
> 	vanilla		19948
> 	mg-v2:		14238
> 
> This is based on glancing at the following graphs and not counting the VM
> counters as it can't be determined which samples are due to the allocator
> and which are due to the rest of the VM accounting.
> 
> http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-vanilla-oltp.png
> http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-mgv2-oltp.png
> 
> So the path costs are reduced in both cases. Whatever caused the regression
> there doesn't appear to be in time spent in the allocator but due to
> something else I haven't imagined yet. Other oddness
> 
> o According to the profile, something like 45% of time is spent entering
>   the __alloc_pages_nodemask() function. Function entry costs but not
>   that much. Another significant part appears to be in checking a simple
>   mask. That doesn't make much sense to me so I don't know what to do with
>   that information yet.
> 
> o In get_page_from_freelist(), 9% of the time is spent deleting a page
>   from the freelist.
> 
> Neither of these make sense, we're not spending time where I would expect
> to at all. One of two things are happening. Something like cache misses or
> bounces are dominating for some reason that is specific to this machine. Cache
> misses are one possibility that I'll check out. The other is that the sample
> rate is too low and the profile counts are hence misleading.
> 
> Question 1: Would it be possible to increase the sample rate and track cache
> misses as well please?
I will try to capture cache miss with oprofile.

> 
> Another interesting fact is that we are spending about 15% of the overall
> time is spent in tg_shares_up() for both kernels but the vanilla kernel
> recorded 977348 samples and the patched kernel recorded 514576 samples. We
> are spending less time in the kernel and it's not obvious why or if that is
> a good thing or not. You'd think less time in kernel is good but it might
> mean we are doing less work overall.
> 
> Total aside from the page allocator, I checked what we were doing
> in tg_shares_up where the vast amount of time is being spent. This has
> something to do with CONFIG_FAIR_GROUP_SCHED. 
> 
> Question 2: Scheduler guys, can you think of what it means to be spending
> less time in tg_shares_up please?
> 
> I don't know enough of how it works to guess why we are in there. FWIW,
> we are appear to be spending the most time in the following lines
> 
>                 weight = tg->cfs_rq[i]->load.weight;
>                 if (!weight)
>                         weight = NICE_0_LOAD;
> 
>                 tg->cfs_rq[i]->rq_weight = weight;
>                 rq_weight += weight;
>                 shares += tg->cfs_rq[i]->shares;
> 
> So.... cfs_rq is SMP aligned, but we iterate though it with for_each_cpu()
> and we're writing to it. How often is this function run by multiple CPUs? If
> the answer is "lots", does that not mean we are cache line bouncing in
> here like mad? Another crazy amount of time is spent accessing tg->se when
> validating. Basically, any access of the task_group appears to incur huge
> costs and cache line bounces would be the obvious explanation.
FAIR_GROUP_SCHED is a feature to support configurable cpu weight for different users.
We did find it takes lots of time to check/update the share weight which might create
lots of cache ping-pang. With sysbench(oltp)+mysql, that becomes more severe because
mysql runs as user mysql and sysbench runs as another regular user. When starting
the testing with 1 thread in command line, there are 2 mysql threads and 1 sysbench
thread are proactive.

> 
> More stupid poking around. We appear to update these share things on each
> fork().
> 
> Question 3: Scheduler guys, If the database or clients being used for OLTP is
> fork-based instead of thread-based, then we are going to be balancing a lot,
> right? What does that mean, how can it be avoided?
> 
> Question 4: Lin, this is unrelated to the page allocator but do you know
> what the performance difference between vanilla-with-group-sched and
> vanilla-without-group-sched is?
When FAIR_GROUP_SCHED appeared in kernel at the first time, we did many such testing.
There is another thread to discuss it at http://lkml.org/lkml/2008/9/10/214.

set sched_shares_ratelimit to a large value could reduce the regression.

Scheduler guys keep improving it. 

> 
> The UDP results are screwy as the profiles are not matching up to the
> images. For example
Mostly, it's caused by not cleaning up old oprofile data when starting
new sampling.

I will retry.

> 
> oltp.oprofile.2.6.29-rc6:           ffffffff802808a0 11022     0.1727  get_page_from_freelist
> oltp.oprofile.2.6.29-rc6-mg-v2:     ffffffff80280610 7958      0.2403  get_page_from_freelist
> UDP-U-4K.oprofile.2.6.29-rc6:       ffffffff802808a0 29914     1.2866  get_page_from_freelist
> UDP-U-4K.oprofile.2.6.29-rc6-mg-v2: ffffffff802808a0 28153     1.1708  get_page_from_freelist
> 
> Look at the addresses. UDP-U-4K.oprofile.2.6.29-rc6-mg-v2 has the address
> for UDP-U-4K.oprofile.2.6.29-rc6 so I have no idea what I'm looking at here
> for the patched kernel :(.
> 
> Question 5: Lin, would it be possible to get whatever script you use for
> running netperf so I can try reproducing it?
Below is a simple script. As for formal testing, we add parameter "-i 50,3 -I" 99,5"
to get a more stable result.

PROG_DIR=/home/ymzhang/test/netperf/src
taskset -c 0 ${PROG_DIR}/netserver
sleep 2
taskset -c 7 ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096
killall netserver


Basically, we start 1 client and bind client/server to different physical cpu.

> 
> Going by the vanilla kernel, a *large* amount of time is spent doing
> high-order allocations. Over 25% of the cost of buffered_rmqueue() is in
> the branch dealing with high-order allocations. Does UDP-U-4K mean that 8K
> pages are required for the packets? That means high-order allocations and
> high contention on the zone-list. That is bad obviously and has implications
> for the SLUB-passthru patch because whether 8K allocations are handled by
> SL*B or the page allocator has a big impact on locking.
> 
> Next, a little over 50% of the cost get_page_from_freelist() is being spent
> acquiring the zone spinlock. The implication is that the SL*B allocators
> passing in order-1 allocations to the page allocator are currently going to
> hit scalability problems in a big way. The solution may be to extend the
> per-cpu allocator to handle magazines up to PAGE_ALLOC_COSTLY_ORDER. I'll
> check it out.
> 


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-04  2:05               ` Zhang, Yanmin
  0 siblings, 0 replies; 118+ messages in thread
From: Zhang, Yanmin @ 2009-03-04  2:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Peter Zijlstra, Ingo Molnar

On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> 
> On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > can see how time is being spent and why it might have gotten worse?
> > 
> > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > patches applied to 2.6.29-rc6.
> > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > line with addr2line.
> > 
> > You can download the oprofile data and vmlinux from below link,
> > http://www.filefactory.com/file/af2330b/
> > 
> 
> Perfect, thanks a lot for profiling this. It is a big help in figuring out
> how the allocator is actually being used for your workloads.
> 
> The OLTP results had the following things to say about the page allocator.
In case we might mislead you guys, I want to clarify that here OLTP is
sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
memory.

Ma Chinang, another Intel guy, does work on the famous OLTP running.

> 
> Samples in the free path
> 	vanilla:	6207
> 	mg-v2:		4911
> Samples in the allocation path
> 	vanilla		19948
> 	mg-v2:		14238
> 
> This is based on glancing at the following graphs and not counting the VM
> counters as it can't be determined which samples are due to the allocator
> and which are due to the rest of the VM accounting.
> 
> http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-vanilla-oltp.png
> http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-mgv2-oltp.png
> 
> So the path costs are reduced in both cases. Whatever caused the regression
> there doesn't appear to be in time spent in the allocator but due to
> something else I haven't imagined yet. Other oddness
> 
> o According to the profile, something like 45% of time is spent entering
>   the __alloc_pages_nodemask() function. Function entry costs but not
>   that much. Another significant part appears to be in checking a simple
>   mask. That doesn't make much sense to me so I don't know what to do with
>   that information yet.
> 
> o In get_page_from_freelist(), 9% of the time is spent deleting a page
>   from the freelist.
> 
> Neither of these make sense, we're not spending time where I would expect
> to at all. One of two things are happening. Something like cache misses or
> bounces are dominating for some reason that is specific to this machine. Cache
> misses are one possibility that I'll check out. The other is that the sample
> rate is too low and the profile counts are hence misleading.
> 
> Question 1: Would it be possible to increase the sample rate and track cache
> misses as well please?
I will try to capture cache miss with oprofile.

> 
> Another interesting fact is that we are spending about 15% of the overall
> time is spent in tg_shares_up() for both kernels but the vanilla kernel
> recorded 977348 samples and the patched kernel recorded 514576 samples. We
> are spending less time in the kernel and it's not obvious why or if that is
> a good thing or not. You'd think less time in kernel is good but it might
> mean we are doing less work overall.
> 
> Total aside from the page allocator, I checked what we were doing
> in tg_shares_up where the vast amount of time is being spent. This has
> something to do with CONFIG_FAIR_GROUP_SCHED. 
> 
> Question 2: Scheduler guys, can you think of what it means to be spending
> less time in tg_shares_up please?
> 
> I don't know enough of how it works to guess why we are in there. FWIW,
> we are appear to be spending the most time in the following lines
> 
>                 weight = tg->cfs_rq[i]->load.weight;
>                 if (!weight)
>                         weight = NICE_0_LOAD;
> 
>                 tg->cfs_rq[i]->rq_weight = weight;
>                 rq_weight += weight;
>                 shares += tg->cfs_rq[i]->shares;
> 
> So.... cfs_rq is SMP aligned, but we iterate though it with for_each_cpu()
> and we're writing to it. How often is this function run by multiple CPUs? If
> the answer is "lots", does that not mean we are cache line bouncing in
> here like mad? Another crazy amount of time is spent accessing tg->se when
> validating. Basically, any access of the task_group appears to incur huge
> costs and cache line bounces would be the obvious explanation.
i>>?FAIR_GROUP_SCHED is a feature to support configurable cpu weight for different users.
We did find it takes lots of time to check/update the share weight which might create
lots of cache ping-pang. With sysbench(oltp)+mysql, that becomes more severe because
mysql runs as user mysql and sysbench runs as another regular user. When starting
the testing with 1 thread in command line, there are 2 mysql threads and 1 sysbench
thread are proactive.

> 
> More stupid poking around. We appear to update these share things on each
> fork().
> 
> Question 3: Scheduler guys, If the database or clients being used for OLTP is
> fork-based instead of thread-based, then we are going to be balancing a lot,
> right? What does that mean, how can it be avoided?
> 
> Question 4: Lin, this is unrelated to the page allocator but do you know
> what the performance difference between vanilla-with-group-sched and
> vanilla-without-group-sched is?
When i>>?FAIR_GROUP_SCHED appeared in kernel at the first time, we did many such testing.
There is another thread to discuss it at http://lkml.org/lkml/2008/9/10/214.

set si>>?ched_shares_ratelimit to a large value could reduce the regression.

Scheduler guys keep improving it. 

> 
> The UDP results are screwy as the profiles are not matching up to the
> images. For example
Mostly, it's caused by not cleaning up old oprofile data when starting
new sampling.

I will retry.

> 
> oltp.oprofile.2.6.29-rc6:           ffffffff802808a0 11022     0.1727  get_page_from_freelist
> oltp.oprofile.2.6.29-rc6-mg-v2:     ffffffff80280610 7958      0.2403  get_page_from_freelist
> UDP-U-4K.oprofile.2.6.29-rc6:       ffffffff802808a0 29914     1.2866  get_page_from_freelist
> UDP-U-4K.oprofile.2.6.29-rc6-mg-v2: ffffffff802808a0 28153     1.1708  get_page_from_freelist
> 
> Look at the addresses. UDP-U-4K.oprofile.2.6.29-rc6-mg-v2 has the address
> for UDP-U-4K.oprofile.2.6.29-rc6 so I have no idea what I'm looking at here
> for the patched kernel :(.
> 
> Question 5: Lin, would it be possible to get whatever script you use for
> running netperf so I can try reproducing it?
Below is a simple script. As for formal testing, we add parameter "-i 50,3 -I" 99,5"
to get a more stable result.

PROG_DIR=/home/ymzhang/test/netperf/src
taskset -c 0 ${PROG_DIR}/netserver
sleep 2
taskset -c 7 ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096
killall netserver


Basically, we start 1 client and bind client/server to different physical cpu.

> 
> Going by the vanilla kernel, a *large* amount of time is spent doing
> high-order allocations. Over 25% of the cost of buffered_rmqueue() is in
> the branch dealing with high-order allocations. Does UDP-U-4K mean that 8K
> pages are required for the packets? That means high-order allocations and
> high contention on the zone-list. That is bad obviously and has implications
> for the SLUB-passthru patch because whether 8K allocations are handled by
> SL*B or the page allocator has a big impact on locking.
> 
> Next, a little over 50% of the cost get_page_from_freelist() is being spent
> acquiring the zone spinlock. The implication is that the SL*B allocators
> passing in order-1 allocations to the page allocator are currently going to
> hit scalability problems in a big way. The solution may be to extend the
> per-cpu allocator to handle magazines up to PAGE_ALLOC_COSTLY_ORDER. I'll
> check it out.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-04  2:05               ` Zhang, Yanmin
@ 2009-03-04  7:23                 ` Peter Zijlstra
  -1 siblings, 0 replies; 118+ messages in thread
From: Peter Zijlstra @ 2009-03-04  7:23 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Mel Gorman, Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Ingo Molnar

On Wed, 2009-03-04 at 10:05 +0800, Zhang, Yanmin wrote:
> FAIR_GROUP_SCHED is a feature to support configurable cpu weight for different users.
> We did find it takes lots of time to check/update the share weight which might create
> lots of cache ping-pang. With sysbench(oltp)+mysql, that becomes more severe because
> mysql runs as user mysql and sysbench runs as another regular user. When starting
> the testing with 1 thread in command line, there are 2 mysql threads and 1 sysbench
> thread are proactive.

cgroup based group scheduling doesn't bother with users. So unless you
create sched-cgroups your should all be in the same (root) group.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-04  7:23                 ` Peter Zijlstra
  0 siblings, 0 replies; 118+ messages in thread
From: Peter Zijlstra @ 2009-03-04  7:23 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Mel Gorman, Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Ingo Molnar

On Wed, 2009-03-04 at 10:05 +0800, Zhang, Yanmin wrote:
> FAIR_GROUP_SCHED is a feature to support configurable cpu weight for different users.
> We did find it takes lots of time to check/update the share weight which might create
> lots of cache ping-pang. With sysbench(oltp)+mysql, that becomes more severe because
> mysql runs as user mysql and sysbench runs as another regular user. When starting
> the testing with 1 thread in command line, there are 2 mysql threads and 1 sysbench
> thread are proactive.

cgroup based group scheduling doesn't bother with users. So unless you
create sched-cgroups your should all be in the same (root) group.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-04  7:23                 ` Peter Zijlstra
@ 2009-03-04  8:31                   ` Zhang, Yanmin
  -1 siblings, 0 replies; 118+ messages in thread
From: Zhang, Yanmin @ 2009-03-04  8:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Ingo Molnar

On Wed, 2009-03-04 at 08:23 +0100, Peter Zijlstra wrote:
> On Wed, 2009-03-04 at 10:05 +0800, Zhang, Yanmin wrote:
> > FAIR_GROUP_SCHED is a feature to support configurable cpu weight for different users.
> > We did find it takes lots of time to check/update the share weight which might create
> > lots of cache ping-pang. With sysbench(oltp)+mysql, that becomes more severe because
> > mysql runs as user mysql and sysbench runs as another regular user. When starting
> > the testing with 1 thread in command line, there are 2 mysql threads and 1 sysbench
> > thread are proactive.
> 
> cgroup based group scheduling doesn't bother with users. So unless you
> create sched-cgroups your should all be in the same (root) group.

I disable CGROUP, but enable GROUP_SCHED and USER_SCHED. My config inherits from old config
files.

CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_USER_SCHED=y
# CONFIG_CGROUP_SCHED is not set

I check defconfig on x86-64 of 2.6.28 and it does enable CGROUP and disable USER_SCHED.

Perhaps I need change my latest config file to the default on sched options.



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-04  8:31                   ` Zhang, Yanmin
  0 siblings, 0 replies; 118+ messages in thread
From: Zhang, Yanmin @ 2009-03-04  8:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Ingo Molnar

On Wed, 2009-03-04 at 08:23 +0100, Peter Zijlstra wrote:
> On Wed, 2009-03-04 at 10:05 +0800, Zhang, Yanmin wrote:
> > FAIR_GROUP_SCHED is a feature to support configurable cpu weight for different users.
> > We did find it takes lots of time to check/update the share weight which might create
> > lots of cache ping-pang. With sysbench(oltp)+mysql, that becomes more severe because
> > mysql runs as user mysql and sysbench runs as another regular user. When starting
> > the testing with 1 thread in command line, there are 2 mysql threads and 1 sysbench
> > thread are proactive.
> 
> cgroup based group scheduling doesn't bother with users. So unless you
> create sched-cgroups your should all be in the same (root) group.

I disable CGROUP, but enable GROUP_SCHED and USER_SCHED. My config inherits from old config
files.

CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_USER_SCHED=y
# CONFIG_CGROUP_SCHED is not set

I check defconfig on x86-64 of 2.6.28 and it does enable CGROUP and disable USER_SCHED.

Perhaps I need change my latest config file to the default on sched options.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-04  2:05               ` Zhang, Yanmin
@ 2009-03-04  9:07                 ` Nick Piggin
  -1 siblings, 0 replies; 118+ messages in thread
From: Nick Piggin @ 2009-03-04  9:07 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Mel Gorman, Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Linux Kernel Mailing List, Peter Zijlstra,
	Ingo Molnar

On Wed, Mar 04, 2009 at 10:05:07AM +0800, Zhang, Yanmin wrote:
> On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> > 
> > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > can see how time is being spent and why it might have gotten worse?
> > > 
> > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > patches applied to 2.6.29-rc6.
> > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > line with addr2line.
> > > 
> > > You can download the oprofile data and vmlinux from below link,
> > > http://www.filefactory.com/file/af2330b/
> > > 
> > 
> > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > how the allocator is actually being used for your workloads.
> > 
> > The OLTP results had the following things to say about the page allocator.
> In case we might mislead you guys, I want to clarify that here OLTP is
> sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
> memory.
> 
> Ma Chinang, another Intel guy, does work on the famous OLTP running.

OK, so my comments WRT cache sensitivity probably don't apply here,
but probably cache hotness of pages coming out of the allocator
might still be important for this one.

How many runs are you doing of these tests? Do you have a fairly high
confidence that the changes are significant?


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-04  9:07                 ` Nick Piggin
  0 siblings, 0 replies; 118+ messages in thread
From: Nick Piggin @ 2009-03-04  9:07 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Mel Gorman, Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Linux Kernel Mailing List, Peter Zijlstra,
	Ingo Molnar

On Wed, Mar 04, 2009 at 10:05:07AM +0800, Zhang, Yanmin wrote:
> On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> > 
> > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > can see how time is being spent and why it might have gotten worse?
> > > 
> > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > patches applied to 2.6.29-rc6.
> > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > line with addr2line.
> > > 
> > > You can download the oprofile data and vmlinux from below link,
> > > http://www.filefactory.com/file/af2330b/
> > > 
> > 
> > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > how the allocator is actually being used for your workloads.
> > 
> > The OLTP results had the following things to say about the page allocator.
> In case we might mislead you guys, I want to clarify that here OLTP is
> sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
> memory.
> 
> Ma Chinang, another Intel guy, does work on the famous OLTP running.

OK, so my comments WRT cache sensitivity probably don't apply here,
but probably cache hotness of pages coming out of the allocator
might still be important for this one.

How many runs are you doing of these tests? Do you have a fairly high
confidence that the changes are significant?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-04  2:05               ` Zhang, Yanmin
@ 2009-03-04 18:04                 ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-03-04 18:04 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Peter Zijlstra, Ingo Molnar

On Wed, Mar 04, 2009 at 10:05:07AM +0800, Zhang, Yanmin wrote:
> On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> > 
> > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > can see how time is being spent and why it might have gotten worse?
> > > 
> > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > patches applied to 2.6.29-rc6.
> > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > line with addr2line.
> > > 
> > > You can download the oprofile data and vmlinux from below link,
> > > http://www.filefactory.com/file/af2330b/
> > > 
> > 
> > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > how the allocator is actually being used for your workloads.
> > 
> > The OLTP results had the following things to say about the page allocator.
>
> In case we might mislead you guys, I want to clarify that here OLTP is
> sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
> memory.
> 

Ah good. I'm testing with sysbench+postgres and I've seen similar
regressions on some machines so I have something to investigate.

> Ma Chinang, another Intel guy, does work on the famous OLTP running.
> 

Good to know. It's too early to test remotely near there but when this
is ready for merging a run on that setup would be really nice time was
available.

> > <SNIP>
> > Question 1: Would it be possible to increase the sample rate and track cache
> > misses as well please?
>
> I will try to capture cache miss with oprofile.
> 

Great, thanks. I did a cache miss capture for one of the machines and
noted cache misses increased but it'd still good to know.

> > Another interesting fact is that we are spending about 15% of the overall
> > time is spent in tg_shares_up() for both kernels but the vanilla kernel
> > recorded 977348 samples and the patched kernel recorded 514576 samples. We
> > are spending less time in the kernel and it's not obvious why or if that is
> > a good thing or not. You'd think less time in kernel is good but it might
> > mean we are doing less work overall.
> > 
> > Total aside from the page allocator, I checked what we were doing
> > in tg_shares_up where the vast amount of time is being spent. This has
> > something to do with CONFIG_FAIR_GROUP_SCHED. 
> > 
> > Question 2: Scheduler guys, can you think of what it means to be spending
> > less time in tg_shares_up please?
> > 
> > I don't know enough of how it works to guess why we are in there. FWIW,
> > we are appear to be spending the most time in the following lines
> > 
> >                 weight = tg->cfs_rq[i]->load.weight;
> >                 if (!weight)
> >                         weight = NICE_0_LOAD;
> > 
> >                 tg->cfs_rq[i]->rq_weight = weight;
> >                 rq_weight += weight;
> >                 shares += tg->cfs_rq[i]->shares;
> > 
> > So.... cfs_rq is SMP aligned, but we iterate though it with for_each_cpu()
> > and we're writing to it. How often is this function run by multiple CPUs? If
> > the answer is "lots", does that not mean we are cache line bouncing in
> > here like mad? Another crazy amount of time is spent accessing tg->se when
> > validating. Basically, any access of the task_group appears to incur huge
> > costs and cache line bounces would be the obvious explanation.
>
> ???FAIR_GROUP_SCHED is a feature to support configurable cpu weight for different users.
> We did find it takes lots of time to check/update the share weight which might create
> lots of cache ping-pang. With sysbench(oltp)+mysql, that becomes more severe because
> mysql runs as user mysql and sysbench runs as another regular user. When starting
> the testing with 1 thread in command line, there are 2 mysql threads and 1 sysbench
> thread are proactive.
> 

Very interesting, I don't think this will affect the page allocator but
I'll keep it in mind when worrying about the workload as a whole instead
of just one corner of it.

> > 
> > 
> > More stupid poking around. We appear to update these share things on each
> > fork().
> > 
> > Question 3: Scheduler guys, If the database or clients being used for OLTP is
> > fork-based instead of thread-based, then we are going to be balancing a lot,
> > right? What does that mean, how can it be avoided?
> > 
> > Question 4: Lin, this is unrelated to the page allocator but do you know
> > what the performance difference between vanilla-with-group-sched and
> > vanilla-without-group-sched is?
>
> When ???FAIR_GROUP_SCHED appeared in kernel at the first time, we did many such testing.
> There is another thread to discuss it at http://lkml.org/lkml/2008/9/10/214.
> 
> set s???ched_shares_ratelimit to a large value could reduce the regression.
> 
> Scheduler guys keep improving it. 
> 

Good to know. I haven't read the thread yet but it's now on my TODO
list.

> > The UDP results are screwy as the profiles are not matching up to the
> > images. For example
> Mostly, it's caused by not cleaning up old oprofile data when starting
> new sampling.
> 
> I will retry.
> 

Thanks
> > 
> > oltp.oprofile.2.6.29-rc6:           ffffffff802808a0 11022     0.1727  get_page_from_freelist
> > oltp.oprofile.2.6.29-rc6-mg-v2:     ffffffff80280610 7958      0.2403  get_page_from_freelist
> > UDP-U-4K.oprofile.2.6.29-rc6:       ffffffff802808a0 29914     1.2866  get_page_from_freelist
> > UDP-U-4K.oprofile.2.6.29-rc6-mg-v2: ffffffff802808a0 28153     1.1708  get_page_from_freelist
> > 
> > Look at the addresses. UDP-U-4K.oprofile.2.6.29-rc6-mg-v2 has the address
> > for UDP-U-4K.oprofile.2.6.29-rc6 so I have no idea what I'm looking at here
> > for the patched kernel :(.
> > 
> > Question 5: Lin, would it be possible to get whatever script you use for
> > running netperf so I can try reproducing it?

> Below is a simple script. As for formal testing, we add parameter "-i 50,3 -I" 99,5"
> to get a more stable result.
> 
> PROG_DIR=/home/ymzhang/test/netperf/src
> taskset -c 0 ${PROG_DIR}/netserver
> sleep 2
> taskset -c 7 ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096
> killall netserver
> 

Thanks, simple is good enough to start with. Just have to get around to
wrapping the automation around it.

> Basically, we start 1 client and bind client/server to different physical cpu.
> 
> > 
> > Going by the vanilla kernel, a *large* amount of time is spent doing
> > high-order allocations. Over 25% of the cost of buffered_rmqueue() is in
> > the branch dealing with high-order allocations. Does UDP-U-4K mean that 8K
> > pages are required for the packets? That means high-order allocations and
> > high contention on the zone-list. That is bad obviously and has implications
> > for the SLUB-passthru patch because whether 8K allocations are handled by
> > SL*B or the page allocator has a big impact on locking.
> > 
> > Next, a little over 50% of the cost get_page_from_freelist() is being spent
> > acquiring the zone spinlock. The implication is that the SL*B allocators
> > passing in order-1 allocations to the page allocator are currently going to
> > hit scalability problems in a big way. The solution may be to extend the
> > per-cpu allocator to handle magazines up to PAGE_ALLOC_COSTLY_ORDER. I'll
> > check it out.
> > 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-04 18:04                 ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-03-04 18:04 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Peter Zijlstra, Ingo Molnar

On Wed, Mar 04, 2009 at 10:05:07AM +0800, Zhang, Yanmin wrote:
> On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> > 
> > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > can see how time is being spent and why it might have gotten worse?
> > > 
> > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > patches applied to 2.6.29-rc6.
> > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > line with addr2line.
> > > 
> > > You can download the oprofile data and vmlinux from below link,
> > > http://www.filefactory.com/file/af2330b/
> > > 
> > 
> > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > how the allocator is actually being used for your workloads.
> > 
> > The OLTP results had the following things to say about the page allocator.
>
> In case we might mislead you guys, I want to clarify that here OLTP is
> sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
> memory.
> 

Ah good. I'm testing with sysbench+postgres and I've seen similar
regressions on some machines so I have something to investigate.

> Ma Chinang, another Intel guy, does work on the famous OLTP running.
> 

Good to know. It's too early to test remotely near there but when this
is ready for merging a run on that setup would be really nice time was
available.

> > <SNIP>
> > Question 1: Would it be possible to increase the sample rate and track cache
> > misses as well please?
>
> I will try to capture cache miss with oprofile.
> 

Great, thanks. I did a cache miss capture for one of the machines and
noted cache misses increased but it'd still good to know.

> > Another interesting fact is that we are spending about 15% of the overall
> > time is spent in tg_shares_up() for both kernels but the vanilla kernel
> > recorded 977348 samples and the patched kernel recorded 514576 samples. We
> > are spending less time in the kernel and it's not obvious why or if that is
> > a good thing or not. You'd think less time in kernel is good but it might
> > mean we are doing less work overall.
> > 
> > Total aside from the page allocator, I checked what we were doing
> > in tg_shares_up where the vast amount of time is being spent. This has
> > something to do with CONFIG_FAIR_GROUP_SCHED. 
> > 
> > Question 2: Scheduler guys, can you think of what it means to be spending
> > less time in tg_shares_up please?
> > 
> > I don't know enough of how it works to guess why we are in there. FWIW,
> > we are appear to be spending the most time in the following lines
> > 
> >                 weight = tg->cfs_rq[i]->load.weight;
> >                 if (!weight)
> >                         weight = NICE_0_LOAD;
> > 
> >                 tg->cfs_rq[i]->rq_weight = weight;
> >                 rq_weight += weight;
> >                 shares += tg->cfs_rq[i]->shares;
> > 
> > So.... cfs_rq is SMP aligned, but we iterate though it with for_each_cpu()
> > and we're writing to it. How often is this function run by multiple CPUs? If
> > the answer is "lots", does that not mean we are cache line bouncing in
> > here like mad? Another crazy amount of time is spent accessing tg->se when
> > validating. Basically, any access of the task_group appears to incur huge
> > costs and cache line bounces would be the obvious explanation.
>
> ???FAIR_GROUP_SCHED is a feature to support configurable cpu weight for different users.
> We did find it takes lots of time to check/update the share weight which might create
> lots of cache ping-pang. With sysbench(oltp)+mysql, that becomes more severe because
> mysql runs as user mysql and sysbench runs as another regular user. When starting
> the testing with 1 thread in command line, there are 2 mysql threads and 1 sysbench
> thread are proactive.
> 

Very interesting, I don't think this will affect the page allocator but
I'll keep it in mind when worrying about the workload as a whole instead
of just one corner of it.

> > 
> > 
> > More stupid poking around. We appear to update these share things on each
> > fork().
> > 
> > Question 3: Scheduler guys, If the database or clients being used for OLTP is
> > fork-based instead of thread-based, then we are going to be balancing a lot,
> > right? What does that mean, how can it be avoided?
> > 
> > Question 4: Lin, this is unrelated to the page allocator but do you know
> > what the performance difference between vanilla-with-group-sched and
> > vanilla-without-group-sched is?
>
> When ???FAIR_GROUP_SCHED appeared in kernel at the first time, we did many such testing.
> There is another thread to discuss it at http://lkml.org/lkml/2008/9/10/214.
> 
> set s???ched_shares_ratelimit to a large value could reduce the regression.
> 
> Scheduler guys keep improving it. 
> 

Good to know. I haven't read the thread yet but it's now on my TODO
list.

> > The UDP results are screwy as the profiles are not matching up to the
> > images. For example
> Mostly, it's caused by not cleaning up old oprofile data when starting
> new sampling.
> 
> I will retry.
> 

Thanks
> > 
> > oltp.oprofile.2.6.29-rc6:           ffffffff802808a0 11022     0.1727  get_page_from_freelist
> > oltp.oprofile.2.6.29-rc6-mg-v2:     ffffffff80280610 7958      0.2403  get_page_from_freelist
> > UDP-U-4K.oprofile.2.6.29-rc6:       ffffffff802808a0 29914     1.2866  get_page_from_freelist
> > UDP-U-4K.oprofile.2.6.29-rc6-mg-v2: ffffffff802808a0 28153     1.1708  get_page_from_freelist
> > 
> > Look at the addresses. UDP-U-4K.oprofile.2.6.29-rc6-mg-v2 has the address
> > for UDP-U-4K.oprofile.2.6.29-rc6 so I have no idea what I'm looking at here
> > for the patched kernel :(.
> > 
> > Question 5: Lin, would it be possible to get whatever script you use for
> > running netperf so I can try reproducing it?

> Below is a simple script. As for formal testing, we add parameter "-i 50,3 -I" 99,5"
> to get a more stable result.
> 
> PROG_DIR=/home/ymzhang/test/netperf/src
> taskset -c 0 ${PROG_DIR}/netserver
> sleep 2
> taskset -c 7 ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096
> killall netserver
> 

Thanks, simple is good enough to start with. Just have to get around to
wrapping the automation around it.

> Basically, we start 1 client and bind client/server to different physical cpu.
> 
> > 
> > Going by the vanilla kernel, a *large* amount of time is spent doing
> > high-order allocations. Over 25% of the cost of buffered_rmqueue() is in
> > the branch dealing with high-order allocations. Does UDP-U-4K mean that 8K
> > pages are required for the packets? That means high-order allocations and
> > high contention on the zone-list. That is bad obviously and has implications
> > for the SLUB-passthru patch because whether 8K allocations are handled by
> > SL*B or the page allocator has a big impact on locking.
> > 
> > Next, a little over 50% of the cost get_page_from_freelist() is being spent
> > acquiring the zone spinlock. The implication is that the SL*B allocators
> > passing in order-1 allocations to the page allocator are currently going to
> > hit scalability problems in a big way. The solution may be to extend the
> > per-cpu allocator to handle magazines up to PAGE_ALLOC_COSTLY_ORDER. I'll
> > check it out.
> > 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-04  9:07                 ` Nick Piggin
@ 2009-03-05  1:56                   ` Zhang, Yanmin
  -1 siblings, 0 replies; 118+ messages in thread
From: Zhang, Yanmin @ 2009-03-05  1:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mel Gorman, Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Linux Kernel Mailing List, Peter Zijlstra,
	Ingo Molnar

On Wed, 2009-03-04 at 10:07 +0100, Nick Piggin wrote:
> On Wed, Mar 04, 2009 at 10:05:07AM +0800, Zhang, Yanmin wrote:
> > On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> > > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> > > 
> > > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > > can see how time is being spent and why it might have gotten worse?
> > > > 
> > > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > > patches applied to 2.6.29-rc6.
> > > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > > line with addr2line.
> > > > 
> > > > You can download the oprofile data and vmlinux from below link,
> > > > http://www.filefactory.com/file/af2330b/
> > > > 
> > > 
> > > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > > how the allocator is actually being used for your workloads.
> > > 
> > > The OLTP results had the following things to say about the page allocator.
> > In case we might mislead you guys, I want to clarify that here OLTP is
> > sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
> > memory.
> > 
> > Ma Chinang, another Intel guy, does work on the famous OLTP running.
> 
> OK, so my comments WRT cache sensitivity probably don't apply here,
> but probably cache hotness of pages coming out of the allocator
> might still be important for this one.
Yes. We need check it.

> 
> How many runs are you doing of these tests?
We start sysbench with different thread number, for example, 8 12 16 32 64 128 for
4*4 tigerton, then get an average value in case there might be a scalability issue.

As for this sysbench oltp testing, we reran it for 7 times on tigerton this week and
found the results have fluctuations. Now we could only say there is a trend that
the result with the pathces is a little worse than the one without the patches.

>  Do you have a fairly high
> confidence that the changes are significant?
2% isn't significant on sysbench oltp.

yanmin



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-05  1:56                   ` Zhang, Yanmin
  0 siblings, 0 replies; 118+ messages in thread
From: Zhang, Yanmin @ 2009-03-05  1:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mel Gorman, Lin Ming, Pekka Enberg, Linux Memory Management List,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Linux Kernel Mailing List, Peter Zijlstra,
	Ingo Molnar

On Wed, 2009-03-04 at 10:07 +0100, Nick Piggin wrote:
> On Wed, Mar 04, 2009 at 10:05:07AM +0800, Zhang, Yanmin wrote:
> > On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> > > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> > > 
> > > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > > can see how time is being spent and why it might have gotten worse?
> > > > 
> > > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > > patches applied to 2.6.29-rc6.
> > > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > > line with addr2line.
> > > > 
> > > > You can download the oprofile data and vmlinux from below link,
> > > > http://www.filefactory.com/file/af2330b/
> > > > 
> > > 
> > > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > > how the allocator is actually being used for your workloads.
> > > 
> > > The OLTP results had the following things to say about the page allocator.
> > In case we might mislead you guys, I want to clarify that here OLTP is
> > sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
> > memory.
> > 
> > Ma Chinang, another Intel guy, does work on the famous OLTP running.
> 
> OK, so my comments WRT cache sensitivity probably don't apply here,
> but probably cache hotness of pages coming out of the allocator
> might still be important for this one.
Yes. We need check it.

> 
> How many runs are you doing of these tests?
We start sysbench with different thread number, for example, 8 12 16 32 64 128 for
4*4 tigerton, then get an average value in case there might be a scalability issue.

As for this sysbench oltp testing, we reran it for 7 times on tigerton this week and
found the results have fluctuations. Now we could only say there is a trend that
the result with the pathces is a little worse than the one without the patches.

>  Do you have a fairly high
> confidence that the changes are significant?
2% isn't significant on sysbench oltp.

yanmin


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-05  1:56                   ` Zhang, Yanmin
@ 2009-03-05 10:34                     ` Ingo Molnar
  -1 siblings, 0 replies; 118+ messages in thread
From: Ingo Molnar @ 2009-03-05 10:34 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Nick Piggin, Mel Gorman, Lin Ming, Pekka Enberg,
	Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Linux Kernel Mailing List,
	Peter Zijlstra


* Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:

> On Wed, 2009-03-04 at 10:07 +0100, Nick Piggin wrote:
> > On Wed, Mar 04, 2009 at 10:05:07AM +0800, Zhang, Yanmin wrote:
> > > On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> > > > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> > > > 
> > > > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > > > can see how time is being spent and why it might have gotten worse?
> > > > > 
> > > > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > > > patches applied to 2.6.29-rc6.
> > > > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > > > line with addr2line.
> > > > > 
> > > > > You can download the oprofile data and vmlinux from below link,
> > > > > http://www.filefactory.com/file/af2330b/
> > > > > 
> > > > 
> > > > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > > > how the allocator is actually being used for your workloads.
> > > > 
> > > > The OLTP results had the following things to say about the page allocator.
> > > In case we might mislead you guys, I want to clarify that here OLTP is
> > > sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
> > > memory.
> > > 
> > > Ma Chinang, another Intel guy, does work on the famous OLTP running.
> > 
> > OK, so my comments WRT cache sensitivity probably don't apply here,
> > but probably cache hotness of pages coming out of the allocator
> > might still be important for this one.
> Yes. We need check it.
> 
> > 
> > How many runs are you doing of these tests?
> We start sysbench with different thread number, for example, 8 12 16 32 64 128 for
> 4*4 tigerton, then get an average value in case there might be a scalability issue.
> 
> As for this sysbench oltp testing, we reran it for 7 times on 
> tigerton this week and found the results have fluctuations. 
> Now we could only say there is a trend that the result with 
> the pathces is a little worse than the one without the 
> patches.

Could you try "perfstat -s" perhaps and see whether any other of 
the metrics outside of tx/sec has less natural noise?

I think a more invariant number might be the ratio of "LLC 
cachemisses" divided by "CPU migrations".

The fluctuation in tx/sec comes from threads bouncing - but you 
can normalize that away by using the cachemisses/migrations 
ration.

Perhaps. It's definitely a difficult thing to measure.

	Ingo

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-05 10:34                     ` Ingo Molnar
  0 siblings, 0 replies; 118+ messages in thread
From: Ingo Molnar @ 2009-03-05 10:34 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Nick Piggin, Mel Gorman, Lin Ming, Pekka Enberg,
	Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Linux Kernel Mailing List,
	Peter Zijlstra


* Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:

> On Wed, 2009-03-04 at 10:07 +0100, Nick Piggin wrote:
> > On Wed, Mar 04, 2009 at 10:05:07AM +0800, Zhang, Yanmin wrote:
> > > On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> > > > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> > > > 
> > > > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > > > can see how time is being spent and why it might have gotten worse?
> > > > > 
> > > > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > > > patches applied to 2.6.29-rc6.
> > > > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > > > line with addr2line.
> > > > > 
> > > > > You can download the oprofile data and vmlinux from below link,
> > > > > http://www.filefactory.com/file/af2330b/
> > > > > 
> > > > 
> > > > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > > > how the allocator is actually being used for your workloads.
> > > > 
> > > > The OLTP results had the following things to say about the page allocator.
> > > In case we might mislead you guys, I want to clarify that here OLTP is
> > > sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
> > > memory.
> > > 
> > > Ma Chinang, another Intel guy, does work on the famous OLTP running.
> > 
> > OK, so my comments WRT cache sensitivity probably don't apply here,
> > but probably cache hotness of pages coming out of the allocator
> > might still be important for this one.
> Yes. We need check it.
> 
> > 
> > How many runs are you doing of these tests?
> We start sysbench with different thread number, for example, 8 12 16 32 64 128 for
> 4*4 tigerton, then get an average value in case there might be a scalability issue.
> 
> As for this sysbench oltp testing, we reran it for 7 times on 
> tigerton this week and found the results have fluctuations. 
> Now we could only say there is a trend that the result with 
> the pathces is a little worse than the one without the 
> patches.

Could you try "perfstat -s" perhaps and see whether any other of 
the metrics outside of tx/sec has less natural noise?

I think a more invariant number might be the ratio of "LLC 
cachemisses" divided by "CPU migrations".

The fluctuation in tx/sec comes from threads bouncing - but you 
can normalize that away by using the cachemisses/migrations 
ration.

Perhaps. It's definitely a difficult thing to measure.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-05 10:34                     ` Ingo Molnar
@ 2009-03-06  8:33                       ` Lin Ming
  -1 siblings, 0 replies; 118+ messages in thread
From: Lin Ming @ 2009-03-06  8:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zhang, Yanmin, Nick Piggin, Mel Gorman, Pekka Enberg,
	Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Linux Kernel Mailing List,
	Peter Zijlstra

On Thu, 2009-03-05 at 18:34 +0800, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> 
> > On Wed, 2009-03-04 at 10:07 +0100, Nick Piggin wrote:
> > > On Wed, Mar 04, 2009 at 10:05:07AM +0800, Zhang, Yanmin wrote:
> > > > On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> > > > > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> > > > > 
> > > > > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > > > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > > > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > > > > can see how time is being spent and why it might have gotten worse?
> > > > > > 
> > > > > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > > > > patches applied to 2.6.29-rc6.
> > > > > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > > > > line with addr2line.
> > > > > > 
> > > > > > You can download the oprofile data and vmlinux from below link,
> > > > > > http://www.filefactory.com/file/af2330b/
> > > > > > 
> > > > > 
> > > > > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > > > > how the allocator is actually being used for your workloads.
> > > > > 
> > > > > The OLTP results had the following things to say about the page allocator.
> > > > In case we might mislead you guys, I want to clarify that here OLTP is
> > > > sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
> > > > memory.
> > > > 
> > > > Ma Chinang, another Intel guy, does work on the famous OLTP running.
> > > 
> > > OK, so my comments WRT cache sensitivity probably don't apply here,
> > > but probably cache hotness of pages coming out of the allocator
> > > might still be important for this one.
> > Yes. We need check it.
> > 
> > > 
> > > How many runs are you doing of these tests?
> > We start sysbench with different thread number, for example, 8 12 16 32 64 128 for
> > 4*4 tigerton, then get an average value in case there might be a scalability issue.
> > 
> > As for this sysbench oltp testing, we reran it for 7 times on 
> > tigerton this week and found the results have fluctuations. 
> > Now we could only say there is a trend that the result with 
> > the pathces is a little worse than the one without the 
> > patches.
> 
> Could you try "perfstat -s" perhaps and see whether any other of 
> the metrics outside of tx/sec has less natural noise?

Thanks, I have used "perfstat -s" to collect cache misses data.

2.6.29-rc7-tip: tip/perfcounters/core (b5e8acf)
2.6.29-rc7-tip-mg2: v2 patches applied to tip/perfcounters/core

I collected 5 times netperf UDP-U-4k data with and without mg-v2 patches
applied to tip/perfcounters/core on a 4p quad-core tigerton machine, as
below
"value" means UDP-U-4k test result.

2.6.29-rc7-tip
---------------
value           cache misses    CPU migrations  cachemisses/migrations
5329.71          391094656       1710            228710
5641.59          239552767       2138            112045
5580.87          132474745       2172            60992
5547.19          86911457        2099            41406
5626.38          196751217       2050            95976

2.6.29-rc7-tip-mg2
-------------------
value           cache misses    CPU migrations  cachemisses/migrations
4749.80          649929463       1132            574142
4327.06          484100170       1252            386661
4649.51          374201508       1489            251310
5655.82          405511551       1848            219432
5571.58          90222256        2159            41788

Lin Ming

> 
> I think a more invariant number might be the ratio of "LLC 
> cachemisses" divided by "CPU migrations".
> 
> The fluctuation in tx/sec comes from threads bouncing - but you 
> can normalize that away by using the cachemisses/migrations 
> ration.
> 
> Perhaps. It's definitely a difficult thing to measure.
> 
> 	Ingo


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-06  8:33                       ` Lin Ming
  0 siblings, 0 replies; 118+ messages in thread
From: Lin Ming @ 2009-03-06  8:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zhang, Yanmin, Nick Piggin, Mel Gorman, Pekka Enberg,
	Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Linux Kernel Mailing List,
	Peter Zijlstra

On Thu, 2009-03-05 at 18:34 +0800, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> 
> > On Wed, 2009-03-04 at 10:07 +0100, Nick Piggin wrote:
> > > On Wed, Mar 04, 2009 at 10:05:07AM +0800, Zhang, Yanmin wrote:
> > > > On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> > > > > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> > > > > 
> > > > > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > > > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > > > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > > > > can see how time is being spent and why it might have gotten worse?
> > > > > > 
> > > > > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > > > > patches applied to 2.6.29-rc6.
> > > > > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > > > > line with addr2line.
> > > > > > 
> > > > > > You can download the oprofile data and vmlinux from below link,
> > > > > > http://www.filefactory.com/file/af2330b/
> > > > > > 
> > > > > 
> > > > > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > > > > how the allocator is actually being used for your workloads.
> > > > > 
> > > > > The OLTP results had the following things to say about the page allocator.
> > > > In case we might mislead you guys, I want to clarify that here OLTP is
> > > > sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
> > > > memory.
> > > > 
> > > > Ma Chinang, another Intel guy, does work on the famous OLTP running.
> > > 
> > > OK, so my comments WRT cache sensitivity probably don't apply here,
> > > but probably cache hotness of pages coming out of the allocator
> > > might still be important for this one.
> > Yes. We need check it.
> > 
> > > 
> > > How many runs are you doing of these tests?
> > We start sysbench with different thread number, for example, 8 12 16 32 64 128 for
> > 4*4 tigerton, then get an average value in case there might be a scalability issue.
> > 
> > As for this sysbench oltp testing, we reran it for 7 times on 
> > tigerton this week and found the results have fluctuations. 
> > Now we could only say there is a trend that the result with 
> > the pathces is a little worse than the one without the 
> > patches.
> 
> Could you try "perfstat -s" perhaps and see whether any other of 
> the metrics outside of tx/sec has less natural noise?

Thanks, I have used "perfstat -s" to collect cache misses data.

2.6.29-rc7-tip: tip/perfcounters/core (b5e8acf)
2.6.29-rc7-tip-mg2: v2 patches applied to tip/perfcounters/core

I collected 5 times netperf UDP-U-4k data with and without mg-v2 patches
applied to tip/perfcounters/core on a 4p quad-core tigerton machine, as
below
"value" means UDP-U-4k test result.

2.6.29-rc7-tip
---------------
value           cache misses    CPU migrations  cachemisses/migrations
5329.71          391094656       1710            228710
5641.59          239552767       2138            112045
5580.87          132474745       2172            60992
5547.19          86911457        2099            41406
5626.38          196751217       2050            95976

2.6.29-rc7-tip-mg2
-------------------
value           cache misses    CPU migrations  cachemisses/migrations
4749.80          649929463       1132            574142
4327.06          484100170       1252            386661
4649.51          374201508       1489            251310
5655.82          405511551       1848            219432
5571.58          90222256        2159            41788

Lin Ming

> 
> I think a more invariant number might be the ratio of "LLC 
> cachemisses" divided by "CPU migrations".
> 
> The fluctuation in tx/sec comes from threads bouncing - but you 
> can normalize that away by using the cachemisses/migrations 
> ration.
> 
> Perhaps. It's definitely a difficult thing to measure.
> 
> 	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-06  8:33                       ` Lin Ming
@ 2009-03-06  9:39                         ` Ingo Molnar
  -1 siblings, 0 replies; 118+ messages in thread
From: Ingo Molnar @ 2009-03-06  9:39 UTC (permalink / raw)
  To: Lin Ming
  Cc: Zhang, Yanmin, Nick Piggin, Mel Gorman, Pekka Enberg,
	Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Linux Kernel Mailing List,
	Peter Zijlstra


* Lin Ming <ming.m.lin@intel.com> wrote:

> Thanks, I have used "perfstat -s" to collect cache misses 
> data.
> 
> 2.6.29-rc7-tip: tip/perfcounters/core (b5e8acf)
> 2.6.29-rc7-tip-mg2: v2 patches applied to tip/perfcounters/core
> 
> I collected 5 times netperf UDP-U-4k data with and without 
> mg-v2 patches applied to tip/perfcounters/core on a 4p 
> quad-core tigerton machine, as below "value" means UDP-U-4k 
> test result.
> 
> 2.6.29-rc7-tip
> ---------------
> value           cache misses    CPU migrations  cachemisses/migrations
> 5329.71          391094656       1710            228710
> 5641.59          239552767       2138            112045
> 5580.87          132474745       2172            60992
> 5547.19          86911457        2099            41406
> 5626.38          196751217       2050            95976
> 
> 2.6.29-rc7-tip-mg2
> -------------------
> value           cache misses    CPU migrations  cachemisses/migrations
> 4749.80          649929463       1132            574142
> 4327.06          484100170       1252            386661
> 4649.51          374201508       1489            251310
> 5655.82          405511551       1848            219432
> 5571.58          90222256        2159            41788
> 
> Lin Ming

Hm, these numbers look really interesting and give us insight 
into this workload. The workload is fluctuating but by measuring 
3 metrics at once instead of just one we see the following 
patterns:

 - Less CPU migrations means more cache misses and less 
   performance.

The lowest-score runs had the lowest CPU migrations count, 
coupled with a high amount of cachemisses.

This _probably_ means that in this workload migrations are 
desired: the sooner two related tasks migrate to the same CPU 
the better. If they stay separate (migration count is low) then 
they interact with each other from different CPUs, creating a 
lot of cachemisses and reducing performance.

You can reduce the migration barrier of the system by enabling 
CONFIG_SCHED_DEBUG=y and setting sched_migration_cost to zero:

   echo 0 > /proc/sys/kernel/sched_migration_cost

This will hurt other workloads - but if this improves the 
numbers then it proves that what this particular workload wants 
is easy migrations.

Now the question is, why does the mg2 patchset reduce the number 
of migrations? It might not be an inherent property of the mg2 
patches: maybe just unlucky timings push the workload across 
sched_migration_cost.

Setting sched_migration_cost to either zero or to a very high 
value and repeating the test will eliminate this source of noise 
and will tell us about other properties of the mg2 patchset.

There might be other effects i'm missing. For example what kind 
of UDP transport is used - localhost networking? That means that 
sender and receiver really wants to be coupled strongly and what 
controls this workload is whether such a 'pair' of tasks can 
properly migrate to the same CPU.

	Ingo

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-06  9:39                         ` Ingo Molnar
  0 siblings, 0 replies; 118+ messages in thread
From: Ingo Molnar @ 2009-03-06  9:39 UTC (permalink / raw)
  To: Lin Ming
  Cc: Zhang, Yanmin, Nick Piggin, Mel Gorman, Pekka Enberg,
	Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Linux Kernel Mailing List,
	Peter Zijlstra


* Lin Ming <ming.m.lin@intel.com> wrote:

> Thanks, I have used "perfstat -s" to collect cache misses 
> data.
> 
> 2.6.29-rc7-tip: tip/perfcounters/core (b5e8acf)
> 2.6.29-rc7-tip-mg2: v2 patches applied to tip/perfcounters/core
> 
> I collected 5 times netperf UDP-U-4k data with and without 
> mg-v2 patches applied to tip/perfcounters/core on a 4p 
> quad-core tigerton machine, as below "value" means UDP-U-4k 
> test result.
> 
> 2.6.29-rc7-tip
> ---------------
> value           cache misses    CPU migrations  cachemisses/migrations
> 5329.71          391094656       1710            228710
> 5641.59          239552767       2138            112045
> 5580.87          132474745       2172            60992
> 5547.19          86911457        2099            41406
> 5626.38          196751217       2050            95976
> 
> 2.6.29-rc7-tip-mg2
> -------------------
> value           cache misses    CPU migrations  cachemisses/migrations
> 4749.80          649929463       1132            574142
> 4327.06          484100170       1252            386661
> 4649.51          374201508       1489            251310
> 5655.82          405511551       1848            219432
> 5571.58          90222256        2159            41788
> 
> Lin Ming

Hm, these numbers look really interesting and give us insight 
into this workload. The workload is fluctuating but by measuring 
3 metrics at once instead of just one we see the following 
patterns:

 - Less CPU migrations means more cache misses and less 
   performance.

The lowest-score runs had the lowest CPU migrations count, 
coupled with a high amount of cachemisses.

This _probably_ means that in this workload migrations are 
desired: the sooner two related tasks migrate to the same CPU 
the better. If they stay separate (migration count is low) then 
they interact with each other from different CPUs, creating a 
lot of cachemisses and reducing performance.

You can reduce the migration barrier of the system by enabling 
CONFIG_SCHED_DEBUG=y and setting sched_migration_cost to zero:

   echo 0 > /proc/sys/kernel/sched_migration_cost

This will hurt other workloads - but if this improves the 
numbers then it proves that what this particular workload wants 
is easy migrations.

Now the question is, why does the mg2 patchset reduce the number 
of migrations? It might not be an inherent property of the mg2 
patches: maybe just unlucky timings push the workload across 
sched_migration_cost.

Setting sched_migration_cost to either zero or to a very high 
value and repeating the test will eliminate this source of noise 
and will tell us about other properties of the mg2 patchset.

There might be other effects i'm missing. For example what kind 
of UDP transport is used - localhost networking? That means that 
sender and receiver really wants to be coupled strongly and what 
controls this workload is whether such a 'pair' of tasks can 
properly migrate to the same CPU.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-06  9:39                         ` Ingo Molnar
@ 2009-03-06 13:03                           ` Mel Gorman
  -1 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-03-06 13:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lin Ming, Zhang, Yanmin, Nick Piggin, Pekka Enberg,
	Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Linux Kernel Mailing List,
	Peter Zijlstra

On Fri, Mar 06, 2009 at 10:39:18AM +0100, Ingo Molnar wrote:
> 
> * Lin Ming <ming.m.lin@intel.com> wrote:
> 
> > Thanks, I have used "perfstat -s" to collect cache misses 
> > data.
> > 
> > 2.6.29-rc7-tip: tip/perfcounters/core (b5e8acf)
> > 2.6.29-rc7-tip-mg2: v2 patches applied to tip/perfcounters/core
> > 
> > I collected 5 times netperf UDP-U-4k data with and without 
> > mg-v2 patches applied to tip/perfcounters/core on a 4p 
> > quad-core tigerton machine, as below "value" means UDP-U-4k 
> > test result.
> > 
> > 2.6.29-rc7-tip
> > ---------------
> > value           cache misses    CPU migrations  cachemisses/migrations
> > 5329.71          391094656       1710            228710
> > 5641.59          239552767       2138            112045
> > 5580.87          132474745       2172            60992
> > 5547.19          86911457        2099            41406
> > 5626.38          196751217       2050            95976
> > 
> > 2.6.29-rc7-tip-mg2
> > -------------------
> > value           cache misses    CPU migrations  cachemisses/migrations
> > 4749.80          649929463       1132            574142
> > 4327.06          484100170       1252            386661
> > 4649.51          374201508       1489            251310
> > 5655.82          405511551       1848            219432
> > 5571.58          90222256        2159            41788
> > 
> > Lin Ming
> 
> Hm, these numbers look really interesting and give us insight 
> into this workload. The workload is fluctuating but by measuring 
> 3 metrics at once instead of just one we see the following 
> patterns:
> 
>  - Less CPU migrations means more cache misses and less 
>    performance.
> 

I also happen to know that V2 was cache unfriendly in a number of
respects. I've been trying to address it in V3 but still the netperf
performance in general is being very tricky even though profiles tell me
the page allocator is lighter and incurring fewer cache misses.

(aside, thanks for saying how you were running netperf. It allowed me to
take shortcuts writing the automation as I knew what parameters to use)

Here is the results from one x86-64 machine running an unreleased version
of the patchset

Netperf UDP_STREAM Comparison
----------------------------
                                    clean      opt-palloc   diff
UDP_STREAM-64                       68.63           73.15    6.18%
UDP_STREAM-128                     149.77          144.33   -3.77%
UDP_STREAM-256                     264.06          280.18    5.75%
UDP_STREAM-1024                   1037.81         1058.61    1.96%
UDP_STREAM-2048                   1790.33         1906.53    6.09%
UDP_STREAM-3312                   2671.34         2744.38    2.66%
UDP_STREAM-4096                   2722.92         2910.65    6.45%
UDP_STREAM-8192                   4280.14         4314.00    0.78%
UDP_STREAM-16384                  5384.13         5606.83    3.97%
Netperf TCP_STREAM Comparison
----------------------------
                                    clean      opt-palloc   diff
TCP_STREAM-64                      180.09          204.59   11.98%
TCP_STREAM-128                     297.45          812.22   63.38%
TCP_STREAM-256                    1315.20         1432.74    8.20%
TCP_STREAM-1024                   2544.73         3043.22   16.38%
TCP_STREAM-2048                   4157.76         4351.28    4.45%
TCP_STREAM-3312                   4254.53         4790.56   11.19%
TCP_STREAM-4096                   4773.22         4932.61    3.23%
TCP_STREAM-8192                   4937.03         5453.58    9.47%
TCP_STREAM-16384                  6003.46         6183.74    2.92%

WOooo, more or less awesome. Then here are the results of a second x86-64
machine

Netperf UDP_STREAM Comparison
----------------------------
                                    clean      opt-palloc   diff
UDP_STREAM-64                      106.50          106.98    0.45%
UDP_STREAM-128                     216.39          212.48   -1.84%
UDP_STREAM-256                     425.29          419.12   -1.47%
UDP_STREAM-1024                   1433.21         1449.20    1.10%
UDP_STREAM-2048                   2569.67         2503.73   -2.63%
UDP_STREAM-3312                   3685.30         3603.15   -2.28%
UDP_STREAM-4096                   4019.05         4252.53    5.49%
UDP_STREAM-8192                   6278.44         6315.58    0.59%
UDP_STREAM-16384                  7389.78         7162.91   -3.17%
Netperf TCP_STREAM Comparison
----------------------------
                                    clean      opt-palloc   diff
TCP_STREAM-64                      694.90          674.47   -3.03%
TCP_STREAM-128                    1160.13         1159.26   -0.08%
TCP_STREAM-256                    2016.35         2018.03    0.08%
TCP_STREAM-1024                   4619.41         4562.86   -1.24%
TCP_STREAM-2048                   5001.08         5096.51    1.87%
TCP_STREAM-3312                   5235.22         5276.18    0.78%
TCP_STREAM-4096                   5832.15         5844.42    0.21%
TCP_STREAM-8192                   6247.71         6287.93    0.64%
TCP_STREAM-16384                  7987.68         7896.17   -1.16%

Much less awesome and the cause of much frowny face and contemplation as to
whether I'd be much better off hitting the bar for a tasty beverage or 10.

I'm trying to pin down why there are such large differences between machines
but it's something with the machine themselves as the results between runs
is fairly consistent. Annoyingly, the second machine showed good results
for kernbench (allocator heavy), sysbench (not allocator heavy), was more
or less the same for hackbench but regressed tbench and netperf even though
the page allocator overhead was less. I'm doing something screwy with cache
but don't know what it is yet.

netperf is being run on different CPUs and is possibly maximising the amount
of cache bounces incurred by the page allocator as it splits and merges
buddies. I'm experimenting with the idea of delaying bulk PCP frees but it's
also possible the network layer is having trouble with cache line bounces when
the workload is run over localhost and my modifications are changing timings.

> The lowest-score runs had the lowest CPU migrations count, 
> coupled with a high amount of cachemisses.
> 
> This _probably_ means that in this workload migrations are 
> desired: the sooner two related tasks migrate to the same CPU 
> the better. If they stay separate (migration count is low) then 
> they interact with each other from different CPUs, creating a 
> lot of cachemisses and reducing performance.
> 
> You can reduce the migration barrier of the system by enabling 
> CONFIG_SCHED_DEBUG=y and setting sched_migration_cost to zero:
> 
>    echo 0 > /proc/sys/kernel/sched_migration_cost
> 
> This will hurt other workloads - but if this improves the 
> numbers then it proves that what this particular workload wants 
> is easy migrations.
> 
> Now the question is, why does the mg2 patchset reduce the number 
> of migrations? It might not be an inherent property of the mg2 
> patches: maybe just unlucky timings push the workload across 
> sched_migration_cost.
> 
> Setting sched_migration_cost to either zero or to a very high 
> value and repeating the test will eliminate this source of noise 
> and will tell us about other properties of the mg2 patchset.
> 
> There might be other effects i'm missing. For example what kind 
> of UDP transport is used - localhost networking? That means that 
> sender and receiver really wants to be coupled strongly and what 
> controls this workload is whether such a 'pair' of tasks can 
> properly migrate to the same CPU.
> 
> 	Ingo
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-06 13:03                           ` Mel Gorman
  0 siblings, 0 replies; 118+ messages in thread
From: Mel Gorman @ 2009-03-06 13:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lin Ming, Zhang, Yanmin, Nick Piggin, Pekka Enberg,
	Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Linux Kernel Mailing List,
	Peter Zijlstra

On Fri, Mar 06, 2009 at 10:39:18AM +0100, Ingo Molnar wrote:
> 
> * Lin Ming <ming.m.lin@intel.com> wrote:
> 
> > Thanks, I have used "perfstat -s" to collect cache misses 
> > data.
> > 
> > 2.6.29-rc7-tip: tip/perfcounters/core (b5e8acf)
> > 2.6.29-rc7-tip-mg2: v2 patches applied to tip/perfcounters/core
> > 
> > I collected 5 times netperf UDP-U-4k data with and without 
> > mg-v2 patches applied to tip/perfcounters/core on a 4p 
> > quad-core tigerton machine, as below "value" means UDP-U-4k 
> > test result.
> > 
> > 2.6.29-rc7-tip
> > ---------------
> > value           cache misses    CPU migrations  cachemisses/migrations
> > 5329.71          391094656       1710            228710
> > 5641.59          239552767       2138            112045
> > 5580.87          132474745       2172            60992
> > 5547.19          86911457        2099            41406
> > 5626.38          196751217       2050            95976
> > 
> > 2.6.29-rc7-tip-mg2
> > -------------------
> > value           cache misses    CPU migrations  cachemisses/migrations
> > 4749.80          649929463       1132            574142
> > 4327.06          484100170       1252            386661
> > 4649.51          374201508       1489            251310
> > 5655.82          405511551       1848            219432
> > 5571.58          90222256        2159            41788
> > 
> > Lin Ming
> 
> Hm, these numbers look really interesting and give us insight 
> into this workload. The workload is fluctuating but by measuring 
> 3 metrics at once instead of just one we see the following 
> patterns:
> 
>  - Less CPU migrations means more cache misses and less 
>    performance.
> 

I also happen to know that V2 was cache unfriendly in a number of
respects. I've been trying to address it in V3 but still the netperf
performance in general is being very tricky even though profiles tell me
the page allocator is lighter and incurring fewer cache misses.

(aside, thanks for saying how you were running netperf. It allowed me to
take shortcuts writing the automation as I knew what parameters to use)

Here is the results from one x86-64 machine running an unreleased version
of the patchset

Netperf UDP_STREAM Comparison
----------------------------
                                    clean      opt-palloc   diff
UDP_STREAM-64                       68.63           73.15    6.18%
UDP_STREAM-128                     149.77          144.33   -3.77%
UDP_STREAM-256                     264.06          280.18    5.75%
UDP_STREAM-1024                   1037.81         1058.61    1.96%
UDP_STREAM-2048                   1790.33         1906.53    6.09%
UDP_STREAM-3312                   2671.34         2744.38    2.66%
UDP_STREAM-4096                   2722.92         2910.65    6.45%
UDP_STREAM-8192                   4280.14         4314.00    0.78%
UDP_STREAM-16384                  5384.13         5606.83    3.97%
Netperf TCP_STREAM Comparison
----------------------------
                                    clean      opt-palloc   diff
TCP_STREAM-64                      180.09          204.59   11.98%
TCP_STREAM-128                     297.45          812.22   63.38%
TCP_STREAM-256                    1315.20         1432.74    8.20%
TCP_STREAM-1024                   2544.73         3043.22   16.38%
TCP_STREAM-2048                   4157.76         4351.28    4.45%
TCP_STREAM-3312                   4254.53         4790.56   11.19%
TCP_STREAM-4096                   4773.22         4932.61    3.23%
TCP_STREAM-8192                   4937.03         5453.58    9.47%
TCP_STREAM-16384                  6003.46         6183.74    2.92%

WOooo, more or less awesome. Then here are the results of a second x86-64
machine

Netperf UDP_STREAM Comparison
----------------------------
                                    clean      opt-palloc   diff
UDP_STREAM-64                      106.50          106.98    0.45%
UDP_STREAM-128                     216.39          212.48   -1.84%
UDP_STREAM-256                     425.29          419.12   -1.47%
UDP_STREAM-1024                   1433.21         1449.20    1.10%
UDP_STREAM-2048                   2569.67         2503.73   -2.63%
UDP_STREAM-3312                   3685.30         3603.15   -2.28%
UDP_STREAM-4096                   4019.05         4252.53    5.49%
UDP_STREAM-8192                   6278.44         6315.58    0.59%
UDP_STREAM-16384                  7389.78         7162.91   -3.17%
Netperf TCP_STREAM Comparison
----------------------------
                                    clean      opt-palloc   diff
TCP_STREAM-64                      694.90          674.47   -3.03%
TCP_STREAM-128                    1160.13         1159.26   -0.08%
TCP_STREAM-256                    2016.35         2018.03    0.08%
TCP_STREAM-1024                   4619.41         4562.86   -1.24%
TCP_STREAM-2048                   5001.08         5096.51    1.87%
TCP_STREAM-3312                   5235.22         5276.18    0.78%
TCP_STREAM-4096                   5832.15         5844.42    0.21%
TCP_STREAM-8192                   6247.71         6287.93    0.64%
TCP_STREAM-16384                  7987.68         7896.17   -1.16%

Much less awesome and the cause of much frowny face and contemplation as to
whether I'd be much better off hitting the bar for a tasty beverage or 10.

I'm trying to pin down why there are such large differences between machines
but it's something with the machine themselves as the results between runs
is fairly consistent. Annoyingly, the second machine showed good results
for kernbench (allocator heavy), sysbench (not allocator heavy), was more
or less the same for hackbench but regressed tbench and netperf even though
the page allocator overhead was less. I'm doing something screwy with cache
but don't know what it is yet.

netperf is being run on different CPUs and is possibly maximising the amount
of cache bounces incurred by the page allocator as it splits and merges
buddies. I'm experimenting with the idea of delaying bulk PCP frees but it's
also possible the network layer is having trouble with cache line bounces when
the workload is run over localhost and my modifications are changing timings.

> The lowest-score runs had the lowest CPU migrations count, 
> coupled with a high amount of cachemisses.
> 
> This _probably_ means that in this workload migrations are 
> desired: the sooner two related tasks migrate to the same CPU 
> the better. If they stay separate (migration count is low) then 
> they interact with each other from different CPUs, creating a 
> lot of cachemisses and reducing performance.
> 
> You can reduce the migration barrier of the system by enabling 
> CONFIG_SCHED_DEBUG=y and setting sched_migration_cost to zero:
> 
>    echo 0 > /proc/sys/kernel/sched_migration_cost
> 
> This will hurt other workloads - but if this improves the 
> numbers then it proves that what this particular workload wants 
> is easy migrations.
> 
> Now the question is, why does the mg2 patchset reduce the number 
> of migrations? It might not be an inherent property of the mg2 
> patches: maybe just unlucky timings push the workload across 
> sched_migration_cost.
> 
> Setting sched_migration_cost to either zero or to a very high 
> value and repeating the test will eliminate this source of noise 
> and will tell us about other properties of the mg2 patchset.
> 
> There might be other effects i'm missing. For example what kind 
> of UDP transport is used - localhost networking? That means that 
> sender and receiver really wants to be coupled strongly and what 
> controls this workload is whether such a 'pair' of tasks can 
> properly migrate to the same CPU.
> 
> 	Ingo
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-06 13:03                           ` Mel Gorman
@ 2009-03-09  1:50                             ` Zhang, Yanmin
  -1 siblings, 0 replies; 118+ messages in thread
From: Zhang, Yanmin @ 2009-03-09  1:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Lin Ming, Nick Piggin, Pekka Enberg,
	Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Linux Kernel Mailing List,
	Peter Zijlstra

On Fri, 2009-03-06 at 13:03 +0000, Mel Gorman wrote:
> On Fri, Mar 06, 2009 at 10:39:18AM +0100, Ingo Molnar wrote:
> > 
> > * Lin Ming <ming.m.lin@intel.com> wrote:
> > 
> > > Thanks, I have used "perfstat -s" to collect cache misses 
> > > data.
> > > 
> > > 2.6.29-rc7-tip: tip/perfcounters/core (b5e8acf)
> > > 2.6.29-rc7-tip-mg2: v2 patches applied to tip/perfcounters/core
> > > 
> > > I collected 5 times netperf UDP-U-4k data with and without 
> > > mg-v2 patches applied to tip/perfcounters/core on a 4p 
> > > quad-core tigerton machine, as below "value" means UDP-U-4k 
> > > test result.
> > > 
> > > 2.6.29-rc7-tip
> > > ---------------
> > > value           cache misses    CPU migrations  cachemisses/migrations
> > > 5329.71          391094656       1710            228710
> > > 5641.59          239552767       2138            112045
> > > 5580.87          132474745       2172            60992
> > > 5547.19          86911457        2099            41406
> > > 5626.38          196751217       2050            95976
> > > 
> > > 2.6.29-rc7-tip-mg2
> > > -------------------
> > > value           cache misses    CPU migrations  cachemisses/migrations
> > > 4749.80          649929463       1132            574142
> > > 4327.06          484100170       1252            386661
> > > 4649.51          374201508       1489            251310
> > > 5655.82          405511551       1848            219432
> > > 5571.58          90222256        2159            41788
> > > 
> > > Lin Ming
> > 
> > Hm, these numbers look really interesting and give us insight 
> > into this workload. The workload is fluctuating but by measuring 
> > 3 metrics at once instead of just one we see the following 
> > patterns:
> > 
> >  - Less CPU migrations means more cache misses and less 
> >    performance.
> > 
> 
> I also happen to know that V2 was cache unfriendly in a number of
> respects. I've been trying to address it in V3 but still the netperf
> performance in general is being very tricky even though profiles tell me
> the page allocator is lighter and incurring fewer cache misses.
> 
> (aside, thanks for saying how you were running netperf. It allowed me to
> take shortcuts writing the automation as I knew what parameters to use)
The script chooses to bind client/server to cores of different physical cpu.
You could also try:
1) no-binding;
2) Start CPU_NUM clients;

> 
> Here is the results from one x86-64 machine running an unreleased version
> of the patchset
> 
> Netperf UDP_STREAM Comparison
> ----------------------------
>                                     clean      opt-palloc   diff
> UDP_STREAM-64                       68.63           73.15    6.18%
> UDP_STREAM-128                     149.77          144.33   -3.77%
> UDP_STREAM-256                     264.06          280.18    5.75%
> UDP_STREAM-1024                   1037.81         1058.61    1.96%
> UDP_STREAM-2048                   1790.33         1906.53    6.09%
> UDP_STREAM-3312                   2671.34         2744.38    2.66%
> UDP_STREAM-4096                   2722.92         2910.65    6.45%
> UDP_STREAM-8192                   4280.14         4314.00    0.78%
> UDP_STREAM-16384                  5384.13         5606.83    3.97%
> Netperf TCP_STREAM Comparison
> ----------------------------
>                                     clean      opt-palloc   diff
> TCP_STREAM-64                      180.09          204.59   11.98%
> TCP_STREAM-128                     297.45          812.22   63.38%
> TCP_STREAM-256                    1315.20         1432.74    8.20%
> TCP_STREAM-1024                   2544.73         3043.22   16.38%
> TCP_STREAM-2048                   4157.76         4351.28    4.45%
> TCP_STREAM-3312                   4254.53         4790.56   11.19%
> TCP_STREAM-4096                   4773.22         4932.61    3.23%
> TCP_STREAM-8192                   4937.03         5453.58    9.47%
> TCP_STREAM-16384                  6003.46         6183.74    2.92%
> 
> WOooo, more or less awesome. Then here are the results of a second x86-64
> machine
> 
> Netperf UDP_STREAM Comparison
> ----------------------------
>                                     clean      opt-palloc   diff
> UDP_STREAM-64                      106.50          106.98    0.45%
> UDP_STREAM-128                     216.39          212.48   -1.84%
> UDP_STREAM-256                     425.29          419.12   -1.47%
> UDP_STREAM-1024                   1433.21         1449.20    1.10%
> UDP_STREAM-2048                   2569.67         2503.73   -2.63%
> UDP_STREAM-3312                   3685.30         3603.15   -2.28%
> UDP_STREAM-4096                   4019.05         4252.53    5.49%
> UDP_STREAM-8192                   6278.44         6315.58    0.59%
> UDP_STREAM-16384                  7389.78         7162.91   -3.17%
> Netperf TCP_STREAM Comparison
> ----------------------------
>                                     clean      opt-palloc   diff
> TCP_STREAM-64                      694.90          674.47   -3.03%
> TCP_STREAM-128                    1160.13         1159.26   -0.08%
> TCP_STREAM-256                    2016.35         2018.03    0.08%
> TCP_STREAM-1024                   4619.41         4562.86   -1.24%
> TCP_STREAM-2048                   5001.08         5096.51    1.87%
> TCP_STREAM-3312                   5235.22         5276.18    0.78%
> TCP_STREAM-4096                   5832.15         5844.42    0.21%
> TCP_STREAM-8192                   6247.71         6287.93    0.64%
> TCP_STREAM-16384                  7987.68         7896.17   -1.16%
> 
> Much less awesome and the cause of much frowny face and contemplation as to
> whether I'd be much better off hitting the bar for a tasty beverage or 10.
> 
> I'm trying to pin down why there are such large differences between machines
> but it's something with the machine themselves as the results between runs
> is fairly consistent. Annoyingly, the second machine showed good results
> for kernbench (allocator heavy), sysbench (not allocator heavy), was more
> or less the same for hackbench but regressed tbench and netperf even though
> the page allocator overhead was less. I'm doing something screwy with cache
> but don't know what it is yet.
> 
> netperf is being run on different CPUs and is possibly maximising the amount
> of cache bounces incurred by the page allocator as it splits and merges
> buddies. I'm experimenting with the idea of delaying bulk PCP frees but it's
> also possible the network layer is having trouble with cache line bounces when
> the workload is run over localhost and my modifications are changing timings.
Ingo's analysis is on the right track. Both netperf and tbench have dependency on
process scheduler. Perhaps V2 has some impact on scheduler?

> 
> > The lowest-score runs had the lowest CPU migrations count, 
> > coupled with a high amount of cachemisses.
> > 
> > This _probably_ means that in this workload migrations are 
> > desired: the sooner two related tasks migrate to the same CPU 
> > the better. If they stay separate (migration count is low) then 
> > they interact with each other from different CPUs, creating a 
> > lot of cachemisses and reducing performance.
> > 
> > You can reduce the migration barrier of the system by enabling 
> > CONFIG_SCHED_DEBUG=y and setting sched_migration_cost to zero:
> > 
> >    echo 0 > /proc/sys/kernel/sched_migration_cost
> > 
> > This will hurt other workloads - but if this improves the 
> > numbers then it proves that what this particular workload wants 
> > is easy migrations.
> > 
> > Now the question is, why does the mg2 patchset reduce the number 
> > of migrations? It might not be an inherent property of the mg2 
> > patches: maybe just unlucky timings push the workload across 
> > sched_migration_cost.
> > 
> > Setting sched_migration_cost to either zero or to a very high 
> > value and repeating the test will eliminate this source of noise 
> > and will tell us about other properties of the mg2 patchset.
> > 
> > There might be other effects i'm missing. For example what kind 
> > of UDP transport is used - localhost networking? That means that 
> > sender and receiver really wants to be coupled strongly and what 
> > controls this workload is whether such a 'pair' of tasks can 
> > properly migrate to the same CPU.



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-09  1:50                             ` Zhang, Yanmin
  0 siblings, 0 replies; 118+ messages in thread
From: Zhang, Yanmin @ 2009-03-09  1:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Lin Ming, Nick Piggin, Pekka Enberg,
	Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Linux Kernel Mailing List,
	Peter Zijlstra

On Fri, 2009-03-06 at 13:03 +0000, Mel Gorman wrote:
> On Fri, Mar 06, 2009 at 10:39:18AM +0100, Ingo Molnar wrote:
> > 
> > * Lin Ming <ming.m.lin@intel.com> wrote:
> > 
> > > Thanks, I have used "perfstat -s" to collect cache misses 
> > > data.
> > > 
> > > 2.6.29-rc7-tip: tip/perfcounters/core (b5e8acf)
> > > 2.6.29-rc7-tip-mg2: v2 patches applied to tip/perfcounters/core
> > > 
> > > I collected 5 times netperf UDP-U-4k data with and without 
> > > mg-v2 patches applied to tip/perfcounters/core on a 4p 
> > > quad-core tigerton machine, as below "value" means UDP-U-4k 
> > > test result.
> > > 
> > > 2.6.29-rc7-tip
> > > ---------------
> > > value           cache misses    CPU migrations  cachemisses/migrations
> > > 5329.71          391094656       1710            228710
> > > 5641.59          239552767       2138            112045
> > > 5580.87          132474745       2172            60992
> > > 5547.19          86911457        2099            41406
> > > 5626.38          196751217       2050            95976
> > > 
> > > 2.6.29-rc7-tip-mg2
> > > -------------------
> > > value           cache misses    CPU migrations  cachemisses/migrations
> > > 4749.80          649929463       1132            574142
> > > 4327.06          484100170       1252            386661
> > > 4649.51          374201508       1489            251310
> > > 5655.82          405511551       1848            219432
> > > 5571.58          90222256        2159            41788
> > > 
> > > Lin Ming
> > 
> > Hm, these numbers look really interesting and give us insight 
> > into this workload. The workload is fluctuating but by measuring 
> > 3 metrics at once instead of just one we see the following 
> > patterns:
> > 
> >  - Less CPU migrations means more cache misses and less 
> >    performance.
> > 
> 
> I also happen to know that V2 was cache unfriendly in a number of
> respects. I've been trying to address it in V3 but still the netperf
> performance in general is being very tricky even though profiles tell me
> the page allocator is lighter and incurring fewer cache misses.
> 
> (aside, thanks for saying how you were running netperf. It allowed me to
> take shortcuts writing the automation as I knew what parameters to use)
The script chooses to bind client/server to cores of different physical cpu.
You could also try:
1) no-binding;
2) Start CPU_NUM clients;

> 
> Here is the results from one x86-64 machine running an unreleased version
> of the patchset
> 
> Netperf UDP_STREAM Comparison
> ----------------------------
>                                     clean      opt-palloc   diff
> UDP_STREAM-64                       68.63           73.15    6.18%
> UDP_STREAM-128                     149.77          144.33   -3.77%
> UDP_STREAM-256                     264.06          280.18    5.75%
> UDP_STREAM-1024                   1037.81         1058.61    1.96%
> UDP_STREAM-2048                   1790.33         1906.53    6.09%
> UDP_STREAM-3312                   2671.34         2744.38    2.66%
> UDP_STREAM-4096                   2722.92         2910.65    6.45%
> UDP_STREAM-8192                   4280.14         4314.00    0.78%
> UDP_STREAM-16384                  5384.13         5606.83    3.97%
> Netperf TCP_STREAM Comparison
> ----------------------------
>                                     clean      opt-palloc   diff
> TCP_STREAM-64                      180.09          204.59   11.98%
> TCP_STREAM-128                     297.45          812.22   63.38%
> TCP_STREAM-256                    1315.20         1432.74    8.20%
> TCP_STREAM-1024                   2544.73         3043.22   16.38%
> TCP_STREAM-2048                   4157.76         4351.28    4.45%
> TCP_STREAM-3312                   4254.53         4790.56   11.19%
> TCP_STREAM-4096                   4773.22         4932.61    3.23%
> TCP_STREAM-8192                   4937.03         5453.58    9.47%
> TCP_STREAM-16384                  6003.46         6183.74    2.92%
> 
> WOooo, more or less awesome. Then here are the results of a second x86-64
> machine
> 
> Netperf UDP_STREAM Comparison
> ----------------------------
>                                     clean      opt-palloc   diff
> UDP_STREAM-64                      106.50          106.98    0.45%
> UDP_STREAM-128                     216.39          212.48   -1.84%
> UDP_STREAM-256                     425.29          419.12   -1.47%
> UDP_STREAM-1024                   1433.21         1449.20    1.10%
> UDP_STREAM-2048                   2569.67         2503.73   -2.63%
> UDP_STREAM-3312                   3685.30         3603.15   -2.28%
> UDP_STREAM-4096                   4019.05         4252.53    5.49%
> UDP_STREAM-8192                   6278.44         6315.58    0.59%
> UDP_STREAM-16384                  7389.78         7162.91   -3.17%
> Netperf TCP_STREAM Comparison
> ----------------------------
>                                     clean      opt-palloc   diff
> TCP_STREAM-64                      694.90          674.47   -3.03%
> TCP_STREAM-128                    1160.13         1159.26   -0.08%
> TCP_STREAM-256                    2016.35         2018.03    0.08%
> TCP_STREAM-1024                   4619.41         4562.86   -1.24%
> TCP_STREAM-2048                   5001.08         5096.51    1.87%
> TCP_STREAM-3312                   5235.22         5276.18    0.78%
> TCP_STREAM-4096                   5832.15         5844.42    0.21%
> TCP_STREAM-8192                   6247.71         6287.93    0.64%
> TCP_STREAM-16384                  7987.68         7896.17   -1.16%
> 
> Much less awesome and the cause of much frowny face and contemplation as to
> whether I'd be much better off hitting the bar for a tasty beverage or 10.
> 
> I'm trying to pin down why there are such large differences between machines
> but it's something with the machine themselves as the results between runs
> is fairly consistent. Annoyingly, the second machine showed good results
> for kernbench (allocator heavy), sysbench (not allocator heavy), was more
> or less the same for hackbench but regressed tbench and netperf even though
> the page allocator overhead was less. I'm doing something screwy with cache
> but don't know what it is yet.
> 
> netperf is being run on different CPUs and is possibly maximising the amount
> of cache bounces incurred by the page allocator as it splits and merges
> buddies. I'm experimenting with the idea of delaying bulk PCP frees but it's
> also possible the network layer is having trouble with cache line bounces when
> the workload is run over localhost and my modifications are changing timings.
Ingo's analysis is on the right track. Both netperf and tbench have dependency on
process scheduler. Perhaps V2 has some impact on scheduler?

> 
> > The lowest-score runs had the lowest CPU migrations count, 
> > coupled with a high amount of cachemisses.
> > 
> > This _probably_ means that in this workload migrations are 
> > desired: the sooner two related tasks migrate to the same CPU 
> > the better. If they stay separate (migration count is low) then 
> > they interact with each other from different CPUs, creating a 
> > lot of cachemisses and reducing performance.
> > 
> > You can reduce the migration barrier of the system by enabling 
> > CONFIG_SCHED_DEBUG=y and setting sched_migration_cost to zero:
> > 
> >    echo 0 > /proc/sys/kernel/sched_migration_cost
> > 
> > This will hurt other workloads - but if this improves the 
> > numbers then it proves that what this particular workload wants 
> > is easy migrations.
> > 
> > Now the question is, why does the mg2 patchset reduce the number 
> > of migrations? It might not be an inherent property of the mg2 
> > patches: maybe just unlucky timings push the workload across 
> > sched_migration_cost.
> > 
> > Setting sched_migration_cost to either zero or to a very high 
> > value and repeating the test will eliminate this source of noise 
> > and will tell us about other properties of the mg2 patchset.
> > 
> > There might be other effects i'm missing. For example what kind 
> > of UDP transport is used - localhost networking? That means that 
> > sender and receiver really wants to be coupled strongly and what 
> > controls this workload is whether such a 'pair' of tasks can 
> > properly migrate to the same CPU.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-06  8:33                       ` Lin Ming
@ 2009-03-09  7:03                         ` Lin Ming
  -1 siblings, 0 replies; 118+ messages in thread
From: Lin Ming @ 2009-03-09  7:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Zhang, Yanmin, Nick Piggin, Pekka Enberg,
	Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Linux Kernel Mailing List,
	Peter Zijlstra

On Fri, 2009-03-06 at 16:33 +0800, Lin Ming wrote:
> On Thu, 2009-03-05 at 18:34 +0800, Ingo Molnar wrote:
> > * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> > 
> > > On Wed, 2009-03-04 at 10:07 +0100, Nick Piggin wrote:
> > > > On Wed, Mar 04, 2009 at 10:05:07AM +0800, Zhang, Yanmin wrote:
> > > > > On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> > > > > > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> > > > > > 
> > > > > > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > > > > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > > > > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > > > > > can see how time is being spent and why it might have gotten worse?
> > > > > > > 
> > > > > > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > > > > > patches applied to 2.6.29-rc6.
> > > > > > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > > > > > line with addr2line.
> > > > > > > 
> > > > > > > You can download the oprofile data and vmlinux from below link,
> > > > > > > http://www.filefactory.com/file/af2330b/
> > > > > > > 
> > > > > > 
> > > > > > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > > > > > how the allocator is actually being used for your workloads.
> > > > > > 
> > > > > > The OLTP results had the following things to say about the page allocator.
> > > > > In case we might mislead you guys, I want to clarify that here OLTP is
> > > > > sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
> > > > > memory.
> > > > > 
> > > > > Ma Chinang, another Intel guy, does work on the famous OLTP running.
> > > > 
> > > > OK, so my comments WRT cache sensitivity probably don't apply here,
> > > > but probably cache hotness of pages coming out of the allocator
> > > > might still be important for this one.
> > > Yes. We need check it.
> > > 
> > > > 
> > > > How many runs are you doing of these tests?
> > > We start sysbench with different thread number, for example, 8 12 16 32 64 128 for
> > > 4*4 tigerton, then get an average value in case there might be a scalability issue.
> > > 
> > > As for this sysbench oltp testing, we reran it for 7 times on 
> > > tigerton this week and found the results have fluctuations. 
> > > Now we could only say there is a trend that the result with 
> > > the pathces is a little worse than the one without the 
> > > patches.
> > 
> > Could you try "perfstat -s" perhaps and see whether any other of 
> > the metrics outside of tx/sec has less natural noise?
> 
> Thanks, I have used "perfstat -s" to collect cache misses data.
> 
> 2.6.29-rc7-tip: tip/perfcounters/core (b5e8acf)
> 2.6.29-rc7-tip-mg2: v2 patches applied to tip/perfcounters/core
> 
> I collected 5 times netperf UDP-U-4k data with and without mg-v2 patches
> applied to tip/perfcounters/core on a 4p quad-core tigerton machine, as
> below
> "value" means UDP-U-4k test result.

I forgot to mention that below are the results without client/server
bind to different cpus.

./netserver
./netperf -t UDP_STREAM -l 60 -H 127.0.0.1  -- -P 15888,12384 -s 32768 -S 32768 -m 4096

> 
> 2.6.29-rc7-tip
> ---------------
> value           cache misses    CPU migrations  cachemisses/migrations
> 5329.71          391094656       1710            228710
> 5641.59          239552767       2138            112045
> 5580.87          132474745       2172            60992
> 5547.19          86911457        2099            41406
> 5626.38          196751217       2050            95976
> 
> 2.6.29-rc7-tip-mg2
> -------------------
> value           cache misses    CPU migrations  cachemisses/migrations
> 4749.80          649929463       1132            574142
> 4327.06          484100170       1252            386661
> 4649.51          374201508       1489            251310
> 5655.82          405511551       1848            219432
> 5571.58          90222256        2159            41788
> 
> Lin Ming
> 
> > 
> > I think a more invariant number might be the ratio of "LLC 
> > cachemisses" divided by "CPU migrations".
> > 
> > The fluctuation in tx/sec comes from threads bouncing - but you 
> > can normalize that away by using the cachemisses/migrations 
> > ration.
> > 
> > Perhaps. It's definitely a difficult thing to measure.
> > 
> > 	Ingo


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-09  7:03                         ` Lin Ming
  0 siblings, 0 replies; 118+ messages in thread
From: Lin Ming @ 2009-03-09  7:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Zhang, Yanmin, Nick Piggin, Pekka Enberg,
	Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Linux Kernel Mailing List,
	Peter Zijlstra

On Fri, 2009-03-06 at 16:33 +0800, Lin Ming wrote:
> On Thu, 2009-03-05 at 18:34 +0800, Ingo Molnar wrote:
> > * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> > 
> > > On Wed, 2009-03-04 at 10:07 +0100, Nick Piggin wrote:
> > > > On Wed, Mar 04, 2009 at 10:05:07AM +0800, Zhang, Yanmin wrote:
> > > > > On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> > > > > > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> > > > > > 
> > > > > > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > > > > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote: 
> > > > > > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > > > > > can see how time is being spent and why it might have gotten worse?
> > > > > > > 
> > > > > > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > > > > > patches applied to 2.6.29-rc6.
> > > > > > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > > > > > line with addr2line.
> > > > > > > 
> > > > > > > You can download the oprofile data and vmlinux from below link,
> > > > > > > http://www.filefactory.com/file/af2330b/
> > > > > > > 
> > > > > > 
> > > > > > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > > > > > how the allocator is actually being used for your workloads.
> > > > > > 
> > > > > > The OLTP results had the following things to say about the page allocator.
> > > > > In case we might mislead you guys, I want to clarify that here OLTP is
> > > > > sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
> > > > > memory.
> > > > > 
> > > > > Ma Chinang, another Intel guy, does work on the famous OLTP running.
> > > > 
> > > > OK, so my comments WRT cache sensitivity probably don't apply here,
> > > > but probably cache hotness of pages coming out of the allocator
> > > > might still be important for this one.
> > > Yes. We need check it.
> > > 
> > > > 
> > > > How many runs are you doing of these tests?
> > > We start sysbench with different thread number, for example, 8 12 16 32 64 128 for
> > > 4*4 tigerton, then get an average value in case there might be a scalability issue.
> > > 
> > > As for this sysbench oltp testing, we reran it for 7 times on 
> > > tigerton this week and found the results have fluctuations. 
> > > Now we could only say there is a trend that the result with 
> > > the pathces is a little worse than the one without the 
> > > patches.
> > 
> > Could you try "perfstat -s" perhaps and see whether any other of 
> > the metrics outside of tx/sec has less natural noise?
> 
> Thanks, I have used "perfstat -s" to collect cache misses data.
> 
> 2.6.29-rc7-tip: tip/perfcounters/core (b5e8acf)
> 2.6.29-rc7-tip-mg2: v2 patches applied to tip/perfcounters/core
> 
> I collected 5 times netperf UDP-U-4k data with and without mg-v2 patches
> applied to tip/perfcounters/core on a 4p quad-core tigerton machine, as
> below
> "value" means UDP-U-4k test result.

I forgot to mention that below are the results without client/server
bind to different cpus.

./netserver
./netperf -t UDP_STREAM -l 60 -H 127.0.0.1  -- -P 15888,12384 -s 32768 -S 32768 -m 4096

> 
> 2.6.29-rc7-tip
> ---------------
> value           cache misses    CPU migrations  cachemisses/migrations
> 5329.71          391094656       1710            228710
> 5641.59          239552767       2138            112045
> 5580.87          132474745       2172            60992
> 5547.19          86911457        2099            41406
> 5626.38          196751217       2050            95976
> 
> 2.6.29-rc7-tip-mg2
> -------------------
> value           cache misses    CPU migrations  cachemisses/migrations
> 4749.80          649929463       1132            574142
> 4327.06          484100170       1252            386661
> 4649.51          374201508       1489            251310
> 5655.82          405511551       1848            219432
> 5571.58          90222256        2159            41788
> 
> Lin Ming
> 
> > 
> > I think a more invariant number might be the ratio of "LLC 
> > cachemisses" divided by "CPU migrations".
> > 
> > The fluctuation in tx/sec comes from threads bouncing - but you 
> > can normalize that away by using the cachemisses/migrations 
> > ration.
> > 
> > Perhaps. It's definitely a difficult thing to measure.
> > 
> > 	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
  2009-03-06  9:39                         ` Ingo Molnar
@ 2009-03-09  7:31                           ` Lin Ming
  -1 siblings, 0 replies; 118+ messages in thread
From: Lin Ming @ 2009-03-09  7:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zhang, Yanmin, Nick Piggin, Mel Gorman, Pekka Enberg,
	Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Linux Kernel Mailing List,
	Peter Zijlstra

On Fri, 2009-03-06 at 17:39 +0800, Ingo Molnar wrote:
> * Lin Ming <ming.m.lin@intel.com> wrote:
> 
> > Thanks, I have used "perfstat -s" to collect cache misses 
> > data.
> > 
> > 2.6.29-rc7-tip: tip/perfcounters/core (b5e8acf)
> > 2.6.29-rc7-tip-mg2: v2 patches applied to tip/perfcounters/core
> > 
> > I collected 5 times netperf UDP-U-4k data with and without 
> > mg-v2 patches applied to tip/perfcounters/core on a 4p 
> > quad-core tigerton machine, as below "value" means UDP-U-4k 
> > test result.
> > 
> > 2.6.29-rc7-tip
> > ---------------
> > value           cache misses    CPU migrations  cachemisses/migrations
> > 5329.71          391094656       1710            228710
> > 5641.59          239552767       2138            112045
> > 5580.87          132474745       2172            60992
> > 5547.19          86911457        2099            41406
> > 5626.38          196751217       2050            95976
> > 
> > 2.6.29-rc7-tip-mg2
> > -------------------
> > value           cache misses    CPU migrations  cachemisses/migrations
> > 4749.80          649929463       1132            574142
> > 4327.06          484100170       1252            386661
> > 4649.51          374201508       1489            251310
> > 5655.82          405511551       1848            219432
> > 5571.58          90222256        2159            41788
> > 
> > Lin Ming
> 
> Hm, these numbers look really interesting and give us insight 
> into this workload. The workload is fluctuating but by measuring 
> 3 metrics at once instead of just one we see the following 
> patterns:
> 
>  - Less CPU migrations means more cache misses and less 
>    performance.
> 
> The lowest-score runs had the lowest CPU migrations count, 
> coupled with a high amount of cachemisses.
> 
> This _probably_ means that in this workload migrations are 
> desired: the sooner two related tasks migrate to the same CPU 
> the better. If they stay separate (migration count is low) then 
> they interact with each other from different CPUs, creating a 
> lot of cachemisses and reducing performance.
> 
> You can reduce the migration barrier of the system by enabling 
> CONFIG_SCHED_DEBUG=y and setting sched_migration_cost to zero:
> 
>    echo 0 > /proc/sys/kernel/sched_migration_cost
> 
> This will hurt other workloads - but if this improves the 
> numbers then it proves that what this particular workload wants 
> is easy migrations.

Again, I don't bind client/server to different cpus.
./netserver
./netperf -t UDP_STREAM -l 60 -H 127.0.0.1  -- -P 15888,12384 -s 32768 -S 32768 -m 4096

2.6.29-rc7-tip-mg2
-------------------
echo 0 > /proc/sys/kernel/sched_migration_cost
value           cache misses    CPU migrations  cachemisses/migrations
2867.62          880055866       117             7521845
2920.08          884482955       122             7249860
2903.16          905450628       127             7129532
2930.94          877616337       104             8438618
5224.02          1428643167      133             10741677

if sysctl_sched_migration_cost is set to zero,
sender/receiver will have less chance to do sync wakeups. (less migrations)

wake_affine (...) {
	...
	if (sync && (curr->se.avg_overlap > sysctl_sched_migration_cost ||
                        p->se.avg_overlap > sysctl_sched_migration_cost))
                sync = 0;
	...
}

echo -1 to sched_migration_cost can improve the numbers. (more migrations)

echo -1 > /proc/sys/kernel/sched_migration_cost
value           cache misses    CPU migrations  cachemisses/migrations
5524.52          97137973        2331            41672
5454.54          92589648        2542            36423
5458.63          96943477        3968            24431
5524.40          89298489        2574            34692
5493.64          87080343        2490            34972

> 
> Now the question is, why does the mg2 patchset reduce the number 
> of migrations? It might not be an inherent property of the mg2 
> patches: maybe just unlucky timings push the workload across 
> sched_migration_cost.
> 
> Setting sched_migration_cost to either zero or to a very high 
> value and repeating the test will eliminate this source of noise 
> and will tell us about other properties of the mg2 patchset.
> 
> There might be other effects i'm missing. For example what kind 
> of UDP transport is used - localhost networking? That means that 

Yes, localhost networking.

Lin Ming

> sender and receiver really wants to be coupled strongly and what 
> controls this workload is whether such a 'pair' of tasks can 
> properly migrate to the same CPU.
> 
> 	Ingo


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2
@ 2009-03-09  7:31                           ` Lin Ming
  0 siblings, 0 replies; 118+ messages in thread
From: Lin Ming @ 2009-03-09  7:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zhang, Yanmin, Nick Piggin, Mel Gorman, Pekka Enberg,
	Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Linux Kernel Mailing List,
	Peter Zijlstra

On Fri, 2009-03-06 at 17:39 +0800, Ingo Molnar wrote:
> * Lin Ming <ming.m.lin@intel.com> wrote:
> 
> > Thanks, I have used "perfstat -s" to collect cache misses 
> > data.
> > 
> > 2.6.29-rc7-tip: tip/perfcounters/core (b5e8acf)
> > 2.6.29-rc7-tip-mg2: v2 patches applied to tip/perfcounters/core
> > 
> > I collected 5 times netperf UDP-U-4k data with and without 
> > mg-v2 patches applied to tip/perfcounters/core on a 4p 
> > quad-core tigerton machine, as below "value" means UDP-U-4k 
> > test result.
> > 
> > 2.6.29-rc7-tip
> > ---------------
> > value           cache misses    CPU migrations  cachemisses/migrations
> > 5329.71          391094656       1710            228710
> > 5641.59          239552767       2138            112045
> > 5580.87          132474745       2172            60992
> > 5547.19          86911457        2099            41406
> > 5626.38          196751217       2050            95976
> > 
> > 2.6.29-rc7-tip-mg2
> > -------------------
> > value           cache misses    CPU migrations  cachemisses/migrations
> > 4749.80          649929463       1132            574142
> > 4327.06          484100170       1252            386661
> > 4649.51          374201508       1489            251310
> > 5655.82          405511551       1848            219432
> > 5571.58          90222256        2159            41788
> > 
> > Lin Ming
> 
> Hm, these numbers look really interesting and give us insight 
> into this workload. The workload is fluctuating but by measuring 
> 3 metrics at once instead of just one we see the following 
> patterns:
> 
>  - Less CPU migrations means more cache misses and less 
>    performance.
> 
> The lowest-score runs had the lowest CPU migrations count, 
> coupled with a high amount of cachemisses.
> 
> This _probably_ means that in this workload migrations are 
> desired: the sooner two related tasks migrate to the same CPU 
> the better. If they stay separate (migration count is low) then 
> they interact with each other from different CPUs, creating a 
> lot of cachemisses and reducing performance.
> 
> You can reduce the migration barrier of the system by enabling 
> CONFIG_SCHED_DEBUG=y and setting sched_migration_cost to zero:
> 
>    echo 0 > /proc/sys/kernel/sched_migration_cost
> 
> This will hurt other workloads - but if this improves the 
> numbers then it proves that what this particular workload wants 
> is easy migrations.

Again, I don't bind client/server to different cpus.
./netserver
./netperf -t UDP_STREAM -l 60 -H 127.0.0.1  -- -P 15888,12384 -s 32768 -S 32768 -m 4096

2.6.29-rc7-tip-mg2
-------------------
echo 0 > /proc/sys/kernel/sched_migration_cost
value           cache misses    CPU migrations  cachemisses/migrations
2867.62          880055866       117             7521845
2920.08          884482955       122             7249860
2903.16          905450628       127             7129532
2930.94          877616337       104             8438618
5224.02          1428643167      133             10741677

if sysctl_sched_migration_cost is set to zero,
sender/receiver will have less chance to do sync wakeups. (less migrations)

wake_affine (...) {
	...
	if (sync && (curr->se.avg_overlap > sysctl_sched_migration_cost ||
                        p->se.avg_overlap > sysctl_sched_migration_cost))
                sync = 0;
	...
}

echo -1 to sched_migration_cost can improve the numbers. (more migrations)

echo -1 > /proc/sys/kernel/sched_migration_cost
value           cache misses    CPU migrations  cachemisses/migrations
5524.52          97137973        2331            41672
5454.54          92589648        2542            36423
5458.63          96943477        3968            24431
5524.40          89298489        2574            34692
5493.64          87080343        2490            34972

> 
> Now the question is, why does the mg2 patchset reduce the number 
> of migrations? It might not be an inherent property of the mg2 
> patches: maybe just unlucky timings push the workload across 
> sched_migration_cost.
> 
> Setting sched_migration_cost to either zero or to a very high 
> value and repeating the test will eliminate this source of noise 
> and will tell us about other properties of the mg2 patchset.
> 
> There might be other effects i'm missing. For example what kind 
> of UDP transport is used - localhost networking? That means that 

Yes, localhost networking.

Lin Ming

> sender and receiver really wants to be coupled strongly and what 
> controls this workload is whether such a 'pair' of tasks can 
> properly migrate to the same CPU.
> 
> 	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 118+ messages in thread

end of thread, other threads:[~2009-03-09  7:38 UTC | newest]

Thread overview: 118+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-24 12:16 [RFC PATCH 00/19] Cleanup and optimise the page allocator V2 Mel Gorman
2009-02-24 12:16 ` Mel Gorman
2009-02-24 12:16 ` [PATCH 01/19] Replace __alloc_pages_internal() with __alloc_pages_nodemask() Mel Gorman
2009-02-24 12:16   ` Mel Gorman
2009-02-24 12:16 ` [PATCH 02/19] Do not sanity check order in the fast path Mel Gorman
2009-02-24 12:16   ` Mel Gorman
2009-02-24 12:16 ` [PATCH 03/19] Do not check NUMA node ID when the caller knows the node is valid Mel Gorman
2009-02-24 12:16   ` Mel Gorman
2009-02-24 17:17   ` Christoph Lameter
2009-02-24 17:17     ` Christoph Lameter
2009-02-24 12:17 ` [PATCH 04/19] Convert gfp_zone() to use a table of precalculated values Mel Gorman
2009-02-24 12:17   ` Mel Gorman
2009-02-24 16:43   ` Christoph Lameter
2009-02-24 16:43     ` Christoph Lameter
2009-02-24 17:07     ` Mel Gorman
2009-02-24 17:07       ` Mel Gorman
2009-02-24 12:17 ` [PATCH 05/19] Re-sort GFP flags and fix whitespace alignment for easier reading Mel Gorman
2009-02-24 12:17   ` Mel Gorman
2009-02-24 12:17 ` [PATCH 06/19] Check only once if the zonelist is suitable for the allocation Mel Gorman
2009-02-24 12:17   ` Mel Gorman
2009-02-24 17:24   ` Christoph Lameter
2009-02-24 17:24     ` Christoph Lameter
2009-02-24 12:17 ` [PATCH 07/19] Break up the allocator entry point into fast and slow paths Mel Gorman
2009-02-24 12:17   ` Mel Gorman
2009-02-24 12:17 ` [PATCH 08/19] Simplify the check on whether cpusets are a factor or not Mel Gorman
2009-02-24 12:17   ` Mel Gorman
2009-02-24 17:27   ` Christoph Lameter
2009-02-24 17:27     ` Christoph Lameter
2009-02-24 17:55     ` Mel Gorman
2009-02-24 17:55       ` Mel Gorman
2009-02-24 12:17 ` [PATCH 09/19] Move check for disabled anti-fragmentation out of fastpath Mel Gorman
2009-02-24 12:17   ` Mel Gorman
2009-02-24 12:17 ` [PATCH 10/19] Calculate the preferred zone for allocation only once Mel Gorman
2009-02-24 12:17   ` Mel Gorman
2009-02-24 17:31   ` Christoph Lameter
2009-02-24 17:31     ` Christoph Lameter
2009-02-24 17:53     ` Mel Gorman
2009-02-24 17:53       ` Mel Gorman
2009-02-24 12:17 ` [PATCH 11/19] Calculate the migratetype " Mel Gorman
2009-02-24 12:17   ` Mel Gorman
2009-02-24 12:17 ` [PATCH 12/19] Calculate the alloc_flags " Mel Gorman
2009-02-24 12:17   ` Mel Gorman
2009-02-24 12:17 ` [PATCH 13/19] Inline __rmqueue_smallest() Mel Gorman
2009-02-24 12:17   ` Mel Gorman
2009-02-24 12:17 ` [PATCH 14/19] Inline buffered_rmqueue() Mel Gorman
2009-02-24 12:17   ` Mel Gorman
2009-02-24 12:17 ` [PATCH 15/19] Do not call get_pageblock_migratetype() more than necessary Mel Gorman
2009-02-24 12:17   ` Mel Gorman
2009-02-24 12:17 ` [PATCH 16/19] Do not disable interrupts in free_page_mlock() Mel Gorman
2009-02-24 12:17   ` Mel Gorman
2009-02-24 12:17 ` [PATCH 17/19] Do not setup zonelist cache when there is only one node Mel Gorman
2009-02-24 12:17   ` Mel Gorman
2009-02-24 12:17 ` [PATCH 18/19] Do not check for compound pages during the page allocator sanity checks Mel Gorman
2009-02-24 12:17   ` Mel Gorman
2009-02-24 12:17 ` [PATCH 19/19] Split per-cpu list into one-list-per-migrate-type Mel Gorman
2009-02-24 12:17   ` Mel Gorman
2009-02-26  9:10 ` [RFC PATCH 00/19] Cleanup and optimise the page allocator V2 Lin Ming
2009-02-26  9:10   ` Lin Ming
2009-02-26  9:26   ` Pekka Enberg
2009-02-26  9:26     ` Pekka Enberg
2009-02-26  9:27     ` Lin Ming
2009-02-26  9:27       ` Lin Ming
2009-02-26 11:03   ` Mel Gorman
2009-02-26 11:03     ` Mel Gorman
2009-02-26 11:18     ` Pekka Enberg
2009-02-26 11:18       ` Pekka Enberg
2009-02-26 11:22       ` Mel Gorman
2009-02-26 11:22         ` Mel Gorman
2009-02-26 12:27         ` Lin Ming
2009-02-26 12:27           ` Lin Ming
2009-02-27  8:44         ` Lin Ming
2009-02-27  8:44           ` Lin Ming
2009-03-02 11:21           ` Mel Gorman
2009-03-02 11:21             ` Mel Gorman
2009-03-02 11:39             ` Nick Piggin
2009-03-02 11:39               ` Nick Piggin
2009-03-02 12:16               ` Mel Gorman
2009-03-02 12:16                 ` Mel Gorman
2009-03-03  4:42                 ` Nick Piggin
2009-03-03  4:42                   ` Nick Piggin
2009-03-03  8:25                   ` Mel Gorman
2009-03-03  8:25                     ` Mel Gorman
2009-03-03  9:04                     ` Nick Piggin
2009-03-03  9:04                       ` Nick Piggin
2009-03-03 13:51                       ` Mel Gorman
2009-03-03 13:51                         ` Mel Gorman
2009-03-03 16:31             ` Christoph Lameter
2009-03-03 16:31               ` Christoph Lameter
2009-03-03 21:48               ` Mel Gorman
2009-03-03 21:48                 ` Mel Gorman
2009-03-04  2:05             ` Zhang, Yanmin
2009-03-04  2:05               ` Zhang, Yanmin
2009-03-04  7:23               ` Peter Zijlstra
2009-03-04  7:23                 ` Peter Zijlstra
2009-03-04  8:31                 ` Zhang, Yanmin
2009-03-04  8:31                   ` Zhang, Yanmin
2009-03-04  9:07               ` Nick Piggin
2009-03-04  9:07                 ` Nick Piggin
2009-03-05  1:56                 ` Zhang, Yanmin
2009-03-05  1:56                   ` Zhang, Yanmin
2009-03-05 10:34                   ` Ingo Molnar
2009-03-05 10:34                     ` Ingo Molnar
2009-03-06  8:33                     ` Lin Ming
2009-03-06  8:33                       ` Lin Ming
2009-03-06  9:39                       ` Ingo Molnar
2009-03-06  9:39                         ` Ingo Molnar
2009-03-06 13:03                         ` Mel Gorman
2009-03-06 13:03                           ` Mel Gorman
2009-03-09  1:50                           ` Zhang, Yanmin
2009-03-09  1:50                             ` Zhang, Yanmin
2009-03-09  7:31                         ` Lin Ming
2009-03-09  7:31                           ` Lin Ming
2009-03-09  7:03                       ` Lin Ming
2009-03-09  7:03                         ` Lin Ming
2009-03-04 18:04               ` Mel Gorman
2009-03-04 18:04                 ` Mel Gorman
2009-02-26 16:28       ` Christoph Lameter
2009-02-26 16:28         ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.