All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/22] Cleanup and optimise the page allocator V7
@ 2009-04-22 13:53 ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

Here is V7 of the cleanup and optimisation of the page allocator and
it should be ready for wider testing. Please consider a possibility for
merging as a Pass 1 at making the page allocator faster. Other passes will
occur later when this one has had a bit of exercise. This patchset is based
on mmotm-2009-04-17 and I've tested it successfully on a small number of
machines.

The performance improvements are in a wide range depending on the exact
machine but the results I've seen so fair are approximately;

kernbench:	0	to	 0.12% (elapsed time)
		0.49%	to	 3.20% (sys time)
aim9:		-4%	to	30% (for page_test and brk_test)
tbench:		-1%	to	 4%
hackbench:	-2.5%	to	 3.45% (mostly within the noise though)
netperf-udp	-1.34%  to	 4.06% (varies between machines a bit)
netperf-tcp	-0.44%  to	 5.22% (varies between machines a bit)

I haven't sysbench figures at hand, but previously they were within the -0.5%
to 2% range.

On netperf, the client and server were bound to opposite number CPUs to
maximise the problems with cache line bouncing of the struct pages so I
expect different people to report different results for netperf depending
on their exact machine and how they ran the test (different machines, same
cpus client/server, shared cache but two threads client/server, different
socket client/server etc).

I also measured the vmlinux sizes for a single x86-based config with
CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM. The core of the .config
is based on the Debian Lenny kernel config so I expect it to be reasonably
typical.

   text	kernel
3355726 mmotm-20090417
3355718 0001-Replace-__alloc_pages_internal-with-__alloc_pages_.patch
3355622 0002-Do-not-sanity-check-order-in-the-fast-path.patch
3355574 0003-Do-not-check-NUMA-node-ID-when-the-caller-knows-the.patch
3355574 0004-Check-only-once-if-the-zonelist-is-suitable-for-the.patch
3355526 0005-Break-up-the-allocator-entry-point-into-fast-and-slo.patch
3355420 0006-Move-check-for-disabled-anti-fragmentation-out-of-fa.patch
3355452 0007-Calculate-the-preferred-zone-for-allocation-only-onc.patch
3355452 0008-Calculate-the-migratetype-for-allocation-only-once.patch
3355436 0009-Calculate-the-alloc_flags-for-allocation-only-once.patch
3355436 0010-Remove-a-branch-by-assuming-__GFP_HIGH-ALLOC_HIGH.patch
3355420 0011-Inline-__rmqueue_smallest.patch
3355420 0012-Inline-buffered_rmqueue.patch
3355420 0013-Inline-__rmqueue_fallback.patch
3355404 0014-Do-not-call-get_pageblock_migratetype-more-than-ne.patch
3355300 0015-Do-not-disable-interrupts-in-free_page_mlock.patch
3355300 0016-Do-not-setup-zonelist-cache-when-there-is-only-one-n.patch
3355188 0017-Do-not-check-for-compound-pages-during-the-page-allo.patch
3355161 0018-Use-allocation-flags-as-an-index-to-the-zone-waterma.patch
3355129 0019-Update-NR_FREE_PAGES-only-as-necessary.patch
3355129 0020-Get-the-pageblock-migratetype-without-disabling-inte.patch
3355129 0021-Use-a-pre-calculated-value-instead-of-num_online_nod.patch

Some patches were dropped in this revision because while I believe they
improved performance, they also increase the text size so they need to
be revisited in isolation to show they actually help things and by how
much. Other than that, the biggest changes were cleaning up accidental
functional changes identified by Kosaki Motohiro. Massive credit to him for a
very defailed review of V6, to Christoph Lameter who reviewed earlier versions
quite heavily and Pekka who kicked through V6 in quite a lot of detail.

 arch/ia64/hp/common/sba_iommu.c   |    2 
 arch/ia64/kernel/mca.c            |    3 
 arch/ia64/kernel/uncached.c       |    3 
 arch/ia64/sn/pci/pci_dma.c        |    3 
 arch/powerpc/platforms/cell/ras.c |    2 
 arch/x86/kvm/vmx.c                |    2 
 drivers/misc/sgi-gru/grufile.c    |    2 
 drivers/misc/sgi-xp/xpc_uv.c      |    2 
 include/linux/gfp.h               |   27 -
 include/linux/mm.h                |    1 
 include/linux/mmzone.h            |   11 
 include/linux/nodemask.h          |   15 -
 kernel/profile.c                  |    8 
 mm/filemap.c                      |    2 
 mm/hugetlb.c                      |    8 
 mm/internal.h                     |   11 
 mm/mempolicy.c                    |    2 
 mm/migrate.c                      |    2 
 mm/page_alloc.c                   |  555 ++++++++++++++++++++++++--------------
 mm/slab.c                         |   11 
 mm/slob.c                         |    4 
 mm/slub.c                         |    2 
 net/sunrpc/svc.c                  |    2 
 23 files changed, 424 insertions(+), 256 deletions(-)

Changes since V6
  o Remove unintentional functional changes when splitting into fast and slow paths
  o Drop patch 7 for zonelist filtering as it modified when zlc_setup() is called
    for the wrong reasons. The patch that avoids calling it for non-NUMA machines is
    still there which has the bulk of the saving. cpusets is relatively small
  o Drop an unnecessary check for in_interrupt() in gfp_to_alloc_flags()
  o Clarify comment on __GFP_HIGH == ALLOC_HIGH
  o Redefine the watermark mask to be expessed in terms of ALLOC_MARK_NOWATERMARK
  o Use BUILD_BUG_ON for checking __GFP_HIGH == ALLOC_HIGH
  o Drop some patches that were not reducing text sizes as expected
  o Remove numa_platform from slab

Change since V5
  o Rebase to mmotm-2009-04-17

Changes since V4
  o Drop the more controversial patches for now and focus on the "obvious win"
    material.
  o Add reviewed-by notes
  o Fix changelog entry to say __rmqueue_fallback instead __rmqueue
  o Add unlikely() for the clearMlocked check
  o Change where PGFREE is accounted in free_hot_cold_page() to have symmetry
    with __free_pages_ok()
  o Convert num_online_nodes() to use a static value so that callers do
    not have to be individually updated
  o Rebase to mmotm-2003-03-13

Changes since V3
  o Drop the more controversial patches for now and focus on the "obvious win"
    material
  o Add reviewed-by notes
  o Fix changelog entry to say __rmqueue_fallback instead __rmqueue
  o Add unlikely() for the clearMlocked check
  o Change where PGFREE is accounted in free_hot_cold_page() to have symmetry
    with __free_pages_ok()

Changes since V2
  o Remove brances by treating watermark flags as array indices
  o Remove branch by assuming __GFP_HIGH == ALLOC_HIGH
  o Do not check for compound on every page free
  o Remove branch by always ensuring the migratetype is known on free
  o Simplify buffered_rmqueue further
  o Reintroduce improved version of batched bulk free of pcp pages
  o Use allocation flags as an index to zone watermarks
  o Work out __GFP_COLD only once
  o Reduce the number of times zone stats are updated
  o Do not dump reserve pages back into the allocator. Instead treat them
    as MOVABLE so that MIGRATE_RESERVE gets used on the max-order-overlapped
    boundaries without causing trouble
  o Allow pages up to PAGE_ALLOC_COSTLY_ORDER to use the per-cpu allocator.
    order-1 allocations are frequently enough in particular to justify this
  o Rearrange inlining such that the hot-path is inlined but not in a way
    that increases the text size of the page allocator
  o Make the check for needing additional zonelist filtering due to NUMA
    or cpusets as light as possible
  o Do not destroy compound pages going to the PCP lists
  o Delay the merging of buddies until a high-order allocation needs them
    or anti-fragmentation is being forced to fallback

Changes since V1
  o Remove the ifdef CONFIG_CPUSETS from inside get_page_from_freelist()
  o Use non-lock bit operations for clearing the mlock flag
  o Factor out alloc_flags calculation so it is only done once (Peter)
  o Make gfp.h a bit prettier and clear-cut (Peter)
  o Instead of deleting a debugging check, replace page_count() in the
    free path with a version that does not check for compound pages (Nick)
  o Drop the alteration for hot/cold page freeing until we know if it
    helps or not


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [PATCH 00/22] Cleanup and optimise the page allocator V7
@ 2009-04-22 13:53 ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

Here is V7 of the cleanup and optimisation of the page allocator and
it should be ready for wider testing. Please consider a possibility for
merging as a Pass 1 at making the page allocator faster. Other passes will
occur later when this one has had a bit of exercise. This patchset is based
on mmotm-2009-04-17 and I've tested it successfully on a small number of
machines.

The performance improvements are in a wide range depending on the exact
machine but the results I've seen so fair are approximately;

kernbench:	0	to	 0.12% (elapsed time)
		0.49%	to	 3.20% (sys time)
aim9:		-4%	to	30% (for page_test and brk_test)
tbench:		-1%	to	 4%
hackbench:	-2.5%	to	 3.45% (mostly within the noise though)
netperf-udp	-1.34%  to	 4.06% (varies between machines a bit)
netperf-tcp	-0.44%  to	 5.22% (varies between machines a bit)

I haven't sysbench figures at hand, but previously they were within the -0.5%
to 2% range.

On netperf, the client and server were bound to opposite number CPUs to
maximise the problems with cache line bouncing of the struct pages so I
expect different people to report different results for netperf depending
on their exact machine and how they ran the test (different machines, same
cpus client/server, shared cache but two threads client/server, different
socket client/server etc).

I also measured the vmlinux sizes for a single x86-based config with
CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM. The core of the .config
is based on the Debian Lenny kernel config so I expect it to be reasonably
typical.

   text	kernel
3355726 mmotm-20090417
3355718 0001-Replace-__alloc_pages_internal-with-__alloc_pages_.patch
3355622 0002-Do-not-sanity-check-order-in-the-fast-path.patch
3355574 0003-Do-not-check-NUMA-node-ID-when-the-caller-knows-the.patch
3355574 0004-Check-only-once-if-the-zonelist-is-suitable-for-the.patch
3355526 0005-Break-up-the-allocator-entry-point-into-fast-and-slo.patch
3355420 0006-Move-check-for-disabled-anti-fragmentation-out-of-fa.patch
3355452 0007-Calculate-the-preferred-zone-for-allocation-only-onc.patch
3355452 0008-Calculate-the-migratetype-for-allocation-only-once.patch
3355436 0009-Calculate-the-alloc_flags-for-allocation-only-once.patch
3355436 0010-Remove-a-branch-by-assuming-__GFP_HIGH-ALLOC_HIGH.patch
3355420 0011-Inline-__rmqueue_smallest.patch
3355420 0012-Inline-buffered_rmqueue.patch
3355420 0013-Inline-__rmqueue_fallback.patch
3355404 0014-Do-not-call-get_pageblock_migratetype-more-than-ne.patch
3355300 0015-Do-not-disable-interrupts-in-free_page_mlock.patch
3355300 0016-Do-not-setup-zonelist-cache-when-there-is-only-one-n.patch
3355188 0017-Do-not-check-for-compound-pages-during-the-page-allo.patch
3355161 0018-Use-allocation-flags-as-an-index-to-the-zone-waterma.patch
3355129 0019-Update-NR_FREE_PAGES-only-as-necessary.patch
3355129 0020-Get-the-pageblock-migratetype-without-disabling-inte.patch
3355129 0021-Use-a-pre-calculated-value-instead-of-num_online_nod.patch

Some patches were dropped in this revision because while I believe they
improved performance, they also increase the text size so they need to
be revisited in isolation to show they actually help things and by how
much. Other than that, the biggest changes were cleaning up accidental
functional changes identified by Kosaki Motohiro. Massive credit to him for a
very defailed review of V6, to Christoph Lameter who reviewed earlier versions
quite heavily and Pekka who kicked through V6 in quite a lot of detail.

 arch/ia64/hp/common/sba_iommu.c   |    2 
 arch/ia64/kernel/mca.c            |    3 
 arch/ia64/kernel/uncached.c       |    3 
 arch/ia64/sn/pci/pci_dma.c        |    3 
 arch/powerpc/platforms/cell/ras.c |    2 
 arch/x86/kvm/vmx.c                |    2 
 drivers/misc/sgi-gru/grufile.c    |    2 
 drivers/misc/sgi-xp/xpc_uv.c      |    2 
 include/linux/gfp.h               |   27 -
 include/linux/mm.h                |    1 
 include/linux/mmzone.h            |   11 
 include/linux/nodemask.h          |   15 -
 kernel/profile.c                  |    8 
 mm/filemap.c                      |    2 
 mm/hugetlb.c                      |    8 
 mm/internal.h                     |   11 
 mm/mempolicy.c                    |    2 
 mm/migrate.c                      |    2 
 mm/page_alloc.c                   |  555 ++++++++++++++++++++++++--------------
 mm/slab.c                         |   11 
 mm/slob.c                         |    4 
 mm/slub.c                         |    2 
 net/sunrpc/svc.c                  |    2 
 23 files changed, 424 insertions(+), 256 deletions(-)

Changes since V6
  o Remove unintentional functional changes when splitting into fast and slow paths
  o Drop patch 7 for zonelist filtering as it modified when zlc_setup() is called
    for the wrong reasons. The patch that avoids calling it for non-NUMA machines is
    still there which has the bulk of the saving. cpusets is relatively small
  o Drop an unnecessary check for in_interrupt() in gfp_to_alloc_flags()
  o Clarify comment on __GFP_HIGH == ALLOC_HIGH
  o Redefine the watermark mask to be expessed in terms of ALLOC_MARK_NOWATERMARK
  o Use BUILD_BUG_ON for checking __GFP_HIGH == ALLOC_HIGH
  o Drop some patches that were not reducing text sizes as expected
  o Remove numa_platform from slab

Change since V5
  o Rebase to mmotm-2009-04-17

Changes since V4
  o Drop the more controversial patches for now and focus on the "obvious win"
    material.
  o Add reviewed-by notes
  o Fix changelog entry to say __rmqueue_fallback instead __rmqueue
  o Add unlikely() for the clearMlocked check
  o Change where PGFREE is accounted in free_hot_cold_page() to have symmetry
    with __free_pages_ok()
  o Convert num_online_nodes() to use a static value so that callers do
    not have to be individually updated
  o Rebase to mmotm-2003-03-13

Changes since V3
  o Drop the more controversial patches for now and focus on the "obvious win"
    material
  o Add reviewed-by notes
  o Fix changelog entry to say __rmqueue_fallback instead __rmqueue
  o Add unlikely() for the clearMlocked check
  o Change where PGFREE is accounted in free_hot_cold_page() to have symmetry
    with __free_pages_ok()

Changes since V2
  o Remove brances by treating watermark flags as array indices
  o Remove branch by assuming __GFP_HIGH == ALLOC_HIGH
  o Do not check for compound on every page free
  o Remove branch by always ensuring the migratetype is known on free
  o Simplify buffered_rmqueue further
  o Reintroduce improved version of batched bulk free of pcp pages
  o Use allocation flags as an index to zone watermarks
  o Work out __GFP_COLD only once
  o Reduce the number of times zone stats are updated
  o Do not dump reserve pages back into the allocator. Instead treat them
    as MOVABLE so that MIGRATE_RESERVE gets used on the max-order-overlapped
    boundaries without causing trouble
  o Allow pages up to PAGE_ALLOC_COSTLY_ORDER to use the per-cpu allocator.
    order-1 allocations are frequently enough in particular to justify this
  o Rearrange inlining such that the hot-path is inlined but not in a way
    that increases the text size of the page allocator
  o Make the check for needing additional zonelist filtering due to NUMA
    or cpusets as light as possible
  o Do not destroy compound pages going to the PCP lists
  o Delay the merging of buddies until a high-order allocation needs them
    or anti-fragmentation is being forced to fallback

Changes since V1
  o Remove the ifdef CONFIG_CPUSETS from inside get_page_from_freelist()
  o Use non-lock bit operations for clearing the mlock flag
  o Factor out alloc_flags calculation so it is only done once (Peter)
  o Make gfp.h a bit prettier and clear-cut (Peter)
  o Instead of deleting a debugging check, replace page_count() in the
    free path with a version that does not check for compound pages (Nick)
  o Drop the alteration for hot/cold page freeing until we know if it
    helps or not

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* [PATCH 01/22] Replace __alloc_pages_internal() with __alloc_pages_nodemask()
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

__alloc_pages_internal is the core page allocator function but
essentially it is an alias of __alloc_pages_nodemask. Naming a publicly
available and exported function "internal" is also a big ugly. This
patch renames __alloc_pages_internal() to __alloc_pages_nodemask() and
deletes the old nodemask function.

Warning - This patch renames an exported symbol. No kernel driver is
affected by external drivers calling __alloc_pages_internal() should
change the call to __alloc_pages_nodemask() without any alteration of
parameters.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 include/linux/gfp.h |   12 ++----------
 mm/page_alloc.c     |    4 ++--
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0bbc15f..556c840 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -169,24 +169,16 @@ static inline void arch_alloc_page(struct page *page, int order) { }
 #endif
 
 struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		       struct zonelist *zonelist, nodemask_t *nodemask);
 
 static inline struct page *
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct zonelist *zonelist)
 {
-	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
 }
 
-static inline struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist, nodemask_t *nodemask)
-{
-	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
-}
-
-
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e4ea469..dcc4f05 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ try_next_zone:
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
@@ -1671,7 +1671,7 @@ nopage:
 got_pg:
 	return page;
 }
-EXPORT_SYMBOL(__alloc_pages_internal);
+EXPORT_SYMBOL(__alloc_pages_nodemask);
 
 /*
  * Common helper functions.
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 01/22] Replace __alloc_pages_internal() with __alloc_pages_nodemask()
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

__alloc_pages_internal is the core page allocator function but
essentially it is an alias of __alloc_pages_nodemask. Naming a publicly
available and exported function "internal" is also a big ugly. This
patch renames __alloc_pages_internal() to __alloc_pages_nodemask() and
deletes the old nodemask function.

Warning - This patch renames an exported symbol. No kernel driver is
affected by external drivers calling __alloc_pages_internal() should
change the call to __alloc_pages_nodemask() without any alteration of
parameters.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 include/linux/gfp.h |   12 ++----------
 mm/page_alloc.c     |    4 ++--
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0bbc15f..556c840 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -169,24 +169,16 @@ static inline void arch_alloc_page(struct page *page, int order) { }
 #endif
 
 struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		       struct zonelist *zonelist, nodemask_t *nodemask);
 
 static inline struct page *
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct zonelist *zonelist)
 {
-	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
 }
 
-static inline struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist, nodemask_t *nodemask)
-{
-	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
-}
-
-
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e4ea469..dcc4f05 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ try_next_zone:
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
@@ -1671,7 +1671,7 @@ nopage:
 got_pg:
 	return page;
 }
-EXPORT_SYMBOL(__alloc_pages_internal);
+EXPORT_SYMBOL(__alloc_pages_nodemask);
 
 /*
  * Common helper functions.
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

No user of the allocator API should be passing in an order >= MAX_ORDER
but we check for it on each and every allocation. Delete this check and
make it a VM_BUG_ON check further down the call path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 include/linux/gfp.h |    6 ------
 mm/page_alloc.c     |    2 ++
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 556c840..760f6c0 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -182,9 +182,6 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
-	if (unlikely(order >= MAX_ORDER))
-		return NULL;
-
 	/* Unknown node is current node */
 	if (nid < 0)
 		nid = numa_node_id();
@@ -198,9 +195,6 @@ extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
 static inline struct page *
 alloc_pages(gfp_t gfp_mask, unsigned int order)
 {
-	if (unlikely(order >= MAX_ORDER))
-		return NULL;
-
 	return alloc_pages_current(gfp_mask, order);
 }
 extern struct page *alloc_page_vma(gfp_t gfp_mask,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dcc4f05..5028f40 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1405,6 +1405,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 
 	classzone_idx = zone_idx(preferred_zone);
 
+	VM_BUG_ON(order >= MAX_ORDER);
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

No user of the allocator API should be passing in an order >= MAX_ORDER
but we check for it on each and every allocation. Delete this check and
make it a VM_BUG_ON check further down the call path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 include/linux/gfp.h |    6 ------
 mm/page_alloc.c     |    2 ++
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 556c840..760f6c0 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -182,9 +182,6 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
-	if (unlikely(order >= MAX_ORDER))
-		return NULL;
-
 	/* Unknown node is current node */
 	if (nid < 0)
 		nid = numa_node_id();
@@ -198,9 +195,6 @@ extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
 static inline struct page *
 alloc_pages(gfp_t gfp_mask, unsigned int order)
 {
-	if (unlikely(order >= MAX_ORDER))
-		return NULL;
-
 	return alloc_pages_current(gfp_mask, order);
 }
 extern struct page *alloc_page_vma(gfp_t gfp_mask,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dcc4f05..5028f40 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1405,6 +1405,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 
 	classzone_idx = zone_idx(preferred_zone);
 
+	VM_BUG_ON(order >= MAX_ORDER);
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 03/22] Do not check NUMA node ID when the caller knows the node is valid
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

Callers of alloc_pages_node() can optionally specify -1 as a node to mean
"allocate from the current node". However, a number of the callers in fast
paths know for a fact their node is valid. To avoid a comparison and branch,
this patch adds alloc_pages_exact_node() that only checks the nid with
VM_BUG_ON(). Callers that know their node is valid are then converted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

(for the SLOB NUMA bits)
Acked-by: Paul Mundt <lethal@linux-sh.org>
---
 arch/ia64/hp/common/sba_iommu.c   |    2 +-
 arch/ia64/kernel/mca.c            |    3 +--
 arch/ia64/kernel/uncached.c       |    3 ++-
 arch/ia64/sn/pci/pci_dma.c        |    3 ++-
 arch/powerpc/platforms/cell/ras.c |    2 +-
 arch/x86/kvm/vmx.c                |    2 +-
 drivers/misc/sgi-gru/grufile.c    |    2 +-
 drivers/misc/sgi-xp/xpc_uv.c      |    2 +-
 include/linux/gfp.h               |    9 +++++++++
 include/linux/mm.h                |    1 -
 kernel/profile.c                  |    8 ++++----
 mm/filemap.c                      |    2 +-
 mm/hugetlb.c                      |    4 ++--
 mm/mempolicy.c                    |    2 +-
 mm/migrate.c                      |    2 +-
 mm/slab.c                         |    4 ++--
 mm/slob.c                         |    4 ++--
 17 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 56ceb68..fe63b2d 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -1131,7 +1131,7 @@ sba_alloc_coherent (struct device *dev, size_t size, dma_addr_t *dma_handle, gfp
 #ifdef CONFIG_NUMA
 	{
 		struct page *page;
-		page = alloc_pages_node(ioc->node == MAX_NUMNODES ?
+		page = alloc_pages_exact_node(ioc->node == MAX_NUMNODES ?
 		                        numa_node_id() : ioc->node, flags,
 		                        get_order(size));
 
diff --git a/arch/ia64/kernel/mca.c b/arch/ia64/kernel/mca.c
index 8f33a88..5b17bd4 100644
--- a/arch/ia64/kernel/mca.c
+++ b/arch/ia64/kernel/mca.c
@@ -1829,8 +1829,7 @@ ia64_mca_cpu_init(void *cpu_data)
 			data = mca_bootmem();
 			first_time = 0;
 		} else
-			data = page_address(alloc_pages_node(numa_node_id(),
-					GFP_KERNEL, get_order(sz)));
+			data = __get_free_pages(GFP_KERNEL, get_order(sz));
 		if (!data)
 			panic("Could not allocate MCA memory for cpu %d\n",
 					cpu);
diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index 8eff8c1..6ba72ab 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -98,7 +98,8 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)
 
 	/* attempt to allocate a granule's worth of cached memory pages */
 
-	page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+	page = alloc_pages_exact_node(nid,
+				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				IA64_GRANULE_SHIFT-PAGE_SHIFT);
 	if (!page) {
 		mutex_unlock(&uc_pool->add_chunk_mutex);
diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index d876423..98b6849 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -90,7 +90,8 @@ static void *sn_dma_alloc_coherent(struct device *dev, size_t size,
 	 */
 	node = pcibus_to_node(pdev->bus);
 	if (likely(node >=0)) {
-		struct page *p = alloc_pages_node(node, flags, get_order(size));
+		struct page *p = alloc_pages_exact_node(node,
+						flags, get_order(size));
 
 		if (likely(p))
 			cpuaddr = page_address(p);
diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c
index 5f961c4..16ba671 100644
--- a/arch/powerpc/platforms/cell/ras.c
+++ b/arch/powerpc/platforms/cell/ras.c
@@ -122,7 +122,7 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order)
 
 	area->nid = nid;
 	area->order = order;
-	area->pages = alloc_pages_node(area->nid, GFP_KERNEL, area->order);
+	area->pages = alloc_pages_exact_node(area->nid, GFP_KERNEL, area->order);
 
 	if (!area->pages)
 		goto out_free_area;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c6997c0..eaa149f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1258,7 +1258,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
 	struct page *pages;
 	struct vmcs *vmcs;
 
-	pages = alloc_pages_node(node, GFP_KERNEL, vmcs_config.order);
+	pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
 	if (!pages)
 		return NULL;
 	vmcs = page_address(pages);
diff --git a/drivers/misc/sgi-gru/grufile.c b/drivers/misc/sgi-gru/grufile.c
index 9ba90f3..796ac70 100644
--- a/drivers/misc/sgi-gru/grufile.c
+++ b/drivers/misc/sgi-gru/grufile.c
@@ -306,7 +306,7 @@ static int gru_init_tables(unsigned long gru_base_paddr, void *gru_base_vaddr)
 		pnode = uv_node_to_pnode(nid);
 		if (bid < 0 || gru_base[bid])
 			continue;
-		page = alloc_pages_node(nid, GFP_KERNEL, order);
+		page = alloc_pages_exact_node(nid, GFP_KERNEL, order);
 		if (!page)
 			goto fail;
 		gru_base[bid] = page_address(page);
diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
index 9172fcd..c76677a 100644
--- a/drivers/misc/sgi-xp/xpc_uv.c
+++ b/drivers/misc/sgi-xp/xpc_uv.c
@@ -232,7 +232,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
 	mq->mmr_blade = uv_cpu_to_blade_id(cpu);
 
 	nid = cpu_to_node(cpu);
-	page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+	page = alloc_pages_exact_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				pg_order);
 	if (page == NULL) {
 		dev_err(xpc_part, "xpc_create_gru_mq_uv() failed to alloc %d "
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 760f6c0..c7429b8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -5,6 +5,7 @@
 #include <linux/stddef.h>
 #include <linux/linkage.h>
 #include <linux/topology.h>
+#include <linux/mmdebug.h>
 
 struct vm_area_struct;
 
@@ -189,6 +190,14 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
 
+static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
+						unsigned int order)
+{
+	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+
+	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+}
+
 #ifdef CONFIG_NUMA
 extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9c916e4..2d86bc8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -7,7 +7,6 @@
 
 #include <linux/gfp.h>
 #include <linux/list.h>
-#include <linux/mmdebug.h>
 #include <linux/mmzone.h>
 #include <linux/rbtree.h>
 #include <linux/prio_tree.h>
diff --git a/kernel/profile.c b/kernel/profile.c
index 7724e04..62e08db 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -371,7 +371,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
 		node = cpu_to_node(cpu);
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
-			page = alloc_pages_node(node,
+			page = alloc_pages_exact_node(node,
 					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
@@ -379,7 +379,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
 			per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
-			page = alloc_pages_node(node,
+			page = alloc_pages_exact_node(node,
 					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
@@ -570,14 +570,14 @@ static int create_hash_tables(void)
 		int node = cpu_to_node(cpu);
 		struct page *page;
 
-		page = alloc_pages_node(node,
+		page = alloc_pages_exact_node(node,
 				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				0);
 		if (!page)
 			goto out_cleanup;
 		per_cpu(cpu_profile_hits, cpu)[1]
 				= (struct profile_hit *)page_address(page);
-		page = alloc_pages_node(node,
+		page = alloc_pages_exact_node(node,
 				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				0);
 		if (!page)
diff --git a/mm/filemap.c b/mm/filemap.c
index 5256582..0ad1eb1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -521,7 +521,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
 {
 	if (cpuset_do_page_mem_spread()) {
 		int n = cpuset_mem_spread_node();
-		return alloc_pages_node(n, gfp, 0);
+		return alloc_pages_exact_node(n, gfp, 0);
 	}
 	return alloc_pages(gfp, 0);
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 28c655b..1234486 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -630,7 +630,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 	if (h->order >= MAX_ORDER)
 		return NULL;
 
-	page = alloc_pages_node(nid,
+	page = alloc_pages_exact_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
 		huge_page_order(h));
@@ -649,7 +649,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
  * Use a helper variable to find the next node and then
  * copy it back to hugetlb_next_nid afterwards:
  * otherwise there's a window in which a racer might
- * pass invalid nid MAX_NUMNODES to alloc_pages_node.
+ * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
  * But we don't need to use a spin_lock here: it really
  * doesn't matter if occasionally a racer chooses the
  * same nid as we do.  Move nid forward in the mask even
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8a5d2b8..c32bc15 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -796,7 +796,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
 
 static struct page *new_node_page(struct page *page, unsigned long node, int **x)
 {
-	return alloc_pages_node(node, GFP_HIGHUSER_MOVABLE, 0);
+	return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE, 0);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
index 068655d..5a24923 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -802,7 +802,7 @@ static struct page *new_page_node(struct page *p, unsigned long private,
 
 	*result = &pm->status;
 
-	return alloc_pages_node(pm->node,
+	return alloc_pages_exact_node(pm->node,
 				GFP_HIGHUSER_MOVABLE | GFP_THISNODE, 0);
 }
 
diff --git a/mm/slab.c b/mm/slab.c
index 3da2640..1c680e8 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1699,7 +1699,7 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		flags |= __GFP_RECLAIMABLE;
 
-	page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+	page = alloc_pages_exact_node(nodeid, flags, cachep->gfporder);
 	if (!page)
 		return NULL;
 
@@ -3254,7 +3254,7 @@ retry:
 		if (local_flags & __GFP_WAIT)
 			local_irq_enable();
 		kmem_flagcheck(cache, flags);
-		obj = kmem_getpages(cache, local_flags, -1);
+		obj = kmem_getpages(cache, local_flags, numa_node_id());
 		if (local_flags & __GFP_WAIT)
 			local_irq_disable();
 		if (obj) {
diff --git a/mm/slob.c b/mm/slob.c
index 3e7acbc..ffeb218 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -46,7 +46,7 @@
  * NUMA support in SLOB is fairly simplistic, pushing most of the real
  * logic down to the page allocator, and simply doing the node accounting
  * on the upper levels. In the event that a node id is explicitly
- * provided, alloc_pages_node() with the specified node id is used
+ * provided, alloc_pages_exact_node() with the specified node id is used
  * instead. The common case (or when the node id isn't explicitly provided)
  * will default to the current node, as per numa_node_id().
  *
@@ -243,7 +243,7 @@ static void *slob_new_pages(gfp_t gfp, int order, int node)
 
 #ifdef CONFIG_NUMA
 	if (node != -1)
-		page = alloc_pages_node(node, gfp, order);
+		page = alloc_pages_exact_node(node, gfp, order);
 	else
 #endif
 		page = alloc_pages(gfp, order);
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 03/22] Do not check NUMA node ID when the caller knows the node is valid
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

Callers of alloc_pages_node() can optionally specify -1 as a node to mean
"allocate from the current node". However, a number of the callers in fast
paths know for a fact their node is valid. To avoid a comparison and branch,
this patch adds alloc_pages_exact_node() that only checks the nid with
VM_BUG_ON(). Callers that know their node is valid are then converted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

(for the SLOB NUMA bits)
Acked-by: Paul Mundt <lethal@linux-sh.org>
---
 arch/ia64/hp/common/sba_iommu.c   |    2 +-
 arch/ia64/kernel/mca.c            |    3 +--
 arch/ia64/kernel/uncached.c       |    3 ++-
 arch/ia64/sn/pci/pci_dma.c        |    3 ++-
 arch/powerpc/platforms/cell/ras.c |    2 +-
 arch/x86/kvm/vmx.c                |    2 +-
 drivers/misc/sgi-gru/grufile.c    |    2 +-
 drivers/misc/sgi-xp/xpc_uv.c      |    2 +-
 include/linux/gfp.h               |    9 +++++++++
 include/linux/mm.h                |    1 -
 kernel/profile.c                  |    8 ++++----
 mm/filemap.c                      |    2 +-
 mm/hugetlb.c                      |    4 ++--
 mm/mempolicy.c                    |    2 +-
 mm/migrate.c                      |    2 +-
 mm/slab.c                         |    4 ++--
 mm/slob.c                         |    4 ++--
 17 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 56ceb68..fe63b2d 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -1131,7 +1131,7 @@ sba_alloc_coherent (struct device *dev, size_t size, dma_addr_t *dma_handle, gfp
 #ifdef CONFIG_NUMA
 	{
 		struct page *page;
-		page = alloc_pages_node(ioc->node == MAX_NUMNODES ?
+		page = alloc_pages_exact_node(ioc->node == MAX_NUMNODES ?
 		                        numa_node_id() : ioc->node, flags,
 		                        get_order(size));
 
diff --git a/arch/ia64/kernel/mca.c b/arch/ia64/kernel/mca.c
index 8f33a88..5b17bd4 100644
--- a/arch/ia64/kernel/mca.c
+++ b/arch/ia64/kernel/mca.c
@@ -1829,8 +1829,7 @@ ia64_mca_cpu_init(void *cpu_data)
 			data = mca_bootmem();
 			first_time = 0;
 		} else
-			data = page_address(alloc_pages_node(numa_node_id(),
-					GFP_KERNEL, get_order(sz)));
+			data = __get_free_pages(GFP_KERNEL, get_order(sz));
 		if (!data)
 			panic("Could not allocate MCA memory for cpu %d\n",
 					cpu);
diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index 8eff8c1..6ba72ab 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -98,7 +98,8 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)
 
 	/* attempt to allocate a granule's worth of cached memory pages */
 
-	page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+	page = alloc_pages_exact_node(nid,
+				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				IA64_GRANULE_SHIFT-PAGE_SHIFT);
 	if (!page) {
 		mutex_unlock(&uc_pool->add_chunk_mutex);
diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index d876423..98b6849 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -90,7 +90,8 @@ static void *sn_dma_alloc_coherent(struct device *dev, size_t size,
 	 */
 	node = pcibus_to_node(pdev->bus);
 	if (likely(node >=0)) {
-		struct page *p = alloc_pages_node(node, flags, get_order(size));
+		struct page *p = alloc_pages_exact_node(node,
+						flags, get_order(size));
 
 		if (likely(p))
 			cpuaddr = page_address(p);
diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c
index 5f961c4..16ba671 100644
--- a/arch/powerpc/platforms/cell/ras.c
+++ b/arch/powerpc/platforms/cell/ras.c
@@ -122,7 +122,7 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order)
 
 	area->nid = nid;
 	area->order = order;
-	area->pages = alloc_pages_node(area->nid, GFP_KERNEL, area->order);
+	area->pages = alloc_pages_exact_node(area->nid, GFP_KERNEL, area->order);
 
 	if (!area->pages)
 		goto out_free_area;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c6997c0..eaa149f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1258,7 +1258,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
 	struct page *pages;
 	struct vmcs *vmcs;
 
-	pages = alloc_pages_node(node, GFP_KERNEL, vmcs_config.order);
+	pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
 	if (!pages)
 		return NULL;
 	vmcs = page_address(pages);
diff --git a/drivers/misc/sgi-gru/grufile.c b/drivers/misc/sgi-gru/grufile.c
index 9ba90f3..796ac70 100644
--- a/drivers/misc/sgi-gru/grufile.c
+++ b/drivers/misc/sgi-gru/grufile.c
@@ -306,7 +306,7 @@ static int gru_init_tables(unsigned long gru_base_paddr, void *gru_base_vaddr)
 		pnode = uv_node_to_pnode(nid);
 		if (bid < 0 || gru_base[bid])
 			continue;
-		page = alloc_pages_node(nid, GFP_KERNEL, order);
+		page = alloc_pages_exact_node(nid, GFP_KERNEL, order);
 		if (!page)
 			goto fail;
 		gru_base[bid] = page_address(page);
diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
index 9172fcd..c76677a 100644
--- a/drivers/misc/sgi-xp/xpc_uv.c
+++ b/drivers/misc/sgi-xp/xpc_uv.c
@@ -232,7 +232,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
 	mq->mmr_blade = uv_cpu_to_blade_id(cpu);
 
 	nid = cpu_to_node(cpu);
-	page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+	page = alloc_pages_exact_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				pg_order);
 	if (page == NULL) {
 		dev_err(xpc_part, "xpc_create_gru_mq_uv() failed to alloc %d "
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 760f6c0..c7429b8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -5,6 +5,7 @@
 #include <linux/stddef.h>
 #include <linux/linkage.h>
 #include <linux/topology.h>
+#include <linux/mmdebug.h>
 
 struct vm_area_struct;
 
@@ -189,6 +190,14 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
 
+static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
+						unsigned int order)
+{
+	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+
+	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+}
+
 #ifdef CONFIG_NUMA
 extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9c916e4..2d86bc8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -7,7 +7,6 @@
 
 #include <linux/gfp.h>
 #include <linux/list.h>
-#include <linux/mmdebug.h>
 #include <linux/mmzone.h>
 #include <linux/rbtree.h>
 #include <linux/prio_tree.h>
diff --git a/kernel/profile.c b/kernel/profile.c
index 7724e04..62e08db 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -371,7 +371,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
 		node = cpu_to_node(cpu);
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
-			page = alloc_pages_node(node,
+			page = alloc_pages_exact_node(node,
 					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
@@ -379,7 +379,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
 			per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
-			page = alloc_pages_node(node,
+			page = alloc_pages_exact_node(node,
 					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
@@ -570,14 +570,14 @@ static int create_hash_tables(void)
 		int node = cpu_to_node(cpu);
 		struct page *page;
 
-		page = alloc_pages_node(node,
+		page = alloc_pages_exact_node(node,
 				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				0);
 		if (!page)
 			goto out_cleanup;
 		per_cpu(cpu_profile_hits, cpu)[1]
 				= (struct profile_hit *)page_address(page);
-		page = alloc_pages_node(node,
+		page = alloc_pages_exact_node(node,
 				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				0);
 		if (!page)
diff --git a/mm/filemap.c b/mm/filemap.c
index 5256582..0ad1eb1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -521,7 +521,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
 {
 	if (cpuset_do_page_mem_spread()) {
 		int n = cpuset_mem_spread_node();
-		return alloc_pages_node(n, gfp, 0);
+		return alloc_pages_exact_node(n, gfp, 0);
 	}
 	return alloc_pages(gfp, 0);
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 28c655b..1234486 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -630,7 +630,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 	if (h->order >= MAX_ORDER)
 		return NULL;
 
-	page = alloc_pages_node(nid,
+	page = alloc_pages_exact_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
 		huge_page_order(h));
@@ -649,7 +649,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
  * Use a helper variable to find the next node and then
  * copy it back to hugetlb_next_nid afterwards:
  * otherwise there's a window in which a racer might
- * pass invalid nid MAX_NUMNODES to alloc_pages_node.
+ * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
  * But we don't need to use a spin_lock here: it really
  * doesn't matter if occasionally a racer chooses the
  * same nid as we do.  Move nid forward in the mask even
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8a5d2b8..c32bc15 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -796,7 +796,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
 
 static struct page *new_node_page(struct page *page, unsigned long node, int **x)
 {
-	return alloc_pages_node(node, GFP_HIGHUSER_MOVABLE, 0);
+	return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE, 0);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
index 068655d..5a24923 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -802,7 +802,7 @@ static struct page *new_page_node(struct page *p, unsigned long private,
 
 	*result = &pm->status;
 
-	return alloc_pages_node(pm->node,
+	return alloc_pages_exact_node(pm->node,
 				GFP_HIGHUSER_MOVABLE | GFP_THISNODE, 0);
 }
 
diff --git a/mm/slab.c b/mm/slab.c
index 3da2640..1c680e8 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1699,7 +1699,7 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		flags |= __GFP_RECLAIMABLE;
 
-	page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+	page = alloc_pages_exact_node(nodeid, flags, cachep->gfporder);
 	if (!page)
 		return NULL;
 
@@ -3254,7 +3254,7 @@ retry:
 		if (local_flags & __GFP_WAIT)
 			local_irq_enable();
 		kmem_flagcheck(cache, flags);
-		obj = kmem_getpages(cache, local_flags, -1);
+		obj = kmem_getpages(cache, local_flags, numa_node_id());
 		if (local_flags & __GFP_WAIT)
 			local_irq_disable();
 		if (obj) {
diff --git a/mm/slob.c b/mm/slob.c
index 3e7acbc..ffeb218 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -46,7 +46,7 @@
  * NUMA support in SLOB is fairly simplistic, pushing most of the real
  * logic down to the page allocator, and simply doing the node accounting
  * on the upper levels. In the event that a node id is explicitly
- * provided, alloc_pages_node() with the specified node id is used
+ * provided, alloc_pages_exact_node() with the specified node id is used
  * instead. The common case (or when the node id isn't explicitly provided)
  * will default to the current node, as per numa_node_id().
  *
@@ -243,7 +243,7 @@ static void *slob_new_pages(gfp_t gfp, int order, int node)
 
 #ifdef CONFIG_NUMA
 	if (node != -1)
-		page = alloc_pages_node(node, gfp, order);
+		page = alloc_pages_exact_node(node, gfp, order);
 	else
 #endif
 		page = alloc_pages(gfp, order);
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 04/22] Check only once if the zonelist is suitable for the allocation
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

It is possible with __GFP_THISNODE that no zones are suitable. This
patch makes sure the check is only made once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 mm/page_alloc.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5028f40..3bed856 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1486,9 +1486,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (should_fail_alloc_page(gfp_mask, order))
 		return NULL;
 
-restart:
-	z = zonelist->_zonerefs;  /* the list of zones suitable for gfp_mask */
-
+	/* the list of zones suitable for gfp_mask */
+	z = zonelist->_zonerefs;
 	if (unlikely(!z->zone)) {
 		/*
 		 * Happens if we have an empty zonelist as a result of
@@ -1497,6 +1496,7 @@ restart:
 		return NULL;
 	}
 
+restart:
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 04/22] Check only once if the zonelist is suitable for the allocation
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

It is possible with __GFP_THISNODE that no zones are suitable. This
patch makes sure the check is only made once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 mm/page_alloc.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5028f40..3bed856 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1486,9 +1486,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (should_fail_alloc_page(gfp_mask, order))
 		return NULL;
 
-restart:
-	z = zonelist->_zonerefs;  /* the list of zones suitable for gfp_mask */
-
+	/* the list of zones suitable for gfp_mask */
+	z = zonelist->_zonerefs;
 	if (unlikely(!z->zone)) {
 		/*
 		 * Happens if we have an empty zonelist as a result of
@@ -1497,6 +1496,7 @@ restart:
 		return NULL;
 	}
 
+restart:
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 05/22] Break up the allocator entry point into fast and slow paths
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

The core of the page allocator is one giant function which allocates memory
on the stack and makes calculations that may not be needed for every
allocation. This patch breaks up the allocator path into fast and slow
paths for clarity. Note the slow paths are still inlined but the entry is
marked unlikely.  If they were not inlined, it actally increases text size
to generate the as there is only one call site.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 mm/page_alloc.c |  353 +++++++++++++++++++++++++++++++++++--------------------
 1 files changed, 228 insertions(+), 125 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3bed856..1e09f26 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1460,47 +1460,171 @@ try_next_zone:
 	return page;
 }
 
-/*
- * This is the 'heart' of the zoned buddy allocator.
- */
-struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-			struct zonelist *zonelist, nodemask_t *nodemask)
+static inline int
+should_alloc_retry(gfp_t gfp_mask, unsigned int order,
+				unsigned long pages_reclaimed)
 {
-	const gfp_t wait = gfp_mask & __GFP_WAIT;
-	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	struct zoneref *z;
-	struct zone *zone;
-	struct page *page;
-	struct reclaim_state reclaim_state;
-	struct task_struct *p = current;
-	int do_retry;
-	int alloc_flags;
-	unsigned long did_some_progress;
-	unsigned long pages_reclaimed = 0;
+	/* Do not loop if specifically requested */
+	if (gfp_mask & __GFP_NORETRY)
+		return 0;
 
-	lockdep_trace_alloc(gfp_mask);
+	/*
+	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
+	 * means __GFP_NOFAIL, but that may not be true in other
+	 * implementations.
+	 */
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return 1;
 
-	might_sleep_if(wait);
+	/*
+	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
+	 * specified, then we retry until we no longer reclaim any pages
+	 * (above), or we've reclaimed an order of pages at least as
+	 * large as the allocation's order. In both cases, if the
+	 * allocation still fails, we stop retrying.
+	 */
+	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
+		return 1;
 
-	if (should_fail_alloc_page(gfp_mask, order))
-		return NULL;
+	/*
+	 * Don't let big-order allocations loop unless the caller
+	 * explicitly requests that.
+	 */
+	if (gfp_mask & __GFP_NOFAIL)
+		return 1;
 
-	/* the list of zones suitable for gfp_mask */
-	z = zonelist->_zonerefs;
-	if (unlikely(!z->zone)) {
-		/*
-		 * Happens if we have an empty zonelist as a result of
-		 * GFP_THISNODE being used on a memoryless node
-		 */
+	return 0;
+}
+
+static inline struct page *
+__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
+{
+	struct page *page;
+
+	/* Acquire the OOM killer lock for the zones in zonelist */
+	if (!try_set_zone_oom(zonelist, gfp_mask)) {
+		schedule_timeout_uninterruptible(1);
 		return NULL;
 	}
 
-restart:
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+	/*
+	 * Go through the zonelist yet one more time, keep very high watermark
+	 * here, this is only to catch a parallel oom killing, we must fail if
+	 * we're still under heavy pressure.
+	 */
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+		order, zonelist, high_zoneidx,
+		ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 	if (page)
-		goto got_pg;
+		goto out;
+
+	/* The OOM killer will not help higher order allocs */
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		goto out;
+
+	/* Exhausted what can be done so it's blamo time */
+	out_of_memory(zonelist, gfp_mask, order);
+
+out:
+	clear_zonelist_oom(zonelist, gfp_mask);
+	return page;
+}
+
+/* The really slow allocator path where we enter direct reclaim */
+static inline struct page *
+__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+{
+	struct page *page = NULL;
+	struct reclaim_state reclaim_state;
+	struct task_struct *p = current;
+
+	cond_resched();
+
+	/* We now go into synchronous reclaim */
+	cpuset_memory_pressure_bump();
+
+	/*
+	 * The task's cpuset might have expanded its set of allowable nodes
+	 */
+	p->flags |= PF_MEMALLOC;
+	lockdep_set_current_reclaim_state(gfp_mask);
+	reclaim_state.reclaimed_slab = 0;
+	p->reclaim_state = &reclaim_state;
+
+	*did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);
+
+	p->reclaim_state = NULL;
+	lockdep_clear_current_reclaim_state();
+	p->flags &= ~PF_MEMALLOC;
+
+	cond_resched();
+
+	if (order != 0)
+		drain_all_pages();
+
+	if (likely(*did_some_progress))
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
+					zonelist, high_zoneidx, alloc_flags);
+	return page;
+}
+
+static inline int
+is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
+{
+	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+			&& !in_interrupt())
+		return 1;
+	return 0;
+}
+
+/*
+ * This is called in the allocator slow-path if the allocation request is of
+ * sufficient urgency to ignore watermarks and take other desperate measures
+ */
+static inline struct page *
+__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
+{
+	struct page *page;
+
+	do {
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
+			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+
+		if (!page && gfp_mask & __GFP_NOFAIL)
+			congestion_wait(WRITE, HZ/50);
+	} while (!page && (gfp_mask & __GFP_NOFAIL));
+
+	return page;
+}
+
+static inline
+void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
+						enum zone_type high_zoneidx)
+{
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+		wakeup_kswapd(zone, order);
+}
+
+static inline struct page *
+__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
+{
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	struct page *page = NULL;
+	int alloc_flags;
+	unsigned long pages_reclaimed = 0;
+	unsigned long did_some_progress;
+	struct task_struct *p = current;
 
 	/*
 	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
@@ -1513,8 +1637,7 @@ restart:
 	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
 		goto nopage;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		wakeup_kswapd(zone, order);
+	wake_all_kswapd(order, zonelist, high_zoneidx);
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
@@ -1534,6 +1657,7 @@ restart:
 	if (wait)
 		alloc_flags |= ALLOC_CPUSET;
 
+restart:
 	/*
 	 * Go through the zonelist again. Let __GFP_HIGH and allocations
 	 * coming from realtime tasks go deeper into reserves.
@@ -1547,23 +1671,18 @@ restart:
 	if (page)
 		goto got_pg;
 
-	/* This allocation should allow future memory freeing. */
-
 rebalance:
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
+	/* Allocate without watermarks if the context allows */
+	if (is_allocation_high_priority(p, gfp_mask)) {
+		/* Do not dip into emergency reserves if specified */
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
-nofail_alloc:
-			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, nodemask, order,
-				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+			page = __alloc_pages_high_priority(gfp_mask, order,
+				zonelist, high_zoneidx, nodemask);
 			if (page)
 				goto got_pg;
-			if (gfp_mask & __GFP_NOFAIL) {
-				congestion_wait(WRITE, HZ/50);
-				goto nofail_alloc;
-			}
 		}
+
+		/* Ensure no recursion into the allocator */
 		goto nopage;
 	}
 
@@ -1571,93 +1690,42 @@ nofail_alloc:
 	if (!wait)
 		goto nopage;
 
-	cond_resched();
-
-	/* We now go into synchronous reclaim */
-	cpuset_memory_pressure_bump();
-
-	p->flags |= PF_MEMALLOC;
-
-	lockdep_set_current_reclaim_state(gfp_mask);
-	reclaim_state.reclaimed_slab = 0;
-	p->reclaim_state = &reclaim_state;
-
-	did_some_progress = try_to_free_pages(zonelist, order,
-						gfp_mask, nodemask);
-
-	p->reclaim_state = NULL;
-	lockdep_clear_current_reclaim_state();
-	p->flags &= ~PF_MEMALLOC;
+	/* Try direct reclaim and then allocating */
+	page = __alloc_pages_direct_reclaim(gfp_mask, order,
+					zonelist, high_zoneidx,
+					nodemask,
+					alloc_flags, &did_some_progress);
+	if (page)
+		goto got_pg;
 
-	cond_resched();
+	/*
+	 * If we failed to make any progress reclaiming, then we are
+	 * running out of options and have to consider going OOM
+	 */
+	if (!did_some_progress) {
+		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+			page = __alloc_pages_may_oom(gfp_mask, order,
+					zonelist, high_zoneidx,
+					nodemask);
+			if (page)
+				goto got_pg;
 
-	if (order != 0)
-		drain_all_pages();
+			/*
+			 * The OOM killer does not trigger for high-order allocations
+			 * but if no progress is being made, there are no other
+			 * options and retrying is unlikely to help
+			 */
+			if (order > PAGE_ALLOC_COSTLY_ORDER)
+				goto nopage;
 
-	if (likely(did_some_progress)) {
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
-					zonelist, high_zoneidx, alloc_flags);
-		if (page)
-			goto got_pg;
-	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
-		if (!try_set_zone_oom(zonelist, gfp_mask)) {
-			schedule_timeout_uninterruptible(1);
 			goto restart;
 		}
-
-		/*
-		 * Go through the zonelist yet one more time, keep
-		 * very high watermark here, this is only to catch
-		 * a parallel oom killing, we must fail if we're still
-		 * under heavy pressure.
-		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
-			order, zonelist, high_zoneidx,
-			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
-		if (page) {
-			clear_zonelist_oom(zonelist, gfp_mask);
-			goto got_pg;
-		}
-
-		/* The OOM killer will not help higher order allocs so fail */
-		if (order > PAGE_ALLOC_COSTLY_ORDER) {
-			clear_zonelist_oom(zonelist, gfp_mask);
-			goto nopage;
-		}
-
-		out_of_memory(zonelist, gfp_mask, order);
-		clear_zonelist_oom(zonelist, gfp_mask);
-		goto restart;
 	}
 
-	/*
-	 * Don't let big-order allocations loop unless the caller explicitly
-	 * requests that.  Wait for some write requests to complete then retry.
-	 *
-	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
-	 * means __GFP_NOFAIL, but that may not be true in other
-	 * implementations.
-	 *
-	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
-	 * specified, then we retry until we no longer reclaim any pages
-	 * (above), or we've reclaimed an order of pages at least as
-	 * large as the allocation's order. In both cases, if the
-	 * allocation still fails, we stop retrying.
-	 */
+	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
-	do_retry = 0;
-	if (!(gfp_mask & __GFP_NORETRY)) {
-		if (order <= PAGE_ALLOC_COSTLY_ORDER) {
-			do_retry = 1;
-		} else {
-			if (gfp_mask & __GFP_REPEAT &&
-				pages_reclaimed < (1 << order))
-					do_retry = 1;
-		}
-		if (gfp_mask & __GFP_NOFAIL)
-			do_retry = 1;
-	}
-	if (do_retry) {
+	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
+		/* Wait for some write requests to complete then retry */
 		congestion_wait(WRITE, HZ/50);
 		goto rebalance;
 	}
@@ -1672,6 +1740,41 @@ nopage:
 	}
 got_pg:
 	return page;
+
+}
+
+/*
+ * This is the 'heart' of the zoned buddy allocator.
+ */
+struct page *
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
+{
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	struct page *page;
+
+	lockdep_trace_alloc(gfp_mask);
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+
+	if (should_fail_alloc_page(gfp_mask, order))
+		return NULL;
+
+	/*
+	 * Check the zones suitable for the gfp_mask contain at least one
+	 * valid zone. It's possible to have an empty zonelist as a result
+	 * of GFP_THISNODE and a memoryless node
+	 */
+	if (unlikely(!zonelist->_zonerefs->zone))
+		return NULL;
+
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
+			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+	if (unlikely(!page))
+		page = __alloc_pages_slowpath(gfp_mask, order,
+				zonelist, high_zoneidx, nodemask);
+
+	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 05/22] Break up the allocator entry point into fast and slow paths
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

The core of the page allocator is one giant function which allocates memory
on the stack and makes calculations that may not be needed for every
allocation. This patch breaks up the allocator path into fast and slow
paths for clarity. Note the slow paths are still inlined but the entry is
marked unlikely.  If they were not inlined, it actally increases text size
to generate the as there is only one call site.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 mm/page_alloc.c |  353 +++++++++++++++++++++++++++++++++++--------------------
 1 files changed, 228 insertions(+), 125 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3bed856..1e09f26 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1460,47 +1460,171 @@ try_next_zone:
 	return page;
 }
 
-/*
- * This is the 'heart' of the zoned buddy allocator.
- */
-struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-			struct zonelist *zonelist, nodemask_t *nodemask)
+static inline int
+should_alloc_retry(gfp_t gfp_mask, unsigned int order,
+				unsigned long pages_reclaimed)
 {
-	const gfp_t wait = gfp_mask & __GFP_WAIT;
-	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	struct zoneref *z;
-	struct zone *zone;
-	struct page *page;
-	struct reclaim_state reclaim_state;
-	struct task_struct *p = current;
-	int do_retry;
-	int alloc_flags;
-	unsigned long did_some_progress;
-	unsigned long pages_reclaimed = 0;
+	/* Do not loop if specifically requested */
+	if (gfp_mask & __GFP_NORETRY)
+		return 0;
 
-	lockdep_trace_alloc(gfp_mask);
+	/*
+	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
+	 * means __GFP_NOFAIL, but that may not be true in other
+	 * implementations.
+	 */
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return 1;
 
-	might_sleep_if(wait);
+	/*
+	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
+	 * specified, then we retry until we no longer reclaim any pages
+	 * (above), or we've reclaimed an order of pages at least as
+	 * large as the allocation's order. In both cases, if the
+	 * allocation still fails, we stop retrying.
+	 */
+	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
+		return 1;
 
-	if (should_fail_alloc_page(gfp_mask, order))
-		return NULL;
+	/*
+	 * Don't let big-order allocations loop unless the caller
+	 * explicitly requests that.
+	 */
+	if (gfp_mask & __GFP_NOFAIL)
+		return 1;
 
-	/* the list of zones suitable for gfp_mask */
-	z = zonelist->_zonerefs;
-	if (unlikely(!z->zone)) {
-		/*
-		 * Happens if we have an empty zonelist as a result of
-		 * GFP_THISNODE being used on a memoryless node
-		 */
+	return 0;
+}
+
+static inline struct page *
+__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
+{
+	struct page *page;
+
+	/* Acquire the OOM killer lock for the zones in zonelist */
+	if (!try_set_zone_oom(zonelist, gfp_mask)) {
+		schedule_timeout_uninterruptible(1);
 		return NULL;
 	}
 
-restart:
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+	/*
+	 * Go through the zonelist yet one more time, keep very high watermark
+	 * here, this is only to catch a parallel oom killing, we must fail if
+	 * we're still under heavy pressure.
+	 */
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+		order, zonelist, high_zoneidx,
+		ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 	if (page)
-		goto got_pg;
+		goto out;
+
+	/* The OOM killer will not help higher order allocs */
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		goto out;
+
+	/* Exhausted what can be done so it's blamo time */
+	out_of_memory(zonelist, gfp_mask, order);
+
+out:
+	clear_zonelist_oom(zonelist, gfp_mask);
+	return page;
+}
+
+/* The really slow allocator path where we enter direct reclaim */
+static inline struct page *
+__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+{
+	struct page *page = NULL;
+	struct reclaim_state reclaim_state;
+	struct task_struct *p = current;
+
+	cond_resched();
+
+	/* We now go into synchronous reclaim */
+	cpuset_memory_pressure_bump();
+
+	/*
+	 * The task's cpuset might have expanded its set of allowable nodes
+	 */
+	p->flags |= PF_MEMALLOC;
+	lockdep_set_current_reclaim_state(gfp_mask);
+	reclaim_state.reclaimed_slab = 0;
+	p->reclaim_state = &reclaim_state;
+
+	*did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);
+
+	p->reclaim_state = NULL;
+	lockdep_clear_current_reclaim_state();
+	p->flags &= ~PF_MEMALLOC;
+
+	cond_resched();
+
+	if (order != 0)
+		drain_all_pages();
+
+	if (likely(*did_some_progress))
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
+					zonelist, high_zoneidx, alloc_flags);
+	return page;
+}
+
+static inline int
+is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
+{
+	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+			&& !in_interrupt())
+		return 1;
+	return 0;
+}
+
+/*
+ * This is called in the allocator slow-path if the allocation request is of
+ * sufficient urgency to ignore watermarks and take other desperate measures
+ */
+static inline struct page *
+__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
+{
+	struct page *page;
+
+	do {
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
+			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+
+		if (!page && gfp_mask & __GFP_NOFAIL)
+			congestion_wait(WRITE, HZ/50);
+	} while (!page && (gfp_mask & __GFP_NOFAIL));
+
+	return page;
+}
+
+static inline
+void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
+						enum zone_type high_zoneidx)
+{
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+		wakeup_kswapd(zone, order);
+}
+
+static inline struct page *
+__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
+{
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	struct page *page = NULL;
+	int alloc_flags;
+	unsigned long pages_reclaimed = 0;
+	unsigned long did_some_progress;
+	struct task_struct *p = current;
 
 	/*
 	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
@@ -1513,8 +1637,7 @@ restart:
 	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
 		goto nopage;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		wakeup_kswapd(zone, order);
+	wake_all_kswapd(order, zonelist, high_zoneidx);
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
@@ -1534,6 +1657,7 @@ restart:
 	if (wait)
 		alloc_flags |= ALLOC_CPUSET;
 
+restart:
 	/*
 	 * Go through the zonelist again. Let __GFP_HIGH and allocations
 	 * coming from realtime tasks go deeper into reserves.
@@ -1547,23 +1671,18 @@ restart:
 	if (page)
 		goto got_pg;
 
-	/* This allocation should allow future memory freeing. */
-
 rebalance:
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
+	/* Allocate without watermarks if the context allows */
+	if (is_allocation_high_priority(p, gfp_mask)) {
+		/* Do not dip into emergency reserves if specified */
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
-nofail_alloc:
-			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, nodemask, order,
-				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+			page = __alloc_pages_high_priority(gfp_mask, order,
+				zonelist, high_zoneidx, nodemask);
 			if (page)
 				goto got_pg;
-			if (gfp_mask & __GFP_NOFAIL) {
-				congestion_wait(WRITE, HZ/50);
-				goto nofail_alloc;
-			}
 		}
+
+		/* Ensure no recursion into the allocator */
 		goto nopage;
 	}
 
@@ -1571,93 +1690,42 @@ nofail_alloc:
 	if (!wait)
 		goto nopage;
 
-	cond_resched();
-
-	/* We now go into synchronous reclaim */
-	cpuset_memory_pressure_bump();
-
-	p->flags |= PF_MEMALLOC;
-
-	lockdep_set_current_reclaim_state(gfp_mask);
-	reclaim_state.reclaimed_slab = 0;
-	p->reclaim_state = &reclaim_state;
-
-	did_some_progress = try_to_free_pages(zonelist, order,
-						gfp_mask, nodemask);
-
-	p->reclaim_state = NULL;
-	lockdep_clear_current_reclaim_state();
-	p->flags &= ~PF_MEMALLOC;
+	/* Try direct reclaim and then allocating */
+	page = __alloc_pages_direct_reclaim(gfp_mask, order,
+					zonelist, high_zoneidx,
+					nodemask,
+					alloc_flags, &did_some_progress);
+	if (page)
+		goto got_pg;
 
-	cond_resched();
+	/*
+	 * If we failed to make any progress reclaiming, then we are
+	 * running out of options and have to consider going OOM
+	 */
+	if (!did_some_progress) {
+		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+			page = __alloc_pages_may_oom(gfp_mask, order,
+					zonelist, high_zoneidx,
+					nodemask);
+			if (page)
+				goto got_pg;
 
-	if (order != 0)
-		drain_all_pages();
+			/*
+			 * The OOM killer does not trigger for high-order allocations
+			 * but if no progress is being made, there are no other
+			 * options and retrying is unlikely to help
+			 */
+			if (order > PAGE_ALLOC_COSTLY_ORDER)
+				goto nopage;
 
-	if (likely(did_some_progress)) {
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
-					zonelist, high_zoneidx, alloc_flags);
-		if (page)
-			goto got_pg;
-	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
-		if (!try_set_zone_oom(zonelist, gfp_mask)) {
-			schedule_timeout_uninterruptible(1);
 			goto restart;
 		}
-
-		/*
-		 * Go through the zonelist yet one more time, keep
-		 * very high watermark here, this is only to catch
-		 * a parallel oom killing, we must fail if we're still
-		 * under heavy pressure.
-		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
-			order, zonelist, high_zoneidx,
-			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
-		if (page) {
-			clear_zonelist_oom(zonelist, gfp_mask);
-			goto got_pg;
-		}
-
-		/* The OOM killer will not help higher order allocs so fail */
-		if (order > PAGE_ALLOC_COSTLY_ORDER) {
-			clear_zonelist_oom(zonelist, gfp_mask);
-			goto nopage;
-		}
-
-		out_of_memory(zonelist, gfp_mask, order);
-		clear_zonelist_oom(zonelist, gfp_mask);
-		goto restart;
 	}
 
-	/*
-	 * Don't let big-order allocations loop unless the caller explicitly
-	 * requests that.  Wait for some write requests to complete then retry.
-	 *
-	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
-	 * means __GFP_NOFAIL, but that may not be true in other
-	 * implementations.
-	 *
-	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
-	 * specified, then we retry until we no longer reclaim any pages
-	 * (above), or we've reclaimed an order of pages at least as
-	 * large as the allocation's order. In both cases, if the
-	 * allocation still fails, we stop retrying.
-	 */
+	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
-	do_retry = 0;
-	if (!(gfp_mask & __GFP_NORETRY)) {
-		if (order <= PAGE_ALLOC_COSTLY_ORDER) {
-			do_retry = 1;
-		} else {
-			if (gfp_mask & __GFP_REPEAT &&
-				pages_reclaimed < (1 << order))
-					do_retry = 1;
-		}
-		if (gfp_mask & __GFP_NOFAIL)
-			do_retry = 1;
-	}
-	if (do_retry) {
+	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
+		/* Wait for some write requests to complete then retry */
 		congestion_wait(WRITE, HZ/50);
 		goto rebalance;
 	}
@@ -1672,6 +1740,41 @@ nopage:
 	}
 got_pg:
 	return page;
+
+}
+
+/*
+ * This is the 'heart' of the zoned buddy allocator.
+ */
+struct page *
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
+{
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	struct page *page;
+
+	lockdep_trace_alloc(gfp_mask);
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+
+	if (should_fail_alloc_page(gfp_mask, order))
+		return NULL;
+
+	/*
+	 * Check the zones suitable for the gfp_mask contain at least one
+	 * valid zone. It's possible to have an empty zonelist as a result
+	 * of GFP_THISNODE and a memoryless node
+	 */
+	if (unlikely(!zonelist->_zonerefs->zone))
+		return NULL;
+
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
+			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+	if (unlikely(!page))
+		page = __alloc_pages_slowpath(gfp_mask, order,
+				zonelist, high_zoneidx, nodemask);
+
+	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
 
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 06/22] Move check for disabled anti-fragmentation out of fastpath
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On low-memory systems, anti-fragmentation gets disabled as there is nothing
it can do and it would just incur overhead shuffling pages between lists
constantly. Currently the check is made in the free page fast path for every
page. This patch moves it to a slow path. On machines with low memory,
there will be small amount of additional overhead as pages get shuffled
between lists but it should quickly settle.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/mmzone.h |    3 ---
 mm/page_alloc.c        |    4 ++++
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 186ec6a..f82bdba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -50,9 +50,6 @@ extern int page_group_by_mobility_disabled;
 
 static inline int get_pageblock_migratetype(struct page *page)
 {
-	if (unlikely(page_group_by_mobility_disabled))
-		return MIGRATE_UNMOVABLE;
-
 	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1e09f26..b1ae435 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -172,6 +172,10 @@ int page_group_by_mobility_disabled __read_mostly;
 
 static void set_pageblock_migratetype(struct page *page, int migratetype)
 {
+
+	if (unlikely(page_group_by_mobility_disabled))
+		migratetype = MIGRATE_UNMOVABLE;
+
 	set_pageblock_flags_group(page, (unsigned long)migratetype,
 					PB_migrate, PB_migrate_end);
 }
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 06/22] Move check for disabled anti-fragmentation out of fastpath
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On low-memory systems, anti-fragmentation gets disabled as there is nothing
it can do and it would just incur overhead shuffling pages between lists
constantly. Currently the check is made in the free page fast path for every
page. This patch moves it to a slow path. On machines with low memory,
there will be small amount of additional overhead as pages get shuffled
between lists but it should quickly settle.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/mmzone.h |    3 ---
 mm/page_alloc.c        |    4 ++++
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 186ec6a..f82bdba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -50,9 +50,6 @@ extern int page_group_by_mobility_disabled;
 
 static inline int get_pageblock_migratetype(struct page *page)
 {
-	if (unlikely(page_group_by_mobility_disabled))
-		return MIGRATE_UNMOVABLE;
-
 	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1e09f26..b1ae435 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -172,6 +172,10 @@ int page_group_by_mobility_disabled __read_mostly;
 
 static void set_pageblock_migratetype(struct page *page, int migratetype)
 {
+
+	if (unlikely(page_group_by_mobility_disabled))
+		migratetype = MIGRATE_UNMOVABLE;
+
 	set_pageblock_flags_group(page, (unsigned long)migratetype,
 					PB_migrate, PB_migrate_end);
 }
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 07/22] Calculate the preferred zone for allocation only once
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

get_page_from_freelist() can be called multiple times for an allocation.
Part of this calculates the preferred_zone which is the first usable zone
in the zonelist but the zone depends on the GFP flags specified at the
beginning of the allocation call. This patch calculates preferred_zone
once. It's safe to do this because if preferred_zone is NULL at the start
of the call, no amount of direct reclaim or other actions will change the
fact the allocation will fail.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 mm/page_alloc.c |   53 ++++++++++++++++++++++++++++++++---------------------
 1 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b1ae435..e073fa3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1392,23 +1392,18 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
  */
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
-		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
+		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
+		struct zone *preferred_zone)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
 	int classzone_idx;
-	struct zone *zone, *preferred_zone;
+	struct zone *zone;
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
-							&preferred_zone);
-	if (!preferred_zone)
-		return NULL;
-
 	classzone_idx = zone_idx(preferred_zone);
-
 	VM_BUG_ON(order >= MAX_ORDER);
 
 zonelist_scan:
@@ -1503,7 +1498,7 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	struct page *page;
 
@@ -1520,7 +1515,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	 */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
-		ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
+		preferred_zone);
 	if (page)
 		goto out;
 
@@ -1540,7 +1536,8 @@ out:
 static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+	unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
@@ -1572,7 +1569,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	if (likely(*did_some_progress))
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
-					zonelist, high_zoneidx, alloc_flags);
+					zonelist, high_zoneidx,
+					alloc_flags, preferred_zone);
 	return page;
 }
 
@@ -1592,13 +1590,14 @@ is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
 static inline struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
+			preferred_zone);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			congestion_wait(WRITE, HZ/50);
@@ -1621,7 +1620,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -1671,7 +1670,8 @@ restart:
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
-						high_zoneidx, alloc_flags);
+						high_zoneidx, alloc_flags,
+						preferred_zone);
 	if (page)
 		goto got_pg;
 
@@ -1681,7 +1681,7 @@ rebalance:
 		/* Do not dip into emergency reserves if specified */
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 			page = __alloc_pages_high_priority(gfp_mask, order,
-				zonelist, high_zoneidx, nodemask);
+				zonelist, high_zoneidx, nodemask, preferred_zone);
 			if (page)
 				goto got_pg;
 		}
@@ -1698,7 +1698,8 @@ rebalance:
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask,
-					alloc_flags, &did_some_progress);
+					alloc_flags, preferred_zone,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -1710,7 +1711,7 @@ rebalance:
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
-					nodemask);
+					nodemask, preferred_zone);
 			if (page)
 				goto got_pg;
 
@@ -1755,6 +1756,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	struct zone *preferred_zone;
 	struct page *page;
 
 	lockdep_trace_alloc(gfp_mask);
@@ -1772,11 +1774,20 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
+	/* The preferred zone is used for statistics later */
+	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
+							&preferred_zone);
+	if (!preferred_zone)
+		return NULL;
+
+	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
+			preferred_zone);
 	if (unlikely(!page))
 		page = __alloc_pages_slowpath(gfp_mask, order,
-				zonelist, high_zoneidx, nodemask);
+				zonelist, high_zoneidx, nodemask,
+				preferred_zone);
 
 	return page;
 }
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 07/22] Calculate the preferred zone for allocation only once
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

get_page_from_freelist() can be called multiple times for an allocation.
Part of this calculates the preferred_zone which is the first usable zone
in the zonelist but the zone depends on the GFP flags specified at the
beginning of the allocation call. This patch calculates preferred_zone
once. It's safe to do this because if preferred_zone is NULL at the start
of the call, no amount of direct reclaim or other actions will change the
fact the allocation will fail.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 mm/page_alloc.c |   53 ++++++++++++++++++++++++++++++++---------------------
 1 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b1ae435..e073fa3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1392,23 +1392,18 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
  */
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
-		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
+		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
+		struct zone *preferred_zone)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
 	int classzone_idx;
-	struct zone *zone, *preferred_zone;
+	struct zone *zone;
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
-							&preferred_zone);
-	if (!preferred_zone)
-		return NULL;
-
 	classzone_idx = zone_idx(preferred_zone);
-
 	VM_BUG_ON(order >= MAX_ORDER);
 
 zonelist_scan:
@@ -1503,7 +1498,7 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	struct page *page;
 
@@ -1520,7 +1515,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	 */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
-		ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
+		preferred_zone);
 	if (page)
 		goto out;
 
@@ -1540,7 +1536,8 @@ out:
 static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+	unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
@@ -1572,7 +1569,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	if (likely(*did_some_progress))
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
-					zonelist, high_zoneidx, alloc_flags);
+					zonelist, high_zoneidx,
+					alloc_flags, preferred_zone);
 	return page;
 }
 
@@ -1592,13 +1590,14 @@ is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
 static inline struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
+			preferred_zone);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			congestion_wait(WRITE, HZ/50);
@@ -1621,7 +1620,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -1671,7 +1670,8 @@ restart:
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
-						high_zoneidx, alloc_flags);
+						high_zoneidx, alloc_flags,
+						preferred_zone);
 	if (page)
 		goto got_pg;
 
@@ -1681,7 +1681,7 @@ rebalance:
 		/* Do not dip into emergency reserves if specified */
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 			page = __alloc_pages_high_priority(gfp_mask, order,
-				zonelist, high_zoneidx, nodemask);
+				zonelist, high_zoneidx, nodemask, preferred_zone);
 			if (page)
 				goto got_pg;
 		}
@@ -1698,7 +1698,8 @@ rebalance:
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask,
-					alloc_flags, &did_some_progress);
+					alloc_flags, preferred_zone,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -1710,7 +1711,7 @@ rebalance:
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
-					nodemask);
+					nodemask, preferred_zone);
 			if (page)
 				goto got_pg;
 
@@ -1755,6 +1756,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	struct zone *preferred_zone;
 	struct page *page;
 
 	lockdep_trace_alloc(gfp_mask);
@@ -1772,11 +1774,20 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
+	/* The preferred zone is used for statistics later */
+	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
+							&preferred_zone);
+	if (!preferred_zone)
+		return NULL;
+
+	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
+			preferred_zone);
 	if (unlikely(!page))
 		page = __alloc_pages_slowpath(gfp_mask, order,
-				zonelist, high_zoneidx, nodemask);
+				zonelist, high_zoneidx, nodemask,
+				preferred_zone);
 
 	return page;
 }
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 08/22] Calculate the migratetype for allocation only once
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

GFP mask is converted into a migratetype when deciding which pagelist to
take a page from. However, it is happening multiple times per
allocation, at least once per zone traversed. Calculate it once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   43 ++++++++++++++++++++++++++-----------------
 1 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e073fa3..eb1548c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1065,13 +1065,13 @@ void split_page(struct page *page, unsigned int order)
  * or two.
  */
 static struct page *buffered_rmqueue(struct zone *preferred_zone,
-			struct zone *zone, int order, gfp_t gfp_flags)
+			struct zone *zone, int order, gfp_t gfp_flags,
+			int migratetype)
 {
 	unsigned long flags;
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
 	int cpu;
-	int migratetype = allocflags_to_migratetype(gfp_flags);
 
 again:
 	cpu  = get_cpu();
@@ -1393,7 +1393,7 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone)
+		struct zone *preferred_zone, int migratetype)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
@@ -1436,7 +1436,8 @@ zonelist_scan:
 			}
 		}
 
-		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
+		page = buffered_rmqueue(preferred_zone, zone, order,
+						gfp_mask, migratetype);
 		if (page)
 			break;
 this_zone_full:
@@ -1498,7 +1499,8 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	struct page *page;
 
@@ -1516,7 +1518,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone);
+		preferred_zone, migratetype);
 	if (page)
 		goto out;
 
@@ -1537,7 +1539,7 @@ static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	unsigned long *did_some_progress)
+	int migratetype, unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
@@ -1570,7 +1572,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	if (likely(*did_some_progress))
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
-					alloc_flags, preferred_zone);
+					alloc_flags, preferred_zone,
+					migratetype);
 	return page;
 }
 
@@ -1590,14 +1593,15 @@ is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
 static inline struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone);
+			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			congestion_wait(WRITE, HZ/50);
@@ -1620,7 +1624,8 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -1671,7 +1676,8 @@ restart:
 	 */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 						high_zoneidx, alloc_flags,
-						preferred_zone);
+						preferred_zone,
+						migratetype);
 	if (page)
 		goto got_pg;
 
@@ -1681,7 +1687,8 @@ rebalance:
 		/* Do not dip into emergency reserves if specified */
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 			page = __alloc_pages_high_priority(gfp_mask, order,
-				zonelist, high_zoneidx, nodemask, preferred_zone);
+				zonelist, high_zoneidx, nodemask, preferred_zone,
+				migratetype);
 			if (page)
 				goto got_pg;
 		}
@@ -1699,7 +1706,7 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					&did_some_progress);
+					migratetype, &did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -1711,7 +1718,8 @@ rebalance:
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
-					nodemask, preferred_zone);
+					nodemask, preferred_zone,
+					migratetype);
 			if (page)
 				goto got_pg;
 
@@ -1758,6 +1766,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
 	struct page *page;
+	int migratetype = allocflags_to_migratetype(gfp_mask);
 
 	lockdep_trace_alloc(gfp_mask);
 
@@ -1783,11 +1792,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
-			preferred_zone);
+			preferred_zone, migratetype);
 	if (unlikely(!page))
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone);
+				preferred_zone, migratetype);
 
 	return page;
 }
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 08/22] Calculate the migratetype for allocation only once
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

GFP mask is converted into a migratetype when deciding which pagelist to
take a page from. However, it is happening multiple times per
allocation, at least once per zone traversed. Calculate it once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   43 ++++++++++++++++++++++++++-----------------
 1 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e073fa3..eb1548c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1065,13 +1065,13 @@ void split_page(struct page *page, unsigned int order)
  * or two.
  */
 static struct page *buffered_rmqueue(struct zone *preferred_zone,
-			struct zone *zone, int order, gfp_t gfp_flags)
+			struct zone *zone, int order, gfp_t gfp_flags,
+			int migratetype)
 {
 	unsigned long flags;
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
 	int cpu;
-	int migratetype = allocflags_to_migratetype(gfp_flags);
 
 again:
 	cpu  = get_cpu();
@@ -1393,7 +1393,7 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone)
+		struct zone *preferred_zone, int migratetype)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
@@ -1436,7 +1436,8 @@ zonelist_scan:
 			}
 		}
 
-		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
+		page = buffered_rmqueue(preferred_zone, zone, order,
+						gfp_mask, migratetype);
 		if (page)
 			break;
 this_zone_full:
@@ -1498,7 +1499,8 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	struct page *page;
 
@@ -1516,7 +1518,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone);
+		preferred_zone, migratetype);
 	if (page)
 		goto out;
 
@@ -1537,7 +1539,7 @@ static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	unsigned long *did_some_progress)
+	int migratetype, unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
@@ -1570,7 +1572,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	if (likely(*did_some_progress))
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
-					alloc_flags, preferred_zone);
+					alloc_flags, preferred_zone,
+					migratetype);
 	return page;
 }
 
@@ -1590,14 +1593,15 @@ is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
 static inline struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone);
+			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			congestion_wait(WRITE, HZ/50);
@@ -1620,7 +1624,8 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -1671,7 +1676,8 @@ restart:
 	 */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 						high_zoneidx, alloc_flags,
-						preferred_zone);
+						preferred_zone,
+						migratetype);
 	if (page)
 		goto got_pg;
 
@@ -1681,7 +1687,8 @@ rebalance:
 		/* Do not dip into emergency reserves if specified */
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 			page = __alloc_pages_high_priority(gfp_mask, order,
-				zonelist, high_zoneidx, nodemask, preferred_zone);
+				zonelist, high_zoneidx, nodemask, preferred_zone,
+				migratetype);
 			if (page)
 				goto got_pg;
 		}
@@ -1699,7 +1706,7 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					&did_some_progress);
+					migratetype, &did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -1711,7 +1718,8 @@ rebalance:
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
-					nodemask, preferred_zone);
+					nodemask, preferred_zone,
+					migratetype);
 			if (page)
 				goto got_pg;
 
@@ -1758,6 +1766,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
 	struct page *page;
+	int migratetype = allocflags_to_migratetype(gfp_mask);
 
 	lockdep_trace_alloc(gfp_mask);
 
@@ -1783,11 +1792,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
-			preferred_zone);
+			preferred_zone, migratetype);
 	if (unlikely(!page))
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone);
+				preferred_zone, migratetype);
 
 	return page;
 }
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 09/22] Calculate the alloc_flags for allocation only once
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

Factor out the mapping between GFP and alloc_flags only once. Once factored
out, it only needs to be calculated once but some care must be taken.

[neilb@suse.de says]
As the test:

-       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-                       && !in_interrupt()) {
-               if (!(gfp_mask & __GFP_NOMEMALLOC)) {

has been replaced with a slightly weaker one:

+       if (alloc_flags & ALLOC_NO_WATERMARKS) {

Without care, this would allow recursion into the allocator via direct
reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set
but TF_MEMDIE callers are now allowed to directly reclaim where they
would have been prevented in the past.

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 mm/page_alloc.c |   94 +++++++++++++++++++++++++++++-------------------------
 1 files changed, 50 insertions(+), 44 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eb1548c..0d23795 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1577,15 +1577,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static inline int
-is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
-{
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt())
-		return 1;
-	return 0;
-}
-
 /*
  * This is called in the allocator slow-path if the allocation request is of
  * sufficient urgency to ignore watermarks and take other desperate measures
@@ -1621,6 +1612,42 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
 		wakeup_kswapd(zone, order);
 }
 
+static inline int
+gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+	struct task_struct *p = current;
+	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+	/*
+	 * The caller may dip into page reserves a bit more if the caller
+	 * cannot run direct reclaim, or if the caller has realtime scheduling
+	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+	 */
+	if (gfp_mask & __GFP_HIGH)
+		alloc_flags |= ALLOC_HIGH;
+
+	if (!wait) {
+		alloc_flags |= ALLOC_HARDER;
+		/*
+		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+		 */
+		alloc_flags &= ~ALLOC_CPUSET;
+	} else if (unlikely(rt_task(p)))
+		alloc_flags |= ALLOC_HARDER;
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (!in_interrupt() &&
+		    ((p->flags & PF_MEMALLOC) ||
+		     unlikely(test_thread_flag(TIF_MEMDIE))))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+	}
+
+	return alloc_flags;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1651,56 +1678,35 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
-	 *
-	 * The caller may dip into page reserves a bit more if the caller
-	 * cannot run direct reclaim, or if the caller has realtime scheduling
-	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	alloc_flags = ALLOC_WMARK_MIN;
-	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
-		alloc_flags |= ALLOC_HARDER;
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
-	if (wait)
-		alloc_flags |= ALLOC_CPUSET;
+	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
 restart:
-	/*
-	 * Go through the zonelist again. Let __GFP_HIGH and allocations
-	 * coming from realtime tasks go deeper into reserves.
-	 *
-	 * This is the last chance, in general, before the goto nopage.
-	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-	 */
+	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
-						high_zoneidx, alloc_flags,
-						preferred_zone,
-						migratetype);
+			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
+			preferred_zone, migratetype);
 	if (page)
 		goto got_pg;
 
 rebalance:
 	/* Allocate without watermarks if the context allows */
-	if (is_allocation_high_priority(p, gfp_mask)) {
-		/* Do not dip into emergency reserves if specified */
-		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
-			page = __alloc_pages_high_priority(gfp_mask, order,
-				zonelist, high_zoneidx, nodemask, preferred_zone,
-				migratetype);
-			if (page)
-				goto got_pg;
-		}
-
-		/* Ensure no recursion into the allocator */
-		goto nopage;
+	if (alloc_flags & ALLOC_NO_WATERMARKS) {
+		page = __alloc_pages_high_priority(gfp_mask, order,
+				zonelist, high_zoneidx, nodemask,
+				preferred_zone, migratetype);
+		if (page)
+			goto got_pg;
 	}
 
 	/* Atomic allocations - we can't balance anything */
 	if (!wait)
 		goto nopage;
 
+	/* Avoid recursion of direct reclaim */
+	if (p->flags & PF_MEMALLOC)
+		goto nopage;
+
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 09/22] Calculate the alloc_flags for allocation only once
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

Factor out the mapping between GFP and alloc_flags only once. Once factored
out, it only needs to be calculated once but some care must be taken.

[neilb@suse.de says]
As the test:

-       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-                       && !in_interrupt()) {
-               if (!(gfp_mask & __GFP_NOMEMALLOC)) {

has been replaced with a slightly weaker one:

+       if (alloc_flags & ALLOC_NO_WATERMARKS) {

Without care, this would allow recursion into the allocator via direct
reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set
but TF_MEMDIE callers are now allowed to directly reclaim where they
would have been prevented in the past.

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 mm/page_alloc.c |   94 +++++++++++++++++++++++++++++-------------------------
 1 files changed, 50 insertions(+), 44 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eb1548c..0d23795 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1577,15 +1577,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static inline int
-is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
-{
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt())
-		return 1;
-	return 0;
-}
-
 /*
  * This is called in the allocator slow-path if the allocation request is of
  * sufficient urgency to ignore watermarks and take other desperate measures
@@ -1621,6 +1612,42 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
 		wakeup_kswapd(zone, order);
 }
 
+static inline int
+gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+	struct task_struct *p = current;
+	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+	/*
+	 * The caller may dip into page reserves a bit more if the caller
+	 * cannot run direct reclaim, or if the caller has realtime scheduling
+	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+	 */
+	if (gfp_mask & __GFP_HIGH)
+		alloc_flags |= ALLOC_HIGH;
+
+	if (!wait) {
+		alloc_flags |= ALLOC_HARDER;
+		/*
+		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+		 */
+		alloc_flags &= ~ALLOC_CPUSET;
+	} else if (unlikely(rt_task(p)))
+		alloc_flags |= ALLOC_HARDER;
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (!in_interrupt() &&
+		    ((p->flags & PF_MEMALLOC) ||
+		     unlikely(test_thread_flag(TIF_MEMDIE))))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+	}
+
+	return alloc_flags;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1651,56 +1678,35 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
-	 *
-	 * The caller may dip into page reserves a bit more if the caller
-	 * cannot run direct reclaim, or if the caller has realtime scheduling
-	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	alloc_flags = ALLOC_WMARK_MIN;
-	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
-		alloc_flags |= ALLOC_HARDER;
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
-	if (wait)
-		alloc_flags |= ALLOC_CPUSET;
+	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
 restart:
-	/*
-	 * Go through the zonelist again. Let __GFP_HIGH and allocations
-	 * coming from realtime tasks go deeper into reserves.
-	 *
-	 * This is the last chance, in general, before the goto nopage.
-	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-	 */
+	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
-						high_zoneidx, alloc_flags,
-						preferred_zone,
-						migratetype);
+			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
+			preferred_zone, migratetype);
 	if (page)
 		goto got_pg;
 
 rebalance:
 	/* Allocate without watermarks if the context allows */
-	if (is_allocation_high_priority(p, gfp_mask)) {
-		/* Do not dip into emergency reserves if specified */
-		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
-			page = __alloc_pages_high_priority(gfp_mask, order,
-				zonelist, high_zoneidx, nodemask, preferred_zone,
-				migratetype);
-			if (page)
-				goto got_pg;
-		}
-
-		/* Ensure no recursion into the allocator */
-		goto nopage;
+	if (alloc_flags & ALLOC_NO_WATERMARKS) {
+		page = __alloc_pages_high_priority(gfp_mask, order,
+				zonelist, high_zoneidx, nodemask,
+				preferred_zone, migratetype);
+		if (page)
+			goto got_pg;
 	}
 
 	/* Atomic allocations - we can't balance anything */
 	if (!wait)
 		goto nopage;
 
+	/* Avoid recursion of direct reclaim */
+	if (p->flags & PF_MEMALLOC)
+		goto nopage;
+
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 10/22] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

Allocations that specify __GFP_HIGH get the ALLOC_HIGH flag. If these
flags are equal to each other, we can eliminate a branch.

[akpm@linux-foundation.org: Suggested the hack]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d23795..4f9cdaa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1619,14 +1619,16 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 
+	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
+	BUILD_BUG_ON(__GFP_HIGH != ALLOC_HIGH);
+
 	/*
 	 * The caller may dip into page reserves a bit more if the caller
 	 * cannot run direct reclaim, or if the caller has realtime scheduling
 	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
 	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
+	alloc_flags |= (gfp_mask & __GFP_HIGH);
 
 	if (!wait) {
 		alloc_flags |= ALLOC_HARDER;
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 10/22] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

Allocations that specify __GFP_HIGH get the ALLOC_HIGH flag. If these
flags are equal to each other, we can eliminate a branch.

[akpm@linux-foundation.org: Suggested the hack]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d23795..4f9cdaa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1619,14 +1619,16 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 
+	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
+	BUILD_BUG_ON(__GFP_HIGH != ALLOC_HIGH);
+
 	/*
 	 * The caller may dip into page reserves a bit more if the caller
 	 * cannot run direct reclaim, or if the caller has realtime scheduling
 	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
 	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
+	alloc_flags |= (gfp_mask & __GFP_HIGH);
 
 	if (!wait) {
 		alloc_flags |= ALLOC_HARDER;
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 11/22] Inline __rmqueue_smallest()
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

Inline __rmqueue_smallest by altering flow very slightly so that there is
only one call site. Because there is only one call-site, this function
can then be inlined without causing text bloat. On an x86-based config,
this patch reduces text by 16 bytes.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f9cdaa..8bfced9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -665,7 +665,8 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
  * Go through the free lists for the given migratetype and remove
  * the smallest available page from the freelists
  */
-static struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
+static inline
+struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						int migratetype)
 {
 	unsigned int current_order;
@@ -835,8 +836,7 @@ static struct page *__rmqueue_fallback(struct zone *zone, int order,
 		}
 	}
 
-	/* Use MIGRATE_RESERVE rather than fail an allocation */
-	return __rmqueue_smallest(zone, order, MIGRATE_RESERVE);
+	return NULL;
 }
 
 /*
@@ -848,11 +848,23 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
 {
 	struct page *page;
 
+retry_reserve:
 	page = __rmqueue_smallest(zone, order, migratetype);
 
-	if (unlikely(!page))
+	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
 		page = __rmqueue_fallback(zone, order, migratetype);
 
+		/*
+		 * Use MIGRATE_RESERVE rather than fail an allocation. goto
+		 * is used because __rmqueue_smallest is an inline function
+		 * and we want just one call site
+		 */
+		if (!page) {
+			migratetype = MIGRATE_RESERVE;
+			goto retry_reserve;
+		}
+	}
+
 	return page;
 }
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 11/22] Inline __rmqueue_smallest()
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

Inline __rmqueue_smallest by altering flow very slightly so that there is
only one call site. Because there is only one call-site, this function
can then be inlined without causing text bloat. On an x86-based config,
this patch reduces text by 16 bytes.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f9cdaa..8bfced9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -665,7 +665,8 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
  * Go through the free lists for the given migratetype and remove
  * the smallest available page from the freelists
  */
-static struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
+static inline
+struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						int migratetype)
 {
 	unsigned int current_order;
@@ -835,8 +836,7 @@ static struct page *__rmqueue_fallback(struct zone *zone, int order,
 		}
 	}
 
-	/* Use MIGRATE_RESERVE rather than fail an allocation */
-	return __rmqueue_smallest(zone, order, MIGRATE_RESERVE);
+	return NULL;
 }
 
 /*
@@ -848,11 +848,23 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
 {
 	struct page *page;
 
+retry_reserve:
 	page = __rmqueue_smallest(zone, order, migratetype);
 
-	if (unlikely(!page))
+	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
 		page = __rmqueue_fallback(zone, order, migratetype);
 
+		/*
+		 * Use MIGRATE_RESERVE rather than fail an allocation. goto
+		 * is used because __rmqueue_smallest is an inline function
+		 * and we want just one call site
+		 */
+		if (!page) {
+			migratetype = MIGRATE_RESERVE;
+			goto retry_reserve;
+		}
+	}
+
 	return page;
 }
 
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 12/22] Inline buffered_rmqueue()
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

buffered_rmqueue() is in the fast path so inline it. Because it only has one
call site, this function can then be inlined without causing text bloat. On
an x86-based config, it made no difference as the savings were padded out
by NOP instructions. Milage varies but text will either decrease in size
or remain static.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8bfced9..cb57382 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1076,7 +1076,8 @@ void split_page(struct page *page, unsigned int order)
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
  */
-static struct page *buffered_rmqueue(struct zone *preferred_zone,
+static inline
+struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, int order, gfp_t gfp_flags,
 			int migratetype)
 {
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 12/22] Inline buffered_rmqueue()
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

buffered_rmqueue() is in the fast path so inline it. Because it only has one
call site, this function can then be inlined without causing text bloat. On
an x86-based config, it made no difference as the savings were padded out
by NOP instructions. Milage varies but text will either decrease in size
or remain static.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8bfced9..cb57382 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1076,7 +1076,8 @@ void split_page(struct page *page, unsigned int order)
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
  */
-static struct page *buffered_rmqueue(struct zone *preferred_zone,
+static inline
+struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, int order, gfp_t gfp_flags,
 			int migratetype)
 {
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 13/22] Inline __rmqueue_fallback()
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

__rmqueue_fallback() is in the slow path but has only one call site. Because
there is only one call-site, this function can then be inlined without
causing text bloat. On an x86-based config, it made no difference as the
savings were padded out by NOP instructions. Milage varies but text will
either decrease in size or remain static.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb57382..e0e9e67 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -775,8 +775,8 @@ static int move_freepages_block(struct zone *zone, struct page *page,
 }
 
 /* Remove an element from the buddy allocator from the fallback list */
-static struct page *__rmqueue_fallback(struct zone *zone, int order,
-						int start_migratetype)
+static inline struct page *
+__rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 {
 	struct free_area * area;
 	int current_order;
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 13/22] Inline __rmqueue_fallback()
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

__rmqueue_fallback() is in the slow path but has only one call site. Because
there is only one call-site, this function can then be inlined without
causing text bloat. On an x86-based config, it made no difference as the
savings were padded out by NOP instructions. Milage varies but text will
either decrease in size or remain static.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb57382..e0e9e67 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -775,8 +775,8 @@ static int move_freepages_block(struct zone *zone, struct page *page,
 }
 
 /* Remove an element from the buddy allocator from the fallback list */
-static struct page *__rmqueue_fallback(struct zone *zone, int order,
-						int start_migratetype)
+static inline struct page *
+__rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 {
 	struct free_area * area;
 	int current_order;
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 14/22] Do not call get_pageblock_migratetype() more than necessary
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

get_pageblock_migratetype() is potentially called twice for every page
free. Once, when being freed to the pcp lists and once when being freed
back to buddy. When freeing from the pcp lists, it is known what the
pageblock type was at the time of free so use it rather than rechecking.
In low memory situations under memory pressure, this might skew
anti-fragmentation slightly but the interference is minimal and
decisions that are fragmenting memory are being made anyway.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |   16 ++++++++++------
 1 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e0e9e67..67cafd0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -456,16 +456,18 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
  */
 
 static inline void __free_one_page(struct page *page,
-		struct zone *zone, unsigned int order)
+		struct zone *zone, unsigned int order,
+		int migratetype)
 {
 	unsigned long page_idx;
 	int order_size = 1 << order;
-	int migratetype = get_pageblock_migratetype(page);
 
 	if (unlikely(PageCompound(page)))
 		if (unlikely(destroy_compound_page(page, order)))
 			return;
 
+	VM_BUG_ON(migratetype == -1);
+
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
 
 	VM_BUG_ON(page_idx & (order_size - 1));
@@ -534,17 +536,18 @@ static void free_pages_bulk(struct zone *zone, int count,
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
-		__free_one_page(page, zone, order);
+		__free_one_page(page, zone, order, page_private(page));
 	}
 	spin_unlock(&zone->lock);
 }
 
-static void free_one_page(struct zone *zone, struct page *page, int order)
+static void free_one_page(struct zone *zone, struct page *page, int order,
+				int migratetype)
 {
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
-	__free_one_page(page, zone, order);
+	__free_one_page(page, zone, order, migratetype);
 	spin_unlock(&zone->lock);
 }
 
@@ -569,7 +572,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	free_one_page(page_zone(page), page, order);
+	free_one_page(page_zone(page), page, order,
+					get_pageblock_migratetype(page));
 	local_irq_restore(flags);
 }
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 14/22] Do not call get_pageblock_migratetype() more than necessary
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

get_pageblock_migratetype() is potentially called twice for every page
free. Once, when being freed to the pcp lists and once when being freed
back to buddy. When freeing from the pcp lists, it is known what the
pageblock type was at the time of free so use it rather than rechecking.
In low memory situations under memory pressure, this might skew
anti-fragmentation slightly but the interference is minimal and
decisions that are fragmenting memory are being made anyway.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |   16 ++++++++++------
 1 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e0e9e67..67cafd0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -456,16 +456,18 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
  */
 
 static inline void __free_one_page(struct page *page,
-		struct zone *zone, unsigned int order)
+		struct zone *zone, unsigned int order,
+		int migratetype)
 {
 	unsigned long page_idx;
 	int order_size = 1 << order;
-	int migratetype = get_pageblock_migratetype(page);
 
 	if (unlikely(PageCompound(page)))
 		if (unlikely(destroy_compound_page(page, order)))
 			return;
 
+	VM_BUG_ON(migratetype == -1);
+
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
 
 	VM_BUG_ON(page_idx & (order_size - 1));
@@ -534,17 +536,18 @@ static void free_pages_bulk(struct zone *zone, int count,
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
-		__free_one_page(page, zone, order);
+		__free_one_page(page, zone, order, page_private(page));
 	}
 	spin_unlock(&zone->lock);
 }
 
-static void free_one_page(struct zone *zone, struct page *page, int order)
+static void free_one_page(struct zone *zone, struct page *page, int order,
+				int migratetype)
 {
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
-	__free_one_page(page, zone, order);
+	__free_one_page(page, zone, order, migratetype);
 	spin_unlock(&zone->lock);
 }
 
@@ -569,7 +572,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	free_one_page(page_zone(page), page, order);
+	free_one_page(page_zone(page), page, order,
+					get_pageblock_migratetype(page));
 	local_irq_restore(flags);
 }
 
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 15/22] Do not disable interrupts in free_page_mlock()
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

free_page_mlock() tests and clears PG_mlocked using locked versions of the
bit operations. If set, it disables interrupts to update counters and this
happens on every page free even though interrupts are disabled very shortly
afterwards a second time.  This is wasteful.

This patch splits what free_page_mlock() does. The bit check is still
made. However, the update of counters is delayed until the interrupts are
disabled and the non-lock version for clearing the bit is used. One potential
weirdness with this split is that the counters do not get updated if the
bad_page() check is triggered but a system showing bad pages is getting
screwed already.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/internal.h   |   11 +++--------
 mm/page_alloc.c |    8 +++++++-
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 987bb03..58ec1bc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
  */
 static inline void free_page_mlock(struct page *page)
 {
-	if (unlikely(TestClearPageMlocked(page))) {
-		unsigned long flags;
-
-		local_irq_save(flags);
-		__dec_zone_page_state(page, NR_MLOCK);
-		__count_vm_event(UNEVICTABLE_MLOCKFREED);
-		local_irq_restore(flags);
-	}
+	__ClearPageMlocked(page);
+	__dec_zone_page_state(page, NR_MLOCK);
+	__count_vm_event(UNEVICTABLE_MLOCKFREED);
 }
 
 #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 67cafd0..7f45de1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -499,7 +499,6 @@ static inline void __free_one_page(struct page *page,
 
 static inline int free_pages_check(struct page *page)
 {
-	free_page_mlock(page);
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
 		(page_count(page) != 0)  |
@@ -556,6 +555,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	unsigned long flags;
 	int i;
 	int bad = 0;
+	int clearMlocked = PageMlocked(page);
 
 	for (i = 0 ; i < (1 << order) ; ++i)
 		bad += free_pages_check(page + i);
@@ -571,6 +571,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	kernel_map_pages(page, 1 << order, 0);
 
 	local_irq_save(flags);
+	if (unlikely(clearMlocked))
+		free_page_mlock(page);
 	__count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, order,
 					get_pageblock_migratetype(page));
@@ -1017,6 +1019,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	int clearMlocked = PageMlocked(page);
 
 	if (PageAnon(page))
 		page->mapping = NULL;
@@ -1032,7 +1035,10 @@ static void free_hot_cold_page(struct page *page, int cold)
 
 	pcp = &zone_pcp(zone, get_cpu())->pcp;
 	local_irq_save(flags);
+	if (unlikely(clearMlocked))
+		free_page_mlock(page);
 	__count_vm_event(PGFREE);
+
 	if (cold)
 		list_add_tail(&page->lru, &pcp->list);
 	else
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 15/22] Do not disable interrupts in free_page_mlock()
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

free_page_mlock() tests and clears PG_mlocked using locked versions of the
bit operations. If set, it disables interrupts to update counters and this
happens on every page free even though interrupts are disabled very shortly
afterwards a second time.  This is wasteful.

This patch splits what free_page_mlock() does. The bit check is still
made. However, the update of counters is delayed until the interrupts are
disabled and the non-lock version for clearing the bit is used. One potential
weirdness with this split is that the counters do not get updated if the
bad_page() check is triggered but a system showing bad pages is getting
screwed already.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/internal.h   |   11 +++--------
 mm/page_alloc.c |    8 +++++++-
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 987bb03..58ec1bc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
  */
 static inline void free_page_mlock(struct page *page)
 {
-	if (unlikely(TestClearPageMlocked(page))) {
-		unsigned long flags;
-
-		local_irq_save(flags);
-		__dec_zone_page_state(page, NR_MLOCK);
-		__count_vm_event(UNEVICTABLE_MLOCKFREED);
-		local_irq_restore(flags);
-	}
+	__ClearPageMlocked(page);
+	__dec_zone_page_state(page, NR_MLOCK);
+	__count_vm_event(UNEVICTABLE_MLOCKFREED);
 }
 
 #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 67cafd0..7f45de1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -499,7 +499,6 @@ static inline void __free_one_page(struct page *page,
 
 static inline int free_pages_check(struct page *page)
 {
-	free_page_mlock(page);
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
 		(page_count(page) != 0)  |
@@ -556,6 +555,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	unsigned long flags;
 	int i;
 	int bad = 0;
+	int clearMlocked = PageMlocked(page);
 
 	for (i = 0 ; i < (1 << order) ; ++i)
 		bad += free_pages_check(page + i);
@@ -571,6 +571,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	kernel_map_pages(page, 1 << order, 0);
 
 	local_irq_save(flags);
+	if (unlikely(clearMlocked))
+		free_page_mlock(page);
 	__count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, order,
 					get_pageblock_migratetype(page));
@@ -1017,6 +1019,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	int clearMlocked = PageMlocked(page);
 
 	if (PageAnon(page))
 		page->mapping = NULL;
@@ -1032,7 +1035,10 @@ static void free_hot_cold_page(struct page *page, int cold)
 
 	pcp = &zone_pcp(zone, get_cpu())->pcp;
 	local_irq_save(flags);
+	if (unlikely(clearMlocked))
+		free_page_mlock(page);
 	__count_vm_event(PGFREE);
+
 	if (cold)
 		list_add_tail(&page->lru, &pcp->list);
 	else
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 16/22] Do not setup zonelist cache when there is only one node
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

There is a zonelist cache which is used to track zones that are not in
the allowed cpuset or found to be recently full. This is to reduce cache
footprint on large machines. On smaller machines, it just incurs cost
for no gain. This patch only uses the zonelist cache when there are NUMA
nodes.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 mm/page_alloc.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7f45de1..e59bb80 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1467,8 +1467,11 @@ this_zone_full:
 		if (NUMA_BUILD)
 			zlc_mark_zone_full(zonelist, z);
 try_next_zone:
-		if (NUMA_BUILD && !did_zlc_setup) {
-			/* we do zlc_setup after the first zone is tried */
+		if (NUMA_BUILD && !did_zlc_setup && num_online_nodes() > 1) {
+			/*
+			 * we do zlc_setup after the first zone is tried but only
+			 * if there are multiple nodes make it worthwhile
+			 */
 			allowednodes = zlc_setup(zonelist, alloc_flags);
 			zlc_active = 1;
 			did_zlc_setup = 1;
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 16/22] Do not setup zonelist cache when there is only one node
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

There is a zonelist cache which is used to track zones that are not in
the allowed cpuset or found to be recently full. This is to reduce cache
footprint on large machines. On smaller machines, it just incurs cost
for no gain. This patch only uses the zonelist cache when there are NUMA
nodes.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 mm/page_alloc.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7f45de1..e59bb80 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1467,8 +1467,11 @@ this_zone_full:
 		if (NUMA_BUILD)
 			zlc_mark_zone_full(zonelist, z);
 try_next_zone:
-		if (NUMA_BUILD && !did_zlc_setup) {
-			/* we do zlc_setup after the first zone is tried */
+		if (NUMA_BUILD && !did_zlc_setup && num_online_nodes() > 1) {
+			/*
+			 * we do zlc_setup after the first zone is tried but only
+			 * if there are multiple nodes make it worthwhile
+			 */
 			allowednodes = zlc_setup(zonelist, alloc_flags);
 			zlc_active = 1;
 			did_zlc_setup = 1;
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 17/22] Do not check for compound pages during the page allocator sanity checks
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

A number of sanity checks are made on each page allocation and free
including that the page count is zero. page_count() checks for
compound pages and checks the count of the head page if true. However,
in these paths, we do not care if the page is compound or not as the
count of each tail page should also be zero.

This patch makes two changes to the use of page_count() in the free path. It
converts one check of page_count() to a VM_BUG_ON() as the count should
have been unconditionally checked earlier in the free path. It also avoids
checking for compound pages.

[mel@csn.ul.ie: Wrote changelog]
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 mm/page_alloc.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e59bb80..b174f2c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -425,7 +425,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 		return 0;
 
 	if (PageBuddy(buddy) && page_order(buddy) == order) {
-		BUG_ON(page_count(buddy) != 0);
+		VM_BUG_ON(page_count(buddy) != 0);
 		return 1;
 	}
 	return 0;
@@ -501,7 +501,7 @@ static inline int free_pages_check(struct page *page)
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(atomic_read(&page->_count) != 0) |
 		(page->flags & PAGE_FLAGS_CHECK_AT_FREE))) {
 		bad_page(page);
 		return 1;
@@ -646,7 +646,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(atomic_read(&page->_count) != 0)  |
 		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
 		bad_page(page);
 		return 1;
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 17/22] Do not check for compound pages during the page allocator sanity checks
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

A number of sanity checks are made on each page allocation and free
including that the page count is zero. page_count() checks for
compound pages and checks the count of the head page if true. However,
in these paths, we do not care if the page is compound or not as the
count of each tail page should also be zero.

This patch makes two changes to the use of page_count() in the free path. It
converts one check of page_count() to a VM_BUG_ON() as the count should
have been unconditionally checked earlier in the free path. It also avoids
checking for compound pages.

[mel@csn.ul.ie: Wrote changelog]
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 mm/page_alloc.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e59bb80..b174f2c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -425,7 +425,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 		return 0;
 
 	if (PageBuddy(buddy) && page_order(buddy) == order) {
-		BUG_ON(page_count(buddy) != 0);
+		VM_BUG_ON(page_count(buddy) != 0);
 		return 1;
 	}
 	return 0;
@@ -501,7 +501,7 @@ static inline int free_pages_check(struct page *page)
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(atomic_read(&page->_count) != 0) |
 		(page->flags & PAGE_FLAGS_CHECK_AT_FREE))) {
 		bad_page(page);
 		return 1;
@@ -646,7 +646,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(atomic_read(&page->_count) != 0)  |
 		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
 		bad_page(page);
 		return 1;
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 18/22] Use allocation flags as an index to the zone watermark
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
pages_min, pages_low or pages_high is used as the zone watermark when
allocating the pages. Two branches in the allocator hotpath determine which
watermark to use. This patch uses the flags as an array index and places
the three watermarks in a union with an array so it can be offset. This
means the flags can be used as an array index and reduces the branches
taken.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/mmzone.h |    8 +++++++-
 mm/page_alloc.c        |   20 ++++++++++----------
 2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f82bdba..c1fa208 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -275,7 +275,13 @@ struct zone_reclaim_stat {
 
 struct zone {
 	/* Fields commonly accessed by the page allocator */
-	unsigned long		pages_min, pages_low, pages_high;
+	union {
+		struct {
+			unsigned long	pages_min, pages_low, pages_high;
+		};
+		unsigned long pages_mark[3];
+	};
+
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b174f2c..6030f49 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1154,10 +1154,15 @@ failed:
 	return NULL;
 }
 
-#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
+/* The WMARK bits are used as an index zone->pages_mark */
+#define ALLOC_WMARK_MIN		0x00 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW		0x01 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH	0x02 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
+
+/* Mask to get the watermark bits */
+#define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
+
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
@@ -1445,12 +1450,7 @@ zonelist_scan:
 
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
-			if (alloc_flags & ALLOC_WMARK_MIN)
-				mark = zone->pages_min;
-			else if (alloc_flags & ALLOC_WMARK_LOW)
-				mark = zone->pages_low;
-			else
-				mark = zone->pages_high;
+			mark = zone->pages_mark[alloc_flags & ALLOC_WMARK_MASK];
 			if (!zone_watermark_ok(zone, order, mark,
 				    classzone_idx, alloc_flags)) {
 				if (!zone_reclaim_mode ||
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 18/22] Use allocation flags as an index to the zone watermark
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
pages_min, pages_low or pages_high is used as the zone watermark when
allocating the pages. Two branches in the allocator hotpath determine which
watermark to use. This patch uses the flags as an array index and places
the three watermarks in a union with an array so it can be offset. This
means the flags can be used as an array index and reduces the branches
taken.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/mmzone.h |    8 +++++++-
 mm/page_alloc.c        |   20 ++++++++++----------
 2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f82bdba..c1fa208 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -275,7 +275,13 @@ struct zone_reclaim_stat {
 
 struct zone {
 	/* Fields commonly accessed by the page allocator */
-	unsigned long		pages_min, pages_low, pages_high;
+	union {
+		struct {
+			unsigned long	pages_min, pages_low, pages_high;
+		};
+		unsigned long pages_mark[3];
+	};
+
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b174f2c..6030f49 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1154,10 +1154,15 @@ failed:
 	return NULL;
 }
 
-#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
+/* The WMARK bits are used as an index zone->pages_mark */
+#define ALLOC_WMARK_MIN		0x00 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW		0x01 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH	0x02 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
+
+/* Mask to get the watermark bits */
+#define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
+
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
@@ -1445,12 +1450,7 @@ zonelist_scan:
 
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
-			if (alloc_flags & ALLOC_WMARK_MIN)
-				mark = zone->pages_min;
-			else if (alloc_flags & ALLOC_WMARK_LOW)
-				mark = zone->pages_low;
-			else
-				mark = zone->pages_high;
+			mark = zone->pages_mark[alloc_flags & ALLOC_WMARK_MASK];
 			if (!zone_watermark_ok(zone, order, mark,
 				    classzone_idx, alloc_flags)) {
 				if (!zone_reclaim_mode ||
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 19/22] Update NR_FREE_PAGES only as necessary
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

When pages are being freed to the buddy allocator, the zone
NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page
freeing, it's updated once per page. This retouches cache lines more
than necessary. Update the counters one per per-cpu bulk free.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6030f49..6494e13 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -460,7 +460,6 @@ static inline void __free_one_page(struct page *page,
 		int migratetype)
 {
 	unsigned long page_idx;
-	int order_size = 1 << order;
 
 	if (unlikely(PageCompound(page)))
 		if (unlikely(destroy_compound_page(page, order)))
@@ -470,10 +469,9 @@ static inline void __free_one_page(struct page *page,
 
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
 
-	VM_BUG_ON(page_idx & (order_size - 1));
+	VM_BUG_ON(page_idx & ((1 << order) - 1));
 	VM_BUG_ON(bad_range(zone, page));
 
-	__mod_zone_page_state(zone, NR_FREE_PAGES, order_size);
 	while (order < MAX_ORDER-1) {
 		unsigned long combined_idx;
 		struct page *buddy;
@@ -528,6 +526,8 @@ static void free_pages_bulk(struct zone *zone, int count,
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
+
+	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
 	while (count--) {
 		struct page *page;
 
@@ -546,6 +546,8 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
+
+	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
 	__free_one_page(page, zone, order, migratetype);
 	spin_unlock(&zone->lock);
 }
@@ -690,7 +692,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		list_del(&page->lru);
 		rmv_page_order(page);
 		area->nr_free--;
-		__mod_zone_page_state(zone, NR_FREE_PAGES, - (1UL << order));
 		expand(zone, page, order, current_order, area, migratetype);
 		return page;
 	}
@@ -830,8 +831,6 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 			/* Remove the page from the freelists */
 			list_del(&page->lru);
 			rmv_page_order(page);
-			__mod_zone_page_state(zone, NR_FREE_PAGES,
-							-(1UL << order));
 
 			if (current_order == pageblock_order)
 				set_pageblock_migratetype(page,
@@ -904,6 +903,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		set_page_private(page, migratetype);
 		list = &page->lru;
 	}
+	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
 	spin_unlock(&zone->lock);
 	return i;
 }
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 19/22] Update NR_FREE_PAGES only as necessary
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

When pages are being freed to the buddy allocator, the zone
NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page
freeing, it's updated once per page. This retouches cache lines more
than necessary. Update the counters one per per-cpu bulk free.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6030f49..6494e13 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -460,7 +460,6 @@ static inline void __free_one_page(struct page *page,
 		int migratetype)
 {
 	unsigned long page_idx;
-	int order_size = 1 << order;
 
 	if (unlikely(PageCompound(page)))
 		if (unlikely(destroy_compound_page(page, order)))
@@ -470,10 +469,9 @@ static inline void __free_one_page(struct page *page,
 
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
 
-	VM_BUG_ON(page_idx & (order_size - 1));
+	VM_BUG_ON(page_idx & ((1 << order) - 1));
 	VM_BUG_ON(bad_range(zone, page));
 
-	__mod_zone_page_state(zone, NR_FREE_PAGES, order_size);
 	while (order < MAX_ORDER-1) {
 		unsigned long combined_idx;
 		struct page *buddy;
@@ -528,6 +526,8 @@ static void free_pages_bulk(struct zone *zone, int count,
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
+
+	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
 	while (count--) {
 		struct page *page;
 
@@ -546,6 +546,8 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
+
+	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
 	__free_one_page(page, zone, order, migratetype);
 	spin_unlock(&zone->lock);
 }
@@ -690,7 +692,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		list_del(&page->lru);
 		rmv_page_order(page);
 		area->nr_free--;
-		__mod_zone_page_state(zone, NR_FREE_PAGES, - (1UL << order));
 		expand(zone, page, order, current_order, area, migratetype);
 		return page;
 	}
@@ -830,8 +831,6 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 			/* Remove the page from the freelists */
 			list_del(&page->lru);
 			rmv_page_order(page);
-			__mod_zone_page_state(zone, NR_FREE_PAGES,
-							-(1UL << order));
 
 			if (current_order == pageblock_order)
 				set_pageblock_migratetype(page,
@@ -904,6 +903,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		set_page_private(page, migratetype);
 		list = &page->lru;
 	}
+	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
 	spin_unlock(&zone->lock);
 	return i;
 }
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 20/22] Get the pageblock migratetype without disabling interrupts
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

Local interrupts are disabled when freeing pages to the PCP list. Part of
that free checks what the migratetype of the pageblock the page is in but it
checks this with interrupts disabled and interupts should never be disabled
longer than necessary. This patch checks the pagetype with interrupts
enabled with the impact that it is possible a page is freed to the wrong
list when a pageblock changes type. As that block is now already considered
mixed from an anti-fragmentation perspective, it's not of vital importance.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6494e13..ba41551 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1034,6 +1034,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 	kernel_map_pages(page, 1, 0);
 
 	pcp = &zone_pcp(zone, get_cpu())->pcp;
+	set_page_private(page, get_pageblock_migratetype(page));
 	local_irq_save(flags);
 	if (unlikely(clearMlocked))
 		free_page_mlock(page);
@@ -1043,7 +1044,6 @@ static void free_hot_cold_page(struct page *page, int cold)
 		list_add_tail(&page->lru, &pcp->list);
 	else
 		list_add(&page->lru, &pcp->list);
-	set_page_private(page, get_pageblock_migratetype(page));
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
 		free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 20/22] Get the pageblock migratetype without disabling interrupts
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

Local interrupts are disabled when freeing pages to the PCP list. Part of
that free checks what the migratetype of the pageblock the page is in but it
checks this with interrupts disabled and interupts should never be disabled
longer than necessary. This patch checks the pagetype with interrupts
enabled with the impact that it is possible a page is freed to the wrong
list when a pageblock changes type. As that block is now already considered
mixed from an anti-fragmentation perspective, it's not of vital importance.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6494e13..ba41551 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1034,6 +1034,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 	kernel_map_pages(page, 1, 0);
 
 	pcp = &zone_pcp(zone, get_cpu())->pcp;
+	set_page_private(page, get_pageblock_migratetype(page));
 	local_irq_save(flags);
 	if (unlikely(clearMlocked))
 		free_page_mlock(page);
@@ -1043,7 +1044,6 @@ static void free_hot_cold_page(struct page *page, int cold)
 		list_add_tail(&page->lru, &pcp->list);
 	else
 		list_add(&page->lru, &pcp->list);
-	set_page_private(page, get_pageblock_migratetype(page));
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
 		free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 21/22] Use a pre-calculated value instead of num_online_nodes() in fast paths
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

From: Christoph Lameter <cl@linux-foundation.org>

num_online_nodes() is called in a number of places but most often by the
page allocator when deciding whether the zonelist needs to be filtered based
on cpusets or the zonelist cache. This is actually a heavy function and
touches a number of cache lines.

This patch stores the number of online nodes at boot time and updates the
value when nodes get onlined and offlined. The value is then used in a
number of important paths in place of num_online_nodes().

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/nodemask.h |   15 ++++++++++++++-
 mm/hugetlb.c             |    4 ++--
 mm/page_alloc.c          |   10 ++++++----
 mm/slub.c                |    2 +-
 net/sunrpc/svc.c         |    2 +-
 5 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 848025c..474e73e 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state)
 #define next_online_node(nid)	next_node((nid), node_states[N_ONLINE])
 
 extern int nr_node_ids;
+extern int nr_online_nodes;
+
+static inline void node_set_online(int nid)
+{
+	node_set_state(nid, N_ONLINE);
+	nr_online_nodes = num_node_state(N_ONLINE);
+}
+
+static inline void node_set_offline(int nid)
+{
+	node_clear_state(nid, N_ONLINE);
+	nr_online_nodes = num_node_state(N_ONLINE);
+}
 #else
 
 static inline int node_state(int node, enum node_states state)
@@ -434,7 +447,7 @@ static inline int num_node_state(enum node_states state)
 #define first_online_node	0
 #define next_online_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1
-
+#define nr_online_nodes		1
 #endif
 
 #define node_online_map 	node_states[N_ONLINE]
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1234486..c9a404c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -875,7 +875,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 	 * can no longer free unreserved surplus pages. This occurs when
 	 * the nodes with surplus pages have no free pages.
 	 */
-	unsigned long remaining_iterations = num_online_nodes();
+	unsigned long remaining_iterations = nr_online_nodes;
 
 	/* Uncommit the reservation */
 	h->resv_huge_pages -= unused_resv_pages;
@@ -904,7 +904,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 			h->surplus_huge_pages--;
 			h->surplus_huge_pages_node[nid]--;
 			nr_pages--;
-			remaining_iterations = num_online_nodes();
+			remaining_iterations = nr_online_nodes;
 		}
 	}
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ba41551..1464aca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -165,7 +165,9 @@ static unsigned long __meminitdata dma_reserve;
 
 #if MAX_NUMNODES > 1
 int nr_node_ids __read_mostly = MAX_NUMNODES;
+int nr_online_nodes __read_mostly = 1;
 EXPORT_SYMBOL(nr_node_ids);
+EXPORT_SYMBOL(nr_online_nodes);
 #endif
 
 int page_group_by_mobility_disabled __read_mostly;
@@ -1467,7 +1469,7 @@ this_zone_full:
 		if (NUMA_BUILD)
 			zlc_mark_zone_full(zonelist, z);
 try_next_zone:
-		if (NUMA_BUILD && !did_zlc_setup && num_online_nodes() > 1) {
+		if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
 			/*
 			 * we do zlc_setup after the first zone is tried but only
 			 * if there are multiple nodes make it worthwhile
@@ -2267,7 +2269,7 @@ int numa_zonelist_order_handler(ctl_table *table, int write,
 }
 
 
-#define MAX_NODE_LOAD (num_online_nodes())
+#define MAX_NODE_LOAD (nr_online_nodes)
 static int node_load[MAX_NUMNODES];
 
 /**
@@ -2476,7 +2478,7 @@ static void build_zonelists(pg_data_t *pgdat)
 
 	/* NUMA-aware ordering of nodes */
 	local_node = pgdat->node_id;
-	load = num_online_nodes();
+	load = nr_online_nodes;
 	prev_node = local_node;
 	nodes_clear(used_mask);
 
@@ -2627,7 +2629,7 @@ void build_all_zonelists(void)
 
 	printk("Built %i zonelists in %s order, mobility grouping %s.  "
 		"Total pages: %ld\n",
-			num_online_nodes(),
+			nr_online_nodes,
 			zonelist_order_name[current_zonelist_order],
 			page_group_by_mobility_disabled ? "off" : "on",
 			vm_total_pages);
diff --git a/mm/slub.c b/mm/slub.c
index 93f5fb0..8e181e2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3714,7 +3714,7 @@ static int list_locations(struct kmem_cache *s, char *buf,
 						 to_cpumask(l->cpus));
 		}
 
-		if (num_online_nodes() > 1 && !nodes_empty(l->nodes) &&
+		if (nr_online_nodes > 1 && !nodes_empty(l->nodes) &&
 				len < PAGE_SIZE - 60) {
 			len += sprintf(buf + len, " nodes=");
 			len += nodelist_scnprintf(buf + len, PAGE_SIZE - len - 50,
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 8847add..5ed8931 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -124,7 +124,7 @@ svc_pool_map_choose_mode(void)
 {
 	unsigned int node;
 
-	if (num_online_nodes() > 1) {
+	if (nr_online_nodes > 1) {
 		/*
 		 * Actually have multiple NUMA nodes,
 		 * so split pools on NUMA node boundaries
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 21/22] Use a pre-calculated value instead of num_online_nodes() in fast paths
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

From: Christoph Lameter <cl@linux-foundation.org>

num_online_nodes() is called in a number of places but most often by the
page allocator when deciding whether the zonelist needs to be filtered based
on cpusets or the zonelist cache. This is actually a heavy function and
touches a number of cache lines.

This patch stores the number of online nodes at boot time and updates the
value when nodes get onlined and offlined. The value is then used in a
number of important paths in place of num_online_nodes().

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/nodemask.h |   15 ++++++++++++++-
 mm/hugetlb.c             |    4 ++--
 mm/page_alloc.c          |   10 ++++++----
 mm/slub.c                |    2 +-
 net/sunrpc/svc.c         |    2 +-
 5 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 848025c..474e73e 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state)
 #define next_online_node(nid)	next_node((nid), node_states[N_ONLINE])
 
 extern int nr_node_ids;
+extern int nr_online_nodes;
+
+static inline void node_set_online(int nid)
+{
+	node_set_state(nid, N_ONLINE);
+	nr_online_nodes = num_node_state(N_ONLINE);
+}
+
+static inline void node_set_offline(int nid)
+{
+	node_clear_state(nid, N_ONLINE);
+	nr_online_nodes = num_node_state(N_ONLINE);
+}
 #else
 
 static inline int node_state(int node, enum node_states state)
@@ -434,7 +447,7 @@ static inline int num_node_state(enum node_states state)
 #define first_online_node	0
 #define next_online_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1
-
+#define nr_online_nodes		1
 #endif
 
 #define node_online_map 	node_states[N_ONLINE]
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1234486..c9a404c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -875,7 +875,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 	 * can no longer free unreserved surplus pages. This occurs when
 	 * the nodes with surplus pages have no free pages.
 	 */
-	unsigned long remaining_iterations = num_online_nodes();
+	unsigned long remaining_iterations = nr_online_nodes;
 
 	/* Uncommit the reservation */
 	h->resv_huge_pages -= unused_resv_pages;
@@ -904,7 +904,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 			h->surplus_huge_pages--;
 			h->surplus_huge_pages_node[nid]--;
 			nr_pages--;
-			remaining_iterations = num_online_nodes();
+			remaining_iterations = nr_online_nodes;
 		}
 	}
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ba41551..1464aca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -165,7 +165,9 @@ static unsigned long __meminitdata dma_reserve;
 
 #if MAX_NUMNODES > 1
 int nr_node_ids __read_mostly = MAX_NUMNODES;
+int nr_online_nodes __read_mostly = 1;
 EXPORT_SYMBOL(nr_node_ids);
+EXPORT_SYMBOL(nr_online_nodes);
 #endif
 
 int page_group_by_mobility_disabled __read_mostly;
@@ -1467,7 +1469,7 @@ this_zone_full:
 		if (NUMA_BUILD)
 			zlc_mark_zone_full(zonelist, z);
 try_next_zone:
-		if (NUMA_BUILD && !did_zlc_setup && num_online_nodes() > 1) {
+		if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
 			/*
 			 * we do zlc_setup after the first zone is tried but only
 			 * if there are multiple nodes make it worthwhile
@@ -2267,7 +2269,7 @@ int numa_zonelist_order_handler(ctl_table *table, int write,
 }
 
 
-#define MAX_NODE_LOAD (num_online_nodes())
+#define MAX_NODE_LOAD (nr_online_nodes)
 static int node_load[MAX_NUMNODES];
 
 /**
@@ -2476,7 +2478,7 @@ static void build_zonelists(pg_data_t *pgdat)
 
 	/* NUMA-aware ordering of nodes */
 	local_node = pgdat->node_id;
-	load = num_online_nodes();
+	load = nr_online_nodes;
 	prev_node = local_node;
 	nodes_clear(used_mask);
 
@@ -2627,7 +2629,7 @@ void build_all_zonelists(void)
 
 	printk("Built %i zonelists in %s order, mobility grouping %s.  "
 		"Total pages: %ld\n",
-			num_online_nodes(),
+			nr_online_nodes,
 			zonelist_order_name[current_zonelist_order],
 			page_group_by_mobility_disabled ? "off" : "on",
 			vm_total_pages);
diff --git a/mm/slub.c b/mm/slub.c
index 93f5fb0..8e181e2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3714,7 +3714,7 @@ static int list_locations(struct kmem_cache *s, char *buf,
 						 to_cpumask(l->cpus));
 		}
 
-		if (num_online_nodes() > 1 && !nodes_empty(l->nodes) &&
+		if (nr_online_nodes > 1 && !nodes_empty(l->nodes) &&
 				len < PAGE_SIZE - 60) {
 			len += sprintf(buf + len, " nodes=");
 			len += nodelist_scnprintf(buf + len, PAGE_SIZE - len - 50,
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 8847add..5ed8931 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -124,7 +124,7 @@ svc_pool_map_choose_mode(void)
 {
 	unsigned int node;
 
-	if (num_online_nodes() > 1) {
+	if (nr_online_nodes > 1) {
 		/*
 		 * Actually have multiple NUMA nodes,
 		 * so split pools on NUMA node boundaries
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 22/22] slab: Use nr_online_nodes to check for a NUMA platform
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-22 13:53   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

SLAB currently avoids checking a bitmap repeatedly by checking once and
storing a flag. When the addition of nr_online_nodes as a cheaper version
of num_online_nodes(), this check can be replaced by nr_online_nodes.

(Christoph did a patch that this is lifted almost verbatim from, hence the
first Signed-off-by. Christoph, can you confirm you're ok with that?)

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/slab.c |    7 ++-----
 1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 1c680e8..4d38dc2 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -892,7 +892,6 @@ static void __slab_error(const char *function, struct kmem_cache *cachep,
   */
 
 static int use_alien_caches __read_mostly = 1;
-static int numa_platform __read_mostly = 1;
 static int __init noaliencache_setup(char *s)
 {
 	use_alien_caches = 0;
@@ -1453,10 +1452,8 @@ void __init kmem_cache_init(void)
 	int order;
 	int node;
 
-	if (num_possible_nodes() == 1) {
+	if (num_possible_nodes() == 1)
 		use_alien_caches = 0;
-		numa_platform = 0;
-	}
 
 	for (i = 0; i < NUM_INIT_LISTS; i++) {
 		kmem_list3_init(&initkmem_list3[i]);
@@ -3579,7 +3576,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp)
 	 * variable to skip the call, which is mostly likely to be present in
 	 * the cache.
 	 */
-	if (numa_platform && cache_free_alien(cachep, objp))
+	if (nr_online_nodes > 1 && cache_free_alien(cachep, objp))
 		return;
 
 	if (likely(ac->avail < ac->limit)) {
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 22/22] slab: Use nr_online_nodes to check for a NUMA platform
@ 2009-04-22 13:53   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 13:53 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

SLAB currently avoids checking a bitmap repeatedly by checking once and
storing a flag. When the addition of nr_online_nodes as a cheaper version
of num_online_nodes(), this check can be replaced by nr_online_nodes.

(Christoph did a patch that this is lifted almost verbatim from, hence the
first Signed-off-by. Christoph, can you confirm you're ok with that?)

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/slab.c |    7 ++-----
 1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 1c680e8..4d38dc2 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -892,7 +892,6 @@ static void __slab_error(const char *function, struct kmem_cache *cachep,
   */
 
 static int use_alien_caches __read_mostly = 1;
-static int numa_platform __read_mostly = 1;
 static int __init noaliencache_setup(char *s)
 {
 	use_alien_caches = 0;
@@ -1453,10 +1452,8 @@ void __init kmem_cache_init(void)
 	int order;
 	int node;
 
-	if (num_possible_nodes() == 1) {
+	if (num_possible_nodes() == 1)
 		use_alien_caches = 0;
-		numa_platform = 0;
-	}
 
 	for (i = 0; i < NUM_INIT_LISTS; i++) {
 		kmem_list3_init(&initkmem_list3[i]);
@@ -3579,7 +3576,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp)
 	 * variable to skip the call, which is mostly likely to be present in
 	 * the cache.
 	 */
-	if (numa_platform && cache_free_alien(cachep, objp))
+	if (nr_online_nodes > 1 && cache_free_alien(cachep, objp))
 		return;
 
 	if (likely(ac->avail < ac->limit)) {
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [PATCH 22/22] slab: Use nr_online_nodes to check for a NUMA  platform
  2009-04-22 13:53   ` Mel Gorman
@ 2009-04-22 14:37     ` Pekka Enberg
  -1 siblings, 0 replies; 186+ messages in thread
From: Pekka Enberg @ 2009-04-22 14:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Wed, Apr 22, 2009 at 4:53 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> SLAB currently avoids checking a bitmap repeatedly by checking once and
> storing a flag. When the addition of nr_online_nodes as a cheaper version
> of num_online_nodes(), this check can be replaced by nr_online_nodes.
>
> (Christoph did a patch that this is lifted almost verbatim from, hence the
> first Signed-off-by. Christoph, can you confirm you're ok with that?)
>
> Signed-off-by: Christoph Lameter <cl@linux.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 22/22] slab: Use nr_online_nodes to check for a NUMA platform
@ 2009-04-22 14:37     ` Pekka Enberg
  0 siblings, 0 replies; 186+ messages in thread
From: Pekka Enberg @ 2009-04-22 14:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Wed, Apr 22, 2009 at 4:53 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> SLAB currently avoids checking a bitmap repeatedly by checking once and
> storing a flag. When the addition of nr_online_nodes as a cheaper version
> of num_online_nodes(), this check can be replaced by nr_online_nodes.
>
> (Christoph did a patch that this is lifted almost verbatim from, hence the
> first Signed-off-by. Christoph, can you confirm you're ok with that?)
>
> Signed-off-by: Christoph Lameter <cl@linux.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-22 13:53   ` Mel Gorman
@ 2009-04-22 16:13     ` Dave Hansen
  -1 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-22 16:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> No user of the allocator API should be passing in an order >= MAX_ORDER
> but we check for it on each and every allocation. Delete this check and
> make it a VM_BUG_ON check further down the call path.

Should we get the check re-added to some of the upper-level functions,
then?  Perhaps __get_free_pages() or things like alloc_pages_exact()? 

I'm selfishly thinking of what I did in profile_init().  Can I slab
alloc it?  Nope.  Page allocator?  Nope.  Oh, well, try vmalloc():

        prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
        if (prof_buffer)
                return 0;

        prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
        if (prof_buffer)
                return 0;

        prof_buffer = vmalloc(buffer_bytes);
        if (prof_buffer)
                return 0;

        free_cpumask_var(prof_cpu_mask);
        return -ENOMEM;

Same thing in __kmalloc_section_memmap():

        page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
        if (page)
                goto got_map_page;

        ret = vmalloc(memmap_size);
        if (ret)
                goto got_map_ptr;

I depend on the allocator to tell me when I've fed it too high of an
order.  If we really need this, perhaps we should do an audit and then
add a WARN_ON() for a few releases to catch the stragglers.

-- Dave


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-22 16:13     ` Dave Hansen
  0 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-22 16:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> No user of the allocator API should be passing in an order >= MAX_ORDER
> but we check for it on each and every allocation. Delete this check and
> make it a VM_BUG_ON check further down the call path.

Should we get the check re-added to some of the upper-level functions,
then?  Perhaps __get_free_pages() or things like alloc_pages_exact()? 

I'm selfishly thinking of what I did in profile_init().  Can I slab
alloc it?  Nope.  Page allocator?  Nope.  Oh, well, try vmalloc():

        prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
        if (prof_buffer)
                return 0;

        prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
        if (prof_buffer)
                return 0;

        prof_buffer = vmalloc(buffer_bytes);
        if (prof_buffer)
                return 0;

        free_cpumask_var(prof_cpu_mask);
        return -ENOMEM;

Same thing in __kmalloc_section_memmap():

        page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
        if (page)
                goto got_map_page;

        ret = vmalloc(memmap_size);
        if (ret)
                goto got_map_ptr;

I depend on the allocator to tell me when I've fed it too high of an
order.  If we really need this, perhaps we should do an audit and then
add a WARN_ON() for a few releases to catch the stragglers.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-22 16:13     ` Dave Hansen
@ 2009-04-22 17:11       ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 17:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, Apr 22, 2009 at 09:13:11AM -0700, Dave Hansen wrote:
> On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > No user of the allocator API should be passing in an order >= MAX_ORDER
> > but we check for it on each and every allocation. Delete this check and
> > make it a VM_BUG_ON check further down the call path.
> 
> Should we get the check re-added to some of the upper-level functions,
> then?  Perhaps __get_free_pages() or things like alloc_pages_exact()? 
> 

I don't think so, no. It just moves the source of the text bloat and
for the few callers that are asking for something that will never
succeed.

> I'm selfishly thinking of what I did in profile_init().  Can I slab
> alloc it?  Nope.  Page allocator?  Nope.  Oh, well, try vmalloc():
> 
>         prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
>         if (prof_buffer)
>                 return 0;
> 
>         prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
>         if (prof_buffer)
>                 return 0;
> 
>         prof_buffer = vmalloc(buffer_bytes);
>         if (prof_buffer)
>                 return 0;
> 
>         free_cpumask_var(prof_cpu_mask);
>         return -ENOMEM;
> 

Can this ever actually be asking for an order larger than MAX_ORDER
though? If so, you're condemning it to always behave poorly.

> Same thing in __kmalloc_section_memmap():
> 
>         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
>         if (page)
>                 goto got_map_page;
> 
>         ret = vmalloc(memmap_size);
>         if (ret)
>                 goto got_map_ptr;
> 

If I'm reading that right, the order will never be a stupid order. It can fail
for higher orders in which case it falls back to vmalloc() .  For example,
to hit that limit, the section size for a 4K kernel, maximum usable order
of 10, the section size would need to be 256MB (assuming struct page size
of 64 bytes). I don't think it's ever that size and if so, it'll always be
sub-optimal which is a poor choice to make.

> I depend on the allocator to tell me when I've fed it too high of an
> order.  If we really need this, perhaps we should do an audit and then
> add a WARN_ON() for a few releases to catch the stragglers.
> 

I consider it buggy to ask for something so large that you always end up
with the worst option - vmalloc(). How about leaving it as a VM_BUG_ON
to get as many reports as possible on who is depending on this odd
behaviour?

If there are users with good reasons, then we could convert this to WARN_ON
to fix up the callers. I suspect that the allocator can already cope with
recieving a stupid order silently but slowly. It should go all the way to the
bottom and just never find anything useful and return NULL.  zone_watermark_ok
is the most dangerous looking part but even it should never get to MAX_ORDER
because it should always find there are not enough free pages and return
before it overruns.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-22 17:11       ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 17:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, Apr 22, 2009 at 09:13:11AM -0700, Dave Hansen wrote:
> On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > No user of the allocator API should be passing in an order >= MAX_ORDER
> > but we check for it on each and every allocation. Delete this check and
> > make it a VM_BUG_ON check further down the call path.
> 
> Should we get the check re-added to some of the upper-level functions,
> then?  Perhaps __get_free_pages() or things like alloc_pages_exact()? 
> 

I don't think so, no. It just moves the source of the text bloat and
for the few callers that are asking for something that will never
succeed.

> I'm selfishly thinking of what I did in profile_init().  Can I slab
> alloc it?  Nope.  Page allocator?  Nope.  Oh, well, try vmalloc():
> 
>         prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
>         if (prof_buffer)
>                 return 0;
> 
>         prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
>         if (prof_buffer)
>                 return 0;
> 
>         prof_buffer = vmalloc(buffer_bytes);
>         if (prof_buffer)
>                 return 0;
> 
>         free_cpumask_var(prof_cpu_mask);
>         return -ENOMEM;
> 

Can this ever actually be asking for an order larger than MAX_ORDER
though? If so, you're condemning it to always behave poorly.

> Same thing in __kmalloc_section_memmap():
> 
>         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
>         if (page)
>                 goto got_map_page;
> 
>         ret = vmalloc(memmap_size);
>         if (ret)
>                 goto got_map_ptr;
> 

If I'm reading that right, the order will never be a stupid order. It can fail
for higher orders in which case it falls back to vmalloc() .  For example,
to hit that limit, the section size for a 4K kernel, maximum usable order
of 10, the section size would need to be 256MB (assuming struct page size
of 64 bytes). I don't think it's ever that size and if so, it'll always be
sub-optimal which is a poor choice to make.

> I depend on the allocator to tell me when I've fed it too high of an
> order.  If we really need this, perhaps we should do an audit and then
> add a WARN_ON() for a few releases to catch the stragglers.
> 

I consider it buggy to ask for something so large that you always end up
with the worst option - vmalloc(). How about leaving it as a VM_BUG_ON
to get as many reports as possible on who is depending on this odd
behaviour?

If there are users with good reasons, then we could convert this to WARN_ON
to fix up the callers. I suspect that the allocator can already cope with
recieving a stupid order silently but slowly. It should go all the way to the
bottom and just never find anything useful and return NULL.  zone_watermark_ok
is the most dangerous looking part but even it should never get to MAX_ORDER
because it should always find there are not enough free pages and return
before it overruns.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark
  2009-04-22 13:53   ` Mel Gorman
@ 2009-04-22 17:11     ` Dave Hansen
  -1 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-22 17:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
>  struct zone {
>         /* Fields commonly accessed by the page allocator */
> -       unsigned long           pages_min, pages_low, pages_high;
> +       union {
> +               struct {
> +                       unsigned long   pages_min, pages_low, pages_high;
> +               };
> +               unsigned long pages_mark[3];
> +       };

Why the union?  It's a bit obfuscated for me.  Why not just have a
couple of these:

static inline unsigned long zone_pages_min(struct zone *zone)
{
	return zone->pages_mark[ALLOC_WMARK_MIN];
}

and s/zone->pages_min/zone_pages_min(zone)/

?

-- Dave


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark
@ 2009-04-22 17:11     ` Dave Hansen
  0 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-22 17:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
>  struct zone {
>         /* Fields commonly accessed by the page allocator */
> -       unsigned long           pages_min, pages_low, pages_high;
> +       union {
> +               struct {
> +                       unsigned long   pages_min, pages_low, pages_high;
> +               };
> +               unsigned long pages_mark[3];
> +       };

Why the union?  It's a bit obfuscated for me.  Why not just have a
couple of these:

static inline unsigned long zone_pages_min(struct zone *zone)
{
	return zone->pages_mark[ALLOC_WMARK_MIN];
}

and s/zone->pages_min/zone_pages_min(zone)/

?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark
  2009-04-22 17:11     ` Dave Hansen
@ 2009-04-22 17:14       ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 17:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, Apr 22, 2009 at 10:11:53AM -0700, Dave Hansen wrote:
> On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> >  struct zone {
> >         /* Fields commonly accessed by the page allocator */
> > -       unsigned long           pages_min, pages_low, pages_high;
> > +       union {
> > +               struct {
> > +                       unsigned long   pages_min, pages_low, pages_high;
> > +               };
> > +               unsigned long pages_mark[3];
> > +       };
> 
> Why the union?  It's a bit obfuscated for me.  Why not just have a
> couple of these:
> 
> static inline unsigned long zone_pages_min(struct zone *zone)
> {
> 	return zone->pages_mark[ALLOC_WMARK_MIN];
> }
> 
> and s/zone->pages_min/zone_pages_min(zone)/
> 
> ?
> 

Preference of taste really. When I started a conversion to accessors, it
changed something recognised to something new that looked uglier to me.
Only one place cares about the union enough to access is via an array so
why spread it everywhere.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark
@ 2009-04-22 17:14       ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-22 17:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, Apr 22, 2009 at 10:11:53AM -0700, Dave Hansen wrote:
> On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> >  struct zone {
> >         /* Fields commonly accessed by the page allocator */
> > -       unsigned long           pages_min, pages_low, pages_high;
> > +       union {
> > +               struct {
> > +                       unsigned long   pages_min, pages_low, pages_high;
> > +               };
> > +               unsigned long pages_mark[3];
> > +       };
> 
> Why the union?  It's a bit obfuscated for me.  Why not just have a
> couple of these:
> 
> static inline unsigned long zone_pages_min(struct zone *zone)
> {
> 	return zone->pages_mark[ALLOC_WMARK_MIN];
> }
> 
> and s/zone->pages_min/zone_pages_min(zone)/
> 
> ?
> 

Preference of taste really. When I started a conversion to accessors, it
changed something recognised to something new that looked uglier to me.
Only one place cares about the union enough to access is via an array so
why spread it everywhere.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-22 17:11       ` Mel Gorman
@ 2009-04-22 17:30         ` Dave Hansen
  -1 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-22 17:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 2009-04-22 at 18:11 +0100, Mel Gorman wrote:
> On Wed, Apr 22, 2009 at 09:13:11AM -0700, Dave Hansen wrote:
> > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > No user of the allocator API should be passing in an order >= MAX_ORDER
> > > but we check for it on each and every allocation. Delete this check and
> > > make it a VM_BUG_ON check further down the call path.
> > 
> > Should we get the check re-added to some of the upper-level functions,
> > then?  Perhaps __get_free_pages() or things like alloc_pages_exact()? 
> 
> I don't think so, no. It just moves the source of the text bloat and
> for the few callers that are asking for something that will never
> succeed.

Well, it's a matter of figuring out when it can succeed.  Some of this
stuff, we can figure out at compile-time.  Others are a bit harder.

> > I'm selfishly thinking of what I did in profile_init().  Can I slab
> > alloc it?  Nope.  Page allocator?  Nope.  Oh, well, try vmalloc():
> > 
> >         prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
> >         if (prof_buffer)
> >                 return 0;
> > 
> >         prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
> >         if (prof_buffer)
> >                 return 0;
> > 
> >         prof_buffer = vmalloc(buffer_bytes);
> >         if (prof_buffer)
> >                 return 0;
> > 
> >         free_cpumask_var(prof_cpu_mask);
> >         return -ENOMEM;
> > 
> 
> Can this ever actually be asking for an order larger than MAX_ORDER
> though? If so, you're condemning it to always behave poorly.

Yeah.  It is based on text size.  Smaller kernels with trimmed configs
and no modules have no problem fitting under MAX_ORDER, as do kernels
with larger base page sizes.  

> > Same thing in __kmalloc_section_memmap():
> > 
> >         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
> >         if (page)
> >                 goto got_map_page;
> > 
> >         ret = vmalloc(memmap_size);
> >         if (ret)
> >                 goto got_map_ptr;
> > 
> 
> If I'm reading that right, the order will never be a stupid order. It can fail
> for higher orders in which case it falls back to vmalloc() .  For example,
> to hit that limit, the section size for a 4K kernel, maximum usable order
> of 10, the section size would need to be 256MB (assuming struct page size
> of 64 bytes). I don't think it's ever that size and if so, it'll always be
> sub-optimal which is a poor choice to make.

I think the section size default used to be 512M on x86 because we
concentrate on removing whole DIMMs.  

> > I depend on the allocator to tell me when I've fed it too high of an
> > order.  If we really need this, perhaps we should do an audit and then
> > add a WARN_ON() for a few releases to catch the stragglers.
> 
> I consider it buggy to ask for something so large that you always end up
> with the worst option - vmalloc(). How about leaving it as a VM_BUG_ON
> to get as many reports as possible on who is depending on this odd
> behaviour?
> 
> If there are users with good reasons, then we could convert this to WARN_ON
> to fix up the callers. I suspect that the allocator can already cope with
> recieving a stupid order silently but slowly. It should go all the way to the
> bottom and just never find anything useful and return NULL.  zone_watermark_ok
> is the most dangerous looking part but even it should never get to MAX_ORDER
> because it should always find there are not enough free pages and return
> before it overruns.

Whatever we do, I'd agree that it's fine that this is a degenerate case
that gets handled very slowly and as far out of hot paths as possible.
Anybody who can fall back to a vmalloc is not doing these things very
often.

-- Dave


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-22 17:30         ` Dave Hansen
  0 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-22 17:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 2009-04-22 at 18:11 +0100, Mel Gorman wrote:
> On Wed, Apr 22, 2009 at 09:13:11AM -0700, Dave Hansen wrote:
> > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > No user of the allocator API should be passing in an order >= MAX_ORDER
> > > but we check for it on each and every allocation. Delete this check and
> > > make it a VM_BUG_ON check further down the call path.
> > 
> > Should we get the check re-added to some of the upper-level functions,
> > then?  Perhaps __get_free_pages() or things like alloc_pages_exact()? 
> 
> I don't think so, no. It just moves the source of the text bloat and
> for the few callers that are asking for something that will never
> succeed.

Well, it's a matter of figuring out when it can succeed.  Some of this
stuff, we can figure out at compile-time.  Others are a bit harder.

> > I'm selfishly thinking of what I did in profile_init().  Can I slab
> > alloc it?  Nope.  Page allocator?  Nope.  Oh, well, try vmalloc():
> > 
> >         prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
> >         if (prof_buffer)
> >                 return 0;
> > 
> >         prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
> >         if (prof_buffer)
> >                 return 0;
> > 
> >         prof_buffer = vmalloc(buffer_bytes);
> >         if (prof_buffer)
> >                 return 0;
> > 
> >         free_cpumask_var(prof_cpu_mask);
> >         return -ENOMEM;
> > 
> 
> Can this ever actually be asking for an order larger than MAX_ORDER
> though? If so, you're condemning it to always behave poorly.

Yeah.  It is based on text size.  Smaller kernels with trimmed configs
and no modules have no problem fitting under MAX_ORDER, as do kernels
with larger base page sizes.  

> > Same thing in __kmalloc_section_memmap():
> > 
> >         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
> >         if (page)
> >                 goto got_map_page;
> > 
> >         ret = vmalloc(memmap_size);
> >         if (ret)
> >                 goto got_map_ptr;
> > 
> 
> If I'm reading that right, the order will never be a stupid order. It can fail
> for higher orders in which case it falls back to vmalloc() .  For example,
> to hit that limit, the section size for a 4K kernel, maximum usable order
> of 10, the section size would need to be 256MB (assuming struct page size
> of 64 bytes). I don't think it's ever that size and if so, it'll always be
> sub-optimal which is a poor choice to make.

I think the section size default used to be 512M on x86 because we
concentrate on removing whole DIMMs.  

> > I depend on the allocator to tell me when I've fed it too high of an
> > order.  If we really need this, perhaps we should do an audit and then
> > add a WARN_ON() for a few releases to catch the stragglers.
> 
> I consider it buggy to ask for something so large that you always end up
> with the worst option - vmalloc(). How about leaving it as a VM_BUG_ON
> to get as many reports as possible on who is depending on this odd
> behaviour?
> 
> If there are users with good reasons, then we could convert this to WARN_ON
> to fix up the callers. I suspect that the allocator can already cope with
> recieving a stupid order silently but slowly. It should go all the way to the
> bottom and just never find anything useful and return NULL.  zone_watermark_ok
> is the most dangerous looking part but even it should never get to MAX_ORDER
> because it should always find there are not enough free pages and return
> before it overruns.

Whatever we do, I'd agree that it's fine that this is a degenerate case
that gets handled very slowly and as far out of hot paths as possible.
Anybody who can fall back to a vmalloc is not doing these things very
often.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark
  2009-04-22 17:14       ` Mel Gorman
@ 2009-04-22 17:47         ` Dave Hansen
  -1 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-22 17:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 2009-04-22 at 18:14 +0100, Mel Gorman wrote:
> Preference of taste really. When I started a conversion to accessors, it
> changed something recognised to something new that looked uglier to me.
> Only one place cares about the union enough to access is via an array so
> why spread it everywhere.

Personally, I'd say for consistency.  Someone looking at both forms
wouldn't necessarily know that they refer to the same variables unless
they know about the union.


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark
@ 2009-04-22 17:47         ` Dave Hansen
  0 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-22 17:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 2009-04-22 at 18:14 +0100, Mel Gorman wrote:
> Preference of taste really. When I started a conversion to accessors, it
> changed something recognised to something new that looked uglier to me.
> Only one place cares about the union enough to access is via an array so
> why spread it everywhere.

Personally, I'd say for consistency.  Someone looking at both forms
wouldn't necessarily know that they refer to the same variables unless
they know about the union.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark
  2009-04-22 13:53   ` Mel Gorman
@ 2009-04-22 20:06     ` David Rientjes
  -1 siblings, 0 replies; 186+ messages in thread
From: David Rientjes @ 2009-04-22 20:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 22 Apr 2009, Mel Gorman wrote:

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b174f2c..6030f49 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1154,10 +1154,15 @@ failed:
>  	return NULL;
>  }
>  
> -#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
> -#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
> -#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
> -#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
> +/* The WMARK bits are used as an index zone->pages_mark */
> +#define ALLOC_WMARK_MIN		0x00 /* use pages_min watermark */
> +#define ALLOC_WMARK_LOW		0x01 /* use pages_low watermark */
> +#define ALLOC_WMARK_HIGH	0x02 /* use pages_high watermark */
> +#define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
> +
> +/* Mask to get the watermark bits */
> +#define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
> +
>  #define ALLOC_HARDER		0x10 /* try to alloc harder */
>  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
>  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */

The watermark flags should probably be members of an anonymous enum since 
they're being used as an index into an array.  If another watermark were 
ever to be added it would require a value of 0x03, for instance.

	enum {
		ALLOC_WMARK_MIN,
		ALLOC_WMARK_LOW,
		ALLOC_WMARK_HIGH,

		ALLOC_WMARK_MASK = 0xf	/* no more than 16 possible watermarks */
	};

This eliminates ALLOC_NO_WATERMARKS and the caller that uses it would 
simply pass 0.

> @@ -1445,12 +1450,7 @@ zonelist_scan:
>  
>  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {

This would become

	if (alloc_flags & ALLOC_WMARK_MASK)

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark
@ 2009-04-22 20:06     ` David Rientjes
  0 siblings, 0 replies; 186+ messages in thread
From: David Rientjes @ 2009-04-22 20:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 22 Apr 2009, Mel Gorman wrote:

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b174f2c..6030f49 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1154,10 +1154,15 @@ failed:
>  	return NULL;
>  }
>  
> -#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
> -#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
> -#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
> -#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
> +/* The WMARK bits are used as an index zone->pages_mark */
> +#define ALLOC_WMARK_MIN		0x00 /* use pages_min watermark */
> +#define ALLOC_WMARK_LOW		0x01 /* use pages_low watermark */
> +#define ALLOC_WMARK_HIGH	0x02 /* use pages_high watermark */
> +#define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
> +
> +/* Mask to get the watermark bits */
> +#define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
> +
>  #define ALLOC_HARDER		0x10 /* try to alloc harder */
>  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
>  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */

The watermark flags should probably be members of an anonymous enum since 
they're being used as an index into an array.  If another watermark were 
ever to be added it would require a value of 0x03, for instance.

	enum {
		ALLOC_WMARK_MIN,
		ALLOC_WMARK_LOW,
		ALLOC_WMARK_HIGH,

		ALLOC_WMARK_MASK = 0xf	/* no more than 16 possible watermarks */
	};

This eliminates ALLOC_NO_WATERMARKS and the caller that uses it would 
simply pass 0.

> @@ -1445,12 +1450,7 @@ zonelist_scan:
>  
>  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {

This would become

	if (alloc_flags & ALLOC_WMARK_MASK)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-22 17:11       ` Mel Gorman
@ 2009-04-22 20:11         ` David Rientjes
  -1 siblings, 0 replies; 186+ messages in thread
From: David Rientjes @ 2009-04-22 20:11 UTC (permalink / raw)
  To: Mel Gorman, Pekka Enberg
  Cc: Dave Hansen, Linux Memory Management List, KOSAKI Motohiro,
	Christoph Lameter, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra, Andrew Morton

On Wed, 22 Apr 2009, Mel Gorman wrote:

> If there are users with good reasons, then we could convert this to WARN_ON
> to fix up the callers. I suspect that the allocator can already cope with
> recieving a stupid order silently but slowly. It should go all the way to the
> bottom and just never find anything useful and return NULL.  zone_watermark_ok
> is the most dangerous looking part but even it should never get to MAX_ORDER
> because it should always find there are not enough free pages and return
> before it overruns.
> 

slub: enforce MAX_ORDER

slub_max_order may not be equal to or greater than MAX_ORDER.

Additionally, if a single object cannot be placed in a slab of
slub_max_order, it still must allocate slabs below MAX_ORDER.

Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/slub.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1909,7 +1909,7 @@ static inline int calculate_order(int size)
 	 * Doh this slab cannot be placed using slub_max_order.
 	 */
 	order = slab_order(size, 1, MAX_ORDER, 1);
-	if (order <= MAX_ORDER)
+	if (order < MAX_ORDER)
 		return order;
 	return -ENOSYS;
 }
@@ -2522,6 +2522,7 @@ __setup("slub_min_order=", setup_slub_min_order);
 static int __init setup_slub_max_order(char *str)
 {
 	get_option(&str, &slub_max_order);
+	slub_max_order = min(slub_max_order, MAX_ORDER - 1);
 
 	return 1;
 }

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-22 20:11         ` David Rientjes
  0 siblings, 0 replies; 186+ messages in thread
From: David Rientjes @ 2009-04-22 20:11 UTC (permalink / raw)
  To: Mel Gorman, Pekka Enberg
  Cc: Dave Hansen, Linux Memory Management List, KOSAKI Motohiro,
	Christoph Lameter, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra, Andrew Morton

On Wed, 22 Apr 2009, Mel Gorman wrote:

> If there are users with good reasons, then we could convert this to WARN_ON
> to fix up the callers. I suspect that the allocator can already cope with
> recieving a stupid order silently but slowly. It should go all the way to the
> bottom and just never find anything useful and return NULL.  zone_watermark_ok
> is the most dangerous looking part but even it should never get to MAX_ORDER
> because it should always find there are not enough free pages and return
> before it overruns.
> 

slub: enforce MAX_ORDER

slub_max_order may not be equal to or greater than MAX_ORDER.

Additionally, if a single object cannot be placed in a slab of
slub_max_order, it still must allocate slabs below MAX_ORDER.

Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/slub.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1909,7 +1909,7 @@ static inline int calculate_order(int size)
 	 * Doh this slab cannot be placed using slub_max_order.
 	 */
 	order = slab_order(size, 1, MAX_ORDER, 1);
-	if (order <= MAX_ORDER)
+	if (order < MAX_ORDER)
 		return order;
 	return -ENOSYS;
 }
@@ -2522,6 +2522,7 @@ __setup("slub_min_order=", setup_slub_min_order);
 static int __init setup_slub_max_order(char *str)
 {
 	get_option(&str, &slub_max_order);
+	slub_max_order = min(slub_max_order, MAX_ORDER - 1);
 
 	return 1;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-22 20:11         ` David Rientjes
@ 2009-04-22 20:20           ` Christoph Lameter
  -1 siblings, 0 replies; 186+ messages in thread
From: Christoph Lameter @ 2009-04-22 20:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Pekka Enberg, Dave Hansen,
	Linux Memory Management List, KOSAKI Motohiro, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton




Acked-by: Christoph Lameter <cl@linux-foundation.org>



^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-22 20:20           ` Christoph Lameter
  0 siblings, 0 replies; 186+ messages in thread
From: Christoph Lameter @ 2009-04-22 20:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Pekka Enberg, Dave Hansen,
	Linux Memory Management List, KOSAKI Motohiro, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton




Acked-by: Christoph Lameter <cl@linux-foundation.org>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 16/22] Do not setup zonelist cache when there is only one node
  2009-04-22 13:53   ` Mel Gorman
@ 2009-04-22 20:24     ` David Rientjes
  -1 siblings, 0 replies; 186+ messages in thread
From: David Rientjes @ 2009-04-22 20:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 22 Apr 2009, Mel Gorman wrote:

> There is a zonelist cache which is used to track zones that are not in
> the allowed cpuset or found to be recently full. This is to reduce cache
> footprint on large machines. On smaller machines, it just incurs cost
> for no gain. This patch only uses the zonelist cache when there are NUMA
> nodes.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> ---
>  mm/page_alloc.c |    7 +++++--
>  1 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7f45de1..e59bb80 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1467,8 +1467,11 @@ this_zone_full:
>  		if (NUMA_BUILD)
>  			zlc_mark_zone_full(zonelist, z);

If zonelist caching is never used for UMA machines, why should they ever 
call zlc_mark_zone_full()?  It will always dereference 
zonelist->zlcache_ptr and immediately return without doing anything.

Wouldn't it better to just add

	if (num_online_nodes() == 1)
		continue;

right before this call to zlc_mark_zone_full()?  This should compile out 
the remainder of the loop for !CONFIG_NUMA kernels anyway.

>  try_next_zone:
> -		if (NUMA_BUILD && !did_zlc_setup) {
> -			/* we do zlc_setup after the first zone is tried */
> +		if (NUMA_BUILD && !did_zlc_setup && num_online_nodes() > 1) {
> +			/*
> +			 * we do zlc_setup after the first zone is tried but only
> +			 * if there are multiple nodes make it worthwhile
> +			 */
>  			allowednodes = zlc_setup(zonelist, alloc_flags);
>  			zlc_active = 1;
>  			did_zlc_setup = 1;
> -- 
> 1.5.6.5

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 16/22] Do not setup zonelist cache when there is only one node
@ 2009-04-22 20:24     ` David Rientjes
  0 siblings, 0 replies; 186+ messages in thread
From: David Rientjes @ 2009-04-22 20:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 22 Apr 2009, Mel Gorman wrote:

> There is a zonelist cache which is used to track zones that are not in
> the allowed cpuset or found to be recently full. This is to reduce cache
> footprint on large machines. On smaller machines, it just incurs cost
> for no gain. This patch only uses the zonelist cache when there are NUMA
> nodes.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> ---
>  mm/page_alloc.c |    7 +++++--
>  1 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7f45de1..e59bb80 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1467,8 +1467,11 @@ this_zone_full:
>  		if (NUMA_BUILD)
>  			zlc_mark_zone_full(zonelist, z);

If zonelist caching is never used for UMA machines, why should they ever 
call zlc_mark_zone_full()?  It will always dereference 
zonelist->zlcache_ptr and immediately return without doing anything.

Wouldn't it better to just add

	if (num_online_nodes() == 1)
		continue;

right before this call to zlc_mark_zone_full()?  This should compile out 
the remainder of the loop for !CONFIG_NUMA kernels anyway.

>  try_next_zone:
> -		if (NUMA_BUILD && !did_zlc_setup) {
> -			/* we do zlc_setup after the first zone is tried */
> +		if (NUMA_BUILD && !did_zlc_setup && num_online_nodes() > 1) {
> +			/*
> +			 * we do zlc_setup after the first zone is tried but only
> +			 * if there are multiple nodes make it worthwhile
> +			 */
>  			allowednodes = zlc_setup(zonelist, alloc_flags);
>  			zlc_active = 1;
>  			did_zlc_setup = 1;
> -- 
> 1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 16/22] Do not setup zonelist cache when there is only one node
  2009-04-22 20:24     ` David Rientjes
@ 2009-04-22 20:32       ` Lee Schermerhorn
  -1 siblings, 0 replies; 186+ messages in thread
From: Lee Schermerhorn @ 2009-04-22 20:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Linux Memory Management List, KOSAKI Motohiro,
	Christoph Lameter, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra, Pekka Enberg,
	Andrew Morton

On Wed, 2009-04-22 at 13:24 -0700, David Rientjes wrote:
> On Wed, 22 Apr 2009, Mel Gorman wrote:
> 
> > There is a zonelist cache which is used to track zones that are not in
> > the allowed cpuset or found to be recently full. This is to reduce cache
> > footprint on large machines. On smaller machines, it just incurs cost
> > for no gain. This patch only uses the zonelist cache when there are NUMA
> > nodes.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> > ---
> >  mm/page_alloc.c |    7 +++++--
> >  1 files changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 7f45de1..e59bb80 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1467,8 +1467,11 @@ this_zone_full:
> >  		if (NUMA_BUILD)
> >  			zlc_mark_zone_full(zonelist, z);
> 
> If zonelist caching is never used for UMA machines, why should they ever 
> call zlc_mark_zone_full()?  It will always dereference 
> zonelist->zlcache_ptr and immediately return without doing anything.
> 
> Wouldn't it better to just add
> 
> 	if (num_online_nodes() == 1)
> 		continue;
> 
> right before this call to zlc_mark_zone_full()?  This should compile out 
> the remainder of the loop for !CONFIG_NUMA kernels anyway.

Shouldn't it already do that?  NUMA_BUILD is defined as 0 when
!CONFIG_NUMA to avoid #ifdef's in the code while still allowing compiler
error checking in the dead code.

Lee

> 
> >  try_next_zone:
> > -		if (NUMA_BUILD && !did_zlc_setup) {
> > -			/* we do zlc_setup after the first zone is tried */
> > +		if (NUMA_BUILD && !did_zlc_setup && num_online_nodes() > 1) {
> > +			/*
> > +			 * we do zlc_setup after the first zone is tried but only
> > +			 * if there are multiple nodes make it worthwhile
> > +			 */
> >  			allowednodes = zlc_setup(zonelist, alloc_flags);
> >  			zlc_active = 1;
> >  			did_zlc_setup = 1;
> > -- 
> > 1.5.6.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 16/22] Do not setup zonelist cache when there is only one node
@ 2009-04-22 20:32       ` Lee Schermerhorn
  0 siblings, 0 replies; 186+ messages in thread
From: Lee Schermerhorn @ 2009-04-22 20:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Linux Memory Management List, KOSAKI Motohiro,
	Christoph Lameter, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra, Pekka Enberg,
	Andrew Morton

On Wed, 2009-04-22 at 13:24 -0700, David Rientjes wrote:
> On Wed, 22 Apr 2009, Mel Gorman wrote:
> 
> > There is a zonelist cache which is used to track zones that are not in
> > the allowed cpuset or found to be recently full. This is to reduce cache
> > footprint on large machines. On smaller machines, it just incurs cost
> > for no gain. This patch only uses the zonelist cache when there are NUMA
> > nodes.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> > ---
> >  mm/page_alloc.c |    7 +++++--
> >  1 files changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 7f45de1..e59bb80 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1467,8 +1467,11 @@ this_zone_full:
> >  		if (NUMA_BUILD)
> >  			zlc_mark_zone_full(zonelist, z);
> 
> If zonelist caching is never used for UMA machines, why should they ever 
> call zlc_mark_zone_full()?  It will always dereference 
> zonelist->zlcache_ptr and immediately return without doing anything.
> 
> Wouldn't it better to just add
> 
> 	if (num_online_nodes() == 1)
> 		continue;
> 
> right before this call to zlc_mark_zone_full()?  This should compile out 
> the remainder of the loop for !CONFIG_NUMA kernels anyway.

Shouldn't it already do that?  NUMA_BUILD is defined as 0 when
!CONFIG_NUMA to avoid #ifdef's in the code while still allowing compiler
error checking in the dead code.

Lee

> 
> >  try_next_zone:
> > -		if (NUMA_BUILD && !did_zlc_setup) {
> > -			/* we do zlc_setup after the first zone is tried */
> > +		if (NUMA_BUILD && !did_zlc_setup && num_online_nodes() > 1) {
> > +			/*
> > +			 * we do zlc_setup after the first zone is tried but only
> > +			 * if there are multiple nodes make it worthwhile
> > +			 */
> >  			allowednodes = zlc_setup(zonelist, alloc_flags);
> >  			zlc_active = 1;
> >  			did_zlc_setup = 1;
> > -- 
> > 1.5.6.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 16/22] Do not setup zonelist cache when there is only one node
  2009-04-22 20:32       ` Lee Schermerhorn
@ 2009-04-22 20:34         ` David Rientjes
  -1 siblings, 0 replies; 186+ messages in thread
From: David Rientjes @ 2009-04-22 20:34 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, Linux Memory Management List, KOSAKI Motohiro,
	Christoph Lameter, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra, Pekka Enberg,
	Andrew Morton

On Wed, 22 Apr 2009, Lee Schermerhorn wrote:

> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 7f45de1..e59bb80 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1467,8 +1467,11 @@ this_zone_full:
> > >  		if (NUMA_BUILD)
> > >  			zlc_mark_zone_full(zonelist, z);
> > 
> > If zonelist caching is never used for UMA machines, why should they ever 
> > call zlc_mark_zone_full()?  It will always dereference 
> > zonelist->zlcache_ptr and immediately return without doing anything.
> > 
> > Wouldn't it better to just add
> > 
> > 	if (num_online_nodes() == 1)
> > 		continue;
> > 
> > right before this call to zlc_mark_zone_full()?  This should compile out 
> > the remainder of the loop for !CONFIG_NUMA kernels anyway.
> 
> Shouldn't it already do that?  NUMA_BUILD is defined as 0 when
> !CONFIG_NUMA to avoid #ifdef's in the code while still allowing compiler
> error checking in the dead code.
> 

Yeah, but adding the check on num_online_nodes() also prevents needlessly 
calling zlc_mark_zone_full() on CONFIG_NUMA kernels when running on an UMA 
machine.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 16/22] Do not setup zonelist cache when there is only one node
@ 2009-04-22 20:34         ` David Rientjes
  0 siblings, 0 replies; 186+ messages in thread
From: David Rientjes @ 2009-04-22 20:34 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, Linux Memory Management List, KOSAKI Motohiro,
	Christoph Lameter, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra, Pekka Enberg,
	Andrew Morton

On Wed, 22 Apr 2009, Lee Schermerhorn wrote:

> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 7f45de1..e59bb80 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1467,8 +1467,11 @@ this_zone_full:
> > >  		if (NUMA_BUILD)
> > >  			zlc_mark_zone_full(zonelist, z);
> > 
> > If zonelist caching is never used for UMA machines, why should they ever 
> > call zlc_mark_zone_full()?  It will always dereference 
> > zonelist->zlcache_ptr and immediately return without doing anything.
> > 
> > Wouldn't it better to just add
> > 
> > 	if (num_online_nodes() == 1)
> > 		continue;
> > 
> > right before this call to zlc_mark_zone_full()?  This should compile out 
> > the remainder of the loop for !CONFIG_NUMA kernels anyway.
> 
> Shouldn't it already do that?  NUMA_BUILD is defined as 0 when
> !CONFIG_NUMA to avoid #ifdef's in the code while still allowing compiler
> error checking in the dead code.
> 

Yeah, but adding the check on num_online_nodes() also prevents needlessly 
calling zlc_mark_zone_full() on CONFIG_NUMA kernels when running on an UMA 
machine.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 21/22] Use a pre-calculated value instead of num_online_nodes() in fast paths
  2009-04-22 13:53   ` Mel Gorman
@ 2009-04-22 23:04     ` David Rientjes
  -1 siblings, 0 replies; 186+ messages in thread
From: David Rientjes @ 2009-04-22 23:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 22 Apr 2009, Mel Gorman wrote:

> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index 848025c..474e73e 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state)
>  #define next_online_node(nid)	next_node((nid), node_states[N_ONLINE])
>  
>  extern int nr_node_ids;
> +extern int nr_online_nodes;
> +
> +static inline void node_set_online(int nid)
> +{
> +	node_set_state(nid, N_ONLINE);
> +	nr_online_nodes = num_node_state(N_ONLINE);
> +}
> +
> +static inline void node_set_offline(int nid)
> +{
> +	node_clear_state(nid, N_ONLINE);
> +	nr_online_nodes = num_node_state(N_ONLINE);
> +}
>  #else
>  
>  static inline int node_state(int node, enum node_states state)

The later #define's of node_set_online() and node_set_offline() in 
include/linux/nodemask.h should probably be removed now.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 21/22] Use a pre-calculated value instead of num_online_nodes() in fast paths
@ 2009-04-22 23:04     ` David Rientjes
  0 siblings, 0 replies; 186+ messages in thread
From: David Rientjes @ 2009-04-22 23:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 22 Apr 2009, Mel Gorman wrote:

> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index 848025c..474e73e 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state)
>  #define next_online_node(nid)	next_node((nid), node_states[N_ONLINE])
>  
>  extern int nr_node_ids;
> +extern int nr_online_nodes;
> +
> +static inline void node_set_online(int nid)
> +{
> +	node_set_state(nid, N_ONLINE);
> +	nr_online_nodes = num_node_state(N_ONLINE);
> +}
> +
> +static inline void node_set_offline(int nid)
> +{
> +	node_clear_state(nid, N_ONLINE);
> +	nr_online_nodes = num_node_state(N_ONLINE);
> +}
>  #else
>  
>  static inline int node_state(int node, enum node_states state)

The later #define's of node_set_online() and node_set_offline() in 
include/linux/nodemask.h should probably be removed now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 16/22] Do not setup zonelist cache when there is only one node
  2009-04-22 20:34         ` David Rientjes
@ 2009-04-23  0:11           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 186+ messages in thread
From: KOSAKI Motohiro @ 2009-04-23  0:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Lee Schermerhorn, Mel Gorman,
	Linux Memory Management List, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> On Wed, 22 Apr 2009, Lee Schermerhorn wrote:
> 
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 7f45de1..e59bb80 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -1467,8 +1467,11 @@ this_zone_full:
> > > >  		if (NUMA_BUILD)
> > > >  			zlc_mark_zone_full(zonelist, z);
> > > 
> > > If zonelist caching is never used for UMA machines, why should they ever 
> > > call zlc_mark_zone_full()?  It will always dereference 
> > > zonelist->zlcache_ptr and immediately return without doing anything.
> > > 
> > > Wouldn't it better to just add
> > > 
> > > 	if (num_online_nodes() == 1)
> > > 		continue;
> > > 
> > > right before this call to zlc_mark_zone_full()?  This should compile out 
> > > the remainder of the loop for !CONFIG_NUMA kernels anyway.
> > 
> > Shouldn't it already do that?  NUMA_BUILD is defined as 0 when
> > !CONFIG_NUMA to avoid #ifdef's in the code while still allowing compiler
> > error checking in the dead code.
> > 
> 
> Yeah, but adding the check on num_online_nodes() also prevents needlessly 
> calling zlc_mark_zone_full() on CONFIG_NUMA kernels when running on an UMA 
> machine.

I don't like this idea...

In UMA system, zlc_mark_zone_full() isn't so expensive. but In large system
one branch increasing is often costly.




^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 16/22] Do not setup zonelist cache when there is only one node
@ 2009-04-23  0:11           ` KOSAKI Motohiro
  0 siblings, 0 replies; 186+ messages in thread
From: KOSAKI Motohiro @ 2009-04-23  0:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Lee Schermerhorn, Mel Gorman,
	Linux Memory Management List, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> On Wed, 22 Apr 2009, Lee Schermerhorn wrote:
> 
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 7f45de1..e59bb80 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -1467,8 +1467,11 @@ this_zone_full:
> > > >  		if (NUMA_BUILD)
> > > >  			zlc_mark_zone_full(zonelist, z);
> > > 
> > > If zonelist caching is never used for UMA machines, why should they ever 
> > > call zlc_mark_zone_full()?  It will always dereference 
> > > zonelist->zlcache_ptr and immediately return without doing anything.
> > > 
> > > Wouldn't it better to just add
> > > 
> > > 	if (num_online_nodes() == 1)
> > > 		continue;
> > > 
> > > right before this call to zlc_mark_zone_full()?  This should compile out 
> > > the remainder of the loop for !CONFIG_NUMA kernels anyway.
> > 
> > Shouldn't it already do that?  NUMA_BUILD is defined as 0 when
> > !CONFIG_NUMA to avoid #ifdef's in the code while still allowing compiler
> > error checking in the dead code.
> > 
> 
> Yeah, but adding the check on num_online_nodes() also prevents needlessly 
> calling zlc_mark_zone_full() on CONFIG_NUMA kernels when running on an UMA 
> machine.

I don't like this idea...

In UMA system, zlc_mark_zone_full() isn't so expensive. but In large system
one branch increasing is often costly.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-22 17:30         ` Dave Hansen
@ 2009-04-23  0:13           ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-23  0:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, Apr 22, 2009 at 10:30:15AM -0700, Dave Hansen wrote:
> On Wed, 2009-04-22 at 18:11 +0100, Mel Gorman wrote:
> > On Wed, Apr 22, 2009 at 09:13:11AM -0700, Dave Hansen wrote:
> > > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > > No user of the allocator API should be passing in an order >= MAX_ORDER
> > > > but we check for it on each and every allocation. Delete this check and
> > > > make it a VM_BUG_ON check further down the call path.
> > > 
> > > Should we get the check re-added to some of the upper-level functions,
> > > then?  Perhaps __get_free_pages() or things like alloc_pages_exact()? 
> > 
> > I don't think so, no. It just moves the source of the text bloat and
> > for the few callers that are asking for something that will never
> > succeed.
> 
> Well, it's a matter of figuring out when it can succeed.  Some of this
> stuff, we can figure out at compile-time.  Others are a bit harder.
> 

What do you suggest then? Some sort of constant that tells you the
maximum size you can call for callers that think they might ever request
too much?

Shuffling the check around to other top-level helpers seems pointless to
me because as I said, it just moves text bloat from one place to the
next.

> > > I'm selfishly thinking of what I did in profile_init().  Can I slab
> > > alloc it?  Nope.  Page allocator?  Nope.  Oh, well, try vmalloc():
> > > 
> > >         prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
> > >         if (prof_buffer)
> > >                 return 0;
> > > 
> > >         prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
> > >         if (prof_buffer)
> > >                 return 0;
> > > 
> > >         prof_buffer = vmalloc(buffer_bytes);
> > >         if (prof_buffer)
> > >                 return 0;
> > > 
> > >         free_cpumask_var(prof_cpu_mask);
> > >         return -ENOMEM;
> > > 
> > 
> > Can this ever actually be asking for an order larger than MAX_ORDER
> > though? If so, you're condemning it to always behave poorly.
> 
> Yeah.  It is based on text size.  Smaller kernels with trimmed configs
> and no modules have no problem fitting under MAX_ORDER, as do kernels
> with larger base page sizes.  
> 

It would seem that the right thing to have done here in the first place
then was

if (buffer_bytes > PAGE_SIZE << (MAX_ORDER-1)
	return vmalloc(...)

kzalloc attempt

alloc_pages_exact attempt

> > > Same thing in __kmalloc_section_memmap():
> > > 
> > >         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
> > >         if (page)
> > >                 goto got_map_page;
> > > 
> > >         ret = vmalloc(memmap_size);
> > >         if (ret)
> > >                 goto got_map_ptr;
> > > 
> > 
> > If I'm reading that right, the order will never be a stupid order. It can fail
> > for higher orders in which case it falls back to vmalloc() .  For example,
> > to hit that limit, the section size for a 4K kernel, maximum usable order
> > of 10, the section size would need to be 256MB (assuming struct page size
> > of 64 bytes). I don't think it's ever that size and if so, it'll always be
> > sub-optimal which is a poor choice to make.
> 
> I think the section size default used to be 512M on x86 because we
> concentrate on removing whole DIMMs.  
> 

It was a poor choice then as their sections always ended up in
vmalloc() or else it was using the bootmem allocator in which case it
doesn't matter that the core page allocator was doing.

> > > I depend on the allocator to tell me when I've fed it too high of an
> > > order.  If we really need this, perhaps we should do an audit and then
> > > add a WARN_ON() for a few releases to catch the stragglers.
> > 
> > I consider it buggy to ask for something so large that you always end up
> > with the worst option - vmalloc(). How about leaving it as a VM_BUG_ON
> > to get as many reports as possible on who is depending on this odd
> > behaviour?
> > 
> > If there are users with good reasons, then we could convert this to WARN_ON
> > to fix up the callers. I suspect that the allocator can already cope with
> > recieving a stupid order silently but slowly. It should go all the way to the
> > bottom and just never find anything useful and return NULL.  zone_watermark_ok
> > is the most dangerous looking part but even it should never get to MAX_ORDER
> > because it should always find there are not enough free pages and return
> > before it overruns.
> 
> Whatever we do, I'd agree that it's fine that this is a degenerate case
> that gets handled very slowly and as far out of hot paths as possible.
> Anybody who can fall back to a vmalloc is not doing these things very
> often.
> 

If that's the case, the simpliest course might be to just drop the VM_BUG_ON()
as a separate patch after asserting it's safe to call into the page
allocator with too large an order with the consequence of it being a
relatively expensive call considering it can never succeed.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-23  0:13           ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-23  0:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, Apr 22, 2009 at 10:30:15AM -0700, Dave Hansen wrote:
> On Wed, 2009-04-22 at 18:11 +0100, Mel Gorman wrote:
> > On Wed, Apr 22, 2009 at 09:13:11AM -0700, Dave Hansen wrote:
> > > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > > No user of the allocator API should be passing in an order >= MAX_ORDER
> > > > but we check for it on each and every allocation. Delete this check and
> > > > make it a VM_BUG_ON check further down the call path.
> > > 
> > > Should we get the check re-added to some of the upper-level functions,
> > > then?  Perhaps __get_free_pages() or things like alloc_pages_exact()? 
> > 
> > I don't think so, no. It just moves the source of the text bloat and
> > for the few callers that are asking for something that will never
> > succeed.
> 
> Well, it's a matter of figuring out when it can succeed.  Some of this
> stuff, we can figure out at compile-time.  Others are a bit harder.
> 

What do you suggest then? Some sort of constant that tells you the
maximum size you can call for callers that think they might ever request
too much?

Shuffling the check around to other top-level helpers seems pointless to
me because as I said, it just moves text bloat from one place to the
next.

> > > I'm selfishly thinking of what I did in profile_init().  Can I slab
> > > alloc it?  Nope.  Page allocator?  Nope.  Oh, well, try vmalloc():
> > > 
> > >         prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
> > >         if (prof_buffer)
> > >                 return 0;
> > > 
> > >         prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
> > >         if (prof_buffer)
> > >                 return 0;
> > > 
> > >         prof_buffer = vmalloc(buffer_bytes);
> > >         if (prof_buffer)
> > >                 return 0;
> > > 
> > >         free_cpumask_var(prof_cpu_mask);
> > >         return -ENOMEM;
> > > 
> > 
> > Can this ever actually be asking for an order larger than MAX_ORDER
> > though? If so, you're condemning it to always behave poorly.
> 
> Yeah.  It is based on text size.  Smaller kernels with trimmed configs
> and no modules have no problem fitting under MAX_ORDER, as do kernels
> with larger base page sizes.  
> 

It would seem that the right thing to have done here in the first place
then was

if (buffer_bytes > PAGE_SIZE << (MAX_ORDER-1)
	return vmalloc(...)

kzalloc attempt

alloc_pages_exact attempt

> > > Same thing in __kmalloc_section_memmap():
> > > 
> > >         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
> > >         if (page)
> > >                 goto got_map_page;
> > > 
> > >         ret = vmalloc(memmap_size);
> > >         if (ret)
> > >                 goto got_map_ptr;
> > > 
> > 
> > If I'm reading that right, the order will never be a stupid order. It can fail
> > for higher orders in which case it falls back to vmalloc() .  For example,
> > to hit that limit, the section size for a 4K kernel, maximum usable order
> > of 10, the section size would need to be 256MB (assuming struct page size
> > of 64 bytes). I don't think it's ever that size and if so, it'll always be
> > sub-optimal which is a poor choice to make.
> 
> I think the section size default used to be 512M on x86 because we
> concentrate on removing whole DIMMs.  
> 

It was a poor choice then as their sections always ended up in
vmalloc() or else it was using the bootmem allocator in which case it
doesn't matter that the core page allocator was doing.

> > > I depend on the allocator to tell me when I've fed it too high of an
> > > order.  If we really need this, perhaps we should do an audit and then
> > > add a WARN_ON() for a few releases to catch the stragglers.
> > 
> > I consider it buggy to ask for something so large that you always end up
> > with the worst option - vmalloc(). How about leaving it as a VM_BUG_ON
> > to get as many reports as possible on who is depending on this odd
> > behaviour?
> > 
> > If there are users with good reasons, then we could convert this to WARN_ON
> > to fix up the callers. I suspect that the allocator can already cope with
> > recieving a stupid order silently but slowly. It should go all the way to the
> > bottom and just never find anything useful and return NULL.  zone_watermark_ok
> > is the most dangerous looking part but even it should never get to MAX_ORDER
> > because it should always find there are not enough free pages and return
> > before it overruns.
> 
> Whatever we do, I'd agree that it's fine that this is a degenerate case
> that gets handled very slowly and as far out of hot paths as possible.
> Anybody who can fall back to a vmalloc is not doing these things very
> often.
> 

If that's the case, the simpliest course might be to just drop the VM_BUG_ON()
as a separate patch after asserting it's safe to call into the page
allocator with too large an order with the consequence of it being a
relatively expensive call considering it can never succeed.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 16/22] Do not setup zonelist cache when there is only one node
  2009-04-22 20:24     ` David Rientjes
@ 2009-04-23  0:19       ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-23  0:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, Apr 22, 2009 at 01:24:26PM -0700, David Rientjes wrote:
> On Wed, 22 Apr 2009, Mel Gorman wrote:
> 
> > There is a zonelist cache which is used to track zones that are not in
> > the allowed cpuset or found to be recently full. This is to reduce cache
> > footprint on large machines. On smaller machines, it just incurs cost
> > for no gain. This patch only uses the zonelist cache when there are NUMA
> > nodes.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> > ---
> >  mm/page_alloc.c |    7 +++++--
> >  1 files changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 7f45de1..e59bb80 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1467,8 +1467,11 @@ this_zone_full:
> >  		if (NUMA_BUILD)
> >  			zlc_mark_zone_full(zonelist, z);
> 
> If zonelist caching is never used for UMA machines, why should they ever 
> call zlc_mark_zone_full()?  It will always dereference 
> zonelist->zlcache_ptr and immediately return without doing anything.
> 
> Wouldn't it better to just add
> 
> 	if (num_online_nodes() == 1)
> 		continue;
> 

num_online_nodes() is actually a really heavy function. It calls hweight
on a bitmap which is probably why it's not happening already. There is a
nr_online_nodes later in the patchset though. With nr_online_nodes, it's
a good idea to avoid a function call so I've taken note to do that patch
in pass 2.

Thanks

> right before this call to zlc_mark_zone_full()?  This should compile out 
> the remainder of the loop for !CONFIG_NUMA kernels anyway.
> 
> >  try_next_zone:
> > -		if (NUMA_BUILD && !did_zlc_setup) {
> > -			/* we do zlc_setup after the first zone is tried */
> > +		if (NUMA_BUILD && !did_zlc_setup && num_online_nodes() > 1) {
> > +			/*
> > +			 * we do zlc_setup after the first zone is tried but only
> > +			 * if there are multiple nodes make it worthwhile
> > +			 */
> >  			allowednodes = zlc_setup(zonelist, alloc_flags);
> >  			zlc_active = 1;
> >  			did_zlc_setup = 1;
> > -- 
> > 1.5.6.5
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 16/22] Do not setup zonelist cache when there is only one node
@ 2009-04-23  0:19       ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-23  0:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, Apr 22, 2009 at 01:24:26PM -0700, David Rientjes wrote:
> On Wed, 22 Apr 2009, Mel Gorman wrote:
> 
> > There is a zonelist cache which is used to track zones that are not in
> > the allowed cpuset or found to be recently full. This is to reduce cache
> > footprint on large machines. On smaller machines, it just incurs cost
> > for no gain. This patch only uses the zonelist cache when there are NUMA
> > nodes.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> > ---
> >  mm/page_alloc.c |    7 +++++--
> >  1 files changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 7f45de1..e59bb80 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1467,8 +1467,11 @@ this_zone_full:
> >  		if (NUMA_BUILD)
> >  			zlc_mark_zone_full(zonelist, z);
> 
> If zonelist caching is never used for UMA machines, why should they ever 
> call zlc_mark_zone_full()?  It will always dereference 
> zonelist->zlcache_ptr and immediately return without doing anything.
> 
> Wouldn't it better to just add
> 
> 	if (num_online_nodes() == 1)
> 		continue;
> 

num_online_nodes() is actually a really heavy function. It calls hweight
on a bitmap which is probably why it's not happening already. There is a
nr_online_nodes later in the patchset though. With nr_online_nodes, it's
a good idea to avoid a function call so I've taken note to do that patch
in pass 2.

Thanks

> right before this call to zlc_mark_zone_full()?  This should compile out 
> the remainder of the loop for !CONFIG_NUMA kernels anyway.
> 
> >  try_next_zone:
> > -		if (NUMA_BUILD && !did_zlc_setup) {
> > -			/* we do zlc_setup after the first zone is tried */
> > +		if (NUMA_BUILD && !did_zlc_setup && num_online_nodes() > 1) {
> > +			/*
> > +			 * we do zlc_setup after the first zone is tried but only
> > +			 * if there are multiple nodes make it worthwhile
> > +			 */
> >  			allowednodes = zlc_setup(zonelist, alloc_flags);
> >  			zlc_active = 1;
> >  			did_zlc_setup = 1;
> > -- 
> > 1.5.6.5
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark
  2009-04-22 17:47         ` Dave Hansen
@ 2009-04-23  0:27           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 186+ messages in thread
From: KOSAKI Motohiro @ 2009-04-23  0:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kosaki.motohiro, Mel Gorman, Linux Memory Management List,
	Christoph Lameter, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra, Pekka Enberg,
	Andrew Morton

> On Wed, 2009-04-22 at 18:14 +0100, Mel Gorman wrote:
> > Preference of taste really. When I started a conversion to accessors, it
> > changed something recognised to something new that looked uglier to me.
> > Only one place cares about the union enough to access is via an array so
> > why spread it everywhere.
> 
> Personally, I'd say for consistency.  Someone looking at both forms
> wouldn't necessarily know that they refer to the same variables unless
> they know about the union.

for just clalification...

AFAIK, C language specification don't gurantee point same value.
compiler can insert pad between struct-member and member, but not insert
into array.

However, all gcc version don't do that. I think. but perhaps I missed
some minor gcc release..


So, I also like Dave's idea. but it only personal feeling.




^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark
@ 2009-04-23  0:27           ` KOSAKI Motohiro
  0 siblings, 0 replies; 186+ messages in thread
From: KOSAKI Motohiro @ 2009-04-23  0:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kosaki.motohiro, Mel Gorman, Linux Memory Management List,
	Christoph Lameter, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra, Pekka Enberg,
	Andrew Morton

> On Wed, 2009-04-22 at 18:14 +0100, Mel Gorman wrote:
> > Preference of taste really. When I started a conversion to accessors, it
> > changed something recognised to something new that looked uglier to me.
> > Only one place cares about the union enough to access is via an array so
> > why spread it everywhere.
> 
> Personally, I'd say for consistency.  Someone looking at both forms
> wouldn't necessarily know that they refer to the same variables unless
> they know about the union.

for just clalification...

AFAIK, C language specification don't gurantee point same value.
compiler can insert pad between struct-member and member, but not insert
into array.

However, all gcc version don't do that. I think. but perhaps I missed
some minor gcc release..


So, I also like Dave's idea. but it only personal feeling.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark
  2009-04-22 20:06     ` David Rientjes
@ 2009-04-23  0:29       ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-23  0:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, Apr 22, 2009 at 01:06:07PM -0700, David Rientjes wrote:
> On Wed, 22 Apr 2009, Mel Gorman wrote:
> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index b174f2c..6030f49 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1154,10 +1154,15 @@ failed:
> >  	return NULL;
> >  }
> >  
> > -#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
> > -#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
> > -#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
> > -#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
> > +/* The WMARK bits are used as an index zone->pages_mark */
> > +#define ALLOC_WMARK_MIN		0x00 /* use pages_min watermark */
> > +#define ALLOC_WMARK_LOW		0x01 /* use pages_low watermark */
> > +#define ALLOC_WMARK_HIGH	0x02 /* use pages_high watermark */
> > +#define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
> > +
> > +/* Mask to get the watermark bits */
> > +#define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
> > +
> >  #define ALLOC_HARDER		0x10 /* try to alloc harder */
> >  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
> >  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
> 
> The watermark flags should probably be members of an anonymous enum since 
> they're being used as an index into an array.  If another watermark were 
> ever to be added it would require a value of 0x03, for instance.
> 
> 	enum {
> 		ALLOC_WMARK_MIN,
> 		ALLOC_WMARK_LOW,
> 		ALLOC_WMARK_HIGH,
> 
> 		ALLOC_WMARK_MASK = 0xf	/* no more than 16 possible watermarks */
> 	};
> 
> This eliminates ALLOC_NO_WATERMARKS and the caller that uses it would 
> simply pass 0.
> 

I'm missing something here. If ALLOC_NO_WATERMARKS was defined as zero
then thing like this break.

        if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
                if (!in_interrupt() &&
                    ((p->flags & PF_MEMALLOC) ||
                     unlikely(test_thread_flag(TIF_MEMDIE))))
                        alloc_flags |= ALLOC_NO_WATERMARKS;
        }

Also, the ALLOC_HARDER and other alloc flags need to be redefined for
ALLOC_WMARK_MASK == 0xf. I know what you are getting at but it's a bit more
involved than you're making out and I'm not seeing an advantage.

> > @@ -1445,12 +1450,7 @@ zonelist_scan:
> >  
> >  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
> 
> This would become
> 
> 	if (alloc_flags & ALLOC_WMARK_MASK)
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark
@ 2009-04-23  0:29       ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-23  0:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, Apr 22, 2009 at 01:06:07PM -0700, David Rientjes wrote:
> On Wed, 22 Apr 2009, Mel Gorman wrote:
> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index b174f2c..6030f49 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1154,10 +1154,15 @@ failed:
> >  	return NULL;
> >  }
> >  
> > -#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
> > -#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
> > -#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
> > -#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
> > +/* The WMARK bits are used as an index zone->pages_mark */
> > +#define ALLOC_WMARK_MIN		0x00 /* use pages_min watermark */
> > +#define ALLOC_WMARK_LOW		0x01 /* use pages_low watermark */
> > +#define ALLOC_WMARK_HIGH	0x02 /* use pages_high watermark */
> > +#define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
> > +
> > +/* Mask to get the watermark bits */
> > +#define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
> > +
> >  #define ALLOC_HARDER		0x10 /* try to alloc harder */
> >  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
> >  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
> 
> The watermark flags should probably be members of an anonymous enum since 
> they're being used as an index into an array.  If another watermark were 
> ever to be added it would require a value of 0x03, for instance.
> 
> 	enum {
> 		ALLOC_WMARK_MIN,
> 		ALLOC_WMARK_LOW,
> 		ALLOC_WMARK_HIGH,
> 
> 		ALLOC_WMARK_MASK = 0xf	/* no more than 16 possible watermarks */
> 	};
> 
> This eliminates ALLOC_NO_WATERMARKS and the caller that uses it would 
> simply pass 0.
> 

I'm missing something here. If ALLOC_NO_WATERMARKS was defined as zero
then thing like this break.

        if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
                if (!in_interrupt() &&
                    ((p->flags & PF_MEMALLOC) ||
                     unlikely(test_thread_flag(TIF_MEMDIE))))
                        alloc_flags |= ALLOC_NO_WATERMARKS;
        }

Also, the ALLOC_HARDER and other alloc flags need to be redefined for
ALLOC_WMARK_MASK == 0xf. I know what you are getting at but it's a bit more
involved than you're making out and I'm not seeing an advantage.

> > @@ -1445,12 +1450,7 @@ zonelist_scan:
> >  
> >  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
> 
> This would become
> 
> 	if (alloc_flags & ALLOC_WMARK_MASK)
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 21/22] Use a pre-calculated value instead of num_online_nodes() in fast paths
  2009-04-22 23:04     ` David Rientjes
@ 2009-04-23  0:44       ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-23  0:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, Apr 22, 2009 at 04:04:47PM -0700, David Rientjes wrote:
> On Wed, 22 Apr 2009, Mel Gorman wrote:
> 
> > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> > index 848025c..474e73e 100644
> > --- a/include/linux/nodemask.h
> > +++ b/include/linux/nodemask.h
> > @@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state)
> >  #define next_online_node(nid)	next_node((nid), node_states[N_ONLINE])
> >  
> >  extern int nr_node_ids;
> > +extern int nr_online_nodes;
> > +
> > +static inline void node_set_online(int nid)
> > +{
> > +	node_set_state(nid, N_ONLINE);
> > +	nr_online_nodes = num_node_state(N_ONLINE);
> > +}
> > +
> > +static inline void node_set_offline(int nid)
> > +{
> > +	node_clear_state(nid, N_ONLINE);
> > +	nr_online_nodes = num_node_state(N_ONLINE);
> > +}
> >  #else
> >  
> >  static inline int node_state(int node, enum node_states state)
> 
> The later #define's of node_set_online() and node_set_offline() in 
> include/linux/nodemask.h should probably be removed now.
> 

You'd think, but you can enable memory hotplug without NUMA and
node_set_online() is called when adding memory. Even though those
functions are nops on !NUMA, they're necessary.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 21/22] Use a pre-calculated value instead of num_online_nodes() in fast paths
@ 2009-04-23  0:44       ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-23  0:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, Apr 22, 2009 at 04:04:47PM -0700, David Rientjes wrote:
> On Wed, 22 Apr 2009, Mel Gorman wrote:
> 
> > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> > index 848025c..474e73e 100644
> > --- a/include/linux/nodemask.h
> > +++ b/include/linux/nodemask.h
> > @@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state)
> >  #define next_online_node(nid)	next_node((nid), node_states[N_ONLINE])
> >  
> >  extern int nr_node_ids;
> > +extern int nr_online_nodes;
> > +
> > +static inline void node_set_online(int nid)
> > +{
> > +	node_set_state(nid, N_ONLINE);
> > +	nr_online_nodes = num_node_state(N_ONLINE);
> > +}
> > +
> > +static inline void node_set_offline(int nid)
> > +{
> > +	node_clear_state(nid, N_ONLINE);
> > +	nr_online_nodes = num_node_state(N_ONLINE);
> > +}
> >  #else
> >  
> >  static inline int node_state(int node, enum node_states state)
> 
> The later #define's of node_set_online() and node_set_offline() in 
> include/linux/nodemask.h should probably be removed now.
> 

You'd think, but you can enable memory hotplug without NUMA and
node_set_online() is called when adding memory. Even though those
functions are nops on !NUMA, they're necessary.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-23  0:13           ` Mel Gorman
@ 2009-04-23  1:34             ` Dave Hansen
  -1 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-23  1:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Thu, 2009-04-23 at 01:13 +0100, Mel Gorman wrote:
> On Wed, Apr 22, 2009 at 10:30:15AM -0700, Dave Hansen wrote:
> > On Wed, 2009-04-22 at 18:11 +0100, Mel Gorman wrote:
> > > On Wed, Apr 22, 2009 at 09:13:11AM -0700, Dave Hansen wrote:
> > > > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > > > No user of the allocator API should be passing in an order >= MAX_ORDER
> > > > > but we check for it on each and every allocation. Delete this check and
> > > > > make it a VM_BUG_ON check further down the call path.
> > > > 
> > > > Should we get the check re-added to some of the upper-level functions,
> > > > then?  Perhaps __get_free_pages() or things like alloc_pages_exact()? 
> > > 
> > > I don't think so, no. It just moves the source of the text bloat and
> > > for the few callers that are asking for something that will never
> > > succeed.
> > 
> > Well, it's a matter of figuring out when it can succeed.  Some of this
> > stuff, we can figure out at compile-time.  Others are a bit harder.
> > 
> 
> What do you suggest then? Some sort of constant that tells you the
> maximum size you can call for callers that think they might ever request
> too much?
> 
> Shuffling the check around to other top-level helpers seems pointless to
> me because as I said, it just moves text bloat from one place to the
> next.

Do any of the actual fast paths pass 'order' in as a real variable?  If
not, the compiler should be able to just take care of it.  From a quick
scan, it does appear that at least a third of the direct alloc_pages()
users pass an explicit '0'.  That should get optimized away
*immediately*.

> > > > I'm selfishly thinking of what I did in profile_init().  Can I slab
> > > > alloc it?  Nope.  Page allocator?  Nope.  Oh, well, try vmalloc():
> > > > 
> > > >         prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
> > > >         if (prof_buffer)
> > > >                 return 0;
> > > > 
> > > >         prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
> > > >         if (prof_buffer)
> > > >                 return 0;
> > > > 
> > > >         prof_buffer = vmalloc(buffer_bytes);
> > > >         if (prof_buffer)
> > > >                 return 0;
> > > > 
> > > >         free_cpumask_var(prof_cpu_mask);
> > > >         return -ENOMEM;
> > > > 
> > > 
> > > Can this ever actually be asking for an order larger than MAX_ORDER
> > > though? If so, you're condemning it to always behave poorly.
> > 
> > Yeah.  It is based on text size.  Smaller kernels with trimmed configs
> > and no modules have no problem fitting under MAX_ORDER, as do kernels
> > with larger base page sizes.  
> > 
> 
> It would seem that the right thing to have done here in the first place
> then was
> 
> if (buffer_bytes > PAGE_SIZE << (MAX_ORDER-1)
> 	return vmalloc(...)
> 
> kzalloc attempt
> 
> alloc_pages_exact attempt

Yeah, but honestly, I don't expect most users to get that "(buffer_bytes
> PAGE_SIZE << (MAX_ORDER-1)" right.  It seems like *exactly* the kind
of thing we should be wrapping up in common code.

Perhaps we do need an alloc_pages_nocheck() for the users that do have a
true non-compile-time-constant 'order' and still know they don't need
the check.

> > > > Same thing in __kmalloc_section_memmap():
> > > > 
> > > >         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
> > > >         if (page)
> > > >                 goto got_map_page;
> > > > 
> > > >         ret = vmalloc(memmap_size);
> > > >         if (ret)
> > > >                 goto got_map_ptr;
> > > > 
> > > 
> > > If I'm reading that right, the order will never be a stupid order. It can fail
> > > for higher orders in which case it falls back to vmalloc() .  For example,
> > > to hit that limit, the section size for a 4K kernel, maximum usable order
> > > of 10, the section size would need to be 256MB (assuming struct page size
> > > of 64 bytes). I don't think it's ever that size and if so, it'll always be
> > > sub-optimal which is a poor choice to make.
> > 
> > I think the section size default used to be 512M on x86 because we
> > concentrate on removing whole DIMMs.  
> > 
> 
> It was a poor choice then as their sections always ended up in
> vmalloc() or else it was using the bootmem allocator in which case it
> doesn't matter that the core page allocator was doing.

True, but we tried to code that sucker to work anywhere and to be as
optimal as possible (which vmalloc() is not) when we could.

> > > > I depend on the allocator to tell me when I've fed it too high of an
> > > > order.  If we really need this, perhaps we should do an audit and then
> > > > add a WARN_ON() for a few releases to catch the stragglers.
> > > 
> > > I consider it buggy to ask for something so large that you always end up
> > > with the worst option - vmalloc(). How about leaving it as a VM_BUG_ON
> > > to get as many reports as possible on who is depending on this odd
> > > behaviour?
> > > 
> > > If there are users with good reasons, then we could convert this to WARN_ON
> > > to fix up the callers. I suspect that the allocator can already cope with
> > > recieving a stupid order silently but slowly. It should go all the way to the
> > > bottom and just never find anything useful and return NULL.  zone_watermark_ok
> > > is the most dangerous looking part but even it should never get to MAX_ORDER
> > > because it should always find there are not enough free pages and return
> > > before it overruns.
> > 
> > Whatever we do, I'd agree that it's fine that this is a degenerate case
> > that gets handled very slowly and as far out of hot paths as possible.
> > Anybody who can fall back to a vmalloc is not doing these things very
> > often.
> 
> If that's the case, the simpliest course might be to just drop the VM_BUG_ON()
> as a separate patch after asserting it's safe to call into the page
> allocator with too large an order with the consequence of it being a
> relatively expensive call considering it can never succeed.

__rmqueue_smallest() seems to do the right thing and it is awfully deep
in the allocator.

How about this:  I'll go and audit the use of order in page_alloc.c to
make sure that having an order>MAX_ORDER-1 floating around is OK and
won't break anything.  I'll also go and see what the actual .text size
changes are from this patch both for alloc_pages() and
alloc_pages_node() separately to make sure what we're dealing with here.
Does this check even *exist* in the optimized code very often?  

Deal? :)

-- Dave


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-23  1:34             ` Dave Hansen
  0 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-23  1:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Thu, 2009-04-23 at 01:13 +0100, Mel Gorman wrote:
> On Wed, Apr 22, 2009 at 10:30:15AM -0700, Dave Hansen wrote:
> > On Wed, 2009-04-22 at 18:11 +0100, Mel Gorman wrote:
> > > On Wed, Apr 22, 2009 at 09:13:11AM -0700, Dave Hansen wrote:
> > > > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > > > No user of the allocator API should be passing in an order >= MAX_ORDER
> > > > > but we check for it on each and every allocation. Delete this check and
> > > > > make it a VM_BUG_ON check further down the call path.
> > > > 
> > > > Should we get the check re-added to some of the upper-level functions,
> > > > then?  Perhaps __get_free_pages() or things like alloc_pages_exact()? 
> > > 
> > > I don't think so, no. It just moves the source of the text bloat and
> > > for the few callers that are asking for something that will never
> > > succeed.
> > 
> > Well, it's a matter of figuring out when it can succeed.  Some of this
> > stuff, we can figure out at compile-time.  Others are a bit harder.
> > 
> 
> What do you suggest then? Some sort of constant that tells you the
> maximum size you can call for callers that think they might ever request
> too much?
> 
> Shuffling the check around to other top-level helpers seems pointless to
> me because as I said, it just moves text bloat from one place to the
> next.

Do any of the actual fast paths pass 'order' in as a real variable?  If
not, the compiler should be able to just take care of it.  From a quick
scan, it does appear that at least a third of the direct alloc_pages()
users pass an explicit '0'.  That should get optimized away
*immediately*.

> > > > I'm selfishly thinking of what I did in profile_init().  Can I slab
> > > > alloc it?  Nope.  Page allocator?  Nope.  Oh, well, try vmalloc():
> > > > 
> > > >         prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
> > > >         if (prof_buffer)
> > > >                 return 0;
> > > > 
> > > >         prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
> > > >         if (prof_buffer)
> > > >                 return 0;
> > > > 
> > > >         prof_buffer = vmalloc(buffer_bytes);
> > > >         if (prof_buffer)
> > > >                 return 0;
> > > > 
> > > >         free_cpumask_var(prof_cpu_mask);
> > > >         return -ENOMEM;
> > > > 
> > > 
> > > Can this ever actually be asking for an order larger than MAX_ORDER
> > > though? If so, you're condemning it to always behave poorly.
> > 
> > Yeah.  It is based on text size.  Smaller kernels with trimmed configs
> > and no modules have no problem fitting under MAX_ORDER, as do kernels
> > with larger base page sizes.  
> > 
> 
> It would seem that the right thing to have done here in the first place
> then was
> 
> if (buffer_bytes > PAGE_SIZE << (MAX_ORDER-1)
> 	return vmalloc(...)
> 
> kzalloc attempt
> 
> alloc_pages_exact attempt

Yeah, but honestly, I don't expect most users to get that "(buffer_bytes
> PAGE_SIZE << (MAX_ORDER-1)" right.  It seems like *exactly* the kind
of thing we should be wrapping up in common code.

Perhaps we do need an alloc_pages_nocheck() for the users that do have a
true non-compile-time-constant 'order' and still know they don't need
the check.

> > > > Same thing in __kmalloc_section_memmap():
> > > > 
> > > >         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
> > > >         if (page)
> > > >                 goto got_map_page;
> > > > 
> > > >         ret = vmalloc(memmap_size);
> > > >         if (ret)
> > > >                 goto got_map_ptr;
> > > > 
> > > 
> > > If I'm reading that right, the order will never be a stupid order. It can fail
> > > for higher orders in which case it falls back to vmalloc() .  For example,
> > > to hit that limit, the section size for a 4K kernel, maximum usable order
> > > of 10, the section size would need to be 256MB (assuming struct page size
> > > of 64 bytes). I don't think it's ever that size and if so, it'll always be
> > > sub-optimal which is a poor choice to make.
> > 
> > I think the section size default used to be 512M on x86 because we
> > concentrate on removing whole DIMMs.  
> > 
> 
> It was a poor choice then as their sections always ended up in
> vmalloc() or else it was using the bootmem allocator in which case it
> doesn't matter that the core page allocator was doing.

True, but we tried to code that sucker to work anywhere and to be as
optimal as possible (which vmalloc() is not) when we could.

> > > > I depend on the allocator to tell me when I've fed it too high of an
> > > > order.  If we really need this, perhaps we should do an audit and then
> > > > add a WARN_ON() for a few releases to catch the stragglers.
> > > 
> > > I consider it buggy to ask for something so large that you always end up
> > > with the worst option - vmalloc(). How about leaving it as a VM_BUG_ON
> > > to get as many reports as possible on who is depending on this odd
> > > behaviour?
> > > 
> > > If there are users with good reasons, then we could convert this to WARN_ON
> > > to fix up the callers. I suspect that the allocator can already cope with
> > > recieving a stupid order silently but slowly. It should go all the way to the
> > > bottom and just never find anything useful and return NULL.  zone_watermark_ok
> > > is the most dangerous looking part but even it should never get to MAX_ORDER
> > > because it should always find there are not enough free pages and return
> > > before it overruns.
> > 
> > Whatever we do, I'd agree that it's fine that this is a degenerate case
> > that gets handled very slowly and as far out of hot paths as possible.
> > Anybody who can fall back to a vmalloc is not doing these things very
> > often.
> 
> If that's the case, the simpliest course might be to just drop the VM_BUG_ON()
> as a separate patch after asserting it's safe to call into the page
> allocator with too large an order with the consequence of it being a
> relatively expensive call considering it can never succeed.

__rmqueue_smallest() seems to do the right thing and it is awfully deep
in the allocator.

How about this:  I'll go and audit the use of order in page_alloc.c to
make sure that having an order>MAX_ORDER-1 floating around is OK and
won't break anything.  I'll also go and see what the actual .text size
changes are from this patch both for alloc_pages() and
alloc_pages_node() separately to make sure what we're dealing with here.
Does this check even *exist* in the optimized code very often?  

Deal? :)

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-22 20:11         ` David Rientjes
@ 2009-04-23  7:44           ` Pekka Enberg
  -1 siblings, 0 replies; 186+ messages in thread
From: Pekka Enberg @ 2009-04-23  7:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Dave Hansen, Linux Memory Management List,
	KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Wed, 2009-04-22 at 13:11 -0700, David Rientjes wrote:
> On Wed, 22 Apr 2009, Mel Gorman wrote:
> 
> > If there are users with good reasons, then we could convert this to WARN_ON
> > to fix up the callers. I suspect that the allocator can already cope with
> > recieving a stupid order silently but slowly. It should go all the way to the
> > bottom and just never find anything useful and return NULL.  zone_watermark_ok
> > is the most dangerous looking part but even it should never get to MAX_ORDER
> > because it should always find there are not enough free pages and return
> > before it overruns.

> slub: enforce MAX_ORDER
> 
> slub_max_order may not be equal to or greater than MAX_ORDER.
> 
> Additionally, if a single object cannot be placed in a slab of
> slub_max_order, it still must allocate slabs below MAX_ORDER.
> 
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Signed-off-by: David Rientjes <rientjes@google.com>

Applied, thanks!


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-23  7:44           ` Pekka Enberg
  0 siblings, 0 replies; 186+ messages in thread
From: Pekka Enberg @ 2009-04-23  7:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Dave Hansen, Linux Memory Management List,
	KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Wed, 2009-04-22 at 13:11 -0700, David Rientjes wrote:
> On Wed, 22 Apr 2009, Mel Gorman wrote:
> 
> > If there are users with good reasons, then we could convert this to WARN_ON
> > to fix up the callers. I suspect that the allocator can already cope with
> > recieving a stupid order silently but slowly. It should go all the way to the
> > bottom and just never find anything useful and return NULL.  zone_watermark_ok
> > is the most dangerous looking part but even it should never get to MAX_ORDER
> > because it should always find there are not enough free pages and return
> > before it overruns.

> slub: enforce MAX_ORDER
> 
> slub_max_order may not be equal to or greater than MAX_ORDER.
> 
> Additionally, if a single object cannot be placed in a slab of
> slub_max_order, it still must allocate slabs below MAX_ORDER.
> 
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Signed-off-by: David Rientjes <rientjes@google.com>

Applied, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-23  1:34             ` Dave Hansen
@ 2009-04-23  9:58               ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-23  9:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, Apr 22, 2009 at 06:34:07PM -0700, Dave Hansen wrote:
> On Thu, 2009-04-23 at 01:13 +0100, Mel Gorman wrote:
> > On Wed, Apr 22, 2009 at 10:30:15AM -0700, Dave Hansen wrote:
> > > On Wed, 2009-04-22 at 18:11 +0100, Mel Gorman wrote:
> > > > On Wed, Apr 22, 2009 at 09:13:11AM -0700, Dave Hansen wrote:
> > > > > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > > > > No user of the allocator API should be passing in an order >= MAX_ORDER
> > > > > > but we check for it on each and every allocation. Delete this check and
> > > > > > make it a VM_BUG_ON check further down the call path.
> > > > > 
> > > > > Should we get the check re-added to some of the upper-level functions,
> > > > > then?  Perhaps __get_free_pages() or things like alloc_pages_exact()? 
> > > > 
> > > > I don't think so, no. It just moves the source of the text bloat and
> > > > for the few callers that are asking for something that will never
> > > > succeed.
> > > 
> > > Well, it's a matter of figuring out when it can succeed.  Some of this
> > > stuff, we can figure out at compile-time.  Others are a bit harder.
> > > 
> > 
> > What do you suggest then? Some sort of constant that tells you the
> > maximum size you can call for callers that think they might ever request
> > too much?
> > 
> > Shuffling the check around to other top-level helpers seems pointless to
> > me because as I said, it just moves text bloat from one place to the
> > next.
> 
> Do any of the actual fast paths pass 'order' in as a real variable?  If
> not, the compiler should be able to just take care of it.  From a quick
> scan, it does appear that at least a third of the direct alloc_pages()
> users pass an explicit '0'.  That should get optimized away
> *immediately*.
> 

You'd think but here are the vmlinux figures

3355718 0001-Replace-__alloc_pages_internal-with-__alloc_pages_.patch
3355622 0002-Do-not-sanity-check-order-in-the-fast-path.patch

That's a mostly modules kernel configuration and it's still making a
difference to the text size. Enough to care about and one to of the
things I was trying to keep in mind was branches in the fast paths.

> > > > > I'm selfishly thinking of what I did in profile_init().  Can I slab
> > > > > alloc it?  Nope.  Page allocator?  Nope.  Oh, well, try vmalloc():
> > > > > 
> > > > >         prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
> > > > >         if (prof_buffer)
> > > > >                 return 0;
> > > > > 
> > > > >         prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
> > > > >         if (prof_buffer)
> > > > >                 return 0;
> > > > > 
> > > > >         prof_buffer = vmalloc(buffer_bytes);
> > > > >         if (prof_buffer)
> > > > >                 return 0;
> > > > > 
> > > > >         free_cpumask_var(prof_cpu_mask);
> > > > >         return -ENOMEM;
> > > > > 
> > > > 
> > > > Can this ever actually be asking for an order larger than MAX_ORDER
> > > > though? If so, you're condemning it to always behave poorly.
> > > 
> > > Yeah.  It is based on text size.  Smaller kernels with trimmed configs
> > > and no modules have no problem fitting under MAX_ORDER, as do kernels
> > > with larger base page sizes.  
> > > 
> > 
> > It would seem that the right thing to have done here in the first place
> > then was
> > 
> > if (buffer_bytes > PAGE_SIZE << (MAX_ORDER-1)
> > 	return vmalloc(...)
> > 
> > kzalloc attempt
> > 
> > alloc_pages_exact attempt
> 
> Yeah, but honestly, I don't expect most users to get that "(buffer_bytes
> > PAGE_SIZE << (MAX_ORDER-1)" right.  It seems like *exactly* the kind
> of thing we should be wrapping up in common code.
> 

No, I wouldn't expect them to get it right. I'd expect some sort of
helper to exist all right if there were enough of these callers to be
fixed up.

> Perhaps we do need an alloc_pages_nocheck() for the users that do have a
> true non-compile-time-constant 'order' and still know they don't need
> the check.
> 

I'd do the opposite if we were doing this - alloc_pages_checksize() because
there are far more callers that are allocating sensible sizes than the
opposite. That said, making sure the allocator can handle a busted order
(even if we call all the way down to __rmqueue_smallest) seems the best
compromise between not having checks high in the fast path and callers that
want to chance their arm calling in and falling back to vmalloc() if it
doesn't work out.

> > > > > Same thing in __kmalloc_section_memmap():
> > > > > 
> > > > >         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
> > > > >         if (page)
> > > > >                 goto got_map_page;
> > > > > 
> > > > >         ret = vmalloc(memmap_size);
> > > > >         if (ret)
> > > > >                 goto got_map_ptr;
> > > > > 
> > > > 
> > > > If I'm reading that right, the order will never be a stupid order. It can fail
> > > > for higher orders in which case it falls back to vmalloc() .  For example,
> > > > to hit that limit, the section size for a 4K kernel, maximum usable order
> > > > of 10, the section size would need to be 256MB (assuming struct page size
> > > > of 64 bytes). I don't think it's ever that size and if so, it'll always be
> > > > sub-optimal which is a poor choice to make.
> > > 
> > > I think the section size default used to be 512M on x86 because we
> > > concentrate on removing whole DIMMs.  
> > > 
> > 
> > It was a poor choice then as their sections always ended up in
> > vmalloc() or else it was using the bootmem allocator in which case it
> > doesn't matter that the core page allocator was doing.
> 
> True, but we tried to code that sucker to work anywhere and to be as
> optimal as possible (which vmalloc() is not) when we could.
> 

Ok, seems fair.

> > > > > I depend on the allocator to tell me when I've fed it too high of an
> > > > > order.  If we really need this, perhaps we should do an audit and then
> > > > > add a WARN_ON() for a few releases to catch the stragglers.
> > > > 
> > > > I consider it buggy to ask for something so large that you always end up
> > > > with the worst option - vmalloc(). How about leaving it as a VM_BUG_ON
> > > > to get as many reports as possible on who is depending on this odd
> > > > behaviour?
> > > > 
> > > > If there are users with good reasons, then we could convert this to WARN_ON
> > > > to fix up the callers. I suspect that the allocator can already cope with
> > > > recieving a stupid order silently but slowly. It should go all the way to the
> > > > bottom and just never find anything useful and return NULL.  zone_watermark_ok
> > > > is the most dangerous looking part but even it should never get to MAX_ORDER
> > > > because it should always find there are not enough free pages and return
> > > > before it overruns.
> > > 
> > > Whatever we do, I'd agree that it's fine that this is a degenerate case
> > > that gets handled very slowly and as far out of hot paths as possible.
> > > Anybody who can fall back to a vmalloc is not doing these things very
> > > often.
> > 
> > If that's the case, the simpliest course might be to just drop the VM_BUG_ON()
> > as a separate patch after asserting it's safe to call into the page
> > allocator with too large an order with the consequence of it being a
> > relatively expensive call considering it can never succeed.
> 
> __rmqueue_smallest() seems to do the right thing and it is awfully deep
> in the allocator.
> 

And I believe zone_watermark_ok() does the right thing as well by
deciding that there are not enough free pages before it overflows the
buffer.

> How about this:  I'll go and audit the use of order in page_alloc.c to
> make sure that having an order>MAX_ORDER-1 floating around is OK and
> won't break anything. 

Great. Right now, I think it's ok but I haven't audited for this
explicily and a second set of eyes never hurts.

> I'll also go and see what the actual .text size
> changes are from this patch both for alloc_pages() and
> alloc_pages_node() separately to make sure what we're dealing with here.

It's .config specific of course, but check the leader where I posted vmlinux
sizes for each patch. scripts/bloat-o-meter by Matt Mackall also rules most
mightily for identifying where text increased or decreased between patches
in case you're not aware of it already.

> Does this check even *exist* in the optimized code very often?  
> 

Enough that it shrunk text size on my .config anyway.

> Deal? :)
> 

Deal

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-23  9:58               ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-23  9:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, Apr 22, 2009 at 06:34:07PM -0700, Dave Hansen wrote:
> On Thu, 2009-04-23 at 01:13 +0100, Mel Gorman wrote:
> > On Wed, Apr 22, 2009 at 10:30:15AM -0700, Dave Hansen wrote:
> > > On Wed, 2009-04-22 at 18:11 +0100, Mel Gorman wrote:
> > > > On Wed, Apr 22, 2009 at 09:13:11AM -0700, Dave Hansen wrote:
> > > > > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > > > > No user of the allocator API should be passing in an order >= MAX_ORDER
> > > > > > but we check for it on each and every allocation. Delete this check and
> > > > > > make it a VM_BUG_ON check further down the call path.
> > > > > 
> > > > > Should we get the check re-added to some of the upper-level functions,
> > > > > then?  Perhaps __get_free_pages() or things like alloc_pages_exact()? 
> > > > 
> > > > I don't think so, no. It just moves the source of the text bloat and
> > > > for the few callers that are asking for something that will never
> > > > succeed.
> > > 
> > > Well, it's a matter of figuring out when it can succeed.  Some of this
> > > stuff, we can figure out at compile-time.  Others are a bit harder.
> > > 
> > 
> > What do you suggest then? Some sort of constant that tells you the
> > maximum size you can call for callers that think they might ever request
> > too much?
> > 
> > Shuffling the check around to other top-level helpers seems pointless to
> > me because as I said, it just moves text bloat from one place to the
> > next.
> 
> Do any of the actual fast paths pass 'order' in as a real variable?  If
> not, the compiler should be able to just take care of it.  From a quick
> scan, it does appear that at least a third of the direct alloc_pages()
> users pass an explicit '0'.  That should get optimized away
> *immediately*.
> 

You'd think but here are the vmlinux figures

3355718 0001-Replace-__alloc_pages_internal-with-__alloc_pages_.patch
3355622 0002-Do-not-sanity-check-order-in-the-fast-path.patch

That's a mostly modules kernel configuration and it's still making a
difference to the text size. Enough to care about and one to of the
things I was trying to keep in mind was branches in the fast paths.

> > > > > I'm selfishly thinking of what I did in profile_init().  Can I slab
> > > > > alloc it?  Nope.  Page allocator?  Nope.  Oh, well, try vmalloc():
> > > > > 
> > > > >         prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
> > > > >         if (prof_buffer)
> > > > >                 return 0;
> > > > > 
> > > > >         prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
> > > > >         if (prof_buffer)
> > > > >                 return 0;
> > > > > 
> > > > >         prof_buffer = vmalloc(buffer_bytes);
> > > > >         if (prof_buffer)
> > > > >                 return 0;
> > > > > 
> > > > >         free_cpumask_var(prof_cpu_mask);
> > > > >         return -ENOMEM;
> > > > > 
> > > > 
> > > > Can this ever actually be asking for an order larger than MAX_ORDER
> > > > though? If so, you're condemning it to always behave poorly.
> > > 
> > > Yeah.  It is based on text size.  Smaller kernels with trimmed configs
> > > and no modules have no problem fitting under MAX_ORDER, as do kernels
> > > with larger base page sizes.  
> > > 
> > 
> > It would seem that the right thing to have done here in the first place
> > then was
> > 
> > if (buffer_bytes > PAGE_SIZE << (MAX_ORDER-1)
> > 	return vmalloc(...)
> > 
> > kzalloc attempt
> > 
> > alloc_pages_exact attempt
> 
> Yeah, but honestly, I don't expect most users to get that "(buffer_bytes
> > PAGE_SIZE << (MAX_ORDER-1)" right.  It seems like *exactly* the kind
> of thing we should be wrapping up in common code.
> 

No, I wouldn't expect them to get it right. I'd expect some sort of
helper to exist all right if there were enough of these callers to be
fixed up.

> Perhaps we do need an alloc_pages_nocheck() for the users that do have a
> true non-compile-time-constant 'order' and still know they don't need
> the check.
> 

I'd do the opposite if we were doing this - alloc_pages_checksize() because
there are far more callers that are allocating sensible sizes than the
opposite. That said, making sure the allocator can handle a busted order
(even if we call all the way down to __rmqueue_smallest) seems the best
compromise between not having checks high in the fast path and callers that
want to chance their arm calling in and falling back to vmalloc() if it
doesn't work out.

> > > > > Same thing in __kmalloc_section_memmap():
> > > > > 
> > > > >         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
> > > > >         if (page)
> > > > >                 goto got_map_page;
> > > > > 
> > > > >         ret = vmalloc(memmap_size);
> > > > >         if (ret)
> > > > >                 goto got_map_ptr;
> > > > > 
> > > > 
> > > > If I'm reading that right, the order will never be a stupid order. It can fail
> > > > for higher orders in which case it falls back to vmalloc() .  For example,
> > > > to hit that limit, the section size for a 4K kernel, maximum usable order
> > > > of 10, the section size would need to be 256MB (assuming struct page size
> > > > of 64 bytes). I don't think it's ever that size and if so, it'll always be
> > > > sub-optimal which is a poor choice to make.
> > > 
> > > I think the section size default used to be 512M on x86 because we
> > > concentrate on removing whole DIMMs.  
> > > 
> > 
> > It was a poor choice then as their sections always ended up in
> > vmalloc() or else it was using the bootmem allocator in which case it
> > doesn't matter that the core page allocator was doing.
> 
> True, but we tried to code that sucker to work anywhere and to be as
> optimal as possible (which vmalloc() is not) when we could.
> 

Ok, seems fair.

> > > > > I depend on the allocator to tell me when I've fed it too high of an
> > > > > order.  If we really need this, perhaps we should do an audit and then
> > > > > add a WARN_ON() for a few releases to catch the stragglers.
> > > > 
> > > > I consider it buggy to ask for something so large that you always end up
> > > > with the worst option - vmalloc(). How about leaving it as a VM_BUG_ON
> > > > to get as many reports as possible on who is depending on this odd
> > > > behaviour?
> > > > 
> > > > If there are users with good reasons, then we could convert this to WARN_ON
> > > > to fix up the callers. I suspect that the allocator can already cope with
> > > > recieving a stupid order silently but slowly. It should go all the way to the
> > > > bottom and just never find anything useful and return NULL.  zone_watermark_ok
> > > > is the most dangerous looking part but even it should never get to MAX_ORDER
> > > > because it should always find there are not enough free pages and return
> > > > before it overruns.
> > > 
> > > Whatever we do, I'd agree that it's fine that this is a degenerate case
> > > that gets handled very slowly and as far out of hot paths as possible.
> > > Anybody who can fall back to a vmalloc is not doing these things very
> > > often.
> > 
> > If that's the case, the simpliest course might be to just drop the VM_BUG_ON()
> > as a separate patch after asserting it's safe to call into the page
> > allocator with too large an order with the consequence of it being a
> > relatively expensive call considering it can never succeed.
> 
> __rmqueue_smallest() seems to do the right thing and it is awfully deep
> in the allocator.
> 

And I believe zone_watermark_ok() does the right thing as well by
deciding that there are not enough free pages before it overflows the
buffer.

> How about this:  I'll go and audit the use of order in page_alloc.c to
> make sure that having an order>MAX_ORDER-1 floating around is OK and
> won't break anything. 

Great. Right now, I think it's ok but I haven't audited for this
explicily and a second set of eyes never hurts.

> I'll also go and see what the actual .text size
> changes are from this patch both for alloc_pages() and
> alloc_pages_node() separately to make sure what we're dealing with here.

It's .config specific of course, but check the leader where I posted vmlinux
sizes for each patch. scripts/bloat-o-meter by Matt Mackall also rules most
mightily for identifying where text increased or decreased between patches
in case you're not aware of it already.

> Does this check even *exist* in the optimized code very often?  
> 

Enough that it shrunk text size on my .config anyway.

> Deal? :)
> 

Deal

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark
  2009-04-23  0:27           ` KOSAKI Motohiro
@ 2009-04-23 10:03             ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-23 10:03 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Dave Hansen, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Thu, Apr 23, 2009 at 09:27:15AM +0900, KOSAKI Motohiro wrote:
> > On Wed, 2009-04-22 at 18:14 +0100, Mel Gorman wrote:
> > > Preference of taste really. When I started a conversion to accessors, it
> > > changed something recognised to something new that looked uglier to me.
> > > Only one place cares about the union enough to access is via an array so
> > > why spread it everywhere.
> > 
> > Personally, I'd say for consistency.  Someone looking at both forms
> > wouldn't necessarily know that they refer to the same variables unless
> > they know about the union.
> 
> for just clalification...
> 
> AFAIK, C language specification don't gurantee point same value.
> compiler can insert pad between struct-member and member, but not insert
> into array.
> 

Considering that they are the same type for elements and arrays, I
didn't think padding would ever be a problem.

> However, all gcc version don't do that. I think. but perhaps I missed
> some minor gcc release..
> 
> So, I also like Dave's idea. but it only personal feeling.
> 

The tide is against me on this one :).

How about I roll a patch on top of this set that replaces the union by
calling all sites? I figure that patch will go through a few revisions before
people are happy with the final API. However, as the patch wouldn't change
functionality, I'd like to see this series getting wider testing if possible. The
replace-union-with-single-array patch can be easily folded in then when
it settles.

Sound like a plan?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark
@ 2009-04-23 10:03             ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-23 10:03 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Dave Hansen, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Thu, Apr 23, 2009 at 09:27:15AM +0900, KOSAKI Motohiro wrote:
> > On Wed, 2009-04-22 at 18:14 +0100, Mel Gorman wrote:
> > > Preference of taste really. When I started a conversion to accessors, it
> > > changed something recognised to something new that looked uglier to me.
> > > Only one place cares about the union enough to access is via an array so
> > > why spread it everywhere.
> > 
> > Personally, I'd say for consistency.  Someone looking at both forms
> > wouldn't necessarily know that they refer to the same variables unless
> > they know about the union.
> 
> for just clalification...
> 
> AFAIK, C language specification don't gurantee point same value.
> compiler can insert pad between struct-member and member, but not insert
> into array.
> 

Considering that they are the same type for elements and arrays, I
didn't think padding would ever be a problem.

> However, all gcc version don't do that. I think. but perhaps I missed
> some minor gcc release..
> 
> So, I also like Dave's idea. but it only personal feeling.
> 

The tide is against me on this one :).

How about I roll a patch on top of this set that replaces the union by
calling all sites? I figure that patch will go through a few revisions before
people are happy with the final API. However, as the patch wouldn't change
functionality, I'd like to see this series getting wider testing if possible. The
replace-union-with-single-array patch can be easily folded in then when
it settles.

Sound like a plan?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-23  9:58               ` Mel Gorman
@ 2009-04-23 17:36                 ` Dave Hansen
  -1 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-23 17:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Thu, 2009-04-23 at 10:58 +0100, Mel Gorman wrote:
> > How about this:  I'll go and audit the use of order in page_alloc.c to
> > make sure that having an order>MAX_ORDER-1 floating around is OK and
> > won't break anything. 
> 
> Great. Right now, I think it's ok but I haven't audited for this
> explicily and a second set of eyes never hurts.

OK, after looking through this, I have a couple of ideas.  One is that
we do the MAX_ORDER check in __alloc_pages_internal(), but *after* the
first call to get_page_from_freelist().  That's because I'm worried if
we ever got into the reclaim code with a >MAX_ORDER 'order'.  Such as:

void wakeup_kswapd(struct zone *zone, int order)
{
...
        if (pgdat->kswapd_max_order < order)
                pgdat->kswapd_max_order = order;
        if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
                return;
        if (!waitqueue_active(&pgdat->kswapd_wait))
                return;
        wake_up_interruptible(&pgdat->kswapd_wait);
}

unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
                                gfp_t gfp_mask, nodemask_t *nodemask)
{
        struct scan_control sc = {
...
                .order = order,
                .mem_cgroup = NULL,
                .isolate_pages = isolate_pages_global,
                .nodemask = nodemask,
        };

        return do_try_to_free_pages(zonelist, &sc);
}

This will keep us only checking 'order' once for each
alloc_pages_internal() call.  It is an extra branch, but it is out of
the really, really hot path since we're about to start reclaim here
anyway.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2f2699..1e3a01e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1498,6 +1498,13 @@ restart:
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
 		goto got_pg;
+	/*
+	 * We're out of the rocket-hot area above, so do a quick sanity
+	 * check.  We do this here to avoid ever trying to do any reclaim
+	 * of >=MAX_ORDER areas which can never succeed, of course.
+	 */
+	if (order >= MAX_ORDER)
+		goto nopage;
 
 	/*
 	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and


-- Dave


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-23 17:36                 ` Dave Hansen
  0 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-23 17:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Thu, 2009-04-23 at 10:58 +0100, Mel Gorman wrote:
> > How about this:  I'll go and audit the use of order in page_alloc.c to
> > make sure that having an order>MAX_ORDER-1 floating around is OK and
> > won't break anything. 
> 
> Great. Right now, I think it's ok but I haven't audited for this
> explicily and a second set of eyes never hurts.

OK, after looking through this, I have a couple of ideas.  One is that
we do the MAX_ORDER check in __alloc_pages_internal(), but *after* the
first call to get_page_from_freelist().  That's because I'm worried if
we ever got into the reclaim code with a >MAX_ORDER 'order'.  Such as:

void wakeup_kswapd(struct zone *zone, int order)
{
...
        if (pgdat->kswapd_max_order < order)
                pgdat->kswapd_max_order = order;
        if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
                return;
        if (!waitqueue_active(&pgdat->kswapd_wait))
                return;
        wake_up_interruptible(&pgdat->kswapd_wait);
}

unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
                                gfp_t gfp_mask, nodemask_t *nodemask)
{
        struct scan_control sc = {
...
                .order = order,
                .mem_cgroup = NULL,
                .isolate_pages = isolate_pages_global,
                .nodemask = nodemask,
        };

        return do_try_to_free_pages(zonelist, &sc);
}

This will keep us only checking 'order' once for each
alloc_pages_internal() call.  It is an extra branch, but it is out of
the really, really hot path since we're about to start reclaim here
anyway.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2f2699..1e3a01e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1498,6 +1498,13 @@ restart:
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
 		goto got_pg;
+	/*
+	 * We're out of the rocket-hot area above, so do a quick sanity
+	 * check.  We do this here to avoid ever trying to do any reclaim
+	 * of >=MAX_ORDER areas which can never succeed, of course.
+	 */
+	if (order >= MAX_ORDER)
+		goto nopage;
 
 	/*
 	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and


-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-23  1:34             ` Dave Hansen
@ 2009-04-23 19:26               ` Dave Hansen
  -1 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-23 19:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 2009-04-22 at 18:34 -0700, Dave Hansen wrote:
> I'll also go and see what the actual .text size
> changes are from this patch both for alloc_pages() and
> alloc_pages_node() separately to make sure what we're dealing with
> here.  Does this check even *exist* in the optimized code very
> often?  

While this isn't definitive by any means, I did get some interesting
results.  Pulling the check out of alloc_pages() had no effect at *all*
on text size because I'm trying with CONFIG_NUMA=n.

$ size i386-T41-laptop.{0,1}/vmlinux
   text	   data	    bss	    dec	    hex	filename
4348625	 286560	 860160	5495345	 53da31	i386-T41-laptop.0/vmlinux
4348625	 286560	 860160	5495345	 53da31	i386-T41-laptop.1/vmlinux

We get a slightly different when pulling the check out of
alloc_pages_node():

$ size i386-T41-laptop.{1,2}/vmlinux
   text	   data	    bss	    dec	    hex	filename
4348625	 286560	 860160	5495345	 53da31	i386-T41-laptop.1/vmlinux
4348601	 286560	 860160	5495321	 53da19	i386-T41-laptop.2/vmlinux

$ bloat-o-meter i386-T41-laptop.1/vmlinux i386-T41-laptop.2/vmlinux 
add/remove: 0/0 grow/shrink: 9/7 up/down: 78/-107 (-29)
function                                     old     new   delta
__get_user_pages                             717     751     +34
st_read                                     1936    1944      +8
shmem_truncate_range                        1660    1667      +7
pci_create_slot                              410     417      +7
sg_build_indirect                            449     455      +6
n_tty_read                                  1336    1342      +6
find_vma_prepare                             103     108      +5
as_update_iohist                             617     621      +4
ntfs_readdir                                3426    3427      +1
enlarge_buffer                               343     341      -2
__get_free_pages                              36      33      -3
dma_generic_alloc_coherent                   207     202      -5
mempool_alloc_pages                           33      17     -16
futex_lock_pi                               2120    2104     -16
kallsyms_lookup_name                         102      82     -20
cache_alloc_refill                          1171    1126     -45

I'm going to retry this with a NUMA config.  

-- Dave


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-23 19:26               ` Dave Hansen
  0 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-23 19:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 2009-04-22 at 18:34 -0700, Dave Hansen wrote:
> I'll also go and see what the actual .text size
> changes are from this patch both for alloc_pages() and
> alloc_pages_node() separately to make sure what we're dealing with
> here.  Does this check even *exist* in the optimized code very
> often?  

While this isn't definitive by any means, I did get some interesting
results.  Pulling the check out of alloc_pages() had no effect at *all*
on text size because I'm trying with CONFIG_NUMA=n.

$ size i386-T41-laptop.{0,1}/vmlinux
   text	   data	    bss	    dec	    hex	filename
4348625	 286560	 860160	5495345	 53da31	i386-T41-laptop.0/vmlinux
4348625	 286560	 860160	5495345	 53da31	i386-T41-laptop.1/vmlinux

We get a slightly different when pulling the check out of
alloc_pages_node():

$ size i386-T41-laptop.{1,2}/vmlinux
   text	   data	    bss	    dec	    hex	filename
4348625	 286560	 860160	5495345	 53da31	i386-T41-laptop.1/vmlinux
4348601	 286560	 860160	5495321	 53da19	i386-T41-laptop.2/vmlinux

$ bloat-o-meter i386-T41-laptop.1/vmlinux i386-T41-laptop.2/vmlinux 
add/remove: 0/0 grow/shrink: 9/7 up/down: 78/-107 (-29)
function                                     old     new   delta
__get_user_pages                             717     751     +34
st_read                                     1936    1944      +8
shmem_truncate_range                        1660    1667      +7
pci_create_slot                              410     417      +7
sg_build_indirect                            449     455      +6
n_tty_read                                  1336    1342      +6
find_vma_prepare                             103     108      +5
as_update_iohist                             617     621      +4
ntfs_readdir                                3426    3427      +1
enlarge_buffer                               343     341      -2
__get_free_pages                              36      33      -3
dma_generic_alloc_coherent                   207     202      -5
mempool_alloc_pages                           33      17     -16
futex_lock_pi                               2120    2104     -16
kallsyms_lookup_name                         102      82     -20
cache_alloc_refill                          1171    1126     -45

I'm going to retry this with a NUMA config.  

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 21/22] Use a pre-calculated value instead of num_online_nodes() in fast paths
  2009-04-23  0:44       ` Mel Gorman
@ 2009-04-23 19:29         ` David Rientjes
  -1 siblings, 0 replies; 186+ messages in thread
From: David Rientjes @ 2009-04-23 19:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Thu, 23 Apr 2009, Mel Gorman wrote:

> > > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> > > index 848025c..474e73e 100644
> > > --- a/include/linux/nodemask.h
> > > +++ b/include/linux/nodemask.h
> > > @@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state)
> > >  #define next_online_node(nid)	next_node((nid), node_states[N_ONLINE])
> > >  
> > >  extern int nr_node_ids;
> > > +extern int nr_online_nodes;
> > > +
> > > +static inline void node_set_online(int nid)
> > > +{
> > > +	node_set_state(nid, N_ONLINE);
> > > +	nr_online_nodes = num_node_state(N_ONLINE);
> > > +}
> > > +
> > > +static inline void node_set_offline(int nid)
> > > +{
> > > +	node_clear_state(nid, N_ONLINE);
> > > +	nr_online_nodes = num_node_state(N_ONLINE);
> > > +}
> > >  #else
> > >  
> > >  static inline int node_state(int node, enum node_states state)
> > 
> > The later #define's of node_set_online() and node_set_offline() in 
> > include/linux/nodemask.h should probably be removed now.
> > 
> 
> You'd think, but you can enable memory hotplug without NUMA and
> node_set_online() is called when adding memory. Even though those
> functions are nops on !NUMA, they're necessary.
> 

The problem is that your new functions above are never used because 
node_set_online and node_set_offline are macro defined later in this 
header for all cases, not just !CONFIG_NUMA.

You need this.
---
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -448,6 +448,9 @@ static inline int num_node_state(enum node_states state)
 #define next_online_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1
 #define nr_online_nodes		1
+
+#define node_set_online(node)	   node_set_state((node), N_ONLINE)
+#define node_set_offline(node)	   node_clear_state((node), N_ONLINE)
 #endif
 
 #define node_online_map 	node_states[N_ONLINE]
@@ -467,9 +470,6 @@ static inline int num_node_state(enum node_states state)
 #define node_online(node)	node_state((node), N_ONLINE)
 #define node_possible(node)	node_state((node), N_POSSIBLE)
 
-#define node_set_online(node)	   node_set_state((node), N_ONLINE)
-#define node_set_offline(node)	   node_clear_state((node), N_ONLINE)
-
 #define for_each_node(node)	   for_each_node_state(node, N_POSSIBLE)
 #define for_each_online_node(node) for_each_node_state(node, N_ONLINE)
 

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 21/22] Use a pre-calculated value instead of num_online_nodes() in fast paths
@ 2009-04-23 19:29         ` David Rientjes
  0 siblings, 0 replies; 186+ messages in thread
From: David Rientjes @ 2009-04-23 19:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Thu, 23 Apr 2009, Mel Gorman wrote:

> > > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> > > index 848025c..474e73e 100644
> > > --- a/include/linux/nodemask.h
> > > +++ b/include/linux/nodemask.h
> > > @@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state)
> > >  #define next_online_node(nid)	next_node((nid), node_states[N_ONLINE])
> > >  
> > >  extern int nr_node_ids;
> > > +extern int nr_online_nodes;
> > > +
> > > +static inline void node_set_online(int nid)
> > > +{
> > > +	node_set_state(nid, N_ONLINE);
> > > +	nr_online_nodes = num_node_state(N_ONLINE);
> > > +}
> > > +
> > > +static inline void node_set_offline(int nid)
> > > +{
> > > +	node_clear_state(nid, N_ONLINE);
> > > +	nr_online_nodes = num_node_state(N_ONLINE);
> > > +}
> > >  #else
> > >  
> > >  static inline int node_state(int node, enum node_states state)
> > 
> > The later #define's of node_set_online() and node_set_offline() in 
> > include/linux/nodemask.h should probably be removed now.
> > 
> 
> You'd think, but you can enable memory hotplug without NUMA and
> node_set_online() is called when adding memory. Even though those
> functions are nops on !NUMA, they're necessary.
> 

The problem is that your new functions above are never used because 
node_set_online and node_set_offline are macro defined later in this 
header for all cases, not just !CONFIG_NUMA.

You need this.
---
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -448,6 +448,9 @@ static inline int num_node_state(enum node_states state)
 #define next_online_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1
 #define nr_online_nodes		1
+
+#define node_set_online(node)	   node_set_state((node), N_ONLINE)
+#define node_set_offline(node)	   node_clear_state((node), N_ONLINE)
 #endif
 
 #define node_online_map 	node_states[N_ONLINE]
@@ -467,9 +470,6 @@ static inline int num_node_state(enum node_states state)
 #define node_online(node)	node_state((node), N_ONLINE)
 #define node_possible(node)	node_state((node), N_POSSIBLE)
 
-#define node_set_online(node)	   node_set_state((node), N_ONLINE)
-#define node_set_offline(node)	   node_clear_state((node), N_ONLINE)
-
 #define for_each_node(node)	   for_each_node_state(node, N_POSSIBLE)
 #define for_each_online_node(node) for_each_node_state(node, N_ONLINE)
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-23 19:26               ` Dave Hansen
@ 2009-04-23 19:45                 ` Dave Hansen
  -1 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-23 19:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Thu, 2009-04-23 at 12:26 -0700, Dave Hansen wrote:
> I'm going to retry this with a NUMA config.  

Oddly enough, removing the alloc_pages() check didn't do anything to
text on a NUMA=y config, either: 

dave@kernel:~/work/mhp-build$ size i386-numaq-sparse.?/vmlinux
   text	   data	    bss	    dec	    hex	filename
3460792	 449336	1626112	5536240	 5479f0	i386-numaq-sparse.0/vmlinux
3460792	 449336	1626112	5536240	 5479f0	i386-numaq-sparse.1/vmlinux
3460776	 449336	1626112	5536224	 5479e0	i386-numaq-sparse.2/vmlinux

Here's bloatmeter from removing the alloc_pages_node() check:

dave@kernel:~/work/mhp-build$ ../linux-2.6.git/scripts/bloat-o-meter
i386-numaq-sparse.{1,2}/vmlinux 
add/remove: 0/0 grow/shrink: 9/16 up/down: 81/-99 (-18)
function                                     old     new   delta
st_int_ioctl                                2600    2624     +24
tcp_sendmsg                                 2153    2169     +16
diskstats_show                               739     753     +14
iov_shorten                                   49      58      +9
unmap_vmas                                  1653    1661      +8
sg_build_indirect                            449     455      +6
ahc_linux_biosparam                          251     253      +2
nlmclnt_call                                 557     558      +1
do_mount                                    1533    1534      +1
skb_icv_walk                                 420     419      -1
nfs_readpage_truncate_uninitialised_page     288     287      -1
find_first_zero_bit                           67      66      -1
enlarge_buffer                               339     337      -2
ahc_parse_msg                               2340    2338      -2
__get_free_pages                              36      33      -3
flock_lock_file_wait                         484     480      -4
find_vma_prepare                             108     103      -5
arp_ignore                                   104      99      -5
__udp4_lib_err                               312     307      -5
alloc_buddy_huge_page                        170     163      -7
pbus_size_mem                                907     899      -8
dma_generic_alloc_coherent                   245     236      -9
cache_alloc_refill                          1168    1158     -10
mempool_alloc_pages                           33      17     -16
do_mremap                                   1208    1188     -20



^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-23 19:45                 ` Dave Hansen
  0 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-23 19:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Thu, 2009-04-23 at 12:26 -0700, Dave Hansen wrote:
> I'm going to retry this with a NUMA config.  

Oddly enough, removing the alloc_pages() check didn't do anything to
text on a NUMA=y config, either: 

dave@kernel:~/work/mhp-build$ size i386-numaq-sparse.?/vmlinux
   text	   data	    bss	    dec	    hex	filename
3460792	 449336	1626112	5536240	 5479f0	i386-numaq-sparse.0/vmlinux
3460792	 449336	1626112	5536240	 5479f0	i386-numaq-sparse.1/vmlinux
3460776	 449336	1626112	5536224	 5479e0	i386-numaq-sparse.2/vmlinux

Here's bloatmeter from removing the alloc_pages_node() check:

dave@kernel:~/work/mhp-build$ ../linux-2.6.git/scripts/bloat-o-meter
i386-numaq-sparse.{1,2}/vmlinux 
add/remove: 0/0 grow/shrink: 9/16 up/down: 81/-99 (-18)
function                                     old     new   delta
st_int_ioctl                                2600    2624     +24
tcp_sendmsg                                 2153    2169     +16
diskstats_show                               739     753     +14
iov_shorten                                   49      58      +9
unmap_vmas                                  1653    1661      +8
sg_build_indirect                            449     455      +6
ahc_linux_biosparam                          251     253      +2
nlmclnt_call                                 557     558      +1
do_mount                                    1533    1534      +1
skb_icv_walk                                 420     419      -1
nfs_readpage_truncate_uninitialised_page     288     287      -1
find_first_zero_bit                           67      66      -1
enlarge_buffer                               339     337      -2
ahc_parse_msg                               2340    2338      -2
__get_free_pages                              36      33      -3
flock_lock_file_wait                         484     480      -4
find_vma_prepare                             108     103      -5
arp_ignore                                   104      99      -5
__udp4_lib_err                               312     307      -5
alloc_buddy_huge_page                        170     163      -7
pbus_size_mem                                907     899      -8
dma_generic_alloc_coherent                   245     236      -9
cache_alloc_refill                          1168    1158     -10
mempool_alloc_pages                           33      17     -16
do_mremap                                   1208    1188     -20


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-22 17:11       ` Mel Gorman
@ 2009-04-23 22:44         ` Andrew Morton
  -1 siblings, 0 replies; 186+ messages in thread
From: Andrew Morton @ 2009-04-23 22:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: dave, linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel,
	ming.m.lin, yanmin_zhang, peterz, penberg

On Wed, 22 Apr 2009 18:11:51 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> > I depend on the allocator to tell me when I've fed it too high of an
> > order.  If we really need this, perhaps we should do an audit and then
> > add a WARN_ON() for a few releases to catch the stragglers.
> > 
> 
> I consider it buggy to ask for something so large that you always end up
> with the worst option - vmalloc().

Nevertheless, it's a pretty common pattern for initialisation code all
over the kernel to do

	while (allocate(huge_amount) == NULL)
		huge_amount /= 2;

and the proposed change will convert that from "works" to "either goes
BUG or mysteriously overindexes zone->free_area[] in
__rmqueue_smallest()".  The latter of which is really nasty.

> How about leaving it as a VM_BUG_ON
> to get as many reports as possible on who is depending on this odd
> behaviour?

That would be quite disruptive.  Even emitting a trace for each call
would be irritating.  How's about this:

--- a/mm/page_alloc.c~page-allocator-do-not-sanity-check-order-in-the-fast-path-fix
+++ a/mm/page_alloc.c
@@ -1405,7 +1405,8 @@ get_page_from_freelist(gfp_t gfp_mask, n
 
 	classzone_idx = zone_idx(preferred_zone);
 
-	VM_BUG_ON(order >= MAX_ORDER);
+	if (WARN_ON_ONCE(order >= MAX_ORDER))
+		return NULL;
 
 zonelist_scan:
 	/*
_


and then we revisit later?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-23 22:44         ` Andrew Morton
  0 siblings, 0 replies; 186+ messages in thread
From: Andrew Morton @ 2009-04-23 22:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: dave, linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel,
	ming.m.lin, yanmin_zhang, peterz, penberg

On Wed, 22 Apr 2009 18:11:51 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> > I depend on the allocator to tell me when I've fed it too high of an
> > order.  If we really need this, perhaps we should do an audit and then
> > add a WARN_ON() for a few releases to catch the stragglers.
> > 
> 
> I consider it buggy to ask for something so large that you always end up
> with the worst option - vmalloc().

Nevertheless, it's a pretty common pattern for initialisation code all
over the kernel to do

	while (allocate(huge_amount) == NULL)
		huge_amount /= 2;

and the proposed change will convert that from "works" to "either goes
BUG or mysteriously overindexes zone->free_area[] in
__rmqueue_smallest()".  The latter of which is really nasty.

> How about leaving it as a VM_BUG_ON
> to get as many reports as possible on who is depending on this odd
> behaviour?

That would be quite disruptive.  Even emitting a trace for each call
would be irritating.  How's about this:

--- a/mm/page_alloc.c~page-allocator-do-not-sanity-check-order-in-the-fast-path-fix
+++ a/mm/page_alloc.c
@@ -1405,7 +1405,8 @@ get_page_from_freelist(gfp_t gfp_mask, n
 
 	classzone_idx = zone_idx(preferred_zone);
 
-	VM_BUG_ON(order >= MAX_ORDER);
+	if (WARN_ON_ONCE(order >= MAX_ORDER))
+		return NULL;
 
 zonelist_scan:
 	/*
_


and then we revisit later?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/22] Calculate the preferred zone for allocation only once
  2009-04-22 13:53   ` Mel Gorman
@ 2009-04-23 22:48     ` Andrew Morton
  -1 siblings, 0 replies; 186+ messages in thread
From: Andrew Morton @ 2009-04-23 22:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: mel, linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel,
	ming.m.lin, yanmin_zhang, peterz, penberg

On Wed, 22 Apr 2009 14:53:12 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> get_page_from_freelist() can be called multiple times for an allocation.
> Part of this calculates the preferred_zone which is the first usable zone
> in the zonelist but the zone depends on the GFP flags specified at the
> beginning of the allocation call. This patch calculates preferred_zone
> once. It's safe to do this because if preferred_zone is NULL at the start
> of the call, no amount of direct reclaim or other actions will change the
> fact the allocation will fail.
> 
>
> ...
>
> -	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
> -
>							&preferred_zone);
> ...  
>
> +	/* The preferred zone is used for statistics later */
> +	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,

Let's quietly zap that dopey cast.

--- a/mm/page_alloc.c~page-allocator-calculate-the-preferred-zone-for-allocation-only-once-fix
+++ a/mm/page_alloc.c
@@ -1775,8 +1775,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
 		return NULL;
 
 	/* The preferred zone is used for statistics later */
-	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
-							&preferred_zone);
+	first_zones_zonelist(zonelist, high_zoneidx, nodemask, &preferred_zone);
 	if (!preferred_zone)
 		return NULL;
 
_


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/22] Calculate the preferred zone for allocation only once
@ 2009-04-23 22:48     ` Andrew Morton
  0 siblings, 0 replies; 186+ messages in thread
From: Andrew Morton @ 2009-04-23 22:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel, ming.m.lin,
	yanmin_zhang, peterz, penberg

On Wed, 22 Apr 2009 14:53:12 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> get_page_from_freelist() can be called multiple times for an allocation.
> Part of this calculates the preferred_zone which is the first usable zone
> in the zonelist but the zone depends on the GFP flags specified at the
> beginning of the allocation call. This patch calculates preferred_zone
> once. It's safe to do this because if preferred_zone is NULL at the start
> of the call, no amount of direct reclaim or other actions will change the
> fact the allocation will fail.
> 
>
> ...
>
> -	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
> -
>							&preferred_zone);
> ...  
>
> +	/* The preferred zone is used for statistics later */
> +	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,

Let's quietly zap that dopey cast.

--- a/mm/page_alloc.c~page-allocator-calculate-the-preferred-zone-for-allocation-only-once-fix
+++ a/mm/page_alloc.c
@@ -1775,8 +1775,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
 		return NULL;
 
 	/* The preferred zone is used for statistics later */
-	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
-							&preferred_zone);
+	first_zones_zonelist(zonelist, high_zoneidx, nodemask, &preferred_zone);
 	if (!preferred_zone)
 		return NULL;
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 09/22] Calculate the alloc_flags for allocation only once
  2009-04-22 13:53   ` Mel Gorman
@ 2009-04-23 22:52     ` Andrew Morton
  -1 siblings, 0 replies; 186+ messages in thread
From: Andrew Morton @ 2009-04-23 22:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: mel, linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel,
	ming.m.lin, yanmin_zhang, peterz, penberg

On Wed, 22 Apr 2009 14:53:14 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> Factor out the mapping between GFP and alloc_flags only once. Once factored
> out, it only needs to be calculated once but some care must be taken.
> 
> [neilb@suse.de says]
> As the test:
> 
> -       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> -                       && !in_interrupt()) {
> -               if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> 
> has been replaced with a slightly weaker one:
> 
> +       if (alloc_flags & ALLOC_NO_WATERMARKS) {
> 
> Without care, this would allow recursion into the allocator via direct
> reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set
> but TF_MEMDIE callers are now allowed to directly reclaim where they
> would have been prevented in the past.
> 
> ...
>
> +static inline int
> +gfp_to_alloc_flags(gfp_t gfp_mask)
> +{
> +	struct task_struct *p = current;
> +	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +
> +	/*
> +	 * The caller may dip into page reserves a bit more if the caller
> +	 * cannot run direct reclaim, or if the caller has realtime scheduling
> +	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> +	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> +	 */
> +	if (gfp_mask & __GFP_HIGH)
> +		alloc_flags |= ALLOC_HIGH;
> +
> +	if (!wait) {
> +		alloc_flags |= ALLOC_HARDER;
> +		/*
> +		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
> +		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> +		 */
> +		alloc_flags &= ~ALLOC_CPUSET;
> +	} else if (unlikely(rt_task(p)))
> +		alloc_flags |= ALLOC_HARDER;
> +
> +	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> +		if (!in_interrupt() &&
> +		    ((p->flags & PF_MEMALLOC) ||
> +		     unlikely(test_thread_flag(TIF_MEMDIE))))
> +			alloc_flags |= ALLOC_NO_WATERMARKS;
> +	}
> +
> +	return alloc_flags;
> +}

hm.  Was there a particular reason for the explicit inline?

It's OK as it stands, but might become suboptimal if we later add a
second caller?


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 09/22] Calculate the alloc_flags for allocation only once
@ 2009-04-23 22:52     ` Andrew Morton
  0 siblings, 0 replies; 186+ messages in thread
From: Andrew Morton @ 2009-04-23 22:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel, ming.m.lin,
	yanmin_zhang, peterz, penberg

On Wed, 22 Apr 2009 14:53:14 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> Factor out the mapping between GFP and alloc_flags only once. Once factored
> out, it only needs to be calculated once but some care must be taken.
> 
> [neilb@suse.de says]
> As the test:
> 
> -       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> -                       && !in_interrupt()) {
> -               if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> 
> has been replaced with a slightly weaker one:
> 
> +       if (alloc_flags & ALLOC_NO_WATERMARKS) {
> 
> Without care, this would allow recursion into the allocator via direct
> reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set
> but TF_MEMDIE callers are now allowed to directly reclaim where they
> would have been prevented in the past.
> 
> ...
>
> +static inline int
> +gfp_to_alloc_flags(gfp_t gfp_mask)
> +{
> +	struct task_struct *p = current;
> +	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +
> +	/*
> +	 * The caller may dip into page reserves a bit more if the caller
> +	 * cannot run direct reclaim, or if the caller has realtime scheduling
> +	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> +	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> +	 */
> +	if (gfp_mask & __GFP_HIGH)
> +		alloc_flags |= ALLOC_HIGH;
> +
> +	if (!wait) {
> +		alloc_flags |= ALLOC_HARDER;
> +		/*
> +		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
> +		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> +		 */
> +		alloc_flags &= ~ALLOC_CPUSET;
> +	} else if (unlikely(rt_task(p)))
> +		alloc_flags |= ALLOC_HARDER;
> +
> +	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> +		if (!in_interrupt() &&
> +		    ((p->flags & PF_MEMALLOC) ||
> +		     unlikely(test_thread_flag(TIF_MEMDIE))))
> +			alloc_flags |= ALLOC_NO_WATERMARKS;
> +	}
> +
> +	return alloc_flags;
> +}

hm.  Was there a particular reason for the explicit inline?

It's OK as it stands, but might become suboptimal if we later add a
second caller?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()
  2009-04-22 13:53   ` Mel Gorman
@ 2009-04-23 22:59     ` Andrew Morton
  -1 siblings, 0 replies; 186+ messages in thread
From: Andrew Morton @ 2009-04-23 22:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: mel, linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel,
	ming.m.lin, yanmin_zhang, peterz, penberg

On Wed, 22 Apr 2009 14:53:20 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> free_page_mlock() tests and clears PG_mlocked using locked versions of the
> bit operations. If set, it disables interrupts to update counters and this
> happens on every page free even though interrupts are disabled very shortly
> afterwards a second time.  This is wasteful.

Well.  It's only wasteful if the page was mlocked, which is rare.

> This patch splits what free_page_mlock() does. The bit check is still
> made. However, the update of counters is delayed until the interrupts are
> disabled and the non-lock version for clearing the bit is used. One potential
> weirdness with this split is that the counters do not get updated if the
> bad_page() check is triggered but a system showing bad pages is getting
> screwed already.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/internal.h   |   11 +++--------
>  mm/page_alloc.c |    8 +++++++-
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 987bb03..58ec1bc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
>   */
>  static inline void free_page_mlock(struct page *page)
>  {
> -	if (unlikely(TestClearPageMlocked(page))) {
> -		unsigned long flags;
> -
> -		local_irq_save(flags);
> -		__dec_zone_page_state(page, NR_MLOCK);
> -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> -		local_irq_restore(flags);
> -	}
> +	__ClearPageMlocked(page);
> +	__dec_zone_page_state(page, NR_MLOCK);
> +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
>  }

The conscientuous reviewer runs around and checks for free_page_mlock()
callers in other .c files which might be affected.

Only there are no such callers.

The reviewer's job would be reduced if free_page_mlock() wasn't
needlessly placed in a header file!

>  #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 67cafd0..7f45de1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -499,7 +499,6 @@ static inline void __free_one_page(struct page *page,
>  
>  static inline int free_pages_check(struct page *page)
>  {
> -	free_page_mlock(page);
>  	if (unlikely(page_mapcount(page) |
>  		(page->mapping != NULL)  |
>  		(page_count(page) != 0)  |
> @@ -556,6 +555,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>  	unsigned long flags;
>  	int i;
>  	int bad = 0;
> +	int clearMlocked = PageMlocked(page);
>  
>  	for (i = 0 ; i < (1 << order) ; ++i)
>  		bad += free_pages_check(page + i);
> @@ -571,6 +571,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>  	kernel_map_pages(page, 1 << order, 0);
>  
>  	local_irq_save(flags);
> +	if (unlikely(clearMlocked))
> +		free_page_mlock(page);

I wonder what the compiler does in the case
CONFIG_HAVE_MLOCKED_PAGE_BIT=n.  If it is dumb, this patch would cause
additional code generation.


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()
@ 2009-04-23 22:59     ` Andrew Morton
  0 siblings, 0 replies; 186+ messages in thread
From: Andrew Morton @ 2009-04-23 22:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel, ming.m.lin,
	yanmin_zhang, peterz, penberg

On Wed, 22 Apr 2009 14:53:20 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> free_page_mlock() tests and clears PG_mlocked using locked versions of the
> bit operations. If set, it disables interrupts to update counters and this
> happens on every page free even though interrupts are disabled very shortly
> afterwards a second time.  This is wasteful.

Well.  It's only wasteful if the page was mlocked, which is rare.

> This patch splits what free_page_mlock() does. The bit check is still
> made. However, the update of counters is delayed until the interrupts are
> disabled and the non-lock version for clearing the bit is used. One potential
> weirdness with this split is that the counters do not get updated if the
> bad_page() check is triggered but a system showing bad pages is getting
> screwed already.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/internal.h   |   11 +++--------
>  mm/page_alloc.c |    8 +++++++-
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 987bb03..58ec1bc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
>   */
>  static inline void free_page_mlock(struct page *page)
>  {
> -	if (unlikely(TestClearPageMlocked(page))) {
> -		unsigned long flags;
> -
> -		local_irq_save(flags);
> -		__dec_zone_page_state(page, NR_MLOCK);
> -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> -		local_irq_restore(flags);
> -	}
> +	__ClearPageMlocked(page);
> +	__dec_zone_page_state(page, NR_MLOCK);
> +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
>  }

The conscientuous reviewer runs around and checks for free_page_mlock()
callers in other .c files which might be affected.

Only there are no such callers.

The reviewer's job would be reduced if free_page_mlock() wasn't
needlessly placed in a header file!

>  #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 67cafd0..7f45de1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -499,7 +499,6 @@ static inline void __free_one_page(struct page *page,
>  
>  static inline int free_pages_check(struct page *page)
>  {
> -	free_page_mlock(page);
>  	if (unlikely(page_mapcount(page) |
>  		(page->mapping != NULL)  |
>  		(page_count(page) != 0)  |
> @@ -556,6 +555,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>  	unsigned long flags;
>  	int i;
>  	int bad = 0;
> +	int clearMlocked = PageMlocked(page);
>  
>  	for (i = 0 ; i < (1 << order) ; ++i)
>  		bad += free_pages_check(page + i);
> @@ -571,6 +571,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>  	kernel_map_pages(page, 1 << order, 0);
>  
>  	local_irq_save(flags);
> +	if (unlikely(clearMlocked))
> +		free_page_mlock(page);

I wonder what the compiler does in the case
CONFIG_HAVE_MLOCKED_PAGE_BIT=n.  If it is dumb, this patch would cause
additional code generation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 19/22] Update NR_FREE_PAGES only as necessary
  2009-04-23 23:06     ` Andrew Morton
@ 2009-04-23 23:04       ` Christoph Lameter
  -1 siblings, 0 replies; 186+ messages in thread
From: Christoph Lameter @ 2009-04-23 23:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-mm, kosaki.motohiro, npiggin, linux-kernel,
	ming.m.lin, yanmin_zhang, peterz, penberg

On Thu, 23 Apr 2009, Andrew Morton wrote:

> If not, what _is_ it asserting?

That the alignment of an order N page in the max order block is proper?



^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 19/22] Update NR_FREE_PAGES only as necessary
@ 2009-04-23 23:04       ` Christoph Lameter
  0 siblings, 0 replies; 186+ messages in thread
From: Christoph Lameter @ 2009-04-23 23:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-mm, kosaki.motohiro, npiggin, linux-kernel,
	ming.m.lin, yanmin_zhang, peterz, penberg

On Thu, 23 Apr 2009, Andrew Morton wrote:

> If not, what _is_ it asserting?

That the alignment of an order N page in the max order block is proper?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 19/22] Update NR_FREE_PAGES only as necessary
  2009-04-22 13:53   ` Mel Gorman
@ 2009-04-23 23:06     ` Andrew Morton
  -1 siblings, 0 replies; 186+ messages in thread
From: Andrew Morton @ 2009-04-23 23:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: mel, linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel,
	ming.m.lin, yanmin_zhang, peterz, penberg

On Wed, 22 Apr 2009 14:53:24 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> When pages are being freed to the buddy allocator, the zone
> NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page
> freeing, it's updated once per page. This retouches cache lines more
> than necessary. Update the counters one per per-cpu bulk free.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -460,7 +460,6 @@ static inline void __free_one_page(struct page *page,
>  		int migratetype)
>  {
>  	unsigned long page_idx;
> -	int order_size = 1 << order;
>  
>  	if (unlikely(PageCompound(page)))
>  		if (unlikely(destroy_compound_page(page, order)))
> @@ -470,10 +469,9 @@ static inline void __free_one_page(struct page *page,
>  
>  	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
>  
> -	VM_BUG_ON(page_idx & (order_size - 1));
> +	VM_BUG_ON(page_idx & ((1 << order) - 1));
>  	VM_BUG_ON(bad_range(zone, page));
>  

<head spins>

Is this all a slow and obscure way of doing

	VM_BUG_ON(order > MAX_ORDER);

?

If not, what _is_ it asserting?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 19/22] Update NR_FREE_PAGES only as necessary
@ 2009-04-23 23:06     ` Andrew Morton
  0 siblings, 0 replies; 186+ messages in thread
From: Andrew Morton @ 2009-04-23 23:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel, ming.m.lin,
	yanmin_zhang, peterz, penberg

On Wed, 22 Apr 2009 14:53:24 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> When pages are being freed to the buddy allocator, the zone
> NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page
> freeing, it's updated once per page. This retouches cache lines more
> than necessary. Update the counters one per per-cpu bulk free.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -460,7 +460,6 @@ static inline void __free_one_page(struct page *page,
>  		int migratetype)
>  {
>  	unsigned long page_idx;
> -	int order_size = 1 << order;
>  
>  	if (unlikely(PageCompound(page)))
>  		if (unlikely(destroy_compound_page(page, order)))
> @@ -470,10 +469,9 @@ static inline void __free_one_page(struct page *page,
>  
>  	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
>  
> -	VM_BUG_ON(page_idx & (order_size - 1));
> +	VM_BUG_ON(page_idx & ((1 << order) - 1));
>  	VM_BUG_ON(bad_range(zone, page));
>  

<head spins>

Is this all a slow and obscure way of doing

	VM_BUG_ON(order > MAX_ORDER);

?

If not, what _is_ it asserting?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()
  2009-04-23 22:59     ` Andrew Morton
@ 2009-04-24  0:07       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 186+ messages in thread
From: KOSAKI Motohiro @ 2009-04-24  0:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Mel Gorman, linux-mm, cl, npiggin, linux-kernel,
	ming.m.lin, yanmin_zhang, peterz, penberg

> > @@ -556,6 +555,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> >  	unsigned long flags;
> >  	int i;
> >  	int bad = 0;
> > +	int clearMlocked = PageMlocked(page);
> >  
> >  	for (i = 0 ; i < (1 << order) ; ++i)
> >  		bad += free_pages_check(page + i);
> > @@ -571,6 +571,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> >  	kernel_map_pages(page, 1 << order, 0);
> >  
> >  	local_irq_save(flags);
> > +	if (unlikely(clearMlocked))
> > +		free_page_mlock(page);
> 
> I wonder what the compiler does in the case
> CONFIG_HAVE_MLOCKED_PAGE_BIT=n.  If it is dumb, this patch would cause
> additional code generation.

if CONFIG_HAVE_MLOCKED_PAGE_BIT=n, PageMlocked() is {return 0;} then
gcc can remove following code, I think.
	if (0) 
		free_page_mlock(page)




^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()
@ 2009-04-24  0:07       ` KOSAKI Motohiro
  0 siblings, 0 replies; 186+ messages in thread
From: KOSAKI Motohiro @ 2009-04-24  0:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Mel Gorman, linux-mm, cl, npiggin, linux-kernel,
	ming.m.lin, yanmin_zhang, peterz, penberg

> > @@ -556,6 +555,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> >  	unsigned long flags;
> >  	int i;
> >  	int bad = 0;
> > +	int clearMlocked = PageMlocked(page);
> >  
> >  	for (i = 0 ; i < (1 << order) ; ++i)
> >  		bad += free_pages_check(page + i);
> > @@ -571,6 +571,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> >  	kernel_map_pages(page, 1 << order, 0);
> >  
> >  	local_irq_save(flags);
> > +	if (unlikely(clearMlocked))
> > +		free_page_mlock(page);
> 
> I wonder what the compiler does in the case
> CONFIG_HAVE_MLOCKED_PAGE_BIT=n.  If it is dumb, this patch would cause
> additional code generation.

if CONFIG_HAVE_MLOCKED_PAGE_BIT=n, PageMlocked() is {return 0;} then
gcc can remove following code, I think.
	if (0) 
		free_page_mlock(page)



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()
  2009-04-23 22:59     ` Andrew Morton
@ 2009-04-24  0:33       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 186+ messages in thread
From: KOSAKI Motohiro @ 2009-04-24  0:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Mel Gorman, linux-mm, cl, npiggin, linux-kernel,
	ming.m.lin, yanmin_zhang, peterz, penberg, Lee Schermerhorn,
	Dave Hansen

> > @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> >   */
> >  static inline void free_page_mlock(struct page *page)
> >  {
> > -	if (unlikely(TestClearPageMlocked(page))) {
> > -		unsigned long flags;
> > -
> > -		local_irq_save(flags);
> > -		__dec_zone_page_state(page, NR_MLOCK);
> > -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > -		local_irq_restore(flags);
> > -	}
> > +	__ClearPageMlocked(page);
> > +	__dec_zone_page_state(page, NR_MLOCK);
> > +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> >  }
> 
> The conscientuous reviewer runs around and checks for free_page_mlock()
> callers in other .c files which might be affected.
> 
> Only there are no such callers.
> 
> The reviewer's job would be reduced if free_page_mlock() wasn't
> needlessly placed in a header file!

very sorry.

How about this?

=============================================
Subject: [PATCH] move free_page_mlock() to page_alloc.c

Currently, free_page_mlock() is only called from page_alloc.c.
Thus, we can move it to page_alloc.c.

Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/internal.h   |   18 ------------------
 mm/page_alloc.c |   21 +++++++++++++++++++++
 2 files changed, 21 insertions(+), 18 deletions(-)

Index: b/mm/internal.h
===================================================================
--- a/mm/internal.h	2009-04-24 09:12:03.000000000 +0900
+++ b/mm/internal.h	2009-04-24 09:12:10.000000000 +0900
@@ -150,23 +150,6 @@ static inline void mlock_migrate_page(st
 	}
 }
 
-/*
- * free_page_mlock() -- clean up attempts to free and mlocked() page.
- * Page should not be on lru, so no need to fix that up.
- * free_pages_check() will verify...
- */
-static inline void free_page_mlock(struct page *page)
-{
-	if (unlikely(TestClearPageMlocked(page))) {
-		unsigned long flags;
-
-		local_irq_save(flags);
-		__dec_zone_page_state(page, NR_MLOCK);
-		__count_vm_event(UNEVICTABLE_MLOCKFREED);
-		local_irq_restore(flags);
-	}
-}
-
 #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
 static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
 {
@@ -175,7 +158,6 @@ static inline int is_mlocked_vma(struct 
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
 static inline void mlock_migrate_page(struct page *new, struct page *old) { }
-static inline void free_page_mlock(struct page *page) { }
 
 #endif /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
 
Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c	2009-04-24 09:12:03.000000000 +0900
+++ b/mm/page_alloc.c	2009-04-24 09:13:25.000000000 +0900
@@ -491,6 +491,27 @@ static inline void __free_one_page(struc
 	zone->free_area[order].nr_free++;
 }
 
+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+/*
+ * free_page_mlock() -- clean up attempts to free and mlocked() page.
+ * Page should not be on lru, so no need to fix that up.
+ * free_pages_check() will verify...
+ */
+static void free_page_mlock(struct page *page)
+{
+	if (unlikely(TestClearPageMlocked(page))) {
+		unsigned long flags;
+
+		local_irq_save(flags);
+		__dec_zone_page_state(page, NR_MLOCK);
+		__count_vm_event(UNEVICTABLE_MLOCKFREED);
+		local_irq_restore(flags);
+	}
+}
+#else
+static void free_page_mlock(struct page *page) { }
+#endif
+
 static inline int free_pages_check(struct page *page)
 {
 	free_page_mlock(page);



^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()
@ 2009-04-24  0:33       ` KOSAKI Motohiro
  0 siblings, 0 replies; 186+ messages in thread
From: KOSAKI Motohiro @ 2009-04-24  0:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Mel Gorman, linux-mm, cl, npiggin, linux-kernel,
	ming.m.lin, yanmin_zhang, peterz, penberg, Lee Schermerhorn,
	Dave Hansen

> > @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> >   */
> >  static inline void free_page_mlock(struct page *page)
> >  {
> > -	if (unlikely(TestClearPageMlocked(page))) {
> > -		unsigned long flags;
> > -
> > -		local_irq_save(flags);
> > -		__dec_zone_page_state(page, NR_MLOCK);
> > -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > -		local_irq_restore(flags);
> > -	}
> > +	__ClearPageMlocked(page);
> > +	__dec_zone_page_state(page, NR_MLOCK);
> > +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> >  }
> 
> The conscientuous reviewer runs around and checks for free_page_mlock()
> callers in other .c files which might be affected.
> 
> Only there are no such callers.
> 
> The reviewer's job would be reduced if free_page_mlock() wasn't
> needlessly placed in a header file!

very sorry.

How about this?

=============================================
Subject: [PATCH] move free_page_mlock() to page_alloc.c

Currently, free_page_mlock() is only called from page_alloc.c.
Thus, we can move it to page_alloc.c.

Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/internal.h   |   18 ------------------
 mm/page_alloc.c |   21 +++++++++++++++++++++
 2 files changed, 21 insertions(+), 18 deletions(-)

Index: b/mm/internal.h
===================================================================
--- a/mm/internal.h	2009-04-24 09:12:03.000000000 +0900
+++ b/mm/internal.h	2009-04-24 09:12:10.000000000 +0900
@@ -150,23 +150,6 @@ static inline void mlock_migrate_page(st
 	}
 }
 
-/*
- * free_page_mlock() -- clean up attempts to free and mlocked() page.
- * Page should not be on lru, so no need to fix that up.
- * free_pages_check() will verify...
- */
-static inline void free_page_mlock(struct page *page)
-{
-	if (unlikely(TestClearPageMlocked(page))) {
-		unsigned long flags;
-
-		local_irq_save(flags);
-		__dec_zone_page_state(page, NR_MLOCK);
-		__count_vm_event(UNEVICTABLE_MLOCKFREED);
-		local_irq_restore(flags);
-	}
-}
-
 #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
 static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
 {
@@ -175,7 +158,6 @@ static inline int is_mlocked_vma(struct 
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
 static inline void mlock_migrate_page(struct page *new, struct page *old) { }
-static inline void free_page_mlock(struct page *page) { }
 
 #endif /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
 
Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c	2009-04-24 09:12:03.000000000 +0900
+++ b/mm/page_alloc.c	2009-04-24 09:13:25.000000000 +0900
@@ -491,6 +491,27 @@ static inline void __free_one_page(struc
 	zone->free_area[order].nr_free++;
 }
 
+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+/*
+ * free_page_mlock() -- clean up attempts to free and mlocked() page.
+ * Page should not be on lru, so no need to fix that up.
+ * free_pages_check() will verify...
+ */
+static void free_page_mlock(struct page *page)
+{
+	if (unlikely(TestClearPageMlocked(page))) {
+		unsigned long flags;
+
+		local_irq_save(flags);
+		__dec_zone_page_state(page, NR_MLOCK);
+		__count_vm_event(UNEVICTABLE_MLOCKFREED);
+		local_irq_restore(flags);
+	}
+}
+#else
+static void free_page_mlock(struct page *page) { }
+#endif
+
 static inline int free_pages_check(struct page *page)
 {
 	free_page_mlock(page);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-23 17:36                 ` Dave Hansen
@ 2009-04-24  2:57                   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 186+ messages in thread
From: KOSAKI Motohiro @ 2009-04-24  2:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kosaki.motohiro, Mel Gorman, Linux Memory Management List,
	Christoph Lameter, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra, Pekka Enberg,
	Andrew Morton

> On Thu, 2009-04-23 at 10:58 +0100, Mel Gorman wrote:
> > > How about this:  I'll go and audit the use of order in page_alloc.c to
> > > make sure that having an order>MAX_ORDER-1 floating around is OK and
> > > won't break anything. 
> > 
> > Great. Right now, I think it's ok but I haven't audited for this
> > explicily and a second set of eyes never hurts.
> 
> OK, after looking through this, I have a couple of ideas.  One is that
> we do the MAX_ORDER check in __alloc_pages_internal(), but *after* the
> first call to get_page_from_freelist().  That's because I'm worried if
> we ever got into the reclaim code with a >MAX_ORDER 'order'.  Such as:
> 
> void wakeup_kswapd(struct zone *zone, int order)
> {
> ...
>         if (pgdat->kswapd_max_order < order)
>                 pgdat->kswapd_max_order = order;
>         if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>                 return;
>         if (!waitqueue_active(&pgdat->kswapd_wait))
>                 return;
>         wake_up_interruptible(&pgdat->kswapd_wait);
> }
> 
> unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>                                 gfp_t gfp_mask, nodemask_t *nodemask)
> {
>         struct scan_control sc = {
> ...
>                 .order = order,
>                 .mem_cgroup = NULL,
>                 .isolate_pages = isolate_pages_global,
>                 .nodemask = nodemask,
>         };
> 
>         return do_try_to_free_pages(zonelist, &sc);
> }
> 
> This will keep us only checking 'order' once for each
> alloc_pages_internal() call.  It is an extra branch, but it is out of
> the really, really hot path since we're about to start reclaim here
> anyway.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e2f2699..1e3a01e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1498,6 +1498,13 @@ restart:
>  			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
>  	if (page)
>  		goto got_pg;
> +	/*
> +	 * We're out of the rocket-hot area above, so do a quick sanity
> +	 * check.  We do this here to avoid ever trying to do any reclaim
> +	 * of >=MAX_ORDER areas which can never succeed, of course.
> +	 */
> +	if (order >= MAX_ORDER)
> +		goto nopage;
>  
>  	/*
>  	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and

Good point.
if (WARN_ON_ONCE(order >= MAX_ORDER)) is better?




^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-24  2:57                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 186+ messages in thread
From: KOSAKI Motohiro @ 2009-04-24  2:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kosaki.motohiro, Mel Gorman, Linux Memory Management List,
	Christoph Lameter, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra, Pekka Enberg,
	Andrew Morton

> On Thu, 2009-04-23 at 10:58 +0100, Mel Gorman wrote:
> > > How about this:  I'll go and audit the use of order in page_alloc.c to
> > > make sure that having an order>MAX_ORDER-1 floating around is OK and
> > > won't break anything. 
> > 
> > Great. Right now, I think it's ok but I haven't audited for this
> > explicily and a second set of eyes never hurts.
> 
> OK, after looking through this, I have a couple of ideas.  One is that
> we do the MAX_ORDER check in __alloc_pages_internal(), but *after* the
> first call to get_page_from_freelist().  That's because I'm worried if
> we ever got into the reclaim code with a >MAX_ORDER 'order'.  Such as:
> 
> void wakeup_kswapd(struct zone *zone, int order)
> {
> ...
>         if (pgdat->kswapd_max_order < order)
>                 pgdat->kswapd_max_order = order;
>         if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>                 return;
>         if (!waitqueue_active(&pgdat->kswapd_wait))
>                 return;
>         wake_up_interruptible(&pgdat->kswapd_wait);
> }
> 
> unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>                                 gfp_t gfp_mask, nodemask_t *nodemask)
> {
>         struct scan_control sc = {
> ...
>                 .order = order,
>                 .mem_cgroup = NULL,
>                 .isolate_pages = isolate_pages_global,
>                 .nodemask = nodemask,
>         };
> 
>         return do_try_to_free_pages(zonelist, &sc);
> }
> 
> This will keep us only checking 'order' once for each
> alloc_pages_internal() call.  It is an extra branch, but it is out of
> the really, really hot path since we're about to start reclaim here
> anyway.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e2f2699..1e3a01e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1498,6 +1498,13 @@ restart:
>  			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
>  	if (page)
>  		goto got_pg;
> +	/*
> +	 * We're out of the rocket-hot area above, so do a quick sanity
> +	 * check.  We do this here to avoid ever trying to do any reclaim
> +	 * of >=MAX_ORDER areas which can never succeed, of course.
> +	 */
> +	if (order >= MAX_ORDER)
> +		goto nopage;
>  
>  	/*
>  	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and

Good point.
if (WARN_ON_ONCE(order >= MAX_ORDER)) is better?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark
  2009-04-23 10:03             ` Mel Gorman
@ 2009-04-24  6:41               ` KOSAKI Motohiro
  -1 siblings, 0 replies; 186+ messages in thread
From: KOSAKI Motohiro @ 2009-04-24  6:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Dave Hansen, Linux Memory Management List,
	Christoph Lameter, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra, Pekka Enberg,
	Andrew Morton

> Considering that they are the same type for elements and arrays, I
> didn't think padding would ever be a problem.
> 
> > However, all gcc version don't do that. I think. but perhaps I missed
> > some minor gcc release..
> > 
> > So, I also like Dave's idea. but it only personal feeling.
> > 
> 
> The tide is against me on this one :).
> 
> How about I roll a patch on top of this set that replaces the union by
> calling all sites? I figure that patch will go through a few revisions before
> people are happy with the final API. However, as the patch wouldn't change
> functionality, I'd like to see this series getting wider testing if possible. The
> replace-union-with-single-array patch can be easily folded in then when
> it settles.
> 
> Sound like a plan?

Yeah, I agree testing is important than ugliness discussion :)




^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark
@ 2009-04-24  6:41               ` KOSAKI Motohiro
  0 siblings, 0 replies; 186+ messages in thread
From: KOSAKI Motohiro @ 2009-04-24  6:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Dave Hansen, Linux Memory Management List,
	Christoph Lameter, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin, Peter Zijlstra, Pekka Enberg,
	Andrew Morton

> Considering that they are the same type for elements and arrays, I
> didn't think padding would ever be a problem.
> 
> > However, all gcc version don't do that. I think. but perhaps I missed
> > some minor gcc release..
> > 
> > So, I also like Dave's idea. but it only personal feeling.
> > 
> 
> The tide is against me on this one :).
> 
> How about I roll a patch on top of this set that replaces the union by
> calling all sites? I figure that patch will go through a few revisions before
> people are happy with the final API. However, as the patch wouldn't change
> functionality, I'd like to see this series getting wider testing if possible. The
> replace-union-with-single-array patch can be easily folded in then when
> it settles.
> 
> Sound like a plan?

Yeah, I agree testing is important than ugliness discussion :)



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-23 19:45                 ` Dave Hansen
@ 2009-04-24  9:21                   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-24  9:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Thu, Apr 23, 2009 at 12:45:30PM -0700, Dave Hansen wrote:
> On Thu, 2009-04-23 at 12:26 -0700, Dave Hansen wrote:
> > I'm going to retry this with a NUMA config.  
> 
> Oddly enough, removing the alloc_pages() check didn't do anything to
> text on a NUMA=y config, either: 
> 
> dave@kernel:~/work/mhp-build$ size i386-numaq-sparse.?/vmlinux
>    text	   data	    bss	    dec	    hex	filename
> 3460792	 449336	1626112	5536240	 5479f0	i386-numaq-sparse.0/vmlinux
> 3460792	 449336	1626112	5536240	 5479f0	i386-numaq-sparse.1/vmlinux
> 3460776	 449336	1626112	5536224	 5479e0	i386-numaq-sparse.2/vmlinux
> 
> Here's bloatmeter from removing the alloc_pages_node() check:
> 
> dave@kernel:~/work/mhp-build$ ../linux-2.6.git/scripts/bloat-o-meter
> i386-numaq-sparse.{1,2}/vmlinux 
> add/remove: 0/0 grow/shrink: 9/16 up/down: 81/-99 (-18)
> function                                     old     new   delta
> st_int_ioctl                                2600    2624     +24
> tcp_sendmsg                                 2153    2169     +16
> diskstats_show                               739     753     +14
> iov_shorten                                   49      58      +9
> unmap_vmas                                  1653    1661      +8
> sg_build_indirect                            449     455      +6
> ahc_linux_biosparam                          251     253      +2
> nlmclnt_call                                 557     558      +1
> do_mount                                    1533    1534      +1

It doesn't make sense at all that text increased in size. Did you make clean
between each .config change and patch application? Are you using distcc
or anything else that might cause confusion? I found I had to eliminate
distscc and clean after each patch application because sometimes
net/ipv4/udp.c would sometimes generate different assembly when
accessing struct zone. Not really sure what was going on there.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-24  9:21                   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-24  9:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Thu, Apr 23, 2009 at 12:45:30PM -0700, Dave Hansen wrote:
> On Thu, 2009-04-23 at 12:26 -0700, Dave Hansen wrote:
> > I'm going to retry this with a NUMA config.  
> 
> Oddly enough, removing the alloc_pages() check didn't do anything to
> text on a NUMA=y config, either: 
> 
> dave@kernel:~/work/mhp-build$ size i386-numaq-sparse.?/vmlinux
>    text	   data	    bss	    dec	    hex	filename
> 3460792	 449336	1626112	5536240	 5479f0	i386-numaq-sparse.0/vmlinux
> 3460792	 449336	1626112	5536240	 5479f0	i386-numaq-sparse.1/vmlinux
> 3460776	 449336	1626112	5536224	 5479e0	i386-numaq-sparse.2/vmlinux
> 
> Here's bloatmeter from removing the alloc_pages_node() check:
> 
> dave@kernel:~/work/mhp-build$ ../linux-2.6.git/scripts/bloat-o-meter
> i386-numaq-sparse.{1,2}/vmlinux 
> add/remove: 0/0 grow/shrink: 9/16 up/down: 81/-99 (-18)
> function                                     old     new   delta
> st_int_ioctl                                2600    2624     +24
> tcp_sendmsg                                 2153    2169     +16
> diskstats_show                               739     753     +14
> iov_shorten                                   49      58      +9
> unmap_vmas                                  1653    1661      +8
> sg_build_indirect                            449     455      +6
> ahc_linux_biosparam                          251     253      +2
> nlmclnt_call                                 557     558      +1
> do_mount                                    1533    1534      +1

It doesn't make sense at all that text increased in size. Did you make clean
between each .config change and patch application? Are you using distcc
or anything else that might cause confusion? I found I had to eliminate
distscc and clean after each patch application because sometimes
net/ipv4/udp.c would sometimes generate different assembly when
accessing struct zone. Not really sure what was going on there.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-23 17:36                 ` Dave Hansen
@ 2009-04-24 10:34                   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-24 10:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Thu, Apr 23, 2009 at 10:36:50AM -0700, Dave Hansen wrote:
> On Thu, 2009-04-23 at 10:58 +0100, Mel Gorman wrote:
> > > How about this:  I'll go and audit the use of order in page_alloc.c to
> > > make sure that having an order>MAX_ORDER-1 floating around is OK and
> > > won't break anything. 
> > 
> > Great. Right now, I think it's ok but I haven't audited for this
> > explicily and a second set of eyes never hurts.
> 
> OK, after looking through this, I have a couple of ideas.  One is that
> we do the MAX_ORDER check in __alloc_pages_internal(), but *after* the
> first call to get_page_from_freelist().  That's because I'm worried if
> we ever got into the reclaim code with a >MAX_ORDER 'order'.  Such as:
> 
> void wakeup_kswapd(struct zone *zone, int order)
> {
> ...
>         if (pgdat->kswapd_max_order < order)
>                 pgdat->kswapd_max_order = order;
>         if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>                 return;
>         if (!waitqueue_active(&pgdat->kswapd_wait))
>                 return;
>         wake_up_interruptible(&pgdat->kswapd_wait);
> }
> 
> unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>                                 gfp_t gfp_mask, nodemask_t *nodemask)
> {
>         struct scan_control sc = {
> ...
>                 .order = order,
>                 .mem_cgroup = NULL,
>                 .isolate_pages = isolate_pages_global,
>                 .nodemask = nodemask,
>         };
> 
>         return do_try_to_free_pages(zonelist, &sc);
> }
> 

That is perfectly rational.

> This will keep us only checking 'order' once for each
> alloc_pages_internal() call.  It is an extra branch, but it is out of
> the really, really hot path since we're about to start reclaim here
> anyway.
> 

I combined yours and Andrew's suggestions into a patch that applies on
top of the series. Dave, as it's basically your patch it needs your
sign-off if you agree it's ok.

I tested this with a bodge that allocates with increasing orders up to a
very large number.  As you'd expect, it worked until it hit order-11 on an
x86 machine and failed for every higher order by returning NULL.

It reports a warning once. We'll probably drop the warning in time but this
will be a chance to check if there are callers that really are being stupid
and are not just callers that are trying to get the best buffer for the job.

=====
Sanity check order in the page allocator slow path

Callers may speculatively call different allocators in order of preference
trying to allocate a buffer of a given size. The order needed to allocate
this may be larger than what the page allocator can normally handle. While
the allocator mostly does the right thing, it should not direct reclaim or
wakeup kswapd with a bogus order. This patch sanity checks the order in the
slow path and returns NULL if it is too large.

Needs-signed-off-by from Dave Hansen here before merging. Based on his
not-signed-off-by patch.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 mm/page_alloc.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1464aca..1c60141 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1434,7 +1434,6 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
 	classzone_idx = zone_idx(preferred_zone);
-	VM_BUG_ON(order >= MAX_ORDER);
 
 zonelist_scan:
 	/*
@@ -1692,6 +1691,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct task_struct *p = current;
 
 	/*
+	 * In the slowpath, we sanity check order to avoid ever trying to
+	 * reclaim >= MAX_ORDER areas which will never succeed. Callers may
+	 * be using allocators in order of preference for an area that is
+	 * too large. 
+	 */
+	if (WARN_ON_ONCE(order >= MAX_ORDER))
+		return NULL;
+
+	/*
 	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
 	 * __GFP_NOWARN set) should not cause reclaim since the subsystem
 	 * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-24 10:34                   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-24 10:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Thu, Apr 23, 2009 at 10:36:50AM -0700, Dave Hansen wrote:
> On Thu, 2009-04-23 at 10:58 +0100, Mel Gorman wrote:
> > > How about this:  I'll go and audit the use of order in page_alloc.c to
> > > make sure that having an order>MAX_ORDER-1 floating around is OK and
> > > won't break anything. 
> > 
> > Great. Right now, I think it's ok but I haven't audited for this
> > explicily and a second set of eyes never hurts.
> 
> OK, after looking through this, I have a couple of ideas.  One is that
> we do the MAX_ORDER check in __alloc_pages_internal(), but *after* the
> first call to get_page_from_freelist().  That's because I'm worried if
> we ever got into the reclaim code with a >MAX_ORDER 'order'.  Such as:
> 
> void wakeup_kswapd(struct zone *zone, int order)
> {
> ...
>         if (pgdat->kswapd_max_order < order)
>                 pgdat->kswapd_max_order = order;
>         if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>                 return;
>         if (!waitqueue_active(&pgdat->kswapd_wait))
>                 return;
>         wake_up_interruptible(&pgdat->kswapd_wait);
> }
> 
> unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>                                 gfp_t gfp_mask, nodemask_t *nodemask)
> {
>         struct scan_control sc = {
> ...
>                 .order = order,
>                 .mem_cgroup = NULL,
>                 .isolate_pages = isolate_pages_global,
>                 .nodemask = nodemask,
>         };
> 
>         return do_try_to_free_pages(zonelist, &sc);
> }
> 

That is perfectly rational.

> This will keep us only checking 'order' once for each
> alloc_pages_internal() call.  It is an extra branch, but it is out of
> the really, really hot path since we're about to start reclaim here
> anyway.
> 

I combined yours and Andrew's suggestions into a patch that applies on
top of the series. Dave, as it's basically your patch it needs your
sign-off if you agree it's ok.

I tested this with a bodge that allocates with increasing orders up to a
very large number.  As you'd expect, it worked until it hit order-11 on an
x86 machine and failed for every higher order by returning NULL.

It reports a warning once. We'll probably drop the warning in time but this
will be a chance to check if there are callers that really are being stupid
and are not just callers that are trying to get the best buffer for the job.

=====
Sanity check order in the page allocator slow path

Callers may speculatively call different allocators in order of preference
trying to allocate a buffer of a given size. The order needed to allocate
this may be larger than what the page allocator can normally handle. While
the allocator mostly does the right thing, it should not direct reclaim or
wakeup kswapd with a bogus order. This patch sanity checks the order in the
slow path and returns NULL if it is too large.

Needs-signed-off-by from Dave Hansen here before merging. Based on his
not-signed-off-by patch.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 mm/page_alloc.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1464aca..1c60141 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1434,7 +1434,6 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
 	classzone_idx = zone_idx(preferred_zone);
-	VM_BUG_ON(order >= MAX_ORDER);
 
 zonelist_scan:
 	/*
@@ -1692,6 +1691,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct task_struct *p = current;
 
 	/*
+	 * In the slowpath, we sanity check order to avoid ever trying to
+	 * reclaim >= MAX_ORDER areas which will never succeed. Callers may
+	 * be using allocators in order of preference for an area that is
+	 * too large. 
+	 */
+	if (WARN_ON_ONCE(order >= MAX_ORDER))
+		return NULL;
+
+	/*
 	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
 	 * __GFP_NOWARN set) should not cause reclaim since the subsystem
 	 * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [PATCH 09/22] Calculate the alloc_flags for allocation only once
  2009-04-23 22:52     ` Andrew Morton
@ 2009-04-24 10:47       ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-24 10:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel, ming.m.lin,
	yanmin_zhang, peterz, penberg

On Thu, Apr 23, 2009 at 03:52:16PM -0700, Andrew Morton wrote:
> On Wed, 22 Apr 2009 14:53:14 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Factor out the mapping between GFP and alloc_flags only once. Once factored
> > out, it only needs to be calculated once but some care must be taken.
> > 
> > [neilb@suse.de says]
> > As the test:
> > 
> > -       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> > -                       && !in_interrupt()) {
> > -               if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> > 
> > has been replaced with a slightly weaker one:
> > 
> > +       if (alloc_flags & ALLOC_NO_WATERMARKS) {
> > 
> > Without care, this would allow recursion into the allocator via direct
> > reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set
> > but TF_MEMDIE callers are now allowed to directly reclaim where they
> > would have been prevented in the past.
> > 
> > ...
> >
> > +static inline int
> > +gfp_to_alloc_flags(gfp_t gfp_mask)
> > +{
> > +	struct task_struct *p = current;
> > +	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> > +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> > +
> > +	/*
> > +	 * The caller may dip into page reserves a bit more if the caller
> > +	 * cannot run direct reclaim, or if the caller has realtime scheduling
> > +	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> > +	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> > +	 */
> > +	if (gfp_mask & __GFP_HIGH)
> > +		alloc_flags |= ALLOC_HIGH;
> > +
> > +	if (!wait) {
> > +		alloc_flags |= ALLOC_HARDER;
> > +		/*
> > +		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
> > +		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> > +		 */
> > +		alloc_flags &= ~ALLOC_CPUSET;
> > +	} else if (unlikely(rt_task(p)))
> > +		alloc_flags |= ALLOC_HARDER;
> > +
> > +	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> > +		if (!in_interrupt() &&
> > +		    ((p->flags & PF_MEMALLOC) ||
> > +		     unlikely(test_thread_flag(TIF_MEMDIE))))
> > +			alloc_flags |= ALLOC_NO_WATERMARKS;
> > +	}
> > +
> > +	return alloc_flags;
> > +}
> 
> hm.  Was there a particular reason for the explicit inline?
> 

Only because it was known there was only one caller.

> It's OK as it stands, but might become suboptimal if we later add a
> second caller?
> 

This is true. As it is also in the slowpath and only called once, the
following patch should make no difference to performance but potentially
avoid a mistake later.

=======
Uninline gfp_to_alloc_flags() in the page allocator slow path

gfp_to_alloc_flags() is in the slowpath but inlined. While there is only one
caller now, a future second call would add suprising text-bloat. Uninline
it now to avoid surprises later as it should have no performance impact.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c60141..f08b4cb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1639,7 +1639,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
 		wakeup_kswapd(zone, order);
 }
 
-static inline int
+static int
 gfp_to_alloc_flags(gfp_t gfp_mask)
 {
 	struct task_struct *p = current;

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [PATCH 09/22] Calculate the alloc_flags for allocation only once
@ 2009-04-24 10:47       ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-24 10:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel, ming.m.lin,
	yanmin_zhang, peterz, penberg

On Thu, Apr 23, 2009 at 03:52:16PM -0700, Andrew Morton wrote:
> On Wed, 22 Apr 2009 14:53:14 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Factor out the mapping between GFP and alloc_flags only once. Once factored
> > out, it only needs to be calculated once but some care must be taken.
> > 
> > [neilb@suse.de says]
> > As the test:
> > 
> > -       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> > -                       && !in_interrupt()) {
> > -               if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> > 
> > has been replaced with a slightly weaker one:
> > 
> > +       if (alloc_flags & ALLOC_NO_WATERMARKS) {
> > 
> > Without care, this would allow recursion into the allocator via direct
> > reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set
> > but TF_MEMDIE callers are now allowed to directly reclaim where they
> > would have been prevented in the past.
> > 
> > ...
> >
> > +static inline int
> > +gfp_to_alloc_flags(gfp_t gfp_mask)
> > +{
> > +	struct task_struct *p = current;
> > +	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> > +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> > +
> > +	/*
> > +	 * The caller may dip into page reserves a bit more if the caller
> > +	 * cannot run direct reclaim, or if the caller has realtime scheduling
> > +	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> > +	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> > +	 */
> > +	if (gfp_mask & __GFP_HIGH)
> > +		alloc_flags |= ALLOC_HIGH;
> > +
> > +	if (!wait) {
> > +		alloc_flags |= ALLOC_HARDER;
> > +		/*
> > +		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
> > +		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> > +		 */
> > +		alloc_flags &= ~ALLOC_CPUSET;
> > +	} else if (unlikely(rt_task(p)))
> > +		alloc_flags |= ALLOC_HARDER;
> > +
> > +	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> > +		if (!in_interrupt() &&
> > +		    ((p->flags & PF_MEMALLOC) ||
> > +		     unlikely(test_thread_flag(TIF_MEMDIE))))
> > +			alloc_flags |= ALLOC_NO_WATERMARKS;
> > +	}
> > +
> > +	return alloc_flags;
> > +}
> 
> hm.  Was there a particular reason for the explicit inline?
> 

Only because it was known there was only one caller.

> It's OK as it stands, but might become suboptimal if we later add a
> second caller?
> 

This is true. As it is also in the slowpath and only called once, the
following patch should make no difference to performance but potentially
avoid a mistake later.

=======
Uninline gfp_to_alloc_flags() in the page allocator slow path

gfp_to_alloc_flags() is in the slowpath but inlined. While there is only one
caller now, a future second call would add suprising text-bloat. Uninline
it now to avoid surprises later as it should have no performance impact.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c60141..f08b4cb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1639,7 +1639,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
 		wakeup_kswapd(zone, order);
 }
 
-static inline int
+static int
 gfp_to_alloc_flags(gfp_t gfp_mask)
 {
 	struct task_struct *p = current;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()
  2009-04-23 22:59     ` Andrew Morton
@ 2009-04-24 11:18       ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-24 11:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel, ming.m.lin,
	yanmin_zhang, peterz, penberg

On Thu, Apr 23, 2009 at 03:59:51PM -0700, Andrew Morton wrote:
> On Wed, 22 Apr 2009 14:53:20 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > free_page_mlock() tests and clears PG_mlocked using locked versions of the
> > bit operations. If set, it disables interrupts to update counters and this
> > happens on every page free even though interrupts are disabled very shortly
> > afterwards a second time.  This is wasteful.
> 
> Well.  It's only wasteful if the page was mlocked, which is rare.
> 

True. mlocked pages are only going to be torn down during process exit or
munmap, both of which you'd expect to be rare when mlock is involved.

s/This is wasteful/While rare, this is unnecessary./

?

> > This patch splits what free_page_mlock() does. The bit check is still
> > made. However, the update of counters is delayed until the interrupts are
> > disabled and the non-lock version for clearing the bit is used. One potential
> > weirdness with this split is that the counters do not get updated if the
> > bad_page() check is triggered but a system showing bad pages is getting
> > screwed already.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> > Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
> > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > ---
> >  mm/internal.h   |   11 +++--------
> >  mm/page_alloc.c |    8 +++++++-
> >  2 files changed, 10 insertions(+), 9 deletions(-)
> > 
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 987bb03..58ec1bc 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> >   */
> >  static inline void free_page_mlock(struct page *page)
> >  {
> > -	if (unlikely(TestClearPageMlocked(page))) {
> > -		unsigned long flags;
> > -
> > -		local_irq_save(flags);
> > -		__dec_zone_page_state(page, NR_MLOCK);
> > -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > -		local_irq_restore(flags);
> > -	}
> > +	__ClearPageMlocked(page);
> > +	__dec_zone_page_state(page, NR_MLOCK);
> > +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> >  }
> 
> The conscientuous reviewer runs around and checks for free_page_mlock()
> callers in other .c files which might be affected.
> 

The patch author should have done the same thing :/

> Only there are no such callers.
> 
> The reviewer's job would be reduced if free_page_mlock() wasn't
> needlessly placed in a header file!
> 
> >  #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 67cafd0..7f45de1 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -499,7 +499,6 @@ static inline void __free_one_page(struct page *page,
> >  
> >  static inline int free_pages_check(struct page *page)
> >  {
> > -	free_page_mlock(page);
> >  	if (unlikely(page_mapcount(page) |
> >  		(page->mapping != NULL)  |
> >  		(page_count(page) != 0)  |
> > @@ -556,6 +555,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> >  	unsigned long flags;
> >  	int i;
> >  	int bad = 0;
> > +	int clearMlocked = PageMlocked(page);
> >  
> >  	for (i = 0 ; i < (1 << order) ; ++i)
> >  		bad += free_pages_check(page + i);
> > @@ -571,6 +571,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> >  	kernel_map_pages(page, 1 << order, 0);
> >  
> >  	local_irq_save(flags);
> > +	if (unlikely(clearMlocked))
> > +		free_page_mlock(page);
> 
> I wonder what the compiler does in the case
> CONFIG_HAVE_MLOCKED_PAGE_BIT=n.  If it is dumb, this patch would cause
> additional code generation.
> 

PageMlocked becomes

static inline int PageMlocked(page)
	{ return 0; }

so the compiler should be fit to spot that clearMlocked will always be 0. Even
if it didn't, it should have at least spotted that free_page_mlock(page)
is an empty inline function in this case.

Still double checked and I did not see the branch with
CONFIG_HAVE_MLOCKED_PAGE_BIT=n

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()
@ 2009-04-24 11:18       ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-24 11:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel, ming.m.lin,
	yanmin_zhang, peterz, penberg

On Thu, Apr 23, 2009 at 03:59:51PM -0700, Andrew Morton wrote:
> On Wed, 22 Apr 2009 14:53:20 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > free_page_mlock() tests and clears PG_mlocked using locked versions of the
> > bit operations. If set, it disables interrupts to update counters and this
> > happens on every page free even though interrupts are disabled very shortly
> > afterwards a second time.  This is wasteful.
> 
> Well.  It's only wasteful if the page was mlocked, which is rare.
> 

True. mlocked pages are only going to be torn down during process exit or
munmap, both of which you'd expect to be rare when mlock is involved.

s/This is wasteful/While rare, this is unnecessary./

?

> > This patch splits what free_page_mlock() does. The bit check is still
> > made. However, the update of counters is delayed until the interrupts are
> > disabled and the non-lock version for clearing the bit is used. One potential
> > weirdness with this split is that the counters do not get updated if the
> > bad_page() check is triggered but a system showing bad pages is getting
> > screwed already.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> > Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
> > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > ---
> >  mm/internal.h   |   11 +++--------
> >  mm/page_alloc.c |    8 +++++++-
> >  2 files changed, 10 insertions(+), 9 deletions(-)
> > 
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 987bb03..58ec1bc 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> >   */
> >  static inline void free_page_mlock(struct page *page)
> >  {
> > -	if (unlikely(TestClearPageMlocked(page))) {
> > -		unsigned long flags;
> > -
> > -		local_irq_save(flags);
> > -		__dec_zone_page_state(page, NR_MLOCK);
> > -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > -		local_irq_restore(flags);
> > -	}
> > +	__ClearPageMlocked(page);
> > +	__dec_zone_page_state(page, NR_MLOCK);
> > +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> >  }
> 
> The conscientuous reviewer runs around and checks for free_page_mlock()
> callers in other .c files which might be affected.
> 

The patch author should have done the same thing :/

> Only there are no such callers.
> 
> The reviewer's job would be reduced if free_page_mlock() wasn't
> needlessly placed in a header file!
> 
> >  #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 67cafd0..7f45de1 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -499,7 +499,6 @@ static inline void __free_one_page(struct page *page,
> >  
> >  static inline int free_pages_check(struct page *page)
> >  {
> > -	free_page_mlock(page);
> >  	if (unlikely(page_mapcount(page) |
> >  		(page->mapping != NULL)  |
> >  		(page_count(page) != 0)  |
> > @@ -556,6 +555,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> >  	unsigned long flags;
> >  	int i;
> >  	int bad = 0;
> > +	int clearMlocked = PageMlocked(page);
> >  
> >  	for (i = 0 ; i < (1 << order) ; ++i)
> >  		bad += free_pages_check(page + i);
> > @@ -571,6 +571,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> >  	kernel_map_pages(page, 1 << order, 0);
> >  
> >  	local_irq_save(flags);
> > +	if (unlikely(clearMlocked))
> > +		free_page_mlock(page);
> 
> I wonder what the compiler does in the case
> CONFIG_HAVE_MLOCKED_PAGE_BIT=n.  If it is dumb, this patch would cause
> additional code generation.
> 

PageMlocked becomes

static inline int PageMlocked(page)
	{ return 0; }

so the compiler should be fit to spot that clearMlocked will always be 0. Even
if it didn't, it should have at least spotted that free_page_mlock(page)
is an empty inline function in this case.

Still double checked and I did not see the branch with
CONFIG_HAVE_MLOCKED_PAGE_BIT=n

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()
  2009-04-24  0:33       ` KOSAKI Motohiro
@ 2009-04-24 11:33         ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-24 11:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, linux-mm, cl, npiggin, linux-kernel, ming.m.lin,
	yanmin_zhang, peterz, penberg, Lee Schermerhorn, Dave Hansen

On Fri, Apr 24, 2009 at 09:33:50AM +0900, KOSAKI Motohiro wrote:
> > > @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> > >   */
> > >  static inline void free_page_mlock(struct page *page)
> > >  {
> > > -	if (unlikely(TestClearPageMlocked(page))) {
> > > -		unsigned long flags;
> > > -
> > > -		local_irq_save(flags);
> > > -		__dec_zone_page_state(page, NR_MLOCK);
> > > -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > > -		local_irq_restore(flags);
> > > -	}
> > > +	__ClearPageMlocked(page);
> > > +	__dec_zone_page_state(page, NR_MLOCK);
> > > +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > >  }
> > 
> > The conscientuous reviewer runs around and checks for free_page_mlock()
> > callers in other .c files which might be affected.
> > 
> > Only there are no such callers.
> > 
> > The reviewer's job would be reduced if free_page_mlock() wasn't
> > needlessly placed in a header file!
> 
> very sorry.
> 
> How about this?
> 
> =============================================
> Subject: [PATCH] move free_page_mlock() to page_alloc.c
> 
> Currently, free_page_mlock() is only called from page_alloc.c.
> Thus, we can move it to page_alloc.c.
> 

Looks good, but here is a version rebased on top of the patch series
where it would be easier to merge with "Do not disable interrupts in
free_page_mlock()".

I do note why it might be in the header though - it keeps all the
CONFIG_HAVE_MLOCKED_PAGE_BIT-related helper functions together making it
easier to find them. Lee, was that the intention?

=======
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Move free_page_mlock() from mm/internal.h to mm/page_alloc.c

Currently, free_page_mlock() is only called from page_alloc.c. This patch
moves it from a header to to page_alloc.c.

[mel@csn.ul.ie: Rebase on top of page allocator patches]
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/internal.h   |   13 -------------
 mm/page_alloc.c |   16 ++++++++++++++++
 2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 58ec1bc..4b1672a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -150,18 +150,6 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 	}
 }
 
-/*
- * free_page_mlock() -- clean up attempts to free and mlocked() page.
- * Page should not be on lru, so no need to fix that up.
- * free_pages_check() will verify...
- */
-static inline void free_page_mlock(struct page *page)
-{
-	__ClearPageMlocked(page);
-	__dec_zone_page_state(page, NR_MLOCK);
-	__count_vm_event(UNEVICTABLE_MLOCKFREED);
-}
-
 #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
 static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
 {
@@ -170,7 +158,6 @@ static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
 static inline void mlock_migrate_page(struct page *new, struct page *old) { }
-static inline void free_page_mlock(struct page *page) { }
 
 #endif /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f08b4cb..3db5f57 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -433,6 +433,22 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	return 0;
 }
 
+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+/*
+ * free_page_mlock() -- clean up attempts to free and mlocked() page.
+ * Page should not be on lru, so no need to fix that up.
+ * free_pages_check() will verify...
+ */
+static inline void free_page_mlock(struct page *page)
+{
+	__ClearPageMlocked(page);
+	__dec_zone_page_state(page, NR_MLOCK);
+	__count_vm_event(UNEVICTABLE_MLOCKFREED);
+}
+#else
+static inline void free_page_mlock(struct page *page) { }
+#endif /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
+
 /*
  * Freeing function for a buddy system allocator.
  *

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()
@ 2009-04-24 11:33         ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-24 11:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, linux-mm, cl, npiggin, linux-kernel, ming.m.lin,
	yanmin_zhang, peterz, penberg, Lee Schermerhorn, Dave Hansen

On Fri, Apr 24, 2009 at 09:33:50AM +0900, KOSAKI Motohiro wrote:
> > > @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> > >   */
> > >  static inline void free_page_mlock(struct page *page)
> > >  {
> > > -	if (unlikely(TestClearPageMlocked(page))) {
> > > -		unsigned long flags;
> > > -
> > > -		local_irq_save(flags);
> > > -		__dec_zone_page_state(page, NR_MLOCK);
> > > -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > > -		local_irq_restore(flags);
> > > -	}
> > > +	__ClearPageMlocked(page);
> > > +	__dec_zone_page_state(page, NR_MLOCK);
> > > +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > >  }
> > 
> > The conscientuous reviewer runs around and checks for free_page_mlock()
> > callers in other .c files which might be affected.
> > 
> > Only there are no such callers.
> > 
> > The reviewer's job would be reduced if free_page_mlock() wasn't
> > needlessly placed in a header file!
> 
> very sorry.
> 
> How about this?
> 
> =============================================
> Subject: [PATCH] move free_page_mlock() to page_alloc.c
> 
> Currently, free_page_mlock() is only called from page_alloc.c.
> Thus, we can move it to page_alloc.c.
> 

Looks good, but here is a version rebased on top of the patch series
where it would be easier to merge with "Do not disable interrupts in
free_page_mlock()".

I do note why it might be in the header though - it keeps all the
CONFIG_HAVE_MLOCKED_PAGE_BIT-related helper functions together making it
easier to find them. Lee, was that the intention?

=======
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Move free_page_mlock() from mm/internal.h to mm/page_alloc.c

Currently, free_page_mlock() is only called from page_alloc.c. This patch
moves it from a header to to page_alloc.c.

[mel@csn.ul.ie: Rebase on top of page allocator patches]
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/internal.h   |   13 -------------
 mm/page_alloc.c |   16 ++++++++++++++++
 2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 58ec1bc..4b1672a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -150,18 +150,6 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 	}
 }
 
-/*
- * free_page_mlock() -- clean up attempts to free and mlocked() page.
- * Page should not be on lru, so no need to fix that up.
- * free_pages_check() will verify...
- */
-static inline void free_page_mlock(struct page *page)
-{
-	__ClearPageMlocked(page);
-	__dec_zone_page_state(page, NR_MLOCK);
-	__count_vm_event(UNEVICTABLE_MLOCKFREED);
-}
-
 #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
 static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
 {
@@ -170,7 +158,6 @@ static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
 static inline void mlock_migrate_page(struct page *new, struct page *old) { }
-static inline void free_page_mlock(struct page *page) { }
 
 #endif /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f08b4cb..3db5f57 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -433,6 +433,22 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	return 0;
 }
 
+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+/*
+ * free_page_mlock() -- clean up attempts to free and mlocked() page.
+ * Page should not be on lru, so no need to fix that up.
+ * free_pages_check() will verify...
+ */
+static inline void free_page_mlock(struct page *page)
+{
+	__ClearPageMlocked(page);
+	__dec_zone_page_state(page, NR_MLOCK);
+	__count_vm_event(UNEVICTABLE_MLOCKFREED);
+}
+#else
+static inline void free_page_mlock(struct page *page) { }
+#endif /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
+
 /*
  * Freeing function for a buddy system allocator.
  *

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()
  2009-04-24 11:33         ` Mel Gorman
@ 2009-04-24 11:52           ` Lee Schermerhorn
  -1 siblings, 0 replies; 186+ messages in thread
From: Lee Schermerhorn @ 2009-04-24 11:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Andrew Morton, linux-mm, cl, npiggin,
	linux-kernel, ming.m.lin, yanmin_zhang, peterz, penberg,
	Dave Hansen

On Fri, 2009-04-24 at 12:33 +0100, Mel Gorman wrote: 
> On Fri, Apr 24, 2009 at 09:33:50AM +0900, KOSAKI Motohiro wrote:
> > > > @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> > > >   */
> > > >  static inline void free_page_mlock(struct page *page)
> > > >  {
> > > > -	if (unlikely(TestClearPageMlocked(page))) {
> > > > -		unsigned long flags;
> > > > -
> > > > -		local_irq_save(flags);
> > > > -		__dec_zone_page_state(page, NR_MLOCK);
> > > > -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > > > -		local_irq_restore(flags);
> > > > -	}
> > > > +	__ClearPageMlocked(page);
> > > > +	__dec_zone_page_state(page, NR_MLOCK);
> > > > +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > > >  }
> > > 
> > > The conscientuous reviewer runs around and checks for free_page_mlock()
> > > callers in other .c files which might be affected.
> > > 
> > > Only there are no such callers.
> > > 
> > > The reviewer's job would be reduced if free_page_mlock() wasn't
> > > needlessly placed in a header file!
> > 
> > very sorry.
> > 
> > How about this?
> > 
> > =============================================
> > Subject: [PATCH] move free_page_mlock() to page_alloc.c
> > 
> > Currently, free_page_mlock() is only called from page_alloc.c.
> > Thus, we can move it to page_alloc.c.
> > 
> 
> Looks good, but here is a version rebased on top of the patch series
> where it would be easier to merge with "Do not disable interrupts in
> free_page_mlock()".
> 
> I do note why it might be in the header though - it keeps all the
> CONFIG_HAVE_MLOCKED_PAGE_BIT-related helper functions together making it
> easier to find them. Lee, was that the intention?

Yes.  Having been dinged one too many times for adding extraneous
#ifdef's to .c's I try to avoid that...

Note that in page-flags.h, we define MLOCK_PAGES as 0 or 1 based on
CONFIG_HAVE_MLOCKED_PAGE_BIT, so you could test that in
free_page_mlock() to eliminate the #ifdef in page_alloc.c as we do in
try_to_unmap_*() over in rmap.c.  Guess I could have done that when I
added MLOCK_PAGES.  Got lost in the heat of the battle.

Lee


> 
> =======
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> Move free_page_mlock() from mm/internal.h to mm/page_alloc.c
> 
> Currently, free_page_mlock() is only called from page_alloc.c. This patch
> moves it from a header to to page_alloc.c.
> 
> [mel@csn.ul.ie: Rebase on top of page allocator patches]
> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Cc: Pekka Enberg <penberg@cs.helsinki.fi>
> Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  mm/internal.h   |   13 -------------
>  mm/page_alloc.c |   16 ++++++++++++++++
>  2 files changed, 16 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 58ec1bc..4b1672a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -150,18 +150,6 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
>  	}
>  }
>  
> -/*
> - * free_page_mlock() -- clean up attempts to free and mlocked() page.
> - * Page should not be on lru, so no need to fix that up.
> - * free_pages_check() will verify...
> - */
> -static inline void free_page_mlock(struct page *page)
> -{
> -	__ClearPageMlocked(page);
> -	__dec_zone_page_state(page, NR_MLOCK);
> -	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> -}
> -
>  #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
>  static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
>  {
> @@ -170,7 +158,6 @@ static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
>  static inline void clear_page_mlock(struct page *page) { }
>  static inline void mlock_vma_page(struct page *page) { }
>  static inline void mlock_migrate_page(struct page *new, struct page *old) { }
> -static inline void free_page_mlock(struct page *page) { }
>  
>  #endif /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f08b4cb..3db5f57 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -433,6 +433,22 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
>  	return 0;
>  }
>  
> +#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
> +/*
> + * free_page_mlock() -- clean up attempts to free and mlocked() page.
> + * Page should not be on lru, so no need to fix that up.
> + * free_pages_check() will verify...
> + */
> +static inline void free_page_mlock(struct page *page)
> +{
> +	__ClearPageMlocked(page);
> +	__dec_zone_page_state(page, NR_MLOCK);
> +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> +}
> +#else
> +static inline void free_page_mlock(struct page *page) { }
> +#endif /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
> +
>  /*
>   * Freeing function for a buddy system allocator.
>   *
> 


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()
@ 2009-04-24 11:52           ` Lee Schermerhorn
  0 siblings, 0 replies; 186+ messages in thread
From: Lee Schermerhorn @ 2009-04-24 11:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Andrew Morton, linux-mm, cl, npiggin,
	linux-kernel, ming.m.lin, yanmin_zhang, peterz, penberg,
	Dave Hansen

On Fri, 2009-04-24 at 12:33 +0100, Mel Gorman wrote: 
> On Fri, Apr 24, 2009 at 09:33:50AM +0900, KOSAKI Motohiro wrote:
> > > > @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> > > >   */
> > > >  static inline void free_page_mlock(struct page *page)
> > > >  {
> > > > -	if (unlikely(TestClearPageMlocked(page))) {
> > > > -		unsigned long flags;
> > > > -
> > > > -		local_irq_save(flags);
> > > > -		__dec_zone_page_state(page, NR_MLOCK);
> > > > -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > > > -		local_irq_restore(flags);
> > > > -	}
> > > > +	__ClearPageMlocked(page);
> > > > +	__dec_zone_page_state(page, NR_MLOCK);
> > > > +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > > >  }
> > > 
> > > The conscientuous reviewer runs around and checks for free_page_mlock()
> > > callers in other .c files which might be affected.
> > > 
> > > Only there are no such callers.
> > > 
> > > The reviewer's job would be reduced if free_page_mlock() wasn't
> > > needlessly placed in a header file!
> > 
> > very sorry.
> > 
> > How about this?
> > 
> > =============================================
> > Subject: [PATCH] move free_page_mlock() to page_alloc.c
> > 
> > Currently, free_page_mlock() is only called from page_alloc.c.
> > Thus, we can move it to page_alloc.c.
> > 
> 
> Looks good, but here is a version rebased on top of the patch series
> where it would be easier to merge with "Do not disable interrupts in
> free_page_mlock()".
> 
> I do note why it might be in the header though - it keeps all the
> CONFIG_HAVE_MLOCKED_PAGE_BIT-related helper functions together making it
> easier to find them. Lee, was that the intention?

Yes.  Having been dinged one too many times for adding extraneous
#ifdef's to .c's I try to avoid that...

Note that in page-flags.h, we define MLOCK_PAGES as 0 or 1 based on
CONFIG_HAVE_MLOCKED_PAGE_BIT, so you could test that in
free_page_mlock() to eliminate the #ifdef in page_alloc.c as we do in
try_to_unmap_*() over in rmap.c.  Guess I could have done that when I
added MLOCK_PAGES.  Got lost in the heat of the battle.

Lee


> 
> =======
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> Move free_page_mlock() from mm/internal.h to mm/page_alloc.c
> 
> Currently, free_page_mlock() is only called from page_alloc.c. This patch
> moves it from a header to to page_alloc.c.
> 
> [mel@csn.ul.ie: Rebase on top of page allocator patches]
> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Cc: Pekka Enberg <penberg@cs.helsinki.fi>
> Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  mm/internal.h   |   13 -------------
>  mm/page_alloc.c |   16 ++++++++++++++++
>  2 files changed, 16 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 58ec1bc..4b1672a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -150,18 +150,6 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
>  	}
>  }
>  
> -/*
> - * free_page_mlock() -- clean up attempts to free and mlocked() page.
> - * Page should not be on lru, so no need to fix that up.
> - * free_pages_check() will verify...
> - */
> -static inline void free_page_mlock(struct page *page)
> -{
> -	__ClearPageMlocked(page);
> -	__dec_zone_page_state(page, NR_MLOCK);
> -	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> -}
> -
>  #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
>  static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
>  {
> @@ -170,7 +158,6 @@ static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
>  static inline void clear_page_mlock(struct page *page) { }
>  static inline void mlock_vma_page(struct page *page) { }
>  static inline void mlock_migrate_page(struct page *new, struct page *old) { }
> -static inline void free_page_mlock(struct page *page) { }
>  
>  #endif /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f08b4cb..3db5f57 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -433,6 +433,22 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
>  	return 0;
>  }
>  
> +#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
> +/*
> + * free_page_mlock() -- clean up attempts to free and mlocked() page.
> + * Page should not be on lru, so no need to fix that up.
> + * free_pages_check() will verify...
> + */
> +static inline void free_page_mlock(struct page *page)
> +{
> +	__ClearPageMlocked(page);
> +	__dec_zone_page_state(page, NR_MLOCK);
> +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> +}
> +#else
> +static inline void free_page_mlock(struct page *page) { }
> +#endif /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
> +
>  /*
>   * Freeing function for a buddy system allocator.
>   *
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 19/22] Update NR_FREE_PAGES only as necessary
  2009-04-23 23:06     ` Andrew Morton
@ 2009-04-24 13:06       ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-24 13:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel, ming.m.lin,
	yanmin_zhang, peterz, penberg

On Thu, Apr 23, 2009 at 04:06:10PM -0700, Andrew Morton wrote:
> On Wed, 22 Apr 2009 14:53:24 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > When pages are being freed to the buddy allocator, the zone
> > NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page
> > freeing, it's updated once per page. This retouches cache lines more
> > than necessary. Update the counters one per per-cpu bulk free.
> > 
> > ...
> >
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -460,7 +460,6 @@ static inline void __free_one_page(struct page *page,
> >  		int migratetype)
> >  {
> >  	unsigned long page_idx;
> > -	int order_size = 1 << order;
> >  
> >  	if (unlikely(PageCompound(page)))
> >  		if (unlikely(destroy_compound_page(page, order)))
> > @@ -470,10 +469,9 @@ static inline void __free_one_page(struct page *page,
> >  
> >  	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
> >  
> > -	VM_BUG_ON(page_idx & (order_size - 1));
> > +	VM_BUG_ON(page_idx & ((1 << order) - 1));
> >  	VM_BUG_ON(bad_range(zone, page));
> >  
> 
> <head spins>
> 
> Is this all a slow and obscure way of doing
> 
> 	VM_BUG_ON(order > MAX_ORDER);
> 
> ?

Nope.

> 
> If not, what _is_ it asserting?
> 

That the page is properly placed. The start of the page being freed has to
be size-of-page-aligned. If it isn't the calculation that works out where
the buddy is will break.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 19/22] Update NR_FREE_PAGES only as necessary
@ 2009-04-24 13:06       ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-24 13:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel, ming.m.lin,
	yanmin_zhang, peterz, penberg

On Thu, Apr 23, 2009 at 04:06:10PM -0700, Andrew Morton wrote:
> On Wed, 22 Apr 2009 14:53:24 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > When pages are being freed to the buddy allocator, the zone
> > NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page
> > freeing, it's updated once per page. This retouches cache lines more
> > than necessary. Update the counters one per per-cpu bulk free.
> > 
> > ...
> >
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -460,7 +460,6 @@ static inline void __free_one_page(struct page *page,
> >  		int migratetype)
> >  {
> >  	unsigned long page_idx;
> > -	int order_size = 1 << order;
> >  
> >  	if (unlikely(PageCompound(page)))
> >  		if (unlikely(destroy_compound_page(page, order)))
> > @@ -470,10 +469,9 @@ static inline void __free_one_page(struct page *page,
> >  
> >  	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
> >  
> > -	VM_BUG_ON(page_idx & (order_size - 1));
> > +	VM_BUG_ON(page_idx & ((1 << order) - 1));
> >  	VM_BUG_ON(bad_range(zone, page));
> >  
> 
> <head spins>
> 
> Is this all a slow and obscure way of doing
> 
> 	VM_BUG_ON(order > MAX_ORDER);
> 
> ?

Nope.

> 
> If not, what _is_ it asserting?
> 

That the page is properly placed. The start of the page being freed has to
be size-of-page-aligned. If it isn't the calculation that works out where
the buddy is will break.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* [PATCH] Do not override definition of node_set_online() with macro
  2009-04-23 19:29         ` David Rientjes
@ 2009-04-24 13:31           ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-24 13:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Thu, Apr 23, 2009 at 12:29:27PM -0700, David Rientjes wrote:
> On Thu, 23 Apr 2009, Mel Gorman wrote:
> 
> > > > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> > > > index 848025c..474e73e 100644
> > > > --- a/include/linux/nodemask.h
> > > > +++ b/include/linux/nodemask.h
> > > > @@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state)
> > > >  #define next_online_node(nid)	next_node((nid), node_states[N_ONLINE])
> > > >  
> > > >  extern int nr_node_ids;
> > > > +extern int nr_online_nodes;
> > > > +
> > > > +static inline void node_set_online(int nid)
> > > > +{
> > > > +	node_set_state(nid, N_ONLINE);
> > > > +	nr_online_nodes = num_node_state(N_ONLINE);
> > > > +}
> > > > +
> > > > +static inline void node_set_offline(int nid)
> > > > +{
> > > > +	node_clear_state(nid, N_ONLINE);
> > > > +	nr_online_nodes = num_node_state(N_ONLINE);
> > > > +}
> > > >  #else
> > > >  
> > > >  static inline int node_state(int node, enum node_states state)
> > > 
> > > The later #define's of node_set_online() and node_set_offline() in 
> > > include/linux/nodemask.h should probably be removed now.
> > > 
> > 
> > You'd think, but you can enable memory hotplug without NUMA and
> > node_set_online() is called when adding memory. Even though those
> > functions are nops on !NUMA, they're necessary.
> > 
> 
> The problem is that your new functions above are never used because 
> node_set_online and node_set_offline are macro defined later in this 
> header for all cases, not just !CONFIG_NUMA.
> 
> You need this.

You're absolutly correct, well spotted. I mistook what the #endif was
closing. Here it the patch again with a changelog. Thanks very much

==== CUT HERE ====
From: David Rientjes <rientjes@google.com>

Do not override definition of node_set_online() with macro

node_set_online() updates node_states[] and updates the value of
nr_online_nodes. However, its definition is being accidentally overridden
by a macro definition intended for use in the !CONFIG_NUMA case. This patch
fixes the problem by moving the !CONFIG_NUMA macro definition.

This should be considered a fix to the patch 
page-allocator-use-a-pre-calculated-value-instead-of-num_online_nodes-in-fast-paths.patch

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 include/linux/nodemask.h |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 474e73e..829b94b 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -448,6 +448,9 @@ static inline int num_node_state(enum node_states state)
 #define next_online_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1
 #define nr_online_nodes		1
+
+#define node_set_online(node)	   node_set_state((node), N_ONLINE)
+#define node_set_offline(node)	   node_clear_state((node), N_ONLINE)
 #endif
 
 #define node_online_map 	node_states[N_ONLINE]
@@ -467,9 +470,6 @@ static inline int num_node_state(enum node_states state)
 #define node_online(node)	node_state((node), N_ONLINE)
 #define node_possible(node)	node_state((node), N_POSSIBLE)
 
-#define node_set_online(node)	   node_set_state((node), N_ONLINE)
-#define node_set_offline(node)	   node_clear_state((node), N_ONLINE)
-
 #define for_each_node(node)	   for_each_node_state(node, N_POSSIBLE)
 #define for_each_online_node(node) for_each_node_state(node, N_ONLINE)
 

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH] Do not override definition of node_set_online() with macro
@ 2009-04-24 13:31           ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-24 13:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Thu, Apr 23, 2009 at 12:29:27PM -0700, David Rientjes wrote:
> On Thu, 23 Apr 2009, Mel Gorman wrote:
> 
> > > > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> > > > index 848025c..474e73e 100644
> > > > --- a/include/linux/nodemask.h
> > > > +++ b/include/linux/nodemask.h
> > > > @@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state)
> > > >  #define next_online_node(nid)	next_node((nid), node_states[N_ONLINE])
> > > >  
> > > >  extern int nr_node_ids;
> > > > +extern int nr_online_nodes;
> > > > +
> > > > +static inline void node_set_online(int nid)
> > > > +{
> > > > +	node_set_state(nid, N_ONLINE);
> > > > +	nr_online_nodes = num_node_state(N_ONLINE);
> > > > +}
> > > > +
> > > > +static inline void node_set_offline(int nid)
> > > > +{
> > > > +	node_clear_state(nid, N_ONLINE);
> > > > +	nr_online_nodes = num_node_state(N_ONLINE);
> > > > +}
> > > >  #else
> > > >  
> > > >  static inline int node_state(int node, enum node_states state)
> > > 
> > > The later #define's of node_set_online() and node_set_offline() in 
> > > include/linux/nodemask.h should probably be removed now.
> > > 
> > 
> > You'd think, but you can enable memory hotplug without NUMA and
> > node_set_online() is called when adding memory. Even though those
> > functions are nops on !NUMA, they're necessary.
> > 
> 
> The problem is that your new functions above are never used because 
> node_set_online and node_set_offline are macro defined later in this 
> header for all cases, not just !CONFIG_NUMA.
> 
> You need this.

You're absolutly correct, well spotted. I mistook what the #endif was
closing. Here it the patch again with a changelog. Thanks very much

==== CUT HERE ====
From: David Rientjes <rientjes@google.com>

Do not override definition of node_set_online() with macro

node_set_online() updates node_states[] and updates the value of
nr_online_nodes. However, its definition is being accidentally overridden
by a macro definition intended for use in the !CONFIG_NUMA case. This patch
fixes the problem by moving the !CONFIG_NUMA macro definition.

This should be considered a fix to the patch 
page-allocator-use-a-pre-calculated-value-instead-of-num_online_nodes-in-fast-paths.patch

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 include/linux/nodemask.h |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 474e73e..829b94b 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -448,6 +448,9 @@ static inline int num_node_state(enum node_states state)
 #define next_online_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1
 #define nr_online_nodes		1
+
+#define node_set_online(node)	   node_set_state((node), N_ONLINE)
+#define node_set_offline(node)	   node_clear_state((node), N_ONLINE)
 #endif
 
 #define node_online_map 	node_states[N_ONLINE]
@@ -467,9 +470,6 @@ static inline int num_node_state(enum node_states state)
 #define node_online(node)	node_state((node), N_ONLINE)
 #define node_possible(node)	node_state((node), N_POSSIBLE)
 
-#define node_set_online(node)	   node_set_state((node), N_ONLINE)
-#define node_set_offline(node)	   node_clear_state((node), N_ONLINE)
-
 #define for_each_node(node)	   for_each_node_state(node, N_POSSIBLE)
 #define for_each_online_node(node) for_each_node_state(node, N_ONLINE)
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-24 10:34                   ` Mel Gorman
@ 2009-04-24 14:16                     ` Dave Hansen
  -1 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-24 14:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Fri, 2009-04-24 at 11:34 +0100, Mel Gorman wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1464aca..1c60141 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1434,7 +1434,6 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
>  	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
> 
>  	classzone_idx = zone_idx(preferred_zone);
> -	VM_BUG_ON(order >= MAX_ORDER);
> 
>  zonelist_scan:
>  	/*
> @@ -1692,6 +1691,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	struct task_struct *p = current;
> 
>  	/*
> +	 * In the slowpath, we sanity check order to avoid ever trying to
> +	 * reclaim >= MAX_ORDER areas which will never succeed. Callers may
> +	 * be using allocators in order of preference for an area that is
> +	 * too large. 
> +	 */
> +	if (WARN_ON_ONCE(order >= MAX_ORDER))
> +		return NULL;
> +
> +	/*
>  	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
>  	 * __GFP_NOWARN set) should not cause reclaim since the subsystem
>  	 * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim



Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>

-- Dave


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-24 14:16                     ` Dave Hansen
  0 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-24 14:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Fri, 2009-04-24 at 11:34 +0100, Mel Gorman wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1464aca..1c60141 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1434,7 +1434,6 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
>  	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
> 
>  	classzone_idx = zone_idx(preferred_zone);
> -	VM_BUG_ON(order >= MAX_ORDER);
> 
>  zonelist_scan:
>  	/*
> @@ -1692,6 +1691,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	struct task_struct *p = current;
> 
>  	/*
> +	 * In the slowpath, we sanity check order to avoid ever trying to
> +	 * reclaim >= MAX_ORDER areas which will never succeed. Callers may
> +	 * be using allocators in order of preference for an area that is
> +	 * too large. 
> +	 */
> +	if (WARN_ON_ONCE(order >= MAX_ORDER))
> +		return NULL;
> +
> +	/*
>  	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
>  	 * __GFP_NOWARN set) should not cause reclaim since the subsystem
>  	 * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim



Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
  2009-04-24  9:21                   ` Mel Gorman
@ 2009-04-24 14:25                     ` Dave Hansen
  -1 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-24 14:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Fri, 2009-04-24 at 10:21 +0100, Mel Gorman wrote:
> > dave@kernel:~/work/mhp-build$ ../linux-2.6.git/scripts/bloat-o-meter
> > i386-numaq-sparse.{1,2}/vmlinux 
> > add/remove: 0/0 grow/shrink: 9/16 up/down: 81/-99 (-18)
> > function                                     old     new   delta
> > st_int_ioctl                                2600    2624     +24
> > tcp_sendmsg                                 2153    2169     +16
> > diskstats_show                               739     753     +14
> > iov_shorten                                   49      58      +9
> > unmap_vmas                                  1653    1661      +8
> > sg_build_indirect                            449     455      +6
> > ahc_linux_biosparam                          251     253      +2
> > nlmclnt_call                                 557     558      +1
> > do_mount                                    1533    1534      +1
> 
> It doesn't make sense at all that text increased in size. Did you make clean
> between each .config change and patch application? Are you using distcc
> or anything else that might cause confusion? I found I had to eliminate
> distscc and clean after each patch application because sometimes
> net/ipv4/udp.c would sometimes generate different assembly when
> accessing struct zone. Not really sure what was going on there.

There's a ccache in there but no distcc.

I'll redo with clean builds.

-- Dave


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/22] Do not sanity check order in the fast path
@ 2009-04-24 14:25                     ` Dave Hansen
  0 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-24 14:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Fri, 2009-04-24 at 10:21 +0100, Mel Gorman wrote:
> > dave@kernel:~/work/mhp-build$ ../linux-2.6.git/scripts/bloat-o-meter
> > i386-numaq-sparse.{1,2}/vmlinux 
> > add/remove: 0/0 grow/shrink: 9/16 up/down: 81/-99 (-18)
> > function                                     old     new   delta
> > st_int_ioctl                                2600    2624     +24
> > tcp_sendmsg                                 2153    2169     +16
> > diskstats_show                               739     753     +14
> > iov_shorten                                   49      58      +9
> > unmap_vmas                                  1653    1661      +8
> > sg_build_indirect                            449     455      +6
> > ahc_linux_biosparam                          251     253      +2
> > nlmclnt_call                                 557     558      +1
> > do_mount                                    1533    1534      +1
> 
> It doesn't make sense at all that text increased in size. Did you make clean
> between each .config change and patch application? Are you using distcc
> or anything else that might cause confusion? I found I had to eliminate
> distscc and clean after each patch application because sometimes
> net/ipv4/udp.c would sometimes generate different assembly when
> accessing struct zone. Not really sure what was going on there.

There's a ccache in there but no distcc.

I'll redo with clean builds.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 09/22] Calculate the alloc_flags for allocation only once
  2009-04-24 10:47       ` Mel Gorman
@ 2009-04-24 17:51         ` Andrew Morton
  -1 siblings, 0 replies; 186+ messages in thread
From: Andrew Morton @ 2009-04-24 17:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel, ming.m.lin,
	yanmin_zhang, peterz, penberg

On Fri, 24 Apr 2009 11:47:17 +0100 Mel Gorman <mel@csn.ul.ie> wrote:

> Uninline gfp_to_alloc_flags() in the page allocator slow path

Well, there are <boggle> 37 inlines in page_alloc.c, so uninlining a
single function (and leaving its layout mucked up ;)) is a bit random.

Perhaps sometime you could take a look at "[patch] page allocator:
rationalise inlining"?

I'm kind of in two minds about it.  Do we really know that all approved
versions of gcc will do the most desirable thing in all circumstances
on all architectures?


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 09/22] Calculate the alloc_flags for allocation only once
@ 2009-04-24 17:51         ` Andrew Morton
  0 siblings, 0 replies; 186+ messages in thread
From: Andrew Morton @ 2009-04-24 17:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, kosaki.motohiro, cl, npiggin, linux-kernel, ming.m.lin,
	yanmin_zhang, peterz, penberg

On Fri, 24 Apr 2009 11:47:17 +0100 Mel Gorman <mel@csn.ul.ie> wrote:

> Uninline gfp_to_alloc_flags() in the page allocator slow path

Well, there are <boggle> 37 inlines in page_alloc.c, so uninlining a
single function (and leaving its layout mucked up ;)) is a bit random.

Perhaps sometime you could take a look at "[patch] page allocator:
rationalise inlining"?

I'm kind of in two minds about it.  Do we really know that all approved
versions of gcc will do the most desirable thing in all circumstances
on all architectures?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 00/22] Cleanup and optimise the page allocator V7
  2009-04-22 13:53 ` Mel Gorman
@ 2009-04-27  7:58   ` Zhang, Yanmin
  -1 siblings, 0 replies; 186+ messages in thread
From: Zhang, Yanmin @ 2009-04-27  7:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> Here is V7 of the cleanup and optimisation of the page allocator and
> it should be ready for wider testing. Please consider a possibility for
> merging as a Pass 1 at making the page allocator faster. Other passes will
> occur later when this one has had a bit of exercise. This patchset is based
> on mmotm-2009-04-17 and I've tested it successfully on a small number of
> machines.
We ran some performance benchmarks against V7 patch on top of 2.6.30-rc3.
It seems some counters in kernel are incorrect after we run some ffsb (disk I/O benchmark)
and swap-cp (a simple swap memory testing by cp on tmpfs). Free memory is bigger than
total memory.

[ymzhang@lkp-st02-x8664 ~]$ uname -a
Linux lkp-st02-x8664 2.6.30-rc3-mgpage #1 SMP Thu Apr 23 16:09:43 CST 2009 x86_64 x86_64 x86_64 GNU/Linux
[ymzhang@lkp-st02-x8664 ~]$ free
             total       used       free     shared    buffers     cached
Mem:       8166564 18014398497625640   20022908          0    2364424     247224
-/+ buffers/cache: 18014398495013992   22634556
Swap:            0          0          0
[ymzhang@lkp-st02-x8664 ~]$ cat /proc/meminfo 
MemTotal:        8166564 kB
MemFree:        20022916 kB
Buffers:         2364424 kB
Cached:           247224 kB
SwapCached:            0 kB
Active:          2414520 kB
Inactive:         206168 kB
Active(anon):       4316 kB
Inactive(anon):     4932 kB
Active(file):    2410204 kB
Inactive(file):   201236 kB



[ymzhang@lkp-ne01 ~]$ uname -a
Linux lkp-ne01 2.6.30-rc3-mgpage #1 SMP Thu Apr 23 15:04:27 CST 2009 x86_64 x86_64 x86_64 GNU/Linux
[ymzhang@lkp-ne01 ~]$ free
             total       used       free     shared    buffers     cached
Mem:       6116356 18014398509340432    6257908          0     609804    1053512
-/+ buffers/cache: 18014398507677116    7921224
Swap:     15631204          0   15631204
[ymzhang@lkp-ne01 ~]$ cat /proc/meminfo 
MemTotal:        6116356 kB
MemFree:         6257948 kB
Buffers:          609804 kB
Cached:          1053512 kB
SwapCached:            0 kB
Active:           723152 kB



Or a simple kernel source cp/rm/cp:
[ymzhang@lkp-ne01 linux-2.6.30-rc3_melgorman]$ uname -a
Linux lkp-ne01 2.6.30-rc3-mgpage #1 SMP Thu Apr 23 15:04:27 CST 2009 x86_64 x86_64 x86_64 GNU/Linux
[ymzhang@lkp-ne01 linux-2.6.30-rc3_melgorman]$ free
             total       used       free     shared    buffers     cached
Mem:       6116356    1309940    4806416          0      82184    1259072
-/+ buffers/cache: 18014398509450668    6147672
Swap:     15631204          0   15631204
[ymzhang@lkp-ne01 linux-2.6.30-rc3_melgorman]$ cat /proc/meminfo 
MemTotal:        6116356 kB
MemFree:         4806724 kB
Buffers:           82184 kB
Cached:          1259072 kB
SwapCached:            0 kB
Active:           477396 kB
Inactive:         872388 kB
Active(anon):       8704 kB
Inactive(anon):        0 kB
Active(file):     468692 kB
Inactive(file):   872388 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      15631204 kB
SwapFree:       15631204 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:          8632 kB
Mapped:             4140 kB
Slab:             174504 kB
SReclaimable:     154976 kB


> 
> The performance improvements are in a wide range depending on the exact
> machine but the results I've seen so fair are approximately;
> 
> kernbench:	0	to	 0.12% (elapsed time)
> 		0.49%	to	 3.20% (sys time)
> aim9:		-4%	to	30% (for page_test and brk_test)
> tbench:		-1%	to	 4%
> hackbench:	-2.5%	to	 3.45% (mostly within the noise though)
> netperf-udp	-1.34%  to	 4.06% (varies between machines a bit)
> netperf-tcp	-0.44%  to	 5.22% (varies between machines a bit)
> 
> I haven't sysbench figures at hand, but previously they were within the -0.5%
> to 2% range.
> 
> On netperf, the client and server were bound to opposite number CPUs to
> maximise the problems with cache line bouncing of the struct pages so I
> expect different people to report different results for netperf depending
> on their exact machine and how they ran the test (different machines, same
> cpus client/server, shared cache but two threads client/server, different
> socket client/server etc).
> 
> I also measured the vmlinux sizes for a single x86-based config with
> CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM. The core of the .config
> is based on the Debian Lenny kernel config so I expect it to be reasonably
> typical.
> 
>    text	kernel
> 3355726 mmotm-20090417
> 3355718 0001-Replace-__alloc_pages_internal-with-__alloc_pages_.patch
> 3355622 0002-Do-not-sanity-check-order-in-the-fast-path.patch
> 3355574 0003-Do-not-check-NUMA-node-ID-when-the-caller-knows-the.patch
> 3355574 0004-Check-only-once-if-the-zonelist-is-suitable-for-the.patch
> 3355526 0005-Break-up-the-allocator-entry-point-into-fast-and-slo.patch
> 3355420 0006-Move-check-for-disabled-anti-fragmentation-out-of-fa.patch
> 3355452 0007-Calculate-the-preferred-zone-for-allocation-only-onc.patch
> 3355452 0008-Calculate-the-migratetype-for-allocation-only-once.patch
> 3355436 0009-Calculate-the-alloc_flags-for-allocation-only-once.patch
> 3355436 0010-Remove-a-branch-by-assuming-__GFP_HIGH-ALLOC_HIGH.patch
> 3355420 0011-Inline-__rmqueue_smallest.patch
> 3355420 0012-Inline-buffered_rmqueue.patch
> 3355420 0013-Inline-__rmqueue_fallback.patch
> 3355404 0014-Do-not-call-get_pageblock_migratetype-more-than-ne.patch
> 3355300 0015-Do-not-disable-interrupts-in-free_page_mlock.patch
> 3355300 0016-Do-not-setup-zonelist-cache-when-there-is-only-one-n.patch
> 3355188 0017-Do-not-check-for-compound-pages-during-the-page-allo.patch
> 3355161 0018-Use-allocation-flags-as-an-index-to-the-zone-waterma.patch
> 3355129 0019-Update-NR_FREE_PAGES-only-as-necessary.patch
> 3355129 0020-Get-the-pageblock-migratetype-without-disabling-inte.patch
> 3355129 0021-Use-a-pre-calculated-value-instead-of-num_online_nod.patch
> 
> Some patches were dropped in this revision because while I believe they
> improved performance, they also increase the text size so they need to
> be revisited in isolation to show they actually help things and by how
> much. Other than that, the biggest changes were cleaning up accidental
> functional changes identified by Kosaki Motohiro. Massive credit to him for a
> very defailed review of V6, to Christoph Lameter who reviewed earlier versions
> quite heavily and Pekka who kicked through V6 in quite a lot of detail.
> 
>  arch/ia64/hp/common/sba_iommu.c   |    2 
>  arch/ia64/kernel/mca.c            |    3 
>  arch/ia64/kernel/uncached.c       |    3 
>  arch/ia64/sn/pci/pci_dma.c        |    3 
>  arch/powerpc/platforms/cell/ras.c |    2 
>  arch/x86/kvm/vmx.c                |    2 
>  drivers/misc/sgi-gru/grufile.c    |    2 
>  drivers/misc/sgi-xp/xpc_uv.c      |    2 
>  include/linux/gfp.h               |   27 -
>  include/linux/mm.h                |    1 
>  include/linux/mmzone.h            |   11 
>  include/linux/nodemask.h          |   15 -
>  kernel/profile.c                  |    8 
>  mm/filemap.c                      |    2 
>  mm/hugetlb.c                      |    8 
>  mm/internal.h                     |   11 
>  mm/mempolicy.c                    |    2 
>  mm/migrate.c                      |    2 
>  mm/page_alloc.c                   |  555 ++++++++++++++++++++++++--------------
>  mm/slab.c                         |   11 
>  mm/slob.c                         |    4 
>  mm/slub.c                         |    2 
>  net/sunrpc/svc.c                  |    2 
>  23 files changed, 424 insertions(+), 256 deletions(-)
> 
> Changes since V6
>   o Remove unintentional functional changes when splitting into fast and slow paths
>   o Drop patch 7 for zonelist filtering as it modified when zlc_setup() is called
>     for the wrong reasons. The patch that avoids calling it for non-NUMA machines is
>     still there which has the bulk of the saving. cpusets is relatively small
>   o Drop an unnecessary check for in_interrupt() in gfp_to_alloc_flags()
>   o Clarify comment on __GFP_HIGH == ALLOC_HIGH
>   o Redefine the watermark mask to be expessed in terms of ALLOC_MARK_NOWATERMARK
>   o Use BUILD_BUG_ON for checking __GFP_HIGH == ALLOC_HIGH
>   o Drop some patches that were not reducing text sizes as expected
>   o Remove numa_platform from slab
> 
> Change since V5
>   o Rebase to mmotm-2009-04-17
> 
> Changes since V4
>   o Drop the more controversial patches for now and focus on the "obvious win"
>     material.
>   o Add reviewed-by notes
>   o Fix changelog entry to say __rmqueue_fallback instead __rmqueue
>   o Add unlikely() for the clearMlocked check
>   o Change where PGFREE is accounted in free_hot_cold_page() to have symmetry
>     with __free_pages_ok()
>   o Convert num_online_nodes() to use a static value so that callers do
>     not have to be individually updated
>   o Rebase to mmotm-2003-03-13
> 
> Changes since V3
>   o Drop the more controversial patches for now and focus on the "obvious win"
>     material
>   o Add reviewed-by notes
>   o Fix changelog entry to say __rmqueue_fallback instead __rmqueue
>   o Add unlikely() for the clearMlocked check
>   o Change where PGFREE is accounted in free_hot_cold_page() to have symmetry
>     with __free_pages_ok()
> 
> Changes since V2
>   o Remove brances by treating watermark flags as array indices
>   o Remove branch by assuming __GFP_HIGH == ALLOC_HIGH
>   o Do not check for compound on every page free
>   o Remove branch by always ensuring the migratetype is known on free
>   o Simplify buffered_rmqueue further
>   o Reintroduce improved version of batched bulk free of pcp pages
>   o Use allocation flags as an index to zone watermarks
>   o Work out __GFP_COLD only once
>   o Reduce the number of times zone stats are updated
>   o Do not dump reserve pages back into the allocator. Instead treat them
>     as MOVABLE so that MIGRATE_RESERVE gets used on the max-order-overlapped
>     boundaries without causing trouble
>   o Allow pages up to PAGE_ALLOC_COSTLY_ORDER to use the per-cpu allocator.
>     order-1 allocations are frequently enough in particular to justify this
>   o Rearrange inlining such that the hot-path is inlined but not in a way
>     that increases the text size of the page allocator
>   o Make the check for needing additional zonelist filtering due to NUMA
>     or cpusets as light as possible
>   o Do not destroy compound pages going to the PCP lists
>   o Delay the merging of buddies until a high-order allocation needs them
>     or anti-fragmentation is being forced to fallback
> 
> Changes since V1
>   o Remove the ifdef CONFIG_CPUSETS from inside get_page_from_freelist()
>   o Use non-lock bit operations for clearing the mlock flag
>   o Factor out alloc_flags calculation so it is only done once (Peter)
>   o Make gfp.h a bit prettier and clear-cut (Peter)
>   o Instead of deleting a debugging check, replace page_count() in the
>     free path with a version that does not check for compound pages (Nick)
>   o Drop the alteration for hot/cold page freeing until we know if it
>     helps or not
> 


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 00/22] Cleanup and optimise the page allocator V7
@ 2009-04-27  7:58   ` Zhang, Yanmin
  0 siblings, 0 replies; 186+ messages in thread
From: Zhang, Yanmin @ 2009-04-27  7:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> Here is V7 of the cleanup and optimisation of the page allocator and
> it should be ready for wider testing. Please consider a possibility for
> merging as a Pass 1 at making the page allocator faster. Other passes will
> occur later when this one has had a bit of exercise. This patchset is based
> on mmotm-2009-04-17 and I've tested it successfully on a small number of
> machines.
We ran some performance benchmarks against V7 patch on top of 2.6.30-rc3.
It seems some counters in kernel are incorrect after we run some ffsb (disk I/O benchmark)
and swap-cp (a simple swap memory testing by cp on tmpfs). Free memory is bigger than
total memory.

[ymzhang@lkp-st02-x8664 ~]$ uname -a
Linux lkp-st02-x8664 2.6.30-rc3-mgpage #1 SMP Thu Apr 23 16:09:43 CST 2009 x86_64 x86_64 x86_64 GNU/Linux
[ymzhang@lkp-st02-x8664 ~]$ free
             total       used       free     shared    buffers     cached
Mem:       8166564 18014398497625640   20022908          0    2364424     247224
-/+ buffers/cache: 18014398495013992   22634556
Swap:            0          0          0
[ymzhang@lkp-st02-x8664 ~]$ cat /proc/meminfo 
MemTotal:        8166564 kB
MemFree:        20022916 kB
Buffers:         2364424 kB
Cached:           247224 kB
SwapCached:            0 kB
Active:          2414520 kB
Inactive:         206168 kB
Active(anon):       4316 kB
Inactive(anon):     4932 kB
Active(file):    2410204 kB
Inactive(file):   201236 kB



[ymzhang@lkp-ne01 ~]$ uname -a
Linux lkp-ne01 2.6.30-rc3-mgpage #1 SMP Thu Apr 23 15:04:27 CST 2009 x86_64 x86_64 x86_64 GNU/Linux
[ymzhang@lkp-ne01 ~]$ free
             total       used       free     shared    buffers     cached
Mem:       6116356 18014398509340432    6257908          0     609804    1053512
-/+ buffers/cache: 18014398507677116    7921224
Swap:     15631204          0   15631204
[ymzhang@lkp-ne01 ~]$ cat /proc/meminfo 
MemTotal:        6116356 kB
MemFree:         6257948 kB
Buffers:          609804 kB
Cached:          1053512 kB
SwapCached:            0 kB
Active:           723152 kB



Or a simple kernel source cp/rm/cp:
[ymzhang@lkp-ne01 linux-2.6.30-rc3_melgorman]$ uname -a
Linux lkp-ne01 2.6.30-rc3-mgpage #1 SMP Thu Apr 23 15:04:27 CST 2009 x86_64 x86_64 x86_64 GNU/Linux
[ymzhang@lkp-ne01 linux-2.6.30-rc3_melgorman]$ free
             total       used       free     shared    buffers     cached
Mem:       6116356    1309940    4806416          0      82184    1259072
-/+ buffers/cache: 18014398509450668    6147672
Swap:     15631204          0   15631204
[ymzhang@lkp-ne01 linux-2.6.30-rc3_melgorman]$ cat /proc/meminfo 
MemTotal:        6116356 kB
MemFree:         4806724 kB
Buffers:           82184 kB
Cached:          1259072 kB
SwapCached:            0 kB
Active:           477396 kB
Inactive:         872388 kB
Active(anon):       8704 kB
Inactive(anon):        0 kB
Active(file):     468692 kB
Inactive(file):   872388 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      15631204 kB
SwapFree:       15631204 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:          8632 kB
Mapped:             4140 kB
Slab:             174504 kB
SReclaimable:     154976 kB


> 
> The performance improvements are in a wide range depending on the exact
> machine but the results I've seen so fair are approximately;
> 
> kernbench:	0	to	 0.12% (elapsed time)
> 		0.49%	to	 3.20% (sys time)
> aim9:		-4%	to	30% (for page_test and brk_test)
> tbench:		-1%	to	 4%
> hackbench:	-2.5%	to	 3.45% (mostly within the noise though)
> netperf-udp	-1.34%  to	 4.06% (varies between machines a bit)
> netperf-tcp	-0.44%  to	 5.22% (varies between machines a bit)
> 
> I haven't sysbench figures at hand, but previously they were within the -0.5%
> to 2% range.
> 
> On netperf, the client and server were bound to opposite number CPUs to
> maximise the problems with cache line bouncing of the struct pages so I
> expect different people to report different results for netperf depending
> on their exact machine and how they ran the test (different machines, same
> cpus client/server, shared cache but two threads client/server, different
> socket client/server etc).
> 
> I also measured the vmlinux sizes for a single x86-based config with
> CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM. The core of the .config
> is based on the Debian Lenny kernel config so I expect it to be reasonably
> typical.
> 
>    text	kernel
> 3355726 mmotm-20090417
> 3355718 0001-Replace-__alloc_pages_internal-with-__alloc_pages_.patch
> 3355622 0002-Do-not-sanity-check-order-in-the-fast-path.patch
> 3355574 0003-Do-not-check-NUMA-node-ID-when-the-caller-knows-the.patch
> 3355574 0004-Check-only-once-if-the-zonelist-is-suitable-for-the.patch
> 3355526 0005-Break-up-the-allocator-entry-point-into-fast-and-slo.patch
> 3355420 0006-Move-check-for-disabled-anti-fragmentation-out-of-fa.patch
> 3355452 0007-Calculate-the-preferred-zone-for-allocation-only-onc.patch
> 3355452 0008-Calculate-the-migratetype-for-allocation-only-once.patch
> 3355436 0009-Calculate-the-alloc_flags-for-allocation-only-once.patch
> 3355436 0010-Remove-a-branch-by-assuming-__GFP_HIGH-ALLOC_HIGH.patch
> 3355420 0011-Inline-__rmqueue_smallest.patch
> 3355420 0012-Inline-buffered_rmqueue.patch
> 3355420 0013-Inline-__rmqueue_fallback.patch
> 3355404 0014-Do-not-call-get_pageblock_migratetype-more-than-ne.patch
> 3355300 0015-Do-not-disable-interrupts-in-free_page_mlock.patch
> 3355300 0016-Do-not-setup-zonelist-cache-when-there-is-only-one-n.patch
> 3355188 0017-Do-not-check-for-compound-pages-during-the-page-allo.patch
> 3355161 0018-Use-allocation-flags-as-an-index-to-the-zone-waterma.patch
> 3355129 0019-Update-NR_FREE_PAGES-only-as-necessary.patch
> 3355129 0020-Get-the-pageblock-migratetype-without-disabling-inte.patch
> 3355129 0021-Use-a-pre-calculated-value-instead-of-num_online_nod.patch
> 
> Some patches were dropped in this revision because while I believe they
> improved performance, they also increase the text size so they need to
> be revisited in isolation to show they actually help things and by how
> much. Other than that, the biggest changes were cleaning up accidental
> functional changes identified by Kosaki Motohiro. Massive credit to him for a
> very defailed review of V6, to Christoph Lameter who reviewed earlier versions
> quite heavily and Pekka who kicked through V6 in quite a lot of detail.
> 
>  arch/ia64/hp/common/sba_iommu.c   |    2 
>  arch/ia64/kernel/mca.c            |    3 
>  arch/ia64/kernel/uncached.c       |    3 
>  arch/ia64/sn/pci/pci_dma.c        |    3 
>  arch/powerpc/platforms/cell/ras.c |    2 
>  arch/x86/kvm/vmx.c                |    2 
>  drivers/misc/sgi-gru/grufile.c    |    2 
>  drivers/misc/sgi-xp/xpc_uv.c      |    2 
>  include/linux/gfp.h               |   27 -
>  include/linux/mm.h                |    1 
>  include/linux/mmzone.h            |   11 
>  include/linux/nodemask.h          |   15 -
>  kernel/profile.c                  |    8 
>  mm/filemap.c                      |    2 
>  mm/hugetlb.c                      |    8 
>  mm/internal.h                     |   11 
>  mm/mempolicy.c                    |    2 
>  mm/migrate.c                      |    2 
>  mm/page_alloc.c                   |  555 ++++++++++++++++++++++++--------------
>  mm/slab.c                         |   11 
>  mm/slob.c                         |    4 
>  mm/slub.c                         |    2 
>  net/sunrpc/svc.c                  |    2 
>  23 files changed, 424 insertions(+), 256 deletions(-)
> 
> Changes since V6
>   o Remove unintentional functional changes when splitting into fast and slow paths
>   o Drop patch 7 for zonelist filtering as it modified when zlc_setup() is called
>     for the wrong reasons. The patch that avoids calling it for non-NUMA machines is
>     still there which has the bulk of the saving. cpusets is relatively small
>   o Drop an unnecessary check for in_interrupt() in gfp_to_alloc_flags()
>   o Clarify comment on __GFP_HIGH == ALLOC_HIGH
>   o Redefine the watermark mask to be expessed in terms of ALLOC_MARK_NOWATERMARK
>   o Use BUILD_BUG_ON for checking __GFP_HIGH == ALLOC_HIGH
>   o Drop some patches that were not reducing text sizes as expected
>   o Remove numa_platform from slab
> 
> Change since V5
>   o Rebase to mmotm-2009-04-17
> 
> Changes since V4
>   o Drop the more controversial patches for now and focus on the "obvious win"
>     material.
>   o Add reviewed-by notes
>   o Fix changelog entry to say __rmqueue_fallback instead __rmqueue
>   o Add unlikely() for the clearMlocked check
>   o Change where PGFREE is accounted in free_hot_cold_page() to have symmetry
>     with __free_pages_ok()
>   o Convert num_online_nodes() to use a static value so that callers do
>     not have to be individually updated
>   o Rebase to mmotm-2003-03-13
> 
> Changes since V3
>   o Drop the more controversial patches for now and focus on the "obvious win"
>     material
>   o Add reviewed-by notes
>   o Fix changelog entry to say __rmqueue_fallback instead __rmqueue
>   o Add unlikely() for the clearMlocked check
>   o Change where PGFREE is accounted in free_hot_cold_page() to have symmetry
>     with __free_pages_ok()
> 
> Changes since V2
>   o Remove brances by treating watermark flags as array indices
>   o Remove branch by assuming __GFP_HIGH == ALLOC_HIGH
>   o Do not check for compound on every page free
>   o Remove branch by always ensuring the migratetype is known on free
>   o Simplify buffered_rmqueue further
>   o Reintroduce improved version of batched bulk free of pcp pages
>   o Use allocation flags as an index to zone watermarks
>   o Work out __GFP_COLD only once
>   o Reduce the number of times zone stats are updated
>   o Do not dump reserve pages back into the allocator. Instead treat them
>     as MOVABLE so that MIGRATE_RESERVE gets used on the max-order-overlapped
>     boundaries without causing trouble
>   o Allow pages up to PAGE_ALLOC_COSTLY_ORDER to use the per-cpu allocator.
>     order-1 allocations are frequently enough in particular to justify this
>   o Rearrange inlining such that the hot-path is inlined but not in a way
>     that increases the text size of the page allocator
>   o Make the check for needing additional zonelist filtering due to NUMA
>     or cpusets as light as possible
>   o Do not destroy compound pages going to the PCP lists
>   o Delay the merging of buddies until a high-order allocation needs them
>     or anti-fragmentation is being forced to fallback
> 
> Changes since V1
>   o Remove the ifdef CONFIG_CPUSETS from inside get_page_from_freelist()
>   o Use non-lock bit operations for clearing the mlock flag
>   o Factor out alloc_flags calculation so it is only done once (Peter)
>   o Make gfp.h a bit prettier and clear-cut (Peter)
>   o Instead of deleting a debugging check, replace page_count() in the
>     free path with a version that does not check for compound pages (Nick)
>   o Drop the alteration for hot/cold page freeing until we know if it
>     helps or not
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 00/22] Cleanup and optimise the page allocator V7
  2009-04-27  7:58   ` Zhang, Yanmin
@ 2009-04-27 14:38     ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-27 14:38 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Mon, Apr 27, 2009 at 03:58:39PM +0800, Zhang, Yanmin wrote:
> On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > Here is V7 of the cleanup and optimisation of the page allocator and
> > it should be ready for wider testing. Please consider a possibility for
> > merging as a Pass 1 at making the page allocator faster. Other passes will
> > occur later when this one has had a bit of exercise. This patchset is based
> > on mmotm-2009-04-17 and I've tested it successfully on a small number of
> > machines.
> We ran some performance benchmarks against V7 patch on top of 2.6.30-rc3.
> It seems some counters in kernel are incorrect after we run some ffsb (disk I/O benchmark)
> and swap-cp (a simple swap memory testing by cp on tmpfs). Free memory is bigger than
> total memory.
> 

oops. Can you try this patch please?

==== CUT HERE ====

Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()

free_pages_bulk() updates the number of free pages in the zone but it is
assuming that the pages being freed are order-0. While this is currently
always true, it's wrong to assume the order is 0. This patch fixes the
problem.

buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
__rmqueue(). This means that any high-order allocation will appear to increase
the number of free pages leading to the situation where free pages appears to
exceed available RAM. This patch accounts for those allocated pages properly.

This is a candidate fix to the patch
page-allocator-update-nr_free_pages-only-as-necessary.patch. It has yet to be
verified as fixing a problem where the free pages count is getting corrupted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3db5f57..dd69593 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
 
-	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
 	while (count--) {
 		struct page *page;
 
@@ -1151,6 +1151,7 @@ again:
 	} else {
 		spin_lock_irqsave(&zone->lock, flags);
 		page = __rmqueue(zone, order, migratetype);
+		__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [PATCH 00/22] Cleanup and optimise the page allocator V7
@ 2009-04-27 14:38     ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-27 14:38 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Mon, Apr 27, 2009 at 03:58:39PM +0800, Zhang, Yanmin wrote:
> On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > Here is V7 of the cleanup and optimisation of the page allocator and
> > it should be ready for wider testing. Please consider a possibility for
> > merging as a Pass 1 at making the page allocator faster. Other passes will
> > occur later when this one has had a bit of exercise. This patchset is based
> > on mmotm-2009-04-17 and I've tested it successfully on a small number of
> > machines.
> We ran some performance benchmarks against V7 patch on top of 2.6.30-rc3.
> It seems some counters in kernel are incorrect after we run some ffsb (disk I/O benchmark)
> and swap-cp (a simple swap memory testing by cp on tmpfs). Free memory is bigger than
> total memory.
> 

oops. Can you try this patch please?

==== CUT HERE ====

Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()

free_pages_bulk() updates the number of free pages in the zone but it is
assuming that the pages being freed are order-0. While this is currently
always true, it's wrong to assume the order is 0. This patch fixes the
problem.

buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
__rmqueue(). This means that any high-order allocation will appear to increase
the number of free pages leading to the situation where free pages appears to
exceed available RAM. This patch accounts for those allocated pages properly.

This is a candidate fix to the patch
page-allocator-update-nr_free_pages-only-as-necessary.patch. It has yet to be
verified as fixing a problem where the free pages count is getting corrupted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3db5f57..dd69593 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
 
-	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
 	while (count--) {
 		struct page *page;
 
@@ -1151,6 +1151,7 @@ again:
 	} else {
 		spin_lock_irqsave(&zone->lock, flags);
 		page = __rmqueue(zone, order, migratetype);
+		__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [RFC] Replace the watermark-related union in struct zone with a watermark[] array
  2009-04-22 20:06     ` David Rientjes
@ 2009-04-27 17:00       ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-27 17:00 UTC (permalink / raw)
  To: Linux Memory Management List
  Cc: David Rientjes, KOSAKI Motohiro, Dave Hansen, Linux Kernel Mailing List

Just build and boot-tested. This is to illustrate what the patch looks like
to replace the union with an array and to see if people are happy with
it or think it should look like something else.

==== CUT HERE ====

Patch page-allocator-use-allocation-flags-as-an-index-to-the-zone-watermark
from -mm added a union to struct zone where the watermarks could be
accessed with either zone->pages_* or a pages_mark array. The concern
was that this aliasing caused more confusion that it helped.

This patch replaces the union with a watermark array that is indexed with
WMARK_* defines and updates all sites that talk about zone->pages_*.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 Documentation/sysctl/vm.txt |   11 ++++++-----
 Documentation/vm/balance    |   18 +++++++++---------
 arch/m32r/mm/discontig.c    |    6 +++---
 include/linux/mmzone.h      |   16 ++++++++++------
 mm/page_alloc.c             |   41 +++++++++++++++++++++--------------------
 mm/vmscan.c                 |   41 +++++++++++++++++++++++------------------
 mm/vmstat.c                 |    6 +++---
 7 files changed, 75 insertions(+), 64 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 97c4b32..f79e12b 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -231,8 +231,8 @@ These protections are added to score to judge whether this zone should be used
 for page allocation or should be reclaimed.
 
 In this example, if normal pages (index=2) are required to this DMA zone and
-pages_high is used for watermark, the kernel judges this zone should not be
-used because pages_free(1355) is smaller than watermark + protection[2]
+watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
+not be used because pages_free(1355) is smaller than watermark + protection[2]
 (4 + 2004 = 2008). If this protection value is 0, this zone would be used for
 normal page requirement. If requirement is DMA zone(index=0), protection[0]
 (=0) is used.
@@ -278,9 +278,10 @@ The default value is 65536.
 min_free_kbytes:
 
 This is used to force the Linux VM to keep a minimum number
-of kilobytes free.  The VM uses this number to compute a pages_min
-value for each lowmem zone in the system.  Each lowmem zone gets
-a number of reserved free pages based proportionally on its size.
+of kilobytes free.  The VM uses this number to compute a
+watermark[WMARK_MIN] value for each lowmem zone in the system.
+Each lowmem zone gets a number of reserved free pages based
+proportionally on its size.
 
 Some minimal amount of memory is needed to satisfy PF_MEMALLOC
 allocations; if you set this to lower than 1024KB, your system will
diff --git a/Documentation/vm/balance b/Documentation/vm/balance
index bd3d31b..c46e68c 100644
--- a/Documentation/vm/balance
+++ b/Documentation/vm/balance
@@ -75,15 +75,15 @@ Page stealing from process memory and shm is done if stealing the page would
 alleviate memory pressure on any zone in the page's node that has fallen below
 its watermark.
 
-pages_min/pages_low/pages_high/low_on_memory/zone_wake_kswapd: These are 
-per-zone fields, used to determine when a zone needs to be balanced. When
-the number of pages falls below pages_min, the hysteric field low_on_memory
-gets set. This stays set till the number of free pages becomes pages_high.
-When low_on_memory is set, page allocation requests will try to free some
-pages in the zone (providing GFP_WAIT is set in the request). Orthogonal
-to this, is the decision to poke kswapd to free some zone pages. That
-decision is not hysteresis based, and is done when the number of free
-pages is below pages_low; in which case zone_wake_kswapd is also set.
+watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These
+are per-zone fields, used to determine when a zone needs to be balanced. When
+the number of pages falls below watermark[WMARK_MIN], the hysteric field
+low_on_memory gets set. This stays set till the number of free pages becomes
+watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will
+try to free some pages in the zone (providing GFP_WAIT is set in the request).
+Orthogonal to this, is the decision to poke kswapd to free some zone pages.
+That decision is not hysteresis based, and is done when the number of free
+pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.
 
 
 (Good) Ideas that I have heard:
diff --git a/arch/m32r/mm/discontig.c b/arch/m32r/mm/discontig.c
index 7daf897..b7a78ad 100644
--- a/arch/m32r/mm/discontig.c
+++ b/arch/m32r/mm/discontig.c
@@ -154,9 +154,9 @@ unsigned long __init zone_sizes_init(void)
 	 *  Use all area of internal RAM.
 	 *  see __alloc_pages()
 	 */
-	NODE_DATA(1)->node_zones->pages_min = 0;
-	NODE_DATA(1)->node_zones->pages_low = 0;
-	NODE_DATA(1)->node_zones->pages_high = 0;
+	NODE_DATA(1)->node_zones->watermark[WMARK_MIN] = 0;
+	NODE_DATA(1)->node_zones->watermark[WMARK_LOW] = 0;
+	NODE_DATA(1)->node_zones->watermark[WMARK_HIGH] = 0;
 
 	return holes;
 }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c1fa208..1ff59fd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -163,6 +163,13 @@ static inline int is_unevictable_lru(enum lru_list l)
 #endif
 }
 
+enum zone_watermarks {
+	WMARK_MIN,
+	WMARK_LOW,
+	WMARK_HIGH,
+	NR_WMARK
+};
+
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
@@ -275,12 +282,9 @@ struct zone_reclaim_stat {
 
 struct zone {
 	/* Fields commonly accessed by the page allocator */
-	union {
-		struct {
-			unsigned long	pages_min, pages_low, pages_high;
-		};
-		unsigned long pages_mark[3];
-	};
+
+	/* zone watermarks, indexed with WMARK_LOW, WMARK_MIN and WMARK_HIGH */
+	unsigned long watermark[NR_WMARK];
 
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f5d5a63..5dd2d59 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1172,10 +1172,10 @@ failed:
 	return NULL;
 }
 
-/* The WMARK bits are used as an index zone->pages_mark */
-#define ALLOC_WMARK_MIN		0x00 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW		0x01 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH	0x02 /* use pages_high watermark */
+/* The ALLOC_WMARK bits are used as an index to zone->watermark */
+#define ALLOC_WMARK_MIN		WMARK_MIN
+#define ALLOC_WMARK_LOW		WMARK_LOW
+#define ALLOC_WMARK_HIGH	WMARK_HIGH
 #define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
 
 /* Mask to get the watermark bits */
@@ -1464,9 +1464,10 @@ zonelist_scan:
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
 
+		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
-			mark = zone->pages_mark[alloc_flags & ALLOC_WMARK_MASK];
+			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 			if (!zone_watermark_ok(zone, order, mark,
 				    classzone_idx, alloc_flags)) {
 				if (!zone_reclaim_mode ||
@@ -1987,7 +1988,7 @@ static unsigned int nr_free_zone_pages(int offset)
 
 	for_each_zone_zonelist(zone, z, zonelist, offset) {
 		unsigned long size = zone->present_pages;
-		unsigned long high = zone->pages_high;
+		unsigned long high = zone->watermark[WMARK_HIGH];
 		if (size > high)
 			sum += size - high;
 	}
@@ -2124,9 +2125,9 @@ void show_free_areas(void)
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
-			K(zone->pages_min),
-			K(zone->pages_low),
-			K(zone->pages_high),
+			K(zone->watermark[WMARK_MIN]),
+			K(zone->watermark[WMARK_LOW]),
+			K(zone->watermark[WMARK_HIGH]),
 			K(zone_page_state(zone, NR_ACTIVE_ANON)),
 			K(zone_page_state(zone, NR_INACTIVE_ANON)),
 			K(zone_page_state(zone, NR_ACTIVE_FILE)),
@@ -2730,8 +2731,8 @@ static inline unsigned long wait_table_bits(unsigned long size)
 
 /*
  * Mark a number of pageblocks as MIGRATE_RESERVE. The number
- * of blocks reserved is based on zone->pages_min. The memory within the
- * reserve will tend to store contiguous free pages. Setting min_free_kbytes
+ * of blocks reserved is based on zone->watermark[WMARK_MIN]. The memory within
+ * the reserve will tend to store contiguous free pages. Setting min_free_kbytes
  * higher will lead to a bigger reserve which will get freed as contiguous
  * blocks as reclaim kicks in
  */
@@ -2744,7 +2745,7 @@ static void setup_zone_migrate_reserve(struct zone *zone)
 	/* Get the start pfn, end pfn and the number of blocks to reserve */
 	start_pfn = zone->zone_start_pfn;
 	end_pfn = start_pfn + zone->spanned_pages;
-	reserve = roundup(zone->pages_min, pageblock_nr_pages) >>
+	reserve = roundup(zone->watermark[WMARK_MIN], pageblock_nr_pages) >>
 							pageblock_order;
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
@@ -4400,8 +4401,8 @@ static void calculate_totalreserve_pages(void)
 					max = zone->lowmem_reserve[j];
 			}
 
-			/* we treat pages_high as reserved pages. */
-			max += zone->pages_high;
+			/* we treat WMARK_HIGH as reserved pages. */
+			max += zone->watermark[WMARK_HIGH];
 
 			if (max > zone->present_pages)
 				max = zone->present_pages;
@@ -4481,7 +4482,7 @@ void setup_per_zone_pages_min(void)
 			 * need highmem pages, so cap pages_min to a small
 			 * value here.
 			 *
-			 * The (pages_high-pages_low) and (pages_low-pages_min)
+			 * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN)
 			 * deltas controls asynch page reclaim, and so should
 			 * not be capped for highmem.
 			 */
@@ -4492,17 +4493,17 @@ void setup_per_zone_pages_min(void)
 				min_pages = SWAP_CLUSTER_MAX;
 			if (min_pages > 128)
 				min_pages = 128;
-			zone->pages_min = min_pages;
+			zone->watermark[WMARK_MIN] = min_pages;
 		} else {
 			/*
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
-			zone->pages_min = tmp;
+			zone->watermark[WMARK_MIN] = tmp;
 		}
 
-		zone->pages_low   = zone->pages_min + (tmp >> 2);
-		zone->pages_high  = zone->pages_min + (tmp >> 1);
+		zone->watermark[WMARK_LOW]  = zone->watermark[WMARK_MIN] + (tmp >> 2);
+		zone->watermark[WMARK_HIGH] = zone->watermark[WMARK_MIN] + (tmp >> 1);
 		setup_zone_migrate_reserve(zone);
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
@@ -4647,7 +4648,7 @@ int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
  *	whenever sysctl_lowmem_reserve_ratio changes.
  *
  * The reserve ratio obviously has absolutely no relation with the
- * pages_min watermarks. The lowmem reserve ratio can only make sense
+ * watermark[WMARK_MIN] watermarks. The lowmem reserve ratio can only make sense
  * if in function of the boot time zone sizes.
  */
 int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cef1801..371447c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1406,7 +1406,7 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
 		free  = zone_page_state(zone, NR_FREE_PAGES);
 		/* If we have very few page cache pages,
 		   force-scan anon pages. */
-		if (unlikely(file + free <= zone->pages_high)) {
+		if (unlikely(file + free <= zone->watermark[WMARK_HIGH])) {
 			percent[0] = 100;
 			percent[1] = 0;
 			return;
@@ -1538,11 +1538,13 @@ static void shrink_zone(int priority, struct zone *zone,
  * try to reclaim pages from zones which will satisfy the caller's allocation
  * request.
  *
- * We reclaim from a zone even if that zone is over pages_high.  Because:
+ * We reclaim from a zone even if that zone is over watermark[WMARK_HIGH].
+ * Because:
  * a) The caller may be trying to free *extra* pages to satisfy a higher-order
  *    allocation or
- * b) The zones may be over pages_high but they must go *over* pages_high to
- *    satisfy the `incremental min' zone defense algorithm.
+ * b) The zones may be over watermark[WMARK_HIGH] but they must go *over*
+ *    watermark[WMARK_HIGH] to satisfy the `incremental min' zone defense
+ *    algorithm.
  *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
@@ -1748,7 +1750,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 
 /*
  * For kswapd, balance_pgdat() will work across all this node's zones until
- * they are all at pages_high.
+ * they are all at watermark[WMARK_HIGH].
  *
  * Returns the number of pages which were actually freed.
  *
@@ -1761,11 +1763,11 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
  * the zone for when the problem goes away.
  *
  * kswapd scans the zones in the highmem->normal->dma direction.  It skips
- * zones which have free_pages > pages_high, but once a zone is found to have
- * free_pages <= pages_high, we scan that zone and the lower zones regardless
- * of the number of free pages in the lower zones.  This interoperates with
- * the page allocator fallback scheme to ensure that aging of pages is balanced
- * across the zones.
+ * zones which have free_pages > watermark[WMARK_HIGH], but once a zone is
+ * found to have free_pages <= watermarkpWMARK_HIGH], we scan that zone and the
+ * lower zones regardless of the number of free pages in the lower zones. This
+ * interoperates with the page allocator fallback scheme to ensure that aging
+ * of pages is balanced across the zones.
  */
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 {
@@ -1786,7 +1788,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 	};
 	/*
 	 * temp_priority is used to remember the scanning priority at which
-	 * this zone was successfully refilled to free_pages == pages_high.
+	 * this zone was successfully refilled to
+	 * free_pages == watermark[WMARK_HIGH].
 	 */
 	int temp_priority[MAX_NR_ZONES];
 
@@ -1831,8 +1834,8 @@ loop_again:
 				shrink_active_list(SWAP_CLUSTER_MAX, zone,
 							&sc, priority, 0);
 
-			if (!zone_watermark_ok(zone, order, zone->pages_high,
-					       0, 0)) {
+			if (!zone_watermark_ok(zone, order,
+					zone->watermark[WMARK_HIGH], 0, 0)) {
 				end_zone = i;
 				break;
 			}
@@ -1866,8 +1869,9 @@ loop_again:
 					priority != DEF_PRIORITY)
 				continue;
 
-			if (!zone_watermark_ok(zone, order, zone->pages_high,
-					       end_zone, 0))
+			if (!zone_watermark_ok(zone, order,
+					zone->watermark[WMARK_HIGH],
+					end_zone, 0))
 				all_zones_ok = 0;
 			temp_priority[i] = priority;
 			sc.nr_scanned = 0;
@@ -1876,8 +1880,9 @@ loop_again:
 			 * We put equal pressure on every zone, unless one
 			 * zone has way too many pages free already.
 			 */
-			if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
-						end_zone, 0))
+			if (!zone_watermark_ok(zone, order,
+					8*zone->watermark[WMARK_HIGH],
+					end_zone, 0))
 				shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
@@ -2043,7 +2048,7 @@ void wakeup_kswapd(struct zone *zone, int order)
 		return;
 
 	pgdat = zone->zone_pgdat;
-	if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0))
+	if (zone_watermark_ok(zone, order, zone->watermark[WMARK_LOW], 0, 0))
 		return;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 66f6130..17f2abb 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -725,9 +725,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
-		   zone->pages_min,
-		   zone->pages_low,
-		   zone->pages_high,
+		   zone->watermark[WMARK_MIN],
+		   zone->watermark[WMARK_LOW],
+		   zone->watermark[WMARK_HIGH],
 		   zone->pages_scanned,
 		   zone->lru[LRU_ACTIVE_ANON].nr_scan,
 		   zone->lru[LRU_INACTIVE_ANON].nr_scan,

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [RFC] Replace the watermark-related union in struct zone with a watermark[] array
@ 2009-04-27 17:00       ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-27 17:00 UTC (permalink / raw)
  To: Linux Memory Management List
  Cc: David Rientjes, KOSAKI Motohiro, Dave Hansen, Linux Kernel Mailing List

Just build and boot-tested. This is to illustrate what the patch looks like
to replace the union with an array and to see if people are happy with
it or think it should look like something else.

==== CUT HERE ====

Patch page-allocator-use-allocation-flags-as-an-index-to-the-zone-watermark
from -mm added a union to struct zone where the watermarks could be
accessed with either zone->pages_* or a pages_mark array. The concern
was that this aliasing caused more confusion that it helped.

This patch replaces the union with a watermark array that is indexed with
WMARK_* defines and updates all sites that talk about zone->pages_*.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 Documentation/sysctl/vm.txt |   11 ++++++-----
 Documentation/vm/balance    |   18 +++++++++---------
 arch/m32r/mm/discontig.c    |    6 +++---
 include/linux/mmzone.h      |   16 ++++++++++------
 mm/page_alloc.c             |   41 +++++++++++++++++++++--------------------
 mm/vmscan.c                 |   41 +++++++++++++++++++++++------------------
 mm/vmstat.c                 |    6 +++---
 7 files changed, 75 insertions(+), 64 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 97c4b32..f79e12b 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -231,8 +231,8 @@ These protections are added to score to judge whether this zone should be used
 for page allocation or should be reclaimed.
 
 In this example, if normal pages (index=2) are required to this DMA zone and
-pages_high is used for watermark, the kernel judges this zone should not be
-used because pages_free(1355) is smaller than watermark + protection[2]
+watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
+not be used because pages_free(1355) is smaller than watermark + protection[2]
 (4 + 2004 = 2008). If this protection value is 0, this zone would be used for
 normal page requirement. If requirement is DMA zone(index=0), protection[0]
 (=0) is used.
@@ -278,9 +278,10 @@ The default value is 65536.
 min_free_kbytes:
 
 This is used to force the Linux VM to keep a minimum number
-of kilobytes free.  The VM uses this number to compute a pages_min
-value for each lowmem zone in the system.  Each lowmem zone gets
-a number of reserved free pages based proportionally on its size.
+of kilobytes free.  The VM uses this number to compute a
+watermark[WMARK_MIN] value for each lowmem zone in the system.
+Each lowmem zone gets a number of reserved free pages based
+proportionally on its size.
 
 Some minimal amount of memory is needed to satisfy PF_MEMALLOC
 allocations; if you set this to lower than 1024KB, your system will
diff --git a/Documentation/vm/balance b/Documentation/vm/balance
index bd3d31b..c46e68c 100644
--- a/Documentation/vm/balance
+++ b/Documentation/vm/balance
@@ -75,15 +75,15 @@ Page stealing from process memory and shm is done if stealing the page would
 alleviate memory pressure on any zone in the page's node that has fallen below
 its watermark.
 
-pages_min/pages_low/pages_high/low_on_memory/zone_wake_kswapd: These are 
-per-zone fields, used to determine when a zone needs to be balanced. When
-the number of pages falls below pages_min, the hysteric field low_on_memory
-gets set. This stays set till the number of free pages becomes pages_high.
-When low_on_memory is set, page allocation requests will try to free some
-pages in the zone (providing GFP_WAIT is set in the request). Orthogonal
-to this, is the decision to poke kswapd to free some zone pages. That
-decision is not hysteresis based, and is done when the number of free
-pages is below pages_low; in which case zone_wake_kswapd is also set.
+watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These
+are per-zone fields, used to determine when a zone needs to be balanced. When
+the number of pages falls below watermark[WMARK_MIN], the hysteric field
+low_on_memory gets set. This stays set till the number of free pages becomes
+watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will
+try to free some pages in the zone (providing GFP_WAIT is set in the request).
+Orthogonal to this, is the decision to poke kswapd to free some zone pages.
+That decision is not hysteresis based, and is done when the number of free
+pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.
 
 
 (Good) Ideas that I have heard:
diff --git a/arch/m32r/mm/discontig.c b/arch/m32r/mm/discontig.c
index 7daf897..b7a78ad 100644
--- a/arch/m32r/mm/discontig.c
+++ b/arch/m32r/mm/discontig.c
@@ -154,9 +154,9 @@ unsigned long __init zone_sizes_init(void)
 	 *  Use all area of internal RAM.
 	 *  see __alloc_pages()
 	 */
-	NODE_DATA(1)->node_zones->pages_min = 0;
-	NODE_DATA(1)->node_zones->pages_low = 0;
-	NODE_DATA(1)->node_zones->pages_high = 0;
+	NODE_DATA(1)->node_zones->watermark[WMARK_MIN] = 0;
+	NODE_DATA(1)->node_zones->watermark[WMARK_LOW] = 0;
+	NODE_DATA(1)->node_zones->watermark[WMARK_HIGH] = 0;
 
 	return holes;
 }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c1fa208..1ff59fd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -163,6 +163,13 @@ static inline int is_unevictable_lru(enum lru_list l)
 #endif
 }
 
+enum zone_watermarks {
+	WMARK_MIN,
+	WMARK_LOW,
+	WMARK_HIGH,
+	NR_WMARK
+};
+
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
@@ -275,12 +282,9 @@ struct zone_reclaim_stat {
 
 struct zone {
 	/* Fields commonly accessed by the page allocator */
-	union {
-		struct {
-			unsigned long	pages_min, pages_low, pages_high;
-		};
-		unsigned long pages_mark[3];
-	};
+
+	/* zone watermarks, indexed with WMARK_LOW, WMARK_MIN and WMARK_HIGH */
+	unsigned long watermark[NR_WMARK];
 
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f5d5a63..5dd2d59 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1172,10 +1172,10 @@ failed:
 	return NULL;
 }
 
-/* The WMARK bits are used as an index zone->pages_mark */
-#define ALLOC_WMARK_MIN		0x00 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW		0x01 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH	0x02 /* use pages_high watermark */
+/* The ALLOC_WMARK bits are used as an index to zone->watermark */
+#define ALLOC_WMARK_MIN		WMARK_MIN
+#define ALLOC_WMARK_LOW		WMARK_LOW
+#define ALLOC_WMARK_HIGH	WMARK_HIGH
 #define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
 
 /* Mask to get the watermark bits */
@@ -1464,9 +1464,10 @@ zonelist_scan:
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
 
+		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
-			mark = zone->pages_mark[alloc_flags & ALLOC_WMARK_MASK];
+			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 			if (!zone_watermark_ok(zone, order, mark,
 				    classzone_idx, alloc_flags)) {
 				if (!zone_reclaim_mode ||
@@ -1987,7 +1988,7 @@ static unsigned int nr_free_zone_pages(int offset)
 
 	for_each_zone_zonelist(zone, z, zonelist, offset) {
 		unsigned long size = zone->present_pages;
-		unsigned long high = zone->pages_high;
+		unsigned long high = zone->watermark[WMARK_HIGH];
 		if (size > high)
 			sum += size - high;
 	}
@@ -2124,9 +2125,9 @@ void show_free_areas(void)
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
-			K(zone->pages_min),
-			K(zone->pages_low),
-			K(zone->pages_high),
+			K(zone->watermark[WMARK_MIN]),
+			K(zone->watermark[WMARK_LOW]),
+			K(zone->watermark[WMARK_HIGH]),
 			K(zone_page_state(zone, NR_ACTIVE_ANON)),
 			K(zone_page_state(zone, NR_INACTIVE_ANON)),
 			K(zone_page_state(zone, NR_ACTIVE_FILE)),
@@ -2730,8 +2731,8 @@ static inline unsigned long wait_table_bits(unsigned long size)
 
 /*
  * Mark a number of pageblocks as MIGRATE_RESERVE. The number
- * of blocks reserved is based on zone->pages_min. The memory within the
- * reserve will tend to store contiguous free pages. Setting min_free_kbytes
+ * of blocks reserved is based on zone->watermark[WMARK_MIN]. The memory within
+ * the reserve will tend to store contiguous free pages. Setting min_free_kbytes
  * higher will lead to a bigger reserve which will get freed as contiguous
  * blocks as reclaim kicks in
  */
@@ -2744,7 +2745,7 @@ static void setup_zone_migrate_reserve(struct zone *zone)
 	/* Get the start pfn, end pfn and the number of blocks to reserve */
 	start_pfn = zone->zone_start_pfn;
 	end_pfn = start_pfn + zone->spanned_pages;
-	reserve = roundup(zone->pages_min, pageblock_nr_pages) >>
+	reserve = roundup(zone->watermark[WMARK_MIN], pageblock_nr_pages) >>
 							pageblock_order;
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
@@ -4400,8 +4401,8 @@ static void calculate_totalreserve_pages(void)
 					max = zone->lowmem_reserve[j];
 			}
 
-			/* we treat pages_high as reserved pages. */
-			max += zone->pages_high;
+			/* we treat WMARK_HIGH as reserved pages. */
+			max += zone->watermark[WMARK_HIGH];
 
 			if (max > zone->present_pages)
 				max = zone->present_pages;
@@ -4481,7 +4482,7 @@ void setup_per_zone_pages_min(void)
 			 * need highmem pages, so cap pages_min to a small
 			 * value here.
 			 *
-			 * The (pages_high-pages_low) and (pages_low-pages_min)
+			 * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN)
 			 * deltas controls asynch page reclaim, and so should
 			 * not be capped for highmem.
 			 */
@@ -4492,17 +4493,17 @@ void setup_per_zone_pages_min(void)
 				min_pages = SWAP_CLUSTER_MAX;
 			if (min_pages > 128)
 				min_pages = 128;
-			zone->pages_min = min_pages;
+			zone->watermark[WMARK_MIN] = min_pages;
 		} else {
 			/*
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
-			zone->pages_min = tmp;
+			zone->watermark[WMARK_MIN] = tmp;
 		}
 
-		zone->pages_low   = zone->pages_min + (tmp >> 2);
-		zone->pages_high  = zone->pages_min + (tmp >> 1);
+		zone->watermark[WMARK_LOW]  = zone->watermark[WMARK_MIN] + (tmp >> 2);
+		zone->watermark[WMARK_HIGH] = zone->watermark[WMARK_MIN] + (tmp >> 1);
 		setup_zone_migrate_reserve(zone);
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
@@ -4647,7 +4648,7 @@ int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
  *	whenever sysctl_lowmem_reserve_ratio changes.
  *
  * The reserve ratio obviously has absolutely no relation with the
- * pages_min watermarks. The lowmem reserve ratio can only make sense
+ * watermark[WMARK_MIN] watermarks. The lowmem reserve ratio can only make sense
  * if in function of the boot time zone sizes.
  */
 int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cef1801..371447c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1406,7 +1406,7 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
 		free  = zone_page_state(zone, NR_FREE_PAGES);
 		/* If we have very few page cache pages,
 		   force-scan anon pages. */
-		if (unlikely(file + free <= zone->pages_high)) {
+		if (unlikely(file + free <= zone->watermark[WMARK_HIGH])) {
 			percent[0] = 100;
 			percent[1] = 0;
 			return;
@@ -1538,11 +1538,13 @@ static void shrink_zone(int priority, struct zone *zone,
  * try to reclaim pages from zones which will satisfy the caller's allocation
  * request.
  *
- * We reclaim from a zone even if that zone is over pages_high.  Because:
+ * We reclaim from a zone even if that zone is over watermark[WMARK_HIGH].
+ * Because:
  * a) The caller may be trying to free *extra* pages to satisfy a higher-order
  *    allocation or
- * b) The zones may be over pages_high but they must go *over* pages_high to
- *    satisfy the `incremental min' zone defense algorithm.
+ * b) The zones may be over watermark[WMARK_HIGH] but they must go *over*
+ *    watermark[WMARK_HIGH] to satisfy the `incremental min' zone defense
+ *    algorithm.
  *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
@@ -1748,7 +1750,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 
 /*
  * For kswapd, balance_pgdat() will work across all this node's zones until
- * they are all at pages_high.
+ * they are all at watermark[WMARK_HIGH].
  *
  * Returns the number of pages which were actually freed.
  *
@@ -1761,11 +1763,11 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
  * the zone for when the problem goes away.
  *
  * kswapd scans the zones in the highmem->normal->dma direction.  It skips
- * zones which have free_pages > pages_high, but once a zone is found to have
- * free_pages <= pages_high, we scan that zone and the lower zones regardless
- * of the number of free pages in the lower zones.  This interoperates with
- * the page allocator fallback scheme to ensure that aging of pages is balanced
- * across the zones.
+ * zones which have free_pages > watermark[WMARK_HIGH], but once a zone is
+ * found to have free_pages <= watermarkpWMARK_HIGH], we scan that zone and the
+ * lower zones regardless of the number of free pages in the lower zones. This
+ * interoperates with the page allocator fallback scheme to ensure that aging
+ * of pages is balanced across the zones.
  */
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 {
@@ -1786,7 +1788,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 	};
 	/*
 	 * temp_priority is used to remember the scanning priority at which
-	 * this zone was successfully refilled to free_pages == pages_high.
+	 * this zone was successfully refilled to
+	 * free_pages == watermark[WMARK_HIGH].
 	 */
 	int temp_priority[MAX_NR_ZONES];
 
@@ -1831,8 +1834,8 @@ loop_again:
 				shrink_active_list(SWAP_CLUSTER_MAX, zone,
 							&sc, priority, 0);
 
-			if (!zone_watermark_ok(zone, order, zone->pages_high,
-					       0, 0)) {
+			if (!zone_watermark_ok(zone, order,
+					zone->watermark[WMARK_HIGH], 0, 0)) {
 				end_zone = i;
 				break;
 			}
@@ -1866,8 +1869,9 @@ loop_again:
 					priority != DEF_PRIORITY)
 				continue;
 
-			if (!zone_watermark_ok(zone, order, zone->pages_high,
-					       end_zone, 0))
+			if (!zone_watermark_ok(zone, order,
+					zone->watermark[WMARK_HIGH],
+					end_zone, 0))
 				all_zones_ok = 0;
 			temp_priority[i] = priority;
 			sc.nr_scanned = 0;
@@ -1876,8 +1880,9 @@ loop_again:
 			 * We put equal pressure on every zone, unless one
 			 * zone has way too many pages free already.
 			 */
-			if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
-						end_zone, 0))
+			if (!zone_watermark_ok(zone, order,
+					8*zone->watermark[WMARK_HIGH],
+					end_zone, 0))
 				shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
@@ -2043,7 +2048,7 @@ void wakeup_kswapd(struct zone *zone, int order)
 		return;
 
 	pgdat = zone->zone_pgdat;
-	if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0))
+	if (zone_watermark_ok(zone, order, zone->watermark[WMARK_LOW], 0, 0))
 		return;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 66f6130..17f2abb 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -725,9 +725,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
-		   zone->pages_min,
-		   zone->pages_low,
-		   zone->pages_high,
+		   zone->watermark[WMARK_MIN],
+		   zone->watermark[WMARK_LOW],
+		   zone->watermark[WMARK_HIGH],
 		   zone->pages_scanned,
 		   zone->lru[LRU_ACTIVE_ANON].nr_scan,
 		   zone->lru[LRU_INACTIVE_ANON].nr_scan,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array
  2009-04-27 17:00       ` Mel Gorman
@ 2009-04-27 20:48         ` David Rientjes
  -1 siblings, 0 replies; 186+ messages in thread
From: David Rientjes @ 2009-04-27 20:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Dave Hansen,
	Linux Kernel Mailing List

On Mon, 27 Apr 2009, Mel Gorman wrote:

> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c1fa208..1ff59fd 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -163,6 +163,13 @@ static inline int is_unevictable_lru(enum lru_list l)
>  #endif
>  }
>  
> +enum zone_watermarks {
> +	WMARK_MIN,
> +	WMARK_LOW,
> +	WMARK_HIGH,
> +	NR_WMARK
> +};
> +
>  struct per_cpu_pages {
>  	int count;		/* number of pages in the list */
>  	int high;		/* high watermark, emptying needed */
> @@ -275,12 +282,9 @@ struct zone_reclaim_stat {
>  
>  struct zone {
>  	/* Fields commonly accessed by the page allocator */
> -	union {
> -		struct {
> -			unsigned long	pages_min, pages_low, pages_high;
> -		};
> -		unsigned long pages_mark[3];
> -	};
> +
> +	/* zone watermarks, indexed with WMARK_LOW, WMARK_MIN and WMARK_HIGH */
> +	unsigned long watermark[NR_WMARK];
>  
>  	/*
>  	 * We don't know if the memory that we're going to allocate will be freeable

I thought the suggestion was for something like

	#define zone_wmark_min(z)	(z->pages_mark[WMARK_MIN])
	...

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array
@ 2009-04-27 20:48         ` David Rientjes
  0 siblings, 0 replies; 186+ messages in thread
From: David Rientjes @ 2009-04-27 20:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Dave Hansen,
	Linux Kernel Mailing List

On Mon, 27 Apr 2009, Mel Gorman wrote:

> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c1fa208..1ff59fd 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -163,6 +163,13 @@ static inline int is_unevictable_lru(enum lru_list l)
>  #endif
>  }
>  
> +enum zone_watermarks {
> +	WMARK_MIN,
> +	WMARK_LOW,
> +	WMARK_HIGH,
> +	NR_WMARK
> +};
> +
>  struct per_cpu_pages {
>  	int count;		/* number of pages in the list */
>  	int high;		/* high watermark, emptying needed */
> @@ -275,12 +282,9 @@ struct zone_reclaim_stat {
>  
>  struct zone {
>  	/* Fields commonly accessed by the page allocator */
> -	union {
> -		struct {
> -			unsigned long	pages_min, pages_low, pages_high;
> -		};
> -		unsigned long pages_mark[3];
> -	};
> +
> +	/* zone watermarks, indexed with WMARK_LOW, WMARK_MIN and WMARK_HIGH */
> +	unsigned long watermark[NR_WMARK];
>  
>  	/*
>  	 * We don't know if the memory that we're going to allocate will be freeable

I thought the suggestion was for something like

	#define zone_wmark_min(z)	(z->pages_mark[WMARK_MIN])
	...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array
  2009-04-27 20:54           ` Mel Gorman
@ 2009-04-27 20:51             ` Christoph Lameter
  -1 siblings, 0 replies; 186+ messages in thread
From: Christoph Lameter @ 2009-04-27 20:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Linux Memory Management List, KOSAKI Motohiro,
	Dave Hansen, Linux Kernel Mailing List

On Mon, 27 Apr 2009, Mel Gorman wrote:

> Would people prefer a getter/setter version?

I'd say leave it as is. It unifies the usage of the watermark array.


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array
@ 2009-04-27 20:51             ` Christoph Lameter
  0 siblings, 0 replies; 186+ messages in thread
From: Christoph Lameter @ 2009-04-27 20:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Linux Memory Management List, KOSAKI Motohiro,
	Dave Hansen, Linux Kernel Mailing List

On Mon, 27 Apr 2009, Mel Gorman wrote:

> Would people prefer a getter/setter version?

I'd say leave it as is. It unifies the usage of the watermark array.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array
  2009-04-27 20:48         ` David Rientjes
@ 2009-04-27 20:54           ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-27 20:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linux Memory Management List, KOSAKI Motohiro, Dave Hansen,
	Linux Kernel Mailing List

On Mon, Apr 27, 2009 at 01:48:47PM -0700, David Rientjes wrote:
> On Mon, 27 Apr 2009, Mel Gorman wrote:
> 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index c1fa208..1ff59fd 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -163,6 +163,13 @@ static inline int is_unevictable_lru(enum lru_list l)
> >  #endif
> >  }
> >  
> > +enum zone_watermarks {
> > +	WMARK_MIN,
> > +	WMARK_LOW,
> > +	WMARK_HIGH,
> > +	NR_WMARK
> > +};
> > +
> >  struct per_cpu_pages {
> >  	int count;		/* number of pages in the list */
> >  	int high;		/* high watermark, emptying needed */
> > @@ -275,12 +282,9 @@ struct zone_reclaim_stat {
> >  
> >  struct zone {
> >  	/* Fields commonly accessed by the page allocator */
> > -	union {
> > -		struct {
> > -			unsigned long	pages_min, pages_low, pages_high;
> > -		};
> > -		unsigned long pages_mark[3];
> > -	};
> > +
> > +	/* zone watermarks, indexed with WMARK_LOW, WMARK_MIN and WMARK_HIGH */
> > +	unsigned long watermark[NR_WMARK];
> >  
> >  	/*
> >  	 * We don't know if the memory that we're going to allocate will be freeable
> 
> I thought the suggestion was for something like
> 
> 	#define zone_wmark_min(z)	(z->pages_mark[WMARK_MIN])
> 	...

Was it the only suggestion? I thought just replacing the union with an
array would be an option as well.

The #define approach also requires setter versions like

static inline set_zone_wmark_min(struct zone *z, unsigned long val)
{
	z->pages_mark[WMARK_MIN] = val;
}

and you need one of those for each watermark if you are to avoid weirdness like

zone_wmark_min(z) = val;

which looks all wrong. I felt this approach would look neater and be closer
in appearance to the code that is already there and therefore less surprising.

Would people prefer a getter/setter version?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array
@ 2009-04-27 20:54           ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-27 20:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linux Memory Management List, KOSAKI Motohiro, Dave Hansen,
	Linux Kernel Mailing List

On Mon, Apr 27, 2009 at 01:48:47PM -0700, David Rientjes wrote:
> On Mon, 27 Apr 2009, Mel Gorman wrote:
> 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index c1fa208..1ff59fd 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -163,6 +163,13 @@ static inline int is_unevictable_lru(enum lru_list l)
> >  #endif
> >  }
> >  
> > +enum zone_watermarks {
> > +	WMARK_MIN,
> > +	WMARK_LOW,
> > +	WMARK_HIGH,
> > +	NR_WMARK
> > +};
> > +
> >  struct per_cpu_pages {
> >  	int count;		/* number of pages in the list */
> >  	int high;		/* high watermark, emptying needed */
> > @@ -275,12 +282,9 @@ struct zone_reclaim_stat {
> >  
> >  struct zone {
> >  	/* Fields commonly accessed by the page allocator */
> > -	union {
> > -		struct {
> > -			unsigned long	pages_min, pages_low, pages_high;
> > -		};
> > -		unsigned long pages_mark[3];
> > -	};
> > +
> > +	/* zone watermarks, indexed with WMARK_LOW, WMARK_MIN and WMARK_HIGH */
> > +	unsigned long watermark[NR_WMARK];
> >  
> >  	/*
> >  	 * We don't know if the memory that we're going to allocate will be freeable
> 
> I thought the suggestion was for something like
> 
> 	#define zone_wmark_min(z)	(z->pages_mark[WMARK_MIN])
> 	...

Was it the only suggestion? I thought just replacing the union with an
array would be an option as well.

The #define approach also requires setter versions like

static inline set_zone_wmark_min(struct zone *z, unsigned long val)
{
	z->pages_mark[WMARK_MIN] = val;
}

and you need one of those for each watermark if you are to avoid weirdness like

zone_wmark_min(z) = val;

which looks all wrong. I felt this approach would look neater and be closer
in appearance to the code that is already there and therefore less surprising.

Would people prefer a getter/setter version?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array
  2009-04-27 20:54           ` Mel Gorman
@ 2009-04-27 21:04             ` David Rientjes
  -1 siblings, 0 replies; 186+ messages in thread
From: David Rientjes @ 2009-04-27 21:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Dave Hansen,
	Linux Kernel Mailing List

On Mon, 27 Apr 2009, Mel Gorman wrote:

> > I thought the suggestion was for something like
> > 
> > 	#define zone_wmark_min(z)	(z->pages_mark[WMARK_MIN])
> > 	...
> 
> Was it the only suggestion? I thought just replacing the union with an
> array would be an option as well.
> 
> The #define approach also requires setter versions like
> 
> static inline set_zone_wmark_min(struct zone *z, unsigned long val)
> {
> 	z->pages_mark[WMARK_MIN] = val;
> }
> 
> and you need one of those for each watermark if you are to avoid weirdness like
> 
> zone_wmark_min(z) = val;
> 
> which looks all wrong.

Agreed, but we only set watermarks in a couple of different locations and 
they really have no reason to change otherwise, so I don't think it's 
necessary to care too much about how the setter looks.

Adding individual get/set functions for each watermark seems like 
overkill.

I personally had no problem with the union struct aliasing the array, I 
think ->pages_min, ->pages_low, etc. are already very familiar.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array
@ 2009-04-27 21:04             ` David Rientjes
  0 siblings, 0 replies; 186+ messages in thread
From: David Rientjes @ 2009-04-27 21:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Dave Hansen,
	Linux Kernel Mailing List

On Mon, 27 Apr 2009, Mel Gorman wrote:

> > I thought the suggestion was for something like
> > 
> > 	#define zone_wmark_min(z)	(z->pages_mark[WMARK_MIN])
> > 	...
> 
> Was it the only suggestion? I thought just replacing the union with an
> array would be an option as well.
> 
> The #define approach also requires setter versions like
> 
> static inline set_zone_wmark_min(struct zone *z, unsigned long val)
> {
> 	z->pages_mark[WMARK_MIN] = val;
> }
> 
> and you need one of those for each watermark if you are to avoid weirdness like
> 
> zone_wmark_min(z) = val;
> 
> which looks all wrong.

Agreed, but we only set watermarks in a couple of different locations and 
they really have no reason to change otherwise, so I don't think it's 
necessary to care too much about how the setter looks.

Adding individual get/set functions for each watermark seems like 
overkill.

I personally had no problem with the union struct aliasing the array, I 
think ->pages_min, ->pages_low, etc. are already very familiar.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 00/22] Cleanup and optimise the page allocator V7
  2009-04-27 14:38     ` Mel Gorman
@ 2009-04-28  1:59       ` Zhang, Yanmin
  -1 siblings, 0 replies; 186+ messages in thread
From: Zhang, Yanmin @ 2009-04-28  1:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Mon, 2009-04-27 at 15:38 +0100, Mel Gorman wrote:
> On Mon, Apr 27, 2009 at 03:58:39PM +0800, Zhang, Yanmin wrote:
> > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > Here is V7 of the cleanup and optimisation of the page allocator and
> > > it should be ready for wider testing. Please consider a possibility for
> > > merging as a Pass 1 at making the page allocator faster. Other passes will
> > > occur later when this one has had a bit of exercise. This patchset is based
> > > on mmotm-2009-04-17 and I've tested it successfully on a small number of
> > > machines.
> > We ran some performance benchmarks against V7 patch on top of 2.6.30-rc3.
> > It seems some counters in kernel are incorrect after we run some ffsb (disk I/O benchmark)
> > and swap-cp (a simple swap memory testing by cp on tmpfs). Free memory is bigger than
> > total memory.
> > 
> 
> oops. Can you try this patch please?
> 
> ==== CUT HERE ====
> 
> Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()
> 
> free_pages_bulk() updates the number of free pages in the zone but it is
> assuming that the pages being freed are order-0. While this is currently
> always true, it's wrong to assume the order is 0. This patch fixes the
> problem.
> 
> buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
> __rmqueue(). This means that any high-order allocation will appear to increase
> the number of free pages leading to the situation where free pages appears to
> exceed available RAM. This patch accounts for those allocated pages properly.
> 
> This is a candidate fix to the patch
> page-allocator-update-nr_free_pages-only-as-necessary.patch. It has yet to be
> verified as fixing a problem where the free pages count is getting corrupted.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/page_alloc.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3db5f57..dd69593 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
>  	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
>  	zone->pages_scanned = 0;
>  
> -	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
> +	__mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
>  	while (count--) {
>  		struct page *page;
>  
> @@ -1151,6 +1151,7 @@ again:
>  	} else {
>  		spin_lock_irqsave(&zone->lock, flags);
>  		page = __rmqueue(zone, order, migratetype);
> +		__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
Here 'i' should be 1?

>  		spin_unlock(&zone->lock);
>  		if (!page)
>  			goto failed;
I ran a cp kernel source files and swap-cp workload and didn't find
bad counter now.



^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 00/22] Cleanup and optimise the page allocator V7
@ 2009-04-28  1:59       ` Zhang, Yanmin
  0 siblings, 0 replies; 186+ messages in thread
From: Zhang, Yanmin @ 2009-04-28  1:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Mon, 2009-04-27 at 15:38 +0100, Mel Gorman wrote:
> On Mon, Apr 27, 2009 at 03:58:39PM +0800, Zhang, Yanmin wrote:
> > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > Here is V7 of the cleanup and optimisation of the page allocator and
> > > it should be ready for wider testing. Please consider a possibility for
> > > merging as a Pass 1 at making the page allocator faster. Other passes will
> > > occur later when this one has had a bit of exercise. This patchset is based
> > > on mmotm-2009-04-17 and I've tested it successfully on a small number of
> > > machines.
> > We ran some performance benchmarks against V7 patch on top of 2.6.30-rc3.
> > It seems some counters in kernel are incorrect after we run some ffsb (disk I/O benchmark)
> > and swap-cp (a simple swap memory testing by cp on tmpfs). Free memory is bigger than
> > total memory.
> > 
> 
> oops. Can you try this patch please?
> 
> ==== CUT HERE ====
> 
> Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()
> 
> free_pages_bulk() updates the number of free pages in the zone but it is
> assuming that the pages being freed are order-0. While this is currently
> always true, it's wrong to assume the order is 0. This patch fixes the
> problem.
> 
> buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
> __rmqueue(). This means that any high-order allocation will appear to increase
> the number of free pages leading to the situation where free pages appears to
> exceed available RAM. This patch accounts for those allocated pages properly.
> 
> This is a candidate fix to the patch
> page-allocator-update-nr_free_pages-only-as-necessary.patch. It has yet to be
> verified as fixing a problem where the free pages count is getting corrupted.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/page_alloc.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3db5f57..dd69593 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
>  	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
>  	zone->pages_scanned = 0;
>  
> -	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
> +	__mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
>  	while (count--) {
>  		struct page *page;
>  
> @@ -1151,6 +1151,7 @@ again:
>  	} else {
>  		spin_lock_irqsave(&zone->lock, flags);
>  		page = __rmqueue(zone, order, migratetype);
> +		__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
Here 'i' should be 1?

>  		spin_unlock(&zone->lock);
>  		if (!page)
>  			goto failed;
I ran a cp kernel source files and swap-cp workload and didn't find
bad counter now.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 00/22] Cleanup and optimise the page allocator V7
  2009-04-28  1:59       ` Zhang, Yanmin
@ 2009-04-28 10:27         ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-28 10:27 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Tue, Apr 28, 2009 at 09:59:17AM +0800, Zhang, Yanmin wrote:
> On Mon, 2009-04-27 at 15:38 +0100, Mel Gorman wrote:
> > On Mon, Apr 27, 2009 at 03:58:39PM +0800, Zhang, Yanmin wrote:
> > > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > > Here is V7 of the cleanup and optimisation of the page allocator and
> > > > it should be ready for wider testing. Please consider a possibility for
> > > > merging as a Pass 1 at making the page allocator faster. Other passes will
> > > > occur later when this one has had a bit of exercise. This patchset is based
> > > > on mmotm-2009-04-17 and I've tested it successfully on a small number of
> > > > machines.
> > > We ran some performance benchmarks against V7 patch on top of 2.6.30-rc3.
> > > It seems some counters in kernel are incorrect after we run some ffsb (disk I/O benchmark)
> > > and swap-cp (a simple swap memory testing by cp on tmpfs). Free memory is bigger than
> > > total memory.
> > > 
> > 
> > oops. Can you try this patch please?
> > 
> > ==== CUT HERE ====
> > 
> > Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()
> > 
> > free_pages_bulk() updates the number of free pages in the zone but it is
> > assuming that the pages being freed are order-0. While this is currently
> > always true, it's wrong to assume the order is 0. This patch fixes the
> > problem.
> > 
> > buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
> > __rmqueue(). This means that any high-order allocation will appear to increase
> > the number of free pages leading to the situation where free pages appears to
> > exceed available RAM. This patch accounts for those allocated pages properly.
> > 
> > This is a candidate fix to the patch
> > page-allocator-update-nr_free_pages-only-as-necessary.patch. It has yet to be
> > verified as fixing a problem where the free pages count is getting corrupted.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/page_alloc.c |    3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3db5f57..dd69593 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
> >  	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
> >  	zone->pages_scanned = 0;
> >  
> > -	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
> > +	__mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
> >  	while (count--) {
> >  		struct page *page;
> >  
> > @@ -1151,6 +1151,7 @@ again:
> >  	} else {
> >  		spin_lock_irqsave(&zone->lock, flags);
> >  		page = __rmqueue(zone, order, migratetype);
> > +		__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> Here 'i' should be 1?

1UL even. Not sure how I managed to send a version with 'i' after a build +
boot test.

> 
> >  		spin_unlock(&zone->lock);
> >  		if (!page)
> >  			goto failed;
>
> I ran a cp kernel source files and swap-cp workload and didn't find
> bad counter now.
> 

I'm assuming you mean that it worked with s/i/1/. I'll send out an updated
version.

Thanks a lot.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 00/22] Cleanup and optimise the page allocator V7
@ 2009-04-28 10:27         ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-28 10:27 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Tue, Apr 28, 2009 at 09:59:17AM +0800, Zhang, Yanmin wrote:
> On Mon, 2009-04-27 at 15:38 +0100, Mel Gorman wrote:
> > On Mon, Apr 27, 2009 at 03:58:39PM +0800, Zhang, Yanmin wrote:
> > > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > > Here is V7 of the cleanup and optimisation of the page allocator and
> > > > it should be ready for wider testing. Please consider a possibility for
> > > > merging as a Pass 1 at making the page allocator faster. Other passes will
> > > > occur later when this one has had a bit of exercise. This patchset is based
> > > > on mmotm-2009-04-17 and I've tested it successfully on a small number of
> > > > machines.
> > > We ran some performance benchmarks against V7 patch on top of 2.6.30-rc3.
> > > It seems some counters in kernel are incorrect after we run some ffsb (disk I/O benchmark)
> > > and swap-cp (a simple swap memory testing by cp on tmpfs). Free memory is bigger than
> > > total memory.
> > > 
> > 
> > oops. Can you try this patch please?
> > 
> > ==== CUT HERE ====
> > 
> > Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()
> > 
> > free_pages_bulk() updates the number of free pages in the zone but it is
> > assuming that the pages being freed are order-0. While this is currently
> > always true, it's wrong to assume the order is 0. This patch fixes the
> > problem.
> > 
> > buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
> > __rmqueue(). This means that any high-order allocation will appear to increase
> > the number of free pages leading to the situation where free pages appears to
> > exceed available RAM. This patch accounts for those allocated pages properly.
> > 
> > This is a candidate fix to the patch
> > page-allocator-update-nr_free_pages-only-as-necessary.patch. It has yet to be
> > verified as fixing a problem where the free pages count is getting corrupted.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/page_alloc.c |    3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3db5f57..dd69593 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
> >  	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
> >  	zone->pages_scanned = 0;
> >  
> > -	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
> > +	__mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
> >  	while (count--) {
> >  		struct page *page;
> >  
> > @@ -1151,6 +1151,7 @@ again:
> >  	} else {
> >  		spin_lock_irqsave(&zone->lock, flags);
> >  		page = __rmqueue(zone, order, migratetype);
> > +		__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> Here 'i' should be 1?

1UL even. Not sure how I managed to send a version with 'i' after a build +
boot test.

> 
> >  		spin_unlock(&zone->lock);
> >  		if (!page)
> >  			goto failed;
>
> I ran a cp kernel source files and swap-cp workload and didn't find
> bad counter now.
> 

I'm assuming you mean that it worked with s/i/1/. I'll send out an updated
version.

Thanks a lot.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()
  2009-04-28  1:59       ` Zhang, Yanmin
@ 2009-04-28 10:31         ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-28 10:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Peter Zijlstra,
	Pekka Enberg, Zhang, Yanmin

This patch fixes two problems with one patch in the page allocator
optimisation patchset.

free_pages_bulk() updates the number of free pages in the zone but it is
assuming that the pages being freed are order-0. While this is currently
always true, it's wrong to assume the order is 0. This patch fixes the
problem.

buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
__rmqueue(). As a result, a high-order allocation will leave an elevated
free page count value leading to the situation where the free page count
exceeds available RAM. This patch accounts for those allocated pages properly.

This is a fix for page-allocator-update-nr_free_pages-only-as-necessary.patch.

Reported-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 mm/page_alloc.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dd2d59..59eb2e1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
 
-	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
 	while (count--) {
 		struct page *page;
 
@@ -1151,6 +1151,7 @@ again:
 	} else {
 		spin_lock_irqsave(&zone->lock, flags);
 		page = __rmqueue(zone, order, migratetype);
+		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()
@ 2009-04-28 10:31         ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-28 10:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Peter Zijlstra,
	Pekka Enberg, Zhang, Yanmin

This patch fixes two problems with one patch in the page allocator
optimisation patchset.

free_pages_bulk() updates the number of free pages in the zone but it is
assuming that the pages being freed are order-0. While this is currently
always true, it's wrong to assume the order is 0. This patch fixes the
problem.

buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
__rmqueue(). As a result, a high-order allocation will leave an elevated
free page count value leading to the situation where the free page count
exceeds available RAM. This patch accounts for those allocated pages properly.

This is a fix for page-allocator-update-nr_free_pages-only-as-necessary.patch.

Reported-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 mm/page_alloc.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dd2d59..59eb2e1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
 
-	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
 	while (count--) {
 		struct page *page;
 
@@ -1151,6 +1151,7 @@ again:
 	} else {
 		spin_lock_irqsave(&zone->lock, flags);
 		page = __rmqueue(zone, order, migratetype);
+		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()
  2009-04-28 10:31         ` Mel Gorman
@ 2009-04-28 16:37           ` Christoph Lameter
  -1 siblings, 0 replies; 186+ messages in thread
From: Christoph Lameter @ 2009-04-28 16:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Memory Management List, KOSAKI Motohiro,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Peter Zijlstra,
	Pekka Enberg, Zhang, Yanmin

On Tue, 28 Apr 2009, Mel Gorman wrote:

> @@ -1151,6 +1151,7 @@ again:
>  	} else {
>  		spin_lock_irqsave(&zone->lock, flags);
>  		page = __rmqueue(zone, order, migratetype);
> +		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
>  		spin_unlock(&zone->lock);
>  		if (!page)
>  			goto failed;

__mod_zone_page_state takes an signed integer argument. Not sure what is
won by the UL suffix here.


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()
@ 2009-04-28 16:37           ` Christoph Lameter
  0 siblings, 0 replies; 186+ messages in thread
From: Christoph Lameter @ 2009-04-28 16:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Memory Management List, KOSAKI Motohiro,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Peter Zijlstra,
	Pekka Enberg, Zhang, Yanmin

On Tue, 28 Apr 2009, Mel Gorman wrote:

> @@ -1151,6 +1151,7 @@ again:
>  	} else {
>  		spin_lock_irqsave(&zone->lock, flags);
>  		page = __rmqueue(zone, order, migratetype);
> +		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
>  		spin_unlock(&zone->lock);
>  		if (!page)
>  			goto failed;

__mod_zone_page_state takes an signed integer argument. Not sure what is
won by the UL suffix here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()
  2009-04-28 16:37           ` Christoph Lameter
@ 2009-04-28 16:51             ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-28 16:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Memory Management List, KOSAKI Motohiro,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Peter Zijlstra,
	Pekka Enberg, Zhang, Yanmin

On Tue, Apr 28, 2009 at 12:37:22PM -0400, Christoph Lameter wrote:
> On Tue, 28 Apr 2009, Mel Gorman wrote:
> 
> > @@ -1151,6 +1151,7 @@ again:
> >  	} else {
> >  		spin_lock_irqsave(&zone->lock, flags);
> >  		page = __rmqueue(zone, order, migratetype);
> > +		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> >  		spin_unlock(&zone->lock);
> >  		if (!page)
> >  			goto failed;
> 
> __mod_zone_page_state takes an signed integer argument. Not sure what is
> won by the UL suffix here.
> 

Matches other call sites such as in __offline_isolated_pages(), habit when
using shifts like this and matches other locations, paranoia, doesn't hurt.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()
@ 2009-04-28 16:51             ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-28 16:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Memory Management List, KOSAKI Motohiro,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Peter Zijlstra,
	Pekka Enberg, Zhang, Yanmin

On Tue, Apr 28, 2009 at 12:37:22PM -0400, Christoph Lameter wrote:
> On Tue, 28 Apr 2009, Mel Gorman wrote:
> 
> > @@ -1151,6 +1151,7 @@ again:
> >  	} else {
> >  		spin_lock_irqsave(&zone->lock, flags);
> >  		page = __rmqueue(zone, order, migratetype);
> > +		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> >  		spin_unlock(&zone->lock);
> >  		if (!page)
> >  			goto failed;
> 
> __mod_zone_page_state takes an signed integer argument. Not sure what is
> won by the UL suffix here.
> 

Matches other call sites such as in __offline_isolated_pages(), habit when
using shifts like this and matches other locations, paranoia, doesn't hurt.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()
  2009-04-28 16:51             ` Mel Gorman
@ 2009-04-28 17:15               ` Hugh Dickins
  -1 siblings, 0 replies; 186+ messages in thread
From: Hugh Dickins @ 2009-04-28 17:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Andrew Morton, Linux Memory Management List,
	KOSAKI Motohiro, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Peter Zijlstra, Pekka Enberg, Zhang, Yanmin

On Tue, 28 Apr 2009, Mel Gorman wrote:
> On Tue, Apr 28, 2009 at 12:37:22PM -0400, Christoph Lameter wrote:
> > On Tue, 28 Apr 2009, Mel Gorman wrote:
> > 
> > > @@ -1151,6 +1151,7 @@ again:
> > >  	} else {
> > >  		spin_lock_irqsave(&zone->lock, flags);
> > >  		page = __rmqueue(zone, order, migratetype);
> > > +		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> > >  		spin_unlock(&zone->lock);
> > >  		if (!page)
> > >  			goto failed;
> > 
> > __mod_zone_page_state takes an signed integer argument. Not sure what is
> > won by the UL suffix here.
> > 
> 
> Matches other call sites such as in __offline_isolated_pages(), habit when
> using shifts like this and matches other locations, paranoia, doesn't hurt.

Well, not a big deal, but I'd say that it does hurt: by wasting people's
time (mine!), wondering what that "UL" is for - I'd have had to ask, if
Christoph hadn't cleared this up first (thank you).  And elsewhere,
you're using an int << order to update NR_FREE_PAGES, so I don't see
the consistency argument.

Hugh

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()
@ 2009-04-28 17:15               ` Hugh Dickins
  0 siblings, 0 replies; 186+ messages in thread
From: Hugh Dickins @ 2009-04-28 17:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Andrew Morton, Linux Memory Management List,
	KOSAKI Motohiro, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Peter Zijlstra, Pekka Enberg, Zhang, Yanmin

On Tue, 28 Apr 2009, Mel Gorman wrote:
> On Tue, Apr 28, 2009 at 12:37:22PM -0400, Christoph Lameter wrote:
> > On Tue, 28 Apr 2009, Mel Gorman wrote:
> > 
> > > @@ -1151,6 +1151,7 @@ again:
> > >  	} else {
> > >  		spin_lock_irqsave(&zone->lock, flags);
> > >  		page = __rmqueue(zone, order, migratetype);
> > > +		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> > >  		spin_unlock(&zone->lock);
> > >  		if (!page)
> > >  			goto failed;
> > 
> > __mod_zone_page_state takes an signed integer argument. Not sure what is
> > won by the UL suffix here.
> > 
> 
> Matches other call sites such as in __offline_isolated_pages(), habit when
> using shifts like this and matches other locations, paranoia, doesn't hurt.

Well, not a big deal, but I'd say that it does hurt: by wasting people's
time (mine!), wondering what that "UL" is for - I'd have had to ask, if
Christoph hadn't cleared this up first (thank you).  And elsewhere,
you're using an int << order to update NR_FREE_PAGES, so I don't see
the consistency argument.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue() V2
  2009-04-28 17:15               ` Hugh Dickins
@ 2009-04-28 18:07                 ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-28 18:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Linux Memory Management List, KOSAKI Motohiro,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Peter Zijlstra,
	Pekka Enberg, Zhang, Yanmin, Hugh Dickins

free_pages_bulk() updates the number of free pages in the zone but it is
assuming that the pages being freed are order-0. While this is currently
always true, it's wrong to assume the order is 0. This patch fixes the problem.

buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
__rmqueue(). As a result, high-order allocation will appear to increase
the number of free pages leading to the situation where the free page count
exceeds available RAM. This patch accounts for those allocated pages properly.

This is a fix for page-allocator-update-nr_free_pages-only-as-necessary.patch.

Changelog since V1
  o Change 1UL to 1 as it's unnecessary in this case to be unsigned long

Reported-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 mm/page_alloc.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dd2d59..59eb2e1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
 
-	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
 	while (count--) {
 		struct page *page;
 
@@ -1151,6 +1151,7 @@ again:
 	} else {
 		spin_lock_irqsave(&zone->lock, flags);
 		page = __rmqueue(zone, order, migratetype);
+		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue() V2
@ 2009-04-28 18:07                 ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-28 18:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Linux Memory Management List, KOSAKI Motohiro,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Peter Zijlstra,
	Pekka Enberg, Zhang, Yanmin, Hugh Dickins

free_pages_bulk() updates the number of free pages in the zone but it is
assuming that the pages being freed are order-0. While this is currently
always true, it's wrong to assume the order is 0. This patch fixes the problem.

buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
__rmqueue(). As a result, high-order allocation will appear to increase
the number of free pages leading to the situation where the free page count
exceeds available RAM. This patch accounts for those allocated pages properly.

This is a fix for page-allocator-update-nr_free_pages-only-as-necessary.patch.

Changelog since V1
  o Change 1UL to 1 as it's unnecessary in this case to be unsigned long

Reported-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 mm/page_alloc.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dd2d59..59eb2e1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
 
-	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
 	while (count--) {
 		struct page *page;
 
@@ -1151,6 +1151,7 @@ again:
 	} else {
 		spin_lock_irqsave(&zone->lock, flags);
 		page = __rmqueue(zone, order, migratetype);
+		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue() V2
  2009-04-28 18:07                 ` Mel Gorman
@ 2009-04-28 18:25                   ` Hugh Dickins
  -1 siblings, 0 replies; 186+ messages in thread
From: Hugh Dickins @ 2009-04-28 18:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Linux Memory Management List,
	KOSAKI Motohiro, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Peter Zijlstra, Pekka Enberg, Zhang, Yanmin

On Tue, 28 Apr 2009, Mel Gorman wrote:

> free_pages_bulk() updates the number of free pages in the zone but it is
> assuming that the pages being freed are order-0. While this is currently
> always true, it's wrong to assume the order is 0. This patch fixes the problem.
> 
> buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
> __rmqueue(). As a result, high-order allocation will appear to increase
> the number of free pages leading to the situation where the free page count
> exceeds available RAM. This patch accounts for those allocated pages properly.
> 
> This is a fix for page-allocator-update-nr_free_pages-only-as-necessary.patch.
> 
> Changelog since V1
>   o Change 1UL to 1 as it's unnecessary in this case to be unsigned long
> 
> Reported-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Thanks, Mel!  (And I don't need to test it, since it's now
remarkably similar to a patch I was preparing yesterday.)

Acked-by: Hugh Dickins <hugh@veritas.com>

> --- 
>  mm/page_alloc.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5dd2d59..59eb2e1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
>  	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
>  	zone->pages_scanned = 0;
>  
> -	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
> +	__mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
>  	while (count--) {
>  		struct page *page;
>  
> @@ -1151,6 +1151,7 @@ again:
>  	} else {
>  		spin_lock_irqsave(&zone->lock, flags);
>  		page = __rmqueue(zone, order, migratetype);
> +		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
>  		spin_unlock(&zone->lock);
>  		if (!page)
>  			goto failed;

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue() V2
@ 2009-04-28 18:25                   ` Hugh Dickins
  0 siblings, 0 replies; 186+ messages in thread
From: Hugh Dickins @ 2009-04-28 18:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Linux Memory Management List,
	KOSAKI Motohiro, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Peter Zijlstra, Pekka Enberg, Zhang, Yanmin

On Tue, 28 Apr 2009, Mel Gorman wrote:

> free_pages_bulk() updates the number of free pages in the zone but it is
> assuming that the pages being freed are order-0. While this is currently
> always true, it's wrong to assume the order is 0. This patch fixes the problem.
> 
> buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
> __rmqueue(). As a result, high-order allocation will appear to increase
> the number of free pages leading to the situation where the free page count
> exceeds available RAM. This patch accounts for those allocated pages properly.
> 
> This is a fix for page-allocator-update-nr_free_pages-only-as-necessary.patch.
> 
> Changelog since V1
>   o Change 1UL to 1 as it's unnecessary in this case to be unsigned long
> 
> Reported-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Thanks, Mel!  (And I don't need to test it, since it's now
remarkably similar to a patch I was preparing yesterday.)

Acked-by: Hugh Dickins <hugh@veritas.com>

> --- 
>  mm/page_alloc.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5dd2d59..59eb2e1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
>  	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
>  	zone->pages_scanned = 0;
>  
> -	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
> +	__mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
>  	while (count--) {
>  		struct page *page;
>  
> @@ -1151,6 +1151,7 @@ again:
>  	} else {
>  		spin_lock_irqsave(&zone->lock, flags);
>  		page = __rmqueue(zone, order, migratetype);
> +		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
>  		spin_unlock(&zone->lock);
>  		if (!page)
>  			goto failed;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()
  2009-04-28 17:15               ` Hugh Dickins
@ 2009-04-28 18:36                 ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-28 18:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Christoph Lameter, Andrew Morton, Linux Memory Management List,
	KOSAKI Motohiro, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Peter Zijlstra, Pekka Enberg, Zhang, Yanmin

On Tue, Apr 28, 2009 at 06:15:30PM +0100, Hugh Dickins wrote:
> On Tue, 28 Apr 2009, Mel Gorman wrote:
> > On Tue, Apr 28, 2009 at 12:37:22PM -0400, Christoph Lameter wrote:
> > > On Tue, 28 Apr 2009, Mel Gorman wrote:
> > > 
> > > > @@ -1151,6 +1151,7 @@ again:
> > > >  	} else {
> > > >  		spin_lock_irqsave(&zone->lock, flags);
> > > >  		page = __rmqueue(zone, order, migratetype);
> > > > +		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> > > >  		spin_unlock(&zone->lock);
> > > >  		if (!page)
> > > >  			goto failed;
> > > 
> > > __mod_zone_page_state takes an signed integer argument. Not sure what is
> > > won by the UL suffix here.
> > > 
> > 
> > Matches other call sites such as in __offline_isolated_pages(), habit when
> > using shifts like this and matches other locations, paranoia, doesn't hurt.
> 
> Well, not a big deal, but I'd say that it does hurt: by wasting people's
> time (mine!), wondering what that "UL" is for - I'd have had to ask, if
> Christoph hadn't cleared this up first (thank you).  And elsewhere,
> you're using an int << order to update NR_FREE_PAGES, so I don't see
> the consistency argument.
> 

Ok, I'll concede things are not super-consistent. In this case, I felt it
"didn't hurt" but I didn't take into account that it needs to be worked out
each time and I posted a second revision that should be clearer.

As an aside, in other places order changes from being signed in some places
to being unsigned in others. There is probably a pass in there just for
dealing with this sort of inconsistency.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()
@ 2009-04-28 18:36                 ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-28 18:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Christoph Lameter, Andrew Morton, Linux Memory Management List,
	KOSAKI Motohiro, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Peter Zijlstra, Pekka Enberg, Zhang, Yanmin

On Tue, Apr 28, 2009 at 06:15:30PM +0100, Hugh Dickins wrote:
> On Tue, 28 Apr 2009, Mel Gorman wrote:
> > On Tue, Apr 28, 2009 at 12:37:22PM -0400, Christoph Lameter wrote:
> > > On Tue, 28 Apr 2009, Mel Gorman wrote:
> > > 
> > > > @@ -1151,6 +1151,7 @@ again:
> > > >  	} else {
> > > >  		spin_lock_irqsave(&zone->lock, flags);
> > > >  		page = __rmqueue(zone, order, migratetype);
> > > > +		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> > > >  		spin_unlock(&zone->lock);
> > > >  		if (!page)
> > > >  			goto failed;
> > > 
> > > __mod_zone_page_state takes an signed integer argument. Not sure what is
> > > won by the UL suffix here.
> > > 
> > 
> > Matches other call sites such as in __offline_isolated_pages(), habit when
> > using shifts like this and matches other locations, paranoia, doesn't hurt.
> 
> Well, not a big deal, but I'd say that it does hurt: by wasting people's
> time (mine!), wondering what that "UL" is for - I'd have had to ask, if
> Christoph hadn't cleared this up first (thank you).  And elsewhere,
> you're using an int << order to update NR_FREE_PAGES, so I don't see
> the consistency argument.
> 

Ok, I'll concede things are not super-consistent. In this case, I felt it
"didn't hurt" but I didn't take into account that it needs to be worked out
each time and I posted a second revision that should be clearer.

As an aside, in other places order changes from being signed in some places
to being unsigned in others. There is probably a pass in there just for
dealing with this sort of inconsistency.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array
  2009-04-27 21:04             ` David Rientjes
@ 2009-04-30 13:35               ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-30 13:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linux Memory Management List, KOSAKI Motohiro, Dave Hansen,
	Linux Kernel Mailing List

On Mon, Apr 27, 2009 at 02:04:03PM -0700, David Rientjes wrote:
> On Mon, 27 Apr 2009, Mel Gorman wrote:
> 
> > > I thought the suggestion was for something like
> > > 
> > > 	#define zone_wmark_min(z)	(z->pages_mark[WMARK_MIN])
> > > 	...
> > 
> > Was it the only suggestion? I thought just replacing the union with an
> > array would be an option as well.
> > 
> > The #define approach also requires setter versions like
> > 
> > static inline set_zone_wmark_min(struct zone *z, unsigned long val)
> > {
> > 	z->pages_mark[WMARK_MIN] = val;
> > }
> > 
> > and you need one of those for each watermark if you are to avoid weirdness like
> > 
> > zone_wmark_min(z) = val;
> > 
> > which looks all wrong.
> 
> Agreed, but we only set watermarks in a couple of different locations and 
> they really have no reason to change otherwise, so I don't think it's 
> necessary to care too much about how the setter looks.
> 
> Adding individual get/set functions for each watermark seems like 
> overkill.
> 

I think what you're saying that you'd be ok with

zone_wmark_min(z)
zone_wmark_low(z)
zone_wmark_high(z)

and z->pages_mark[WMARK_MIN] =
and z->pages_mark[WMARK_LOW] =
and z->pages_mark[WMARK_HIGH] =

?

Is that a significant improvement over what the patch currently does? To
me, it seems more verbose.

> I personally had no problem with the union struct aliasing the array, I 
> think ->pages_min, ->pages_low, etc. are already very familiar.
> 

Can the people who do have a problem with the union make some sort of
comment on how they think it should look?

Obviously, I'm pro-the-current-patch :/

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array
@ 2009-04-30 13:35               ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-04-30 13:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linux Memory Management List, KOSAKI Motohiro, Dave Hansen,
	Linux Kernel Mailing List

On Mon, Apr 27, 2009 at 02:04:03PM -0700, David Rientjes wrote:
> On Mon, 27 Apr 2009, Mel Gorman wrote:
> 
> > > I thought the suggestion was for something like
> > > 
> > > 	#define zone_wmark_min(z)	(z->pages_mark[WMARK_MIN])
> > > 	...
> > 
> > Was it the only suggestion? I thought just replacing the union with an
> > array would be an option as well.
> > 
> > The #define approach also requires setter versions like
> > 
> > static inline set_zone_wmark_min(struct zone *z, unsigned long val)
> > {
> > 	z->pages_mark[WMARK_MIN] = val;
> > }
> > 
> > and you need one of those for each watermark if you are to avoid weirdness like
> > 
> > zone_wmark_min(z) = val;
> > 
> > which looks all wrong.
> 
> Agreed, but we only set watermarks in a couple of different locations and 
> they really have no reason to change otherwise, so I don't think it's 
> necessary to care too much about how the setter looks.
> 
> Adding individual get/set functions for each watermark seems like 
> overkill.
> 

I think what you're saying that you'd be ok with

zone_wmark_min(z)
zone_wmark_low(z)
zone_wmark_high(z)

and z->pages_mark[WMARK_MIN] =
and z->pages_mark[WMARK_LOW] =
and z->pages_mark[WMARK_HIGH] =

?

Is that a significant improvement over what the patch currently does? To
me, it seems more verbose.

> I personally had no problem with the union struct aliasing the array, I 
> think ->pages_min, ->pages_low, etc. are already very familiar.
> 

Can the people who do have a problem with the union make some sort of
comment on how they think it should look?

Obviously, I'm pro-the-current-patch :/

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array
  2009-04-30 13:35               ` Mel Gorman
@ 2009-04-30 13:48                 ` Dave Hansen
  -1 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-30 13:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Linux Memory Management List, KOSAKI Motohiro,
	Linux Kernel Mailing List

On Thu, 2009-04-30 at 14:35 +0100, Mel Gorman wrote:
> I think what you're saying that you'd be ok with
> 
> zone_wmark_min(z)
> zone_wmark_low(z)
> zone_wmark_high(z)
> 
> and z->pages_mark[WMARK_MIN] =
> and z->pages_mark[WMARK_LOW] =
> and z->pages_mark[WMARK_HIGH] =
> 
> ?
> 
> Is that a significant improvement over what the patch currently does? To
> me, it seems more verbose.

Either way, there are _relatively_ few users.  From a quick cscope, it
appears setup_per_zone_pages_min() is really the heaviest user assigning
them.

Personally, I do like having the 'wmark' or something similar in the
function or structure member names.  But, I also like having the units
in there as well.  There's probably not room for both, though.  I'm fine
with the naming you have above.  The only thing I might consider is
removing 'zone_' from the function names since it's implied from the
variable name:

	min_wmark_pages(z)

The 'z->pages_mark[WMARK_*]' form is ugly, but it should be basically
restricted to use in setup_per_zone_pages_min().  I think that means we
don't need set_foo() functions because of a lack of use sites.

-- Dave


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array
@ 2009-04-30 13:48                 ` Dave Hansen
  0 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-04-30 13:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Linux Memory Management List, KOSAKI Motohiro,
	Linux Kernel Mailing List

On Thu, 2009-04-30 at 14:35 +0100, Mel Gorman wrote:
> I think what you're saying that you'd be ok with
> 
> zone_wmark_min(z)
> zone_wmark_low(z)
> zone_wmark_high(z)
> 
> and z->pages_mark[WMARK_MIN] =
> and z->pages_mark[WMARK_LOW] =
> and z->pages_mark[WMARK_HIGH] =
> 
> ?
> 
> Is that a significant improvement over what the patch currently does? To
> me, it seems more verbose.

Either way, there are _relatively_ few users.  From a quick cscope, it
appears setup_per_zone_pages_min() is really the heaviest user assigning
them.

Personally, I do like having the 'wmark' or something similar in the
function or structure member names.  But, I also like having the units
in there as well.  There's probably not room for both, though.  I'm fine
with the naming you have above.  The only thing I might consider is
removing 'zone_' from the function names since it's implied from the
variable name:

	min_wmark_pages(z)

The 'z->pages_mark[WMARK_*]' form is ugly, but it should be basically
restricted to use in setup_per_zone_pages_min().  I think that means we
don't need set_foo() functions because of a lack of use sites.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* [RFC] Replace the watermark-related union in struct zone with a watermark[] array V2
  2009-04-30 13:48                 ` Dave Hansen
@ 2009-05-12 14:13                   ` Mel Gorman
  -1 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-05-12 14:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Rientjes, Linux Memory Management List, KOSAKI Motohiro,
	Linux Kernel Mailing List

Changelog since V1
  o Use N_wmark_pages accessors instead of array accesses

Patch page-allocator-use-allocation-flags-as-an-index-to-the-zone-watermark
from -mm added a union to struct zone where the watermarks could be accessed
with either zone->pages_* or a pages_mark array. The concern was that this
aliasing caused more confusion that it helped.

This patch replaces the union with a watermark array that is indexed with
WMARK_* defines accessed via helpers. All call sites that use zone->pages_*
are updated to use the helpers for accessing the values and the array
offsets for setting.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 Documentation/sysctl/vm.txt |   11 ++++++-----
 Documentation/vm/balance    |   18 +++++++++---------
 arch/m32r/mm/discontig.c    |    6 +++---
 include/linux/mmzone.h      |   20 ++++++++++++++------
 mm/page_alloc.c             |   41 +++++++++++++++++++++--------------------
 mm/vmscan.c                 |   39 +++++++++++++++++++++------------------
 mm/vmstat.c                 |    6 +++---
 7 files changed, 77 insertions(+), 64 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 3fddcf3..b0237be 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -235,8 +235,8 @@ These protections are added to score to judge whether this zone should be used
 for page allocation or should be reclaimed.
 
 In this example, if normal pages (index=2) are required to this DMA zone and
-pages_high is used for watermark, the kernel judges this zone should not be
-used because pages_free(1355) is smaller than watermark + protection[2]
+watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
+not be used because pages_free(1355) is smaller than watermark + protection[2]
 (4 + 2004 = 2008). If this protection value is 0, this zone would be used for
 normal page requirement. If requirement is DMA zone(index=0), protection[0]
 (=0) is used.
@@ -282,9 +282,10 @@ The default value is 65536.
 min_free_kbytes:
 
 This is used to force the Linux VM to keep a minimum number
-of kilobytes free.  The VM uses this number to compute a pages_min
-value for each lowmem zone in the system.  Each lowmem zone gets
-a number of reserved free pages based proportionally on its size.
+of kilobytes free.  The VM uses this number to compute a
+watermark[WMARK_MIN] value for each lowmem zone in the system.
+Each lowmem zone gets a number of reserved free pages based
+proportionally on its size.
 
 Some minimal amount of memory is needed to satisfy PF_MEMALLOC
 allocations; if you set this to lower than 1024KB, your system will
diff --git a/Documentation/vm/balance b/Documentation/vm/balance
index bd3d31b..c46e68c 100644
--- a/Documentation/vm/balance
+++ b/Documentation/vm/balance
@@ -75,15 +75,15 @@ Page stealing from process memory and shm is done if stealing the page would
 alleviate memory pressure on any zone in the page's node that has fallen below
 its watermark.
 
-pages_min/pages_low/pages_high/low_on_memory/zone_wake_kswapd: These are 
-per-zone fields, used to determine when a zone needs to be balanced. When
-the number of pages falls below pages_min, the hysteric field low_on_memory
-gets set. This stays set till the number of free pages becomes pages_high.
-When low_on_memory is set, page allocation requests will try to free some
-pages in the zone (providing GFP_WAIT is set in the request). Orthogonal
-to this, is the decision to poke kswapd to free some zone pages. That
-decision is not hysteresis based, and is done when the number of free
-pages is below pages_low; in which case zone_wake_kswapd is also set.
+watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These
+are per-zone fields, used to determine when a zone needs to be balanced. When
+the number of pages falls below watermark[WMARK_MIN], the hysteric field
+low_on_memory gets set. This stays set till the number of free pages becomes
+watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will
+try to free some pages in the zone (providing GFP_WAIT is set in the request).
+Orthogonal to this, is the decision to poke kswapd to free some zone pages.
+That decision is not hysteresis based, and is done when the number of free
+pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.
 
 
 (Good) Ideas that I have heard:
diff --git a/arch/m32r/mm/discontig.c b/arch/m32r/mm/discontig.c
index 7daf897..b7a78ad 100644
--- a/arch/m32r/mm/discontig.c
+++ b/arch/m32r/mm/discontig.c
@@ -154,9 +154,9 @@ unsigned long __init zone_sizes_init(void)
 	 *  Use all area of internal RAM.
 	 *  see __alloc_pages()
 	 */
-	NODE_DATA(1)->node_zones->pages_min = 0;
-	NODE_DATA(1)->node_zones->pages_low = 0;
-	NODE_DATA(1)->node_zones->pages_high = 0;
+	NODE_DATA(1)->node_zones->watermark[WMARK_MIN] = 0;
+	NODE_DATA(1)->node_zones->watermark[WMARK_LOW] = 0;
+	NODE_DATA(1)->node_zones->watermark[WMARK_HIGH] = 0;
 
 	return holes;
 }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 95a9257..4f99501 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -163,6 +163,17 @@ static inline int is_unevictable_lru(enum lru_list l)
 #endif
 }
 
+enum zone_watermarks {
+	WMARK_MIN,
+	WMARK_LOW,
+	WMARK_HIGH,
+	NR_WMARK
+};
+
+#define min_wmark_pages(z) (z->watermark[WMARK_MIN])
+#define low_wmark_pages(z) (z->watermark[WMARK_LOW])
+#define high_wmark_pages(z) (z->watermark[WMARK_HIGH])
+
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
@@ -275,12 +286,9 @@ struct zone_reclaim_stat {
 
 struct zone {
 	/* Fields commonly accessed by the page allocator */
-	union {
-		struct {
-			unsigned long	pages_min, pages_low, pages_high;
-		};
-		unsigned long pages_mark[3];
-	};
+
+	/* zone watermarks, access with *_wmark_pages(zone) macros */
+	unsigned long watermark[NR_WMARK];
 
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3d8272a..4ba56d9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1154,10 +1154,10 @@ failed:
 	return NULL;
 }
 
-/* The WMARK bits are used as an index zone->pages_mark */
-#define ALLOC_WMARK_MIN		0x00 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW		0x01 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH	0x02 /* use pages_high watermark */
+/* The ALLOC_WMARK bits are used as an index to zone->watermark */
+#define ALLOC_WMARK_MIN		WMARK_MIN
+#define ALLOC_WMARK_LOW		WMARK_LOW
+#define ALLOC_WMARK_HIGH	WMARK_HIGH
 #define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
 
 /* Mask to get the watermark bits */
@@ -1446,9 +1446,10 @@ zonelist_scan:
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
 
+		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
-			mark = zone->pages_mark[alloc_flags & ALLOC_WMARK_MASK];
+			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 			if (!zone_watermark_ok(zone, order, mark,
 				    classzone_idx, alloc_flags)) {
 				if (!zone_reclaim_mode ||
@@ -1969,7 +1970,7 @@ static unsigned int nr_free_zone_pages(int offset)
 
 	for_each_zone_zonelist(zone, z, zonelist, offset) {
 		unsigned long size = zone->present_pages;
-		unsigned long high = zone->pages_high;
+		unsigned long high = high_wmark_pages(zone);
 		if (size > high)
 			sum += size - high;
 	}
@@ -2106,9 +2107,9 @@ void show_free_areas(void)
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
-			K(zone->pages_min),
-			K(zone->pages_low),
-			K(zone->pages_high),
+			K(min_wmark_pages(zone)),
+			K(low_wmark_pages(zone)),
+			K(high_wmark_pages(zone)),
 			K(zone_page_state(zone, NR_ACTIVE_ANON)),
 			K(zone_page_state(zone, NR_INACTIVE_ANON)),
 			K(zone_page_state(zone, NR_ACTIVE_FILE)),
@@ -2712,8 +2713,8 @@ static inline unsigned long wait_table_bits(unsigned long size)
 
 /*
  * Mark a number of pageblocks as MIGRATE_RESERVE. The number
- * of blocks reserved is based on zone->pages_min. The memory within the
- * reserve will tend to store contiguous free pages. Setting min_free_kbytes
+ * of blocks reserved is based on min_wmark_pages(zone). The memory within
+ * the reserve will tend to store contiguous free pages. Setting min_free_kbytes
  * higher will lead to a bigger reserve which will get freed as contiguous
  * blocks as reclaim kicks in
  */
@@ -2726,7 +2727,7 @@ static void setup_zone_migrate_reserve(struct zone *zone)
 	/* Get the start pfn, end pfn and the number of blocks to reserve */
 	start_pfn = zone->zone_start_pfn;
 	end_pfn = start_pfn + zone->spanned_pages;
-	reserve = roundup(zone->pages_min, pageblock_nr_pages) >>
+	reserve = roundup(min_wmark_pages(zone), pageblock_nr_pages) >>
 							pageblock_order;
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
@@ -4400,8 +4401,8 @@ static void calculate_totalreserve_pages(void)
 					max = zone->lowmem_reserve[j];
 			}
 
-			/* we treat pages_high as reserved pages. */
-			max += zone->pages_high;
+			/* we treat the high watermark as reserved pages. */
+			max += high_wmark_pages(zone);
 
 			if (max > zone->present_pages)
 				max = zone->present_pages;
@@ -4481,7 +4482,7 @@ void setup_per_zone_pages_min(void)
 			 * need highmem pages, so cap pages_min to a small
 			 * value here.
 			 *
-			 * The (pages_high-pages_low) and (pages_low-pages_min)
+			 * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN)
 			 * deltas controls asynch page reclaim, and so should
 			 * not be capped for highmem.
 			 */
@@ -4492,17 +4493,17 @@ void setup_per_zone_pages_min(void)
 				min_pages = SWAP_CLUSTER_MAX;
 			if (min_pages > 128)
 				min_pages = 128;
-			zone->pages_min = min_pages;
+			zone->watermark[WMARK_MIN] = min_pages;
 		} else {
 			/*
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
-			zone->pages_min = tmp;
+			zone->watermark[WMARK_MIN] = tmp;
 		}
 
-		zone->pages_low   = zone->pages_min + (tmp >> 2);
-		zone->pages_high  = zone->pages_min + (tmp >> 1);
+		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
+		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
 		setup_zone_migrate_reserve(zone);
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
@@ -4646,7 +4647,7 @@ int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
  *	whenever sysctl_lowmem_reserve_ratio changes.
  *
  * The reserve ratio obviously has absolutely no relation with the
- * pages_min watermarks. The lowmem reserve ratio can only make sense
+ * minimum watermarks. The lowmem reserve ratio can only make sense
  * if in function of the boot time zone sizes.
  */
 int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 822056c..462ced1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1437,7 +1437,7 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
 		free  = zone_page_state(zone, NR_FREE_PAGES);
 		/* If we have very few page cache pages,
 		   force-scan anon pages. */
-		if (unlikely(file + free <= zone->pages_high)) {
+		if (unlikely(file + free <= high_wmark_pages(zone))) {
 			percent[0] = 100;
 			percent[1] = 0;
 			return;
@@ -1586,11 +1586,13 @@ static void shrink_zone(int priority, struct zone *zone,
  * try to reclaim pages from zones which will satisfy the caller's allocation
  * request.
  *
- * We reclaim from a zone even if that zone is over pages_high.  Because:
+ * We reclaim from a zone even if that zone is over high_wmark_pages(zone).
+ * Because:
  * a) The caller may be trying to free *extra* pages to satisfy a higher-order
  *    allocation or
- * b) The zones may be over pages_high but they must go *over* pages_high to
- *    satisfy the `incremental min' zone defense algorithm.
+ * b) The target zone may be at high_wmark_pages(zone) but the lower zones
+ *    must go *over* high_wmark_pages(zone) to satisfy the `incremental min'
+ *    zone defense algorithm.
  *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
@@ -1796,7 +1798,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 
 /*
  * For kswapd, balance_pgdat() will work across all this node's zones until
- * they are all at pages_high.
+ * they are all at high_wmark_pages(zone).
  *
  * Returns the number of pages which were actually freed.
  *
@@ -1809,11 +1811,11 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
  * the zone for when the problem goes away.
  *
  * kswapd scans the zones in the highmem->normal->dma direction.  It skips
- * zones which have free_pages > pages_high, but once a zone is found to have
- * free_pages <= pages_high, we scan that zone and the lower zones regardless
- * of the number of free pages in the lower zones.  This interoperates with
- * the page allocator fallback scheme to ensure that aging of pages is balanced
- * across the zones.
+ * zones which have free_pages > high_wmark_pages(zone), but once a zone is
+ * found to have free_pages <= high_wmark_pages(zone), we scan that zone and the
+ * lower zones regardless of the number of free pages in the lower zones. This
+ * interoperates with the page allocator fallback scheme to ensure that aging
+ * of pages is balanced across the zones.
  */
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 {
@@ -1834,7 +1836,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 	};
 	/*
 	 * temp_priority is used to remember the scanning priority at which
-	 * this zone was successfully refilled to free_pages == pages_high.
+	 * this zone was successfully refilled to
+	 * free_pages == high_wmark_pages(zone).
 	 */
 	int temp_priority[MAX_NR_ZONES];
 
@@ -1879,8 +1882,8 @@ loop_again:
 				shrink_active_list(SWAP_CLUSTER_MAX, zone,
 							&sc, priority, 0);
 
-			if (!zone_watermark_ok(zone, order, zone->pages_high,
-					       0, 0)) {
+			if (!zone_watermark_ok(zone, order,
+					high_wmark_pages(zone), 0, 0)) {
 				end_zone = i;
 				break;
 			}
@@ -1914,8 +1917,8 @@ loop_again:
 					priority != DEF_PRIORITY)
 				continue;
 
-			if (!zone_watermark_ok(zone, order, zone->pages_high,
-					       end_zone, 0))
+			if (!zone_watermark_ok(zone, order,
+					high_wmark_pages(zone), end_zone, 0))
 				all_zones_ok = 0;
 			temp_priority[i] = priority;
 			sc.nr_scanned = 0;
@@ -1924,8 +1927,8 @@ loop_again:
 			 * We put equal pressure on every zone, unless one
 			 * zone has way too many pages free already.
 			 */
-			if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
-						end_zone, 0))
+			if (!zone_watermark_ok(zone, order,
+					8*high_wmark_pages(zone), end_zone, 0))
 				shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
@@ -2091,7 +2094,7 @@ void wakeup_kswapd(struct zone *zone, int order)
 		return;
 
 	pgdat = zone->zone_pgdat;
-	if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0))
+	if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
 		return;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b42d06c..5e101d7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -725,9 +725,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
-		   zone->pages_min,
-		   zone->pages_low,
-		   zone->pages_high,
+		   min_wmark_pages(zone),
+		   low_wmark_pages(zone),
+		   high_wmark_pages(zone),
 		   zone->pages_scanned,
 		   zone->spanned_pages,
 		   zone->present_pages);

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [RFC] Replace the watermark-related union in struct zone with a watermark[] array V2
@ 2009-05-12 14:13                   ` Mel Gorman
  0 siblings, 0 replies; 186+ messages in thread
From: Mel Gorman @ 2009-05-12 14:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Rientjes, Linux Memory Management List, KOSAKI Motohiro,
	Linux Kernel Mailing List

Changelog since V1
  o Use N_wmark_pages accessors instead of array accesses

Patch page-allocator-use-allocation-flags-as-an-index-to-the-zone-watermark
from -mm added a union to struct zone where the watermarks could be accessed
with either zone->pages_* or a pages_mark array. The concern was that this
aliasing caused more confusion that it helped.

This patch replaces the union with a watermark array that is indexed with
WMARK_* defines accessed via helpers. All call sites that use zone->pages_*
are updated to use the helpers for accessing the values and the array
offsets for setting.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 Documentation/sysctl/vm.txt |   11 ++++++-----
 Documentation/vm/balance    |   18 +++++++++---------
 arch/m32r/mm/discontig.c    |    6 +++---
 include/linux/mmzone.h      |   20 ++++++++++++++------
 mm/page_alloc.c             |   41 +++++++++++++++++++++--------------------
 mm/vmscan.c                 |   39 +++++++++++++++++++++------------------
 mm/vmstat.c                 |    6 +++---
 7 files changed, 77 insertions(+), 64 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 3fddcf3..b0237be 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -235,8 +235,8 @@ These protections are added to score to judge whether this zone should be used
 for page allocation or should be reclaimed.
 
 In this example, if normal pages (index=2) are required to this DMA zone and
-pages_high is used for watermark, the kernel judges this zone should not be
-used because pages_free(1355) is smaller than watermark + protection[2]
+watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
+not be used because pages_free(1355) is smaller than watermark + protection[2]
 (4 + 2004 = 2008). If this protection value is 0, this zone would be used for
 normal page requirement. If requirement is DMA zone(index=0), protection[0]
 (=0) is used.
@@ -282,9 +282,10 @@ The default value is 65536.
 min_free_kbytes:
 
 This is used to force the Linux VM to keep a minimum number
-of kilobytes free.  The VM uses this number to compute a pages_min
-value for each lowmem zone in the system.  Each lowmem zone gets
-a number of reserved free pages based proportionally on its size.
+of kilobytes free.  The VM uses this number to compute a
+watermark[WMARK_MIN] value for each lowmem zone in the system.
+Each lowmem zone gets a number of reserved free pages based
+proportionally on its size.
 
 Some minimal amount of memory is needed to satisfy PF_MEMALLOC
 allocations; if you set this to lower than 1024KB, your system will
diff --git a/Documentation/vm/balance b/Documentation/vm/balance
index bd3d31b..c46e68c 100644
--- a/Documentation/vm/balance
+++ b/Documentation/vm/balance
@@ -75,15 +75,15 @@ Page stealing from process memory and shm is done if stealing the page would
 alleviate memory pressure on any zone in the page's node that has fallen below
 its watermark.
 
-pages_min/pages_low/pages_high/low_on_memory/zone_wake_kswapd: These are 
-per-zone fields, used to determine when a zone needs to be balanced. When
-the number of pages falls below pages_min, the hysteric field low_on_memory
-gets set. This stays set till the number of free pages becomes pages_high.
-When low_on_memory is set, page allocation requests will try to free some
-pages in the zone (providing GFP_WAIT is set in the request). Orthogonal
-to this, is the decision to poke kswapd to free some zone pages. That
-decision is not hysteresis based, and is done when the number of free
-pages is below pages_low; in which case zone_wake_kswapd is also set.
+watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These
+are per-zone fields, used to determine when a zone needs to be balanced. When
+the number of pages falls below watermark[WMARK_MIN], the hysteric field
+low_on_memory gets set. This stays set till the number of free pages becomes
+watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will
+try to free some pages in the zone (providing GFP_WAIT is set in the request).
+Orthogonal to this, is the decision to poke kswapd to free some zone pages.
+That decision is not hysteresis based, and is done when the number of free
+pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.
 
 
 (Good) Ideas that I have heard:
diff --git a/arch/m32r/mm/discontig.c b/arch/m32r/mm/discontig.c
index 7daf897..b7a78ad 100644
--- a/arch/m32r/mm/discontig.c
+++ b/arch/m32r/mm/discontig.c
@@ -154,9 +154,9 @@ unsigned long __init zone_sizes_init(void)
 	 *  Use all area of internal RAM.
 	 *  see __alloc_pages()
 	 */
-	NODE_DATA(1)->node_zones->pages_min = 0;
-	NODE_DATA(1)->node_zones->pages_low = 0;
-	NODE_DATA(1)->node_zones->pages_high = 0;
+	NODE_DATA(1)->node_zones->watermark[WMARK_MIN] = 0;
+	NODE_DATA(1)->node_zones->watermark[WMARK_LOW] = 0;
+	NODE_DATA(1)->node_zones->watermark[WMARK_HIGH] = 0;
 
 	return holes;
 }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 95a9257..4f99501 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -163,6 +163,17 @@ static inline int is_unevictable_lru(enum lru_list l)
 #endif
 }
 
+enum zone_watermarks {
+	WMARK_MIN,
+	WMARK_LOW,
+	WMARK_HIGH,
+	NR_WMARK
+};
+
+#define min_wmark_pages(z) (z->watermark[WMARK_MIN])
+#define low_wmark_pages(z) (z->watermark[WMARK_LOW])
+#define high_wmark_pages(z) (z->watermark[WMARK_HIGH])
+
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
@@ -275,12 +286,9 @@ struct zone_reclaim_stat {
 
 struct zone {
 	/* Fields commonly accessed by the page allocator */
-	union {
-		struct {
-			unsigned long	pages_min, pages_low, pages_high;
-		};
-		unsigned long pages_mark[3];
-	};
+
+	/* zone watermarks, access with *_wmark_pages(zone) macros */
+	unsigned long watermark[NR_WMARK];
 
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3d8272a..4ba56d9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1154,10 +1154,10 @@ failed:
 	return NULL;
 }
 
-/* The WMARK bits are used as an index zone->pages_mark */
-#define ALLOC_WMARK_MIN		0x00 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW		0x01 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH	0x02 /* use pages_high watermark */
+/* The ALLOC_WMARK bits are used as an index to zone->watermark */
+#define ALLOC_WMARK_MIN		WMARK_MIN
+#define ALLOC_WMARK_LOW		WMARK_LOW
+#define ALLOC_WMARK_HIGH	WMARK_HIGH
 #define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
 
 /* Mask to get the watermark bits */
@@ -1446,9 +1446,10 @@ zonelist_scan:
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
 
+		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
-			mark = zone->pages_mark[alloc_flags & ALLOC_WMARK_MASK];
+			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 			if (!zone_watermark_ok(zone, order, mark,
 				    classzone_idx, alloc_flags)) {
 				if (!zone_reclaim_mode ||
@@ -1969,7 +1970,7 @@ static unsigned int nr_free_zone_pages(int offset)
 
 	for_each_zone_zonelist(zone, z, zonelist, offset) {
 		unsigned long size = zone->present_pages;
-		unsigned long high = zone->pages_high;
+		unsigned long high = high_wmark_pages(zone);
 		if (size > high)
 			sum += size - high;
 	}
@@ -2106,9 +2107,9 @@ void show_free_areas(void)
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
-			K(zone->pages_min),
-			K(zone->pages_low),
-			K(zone->pages_high),
+			K(min_wmark_pages(zone)),
+			K(low_wmark_pages(zone)),
+			K(high_wmark_pages(zone)),
 			K(zone_page_state(zone, NR_ACTIVE_ANON)),
 			K(zone_page_state(zone, NR_INACTIVE_ANON)),
 			K(zone_page_state(zone, NR_ACTIVE_FILE)),
@@ -2712,8 +2713,8 @@ static inline unsigned long wait_table_bits(unsigned long size)
 
 /*
  * Mark a number of pageblocks as MIGRATE_RESERVE. The number
- * of blocks reserved is based on zone->pages_min. The memory within the
- * reserve will tend to store contiguous free pages. Setting min_free_kbytes
+ * of blocks reserved is based on min_wmark_pages(zone). The memory within
+ * the reserve will tend to store contiguous free pages. Setting min_free_kbytes
  * higher will lead to a bigger reserve which will get freed as contiguous
  * blocks as reclaim kicks in
  */
@@ -2726,7 +2727,7 @@ static void setup_zone_migrate_reserve(struct zone *zone)
 	/* Get the start pfn, end pfn and the number of blocks to reserve */
 	start_pfn = zone->zone_start_pfn;
 	end_pfn = start_pfn + zone->spanned_pages;
-	reserve = roundup(zone->pages_min, pageblock_nr_pages) >>
+	reserve = roundup(min_wmark_pages(zone), pageblock_nr_pages) >>
 							pageblock_order;
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
@@ -4400,8 +4401,8 @@ static void calculate_totalreserve_pages(void)
 					max = zone->lowmem_reserve[j];
 			}
 
-			/* we treat pages_high as reserved pages. */
-			max += zone->pages_high;
+			/* we treat the high watermark as reserved pages. */
+			max += high_wmark_pages(zone);
 
 			if (max > zone->present_pages)
 				max = zone->present_pages;
@@ -4481,7 +4482,7 @@ void setup_per_zone_pages_min(void)
 			 * need highmem pages, so cap pages_min to a small
 			 * value here.
 			 *
-			 * The (pages_high-pages_low) and (pages_low-pages_min)
+			 * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN)
 			 * deltas controls asynch page reclaim, and so should
 			 * not be capped for highmem.
 			 */
@@ -4492,17 +4493,17 @@ void setup_per_zone_pages_min(void)
 				min_pages = SWAP_CLUSTER_MAX;
 			if (min_pages > 128)
 				min_pages = 128;
-			zone->pages_min = min_pages;
+			zone->watermark[WMARK_MIN] = min_pages;
 		} else {
 			/*
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
-			zone->pages_min = tmp;
+			zone->watermark[WMARK_MIN] = tmp;
 		}
 
-		zone->pages_low   = zone->pages_min + (tmp >> 2);
-		zone->pages_high  = zone->pages_min + (tmp >> 1);
+		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
+		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
 		setup_zone_migrate_reserve(zone);
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
@@ -4646,7 +4647,7 @@ int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
  *	whenever sysctl_lowmem_reserve_ratio changes.
  *
  * The reserve ratio obviously has absolutely no relation with the
- * pages_min watermarks. The lowmem reserve ratio can only make sense
+ * minimum watermarks. The lowmem reserve ratio can only make sense
  * if in function of the boot time zone sizes.
  */
 int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 822056c..462ced1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1437,7 +1437,7 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
 		free  = zone_page_state(zone, NR_FREE_PAGES);
 		/* If we have very few page cache pages,
 		   force-scan anon pages. */
-		if (unlikely(file + free <= zone->pages_high)) {
+		if (unlikely(file + free <= high_wmark_pages(zone))) {
 			percent[0] = 100;
 			percent[1] = 0;
 			return;
@@ -1586,11 +1586,13 @@ static void shrink_zone(int priority, struct zone *zone,
  * try to reclaim pages from zones which will satisfy the caller's allocation
  * request.
  *
- * We reclaim from a zone even if that zone is over pages_high.  Because:
+ * We reclaim from a zone even if that zone is over high_wmark_pages(zone).
+ * Because:
  * a) The caller may be trying to free *extra* pages to satisfy a higher-order
  *    allocation or
- * b) The zones may be over pages_high but they must go *over* pages_high to
- *    satisfy the `incremental min' zone defense algorithm.
+ * b) The target zone may be at high_wmark_pages(zone) but the lower zones
+ *    must go *over* high_wmark_pages(zone) to satisfy the `incremental min'
+ *    zone defense algorithm.
  *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
@@ -1796,7 +1798,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 
 /*
  * For kswapd, balance_pgdat() will work across all this node's zones until
- * they are all at pages_high.
+ * they are all at high_wmark_pages(zone).
  *
  * Returns the number of pages which were actually freed.
  *
@@ -1809,11 +1811,11 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
  * the zone for when the problem goes away.
  *
  * kswapd scans the zones in the highmem->normal->dma direction.  It skips
- * zones which have free_pages > pages_high, but once a zone is found to have
- * free_pages <= pages_high, we scan that zone and the lower zones regardless
- * of the number of free pages in the lower zones.  This interoperates with
- * the page allocator fallback scheme to ensure that aging of pages is balanced
- * across the zones.
+ * zones which have free_pages > high_wmark_pages(zone), but once a zone is
+ * found to have free_pages <= high_wmark_pages(zone), we scan that zone and the
+ * lower zones regardless of the number of free pages in the lower zones. This
+ * interoperates with the page allocator fallback scheme to ensure that aging
+ * of pages is balanced across the zones.
  */
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 {
@@ -1834,7 +1836,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 	};
 	/*
 	 * temp_priority is used to remember the scanning priority at which
-	 * this zone was successfully refilled to free_pages == pages_high.
+	 * this zone was successfully refilled to
+	 * free_pages == high_wmark_pages(zone).
 	 */
 	int temp_priority[MAX_NR_ZONES];
 
@@ -1879,8 +1882,8 @@ loop_again:
 				shrink_active_list(SWAP_CLUSTER_MAX, zone,
 							&sc, priority, 0);
 
-			if (!zone_watermark_ok(zone, order, zone->pages_high,
-					       0, 0)) {
+			if (!zone_watermark_ok(zone, order,
+					high_wmark_pages(zone), 0, 0)) {
 				end_zone = i;
 				break;
 			}
@@ -1914,8 +1917,8 @@ loop_again:
 					priority != DEF_PRIORITY)
 				continue;
 
-			if (!zone_watermark_ok(zone, order, zone->pages_high,
-					       end_zone, 0))
+			if (!zone_watermark_ok(zone, order,
+					high_wmark_pages(zone), end_zone, 0))
 				all_zones_ok = 0;
 			temp_priority[i] = priority;
 			sc.nr_scanned = 0;
@@ -1924,8 +1927,8 @@ loop_again:
 			 * We put equal pressure on every zone, unless one
 			 * zone has way too many pages free already.
 			 */
-			if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
-						end_zone, 0))
+			if (!zone_watermark_ok(zone, order,
+					8*high_wmark_pages(zone), end_zone, 0))
 				shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
@@ -2091,7 +2094,7 @@ void wakeup_kswapd(struct zone *zone, int order)
 		return;
 
 	pgdat = zone->zone_pgdat;
-	if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0))
+	if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
 		return;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b42d06c..5e101d7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -725,9 +725,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
-		   zone->pages_min,
-		   zone->pages_low,
-		   zone->pages_high,
+		   min_wmark_pages(zone),
+		   low_wmark_pages(zone),
+		   high_wmark_pages(zone),
 		   zone->pages_scanned,
 		   zone->spanned_pages,
 		   zone->present_pages);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [RFC] Replace the watermark-related union in struct zone with  awatermark[] array V2
  2009-05-12 14:13                   ` Mel Gorman
@ 2009-05-12 15:05                     ` Dave Hansen
  -1 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-05-12 15:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Linux Memory Management List, KOSAKI Motohiro,
	Linux Kernel Mailing List

On Tue, 2009-05-12 at 15:13 +0100, Mel Gorman wrote:
> Changelog since V1
>   o Use N_wmark_pages accessors instead of array accesses
> 
> Patch page-allocator-use-allocation-flags-as-an-index-to-the-zone-watermark
> from -mm added a union to struct zone where the watermarks could be accessed
> with either zone->pages_* or a pages_mark array. The concern was that this
> aliasing caused more confusion that it helped.
> 
> This patch replaces the union with a watermark array that is indexed with
> WMARK_* defines accessed via helpers. All call sites that use zone->pages_*
> are updated to use the helpers for accessing the values and the array
> offsets for setting.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> --- 
>  Documentation/sysctl/vm.txt |   11 ++++++-----
>  Documentation/vm/balance    |   18 +++++++++---------
>  arch/m32r/mm/discontig.c    |    6 +++---
>  include/linux/mmzone.h      |   20 ++++++++++++++------
>  mm/page_alloc.c             |   41 +++++++++++++++++++++--------------------
>  mm/vmscan.c                 |   39 +++++++++++++++++++++------------------
>  mm/vmstat.c                 |    6 +++---
>  7 files changed, 77 insertions(+), 64 deletions(-)

Looks nice.  It net adds a few lines of code, but that's mostly from the
#defines and not added complexity or lots of line wrapping.  

Ackedf-by: Dave Hansen <dave@linux.vnet.ibm.com>

-- Dave


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [RFC] Replace the watermark-related union in struct zone with awatermark[] array V2
@ 2009-05-12 15:05                     ` Dave Hansen
  0 siblings, 0 replies; 186+ messages in thread
From: Dave Hansen @ 2009-05-12 15:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Linux Memory Management List, KOSAKI Motohiro,
	Linux Kernel Mailing List

On Tue, 2009-05-12 at 15:13 +0100, Mel Gorman wrote:
> Changelog since V1
>   o Use N_wmark_pages accessors instead of array accesses
> 
> Patch page-allocator-use-allocation-flags-as-an-index-to-the-zone-watermark
> from -mm added a union to struct zone where the watermarks could be accessed
> with either zone->pages_* or a pages_mark array. The concern was that this
> aliasing caused more confusion that it helped.
> 
> This patch replaces the union with a watermark array that is indexed with
> WMARK_* defines accessed via helpers. All call sites that use zone->pages_*
> are updated to use the helpers for accessing the values and the array
> offsets for setting.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> --- 
>  Documentation/sysctl/vm.txt |   11 ++++++-----
>  Documentation/vm/balance    |   18 +++++++++---------
>  arch/m32r/mm/discontig.c    |    6 +++---
>  include/linux/mmzone.h      |   20 ++++++++++++++------
>  mm/page_alloc.c             |   41 +++++++++++++++++++++--------------------
>  mm/vmscan.c                 |   39 +++++++++++++++++++++------------------
>  mm/vmstat.c                 |    6 +++---
>  7 files changed, 77 insertions(+), 64 deletions(-)

Looks nice.  It net adds a few lines of code, but that's mostly from the
#defines and not added complexity or lots of line wrapping.  

Ackedf-by: Dave Hansen <dave@linux.vnet.ibm.com>

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array V2
  2009-05-12 14:13                   ` Mel Gorman
@ 2009-05-13  8:31                     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 186+ messages in thread
From: KOSAKI Motohiro @ 2009-05-13  8:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Dave Hansen, David Rientjes,
	Linux Memory Management List, Linux Kernel Mailing List

> Changelog since V1
>   o Use N_wmark_pages accessors instead of array accesses
> 
> Patch page-allocator-use-allocation-flags-as-an-index-to-the-zone-watermark
> from -mm added a union to struct zone where the watermarks could be accessed
> with either zone->pages_* or a pages_mark array. The concern was that this
> aliasing caused more confusion that it helped.
> 
> This patch replaces the union with a watermark array that is indexed with
> WMARK_* defines accessed via helpers. All call sites that use zone->pages_*
> are updated to use the helpers for accessing the values and the array
> offsets for setting.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

looks good to me :)





^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array V2
@ 2009-05-13  8:31                     ` KOSAKI Motohiro
  0 siblings, 0 replies; 186+ messages in thread
From: KOSAKI Motohiro @ 2009-05-13  8:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Dave Hansen, David Rientjes,
	Linux Memory Management List, Linux Kernel Mailing List

> Changelog since V1
>   o Use N_wmark_pages accessors instead of array accesses
> 
> Patch page-allocator-use-allocation-flags-as-an-index-to-the-zone-watermark
> from -mm added a union to struct zone where the watermarks could be accessed
> with either zone->pages_* or a pages_mark array. The concern was that this
> aliasing caused more confusion that it helped.
> 
> This patch replaces the union with a watermark array that is indexed with
> WMARK_* defines accessed via helpers. All call sites that use zone->pages_*
> are updated to use the helpers for accessing the values and the array
> offsets for setting.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

looks good to me :)




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 186+ messages in thread

end of thread, other threads:[~2009-05-13  8:32 UTC | newest]

Thread overview: 186+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-22 13:53 [PATCH 00/22] Cleanup and optimise the page allocator V7 Mel Gorman
2009-04-22 13:53 ` Mel Gorman
2009-04-22 13:53 ` [PATCH 01/22] Replace __alloc_pages_internal() with __alloc_pages_nodemask() Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 13:53 ` [PATCH 02/22] Do not sanity check order in the fast path Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 16:13   ` Dave Hansen
2009-04-22 16:13     ` Dave Hansen
2009-04-22 17:11     ` Mel Gorman
2009-04-22 17:11       ` Mel Gorman
2009-04-22 17:30       ` Dave Hansen
2009-04-22 17:30         ` Dave Hansen
2009-04-23  0:13         ` Mel Gorman
2009-04-23  0:13           ` Mel Gorman
2009-04-23  1:34           ` Dave Hansen
2009-04-23  1:34             ` Dave Hansen
2009-04-23  9:58             ` Mel Gorman
2009-04-23  9:58               ` Mel Gorman
2009-04-23 17:36               ` Dave Hansen
2009-04-23 17:36                 ` Dave Hansen
2009-04-24  2:57                 ` KOSAKI Motohiro
2009-04-24  2:57                   ` KOSAKI Motohiro
2009-04-24 10:34                 ` Mel Gorman
2009-04-24 10:34                   ` Mel Gorman
2009-04-24 14:16                   ` Dave Hansen
2009-04-24 14:16                     ` Dave Hansen
2009-04-23 19:26             ` Dave Hansen
2009-04-23 19:26               ` Dave Hansen
2009-04-23 19:45               ` Dave Hansen
2009-04-23 19:45                 ` Dave Hansen
2009-04-24  9:21                 ` Mel Gorman
2009-04-24  9:21                   ` Mel Gorman
2009-04-24 14:25                   ` Dave Hansen
2009-04-24 14:25                     ` Dave Hansen
2009-04-22 20:11       ` David Rientjes
2009-04-22 20:11         ` David Rientjes
2009-04-22 20:20         ` Christoph Lameter
2009-04-22 20:20           ` Christoph Lameter
2009-04-23  7:44         ` Pekka Enberg
2009-04-23  7:44           ` Pekka Enberg
2009-04-23 22:44       ` Andrew Morton
2009-04-23 22:44         ` Andrew Morton
2009-04-22 13:53 ` [PATCH 03/22] Do not check NUMA node ID when the caller knows the node is valid Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 13:53 ` [PATCH 04/22] Check only once if the zonelist is suitable for the allocation Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 13:53 ` [PATCH 05/22] Break up the allocator entry point into fast and slow paths Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 13:53 ` [PATCH 06/22] Move check for disabled anti-fragmentation out of fastpath Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 13:53 ` [PATCH 07/22] Calculate the preferred zone for allocation only once Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-23 22:48   ` Andrew Morton
2009-04-23 22:48     ` Andrew Morton
2009-04-22 13:53 ` [PATCH 08/22] Calculate the migratetype " Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 13:53 ` [PATCH 09/22] Calculate the alloc_flags " Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-23 22:52   ` Andrew Morton
2009-04-23 22:52     ` Andrew Morton
2009-04-24 10:47     ` Mel Gorman
2009-04-24 10:47       ` Mel Gorman
2009-04-24 17:51       ` Andrew Morton
2009-04-24 17:51         ` Andrew Morton
2009-04-22 13:53 ` [PATCH 10/22] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 13:53 ` [PATCH 11/22] Inline __rmqueue_smallest() Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 13:53 ` [PATCH 12/22] Inline buffered_rmqueue() Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 13:53 ` [PATCH 13/22] Inline __rmqueue_fallback() Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 13:53 ` [PATCH 14/22] Do not call get_pageblock_migratetype() more than necessary Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 13:53 ` [PATCH 15/22] Do not disable interrupts in free_page_mlock() Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-23 22:59   ` Andrew Morton
2009-04-23 22:59     ` Andrew Morton
2009-04-24  0:07     ` KOSAKI Motohiro
2009-04-24  0:07       ` KOSAKI Motohiro
2009-04-24  0:33     ` KOSAKI Motohiro
2009-04-24  0:33       ` KOSAKI Motohiro
2009-04-24 11:33       ` Mel Gorman
2009-04-24 11:33         ` Mel Gorman
2009-04-24 11:52         ` Lee Schermerhorn
2009-04-24 11:52           ` Lee Schermerhorn
2009-04-24 11:18     ` Mel Gorman
2009-04-24 11:18       ` Mel Gorman
2009-04-22 13:53 ` [PATCH 16/22] Do not setup zonelist cache when there is only one node Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 20:24   ` David Rientjes
2009-04-22 20:24     ` David Rientjes
2009-04-22 20:32     ` Lee Schermerhorn
2009-04-22 20:32       ` Lee Schermerhorn
2009-04-22 20:34       ` David Rientjes
2009-04-22 20:34         ` David Rientjes
2009-04-23  0:11         ` KOSAKI Motohiro
2009-04-23  0:11           ` KOSAKI Motohiro
2009-04-23  0:19     ` Mel Gorman
2009-04-23  0:19       ` Mel Gorman
2009-04-22 13:53 ` [PATCH 17/22] Do not check for compound pages during the page allocator sanity checks Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 13:53 ` [PATCH 18/22] Use allocation flags as an index to the zone watermark Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 17:11   ` Dave Hansen
2009-04-22 17:11     ` Dave Hansen
2009-04-22 17:14     ` Mel Gorman
2009-04-22 17:14       ` Mel Gorman
2009-04-22 17:47       ` Dave Hansen
2009-04-22 17:47         ` Dave Hansen
2009-04-23  0:27         ` KOSAKI Motohiro
2009-04-23  0:27           ` KOSAKI Motohiro
2009-04-23 10:03           ` Mel Gorman
2009-04-23 10:03             ` Mel Gorman
2009-04-24  6:41             ` KOSAKI Motohiro
2009-04-24  6:41               ` KOSAKI Motohiro
2009-04-22 20:06   ` David Rientjes
2009-04-22 20:06     ` David Rientjes
2009-04-23  0:29     ` Mel Gorman
2009-04-23  0:29       ` Mel Gorman
2009-04-27 17:00     ` [RFC] Replace the watermark-related union in struct zone with a watermark[] array Mel Gorman
2009-04-27 17:00       ` Mel Gorman
2009-04-27 20:48       ` David Rientjes
2009-04-27 20:48         ` David Rientjes
2009-04-27 20:54         ` Mel Gorman
2009-04-27 20:54           ` Mel Gorman
2009-04-27 20:51           ` Christoph Lameter
2009-04-27 20:51             ` Christoph Lameter
2009-04-27 21:04           ` David Rientjes
2009-04-27 21:04             ` David Rientjes
2009-04-30 13:35             ` Mel Gorman
2009-04-30 13:35               ` Mel Gorman
2009-04-30 13:48               ` Dave Hansen
2009-04-30 13:48                 ` Dave Hansen
2009-05-12 14:13                 ` [RFC] Replace the watermark-related union in struct zone with a watermark[] array V2 Mel Gorman
2009-05-12 14:13                   ` Mel Gorman
2009-05-12 15:05                   ` [RFC] Replace the watermark-related union in struct zone with awatermark[] " Dave Hansen
2009-05-12 15:05                     ` Dave Hansen
2009-05-13  8:31                   ` [RFC] Replace the watermark-related union in struct zone with a watermark[] " KOSAKI Motohiro
2009-05-13  8:31                     ` KOSAKI Motohiro
2009-04-22 13:53 ` [PATCH 19/22] Update NR_FREE_PAGES only as necessary Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-23 23:06   ` Andrew Morton
2009-04-23 23:06     ` Andrew Morton
2009-04-23 23:04     ` Christoph Lameter
2009-04-23 23:04       ` Christoph Lameter
2009-04-24 13:06     ` Mel Gorman
2009-04-24 13:06       ` Mel Gorman
2009-04-22 13:53 ` [PATCH 20/22] Get the pageblock migratetype without disabling interrupts Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 13:53 ` [PATCH 21/22] Use a pre-calculated value instead of num_online_nodes() in fast paths Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 23:04   ` David Rientjes
2009-04-22 23:04     ` David Rientjes
2009-04-23  0:44     ` Mel Gorman
2009-04-23  0:44       ` Mel Gorman
2009-04-23 19:29       ` David Rientjes
2009-04-23 19:29         ` David Rientjes
2009-04-24 13:31         ` [PATCH] Do not override definition of node_set_online() with macro Mel Gorman
2009-04-24 13:31           ` Mel Gorman
2009-04-22 13:53 ` [PATCH 22/22] slab: Use nr_online_nodes to check for a NUMA platform Mel Gorman
2009-04-22 13:53   ` Mel Gorman
2009-04-22 14:37   ` Pekka Enberg
2009-04-22 14:37     ` Pekka Enberg
2009-04-27  7:58 ` [PATCH 00/22] Cleanup and optimise the page allocator V7 Zhang, Yanmin
2009-04-27  7:58   ` Zhang, Yanmin
2009-04-27 14:38   ` Mel Gorman
2009-04-27 14:38     ` Mel Gorman
2009-04-28  1:59     ` Zhang, Yanmin
2009-04-28  1:59       ` Zhang, Yanmin
2009-04-28 10:27       ` Mel Gorman
2009-04-28 10:27         ` Mel Gorman
2009-04-28 10:31       ` [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue() Mel Gorman
2009-04-28 10:31         ` Mel Gorman
2009-04-28 16:37         ` Christoph Lameter
2009-04-28 16:37           ` Christoph Lameter
2009-04-28 16:51           ` Mel Gorman
2009-04-28 16:51             ` Mel Gorman
2009-04-28 17:15             ` Hugh Dickins
2009-04-28 17:15               ` Hugh Dickins
2009-04-28 18:07               ` [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue() V2 Mel Gorman
2009-04-28 18:07                 ` Mel Gorman
2009-04-28 18:25                 ` Hugh Dickins
2009-04-28 18:25                   ` Hugh Dickins
2009-04-28 18:36               ` [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue() Mel Gorman
2009-04-28 18:36                 ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.