All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/20] Cleanup and optimise the page allocator
@ 2009-02-22 23:17 ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

The complexity of the page allocator has been increasing for some time
and it has now reached the point where the SLUB allocator is doing strange
tricks to avoid the page allocator. This is obviously bad as it may encourage
other subsystems to try avoiding the page allocator as well.

This series of patches is intended to reduce the cost of the page
allocator by doing the following.

Patches 1-3 iron out the entry paths slightly and remove stupid sanity
checks from the fast path.

Patch 4 uses a lookup table instead of a number of branches to decide what
zones are usable given the GFP flags.

Patch 5 avoids repeated checks of the zonelist

Patch 6 breaks the allocator up into a fast and slow path where the fast
path later becomes one long inlined function.

Patches 7-10 avoids calculating the same things repeatedly and instead
calculates them once.

Patches 11-13 inline the whole allocator fast path

Patch 14 avoids calling get_pageblock_migratetype() potentially twice on
every page free

Patch 15 reduces the number of times interrupts are disabled by reworking
what free_page_mlock() does. However, I notice that the cost of calling
TestClearPageMlocked() is still quite high and I'm guessing it's because
it's a locked bit operation. It's be nice if it could be established if
it's safe to use an unlocked version here. Rik, can you comment?

Patch 16 avoids using the zonelist cache on non-NUMA machines

Patch 17 removes an expensive and excessively paranoid check in the
allocator fast path

Patch 18 avoids a list search in the allocator fast path.

Patch 19 avoids repeated checking of an empty list.

Patch 20 gets rid of hot/cold freeing of pages because it incurs cost for
what I believe to be very dubious gain. I'm not sure we currently gain
anything by it but it's further discussed in the patch itself.

Running all of these through a profiler shows me the cost of page allocation
and freeing is reduced by a nice amount without drastically altering how the
allocator actually works. Excluding the cost of zeroing pages, the cost of
allocation is reduced by 25% and the cost of freeing by 12%.  Again excluding
zeroing a page, much of the remaining cost is due to counters, debugging
checks and interrupt disabling.  Of course when a page has to be zeroed,
the dominant cost of a page allocation is zeroing it.

Counters are surprising expensive, we spent a good chuck of our time in
functions like __dec_zone_page_state and __dec_zone_state. In a profiled
run of kernbench, the time spent in __dec_zone_state was roughly equal to
the combined cost of the rest of the page free path. A quick check showed
that almost half of the time in that function is spent on line 233 alone
which for me is;

	(*p)--;

That's worth a separate investigation but it might be a case that
manipulating int8_t on the machine I was using for profiling is unusually
expensive. Converting this to an int might be faster but the increased
memory consumption and cache footprint might be a problem. Opinions?

The downside is that the patches do increase text size because of the
splitting of the fast path into one inlined blob and the slow path into a
number of other functions. On my test machine, text increased by 1.2K so
I might revisit that again and see how much of a difference it really made.

That all said, I'm seeing good results on actual benchmarks with these
patches.

o On many machines, I'm seeing a 0-2% improvement on kernbench. The dominant
  cost in kernbench is the compiler and zeroing allocated pages for
  pagetables.

o For tbench, I have seen an 8-12% improvement on two x86-64 machines (elm3b6
  on test.kernel.org gained 8%) but generally it was less dramatic on
  x86-64 in the range of 0-4%. On one PPC64, the different was also in the
  range of 0-4%. Generally there were gains, but one specific ppc64 showed a
  regression of 7% for one client but a negligible difference for 8 clients.
  It's not clear why this machine regressed and others didn't.

o hackbench is harder to conclude anything from. Most machines showed
  performance gains in the 5-11% range but one machine in particular showed
  a mix of gains and losses depending on the number of clients. Might be
  a caching thing.

o One machine in particular was a major surprise for sysbench with gains
  of 4-8% there which was drastically higher than I was expecting. However,
  on other machines, it was in the more reasonable 0-4% range, still pretty
  respectable. It's not guaranteed though. While most machines showed some
  sort of gain, one ppc64 showed no difference at all.

So, by and large it's an improvement of some sort.

I haven't run a page-allocator micro-benchmark to see what sort of figures
that gives. Christoph, I recall you had some sort of page allocator
micro-benchmark. Do you want to give it a shot or remind me how to use
it please?

All other reviews, comments, alternative benchmark reports are welcome.

 arch/ia64/hp/common/sba_iommu.c   |    2 +-
 arch/ia64/kernel/mca.c            |    3 +-
 arch/ia64/kernel/uncached.c       |    3 +-
 arch/ia64/sn/pci/pci_dma.c        |    3 +-
 arch/powerpc/platforms/cell/ras.c |    2 +-
 arch/x86/kvm/vmx.c                |    2 +-
 drivers/misc/sgi-gru/grufile.c    |    2 +-
 drivers/misc/sgi-xp/xpc_uv.c      |    2 +-
 fs/afs/write.c                    |    4 +-
 fs/btrfs/compression.c            |    2 +-
 fs/btrfs/extent_io.c              |    4 +-
 fs/btrfs/ordered-data.c           |    2 +-
 fs/cifs/file.c                    |    4 +-
 fs/gfs2/ops_address.c             |    2 +-
 fs/hugetlbfs/inode.c              |    2 +-
 fs/nfs/dir.c                      |    2 +-
 fs/ntfs/file.c                    |    2 +-
 fs/ramfs/file-nommu.c             |    2 +-
 fs/xfs/linux-2.6/xfs_aops.c       |    4 +-
 include/linux/gfp.h               |   58 ++--
 include/linux/mm.h                |    1 -
 include/linux/mmzone.h            |    8 +-
 include/linux/pagemap.h           |    2 +-
 include/linux/pagevec.h           |    4 +-
 include/linux/swap.h              |    2 +-
 init/main.c                       |    1 +
 kernel/profile.c                  |    8 +-
 mm/filemap.c                      |    4 +-
 mm/hugetlb.c                      |    4 +-
 mm/internal.h                     |   10 +-
 mm/mempolicy.c                    |    2 +-
 mm/migrate.c                      |    2 +-
 mm/page-writeback.c               |    2 +-
 mm/page_alloc.c                   |  646 ++++++++++++++++++++++-----------
 mm/slab.c                         |    4 +-
 mm/slob.c                         |    4 +-
 mm/slub.c                         |    5 +-
 mm/swap.c                         |   12 +-
 mm/swap_state.c                   |    2 +-
 mm/truncate.c                     |    6 +-
 mm/vmalloc.c                      |    6 +-
 mm/vmscan.c                       |    8 +-
 42 files changed, 517 insertions(+), 333 deletions(-)


^ permalink raw reply	[flat|nested] 190+ messages in thread

* [RFC PATCH 00/20] Cleanup and optimise the page allocator
@ 2009-02-22 23:17 ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

The complexity of the page allocator has been increasing for some time
and it has now reached the point where the SLUB allocator is doing strange
tricks to avoid the page allocator. This is obviously bad as it may encourage
other subsystems to try avoiding the page allocator as well.

This series of patches is intended to reduce the cost of the page
allocator by doing the following.

Patches 1-3 iron out the entry paths slightly and remove stupid sanity
checks from the fast path.

Patch 4 uses a lookup table instead of a number of branches to decide what
zones are usable given the GFP flags.

Patch 5 avoids repeated checks of the zonelist

Patch 6 breaks the allocator up into a fast and slow path where the fast
path later becomes one long inlined function.

Patches 7-10 avoids calculating the same things repeatedly and instead
calculates them once.

Patches 11-13 inline the whole allocator fast path

Patch 14 avoids calling get_pageblock_migratetype() potentially twice on
every page free

Patch 15 reduces the number of times interrupts are disabled by reworking
what free_page_mlock() does. However, I notice that the cost of calling
TestClearPageMlocked() is still quite high and I'm guessing it's because
it's a locked bit operation. It's be nice if it could be established if
it's safe to use an unlocked version here. Rik, can you comment?

Patch 16 avoids using the zonelist cache on non-NUMA machines

Patch 17 removes an expensive and excessively paranoid check in the
allocator fast path

Patch 18 avoids a list search in the allocator fast path.

Patch 19 avoids repeated checking of an empty list.

Patch 20 gets rid of hot/cold freeing of pages because it incurs cost for
what I believe to be very dubious gain. I'm not sure we currently gain
anything by it but it's further discussed in the patch itself.

Running all of these through a profiler shows me the cost of page allocation
and freeing is reduced by a nice amount without drastically altering how the
allocator actually works. Excluding the cost of zeroing pages, the cost of
allocation is reduced by 25% and the cost of freeing by 12%.  Again excluding
zeroing a page, much of the remaining cost is due to counters, debugging
checks and interrupt disabling.  Of course when a page has to be zeroed,
the dominant cost of a page allocation is zeroing it.

Counters are surprising expensive, we spent a good chuck of our time in
functions like __dec_zone_page_state and __dec_zone_state. In a profiled
run of kernbench, the time spent in __dec_zone_state was roughly equal to
the combined cost of the rest of the page free path. A quick check showed
that almost half of the time in that function is spent on line 233 alone
which for me is;

	(*p)--;

That's worth a separate investigation but it might be a case that
manipulating int8_t on the machine I was using for profiling is unusually
expensive. Converting this to an int might be faster but the increased
memory consumption and cache footprint might be a problem. Opinions?

The downside is that the patches do increase text size because of the
splitting of the fast path into one inlined blob and the slow path into a
number of other functions. On my test machine, text increased by 1.2K so
I might revisit that again and see how much of a difference it really made.

That all said, I'm seeing good results on actual benchmarks with these
patches.

o On many machines, I'm seeing a 0-2% improvement on kernbench. The dominant
  cost in kernbench is the compiler and zeroing allocated pages for
  pagetables.

o For tbench, I have seen an 8-12% improvement on two x86-64 machines (elm3b6
  on test.kernel.org gained 8%) but generally it was less dramatic on
  x86-64 in the range of 0-4%. On one PPC64, the different was also in the
  range of 0-4%. Generally there were gains, but one specific ppc64 showed a
  regression of 7% for one client but a negligible difference for 8 clients.
  It's not clear why this machine regressed and others didn't.

o hackbench is harder to conclude anything from. Most machines showed
  performance gains in the 5-11% range but one machine in particular showed
  a mix of gains and losses depending on the number of clients. Might be
  a caching thing.

o One machine in particular was a major surprise for sysbench with gains
  of 4-8% there which was drastically higher than I was expecting. However,
  on other machines, it was in the more reasonable 0-4% range, still pretty
  respectable. It's not guaranteed though. While most machines showed some
  sort of gain, one ppc64 showed no difference at all.

So, by and large it's an improvement of some sort.

I haven't run a page-allocator micro-benchmark to see what sort of figures
that gives. Christoph, I recall you had some sort of page allocator
micro-benchmark. Do you want to give it a shot or remind me how to use
it please?

All other reviews, comments, alternative benchmark reports are welcome.

 arch/ia64/hp/common/sba_iommu.c   |    2 +-
 arch/ia64/kernel/mca.c            |    3 +-
 arch/ia64/kernel/uncached.c       |    3 +-
 arch/ia64/sn/pci/pci_dma.c        |    3 +-
 arch/powerpc/platforms/cell/ras.c |    2 +-
 arch/x86/kvm/vmx.c                |    2 +-
 drivers/misc/sgi-gru/grufile.c    |    2 +-
 drivers/misc/sgi-xp/xpc_uv.c      |    2 +-
 fs/afs/write.c                    |    4 +-
 fs/btrfs/compression.c            |    2 +-
 fs/btrfs/extent_io.c              |    4 +-
 fs/btrfs/ordered-data.c           |    2 +-
 fs/cifs/file.c                    |    4 +-
 fs/gfs2/ops_address.c             |    2 +-
 fs/hugetlbfs/inode.c              |    2 +-
 fs/nfs/dir.c                      |    2 +-
 fs/ntfs/file.c                    |    2 +-
 fs/ramfs/file-nommu.c             |    2 +-
 fs/xfs/linux-2.6/xfs_aops.c       |    4 +-
 include/linux/gfp.h               |   58 ++--
 include/linux/mm.h                |    1 -
 include/linux/mmzone.h            |    8 +-
 include/linux/pagemap.h           |    2 +-
 include/linux/pagevec.h           |    4 +-
 include/linux/swap.h              |    2 +-
 init/main.c                       |    1 +
 kernel/profile.c                  |    8 +-
 mm/filemap.c                      |    4 +-
 mm/hugetlb.c                      |    4 +-
 mm/internal.h                     |   10 +-
 mm/mempolicy.c                    |    2 +-
 mm/migrate.c                      |    2 +-
 mm/page-writeback.c               |    2 +-
 mm/page_alloc.c                   |  646 ++++++++++++++++++++++-----------
 mm/slab.c                         |    4 +-
 mm/slob.c                         |    4 +-
 mm/slub.c                         |    5 +-
 mm/swap.c                         |   12 +-
 mm/swap_state.c                   |    2 +-
 mm/truncate.c                     |    6 +-
 mm/vmalloc.c                      |    6 +-
 mm/vmscan.c                       |    8 +-
 42 files changed, 517 insertions(+), 333 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [PATCH 01/20] Replace __alloc_pages_internal() with __alloc_pages_nodemask()
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

__alloc_pages_internal is the core page allocator function but
essentially it is an alias of __alloc_pages_nodemask. Naming a publicly
available and exported function "internal" is also a big ugly. This
patch renames __alloc_pages_internal() to __alloc_pages_nodemask() and
deletes the old nodemask function.

Warning - This patch renames an exported symbol. No kernel driver is
affected by external drivers calling __alloc_pages_internal() should
change the call to __alloc_pages_nodemask() without any alteration of
parameters.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/gfp.h |   12 ++----------
 mm/page_alloc.c     |    4 ++--
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index dd20cd7..dcf0ab8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -168,24 +168,16 @@ static inline void arch_alloc_page(struct page *page, int order) { }
 #endif
 
 struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		       struct zonelist *zonelist, nodemask_t *nodemask);
 
 static inline struct page *
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct zonelist *zonelist)
 {
-	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
 }
 
-static inline struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist, nodemask_t *nodemask)
-{
-	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
-}
-
-
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5675b30..61051d5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1464,7 +1464,7 @@ try_next_zone:
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
@@ -1670,7 +1670,7 @@ nopage:
 got_pg:
 	return page;
 }
-EXPORT_SYMBOL(__alloc_pages_internal);
+EXPORT_SYMBOL(__alloc_pages_nodemask);
 
 /*
  * Common helper functions.
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 01/20] Replace __alloc_pages_internal() with __alloc_pages_nodemask()
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

__alloc_pages_internal is the core page allocator function but
essentially it is an alias of __alloc_pages_nodemask. Naming a publicly
available and exported function "internal" is also a big ugly. This
patch renames __alloc_pages_internal() to __alloc_pages_nodemask() and
deletes the old nodemask function.

Warning - This patch renames an exported symbol. No kernel driver is
affected by external drivers calling __alloc_pages_internal() should
change the call to __alloc_pages_nodemask() without any alteration of
parameters.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/gfp.h |   12 ++----------
 mm/page_alloc.c     |    4 ++--
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index dd20cd7..dcf0ab8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -168,24 +168,16 @@ static inline void arch_alloc_page(struct page *page, int order) { }
 #endif
 
 struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		       struct zonelist *zonelist, nodemask_t *nodemask);
 
 static inline struct page *
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct zonelist *zonelist)
 {
-	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
 }
 
-static inline struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist, nodemask_t *nodemask)
-{
-	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
-}
-
-
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5675b30..61051d5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1464,7 +1464,7 @@ try_next_zone:
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
@@ -1670,7 +1670,7 @@ nopage:
 got_pg:
 	return page;
 }
-EXPORT_SYMBOL(__alloc_pages_internal);
+EXPORT_SYMBOL(__alloc_pages_nodemask);
 
 /*
  * Common helper functions.
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 02/20] Do not sanity check order in the fast path
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

No user of the allocator API should be passing in an order >= MAX_ORDER
but we check for it on each and every allocation. Delete this check and
make it a VM_BUG_ON check further down the call path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/gfp.h |    6 ------
 mm/page_alloc.c     |    2 ++
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index dcf0ab8..8736047 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -181,9 +181,6 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
-	if (unlikely(order >= MAX_ORDER))
-		return NULL;
-
 	/* Unknown node is current node */
 	if (nid < 0)
 		nid = numa_node_id();
@@ -197,9 +194,6 @@ extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
 static inline struct page *
 alloc_pages(gfp_t gfp_mask, unsigned int order)
 {
-	if (unlikely(order >= MAX_ORDER))
-		return NULL;
-
 	return alloc_pages_current(gfp_mask, order);
 }
 extern struct page *alloc_page_vma(gfp_t gfp_mask,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 61051d5..c3842f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1407,6 +1407,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 
 	classzone_idx = zone_idx(preferred_zone);
 
+	VM_BUG_ON(order >= MAX_ORDER);
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 02/20] Do not sanity check order in the fast path
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

No user of the allocator API should be passing in an order >= MAX_ORDER
but we check for it on each and every allocation. Delete this check and
make it a VM_BUG_ON check further down the call path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/gfp.h |    6 ------
 mm/page_alloc.c     |    2 ++
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index dcf0ab8..8736047 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -181,9 +181,6 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
-	if (unlikely(order >= MAX_ORDER))
-		return NULL;
-
 	/* Unknown node is current node */
 	if (nid < 0)
 		nid = numa_node_id();
@@ -197,9 +194,6 @@ extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
 static inline struct page *
 alloc_pages(gfp_t gfp_mask, unsigned int order)
 {
-	if (unlikely(order >= MAX_ORDER))
-		return NULL;
-
 	return alloc_pages_current(gfp_mask, order);
 }
 extern struct page *alloc_page_vma(gfp_t gfp_mask,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 61051d5..c3842f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1407,6 +1407,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 
 	classzone_idx = zone_idx(preferred_zone);
 
+	VM_BUG_ON(order >= MAX_ORDER);
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 03/20] Do not check NUMA node ID when the caller knows the node is valid
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

Callers of alloc_pages_node() can optionally specify -1 as a node to mean
"allocate from the current node". However, a number of the callers in fast
paths know for a fact their node is valid. To avoid a comparison and branch,
this patch adds alloc_pages_exact_node() that only checks the nid with
VM_BUG_ON(). Callers that know their node is valid are then converted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 arch/ia64/hp/common/sba_iommu.c   |    2 +-
 arch/ia64/kernel/mca.c            |    3 +--
 arch/ia64/kernel/uncached.c       |    3 ++-
 arch/ia64/sn/pci/pci_dma.c        |    3 ++-
 arch/powerpc/platforms/cell/ras.c |    2 +-
 arch/x86/kvm/vmx.c                |    2 +-
 drivers/misc/sgi-gru/grufile.c    |    2 +-
 drivers/misc/sgi-xp/xpc_uv.c      |    2 +-
 include/linux/gfp.h               |    9 +++++++++
 include/linux/mm.h                |    1 -
 kernel/profile.c                  |    8 ++++----
 mm/filemap.c                      |    2 +-
 mm/hugetlb.c                      |    4 ++--
 mm/mempolicy.c                    |    2 +-
 mm/migrate.c                      |    2 +-
 mm/slab.c                         |    4 ++--
 mm/slob.c                         |    4 ++--
 mm/slub.c                         |    5 +----
 mm/vmalloc.c                      |    6 +-----
 19 files changed, 34 insertions(+), 32 deletions(-)

diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 6d5e6c5..66a3257 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -1116,7 +1116,7 @@ sba_alloc_coherent (struct device *dev, size_t size, dma_addr_t *dma_handle, gfp
 #ifdef CONFIG_NUMA
 	{
 		struct page *page;
-		page = alloc_pages_node(ioc->node == MAX_NUMNODES ?
+		page = alloc_pages_exact_node(ioc->node == MAX_NUMNODES ?
 		                        numa_node_id() : ioc->node, flags,
 		                        get_order(size));
 
diff --git a/arch/ia64/kernel/mca.c b/arch/ia64/kernel/mca.c
index bab1de2..2e614bd 100644
--- a/arch/ia64/kernel/mca.c
+++ b/arch/ia64/kernel/mca.c
@@ -1829,8 +1829,7 @@ ia64_mca_cpu_init(void *cpu_data)
 			data = mca_bootmem();
 			first_time = 0;
 		} else
-			data = page_address(alloc_pages_node(numa_node_id(),
-					GFP_KERNEL, get_order(sz)));
+			data = __get_free_pages(GFP_KERNEL, get_order(sz));
 		if (!data)
 			panic("Could not allocate MCA memory for cpu %d\n",
 					cpu);
diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index 8eff8c1..6ba72ab 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -98,7 +98,8 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)
 
 	/* attempt to allocate a granule's worth of cached memory pages */
 
-	page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+	page = alloc_pages_exact_node(nid,
+				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				IA64_GRANULE_SHIFT-PAGE_SHIFT);
 	if (!page) {
 		mutex_unlock(&uc_pool->add_chunk_mutex);
diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index 863f501..2aa52de 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -91,7 +91,8 @@ void *sn_dma_alloc_coherent(struct device *dev, size_t size,
 	 */
 	node = pcibus_to_node(pdev->bus);
 	if (likely(node >=0)) {
-		struct page *p = alloc_pages_node(node, flags, get_order(size));
+		struct page *p = alloc_pages_exact_node(node,
+						flags, get_order(size));
 
 		if (likely(p))
 			cpuaddr = page_address(p);
diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c
index 5f961c4..16ba671 100644
--- a/arch/powerpc/platforms/cell/ras.c
+++ b/arch/powerpc/platforms/cell/ras.c
@@ -122,7 +122,7 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order)
 
 	area->nid = nid;
 	area->order = order;
-	area->pages = alloc_pages_node(area->nid, GFP_KERNEL, area->order);
+	area->pages = alloc_pages_exact_node(area->nid, GFP_KERNEL, area->order);
 
 	if (!area->pages)
 		goto out_free_area;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7611af5..cca119a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1244,7 +1244,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
 	struct page *pages;
 	struct vmcs *vmcs;
 
-	pages = alloc_pages_node(node, GFP_KERNEL, vmcs_config.order);
+	pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
 	if (!pages)
 		return NULL;
 	vmcs = page_address(pages);
diff --git a/drivers/misc/sgi-gru/grufile.c b/drivers/misc/sgi-gru/grufile.c
index 6509838..52d4160 100644
--- a/drivers/misc/sgi-gru/grufile.c
+++ b/drivers/misc/sgi-gru/grufile.c
@@ -309,7 +309,7 @@ static int gru_init_tables(unsigned long gru_base_paddr, void *gru_base_vaddr)
 		pnode = uv_node_to_pnode(nid);
 		if (gru_base[bid])
 			continue;
-		page = alloc_pages_node(nid, GFP_KERNEL, order);
+		page = alloc_pages_exact_node(nid, GFP_KERNEL, order);
 		if (!page)
 			goto fail;
 		gru_base[bid] = page_address(page);
diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
index 29c0502..0563350 100644
--- a/drivers/misc/sgi-xp/xpc_uv.c
+++ b/drivers/misc/sgi-xp/xpc_uv.c
@@ -184,7 +184,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
 	mq->mmr_blade = uv_cpu_to_blade_id(cpu);
 
 	nid = cpu_to_node(cpu);
-	page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+	page = alloc_pages_exact_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				pg_order);
 	if (page == NULL) {
 		dev_err(xpc_part, "xpc_create_gru_mq_uv() failed to alloc %d "
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 8736047..59eb093 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -4,6 +4,7 @@
 #include <linux/mmzone.h>
 #include <linux/stddef.h>
 #include <linux/linkage.h>
+#include <linux/mmdebug.h>
 
 struct vm_area_struct;
 
@@ -188,6 +189,14 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
 
+static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
+						unsigned int order)
+{
+	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+
+	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+}
+
 #ifdef CONFIG_NUMA
 extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7dc04ff..954e945 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -7,7 +7,6 @@
 
 #include <linux/gfp.h>
 #include <linux/list.h>
-#include <linux/mmdebug.h>
 #include <linux/mmzone.h>
 #include <linux/rbtree.h>
 #include <linux/prio_tree.h>
diff --git a/kernel/profile.c b/kernel/profile.c
index 7724e04..62e08db 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -371,7 +371,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
 		node = cpu_to_node(cpu);
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
-			page = alloc_pages_node(node,
+			page = alloc_pages_exact_node(node,
 					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
@@ -379,7 +379,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
 			per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
-			page = alloc_pages_node(node,
+			page = alloc_pages_exact_node(node,
 					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
@@ -570,14 +570,14 @@ static int create_hash_tables(void)
 		int node = cpu_to_node(cpu);
 		struct page *page;
 
-		page = alloc_pages_node(node,
+		page = alloc_pages_exact_node(node,
 				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				0);
 		if (!page)
 			goto out_cleanup;
 		per_cpu(cpu_profile_hits, cpu)[1]
 				= (struct profile_hit *)page_address(page);
-		page = alloc_pages_node(node,
+		page = alloc_pages_exact_node(node,
 				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				0);
 		if (!page)
diff --git a/mm/filemap.c b/mm/filemap.c
index 23acefe..2523d95 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -519,7 +519,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
 {
 	if (cpuset_do_page_mem_spread()) {
 		int n = cpuset_mem_spread_node();
-		return alloc_pages_node(n, gfp, 0);
+		return alloc_pages_exact_node(n, gfp, 0);
 	}
 	return alloc_pages(gfp, 0);
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 107da3d..1e99997 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -630,7 +630,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 	if (h->order >= MAX_ORDER)
 		return NULL;
 
-	page = alloc_pages_node(nid,
+	page = alloc_pages_exact_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
 		huge_page_order(h));
@@ -649,7 +649,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
  * Use a helper variable to find the next node and then
  * copy it back to hugetlb_next_nid afterwards:
  * otherwise there's a window in which a racer might
- * pass invalid nid MAX_NUMNODES to alloc_pages_node.
+ * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
  * But we don't need to use a spin_lock here: it really
  * doesn't matter if occasionally a racer chooses the
  * same nid as we do.  Move nid forward in the mask even
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3eb4a6f..341fbca 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -767,7 +767,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
 
 static struct page *new_node_page(struct page *page, unsigned long node, int **x)
 {
-	return alloc_pages_node(node, GFP_HIGHUSER_MOVABLE, 0);
+	return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE, 0);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
index a9eff3f..6bda9c2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -802,7 +802,7 @@ static struct page *new_page_node(struct page *p, unsigned long private,
 
 	*result = &pm->status;
 
-	return alloc_pages_node(pm->node,
+	return alloc_pages_exact_node(pm->node,
 				GFP_HIGHUSER_MOVABLE | GFP_THISNODE, 0);
 }
 
diff --git a/mm/slab.c b/mm/slab.c
index 4d00855..e7f1ded 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1680,7 +1680,7 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		flags |= __GFP_RECLAIMABLE;
 
-	page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+	page = alloc_pages_exact_node(nodeid, flags, cachep->gfporder);
 	if (!page)
 		return NULL;
 
@@ -3210,7 +3210,7 @@ retry:
 		if (local_flags & __GFP_WAIT)
 			local_irq_enable();
 		kmem_flagcheck(cache, flags);
-		obj = kmem_getpages(cache, local_flags, -1);
+		obj = kmem_getpages(cache, local_flags, numa_node_id());
 		if (local_flags & __GFP_WAIT)
 			local_irq_disable();
 		if (obj) {
diff --git a/mm/slob.c b/mm/slob.c
index 52bc8a2..d646a4c 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -46,7 +46,7 @@
  * NUMA support in SLOB is fairly simplistic, pushing most of the real
  * logic down to the page allocator, and simply doing the node accounting
  * on the upper levels. In the event that a node id is explicitly
- * provided, alloc_pages_node() with the specified node id is used
+ * provided, alloc_pages_exact_node() with the specified node id is used
  * instead. The common case (or when the node id isn't explicitly provided)
  * will default to the current node, as per numa_node_id().
  *
@@ -236,7 +236,7 @@ static void *slob_new_page(gfp_t gfp, int order, int node)
 
 #ifdef CONFIG_NUMA
 	if (node != -1)
-		page = alloc_pages_node(node, gfp, order);
+		page = alloc_pages_exact_node(node, gfp, order);
 	else
 #endif
 		page = alloc_pages(gfp, order);
diff --git a/mm/slub.c b/mm/slub.c
index 0280eee..ecb6d28 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1068,10 +1068,7 @@ static inline struct page *alloc_slab_page(gfp_t flags, int node,
 {
 	int order = oo_order(oo);
 
-	if (node == -1)
-		return alloc_pages(flags, order);
-	else
-		return alloc_pages_node(node, flags, order);
+	return alloc_pages_node(node, flags, order);
 }
 
 static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 75f49d3..6566c9e 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1318,11 +1318,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	for (i = 0; i < area->nr_pages; i++) {
 		struct page *page;
 
-		if (node < 0)
-			page = alloc_page(gfp_mask);
-		else
-			page = alloc_pages_node(node, gfp_mask, 0);
-
+		page = alloc_pages_node(node, gfp_mask, 0);
 		if (unlikely(!page)) {
 			/* Successfully allocated i pages, free them in __vunmap() */
 			area->nr_pages = i;
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 03/20] Do not check NUMA node ID when the caller knows the node is valid
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

Callers of alloc_pages_node() can optionally specify -1 as a node to mean
"allocate from the current node". However, a number of the callers in fast
paths know for a fact their node is valid. To avoid a comparison and branch,
this patch adds alloc_pages_exact_node() that only checks the nid with
VM_BUG_ON(). Callers that know their node is valid are then converted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 arch/ia64/hp/common/sba_iommu.c   |    2 +-
 arch/ia64/kernel/mca.c            |    3 +--
 arch/ia64/kernel/uncached.c       |    3 ++-
 arch/ia64/sn/pci/pci_dma.c        |    3 ++-
 arch/powerpc/platforms/cell/ras.c |    2 +-
 arch/x86/kvm/vmx.c                |    2 +-
 drivers/misc/sgi-gru/grufile.c    |    2 +-
 drivers/misc/sgi-xp/xpc_uv.c      |    2 +-
 include/linux/gfp.h               |    9 +++++++++
 include/linux/mm.h                |    1 -
 kernel/profile.c                  |    8 ++++----
 mm/filemap.c                      |    2 +-
 mm/hugetlb.c                      |    4 ++--
 mm/mempolicy.c                    |    2 +-
 mm/migrate.c                      |    2 +-
 mm/slab.c                         |    4 ++--
 mm/slob.c                         |    4 ++--
 mm/slub.c                         |    5 +----
 mm/vmalloc.c                      |    6 +-----
 19 files changed, 34 insertions(+), 32 deletions(-)

diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 6d5e6c5..66a3257 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -1116,7 +1116,7 @@ sba_alloc_coherent (struct device *dev, size_t size, dma_addr_t *dma_handle, gfp
 #ifdef CONFIG_NUMA
 	{
 		struct page *page;
-		page = alloc_pages_node(ioc->node == MAX_NUMNODES ?
+		page = alloc_pages_exact_node(ioc->node == MAX_NUMNODES ?
 		                        numa_node_id() : ioc->node, flags,
 		                        get_order(size));
 
diff --git a/arch/ia64/kernel/mca.c b/arch/ia64/kernel/mca.c
index bab1de2..2e614bd 100644
--- a/arch/ia64/kernel/mca.c
+++ b/arch/ia64/kernel/mca.c
@@ -1829,8 +1829,7 @@ ia64_mca_cpu_init(void *cpu_data)
 			data = mca_bootmem();
 			first_time = 0;
 		} else
-			data = page_address(alloc_pages_node(numa_node_id(),
-					GFP_KERNEL, get_order(sz)));
+			data = __get_free_pages(GFP_KERNEL, get_order(sz));
 		if (!data)
 			panic("Could not allocate MCA memory for cpu %d\n",
 					cpu);
diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index 8eff8c1..6ba72ab 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -98,7 +98,8 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)
 
 	/* attempt to allocate a granule's worth of cached memory pages */
 
-	page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+	page = alloc_pages_exact_node(nid,
+				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				IA64_GRANULE_SHIFT-PAGE_SHIFT);
 	if (!page) {
 		mutex_unlock(&uc_pool->add_chunk_mutex);
diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index 863f501..2aa52de 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -91,7 +91,8 @@ void *sn_dma_alloc_coherent(struct device *dev, size_t size,
 	 */
 	node = pcibus_to_node(pdev->bus);
 	if (likely(node >=0)) {
-		struct page *p = alloc_pages_node(node, flags, get_order(size));
+		struct page *p = alloc_pages_exact_node(node,
+						flags, get_order(size));
 
 		if (likely(p))
 			cpuaddr = page_address(p);
diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c
index 5f961c4..16ba671 100644
--- a/arch/powerpc/platforms/cell/ras.c
+++ b/arch/powerpc/platforms/cell/ras.c
@@ -122,7 +122,7 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order)
 
 	area->nid = nid;
 	area->order = order;
-	area->pages = alloc_pages_node(area->nid, GFP_KERNEL, area->order);
+	area->pages = alloc_pages_exact_node(area->nid, GFP_KERNEL, area->order);
 
 	if (!area->pages)
 		goto out_free_area;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7611af5..cca119a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1244,7 +1244,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
 	struct page *pages;
 	struct vmcs *vmcs;
 
-	pages = alloc_pages_node(node, GFP_KERNEL, vmcs_config.order);
+	pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
 	if (!pages)
 		return NULL;
 	vmcs = page_address(pages);
diff --git a/drivers/misc/sgi-gru/grufile.c b/drivers/misc/sgi-gru/grufile.c
index 6509838..52d4160 100644
--- a/drivers/misc/sgi-gru/grufile.c
+++ b/drivers/misc/sgi-gru/grufile.c
@@ -309,7 +309,7 @@ static int gru_init_tables(unsigned long gru_base_paddr, void *gru_base_vaddr)
 		pnode = uv_node_to_pnode(nid);
 		if (gru_base[bid])
 			continue;
-		page = alloc_pages_node(nid, GFP_KERNEL, order);
+		page = alloc_pages_exact_node(nid, GFP_KERNEL, order);
 		if (!page)
 			goto fail;
 		gru_base[bid] = page_address(page);
diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
index 29c0502..0563350 100644
--- a/drivers/misc/sgi-xp/xpc_uv.c
+++ b/drivers/misc/sgi-xp/xpc_uv.c
@@ -184,7 +184,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
 	mq->mmr_blade = uv_cpu_to_blade_id(cpu);
 
 	nid = cpu_to_node(cpu);
-	page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+	page = alloc_pages_exact_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				pg_order);
 	if (page == NULL) {
 		dev_err(xpc_part, "xpc_create_gru_mq_uv() failed to alloc %d "
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 8736047..59eb093 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -4,6 +4,7 @@
 #include <linux/mmzone.h>
 #include <linux/stddef.h>
 #include <linux/linkage.h>
+#include <linux/mmdebug.h>
 
 struct vm_area_struct;
 
@@ -188,6 +189,14 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
 
+static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
+						unsigned int order)
+{
+	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+
+	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+}
+
 #ifdef CONFIG_NUMA
 extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7dc04ff..954e945 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -7,7 +7,6 @@
 
 #include <linux/gfp.h>
 #include <linux/list.h>
-#include <linux/mmdebug.h>
 #include <linux/mmzone.h>
 #include <linux/rbtree.h>
 #include <linux/prio_tree.h>
diff --git a/kernel/profile.c b/kernel/profile.c
index 7724e04..62e08db 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -371,7 +371,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
 		node = cpu_to_node(cpu);
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
-			page = alloc_pages_node(node,
+			page = alloc_pages_exact_node(node,
 					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
@@ -379,7 +379,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
 			per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
-			page = alloc_pages_node(node,
+			page = alloc_pages_exact_node(node,
 					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
@@ -570,14 +570,14 @@ static int create_hash_tables(void)
 		int node = cpu_to_node(cpu);
 		struct page *page;
 
-		page = alloc_pages_node(node,
+		page = alloc_pages_exact_node(node,
 				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				0);
 		if (!page)
 			goto out_cleanup;
 		per_cpu(cpu_profile_hits, cpu)[1]
 				= (struct profile_hit *)page_address(page);
-		page = alloc_pages_node(node,
+		page = alloc_pages_exact_node(node,
 				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				0);
 		if (!page)
diff --git a/mm/filemap.c b/mm/filemap.c
index 23acefe..2523d95 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -519,7 +519,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
 {
 	if (cpuset_do_page_mem_spread()) {
 		int n = cpuset_mem_spread_node();
-		return alloc_pages_node(n, gfp, 0);
+		return alloc_pages_exact_node(n, gfp, 0);
 	}
 	return alloc_pages(gfp, 0);
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 107da3d..1e99997 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -630,7 +630,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 	if (h->order >= MAX_ORDER)
 		return NULL;
 
-	page = alloc_pages_node(nid,
+	page = alloc_pages_exact_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
 		huge_page_order(h));
@@ -649,7 +649,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
  * Use a helper variable to find the next node and then
  * copy it back to hugetlb_next_nid afterwards:
  * otherwise there's a window in which a racer might
- * pass invalid nid MAX_NUMNODES to alloc_pages_node.
+ * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
  * But we don't need to use a spin_lock here: it really
  * doesn't matter if occasionally a racer chooses the
  * same nid as we do.  Move nid forward in the mask even
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3eb4a6f..341fbca 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -767,7 +767,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
 
 static struct page *new_node_page(struct page *page, unsigned long node, int **x)
 {
-	return alloc_pages_node(node, GFP_HIGHUSER_MOVABLE, 0);
+	return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE, 0);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
index a9eff3f..6bda9c2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -802,7 +802,7 @@ static struct page *new_page_node(struct page *p, unsigned long private,
 
 	*result = &pm->status;
 
-	return alloc_pages_node(pm->node,
+	return alloc_pages_exact_node(pm->node,
 				GFP_HIGHUSER_MOVABLE | GFP_THISNODE, 0);
 }
 
diff --git a/mm/slab.c b/mm/slab.c
index 4d00855..e7f1ded 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1680,7 +1680,7 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		flags |= __GFP_RECLAIMABLE;
 
-	page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+	page = alloc_pages_exact_node(nodeid, flags, cachep->gfporder);
 	if (!page)
 		return NULL;
 
@@ -3210,7 +3210,7 @@ retry:
 		if (local_flags & __GFP_WAIT)
 			local_irq_enable();
 		kmem_flagcheck(cache, flags);
-		obj = kmem_getpages(cache, local_flags, -1);
+		obj = kmem_getpages(cache, local_flags, numa_node_id());
 		if (local_flags & __GFP_WAIT)
 			local_irq_disable();
 		if (obj) {
diff --git a/mm/slob.c b/mm/slob.c
index 52bc8a2..d646a4c 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -46,7 +46,7 @@
  * NUMA support in SLOB is fairly simplistic, pushing most of the real
  * logic down to the page allocator, and simply doing the node accounting
  * on the upper levels. In the event that a node id is explicitly
- * provided, alloc_pages_node() with the specified node id is used
+ * provided, alloc_pages_exact_node() with the specified node id is used
  * instead. The common case (or when the node id isn't explicitly provided)
  * will default to the current node, as per numa_node_id().
  *
@@ -236,7 +236,7 @@ static void *slob_new_page(gfp_t gfp, int order, int node)
 
 #ifdef CONFIG_NUMA
 	if (node != -1)
-		page = alloc_pages_node(node, gfp, order);
+		page = alloc_pages_exact_node(node, gfp, order);
 	else
 #endif
 		page = alloc_pages(gfp, order);
diff --git a/mm/slub.c b/mm/slub.c
index 0280eee..ecb6d28 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1068,10 +1068,7 @@ static inline struct page *alloc_slab_page(gfp_t flags, int node,
 {
 	int order = oo_order(oo);
 
-	if (node == -1)
-		return alloc_pages(flags, order);
-	else
-		return alloc_pages_node(node, flags, order);
+	return alloc_pages_node(node, flags, order);
 }
 
 static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 75f49d3..6566c9e 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1318,11 +1318,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	for (i = 0; i < area->nr_pages; i++) {
 		struct page *page;
 
-		if (node < 0)
-			page = alloc_page(gfp_mask);
-		else
-			page = alloc_pages_node(node, gfp_mask, 0);
-
+		page = alloc_pages_node(node, gfp_mask, 0);
 		if (unlikely(!page)) {
 			/* Successfully allocated i pages, free them in __vunmap() */
 			area->nr_pages = i;
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 04/20] Convert gfp_zone() to use a table of precalculated values
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

Every page allocation uses gfp_zone() to calcuate what the highest zone
allowed by a combination of GFP flags is. This is a large number of branches
to have in a fast path. This patch replaces the branches with a lookup
table that is calculated at boot-time and stored in the read-mostly section
so it can be shared. This requires __GFP_MOVABLE to be redefined but it's
debatable as to whether it should be considered a zone modifier or not.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/gfp.h |   28 +++++++++++-----------------
 init/main.c         |    1 +
 mm/page_alloc.c     |   36 +++++++++++++++++++++++++++++++++++-
 3 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 59eb093..581f8a9 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -16,6 +16,10 @@ struct vm_area_struct;
  * Do not put any conditional on these. If necessary modify the definitions
  * without the underscores and use the consistently. The definitions here may
  * be used in bit comparisons.
+ *
+ * Note that __GFP_MOVABLE uses the next available bit but it is not
+ * a zone modifier. It uses the fourth bit so that the calculation of
+ * gfp_zone() can use a table rather than a series of comparisons
  */
 #define __GFP_DMA	((__force gfp_t)0x01u)
 #define __GFP_HIGHMEM	((__force gfp_t)0x02u)
@@ -50,7 +54,7 @@ struct vm_area_struct;
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
 #define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
 #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
-#define __GFP_MOVABLE	((__force gfp_t)0x100000u)  /* Page is movable */
+#define __GFP_MOVABLE	((__force gfp_t)0x08u)  /* Page is movable */
 
 #define __GFP_BITS_SHIFT 21	/* Room for 21 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -77,6 +81,9 @@ struct vm_area_struct;
 #define GFP_THISNODE	((__force gfp_t)0)
 #endif
 
+/* This is a mask of all modifiers affecting gfp_zonemask() */
+#define GFP_ZONEMASK (__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32 | __GFP_MOVABLE)
+
 /* This mask makes up all the page movable related flags */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
 
@@ -112,24 +119,11 @@ static inline int allocflags_to_migratetype(gfp_t gfp_flags)
 		((gfp_flags & __GFP_RECLAIMABLE) != 0);
 }
 
+extern int gfp_zone_table[GFP_ZONEMASK];
+void init_gfp_zone_table(void);
 static inline enum zone_type gfp_zone(gfp_t flags)
 {
-#ifdef CONFIG_ZONE_DMA
-	if (flags & __GFP_DMA)
-		return ZONE_DMA;
-#endif
-#ifdef CONFIG_ZONE_DMA32
-	if (flags & __GFP_DMA32)
-		return ZONE_DMA32;
-#endif
-	if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
-			(__GFP_HIGHMEM | __GFP_MOVABLE))
-		return ZONE_MOVABLE;
-#ifdef CONFIG_HIGHMEM
-	if (flags & __GFP_HIGHMEM)
-		return ZONE_HIGHMEM;
-#endif
-	return ZONE_NORMAL;
+	return gfp_zone_table[flags & GFP_ZONEMASK];
 }
 
 /*
diff --git a/init/main.c b/init/main.c
index 8442094..08a5663 100644
--- a/init/main.c
+++ b/init/main.c
@@ -573,6 +573,7 @@ asmlinkage void __init start_kernel(void)
 	 * fragile until we cpu_idle() for the first time.
 	 */
 	preempt_disable();
+	init_gfp_zone_table();
 	build_all_zonelists();
 	page_alloc_init();
 	printk(KERN_NOTICE "Kernel command line: %s\n", boot_command_line);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3842f8..7cc4932 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -70,6 +70,7 @@ EXPORT_SYMBOL(node_states);
 unsigned long totalram_pages __read_mostly;
 unsigned long totalreserve_pages __read_mostly;
 unsigned long highest_memmap_pfn __read_mostly;
+int gfp_zone_table[GFP_ZONEMASK] __read_mostly;
 int percpu_pagelist_fraction;
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
@@ -4373,7 +4374,7 @@ static void setup_per_zone_inactive_ratio(void)
  * 8192MB:	11584k
  * 16384MB:	16384k
  */
-static int __init init_per_zone_pages_min(void)
+static int init_per_zone_pages_min(void)
 {
 	unsigned long lowmem_kbytes;
 
@@ -4391,6 +4392,39 @@ static int __init init_per_zone_pages_min(void)
 }
 module_init(init_per_zone_pages_min)
 
+static inline int __init gfp_flags_to_zone(gfp_t flags)
+{
+#ifdef CONFIG_ZONE_DMA
+	if (flags & __GFP_DMA)
+		return ZONE_DMA;
+#endif
+#ifdef CONFIG_ZONE_DMA32
+	if (flags & __GFP_DMA32)
+		return ZONE_DMA32;
+#endif
+	if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
+			(__GFP_HIGHMEM | __GFP_MOVABLE))
+		return ZONE_MOVABLE;
+#ifdef CONFIG_HIGHMEM
+	if (flags & __GFP_HIGHMEM)
+		return ZONE_HIGHMEM;
+#endif
+	return ZONE_NORMAL;
+}
+
+/*
+ * For each possible combination of zone modifier flags, we calculate
+ * what zone it should be using. This consumes a cache line in most
+ * cases but avoids a number of branches in the allocator fast path
+ */
+void __init init_gfp_zone_table(void)
+{
+	gfp_t gfp_flags;
+
+	for (gfp_flags = 0; gfp_flags < GFP_ZONEMASK; gfp_flags++)
+		gfp_zone_table[gfp_flags] = gfp_flags_to_zone(gfp_flags);
+}
+
 /*
  * min_free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so 
  *	that we can call two helper functions whenever min_free_kbytes
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 04/20] Convert gfp_zone() to use a table of precalculated values
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

Every page allocation uses gfp_zone() to calcuate what the highest zone
allowed by a combination of GFP flags is. This is a large number of branches
to have in a fast path. This patch replaces the branches with a lookup
table that is calculated at boot-time and stored in the read-mostly section
so it can be shared. This requires __GFP_MOVABLE to be redefined but it's
debatable as to whether it should be considered a zone modifier or not.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/gfp.h |   28 +++++++++++-----------------
 init/main.c         |    1 +
 mm/page_alloc.c     |   36 +++++++++++++++++++++++++++++++++++-
 3 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 59eb093..581f8a9 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -16,6 +16,10 @@ struct vm_area_struct;
  * Do not put any conditional on these. If necessary modify the definitions
  * without the underscores and use the consistently. The definitions here may
  * be used in bit comparisons.
+ *
+ * Note that __GFP_MOVABLE uses the next available bit but it is not
+ * a zone modifier. It uses the fourth bit so that the calculation of
+ * gfp_zone() can use a table rather than a series of comparisons
  */
 #define __GFP_DMA	((__force gfp_t)0x01u)
 #define __GFP_HIGHMEM	((__force gfp_t)0x02u)
@@ -50,7 +54,7 @@ struct vm_area_struct;
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
 #define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
 #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
-#define __GFP_MOVABLE	((__force gfp_t)0x100000u)  /* Page is movable */
+#define __GFP_MOVABLE	((__force gfp_t)0x08u)  /* Page is movable */
 
 #define __GFP_BITS_SHIFT 21	/* Room for 21 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -77,6 +81,9 @@ struct vm_area_struct;
 #define GFP_THISNODE	((__force gfp_t)0)
 #endif
 
+/* This is a mask of all modifiers affecting gfp_zonemask() */
+#define GFP_ZONEMASK (__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32 | __GFP_MOVABLE)
+
 /* This mask makes up all the page movable related flags */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
 
@@ -112,24 +119,11 @@ static inline int allocflags_to_migratetype(gfp_t gfp_flags)
 		((gfp_flags & __GFP_RECLAIMABLE) != 0);
 }
 
+extern int gfp_zone_table[GFP_ZONEMASK];
+void init_gfp_zone_table(void);
 static inline enum zone_type gfp_zone(gfp_t flags)
 {
-#ifdef CONFIG_ZONE_DMA
-	if (flags & __GFP_DMA)
-		return ZONE_DMA;
-#endif
-#ifdef CONFIG_ZONE_DMA32
-	if (flags & __GFP_DMA32)
-		return ZONE_DMA32;
-#endif
-	if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
-			(__GFP_HIGHMEM | __GFP_MOVABLE))
-		return ZONE_MOVABLE;
-#ifdef CONFIG_HIGHMEM
-	if (flags & __GFP_HIGHMEM)
-		return ZONE_HIGHMEM;
-#endif
-	return ZONE_NORMAL;
+	return gfp_zone_table[flags & GFP_ZONEMASK];
 }
 
 /*
diff --git a/init/main.c b/init/main.c
index 8442094..08a5663 100644
--- a/init/main.c
+++ b/init/main.c
@@ -573,6 +573,7 @@ asmlinkage void __init start_kernel(void)
 	 * fragile until we cpu_idle() for the first time.
 	 */
 	preempt_disable();
+	init_gfp_zone_table();
 	build_all_zonelists();
 	page_alloc_init();
 	printk(KERN_NOTICE "Kernel command line: %s\n", boot_command_line);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3842f8..7cc4932 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -70,6 +70,7 @@ EXPORT_SYMBOL(node_states);
 unsigned long totalram_pages __read_mostly;
 unsigned long totalreserve_pages __read_mostly;
 unsigned long highest_memmap_pfn __read_mostly;
+int gfp_zone_table[GFP_ZONEMASK] __read_mostly;
 int percpu_pagelist_fraction;
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
@@ -4373,7 +4374,7 @@ static void setup_per_zone_inactive_ratio(void)
  * 8192MB:	11584k
  * 16384MB:	16384k
  */
-static int __init init_per_zone_pages_min(void)
+static int init_per_zone_pages_min(void)
 {
 	unsigned long lowmem_kbytes;
 
@@ -4391,6 +4392,39 @@ static int __init init_per_zone_pages_min(void)
 }
 module_init(init_per_zone_pages_min)
 
+static inline int __init gfp_flags_to_zone(gfp_t flags)
+{
+#ifdef CONFIG_ZONE_DMA
+	if (flags & __GFP_DMA)
+		return ZONE_DMA;
+#endif
+#ifdef CONFIG_ZONE_DMA32
+	if (flags & __GFP_DMA32)
+		return ZONE_DMA32;
+#endif
+	if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
+			(__GFP_HIGHMEM | __GFP_MOVABLE))
+		return ZONE_MOVABLE;
+#ifdef CONFIG_HIGHMEM
+	if (flags & __GFP_HIGHMEM)
+		return ZONE_HIGHMEM;
+#endif
+	return ZONE_NORMAL;
+}
+
+/*
+ * For each possible combination of zone modifier flags, we calculate
+ * what zone it should be using. This consumes a cache line in most
+ * cases but avoids a number of branches in the allocator fast path
+ */
+void __init init_gfp_zone_table(void)
+{
+	gfp_t gfp_flags;
+
+	for (gfp_flags = 0; gfp_flags < GFP_ZONEMASK; gfp_flags++)
+		gfp_zone_table[gfp_flags] = gfp_flags_to_zone(gfp_flags);
+}
+
 /*
  * min_free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so 
  *	that we can call two helper functions whenever min_free_kbytes
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 05/20] Check only once if the zonelist is suitable for the allocation
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

It is possible with __GFP_THISNODE that no zones are suitable. This
patch makes sure the check is only made once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7cc4932..99fd538 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1487,9 +1487,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (should_fail_alloc_page(gfp_mask, order))
 		return NULL;
 
-restart:
-	z = zonelist->_zonerefs;  /* the list of zones suitable for gfp_mask */
-
+	/* the list of zones suitable for gfp_mask */
+	z = zonelist->_zonerefs;
 	if (unlikely(!z->zone)) {
 		/*
 		 * Happens if we have an empty zonelist as a result of
@@ -1498,6 +1497,7 @@ restart:
 		return NULL;
 	}
 
+restart:
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 05/20] Check only once if the zonelist is suitable for the allocation
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

It is possible with __GFP_THISNODE that no zones are suitable. This
patch makes sure the check is only made once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7cc4932..99fd538 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1487,9 +1487,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (should_fail_alloc_page(gfp_mask, order))
 		return NULL;
 
-restart:
-	z = zonelist->_zonerefs;  /* the list of zones suitable for gfp_mask */
-
+	/* the list of zones suitable for gfp_mask */
+	z = zonelist->_zonerefs;
 	if (unlikely(!z->zone)) {
 		/*
 		 * Happens if we have an empty zonelist as a result of
@@ -1498,6 +1497,7 @@ restart:
 		return NULL;
 	}
 
+restart:
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 06/20] Break up the allocator entry point into fast and slow paths
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

The core of the page allocator is one giant function which allocates
memory on the stack and makes calculations that may not be needed for every
allocation. This patch breaks up the allocator path into fast and slow paths.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |  345 ++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 216 insertions(+), 129 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 99fd538..503d692 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1463,45 +1463,170 @@ try_next_zone:
 	return page;
 }
 
-/*
- * This is the 'heart' of the zoned buddy allocator.
- */
+int
+should_alloc_retry(gfp_t gfp_mask, unsigned int order,
+				unsigned long pages_reclaimed)
+{
+	/* Do not loop if specifically requested */
+	if (gfp_mask & __GFP_NORETRY)
+		return 0;
+	
+	/*
+	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
+	 * means __GFP_NOFAIL, but that may not be true in other
+	 * implementations.
+	 */
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return 1;
+
+	/*
+	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
+	 * specified, then we retry until we no longer reclaim any pages
+	 * (above), or we've reclaimed an order of pages at least as
+	 * large as the allocation's order. In both cases, if the
+	 * allocation still fails, we stop retrying.
+	 */
+	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
+		return 1;
+
+	/*
+	 * Don't let big-order allocations loop unless the caller
+	 * explicitly requests that. 
+	 */
+	if (gfp_mask & __GFP_NOFAIL)
+		return 1;
+
+	return 0;
+}
+
 struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-			struct zonelist *zonelist, nodemask_t *nodemask)
+__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
 {
-	const gfp_t wait = gfp_mask & __GFP_WAIT;
-	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	struct zoneref *z;
-	struct zone *zone;
 	struct page *page;
+
+	/* Acquire the OOM killer lock for the zones in zonelist */
+	if (!try_set_zone_oom(zonelist, gfp_mask)) {
+		schedule_timeout_uninterruptible(1);
+		return NULL;
+	}
+
+	/*
+	 * Go through the zonelist yet one more time, keep very high watermark
+	 * here, this is only to catch a parallel oom killing, we must fail if
+	 * we're still under heavy pressure.
+	 */
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+		order, zonelist, high_zoneidx,
+		ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+	if (page)
+		goto out;
+
+	/* The OOM killer will not help higher order allocs */
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		goto out;
+
+	/* Exhausted what can be done so it's blamo time */
+	out_of_memory(zonelist, gfp_mask, order);
+
+out:
+	clear_zonelist_oom(zonelist, gfp_mask);
+	return page;
+}
+
+/* The really slow allocator path where we enter direct reclaim */
+struct page *
+__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+{
+	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
-	int do_retry;
-	int alloc_flags;
-	unsigned long did_some_progress;
-	unsigned long pages_reclaimed = 0;
 
-	might_sleep_if(wait);
+	cond_resched();
 
-	if (should_fail_alloc_page(gfp_mask, order))
-		return NULL;
+	/* We now go into synchronous reclaim */
+	cpuset_memory_pressure_bump();
 
-	/* the list of zones suitable for gfp_mask */
-	z = zonelist->_zonerefs;
-	if (unlikely(!z->zone)) {
-		/*
-		 * Happens if we have an empty zonelist as a result of
-		 * GFP_THISNODE being used on a memoryless node
-		 */
-		return NULL;
-	}
+	/*
+	 * The task's cpuset might have expanded its set of allowable nodes
+	 */
+	cpuset_update_task_memory_state();
+	p->flags |= PF_MEMALLOC;
+	reclaim_state.reclaimed_slab = 0;
+	p->reclaim_state = &reclaim_state;
 
-restart:
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
-	if (page)
-		goto got_pg;
+	*did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
+
+	p->reclaim_state = NULL;
+	p->flags &= ~PF_MEMALLOC;
+
+	cond_resched();
+
+	if (order != 0)
+		drain_all_pages();
+
+	if (likely(*did_some_progress))
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
+					zonelist, high_zoneidx, alloc_flags);
+	return page;
+}
+
+static inline int is_allocation_high_priority(struct task_struct *p,
+							gfp_t gfp_mask)
+{
+	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+			&& !in_interrupt())
+		if (!(gfp_mask & __GFP_NOMEMALLOC))
+			return 1;
+	return 0;
+}
+
+/*
+ * This is called in the allocator slow-path if the allocation request is of
+ * sufficient urgency to ignore watermarks and take other desperate measures
+ */
+struct page *
+__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
+{
+	struct page *page;
+
+	do {
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
+			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+
+		if (!page && gfp_mask & __GFP_NOFAIL)
+			congestion_wait(WRITE, HZ/50);
+	} while (!page && (gfp_mask & __GFP_NOFAIL));
+
+	return page;
+}
+
+static inline
+void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx)
+{
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+		wakeup_kswapd(zone, order);
+}
+
+static struct page * noinline
+__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
+{
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	struct page *page = NULL;
+	int alloc_flags;
+	unsigned long pages_reclaimed = 0;
+	unsigned long did_some_progress;
+	struct task_struct *p = current;
 
 	/*
 	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
@@ -1514,8 +1639,7 @@ restart:
 	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
 		goto nopage;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		wakeup_kswapd(zone, order);
+	wake_all_kswapd(order, zonelist, high_zoneidx);
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
@@ -1535,6 +1659,7 @@ restart:
 	if (wait)
 		alloc_flags |= ALLOC_CPUSET;
 
+restart:
 	/*
 	 * Go through the zonelist again. Let __GFP_HIGH and allocations
 	 * coming from realtime tasks go deeper into reserves.
@@ -1548,118 +1673,47 @@ restart:
 	if (page)
 		goto got_pg;
 
-	/* This allocation should allow future memory freeing. */
-
-rebalance:
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
-		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
-nofail_alloc:
-			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, nodemask, order,
-				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
-			if (page)
-				goto got_pg;
-			if (gfp_mask & __GFP_NOFAIL) {
-				congestion_wait(WRITE, HZ/50);
-				goto nofail_alloc;
-			}
-		}
-		goto nopage;
-	}
+	/* Allocate without watermarks if the context allows */
+	if (is_allocation_high_priority(p, gfp_mask))
+		page = __alloc_pages_high_priority(gfp_mask, order,
+			zonelist, high_zoneidx, nodemask);
+	if (page)
+		goto got_pg;
 
 	/* Atomic allocations - we can't balance anything */
 	if (!wait)
 		goto nopage;
 
-	cond_resched();
+	/* Try direct reclaim and then allocating */
+	page = __alloc_pages_direct_reclaim(gfp_mask, order,
+					zonelist, high_zoneidx,
+					nodemask,
+					alloc_flags, &did_some_progress);
+	if (page)
+		goto got_pg;
 
-	/* We now go into synchronous reclaim */
-	cpuset_memory_pressure_bump();
 	/*
-	 * The task's cpuset might have expanded its set of allowable nodes
+	 * If we failed to make any progress reclaiming, then we are
+	 * running out of options and have to consider going OOM
 	 */
-	cpuset_update_task_memory_state();
-	p->flags |= PF_MEMALLOC;
-	reclaim_state.reclaimed_slab = 0;
-	p->reclaim_state = &reclaim_state;
-
-	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
-
-	p->reclaim_state = NULL;
-	p->flags &= ~PF_MEMALLOC;
-
-	cond_resched();
-
-	if (order != 0)
-		drain_all_pages();
+	if (!did_some_progress) {
+		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+			page = __alloc_pages_may_oom(gfp_mask, order,
+					zonelist, high_zoneidx,
+					nodemask);
+			if (page)
+				goto got_pg;
 
-	if (likely(did_some_progress)) {
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
-					zonelist, high_zoneidx, alloc_flags);
-		if (page)
-			goto got_pg;
-	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
-		if (!try_set_zone_oom(zonelist, gfp_mask)) {
-			schedule_timeout_uninterruptible(1);
 			goto restart;
 		}
-
-		/*
-		 * Go through the zonelist yet one more time, keep
-		 * very high watermark here, this is only to catch
-		 * a parallel oom killing, we must fail if we're still
-		 * under heavy pressure.
-		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
-			order, zonelist, high_zoneidx,
-			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
-		if (page) {
-			clear_zonelist_oom(zonelist, gfp_mask);
-			goto got_pg;
-		}
-
-		/* The OOM killer will not help higher order allocs so fail */
-		if (order > PAGE_ALLOC_COSTLY_ORDER) {
-			clear_zonelist_oom(zonelist, gfp_mask);
-			goto nopage;
-		}
-
-		out_of_memory(zonelist, gfp_mask, order);
-		clear_zonelist_oom(zonelist, gfp_mask);
-		goto restart;
 	}
 
-	/*
-	 * Don't let big-order allocations loop unless the caller explicitly
-	 * requests that.  Wait for some write requests to complete then retry.
-	 *
-	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
-	 * means __GFP_NOFAIL, but that may not be true in other
-	 * implementations.
-	 *
-	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
-	 * specified, then we retry until we no longer reclaim any pages
-	 * (above), or we've reclaimed an order of pages at least as
-	 * large as the allocation's order. In both cases, if the
-	 * allocation still fails, we stop retrying.
-	 */
+	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
-	do_retry = 0;
-	if (!(gfp_mask & __GFP_NORETRY)) {
-		if (order <= PAGE_ALLOC_COSTLY_ORDER) {
-			do_retry = 1;
-		} else {
-			if (gfp_mask & __GFP_REPEAT &&
-				pages_reclaimed < (1 << order))
-					do_retry = 1;
-		}
-		if (gfp_mask & __GFP_NOFAIL)
-			do_retry = 1;
-	}
-	if (do_retry) {
+	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
+		/* Wait for some write requests to complete then retry */
 		congestion_wait(WRITE, HZ/50);
-		goto rebalance;
+		goto restart;
 	}
 
 nopage:
@@ -1672,6 +1726,39 @@ nopage:
 	}
 got_pg:
 	return page;
+
+}
+
+/*
+ * This is the 'heart' of the zoned buddy allocator.
+ */
+struct page *
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
+{
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	struct page *page;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+
+	if (should_fail_alloc_page(gfp_mask, order))
+		return NULL;
+
+	/*
+	 * Check the zones suitable for the gfp_mask contain at least one
+	 * valid zone. It's possible to have an empty zonelist as a result
+	 * of GFP_THISNODE and a memoryless node
+	 */
+	if (unlikely(!zonelist->_zonerefs->zone))
+		return NULL;
+
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
+			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+	if (unlikely(!page))
+		page = __alloc_pages_slowpath(gfp_mask, order,
+				zonelist, high_zoneidx, nodemask);
+
+	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 06/20] Break up the allocator entry point into fast and slow paths
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

The core of the page allocator is one giant function which allocates
memory on the stack and makes calculations that may not be needed for every
allocation. This patch breaks up the allocator path into fast and slow paths.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |  345 ++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 216 insertions(+), 129 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 99fd538..503d692 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1463,45 +1463,170 @@ try_next_zone:
 	return page;
 }
 
-/*
- * This is the 'heart' of the zoned buddy allocator.
- */
+int
+should_alloc_retry(gfp_t gfp_mask, unsigned int order,
+				unsigned long pages_reclaimed)
+{
+	/* Do not loop if specifically requested */
+	if (gfp_mask & __GFP_NORETRY)
+		return 0;
+	
+	/*
+	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
+	 * means __GFP_NOFAIL, but that may not be true in other
+	 * implementations.
+	 */
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return 1;
+
+	/*
+	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
+	 * specified, then we retry until we no longer reclaim any pages
+	 * (above), or we've reclaimed an order of pages at least as
+	 * large as the allocation's order. In both cases, if the
+	 * allocation still fails, we stop retrying.
+	 */
+	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
+		return 1;
+
+	/*
+	 * Don't let big-order allocations loop unless the caller
+	 * explicitly requests that. 
+	 */
+	if (gfp_mask & __GFP_NOFAIL)
+		return 1;
+
+	return 0;
+}
+
 struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-			struct zonelist *zonelist, nodemask_t *nodemask)
+__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
 {
-	const gfp_t wait = gfp_mask & __GFP_WAIT;
-	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	struct zoneref *z;
-	struct zone *zone;
 	struct page *page;
+
+	/* Acquire the OOM killer lock for the zones in zonelist */
+	if (!try_set_zone_oom(zonelist, gfp_mask)) {
+		schedule_timeout_uninterruptible(1);
+		return NULL;
+	}
+
+	/*
+	 * Go through the zonelist yet one more time, keep very high watermark
+	 * here, this is only to catch a parallel oom killing, we must fail if
+	 * we're still under heavy pressure.
+	 */
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+		order, zonelist, high_zoneidx,
+		ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+	if (page)
+		goto out;
+
+	/* The OOM killer will not help higher order allocs */
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		goto out;
+
+	/* Exhausted what can be done so it's blamo time */
+	out_of_memory(zonelist, gfp_mask, order);
+
+out:
+	clear_zonelist_oom(zonelist, gfp_mask);
+	return page;
+}
+
+/* The really slow allocator path where we enter direct reclaim */
+struct page *
+__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+{
+	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
-	int do_retry;
-	int alloc_flags;
-	unsigned long did_some_progress;
-	unsigned long pages_reclaimed = 0;
 
-	might_sleep_if(wait);
+	cond_resched();
 
-	if (should_fail_alloc_page(gfp_mask, order))
-		return NULL;
+	/* We now go into synchronous reclaim */
+	cpuset_memory_pressure_bump();
 
-	/* the list of zones suitable for gfp_mask */
-	z = zonelist->_zonerefs;
-	if (unlikely(!z->zone)) {
-		/*
-		 * Happens if we have an empty zonelist as a result of
-		 * GFP_THISNODE being used on a memoryless node
-		 */
-		return NULL;
-	}
+	/*
+	 * The task's cpuset might have expanded its set of allowable nodes
+	 */
+	cpuset_update_task_memory_state();
+	p->flags |= PF_MEMALLOC;
+	reclaim_state.reclaimed_slab = 0;
+	p->reclaim_state = &reclaim_state;
 
-restart:
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
-	if (page)
-		goto got_pg;
+	*did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
+
+	p->reclaim_state = NULL;
+	p->flags &= ~PF_MEMALLOC;
+
+	cond_resched();
+
+	if (order != 0)
+		drain_all_pages();
+
+	if (likely(*did_some_progress))
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
+					zonelist, high_zoneidx, alloc_flags);
+	return page;
+}
+
+static inline int is_allocation_high_priority(struct task_struct *p,
+							gfp_t gfp_mask)
+{
+	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+			&& !in_interrupt())
+		if (!(gfp_mask & __GFP_NOMEMALLOC))
+			return 1;
+	return 0;
+}
+
+/*
+ * This is called in the allocator slow-path if the allocation request is of
+ * sufficient urgency to ignore watermarks and take other desperate measures
+ */
+struct page *
+__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
+{
+	struct page *page;
+
+	do {
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
+			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+
+		if (!page && gfp_mask & __GFP_NOFAIL)
+			congestion_wait(WRITE, HZ/50);
+	} while (!page && (gfp_mask & __GFP_NOFAIL));
+
+	return page;
+}
+
+static inline
+void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx)
+{
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+		wakeup_kswapd(zone, order);
+}
+
+static struct page * noinline
+__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
+{
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	struct page *page = NULL;
+	int alloc_flags;
+	unsigned long pages_reclaimed = 0;
+	unsigned long did_some_progress;
+	struct task_struct *p = current;
 
 	/*
 	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
@@ -1514,8 +1639,7 @@ restart:
 	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
 		goto nopage;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		wakeup_kswapd(zone, order);
+	wake_all_kswapd(order, zonelist, high_zoneidx);
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
@@ -1535,6 +1659,7 @@ restart:
 	if (wait)
 		alloc_flags |= ALLOC_CPUSET;
 
+restart:
 	/*
 	 * Go through the zonelist again. Let __GFP_HIGH and allocations
 	 * coming from realtime tasks go deeper into reserves.
@@ -1548,118 +1673,47 @@ restart:
 	if (page)
 		goto got_pg;
 
-	/* This allocation should allow future memory freeing. */
-
-rebalance:
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
-		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
-nofail_alloc:
-			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, nodemask, order,
-				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
-			if (page)
-				goto got_pg;
-			if (gfp_mask & __GFP_NOFAIL) {
-				congestion_wait(WRITE, HZ/50);
-				goto nofail_alloc;
-			}
-		}
-		goto nopage;
-	}
+	/* Allocate without watermarks if the context allows */
+	if (is_allocation_high_priority(p, gfp_mask))
+		page = __alloc_pages_high_priority(gfp_mask, order,
+			zonelist, high_zoneidx, nodemask);
+	if (page)
+		goto got_pg;
 
 	/* Atomic allocations - we can't balance anything */
 	if (!wait)
 		goto nopage;
 
-	cond_resched();
+	/* Try direct reclaim and then allocating */
+	page = __alloc_pages_direct_reclaim(gfp_mask, order,
+					zonelist, high_zoneidx,
+					nodemask,
+					alloc_flags, &did_some_progress);
+	if (page)
+		goto got_pg;
 
-	/* We now go into synchronous reclaim */
-	cpuset_memory_pressure_bump();
 	/*
-	 * The task's cpuset might have expanded its set of allowable nodes
+	 * If we failed to make any progress reclaiming, then we are
+	 * running out of options and have to consider going OOM
 	 */
-	cpuset_update_task_memory_state();
-	p->flags |= PF_MEMALLOC;
-	reclaim_state.reclaimed_slab = 0;
-	p->reclaim_state = &reclaim_state;
-
-	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
-
-	p->reclaim_state = NULL;
-	p->flags &= ~PF_MEMALLOC;
-
-	cond_resched();
-
-	if (order != 0)
-		drain_all_pages();
+	if (!did_some_progress) {
+		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+			page = __alloc_pages_may_oom(gfp_mask, order,
+					zonelist, high_zoneidx,
+					nodemask);
+			if (page)
+				goto got_pg;
 
-	if (likely(did_some_progress)) {
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
-					zonelist, high_zoneidx, alloc_flags);
-		if (page)
-			goto got_pg;
-	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
-		if (!try_set_zone_oom(zonelist, gfp_mask)) {
-			schedule_timeout_uninterruptible(1);
 			goto restart;
 		}
-
-		/*
-		 * Go through the zonelist yet one more time, keep
-		 * very high watermark here, this is only to catch
-		 * a parallel oom killing, we must fail if we're still
-		 * under heavy pressure.
-		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
-			order, zonelist, high_zoneidx,
-			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
-		if (page) {
-			clear_zonelist_oom(zonelist, gfp_mask);
-			goto got_pg;
-		}
-
-		/* The OOM killer will not help higher order allocs so fail */
-		if (order > PAGE_ALLOC_COSTLY_ORDER) {
-			clear_zonelist_oom(zonelist, gfp_mask);
-			goto nopage;
-		}
-
-		out_of_memory(zonelist, gfp_mask, order);
-		clear_zonelist_oom(zonelist, gfp_mask);
-		goto restart;
 	}
 
-	/*
-	 * Don't let big-order allocations loop unless the caller explicitly
-	 * requests that.  Wait for some write requests to complete then retry.
-	 *
-	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
-	 * means __GFP_NOFAIL, but that may not be true in other
-	 * implementations.
-	 *
-	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
-	 * specified, then we retry until we no longer reclaim any pages
-	 * (above), or we've reclaimed an order of pages at least as
-	 * large as the allocation's order. In both cases, if the
-	 * allocation still fails, we stop retrying.
-	 */
+	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
-	do_retry = 0;
-	if (!(gfp_mask & __GFP_NORETRY)) {
-		if (order <= PAGE_ALLOC_COSTLY_ORDER) {
-			do_retry = 1;
-		} else {
-			if (gfp_mask & __GFP_REPEAT &&
-				pages_reclaimed < (1 << order))
-					do_retry = 1;
-		}
-		if (gfp_mask & __GFP_NOFAIL)
-			do_retry = 1;
-	}
-	if (do_retry) {
+	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
+		/* Wait for some write requests to complete then retry */
 		congestion_wait(WRITE, HZ/50);
-		goto rebalance;
+		goto restart;
 	}
 
 nopage:
@@ -1672,6 +1726,39 @@ nopage:
 	}
 got_pg:
 	return page;
+
+}
+
+/*
+ * This is the 'heart' of the zoned buddy allocator.
+ */
+struct page *
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
+{
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	struct page *page;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+
+	if (should_fail_alloc_page(gfp_mask, order))
+		return NULL;
+
+	/*
+	 * Check the zones suitable for the gfp_mask contain at least one
+	 * valid zone. It's possible to have an empty zonelist as a result
+	 * of GFP_THISNODE and a memoryless node
+	 */
+	if (unlikely(!zonelist->_zonerefs->zone))
+		return NULL;
+
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
+			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+	if (unlikely(!page))
+		page = __alloc_pages_slowpath(gfp_mask, order,
+				zonelist, high_zoneidx, nodemask);
+
+	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
 
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 07/20] Simplify the check on whether cpusets are a factor or not
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

The check whether cpuset contraints need to be checked or not is complex
and often repeated.  This patch makes the check in advance to the comparison
is simplier to compute.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 503d692..dc50c47 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1400,6 +1400,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	int alloc_cpuset = 0;
 
 	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
 							&preferred_zone);
@@ -1410,6 +1411,12 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 
 	VM_BUG_ON(order >= MAX_ORDER);
 
+#ifdef CONFIG_CPUSETS
+	/* Determine in advance if the cpuset checks will be needed */
+	if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
+		alloc_cpuset = 1;
+#endif
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -1420,8 +1427,8 @@ zonelist_scan:
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-		if ((alloc_flags & ALLOC_CPUSET) &&
-			!cpuset_zone_allowed_softwall(zone, gfp_mask))
+		if (alloc_cpuset)
+			if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
 
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 07/20] Simplify the check on whether cpusets are a factor or not
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

The check whether cpuset contraints need to be checked or not is complex
and often repeated.  This patch makes the check in advance to the comparison
is simplier to compute.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 503d692..dc50c47 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1400,6 +1400,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	int alloc_cpuset = 0;
 
 	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
 							&preferred_zone);
@@ -1410,6 +1411,12 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 
 	VM_BUG_ON(order >= MAX_ORDER);
 
+#ifdef CONFIG_CPUSETS
+	/* Determine in advance if the cpuset checks will be needed */
+	if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
+		alloc_cpuset = 1;
+#endif
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -1420,8 +1427,8 @@ zonelist_scan:
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-		if ((alloc_flags & ALLOC_CPUSET) &&
-			!cpuset_zone_allowed_softwall(zone, gfp_mask))
+		if (alloc_cpuset)
+			if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
 
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 08/20] Move check for disabled anti-fragmentation out of fastpath
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

On low-memory systems, anti-fragmentation gets disabled as there is nothing
it can do and it would just incur overhead shuffling pages between lists
constantly. Currently the check is made in the free page fast path for every
page. This patch moves it to a slow path. On machines with low memory,
there will be small amount of additional overhead as pages get shuffled
between lists but it should quickly settle.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |    3 ---
 mm/page_alloc.c        |    4 ++++
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 09c14e2..6089393 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -50,9 +50,6 @@ extern int page_group_by_mobility_disabled;
 
 static inline int get_pageblock_migratetype(struct page *page)
 {
-	if (unlikely(page_group_by_mobility_disabled))
-		return MIGRATE_UNMOVABLE;
-
 	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dc50c47..eaa0ab7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -172,6 +172,10 @@ int page_group_by_mobility_disabled __read_mostly;
 
 static void set_pageblock_migratetype(struct page *page, int migratetype)
 {
+
+	if (unlikely(page_group_by_mobility_disabled))
+		migratetype = MIGRATE_UNMOVABLE;
+
 	set_pageblock_flags_group(page, (unsigned long)migratetype,
 					PB_migrate, PB_migrate_end);
 }
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 08/20] Move check for disabled anti-fragmentation out of fastpath
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

On low-memory systems, anti-fragmentation gets disabled as there is nothing
it can do and it would just incur overhead shuffling pages between lists
constantly. Currently the check is made in the free page fast path for every
page. This patch moves it to a slow path. On machines with low memory,
there will be small amount of additional overhead as pages get shuffled
between lists but it should quickly settle.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |    3 ---
 mm/page_alloc.c        |    4 ++++
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 09c14e2..6089393 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -50,9 +50,6 @@ extern int page_group_by_mobility_disabled;
 
 static inline int get_pageblock_migratetype(struct page *page)
 {
-	if (unlikely(page_group_by_mobility_disabled))
-		return MIGRATE_UNMOVABLE;
-
 	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dc50c47..eaa0ab7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -172,6 +172,10 @@ int page_group_by_mobility_disabled __read_mostly;
 
 static void set_pageblock_migratetype(struct page *page, int migratetype)
 {
+
+	if (unlikely(page_group_by_mobility_disabled))
+		migratetype = MIGRATE_UNMOVABLE;
+
 	set_pageblock_flags_group(page, (unsigned long)migratetype,
 					PB_migrate, PB_migrate_end);
 }
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 09/20] Calculate the preferred zone for allocation only once
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

get_page_from_freelist() can be called multiple times for an allocation.
Part of this calculates the preferred_zone which is the first usable
zone in the zonelist. This patch calculates preferred_zone once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   53 ++++++++++++++++++++++++++++++++---------------------
 1 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eaa0ab7..bd7b2c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1395,24 +1395,19 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
  */
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
-		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
+		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
+		struct zone *preferred_zone)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
 	int classzone_idx;
-	struct zone *zone, *preferred_zone;
+	struct zone *zone;
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 	int alloc_cpuset = 0;
 
-	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
-							&preferred_zone);
-	if (!preferred_zone)
-		return NULL;
-
 	classzone_idx = zone_idx(preferred_zone);
-
 	VM_BUG_ON(order >= MAX_ORDER);
 
 #ifdef CONFIG_CPUSETS
@@ -1513,7 +1508,7 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	struct page *page;
 
@@ -1530,7 +1525,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	 */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
-		ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
+		preferred_zone);
 	if (page)
 		goto out;
 
@@ -1550,7 +1546,8 @@ out:
 struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+	unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
@@ -1581,7 +1578,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	if (likely(*did_some_progress))
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
-					zonelist, high_zoneidx, alloc_flags);
+					zonelist, high_zoneidx,
+					alloc_flags, preferred_zone);
 	return page;
 }
 
@@ -1602,13 +1600,14 @@ static inline int is_allocation_high_priority(struct task_struct *p,
 struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
+			preferred_zone);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			congestion_wait(WRITE, HZ/50);
@@ -1630,7 +1629,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, enum zone_ty
 static struct page * noinline
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -1680,14 +1679,15 @@ restart:
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
-						high_zoneidx, alloc_flags);
+						high_zoneidx, alloc_flags,
+						preferred_zone);
 	if (page)
 		goto got_pg;
 
 	/* Allocate without watermarks if the context allows */
 	if (is_allocation_high_priority(p, gfp_mask))
 		page = __alloc_pages_high_priority(gfp_mask, order,
-			zonelist, high_zoneidx, nodemask);
+			zonelist, high_zoneidx, nodemask, preferred_zone);
 	if (page)
 		goto got_pg;
 
@@ -1699,7 +1699,8 @@ restart:
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask,
-					alloc_flags, &did_some_progress);
+					alloc_flags, preferred_zone,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -1711,7 +1712,7 @@ restart:
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
-					nodemask);
+					nodemask, preferred_zone);
 			if (page)
 				goto got_pg;
 
@@ -1748,6 +1749,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	struct zone *preferred_zone;
 	struct page *page;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
@@ -1763,11 +1765,20 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
+	/* The preferred zone is used for statistics later */
+	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
+							&preferred_zone);
+	if (!preferred_zone)
+		return NULL;
+
+	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
+			preferred_zone);
 	if (unlikely(!page))
 		page = __alloc_pages_slowpath(gfp_mask, order,
-				zonelist, high_zoneidx, nodemask);
+				zonelist, high_zoneidx, nodemask,
+				preferred_zone);
 
 	return page;
 }
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 09/20] Calculate the preferred zone for allocation only once
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

get_page_from_freelist() can be called multiple times for an allocation.
Part of this calculates the preferred_zone which is the first usable
zone in the zonelist. This patch calculates preferred_zone once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   53 ++++++++++++++++++++++++++++++++---------------------
 1 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eaa0ab7..bd7b2c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1395,24 +1395,19 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
  */
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
-		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
+		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
+		struct zone *preferred_zone)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
 	int classzone_idx;
-	struct zone *zone, *preferred_zone;
+	struct zone *zone;
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 	int alloc_cpuset = 0;
 
-	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
-							&preferred_zone);
-	if (!preferred_zone)
-		return NULL;
-
 	classzone_idx = zone_idx(preferred_zone);
-
 	VM_BUG_ON(order >= MAX_ORDER);
 
 #ifdef CONFIG_CPUSETS
@@ -1513,7 +1508,7 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	struct page *page;
 
@@ -1530,7 +1525,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	 */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
-		ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
+		preferred_zone);
 	if (page)
 		goto out;
 
@@ -1550,7 +1546,8 @@ out:
 struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+	unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
@@ -1581,7 +1578,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	if (likely(*did_some_progress))
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
-					zonelist, high_zoneidx, alloc_flags);
+					zonelist, high_zoneidx,
+					alloc_flags, preferred_zone);
 	return page;
 }
 
@@ -1602,13 +1600,14 @@ static inline int is_allocation_high_priority(struct task_struct *p,
 struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
+			preferred_zone);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			congestion_wait(WRITE, HZ/50);
@@ -1630,7 +1629,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, enum zone_ty
 static struct page * noinline
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -1680,14 +1679,15 @@ restart:
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
-						high_zoneidx, alloc_flags);
+						high_zoneidx, alloc_flags,
+						preferred_zone);
 	if (page)
 		goto got_pg;
 
 	/* Allocate without watermarks if the context allows */
 	if (is_allocation_high_priority(p, gfp_mask))
 		page = __alloc_pages_high_priority(gfp_mask, order,
-			zonelist, high_zoneidx, nodemask);
+			zonelist, high_zoneidx, nodemask, preferred_zone);
 	if (page)
 		goto got_pg;
 
@@ -1699,7 +1699,8 @@ restart:
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask,
-					alloc_flags, &did_some_progress);
+					alloc_flags, preferred_zone,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -1711,7 +1712,7 @@ restart:
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
-					nodemask);
+					nodemask, preferred_zone);
 			if (page)
 				goto got_pg;
 
@@ -1748,6 +1749,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	struct zone *preferred_zone;
 	struct page *page;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
@@ -1763,11 +1765,20 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
+	/* The preferred zone is used for statistics later */
+	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
+							&preferred_zone);
+	if (!preferred_zone)
+		return NULL;
+
+	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
+			preferred_zone);
 	if (unlikely(!page))
 		page = __alloc_pages_slowpath(gfp_mask, order,
-				zonelist, high_zoneidx, nodemask);
+				zonelist, high_zoneidx, nodemask,
+				preferred_zone);
 
 	return page;
 }
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 10/20] Calculate the migratetype for allocation only once
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

GFP mask is converted into a migratetype when deciding which pagelist to
take a page from. However, it is happening multiple times per
allocation, at least once per zone traversed. Calculate it once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   43 ++++++++++++++++++++++++++-----------------
 1 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bd7b2c6..d0d8c07 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1068,13 +1068,13 @@ void split_page(struct page *page, unsigned int order)
  * or two.
  */
 static struct page *buffered_rmqueue(struct zone *preferred_zone,
-			struct zone *zone, int order, gfp_t gfp_flags)
+			struct zone *zone, int order, gfp_t gfp_flags,
+			int migratetype)
 {
 	unsigned long flags;
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
 	int cpu;
-	int migratetype = allocflags_to_migratetype(gfp_flags);
 
 again:
 	cpu  = get_cpu();
@@ -1396,7 +1396,7 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone)
+		struct zone *preferred_zone, int migratetype)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
@@ -1446,7 +1446,8 @@ zonelist_scan:
 			}
 		}
 
-		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
+		page = buffered_rmqueue(preferred_zone, zone, order,
+						gfp_mask, migratetype);
 		if (page)
 			break;
 this_zone_full:
@@ -1508,7 +1509,8 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	struct page *page;
 
@@ -1526,7 +1528,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone);
+		preferred_zone, migratetype);
 	if (page)
 		goto out;
 
@@ -1547,7 +1549,7 @@ struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	unsigned long *did_some_progress)
+	int migratetype, unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
@@ -1579,7 +1581,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	if (likely(*did_some_progress))
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
-					alloc_flags, preferred_zone);
+					alloc_flags, preferred_zone,
+					migratetype);
 	return page;
 }
 
@@ -1600,14 +1603,15 @@ static inline int is_allocation_high_priority(struct task_struct *p,
 struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone);
+			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			congestion_wait(WRITE, HZ/50);
@@ -1629,7 +1633,8 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, enum zone_ty
 static struct page * noinline
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -1680,14 +1685,16 @@ restart:
 	 */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 						high_zoneidx, alloc_flags,
-						preferred_zone);
+						preferred_zone,
+						migratetype);
 	if (page)
 		goto got_pg;
 
 	/* Allocate without watermarks if the context allows */
 	if (is_allocation_high_priority(p, gfp_mask))
 		page = __alloc_pages_high_priority(gfp_mask, order,
-			zonelist, high_zoneidx, nodemask, preferred_zone);
+			zonelist, high_zoneidx, nodemask, preferred_zone,
+			migratetype);
 	if (page)
 		goto got_pg;
 
@@ -1700,7 +1707,7 @@ restart:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					&did_some_progress);
+					migratetype, &did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -1712,7 +1719,8 @@ restart:
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
-					nodemask, preferred_zone);
+					nodemask, preferred_zone,
+					migratetype);
 			if (page)
 				goto got_pg;
 
@@ -1751,6 +1759,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
 	struct page *page;
+	int migratetype = allocflags_to_migratetype(gfp_mask);
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
@@ -1774,11 +1783,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
-			preferred_zone);
+			preferred_zone, migratetype);
 	if (unlikely(!page))
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone);
+				preferred_zone, migratetype);
 
 	return page;
 }
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 10/20] Calculate the migratetype for allocation only once
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

GFP mask is converted into a migratetype when deciding which pagelist to
take a page from. However, it is happening multiple times per
allocation, at least once per zone traversed. Calculate it once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   43 ++++++++++++++++++++++++++-----------------
 1 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bd7b2c6..d0d8c07 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1068,13 +1068,13 @@ void split_page(struct page *page, unsigned int order)
  * or two.
  */
 static struct page *buffered_rmqueue(struct zone *preferred_zone,
-			struct zone *zone, int order, gfp_t gfp_flags)
+			struct zone *zone, int order, gfp_t gfp_flags,
+			int migratetype)
 {
 	unsigned long flags;
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
 	int cpu;
-	int migratetype = allocflags_to_migratetype(gfp_flags);
 
 again:
 	cpu  = get_cpu();
@@ -1396,7 +1396,7 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone)
+		struct zone *preferred_zone, int migratetype)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
@@ -1446,7 +1446,8 @@ zonelist_scan:
 			}
 		}
 
-		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
+		page = buffered_rmqueue(preferred_zone, zone, order,
+						gfp_mask, migratetype);
 		if (page)
 			break;
 this_zone_full:
@@ -1508,7 +1509,8 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	struct page *page;
 
@@ -1526,7 +1528,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone);
+		preferred_zone, migratetype);
 	if (page)
 		goto out;
 
@@ -1547,7 +1549,7 @@ struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	unsigned long *did_some_progress)
+	int migratetype, unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
@@ -1579,7 +1581,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	if (likely(*did_some_progress))
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
-					alloc_flags, preferred_zone);
+					alloc_flags, preferred_zone,
+					migratetype);
 	return page;
 }
 
@@ -1600,14 +1603,15 @@ static inline int is_allocation_high_priority(struct task_struct *p,
 struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone);
+			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			congestion_wait(WRITE, HZ/50);
@@ -1629,7 +1633,8 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, enum zone_ty
 static struct page * noinline
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -1680,14 +1685,16 @@ restart:
 	 */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 						high_zoneidx, alloc_flags,
-						preferred_zone);
+						preferred_zone,
+						migratetype);
 	if (page)
 		goto got_pg;
 
 	/* Allocate without watermarks if the context allows */
 	if (is_allocation_high_priority(p, gfp_mask))
 		page = __alloc_pages_high_priority(gfp_mask, order,
-			zonelist, high_zoneidx, nodemask, preferred_zone);
+			zonelist, high_zoneidx, nodemask, preferred_zone,
+			migratetype);
 	if (page)
 		goto got_pg;
 
@@ -1700,7 +1707,7 @@ restart:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					&did_some_progress);
+					migratetype, &did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -1712,7 +1719,8 @@ restart:
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
-					nodemask, preferred_zone);
+					nodemask, preferred_zone,
+					migratetype);
 			if (page)
 				goto got_pg;
 
@@ -1751,6 +1759,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
 	struct page *page;
+	int migratetype = allocflags_to_migratetype(gfp_mask);
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
@@ -1774,11 +1783,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
-			preferred_zone);
+			preferred_zone, migratetype);
 	if (unlikely(!page))
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone);
+				preferred_zone, migratetype);
 
 	return page;
 }
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 11/20] Inline get_page_from_freelist() in the fast-path
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

In the best-case scenario, use an inlined version of
get_page_from_freelist(). This increases the size of the text but avoids
time spent pushing arguments onto the stack.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   19 +++++++++++++++----
 1 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0d8c07..36d30f3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1393,8 +1393,8 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
  */
-static struct page *
-get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
+static inline struct page *
+__get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
 		struct zone *preferred_zone, int migratetype)
 {
@@ -1470,6 +1470,17 @@ try_next_zone:
 	return page;
 }
 
+/* Non-inline version of __get_page_from_freelist() */
+static struct page * noinline
+get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
+		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
+		struct zone *preferred_zone, int migratetype)
+{
+	return __get_page_from_freelist(gfp_mask, nodemask, order,
+			zonelist, high_zoneidx, alloc_flags,
+			preferred_zone, migratetype);
+}
+
 int
 should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 				unsigned long pages_reclaimed)
@@ -1780,8 +1791,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (!preferred_zone)
 		return NULL;
 
-	/* First allocation attempt */
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
+	/* First allocation attempt. Fastpath uses inlined version */
+	page = __get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
 			preferred_zone, migratetype);
 	if (unlikely(!page))
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 11/20] Inline get_page_from_freelist() in the fast-path
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

In the best-case scenario, use an inlined version of
get_page_from_freelist(). This increases the size of the text but avoids
time spent pushing arguments onto the stack.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   19 +++++++++++++++----
 1 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0d8c07..36d30f3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1393,8 +1393,8 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
  */
-static struct page *
-get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
+static inline struct page *
+__get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
 		struct zone *preferred_zone, int migratetype)
 {
@@ -1470,6 +1470,17 @@ try_next_zone:
 	return page;
 }
 
+/* Non-inline version of __get_page_from_freelist() */
+static struct page * noinline
+get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
+		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
+		struct zone *preferred_zone, int migratetype)
+{
+	return __get_page_from_freelist(gfp_mask, nodemask, order,
+			zonelist, high_zoneidx, alloc_flags,
+			preferred_zone, migratetype);
+}
+
 int
 should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 				unsigned long pages_reclaimed)
@@ -1780,8 +1791,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (!preferred_zone)
 		return NULL;
 
-	/* First allocation attempt */
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
+	/* First allocation attempt. Fastpath uses inlined version */
+	page = __get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
 			preferred_zone, migratetype);
 	if (unlikely(!page))
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 12/20] Inline __rmqueue_smallest()
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

Inline __rmqueue_smallest by altering flow very slightly so that there
is only one call site. This allows the function to be inlined without
additional text bloat.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   23 ++++++++++++++++++-----
 1 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 36d30f3..d8a6828 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -665,7 +665,8 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
  * Go through the free lists for the given migratetype and remove
  * the smallest available page from the freelists
  */
-static struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
+static inline
+struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						int migratetype)
 {
 	unsigned int current_order;
@@ -835,24 +836,36 @@ static struct page *__rmqueue_fallback(struct zone *zone, int order,
 		}
 	}
 
-	/* Use MIGRATE_RESERVE rather than fail an allocation */
-	return __rmqueue_smallest(zone, order, MIGRATE_RESERVE);
+	return NULL;
 }
 
 /*
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order,
+static inline
+struct page *__rmqueue(struct zone *zone, unsigned int order,
 						int migratetype)
 {
 	struct page *page;
 
+retry_reserve:
 	page = __rmqueue_smallest(zone, order, migratetype);
 
-	if (unlikely(!page))
+	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
 		page = __rmqueue_fallback(zone, order, migratetype);
 
+		/*
+		 * Use MIGRATE_RESERVE rather than fail an allocation. goto
+		 * is used because __rmqueue_smallest is an inline function
+		 * and we want just one call site
+		 */
+		if (!page) {
+			migratetype = MIGRATE_RESERVE;
+			goto retry_reserve;
+		}
+	}
+
 	return page;
 }
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 12/20] Inline __rmqueue_smallest()
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

Inline __rmqueue_smallest by altering flow very slightly so that there
is only one call site. This allows the function to be inlined without
additional text bloat.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   23 ++++++++++++++++++-----
 1 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 36d30f3..d8a6828 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -665,7 +665,8 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
  * Go through the free lists for the given migratetype and remove
  * the smallest available page from the freelists
  */
-static struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
+static inline
+struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						int migratetype)
 {
 	unsigned int current_order;
@@ -835,24 +836,36 @@ static struct page *__rmqueue_fallback(struct zone *zone, int order,
 		}
 	}
 
-	/* Use MIGRATE_RESERVE rather than fail an allocation */
-	return __rmqueue_smallest(zone, order, MIGRATE_RESERVE);
+	return NULL;
 }
 
 /*
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order,
+static inline
+struct page *__rmqueue(struct zone *zone, unsigned int order,
 						int migratetype)
 {
 	struct page *page;
 
+retry_reserve:
 	page = __rmqueue_smallest(zone, order, migratetype);
 
-	if (unlikely(!page))
+	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
 		page = __rmqueue_fallback(zone, order, migratetype);
 
+		/*
+		 * Use MIGRATE_RESERVE rather than fail an allocation. goto
+		 * is used because __rmqueue_smallest is an inline function
+		 * and we want just one call site
+		 */
+		if (!page) {
+			migratetype = MIGRATE_RESERVE;
+			goto retry_reserve;
+		}
+	}
+
 	return page;
 }
 
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 13/20] Inline buffered_rmqueue()
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

buffered_rmqueue() is in the fast path so inline it. This incurs text
bloat as there is now a copy in the fast and slow paths but the cost of
the function call was noticeable in profiles of the fast path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d8a6828..2383147 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1080,7 +1080,8 @@ void split_page(struct page *page, unsigned int order)
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
  */
-static struct page *buffered_rmqueue(struct zone *preferred_zone,
+static inline
+struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, int order, gfp_t gfp_flags,
 			int migratetype)
 {
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 13/20] Inline buffered_rmqueue()
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

buffered_rmqueue() is in the fast path so inline it. This incurs text
bloat as there is now a copy in the fast and slow paths but the cost of
the function call was noticeable in profiles of the fast path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d8a6828..2383147 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1080,7 +1080,8 @@ void split_page(struct page *page, unsigned int order)
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
  */
-static struct page *buffered_rmqueue(struct zone *preferred_zone,
+static inline
+struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, int order, gfp_t gfp_flags,
 			int migratetype)
 {
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 14/20] Do not call get_pageblock_migratetype() more than necessary
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

get_pageblock_migratetype() is potentially called twice for every page
free. Once, when being freed to the pcp lists and once when being freed
back to buddy. When freeing from the pcp lists, it is known what the
pageblock type was at the time of free so use it rather than rechecking.
In low memory situations under memory pressure, this might skew
anti-fragmentation slightly but the interference is minimal and
decisions that are fragmenting memory are being made anyway.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   26 ++++++++++++++++----------
 1 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2383147..a9e9466 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -77,7 +77,8 @@ int percpu_pagelist_fraction;
 int pageblock_order __read_mostly;
 #endif
 
-static void __free_pages_ok(struct page *page, unsigned int order);
+static void __free_pages_ok(struct page *page, unsigned int order,
+					int migratetype);
 
 /*
  * results with 256, 32 in the lowmem_reserve sysctl:
@@ -283,7 +284,7 @@ out:
 
 static void free_compound_page(struct page *page)
 {
-	__free_pages_ok(page, compound_order(page));
+	__free_pages_ok(page, compound_order(page), -1);
 }
 
 void prep_compound_page(struct page *page, unsigned long order)
@@ -456,16 +457,19 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
  */
 
 static inline void __free_one_page(struct page *page,
-		struct zone *zone, unsigned int order)
+		struct zone *zone, unsigned int order,
+		int migratetype)
 {
 	unsigned long page_idx;
 	int order_size = 1 << order;
-	int migratetype = get_pageblock_migratetype(page);
 
 	if (unlikely(PageCompound(page)))
 		if (unlikely(destroy_compound_page(page, order)))
 			return;
 
+	if (migratetype == -1)
+		migratetype = get_pageblock_migratetype(page);
+
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
 
 	VM_BUG_ON(page_idx & (order_size - 1));
@@ -534,21 +538,23 @@ static void free_pages_bulk(struct zone *zone, int count,
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
-		__free_one_page(page, zone, order);
+		__free_one_page(page, zone, order, page_private(page));
 	}
 	spin_unlock(&zone->lock);
 }
 
-static void free_one_page(struct zone *zone, struct page *page, int order)
+static void free_one_page(struct zone *zone, struct page *page, int order,
+				int migratetype)
 {
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
-	__free_one_page(page, zone, order);
+	__free_one_page(page, zone, order, migratetype);
 	spin_unlock(&zone->lock);
 }
 
-static void __free_pages_ok(struct page *page, unsigned int order)
+static void __free_pages_ok(struct page *page, unsigned int order,
+				int migratetype)
 {
 	unsigned long flags;
 	int i;
@@ -569,7 +575,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	free_one_page(page_zone(page), page, order);
+	free_one_page(page_zone(page), page, order, migratetype);
 	local_irq_restore(flags);
 }
 
@@ -1864,7 +1870,7 @@ void __free_pages(struct page *page, unsigned int order)
 		if (order == 0)
 			free_hot_page(page);
 		else
-			__free_pages_ok(page, order);
+			__free_pages_ok(page, order, -1);
 	}
 }
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 14/20] Do not call get_pageblock_migratetype() more than necessary
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

get_pageblock_migratetype() is potentially called twice for every page
free. Once, when being freed to the pcp lists and once when being freed
back to buddy. When freeing from the pcp lists, it is known what the
pageblock type was at the time of free so use it rather than rechecking.
In low memory situations under memory pressure, this might skew
anti-fragmentation slightly but the interference is minimal and
decisions that are fragmenting memory are being made anyway.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   26 ++++++++++++++++----------
 1 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2383147..a9e9466 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -77,7 +77,8 @@ int percpu_pagelist_fraction;
 int pageblock_order __read_mostly;
 #endif
 
-static void __free_pages_ok(struct page *page, unsigned int order);
+static void __free_pages_ok(struct page *page, unsigned int order,
+					int migratetype);
 
 /*
  * results with 256, 32 in the lowmem_reserve sysctl:
@@ -283,7 +284,7 @@ out:
 
 static void free_compound_page(struct page *page)
 {
-	__free_pages_ok(page, compound_order(page));
+	__free_pages_ok(page, compound_order(page), -1);
 }
 
 void prep_compound_page(struct page *page, unsigned long order)
@@ -456,16 +457,19 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
  */
 
 static inline void __free_one_page(struct page *page,
-		struct zone *zone, unsigned int order)
+		struct zone *zone, unsigned int order,
+		int migratetype)
 {
 	unsigned long page_idx;
 	int order_size = 1 << order;
-	int migratetype = get_pageblock_migratetype(page);
 
 	if (unlikely(PageCompound(page)))
 		if (unlikely(destroy_compound_page(page, order)))
 			return;
 
+	if (migratetype == -1)
+		migratetype = get_pageblock_migratetype(page);
+
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
 
 	VM_BUG_ON(page_idx & (order_size - 1));
@@ -534,21 +538,23 @@ static void free_pages_bulk(struct zone *zone, int count,
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
-		__free_one_page(page, zone, order);
+		__free_one_page(page, zone, order, page_private(page));
 	}
 	spin_unlock(&zone->lock);
 }
 
-static void free_one_page(struct zone *zone, struct page *page, int order)
+static void free_one_page(struct zone *zone, struct page *page, int order,
+				int migratetype)
 {
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
-	__free_one_page(page, zone, order);
+	__free_one_page(page, zone, order, migratetype);
 	spin_unlock(&zone->lock);
 }
 
-static void __free_pages_ok(struct page *page, unsigned int order)
+static void __free_pages_ok(struct page *page, unsigned int order,
+				int migratetype)
 {
 	unsigned long flags;
 	int i;
@@ -569,7 +575,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	free_one_page(page_zone(page), page, order);
+	free_one_page(page_zone(page), page, order, migratetype);
 	local_irq_restore(flags);
 }
 
@@ -1864,7 +1870,7 @@ void __free_pages(struct page *page, unsigned int order)
 		if (order == 0)
 			free_hot_page(page);
 		else
-			__free_pages_ok(page, order);
+			__free_pages_ok(page, order, -1);
 	}
 }
 
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 15/20] Do not disable interrupts in free_page_mlock()
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

free_page_mlock() tests and clears PG_mlocked. If set, it disables interrupts
to update counters and this happens on every page free even though interrupts
are disabled very shortly afterwards a second time.  This is wasteful.

This patch splits what free_page_mlock() does. The bit check is still
made. However, the update of counters is delayed until the interrupts are
disabled. One potential weirdness with this split is that the counters do
not get updated if the bad_page() check is triggered but a system showing
bad pages is getting screwed already.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/internal.h   |   10 ++--------
 mm/page_alloc.c |    8 +++++++-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 478223b..b52bf86 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -155,14 +155,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
  */
 static inline void free_page_mlock(struct page *page)
 {
-	if (unlikely(TestClearPageMlocked(page))) {
-		unsigned long flags;
-
-		local_irq_save(flags);
-		__dec_zone_page_state(page, NR_MLOCK);
-		__count_vm_event(UNEVICTABLE_MLOCKFREED);
-		local_irq_restore(flags);
-	}
+	__dec_zone_page_state(page, NR_MLOCK);
+	__count_vm_event(UNEVICTABLE_MLOCKFREED);
 }
 
 #else /* CONFIG_UNEVICTABLE_LRU */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9e9466..9adafba 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -501,7 +501,6 @@ static inline void __free_one_page(struct page *page,
 
 static inline int free_pages_check(struct page *page)
 {
-	free_page_mlock(page);
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
 		(page_count(page) != 0)  |
@@ -559,6 +558,7 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	unsigned long flags;
 	int i;
 	int bad = 0;
+	int clearMlocked = TestClearPageMlocked(page);
 
 	for (i = 0 ; i < (1 << order) ; ++i)
 		bad += free_pages_check(page + i);
@@ -574,6 +574,8 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	kernel_map_pages(page, 1 << order, 0);
 
 	local_irq_save(flags);
+	if (clearMlocked)
+		free_page_mlock(page);
 	__count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, order, migratetype);
 	local_irq_restore(flags);
@@ -1023,6 +1025,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	int clearMlocked = TestClearPageMlocked(page);
 
 	if (PageAnon(page))
 		page->mapping = NULL;
@@ -1039,6 +1042,9 @@ static void free_hot_cold_page(struct page *page, int cold)
 	pcp = &zone_pcp(zone, get_cpu())->pcp;
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);
+	if (clearMlocked)
+		free_page_mlock(page);
+
 	if (cold)
 		list_add_tail(&page->lru, &pcp->list);
 	else
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 15/20] Do not disable interrupts in free_page_mlock()
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

free_page_mlock() tests and clears PG_mlocked. If set, it disables interrupts
to update counters and this happens on every page free even though interrupts
are disabled very shortly afterwards a second time.  This is wasteful.

This patch splits what free_page_mlock() does. The bit check is still
made. However, the update of counters is delayed until the interrupts are
disabled. One potential weirdness with this split is that the counters do
not get updated if the bad_page() check is triggered but a system showing
bad pages is getting screwed already.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/internal.h   |   10 ++--------
 mm/page_alloc.c |    8 +++++++-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 478223b..b52bf86 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -155,14 +155,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
  */
 static inline void free_page_mlock(struct page *page)
 {
-	if (unlikely(TestClearPageMlocked(page))) {
-		unsigned long flags;
-
-		local_irq_save(flags);
-		__dec_zone_page_state(page, NR_MLOCK);
-		__count_vm_event(UNEVICTABLE_MLOCKFREED);
-		local_irq_restore(flags);
-	}
+	__dec_zone_page_state(page, NR_MLOCK);
+	__count_vm_event(UNEVICTABLE_MLOCKFREED);
 }
 
 #else /* CONFIG_UNEVICTABLE_LRU */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9e9466..9adafba 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -501,7 +501,6 @@ static inline void __free_one_page(struct page *page,
 
 static inline int free_pages_check(struct page *page)
 {
-	free_page_mlock(page);
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
 		(page_count(page) != 0)  |
@@ -559,6 +558,7 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	unsigned long flags;
 	int i;
 	int bad = 0;
+	int clearMlocked = TestClearPageMlocked(page);
 
 	for (i = 0 ; i < (1 << order) ; ++i)
 		bad += free_pages_check(page + i);
@@ -574,6 +574,8 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	kernel_map_pages(page, 1 << order, 0);
 
 	local_irq_save(flags);
+	if (clearMlocked)
+		free_page_mlock(page);
 	__count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, order, migratetype);
 	local_irq_restore(flags);
@@ -1023,6 +1025,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	int clearMlocked = TestClearPageMlocked(page);
 
 	if (PageAnon(page))
 		page->mapping = NULL;
@@ -1039,6 +1042,9 @@ static void free_hot_cold_page(struct page *page, int cold)
 	pcp = &zone_pcp(zone, get_cpu())->pcp;
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);
+	if (clearMlocked)
+		free_page_mlock(page);
+
 	if (cold)
 		list_add_tail(&page->lru, &pcp->list);
 	else
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 16/20] Do not setup zonelist cache when there is only one node
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

There is a zonelist cache which is used to track zones that are not in
the allowed cpuset or found to be recently full. This is to reduce cache
footprint on large machines. On smaller machines, it just incurs cost
for no gain. This patch only uses the zonelist cache when there are NUMA
nodes.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9adafba..9e16aec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1481,9 +1481,15 @@ this_zone_full:
 			zlc_mark_zone_full(zonelist, z);
 try_next_zone:
 		if (NUMA_BUILD && !did_zlc_setup) {
-			/* we do zlc_setup after the first zone is tried */
-			allowednodes = zlc_setup(zonelist, alloc_flags);
-			zlc_active = 1;
+			/*
+			 * we do zlc_setup after the first zone is tried
+			 * but only if there are multiple nodes to make
+			 * it worthwhile
+			 */
+			if (num_online_nodes() > 1) {
+				allowednodes = zlc_setup(zonelist, alloc_flags);
+				zlc_active = 1;
+			}
 			did_zlc_setup = 1;
 		}
 	}
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 16/20] Do not setup zonelist cache when there is only one node
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

There is a zonelist cache which is used to track zones that are not in
the allowed cpuset or found to be recently full. This is to reduce cache
footprint on large machines. On smaller machines, it just incurs cost
for no gain. This patch only uses the zonelist cache when there are NUMA
nodes.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9adafba..9e16aec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1481,9 +1481,15 @@ this_zone_full:
 			zlc_mark_zone_full(zonelist, z);
 try_next_zone:
 		if (NUMA_BUILD && !did_zlc_setup) {
-			/* we do zlc_setup after the first zone is tried */
-			allowednodes = zlc_setup(zonelist, alloc_flags);
-			zlc_active = 1;
+			/*
+			 * we do zlc_setup after the first zone is tried
+			 * but only if there are multiple nodes to make
+			 * it worthwhile
+			 */
+			if (num_online_nodes() > 1) {
+				allowednodes = zlc_setup(zonelist, alloc_flags);
+				zlc_active = 1;
+			}
 			did_zlc_setup = 1;
 		}
 	}
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 17/20] Do not double sanity check page attributes during allocation
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

On every page free, free_pages_check() sanity checks the page details,
including some atomic operations. On page allocation, the same checks
are been made. This is excessively paranoid as it will only catch severe
memory corruption bugs that are going to manifest in a variety of fun
and entertaining ways with or without this check. This patch removes the
overhead of double checking the page state on every allocation.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    8 --------
 1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9e16aec..452f708 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -646,14 +646,6 @@ static inline void expand(struct zone *zone, struct page *page,
  */
 static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 {
-	if (unlikely(page_mapcount(page) |
-		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
-		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
-		bad_page(page);
-		return 1;
-	}
-
 	set_page_private(page, 0);
 	set_page_refcounted(page);
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 17/20] Do not double sanity check page attributes during allocation
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

On every page free, free_pages_check() sanity checks the page details,
including some atomic operations. On page allocation, the same checks
are been made. This is excessively paranoid as it will only catch severe
memory corruption bugs that are going to manifest in a variety of fun
and entertaining ways with or without this check. This patch removes the
overhead of double checking the page state on every allocation.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    8 --------
 1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9e16aec..452f708 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -646,14 +646,6 @@ static inline void expand(struct zone *zone, struct page *page,
  */
 static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 {
-	if (unlikely(page_mapcount(page) |
-		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
-		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
-		bad_page(page);
-		return 1;
-	}
-
 	set_page_private(page, 0);
 	set_page_refcounted(page);
 
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 18/20] Split per-cpu list into one-list-per-migrate-type
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

Currently the per-cpu page allocator searches the PCP list for pages of the
correct migrate-type to reduce the possibility of pages being inappropriate
placed from a fragmentation perspective. This search is potentially expensive
in a fast-path and undesirable. Splitting the per-cpu list into multiple
lists increases the size of a per-cpu structure and this was potentially
a major problem at the time the search was introduced. These problem has
been mitigated as now only the necessary number of structures is allocated
for the running system.

This patch replaces a list search in the per-cpu allocator with one list
per migrate type.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |    5 ++-
 mm/page_alloc.c        |   80 +++++++++++++++++++++++++++++------------------
 2 files changed, 53 insertions(+), 32 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6089393..2a7349a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -38,6 +38,7 @@
 #define MIGRATE_UNMOVABLE     0
 #define MIGRATE_RECLAIMABLE   1
 #define MIGRATE_MOVABLE       2
+#define MIGRATE_PCPTYPES      3 /* the number of types on the pcp lists */
 #define MIGRATE_RESERVE       3
 #define MIGRATE_ISOLATE       4 /* can't allocate from here */
 #define MIGRATE_TYPES         5
@@ -167,7 +168,9 @@ struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
-	struct list_head list;	/* the list of pages */
+
+	/* Lists of pages, one per migrate type stored on the pcp-lists */
+	struct list_head lists[MIGRATE_TYPES];
 };
 
 struct per_cpu_pageset {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 452f708..50e2fdc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -514,7 +514,7 @@ static inline int free_pages_check(struct page *page)
 }
 
 /*
- * Frees a list of pages. 
+ * Frees a number of pages from the PCP lists
  * Assumes all pages on list are in same zone, and of same order.
  * count is the number of pages to free.
  *
@@ -524,20 +524,30 @@ static inline int free_pages_check(struct page *page)
  * And clear the zone's pages_scanned counter, to hold off the "all pages are
  * pinned" detection logic.
  */
-static void free_pages_bulk(struct zone *zone, int count,
-					struct list_head *list, int order)
+static void free_pcppages_bulk(struct zone *zone, int count,
+					 struct per_cpu_pages *pcp)
 {
+	int migratetype = 0;
+
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
 	while (count--) {
 		struct page *page;
-
-		VM_BUG_ON(list_empty(list));
+		struct list_head *list;
+
+		/* Remove pages from lists in a round-robin fashion */
+		do {
+			if (migratetype == MIGRATE_PCPTYPES)
+				migratetype = 0;
+			list = &pcp->lists[migratetype];
+			migratetype++;
+		} while (list_empty(list));
+		
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
-		__free_one_page(page, zone, order, page_private(page));
+		__free_one_page(page, zone, 0, page_private(page));
 	}
 	spin_unlock(&zone->lock);
 }
@@ -922,7 +932,7 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 		to_drain = pcp->batch;
 	else
 		to_drain = pcp->count;
-	free_pages_bulk(zone, to_drain, &pcp->list, 0);
+	free_pcppages_bulk(zone, to_drain, pcp);
 	pcp->count -= to_drain;
 	local_irq_restore(flags);
 }
@@ -951,7 +961,7 @@ static void drain_pages(unsigned int cpu)
 
 		pcp = &pset->pcp;
 		local_irq_save(flags);
-		free_pages_bulk(zone, pcp->count, &pcp->list, 0);
+		free_pcppages_bulk(zone, pcp->count, pcp);
 		pcp->count = 0;
 		local_irq_restore(flags);
 	}
@@ -1017,6 +1027,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	int migratetype;
 	int clearMlocked = TestClearPageMlocked(page);
 
 	if (PageAnon(page))
@@ -1037,16 +1048,31 @@ static void free_hot_cold_page(struct page *page, int cold)
 	if (clearMlocked)
 		free_page_mlock(page);
 
+	/*
+	 * Only store unreclaimable, reclaimable and movable on pcp lists.
+	 * The one concern is that if the minimum number of free pages is not
+	 * aligned to a pageblock-boundary that allocations/frees from the
+	 * MIGRATE_RESERVE pageblocks may call free_one_page() excessively
+	 */
+	migratetype = get_pageblock_migratetype(page);
+	if (migratetype >= MIGRATE_PCPTYPES) {
+		free_one_page(zone, page, 0, migratetype);
+		goto out;
+	}
+
+	/* Record the migratetype and place on the lists */
+	set_page_private(page, migratetype);
 	if (cold)
-		list_add_tail(&page->lru, &pcp->list);
+		list_add_tail(&page->lru, &pcp->lists[migratetype]);
 	else
-		list_add(&page->lru, &pcp->list);
-	set_page_private(page, get_pageblock_migratetype(page));
+		list_add(&page->lru, &pcp->lists[migratetype]);
+
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
-		free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
+		free_pcppages_bulk(zone, pcp->batch, pcp);
 		pcp->count -= pcp->batch;
 	}
+out:
 	local_irq_restore(flags);
 	put_cpu();
 }
@@ -1101,29 +1127,19 @@ again:
 
 		pcp = &zone_pcp(zone, cpu)->pcp;
 		local_irq_save(flags);
-		if (!pcp->count) {
-			pcp->count = rmqueue_bulk(zone, 0,
-					pcp->batch, &pcp->list, migratetype);
-			if (unlikely(!pcp->count))
+		if (list_empty(&pcp->lists[migratetype])) {
+			pcp->count += rmqueue_bulk(zone, 0, pcp->batch,
+				&pcp->lists[migratetype], migratetype);
+			if (unlikely(list_empty(&pcp->lists[migratetype])))
 				goto failed;
 		}
 
-		/* Find a page of the appropriate migrate type */
 		if (cold) {
-			list_for_each_entry_reverse(page, &pcp->list, lru)
-				if (page_private(page) == migratetype)
-					break;
+			page = list_entry(pcp->lists[migratetype].prev,
+							struct page, lru);
 		} else {
-			list_for_each_entry(page, &pcp->list, lru)
-				if (page_private(page) == migratetype)
-					break;
-		}
-
-		/* Allocate more to the pcp list if necessary */
-		if (unlikely(&page->lru == &pcp->list)) {
-			pcp->count += rmqueue_bulk(zone, 0,
-					pcp->batch, &pcp->list, migratetype);
-			page = list_entry(pcp->list.next, struct page, lru);
+			page = list_entry(pcp->lists[migratetype].next,
+							struct page, lru);
 		}
 
 		list_del(&page->lru);
@@ -2876,6 +2892,7 @@ static int zone_batchsize(struct zone *zone)
 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
 {
 	struct per_cpu_pages *pcp;
+	int migratetype;
 
 	memset(p, 0, sizeof(*p));
 
@@ -2883,7 +2900,8 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
 	pcp->count = 0;
 	pcp->high = 6 * batch;
 	pcp->batch = max(1UL, 1 * batch);
-	INIT_LIST_HEAD(&pcp->list);
+	for (migratetype = 0; migratetype < MIGRATE_TYPES; migratetype++)
+		INIT_LIST_HEAD(&pcp->lists[migratetype]);
 }
 
 /*
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 18/20] Split per-cpu list into one-list-per-migrate-type
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

Currently the per-cpu page allocator searches the PCP list for pages of the
correct migrate-type to reduce the possibility of pages being inappropriate
placed from a fragmentation perspective. This search is potentially expensive
in a fast-path and undesirable. Splitting the per-cpu list into multiple
lists increases the size of a per-cpu structure and this was potentially
a major problem at the time the search was introduced. These problem has
been mitigated as now only the necessary number of structures is allocated
for the running system.

This patch replaces a list search in the per-cpu allocator with one list
per migrate type.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |    5 ++-
 mm/page_alloc.c        |   80 +++++++++++++++++++++++++++++------------------
 2 files changed, 53 insertions(+), 32 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6089393..2a7349a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -38,6 +38,7 @@
 #define MIGRATE_UNMOVABLE     0
 #define MIGRATE_RECLAIMABLE   1
 #define MIGRATE_MOVABLE       2
+#define MIGRATE_PCPTYPES      3 /* the number of types on the pcp lists */
 #define MIGRATE_RESERVE       3
 #define MIGRATE_ISOLATE       4 /* can't allocate from here */
 #define MIGRATE_TYPES         5
@@ -167,7 +168,9 @@ struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
-	struct list_head list;	/* the list of pages */
+
+	/* Lists of pages, one per migrate type stored on the pcp-lists */
+	struct list_head lists[MIGRATE_TYPES];
 };
 
 struct per_cpu_pageset {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 452f708..50e2fdc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -514,7 +514,7 @@ static inline int free_pages_check(struct page *page)
 }
 
 /*
- * Frees a list of pages. 
+ * Frees a number of pages from the PCP lists
  * Assumes all pages on list are in same zone, and of same order.
  * count is the number of pages to free.
  *
@@ -524,20 +524,30 @@ static inline int free_pages_check(struct page *page)
  * And clear the zone's pages_scanned counter, to hold off the "all pages are
  * pinned" detection logic.
  */
-static void free_pages_bulk(struct zone *zone, int count,
-					struct list_head *list, int order)
+static void free_pcppages_bulk(struct zone *zone, int count,
+					 struct per_cpu_pages *pcp)
 {
+	int migratetype = 0;
+
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
 	while (count--) {
 		struct page *page;
-
-		VM_BUG_ON(list_empty(list));
+		struct list_head *list;
+
+		/* Remove pages from lists in a round-robin fashion */
+		do {
+			if (migratetype == MIGRATE_PCPTYPES)
+				migratetype = 0;
+			list = &pcp->lists[migratetype];
+			migratetype++;
+		} while (list_empty(list));
+		
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
-		__free_one_page(page, zone, order, page_private(page));
+		__free_one_page(page, zone, 0, page_private(page));
 	}
 	spin_unlock(&zone->lock);
 }
@@ -922,7 +932,7 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 		to_drain = pcp->batch;
 	else
 		to_drain = pcp->count;
-	free_pages_bulk(zone, to_drain, &pcp->list, 0);
+	free_pcppages_bulk(zone, to_drain, pcp);
 	pcp->count -= to_drain;
 	local_irq_restore(flags);
 }
@@ -951,7 +961,7 @@ static void drain_pages(unsigned int cpu)
 
 		pcp = &pset->pcp;
 		local_irq_save(flags);
-		free_pages_bulk(zone, pcp->count, &pcp->list, 0);
+		free_pcppages_bulk(zone, pcp->count, pcp);
 		pcp->count = 0;
 		local_irq_restore(flags);
 	}
@@ -1017,6 +1027,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	int migratetype;
 	int clearMlocked = TestClearPageMlocked(page);
 
 	if (PageAnon(page))
@@ -1037,16 +1048,31 @@ static void free_hot_cold_page(struct page *page, int cold)
 	if (clearMlocked)
 		free_page_mlock(page);
 
+	/*
+	 * Only store unreclaimable, reclaimable and movable on pcp lists.
+	 * The one concern is that if the minimum number of free pages is not
+	 * aligned to a pageblock-boundary that allocations/frees from the
+	 * MIGRATE_RESERVE pageblocks may call free_one_page() excessively
+	 */
+	migratetype = get_pageblock_migratetype(page);
+	if (migratetype >= MIGRATE_PCPTYPES) {
+		free_one_page(zone, page, 0, migratetype);
+		goto out;
+	}
+
+	/* Record the migratetype and place on the lists */
+	set_page_private(page, migratetype);
 	if (cold)
-		list_add_tail(&page->lru, &pcp->list);
+		list_add_tail(&page->lru, &pcp->lists[migratetype]);
 	else
-		list_add(&page->lru, &pcp->list);
-	set_page_private(page, get_pageblock_migratetype(page));
+		list_add(&page->lru, &pcp->lists[migratetype]);
+
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
-		free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
+		free_pcppages_bulk(zone, pcp->batch, pcp);
 		pcp->count -= pcp->batch;
 	}
+out:
 	local_irq_restore(flags);
 	put_cpu();
 }
@@ -1101,29 +1127,19 @@ again:
 
 		pcp = &zone_pcp(zone, cpu)->pcp;
 		local_irq_save(flags);
-		if (!pcp->count) {
-			pcp->count = rmqueue_bulk(zone, 0,
-					pcp->batch, &pcp->list, migratetype);
-			if (unlikely(!pcp->count))
+		if (list_empty(&pcp->lists[migratetype])) {
+			pcp->count += rmqueue_bulk(zone, 0, pcp->batch,
+				&pcp->lists[migratetype], migratetype);
+			if (unlikely(list_empty(&pcp->lists[migratetype])))
 				goto failed;
 		}
 
-		/* Find a page of the appropriate migrate type */
 		if (cold) {
-			list_for_each_entry_reverse(page, &pcp->list, lru)
-				if (page_private(page) == migratetype)
-					break;
+			page = list_entry(pcp->lists[migratetype].prev,
+							struct page, lru);
 		} else {
-			list_for_each_entry(page, &pcp->list, lru)
-				if (page_private(page) == migratetype)
-					break;
-		}
-
-		/* Allocate more to the pcp list if necessary */
-		if (unlikely(&page->lru == &pcp->list)) {
-			pcp->count += rmqueue_bulk(zone, 0,
-					pcp->batch, &pcp->list, migratetype);
-			page = list_entry(pcp->list.next, struct page, lru);
+			page = list_entry(pcp->lists[migratetype].next,
+							struct page, lru);
 		}
 
 		list_del(&page->lru);
@@ -2876,6 +2892,7 @@ static int zone_batchsize(struct zone *zone)
 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
 {
 	struct per_cpu_pages *pcp;
+	int migratetype;
 
 	memset(p, 0, sizeof(*p));
 
@@ -2883,7 +2900,8 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
 	pcp->count = 0;
 	pcp->high = 6 * batch;
 	pcp->batch = max(1UL, 1 * batch);
-	INIT_LIST_HEAD(&pcp->list);
+	for (migratetype = 0; migratetype < MIGRATE_TYPES; migratetype++)
+		INIT_LIST_HEAD(&pcp->lists[migratetype]);
 }
 
 /*
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 19/20] Batch free pages from migratetype per-cpu lists
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

When the PCP lists are too large, a number of pages are freed in bulk.
Currently the free lists are examined in a round-robin fashion but it's
not unusual for only pages of the one type to be in the PCP lists so
quite an amount of time is spent checking empty lists. This patch still
frees pages in a round-robin fashion but multiple pages are freed for
each migratetype at a time.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   36 ++++++++++++++++++++++++------------
 1 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 50e2fdc..627837c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -532,22 +532,34 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
-	while (count--) {
+
+	/* Remove pages from lists in a semi-round-robin fashion */
+	while (count) {
 		struct page *page;
 		struct list_head *list;
+		int batch;
 
-		/* Remove pages from lists in a round-robin fashion */
-		do {
-			if (migratetype == MIGRATE_PCPTYPES)
-				migratetype = 0;
-			list = &pcp->lists[migratetype];
-			migratetype++;
-		} while (list_empty(list));
+		if (++migratetype == MIGRATE_PCPTYPES)
+			migratetype = 0;
+		list = &pcp->lists[migratetype];
 		
-		page = list_entry(list->prev, struct page, lru);
-		/* have to delete it as __free_one_page list manipulates */
-		list_del(&page->lru);
-		__free_one_page(page, zone, 0, page_private(page));
+		/*
+		 * Free from the lists in batches of 8. Batching avoids
+		 * the case where the pcp lists contain mainly pages of
+		 * one type and constantly cycling around checking empty
+		 * lists. The choice of 8 is somewhat arbitrary but based
+		 * on the expected maximum size of the PCP lists
+		 */
+		for (batch = 0; batch < 8 && count; batch++) {
+			if (list_empty(list))
+				break;
+			page = list_entry(list->prev, struct page, lru);
+
+			/* have to delete as __free_one_page list manipulates */
+			list_del(&page->lru);
+			__free_one_page(page, zone, 0, page_private(page));
+			count--;
+		}
 	}
 	spin_unlock(&zone->lock);
 }
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 19/20] Batch free pages from migratetype per-cpu lists
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

When the PCP lists are too large, a number of pages are freed in bulk.
Currently the free lists are examined in a round-robin fashion but it's
not unusual for only pages of the one type to be in the PCP lists so
quite an amount of time is spent checking empty lists. This patch still
frees pages in a round-robin fashion but multiple pages are freed for
each migratetype at a time.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   36 ++++++++++++++++++++++++------------
 1 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 50e2fdc..627837c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -532,22 +532,34 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
-	while (count--) {
+
+	/* Remove pages from lists in a semi-round-robin fashion */
+	while (count) {
 		struct page *page;
 		struct list_head *list;
+		int batch;
 
-		/* Remove pages from lists in a round-robin fashion */
-		do {
-			if (migratetype == MIGRATE_PCPTYPES)
-				migratetype = 0;
-			list = &pcp->lists[migratetype];
-			migratetype++;
-		} while (list_empty(list));
+		if (++migratetype == MIGRATE_PCPTYPES)
+			migratetype = 0;
+		list = &pcp->lists[migratetype];
 		
-		page = list_entry(list->prev, struct page, lru);
-		/* have to delete it as __free_one_page list manipulates */
-		list_del(&page->lru);
-		__free_one_page(page, zone, 0, page_private(page));
+		/*
+		 * Free from the lists in batches of 8. Batching avoids
+		 * the case where the pcp lists contain mainly pages of
+		 * one type and constantly cycling around checking empty
+		 * lists. The choice of 8 is somewhat arbitrary but based
+		 * on the expected maximum size of the PCP lists
+		 */
+		for (batch = 0; batch < 8 && count; batch++) {
+			if (list_empty(list))
+				break;
+			page = list_entry(list->prev, struct page, lru);
+
+			/* have to delete as __free_one_page list manipulates */
+			list_del(&page->lru);
+			__free_one_page(page, zone, 0, page_private(page));
+			count--;
+		}
 	}
 	spin_unlock(&zone->lock);
 }
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:17   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

Currently an effort is made to determine if a page is hot or cold when
it is being freed so that cache hot pages can be allocated to callers if
possible. However, the reasoning used whether to mark something hot or
cold is a bit spurious. A profile run of kernbench showed that "cold"
pages were never freed so it either doesn't happen generally or is so
rare, it's barely measurable.

It's dubious as to whether pages are being correctly marked hot and cold
anyway. Things like page cache and pages being truncated are are considered
"hot" but there is no guarantee that these pages have been recently used
and are cache hot. Pages being reclaimed from the LRU are considered
cold which is logical because they cannot have been referenced recently
but if the system is reclaiming pages, then we have entered allocator
slowpaths and are not going to notice any potential performance boost
because a "hot" page was freed.

This patch just deletes the concept of freeing hot or cold pages and
just frees them all as hot.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/afs/write.c              |    4 ++--
 fs/btrfs/compression.c      |    2 +-
 fs/btrfs/extent_io.c        |    4 ++--
 fs/btrfs/ordered-data.c     |    2 +-
 fs/cifs/file.c              |    4 ++--
 fs/gfs2/ops_address.c       |    2 +-
 fs/hugetlbfs/inode.c        |    2 +-
 fs/nfs/dir.c                |    2 +-
 fs/ntfs/file.c              |    2 +-
 fs/ramfs/file-nommu.c       |    2 +-
 fs/xfs/linux-2.6/xfs_aops.c |    4 ++--
 include/linux/gfp.h         |    3 +--
 include/linux/pagemap.h     |    2 +-
 include/linux/pagevec.h     |    4 +---
 include/linux/swap.h        |    2 +-
 mm/filemap.c                |    2 +-
 mm/page-writeback.c         |    2 +-
 mm/page_alloc.c             |   21 ++++++---------------
 mm/swap.c                   |   12 ++++++------
 mm/swap_state.c             |    2 +-
 mm/truncate.c               |    6 +++---
 mm/vmscan.c                 |    8 ++++----
 22 files changed, 41 insertions(+), 53 deletions(-)

diff --git a/fs/afs/write.c b/fs/afs/write.c
index 3fb36d4..172f8ae 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -285,7 +285,7 @@ static void afs_kill_pages(struct afs_vnode *vnode, bool error,
 	_enter("{%x:%u},%lx-%lx",
 	       vnode->fid.vid, vnode->fid.vnode, first, last);
 
-	pagevec_init(&pv, 0);
+	pagevec_init(&pv);
 
 	do {
 		_debug("kill %lx-%lx", first, last);
@@ -621,7 +621,7 @@ void afs_pages_written_back(struct afs_vnode *vnode, struct afs_call *call)
 
 	ASSERT(wb != NULL);
 
-	pagevec_init(&pv, 0);
+	pagevec_init(&pv);
 
 	do {
 		_debug("done %lx-%lx", first, last);
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index ab07627..e141e59 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -462,7 +462,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 
 	end_index = (i_size_read(inode) - 1) >> PAGE_CACHE_SHIFT;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	while (last_offset < compressed_end) {
 		page_index = last_offset >> PAGE_CACHE_SHIFT;
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ebe6b29..f3cad4b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2375,7 +2375,7 @@ static int extent_write_cache_pages(struct extent_io_tree *tree,
 	int scanned = 0;
 	int range_whole = 0;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	if (wbc->range_cyclic) {
 		index = mapping->writeback_index; /* Start from prev offset */
 		end = -1;
@@ -2576,7 +2576,7 @@ int extent_readpages(struct extent_io_tree *tree,
 	struct pagevec pvec;
 	unsigned long bio_flags = 0;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = list_entry(pages->prev, struct page, lru);
 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 77c2411..5d8bed2 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -695,7 +695,7 @@ int btrfs_wait_on_page_writeback_range(struct address_space *mapping,
 	if (end < start)
 		return 0;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	index = start;
 	while ((index <= end) &&
 			(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 12bb656..7552ae8 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1265,7 +1265,7 @@ static int cifs_writepages(struct address_space *mapping,
 
 	xid = GetXid();
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	if (wbc->range_cyclic) {
 		index = mapping->writeback_index; /* Start from prev offset */
 		end = -1;
@@ -1838,7 +1838,7 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 	cifs_sb = CIFS_SB(file->f_path.dentry->d_sb);
 	pTcon = cifs_sb->tcon;
 
-	pagevec_init(&lru_pvec, 0);
+	pagevec_init(&lru_pvec);
 	cFYI(DBG2, ("rpages: num pages %d", num_pages));
 	for (i = 0; i < num_pages; ) {
 		unsigned contig_pages;
diff --git a/fs/gfs2/ops_address.c b/fs/gfs2/ops_address.c
index 4ddab67..0821b4b 100644
--- a/fs/gfs2/ops_address.c
+++ b/fs/gfs2/ops_address.c
@@ -355,7 +355,7 @@ static int gfs2_write_cache_jdata(struct address_space *mapping,
 		return 0;
 	}
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	if (wbc->range_cyclic) {
 		index = mapping->writeback_index; /* Start from prev offset */
 		end = -1;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 9b800d9..ee68edc 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -356,7 +356,7 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
 	pgoff_t next;
 	int i, freed = 0;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	next = start;
 	while (1) {
 		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index e35c819..d4de5ac 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1514,7 +1514,7 @@ static int nfs_symlink(struct inode *dir, struct dentry *dentry, const char *sym
 	 * No big deal if we can't add this page to the page cache here.
 	 * READLINK will get the missing page from the server if needed.
 	 */
-	pagevec_init(&lru_pvec, 0);
+	pagevec_init(&lru_pvec);
 	if (!add_to_page_cache(page, dentry->d_inode->i_mapping, 0,
 							GFP_KERNEL)) {
 		pagevec_add(&lru_pvec, page);
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 3140a44..355c821 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -1911,7 +1911,7 @@ static ssize_t ntfs_file_buffered_write(struct kiocb *iocb,
 			}
 		}
 	}
-	pagevec_init(&lru_pvec, 0);
+	pagevec_init(&lru_pvec);
 	written = 0;
 	/*
 	 * If the write starts beyond the initialized size, extend it up to the
diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c
index b9b567a..294feb0 100644
--- a/fs/ramfs/file-nommu.c
+++ b/fs/ramfs/file-nommu.c
@@ -103,7 +103,7 @@ int ramfs_nommu_expand_for_mapping(struct inode *inode, size_t newsize)
 	memset(data, 0, newsize);
 
 	/* attach all the pages to the inode's address space */
-	pagevec_init(&lru_pvec, 0);
+	pagevec_init(&lru_pvec);
 	for (loop = 0; loop < npages; loop++) {
 		struct page *page = pages + loop;
 
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index de3a198..bc8ee83 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -691,7 +691,7 @@ xfs_probe_cluster(
 	/* Prune this back to avoid pathological behavior */
 	tloff = min(tlast, startpage->index + 64);
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	while (!done && tindex <= tloff) {
 		unsigned len = min_t(pgoff_t, PAGEVEC_SIZE, tlast - tindex + 1);
 
@@ -922,7 +922,7 @@ xfs_cluster_write(
 	struct pagevec		pvec;
 	int			done = 0, i;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	while (!done && tindex <= tlast) {
 		unsigned len = min_t(pgoff_t, PAGEVEC_SIZE, tlast - tindex + 1);
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 581f8a9..c6d70f3 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -222,8 +222,7 @@ void free_pages_exact(void *virt, size_t size);
 
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
-extern void free_hot_page(struct page *page);
-extern void free_cold_page(struct page *page);
+extern void free_zerocount_page(struct page *page);
 
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr),0)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 01ca085..6782dc9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -90,7 +90,7 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 
 #define page_cache_get(page)		get_page(page)
 #define page_cache_release(page)	put_page(page)
-void release_pages(struct page **pages, int nr, int cold);
+void release_pages(struct page **pages, int nr);
 
 /*
  * speculatively take a reference to a page.
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 7b2886f..001913a 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -16,7 +16,6 @@ struct address_space;
 
 struct pagevec {
 	unsigned long nr;
-	unsigned long cold;
 	struct page *pages[PAGEVEC_SIZE];
 };
 
@@ -31,10 +30,9 @@ unsigned pagevec_lookup_tag(struct pagevec *pvec,
 		struct address_space *mapping, pgoff_t *index, int tag,
 		unsigned nr_pages);
 
-static inline void pagevec_init(struct pagevec *pvec, int cold)
+static inline void pagevec_init(struct pagevec *pvec)
 {
 	pvec->nr = 0;
-	pvec->cold = cold;
 }
 
 static inline void pagevec_reinit(struct pagevec *pvec)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d302155..762fe08 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -363,7 +363,7 @@ static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
 #define free_page_and_swap_cache(page) \
 	page_cache_release(page)
 #define free_pages_and_swap_cache(pages, nr) \
-	release_pages((pages), (nr), 0);
+	release_pages((pages), (nr));
 
 static inline void show_swap_cache_info(void)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 2523d95..7c0f78c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -274,7 +274,7 @@ int wait_on_page_writeback_range(struct address_space *mapping,
 	if (end < start)
 		return 0;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	index = start;
 	while ((index <= end) &&
 			(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3c84128..fa7c000 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -951,7 +951,7 @@ int write_cache_pages(struct address_space *mapping,
 		return 0;
 	}
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	if (wbc->range_cyclic) {
 		writeback_index = mapping->writeback_index; /* prev offset */
 		index = writeback_index;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 627837c..b3906db 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1034,7 +1034,7 @@ void mark_free_pages(struct zone *zone)
 /*
  * Free a 0-order page
  */
-static void free_hot_cold_page(struct page *page, int cold)
+static void free_pcp_page(struct page *page)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
@@ -1074,11 +1074,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 
 	/* Record the migratetype and place on the lists */
 	set_page_private(page, migratetype);
-	if (cold)
-		list_add_tail(&page->lru, &pcp->lists[migratetype]);
-	else
-		list_add(&page->lru, &pcp->lists[migratetype]);
-
+	list_add(&page->lru, &pcp->lists[migratetype]);
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
 		free_pcppages_bulk(zone, pcp->batch, pcp);
@@ -1089,14 +1085,9 @@ out:
 	put_cpu();
 }
 
-void free_hot_page(struct page *page)
-{
-	free_hot_cold_page(page, 0);
-}
-	
-void free_cold_page(struct page *page)
+void free_zerocount_page(struct page *page)
 {
-	free_hot_cold_page(page, 1);
+	free_pcp_page(page);
 }
 
 /*
@@ -1893,14 +1884,14 @@ void __pagevec_free(struct pagevec *pvec)
 	int i = pagevec_count(pvec);
 
 	while (--i >= 0)
-		free_hot_cold_page(pvec->pages[i], pvec->cold);
+		free_pcp_page(pvec->pages[i]);
 }
 
 void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
 		if (order == 0)
-			free_hot_page(page);
+			free_pcp_page(page);
 		else
 			__free_pages_ok(page, order, -1);
 	}
diff --git a/mm/swap.c b/mm/swap.c
index 8adb9fe..0fa30bc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -55,7 +55,7 @@ static void __page_cache_release(struct page *page)
 		del_page_from_lru(zone, page);
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
-	free_hot_page(page);
+	free_zerocount_page(page);
 }
 
 static void put_compound_page(struct page *page)
@@ -126,7 +126,7 @@ static void pagevec_move_tail(struct pagevec *pvec)
 	if (zone)
 		spin_unlock(&zone->lru_lock);
 	__count_vm_events(PGROTATED, pgmoved);
-	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
 
@@ -324,14 +324,14 @@ int lru_add_drain_all(void)
  * grabbed the page via the LRU.  If it did, give up: shrink_inactive_list()
  * will free it.
  */
-void release_pages(struct page **pages, int nr, int cold)
+void release_pages(struct page **pages, int nr)
 {
 	int i;
 	struct pagevec pages_to_free;
 	struct zone *zone = NULL;
 	unsigned long uninitialized_var(flags);
 
-	pagevec_init(&pages_to_free, cold);
+	pagevec_init(&pages_to_free);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
@@ -390,7 +390,7 @@ void release_pages(struct page **pages, int nr, int cold)
 void __pagevec_release(struct pagevec *pvec)
 {
 	lru_add_drain();
-	release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
+	release_pages(pvec->pages, pagevec_count(pvec));
 	pagevec_reinit(pvec);
 }
 
@@ -432,7 +432,7 @@ void ____pagevec_lru_add(struct pagevec *pvec, enum lru_list lru)
 	}
 	if (zone)
 		spin_unlock_irq(&zone->lru_lock);
-	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3ecea98..a0ad9ec 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -236,7 +236,7 @@ void free_pages_and_swap_cache(struct page **pages, int nr)
 
 		for (i = 0; i < todo; i++)
 			free_swap_cache(pagep[i]);
-		release_pages(pagep, todo, 0);
+		release_pages(pagep, todo);
 		pagep += todo;
 		nr -= todo;
 	}
diff --git a/mm/truncate.c b/mm/truncate.c
index 1229211..4d05520 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -174,7 +174,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
 	end = (lend >> PAGE_CACHE_SHIFT);
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	next = start;
 	while (next <= end &&
 	       pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
@@ -275,7 +275,7 @@ unsigned long __invalidate_mapping_pages(struct address_space *mapping,
 	unsigned long ret = 0;
 	int i;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	while (next <= end &&
 			pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
 		for (i = 0; i < pagevec_count(&pvec); i++) {
@@ -397,7 +397,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	int did_range_unmap = 0;
 	int wrapped = 0;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	next = start;
 	while (next <= end && !wrapped &&
 		pagevec_lookup(&pvec, mapping, next,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9a27c44..9cadc27 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -584,7 +584,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 	cond_resched();
 
-	pagevec_init(&freed_pvec, 1);
+	pagevec_init(&freed_pvec);
 	while (!list_empty(page_list)) {
 		struct address_space *mapping;
 		struct page *page;
@@ -1050,7 +1050,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
-	pagevec_init(&pvec, 1);
+	pagevec_init(&pvec);
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
@@ -1261,7 +1261,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	/*
 	 * Move the pages to the [file or anon] inactive list.
 	 */
-	pagevec_init(&pvec, 1);
+	pagevec_init(&pvec);
 	pgmoved = 0;
 	lru = LRU_BASE + file * LRU_FILE;
 
@@ -2488,7 +2488,7 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
 	if (mapping->nrpages == 0)
 		return;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	while (next < end &&
 		pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
 		int i;
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-02-22 23:17   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-22 23:17 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

Currently an effort is made to determine if a page is hot or cold when
it is being freed so that cache hot pages can be allocated to callers if
possible. However, the reasoning used whether to mark something hot or
cold is a bit spurious. A profile run of kernbench showed that "cold"
pages were never freed so it either doesn't happen generally or is so
rare, it's barely measurable.

It's dubious as to whether pages are being correctly marked hot and cold
anyway. Things like page cache and pages being truncated are are considered
"hot" but there is no guarantee that these pages have been recently used
and are cache hot. Pages being reclaimed from the LRU are considered
cold which is logical because they cannot have been referenced recently
but if the system is reclaiming pages, then we have entered allocator
slowpaths and are not going to notice any potential performance boost
because a "hot" page was freed.

This patch just deletes the concept of freeing hot or cold pages and
just frees them all as hot.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/afs/write.c              |    4 ++--
 fs/btrfs/compression.c      |    2 +-
 fs/btrfs/extent_io.c        |    4 ++--
 fs/btrfs/ordered-data.c     |    2 +-
 fs/cifs/file.c              |    4 ++--
 fs/gfs2/ops_address.c       |    2 +-
 fs/hugetlbfs/inode.c        |    2 +-
 fs/nfs/dir.c                |    2 +-
 fs/ntfs/file.c              |    2 +-
 fs/ramfs/file-nommu.c       |    2 +-
 fs/xfs/linux-2.6/xfs_aops.c |    4 ++--
 include/linux/gfp.h         |    3 +--
 include/linux/pagemap.h     |    2 +-
 include/linux/pagevec.h     |    4 +---
 include/linux/swap.h        |    2 +-
 mm/filemap.c                |    2 +-
 mm/page-writeback.c         |    2 +-
 mm/page_alloc.c             |   21 ++++++---------------
 mm/swap.c                   |   12 ++++++------
 mm/swap_state.c             |    2 +-
 mm/truncate.c               |    6 +++---
 mm/vmscan.c                 |    8 ++++----
 22 files changed, 41 insertions(+), 53 deletions(-)

diff --git a/fs/afs/write.c b/fs/afs/write.c
index 3fb36d4..172f8ae 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -285,7 +285,7 @@ static void afs_kill_pages(struct afs_vnode *vnode, bool error,
 	_enter("{%x:%u},%lx-%lx",
 	       vnode->fid.vid, vnode->fid.vnode, first, last);
 
-	pagevec_init(&pv, 0);
+	pagevec_init(&pv);
 
 	do {
 		_debug("kill %lx-%lx", first, last);
@@ -621,7 +621,7 @@ void afs_pages_written_back(struct afs_vnode *vnode, struct afs_call *call)
 
 	ASSERT(wb != NULL);
 
-	pagevec_init(&pv, 0);
+	pagevec_init(&pv);
 
 	do {
 		_debug("done %lx-%lx", first, last);
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index ab07627..e141e59 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -462,7 +462,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 
 	end_index = (i_size_read(inode) - 1) >> PAGE_CACHE_SHIFT;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	while (last_offset < compressed_end) {
 		page_index = last_offset >> PAGE_CACHE_SHIFT;
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ebe6b29..f3cad4b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2375,7 +2375,7 @@ static int extent_write_cache_pages(struct extent_io_tree *tree,
 	int scanned = 0;
 	int range_whole = 0;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	if (wbc->range_cyclic) {
 		index = mapping->writeback_index; /* Start from prev offset */
 		end = -1;
@@ -2576,7 +2576,7 @@ int extent_readpages(struct extent_io_tree *tree,
 	struct pagevec pvec;
 	unsigned long bio_flags = 0;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = list_entry(pages->prev, struct page, lru);
 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 77c2411..5d8bed2 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -695,7 +695,7 @@ int btrfs_wait_on_page_writeback_range(struct address_space *mapping,
 	if (end < start)
 		return 0;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	index = start;
 	while ((index <= end) &&
 			(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 12bb656..7552ae8 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1265,7 +1265,7 @@ static int cifs_writepages(struct address_space *mapping,
 
 	xid = GetXid();
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	if (wbc->range_cyclic) {
 		index = mapping->writeback_index; /* Start from prev offset */
 		end = -1;
@@ -1838,7 +1838,7 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 	cifs_sb = CIFS_SB(file->f_path.dentry->d_sb);
 	pTcon = cifs_sb->tcon;
 
-	pagevec_init(&lru_pvec, 0);
+	pagevec_init(&lru_pvec);
 	cFYI(DBG2, ("rpages: num pages %d", num_pages));
 	for (i = 0; i < num_pages; ) {
 		unsigned contig_pages;
diff --git a/fs/gfs2/ops_address.c b/fs/gfs2/ops_address.c
index 4ddab67..0821b4b 100644
--- a/fs/gfs2/ops_address.c
+++ b/fs/gfs2/ops_address.c
@@ -355,7 +355,7 @@ static int gfs2_write_cache_jdata(struct address_space *mapping,
 		return 0;
 	}
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	if (wbc->range_cyclic) {
 		index = mapping->writeback_index; /* Start from prev offset */
 		end = -1;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 9b800d9..ee68edc 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -356,7 +356,7 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
 	pgoff_t next;
 	int i, freed = 0;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	next = start;
 	while (1) {
 		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index e35c819..d4de5ac 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1514,7 +1514,7 @@ static int nfs_symlink(struct inode *dir, struct dentry *dentry, const char *sym
 	 * No big deal if we can't add this page to the page cache here.
 	 * READLINK will get the missing page from the server if needed.
 	 */
-	pagevec_init(&lru_pvec, 0);
+	pagevec_init(&lru_pvec);
 	if (!add_to_page_cache(page, dentry->d_inode->i_mapping, 0,
 							GFP_KERNEL)) {
 		pagevec_add(&lru_pvec, page);
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 3140a44..355c821 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -1911,7 +1911,7 @@ static ssize_t ntfs_file_buffered_write(struct kiocb *iocb,
 			}
 		}
 	}
-	pagevec_init(&lru_pvec, 0);
+	pagevec_init(&lru_pvec);
 	written = 0;
 	/*
 	 * If the write starts beyond the initialized size, extend it up to the
diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c
index b9b567a..294feb0 100644
--- a/fs/ramfs/file-nommu.c
+++ b/fs/ramfs/file-nommu.c
@@ -103,7 +103,7 @@ int ramfs_nommu_expand_for_mapping(struct inode *inode, size_t newsize)
 	memset(data, 0, newsize);
 
 	/* attach all the pages to the inode's address space */
-	pagevec_init(&lru_pvec, 0);
+	pagevec_init(&lru_pvec);
 	for (loop = 0; loop < npages; loop++) {
 		struct page *page = pages + loop;
 
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index de3a198..bc8ee83 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -691,7 +691,7 @@ xfs_probe_cluster(
 	/* Prune this back to avoid pathological behavior */
 	tloff = min(tlast, startpage->index + 64);
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	while (!done && tindex <= tloff) {
 		unsigned len = min_t(pgoff_t, PAGEVEC_SIZE, tlast - tindex + 1);
 
@@ -922,7 +922,7 @@ xfs_cluster_write(
 	struct pagevec		pvec;
 	int			done = 0, i;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	while (!done && tindex <= tlast) {
 		unsigned len = min_t(pgoff_t, PAGEVEC_SIZE, tlast - tindex + 1);
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 581f8a9..c6d70f3 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -222,8 +222,7 @@ void free_pages_exact(void *virt, size_t size);
 
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
-extern void free_hot_page(struct page *page);
-extern void free_cold_page(struct page *page);
+extern void free_zerocount_page(struct page *page);
 
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr),0)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 01ca085..6782dc9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -90,7 +90,7 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 
 #define page_cache_get(page)		get_page(page)
 #define page_cache_release(page)	put_page(page)
-void release_pages(struct page **pages, int nr, int cold);
+void release_pages(struct page **pages, int nr);
 
 /*
  * speculatively take a reference to a page.
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 7b2886f..001913a 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -16,7 +16,6 @@ struct address_space;
 
 struct pagevec {
 	unsigned long nr;
-	unsigned long cold;
 	struct page *pages[PAGEVEC_SIZE];
 };
 
@@ -31,10 +30,9 @@ unsigned pagevec_lookup_tag(struct pagevec *pvec,
 		struct address_space *mapping, pgoff_t *index, int tag,
 		unsigned nr_pages);
 
-static inline void pagevec_init(struct pagevec *pvec, int cold)
+static inline void pagevec_init(struct pagevec *pvec)
 {
 	pvec->nr = 0;
-	pvec->cold = cold;
 }
 
 static inline void pagevec_reinit(struct pagevec *pvec)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d302155..762fe08 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -363,7 +363,7 @@ static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
 #define free_page_and_swap_cache(page) \
 	page_cache_release(page)
 #define free_pages_and_swap_cache(pages, nr) \
-	release_pages((pages), (nr), 0);
+	release_pages((pages), (nr));
 
 static inline void show_swap_cache_info(void)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 2523d95..7c0f78c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -274,7 +274,7 @@ int wait_on_page_writeback_range(struct address_space *mapping,
 	if (end < start)
 		return 0;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	index = start;
 	while ((index <= end) &&
 			(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3c84128..fa7c000 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -951,7 +951,7 @@ int write_cache_pages(struct address_space *mapping,
 		return 0;
 	}
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	if (wbc->range_cyclic) {
 		writeback_index = mapping->writeback_index; /* prev offset */
 		index = writeback_index;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 627837c..b3906db 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1034,7 +1034,7 @@ void mark_free_pages(struct zone *zone)
 /*
  * Free a 0-order page
  */
-static void free_hot_cold_page(struct page *page, int cold)
+static void free_pcp_page(struct page *page)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
@@ -1074,11 +1074,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 
 	/* Record the migratetype and place on the lists */
 	set_page_private(page, migratetype);
-	if (cold)
-		list_add_tail(&page->lru, &pcp->lists[migratetype]);
-	else
-		list_add(&page->lru, &pcp->lists[migratetype]);
-
+	list_add(&page->lru, &pcp->lists[migratetype]);
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
 		free_pcppages_bulk(zone, pcp->batch, pcp);
@@ -1089,14 +1085,9 @@ out:
 	put_cpu();
 }
 
-void free_hot_page(struct page *page)
-{
-	free_hot_cold_page(page, 0);
-}
-	
-void free_cold_page(struct page *page)
+void free_zerocount_page(struct page *page)
 {
-	free_hot_cold_page(page, 1);
+	free_pcp_page(page);
 }
 
 /*
@@ -1893,14 +1884,14 @@ void __pagevec_free(struct pagevec *pvec)
 	int i = pagevec_count(pvec);
 
 	while (--i >= 0)
-		free_hot_cold_page(pvec->pages[i], pvec->cold);
+		free_pcp_page(pvec->pages[i]);
 }
 
 void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
 		if (order == 0)
-			free_hot_page(page);
+			free_pcp_page(page);
 		else
 			__free_pages_ok(page, order, -1);
 	}
diff --git a/mm/swap.c b/mm/swap.c
index 8adb9fe..0fa30bc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -55,7 +55,7 @@ static void __page_cache_release(struct page *page)
 		del_page_from_lru(zone, page);
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
-	free_hot_page(page);
+	free_zerocount_page(page);
 }
 
 static void put_compound_page(struct page *page)
@@ -126,7 +126,7 @@ static void pagevec_move_tail(struct pagevec *pvec)
 	if (zone)
 		spin_unlock(&zone->lru_lock);
 	__count_vm_events(PGROTATED, pgmoved);
-	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
 
@@ -324,14 +324,14 @@ int lru_add_drain_all(void)
  * grabbed the page via the LRU.  If it did, give up: shrink_inactive_list()
  * will free it.
  */
-void release_pages(struct page **pages, int nr, int cold)
+void release_pages(struct page **pages, int nr)
 {
 	int i;
 	struct pagevec pages_to_free;
 	struct zone *zone = NULL;
 	unsigned long uninitialized_var(flags);
 
-	pagevec_init(&pages_to_free, cold);
+	pagevec_init(&pages_to_free);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
@@ -390,7 +390,7 @@ void release_pages(struct page **pages, int nr, int cold)
 void __pagevec_release(struct pagevec *pvec)
 {
 	lru_add_drain();
-	release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
+	release_pages(pvec->pages, pagevec_count(pvec));
 	pagevec_reinit(pvec);
 }
 
@@ -432,7 +432,7 @@ void ____pagevec_lru_add(struct pagevec *pvec, enum lru_list lru)
 	}
 	if (zone)
 		spin_unlock_irq(&zone->lru_lock);
-	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3ecea98..a0ad9ec 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -236,7 +236,7 @@ void free_pages_and_swap_cache(struct page **pages, int nr)
 
 		for (i = 0; i < todo; i++)
 			free_swap_cache(pagep[i]);
-		release_pages(pagep, todo, 0);
+		release_pages(pagep, todo);
 		pagep += todo;
 		nr -= todo;
 	}
diff --git a/mm/truncate.c b/mm/truncate.c
index 1229211..4d05520 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -174,7 +174,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
 	end = (lend >> PAGE_CACHE_SHIFT);
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	next = start;
 	while (next <= end &&
 	       pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
@@ -275,7 +275,7 @@ unsigned long __invalidate_mapping_pages(struct address_space *mapping,
 	unsigned long ret = 0;
 	int i;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	while (next <= end &&
 			pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
 		for (i = 0; i < pagevec_count(&pvec); i++) {
@@ -397,7 +397,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	int did_range_unmap = 0;
 	int wrapped = 0;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	next = start;
 	while (next <= end && !wrapped &&
 		pagevec_lookup(&pvec, mapping, next,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9a27c44..9cadc27 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -584,7 +584,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 	cond_resched();
 
-	pagevec_init(&freed_pvec, 1);
+	pagevec_init(&freed_pvec);
 	while (!list_empty(page_list)) {
 		struct address_space *mapping;
 		struct page *page;
@@ -1050,7 +1050,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
-	pagevec_init(&pvec, 1);
+	pagevec_init(&pvec);
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
@@ -1261,7 +1261,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	/*
 	 * Move the pages to the [file or anon] inactive list.
 	 */
-	pagevec_init(&pvec, 1);
+	pagevec_init(&pvec);
 	pgmoved = 0;
 	lru = LRU_BASE + file * LRU_FILE;
 
@@ -2488,7 +2488,7 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
 	if (mapping->nrpages == 0)
 		return;
 
-	pagevec_init(&pvec, 0);
+	pagevec_init(&pvec);
 	while (next < end &&
 		pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
 		int i;
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-22 23:57   ` Andi Kleen
  -1 siblings, 0 replies; 190+ messages in thread
From: Andi Kleen @ 2009-02-22 23:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

Mel Gorman <mel@csn.ul.ie> writes:

> The complexity of the page allocator has been increasing for some time
> and it has now reached the point where the SLUB allocator is doing strange
> tricks to avoid the page allocator. This is obviously bad as it may encourage
> other subsystems to try avoiding the page allocator as well.

Congratulations! That was long overdue. Haven't read the patches yet though.

> Patch 15 reduces the number of times interrupts are disabled by reworking
> what free_page_mlock() does. However, I notice that the cost of calling
> TestClearPageMlocked() is still quite high and I'm guessing it's because
> it's a locked bit operation. It's be nice if it could be established if
> it's safe to use an unlocked version here. Rik, can you comment?

What machine was that again?

> Patch 16 avoids using the zonelist cache on non-NUMA machines

My suspicion is that it can be even dropped on most small (all?) NUMA systems.

> Patch 20 gets rid of hot/cold freeing of pages because it incurs cost for
> what I believe to be very dubious gain. I'm not sure we currently gain
> anything by it but it's further discussed in the patch itself.

Yes the hot/cold thing was always quite dubious.

> Counters are surprising expensive, we spent a good chuck of our time in
> functions like __dec_zone_page_state and __dec_zone_state. In a profiled
> run of kernbench, the time spent in __dec_zone_state was roughly equal to
> the combined cost of the rest of the page free path. A quick check showed
> that almost half of the time in that function is spent on line 233 alone
> which for me is;
>
> 	(*p)--;
>
> That's worth a separate investigation but it might be a case that
> manipulating int8_t on the machine I was using for profiling is unusually
> expensive. 

What machine was that?

In general I wouldn't expect even on a system with slow char
operations to be that expensive. It sounds more like a cache miss or a
cache line bounce. You could possibly confirm by using appropiate
performance counters.

> Converting this to an int might be faster but the increased
> memory consumption and cache footprint might be a problem. Opinions?

One possibility would be to move the zone statistics to allocated
per cpu data. Or perhaps just stop counting per zone at all and
only count per cpu.

> The downside is that the patches do increase text size because of the
> splitting of the fast path into one inlined blob and the slow path into a
> number of other functions. On my test machine, text increased by 1.2K so
> I might revisit that again and see how much of a difference it really made.
>
> That all said, I'm seeing good results on actual benchmarks with these
> patches.
>
> o On many machines, I'm seeing a 0-2% improvement on kernbench. The dominant

Neat.

> So, by and large it's an improvement of some sort.

That seems like an understatement.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
@ 2009-02-22 23:57   ` Andi Kleen
  0 siblings, 0 replies; 190+ messages in thread
From: Andi Kleen @ 2009-02-22 23:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

Mel Gorman <mel@csn.ul.ie> writes:

> The complexity of the page allocator has been increasing for some time
> and it has now reached the point where the SLUB allocator is doing strange
> tricks to avoid the page allocator. This is obviously bad as it may encourage
> other subsystems to try avoiding the page allocator as well.

Congratulations! That was long overdue. Haven't read the patches yet though.

> Patch 15 reduces the number of times interrupts are disabled by reworking
> what free_page_mlock() does. However, I notice that the cost of calling
> TestClearPageMlocked() is still quite high and I'm guessing it's because
> it's a locked bit operation. It's be nice if it could be established if
> it's safe to use an unlocked version here. Rik, can you comment?

What machine was that again?

> Patch 16 avoids using the zonelist cache on non-NUMA machines

My suspicion is that it can be even dropped on most small (all?) NUMA systems.

> Patch 20 gets rid of hot/cold freeing of pages because it incurs cost for
> what I believe to be very dubious gain. I'm not sure we currently gain
> anything by it but it's further discussed in the patch itself.

Yes the hot/cold thing was always quite dubious.

> Counters are surprising expensive, we spent a good chuck of our time in
> functions like __dec_zone_page_state and __dec_zone_state. In a profiled
> run of kernbench, the time spent in __dec_zone_state was roughly equal to
> the combined cost of the rest of the page free path. A quick check showed
> that almost half of the time in that function is spent on line 233 alone
> which for me is;
>
> 	(*p)--;
>
> That's worth a separate investigation but it might be a case that
> manipulating int8_t on the machine I was using for profiling is unusually
> expensive. 

What machine was that?

In general I wouldn't expect even on a system with slow char
operations to be that expensive. It sounds more like a cache miss or a
cache line bounce. You could possibly confirm by using appropiate
performance counters.

> Converting this to an int might be faster but the increased
> memory consumption and cache footprint might be a problem. Opinions?

One possibility would be to move the zone statistics to allocated
per cpu data. Or perhaps just stop counting per zone at all and
only count per cpu.

> The downside is that the patches do increase text size because of the
> splitting of the fast path into one inlined blob and the slow path into a
> number of other functions. On my test machine, text increased by 1.2K so
> I might revisit that again and see how much of a difference it really made.
>
> That all said, I'm seeing good results on actual benchmarks with these
> patches.
>
> o On many machines, I'm seeing a 0-2% improvement on kernbench. The dominant

Neat.

> So, by and large it's an improvement of some sort.

That seems like an understatement.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-23  0:02   ` Andi Kleen
  -1 siblings, 0 replies; 190+ messages in thread
From: Andi Kleen @ 2009-02-23  0:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

Mel Gorman <mel@csn.ul.ie> writes:


BTW one additional tuning opportunity would be to change cpusets to
always precompute zonelists out of line and then avoid doing
all these checks in the fast path.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
@ 2009-02-23  0:02   ` Andi Kleen
  0 siblings, 0 replies; 190+ messages in thread
From: Andi Kleen @ 2009-02-23  0:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

Mel Gorman <mel@csn.ul.ie> writes:


BTW one additional tuning opportunity would be to change cpusets to
always precompute zonelists out of line and then avoid doing
all these checks in the fast path.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 07/20] Simplify the check on whether cpusets are a factor or not
  2009-02-22 23:17   ` Mel Gorman
@ 2009-02-23  7:14     ` Pekka J Enberg
  -1 siblings, 0 replies; 190+ messages in thread
From: Pekka J Enberg @ 2009-02-23  7:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Sun, 22 Feb 2009, Mel Gorman wrote:
> The check whether cpuset contraints need to be checked or not is complex
> and often repeated.  This patch makes the check in advance to the comparison
> is simplier to compute.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

You can do that in a cleaner way by defining ALLOC_CPUSET to be zero when 
CONFIG_CPUSETS is disabled. Something like following untested patch:

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5675b30..18b687d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1135,7 +1135,12 @@ failed:
 #define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
+
+#ifdef CONFIG_CPUSETS
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+#else
+#define ALLOC_CPUSET		0x00
+#endif
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH 07/20] Simplify the check on whether cpusets are a factor or not
@ 2009-02-23  7:14     ` Pekka J Enberg
  0 siblings, 0 replies; 190+ messages in thread
From: Pekka J Enberg @ 2009-02-23  7:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Sun, 22 Feb 2009, Mel Gorman wrote:
> The check whether cpuset contraints need to be checked or not is complex
> and often repeated.  This patch makes the check in advance to the comparison
> is simplier to compute.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

You can do that in a cleaner way by defining ALLOC_CPUSET to be zero when 
CONFIG_CPUSETS is disabled. Something like following untested patch:

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5675b30..18b687d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1135,7 +1135,12 @@ failed:
 #define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
+
+#ifdef CONFIG_CPUSETS
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+#else
+#define ALLOC_CPUSET		0x00
+#endif
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH 11/20] Inline get_page_from_freelist() in the fast-path
  2009-02-22 23:17   ` Mel Gorman
@ 2009-02-23  7:21     ` Pekka Enberg
  -1 siblings, 0 replies; 190+ messages in thread
From: Pekka Enberg @ 2009-02-23  7:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 1:17 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> In the best-case scenario, use an inlined version of
> get_page_from_freelist(). This increases the size of the text but avoids
> time spent pushing arguments onto the stack.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

It's not obvious to me why this would be a huge win so I suppose this
patch description could use numbers. Note: we used to do tricks like
these in slab.c but got rid of most of them to reduce kernel text size
which is probably why the patch seems bit backwards to me.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 11/20] Inline get_page_from_freelist() in the fast-path
@ 2009-02-23  7:21     ` Pekka Enberg
  0 siblings, 0 replies; 190+ messages in thread
From: Pekka Enberg @ 2009-02-23  7:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 1:17 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> In the best-case scenario, use an inlined version of
> get_page_from_freelist(). This increases the size of the text but avoids
> time spent pushing arguments onto the stack.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

It's not obvious to me why this would be a huge win so I suppose this
patch description could use numbers. Note: we used to do tricks like
these in slab.c but got rid of most of them to reduce kernel text size
which is probably why the patch seems bit backwards to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 13/20] Inline buffered_rmqueue()
  2009-02-22 23:17   ` Mel Gorman
@ 2009-02-23  7:24     ` Pekka Enberg
  -1 siblings, 0 replies; 190+ messages in thread
From: Pekka Enberg @ 2009-02-23  7:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 1:17 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> buffered_rmqueue() is in the fast path so inline it. This incurs text
> bloat as there is now a copy in the fast and slow paths but the cost of
> the function call was noticeable in profiles of the fast path.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/page_alloc.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d8a6828..2383147 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1080,7 +1080,8 @@ void split_page(struct page *page, unsigned int order)
>  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
>  * or two.
>  */
> -static struct page *buffered_rmqueue(struct zone *preferred_zone,
> +static inline
> +struct page *buffered_rmqueue(struct zone *preferred_zone,
>                        struct zone *zone, int order, gfp_t gfp_flags,
>                        int migratetype)
>  {

I'm not sure if this is changed now but at least in the past, you had
to use __always_inline to force GCC to do the inlining for all
configurations.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 13/20] Inline buffered_rmqueue()
@ 2009-02-23  7:24     ` Pekka Enberg
  0 siblings, 0 replies; 190+ messages in thread
From: Pekka Enberg @ 2009-02-23  7:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 1:17 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> buffered_rmqueue() is in the fast path so inline it. This incurs text
> bloat as there is now a copy in the fast and slow paths but the cost of
> the function call was noticeable in profiles of the fast path.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/page_alloc.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d8a6828..2383147 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1080,7 +1080,8 @@ void split_page(struct page *page, unsigned int order)
>  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
>  * or two.
>  */
> -static struct page *buffered_rmqueue(struct zone *preferred_zone,
> +static inline
> +struct page *buffered_rmqueue(struct zone *preferred_zone,
>                        struct zone *zone, int order, gfp_t gfp_flags,
>                        int migratetype)
>  {

I'm not sure if this is changed now but at least in the past, you had
to use __always_inline to force GCC to do the inlining for all
configurations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-23  7:29   ` Pekka Enberg
  -1 siblings, 0 replies; 190+ messages in thread
From: Pekka Enberg @ 2009-02-23  7:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 1:17 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> The complexity of the page allocator has been increasing for some time
> and it has now reached the point where the SLUB allocator is doing strange
> tricks to avoid the page allocator. This is obviously bad as it may encourage
> other subsystems to try avoiding the page allocator as well.

I'm not an expert on the page allocator but the series looks sane to me.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>

Yanmin, it would be interesting to know if we still need 8K kmalloc
caches with these patches applied. :-)

                               Pekka

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
@ 2009-02-23  7:29   ` Pekka Enberg
  0 siblings, 0 replies; 190+ messages in thread
From: Pekka Enberg @ 2009-02-23  7:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 1:17 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> The complexity of the page allocator has been increasing for some time
> and it has now reached the point where the SLUB allocator is doing strange
> tricks to avoid the page allocator. This is obviously bad as it may encourage
> other subsystems to try avoiding the page allocator as well.

I'm not an expert on the page allocator but the series looks sane to me.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>

Yanmin, it would be interesting to know if we still need 8K kmalloc
caches with these patches applied. :-)

                               Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
  2009-02-23  7:29   ` Pekka Enberg
@ 2009-02-23  8:34     ` Zhang, Yanmin
  -1 siblings, 0 replies; 190+ messages in thread
From: Zhang, Yanmin @ 2009-02-23  8:34 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Mel Gorman, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming

On Mon, 2009-02-23 at 09:29 +0200, Pekka Enberg wrote:
> On Mon, Feb 23, 2009 at 1:17 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > The complexity of the page allocator has been increasing for some time
> > and it has now reached the point where the SLUB allocator is doing strange
> > tricks to avoid the page allocator. This is obviously bad as it may encourage
> > other subsystems to try avoiding the page allocator as well.
> 
> I'm not an expert on the page allocator but the series looks sane to me.
> 
> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
> 
> Yanmin, it would be interesting to know if we still need 8K kmalloc
> caches with these patches applied. :-)
We are running testing against the series of patches on top of 2.6.29-rc5, and
will keep you posted on the results.



^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
@ 2009-02-23  8:34     ` Zhang, Yanmin
  0 siblings, 0 replies; 190+ messages in thread
From: Zhang, Yanmin @ 2009-02-23  8:34 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Mel Gorman, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming

On Mon, 2009-02-23 at 09:29 +0200, Pekka Enberg wrote:
> On Mon, Feb 23, 2009 at 1:17 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > The complexity of the page allocator has been increasing for some time
> > and it has now reached the point where the SLUB allocator is doing strange
> > tricks to avoid the page allocator. This is obviously bad as it may encourage
> > other subsystems to try avoiding the page allocator as well.
> 
> I'm not an expert on the page allocator but the series looks sane to me.
> 
> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
> 
> Yanmin, it would be interesting to know if we still need 8K kmalloc
> caches with these patches applied. :-)
We are running testing against the series of patches on top of 2.6.29-rc5, and
will keep you posted on the results.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 07/20] Simplify the check on whether cpusets are a factor or not
  2009-02-23  7:14     ` Pekka J Enberg
@ 2009-02-23  9:07       ` Peter Zijlstra
  -1 siblings, 0 replies; 190+ messages in thread
From: Peter Zijlstra @ 2009-02-23  9:07 UTC (permalink / raw)
  To: Pekka J Enberg
  Cc: Mel Gorman, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 2009-02-23 at 09:14 +0200, Pekka J Enberg wrote:
> On Sun, 22 Feb 2009, Mel Gorman wrote:
> > The check whether cpuset contraints need to be checked or not is complex
> > and often repeated.  This patch makes the check in advance to the comparison
> > is simplier to compute.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> You can do that in a cleaner way by defining ALLOC_CPUSET to be zero when 
> CONFIG_CPUSETS is disabled. Something like following untested patch:
> 
> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
> ---
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5675b30..18b687d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1135,7 +1135,12 @@ failed:
>  #define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
>  #define ALLOC_HARDER		0x10 /* try to alloc harder */
>  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
> +
> +#ifdef CONFIG_CPUSETS
>  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
> +#else
> +#define ALLOC_CPUSET		0x00
> +#endif
>  

Mel's patch however even avoids the code when cpusets are configured but
not actively used (the most common case for distro kernels).


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 07/20] Simplify the check on whether cpusets are a factor or not
@ 2009-02-23  9:07       ` Peter Zijlstra
  0 siblings, 0 replies; 190+ messages in thread
From: Peter Zijlstra @ 2009-02-23  9:07 UTC (permalink / raw)
  To: Pekka J Enberg
  Cc: Mel Gorman, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 2009-02-23 at 09:14 +0200, Pekka J Enberg wrote:
> On Sun, 22 Feb 2009, Mel Gorman wrote:
> > The check whether cpuset contraints need to be checked or not is complex
> > and often repeated.  This patch makes the check in advance to the comparison
> > is simplier to compute.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> You can do that in a cleaner way by defining ALLOC_CPUSET to be zero when 
> CONFIG_CPUSETS is disabled. Something like following untested patch:
> 
> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
> ---
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5675b30..18b687d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1135,7 +1135,12 @@ failed:
>  #define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
>  #define ALLOC_HARDER		0x10 /* try to alloc harder */
>  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
> +
> +#ifdef CONFIG_CPUSETS
>  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
> +#else
> +#define ALLOC_CPUSET		0x00
> +#endif
>  

Mel's patch however even avoids the code when cpusets are configured but
not actively used (the most common case for distro kernels).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
  2009-02-23  7:29   ` Pekka Enberg
@ 2009-02-23  9:10     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 190+ messages in thread
From: KOSAKI Motohiro @ 2009-02-23  9:10 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: kosaki.motohiro, Mel Gorman, Linux Memory Management List,
	Rik van Riel, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

> On Mon, Feb 23, 2009 at 1:17 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > The complexity of the page allocator has been increasing for some time
> > and it has now reached the point where the SLUB allocator is doing strange
> > tricks to avoid the page allocator. This is obviously bad as it may encourage
> > other subsystems to try avoiding the page allocator as well.
> 
> I'm not an expert on the page allocator but the series looks sane to me.

Yeah!
I also strongly like this patch series.

Unfortunately, I don't have enough time for patch review in this week.
but I expect I can review and test it next week.

thanks.


> 
> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
> 
> Yanmin, it would be interesting to know if we still need 8K kmalloc
> caches with these patches applied. :-)
> 
>                                Pekka




^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
@ 2009-02-23  9:10     ` KOSAKI Motohiro
  0 siblings, 0 replies; 190+ messages in thread
From: KOSAKI Motohiro @ 2009-02-23  9:10 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: kosaki.motohiro, Mel Gorman, Linux Memory Management List,
	Rik van Riel, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

> On Mon, Feb 23, 2009 at 1:17 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > The complexity of the page allocator has been increasing for some time
> > and it has now reached the point where the SLUB allocator is doing strange
> > tricks to avoid the page allocator. This is obviously bad as it may encourage
> > other subsystems to try avoiding the page allocator as well.
> 
> I'm not an expert on the page allocator but the series looks sane to me.

Yeah!
I also strongly like this patch series.

Unfortunately, I don't have enough time for patch review in this week.
but I expect I can review and test it next week.

thanks.


> 
> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
> 
> Yanmin, it would be interesting to know if we still need 8K kmalloc
> caches with these patches applied. :-)
> 
>                                Pekka



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 07/20] Simplify the check on whether cpusets are a factor or not
  2009-02-23  9:07       ` Peter Zijlstra
@ 2009-02-23  9:13         ` Pekka Enberg
  -1 siblings, 0 replies; 190+ messages in thread
From: Pekka Enberg @ 2009-02-23  9:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 2009-02-23 at 10:07 +0100, Peter Zijlstra wrote:
> On Mon, 2009-02-23 at 09:14 +0200, Pekka J Enberg wrote:
> > On Sun, 22 Feb 2009, Mel Gorman wrote:
> > > The check whether cpuset contraints need to be checked or not is complex
> > > and often repeated.  This patch makes the check in advance to the comparison
> > > is simplier to compute.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > You can do that in a cleaner way by defining ALLOC_CPUSET to be zero when 
> > CONFIG_CPUSETS is disabled. Something like following untested patch:
> > 
> > Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
> > ---
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 5675b30..18b687d 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1135,7 +1135,12 @@ failed:
> >  #define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
> >  #define ALLOC_HARDER		0x10 /* try to alloc harder */
> >  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
> > +
> > +#ifdef CONFIG_CPUSETS
> >  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
> > +#else
> > +#define ALLOC_CPUSET		0x00
> > +#endif
> >  
> 
> Mel's patch however even avoids the code when cpusets are configured but
> not actively used (the most common case for distro kernels).

Right. Combining both patches is probably the best solution then as we
get rid of the #ifdef in get_page_from_freelist().

			Pekka


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 07/20] Simplify the check on whether cpusets are a factor or not
@ 2009-02-23  9:13         ` Pekka Enberg
  0 siblings, 0 replies; 190+ messages in thread
From: Pekka Enberg @ 2009-02-23  9:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 2009-02-23 at 10:07 +0100, Peter Zijlstra wrote:
> On Mon, 2009-02-23 at 09:14 +0200, Pekka J Enberg wrote:
> > On Sun, 22 Feb 2009, Mel Gorman wrote:
> > > The check whether cpuset contraints need to be checked or not is complex
> > > and often repeated.  This patch makes the check in advance to the comparison
> > > is simplier to compute.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > You can do that in a cleaner way by defining ALLOC_CPUSET to be zero when 
> > CONFIG_CPUSETS is disabled. Something like following untested patch:
> > 
> > Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
> > ---
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 5675b30..18b687d 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1135,7 +1135,12 @@ failed:
> >  #define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
> >  #define ALLOC_HARDER		0x10 /* try to alloc harder */
> >  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
> > +
> > +#ifdef CONFIG_CPUSETS
> >  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
> > +#else
> > +#define ALLOC_CPUSET		0x00
> > +#endif
> >  
> 
> Mel's patch however even avoids the code when cpusets are configured but
> not actively used (the most common case for distro kernels).

Right. Combining both patches is probably the best solution then as we
get rid of the #ifdef in get_page_from_freelist().

			Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 07/20] Simplify the check on whether cpusets are a factor or not
  2009-02-22 23:17   ` Mel Gorman
@ 2009-02-23  9:14     ` Li Zefan
  -1 siblings, 0 replies; 190+ messages in thread
From: Li Zefan @ 2009-02-23  9:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

> +#ifdef CONFIG_CPUSETS
> +	/* Determine in advance if the cpuset checks will be needed */
> +	if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
> +		alloc_cpuset = 1;
> +#endif
> +
>  zonelist_scan:
>  	/*
>  	 * Scan zonelist, looking for a zone with enough free.
> @@ -1420,8 +1427,8 @@ zonelist_scan:
>  		if (NUMA_BUILD && zlc_active &&
>  			!zlc_zone_worth_trying(zonelist, z, allowednodes))
>  				continue;
> -		if ((alloc_flags & ALLOC_CPUSET) &&
> -			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> +		if (alloc_cpuset)
> +			if (!cpuset_zone_allowed_softwall(zone, gfp_mask))

I think you can call __cpuset_zone_allowed_softwall() which won't
check number_of_cpusets, and note you should also define an empty
noop __xxx() for !CONFIG_CPUSETS.

>  				goto try_next_zone;
>  
>  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 07/20] Simplify the check on whether cpusets are a factor or not
@ 2009-02-23  9:14     ` Li Zefan
  0 siblings, 0 replies; 190+ messages in thread
From: Li Zefan @ 2009-02-23  9:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

> +#ifdef CONFIG_CPUSETS
> +	/* Determine in advance if the cpuset checks will be needed */
> +	if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
> +		alloc_cpuset = 1;
> +#endif
> +
>  zonelist_scan:
>  	/*
>  	 * Scan zonelist, looking for a zone with enough free.
> @@ -1420,8 +1427,8 @@ zonelist_scan:
>  		if (NUMA_BUILD && zlc_active &&
>  			!zlc_zone_worth_trying(zonelist, z, allowednodes))
>  				continue;
> -		if ((alloc_flags & ALLOC_CPUSET) &&
> -			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> +		if (alloc_cpuset)
> +			if (!cpuset_zone_allowed_softwall(zone, gfp_mask))

I think you can call __cpuset_zone_allowed_softwall() which won't
check number_of_cpusets, and note you should also define an empty
noop __xxx() for !CONFIG_CPUSETS.

>  				goto try_next_zone;
>  
>  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 15/20] Do not disable interrupts in free_page_mlock()
  2009-02-22 23:17   ` Mel Gorman
@ 2009-02-23  9:19     ` Peter Zijlstra
  -1 siblings, 0 replies; 190+ messages in thread
From: Peter Zijlstra @ 2009-02-23  9:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Sun, 2009-02-22 at 23:17 +0000, Mel Gorman wrote:
> free_page_mlock() tests and clears PG_mlocked. If set, it disables interrupts
> to update counters and this happens on every page free even though interrupts
> are disabled very shortly afterwards a second time.  This is wasteful.
> 
> This patch splits what free_page_mlock() does. The bit check is still
> made. However, the update of counters is delayed until the interrupts are
> disabled. One potential weirdness with this split is that the counters do
> not get updated if the bad_page() check is triggered but a system showing
> bad pages is getting screwed already.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/internal.h   |   10 ++--------
>  mm/page_alloc.c |    8 +++++++-
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 478223b..b52bf86 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -155,14 +155,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
>   */
>  static inline void free_page_mlock(struct page *page)
>  {
> -	if (unlikely(TestClearPageMlocked(page))) {
> -		unsigned long flags;
> -
> -		local_irq_save(flags);
> -		__dec_zone_page_state(page, NR_MLOCK);
> -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> -		local_irq_restore(flags);
> -	}
> +	__dec_zone_page_state(page, NR_MLOCK);
> +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
>  }

Its not actually clearing PG_mlocked anymore, so the name is now a tad
misleading.

That said, since we're freeing the page, there ought to not be another
reference to the page, in which case it appears to me we could safely
use the unlocked variant of TestClear*().



^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 15/20] Do not disable interrupts in free_page_mlock()
@ 2009-02-23  9:19     ` Peter Zijlstra
  0 siblings, 0 replies; 190+ messages in thread
From: Peter Zijlstra @ 2009-02-23  9:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Sun, 2009-02-22 at 23:17 +0000, Mel Gorman wrote:
> free_page_mlock() tests and clears PG_mlocked. If set, it disables interrupts
> to update counters and this happens on every page free even though interrupts
> are disabled very shortly afterwards a second time.  This is wasteful.
> 
> This patch splits what free_page_mlock() does. The bit check is still
> made. However, the update of counters is delayed until the interrupts are
> disabled. One potential weirdness with this split is that the counters do
> not get updated if the bad_page() check is triggered but a system showing
> bad pages is getting screwed already.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/internal.h   |   10 ++--------
>  mm/page_alloc.c |    8 +++++++-
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 478223b..b52bf86 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -155,14 +155,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
>   */
>  static inline void free_page_mlock(struct page *page)
>  {
> -	if (unlikely(TestClearPageMlocked(page))) {
> -		unsigned long flags;
> -
> -		local_irq_save(flags);
> -		__dec_zone_page_state(page, NR_MLOCK);
> -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> -		local_irq_restore(flags);
> -	}
> +	__dec_zone_page_state(page, NR_MLOCK);
> +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
>  }

Its not actually clearing PG_mlocked anymore, so the name is now a tad
misleading.

That said, since we're freeing the page, there ought to not be another
reference to the page, in which case it appears to me we could safely
use the unlocked variant of TestClear*().


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-22 23:17   ` Mel Gorman
@ 2009-02-23  9:37     ` Andrew Morton
  -1 siblings, 0 replies; 190+ messages in thread
From: Andrew Morton @ 2009-02-23  9:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Sun, 22 Feb 2009 23:17:29 +0000 Mel Gorman <mel@csn.ul.ie> wrote:

> Currently an effort is made to determine if a page is hot or cold when
> it is being freed so that cache hot pages can be allocated to callers if
> possible. However, the reasoning used whether to mark something hot or
> cold is a bit spurious. A profile run of kernbench showed that "cold"
> pages were never freed so it either doesn't happen generally or is so
> rare, it's barely measurable.
> 
> It's dubious as to whether pages are being correctly marked hot and cold
> anyway. Things like page cache and pages being truncated are are considered
> "hot" but there is no guarantee that these pages have been recently used
> and are cache hot. Pages being reclaimed from the LRU are considered
> cold which is logical because they cannot have been referenced recently
> but if the system is reclaiming pages, then we have entered allocator
> slowpaths and are not going to notice any potential performance boost
> because a "hot" page was freed.
> 
> This patch just deletes the concept of freeing hot or cold pages and
> just frees them all as hot.
> 

Well yes.  We waffled for months over whether to merge that code originally.

What tipped the balance was a dopey microbenchmark which I wrote which
sat in a loop extending (via write()) and then truncating the same file
by 32 kbytes (or thereabouts).  Its performance was increased by a lot
(2x or more, iirc) and no actual regressions were demonstrable, so we
merged it.

Could you check that please?  I'd suggest trying various values of 32k,
too.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-02-23  9:37     ` Andrew Morton
  0 siblings, 0 replies; 190+ messages in thread
From: Andrew Morton @ 2009-02-23  9:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Sun, 22 Feb 2009 23:17:29 +0000 Mel Gorman <mel@csn.ul.ie> wrote:

> Currently an effort is made to determine if a page is hot or cold when
> it is being freed so that cache hot pages can be allocated to callers if
> possible. However, the reasoning used whether to mark something hot or
> cold is a bit spurious. A profile run of kernbench showed that "cold"
> pages were never freed so it either doesn't happen generally or is so
> rare, it's barely measurable.
> 
> It's dubious as to whether pages are being correctly marked hot and cold
> anyway. Things like page cache and pages being truncated are are considered
> "hot" but there is no guarantee that these pages have been recently used
> and are cache hot. Pages being reclaimed from the LRU are considered
> cold which is logical because they cannot have been referenced recently
> but if the system is reclaiming pages, then we have entered allocator
> slowpaths and are not going to notice any potential performance boost
> because a "hot" page was freed.
> 
> This patch just deletes the concept of freeing hot or cold pages and
> just frees them all as hot.
> 

Well yes.  We waffled for months over whether to merge that code originally.

What tipped the balance was a dopey microbenchmark which I wrote which
sat in a loop extending (via write()) and then truncating the same file
by 32 kbytes (or thereabouts).  Its performance was increased by a lot
(2x or more, iirc) and no actual regressions were demonstrable, so we
merged it.

Could you check that please?  I'd suggest trying various values of 32k,
too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 07/20] Simplify the check on whether cpusets are a factor or not
  2009-02-23  9:13         ` Pekka Enberg
@ 2009-02-23 11:39           ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 11:39 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Peter Zijlstra, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 11:13:23AM +0200, Pekka Enberg wrote:
> On Mon, 2009-02-23 at 10:07 +0100, Peter Zijlstra wrote:
> > On Mon, 2009-02-23 at 09:14 +0200, Pekka J Enberg wrote:
> > > On Sun, 22 Feb 2009, Mel Gorman wrote:
> > > > The check whether cpuset contraints need to be checked or not is complex
> > > > and often repeated.  This patch makes the check in advance to the comparison
> > > > is simplier to compute.
> > > > 
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > 
> > > You can do that in a cleaner way by defining ALLOC_CPUSET to be zero when 
> > > CONFIG_CPUSETS is disabled. Something like following untested patch:
> > > 
> > > Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
> > > ---
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 5675b30..18b687d 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1135,7 +1135,12 @@ failed:
> > >  #define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
> > >  #define ALLOC_HARDER		0x10 /* try to alloc harder */
> > >  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
> > > +
> > > +#ifdef CONFIG_CPUSETS
> > >  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
> > > +#else
> > > +#define ALLOC_CPUSET		0x00
> > > +#endif
> > >  
> > 
> > Mel's patch however even avoids the code when cpusets are configured but
> > not actively used (the most common case for distro kernels).
> 
> Right. Combining both patches is probably the best solution then as we
> get rid of the #ifdef in get_page_from_freelist().
> 

An #ifdef in a function is ugly all right. Here is a slightly different
version based on your suggestion. Note the definition of number_of_cpusets
in the !CONFIG_CPUSETS case. I didn't call cpuset_zone_allowed_softwall()
for the preferred zone in case it wasn't in the cpuset for some reason and
we incorrectly disabled the cpuset check.

=====
Simplify the check on whether cpusets are a factor or not

The check whether cpuset contraints need to be checked or not is complex
and often repeated.  This patch makes the check in advance to the comparison
is simplier to compute.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 90c6074..6051082 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -83,6 +83,8 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
 
 #else /* !CONFIG_CPUSETS */
 
+#define number_of_cpusets (0)
+
 static inline int cpuset_init_early(void) { return 0; }
 static inline int cpuset_init(void) { return 0; }
 static inline void cpuset_init_smp(void) {}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 503d692..405cd8c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1136,7 +1136,11 @@ failed:
 #define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
+#ifdef CONFIG_CPUSETS
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+#else
+#define ALLOC_CPUSET		0x00
+#endif /* CONFIG_CPUSETS */
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
@@ -1400,6 +1404,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	int alloc_cpuset = 0;
 
 	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
 							&preferred_zone);
@@ -1410,6 +1415,10 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 
 	VM_BUG_ON(order >= MAX_ORDER);
 
+	/* Determine in advance if the cpuset checks will be needed */
+	if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
+		alloc_cpuset = 1;
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -1420,8 +1429,8 @@ zonelist_scan:
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-		if ((alloc_flags & ALLOC_CPUSET) &&
-			!cpuset_zone_allowed_softwall(zone, gfp_mask))
+		if (alloc_cpuset)
+			if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
 
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH 07/20] Simplify the check on whether cpusets are a factor or not
@ 2009-02-23 11:39           ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 11:39 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Peter Zijlstra, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 11:13:23AM +0200, Pekka Enberg wrote:
> On Mon, 2009-02-23 at 10:07 +0100, Peter Zijlstra wrote:
> > On Mon, 2009-02-23 at 09:14 +0200, Pekka J Enberg wrote:
> > > On Sun, 22 Feb 2009, Mel Gorman wrote:
> > > > The check whether cpuset contraints need to be checked or not is complex
> > > > and often repeated.  This patch makes the check in advance to the comparison
> > > > is simplier to compute.
> > > > 
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > 
> > > You can do that in a cleaner way by defining ALLOC_CPUSET to be zero when 
> > > CONFIG_CPUSETS is disabled. Something like following untested patch:
> > > 
> > > Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
> > > ---
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 5675b30..18b687d 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1135,7 +1135,12 @@ failed:
> > >  #define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
> > >  #define ALLOC_HARDER		0x10 /* try to alloc harder */
> > >  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
> > > +
> > > +#ifdef CONFIG_CPUSETS
> > >  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
> > > +#else
> > > +#define ALLOC_CPUSET		0x00
> > > +#endif
> > >  
> > 
> > Mel's patch however even avoids the code when cpusets are configured but
> > not actively used (the most common case for distro kernels).
> 
> Right. Combining both patches is probably the best solution then as we
> get rid of the #ifdef in get_page_from_freelist().
> 

An #ifdef in a function is ugly all right. Here is a slightly different
version based on your suggestion. Note the definition of number_of_cpusets
in the !CONFIG_CPUSETS case. I didn't call cpuset_zone_allowed_softwall()
for the preferred zone in case it wasn't in the cpuset for some reason and
we incorrectly disabled the cpuset check.

=====
Simplify the check on whether cpusets are a factor or not

The check whether cpuset contraints need to be checked or not is complex
and often repeated.  This patch makes the check in advance to the comparison
is simplier to compute.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 90c6074..6051082 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -83,6 +83,8 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
 
 #else /* !CONFIG_CPUSETS */
 
+#define number_of_cpusets (0)
+
 static inline int cpuset_init_early(void) { return 0; }
 static inline int cpuset_init(void) { return 0; }
 static inline void cpuset_init_smp(void) {}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 503d692..405cd8c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1136,7 +1136,11 @@ failed:
 #define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
+#ifdef CONFIG_CPUSETS
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+#else
+#define ALLOC_CPUSET		0x00
+#endif /* CONFIG_CPUSETS */
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
@@ -1400,6 +1404,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	int alloc_cpuset = 0;
 
 	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
 							&preferred_zone);
@@ -1410,6 +1415,10 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 
 	VM_BUG_ON(order >= MAX_ORDER);
 
+	/* Determine in advance if the cpuset checks will be needed */
+	if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
+		alloc_cpuset = 1;
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -1420,8 +1429,8 @@ zonelist_scan:
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-		if ((alloc_flags & ALLOC_CPUSET) &&
-			!cpuset_zone_allowed_softwall(zone, gfp_mask))
+		if (alloc_cpuset)
+			if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
 
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH 11/20] Inline get_page_from_freelist() in the fast-path
  2009-02-23  7:21     ` Pekka Enberg
@ 2009-02-23 11:42       ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 11:42 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 09:21:09AM +0200, Pekka Enberg wrote:
> On Mon, Feb 23, 2009 at 1:17 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > In the best-case scenario, use an inlined version of
> > get_page_from_freelist(). This increases the size of the text but avoids
> > time spent pushing arguments onto the stack.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> It's not obvious to me why this would be a huge win so I suppose this
> patch description could use numbers.

I don't have the exact numbers from the profiles any more but the
function entry and exit was about 1/20th of the cost of the path when
zeroing pages is not taken into account.

> Note: we used to do tricks like
> these in slab.c but got rid of most of them to reduce kernel text size
> which is probably why the patch seems bit backwards to me.
> 

I'll be rechecking this patch in particular because it's likely the
biggest text bloat in the entire series.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 11/20] Inline get_page_from_freelist() in the fast-path
@ 2009-02-23 11:42       ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 11:42 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 09:21:09AM +0200, Pekka Enberg wrote:
> On Mon, Feb 23, 2009 at 1:17 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > In the best-case scenario, use an inlined version of
> > get_page_from_freelist(). This increases the size of the text but avoids
> > time spent pushing arguments onto the stack.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> It's not obvious to me why this would be a huge win so I suppose this
> patch description could use numbers.

I don't have the exact numbers from the profiles any more but the
function entry and exit was about 1/20th of the cost of the path when
zeroing pages is not taken into account.

> Note: we used to do tricks like
> these in slab.c but got rid of most of them to reduce kernel text size
> which is probably why the patch seems bit backwards to me.
> 

I'll be rechecking this patch in particular because it's likely the
biggest text bloat in the entire series.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 13/20] Inline buffered_rmqueue()
  2009-02-23  7:24     ` Pekka Enberg
@ 2009-02-23 11:44       ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 11:44 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 09:24:19AM +0200, Pekka Enberg wrote:
> On Mon, Feb 23, 2009 at 1:17 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > buffered_rmqueue() is in the fast path so inline it. This incurs text
> > bloat as there is now a copy in the fast and slow paths but the cost of
> > the function call was noticeable in profiles of the fast path.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/page_alloc.c |    3 ++-
> >  1 files changed, 2 insertions(+), 1 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index d8a6828..2383147 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1080,7 +1080,8 @@ void split_page(struct page *page, unsigned int order)
> >  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
> >  * or two.
> >  */
> > -static struct page *buffered_rmqueue(struct zone *preferred_zone,
> > +static inline
> > +struct page *buffered_rmqueue(struct zone *preferred_zone,
> >                        struct zone *zone, int order, gfp_t gfp_flags,
> >                        int migratetype)
> >  {
> 
> I'm not sure if this is changed now but at least in the past, you had
> to use __always_inline to force GCC to do the inlining for all
> configurations.
> 

Hmm, as there is only one call-site, I would expect gcc to inline it. I
can force it though. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 13/20] Inline buffered_rmqueue()
@ 2009-02-23 11:44       ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 11:44 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linux Memory Management List, Rik van Riel, KOSAKI Motohiro,
	Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 09:24:19AM +0200, Pekka Enberg wrote:
> On Mon, Feb 23, 2009 at 1:17 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > buffered_rmqueue() is in the fast path so inline it. This incurs text
> > bloat as there is now a copy in the fast and slow paths but the cost of
> > the function call was noticeable in profiles of the fast path.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/page_alloc.c |    3 ++-
> >  1 files changed, 2 insertions(+), 1 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index d8a6828..2383147 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1080,7 +1080,8 @@ void split_page(struct page *page, unsigned int order)
> >  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
> >  * or two.
> >  */
> > -static struct page *buffered_rmqueue(struct zone *preferred_zone,
> > +static inline
> > +struct page *buffered_rmqueue(struct zone *preferred_zone,
> >                        struct zone *zone, int order, gfp_t gfp_flags,
> >                        int migratetype)
> >  {
> 
> I'm not sure if this is changed now but at least in the past, you had
> to use __always_inline to force GCC to do the inlining for all
> configurations.
> 

Hmm, as there is only one call-site, I would expect gcc to inline it. I
can force it though. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [PATCH] mm: clean up __GFP_* flags a bit
  2009-02-22 23:17   ` Mel Gorman
@ 2009-02-23 11:55     ` Peter Zijlstra
  -1 siblings, 0 replies; 190+ messages in thread
From: Peter Zijlstra @ 2009-02-23 11:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

Subject: mm: clean up __GFP_* flags a bit
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Mon Feb 23 12:28:33 CET 2009

re-sort them and poke at some whitespace alignment for easier reading.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/gfp.h |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -25,6 +25,8 @@ struct vm_area_struct;
 #define __GFP_HIGHMEM	((__force gfp_t)0x02u)
 #define __GFP_DMA32	((__force gfp_t)0x04u)
 
+#define __GFP_MOVABLE	((__force gfp_t)0x08u)  /* Page is movable */
+
 /*
  * Action modifiers - doesn't change the zoning
  *
@@ -50,16 +52,15 @@ struct vm_area_struct;
 #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* See above */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
-#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
-#define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
-#define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_NOMEMALLOC  ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
+#define __GFP_HARDWALL    ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_THISNODE	  ((__force gfp_t)0x40000u) /* No fallback, no policies */
 #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
-#define __GFP_MOVABLE	((__force gfp_t)0x08u)  /* Page is movable */
 
 #ifdef CONFIG_KMEMCHECK
-#define __GFP_NOTRACK	((__force gfp_t)0x200000u)  /* Don't track with kmemcheck */
+#define __GFP_NOTRACK	  ((__force gfp_t)0x100000u) /* Don't track with kmemcheck */
 #else
-#define __GFP_NOTRACK	((__force gfp_t)0)
+#define __GFP_NOTRACK	  ((__force gfp_t)0)
 #endif
 
 #define __GFP_BITS_SHIFT 22	/* Room for 22 __GFP_FOO bits */



^ permalink raw reply	[flat|nested] 190+ messages in thread

* [PATCH] mm: clean up __GFP_* flags a bit
@ 2009-02-23 11:55     ` Peter Zijlstra
  0 siblings, 0 replies; 190+ messages in thread
From: Peter Zijlstra @ 2009-02-23 11:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

Subject: mm: clean up __GFP_* flags a bit
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Mon Feb 23 12:28:33 CET 2009

re-sort them and poke at some whitespace alignment for easier reading.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/gfp.h |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -25,6 +25,8 @@ struct vm_area_struct;
 #define __GFP_HIGHMEM	((__force gfp_t)0x02u)
 #define __GFP_DMA32	((__force gfp_t)0x04u)
 
+#define __GFP_MOVABLE	((__force gfp_t)0x08u)  /* Page is movable */
+
 /*
  * Action modifiers - doesn't change the zoning
  *
@@ -50,16 +52,15 @@ struct vm_area_struct;
 #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* See above */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
-#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
-#define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
-#define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_NOMEMALLOC  ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
+#define __GFP_HARDWALL    ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_THISNODE	  ((__force gfp_t)0x40000u) /* No fallback, no policies */
 #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
-#define __GFP_MOVABLE	((__force gfp_t)0x08u)  /* Page is movable */
 
 #ifdef CONFIG_KMEMCHECK
-#define __GFP_NOTRACK	((__force gfp_t)0x200000u)  /* Don't track with kmemcheck */
+#define __GFP_NOTRACK	  ((__force gfp_t)0x100000u) /* Don't track with kmemcheck */
 #else
-#define __GFP_NOTRACK	((__force gfp_t)0)
+#define __GFP_NOTRACK	  ((__force gfp_t)0)
 #endif
 
 #define __GFP_BITS_SHIFT 22	/* Room for 22 __GFP_FOO bits */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [PATCH] mm: gfp_to_alloc_flags()
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-23 11:55   ` Peter Zijlstra
  -1 siblings, 0 replies; 190+ messages in thread
From: Peter Zijlstra @ 2009-02-23 11:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

I've always found the below a clean-up, respun it on top of your changes.
Test box still boots ;-)

---
Subject: mm: gfp_to_alloc_flags()
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Mon Feb 23 12:46:36 CET 2009

Clean up the code by factoring out the gfp to alloc_flags mapping.

[neilb@suse.de says]
As the test:

-       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-                       && !in_interrupt()) {
-               if (!(gfp_mask & __GFP_NOMEMALLOC)) {

has been replaced with a slightly weaker one:

+       if (alloc_flags & ALLOC_NO_WATERMARKS) {

we need to ensure we don't recurse when PF_MEMALLOC is set

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page_alloc.c |   90 ++++++++++++++++++++++++++++++++------------------------
 1 file changed, 52 insertions(+), 38 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1658,16 +1658,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
 	return page;
 }
 
-static inline int is_allocation_high_priority(struct task_struct *p,
-							gfp_t gfp_mask)
-{
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt())
-		if (!(gfp_mask & __GFP_NOMEMALLOC))
-			return 1;
-	return 0;
-}
-
 /*
  * This is called in the allocator slow-path if the allocation request is of
  * sufficient urgency to ignore watermarks and take other desperate measures
@@ -1702,6 +1692,44 @@ void wake_all_kswapd(unsigned int order,
 		wakeup_kswapd(zone, order);
 }
 
+/*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+static int gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+	struct task_struct *p = current;
+	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+	/*
+	 * The caller may dip into page reserves a bit more if the caller
+	 * cannot run direct reclaim, or if the caller has realtime scheduling
+	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+	 */
+	if (gfp_mask & __GFP_HIGH)
+		alloc_flags |= ALLOC_HIGH;
+
+	if (!wait) {
+		alloc_flags |= ALLOC_HARDER;
+		/*
+		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+		 */
+		alloc_flags &= ~ALLOC_CPUSET;
+	} else if (unlikely(rt_task(p)) && !in_interrupt())
+		alloc_flags |= ALLOC_HARDER;
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (!in_interrupt() &&
+		    ((p->flags & PF_MEMALLOC) ||
+		     unlikely(test_thread_flag(TIF_MEMDIE))))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+	}
+
+	return alloc_flags;
+}
+
 static struct page * noinline
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1732,48 +1760,34 @@ __alloc_pages_slowpath(gfp_t gfp_mask, u
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
-	 *
-	 * The caller may dip into page reserves a bit more if the caller
-	 * cannot run direct reclaim, or if the caller has realtime scheduling
-	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	alloc_flags = ALLOC_WMARK_MIN;
-	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
-		alloc_flags |= ALLOC_HARDER;
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
-	if (wait)
-		alloc_flags |= ALLOC_CPUSET;
+	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
 restart:
-	/*
-	 * Go through the zonelist again. Let __GFP_HIGH and allocations
-	 * coming from realtime tasks go deeper into reserves.
-	 *
-	 * This is the last chance, in general, before the goto nopage.
-	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-	 */
+	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
-						high_zoneidx, alloc_flags,
-						preferred_zone,
-						migratetype);
+			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
+			preferred_zone, migratetype);
 	if (page)
 		goto got_pg;
 
 	/* Allocate without watermarks if the context allows */
-	if (is_allocation_high_priority(p, gfp_mask))
+	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 		page = __alloc_pages_high_priority(gfp_mask, order,
-			zonelist, high_zoneidx, nodemask, preferred_zone,
-			migratetype);
-	if (page)
-		goto got_pg;
+				zonelist, high_zoneidx, nodemask,
+				preferred_zone, migratetype);
+		if (page)
+			goto got_pg;
+	}
 
 	/* Atomic allocations - we can't balance anything */
 	if (!wait)
 		goto nopage;
 
+	/* Avoid recursion of direct reclaim */
+	if (p->flags & PF_MEMALLOC)
+		goto nopage;
+
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,



^ permalink raw reply	[flat|nested] 190+ messages in thread

* [PATCH] mm: gfp_to_alloc_flags()
@ 2009-02-23 11:55   ` Peter Zijlstra
  0 siblings, 0 replies; 190+ messages in thread
From: Peter Zijlstra @ 2009-02-23 11:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

I've always found the below a clean-up, respun it on top of your changes.
Test box still boots ;-)

---
Subject: mm: gfp_to_alloc_flags()
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Mon Feb 23 12:46:36 CET 2009

Clean up the code by factoring out the gfp to alloc_flags mapping.

[neilb@suse.de says]
As the test:

-       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-                       && !in_interrupt()) {
-               if (!(gfp_mask & __GFP_NOMEMALLOC)) {

has been replaced with a slightly weaker one:

+       if (alloc_flags & ALLOC_NO_WATERMARKS) {

we need to ensure we don't recurse when PF_MEMALLOC is set

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page_alloc.c |   90 ++++++++++++++++++++++++++++++++------------------------
 1 file changed, 52 insertions(+), 38 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1658,16 +1658,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
 	return page;
 }
 
-static inline int is_allocation_high_priority(struct task_struct *p,
-							gfp_t gfp_mask)
-{
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt())
-		if (!(gfp_mask & __GFP_NOMEMALLOC))
-			return 1;
-	return 0;
-}
-
 /*
  * This is called in the allocator slow-path if the allocation request is of
  * sufficient urgency to ignore watermarks and take other desperate measures
@@ -1702,6 +1692,44 @@ void wake_all_kswapd(unsigned int order,
 		wakeup_kswapd(zone, order);
 }
 
+/*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+static int gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+	struct task_struct *p = current;
+	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+	/*
+	 * The caller may dip into page reserves a bit more if the caller
+	 * cannot run direct reclaim, or if the caller has realtime scheduling
+	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+	 */
+	if (gfp_mask & __GFP_HIGH)
+		alloc_flags |= ALLOC_HIGH;
+
+	if (!wait) {
+		alloc_flags |= ALLOC_HARDER;
+		/*
+		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+		 */
+		alloc_flags &= ~ALLOC_CPUSET;
+	} else if (unlikely(rt_task(p)) && !in_interrupt())
+		alloc_flags |= ALLOC_HARDER;
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (!in_interrupt() &&
+		    ((p->flags & PF_MEMALLOC) ||
+		     unlikely(test_thread_flag(TIF_MEMDIE))))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+	}
+
+	return alloc_flags;
+}
+
 static struct page * noinline
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1732,48 +1760,34 @@ __alloc_pages_slowpath(gfp_t gfp_mask, u
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
-	 *
-	 * The caller may dip into page reserves a bit more if the caller
-	 * cannot run direct reclaim, or if the caller has realtime scheduling
-	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	alloc_flags = ALLOC_WMARK_MIN;
-	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
-		alloc_flags |= ALLOC_HARDER;
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
-	if (wait)
-		alloc_flags |= ALLOC_CPUSET;
+	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
 restart:
-	/*
-	 * Go through the zonelist again. Let __GFP_HIGH and allocations
-	 * coming from realtime tasks go deeper into reserves.
-	 *
-	 * This is the last chance, in general, before the goto nopage.
-	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-	 */
+	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
-						high_zoneidx, alloc_flags,
-						preferred_zone,
-						migratetype);
+			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
+			preferred_zone, migratetype);
 	if (page)
 		goto got_pg;
 
 	/* Allocate without watermarks if the context allows */
-	if (is_allocation_high_priority(p, gfp_mask))
+	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 		page = __alloc_pages_high_priority(gfp_mask, order,
-			zonelist, high_zoneidx, nodemask, preferred_zone,
-			migratetype);
-	if (page)
-		goto got_pg;
+				zonelist, high_zoneidx, nodemask,
+				preferred_zone, migratetype);
+		if (page)
+			goto got_pg;
+	}
 
 	/* Atomic allocations - we can't balance anything */
 	if (!wait)
 		goto nopage;
 
+	/* Avoid recursion of direct reclaim */
+	if (p->flags & PF_MEMALLOC)
+		goto nopage;
+
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 15/20] Do not disable interrupts in free_page_mlock()
  2009-02-23  9:19     ` Peter Zijlstra
@ 2009-02-23 12:23       ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 12:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 10:19:00AM +0100, Peter Zijlstra wrote:
> On Sun, 2009-02-22 at 23:17 +0000, Mel Gorman wrote:
> > free_page_mlock() tests and clears PG_mlocked. If set, it disables interrupts
> > to update counters and this happens on every page free even though interrupts
> > are disabled very shortly afterwards a second time.  This is wasteful.
> > 
> > This patch splits what free_page_mlock() does. The bit check is still
> > made. However, the update of counters is delayed until the interrupts are
> > disabled. One potential weirdness with this split is that the counters do
> > not get updated if the bad_page() check is triggered but a system showing
> > bad pages is getting screwed already.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/internal.h   |   10 ++--------
> >  mm/page_alloc.c |    8 +++++++-
> >  2 files changed, 9 insertions(+), 9 deletions(-)
> > 
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 478223b..b52bf86 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -155,14 +155,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> >   */
> >  static inline void free_page_mlock(struct page *page)
> >  {
> > -	if (unlikely(TestClearPageMlocked(page))) {
> > -		unsigned long flags;
> > -
> > -		local_irq_save(flags);
> > -		__dec_zone_page_state(page, NR_MLOCK);
> > -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > -		local_irq_restore(flags);
> > -	}
> > +	__dec_zone_page_state(page, NR_MLOCK);
> > +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> >  }
> 
> Its not actually clearing PG_mlocked anymore, so the name is now a tad
> misleading.
> 

Really? I see the following

#ifdef CONFIG_UNEVICTABLE_LRU
PAGEFLAG(Unevictable, unevictable) __CLEARPAGEFLAG(Unevictable, unevictable)
        TESTCLEARFLAG(Unevictable, unevictable)

#define MLOCK_PAGES 1
PAGEFLAG(Mlocked, mlocked) __CLEARPAGEFLAG(Mlocked, mlocked)
        TESTSCFLAG(Mlocked, mlocked)

#else

#define MLOCK_PAGES 0
PAGEFLAG_FALSE(Mlocked)
        SETPAGEFLAG_NOOP(Mlocked) TESTCLEARFLAG_FALSE(Mlocked)

PAGEFLAG_FALSE(Unevictable) TESTCLEARFLAG_FALSE(Unevictable)
        SETPAGEFLAG_NOOP(Unevictable) CLEARPAGEFLAG_NOOP(Unevictable)
        __CLEARPAGEFLAG_NOOP(Unevictable)
#endif

So there is a PG_mlocked bit once UNEVITABLE_LRU is set which was the
case on the tests I was running. I'm probably missing something silly.

> That said, since we're freeing the page, there ought to not be another
> reference to the page, in which case it appears to me we could safely
> use the unlocked variant of TestClear*().
> 

Regrettably, unlocked variants do not appear to be defined as such but
the following should do the job, right? It applies on top of the current
change.

diff --git a/mm/internal.h b/mm/internal.h
index b52bf86..7f775a1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -155,6 +155,7 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
  */
 static inline void free_page_mlock(struct page *page)
 {
+	__ClearPageMlocked(page);
 	__dec_zone_page_state(page, NR_MLOCK);
 	__count_vm_event(UNEVICTABLE_MLOCKFREED);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index edac673..8bd0533 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -580,7 +580,7 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	unsigned long flags;
 	int i;
 	int bad = 0;
-	int clearMlocked = TestClearPageMlocked(page);
+	int clearMlocked = PageMlocked(page);
 
 	for (i = 0 ; i < (1 << order) ; ++i)
 		bad += free_pages_check(page + i);
@@ -1040,7 +1040,7 @@ static void free_pcp_page(struct page *page)
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 	int migratetype;
-	int clearMlocked = TestClearPageMlocked(page);
+	int clearMlocked = PageMlocked(page);
 
 	if (PageAnon(page))
 		page->mapping = NULL;

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH 15/20] Do not disable interrupts in free_page_mlock()
@ 2009-02-23 12:23       ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 12:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 10:19:00AM +0100, Peter Zijlstra wrote:
> On Sun, 2009-02-22 at 23:17 +0000, Mel Gorman wrote:
> > free_page_mlock() tests and clears PG_mlocked. If set, it disables interrupts
> > to update counters and this happens on every page free even though interrupts
> > are disabled very shortly afterwards a second time.  This is wasteful.
> > 
> > This patch splits what free_page_mlock() does. The bit check is still
> > made. However, the update of counters is delayed until the interrupts are
> > disabled. One potential weirdness with this split is that the counters do
> > not get updated if the bad_page() check is triggered but a system showing
> > bad pages is getting screwed already.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/internal.h   |   10 ++--------
> >  mm/page_alloc.c |    8 +++++++-
> >  2 files changed, 9 insertions(+), 9 deletions(-)
> > 
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 478223b..b52bf86 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -155,14 +155,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> >   */
> >  static inline void free_page_mlock(struct page *page)
> >  {
> > -	if (unlikely(TestClearPageMlocked(page))) {
> > -		unsigned long flags;
> > -
> > -		local_irq_save(flags);
> > -		__dec_zone_page_state(page, NR_MLOCK);
> > -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > -		local_irq_restore(flags);
> > -	}
> > +	__dec_zone_page_state(page, NR_MLOCK);
> > +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> >  }
> 
> Its not actually clearing PG_mlocked anymore, so the name is now a tad
> misleading.
> 

Really? I see the following

#ifdef CONFIG_UNEVICTABLE_LRU
PAGEFLAG(Unevictable, unevictable) __CLEARPAGEFLAG(Unevictable, unevictable)
        TESTCLEARFLAG(Unevictable, unevictable)

#define MLOCK_PAGES 1
PAGEFLAG(Mlocked, mlocked) __CLEARPAGEFLAG(Mlocked, mlocked)
        TESTSCFLAG(Mlocked, mlocked)

#else

#define MLOCK_PAGES 0
PAGEFLAG_FALSE(Mlocked)
        SETPAGEFLAG_NOOP(Mlocked) TESTCLEARFLAG_FALSE(Mlocked)

PAGEFLAG_FALSE(Unevictable) TESTCLEARFLAG_FALSE(Unevictable)
        SETPAGEFLAG_NOOP(Unevictable) CLEARPAGEFLAG_NOOP(Unevictable)
        __CLEARPAGEFLAG_NOOP(Unevictable)
#endif

So there is a PG_mlocked bit once UNEVITABLE_LRU is set which was the
case on the tests I was running. I'm probably missing something silly.

> That said, since we're freeing the page, there ought to not be another
> reference to the page, in which case it appears to me we could safely
> use the unlocked variant of TestClear*().
> 

Regrettably, unlocked variants do not appear to be defined as such but
the following should do the job, right? It applies on top of the current
change.

diff --git a/mm/internal.h b/mm/internal.h
index b52bf86..7f775a1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -155,6 +155,7 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
  */
 static inline void free_page_mlock(struct page *page)
 {
+	__ClearPageMlocked(page);
 	__dec_zone_page_state(page, NR_MLOCK);
 	__count_vm_event(UNEVICTABLE_MLOCKFREED);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index edac673..8bd0533 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -580,7 +580,7 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	unsigned long flags;
 	int i;
 	int bad = 0;
-	int clearMlocked = TestClearPageMlocked(page);
+	int clearMlocked = PageMlocked(page);
 
 	for (i = 0 ; i < (1 << order) ; ++i)
 		bad += free_pages_check(page + i);
@@ -1040,7 +1040,7 @@ static void free_pcp_page(struct page *page)
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 	int migratetype;
-	int clearMlocked = TestClearPageMlocked(page);
+	int clearMlocked = PageMlocked(page);
 
 	if (PageAnon(page))
 		page->mapping = NULL;

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
  2009-02-22 23:57   ` Andi Kleen
@ 2009-02-23 12:34     ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 12:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 12:57:37AM +0100, Andi Kleen wrote:
> Mel Gorman <mel@csn.ul.ie> writes:
> 
> > The complexity of the page allocator has been increasing for some time
> > and it has now reached the point where the SLUB allocator is doing strange
> > tricks to avoid the page allocator. This is obviously bad as it may encourage
> > other subsystems to try avoiding the page allocator as well.
> 
> Congratulations! That was long overdue. Haven't read the patches yet though.
> 

Thanks

> > Patch 15 reduces the number of times interrupts are disabled by reworking
> > what free_page_mlock() does. However, I notice that the cost of calling
> > TestClearPageMlocked() is still quite high and I'm guessing it's because
> > it's a locked bit operation. It's be nice if it could be established if
> > it's safe to use an unlocked version here. Rik, can you comment?
> 
> What machine was that again?
> 

It's a AMD Phenom 9950 quad core.

> > Patch 16 avoids using the zonelist cache on non-NUMA machines
> 
> My suspicion is that it can be even dropped on most small (all?) NUMA systems.
> 

I'm assuming it should not be dropped for all. My vague memory was that this
was introduced for large IA-64 machines and that they were able to show a
clear gain when scanning large numbers of zones. Patch 16 disables zonelist
caching if there is only one NUMA node but maybe it should be disabled for
more than that.

> > Patch 20 gets rid of hot/cold freeing of pages because it incurs cost for
> > what I believe to be very dubious gain. I'm not sure we currently gain
> > anything by it but it's further discussed in the patch itself.
> 
> Yes the hot/cold thing was always quite dubious.
> 

Andrew mentioned a micro-benchmark so I will be digging that up to see
what it can show.

> > Counters are surprising expensive, we spent a good chuck of our time in
> > functions like __dec_zone_page_state and __dec_zone_state. In a profiled
> > run of kernbench, the time spent in __dec_zone_state was roughly equal to
> > the combined cost of the rest of the page free path. A quick check showed
> > that almost half of the time in that function is spent on line 233 alone
> > which for me is;
> >
> > 	(*p)--;
> >
> > That's worth a separate investigation but it might be a case that
> > manipulating int8_t on the machine I was using for profiling is unusually
> > expensive. 
> 
> What machine was that?
> 

This is the AMD Phenom again but I might be mistaken on the line causing
the problem. A second profile run shows all the cost in the function entry
so it might just be a co-incidence that the sampling happened to trigger on
that particular line. It's high on the profiles simply because it's called
a lot. The assembler doesn't look particularly bad or anything.

> In general I wouldn't expect even on a system with slow char
> operations to be that expensive. It sounds more like a cache miss or a
> cache line bounce. You could possibly confirm by using appropiate
> performance counters.
> 

I'll check for cache line misses.

> > Converting this to an int might be faster but the increased
> > memory consumption and cache footprint might be a problem. Opinions?
> 
> One possibility would be to move the zone statistics to allocated
> per cpu data. Or perhaps just stop counting per zone at all and
> only count per cpu.
> 
> > The downside is that the patches do increase text size because of the
> > splitting of the fast path into one inlined blob and the slow path into a
> > number of other functions. On my test machine, text increased by 1.2K so
> > I might revisit that again and see how much of a difference it really made.
> >
> > That all said, I'm seeing good results on actual benchmarks with these
> > patches.
> >
> > o On many machines, I'm seeing a 0-2% improvement on kernbench. The dominant
> 
> Neat.
> 
> > So, by and large it's an improvement of some sort.
> 
> That seems like an understatement.
> 

It'll all depend on what other peoples machines turn up :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
@ 2009-02-23 12:34     ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 12:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 12:57:37AM +0100, Andi Kleen wrote:
> Mel Gorman <mel@csn.ul.ie> writes:
> 
> > The complexity of the page allocator has been increasing for some time
> > and it has now reached the point where the SLUB allocator is doing strange
> > tricks to avoid the page allocator. This is obviously bad as it may encourage
> > other subsystems to try avoiding the page allocator as well.
> 
> Congratulations! That was long overdue. Haven't read the patches yet though.
> 

Thanks

> > Patch 15 reduces the number of times interrupts are disabled by reworking
> > what free_page_mlock() does. However, I notice that the cost of calling
> > TestClearPageMlocked() is still quite high and I'm guessing it's because
> > it's a locked bit operation. It's be nice if it could be established if
> > it's safe to use an unlocked version here. Rik, can you comment?
> 
> What machine was that again?
> 

It's a AMD Phenom 9950 quad core.

> > Patch 16 avoids using the zonelist cache on non-NUMA machines
> 
> My suspicion is that it can be even dropped on most small (all?) NUMA systems.
> 

I'm assuming it should not be dropped for all. My vague memory was that this
was introduced for large IA-64 machines and that they were able to show a
clear gain when scanning large numbers of zones. Patch 16 disables zonelist
caching if there is only one NUMA node but maybe it should be disabled for
more than that.

> > Patch 20 gets rid of hot/cold freeing of pages because it incurs cost for
> > what I believe to be very dubious gain. I'm not sure we currently gain
> > anything by it but it's further discussed in the patch itself.
> 
> Yes the hot/cold thing was always quite dubious.
> 

Andrew mentioned a micro-benchmark so I will be digging that up to see
what it can show.

> > Counters are surprising expensive, we spent a good chuck of our time in
> > functions like __dec_zone_page_state and __dec_zone_state. In a profiled
> > run of kernbench, the time spent in __dec_zone_state was roughly equal to
> > the combined cost of the rest of the page free path. A quick check showed
> > that almost half of the time in that function is spent on line 233 alone
> > which for me is;
> >
> > 	(*p)--;
> >
> > That's worth a separate investigation but it might be a case that
> > manipulating int8_t on the machine I was using for profiling is unusually
> > expensive. 
> 
> What machine was that?
> 

This is the AMD Phenom again but I might be mistaken on the line causing
the problem. A second profile run shows all the cost in the function entry
so it might just be a co-incidence that the sampling happened to trigger on
that particular line. It's high on the profiles simply because it's called
a lot. The assembler doesn't look particularly bad or anything.

> In general I wouldn't expect even on a system with slow char
> operations to be that expensive. It sounds more like a cache miss or a
> cache line bounce. You could possibly confirm by using appropiate
> performance counters.
> 

I'll check for cache line misses.

> > Converting this to an int might be faster but the increased
> > memory consumption and cache footprint might be a problem. Opinions?
> 
> One possibility would be to move the zone statistics to allocated
> per cpu data. Or perhaps just stop counting per zone at all and
> only count per cpu.
> 
> > The downside is that the patches do increase text size because of the
> > splitting of the fast path into one inlined blob and the slow path into a
> > number of other functions. On my test machine, text increased by 1.2K so
> > I might revisit that again and see how much of a difference it really made.
> >
> > That all said, I'm seeing good results on actual benchmarks with these
> > patches.
> >
> > o On many machines, I'm seeing a 0-2% improvement on kernbench. The dominant
> 
> Neat.
> 
> > So, by and large it's an improvement of some sort.
> 
> That seems like an understatement.
> 

It'll all depend on what other peoples machines turn up :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 15/20] Do not disable interrupts in free_page_mlock()
  2009-02-23 12:23       ` Mel Gorman
@ 2009-02-23 12:44         ` Peter Zijlstra
  -1 siblings, 0 replies; 190+ messages in thread
From: Peter Zijlstra @ 2009-02-23 12:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 2009-02-23 at 12:23 +0000, Mel Gorman wrote:
> On Mon, Feb 23, 2009 at 10:19:00AM +0100, Peter Zijlstra wrote:
> > On Sun, 2009-02-22 at 23:17 +0000, Mel Gorman wrote:

> > >  static inline void free_page_mlock(struct page *page)
> > >  {
> > > -	if (unlikely(TestClearPageMlocked(page))) {
> > > -		unsigned long flags;
> > > -
> > > -		local_irq_save(flags);
> > > -		__dec_zone_page_state(page, NR_MLOCK);
> > > -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > > -		local_irq_restore(flags);
> > > -	}
> > > +	__dec_zone_page_state(page, NR_MLOCK);
> > > +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > >  }
> > 
> > Its not actually clearing PG_mlocked anymore, so the name is now a tad
> > misleading.
> > 
> 
> Really? I see the following

> So there is a PG_mlocked bit once UNEVITABLE_LRU is set which was the
> case on the tests I was running. I'm probably missing something silly.

What I was trying to say was that free_page_mlock() doesn't change the
page-state after your change, hence the 'free' part of its name is
misleading.

> > That said, since we're freeing the page, there ought to not be another
> > reference to the page, in which case it appears to me we could safely
> > use the unlocked variant of TestClear*().
> > 
> 
> Regrettably, unlocked variants do not appear to be defined as such but
> the following should do the job, right? It applies on top of the current
> change.
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index b52bf86..7f775a1 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -155,6 +155,7 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
>   */
>  static inline void free_page_mlock(struct page *page)
>  {
	VM_BUG_ON(!PageMlocked(page)); ?

> +	__ClearPageMlocked(page);
>  	__dec_zone_page_state(page, NR_MLOCK);
>  	__count_vm_event(UNEVICTABLE_MLOCKFREED);
>  }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index edac673..8bd0533 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -580,7 +580,7 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>  	unsigned long flags;
>  	int i;
>  	int bad = 0;
> -	int clearMlocked = TestClearPageMlocked(page);
> +	int clearMlocked = PageMlocked(page);
>  
>  	for (i = 0 ; i < (1 << order) ; ++i)
>  		bad += free_pages_check(page + i);
> @@ -1040,7 +1040,7 @@ static void free_pcp_page(struct page *page)
>  	struct per_cpu_pages *pcp;
>  	unsigned long flags;
>  	int migratetype;
> -	int clearMlocked = TestClearPageMlocked(page);
> +	int clearMlocked = PageMlocked(page);
>  
>  	if (PageAnon(page))
>  		page->mapping = NULL;

Right, that should do.


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 15/20] Do not disable interrupts in free_page_mlock()
@ 2009-02-23 12:44         ` Peter Zijlstra
  0 siblings, 0 replies; 190+ messages in thread
From: Peter Zijlstra @ 2009-02-23 12:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 2009-02-23 at 12:23 +0000, Mel Gorman wrote:
> On Mon, Feb 23, 2009 at 10:19:00AM +0100, Peter Zijlstra wrote:
> > On Sun, 2009-02-22 at 23:17 +0000, Mel Gorman wrote:

> > >  static inline void free_page_mlock(struct page *page)
> > >  {
> > > -	if (unlikely(TestClearPageMlocked(page))) {
> > > -		unsigned long flags;
> > > -
> > > -		local_irq_save(flags);
> > > -		__dec_zone_page_state(page, NR_MLOCK);
> > > -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > > -		local_irq_restore(flags);
> > > -	}
> > > +	__dec_zone_page_state(page, NR_MLOCK);
> > > +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > >  }
> > 
> > Its not actually clearing PG_mlocked anymore, so the name is now a tad
> > misleading.
> > 
> 
> Really? I see the following

> So there is a PG_mlocked bit once UNEVITABLE_LRU is set which was the
> case on the tests I was running. I'm probably missing something silly.

What I was trying to say was that free_page_mlock() doesn't change the
page-state after your change, hence the 'free' part of its name is
misleading.

> > That said, since we're freeing the page, there ought to not be another
> > reference to the page, in which case it appears to me we could safely
> > use the unlocked variant of TestClear*().
> > 
> 
> Regrettably, unlocked variants do not appear to be defined as such but
> the following should do the job, right? It applies on top of the current
> change.
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index b52bf86..7f775a1 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -155,6 +155,7 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
>   */
>  static inline void free_page_mlock(struct page *page)
>  {
	VM_BUG_ON(!PageMlocked(page)); ?

> +	__ClearPageMlocked(page);
>  	__dec_zone_page_state(page, NR_MLOCK);
>  	__count_vm_event(UNEVICTABLE_MLOCKFREED);
>  }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index edac673..8bd0533 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -580,7 +580,7 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>  	unsigned long flags;
>  	int i;
>  	int bad = 0;
> -	int clearMlocked = TestClearPageMlocked(page);
> +	int clearMlocked = PageMlocked(page);
>  
>  	for (i = 0 ; i < (1 << order) ; ++i)
>  		bad += free_pages_check(page + i);
> @@ -1040,7 +1040,7 @@ static void free_pcp_page(struct page *page)
>  	struct per_cpu_pages *pcp;
>  	unsigned long flags;
>  	int migratetype;
> -	int clearMlocked = TestClearPageMlocked(page);
> +	int clearMlocked = PageMlocked(page);
>  
>  	if (PageAnon(page))
>  		page->mapping = NULL;

Right, that should do.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 07/20] Simplify the check on whether cpusets are a factor or not
  2009-02-23 11:39           ` Mel Gorman
@ 2009-02-23 13:19             ` Pekka Enberg
  -1 siblings, 0 replies; 190+ messages in thread
From: Pekka Enberg @ 2009-02-23 13:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

Hi Mel,

On Mon, 2009-02-23 at 11:39 +0000, Mel Gorman wrote:
> An #ifdef in a function is ugly all right. Here is a slightly
> different
> version based on your suggestion. Note the definition of number_of_cpusets
> in the !CONFIG_CPUSETS case. I didn't call cpuset_zone_allowed_softwall()
> for the preferred zone in case it wasn't in the cpuset for some reason and
> we incorrectly disabled the cpuset check.
> 
> =====
> Simplify the check on whether cpusets are a factor or not
> 
> The check whether cpuset contraints need to be checked or not is complex
> and often repeated.  This patch makes the check in advance to the comparison
> is simplier to compute.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Looks good to me!


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 07/20] Simplify the check on whether cpusets are a factor or not
@ 2009-02-23 13:19             ` Pekka Enberg
  0 siblings, 0 replies; 190+ messages in thread
From: Pekka Enberg @ 2009-02-23 13:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

Hi Mel,

On Mon, 2009-02-23 at 11:39 +0000, Mel Gorman wrote:
> An #ifdef in a function is ugly all right. Here is a slightly
> different
> version based on your suggestion. Note the definition of number_of_cpusets
> in the !CONFIG_CPUSETS case. I didn't call cpuset_zone_allowed_softwall()
> for the preferred zone in case it wasn't in the cpuset for some reason and
> we incorrectly disabled the cpuset check.
> 
> =====
> Simplify the check on whether cpusets are a factor or not
> 
> The check whether cpuset contraints need to be checked or not is complex
> and often repeated.  This patch makes the check in advance to the comparison
> is simplier to compute.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Looks good to me!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] mm: gfp_to_alloc_flags()
  2009-02-23 11:55   ` Peter Zijlstra
@ 2009-02-23 14:00     ` Pekka Enberg
  -1 siblings, 0 replies; 190+ messages in thread
From: Pekka Enberg @ 2009-02-23 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 2009-02-23 at 12:55 +0100, Peter Zijlstra wrote:
> Subject: mm: gfp_to_alloc_flags()
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Mon Feb 23 12:46:36 CET 2009
> 
> Clean up the code by factoring out the gfp to alloc_flags mapping.
> 
> [neilb@suse.de says]
> As the test:
> 
> -       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> -                       && !in_interrupt()) {
> -               if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> 
> has been replaced with a slightly weaker one:
> 
> +       if (alloc_flags & ALLOC_NO_WATERMARKS) {
> 
> we need to ensure we don't recurse when PF_MEMALLOC is set
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] mm: gfp_to_alloc_flags()
@ 2009-02-23 14:00     ` Pekka Enberg
  0 siblings, 0 replies; 190+ messages in thread
From: Pekka Enberg @ 2009-02-23 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Linux Memory Management List, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 2009-02-23 at 12:55 +0100, Peter Zijlstra wrote:
> Subject: mm: gfp_to_alloc_flags()
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Mon Feb 23 12:46:36 CET 2009
> 
> Clean up the code by factoring out the gfp to alloc_flags mapping.
> 
> [neilb@suse.de says]
> As the test:
> 
> -       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> -                       && !in_interrupt()) {
> -               if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> 
> has been replaced with a slightly weaker one:
> 
> +       if (alloc_flags & ALLOC_NO_WATERMARKS) {
> 
> we need to ensure we don't recurse when PF_MEMALLOC is set
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 15/20] Do not disable interrupts in free_page_mlock()
  2009-02-23 12:44         ` Peter Zijlstra
@ 2009-02-23 14:25           ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 14:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 01:44:18PM +0100, Peter Zijlstra wrote:
> On Mon, 2009-02-23 at 12:23 +0000, Mel Gorman wrote:
> > On Mon, Feb 23, 2009 at 10:19:00AM +0100, Peter Zijlstra wrote:
> > > On Sun, 2009-02-22 at 23:17 +0000, Mel Gorman wrote:
> 
> > > >  static inline void free_page_mlock(struct page *page)
> > > >  {
> > > > -	if (unlikely(TestClearPageMlocked(page))) {
> > > > -		unsigned long flags;
> > > > -
> > > > -		local_irq_save(flags);
> > > > -		__dec_zone_page_state(page, NR_MLOCK);
> > > > -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > > > -		local_irq_restore(flags);
> > > > -	}
> > > > +	__dec_zone_page_state(page, NR_MLOCK);
> > > > +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > > >  }
> > > 
> > > Its not actually clearing PG_mlocked anymore, so the name is now a tad
> > > misleading.
> > > 
> > 
> > Really? I see the following
> 
> > So there is a PG_mlocked bit once UNEVITABLE_LRU is set which was the
> > case on the tests I was running. I'm probably missing something silly.
> 
> What I was trying to say was that free_page_mlock() doesn't change the
> page-state after your change, hence the 'free' part of its name is
> misleading.
> 

Ah right. As you pointed out on mail, the newer version clears the bit
again so while the name was misleading, it's sortof ok again now. Mind you,
it's not freeing a page as such. A name like account_freed_mlock() might be
better. As it is only used in page_alloc.c, it could also be taken out of
the header file altogether.

> > > That said, since we're freeing the page, there ought to not be another
> > > reference to the page, in which case it appears to me we could safely
> > > use the unlocked variant of TestClear*().
> > > 
> > 
> > Regrettably, unlocked variants do not appear to be defined as such but
> > the following should do the job, right? It applies on top of the current
> > change.
> > 
> > diff --git a/mm/internal.h b/mm/internal.h
> > index b52bf86..7f775a1 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -155,6 +155,7 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> >   */
> >  static inline void free_page_mlock(struct page *page)
> >  {
> 	VM_BUG_ON(!PageMlocked(page)); ?
> 
> > +	__ClearPageMlocked(page);
> >  	__dec_zone_page_state(page, NR_MLOCK);
> >  	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> >  }
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index edac673..8bd0533 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -580,7 +580,7 @@ static void __free_pages_ok(struct page *page, unsigned int order,
> >  	unsigned long flags;
> >  	int i;
> >  	int bad = 0;
> > -	int clearMlocked = TestClearPageMlocked(page);
> > +	int clearMlocked = PageMlocked(page);
> >  
> >  	for (i = 0 ; i < (1 << order) ; ++i)
> >  		bad += free_pages_check(page + i);
> > @@ -1040,7 +1040,7 @@ static void free_pcp_page(struct page *page)
> >  	struct per_cpu_pages *pcp;
> >  	unsigned long flags;
> >  	int migratetype;
> > -	int clearMlocked = TestClearPageMlocked(page);
> > +	int clearMlocked = PageMlocked(page);
> >  
> >  	if (PageAnon(page))
> >  		page->mapping = NULL;
> 
> Right, that should do.
> 

Nice, thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 15/20] Do not disable interrupts in free_page_mlock()
@ 2009-02-23 14:25           ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 14:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 01:44:18PM +0100, Peter Zijlstra wrote:
> On Mon, 2009-02-23 at 12:23 +0000, Mel Gorman wrote:
> > On Mon, Feb 23, 2009 at 10:19:00AM +0100, Peter Zijlstra wrote:
> > > On Sun, 2009-02-22 at 23:17 +0000, Mel Gorman wrote:
> 
> > > >  static inline void free_page_mlock(struct page *page)
> > > >  {
> > > > -	if (unlikely(TestClearPageMlocked(page))) {
> > > > -		unsigned long flags;
> > > > -
> > > > -		local_irq_save(flags);
> > > > -		__dec_zone_page_state(page, NR_MLOCK);
> > > > -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > > > -		local_irq_restore(flags);
> > > > -	}
> > > > +	__dec_zone_page_state(page, NR_MLOCK);
> > > > +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > > >  }
> > > 
> > > Its not actually clearing PG_mlocked anymore, so the name is now a tad
> > > misleading.
> > > 
> > 
> > Really? I see the following
> 
> > So there is a PG_mlocked bit once UNEVITABLE_LRU is set which was the
> > case on the tests I was running. I'm probably missing something silly.
> 
> What I was trying to say was that free_page_mlock() doesn't change the
> page-state after your change, hence the 'free' part of its name is
> misleading.
> 

Ah right. As you pointed out on mail, the newer version clears the bit
again so while the name was misleading, it's sortof ok again now. Mind you,
it's not freeing a page as such. A name like account_freed_mlock() might be
better. As it is only used in page_alloc.c, it could also be taken out of
the header file altogether.

> > > That said, since we're freeing the page, there ought to not be another
> > > reference to the page, in which case it appears to me we could safely
> > > use the unlocked variant of TestClear*().
> > > 
> > 
> > Regrettably, unlocked variants do not appear to be defined as such but
> > the following should do the job, right? It applies on top of the current
> > change.
> > 
> > diff --git a/mm/internal.h b/mm/internal.h
> > index b52bf86..7f775a1 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -155,6 +155,7 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> >   */
> >  static inline void free_page_mlock(struct page *page)
> >  {
> 	VM_BUG_ON(!PageMlocked(page)); ?
> 
> > +	__ClearPageMlocked(page);
> >  	__dec_zone_page_state(page, NR_MLOCK);
> >  	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> >  }
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index edac673..8bd0533 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -580,7 +580,7 @@ static void __free_pages_ok(struct page *page, unsigned int order,
> >  	unsigned long flags;
> >  	int i;
> >  	int bad = 0;
> > -	int clearMlocked = TestClearPageMlocked(page);
> > +	int clearMlocked = PageMlocked(page);
> >  
> >  	for (i = 0 ; i < (1 << order) ; ++i)
> >  		bad += free_pages_check(page + i);
> > @@ -1040,7 +1040,7 @@ static void free_pcp_page(struct page *page)
> >  	struct per_cpu_pages *pcp;
> >  	unsigned long flags;
> >  	int migratetype;
> > -	int clearMlocked = TestClearPageMlocked(page);
> > +	int clearMlocked = PageMlocked(page);
> >  
> >  	if (PageAnon(page))
> >  		page->mapping = NULL;
> 
> Right, that should do.
> 

Nice, thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
  2009-02-23  0:02   ` Andi Kleen
@ 2009-02-23 14:32     ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 14:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 01:02:59AM +0100, Andi Kleen wrote:
> Mel Gorman <mel@csn.ul.ie> writes:
> 
> 
> BTW one additional tuning opportunity would be to change cpusets to
> always precompute zonelists out of line and then avoid doing
> all these checks in the fast path.
> 

hmm, it would be ideal but I haven't looked too closely at how it could
be implemented. I thought first you could just associate a zonelist with
the cpuset but you'd need one for each node allowed by the cpuset so it
could get quite large. Then again, it might be worthwhile if cpusets
were expected to be very long lived.

If there are any users of cpusets watching, would you be interested in
profiling with cpusets enabled and see how much time we spend in that
code?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
@ 2009-02-23 14:32     ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 14:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 01:02:59AM +0100, Andi Kleen wrote:
> Mel Gorman <mel@csn.ul.ie> writes:
> 
> 
> BTW one additional tuning opportunity would be to change cpusets to
> always precompute zonelists out of line and then avoid doing
> all these checks in the fast path.
> 

hmm, it would be ideal but I haven't looked too closely at how it could
be implemented. I thought first you could just associate a zonelist with
the cpuset but you'd need one for each node allowed by the cpuset so it
could get quite large. Then again, it might be worthwhile if cpusets
were expected to be very long lived.

If there are any users of cpusets watching, would you be interested in
profiling with cpusets enabled and see how much time we spend in that
code?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-23 14:38   ` Christoph Lameter
  -1 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-23 14:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Sun, 22 Feb 2009, Mel Gorman wrote:

> I haven't run a page-allocator micro-benchmark to see what sort of figures
> that gives. Christoph, I recall you had some sort of page allocator
> micro-benchmark. Do you want to give it a shot or remind me how to use
> it please?

The page allocator / slab allocator microbenchmarks are in my VM
development git tree. The branch is named tests.

http://git.kernel.org/?p=linux/kernel/git/christoph/vm.git;a=shortlog;h=tests



^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
@ 2009-02-23 14:38   ` Christoph Lameter
  0 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-23 14:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Sun, 22 Feb 2009, Mel Gorman wrote:

> I haven't run a page-allocator micro-benchmark to see what sort of figures
> that gives. Christoph, I recall you had some sort of page allocator
> micro-benchmark. Do you want to give it a shot or remind me how to use
> it please?

The page allocator / slab allocator microbenchmarks are in my VM
development git tree. The branch is named tests.

http://git.kernel.org/?p=linux/kernel/git/christoph/vm.git;a=shortlog;h=tests


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
  2009-02-22 23:17 ` Mel Gorman
@ 2009-02-23 14:46   ` Nick Piggin
  -1 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-23 14:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 5433 bytes --]

Hi Mel,
Seems like a nice patchset.
On Monday 23 February 2009 10:17:09 Mel Gorman wrote:> The complexity of the page allocator has been increasing for some time> and it has now reached the point where the SLUB allocator is doing strange> tricks to avoid the page allocator. This is obviously bad as it may> encourage other subsystems to try avoiding the page allocator as well.>> This series of patches is intended to reduce the cost of the page> allocator by doing the following.>> Patches 1-3 iron out the entry paths slightly and remove stupid sanity> checks from the fast path.>> Patch 4 uses a lookup table instead of a number of branches to decide what> zones are usable given the GFP flags.>> Patch 5 avoids repeated checks of the zonelist>> Patch 6 breaks the allocator up into a fast and slow path where the fast> path later becomes one long inlined function.>> Patches 7-10 avoids calculating the same things repeatedly and instead> calculates them once.>> Patches 11-13 inline the whole allocator fast path>> Patch 14 avoids calling get_pageblock_migratetype() potentially twice on> every page free>> Patch 15 reduces the number of times interrupts are disabled by reworking> what free_page_mlock() does. However, I notice that the cost of calling> TestClearPageMlocked() is still quite high and I'm guessing it's because> it's a locked bit operation. It's be nice if it could be established if> it's safe to use an unlocked version here. Rik, can you comment?
Yes, it can. page flags are owned entirely by the owner of the page.
free_page_mlock shouldn't really be in free_pages_check, but oh well.

> Patch 16 avoids using the zonelist cache on non-NUMA machines>> Patch 17 removes an expensive and excessively paranoid check in the> allocator fast path
I would be careful of removing useful debug checks completely likethis. What is the cost? Obviously non-zero, but it is also a checkI have seen trigger on quite a lot of occasions (due to kernel bugsand hardware bugs, and in each case it is better to warn than not,even if many other situations can go undetected).
One problem is that some of the calls we're making in page_alloc.cdo the compound_head() thing, wheras we know that we only want tolook at this page. I've attached a patch which cuts out about 150bytes of text and several branches from these paths.

> Patch 18 avoids a list search in the allocator fast path.
Ah, this was badly needed :)

> o On many machines, I'm seeing a 0-2% improvement on kernbench. The> dominant cost in kernbench is the compiler and zeroing allocated pages for> pagetables.
zeroing is a factor, but IIRC page faults and page allocator are amongthe top of the profiles.
> o For tbench, I have seen an 8-12% improvement on two x86-64 machines> (elm3b6 on test.kernel.org gained 8%) but generally it was less dramatic on> x86-64 in the range of 0-4%. On one PPC64, the different was also in the> range of 0-4%. Generally there were gains, but one specific ppc64 showed a> regression of 7% for one client but a negligible difference for 8 clients.> It's not clear why this machine regressed and others didn't.
Did you bisect your patchset? It could have been random or pointed toeg the hot/cold removal?
> o hackbench is harder to conclude anything from. Most machines showed>   performance gains in the 5-11% range but one machine in particular showed>   a mix of gains and losses depending on the number of clients. Might be>   a caching thing.>> o One machine in particular was a major surprise for sysbench with gains>   of 4-8% there which was drastically higher than I was expecting. However,>   on other machines, it was in the more reasonable 0-4% range, still pretty>   respectable. It's not guaranteed though. While most machines showed some>   sort of gain, one ppc64 showed no difference at all.>> So, by and large it's an improvement of some sort.

Most of these benchmarks *really* need to be run quite a few times to geta reasonable confidence.
But it sounds pretty positive.--- mm/page_alloc.c |   10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
Index: linux-2.6/mm/page_alloc.c===================================================================--- linux-2.6.orig/mm/page_alloc.c+++ linux-2.6/mm/page_alloc.c@@ -420,7 +420,7 @@ static inline int page_is_buddy(struct p 		return 0;  	if (PageBuddy(buddy) && page_order(buddy) == order) {-		BUG_ON(page_count(buddy) != 0);+		VM_BUG_ON(page_count(buddy) != 0); 		return 1; 	} 	return 0;@@ -493,9 +493,9 @@ static inline void __free_one_page(struc static inline int free_pages_check(struct page *page) { 	free_page_mlock(page);-	if (unlikely(page_mapcount(page) |+	if (unlikely((atomic_read(&page->_mapcount) != -1) | 		(page->mapping != NULL)  |-		(page_count(page) != 0)  |+		(atomic_read(&page->_count) != 0) | 		(page->flags & PAGE_FLAGS_CHECK_AT_FREE))) { 		bad_page(page); 		return 1;@@ -633,9 +633,9 @@ static inline void expand(struct zone *z  */ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) {-	if (unlikely(page_mapcount(page) |+	if (unlikely((atomic_read(&page->_mapcount) != -1) | 		(page->mapping != NULL)  |-		(page_count(page) != 0)  |+		(atomic_read(&page->_count) != 0)  | 		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) { 		bad_page(page); 		return 1;\0ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
@ 2009-02-23 14:46   ` Nick Piggin
  0 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-23 14:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 5493 bytes --]

Hi Mel,

Seems like a nice patchset.

On Monday 23 February 2009 10:17:09 Mel Gorman wrote:
> The complexity of the page allocator has been increasing for some time
> and it has now reached the point where the SLUB allocator is doing strange
> tricks to avoid the page allocator. This is obviously bad as it may
> encourage other subsystems to try avoiding the page allocator as well.
>
> This series of patches is intended to reduce the cost of the page
> allocator by doing the following.
>
> Patches 1-3 iron out the entry paths slightly and remove stupid sanity
> checks from the fast path.
>
> Patch 4 uses a lookup table instead of a number of branches to decide what
> zones are usable given the GFP flags.
>
> Patch 5 avoids repeated checks of the zonelist
>
> Patch 6 breaks the allocator up into a fast and slow path where the fast
> path later becomes one long inlined function.
>
> Patches 7-10 avoids calculating the same things repeatedly and instead
> calculates them once.
>
> Patches 11-13 inline the whole allocator fast path
>
> Patch 14 avoids calling get_pageblock_migratetype() potentially twice on
> every page free
>
> Patch 15 reduces the number of times interrupts are disabled by reworking
> what free_page_mlock() does. However, I notice that the cost of calling
> TestClearPageMlocked() is still quite high and I'm guessing it's because
> it's a locked bit operation. It's be nice if it could be established if
> it's safe to use an unlocked version here. Rik, can you comment?

Yes, it can. page flags are owned entirely by the owner of the page.

free_page_mlock shouldn't really be in free_pages_check, but oh well.


> Patch 16 avoids using the zonelist cache on non-NUMA machines
>
> Patch 17 removes an expensive and excessively paranoid check in the
> allocator fast path

I would be careful of removing useful debug checks completely like
this. What is the cost? Obviously non-zero, but it is also a check
I have seen trigger on quite a lot of occasions (due to kernel bugs
and hardware bugs, and in each case it is better to warn than not,
even if many other situations can go undetected).

One problem is that some of the calls we're making in page_alloc.c
do the compound_head() thing, wheras we know that we only want to
look at this page. I've attached a patch which cuts out about 150
bytes of text and several branches from these paths.


> Patch 18 avoids a list search in the allocator fast path.

Ah, this was badly needed :)


> o On many machines, I'm seeing a 0-2% improvement on kernbench. The
> dominant cost in kernbench is the compiler and zeroing allocated pages for
> pagetables.

zeroing is a factor, but IIRC page faults and page allocator are among
the top of the profiles.

> o For tbench, I have seen an 8-12% improvement on two x86-64 machines
> (elm3b6 on test.kernel.org gained 8%) but generally it was less dramatic on
> x86-64 in the range of 0-4%. On one PPC64, the different was also in the
> range of 0-4%. Generally there were gains, but one specific ppc64 showed a
> regression of 7% for one client but a negligible difference for 8 clients.
> It's not clear why this machine regressed and others didn't.

Did you bisect your patchset? It could have been random or pointed to
eg the hot/cold removal?

> o hackbench is harder to conclude anything from. Most machines showed
>   performance gains in the 5-11% range but one machine in particular showed
>   a mix of gains and losses depending on the number of clients. Might be
>   a caching thing.
>
> o One machine in particular was a major surprise for sysbench with gains
>   of 4-8% there which was drastically higher than I was expecting. However,
>   on other machines, it was in the more reasonable 0-4% range, still pretty
>   respectable. It's not guaranteed though. While most machines showed some
>   sort of gain, one ppc64 showed no difference at all.
>
> So, by and large it's an improvement of some sort.


Most of these benchmarks *really* need to be run quite a few times to get
a reasonable confidence.

But it sounds pretty positive.
---
 mm/page_alloc.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -420,7 +420,7 @@ static inline int page_is_buddy(struct p
 		return 0;
 
 	if (PageBuddy(buddy) && page_order(buddy) == order) {
-		BUG_ON(page_count(buddy) != 0);
+		VM_BUG_ON(page_count(buddy) != 0);
 		return 1;
 	}
 	return 0;
@@ -493,9 +493,9 @@ static inline void __free_one_page(struc
 static inline int free_pages_check(struct page *page)
 {
 	free_page_mlock(page);
-	if (unlikely(page_mapcount(page) |
+	if (unlikely((atomic_read(&page->_mapcount) != -1) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(atomic_read(&page->_count) != 0) |
 		(page->flags & PAGE_FLAGS_CHECK_AT_FREE))) {
 		bad_page(page);
 		return 1;
@@ -633,9 +633,9 @@ static inline void expand(struct zone *z
  */
 static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 {
-	if (unlikely(page_mapcount(page) |
+	if (unlikely((atomic_read(&page->_mapcount) != -1) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(atomic_read(&page->_count) != 0)  |
 		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
 		bad_page(page);
 		return 1;
\0
N‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨­è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒðèž×¦j)Z†·Ÿ

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
  2009-02-23 14:46   ` Nick Piggin
@ 2009-02-23 15:00     ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 15:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tue, Feb 24, 2009 at 01:46:01AM +1100, Nick Piggin wrote:
> Hi Mel,
> 
> Seems like a nice patchset.
> 

Thanks

> On Monday 23 February 2009 10:17:09 Mel Gorman wrote:
> > The complexity of the page allocator has been increasing for some time
> > and it has now reached the point where the SLUB allocator is doing strange
> > tricks to avoid the page allocator. This is obviously bad as it may
> > encourage other subsystems to try avoiding the page allocator as well.
> >
> > This series of patches is intended to reduce the cost of the page
> > allocator by doing the following.
> >
> > Patches 1-3 iron out the entry paths slightly and remove stupid sanity
> > checks from the fast path.
> >
> > Patch 4 uses a lookup table instead of a number of branches to decide what
> > zones are usable given the GFP flags.
> >
> > Patch 5 avoids repeated checks of the zonelist
> >
> > Patch 6 breaks the allocator up into a fast and slow path where the fast
> > path later becomes one long inlined function.
> >
> > Patches 7-10 avoids calculating the same things repeatedly and instead
> > calculates them once.
> >
> > Patches 11-13 inline the whole allocator fast path
> >
> > Patch 14 avoids calling get_pageblock_migratetype() potentially twice on
> > every page free
> >
> > Patch 15 reduces the number of times interrupts are disabled by reworking
> > what free_page_mlock() does. However, I notice that the cost of calling
> > TestClearPageMlocked() is still quite high and I'm guessing it's because
> > it's a locked bit operation. It's be nice if it could be established if
> > it's safe to use an unlocked version here. Rik, can you comment?
> 
> Yes, it can. page flags are owned entirely by the owner of the page.
> 

I figured that was the case but hadn't convinced myself 100%. I wanted a
second opinion but I'm sure it's safe now.

> free_page_mlock shouldn't really be in free_pages_check, but oh well.
> 

Agreed, I took it out of there. The name alone implies it's debugging
that could be optionally disabled if you really had to.

> 
> > Patch 16 avoids using the zonelist cache on non-NUMA machines
> >
> > Patch 17 removes an expensive and excessively paranoid check in the
> > allocator fast path
> 
> I would be careful of removing useful debug checks completely like
> this. What is the cost? Obviously non-zero, but it is also a check

The cost was something like 1/10th the cost of the path. There are atomic
operations in there that are causing the problems.

> I have seen trigger on quite a lot of occasions (due to kernel bugs
> and hardware bugs, and in each case it is better to warn than not,
> even if many other situations can go undetected).
> 

Have you really seen it trigger for the allocation path or did it
trigger in teh free path? Essentially we are making the same check on
every allocation and free which is why I considered it excessivly
paranoid.

> One problem is that some of the calls we're making in page_alloc.c
> do the compound_head() thing, wheras we know that we only want to
> look at this page. I've attached a patch which cuts out about 150
> bytes of text and several branches from these paths.
> 

Nice, I should have spotted that. I'm going to fold this into the series
if that is ok with you? I'll replace patch 17 with it and see does it
still show up on profiles.

> 
> > Patch 18 avoids a list search in the allocator fast path.
> 
> Ah, this was badly needed :)
> 
> 
> > o On many machines, I'm seeing a 0-2% improvement on kernbench. The
> > dominant cost in kernbench is the compiler and zeroing allocated pages for
> > pagetables.
> 
> zeroing is a factor, but IIRC page faults and page allocator are among
> the top of the profiles.
> 

kernbench is also very fork heavy. That means lots of pagetable
allocations with lots of zeroing. I tried various ways of reducing the
zeroing cost including having processes exiting zero the pages as they
free but I couldn't make it go any faster.

> > o For tbench, I have seen an 8-12% improvement on two x86-64 machines
> > (elm3b6 on test.kernel.org gained 8%) but generally it was less dramatic on
> > x86-64 in the range of 0-4%. On one PPC64, the different was also in the
> > range of 0-4%. Generally there were gains, but one specific ppc64 showed a
> > regression of 7% for one client but a negligible difference for 8 clients.
> > It's not clear why this machine regressed and others didn't.
> 
> Did you bisect your patchset? It could have been random or pointed to
> eg the hot/cold removal?
> 

I didn't bisect, but I probably should to see can this be pinned down. I
should run one kernel for each patch to see what exactly is helping.
When I was writing the patches, I was just running kernbench and reading
profiles.

> > o hackbench is harder to conclude anything from. Most machines showed
> >   performance gains in the 5-11% range but one machine in particular showed
> >   a mix of gains and losses depending on the number of clients. Might be
> >   a caching thing.
> >
> > o One machine in particular was a major surprise for sysbench with gains
> >   of 4-8% there which was drastically higher than I was expecting. However,
> >   on other machines, it was in the more reasonable 0-4% range, still pretty
> >   respectable. It's not guaranteed though. While most machines showed some
> >   sort of gain, one ppc64 showed no difference at all.
> >
> > So, by and large it's an improvement of some sort.
> 
> Most of these benchmarks *really* need to be run quite a few times to get
> a reasonable confidence.
> 

Most are run repeatedly and an average taken but I should double check
what is going on. It's irritating that gains/regressions are
inconsistent between different machine types but that is nothing new.

> But it sounds pretty positive.
> ---
>  mm/page_alloc.c |   10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -420,7 +420,7 @@ static inline int page_is_buddy(struct p
>  		return 0;
>  
>  	if (PageBuddy(buddy) && page_order(buddy) == order) {
> -		BUG_ON(page_count(buddy) != 0);
> +		VM_BUG_ON(page_count(buddy) != 0);
>  		return 1;
>  	}
>  	return 0;
> @@ -493,9 +493,9 @@ static inline void __free_one_page(struc
>  static inline int free_pages_check(struct page *page)
>  {
>  	free_page_mlock(page);
> -	if (unlikely(page_mapcount(page) |
> +	if (unlikely((atomic_read(&page->_mapcount) != -1) |
>  		(page->mapping != NULL)  |
> -		(page_count(page) != 0)  |
> +		(atomic_read(&page->_count) != 0) |
>  		(page->flags & PAGE_FLAGS_CHECK_AT_FREE))) {
>  		bad_page(page);
>  		return 1;
> @@ -633,9 +633,9 @@ static inline void expand(struct zone *z
>   */
>  static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
>  {
> -	if (unlikely(page_mapcount(page) |
> +	if (unlikely((atomic_read(&page->_mapcount) != -1) |
>  		(page->mapping != NULL)  |
> -		(page_count(page) != 0)  |
> +		(atomic_read(&page->_count) != 0)  |
>  		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
>  		bad_page(page);
>  		return 1;
> 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
@ 2009-02-23 15:00     ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 15:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tue, Feb 24, 2009 at 01:46:01AM +1100, Nick Piggin wrote:
> Hi Mel,
> 
> Seems like a nice patchset.
> 

Thanks

> On Monday 23 February 2009 10:17:09 Mel Gorman wrote:
> > The complexity of the page allocator has been increasing for some time
> > and it has now reached the point where the SLUB allocator is doing strange
> > tricks to avoid the page allocator. This is obviously bad as it may
> > encourage other subsystems to try avoiding the page allocator as well.
> >
> > This series of patches is intended to reduce the cost of the page
> > allocator by doing the following.
> >
> > Patches 1-3 iron out the entry paths slightly and remove stupid sanity
> > checks from the fast path.
> >
> > Patch 4 uses a lookup table instead of a number of branches to decide what
> > zones are usable given the GFP flags.
> >
> > Patch 5 avoids repeated checks of the zonelist
> >
> > Patch 6 breaks the allocator up into a fast and slow path where the fast
> > path later becomes one long inlined function.
> >
> > Patches 7-10 avoids calculating the same things repeatedly and instead
> > calculates them once.
> >
> > Patches 11-13 inline the whole allocator fast path
> >
> > Patch 14 avoids calling get_pageblock_migratetype() potentially twice on
> > every page free
> >
> > Patch 15 reduces the number of times interrupts are disabled by reworking
> > what free_page_mlock() does. However, I notice that the cost of calling
> > TestClearPageMlocked() is still quite high and I'm guessing it's because
> > it's a locked bit operation. It's be nice if it could be established if
> > it's safe to use an unlocked version here. Rik, can you comment?
> 
> Yes, it can. page flags are owned entirely by the owner of the page.
> 

I figured that was the case but hadn't convinced myself 100%. I wanted a
second opinion but I'm sure it's safe now.

> free_page_mlock shouldn't really be in free_pages_check, but oh well.
> 

Agreed, I took it out of there. The name alone implies it's debugging
that could be optionally disabled if you really had to.

> 
> > Patch 16 avoids using the zonelist cache on non-NUMA machines
> >
> > Patch 17 removes an expensive and excessively paranoid check in the
> > allocator fast path
> 
> I would be careful of removing useful debug checks completely like
> this. What is the cost? Obviously non-zero, but it is also a check

The cost was something like 1/10th the cost of the path. There are atomic
operations in there that are causing the problems.

> I have seen trigger on quite a lot of occasions (due to kernel bugs
> and hardware bugs, and in each case it is better to warn than not,
> even if many other situations can go undetected).
> 

Have you really seen it trigger for the allocation path or did it
trigger in teh free path? Essentially we are making the same check on
every allocation and free which is why I considered it excessivly
paranoid.

> One problem is that some of the calls we're making in page_alloc.c
> do the compound_head() thing, wheras we know that we only want to
> look at this page. I've attached a patch which cuts out about 150
> bytes of text and several branches from these paths.
> 

Nice, I should have spotted that. I'm going to fold this into the series
if that is ok with you? I'll replace patch 17 with it and see does it
still show up on profiles.

> 
> > Patch 18 avoids a list search in the allocator fast path.
> 
> Ah, this was badly needed :)
> 
> 
> > o On many machines, I'm seeing a 0-2% improvement on kernbench. The
> > dominant cost in kernbench is the compiler and zeroing allocated pages for
> > pagetables.
> 
> zeroing is a factor, but IIRC page faults and page allocator are among
> the top of the profiles.
> 

kernbench is also very fork heavy. That means lots of pagetable
allocations with lots of zeroing. I tried various ways of reducing the
zeroing cost including having processes exiting zero the pages as they
free but I couldn't make it go any faster.

> > o For tbench, I have seen an 8-12% improvement on two x86-64 machines
> > (elm3b6 on test.kernel.org gained 8%) but generally it was less dramatic on
> > x86-64 in the range of 0-4%. On one PPC64, the different was also in the
> > range of 0-4%. Generally there were gains, but one specific ppc64 showed a
> > regression of 7% for one client but a negligible difference for 8 clients.
> > It's not clear why this machine regressed and others didn't.
> 
> Did you bisect your patchset? It could have been random or pointed to
> eg the hot/cold removal?
> 

I didn't bisect, but I probably should to see can this be pinned down. I
should run one kernel for each patch to see what exactly is helping.
When I was writing the patches, I was just running kernbench and reading
profiles.

> > o hackbench is harder to conclude anything from. Most machines showed
> >   performance gains in the 5-11% range but one machine in particular showed
> >   a mix of gains and losses depending on the number of clients. Might be
> >   a caching thing.
> >
> > o One machine in particular was a major surprise for sysbench with gains
> >   of 4-8% there which was drastically higher than I was expecting. However,
> >   on other machines, it was in the more reasonable 0-4% range, still pretty
> >   respectable. It's not guaranteed though. While most machines showed some
> >   sort of gain, one ppc64 showed no difference at all.
> >
> > So, by and large it's an improvement of some sort.
> 
> Most of these benchmarks *really* need to be run quite a few times to get
> a reasonable confidence.
> 

Most are run repeatedly and an average taken but I should double check
what is going on. It's irritating that gains/regressions are
inconsistent between different machine types but that is nothing new.

> But it sounds pretty positive.
> ---
>  mm/page_alloc.c |   10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -420,7 +420,7 @@ static inline int page_is_buddy(struct p
>  		return 0;
>  
>  	if (PageBuddy(buddy) && page_order(buddy) == order) {
> -		BUG_ON(page_count(buddy) != 0);
> +		VM_BUG_ON(page_count(buddy) != 0);
>  		return 1;
>  	}
>  	return 0;
> @@ -493,9 +493,9 @@ static inline void __free_one_page(struc
>  static inline int free_pages_check(struct page *page)
>  {
>  	free_page_mlock(page);
> -	if (unlikely(page_mapcount(page) |
> +	if (unlikely((atomic_read(&page->_mapcount) != -1) |
>  		(page->mapping != NULL)  |
> -		(page_count(page) != 0)  |
> +		(atomic_read(&page->_count) != 0) |
>  		(page->flags & PAGE_FLAGS_CHECK_AT_FREE))) {
>  		bad_page(page);
>  		return 1;
> @@ -633,9 +633,9 @@ static inline void expand(struct zone *z
>   */
>  static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
>  {
> -	if (unlikely(page_mapcount(page) |
> +	if (unlikely((atomic_read(&page->_mapcount) != -1) |
>  		(page->mapping != NULL)  |
> -		(page_count(page) != 0)  |
> +		(atomic_read(&page->_count) != 0)  |
>  		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
>  		bad_page(page);
>  		return 1;
> \0

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 03/20] Do not check NUMA node ID when the caller knows the node is valid
  2009-02-22 23:17   ` Mel Gorman
@ 2009-02-23 15:01     ` Christoph Lameter
  -1 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-23 15:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Sun, 22 Feb 2009, Mel Gorman wrote:

> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 75f49d3..6566c9e 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1318,11 +1318,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	for (i = 0; i < area->nr_pages; i++) {
>  		struct page *page;
>
> -		if (node < 0)
> -			page = alloc_page(gfp_mask);
> -		else
> -			page = alloc_pages_node(node, gfp_mask, 0);
> -
> +		page = alloc_pages_node(node, gfp_mask, 0);
>  		if (unlikely(!page)) {
>  			/* Successfully allocated i pages, free them in __vunmap() */
>  			area->nr_pages = i;
>

That wont work. alloc_pages() obeys memory policies. alloc_pages_node()
does not.



^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 03/20] Do not check NUMA node ID when the caller knows the node is valid
@ 2009-02-23 15:01     ` Christoph Lameter
  0 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-23 15:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Sun, 22 Feb 2009, Mel Gorman wrote:

> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 75f49d3..6566c9e 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1318,11 +1318,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	for (i = 0; i < area->nr_pages; i++) {
>  		struct page *page;
>
> -		if (node < 0)
> -			page = alloc_page(gfp_mask);
> -		else
> -			page = alloc_pages_node(node, gfp_mask, 0);
> -
> +		page = alloc_pages_node(node, gfp_mask, 0);
>  		if (unlikely(!page)) {
>  			/* Successfully allocated i pages, free them in __vunmap() */
>  			area->nr_pages = i;
>

That wont work. alloc_pages() obeys memory policies. alloc_pages_node()
does not.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
  2009-02-23 15:00     ` Mel Gorman
@ 2009-02-23 15:22       ` Nick Piggin
  -1 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-23 15:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tuesday 24 February 2009 02:00:56 Mel Gorman wrote:
> On Tue, Feb 24, 2009 at 01:46:01AM +1100, Nick Piggin wrote:

> > free_page_mlock shouldn't really be in free_pages_check, but oh well.
>
> Agreed, I took it out of there.

Oh good. I didn't notice that.


> > > Patch 16 avoids using the zonelist cache on non-NUMA machines
> > >
> > > Patch 17 removes an expensive and excessively paranoid check in the
> > > allocator fast path
> >
> > I would be careful of removing useful debug checks completely like
> > this. What is the cost? Obviously non-zero, but it is also a check
>
> The cost was something like 1/10th the cost of the path. There are atomic
> operations in there that are causing the problems.

The only atomic memory operations in there should be atomic loads of
word or atomic_t sized and aligned locations, which should just be
normal loads on any architecture?

The only atomic RMW you might see in that function would come from
free_page_mlock (which you moved out of there, and anyway can be
made non-atomic).

I'd like you to just reevaluate it after your patchset, after the
patch to make mlock non-atomic, and my patch I just sent.


> > I have seen trigger on quite a lot of occasions (due to kernel bugs
> > and hardware bugs, and in each case it is better to warn than not,
> > even if many other situations can go undetected).
>
> Have you really seen it trigger for the allocation path or did it
> trigger in teh free path? Essentially we are making the same check on
> every allocation and free which is why I considered it excessivly
> paranoid.

Yes I've seen it trigger in the allocation path. Kernel memory scribbles
or RAM errors.


> > One problem is that some of the calls we're making in page_alloc.c
> > do the compound_head() thing, wheras we know that we only want to
> > look at this page. I've attached a patch which cuts out about 150
> > bytes of text and several branches from these paths.
>
> Nice, I should have spotted that. I'm going to fold this into the series
> if that is ok with you? I'll replace patch 17 with it and see does it
> still show up on profiles.

Great! Sure fold it in (and put SOB: me on there if you like).


> > > So, by and large it's an improvement of some sort.
> >
> > Most of these benchmarks *really* need to be run quite a few times to get
> > a reasonable confidence.
>
> Most are run repeatedly and an average taken but I should double check
> what is going on. It's irritating that gains/regressions are
> inconsistent between different machine types but that is nothing new.

Yeah. Cache behaviour maybe. One thing you might try is to size the struct
page out to 64 bytes if it isn't already. This could bring down any skews
if one kernel is lucky to get a nice packing of pages, or another is unlucky
to get lots of struct pages spread over 2 cachelines. Maybe I'm just
thinking wishfully :)

I think with many of your changes, common sense will tell us that it is a
better code sequence. Sometimes it's just impossible to really get
"scientific proof" :)


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
@ 2009-02-23 15:22       ` Nick Piggin
  0 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-23 15:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tuesday 24 February 2009 02:00:56 Mel Gorman wrote:
> On Tue, Feb 24, 2009 at 01:46:01AM +1100, Nick Piggin wrote:

> > free_page_mlock shouldn't really be in free_pages_check, but oh well.
>
> Agreed, I took it out of there.

Oh good. I didn't notice that.


> > > Patch 16 avoids using the zonelist cache on non-NUMA machines
> > >
> > > Patch 17 removes an expensive and excessively paranoid check in the
> > > allocator fast path
> >
> > I would be careful of removing useful debug checks completely like
> > this. What is the cost? Obviously non-zero, but it is also a check
>
> The cost was something like 1/10th the cost of the path. There are atomic
> operations in there that are causing the problems.

The only atomic memory operations in there should be atomic loads of
word or atomic_t sized and aligned locations, which should just be
normal loads on any architecture?

The only atomic RMW you might see in that function would come from
free_page_mlock (which you moved out of there, and anyway can be
made non-atomic).

I'd like you to just reevaluate it after your patchset, after the
patch to make mlock non-atomic, and my patch I just sent.


> > I have seen trigger on quite a lot of occasions (due to kernel bugs
> > and hardware bugs, and in each case it is better to warn than not,
> > even if many other situations can go undetected).
>
> Have you really seen it trigger for the allocation path or did it
> trigger in teh free path? Essentially we are making the same check on
> every allocation and free which is why I considered it excessivly
> paranoid.

Yes I've seen it trigger in the allocation path. Kernel memory scribbles
or RAM errors.


> > One problem is that some of the calls we're making in page_alloc.c
> > do the compound_head() thing, wheras we know that we only want to
> > look at this page. I've attached a patch which cuts out about 150
> > bytes of text and several branches from these paths.
>
> Nice, I should have spotted that. I'm going to fold this into the series
> if that is ok with you? I'll replace patch 17 with it and see does it
> still show up on profiles.

Great! Sure fold it in (and put SOB: me on there if you like).


> > > So, by and large it's an improvement of some sort.
> >
> > Most of these benchmarks *really* need to be run quite a few times to get
> > a reasonable confidence.
>
> Most are run repeatedly and an average taken but I should double check
> what is going on. It's irritating that gains/regressions are
> inconsistent between different machine types but that is nothing new.

Yeah. Cache behaviour maybe. One thing you might try is to size the struct
page out to 64 bytes if it isn't already. This could bring down any skews
if one kernel is lucky to get a nice packing of pages, or another is unlucky
to get lots of struct pages spread over 2 cachelines. Maybe I'm just
thinking wishfully :)

I think with many of your changes, common sense will tell us that it is a
better code sequence. Sometimes it's just impossible to really get
"scientific proof" :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated values
  2009-02-22 23:17   ` Mel Gorman
@ 2009-02-23 15:23     ` Christoph Lameter
  -1 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-23 15:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Sun, 22 Feb 2009, Mel Gorman wrote:

> Every page allocation uses gfp_zone() to calcuate what the highest zone
> allowed by a combination of GFP flags is. This is a large number of branches
> to have in a fast path. This patch replaces the branches with a lookup
> table that is calculated at boot-time and stored in the read-mostly section
> so it can be shared. This requires __GFP_MOVABLE to be redefined but it's
> debatable as to whether it should be considered a zone modifier or not.

Are you sure that this is a benefit? Jumps are forward and pretty short
and the compiler is optimizing a branch away in the current code.


0xffffffff8027bde8 <try_to_free_pages+95>:      mov    %esi,-0x58(%rbp)
0xffffffff8027bdeb <try_to_free_pages+98>:      movq   $0xffffffff80278cd0,-0x48(%rbp)
0xffffffff8027bdf3 <try_to_free_pages+106>:     test   $0x1,%r8b
0xffffffff8027bdf7 <try_to_free_pages+110>:     mov    0x620(%rax),%rax
0xffffffff8027bdfe <try_to_free_pages+117>:     mov    %rax,-0x88(%rbp)
0xffffffff8027be05 <try_to_free_pages+124>:     jne    0xffffffff8027be2c <try_to_free_pages+163>
0xffffffff8027be07 <try_to_free_pages+126>:     mov    $0x1,%r14d
0xffffffff8027be0d <try_to_free_pages+132>:     test   $0x4,%r8b
0xffffffff8027be11 <try_to_free_pages+136>:     jne    0xffffffff8027be2c <try_to_free_pages+163>
0xffffffff8027be13 <try_to_free_pages+138>:     xor    %r14d,%r14d
0xffffffff8027be16 <try_to_free_pages+141>:     and    $0x100002,%r8d
0xffffffff8027be1d <try_to_free_pages+148>:     cmp    $0x100002,%r8d
0xffffffff8027be24 <try_to_free_pages+155>:     sete   %r14b
0xffffffff8027be28 <try_to_free_pages+159>:     add    $0x2,%r14d
0xffffffff8027be2c <try_to_free_pages+163>:     mov    %gs:0x8,%rdx


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated values
@ 2009-02-23 15:23     ` Christoph Lameter
  0 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-23 15:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Sun, 22 Feb 2009, Mel Gorman wrote:

> Every page allocation uses gfp_zone() to calcuate what the highest zone
> allowed by a combination of GFP flags is. This is a large number of branches
> to have in a fast path. This patch replaces the branches with a lookup
> table that is calculated at boot-time and stored in the read-mostly section
> so it can be shared. This requires __GFP_MOVABLE to be redefined but it's
> debatable as to whether it should be considered a zone modifier or not.

Are you sure that this is a benefit? Jumps are forward and pretty short
and the compiler is optimizing a branch away in the current code.


0xffffffff8027bde8 <try_to_free_pages+95>:      mov    %esi,-0x58(%rbp)
0xffffffff8027bdeb <try_to_free_pages+98>:      movq   $0xffffffff80278cd0,-0x48(%rbp)
0xffffffff8027bdf3 <try_to_free_pages+106>:     test   $0x1,%r8b
0xffffffff8027bdf7 <try_to_free_pages+110>:     mov    0x620(%rax),%rax
0xffffffff8027bdfe <try_to_free_pages+117>:     mov    %rax,-0x88(%rbp)
0xffffffff8027be05 <try_to_free_pages+124>:     jne    0xffffffff8027be2c <try_to_free_pages+163>
0xffffffff8027be07 <try_to_free_pages+126>:     mov    $0x1,%r14d
0xffffffff8027be0d <try_to_free_pages+132>:     test   $0x4,%r8b
0xffffffff8027be11 <try_to_free_pages+136>:     jne    0xffffffff8027be2c <try_to_free_pages+163>
0xffffffff8027be13 <try_to_free_pages+138>:     xor    %r14d,%r14d
0xffffffff8027be16 <try_to_free_pages+141>:     and    $0x100002,%r8d
0xffffffff8027be1d <try_to_free_pages+148>:     cmp    $0x100002,%r8d
0xffffffff8027be24 <try_to_free_pages+155>:     sete   %r14b
0xffffffff8027be28 <try_to_free_pages+159>:     add    $0x2,%r14d
0xffffffff8027be2c <try_to_free_pages+163>:     mov    %gs:0x8,%rdx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 11/20] Inline get_page_from_freelist() in the fast-path
  2009-02-22 23:17   ` Mel Gorman
@ 2009-02-23 15:32     ` Nick Piggin
  -1 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-23 15:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Monday 23 February 2009 10:17:20 Mel Gorman wrote:
> In the best-case scenario, use an inlined version of
> get_page_from_freelist(). This increases the size of the text but avoids
> time spent pushing arguments onto the stack.

I'm quite fond of inlining ;) But it can increase register pressure as
well as icache footprint as well. x86-64 isn't spilling a lot more
registers to stack after these changes, is it?

Also,


> @@ -1780,8 +1791,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int
> order, if (!preferred_zone)
>  		return NULL;
>
> -	/* First allocation attempt */
> -	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> +	/* First allocation attempt. Fastpath uses inlined version */
> +	page = __get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
>  			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
>  			preferred_zone, migratetype);
>  	if (unlikely(!page))

I think in a common case where there is background reclaim going on,
it will be quite common to fail this, won't it? (I haven't run
statistics though).

In which case you will get extra icache footprint. What speedup does
it give in the cache-hot microbenchmark case?


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 11/20] Inline get_page_from_freelist() in the fast-path
@ 2009-02-23 15:32     ` Nick Piggin
  0 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-23 15:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Monday 23 February 2009 10:17:20 Mel Gorman wrote:
> In the best-case scenario, use an inlined version of
> get_page_from_freelist(). This increases the size of the text but avoids
> time spent pushing arguments onto the stack.

I'm quite fond of inlining ;) But it can increase register pressure as
well as icache footprint as well. x86-64 isn't spilling a lot more
registers to stack after these changes, is it?

Also,


> @@ -1780,8 +1791,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int
> order, if (!preferred_zone)
>  		return NULL;
>
> -	/* First allocation attempt */
> -	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> +	/* First allocation attempt. Fastpath uses inlined version */
> +	page = __get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
>  			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
>  			preferred_zone, migratetype);
>  	if (unlikely(!page))

I think in a common case where there is background reclaim going on,
it will be quite common to fail this, won't it? (I haven't run
statistics though).

In which case you will get extra icache footprint. What speedup does
it give in the cache-hot microbenchmark case?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocato
  2009-02-22 23:57   ` Andi Kleen
@ 2009-02-23 15:34     ` Christoph Lameter
  -1 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-23 15:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mel Gorman, Linux Memory Management List, Pekka Enberg,
	Rik van Riel, KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 23 Feb 2009, Andi Kleen wrote:

> > Counters are surprising expensive, we spent a good chuck of our time in
> > functions like __dec_zone_page_state and __dec_zone_state. In a profiled
> > run of kernbench, the time spent in __dec_zone_state was roughly equal to
> > the combined cost of the rest of the page free path. A quick check showed
> > that almost half of the time in that function is spent on line 233 alone
> > which for me is;
> >
> > 	(*p)--;
> >
> > That's worth a separate investigation but it might be a case that
> > manipulating int8_t on the machine I was using for profiling is unusually
> > expensive.
>
> What machine was that?
>
> In general I wouldn't expect even on a system with slow char
> operations to be that expensive. It sounds more like a cache miss or a
> cache line bounce. You could possibly confirm by using appropiate
> performance counters.

I have seen similar things occur with some processors. 16 bit or 8 bit
arithmetic can be a problem.

> > Converting this to an int might be faster but the increased
> > memory consumption and cache footprint might be a problem. Opinions?
>
> One possibility would be to move the zone statistics to allocated
> per cpu data. Or perhaps just stop counting per zone at all and
> only count per cpu.

Statistics are in a structure allocated and dedicated for a certain cpu.
It cannot be per cpu data as long as the per cpu allocator has not been
merged. Cache line footprint is reduced with the per cpu allocator.

> > So, by and large it's an improvement of some sort.
>
> That seems like an understatement.

Ack. There is certainly more work to be done on the page allocator. Looks
like a good start to me though.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocato
@ 2009-02-23 15:34     ` Christoph Lameter
  0 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-23 15:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mel Gorman, Linux Memory Management List, Pekka Enberg,
	Rik van Riel, KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 23 Feb 2009, Andi Kleen wrote:

> > Counters are surprising expensive, we spent a good chuck of our time in
> > functions like __dec_zone_page_state and __dec_zone_state. In a profiled
> > run of kernbench, the time spent in __dec_zone_state was roughly equal to
> > the combined cost of the rest of the page free path. A quick check showed
> > that almost half of the time in that function is spent on line 233 alone
> > which for me is;
> >
> > 	(*p)--;
> >
> > That's worth a separate investigation but it might be a case that
> > manipulating int8_t on the machine I was using for profiling is unusually
> > expensive.
>
> What machine was that?
>
> In general I wouldn't expect even on a system with slow char
> operations to be that expensive. It sounds more like a cache miss or a
> cache line bounce. You could possibly confirm by using appropiate
> performance counters.

I have seen similar things occur with some processors. 16 bit or 8 bit
arithmetic can be a problem.

> > Converting this to an int might be faster but the increased
> > memory consumption and cache footprint might be a problem. Opinions?
>
> One possibility would be to move the zone statistics to allocated
> per cpu data. Or perhaps just stop counting per zone at all and
> only count per cpu.

Statistics are in a structure allocated and dedicated for a certain cpu.
It cannot be per cpu data as long as the per cpu allocator has not been
merged. Cache line footprint is reduced with the per cpu allocator.

> > So, by and large it's an improvement of some sort.
>
> That seems like an understatement.

Ack. There is certainly more work to be done on the page allocator. Looks
like a good start to me though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated values
  2009-02-23 15:23     ` Christoph Lameter
@ 2009-02-23 15:41       ` Nick Piggin
  -1 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-23 15:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Linux Memory Management List, Pekka Enberg,
	Rik van Riel, KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tuesday 24 February 2009 02:23:52 Christoph Lameter wrote:
> On Sun, 22 Feb 2009, Mel Gorman wrote:
> > Every page allocation uses gfp_zone() to calcuate what the highest zone
> > allowed by a combination of GFP flags is. This is a large number of
> > branches to have in a fast path. This patch replaces the branches with a
> > lookup table that is calculated at boot-time and stored in the
> > read-mostly section so it can be shared. This requires __GFP_MOVABLE to
> > be redefined but it's debatable as to whether it should be considered a
> > zone modifier or not.
>
> Are you sure that this is a benefit? Jumps are forward and pretty short
> and the compiler is optimizing a branch away in the current code.

Pretty easy to mispredict there, though, especially as you can tend
to get allocations interleaved between kernel and movable (or simply
if the branch predictor is cold there are a lot of branches on x86-64).

I would be interested to know if there is a measured improvement. It
adds an extra dcache line to the footprint, but OTOH the instructions
you quote is more than one icache line, and presumably Mel's code will
be a lot shorter.

>
> 0xffffffff8027bde8 <try_to_free_pages+95>:      mov    %esi,-0x58(%rbp)


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated values
@ 2009-02-23 15:41       ` Nick Piggin
  0 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-23 15:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Linux Memory Management List, Pekka Enberg,
	Rik van Riel, KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tuesday 24 February 2009 02:23:52 Christoph Lameter wrote:
> On Sun, 22 Feb 2009, Mel Gorman wrote:
> > Every page allocation uses gfp_zone() to calcuate what the highest zone
> > allowed by a combination of GFP flags is. This is a large number of
> > branches to have in a fast path. This patch replaces the branches with a
> > lookup table that is calculated at boot-time and stored in the
> > read-mostly section so it can be shared. This requires __GFP_MOVABLE to
> > be redefined but it's debatable as to whether it should be considered a
> > zone modifier or not.
>
> Are you sure that this is a benefit? Jumps are forward and pretty short
> and the compiler is optimizing a branch away in the current code.

Pretty easy to mispredict there, though, especially as you can tend
to get allocations interleaved between kernel and movable (or simply
if the branch predictor is cold there are a lot of branches on x86-64).

I would be interested to know if there is a measured improvement. It
adds an extra dcache line to the footprint, but OTOH the instructions
you quote is more than one icache line, and presumably Mel's code will
be a lot shorter.

>
> 0xffffffff8027bde8 <try_to_free_pages+95>:      mov    %esi,-0x58(%rbp)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
  2009-02-23 15:41       ` Nick Piggin
@ 2009-02-23 15:43         ` Christoph Lameter
  -1 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-23 15:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mel Gorman, Linux Memory Management List, Pekka Enberg,
	Rik van Riel, KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tue, 24 Feb 2009, Nick Piggin wrote:

> > Are you sure that this is a benefit? Jumps are forward and pretty short
> > and the compiler is optimizing a branch away in the current code.
>
> Pretty easy to mispredict there, though, especially as you can tend
> to get allocations interleaved between kernel and movable (or simply
> if the branch predictor is cold there are a lot of branches on x86-64).
>
> I would be interested to know if there is a measured improvement. It
> adds an extra dcache line to the footprint, but OTOH the instructions
> you quote is more than one icache line, and presumably Mel's code will
> be a lot shorter.

Maybe we can come up with a version of gfp_zone that has no branches and
no lookup?


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
@ 2009-02-23 15:43         ` Christoph Lameter
  0 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-23 15:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mel Gorman, Linux Memory Management List, Pekka Enberg,
	Rik van Riel, KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tue, 24 Feb 2009, Nick Piggin wrote:

> > Are you sure that this is a benefit? Jumps are forward and pretty short
> > and the compiler is optimizing a branch away in the current code.
>
> Pretty easy to mispredict there, though, especially as you can tend
> to get allocations interleaved between kernel and movable (or simply
> if the branch predictor is cold there are a lot of branches on x86-64).
>
> I would be interested to know if there is a measured improvement. It
> adds an extra dcache line to the footprint, but OTOH the instructions
> you quote is more than one icache line, and presumably Mel's code will
> be a lot shorter.

Maybe we can come up with a version of gfp_zone that has no branches and
no lookup?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 03/20] Do not check NUMA node ID when the caller knows the node is valid
  2009-02-23 15:01     ` Christoph Lameter
@ 2009-02-23 16:24       ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 16:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 10:01:35AM -0500, Christoph Lameter wrote:
> On Sun, 22 Feb 2009, Mel Gorman wrote:
> 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 75f49d3..6566c9e 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1318,11 +1318,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> >  	for (i = 0; i < area->nr_pages; i++) {
> >  		struct page *page;
> >
> > -		if (node < 0)
> > -			page = alloc_page(gfp_mask);
> > -		else
> > -			page = alloc_pages_node(node, gfp_mask, 0);
> > -
> > +		page = alloc_pages_node(node, gfp_mask, 0);
> >  		if (unlikely(!page)) {
> >  			/* Successfully allocated i pages, free them in __vunmap() */
> >  			area->nr_pages = i;
> >
> 
> That wont work. alloc_pages() obeys memory policies. alloc_pages_node()
> does not.
> 

Correct, I failed to take policies into account. The same comment
applied for the slub modification. I've reverted this part. Good spot.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 03/20] Do not check NUMA node ID when the caller knows the node is valid
@ 2009-02-23 16:24       ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 16:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 10:01:35AM -0500, Christoph Lameter wrote:
> On Sun, 22 Feb 2009, Mel Gorman wrote:
> 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 75f49d3..6566c9e 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1318,11 +1318,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> >  	for (i = 0; i < area->nr_pages; i++) {
> >  		struct page *page;
> >
> > -		if (node < 0)
> > -			page = alloc_page(gfp_mask);
> > -		else
> > -			page = alloc_pages_node(node, gfp_mask, 0);
> > -
> > +		page = alloc_pages_node(node, gfp_mask, 0);
> >  		if (unlikely(!page)) {
> >  			/* Successfully allocated i pages, free them in __vunmap() */
> >  			area->nr_pages = i;
> >
> 
> That wont work. alloc_pages() obeys memory policies. alloc_pages_node()
> does not.
> 

Correct, I failed to take policies into account. The same comment
applied for the slub modification. I've reverted this part. Good spot.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
  2009-02-23 16:33       ` Mel Gorman
@ 2009-02-23 16:33         ` Christoph Lameter
  -1 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-23 16:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 23 Feb 2009, Mel Gorman wrote:

> I was concerned with mispredictions here rather than the actual assembly
> and gfp_zone is inlined so it's lot of branches introduced in a lot of paths.

The amount of speculation that can be done by the processor pretty
limited to a few instructions. So the impact of a misprediction also
should be minimal. The decoder is likely to have sucked in the following
code anyways.


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
@ 2009-02-23 16:33         ` Christoph Lameter
  0 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-23 16:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 23 Feb 2009, Mel Gorman wrote:

> I was concerned with mispredictions here rather than the actual assembly
> and gfp_zone is inlined so it's lot of branches introduced in a lot of paths.

The amount of speculation that can be done by the processor pretty
limited to a few instructions. So the impact of a misprediction also
should be minimal. The decoder is likely to have sucked in the following
code anyways.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated values
  2009-02-23 15:23     ` Christoph Lameter
@ 2009-02-23 16:33       ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 16:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 10:23:52AM -0500, Christoph Lameter wrote:
> On Sun, 22 Feb 2009, Mel Gorman wrote:
> 
> > Every page allocation uses gfp_zone() to calcuate what the highest zone
> > allowed by a combination of GFP flags is. This is a large number of branches
> > to have in a fast path. This patch replaces the branches with a lookup
> > table that is calculated at boot-time and stored in the read-mostly section
> > so it can be shared. This requires __GFP_MOVABLE to be redefined but it's
> > debatable as to whether it should be considered a zone modifier or not.
> 
> Are you sure that this is a benefit?

I haven't proved it, but my thinking was that this was a crapload of
branches in fast paths that will be mispredicted. I tried profiling the
RETIRED_MISPREDICTED_BRANCH_INSTRUCTIONS event and noticed there were large
numbers in this general area. However, the profile counters for this event
seem to be very difficult to isolate to a specific area of code so it's
difficult to be 100% certain.

> Jumps are forward and pretty short
> and the compiler is optimizing a branch away in the current code.
> 

I was concerned with mispredictions here rather than the actual assembly
and gfp_zone is inlined so it's lot of branches introduced in a lot of paths.

> 
> 0xffffffff8027bde8 <try_to_free_pages+95>:      mov    %esi,-0x58(%rbp)
> 0xffffffff8027bdeb <try_to_free_pages+98>:      movq   $0xffffffff80278cd0,-0x48(%rbp)
> 0xffffffff8027bdf3 <try_to_free_pages+106>:     test   $0x1,%r8b
> 0xffffffff8027bdf7 <try_to_free_pages+110>:     mov    0x620(%rax),%rax
> 0xffffffff8027bdfe <try_to_free_pages+117>:     mov    %rax,-0x88(%rbp)
> 0xffffffff8027be05 <try_to_free_pages+124>:     jne    0xffffffff8027be2c <try_to_free_pages+163>
> 0xffffffff8027be07 <try_to_free_pages+126>:     mov    $0x1,%r14d
> 0xffffffff8027be0d <try_to_free_pages+132>:     test   $0x4,%r8b
> 0xffffffff8027be11 <try_to_free_pages+136>:     jne    0xffffffff8027be2c <try_to_free_pages+163>
> 0xffffffff8027be13 <try_to_free_pages+138>:     xor    %r14d,%r14d
> 0xffffffff8027be16 <try_to_free_pages+141>:     and    $0x100002,%r8d
> 0xffffffff8027be1d <try_to_free_pages+148>:     cmp    $0x100002,%r8d
> 0xffffffff8027be24 <try_to_free_pages+155>:     sete   %r14b
> 0xffffffff8027be28 <try_to_free_pages+159>:     add    $0x2,%r14d
> 0xffffffff8027be2c <try_to_free_pages+163>:     mov    %gs:0x8,%rdx
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated values
@ 2009-02-23 16:33       ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 16:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 10:23:52AM -0500, Christoph Lameter wrote:
> On Sun, 22 Feb 2009, Mel Gorman wrote:
> 
> > Every page allocation uses gfp_zone() to calcuate what the highest zone
> > allowed by a combination of GFP flags is. This is a large number of branches
> > to have in a fast path. This patch replaces the branches with a lookup
> > table that is calculated at boot-time and stored in the read-mostly section
> > so it can be shared. This requires __GFP_MOVABLE to be redefined but it's
> > debatable as to whether it should be considered a zone modifier or not.
> 
> Are you sure that this is a benefit?

I haven't proved it, but my thinking was that this was a crapload of
branches in fast paths that will be mispredicted. I tried profiling the
RETIRED_MISPREDICTED_BRANCH_INSTRUCTIONS event and noticed there were large
numbers in this general area. However, the profile counters for this event
seem to be very difficult to isolate to a specific area of code so it's
difficult to be 100% certain.

> Jumps are forward and pretty short
> and the compiler is optimizing a branch away in the current code.
> 

I was concerned with mispredictions here rather than the actual assembly
and gfp_zone is inlined so it's lot of branches introduced in a lot of paths.

> 
> 0xffffffff8027bde8 <try_to_free_pages+95>:      mov    %esi,-0x58(%rbp)
> 0xffffffff8027bdeb <try_to_free_pages+98>:      movq   $0xffffffff80278cd0,-0x48(%rbp)
> 0xffffffff8027bdf3 <try_to_free_pages+106>:     test   $0x1,%r8b
> 0xffffffff8027bdf7 <try_to_free_pages+110>:     mov    0x620(%rax),%rax
> 0xffffffff8027bdfe <try_to_free_pages+117>:     mov    %rax,-0x88(%rbp)
> 0xffffffff8027be05 <try_to_free_pages+124>:     jne    0xffffffff8027be2c <try_to_free_pages+163>
> 0xffffffff8027be07 <try_to_free_pages+126>:     mov    $0x1,%r14d
> 0xffffffff8027be0d <try_to_free_pages+132>:     test   $0x4,%r8b
> 0xffffffff8027be11 <try_to_free_pages+136>:     jne    0xffffffff8027be2c <try_to_free_pages+163>
> 0xffffffff8027be13 <try_to_free_pages+138>:     xor    %r14d,%r14d
> 0xffffffff8027be16 <try_to_free_pages+141>:     and    $0x100002,%r8d
> 0xffffffff8027be1d <try_to_free_pages+148>:     cmp    $0x100002,%r8d
> 0xffffffff8027be24 <try_to_free_pages+155>:     sete   %r14b
> 0xffffffff8027be28 <try_to_free_pages+159>:     add    $0x2,%r14d
> 0xffffffff8027be2c <try_to_free_pages+163>:     mov    %gs:0x8,%rdx
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
  2009-02-23 15:43         ` Christoph Lameter
@ 2009-02-23 16:40           ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 16:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Linux Memory Management List, Pekka Enberg,
	Rik van Riel, KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 10:43:20AM -0500, Christoph Lameter wrote:
> On Tue, 24 Feb 2009, Nick Piggin wrote:
> 
> > > Are you sure that this is a benefit? Jumps are forward and pretty short
> > > and the compiler is optimizing a branch away in the current code.
> >
> > Pretty easy to mispredict there, though, especially as you can tend
> > to get allocations interleaved between kernel and movable (or simply
> > if the branch predictor is cold there are a lot of branches on x86-64).
> >
> > I would be interested to know if there is a measured improvement.

Not in kernbench at least, but that is no surprise. It's a small
percentage of the overall cost. It'll appear in the noise for anything
other than micro-benchmarks.

> > It
> > adds an extra dcache line to the footprint, but OTOH the instructions
> > you quote is more than one icache line, and presumably Mel's code will
> > be a lot shorter.
> 

Yes, it's an index lookup of a shared read-only cache line versus a lot
of code with branches to mispredict. I wasn't happy with the cache line
consumption but it was the first obvious alternative.

> Maybe we can come up with a version of gfp_zone that has no branches and
> no lookup?
> 

Ideally, yes, but I didn't spot any obvious way of figuring it out at
compile time then or now. Suggestions?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
@ 2009-02-23 16:40           ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 16:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Linux Memory Management List, Pekka Enberg,
	Rik van Riel, KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 10:43:20AM -0500, Christoph Lameter wrote:
> On Tue, 24 Feb 2009, Nick Piggin wrote:
> 
> > > Are you sure that this is a benefit? Jumps are forward and pretty short
> > > and the compiler is optimizing a branch away in the current code.
> >
> > Pretty easy to mispredict there, though, especially as you can tend
> > to get allocations interleaved between kernel and movable (or simply
> > if the branch predictor is cold there are a lot of branches on x86-64).
> >
> > I would be interested to know if there is a measured improvement.

Not in kernbench at least, but that is no surprise. It's a small
percentage of the overall cost. It'll appear in the noise for anything
other than micro-benchmarks.

> > It
> > adds an extra dcache line to the footprint, but OTOH the instructions
> > you quote is more than one icache line, and presumably Mel's code will
> > be a lot shorter.
> 

Yes, it's an index lookup of a shared read-only cache line versus a lot
of code with branches to mispredict. I wasn't happy with the cache line
consumption but it was the first obvious alternative.

> Maybe we can come up with a version of gfp_zone that has no branches and
> no lookup?
> 

Ideally, yes, but I didn't spot any obvious way of figuring it out at
compile time then or now. Suggestions?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
  2009-02-23 16:40           ` Mel Gorman
@ 2009-02-23 17:03             ` Christoph Lameter
  -1 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-23 17:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Linux Memory Management List, Pekka Enberg,
	Rik van Riel, KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 23 Feb 2009, Mel Gorman wrote:

> > Maybe we can come up with a version of gfp_zone that has no branches and
> > no lookup?
> >
>
> Ideally, yes, but I didn't spot any obvious way of figuring it out at
> compile time then or now. Suggestions?

Can we just mask the relevant bits and then find the highest set bit? With
some rearrangement of gfp flags this may work.


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
@ 2009-02-23 17:03             ` Christoph Lameter
  0 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-23 17:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Linux Memory Management List, Pekka Enberg,
	Rik van Riel, KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 23 Feb 2009, Mel Gorman wrote:

> > Maybe we can come up with a version of gfp_zone that has no branches and
> > no lookup?
> >
>
> Ideally, yes, but I didn't spot any obvious way of figuring it out at
> compile time then or now. Suggestions?

Can we just mask the relevant bits and then find the highest set bit? With
some rearrangement of gfp flags this may work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
  2009-02-23 16:33         ` Christoph Lameter
@ 2009-02-23 17:41           ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 17:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 11:33:00AM -0500, Christoph Lameter wrote:
> On Mon, 23 Feb 2009, Mel Gorman wrote:
> 
> > I was concerned with mispredictions here rather than the actual assembly
> > and gfp_zone is inlined so it's lot of branches introduced in a lot of paths.
> 
> The amount of speculation that can be done by the processor pretty
> limited to a few instructions. So the impact of a misprediction also
> should be minimal.

It really is quite a bit of code overall.

   text	   data	    bss	    dec	    hex	filename
4071245	 823620	 741180	5636045	 55ffcd	linux-2.6.29-rc5-vanilla/vmlinux
4070872	 823684	 741180	5635736	 55fe98 linux-2.6.29-rc5-convert-gfpzone/vmlinux

That's 373 bytes of text with oodles of branches. I don't know what the
cost of misprediction is going to be but surely this is having some impact
on the branch prediction tables?

> The decoder is likely to have sucked in the following
> code anyways.
> 

Probably. To be honest, measuring this would likely be tricker but this
is less branches and less code in a fast path. The question is if a
cache line of data is justified or not. Right now, I think it is but
I'll go with the general consensus if we can find one.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
@ 2009-02-23 17:41           ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 17:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 11:33:00AM -0500, Christoph Lameter wrote:
> On Mon, 23 Feb 2009, Mel Gorman wrote:
> 
> > I was concerned with mispredictions here rather than the actual assembly
> > and gfp_zone is inlined so it's lot of branches introduced in a lot of paths.
> 
> The amount of speculation that can be done by the processor pretty
> limited to a few instructions. So the impact of a misprediction also
> should be minimal.

It really is quite a bit of code overall.

   text	   data	    bss	    dec	    hex	filename
4071245	 823620	 741180	5636045	 55ffcd	linux-2.6.29-rc5-vanilla/vmlinux
4070872	 823684	 741180	5635736	 55fe98 linux-2.6.29-rc5-convert-gfpzone/vmlinux

That's 373 bytes of text with oodles of branches. I don't know what the
cost of misprediction is going to be but surely this is having some impact
on the branch prediction tables?

> The decoder is likely to have sucked in the following
> code anyways.
> 

Probably. To be honest, measuring this would likely be tricker but this
is less branches and less code in a fast path. The question is if a
cache line of data is justified or not. Right now, I think it is but
I'll go with the general consensus if we can find one.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
  2009-02-23 14:32     ` Mel Gorman
@ 2009-02-23 17:49       ` Andi Kleen
  -1 siblings, 0 replies; 190+ messages in thread
From: Andi Kleen @ 2009-02-23 17:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andi Kleen, Linux Memory Management List, Pekka Enberg,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

> hmm, it would be ideal but I haven't looked too closely at how it could
> be implemented. I thought first you could just associate a zonelist with

Yes like that. This was actually discussed during the initial cpuset
implementation. I thought back then it would be better to do it
elsewhere, but changed my mind later when I saw the impact on the
fast path.

> the cpuset but you'd need one for each node allowed by the cpuset so it
> could get quite large. Then again, it might be worthwhile if cpusets

Yes you would need one per node, but that's not a big problem because
systems with lots of nodes are also expected to have lots of memory.
Most systems have a very small number of nodes.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
@ 2009-02-23 17:49       ` Andi Kleen
  0 siblings, 0 replies; 190+ messages in thread
From: Andi Kleen @ 2009-02-23 17:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andi Kleen, Linux Memory Management List, Pekka Enberg,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

> hmm, it would be ideal but I haven't looked too closely at how it could
> be implemented. I thought first you could just associate a zonelist with

Yes like that. This was actually discussed during the initial cpuset
implementation. I thought back then it would be better to do it
elsewhere, but changed my mind later when I saw the impact on the
fast path.

> the cpuset but you'd need one for each node allowed by the cpuset so it
> could get quite large. Then again, it might be worthwhile if cpusets

Yes you would need one per node, but that's not a big problem because
systems with lots of nodes are also expected to have lots of memory.
Most systems have a very small number of nodes.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] mm: clean up __GFP_* flags a bit
  2009-02-23 11:55     ` Peter Zijlstra
@ 2009-02-23 18:01       ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 18:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 12:55:01PM +0100, Peter Zijlstra wrote:
> Subject: mm: clean up __GFP_* flags a bit
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Mon Feb 23 12:28:33 CET 2009
> 
> re-sort them and poke at some whitespace alignment for easier reading.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

It didn't apply because we are working off different trees. I was on
git-latest from last Wednesday and this looks to be -mm based on the presense
of CONFIG_KMEMCHECK. I rebased and ended up with the patch below. Thanks

====
>From 37990226c22063d6304106cdd5aae9b73f484d76 Mon Sep 17 00:00:00 2001
From: Mel Gorman <mel@csn.ul.ie>
Date: Mon, 23 Feb 2009 17:56:22 +0000
Subject: [PATCH] Re-sort GFP flags and fix whitespace alignment for easier reading.

Resort the GFP flags after __GFP_MOVABLE got redefined so how the bits
are used are a bit cleared.

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 581f8a9..8f7d176 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -25,6 +25,8 @@ struct vm_area_struct;
 #define __GFP_HIGHMEM	((__force gfp_t)0x02u)
 #define __GFP_DMA32	((__force gfp_t)0x04u)
 
+#define __GFP_MOVABLE	((__force gfp_t)0x08u)  /* Page is movable */
+
 /*
  * Action modifiers - doesn't change the zoning
  *
@@ -50,11 +52,10 @@ struct vm_area_struct;
 #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* See above */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
-#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
-#define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
-#define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_NOMEMALLOC  ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
+#define __GFP_HARDWALL    ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_THISNODE	  ((__force gfp_t)0x40000u) /* No fallback, no policies */
 #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
-#define __GFP_MOVABLE	((__force gfp_t)0x08u)  /* Page is movable */
 
 #define __GFP_BITS_SHIFT 21	/* Room for 21 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH] mm: clean up __GFP_* flags a bit
@ 2009-02-23 18:01       ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 18:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 12:55:01PM +0100, Peter Zijlstra wrote:
> Subject: mm: clean up __GFP_* flags a bit
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Mon Feb 23 12:28:33 CET 2009
> 
> re-sort them and poke at some whitespace alignment for easier reading.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

It didn't apply because we are working off different trees. I was on
git-latest from last Wednesday and this looks to be -mm based on the presense
of CONFIG_KMEMCHECK. I rebased and ended up with the patch below. Thanks

====

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] mm: gfp_to_alloc_flags()
  2009-02-23 11:55   ` Peter Zijlstra
@ 2009-02-23 18:17     ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 18:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 12:55:03PM +0100, Peter Zijlstra wrote:
> I've always found the below a clean-up, respun it on top of your changes.
> Test box still boots ;-)
> 
> ---
> Subject: mm: gfp_to_alloc_flags()
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Mon Feb 23 12:46:36 CET 2009
> 
> Clean up the code by factoring out the gfp to alloc_flags mapping.
> 
> [neilb@suse.de says]
> As the test:
> 
> -       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> -                       && !in_interrupt()) {
> -               if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> 

At what point was this code deleted?

If it still exists, then I like the idea of this patch anyway. It takes
more code out of the loop. We end up checking if __GFP_WAIT is set twice,
but no major harm in that.

> has been replaced with a slightly weaker one:
> 
> +       if (alloc_flags & ALLOC_NO_WATERMARKS) {
> 
> we need to ensure we don't recurse when PF_MEMALLOC is set
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  mm/page_alloc.c |   90 ++++++++++++++++++++++++++++++++------------------------
>  1 file changed, 52 insertions(+), 38 deletions(-)
> 
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -1658,16 +1658,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
>  	return page;
>  }
>  
> -static inline int is_allocation_high_priority(struct task_struct *p,
> -							gfp_t gfp_mask)
> -{
> -	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> -			&& !in_interrupt())
> -		if (!(gfp_mask & __GFP_NOMEMALLOC))
> -			return 1;
> -	return 0;
> -}
> -
>  /*
>   * This is called in the allocator slow-path if the allocation request is of
>   * sufficient urgency to ignore watermarks and take other desperate measures
> @@ -1702,6 +1692,44 @@ void wake_all_kswapd(unsigned int order,
>  		wakeup_kswapd(zone, order);
>  }
>  
> +/*
> + * get the deepest reaching allocation flags for the given gfp_mask
> + */
> +static int gfp_to_alloc_flags(gfp_t gfp_mask)
> +{
> +	struct task_struct *p = current;
> +	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +
> +	/*
> +	 * The caller may dip into page reserves a bit more if the caller
> +	 * cannot run direct reclaim, or if the caller has realtime scheduling
> +	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> +	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> +	 */
> +	if (gfp_mask & __GFP_HIGH)
> +		alloc_flags |= ALLOC_HIGH;
> +
> +	if (!wait) {
> +		alloc_flags |= ALLOC_HARDER;
> +		/*
> +		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
> +		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> +		 */
> +		alloc_flags &= ~ALLOC_CPUSET;
> +	} else if (unlikely(rt_task(p)) && !in_interrupt())
> +		alloc_flags |= ALLOC_HARDER;
> +
> +	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> +		if (!in_interrupt() &&
> +		    ((p->flags & PF_MEMALLOC) ||
> +		     unlikely(test_thread_flag(TIF_MEMDIE))))
> +			alloc_flags |= ALLOC_NO_WATERMARKS;
> +	}
> +
> +	return alloc_flags;
> +}
> +
>  static struct page * noinline
>  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> @@ -1732,48 +1760,34 @@ __alloc_pages_slowpath(gfp_t gfp_mask, u
>  	 * OK, we're below the kswapd watermark and have kicked background
>  	 * reclaim. Now things get more complex, so set up alloc_flags according
>  	 * to how we want to proceed.
> -	 *
> -	 * The caller may dip into page reserves a bit more if the caller
> -	 * cannot run direct reclaim, or if the caller has realtime scheduling
> -	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> -	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
>  	 */
> -	alloc_flags = ALLOC_WMARK_MIN;
> -	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
> -		alloc_flags |= ALLOC_HARDER;
> -	if (gfp_mask & __GFP_HIGH)
> -		alloc_flags |= ALLOC_HIGH;
> -	if (wait)
> -		alloc_flags |= ALLOC_CPUSET;
> +	alloc_flags = gfp_to_alloc_flags(gfp_mask);
>  
>  restart:
> -	/*
> -	 * Go through the zonelist again. Let __GFP_HIGH and allocations
> -	 * coming from realtime tasks go deeper into reserves.
> -	 *
> -	 * This is the last chance, in general, before the goto nopage.
> -	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
> -	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> -	 */
> +	/* This is the last chance, in general, before the goto nopage. */
>  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> -						high_zoneidx, alloc_flags,
> -						preferred_zone,
> -						migratetype);
> +			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
> +			preferred_zone, migratetype);
>  	if (page)
>  		goto got_pg;
>  
>  	/* Allocate without watermarks if the context allows */
> -	if (is_allocation_high_priority(p, gfp_mask))
> +	if (alloc_flags & ALLOC_NO_WATERMARKS) {
>  		page = __alloc_pages_high_priority(gfp_mask, order,
> -			zonelist, high_zoneidx, nodemask, preferred_zone,
> -			migratetype);
> -	if (page)
> -		goto got_pg;
> +				zonelist, high_zoneidx, nodemask,
> +				preferred_zone, migratetype);
> +		if (page)
> +			goto got_pg;
> +	}
>  
>  	/* Atomic allocations - we can't balance anything */
>  	if (!wait)
>  		goto nopage;
>  
> +	/* Avoid recursion of direct reclaim */
> +	if (p->flags & PF_MEMALLOC)
> +		goto nopage;
> +
>  	/* Try direct reclaim and then allocating */
>  	page = __alloc_pages_direct_reclaim(gfp_mask, order,
>  					zonelist, high_zoneidx,
> 

Looks good eyeballing it here at least. I'll slot it in and see what the
end result looks like but I think it'll be good.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] mm: gfp_to_alloc_flags()
@ 2009-02-23 18:17     ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 18:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 12:55:03PM +0100, Peter Zijlstra wrote:
> I've always found the below a clean-up, respun it on top of your changes.
> Test box still boots ;-)
> 
> ---
> Subject: mm: gfp_to_alloc_flags()
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Mon Feb 23 12:46:36 CET 2009
> 
> Clean up the code by factoring out the gfp to alloc_flags mapping.
> 
> [neilb@suse.de says]
> As the test:
> 
> -       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> -                       && !in_interrupt()) {
> -               if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> 

At what point was this code deleted?

If it still exists, then I like the idea of this patch anyway. It takes
more code out of the loop. We end up checking if __GFP_WAIT is set twice,
but no major harm in that.

> has been replaced with a slightly weaker one:
> 
> +       if (alloc_flags & ALLOC_NO_WATERMARKS) {
> 
> we need to ensure we don't recurse when PF_MEMALLOC is set
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  mm/page_alloc.c |   90 ++++++++++++++++++++++++++++++++------------------------
>  1 file changed, 52 insertions(+), 38 deletions(-)
> 
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -1658,16 +1658,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
>  	return page;
>  }
>  
> -static inline int is_allocation_high_priority(struct task_struct *p,
> -							gfp_t gfp_mask)
> -{
> -	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> -			&& !in_interrupt())
> -		if (!(gfp_mask & __GFP_NOMEMALLOC))
> -			return 1;
> -	return 0;
> -}
> -
>  /*
>   * This is called in the allocator slow-path if the allocation request is of
>   * sufficient urgency to ignore watermarks and take other desperate measures
> @@ -1702,6 +1692,44 @@ void wake_all_kswapd(unsigned int order,
>  		wakeup_kswapd(zone, order);
>  }
>  
> +/*
> + * get the deepest reaching allocation flags for the given gfp_mask
> + */
> +static int gfp_to_alloc_flags(gfp_t gfp_mask)
> +{
> +	struct task_struct *p = current;
> +	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +
> +	/*
> +	 * The caller may dip into page reserves a bit more if the caller
> +	 * cannot run direct reclaim, or if the caller has realtime scheduling
> +	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> +	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> +	 */
> +	if (gfp_mask & __GFP_HIGH)
> +		alloc_flags |= ALLOC_HIGH;
> +
> +	if (!wait) {
> +		alloc_flags |= ALLOC_HARDER;
> +		/*
> +		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
> +		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> +		 */
> +		alloc_flags &= ~ALLOC_CPUSET;
> +	} else if (unlikely(rt_task(p)) && !in_interrupt())
> +		alloc_flags |= ALLOC_HARDER;
> +
> +	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> +		if (!in_interrupt() &&
> +		    ((p->flags & PF_MEMALLOC) ||
> +		     unlikely(test_thread_flag(TIF_MEMDIE))))
> +			alloc_flags |= ALLOC_NO_WATERMARKS;
> +	}
> +
> +	return alloc_flags;
> +}
> +
>  static struct page * noinline
>  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> @@ -1732,48 +1760,34 @@ __alloc_pages_slowpath(gfp_t gfp_mask, u
>  	 * OK, we're below the kswapd watermark and have kicked background
>  	 * reclaim. Now things get more complex, so set up alloc_flags according
>  	 * to how we want to proceed.
> -	 *
> -	 * The caller may dip into page reserves a bit more if the caller
> -	 * cannot run direct reclaim, or if the caller has realtime scheduling
> -	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> -	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
>  	 */
> -	alloc_flags = ALLOC_WMARK_MIN;
> -	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
> -		alloc_flags |= ALLOC_HARDER;
> -	if (gfp_mask & __GFP_HIGH)
> -		alloc_flags |= ALLOC_HIGH;
> -	if (wait)
> -		alloc_flags |= ALLOC_CPUSET;
> +	alloc_flags = gfp_to_alloc_flags(gfp_mask);
>  
>  restart:
> -	/*
> -	 * Go through the zonelist again. Let __GFP_HIGH and allocations
> -	 * coming from realtime tasks go deeper into reserves.
> -	 *
> -	 * This is the last chance, in general, before the goto nopage.
> -	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
> -	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> -	 */
> +	/* This is the last chance, in general, before the goto nopage. */
>  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> -						high_zoneidx, alloc_flags,
> -						preferred_zone,
> -						migratetype);
> +			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
> +			preferred_zone, migratetype);
>  	if (page)
>  		goto got_pg;
>  
>  	/* Allocate without watermarks if the context allows */
> -	if (is_allocation_high_priority(p, gfp_mask))
> +	if (alloc_flags & ALLOC_NO_WATERMARKS) {
>  		page = __alloc_pages_high_priority(gfp_mask, order,
> -			zonelist, high_zoneidx, nodemask, preferred_zone,
> -			migratetype);
> -	if (page)
> -		goto got_pg;
> +				zonelist, high_zoneidx, nodemask,
> +				preferred_zone, migratetype);
> +		if (page)
> +			goto got_pg;
> +	}
>  
>  	/* Atomic allocations - we can't balance anything */
>  	if (!wait)
>  		goto nopage;
>  
> +	/* Avoid recursion of direct reclaim */
> +	if (p->flags & PF_MEMALLOC)
> +		goto nopage;
> +
>  	/* Try direct reclaim and then allocating */
>  	page = __alloc_pages_direct_reclaim(gfp_mask, order,
>  					zonelist, high_zoneidx,
> 

Looks good eyeballing it here at least. I'll slot it in and see what the
end result looks like but I think it'll be good.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] mm: gfp_to_alloc_flags()
  2009-02-23 18:17     ` Mel Gorman
@ 2009-02-23 20:09       ` Peter Zijlstra
  -1 siblings, 0 replies; 190+ messages in thread
From: Peter Zijlstra @ 2009-02-23 20:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 2009-02-23 at 18:17 +0000, Mel Gorman wrote:

> > 
> > -       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> > -                       && !in_interrupt()) {
> > -               if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> > 
> 
> At what point was this code deleted?

You moved it around a bit, but it ended up here:

> > -static inline int is_allocation_high_priority(struct task_struct *p,
> > -							gfp_t gfp_mask)
> > -{
> > -	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> > -			&& !in_interrupt())
> > -		if (!(gfp_mask & __GFP_NOMEMALLOC))
> > -			return 1;
> > -	return 0;
> > -}


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] mm: gfp_to_alloc_flags()
@ 2009-02-23 20:09       ` Peter Zijlstra
  0 siblings, 0 replies; 190+ messages in thread
From: Peter Zijlstra @ 2009-02-23 20:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 2009-02-23 at 18:17 +0000, Mel Gorman wrote:

> > 
> > -       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> > -                       && !in_interrupt()) {
> > -               if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> > 
> 
> At what point was this code deleted?

You moved it around a bit, but it ended up here:

> > -static inline int is_allocation_high_priority(struct task_struct *p,
> > -							gfp_t gfp_mask)
> > -{
> > -	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> > -			&& !in_interrupt())
> > -		if (!(gfp_mask & __GFP_NOMEMALLOC))
> > -			return 1;
> > -	return 0;
> > -}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
  2009-02-23 15:22       ` Nick Piggin
@ 2009-02-23 20:26         ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 20:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tue, Feb 24, 2009 at 02:22:03AM +1100, Nick Piggin wrote:
> On Tuesday 24 February 2009 02:00:56 Mel Gorman wrote:
> > On Tue, Feb 24, 2009 at 01:46:01AM +1100, Nick Piggin wrote:
> 
> > > free_page_mlock shouldn't really be in free_pages_check, but oh well.
> >
> > Agreed, I took it out of there.
> 
> Oh good. I didn't notice that.
> 
> > > > Patch 16 avoids using the zonelist cache on non-NUMA machines
> > > >
> > > > Patch 17 removes an expensive and excessively paranoid check in the
> > > > allocator fast path
> > >
> > > I would be careful of removing useful debug checks completely like
> > > this. What is the cost? Obviously non-zero, but it is also a check
> >
> > The cost was something like 1/10th the cost of the path. There are atomic
> > operations in there that are causing the problems.
> 
> The only atomic memory operations in there should be atomic loads of
> word or atomic_t sized and aligned locations, which should just be
> normal loads on any architecture?
> 
> The only atomic RMW you might see in that function would come from
> free_page_mlock (which you moved out of there, and anyway can be
> made non-atomic).
> 

You're right, they're normal loads. I wasn't looking at the resulting
assembly closely enough.  I saw a lock and branch in that general area of
free_pages_check(), remembered that it was an atomic read, conflated the
lock-bit-clear with the atomic read and went astray from there. Silly.

> I'd like you to just reevaluate it after your patchset, after the
> patch to make mlock non-atomic, and my patch I just sent.
> 

I re-evaluated with your patch in place of the check being dropped. With
the mlock bit clear moved out of the way, the assembly looks grand and the
amount of time being spent in that check is ok according to profiles

  o roughly 70 samples out of 398 in __free_pages_ok()
  o 2354 samples out of 31295 in free_pcp_pages()
  o 859 samples out of 35362 get_page_from_freelist 

I guess it's 7.5% of the free_pcp_pages() path but it would probably cause
more hassle with hard-to-debug problems the check was removed.

I was momentarily concerned about the compound aspect of page_count. We can
have compound pages in the __free_pages_ok() path and we'll end up checking
the count for each of the sub-pages instead of the head page. It shouldn't be
a problem as the count should be zero for each of the tail pages. A positive
count is a bug and will now trigger where in fact we would have missed it
before. I convinced myself that this change is ok but if anyone can spot a
problem with this reasoning, please shout now.

Is the page_mapcount() change in your patch really necessary? Unlikely
page_count(), it does not check for a compound page so it's not branching
like page_count() is. I don't think it is so I dropped that part of the
patch for the moment.

> 
> > > I have seen trigger on quite a lot of occasions (due to kernel bugs
> > > and hardware bugs, and in each case it is better to warn than not,
> > > even if many other situations can go undetected).
> >
> > Have you really seen it trigger for the allocation path or did it
> > trigger in teh free path? Essentially we are making the same check on
> > every allocation and free which is why I considered it excessivly
> > paranoid.
> 
> Yes I've seen it trigger in the allocation path. Kernel memory scribbles
> or RAM errors.
> 

That's the type of situation I expected it to occur but felt that the free
path would be sufficient. However, I'm convinced now to leave it in place,
particularly as its cost is not as excessive as I initially believed.

> 
> > > One problem is that some of the calls we're making in page_alloc.c
> > > do the compound_head() thing, wheras we know that we only want to
> > > look at this page. I've attached a patch which cuts out about 150
> > > bytes of text and several branches from these paths.
> >
> > Nice, I should have spotted that. I'm going to fold this into the series
> > if that is ok with you? I'll replace patch 17 with it and see does it
> > still show up on profiles.
> 
> Great! Sure fold it in (and put SOB: me on there if you like).
> 

Done, thanks. The version I'm currently using is below.

> 
> > > > So, by and large it's an improvement of some sort.
> > >
> > > Most of these benchmarks *really* need to be run quite a few times to get
> > > a reasonable confidence.
> >
> > Most are run repeatedly and an average taken but I should double check
> > what is going on. It's irritating that gains/regressions are
> > inconsistent between different machine types but that is nothing new.
> 
> Yeah. Cache behaviour maybe. One thing you might try is to size the struct
> page out to 64 bytes if it isn't already. This could bring down any skews
> if one kernel is lucky to get a nice packing of pages, or another is unlucky
> to get lots of struct pages spread over 2 cachelines. Maybe I'm just
> thinking wishfully :)
> 

It's worth an investigate :)

> I think with many of your changes, common sense will tell us that it is a
> better code sequence. Sometimes it's just impossible to really get
> "scientific proof" :)
> 

Sounds good to me but I'm hoping that it'll be possible to show a gains in
a few benchmarks on a few machines without large regressions showing up.

The replacement patch now looks like

=====
    Do not check for compound pages during the page allocator sanity checks
    
    A number of sanity checks are made on each page allocation and free
    including that the page count is zero. page_count() checks for
    compound pages and checks the count of the head page if true. However,
    in these paths, we do not care if the page is compound or not as the
    count of each tail page should also be zero.
    
    This patch makes two changes to the use of page_count() in the free path. It
    converts one check of page_count() to a VM_BUG_ON() as the count should
    have been unconditionally checked earlier in the free path. It also avoids
    checking for compound pages.
    
    [mel@csn.ul.ie: Wrote changelog]
    Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e598da8..8a8db71 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -426,7 +426,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 		return 0;
 
 	if (PageBuddy(buddy) && page_order(buddy) == order) {
-		BUG_ON(page_count(buddy) != 0);
+		VM_BUG_ON(page_count(buddy) != 0);
 		return 1;
 	}
 	return 0;
@@ -503,7 +503,7 @@ static inline int free_pages_check(struct page *page)
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(atomic_read(&page->_count) != 0) |
 		(page->flags & PAGE_FLAGS_CHECK_AT_FREE))) {
 		bad_page(page);
 		return 1;
@@ -648,7 +648,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(atomic_read(&page->_count) != 0)  |
 		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
 		bad_page(page);
 		return 1;

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
@ 2009-02-23 20:26         ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 20:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tue, Feb 24, 2009 at 02:22:03AM +1100, Nick Piggin wrote:
> On Tuesday 24 February 2009 02:00:56 Mel Gorman wrote:
> > On Tue, Feb 24, 2009 at 01:46:01AM +1100, Nick Piggin wrote:
> 
> > > free_page_mlock shouldn't really be in free_pages_check, but oh well.
> >
> > Agreed, I took it out of there.
> 
> Oh good. I didn't notice that.
> 
> > > > Patch 16 avoids using the zonelist cache on non-NUMA machines
> > > >
> > > > Patch 17 removes an expensive and excessively paranoid check in the
> > > > allocator fast path
> > >
> > > I would be careful of removing useful debug checks completely like
> > > this. What is the cost? Obviously non-zero, but it is also a check
> >
> > The cost was something like 1/10th the cost of the path. There are atomic
> > operations in there that are causing the problems.
> 
> The only atomic memory operations in there should be atomic loads of
> word or atomic_t sized and aligned locations, which should just be
> normal loads on any architecture?
> 
> The only atomic RMW you might see in that function would come from
> free_page_mlock (which you moved out of there, and anyway can be
> made non-atomic).
> 

You're right, they're normal loads. I wasn't looking at the resulting
assembly closely enough.  I saw a lock and branch in that general area of
free_pages_check(), remembered that it was an atomic read, conflated the
lock-bit-clear with the atomic read and went astray from there. Silly.

> I'd like you to just reevaluate it after your patchset, after the
> patch to make mlock non-atomic, and my patch I just sent.
> 

I re-evaluated with your patch in place of the check being dropped. With
the mlock bit clear moved out of the way, the assembly looks grand and the
amount of time being spent in that check is ok according to profiles

  o roughly 70 samples out of 398 in __free_pages_ok()
  o 2354 samples out of 31295 in free_pcp_pages()
  o 859 samples out of 35362 get_page_from_freelist 

I guess it's 7.5% of the free_pcp_pages() path but it would probably cause
more hassle with hard-to-debug problems the check was removed.

I was momentarily concerned about the compound aspect of page_count. We can
have compound pages in the __free_pages_ok() path and we'll end up checking
the count for each of the sub-pages instead of the head page. It shouldn't be
a problem as the count should be zero for each of the tail pages. A positive
count is a bug and will now trigger where in fact we would have missed it
before. I convinced myself that this change is ok but if anyone can spot a
problem with this reasoning, please shout now.

Is the page_mapcount() change in your patch really necessary? Unlikely
page_count(), it does not check for a compound page so it's not branching
like page_count() is. I don't think it is so I dropped that part of the
patch for the moment.

> 
> > > I have seen trigger on quite a lot of occasions (due to kernel bugs
> > > and hardware bugs, and in each case it is better to warn than not,
> > > even if many other situations can go undetected).
> >
> > Have you really seen it trigger for the allocation path or did it
> > trigger in teh free path? Essentially we are making the same check on
> > every allocation and free which is why I considered it excessivly
> > paranoid.
> 
> Yes I've seen it trigger in the allocation path. Kernel memory scribbles
> or RAM errors.
> 

That's the type of situation I expected it to occur but felt that the free
path would be sufficient. However, I'm convinced now to leave it in place,
particularly as its cost is not as excessive as I initially believed.

> 
> > > One problem is that some of the calls we're making in page_alloc.c
> > > do the compound_head() thing, wheras we know that we only want to
> > > look at this page. I've attached a patch which cuts out about 150
> > > bytes of text and several branches from these paths.
> >
> > Nice, I should have spotted that. I'm going to fold this into the series
> > if that is ok with you? I'll replace patch 17 with it and see does it
> > still show up on profiles.
> 
> Great! Sure fold it in (and put SOB: me on there if you like).
> 

Done, thanks. The version I'm currently using is below.

> 
> > > > So, by and large it's an improvement of some sort.
> > >
> > > Most of these benchmarks *really* need to be run quite a few times to get
> > > a reasonable confidence.
> >
> > Most are run repeatedly and an average taken but I should double check
> > what is going on. It's irritating that gains/regressions are
> > inconsistent between different machine types but that is nothing new.
> 
> Yeah. Cache behaviour maybe. One thing you might try is to size the struct
> page out to 64 bytes if it isn't already. This could bring down any skews
> if one kernel is lucky to get a nice packing of pages, or another is unlucky
> to get lots of struct pages spread over 2 cachelines. Maybe I'm just
> thinking wishfully :)
> 

It's worth an investigate :)

> I think with many of your changes, common sense will tell us that it is a
> better code sequence. Sometimes it's just impossible to really get
> "scientific proof" :)
> 

Sounds good to me but I'm hoping that it'll be possible to show a gains in
a few benchmarks on a few machines without large regressions showing up.

The replacement patch now looks like

=====
    Do not check for compound pages during the page allocator sanity checks
    
    A number of sanity checks are made on each page allocation and free
    including that the page count is zero. page_count() checks for
    compound pages and checks the count of the head page if true. However,
    in these paths, we do not care if the page is compound or not as the
    count of each tail page should also be zero.
    
    This patch makes two changes to the use of page_count() in the free path. It
    converts one check of page_count() to a VM_BUG_ON() as the count should
    have been unconditionally checked earlier in the free path. It also avoids
    checking for compound pages.
    
    [mel@csn.ul.ie: Wrote changelog]
    Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e598da8..8a8db71 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -426,7 +426,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 		return 0;
 
 	if (PageBuddy(buddy) && page_order(buddy) == order) {
-		BUG_ON(page_count(buddy) != 0);
+		VM_BUG_ON(page_count(buddy) != 0);
 		return 1;
 	}
 	return 0;
@@ -503,7 +503,7 @@ static inline int free_pages_check(struct page *page)
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(atomic_read(&page->_count) != 0) |
 		(page->flags & PAGE_FLAGS_CHECK_AT_FREE))) {
 		bad_page(page);
 		return 1;
@@ -648,7 +648,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(atomic_read(&page->_count) != 0)  |
 		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
 		bad_page(page);
 		return 1;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH] mm: clean up __GFP_* flags a bit
  2009-02-23 18:01       ` Mel Gorman
@ 2009-02-23 20:27         ` Vegard Nossum
  -1 siblings, 0 replies; 190+ messages in thread
From: Vegard Nossum @ 2009-02-23 20:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Linux Memory Management List, Pekka Enberg,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

2009/2/23 Mel Gorman <mel@csn.ul.ie>:
> On Mon, Feb 23, 2009 at 12:55:01PM +0100, Peter Zijlstra wrote:
>> Subject: mm: clean up __GFP_* flags a bit
>> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
>> Date: Mon Feb 23 12:28:33 CET 2009
>>
>> re-sort them and poke at some whitespace alignment for easier reading.
>>
>> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>
> It didn't apply because we are working off different trees. I was on
> git-latest from last Wednesday and this looks to be -mm based on the presense
> of CONFIG_KMEMCHECK. I rebased and ended up with the patch below. Thanks

I will take the remaining parts and apply it to the kmemcheck tree. Thanks!


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] mm: clean up __GFP_* flags a bit
@ 2009-02-23 20:27         ` Vegard Nossum
  0 siblings, 0 replies; 190+ messages in thread
From: Vegard Nossum @ 2009-02-23 20:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Linux Memory Management List, Pekka Enberg,
	Rik van Riel, KOSAKI Motohiro, Christoph Lameter,
	Johannes Weiner, Nick Piggin, Linux Kernel Mailing List,
	Lin Ming, Zhang Yanmin

2009/2/23 Mel Gorman <mel@csn.ul.ie>:
> On Mon, Feb 23, 2009 at 12:55:01PM +0100, Peter Zijlstra wrote:
>> Subject: mm: clean up __GFP_* flags a bit
>> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
>> Date: Mon Feb 23 12:28:33 CET 2009
>>
>> re-sort them and poke at some whitespace alignment for easier reading.
>>
>> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>
> It didn't apply because we are working off different trees. I was on
> git-latest from last Wednesday and this looks to be -mm based on the presense
> of CONFIG_KMEMCHECK. I rebased and ended up with the patch below. Thanks

I will take the remaining parts and apply it to the kmemcheck tree. Thanks!


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] mm: gfp_to_alloc_flags()
  2009-02-23 11:55   ` Peter Zijlstra
@ 2009-02-23 22:59     ` Andrew Morton
  -1 siblings, 0 replies; 190+ messages in thread
From: Andrew Morton @ 2009-02-23 22:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mel, linux-mm, penberg, riel, kosaki.motohiro, cl, hannes,
	npiggin, linux-kernel, ming.m.lin, yanmin_zhang

On Mon, 23 Feb 2009 12:55:03 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> +static int gfp_to_alloc_flags(gfp_t gfp_mask)
> +{
> +	struct task_struct *p = current;
> +	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +
> +	/*
> +	 * The caller may dip into page reserves a bit more if the caller
> +	 * cannot run direct reclaim, or if the caller has realtime scheduling
> +	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> +	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> +	 */
> +	if (gfp_mask & __GFP_HIGH)
> +		alloc_flags |= ALLOC_HIGH;

This could be sped up by making ALLOC_HIGH==__GFP_HIGH (hack)

> +	if (!wait) {
> +		alloc_flags |= ALLOC_HARDER;
> +		/*
> +		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
> +		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> +		 */
> +		alloc_flags &= ~ALLOC_CPUSET;
> +	} else if (unlikely(rt_task(p)) && !in_interrupt())
> +		alloc_flags |= ALLOC_HARDER;
> +
> +	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> +		if (!in_interrupt() &&
> +		    ((p->flags & PF_MEMALLOC) ||
> +		     unlikely(test_thread_flag(TIF_MEMDIE))))
> +			alloc_flags |= ALLOC_NO_WATERMARKS;
> +	}
> +	return alloc_flags;
> +}


But really, the whole function can be elided on the fastpath.  Try the
allocation with the current flags (and __GFP_NOWARN) and only if it
failed will we try altering the flags to try harder?


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] mm: gfp_to_alloc_flags()
@ 2009-02-23 22:59     ` Andrew Morton
  0 siblings, 0 replies; 190+ messages in thread
From: Andrew Morton @ 2009-02-23 22:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mel, linux-mm, penberg, riel, kosaki.motohiro, cl, hannes,
	npiggin, linux-kernel, ming.m.lin, yanmin_zhang

On Mon, 23 Feb 2009 12:55:03 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> +static int gfp_to_alloc_flags(gfp_t gfp_mask)
> +{
> +	struct task_struct *p = current;
> +	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +
> +	/*
> +	 * The caller may dip into page reserves a bit more if the caller
> +	 * cannot run direct reclaim, or if the caller has realtime scheduling
> +	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> +	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> +	 */
> +	if (gfp_mask & __GFP_HIGH)
> +		alloc_flags |= ALLOC_HIGH;

This could be sped up by making ALLOC_HIGH==__GFP_HIGH (hack)

> +	if (!wait) {
> +		alloc_flags |= ALLOC_HARDER;
> +		/*
> +		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
> +		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> +		 */
> +		alloc_flags &= ~ALLOC_CPUSET;
> +	} else if (unlikely(rt_task(p)) && !in_interrupt())
> +		alloc_flags |= ALLOC_HARDER;
> +
> +	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> +		if (!in_interrupt() &&
> +		    ((p->flags & PF_MEMALLOC) ||
> +		     unlikely(test_thread_flag(TIF_MEMDIE))))
> +			alloc_flags |= ALLOC_NO_WATERMARKS;
> +	}
> +	return alloc_flags;
> +}


But really, the whole function can be elided on the fastpath.  Try the
allocation with the current flags (and __GFP_NOWARN) and only if it
failed will we try altering the flags to try harder?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-23  9:37     ` Andrew Morton
@ 2009-02-23 23:30       ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 23:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 01:37:23AM -0800, Andrew Morton wrote:
> On Sun, 22 Feb 2009 23:17:29 +0000 Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Currently an effort is made to determine if a page is hot or cold when
> > it is being freed so that cache hot pages can be allocated to callers if
> > possible. However, the reasoning used whether to mark something hot or
> > cold is a bit spurious. A profile run of kernbench showed that "cold"
> > pages were never freed so it either doesn't happen generally or is so
> > rare, it's barely measurable.
> > 
> > It's dubious as to whether pages are being correctly marked hot and cold
> > anyway. Things like page cache and pages being truncated are are considered
> > "hot" but there is no guarantee that these pages have been recently used
> > and are cache hot. Pages being reclaimed from the LRU are considered
> > cold which is logical because they cannot have been referenced recently
> > but if the system is reclaiming pages, then we have entered allocator
> > slowpaths and are not going to notice any potential performance boost
> > because a "hot" page was freed.
> > 
> > This patch just deletes the concept of freeing hot or cold pages and
> > just frees them all as hot.
> > 
> 
> Well yes.  We waffled for months over whether to merge that code originally.
> 
> What tipped the balance was a dopey microbenchmark which I wrote which
> sat in a loop extending (via write()) and then truncating the same file
> by 32 kbytes (or thereabouts).  Its performance was increased by a lot
> (2x or more, iirc) and no actual regressions were demonstrable, so we
> merged it.
> 
> Could you check that please?  I'd suggest trying various values of 32k,
> too.
> 

I dug around the archives but hadn't much luck finding the original
discussion. I saw some results from around the 2.5.40-mm timeframe that talked
about ~60% difference with this benchmark (http://lkml.org/lkml/2002/10/6/174)
but didn't find the source. The more solid benchmark reports was
https://lwn.net/Articles/14761/ where you talked about 1-2% kernel compile
improvements, good SpecWEB and a big hike on performance with SDET.

It's not clearcut. I tried reproducing your original benchmark rather than
whinging about not finding yours :) . The source is below so maybe you can
tell me if it's equivalent? I only ran it on one CPU which also may be a
factor. The results were

    size      with   without difference
      64  0.216033  0.558803 -158.67%
     128  0.158551  0.150673   4.97%
     256  0.153240  0.153488  -0.16%
     512  0.156502  0.158769  -1.45%
    1024  0.162146  0.163302  -0.71%
    2048  0.167001  0.169573  -1.54%
    4096  0.175376  0.178882  -2.00%
    8192  0.237618  0.243385  -2.43%
   16384  0.735053  0.351040  52.24%
   32768  0.524731  0.583863 -11.27%
   65536  1.149310  1.227855  -6.83%
  131072  2.160248  2.084981   3.48%
  262144  3.858264  4.046389  -4.88%
  524288  8.228358  8.259957  -0.38%
 1048576 16.228190 16.288308  -0.37%

with    == Using hot/cold information to place pages at the front or end of
        the freelist
without == Consider all pages being freed as hot

The results are a bit all over the place but mostly negative but nowhere near
60% of a difference so the benchmark might be wrong. Oddly, 64 shows massive
regressions but 16384 shows massive improvements. With profiling enabled, it's

      64  0.214873  0.196666   8.47%
     128  0.166807  0.162612   2.51%
     256  0.170776  0.161861   5.22%
     512  0.175772  0.164903   6.18%
    1024  0.178835  0.168695   5.67%
    2048  0.183769  0.174317   5.14%
    4096  0.191877  0.183343   4.45%
    8192  0.262511  0.254148   3.19%
   16384  0.388201  0.371461   4.31%
   32768  0.655402  0.611528   6.69%
   65536  1.325445  1.193961   9.92%
  131072  2.218135  2.209091   0.41%
  262144  4.117233  4.116681   0.01%
  524288  8.514915  8.590700  -0.89%
 1048576 16.657330 16.708367  -0.31%

Almost the opposite with steady improvements almost all the way through.

With the patch applied, we are still using hot/cold information on the
allocation side so I'm somewhat surprised the patch even makes much of a
difference. I'd have expected the pages being freed to be mostly hot.

Kernbench was no help figuring this out either.

with:    Elapsed: 74.1625s User: 253.85s System: 27.1s CPU: 378.5%
without: Elapsed: 74.0525s User: 252.9s System: 27.3675s CPU: 378.25%

Improvements on elapsed and user time but a regression on system time.

The issue is sufficiently cloudy that I'm just going to drop the patch
for now. Hopefully the rest of the patchset is more clear-cut. I'll pick
it up again at a later time.

Here is the microbenchmark I used

Thanks.

/*
 * write-truncate.c
 * Microbenchmark that tests the speed of write/truncate of small files.
 * 
 * Suggested by Andrew Morton
 * Written by Mel Gorman 2009
 */
#include <stdio.h>
#include <limits.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/time.h>
#include <fcntl.h>
#include <stdlib.h>
#include <string.h>

#define TESTFILE "./write-truncate-testfile.dat"
#define ITERATIONS 10000
#define STARTSIZE 32
#define SIZES 15

#ifndef MIN
#define MIN(x,y) ((x)<(y)?(x):(y))
#endif
#ifndef MAX
#define MAX(x,y) ((x)>(y)?(x):(y))
#endif

double whattime()
{
        struct timeval tp;
        int i;

	if (gettimeofday(&tp,NULL) == -1) {
		perror("gettimeofday");
		exit(EXIT_FAILURE);
	}

        return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}

int main(void)
{
	int fd;
	int bufsize, sizes, iteration;
	char *buf;
	double t;

	/* Create test file */
	fd = open(TESTFILE, O_RDWR|O_CREAT|O_EXCL);
	if (fd == -1) {
		perror("open");
		exit(EXIT_FAILURE);
	}

	/* Unlink now for cleanup */
	if (unlink(TESTFILE) == -1) {
		perror("unlinke");
		exit(EXIT_FAILURE);
	}

	/* Go through a series of sizes */
	bufsize = STARTSIZE;
	for (sizes = 1; sizes <= SIZES; sizes++) {
		bufsize *= 2;
		buf = malloc(bufsize);
		if (buf == NULL) {
			printf("ERROR: Malloc failed\n");
			exit(EXIT_FAILURE);
		}
		memset(buf, 0xE0, bufsize);

		t = whattime();
		for (iteration = 0; iteration < ITERATIONS; iteration++) {
			size_t written = 0, thiswrite;
			
			while (written != bufsize) {
				thiswrite = write(fd, buf, bufsize);
				if (thiswrite == -1) {
					perror("write");
					exit(EXIT_FAILURE);
				}
				written += thiswrite;
			}

			if (ftruncate(fd, 0) == -1) {
				perror("ftruncate");
				exit(EXIT_FAILURE);
			}

			if (lseek(fd, 0, SEEK_SET) != 0) {
				perror("lseek");
				exit(EXIT_FAILURE);
			}
		}
		t = whattime() - t;
		free(buf);

		printf("%d %f\n", bufsize, t);
	}

	if (close(fd) == -1) {
		perror("close");
		exit(EXIT_FAILURE);
	}

	exit(EXIT_SUCCESS);
}
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-02-23 23:30       ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-23 23:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 01:37:23AM -0800, Andrew Morton wrote:
> On Sun, 22 Feb 2009 23:17:29 +0000 Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Currently an effort is made to determine if a page is hot or cold when
> > it is being freed so that cache hot pages can be allocated to callers if
> > possible. However, the reasoning used whether to mark something hot or
> > cold is a bit spurious. A profile run of kernbench showed that "cold"
> > pages were never freed so it either doesn't happen generally or is so
> > rare, it's barely measurable.
> > 
> > It's dubious as to whether pages are being correctly marked hot and cold
> > anyway. Things like page cache and pages being truncated are are considered
> > "hot" but there is no guarantee that these pages have been recently used
> > and are cache hot. Pages being reclaimed from the LRU are considered
> > cold which is logical because they cannot have been referenced recently
> > but if the system is reclaiming pages, then we have entered allocator
> > slowpaths and are not going to notice any potential performance boost
> > because a "hot" page was freed.
> > 
> > This patch just deletes the concept of freeing hot or cold pages and
> > just frees them all as hot.
> > 
> 
> Well yes.  We waffled for months over whether to merge that code originally.
> 
> What tipped the balance was a dopey microbenchmark which I wrote which
> sat in a loop extending (via write()) and then truncating the same file
> by 32 kbytes (or thereabouts).  Its performance was increased by a lot
> (2x or more, iirc) and no actual regressions were demonstrable, so we
> merged it.
> 
> Could you check that please?  I'd suggest trying various values of 32k,
> too.
> 

I dug around the archives but hadn't much luck finding the original
discussion. I saw some results from around the 2.5.40-mm timeframe that talked
about ~60% difference with this benchmark (http://lkml.org/lkml/2002/10/6/174)
but didn't find the source. The more solid benchmark reports was
https://lwn.net/Articles/14761/ where you talked about 1-2% kernel compile
improvements, good SpecWEB and a big hike on performance with SDET.

It's not clearcut. I tried reproducing your original benchmark rather than
whinging about not finding yours :) . The source is below so maybe you can
tell me if it's equivalent? I only ran it on one CPU which also may be a
factor. The results were

    size      with   without difference
      64  0.216033  0.558803 -158.67%
     128  0.158551  0.150673   4.97%
     256  0.153240  0.153488  -0.16%
     512  0.156502  0.158769  -1.45%
    1024  0.162146  0.163302  -0.71%
    2048  0.167001  0.169573  -1.54%
    4096  0.175376  0.178882  -2.00%
    8192  0.237618  0.243385  -2.43%
   16384  0.735053  0.351040  52.24%
   32768  0.524731  0.583863 -11.27%
   65536  1.149310  1.227855  -6.83%
  131072  2.160248  2.084981   3.48%
  262144  3.858264  4.046389  -4.88%
  524288  8.228358  8.259957  -0.38%
 1048576 16.228190 16.288308  -0.37%

with    == Using hot/cold information to place pages at the front or end of
        the freelist
without == Consider all pages being freed as hot

The results are a bit all over the place but mostly negative but nowhere near
60% of a difference so the benchmark might be wrong. Oddly, 64 shows massive
regressions but 16384 shows massive improvements. With profiling enabled, it's

      64  0.214873  0.196666   8.47%
     128  0.166807  0.162612   2.51%
     256  0.170776  0.161861   5.22%
     512  0.175772  0.164903   6.18%
    1024  0.178835  0.168695   5.67%
    2048  0.183769  0.174317   5.14%
    4096  0.191877  0.183343   4.45%
    8192  0.262511  0.254148   3.19%
   16384  0.388201  0.371461   4.31%
   32768  0.655402  0.611528   6.69%
   65536  1.325445  1.193961   9.92%
  131072  2.218135  2.209091   0.41%
  262144  4.117233  4.116681   0.01%
  524288  8.514915  8.590700  -0.89%
 1048576 16.657330 16.708367  -0.31%

Almost the opposite with steady improvements almost all the way through.

With the patch applied, we are still using hot/cold information on the
allocation side so I'm somewhat surprised the patch even makes much of a
difference. I'd have expected the pages being freed to be mostly hot.

Kernbench was no help figuring this out either.

with:    Elapsed: 74.1625s User: 253.85s System: 27.1s CPU: 378.5%
without: Elapsed: 74.0525s User: 252.9s System: 27.3675s CPU: 378.25%

Improvements on elapsed and user time but a regression on system time.

The issue is sufficiently cloudy that I'm just going to drop the patch
for now. Hopefully the rest of the patchset is more clear-cut. I'll pick
it up again at a later time.

Here is the microbenchmark I used

Thanks.

/*
 * write-truncate.c
 * Microbenchmark that tests the speed of write/truncate of small files.
 * 
 * Suggested by Andrew Morton
 * Written by Mel Gorman 2009
 */
#include <stdio.h>
#include <limits.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/time.h>
#include <fcntl.h>
#include <stdlib.h>
#include <string.h>

#define TESTFILE "./write-truncate-testfile.dat"
#define ITERATIONS 10000
#define STARTSIZE 32
#define SIZES 15

#ifndef MIN
#define MIN(x,y) ((x)<(y)?(x):(y))
#endif
#ifndef MAX
#define MAX(x,y) ((x)>(y)?(x):(y))
#endif

double whattime()
{
        struct timeval tp;
        int i;

	if (gettimeofday(&tp,NULL) == -1) {
		perror("gettimeofday");
		exit(EXIT_FAILURE);
	}

        return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}

int main(void)
{
	int fd;
	int bufsize, sizes, iteration;
	char *buf;
	double t;

	/* Create test file */
	fd = open(TESTFILE, O_RDWR|O_CREAT|O_EXCL);
	if (fd == -1) {
		perror("open");
		exit(EXIT_FAILURE);
	}

	/* Unlink now for cleanup */
	if (unlink(TESTFILE) == -1) {
		perror("unlinke");
		exit(EXIT_FAILURE);
	}

	/* Go through a series of sizes */
	bufsize = STARTSIZE;
	for (sizes = 1; sizes <= SIZES; sizes++) {
		bufsize *= 2;
		buf = malloc(bufsize);
		if (buf == NULL) {
			printf("ERROR: Malloc failed\n");
			exit(EXIT_FAILURE);
		}
		memset(buf, 0xE0, bufsize);

		t = whattime();
		for (iteration = 0; iteration < ITERATIONS; iteration++) {
			size_t written = 0, thiswrite;
			
			while (written != bufsize) {
				thiswrite = write(fd, buf, bufsize);
				if (thiswrite == -1) {
					perror("write");
					exit(EXIT_FAILURE);
				}
				written += thiswrite;
			}

			if (ftruncate(fd, 0) == -1) {
				perror("ftruncate");
				exit(EXIT_FAILURE);
			}

			if (lseek(fd, 0, SEEK_SET) != 0) {
				perror("lseek");
				exit(EXIT_FAILURE);
			}
		}
		t = whattime() - t;
		free(buf);

		printf("%d %f\n", bufsize, t);
	}

	if (close(fd) == -1) {
		perror("close");
		exit(EXIT_FAILURE);
	}

	exit(EXIT_SUCCESS);
}
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-23 23:30       ` Mel Gorman
@ 2009-02-23 23:53         ` Andrew Morton
  -1 siblings, 0 replies; 190+ messages in thread
From: Andrew Morton @ 2009-02-23 23:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, penberg, riel, kosaki.motohiro, cl, hannes, npiggin,
	linux-kernel, ming.m.lin, yanmin_zhang

On Mon, 23 Feb 2009 23:30:30 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> On Mon, Feb 23, 2009 at 01:37:23AM -0800, Andrew Morton wrote:
> > On Sun, 22 Feb 2009 23:17:29 +0000 Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > Currently an effort is made to determine if a page is hot or cold when
> > > it is being freed so that cache hot pages can be allocated to callers if
> > > possible. However, the reasoning used whether to mark something hot or
> > > cold is a bit spurious. A profile run of kernbench showed that "cold"
> > > pages were never freed so it either doesn't happen generally or is so
> > > rare, it's barely measurable.
> > > 
> > > It's dubious as to whether pages are being correctly marked hot and cold
> > > anyway. Things like page cache and pages being truncated are are considered
> > > "hot" but there is no guarantee that these pages have been recently used
> > > and are cache hot. Pages being reclaimed from the LRU are considered
> > > cold which is logical because they cannot have been referenced recently
> > > but if the system is reclaiming pages, then we have entered allocator
> > > slowpaths and are not going to notice any potential performance boost
> > > because a "hot" page was freed.
> > > 
> > > This patch just deletes the concept of freeing hot or cold pages and
> > > just frees them all as hot.
> > > 
> > 
> > Well yes.  We waffled for months over whether to merge that code originally.
> > 
> > What tipped the balance was a dopey microbenchmark which I wrote which
> > sat in a loop extending (via write()) and then truncating the same file
> > by 32 kbytes (or thereabouts).  Its performance was increased by a lot
> > (2x or more, iirc) and no actual regressions were demonstrable, so we
> > merged it.
> > 
> > Could you check that please?  I'd suggest trying various values of 32k,
> > too.
> > 
> 
> I dug around the archives but hadn't much luck finding the original
> discussion. I saw some results from around the 2.5.40-mm timeframe that talked
> about ~60% difference with this benchmark (http://lkml.org/lkml/2002/10/6/174)
> but didn't find the source. The more solid benchmark reports was
> https://lwn.net/Articles/14761/ where you talked about 1-2% kernel compile
> improvements, good SpecWEB and a big hike on performance with SDET.
> 
> It's not clearcut. I tried reproducing your original benchmark rather than
> whinging about not finding yours :) . The source is below so maybe you can
> tell me if it's equivalent? I only ran it on one CPU which also may be a
> factor. The results were
> 
>     size      with   without difference
>       64  0.216033  0.558803 -158.67%
>      128  0.158551  0.150673   4.97%
>      256  0.153240  0.153488  -0.16%
>      512  0.156502  0.158769  -1.45%
>     1024  0.162146  0.163302  -0.71%
>     2048  0.167001  0.169573  -1.54%
>     4096  0.175376  0.178882  -2.00%
>     8192  0.237618  0.243385  -2.43%
>    16384  0.735053  0.351040  52.24%
>    32768  0.524731  0.583863 -11.27%
>    65536  1.149310  1.227855  -6.83%
>   131072  2.160248  2.084981   3.48%
>   262144  3.858264  4.046389  -4.88%
>   524288  8.228358  8.259957  -0.38%
>  1048576 16.228190 16.288308  -0.37%
> 
> with    == Using hot/cold information to place pages at the front or end of
>         the freelist
> without == Consider all pages being freed as hot

My head is spinning.  Smaller is better, right?  So for 16384-byte
writes, current mainline is slower?

That's odd.

> The results are a bit all over the place but mostly negative but nowhere near
> 60% of a difference so the benchmark might be wrong. Oddly, 64 shows massive
> regressions but 16384 shows massive improvements. With profiling enabled, it's
> 
>       64  0.214873  0.196666   8.47%
>      128  0.166807  0.162612   2.51%
>      256  0.170776  0.161861   5.22%
>      512  0.175772  0.164903   6.18%
>     1024  0.178835  0.168695   5.67%
>     2048  0.183769  0.174317   5.14%
>     4096  0.191877  0.183343   4.45%
>     8192  0.262511  0.254148   3.19%
>    16384  0.388201  0.371461   4.31%
>    32768  0.655402  0.611528   6.69%
>    65536  1.325445  1.193961   9.92%
>   131072  2.218135  2.209091   0.41%
>   262144  4.117233  4.116681   0.01%
>   524288  8.514915  8.590700  -0.89%
>  1048576 16.657330 16.708367  -0.31%
> 
> Almost the opposite with steady improvements almost all the way through.
> 
> With the patch applied, we are still using hot/cold information on the
> allocation side so I'm somewhat surprised the patch even makes much of a
> difference. I'd have expected the pages being freed to be mostly hot.

Oh yeah.  Back in the ancient days, hot-cold-pages was using separate
magazines for hot and cold pages.  Then Christoph went and mucked with
it, using a single queue.  That might have affected things.

It would be interesting to go back to a suitably-early kernel to see if
we broke it sometime after the early quantitative testing.  But I could
understand you not being so terribly interested ;)

> Kernbench was no help figuring this out either.
> 
> with:    Elapsed: 74.1625s User: 253.85s System: 27.1s CPU: 378.5%
> without: Elapsed: 74.0525s User: 252.9s System: 27.3675s CPU: 378.25%
> 
> Improvements on elapsed and user time but a regression on system time.
> 
> The issue is sufficiently cloudy that I'm just going to drop the patch
> for now. Hopefully the rest of the patchset is more clear-cut. I'll pick
> it up again at a later time.

Well...  if the benefits of the existing code are dubious then we
should default to deleting it.

> Here is the microbenchmark I used
> 
> Thanks.
> 
> /*
>  * write-truncate.c
>  * Microbenchmark that tests the speed of write/truncate of small files.
>  * 
>  * Suggested by Andrew Morton
>  * Written by Mel Gorman 2009
>  */
> #include <stdio.h>
> #include <limits.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/time.h>
> #include <fcntl.h>
> #include <stdlib.h>
> #include <string.h>
> 
> #define TESTFILE "./write-truncate-testfile.dat"
> #define ITERATIONS 10000
> #define STARTSIZE 32
> #define SIZES 15
> 
> #ifndef MIN
> #define MIN(x,y) ((x)<(y)?(x):(y))
> #endif
> #ifndef MAX
> #define MAX(x,y) ((x)>(y)?(x):(y))
> #endif
> 
> double whattime()
> {
>         struct timeval tp;
>         int i;
> 
> 	if (gettimeofday(&tp,NULL) == -1) {
> 		perror("gettimeofday");
> 		exit(EXIT_FAILURE);
> 	}
> 
>         return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
> }
> 
> int main(void)
> {
> 	int fd;
> 	int bufsize, sizes, iteration;
> 	char *buf;
> 	double t;
> 
> 	/* Create test file */
> 	fd = open(TESTFILE, O_RDWR|O_CREAT|O_EXCL);
> 	if (fd == -1) {
> 		perror("open");
> 		exit(EXIT_FAILURE);
> 	}
> 
> 	/* Unlink now for cleanup */
> 	if (unlink(TESTFILE) == -1) {
> 		perror("unlinke");
> 		exit(EXIT_FAILURE);
> 	}
> 
> 	/* Go through a series of sizes */
> 	bufsize = STARTSIZE;
> 	for (sizes = 1; sizes <= SIZES; sizes++) {
> 		bufsize *= 2;
> 		buf = malloc(bufsize);
> 		if (buf == NULL) {
> 			printf("ERROR: Malloc failed\n");
> 			exit(EXIT_FAILURE);
> 		}
> 		memset(buf, 0xE0, bufsize);
> 
> 		t = whattime();
> 		for (iteration = 0; iteration < ITERATIONS; iteration++) {
> 			size_t written = 0, thiswrite;
> 			
> 			while (written != bufsize) {
> 				thiswrite = write(fd, buf, bufsize);

(it should write bufsize-written ;))

> 				if (thiswrite == -1) {
> 					perror("write");
> 					exit(EXIT_FAILURE);
> 				}
> 				written += thiswrite;
> 			}
> 
> 			if (ftruncate(fd, 0) == -1) {
> 				perror("ftruncate");
> 				exit(EXIT_FAILURE);
> 			}
> 
> 			if (lseek(fd, 0, SEEK_SET) != 0) {
> 				perror("lseek");
> 				exit(EXIT_FAILURE);
> 			}
> 		}

yup, I think that captures the same idea.

> 		t = whattime() - t;
> 		free(buf);
> 
> 		printf("%d %f\n", bufsize, t);
> 	}
> 
> 	if (close(fd) == -1) {
> 		perror("close");
> 		exit(EXIT_FAILURE);
> 	}
> 
> 	exit(EXIT_SUCCESS);
> }
> -- 
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-02-23 23:53         ` Andrew Morton
  0 siblings, 0 replies; 190+ messages in thread
From: Andrew Morton @ 2009-02-23 23:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, penberg, riel, kosaki.motohiro, cl, hannes, npiggin,
	linux-kernel, ming.m.lin, yanmin_zhang

On Mon, 23 Feb 2009 23:30:30 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> On Mon, Feb 23, 2009 at 01:37:23AM -0800, Andrew Morton wrote:
> > On Sun, 22 Feb 2009 23:17:29 +0000 Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > Currently an effort is made to determine if a page is hot or cold when
> > > it is being freed so that cache hot pages can be allocated to callers if
> > > possible. However, the reasoning used whether to mark something hot or
> > > cold is a bit spurious. A profile run of kernbench showed that "cold"
> > > pages were never freed so it either doesn't happen generally or is so
> > > rare, it's barely measurable.
> > > 
> > > It's dubious as to whether pages are being correctly marked hot and cold
> > > anyway. Things like page cache and pages being truncated are are considered
> > > "hot" but there is no guarantee that these pages have been recently used
> > > and are cache hot. Pages being reclaimed from the LRU are considered
> > > cold which is logical because they cannot have been referenced recently
> > > but if the system is reclaiming pages, then we have entered allocator
> > > slowpaths and are not going to notice any potential performance boost
> > > because a "hot" page was freed.
> > > 
> > > This patch just deletes the concept of freeing hot or cold pages and
> > > just frees them all as hot.
> > > 
> > 
> > Well yes.  We waffled for months over whether to merge that code originally.
> > 
> > What tipped the balance was a dopey microbenchmark which I wrote which
> > sat in a loop extending (via write()) and then truncating the same file
> > by 32 kbytes (or thereabouts).  Its performance was increased by a lot
> > (2x or more, iirc) and no actual regressions were demonstrable, so we
> > merged it.
> > 
> > Could you check that please?  I'd suggest trying various values of 32k,
> > too.
> > 
> 
> I dug around the archives but hadn't much luck finding the original
> discussion. I saw some results from around the 2.5.40-mm timeframe that talked
> about ~60% difference with this benchmark (http://lkml.org/lkml/2002/10/6/174)
> but didn't find the source. The more solid benchmark reports was
> https://lwn.net/Articles/14761/ where you talked about 1-2% kernel compile
> improvements, good SpecWEB and a big hike on performance with SDET.
> 
> It's not clearcut. I tried reproducing your original benchmark rather than
> whinging about not finding yours :) . The source is below so maybe you can
> tell me if it's equivalent? I only ran it on one CPU which also may be a
> factor. The results were
> 
>     size      with   without difference
>       64  0.216033  0.558803 -158.67%
>      128  0.158551  0.150673   4.97%
>      256  0.153240  0.153488  -0.16%
>      512  0.156502  0.158769  -1.45%
>     1024  0.162146  0.163302  -0.71%
>     2048  0.167001  0.169573  -1.54%
>     4096  0.175376  0.178882  -2.00%
>     8192  0.237618  0.243385  -2.43%
>    16384  0.735053  0.351040  52.24%
>    32768  0.524731  0.583863 -11.27%
>    65536  1.149310  1.227855  -6.83%
>   131072  2.160248  2.084981   3.48%
>   262144  3.858264  4.046389  -4.88%
>   524288  8.228358  8.259957  -0.38%
>  1048576 16.228190 16.288308  -0.37%
> 
> with    == Using hot/cold information to place pages at the front or end of
>         the freelist
> without == Consider all pages being freed as hot

My head is spinning.  Smaller is better, right?  So for 16384-byte
writes, current mainline is slower?

That's odd.

> The results are a bit all over the place but mostly negative but nowhere near
> 60% of a difference so the benchmark might be wrong. Oddly, 64 shows massive
> regressions but 16384 shows massive improvements. With profiling enabled, it's
> 
>       64  0.214873  0.196666   8.47%
>      128  0.166807  0.162612   2.51%
>      256  0.170776  0.161861   5.22%
>      512  0.175772  0.164903   6.18%
>     1024  0.178835  0.168695   5.67%
>     2048  0.183769  0.174317   5.14%
>     4096  0.191877  0.183343   4.45%
>     8192  0.262511  0.254148   3.19%
>    16384  0.388201  0.371461   4.31%
>    32768  0.655402  0.611528   6.69%
>    65536  1.325445  1.193961   9.92%
>   131072  2.218135  2.209091   0.41%
>   262144  4.117233  4.116681   0.01%
>   524288  8.514915  8.590700  -0.89%
>  1048576 16.657330 16.708367  -0.31%
> 
> Almost the opposite with steady improvements almost all the way through.
> 
> With the patch applied, we are still using hot/cold information on the
> allocation side so I'm somewhat surprised the patch even makes much of a
> difference. I'd have expected the pages being freed to be mostly hot.

Oh yeah.  Back in the ancient days, hot-cold-pages was using separate
magazines for hot and cold pages.  Then Christoph went and mucked with
it, using a single queue.  That might have affected things.

It would be interesting to go back to a suitably-early kernel to see if
we broke it sometime after the early quantitative testing.  But I could
understand you not being so terribly interested ;)

> Kernbench was no help figuring this out either.
> 
> with:    Elapsed: 74.1625s User: 253.85s System: 27.1s CPU: 378.5%
> without: Elapsed: 74.0525s User: 252.9s System: 27.3675s CPU: 378.25%
> 
> Improvements on elapsed and user time but a regression on system time.
> 
> The issue is sufficiently cloudy that I'm just going to drop the patch
> for now. Hopefully the rest of the patchset is more clear-cut. I'll pick
> it up again at a later time.

Well...  if the benefits of the existing code are dubious then we
should default to deleting it.

> Here is the microbenchmark I used
> 
> Thanks.
> 
> /*
>  * write-truncate.c
>  * Microbenchmark that tests the speed of write/truncate of small files.
>  * 
>  * Suggested by Andrew Morton
>  * Written by Mel Gorman 2009
>  */
> #include <stdio.h>
> #include <limits.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/time.h>
> #include <fcntl.h>
> #include <stdlib.h>
> #include <string.h>
> 
> #define TESTFILE "./write-truncate-testfile.dat"
> #define ITERATIONS 10000
> #define STARTSIZE 32
> #define SIZES 15
> 
> #ifndef MIN
> #define MIN(x,y) ((x)<(y)?(x):(y))
> #endif
> #ifndef MAX
> #define MAX(x,y) ((x)>(y)?(x):(y))
> #endif
> 
> double whattime()
> {
>         struct timeval tp;
>         int i;
> 
> 	if (gettimeofday(&tp,NULL) == -1) {
> 		perror("gettimeofday");
> 		exit(EXIT_FAILURE);
> 	}
> 
>         return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
> }
> 
> int main(void)
> {
> 	int fd;
> 	int bufsize, sizes, iteration;
> 	char *buf;
> 	double t;
> 
> 	/* Create test file */
> 	fd = open(TESTFILE, O_RDWR|O_CREAT|O_EXCL);
> 	if (fd == -1) {
> 		perror("open");
> 		exit(EXIT_FAILURE);
> 	}
> 
> 	/* Unlink now for cleanup */
> 	if (unlink(TESTFILE) == -1) {
> 		perror("unlinke");
> 		exit(EXIT_FAILURE);
> 	}
> 
> 	/* Go through a series of sizes */
> 	bufsize = STARTSIZE;
> 	for (sizes = 1; sizes <= SIZES; sizes++) {
> 		bufsize *= 2;
> 		buf = malloc(bufsize);
> 		if (buf == NULL) {
> 			printf("ERROR: Malloc failed\n");
> 			exit(EXIT_FAILURE);
> 		}
> 		memset(buf, 0xE0, bufsize);
> 
> 		t = whattime();
> 		for (iteration = 0; iteration < ITERATIONS; iteration++) {
> 			size_t written = 0, thiswrite;
> 			
> 			while (written != bufsize) {
> 				thiswrite = write(fd, buf, bufsize);

(it should write bufsize-written ;))

> 				if (thiswrite == -1) {
> 					perror("write");
> 					exit(EXIT_FAILURE);
> 				}
> 				written += thiswrite;
> 			}
> 
> 			if (ftruncate(fd, 0) == -1) {
> 				perror("ftruncate");
> 				exit(EXIT_FAILURE);
> 			}
> 
> 			if (lseek(fd, 0, SEEK_SET) != 0) {
> 				perror("lseek");
> 				exit(EXIT_FAILURE);
> 			}
> 		}

yup, I think that captures the same idea.

> 		t = whattime() - t;
> 		free(buf);
> 
> 		printf("%d %f\n", bufsize, t);
> 	}
> 
> 	if (close(fd) == -1) {
> 		perror("close");
> 		exit(EXIT_FAILURE);
> 	}
> 
> 	exit(EXIT_SUCCESS);
> }
> -- 
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
  2009-02-23 16:40           ` Mel Gorman
@ 2009-02-24  1:32             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 190+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-24  1:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Nick Piggin, Linux Memory Management List,
	Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Johannes Weiner,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 23 Feb 2009 16:40:47 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> On Mon, Feb 23, 2009 at 10:43:20AM -0500, Christoph Lameter wrote:
> > On Tue, 24 Feb 2009, Nick Piggin wrote:
> > 
> > > > Are you sure that this is a benefit? Jumps are forward and pretty short
> > > > and the compiler is optimizing a branch away in the current code.
> > >
> > > Pretty easy to mispredict there, though, especially as you can tend
> > > to get allocations interleaved between kernel and movable (or simply
> > > if the branch predictor is cold there are a lot of branches on x86-64).
> > >
> > > I would be interested to know if there is a measured improvement.
> 
> Not in kernbench at least, but that is no surprise. It's a small
> percentage of the overall cost. It'll appear in the noise for anything
> other than micro-benchmarks.
> 
> > > It
> > > adds an extra dcache line to the footprint, but OTOH the instructions
> > > you quote is more than one icache line, and presumably Mel's code will
> > > be a lot shorter.
> > 
> 
> Yes, it's an index lookup of a shared read-only cache line versus a lot
> of code with branches to mispredict. I wasn't happy with the cache line
> consumption but it was the first obvious alternative.
> 
> > Maybe we can come up with a version of gfp_zone that has no branches and
> > no lookup?
> > 
> 
> Ideally, yes, but I didn't spot any obvious way of figuring it out at
> compile time then or now. Suggestions?
> 


Assume
  ZONE_DMA=0
  ZONE_DMA32=1
  ZONE_NORMAL=2
  ZONE_HIGHMEM=3
  ZONE_MOVABLE=4

#define __GFP_DMA       ((__force gfp_t)0x01u)
#define __GFP_DMA32     ((__force gfp_t)0x02u)
#define __GFP_HIGHMEM   ((__force gfp_t)0x04u)
#define __GFP_MOVABLE   ((__force gfp_t)0x08u)

#define GFP_MAGIC (0400030102) ) #depends on config.

gfp_zone(mask) = ((GFP_MAGIC >> ((mask & 0xf)*3) & 0x7)


Thx
-Kame





^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
@ 2009-02-24  1:32             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 190+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-24  1:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Nick Piggin, Linux Memory Management List,
	Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Johannes Weiner,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, 23 Feb 2009 16:40:47 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> On Mon, Feb 23, 2009 at 10:43:20AM -0500, Christoph Lameter wrote:
> > On Tue, 24 Feb 2009, Nick Piggin wrote:
> > 
> > > > Are you sure that this is a benefit? Jumps are forward and pretty short
> > > > and the compiler is optimizing a branch away in the current code.
> > >
> > > Pretty easy to mispredict there, though, especially as you can tend
> > > to get allocations interleaved between kernel and movable (or simply
> > > if the branch predictor is cold there are a lot of branches on x86-64).
> > >
> > > I would be interested to know if there is a measured improvement.
> 
> Not in kernbench at least, but that is no surprise. It's a small
> percentage of the overall cost. It'll appear in the noise for anything
> other than micro-benchmarks.
> 
> > > It
> > > adds an extra dcache line to the footprint, but OTOH the instructions
> > > you quote is more than one icache line, and presumably Mel's code will
> > > be a lot shorter.
> > 
> 
> Yes, it's an index lookup of a shared read-only cache line versus a lot
> of code with branches to mispredict. I wasn't happy with the cache line
> consumption but it was the first obvious alternative.
> 
> > Maybe we can come up with a version of gfp_zone that has no branches and
> > no lookup?
> > 
> 
> Ideally, yes, but I didn't spot any obvious way of figuring it out at
> compile time then or now. Suggestions?
> 


Assume
  ZONE_DMA=0
  ZONE_DMA32=1
  ZONE_NORMAL=2
  ZONE_HIGHMEM=3
  ZONE_MOVABLE=4

#define __GFP_DMA       ((__force gfp_t)0x01u)
#define __GFP_DMA32     ((__force gfp_t)0x02u)
#define __GFP_HIGHMEM   ((__force gfp_t)0x04u)
#define __GFP_MOVABLE   ((__force gfp_t)0x08u)

#define GFP_MAGIC (0400030102) ) #depends on config.

gfp_zone(mask) = ((GFP_MAGIC >> ((mask & 0xf)*3) & 0x7)


Thx
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
  2009-02-24  1:32             ` KAMEZAWA Hiroyuki
@ 2009-02-24  3:59               ` Nick Piggin
  -1 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-24  3:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Christoph Lameter, Linux Memory Management List,
	Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Johannes Weiner,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tuesday 24 February 2009 12:32:26 KAMEZAWA Hiroyuki wrote:
> On Mon, 23 Feb 2009 16:40:47 +0000
>
> Mel Gorman <mel@csn.ul.ie> wrote:
> > On Mon, Feb 23, 2009 at 10:43:20AM -0500, Christoph Lameter wrote:
> > > On Tue, 24 Feb 2009, Nick Piggin wrote:
> > > > > Are you sure that this is a benefit? Jumps are forward and pretty
> > > > > short and the compiler is optimizing a branch away in the current
> > > > > code.
> > > >
> > > > Pretty easy to mispredict there, though, especially as you can tend
> > > > to get allocations interleaved between kernel and movable (or simply
> > > > if the branch predictor is cold there are a lot of branches on
> > > > x86-64).
> > > >
> > > > I would be interested to know if there is a measured improvement.
> >
> > Not in kernbench at least, but that is no surprise. It's a small
> > percentage of the overall cost. It'll appear in the noise for anything
> > other than micro-benchmarks.
> >
> > > > It
> > > > adds an extra dcache line to the footprint, but OTOH the instructions
> > > > you quote is more than one icache line, and presumably Mel's code
> > > > will be a lot shorter.
> >
> > Yes, it's an index lookup of a shared read-only cache line versus a lot
> > of code with branches to mispredict. I wasn't happy with the cache line
> > consumption but it was the first obvious alternative.
> >
> > > Maybe we can come up with a version of gfp_zone that has no branches
> > > and no lookup?
> >
> > Ideally, yes, but I didn't spot any obvious way of figuring it out at
> > compile time then or now. Suggestions?
>
> Assume
>   ZONE_DMA=0
>   ZONE_DMA32=1
>   ZONE_NORMAL=2
>   ZONE_HIGHMEM=3
>   ZONE_MOVABLE=4
>
> #define __GFP_DMA       ((__force gfp_t)0x01u)
> #define __GFP_DMA32     ((__force gfp_t)0x02u)
> #define __GFP_HIGHMEM   ((__force gfp_t)0x04u)
> #define __GFP_MOVABLE   ((__force gfp_t)0x08u)
>
> #define GFP_MAGIC (0400030102) ) #depends on config.
>
> gfp_zone(mask) = ((GFP_MAGIC >> ((mask & 0xf)*3) & 0x7)

Clever!

But I wonder if it is even valid to perform bitwise operations on
the zone bits of the gfp mask? Hmm, I see a few places doing it,
but if we stamped that out, we could just have a simple zone mask
that takes the zone idx out of the gfp, which would be slightly
simpler again and more extendible.

But if it's too hard to avoid the bitwise operations, then your idea
is pretty cool ;)


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
@ 2009-02-24  3:59               ` Nick Piggin
  0 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-24  3:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Christoph Lameter, Linux Memory Management List,
	Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Johannes Weiner,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tuesday 24 February 2009 12:32:26 KAMEZAWA Hiroyuki wrote:
> On Mon, 23 Feb 2009 16:40:47 +0000
>
> Mel Gorman <mel@csn.ul.ie> wrote:
> > On Mon, Feb 23, 2009 at 10:43:20AM -0500, Christoph Lameter wrote:
> > > On Tue, 24 Feb 2009, Nick Piggin wrote:
> > > > > Are you sure that this is a benefit? Jumps are forward and pretty
> > > > > short and the compiler is optimizing a branch away in the current
> > > > > code.
> > > >
> > > > Pretty easy to mispredict there, though, especially as you can tend
> > > > to get allocations interleaved between kernel and movable (or simply
> > > > if the branch predictor is cold there are a lot of branches on
> > > > x86-64).
> > > >
> > > > I would be interested to know if there is a measured improvement.
> >
> > Not in kernbench at least, but that is no surprise. It's a small
> > percentage of the overall cost. It'll appear in the noise for anything
> > other than micro-benchmarks.
> >
> > > > It
> > > > adds an extra dcache line to the footprint, but OTOH the instructions
> > > > you quote is more than one icache line, and presumably Mel's code
> > > > will be a lot shorter.
> >
> > Yes, it's an index lookup of a shared read-only cache line versus a lot
> > of code with branches to mispredict. I wasn't happy with the cache line
> > consumption but it was the first obvious alternative.
> >
> > > Maybe we can come up with a version of gfp_zone that has no branches
> > > and no lookup?
> >
> > Ideally, yes, but I didn't spot any obvious way of figuring it out at
> > compile time then or now. Suggestions?
>
> Assume
>   ZONE_DMA=0
>   ZONE_DMA32=1
>   ZONE_NORMAL=2
>   ZONE_HIGHMEM=3
>   ZONE_MOVABLE=4
>
> #define __GFP_DMA       ((__force gfp_t)0x01u)
> #define __GFP_DMA32     ((__force gfp_t)0x02u)
> #define __GFP_HIGHMEM   ((__force gfp_t)0x04u)
> #define __GFP_MOVABLE   ((__force gfp_t)0x08u)
>
> #define GFP_MAGIC (0400030102) ) #depends on config.
>
> gfp_zone(mask) = ((GFP_MAGIC >> ((mask & 0xf)*3) & 0x7)

Clever!

But I wonder if it is even valid to perform bitwise operations on
the zone bits of the gfp mask? Hmm, I see a few places doing it,
but if we stamped that out, we could just have a simple zone mask
that takes the zone idx out of the gfp, which would be slightly
simpler again and more extendible.

But if it's too hard to avoid the bitwise operations, then your idea
is pretty cool ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
  2009-02-24  3:59               ` Nick Piggin
@ 2009-02-24  5:20                 ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 190+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-24  5:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mel Gorman, Christoph Lameter, Linux Memory Management List,
	Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Johannes Weiner,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tue, 24 Feb 2009 14:59:34 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Tuesday 24 February 2009 12:32:26 KAMEZAWA Hiroyuki wrote:
> > On Mon, 23 Feb 2009 16:40:47 +0000
> >
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > > On Mon, Feb 23, 2009 at 10:43:20AM -0500, Christoph Lameter wrote:
> > > > On Tue, 24 Feb 2009, Nick Piggin wrote:
> > > > > > Are you sure that this is a benefit? Jumps are forward and pretty
> > > > > > short and the compiler is optimizing a branch away in the current
> > > > > > code.
> > > > >
> > > > > Pretty easy to mispredict there, though, especially as you can tend
> > > > > to get allocations interleaved between kernel and movable (or simply
> > > > > if the branch predictor is cold there are a lot of branches on
> > > > > x86-64).
> > > > >
> > > > > I would be interested to know if there is a measured improvement.
> > >
> > > Not in kernbench at least, but that is no surprise. It's a small
> > > percentage of the overall cost. It'll appear in the noise for anything
> > > other than micro-benchmarks.
> > >
> > > > > It
> > > > > adds an extra dcache line to the footprint, but OTOH the instructions
> > > > > you quote is more than one icache line, and presumably Mel's code
> > > > > will be a lot shorter.
> > >
> > > Yes, it's an index lookup of a shared read-only cache line versus a lot
> > > of code with branches to mispredict. I wasn't happy with the cache line
> > > consumption but it was the first obvious alternative.
> > >
> > > > Maybe we can come up with a version of gfp_zone that has no branches
> > > > and no lookup?
> > >
> > > Ideally, yes, but I didn't spot any obvious way of figuring it out at
> > > compile time then or now. Suggestions?
> >
> > Assume
> >   ZONE_DMA=0
> >   ZONE_DMA32=1
> >   ZONE_NORMAL=2
> >   ZONE_HIGHMEM=3
> >   ZONE_MOVABLE=4
> >
> > #define __GFP_DMA       ((__force gfp_t)0x01u)
> > #define __GFP_DMA32     ((__force gfp_t)0x02u)
> > #define __GFP_HIGHMEM   ((__force gfp_t)0x04u)
> > #define __GFP_MOVABLE   ((__force gfp_t)0x08u)
> >
> > #define GFP_MAGIC (0400030102) ) #depends on config.
> >
> > gfp_zone(mask) = ((GFP_MAGIC >> ((mask & 0xf)*3) & 0x7)
> 
> Clever!
> 
> But I wonder if it is even valid to perform bitwise operations on
> the zone bits of the gfp mask? Hmm, I see a few places doing it,
> but if we stamped that out, we could just have a simple zone mask
> that takes the zone idx out of the gfp, which would be slightly
> simpler again and more extendible.
> 
IIRC, __GFP_MOVALE works as flag.

And, one troube is that there is no __GFP_NORMAL flag.

I wrote follwoing in old days(before ZONE_MOVABLE). Assume ZONE_NORMAL=2.

//trasnslate gfp_mask to zone_idx.
#define __GFP_DMA	(2)
#define __GFP_DMA32	(3)
#define __GFP_HIGHMEM	(1)
#define __GFP_MOVABLE	(6)

gfp_zone(mask)	= (mask & 0x7) ^ 0x2 //ZONE_NORMAL=2)

But, this doesn't work. ZONE_NORMAL can be 0,1,2. 
(and ppc doesn't have ZONE_NORMAL)


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
@ 2009-02-24  5:20                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 190+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-24  5:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mel Gorman, Christoph Lameter, Linux Memory Management List,
	Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Johannes Weiner,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tue, 24 Feb 2009 14:59:34 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Tuesday 24 February 2009 12:32:26 KAMEZAWA Hiroyuki wrote:
> > On Mon, 23 Feb 2009 16:40:47 +0000
> >
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > > On Mon, Feb 23, 2009 at 10:43:20AM -0500, Christoph Lameter wrote:
> > > > On Tue, 24 Feb 2009, Nick Piggin wrote:
> > > > > > Are you sure that this is a benefit? Jumps are forward and pretty
> > > > > > short and the compiler is optimizing a branch away in the current
> > > > > > code.
> > > > >
> > > > > Pretty easy to mispredict there, though, especially as you can tend
> > > > > to get allocations interleaved between kernel and movable (or simply
> > > > > if the branch predictor is cold there are a lot of branches on
> > > > > x86-64).
> > > > >
> > > > > I would be interested to know if there is a measured improvement.
> > >
> > > Not in kernbench at least, but that is no surprise. It's a small
> > > percentage of the overall cost. It'll appear in the noise for anything
> > > other than micro-benchmarks.
> > >
> > > > > It
> > > > > adds an extra dcache line to the footprint, but OTOH the instructions
> > > > > you quote is more than one icache line, and presumably Mel's code
> > > > > will be a lot shorter.
> > >
> > > Yes, it's an index lookup of a shared read-only cache line versus a lot
> > > of code with branches to mispredict. I wasn't happy with the cache line
> > > consumption but it was the first obvious alternative.
> > >
> > > > Maybe we can come up with a version of gfp_zone that has no branches
> > > > and no lookup?
> > >
> > > Ideally, yes, but I didn't spot any obvious way of figuring it out at
> > > compile time then or now. Suggestions?
> >
> > Assume
> >   ZONE_DMA=0
> >   ZONE_DMA32=1
> >   ZONE_NORMAL=2
> >   ZONE_HIGHMEM=3
> >   ZONE_MOVABLE=4
> >
> > #define __GFP_DMA       ((__force gfp_t)0x01u)
> > #define __GFP_DMA32     ((__force gfp_t)0x02u)
> > #define __GFP_HIGHMEM   ((__force gfp_t)0x04u)
> > #define __GFP_MOVABLE   ((__force gfp_t)0x08u)
> >
> > #define GFP_MAGIC (0400030102) ) #depends on config.
> >
> > gfp_zone(mask) = ((GFP_MAGIC >> ((mask & 0xf)*3) & 0x7)
> 
> Clever!
> 
> But I wonder if it is even valid to perform bitwise operations on
> the zone bits of the gfp mask? Hmm, I see a few places doing it,
> but if we stamped that out, we could just have a simple zone mask
> that takes the zone idx out of the gfp, which would be slightly
> simpler again and more extendible.
> 
IIRC, __GFP_MOVALE works as flag.

And, one troube is that there is no __GFP_NORMAL flag.

I wrote follwoing in old days(before ZONE_MOVABLE). Assume ZONE_NORMAL=2.

//trasnslate gfp_mask to zone_idx.
#define __GFP_DMA	(2)
#define __GFP_DMA32	(3)
#define __GFP_HIGHMEM	(1)
#define __GFP_MOVABLE	(6)

gfp_zone(mask)	= (mask & 0x7) ^ 0x2 //ZONE_NORMAL=2)

But, this doesn't work. ZONE_NORMAL can be 0,1,2. 
(and ppc doesn't have ZONE_NORMAL)


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] mm: gfp_to_alloc_flags()
  2009-02-23 22:59     ` Andrew Morton
@ 2009-02-24  8:59       ` Peter Zijlstra
  -1 siblings, 0 replies; 190+ messages in thread
From: Peter Zijlstra @ 2009-02-24  8:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mel, linux-mm, penberg, riel, kosaki.motohiro, cl, hannes,
	npiggin, linux-kernel, ming.m.lin, yanmin_zhang

On Mon, 2009-02-23 at 14:59 -0800, Andrew Morton wrote:
> On Mon, 23 Feb 2009 12:55:03 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > +static int gfp_to_alloc_flags(gfp_t gfp_mask)
> > +{
> > +	struct task_struct *p = current;
> > +	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> > +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> > +
> > +	/*
> > +	 * The caller may dip into page reserves a bit more if the caller
> > +	 * cannot run direct reclaim, or if the caller has realtime scheduling
> > +	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> > +	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> > +	 */
> > +	if (gfp_mask & __GFP_HIGH)
> > +		alloc_flags |= ALLOC_HIGH;
> 
> This could be sped up by making ALLOC_HIGH==__GFP_HIGH (hack)

:-) 

> But really, the whole function can be elided on the fastpath.  Try the
> allocation with the current flags (and __GFP_NOWARN) and only if it
> failed will we try altering the flags to try harder?

It is slowpath only.

After Mel's patches the fast path looks like so:

        page = __get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
                        zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
                        preferred_zone, migratetype);
        if (unlikely(!page))
                page = __alloc_pages_slowpath(gfp_mask, order,
                                zonelist, high_zoneidx, nodemask,
                                preferred_zone, migratetype);


and gfp_to_alloc_flags() is only used in __alloc_pages_slowpath().

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] mm: gfp_to_alloc_flags()
@ 2009-02-24  8:59       ` Peter Zijlstra
  0 siblings, 0 replies; 190+ messages in thread
From: Peter Zijlstra @ 2009-02-24  8:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mel, linux-mm, penberg, riel, kosaki.motohiro, cl, hannes,
	npiggin, linux-kernel, ming.m.lin, yanmin_zhang

On Mon, 2009-02-23 at 14:59 -0800, Andrew Morton wrote:
> On Mon, 23 Feb 2009 12:55:03 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > +static int gfp_to_alloc_flags(gfp_t gfp_mask)
> > +{
> > +	struct task_struct *p = current;
> > +	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> > +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> > +
> > +	/*
> > +	 * The caller may dip into page reserves a bit more if the caller
> > +	 * cannot run direct reclaim, or if the caller has realtime scheduling
> > +	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> > +	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> > +	 */
> > +	if (gfp_mask & __GFP_HIGH)
> > +		alloc_flags |= ALLOC_HIGH;
> 
> This could be sped up by making ALLOC_HIGH==__GFP_HIGH (hack)

:-) 

> But really, the whole function can be elided on the fastpath.  Try the
> allocation with the current flags (and __GFP_NOWARN) and only if it
> failed will we try altering the flags to try harder?

It is slowpath only.

After Mel's patches the fast path looks like so:

        page = __get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
                        zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
                        preferred_zone, migratetype);
        if (unlikely(!page))
                page = __alloc_pages_slowpath(gfp_mask, order,
                                zonelist, high_zoneidx, nodemask,
                                preferred_zone, migratetype);


and gfp_to_alloc_flags() is only used in __alloc_pages_slowpath().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
  2009-02-24  1:32             ` KAMEZAWA Hiroyuki
@ 2009-02-24 11:36               ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-24 11:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Lameter, Nick Piggin, Linux Memory Management List,
	Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Johannes Weiner,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tue, Feb 24, 2009 at 10:32:26AM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 23 Feb 2009 16:40:47 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Mon, Feb 23, 2009 at 10:43:20AM -0500, Christoph Lameter wrote:
> > > On Tue, 24 Feb 2009, Nick Piggin wrote:
> > > 
> > > > > Are you sure that this is a benefit? Jumps are forward and pretty short
> > > > > and the compiler is optimizing a branch away in the current code.
> > > >
> > > > Pretty easy to mispredict there, though, especially as you can tend
> > > > to get allocations interleaved between kernel and movable (or simply
> > > > if the branch predictor is cold there are a lot of branches on x86-64).
> > > >
> > > > I would be interested to know if there is a measured improvement.
> > 
> > Not in kernbench at least, but that is no surprise. It's a small
> > percentage of the overall cost. It'll appear in the noise for anything
> > other than micro-benchmarks.
> > 
> > > > It
> > > > adds an extra dcache line to the footprint, but OTOH the instructions
> > > > you quote is more than one icache line, and presumably Mel's code will
> > > > be a lot shorter.
> > > 
> > 
> > Yes, it's an index lookup of a shared read-only cache line versus a lot
> > of code with branches to mispredict. I wasn't happy with the cache line
> > consumption but it was the first obvious alternative.
> > 
> > > Maybe we can come up with a version of gfp_zone that has no branches and
> > > no lookup?
> > > 
> > 
> > Ideally, yes, but I didn't spot any obvious way of figuring it out at
> > compile time then or now. Suggestions?
> > 
> 
> 
> Assume
>   ZONE_DMA=0
>   ZONE_DMA32=1
>   ZONE_NORMAL=2
>   ZONE_HIGHMEM=3
>   ZONE_MOVABLE=4
> 
> #define __GFP_DMA       ((__force gfp_t)0x01u)
> #define __GFP_DMA32     ((__force gfp_t)0x02u)
> #define __GFP_HIGHMEM   ((__force gfp_t)0x04u)
> #define __GFP_MOVABLE   ((__force gfp_t)0x08u)
> 
> #define GFP_MAGIC (0400030102) ) #depends on config.
> 
> gfp_zone(mask) = ((GFP_MAGIC >> ((mask & 0xf)*3) & 0x7)
> 

Clever. I can see how this can be made work for __GFP_DMA, __GFP_DMA32 and
__GFP_HIGHMEM. However, I'm not currently seeing how __GFP_MOVABLE can be dealt
with properly and quickly. In the above scheme __GFP_MOVABLE would return
zone 4 which appears right but it's not. Only __GFP_MOVABLE|__GFP_HIGHMEM
should return 4.

To make that work, you end up with something like the following;

#define GFP_DMA_ZONEMAGIC       0000000100
#define GFP_DMA32_ZONEMAGIC     0000010000
#define GFP_NORMAL_ZONEMAGIC    0000000002
#define GFP_HIGHMEM_ZONEMAGIC   0000000200
#define GFP_MOVABLE_ZONEMAGIC   040000000000ULL
#define GFP_MAGIC (GFP_DMA_ZONEMAGIC|GFP_DMA32_ZONEMAGIC|GFP_NORMAL_ZONEMAGIC|GFP_HIGHMEM_ZONEMAGIC|GFP_MOVABLE_ZONEMAGIC)

static inline int new_gfp_zone(gfp_t flags) {
        if ((flags & __GFP_MOVABLE))
                if (!(flags & __GFP_HIGHMEM))
                        flags &= ~__GFP_MOVABLE;
        return (GFP_MAGIC >> ((flags & 0xf)*3) & 0x7);
}

so we end up back again with branches and checking masks. Mind you, I also
ended up with a different GFP magic value when actually implementing this
so I might be missing something else with your suggestion and how it works.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value
@ 2009-02-24 11:36               ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-24 11:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Lameter, Nick Piggin, Linux Memory Management List,
	Pekka Enberg, Rik van Riel, KOSAKI Motohiro, Johannes Weiner,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tue, Feb 24, 2009 at 10:32:26AM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 23 Feb 2009 16:40:47 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Mon, Feb 23, 2009 at 10:43:20AM -0500, Christoph Lameter wrote:
> > > On Tue, 24 Feb 2009, Nick Piggin wrote:
> > > 
> > > > > Are you sure that this is a benefit? Jumps are forward and pretty short
> > > > > and the compiler is optimizing a branch away in the current code.
> > > >
> > > > Pretty easy to mispredict there, though, especially as you can tend
> > > > to get allocations interleaved between kernel and movable (or simply
> > > > if the branch predictor is cold there are a lot of branches on x86-64).
> > > >
> > > > I would be interested to know if there is a measured improvement.
> > 
> > Not in kernbench at least, but that is no surprise. It's a small
> > percentage of the overall cost. It'll appear in the noise for anything
> > other than micro-benchmarks.
> > 
> > > > It
> > > > adds an extra dcache line to the footprint, but OTOH the instructions
> > > > you quote is more than one icache line, and presumably Mel's code will
> > > > be a lot shorter.
> > > 
> > 
> > Yes, it's an index lookup of a shared read-only cache line versus a lot
> > of code with branches to mispredict. I wasn't happy with the cache line
> > consumption but it was the first obvious alternative.
> > 
> > > Maybe we can come up with a version of gfp_zone that has no branches and
> > > no lookup?
> > > 
> > 
> > Ideally, yes, but I didn't spot any obvious way of figuring it out at
> > compile time then or now. Suggestions?
> > 
> 
> 
> Assume
>   ZONE_DMA=0
>   ZONE_DMA32=1
>   ZONE_NORMAL=2
>   ZONE_HIGHMEM=3
>   ZONE_MOVABLE=4
> 
> #define __GFP_DMA       ((__force gfp_t)0x01u)
> #define __GFP_DMA32     ((__force gfp_t)0x02u)
> #define __GFP_HIGHMEM   ((__force gfp_t)0x04u)
> #define __GFP_MOVABLE   ((__force gfp_t)0x08u)
> 
> #define GFP_MAGIC (0400030102) ) #depends on config.
> 
> gfp_zone(mask) = ((GFP_MAGIC >> ((mask & 0xf)*3) & 0x7)
> 

Clever. I can see how this can be made work for __GFP_DMA, __GFP_DMA32 and
__GFP_HIGHMEM. However, I'm not currently seeing how __GFP_MOVABLE can be dealt
with properly and quickly. In the above scheme __GFP_MOVABLE would return
zone 4 which appears right but it's not. Only __GFP_MOVABLE|__GFP_HIGHMEM
should return 4.

To make that work, you end up with something like the following;

#define GFP_DMA_ZONEMAGIC       0000000100
#define GFP_DMA32_ZONEMAGIC     0000010000
#define GFP_NORMAL_ZONEMAGIC    0000000002
#define GFP_HIGHMEM_ZONEMAGIC   0000000200
#define GFP_MOVABLE_ZONEMAGIC   040000000000ULL
#define GFP_MAGIC (GFP_DMA_ZONEMAGIC|GFP_DMA32_ZONEMAGIC|GFP_NORMAL_ZONEMAGIC|GFP_HIGHMEM_ZONEMAGIC|GFP_MOVABLE_ZONEMAGIC)

static inline int new_gfp_zone(gfp_t flags) {
        if ((flags & __GFP_MOVABLE))
                if (!(flags & __GFP_HIGHMEM))
                        flags &= ~__GFP_MOVABLE;
        return (GFP_MAGIC >> ((flags & 0xf)*3) & 0x7);
}

so we end up back again with branches and checking masks. Mind you, I also
ended up with a different GFP magic value when actually implementing this
so I might be missing something else with your suggestion and how it works.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-23 23:53         ` Andrew Morton
@ 2009-02-24 11:51           ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-24 11:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, penberg, riel, kosaki.motohiro, cl, hannes, npiggin,
	linux-kernel, ming.m.lin, yanmin_zhang

On Mon, Feb 23, 2009 at 03:53:13PM -0800, Andrew Morton wrote:
> On Mon, 23 Feb 2009 23:30:30 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Mon, Feb 23, 2009 at 01:37:23AM -0800, Andrew Morton wrote:
> > > On Sun, 22 Feb 2009 23:17:29 +0000 Mel Gorman <mel@csn.ul.ie> wrote:
> > > 
> > > > Currently an effort is made to determine if a page is hot or cold when
> > > > it is being freed so that cache hot pages can be allocated to callers if
> > > > possible. However, the reasoning used whether to mark something hot or
> > > > cold is a bit spurious. A profile run of kernbench showed that "cold"
> > > > pages were never freed so it either doesn't happen generally or is so
> > > > rare, it's barely measurable.
> > > > 
> > > > It's dubious as to whether pages are being correctly marked hot and cold
> > > > anyway. Things like page cache and pages being truncated are are considered
> > > > "hot" but there is no guarantee that these pages have been recently used
> > > > and are cache hot. Pages being reclaimed from the LRU are considered
> > > > cold which is logical because they cannot have been referenced recently
> > > > but if the system is reclaiming pages, then we have entered allocator
> > > > slowpaths and are not going to notice any potential performance boost
> > > > because a "hot" page was freed.
> > > > 
> > > > This patch just deletes the concept of freeing hot or cold pages and
> > > > just frees them all as hot.
> > > > 
> > > 
> > > Well yes.  We waffled for months over whether to merge that code originally.
> > > 
> > > What tipped the balance was a dopey microbenchmark which I wrote which
> > > sat in a loop extending (via write()) and then truncating the same file
> > > by 32 kbytes (or thereabouts).  Its performance was increased by a lot
> > > (2x or more, iirc) and no actual regressions were demonstrable, so we
> > > merged it.
> > > 
> > > Could you check that please?  I'd suggest trying various values of 32k,
> > > too.
> > > 
> > 
> > I dug around the archives but hadn't much luck finding the original
> > discussion. I saw some results from around the 2.5.40-mm timeframe that talked
> > about ~60% difference with this benchmark (http://lkml.org/lkml/2002/10/6/174)
> > but didn't find the source. The more solid benchmark reports was
> > https://lwn.net/Articles/14761/ where you talked about 1-2% kernel compile
> > improvements, good SpecWEB and a big hike on performance with SDET.
> > 
> > It's not clearcut. I tried reproducing your original benchmark rather than
> > whinging about not finding yours :) . The source is below so maybe you can
> > tell me if it's equivalent? I only ran it on one CPU which also may be a
> > factor. The results were
> > 
> >     size      with   without difference
> >       64  0.216033  0.558803 -158.67%
> >      128  0.158551  0.150673   4.97%
> >      256  0.153240  0.153488  -0.16%
> >      512  0.156502  0.158769  -1.45%
> >     1024  0.162146  0.163302  -0.71%
> >     2048  0.167001  0.169573  -1.54%
> >     4096  0.175376  0.178882  -2.00%
> >     8192  0.237618  0.243385  -2.43%
> >    16384  0.735053  0.351040  52.24%
> >    32768  0.524731  0.583863 -11.27%
> >    65536  1.149310  1.227855  -6.83%
> >   131072  2.160248  2.084981   3.48%
> >   262144  3.858264  4.046389  -4.88%
> >   524288  8.228358  8.259957  -0.38%
> >  1048576 16.228190 16.288308  -0.37%
> > 
> > with    == Using hot/cold information to place pages at the front or end of
> >         the freelist
> > without == Consider all pages being freed as hot
> 
> My head is spinning.  Smaller is better, right? 

Right. It's measured in time so smaller == faster == better.

> So for 16384-byte
> writes, current mainline is slower?
> 
> That's odd.
> 

Indeed.

> > The results are a bit all over the place but mostly negative but nowhere near
> > 60% of a difference so the benchmark might be wrong. Oddly, 64 shows massive
> > regressions but 16384 shows massive improvements. With profiling enabled, it's
> > 
> >       64  0.214873  0.196666   8.47%
> >      128  0.166807  0.162612   2.51%
> >      256  0.170776  0.161861   5.22%
> >      512  0.175772  0.164903   6.18%
> >     1024  0.178835  0.168695   5.67%
> >     2048  0.183769  0.174317   5.14%
> >     4096  0.191877  0.183343   4.45%
> >     8192  0.262511  0.254148   3.19%
> >    16384  0.388201  0.371461   4.31%
> >    32768  0.655402  0.611528   6.69%
> >    65536  1.325445  1.193961   9.92%
> >   131072  2.218135  2.209091   0.41%
> >   262144  4.117233  4.116681   0.01%
> >   524288  8.514915  8.590700  -0.89%
> >  1048576 16.657330 16.708367  -0.31%
> > 
> > Almost the opposite with steady improvements almost all the way through.
> > 
> > With the patch applied, we are still using hot/cold information on the
> > allocation side so I'm somewhat surprised the patch even makes much of a
> > difference. I'd have expected the pages being freed to be mostly hot.
> 
> Oh yeah.  Back in the ancient days, hot-cold-pages was using separate
> magazines for hot and cold pages.  Then Christoph went and mucked with
> it, using a single queue.  That might have affected things.
> 

It might have. The impact is that requests for cold pages can get hot pages
if there are not enough cold pages in the queue so readahead could prevent
an active process getting cache hot pages. I don't think that would have
showed up in the microbenchmark though.

> It would be interesting to go back to a suitably-early kernel to see if
> we broke it sometime after the early quantitative testing.  But I could
> understand you not being so terribly interested ;)
> 

I'm interested, but I now want to put it off for a second or third pass at
giving the page allocator "go faster stripes". It's pure chicken, but the other
patches are lower-hanging fruit but hefty enough to deal with on their own.

> > Kernbench was no help figuring this out either.
> > 
> > with:    Elapsed: 74.1625s User: 253.85s System: 27.1s CPU: 378.5%
> > without: Elapsed: 74.0525s User: 252.9s System: 27.3675s CPU: 378.25%
> > 
> > Improvements on elapsed and user time but a regression on system time.
> > 
> > The issue is sufficiently cloudy that I'm just going to drop the patch
> > for now. Hopefully the rest of the patchset is more clear-cut. I'll pick
> > it up again at a later time.
> 
> Well...  if the benefits of the existing code are dubious then we
> should default to deleting it.
> 

Some things haven't changed since 2002 - I can't convince myself it's
either good or bad at the moment so am leaning towards leaving it alone
for now.

> > Here is the microbenchmark I used
> > 
> > Thanks.
> > 
> > /*
> >  * write-truncate.c
> >  * Microbenchmark that tests the speed of write/truncate of small files.
> >  * 
> >  * Suggested by Andrew Morton
> >  * Written by Mel Gorman 2009
> >  */
> > #include <stdio.h>
> > #include <limits.h>
> > #include <unistd.h>
> > #include <sys/types.h>
> > #include <sys/time.h>
> > #include <fcntl.h>
> > #include <stdlib.h>
> > #include <string.h>
> > 
> > #define TESTFILE "./write-truncate-testfile.dat"
> > #define ITERATIONS 10000
> > #define STARTSIZE 32
> > #define SIZES 15
> > 
> > #ifndef MIN
> > #define MIN(x,y) ((x)<(y)?(x):(y))
> > #endif
> > #ifndef MAX
> > #define MAX(x,y) ((x)>(y)?(x):(y))
> > #endif
> > 
> > double whattime()
> > {
> >         struct timeval tp;
> >         int i;
> > 
> > 	if (gettimeofday(&tp,NULL) == -1) {
> > 		perror("gettimeofday");
> > 		exit(EXIT_FAILURE);
> > 	}
> > 
> >         return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
> > }
> > 
> > int main(void)
> > {
> > 	int fd;
> > 	int bufsize, sizes, iteration;
> > 	char *buf;
> > 	double t;
> > 
> > 	/* Create test file */
> > 	fd = open(TESTFILE, O_RDWR|O_CREAT|O_EXCL);
> > 	if (fd == -1) {
> > 		perror("open");
> > 		exit(EXIT_FAILURE);
> > 	}
> > 
> > 	/* Unlink now for cleanup */
> > 	if (unlink(TESTFILE) == -1) {
> > 		perror("unlinke");
> > 		exit(EXIT_FAILURE);
> > 	}
> > 
> > 	/* Go through a series of sizes */
> > 	bufsize = STARTSIZE;
> > 	for (sizes = 1; sizes <= SIZES; sizes++) {
> > 		bufsize *= 2;
> > 		buf = malloc(bufsize);
> > 		if (buf == NULL) {
> > 			printf("ERROR: Malloc failed\n");
> > 			exit(EXIT_FAILURE);
> > 		}
> > 		memset(buf, 0xE0, bufsize);
> > 
> > 		t = whattime();
> > 		for (iteration = 0; iteration < ITERATIONS; iteration++) {
> > 			size_t written = 0, thiswrite;
> > 			
> > 			while (written != bufsize) {
> > 				thiswrite = write(fd, buf, bufsize);
> 
> (it should write bufsize-written ;))
> 

D'oh - and write the buffer from buf + written. Not that it matters in this
case as it's all just fake data. I wonder did this mistake cause the spike
at 16384 hmmm....

> > 				if (thiswrite == -1) {
> > 					perror("write");
> > 					exit(EXIT_FAILURE);
> > 				}
> > 				written += thiswrite;
> > 			}
> > 
> > 			if (ftruncate(fd, 0) == -1) {
> > 				perror("ftruncate");
> > 				exit(EXIT_FAILURE);
> > 			}
> > 
> > 			if (lseek(fd, 0, SEEK_SET) != 0) {
> > 				perror("lseek");
> > 				exit(EXIT_FAILURE);
> > 			}
> > 		}
> 
> yup, I think that captures the same idea.
> 
> > 		t = whattime() - t;
> > 		free(buf);
> > 
> > 		printf("%d %f\n", bufsize, t);
> > 	}
> > 
> > 	if (close(fd) == -1) {
> > 		perror("close");
> > 		exit(EXIT_FAILURE);
> > 	}
> > 
> > 	exit(EXIT_SUCCESS);
> > }

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-02-24 11:51           ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-24 11:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, penberg, riel, kosaki.motohiro, cl, hannes, npiggin,
	linux-kernel, ming.m.lin, yanmin_zhang

On Mon, Feb 23, 2009 at 03:53:13PM -0800, Andrew Morton wrote:
> On Mon, 23 Feb 2009 23:30:30 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Mon, Feb 23, 2009 at 01:37:23AM -0800, Andrew Morton wrote:
> > > On Sun, 22 Feb 2009 23:17:29 +0000 Mel Gorman <mel@csn.ul.ie> wrote:
> > > 
> > > > Currently an effort is made to determine if a page is hot or cold when
> > > > it is being freed so that cache hot pages can be allocated to callers if
> > > > possible. However, the reasoning used whether to mark something hot or
> > > > cold is a bit spurious. A profile run of kernbench showed that "cold"
> > > > pages were never freed so it either doesn't happen generally or is so
> > > > rare, it's barely measurable.
> > > > 
> > > > It's dubious as to whether pages are being correctly marked hot and cold
> > > > anyway. Things like page cache and pages being truncated are are considered
> > > > "hot" but there is no guarantee that these pages have been recently used
> > > > and are cache hot. Pages being reclaimed from the LRU are considered
> > > > cold which is logical because they cannot have been referenced recently
> > > > but if the system is reclaiming pages, then we have entered allocator
> > > > slowpaths and are not going to notice any potential performance boost
> > > > because a "hot" page was freed.
> > > > 
> > > > This patch just deletes the concept of freeing hot or cold pages and
> > > > just frees them all as hot.
> > > > 
> > > 
> > > Well yes.  We waffled for months over whether to merge that code originally.
> > > 
> > > What tipped the balance was a dopey microbenchmark which I wrote which
> > > sat in a loop extending (via write()) and then truncating the same file
> > > by 32 kbytes (or thereabouts).  Its performance was increased by a lot
> > > (2x or more, iirc) and no actual regressions were demonstrable, so we
> > > merged it.
> > > 
> > > Could you check that please?  I'd suggest trying various values of 32k,
> > > too.
> > > 
> > 
> > I dug around the archives but hadn't much luck finding the original
> > discussion. I saw some results from around the 2.5.40-mm timeframe that talked
> > about ~60% difference with this benchmark (http://lkml.org/lkml/2002/10/6/174)
> > but didn't find the source. The more solid benchmark reports was
> > https://lwn.net/Articles/14761/ where you talked about 1-2% kernel compile
> > improvements, good SpecWEB and a big hike on performance with SDET.
> > 
> > It's not clearcut. I tried reproducing your original benchmark rather than
> > whinging about not finding yours :) . The source is below so maybe you can
> > tell me if it's equivalent? I only ran it on one CPU which also may be a
> > factor. The results were
> > 
> >     size      with   without difference
> >       64  0.216033  0.558803 -158.67%
> >      128  0.158551  0.150673   4.97%
> >      256  0.153240  0.153488  -0.16%
> >      512  0.156502  0.158769  -1.45%
> >     1024  0.162146  0.163302  -0.71%
> >     2048  0.167001  0.169573  -1.54%
> >     4096  0.175376  0.178882  -2.00%
> >     8192  0.237618  0.243385  -2.43%
> >    16384  0.735053  0.351040  52.24%
> >    32768  0.524731  0.583863 -11.27%
> >    65536  1.149310  1.227855  -6.83%
> >   131072  2.160248  2.084981   3.48%
> >   262144  3.858264  4.046389  -4.88%
> >   524288  8.228358  8.259957  -0.38%
> >  1048576 16.228190 16.288308  -0.37%
> > 
> > with    == Using hot/cold information to place pages at the front or end of
> >         the freelist
> > without == Consider all pages being freed as hot
> 
> My head is spinning.  Smaller is better, right? 

Right. It's measured in time so smaller == faster == better.

> So for 16384-byte
> writes, current mainline is slower?
> 
> That's odd.
> 

Indeed.

> > The results are a bit all over the place but mostly negative but nowhere near
> > 60% of a difference so the benchmark might be wrong. Oddly, 64 shows massive
> > regressions but 16384 shows massive improvements. With profiling enabled, it's
> > 
> >       64  0.214873  0.196666   8.47%
> >      128  0.166807  0.162612   2.51%
> >      256  0.170776  0.161861   5.22%
> >      512  0.175772  0.164903   6.18%
> >     1024  0.178835  0.168695   5.67%
> >     2048  0.183769  0.174317   5.14%
> >     4096  0.191877  0.183343   4.45%
> >     8192  0.262511  0.254148   3.19%
> >    16384  0.388201  0.371461   4.31%
> >    32768  0.655402  0.611528   6.69%
> >    65536  1.325445  1.193961   9.92%
> >   131072  2.218135  2.209091   0.41%
> >   262144  4.117233  4.116681   0.01%
> >   524288  8.514915  8.590700  -0.89%
> >  1048576 16.657330 16.708367  -0.31%
> > 
> > Almost the opposite with steady improvements almost all the way through.
> > 
> > With the patch applied, we are still using hot/cold information on the
> > allocation side so I'm somewhat surprised the patch even makes much of a
> > difference. I'd have expected the pages being freed to be mostly hot.
> 
> Oh yeah.  Back in the ancient days, hot-cold-pages was using separate
> magazines for hot and cold pages.  Then Christoph went and mucked with
> it, using a single queue.  That might have affected things.
> 

It might have. The impact is that requests for cold pages can get hot pages
if there are not enough cold pages in the queue so readahead could prevent
an active process getting cache hot pages. I don't think that would have
showed up in the microbenchmark though.

> It would be interesting to go back to a suitably-early kernel to see if
> we broke it sometime after the early quantitative testing.  But I could
> understand you not being so terribly interested ;)
> 

I'm interested, but I now want to put it off for a second or third pass at
giving the page allocator "go faster stripes". It's pure chicken, but the other
patches are lower-hanging fruit but hefty enough to deal with on their own.

> > Kernbench was no help figuring this out either.
> > 
> > with:    Elapsed: 74.1625s User: 253.85s System: 27.1s CPU: 378.5%
> > without: Elapsed: 74.0525s User: 252.9s System: 27.3675s CPU: 378.25%
> > 
> > Improvements on elapsed and user time but a regression on system time.
> > 
> > The issue is sufficiently cloudy that I'm just going to drop the patch
> > for now. Hopefully the rest of the patchset is more clear-cut. I'll pick
> > it up again at a later time.
> 
> Well...  if the benefits of the existing code are dubious then we
> should default to deleting it.
> 

Some things haven't changed since 2002 - I can't convince myself it's
either good or bad at the moment so am leaning towards leaving it alone
for now.

> > Here is the microbenchmark I used
> > 
> > Thanks.
> > 
> > /*
> >  * write-truncate.c
> >  * Microbenchmark that tests the speed of write/truncate of small files.
> >  * 
> >  * Suggested by Andrew Morton
> >  * Written by Mel Gorman 2009
> >  */
> > #include <stdio.h>
> > #include <limits.h>
> > #include <unistd.h>
> > #include <sys/types.h>
> > #include <sys/time.h>
> > #include <fcntl.h>
> > #include <stdlib.h>
> > #include <string.h>
> > 
> > #define TESTFILE "./write-truncate-testfile.dat"
> > #define ITERATIONS 10000
> > #define STARTSIZE 32
> > #define SIZES 15
> > 
> > #ifndef MIN
> > #define MIN(x,y) ((x)<(y)?(x):(y))
> > #endif
> > #ifndef MAX
> > #define MAX(x,y) ((x)>(y)?(x):(y))
> > #endif
> > 
> > double whattime()
> > {
> >         struct timeval tp;
> >         int i;
> > 
> > 	if (gettimeofday(&tp,NULL) == -1) {
> > 		perror("gettimeofday");
> > 		exit(EXIT_FAILURE);
> > 	}
> > 
> >         return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
> > }
> > 
> > int main(void)
> > {
> > 	int fd;
> > 	int bufsize, sizes, iteration;
> > 	char *buf;
> > 	double t;
> > 
> > 	/* Create test file */
> > 	fd = open(TESTFILE, O_RDWR|O_CREAT|O_EXCL);
> > 	if (fd == -1) {
> > 		perror("open");
> > 		exit(EXIT_FAILURE);
> > 	}
> > 
> > 	/* Unlink now for cleanup */
> > 	if (unlink(TESTFILE) == -1) {
> > 		perror("unlinke");
> > 		exit(EXIT_FAILURE);
> > 	}
> > 
> > 	/* Go through a series of sizes */
> > 	bufsize = STARTSIZE;
> > 	for (sizes = 1; sizes <= SIZES; sizes++) {
> > 		bufsize *= 2;
> > 		buf = malloc(bufsize);
> > 		if (buf == NULL) {
> > 			printf("ERROR: Malloc failed\n");
> > 			exit(EXIT_FAILURE);
> > 		}
> > 		memset(buf, 0xE0, bufsize);
> > 
> > 		t = whattime();
> > 		for (iteration = 0; iteration < ITERATIONS; iteration++) {
> > 			size_t written = 0, thiswrite;
> > 			
> > 			while (written != bufsize) {
> > 				thiswrite = write(fd, buf, bufsize);
> 
> (it should write bufsize-written ;))
> 

D'oh - and write the buffer from buf + written. Not that it matters in this
case as it's all just fake data. I wonder did this mistake cause the spike
at 16384 hmmm....

> > 				if (thiswrite == -1) {
> > 					perror("write");
> > 					exit(EXIT_FAILURE);
> > 				}
> > 				written += thiswrite;
> > 			}
> > 
> > 			if (ftruncate(fd, 0) == -1) {
> > 				perror("ftruncate");
> > 				exit(EXIT_FAILURE);
> > 			}
> > 
> > 			if (lseek(fd, 0, SEEK_SET) != 0) {
> > 				perror("lseek");
> > 				exit(EXIT_FAILURE);
> > 			}
> > 		}
> 
> yup, I think that captures the same idea.
> 
> > 		t = whattime() - t;
> > 		free(buf);
> > 
> > 		printf("%d %f\n", bufsize, t);
> > 	}
> > 
> > 	if (close(fd) == -1) {
> > 		perror("close");
> > 		exit(EXIT_FAILURE);
> > 	}
> > 
> > 	exit(EXIT_SUCCESS);
> > }

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 11/20] Inline get_page_from_freelist() in the fast-path
  2009-02-23 15:32     ` Nick Piggin
@ 2009-02-24 13:32       ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-24 13:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tue, Feb 24, 2009 at 02:32:37AM +1100, Nick Piggin wrote:
> On Monday 23 February 2009 10:17:20 Mel Gorman wrote:
> > In the best-case scenario, use an inlined version of
> > get_page_from_freelist(). This increases the size of the text but avoids
> > time spent pushing arguments onto the stack.
> 
> I'm quite fond of inlining ;) But it can increase register pressure as
> well as icache footprint as well. x86-64 isn't spilling a lot more
> registers to stack after these changes, is it?
> 

I didn't actually check that closely so I don't know for sure. Is there a
handier way of figuring it out than eyeballing the assembly? In the end
I dropped the inline of this function anyway. It means the patches
reduce rather than increase text size which is a bit more clear-cut.

> Also,
> 
> 
> > @@ -1780,8 +1791,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int
> > order, if (!preferred_zone)
> >  		return NULL;
> >
> > -	/* First allocation attempt */
> > -	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> > +	/* First allocation attempt. Fastpath uses inlined version */
> > +	page = __get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> >  			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
> >  			preferred_zone, migratetype);
> >  	if (unlikely(!page))
> 
> I think in a common case where there is background reclaim going on,
> it will be quite common to fail this, won't it? (I haven't run
> statistics though).
> 

Good question. It would be common to fail when background reclaim has
been kicked off for the first time but once we are over the low
watermark, background reclaim will continue even though we are
allocating pages. I recall that ther eis a profile likely/unlikely debug
option. I dont' recall using it before but now might be a good time to
fire it up.

> In which case you will get extra icache footprint. What speedup does
> it give in the cache-hot microbenchmark case?
> 

I wasn't measuring with a microbenchmark at the time of writing so I don't
know. I was going entirely by profile counts running kernbench and the
time spent running the benchmark.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 11/20] Inline get_page_from_freelist() in the fast-path
@ 2009-02-24 13:32       ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-24 13:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Tue, Feb 24, 2009 at 02:32:37AM +1100, Nick Piggin wrote:
> On Monday 23 February 2009 10:17:20 Mel Gorman wrote:
> > In the best-case scenario, use an inlined version of
> > get_page_from_freelist(). This increases the size of the text but avoids
> > time spent pushing arguments onto the stack.
> 
> I'm quite fond of inlining ;) But it can increase register pressure as
> well as icache footprint as well. x86-64 isn't spilling a lot more
> registers to stack after these changes, is it?
> 

I didn't actually check that closely so I don't know for sure. Is there a
handier way of figuring it out than eyeballing the assembly? In the end
I dropped the inline of this function anyway. It means the patches
reduce rather than increase text size which is a bit more clear-cut.

> Also,
> 
> 
> > @@ -1780,8 +1791,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int
> > order, if (!preferred_zone)
> >  		return NULL;
> >
> > -	/* First allocation attempt */
> > -	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> > +	/* First allocation attempt. Fastpath uses inlined version */
> > +	page = __get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> >  			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
> >  			preferred_zone, migratetype);
> >  	if (unlikely(!page))
> 
> I think in a common case where there is background reclaim going on,
> it will be quite common to fail this, won't it? (I haven't run
> statistics though).
> 

Good question. It would be common to fail when background reclaim has
been kicked off for the first time but once we are over the low
watermark, background reclaim will continue even though we are
allocating pages. I recall that ther eis a profile likely/unlikely debug
option. I dont' recall using it before but now might be a good time to
fire it up.

> In which case you will get extra icache footprint. What speedup does
> it give in the cache-hot microbenchmark case?
> 

I wasn't measuring with a microbenchmark at the time of writing so I don't
know. I was going entirely by profile counts running kernbench and the
time spent running the benchmark.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 11/20] Inline get_page_from_freelist() in the fast-path
  2009-02-24 13:32       ` Mel Gorman
@ 2009-02-24 14:08         ` Nick Piggin
  -1 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-24 14:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Wednesday 25 February 2009 00:32:53 Mel Gorman wrote:
> On Tue, Feb 24, 2009 at 02:32:37AM +1100, Nick Piggin wrote:
> > On Monday 23 February 2009 10:17:20 Mel Gorman wrote:
> > > In the best-case scenario, use an inlined version of
> > > get_page_from_freelist(). This increases the size of the text but
> > > avoids time spent pushing arguments onto the stack.
> >
> > I'm quite fond of inlining ;) But it can increase register pressure as
> > well as icache footprint as well. x86-64 isn't spilling a lot more
> > registers to stack after these changes, is it?
>
> I didn't actually check that closely so I don't know for sure. Is there a
> handier way of figuring it out than eyeballing the assembly? In the end

I guess the 5 second check is to look at how much stack the function
uses. OTOH I think gcc does do a reasonable job at register allocation.


> I dropped the inline of this function anyway. It means the patches
> reduce rather than increase text size which is a bit more clear-cut.

Cool, clear cut patches for round 1 should help to get things moving.


> > In which case you will get extra icache footprint. What speedup does
> > it give in the cache-hot microbenchmark case?
>
> I wasn't measuring with a microbenchmark at the time of writing so I don't
> know. I was going entirely by profile counts running kernbench and the
> time spent running the benchmark.

OK. Well seeing as you have dropped this for the moment, let's not
dwell on it ;)


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 11/20] Inline get_page_from_freelist() in the fast-path
@ 2009-02-24 14:08         ` Nick Piggin
  0 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-24 14:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Wednesday 25 February 2009 00:32:53 Mel Gorman wrote:
> On Tue, Feb 24, 2009 at 02:32:37AM +1100, Nick Piggin wrote:
> > On Monday 23 February 2009 10:17:20 Mel Gorman wrote:
> > > In the best-case scenario, use an inlined version of
> > > get_page_from_freelist(). This increases the size of the text but
> > > avoids time spent pushing arguments onto the stack.
> >
> > I'm quite fond of inlining ;) But it can increase register pressure as
> > well as icache footprint as well. x86-64 isn't spilling a lot more
> > registers to stack after these changes, is it?
>
> I didn't actually check that closely so I don't know for sure. Is there a
> handier way of figuring it out than eyeballing the assembly? In the end

I guess the 5 second check is to look at how much stack the function
uses. OTOH I think gcc does do a reasonable job at register allocation.


> I dropped the inline of this function anyway. It means the patches
> reduce rather than increase text size which is a bit more clear-cut.

Cool, clear cut patches for round 1 should help to get things moving.


> > In which case you will get extra icache footprint. What speedup does
> > it give in the cache-hot microbenchmark case?
>
> I wasn't measuring with a microbenchmark at the time of writing so I don't
> know. I was going entirely by profile counts running kernbench and the
> time spent running the benchmark.

OK. Well seeing as you have dropped this for the moment, let's not
dwell on it ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
  2009-02-23 17:49       ` Andi Kleen
@ 2009-02-24 14:32         ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-24 14:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 06:49:48PM +0100, Andi Kleen wrote:
> > hmm, it would be ideal but I haven't looked too closely at how it could
> > be implemented. I thought first you could just associate a zonelist with
> 
> Yes like that. This was actually discussed during the initial cpuset
> implementation. I thought back then it would be better to do it
> elsewhere, but changed my mind later when I saw the impact on the
> fast path.
> 

Back then there would have been other anomolies as well such as
MPOL_BIND using zones in the wrong order. Zeroing would still have
dominated the cost of the allocation and slab would hide other details.
Hindsight is 20/20 and all that.

Right now, I don't think cpusets are a dominant factor for most setups but
I'm open to being convinced otherwise. For now, I'm happy if it's just shoved
a bit more to the side in the non-cpuset case. Like the CPU cache hot/cold
path, it might be best to leave it for a second or third pass and tackle
the low-lying fruit for the first pass.

> > the cpuset but you'd need one for each node allowed by the cpuset so it
> > could get quite large. Then again, it might be worthwhile if cpusets
> 
> Yes you would need one per node, but that's not a big problem because
> systems with lots of nodes are also expected to have lots of memory.
> Most systems have a very small number of nodes.
> 

That's a fair point on the memory consumption. There might be issues
with the cache consumption but if the cpuset is being heavily used for an
allocation-intensive workload then it probably will not be noticeable.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC PATCH 00/20] Cleanup and optimise the page allocator
@ 2009-02-24 14:32         ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-24 14:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Mon, Feb 23, 2009 at 06:49:48PM +0100, Andi Kleen wrote:
> > hmm, it would be ideal but I haven't looked too closely at how it could
> > be implemented. I thought first you could just associate a zonelist with
> 
> Yes like that. This was actually discussed during the initial cpuset
> implementation. I thought back then it would be better to do it
> elsewhere, but changed my mind later when I saw the impact on the
> fast path.
> 

Back then there would have been other anomolies as well such as
MPOL_BIND using zones in the wrong order. Zeroing would still have
dominated the cost of the allocation and slab would hide other details.
Hindsight is 20/20 and all that.

Right now, I don't think cpusets are a dominant factor for most setups but
I'm open to being convinced otherwise. For now, I'm happy if it's just shoved
a bit more to the side in the non-cpuset case. Like the CPU cache hot/cold
path, it might be best to leave it for a second or third pass and tackle
the low-lying fruit for the first pass.

> > the cpuset but you'd need one for each node allowed by the cpuset so it
> > could get quite large. Then again, it might be worthwhile if cpusets
> 
> Yes you would need one per node, but that's not a big problem because
> systems with lots of nodes are also expected to have lots of memory.
> Most systems have a very small number of nodes.
> 

That's a fair point on the memory consumption. There might be issues
with the cache consumption but if the cpuset is being heavily used for an
allocation-intensive workload then it probably will not be noticeable.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 11/20] Inline get_page_from_freelist() in the fast-path
  2009-02-24 14:08         ` Nick Piggin
@ 2009-02-24 15:03           ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-24 15:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Wed, Feb 25, 2009 at 01:08:10AM +1100, Nick Piggin wrote:
> On Wednesday 25 February 2009 00:32:53 Mel Gorman wrote:
> > On Tue, Feb 24, 2009 at 02:32:37AM +1100, Nick Piggin wrote:
> > > On Monday 23 February 2009 10:17:20 Mel Gorman wrote:
> > > > In the best-case scenario, use an inlined version of
> > > > get_page_from_freelist(). This increases the size of the text but
> > > > avoids time spent pushing arguments onto the stack.
> > >
> > > I'm quite fond of inlining ;) But it can increase register pressure as
> > > well as icache footprint as well. x86-64 isn't spilling a lot more
> > > registers to stack after these changes, is it?
> >
> > I didn't actually check that closely so I don't know for sure. Is there a
> > handier way of figuring it out than eyeballing the assembly? In the end
> 
> I guess the 5 second check is to look at how much stack the function
> uses. OTOH I think gcc does do a reasonable job at register allocation.
> 

FWIW, 6 registers get pushed onto the stack from the calling function from
a glance of the assembly. According to the profile, about 7% of the cost of
the get_page_from_freelist() function is incurred by setting up and making
the function call. This is 2755 samples out of 35266. To compare, the cost
of zeroing was 192574 samples.

So, it's a good chunk of time, but in the grand scheme of things, time is
better spent optimising elsewhere for now.

> 
> > I dropped the inline of this function anyway. It means the patches
> > reduce rather than increase text size which is a bit more clear-cut.
> 
> Cool, clear cut patches for round 1 should help to get things moving.
> 

Indeed

> 
> > > In which case you will get extra icache footprint. What speedup does
> > > it give in the cache-hot microbenchmark case?
> >
> > I wasn't measuring with a microbenchmark at the time of writing so I don't
> > know. I was going entirely by profile counts running kernbench and the
> > time spent running the benchmark.
> 
> OK. Well seeing as you have dropped this for the moment, let's not
> dwell on it ;)
> 

Agreed.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 11/20] Inline get_page_from_freelist() in the fast-path
@ 2009-02-24 15:03           ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-24 15:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Pekka Enberg, Rik van Riel,
	KOSAKI Motohiro, Christoph Lameter, Johannes Weiner, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin

On Wed, Feb 25, 2009 at 01:08:10AM +1100, Nick Piggin wrote:
> On Wednesday 25 February 2009 00:32:53 Mel Gorman wrote:
> > On Tue, Feb 24, 2009 at 02:32:37AM +1100, Nick Piggin wrote:
> > > On Monday 23 February 2009 10:17:20 Mel Gorman wrote:
> > > > In the best-case scenario, use an inlined version of
> > > > get_page_from_freelist(). This increases the size of the text but
> > > > avoids time spent pushing arguments onto the stack.
> > >
> > > I'm quite fond of inlining ;) But it can increase register pressure as
> > > well as icache footprint as well. x86-64 isn't spilling a lot more
> > > registers to stack after these changes, is it?
> >
> > I didn't actually check that closely so I don't know for sure. Is there a
> > handier way of figuring it out than eyeballing the assembly? In the end
> 
> I guess the 5 second check is to look at how much stack the function
> uses. OTOH I think gcc does do a reasonable job at register allocation.
> 

FWIW, 6 registers get pushed onto the stack from the calling function from
a glance of the assembly. According to the profile, about 7% of the cost of
the get_page_from_freelist() function is incurred by setting up and making
the function call. This is 2755 samples out of 35266. To compare, the cost
of zeroing was 192574 samples.

So, it's a good chunk of time, but in the grand scheme of things, time is
better spent optimising elsewhere for now.

> 
> > I dropped the inline of this function anyway. It means the patches
> > reduce rather than increase text size which is a bit more clear-cut.
> 
> Cool, clear cut patches for round 1 should help to get things moving.
> 

Indeed

> 
> > > In which case you will get extra icache footprint. What speedup does
> > > it give in the cache-hot microbenchmark case?
> >
> > I wasn't measuring with a microbenchmark at the time of writing so I don't
> > know. I was going entirely by profile counts running kernbench and the
> > time spent running the benchmark.
> 
> OK. Well seeing as you have dropped this for the moment, let's not
> dwell on it ;)
> 

Agreed.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-24 11:51           ` Mel Gorman
@ 2009-02-25  0:01             ` Andrew Morton
  -1 siblings, 0 replies; 190+ messages in thread
From: Andrew Morton @ 2009-02-25  0:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, penberg, riel, kosaki.motohiro, cl, hannes, npiggin,
	linux-kernel, ming.m.lin, yanmin_zhang

On Tue, 24 Feb 2009 11:51:26 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> > > Almost the opposite with steady improvements almost all the way through.
> > > 
> > > With the patch applied, we are still using hot/cold information on the
> > > allocation side so I'm somewhat surprised the patch even makes much of a
> > > difference. I'd have expected the pages being freed to be mostly hot.
> > 
> > Oh yeah.  Back in the ancient days, hot-cold-pages was using separate
> > magazines for hot and cold pages.  Then Christoph went and mucked with
> > it, using a single queue.  That might have affected things.
> > 
> 
> It might have. The impact is that requests for cold pages can get hot pages
> if there are not enough cold pages in the queue so readahead could prevent
> an active process getting cache hot pages. I don't think that would have
> showed up in the microbenchmark though.

We switched to doing non-temporal stores in copy_from_user(), didn't
we?  That would rub out the benefit which that microbenchmark
demonstrated?


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-02-25  0:01             ` Andrew Morton
  0 siblings, 0 replies; 190+ messages in thread
From: Andrew Morton @ 2009-02-25  0:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, penberg, riel, kosaki.motohiro, cl, hannes, npiggin,
	linux-kernel, ming.m.lin, yanmin_zhang

On Tue, 24 Feb 2009 11:51:26 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> > > Almost the opposite with steady improvements almost all the way through.
> > > 
> > > With the patch applied, we are still using hot/cold information on the
> > > allocation side so I'm somewhat surprised the patch even makes much of a
> > > difference. I'd have expected the pages being freed to be mostly hot.
> > 
> > Oh yeah.  Back in the ancient days, hot-cold-pages was using separate
> > magazines for hot and cold pages.  Then Christoph went and mucked with
> > it, using a single queue.  That might have affected things.
> > 
> 
> It might have. The impact is that requests for cold pages can get hot pages
> if there are not enough cold pages in the queue so readahead could prevent
> an active process getting cache hot pages. I don't think that would have
> showed up in the microbenchmark though.

We switched to doing non-temporal stores in copy_from_user(), didn't
we?  That would rub out the benefit which that microbenchmark
demonstrated?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-25  0:01             ` Andrew Morton
@ 2009-02-25 16:01               ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-25 16:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, penberg, riel, kosaki.motohiro, cl, hannes, npiggin,
	linux-kernel, ming.m.lin, yanmin_zhang

On Tue, Feb 24, 2009 at 04:01:03PM -0800, Andrew Morton wrote:
> On Tue, 24 Feb 2009 11:51:26 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > > > Almost the opposite with steady improvements almost all the way through.
> > > > 
> > > > With the patch applied, we are still using hot/cold information on the
> > > > allocation side so I'm somewhat surprised the patch even makes much of a
> > > > difference. I'd have expected the pages being freed to be mostly hot.
> > > 
> > > Oh yeah.  Back in the ancient days, hot-cold-pages was using separate
> > > magazines for hot and cold pages.  Then Christoph went and mucked with
> > > it, using a single queue.  That might have affected things.
> > > 
> > 
> > It might have. The impact is that requests for cold pages can get hot pages
> > if there are not enough cold pages in the queue so readahead could prevent
> > an active process getting cache hot pages. I don't think that would have
> > showed up in the microbenchmark though.
> 
> We switched to doing non-temporal stores in copy_from_user(), didn't
> we? 

We do? I would have missed something like that but luckily I took a profile
of the microbenchmark and what do you know, we spent 17053 profiles samples in
__copy_user_nocache(). It's not quite copy_from_user() but it's close. Thanks
for pointing that out!

For anyone watching, copy_from_user() itself and the functions it calls do
not use non-temporal stores. At least, I am not seeing the nt variants of
mov in the assembly I looked at. __copy_user_nocache() on the other hand
uses movnt and it's called in the generic_file_buffered_write() path which
this micro-benchmark is optimising.

> That would rub out the benefit which that microbenchmark
> demonstrated?
> 

It'd impact it for sure. Due to the non-temporal stores, I'm surprised
there is any measurable impact from the patch.  This has likely been the
case since commit 0812a579c92fefa57506821fa08e90f47cb6dbdd. My reading of
this (someone correct/enlighten) is that even if the data was cache hot,
it is pushed out as a result of the non-temporal access.

The changelog doesn't give the reasoning for using uncached accesses but maybe
it's because for filesystem writes, it is not expected that the data will be
accessed by the CPU any more and the storage device driver has less work to
do to ensure the data in memory is not dirty in cache (this is speculation,
I don't know for sure what the expected benefit is meant to be but it might
be in the manual, I'll check later).

Thinking of alternative microbenchmarks that might show this up....

Repeated setup, populate and teardown of pagetables might show up something
as it should benefit if the pages were cache hot but the cost of faulting
and zeroing might hide it.

Maybe a microbenchmark that creates/deletes many small (or empty) files
or reads large directories might benefit from cache hotness as the slab
pages would have to be allocated, populated and then freed back to the
allocator. I'll give it a shot but alternative suggestions are welcome.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-02-25 16:01               ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-25 16:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, penberg, riel, kosaki.motohiro, cl, hannes, npiggin,
	linux-kernel, ming.m.lin, yanmin_zhang

On Tue, Feb 24, 2009 at 04:01:03PM -0800, Andrew Morton wrote:
> On Tue, 24 Feb 2009 11:51:26 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > > > Almost the opposite with steady improvements almost all the way through.
> > > > 
> > > > With the patch applied, we are still using hot/cold information on the
> > > > allocation side so I'm somewhat surprised the patch even makes much of a
> > > > difference. I'd have expected the pages being freed to be mostly hot.
> > > 
> > > Oh yeah.  Back in the ancient days, hot-cold-pages was using separate
> > > magazines for hot and cold pages.  Then Christoph went and mucked with
> > > it, using a single queue.  That might have affected things.
> > > 
> > 
> > It might have. The impact is that requests for cold pages can get hot pages
> > if there are not enough cold pages in the queue so readahead could prevent
> > an active process getting cache hot pages. I don't think that would have
> > showed up in the microbenchmark though.
> 
> We switched to doing non-temporal stores in copy_from_user(), didn't
> we? 

We do? I would have missed something like that but luckily I took a profile
of the microbenchmark and what do you know, we spent 17053 profiles samples in
__copy_user_nocache(). It's not quite copy_from_user() but it's close. Thanks
for pointing that out!

For anyone watching, copy_from_user() itself and the functions it calls do
not use non-temporal stores. At least, I am not seeing the nt variants of
mov in the assembly I looked at. __copy_user_nocache() on the other hand
uses movnt and it's called in the generic_file_buffered_write() path which
this micro-benchmark is optimising.

> That would rub out the benefit which that microbenchmark
> demonstrated?
> 

It'd impact it for sure. Due to the non-temporal stores, I'm surprised
there is any measurable impact from the patch.  This has likely been the
case since commit 0812a579c92fefa57506821fa08e90f47cb6dbdd. My reading of
this (someone correct/enlighten) is that even if the data was cache hot,
it is pushed out as a result of the non-temporal access.

The changelog doesn't give the reasoning for using uncached accesses but maybe
it's because for filesystem writes, it is not expected that the data will be
accessed by the CPU any more and the storage device driver has less work to
do to ensure the data in memory is not dirty in cache (this is speculation,
I don't know for sure what the expected benefit is meant to be but it might
be in the manual, I'll check later).

Thinking of alternative microbenchmarks that might show this up....

Repeated setup, populate and teardown of pagetables might show up something
as it should benefit if the pages were cache hot but the cost of faulting
and zeroing might hide it.

Maybe a microbenchmark that creates/deletes many small (or empty) files
or reads large directories might benefit from cache hotness as the slab
pages would have to be allocated, populated and then freed back to the
allocator. I'll give it a shot but alternative suggestions are welcome.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-25 16:01               ` Mel Gorman
@ 2009-02-25 16:19                 ` Andrew Morton
  -1 siblings, 0 replies; 190+ messages in thread
From: Andrew Morton @ 2009-02-25 16:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, penberg, riel, kosaki.motohiro, cl, hannes, npiggin,
	linux-kernel, ming.m.lin, yanmin_zhang

On Wed, 25 Feb 2009 16:01:25 +0000 Mel Gorman <mel@csn.ul.ie> wrote:

> ...
>
> > That would rub out the benefit which that microbenchmark
> > demonstrated?
> > 
> 
> It'd impact it for sure. Due to the non-temporal stores, I'm surprised
> there is any measurable impact from the patch.  This has likely been the
> case since commit 0812a579c92fefa57506821fa08e90f47cb6dbdd. My reading of
> this (someone correct/enlighten) is that even if the data was cache hot,
> it is pushed out as a result of the non-temporal access.

yup, that's my understanding.

> The changelog doesn't give the reasoning for using uncached accesses but maybe
> it's because for filesystem writes, it is not expected that the data will be
> accessed by the CPU any more and the storage device driver has less work to
> do to ensure the data in memory is not dirty in cache (this is speculation,
> I don't know for sure what the expected benefit is meant to be but it might
> be in the manual, I'll check later).
> 
> Thinking of alternative microbenchmarks that might show this up....

Well, 0812a579c92fefa57506821fa08e90f47cb6dbdd is beeing actively
discussed over in the "Performance regression in write() syscall"
thread.  There are patches there which disable the movnt for
less-than-PAGE_SIZE copies.  Perhaps adapt those to disable movnt
altogether to then see whether the use of movnt broke the advantages
which hot-cold-pages gave us?

argh.

Sorry to be pushing all this kernel archeology at you.  Sometimes I
think we're insufficiently careful about removing old stuff - it can
turn out that it wasn't that bad after all!  (cf slab.c...)

> Repeated setup, populate and teardown of pagetables might show up something
> as it should benefit if the pages were cache hot but the cost of faulting
> and zeroing might hide it.

Well, pagetables have been churning for years, with special-cased
magazining, quicklists, magazining of known-to-be-zeroed pages, etc.

I've always felt that we're doing that wrong, or at least awkwardly. 
If the magazining which the core page allocator does is up-to-snuff
then it _should_ be usable for pagetables.

The magazining of known-to-be-zeroed pages is a new requirement.  But
it absolutely should not be done private to the pagetable page
allocator (if it still is - I forget), because there are other
callsites in the kernel which want cache-hot zeroed pages, and there
are probably other places which free up known-to-be-zeroed pages.


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-02-25 16:19                 ` Andrew Morton
  0 siblings, 0 replies; 190+ messages in thread
From: Andrew Morton @ 2009-02-25 16:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, penberg, riel, kosaki.motohiro, cl, hannes, npiggin,
	linux-kernel, ming.m.lin, yanmin_zhang

On Wed, 25 Feb 2009 16:01:25 +0000 Mel Gorman <mel@csn.ul.ie> wrote:

> ...
>
> > That would rub out the benefit which that microbenchmark
> > demonstrated?
> > 
> 
> It'd impact it for sure. Due to the non-temporal stores, I'm surprised
> there is any measurable impact from the patch.  This has likely been the
> case since commit 0812a579c92fefa57506821fa08e90f47cb6dbdd. My reading of
> this (someone correct/enlighten) is that even if the data was cache hot,
> it is pushed out as a result of the non-temporal access.

yup, that's my understanding.

> The changelog doesn't give the reasoning for using uncached accesses but maybe
> it's because for filesystem writes, it is not expected that the data will be
> accessed by the CPU any more and the storage device driver has less work to
> do to ensure the data in memory is not dirty in cache (this is speculation,
> I don't know for sure what the expected benefit is meant to be but it might
> be in the manual, I'll check later).
> 
> Thinking of alternative microbenchmarks that might show this up....

Well, 0812a579c92fefa57506821fa08e90f47cb6dbdd is beeing actively
discussed over in the "Performance regression in write() syscall"
thread.  There are patches there which disable the movnt for
less-than-PAGE_SIZE copies.  Perhaps adapt those to disable movnt
altogether to then see whether the use of movnt broke the advantages
which hot-cold-pages gave us?

argh.

Sorry to be pushing all this kernel archeology at you.  Sometimes I
think we're insufficiently careful about removing old stuff - it can
turn out that it wasn't that bad after all!  (cf slab.c...)

> Repeated setup, populate and teardown of pagetables might show up something
> as it should benefit if the pages were cache hot but the cost of faulting
> and zeroing might hide it.

Well, pagetables have been churning for years, with special-cased
magazining, quicklists, magazining of known-to-be-zeroed pages, etc.

I've always felt that we're doing that wrong, or at least awkwardly. 
If the magazining which the core page allocator does is up-to-snuff
then it _should_ be usable for pagetables.

The magazining of known-to-be-zeroed pages is a new requirement.  But
it absolutely should not be done private to the pagetable page
allocator (if it still is - I forget), because there are other
callsites in the kernel which want cache-hot zeroed pages, and there
are probably other places which free up known-to-be-zeroed pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-25 16:01               ` Mel Gorman
@ 2009-02-25 18:33                 ` Christoph Lameter
  -1 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-25 18:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, penberg, riel, kosaki.motohiro, hannes,
	npiggin, linux-kernel, ming.m.lin, yanmin_zhang

On Wed, 25 Feb 2009, Mel Gorman wrote:

> It'd impact it for sure. Due to the non-temporal stores, I'm surprised
> there is any measurable impact from the patch.  This has likely been the
> case since commit 0812a579c92fefa57506821fa08e90f47cb6dbdd. My reading of
> this (someone correct/enlighten) is that even if the data was cache hot,
> it is pushed out as a result of the non-temporal access.

A nontemporal store simply does not set the used flag for the cacheline.
So the cpu cache LRU will evict the cacheline sooner. Thats at least how
it works on I64.


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-02-25 18:33                 ` Christoph Lameter
  0 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-25 18:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, penberg, riel, kosaki.motohiro, hannes,
	npiggin, linux-kernel, ming.m.lin, yanmin_zhang

On Wed, 25 Feb 2009, Mel Gorman wrote:

> It'd impact it for sure. Due to the non-temporal stores, I'm surprised
> there is any measurable impact from the patch.  This has likely been the
> case since commit 0812a579c92fefa57506821fa08e90f47cb6dbdd. My reading of
> this (someone correct/enlighten) is that even if the data was cache hot,
> it is pushed out as a result of the non-temporal access.

A nontemporal store simply does not set the used flag for the cacheline.
So the cpu cache LRU will evict the cacheline sooner. Thats at least how
it works on I64.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-25 16:19                 ` Andrew Morton
@ 2009-02-26 16:37                   ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-26 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, penberg, riel, kosaki.motohiro, cl, hannes, npiggin,
	linux-kernel, ming.m.lin, yanmin_zhang

On Wed, Feb 25, 2009 at 08:19:54AM -0800, Andrew Morton wrote:
> On Wed, 25 Feb 2009 16:01:25 +0000 Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > ...
> >
> > > That would rub out the benefit which that microbenchmark
> > > demonstrated?
> > > 
> > 
> > It'd impact it for sure. Due to the non-temporal stores, I'm surprised
> > there is any measurable impact from the patch.  This has likely been the
> > case since commit 0812a579c92fefa57506821fa08e90f47cb6dbdd. My reading of
> > this (someone correct/enlighten) is that even if the data was cache hot,
> > it is pushed out as a result of the non-temporal access.
> 
> yup, that's my understanding.
> 
> > The changelog doesn't give the reasoning for using uncached accesses but maybe
> > it's because for filesystem writes, it is not expected that the data will be
> > accessed by the CPU any more and the storage device driver has less work to
> > do to ensure the data in memory is not dirty in cache (this is speculation,
> > I don't know for sure what the expected benefit is meant to be but it might
> > be in the manual, I'll check later).
> > 
> > Thinking of alternative microbenchmarks that might show this up....
> 
> Well, 0812a579c92fefa57506821fa08e90f47cb6dbdd is beeing actively
> discussed over in the "Performance regression in write() syscall"
> thread.  There are patches there which disable the movnt for
> less-than-PAGE_SIZE copies.  Perhaps adapt those to disable movnt
> altogether to then see whether the use of movnt broke the advantages
> which hot-cold-pages gave us?
> 

I checked just what that patch was doing with write-truncate and the results
show that using temporal access for small files appeared to have a huge
positive difference for the microbenchmark. It also showed that hot/cold
freeing (i.e. the current code) was a gain when temporal accesses were used
but then I saw a big problem with the benchmark.

The deviations between runs are huge - really huge and I had missed that
before. I redid the test to run a larger number of iterations and then 20
times in a row on a kernel with hot/cold freeing and I got;

size          avg   stddev
      64 3.337564 0.619085
     128 2.753963 0.461398
     256 2.556934 0.461848
     512 2.736831 0.475484
    1024 2.561668 0.470887
    2048 2.719766 0.478039
    4096 2.963039 0.407311
    8192 4.043475 0.236713
   16384 6.098094 0.249132
   32768 9.439190 0.143978

where size is the size of the write/truncate, avg is the average time and the
stddev is the standard deviation. For small sizes, it's too massive to draw
any reasonable conclusion from the microbenchmark. Factors like scheduling,
whether sync happened and a host of other issues muck up the results.

More importantly, I then checked how many times we freed cold pages during
the test and the answer is ..... *never*. They were all hot page releases
which is what my patch originally forced and the profiles agreed because they
showed no samples in the "if (cold)" branch. Cold pages were only freed if I
made kswapd kick off which was my original expectation as a system reclaiming
is currently polluting cache with scanning so it's not important.

Based on that nugget, the patch makes common sense because we never take the
cold branch at a time we care. Common sense also tells me the patch should
be an improvement because pagevec is smaller. Proving it's a good change is
not working out very well at all.

> argh.
> 
> Sorry to be pushing all this kernel archeology at you.

Don't be. This was time well spent in my opinion.

> Sometimes I
> think we're insufficiently careful about removing old stuff - it can
> turn out that it wasn't that bad after all!  (cf slab.c...)
> 

Agreed. Better safe than sorry.

> > Repeated setup, populate and teardown of pagetables might show up something
> > as it should benefit if the pages were cache hot but the cost of faulting
> > and zeroing might hide it.
> 
> Well, pagetables have been churning for years, with special-cased
> magazining, quicklists, magazining of known-to-be-zeroed pages, etc.
> 

The known-to-be-zeroed pages is interesting and something I tried but didn't
get far enough with. One patch I did but didn't release would zero pages on
the free path if the was process exiting or if it was kswapd.  It tracked if
the page was zero using page->index to record the order of the zerod page. On
allocation, it would check index and if a matching order, would not zero a
second time. I got this working for order-0 pages reliably but it didn't gain
anything because we were zeroing even more than we had to in the free path.

I should have gone at the pagetable pages as a source of zerod pages that
required no additional work and said "screw it, I'll release what I have
and see what happens".

> I've always felt that we're doing that wrong, or at least awkwardly. 
> If the magazining which the core page allocator does is up-to-snuff
> then it _should_ be usable for pagetables.
> 

If pagetable pages were known to be zero and handed back to the allocator
that remember zerod pages, I bet we'd get a win.

> The magazining of known-to-be-zeroed pages is a new requirement. 

They don't even need a separate magazine. Put them back on the lists and
record if they are zero with page->index. Granted, it means a caller will
sometimes get pages that are zerod when they don't need to be but I think
it'd be better than larger structures or searches.

> But
> it absolutely should not be done private to the pagetable page
> allocator (if it still is - I forget), because there are other
> callsites in the kernel which want cache-hot zeroed pages, and there
> are probably other places which free up known-to-be-zeroed pages.
> 

Agreed. I believe we can do it in the allocator too using page->index if I
am understanding you properly.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-02-26 16:37                   ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-26 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, penberg, riel, kosaki.motohiro, cl, hannes, npiggin,
	linux-kernel, ming.m.lin, yanmin_zhang

On Wed, Feb 25, 2009 at 08:19:54AM -0800, Andrew Morton wrote:
> On Wed, 25 Feb 2009 16:01:25 +0000 Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > ...
> >
> > > That would rub out the benefit which that microbenchmark
> > > demonstrated?
> > > 
> > 
> > It'd impact it for sure. Due to the non-temporal stores, I'm surprised
> > there is any measurable impact from the patch.  This has likely been the
> > case since commit 0812a579c92fefa57506821fa08e90f47cb6dbdd. My reading of
> > this (someone correct/enlighten) is that even if the data was cache hot,
> > it is pushed out as a result of the non-temporal access.
> 
> yup, that's my understanding.
> 
> > The changelog doesn't give the reasoning for using uncached accesses but maybe
> > it's because for filesystem writes, it is not expected that the data will be
> > accessed by the CPU any more and the storage device driver has less work to
> > do to ensure the data in memory is not dirty in cache (this is speculation,
> > I don't know for sure what the expected benefit is meant to be but it might
> > be in the manual, I'll check later).
> > 
> > Thinking of alternative microbenchmarks that might show this up....
> 
> Well, 0812a579c92fefa57506821fa08e90f47cb6dbdd is beeing actively
> discussed over in the "Performance regression in write() syscall"
> thread.  There are patches there which disable the movnt for
> less-than-PAGE_SIZE copies.  Perhaps adapt those to disable movnt
> altogether to then see whether the use of movnt broke the advantages
> which hot-cold-pages gave us?
> 

I checked just what that patch was doing with write-truncate and the results
show that using temporal access for small files appeared to have a huge
positive difference for the microbenchmark. It also showed that hot/cold
freeing (i.e. the current code) was a gain when temporal accesses were used
but then I saw a big problem with the benchmark.

The deviations between runs are huge - really huge and I had missed that
before. I redid the test to run a larger number of iterations and then 20
times in a row on a kernel with hot/cold freeing and I got;

size          avg   stddev
      64 3.337564 0.619085
     128 2.753963 0.461398
     256 2.556934 0.461848
     512 2.736831 0.475484
    1024 2.561668 0.470887
    2048 2.719766 0.478039
    4096 2.963039 0.407311
    8192 4.043475 0.236713
   16384 6.098094 0.249132
   32768 9.439190 0.143978

where size is the size of the write/truncate, avg is the average time and the
stddev is the standard deviation. For small sizes, it's too massive to draw
any reasonable conclusion from the microbenchmark. Factors like scheduling,
whether sync happened and a host of other issues muck up the results.

More importantly, I then checked how many times we freed cold pages during
the test and the answer is ..... *never*. They were all hot page releases
which is what my patch originally forced and the profiles agreed because they
showed no samples in the "if (cold)" branch. Cold pages were only freed if I
made kswapd kick off which was my original expectation as a system reclaiming
is currently polluting cache with scanning so it's not important.

Based on that nugget, the patch makes common sense because we never take the
cold branch at a time we care. Common sense also tells me the patch should
be an improvement because pagevec is smaller. Proving it's a good change is
not working out very well at all.

> argh.
> 
> Sorry to be pushing all this kernel archeology at you.

Don't be. This was time well spent in my opinion.

> Sometimes I
> think we're insufficiently careful about removing old stuff - it can
> turn out that it wasn't that bad after all!  (cf slab.c...)
> 

Agreed. Better safe than sorry.

> > Repeated setup, populate and teardown of pagetables might show up something
> > as it should benefit if the pages were cache hot but the cost of faulting
> > and zeroing might hide it.
> 
> Well, pagetables have been churning for years, with special-cased
> magazining, quicklists, magazining of known-to-be-zeroed pages, etc.
> 

The known-to-be-zeroed pages is interesting and something I tried but didn't
get far enough with. One patch I did but didn't release would zero pages on
the free path if the was process exiting or if it was kswapd.  It tracked if
the page was zero using page->index to record the order of the zerod page. On
allocation, it would check index and if a matching order, would not zero a
second time. I got this working for order-0 pages reliably but it didn't gain
anything because we were zeroing even more than we had to in the free path.

I should have gone at the pagetable pages as a source of zerod pages that
required no additional work and said "screw it, I'll release what I have
and see what happens".

> I've always felt that we're doing that wrong, or at least awkwardly. 
> If the magazining which the core page allocator does is up-to-snuff
> then it _should_ be usable for pagetables.
> 

If pagetable pages were known to be zero and handed back to the allocator
that remember zerod pages, I bet we'd get a win.

> The magazining of known-to-be-zeroed pages is a new requirement. 

They don't even need a separate magazine. Put them back on the lists and
record if they are zero with page->index. Granted, it means a caller will
sometimes get pages that are zerod when they don't need to be but I think
it'd be better than larger structures or searches.

> But
> it absolutely should not be done private to the pagetable page
> allocator (if it still is - I forget), because there are other
> callsites in the kernel which want cache-hot zeroed pages, and there
> are probably other places which free up known-to-be-zeroed pages.
> 

Agreed. I believe we can do it in the allocator too using page->index if I
am understanding you properly.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-26 16:37                   ` Mel Gorman
@ 2009-02-26 17:00                     ` Christoph Lameter
  -1 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-26 17:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, penberg, riel, kosaki.motohiro, hannes,
	npiggin, linux-kernel, ming.m.lin, yanmin_zhang

On Thu, 26 Feb 2009, Mel Gorman wrote:

> The known-to-be-zeroed pages is interesting and something I tried but didn't
> get far enough with. One patch I did but didn't release would zero pages on
> the free path if the was process exiting or if it was kswapd.  It tracked if
> the page was zero using page->index to record the order of the zerod page. On
> allocation, it would check index and if a matching order, would not zero a
> second time. I got this working for order-0 pages reliably but it didn't gain
> anything because we were zeroing even more than we had to in the free path.

I tried the general use of a pool of zeroed pages back in 2005. Zeroing
made sense only if the code allocating the page did not immediately touch
the cachelines of the page. The more cachelines were touched the less the
benefit. If the page is written to immediately afterwards then the zeroing
simply warms up the caches.

page table pages are different. We may only write to a few cachelines in
the page. There it makes sense and that is why we have the special
quicklists there.

> If pagetable pages were known to be zero and handed back to the allocator
> that remember zerod pages, I bet we'd get a win.

We have quicklists that do this on various platforms.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-02-26 17:00                     ` Christoph Lameter
  0 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-26 17:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, penberg, riel, kosaki.motohiro, hannes,
	npiggin, linux-kernel, ming.m.lin, yanmin_zhang

On Thu, 26 Feb 2009, Mel Gorman wrote:

> The known-to-be-zeroed pages is interesting and something I tried but didn't
> get far enough with. One patch I did but didn't release would zero pages on
> the free path if the was process exiting or if it was kswapd.  It tracked if
> the page was zero using page->index to record the order of the zerod page. On
> allocation, it would check index and if a matching order, would not zero a
> second time. I got this working for order-0 pages reliably but it didn't gain
> anything because we were zeroing even more than we had to in the free path.

I tried the general use of a pool of zeroed pages back in 2005. Zeroing
made sense only if the code allocating the page did not immediately touch
the cachelines of the page. The more cachelines were touched the less the
benefit. If the page is written to immediately afterwards then the zeroing
simply warms up the caches.

page table pages are different. We may only write to a few cachelines in
the page. There it makes sense and that is why we have the special
quicklists there.

> If pagetable pages were known to be zero and handed back to the allocator
> that remember zerod pages, I bet we'd get a win.

We have quicklists that do this on various platforms.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-26 17:00                     ` Christoph Lameter
@ 2009-02-26 17:15                       ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-26 17:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-mm, penberg, riel, kosaki.motohiro, hannes,
	npiggin, linux-kernel, ming.m.lin, yanmin_zhang

On Thu, Feb 26, 2009 at 12:00:22PM -0500, Christoph Lameter wrote:
> On Thu, 26 Feb 2009, Mel Gorman wrote:
> 
> > The known-to-be-zeroed pages is interesting and something I tried but didn't
> > get far enough with. One patch I did but didn't release would zero pages on
> > the free path if the was process exiting or if it was kswapd.  It tracked if
> > the page was zero using page->index to record the order of the zerod page. On
> > allocation, it would check index and if a matching order, would not zero a
> > second time. I got this working for order-0 pages reliably but it didn't gain
> > anything because we were zeroing even more than we had to in the free path.
> 
> I tried the general use of a pool of zeroed pages back in 2005. Zeroing
> made sense only if the code allocating the page did not immediately touch
> the cachelines of the page.

Any feeling as to how often this was the case?

> The more cachelines were touched the less the
> benefit. If the page is written to immediately afterwards then the zeroing
> simply warms up the caches.
> 
> page table pages are different. We may only write to a few cachelines in
> the page. There it makes sense and that is why we have the special
> quicklists there.
> 
> > If pagetable pages were known to be zero and handed back to the allocator
> > that remember zerod pages, I bet we'd get a win.
> 
> We have quicklists that do this on various platforms.
> 

Indeed, any gain if it existed would be avoiding zeroing the pages used
by userspace. The cleanup would be reducing the amount of
architecture-specific code.

I reckon it's worth an investigate but there is still other lower-lying
fruit.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-02-26 17:15                       ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-02-26 17:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-mm, penberg, riel, kosaki.motohiro, hannes,
	npiggin, linux-kernel, ming.m.lin, yanmin_zhang

On Thu, Feb 26, 2009 at 12:00:22PM -0500, Christoph Lameter wrote:
> On Thu, 26 Feb 2009, Mel Gorman wrote:
> 
> > The known-to-be-zeroed pages is interesting and something I tried but didn't
> > get far enough with. One patch I did but didn't release would zero pages on
> > the free path if the was process exiting or if it was kswapd.  It tracked if
> > the page was zero using page->index to record the order of the zerod page. On
> > allocation, it would check index and if a matching order, would not zero a
> > second time. I got this working for order-0 pages reliably but it didn't gain
> > anything because we were zeroing even more than we had to in the free path.
> 
> I tried the general use of a pool of zeroed pages back in 2005. Zeroing
> made sense only if the code allocating the page did not immediately touch
> the cachelines of the page.

Any feeling as to how often this was the case?

> The more cachelines were touched the less the
> benefit. If the page is written to immediately afterwards then the zeroing
> simply warms up the caches.
> 
> page table pages are different. We may only write to a few cachelines in
> the page. There it makes sense and that is why we have the special
> quicklists there.
> 
> > If pagetable pages were known to be zero and handed back to the allocator
> > that remember zerod pages, I bet we'd get a win.
> 
> We have quicklists that do this on various platforms.
> 

Indeed, any gain if it existed would be avoiding zeroing the pages used
by userspace. The cleanup would be reducing the amount of
architecture-specific code.

I reckon it's worth an investigate but there is still other lower-lying
fruit.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-26 17:15                       ` Mel Gorman
@ 2009-02-26 17:30                         ` Christoph Lameter
  -1 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-26 17:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, penberg, riel, kosaki.motohiro, hannes,
	npiggin, linux-kernel, ming.m.lin, yanmin_zhang

On Thu, 26 Feb 2009, Mel Gorman wrote:

> > I tried the general use of a pool of zeroed pages back in 2005. Zeroing
> > made sense only if the code allocating the page did not immediately touch
> > the cachelines of the page.
>
> Any feeling as to how often this was the case?

Not often enough to justify the merging of my patches at the time. This
was publicly discussed on lkml:

http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/0482.html

> Indeed, any gain if it existed would be avoiding zeroing the pages used
> by userspace. The cleanup would be reducing the amount of
> architecture-specific code.
>
> I reckon it's worth an investigate but there is still other lower-lying
> fruit.

I hope we can get rid of various ugly elements of the quicklists if the
page allocator would offer some sort of support. I would think that the
slow allocation and freeing behavior is also a factor that makes
quicklists advantageous. The quicklist page lists are simply a linked list
of pages and a page can simply be dequeued and used.



^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-02-26 17:30                         ` Christoph Lameter
  0 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-26 17:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, penberg, riel, kosaki.motohiro, hannes,
	npiggin, linux-kernel, ming.m.lin, yanmin_zhang

On Thu, 26 Feb 2009, Mel Gorman wrote:

> > I tried the general use of a pool of zeroed pages back in 2005. Zeroing
> > made sense only if the code allocating the page did not immediately touch
> > the cachelines of the page.
>
> Any feeling as to how often this was the case?

Not often enough to justify the merging of my patches at the time. This
was publicly discussed on lkml:

http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/0482.html

> Indeed, any gain if it existed would be avoiding zeroing the pages used
> by userspace. The cleanup would be reducing the amount of
> architecture-specific code.
>
> I reckon it's worth an investigate but there is still other lower-lying
> fruit.

I hope we can get rid of various ugly elements of the quicklists if the
page allocator would offer some sort of support. I would think that the
slow allocation and freeing behavior is also a factor that makes
quicklists advantageous. The quicklist page lists are simply a linked list
of pages and a page can simply be dequeued and used.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-26 17:30                         ` Christoph Lameter
@ 2009-02-27 11:33                           ` Nick Piggin
  -1 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-27 11:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Andrew Morton, linux-mm, penberg, riel,
	kosaki.motohiro, hannes, linux-kernel, ming.m.lin, yanmin_zhang

On Thu, Feb 26, 2009 at 12:30:45PM -0500, Christoph Lameter wrote:
> On Thu, 26 Feb 2009, Mel Gorman wrote:
> 
> > > I tried the general use of a pool of zeroed pages back in 2005. Zeroing
> > > made sense only if the code allocating the page did not immediately touch
> > > the cachelines of the page.
> >
> > Any feeling as to how often this was the case?
> 
> Not often enough to justify the merging of my patches at the time. This
> was publicly discussed on lkml:
> 
> http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/0482.html
> 
> > Indeed, any gain if it existed would be avoiding zeroing the pages used
> > by userspace. The cleanup would be reducing the amount of
> > architecture-specific code.
> >
> > I reckon it's worth an investigate but there is still other lower-lying
> > fruit.
> 
> I hope we can get rid of various ugly elements of the quicklists if the
> page allocator would offer some sort of support. I would think that the

Only if it provides significant advantages over existing quicklists or
adds *no* extra overhead to the page allocator common cases. :)
 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-02-27 11:33                           ` Nick Piggin
  0 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-27 11:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Andrew Morton, linux-mm, penberg, riel,
	kosaki.motohiro, hannes, linux-kernel, ming.m.lin, yanmin_zhang

On Thu, Feb 26, 2009 at 12:30:45PM -0500, Christoph Lameter wrote:
> On Thu, 26 Feb 2009, Mel Gorman wrote:
> 
> > > I tried the general use of a pool of zeroed pages back in 2005. Zeroing
> > > made sense only if the code allocating the page did not immediately touch
> > > the cachelines of the page.
> >
> > Any feeling as to how often this was the case?
> 
> Not often enough to justify the merging of my patches at the time. This
> was publicly discussed on lkml:
> 
> http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/0482.html
> 
> > Indeed, any gain if it existed would be avoiding zeroing the pages used
> > by userspace. The cleanup would be reducing the amount of
> > architecture-specific code.
> >
> > I reckon it's worth an investigate but there is still other lower-lying
> > fruit.
> 
> I hope we can get rid of various ugly elements of the quicklists if the
> page allocator would offer some sort of support. I would think that the

Only if it provides significant advantages over existing quicklists or
adds *no* extra overhead to the page allocator common cases. :)
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-26 17:15                       ` Mel Gorman
@ 2009-02-27 11:38                         ` Nick Piggin
  -1 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-27 11:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Andrew Morton, linux-mm, penberg, riel,
	kosaki.motohiro, hannes, linux-kernel, ming.m.lin, yanmin_zhang

On Thu, Feb 26, 2009 at 05:15:49PM +0000, Mel Gorman wrote:
> On Thu, Feb 26, 2009 at 12:00:22PM -0500, Christoph Lameter wrote:
> > I tried the general use of a pool of zeroed pages back in 2005. Zeroing
> > made sense only if the code allocating the page did not immediately touch
> > the cachelines of the page.
> 
> Any feeling as to how often this was the case?

IMO background zeroing or anything like that is only going to
become less attractive. Heat and energy considerations are
relatively increasing, so doing speculative work in the kernel
is going to become relatively more costly. Especially in this
case where you use nontemporal stores or otherwise reduce the
efficiency of the CPU caches (and increase activity on bus and
memory).

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-02-27 11:38                         ` Nick Piggin
  0 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-27 11:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Andrew Morton, linux-mm, penberg, riel,
	kosaki.motohiro, hannes, linux-kernel, ming.m.lin, yanmin_zhang

On Thu, Feb 26, 2009 at 05:15:49PM +0000, Mel Gorman wrote:
> On Thu, Feb 26, 2009 at 12:00:22PM -0500, Christoph Lameter wrote:
> > I tried the general use of a pool of zeroed pages back in 2005. Zeroing
> > made sense only if the code allocating the page did not immediately touch
> > the cachelines of the page.
> 
> Any feeling as to how often this was the case?

IMO background zeroing or anything like that is only going to
become less attractive. Heat and energy considerations are
relatively increasing, so doing speculative work in the kernel
is going to become relatively more costly. Especially in this
case where you use nontemporal stores or otherwise reduce the
efficiency of the CPU caches (and increase activity on bus and
memory).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-27 11:33                           ` Nick Piggin
@ 2009-02-27 15:40                             ` Christoph Lameter
  -1 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-27 15:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mel Gorman, Andrew Morton, linux-mm, penberg, riel,
	kosaki.motohiro, hannes, linux-kernel, ming.m.lin, yanmin_zhang

On Fri, 27 Feb 2009, Nick Piggin wrote:

> > I hope we can get rid of various ugly elements of the quicklists if the
> > page allocator would offer some sort of support. I would think that the
>
> Only if it provides significant advantages over existing quicklists or
> adds *no* extra overhead to the page allocator common cases. :)

And only if the page allocator gets fast enough to be usable for
allocs instead of quicklists.


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-02-27 15:40                             ` Christoph Lameter
  0 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-02-27 15:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mel Gorman, Andrew Morton, linux-mm, penberg, riel,
	kosaki.motohiro, hannes, linux-kernel, ming.m.lin, yanmin_zhang

On Fri, 27 Feb 2009, Nick Piggin wrote:

> > I hope we can get rid of various ugly elements of the quicklists if the
> > page allocator would offer some sort of support. I would think that the
>
> Only if it provides significant advantages over existing quicklists or
> adds *no* extra overhead to the page allocator common cases. :)

And only if the page allocator gets fast enough to be usable for
allocs instead of quicklists.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-27 11:38                         ` Nick Piggin
@ 2009-03-01 10:37                           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 190+ messages in thread
From: KOSAKI Motohiro @ 2009-03-01 10:37 UTC (permalink / raw)
  To: Nick Piggin
  Cc: kosaki.motohiro, Mel Gorman, Christoph Lameter, Andrew Morton,
	linux-mm, penberg, riel, hannes, linux-kernel, ming.m.lin,
	yanmin_zhang

> On Thu, Feb 26, 2009 at 05:15:49PM +0000, Mel Gorman wrote:
> > On Thu, Feb 26, 2009 at 12:00:22PM -0500, Christoph Lameter wrote:
> > > I tried the general use of a pool of zeroed pages back in 2005. Zeroing
> > > made sense only if the code allocating the page did not immediately touch
> > > the cachelines of the page.
> > 
> > Any feeling as to how often this was the case?
> 
> IMO background zeroing or anything like that is only going to
> become less attractive. Heat and energy considerations are
> relatively increasing, so doing speculative work in the kernel
> is going to become relatively more costly. 

IMHO..

In general, the value of any speculative approach depend on
forecast hitting rate.

e.g. readahead is very effective to sequential read workload, but 
no effective to random access workload.

zerod pages is always used from page_alloc(GFP_ZERO), it's valuable 
although nowadays. but it's impossible.

Then, (IMHO) rest problem is
  - What stastics is better mesurement for the forecast of future zero page
    demanding?
  - How implement it?


I have no idea yet..


> Especially in this
> case where you use nontemporal stores or otherwise reduce the
> efficiency of the CPU caches (and increase activity on bus and
> memory).




^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-03-01 10:37                           ` KOSAKI Motohiro
  0 siblings, 0 replies; 190+ messages in thread
From: KOSAKI Motohiro @ 2009-03-01 10:37 UTC (permalink / raw)
  To: Nick Piggin
  Cc: kosaki.motohiro, Mel Gorman, Christoph Lameter, Andrew Morton,
	linux-mm, penberg, riel, hannes, linux-kernel, ming.m.lin,
	yanmin_zhang

> On Thu, Feb 26, 2009 at 05:15:49PM +0000, Mel Gorman wrote:
> > On Thu, Feb 26, 2009 at 12:00:22PM -0500, Christoph Lameter wrote:
> > > I tried the general use of a pool of zeroed pages back in 2005. Zeroing
> > > made sense only if the code allocating the page did not immediately touch
> > > the cachelines of the page.
> > 
> > Any feeling as to how often this was the case?
> 
> IMO background zeroing or anything like that is only going to
> become less attractive. Heat and energy considerations are
> relatively increasing, so doing speculative work in the kernel
> is going to become relatively more costly. 

IMHO..

In general, the value of any speculative approach depend on
forecast hitting rate.

e.g. readahead is very effective to sequential read workload, but 
no effective to random access workload.

zerod pages is always used from page_alloc(GFP_ZERO), it's valuable 
although nowadays. but it's impossible.

Then, (IMHO) rest problem is
  - What stastics is better mesurement for the forecast of future zero page
    demanding?
  - How implement it?


I have no idea yet..


> Especially in this
> case where you use nontemporal stores or otherwise reduce the
> efficiency of the CPU caches (and increase activity on bus and
> memory).



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-02-27 15:40                             ` Christoph Lameter
@ 2009-03-03 13:52                               ` Mel Gorman
  -1 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-03-03 13:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Andrew Morton, linux-mm, penberg, riel,
	kosaki.motohiro, hannes, linux-kernel, ming.m.lin, yanmin_zhang

On Fri, Feb 27, 2009 at 10:40:17AM -0500, Christoph Lameter wrote:
> On Fri, 27 Feb 2009, Nick Piggin wrote:
> 
> > > I hope we can get rid of various ugly elements of the quicklists if the
> > > page allocator would offer some sort of support. I would think that the
> >
> > Only if it provides significant advantages over existing quicklists or
> > adds *no* extra overhead to the page allocator common cases. :)
> 
> And only if the page allocator gets fast enough to be usable for
> allocs instead of quicklists.
> 

It appears the x86 doesn't even use the quicklists. I know patches for
i386 support used to exist, what happened with them?

That aside, I think we could win slightly by just knowing when a page is
zeroed and being freed back to the allocator such as when the quicklists
are being drained. I wrote a patch along those lines but it started
getting really messy on x86 so I'm postponing it for the moment.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-03-03 13:52                               ` Mel Gorman
  0 siblings, 0 replies; 190+ messages in thread
From: Mel Gorman @ 2009-03-03 13:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Andrew Morton, linux-mm, penberg, riel,
	kosaki.motohiro, hannes, linux-kernel, ming.m.lin, yanmin_zhang

On Fri, Feb 27, 2009 at 10:40:17AM -0500, Christoph Lameter wrote:
> On Fri, 27 Feb 2009, Nick Piggin wrote:
> 
> > > I hope we can get rid of various ugly elements of the quicklists if the
> > > page allocator would offer some sort of support. I would think that the
> >
> > Only if it provides significant advantages over existing quicklists or
> > adds *no* extra overhead to the page allocator common cases. :)
> 
> And only if the page allocator gets fast enough to be usable for
> allocs instead of quicklists.
> 

It appears the x86 doesn't even use the quicklists. I know patches for
i386 support used to exist, what happened with them?

That aside, I think we could win slightly by just knowing when a page is
zeroed and being freed back to the allocator such as when the quicklists
are being drained. I wrote a patch along those lines but it started
getting really messy on x86 so I'm postponing it for the moment.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
  2009-03-03 13:52                               ` Mel Gorman
@ 2009-03-03 18:53                                 ` Christoph Lameter
  -1 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-03-03 18:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Andrew Morton, linux-mm, penberg, riel,
	kosaki.motohiro, hannes, linux-kernel, ming.m.lin, yanmin_zhang

On Tue, 3 Mar 2009, Mel Gorman wrote:

> > And only if the page allocator gets fast enough to be usable for
> > allocs instead of quicklists.
> It appears the x86 doesn't even use the quicklists. I know patches for
> i386 support used to exist, what happened with them?

The x86 patches were not applied because of an issue with early NUMA
freeing. The problem has been fixed but the x86 patches were left
unmerged. There was also an issue with the quicklists growing too large.

> That aside, I think we could win slightly by just knowing when a page is
> zeroed and being freed back to the allocator such as when the quicklists
> are being drained. I wrote a patch along those lines but it started
> getting really messy on x86 so I'm postponing it for the moment.

quicklist tied into the tlb freeing logic. The tlb freeing logic could
itself keep a list of zeroed pages which may be cleaner.


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
@ 2009-03-03 18:53                                 ` Christoph Lameter
  0 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2009-03-03 18:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Andrew Morton, linux-mm, penberg, riel,
	kosaki.motohiro, hannes, linux-kernel, ming.m.lin, yanmin_zhang

On Tue, 3 Mar 2009, Mel Gorman wrote:

> > And only if the page allocator gets fast enough to be usable for
> > allocs instead of quicklists.
> It appears the x86 doesn't even use the quicklists. I know patches for
> i386 support used to exist, what happened with them?

The x86 patches were not applied because of an issue with early NUMA
freeing. The problem has been fixed but the x86 patches were left
unmerged. There was also an issue with the quicklists growing too large.

> That aside, I think we could win slightly by just knowing when a page is
> zeroed and being freed back to the allocator such as when the quicklists
> are being drained. I wrote a patch along those lines but it started
> getting really messy on x86 so I'm postponing it for the moment.

quicklist tied into the tlb freeing logic. The tlb freeing logic could
itself keep a list of zeroed pages which may be cleaner.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 190+ messages in thread

end of thread, other threads:[~2009-03-03 19:04 UTC | newest]

Thread overview: 190+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-22 23:17 [RFC PATCH 00/20] Cleanup and optimise the page allocator Mel Gorman
2009-02-22 23:17 ` Mel Gorman
2009-02-22 23:17 ` [PATCH 01/20] Replace __alloc_pages_internal() with __alloc_pages_nodemask() Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-22 23:17 ` [PATCH 02/20] Do not sanity check order in the fast path Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-22 23:17 ` [PATCH 03/20] Do not check NUMA node ID when the caller knows the node is valid Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-23 15:01   ` Christoph Lameter
2009-02-23 15:01     ` Christoph Lameter
2009-02-23 16:24     ` Mel Gorman
2009-02-23 16:24       ` Mel Gorman
2009-02-22 23:17 ` [PATCH 04/20] Convert gfp_zone() to use a table of precalculated values Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-23 11:55   ` [PATCH] mm: clean up __GFP_* flags a bit Peter Zijlstra
2009-02-23 11:55     ` Peter Zijlstra
2009-02-23 18:01     ` Mel Gorman
2009-02-23 18:01       ` Mel Gorman
2009-02-23 20:27       ` Vegard Nossum
2009-02-23 20:27         ` Vegard Nossum
2009-02-23 15:23   ` [PATCH 04/20] Convert gfp_zone() to use a table of precalculated values Christoph Lameter
2009-02-23 15:23     ` Christoph Lameter
2009-02-23 15:41     ` Nick Piggin
2009-02-23 15:41       ` Nick Piggin
2009-02-23 15:43       ` [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value Christoph Lameter
2009-02-23 15:43         ` Christoph Lameter
2009-02-23 16:40         ` Mel Gorman
2009-02-23 16:40           ` Mel Gorman
2009-02-23 17:03           ` Christoph Lameter
2009-02-23 17:03             ` Christoph Lameter
2009-02-24  1:32           ` KAMEZAWA Hiroyuki
2009-02-24  1:32             ` KAMEZAWA Hiroyuki
2009-02-24  3:59             ` Nick Piggin
2009-02-24  3:59               ` Nick Piggin
2009-02-24  5:20               ` KAMEZAWA Hiroyuki
2009-02-24  5:20                 ` KAMEZAWA Hiroyuki
2009-02-24 11:36             ` Mel Gorman
2009-02-24 11:36               ` Mel Gorman
2009-02-23 16:33     ` [PATCH 04/20] Convert gfp_zone() to use a table of precalculated values Mel Gorman
2009-02-23 16:33       ` Mel Gorman
2009-02-23 16:33       ` [PATCH 04/20] Convert gfp_zone() to use a table of precalculated value Christoph Lameter
2009-02-23 16:33         ` Christoph Lameter
2009-02-23 17:41         ` Mel Gorman
2009-02-23 17:41           ` Mel Gorman
2009-02-22 23:17 ` [PATCH 05/20] Check only once if the zonelist is suitable for the allocation Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-22 23:17 ` [PATCH 06/20] Break up the allocator entry point into fast and slow paths Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-22 23:17 ` [PATCH 07/20] Simplify the check on whether cpusets are a factor or not Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-23  7:14   ` Pekka J Enberg
2009-02-23  7:14     ` Pekka J Enberg
2009-02-23  9:07     ` Peter Zijlstra
2009-02-23  9:07       ` Peter Zijlstra
2009-02-23  9:13       ` Pekka Enberg
2009-02-23  9:13         ` Pekka Enberg
2009-02-23 11:39         ` Mel Gorman
2009-02-23 11:39           ` Mel Gorman
2009-02-23 13:19           ` Pekka Enberg
2009-02-23 13:19             ` Pekka Enberg
2009-02-23  9:14   ` Li Zefan
2009-02-23  9:14     ` Li Zefan
2009-02-22 23:17 ` [PATCH 08/20] Move check for disabled anti-fragmentation out of fastpath Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-22 23:17 ` [PATCH 09/20] Calculate the preferred zone for allocation only once Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-22 23:17 ` [PATCH 10/20] Calculate the migratetype " Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-22 23:17 ` [PATCH 11/20] Inline get_page_from_freelist() in the fast-path Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-23  7:21   ` Pekka Enberg
2009-02-23  7:21     ` Pekka Enberg
2009-02-23 11:42     ` Mel Gorman
2009-02-23 11:42       ` Mel Gorman
2009-02-23 15:32   ` Nick Piggin
2009-02-23 15:32     ` Nick Piggin
2009-02-24 13:32     ` Mel Gorman
2009-02-24 13:32       ` Mel Gorman
2009-02-24 14:08       ` Nick Piggin
2009-02-24 14:08         ` Nick Piggin
2009-02-24 15:03         ` Mel Gorman
2009-02-24 15:03           ` Mel Gorman
2009-02-22 23:17 ` [PATCH 12/20] Inline __rmqueue_smallest() Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-22 23:17 ` [PATCH 13/20] Inline buffered_rmqueue() Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-23  7:24   ` Pekka Enberg
2009-02-23  7:24     ` Pekka Enberg
2009-02-23 11:44     ` Mel Gorman
2009-02-23 11:44       ` Mel Gorman
2009-02-22 23:17 ` [PATCH 14/20] Do not call get_pageblock_migratetype() more than necessary Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-22 23:17 ` [PATCH 15/20] Do not disable interrupts in free_page_mlock() Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-23  9:19   ` Peter Zijlstra
2009-02-23  9:19     ` Peter Zijlstra
2009-02-23 12:23     ` Mel Gorman
2009-02-23 12:23       ` Mel Gorman
2009-02-23 12:44       ` Peter Zijlstra
2009-02-23 12:44         ` Peter Zijlstra
2009-02-23 14:25         ` Mel Gorman
2009-02-23 14:25           ` Mel Gorman
2009-02-22 23:17 ` [PATCH 16/20] Do not setup zonelist cache when there is only one node Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-22 23:17 ` [PATCH 17/20] Do not double sanity check page attributes during allocation Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-22 23:17 ` [PATCH 18/20] Split per-cpu list into one-list-per-migrate-type Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-22 23:17 ` [PATCH 19/20] Batch free pages from migratetype per-cpu lists Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-22 23:17 ` [PATCH 20/20] Get rid of the concept of hot/cold page freeing Mel Gorman
2009-02-22 23:17   ` Mel Gorman
2009-02-23  9:37   ` Andrew Morton
2009-02-23  9:37     ` Andrew Morton
2009-02-23 23:30     ` Mel Gorman
2009-02-23 23:30       ` Mel Gorman
2009-02-23 23:53       ` Andrew Morton
2009-02-23 23:53         ` Andrew Morton
2009-02-24 11:51         ` Mel Gorman
2009-02-24 11:51           ` Mel Gorman
2009-02-25  0:01           ` Andrew Morton
2009-02-25  0:01             ` Andrew Morton
2009-02-25 16:01             ` Mel Gorman
2009-02-25 16:01               ` Mel Gorman
2009-02-25 16:19               ` Andrew Morton
2009-02-25 16:19                 ` Andrew Morton
2009-02-26 16:37                 ` Mel Gorman
2009-02-26 16:37                   ` Mel Gorman
2009-02-26 17:00                   ` Christoph Lameter
2009-02-26 17:00                     ` Christoph Lameter
2009-02-26 17:15                     ` Mel Gorman
2009-02-26 17:15                       ` Mel Gorman
2009-02-26 17:30                       ` Christoph Lameter
2009-02-26 17:30                         ` Christoph Lameter
2009-02-27 11:33                         ` Nick Piggin
2009-02-27 11:33                           ` Nick Piggin
2009-02-27 15:40                           ` Christoph Lameter
2009-02-27 15:40                             ` Christoph Lameter
2009-03-03 13:52                             ` Mel Gorman
2009-03-03 13:52                               ` Mel Gorman
2009-03-03 18:53                               ` Christoph Lameter
2009-03-03 18:53                                 ` Christoph Lameter
2009-02-27 11:38                       ` Nick Piggin
2009-02-27 11:38                         ` Nick Piggin
2009-03-01 10:37                         ` KOSAKI Motohiro
2009-03-01 10:37                           ` KOSAKI Motohiro
2009-02-25 18:33               ` Christoph Lameter
2009-02-25 18:33                 ` Christoph Lameter
2009-02-22 23:57 ` [RFC PATCH 00/20] Cleanup and optimise the page allocator Andi Kleen
2009-02-22 23:57   ` Andi Kleen
2009-02-23 12:34   ` Mel Gorman
2009-02-23 12:34     ` Mel Gorman
2009-02-23 15:34   ` [RFC PATCH 00/20] Cleanup and optimise the page allocato Christoph Lameter
2009-02-23 15:34     ` Christoph Lameter
2009-02-23  0:02 ` [RFC PATCH 00/20] Cleanup and optimise the page allocator Andi Kleen
2009-02-23  0:02   ` Andi Kleen
2009-02-23 14:32   ` Mel Gorman
2009-02-23 14:32     ` Mel Gorman
2009-02-23 17:49     ` Andi Kleen
2009-02-23 17:49       ` Andi Kleen
2009-02-24 14:32       ` Mel Gorman
2009-02-24 14:32         ` Mel Gorman
2009-02-23  7:29 ` Pekka Enberg
2009-02-23  7:29   ` Pekka Enberg
2009-02-23  8:34   ` Zhang, Yanmin
2009-02-23  8:34     ` Zhang, Yanmin
2009-02-23  9:10   ` KOSAKI Motohiro
2009-02-23  9:10     ` KOSAKI Motohiro
2009-02-23 11:55 ` [PATCH] mm: gfp_to_alloc_flags() Peter Zijlstra
2009-02-23 11:55   ` Peter Zijlstra
2009-02-23 14:00   ` Pekka Enberg
2009-02-23 14:00     ` Pekka Enberg
2009-02-23 18:17   ` Mel Gorman
2009-02-23 18:17     ` Mel Gorman
2009-02-23 20:09     ` Peter Zijlstra
2009-02-23 20:09       ` Peter Zijlstra
2009-02-23 22:59   ` Andrew Morton
2009-02-23 22:59     ` Andrew Morton
2009-02-24  8:59     ` Peter Zijlstra
2009-02-24  8:59       ` Peter Zijlstra
2009-02-23 14:38 ` [RFC PATCH 00/20] Cleanup and optimise the page allocator Christoph Lameter
2009-02-23 14:38   ` Christoph Lameter
2009-02-23 14:46 ` Nick Piggin
2009-02-23 14:46   ` Nick Piggin
2009-02-23 15:00   ` Mel Gorman
2009-02-23 15:00     ` Mel Gorman
2009-02-23 15:22     ` Nick Piggin
2009-02-23 15:22       ` Nick Piggin
2009-02-23 20:26       ` Mel Gorman
2009-02-23 20:26         ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.