All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/34] Memory management performance backports for -stable V2
@ 2012-07-23 13:38 ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

Changelog since V1
  o Expand some of the notes					(jrnieder)
  o Correct upstream commit SHA1				(hugh)

This series is related to the new addition to stable_kernel_rules.txt

 - Serious issues as reported by a user of a distribution kernel may also
   be considered if they fix a notable performance or interactivity issue.
   As these fixes are not as obvious and have a higher risk of a subtle
   regression they should only be submitted by a distribution kernel
   maintainer and include an addendum linking to a bugzilla entry if it
   exists and additional information on the user-visible impact.

All of these patches have been backported to a distribution kernel and
address some sort of performance issue in the VM. As they are not all
obvious, I've added a "Stable note" to the top of each patch giving
additional information on why the patch was backported. Lets see where
the boundaries lie on how this new rule is interpreted in practice :).

Patch 1	Performance fix for tmpfs
Patch 2 Memory hotadd fix
Patch 3 Reduce boot time on large machines
Patches 4-5 Reduce stalls for wait_iff_congested
Patches 6-8 Reduce excessive reclaim of slab objects which for some workloads
	will reduce the amount of IO required
Patches 9-10 limits the amount of page reclaim when THP/Compaction is active.
	Excessive reclaim in low memory situations can lead to stalls some
	of which are user visible.
Patches 11-19 reduce the amount of churn of the LRU lists. Poor reclaim
	decisions can impair workloads in different ways and there have
	been complaints recently the reclaim decisions of modern kernels
	are worse than older ones.
Patches 20-21 reduce the amount of CPU kswapd uses in some cases. This
	is harder to trigger but were developed due to bug reports about
	100% CPU usage from kswapd.
Patches 22-25 are mostly related to interactivity when THP is enabled.
Patches 26-30 are also related to page reclaim decisions, particularly
	the residency of mapped pages.
Patches 31-34 fix a major page allocator performance regression

All of the patches will apply to 3.0-stable but the ordering of the
patches is such that applying them to 3.2-stable and 3.4-stable should
be straight-forward.

I am bending or breaking the rules in places that needs examination.

1. Not all these patches have a bugzilla entry because in many cases I was
   doing the investigation based on my own testing. By rights, I should
   have been creating bugzilla entries for each of them but there only
   are so many hours in the day.
2. I will be duplicated in the signed-offs because I may both the author
   of the patch and now part of the submission path to -stable. I don't think
   there is anything wrong with this but it might look weird to some people.
3. Some patches are in the series only because they make later patches
   easier to backport.
4. Patch 30 may be violating the rules. The upstream patch accidentally
   fixes a problem and was found through bisection but the full patch and
   the series itself is not a good -stable candidate. Patch 30 is a substitute
   for an upstream commit.

The patches are based on 3.0.36 but there should not be problems applying
the series to later stable releases.

Alex,Shi (2):
  kswapd: avoid unnecessary rebalance after an unsuccessful balancing
  kswapd: assign new_order and new_classzone_idx after wakeup in
    sleeping

Dave Chinner (3):
  vmscan: add shrink_slab tracepoints
  vmscan: shrinker->nr updates race and go wrong
  vmscan: reduce wind up shrinker->nr when shrinker can't do work

David Rientjes (2):
  cpusets: avoid looping when storing to mems_allowed if one node
    remains set
  cpusets: stall when updating mems_allowed for mempolicy or disjoint
    nodemask

Dimitri Sivanich (1):
  mm: vmstat: cache align vm_stat

Hugh Dickins (1):
  mm: test PageSwapBacked in lumpy reclaim

Johannes Weiner (1):
  mm: vmscan: fix force-scanning small targets without swap

Konstantin Khlebnikov (3):
  vmscan: promote shared file mapped pages
  vmscan: activate executable pages after first usage
  mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma

Mel Gorman (14):
  mm: memory hotplug: Check if pages are correctly reserved on a
    per-section basis
  mm: Reduce the amount of work done when updating min_free_kbytes
  mm: Abort reclaim/compaction if compaction can proceed
  mm: migration: clean up unmap_and_move()
  mm: compaction: Allow compaction to isolate dirty pages
  mm: compaction: Determine if dirty pages can be migrated without
    blocking within ->migratepage
  mm: page allocator: Do not call direct reclaim for THP allocations
    while compaction is deferred
  mm: compaction: make isolate_lru_page() filter-aware again
  mm: compaction: Introduce sync-light migration for use by compaction
  mm: vmscan: When reclaiming for compaction, ensure there are
    sufficient free pages available
  mm: vmscan: Do not OOM if aborting reclaim to start compaction
  mm: vmscan: Check if reclaim should really abort even if
    compaction_ready() is true for one zone
  mm: vmscan: Do not force kswapd to scan small targets
  cpuset: mm: Reduce large amounts of memory barrier related damage v3

Minchan Kim (5):
  mm: compaction: trivial clean up in acct_isolated()
  mm: change isolate mode from #define to bitwise type
  mm: compaction: make isolate_lru_page() filter-aware
  mm: zone_reclaim: make isolate_lru_page() filter-aware
  mm/vmscan.c: consider swap space when deciding whether to continue
    reclaim

Rik van Riel (1):
  mm: limit direct reclaim for higher order allocations

Shaohua Li (1):
  vmscan: clear ZONE_CONGESTED for zone with good watermark

 .../trace/postprocess/trace-vmscan-postprocess.pl  |    8 +-
 drivers/base/memory.c                              |   58 ++--
 fs/btrfs/disk-io.c                                 |    5 +-
 fs/hugetlbfs/inode.c                               |    3 +-
 fs/nfs/internal.h                                  |    2 +-
 fs/nfs/write.c                                     |    4 +-
 include/linux/cpuset.h                             |   45 ++--
 include/linux/fs.h                                 |   11 +-
 include/linux/init_task.h                          |    8 +
 include/linux/memcontrol.h                         |    3 +-
 include/linux/migrate.h                            |   23 +-
 include/linux/mmzone.h                             |   14 +
 include/linux/sched.h                              |    2 +-
 include/linux/swap.h                               |    7 +-
 include/trace/events/vmscan.h                      |   85 +++++-
 kernel/cpuset.c                                    |   63 ++---
 kernel/fork.c                                      |    3 +
 mm/compaction.c                                    |   26 +-
 mm/filemap.c                                       |   11 +-
 mm/hugetlb.c                                       |   13 +-
 mm/memcontrol.c                                    |    3 +-
 mm/memory-failure.c                                |    2 +-
 mm/memory_hotplug.c                                |    2 +-
 mm/mempolicy.c                                     |   30 ++-
 mm/migrate.c                                       |  224 ++++++++++------
 mm/page_alloc.c                                    |  113 +++++---
 mm/slab.c                                          |   13 +-
 mm/slub.c                                          |   39 ++-
 mm/vmscan.c                                        |  280 ++++++++++++++++----
 mm/vmstat.c                                        |    2 +-
 30 files changed, 772 insertions(+), 330 deletions(-)

-- 
1.7.9.2


^ permalink raw reply	[flat|nested] 125+ messages in thread

* [PATCH 00/34] Memory management performance backports for -stable V2
@ 2012-07-23 13:38 ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

Changelog since V1
  o Expand some of the notes					(jrnieder)
  o Correct upstream commit SHA1				(hugh)

This series is related to the new addition to stable_kernel_rules.txt

 - Serious issues as reported by a user of a distribution kernel may also
   be considered if they fix a notable performance or interactivity issue.
   As these fixes are not as obvious and have a higher risk of a subtle
   regression they should only be submitted by a distribution kernel
   maintainer and include an addendum linking to a bugzilla entry if it
   exists and additional information on the user-visible impact.

All of these patches have been backported to a distribution kernel and
address some sort of performance issue in the VM. As they are not all
obvious, I've added a "Stable note" to the top of each patch giving
additional information on why the patch was backported. Lets see where
the boundaries lie on how this new rule is interpreted in practice :).

Patch 1	Performance fix for tmpfs
Patch 2 Memory hotadd fix
Patch 3 Reduce boot time on large machines
Patches 4-5 Reduce stalls for wait_iff_congested
Patches 6-8 Reduce excessive reclaim of slab objects which for some workloads
	will reduce the amount of IO required
Patches 9-10 limits the amount of page reclaim when THP/Compaction is active.
	Excessive reclaim in low memory situations can lead to stalls some
	of which are user visible.
Patches 11-19 reduce the amount of churn of the LRU lists. Poor reclaim
	decisions can impair workloads in different ways and there have
	been complaints recently the reclaim decisions of modern kernels
	are worse than older ones.
Patches 20-21 reduce the amount of CPU kswapd uses in some cases. This
	is harder to trigger but were developed due to bug reports about
	100% CPU usage from kswapd.
Patches 22-25 are mostly related to interactivity when THP is enabled.
Patches 26-30 are also related to page reclaim decisions, particularly
	the residency of mapped pages.
Patches 31-34 fix a major page allocator performance regression

All of the patches will apply to 3.0-stable but the ordering of the
patches is such that applying them to 3.2-stable and 3.4-stable should
be straight-forward.

I am bending or breaking the rules in places that needs examination.

1. Not all these patches have a bugzilla entry because in many cases I was
   doing the investigation based on my own testing. By rights, I should
   have been creating bugzilla entries for each of them but there only
   are so many hours in the day.
2. I will be duplicated in the signed-offs because I may both the author
   of the patch and now part of the submission path to -stable. I don't think
   there is anything wrong with this but it might look weird to some people.
3. Some patches are in the series only because they make later patches
   easier to backport.
4. Patch 30 may be violating the rules. The upstream patch accidentally
   fixes a problem and was found through bisection but the full patch and
   the series itself is not a good -stable candidate. Patch 30 is a substitute
   for an upstream commit.

The patches are based on 3.0.36 but there should not be problems applying
the series to later stable releases.

Alex,Shi (2):
  kswapd: avoid unnecessary rebalance after an unsuccessful balancing
  kswapd: assign new_order and new_classzone_idx after wakeup in
    sleeping

Dave Chinner (3):
  vmscan: add shrink_slab tracepoints
  vmscan: shrinker->nr updates race and go wrong
  vmscan: reduce wind up shrinker->nr when shrinker can't do work

David Rientjes (2):
  cpusets: avoid looping when storing to mems_allowed if one node
    remains set
  cpusets: stall when updating mems_allowed for mempolicy or disjoint
    nodemask

Dimitri Sivanich (1):
  mm: vmstat: cache align vm_stat

Hugh Dickins (1):
  mm: test PageSwapBacked in lumpy reclaim

Johannes Weiner (1):
  mm: vmscan: fix force-scanning small targets without swap

Konstantin Khlebnikov (3):
  vmscan: promote shared file mapped pages
  vmscan: activate executable pages after first usage
  mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma

Mel Gorman (14):
  mm: memory hotplug: Check if pages are correctly reserved on a
    per-section basis
  mm: Reduce the amount of work done when updating min_free_kbytes
  mm: Abort reclaim/compaction if compaction can proceed
  mm: migration: clean up unmap_and_move()
  mm: compaction: Allow compaction to isolate dirty pages
  mm: compaction: Determine if dirty pages can be migrated without
    blocking within ->migratepage
  mm: page allocator: Do not call direct reclaim for THP allocations
    while compaction is deferred
  mm: compaction: make isolate_lru_page() filter-aware again
  mm: compaction: Introduce sync-light migration for use by compaction
  mm: vmscan: When reclaiming for compaction, ensure there are
    sufficient free pages available
  mm: vmscan: Do not OOM if aborting reclaim to start compaction
  mm: vmscan: Check if reclaim should really abort even if
    compaction_ready() is true for one zone
  mm: vmscan: Do not force kswapd to scan small targets
  cpuset: mm: Reduce large amounts of memory barrier related damage v3

Minchan Kim (5):
  mm: compaction: trivial clean up in acct_isolated()
  mm: change isolate mode from #define to bitwise type
  mm: compaction: make isolate_lru_page() filter-aware
  mm: zone_reclaim: make isolate_lru_page() filter-aware
  mm/vmscan.c: consider swap space when deciding whether to continue
    reclaim

Rik van Riel (1):
  mm: limit direct reclaim for higher order allocations

Shaohua Li (1):
  vmscan: clear ZONE_CONGESTED for zone with good watermark

 .../trace/postprocess/trace-vmscan-postprocess.pl  |    8 +-
 drivers/base/memory.c                              |   58 ++--
 fs/btrfs/disk-io.c                                 |    5 +-
 fs/hugetlbfs/inode.c                               |    3 +-
 fs/nfs/internal.h                                  |    2 +-
 fs/nfs/write.c                                     |    4 +-
 include/linux/cpuset.h                             |   45 ++--
 include/linux/fs.h                                 |   11 +-
 include/linux/init_task.h                          |    8 +
 include/linux/memcontrol.h                         |    3 +-
 include/linux/migrate.h                            |   23 +-
 include/linux/mmzone.h                             |   14 +
 include/linux/sched.h                              |    2 +-
 include/linux/swap.h                               |    7 +-
 include/trace/events/vmscan.h                      |   85 +++++-
 kernel/cpuset.c                                    |   63 ++---
 kernel/fork.c                                      |    3 +
 mm/compaction.c                                    |   26 +-
 mm/filemap.c                                       |   11 +-
 mm/hugetlb.c                                       |   13 +-
 mm/memcontrol.c                                    |    3 +-
 mm/memory-failure.c                                |    2 +-
 mm/memory_hotplug.c                                |    2 +-
 mm/mempolicy.c                                     |   30 ++-
 mm/migrate.c                                       |  224 ++++++++++------
 mm/page_alloc.c                                    |  113 +++++---
 mm/slab.c                                          |   13 +-
 mm/slub.c                                          |   39 ++-
 mm/vmscan.c                                        |  280 ++++++++++++++++----
 mm/vmstat.c                                        |    2 +-
 30 files changed, 772 insertions(+), 330 deletions(-)

-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* [PATCH 01/34] mm: vmstat: cache align vm_stat
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Dimitri Sivanich <sivanich@sgi.com>

commit a1cb2c60ddc98ff4e5246f410558805401ceee67 upstream.

Stable note: Not tracked on Bugzilla. This patch is known to make a big
	difference to tmpfs performance on larger machines.

Avoid false sharing of the vm_stat array. This was found to adversely
affect tmpfs I/O performance.

Tests run on a 640 cpu UV system.

With 120 threads doing parallel writes, each to different tmpfs mounts:
No patch:		~300 MB/sec
With vm_stat alignment:	~430 MB/sec

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmstat.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 20c18b7..6559013 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -78,7 +78,7 @@ void vm_events_fold_cpu(int cpu)
  *
  * vm_stat contains the global counters
  */
-atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
+atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
 EXPORT_SYMBOL(vm_stat);
 
 #ifdef CONFIG_SMP
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 01/34] mm: vmstat: cache align vm_stat
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Dimitri Sivanich <sivanich@sgi.com>

commit a1cb2c60ddc98ff4e5246f410558805401ceee67 upstream.

Stable note: Not tracked on Bugzilla. This patch is known to make a big
	difference to tmpfs performance on larger machines.

Avoid false sharing of the vm_stat array. This was found to adversely
affect tmpfs I/O performance.

Tests run on a 640 cpu UV system.

With 120 threads doing parallel writes, each to different tmpfs mounts:
No patch:		~300 MB/sec
With vm_stat alignment:	~430 MB/sec

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmstat.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 20c18b7..6559013 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -78,7 +78,7 @@ void vm_events_fold_cpu(int cpu)
  *
  * vm_stat contains the global counters
  */
-atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
+atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
 EXPORT_SYMBOL(vm_stat);
 
 #ifdef CONFIG_SMP
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 02/34] mm: memory hotplug: Check if pages are correctly reserved on a per-section basis
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit 2bbcb8788311a40714b585fc11b51da6ffa2ab92 upstream.

Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=721039 .
	Without the patch, memory hot-add can fail for kernel configurations
	that do not set CONFIG_SPARSEMEM_VMEMMAP.

It is expected that memory being brought online is PageReserved
similar to what happens when the page allocator is being brought up.
Memory is onlined in "memory blocks" which consist of one or more
sections. Unfortunately, the code that verifies PageReserved is
currently assuming that the memmap backing all these pages is virtually
contiguous which is only the case when CONFIG_SPARSEMEM_VMEMMAP is set.

This patch updates the PageReserved check to lookup struct page once per
section to guarantee the correct struct page is being checked.

[Check pages within sections properly: rientjes@google.com]
[original patch by: nfont@linux.vnet.ibm.com]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
---
 drivers/base/memory.c |   58 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 40 insertions(+), 18 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 45d7c8f..5fb6aae 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -224,13 +224,48 @@ int memory_isolate_notify(unsigned long val, void *v)
 }
 
 /*
+ * The probe routines leave the pages reserved, just as the bootmem code does.
+ * Make sure they're still that way.
+ */
+static bool pages_correctly_reserved(unsigned long start_pfn,
+					unsigned long nr_pages)
+{
+	int i, j;
+	struct page *page;
+	unsigned long pfn = start_pfn;
+
+	/*
+	 * memmap between sections is not contiguous except with
+	 * SPARSEMEM_VMEMMAP. We lookup the page once per section
+	 * and assume memmap is contiguous within each section
+	 */
+	for (i = 0; i < sections_per_block; i++, pfn += PAGES_PER_SECTION) {
+		if (WARN_ON_ONCE(!pfn_valid(pfn)))
+			return false;
+		page = pfn_to_page(pfn);
+
+		for (j = 0; j < PAGES_PER_SECTION; j++) {
+			if (PageReserved(page + j))
+				continue;
+
+			printk(KERN_WARNING "section number %ld page number %d "
+				"not reserved, was it already online?\n",
+				pfn_to_section_nr(pfn), j);
+
+			return false;
+		}
+	}
+
+	return true;
+}
+
+/*
  * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
  * OK to have direct references to sparsemem variables in here.
  */
 static int
 memory_block_action(unsigned long phys_index, unsigned long action)
 {
-	int i;
 	unsigned long start_pfn, start_paddr;
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
 	struct page *first_page;
@@ -238,26 +273,13 @@ memory_block_action(unsigned long phys_index, unsigned long action)
 
 	first_page = pfn_to_page(phys_index << PFN_SECTION_SHIFT);
 
-	/*
-	 * The probe routines leave the pages reserved, just
-	 * as the bootmem code does.  Make sure they're still
-	 * that way.
-	 */
-	if (action == MEM_ONLINE) {
-		for (i = 0; i < nr_pages; i++) {
-			if (PageReserved(first_page+i))
-				continue;
-
-			printk(KERN_WARNING "section number %ld page number %d "
-				"not reserved, was it already online?\n",
-				phys_index, i);
-			return -EBUSY;
-		}
-	}
-
 	switch (action) {
 		case MEM_ONLINE:
 			start_pfn = page_to_pfn(first_page);
+
+			if (!pages_correctly_reserved(start_pfn, nr_pages))
+				return -EBUSY;
+
 			ret = online_pages(start_pfn, nr_pages);
 			break;
 		case MEM_OFFLINE:
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 02/34] mm: memory hotplug: Check if pages are correctly reserved on a per-section basis
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit 2bbcb8788311a40714b585fc11b51da6ffa2ab92 upstream.

Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=721039 .
	Without the patch, memory hot-add can fail for kernel configurations
	that do not set CONFIG_SPARSEMEM_VMEMMAP.

It is expected that memory being brought online is PageReserved
similar to what happens when the page allocator is being brought up.
Memory is onlined in "memory blocks" which consist of one or more
sections. Unfortunately, the code that verifies PageReserved is
currently assuming that the memmap backing all these pages is virtually
contiguous which is only the case when CONFIG_SPARSEMEM_VMEMMAP is set.

This patch updates the PageReserved check to lookup struct page once per
section to guarantee the correct struct page is being checked.

[Check pages within sections properly: rientjes@google.com]
[original patch by: nfont@linux.vnet.ibm.com]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
---
 drivers/base/memory.c |   58 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 40 insertions(+), 18 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 45d7c8f..5fb6aae 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -224,13 +224,48 @@ int memory_isolate_notify(unsigned long val, void *v)
 }
 
 /*
+ * The probe routines leave the pages reserved, just as the bootmem code does.
+ * Make sure they're still that way.
+ */
+static bool pages_correctly_reserved(unsigned long start_pfn,
+					unsigned long nr_pages)
+{
+	int i, j;
+	struct page *page;
+	unsigned long pfn = start_pfn;
+
+	/*
+	 * memmap between sections is not contiguous except with
+	 * SPARSEMEM_VMEMMAP. We lookup the page once per section
+	 * and assume memmap is contiguous within each section
+	 */
+	for (i = 0; i < sections_per_block; i++, pfn += PAGES_PER_SECTION) {
+		if (WARN_ON_ONCE(!pfn_valid(pfn)))
+			return false;
+		page = pfn_to_page(pfn);
+
+		for (j = 0; j < PAGES_PER_SECTION; j++) {
+			if (PageReserved(page + j))
+				continue;
+
+			printk(KERN_WARNING "section number %ld page number %d "
+				"not reserved, was it already online?\n",
+				pfn_to_section_nr(pfn), j);
+
+			return false;
+		}
+	}
+
+	return true;
+}
+
+/*
  * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
  * OK to have direct references to sparsemem variables in here.
  */
 static int
 memory_block_action(unsigned long phys_index, unsigned long action)
 {
-	int i;
 	unsigned long start_pfn, start_paddr;
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
 	struct page *first_page;
@@ -238,26 +273,13 @@ memory_block_action(unsigned long phys_index, unsigned long action)
 
 	first_page = pfn_to_page(phys_index << PFN_SECTION_SHIFT);
 
-	/*
-	 * The probe routines leave the pages reserved, just
-	 * as the bootmem code does.  Make sure they're still
-	 * that way.
-	 */
-	if (action == MEM_ONLINE) {
-		for (i = 0; i < nr_pages; i++) {
-			if (PageReserved(first_page+i))
-				continue;
-
-			printk(KERN_WARNING "section number %ld page number %d "
-				"not reserved, was it already online?\n",
-				phys_index, i);
-			return -EBUSY;
-		}
-	}
-
 	switch (action) {
 		case MEM_ONLINE:
 			start_pfn = page_to_pfn(first_page);
+
+			if (!pages_correctly_reserved(start_pfn, nr_pages))
+				return -EBUSY;
+
 			ret = online_pages(start_pfn, nr_pages);
 			break;
 		case MEM_OFFLINE:
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 03/34] mm: Reduce the amount of work done when updating min_free_kbytes
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit 938929f14cb595f43cd1a4e63e22d36cab1e4a1f upstream.

Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=726210 .
	Large machines with 1TB or more of RAM take a long time to boot
	without this patch and may spew out soft lockup warnings.

When min_free_kbytes is updated blocks marked MIGRATE_RESERVE are
updated. Ordinarily, this work is unnoticable as it happens early
in boot. However, on large machines with 1TB of memory, this can take
a considerable time when NUMA distances are taken into account. The bulk
of the work is done by pageblock_is_reserved() which examines the
metadata for almost every page in the system. Currently, we are doing
this far more than necessary as it is only required while there are
still blocks to be marked MIGRATE_RESERVE. This patch significantly
reduces the amount of work done by setup_zone_migrate_reserve()
improving boot times on 1TB machines.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |   35 +++++++++++++++++++----------------
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 947a7e9..e568b80 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3418,25 +3418,28 @@ static void setup_zone_migrate_reserve(struct zone *zone)
 		if (page_to_nid(page) != zone_to_nid(zone))
 			continue;
 
-		/* Blocks with reserved pages will never free, skip them. */
-		block_end_pfn = min(pfn + pageblock_nr_pages, end_pfn);
-		if (pageblock_is_reserved(pfn, block_end_pfn))
-			continue;
-
 		block_migratetype = get_pageblock_migratetype(page);
 
-		/* If this block is reserved, account for it */
-		if (reserve > 0 && block_migratetype == MIGRATE_RESERVE) {
-			reserve--;
-			continue;
-		}
+		/* Only test what is necessary when the reserves are not met */
+		if (reserve > 0) {
+			/* Blocks with reserved pages will never free, skip them. */
+			block_end_pfn = min(pfn + pageblock_nr_pages, end_pfn);
+			if (pageblock_is_reserved(pfn, block_end_pfn))
+				continue;
 
-		/* Suitable for reserving if this block is movable */
-		if (reserve > 0 && block_migratetype == MIGRATE_MOVABLE) {
-			set_pageblock_migratetype(page, MIGRATE_RESERVE);
-			move_freepages_block(zone, page, MIGRATE_RESERVE);
-			reserve--;
-			continue;
+			/* If this block is reserved, account for it */
+			if (block_migratetype == MIGRATE_RESERVE) {
+				reserve--;
+				continue;
+			}
+
+			/* Suitable for reserving if this block is movable */
+			if (block_migratetype == MIGRATE_MOVABLE) {
+				set_pageblock_migratetype(page, MIGRATE_RESERVE);
+				move_freepages_block(zone, page, MIGRATE_RESERVE);
+				reserve--;
+				continue;
+			}
 		}
 
 		/*
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 03/34] mm: Reduce the amount of work done when updating min_free_kbytes
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit 938929f14cb595f43cd1a4e63e22d36cab1e4a1f upstream.

Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=726210 .
	Large machines with 1TB or more of RAM take a long time to boot
	without this patch and may spew out soft lockup warnings.

When min_free_kbytes is updated blocks marked MIGRATE_RESERVE are
updated. Ordinarily, this work is unnoticable as it happens early
in boot. However, on large machines with 1TB of memory, this can take
a considerable time when NUMA distances are taken into account. The bulk
of the work is done by pageblock_is_reserved() which examines the
metadata for almost every page in the system. Currently, we are doing
this far more than necessary as it is only required while there are
still blocks to be marked MIGRATE_RESERVE. This patch significantly
reduces the amount of work done by setup_zone_migrate_reserve()
improving boot times on 1TB machines.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |   35 +++++++++++++++++++----------------
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 947a7e9..e568b80 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3418,25 +3418,28 @@ static void setup_zone_migrate_reserve(struct zone *zone)
 		if (page_to_nid(page) != zone_to_nid(zone))
 			continue;
 
-		/* Blocks with reserved pages will never free, skip them. */
-		block_end_pfn = min(pfn + pageblock_nr_pages, end_pfn);
-		if (pageblock_is_reserved(pfn, block_end_pfn))
-			continue;
-
 		block_migratetype = get_pageblock_migratetype(page);
 
-		/* If this block is reserved, account for it */
-		if (reserve > 0 && block_migratetype == MIGRATE_RESERVE) {
-			reserve--;
-			continue;
-		}
+		/* Only test what is necessary when the reserves are not met */
+		if (reserve > 0) {
+			/* Blocks with reserved pages will never free, skip them. */
+			block_end_pfn = min(pfn + pageblock_nr_pages, end_pfn);
+			if (pageblock_is_reserved(pfn, block_end_pfn))
+				continue;
 
-		/* Suitable for reserving if this block is movable */
-		if (reserve > 0 && block_migratetype == MIGRATE_MOVABLE) {
-			set_pageblock_migratetype(page, MIGRATE_RESERVE);
-			move_freepages_block(zone, page, MIGRATE_RESERVE);
-			reserve--;
-			continue;
+			/* If this block is reserved, account for it */
+			if (block_migratetype == MIGRATE_RESERVE) {
+				reserve--;
+				continue;
+			}
+
+			/* Suitable for reserving if this block is movable */
+			if (block_migratetype == MIGRATE_MOVABLE) {
+				set_pageblock_migratetype(page, MIGRATE_RESERVE);
+				move_freepages_block(zone, page, MIGRATE_RESERVE);
+				reserve--;
+				continue;
+			}
 		}
 
 		/*
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 04/34] mm: vmscan: fix force-scanning small targets without swap
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Johannes Weiner <jweiner@redhat.com>

commit a4d3e9e76337059406fcf3ead288c0df22a790e9 upstream.

Stable note: Not tracked in Bugzilla. This patch augments an earlier commit
	that avoids scanning priority being artificially raised. The older
	fix was particularly important for small memcgs to avoid calling
	wait_iff_congested() unnecessarily.

Without swap, anonymous pages are not scanned.  As such, they should not
count when considering force-scanning a small target if there is no swap.

Otherwise, targets are not force-scanned even when their effective scan
number is zero and the other conditions--kswapd/memcg--apply.

This fixes 246e87a93934 ("memcg: fix get_scan_count() for small
targets").

[akpm@linux-foundation.org: fix comment]
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 769935d..bdfdec3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1747,23 +1747,15 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 	u64 fraction[2], denominator;
 	enum lru_list l;
 	int noswap = 0;
-	int force_scan = 0;
+	bool force_scan = false;
 	unsigned long nr_force_scan[2];
 
-
-	anon  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
-		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
-	file  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
-		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
-
-	if (((anon + file) >> priority) < SWAP_CLUSTER_MAX) {
-		/* kswapd does zone balancing and need to scan this zone */
-		if (scanning_global_lru(sc) && current_is_kswapd())
-			force_scan = 1;
-		/* memcg may have small limit and need to avoid priority drop */
-		if (!scanning_global_lru(sc))
-			force_scan = 1;
-	}
+	/* kswapd does zone balancing and needs to scan this zone */
+	if (scanning_global_lru(sc) && current_is_kswapd())
+		force_scan = true;
+	/* memcg may have small limit and need to avoid priority drop */
+	if (!scanning_global_lru(sc))
+		force_scan = true;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (!sc->may_swap || (nr_swap_pages <= 0)) {
@@ -1776,6 +1768,11 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 		goto out;
 	}
 
+	anon  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
+		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
+	file  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
+		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
+
 	if (scanning_global_lru(sc)) {
 		free  = zone_page_state(zone, NR_FREE_PAGES);
 		/* If we have very few page cache pages,
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 04/34] mm: vmscan: fix force-scanning small targets without swap
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Johannes Weiner <jweiner@redhat.com>

commit a4d3e9e76337059406fcf3ead288c0df22a790e9 upstream.

Stable note: Not tracked in Bugzilla. This patch augments an earlier commit
	that avoids scanning priority being artificially raised. The older
	fix was particularly important for small memcgs to avoid calling
	wait_iff_congested() unnecessarily.

Without swap, anonymous pages are not scanned.  As such, they should not
count when considering force-scanning a small target if there is no swap.

Otherwise, targets are not force-scanned even when their effective scan
number is zero and the other conditions--kswapd/memcg--apply.

This fixes 246e87a93934 ("memcg: fix get_scan_count() for small
targets").

[akpm@linux-foundation.org: fix comment]
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 769935d..bdfdec3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1747,23 +1747,15 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 	u64 fraction[2], denominator;
 	enum lru_list l;
 	int noswap = 0;
-	int force_scan = 0;
+	bool force_scan = false;
 	unsigned long nr_force_scan[2];
 
-
-	anon  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
-		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
-	file  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
-		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
-
-	if (((anon + file) >> priority) < SWAP_CLUSTER_MAX) {
-		/* kswapd does zone balancing and need to scan this zone */
-		if (scanning_global_lru(sc) && current_is_kswapd())
-			force_scan = 1;
-		/* memcg may have small limit and need to avoid priority drop */
-		if (!scanning_global_lru(sc))
-			force_scan = 1;
-	}
+	/* kswapd does zone balancing and needs to scan this zone */
+	if (scanning_global_lru(sc) && current_is_kswapd())
+		force_scan = true;
+	/* memcg may have small limit and need to avoid priority drop */
+	if (!scanning_global_lru(sc))
+		force_scan = true;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (!sc->may_swap || (nr_swap_pages <= 0)) {
@@ -1776,6 +1768,11 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 		goto out;
 	}
 
+	anon  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
+		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
+	file  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
+		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
+
 	if (scanning_global_lru(sc)) {
 		free  = zone_page_state(zone, NR_FREE_PAGES);
 		/* If we have very few page cache pages,
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 05/34] vmscan: clear ZONE_CONGESTED for zone with good watermark
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Shaohua Li <shaohua.li@intel.com>

commit 439423f6894aa0dec22187526827456f5004baed upstream.

Stable note: Not tracked in Bugzilla. kswapd is responsible for clearing
	ZONE_CONGESTED after it balances a zone and this patch fixes a bug
	where that was failing to happen. Without this patch, processes
	can stall in wait_iff_congested unnecessarily. For users, this can
	look like an interactivity stall but some workloads would see it
	as sudden drop in throughput.

ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any
task.  It's possible ZONE_CONGESTED isn't cleared in some cases:

 1. the zone is already balanced just entering balance_pgdat() for
    order-0 because concurrent tasks free memory.  In this case, later
    check will skip the zone as it's balanced so the flag isn't cleared.

 2. high order balance fallbacks to order-0.  quote from Mel: At the
    end of balance_pgdat(), kswapd uses the following logic;

	If reclaiming at high order {
		for each zone {
			if all_unreclaimable
				skip
			if watermark is not met
				order = 0
				loop again

			/* watermark is met */
			clear congested
		}
	}

    i.e. it clears ZONE_CONGESTED if it the zone is balanced.  if not,
    it restarts balancing at order-0.  However, if the higher zones are
    balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as
    that only happens after a zone is shrunk.  This can mean that
    wait_iff_congested() stalls unnecessarily.

This patch makes kswapd clear ZONE_CONGESTED during its initial
highmem->dma scan for zones that are already balanced.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bdfdec3..72340b84 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2456,6 +2456,9 @@ loop_again:
 					high_wmark_pages(zone), 0, 0)) {
 				end_zone = i;
 				break;
+			} else {
+				/* If balanced, clear the congested flag */
+				zone_clear_flag(zone, ZONE_CONGESTED);
 			}
 		}
 		if (i < 0)
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 05/34] vmscan: clear ZONE_CONGESTED for zone with good watermark
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Shaohua Li <shaohua.li@intel.com>

commit 439423f6894aa0dec22187526827456f5004baed upstream.

Stable note: Not tracked in Bugzilla. kswapd is responsible for clearing
	ZONE_CONGESTED after it balances a zone and this patch fixes a bug
	where that was failing to happen. Without this patch, processes
	can stall in wait_iff_congested unnecessarily. For users, this can
	look like an interactivity stall but some workloads would see it
	as sudden drop in throughput.

ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any
task.  It's possible ZONE_CONGESTED isn't cleared in some cases:

 1. the zone is already balanced just entering balance_pgdat() for
    order-0 because concurrent tasks free memory.  In this case, later
    check will skip the zone as it's balanced so the flag isn't cleared.

 2. high order balance fallbacks to order-0.  quote from Mel: At the
    end of balance_pgdat(), kswapd uses the following logic;

	If reclaiming at high order {
		for each zone {
			if all_unreclaimable
				skip
			if watermark is not met
				order = 0
				loop again

			/* watermark is met */
			clear congested
		}
	}

    i.e. it clears ZONE_CONGESTED if it the zone is balanced.  if not,
    it restarts balancing at order-0.  However, if the higher zones are
    balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as
    that only happens after a zone is shrunk.  This can mean that
    wait_iff_congested() stalls unnecessarily.

This patch makes kswapd clear ZONE_CONGESTED during its initial
highmem->dma scan for zones that are already balanced.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bdfdec3..72340b84 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2456,6 +2456,9 @@ loop_again:
 					high_wmark_pages(zone), 0, 0)) {
 				end_zone = i;
 				break;
+			} else {
+				/* If balanced, clear the congested flag */
+				zone_clear_flag(zone, ZONE_CONGESTED);
 			}
 		}
 		if (i < 0)
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 06/34] vmscan: add shrink_slab tracepoints
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Dave Chinner <dchinner@redhat.com>

commit 095760730c1047c69159ce88021a7fa3833502c8 upstream.

Stable note: This patch makes later patches easier to apply but otherwise
	has little to justify it. It is a diagnostic patch that was part
	of a series addressing excessive slab shrinking after GFP_NOFS
	failures. There is detailed information on the series' motivation
	at https://lkml.org/lkml/2011/6/2/42 .

It is impossible to understand what the shrinkers are actually doing
without instrumenting the code, so add a some tracepoints to allow
insight to be gained.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/vmscan.h |   77 +++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                   |    8 ++++-
 2 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index b2c33bd..36851f7 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -179,6 +179,83 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re
 	TP_ARGS(nr_reclaimed)
 );
 
+TRACE_EVENT(mm_shrink_slab_start,
+	TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
+		long nr_objects_to_shrink, unsigned long pgs_scanned,
+		unsigned long lru_pgs, unsigned long cache_items,
+		unsigned long long delta, unsigned long total_scan),
+
+	TP_ARGS(shr, sc, nr_objects_to_shrink, pgs_scanned, lru_pgs,
+		cache_items, delta, total_scan),
+
+	TP_STRUCT__entry(
+		__field(struct shrinker *, shr)
+		__field(void *, shrink)
+		__field(long, nr_objects_to_shrink)
+		__field(gfp_t, gfp_flags)
+		__field(unsigned long, pgs_scanned)
+		__field(unsigned long, lru_pgs)
+		__field(unsigned long, cache_items)
+		__field(unsigned long long, delta)
+		__field(unsigned long, total_scan)
+	),
+
+	TP_fast_assign(
+		__entry->shr = shr;
+		__entry->shrink = shr->shrink;
+		__entry->nr_objects_to_shrink = nr_objects_to_shrink;
+		__entry->gfp_flags = sc->gfp_mask;
+		__entry->pgs_scanned = pgs_scanned;
+		__entry->lru_pgs = lru_pgs;
+		__entry->cache_items = cache_items;
+		__entry->delta = delta;
+		__entry->total_scan = total_scan;
+	),
+
+	TP_printk("%pF %p: objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld",
+		__entry->shrink,
+		__entry->shr,
+		__entry->nr_objects_to_shrink,
+		show_gfp_flags(__entry->gfp_flags),
+		__entry->pgs_scanned,
+		__entry->lru_pgs,
+		__entry->cache_items,
+		__entry->delta,
+		__entry->total_scan)
+);
+
+TRACE_EVENT(mm_shrink_slab_end,
+	TP_PROTO(struct shrinker *shr, int shrinker_retval,
+		long unused_scan_cnt, long new_scan_cnt),
+
+	TP_ARGS(shr, shrinker_retval, unused_scan_cnt, new_scan_cnt),
+
+	TP_STRUCT__entry(
+		__field(struct shrinker *, shr)
+		__field(void *, shrink)
+		__field(long, unused_scan)
+		__field(long, new_scan)
+		__field(int, retval)
+		__field(long, total_scan)
+	),
+
+	TP_fast_assign(
+		__entry->shr = shr;
+		__entry->shrink = shr->shrink;
+		__entry->unused_scan = unused_scan_cnt;
+		__entry->new_scan = new_scan_cnt;
+		__entry->retval = shrinker_retval;
+		__entry->total_scan = new_scan_cnt - unused_scan_cnt;
+	),
+
+	TP_printk("%pF %p: unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
+		__entry->shrink,
+		__entry->shr,
+		__entry->unused_scan,
+		__entry->new_scan,
+		__entry->total_scan,
+		__entry->retval)
+);
 
 DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 72340b84..d875058 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -250,6 +250,7 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		unsigned long long delta;
 		unsigned long total_scan;
 		unsigned long max_pass;
+		int shrink_ret = 0;
 
 		max_pass = do_shrinker_shrink(shrinker, shrink, 0);
 		delta = (4 * nr_pages_scanned) / shrinker->seeks;
@@ -274,9 +275,12 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		total_scan = shrinker->nr;
 		shrinker->nr = 0;
 
+		trace_mm_shrink_slab_start(shrinker, shrink, total_scan,
+					nr_pages_scanned, lru_pages,
+					max_pass, delta, total_scan);
+
 		while (total_scan >= SHRINK_BATCH) {
 			long this_scan = SHRINK_BATCH;
-			int shrink_ret;
 			int nr_before;
 
 			nr_before = do_shrinker_shrink(shrinker, shrink, 0);
@@ -293,6 +297,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		}
 
 		shrinker->nr += total_scan;
+		trace_mm_shrink_slab_end(shrinker, shrink_ret, total_scan,
+					 shrinker->nr);
 	}
 	up_read(&shrinker_rwsem);
 out:
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 06/34] vmscan: add shrink_slab tracepoints
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Dave Chinner <dchinner@redhat.com>

commit 095760730c1047c69159ce88021a7fa3833502c8 upstream.

Stable note: This patch makes later patches easier to apply but otherwise
	has little to justify it. It is a diagnostic patch that was part
	of a series addressing excessive slab shrinking after GFP_NOFS
	failures. There is detailed information on the series' motivation
	at https://lkml.org/lkml/2011/6/2/42 .

It is impossible to understand what the shrinkers are actually doing
without instrumenting the code, so add a some tracepoints to allow
insight to be gained.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/vmscan.h |   77 +++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                   |    8 ++++-
 2 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index b2c33bd..36851f7 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -179,6 +179,83 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re
 	TP_ARGS(nr_reclaimed)
 );
 
+TRACE_EVENT(mm_shrink_slab_start,
+	TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
+		long nr_objects_to_shrink, unsigned long pgs_scanned,
+		unsigned long lru_pgs, unsigned long cache_items,
+		unsigned long long delta, unsigned long total_scan),
+
+	TP_ARGS(shr, sc, nr_objects_to_shrink, pgs_scanned, lru_pgs,
+		cache_items, delta, total_scan),
+
+	TP_STRUCT__entry(
+		__field(struct shrinker *, shr)
+		__field(void *, shrink)
+		__field(long, nr_objects_to_shrink)
+		__field(gfp_t, gfp_flags)
+		__field(unsigned long, pgs_scanned)
+		__field(unsigned long, lru_pgs)
+		__field(unsigned long, cache_items)
+		__field(unsigned long long, delta)
+		__field(unsigned long, total_scan)
+	),
+
+	TP_fast_assign(
+		__entry->shr = shr;
+		__entry->shrink = shr->shrink;
+		__entry->nr_objects_to_shrink = nr_objects_to_shrink;
+		__entry->gfp_flags = sc->gfp_mask;
+		__entry->pgs_scanned = pgs_scanned;
+		__entry->lru_pgs = lru_pgs;
+		__entry->cache_items = cache_items;
+		__entry->delta = delta;
+		__entry->total_scan = total_scan;
+	),
+
+	TP_printk("%pF %p: objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld",
+		__entry->shrink,
+		__entry->shr,
+		__entry->nr_objects_to_shrink,
+		show_gfp_flags(__entry->gfp_flags),
+		__entry->pgs_scanned,
+		__entry->lru_pgs,
+		__entry->cache_items,
+		__entry->delta,
+		__entry->total_scan)
+);
+
+TRACE_EVENT(mm_shrink_slab_end,
+	TP_PROTO(struct shrinker *shr, int shrinker_retval,
+		long unused_scan_cnt, long new_scan_cnt),
+
+	TP_ARGS(shr, shrinker_retval, unused_scan_cnt, new_scan_cnt),
+
+	TP_STRUCT__entry(
+		__field(struct shrinker *, shr)
+		__field(void *, shrink)
+		__field(long, unused_scan)
+		__field(long, new_scan)
+		__field(int, retval)
+		__field(long, total_scan)
+	),
+
+	TP_fast_assign(
+		__entry->shr = shr;
+		__entry->shrink = shr->shrink;
+		__entry->unused_scan = unused_scan_cnt;
+		__entry->new_scan = new_scan_cnt;
+		__entry->retval = shrinker_retval;
+		__entry->total_scan = new_scan_cnt - unused_scan_cnt;
+	),
+
+	TP_printk("%pF %p: unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
+		__entry->shrink,
+		__entry->shr,
+		__entry->unused_scan,
+		__entry->new_scan,
+		__entry->total_scan,
+		__entry->retval)
+);
 
 DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 72340b84..d875058 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -250,6 +250,7 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		unsigned long long delta;
 		unsigned long total_scan;
 		unsigned long max_pass;
+		int shrink_ret = 0;
 
 		max_pass = do_shrinker_shrink(shrinker, shrink, 0);
 		delta = (4 * nr_pages_scanned) / shrinker->seeks;
@@ -274,9 +275,12 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		total_scan = shrinker->nr;
 		shrinker->nr = 0;
 
+		trace_mm_shrink_slab_start(shrinker, shrink, total_scan,
+					nr_pages_scanned, lru_pages,
+					max_pass, delta, total_scan);
+
 		while (total_scan >= SHRINK_BATCH) {
 			long this_scan = SHRINK_BATCH;
-			int shrink_ret;
 			int nr_before;
 
 			nr_before = do_shrinker_shrink(shrinker, shrink, 0);
@@ -293,6 +297,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		}
 
 		shrinker->nr += total_scan;
+		trace_mm_shrink_slab_end(shrinker, shrink_ret, total_scan,
+					 shrinker->nr);
 	}
 	up_read(&shrinker_rwsem);
 out:
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 07/34] vmscan: shrinker->nr updates race and go wrong
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Dave Chinner <dchinner@redhat.com>

commit acf92b485cccf028177f46918e045c0c4e80ee10 upstream.

Stable note: Not tracked in Bugzilla. This patch reduces excessive
	reclaim of slab objects reducing the amount of information
	that has to be brought back in from disk.

shrink_slab() allows shrinkers to be called in parallel so the
struct shrinker can be updated concurrently. It does not provide any
exclusion for such updates, so we can get the shrinker->nr value
increasing or decreasing incorrectly.

As a result, when a shrinker repeatedly returns a value of -1 (e.g.
a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
sometimes updating with the scan count that wasn't used, sometimes
losing it altogether. Worse is when a shrinker does work and that
update is lost due to racy updates, which means the shrinker will do
the work again!

Fix this by making the total_scan calculations independent of
shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
other updates via cmpxchg loops.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   45 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 32 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d875058..31b551e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -251,17 +251,29 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		unsigned long total_scan;
 		unsigned long max_pass;
 		int shrink_ret = 0;
+		long nr;
+		long new_nr;
 
+		/*
+		 * copy the current shrinker scan count into a local variable
+		 * and zero it so that other concurrent shrinker invocations
+		 * don't also do this scanning work.
+		 */
+		do {
+			nr = shrinker->nr;
+		} while (cmpxchg(&shrinker->nr, nr, 0) != nr);
+
+		total_scan = nr;
 		max_pass = do_shrinker_shrink(shrinker, shrink, 0);
 		delta = (4 * nr_pages_scanned) / shrinker->seeks;
 		delta *= max_pass;
 		do_div(delta, lru_pages + 1);
-		shrinker->nr += delta;
-		if (shrinker->nr < 0) {
+		total_scan += delta;
+		if (total_scan < 0) {
 			printk(KERN_ERR "shrink_slab: %pF negative objects to "
 			       "delete nr=%ld\n",
-			       shrinker->shrink, shrinker->nr);
-			shrinker->nr = max_pass;
+			       shrinker->shrink, total_scan);
+			total_scan = max_pass;
 		}
 
 		/*
@@ -269,13 +281,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		 * never try to free more than twice the estimate number of
 		 * freeable entries.
 		 */
-		if (shrinker->nr > max_pass * 2)
-			shrinker->nr = max_pass * 2;
-
-		total_scan = shrinker->nr;
-		shrinker->nr = 0;
+		if (total_scan > max_pass * 2)
+			total_scan = max_pass * 2;
 
-		trace_mm_shrink_slab_start(shrinker, shrink, total_scan,
+		trace_mm_shrink_slab_start(shrinker, shrink, nr,
 					nr_pages_scanned, lru_pages,
 					max_pass, delta, total_scan);
 
@@ -296,9 +305,19 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 			cond_resched();
 		}
 
-		shrinker->nr += total_scan;
-		trace_mm_shrink_slab_end(shrinker, shrink_ret, total_scan,
-					 shrinker->nr);
+		/*
+		 * move the unused scan count back into the shrinker in a
+		 * manner that handles concurrent updates. If we exhausted the
+		 * scan, there is no need to do an update.
+		 */
+		do {
+			nr = shrinker->nr;
+			new_nr = total_scan + nr;
+			if (total_scan <= 0)
+				break;
+		} while (cmpxchg(&shrinker->nr, nr, new_nr) != nr);
+
+		trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr);
 	}
 	up_read(&shrinker_rwsem);
 out:
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 07/34] vmscan: shrinker->nr updates race and go wrong
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Dave Chinner <dchinner@redhat.com>

commit acf92b485cccf028177f46918e045c0c4e80ee10 upstream.

Stable note: Not tracked in Bugzilla. This patch reduces excessive
	reclaim of slab objects reducing the amount of information
	that has to be brought back in from disk.

shrink_slab() allows shrinkers to be called in parallel so the
struct shrinker can be updated concurrently. It does not provide any
exclusion for such updates, so we can get the shrinker->nr value
increasing or decreasing incorrectly.

As a result, when a shrinker repeatedly returns a value of -1 (e.g.
a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
sometimes updating with the scan count that wasn't used, sometimes
losing it altogether. Worse is when a shrinker does work and that
update is lost due to racy updates, which means the shrinker will do
the work again!

Fix this by making the total_scan calculations independent of
shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
other updates via cmpxchg loops.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   45 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 32 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d875058..31b551e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -251,17 +251,29 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		unsigned long total_scan;
 		unsigned long max_pass;
 		int shrink_ret = 0;
+		long nr;
+		long new_nr;
 
+		/*
+		 * copy the current shrinker scan count into a local variable
+		 * and zero it so that other concurrent shrinker invocations
+		 * don't also do this scanning work.
+		 */
+		do {
+			nr = shrinker->nr;
+		} while (cmpxchg(&shrinker->nr, nr, 0) != nr);
+
+		total_scan = nr;
 		max_pass = do_shrinker_shrink(shrinker, shrink, 0);
 		delta = (4 * nr_pages_scanned) / shrinker->seeks;
 		delta *= max_pass;
 		do_div(delta, lru_pages + 1);
-		shrinker->nr += delta;
-		if (shrinker->nr < 0) {
+		total_scan += delta;
+		if (total_scan < 0) {
 			printk(KERN_ERR "shrink_slab: %pF negative objects to "
 			       "delete nr=%ld\n",
-			       shrinker->shrink, shrinker->nr);
-			shrinker->nr = max_pass;
+			       shrinker->shrink, total_scan);
+			total_scan = max_pass;
 		}
 
 		/*
@@ -269,13 +281,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		 * never try to free more than twice the estimate number of
 		 * freeable entries.
 		 */
-		if (shrinker->nr > max_pass * 2)
-			shrinker->nr = max_pass * 2;
-
-		total_scan = shrinker->nr;
-		shrinker->nr = 0;
+		if (total_scan > max_pass * 2)
+			total_scan = max_pass * 2;
 
-		trace_mm_shrink_slab_start(shrinker, shrink, total_scan,
+		trace_mm_shrink_slab_start(shrinker, shrink, nr,
 					nr_pages_scanned, lru_pages,
 					max_pass, delta, total_scan);
 
@@ -296,9 +305,19 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 			cond_resched();
 		}
 
-		shrinker->nr += total_scan;
-		trace_mm_shrink_slab_end(shrinker, shrink_ret, total_scan,
-					 shrinker->nr);
+		/*
+		 * move the unused scan count back into the shrinker in a
+		 * manner that handles concurrent updates. If we exhausted the
+		 * scan, there is no need to do an update.
+		 */
+		do {
+			nr = shrinker->nr;
+			new_nr = total_scan + nr;
+			if (total_scan <= 0)
+				break;
+		} while (cmpxchg(&shrinker->nr, nr, new_nr) != nr);
+
+		trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr);
 	}
 	up_read(&shrinker_rwsem);
 out:
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 08/34] vmscan: reduce wind up shrinker->nr when shrinker can't do work
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Dave Chinner <dchinner@redhat.com>

commit 3567b59aa80ac4417002bf58e35dce5c777d4164 upstream.

Stable note: Not tracked in Bugzilla. This patch reduces excessive
	reclaim of slab objects reducing the amount of information that
	has to be brought back in from disk. The third and fourth paragram
	in the series describes the impact.

When a shrinker returns -1 to shrink_slab() to indicate it cannot do
any work given the current memory reclaim requirements, it adds the
entire total_scan count to shrinker->nr. The idea behind this is that
when the shrinker is next called and can do work, it will do the work
of the previously aborted shrinker call as well.

However, if a filesystem is doing lots of allocation with GFP_NOFS
set, then we get many, many more aborts from the shrinkers than we
do successful calls. The result is that shrinker->nr winds up to
it's maximum permissible value (twice the current cache size) and
then when the next shrinker call that can do work is issued, it
has enough scan count built up to free the entire cache twice over.

This manifests itself in the cache going from full to empty in a
matter of seconds, even when only a small part of the cache is
needed to be emptied to free sufficient memory.

Under metadata intensive workloads on ext4 and XFS, I'm seeing the
VFS caches increase memory consumption up to 75% of memory (no page
cache pressure) over a period of 30-60s, and then the shrinker
empties them down to zero in the space of 2-3s. This cycle repeats
over and over again, with the shrinker completely trashing the inode
and dentry caches every minute or so the workload continues.

This behaviour was made obvious by the shrink_slab tracepoints added
earlier in the series, and made worse by the patch that corrected
the concurrent accounting of shrinker->nr.

To avoid this problem, stop repeated small increments of the total
scan value from winding shrinker->nr up to a value that can cause
the entire cache to be freed. We still need to allow it to wind up,
so use the delta as the "large scan" threshold check - if the delta
is more than a quarter of the entire cache size, then it is a large
scan and allowed to cause lots of windup because we are clearly
needing to free lots of memory.

If it isn't a large scan then limit the total scan to half the size
of the cache so that windup never increases to consume the whole
cache. Reducing the total scan limit further does not allow enough
wind-up to maintain the current levels of performance, whilst a
higher threshold does not prevent the windup from freeing the entire
cache under sustained workloads.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 31b551e..8ca1cd5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -277,6 +277,21 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		}
 
 		/*
+		 * We need to avoid excessive windup on filesystem shrinkers
+		 * due to large numbers of GFP_NOFS allocations causing the
+		 * shrinkers to return -1 all the time. This results in a large
+		 * nr being built up so when a shrink that can do some work
+		 * comes along it empties the entire cache due to nr >>>
+		 * max_pass.  This is bad for sustaining a working set in
+		 * memory.
+		 *
+		 * Hence only allow the shrinker to scan the entire cache when
+		 * a large delta change is calculated directly.
+		 */
+		if (delta < max_pass / 4)
+			total_scan = min(total_scan, max_pass / 2);
+
+		/*
 		 * Avoid risking looping forever due to too large nr value:
 		 * never try to free more than twice the estimate number of
 		 * freeable entries.
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 08/34] vmscan: reduce wind up shrinker->nr when shrinker can't do work
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Dave Chinner <dchinner@redhat.com>

commit 3567b59aa80ac4417002bf58e35dce5c777d4164 upstream.

Stable note: Not tracked in Bugzilla. This patch reduces excessive
	reclaim of slab objects reducing the amount of information that
	has to be brought back in from disk. The third and fourth paragram
	in the series describes the impact.

When a shrinker returns -1 to shrink_slab() to indicate it cannot do
any work given the current memory reclaim requirements, it adds the
entire total_scan count to shrinker->nr. The idea behind this is that
when the shrinker is next called and can do work, it will do the work
of the previously aborted shrinker call as well.

However, if a filesystem is doing lots of allocation with GFP_NOFS
set, then we get many, many more aborts from the shrinkers than we
do successful calls. The result is that shrinker->nr winds up to
it's maximum permissible value (twice the current cache size) and
then when the next shrinker call that can do work is issued, it
has enough scan count built up to free the entire cache twice over.

This manifests itself in the cache going from full to empty in a
matter of seconds, even when only a small part of the cache is
needed to be emptied to free sufficient memory.

Under metadata intensive workloads on ext4 and XFS, I'm seeing the
VFS caches increase memory consumption up to 75% of memory (no page
cache pressure) over a period of 30-60s, and then the shrinker
empties them down to zero in the space of 2-3s. This cycle repeats
over and over again, with the shrinker completely trashing the inode
and dentry caches every minute or so the workload continues.

This behaviour was made obvious by the shrink_slab tracepoints added
earlier in the series, and made worse by the patch that corrected
the concurrent accounting of shrinker->nr.

To avoid this problem, stop repeated small increments of the total
scan value from winding shrinker->nr up to a value that can cause
the entire cache to be freed. We still need to allow it to wind up,
so use the delta as the "large scan" threshold check - if the delta
is more than a quarter of the entire cache size, then it is a large
scan and allowed to cause lots of windup because we are clearly
needing to free lots of memory.

If it isn't a large scan then limit the total scan to half the size
of the cache so that windup never increases to consume the whole
cache. Reducing the total scan limit further does not allow enough
wind-up to maintain the current levels of performance, whilst a
higher threshold does not prevent the windup from freeing the entire
cache under sustained workloads.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 31b551e..8ca1cd5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -277,6 +277,21 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		}
 
 		/*
+		 * We need to avoid excessive windup on filesystem shrinkers
+		 * due to large numbers of GFP_NOFS allocations causing the
+		 * shrinkers to return -1 all the time. This results in a large
+		 * nr being built up so when a shrink that can do some work
+		 * comes along it empties the entire cache due to nr >>>
+		 * max_pass.  This is bad for sustaining a working set in
+		 * memory.
+		 *
+		 * Hence only allow the shrinker to scan the entire cache when
+		 * a large delta change is calculated directly.
+		 */
+		if (delta < max_pass / 4)
+			total_scan = min(total_scan, max_pass / 2);
+
+		/*
 		 * Avoid risking looping forever due to too large nr value:
 		 * never try to free more than twice the estimate number of
 		 * freeable entries.
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 09/34] mm: limit direct reclaim for higher order allocations
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

commit e0887c19b2daa140f20ca8104bdc5740f39dbb86 upstream.

Stable note: Not tracked on Bugzilla. THP and compaction was found to
	aggressively reclaim pages and stall systems under different
	situations that was addressed piecemeal over time. Paragraph
	3 of this changelog is the motivation for this patch.

When suffering from memory fragmentation due to unfreeable pages, THP page
faults will repeatedly try to compact memory.  Due to the unfreeable
pages, compaction fails.

Needless to say, at that point page reclaim also fails to create free
contiguous 2MB areas.  However, that doesn't stop the current code from
trying, over and over again, and freeing a minimum of 4MB (2UL <<
sc->order pages) at every single invocation.

This resulted in my 12GB system having 2-3GB free memory, a corresponding
amount of used swap and very sluggish response times.

This can be avoided by having the direct reclaim code not reclaim from
zones that already have plenty of free memory available for compaction.

If compaction still fails due to unmovable memory, doing additional
reclaim will only hurt the system, not help.

[jweiner@redhat.com: change comment to explain the order check]
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8ca1cd5..d11b6c4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2059,6 +2059,22 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 				continue;
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
+			if (COMPACTION_BUILD) {
+				/*
+				 * If we already have plenty of memory
+				 * free for compaction, don't free any
+				 * more.  Even though compaction is
+				 * invoked for any non-zero order,
+				 * only frequent costly order
+				 * reclamation is disruptive enough to
+				 * become a noticable problem, like
+				 * transparent huge page allocations.
+				 */
+				if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
+					(compaction_suitable(zone, sc->order) ||
+					 compaction_deferred(zone)))
+					continue;
+			}
 			/*
 			 * This steals pages from memory cgroups over softlimit
 			 * and returns the number of reclaimed pages and
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 09/34] mm: limit direct reclaim for higher order allocations
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

commit e0887c19b2daa140f20ca8104bdc5740f39dbb86 upstream.

Stable note: Not tracked on Bugzilla. THP and compaction was found to
	aggressively reclaim pages and stall systems under different
	situations that was addressed piecemeal over time. Paragraph
	3 of this changelog is the motivation for this patch.

When suffering from memory fragmentation due to unfreeable pages, THP page
faults will repeatedly try to compact memory.  Due to the unfreeable
pages, compaction fails.

Needless to say, at that point page reclaim also fails to create free
contiguous 2MB areas.  However, that doesn't stop the current code from
trying, over and over again, and freeing a minimum of 4MB (2UL <<
sc->order pages) at every single invocation.

This resulted in my 12GB system having 2-3GB free memory, a corresponding
amount of used swap and very sluggish response times.

This can be avoided by having the direct reclaim code not reclaim from
zones that already have plenty of free memory available for compaction.

If compaction still fails due to unmovable memory, doing additional
reclaim will only hurt the system, not help.

[jweiner@redhat.com: change comment to explain the order check]
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8ca1cd5..d11b6c4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2059,6 +2059,22 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 				continue;
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
+			if (COMPACTION_BUILD) {
+				/*
+				 * If we already have plenty of memory
+				 * free for compaction, don't free any
+				 * more.  Even though compaction is
+				 * invoked for any non-zero order,
+				 * only frequent costly order
+				 * reclamation is disruptive enough to
+				 * become a noticable problem, like
+				 * transparent huge page allocations.
+				 */
+				if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
+					(compaction_suitable(zone, sc->order) ||
+					 compaction_deferred(zone)))
+					continue;
+			}
 			/*
 			 * This steals pages from memory cgroups over softlimit
 			 * and returns the number of reclaimed pages and
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 10/34] mm: Abort reclaim/compaction if compaction can proceed
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit e0c23279c9f800c403f37511484d9014ac83adec upstream.

Stable note: Not tracked on Bugzilla. THP and compaction was found to
	aggressively reclaim pages and stall systems under different
	situations that was addressed piecemeal over time.

If compaction can proceed, shrink_zones() stops doing any work but its
callers still call shrink_slab() which raises the priority and potentially
sleeps.  This is unnecessary and wasteful so this patch aborts direct
reclaim/compaction entirely if compaction can proceed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/vmscan.c |   32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d11b6c4..65388ac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2037,14 +2037,19 @@ restart:
  *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
+ *
+ * This function returns true if a zone is being reclaimed for a costly
+ * high-order allocation and compaction is either ready to begin or deferred.
+ * This indicates to the caller that it should retry the allocation or fail.
  */
-static void shrink_zones(int priority, struct zonelist *zonelist,
+static bool shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
+	bool should_abort_reclaim = false;
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2061,19 +2066,20 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 				continue;	/* Let kswapd poll it */
 			if (COMPACTION_BUILD) {
 				/*
-				 * If we already have plenty of memory
-				 * free for compaction, don't free any
-				 * more.  Even though compaction is
-				 * invoked for any non-zero order,
-				 * only frequent costly order
-				 * reclamation is disruptive enough to
-				 * become a noticable problem, like
-				 * transparent huge page allocations.
+				 * If we already have plenty of memory free for
+				 * compaction in this zone, don't free any more.
+				 * Even though compaction is invoked for any
+				 * non-zero order, only frequent costly order
+				 * reclamation is disruptive enough to become a
+				 * noticable problem, like transparent huge page
+				 * allocations.
 				 */
 				if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
 					(compaction_suitable(zone, sc->order) ||
-					 compaction_deferred(zone)))
+					 compaction_deferred(zone))) {
+					should_abort_reclaim = true;
 					continue;
+				}
 			}
 			/*
 			 * This steals pages from memory cgroups over softlimit
@@ -2092,6 +2098,8 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 
 		shrink_zone(priority, zone, sc);
 	}
+
+	return should_abort_reclaim;
 }
 
 static bool zone_reclaimable(struct zone *zone)
@@ -2156,7 +2164,9 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		sc->nr_scanned = 0;
 		if (!priority)
 			disable_swap_token(sc->mem_cgroup);
-		shrink_zones(priority, zonelist, sc);
+		if (shrink_zones(priority, zonelist, sc))
+			break;
+
 		/*
 		 * Don't shrink slabs when reclaiming memory from
 		 * over limit cgroups
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 10/34] mm: Abort reclaim/compaction if compaction can proceed
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit e0c23279c9f800c403f37511484d9014ac83adec upstream.

Stable note: Not tracked on Bugzilla. THP and compaction was found to
	aggressively reclaim pages and stall systems under different
	situations that was addressed piecemeal over time.

If compaction can proceed, shrink_zones() stops doing any work but its
callers still call shrink_slab() which raises the priority and potentially
sleeps.  This is unnecessary and wasteful so this patch aborts direct
reclaim/compaction entirely if compaction can proceed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/vmscan.c |   32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d11b6c4..65388ac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2037,14 +2037,19 @@ restart:
  *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
+ *
+ * This function returns true if a zone is being reclaimed for a costly
+ * high-order allocation and compaction is either ready to begin or deferred.
+ * This indicates to the caller that it should retry the allocation or fail.
  */
-static void shrink_zones(int priority, struct zonelist *zonelist,
+static bool shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
+	bool should_abort_reclaim = false;
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2061,19 +2066,20 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 				continue;	/* Let kswapd poll it */
 			if (COMPACTION_BUILD) {
 				/*
-				 * If we already have plenty of memory
-				 * free for compaction, don't free any
-				 * more.  Even though compaction is
-				 * invoked for any non-zero order,
-				 * only frequent costly order
-				 * reclamation is disruptive enough to
-				 * become a noticable problem, like
-				 * transparent huge page allocations.
+				 * If we already have plenty of memory free for
+				 * compaction in this zone, don't free any more.
+				 * Even though compaction is invoked for any
+				 * non-zero order, only frequent costly order
+				 * reclamation is disruptive enough to become a
+				 * noticable problem, like transparent huge page
+				 * allocations.
 				 */
 				if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
 					(compaction_suitable(zone, sc->order) ||
-					 compaction_deferred(zone)))
+					 compaction_deferred(zone))) {
+					should_abort_reclaim = true;
 					continue;
+				}
 			}
 			/*
 			 * This steals pages from memory cgroups over softlimit
@@ -2092,6 +2098,8 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 
 		shrink_zone(priority, zone, sc);
 	}
+
+	return should_abort_reclaim;
 }
 
 static bool zone_reclaimable(struct zone *zone)
@@ -2156,7 +2164,9 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		sc->nr_scanned = 0;
 		if (!priority)
 			disable_swap_token(sc->mem_cgroup);
-		shrink_zones(priority, zonelist, sc);
+		if (shrink_zones(priority, zonelist, sc))
+			break;
+
 		/*
 		 * Don't shrink slabs when reclaiming memory from
 		 * over limit cgroups
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 11/34] mm: compaction: trivial clean up in acct_isolated()
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Minchan Kim <minchan.kim@gmail.com>

commit b9e84ac1536d35aee03b2601f19694949f0bd506 upstream.

Stable note: Not tracked in Bugzilla. This patch makes later patches
	easier to apply but has no other impact.

acct_isolated of compaction uses page_lru_base_type which returns
only base type of LRU list so it never returns LRU_ACTIVE_ANON or
LRU_ACTIVE_FILE.  In addition, cc->nr_[anon|file] is used in only
acct_isolated so it doesn't have fields in compact_control.

This patch removes fields from compact_control and makes clear function of
acct_issolated which counts the number of anon|file pages isolated.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/compaction.c |   18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index c4bc5ac..d8c023e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -35,10 +35,6 @@ struct compact_control {
 	unsigned long migrate_pfn;	/* isolate_migratepages search base */
 	bool sync;			/* Synchronous migration */
 
-	/* Account for isolated anon and file pages */
-	unsigned long nr_anon;
-	unsigned long nr_file;
-
 	unsigned int order;		/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
@@ -223,17 +219,13 @@ static void isolate_freepages(struct zone *zone,
 static void acct_isolated(struct zone *zone, struct compact_control *cc)
 {
 	struct page *page;
-	unsigned int count[NR_LRU_LISTS] = { 0, };
+	unsigned int count[2] = { 0, };
 
-	list_for_each_entry(page, &cc->migratepages, lru) {
-		int lru = page_lru_base_type(page);
-		count[lru]++;
-	}
+	list_for_each_entry(page, &cc->migratepages, lru)
+		count[!!page_is_file_cache(page)]++;
 
-	cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
-	cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
-	__mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
 }
 
 /* Similar to reclaim, but different enough that they don't share logic */
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 11/34] mm: compaction: trivial clean up in acct_isolated()
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Minchan Kim <minchan.kim@gmail.com>

commit b9e84ac1536d35aee03b2601f19694949f0bd506 upstream.

Stable note: Not tracked in Bugzilla. This patch makes later patches
	easier to apply but has no other impact.

acct_isolated of compaction uses page_lru_base_type which returns
only base type of LRU list so it never returns LRU_ACTIVE_ANON or
LRU_ACTIVE_FILE.  In addition, cc->nr_[anon|file] is used in only
acct_isolated so it doesn't have fields in compact_control.

This patch removes fields from compact_control and makes clear function of
acct_issolated which counts the number of anon|file pages isolated.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/compaction.c |   18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index c4bc5ac..d8c023e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -35,10 +35,6 @@ struct compact_control {
 	unsigned long migrate_pfn;	/* isolate_migratepages search base */
 	bool sync;			/* Synchronous migration */
 
-	/* Account for isolated anon and file pages */
-	unsigned long nr_anon;
-	unsigned long nr_file;
-
 	unsigned int order;		/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
@@ -223,17 +219,13 @@ static void isolate_freepages(struct zone *zone,
 static void acct_isolated(struct zone *zone, struct compact_control *cc)
 {
 	struct page *page;
-	unsigned int count[NR_LRU_LISTS] = { 0, };
+	unsigned int count[2] = { 0, };
 
-	list_for_each_entry(page, &cc->migratepages, lru) {
-		int lru = page_lru_base_type(page);
-		count[lru]++;
-	}
+	list_for_each_entry(page, &cc->migratepages, lru)
+		count[!!page_is_file_cache(page)]++;
 
-	cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
-	cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
-	__mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
 }
 
 /* Similar to reclaim, but different enough that they don't share logic */
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 12/34] mm: change isolate mode from #define to bitwise type
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Minchan Kim <minchan.kim@gmail.com>

commit 4356f21d09283dc6d39a6f7287a65ddab61e2808 upstream.

Stable note: Not tracked in Bugzilla. This patch makes later patches
	easier to apply but has no other impact.

Change ISOLATE_XXX macro with bitwise isolate_mode_t type.  Normally,
macro isn't recommended as it's type-unsafe and making debugging harder as
symbol cannot be passed throught to the debugger.

Quote from Johannes
" Hmm, it would probably be cleaner to fully convert the isolation mode
into independent flags.  INACTIVE, ACTIVE, BOTH is currently a
tri-state among flags, which is a bit ugly."

This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |    8 ++---
 include/linux/memcontrol.h                         |    3 +-
 include/linux/mmzone.h                             |    8 +++++
 include/linux/swap.h                               |    7 +---
 include/trace/events/vmscan.h                      |    8 ++---
 mm/compaction.c                                    |    3 +-
 mm/memcontrol.c                                    |    3 +-
 mm/vmscan.c                                        |   37 +++++++++++---------
 8 files changed, 43 insertions(+), 34 deletions(-)

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
index 12cecc8..4a37c47 100644
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -379,10 +379,10 @@ EVENT_PROCESS:
 
 			# To closer match vmstat scanning statistics, only count isolate_both
 			# and isolate_inactive as scanning. isolate_active is rotation
-			# isolate_inactive == 0
-			# isolate_active   == 1
-			# isolate_both     == 2
-			if ($isolate_mode != 1) {
+			# isolate_inactive == 1
+			# isolate_active   == 2
+			# isolate_both     == 3
+			if ($isolate_mode != 2) {
 				$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
 			}
 			$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 313a00e..4a8da84 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,7 +35,8 @@ enum mem_cgroup_page_stat_item {
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
-					int mode, struct zone *z,
+					isolate_mode_t mode,
+					struct zone *z,
 					struct mem_cgroup *mem_cont,
 					int active, int file);
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9f7c3eb..5a5286d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -158,6 +158,14 @@ static inline int is_unevictable_lru(enum lru_list l)
 	return (l == LRU_UNEVICTABLE);
 }
 
+/* Isolate inactive pages */
+#define ISOLATE_INACTIVE	((__force isolate_mode_t)0x1)
+/* Isolate active pages */
+#define ISOLATE_ACTIVE		((__force isolate_mode_t)0x2)
+
+/* LRU Isolation modes. */
+typedef unsigned __bitwise__ isolate_mode_t;
+
 enum zone_watermarks {
 	WMARK_MIN,
 	WMARK_LOW,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a273468..e73799d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -243,11 +243,6 @@ static inline void lru_cache_add_file(struct page *page)
 	__lru_cache_add(page, LRU_INACTIVE_FILE);
 }
 
-/* LRU Isolation modes. */
-#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
-#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
-#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
-
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
@@ -259,7 +254,7 @@ extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						unsigned int swappiness,
 						struct zone *zone,
 						unsigned long *nr_scanned);
-extern int __isolate_lru_page(struct page *page, int mode, int file);
+extern int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 36851f7..edc4b3d 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -266,7 +266,7 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 		unsigned long nr_lumpy_taken,
 		unsigned long nr_lumpy_dirty,
 		unsigned long nr_lumpy_failed,
-		int isolate_mode),
+		isolate_mode_t isolate_mode),
 
 	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode),
 
@@ -278,7 +278,7 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 		__field(unsigned long, nr_lumpy_taken)
 		__field(unsigned long, nr_lumpy_dirty)
 		__field(unsigned long, nr_lumpy_failed)
-		__field(int, isolate_mode)
+		__field(isolate_mode_t, isolate_mode)
 	),
 
 	TP_fast_assign(
@@ -312,7 +312,7 @@ DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_lru_isolate,
 		unsigned long nr_lumpy_taken,
 		unsigned long nr_lumpy_dirty,
 		unsigned long nr_lumpy_failed,
-		int isolate_mode),
+		isolate_mode_t isolate_mode),
 
 	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode)
 
@@ -327,7 +327,7 @@ DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_memcg_isolate,
 		unsigned long nr_lumpy_taken,
 		unsigned long nr_lumpy_dirty,
 		unsigned long nr_lumpy_failed,
-		int isolate_mode),
+		isolate_mode_t isolate_mode),
 
 	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode)
 
diff --git a/mm/compaction.c b/mm/compaction.c
index d8c023e..4fbbbd0 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -371,7 +371,8 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 		}
 
 		/* Try isolate the page */
-		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) != 0)
+		if (__isolate_lru_page(page,
+				ISOLATE_ACTIVE|ISOLATE_INACTIVE, 0) != 0)
 			continue;
 
 		VM_BUG_ON(PageTransCompound(page));
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ffb99b4..57cdf5a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1251,7 +1251,8 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
-					int mode, struct zone *z,
+					isolate_mode_t mode,
+					struct zone *z,
 					struct mem_cgroup *mem_cont,
 					int active, int file)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 65388ac..4bb2010 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1012,23 +1012,27 @@ keep_lumpy:
  *
  * returns 0 on success, -ve errno on failure.
  */
-int __isolate_lru_page(struct page *page, int mode, int file)
+int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file)
 {
+	bool all_lru_mode;
 	int ret = -EINVAL;
 
 	/* Only take pages on the LRU. */
 	if (!PageLRU(page))
 		return ret;
 
+	all_lru_mode = (mode & (ISOLATE_ACTIVE|ISOLATE_INACTIVE)) ==
+		(ISOLATE_ACTIVE|ISOLATE_INACTIVE);
+
 	/*
 	 * When checking the active state, we need to be sure we are
 	 * dealing with comparible boolean values.  Take the logical not
 	 * of each.
 	 */
-	if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
+	if (!all_lru_mode && !PageActive(page) != !(mode & ISOLATE_ACTIVE))
 		return ret;
 
-	if (mode != ISOLATE_BOTH && page_is_file_cache(page) != file)
+	if (!all_lru_mode && !!page_is_file_cache(page) != file)
 		return ret;
 
 	/*
@@ -1076,7 +1080,8 @@ int __isolate_lru_page(struct page *page, int mode, int file)
  */
 static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		struct list_head *src, struct list_head *dst,
-		unsigned long *scanned, int order, int mode, int file)
+		unsigned long *scanned, int order, isolate_mode_t mode,
+		int file)
 {
 	unsigned long nr_taken = 0;
 	unsigned long nr_lumpy_taken = 0;
@@ -1201,8 +1206,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 static unsigned long isolate_pages_global(unsigned long nr,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
-					int mode, struct zone *z,
-					int active, int file)
+					isolate_mode_t mode,
+					struct zone *z,	int active, int file)
 {
 	int lru = LRU_BASE;
 	if (active)
@@ -1448,6 +1453,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	unsigned long nr_taken;
 	unsigned long nr_anon;
 	unsigned long nr_file;
+	isolate_mode_t reclaim_mode = ISOLATE_INACTIVE;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1458,15 +1464,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	}
 
 	set_reclaim_mode(priority, sc, false);
+	if (sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM)
+		reclaim_mode |= ISOLATE_ACTIVE;
+
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 
 	if (scanning_global_lru(sc)) {
-		nr_taken = isolate_pages_global(nr_to_scan,
-			&page_list, &nr_scanned, sc->order,
-			sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
-					ISOLATE_BOTH : ISOLATE_INACTIVE,
-			zone, 0, file);
+		nr_taken = isolate_pages_global(nr_to_scan, &page_list,
+			&nr_scanned, sc->order, reclaim_mode, zone, 0, file);
 		zone->pages_scanned += nr_scanned;
 		if (current_is_kswapd())
 			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
@@ -1475,12 +1481,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 			__count_zone_vm_events(PGSCAN_DIRECT, zone,
 					       nr_scanned);
 	} else {
-		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
-			&page_list, &nr_scanned, sc->order,
-			sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
-					ISOLATE_BOTH : ISOLATE_INACTIVE,
-			zone, sc->mem_cgroup,
-			0, file);
+		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
+			&nr_scanned, sc->order, reclaim_mode, zone,
+			sc->mem_cgroup, 0, file);
 		/*
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 12/34] mm: change isolate mode from #define to bitwise type
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Minchan Kim <minchan.kim@gmail.com>

commit 4356f21d09283dc6d39a6f7287a65ddab61e2808 upstream.

Stable note: Not tracked in Bugzilla. This patch makes later patches
	easier to apply but has no other impact.

Change ISOLATE_XXX macro with bitwise isolate_mode_t type.  Normally,
macro isn't recommended as it's type-unsafe and making debugging harder as
symbol cannot be passed throught to the debugger.

Quote from Johannes
" Hmm, it would probably be cleaner to fully convert the isolation mode
into independent flags.  INACTIVE, ACTIVE, BOTH is currently a
tri-state among flags, which is a bit ugly."

This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |    8 ++---
 include/linux/memcontrol.h                         |    3 +-
 include/linux/mmzone.h                             |    8 +++++
 include/linux/swap.h                               |    7 +---
 include/trace/events/vmscan.h                      |    8 ++---
 mm/compaction.c                                    |    3 +-
 mm/memcontrol.c                                    |    3 +-
 mm/vmscan.c                                        |   37 +++++++++++---------
 8 files changed, 43 insertions(+), 34 deletions(-)

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
index 12cecc8..4a37c47 100644
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -379,10 +379,10 @@ EVENT_PROCESS:
 
 			# To closer match vmstat scanning statistics, only count isolate_both
 			# and isolate_inactive as scanning. isolate_active is rotation
-			# isolate_inactive == 0
-			# isolate_active   == 1
-			# isolate_both     == 2
-			if ($isolate_mode != 1) {
+			# isolate_inactive == 1
+			# isolate_active   == 2
+			# isolate_both     == 3
+			if ($isolate_mode != 2) {
 				$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
 			}
 			$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 313a00e..4a8da84 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,7 +35,8 @@ enum mem_cgroup_page_stat_item {
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
-					int mode, struct zone *z,
+					isolate_mode_t mode,
+					struct zone *z,
 					struct mem_cgroup *mem_cont,
 					int active, int file);
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9f7c3eb..5a5286d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -158,6 +158,14 @@ static inline int is_unevictable_lru(enum lru_list l)
 	return (l == LRU_UNEVICTABLE);
 }
 
+/* Isolate inactive pages */
+#define ISOLATE_INACTIVE	((__force isolate_mode_t)0x1)
+/* Isolate active pages */
+#define ISOLATE_ACTIVE		((__force isolate_mode_t)0x2)
+
+/* LRU Isolation modes. */
+typedef unsigned __bitwise__ isolate_mode_t;
+
 enum zone_watermarks {
 	WMARK_MIN,
 	WMARK_LOW,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a273468..e73799d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -243,11 +243,6 @@ static inline void lru_cache_add_file(struct page *page)
 	__lru_cache_add(page, LRU_INACTIVE_FILE);
 }
 
-/* LRU Isolation modes. */
-#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
-#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
-#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
-
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
@@ -259,7 +254,7 @@ extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						unsigned int swappiness,
 						struct zone *zone,
 						unsigned long *nr_scanned);
-extern int __isolate_lru_page(struct page *page, int mode, int file);
+extern int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 36851f7..edc4b3d 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -266,7 +266,7 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 		unsigned long nr_lumpy_taken,
 		unsigned long nr_lumpy_dirty,
 		unsigned long nr_lumpy_failed,
-		int isolate_mode),
+		isolate_mode_t isolate_mode),
 
 	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode),
 
@@ -278,7 +278,7 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 		__field(unsigned long, nr_lumpy_taken)
 		__field(unsigned long, nr_lumpy_dirty)
 		__field(unsigned long, nr_lumpy_failed)
-		__field(int, isolate_mode)
+		__field(isolate_mode_t, isolate_mode)
 	),
 
 	TP_fast_assign(
@@ -312,7 +312,7 @@ DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_lru_isolate,
 		unsigned long nr_lumpy_taken,
 		unsigned long nr_lumpy_dirty,
 		unsigned long nr_lumpy_failed,
-		int isolate_mode),
+		isolate_mode_t isolate_mode),
 
 	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode)
 
@@ -327,7 +327,7 @@ DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_memcg_isolate,
 		unsigned long nr_lumpy_taken,
 		unsigned long nr_lumpy_dirty,
 		unsigned long nr_lumpy_failed,
-		int isolate_mode),
+		isolate_mode_t isolate_mode),
 
 	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode)
 
diff --git a/mm/compaction.c b/mm/compaction.c
index d8c023e..4fbbbd0 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -371,7 +371,8 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 		}
 
 		/* Try isolate the page */
-		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) != 0)
+		if (__isolate_lru_page(page,
+				ISOLATE_ACTIVE|ISOLATE_INACTIVE, 0) != 0)
 			continue;
 
 		VM_BUG_ON(PageTransCompound(page));
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ffb99b4..57cdf5a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1251,7 +1251,8 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
-					int mode, struct zone *z,
+					isolate_mode_t mode,
+					struct zone *z,
 					struct mem_cgroup *mem_cont,
 					int active, int file)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 65388ac..4bb2010 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1012,23 +1012,27 @@ keep_lumpy:
  *
  * returns 0 on success, -ve errno on failure.
  */
-int __isolate_lru_page(struct page *page, int mode, int file)
+int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file)
 {
+	bool all_lru_mode;
 	int ret = -EINVAL;
 
 	/* Only take pages on the LRU. */
 	if (!PageLRU(page))
 		return ret;
 
+	all_lru_mode = (mode & (ISOLATE_ACTIVE|ISOLATE_INACTIVE)) ==
+		(ISOLATE_ACTIVE|ISOLATE_INACTIVE);
+
 	/*
 	 * When checking the active state, we need to be sure we are
 	 * dealing with comparible boolean values.  Take the logical not
 	 * of each.
 	 */
-	if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
+	if (!all_lru_mode && !PageActive(page) != !(mode & ISOLATE_ACTIVE))
 		return ret;
 
-	if (mode != ISOLATE_BOTH && page_is_file_cache(page) != file)
+	if (!all_lru_mode && !!page_is_file_cache(page) != file)
 		return ret;
 
 	/*
@@ -1076,7 +1080,8 @@ int __isolate_lru_page(struct page *page, int mode, int file)
  */
 static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		struct list_head *src, struct list_head *dst,
-		unsigned long *scanned, int order, int mode, int file)
+		unsigned long *scanned, int order, isolate_mode_t mode,
+		int file)
 {
 	unsigned long nr_taken = 0;
 	unsigned long nr_lumpy_taken = 0;
@@ -1201,8 +1206,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 static unsigned long isolate_pages_global(unsigned long nr,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
-					int mode, struct zone *z,
-					int active, int file)
+					isolate_mode_t mode,
+					struct zone *z,	int active, int file)
 {
 	int lru = LRU_BASE;
 	if (active)
@@ -1448,6 +1453,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	unsigned long nr_taken;
 	unsigned long nr_anon;
 	unsigned long nr_file;
+	isolate_mode_t reclaim_mode = ISOLATE_INACTIVE;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1458,15 +1464,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	}
 
 	set_reclaim_mode(priority, sc, false);
+	if (sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM)
+		reclaim_mode |= ISOLATE_ACTIVE;
+
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 
 	if (scanning_global_lru(sc)) {
-		nr_taken = isolate_pages_global(nr_to_scan,
-			&page_list, &nr_scanned, sc->order,
-			sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
-					ISOLATE_BOTH : ISOLATE_INACTIVE,
-			zone, 0, file);
+		nr_taken = isolate_pages_global(nr_to_scan, &page_list,
+			&nr_scanned, sc->order, reclaim_mode, zone, 0, file);
 		zone->pages_scanned += nr_scanned;
 		if (current_is_kswapd())
 			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
@@ -1475,12 +1481,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 			__count_zone_vm_events(PGSCAN_DIRECT, zone,
 					       nr_scanned);
 	} else {
-		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
-			&page_list, &nr_scanned, sc->order,
-			sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
-					ISOLATE_BOTH : ISOLATE_INACTIVE,
-			zone, sc->mem_cgroup,
-			0, file);
+		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
+			&nr_scanned, sc->order, reclaim_mode, zone,
+			sc->mem_cgroup, 0, file);
 		/*
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 13/34] mm: compaction: make isolate_lru_page() filter-aware
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Minchan Kim <minchan.kim@gmail.com>

commit 39deaf8585152f1a35c1676d3d7dc6ae0fb65967 upstream.

Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU
	list leading to poor reclaim decisions which has a variable
	performance impact.

In async mode, compaction doesn't migrate dirty or writeback pages.  So,
it's meaningless to pick the page and re-add it to lru list.

Of course, when we isolate the page in compaction, the page might be dirty
or writeback but when we try to migrate the page, the page would be not
dirty, writeback.  So it could be migrated.  But it's very unlikely as
isolate and migration cycle is much faster than writeout.

So, this patch helps cpu overhead and prevent unnecessary LRU churning.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |    2 ++
 mm/compaction.c        |    7 +++++--
 mm/vmscan.c            |    3 +++
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5a5286d..632107e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -162,6 +162,8 @@ static inline int is_unevictable_lru(enum lru_list l)
 #define ISOLATE_INACTIVE	((__force isolate_mode_t)0x1)
 /* Isolate active pages */
 #define ISOLATE_ACTIVE		((__force isolate_mode_t)0x2)
+/* Isolate clean file */
+#define ISOLATE_CLEAN		((__force isolate_mode_t)0x4)
 
 /* LRU Isolation modes. */
 typedef unsigned __bitwise__ isolate_mode_t;
diff --git a/mm/compaction.c b/mm/compaction.c
index 4fbbbd0..61e68a5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -261,6 +261,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	unsigned long last_pageblock_nr = 0, pageblock_nr;
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct list_head *migratelist = &cc->migratepages;
+	isolate_mode_t mode = ISOLATE_ACTIVE|ISOLATE_INACTIVE;
 
 	/* Do not scan outside zone boundaries */
 	low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
@@ -370,9 +371,11 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 			continue;
 		}
 
+		if (!cc->sync)
+			mode |= ISOLATE_CLEAN;
+
 		/* Try isolate the page */
-		if (__isolate_lru_page(page,
-				ISOLATE_ACTIVE|ISOLATE_INACTIVE, 0) != 0)
+		if (__isolate_lru_page(page, mode, 0) != 0)
 			continue;
 
 		VM_BUG_ON(PageTransCompound(page));
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4bb2010..032f35e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1045,6 +1045,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file)
 
 	ret = -EBUSY;
 
+	if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page)))
+		return ret;
+
 	if (likely(get_page_unless_zero(page))) {
 		/*
 		 * Be careful not to clear PageLRU until after we're
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 13/34] mm: compaction: make isolate_lru_page() filter-aware
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Minchan Kim <minchan.kim@gmail.com>

commit 39deaf8585152f1a35c1676d3d7dc6ae0fb65967 upstream.

Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU
	list leading to poor reclaim decisions which has a variable
	performance impact.

In async mode, compaction doesn't migrate dirty or writeback pages.  So,
it's meaningless to pick the page and re-add it to lru list.

Of course, when we isolate the page in compaction, the page might be dirty
or writeback but when we try to migrate the page, the page would be not
dirty, writeback.  So it could be migrated.  But it's very unlikely as
isolate and migration cycle is much faster than writeout.

So, this patch helps cpu overhead and prevent unnecessary LRU churning.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |    2 ++
 mm/compaction.c        |    7 +++++--
 mm/vmscan.c            |    3 +++
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5a5286d..632107e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -162,6 +162,8 @@ static inline int is_unevictable_lru(enum lru_list l)
 #define ISOLATE_INACTIVE	((__force isolate_mode_t)0x1)
 /* Isolate active pages */
 #define ISOLATE_ACTIVE		((__force isolate_mode_t)0x2)
+/* Isolate clean file */
+#define ISOLATE_CLEAN		((__force isolate_mode_t)0x4)
 
 /* LRU Isolation modes. */
 typedef unsigned __bitwise__ isolate_mode_t;
diff --git a/mm/compaction.c b/mm/compaction.c
index 4fbbbd0..61e68a5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -261,6 +261,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	unsigned long last_pageblock_nr = 0, pageblock_nr;
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct list_head *migratelist = &cc->migratepages;
+	isolate_mode_t mode = ISOLATE_ACTIVE|ISOLATE_INACTIVE;
 
 	/* Do not scan outside zone boundaries */
 	low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
@@ -370,9 +371,11 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 			continue;
 		}
 
+		if (!cc->sync)
+			mode |= ISOLATE_CLEAN;
+
 		/* Try isolate the page */
-		if (__isolate_lru_page(page,
-				ISOLATE_ACTIVE|ISOLATE_INACTIVE, 0) != 0)
+		if (__isolate_lru_page(page, mode, 0) != 0)
 			continue;
 
 		VM_BUG_ON(PageTransCompound(page));
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4bb2010..032f35e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1045,6 +1045,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file)
 
 	ret = -EBUSY;
 
+	if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page)))
+		return ret;
+
 	if (likely(get_page_unless_zero(page))) {
 		/*
 		 * Be careful not to clear PageLRU until after we're
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 14/34] mm: zone_reclaim: make isolate_lru_page() filter-aware
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Minchan Kim <minchan.kim@gmail.com>

commit f80c0673610e36ae29d63e3297175e22f70dde5f upstream.

Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU list
	leading to poor reclaim decisions which has a variable performance
	impact.

In __zone_reclaim case, we don't want to shrink mapped page.  Nonetheless,
we have isolated mapped page and re-add it into LRU's head.  It's
unnecessary CPU overhead and makes LRU churning.

Of course, when we isolate the page, the page might be mapped but when we
try to migrate the page, the page would be not mapped.  So it could be
migrated.  But race is rare and although it happens, it's no big deal.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |    2 ++
 mm/vmscan.c            |   20 ++++++++++++++++++--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 632107e..951ed81 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -164,6 +164,8 @@ static inline int is_unevictable_lru(enum lru_list l)
 #define ISOLATE_ACTIVE		((__force isolate_mode_t)0x2)
 /* Isolate clean file */
 #define ISOLATE_CLEAN		((__force isolate_mode_t)0x4)
+/* Isolate unmapped file */
+#define ISOLATE_UNMAPPED	((__force isolate_mode_t)0x8)
 
 /* LRU Isolation modes. */
 typedef unsigned __bitwise__ isolate_mode_t;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 032f35e..9aa75e9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1048,6 +1048,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file)
 	if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page)))
 		return ret;
 
+	if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
+		return ret;
+
 	if (likely(get_page_unless_zero(page))) {
 		/*
 		 * Be careful not to clear PageLRU until after we're
@@ -1471,6 +1474,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		reclaim_mode |= ISOLATE_ACTIVE;
 
 	lru_add_drain();
+
+	if (!sc->may_unmap)
+		reclaim_mode |= ISOLATE_UNMAPPED;
+	if (!sc->may_writepage)
+		reclaim_mode |= ISOLATE_CLEAN;
+
 	spin_lock_irq(&zone->lru_lock);
 
 	if (scanning_global_lru(sc)) {
@@ -1588,19 +1597,26 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	struct page *page;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	unsigned long nr_rotated = 0;
+	isolate_mode_t reclaim_mode = ISOLATE_ACTIVE;
 
 	lru_add_drain();
+
+	if (!sc->may_unmap)
+		reclaim_mode |= ISOLATE_UNMAPPED;
+	if (!sc->may_writepage)
+		reclaim_mode |= ISOLATE_CLEAN;
+
 	spin_lock_irq(&zone->lru_lock);
 	if (scanning_global_lru(sc)) {
 		nr_taken = isolate_pages_global(nr_pages, &l_hold,
 						&pgscanned, sc->order,
-						ISOLATE_ACTIVE, zone,
+						reclaim_mode, zone,
 						1, file);
 		zone->pages_scanned += pgscanned;
 	} else {
 		nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
 						&pgscanned, sc->order,
-						ISOLATE_ACTIVE, zone,
+						reclaim_mode, zone,
 						sc->mem_cgroup, 1, file);
 		/*
 		 * mem_cgroup_isolate_pages() keeps track of
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 14/34] mm: zone_reclaim: make isolate_lru_page() filter-aware
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Minchan Kim <minchan.kim@gmail.com>

commit f80c0673610e36ae29d63e3297175e22f70dde5f upstream.

Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU list
	leading to poor reclaim decisions which has a variable performance
	impact.

In __zone_reclaim case, we don't want to shrink mapped page.  Nonetheless,
we have isolated mapped page and re-add it into LRU's head.  It's
unnecessary CPU overhead and makes LRU churning.

Of course, when we isolate the page, the page might be mapped but when we
try to migrate the page, the page would be not mapped.  So it could be
migrated.  But race is rare and although it happens, it's no big deal.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |    2 ++
 mm/vmscan.c            |   20 ++++++++++++++++++--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 632107e..951ed81 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -164,6 +164,8 @@ static inline int is_unevictable_lru(enum lru_list l)
 #define ISOLATE_ACTIVE		((__force isolate_mode_t)0x2)
 /* Isolate clean file */
 #define ISOLATE_CLEAN		((__force isolate_mode_t)0x4)
+/* Isolate unmapped file */
+#define ISOLATE_UNMAPPED	((__force isolate_mode_t)0x8)
 
 /* LRU Isolation modes. */
 typedef unsigned __bitwise__ isolate_mode_t;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 032f35e..9aa75e9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1048,6 +1048,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file)
 	if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page)))
 		return ret;
 
+	if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
+		return ret;
+
 	if (likely(get_page_unless_zero(page))) {
 		/*
 		 * Be careful not to clear PageLRU until after we're
@@ -1471,6 +1474,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		reclaim_mode |= ISOLATE_ACTIVE;
 
 	lru_add_drain();
+
+	if (!sc->may_unmap)
+		reclaim_mode |= ISOLATE_UNMAPPED;
+	if (!sc->may_writepage)
+		reclaim_mode |= ISOLATE_CLEAN;
+
 	spin_lock_irq(&zone->lru_lock);
 
 	if (scanning_global_lru(sc)) {
@@ -1588,19 +1597,26 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	struct page *page;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	unsigned long nr_rotated = 0;
+	isolate_mode_t reclaim_mode = ISOLATE_ACTIVE;
 
 	lru_add_drain();
+
+	if (!sc->may_unmap)
+		reclaim_mode |= ISOLATE_UNMAPPED;
+	if (!sc->may_writepage)
+		reclaim_mode |= ISOLATE_CLEAN;
+
 	spin_lock_irq(&zone->lru_lock);
 	if (scanning_global_lru(sc)) {
 		nr_taken = isolate_pages_global(nr_pages, &l_hold,
 						&pgscanned, sc->order,
-						ISOLATE_ACTIVE, zone,
+						reclaim_mode, zone,
 						1, file);
 		zone->pages_scanned += pgscanned;
 	} else {
 		nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
 						&pgscanned, sc->order,
-						ISOLATE_ACTIVE, zone,
+						reclaim_mode, zone,
 						sc->mem_cgroup, 1, file);
 		/*
 		 * mem_cgroup_isolate_pages() keeps track of
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 15/34] mm: migration: clean up unmap_and_move()
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit 0dabec93de633a87adfbbe1d800a4c56cd19d73b upstream.

Stable note: Not tracked in Bugzilla. This patch makes later patches
	easier to apply but has no other impact.

unmap_and_move() is one a big messy function.  Clean it up.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/migrate.c |   59 ++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 33 insertions(+), 26 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 14d0a6a..e58ab66 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -621,38 +621,18 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 	return rc;
 }
 
-/*
- * Obtain the lock on page, remove all ptes and migrate the page
- * to the newly allocated page in newpage.
- */
-static int unmap_and_move(new_page_t get_new_page, unsigned long private,
-			struct page *page, int force, bool offlining, bool sync)
+static int __unmap_and_move(struct page *page, struct page *newpage,
+				int force, bool offlining, bool sync)
 {
-	int rc = 0;
-	int *result = NULL;
-	struct page *newpage = get_new_page(page, private, &result);
+	int rc = -EAGAIN;
 	int remap_swapcache = 1;
 	int charge = 0;
 	struct mem_cgroup *mem;
 	struct anon_vma *anon_vma = NULL;
 
-	if (!newpage)
-		return -ENOMEM;
-
-	if (page_count(page) == 1) {
-		/* page was freed from under us. So we are done. */
-		goto move_newpage;
-	}
-	if (unlikely(PageTransHuge(page)))
-		if (unlikely(split_huge_page(page)))
-			goto move_newpage;
-
-	/* prepare cgroup just returns 0 or -ENOMEM */
-	rc = -EAGAIN;
-
 	if (!trylock_page(page)) {
 		if (!force || !sync)
-			goto move_newpage;
+			goto out;
 
 		/*
 		 * It's not safe for direct compaction to call lock_page.
@@ -668,7 +648,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		 * altogether.
 		 */
 		if (current->flags & PF_MEMALLOC)
-			goto move_newpage;
+			goto out;
 
 		lock_page(page);
 	}
@@ -785,8 +765,35 @@ uncharge:
 		mem_cgroup_end_migration(mem, page, newpage, rc == 0);
 unlock:
 	unlock_page(page);
+out:
+	return rc;
+}
 
-move_newpage:
+/*
+ * Obtain the lock on page, remove all ptes and migrate the page
+ * to the newly allocated page in newpage.
+ */
+static int unmap_and_move(new_page_t get_new_page, unsigned long private,
+			struct page *page, int force, bool offlining, bool sync)
+{
+	int rc = 0;
+	int *result = NULL;
+	struct page *newpage = get_new_page(page, private, &result);
+
+	if (!newpage)
+		return -ENOMEM;
+
+	if (page_count(page) == 1) {
+		/* page was freed from under us. So we are done. */
+		goto out;
+	}
+
+	if (unlikely(PageTransHuge(page)))
+		if (unlikely(split_huge_page(page)))
+			goto out;
+
+	rc = __unmap_and_move(page, newpage, force, offlining, sync);
+out:
 	if (rc != -EAGAIN) {
  		/*
  		 * A page that has been migrated has all references
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 15/34] mm: migration: clean up unmap_and_move()
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit 0dabec93de633a87adfbbe1d800a4c56cd19d73b upstream.

Stable note: Not tracked in Bugzilla. This patch makes later patches
	easier to apply but has no other impact.

unmap_and_move() is one a big messy function.  Clean it up.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/migrate.c |   59 ++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 33 insertions(+), 26 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 14d0a6a..e58ab66 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -621,38 +621,18 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 	return rc;
 }
 
-/*
- * Obtain the lock on page, remove all ptes and migrate the page
- * to the newly allocated page in newpage.
- */
-static int unmap_and_move(new_page_t get_new_page, unsigned long private,
-			struct page *page, int force, bool offlining, bool sync)
+static int __unmap_and_move(struct page *page, struct page *newpage,
+				int force, bool offlining, bool sync)
 {
-	int rc = 0;
-	int *result = NULL;
-	struct page *newpage = get_new_page(page, private, &result);
+	int rc = -EAGAIN;
 	int remap_swapcache = 1;
 	int charge = 0;
 	struct mem_cgroup *mem;
 	struct anon_vma *anon_vma = NULL;
 
-	if (!newpage)
-		return -ENOMEM;
-
-	if (page_count(page) == 1) {
-		/* page was freed from under us. So we are done. */
-		goto move_newpage;
-	}
-	if (unlikely(PageTransHuge(page)))
-		if (unlikely(split_huge_page(page)))
-			goto move_newpage;
-
-	/* prepare cgroup just returns 0 or -ENOMEM */
-	rc = -EAGAIN;
-
 	if (!trylock_page(page)) {
 		if (!force || !sync)
-			goto move_newpage;
+			goto out;
 
 		/*
 		 * It's not safe for direct compaction to call lock_page.
@@ -668,7 +648,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		 * altogether.
 		 */
 		if (current->flags & PF_MEMALLOC)
-			goto move_newpage;
+			goto out;
 
 		lock_page(page);
 	}
@@ -785,8 +765,35 @@ uncharge:
 		mem_cgroup_end_migration(mem, page, newpage, rc == 0);
 unlock:
 	unlock_page(page);
+out:
+	return rc;
+}
 
-move_newpage:
+/*
+ * Obtain the lock on page, remove all ptes and migrate the page
+ * to the newly allocated page in newpage.
+ */
+static int unmap_and_move(new_page_t get_new_page, unsigned long private,
+			struct page *page, int force, bool offlining, bool sync)
+{
+	int rc = 0;
+	int *result = NULL;
+	struct page *newpage = get_new_page(page, private, &result);
+
+	if (!newpage)
+		return -ENOMEM;
+
+	if (page_count(page) == 1) {
+		/* page was freed from under us. So we are done. */
+		goto out;
+	}
+
+	if (unlikely(PageTransHuge(page)))
+		if (unlikely(split_huge_page(page)))
+			goto out;
+
+	rc = __unmap_and_move(page, newpage, force, offlining, sync);
+out:
 	if (rc != -EAGAIN) {
  		/*
  		 * A page that has been migrated has all references
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 16/34] mm: compaction: Allow compaction to isolate dirty pages
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit a77ebd333cd810d7b680d544be88c875131c2bd3 upstream.

Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
	information by reducing LRU list churning had the side-effect of
	reducing THP allocation success rates. This was part of a series
	to restore the success rates while preserving the reclaim fix.

Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
noted that compaction does not migrate dirty or writeback pages and
that is was meaningless to pick the page and re-add it to the LRU list.

What was missed during review is that asynchronous migration moves
dirty pages if their ->migratepage callback is migrate_page() because
these can be moved without blocking. This potentially impacted
hugepage allocation success rates by a factor depending on how many
dirty pages are in the system.

This patch partially reverts 39deaf85 to allow migration to isolate
dirty pages again. This increases how much compaction disrupts the
LRU but that is addressed later in the series.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 mm/compaction.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 61e68a5..afdc416 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -371,9 +371,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 			continue;
 		}
 
-		if (!cc->sync)
-			mode |= ISOLATE_CLEAN;
-
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, mode, 0) != 0)
 			continue;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 16/34] mm: compaction: Allow compaction to isolate dirty pages
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit a77ebd333cd810d7b680d544be88c875131c2bd3 upstream.

Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
	information by reducing LRU list churning had the side-effect of
	reducing THP allocation success rates. This was part of a series
	to restore the success rates while preserving the reclaim fix.

Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
noted that compaction does not migrate dirty or writeback pages and
that is was meaningless to pick the page and re-add it to the LRU list.

What was missed during review is that asynchronous migration moves
dirty pages if their ->migratepage callback is migrate_page() because
these can be moved without blocking. This potentially impacted
hugepage allocation success rates by a factor depending on how many
dirty pages are in the system.

This patch partially reverts 39deaf85 to allow migration to isolate
dirty pages again. This increases how much compaction disrupts the
LRU but that is addressed later in the series.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 mm/compaction.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 61e68a5..afdc416 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -371,9 +371,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 			continue;
 		}
 
-		if (!cc->sync)
-			mode |= ISOLATE_CLEAN;
-
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, mode, 0) != 0)
 			continue;
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 17/34] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit b969c4ab9f182a6e1b2a0848be349f99714947b0 upstream.

Stable note: Not tracked in Bugzilla. A fix aimed at preserving page
	aging information by reducing LRU list churning had the side-effect
	of reducing THP allocation success rates. This was part of a series
	to restore the success rates while preserving the reclaim fix.

Asynchronous compaction is used when allocating transparent hugepages
to avoid blocking for long periods of time. Due to reports of
stalling, there was a debate on disabling synchronous compaction
but this severely impacted allocation success rates. Part of the
reason was because when deciding whether to migrate dirty pages,
the following check is made;

	if (PageDirty(page) && !sync &&
		mapping->a_ops->migratepage != migrate_page)
			rc = -EBUSY;

This skips over all pages using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking. This
patch updates the ->migratepage callback with a "sync" parameter. It
is the resposibility of the callback to gracefully fail migration of
the page if it would block.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/btrfs/disk-io.c      |    4 +-
 fs/nfs/internal.h       |    2 +-
 fs/nfs/write.c          |    4 +-
 include/linux/fs.h      |    9 ++--
 include/linux/migrate.h |    2 +-
 mm/migrate.c            |  129 +++++++++++++++++++++++++++++++++--------------
 6 files changed, 104 insertions(+), 46 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1ac8db5d..522cb2a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -801,7 +801,7 @@ static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
 
 #ifdef CONFIG_MIGRATION
 static int btree_migratepage(struct address_space *mapping,
-			struct page *newpage, struct page *page)
+			struct page *newpage, struct page *page, bool sync)
 {
 	/*
 	 * we can't safely write a btree page from here,
@@ -816,7 +816,7 @@ static int btree_migratepage(struct address_space *mapping,
 	if (page_has_private(page) &&
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
-	return migrate_page(mapping, newpage, page);
+	return migrate_page(mapping, newpage, page, sync);
 }
 #endif
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 2a55347..a74442a 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -315,7 +315,7 @@ void nfs_commit_release_pages(struct nfs_write_data *data);
 
 #ifdef CONFIG_MIGRATION
 extern int nfs_migrate_page(struct address_space *,
-		struct page *, struct page *);
+		struct page *, struct page *, bool);
 #else
 #define nfs_migrate_page NULL
 #endif
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index f2f80c0..22a48fd 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1662,7 +1662,7 @@ out_error:
 
 #ifdef CONFIG_MIGRATION
 int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
-		struct page *page)
+		struct page *page, bool sync)
 {
 	/*
 	 * If PagePrivate is set, then the page is currently associated with
@@ -1677,7 +1677,7 @@ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
 
 	nfs_fscache_release_page(page, GFP_KERNEL);
 
-	return migrate_page(mapping, newpage, page);
+	return migrate_page(mapping, newpage, page, sync);
 }
 #endif
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 96b1035..09ddec9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -607,9 +607,12 @@ struct address_space_operations {
 			loff_t offset, unsigned long nr_segs);
 	int (*get_xip_mem)(struct address_space *, pgoff_t, int,
 						void **, unsigned long *);
-	/* migrate the contents of a page to the specified target */
+	/*
+	 * migrate the contents of a page to the specified target. If sync
+	 * is false, it must not block.
+	 */
 	int (*migratepage) (struct address_space *,
-			struct page *, struct page *);
+			struct page *, struct page *, bool);
 	int (*launder_page) (struct page *);
 	int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
 					unsigned long);
@@ -2478,7 +2481,7 @@ extern int generic_check_addressable(unsigned, u64);
 
 #ifdef CONFIG_MIGRATION
 extern int buffer_migrate_page(struct address_space *,
-				struct page *, struct page *);
+				struct page *, struct page *, bool);
 #else
 #define buffer_migrate_page NULL
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e39aeec..14e6d2a 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -11,7 +11,7 @@ typedef struct page *new_page_t(struct page *, unsigned long private, int **);
 
 extern void putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
-			struct page *, struct page *);
+			struct page *, struct page *, bool);
 extern int migrate_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
 			bool sync);
diff --git a/mm/migrate.c b/mm/migrate.c
index e58ab66..fb8d1ae 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -220,6 +220,55 @@ out:
 	pte_unmap_unlock(ptep, ptl);
 }
 
+#ifdef CONFIG_BLOCK
+/* Returns true if all buffers are successfully locked */
+static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
+{
+	struct buffer_head *bh = head;
+
+	/* Simple case, sync compaction */
+	if (sync) {
+		do {
+			get_bh(bh);
+			lock_buffer(bh);
+			bh = bh->b_this_page;
+
+		} while (bh != head);
+
+		return true;
+	}
+
+	/* async case, we cannot block on lock_buffer so use trylock_buffer */
+	do {
+		get_bh(bh);
+		if (!trylock_buffer(bh)) {
+			/*
+			 * We failed to lock the buffer and cannot stall in
+			 * async migration. Release the taken locks
+			 */
+			struct buffer_head *failed_bh = bh;
+			put_bh(failed_bh);
+			bh = head;
+			while (bh != failed_bh) {
+				unlock_buffer(bh);
+				put_bh(bh);
+				bh = bh->b_this_page;
+			}
+			return false;
+		}
+
+		bh = bh->b_this_page;
+	} while (bh != head);
+	return true;
+}
+#else
+static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
+								bool sync)
+{
+	return true;
+}
+#endif /* CONFIG_BLOCK */
+
 /*
  * Replace the page in the mapping.
  *
@@ -229,7 +278,8 @@ out:
  * 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
  */
 static int migrate_page_move_mapping(struct address_space *mapping,
-		struct page *newpage, struct page *page)
+		struct page *newpage, struct page *page,
+		struct buffer_head *head, bool sync)
 {
 	int expected_count;
 	void **pslot;
@@ -259,6 +309,19 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 	}
 
 	/*
+	 * In the async migration case of moving a page with buffers, lock the
+	 * buffers using trylock before the mapping is moved. If the mapping
+	 * was moved, we later failed to lock the buffers and could not move
+	 * the mapping back due to an elevated page count, we would have to
+	 * block waiting on other references to be dropped.
+	 */
+	if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
+		page_unfreeze_refs(page, expected_count);
+		spin_unlock_irq(&mapping->tree_lock);
+		return -EAGAIN;
+	}
+
+	/*
 	 * Now we know that no one else is looking at the page.
 	 */
 	get_page(newpage);	/* add cache reference */
@@ -415,13 +478,13 @@ EXPORT_SYMBOL(fail_migrate_page);
  * Pages are locked upon entry and exit.
  */
 int migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page)
+		struct page *newpage, struct page *page, bool sync)
 {
 	int rc;
 
 	BUG_ON(PageWriteback(page));	/* Writeback must be complete */
 
-	rc = migrate_page_move_mapping(mapping, newpage, page);
+	rc = migrate_page_move_mapping(mapping, newpage, page, NULL, sync);
 
 	if (rc)
 		return rc;
@@ -438,28 +501,28 @@ EXPORT_SYMBOL(migrate_page);
  * exist.
  */
 int buffer_migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page)
+		struct page *newpage, struct page *page, bool sync)
 {
 	struct buffer_head *bh, *head;
 	int rc;
 
 	if (!page_has_buffers(page))
-		return migrate_page(mapping, newpage, page);
+		return migrate_page(mapping, newpage, page, sync);
 
 	head = page_buffers(page);
 
-	rc = migrate_page_move_mapping(mapping, newpage, page);
+	rc = migrate_page_move_mapping(mapping, newpage, page, head, sync);
 
 	if (rc)
 		return rc;
 
-	bh = head;
-	do {
-		get_bh(bh);
-		lock_buffer(bh);
-		bh = bh->b_this_page;
-
-	} while (bh != head);
+	/*
+	 * In the async case, migrate_page_move_mapping locked the buffers
+	 * with an IRQ-safe spinlock held. In the sync case, the buffers
+	 * need to be locked no
+	 */
+	if (sync)
+		BUG_ON(!buffer_migrate_lock_buffers(head, sync));
 
 	ClearPagePrivate(page);
 	set_page_private(newpage, page_private(page));
@@ -536,10 +599,13 @@ static int writeout(struct address_space *mapping, struct page *page)
  * Default handling if a filesystem does not provide a migration function.
  */
 static int fallback_migrate_page(struct address_space *mapping,
-	struct page *newpage, struct page *page)
+	struct page *newpage, struct page *page, bool sync)
 {
-	if (PageDirty(page))
+	if (PageDirty(page)) {
+		if (!sync)
+			return -EBUSY;
 		return writeout(mapping, page);
+	}
 
 	/*
 	 * Buffers may be managed in a filesystem specific way.
@@ -549,7 +615,7 @@ static int fallback_migrate_page(struct address_space *mapping,
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
 
-	return migrate_page(mapping, newpage, page);
+	return migrate_page(mapping, newpage, page, sync);
 }
 
 /*
@@ -585,29 +651,18 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 
 	mapping = page_mapping(page);
 	if (!mapping)
-		rc = migrate_page(mapping, newpage, page);
-	else {
+		rc = migrate_page(mapping, newpage, page, sync);
+	else if (mapping->a_ops->migratepage)
 		/*
-		 * Do not writeback pages if !sync and migratepage is
-		 * not pointing to migrate_page() which is nonblocking
-		 * (swapcache/tmpfs uses migratepage = migrate_page).
+		 * Most pages have a mapping and most filesystems provide a
+		 * migratepage callback. Anonymous pages are part of swap
+		 * space which also has its own migratepage callback. This
+		 * is the most common path for page migration.
 		 */
-		if (PageDirty(page) && !sync &&
-		    mapping->a_ops->migratepage != migrate_page)
-			rc = -EBUSY;
-		else if (mapping->a_ops->migratepage)
-			/*
-			 * Most pages have a mapping and most filesystems
-			 * should provide a migration function. Anonymous
-			 * pages are part of swap space which also has its
-			 * own migration function. This is the most common
-			 * path for page migration.
-			 */
-			rc = mapping->a_ops->migratepage(mapping,
-							newpage, page);
-		else
-			rc = fallback_migrate_page(mapping, newpage, page);
-	}
+		rc = mapping->a_ops->migratepage(mapping,
+						newpage, page, sync);
+	else
+		rc = fallback_migrate_page(mapping, newpage, page, sync);
 
 	if (rc) {
 		newpage->mapping = NULL;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 17/34] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit b969c4ab9f182a6e1b2a0848be349f99714947b0 upstream.

Stable note: Not tracked in Bugzilla. A fix aimed at preserving page
	aging information by reducing LRU list churning had the side-effect
	of reducing THP allocation success rates. This was part of a series
	to restore the success rates while preserving the reclaim fix.

Asynchronous compaction is used when allocating transparent hugepages
to avoid blocking for long periods of time. Due to reports of
stalling, there was a debate on disabling synchronous compaction
but this severely impacted allocation success rates. Part of the
reason was because when deciding whether to migrate dirty pages,
the following check is made;

	if (PageDirty(page) && !sync &&
		mapping->a_ops->migratepage != migrate_page)
			rc = -EBUSY;

This skips over all pages using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking. This
patch updates the ->migratepage callback with a "sync" parameter. It
is the resposibility of the callback to gracefully fail migration of
the page if it would block.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/btrfs/disk-io.c      |    4 +-
 fs/nfs/internal.h       |    2 +-
 fs/nfs/write.c          |    4 +-
 include/linux/fs.h      |    9 ++--
 include/linux/migrate.h |    2 +-
 mm/migrate.c            |  129 +++++++++++++++++++++++++++++++++--------------
 6 files changed, 104 insertions(+), 46 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1ac8db5d..522cb2a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -801,7 +801,7 @@ static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
 
 #ifdef CONFIG_MIGRATION
 static int btree_migratepage(struct address_space *mapping,
-			struct page *newpage, struct page *page)
+			struct page *newpage, struct page *page, bool sync)
 {
 	/*
 	 * we can't safely write a btree page from here,
@@ -816,7 +816,7 @@ static int btree_migratepage(struct address_space *mapping,
 	if (page_has_private(page) &&
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
-	return migrate_page(mapping, newpage, page);
+	return migrate_page(mapping, newpage, page, sync);
 }
 #endif
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 2a55347..a74442a 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -315,7 +315,7 @@ void nfs_commit_release_pages(struct nfs_write_data *data);
 
 #ifdef CONFIG_MIGRATION
 extern int nfs_migrate_page(struct address_space *,
-		struct page *, struct page *);
+		struct page *, struct page *, bool);
 #else
 #define nfs_migrate_page NULL
 #endif
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index f2f80c0..22a48fd 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1662,7 +1662,7 @@ out_error:
 
 #ifdef CONFIG_MIGRATION
 int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
-		struct page *page)
+		struct page *page, bool sync)
 {
 	/*
 	 * If PagePrivate is set, then the page is currently associated with
@@ -1677,7 +1677,7 @@ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
 
 	nfs_fscache_release_page(page, GFP_KERNEL);
 
-	return migrate_page(mapping, newpage, page);
+	return migrate_page(mapping, newpage, page, sync);
 }
 #endif
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 96b1035..09ddec9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -607,9 +607,12 @@ struct address_space_operations {
 			loff_t offset, unsigned long nr_segs);
 	int (*get_xip_mem)(struct address_space *, pgoff_t, int,
 						void **, unsigned long *);
-	/* migrate the contents of a page to the specified target */
+	/*
+	 * migrate the contents of a page to the specified target. If sync
+	 * is false, it must not block.
+	 */
 	int (*migratepage) (struct address_space *,
-			struct page *, struct page *);
+			struct page *, struct page *, bool);
 	int (*launder_page) (struct page *);
 	int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
 					unsigned long);
@@ -2478,7 +2481,7 @@ extern int generic_check_addressable(unsigned, u64);
 
 #ifdef CONFIG_MIGRATION
 extern int buffer_migrate_page(struct address_space *,
-				struct page *, struct page *);
+				struct page *, struct page *, bool);
 #else
 #define buffer_migrate_page NULL
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e39aeec..14e6d2a 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -11,7 +11,7 @@ typedef struct page *new_page_t(struct page *, unsigned long private, int **);
 
 extern void putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
-			struct page *, struct page *);
+			struct page *, struct page *, bool);
 extern int migrate_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
 			bool sync);
diff --git a/mm/migrate.c b/mm/migrate.c
index e58ab66..fb8d1ae 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -220,6 +220,55 @@ out:
 	pte_unmap_unlock(ptep, ptl);
 }
 
+#ifdef CONFIG_BLOCK
+/* Returns true if all buffers are successfully locked */
+static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
+{
+	struct buffer_head *bh = head;
+
+	/* Simple case, sync compaction */
+	if (sync) {
+		do {
+			get_bh(bh);
+			lock_buffer(bh);
+			bh = bh->b_this_page;
+
+		} while (bh != head);
+
+		return true;
+	}
+
+	/* async case, we cannot block on lock_buffer so use trylock_buffer */
+	do {
+		get_bh(bh);
+		if (!trylock_buffer(bh)) {
+			/*
+			 * We failed to lock the buffer and cannot stall in
+			 * async migration. Release the taken locks
+			 */
+			struct buffer_head *failed_bh = bh;
+			put_bh(failed_bh);
+			bh = head;
+			while (bh != failed_bh) {
+				unlock_buffer(bh);
+				put_bh(bh);
+				bh = bh->b_this_page;
+			}
+			return false;
+		}
+
+		bh = bh->b_this_page;
+	} while (bh != head);
+	return true;
+}
+#else
+static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
+								bool sync)
+{
+	return true;
+}
+#endif /* CONFIG_BLOCK */
+
 /*
  * Replace the page in the mapping.
  *
@@ -229,7 +278,8 @@ out:
  * 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
  */
 static int migrate_page_move_mapping(struct address_space *mapping,
-		struct page *newpage, struct page *page)
+		struct page *newpage, struct page *page,
+		struct buffer_head *head, bool sync)
 {
 	int expected_count;
 	void **pslot;
@@ -259,6 +309,19 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 	}
 
 	/*
+	 * In the async migration case of moving a page with buffers, lock the
+	 * buffers using trylock before the mapping is moved. If the mapping
+	 * was moved, we later failed to lock the buffers and could not move
+	 * the mapping back due to an elevated page count, we would have to
+	 * block waiting on other references to be dropped.
+	 */
+	if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
+		page_unfreeze_refs(page, expected_count);
+		spin_unlock_irq(&mapping->tree_lock);
+		return -EAGAIN;
+	}
+
+	/*
 	 * Now we know that no one else is looking at the page.
 	 */
 	get_page(newpage);	/* add cache reference */
@@ -415,13 +478,13 @@ EXPORT_SYMBOL(fail_migrate_page);
  * Pages are locked upon entry and exit.
  */
 int migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page)
+		struct page *newpage, struct page *page, bool sync)
 {
 	int rc;
 
 	BUG_ON(PageWriteback(page));	/* Writeback must be complete */
 
-	rc = migrate_page_move_mapping(mapping, newpage, page);
+	rc = migrate_page_move_mapping(mapping, newpage, page, NULL, sync);
 
 	if (rc)
 		return rc;
@@ -438,28 +501,28 @@ EXPORT_SYMBOL(migrate_page);
  * exist.
  */
 int buffer_migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page)
+		struct page *newpage, struct page *page, bool sync)
 {
 	struct buffer_head *bh, *head;
 	int rc;
 
 	if (!page_has_buffers(page))
-		return migrate_page(mapping, newpage, page);
+		return migrate_page(mapping, newpage, page, sync);
 
 	head = page_buffers(page);
 
-	rc = migrate_page_move_mapping(mapping, newpage, page);
+	rc = migrate_page_move_mapping(mapping, newpage, page, head, sync);
 
 	if (rc)
 		return rc;
 
-	bh = head;
-	do {
-		get_bh(bh);
-		lock_buffer(bh);
-		bh = bh->b_this_page;
-
-	} while (bh != head);
+	/*
+	 * In the async case, migrate_page_move_mapping locked the buffers
+	 * with an IRQ-safe spinlock held. In the sync case, the buffers
+	 * need to be locked no
+	 */
+	if (sync)
+		BUG_ON(!buffer_migrate_lock_buffers(head, sync));
 
 	ClearPagePrivate(page);
 	set_page_private(newpage, page_private(page));
@@ -536,10 +599,13 @@ static int writeout(struct address_space *mapping, struct page *page)
  * Default handling if a filesystem does not provide a migration function.
  */
 static int fallback_migrate_page(struct address_space *mapping,
-	struct page *newpage, struct page *page)
+	struct page *newpage, struct page *page, bool sync)
 {
-	if (PageDirty(page))
+	if (PageDirty(page)) {
+		if (!sync)
+			return -EBUSY;
 		return writeout(mapping, page);
+	}
 
 	/*
 	 * Buffers may be managed in a filesystem specific way.
@@ -549,7 +615,7 @@ static int fallback_migrate_page(struct address_space *mapping,
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
 
-	return migrate_page(mapping, newpage, page);
+	return migrate_page(mapping, newpage, page, sync);
 }
 
 /*
@@ -585,29 +651,18 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 
 	mapping = page_mapping(page);
 	if (!mapping)
-		rc = migrate_page(mapping, newpage, page);
-	else {
+		rc = migrate_page(mapping, newpage, page, sync);
+	else if (mapping->a_ops->migratepage)
 		/*
-		 * Do not writeback pages if !sync and migratepage is
-		 * not pointing to migrate_page() which is nonblocking
-		 * (swapcache/tmpfs uses migratepage = migrate_page).
+		 * Most pages have a mapping and most filesystems provide a
+		 * migratepage callback. Anonymous pages are part of swap
+		 * space which also has its own migratepage callback. This
+		 * is the most common path for page migration.
 		 */
-		if (PageDirty(page) && !sync &&
-		    mapping->a_ops->migratepage != migrate_page)
-			rc = -EBUSY;
-		else if (mapping->a_ops->migratepage)
-			/*
-			 * Most pages have a mapping and most filesystems
-			 * should provide a migration function. Anonymous
-			 * pages are part of swap space which also has its
-			 * own migration function. This is the most common
-			 * path for page migration.
-			 */
-			rc = mapping->a_ops->migratepage(mapping,
-							newpage, page);
-		else
-			rc = fallback_migrate_page(mapping, newpage, page);
-	}
+		rc = mapping->a_ops->migratepage(mapping,
+						newpage, page, sync);
+	else
+		rc = fallback_migrate_page(mapping, newpage, page, sync);
 
 	if (rc) {
 		newpage->mapping = NULL;
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 18/34] mm: page allocator: Do not call direct reclaim for THP allocations while compaction is deferred
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit 66199712e9eef5aede09dbcd9dfff87798a66917 upstream.

Stable note: Not tracked in Buzilla. This was part of a series that
	reduced interactivity stalls experienced when THP was enabled.

If compaction is deferred, direct reclaim is used to try free enough
pages for the allocation to succeed. For small high-orders, this has
a reasonable chance of success. However, if the caller has specified
__GFP_NO_KSWAPD to limit the disruption to the system, it makes more
sense to fail the allocation rather than stall the caller in direct
reclaim. This patch skips direct reclaim if compaction is deferred
and the caller specifies __GFP_NO_KSWAPD.

Async compaction only considers a subset of pages so it is possible for
compaction to be deferred prematurely and not enter direct reclaim even
in cases where it should. To compensate for this, this patch also defers
compaction only if sync compaction failed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/page_alloc.c |   45 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e568b80..257acae 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1897,14 +1897,20 @@ static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, unsigned long *did_some_progress,
-	bool sync_migration)
+	int migratetype, bool sync_migration,
+	bool *deferred_compaction,
+	unsigned long *did_some_progress)
 {
 	struct page *page;
 
-	if (!order || compaction_deferred(preferred_zone))
+	if (!order)
 		return NULL;
 
+	if (compaction_deferred(preferred_zone)) {
+		*deferred_compaction = true;
+		return NULL;
+	}
+
 	current->flags |= PF_MEMALLOC;
 	*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
 						nodemask, sync_migration);
@@ -1932,7 +1938,13 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		 * but not enough to satisfy watermarks.
 		 */
 		count_vm_event(COMPACTFAIL);
-		defer_compaction(preferred_zone);
+
+		/*
+		 * As async compaction considers a subset of pageblocks, only
+		 * defer if the failure was a sync compaction failure.
+		 */
+		if (sync_migration)
+			defer_compaction(preferred_zone);
 
 		cond_resched();
 	}
@@ -1944,8 +1956,9 @@ static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, unsigned long *did_some_progress,
-	bool sync_migration)
+	int migratetype, bool sync_migration,
+	bool *deferred_compaction,
+	unsigned long *did_some_progress)
 {
 	return NULL;
 }
@@ -2095,6 +2108,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long pages_reclaimed = 0;
 	unsigned long did_some_progress;
 	bool sync_migration = false;
+	bool deferred_compaction = false;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2175,12 +2189,22 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, &did_some_progress,
-					sync_migration);
+					migratetype, sync_migration,
+					&deferred_compaction,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 	sync_migration = true;
 
+	/*
+	 * If compaction is deferred for high-order allocations, it is because
+	 * sync compaction recently failed. In this is the case and the caller
+	 * has requested the system not be heavily disrupted, fail the
+	 * allocation now instead of entering direct reclaim
+	 */
+	if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
+		goto nopage;
+
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,
@@ -2243,8 +2267,9 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, &did_some_progress,
-					sync_migration);
+					migratetype, sync_migration,
+					&deferred_compaction,
+					&did_some_progress);
 		if (page)
 			goto got_pg;
 	}
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 18/34] mm: page allocator: Do not call direct reclaim for THP allocations while compaction is deferred
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit 66199712e9eef5aede09dbcd9dfff87798a66917 upstream.

Stable note: Not tracked in Buzilla. This was part of a series that
	reduced interactivity stalls experienced when THP was enabled.

If compaction is deferred, direct reclaim is used to try free enough
pages for the allocation to succeed. For small high-orders, this has
a reasonable chance of success. However, if the caller has specified
__GFP_NO_KSWAPD to limit the disruption to the system, it makes more
sense to fail the allocation rather than stall the caller in direct
reclaim. This patch skips direct reclaim if compaction is deferred
and the caller specifies __GFP_NO_KSWAPD.

Async compaction only considers a subset of pages so it is possible for
compaction to be deferred prematurely and not enter direct reclaim even
in cases where it should. To compensate for this, this patch also defers
compaction only if sync compaction failed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/page_alloc.c |   45 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e568b80..257acae 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1897,14 +1897,20 @@ static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, unsigned long *did_some_progress,
-	bool sync_migration)
+	int migratetype, bool sync_migration,
+	bool *deferred_compaction,
+	unsigned long *did_some_progress)
 {
 	struct page *page;
 
-	if (!order || compaction_deferred(preferred_zone))
+	if (!order)
 		return NULL;
 
+	if (compaction_deferred(preferred_zone)) {
+		*deferred_compaction = true;
+		return NULL;
+	}
+
 	current->flags |= PF_MEMALLOC;
 	*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
 						nodemask, sync_migration);
@@ -1932,7 +1938,13 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		 * but not enough to satisfy watermarks.
 		 */
 		count_vm_event(COMPACTFAIL);
-		defer_compaction(preferred_zone);
+
+		/*
+		 * As async compaction considers a subset of pageblocks, only
+		 * defer if the failure was a sync compaction failure.
+		 */
+		if (sync_migration)
+			defer_compaction(preferred_zone);
 
 		cond_resched();
 	}
@@ -1944,8 +1956,9 @@ static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, unsigned long *did_some_progress,
-	bool sync_migration)
+	int migratetype, bool sync_migration,
+	bool *deferred_compaction,
+	unsigned long *did_some_progress)
 {
 	return NULL;
 }
@@ -2095,6 +2108,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long pages_reclaimed = 0;
 	unsigned long did_some_progress;
 	bool sync_migration = false;
+	bool deferred_compaction = false;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2175,12 +2189,22 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, &did_some_progress,
-					sync_migration);
+					migratetype, sync_migration,
+					&deferred_compaction,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 	sync_migration = true;
 
+	/*
+	 * If compaction is deferred for high-order allocations, it is because
+	 * sync compaction recently failed. In this is the case and the caller
+	 * has requested the system not be heavily disrupted, fail the
+	 * allocation now instead of entering direct reclaim
+	 */
+	if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
+		goto nopage;
+
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,
@@ -2243,8 +2267,9 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, &did_some_progress,
-					sync_migration);
+					migratetype, sync_migration,
+					&deferred_compaction,
+					&did_some_progress);
 		if (page)
 			goto got_pg;
 	}
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 19/34] mm: compaction: make isolate_lru_page() filter-aware again
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit c82449352854ff09e43062246af86bdeb628f0c3 upstream.

Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
	information by reducing LRU list churning had the side-effect of
	reducing THP allocation success rates. This was part of a series
	to restore the success rates while preserving the reclaim fix.

Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
noted that compaction does not migrate dirty or writeback pages and
that is was meaningless to pick the page and re-add it to the LRU list.
This had to be partially reverted because some dirty pages can be
migrated by compaction without blocking.

This patch updates "mm: compaction: make isolate_lru_page" by skipping
over pages that migration has no possibility of migrating to minimise
LRU disruption.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |    2 ++
 mm/compaction.c        |    3 +++
 mm/vmscan.c            |   35 +++++++++++++++++++++++++++++++++--
 3 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 951ed81..80caa71 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -166,6 +166,8 @@ static inline int is_unevictable_lru(enum lru_list l)
 #define ISOLATE_CLEAN		((__force isolate_mode_t)0x4)
 /* Isolate unmapped file */
 #define ISOLATE_UNMAPPED	((__force isolate_mode_t)0x8)
+/* Isolate for asynchronous migration */
+#define ISOLATE_ASYNC_MIGRATE	((__force isolate_mode_t)0x10)
 
 /* LRU Isolation modes. */
 typedef unsigned __bitwise__ isolate_mode_t;
diff --git a/mm/compaction.c b/mm/compaction.c
index afdc416..76bdd65 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -371,6 +371,9 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 			continue;
 		}
 
+		if (!cc->sync)
+			mode |= ISOLATE_ASYNC_MIGRATE;
+
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, mode, 0) != 0)
 			continue;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9aa75e9..aa75861 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1045,8 +1045,39 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file)
 
 	ret = -EBUSY;
 
-	if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page)))
-		return ret;
+	/*
+	 * To minimise LRU disruption, the caller can indicate that it only
+	 * wants to isolate pages it will be able to operate on without
+	 * blocking - clean pages for the most part.
+	 *
+	 * ISOLATE_CLEAN means that only clean pages should be isolated. This
+	 * is used by reclaim when it is cannot write to backing storage
+	 *
+	 * ISOLATE_ASYNC_MIGRATE is used to indicate that it only wants to pages
+	 * that it is possible to migrate without blocking
+	 */
+	if (mode & (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)) {
+		/* All the caller can do on PageWriteback is block */
+		if (PageWriteback(page))
+			return ret;
+
+		if (PageDirty(page)) {
+			struct address_space *mapping;
+
+			/* ISOLATE_CLEAN means only clean pages */
+			if (mode & ISOLATE_CLEAN)
+				return ret;
+
+			/*
+			 * Only pages without mappings or that have a
+			 * ->migratepage callback are possible to migrate
+			 * without blocking
+			 */
+			mapping = page_mapping(page);
+			if (mapping && !mapping->a_ops->migratepage)
+				return ret;
+		}
+	}
 
 	if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
 		return ret;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 19/34] mm: compaction: make isolate_lru_page() filter-aware again
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit c82449352854ff09e43062246af86bdeb628f0c3 upstream.

Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
	information by reducing LRU list churning had the side-effect of
	reducing THP allocation success rates. This was part of a series
	to restore the success rates while preserving the reclaim fix.

Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
noted that compaction does not migrate dirty or writeback pages and
that is was meaningless to pick the page and re-add it to the LRU list.
This had to be partially reverted because some dirty pages can be
migrated by compaction without blocking.

This patch updates "mm: compaction: make isolate_lru_page" by skipping
over pages that migration has no possibility of migrating to minimise
LRU disruption.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |    2 ++
 mm/compaction.c        |    3 +++
 mm/vmscan.c            |   35 +++++++++++++++++++++++++++++++++--
 3 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 951ed81..80caa71 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -166,6 +166,8 @@ static inline int is_unevictable_lru(enum lru_list l)
 #define ISOLATE_CLEAN		((__force isolate_mode_t)0x4)
 /* Isolate unmapped file */
 #define ISOLATE_UNMAPPED	((__force isolate_mode_t)0x8)
+/* Isolate for asynchronous migration */
+#define ISOLATE_ASYNC_MIGRATE	((__force isolate_mode_t)0x10)
 
 /* LRU Isolation modes. */
 typedef unsigned __bitwise__ isolate_mode_t;
diff --git a/mm/compaction.c b/mm/compaction.c
index afdc416..76bdd65 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -371,6 +371,9 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 			continue;
 		}
 
+		if (!cc->sync)
+			mode |= ISOLATE_ASYNC_MIGRATE;
+
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, mode, 0) != 0)
 			continue;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9aa75e9..aa75861 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1045,8 +1045,39 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file)
 
 	ret = -EBUSY;
 
-	if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page)))
-		return ret;
+	/*
+	 * To minimise LRU disruption, the caller can indicate that it only
+	 * wants to isolate pages it will be able to operate on without
+	 * blocking - clean pages for the most part.
+	 *
+	 * ISOLATE_CLEAN means that only clean pages should be isolated. This
+	 * is used by reclaim when it is cannot write to backing storage
+	 *
+	 * ISOLATE_ASYNC_MIGRATE is used to indicate that it only wants to pages
+	 * that it is possible to migrate without blocking
+	 */
+	if (mode & (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)) {
+		/* All the caller can do on PageWriteback is block */
+		if (PageWriteback(page))
+			return ret;
+
+		if (PageDirty(page)) {
+			struct address_space *mapping;
+
+			/* ISOLATE_CLEAN means only clean pages */
+			if (mode & ISOLATE_CLEAN)
+				return ret;
+
+			/*
+			 * Only pages without mappings or that have a
+			 * ->migratepage callback are possible to migrate
+			 * without blocking
+			 */
+			mapping = page_mapping(page);
+			if (mapping && !mapping->a_ops->migratepage)
+				return ret;
+		}
+	}
 
 	if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
 		return ret;
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 20/34] kswapd: avoid unnecessary rebalance after an unsuccessful balancing
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: "Alex,Shi" <alex.shi@intel.com>

commit d2ebd0f6b89567eb93ead4e2ca0cbe03021f344b upstream.

Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019. This
	patch reduces kswapd CPU usage.

In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
after a unsuccessful balancing if there is tighter reclaim request pending
in the balancing.  But in the following scenario, kswapd do something that
is not matched our expectation.  The patch fixes this issue.

1, Read pgdat request A (classzone_idx, order = 3)
2, balance_pgdat()
3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
4, balance_pgdat() returns but failed since returned order = 0
5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
   While the expectation behavior of kswapd should try to sleep.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Pádraig Brady <P@draigBrady.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index aa75861..bf85e4d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2841,7 +2841,9 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 static int kswapd(void *p)
 {
 	unsigned long order, new_order;
+	unsigned balanced_order;
 	int classzone_idx, new_classzone_idx;
+	int balanced_classzone_idx;
 	pg_data_t *pgdat = (pg_data_t*)p;
 	struct task_struct *tsk = current;
 
@@ -2872,7 +2874,9 @@ static int kswapd(void *p)
 	set_freezable();
 
 	order = new_order = 0;
+	balanced_order = 0;
 	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
+	balanced_classzone_idx = classzone_idx;
 	for ( ; ; ) {
 		int ret;
 
@@ -2881,7 +2885,8 @@ static int kswapd(void *p)
 		 * new request of a similar or harder type will succeed soon
 		 * so consider going to sleep on the basis we reclaimed at
 		 */
-		if (classzone_idx >= new_classzone_idx && order == new_order) {
+		if (balanced_classzone_idx >= new_classzone_idx &&
+					balanced_order == new_order) {
 			new_order = pgdat->kswapd_max_order;
 			new_classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order =  0;
@@ -2896,7 +2901,8 @@ static int kswapd(void *p)
 			order = new_order;
 			classzone_idx = new_classzone_idx;
 		} else {
-			kswapd_try_to_sleep(pgdat, order, classzone_idx);
+			kswapd_try_to_sleep(pgdat, balanced_order,
+						balanced_classzone_idx);
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order = 0;
@@ -2913,7 +2919,9 @@ static int kswapd(void *p)
 		 */
 		if (!ret) {
 			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			order = balance_pgdat(pgdat, order, &classzone_idx);
+			balanced_classzone_idx = classzone_idx;
+			balanced_order = balance_pgdat(pgdat, order,
+						&balanced_classzone_idx);
 		}
 	}
 	return 0;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 20/34] kswapd: avoid unnecessary rebalance after an unsuccessful balancing
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: "Alex,Shi" <alex.shi@intel.com>

commit d2ebd0f6b89567eb93ead4e2ca0cbe03021f344b upstream.

Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019. This
	patch reduces kswapd CPU usage.

In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
after a unsuccessful balancing if there is tighter reclaim request pending
in the balancing.  But in the following scenario, kswapd do something that
is not matched our expectation.  The patch fixes this issue.

1, Read pgdat request A (classzone_idx, order = 3)
2, balance_pgdat()
3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
4, balance_pgdat() returns but failed since returned order = 0
5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
   While the expectation behavior of kswapd should try to sleep.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: PA!draig Brady <P@draigBrady.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index aa75861..bf85e4d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2841,7 +2841,9 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 static int kswapd(void *p)
 {
 	unsigned long order, new_order;
+	unsigned balanced_order;
 	int classzone_idx, new_classzone_idx;
+	int balanced_classzone_idx;
 	pg_data_t *pgdat = (pg_data_t*)p;
 	struct task_struct *tsk = current;
 
@@ -2872,7 +2874,9 @@ static int kswapd(void *p)
 	set_freezable();
 
 	order = new_order = 0;
+	balanced_order = 0;
 	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
+	balanced_classzone_idx = classzone_idx;
 	for ( ; ; ) {
 		int ret;
 
@@ -2881,7 +2885,8 @@ static int kswapd(void *p)
 		 * new request of a similar or harder type will succeed soon
 		 * so consider going to sleep on the basis we reclaimed at
 		 */
-		if (classzone_idx >= new_classzone_idx && order == new_order) {
+		if (balanced_classzone_idx >= new_classzone_idx &&
+					balanced_order == new_order) {
 			new_order = pgdat->kswapd_max_order;
 			new_classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order =  0;
@@ -2896,7 +2901,8 @@ static int kswapd(void *p)
 			order = new_order;
 			classzone_idx = new_classzone_idx;
 		} else {
-			kswapd_try_to_sleep(pgdat, order, classzone_idx);
+			kswapd_try_to_sleep(pgdat, balanced_order,
+						balanced_classzone_idx);
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order = 0;
@@ -2913,7 +2919,9 @@ static int kswapd(void *p)
 		 */
 		if (!ret) {
 			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			order = balance_pgdat(pgdat, order, &classzone_idx);
+			balanced_classzone_idx = classzone_idx;
+			balanced_order = balance_pgdat(pgdat, order,
+						&balanced_classzone_idx);
 		}
 	}
 	return 0;
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 21/34] kswapd: assign new_order and new_classzone_idx after wakeup in sleeping
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: "Alex,Shi" <alex.shi@intel.com>

commit f0dfcde099453aa4c0dc42473828d15a6d492936 upstream.

Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019. This
	patch reduces kswapd CPU usage.

There 2 places to read pgdat in kswapd.  One is return from a successful
balance, another is waked up from kswapd sleeping.  The new_order and
new_classzone_idx represent the balance input order and classzone_idx.

But current new_order and new_classzone_idx are not assigned after
kswapd_try_to_sleep(), that will cause a bug in the following scenario.

1: after a successful balance, kswapd goes to sleep, and new_order = 0;
   new_classzone_idx = __MAX_NR_ZONES - 1;

2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL

3: in the balance_pgdat() running, a new balance wakeup happened with
   order = 5, and classzone_idx = ZONE_NORMAL

4: the first wakeup(order = 3) finished successufly, return order = 3
   but, the new_order is still 0, so, this balancing will be treated as a
   failed balance.  And then the second tighter balancing will be missed.

So, to avoid the above problem, the new_order and new_classzone_idx need
to be assigned for later successful comparison.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bf85e4d..b8c1fc0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2905,6 +2905,8 @@ static int kswapd(void *p)
 						balanced_classzone_idx);
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
+			new_order = order;
+			new_classzone_idx = classzone_idx;
 			pgdat->kswapd_max_order = 0;
 			pgdat->classzone_idx = pgdat->nr_zones - 1;
 		}
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 21/34] kswapd: assign new_order and new_classzone_idx after wakeup in sleeping
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: "Alex,Shi" <alex.shi@intel.com>

commit f0dfcde099453aa4c0dc42473828d15a6d492936 upstream.

Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019. This
	patch reduces kswapd CPU usage.

There 2 places to read pgdat in kswapd.  One is return from a successful
balance, another is waked up from kswapd sleeping.  The new_order and
new_classzone_idx represent the balance input order and classzone_idx.

But current new_order and new_classzone_idx are not assigned after
kswapd_try_to_sleep(), that will cause a bug in the following scenario.

1: after a successful balance, kswapd goes to sleep, and new_order = 0;
   new_classzone_idx = __MAX_NR_ZONES - 1;

2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL

3: in the balance_pgdat() running, a new balance wakeup happened with
   order = 5, and classzone_idx = ZONE_NORMAL

4: the first wakeup(order = 3) finished successufly, return order = 3
   but, the new_order is still 0, so, this balancing will be treated as a
   failed balance.  And then the second tighter balancing will be missed.

So, to avoid the above problem, the new_order and new_classzone_idx need
to be assigned for later successful comparison.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Tested-by: PA!draig Brady <P@draigBrady.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bf85e4d..b8c1fc0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2905,6 +2905,8 @@ static int kswapd(void *p)
 						balanced_classzone_idx);
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
+			new_order = order;
+			new_classzone_idx = classzone_idx;
 			pgdat->kswapd_max_order = 0;
 			pgdat->classzone_idx = pgdat->nr_zones - 1;
 		}
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 22/34] mm: compaction: Introduce sync-light migration for use by compaction
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit a6bc32b899223a877f595ef9ddc1e89ead5072b8 upstream.

Stable note: Not tracked in Buzilla. This was part of a series that
	reduced interactivity stalls experienced when THP was enabled.
	These stalls were particularly noticable when copying data
	to a USB stick but the experiences for users varied a lot.

This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async
compaction maps to MIGRATE_ASYNC while sync compaction maps to
MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory
hotplug, MIGRATE_SYNC is used.

This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be
a large number of dirty pages backed by a filesystem that does not
support ->writepages.

[aarcange@redhat.com: This patch is heavily based on Andrea's work]
[akpm@linux-foundation.org: fix fs/nfs/write.c build]
[akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/btrfs/disk-io.c      |    3 +-
 fs/hugetlbfs/inode.c    |    3 +-
 fs/nfs/internal.h       |    2 +-
 fs/nfs/write.c          |    2 +-
 include/linux/fs.h      |    6 ++--
 include/linux/migrate.h |   23 +++++++++++---
 mm/compaction.c         |    2 +-
 mm/memory-failure.c     |    2 +-
 mm/memory_hotplug.c     |    2 +-
 mm/mempolicy.c          |    2 +-
 mm/migrate.c            |   78 ++++++++++++++++++++++++++---------------------
 11 files changed, 75 insertions(+), 50 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 522cb2a..c57d202 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -801,7 +801,8 @@ static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
 
 #ifdef CONFIG_MIGRATION
 static int btree_migratepage(struct address_space *mapping,
-			struct page *newpage, struct page *page, bool sync)
+			struct page *newpage, struct page *page,
+			enum migrate_mode sync)
 {
 	/*
 	 * we can't safely write a btree page from here,
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 8b0c875..6327a06 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -568,7 +568,8 @@ static int hugetlbfs_set_page_dirty(struct page *page)
 }
 
 static int hugetlbfs_migrate_page(struct address_space *mapping,
-				struct page *newpage, struct page *page)
+				struct page *newpage, struct page *page,
+				enum migrate_mode mode)
 {
 	int rc;
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index a74442a..4f10d81 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -315,7 +315,7 @@ void nfs_commit_release_pages(struct nfs_write_data *data);
 
 #ifdef CONFIG_MIGRATION
 extern int nfs_migrate_page(struct address_space *,
-		struct page *, struct page *, bool);
+		struct page *, struct page *, enum migrate_mode);
 #else
 #define nfs_migrate_page NULL
 #endif
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 22a48fd..a5fcc65 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1662,7 +1662,7 @@ out_error:
 
 #ifdef CONFIG_MIGRATION
 int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
-		struct page *page, bool sync)
+		struct page *page, enum migrate_mode sync)
 {
 	/*
 	 * If PagePrivate is set, then the page is currently associated with
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 09ddec9..212ea7b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -523,6 +523,7 @@ enum positive_aop_returns {
 struct page;
 struct address_space;
 struct writeback_control;
+enum migrate_mode;
 
 struct iov_iter {
 	const struct iovec *iov;
@@ -612,7 +613,7 @@ struct address_space_operations {
 	 * is false, it must not block.
 	 */
 	int (*migratepage) (struct address_space *,
-			struct page *, struct page *, bool);
+			struct page *, struct page *, enum migrate_mode);
 	int (*launder_page) (struct page *);
 	int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
 					unsigned long);
@@ -2481,7 +2482,8 @@ extern int generic_check_addressable(unsigned, u64);
 
 #ifdef CONFIG_MIGRATION
 extern int buffer_migrate_page(struct address_space *,
-				struct page *, struct page *, bool);
+				struct page *, struct page *,
+				enum migrate_mode);
 #else
 #define buffer_migrate_page NULL
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 14e6d2a..775787c 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -6,18 +6,31 @@
 
 typedef struct page *new_page_t(struct page *, unsigned long private, int **);
 
+/*
+ * MIGRATE_ASYNC means never block
+ * MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking
+ *	on most operations but not ->writepage as the potential stall time
+ *	is too significant
+ * MIGRATE_SYNC will block when migrating pages
+ */
+enum migrate_mode {
+	MIGRATE_ASYNC,
+	MIGRATE_SYNC_LIGHT,
+	MIGRATE_SYNC,
+};
+
 #ifdef CONFIG_MIGRATION
 #define PAGE_MIGRATION 1
 
 extern void putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
-			struct page *, struct page *, bool);
+			struct page *, struct page *, enum migrate_mode);
 extern int migrate_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
-			bool sync);
+			enum migrate_mode sync);
 extern int migrate_huge_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
-			bool sync);
+			enum migrate_mode sync);
 
 extern int fail_migrate_page(struct address_space *,
 			struct page *, struct page *);
@@ -36,10 +49,10 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 static inline void putback_lru_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private, bool offlining,
-		bool sync) { return -ENOSYS; }
+		enum migrate_mode sync) { return -ENOSYS; }
 static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
 		unsigned long private, bool offlining,
-		bool sync) { return -ENOSYS; }
+		enum migrate_mode sync) { return -ENOSYS; }
 
 static inline int migrate_prep(void) { return -ENOSYS; }
 static inline int migrate_prep_local(void) { return -ENOSYS; }
diff --git a/mm/compaction.c b/mm/compaction.c
index 76bdd65..0d43bb9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -577,7 +577,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		nr_migrate = cc->nr_migratepages;
 		err = migrate_pages(&cc->migratepages, compaction_alloc,
 				(unsigned long)cc, false,
-				cc->sync);
+				cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC);
 		update_nr_listpages(cc);
 		nr_remaining = cc->nr_migratepages;
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 740c4f5..6496748 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1464,7 +1464,7 @@ int soft_offline_page(struct page *page, int flags)
 					    page_is_file_cache(page));
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
-								0, true);
+							0, MIGRATE_SYNC);
 		if (ret) {
 			putback_lru_pages(&pagelist);
 			pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c46887b..ae5a3f2 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -747,7 +747,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		}
 		/* this function returns # of failed pages */
 		ret = migrate_pages(&source, hotremove_migrate_alloc, 0,
-								true, true);
+							true, MIGRATE_SYNC);
 		if (ret)
 			putback_lru_pages(&source);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3dac2d1..dd5f874 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -926,7 +926,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
 
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_node_page, dest,
-								false, true);
+							false, MIGRATE_SYNC);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index fb8d1ae..132063e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -222,12 +222,13 @@ out:
 
 #ifdef CONFIG_BLOCK
 /* Returns true if all buffers are successfully locked */
-static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
+static bool buffer_migrate_lock_buffers(struct buffer_head *head,
+							enum migrate_mode mode)
 {
 	struct buffer_head *bh = head;
 
 	/* Simple case, sync compaction */
-	if (sync) {
+	if (mode != MIGRATE_ASYNC) {
 		do {
 			get_bh(bh);
 			lock_buffer(bh);
@@ -263,7 +264,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
 }
 #else
 static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
-								bool sync)
+							enum migrate_mode mode)
 {
 	return true;
 }
@@ -279,7 +280,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
  */
 static int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page,
-		struct buffer_head *head, bool sync)
+		struct buffer_head *head, enum migrate_mode mode)
 {
 	int expected_count;
 	void **pslot;
@@ -315,7 +316,8 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 	 * the mapping back due to an elevated page count, we would have to
 	 * block waiting on other references to be dropped.
 	 */
-	if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
+	if (mode == MIGRATE_ASYNC && head &&
+			!buffer_migrate_lock_buffers(head, mode)) {
 		page_unfreeze_refs(page, expected_count);
 		spin_unlock_irq(&mapping->tree_lock);
 		return -EAGAIN;
@@ -478,13 +480,14 @@ EXPORT_SYMBOL(fail_migrate_page);
  * Pages are locked upon entry and exit.
  */
 int migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page, bool sync)
+		struct page *newpage, struct page *page,
+		enum migrate_mode mode)
 {
 	int rc;
 
 	BUG_ON(PageWriteback(page));	/* Writeback must be complete */
 
-	rc = migrate_page_move_mapping(mapping, newpage, page, NULL, sync);
+	rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode);
 
 	if (rc)
 		return rc;
@@ -501,17 +504,17 @@ EXPORT_SYMBOL(migrate_page);
  * exist.
  */
 int buffer_migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page, bool sync)
+		struct page *newpage, struct page *page, enum migrate_mode mode)
 {
 	struct buffer_head *bh, *head;
 	int rc;
 
 	if (!page_has_buffers(page))
-		return migrate_page(mapping, newpage, page, sync);
+		return migrate_page(mapping, newpage, page, mode);
 
 	head = page_buffers(page);
 
-	rc = migrate_page_move_mapping(mapping, newpage, page, head, sync);
+	rc = migrate_page_move_mapping(mapping, newpage, page, head, mode);
 
 	if (rc)
 		return rc;
@@ -521,8 +524,8 @@ int buffer_migrate_page(struct address_space *mapping,
 	 * with an IRQ-safe spinlock held. In the sync case, the buffers
 	 * need to be locked no
 	 */
-	if (sync)
-		BUG_ON(!buffer_migrate_lock_buffers(head, sync));
+	if (mode != MIGRATE_ASYNC)
+		BUG_ON(!buffer_migrate_lock_buffers(head, mode));
 
 	ClearPagePrivate(page);
 	set_page_private(newpage, page_private(page));
@@ -599,10 +602,11 @@ static int writeout(struct address_space *mapping, struct page *page)
  * Default handling if a filesystem does not provide a migration function.
  */
 static int fallback_migrate_page(struct address_space *mapping,
-	struct page *newpage, struct page *page, bool sync)
+	struct page *newpage, struct page *page, enum migrate_mode mode)
 {
 	if (PageDirty(page)) {
-		if (!sync)
+		/* Only writeback pages in full synchronous migration */
+		if (mode != MIGRATE_SYNC)
 			return -EBUSY;
 		return writeout(mapping, page);
 	}
@@ -615,7 +619,7 @@ static int fallback_migrate_page(struct address_space *mapping,
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
 
-	return migrate_page(mapping, newpage, page, sync);
+	return migrate_page(mapping, newpage, page, mode);
 }
 
 /*
@@ -630,7 +634,7 @@ static int fallback_migrate_page(struct address_space *mapping,
  *  == 0 - success
  */
 static int move_to_new_page(struct page *newpage, struct page *page,
-					int remap_swapcache, bool sync)
+				int remap_swapcache, enum migrate_mode mode)
 {
 	struct address_space *mapping;
 	int rc;
@@ -651,7 +655,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 
 	mapping = page_mapping(page);
 	if (!mapping)
-		rc = migrate_page(mapping, newpage, page, sync);
+		rc = migrate_page(mapping, newpage, page, mode);
 	else if (mapping->a_ops->migratepage)
 		/*
 		 * Most pages have a mapping and most filesystems provide a
@@ -660,9 +664,9 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		 * is the most common path for page migration.
 		 */
 		rc = mapping->a_ops->migratepage(mapping,
-						newpage, page, sync);
+						newpage, page, mode);
 	else
-		rc = fallback_migrate_page(mapping, newpage, page, sync);
+		rc = fallback_migrate_page(mapping, newpage, page, mode);
 
 	if (rc) {
 		newpage->mapping = NULL;
@@ -677,7 +681,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 }
 
 static int __unmap_and_move(struct page *page, struct page *newpage,
-				int force, bool offlining, bool sync)
+			int force, bool offlining, enum migrate_mode mode)
 {
 	int rc = -EAGAIN;
 	int remap_swapcache = 1;
@@ -686,7 +690,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	struct anon_vma *anon_vma = NULL;
 
 	if (!trylock_page(page)) {
-		if (!force || !sync)
+		if (!force || mode == MIGRATE_ASYNC)
 			goto out;
 
 		/*
@@ -732,10 +736,12 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 
 	if (PageWriteback(page)) {
 		/*
-		 * For !sync, there is no point retrying as the retry loop
-		 * is expected to be too short for PageWriteback to be cleared
+		 * Only in the case of a full syncronous migration is it
+		 * necessary to wait for PageWriteback. In the async case,
+		 * the retry loop is too short and in the sync-light case,
+		 * the overhead of stalling is too much
 		 */
-		if (!sync) {
+		if (mode != MIGRATE_SYNC) {
 			rc = -EBUSY;
 			goto uncharge;
 		}
@@ -806,7 +812,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 
 skip_unmap:
 	if (!page_mapped(page))
-		rc = move_to_new_page(newpage, page, remap_swapcache, sync);
+		rc = move_to_new_page(newpage, page, remap_swapcache, mode);
 
 	if (rc && remap_swapcache)
 		remove_migration_ptes(page, page);
@@ -829,7 +835,8 @@ out:
  * to the newly allocated page in newpage.
  */
 static int unmap_and_move(new_page_t get_new_page, unsigned long private,
-			struct page *page, int force, bool offlining, bool sync)
+			struct page *page, int force, bool offlining,
+			enum migrate_mode mode)
 {
 	int rc = 0;
 	int *result = NULL;
@@ -847,7 +854,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		if (unlikely(split_huge_page(page)))
 			goto out;
 
-	rc = __unmap_and_move(page, newpage, force, offlining, sync);
+	rc = __unmap_and_move(page, newpage, force, offlining, mode);
 out:
 	if (rc != -EAGAIN) {
  		/*
@@ -897,7 +904,8 @@ out:
  */
 static int unmap_and_move_huge_page(new_page_t get_new_page,
 				unsigned long private, struct page *hpage,
-				int force, bool offlining, bool sync)
+				int force, bool offlining,
+				enum migrate_mode mode)
 {
 	int rc = 0;
 	int *result = NULL;
@@ -910,7 +918,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	rc = -EAGAIN;
 
 	if (!trylock_page(hpage)) {
-		if (!force || !sync)
+		if (!force || mode != MIGRATE_SYNC)
 			goto out;
 		lock_page(hpage);
 	}
@@ -921,7 +929,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 	if (!page_mapped(hpage))
-		rc = move_to_new_page(new_hpage, hpage, 1, sync);
+		rc = move_to_new_page(new_hpage, hpage, 1, mode);
 
 	if (rc)
 		remove_migration_ptes(hpage, hpage);
@@ -964,7 +972,7 @@ out:
  */
 int migrate_pages(struct list_head *from,
 		new_page_t get_new_page, unsigned long private, bool offlining,
-		bool sync)
+		enum migrate_mode mode)
 {
 	int retry = 1;
 	int nr_failed = 0;
@@ -985,7 +993,7 @@ int migrate_pages(struct list_head *from,
 
 			rc = unmap_and_move(get_new_page, private,
 						page, pass > 2, offlining,
-						sync);
+						mode);
 
 			switch(rc) {
 			case -ENOMEM:
@@ -1015,7 +1023,7 @@ out:
 
 int migrate_huge_pages(struct list_head *from,
 		new_page_t get_new_page, unsigned long private, bool offlining,
-		bool sync)
+		enum migrate_mode mode)
 {
 	int retry = 1;
 	int nr_failed = 0;
@@ -1032,7 +1040,7 @@ int migrate_huge_pages(struct list_head *from,
 
 			rc = unmap_and_move_huge_page(get_new_page,
 					private, page, pass > 2, offlining,
-					sync);
+					mode);
 
 			switch(rc) {
 			case -ENOMEM:
@@ -1161,7 +1169,7 @@ set_status:
 	err = 0;
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_page_node,
-				(unsigned long)pm, 0, true);
+				(unsigned long)pm, 0, MIGRATE_SYNC);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 22/34] mm: compaction: Introduce sync-light migration for use by compaction
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit a6bc32b899223a877f595ef9ddc1e89ead5072b8 upstream.

Stable note: Not tracked in Buzilla. This was part of a series that
	reduced interactivity stalls experienced when THP was enabled.
	These stalls were particularly noticable when copying data
	to a USB stick but the experiences for users varied a lot.

This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async
compaction maps to MIGRATE_ASYNC while sync compaction maps to
MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory
hotplug, MIGRATE_SYNC is used.

This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be
a large number of dirty pages backed by a filesystem that does not
support ->writepages.

[aarcange@redhat.com: This patch is heavily based on Andrea's work]
[akpm@linux-foundation.org: fix fs/nfs/write.c build]
[akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/btrfs/disk-io.c      |    3 +-
 fs/hugetlbfs/inode.c    |    3 +-
 fs/nfs/internal.h       |    2 +-
 fs/nfs/write.c          |    2 +-
 include/linux/fs.h      |    6 ++--
 include/linux/migrate.h |   23 +++++++++++---
 mm/compaction.c         |    2 +-
 mm/memory-failure.c     |    2 +-
 mm/memory_hotplug.c     |    2 +-
 mm/mempolicy.c          |    2 +-
 mm/migrate.c            |   78 ++++++++++++++++++++++++++---------------------
 11 files changed, 75 insertions(+), 50 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 522cb2a..c57d202 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -801,7 +801,8 @@ static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
 
 #ifdef CONFIG_MIGRATION
 static int btree_migratepage(struct address_space *mapping,
-			struct page *newpage, struct page *page, bool sync)
+			struct page *newpage, struct page *page,
+			enum migrate_mode sync)
 {
 	/*
 	 * we can't safely write a btree page from here,
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 8b0c875..6327a06 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -568,7 +568,8 @@ static int hugetlbfs_set_page_dirty(struct page *page)
 }
 
 static int hugetlbfs_migrate_page(struct address_space *mapping,
-				struct page *newpage, struct page *page)
+				struct page *newpage, struct page *page,
+				enum migrate_mode mode)
 {
 	int rc;
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index a74442a..4f10d81 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -315,7 +315,7 @@ void nfs_commit_release_pages(struct nfs_write_data *data);
 
 #ifdef CONFIG_MIGRATION
 extern int nfs_migrate_page(struct address_space *,
-		struct page *, struct page *, bool);
+		struct page *, struct page *, enum migrate_mode);
 #else
 #define nfs_migrate_page NULL
 #endif
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 22a48fd..a5fcc65 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1662,7 +1662,7 @@ out_error:
 
 #ifdef CONFIG_MIGRATION
 int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
-		struct page *page, bool sync)
+		struct page *page, enum migrate_mode sync)
 {
 	/*
 	 * If PagePrivate is set, then the page is currently associated with
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 09ddec9..212ea7b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -523,6 +523,7 @@ enum positive_aop_returns {
 struct page;
 struct address_space;
 struct writeback_control;
+enum migrate_mode;
 
 struct iov_iter {
 	const struct iovec *iov;
@@ -612,7 +613,7 @@ struct address_space_operations {
 	 * is false, it must not block.
 	 */
 	int (*migratepage) (struct address_space *,
-			struct page *, struct page *, bool);
+			struct page *, struct page *, enum migrate_mode);
 	int (*launder_page) (struct page *);
 	int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
 					unsigned long);
@@ -2481,7 +2482,8 @@ extern int generic_check_addressable(unsigned, u64);
 
 #ifdef CONFIG_MIGRATION
 extern int buffer_migrate_page(struct address_space *,
-				struct page *, struct page *, bool);
+				struct page *, struct page *,
+				enum migrate_mode);
 #else
 #define buffer_migrate_page NULL
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 14e6d2a..775787c 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -6,18 +6,31 @@
 
 typedef struct page *new_page_t(struct page *, unsigned long private, int **);
 
+/*
+ * MIGRATE_ASYNC means never block
+ * MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking
+ *	on most operations but not ->writepage as the potential stall time
+ *	is too significant
+ * MIGRATE_SYNC will block when migrating pages
+ */
+enum migrate_mode {
+	MIGRATE_ASYNC,
+	MIGRATE_SYNC_LIGHT,
+	MIGRATE_SYNC,
+};
+
 #ifdef CONFIG_MIGRATION
 #define PAGE_MIGRATION 1
 
 extern void putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
-			struct page *, struct page *, bool);
+			struct page *, struct page *, enum migrate_mode);
 extern int migrate_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
-			bool sync);
+			enum migrate_mode sync);
 extern int migrate_huge_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
-			bool sync);
+			enum migrate_mode sync);
 
 extern int fail_migrate_page(struct address_space *,
 			struct page *, struct page *);
@@ -36,10 +49,10 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 static inline void putback_lru_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private, bool offlining,
-		bool sync) { return -ENOSYS; }
+		enum migrate_mode sync) { return -ENOSYS; }
 static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
 		unsigned long private, bool offlining,
-		bool sync) { return -ENOSYS; }
+		enum migrate_mode sync) { return -ENOSYS; }
 
 static inline int migrate_prep(void) { return -ENOSYS; }
 static inline int migrate_prep_local(void) { return -ENOSYS; }
diff --git a/mm/compaction.c b/mm/compaction.c
index 76bdd65..0d43bb9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -577,7 +577,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		nr_migrate = cc->nr_migratepages;
 		err = migrate_pages(&cc->migratepages, compaction_alloc,
 				(unsigned long)cc, false,
-				cc->sync);
+				cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC);
 		update_nr_listpages(cc);
 		nr_remaining = cc->nr_migratepages;
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 740c4f5..6496748 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1464,7 +1464,7 @@ int soft_offline_page(struct page *page, int flags)
 					    page_is_file_cache(page));
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
-								0, true);
+							0, MIGRATE_SYNC);
 		if (ret) {
 			putback_lru_pages(&pagelist);
 			pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c46887b..ae5a3f2 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -747,7 +747,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		}
 		/* this function returns # of failed pages */
 		ret = migrate_pages(&source, hotremove_migrate_alloc, 0,
-								true, true);
+							true, MIGRATE_SYNC);
 		if (ret)
 			putback_lru_pages(&source);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3dac2d1..dd5f874 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -926,7 +926,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
 
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_node_page, dest,
-								false, true);
+							false, MIGRATE_SYNC);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index fb8d1ae..132063e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -222,12 +222,13 @@ out:
 
 #ifdef CONFIG_BLOCK
 /* Returns true if all buffers are successfully locked */
-static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
+static bool buffer_migrate_lock_buffers(struct buffer_head *head,
+							enum migrate_mode mode)
 {
 	struct buffer_head *bh = head;
 
 	/* Simple case, sync compaction */
-	if (sync) {
+	if (mode != MIGRATE_ASYNC) {
 		do {
 			get_bh(bh);
 			lock_buffer(bh);
@@ -263,7 +264,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
 }
 #else
 static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
-								bool sync)
+							enum migrate_mode mode)
 {
 	return true;
 }
@@ -279,7 +280,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
  */
 static int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page,
-		struct buffer_head *head, bool sync)
+		struct buffer_head *head, enum migrate_mode mode)
 {
 	int expected_count;
 	void **pslot;
@@ -315,7 +316,8 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 	 * the mapping back due to an elevated page count, we would have to
 	 * block waiting on other references to be dropped.
 	 */
-	if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
+	if (mode == MIGRATE_ASYNC && head &&
+			!buffer_migrate_lock_buffers(head, mode)) {
 		page_unfreeze_refs(page, expected_count);
 		spin_unlock_irq(&mapping->tree_lock);
 		return -EAGAIN;
@@ -478,13 +480,14 @@ EXPORT_SYMBOL(fail_migrate_page);
  * Pages are locked upon entry and exit.
  */
 int migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page, bool sync)
+		struct page *newpage, struct page *page,
+		enum migrate_mode mode)
 {
 	int rc;
 
 	BUG_ON(PageWriteback(page));	/* Writeback must be complete */
 
-	rc = migrate_page_move_mapping(mapping, newpage, page, NULL, sync);
+	rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode);
 
 	if (rc)
 		return rc;
@@ -501,17 +504,17 @@ EXPORT_SYMBOL(migrate_page);
  * exist.
  */
 int buffer_migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page, bool sync)
+		struct page *newpage, struct page *page, enum migrate_mode mode)
 {
 	struct buffer_head *bh, *head;
 	int rc;
 
 	if (!page_has_buffers(page))
-		return migrate_page(mapping, newpage, page, sync);
+		return migrate_page(mapping, newpage, page, mode);
 
 	head = page_buffers(page);
 
-	rc = migrate_page_move_mapping(mapping, newpage, page, head, sync);
+	rc = migrate_page_move_mapping(mapping, newpage, page, head, mode);
 
 	if (rc)
 		return rc;
@@ -521,8 +524,8 @@ int buffer_migrate_page(struct address_space *mapping,
 	 * with an IRQ-safe spinlock held. In the sync case, the buffers
 	 * need to be locked no
 	 */
-	if (sync)
-		BUG_ON(!buffer_migrate_lock_buffers(head, sync));
+	if (mode != MIGRATE_ASYNC)
+		BUG_ON(!buffer_migrate_lock_buffers(head, mode));
 
 	ClearPagePrivate(page);
 	set_page_private(newpage, page_private(page));
@@ -599,10 +602,11 @@ static int writeout(struct address_space *mapping, struct page *page)
  * Default handling if a filesystem does not provide a migration function.
  */
 static int fallback_migrate_page(struct address_space *mapping,
-	struct page *newpage, struct page *page, bool sync)
+	struct page *newpage, struct page *page, enum migrate_mode mode)
 {
 	if (PageDirty(page)) {
-		if (!sync)
+		/* Only writeback pages in full synchronous migration */
+		if (mode != MIGRATE_SYNC)
 			return -EBUSY;
 		return writeout(mapping, page);
 	}
@@ -615,7 +619,7 @@ static int fallback_migrate_page(struct address_space *mapping,
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
 
-	return migrate_page(mapping, newpage, page, sync);
+	return migrate_page(mapping, newpage, page, mode);
 }
 
 /*
@@ -630,7 +634,7 @@ static int fallback_migrate_page(struct address_space *mapping,
  *  == 0 - success
  */
 static int move_to_new_page(struct page *newpage, struct page *page,
-					int remap_swapcache, bool sync)
+				int remap_swapcache, enum migrate_mode mode)
 {
 	struct address_space *mapping;
 	int rc;
@@ -651,7 +655,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 
 	mapping = page_mapping(page);
 	if (!mapping)
-		rc = migrate_page(mapping, newpage, page, sync);
+		rc = migrate_page(mapping, newpage, page, mode);
 	else if (mapping->a_ops->migratepage)
 		/*
 		 * Most pages have a mapping and most filesystems provide a
@@ -660,9 +664,9 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		 * is the most common path for page migration.
 		 */
 		rc = mapping->a_ops->migratepage(mapping,
-						newpage, page, sync);
+						newpage, page, mode);
 	else
-		rc = fallback_migrate_page(mapping, newpage, page, sync);
+		rc = fallback_migrate_page(mapping, newpage, page, mode);
 
 	if (rc) {
 		newpage->mapping = NULL;
@@ -677,7 +681,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 }
 
 static int __unmap_and_move(struct page *page, struct page *newpage,
-				int force, bool offlining, bool sync)
+			int force, bool offlining, enum migrate_mode mode)
 {
 	int rc = -EAGAIN;
 	int remap_swapcache = 1;
@@ -686,7 +690,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	struct anon_vma *anon_vma = NULL;
 
 	if (!trylock_page(page)) {
-		if (!force || !sync)
+		if (!force || mode == MIGRATE_ASYNC)
 			goto out;
 
 		/*
@@ -732,10 +736,12 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 
 	if (PageWriteback(page)) {
 		/*
-		 * For !sync, there is no point retrying as the retry loop
-		 * is expected to be too short for PageWriteback to be cleared
+		 * Only in the case of a full syncronous migration is it
+		 * necessary to wait for PageWriteback. In the async case,
+		 * the retry loop is too short and in the sync-light case,
+		 * the overhead of stalling is too much
 		 */
-		if (!sync) {
+		if (mode != MIGRATE_SYNC) {
 			rc = -EBUSY;
 			goto uncharge;
 		}
@@ -806,7 +812,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 
 skip_unmap:
 	if (!page_mapped(page))
-		rc = move_to_new_page(newpage, page, remap_swapcache, sync);
+		rc = move_to_new_page(newpage, page, remap_swapcache, mode);
 
 	if (rc && remap_swapcache)
 		remove_migration_ptes(page, page);
@@ -829,7 +835,8 @@ out:
  * to the newly allocated page in newpage.
  */
 static int unmap_and_move(new_page_t get_new_page, unsigned long private,
-			struct page *page, int force, bool offlining, bool sync)
+			struct page *page, int force, bool offlining,
+			enum migrate_mode mode)
 {
 	int rc = 0;
 	int *result = NULL;
@@ -847,7 +854,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		if (unlikely(split_huge_page(page)))
 			goto out;
 
-	rc = __unmap_and_move(page, newpage, force, offlining, sync);
+	rc = __unmap_and_move(page, newpage, force, offlining, mode);
 out:
 	if (rc != -EAGAIN) {
  		/*
@@ -897,7 +904,8 @@ out:
  */
 static int unmap_and_move_huge_page(new_page_t get_new_page,
 				unsigned long private, struct page *hpage,
-				int force, bool offlining, bool sync)
+				int force, bool offlining,
+				enum migrate_mode mode)
 {
 	int rc = 0;
 	int *result = NULL;
@@ -910,7 +918,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	rc = -EAGAIN;
 
 	if (!trylock_page(hpage)) {
-		if (!force || !sync)
+		if (!force || mode != MIGRATE_SYNC)
 			goto out;
 		lock_page(hpage);
 	}
@@ -921,7 +929,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 	if (!page_mapped(hpage))
-		rc = move_to_new_page(new_hpage, hpage, 1, sync);
+		rc = move_to_new_page(new_hpage, hpage, 1, mode);
 
 	if (rc)
 		remove_migration_ptes(hpage, hpage);
@@ -964,7 +972,7 @@ out:
  */
 int migrate_pages(struct list_head *from,
 		new_page_t get_new_page, unsigned long private, bool offlining,
-		bool sync)
+		enum migrate_mode mode)
 {
 	int retry = 1;
 	int nr_failed = 0;
@@ -985,7 +993,7 @@ int migrate_pages(struct list_head *from,
 
 			rc = unmap_and_move(get_new_page, private,
 						page, pass > 2, offlining,
-						sync);
+						mode);
 
 			switch(rc) {
 			case -ENOMEM:
@@ -1015,7 +1023,7 @@ out:
 
 int migrate_huge_pages(struct list_head *from,
 		new_page_t get_new_page, unsigned long private, bool offlining,
-		bool sync)
+		enum migrate_mode mode)
 {
 	int retry = 1;
 	int nr_failed = 0;
@@ -1032,7 +1040,7 @@ int migrate_huge_pages(struct list_head *from,
 
 			rc = unmap_and_move_huge_page(get_new_page,
 					private, page, pass > 2, offlining,
-					sync);
+					mode);
 
 			switch(rc) {
 			case -ENOMEM:
@@ -1161,7 +1169,7 @@ set_status:
 	err = 0;
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_page_node,
-				(unsigned long)pm, 0, true);
+				(unsigned long)pm, 0, MIGRATE_SYNC);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 23/34] mm: vmscan: When reclaiming for compaction, ensure there are sufficient free pages available
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit fe4b1b244bdb96136855f2c694071cb09d140766 upstream.

Stable note: Not tracked on Bugzilla. THP and compaction was found to
	aggressively reclaim pages and stall systems under different
	situations that was addressed piecemeal over time. This patch
	addresses a problem where the fix regressed THP allocation
	success rates.

In commit [e0887c19: vmscan: limit direct reclaim for higher order
allocations], Rik noted that reclaim was too aggressive when THP was
enabled. In his initial patch he used the number of free pages to
decide if reclaim should abort for compaction. My feedback was that
reclaim and compaction should be using the same logic when deciding if
reclaim should be aborted.

Unfortunately, this had the effect of reducing THP success rates when
the workload included something like streaming reads that continually
allocated pages. The window during which compaction could run and return
a THP was too small.

This patch combines Rik's two patches together. compaction_suitable()
is still used to decide if reclaim should be aborted to allow
compaction is used. However, it will also ensure that there is a
reasonable buffer of free pages available. This improves upon the
THP allocation success rates but bounds the number of pages that are
freed for compaction.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   44 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 39 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b8c1fc0..e85abfd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2075,6 +2075,42 @@ restart:
 	throttle_vm_writeout(sc->gfp_mask);
 }
 
+/* Returns true if compaction should go ahead for a high-order request */
+static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
+{
+	unsigned long balance_gap, watermark;
+	bool watermark_ok;
+
+	/* Do not consider compaction for orders reclaim is meant to satisfy */
+	if (sc->order <= PAGE_ALLOC_COSTLY_ORDER)
+		return false;
+
+	/*
+	 * Compaction takes time to run and there are potentially other
+	 * callers using the pages just freed. Continue reclaiming until
+	 * there is a buffer of free pages available to give compaction
+	 * a reasonable chance of completing and allocating the page
+	 */
+	balance_gap = min(low_wmark_pages(zone),
+		(zone->present_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
+			KSWAPD_ZONE_BALANCE_GAP_RATIO);
+	watermark = high_wmark_pages(zone) + balance_gap + (2UL << sc->order);
+	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0);
+
+	/*
+	 * If compaction is deferred, reclaim up to a point where
+	 * compaction will have a chance of success when re-enabled
+	 */
+	if (compaction_deferred(zone))
+		return watermark_ok;
+
+	/* If compaction is not ready to start, keep reclaiming */
+	if (!compaction_suitable(zone, sc->order))
+		return false;
+
+	return watermark_ok;
+}
+
 /*
  * This is the direct reclaim path, for page-allocating processes.  We only
  * try to reclaim pages from zones which will satisfy the caller's allocation
@@ -2092,8 +2128,8 @@ restart:
  * scan then give up on it.
  *
  * This function returns true if a zone is being reclaimed for a costly
- * high-order allocation and compaction is either ready to begin or deferred.
- * This indicates to the caller that it should retry the allocation or fail.
+ * allocation and compaction is ready to begin. This indicates to the caller
+ * that it should retry the allocation or fail.
  */
 static bool shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
@@ -2127,9 +2163,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 				 * noticable problem, like transparent huge page
 				 * allocations.
 				 */
-				if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
-					(compaction_suitable(zone, sc->order) ||
-					 compaction_deferred(zone))) {
+				if (compaction_ready(zone, sc)) {
 					should_abort_reclaim = true;
 					continue;
 				}
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 23/34] mm: vmscan: When reclaiming for compaction, ensure there are sufficient free pages available
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit fe4b1b244bdb96136855f2c694071cb09d140766 upstream.

Stable note: Not tracked on Bugzilla. THP and compaction was found to
	aggressively reclaim pages and stall systems under different
	situations that was addressed piecemeal over time. This patch
	addresses a problem where the fix regressed THP allocation
	success rates.

In commit [e0887c19: vmscan: limit direct reclaim for higher order
allocations], Rik noted that reclaim was too aggressive when THP was
enabled. In his initial patch he used the number of free pages to
decide if reclaim should abort for compaction. My feedback was that
reclaim and compaction should be using the same logic when deciding if
reclaim should be aborted.

Unfortunately, this had the effect of reducing THP success rates when
the workload included something like streaming reads that continually
allocated pages. The window during which compaction could run and return
a THP was too small.

This patch combines Rik's two patches together. compaction_suitable()
is still used to decide if reclaim should be aborted to allow
compaction is used. However, it will also ensure that there is a
reasonable buffer of free pages available. This improves upon the
THP allocation success rates but bounds the number of pages that are
freed for compaction.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   44 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 39 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b8c1fc0..e85abfd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2075,6 +2075,42 @@ restart:
 	throttle_vm_writeout(sc->gfp_mask);
 }
 
+/* Returns true if compaction should go ahead for a high-order request */
+static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
+{
+	unsigned long balance_gap, watermark;
+	bool watermark_ok;
+
+	/* Do not consider compaction for orders reclaim is meant to satisfy */
+	if (sc->order <= PAGE_ALLOC_COSTLY_ORDER)
+		return false;
+
+	/*
+	 * Compaction takes time to run and there are potentially other
+	 * callers using the pages just freed. Continue reclaiming until
+	 * there is a buffer of free pages available to give compaction
+	 * a reasonable chance of completing and allocating the page
+	 */
+	balance_gap = min(low_wmark_pages(zone),
+		(zone->present_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
+			KSWAPD_ZONE_BALANCE_GAP_RATIO);
+	watermark = high_wmark_pages(zone) + balance_gap + (2UL << sc->order);
+	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0);
+
+	/*
+	 * If compaction is deferred, reclaim up to a point where
+	 * compaction will have a chance of success when re-enabled
+	 */
+	if (compaction_deferred(zone))
+		return watermark_ok;
+
+	/* If compaction is not ready to start, keep reclaiming */
+	if (!compaction_suitable(zone, sc->order))
+		return false;
+
+	return watermark_ok;
+}
+
 /*
  * This is the direct reclaim path, for page-allocating processes.  We only
  * try to reclaim pages from zones which will satisfy the caller's allocation
@@ -2092,8 +2128,8 @@ restart:
  * scan then give up on it.
  *
  * This function returns true if a zone is being reclaimed for a costly
- * high-order allocation and compaction is either ready to begin or deferred.
- * This indicates to the caller that it should retry the allocation or fail.
+ * allocation and compaction is ready to begin. This indicates to the caller
+ * that it should retry the allocation or fail.
  */
 static bool shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
@@ -2127,9 +2163,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 				 * noticable problem, like transparent huge page
 				 * allocations.
 				 */
-				if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
-					(compaction_suitable(zone, sc->order) ||
-					 compaction_deferred(zone))) {
+				if (compaction_ready(zone, sc)) {
 					should_abort_reclaim = true;
 					continue;
 				}
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 24/34] mm: vmscan: Do not OOM if aborting reclaim to start compaction
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit 7335084d446b83cbcb15da80497d03f0c1dc9e21 upstream.

Stable note: Not tracked in Bugzilla. This patch makes later patches
	easier to apply but otherwise has little to justify it. The
	problem it fixes was never observed but the source of the
	theoretical problem did not exist for very long.

When direct reclaim is entered is is possible that reclaim will be
aborted so that compaction can be attempted to satisfy a high-order
allocation.  If this decision is made before any pages are reclaimed,
it is possible for 0 to be returned to the page allocator potentially
triggering an OOM. This has not been observed but it is a possibility
so this patch addresses it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e85abfd..f109f2d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2240,6 +2240,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	struct zoneref *z;
 	struct zone *zone;
 	unsigned long writeback_threshold;
+	bool should_abort_reclaim;
 
 	get_mems_allowed();
 	delayacct_freepages_start();
@@ -2251,7 +2252,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		sc->nr_scanned = 0;
 		if (!priority)
 			disable_swap_token(sc->mem_cgroup);
-		if (shrink_zones(priority, zonelist, sc))
+		should_abort_reclaim = shrink_zones(priority, zonelist, sc);
+		if (should_abort_reclaim)
 			break;
 
 		/*
@@ -2318,6 +2320,10 @@ out:
 	if (oom_killer_disabled)
 		return 0;
 
+	/* Aborting reclaim to try compaction? don't OOM, then */
+	if (should_abort_reclaim)
+		return 1;
+
 	/* top priority shrink_zones still had more to do? don't OOM, then */
 	if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
 		return 1;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 24/34] mm: vmscan: Do not OOM if aborting reclaim to start compaction
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit 7335084d446b83cbcb15da80497d03f0c1dc9e21 upstream.

Stable note: Not tracked in Bugzilla. This patch makes later patches
	easier to apply but otherwise has little to justify it. The
	problem it fixes was never observed but the source of the
	theoretical problem did not exist for very long.

When direct reclaim is entered is is possible that reclaim will be
aborted so that compaction can be attempted to satisfy a high-order
allocation.  If this decision is made before any pages are reclaimed,
it is possible for 0 to be returned to the page allocator potentially
triggering an OOM. This has not been observed but it is a possibility
so this patch addresses it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e85abfd..f109f2d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2240,6 +2240,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	struct zoneref *z;
 	struct zone *zone;
 	unsigned long writeback_threshold;
+	bool should_abort_reclaim;
 
 	get_mems_allowed();
 	delayacct_freepages_start();
@@ -2251,7 +2252,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		sc->nr_scanned = 0;
 		if (!priority)
 			disable_swap_token(sc->mem_cgroup);
-		if (shrink_zones(priority, zonelist, sc))
+		should_abort_reclaim = shrink_zones(priority, zonelist, sc);
+		if (should_abort_reclaim)
 			break;
 
 		/*
@@ -2318,6 +2320,10 @@ out:
 	if (oom_killer_disabled)
 		return 0;
 
+	/* Aborting reclaim to try compaction? don't OOM, then */
+	if (should_abort_reclaim)
+		return 1;
+
 	/* top priority shrink_zones still had more to do? don't OOM, then */
 	if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
 		return 1;
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 25/34] mm: vmscan: Check if reclaim should really abort even if compaction_ready() is true for one zone
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit 0cee34fd72c582b4f8ad8ce00645b75fb4168199 upstream.

Stable note: Not tracked on Bugzilla. THP and compaction was found to
	aggressively reclaim pages and stall systems under different
	situations that was addressed piecemeal over time.

If compaction can proceed for a given zone, shrink_zones() does not
reclaim any more pages from it. After commit [e0c2327: vmscan: abort
reclaim/compaction if compaction can proceed], do_try_to_free_pages()
tries to finish as soon as possible once one zone can compact.

This was intended to prevent slabs being shrunk unnecessarily but
there are side-effects. One is that a small zone that is ready for
compaction will abort reclaim even if the chances of successfully
allocating a THP from that zone is small. It also means that reclaim
can return too early even though sc->nr_to_reclaim pages were not
reclaimed.

This partially reverts the commit until it is proven that slabs are
really being shrunk unnecessarily but preserves the check to return
1 to avoid OOM if reclaim was aborted prematurely.

[aarcange@redhat.com: This patch replaces a revert from Andrea]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f109f2d..bc31f32 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2129,7 +2129,8 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
  *
  * This function returns true if a zone is being reclaimed for a costly
  * allocation and compaction is ready to begin. This indicates to the caller
- * that it should retry the allocation or fail.
+ * that it should consider retrying the allocation instead of
+ * further reclaim.
  */
 static bool shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
@@ -2138,7 +2139,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 	struct zone *zone;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
-	bool should_abort_reclaim = false;
+	bool aborted_reclaim = false;
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2164,7 +2165,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 				 * allocations.
 				 */
 				if (compaction_ready(zone, sc)) {
-					should_abort_reclaim = true;
+					aborted_reclaim = true;
 					continue;
 				}
 			}
@@ -2186,7 +2187,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 		shrink_zone(priority, zone, sc);
 	}
 
-	return should_abort_reclaim;
+	return aborted_reclaim;
 }
 
 static bool zone_reclaimable(struct zone *zone)
@@ -2240,7 +2241,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	struct zoneref *z;
 	struct zone *zone;
 	unsigned long writeback_threshold;
-	bool should_abort_reclaim;
+	bool aborted_reclaim;
 
 	get_mems_allowed();
 	delayacct_freepages_start();
@@ -2252,9 +2253,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		sc->nr_scanned = 0;
 		if (!priority)
 			disable_swap_token(sc->mem_cgroup);
-		should_abort_reclaim = shrink_zones(priority, zonelist, sc);
-		if (should_abort_reclaim)
-			break;
+		aborted_reclaim = shrink_zones(priority, zonelist, sc);
 
 		/*
 		 * Don't shrink slabs when reclaiming memory from
@@ -2320,8 +2319,8 @@ out:
 	if (oom_killer_disabled)
 		return 0;
 
-	/* Aborting reclaim to try compaction? don't OOM, then */
-	if (should_abort_reclaim)
+	/* Aborted reclaim to try compaction? don't OOM, then */
+	if (aborted_reclaim)
 		return 1;
 
 	/* top priority shrink_zones still had more to do? don't OOM, then */
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 25/34] mm: vmscan: Check if reclaim should really abort even if compaction_ready() is true for one zone
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit 0cee34fd72c582b4f8ad8ce00645b75fb4168199 upstream.

Stable note: Not tracked on Bugzilla. THP and compaction was found to
	aggressively reclaim pages and stall systems under different
	situations that was addressed piecemeal over time.

If compaction can proceed for a given zone, shrink_zones() does not
reclaim any more pages from it. After commit [e0c2327: vmscan: abort
reclaim/compaction if compaction can proceed], do_try_to_free_pages()
tries to finish as soon as possible once one zone can compact.

This was intended to prevent slabs being shrunk unnecessarily but
there are side-effects. One is that a small zone that is ready for
compaction will abort reclaim even if the chances of successfully
allocating a THP from that zone is small. It also means that reclaim
can return too early even though sc->nr_to_reclaim pages were not
reclaimed.

This partially reverts the commit until it is proven that slabs are
really being shrunk unnecessarily but preserves the check to return
1 to avoid OOM if reclaim was aborted prematurely.

[aarcange@redhat.com: This patch replaces a revert from Andrea]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f109f2d..bc31f32 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2129,7 +2129,8 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
  *
  * This function returns true if a zone is being reclaimed for a costly
  * allocation and compaction is ready to begin. This indicates to the caller
- * that it should retry the allocation or fail.
+ * that it should consider retrying the allocation instead of
+ * further reclaim.
  */
 static bool shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
@@ -2138,7 +2139,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 	struct zone *zone;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
-	bool should_abort_reclaim = false;
+	bool aborted_reclaim = false;
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2164,7 +2165,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 				 * allocations.
 				 */
 				if (compaction_ready(zone, sc)) {
-					should_abort_reclaim = true;
+					aborted_reclaim = true;
 					continue;
 				}
 			}
@@ -2186,7 +2187,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 		shrink_zone(priority, zone, sc);
 	}
 
-	return should_abort_reclaim;
+	return aborted_reclaim;
 }
 
 static bool zone_reclaimable(struct zone *zone)
@@ -2240,7 +2241,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	struct zoneref *z;
 	struct zone *zone;
 	unsigned long writeback_threshold;
-	bool should_abort_reclaim;
+	bool aborted_reclaim;
 
 	get_mems_allowed();
 	delayacct_freepages_start();
@@ -2252,9 +2253,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		sc->nr_scanned = 0;
 		if (!priority)
 			disable_swap_token(sc->mem_cgroup);
-		should_abort_reclaim = shrink_zones(priority, zonelist, sc);
-		if (should_abort_reclaim)
-			break;
+		aborted_reclaim = shrink_zones(priority, zonelist, sc);
 
 		/*
 		 * Don't shrink slabs when reclaiming memory from
@@ -2320,8 +2319,8 @@ out:
 	if (oom_killer_disabled)
 		return 0;
 
-	/* Aborting reclaim to try compaction? don't OOM, then */
-	if (should_abort_reclaim)
+	/* Aborted reclaim to try compaction? don't OOM, then */
+	if (aborted_reclaim)
 		return 1;
 
 	/* top priority shrink_zones still had more to do? don't OOM, then */
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 26/34] vmscan: promote shared file mapped pages
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Konstantin Khlebnikov <khlebnikov@openvz.org>

commit 34dbc67a644f11ab3475d822d72e25409911e760 upstream.

Stable note: Not tracked in Bugzilla. There were reports of shared
	mapped pages being unfairly reclaimed in comparison to older kernels.
	This is being addressed over time. The specific workload being
	addressed here in described in paragraph four and while paragraph
	five says it did not help performance as such, it made a difference
	to major page faults. I'm aware of at least one bug for a large
	vendor that was due to increased major faults.

Commit 645747462435 ("vmscan: detect mapped file pages used only once")
greatly decreases lifetime of single-used mapped file pages.
Unfortunately it also decreases life time of all shared mapped file
pages.  Because after commit bf3f3bc5e7347 ("mm: don't mark_page_accessed
in fault path") page-fault handler does not mark page active or even
referenced.

Thus page_check_references() activates file page only if it was used twice
while it stays in inactive list, meanwhile it activates anon pages after
first access.  Inactive list can be small enough, this way reclaimer can
accidentally throw away any widely used page if it wasn't used twice in
short period.

After this patch page_check_references() also activate file mapped page at
first inactive list scan if this page is already used multiple times via
several ptes.

I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5
(~2.6.18).  There a complete mess with >100 web/mail/spam/ftp containers,
they share all their files but there a lot of anonymous pages: ~500mb
shared file mapped memory and 15-20Gb non-shared anonymous memory.  In
this situation major-pagefaults are very costly, because all containers
share the same page.  In my load kernel created a disproportionate
pressure on the file memory, compared with the anonymous, they equaled
only if I raise swappiness up to 150 =)

These patches actually wasn't helped a lot in my problem, but I saw
noticable (10-20 times) reduce in count and average time of
major-pagefault in file-mapped areas.

Actually both patches are fixes for commit v2.6.33-5448-g6457474, because
it was aimed at one scenario (singly used pages), but it breaks the logic
in other scenarios (shared and/or executable pages)

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bc31f32..7edaaac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -723,7 +723,7 @@ static enum page_references page_check_references(struct page *page,
 		 */
 		SetPageReferenced(page);
 
-		if (referenced_page)
+		if (referenced_page || referenced_ptes > 1)
 			return PAGEREF_ACTIVATE;
 
 		return PAGEREF_KEEP;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 26/34] vmscan: promote shared file mapped pages
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Konstantin Khlebnikov <khlebnikov@openvz.org>

commit 34dbc67a644f11ab3475d822d72e25409911e760 upstream.

Stable note: Not tracked in Bugzilla. There were reports of shared
	mapped pages being unfairly reclaimed in comparison to older kernels.
	This is being addressed over time. The specific workload being
	addressed here in described in paragraph four and while paragraph
	five says it did not help performance as such, it made a difference
	to major page faults. I'm aware of at least one bug for a large
	vendor that was due to increased major faults.

Commit 645747462435 ("vmscan: detect mapped file pages used only once")
greatly decreases lifetime of single-used mapped file pages.
Unfortunately it also decreases life time of all shared mapped file
pages.  Because after commit bf3f3bc5e7347 ("mm: don't mark_page_accessed
in fault path") page-fault handler does not mark page active or even
referenced.

Thus page_check_references() activates file page only if it was used twice
while it stays in inactive list, meanwhile it activates anon pages after
first access.  Inactive list can be small enough, this way reclaimer can
accidentally throw away any widely used page if it wasn't used twice in
short period.

After this patch page_check_references() also activate file mapped page at
first inactive list scan if this page is already used multiple times via
several ptes.

I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5
(~2.6.18).  There a complete mess with >100 web/mail/spam/ftp containers,
they share all their files but there a lot of anonymous pages: ~500mb
shared file mapped memory and 15-20Gb non-shared anonymous memory.  In
this situation major-pagefaults are very costly, because all containers
share the same page.  In my load kernel created a disproportionate
pressure on the file memory, compared with the anonymous, they equaled
only if I raise swappiness up to 150 =)

These patches actually wasn't helped a lot in my problem, but I saw
noticable (10-20 times) reduce in count and average time of
major-pagefault in file-mapped areas.

Actually both patches are fixes for commit v2.6.33-5448-g6457474, because
it was aimed at one scenario (singly used pages), but it breaks the logic
in other scenarios (shared and/or executable pages)

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bc31f32..7edaaac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -723,7 +723,7 @@ static enum page_references page_check_references(struct page *page,
 		 */
 		SetPageReferenced(page);
 
-		if (referenced_page)
+		if (referenced_page || referenced_ptes > 1)
 			return PAGEREF_ACTIVATE;
 
 		return PAGEREF_KEEP;
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 27/34] vmscan: activate executable pages after first usage
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Konstantin Khlebnikov <khlebnikov@openvz.org>

commit c909e99364c8b6ca07864d752950b6b4ecf6bef4 upstream.

Stable note: Not tracked in Bugzilla. There were reports of shared
	mapped pages being unfairly reclaimed in comparison to older kernels.
	This is being addressed over time.

Logic added in commit 8cab4754d24a0 ("vmscan: make mapped executable pages
the first class citizen") was noticeably weakened in commit
645747462435d84 ("vmscan: detect mapped file pages used only once").

Currently these pages can become "first class citizens" only after second
usage.  After this patch page_check_references() will activate they after
first usage, and executable code gets yet better chance to stay in memory.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7edaaac..8b98a75 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -726,6 +726,12 @@ static enum page_references page_check_references(struct page *page,
 		if (referenced_page || referenced_ptes > 1)
 			return PAGEREF_ACTIVATE;
 
+		/*
+		 * Activate file-backed executable pages after first usage.
+		 */
+		if (vm_flags & VM_EXEC)
+			return PAGEREF_ACTIVATE;
+
 		return PAGEREF_KEEP;
 	}
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 27/34] vmscan: activate executable pages after first usage
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Konstantin Khlebnikov <khlebnikov@openvz.org>

commit c909e99364c8b6ca07864d752950b6b4ecf6bef4 upstream.

Stable note: Not tracked in Bugzilla. There were reports of shared
	mapped pages being unfairly reclaimed in comparison to older kernels.
	This is being addressed over time.

Logic added in commit 8cab4754d24a0 ("vmscan: make mapped executable pages
the first class citizen") was noticeably weakened in commit
645747462435d84 ("vmscan: detect mapped file pages used only once").

Currently these pages can become "first class citizens" only after second
usage.  After this patch page_check_references() will activate they after
first usage, and executable code gets yet better chance to stay in memory.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7edaaac..8b98a75 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -726,6 +726,12 @@ static enum page_references page_check_references(struct page *page,
 		if (referenced_page || referenced_ptes > 1)
 			return PAGEREF_ACTIVATE;
 
+		/*
+		 * Activate file-backed executable pages after first usage.
+		 */
+		if (vm_flags & VM_EXEC)
+			return PAGEREF_ACTIVATE;
+
 		return PAGEREF_KEEP;
 	}
 
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 28/34] mm/vmscan.c: consider swap space when deciding whether to continue reclaim
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Minchan Kim <minchan@kernel.org>

commit 86cfd3a45042ab242d47f3935a02811a402beab6 upstream.

Stable note: Not tracked in Bugzilla. This patch reduces kswapd CPU
	usage on swapless systems with high anonymous memory usage.

It's pointless to continue reclaiming when we have no swap space and lots
of anon pages in the inactive list.

Without this patch, it is possible when swap is disabled to continue
trying to reclaim when there are only anonymous pages in the system even
though that will not make any progress.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8b98a75..da195c2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2008,8 +2008,9 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-	inactive_lru_pages = zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON) +
-				zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
+	inactive_lru_pages = zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
+	if (nr_swap_pages > 0)
+		inactive_lru_pages += zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 28/34] mm/vmscan.c: consider swap space when deciding whether to continue reclaim
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Minchan Kim <minchan@kernel.org>

commit 86cfd3a45042ab242d47f3935a02811a402beab6 upstream.

Stable note: Not tracked in Bugzilla. This patch reduces kswapd CPU
	usage on swapless systems with high anonymous memory usage.

It's pointless to continue reclaiming when we have no swap space and lots
of anon pages in the inactive list.

Without this patch, it is possible when swap is disabled to continue
trying to reclaim when there are only anonymous pages in the system even
though that will not make any progress.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8b98a75..da195c2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2008,8 +2008,9 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-	inactive_lru_pages = zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON) +
-				zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
+	inactive_lru_pages = zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
+	if (nr_swap_pages > 0)
+		inactive_lru_pages += zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 29/34] mm: test PageSwapBacked in lumpy reclaim
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Hugh Dickins <hughd@google.com>

commit 043bcbe5ec51e0478ef2b44acef17193e01d7f70 upstream.

Stable note: Not tracked in Bugzilla. There were reports of shared
	mapped pages being unfairly reclaimed in comparison to older kernels.
	This is being addressed over time. Even though the subject
	refers to lumpy reclaim, it impacts compaction as well.

Lumpy reclaim does well to stop at a PageAnon when there's no swap, but
better is to stop at any PageSwapBacked, which includes shmem/tmpfs too.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index da195c2..e5382ad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1199,7 +1199,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			 * anon page which don't already have a swap slot is
 			 * pointless.
 			 */
-			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
+			if (nr_swap_pages <= 0 && PageSwapBacked(cursor_page) &&
 			    !PageSwapCache(cursor_page))
 				break;
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 29/34] mm: test PageSwapBacked in lumpy reclaim
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Hugh Dickins <hughd@google.com>

commit 043bcbe5ec51e0478ef2b44acef17193e01d7f70 upstream.

Stable note: Not tracked in Bugzilla. There were reports of shared
	mapped pages being unfairly reclaimed in comparison to older kernels.
	This is being addressed over time. Even though the subject
	refers to lumpy reclaim, it impacts compaction as well.

Lumpy reclaim does well to stop at a PageAnon when there's no swap, but
better is to stop at any PageSwapBacked, which includes shmem/tmpfs too.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index da195c2..e5382ad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1199,7 +1199,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			 * anon page which don't already have a swap slot is
 			 * pointless.
 			 */
-			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
+			if (nr_swap_pages <= 0 && PageSwapBacked(cursor_page) &&
 			    !PageSwapCache(cursor_page))
 				break;
 
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 30/34] mm: vmscan: Do not force kswapd to scan small targets
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit ad2b8e601099a23dffffb53f91c18d874fe98854 upstream - WARNING: this is a substitute patch.

Stable note: Not tracked in Bugzilla. This is a substitute for an
	upstream commit addressing a completely different issue that
	accidentally contained an important fix. The workload this patch
	helps was memcached when IO is started in the background. memcached
	should stay resident but without this patch it gets swapped more
	than it should. Sometimes this manifests as a drop in throughput
	but mostly it was observed through /proc/vmstat.

Commit [246e87a9: memcg: fix get_scan_count() for small targets] was
meant to fix a problem whereby small scan targets on memcg were ignored
causing priority to raise too sharply. It forced scanning to take place
if the target was small, memcg or kswapd.

>From the time it was introduced it cause excessive reclaim by kswapd
with workloads being pushed to swap that previously would have stayed
resident. This was accidentally fixed by commit [ad2b8e60: mm: memcg:
remove optimization of keeping the root_mem_cgroup LRU lists empty] but
that patchset is not suitable for backporting.

The original patch came with no information on what workloads it benefits
but the cost of it is obvious in that it forces scanning to take place
on lists that would otherwise have been ignored such as small anonymous
inactive lists. This patch partially reverts 246e87a9 so that small lists
are not force scanned which means that IO-intensive workloads with small
amounts of anonymous memory will not be swapped.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e5382ad..49d8547 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1849,9 +1849,6 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 	bool force_scan = false;
 	unsigned long nr_force_scan[2];
 
-	/* kswapd does zone balancing and needs to scan this zone */
-	if (scanning_global_lru(sc) && current_is_kswapd())
-		force_scan = true;
 	/* memcg may have small limit and need to avoid priority drop */
 	if (!scanning_global_lru(sc))
 		force_scan = true;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 30/34] mm: vmscan: Do not force kswapd to scan small targets
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit ad2b8e601099a23dffffb53f91c18d874fe98854 upstream - WARNING: this is a substitute patch.

Stable note: Not tracked in Bugzilla. This is a substitute for an
	upstream commit addressing a completely different issue that
	accidentally contained an important fix. The workload this patch
	helps was memcached when IO is started in the background. memcached
	should stay resident but without this patch it gets swapped more
	than it should. Sometimes this manifests as a drop in throughput
	but mostly it was observed through /proc/vmstat.

Commit [246e87a9: memcg: fix get_scan_count() for small targets] was
meant to fix a problem whereby small scan targets on memcg were ignored
causing priority to raise too sharply. It forced scanning to take place
if the target was small, memcg or kswapd.

>From the time it was introduced it cause excessive reclaim by kswapd
with workloads being pushed to swap that previously would have stayed
resident. This was accidentally fixed by commit [ad2b8e60: mm: memcg:
remove optimization of keeping the root_mem_cgroup LRU lists empty] but
that patchset is not suitable for backporting.

The original patch came with no information on what workloads it benefits
but the cost of it is obvious in that it forces scanning to take place
on lists that would otherwise have been ignored such as small anonymous
inactive lists. This patch partially reverts 246e87a9 so that small lists
are not force scanned which means that IO-intensive workloads with small
amounts of anonymous memory will not be swapped.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e5382ad..49d8547 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1849,9 +1849,6 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 	bool force_scan = false;
 	unsigned long nr_force_scan[2];
 
-	/* kswapd does zone balancing and needs to scan this zone */
-	if (scanning_global_lru(sc) && current_is_kswapd())
-		force_scan = true;
 	/* memcg may have small limit and need to avoid priority drop */
 	if (!scanning_global_lru(sc))
 		force_scan = true;
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 31/34] cpusets: avoid looping when storing to mems_allowed if one node remains set
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: David Rientjes <rientjes@google.com>

commit 89e8a244b97e48f1f30e898b6f32acca477f2a13 upstream.

Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is
	extremely expensive and severely impacted page allocator performance.
	This is part of a series of patches that reduce page allocator
	overhead.

{get,put}_mems_allowed() exist so that general kernel code may locklessly
access a task's set of allowable nodes without having the chance that a
concurrent write will cause the nodemask to be empty on configurations
where MAX_NUMNODES > BITS_PER_LONG.

This could incur a significant delay, however, especially in low memory
conditions because the page allocator is blocking and reclaim requires
get_mems_allowed() itself.  It is not atypical to see writes to
cpuset.mems take over 2 seconds to complete, for example.  In low memory
conditions, this is problematic because it's one of the most imporant
times to change cpuset.mems in the first place!

The only way a task's set of allowable nodes may change is through cpusets
by writing to cpuset.mems and when attaching a task to a generic code is
not reading the nodemask with get_mems_allowed() at the same time, and
then clearing all the old nodes.  This prevents the possibility that a
reader will see an empty nodemask at the same time the writer is storing a
new nodemask.

If at least one node remains unchanged, though, it's possible to simply
set all new nodes and then clear all the old nodes.  Changing a task's
nodemask is protected by cgroup_mutex so it's guaranteed that two threads
are not changing the same task's nodemask at the same time, so the
nodemask is guaranteed to be stored before another thread changes it and
determines whether a node remains set or not.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Paul Menage <paul@paulmenage.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/cpuset.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 9c9b754..a995893 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -949,6 +949,8 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
 static void cpuset_change_task_nodemask(struct task_struct *tsk,
 					nodemask_t *newmems)
 {
+	bool masks_disjoint = !nodes_intersects(*newmems, tsk->mems_allowed);
+
 repeat:
 	/*
 	 * Allow tasks that have access to memory reserves because they have
@@ -963,7 +965,6 @@ repeat:
 	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
 	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
-
 	/*
 	 * ensure checking ->mems_allowed_change_disable after setting all new
 	 * allowed nodes.
@@ -980,9 +981,11 @@ repeat:
 
 	/*
 	 * Allocation of memory is very fast, we needn't sleep when waiting
-	 * for the read-side.
+	 * for the read-side.  No wait is necessary, however, if at least one
+	 * node remains unchanged.
 	 */
-	while (ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
+	while (masks_disjoint &&
+			ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
 		task_unlock(tsk);
 		if (!task_curr(tsk))
 			yield();
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 31/34] cpusets: avoid looping when storing to mems_allowed if one node remains set
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: David Rientjes <rientjes@google.com>

commit 89e8a244b97e48f1f30e898b6f32acca477f2a13 upstream.

Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is
	extremely expensive and severely impacted page allocator performance.
	This is part of a series of patches that reduce page allocator
	overhead.

{get,put}_mems_allowed() exist so that general kernel code may locklessly
access a task's set of allowable nodes without having the chance that a
concurrent write will cause the nodemask to be empty on configurations
where MAX_NUMNODES > BITS_PER_LONG.

This could incur a significant delay, however, especially in low memory
conditions because the page allocator is blocking and reclaim requires
get_mems_allowed() itself.  It is not atypical to see writes to
cpuset.mems take over 2 seconds to complete, for example.  In low memory
conditions, this is problematic because it's one of the most imporant
times to change cpuset.mems in the first place!

The only way a task's set of allowable nodes may change is through cpusets
by writing to cpuset.mems and when attaching a task to a generic code is
not reading the nodemask with get_mems_allowed() at the same time, and
then clearing all the old nodes.  This prevents the possibility that a
reader will see an empty nodemask at the same time the writer is storing a
new nodemask.

If at least one node remains unchanged, though, it's possible to simply
set all new nodes and then clear all the old nodes.  Changing a task's
nodemask is protected by cgroup_mutex so it's guaranteed that two threads
are not changing the same task's nodemask at the same time, so the
nodemask is guaranteed to be stored before another thread changes it and
determines whether a node remains set or not.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Paul Menage <paul@paulmenage.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/cpuset.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 9c9b754..a995893 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -949,6 +949,8 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
 static void cpuset_change_task_nodemask(struct task_struct *tsk,
 					nodemask_t *newmems)
 {
+	bool masks_disjoint = !nodes_intersects(*newmems, tsk->mems_allowed);
+
 repeat:
 	/*
 	 * Allow tasks that have access to memory reserves because they have
@@ -963,7 +965,6 @@ repeat:
 	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
 	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
-
 	/*
 	 * ensure checking ->mems_allowed_change_disable after setting all new
 	 * allowed nodes.
@@ -980,9 +981,11 @@ repeat:
 
 	/*
 	 * Allocation of memory is very fast, we needn't sleep when waiting
-	 * for the read-side.
+	 * for the read-side.  No wait is necessary, however, if at least one
+	 * node remains unchanged.
 	 */
-	while (ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
+	while (masks_disjoint &&
+			ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
 		task_unlock(tsk);
 		if (!task_curr(tsk))
 			yield();
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 32/34] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: David Rientjes <rientjes@google.com>

commit b246272ecc5ac68c743b15c9e41a2275f7ce70e2 upstream.

Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
	expensive and severely impacted page allocator performance. This is
	part of a series of patches that reduce page allocator overhead.

Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty
nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a
new set of allowed cpuset nodes where the two nodemasks, as a result of
the remap, are now disjoint.

c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
nodes from changing for a thread.  This causes any update to a set of
allowed nodes to stall until put_mems_allowed() is called.

This stall is unncessary, however, if at least one node remains unchanged
in the update to the set of allowed nodes.  This was addressed by
89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
node remains set"), but it's still possible that an empty nodemask may be
read from a mempolicy because the old nodemask may be remapped to the new
nodemask during rebind.  To prevent this, only avoid the stall if there is
no mempolicy for the thread being changed.

This is a temporary solution until all reads from mempolicy nodemasks can
be guaranteed to not be empty without the get_mems_allowed()
synchronization.

Also moves the check for nodemask intersection inside task_lock() so that
tsk->mems_allowed cannot change.  This ensures that nothing can set this
tsk's mems_allowed out from under us and also protects tsk->mempolicy.

Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/cpuset.c |   29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index a995893..28d0bbd 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -123,6 +123,19 @@ static inline struct cpuset *task_cs(struct task_struct *task)
 			    struct cpuset, css);
 }
 
+#ifdef CONFIG_NUMA
+static inline bool task_has_mempolicy(struct task_struct *task)
+{
+	return task->mempolicy;
+}
+#else
+static inline bool task_has_mempolicy(struct task_struct *task)
+{
+	return false;
+}
+#endif
+
+
 /* bits in struct cpuset flags field */
 typedef enum {
 	CS_CPU_EXCLUSIVE,
@@ -949,7 +962,7 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
 static void cpuset_change_task_nodemask(struct task_struct *tsk,
 					nodemask_t *newmems)
 {
-	bool masks_disjoint = !nodes_intersects(*newmems, tsk->mems_allowed);
+	bool need_loop;
 
 repeat:
 	/*
@@ -962,6 +975,14 @@ repeat:
 		return;
 
 	task_lock(tsk);
+	/*
+	 * Determine if a loop is necessary if another thread is doing
+	 * get_mems_allowed().  If at least one node remains unchanged and
+	 * tsk does not have a mempolicy, then an empty nodemask will not be
+	 * possible when mems_allowed is larger than a word.
+	 */
+	need_loop = task_has_mempolicy(tsk) ||
+			!nodes_intersects(*newmems, tsk->mems_allowed);
 	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
 	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
@@ -981,11 +1002,9 @@ repeat:
 
 	/*
 	 * Allocation of memory is very fast, we needn't sleep when waiting
-	 * for the read-side.  No wait is necessary, however, if at least one
-	 * node remains unchanged.
+	 * for the read-side.
 	 */
-	while (masks_disjoint &&
-			ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
+	while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
 		task_unlock(tsk);
 		if (!task_curr(tsk))
 			yield();
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 32/34] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: David Rientjes <rientjes@google.com>

commit b246272ecc5ac68c743b15c9e41a2275f7ce70e2 upstream.

Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
	expensive and severely impacted page allocator performance. This is
	part of a series of patches that reduce page allocator overhead.

Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty
nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a
new set of allowed cpuset nodes where the two nodemasks, as a result of
the remap, are now disjoint.

c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
nodes from changing for a thread.  This causes any update to a set of
allowed nodes to stall until put_mems_allowed() is called.

This stall is unncessary, however, if at least one node remains unchanged
in the update to the set of allowed nodes.  This was addressed by
89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
node remains set"), but it's still possible that an empty nodemask may be
read from a mempolicy because the old nodemask may be remapped to the new
nodemask during rebind.  To prevent this, only avoid the stall if there is
no mempolicy for the thread being changed.

This is a temporary solution until all reads from mempolicy nodemasks can
be guaranteed to not be empty without the get_mems_allowed()
synchronization.

Also moves the check for nodemask intersection inside task_lock() so that
tsk->mems_allowed cannot change.  This ensures that nothing can set this
tsk's mems_allowed out from under us and also protects tsk->mempolicy.

Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/cpuset.c |   29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index a995893..28d0bbd 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -123,6 +123,19 @@ static inline struct cpuset *task_cs(struct task_struct *task)
 			    struct cpuset, css);
 }
 
+#ifdef CONFIG_NUMA
+static inline bool task_has_mempolicy(struct task_struct *task)
+{
+	return task->mempolicy;
+}
+#else
+static inline bool task_has_mempolicy(struct task_struct *task)
+{
+	return false;
+}
+#endif
+
+
 /* bits in struct cpuset flags field */
 typedef enum {
 	CS_CPU_EXCLUSIVE,
@@ -949,7 +962,7 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
 static void cpuset_change_task_nodemask(struct task_struct *tsk,
 					nodemask_t *newmems)
 {
-	bool masks_disjoint = !nodes_intersects(*newmems, tsk->mems_allowed);
+	bool need_loop;
 
 repeat:
 	/*
@@ -962,6 +975,14 @@ repeat:
 		return;
 
 	task_lock(tsk);
+	/*
+	 * Determine if a loop is necessary if another thread is doing
+	 * get_mems_allowed().  If at least one node remains unchanged and
+	 * tsk does not have a mempolicy, then an empty nodemask will not be
+	 * possible when mems_allowed is larger than a word.
+	 */
+	need_loop = task_has_mempolicy(tsk) ||
+			!nodes_intersects(*newmems, tsk->mems_allowed);
 	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
 	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
@@ -981,11 +1002,9 @@ repeat:
 
 	/*
 	 * Allocation of memory is very fast, we needn't sleep when waiting
-	 * for the read-side.  No wait is necessary, however, if at least one
-	 * node remains unchanged.
+	 * for the read-side.
 	 */
-	while (masks_disjoint &&
-			ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
+	while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
 		task_unlock(tsk);
 		if (!task_curr(tsk))
 			yield();
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 33/34] cpuset: mm: Reduce large amounts of memory barrier related damage v3
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit cc9a6c8776615f9c194ccf0b63a0aa5628235545 upstream.

Stable note:  Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
	expensive and severely impacted page allocator performance. This
	is part of a series of patches that reduce page allocator overhead.

Changelog since V2
  o Documentation						(akpm)
  o Do not retry hugetlb allocations in the event of an error	(akpm)

Changelog since V1
  o Use seqcount with rmb instead of atomics (Peter, Christoph)

Commit [c0ff7453: cpuset,mm: fix no node to alloc memory when changing
cpuset's mems] wins a super prize for the largest number of memory
barriers entered into fast paths for one commit. [get|put]_mems_allowed
is incredibly heavy with pairs of full memory barriers inserted into a
number of hot paths. This was detected while investigating at large page
allocator slowdown introduced some time after 2.6.32. The largest portion
of this overhead was shown by oprofile to be at an mfence introduced by
this commit into the page allocator hot path.

For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.

This patch replaces the full memory barriers on both read and write sides
with a sequence counter with just read barriers on the fast path side.
This is much cheaper on some architectures, including x86.  The main bulk
of the patch is the retry logic if the nodemask changes in a manner that
can cause a false failure.

While updating the nodemask, a check is made to see if a false failure is
a risk. If it is, the sequence number gets bumped and parallel allocators
will briefly stall while the nodemask update takes place.

In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual
results were

                         3.3.0-rc3          3.3.0-rc3
                         rc3-vanilla        nobarrier-v2r1
Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds)             135.68    132.17
User+Sys Time Running Test (seconds)         164.2    160.13
Total Elapsed Time (seconds)                123.46    120.87

The overall improvement is small but the System CPU time is much improved
and roughly in correlation to what oprofile reported (these performance
figures are without profiling so skew is expected). The actual number of
page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals. The first
ran within a cpuset and continually ran a small program that faulted 100M
of anonymous data. In a second window, the nodemask of the cpuset was
continually randomised in a loop. Without the commit, the program would
fail every so often (usually within 10 seconds) and obviously with the
commit everything worked fine. With this patch applied, it also worked
fine so the fix should be functionally equivalent.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/cpuset.h    |   45 ++++++++++++++++++---------------------------
 include/linux/init_task.h |    8 ++++++++
 include/linux/sched.h     |    2 +-
 kernel/cpuset.c           |   43 ++++++++-----------------------------------
 kernel/fork.c             |    3 +++
 mm/filemap.c              |   11 +++++++----
 mm/hugetlb.c              |   15 +++++++++++----
 mm/mempolicy.c            |   28 +++++++++++++++++++++-------
 mm/page_alloc.c           |   33 +++++++++++++++++++++++----------
 mm/slab.c                 |   13 ++++++++-----
 mm/slub.c                 |   39 +++++++++++++++++++++++++--------------
 mm/vmscan.c               |    2 --
 12 files changed, 133 insertions(+), 109 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index e9eaec5..8f15695 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -89,36 +89,25 @@ extern void rebuild_sched_domains(void);
 extern void cpuset_print_task_mems_allowed(struct task_struct *p);
 
 /*
- * reading current mems_allowed and mempolicy in the fastpath must protected
- * by get_mems_allowed()
+ * get_mems_allowed is required when making decisions involving mems_allowed
+ * such as during page allocation. mems_allowed can be updated in parallel
+ * and depending on the new value an operation can fail potentially causing
+ * process failure. A retry loop with get_mems_allowed and put_mems_allowed
+ * prevents these artificial failures.
  */
-static inline void get_mems_allowed(void)
+static inline unsigned int get_mems_allowed(void)
 {
-	current->mems_allowed_change_disable++;
-
-	/*
-	 * ensure that reading mems_allowed and mempolicy happens after the
-	 * update of ->mems_allowed_change_disable.
-	 *
-	 * the write-side task finds ->mems_allowed_change_disable is not 0,
-	 * and knows the read-side task is reading mems_allowed or mempolicy,
-	 * so it will clear old bits lazily.
-	 */
-	smp_mb();
+	return read_seqcount_begin(&current->mems_allowed_seq);
 }
 
-static inline void put_mems_allowed(void)
+/*
+ * If this returns false, the operation that took place after get_mems_allowed
+ * may have failed. It is up to the caller to retry the operation if
+ * appropriate.
+ */
+static inline bool put_mems_allowed(unsigned int seq)
 {
-	/*
-	 * ensure that reading mems_allowed and mempolicy before reducing
-	 * mems_allowed_change_disable.
-	 *
-	 * the write-side task will know that the read-side task is still
-	 * reading mems_allowed or mempolicy, don't clears old bits in the
-	 * nodemask.
-	 */
-	smp_mb();
-	--ACCESS_ONCE(current->mems_allowed_change_disable);
+	return !read_seqcount_retry(&current->mems_allowed_seq, seq);
 }
 
 static inline void set_mems_allowed(nodemask_t nodemask)
@@ -234,12 +223,14 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 {
 }
 
-static inline void get_mems_allowed(void)
+static inline unsigned int get_mems_allowed(void)
 {
+	return 0;
 }
 
-static inline void put_mems_allowed(void)
+static inline bool put_mems_allowed(unsigned int seq)
 {
+	return true;
 }
 
 #endif /* !CONFIG_CPUSETS */
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 580f70c..5e41a8e 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -30,6 +30,13 @@ extern struct fs_struct init_fs;
 #define INIT_THREADGROUP_FORK_LOCK(sig)
 #endif
 
+#ifdef CONFIG_CPUSETS
+#define INIT_CPUSET_SEQ							\
+	.mems_allowed_seq = SEQCNT_ZERO,
+#else
+#define INIT_CPUSET_SEQ
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.nr_threads	= 1,						\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -193,6 +200,7 @@ extern struct cred init_cred;
 	INIT_FTRACE_GRAPH						\
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
+	INIT_CPUSET_SEQ							\
 }
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4ef452b..443ec43 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1484,7 +1484,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_CPUSETS
 	nodemask_t mems_allowed;	/* Protected by alloc_lock */
-	int mems_allowed_change_disable;
+	seqcount_t mems_allowed_seq;	/* Seqence no to catch updates */
 	int cpuset_mem_spread_rotor;
 	int cpuset_slab_spread_rotor;
 #endif
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 28d0bbd..b2e84bd 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -964,7 +964,6 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 {
 	bool need_loop;
 
-repeat:
 	/*
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
@@ -983,45 +982,19 @@ repeat:
 	 */
 	need_loop = task_has_mempolicy(tsk) ||
 			!nodes_intersects(*newmems, tsk->mems_allowed);
-	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
-	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
-	/*
-	 * ensure checking ->mems_allowed_change_disable after setting all new
-	 * allowed nodes.
-	 *
-	 * the read-side task can see an nodemask with new allowed nodes and
-	 * old allowed nodes. and if it allocates page when cpuset clears newly
-	 * disallowed ones continuous, it can see the new allowed bits.
-	 *
-	 * And if setting all new allowed nodes is after the checking, setting
-	 * all new allowed nodes and clearing newly disallowed ones will be done
-	 * continuous, and the read-side task may find no node to alloc page.
-	 */
-	smp_mb();
+	if (need_loop)
+		write_seqcount_begin(&tsk->mems_allowed_seq);
 
-	/*
-	 * Allocation of memory is very fast, we needn't sleep when waiting
-	 * for the read-side.
-	 */
-	while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
-		task_unlock(tsk);
-		if (!task_curr(tsk))
-			yield();
-		goto repeat;
-	}
-
-	/*
-	 * ensure checking ->mems_allowed_change_disable before clearing all new
-	 * disallowed nodes.
-	 *
-	 * if clearing newly disallowed bits before the checking, the read-side
-	 * task may find no node to alloc page.
-	 */
-	smp_mb();
+	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
+	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
 	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2);
 	tsk->mems_allowed = *newmems;
+
+	if (need_loop)
+		write_seqcount_end(&tsk->mems_allowed_seq);
+
 	task_unlock(tsk);
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 4712e3e..3d42aa3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -985,6 +985,9 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 #ifdef CONFIG_CGROUPS
 	init_rwsem(&sig->threadgroup_fork_lock);
 #endif
+#ifdef CONFIG_CPUSETS
+	seqcount_init(&tsk->mems_allowed_seq);
+#endif
 
 	sig->oom_adj = current->signal->oom_adj;
 	sig->oom_score_adj = current->signal->oom_score_adj;
diff --git a/mm/filemap.c b/mm/filemap.c
index b7d8603..10481eb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -516,10 +516,13 @@ struct page *__page_cache_alloc(gfp_t gfp)
 	struct page *page;
 
 	if (cpuset_do_page_mem_spread()) {
-		get_mems_allowed();
-		n = cpuset_mem_spread_node();
-		page = alloc_pages_exact_node(n, gfp, 0);
-		put_mems_allowed();
+		unsigned int cpuset_mems_cookie;
+		do {
+			cpuset_mems_cookie = get_mems_allowed();
+			n = cpuset_mem_spread_node();
+			page = alloc_pages_exact_node(n, gfp, 0);
+		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
+
 		return page;
 	}
 	return alloc_pages(gfp, 0);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 05f8fd4..64f2b7a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -454,14 +454,16 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
 {
-	struct page *page = NULL;
+	struct page *page;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
 	struct zonelist *zonelist;
 	struct zone *zone;
 	struct zoneref *z;
+	unsigned int cpuset_mems_cookie;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
 	zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol, &nodemask);
 	/*
@@ -488,10 +490,15 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 			}
 		}
 	}
-err:
+
 	mpol_cond_put(mpol);
-	put_mems_allowed();
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
 	return page;
+
+err:
+	mpol_cond_put(mpol);
+	return NULL;
 }
 
 static void update_and_free_page(struct hstate *h, struct page *page)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index dd5f874..cff919f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1810,18 +1810,24 @@ struct page *
 alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		unsigned long addr, int node)
 {
-	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+	struct mempolicy *pol;
 	struct zonelist *zl;
 	struct page *page;
+	unsigned int cpuset_mems_cookie;
+
+retry_cpuset:
+	pol = get_vma_policy(current, vma, addr);
+	cpuset_mems_cookie = get_mems_allowed();
 
-	get_mems_allowed();
 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
-		put_mems_allowed();
+		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+			goto retry_cpuset;
+
 		return page;
 	}
 	zl = policy_zonelist(gfp, pol, node);
@@ -1832,7 +1838,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		struct page *page =  __alloc_pages_nodemask(gfp, order,
 						zl, policy_nodemask(gfp, pol));
 		__mpol_put(pol);
-		put_mems_allowed();
+		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+			goto retry_cpuset;
 		return page;
 	}
 	/*
@@ -1840,7 +1847,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	 */
 	page = __alloc_pages_nodemask(gfp, order, zl,
 				      policy_nodemask(gfp, pol));
-	put_mems_allowed();
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
 	return page;
 }
 
@@ -1867,11 +1875,14 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
 	struct mempolicy *pol = current->mempolicy;
 	struct page *page;
+	unsigned int cpuset_mems_cookie;
 
 	if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
 		pol = &default_policy;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+
 	/*
 	 * No reference counting needed for current->mempolicy
 	 * nor system default_policy
@@ -1882,7 +1893,10 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 		page = __alloc_pages_nodemask(gfp, order,
 				policy_zonelist(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
-	put_mems_allowed();
+
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
+
 	return page;
 }
 EXPORT_SYMBOL(alloc_pages_current);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 257acae..a1744f5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2293,8 +2293,9 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
-	struct page *page;
+	struct page *page = NULL;
 	int migratetype = allocflags_to_migratetype(gfp_mask);
+	unsigned int cpuset_mems_cookie;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2313,15 +2314,15 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+
 	/* The preferred zone is used for statistics later */
 	first_zones_zonelist(zonelist, high_zoneidx,
 				nodemask ? : &cpuset_current_mems_allowed,
 				&preferred_zone);
-	if (!preferred_zone) {
-		put_mems_allowed();
-		return NULL;
-	}
+	if (!preferred_zone)
+		goto out;
 
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
@@ -2331,9 +2332,19 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
-	put_mems_allowed();
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
+
+out:
+	/*
+	 * When updating a task's mems_allowed, it is possible to race with
+	 * parallel threads in such a way that an allocation can fail while
+	 * the mask is being updated. If a page allocation is about to fail,
+	 * check if the cpuset changed during allocation and if so, retry.
+	 */
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
+
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -2557,13 +2568,15 @@ void si_meminfo_node(struct sysinfo *val, int nid)
 bool skip_free_areas_node(unsigned int flags, int nid)
 {
 	bool ret = false;
+	unsigned int cpuset_mems_cookie;
 
 	if (!(flags & SHOW_MEM_FILTER_NODES))
 		goto out;
 
-	get_mems_allowed();
-	ret = !node_isset(nid, cpuset_current_mems_allowed);
-	put_mems_allowed();
+	do {
+		cpuset_mems_cookie = get_mems_allowed();
+		ret = !node_isset(nid, cpuset_current_mems_allowed);
+	} while (!put_mems_allowed(cpuset_mems_cookie));
 out:
 	return ret;
 }
diff --git a/mm/slab.c b/mm/slab.c
index d96e223..a67f812 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3218,12 +3218,10 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
 	if (in_interrupt() || (flags & __GFP_THISNODE))
 		return NULL;
 	nid_alloc = nid_here = numa_mem_id();
-	get_mems_allowed();
 	if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
 		nid_alloc = cpuset_slab_spread_node();
 	else if (current->mempolicy)
 		nid_alloc = slab_node(current->mempolicy);
-	put_mems_allowed();
 	if (nid_alloc != nid_here)
 		return ____cache_alloc_node(cachep, flags, nid_alloc);
 	return NULL;
@@ -3246,14 +3244,17 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
 	int nid;
+	unsigned int cpuset_mems_cookie;
 
 	if (flags & __GFP_THISNODE)
 		return NULL;
 
-	get_mems_allowed();
-	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
 
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+
 retry:
 	/*
 	 * Look through allowed nodes for objects available
@@ -3306,7 +3307,9 @@ retry:
 			}
 		}
 	}
-	put_mems_allowed();
+
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
+		goto retry_cpuset;
 	return obj;
 }
 
diff --git a/mm/slub.c b/mm/slub.c
index 10ab233..00ccf2c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1457,6 +1457,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	struct page *page;
+	unsigned int cpuset_mems_cookie;
 
 	/*
 	 * The defrag ratio allows a configuration of the tradeoffs between
@@ -1480,22 +1481,32 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
 			get_cycles() % 1024 > s->remote_node_defrag_ratio)
 		return NULL;
 
-	get_mems_allowed();
-	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-		struct kmem_cache_node *n;
-
-		n = get_node(s, zone_to_nid(zone));
-
-		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
-				n->nr_partial > s->min_partial) {
-			page = get_partial_node(n);
-			if (page) {
-				put_mems_allowed();
-				return page;
+	do {
+		cpuset_mems_cookie = get_mems_allowed();
+		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+			struct kmem_cache_node *n;
+
+			n = get_node(s, zone_to_nid(zone));
+
+			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
+					n->nr_partial > s->min_partial) {
+				page = get_partial_node(n);
+				if (page) {
+					/*
+					 * Return the object even if
+					 * put_mems_allowed indicated that
+					 * the cpuset mems_allowed was
+					 * updated in parallel. It's a
+					 * harmless race between the alloc
+					 * and the cpuset update.
+					 */
+					put_mems_allowed(cpuset_mems_cookie);
+					return page;
+				}
 			}
 		}
-	}
+	} while (!put_mems_allowed(cpuset_mems_cookie));
 	put_mems_allowed();
 #endif
 	return NULL;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 49d8547..1682835 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2247,7 +2247,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	unsigned long writeback_threshold;
 	bool aborted_reclaim;
 
-	get_mems_allowed();
 	delayacct_freepages_start();
 
 	if (scanning_global_lru(sc))
@@ -2310,7 +2309,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 out:
 	delayacct_freepages_end();
-	put_mems_allowed();
 
 	if (sc->nr_reclaimed)
 		return sc->nr_reclaimed;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 33/34] cpuset: mm: Reduce large amounts of memory barrier related damage v3
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

commit cc9a6c8776615f9c194ccf0b63a0aa5628235545 upstream.

Stable note:  Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
	expensive and severely impacted page allocator performance. This
	is part of a series of patches that reduce page allocator overhead.

Changelog since V2
  o Documentation						(akpm)
  o Do not retry hugetlb allocations in the event of an error	(akpm)

Changelog since V1
  o Use seqcount with rmb instead of atomics (Peter, Christoph)

Commit [c0ff7453: cpuset,mm: fix no node to alloc memory when changing
cpuset's mems] wins a super prize for the largest number of memory
barriers entered into fast paths for one commit. [get|put]_mems_allowed
is incredibly heavy with pairs of full memory barriers inserted into a
number of hot paths. This was detected while investigating at large page
allocator slowdown introduced some time after 2.6.32. The largest portion
of this overhead was shown by oprofile to be at an mfence introduced by
this commit into the page allocator hot path.

For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.

This patch replaces the full memory barriers on both read and write sides
with a sequence counter with just read barriers on the fast path side.
This is much cheaper on some architectures, including x86.  The main bulk
of the patch is the retry logic if the nodemask changes in a manner that
can cause a false failure.

While updating the nodemask, a check is made to see if a false failure is
a risk. If it is, the sequence number gets bumped and parallel allocators
will briefly stall while the nodemask update takes place.

In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual
results were

                         3.3.0-rc3          3.3.0-rc3
                         rc3-vanilla        nobarrier-v2r1
Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds)             135.68    132.17
User+Sys Time Running Test (seconds)         164.2    160.13
Total Elapsed Time (seconds)                123.46    120.87

The overall improvement is small but the System CPU time is much improved
and roughly in correlation to what oprofile reported (these performance
figures are without profiling so skew is expected). The actual number of
page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals. The first
ran within a cpuset and continually ran a small program that faulted 100M
of anonymous data. In a second window, the nodemask of the cpuset was
continually randomised in a loop. Without the commit, the program would
fail every so often (usually within 10 seconds) and obviously with the
commit everything worked fine. With this patch applied, it also worked
fine so the fix should be functionally equivalent.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/cpuset.h    |   45 ++++++++++++++++++---------------------------
 include/linux/init_task.h |    8 ++++++++
 include/linux/sched.h     |    2 +-
 kernel/cpuset.c           |   43 ++++++++-----------------------------------
 kernel/fork.c             |    3 +++
 mm/filemap.c              |   11 +++++++----
 mm/hugetlb.c              |   15 +++++++++++----
 mm/mempolicy.c            |   28 +++++++++++++++++++++-------
 mm/page_alloc.c           |   33 +++++++++++++++++++++++----------
 mm/slab.c                 |   13 ++++++++-----
 mm/slub.c                 |   39 +++++++++++++++++++++++++--------------
 mm/vmscan.c               |    2 --
 12 files changed, 133 insertions(+), 109 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index e9eaec5..8f15695 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -89,36 +89,25 @@ extern void rebuild_sched_domains(void);
 extern void cpuset_print_task_mems_allowed(struct task_struct *p);
 
 /*
- * reading current mems_allowed and mempolicy in the fastpath must protected
- * by get_mems_allowed()
+ * get_mems_allowed is required when making decisions involving mems_allowed
+ * such as during page allocation. mems_allowed can be updated in parallel
+ * and depending on the new value an operation can fail potentially causing
+ * process failure. A retry loop with get_mems_allowed and put_mems_allowed
+ * prevents these artificial failures.
  */
-static inline void get_mems_allowed(void)
+static inline unsigned int get_mems_allowed(void)
 {
-	current->mems_allowed_change_disable++;
-
-	/*
-	 * ensure that reading mems_allowed and mempolicy happens after the
-	 * update of ->mems_allowed_change_disable.
-	 *
-	 * the write-side task finds ->mems_allowed_change_disable is not 0,
-	 * and knows the read-side task is reading mems_allowed or mempolicy,
-	 * so it will clear old bits lazily.
-	 */
-	smp_mb();
+	return read_seqcount_begin(&current->mems_allowed_seq);
 }
 
-static inline void put_mems_allowed(void)
+/*
+ * If this returns false, the operation that took place after get_mems_allowed
+ * may have failed. It is up to the caller to retry the operation if
+ * appropriate.
+ */
+static inline bool put_mems_allowed(unsigned int seq)
 {
-	/*
-	 * ensure that reading mems_allowed and mempolicy before reducing
-	 * mems_allowed_change_disable.
-	 *
-	 * the write-side task will know that the read-side task is still
-	 * reading mems_allowed or mempolicy, don't clears old bits in the
-	 * nodemask.
-	 */
-	smp_mb();
-	--ACCESS_ONCE(current->mems_allowed_change_disable);
+	return !read_seqcount_retry(&current->mems_allowed_seq, seq);
 }
 
 static inline void set_mems_allowed(nodemask_t nodemask)
@@ -234,12 +223,14 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 {
 }
 
-static inline void get_mems_allowed(void)
+static inline unsigned int get_mems_allowed(void)
 {
+	return 0;
 }
 
-static inline void put_mems_allowed(void)
+static inline bool put_mems_allowed(unsigned int seq)
 {
+	return true;
 }
 
 #endif /* !CONFIG_CPUSETS */
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 580f70c..5e41a8e 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -30,6 +30,13 @@ extern struct fs_struct init_fs;
 #define INIT_THREADGROUP_FORK_LOCK(sig)
 #endif
 
+#ifdef CONFIG_CPUSETS
+#define INIT_CPUSET_SEQ							\
+	.mems_allowed_seq = SEQCNT_ZERO,
+#else
+#define INIT_CPUSET_SEQ
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.nr_threads	= 1,						\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -193,6 +200,7 @@ extern struct cred init_cred;
 	INIT_FTRACE_GRAPH						\
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
+	INIT_CPUSET_SEQ							\
 }
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4ef452b..443ec43 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1484,7 +1484,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_CPUSETS
 	nodemask_t mems_allowed;	/* Protected by alloc_lock */
-	int mems_allowed_change_disable;
+	seqcount_t mems_allowed_seq;	/* Seqence no to catch updates */
 	int cpuset_mem_spread_rotor;
 	int cpuset_slab_spread_rotor;
 #endif
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 28d0bbd..b2e84bd 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -964,7 +964,6 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 {
 	bool need_loop;
 
-repeat:
 	/*
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
@@ -983,45 +982,19 @@ repeat:
 	 */
 	need_loop = task_has_mempolicy(tsk) ||
 			!nodes_intersects(*newmems, tsk->mems_allowed);
-	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
-	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
-	/*
-	 * ensure checking ->mems_allowed_change_disable after setting all new
-	 * allowed nodes.
-	 *
-	 * the read-side task can see an nodemask with new allowed nodes and
-	 * old allowed nodes. and if it allocates page when cpuset clears newly
-	 * disallowed ones continuous, it can see the new allowed bits.
-	 *
-	 * And if setting all new allowed nodes is after the checking, setting
-	 * all new allowed nodes and clearing newly disallowed ones will be done
-	 * continuous, and the read-side task may find no node to alloc page.
-	 */
-	smp_mb();
+	if (need_loop)
+		write_seqcount_begin(&tsk->mems_allowed_seq);
 
-	/*
-	 * Allocation of memory is very fast, we needn't sleep when waiting
-	 * for the read-side.
-	 */
-	while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
-		task_unlock(tsk);
-		if (!task_curr(tsk))
-			yield();
-		goto repeat;
-	}
-
-	/*
-	 * ensure checking ->mems_allowed_change_disable before clearing all new
-	 * disallowed nodes.
-	 *
-	 * if clearing newly disallowed bits before the checking, the read-side
-	 * task may find no node to alloc page.
-	 */
-	smp_mb();
+	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
+	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
 	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2);
 	tsk->mems_allowed = *newmems;
+
+	if (need_loop)
+		write_seqcount_end(&tsk->mems_allowed_seq);
+
 	task_unlock(tsk);
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 4712e3e..3d42aa3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -985,6 +985,9 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 #ifdef CONFIG_CGROUPS
 	init_rwsem(&sig->threadgroup_fork_lock);
 #endif
+#ifdef CONFIG_CPUSETS
+	seqcount_init(&tsk->mems_allowed_seq);
+#endif
 
 	sig->oom_adj = current->signal->oom_adj;
 	sig->oom_score_adj = current->signal->oom_score_adj;
diff --git a/mm/filemap.c b/mm/filemap.c
index b7d8603..10481eb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -516,10 +516,13 @@ struct page *__page_cache_alloc(gfp_t gfp)
 	struct page *page;
 
 	if (cpuset_do_page_mem_spread()) {
-		get_mems_allowed();
-		n = cpuset_mem_spread_node();
-		page = alloc_pages_exact_node(n, gfp, 0);
-		put_mems_allowed();
+		unsigned int cpuset_mems_cookie;
+		do {
+			cpuset_mems_cookie = get_mems_allowed();
+			n = cpuset_mem_spread_node();
+			page = alloc_pages_exact_node(n, gfp, 0);
+		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
+
 		return page;
 	}
 	return alloc_pages(gfp, 0);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 05f8fd4..64f2b7a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -454,14 +454,16 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
 {
-	struct page *page = NULL;
+	struct page *page;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
 	struct zonelist *zonelist;
 	struct zone *zone;
 	struct zoneref *z;
+	unsigned int cpuset_mems_cookie;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
 	zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol, &nodemask);
 	/*
@@ -488,10 +490,15 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 			}
 		}
 	}
-err:
+
 	mpol_cond_put(mpol);
-	put_mems_allowed();
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
 	return page;
+
+err:
+	mpol_cond_put(mpol);
+	return NULL;
 }
 
 static void update_and_free_page(struct hstate *h, struct page *page)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index dd5f874..cff919f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1810,18 +1810,24 @@ struct page *
 alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		unsigned long addr, int node)
 {
-	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+	struct mempolicy *pol;
 	struct zonelist *zl;
 	struct page *page;
+	unsigned int cpuset_mems_cookie;
+
+retry_cpuset:
+	pol = get_vma_policy(current, vma, addr);
+	cpuset_mems_cookie = get_mems_allowed();
 
-	get_mems_allowed();
 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
-		put_mems_allowed();
+		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+			goto retry_cpuset;
+
 		return page;
 	}
 	zl = policy_zonelist(gfp, pol, node);
@@ -1832,7 +1838,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		struct page *page =  __alloc_pages_nodemask(gfp, order,
 						zl, policy_nodemask(gfp, pol));
 		__mpol_put(pol);
-		put_mems_allowed();
+		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+			goto retry_cpuset;
 		return page;
 	}
 	/*
@@ -1840,7 +1847,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	 */
 	page = __alloc_pages_nodemask(gfp, order, zl,
 				      policy_nodemask(gfp, pol));
-	put_mems_allowed();
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
 	return page;
 }
 
@@ -1867,11 +1875,14 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
 	struct mempolicy *pol = current->mempolicy;
 	struct page *page;
+	unsigned int cpuset_mems_cookie;
 
 	if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
 		pol = &default_policy;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+
 	/*
 	 * No reference counting needed for current->mempolicy
 	 * nor system default_policy
@@ -1882,7 +1893,10 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 		page = __alloc_pages_nodemask(gfp, order,
 				policy_zonelist(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
-	put_mems_allowed();
+
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
+
 	return page;
 }
 EXPORT_SYMBOL(alloc_pages_current);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 257acae..a1744f5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2293,8 +2293,9 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
-	struct page *page;
+	struct page *page = NULL;
 	int migratetype = allocflags_to_migratetype(gfp_mask);
+	unsigned int cpuset_mems_cookie;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2313,15 +2314,15 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+
 	/* The preferred zone is used for statistics later */
 	first_zones_zonelist(zonelist, high_zoneidx,
 				nodemask ? : &cpuset_current_mems_allowed,
 				&preferred_zone);
-	if (!preferred_zone) {
-		put_mems_allowed();
-		return NULL;
-	}
+	if (!preferred_zone)
+		goto out;
 
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
@@ -2331,9 +2332,19 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
-	put_mems_allowed();
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
+
+out:
+	/*
+	 * When updating a task's mems_allowed, it is possible to race with
+	 * parallel threads in such a way that an allocation can fail while
+	 * the mask is being updated. If a page allocation is about to fail,
+	 * check if the cpuset changed during allocation and if so, retry.
+	 */
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
+
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -2557,13 +2568,15 @@ void si_meminfo_node(struct sysinfo *val, int nid)
 bool skip_free_areas_node(unsigned int flags, int nid)
 {
 	bool ret = false;
+	unsigned int cpuset_mems_cookie;
 
 	if (!(flags & SHOW_MEM_FILTER_NODES))
 		goto out;
 
-	get_mems_allowed();
-	ret = !node_isset(nid, cpuset_current_mems_allowed);
-	put_mems_allowed();
+	do {
+		cpuset_mems_cookie = get_mems_allowed();
+		ret = !node_isset(nid, cpuset_current_mems_allowed);
+	} while (!put_mems_allowed(cpuset_mems_cookie));
 out:
 	return ret;
 }
diff --git a/mm/slab.c b/mm/slab.c
index d96e223..a67f812 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3218,12 +3218,10 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
 	if (in_interrupt() || (flags & __GFP_THISNODE))
 		return NULL;
 	nid_alloc = nid_here = numa_mem_id();
-	get_mems_allowed();
 	if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
 		nid_alloc = cpuset_slab_spread_node();
 	else if (current->mempolicy)
 		nid_alloc = slab_node(current->mempolicy);
-	put_mems_allowed();
 	if (nid_alloc != nid_here)
 		return ____cache_alloc_node(cachep, flags, nid_alloc);
 	return NULL;
@@ -3246,14 +3244,17 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
 	int nid;
+	unsigned int cpuset_mems_cookie;
 
 	if (flags & __GFP_THISNODE)
 		return NULL;
 
-	get_mems_allowed();
-	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
 
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+
 retry:
 	/*
 	 * Look through allowed nodes for objects available
@@ -3306,7 +3307,9 @@ retry:
 			}
 		}
 	}
-	put_mems_allowed();
+
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
+		goto retry_cpuset;
 	return obj;
 }
 
diff --git a/mm/slub.c b/mm/slub.c
index 10ab233..00ccf2c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1457,6 +1457,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	struct page *page;
+	unsigned int cpuset_mems_cookie;
 
 	/*
 	 * The defrag ratio allows a configuration of the tradeoffs between
@@ -1480,22 +1481,32 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
 			get_cycles() % 1024 > s->remote_node_defrag_ratio)
 		return NULL;
 
-	get_mems_allowed();
-	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-		struct kmem_cache_node *n;
-
-		n = get_node(s, zone_to_nid(zone));
-
-		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
-				n->nr_partial > s->min_partial) {
-			page = get_partial_node(n);
-			if (page) {
-				put_mems_allowed();
-				return page;
+	do {
+		cpuset_mems_cookie = get_mems_allowed();
+		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+			struct kmem_cache_node *n;
+
+			n = get_node(s, zone_to_nid(zone));
+
+			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
+					n->nr_partial > s->min_partial) {
+				page = get_partial_node(n);
+				if (page) {
+					/*
+					 * Return the object even if
+					 * put_mems_allowed indicated that
+					 * the cpuset mems_allowed was
+					 * updated in parallel. It's a
+					 * harmless race between the alloc
+					 * and the cpuset update.
+					 */
+					put_mems_allowed(cpuset_mems_cookie);
+					return page;
+				}
 			}
 		}
-	}
+	} while (!put_mems_allowed(cpuset_mems_cookie));
 	put_mems_allowed();
 #endif
 	return NULL;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 49d8547..1682835 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2247,7 +2247,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	unsigned long writeback_threshold;
 	bool aborted_reclaim;
 
-	get_mems_allowed();
 	delayacct_freepages_start();
 
 	if (scanning_global_lru(sc))
@@ -2310,7 +2309,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 out:
 	delayacct_freepages_end();
-	put_mems_allowed();
 
 	if (sc->nr_reclaimed)
 		return sc->nr_reclaimed;
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 34/34] mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-23 13:38   ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Konstantin Khlebnikov <khlebnikov@openvz.org>

commit b1c12cbcd0a02527c180a862e8971e249d3b347d upstream.

Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
	expensive and severely impacted page allocator performance. This
	is part of a series of patches that reduce page allocator overhead.

Fix a gcc warning (and bug?) introduced in cc9a6c877 ("cpuset: mm: reduce
large amounts of memory barrier related damage v3")

Local variable "page" can be uninitialized if the nodemask from vma policy
does not intersects with nodemask from cpuset.  Even if it doesn't happens
it is better to initialize this variable explicitly than to introduce
a kernel oops in a weird corner case.

mm/hugetlb.c: In function `alloc_huge_page':
mm/hugetlb.c:1135:5: warning: `page' may be used uninitialized in this function

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/hugetlb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 64f2b7a..ae60a53 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -454,7 +454,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
 {
-	struct page *page;
+	struct page *page = NULL;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
 	struct zonelist *zonelist;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 34/34] mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma
@ 2012-07-23 13:38   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23 13:38 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM, LKML, Mel Gorman

From: Konstantin Khlebnikov <khlebnikov@openvz.org>

commit b1c12cbcd0a02527c180a862e8971e249d3b347d upstream.

Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
	expensive and severely impacted page allocator performance. This
	is part of a series of patches that reduce page allocator overhead.

Fix a gcc warning (and bug?) introduced in cc9a6c877 ("cpuset: mm: reduce
large amounts of memory barrier related damage v3")

Local variable "page" can be uninitialized if the nodemask from vma policy
does not intersects with nodemask from cpuset.  Even if it doesn't happens
it is better to initialize this variable explicitly than to introduce
a kernel oops in a weird corner case.

mm/hugetlb.c: In function `alloc_huge_page':
mm/hugetlb.c:1135:5: warning: `page' may be used uninitialized in this function

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/hugetlb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 64f2b7a..ae60a53 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -454,7 +454,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
 {
-	struct page *page;
+	struct page *page = NULL;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
 	struct zonelist *zonelist;
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-24  5:58   ` Mike Galbraith
  -1 siblings, 0 replies; 125+ messages in thread
From: Mike Galbraith @ 2012-07-24  5:58 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Mon, 2012-07-23 at 14:38 +0100, Mel Gorman wrote: 
> Changelog since V1
>   o Expand some of the notes					(jrnieder)
>   o Correct upstream commit SHA1				(hugh)
> 
> This series is related to the new addition to stable_kernel_rules.txt
> 
>  - Serious issues as reported by a user of a distribution kernel may also
>    be considered if they fix a notable performance or interactivity issue.
>    As these fixes are not as obvious and have a higher risk of a subtle
>    regression they should only be submitted by a distribution kernel
>    maintainer and include an addendum linking to a bugzilla entry if it
>    exists and additional information on the user-visible impact.
> 
> All of these patches have been backported to a distribution kernel and
> address some sort of performance issue in the VM. As they are not all
> obvious, I've added a "Stable note" to the top of each patch giving
> additional information on why the patch was backported. Lets see where
> the boundaries lie on how this new rule is interpreted in practice :).

FWIW, I'm all for performance backports.  They do have a downside though
(other than the risk of bugs slipping in, or triggering latent bugs).

When the next enterprise kernel is built, marketeers ask for numbers to
make potential customers drool over, and you _can't produce any_ because
you wedged all the spiffy performance stuff into the crusty old kernel.

-Mike


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
@ 2012-07-24  5:58   ` Mike Galbraith
  0 siblings, 0 replies; 125+ messages in thread
From: Mike Galbraith @ 2012-07-24  5:58 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Mon, 2012-07-23 at 14:38 +0100, Mel Gorman wrote: 
> Changelog since V1
>   o Expand some of the notes					(jrnieder)
>   o Correct upstream commit SHA1				(hugh)
> 
> This series is related to the new addition to stable_kernel_rules.txt
> 
>  - Serious issues as reported by a user of a distribution kernel may also
>    be considered if they fix a notable performance or interactivity issue.
>    As these fixes are not as obvious and have a higher risk of a subtle
>    regression they should only be submitted by a distribution kernel
>    maintainer and include an addendum linking to a bugzilla entry if it
>    exists and additional information on the user-visible impact.
> 
> All of these patches have been backported to a distribution kernel and
> address some sort of performance issue in the VM. As they are not all
> obvious, I've added a "Stable note" to the top of each patch giving
> additional information on why the patch was backported. Lets see where
> the boundaries lie on how this new rule is interpreted in practice :).

FWIW, I'm all for performance backports.  They do have a downside though
(other than the risk of bugs slipping in, or triggering latent bugs).

When the next enterprise kernel is built, marketeers ask for numbers to
make potential customers drool over, and you _can't produce any_ because
you wedged all the spiffy performance stuff into the crusty old kernel.

-Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
  2012-07-24  5:58   ` Mike Galbraith
@ 2012-07-24  8:10     ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-24  8:10 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Stable, Linux-MM, LKML

On Tue, Jul 24, 2012 at 07:58:51AM +0200, Mike Galbraith wrote:
> On Mon, 2012-07-23 at 14:38 +0100, Mel Gorman wrote: 
> > Changelog since V1
> >   o Expand some of the notes					(jrnieder)
> >   o Correct upstream commit SHA1				(hugh)
> > 
> > This series is related to the new addition to stable_kernel_rules.txt
> > 
> >  - Serious issues as reported by a user of a distribution kernel may also
> >    be considered if they fix a notable performance or interactivity issue.
> >    As these fixes are not as obvious and have a higher risk of a subtle
> >    regression they should only be submitted by a distribution kernel
> >    maintainer and include an addendum linking to a bugzilla entry if it
> >    exists and additional information on the user-visible impact.
> > 
> > All of these patches have been backported to a distribution kernel and
> > address some sort of performance issue in the VM. As they are not all
> > obvious, I've added a "Stable note" to the top of each patch giving
> > additional information on why the patch was backported. Lets see where
> > the boundaries lie on how this new rule is interpreted in practice :).
> 
> FWIW, I'm all for performance backports.  They do have a downside though
> (other than the risk of bugs slipping in, or triggering latent bugs).
> 
> When the next enterprise kernel is built, marketeers ask for numbers to
> make potential customers drool over, and you _can't produce any_ because
> you wedged all the spiffy performance stuff into the crusty old kernel.
> 

I'm not a marketing person but I expect the performance figures they
really care about are between versions of the product which includes more
than the kernel. The are not going to be comparisons between the upstream
kernel and the distribution kernel so they'll still are able to produce
the drool-inducing figures. By backporting certain performance figures,
data from regression testing major kernel releases is more valuable to
the distribution vendor when considering a change of kernel version.

There is also the lag factor to consider. Distribution kernels will carry
fixes for the functional and performance regression fixes from the time
of discovery and supply temporary kernels to their users to minimise the
lifetime of a bug. It could be weeks if not months before the same fixes
bubble their way up to -stable. They might never bubble up if the developer
is pressed for time or the patch unsuitable for -stable for some reason.

None of that takes into account the fact that distribution kernels are
backed by quality support and developer teams that can diagnose and fix
a range of problems encountered in the field. This is true whether it is
an distribution that directly sells support as part of the software or
is a distribution with a lot of developers that are also contractors.
The same guarantees do not necessarily apply to upstream kernels where
support is conditional on getting the attention of the right people.

These backports are not going to destroy the value proposition of
distribution kernels :)

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
@ 2012-07-24  8:10     ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-24  8:10 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Stable, Linux-MM, LKML

On Tue, Jul 24, 2012 at 07:58:51AM +0200, Mike Galbraith wrote:
> On Mon, 2012-07-23 at 14:38 +0100, Mel Gorman wrote: 
> > Changelog since V1
> >   o Expand some of the notes					(jrnieder)
> >   o Correct upstream commit SHA1				(hugh)
> > 
> > This series is related to the new addition to stable_kernel_rules.txt
> > 
> >  - Serious issues as reported by a user of a distribution kernel may also
> >    be considered if they fix a notable performance or interactivity issue.
> >    As these fixes are not as obvious and have a higher risk of a subtle
> >    regression they should only be submitted by a distribution kernel
> >    maintainer and include an addendum linking to a bugzilla entry if it
> >    exists and additional information on the user-visible impact.
> > 
> > All of these patches have been backported to a distribution kernel and
> > address some sort of performance issue in the VM. As they are not all
> > obvious, I've added a "Stable note" to the top of each patch giving
> > additional information on why the patch was backported. Lets see where
> > the boundaries lie on how this new rule is interpreted in practice :).
> 
> FWIW, I'm all for performance backports.  They do have a downside though
> (other than the risk of bugs slipping in, or triggering latent bugs).
> 
> When the next enterprise kernel is built, marketeers ask for numbers to
> make potential customers drool over, and you _can't produce any_ because
> you wedged all the spiffy performance stuff into the crusty old kernel.
> 

I'm not a marketing person but I expect the performance figures they
really care about are between versions of the product which includes more
than the kernel. The are not going to be comparisons between the upstream
kernel and the distribution kernel so they'll still are able to produce
the drool-inducing figures. By backporting certain performance figures,
data from regression testing major kernel releases is more valuable to
the distribution vendor when considering a change of kernel version.

There is also the lag factor to consider. Distribution kernels will carry
fixes for the functional and performance regression fixes from the time
of discovery and supply temporary kernels to their users to minimise the
lifetime of a bug. It could be weeks if not months before the same fixes
bubble their way up to -stable. They might never bubble up if the developer
is pressed for time or the patch unsuitable for -stable for some reason.

None of that takes into account the fact that distribution kernels are
backed by quality support and developer teams that can diagnose and fix
a range of problems encountered in the field. This is true whether it is
an distribution that directly sells support as part of the software or
is a distribution with a lot of developers that are also contractors.
The same guarantees do not necessarily apply to upstream kernels where
support is conditional on getting the attention of the right people.

These backports are not going to destroy the value proposition of
distribution kernels :)

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
  2012-07-24  5:58   ` Mike Galbraith
@ 2012-07-24 13:18     ` Hillf Danton
  -1 siblings, 0 replies; 125+ messages in thread
From: Hillf Danton @ 2012-07-24 13:18 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Mel Gorman, Stable, Linux-MM, LKML

On Tue, Jul 24, 2012 at 1:58 PM, Mike Galbraith <efault@gmx.de> wrote:
> FWIW, I'm all for performance backports.  They do have a downside though
> (other than the risk of bugs slipping in, or triggering latent bugs).
>
> When the next enterprise kernel is built, marketeers ask for numbers to
> make potential customers drool over, and you _can't produce any_ because
> you wedged all the spiffy performance stuff into the crusty old kernel.
>
Well do your job please.

	Suse 11 SP1 kernel panic on HP hardware
	https://lkml.org/lkml/2012/7/24/136

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
@ 2012-07-24 13:18     ` Hillf Danton
  0 siblings, 0 replies; 125+ messages in thread
From: Hillf Danton @ 2012-07-24 13:18 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Mel Gorman, Stable, Linux-MM, LKML

On Tue, Jul 24, 2012 at 1:58 PM, Mike Galbraith <efault@gmx.de> wrote:
> FWIW, I'm all for performance backports.  They do have a downside though
> (other than the risk of bugs slipping in, or triggering latent bugs).
>
> When the next enterprise kernel is built, marketeers ask for numbers to
> make potential customers drool over, and you _can't produce any_ because
> you wedged all the spiffy performance stuff into the crusty old kernel.
>
Well do your job please.

	Suse 11 SP1 kernel panic on HP hardware
	https://lkml.org/lkml/2012/7/24/136

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
  2012-07-24 13:18     ` Hillf Danton
@ 2012-07-24 13:27       ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-24 13:27 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Mike Galbraith, Stable, Linux-MM, LKML

On Tue, Jul 24, 2012 at 09:18:16PM +0800, Hillf Danton wrote:
> On Tue, Jul 24, 2012 at 1:58 PM, Mike Galbraith <efault@gmx.de> wrote:
> > FWIW, I'm all for performance backports.  They do have a downside though
> > (other than the risk of bugs slipping in, or triggering latent bugs).
> >
> > When the next enterprise kernel is built, marketeers ask for numbers to
> > make potential customers drool over, and you _can't produce any_ because
> > you wedged all the spiffy performance stuff into the crusty old kernel.
> >
> Well do your job please.
> 

I would suggest the user in question use the normal support channels for
resolving a potentially SLES-specific bug.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
@ 2012-07-24 13:27       ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-24 13:27 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Mike Galbraith, Stable, Linux-MM, LKML

On Tue, Jul 24, 2012 at 09:18:16PM +0800, Hillf Danton wrote:
> On Tue, Jul 24, 2012 at 1:58 PM, Mike Galbraith <efault@gmx.de> wrote:
> > FWIW, I'm all for performance backports.  They do have a downside though
> > (other than the risk of bugs slipping in, or triggering latent bugs).
> >
> > When the next enterprise kernel is built, marketeers ask for numbers to
> > make potential customers drool over, and you _can't produce any_ because
> > you wedged all the spiffy performance stuff into the crusty old kernel.
> >
> Well do your job please.
> 

I would suggest the user in question use the normal support channels for
resolving a potentially SLES-specific bug.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
  2012-07-24 13:27       ` Mel Gorman
@ 2012-07-24 13:34         ` Hillf Danton
  -1 siblings, 0 replies; 125+ messages in thread
From: Hillf Danton @ 2012-07-24 13:34 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Mike Galbraith, Stable, Linux-MM, LKML

On Tue, Jul 24, 2012 at 9:27 PM, Mel Gorman <mgorman@suse.de> wrote:
> I would suggest the user in question use the normal support channels for
> resolving a potentially SLES-specific bug.
>
Thanks, Mel.

Is Mike busy in other fairs?

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
@ 2012-07-24 13:34         ` Hillf Danton
  0 siblings, 0 replies; 125+ messages in thread
From: Hillf Danton @ 2012-07-24 13:34 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Mike Galbraith, Stable, Linux-MM, LKML

On Tue, Jul 24, 2012 at 9:27 PM, Mel Gorman <mgorman@suse.de> wrote:
> I would suggest the user in question use the normal support channels for
> resolving a potentially SLES-specific bug.
>
Thanks, Mel.

Is Mike busy in other fairs?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
  2012-07-24 13:18     ` Hillf Danton
@ 2012-07-24 13:52       ` Mike Galbraith
  -1 siblings, 0 replies; 125+ messages in thread
From: Mike Galbraith @ 2012-07-24 13:52 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Mel Gorman, Stable, Linux-MM, LKML

On Tue, 2012-07-24 at 21:18 +0800, Hillf Danton wrote: 
> On Tue, Jul 24, 2012 at 1:58 PM, Mike Galbraith <efault@gmx.de> wrote:
> > FWIW, I'm all for performance backports.  They do have a downside though
> > (other than the risk of bugs slipping in, or triggering latent bugs).
> >
> > When the next enterprise kernel is built, marketeers ask for numbers to
> > make potential customers drool over, and you _can't produce any_ because
> > you wedged all the spiffy performance stuff into the crusty old kernel.
> >
> Well do your job please.
> 
> 	Suse 11 SP1 kernel panic on HP hardware
> 	https://lkml.org/lkml/2012/7/24/136

Last time I looked, handling SUSE support issues on LKML was not in my
job description.  I don't recall seeing anything about taking direction
from random LKML subscribers either.

-Mike


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
@ 2012-07-24 13:52       ` Mike Galbraith
  0 siblings, 0 replies; 125+ messages in thread
From: Mike Galbraith @ 2012-07-24 13:52 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Mel Gorman, Stable, Linux-MM, LKML

On Tue, 2012-07-24 at 21:18 +0800, Hillf Danton wrote: 
> On Tue, Jul 24, 2012 at 1:58 PM, Mike Galbraith <efault@gmx.de> wrote:
> > FWIW, I'm all for performance backports.  They do have a downside though
> > (other than the risk of bugs slipping in, or triggering latent bugs).
> >
> > When the next enterprise kernel is built, marketeers ask for numbers to
> > make potential customers drool over, and you _can't produce any_ because
> > you wedged all the spiffy performance stuff into the crusty old kernel.
> >
> Well do your job please.
> 
> 	Suse 11 SP1 kernel panic on HP hardware
> 	https://lkml.org/lkml/2012/7/24/136

Last time I looked, handling SUSE support issues on LKML was not in my
job description.  I don't recall seeing anything about taking direction
from random LKML subscribers either.

-Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
  2012-07-24 13:34         ` Hillf Danton
@ 2012-07-24 13:53           ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-24 13:53 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Mike Galbraith, Stable, Linux-MM, LKML

On Tue, Jul 24, 2012 at 09:34:56PM +0800, Hillf Danton wrote:
> On Tue, Jul 24, 2012 at 9:27 PM, Mel Gorman <mgorman@suse.de> wrote:
> > I would suggest the user in question use the normal support channels for
> > resolving a potentially SLES-specific bug.
> >
> Thanks, Mel.
> 
> Is Mike busy in other fairs?

It's not for me whether to say whether he is or not. SUSE already provide
excellent support channel for handling bugs like the one. If the user uses
them, they are very likely to find that this particular bug was resolved
in February 2011 by Mike without you stamping your foot on LKML.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
@ 2012-07-24 13:53           ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-24 13:53 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Mike Galbraith, Stable, Linux-MM, LKML

On Tue, Jul 24, 2012 at 09:34:56PM +0800, Hillf Danton wrote:
> On Tue, Jul 24, 2012 at 9:27 PM, Mel Gorman <mgorman@suse.de> wrote:
> > I would suggest the user in question use the normal support channels for
> > resolving a potentially SLES-specific bug.
> >
> Thanks, Mel.
> 
> Is Mike busy in other fairs?

It's not for me whether to say whether he is or not. SUSE already provide
excellent support channel for handling bugs like the one. If the user uses
them, they are very likely to find that this particular bug was resolved
in February 2011 by Mike without you stamping your foot on LKML.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
  2012-07-24 13:53           ` Mel Gorman
@ 2012-07-24 14:11             ` Hillf Danton
  -1 siblings, 0 replies; 125+ messages in thread
From: Hillf Danton @ 2012-07-24 14:11 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Mike Galbraith, Stable, Linux-MM, LKML

On Tue, Jul 24, 2012 at 9:53 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Tue, Jul 24, 2012 at 09:34:56PM +0800, Hillf Danton wrote:
>> On Tue, Jul 24, 2012 at 9:27 PM, Mel Gorman <mgorman@suse.de> wrote:
>> > I would suggest the user in question use the normal support channels for
>> > resolving a potentially SLES-specific bug.
>> >
>> Thanks, Mel.
>>
>> Is Mike busy in other fairs?
>
> It's not for me whether to say whether he is or not. SUSE already provide
> excellent support channel for handling bugs like the one. If the user uses
> them, they are very likely to find that this particular bug was resolved
> in February 2011 by Mike without you stamping your foot on LKML.
>
If you pay for good products and service, you see why
I forwarded the message to you, the SUSE gurus.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
@ 2012-07-24 14:11             ` Hillf Danton
  0 siblings, 0 replies; 125+ messages in thread
From: Hillf Danton @ 2012-07-24 14:11 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Mike Galbraith, Stable, Linux-MM, LKML

On Tue, Jul 24, 2012 at 9:53 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Tue, Jul 24, 2012 at 09:34:56PM +0800, Hillf Danton wrote:
>> On Tue, Jul 24, 2012 at 9:27 PM, Mel Gorman <mgorman@suse.de> wrote:
>> > I would suggest the user in question use the normal support channels for
>> > resolving a potentially SLES-specific bug.
>> >
>> Thanks, Mel.
>>
>> Is Mike busy in other fairs?
>
> It's not for me whether to say whether he is or not. SUSE already provide
> excellent support channel for handling bugs like the one. If the user uses
> them, they are very likely to find that this particular bug was resolved
> in February 2011 by Mike without you stamping your foot on LKML.
>
If you pay for good products and service, you see why
I forwarded the message to you, the SUSE gurus.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
  2012-07-24 13:52       ` Mike Galbraith
@ 2012-07-24 14:18         ` Hillf Danton
  -1 siblings, 0 replies; 125+ messages in thread
From: Hillf Danton @ 2012-07-24 14:18 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Mel Gorman, Stable, Linux-MM, LKML

On Tue, Jul 24, 2012 at 9:52 PM, Mike Galbraith <efault@gmx.de> wrote:
> Last time I looked, handling SUSE support issues on LKML was not in my
> job description.  I don't recall seeing anything about taking direction
> from random LKML subscribers either.
>
End users pay for SUSE products/service, right?

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
@ 2012-07-24 14:18         ` Hillf Danton
  0 siblings, 0 replies; 125+ messages in thread
From: Hillf Danton @ 2012-07-24 14:18 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Mel Gorman, Stable, Linux-MM, LKML

On Tue, Jul 24, 2012 at 9:52 PM, Mike Galbraith <efault@gmx.de> wrote:
> Last time I looked, handling SUSE support issues on LKML was not in my
> job description.  I don't recall seeing anything about taking direction
> from random LKML subscribers either.
>
End users pay for SUSE products/service, right?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
  2012-07-24 14:18         ` Hillf Danton
@ 2012-07-24 14:41           ` Mike Galbraith
  -1 siblings, 0 replies; 125+ messages in thread
From: Mike Galbraith @ 2012-07-24 14:41 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Mel Gorman, Stable, Linux-MM, LKML

On Tue, 2012-07-24 at 22:18 +0800, Hillf Danton wrote: 
> On Tue, Jul 24, 2012 at 9:52 PM, Mike Galbraith <efault@gmx.de> wrote:
> > Last time I looked, handling SUSE support issues on LKML was not in my
> > job description.  I don't recall seeing anything about taking direction
> > from random LKML subscribers either.
> >
> End users pay for SUSE products/service, right?

Hohum.  Have a nice life, and goodbye.

-Mike



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
@ 2012-07-24 14:41           ` Mike Galbraith
  0 siblings, 0 replies; 125+ messages in thread
From: Mike Galbraith @ 2012-07-24 14:41 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Mel Gorman, Stable, Linux-MM, LKML

On Tue, 2012-07-24 at 22:18 +0800, Hillf Danton wrote: 
> On Tue, Jul 24, 2012 at 9:52 PM, Mike Galbraith <efault@gmx.de> wrote:
> > Last time I looked, handling SUSE support issues on LKML was not in my
> > job description.  I don't recall seeing anything about taking direction
> > from random LKML subscribers either.
> >
> End users pay for SUSE products/service, right?

Hohum.  Have a nice life, and goodbye.

-Mike


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 03/34] mm: Reduce the amount of work done when updating min_free_kbytes
  2012-07-23 13:38   ` Mel Gorman
@ 2012-07-24 22:47     ` Greg KH
  -1 siblings, 0 replies; 125+ messages in thread
From: Greg KH @ 2012-07-24 22:47 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Mon, Jul 23, 2012 at 02:38:16PM +0100, Mel Gorman wrote:
> commit 938929f14cb595f43cd1a4e63e22d36cab1e4a1f upstream.
> 
> Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=726210 .
> 	Large machines with 1TB or more of RAM take a long time to boot
> 	without this patch and may spew out soft lockup warnings.

In comparing this with the upstream version, you have a few different
coding style differences, but no real content difference.  Why?

> 
> When min_free_kbytes is updated blocks marked MIGRATE_RESERVE are
> updated. Ordinarily, this work is unnoticable as it happens early
> in boot. However, on large machines with 1TB of memory, this can take
> a considerable time when NUMA distances are taken into account. The bulk
> of the work is done by pageblock_is_reserved() which examines the
> metadata for almost every page in the system. Currently, we are doing
> this far more than necessary as it is only required while there are
> still blocks to be marked MIGRATE_RESERVE. This patch significantly
> reduces the amount of work done by setup_zone_migrate_reserve()
> improving boot times on 1TB machines.
> 
> [akpm@linux-foundation.org: coding-style fixes]

I'm guessing you didn't pick these up?

Anyway, I've taken it now as the original one from Linus's tree,
hopefully this doesn't burn me later in the series...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 03/34] mm: Reduce the amount of work done when updating min_free_kbytes
@ 2012-07-24 22:47     ` Greg KH
  0 siblings, 0 replies; 125+ messages in thread
From: Greg KH @ 2012-07-24 22:47 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Mon, Jul 23, 2012 at 02:38:16PM +0100, Mel Gorman wrote:
> commit 938929f14cb595f43cd1a4e63e22d36cab1e4a1f upstream.
> 
> Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=726210 .
> 	Large machines with 1TB or more of RAM take a long time to boot
> 	without this patch and may spew out soft lockup warnings.

In comparing this with the upstream version, you have a few different
coding style differences, but no real content difference.  Why?

> 
> When min_free_kbytes is updated blocks marked MIGRATE_RESERVE are
> updated. Ordinarily, this work is unnoticable as it happens early
> in boot. However, on large machines with 1TB of memory, this can take
> a considerable time when NUMA distances are taken into account. The bulk
> of the work is done by pageblock_is_reserved() which examines the
> metadata for almost every page in the system. Currently, we are doing
> this far more than necessary as it is only required while there are
> still blocks to be marked MIGRATE_RESERVE. This patch significantly
> reduces the amount of work done by setup_zone_migrate_reserve()
> improving boot times on 1TB machines.
> 
> [akpm@linux-foundation.org: coding-style fixes]

I'm guessing you didn't pick these up?

Anyway, I've taken it now as the original one from Linus's tree,
hopefully this doesn't burn me later in the series...

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 03/34] mm: Reduce the amount of work done when updating min_free_kbytes
  2012-07-24 22:47     ` Greg KH
@ 2012-07-25  7:57       ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-25  7:57 UTC (permalink / raw)
  To: Greg KH; +Cc: Stable, Linux-MM, LKML

On Tue, Jul 24, 2012 at 03:47:12PM -0700, Greg KH wrote:
> On Mon, Jul 23, 2012 at 02:38:16PM +0100, Mel Gorman wrote:
> > commit 938929f14cb595f43cd1a4e63e22d36cab1e4a1f upstream.
> > 
> > Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=726210 .
> > 	Large machines with 1TB or more of RAM take a long time to boot
> > 	without this patch and may spew out soft lockup warnings.
> 
> In comparing this with the upstream version, you have a few different
> coding style differences, but no real content difference.  Why?
> 

This was a mistake in my workflow that needs a bit of ironing out.

The mistake is that I took the patch from the distribution kernel which was
merged at a time before the coding style fixes were applied. The upstream
commit and signed-off lines were "fixed" but I failed to refresh the
patch and missed that it differed from upstream. Thanks for catching this.
I'll adjust my workflow and assistant scripts to watch for this sort of
problem in the future.

> > 
> > When min_free_kbytes is updated blocks marked MIGRATE_RESERVE are
> > updated. Ordinarily, this work is unnoticable as it happens early
> > in boot. However, on large machines with 1TB of memory, this can take
> > a considerable time when NUMA distances are taken into account. The bulk
> > of the work is done by pageblock_is_reserved() which examines the
> > metadata for almost every page in the system. Currently, we are doing
> > this far more than necessary as it is only required while there are
> > still blocks to be marked MIGRATE_RESERVE. This patch significantly
> > reduces the amount of work done by setup_zone_migrate_reserve()
> > improving boot times on 1TB machines.
> > 
> > [akpm@linux-foundation.org: coding-style fixes]
> 
> I'm guessing you didn't pick these up?
> 

Correct but due to a mistake, not for any good reason.

> Anyway, I've taken it now as the original one from Linus's tree,
> hopefully this doesn't burn me later in the series...
> 

I hope it didn't.

Thanks Greg.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 03/34] mm: Reduce the amount of work done when updating min_free_kbytes
@ 2012-07-25  7:57       ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-25  7:57 UTC (permalink / raw)
  To: Greg KH; +Cc: Stable, Linux-MM, LKML

On Tue, Jul 24, 2012 at 03:47:12PM -0700, Greg KH wrote:
> On Mon, Jul 23, 2012 at 02:38:16PM +0100, Mel Gorman wrote:
> > commit 938929f14cb595f43cd1a4e63e22d36cab1e4a1f upstream.
> > 
> > Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=726210 .
> > 	Large machines with 1TB or more of RAM take a long time to boot
> > 	without this patch and may spew out soft lockup warnings.
> 
> In comparing this with the upstream version, you have a few different
> coding style differences, but no real content difference.  Why?
> 

This was a mistake in my workflow that needs a bit of ironing out.

The mistake is that I took the patch from the distribution kernel which was
merged at a time before the coding style fixes were applied. The upstream
commit and signed-off lines were "fixed" but I failed to refresh the
patch and missed that it differed from upstream. Thanks for catching this.
I'll adjust my workflow and assistant scripts to watch for this sort of
problem in the future.

> > 
> > When min_free_kbytes is updated blocks marked MIGRATE_RESERVE are
> > updated. Ordinarily, this work is unnoticable as it happens early
> > in boot. However, on large machines with 1TB of memory, this can take
> > a considerable time when NUMA distances are taken into account. The bulk
> > of the work is done by pageblock_is_reserved() which examines the
> > metadata for almost every page in the system. Currently, we are doing
> > this far more than necessary as it is only required while there are
> > still blocks to be marked MIGRATE_RESERVE. This patch significantly
> > reduces the amount of work done by setup_zone_migrate_reserve()
> > improving boot times on 1TB machines.
> > 
> > [akpm@linux-foundation.org: coding-style fixes]
> 
> I'm guessing you didn't pick these up?
> 

Correct but due to a mistake, not for any good reason.

> Anyway, I've taken it now as the original one from Linus's tree,
> hopefully this doesn't burn me later in the series...
> 

I hope it didn't.

Thanks Greg.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 15/34] mm: migration: clean up unmap_and_move()
  2012-07-23 13:38   ` Mel Gorman
@ 2012-07-25 15:45     ` Greg KH
  -1 siblings, 0 replies; 125+ messages in thread
From: Greg KH @ 2012-07-25 15:45 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Mon, Jul 23, 2012 at 02:38:28PM +0100, Mel Gorman wrote:
> commit 0dabec93de633a87adfbbe1d800a4c56cd19d73b upstream.
> 
> Stable note: Not tracked in Bugzilla. This patch makes later patches
> 	easier to apply but has no other impact.
> 
> unmap_and_move() is one a big messy function.  Clean it up.
> 
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  mm/migrate.c |   59 ++++++++++++++++++++++++++++++++--------------------------
>  1 file changed, 33 insertions(+), 26 deletions(-)

Mel, you didn't sign-off-on this patch.  Any reason why?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 15/34] mm: migration: clean up unmap_and_move()
@ 2012-07-25 15:45     ` Greg KH
  0 siblings, 0 replies; 125+ messages in thread
From: Greg KH @ 2012-07-25 15:45 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Mon, Jul 23, 2012 at 02:38:28PM +0100, Mel Gorman wrote:
> commit 0dabec93de633a87adfbbe1d800a4c56cd19d73b upstream.
> 
> Stable note: Not tracked in Bugzilla. This patch makes later patches
> 	easier to apply but has no other impact.
> 
> unmap_and_move() is one a big messy function.  Clean it up.
> 
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  mm/migrate.c |   59 ++++++++++++++++++++++++++++++++--------------------------
>  1 file changed, 33 insertions(+), 26 deletions(-)

Mel, you didn't sign-off-on this patch.  Any reason why?

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 16/34] mm: compaction: Allow compaction to isolate dirty pages
  2012-07-23 13:38   ` Mel Gorman
@ 2012-07-25 15:47     ` Greg KH
  -1 siblings, 0 replies; 125+ messages in thread
From: Greg KH @ 2012-07-25 15:47 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Mon, Jul 23, 2012 at 02:38:29PM +0100, Mel Gorman wrote:
> commit a77ebd333cd810d7b680d544be88c875131c2bd3 upstream.
> 
> Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
> 	information by reducing LRU list churning had the side-effect of
> 	reducing THP allocation success rates. This was part of a series
> 	to restore the success rates while preserving the reclaim fix.
> 
> Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
> noted that compaction does not migrate dirty or writeback pages and
> that is was meaningless to pick the page and re-add it to the LRU list.
> 
> What was missed during review is that asynchronous migration moves
> dirty pages if their ->migratepage callback is migrate_page() because
> these can be moved without blocking. This potentially impacted
> hugepage allocation success rates by a factor depending on how many
> dirty pages are in the system.
> 
> This patch partially reverts 39deaf85 to allow migration to isolate
> dirty pages again. This increases how much compaction disrupts the
> LRU but that is addressed later in the series.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Note, the changelog here differs from what is in Linus's tree by a LOT.
I took the version in Linus's tree instead.

greg k-h

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 16/34] mm: compaction: Allow compaction to isolate dirty pages
@ 2012-07-25 15:47     ` Greg KH
  0 siblings, 0 replies; 125+ messages in thread
From: Greg KH @ 2012-07-25 15:47 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Mon, Jul 23, 2012 at 02:38:29PM +0100, Mel Gorman wrote:
> commit a77ebd333cd810d7b680d544be88c875131c2bd3 upstream.
> 
> Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
> 	information by reducing LRU list churning had the side-effect of
> 	reducing THP allocation success rates. This was part of a series
> 	to restore the success rates while preserving the reclaim fix.
> 
> Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
> noted that compaction does not migrate dirty or writeback pages and
> that is was meaningless to pick the page and re-add it to the LRU list.
> 
> What was missed during review is that asynchronous migration moves
> dirty pages if their ->migratepage callback is migrate_page() because
> these can be moved without blocking. This potentially impacted
> hugepage allocation success rates by a factor depending on how many
> dirty pages are in the system.
> 
> This patch partially reverts 39deaf85 to allow migration to isolate
> dirty pages again. This increases how much compaction disrupts the
> LRU but that is addressed later in the series.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Note, the changelog here differs from what is in Linus's tree by a LOT.
I took the version in Linus's tree instead.

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 15/34] mm: migration: clean up unmap_and_move()
  2012-07-25 15:45     ` Greg KH
@ 2012-07-25 16:04       ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-25 16:04 UTC (permalink / raw)
  To: Greg KH; +Cc: Stable, Linux-MM, LKML

On Wed, Jul 25, 2012 at 08:45:26AM -0700, Greg KH wrote:
> On Mon, Jul 23, 2012 at 02:38:28PM +0100, Mel Gorman wrote:
> > commit 0dabec93de633a87adfbbe1d800a4c56cd19d73b upstream.
> > 
> > Stable note: Not tracked in Bugzilla. This patch makes later patches
> > 	easier to apply but has no other impact.
> > 
> > unmap_and_move() is one a big messy function.  Clean it up.
> > 
> > Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Michal Hocko <mhocko@suse.cz>
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> > ---
> >  mm/migrate.c |   59 ++++++++++++++++++++++++++++++++--------------------------
> >  1 file changed, 33 insertions(+), 26 deletions(-)
> 
> Mel, you didn't sign-off-on this patch.  Any reason why?
> 

Another patch that was merged to the distribution kernel before picked
up by mainline. In this case, I copied across the signed-off-bys and
missed my own

Signed-off-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 15/34] mm: migration: clean up unmap_and_move()
@ 2012-07-25 16:04       ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-25 16:04 UTC (permalink / raw)
  To: Greg KH; +Cc: Stable, Linux-MM, LKML

On Wed, Jul 25, 2012 at 08:45:26AM -0700, Greg KH wrote:
> On Mon, Jul 23, 2012 at 02:38:28PM +0100, Mel Gorman wrote:
> > commit 0dabec93de633a87adfbbe1d800a4c56cd19d73b upstream.
> > 
> > Stable note: Not tracked in Bugzilla. This patch makes later patches
> > 	easier to apply but has no other impact.
> > 
> > unmap_and_move() is one a big messy function.  Clean it up.
> > 
> > Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Michal Hocko <mhocko@suse.cz>
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> > ---
> >  mm/migrate.c |   59 ++++++++++++++++++++++++++++++++--------------------------
> >  1 file changed, 33 insertions(+), 26 deletions(-)
> 
> Mel, you didn't sign-off-on this patch.  Any reason why?
> 

Another patch that was merged to the distribution kernel before picked
up by mainline. In this case, I copied across the signed-off-bys and
missed my own

Signed-off-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 16/34] mm: compaction: Allow compaction to isolate dirty pages
  2012-07-25 15:47     ` Greg KH
@ 2012-07-25 16:07       ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-25 16:07 UTC (permalink / raw)
  To: Greg KH; +Cc: Stable, Linux-MM, LKML

On Wed, Jul 25, 2012 at 08:47:45AM -0700, Greg KH wrote:
> On Mon, Jul 23, 2012 at 02:38:29PM +0100, Mel Gorman wrote:
> > commit a77ebd333cd810d7b680d544be88c875131c2bd3 upstream.
> > 
> > Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
> > 	information by reducing LRU list churning had the side-effect of
> > 	reducing THP allocation success rates. This was part of a series
> > 	to restore the success rates while preserving the reclaim fix.
> > 
> > Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
> > noted that compaction does not migrate dirty or writeback pages and
> > that is was meaningless to pick the page and re-add it to the LRU list.
> > 
> > What was missed during review is that asynchronous migration moves
> > dirty pages if their ->migratepage callback is migrate_page() because
> > these can be moved without blocking. This potentially impacted
> > hugepage allocation success rates by a factor depending on how many
> > dirty pages are in the system.
> > 
> > This patch partially reverts 39deaf85 to allow migration to isolate
> > dirty pages again. This increases how much compaction disrupts the
> > LRU but that is addressed later in the series.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> 
> Note, the changelog here differs from what is in Linus's tree by a LOT.
> I took the version in Linus's tree instead.
> 

Yet another case of where the distribution kernel got the patch first
and I mucked up the transfer back.  In this case the mainline changelog
includes the patch leader with a lot of additional information.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 16/34] mm: compaction: Allow compaction to isolate dirty pages
@ 2012-07-25 16:07       ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-25 16:07 UTC (permalink / raw)
  To: Greg KH; +Cc: Stable, Linux-MM, LKML

On Wed, Jul 25, 2012 at 08:47:45AM -0700, Greg KH wrote:
> On Mon, Jul 23, 2012 at 02:38:29PM +0100, Mel Gorman wrote:
> > commit a77ebd333cd810d7b680d544be88c875131c2bd3 upstream.
> > 
> > Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
> > 	information by reducing LRU list churning had the side-effect of
> > 	reducing THP allocation success rates. This was part of a series
> > 	to restore the success rates while preserving the reclaim fix.
> > 
> > Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
> > noted that compaction does not migrate dirty or writeback pages and
> > that is was meaningless to pick the page and re-add it to the LRU list.
> > 
> > What was missed during review is that asynchronous migration moves
> > dirty pages if their ->migratepage callback is migrate_page() because
> > these can be moved without blocking. This potentially impacted
> > hugepage allocation success rates by a factor depending on how many
> > dirty pages are in the system.
> > 
> > This patch partially reverts 39deaf85 to allow migration to isolate
> > dirty pages again. This increases how much compaction disrupts the
> > LRU but that is addressed later in the series.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> 
> Note, the changelog here differs from what is in Linus's tree by a LOT.
> I took the version in Linus's tree instead.
> 

Yet another case of where the distribution kernel got the patch first
and I mucked up the transfer back.  In this case the mainline changelog
includes the patch leader with a lot of additional information.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 15/34] mm: migration: clean up unmap_and_move()
  2012-07-25 16:04       ` Mel Gorman
@ 2012-07-25 18:03         ` Greg KH
  -1 siblings, 0 replies; 125+ messages in thread
From: Greg KH @ 2012-07-25 18:03 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Wed, Jul 25, 2012 at 05:04:34PM +0100, Mel Gorman wrote:
> On Wed, Jul 25, 2012 at 08:45:26AM -0700, Greg KH wrote:
> > On Mon, Jul 23, 2012 at 02:38:28PM +0100, Mel Gorman wrote:
> > > commit 0dabec93de633a87adfbbe1d800a4c56cd19d73b upstream.
> > > 
> > > Stable note: Not tracked in Bugzilla. This patch makes later patches
> > > 	easier to apply but has no other impact.
> > > 
> > > unmap_and_move() is one a big messy function.  Clean it up.
> > > 
> > > Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> > > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > Cc: Mel Gorman <mgorman@suse.de>
> > > Cc: Rik van Riel <riel@redhat.com>
> > > Cc: Michal Hocko <mhocko@suse.cz>
> > > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> > > ---
> > >  mm/migrate.c |   59 ++++++++++++++++++++++++++++++++--------------------------
> > >  1 file changed, 33 insertions(+), 26 deletions(-)
> > 
> > Mel, you didn't sign-off-on this patch.  Any reason why?
> > 
> 
> Another patch that was merged to the distribution kernel before picked
> up by mainline. In this case, I copied across the signed-off-bys and
> missed my own
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Thanks, I've now added it.

greg k-h

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 15/34] mm: migration: clean up unmap_and_move()
@ 2012-07-25 18:03         ` Greg KH
  0 siblings, 0 replies; 125+ messages in thread
From: Greg KH @ 2012-07-25 18:03 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Wed, Jul 25, 2012 at 05:04:34PM +0100, Mel Gorman wrote:
> On Wed, Jul 25, 2012 at 08:45:26AM -0700, Greg KH wrote:
> > On Mon, Jul 23, 2012 at 02:38:28PM +0100, Mel Gorman wrote:
> > > commit 0dabec93de633a87adfbbe1d800a4c56cd19d73b upstream.
> > > 
> > > Stable note: Not tracked in Bugzilla. This patch makes later patches
> > > 	easier to apply but has no other impact.
> > > 
> > > unmap_and_move() is one a big messy function.  Clean it up.
> > > 
> > > Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> > > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > Cc: Mel Gorman <mgorman@suse.de>
> > > Cc: Rik van Riel <riel@redhat.com>
> > > Cc: Michal Hocko <mhocko@suse.cz>
> > > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> > > ---
> > >  mm/migrate.c |   59 ++++++++++++++++++++++++++++++++--------------------------
> > >  1 file changed, 33 insertions(+), 26 deletions(-)
> > 
> > Mel, you didn't sign-off-on this patch.  Any reason why?
> > 
> 
> Another patch that was merged to the distribution kernel before picked
> up by mainline. In this case, I copied across the signed-off-bys and
> missed my own
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Thanks, I've now added it.

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 25/34] mm: vmscan: Check if reclaim should really abort even if compaction_ready() is true for one zone
  2012-07-23 13:38   ` Mel Gorman
@ 2012-07-25 19:51     ` Greg KH
  -1 siblings, 0 replies; 125+ messages in thread
From: Greg KH @ 2012-07-25 19:51 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Mon, Jul 23, 2012 at 02:38:38PM +0100, Mel Gorman wrote:
> commit 0cee34fd72c582b4f8ad8ce00645b75fb4168199 upstream.
> 
> Stable note: Not tracked on Bugzilla. THP and compaction was found to
> 	aggressively reclaim pages and stall systems under different
> 	situations that was addressed piecemeal over time.
> 
> If compaction can proceed for a given zone, shrink_zones() does not
> reclaim any more pages from it. After commit [e0c2327: vmscan: abort
> reclaim/compaction if compaction can proceed], do_try_to_free_pages()
> tries to finish as soon as possible once one zone can compact.
> 
> This was intended to prevent slabs being shrunk unnecessarily but
> there are side-effects. One is that a small zone that is ready for
> compaction will abort reclaim even if the chances of successfully
> allocating a THP from that zone is small. It also means that reclaim
> can return too early even though sc->nr_to_reclaim pages were not
> reclaimed.
> 
> This partially reverts the commit until it is proven that slabs are
> really being shrunk unnecessarily but preserves the check to return
> 1 to avoid OOM if reclaim was aborted prematurely.
> 
> [aarcange@redhat.com: This patch replaces a revert from Andrea]
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Andy Isaacson <adi@hexapodia.org>
> Cc: Nai Xia <nai.xia@gmail.com>
> Cc: Johannes Weiner <jweiner@redhat.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |   19 +++++++++----------
>  1 file changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f109f2d..bc31f32 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2129,7 +2129,8 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
>   *
>   * This function returns true if a zone is being reclaimed for a costly
>   * allocation and compaction is ready to begin. This indicates to the caller
> - * that it should retry the allocation or fail.
> + * that it should consider retrying the allocation instead of
> + * further reclaim.
>   */
>  static bool shrink_zones(int priority, struct zonelist *zonelist,
>  					struct scan_control *sc)

This hunk didn't apply (the original commit from Linus's tree also
didn't apply due to some context changes in the rest of the patch.)  So
I took the original comment changes from Linus's tree, and the context
changes from this one and applied that.

Franken-patches, the story of my life...

greg k-h

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 25/34] mm: vmscan: Check if reclaim should really abort even if compaction_ready() is true for one zone
@ 2012-07-25 19:51     ` Greg KH
  0 siblings, 0 replies; 125+ messages in thread
From: Greg KH @ 2012-07-25 19:51 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Mon, Jul 23, 2012 at 02:38:38PM +0100, Mel Gorman wrote:
> commit 0cee34fd72c582b4f8ad8ce00645b75fb4168199 upstream.
> 
> Stable note: Not tracked on Bugzilla. THP and compaction was found to
> 	aggressively reclaim pages and stall systems under different
> 	situations that was addressed piecemeal over time.
> 
> If compaction can proceed for a given zone, shrink_zones() does not
> reclaim any more pages from it. After commit [e0c2327: vmscan: abort
> reclaim/compaction if compaction can proceed], do_try_to_free_pages()
> tries to finish as soon as possible once one zone can compact.
> 
> This was intended to prevent slabs being shrunk unnecessarily but
> there are side-effects. One is that a small zone that is ready for
> compaction will abort reclaim even if the chances of successfully
> allocating a THP from that zone is small. It also means that reclaim
> can return too early even though sc->nr_to_reclaim pages were not
> reclaimed.
> 
> This partially reverts the commit until it is proven that slabs are
> really being shrunk unnecessarily but preserves the check to return
> 1 to avoid OOM if reclaim was aborted prematurely.
> 
> [aarcange@redhat.com: This patch replaces a revert from Andrea]
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Andy Isaacson <adi@hexapodia.org>
> Cc: Nai Xia <nai.xia@gmail.com>
> Cc: Johannes Weiner <jweiner@redhat.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |   19 +++++++++----------
>  1 file changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f109f2d..bc31f32 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2129,7 +2129,8 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
>   *
>   * This function returns true if a zone is being reclaimed for a costly
>   * allocation and compaction is ready to begin. This indicates to the caller
> - * that it should retry the allocation or fail.
> + * that it should consider retrying the allocation instead of
> + * further reclaim.
>   */
>  static bool shrink_zones(int priority, struct zonelist *zonelist,
>  					struct scan_control *sc)

This hunk didn't apply (the original commit from Linus's tree also
didn't apply due to some context changes in the rest of the patch.)  So
I took the original comment changes from Linus's tree, and the context
changes from this one and applied that.

Franken-patches, the story of my life...

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 30/34] mm: vmscan: Do not force kswapd to scan small targets
  2012-07-23 13:38   ` Mel Gorman
@ 2012-07-25 19:59     ` Greg KH
  -1 siblings, 0 replies; 125+ messages in thread
From: Greg KH @ 2012-07-25 19:59 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Mon, Jul 23, 2012 at 02:38:43PM +0100, Mel Gorman wrote:
> commit ad2b8e601099a23dffffb53f91c18d874fe98854 upstream - WARNING: this is a substitute patch.
> 
> Stable note: Not tracked in Bugzilla. This is a substitute for an
> 	upstream commit addressing a completely different issue that
> 	accidentally contained an important fix. The workload this patch
> 	helps was memcached when IO is started in the background. memcached
> 	should stay resident but without this patch it gets swapped more
> 	than it should. Sometimes this manifests as a drop in throughput
> 	but mostly it was observed through /proc/vmstat.
> 
> Commit [246e87a9: memcg: fix get_scan_count() for small targets] was
> meant to fix a problem whereby small scan targets on memcg were ignored
> causing priority to raise too sharply. It forced scanning to take place
> if the target was small, memcg or kswapd.
> 
> >From the time it was introduced it cause excessive reclaim by kswapd
> with workloads being pushed to swap that previously would have stayed
> resident. This was accidentally fixed by commit [ad2b8e60: mm: memcg:
> remove optimization of keeping the root_mem_cgroup LRU lists empty] but
> that patchset is not suitable for backporting.
> 
> The original patch came with no information on what workloads it benefits
> but the cost of it is obvious in that it forces scanning to take place
> on lists that would otherwise have been ignored such as small anonymous
> inactive lists. This patch partially reverts 246e87a9 so that small lists
> are not force scanned which means that IO-intensive workloads with small
> amounts of anonymous memory will not be swapped.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |    3 ---
>  1 file changed, 3 deletions(-)

I don't understand this patch.  The original
ad2b8e601099a23dffffb53f91c18d874fe98854 commit touched the file
mm/memcontrol.c and seemed to do something quite different from what you
have done below.

I'm all for fixing things in a different way than what was done in
Linus's tree, IF there is a reason to, but the comparison between these
two patches (yours and upstream) are not making any sense at all.

confused,

greg k-h

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 30/34] mm: vmscan: Do not force kswapd to scan small targets
@ 2012-07-25 19:59     ` Greg KH
  0 siblings, 0 replies; 125+ messages in thread
From: Greg KH @ 2012-07-25 19:59 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Mon, Jul 23, 2012 at 02:38:43PM +0100, Mel Gorman wrote:
> commit ad2b8e601099a23dffffb53f91c18d874fe98854 upstream - WARNING: this is a substitute patch.
> 
> Stable note: Not tracked in Bugzilla. This is a substitute for an
> 	upstream commit addressing a completely different issue that
> 	accidentally contained an important fix. The workload this patch
> 	helps was memcached when IO is started in the background. memcached
> 	should stay resident but without this patch it gets swapped more
> 	than it should. Sometimes this manifests as a drop in throughput
> 	but mostly it was observed through /proc/vmstat.
> 
> Commit [246e87a9: memcg: fix get_scan_count() for small targets] was
> meant to fix a problem whereby small scan targets on memcg were ignored
> causing priority to raise too sharply. It forced scanning to take place
> if the target was small, memcg or kswapd.
> 
> >From the time it was introduced it cause excessive reclaim by kswapd
> with workloads being pushed to swap that previously would have stayed
> resident. This was accidentally fixed by commit [ad2b8e60: mm: memcg:
> remove optimization of keeping the root_mem_cgroup LRU lists empty] but
> that patchset is not suitable for backporting.
> 
> The original patch came with no information on what workloads it benefits
> but the cost of it is obvious in that it forces scanning to take place
> on lists that would otherwise have been ignored such as small anonymous
> inactive lists. This patch partially reverts 246e87a9 so that small lists
> are not force scanned which means that IO-intensive workloads with small
> amounts of anonymous memory will not be swapped.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |    3 ---
>  1 file changed, 3 deletions(-)

I don't understand this patch.  The original
ad2b8e601099a23dffffb53f91c18d874fe98854 commit touched the file
mm/memcontrol.c and seemed to do something quite different from what you
have done below.

I'm all for fixing things in a different way than what was done in
Linus's tree, IF there is a reason to, but the comparison between these
two patches (yours and upstream) are not making any sense at all.

confused,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 30/34] mm: vmscan: Do not force kswapd to scan small targets
  2012-07-25 19:59     ` Greg KH
@ 2012-07-25 21:35       ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-25 21:35 UTC (permalink / raw)
  To: Greg KH; +Cc: Stable, Linux-MM, LKML

On Wed, Jul 25, 2012 at 12:59:48PM -0700, Greg KH wrote:
> On Mon, Jul 23, 2012 at 02:38:43PM +0100, Mel Gorman wrote:
> > commit ad2b8e601099a23dffffb53f91c18d874fe98854 upstream - WARNING: this is a substitute patch.
> > 
> > Stable note: Not tracked in Bugzilla. This is a substitute for an
> > 	upstream commit addressing a completely different issue that
> > 	accidentally contained an important fix. The workload this patch
> > 	helps was memcached when IO is started in the background. memcached
> > 	should stay resident but without this patch it gets swapped more
> > 	than it should. Sometimes this manifests as a drop in throughput
> > 	but mostly it was observed through /proc/vmstat.
> > 
> > Commit [246e87a9: memcg: fix get_scan_count() for small targets] was
> > meant to fix a problem whereby small scan targets on memcg were ignored
> > causing priority to raise too sharply. It forced scanning to take place
> > if the target was small, memcg or kswapd.
> > 
> > >From the time it was introduced it cause excessive reclaim by kswapd
> > with workloads being pushed to swap that previously would have stayed
> > resident. This was accidentally fixed by commit [ad2b8e60: mm: memcg:
> > remove optimization of keeping the root_mem_cgroup LRU lists empty] but
> > that patchset is not suitable for backporting.
> > 
> > The original patch came with no information on what workloads it benefits
> > but the cost of it is obvious in that it forces scanning to take place
> > on lists that would otherwise have been ignored such as small anonymous
> > inactive lists. This patch partially reverts 246e87a9 so that small lists
> > are not force scanned which means that IO-intensive workloads with small
> > amounts of anonymous memory will not be swapped.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/vmscan.c |    3 ---
> >  1 file changed, 3 deletions(-)
> 
> I don't understand this patch.  The original
> ad2b8e601099a23dffffb53f91c18d874fe98854 commit touched the file
> mm/memcontrol.c and seemed to do something quite different from what you
> have done below.
> 

The main problem is I'm an idiot and "missed" when copying&paste and followed
through with the mistake. The actual commit of interest was the one after it
[b95a2f2d: mm: vmscan: convert global reclaim to per-memcg LRU lists]

That patch has this hunk in it

@@ -1886,7 +1886,7 @@ static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
         * latencies, so it's better to scan a minimum amount there as
         * well.
         */
-       if (current_is_kswapd())
+       if (current_is_kswapd() && mz->zone->all_unreclaimable)
                force_scan = true;
        if (!global_reclaim(sc))
                force_scan = true;

This change makes it very difficult for kswapd to force scan which was
the fix I was interested in but the series is not suitable for backport.
This has changed again since in 3.5-rc1 due to commit [90126375: mm/vmscan:
push lruvec pointer into get_scan_count()] where this check became

	if (current_is_kswapd() && zone->all_unreclaimable)

Superficially that looks ok to backport, but it's not due to a subtle
difference in how zone is looked up in the new context.

Can you use this patch as a replacement? It is functionally much closer
to what happens upstream while still backporting the actual fix of
interest.

---8<---
mm: vmscan: Do not force kswapd to scan small targets

commit b95a2f2d486d0d768a92879c023a03757b9c7e58 upstream - WARNING: this is a substitute patch.

Stable note: Not tracked in Bugzilla. This is a partial backport of an
        upstream commit addressing a completely different issue
        that accidentally contained an important fix. The workload
        this patch helps was memcached when IO is started in the
        background. memcached should stay resident but without this patch
        it gets swapped. Sometimes this manifests as a drop in throughput
        but mostly it was observed through /proc/vmstat.

Commit [246e87a9: memcg: fix get_scan_count() for small targets] was meant
to fix a problem whereby small scan targets on memcg were ignored causing
priority to raise too sharply. It forced scanning to take place if the
target was small, memcg or kswapd.

>From the time it was introduced it caused excessive reclaim by kswapd
with workloads being pushed to swap that previously would have stayed
resident. This was accidentally fixed in commit [b95a2f2d: mm: vmscan:
convert global reclaim to per-memcg LRU lists] by making it harder for
kswapd to force scan small targets but that patchset is not suitable for
backporting. This was later changed again by commit [90126375: mm/vmscan:
push lruvec pointer into get_scan_count()] into a format that looks
like it would be a straight-forward backport but there is a subtle
difference due to the use of lruvecs.

The impact of the accidental fix is to make it harder for kswapd to force
scan small targets by taking zone->all_unreclaimable into account. This
patch is the closest equivalent available based on what is backported.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 42d2a5e..e0afff3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1850,7 +1850,8 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 	unsigned long nr_force_scan[2];
 
 	/* kswapd does zone balancing and needs to scan this zone */
-	if (scanning_global_lru(sc) && current_is_kswapd())
+	if (scanning_global_lru(sc) && current_is_kswapd() &&
+	    zone->all_unreclaimable)
 		force_scan = true;
 	/* memcg may have small limit and need to avoid priority drop */
 	if (!scanning_global_lru(sc))

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [PATCH 30/34] mm: vmscan: Do not force kswapd to scan small targets
@ 2012-07-25 21:35       ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-25 21:35 UTC (permalink / raw)
  To: Greg KH; +Cc: Stable, Linux-MM, LKML

On Wed, Jul 25, 2012 at 12:59:48PM -0700, Greg KH wrote:
> On Mon, Jul 23, 2012 at 02:38:43PM +0100, Mel Gorman wrote:
> > commit ad2b8e601099a23dffffb53f91c18d874fe98854 upstream - WARNING: this is a substitute patch.
> > 
> > Stable note: Not tracked in Bugzilla. This is a substitute for an
> > 	upstream commit addressing a completely different issue that
> > 	accidentally contained an important fix. The workload this patch
> > 	helps was memcached when IO is started in the background. memcached
> > 	should stay resident but without this patch it gets swapped more
> > 	than it should. Sometimes this manifests as a drop in throughput
> > 	but mostly it was observed through /proc/vmstat.
> > 
> > Commit [246e87a9: memcg: fix get_scan_count() for small targets] was
> > meant to fix a problem whereby small scan targets on memcg were ignored
> > causing priority to raise too sharply. It forced scanning to take place
> > if the target was small, memcg or kswapd.
> > 
> > >From the time it was introduced it cause excessive reclaim by kswapd
> > with workloads being pushed to swap that previously would have stayed
> > resident. This was accidentally fixed by commit [ad2b8e60: mm: memcg:
> > remove optimization of keeping the root_mem_cgroup LRU lists empty] but
> > that patchset is not suitable for backporting.
> > 
> > The original patch came with no information on what workloads it benefits
> > but the cost of it is obvious in that it forces scanning to take place
> > on lists that would otherwise have been ignored such as small anonymous
> > inactive lists. This patch partially reverts 246e87a9 so that small lists
> > are not force scanned which means that IO-intensive workloads with small
> > amounts of anonymous memory will not be swapped.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/vmscan.c |    3 ---
> >  1 file changed, 3 deletions(-)
> 
> I don't understand this patch.  The original
> ad2b8e601099a23dffffb53f91c18d874fe98854 commit touched the file
> mm/memcontrol.c and seemed to do something quite different from what you
> have done below.
> 

The main problem is I'm an idiot and "missed" when copying&paste and followed
through with the mistake. The actual commit of interest was the one after it
[b95a2f2d: mm: vmscan: convert global reclaim to per-memcg LRU lists]

That patch has this hunk in it

@@ -1886,7 +1886,7 @@ static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
         * latencies, so it's better to scan a minimum amount there as
         * well.
         */
-       if (current_is_kswapd())
+       if (current_is_kswapd() && mz->zone->all_unreclaimable)
                force_scan = true;
        if (!global_reclaim(sc))
                force_scan = true;

This change makes it very difficult for kswapd to force scan which was
the fix I was interested in but the series is not suitable for backport.
This has changed again since in 3.5-rc1 due to commit [90126375: mm/vmscan:
push lruvec pointer into get_scan_count()] where this check became

	if (current_is_kswapd() && zone->all_unreclaimable)

Superficially that looks ok to backport, but it's not due to a subtle
difference in how zone is looked up in the new context.

Can you use this patch as a replacement? It is functionally much closer
to what happens upstream while still backporting the actual fix of
interest.

---8<---
mm: vmscan: Do not force kswapd to scan small targets

commit b95a2f2d486d0d768a92879c023a03757b9c7e58 upstream - WARNING: this is a substitute patch.

Stable note: Not tracked in Bugzilla. This is a partial backport of an
        upstream commit addressing a completely different issue
        that accidentally contained an important fix. The workload
        this patch helps was memcached when IO is started in the
        background. memcached should stay resident but without this patch
        it gets swapped. Sometimes this manifests as a drop in throughput
        but mostly it was observed through /proc/vmstat.

Commit [246e87a9: memcg: fix get_scan_count() for small targets] was meant
to fix a problem whereby small scan targets on memcg were ignored causing
priority to raise too sharply. It forced scanning to take place if the
target was small, memcg or kswapd.

>From the time it was introduced it caused excessive reclaim by kswapd
with workloads being pushed to swap that previously would have stayed
resident. This was accidentally fixed in commit [b95a2f2d: mm: vmscan:
convert global reclaim to per-memcg LRU lists] by making it harder for
kswapd to force scan small targets but that patchset is not suitable for
backporting. This was later changed again by commit [90126375: mm/vmscan:
push lruvec pointer into get_scan_count()] into a format that looks
like it would be a straight-forward backport but there is a subtle
difference due to the use of lruvecs.

The impact of the accidental fix is to make it harder for kswapd to force
scan small targets by taking zone->all_unreclaimable into account. This
patch is the closest equivalent available based on what is backported.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 42d2a5e..e0afff3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1850,7 +1850,8 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 	unsigned long nr_force_scan[2];
 
 	/* kswapd does zone balancing and needs to scan this zone */
-	if (scanning_global_lru(sc) && current_is_kswapd())
+	if (scanning_global_lru(sc) && current_is_kswapd() &&
+	    zone->all_unreclaimable)
 		force_scan = true;
 	/* memcg may have small limit and need to avoid priority drop */
 	if (!scanning_global_lru(sc))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [PATCH 30/34] mm: vmscan: Do not force kswapd to scan small targets
  2012-07-25 21:35       ` Mel Gorman
@ 2012-07-25 21:44         ` Greg KH
  -1 siblings, 0 replies; 125+ messages in thread
From: Greg KH @ 2012-07-25 21:44 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Wed, Jul 25, 2012 at 10:35:08PM +0100, Mel Gorman wrote:
> On Wed, Jul 25, 2012 at 12:59:48PM -0700, Greg KH wrote:
> > On Mon, Jul 23, 2012 at 02:38:43PM +0100, Mel Gorman wrote:
> > > commit ad2b8e601099a23dffffb53f91c18d874fe98854 upstream - WARNING: this is a substitute patch.
> > > 
> > > Stable note: Not tracked in Bugzilla. This is a substitute for an
> > > 	upstream commit addressing a completely different issue that
> > > 	accidentally contained an important fix. The workload this patch
> > > 	helps was memcached when IO is started in the background. memcached
> > > 	should stay resident but without this patch it gets swapped more
> > > 	than it should. Sometimes this manifests as a drop in throughput
> > > 	but mostly it was observed through /proc/vmstat.
> > > 
> > > Commit [246e87a9: memcg: fix get_scan_count() for small targets] was
> > > meant to fix a problem whereby small scan targets on memcg were ignored
> > > causing priority to raise too sharply. It forced scanning to take place
> > > if the target was small, memcg or kswapd.
> > > 
> > > >From the time it was introduced it cause excessive reclaim by kswapd
> > > with workloads being pushed to swap that previously would have stayed
> > > resident. This was accidentally fixed by commit [ad2b8e60: mm: memcg:
> > > remove optimization of keeping the root_mem_cgroup LRU lists empty] but
> > > that patchset is not suitable for backporting.
> > > 
> > > The original patch came with no information on what workloads it benefits
> > > but the cost of it is obvious in that it forces scanning to take place
> > > on lists that would otherwise have been ignored such as small anonymous
> > > inactive lists. This patch partially reverts 246e87a9 so that small lists
> > > are not force scanned which means that IO-intensive workloads with small
> > > amounts of anonymous memory will not be swapped.
> > > 
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > ---
> > >  mm/vmscan.c |    3 ---
> > >  1 file changed, 3 deletions(-)
> > 
> > I don't understand this patch.  The original
> > ad2b8e601099a23dffffb53f91c18d874fe98854 commit touched the file
> > mm/memcontrol.c and seemed to do something quite different from what you
> > have done below.
> > 
> 
> The main problem is I'm an idiot and "missed" when copying&paste and followed
> through with the mistake. The actual commit of interest was the one after it
> [b95a2f2d: mm: vmscan: convert global reclaim to per-memcg LRU lists]
> 
> That patch has this hunk in it
> 
> @@ -1886,7 +1886,7 @@ static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
>          * latencies, so it's better to scan a minimum amount there as
>          * well.
>          */
> -       if (current_is_kswapd())
> +       if (current_is_kswapd() && mz->zone->all_unreclaimable)
>                 force_scan = true;
>         if (!global_reclaim(sc))
>                 force_scan = true;
> 
> This change makes it very difficult for kswapd to force scan which was
> the fix I was interested in but the series is not suitable for backport.
> This has changed again since in 3.5-rc1 due to commit [90126375: mm/vmscan:
> push lruvec pointer into get_scan_count()] where this check became
> 
> 	if (current_is_kswapd() && zone->all_unreclaimable)
> 
> Superficially that looks ok to backport, but it's not due to a subtle
> difference in how zone is looked up in the new context.
> 
> Can you use this patch as a replacement? It is functionally much closer
> to what happens upstream while still backporting the actual fix of
> interest.

Yes, that makes more sense as that is what the patch you included does
:)

I'll go queue it up now, thanks for the backport.

greg k-h

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 30/34] mm: vmscan: Do not force kswapd to scan small targets
@ 2012-07-25 21:44         ` Greg KH
  0 siblings, 0 replies; 125+ messages in thread
From: Greg KH @ 2012-07-25 21:44 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Wed, Jul 25, 2012 at 10:35:08PM +0100, Mel Gorman wrote:
> On Wed, Jul 25, 2012 at 12:59:48PM -0700, Greg KH wrote:
> > On Mon, Jul 23, 2012 at 02:38:43PM +0100, Mel Gorman wrote:
> > > commit ad2b8e601099a23dffffb53f91c18d874fe98854 upstream - WARNING: this is a substitute patch.
> > > 
> > > Stable note: Not tracked in Bugzilla. This is a substitute for an
> > > 	upstream commit addressing a completely different issue that
> > > 	accidentally contained an important fix. The workload this patch
> > > 	helps was memcached when IO is started in the background. memcached
> > > 	should stay resident but without this patch it gets swapped more
> > > 	than it should. Sometimes this manifests as a drop in throughput
> > > 	but mostly it was observed through /proc/vmstat.
> > > 
> > > Commit [246e87a9: memcg: fix get_scan_count() for small targets] was
> > > meant to fix a problem whereby small scan targets on memcg were ignored
> > > causing priority to raise too sharply. It forced scanning to take place
> > > if the target was small, memcg or kswapd.
> > > 
> > > >From the time it was introduced it cause excessive reclaim by kswapd
> > > with workloads being pushed to swap that previously would have stayed
> > > resident. This was accidentally fixed by commit [ad2b8e60: mm: memcg:
> > > remove optimization of keeping the root_mem_cgroup LRU lists empty] but
> > > that patchset is not suitable for backporting.
> > > 
> > > The original patch came with no information on what workloads it benefits
> > > but the cost of it is obvious in that it forces scanning to take place
> > > on lists that would otherwise have been ignored such as small anonymous
> > > inactive lists. This patch partially reverts 246e87a9 so that small lists
> > > are not force scanned which means that IO-intensive workloads with small
> > > amounts of anonymous memory will not be swapped.
> > > 
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > ---
> > >  mm/vmscan.c |    3 ---
> > >  1 file changed, 3 deletions(-)
> > 
> > I don't understand this patch.  The original
> > ad2b8e601099a23dffffb53f91c18d874fe98854 commit touched the file
> > mm/memcontrol.c and seemed to do something quite different from what you
> > have done below.
> > 
> 
> The main problem is I'm an idiot and "missed" when copying&paste and followed
> through with the mistake. The actual commit of interest was the one after it
> [b95a2f2d: mm: vmscan: convert global reclaim to per-memcg LRU lists]
> 
> That patch has this hunk in it
> 
> @@ -1886,7 +1886,7 @@ static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
>          * latencies, so it's better to scan a minimum amount there as
>          * well.
>          */
> -       if (current_is_kswapd())
> +       if (current_is_kswapd() && mz->zone->all_unreclaimable)
>                 force_scan = true;
>         if (!global_reclaim(sc))
>                 force_scan = true;
> 
> This change makes it very difficult for kswapd to force scan which was
> the fix I was interested in but the series is not suitable for backport.
> This has changed again since in 3.5-rc1 due to commit [90126375: mm/vmscan:
> push lruvec pointer into get_scan_count()] where this check became
> 
> 	if (current_is_kswapd() && zone->all_unreclaimable)
> 
> Superficially that looks ok to backport, but it's not due to a subtle
> difference in how zone is looked up in the new context.
> 
> Can you use this patch as a replacement? It is functionally much closer
> to what happens upstream while still backporting the actual fix of
> interest.

Yes, that makes more sense as that is what the patch you included does
:)

I'll go queue it up now, thanks for the backport.

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
  2012-07-23 13:38 ` Mel Gorman
@ 2012-07-25 22:30   ` Greg KH
  -1 siblings, 0 replies; 125+ messages in thread
From: Greg KH @ 2012-07-25 22:30 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Mon, Jul 23, 2012 at 02:38:13PM +0100, Mel Gorman wrote:
> Changelog since V1
>   o Expand some of the notes					(jrnieder)
>   o Correct upstream commit SHA1				(hugh)
> 
> This series is related to the new addition to stable_kernel_rules.txt
> 
>  - Serious issues as reported by a user of a distribution kernel may also
>    be considered if they fix a notable performance or interactivity issue.
>    As these fixes are not as obvious and have a higher risk of a subtle
>    regression they should only be submitted by a distribution kernel
>    maintainer and include an addendum linking to a bugzilla entry if it
>    exists and additional information on the user-visible impact.
> 
> All of these patches have been backported to a distribution kernel and
> address some sort of performance issue in the VM. As they are not all
> obvious, I've added a "Stable note" to the top of each patch giving
> additional information on why the patch was backported. Lets see where
> the boundaries lie on how this new rule is interpreted in practice :).
> 
> Patch 1	Performance fix for tmpfs
> Patch 2 Memory hotadd fix
> Patch 3 Reduce boot time on large machines
> Patches 4-5 Reduce stalls for wait_iff_congested
> Patches 6-8 Reduce excessive reclaim of slab objects which for some workloads
> 	will reduce the amount of IO required
> Patches 9-10 limits the amount of page reclaim when THP/Compaction is active.
> 	Excessive reclaim in low memory situations can lead to stalls some
> 	of which are user visible.
> Patches 11-19 reduce the amount of churn of the LRU lists. Poor reclaim
> 	decisions can impair workloads in different ways and there have
> 	been complaints recently the reclaim decisions of modern kernels
> 	are worse than older ones.
> Patches 20-21 reduce the amount of CPU kswapd uses in some cases. This
> 	is harder to trigger but were developed due to bug reports about
> 	100% CPU usage from kswapd.
> Patches 22-25 are mostly related to interactivity when THP is enabled.
> Patches 26-30 are also related to page reclaim decisions, particularly
> 	the residency of mapped pages.
> Patches 31-34 fix a major page allocator performance regression
> 
> All of the patches will apply to 3.0-stable but the ordering of the
> patches is such that applying them to 3.2-stable and 3.4-stable should
> be straight-forward.

I can't find any of these that should have gone to 3.4-stable, given
that they all were included in 3.4 already, right?

I've queued up the whole lot for the 3.0-stable tree, thanks so much for
providing them.

greg k-h

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
@ 2012-07-25 22:30   ` Greg KH
  0 siblings, 0 replies; 125+ messages in thread
From: Greg KH @ 2012-07-25 22:30 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

On Mon, Jul 23, 2012 at 02:38:13PM +0100, Mel Gorman wrote:
> Changelog since V1
>   o Expand some of the notes					(jrnieder)
>   o Correct upstream commit SHA1				(hugh)
> 
> This series is related to the new addition to stable_kernel_rules.txt
> 
>  - Serious issues as reported by a user of a distribution kernel may also
>    be considered if they fix a notable performance or interactivity issue.
>    As these fixes are not as obvious and have a higher risk of a subtle
>    regression they should only be submitted by a distribution kernel
>    maintainer and include an addendum linking to a bugzilla entry if it
>    exists and additional information on the user-visible impact.
> 
> All of these patches have been backported to a distribution kernel and
> address some sort of performance issue in the VM. As they are not all
> obvious, I've added a "Stable note" to the top of each patch giving
> additional information on why the patch was backported. Lets see where
> the boundaries lie on how this new rule is interpreted in practice :).
> 
> Patch 1	Performance fix for tmpfs
> Patch 2 Memory hotadd fix
> Patch 3 Reduce boot time on large machines
> Patches 4-5 Reduce stalls for wait_iff_congested
> Patches 6-8 Reduce excessive reclaim of slab objects which for some workloads
> 	will reduce the amount of IO required
> Patches 9-10 limits the amount of page reclaim when THP/Compaction is active.
> 	Excessive reclaim in low memory situations can lead to stalls some
> 	of which are user visible.
> Patches 11-19 reduce the amount of churn of the LRU lists. Poor reclaim
> 	decisions can impair workloads in different ways and there have
> 	been complaints recently the reclaim decisions of modern kernels
> 	are worse than older ones.
> Patches 20-21 reduce the amount of CPU kswapd uses in some cases. This
> 	is harder to trigger but were developed due to bug reports about
> 	100% CPU usage from kswapd.
> Patches 22-25 are mostly related to interactivity when THP is enabled.
> Patches 26-30 are also related to page reclaim decisions, particularly
> 	the residency of mapped pages.
> Patches 31-34 fix a major page allocator performance regression
> 
> All of the patches will apply to 3.0-stable but the ordering of the
> patches is such that applying them to 3.2-stable and 3.4-stable should
> be straight-forward.

I can't find any of these that should have gone to 3.4-stable, given
that they all were included in 3.4 already, right?

I've queued up the whole lot for the 3.0-stable tree, thanks so much for
providing them.

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
  2012-07-25 22:30   ` Greg KH
@ 2012-07-25 22:48     ` Mel Gorman
  -1 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-25 22:48 UTC (permalink / raw)
  To: Greg KH; +Cc: Stable, Linux-MM, LKML

On Wed, Jul 25, 2012 at 03:30:57PM -0700, Greg KH wrote:
> > <SNIP>
> > All of the patches will apply to 3.0-stable but the ordering of the
> > patches is such that applying them to 3.2-stable and 3.4-stable should
> > be straight-forward.
> 
> I can't find any of these that should have gone to 3.4-stable, given
> that they all were included in 3.4 already, right?
> 

Yes, you're right.

At the time I wrote the changelog I had patches belonging to 3.5 included. I
later decided to drop them until after 3.5 was out. It was potentially
weird to have a 3.0-stable kernel with patches that were not in a released
3.x.0 kernel. Besides, they were very low priority. I forgot to update
the changelog to match.

> I've queued up the whole lot for the 3.0-stable tree, thanks so much for
> providing them.
> 

Thanks for reviewing them in detail and getting the flaws corrected.
I expect it'll be a bit more smooth if/when I do something like this again.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
@ 2012-07-25 22:48     ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-25 22:48 UTC (permalink / raw)
  To: Greg KH; +Cc: Stable, Linux-MM, LKML

On Wed, Jul 25, 2012 at 03:30:57PM -0700, Greg KH wrote:
> > <SNIP>
> > All of the patches will apply to 3.0-stable but the ordering of the
> > patches is such that applying them to 3.2-stable and 3.4-stable should
> > be straight-forward.
> 
> I can't find any of these that should have gone to 3.4-stable, given
> that they all were included in 3.4 already, right?
> 

Yes, you're right.

At the time I wrote the changelog I had patches belonging to 3.5 included. I
later decided to drop them until after 3.5 was out. It was potentially
weird to have a 3.0-stable kernel with patches that were not in a released
3.x.0 kernel. Besides, they were very low priority. I forgot to update
the changelog to match.

> I've queued up the whole lot for the 3.0-stable tree, thanks so much for
> providing them.
> 

Thanks for reviewing them in detail and getting the flaws corrected.
I expect it'll be a bit more smooth if/when I do something like this again.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/34] Memory management performance backports for -stable V2
  2012-07-23 13:38 ` Mel Gorman
                   ` (36 preceding siblings ...)
  (?)
@ 2012-07-30  1:13 ` Ben Hutchings
  -1 siblings, 0 replies; 125+ messages in thread
From: Ben Hutchings @ 2012-07-30  1:13 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, Linux-MM, LKML

[-- Attachment #1: Type: text/plain, Size: 2968 bytes --]

On Mon, 2012-07-23 at 14:38 +0100, Mel Gorman wrote:
> Changelog since V1
>   o Expand some of the notes					(jrnieder)
>   o Correct upstream commit SHA1				(hugh)
> 
> This series is related to the new addition to stable_kernel_rules.txt
> 
>  - Serious issues as reported by a user of a distribution kernel may also
>    be considered if they fix a notable performance or interactivity issue.
>    As these fixes are not as obvious and have a higher risk of a subtle
>    regression they should only be submitted by a distribution kernel
>    maintainer and include an addendum linking to a bugzilla entry if it
>    exists and additional information on the user-visible impact.
> 
> All of these patches have been backported to a distribution kernel and
> address some sort of performance issue in the VM. As they are not all
> obvious, I've added a "Stable note" to the top of each patch giving
> additional information on why the patch was backported. Lets see where
> the boundaries lie on how this new rule is interpreted in practice :).
>
> Patch 1	Performance fix for tmpfs
> Patch 2 Memory hotadd fix
> Patch 3 Reduce boot time on large machines
> Patches 4-5 Reduce stalls for wait_iff_congested
> Patches 6-8 Reduce excessive reclaim of slab objects which for some workloads
> 	will reduce the amount of IO required
> Patches 9-10 limits the amount of page reclaim when THP/Compaction is active.
> 	Excessive reclaim in low memory situations can lead to stalls some
> 	of which are user visible.
> Patches 11-19 reduce the amount of churn of the LRU lists. Poor reclaim
> 	decisions can impair workloads in different ways and there have
> 	been complaints recently the reclaim decisions of modern kernels
> 	are worse than older ones.
> Patches 20-21 reduce the amount of CPU kswapd uses in some cases. This
> 	is harder to trigger but were developed due to bug reports about
> 	100% CPU usage from kswapd.
> Patches 22-25 are mostly related to interactivity when THP is enabled.
> Patches 26-30 are also related to page reclaim decisions, particularly
> 	the residency of mapped pages.
> Patches 31-34 fix a major page allocator performance regression
[...]
> The patches are based on 3.0.36 but there should not be problems applying
> the series to later stable releases.
[...]

Patches 1-2, 4-15, 20-21, 31-32 correspond to commits included in Linux
3.2.  I've added the rest to the queue for 3.2.y, generally using the
versions Greg has queued for 3.0.39.

Patch 30 'mm: vmscan: convert global reclaim to per-memcg LRU lists'
needed a further context change.

For patch 33 'cpuset: mm: reduce large amounts of memory barrier related
damage v3' I folded in the two fixes Herton pointed out and you
acknowledged, and took the upstream version of the changes to
get_any_partial() in slub.c.

Ben.

-- 
Ben Hutchings
It is impossible to make anything foolproof because fools are so ingenious.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 06/34] vmscan: add shrink_slab tracepoints
  2012-07-20 15:54           ` Jonathan Nieder
@ 2012-07-23  9:20             ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-23  9:20 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Stable, LKML

On Fri, Jul 20, 2012 at 10:54:17AM -0500, Jonathan Nieder wrote:
> Mel Gorman wrote:
> > On Thu, Jul 19, 2012 at 05:07:21PM -0500, Jonathan Nieder wrote:
> 
> >> Some of the other patches of this type made sense, but I'd personally
> >> prefer if this one was dropped, yes.  Though I am just a nobody that
> >> reads patches rather than one of the relevant people. ;-)
> >
> > It's a valid point but I'm going to leave it in for now and see what the
> > general opinion is.
> 
> Ok.  To be more precise, this patch has two properties that other patches
> of the "make later patches easier to apply" class tend not to:
> 
>  * it introduces a significant functional change (adding tracepoints)
>  * it would have been very easy to skip
> 
> Have fun, and sorry for not explaining my reasoning before.
> 

Don't be sorry at all as your reasoning is solid. I'm not leaving the patch
in because I think you're wrong. I feel it is preferable to minimise the
deviation of upstream patches as much as possible but stable reviewers may
prefer to minimise functional change. We'll see where the consensus lies
if I leave the patch in but learn nothing if I take it out at this point.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 06/34] vmscan: add shrink_slab tracepoints
  2012-07-20 10:06         ` Mel Gorman
@ 2012-07-20 15:54           ` Jonathan Nieder
  2012-07-23  9:20             ` Mel Gorman
  0 siblings, 1 reply; 125+ messages in thread
From: Jonathan Nieder @ 2012-07-20 15:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, LKML

Mel Gorman wrote:
> On Thu, Jul 19, 2012 at 05:07:21PM -0500, Jonathan Nieder wrote:

>> Some of the other patches of this type made sense, but I'd personally
>> prefer if this one was dropped, yes.  Though I am just a nobody that
>> reads patches rather than one of the relevant people. ;-)
>
> It's a valid point but I'm going to leave it in for now and see what the
> general opinion is.

Ok.  To be more precise, this patch has two properties that other patches
of the "make later patches easier to apply" class tend not to:

 * it introduces a significant functional change (adding tracepoints)
 * it would have been very easy to skip

Have fun, and sorry for not explaining my reasoning before.

Ciao,
Jonathan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 06/34] vmscan: add shrink_slab tracepoints
  2012-07-19 22:07       ` Jonathan Nieder
@ 2012-07-20 10:06         ` Mel Gorman
  2012-07-20 15:54           ` Jonathan Nieder
  0 siblings, 1 reply; 125+ messages in thread
From: Mel Gorman @ 2012-07-20 10:06 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Stable, LKML

On Thu, Jul 19, 2012 at 05:07:21PM -0500, Jonathan Nieder wrote:
> Mel Gorman wrote:
> > On Thu, Jul 19, 2012 at 03:30:17PM -0500, Jonathan Nieder wrote:
> 
> >> It doesn't sound like it fixes a serious issue.
> >
> > You're right, it doesn't. There are a few patches in this series that
> > were applied because they made other patches easier to apply and this is
> > one of them.  I should have noted this properly. Unlike other patches of
> > this type in the series, this particular one would have been easy to work
> > around. How about this as an updated note or would you prefer it was
> > dropped entirely?
> 
> Some of the other patches of this type made sense, but I'd personally
> prefer if this one was dropped, yes.  Though I am just a nobody that
> reads patches rather than one of the relevant people. ;-)
> 

It's a valid point but I'm going to leave it in for now and see what the
general opinion is. I'm happy to go either way with this patch. I'll
repost shortly with the "stable note" updates and the linux-mm included
as I managed to screw up its mail address in the sending script.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 06/34] vmscan: add shrink_slab tracepoints
  2012-07-19 22:04     ` Mel Gorman
@ 2012-07-19 22:07       ` Jonathan Nieder
  2012-07-20 10:06         ` Mel Gorman
  0 siblings, 1 reply; 125+ messages in thread
From: Jonathan Nieder @ 2012-07-19 22:07 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, LKML

Mel Gorman wrote:
> On Thu, Jul 19, 2012 at 03:30:17PM -0500, Jonathan Nieder wrote:

>> It doesn't sound like it fixes a serious issue.
>
> You're right, it doesn't. There are a few patches in this series that
> were applied because they made other patches easier to apply and this is
> one of them.  I should have noted this properly. Unlike other patches of
> this type in the series, this particular one would have been easy to work
> around. How about this as an updated note or would you prefer it was
> dropped entirely?

Some of the other patches of this type made sense, but I'd personally
prefer if this one was dropped, yes.  Though I am just a nobody that
reads patches rather than one of the relevant people. ;-)

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 06/34] vmscan: add shrink_slab tracepoints
  2012-07-19 20:30   ` Jonathan Nieder
@ 2012-07-19 22:04     ` Mel Gorman
  2012-07-19 22:07       ` Jonathan Nieder
  0 siblings, 1 reply; 125+ messages in thread
From: Mel Gorman @ 2012-07-19 22:04 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Stable, LKML

On Thu, Jul 19, 2012 at 03:30:17PM -0500, Jonathan Nieder wrote:
> Mel Gorman wrote:
> 
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > commit 095760730c1047c69159ce88021a7fa3833502c8 upstream.
> >
> > Stable note: Not tracked in Bugzilla. This is a diagnostic patch that
> > 	was part of a series addressing excessive slab shrinking after
> > 	GFP_NOFS failures. There is detailed information on the series'
> > 	motivation at https://lkml.org/lkml/2011/6/2/42 .
> 
> Thanks.  Why would we want this particular patch in stable@? 

That's a reasonable question and thanks for taking a look at this series.

> It doesn't sound like it fixes a serious issue.
> 

You're right, it doesn't. There are a few patches in this series that
were applied because they made other patches easier to apply and this is
one of them.  I should have noted this properly. Unlike other patches of
this type in the series, this particular one would have been easy to work
around. How about this as an updated note or would you prefer it was
dropped entirely?

Stable note: This patch makes later patches easier to apply but otherwise
	has little to justify it. It is a diagnostic patch that was part
	of a series addressing excessive slab shrinking after GFP_NOFS
	failures. There is detailed information on the series' motivation
	at https://lkml.org/lkml/2011/6/2/42 .

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 06/34] vmscan: add shrink_slab tracepoints
  2012-07-19 14:36   ` Mel Gorman
  (?)
@ 2012-07-19 20:30   ` Jonathan Nieder
  2012-07-19 22:04     ` Mel Gorman
  -1 siblings, 1 reply; 125+ messages in thread
From: Jonathan Nieder @ 2012-07-19 20:30 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stable, LKML

Mel Gorman wrote:

> From: Dave Chinner <dchinner@redhat.com>
>
> commit 095760730c1047c69159ce88021a7fa3833502c8 upstream.
>
> Stable note: Not tracked in Bugzilla. This is a diagnostic patch that
> 	was part of a series addressing excessive slab shrinking after
> 	GFP_NOFS failures. There is detailed information on the series'
> 	motivation at https://lkml.org/lkml/2011/6/2/42 .

Thanks.  Why would we want this particular patch in stable@?  It
doesn't sound like it fixes a serious issue.

Curious,
Jonathan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* [PATCH 06/34] vmscan: add shrink_slab tracepoints
  2012-07-19 14:36 [PATCH 00/34] Memory management performance backports for -stable Mel Gorman
@ 2012-07-19 14:36   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-19 14:36 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM <linux-mm, LKML, Mel Gorman

From: Dave Chinner <dchinner@redhat.com>

commit 095760730c1047c69159ce88021a7fa3833502c8 upstream.

Stable note: Not tracked in Bugzilla. This is a diagnostic patch that
	was part of a series addressing excessive slab shrinking after
	GFP_NOFS failures. There is detailed information on the series'
	motivation at https://lkml.org/lkml/2011/6/2/42 .

It is impossible to understand what the shrinkers are actually doing
without instrumenting the code, so add a some tracepoints to allow
insight to be gained.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/vmscan.h |   77 +++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                   |    8 ++++-
 2 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index b2c33bd..36851f7 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -179,6 +179,83 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re
 	TP_ARGS(nr_reclaimed)
 );
 
+TRACE_EVENT(mm_shrink_slab_start,
+	TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
+		long nr_objects_to_shrink, unsigned long pgs_scanned,
+		unsigned long lru_pgs, unsigned long cache_items,
+		unsigned long long delta, unsigned long total_scan),
+
+	TP_ARGS(shr, sc, nr_objects_to_shrink, pgs_scanned, lru_pgs,
+		cache_items, delta, total_scan),
+
+	TP_STRUCT__entry(
+		__field(struct shrinker *, shr)
+		__field(void *, shrink)
+		__field(long, nr_objects_to_shrink)
+		__field(gfp_t, gfp_flags)
+		__field(unsigned long, pgs_scanned)
+		__field(unsigned long, lru_pgs)
+		__field(unsigned long, cache_items)
+		__field(unsigned long long, delta)
+		__field(unsigned long, total_scan)
+	),
+
+	TP_fast_assign(
+		__entry->shr = shr;
+		__entry->shrink = shr->shrink;
+		__entry->nr_objects_to_shrink = nr_objects_to_shrink;
+		__entry->gfp_flags = sc->gfp_mask;
+		__entry->pgs_scanned = pgs_scanned;
+		__entry->lru_pgs = lru_pgs;
+		__entry->cache_items = cache_items;
+		__entry->delta = delta;
+		__entry->total_scan = total_scan;
+	),
+
+	TP_printk("%pF %p: objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld",
+		__entry->shrink,
+		__entry->shr,
+		__entry->nr_objects_to_shrink,
+		show_gfp_flags(__entry->gfp_flags),
+		__entry->pgs_scanned,
+		__entry->lru_pgs,
+		__entry->cache_items,
+		__entry->delta,
+		__entry->total_scan)
+);
+
+TRACE_EVENT(mm_shrink_slab_end,
+	TP_PROTO(struct shrinker *shr, int shrinker_retval,
+		long unused_scan_cnt, long new_scan_cnt),
+
+	TP_ARGS(shr, shrinker_retval, unused_scan_cnt, new_scan_cnt),
+
+	TP_STRUCT__entry(
+		__field(struct shrinker *, shr)
+		__field(void *, shrink)
+		__field(long, unused_scan)
+		__field(long, new_scan)
+		__field(int, retval)
+		__field(long, total_scan)
+	),
+
+	TP_fast_assign(
+		__entry->shr = shr;
+		__entry->shrink = shr->shrink;
+		__entry->unused_scan = unused_scan_cnt;
+		__entry->new_scan = new_scan_cnt;
+		__entry->retval = shrinker_retval;
+		__entry->total_scan = new_scan_cnt - unused_scan_cnt;
+	),
+
+	TP_printk("%pF %p: unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
+		__entry->shrink,
+		__entry->shr,
+		__entry->unused_scan,
+		__entry->new_scan,
+		__entry->total_scan,
+		__entry->retval)
+);
 
 DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 72340b84..d875058 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -250,6 +250,7 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		unsigned long long delta;
 		unsigned long total_scan;
 		unsigned long max_pass;
+		int shrink_ret = 0;
 
 		max_pass = do_shrinker_shrink(shrinker, shrink, 0);
 		delta = (4 * nr_pages_scanned) / shrinker->seeks;
@@ -274,9 +275,12 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		total_scan = shrinker->nr;
 		shrinker->nr = 0;
 
+		trace_mm_shrink_slab_start(shrinker, shrink, total_scan,
+					nr_pages_scanned, lru_pages,
+					max_pass, delta, total_scan);
+
 		while (total_scan >= SHRINK_BATCH) {
 			long this_scan = SHRINK_BATCH;
-			int shrink_ret;
 			int nr_before;
 
 			nr_before = do_shrinker_shrink(shrinker, shrink, 0);
@@ -293,6 +297,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		}
 
 		shrinker->nr += total_scan;
+		trace_mm_shrink_slab_end(shrinker, shrink_ret, total_scan,
+					 shrinker->nr);
 	}
 	up_read(&shrinker_rwsem);
 out:
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 06/34] vmscan: add shrink_slab tracepoints
@ 2012-07-19 14:36   ` Mel Gorman
  0 siblings, 0 replies; 125+ messages in thread
From: Mel Gorman @ 2012-07-19 14:36 UTC (permalink / raw)
  To: Stable; +Cc: Linux-MM <linux-mm, LKML, Mel Gorman

From: Dave Chinner <dchinner@redhat.com>

commit 095760730c1047c69159ce88021a7fa3833502c8 upstream.

Stable note: Not tracked in Bugzilla. This is a diagnostic patch that
	was part of a series addressing excessive slab shrinking after
	GFP_NOFS failures. There is detailed information on the series'
	motivation at https://lkml.org/lkml/2011/6/2/42 .

It is impossible to understand what the shrinkers are actually doing
without instrumenting the code, so add a some tracepoints to allow
insight to be gained.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/vmscan.h |   77 +++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                   |    8 ++++-
 2 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index b2c33bd..36851f7 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -179,6 +179,83 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re
 	TP_ARGS(nr_reclaimed)
 );
 
+TRACE_EVENT(mm_shrink_slab_start,
+	TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
+		long nr_objects_to_shrink, unsigned long pgs_scanned,
+		unsigned long lru_pgs, unsigned long cache_items,
+		unsigned long long delta, unsigned long total_scan),
+
+	TP_ARGS(shr, sc, nr_objects_to_shrink, pgs_scanned, lru_pgs,
+		cache_items, delta, total_scan),
+
+	TP_STRUCT__entry(
+		__field(struct shrinker *, shr)
+		__field(void *, shrink)
+		__field(long, nr_objects_to_shrink)
+		__field(gfp_t, gfp_flags)
+		__field(unsigned long, pgs_scanned)
+		__field(unsigned long, lru_pgs)
+		__field(unsigned long, cache_items)
+		__field(unsigned long long, delta)
+		__field(unsigned long, total_scan)
+	),
+
+	TP_fast_assign(
+		__entry->shr = shr;
+		__entry->shrink = shr->shrink;
+		__entry->nr_objects_to_shrink = nr_objects_to_shrink;
+		__entry->gfp_flags = sc->gfp_mask;
+		__entry->pgs_scanned = pgs_scanned;
+		__entry->lru_pgs = lru_pgs;
+		__entry->cache_items = cache_items;
+		__entry->delta = delta;
+		__entry->total_scan = total_scan;
+	),
+
+	TP_printk("%pF %p: objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld",
+		__entry->shrink,
+		__entry->shr,
+		__entry->nr_objects_to_shrink,
+		show_gfp_flags(__entry->gfp_flags),
+		__entry->pgs_scanned,
+		__entry->lru_pgs,
+		__entry->cache_items,
+		__entry->delta,
+		__entry->total_scan)
+);
+
+TRACE_EVENT(mm_shrink_slab_end,
+	TP_PROTO(struct shrinker *shr, int shrinker_retval,
+		long unused_scan_cnt, long new_scan_cnt),
+
+	TP_ARGS(shr, shrinker_retval, unused_scan_cnt, new_scan_cnt),
+
+	TP_STRUCT__entry(
+		__field(struct shrinker *, shr)
+		__field(void *, shrink)
+		__field(long, unused_scan)
+		__field(long, new_scan)
+		__field(int, retval)
+		__field(long, total_scan)
+	),
+
+	TP_fast_assign(
+		__entry->shr = shr;
+		__entry->shrink = shr->shrink;
+		__entry->unused_scan = unused_scan_cnt;
+		__entry->new_scan = new_scan_cnt;
+		__entry->retval = shrinker_retval;
+		__entry->total_scan = new_scan_cnt - unused_scan_cnt;
+	),
+
+	TP_printk("%pF %p: unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
+		__entry->shrink,
+		__entry->shr,
+		__entry->unused_scan,
+		__entry->new_scan,
+		__entry->total_scan,
+		__entry->retval)
+);
 
 DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 72340b84..d875058 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -250,6 +250,7 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		unsigned long long delta;
 		unsigned long total_scan;
 		unsigned long max_pass;
+		int shrink_ret = 0;
 
 		max_pass = do_shrinker_shrink(shrinker, shrink, 0);
 		delta = (4 * nr_pages_scanned) / shrinker->seeks;
@@ -274,9 +275,12 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		total_scan = shrinker->nr;
 		shrinker->nr = 0;
 
+		trace_mm_shrink_slab_start(shrinker, shrink, total_scan,
+					nr_pages_scanned, lru_pages,
+					max_pass, delta, total_scan);
+
 		while (total_scan >= SHRINK_BATCH) {
 			long this_scan = SHRINK_BATCH;
-			int shrink_ret;
 			int nr_before;
 
 			nr_before = do_shrinker_shrink(shrinker, shrink, 0);
@@ -293,6 +297,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		}
 
 		shrinker->nr += total_scan;
+		trace_mm_shrink_slab_end(shrinker, shrink_ret, total_scan,
+					 shrinker->nr);
 	}
 	up_read(&shrinker_rwsem);
 out:
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

end of thread, other threads:[~2012-07-30  1:13 UTC | newest]

Thread overview: 125+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-23 13:38 [PATCH 00/34] Memory management performance backports for -stable V2 Mel Gorman
2012-07-23 13:38 ` Mel Gorman
2012-07-23 13:38 ` [PATCH 01/34] mm: vmstat: cache align vm_stat Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 02/34] mm: memory hotplug: Check if pages are correctly reserved on a per-section basis Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 03/34] mm: Reduce the amount of work done when updating min_free_kbytes Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-24 22:47   ` Greg KH
2012-07-24 22:47     ` Greg KH
2012-07-25  7:57     ` Mel Gorman
2012-07-25  7:57       ` Mel Gorman
2012-07-23 13:38 ` [PATCH 04/34] mm: vmscan: fix force-scanning small targets without swap Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 05/34] vmscan: clear ZONE_CONGESTED for zone with good watermark Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 06/34] vmscan: add shrink_slab tracepoints Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 07/34] vmscan: shrinker->nr updates race and go wrong Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 08/34] vmscan: reduce wind up shrinker->nr when shrinker can't do work Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 09/34] mm: limit direct reclaim for higher order allocations Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 10/34] mm: Abort reclaim/compaction if compaction can proceed Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 11/34] mm: compaction: trivial clean up in acct_isolated() Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 12/34] mm: change isolate mode from #define to bitwise type Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 13/34] mm: compaction: make isolate_lru_page() filter-aware Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 14/34] mm: zone_reclaim: " Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 15/34] mm: migration: clean up unmap_and_move() Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-25 15:45   ` Greg KH
2012-07-25 15:45     ` Greg KH
2012-07-25 16:04     ` Mel Gorman
2012-07-25 16:04       ` Mel Gorman
2012-07-25 18:03       ` Greg KH
2012-07-25 18:03         ` Greg KH
2012-07-23 13:38 ` [PATCH 16/34] mm: compaction: Allow compaction to isolate dirty pages Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-25 15:47   ` Greg KH
2012-07-25 15:47     ` Greg KH
2012-07-25 16:07     ` Mel Gorman
2012-07-25 16:07       ` Mel Gorman
2012-07-23 13:38 ` [PATCH 17/34] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 18/34] mm: page allocator: Do not call direct reclaim for THP allocations while compaction is deferred Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 19/34] mm: compaction: make isolate_lru_page() filter-aware again Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 20/34] kswapd: avoid unnecessary rebalance after an unsuccessful balancing Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 21/34] kswapd: assign new_order and new_classzone_idx after wakeup in sleeping Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 22/34] mm: compaction: Introduce sync-light migration for use by compaction Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 23/34] mm: vmscan: When reclaiming for compaction, ensure there are sufficient free pages available Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 24/34] mm: vmscan: Do not OOM if aborting reclaim to start compaction Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 25/34] mm: vmscan: Check if reclaim should really abort even if compaction_ready() is true for one zone Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-25 19:51   ` Greg KH
2012-07-25 19:51     ` Greg KH
2012-07-23 13:38 ` [PATCH 26/34] vmscan: promote shared file mapped pages Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 27/34] vmscan: activate executable pages after first usage Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 28/34] mm/vmscan.c: consider swap space when deciding whether to continue reclaim Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 29/34] mm: test PageSwapBacked in lumpy reclaim Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 30/34] mm: vmscan: Do not force kswapd to scan small targets Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-25 19:59   ` Greg KH
2012-07-25 19:59     ` Greg KH
2012-07-25 21:35     ` Mel Gorman
2012-07-25 21:35       ` Mel Gorman
2012-07-25 21:44       ` Greg KH
2012-07-25 21:44         ` Greg KH
2012-07-23 13:38 ` [PATCH 31/34] cpusets: avoid looping when storing to mems_allowed if one node remains set Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 32/34] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 33/34] cpuset: mm: Reduce large amounts of memory barrier related damage v3 Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-23 13:38 ` [PATCH 34/34] mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma Mel Gorman
2012-07-23 13:38   ` Mel Gorman
2012-07-24  5:58 ` [PATCH 00/34] Memory management performance backports for -stable V2 Mike Galbraith
2012-07-24  5:58   ` Mike Galbraith
2012-07-24  8:10   ` Mel Gorman
2012-07-24  8:10     ` Mel Gorman
2012-07-24 13:18   ` Hillf Danton
2012-07-24 13:18     ` Hillf Danton
2012-07-24 13:27     ` Mel Gorman
2012-07-24 13:27       ` Mel Gorman
2012-07-24 13:34       ` Hillf Danton
2012-07-24 13:34         ` Hillf Danton
2012-07-24 13:53         ` Mel Gorman
2012-07-24 13:53           ` Mel Gorman
2012-07-24 14:11           ` Hillf Danton
2012-07-24 14:11             ` Hillf Danton
2012-07-24 13:52     ` Mike Galbraith
2012-07-24 13:52       ` Mike Galbraith
2012-07-24 14:18       ` Hillf Danton
2012-07-24 14:18         ` Hillf Danton
2012-07-24 14:41         ` Mike Galbraith
2012-07-24 14:41           ` Mike Galbraith
2012-07-25 22:30 ` Greg KH
2012-07-25 22:30   ` Greg KH
2012-07-25 22:48   ` Mel Gorman
2012-07-25 22:48     ` Mel Gorman
2012-07-30  1:13 ` Ben Hutchings
  -- strict thread matches above, loose matches on Subject: below --
2012-07-19 14:36 [PATCH 00/34] Memory management performance backports for -stable Mel Gorman
2012-07-19 14:36 ` [PATCH 06/34] vmscan: add shrink_slab tracepoints Mel Gorman
2012-07-19 14:36   ` Mel Gorman
2012-07-19 20:30   ` Jonathan Nieder
2012-07-19 22:04     ` Mel Gorman
2012-07-19 22:07       ` Jonathan Nieder
2012-07-20 10:06         ` Mel Gorman
2012-07-20 15:54           ` Jonathan Nieder
2012-07-23  9:20             ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.