linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/11] Memory Compaction v5
@ 2010-03-23 12:25 Mel Gorman
  2010-03-23 12:25 ` [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
                   ` (10 more replies)
  0 siblings, 11 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 12:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

The changes from V4 are minimal. Closing of a race window and a fix
to NR_ISOLATED_* are the big changes. Mostly, this is adding a lot of
Reviewed-by's. Thanks a million to the people that reviewed this. It caught
a lot of issues and was a big help.

Kosaki-san raised concerns about the direct compact patch (patch 10/11) where
he'd prefer to see it directly integrated with lumpy reclaim. I responded
with a number of points but heard nothing further.  To follow the suggestion,
the complexity of the algorithm would need to increase significantly and
the exit conditions become trickier to use the same type of logic as lumpy
reclaim. Conceivably when there is better data available, that approach
will be taken but overall I don't think it's the best starting point.

Are there any further obstacles to merging this?

Changelog since V4
  o Remove unnecessary check for PageLRU and PageUnevictable
  o Fix isolated accounting
  o Close race window between page_mapcount and rcu_read_lock
  o Added a lot more Reviewed-by tags

Changelog since V3
  o Document sysfs entries (subseqently, merged independently)
  o COMPACTION should depend on MMU
  o Comment updates
  o Ensure proc/sysfs triggering of compaction fully completes
  o Rename anon_vma refcount to external_refcount
  o Rebase to mmotm on top of 2.6.34-rc1

Changelog since V2
  o Move unusable and fragmentation indices to separate proc files
  o Express indices as being between 0 and 1
  o Update copyright notice for compaction.c
  o Avoid infinite loop when split free page fails
  o Init compact_resume at least once (impacted x86 testing)
  o Fewer pages are isolated during compaction.
  o LRU lists are no longer rotated when page is busy
  o NR_ISOLATED_* is updated to avoid isolating too many pages
  o Update zone LRU stats correctly when isolating pages
  o Reference count anon_vma instead of insufficient locking with
    use-after-free races in memory compaction
  o Watch for unmapped anon pages during migration
  o Remove unnecessary parameters on a few functions
  o Add Reviewed-by's. Note that I didn't add the Acks and Reviewed
    for the proc patches as they have been split out into separate
    files and I don't know if the Acks are still valid.

Changelog since V1
  o Update help blurb on CONFIG_MIGRATION
  o Max unusable free space index is 100, not 1000
  o Move blockpfn forward properly during compaction
  o Cleanup CONFIG_COMPACTION vs CONFIG_MIGRATION confusion
  o Permissions on /proc and /sys files should be 0200
  o Reduce verbosity
  o Compact all nodes when triggered via /proc
  o Add per-node compaction via sysfs
  o Move defer_compaction out-of-line
  o Fix lock oddities in rmap_walk_anon
  o Add documentation

This patchset is a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was slub
"defragmentation" (really a form of targeted reclaim). Hence, this is called
"compaction" to distinguish it from other forms of defragmentation.

In this implementation, a full compaction run involves two scanners operating
within a zone - a migration and a free scanner. The migration scanner
starts at the beginning of a zone and finds all movable pages within one
pageblock_nr_pages-sized area and isolates them on a migratepages list. The
free scanner begins at the end of the zone and searches on a per-area
basis for enough free pages to migrate all the pages on the migratepages
list. As each area is respectively migrated or exhausted of free pages,
the scanners are advanced one area.  A compaction run completes within a
zone when the two scanners meet.

This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.

It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.

Memory compaction can be triggered in one of three ways. It may be triggered
explicitly by writing any value to /proc/sys/vm/compact_memory and compacting
all of memory. It can be triggered on a per-node basis by writing any
value to /sys/devices/system/node/nodeN/compact where N is the node ID to
be compacted. When a process fails to allocate a high-order page, it may
compact memory in an attempt to satisfy the allocation instead of entering
direct reclaim. Explicit compaction does not finish until the two scanners
meet and direct compaction ends if a suitable page becomes available that
would meet watermarks.

The series is in 11 patches. The first three are not "core" to the series
but are important pre-requisites.

Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
	patch, it's possible to use anon_vma after free if the caller is
	not holding a VMA or mmap_sem for the pages in question. While
	there should be no existing user that causes this problem,
	it's a requirement for memory compaction to be stable. The patch
	is at the start of the series for bisection reasons.
Patch 2 skips over anon pages during migration that are no longer mapped
	because there still appeared to be a small window between when
	a page was isolated and migration started during which anon_vma
	could disappear.
Patch 3 merges the KSM and migrate counts. It could be merged with patch 1
	but would be slightly harder to review.
Patch 4 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 5 exports a "unusable free space index" via /proc/pagetypeinfo. It's
	a measure of external fragmentation that takes the size of the
	allocation request into account. It can also be calculated from
	userspace so can be dropped if requested
Patch 6 exports a "fragmentation index" which only has meaning when an
	allocation request fails. It determines if an allocation failure
	would be due to a lack of memory or external fragmentation.
Patch 7 is the compaction mechanism although it's unreachable at this point
Patch 8 adds a means of compacting all of memory with a proc trgger
Patch 9 adds a means of compacting a specific node with a sysfs trigger
Patch 10 adds "direct compaction" before "direct reclaim" if it is
	determined there is a good chance of success.
Patch 11 temporarily disables compaction if an allocation failure occurs
	after compaction.

Testing of compaction was in three stages.  For the test, debugging, preempt,
the sleep watchdog and lockdep were all enabled but nothing nasty popped
out. min_free_kbytes was tuned as recommended by hugeadm to help fragmentation
avoidance and high-order allocations. It was tested on X86, X86-64 and PPC64.

Ths first test represents one of the easiest cases that can be faced for
lumpy reclaim or memory compaction.

1. Machine freshly booted and configured for hugepage usage with
	a) hugeadm --create-global-mounts
	b) hugeadm --pool-pages-max DEFAULT:8G
	c) hugeadm --set-recommended-min_free_kbytes
	d) hugeadm --set-recommended-shmmax

	The min_free_kbytes here is important. Anti-fragmentation works best
	when pageblocks don't mix. hugeadm knows how to calculate a value that
	will significantly reduce the worst of external-fragmentation-related
	events as reported by the mm_page_alloc_extfrag tracepoint.

2. Load up memory
	a) Start updatedb
	b) Create in parallel a X files of pagesize*128 in size. Wait
	   until files are created. By parallel, I mean that 4096 instances
	   of dd were launched, one after the other using &. The crude
	   objective being to mix filesystem metadata allocations with
	   the buffer cache.
	c) Delete every second file so that pageblocks are likely to
	   have holes
	d) kill updatedb if it's still running

	At this point, the system is quiet, memory is full but it's full with
	clean filesystem metadata and clean buffer cache that is unmapped.
	This is readily migrated or discarded so you'd expect lumpy reclaim
	to have no significant advantage over compaction but this is at
	the POC stage.

3. In increments, attempt to allocate 5% of memory as hugepages.
	   Measure how long it took, how successful it was, how many
	   direct reclaims took place and how how many compactions. Note
	   the compaction figures might not fully add up as compactions
	   can take place for orders other than the hugepage size

X86				vanilla		compaction
Final page count                    930                941 (attempted 1002)
pages reclaimed                   74630               3861

X86-64				vanilla		compaction
Final page count:                   916                916 (attempted 1002)
Total pages reclaimed:           122076              49800

PPC64				vanilla		compaction
Final page count:                    91                 94 (attempted 110)
Total pages reclaimed:            80252              96299

There was not a dramatic improvement in success rates but it wouldn't be
expected in this case either. What was important is that significantly fewer
pages were reclaimed in all cases reducing the amount of IO required to
satisfy a huge page allocation.

The second tests were all performance related - kernbench, netperf, iozone
and sysbench. None showed anything too remarkable.

The last test was a high-order allocation stress test. Many kernel compiles
are started to fill memory with a pressured mix of kernel and movable
allocations. During this, an attempt is made to allocate 90% of memory
as huge pages - one at a time with small delays between attempts to avoid
flooding the IO queue.

                                             vanilla   compaction
Percentage of request allocated X86               98           99
Percentage of request allocated X86-64            93           99
Percentage of request allocated PPC64             59           76

Success rates are a little higher, particularly on PPC64 with the larger
huge pages. What is most interesting is the latency when allocating huge
pages.

X86:    http://www.csn.ul.ie/~mel/postings/compaction-20100312/highalloc-interlatency-arnold-compaction-stress-v4r3-mean.ps
X86_64: http://www.csn.ul.ie/~mel/postings/compaction-20100312/highalloc-interlatency-hydra-compaction-stress-v4r3-mean.ps
PPC64: http://www.csn.ul.ie/~mel/postings/compaction-20100312/highalloc-interlatency-powyah-compaction-stress-v4r3-mean.ps

X86 latency is reduced the least but it is depending heavily on the HIGHMEM
zone to allocate many of its huge pages which is a relatively
straight-forward job. X86-64 and PPC64 both show very significant reductions
in average time taken to allocate huge pages. It is not reduced to zero
because the system is under enough memory pressure that reclaim is still
required for some of the allocations.

What is also enlightening in the same directory is the "stddev" files. Each
of them show that the variance between allocation times is drastically
reduced.

Andrew, assuming no major complaints, how do you feel about picking these up?

 Documentation/ABI/stable/sysfs-devices-node  |    7 +
 Documentation/ABI/testing/sysfs-devices-node |    7 +
 Documentation/filesystems/proc.txt           |   68 +++-
 Documentation/sysctl/vm.txt                  |   11 +
 drivers/base/node.c                          |    3 +
 include/linux/compaction.h                   |   76 ++++
 include/linux/mm.h                           |    1 +
 include/linux/mmzone.h                       |    7 +
 include/linux/rmap.h                         |   27 +-
 include/linux/swap.h                         |    6 +
 include/linux/vmstat.h                       |    2 +
 kernel/sysctl.c                              |   11 +
 mm/Kconfig                                   |   20 +-
 mm/Makefile                                  |    1 +
 mm/compaction.c                              |  555 ++++++++++++++++++++++++++
 mm/ksm.c                                     |    4 +-
 mm/migrate.c                                 |   22 +
 mm/page_alloc.c                              |   68 ++++
 mm/rmap.c                                    |   10 +-
 mm/vmscan.c                                  |    5 -
 mm/vmstat.c                                  |  217 ++++++++++
 21 files changed, 1101 insertions(+), 27 deletions(-)
 create mode 100644 Documentation/ABI/stable/sysfs-devices-node
 create mode 100644 Documentation/ABI/testing/sysfs-devices-node
 create mode 100644 include/linux/compaction.h
 create mode 100644 mm/compaction.c


^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating
  2010-03-23 12:25 [PATCH 0/11] Memory Compaction v5 Mel Gorman
@ 2010-03-23 12:25 ` Mel Gorman
  2010-03-23 12:25 ` [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 12:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.

This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/rmap.h |   23 +++++++++++++++++++++++
 mm/migrate.c         |   12 ++++++++++++
 mm/rmap.c            |   10 +++++-----
 3 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..567d43f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -29,6 +29,9 @@ struct anon_vma {
 #ifdef CONFIG_KSM
 	atomic_t ksm_refcount;
 #endif
+#ifdef CONFIG_MIGRATION
+	atomic_t migrate_refcount;
+#endif
 	/*
 	 * NOTE: the LSB of the head.next is set by
 	 * mm_take_all_locks() _after_ taking the above lock. So the
@@ -81,6 +84,26 @@ static inline int ksm_refcount(struct anon_vma *anon_vma)
 	return 0;
 }
 #endif /* CONFIG_KSM */
+#ifdef CONFIG_MIGRATION
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+	atomic_set(&anon_vma->migrate_refcount, 0);
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+	return atomic_read(&anon_vma->migrate_refcount);
+}
+#else
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+	return 0;
+}
+#endif /* CONFIG_MIGRATE */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
diff --git a/mm/migrate.c b/mm/migrate.c
index 88000b8..98eaaf2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -547,6 +547,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	int rcu_locked = 0;
 	int charge = 0;
 	struct mem_cgroup *mem = NULL;
+	struct anon_vma *anon_vma = NULL;
 
 	if (!newpage)
 		return -ENOMEM;
@@ -603,6 +604,8 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	if (PageAnon(page)) {
 		rcu_read_lock();
 		rcu_locked = 1;
+		anon_vma = page_anon_vma(page);
+		atomic_inc(&anon_vma->migrate_refcount);
 	}
 
 	/*
@@ -642,6 +645,15 @@ skip_unmap:
 	if (rc)
 		remove_migration_ptes(page, page);
 rcu_unlock:
+
+	/* Drop an anon_vma reference if we took one */
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
+		int empty = list_empty(&anon_vma->head);
+		spin_unlock(&anon_vma->lock);
+		if (empty)
+			anon_vma_free(anon_vma);
+	}
+
 	if (rcu_locked)
 		rcu_read_unlock();
 uncharge:
diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..578d0fe 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -248,7 +248,8 @@ static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma);
+	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
+					!migrate_refcount(anon_vma);
 	spin_unlock(&anon_vma->lock);
 
 	if (empty)
@@ -273,6 +274,7 @@ static void anon_vma_ctor(void *data)
 
 	spin_lock_init(&anon_vma->lock);
 	ksm_refcount_init(anon_vma);
+	migrate_refcount_init(anon_vma);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -1338,10 +1340,8 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	/*
 	 * Note: remove_migration_ptes() cannot use page_lock_anon_vma()
 	 * because that depends on page_mapped(); but not all its usages
-	 * are holding mmap_sem, which also gave the necessary guarantee
-	 * (that this anon_vma's slab has not already been destroyed).
-	 * This needs to be reviewed later: avoiding page_lock_anon_vma()
-	 * is risky, and currently limits the usefulness of rmap_walk().
+	 * are holding mmap_sem. Users without mmap_sem are required to
+	 * take a reference count to prevent the anon_vma disappearing
 	 */
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-23 12:25 [PATCH 0/11] Memory Compaction v5 Mel Gorman
  2010-03-23 12:25 ` [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
@ 2010-03-23 12:25 ` Mel Gorman
  2010-03-23 17:22   ` Christoph Lameter
  2010-03-23 12:25 ` [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 12:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

rmap_walk_anon() was triggering errors in memory compaction that look like
use-after-free errors. The problem is that between the page being isolated
from the LRU and rcu_read_lock() being taken, the mapcount of the page
dropped to 0 and the anon_vma gets freed. This can happen during memory
compaction if pages being migrated belong to a process that exits before
migration completes. Hence, the use-after-free race looks like

 1. Page isolated for migration
 2. Process exits
 3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
 4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
 4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
    is garbage.

This patch checks the mapcount after the rcu lock is taken. If the
mapcount is zero, the anon_vma is assumed to be freed and no further
action is taken.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/migrate.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 98eaaf2..6eb1efe 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -603,6 +603,19 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	 */
 	if (PageAnon(page)) {
 		rcu_read_lock();
+
+		/*
+		 * If the page has no mappings any more, just bail. An
+		 * unmapped anon page is likely to be freed soon but worse,
+		 * it's possible its anon_vma disappeared between when
+		 * the page was isolated and when we reached here while
+		 * the RCU lock was not held
+		 */
+		if (!page_mapcount(page)) {
+			rcu_read_unlock();
+			goto uncharge;
+		}
+
 		rcu_locked = 1;
 		anon_vma = page_anon_vma(page);
 		atomic_inc(&anon_vma->migrate_refcount);
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration
  2010-03-23 12:25 [PATCH 0/11] Memory Compaction v5 Mel Gorman
  2010-03-23 12:25 ` [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
  2010-03-23 12:25 ` [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
@ 2010-03-23 12:25 ` Mel Gorman
  2010-03-23 17:25   ` Christoph Lameter
  2010-03-23 23:55   ` KAMEZAWA Hiroyuki
  2010-03-23 12:25 ` [PATCH 04/11] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 12:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

For clarity of review, KSM and page migration have separate refcounts on
the anon_vma. While clear, this is a waste of memory. This patch gets
KSM and page migration to share their toys in a spirit of harmony.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/rmap.h |   50 ++++++++++++++++++--------------------------------
 mm/ksm.c             |    4 ++--
 mm/migrate.c         |    4 ++--
 mm/rmap.c            |    6 ++----
 4 files changed, 24 insertions(+), 40 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 567d43f..7721674 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -26,11 +26,17 @@
  */
 struct anon_vma {
 	spinlock_t lock;	/* Serialize access to vma list */
-#ifdef CONFIG_KSM
-	atomic_t ksm_refcount;
-#endif
-#ifdef CONFIG_MIGRATION
-	atomic_t migrate_refcount;
+#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
+
+	/*
+	 * The external_refcount is taken by either KSM or page migration
+	 * to take a reference to an anon_vma when there is no
+	 * guarantee that the vma of page tables will exist for
+	 * the duration of the operation. A caller that takes
+	 * the reference is responsible for clearing up the
+	 * anon_vma if they are the last user on release
+	 */
+	atomic_t external_refcount;
 #endif
 	/*
 	 * NOTE: the LSB of the head.next is set by
@@ -64,46 +70,26 @@ struct anon_vma_chain {
 };
 
 #ifdef CONFIG_MMU
-#ifdef CONFIG_KSM
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
+#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
+static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
 {
-	atomic_set(&anon_vma->ksm_refcount, 0);
+	atomic_set(&anon_vma->external_refcount, 0);
 }
 
-static inline int ksm_refcount(struct anon_vma *anon_vma)
+static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
 {
-	return atomic_read(&anon_vma->ksm_refcount);
+	return atomic_read(&anon_vma->external_refcount);
 }
 #else
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
+static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
 {
 }
 
-static inline int ksm_refcount(struct anon_vma *anon_vma)
+static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
 {
 	return 0;
 }
 #endif /* CONFIG_KSM */
-#ifdef CONFIG_MIGRATION
-static inline void migrate_refcount_init(struct anon_vma *anon_vma)
-{
-	atomic_set(&anon_vma->migrate_refcount, 0);
-}
-
-static inline int migrate_refcount(struct anon_vma *anon_vma)
-{
-	return atomic_read(&anon_vma->migrate_refcount);
-}
-#else
-static inline void migrate_refcount_init(struct anon_vma *anon_vma)
-{
-}
-
-static inline int migrate_refcount(struct anon_vma *anon_vma)
-{
-	return 0;
-}
-#endif /* CONFIG_MIGRATE */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
diff --git a/mm/ksm.c b/mm/ksm.c
index a93f1b7..e45ec98 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -318,14 +318,14 @@ static void hold_anon_vma(struct rmap_item *rmap_item,
 			  struct anon_vma *anon_vma)
 {
 	rmap_item->anon_vma = anon_vma;
-	atomic_inc(&anon_vma->ksm_refcount);
+	atomic_inc(&anon_vma->external_refcount);
 }
 
 static void drop_anon_vma(struct rmap_item *rmap_item)
 {
 	struct anon_vma *anon_vma = rmap_item->anon_vma;
 
-	if (atomic_dec_and_lock(&anon_vma->ksm_refcount, &anon_vma->lock)) {
+	if (atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);
 		if (empty)
diff --git a/mm/migrate.c b/mm/migrate.c
index 6eb1efe..e395115 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -618,7 +618,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 
 		rcu_locked = 1;
 		anon_vma = page_anon_vma(page);
-		atomic_inc(&anon_vma->migrate_refcount);
+		atomic_inc(&anon_vma->external_refcount);
 	}
 
 	/*
@@ -660,7 +660,7 @@ skip_unmap:
 rcu_unlock:
 
 	/* Drop an anon_vma reference if we took one */
-	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);
 		if (empty)
diff --git a/mm/rmap.c b/mm/rmap.c
index 578d0fe..af35b75 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -248,8 +248,7 @@ static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
-					!migrate_refcount(anon_vma);
+	empty = list_empty(&anon_vma->head) && !anonvma_external_refcount(anon_vma);
 	spin_unlock(&anon_vma->lock);
 
 	if (empty)
@@ -273,8 +272,7 @@ static void anon_vma_ctor(void *data)
 	struct anon_vma *anon_vma = data;
 
 	spin_lock_init(&anon_vma->lock);
-	ksm_refcount_init(anon_vma);
-	migrate_refcount_init(anon_vma);
+	anonvma_external_refcount_init(anon_vma);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 04/11] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
  2010-03-23 12:25 [PATCH 0/11] Memory Compaction v5 Mel Gorman
                   ` (2 preceding siblings ...)
  2010-03-23 12:25 ` [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
@ 2010-03-23 12:25 ` Mel Gorman
  2010-03-23 12:25 ` [PATCH 05/11] Export unusable free space index via /proc/unusable_index Mel Gorman
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 12:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
being able to hot-remove memory. The main users of page migration such as
sys_move_pages(), sys_migrate_pages() and cpuset process migration are
only beneficial on NUMA so it makes sense.

As memory compaction will operate within a zone and is useful on both NUMA
and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
user selects CONFIG_COMPACTION as an option.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/Kconfig |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 9c61158..04e241b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -172,17 +172,29 @@ config SPLIT_PTLOCK_CPUS
 	default "4"
 
 #
+# support for memory compaction
+config COMPACTION
+	bool "Allow for memory compaction"
+	def_bool y
+	select MIGRATION
+	depends on EXPERIMENTAL && HUGETLBFS && MMU
+	help
+	  Allows the compaction of memory for the allocation of huge pages.
+
+#
 # support for page migration
 #
 config MIGRATION
 	bool "Page migration"
 	def_bool y
-	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE
+	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION
 	help
 	  Allows the migration of the physical location of pages of processes
-	  while the virtual addresses are not changed. This is useful for
-	  example on NUMA systems to put pages nearer to the processors accessing
-	  the page.
+	  while the virtual addresses are not changed. This is useful in
+	  two situations. The first is on NUMA systems to put pages nearer
+	  to the processors accessing. The second is when allocating huge
+	  pages as migration can relocate pages to satisfy a huge page
+	  allocation instead of reclaiming.
 
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 05/11] Export unusable free space index via /proc/unusable_index
  2010-03-23 12:25 [PATCH 0/11] Memory Compaction v5 Mel Gorman
                   ` (3 preceding siblings ...)
  2010-03-23 12:25 ` [PATCH 04/11] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
@ 2010-03-23 12:25 ` Mel Gorman
  2010-03-23 17:31   ` Christoph Lameter
  2010-03-24  0:03   ` KAMEZAWA Hiroyuki
  2010-03-23 12:25 ` [PATCH 06/11] Export fragmentation index via /proc/extfrag_index Mel Gorman
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 12:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

Unusable free space index is a measure of external fragmentation that
takes the allocation size into account. For the most part, the huge page
size will be the size of interest but not necessarily so it is exported
on a per-order and per-zone basis via /proc/unusable_index.

The index is a value between 0 and 1. It can be expressed as a
percentage by multiplying by 100 as documented in
Documentation/filesystems/proc.txt.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 Documentation/filesystems/proc.txt |   13 ++++-
 mm/vmstat.c                        |  120 +++++++++++++++++++++++++++++++++
 2 files changed, 132 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 5e132b5..5c4b0fb 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -452,6 +452,7 @@ Table 1-5: Kernel info in /proc
  sys         See chapter 2                                     
  sysvipc     Info of SysVIPC Resources (msg, sem, shm)		(2.4)
  tty	     Info of tty drivers
+ unusable_index Additional page allocator information (see text)(2.5)
  uptime      System uptime                                     
  version     Kernel version                                    
  video	     bttv info of video resources			(2.4)
@@ -609,7 +610,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo.
+pagetypeinfo and unusable_index
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -650,6 +651,16 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
 also be allocatable although a lot of filesystem metadata may have to be
 reclaimed to achieve this.
 
+> cat /proc/unusable_index
+Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
+Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
+
+The unusable free space index measures how much of the available free
+memory cannot be used to satisfy an allocation of a given size and is a
+value between 0 and 1. The higher the value, the more of free memory is
+unusable and by implication, the worse the external fragmentation is. This
+can be expressed as a percentage by multiplying by 100.
+
 ..............................................................................
 
 meminfo:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7f760cb..ca42e10 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -453,6 +453,106 @@ static int frag_show(struct seq_file *m, void *arg)
 	return 0;
 }
 
+
+struct contig_page_info {
+	unsigned long free_pages;
+	unsigned long free_blocks_total;
+	unsigned long free_blocks_suitable;
+};
+
+/*
+ * Calculate the number of free pages in a zone, how many contiguous
+ * pages are free and how many are large enough to satisfy an allocation of
+ * the target size. Note that this function makes to attempt to estimate
+ * how many suitable free blocks there *might* be if MOVABLE pages were
+ * migrated. Calculating that is possible, but expensive and can be
+ * figured out from userspace
+ */
+static void fill_contig_page_info(struct zone *zone,
+				unsigned int suitable_order,
+				struct contig_page_info *info)
+{
+	unsigned int order;
+
+	info->free_pages = 0;
+	info->free_blocks_total = 0;
+	info->free_blocks_suitable = 0;
+
+	for (order = 0; order < MAX_ORDER; order++) {
+		unsigned long blocks;
+
+		/* Count number of free blocks */
+		blocks = zone->free_area[order].nr_free;
+		info->free_blocks_total += blocks;
+
+		/* Count free base pages */
+		info->free_pages += blocks << order;
+
+		/* Count the suitable free blocks */
+		if (order >= suitable_order)
+			info->free_blocks_suitable += blocks <<
+						(order - suitable_order);
+	}
+}
+
+/*
+ * Return an index indicating how much of the available free memory is
+ * unusable for an allocation of the requested size.
+ */
+static int unusable_free_index(unsigned int order,
+				struct contig_page_info *info)
+{
+	/* No free memory is interpreted as all free memory is unusable */
+	if (info->free_pages == 0)
+		return 1000;
+
+	/*
+	 * Index should be a value between 0 and 1. Return a value to 3
+	 * decimal places.
+	 *
+	 * 0 => no fragmentation
+	 * 1 => high fragmentation
+	 */
+	return ((info->free_pages - (info->free_blocks_suitable << order)) * 1000) / info->free_pages;
+
+}
+
+static void unusable_show_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	unsigned int order;
+	int index;
+	struct contig_page_info info;
+
+	seq_printf(m, "Node %d, zone %8s ",
+				pgdat->node_id,
+				zone->name);
+	for (order = 0; order < MAX_ORDER; ++order) {
+		fill_contig_page_info(zone, order, &info);
+		index = unusable_free_index(order, &info);
+		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+	}
+
+	seq_putc(m, '\n');
+}
+
+/*
+ * Display unusable free space index
+ * XXX: Could be a lot more efficient, but it's not a critical path
+ */
+static int unusable_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	/* check memoryless node */
+	if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+		return 0;
+
+	walk_zones_in_node(m, pgdat, unusable_show_print);
+
+	return 0;
+}
+
 static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
@@ -603,6 +703,25 @@ static const struct file_operations pagetypeinfo_file_ops = {
 	.release	= seq_release,
 };
 
+static const struct seq_operations unusable_op = {
+	.start	= frag_start,
+	.next	= frag_next,
+	.stop	= frag_stop,
+	.show	= unusable_show,
+};
+
+static int unusable_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &unusable_op);
+}
+
+static const struct file_operations unusable_file_ops = {
+	.open		= unusable_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 #ifdef CONFIG_ZONE_DMA
 #define TEXT_FOR_DMA(xx) xx "_dma",
 #else
@@ -947,6 +1066,7 @@ static int __init setup_vmstat(void)
 #ifdef CONFIG_PROC_FS
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
 	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
+	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
 #endif
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 06/11] Export fragmentation index via /proc/extfrag_index
  2010-03-23 12:25 [PATCH 0/11] Memory Compaction v5 Mel Gorman
                   ` (4 preceding siblings ...)
  2010-03-23 12:25 ` [PATCH 05/11] Export unusable free space index via /proc/unusable_index Mel Gorman
@ 2010-03-23 12:25 ` Mel Gorman
  2010-03-23 17:37   ` Christoph Lameter
  2010-03-23 12:25 ` [PATCH 07/11] Memory compaction core Mel Gorman
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 12:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

Fragmentation index is a value that makes sense when an allocation of a
given size would fail. The index indicates whether an allocation failure is
due to a lack of memory (values towards 0) or due to external fragmentation
(value towards 1).  For the most part, the huge page size will be the size
of interest but not necessarily so it is exported on a per-order and per-zone
basis via /proc/extfrag_index

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 Documentation/filesystems/proc.txt |   14 ++++++-
 mm/vmstat.c                        |   81 +++++++++++++++++++++++++++++++++
 2 files changed, 94 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 5c4b0fb..582ff3d 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -421,6 +421,7 @@ Table 1-5: Kernel info in /proc
  filesystems Supported filesystems                             
  driver	     Various drivers grouped here, currently rtc (2.4)
  execdomains Execdomains, related to security			(2.4)
+ extfrag_index Additional page allocator information (see text) (2.5)
  fb	     Frame Buffer devices				(2.4)
  fs	     File system parameters, currently nfs/exports	(2.4)
  ide         Directory containing info about the IDE subsystem 
@@ -610,7 +611,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo and unusable_index
+pagetypeinfo, unusable_index and extfrag_index.
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -661,6 +662,17 @@ value between 0 and 1. The higher the value, the more of free memory is
 unusable and by implication, the worse the external fragmentation is. This
 can be expressed as a percentage by multiplying by 100.
 
+> cat /proc/extfrag_index
+Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
+Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954
+
+The external fragmentation index, is only meaningful if an allocation
+would fail and indicates what the failure is due to. A value of -1 such as
+in many of the examples above states that the allocation would succeed.
+If it would fail, the value is between 0 and 1. A value tending towards
+0 implies the allocation failed due to a lack of memory. A value tending
+towards 1 implies it failed due to external fragmentation.
+
 ..............................................................................
 
 meminfo:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ca42e10..7377da6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -553,6 +553,67 @@ static int unusable_show(struct seq_file *m, void *arg)
 	return 0;
 }
 
+/*
+ * A fragmentation index only makes sense if an allocation of a requested
+ * size would fail. If that is true, the fragmentation index indicates
+ * whether external fragmentation or a lack of memory was the problem.
+ * The value can be used to determine if page reclaim or compaction
+ * should be used
+ */
+int fragmentation_index(unsigned int order, struct contig_page_info *info)
+{
+	unsigned long requested = 1UL << order;
+
+	if (!info->free_blocks_total)
+		return 0;
+
+	/* Fragmentation index only makes sense when a request would fail */
+	if (info->free_blocks_suitable)
+		return -1000;
+
+	/*
+	 * Index is between 0 and 1 so return within 3 decimal places
+	 *
+	 * 0 => allocation would fail due to lack of memory
+	 * 1 => allocation would fail due to fragmentation
+	 */
+	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
+}
+
+
+static void extfrag_show_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	unsigned int order;
+	int index;
+
+	/* Alloc on stack as interrupts are disabled for zone walk */
+	struct contig_page_info info;
+
+	seq_printf(m, "Node %d, zone %8s ",
+				pgdat->node_id,
+				zone->name);
+	for (order = 0; order < MAX_ORDER; ++order) {
+		fill_contig_page_info(zone, order, &info);
+		index = fragmentation_index(order, &info);
+		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+	}
+
+	seq_putc(m, '\n');
+}
+
+/*
+ * Display fragmentation index for orders that allocations would fail for
+ */
+static int extfrag_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	walk_zones_in_node(m, pgdat, extfrag_show_print);
+
+	return 0;
+}
+
 static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
@@ -722,6 +783,25 @@ static const struct file_operations unusable_file_ops = {
 	.release	= seq_release,
 };
 
+static const struct seq_operations extfrag_op = {
+	.start	= frag_start,
+	.next	= frag_next,
+	.stop	= frag_stop,
+	.show	= extfrag_show,
+};
+
+static int extfrag_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &extfrag_op);
+}
+
+static const struct file_operations extfrag_file_ops = {
+	.open		= extfrag_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 #ifdef CONFIG_ZONE_DMA
 #define TEXT_FOR_DMA(xx) xx "_dma",
 #else
@@ -1067,6 +1147,7 @@ static int __init setup_vmstat(void)
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
 	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
 	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
+	proc_create("extfrag_index", S_IRUGO, NULL, &extfrag_file_ops);
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
 #endif
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 07/11] Memory compaction core
  2010-03-23 12:25 [PATCH 0/11] Memory Compaction v5 Mel Gorman
                   ` (5 preceding siblings ...)
  2010-03-23 12:25 ` [PATCH 06/11] Export fragmentation index via /proc/extfrag_index Mel Gorman
@ 2010-03-23 12:25 ` Mel Gorman
  2010-03-23 17:56   ` Christoph Lameter
                     ` (2 more replies)
  2010-03-23 12:25 ` [PATCH 08/11] Add /proc trigger for memory compaction Mel Gorman
                   ` (3 subsequent siblings)
  10 siblings, 3 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 12:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.

A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable pages
within each area, isolating them onto a private list called migratelist.
The free scanner starts at the top of the zone and searches for suitable
areas and consumes the free pages within making them available for the
migration scanner. The pages isolated for migration are then migrated to
the newly isolated free pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 include/linux/compaction.h |    8 +
 include/linux/mm.h         |    1 +
 include/linux/swap.h       |    6 +
 include/linux/vmstat.h     |    1 +
 mm/Makefile                |    1 +
 mm/compaction.c            |  348 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   39 +++++
 mm/vmscan.c                |    5 -
 mm/vmstat.c                |    5 +
 9 files changed, 409 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/compaction.h
 create mode 100644 mm/compaction.c

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
new file mode 100644
index 0000000..6201371
--- /dev/null
+++ b/include/linux/compaction.h
@@ -0,0 +1,8 @@
+#ifndef _LINUX_COMPACTION_H
+#define _LINUX_COMPACTION_H
+
+/* Return values for compact_zone() */
+#define COMPACT_INCOMPLETE	0
+#define COMPACT_COMPLETE	1
+
+#endif /* _LINUX_COMPACTION_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3b473a..f920815 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -335,6 +335,7 @@ void put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
+int split_free_page(struct page *page);
 
 /*
  * Compound pages have a destructor function.  Provide a
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1f59d93..cf8bba7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -151,6 +151,7 @@ enum {
 };
 
 #define SWAP_CLUSTER_MAX 32
+#define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
 #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
 #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
@@ -238,6 +239,11 @@ static inline void lru_cache_add_active_file(struct page *page)
 	__lru_cache_add(page, LRU_ACTIVE_FILE);
 }
 
+/* LRU Isolation modes. */
+#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
+#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
+#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 117f0dd..56e4b44 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
diff --git a/mm/Makefile b/mm/Makefile
index 7a68d2a..ccb1f72 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_COMPACTION) += compaction.o
 obj-$(CONFIG_SMP) += percpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
diff --git a/mm/compaction.c b/mm/compaction.c
new file mode 100644
index 0000000..0d2e8aa
--- /dev/null
+++ b/mm/compaction.c
@@ -0,0 +1,348 @@
+/*
+ * linux/mm/compaction.c
+ *
+ * Memory compaction for the reduction of external fragmentation. Note that
+ * this heavily depends upon page migration to do all the real heavy
+ * lifting
+ *
+ * Copyright IBM Corp. 2007-2010 Mel Gorman <mel@csn.ul.ie>
+ */
+#include <linux/swap.h>
+#include <linux/migrate.h>
+#include <linux/compaction.h>
+#include <linux/mm_inline.h>
+#include "internal.h"
+
+/*
+ * compact_control is used to track pages being migrated and the free pages
+ * they are being migrated to during memory compaction. The free_pfn starts
+ * at the end of a zone and migrate_pfn begins at the start. Movable pages
+ * are moved to the end of a zone during a compaction run and the run
+ * completes when free_pfn <= migrate_pfn
+ */
+struct compact_control {
+	struct list_head freepages;	/* List of free pages to migrate to */
+	struct list_head migratepages;	/* List of pages being migrated */
+	unsigned long nr_freepages;	/* Number of isolated free pages */
+	unsigned long nr_migratepages;	/* Number of pages to migrate */
+	unsigned long free_pfn;		/* isolate_freepages search base */
+	unsigned long migrate_pfn;	/* isolate_migratepages search base */
+
+	/* Account for isolated anon and file pages */
+	unsigned long nr_anon;
+	unsigned long nr_file;
+
+	struct zone *zone;
+};
+
+static int release_freepages(struct list_head *freelist)
+{
+	struct page *page, *next;
+	int count = 0;
+
+	list_for_each_entry_safe(page, next, freelist, lru) {
+		list_del(&page->lru);
+		__free_page(page);
+		count++;
+	}
+
+	return count;
+}
+
+/* Isolate free pages onto a private freelist. Must hold zone->lock */
+static int isolate_freepages_block(struct zone *zone,
+				unsigned long blockpfn,
+				struct list_head *freelist)
+{
+	unsigned long zone_end_pfn, end_pfn;
+	int total_isolated = 0;
+
+	/* Get the last PFN we should scan for free pages at */
+	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+	end_pfn = blockpfn + pageblock_nr_pages;
+	if (end_pfn > zone_end_pfn)
+		end_pfn = zone_end_pfn;
+
+	/* Isolate free pages. This assumes the block is valid */
+	for (; blockpfn < end_pfn; blockpfn++) {
+		struct page *page;
+		int isolated, i;
+
+		if (!pfn_valid_within(blockpfn))
+			continue;
+
+		page = pfn_to_page(blockpfn);
+		if (!PageBuddy(page))
+			continue;
+
+		/* Found a free page, break it into order-0 pages */
+		isolated = split_free_page(page);
+		total_isolated += isolated;
+		for (i = 0; i < isolated; i++) {
+			list_add(&page->lru, freelist);
+			page++;
+		}
+
+		/* If a page was split, advance to the end of it */
+		if (isolated)
+			blockpfn += isolated - 1;
+	}
+
+	return total_isolated;
+}
+
+/* Returns 1 if the page is within a block suitable for migration to */
+static int suitable_migration_target(struct page *page)
+{
+	/* If the page is a large free page, then allow migration */
+	if (PageBuddy(page) && page_order(page) >= pageblock_order)
+		return 1;
+
+	/* If the block is MIGRATE_MOVABLE, allow migration */
+	if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
+		return 1;
+
+	/* Otherwise skip the block */
+	return 0;
+}
+
+/*
+ * Based on information in the current compact_control, find blocks
+ * suitable for isolating free pages from
+ */
+static void isolate_freepages(struct zone *zone,
+				struct compact_control *cc)
+{
+	struct page *page;
+	unsigned long high_pfn, low_pfn, pfn;
+	unsigned long flags;
+	int nr_freepages = cc->nr_freepages;
+	struct list_head *freelist = &cc->freepages;
+
+	pfn = cc->free_pfn;
+	low_pfn = cc->migrate_pfn + pageblock_nr_pages;
+	high_pfn = low_pfn;
+
+	/*
+	 * Isolate free pages until enough are available to migrate the
+	 * pages on cc->migratepages. We stop searching if the migrate
+	 * and free page scanners meet or enough free pages are isolated.
+	 */
+	spin_lock_irqsave(&zone->lock, flags);
+	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
+					pfn -= pageblock_nr_pages) {
+		int isolated;
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		/* 
+		 * Check for overlapping nodes/zones. It's possible on some
+		 * configurations to have a setup like
+		 * node0 node1 node0
+		 * i.e. it's possible that all pages within a zones range of
+		 * pages do not belong to a single zone.
+		 */
+		page = pfn_to_page(pfn);
+		if (page_zone(page) != zone)
+			continue;
+
+		/* Check the block is suitable for migration */
+		if (!suitable_migration_target(page))
+			continue;
+
+		/* Found a block suitable for isolating free pages from */
+		isolated = isolate_freepages_block(zone, pfn, freelist);
+		nr_freepages += isolated;
+
+		/*
+		 * Record the highest PFN we isolated pages from. When next
+		 * looking for free pages, the search will restart here as
+		 * page migration may have returned some pages to the allocator
+		 */
+		if (isolated)
+			high_pfn = max(high_pfn, pfn);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	cc->free_pfn = high_pfn;
+	cc->nr_freepages = nr_freepages;
+}
+
+/* Update the number of anon and file isolated pages in the zone */
+static void acct_isolated(struct zone *zone, struct compact_control *cc)
+{
+	struct page *page;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+
+	list_for_each_entry(page, &cc->migratepages, lru) {
+		int lru = page_lru_base_type(page);
+		count[lru]++;
+	}
+
+	cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
+}
+
+/*
+ * Isolate all pages that can be migrated from the block pointed to by
+ * the migrate scanner within compact_control.
+ */
+static unsigned long isolate_migratepages(struct zone *zone,
+					struct compact_control *cc)
+{
+	unsigned long low_pfn, end_pfn;
+	struct list_head *migratelist;
+
+	low_pfn = cc->migrate_pfn;
+	migratelist = &cc->migratepages;
+
+	/* Do not scan outside zone boundaries */
+	if (low_pfn < zone->zone_start_pfn)
+		low_pfn = zone->zone_start_pfn;
+
+	/* Setup to scan one block but not past where we are migrating to */
+	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
+
+	/* Do not cross the free scanner or scan within a memory hole */
+	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
+		cc->migrate_pfn = end_pfn;
+		return 0;
+	}
+
+	migrate_prep();
+
+	/* Time to isolate some pages for migration */
+	spin_lock_irq(&zone->lru_lock);
+	for (; low_pfn < end_pfn; low_pfn++) {
+		struct page *page;
+		if (!pfn_valid_within(low_pfn))
+			continue;
+
+		/* Get the page and skip if free */
+		page = pfn_to_page(low_pfn);
+		if (PageBuddy(page)) {
+			low_pfn += (1 << page_order(page)) - 1;
+			continue;
+		}
+
+		/* Try isolate the page */
+		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) == 0) {
+			del_page_from_lru_list(zone, page, page_lru(page));
+			list_add(&page->lru, migratelist);
+			mem_cgroup_del_lru(page);
+			cc->nr_migratepages++;
+		}
+
+		/* Avoid isolating too much */
+		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
+			break;
+	}
+
+	acct_isolated(zone, cc);
+
+	spin_unlock_irq(&zone->lru_lock);
+	cc->migrate_pfn = low_pfn;
+
+	return cc->nr_migratepages;
+}
+
+/*
+ * This is a migrate-callback that "allocates" freepages by taking pages
+ * from the isolated freelists in the block we are migrating to.
+ */
+static struct page *compaction_alloc(struct page *migratepage,
+					unsigned long data,
+					int **result)
+{
+	struct compact_control *cc = (struct compact_control *)data;
+	struct page *freepage;
+
+	VM_BUG_ON(cc == NULL);
+
+	/* Isolate free pages if necessary */
+	if (list_empty(&cc->freepages)) {
+		isolate_freepages(cc->zone, cc);
+
+		if (list_empty(&cc->freepages))
+			return NULL;
+	}
+
+	freepage = list_entry(cc->freepages.next, struct page, lru);
+	list_del(&freepage->lru);
+	cc->nr_freepages--;
+
+	return freepage;
+}
+
+/*
+ * We cannot control nr_migratepages and nr_freepages fully when migration is
+ * running as migrate_pages() has no knowledge of compact_control. When
+ * migration is complete, we count the number of pages on the lists by hand.
+ */
+static void update_nr_listpages(struct compact_control *cc)
+{
+	int nr_migratepages = 0;
+	int nr_freepages = 0;
+	struct page *page;
+	list_for_each_entry(page, &cc->migratepages, lru)
+		nr_migratepages++;
+	list_for_each_entry(page, &cc->freepages, lru)
+		nr_freepages++;
+
+	cc->nr_migratepages = nr_migratepages;
+	cc->nr_freepages = nr_freepages;
+}
+
+static inline int compact_finished(struct zone *zone,
+						struct compact_control *cc)
+{
+	/* Compaction run completes if the migrate and free scanner meet */
+	if (cc->free_pfn <= cc->migrate_pfn)
+		return COMPACT_COMPLETE;
+
+	return COMPACT_INCOMPLETE;
+}
+
+static int compact_zone(struct zone *zone, struct compact_control *cc)
+{
+	int ret = COMPACT_INCOMPLETE;
+
+	/* Setup to move all movable pages to the end of the zone */
+	cc->migrate_pfn = zone->zone_start_pfn;
+	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
+	cc->free_pfn &= ~(pageblock_nr_pages-1);
+
+	for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
+		unsigned long nr_migrate, nr_remaining;
+		if (!isolate_migratepages(zone, cc))
+			continue;
+
+		nr_migrate = cc->nr_migratepages;
+		migrate_pages(&cc->migratepages, compaction_alloc,
+						(unsigned long)cc, 0);
+		update_nr_listpages(cc);
+		nr_remaining = cc->nr_migratepages;
+
+		count_vm_event(COMPACTBLOCKS);
+		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
+		if (nr_remaining)
+			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
+
+		/* Release LRU pages not migrated */
+		if (!list_empty(&cc->migratepages)) {
+			putback_lru_pages(&cc->migratepages);
+			cc->nr_migratepages = 0;
+		}
+
+	}
+
+	/* Release free pages and check accounting */
+	cc->nr_freepages -= release_freepages(&cc->freepages);
+	VM_BUG_ON(cc->nr_freepages != 0);
+
+	return ret;
+}
+
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 882aef0..9708143 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1208,6 +1208,45 @@ void split_page(struct page *page, unsigned int order)
 }
 
 /*
+ * Similar to split_page except the page is already free. As this is only
+ * being used for migration, the migratetype of the block also changes.
+ */
+int split_free_page(struct page *page)
+{
+	unsigned int order;
+	unsigned long watermark;
+	struct zone *zone;
+
+	BUG_ON(!PageBuddy(page));
+
+	zone = page_zone(page);
+	order = page_order(page);
+
+	/* Obey watermarks or the system could deadlock */
+	watermark = low_wmark_pages(zone) + (1 << order);
+	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+		return 0;
+
+	/* Remove page from free list */
+	list_del(&page->lru);
+	zone->free_area[order].nr_free--;
+	rmv_page_order(page);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+
+	/* Split into individual pages */
+	set_page_refcounted(page);
+	split_page(page, order);
+
+	if (order >= pageblock_order - 1) {
+		struct page *endpage = page + (1 << order) - 1;
+		for (; page < endpage; page += pageblock_nr_pages)
+			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+	}
+
+	return 1 << order;
+}
+
+/*
  * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..ef89600 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -839,11 +839,6 @@ keep:
 	return nr_reclaimed;
 }
 
-/* LRU Isolation modes. */
-#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
-#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
-#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
-
 /*
  * Attempt to remove the specified page from its LRU.  Only take this page
  * if it is of the appropriate PageActive status.  Pages which are being
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7377da6..af88647 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -891,6 +891,11 @@ static const char * const vmstat_text[] = {
 	"allocstall",
 
 	"pgrotated",
+
+	"compact_blocks_moved",
+	"compact_pages_moved",
+	"compact_pagemigrate_failed",
+
 #ifdef CONFIG_HUGETLB_PAGE
 	"htlb_buddy_alloc_success",
 	"htlb_buddy_alloc_fail",
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 08/11] Add /proc trigger for memory compaction
  2010-03-23 12:25 [PATCH 0/11] Memory Compaction v5 Mel Gorman
                   ` (6 preceding siblings ...)
  2010-03-23 12:25 ` [PATCH 07/11] Memory compaction core Mel Gorman
@ 2010-03-23 12:25 ` Mel Gorman
  2010-03-23 18:25   ` Christoph Lameter
  2010-03-24 20:33   ` Andrew Morton
  2010-03-23 12:25 ` [PATCH 09/11] Add /sys trigger for per-node " Mel Gorman
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 12:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
value is written to the file, all zones are compacted. The expected user
of such a trigger is a job scheduler that prepares the system before the
target application runs.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 Documentation/sysctl/vm.txt |   11 ++++++++
 include/linux/compaction.h  |    6 ++++
 kernel/sysctl.c             |   10 +++++++
 mm/compaction.c             |   61 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 88 insertions(+), 0 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 6c7d18c..317d3f0 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -19,6 +19,7 @@ files can be found in mm/swap.c.
 Currently, these files are in /proc/sys/vm:
 
 - block_dump
+- compact_memory
 - dirty_background_bytes
 - dirty_background_ratio
 - dirty_bytes
@@ -64,6 +65,16 @@ information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.
 
 ==============================================================
 
+compact_memory
+
+Available only when CONFIG_COMPACTION is set. When an arbitrary value
+is written to the file, all zones are compacted such that free memory
+is available in contiguous blocks where possible. This can be important
+for example in the allocation of huge pages although processes will also
+directly compact memory as required.
+
+==============================================================
+
 dirty_background_bytes
 
 Contains the amount of dirty memory at which the pdflush background writeback
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 6201371..52762d2 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -5,4 +5,10 @@
 #define COMPACT_INCOMPLETE	0
 #define COMPACT_COMPLETE	1
 
+#ifdef CONFIG_COMPACTION
+extern int sysctl_compact_memory;
+extern int sysctl_compaction_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos);
+#endif /* CONFIG_COMPACTION */
+
 #endif /* _LINUX_COMPACTION_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1de111d..05cfee8 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -52,6 +52,7 @@
 #include <linux/slow-work.h>
 #include <linux/perf_event.h>
 #include <linux/kprobes.h>
+#include <linux/compaction.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -1124,6 +1125,15 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= drop_caches_sysctl_handler,
 	},
+#ifdef CONFIG_COMPACTION
+	{
+		.procname	= "compact_memory",
+		.data		= &sysctl_compact_memory,
+		.maxlen		= sizeof(int),
+		.mode		= 0200,
+		.proc_handler	= sysctl_compaction_handler,
+	},
+#endif /* CONFIG_COMPACTION */
 	{
 		.procname	= "min_free_kbytes",
 		.data		= &min_free_kbytes,
diff --git a/mm/compaction.c b/mm/compaction.c
index 0d2e8aa..faa9b53 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -11,6 +11,7 @@
 #include <linux/migrate.h>
 #include <linux/compaction.h>
 #include <linux/mm_inline.h>
+#include <linux/sysctl.h>
 #include "internal.h"
 
 /*
@@ -346,3 +347,63 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	return ret;
 }
 
+/* Compact all zones within a node */
+static int compact_node(int nid)
+{
+	int zoneid;
+	pg_data_t *pgdat;
+	struct zone *zone;
+
+	if (nid < 0 || nid > nr_node_ids || !node_online(nid))
+		return -EINVAL;
+	pgdat = NODE_DATA(nid);
+
+	/* Flush pending updates to the LRU lists */
+	lru_add_drain_all();
+
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+		struct compact_control cc;
+
+		zone = &pgdat->node_zones[zoneid];
+		if (!populated_zone(zone))
+			continue;
+
+		cc.nr_freepages = 0;
+		cc.nr_migratepages = 0;
+		cc.zone = zone;
+		cc.order = -1;
+		INIT_LIST_HEAD(&cc.freepages);
+		INIT_LIST_HEAD(&cc.migratepages);
+
+		compact_zone(zone, &cc);
+
+		VM_BUG_ON(!list_empty(&cc.freepages));
+		VM_BUG_ON(!list_empty(&cc.migratepages));
+	}
+
+	return 0;
+}
+
+/* Compact all nodes in the system */
+static int compact_nodes(void)
+{
+	int nid;
+
+	for_each_online_node(nid)
+		compact_node(nid);
+
+	return COMPACT_COMPLETE;
+}
+
+/* The written value is actually unused, all memory is compacted */
+int sysctl_compact_memory;
+
+/* This is the entry point for compacting all nodes via /proc/sys/vm */
+int sysctl_compaction_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos)
+{
+	if (write)
+		return compact_nodes();
+
+	return 0;
+}
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 09/11] Add /sys trigger for per-node memory compaction
  2010-03-23 12:25 [PATCH 0/11] Memory Compaction v5 Mel Gorman
                   ` (7 preceding siblings ...)
  2010-03-23 12:25 ` [PATCH 08/11] Add /proc trigger for memory compaction Mel Gorman
@ 2010-03-23 12:25 ` Mel Gorman
  2010-03-23 18:27   ` Christoph Lameter
                     ` (2 more replies)
  2010-03-23 12:25 ` [PATCH 10/11] Direct compact when a high-order allocation fails Mel Gorman
  2010-03-23 12:25 ` [PATCH 11/11] Do not compact within a preferred zone after a compaction failure Mel Gorman
  10 siblings, 3 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 12:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

This patch adds a per-node sysfs file called compact. When the file is
written to, each zone in that node is compacted. The intention that this
would be used by something like a job scheduler in a batch system before
a job starts so that the job can allocate the maximum number of
hugepages without significant start-up cost.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 Documentation/ABI/testing/sysfs-devices-node |    7 +++++++
 drivers/base/node.c                          |    3 +++
 include/linux/compaction.h                   |   16 ++++++++++++++++
 mm/compaction.c                              |   23 +++++++++++++++++++++++
 4 files changed, 49 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-devices-node

diff --git a/Documentation/ABI/testing/sysfs-devices-node b/Documentation/ABI/testing/sysfs-devices-node
new file mode 100644
index 0000000..0cb286a
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-devices-node
@@ -0,0 +1,7 @@
+What:		/sys/devices/system/node/nodeX/compact
+Date:		February 2010
+Contact:	Mel Gorman <mel@csn.ul.ie>
+Description:
+		When this file is written to, all memory within that node
+		will be compacted. When it completes, memory will be free
+		in as contiguous blocks as possible.
diff --git a/drivers/base/node.c b/drivers/base/node.c
index ad43185..15fb30d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -15,6 +15,7 @@
 #include <linux/cpu.h>
 #include <linux/device.h>
 #include <linux/swap.h>
+#include <linux/compaction.h>
 
 static struct sysdev_class_attribute *node_state_attrs[];
 
@@ -242,6 +243,8 @@ int register_node(struct node *node, int num, struct node *parent)
 		scan_unevictable_register_node(node);
 
 		hugetlb_register_node(node);
+
+		compaction_register_node(node);
 	}
 	return error;
 }
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 52762d2..c94890b 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -11,4 +11,20 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
 #endif /* CONFIG_COMPACTION */
 
+#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+extern int compaction_register_node(struct node *node);
+extern void compaction_unregister_node(struct node *node);
+
+#else
+
+static inline int compaction_register_node(struct node *node)
+{
+	return 0;
+}
+
+static inline void compaction_unregister_node(struct node *node)
+{
+}
+#endif /* CONFIG_COMPACTION && CONFIG_SYSFS && CONFIG_NUMA */
+
 #endif /* _LINUX_COMPACTION_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index faa9b53..8df6e3d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -12,6 +12,7 @@
 #include <linux/compaction.h>
 #include <linux/mm_inline.h>
 #include <linux/sysctl.h>
+#include <linux/sysfs.h>
 #include "internal.h"
 
 /*
@@ -407,3 +408,25 @@ int sysctl_compaction_handler(struct ctl_table *table, int write,
 
 	return 0;
 }
+
+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+ssize_t sysfs_compact_node(struct sys_device *dev,
+			struct sysdev_attribute *attr,
+			const char *buf, size_t count)
+{
+	compact_node(dev->id);
+
+	return count;
+}
+static SYSDEV_ATTR(compact, S_IWUSR, NULL, sysfs_compact_node);
+
+int compaction_register_node(struct node *node)
+{
+	return sysdev_create_file(&node->sysdev, &attr_compact);
+}
+
+void compaction_unregister_node(struct node *node)
+{
+	return sysdev_remove_file(&node->sysdev, &attr_compact);
+}
+#endif /* CONFIG_SYSFS && CONFIG_NUMA */
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-23 12:25 [PATCH 0/11] Memory Compaction v5 Mel Gorman
                   ` (8 preceding siblings ...)
  2010-03-23 12:25 ` [PATCH 09/11] Add /sys trigger for per-node " Mel Gorman
@ 2010-03-23 12:25 ` Mel Gorman
  2010-03-23 23:10   ` Minchan Kim
                     ` (2 more replies)
  2010-03-23 12:25 ` [PATCH 11/11] Do not compact within a preferred zone after a compaction failure Mel Gorman
  10 siblings, 3 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 12:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

Ordinarily when a high-order allocation fails, direct reclaim is entered to
free pages to satisfy the allocation.  With this patch, it is determined if
an allocation failed due to external fragmentation instead of low memory
and if so, the calling process will compact until a suitable page is
freed. Compaction by moving pages in memory is considerably cheaper than
paging out to disk and works where there are locked pages or no swap. If
compaction fails to free a page of a suitable size, then reclaim will
still occur.

Direct compaction returns as soon as possible. As each block is compacted,
it is checked if a suitable page has been freed and if so, it returns.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |   16 +++++-
 include/linux/vmstat.h     |    1 +
 mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   26 ++++++++++
 mm/vmstat.c                |   15 +++++-
 5 files changed, 172 insertions(+), 4 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index c94890b..b851428 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,14 +1,26 @@
 #ifndef _LINUX_COMPACTION_H
 #define _LINUX_COMPACTION_H
 
-/* Return values for compact_zone() */
+/* Return values for compact_zone() and try_to_compact_pages() */
 #define COMPACT_INCOMPLETE	0
-#define COMPACT_COMPLETE	1
+#define COMPACT_PARTIAL		1
+#define COMPACT_COMPLETE	2
 
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
+
+extern int fragmentation_index(struct zone *zone, unsigned int order);
+extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *mask);
+#else
+static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+	return COMPACT_INCOMPLETE;
+}
+
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 56e4b44..b4b4d34 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
 		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
+		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
diff --git a/mm/compaction.c b/mm/compaction.c
index 8df6e3d..6688700 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -34,6 +34,8 @@ struct compact_control {
 	unsigned long nr_anon;
 	unsigned long nr_file;
 
+	unsigned int order;		/* order a direct compactor needs */
+	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
 };
 
@@ -301,10 +303,31 @@ static void update_nr_listpages(struct compact_control *cc)
 static inline int compact_finished(struct zone *zone,
 						struct compact_control *cc)
 {
+	unsigned int order;
+	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
+
 	/* Compaction run completes if the migrate and free scanner meet */
 	if (cc->free_pfn <= cc->migrate_pfn)
 		return COMPACT_COMPLETE;
 
+	/* Compaction run is not finished if the watermark is not met */
+	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
+		return COMPACT_INCOMPLETE;
+
+	if (cc->order == -1)
+		return COMPACT_INCOMPLETE;
+
+	/* Direct compactor: Is a suitable page free? */
+	for (order = cc->order; order < MAX_ORDER; order++) {
+		/* Job done if page is free of the right migratetype */
+		if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
+			return COMPACT_PARTIAL;
+
+		/* Job done if allocation would set block type */
+		if (order >= pageblock_order && zone->free_area[order].nr_free)
+			return COMPACT_PARTIAL;
+	}
+
 	return COMPACT_INCOMPLETE;
 }
 
@@ -348,6 +371,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	return ret;
 }
 
+static inline unsigned long compact_zone_order(struct zone *zone,
+						int order, gfp_t gfp_mask)
+{
+	struct compact_control cc = {
+		.nr_freepages = 0,
+		.nr_migratepages = 0,
+		.order = order,
+		.migratetype = allocflags_to_migratetype(gfp_mask),
+		.zone = zone,
+	};
+	INIT_LIST_HEAD(&cc.freepages);
+	INIT_LIST_HEAD(&cc.migratepages);
+
+	return compact_zone(zone, &cc);
+}
+
+/**
+ * try_to_compact_pages - Direct compact to satisfy a high-order allocation
+ * @zonelist: The zonelist used for the current allocation
+ * @order: The order of the current allocation
+ * @gfp_mask: The GFP mask of the current allocation
+ * @nodemask: The allowed nodes to allocate from
+ *
+ * This is the main entry point for direct page compaction.
+ */
+unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	int may_enter_fs = gfp_mask & __GFP_FS;
+	int may_perform_io = gfp_mask & __GFP_IO;
+	unsigned long watermark;
+	struct zoneref *z;
+	struct zone *zone;
+	int rc = COMPACT_INCOMPLETE;
+
+	/* Check whether it is worth even starting compaction */
+	if (order == 0 || !may_enter_fs || !may_perform_io)
+		return rc;
+
+	/*
+	 * We will not stall if the necessary conditions are not met for
+	 * migration but direct reclaim seems to account stalls similarly
+	 */
+	count_vm_event(COMPACTSTALL);
+
+	/* Compact each zone in the list */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
+								nodemask) {
+		int fragindex;
+		int status;
+
+		/*
+		 * Watermarks for order-0 must be met for compaction. Note
+		 * the 2UL. This is because during migration, copies of
+		 * pages need to be allocated and for a short time, the
+		 * footprint is higher
+		 */
+		watermark = low_wmark_pages(zone) + (2UL << order);
+		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+			continue;
+
+		/*
+		 * fragmentation index determines if allocation failures are
+		 * due to low memory or external fragmentation
+		 *
+		 * index of -1 implies allocations might succeed depending
+		 * 	on watermarks
+		 * index < 500 implies alloc failure is due to lack of memory
+		 *
+		 * XXX: The choice of 500 is arbitrary. Reinvestigate
+		 *      appropriately to determine a sensible default.
+		 *      and what it means when watermarks are also taken
+		 *      into account. Consider making it a sysctl
+		 */
+		fragindex = fragmentation_index(zone, order);
+		if (fragindex >= 0 && fragindex <= 500)
+			continue;
+
+		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
+			rc = COMPACT_PARTIAL;
+			break;
+		}
+
+		status = compact_zone_order(zone, order, gfp_mask);
+		rc = max(status, rc);
+
+		if (zone_watermark_ok(zone, order, watermark, 0, 0))
+			break;
+	}
+
+	return rc;
+}
+
+
 /* Compact all zones within a node */
 static int compact_node(int nid)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9708143..e301108 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -49,6 +49,7 @@
 #include <linux/debugobjects.h>
 #include <linux/kmemleak.h>
 #include <linux/memory.h>
+#include <linux/compaction.h>
 #include <trace/events/kmem.h>
 #include <linux/ftrace_event.h>
 
@@ -1765,6 +1766,31 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
+	/* Try memory compaction for high-order allocations before reclaim */
+	if (order) {
+		*did_some_progress = try_to_compact_pages(zonelist,
+						order, gfp_mask, nodemask);
+		if (*did_some_progress != COMPACT_INCOMPLETE) {
+			page = get_page_from_freelist(gfp_mask, nodemask,
+					order, zonelist, high_zoneidx,
+					alloc_flags, preferred_zone,
+					migratetype);
+			if (page) {
+				__count_vm_event(COMPACTSUCCESS);
+				return page;
+			}
+
+			/*
+			 * It's bad if compaction run occurs and fails.
+			 * The most likely reason is that pages exist,
+			 * but not enough to satisfy watermarks.
+			 */
+			count_vm_event(COMPACTFAIL);
+
+			cond_resched();
+		}
+	}
+
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
 	p->flags |= PF_MEMALLOC;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index af88647..c88f285 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -560,7 +560,7 @@ static int unusable_show(struct seq_file *m, void *arg)
  * The value can be used to determine if page reclaim or compaction
  * should be used
  */
-int fragmentation_index(unsigned int order, struct contig_page_info *info)
+int __fragmentation_index(unsigned int order, struct contig_page_info *info)
 {
 	unsigned long requested = 1UL << order;
 
@@ -580,6 +580,14 @@ int fragmentation_index(unsigned int order, struct contig_page_info *info)
 	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
 }
 
+/* Same as __fragmentation index but allocs contig_page_info on stack */
+int fragmentation_index(struct zone *zone, unsigned int order)
+{
+	struct contig_page_info info;
+
+	fill_contig_page_info(zone, order, &info);
+	return __fragmentation_index(order, &info);
+}
 
 static void extfrag_show_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
@@ -595,7 +603,7 @@ static void extfrag_show_print(struct seq_file *m,
 				zone->name);
 	for (order = 0; order < MAX_ORDER; ++order) {
 		fill_contig_page_info(zone, order, &info);
-		index = fragmentation_index(order, &info);
+		index = __fragmentation_index(order, &info);
 		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
 	}
 
@@ -895,6 +903,9 @@ static const char * const vmstat_text[] = {
 	"compact_blocks_moved",
 	"compact_pages_moved",
 	"compact_pagemigrate_failed",
+	"compact_stall",
+	"compact_fail",
+	"compact_success",
 
 #ifdef CONFIG_HUGETLB_PAGE
 	"htlb_buddy_alloc_success",
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 11/11] Do not compact within a preferred zone after a compaction failure
  2010-03-23 12:25 [PATCH 0/11] Memory Compaction v5 Mel Gorman
                   ` (9 preceding siblings ...)
  2010-03-23 12:25 ` [PATCH 10/11] Direct compact when a high-order allocation fails Mel Gorman
@ 2010-03-23 12:25 ` Mel Gorman
  2010-03-23 18:31   ` Christoph Lameter
  2010-03-24 20:53   ` Andrew Morton
  10 siblings, 2 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 12:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

The fragmentation index may indicate that a failure it due to external
fragmentation, a compaction run complete and an allocation failure still
fail. There are two obvious reasons as to why

  o Page migration cannot move all pages so fragmentation remains
  o A suitable page may exist but watermarks are not met

In the event of compaction and allocation failure, this patch prevents
compaction happening for a short interval. It's only recorded on the
preferred zone but that should be enough coverage. This could have been
implemented similar to the zonelist_cache but the increased size of the
zonelist did not appear to be justified.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |   35 +++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h     |    7 +++++++
 mm/page_alloc.c            |    5 ++++-
 3 files changed, 46 insertions(+), 1 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index b851428..bc7059d 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -14,6 +14,32 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask);
+
+/* defer_compaction - Do not compact within a zone until a given time */
+static inline void defer_compaction(struct zone *zone, unsigned long resume)
+{
+	/*
+	 * This function is called when compaction fails to result in a page
+	 * allocation success. This is somewhat unsatisfactory as the failure
+	 * to compact has nothing to do with time and everything to do with
+	 * the requested order, the number of free pages and watermarks. How
+	 * to wait on that is more unclear, but the answer would apply to
+	 * other areas where the VM waits based on time.
+	 */
+	zone->compact_resume = resume;
+}
+
+static inline int compaction_deferred(struct zone *zone)
+{
+	/* init once if necessary */
+	if (unlikely(!zone->compact_resume)) {
+		zone->compact_resume = jiffies;
+		return 0;
+	}
+
+	return time_before(jiffies, zone->compact_resume);
+}
+
 #else
 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask)
@@ -21,6 +47,15 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	return COMPACT_INCOMPLETE;
 }
 
+static inline void defer_compaction(struct zone *zone, unsigned long resume)
+{
+}
+
+static inline int compaction_deferred(struct zone *zone)
+{
+	return 1;
+}
+
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cf9e458..bde879b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -321,6 +321,13 @@ struct zone {
 	unsigned long		*pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 
+#ifdef CONFIG_COMPACTION
+	/*
+	 * If a compaction fails, do not try compaction again until
+	 * jiffies is after the value of compact_resume
+	 */
+	unsigned long		compact_resume;
+#endif
 
 	ZONE_PADDING(_pad1_)
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e301108..f481df2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1767,7 +1767,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	cond_resched();
 
 	/* Try memory compaction for high-order allocations before reclaim */
-	if (order) {
+	if (order && !compaction_deferred(preferred_zone)) {
 		*did_some_progress = try_to_compact_pages(zonelist,
 						order, gfp_mask, nodemask);
 		if (*did_some_progress != COMPACT_INCOMPLETE) {
@@ -1787,6 +1787,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 			 */
 			count_vm_event(COMPACTFAIL);
 
+			/* On failure, avoid compaction for a short time. */
+			defer_compaction(preferred_zone, jiffies + HZ/50);
+
 			cond_resched();
 		}
 	}
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-23 12:25 ` [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
@ 2010-03-23 17:22   ` Christoph Lameter
  2010-03-23 18:04     ` Mel Gorman
  0 siblings, 1 reply; 78+ messages in thread
From: Christoph Lameter @ 2010-03-23 17:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 23 Mar 2010, Mel Gorman wrote:

> diff --git a/mm/migrate.c b/mm/migrate.c
> index 98eaaf2..6eb1efe 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -603,6 +603,19 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  	 */
>  	if (PageAnon(page)) {
>  		rcu_read_lock();
> +
> +		/*
> +		 * If the page has no mappings any more, just bail. An
> +		 * unmapped anon page is likely to be freed soon but worse,
> +		 * it's possible its anon_vma disappeared between when
> +		 * the page was isolated and when we reached here while
> +		 * the RCU lock was not held
> +		 */
> +		if (!page_mapcount(page)) {
> +			rcu_read_unlock();
> +			goto uncharge;
> +		}
> +
>  		rcu_locked = 1;
>  		anon_vma = page_anon_vma(page);
>  		atomic_inc(&anon_vma->migrate_refcount);

A way to make this simpler would be to move "rcu_locked = 1" before the
if statement and then do

if (!page_mapcount(page))
	goto rcu_unlock;


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration
  2010-03-23 12:25 ` [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
@ 2010-03-23 17:25   ` Christoph Lameter
  2010-03-23 23:55   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2010-03-23 17:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm


Reviewed-by: Christoph Lameter <cl@linux-foundation.org>



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] Export unusable free space index via /proc/unusable_index
  2010-03-23 12:25 ` [PATCH 05/11] Export unusable free space index via /proc/unusable_index Mel Gorman
@ 2010-03-23 17:31   ` Christoph Lameter
  2010-03-23 18:14     ` Mel Gorman
  2010-03-24  0:03   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 78+ messages in thread
From: Christoph Lameter @ 2010-03-23 17:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 23 Mar 2010, Mel Gorman wrote:

> +/*
> + * Return an index indicating how much of the available free memory is
> + * unusable for an allocation of the requested size.
> + */
> +static int unusable_free_index(unsigned int order,
> +				struct contig_page_info *info)
> +{
> +	/* No free memory is interpreted as all free memory is unusable */
> +	if (info->free_pages == 0)
> +		return 1000;


Is that assumption correct? If you have no free memory then you do not
know about the fragmentation status that would result if you would run
reclaim and free some memory. Going into a compaction mode would not be
useful. Should this not return 0 to avoid any compaction run when all
memory is allocated?

Otherwise

Reviewed-by: Christoph Lameter <cl@linux-foundation.org>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 06/11] Export fragmentation index via /proc/extfrag_index
  2010-03-23 12:25 ` [PATCH 06/11] Export fragmentation index via /proc/extfrag_index Mel Gorman
@ 2010-03-23 17:37   ` Christoph Lameter
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2010-03-23 17:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm


Reviewed-by: Christoph Lameter <cl@linux-foundation.org>



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-23 12:25 ` [PATCH 07/11] Memory compaction core Mel Gorman
@ 2010-03-23 17:56   ` Christoph Lameter
  2010-03-23 18:15     ` Mel Gorman
  2010-03-24  1:03   ` KAMEZAWA Hiroyuki
  2010-03-24 20:33   ` Andrew Morton
  2 siblings, 1 reply; 78+ messages in thread
From: Christoph Lameter @ 2010-03-23 17:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 23 Mar 2010, Mel Gorman wrote:

> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 1f59d93..cf8bba7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -238,6 +239,11 @@ static inline void lru_cache_add_active_file(struct page *page)
>  	__lru_cache_add(page, LRU_ACTIVE_FILE);
>  }
>
> +/* LRU Isolation modes. */
> +#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
> +#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
> +#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
> +
>  /* linux/mm/vmscan.c */
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  					gfp_t gfp_mask, nodemask_t *mask);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 79c8098..ef89600 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -839,11 +839,6 @@ keep:
>  	return nr_reclaimed;
>  }
>
> -/* LRU Isolation modes. */
> -#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
> -#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
> -#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
> -
>  /*
>   * Attempt to remove the specified page from its LRU.  Only take this page
>   * if it is of the appropriate PageActive status.  Pages which are being

Put the above in a separate patch?


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-23 17:22   ` Christoph Lameter
@ 2010-03-23 18:04     ` Mel Gorman
  0 siblings, 0 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 18:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, Mar 23, 2010 at 12:22:57PM -0500, Christoph Lameter wrote:
> On Tue, 23 Mar 2010, Mel Gorman wrote:
> 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 98eaaf2..6eb1efe 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -603,6 +603,19 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  	 */
> >  	if (PageAnon(page)) {
> >  		rcu_read_lock();
> > +
> > +		/*
> > +		 * If the page has no mappings any more, just bail. An
> > +		 * unmapped anon page is likely to be freed soon but worse,
> > +		 * it's possible its anon_vma disappeared between when
> > +		 * the page was isolated and when we reached here while
> > +		 * the RCU lock was not held
> > +		 */
> > +		if (!page_mapcount(page)) {
> > +			rcu_read_unlock();
> > +			goto uncharge;
> > +		}
> > +
> >  		rcu_locked = 1;
> >  		anon_vma = page_anon_vma(page);
> >  		atomic_inc(&anon_vma->migrate_refcount);
> 
> A way to make this simpler would be to move "rcu_locked = 1" before the
> if statement and then do
> 
> if (!page_mapcount(page))
> 	goto rcu_unlock;
> 

True. Fixed.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] Export unusable free space index via /proc/unusable_index
  2010-03-23 17:31   ` Christoph Lameter
@ 2010-03-23 18:14     ` Mel Gorman
  0 siblings, 0 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 18:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, Mar 23, 2010 at 12:31:35PM -0500, Christoph Lameter wrote:
> On Tue, 23 Mar 2010, Mel Gorman wrote:
> 
> > +/*
> > + * Return an index indicating how much of the available free memory is
> > + * unusable for an allocation of the requested size.
> > + */
> > +static int unusable_free_index(unsigned int order,
> > +				struct contig_page_info *info)
> > +{
> > +	/* No free memory is interpreted as all free memory is unusable */
> > +	if (info->free_pages == 0)
> > +		return 1000;
> 
> 
> Is that assumption correct? If you have no free memory then you do not
> know about the fragmentation status that would result if you would run
> reclaim and free some memory.

True, but reclaim and the freeing of memory is a possible future event.
At the time the index is being measured, saying "there is no free memory" and
"of the free memory available, none if it is usable" has the same end-result -
an allocation attempt will fail so the value makes sense.

If it returned zero, it would be a bit confusing. As memory within the zone
gets consumed, the value for high-orders would go towards 1 until there was
no free memory when it would suddenly go to 0. If you graphed that over
time, it would look a bit strange.

> Going into a compaction mode would not be
> useful. Should this not return 0 to avoid any compaction run when all
> memory is allocated?
> 

A combination of watermarks and fragmentation_index is what is used in
the compaction decision, not unusable_free_index.

> Otherwise
> 
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> 

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-23 17:56   ` Christoph Lameter
@ 2010-03-23 18:15     ` Mel Gorman
  2010-03-23 18:33       ` Christoph Lameter
  0 siblings, 1 reply; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 18:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, Mar 23, 2010 at 12:56:30PM -0500, Christoph Lameter wrote:
> On Tue, 23 Mar 2010, Mel Gorman wrote:
> 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 1f59d93..cf8bba7 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -238,6 +239,11 @@ static inline void lru_cache_add_active_file(struct page *page)
> >  	__lru_cache_add(page, LRU_ACTIVE_FILE);
> >  }
> >
> > +/* LRU Isolation modes. */
> > +#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
> > +#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
> > +#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
> > +
> >  /* linux/mm/vmscan.c */
> >  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >  					gfp_t gfp_mask, nodemask_t *mask);
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 79c8098..ef89600 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -839,11 +839,6 @@ keep:
> >  	return nr_reclaimed;
> >  }
> >
> > -/* LRU Isolation modes. */
> > -#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
> > -#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
> > -#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
> > -
> >  /*
> >   * Attempt to remove the specified page from its LRU.  Only take this page
> >   * if it is of the appropriate PageActive status.  Pages which are being
> 
> Put the above in a separate patch?
> 

I can if you prefer but it's so small, I didn't think it obscured the
clarity of the patch anyway. I would have somewhat expected the two
patches to be merged together before going upstream.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/11] Add /proc trigger for memory compaction
  2010-03-23 12:25 ` [PATCH 08/11] Add /proc trigger for memory compaction Mel Gorman
@ 2010-03-23 18:25   ` Christoph Lameter
  2010-03-23 18:32     ` Mel Gorman
  2010-03-24 20:33   ` Andrew Morton
  1 sibling, 1 reply; 78+ messages in thread
From: Christoph Lameter @ 2010-03-23 18:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 23 Mar 2010, Mel Gorman wrote:

> diff --git a/mm/compaction.c b/mm/compaction.c
> index 0d2e8aa..faa9b53 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -346,3 +347,63 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  	return ret;
>  }
>
> +/* Compact all zones within a node */
> +static int compact_node(int nid)
> +{
> +	int zoneid;
> +	pg_data_t *pgdat;
> +	struct zone *zone;
> +
> +	if (nid < 0 || nid > nr_node_ids || !node_online(nid))

Must be nid >= nr_node_ids.

Otherwise

Reviewed-by: Christoph Lameter <cl@linux-foundation.org>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] Add /sys trigger for per-node memory compaction
  2010-03-23 12:25 ` [PATCH 09/11] Add /sys trigger for per-node " Mel Gorman
@ 2010-03-23 18:27   ` Christoph Lameter
  2010-03-23 22:45   ` Minchan Kim
  2010-03-24  0:19   ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2010-03-23 18:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm


Reviewed-by: Christoph Lameter <cl@linux-foundation.org>



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 11/11] Do not compact within a preferred zone after a compaction failure
  2010-03-23 12:25 ` [PATCH 11/11] Do not compact within a preferred zone after a compaction failure Mel Gorman
@ 2010-03-23 18:31   ` Christoph Lameter
  2010-03-23 18:39     ` Mel Gorman
  2010-03-24 20:53   ` Andrew Morton
  1 sibling, 1 reply; 78+ messages in thread
From: Christoph Lameter @ 2010-03-23 18:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 23 Mar 2010, Mel Gorman wrote:

> The fragmentation index may indicate that a failure it due to external

s/it/is/

> fragmentation, a compaction run complete and an allocation failure still

???

> fail. There are two obvious reasons as to why
>
>   o Page migration cannot move all pages so fragmentation remains
>   o A suitable page may exist but watermarks are not met
>
> In the event of compaction and allocation failure, this patch prevents
> compaction happening for a short interval. It's only recorded on the

compaction is "recorded"? deferred?

> preferred zone but that should be enough coverage. This could have been
> implemented similar to the zonelist_cache but the increased size of the
> zonelist did not appear to be justified.

> @@ -1787,6 +1787,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  			 */
>  			count_vm_event(COMPACTFAIL);
>
> +			/* On failure, avoid compaction for a short time. */
> +			defer_compaction(preferred_zone, jiffies + HZ/50);
> +

20ms? How was that interval determined?


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/11] Add /proc trigger for memory compaction
  2010-03-23 18:25   ` Christoph Lameter
@ 2010-03-23 18:32     ` Mel Gorman
  0 siblings, 0 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 18:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, Mar 23, 2010 at 01:25:47PM -0500, Christoph Lameter wrote:
> On Tue, 23 Mar 2010, Mel Gorman wrote:
> 
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 0d2e8aa..faa9b53 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -346,3 +347,63 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> >  	return ret;
> >  }
> >
> > +/* Compact all zones within a node */
> > +static int compact_node(int nid)
> > +{
> > +	int zoneid;
> > +	pg_data_t *pgdat;
> > +	struct zone *zone;
> > +
> > +	if (nid < 0 || nid > nr_node_ids || !node_online(nid))
> 
> Must be nid >= nr_node_ids.
> 

Oops, correct. It should be "impossible" to supply an incorrect nid here
but still.

> Otherwise
> 
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> 

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-23 18:15     ` Mel Gorman
@ 2010-03-23 18:33       ` Christoph Lameter
  2010-03-23 18:58         ` Mel Gorman
  0 siblings, 1 reply; 78+ messages in thread
From: Christoph Lameter @ 2010-03-23 18:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 23 Mar 2010, Mel Gorman wrote:

> I can if you prefer but it's so small, I didn't think it obscured the
> clarity of the patch anyway. I would have somewhat expected the two
> patches to be merged together before going upstream.

It exposes the definitions needed to run __isolate_lru_page(). The
definitions could be useful for other uses of page migrations.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 11/11] Do not compact within a preferred zone after a compaction failure
  2010-03-23 18:31   ` Christoph Lameter
@ 2010-03-23 18:39     ` Mel Gorman
  2010-03-23 19:27       ` Christoph Lameter
  0 siblings, 1 reply; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 18:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, Mar 23, 2010 at 01:31:43PM -0500, Christoph Lameter wrote:
> On Tue, 23 Mar 2010, Mel Gorman wrote:
> 
> > The fragmentation index may indicate that a failure it due to external
> 
> s/it/is/
> 

Correct.

> > fragmentation, a compaction run complete and an allocation failure still
> 
> ???
> 

I was having some sort of fit when I wrote that obviously. Try this on
for size

The fragmentation index may indicate that a failure is due to external
fragmentation but after a compaction run completes, it is still possible  
for an allocation to fail.

> > fail. There are two obvious reasons as to why
> >
> >   o Page migration cannot move all pages so fragmentation remains
> >   o A suitable page may exist but watermarks are not met
> >
> > In the event of compaction and allocation failure, this patch prevents
> > compaction happening for a short interval. It's only recorded on the
> 
> compaction is "recorded"? deferred?
> 

deferred makes more sense.

What I was thinking at the time was that compact_resume was stored in struct
zone - i.e. that is where it is recorded.

> > preferred zone but that should be enough coverage. This could have been
> > implemented similar to the zonelist_cache but the increased size of the
> > zonelist did not appear to be justified.
> 
> > @@ -1787,6 +1787,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> >  			 */
> >  			count_vm_event(COMPACTFAIL);
> >
> > +			/* On failure, avoid compaction for a short time. */
> > +			defer_compaction(preferred_zone, jiffies + HZ/50);
> > +
> 
> 20ms? How was that interval determined?
> 

Matches the time the page allocator would defer to an event like
congestion. The choice is somewhat arbitrary. Ideally, there would be
some sort of event that would re-enable compaction but there wasn't an
obvious candidate so I used time.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-23 18:33       ` Christoph Lameter
@ 2010-03-23 18:58         ` Mel Gorman
  2010-03-23 19:20           ` Christoph Lameter
  0 siblings, 1 reply; 78+ messages in thread
From: Mel Gorman @ 2010-03-23 18:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, Mar 23, 2010 at 01:33:19PM -0500, Christoph Lameter wrote:
> On Tue, 23 Mar 2010, Mel Gorman wrote:
> 
> > I can if you prefer but it's so small, I didn't think it obscured the
> > clarity of the patch anyway. I would have somewhat expected the two
> > patches to be merged together before going upstream.
> 
> It exposes the definitions needed to run __isolate_lru_page(). The
> definitions could be useful for other uses of page migrations.
> 

Sure. Patch split out now and looks like

=== CUT HERE ===
Subject: [PATCH 07/12] Move definition for LRU isolation modes to a header

Currently, vmscan.c defines the isolation modes for
__isolate_lru_page(). Memory compaction needs access to these modes for
isolating pages for migration.  This patch exports them.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/swap.h |    5 +++++
 mm/vmscan.c          |    5 -----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1f59d93..986b12d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -238,6 +238,11 @@ static inline void lru_cache_add_active_file(struct page *page)
 	__lru_cache_add(page, LRU_ACTIVE_FILE);
 }
 
+/* LRU Isolation modes. */
+#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
+#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
+#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..ef89600 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -839,11 +839,6 @@ keep:
 	return nr_reclaimed;
 }
 
-/* LRU Isolation modes. */
-#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
-#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
-#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
-
 /*
  * Attempt to remove the specified page from its LRU.  Only take this page
  * if it is of the appropriate PageActive status.  Pages which are being
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-23 18:58         ` Mel Gorman
@ 2010-03-23 19:20           ` Christoph Lameter
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2010-03-23 19:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm


Acked-by: Christoph Lameter <cl@linux-foundation.org>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 11/11] Do not compact within a preferred zone after a compaction failure
  2010-03-23 18:39     ` Mel Gorman
@ 2010-03-23 19:27       ` Christoph Lameter
  2010-03-24 10:37         ` Mel Gorman
  0 siblings, 1 reply; 78+ messages in thread
From: Christoph Lameter @ 2010-03-23 19:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 23 Mar 2010, Mel Gorman wrote:

> I was having some sort of fit when I wrote that obviously. Try this on
> for size
>
> The fragmentation index may indicate that a failure is due to external
> fragmentation but after a compaction run completes, it is still possible
> for an allocation to fail.

Ok.

> > > fail. There are two obvious reasons as to why
> > >
> > >   o Page migration cannot move all pages so fragmentation remains
> > >   o A suitable page may exist but watermarks are not met
> > >
> > > In the event of compaction and allocation failure, this patch prevents
> > > compaction happening for a short interval. It's only recorded on the
> >
> > compaction is "recorded"? deferred?
> >
>
> deferred makes more sense.
>
> What I was thinking at the time was that compact_resume was stored in struct
> zone - i.e. that is where it is recorded.

Ok adding a dozen or more words here may be useful.

> > > preferred zone but that should be enough coverage. This could have been
> > > implemented similar to the zonelist_cache but the increased size of the
> > > zonelist did not appear to be justified.
> >
> > > @@ -1787,6 +1787,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> > >  			 */
> > >  			count_vm_event(COMPACTFAIL);
> > >
> > > +			/* On failure, avoid compaction for a short time. */
> > > +			defer_compaction(preferred_zone, jiffies + HZ/50);
> > > +
> >
> > 20ms? How was that interval determined?
> >
>
> Matches the time the page allocator would defer to an event like
> congestion. The choice is somewhat arbitrary. Ideally, there would be
> some sort of event that would re-enable compaction but there wasn't an
> obvious candidate so I used time.

There are frequent uses of HZ/10 as well especially in vmscna.c. A longer
time may be better?  HZ/50 looks like an interval for writeout. But this
is related to reclaim?


 backing-dev.h    <global>                      283 long congestion_wait(int sync, long timeout);
1 backing-dev.c    <global>                      762 EXPORT_SYMBOL(congestion_wait);
2 usercopy_32.c    __copy_to_user_ll             754 congestion_wait(BLK_RW_ASYNC, HZ/50);
3 pktcdvd.c        pkt_make_request             2557 congestion_wait(BLK_RW_ASYNC, HZ);
4 dm-crypt.c       kcryptd_crypt_write_convert   834 congestion_wait(BLK_RW_ASYNC, HZ/100);
5 file.c           fat_file_release              137 congestion_wait(BLK_RW_ASYNC, HZ/10);
6 journal.c        reiserfs_async_progress_wait  990 congestion_wait(BLK_RW_ASYNC, HZ / 10);
7 kmem.c           kmem_alloc                     61 congestion_wait(BLK_RW_ASYNC, HZ/50);
8 kmem.c           kmem_zone_alloc               117 congestion_wait(BLK_RW_ASYNC, HZ/50);
9 xfs_buf.c        _xfs_buf_lookup_pages         343 congestion_wait(BLK_RW_ASYNC, HZ/50);
a backing-dev.c    congestion_wait               751 long congestion_wait(int sync, long timeout)
b memcontrol.c     mem_cgroup_force_empty       2858 congestion_wait(BLK_RW_ASYNC, HZ/10);
c page-writeback.c throttle_vm_writeout          674 congestion_wait(BLK_RW_ASYNC, HZ/10);
d page_alloc.c     __alloc_pages_high_priority  1753 congestion_wait(BLK_RW_ASYNC, HZ/50);
e page_alloc.c     __alloc_pages_slowpath       1924 congestion_wait(BLK_RW_ASYNC, HZ/50);
f vmscan.c         shrink_inactive_list         1136 congestion_wait(BLK_RW_ASYNC, HZ/10);
g vmscan.c         shrink_inactive_list         1220 congestion_wait(BLK_RW_ASYNC, HZ/10);
h vmscan.c         do_try_to_free_pages         1837 congestion_wait(BLK_RW_ASYNC, HZ/10);
i vmscan.c         balance_pgdat                2161 congestion_wait(BLK_RW_ASYNC, HZ/10);


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] Add /sys trigger for per-node memory compaction
  2010-03-23 12:25 ` [PATCH 09/11] Add /sys trigger for per-node " Mel Gorman
  2010-03-23 18:27   ` Christoph Lameter
@ 2010-03-23 22:45   ` Minchan Kim
  2010-03-24  0:19   ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 78+ messages in thread
From: Minchan Kim @ 2010-03-23 22:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, Mar 23, 2010 at 9:25 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> This patch adds a per-node sysfs file called compact. When the file is
> written to, each zone in that node is compacted. The intention that this
> would be used by something like a job scheduler in a batch system before
> a job starts so that the job can allocate the maximum number of
> hugepages without significant start-up cost.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-23 12:25 ` [PATCH 10/11] Direct compact when a high-order allocation fails Mel Gorman
@ 2010-03-23 23:10   ` Minchan Kim
  2010-03-24 11:11     ` Mel Gorman
  2010-03-24  1:19   ` KAMEZAWA Hiroyuki
  2010-03-24 20:48   ` Andrew Morton
  2 siblings, 1 reply; 78+ messages in thread
From: Minchan Kim @ 2010-03-23 23:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

Hi, Mel.

On Tue, Mar 23, 2010 at 9:25 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> Ordinarily when a high-order allocation fails, direct reclaim is entered to
> free pages to satisfy the allocation.  With this patch, it is determined if
> an allocation failed due to external fragmentation instead of low memory
> and if so, the calling process will compact until a suitable page is
> freed. Compaction by moving pages in memory is considerably cheaper than
> paging out to disk and works where there are locked pages or no swap. If
> compaction fails to free a page of a suitable size, then reclaim will
> still occur.
>
> Direct compaction returns as soon as possible. As each block is compacted,
> it is checked if a suitable page has been freed and if so, it returns.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  include/linux/compaction.h |   16 +++++-
>  include/linux/vmstat.h     |    1 +
>  mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
>  mm/page_alloc.c            |   26 ++++++++++
>  mm/vmstat.c                |   15 +++++-
>  5 files changed, 172 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index c94890b..b851428 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -1,14 +1,26 @@
>  #ifndef _LINUX_COMPACTION_H
>  #define _LINUX_COMPACTION_H
>
> -/* Return values for compact_zone() */
> +/* Return values for compact_zone() and try_to_compact_pages() */
>  #define COMPACT_INCOMPLETE     0
> -#define COMPACT_COMPLETE       1
> +#define COMPACT_PARTIAL                1
> +#define COMPACT_COMPLETE       2
>
>  #ifdef CONFIG_COMPACTION
>  extern int sysctl_compact_memory;
>  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>                        void __user *buffer, size_t *length, loff_t *ppos);
> +
> +extern int fragmentation_index(struct zone *zone, unsigned int order);
> +extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> +                       int order, gfp_t gfp_mask, nodemask_t *mask);
> +#else
> +static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> +                       int order, gfp_t gfp_mask, nodemask_t *nodemask)
> +{
> +       return COMPACT_INCOMPLETE;
> +}
> +
>  #endif /* CONFIG_COMPACTION */
>
>  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 56e4b44..b4b4d34 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                KSWAPD_SKIP_CONGESTION_WAIT,
>                PAGEOUTRUN, ALLOCSTALL, PGROTATED,
>                COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> +               COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
>  #ifdef CONFIG_HUGETLB_PAGE
>                HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
>  #endif
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 8df6e3d..6688700 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -34,6 +34,8 @@ struct compact_control {
>        unsigned long nr_anon;
>        unsigned long nr_file;
>
> +       unsigned int order;             /* order a direct compactor needs */
> +       int migratetype;                /* MOVABLE, RECLAIMABLE etc */
>        struct zone *zone;
>  };
>
> @@ -301,10 +303,31 @@ static void update_nr_listpages(struct compact_control *cc)
>  static inline int compact_finished(struct zone *zone,
>                                                struct compact_control *cc)
>  {
> +       unsigned int order;
> +       unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
> +
>        /* Compaction run completes if the migrate and free scanner meet */
>        if (cc->free_pfn <= cc->migrate_pfn)
>                return COMPACT_COMPLETE;
>
> +       /* Compaction run is not finished if the watermark is not met */
> +       if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
> +               return COMPACT_INCOMPLETE;
> +
> +       if (cc->order == -1)
> +               return COMPACT_INCOMPLETE;
> +
> +       /* Direct compactor: Is a suitable page free? */
> +       for (order = cc->order; order < MAX_ORDER; order++) {
> +               /* Job done if page is free of the right migratetype */
> +               if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
> +                       return COMPACT_PARTIAL;
> +
> +               /* Job done if allocation would set block type */
> +               if (order >= pageblock_order && zone->free_area[order].nr_free)
> +                       return COMPACT_PARTIAL;
> +       }
> +
>        return COMPACT_INCOMPLETE;
>  }
>
> @@ -348,6 +371,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>        return ret;
>  }
>
> +static inline unsigned long compact_zone_order(struct zone *zone,
> +                                               int order, gfp_t gfp_mask)
> +{
> +       struct compact_control cc = {
> +               .nr_freepages = 0,
> +               .nr_migratepages = 0,
> +               .order = order,
> +               .migratetype = allocflags_to_migratetype(gfp_mask),
> +               .zone = zone,
> +       };
> +       INIT_LIST_HEAD(&cc.freepages);
> +       INIT_LIST_HEAD(&cc.migratepages);
> +
> +       return compact_zone(zone, &cc);
> +}
> +
> +/**
> + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> + * @zonelist: The zonelist used for the current allocation
> + * @order: The order of the current allocation
> + * @gfp_mask: The GFP mask of the current allocation
> + * @nodemask: The allowed nodes to allocate from
> + *
> + * This is the main entry point for direct page compaction.
> + */
> +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> +                       int order, gfp_t gfp_mask, nodemask_t *nodemask)
> +{
> +       enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> +       int may_enter_fs = gfp_mask & __GFP_FS;
> +       int may_perform_io = gfp_mask & __GFP_IO;
> +       unsigned long watermark;
> +       struct zoneref *z;
> +       struct zone *zone;
> +       int rc = COMPACT_INCOMPLETE;
> +
> +       /* Check whether it is worth even starting compaction */
> +       if (order == 0 || !may_enter_fs || !may_perform_io)
> +               return rc;
> +
> +       /*
> +        * We will not stall if the necessary conditions are not met for
> +        * migration but direct reclaim seems to account stalls similarly
> +        */

I can't understand this comment.
In case of direct reclaim, shrink_zones's long time is just stall
by view point of allocation customer.
So "Allocation is stalled" makes sense to me.

But "Compaction is stalled" doesn't make sense to me.
How about "COMPACTION_DIRECT" like "PGSCAN_DIRECT"?
I think It's straightforward.
Naming is important since it makes ABI.

> +       count_vm_event(COMPACTSTALL);
> +





-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration
  2010-03-23 12:25 ` [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
  2010-03-23 17:25   ` Christoph Lameter
@ 2010-03-23 23:55   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 78+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-23 23:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 23 Mar 2010 12:25:38 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> For clarity of review, KSM and page migration have separate refcounts on
> the anon_vma. While clear, this is a waste of memory. This patch gets
> KSM and page migration to share their toys in a spirit of harmony.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] Export unusable free space index via /proc/unusable_index
  2010-03-23 12:25 ` [PATCH 05/11] Export unusable free space index via /proc/unusable_index Mel Gorman
  2010-03-23 17:31   ` Christoph Lameter
@ 2010-03-24  0:03   ` KAMEZAWA Hiroyuki
  2010-03-24  0:16     ` Minchan Kim
  2010-03-24 10:25     ` Mel Gorman
  1 sibling, 2 replies; 78+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-24  0:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 23 Mar 2010 12:25:40 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> Unusable free space index is a measure of external fragmentation that
> takes the allocation size into account. For the most part, the huge page
> size will be the size of interest but not necessarily so it is exported
> on a per-order and per-zone basis via /proc/unusable_index.
> 
> The index is a value between 0 and 1. It can be expressed as a
> percentage by multiplying by 100 as documented in
> Documentation/filesystems/proc.txt.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  Documentation/filesystems/proc.txt |   13 ++++-
>  mm/vmstat.c                        |  120 +++++++++++++++++++++++++++++++++
>  2 files changed, 132 insertions(+), 1 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 5e132b5..5c4b0fb 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -452,6 +452,7 @@ Table 1-5: Kernel info in /proc
>   sys         See chapter 2                                     
>   sysvipc     Info of SysVIPC Resources (msg, sem, shm)		(2.4)
>   tty	     Info of tty drivers
> + unusable_index Additional page allocator information (see text)(2.5)
>   uptime      System uptime                                     
>   version     Kernel version                                    
>   video	     bttv info of video resources			(2.4)
> @@ -609,7 +610,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
>  available in ZONE_NORMAL, etc... 
>  
>  More information relevant to external fragmentation can be found in
> -pagetypeinfo.
> +pagetypeinfo and unusable_index
>  
>  > cat /proc/pagetypeinfo
>  Page block order: 9
> @@ -650,6 +651,16 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
>  also be allocatable although a lot of filesystem metadata may have to be
>  reclaimed to achieve this.
>  
> +> cat /proc/unusable_index
> +Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
> +Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
> +
> +The unusable free space index measures how much of the available free
> +memory cannot be used to satisfy an allocation of a given size and is a
> +value between 0 and 1. The higher the value, the more of free memory is
> +unusable and by implication, the worse the external fragmentation is. This
> +can be expressed as a percentage by multiplying by 100.
> +
>  ..............................................................................
>  
>  meminfo:
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 7f760cb..ca42e10 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -453,6 +453,106 @@ static int frag_show(struct seq_file *m, void *arg)
>  	return 0;
>  }
>  
> +
> +struct contig_page_info {
> +	unsigned long free_pages;
> +	unsigned long free_blocks_total;
> +	unsigned long free_blocks_suitable;
> +};
> +
> +/*
> + * Calculate the number of free pages in a zone, how many contiguous
> + * pages are free and how many are large enough to satisfy an allocation of
> + * the target size. Note that this function makes to attempt to estimate
> + * how many suitable free blocks there *might* be if MOVABLE pages were
> + * migrated. Calculating that is possible, but expensive and can be
> + * figured out from userspace
> + */
> +static void fill_contig_page_info(struct zone *zone,
> +				unsigned int suitable_order,
> +				struct contig_page_info *info)
> +{
> +	unsigned int order;
> +
> +	info->free_pages = 0;
> +	info->free_blocks_total = 0;
> +	info->free_blocks_suitable = 0;
> +
> +	for (order = 0; order < MAX_ORDER; order++) {
> +		unsigned long blocks;
> +
> +		/* Count number of free blocks */
> +		blocks = zone->free_area[order].nr_free;
> +		info->free_blocks_total += blocks;

....for what this free_blocks_total is ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] Export unusable free space index via  /proc/unusable_index
  2010-03-24  0:16     ` Minchan Kim
@ 2010-03-24  0:13       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 78+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-24  0:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, 24 Mar 2010 09:16:07 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> Hi, Kame.
> 
> On Wed, Mar 24, 2010 at 9:03 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 23 Mar 2010 12:25:40 +0000
> > Mel Gorman <mel@csn.ul.ie> wrote:
> >
> >> Unusable free space index is a measure of external fragmentation that
> >> takes the allocation size into account. For the most part, the huge page
> >> size will be the size of interest but not necessarily so it is exported
> >> on a per-order and per-zone basis via /proc/unusable_index.
> >>
> >> The index is a value between 0 and 1. It can be expressed as a
> >> percentage by multiplying by 100 as documented in
> >> Documentation/filesystems/proc.txt.
> >>
> >> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> >> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >> Acked-by: Rik van Riel <riel@redhat.com>
> >> ---
> >>  Documentation/filesystems/proc.txt |   13 ++++-
> >>  mm/vmstat.c                        |  120 +++++++++++++++++++++++++++++++++
> >>  2 files changed, 132 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> >> index 5e132b5..5c4b0fb 100644
> >> --- a/Documentation/filesystems/proc.txt
> >> +++ b/Documentation/filesystems/proc.txt
> >> @@ -452,6 +452,7 @@ Table 1-5: Kernel info in /proc
> >>   sys         See chapter 2
> >>   sysvipc     Info of SysVIPC Resources (msg, sem, shm)               (2.4)
> >>   tty      Info of tty drivers
> >> + unusable_index Additional page allocator information (see text)(2.5)
> >>   uptime      System uptime
> >>   version     Kernel version
> >>   video            bttv info of video resources                       (2.4)
> >> @@ -609,7 +610,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
> >>  available in ZONE_NORMAL, etc...
> >>
> >>  More information relevant to external fragmentation can be found in
> >> -pagetypeinfo.
> >> +pagetypeinfo and unusable_index
> >>
> >>  > cat /proc/pagetypeinfo
> >>  Page block order: 9
> >> @@ -650,6 +651,16 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
> >>  also be allocatable although a lot of filesystem metadata may have to be
> >>  reclaimed to achieve this.
> >>
> >> +> cat /proc/unusable_index
> >> +Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
> >> +Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
> >> +
> >> +The unusable free space index measures how much of the available free
> >> +memory cannot be used to satisfy an allocation of a given size and is a
> >> +value between 0 and 1. The higher the value, the more of free memory is
> >> +unusable and by implication, the worse the external fragmentation is. This
> >> +can be expressed as a percentage by multiplying by 100.
> >> +
> >>  ..............................................................................
> >>
> >>  meminfo:
> >> diff --git a/mm/vmstat.c b/mm/vmstat.c
> >> index 7f760cb..ca42e10 100644
> >> --- a/mm/vmstat.c
> >> +++ b/mm/vmstat.c
> >> @@ -453,6 +453,106 @@ static int frag_show(struct seq_file *m, void *arg)
> >>       return 0;
> >>  }
> >>
> >> +
> >> +struct contig_page_info {
> >> +     unsigned long free_pages;
> >> +     unsigned long free_blocks_total;
> >> +     unsigned long free_blocks_suitable;
> >> +};
> >> +
> >> +/*
> >> + * Calculate the number of free pages in a zone, how many contiguous
> >> + * pages are free and how many are large enough to satisfy an allocation of
> >> + * the target size. Note that this function makes to attempt to estimate
> >> + * how many suitable free blocks there *might* be if MOVABLE pages were
> >> + * migrated. Calculating that is possible, but expensive and can be
> >> + * figured out from userspace
> >> + */
> >> +static void fill_contig_page_info(struct zone *zone,
> >> +                             unsigned int suitable_order,
> >> +                             struct contig_page_info *info)
> >> +{
> >> +     unsigned int order;
> >> +
> >> +     info->free_pages = 0;
> >> +     info->free_blocks_total = 0;
> >> +     info->free_blocks_suitable = 0;
> >> +
> >> +     for (order = 0; order < MAX_ORDER; order++) {
> >> +             unsigned long blocks;
> >> +
> >> +             /* Count number of free blocks */
> >> +             blocks = zone->free_area[order].nr_free;
> >> +             info->free_blocks_total += blocks;
> >
> > ....for what this free_blocks_total is ?
> 
> It's used by fragmentation_index in [06/11].
> 
Ah, I see. thanks.

-Kame


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] Export unusable free space index via  /proc/unusable_index
  2010-03-24  0:03   ` KAMEZAWA Hiroyuki
@ 2010-03-24  0:16     ` Minchan Kim
  2010-03-24  0:13       ` KAMEZAWA Hiroyuki
  2010-03-24 10:25     ` Mel Gorman
  1 sibling, 1 reply; 78+ messages in thread
From: Minchan Kim @ 2010-03-24  0:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

Hi, Kame.

On Wed, Mar 24, 2010 at 9:03 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 23 Mar 2010 12:25:40 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
>
>> Unusable free space index is a measure of external fragmentation that
>> takes the allocation size into account. For the most part, the huge page
>> size will be the size of interest but not necessarily so it is exported
>> on a per-order and per-zone basis via /proc/unusable_index.
>>
>> The index is a value between 0 and 1. It can be expressed as a
>> percentage by multiplying by 100 as documented in
>> Documentation/filesystems/proc.txt.
>>
>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
>> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> Acked-by: Rik van Riel <riel@redhat.com>
>> ---
>>  Documentation/filesystems/proc.txt |   13 ++++-
>>  mm/vmstat.c                        |  120 +++++++++++++++++++++++++++++++++
>>  2 files changed, 132 insertions(+), 1 deletions(-)
>>
>> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
>> index 5e132b5..5c4b0fb 100644
>> --- a/Documentation/filesystems/proc.txt
>> +++ b/Documentation/filesystems/proc.txt
>> @@ -452,6 +452,7 @@ Table 1-5: Kernel info in /proc
>>   sys         See chapter 2
>>   sysvipc     Info of SysVIPC Resources (msg, sem, shm)               (2.4)
>>   tty      Info of tty drivers
>> + unusable_index Additional page allocator information (see text)(2.5)
>>   uptime      System uptime
>>   version     Kernel version
>>   video            bttv info of video resources                       (2.4)
>> @@ -609,7 +610,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
>>  available in ZONE_NORMAL, etc...
>>
>>  More information relevant to external fragmentation can be found in
>> -pagetypeinfo.
>> +pagetypeinfo and unusable_index
>>
>>  > cat /proc/pagetypeinfo
>>  Page block order: 9
>> @@ -650,6 +651,16 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
>>  also be allocatable although a lot of filesystem metadata may have to be
>>  reclaimed to achieve this.
>>
>> +> cat /proc/unusable_index
>> +Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
>> +Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
>> +
>> +The unusable free space index measures how much of the available free
>> +memory cannot be used to satisfy an allocation of a given size and is a
>> +value between 0 and 1. The higher the value, the more of free memory is
>> +unusable and by implication, the worse the external fragmentation is. This
>> +can be expressed as a percentage by multiplying by 100.
>> +
>>  ..............................................................................
>>
>>  meminfo:
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 7f760cb..ca42e10 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -453,6 +453,106 @@ static int frag_show(struct seq_file *m, void *arg)
>>       return 0;
>>  }
>>
>> +
>> +struct contig_page_info {
>> +     unsigned long free_pages;
>> +     unsigned long free_blocks_total;
>> +     unsigned long free_blocks_suitable;
>> +};
>> +
>> +/*
>> + * Calculate the number of free pages in a zone, how many contiguous
>> + * pages are free and how many are large enough to satisfy an allocation of
>> + * the target size. Note that this function makes to attempt to estimate
>> + * how many suitable free blocks there *might* be if MOVABLE pages were
>> + * migrated. Calculating that is possible, but expensive and can be
>> + * figured out from userspace
>> + */
>> +static void fill_contig_page_info(struct zone *zone,
>> +                             unsigned int suitable_order,
>> +                             struct contig_page_info *info)
>> +{
>> +     unsigned int order;
>> +
>> +     info->free_pages = 0;
>> +     info->free_blocks_total = 0;
>> +     info->free_blocks_suitable = 0;
>> +
>> +     for (order = 0; order < MAX_ORDER; order++) {
>> +             unsigned long blocks;
>> +
>> +             /* Count number of free blocks */
>> +             blocks = zone->free_area[order].nr_free;
>> +             info->free_blocks_total += blocks;
>
> ....for what this free_blocks_total is ?

It's used by fragmentation_index in [06/11].

>
> Thanks,
> -Kame
>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] Add /sys trigger for per-node memory compaction
  2010-03-23 12:25 ` [PATCH 09/11] Add /sys trigger for per-node " Mel Gorman
  2010-03-23 18:27   ` Christoph Lameter
  2010-03-23 22:45   ` Minchan Kim
@ 2010-03-24  0:19   ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 78+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-24  0:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 23 Mar 2010 12:25:44 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> This patch adds a per-node sysfs file called compact. When the file is
> written to, each zone in that node is compacted. The intention that this
> would be used by something like a job scheduler in a batch system before
> a job starts so that the job can allocate the maximum number of
> hugepages without significant start-up cost.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>




^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-23 12:25 ` [PATCH 07/11] Memory compaction core Mel Gorman
  2010-03-23 17:56   ` Christoph Lameter
@ 2010-03-24  1:03   ` KAMEZAWA Hiroyuki
  2010-03-24  1:47     ` Minchan Kim
  2010-03-24 20:33   ` Andrew Morton
  2 siblings, 1 reply; 78+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-24  1:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 23 Mar 2010 12:25:42 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> This patch is the core of a mechanism which compacts memory in a zone by
> relocating movable pages towards the end of the zone.
> 
> A single compaction run involves a migration scanner and a free scanner.
> Both scanners operate on pageblock-sized areas in the zone. The migration
> scanner starts at the bottom of the zone and searches for all movable pages
> within each area, isolating them onto a private list called migratelist.
> The free scanner starts at the top of the zone and searches for suitable
> areas and consumes the free pages within making them available for the
> migration scanner. The pages isolated for migration are then migrated to
> the newly isolated free pages.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

I think lru_add_drain() or lru_add_drain_all() should be called somewhere
when we do __isolate_lru_page(). But it's (_all is) slow....

But,
Reivewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


> ---
>  include/linux/compaction.h |    8 +
>  include/linux/mm.h         |    1 +
>  include/linux/swap.h       |    6 +
>  include/linux/vmstat.h     |    1 +
>  mm/Makefile                |    1 +
>  mm/compaction.c            |  348 ++++++++++++++++++++++++++++++++++++++++++++
>  mm/page_alloc.c            |   39 +++++
>  mm/vmscan.c                |    5 -
>  mm/vmstat.c                |    5 +
>  9 files changed, 409 insertions(+), 5 deletions(-)
>  create mode 100644 include/linux/compaction.h
>  create mode 100644 mm/compaction.c
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> new file mode 100644
> index 0000000..6201371
> --- /dev/null
> +++ b/include/linux/compaction.h
> @@ -0,0 +1,8 @@
> +#ifndef _LINUX_COMPACTION_H
> +#define _LINUX_COMPACTION_H
> +
> +/* Return values for compact_zone() */
> +#define COMPACT_INCOMPLETE	0
> +#define COMPACT_COMPLETE	1
> +
> +#endif /* _LINUX_COMPACTION_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f3b473a..f920815 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -335,6 +335,7 @@ void put_page(struct page *page);
>  void put_pages_list(struct list_head *pages);
>  
>  void split_page(struct page *page, unsigned int order);
> +int split_free_page(struct page *page);
>  
>  /*
>   * Compound pages have a destructor function.  Provide a
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 1f59d93..cf8bba7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -151,6 +151,7 @@ enum {
>  };
>  
>  #define SWAP_CLUSTER_MAX 32
> +#define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
>  
>  #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
>  #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
> @@ -238,6 +239,11 @@ static inline void lru_cache_add_active_file(struct page *page)
>  	__lru_cache_add(page, LRU_ACTIVE_FILE);
>  }
>  
> +/* LRU Isolation modes. */
> +#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
> +#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
> +#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
> +
>  /* linux/mm/vmscan.c */
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  					gfp_t gfp_mask, nodemask_t *mask);
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 117f0dd..56e4b44 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
>  		KSWAPD_SKIP_CONGESTION_WAIT,
>  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> +		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
>  #ifdef CONFIG_HUGETLB_PAGE
>  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
>  #endif
> diff --git a/mm/Makefile b/mm/Makefile
> index 7a68d2a..ccb1f72 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -33,6 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
>  obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
>  obj-$(CONFIG_FS_XIP) += filemap_xip.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_COMPACTION) += compaction.o
>  obj-$(CONFIG_SMP) += percpu.o
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
>  obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> diff --git a/mm/compaction.c b/mm/compaction.c
> new file mode 100644
> index 0000000..0d2e8aa
> --- /dev/null
> +++ b/mm/compaction.c
> @@ -0,0 +1,348 @@
> +/*
> + * linux/mm/compaction.c
> + *
> + * Memory compaction for the reduction of external fragmentation. Note that
> + * this heavily depends upon page migration to do all the real heavy
> + * lifting
> + *
> + * Copyright IBM Corp. 2007-2010 Mel Gorman <mel@csn.ul.ie>
> + */
> +#include <linux/swap.h>
> +#include <linux/migrate.h>
> +#include <linux/compaction.h>
> +#include <linux/mm_inline.h>
> +#include "internal.h"
> +
> +/*
> + * compact_control is used to track pages being migrated and the free pages
> + * they are being migrated to during memory compaction. The free_pfn starts
> + * at the end of a zone and migrate_pfn begins at the start. Movable pages
> + * are moved to the end of a zone during a compaction run and the run
> + * completes when free_pfn <= migrate_pfn
> + */
> +struct compact_control {
> +	struct list_head freepages;	/* List of free pages to migrate to */
> +	struct list_head migratepages;	/* List of pages being migrated */
> +	unsigned long nr_freepages;	/* Number of isolated free pages */
> +	unsigned long nr_migratepages;	/* Number of pages to migrate */
> +	unsigned long free_pfn;		/* isolate_freepages search base */
> +	unsigned long migrate_pfn;	/* isolate_migratepages search base */
> +
> +	/* Account for isolated anon and file pages */
> +	unsigned long nr_anon;
> +	unsigned long nr_file;
> +
> +	struct zone *zone;
> +};
> +
> +static int release_freepages(struct list_head *freelist)
> +{
> +	struct page *page, *next;
> +	int count = 0;
> +
> +	list_for_each_entry_safe(page, next, freelist, lru) {
> +		list_del(&page->lru);
> +		__free_page(page);
> +		count++;
> +	}
> +
> +	return count;
> +}
> +
> +/* Isolate free pages onto a private freelist. Must hold zone->lock */
> +static int isolate_freepages_block(struct zone *zone,
> +				unsigned long blockpfn,
> +				struct list_head *freelist)
> +{
> +	unsigned long zone_end_pfn, end_pfn;
> +	int total_isolated = 0;
> +
> +	/* Get the last PFN we should scan for free pages at */
> +	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> +	end_pfn = blockpfn + pageblock_nr_pages;
> +	if (end_pfn > zone_end_pfn)
> +		end_pfn = zone_end_pfn;
> +
> +	/* Isolate free pages. This assumes the block is valid */
> +	for (; blockpfn < end_pfn; blockpfn++) {
> +		struct page *page;
> +		int isolated, i;
> +
> +		if (!pfn_valid_within(blockpfn))
> +			continue;
> +
> +		page = pfn_to_page(blockpfn);
> +		if (!PageBuddy(page))
> +			continue;
> +
> +		/* Found a free page, break it into order-0 pages */
> +		isolated = split_free_page(page);
> +		total_isolated += isolated;
> +		for (i = 0; i < isolated; i++) {
> +			list_add(&page->lru, freelist);
> +			page++;
> +		}
> +
> +		/* If a page was split, advance to the end of it */
> +		if (isolated)
> +			blockpfn += isolated - 1;
> +	}
> +
> +	return total_isolated;
> +}
> +
> +/* Returns 1 if the page is within a block suitable for migration to */
> +static int suitable_migration_target(struct page *page)
> +{
> +	/* If the page is a large free page, then allow migration */
> +	if (PageBuddy(page) && page_order(page) >= pageblock_order)
> +		return 1;
> +
> +	/* If the block is MIGRATE_MOVABLE, allow migration */
> +	if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
> +		return 1;
> +
> +	/* Otherwise skip the block */
> +	return 0;
> +}
> +
> +/*
> + * Based on information in the current compact_control, find blocks
> + * suitable for isolating free pages from
> + */
> +static void isolate_freepages(struct zone *zone,
> +				struct compact_control *cc)
> +{
> +	struct page *page;
> +	unsigned long high_pfn, low_pfn, pfn;
> +	unsigned long flags;
> +	int nr_freepages = cc->nr_freepages;
> +	struct list_head *freelist = &cc->freepages;
> +
> +	pfn = cc->free_pfn;
> +	low_pfn = cc->migrate_pfn + pageblock_nr_pages;
> +	high_pfn = low_pfn;
> +
> +	/*
> +	 * Isolate free pages until enough are available to migrate the
> +	 * pages on cc->migratepages. We stop searching if the migrate
> +	 * and free page scanners meet or enough free pages are isolated.
> +	 */
> +	spin_lock_irqsave(&zone->lock, flags);
> +	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
> +					pfn -= pageblock_nr_pages) {
> +		int isolated;
> +
> +		if (!pfn_valid(pfn))
> +			continue;
> +
> +		/* 
> +		 * Check for overlapping nodes/zones. It's possible on some
> +		 * configurations to have a setup like
> +		 * node0 node1 node0
> +		 * i.e. it's possible that all pages within a zones range of
> +		 * pages do not belong to a single zone.
> +		 */
> +		page = pfn_to_page(pfn);
> +		if (page_zone(page) != zone)
> +			continue;
> +
> +		/* Check the block is suitable for migration */
> +		if (!suitable_migration_target(page))
> +			continue;
> +
> +		/* Found a block suitable for isolating free pages from */
> +		isolated = isolate_freepages_block(zone, pfn, freelist);
> +		nr_freepages += isolated;
> +
> +		/*
> +		 * Record the highest PFN we isolated pages from. When next
> +		 * looking for free pages, the search will restart here as
> +		 * page migration may have returned some pages to the allocator
> +		 */
> +		if (isolated)
> +			high_pfn = max(high_pfn, pfn);
> +	}
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +
> +	cc->free_pfn = high_pfn;
> +	cc->nr_freepages = nr_freepages;
> +}
> +
> +/* Update the number of anon and file isolated pages in the zone */
> +static void acct_isolated(struct zone *zone, struct compact_control *cc)
> +{
> +	struct page *page;
> +	unsigned int count[NR_LRU_LISTS] = { 0, };
> +
> +	list_for_each_entry(page, &cc->migratepages, lru) {
> +		int lru = page_lru_base_type(page);
> +		count[lru]++;
> +	}
> +
> +	cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> +	cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> +	__mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
> +	__mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
> +}
> +
> +/*
> + * Isolate all pages that can be migrated from the block pointed to by
> + * the migrate scanner within compact_control.
> + */
> +static unsigned long isolate_migratepages(struct zone *zone,
> +					struct compact_control *cc)
> +{
> +	unsigned long low_pfn, end_pfn;
> +	struct list_head *migratelist;
> +
> +	low_pfn = cc->migrate_pfn;
> +	migratelist = &cc->migratepages;
> +
> +	/* Do not scan outside zone boundaries */
> +	if (low_pfn < zone->zone_start_pfn)
> +		low_pfn = zone->zone_start_pfn;
> +
> +	/* Setup to scan one block but not past where we are migrating to */
> +	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> +
> +	/* Do not cross the free scanner or scan within a memory hole */
> +	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
> +		cc->migrate_pfn = end_pfn;
> +		return 0;
> +	}
> +
> +	migrate_prep();
> +
> +	/* Time to isolate some pages for migration */
> +	spin_lock_irq(&zone->lru_lock);
> +	for (; low_pfn < end_pfn; low_pfn++) {
> +		struct page *page;
> +		if (!pfn_valid_within(low_pfn))
> +			continue;
> +
> +		/* Get the page and skip if free */
> +		page = pfn_to_page(low_pfn);
> +		if (PageBuddy(page)) {
> +			low_pfn += (1 << page_order(page)) - 1;
> +			continue;
> +		}
> +
> +		/* Try isolate the page */
> +		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) == 0) {
> +			del_page_from_lru_list(zone, page, page_lru(page));
> +			list_add(&page->lru, migratelist);
> +			mem_cgroup_del_lru(page);
> +			cc->nr_migratepages++;
> +		}
> +
> +		/* Avoid isolating too much */
> +		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
> +			break;
> +	}
> +
> +	acct_isolated(zone, cc);
> +
> +	spin_unlock_irq(&zone->lru_lock);
> +	cc->migrate_pfn = low_pfn;
> +
> +	return cc->nr_migratepages;
> +}
> +
> +/*
> + * This is a migrate-callback that "allocates" freepages by taking pages
> + * from the isolated freelists in the block we are migrating to.
> + */
> +static struct page *compaction_alloc(struct page *migratepage,
> +					unsigned long data,
> +					int **result)
> +{
> +	struct compact_control *cc = (struct compact_control *)data;
> +	struct page *freepage;
> +
> +	VM_BUG_ON(cc == NULL);
> +
> +	/* Isolate free pages if necessary */
> +	if (list_empty(&cc->freepages)) {
> +		isolate_freepages(cc->zone, cc);
> +
> +		if (list_empty(&cc->freepages))
> +			return NULL;
> +	}
> +
> +	freepage = list_entry(cc->freepages.next, struct page, lru);
> +	list_del(&freepage->lru);
> +	cc->nr_freepages--;
> +
> +	return freepage;
> +}
> +
> +/*
> + * We cannot control nr_migratepages and nr_freepages fully when migration is
> + * running as migrate_pages() has no knowledge of compact_control. When
> + * migration is complete, we count the number of pages on the lists by hand.
> + */
> +static void update_nr_listpages(struct compact_control *cc)
> +{
> +	int nr_migratepages = 0;
> +	int nr_freepages = 0;
> +	struct page *page;
> +	list_for_each_entry(page, &cc->migratepages, lru)
> +		nr_migratepages++;
> +	list_for_each_entry(page, &cc->freepages, lru)
> +		nr_freepages++;
> +
> +	cc->nr_migratepages = nr_migratepages;
> +	cc->nr_freepages = nr_freepages;
> +}
> +
> +static inline int compact_finished(struct zone *zone,
> +						struct compact_control *cc)
> +{
> +	/* Compaction run completes if the migrate and free scanner meet */
> +	if (cc->free_pfn <= cc->migrate_pfn)
> +		return COMPACT_COMPLETE;
> +
> +	return COMPACT_INCOMPLETE;
> +}
> +
> +static int compact_zone(struct zone *zone, struct compact_control *cc)
> +{
> +	int ret = COMPACT_INCOMPLETE;
> +
> +	/* Setup to move all movable pages to the end of the zone */
> +	cc->migrate_pfn = zone->zone_start_pfn;
> +	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
> +	cc->free_pfn &= ~(pageblock_nr_pages-1);
> +
> +	for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
> +		unsigned long nr_migrate, nr_remaining;
> +		if (!isolate_migratepages(zone, cc))
> +			continue;
> +
> +		nr_migrate = cc->nr_migratepages;
> +		migrate_pages(&cc->migratepages, compaction_alloc,
> +						(unsigned long)cc, 0);
> +		update_nr_listpages(cc);
> +		nr_remaining = cc->nr_migratepages;
> +
> +		count_vm_event(COMPACTBLOCKS);
> +		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
> +		if (nr_remaining)
> +			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
> +
> +		/* Release LRU pages not migrated */
> +		if (!list_empty(&cc->migratepages)) {
> +			putback_lru_pages(&cc->migratepages);
> +			cc->nr_migratepages = 0;
> +		}
> +
> +	}
> +
> +	/* Release free pages and check accounting */
> +	cc->nr_freepages -= release_freepages(&cc->freepages);
> +	VM_BUG_ON(cc->nr_freepages != 0);
> +
> +	return ret;
> +}
> +
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 882aef0..9708143 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1208,6 +1208,45 @@ void split_page(struct page *page, unsigned int order)
>  }
>  
>  /*
> + * Similar to split_page except the page is already free. As this is only
> + * being used for migration, the migratetype of the block also changes.
> + */
> +int split_free_page(struct page *page)
> +{
> +	unsigned int order;
> +	unsigned long watermark;
> +	struct zone *zone;
> +
> +	BUG_ON(!PageBuddy(page));
> +
> +	zone = page_zone(page);
> +	order = page_order(page);
> +
> +	/* Obey watermarks or the system could deadlock */
> +	watermark = low_wmark_pages(zone) + (1 << order);
> +	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> +		return 0;
> +
> +	/* Remove page from free list */
> +	list_del(&page->lru);
> +	zone->free_area[order].nr_free--;
> +	rmv_page_order(page);
> +	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> +
> +	/* Split into individual pages */
> +	set_page_refcounted(page);
> +	split_page(page, order);
> +
> +	if (order >= pageblock_order - 1) {
> +		struct page *endpage = page + (1 << order) - 1;
> +		for (; page < endpage; page += pageblock_nr_pages)
> +			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> +	}
> +
> +	return 1 << order;
> +}
> +
> +/*
>   * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
>   * we cheat by calling it from here, in the order > 0 path.  Saves a branch
>   * or two.
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 79c8098..ef89600 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -839,11 +839,6 @@ keep:
>  	return nr_reclaimed;
>  }
>  
> -/* LRU Isolation modes. */
> -#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
> -#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
> -#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
> -
>  /*
>   * Attempt to remove the specified page from its LRU.  Only take this page
>   * if it is of the appropriate PageActive status.  Pages which are being
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 7377da6..af88647 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -891,6 +891,11 @@ static const char * const vmstat_text[] = {
>  	"allocstall",
>  
>  	"pgrotated",
> +
> +	"compact_blocks_moved",
> +	"compact_pages_moved",
> +	"compact_pagemigrate_failed",
> +
>  #ifdef CONFIG_HUGETLB_PAGE
>  	"htlb_buddy_alloc_success",
>  	"htlb_buddy_alloc_fail",
> -- 
> 1.6.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-23 12:25 ` [PATCH 10/11] Direct compact when a high-order allocation fails Mel Gorman
  2010-03-23 23:10   ` Minchan Kim
@ 2010-03-24  1:19   ` KAMEZAWA Hiroyuki
  2010-03-24 11:40     ` Mel Gorman
  2010-03-24 20:48   ` Andrew Morton
  2 siblings, 1 reply; 78+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-24  1:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 23 Mar 2010 12:25:45 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> Ordinarily when a high-order allocation fails, direct reclaim is entered to
> free pages to satisfy the allocation.  With this patch, it is determined if
> an allocation failed due to external fragmentation instead of low memory
> and if so, the calling process will compact until a suitable page is
> freed. Compaction by moving pages in memory is considerably cheaper than
> paging out to disk and works where there are locked pages or no swap. If
> compaction fails to free a page of a suitable size, then reclaim will
> still occur.
> 
> Direct compaction returns as soon as possible. As each block is compacted,
> it is checked if a suitable page has been freed and if so, it returns.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  include/linux/compaction.h |   16 +++++-
>  include/linux/vmstat.h     |    1 +
>  mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
>  mm/page_alloc.c            |   26 ++++++++++
>  mm/vmstat.c                |   15 +++++-
>  5 files changed, 172 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index c94890b..b851428 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -1,14 +1,26 @@
>  #ifndef _LINUX_COMPACTION_H
>  #define _LINUX_COMPACTION_H
>  
> -/* Return values for compact_zone() */
> +/* Return values for compact_zone() and try_to_compact_pages() */
>  #define COMPACT_INCOMPLETE	0
> -#define COMPACT_COMPLETE	1
> +#define COMPACT_PARTIAL		1
> +#define COMPACT_COMPLETE	2
>  
>  #ifdef CONFIG_COMPACTION
>  extern int sysctl_compact_memory;
>  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>  			void __user *buffer, size_t *length, loff_t *ppos);
> +
> +extern int fragmentation_index(struct zone *zone, unsigned int order);
> +extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> +			int order, gfp_t gfp_mask, nodemask_t *mask);
> +#else
> +static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> +			int order, gfp_t gfp_mask, nodemask_t *nodemask)
> +{
> +	return COMPACT_INCOMPLETE;
> +}
> +
>  #endif /* CONFIG_COMPACTION */
>  
>  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 56e4b44..b4b4d34 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		KSWAPD_SKIP_CONGESTION_WAIT,
>  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
>  		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> +		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
>  #ifdef CONFIG_HUGETLB_PAGE
>  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
>  #endif
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 8df6e3d..6688700 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -34,6 +34,8 @@ struct compact_control {
>  	unsigned long nr_anon;
>  	unsigned long nr_file;
>  
> +	unsigned int order;		/* order a direct compactor needs */
> +	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
>  	struct zone *zone;
>  };
>  
> @@ -301,10 +303,31 @@ static void update_nr_listpages(struct compact_control *cc)
>  static inline int compact_finished(struct zone *zone,
>  						struct compact_control *cc)
>  {
> +	unsigned int order;
> +	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
> +
>  	/* Compaction run completes if the migrate and free scanner meet */
>  	if (cc->free_pfn <= cc->migrate_pfn)
>  		return COMPACT_COMPLETE;
>  
> +	/* Compaction run is not finished if the watermark is not met */
> +	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
> +		return COMPACT_INCOMPLETE;
> +
> +	if (cc->order == -1)
> +		return COMPACT_INCOMPLETE;
> +
> +	/* Direct compactor: Is a suitable page free? */
> +	for (order = cc->order; order < MAX_ORDER; order++) {
> +		/* Job done if page is free of the right migratetype */
> +		if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
> +			return COMPACT_PARTIAL;
> +
> +		/* Job done if allocation would set block type */
> +		if (order >= pageblock_order && zone->free_area[order].nr_free)
> +			return COMPACT_PARTIAL;
> +	}
> +
>  	return COMPACT_INCOMPLETE;
>  }
>  
> @@ -348,6 +371,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  	return ret;
>  }
>  
> +static inline unsigned long compact_zone_order(struct zone *zone,
> +						int order, gfp_t gfp_mask)
> +{
> +	struct compact_control cc = {
> +		.nr_freepages = 0,
> +		.nr_migratepages = 0,
> +		.order = order,
> +		.migratetype = allocflags_to_migratetype(gfp_mask),
> +		.zone = zone,
> +	};
> +	INIT_LIST_HEAD(&cc.freepages);
> +	INIT_LIST_HEAD(&cc.migratepages);
> +
> +	return compact_zone(zone, &cc);
> +}
> +
> +/**
> + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> + * @zonelist: The zonelist used for the current allocation
> + * @order: The order of the current allocation
> + * @gfp_mask: The GFP mask of the current allocation
> + * @nodemask: The allowed nodes to allocate from
> + *
> + * This is the main entry point for direct page compaction.
> + */
> +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> +			int order, gfp_t gfp_mask, nodemask_t *nodemask)
> +{
> +	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> +	int may_enter_fs = gfp_mask & __GFP_FS;
> +	int may_perform_io = gfp_mask & __GFP_IO;
> +	unsigned long watermark;
> +	struct zoneref *z;
> +	struct zone *zone;
> +	int rc = COMPACT_INCOMPLETE;
> +
> +	/* Check whether it is worth even starting compaction */
> +	if (order == 0 || !may_enter_fs || !may_perform_io)
> +		return rc;
> +
> +	/*
> +	 * We will not stall if the necessary conditions are not met for
> +	 * migration but direct reclaim seems to account stalls similarly
> +	 */
> +	count_vm_event(COMPACTSTALL);
> +
> +	/* Compact each zone in the list */
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> +								nodemask) {
> +		int fragindex;
> +		int status;
> +
> +		/*
> +		 * Watermarks for order-0 must be met for compaction. Note
> +		 * the 2UL. This is because during migration, copies of
> +		 * pages need to be allocated and for a short time, the
> +		 * footprint is higher
> +		 */
> +		watermark = low_wmark_pages(zone) + (2UL << order);
> +		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> +			continue;
> +
> +		/*
> +		 * fragmentation index determines if allocation failures are
> +		 * due to low memory or external fragmentation
> +		 *
> +		 * index of -1 implies allocations might succeed depending
> +		 * 	on watermarks
> +		 * index < 500 implies alloc failure is due to lack of memory
> +		 *
> +		 * XXX: The choice of 500 is arbitrary. Reinvestigate
> +		 *      appropriately to determine a sensible default.
> +		 *      and what it means when watermarks are also taken
> +		 *      into account. Consider making it a sysctl
> +		 */
> +		fragindex = fragmentation_index(zone, order);
> +		if (fragindex >= 0 && fragindex <= 500)
> +			continue;
> +
> +		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
> +			rc = COMPACT_PARTIAL;
> +			break;
> +		}
> +
> +		status = compact_zone_order(zone, order, gfp_mask);
> +		rc = max(status, rc);

Hm...then, scanning over the whole zone until success of migration at
each failure ? Is it meaningful that multiple tasks run direct-compaction against
a zone (from zone->start_pfn to zone->end_pfn) in parallel ?
ex) running order=3 compaction while other thread runs order=5 compaction.

Can't we find a clever way to find [start_pfn, end_pfn) for scanning rather than
[zone->start_pfn, zone->start_pfn + zone->spanned_pages) ?

I'm sorry if I miss something...

Thanks,
-Kame


> +
> +		if (zone_watermark_ok(zone, order, watermark, 0, 0))
> +			break;
> +	}
> +
> +	return rc;
> +}
> +
> +
>  /* Compact all zones within a node */
>  static int compact_node(int nid)
>  {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9708143..e301108 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -49,6 +49,7 @@
>  #include <linux/debugobjects.h>
>  #include <linux/kmemleak.h>
>  #include <linux/memory.h>
> +#include <linux/compaction.h>
>  #include <trace/events/kmem.h>
>  #include <linux/ftrace_event.h>
>  
> @@ -1765,6 +1766,31 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  
>  	cond_resched();
>  
> +	/* Try memory compaction for high-order allocations before reclaim */
> +	if (order) {
> +		*did_some_progress = try_to_compact_pages(zonelist,
> +						order, gfp_mask, nodemask);
> +		if (*did_some_progress != COMPACT_INCOMPLETE) {
> +			page = get_page_from_freelist(gfp_mask, nodemask,
> +					order, zonelist, high_zoneidx,
> +					alloc_flags, preferred_zone,
> +					migratetype);
> +			if (page) {
> +				__count_vm_event(COMPACTSUCCESS);
> +				return page;
> +			}
> +
> +			/*
> +			 * It's bad if compaction run occurs and fails.
> +			 * The most likely reason is that pages exist,
> +			 * but not enough to satisfy watermarks.
> +			 */
> +			count_vm_event(COMPACTFAIL);
> +
> +			cond_resched();
> +		}
> +	}
> +
>  	/* We now go into synchronous reclaim */
>  	cpuset_memory_pressure_bump();
>  	p->flags |= PF_MEMALLOC;
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index af88647..c88f285 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -560,7 +560,7 @@ static int unusable_show(struct seq_file *m, void *arg)
>   * The value can be used to determine if page reclaim or compaction
>   * should be used
>   */
> -int fragmentation_index(unsigned int order, struct contig_page_info *info)
> +int __fragmentation_index(unsigned int order, struct contig_page_info *info)
>  {
>  	unsigned long requested = 1UL << order;
>  
> @@ -580,6 +580,14 @@ int fragmentation_index(unsigned int order, struct contig_page_info *info)
>  	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
>  }
>  
> +/* Same as __fragmentation index but allocs contig_page_info on stack */
> +int fragmentation_index(struct zone *zone, unsigned int order)
> +{
> +	struct contig_page_info info;
> +
> +	fill_contig_page_info(zone, order, &info);
> +	return __fragmentation_index(order, &info);
> +}
>  
>  static void extfrag_show_print(struct seq_file *m,
>  					pg_data_t *pgdat, struct zone *zone)
> @@ -595,7 +603,7 @@ static void extfrag_show_print(struct seq_file *m,
>  				zone->name);
>  	for (order = 0; order < MAX_ORDER; ++order) {
>  		fill_contig_page_info(zone, order, &info);
> -		index = fragmentation_index(order, &info);
> +		index = __fragmentation_index(order, &info);
>  		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
>  	}
>  
> @@ -895,6 +903,9 @@ static const char * const vmstat_text[] = {
>  	"compact_blocks_moved",
>  	"compact_pages_moved",
>  	"compact_pagemigrate_failed",
> +	"compact_stall",
> +	"compact_fail",
> +	"compact_success",
>  
>  #ifdef CONFIG_HUGETLB_PAGE
>  	"htlb_buddy_alloc_success",
> -- 
> 1.6.5
> 
> 


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-24  1:03   ` KAMEZAWA Hiroyuki
@ 2010-03-24  1:47     ` Minchan Kim
  2010-03-24  1:53       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 78+ messages in thread
From: Minchan Kim @ 2010-03-24  1:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 10:03 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 23 Mar 2010 12:25:42 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
>
>> This patch is the core of a mechanism which compacts memory in a zone by
>> relocating movable pages towards the end of the zone.
>>
>> A single compaction run involves a migration scanner and a free scanner.
>> Both scanners operate on pageblock-sized areas in the zone. The migration
>> scanner starts at the bottom of the zone and searches for all movable pages
>> within each area, isolating them onto a private list called migratelist.
>> The free scanner starts at the top of the zone and searches for suitable
>> areas and consumes the free pages within making them available for the
>> migration scanner. The pages isolated for migration are then migrated to
>> the newly isolated free pages.
>>
>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> Acked-by: Rik van Riel <riel@redhat.com>
>> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
>
> I think lru_add_drain() or lru_add_drain_all() should be called somewhere
> when we do __isolate_lru_page(). But it's (_all is) slow....
>

migrate_prep does it.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-24  1:47     ` Minchan Kim
@ 2010-03-24  1:53       ` KAMEZAWA Hiroyuki
  2010-03-24  2:10         ` Minchan Kim
  0 siblings, 1 reply; 78+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-24  1:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, 24 Mar 2010 10:47:41 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Wed, Mar 24, 2010 at 10:03 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 23 Mar 2010 12:25:42 +0000
> > Mel Gorman <mel@csn.ul.ie> wrote:
> >
> >> This patch is the core of a mechanism which compacts memory in a zone by
> >> relocating movable pages towards the end of the zone.
> >>
> >> A single compaction run involves a migration scanner and a free scanner.
> >> Both scanners operate on pageblock-sized areas in the zone. The migration
> >> scanner starts at the bottom of the zone and searches for all movable pages
> >> within each area, isolating them onto a private list called migratelist.
> >> The free scanner starts at the top of the zone and searches for suitable
> >> areas and consumes the free pages within making them available for the
> >> migration scanner. The pages isolated for migration are then migrated to
> >> the newly isolated free pages.
> >>
> >> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >> Acked-by: Rik van Riel <riel@redhat.com>
> >> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> >
> > I think lru_add_drain() or lru_add_drain_all() should be called somewhere
> > when we do __isolate_lru_page(). But it's (_all is) slow....
> >
> 
> migrate_prep does it.
> 
Thanks.

Hmm...then, lru_add_drain_all() is called at each (32page migrate) itelation.
Isn't it too slow to be called in such frequency ?

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-24  1:53       ` KAMEZAWA Hiroyuki
@ 2010-03-24  2:10         ` Minchan Kim
  2010-03-24 10:57           ` Mel Gorman
  0 siblings, 1 reply; 78+ messages in thread
From: Minchan Kim @ 2010-03-24  2:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 10:53 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 24 Mar 2010 10:47:41 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Wed, Mar 24, 2010 at 10:03 AM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Tue, 23 Mar 2010 12:25:42 +0000
>> > Mel Gorman <mel@csn.ul.ie> wrote:
>> >
>> >> This patch is the core of a mechanism which compacts memory in a zone by
>> >> relocating movable pages towards the end of the zone.
>> >>
>> >> A single compaction run involves a migration scanner and a free scanner.
>> >> Both scanners operate on pageblock-sized areas in the zone. The migration
>> >> scanner starts at the bottom of the zone and searches for all movable pages
>> >> within each area, isolating them onto a private list called migratelist.
>> >> The free scanner starts at the top of the zone and searches for suitable
>> >> areas and consumes the free pages within making them available for the
>> >> migration scanner. The pages isolated for migration are then migrated to
>> >> the newly isolated free pages.
>> >>
>> >> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> >> Acked-by: Rik van Riel <riel@redhat.com>
>> >> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
>> >
>> > I think lru_add_drain() or lru_add_drain_all() should be called somewhere
>> > when we do __isolate_lru_page(). But it's (_all is) slow....
>> >
>>
>> migrate_prep does it.
>>
> Thanks.
>
> Hmm...then, lru_add_drain_all() is called at each (32page migrate) itelation.
> Isn't it too slow to be called in such frequency ?

I agree. We can move migrate_prep in compact_zone.

>
> Thanks,
> -Kame
>
>
>
>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] Export unusable free space index via /proc/unusable_index
  2010-03-24  0:03   ` KAMEZAWA Hiroyuki
  2010-03-24  0:16     ` Minchan Kim
@ 2010-03-24 10:25     ` Mel Gorman
  1 sibling, 0 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-24 10:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 09:03:12AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 23 Mar 2010 12:25:40 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Unusable free space index is a measure of external fragmentation that
> > takes the allocation size into account. For the most part, the huge page
> > size will be the size of interest but not necessarily so it is exported
> > on a per-order and per-zone basis via /proc/unusable_index.
> > 
> > The index is a value between 0 and 1. It can be expressed as a
> > percentage by multiplying by 100 as documented in
> > Documentation/filesystems/proc.txt.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >  Documentation/filesystems/proc.txt |   13 ++++-
> >  mm/vmstat.c                        |  120 +++++++++++++++++++++++++++++++++
> >  2 files changed, 132 insertions(+), 1 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> > index 5e132b5..5c4b0fb 100644
> > --- a/Documentation/filesystems/proc.txt
> > +++ b/Documentation/filesystems/proc.txt
> > @@ -452,6 +452,7 @@ Table 1-5: Kernel info in /proc
> >   sys         See chapter 2                                     
> >   sysvipc     Info of SysVIPC Resources (msg, sem, shm)		(2.4)
> >   tty	     Info of tty drivers
> > + unusable_index Additional page allocator information (see text)(2.5)
> >   uptime      System uptime                                     
> >   version     Kernel version                                    
> >   video	     bttv info of video resources			(2.4)
> > @@ -609,7 +610,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
> >  available in ZONE_NORMAL, etc... 
> >  
> >  More information relevant to external fragmentation can be found in
> > -pagetypeinfo.
> > +pagetypeinfo and unusable_index
> >  
> >  > cat /proc/pagetypeinfo
> >  Page block order: 9
> > @@ -650,6 +651,16 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
> >  also be allocatable although a lot of filesystem metadata may have to be
> >  reclaimed to achieve this.
> >  
> > +> cat /proc/unusable_index
> > +Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
> > +Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
> > +
> > +The unusable free space index measures how much of the available free
> > +memory cannot be used to satisfy an allocation of a given size and is a
> > +value between 0 and 1. The higher the value, the more of free memory is
> > +unusable and by implication, the worse the external fragmentation is. This
> > +can be expressed as a percentage by multiplying by 100.
> > +
> >  ..............................................................................
> >  
> >  meminfo:
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 7f760cb..ca42e10 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -453,6 +453,106 @@ static int frag_show(struct seq_file *m, void *arg)
> >  	return 0;
> >  }
> >  
> > +
> > +struct contig_page_info {
> > +	unsigned long free_pages;
> > +	unsigned long free_blocks_total;
> > +	unsigned long free_blocks_suitable;
> > +};
> > +
> > +/*
> > + * Calculate the number of free pages in a zone, how many contiguous
> > + * pages are free and how many are large enough to satisfy an allocation of
> > + * the target size. Note that this function makes to attempt to estimate
> > + * how many suitable free blocks there *might* be if MOVABLE pages were
> > + * migrated. Calculating that is possible, but expensive and can be
> > + * figured out from userspace
> > + */
> > +static void fill_contig_page_info(struct zone *zone,
> > +				unsigned int suitable_order,
> > +				struct contig_page_info *info)
> > +{
> > +	unsigned int order;
> > +
> > +	info->free_pages = 0;
> > +	info->free_blocks_total = 0;
> > +	info->free_blocks_suitable = 0;
> > +
> > +	for (order = 0; order < MAX_ORDER; order++) {
> > +		unsigned long blocks;
> > +
> > +		/* Count number of free blocks */
> > +		blocks = zone->free_area[order].nr_free;
> > +		info->free_blocks_total += blocks;
> 
> ....for what this free_blocks_total is ?
> 

It's used for fragmentation_index in the next patch. By rights, they
should be in the same patch but I found it easier to re-review
fill_contig_page_info() if it was introduced as a single piece.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 11/11] Do not compact within a preferred zone after a compaction failure
  2010-03-23 19:27       ` Christoph Lameter
@ 2010-03-24 10:37         ` Mel Gorman
  2010-03-24 19:54           ` Christoph Lameter
  0 siblings, 1 reply; 78+ messages in thread
From: Mel Gorman @ 2010-03-24 10:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, Mar 23, 2010 at 02:27:08PM -0500, Christoph Lameter wrote:
> On Tue, 23 Mar 2010, Mel Gorman wrote:
> 
> > I was having some sort of fit when I wrote that obviously. Try this on
> > for size
> >
> > The fragmentation index may indicate that a failure is due to external
> > fragmentation but after a compaction run completes, it is still possible
> > for an allocation to fail.
> 
> Ok.
> 
> > > > fail. There are two obvious reasons as to why
> > > >
> > > >   o Page migration cannot move all pages so fragmentation remains
> > > >   o A suitable page may exist but watermarks are not met
> > > >
> > > > In the event of compaction and allocation failure, this patch prevents
> > > > compaction happening for a short interval. It's only recorded on the
> > >
> > > compaction is "recorded"? deferred?
> > >
> >
> > deferred makes more sense.
> >
> > What I was thinking at the time was that compact_resume was stored in struct
> > zone - i.e. that is where it is recorded.
> 
> Ok adding a dozen or more words here may be useful.
> 

In the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone for a period of time. The zone that
is deferred is the first zone in the zonelist - i.e. the preferred zone.
To defer compaction in the other zones, the information would need to
be stored in the zonelist or implemented similar to the zonelist_cache.
This would impact the fast-paths and is not justified at this time.

?

> > > > preferred zone but that should be enough coverage. This could have been
> > > > implemented similar to the zonelist_cache but the increased size of the
> > > > zonelist did not appear to be justified.
> > >
> > > > @@ -1787,6 +1787,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> > > >  			 */
> > > >  			count_vm_event(COMPACTFAIL);
> > > >
> > > > +			/* On failure, avoid compaction for a short time. */
> > > > +			defer_compaction(preferred_zone, jiffies + HZ/50);
> > > > +
> > >
> > > 20ms? How was that interval determined?
> > >
> >
> > Matches the time the page allocator would defer to an event like
> > congestion. The choice is somewhat arbitrary. Ideally, there would be
> > some sort of event that would re-enable compaction but there wasn't an
> > obvious candidate so I used time.
> 
> There are frequent uses of HZ/10 as well especially in vmscna.c. A longer
> time may be better? HZ/50 looks like an interval for writeout. But this
> is related to reclaim?
> 

HZ/10 is somewhat of an arbitrary choice as well and there isn't data on
which is better and which is worse. If the zone is full of dirty data, then
HZ/10 makes sense for IO. If it happened to be mainly clean cache but under
heavy memory pressure, then reclaim would be a relatively fast event and a
shorter wait makes sense of HZ/50.

Thing is, if we start with a short timer and it's too short, COMPACTFAIL
will be growing steadily. If we choose a long time and it's too long, there
is no counter to indicate it was a bad choice. Hence, I'd prefer the short
timer to start with and ideally resume compaction after some event in the
future rather than depending on time.

Does that make sense?

> 
>  backing-dev.h    <global>                      283 long congestion_wait(int sync, long timeout);
> 1 backing-dev.c    <global>                      762 EXPORT_SYMBOL(congestion_wait);
> 2 usercopy_32.c    __copy_to_user_ll             754 congestion_wait(BLK_RW_ASYNC, HZ/50);
> 3 pktcdvd.c        pkt_make_request             2557 congestion_wait(BLK_RW_ASYNC, HZ);
> 4 dm-crypt.c       kcryptd_crypt_write_convert   834 congestion_wait(BLK_RW_ASYNC, HZ/100);
> 5 file.c           fat_file_release              137 congestion_wait(BLK_RW_ASYNC, HZ/10);
> 6 journal.c        reiserfs_async_progress_wait  990 congestion_wait(BLK_RW_ASYNC, HZ / 10);
> 7 kmem.c           kmem_alloc                     61 congestion_wait(BLK_RW_ASYNC, HZ/50);
> 8 kmem.c           kmem_zone_alloc               117 congestion_wait(BLK_RW_ASYNC, HZ/50);
> 9 xfs_buf.c        _xfs_buf_lookup_pages         343 congestion_wait(BLK_RW_ASYNC, HZ/50);
> a backing-dev.c    congestion_wait               751 long congestion_wait(int sync, long timeout)
> b memcontrol.c     mem_cgroup_force_empty       2858 congestion_wait(BLK_RW_ASYNC, HZ/10);
> c page-writeback.c throttle_vm_writeout          674 congestion_wait(BLK_RW_ASYNC, HZ/10);
> d page_alloc.c     __alloc_pages_high_priority  1753 congestion_wait(BLK_RW_ASYNC, HZ/50);
> e page_alloc.c     __alloc_pages_slowpath       1924 congestion_wait(BLK_RW_ASYNC, HZ/50);
> f vmscan.c         shrink_inactive_list         1136 congestion_wait(BLK_RW_ASYNC, HZ/10);
> g vmscan.c         shrink_inactive_list         1220 congestion_wait(BLK_RW_ASYNC, HZ/10);
> h vmscan.c         do_try_to_free_pages         1837 congestion_wait(BLK_RW_ASYNC, HZ/10);
> i vmscan.c         balance_pgdat                2161 congestion_wait(BLK_RW_ASYNC, HZ/10);
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-24  2:10         ` Minchan Kim
@ 2010-03-24 10:57           ` Mel Gorman
  0 siblings, 0 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-24 10:57 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 11:10:14AM +0900, Minchan Kim wrote:
> On Wed, Mar 24, 2010 at 10:53 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Wed, 24 Mar 2010 10:47:41 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >
> >> On Wed, Mar 24, 2010 at 10:03 AM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > On Tue, 23 Mar 2010 12:25:42 +0000
> >> > Mel Gorman <mel@csn.ul.ie> wrote:
> >> >
> >> >> This patch is the core of a mechanism which compacts memory in a zone by
> >> >> relocating movable pages towards the end of the zone.
> >> >>
> >> >> A single compaction run involves a migration scanner and a free scanner.
> >> >> Both scanners operate on pageblock-sized areas in the zone. The migration
> >> >> scanner starts at the bottom of the zone and searches for all movable pages
> >> >> within each area, isolating them onto a private list called migratelist.
> >> >> The free scanner starts at the top of the zone and searches for suitable
> >> >> areas and consumes the free pages within making them available for the
> >> >> migration scanner. The pages isolated for migration are then migrated to
> >> >> the newly isolated free pages.
> >> >>
> >> >> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >> >> Acked-by: Rik van Riel <riel@redhat.com>
> >> >> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> >> >
> >> > I think lru_add_drain() or lru_add_drain_all() should be called somewhere
> >> > when we do __isolate_lru_page(). But it's (_all is) slow....
> >> >
> >>
> >> migrate_prep does it.
> >>

Yep.

> > Thanks.
> >
> > Hmm...then, lru_add_drain_all() is called at each (32page migrate) itelation.
> > Isn't it too slow to be called in such frequency ?
> 
> I agree. We can move migrate_prep in compact_zone.
> 

Indeed we can. It's moved now.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-23 23:10   ` Minchan Kim
@ 2010-03-24 11:11     ` Mel Gorman
  2010-03-24 11:59       ` Minchan Kim
  0 siblings, 1 reply; 78+ messages in thread
From: Mel Gorman @ 2010-03-24 11:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 08:10:40AM +0900, Minchan Kim wrote:
> Hi, Mel.
> 
> On Tue, Mar 23, 2010 at 9:25 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > Ordinarily when a high-order allocation fails, direct reclaim is entered to
> > free pages to satisfy the allocation.  With this patch, it is determined if
> > an allocation failed due to external fragmentation instead of low memory
> > and if so, the calling process will compact until a suitable page is
> > freed. Compaction by moving pages in memory is considerably cheaper than
> > paging out to disk and works where there are locked pages or no swap. If
> > compaction fails to free a page of a suitable size, then reclaim will
> > still occur.
> >
> > Direct compaction returns as soon as possible. As each block is compacted,
> > it is checked if a suitable page has been freed and if so, it returns.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >  include/linux/compaction.h |   16 +++++-
> >  include/linux/vmstat.h     |    1 +
> >  mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
> >  mm/page_alloc.c            |   26 ++++++++++
> >  mm/vmstat.c                |   15 +++++-
> >  5 files changed, 172 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index c94890b..b851428 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -1,14 +1,26 @@
> >  #ifndef _LINUX_COMPACTION_H
> >  #define _LINUX_COMPACTION_H
> >
> > -/* Return values for compact_zone() */
> > +/* Return values for compact_zone() and try_to_compact_pages() */
> >  #define COMPACT_INCOMPLETE     0
> > -#define COMPACT_COMPLETE       1
> > +#define COMPACT_PARTIAL                1
> > +#define COMPACT_COMPLETE       2
> >
> >  #ifdef CONFIG_COMPACTION
> >  extern int sysctl_compact_memory;
> >  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> >                        void __user *buffer, size_t *length, loff_t *ppos);
> > +
> > +extern int fragmentation_index(struct zone *zone, unsigned int order);
> > +extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > +                       int order, gfp_t gfp_mask, nodemask_t *mask);
> > +#else
> > +static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > +                       int order, gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> > +       return COMPACT_INCOMPLETE;
> > +}
> > +
> >  #endif /* CONFIG_COMPACTION */
> >
> >  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index 56e4b44..b4b4d34 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >                KSWAPD_SKIP_CONGESTION_WAIT,
> >                PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> >                COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> > +               COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
> >  #ifdef CONFIG_HUGETLB_PAGE
> >                HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> >  #endif
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 8df6e3d..6688700 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -34,6 +34,8 @@ struct compact_control {
> >        unsigned long nr_anon;
> >        unsigned long nr_file;
> >
> > +       unsigned int order;             /* order a direct compactor needs */
> > +       int migratetype;                /* MOVABLE, RECLAIMABLE etc */
> >        struct zone *zone;
> >  };
> >
> > @@ -301,10 +303,31 @@ static void update_nr_listpages(struct compact_control *cc)
> >  static inline int compact_finished(struct zone *zone,
> >                                                struct compact_control *cc)
> >  {
> > +       unsigned int order;
> > +       unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
> > +
> >        /* Compaction run completes if the migrate and free scanner meet */
> >        if (cc->free_pfn <= cc->migrate_pfn)
> >                return COMPACT_COMPLETE;
> >
> > +       /* Compaction run is not finished if the watermark is not met */
> > +       if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
> > +               return COMPACT_INCOMPLETE;
> > +
> > +       if (cc->order == -1)
> > +               return COMPACT_INCOMPLETE;
> > +
> > +       /* Direct compactor: Is a suitable page free? */
> > +       for (order = cc->order; order < MAX_ORDER; order++) {
> > +               /* Job done if page is free of the right migratetype */
> > +               if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
> > +                       return COMPACT_PARTIAL;
> > +
> > +               /* Job done if allocation would set block type */
> > +               if (order >= pageblock_order && zone->free_area[order].nr_free)
> > +                       return COMPACT_PARTIAL;
> > +       }
> > +
> >        return COMPACT_INCOMPLETE;
> >  }
> >
> > @@ -348,6 +371,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> >        return ret;
> >  }
> >
> > +static inline unsigned long compact_zone_order(struct zone *zone,
> > +                                               int order, gfp_t gfp_mask)
> > +{
> > +       struct compact_control cc = {
> > +               .nr_freepages = 0,
> > +               .nr_migratepages = 0,
> > +               .order = order,
> > +               .migratetype = allocflags_to_migratetype(gfp_mask),
> > +               .zone = zone,
> > +       };
> > +       INIT_LIST_HEAD(&cc.freepages);
> > +       INIT_LIST_HEAD(&cc.migratepages);
> > +
> > +       return compact_zone(zone, &cc);
> > +}
> > +
> > +/**
> > + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> > + * @zonelist: The zonelist used for the current allocation
> > + * @order: The order of the current allocation
> > + * @gfp_mask: The GFP mask of the current allocation
> > + * @nodemask: The allowed nodes to allocate from
> > + *
> > + * This is the main entry point for direct page compaction.
> > + */
> > +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > +                       int order, gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> > +       enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > +       int may_enter_fs = gfp_mask & __GFP_FS;
> > +       int may_perform_io = gfp_mask & __GFP_IO;
> > +       unsigned long watermark;
> > +       struct zoneref *z;
> > +       struct zone *zone;
> > +       int rc = COMPACT_INCOMPLETE;
> > +
> > +       /* Check whether it is worth even starting compaction */
> > +       if (order == 0 || !may_enter_fs || !may_perform_io)
> > +               return rc;
> > +
> > +       /*
> > +        * We will not stall if the necessary conditions are not met for
> > +        * migration but direct reclaim seems to account stalls similarly
> > +        */
> 
> I can't understand this comment.
> In case of direct reclaim, shrink_zones's long time is just stall
> by view point of allocation customer.
> So "Allocation is stalled" makes sense to me.
> 
> But "Compaction is stalled" doesn't make sense to me.

I considered a "stall" to be when the allocator is doing work that is not
allocation-related such as page reclaim or in this case - memory compaction.

> How about "COMPACTION_DIRECT" like "PGSCAN_DIRECT"?

PGSCAN_DIRECT is page-based counter on the number of pages scanned. The
similar naming but very different meaning could be confusing to someone not
familar with the counters. The event being counted here is the number of
times compaction happened just like ALLOCSTALL counts the number of times
direct reclaim happened.

How about COMPACTSTALL like ALLOCSTALL? :/

> I think It's straightforward.
> Naming is important since it makes ABI.
> 
> > +       count_vm_event(COMPACTSTALL);
> > +
> 
> 
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-24  1:19   ` KAMEZAWA Hiroyuki
@ 2010-03-24 11:40     ` Mel Gorman
  2010-03-25  0:30       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 78+ messages in thread
From: Mel Gorman @ 2010-03-24 11:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 10:19:27AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 23 Mar 2010 12:25:45 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Ordinarily when a high-order allocation fails, direct reclaim is entered to
> > free pages to satisfy the allocation.  With this patch, it is determined if
> > an allocation failed due to external fragmentation instead of low memory
> > and if so, the calling process will compact until a suitable page is
> > freed. Compaction by moving pages in memory is considerably cheaper than
> > paging out to disk and works where there are locked pages or no swap. If
> > compaction fails to free a page of a suitable size, then reclaim will
> > still occur.
> > 
> > Direct compaction returns as soon as possible. As each block is compacted,
> > it is checked if a suitable page has been freed and if so, it returns.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >  include/linux/compaction.h |   16 +++++-
> >  include/linux/vmstat.h     |    1 +
> >  mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
> >  mm/page_alloc.c            |   26 ++++++++++
> >  mm/vmstat.c                |   15 +++++-
> >  5 files changed, 172 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index c94890b..b851428 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -1,14 +1,26 @@
> >  #ifndef _LINUX_COMPACTION_H
> >  #define _LINUX_COMPACTION_H
> >  
> > -/* Return values for compact_zone() */
> > +/* Return values for compact_zone() and try_to_compact_pages() */
> >  #define COMPACT_INCOMPLETE	0
> > -#define COMPACT_COMPLETE	1
> > +#define COMPACT_PARTIAL		1
> > +#define COMPACT_COMPLETE	2
> >  
> >  #ifdef CONFIG_COMPACTION
> >  extern int sysctl_compact_memory;
> >  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> >  			void __user *buffer, size_t *length, loff_t *ppos);
> > +
> > +extern int fragmentation_index(struct zone *zone, unsigned int order);
> > +extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > +			int order, gfp_t gfp_mask, nodemask_t *mask);
> > +#else
> > +static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > +			int order, gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> > +	return COMPACT_INCOMPLETE;
> > +}
> > +
> >  #endif /* CONFIG_COMPACTION */
> >  
> >  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index 56e4b44..b4b4d34 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  		KSWAPD_SKIP_CONGESTION_WAIT,
> >  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> >  		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> > +		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
> >  #ifdef CONFIG_HUGETLB_PAGE
> >  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> >  #endif
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 8df6e3d..6688700 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -34,6 +34,8 @@ struct compact_control {
> >  	unsigned long nr_anon;
> >  	unsigned long nr_file;
> >  
> > +	unsigned int order;		/* order a direct compactor needs */
> > +	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
> >  	struct zone *zone;
> >  };
> >  
> > @@ -301,10 +303,31 @@ static void update_nr_listpages(struct compact_control *cc)
> >  static inline int compact_finished(struct zone *zone,
> >  						struct compact_control *cc)
> >  {
> > +	unsigned int order;
> > +	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
> > +
> >  	/* Compaction run completes if the migrate and free scanner meet */
> >  	if (cc->free_pfn <= cc->migrate_pfn)
> >  		return COMPACT_COMPLETE;
> >  
> > +	/* Compaction run is not finished if the watermark is not met */
> > +	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
> > +		return COMPACT_INCOMPLETE;
> > +
> > +	if (cc->order == -1)
> > +		return COMPACT_INCOMPLETE;
> > +
> > +	/* Direct compactor: Is a suitable page free? */
> > +	for (order = cc->order; order < MAX_ORDER; order++) {
> > +		/* Job done if page is free of the right migratetype */
> > +		if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
> > +			return COMPACT_PARTIAL;
> > +
> > +		/* Job done if allocation would set block type */
> > +		if (order >= pageblock_order && zone->free_area[order].nr_free)
> > +			return COMPACT_PARTIAL;
> > +	}
> > +
> >  	return COMPACT_INCOMPLETE;
> >  }
> >  
> > @@ -348,6 +371,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> >  	return ret;
> >  }
> >  
> > +static inline unsigned long compact_zone_order(struct zone *zone,
> > +						int order, gfp_t gfp_mask)
> > +{
> > +	struct compact_control cc = {
> > +		.nr_freepages = 0,
> > +		.nr_migratepages = 0,
> > +		.order = order,
> > +		.migratetype = allocflags_to_migratetype(gfp_mask),
> > +		.zone = zone,
> > +	};
> > +	INIT_LIST_HEAD(&cc.freepages);
> > +	INIT_LIST_HEAD(&cc.migratepages);
> > +
> > +	return compact_zone(zone, &cc);
> > +}
> > +
> > +/**
> > + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> > + * @zonelist: The zonelist used for the current allocation
> > + * @order: The order of the current allocation
> > + * @gfp_mask: The GFP mask of the current allocation
> > + * @nodemask: The allowed nodes to allocate from
> > + *
> > + * This is the main entry point for direct page compaction.
> > + */
> > +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > +			int order, gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> > +	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > +	int may_enter_fs = gfp_mask & __GFP_FS;
> > +	int may_perform_io = gfp_mask & __GFP_IO;
> > +	unsigned long watermark;
> > +	struct zoneref *z;
> > +	struct zone *zone;
> > +	int rc = COMPACT_INCOMPLETE;
> > +
> > +	/* Check whether it is worth even starting compaction */
> > +	if (order == 0 || !may_enter_fs || !may_perform_io)
> > +		return rc;
> > +
> > +	/*
> > +	 * We will not stall if the necessary conditions are not met for
> > +	 * migration but direct reclaim seems to account stalls similarly
> > +	 */
> > +	count_vm_event(COMPACTSTALL);
> > +
> > +	/* Compact each zone in the list */
> > +	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> > +								nodemask) {
> > +		int fragindex;
> > +		int status;
> > +
> > +		/*
> > +		 * Watermarks for order-0 must be met for compaction. Note
> > +		 * the 2UL. This is because during migration, copies of
> > +		 * pages need to be allocated and for a short time, the
> > +		 * footprint is higher
> > +		 */
> > +		watermark = low_wmark_pages(zone) + (2UL << order);
> > +		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> > +			continue;
> > +
> > +		/*
> > +		 * fragmentation index determines if allocation failures are
> > +		 * due to low memory or external fragmentation
> > +		 *
> > +		 * index of -1 implies allocations might succeed depending
> > +		 * 	on watermarks
> > +		 * index < 500 implies alloc failure is due to lack of memory
> > +		 *
> > +		 * XXX: The choice of 500 is arbitrary. Reinvestigate
> > +		 *      appropriately to determine a sensible default.
> > +		 *      and what it means when watermarks are also taken
> > +		 *      into account. Consider making it a sysctl
> > +		 */
> > +		fragindex = fragmentation_index(zone, order);
> > +		if (fragindex >= 0 && fragindex <= 500)
> > +			continue;
> > +
> > +		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
> > +			rc = COMPACT_PARTIAL;
> > +			break;
> > +		}
> > +
> > +		status = compact_zone_order(zone, order, gfp_mask);
> > +		rc = max(status, rc);
> 
> Hm...then, scanning over the whole zone until success of migration at
> each failure?

Sorry for my lack of understanding but your question is difficult to
understand.

You might mean "scanning over the whole zonelist" rather than zone. In that
case, the zone_watermark_ok before and after the compaction will exit the
loop rather than moving to the next zone in the list.

I'm not sure what you mean by "at each failure". The worst-case scenario
is that a process compacts the entire zone and still fails to meet the
watermarks. The best-case scenario is that it does a small amount of
compaction in the compact_zone() loop and finds that compact_finished()
causes the loop to exit before the whole zone is compacted.

> Is it meaningful that multiple tasks run direct-compaction against
> a zone (from zone->start_pfn to zone->end_pfn) in parallel ?
> ex) running order=3 compaction while other thread runs order=5 compaction.
> 

It is meaningful in that "it will work" but there is a good chance that it's
pointless. To what degree it's pointless depends on what happened between
Compaction Process A starting and Compaction Process B. If kswapd is also
awake, then it might be worthwhile. By and large, the scanning is fast enough
that it won't be very noticeable.

My feeling is that multiple processes entering compaction at all is a bad
situation to be in. It implies there are multiple processes are requiring
high-order pages. Maybe if transparent huge pages were merged, it'd be
expected but otherwise it'd be a surprise.

> Can't we find a clever way to find [start_pfn, end_pfn) for scanning rather than
> [zone->start_pfn, zone->start_pfn + zone->spanned_pages) ?
> 

For sure. An early iteration of these patches stored the PFNs last scanned
for migration in struct zone and would use that as a starting point. It'd
wrap around at least once when it encountered the free page scanner so
that the zone would be scanned at least once. A more convulated
iteration stored a list of compactors in a linked list. When selecting a
pageblock to migrate pages from, it'd check the list and avoid scanning
the same block as any other process.

I dropped these modifications for a few reasons

a) It added complexity for a situation that may not be encountered in
   practice.
b) Arguably, it would also make sense to simply allow only one compactor
   within a zone at a time and use a mutex
c) I had no data on why multiple processes would be direct compacting

The last point was the most important. I wanted to avoid complexity unless
there was a good reason for it. If we do encounter a situation where
multiple compactors are causing problems, I'd be more likely to ask "why
are there so many high-order allocations happening simultaneously?" than
"how can we make compaction smarter?"

> I'm sorry if I miss something...
> 

I don't think you have. Sorry for my poor understanding if I missed
answering any of your queries.

> > +	}
> > +
> > +	return rc;
> > +}
> > +
> > +
> >  /* Compact all zones within a node */
> >  static int compact_node(int nid)
> >  {
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 9708143..e301108 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -49,6 +49,7 @@
> >  #include <linux/debugobjects.h>
> >  #include <linux/kmemleak.h>
> >  #include <linux/memory.h>
> > +#include <linux/compaction.h>
> >  #include <trace/events/kmem.h>
> >  #include <linux/ftrace_event.h>
> >  
> > @@ -1765,6 +1766,31 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> >  
> >  	cond_resched();
> >  
> > +	/* Try memory compaction for high-order allocations before reclaim */
> > +	if (order) {
> > +		*did_some_progress = try_to_compact_pages(zonelist,
> > +						order, gfp_mask, nodemask);
> > +		if (*did_some_progress != COMPACT_INCOMPLETE) {
> > +			page = get_page_from_freelist(gfp_mask, nodemask,
> > +					order, zonelist, high_zoneidx,
> > +					alloc_flags, preferred_zone,
> > +					migratetype);
> > +			if (page) {
> > +				__count_vm_event(COMPACTSUCCESS);
> > +				return page;
> > +			}
> > +
> > +			/*
> > +			 * It's bad if compaction run occurs and fails.
> > +			 * The most likely reason is that pages exist,
> > +			 * but not enough to satisfy watermarks.
> > +			 */
> > +			count_vm_event(COMPACTFAIL);
> > +
> > +			cond_resched();
> > +		}
> > +	}
> > +
> >  	/* We now go into synchronous reclaim */
> >  	cpuset_memory_pressure_bump();
> >  	p->flags |= PF_MEMALLOC;
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index af88647..c88f285 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -560,7 +560,7 @@ static int unusable_show(struct seq_file *m, void *arg)
> >   * The value can be used to determine if page reclaim or compaction
> >   * should be used
> >   */
> > -int fragmentation_index(unsigned int order, struct contig_page_info *info)
> > +int __fragmentation_index(unsigned int order, struct contig_page_info *info)
> >  {
> >  	unsigned long requested = 1UL << order;
> >  
> > @@ -580,6 +580,14 @@ int fragmentation_index(unsigned int order, struct contig_page_info *info)
> >  	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
> >  }
> >  
> > +/* Same as __fragmentation index but allocs contig_page_info on stack */
> > +int fragmentation_index(struct zone *zone, unsigned int order)
> > +{
> > +	struct contig_page_info info;
> > +
> > +	fill_contig_page_info(zone, order, &info);
> > +	return __fragmentation_index(order, &info);
> > +}
> >  
> >  static void extfrag_show_print(struct seq_file *m,
> >  					pg_data_t *pgdat, struct zone *zone)
> > @@ -595,7 +603,7 @@ static void extfrag_show_print(struct seq_file *m,
> >  				zone->name);
> >  	for (order = 0; order < MAX_ORDER; ++order) {
> >  		fill_contig_page_info(zone, order, &info);
> > -		index = fragmentation_index(order, &info);
> > +		index = __fragmentation_index(order, &info);
> >  		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
> >  	}
> >  
> > @@ -895,6 +903,9 @@ static const char * const vmstat_text[] = {
> >  	"compact_blocks_moved",
> >  	"compact_pages_moved",
> >  	"compact_pagemigrate_failed",
> > +	"compact_stall",
> > +	"compact_fail",
> > +	"compact_success",
> >  
> >  #ifdef CONFIG_HUGETLB_PAGE
> >  	"htlb_buddy_alloc_success",
> > -- 
> > 1.6.5
> > 
> > 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-24 11:11     ` Mel Gorman
@ 2010-03-24 11:59       ` Minchan Kim
  2010-03-24 12:06         ` Minchan Kim
  2010-03-24 12:09         ` Mel Gorman
  0 siblings, 2 replies; 78+ messages in thread
From: Minchan Kim @ 2010-03-24 11:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 8:11 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Wed, Mar 24, 2010 at 08:10:40AM +0900, Minchan Kim wrote:
>> Hi, Mel.
>>
>> On Tue, Mar 23, 2010 at 9:25 PM, Mel Gorman <mel@csn.ul.ie> wrote:
>> > Ordinarily when a high-order allocation fails, direct reclaim is entered to
>> > free pages to satisfy the allocation.  With this patch, it is determined if
>> > an allocation failed due to external fragmentation instead of low memory
>> > and if so, the calling process will compact until a suitable page is
>> > freed. Compaction by moving pages in memory is considerably cheaper than
>> > paging out to disk and works where there are locked pages or no swap. If
>> > compaction fails to free a page of a suitable size, then reclaim will
>> > still occur.
>> >
>> > Direct compaction returns as soon as possible. As each block is compacted,
>> > it is checked if a suitable page has been freed and if so, it returns.
>> >
>> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> > Acked-by: Rik van Riel <riel@redhat.com>
>> > ---
>> >  include/linux/compaction.h |   16 +++++-
>> >  include/linux/vmstat.h     |    1 +
>> >  mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
>> >  mm/page_alloc.c            |   26 ++++++++++
>> >  mm/vmstat.c                |   15 +++++-
>> >  5 files changed, 172 insertions(+), 4 deletions(-)
>> >
>> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>> > index c94890b..b851428 100644
>> > --- a/include/linux/compaction.h
>> > +++ b/include/linux/compaction.h
>> > @@ -1,14 +1,26 @@
>> >  #ifndef _LINUX_COMPACTION_H
>> >  #define _LINUX_COMPACTION_H
>> >
>> > -/* Return values for compact_zone() */
>> > +/* Return values for compact_zone() and try_to_compact_pages() */
>> >  #define COMPACT_INCOMPLETE     0
>> > -#define COMPACT_COMPLETE       1
>> > +#define COMPACT_PARTIAL                1
>> > +#define COMPACT_COMPLETE       2
>> >
>> >  #ifdef CONFIG_COMPACTION
>> >  extern int sysctl_compact_memory;
>> >  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>> >                        void __user *buffer, size_t *length, loff_t *ppos);
>> > +
>> > +extern int fragmentation_index(struct zone *zone, unsigned int order);
>> > +extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> > +                       int order, gfp_t gfp_mask, nodemask_t *mask);
>> > +#else
>> > +static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> > +                       int order, gfp_t gfp_mask, nodemask_t *nodemask)
>> > +{
>> > +       return COMPACT_INCOMPLETE;
>> > +}
>> > +
>> >  #endif /* CONFIG_COMPACTION */
>> >
>> >  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
>> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> > index 56e4b44..b4b4d34 100644
>> > --- a/include/linux/vmstat.h
>> > +++ b/include/linux/vmstat.h
>> > @@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>> >                KSWAPD_SKIP_CONGESTION_WAIT,
>> >                PAGEOUTRUN, ALLOCSTALL, PGROTATED,
>> >                COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
>> > +               COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
>> >  #ifdef CONFIG_HUGETLB_PAGE
>> >                HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
>> >  #endif
>> > diff --git a/mm/compaction.c b/mm/compaction.c
>> > index 8df6e3d..6688700 100644
>> > --- a/mm/compaction.c
>> > +++ b/mm/compaction.c
>> > @@ -34,6 +34,8 @@ struct compact_control {
>> >        unsigned long nr_anon;
>> >        unsigned long nr_file;
>> >
>> > +       unsigned int order;             /* order a direct compactor needs */
>> > +       int migratetype;                /* MOVABLE, RECLAIMABLE etc */
>> >        struct zone *zone;
>> >  };
>> >
>> > @@ -301,10 +303,31 @@ static void update_nr_listpages(struct compact_control *cc)
>> >  static inline int compact_finished(struct zone *zone,
>> >                                                struct compact_control *cc)
>> >  {
>> > +       unsigned int order;
>> > +       unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
>> > +
>> >        /* Compaction run completes if the migrate and free scanner meet */
>> >        if (cc->free_pfn <= cc->migrate_pfn)
>> >                return COMPACT_COMPLETE;
>> >
>> > +       /* Compaction run is not finished if the watermark is not met */
>> > +       if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
>> > +               return COMPACT_INCOMPLETE;
>> > +
>> > +       if (cc->order == -1)
>> > +               return COMPACT_INCOMPLETE;
>> > +
>> > +       /* Direct compactor: Is a suitable page free? */
>> > +       for (order = cc->order; order < MAX_ORDER; order++) {
>> > +               /* Job done if page is free of the right migratetype */
>> > +               if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
>> > +                       return COMPACT_PARTIAL;
>> > +
>> > +               /* Job done if allocation would set block type */
>> > +               if (order >= pageblock_order && zone->free_area[order].nr_free)
>> > +                       return COMPACT_PARTIAL;
>> > +       }
>> > +
>> >        return COMPACT_INCOMPLETE;
>> >  }
>> >
>> > @@ -348,6 +371,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>> >        return ret;
>> >  }
>> >
>> > +static inline unsigned long compact_zone_order(struct zone *zone,
>> > +                                               int order, gfp_t gfp_mask)
>> > +{
>> > +       struct compact_control cc = {
>> > +               .nr_freepages = 0,
>> > +               .nr_migratepages = 0,
>> > +               .order = order,
>> > +               .migratetype = allocflags_to_migratetype(gfp_mask),
>> > +               .zone = zone,
>> > +       };
>> > +       INIT_LIST_HEAD(&cc.freepages);
>> > +       INIT_LIST_HEAD(&cc.migratepages);
>> > +
>> > +       return compact_zone(zone, &cc);
>> > +}
>> > +
>> > +/**
>> > + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
>> > + * @zonelist: The zonelist used for the current allocation
>> > + * @order: The order of the current allocation
>> > + * @gfp_mask: The GFP mask of the current allocation
>> > + * @nodemask: The allowed nodes to allocate from
>> > + *
>> > + * This is the main entry point for direct page compaction.
>> > + */
>> > +unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> > +                       int order, gfp_t gfp_mask, nodemask_t *nodemask)
>> > +{
>> > +       enum zone_type high_zoneidx = gfp_zone(gfp_mask);
>> > +       int may_enter_fs = gfp_mask & __GFP_FS;
>> > +       int may_perform_io = gfp_mask & __GFP_IO;
>> > +       unsigned long watermark;
>> > +       struct zoneref *z;
>> > +       struct zone *zone;
>> > +       int rc = COMPACT_INCOMPLETE;
>> > +
>> > +       /* Check whether it is worth even starting compaction */
>> > +       if (order == 0 || !may_enter_fs || !may_perform_io)
>> > +               return rc;
>> > +
>> > +       /*
>> > +        * We will not stall if the necessary conditions are not met for
>> > +        * migration but direct reclaim seems to account stalls similarly
>> > +        */
>>
>> I can't understand this comment.
>> In case of direct reclaim, shrink_zones's long time is just stall
>> by view point of allocation customer.
>> So "Allocation is stalled" makes sense to me.
>>
>> But "Compaction is stalled" doesn't make sense to me.
>
> I considered a "stall" to be when the allocator is doing work that is not
> allocation-related such as page reclaim or in this case - memory compaction.

I agree.

>
>> How about "COMPACTION_DIRECT" like "PGSCAN_DIRECT"?
>
> PGSCAN_DIRECT is page-based counter on the number of pages scanned. The
> similar naming but very different meaning could be confusing to someone not
> familar with the counters. The event being counted here is the number of
> times compaction happened just like ALLOCSTALL counts the number of times
> direct reclaim happened.

You're right. I just wanted to change the name as one which imply
direct compaction.
That's because I believe we will implement it by backgroud, too.
Then It's more straightforward, I think. :-)

>
> How about COMPACTSTALL like ALLOCSTALL? :/

I wouldn't have a strong objection any more if you insist on it.

>> I think It's straightforward.
>> Naming is important since it makes ABI.
>>
>> > +       count_vm_event(COMPACTSTALL);
>> > +
>>
>>
>>
>>
>>
>> --
>> Kind regards,
>> Minchan Kim
>>
>
> --
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-24 11:59       ` Minchan Kim
@ 2010-03-24 12:06         ` Minchan Kim
  2010-03-24 12:10           ` Mel Gorman
  2010-03-24 12:09         ` Mel Gorman
  1 sibling, 1 reply; 78+ messages in thread
From: Minchan Kim @ 2010-03-24 12:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 8:59 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Wed, Mar 24, 2010 at 8:11 PM, Mel Gorman <mel@csn.ul.ie> wrote:
>> On Wed, Mar 24, 2010 at 08:10:40AM +0900, Minchan Kim wrote:
>>> Hi, Mel.
>>>
>>> On Tue, Mar 23, 2010 at 9:25 PM, Mel Gorman <mel@csn.ul.ie> wrote:
>>> > Ordinarily when a high-order allocation fails, direct reclaim is entered to
>>> > free pages to satisfy the allocation.  With this patch, it is determined if
>>> > an allocation failed due to external fragmentation instead of low memory
>>> > and if so, the calling process will compact until a suitable page is
>>> > freed. Compaction by moving pages in memory is considerably cheaper than
>>> > paging out to disk and works where there are locked pages or no swap. If
>>> > compaction fails to free a page of a suitable size, then reclaim will
>>> > still occur.
>>> >
>>> > Direct compaction returns as soon as possible. As each block is compacted,
>>> > it is checked if a suitable page has been freed and if so, it returns.
>>> >
>>> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>>> > Acked-by: Rik van Riel <riel@redhat.com>
>>> > ---
>>> >  include/linux/compaction.h |   16 +++++-
>>> >  include/linux/vmstat.h     |    1 +
>>> >  mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
>>> >  mm/page_alloc.c            |   26 ++++++++++
>>> >  mm/vmstat.c                |   15 +++++-
>>> >  5 files changed, 172 insertions(+), 4 deletions(-)
>>> >
>>> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>>> > index c94890b..b851428 100644
>>> > --- a/include/linux/compaction.h
>>> > +++ b/include/linux/compaction.h
>>> > @@ -1,14 +1,26 @@
>>> >  #ifndef _LINUX_COMPACTION_H
>>> >  #define _LINUX_COMPACTION_H
>>> >
>>> > -/* Return values for compact_zone() */
>>> > +/* Return values for compact_zone() and try_to_compact_pages() */
>>> >  #define COMPACT_INCOMPLETE     0
>>> > -#define COMPACT_COMPLETE       1
>>> > +#define COMPACT_PARTIAL                1
>>> > +#define COMPACT_COMPLETE       2
>>> >
>>> >  #ifdef CONFIG_COMPACTION
>>> >  extern int sysctl_compact_memory;
>>> >  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>>> >                        void __user *buffer, size_t *length, loff_t *ppos);
>>> > +
>>> > +extern int fragmentation_index(struct zone *zone, unsigned int order);
>>> > +extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
>>> > +                       int order, gfp_t gfp_mask, nodemask_t *mask);
>>> > +#else
>>> > +static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
>>> > +                       int order, gfp_t gfp_mask, nodemask_t *nodemask)
>>> > +{
>>> > +       return COMPACT_INCOMPLETE;
>>> > +}
>>> > +
>>> >  #endif /* CONFIG_COMPACTION */
>>> >
>>> >  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
>>> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>>> > index 56e4b44..b4b4d34 100644
>>> > --- a/include/linux/vmstat.h
>>> > +++ b/include/linux/vmstat.h
>>> > @@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>> >                KSWAPD_SKIP_CONGESTION_WAIT,
>>> >                PAGEOUTRUN, ALLOCSTALL, PGROTATED,
>>> >                COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
>>> > +               COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
>>> >  #ifdef CONFIG_HUGETLB_PAGE
>>> >                HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
>>> >  #endif
>>> > diff --git a/mm/compaction.c b/mm/compaction.c
>>> > index 8df6e3d..6688700 100644
>>> > --- a/mm/compaction.c
>>> > +++ b/mm/compaction.c
>>> > @@ -34,6 +34,8 @@ struct compact_control {
>>> >        unsigned long nr_anon;
>>> >        unsigned long nr_file;
>>> >
>>> > +       unsigned int order;             /* order a direct compactor needs */
>>> > +       int migratetype;                /* MOVABLE, RECLAIMABLE etc */
>>> >        struct zone *zone;
>>> >  };
>>> >
>>> > @@ -301,10 +303,31 @@ static void update_nr_listpages(struct compact_control *cc)
>>> >  static inline int compact_finished(struct zone *zone,
>>> >                                                struct compact_control *cc)
>>> >  {
>>> > +       unsigned int order;
>>> > +       unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
>>> > +
>>> >        /* Compaction run completes if the migrate and free scanner meet */
>>> >        if (cc->free_pfn <= cc->migrate_pfn)
>>> >                return COMPACT_COMPLETE;
>>> >
>>> > +       /* Compaction run is not finished if the watermark is not met */
>>> > +       if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
>>> > +               return COMPACT_INCOMPLETE;
>>> > +
>>> > +       if (cc->order == -1)
>>> > +               return COMPACT_INCOMPLETE;
>>> > +
>>> > +       /* Direct compactor: Is a suitable page free? */
>>> > +       for (order = cc->order; order < MAX_ORDER; order++) {
>>> > +               /* Job done if page is free of the right migratetype */
>>> > +               if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
>>> > +                       return COMPACT_PARTIAL;
>>> > +
>>> > +               /* Job done if allocation would set block type */
>>> > +               if (order >= pageblock_order && zone->free_area[order].nr_free)
>>> > +                       return COMPACT_PARTIAL;
>>> > +       }
>>> > +
>>> >        return COMPACT_INCOMPLETE;
>>> >  }
>>> >
>>> > @@ -348,6 +371,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>>> >        return ret;
>>> >  }
>>> >
>>> > +static inline unsigned long compact_zone_order(struct zone *zone,
>>> > +                                               int order, gfp_t gfp_mask)
>>> > +{
>>> > +       struct compact_control cc = {
>>> > +               .nr_freepages = 0,
>>> > +               .nr_migratepages = 0,
>>> > +               .order = order,
>>> > +               .migratetype = allocflags_to_migratetype(gfp_mask),
>>> > +               .zone = zone,
>>> > +       };
>>> > +       INIT_LIST_HEAD(&cc.freepages);
>>> > +       INIT_LIST_HEAD(&cc.migratepages);
>>> > +
>>> > +       return compact_zone(zone, &cc);
>>> > +}
>>> > +
>>> > +/**
>>> > + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
>>> > + * @zonelist: The zonelist used for the current allocation
>>> > + * @order: The order of the current allocation
>>> > + * @gfp_mask: The GFP mask of the current allocation
>>> > + * @nodemask: The allowed nodes to allocate from
>>> > + *
>>> > + * This is the main entry point for direct page compaction.
>>> > + */
>>> > +unsigned long try_to_compact_pages(struct zonelist *zonelist,
>>> > +                       int order, gfp_t gfp_mask, nodemask_t *nodemask)
>>> > +{
>>> > +       enum zone_type high_zoneidx = gfp_zone(gfp_mask);
>>> > +       int may_enter_fs = gfp_mask & __GFP_FS;
>>> > +       int may_perform_io = gfp_mask & __GFP_IO;
>>> > +       unsigned long watermark;
>>> > +       struct zoneref *z;
>>> > +       struct zone *zone;
>>> > +       int rc = COMPACT_INCOMPLETE;
>>> > +
>>> > +       /* Check whether it is worth even starting compaction */
>>> > +       if (order == 0 || !may_enter_fs || !may_perform_io)
>>> > +               return rc;
>>> > +
>>> > +       /*
>>> > +        * We will not stall if the necessary conditions are not met for
>>> > +        * migration but direct reclaim seems to account stalls similarly
>>> > +        */

Then, Let's remove this comment.


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-24 11:59       ` Minchan Kim
  2010-03-24 12:06         ` Minchan Kim
@ 2010-03-24 12:09         ` Mel Gorman
  2010-03-24 12:25           ` Minchan Kim
  1 sibling, 1 reply; 78+ messages in thread
From: Mel Gorman @ 2010-03-24 12:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 08:59:45PM +0900, Minchan Kim wrote:
> On Wed, Mar 24, 2010 at 8:11 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Wed, Mar 24, 2010 at 08:10:40AM +0900, Minchan Kim wrote:
> >> Hi, Mel.
> >>
> >> On Tue, Mar 23, 2010 at 9:25 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> >> > Ordinarily when a high-order allocation fails, direct reclaim is entered to
> >> > free pages to satisfy the allocation.  With this patch, it is determined if
> >> > an allocation failed due to external fragmentation instead of low memory
> >> > and if so, the calling process will compact until a suitable page is
> >> > freed. Compaction by moving pages in memory is considerably cheaper than
> >> > paging out to disk and works where there are locked pages or no swap. If
> >> > compaction fails to free a page of a suitable size, then reclaim will
> >> > still occur.
> >> >
> >> > Direct compaction returns as soon as possible. As each block is compacted,
> >> > it is checked if a suitable page has been freed and if so, it returns.
> >> >
> >> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >> > Acked-by: Rik van Riel <riel@redhat.com>
> >> > ---
> >> >  include/linux/compaction.h |   16 +++++-
> >> >  include/linux/vmstat.h     |    1 +
> >> >  mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
> >> >  mm/page_alloc.c            |   26 ++++++++++
> >> >  mm/vmstat.c                |   15 +++++-
> >> >  5 files changed, 172 insertions(+), 4 deletions(-)
> >> >
> >> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> >> > index c94890b..b851428 100644
> >> > --- a/include/linux/compaction.h
> >> > +++ b/include/linux/compaction.h
> >> > @@ -1,14 +1,26 @@
> >> >  #ifndef _LINUX_COMPACTION_H
> >> >  #define _LINUX_COMPACTION_H
> >> >
> >> > -/* Return values for compact_zone() */
> >> > +/* Return values for compact_zone() and try_to_compact_pages() */
> >> >  #define COMPACT_INCOMPLETE     0
> >> > -#define COMPACT_COMPLETE       1
> >> > +#define COMPACT_PARTIAL                1
> >> > +#define COMPACT_COMPLETE       2
> >> >
> >> >  #ifdef CONFIG_COMPACTION
> >> >  extern int sysctl_compact_memory;
> >> >  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> >> >                        void __user *buffer, size_t *length, loff_t *ppos);
> >> > +
> >> > +extern int fragmentation_index(struct zone *zone, unsigned int order);
> >> > +extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> >> > +                       int order, gfp_t gfp_mask, nodemask_t *mask);
> >> > +#else
> >> > +static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> >> > +                       int order, gfp_t gfp_mask, nodemask_t *nodemask)
> >> > +{
> >> > +       return COMPACT_INCOMPLETE;
> >> > +}
> >> > +
> >> >  #endif /* CONFIG_COMPACTION */
> >> >
> >> >  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> >> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> >> > index 56e4b44..b4b4d34 100644
> >> > --- a/include/linux/vmstat.h
> >> > +++ b/include/linux/vmstat.h
> >> > @@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >> >                KSWAPD_SKIP_CONGESTION_WAIT,
> >> >                PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> >> >                COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> >> > +               COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
> >> >  #ifdef CONFIG_HUGETLB_PAGE
> >> >                HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> >> >  #endif
> >> > diff --git a/mm/compaction.c b/mm/compaction.c
> >> > index 8df6e3d..6688700 100644
> >> > --- a/mm/compaction.c
> >> > +++ b/mm/compaction.c
> >> > @@ -34,6 +34,8 @@ struct compact_control {
> >> >        unsigned long nr_anon;
> >> >        unsigned long nr_file;
> >> >
> >> > +       unsigned int order;             /* order a direct compactor needs */
> >> > +       int migratetype;                /* MOVABLE, RECLAIMABLE etc */
> >> >        struct zone *zone;
> >> >  };
> >> >
> >> > @@ -301,10 +303,31 @@ static void update_nr_listpages(struct compact_control *cc)
> >> >  static inline int compact_finished(struct zone *zone,
> >> >                                                struct compact_control *cc)
> >> >  {
> >> > +       unsigned int order;
> >> > +       unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
> >> > +
> >> >        /* Compaction run completes if the migrate and free scanner meet */
> >> >        if (cc->free_pfn <= cc->migrate_pfn)
> >> >                return COMPACT_COMPLETE;
> >> >
> >> > +       /* Compaction run is not finished if the watermark is not met */
> >> > +       if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
> >> > +               return COMPACT_INCOMPLETE;
> >> > +
> >> > +       if (cc->order == -1)
> >> > +               return COMPACT_INCOMPLETE;
> >> > +
> >> > +       /* Direct compactor: Is a suitable page free? */
> >> > +       for (order = cc->order; order < MAX_ORDER; order++) {
> >> > +               /* Job done if page is free of the right migratetype */
> >> > +               if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
> >> > +                       return COMPACT_PARTIAL;
> >> > +
> >> > +               /* Job done if allocation would set block type */
> >> > +               if (order >= pageblock_order && zone->free_area[order].nr_free)
> >> > +                       return COMPACT_PARTIAL;
> >> > +       }
> >> > +
> >> >        return COMPACT_INCOMPLETE;
> >> >  }
> >> >
> >> > @@ -348,6 +371,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> >> >        return ret;
> >> >  }
> >> >
> >> > +static inline unsigned long compact_zone_order(struct zone *zone,
> >> > +                                               int order, gfp_t gfp_mask)
> >> > +{
> >> > +       struct compact_control cc = {
> >> > +               .nr_freepages = 0,
> >> > +               .nr_migratepages = 0,
> >> > +               .order = order,
> >> > +               .migratetype = allocflags_to_migratetype(gfp_mask),
> >> > +               .zone = zone,
> >> > +       };
> >> > +       INIT_LIST_HEAD(&cc.freepages);
> >> > +       INIT_LIST_HEAD(&cc.migratepages);
> >> > +
> >> > +       return compact_zone(zone, &cc);
> >> > +}
> >> > +
> >> > +/**
> >> > + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> >> > + * @zonelist: The zonelist used for the current allocation
> >> > + * @order: The order of the current allocation
> >> > + * @gfp_mask: The GFP mask of the current allocation
> >> > + * @nodemask: The allowed nodes to allocate from
> >> > + *
> >> > + * This is the main entry point for direct page compaction.
> >> > + */
> >> > +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> >> > +                       int order, gfp_t gfp_mask, nodemask_t *nodemask)
> >> > +{
> >> > +       enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> >> > +       int may_enter_fs = gfp_mask & __GFP_FS;
> >> > +       int may_perform_io = gfp_mask & __GFP_IO;
> >> > +       unsigned long watermark;
> >> > +       struct zoneref *z;
> >> > +       struct zone *zone;
> >> > +       int rc = COMPACT_INCOMPLETE;
> >> > +
> >> > +       /* Check whether it is worth even starting compaction */
> >> > +       if (order == 0 || !may_enter_fs || !may_perform_io)
> >> > +               return rc;
> >> > +
> >> > +       /*
> >> > +        * We will not stall if the necessary conditions are not met for
> >> > +        * migration but direct reclaim seems to account stalls similarly
> >> > +        */
> >>
> >> I can't understand this comment.
> >> In case of direct reclaim, shrink_zones's long time is just stall
> >> by view point of allocation customer.
> >> So "Allocation is stalled" makes sense to me.
> >>
> >> But "Compaction is stalled" doesn't make sense to me.
> >
> > I considered a "stall" to be when the allocator is doing work that is not
> > allocation-related such as page reclaim or in this case - memory compaction.
> 
> I agree.
> 
> >
> >> How about "COMPACTION_DIRECT" like "PGSCAN_DIRECT"?
> >
> > PGSCAN_DIRECT is page-based counter on the number of pages scanned. The
> > similar naming but very different meaning could be confusing to someone not
> > familar with the counters. The event being counted here is the number of
> > times compaction happened just like ALLOCSTALL counts the number of times
> > direct reclaim happened.
> 
> You're right. I just wanted to change the name as one which imply
> direct compaction.

I think I'd fully agree with your point if there was more than one way to
stall a process due to compaction. As it is, direct compaction is the only
way to meaningfully stall a process and I can't think of alternative stalls
in the future. Technically, a process using the sysfs or proc triggers for
compaction also stalls but it's not interesting to count those events.

> That's because I believe we will implement it by backgroud, too.

This is a possibility but in that case it would be a separate process
like kcompactd and I wouldn't count it as a stall as such.

> Then It's more straightforward, I think. :-)
> 
> > How about COMPACTSTALL like ALLOCSTALL? :/
> 
> I wouldn't have a strong objection any more if you insist on it.
> 

I'm not insisting as such, I just don't think renaming it to
PGSCAN_COMPACT_X would be easier to understand.

> >> I think It's straightforward.
> >> Naming is important since it makes ABI.
> >>
> >> > +       count_vm_event(COMPACTSTALL);
> >> > +
> >>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-24 12:06         ` Minchan Kim
@ 2010-03-24 12:10           ` Mel Gorman
  0 siblings, 0 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-24 12:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 09:06:51PM +0900, Minchan Kim wrote:
> On Wed, Mar 24, 2010 at 8:59 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> > On Wed, Mar 24, 2010 at 8:11 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> >> On Wed, Mar 24, 2010 at 08:10:40AM +0900, Minchan Kim wrote:
> >>> Hi, Mel.
> >>>
> >>> On Tue, Mar 23, 2010 at 9:25 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> >>> > Ordinarily when a high-order allocation fails, direct reclaim is entered to
> >>> > free pages to satisfy the allocation.  With this patch, it is determined if
> >>> > an allocation failed due to external fragmentation instead of low memory
> >>> > and if so, the calling process will compact until a suitable page is
> >>> > freed. Compaction by moving pages in memory is considerably cheaper than
> >>> > paging out to disk and works where there are locked pages or no swap. If
> >>> > compaction fails to free a page of a suitable size, then reclaim will
> >>> > still occur.
> >>> >
> >>> > Direct compaction returns as soon as possible. As each block is compacted,
> >>> > it is checked if a suitable page has been freed and if so, it returns.
> >>> >
> >>> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >>> > Acked-by: Rik van Riel <riel@redhat.com>
> >>> > ---
> >>> >  include/linux/compaction.h |   16 +++++-
> >>> >  include/linux/vmstat.h     |    1 +
> >>> >  mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
> >>> >  mm/page_alloc.c            |   26 ++++++++++
> >>> >  mm/vmstat.c                |   15 +++++-
> >>> >  5 files changed, 172 insertions(+), 4 deletions(-)
> >>> >
> >>> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> >>> > index c94890b..b851428 100644
> >>> > --- a/include/linux/compaction.h
> >>> > +++ b/include/linux/compaction.h
> >>> > @@ -1,14 +1,26 @@
> >>> >  #ifndef _LINUX_COMPACTION_H
> >>> >  #define _LINUX_COMPACTION_H
> >>> >
> >>> > -/* Return values for compact_zone() */
> >>> > +/* Return values for compact_zone() and try_to_compact_pages() */
> >>> >  #define COMPACT_INCOMPLETE     0
> >>> > -#define COMPACT_COMPLETE       1
> >>> > +#define COMPACT_PARTIAL                1
> >>> > +#define COMPACT_COMPLETE       2
> >>> >
> >>> >  #ifdef CONFIG_COMPACTION
> >>> >  extern int sysctl_compact_memory;
> >>> >  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> >>> >                        void __user *buffer, size_t *length, loff_t *ppos);
> >>> > +
> >>> > +extern int fragmentation_index(struct zone *zone, unsigned int order);
> >>> > +extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> >>> > +                       int order, gfp_t gfp_mask, nodemask_t *mask);
> >>> > +#else
> >>> > +static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> >>> > +                       int order, gfp_t gfp_mask, nodemask_t *nodemask)
> >>> > +{
> >>> > +       return COMPACT_INCOMPLETE;
> >>> > +}
> >>> > +
> >>> >  #endif /* CONFIG_COMPACTION */
> >>> >
> >>> >  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> >>> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> >>> > index 56e4b44..b4b4d34 100644
> >>> > --- a/include/linux/vmstat.h
> >>> > +++ b/include/linux/vmstat.h
> >>> > @@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >>> >                KSWAPD_SKIP_CONGESTION_WAIT,
> >>> >                PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> >>> >                COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> >>> > +               COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
> >>> >  #ifdef CONFIG_HUGETLB_PAGE
> >>> >                HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> >>> >  #endif
> >>> > diff --git a/mm/compaction.c b/mm/compaction.c
> >>> > index 8df6e3d..6688700 100644
> >>> > --- a/mm/compaction.c
> >>> > +++ b/mm/compaction.c
> >>> > @@ -34,6 +34,8 @@ struct compact_control {
> >>> >        unsigned long nr_anon;
> >>> >        unsigned long nr_file;
> >>> >
> >>> > +       unsigned int order;             /* order a direct compactor needs */
> >>> > +       int migratetype;                /* MOVABLE, RECLAIMABLE etc */
> >>> >        struct zone *zone;
> >>> >  };
> >>> >
> >>> > @@ -301,10 +303,31 @@ static void update_nr_listpages(struct compact_control *cc)
> >>> >  static inline int compact_finished(struct zone *zone,
> >>> >                                                struct compact_control *cc)
> >>> >  {
> >>> > +       unsigned int order;
> >>> > +       unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
> >>> > +
> >>> >        /* Compaction run completes if the migrate and free scanner meet */
> >>> >        if (cc->free_pfn <= cc->migrate_pfn)
> >>> >                return COMPACT_COMPLETE;
> >>> >
> >>> > +       /* Compaction run is not finished if the watermark is not met */
> >>> > +       if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
> >>> > +               return COMPACT_INCOMPLETE;
> >>> > +
> >>> > +       if (cc->order == -1)
> >>> > +               return COMPACT_INCOMPLETE;
> >>> > +
> >>> > +       /* Direct compactor: Is a suitable page free? */
> >>> > +       for (order = cc->order; order < MAX_ORDER; order++) {
> >>> > +               /* Job done if page is free of the right migratetype */
> >>> > +               if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
> >>> > +                       return COMPACT_PARTIAL;
> >>> > +
> >>> > +               /* Job done if allocation would set block type */
> >>> > +               if (order >= pageblock_order && zone->free_area[order].nr_free)
> >>> > +                       return COMPACT_PARTIAL;
> >>> > +       }
> >>> > +
> >>> >        return COMPACT_INCOMPLETE;
> >>> >  }
> >>> >
> >>> > @@ -348,6 +371,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> >>> >        return ret;
> >>> >  }
> >>> >
> >>> > +static inline unsigned long compact_zone_order(struct zone *zone,
> >>> > +                                               int order, gfp_t gfp_mask)
> >>> > +{
> >>> > +       struct compact_control cc = {
> >>> > +               .nr_freepages = 0,
> >>> > +               .nr_migratepages = 0,
> >>> > +               .order = order,
> >>> > +               .migratetype = allocflags_to_migratetype(gfp_mask),
> >>> > +               .zone = zone,
> >>> > +       };
> >>> > +       INIT_LIST_HEAD(&cc.freepages);
> >>> > +       INIT_LIST_HEAD(&cc.migratepages);
> >>> > +
> >>> > +       return compact_zone(zone, &cc);
> >>> > +}
> >>> > +
> >>> > +/**
> >>> > + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> >>> > + * @zonelist: The zonelist used for the current allocation
> >>> > + * @order: The order of the current allocation
> >>> > + * @gfp_mask: The GFP mask of the current allocation
> >>> > + * @nodemask: The allowed nodes to allocate from
> >>> > + *
> >>> > + * This is the main entry point for direct page compaction.
> >>> > + */
> >>> > +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> >>> > +                       int order, gfp_t gfp_mask, nodemask_t *nodemask)
> >>> > +{
> >>> > +       enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> >>> > +       int may_enter_fs = gfp_mask & __GFP_FS;
> >>> > +       int may_perform_io = gfp_mask & __GFP_IO;
> >>> > +       unsigned long watermark;
> >>> > +       struct zoneref *z;
> >>> > +       struct zone *zone;
> >>> > +       int rc = COMPACT_INCOMPLETE;
> >>> > +
> >>> > +       /* Check whether it is worth even starting compaction */
> >>> > +       if (order == 0 || !may_enter_fs || !may_perform_io)
> >>> > +               return rc;
> >>> > +
> >>> > +       /*
> >>> > +        * We will not stall if the necessary conditions are not met for
> >>> > +        * migration but direct reclaim seems to account stalls similarly
> >>> > +        */
> 
> Then, Let's remove this comment.
> 

Yes, it hinders more than it helps in this case. It's deleted now.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-24 12:09         ` Mel Gorman
@ 2010-03-24 12:25           ` Minchan Kim
  0 siblings, 0 replies; 78+ messages in thread
From: Minchan Kim @ 2010-03-24 12:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 9:09 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Wed, Mar 24, 2010 at 08:59:45PM +0900, Minchan Kim wrote:
>> On Wed, Mar 24, 2010 at 8:11 PM, Mel Gorman <mel@csn.ul.ie> wrote:
>> > On Wed, Mar 24, 2010 at 08:10:40AM +0900, Minchan Kim wrote:
>> >> Hi, Mel.
>> >>
>> >> On Tue, Mar 23, 2010 at 9:25 PM, Mel Gorman <mel@csn.ul.ie> wrote:
>> >> > Ordinarily when a high-order allocation fails, direct reclaim is entered to
>> >> > free pages to satisfy the allocation.  With this patch, it is determined if
>> >> > an allocation failed due to external fragmentation instead of low memory
>> >> > and if so, the calling process will compact until a suitable page is
>> >> > freed. Compaction by moving pages in memory is considerably cheaper than
>> >> > paging out to disk and works where there are locked pages or no swap. If
>> >> > compaction fails to free a page of a suitable size, then reclaim will
>> >> > still occur.
>> >> >
>> >> > Direct compaction returns as soon as possible. As each block is compacted,
>> >> > it is checked if a suitable page has been freed and if so, it returns.
>> >> >
>> >> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> >> > Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

<snip>

>> You're right. I just wanted to change the name as one which imply
>> direct compaction.
>
> I think I'd fully agree with your point if there was more than one way to
> stall a process due to compaction. As it is, direct compaction is the only
> way to meaningfully stall a process and I can't think of alternative stalls
> in the future. Technically, a process using the sysfs or proc triggers for
> compaction also stalls but it's not interesting to count those events.
>
>> That's because I believe we will implement it by backgroud, too.
>
> This is a possibility but in that case it would be a separate process
> like kcompactd and I wouldn't count it as a stall as such.
>
>> Then It's more straightforward, I think. :-)
>>
>> > How about COMPACTSTALL like ALLOCSTALL? :/
>>
>> I wouldn't have a strong objection any more if you insist on it.
>>
>
> I'm not insisting as such, I just don't think renaming it to
> PGSCAN_COMPACT_X would be easier to understand.

Totally, I agree with your opinion.
>From now on, I don't have any objection.




-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 11/11] Do not compact within a preferred zone after a compaction failure
  2010-03-24 10:37         ` Mel Gorman
@ 2010-03-24 19:54           ` Christoph Lameter
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2010-03-24 19:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, 24 Mar 2010, Mel Gorman wrote:

> > > What I was thinking at the time was that compact_resume was stored in struct
> > > zone - i.e. that is where it is recorded.
> >
> > Ok adding a dozen or more words here may be useful.
> >
>
> In the event of compaction followed by an allocation failure, this patch
> defers further compaction in the zone for a period of time. The zone that
> is deferred is the first zone in the zonelist - i.e. the preferred zone.
> To defer compaction in the other zones, the information would need to
> be stored in the zonelist or implemented similar to the zonelist_cache.
> This would impact the fast-paths and is not justified at this time.
>
> ?

Ok.

> > There are frequent uses of HZ/10 as well especially in vmscna.c. A longer
> > time may be better? HZ/50 looks like an interval for writeout. But this
> > is related to reclaim?
> >
>
> HZ/10 is somewhat of an arbitrary choice as well and there isn't data on
> which is better and which is worse. If the zone is full of dirty data, then
> HZ/10 makes sense for IO. If it happened to be mainly clean cache but under
> heavy memory pressure, then reclaim would be a relatively fast event and a
> shorter wait makes sense of HZ/50.
>
> Thing is, if we start with a short timer and it's too short, COMPACTFAIL
> will be growing steadily. If we choose a long time and it's too long, there
> is no counter to indicate it was a bad choice. Hence, I'd prefer the short
> timer to start with and ideally resume compaction after some event in the
> future rather than depending on time.
>
> Does that make sense?

Yes.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-23 12:25 ` [PATCH 07/11] Memory compaction core Mel Gorman
  2010-03-23 17:56   ` Christoph Lameter
  2010-03-24  1:03   ` KAMEZAWA Hiroyuki
@ 2010-03-24 20:33   ` Andrew Morton
  2010-03-24 20:59     ` Jonathan Corbet
  2010-03-25  9:13     ` Mel Gorman
  2 siblings, 2 replies; 78+ messages in thread
From: Andrew Morton @ 2010-03-24 20:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 23 Mar 2010 12:25:42 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> This patch is the core of a mechanism which compacts memory in a zone by
> relocating movable pages towards the end of the zone.
> 
> A single compaction run involves a migration scanner and a free scanner.
> Both scanners operate on pageblock-sized areas in the zone. The migration
> scanner starts at the bottom of the zone and searches for all movable pages
> within each area, isolating them onto a private list called migratelist.
> The free scanner starts at the top of the zone and searches for suitable
> areas and consumes the free pages within making them available for the
> migration scanner. The pages isolated for migration are then migrated to
> the newly isolated free pages.

General comment: it looks like there are some codepaths which could
hold zone->lock for a long time.  It's unclear that they're all
constrained by COMPACT_CLUSTER_MAX. Is there a a latency issue here?

>
> ...
>
> +static struct page *compaction_alloc(struct page *migratepage,
> +					unsigned long data,
> +					int **result)
> +{
> +	struct compact_control *cc = (struct compact_control *)data;
> +	struct page *freepage;
> +
> +	VM_BUG_ON(cc == NULL);

It's a bit strange to test this when we're about to oops anyway.  The
oops will tell us the same thing.

> +	/* Isolate free pages if necessary */
> +	if (list_empty(&cc->freepages)) {
> +		isolate_freepages(cc->zone, cc);
> +
> +		if (list_empty(&cc->freepages))
> +			return NULL;
> +	}
> +
> +	freepage = list_entry(cc->freepages.next, struct page, lru);
> +	list_del(&freepage->lru);
> +	cc->nr_freepages--;
> +
> +	return freepage;
> +}


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/11] Add /proc trigger for memory compaction
  2010-03-23 12:25 ` [PATCH 08/11] Add /proc trigger for memory compaction Mel Gorman
  2010-03-23 18:25   ` Christoph Lameter
@ 2010-03-24 20:33   ` Andrew Morton
  2010-03-26 10:46     ` Mel Gorman
  1 sibling, 1 reply; 78+ messages in thread
From: Andrew Morton @ 2010-03-24 20:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 23 Mar 2010 12:25:43 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
> value is written to the file, all zones are compacted. The expected user
> of such a trigger is a job scheduler that prepares the system before the
> target application runs.
> 
>
> ...
>
> +/* This is the entry point for compacting all nodes via /proc/sys/vm */
> +int sysctl_compaction_handler(struct ctl_table *table, int write,
> +			void __user *buffer, size_t *length, loff_t *ppos)
> +{
> +	if (write)
> +		return compact_nodes();
> +
> +	return 0;
> +}

Neato.  When I saw the overall description I was afraid that this stuff
would be fiddling with kernel threads.

The underlying compaction code can at times cause rather large amounts
of memory to be put onto private lists, so it's lost to the rest of the
kernel.  What happens if 10000 processes simultaneously write to this
thing?  It's root-only so I guess the answer is "root becomes unemployed".


I fear that the overall effect of this feature is that people will come
up with ghastly hacks which keep on poking this tunable as a workaround
for some VM shortcoming.  This will lead to more shortcomings, and
longer-lived ones.



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-23 12:25 ` [PATCH 10/11] Direct compact when a high-order allocation fails Mel Gorman
  2010-03-23 23:10   ` Minchan Kim
  2010-03-24  1:19   ` KAMEZAWA Hiroyuki
@ 2010-03-24 20:48   ` Andrew Morton
  2010-03-25  0:57     ` KAMEZAWA Hiroyuki
  2010-03-25 10:21     ` Mel Gorman
  2 siblings, 2 replies; 78+ messages in thread
From: Andrew Morton @ 2010-03-24 20:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 23 Mar 2010 12:25:45 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> Ordinarily when a high-order allocation fails, direct reclaim is entered to
> free pages to satisfy the allocation.  With this patch, it is determined if
> an allocation failed due to external fragmentation instead of low memory
> and if so, the calling process will compact until a suitable page is
> freed. Compaction by moving pages in memory is considerably cheaper than
> paging out to disk and works where there are locked pages or no swap. If
> compaction fails to free a page of a suitable size, then reclaim will
> still occur.
> 
> Direct compaction returns as soon as possible. As each block is compacted,
> it is checked if a suitable page has been freed and if so, it returns.
> 
>
> ...
>
> +static inline unsigned long compact_zone_order(struct zone *zone,
> +						int order, gfp_t gfp_mask)

Suggest that you re-review all the manual inlining in the patchset. 
It's rarely needed and often incorrect.

> +{
> +	struct compact_control cc = {
> +		.nr_freepages = 0,
> +		.nr_migratepages = 0,
> +		.order = order,
> +		.migratetype = allocflags_to_migratetype(gfp_mask),
> +		.zone = zone,
> +	};
> +	INIT_LIST_HEAD(&cc.freepages);
> +	INIT_LIST_HEAD(&cc.migratepages);
> +
> +	return compact_zone(zone, &cc);
> +}
> +
> +/**
> + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> + * @zonelist: The zonelist used for the current allocation
> + * @order: The order of the current allocation
> + * @gfp_mask: The GFP mask of the current allocation
> + * @nodemask: The allowed nodes to allocate from
> + *
> + * This is the main entry point for direct page compaction.
> + */
> +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> +			int order, gfp_t gfp_mask, nodemask_t *nodemask)
> +{
> +	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> +	int may_enter_fs = gfp_mask & __GFP_FS;
> +	int may_perform_io = gfp_mask & __GFP_IO;
> +	unsigned long watermark;
> +	struct zoneref *z;
> +	struct zone *zone;
> +	int rc = COMPACT_INCOMPLETE;
> +
> +	/* Check whether it is worth even starting compaction */
> +	if (order == 0 || !may_enter_fs || !may_perform_io)
> +		return rc;

hm, that was sad.  All those darn wireless drivers doing their
high-order GFP_ATOMIC allocations cannot be saved?

> +	/*
> +	 * We will not stall if the necessary conditions are not met for
> +	 * migration but direct reclaim seems to account stalls similarly
> +	 */
> +	count_vm_event(COMPACTSTALL);
> +
> +	/* Compact each zone in the list */
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> +								nodemask) {

Will all of this code play nicely with memory hotplug?

> +		int fragindex;
> +		int status;
> +
> +		/*
> +		 * Watermarks for order-0 must be met for compaction. Note
> +		 * the 2UL. This is because during migration, copies of
> +		 * pages need to be allocated and for a short time, the
> +		 * footprint is higher
> +		 */
> +		watermark = low_wmark_pages(zone) + (2UL << order);
> +		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> +			continue;
> +
> +		/*
> +		 * fragmentation index determines if allocation failures are
> +		 * due to low memory or external fragmentation
> +		 *
> +		 * index of -1 implies allocations might succeed depending
> +		 * 	on watermarks
> +		 * index < 500 implies alloc failure is due to lack of memory
> +		 *
> +		 * XXX: The choice of 500 is arbitrary. Reinvestigate
> +		 *      appropriately to determine a sensible default.
> +		 *      and what it means when watermarks are also taken
> +		 *      into account. Consider making it a sysctl
> +		 */

Yes, best to make it a sysctl IMO.   It'll make optimisation far easier.
/proc/sys/vm/fragmentation_index_dont_you_dare_use_this_it_will_disappear_soon

> +		fragindex = fragmentation_index(zone, order);
> +		if (fragindex >= 0 && fragindex <= 500)
> +			continue;
> +
> +		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
> +			rc = COMPACT_PARTIAL;
> +			break;
> +		}
> +
> +		status = compact_zone_order(zone, order, gfp_mask);
> +		rc = max(status, rc);
> +
> +		if (zone_watermark_ok(zone, order, watermark, 0, 0))
> +			break;
> +	}
> +
> +	return rc;
> +}
>
> ...
>
> @@ -1765,6 +1766,31 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  
>  	cond_resched();
>  
> +	/* Try memory compaction for high-order allocations before reclaim */
> +	if (order) {
> +		*did_some_progress = try_to_compact_pages(zonelist,
> +						order, gfp_mask, nodemask);
> +		if (*did_some_progress != COMPACT_INCOMPLETE) {
> +			page = get_page_from_freelist(gfp_mask, nodemask,
> +					order, zonelist, high_zoneidx,
> +					alloc_flags, preferred_zone,
> +					migratetype);
> +			if (page) {
> +				__count_vm_event(COMPACTSUCCESS);
> +				return page;
> +			}
> +
> +			/*
> +			 * It's bad if compaction run occurs and fails.
> +			 * The most likely reason is that pages exist,
> +			 * but not enough to satisfy watermarks.
> +			 */
> +			count_vm_event(COMPACTFAIL);

This counter will get incremented if !__GFP_FS or !__GFP_IO.  Seems
wrong.

> +			cond_resched();
> +		}
> +	}
> +


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 11/11] Do not compact within a preferred zone after a compaction failure
  2010-03-23 12:25 ` [PATCH 11/11] Do not compact within a preferred zone after a compaction failure Mel Gorman
  2010-03-23 18:31   ` Christoph Lameter
@ 2010-03-24 20:53   ` Andrew Morton
  2010-03-25  9:40     ` Mel Gorman
  1 sibling, 1 reply; 78+ messages in thread
From: Andrew Morton @ 2010-03-24 20:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 23 Mar 2010 12:25:46 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> The fragmentation index may indicate that a failure it due to external
> fragmentation, a compaction run complete and an allocation failure still
> fail. There are two obvious reasons as to why
> 
>   o Page migration cannot move all pages so fragmentation remains
>   o A suitable page may exist but watermarks are not met
> 
> In the event of compaction and allocation failure, this patch prevents
> compaction happening for a short interval. It's only recorded on the
> preferred zone but that should be enough coverage. This could have been
> implemented similar to the zonelist_cache but the increased size of the
> zonelist did not appear to be justified.
> 
>
> ...
>
> +/* defer_compaction - Do not compact within a zone until a given time */
> +static inline void defer_compaction(struct zone *zone, unsigned long resume)
> +{
> +	/*
> +	 * This function is called when compaction fails to result in a page
> +	 * allocation success. This is somewhat unsatisfactory as the failure
> +	 * to compact has nothing to do with time and everything to do with
> +	 * the requested order, the number of free pages and watermarks. How
> +	 * to wait on that is more unclear, but the answer would apply to
> +	 * other areas where the VM waits based on time.

um.  "Two wrongs don't make a right".  We should fix the other sites,
not use them as excuses ;)

What _is_ a good measure of "time" in this code?  "number of pages
scanned" is a pretty good one in reclaim.  We want something which will
adapt itself to amount-of-memory, number-of-cpus, speed-of-cpus,
nature-of-workload, etc, etc.

Is it possible to come up with some simple metric which approximately
reflects how busy this code is, then pace ourselves via that?

> +	 */
> +	zone->compact_resume = resume;
> +}
> +
> +static inline int compaction_deferred(struct zone *zone)
> +{
> +	/* init once if necessary */
> +	if (unlikely(!zone->compact_resume)) {
> +		zone->compact_resume = jiffies;
> +		return 0;
> +	}
> +
> +	return time_before(jiffies, zone->compact_resume);
> +}


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-24 20:33   ` Andrew Morton
@ 2010-03-24 20:59     ` Jonathan Corbet
  2010-03-24 21:14       ` Andrew Morton
  2010-03-24 21:19       ` Andrea Arcangeli
  2010-03-25  9:13     ` Mel Gorman
  1 sibling, 2 replies; 78+ messages in thread
From: Jonathan Corbet @ 2010-03-24 20:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Wed, 24 Mar 2010 13:33:47 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> > +	VM_BUG_ON(cc == NULL);  
> 
> It's a bit strange to test this when we're about to oops anyway.  The
> oops will tell us the same thing.

...except that we've seen a fair number of null pointer dereference
exploits that have told us something altogether different.  Are we
*sure* we don't want to test for null pointers...?

jon

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-24 20:59     ` Jonathan Corbet
@ 2010-03-24 21:14       ` Andrew Morton
  2010-03-24 21:19         ` Christoph Lameter
  2010-03-24 21:19       ` Andrea Arcangeli
  1 sibling, 1 reply; 78+ messages in thread
From: Andrew Morton @ 2010-03-24 21:14 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Mel Gorman, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Wed, 24 Mar 2010 14:59:46 -0600
Jonathan Corbet <corbet@lwn.net> wrote:

> On Wed, 24 Mar 2010 13:33:47 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > > +	VM_BUG_ON(cc == NULL);  
> > 
> > It's a bit strange to test this when we're about to oops anyway.  The
> > oops will tell us the same thing.
> 
> ...except that we've seen a fair number of null pointer dereference
> exploits that have told us something altogether different.  Are we
> *sure* we don't want to test for null pointers...?
> 

It's hard to see what the test gains us really - the kernel has
zillions of pointer derefs, any of which could be NULL if we have a
bug.  Are we more likely to have a bug here than elsewhere?

This one will oops on a plain old read, so it's a bit moot in this
case.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-24 20:59     ` Jonathan Corbet
  2010-03-24 21:14       ` Andrew Morton
@ 2010-03-24 21:19       ` Andrea Arcangeli
  2010-03-24 21:28         ` Jonathan Corbet
  1 sibling, 1 reply; 78+ messages in thread
From: Andrea Arcangeli @ 2010-03-24 21:19 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Andrew Morton, Mel Gorman, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

Hi Jonathan,

On Wed, Mar 24, 2010 at 02:59:46PM -0600, Jonathan Corbet wrote:
> On Wed, 24 Mar 2010 13:33:47 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > > +	VM_BUG_ON(cc == NULL);  
> > 
> > It's a bit strange to test this when we're about to oops anyway.  The
> > oops will tell us the same thing.
> 
> ...except that we've seen a fair number of null pointer dereference
> exploits that have told us something altogether different.  Are we
> *sure* we don't want to test for null pointers...?

Examples? Maybe WARN_ON != oops, but VM_BUG_ON still an oops that is
and without serial console it would go lost too. I personally don't
see how it's needed. Plus those things are mostly for debug to check
for invariant condition, how long it takes to sort it out isn't very
relevant. So I'm on Andrew camp ;).

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-24 21:14       ` Andrew Morton
@ 2010-03-24 21:19         ` Christoph Lameter
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2010-03-24 21:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Mel Gorman, Andrea Arcangeli, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Wed, 24 Mar 2010, Andrew Morton wrote:

> > ...except that we've seen a fair number of null pointer dereference
> > exploits that have told us something altogether different.  Are we
> > *sure* we don't want to test for null pointers...?
> >
>
> It's hard to see what the test gains us really - the kernel has
> zillions of pointer derefs, any of which could be NULL if we have a
> bug.  Are we more likely to have a bug here than elsewhere?
>
> This one will oops on a plain old read, so it's a bit moot in this
> case.

If the object pointed to is larger than page size and we are
referencing a member with an offset larger than page size later then we
may create an exploit without checks.

But the structure here is certainly smaller than that. So no issue here.




^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-24 21:19       ` Andrea Arcangeli
@ 2010-03-24 21:28         ` Jonathan Corbet
  2010-03-24 21:47           ` Andrea Arcangeli
  0 siblings, 1 reply; 78+ messages in thread
From: Jonathan Corbet @ 2010-03-24 21:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Mel Gorman, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Wed, 24 Mar 2010 22:19:24 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> > > It's a bit strange to test this when we're about to oops anyway.  The
> > > oops will tell us the same thing.  
> > 
> > ...except that we've seen a fair number of null pointer dereference
> > exploits that have told us something altogether different.  Are we
> > *sure* we don't want to test for null pointers...?  
> 
> Examples? Maybe WARN_ON != oops, but VM_BUG_ON still an oops that is
> and without serial console it would go lost too. I personally don't
> see how it's needed.

I don't quite understand the question; are you asking for examples of
exploits?

	http://lwn.net/Articles/347006/
	http://lwn.net/Articles/360328/
	http://lwn.net/Articles/342330/
	...

As to whether this particular test makes sense, I don't know.  But the
idea that we never need to test about-to-be-dereferenced pointers for
NULL does worry me a bit.

jon

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-24 21:28         ` Jonathan Corbet
@ 2010-03-24 21:47           ` Andrea Arcangeli
  2010-03-24 21:54             ` Jonathan Corbet
  2010-03-24 21:57             ` Andrea Arcangeli
  0 siblings, 2 replies; 78+ messages in thread
From: Andrea Arcangeli @ 2010-03-24 21:47 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Andrew Morton, Mel Gorman, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 03:28:54PM -0600, Jonathan Corbet wrote:
> On Wed, 24 Mar 2010 22:19:24 +0100
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > > > It's a bit strange to test this when we're about to oops anyway.  The
> > > > oops will tell us the same thing.  
> > > 
> > > ...except that we've seen a fair number of null pointer dereference
> > > exploits that have told us something altogether different.  Are we
> > > *sure* we don't want to test for null pointers...?  
> > 
> > Examples? Maybe WARN_ON != oops, but VM_BUG_ON still an oops that is
> > and without serial console it would go lost too. I personally don't
> > see how it's needed.
> 
> I don't quite understand the question; are you asking for examples of
> exploits?
> 
> 	http://lwn.net/Articles/347006/
> 	http://lwn.net/Articles/360328/
> 	http://lwn.net/Articles/342330/
> 	...

As far as I can tell, VM_BUG_ON would make _zero_ differences there.

I think you mistaken a VM_BUG_ON for a:

  if (could_be_null->something) {
     WARN_ON(1);
     return -ESOMETHING;
  }

adding a VM_BUG_ON(inode->something) would _still_ be as exploitable
as the null pointer deference, because it's a DoS. It's not really a
big deal of an exploit but it _sure_ need fixing.

The whole point is that VM_BUG_ON(!something) before something->else
won't move the needle as far as your null pointer deference exploits
are concerned.

> As to whether this particular test makes sense, I don't know.  But the
> idea that we never need to test about-to-be-dereferenced pointers for
> NULL does worry me a bit.

Being worried is good idea, as we don't want DoS bugs ;). It's just
that VM_BUG_ON isn't a solution to the problem (and the really
important thing, it's not improving its detectability either), fixing
the actual bug is the solution.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-24 21:47           ` Andrea Arcangeli
@ 2010-03-24 21:54             ` Jonathan Corbet
  2010-03-24 22:06               ` Andrea Arcangeli
  2010-03-24 21:57             ` Andrea Arcangeli
  1 sibling, 1 reply; 78+ messages in thread
From: Jonathan Corbet @ 2010-03-24 21:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Mel Gorman, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Wed, 24 Mar 2010 22:47:42 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> I think you mistaken a VM_BUG_ON for a:
> 
>   if (could_be_null->something) {
>      WARN_ON(1);
>      return -ESOMETHING;
>   }
> 
> adding a VM_BUG_ON(inode->something) would _still_ be as exploitable
> as the null pointer deference, because it's a DoS. It's not really a
> big deal of an exploit but it _sure_ need fixing.

Ah, but that's the point: these NULL pointer dereferences were not DoS
vulnerabilities - they were full privilege-escalation affairs.  Since
then, some problems have been fixed and some distributors have started
shipping smarter configurations.  But, on quite a few systems a NULL
dereference still has the potential to be fully exploitable; if there's
a possibility of it happening I think we should test for it.  A DoS is
a much better outcome...

jon

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-24 21:47           ` Andrea Arcangeli
  2010-03-24 21:54             ` Jonathan Corbet
@ 2010-03-24 21:57             ` Andrea Arcangeli
  1 sibling, 0 replies; 78+ messages in thread
From: Andrea Arcangeli @ 2010-03-24 21:57 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Andrew Morton, Mel Gorman, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 10:47:42PM +0100, Andrea Arcangeli wrote:
> As far as I can tell, VM_BUG_ON would make _zero_ differences there.
> 
> I think you mistaken a VM_BUG_ON for a:
> 
>   if (could_be_null->something) {

Ooops, I wrote ->something to indicate that "could_be_null" was going
to later be dereferenced for ->something and here we're checking if it
could be null when we dereference something, but now I think it could
be very confusing as I use strict C for all the rest, so maybe I
should clarify in C it would be !could_be_null.

>      WARN_ON(1);
>      return -ESOMETHING;
>   }
> 
> adding a VM_BUG_ON(inode->something) would _still_ be as exploitable

here the same !inode.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-24 21:54             ` Jonathan Corbet
@ 2010-03-24 22:06               ` Andrea Arcangeli
  0 siblings, 0 replies; 78+ messages in thread
From: Andrea Arcangeli @ 2010-03-24 22:06 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Andrew Morton, Mel Gorman, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 03:54:23PM -0600, Jonathan Corbet wrote:
> Ah, but that's the point: these NULL pointer dereferences were not DoS
> vulnerabilities - they were full privilege-escalation affairs.  Since
> then, some problems have been fixed and some distributors have started
> shipping smarter configurations.  But, on quite a few systems a NULL
> dereference still has the potential to be fully exploitable; if there's
> a possibility of it happening I think we should test for it.  A DoS is
> a much better outcome...

You're pointing the finger at lack of VM_BUG_ON but the finger should
be pointed in the code that shall enforce mmap_min_addr. That is the
exploitable bug. I can't imagine any other ways VM_BUG_ON could help
in preventing an exploit. Let's concentrate on mmap_min_addr and leave
the code fast.

If it's a small structure (<4096 bytes) we're talking about, I stand
that VM_BUG_ON() is just pure CPU overhead.

I do agree however for structures that may grow larger than 4096 bytes
VM_BUG_ON isn't bad idea, and furthermore I think it's wrong to keep
the min address at only 4096 bytes, it shall be like 100M or
something. Then all of them can go away. That is way more effective
than having to remember to add VM_BUG_ON(!null) when cpu can do it
zero cost.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-24 11:40     ` Mel Gorman
@ 2010-03-25  0:30       ` KAMEZAWA Hiroyuki
  2010-03-25  9:48         ` Mel Gorman
  0 siblings, 1 reply; 78+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-25  0:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, 24 Mar 2010 11:40:57 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> On Wed, Mar 24, 2010 at 10:19:27AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Tue, 23 Mar 2010 12:25:45 +0000
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > Ordinarily when a high-order allocation fails, direct reclaim is entered to
> > > free pages to satisfy the allocation.  With this patch, it is determined if
> > > an allocation failed due to external fragmentation instead of low memory
> > > and if so, the calling process will compact until a suitable page is
> > > freed. Compaction by moving pages in memory is considerably cheaper than
> > > paging out to disk and works where there are locked pages or no swap. If
> > > compaction fails to free a page of a suitable size, then reclaim will
> > > still occur.
> > > 
> > > Direct compaction returns as soon as possible. As each block is compacted,
> > > it is checked if a suitable page has been freed and if so, it returns.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > ---
> > >  include/linux/compaction.h |   16 +++++-
> > >  include/linux/vmstat.h     |    1 +
> > >  mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
> > >  mm/page_alloc.c            |   26 ++++++++++
> > >  mm/vmstat.c                |   15 +++++-
> > >  5 files changed, 172 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > > index c94890b..b851428 100644
> > > --- a/include/linux/compaction.h
> > > +++ b/include/linux/compaction.h
> > > @@ -1,14 +1,26 @@
> > >  #ifndef _LINUX_COMPACTION_H
> > >  #define _LINUX_COMPACTION_H
> > >  
> > > -/* Return values for compact_zone() */
> > > +/* Return values for compact_zone() and try_to_compact_pages() */
> > >  #define COMPACT_INCOMPLETE	0
> > > -#define COMPACT_COMPLETE	1
> > > +#define COMPACT_PARTIAL		1
> > > +#define COMPACT_COMPLETE	2
> > >  
> > >  #ifdef CONFIG_COMPACTION
> > >  extern int sysctl_compact_memory;
> > >  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> > >  			void __user *buffer, size_t *length, loff_t *ppos);
> > > +
> > > +extern int fragmentation_index(struct zone *zone, unsigned int order);
> > > +extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > > +			int order, gfp_t gfp_mask, nodemask_t *mask);
> > > +#else
> > > +static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > > +			int order, gfp_t gfp_mask, nodemask_t *nodemask)
> > > +{
> > > +	return COMPACT_INCOMPLETE;
> > > +}
> > > +
> > >  #endif /* CONFIG_COMPACTION */
> > >  
> > >  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> > > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > > index 56e4b44..b4b4d34 100644
> > > --- a/include/linux/vmstat.h
> > > +++ b/include/linux/vmstat.h
> > > @@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> > >  		KSWAPD_SKIP_CONGESTION_WAIT,
> > >  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> > >  		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> > > +		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
> > >  #ifdef CONFIG_HUGETLB_PAGE
> > >  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> > >  #endif
> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index 8df6e3d..6688700 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -34,6 +34,8 @@ struct compact_control {
> > >  	unsigned long nr_anon;
> > >  	unsigned long nr_file;
> > >  
> > > +	unsigned int order;		/* order a direct compactor needs */
> > > +	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
> > >  	struct zone *zone;
> > >  };
> > >  
> > > @@ -301,10 +303,31 @@ static void update_nr_listpages(struct compact_control *cc)
> > >  static inline int compact_finished(struct zone *zone,
> > >  						struct compact_control *cc)
> > >  {
> > > +	unsigned int order;
> > > +	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
> > > +
> > >  	/* Compaction run completes if the migrate and free scanner meet */
> > >  	if (cc->free_pfn <= cc->migrate_pfn)
> > >  		return COMPACT_COMPLETE;
> > >  
> > > +	/* Compaction run is not finished if the watermark is not met */
> > > +	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
> > > +		return COMPACT_INCOMPLETE;
> > > +
> > > +	if (cc->order == -1)
> > > +		return COMPACT_INCOMPLETE;
> > > +
> > > +	/* Direct compactor: Is a suitable page free? */
> > > +	for (order = cc->order; order < MAX_ORDER; order++) {
> > > +		/* Job done if page is free of the right migratetype */
> > > +		if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
> > > +			return COMPACT_PARTIAL;
> > > +
> > > +		/* Job done if allocation would set block type */
> > > +		if (order >= pageblock_order && zone->free_area[order].nr_free)
> > > +			return COMPACT_PARTIAL;
> > > +	}
> > > +
> > >  	return COMPACT_INCOMPLETE;
> > >  }
> > >  
> > > @@ -348,6 +371,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> > >  	return ret;
> > >  }
> > >  
> > > +static inline unsigned long compact_zone_order(struct zone *zone,
> > > +						int order, gfp_t gfp_mask)
> > > +{
> > > +	struct compact_control cc = {
> > > +		.nr_freepages = 0,
> > > +		.nr_migratepages = 0,
> > > +		.order = order,
> > > +		.migratetype = allocflags_to_migratetype(gfp_mask),
> > > +		.zone = zone,
> > > +	};
> > > +	INIT_LIST_HEAD(&cc.freepages);
> > > +	INIT_LIST_HEAD(&cc.migratepages);
> > > +
> > > +	return compact_zone(zone, &cc);
> > > +}
> > > +
> > > +/**
> > > + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> > > + * @zonelist: The zonelist used for the current allocation
> > > + * @order: The order of the current allocation
> > > + * @gfp_mask: The GFP mask of the current allocation
> > > + * @nodemask: The allowed nodes to allocate from
> > > + *
> > > + * This is the main entry point for direct page compaction.
> > > + */
> > > +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > > +			int order, gfp_t gfp_mask, nodemask_t *nodemask)
> > > +{
> > > +	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > > +	int may_enter_fs = gfp_mask & __GFP_FS;
> > > +	int may_perform_io = gfp_mask & __GFP_IO;
> > > +	unsigned long watermark;
> > > +	struct zoneref *z;
> > > +	struct zone *zone;
> > > +	int rc = COMPACT_INCOMPLETE;
> > > +
> > > +	/* Check whether it is worth even starting compaction */
> > > +	if (order == 0 || !may_enter_fs || !may_perform_io)
> > > +		return rc;
> > > +
> > > +	/*
> > > +	 * We will not stall if the necessary conditions are not met for
> > > +	 * migration but direct reclaim seems to account stalls similarly
> > > +	 */
> > > +	count_vm_event(COMPACTSTALL);
> > > +
> > > +	/* Compact each zone in the list */
> > > +	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> > > +								nodemask) {
> > > +		int fragindex;
> > > +		int status;
> > > +
> > > +		/*
> > > +		 * Watermarks for order-0 must be met for compaction. Note
> > > +		 * the 2UL. This is because during migration, copies of
> > > +		 * pages need to be allocated and for a short time, the
> > > +		 * footprint is higher
> > > +		 */
> > > +		watermark = low_wmark_pages(zone) + (2UL << order);
> > > +		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> > > +			continue;
> > > +
> > > +		/*
> > > +		 * fragmentation index determines if allocation failures are
> > > +		 * due to low memory or external fragmentation
> > > +		 *
> > > +		 * index of -1 implies allocations might succeed depending
> > > +		 * 	on watermarks
> > > +		 * index < 500 implies alloc failure is due to lack of memory
> > > +		 *
> > > +		 * XXX: The choice of 500 is arbitrary. Reinvestigate
> > > +		 *      appropriately to determine a sensible default.
> > > +		 *      and what it means when watermarks are also taken
> > > +		 *      into account. Consider making it a sysctl
> > > +		 */
> > > +		fragindex = fragmentation_index(zone, order);
> > > +		if (fragindex >= 0 && fragindex <= 500)
> > > +			continue;
> > > +
> > > +		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
> > > +			rc = COMPACT_PARTIAL;
> > > +			break;
> > > +		}
> > > +
> > > +		status = compact_zone_order(zone, order, gfp_mask);
> > > +		rc = max(status, rc);
> > 
> > Hm...then, scanning over the whole zone until success of migration at
> > each failure?
> 
> Sorry for my lack of understanding but your question is difficult to
> understand.
> 
Thank you for clarification.
My word was bad. (And my understanding was.)


> You might mean "scanning over the whole zonelist" rather than zone. In that
> case, the zone_watermark_ok before and after the compaction will exit the
> loop rather than moving to the next zone in the list.
> 
Yes.

> I'm not sure what you mean by "at each failure". The worst-case scenario
> is that a process compacts the entire zone and still fails to meet the
> watermarks. The best-case scenario is that it does a small amount of
> compaction in the compact_zone() loop and finds that compact_finished()
> causes the loop to exit before the whole zone is compacted.
> 
"at each failure" means "At each fauilure of smooth allocation of contiguous
pages of requested order", I'm sorry for short words.

And yes, compact_finished() causes the loop to exit in good cases.
(And I missed a change in compact_finished() in this patch..)

> > Is it meaningful that multiple tasks run direct-compaction against
> > a zone (from zone->start_pfn to zone->end_pfn) in parallel ?
> > ex) running order=3 compaction while other thread runs order=5 compaction.
> > 
> 
> It is meaningful in that "it will work" but there is a good chance that it's
> pointless. To what degree it's pointless depends on what happened between
> Compaction Process A starting and Compaction Process B. If kswapd is also
> awake, then it might be worthwhile. By and large, the scanning is fast enough
> that it won't be very noticeable.
> 
Hmm.

> My feeling is that multiple processes entering compaction at all is a bad
> situation to be in. It implies there are multiple processes are requiring
> high-order pages. Maybe if transparent huge pages were merged, it'd be
> expected but otherwise it'd be a surprise.
> 
At first look, my concern was that you use
	if (order)
rather than
	if (order >= PAGE_ALLOC_COSTLY_ORDER)

if order=1, dropping _old_ file cache isn't very bad.
Because migration modifies LRU order, frequent compaction may add too much
noise to LRU.

But yes, I have no data.

> > Can't we find a clever way to find [start_pfn, end_pfn) for scanning rather than
> > [zone->start_pfn, zone->start_pfn + zone->spanned_pages) ?
> > 
> 
> For sure. An early iteration of these patches stored the PFNs last scanned
> for migration in struct zone and would use that as a starting point. It'd
> wrap around at least once when it encountered the free page scanner so
> that the zone would be scanned at least once. A more convulated
> iteration stored a list of compactors in a linked list. When selecting a
> pageblock to migrate pages from, it'd check the list and avoid scanning
> the same block as any other process.
> 
> I dropped these modifications for a few reasons
> 
> a) It added complexity for a situation that may not be encountered in
>    practice.
> b) Arguably, it would also make sense to simply allow only one compactor
>    within a zone at a time and use a mutex
> c) I had no data on why multiple processes would be direct compacting
> 
> The last point was the most important. I wanted to avoid complexity unless
> there was a good reason for it. If we do encounter a situation where
> multiple compactors are causing problems, I'd be more likely to ask "why
> are there so many high-order allocations happening simultaneously?" than
> "how can we make compaction smarter?"
> 

I agree multiple requester of "high order" is unusual. Here, "high order" means
order > PAGE_ALLOC_COSTLY_ORDER.
But multiple requester of order=1,2,3? is usual case, I think.


BTW, one more question.

Because freed pages are pushed back to buddy system by __free_page(page),
they may exists in zone's pcp list. In that case, compact_finished() can't
find there is a free chunk and do more work. 
How about using a function like
	 free_pcppages_bulk(zone, pcp->batch, pcp);
to bypass pcp list and freeing pages at once ?

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-24 20:48   ` Andrew Morton
@ 2010-03-25  0:57     ` KAMEZAWA Hiroyuki
  2010-03-25 10:21     ` Mel Gorman
  1 sibling, 0 replies; 78+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-25  0:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, 24 Mar 2010 13:48:16 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 23 Mar 2010 12:25:45 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:

> > +	/*
> > +	 * We will not stall if the necessary conditions are not met for
> > +	 * migration but direct reclaim seems to account stalls similarly
> > +	 */
> > +	count_vm_event(COMPACTSTALL);
> > +
> > +	/* Compact each zone in the list */
> > +	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> > +								nodemask) {
> 
> Will all of this code play nicely with memory hotplug?
> 

If your concern is a race with memory hotplug, I have no concern about that
because memory hotplug makes a range of pages as "not for use" before starting.
If your concern is "code sharing", shared codes between memory hotplug and
compaction is "migrate_pages()".

Other parts are independent from each other.

IIUC.
Memory Hotremove does

	1. select a range for removal [start ....end)
	2. mark free pages as "not for use" by migrate_type
	3. move all used pages to other range.
	4. Finally, all pages in the range will be "not for use"

Compaction does
	1. select a target order
	2. move some free pages to private list
	3. move some used pages to pages in private list.
        4. free pages.

So, techniques to isolate freed pages is different. 
I think it's from their purpose.
 
"freed pages" by compaction is
	- for use
	- a chunk of page from anywhere is ok.

but "freed pages" by memory unplug is 
	- not for use
	- a chunk of page should be in specified range.

For using memory hotplug's code for compaction, we have to specify
"not for use" range. It will make low order compaction innefficient
and it seems not easy to find the best range for compaction.

For compaction, logic used in memory hotplug is too big hummer, I guess.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-24 20:33   ` Andrew Morton
  2010-03-24 20:59     ` Jonathan Corbet
@ 2010-03-25  9:13     ` Mel Gorman
  1 sibling, 0 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-25  9:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 01:33:47PM -0700, Andrew Morton wrote:
> On Tue, 23 Mar 2010 12:25:42 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > This patch is the core of a mechanism which compacts memory in a zone by
> > relocating movable pages towards the end of the zone.
> > 
> > A single compaction run involves a migration scanner and a free scanner.
> > Both scanners operate on pageblock-sized areas in the zone. The migration
> > scanner starts at the bottom of the zone and searches for all movable pages
> > within each area, isolating them onto a private list called migratelist.
> > The free scanner starts at the top of the zone and searches for suitable
> > areas and consumes the free pages within making them available for the
> > migration scanner. The pages isolated for migration are then migrated to
> > the newly isolated free pages.
> 
> General comment: it looks like there are some codepaths which could
> hold zone->lock for a long time.  It's unclear that they're all
> constrained by COMPACT_CLUSTER_MAX. Is there a a latency issue here?
> 

I don't think so. There are two points where zone-related locks are
held.

zone->lock is held in isolate_freepages() while it gets the free pages
	necessary for migration to complete. The size of the list of pages
	being migrated is constrained by COMPACT_CLUSTER_MAX so it is bounded
	by that. Worst case scenario is the zone is almost fully
	scanned.

zone->lru_lock is held in isolate_migratepages) while it gets pages for
	migration. It's released if COMPACT_CLUSTER_MAX pages are
	isolated. Again, worst case scenario is that the zone is
	almost fully scanned.

The worst-case scenario in both cases is the lock is held while the zone
is scanned. The concern would be if we managed to scan almost a full
zone and that zone is very large. I could add an additional check to
release the lock when a large number of pages has been scanned but I
don't think it's necessary. I find it very unlikely that a large zone
would not have COMPACT_CLUSTER_MAX pages found quickly for isolation.

> >
> > ...
> >
> > +static struct page *compaction_alloc(struct page *migratepage,
> > +					unsigned long data,
> > +					int **result)
> > +{
> > +	struct compact_control *cc = (struct compact_control *)data;
> > +	struct page *freepage;
> > +
> > +	VM_BUG_ON(cc == NULL);
> 
> It's a bit strange to test this when we're about to oops anyway.  The
> oops will tell us the same thing.
> 

It was paranoia after the bugs related to NULL-offsets but unnecessary
paranoia in this case. It would require migration to be very broken for it to
trigger. Even if it was, I cannot imagine a case where it would be exploited
because it's a small structure and not offset by any userspace-supplied
piece of data. I will drop the check.

> > +	/* Isolate free pages if necessary */
> > +	if (list_empty(&cc->freepages)) {
> > +		isolate_freepages(cc->zone, cc);
> > +
> > +		if (list_empty(&cc->freepages))
> > +			return NULL;
> > +	}
> > +
> > +	freepage = list_entry(cc->freepages.next, struct page, lru);
> > +	list_del(&freepage->lru);
> > +	cc->nr_freepages--;
> > +
> > +	return freepage;
> > +}
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 11/11] Do not compact within a preferred zone after a compaction failure
  2010-03-24 20:53   ` Andrew Morton
@ 2010-03-25  9:40     ` Mel Gorman
  0 siblings, 0 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-25  9:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 01:53:47PM -0700, Andrew Morton wrote:
> On Tue, 23 Mar 2010 12:25:46 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > The fragmentation index may indicate that a failure it due to external
> > fragmentation, a compaction run complete and an allocation failure still
> > fail. There are two obvious reasons as to why
> > 
> >   o Page migration cannot move all pages so fragmentation remains
> >   o A suitable page may exist but watermarks are not met
> > 
> > In the event of compaction and allocation failure, this patch prevents
> > compaction happening for a short interval. It's only recorded on the
> > preferred zone but that should be enough coverage. This could have been
> > implemented similar to the zonelist_cache but the increased size of the
> > zonelist did not appear to be justified.
> > 
> >
> > ...
> >
> > +/* defer_compaction - Do not compact within a zone until a given time */
> > +static inline void defer_compaction(struct zone *zone, unsigned long resume)
> > +{
> > +	/*
> > +	 * This function is called when compaction fails to result in a page
> > +	 * allocation success. This is somewhat unsatisfactory as the failure
> > +	 * to compact has nothing to do with time and everything to do with
> > +	 * the requested order, the number of free pages and watermarks. How
> > +	 * to wait on that is more unclear, but the answer would apply to
> > +	 * other areas where the VM waits based on time.
> 
> um.  "Two wrongs don't make a right".  We should fix the other sites,
> not use them as excuses ;)
> 

Heh, one of those sites is currently in dispute. Specifically, the patch
that replaces congestion_wait() with a waitqueue that is woken when
watermarks are reached. I wrote that comment around about the same time
that patch was being developed which is why I found the situation
particularly unsatisfactory.

> What _is_ a good measure of "time" in this code?  "number of pages
> scanned" is a pretty good one in reclaim. 

In this case, a strong possibility is number of pages freed since deferral.
It's not perfect though because heavy memory pressure would mean those
pages are getting allocated again and the compaction is still not going
to succeed. I could use NR_FREE_PAGES to make a guess at how much has
changed since and whether it's worth trying to compact again but even
that is not perfect.

Lets say for example that compaction failed because the zone was mostly slab
pages. If all those were freed and replaced with migratable pages then the
counters would look similar but compaction will now succeed.  I could make
some sort of guess based on number of free, anon and file pages in the zone but
ultimately it would be hard to tell if the heuristic was any better than time.

I think this is only worth worrying about if a workload is found where
compact_fail is rising rapidly.

> We want something which will
> adapt itself to amount-of-memory, number-of-cpus, speed-of-cpus,
> nature-of-workload, etc, etc.
> 
> Is it possible to come up with some simple metric which approximately
> reflects how busy this code is, then pace ourselves via that?
> 

I think a simple metric would be based on free anon and file pages but
I think we would need a workload that was hitting compact_fail to devise
it properly.

> > +	 */
> > +	zone->compact_resume = resume;
> > +}
> > +
> > +static inline int compaction_deferred(struct zone *zone)
> > +{
> > +	/* init once if necessary */
> > +	if (unlikely(!zone->compact_resume)) {
> > +		zone->compact_resume = jiffies;
> > +		return 0;
> > +	}
> > +
> > +	return time_before(jiffies, zone->compact_resume);
> > +}
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-25  0:30       ` KAMEZAWA Hiroyuki
@ 2010-03-25  9:48         ` Mel Gorman
  2010-03-25  9:50           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 78+ messages in thread
From: Mel Gorman @ 2010-03-25  9:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Thu, Mar 25, 2010 at 09:30:06AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 24 Mar 2010 11:40:57 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Wed, Mar 24, 2010 at 10:19:27AM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 23 Mar 2010 12:25:45 +0000
> > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > 
> > > > Ordinarily when a high-order allocation fails, direct reclaim is entered to
> > > > free pages to satisfy the allocation.  With this patch, it is determined if
> > > > an allocation failed due to external fragmentation instead of low memory
> > > > and if so, the calling process will compact until a suitable page is
> > > > freed. Compaction by moving pages in memory is considerably cheaper than
> > > > paging out to disk and works where there are locked pages or no swap. If
> > > > compaction fails to free a page of a suitable size, then reclaim will
> > > > still occur.
> > > > 
> > > > Direct compaction returns as soon as possible. As each block is compacted,
> > > > it is checked if a suitable page has been freed and if so, it returns.
> > > > 
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > Acked-by: Rik van Riel <riel@redhat.com>
> > > > ---
> > > >  include/linux/compaction.h |   16 +++++-
> > > >  include/linux/vmstat.h     |    1 +
> > > >  mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
> > > >  mm/page_alloc.c            |   26 ++++++++++
> > > >  mm/vmstat.c                |   15 +++++-
> > > >  5 files changed, 172 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > > > index c94890b..b851428 100644
> > > > --- a/include/linux/compaction.h
> > > > +++ b/include/linux/compaction.h
> > > > @@ -1,14 +1,26 @@
> > > >  #ifndef _LINUX_COMPACTION_H
> > > >  #define _LINUX_COMPACTION_H
> > > >  
> > > > -/* Return values for compact_zone() */
> > > > +/* Return values for compact_zone() and try_to_compact_pages() */
> > > >  #define COMPACT_INCOMPLETE	0
> > > > -#define COMPACT_COMPLETE	1
> > > > +#define COMPACT_PARTIAL		1
> > > > +#define COMPACT_COMPLETE	2
> > > >  
> > > >  #ifdef CONFIG_COMPACTION
> > > >  extern int sysctl_compact_memory;
> > > >  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> > > >  			void __user *buffer, size_t *length, loff_t *ppos);
> > > > +
> > > > +extern int fragmentation_index(struct zone *zone, unsigned int order);
> > > > +extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > > > +			int order, gfp_t gfp_mask, nodemask_t *mask);
> > > > +#else
> > > > +static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > > > +			int order, gfp_t gfp_mask, nodemask_t *nodemask)
> > > > +{
> > > > +	return COMPACT_INCOMPLETE;
> > > > +}
> > > > +
> > > >  #endif /* CONFIG_COMPACTION */
> > > >  
> > > >  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> > > > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > > > index 56e4b44..b4b4d34 100644
> > > > --- a/include/linux/vmstat.h
> > > > +++ b/include/linux/vmstat.h
> > > > @@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> > > >  		KSWAPD_SKIP_CONGESTION_WAIT,
> > > >  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> > > >  		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> > > > +		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
> > > >  #ifdef CONFIG_HUGETLB_PAGE
> > > >  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> > > >  #endif
> > > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > > index 8df6e3d..6688700 100644
> > > > --- a/mm/compaction.c
> > > > +++ b/mm/compaction.c
> > > > @@ -34,6 +34,8 @@ struct compact_control {
> > > >  	unsigned long nr_anon;
> > > >  	unsigned long nr_file;
> > > >  
> > > > +	unsigned int order;		/* order a direct compactor needs */
> > > > +	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
> > > >  	struct zone *zone;
> > > >  };
> > > >  
> > > > @@ -301,10 +303,31 @@ static void update_nr_listpages(struct compact_control *cc)
> > > >  static inline int compact_finished(struct zone *zone,
> > > >  						struct compact_control *cc)
> > > >  {
> > > > +	unsigned int order;
> > > > +	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
> > > > +
> > > >  	/* Compaction run completes if the migrate and free scanner meet */
> > > >  	if (cc->free_pfn <= cc->migrate_pfn)
> > > >  		return COMPACT_COMPLETE;
> > > >  
> > > > +	/* Compaction run is not finished if the watermark is not met */
> > > > +	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
> > > > +		return COMPACT_INCOMPLETE;
> > > > +
> > > > +	if (cc->order == -1)
> > > > +		return COMPACT_INCOMPLETE;
> > > > +
> > > > +	/* Direct compactor: Is a suitable page free? */
> > > > +	for (order = cc->order; order < MAX_ORDER; order++) {
> > > > +		/* Job done if page is free of the right migratetype */
> > > > +		if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
> > > > +			return COMPACT_PARTIAL;
> > > > +
> > > > +		/* Job done if allocation would set block type */
> > > > +		if (order >= pageblock_order && zone->free_area[order].nr_free)
> > > > +			return COMPACT_PARTIAL;
> > > > +	}
> > > > +
> > > >  	return COMPACT_INCOMPLETE;
> > > >  }
> > > >  
> > > > @@ -348,6 +371,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> > > >  	return ret;
> > > >  }
> > > >  
> > > > +static inline unsigned long compact_zone_order(struct zone *zone,
> > > > +						int order, gfp_t gfp_mask)
> > > > +{
> > > > +	struct compact_control cc = {
> > > > +		.nr_freepages = 0,
> > > > +		.nr_migratepages = 0,
> > > > +		.order = order,
> > > > +		.migratetype = allocflags_to_migratetype(gfp_mask),
> > > > +		.zone = zone,
> > > > +	};
> > > > +	INIT_LIST_HEAD(&cc.freepages);
> > > > +	INIT_LIST_HEAD(&cc.migratepages);
> > > > +
> > > > +	return compact_zone(zone, &cc);
> > > > +}
> > > > +
> > > > +/**
> > > > + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> > > > + * @zonelist: The zonelist used for the current allocation
> > > > + * @order: The order of the current allocation
> > > > + * @gfp_mask: The GFP mask of the current allocation
> > > > + * @nodemask: The allowed nodes to allocate from
> > > > + *
> > > > + * This is the main entry point for direct page compaction.
> > > > + */
> > > > +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > > > +			int order, gfp_t gfp_mask, nodemask_t *nodemask)
> > > > +{
> > > > +	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > > > +	int may_enter_fs = gfp_mask & __GFP_FS;
> > > > +	int may_perform_io = gfp_mask & __GFP_IO;
> > > > +	unsigned long watermark;
> > > > +	struct zoneref *z;
> > > > +	struct zone *zone;
> > > > +	int rc = COMPACT_INCOMPLETE;
> > > > +
> > > > +	/* Check whether it is worth even starting compaction */
> > > > +	if (order == 0 || !may_enter_fs || !may_perform_io)
> > > > +		return rc;
> > > > +
> > > > +	/*
> > > > +	 * We will not stall if the necessary conditions are not met for
> > > > +	 * migration but direct reclaim seems to account stalls similarly
> > > > +	 */
> > > > +	count_vm_event(COMPACTSTALL);
> > > > +
> > > > +	/* Compact each zone in the list */
> > > > +	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> > > > +								nodemask) {
> > > > +		int fragindex;
> > > > +		int status;
> > > > +
> > > > +		/*
> > > > +		 * Watermarks for order-0 must be met for compaction. Note
> > > > +		 * the 2UL. This is because during migration, copies of
> > > > +		 * pages need to be allocated and for a short time, the
> > > > +		 * footprint is higher
> > > > +		 */
> > > > +		watermark = low_wmark_pages(zone) + (2UL << order);
> > > > +		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> > > > +			continue;
> > > > +
> > > > +		/*
> > > > +		 * fragmentation index determines if allocation failures are
> > > > +		 * due to low memory or external fragmentation
> > > > +		 *
> > > > +		 * index of -1 implies allocations might succeed depending
> > > > +		 * 	on watermarks
> > > > +		 * index < 500 implies alloc failure is due to lack of memory
> > > > +		 *
> > > > +		 * XXX: The choice of 500 is arbitrary. Reinvestigate
> > > > +		 *      appropriately to determine a sensible default.
> > > > +		 *      and what it means when watermarks are also taken
> > > > +		 *      into account. Consider making it a sysctl
> > > > +		 */
> > > > +		fragindex = fragmentation_index(zone, order);
> > > > +		if (fragindex >= 0 && fragindex <= 500)
> > > > +			continue;
> > > > +
> > > > +		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
> > > > +			rc = COMPACT_PARTIAL;
> > > > +			break;
> > > > +		}
> > > > +
> > > > +		status = compact_zone_order(zone, order, gfp_mask);
> > > > +		rc = max(status, rc);
> > > 
> > > Hm...then, scanning over the whole zone until success of migration at
> > > each failure?
> > 
> > Sorry for my lack of understanding but your question is difficult to
> > understand.
> > 
>
> Thank you for clarification.
> My word was bad. (And my understanding was.)
> 
> 
> > You might mean "scanning over the whole zonelist" rather than zone. In that
> > case, the zone_watermark_ok before and after the compaction will exit the
> > loop rather than moving to the next zone in the list.
> > 
>
> Yes.
> 
> > I'm not sure what you mean by "at each failure". The worst-case scenario
> > is that a process compacts the entire zone and still fails to meet the
> > watermarks. The best-case scenario is that it does a small amount of
> > compaction in the compact_zone() loop and finds that compact_finished()
> > causes the loop to exit before the whole zone is compacted.
> > 
>
> "at each failure" means "At each fauilure of smooth allocation of contiguous
> pages of requested order", I'm sorry for short words.
> 
> And yes, compact_finished() causes the loop to exit in good cases.
> (And I missed a change in compact_finished() in this patch..)
> 

Ok.

> > > Is it meaningful that multiple tasks run direct-compaction against
> > > a zone (from zone->start_pfn to zone->end_pfn) in parallel ?
> > > ex) running order=3 compaction while other thread runs order=5 compaction.
> > > 
> > 
> > It is meaningful in that "it will work" but there is a good chance that it's
> > pointless. To what degree it's pointless depends on what happened between
> > Compaction Process A starting and Compaction Process B. If kswapd is also
> > awake, then it might be worthwhile. By and large, the scanning is fast enough
> > that it won't be very noticeable.
> > 
>
> Hmm.
> 
> > My feeling is that multiple processes entering compaction at all is a bad
> > situation to be in. It implies there are multiple processes are requiring
> > high-order pages. Maybe if transparent huge pages were merged, it'd be
> > expected but otherwise it'd be a surprise.
> > 
> At first look, my concern was that you use
> 	if (order)
> rather than
> 	if (order >= PAGE_ALLOC_COSTLY_ORDER)
> 
> if order=1, dropping _old_ file cache isn't very bad.
> Because migration modifies LRU order, frequent compaction may add too much
> noise to LRU.
> 
> But yes, I have no data.
> 

FWIW, if a workload is increasing compact_stall rapidly, it should be
investigated.

> > > Can't we find a clever way to find [start_pfn, end_pfn) for scanning rather than
> > > [zone->start_pfn, zone->start_pfn + zone->spanned_pages) ?
> > > 
> > 
> > For sure. An early iteration of these patches stored the PFNs last scanned
> > for migration in struct zone and would use that as a starting point. It'd
> > wrap around at least once when it encountered the free page scanner so
> > that the zone would be scanned at least once. A more convulated
> > iteration stored a list of compactors in a linked list. When selecting a
> > pageblock to migrate pages from, it'd check the list and avoid scanning
> > the same block as any other process.
> > 
> > I dropped these modifications for a few reasons
> > 
> > a) It added complexity for a situation that may not be encountered in
> >    practice.
> > b) Arguably, it would also make sense to simply allow only one compactor
> >    within a zone at a time and use a mutex
> > c) I had no data on why multiple processes would be direct compacting
> > 
> > The last point was the most important. I wanted to avoid complexity unless
> > there was a good reason for it. If we do encounter a situation where
> > multiple compactors are causing problems, I'd be more likely to ask "why
> > are there so many high-order allocations happening simultaneously?" than
> > "how can we make compaction smarter?"
> > 
> 
> I agree multiple requester of "high order" is unusual. Here, "high order" means
> order > PAGE_ALLOC_COSTLY_ORDER.
> But multiple requester of order=1,2,3? is usual case, I think.
> 

Multiple requesters of order 1,2,3 do happen but there is also an expectation
that the allocator can satisfy many of them without resorting to compaction
or reclaim.

Again, if compact_stall is rising rapidly it needs investigating. Such a
situation today is likely to be hitting lumpy reclaim a lot and that is
not very satisfactory either.

> BTW, one more question.
> 
> Because freed pages are pushed back to buddy system by __free_page(page),
> they may exists in zone's pcp list.

True.

> In that case, compact_finished() can't
> find there is a free chunk and do more work.  How about using a function like
> 	 free_pcppages_bulk(zone, pcp->batch, pcp);
> to bypass pcp list and freeing pages at once ?
> 

I think you mean to drain the PCP lists while compaction is happening
but is it justified? It's potentially a lot of IPI calls just to check
if compaction can finish a little earlier. If the pages on the PCP lists
are making that much of a difference to high-order page availability, it
implies that the zone is pretty full and it's likely that compaction was
avoided and we direct reclaimed.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-25  9:48         ` Mel Gorman
@ 2010-03-25  9:50           ` KAMEZAWA Hiroyuki
  2010-03-25 10:16             ` Mel Gorman
  0 siblings, 1 reply; 78+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-25  9:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Thu, 25 Mar 2010 09:48:26 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> > In that case, compact_finished() can't
> > find there is a free chunk and do more work.  How about using a function like
> > 	 free_pcppages_bulk(zone, pcp->batch, pcp);
> > to bypass pcp list and freeing pages at once ?
> > 
> 
> I think you mean to drain the PCP lists while compaction is happening
> but is it justified? It's potentially a lot of IPI calls just to check
> if compaction can finish a little earlier. If the pages on the PCP lists
> are making that much of a difference to high-order page availability, it
> implies that the zone is pretty full and it's likely that compaction was
> avoided and we direct reclaimed.
> 
Ah, sorry for my short word again. I mean draining "local" pcp list because
a thread which run direct-compaction freed pages. IPI is not necessary and
overkill.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-25  9:50           ` KAMEZAWA Hiroyuki
@ 2010-03-25 10:16             ` Mel Gorman
  2010-03-26  1:03               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 78+ messages in thread
From: Mel Gorman @ 2010-03-25 10:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Thu, Mar 25, 2010 at 06:50:21PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 25 Mar 2010 09:48:26 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > > In that case, compact_finished() can't
> > > find there is a free chunk and do more work.  How about using a function like
> > > 	 free_pcppages_bulk(zone, pcp->batch, pcp);
> > > to bypass pcp list and freeing pages at once ?
> > > 
> > 
> > I think you mean to drain the PCP lists while compaction is happening
> > but is it justified? It's potentially a lot of IPI calls just to check
> > if compaction can finish a little earlier. If the pages on the PCP lists
> > are making that much of a difference to high-order page availability, it
> > implies that the zone is pretty full and it's likely that compaction was
> > avoided and we direct reclaimed.
> > 
> Ah, sorry for my short word again. I mean draining "local" pcp list because
> a thread which run direct-compaction freed pages. IPI is not necessary and
> overkill.
> 

Ah, I see now. There are two places that pages get freed.  release_freepages()
at the end of compaction when it's too late for compact_finished() to be
helped and within migration itself. Migration frees with either
free_page() or more commonly put_page() with put_page() being the most
frequently used. As free_page() is called on failure to migrate (rare),
there is little help in changing it and I'd rather not modify how
put_page() works.

I could add a variant of drain_local_pages() that drains just the local PCP of
a given zone before compact_finished() is called. The cost would be a doubling
of the number of times zone->lock is taken to do the drain. Is it
justified? It seems overkill to me to take the zone->lock just in case
compaction can finish a little earlier. It feels like it would be adding
a guaranteed cost for a potential saving.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-24 20:48   ` Andrew Morton
  2010-03-25  0:57     ` KAMEZAWA Hiroyuki
@ 2010-03-25 10:21     ` Mel Gorman
  1 sibling, 0 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-25 10:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 01:48:16PM -0700, Andrew Morton wrote:
> On Tue, 23 Mar 2010 12:25:45 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Ordinarily when a high-order allocation fails, direct reclaim is entered to
> > free pages to satisfy the allocation.  With this patch, it is determined if
> > an allocation failed due to external fragmentation instead of low memory
> > and if so, the calling process will compact until a suitable page is
> > freed. Compaction by moving pages in memory is considerably cheaper than
> > paging out to disk and works where there are locked pages or no swap. If
> > compaction fails to free a page of a suitable size, then reclaim will
> > still occur.
> > 
> > Direct compaction returns as soon as possible. As each block is compacted,
> > it is checked if a suitable page has been freed and if so, it returns.
> > 
> >
> > ...
> >
> > +static inline unsigned long compact_zone_order(struct zone *zone,
> > +						int order, gfp_t gfp_mask)
> 
> Suggest that you re-review all the manual inlining in the patchset. 
> It's rarely needed and often incorrect.
> 

I dropped both inlines. Both have only one caller and should be
automatically inlined.

> > +{
> > +	struct compact_control cc = {
> > +		.nr_freepages = 0,
> > +		.nr_migratepages = 0,
> > +		.order = order,
> > +		.migratetype = allocflags_to_migratetype(gfp_mask),
> > +		.zone = zone,
> > +	};
> > +	INIT_LIST_HEAD(&cc.freepages);
> > +	INIT_LIST_HEAD(&cc.migratepages);
> > +
> > +	return compact_zone(zone, &cc);
> > +}
> > +
> > +/**
> > + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> > + * @zonelist: The zonelist used for the current allocation
> > + * @order: The order of the current allocation
> > + * @gfp_mask: The GFP mask of the current allocation
> > + * @nodemask: The allowed nodes to allocate from
> > + *
> > + * This is the main entry point for direct page compaction.
> > + */
> > +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > +			int order, gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> > +	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > +	int may_enter_fs = gfp_mask & __GFP_FS;
> > +	int may_perform_io = gfp_mask & __GFP_IO;
> > +	unsigned long watermark;
> > +	struct zoneref *z;
> > +	struct zone *zone;
> > +	int rc = COMPACT_INCOMPLETE;
> > +
> > +	/* Check whether it is worth even starting compaction */
> > +	if (order == 0 || !may_enter_fs || !may_perform_io)
> > +		return rc;
> 
> hm, that was sad.  All those darn wireless drivers doing their
> high-order GFP_ATOMIC allocations cannot be saved?
> 

Not at this time. I'd need to go through migration and make it atomic-safe
first. It's doing things like taking the page table lock which is not
interrupt-safe. It's possible that migration can distinguish between atomic
and non-atomic migrations in the future.

> > +	/*
> > +	 * We will not stall if the necessary conditions are not met for
> > +	 * migration but direct reclaim seems to account stalls similarly
> > +	 */
> > +	count_vm_event(COMPACTSTALL);
> > +
> > +	/* Compact each zone in the list */
> > +	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> > +								nodemask) {
> 
> Will all of this code play nicely with memory hotplug?
> 

I assume you mean memory hot-remove as I cannot see any problems with
memory hot-add.

Kamezawa mostly covers this but I did miss one
possibility. suitable_migration_target() will return true if a block if
MIGRATE_ISOLATE and it had a large free page at the start. I'll fix it to
explicitly avoid MIGRATE_ISOLATE.

> > +		int fragindex;
> > +		int status;
> > +
> > +		/*
> > +		 * Watermarks for order-0 must be met for compaction. Note
> > +		 * the 2UL. This is because during migration, copies of
> > +		 * pages need to be allocated and for a short time, the
> > +		 * footprint is higher
> > +		 */
> > +		watermark = low_wmark_pages(zone) + (2UL << order);
> > +		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> > +			continue;
> > +
> > +		/*
> > +		 * fragmentation index determines if allocation failures are
> > +		 * due to low memory or external fragmentation
> > +		 *
> > +		 * index of -1 implies allocations might succeed depending
> > +		 * 	on watermarks
> > +		 * index < 500 implies alloc failure is due to lack of memory
> > +		 *
> > +		 * XXX: The choice of 500 is arbitrary. Reinvestigate
> > +		 *      appropriately to determine a sensible default.
> > +		 *      and what it means when watermarks are also taken
> > +		 *      into account. Consider making it a sysctl
> > +		 */
> 
> Yes, best to make it a sysctl IMO.   It'll make optimisation far easier.
> /proc/sys/vm/fragmentation_index_dont_you_dare_use_this_it_will_disappear_soon
> 

Will do. This is why I leave such nice notes for myself :)

Is there also scope for a feature that allows the following?

if (!mel || !mel_told_you_to_do_this)
	slap_user();

> > +		fragindex = fragmentation_index(zone, order);
> > +		if (fragindex >= 0 && fragindex <= 500)
> > +			continue;
> > +
> > +		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
> > +			rc = COMPACT_PARTIAL;
> > +			break;
> > +		}
> > +
> > +		status = compact_zone_order(zone, order, gfp_mask);
> > +		rc = max(status, rc);
> > +
> > +		if (zone_watermark_ok(zone, order, watermark, 0, 0))
> > +			break;
> > +	}
> > +
> > +	return rc;
> > +}
> >
> > ...
> >
> > @@ -1765,6 +1766,31 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> >  
> >  	cond_resched();
> >  
> > +	/* Try memory compaction for high-order allocations before reclaim */
> > +	if (order) {
> > +		*did_some_progress = try_to_compact_pages(zonelist,
> > +						order, gfp_mask, nodemask);
> > +		if (*did_some_progress != COMPACT_INCOMPLETE) {
> > +			page = get_page_from_freelist(gfp_mask, nodemask,
> > +					order, zonelist, high_zoneidx,
> > +					alloc_flags, preferred_zone,
> > +					migratetype);
> > +			if (page) {
> > +				__count_vm_event(COMPACTSUCCESS);
> > +				return page;
> > +			}
> > +
> > +			/*
> > +			 * It's bad if compaction run occurs and fails.
> > +			 * The most likely reason is that pages exist,
> > +			 * but not enough to satisfy watermarks.
> > +			 */
> > +			count_vm_event(COMPACTFAIL);
> 
> This counter will get incremented if !__GFP_FS or !__GFP_IO.  Seems
> wrong.
> 

Yep, it is wrong. I'll fix it. Thanks

> > +			cond_resched();
> > +		}
> > +	}
> > +
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-25 10:16             ` Mel Gorman
@ 2010-03-26  1:03               ` KAMEZAWA Hiroyuki
  2010-03-26  9:40                 ` Mel Gorman
  0 siblings, 1 reply; 78+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-26  1:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Thu, 25 Mar 2010 10:16:54 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> On Thu, Mar 25, 2010 at 06:50:21PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 25 Mar 2010 09:48:26 +0000
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > > In that case, compact_finished() can't
> > > > find there is a free chunk and do more work.  How about using a function like
> > > > 	 free_pcppages_bulk(zone, pcp->batch, pcp);
> > > > to bypass pcp list and freeing pages at once ?
> > > > 
> > > 
> > > I think you mean to drain the PCP lists while compaction is happening
> > > but is it justified? It's potentially a lot of IPI calls just to check
> > > if compaction can finish a little earlier. If the pages on the PCP lists
> > > are making that much of a difference to high-order page availability, it
> > > implies that the zone is pretty full and it's likely that compaction was
> > > avoided and we direct reclaimed.
> > > 
> > Ah, sorry for my short word again. I mean draining "local" pcp list because
> > a thread which run direct-compaction freed pages. IPI is not necessary and
> > overkill.
> > 
> 
> Ah, I see now. There are two places that pages get freed.  release_freepages()
> at the end of compaction when it's too late for compact_finished() to be
> helped and within migration itself. Migration frees with either
> free_page() or more commonly put_page() with put_page() being the most
> frequently used. As free_page() is called on failure to migrate (rare),
> there is little help in changing it and I'd rather not modify how
> put_page() works.
> 
> I could add a variant of drain_local_pages() that drains just the local PCP of
> a given zone before compact_finished() is called. The cost would be a doubling
> of the number of times zone->lock is taken to do the drain. Is it
> justified? It seems overkill to me to take the zone->lock just in case
> compaction can finish a little earlier. It feels like it would be adding
> a guaranteed cost for a potential saving.
> 
If you want to keep code comapct, I don't ask more.

I worried about that just because memory hot-unplug were suffered by pagevec
and pcp list before using  MIGRATE_ISOLATE and proper lru_add_drain().

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-26  1:03               ` KAMEZAWA Hiroyuki
@ 2010-03-26  9:40                 ` Mel Gorman
  0 siblings, 0 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-26  9:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Fri, Mar 26, 2010 at 10:03:08AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 25 Mar 2010 10:16:54 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Thu, Mar 25, 2010 at 06:50:21PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 25 Mar 2010 09:48:26 +0000
> > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > 
> > > > > In that case, compact_finished() can't
> > > > > find there is a free chunk and do more work.  How about using a function like
> > > > > 	 free_pcppages_bulk(zone, pcp->batch, pcp);
> > > > > to bypass pcp list and freeing pages at once ?
> > > > > 
> > > > 
> > > > I think you mean to drain the PCP lists while compaction is happening
> > > > but is it justified? It's potentially a lot of IPI calls just to check
> > > > if compaction can finish a little earlier. If the pages on the PCP lists
> > > > are making that much of a difference to high-order page availability, it
> > > > implies that the zone is pretty full and it's likely that compaction was
> > > > avoided and we direct reclaimed.
> > > > 
> > > Ah, sorry for my short word again. I mean draining "local" pcp list because
> > > a thread which run direct-compaction freed pages. IPI is not necessary and
> > > overkill.
> > > 
> > 
> > Ah, I see now. There are two places that pages get freed.  release_freepages()
> > at the end of compaction when it's too late for compact_finished() to be
> > helped and within migration itself. Migration frees with either
> > free_page() or more commonly put_page() with put_page() being the most
> > frequently used. As free_page() is called on failure to migrate (rare),
> > there is little help in changing it and I'd rather not modify how
> > put_page() works.
> > 
> > I could add a variant of drain_local_pages() that drains just the local PCP of
> > a given zone before compact_finished() is called. The cost would be a doubling
> > of the number of times zone->lock is taken to do the drain. Is it
> > justified? It seems overkill to me to take the zone->lock just in case
> > compaction can finish a little earlier. It feels like it would be adding
> > a guaranteed cost for a potential saving.
> > 
> If you want to keep code comapct, I don't ask more.
> 
> I worried about that just because memory hot-unplug were suffered by pagevec
> and pcp list before using  MIGRATE_ISOLATE and proper lru_add_drain().
> 

What I can do to cover that situation that won't cost much is to call
drain_local_pages after compaction completes.

Thanks


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/11] Add /proc trigger for memory compaction
  2010-03-24 20:33   ` Andrew Morton
@ 2010-03-26 10:46     ` Mel Gorman
  0 siblings, 0 replies; 78+ messages in thread
From: Mel Gorman @ 2010-03-26 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 24, 2010 at 01:33:51PM -0700, Andrew Morton wrote:
> On Tue, 23 Mar 2010 12:25:43 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
> > value is written to the file, all zones are compacted. The expected user
> > of such a trigger is a job scheduler that prepares the system before the
> > target application runs.
> > 
> >
> > ...
> >
> > +/* This is the entry point for compacting all nodes via /proc/sys/vm */
> > +int sysctl_compaction_handler(struct ctl_table *table, int write,
> > +			void __user *buffer, size_t *length, loff_t *ppos)
> > +{
> > +	if (write)
> > +		return compact_nodes();
> > +
> > +	return 0;
> > +}
> 
> Neato.  When I saw the overall description I was afraid that this stuff
> would be fiddling with kernel threads.
> 

Not yet anyway. It has been floated in the past to have a kcompactd
similar to kswapd but right now there is no justification for it. Like
other suggestions made in the past, it has potential but needs data to
justify.

> The underlying compaction code can at times cause rather large amounts
> of memory to be put onto private lists, so it's lost to the rest of the
> kernel.  What happens if 10000 processes simultaneously write to this
> thing?  It's root-only so I guess the answer is "root becomes unemployed".
> 

Well, root becomes unemployed but I shouldn't be supplying the rope.
Lets keep min_free_kbytes as the "fall off the cliff" tunable. I added
too_many_isolated()-like logic and also handling of fatal signals.

> I fear that the overall effect of this feature is that people will come
> up with ghastly hacks which keep on poking this tunable as a workaround
> for some VM shortcoming.  This will lead to more shortcomings, and
> longer-lived ones.
> 

That would be very unfortunate and also a self-defeating measure in the short
run, let alone the long run.  I consider the tunable to be more like the
"drop_caches" tunable. It can be used for good or bad and all the bad uses
kick you in the ass because it does not resolve the underlying problem and
is expensive to use.

I had three legit uses in mind for it

1. Batch-systems that compact memory before a job is scheduler to reduce
   start-up time of applications using huge pages. Depending on their
   setup, sysfs might be a better fit for them

2. Illustrate a bug in direct compaction. i.e. I'd get a report on some
   allocation failure that was consistent but when the tunable is poked,
   it works perfectly

3. Development uses. Measuring worst-case scenarios for compaction (rare
   obviously), stress testing compaction to try catch bugs in migration
   and measuring how effective compaction currently is.

Do these justify the existance of the tunable or is the risk of abuse
too high?

This is what the isolate logic looks like


diff --git a/mm/compaction.c b/mm/compaction.c
index e0e8100..a6a6958 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -13,6 +13,7 @@
 #include <linux/mm_inline.h>
 #include <linux/sysctl.h>
 #include <linux/sysfs.h>
+#include <linux/backing-dev.h>
 #include "internal.h"
 
 /*
@@ -197,6 +198,20 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
 	__mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
 }
 
+/* Similar to reclaim, but different enough that they don't share logic */
+static int too_many_isolated(struct zone *zone)
+{
+
+	unsigned long inactive, isolated;
+
+	inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
+					zone_page_state(zone, NR_INACTIVE_ANON);
+	isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
+					zone_page_state(zone, NR_ISOLATED_ANON);
+
+	return isolated > inactive;
+}
+
 /*
  * Isolate all pages that can be migrated from the block pointed to by
  * the migrate scanner within compact_control.
@@ -223,6 +238,14 @@ static unsigned long isolate_migratepages(struct zone *zone,
 		return 0;
 	}
 
+	/* Do not isolate the world */
+	while (unlikely(too_many_isolated(zone))) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+		if (fatal_signal_pending(current))
+			return 0;
+	}
+
 	/* Time to isolate some pages for migration */
 	spin_lock_irq(&zone->lru_lock);
 	for (; low_pfn < end_pfn; low_pfn++) {
@@ -309,6 +332,9 @@ static int compact_finished(struct zone *zone,
 	unsigned int order;
 	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
 
+	if (fatal_signal_pending(current))
+		return COMPACT_PARTIAL;
+
 	/* Compaction run completes if the migrate and free scanner meet */
 	if (cc->free_pfn <= cc->migrate_pfn)
 		return COMPACT_COMPLETE;

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/11] Add /proc trigger for memory compaction
  2010-03-12 16:41 ` [PATCH 08/11] Add /proc trigger for memory compaction Mel Gorman
@ 2010-03-17  3:18   ` KOSAKI Motohiro
  0 siblings, 0 replies; 78+ messages in thread
From: KOSAKI Motohiro @ 2010-03-17  3:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
> value is written to the file, all zones are compacted. The expected user
> of such a trigger is a job scheduler that prepares the system before the
> target application runs.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>





^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 08/11] Add /proc trigger for memory compaction
  2010-03-12 16:41 [PATCH 0/11] Memory Compaction v4 Mel Gorman
@ 2010-03-12 16:41 ` Mel Gorman
  2010-03-17  3:18   ` KOSAKI Motohiro
  0 siblings, 1 reply; 78+ messages in thread
From: Mel Gorman @ 2010-03-12 16:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	linux-kernel, linux-mm

This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
value is written to the file, all zones are compacted. The expected user
of such a trigger is a job scheduler that prepares the system before the
target application runs.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 Documentation/sysctl/vm.txt |   11 ++++++++
 include/linux/compaction.h  |    6 ++++
 kernel/sysctl.c             |   10 +++++++
 mm/compaction.c             |   61 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 88 insertions(+), 0 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 6c7d18c..317d3f0 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -19,6 +19,7 @@ files can be found in mm/swap.c.
 Currently, these files are in /proc/sys/vm:
 
 - block_dump
+- compact_memory
 - dirty_background_bytes
 - dirty_background_ratio
 - dirty_bytes
@@ -64,6 +65,16 @@ information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.
 
 ==============================================================
 
+compact_memory
+
+Available only when CONFIG_COMPACTION is set. When an arbitrary value
+is written to the file, all zones are compacted such that free memory
+is available in contiguous blocks where possible. This can be important
+for example in the allocation of huge pages although processes will also
+directly compact memory as required.
+
+==============================================================
+
 dirty_background_bytes
 
 Contains the amount of dirty memory at which the pdflush background writeback
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 6201371..52762d2 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -5,4 +5,10 @@
 #define COMPACT_INCOMPLETE	0
 #define COMPACT_COMPLETE	1
 
+#ifdef CONFIG_COMPACTION
+extern int sysctl_compact_memory;
+extern int sysctl_compaction_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos);
+#endif /* CONFIG_COMPACTION */
+
 #endif /* _LINUX_COMPACTION_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7e12adc..df3b018 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -64,6 +64,7 @@
 #include <linux/slow-work.h>
 #include <linux/perf_event.h>
 #include <linux/kprobes.h>
+#include <linux/compaction.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -1090,6 +1091,15 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= drop_caches_sysctl_handler,
 	},
+#ifdef CONFIG_COMPACTION
+	{
+		.procname	= "compact_memory",
+		.data		= &sysctl_compact_memory,
+		.maxlen		= sizeof(int),
+		.mode		= 0200,
+		.proc_handler	= sysctl_compaction_handler,
+	},
+#endif /* CONFIG_COMPACTION */
 	{
 		.procname	= "min_free_kbytes",
 		.data		= &min_free_kbytes,
diff --git a/mm/compaction.c b/mm/compaction.c
index 3cc4db5..817aa5b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -11,6 +11,7 @@
 #include <linux/migrate.h>
 #include <linux/compaction.h>
 #include <linux/mm_inline.h>
+#include <linux/sysctl.h>
 #include "internal.h"
 
 /*
@@ -351,3 +352,63 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	return ret;
 }
 
+/* Compact all zones within a node */
+static int compact_node(int nid)
+{
+	int zoneid;
+	pg_data_t *pgdat;
+	struct zone *zone;
+
+	if (nid < 0 || nid > nr_node_ids || !node_online(nid))
+		return -EINVAL;
+	pgdat = NODE_DATA(nid);
+
+	/* Flush pending updates to the LRU lists */
+	lru_add_drain_all();
+
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+		struct compact_control cc;
+
+		zone = &pgdat->node_zones[zoneid];
+		if (!populated_zone(zone))
+			continue;
+
+		cc.nr_freepages = 0;
+		cc.nr_migratepages = 0;
+		cc.zone = zone;
+		cc.order = -1;
+		INIT_LIST_HEAD(&cc.freepages);
+		INIT_LIST_HEAD(&cc.migratepages);
+
+		compact_zone(zone, &cc);
+
+		VM_BUG_ON(!list_empty(&cc.freepages));
+		VM_BUG_ON(!list_empty(&cc.migratepages));
+	}
+
+	return 0;
+}
+
+/* Compact all nodes in the system */
+static int compact_nodes(void)
+{
+	int nid;
+
+	for_each_online_node(nid)
+		compact_node(nid);
+
+	return COMPACT_COMPLETE;
+}
+
+/* The written value is actually unused, all memory is compacted */
+int sysctl_compact_memory;
+
+/* This is the entry point for compacting all nodes via /proc/sys/vm */
+int sysctl_compaction_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos)
+{
+	if (write)
+		return compact_nodes();
+
+	return 0;
+}
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2010-03-26 10:46 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-23 12:25 [PATCH 0/11] Memory Compaction v5 Mel Gorman
2010-03-23 12:25 ` [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
2010-03-23 12:25 ` [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
2010-03-23 17:22   ` Christoph Lameter
2010-03-23 18:04     ` Mel Gorman
2010-03-23 12:25 ` [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
2010-03-23 17:25   ` Christoph Lameter
2010-03-23 23:55   ` KAMEZAWA Hiroyuki
2010-03-23 12:25 ` [PATCH 04/11] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
2010-03-23 12:25 ` [PATCH 05/11] Export unusable free space index via /proc/unusable_index Mel Gorman
2010-03-23 17:31   ` Christoph Lameter
2010-03-23 18:14     ` Mel Gorman
2010-03-24  0:03   ` KAMEZAWA Hiroyuki
2010-03-24  0:16     ` Minchan Kim
2010-03-24  0:13       ` KAMEZAWA Hiroyuki
2010-03-24 10:25     ` Mel Gorman
2010-03-23 12:25 ` [PATCH 06/11] Export fragmentation index via /proc/extfrag_index Mel Gorman
2010-03-23 17:37   ` Christoph Lameter
2010-03-23 12:25 ` [PATCH 07/11] Memory compaction core Mel Gorman
2010-03-23 17:56   ` Christoph Lameter
2010-03-23 18:15     ` Mel Gorman
2010-03-23 18:33       ` Christoph Lameter
2010-03-23 18:58         ` Mel Gorman
2010-03-23 19:20           ` Christoph Lameter
2010-03-24  1:03   ` KAMEZAWA Hiroyuki
2010-03-24  1:47     ` Minchan Kim
2010-03-24  1:53       ` KAMEZAWA Hiroyuki
2010-03-24  2:10         ` Minchan Kim
2010-03-24 10:57           ` Mel Gorman
2010-03-24 20:33   ` Andrew Morton
2010-03-24 20:59     ` Jonathan Corbet
2010-03-24 21:14       ` Andrew Morton
2010-03-24 21:19         ` Christoph Lameter
2010-03-24 21:19       ` Andrea Arcangeli
2010-03-24 21:28         ` Jonathan Corbet
2010-03-24 21:47           ` Andrea Arcangeli
2010-03-24 21:54             ` Jonathan Corbet
2010-03-24 22:06               ` Andrea Arcangeli
2010-03-24 21:57             ` Andrea Arcangeli
2010-03-25  9:13     ` Mel Gorman
2010-03-23 12:25 ` [PATCH 08/11] Add /proc trigger for memory compaction Mel Gorman
2010-03-23 18:25   ` Christoph Lameter
2010-03-23 18:32     ` Mel Gorman
2010-03-24 20:33   ` Andrew Morton
2010-03-26 10:46     ` Mel Gorman
2010-03-23 12:25 ` [PATCH 09/11] Add /sys trigger for per-node " Mel Gorman
2010-03-23 18:27   ` Christoph Lameter
2010-03-23 22:45   ` Minchan Kim
2010-03-24  0:19   ` KAMEZAWA Hiroyuki
2010-03-23 12:25 ` [PATCH 10/11] Direct compact when a high-order allocation fails Mel Gorman
2010-03-23 23:10   ` Minchan Kim
2010-03-24 11:11     ` Mel Gorman
2010-03-24 11:59       ` Minchan Kim
2010-03-24 12:06         ` Minchan Kim
2010-03-24 12:10           ` Mel Gorman
2010-03-24 12:09         ` Mel Gorman
2010-03-24 12:25           ` Minchan Kim
2010-03-24  1:19   ` KAMEZAWA Hiroyuki
2010-03-24 11:40     ` Mel Gorman
2010-03-25  0:30       ` KAMEZAWA Hiroyuki
2010-03-25  9:48         ` Mel Gorman
2010-03-25  9:50           ` KAMEZAWA Hiroyuki
2010-03-25 10:16             ` Mel Gorman
2010-03-26  1:03               ` KAMEZAWA Hiroyuki
2010-03-26  9:40                 ` Mel Gorman
2010-03-24 20:48   ` Andrew Morton
2010-03-25  0:57     ` KAMEZAWA Hiroyuki
2010-03-25 10:21     ` Mel Gorman
2010-03-23 12:25 ` [PATCH 11/11] Do not compact within a preferred zone after a compaction failure Mel Gorman
2010-03-23 18:31   ` Christoph Lameter
2010-03-23 18:39     ` Mel Gorman
2010-03-23 19:27       ` Christoph Lameter
2010-03-24 10:37         ` Mel Gorman
2010-03-24 19:54           ` Christoph Lameter
2010-03-24 20:53   ` Andrew Morton
2010-03-25  9:40     ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2010-03-12 16:41 [PATCH 0/11] Memory Compaction v4 Mel Gorman
2010-03-12 16:41 ` [PATCH 08/11] Add /proc trigger for memory compaction Mel Gorman
2010-03-17  3:18   ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).