linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/11] Memory Compaction v4
@ 2010-03-12 16:41 Mel Gorman
  2010-03-12 16:41 ` [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
                   ` (10 more replies)
  0 siblings, 11 replies; 109+ messages in thread
From: Mel Gorman @ 2010-03-12 16:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	linux-kernel, linux-mm

This is a rebase on top of mmotm for consideration of merging. These are
based up and tested on mmotm-2010-03-09-19-15 minus the sysctl changes that
cause locking-related spew all over the console during boot time. The spew
problem has been reported already in the thread "mmotm 2010-03-09-19-15:
Lot of scheduling while atomic warnings related to RCU".

Changelog since V3
  o Document sysfs entries (subseqently, merged independently)
  o COMPACTION should depend on MMU
  o Comment updates
  o Ensure proc/sysfs triggering of compaction fully completes
  o Rename anon_vma refcount to external_refcount
  o Rebase to mmotm on top of 2.6.34-rc1

Changelog since V2
  o Move unusable and fragmentation indices to separate proc files
  o Express indices as being between 0 and 1
  o Update copyright notice for compaction.c
  o Avoid infinite loop when split free page fails
  o Init compact_resume at least once (impacted x86 testing)
  o Fewer pages are isolated during compaction.
  o LRU lists are no longer rotated when page is busy
  o NR_ISOLATED_* is updated to avoid isolating too many pages
  o Update zone LRU stats correctly when isolating pages
  o Reference count anon_vma instead of insufficient locking with
    use-after-free races in memory compaction
  o Watch for unmapped anon pages during migration
  o Remove unnecessary parameters on a few functions
  o Add Reviewed-by's. Note that I didn't add the Acks and Reviewed
    for the proc patches as they have been split out into separate
    files and I don't know if the Acks are still valid.

Changelog since V1
  o Update help blurb on CONFIG_MIGRATION
  o Max unusable free space index is 100, not 1000
  o Move blockpfn forward properly during compaction
  o Cleanup CONFIG_COMPACTION vs CONFIG_MIGRATION confusion
  o Permissions on /proc and /sys files should be 0200
  o Reduce verbosity
  o Compact all nodes when triggered via /proc
  o Add per-node compaction via sysfs
  o Move defer_compaction out-of-line
  o Fix lock oddities in rmap_walk_anon
  o Add documentation

This patchset is a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was slub
"defragmentation" (really a form of targeted reclaim). Hence, this is called
"compaction" to distinguish it from other forms of defragmentation.

In this implementation, a full compaction run involves two scanners operating
within a zone - a migration and a free scanner. The migration scanner
starts at the beginning of a zone and finds all movable pages within one
pageblock_nr_pages-sized area and isolates them on a migratepages list. The
free scanner begins at the end of the zone and searches on a per-area
basis for enough free pages to migrate all the pages on the migratepages
list. As each area is respectively migrated or exhausted of free pages,
the scanners are advanced one area.  A compaction run completes within a
zone when the two scanners meet.

This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.

It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.

Memory compaction can be triggered in one of three ways. It may be triggered
explicitly by writing any value to /proc/sys/vm/compact_memory and compacting
all of memory. It can be triggered on a per-node basis by writing any
value to /sys/devices/system/node/nodeN/compact where N is the node ID to
be compacted. When a process fails to allocate a high-order page, it may
compact memory in an attempt to satisfy the allocation instead of entering
direct reclaim. Explicit compaction does not finish until the two scanners
meet and direct compaction ends if a suitable page becomes available that
would meet watermarks.

The series is in 11 patches. The first three are not "core" to the series
but are important pre-requisites.

Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
	patch, it's possible to use anon_vma after free if the caller is
	not holding a VMA or mmap_sem for the pages in question. While
	there should be no existing user that causes this problem,
	it's a requirement for memory compaction to be stable. The patch
	is at the start of the series for bisection reasons.
Patch 2 skips over anon pages during migration that are no longer mapped
	because there still appeared to be a small window between when
	a page was isolated and migration started during which anon_vma
	could disappear.
Patch 3 merges the KSM and migrate counts. It could be merged with patch 1
	but would be slightly harder to review.
Patch 4 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 5 exports a "unusable free space index" via /proc/pagetypeinfo. It's
	a measure of external fragmentation that takes the size of the
	allocation request into account. It can also be calculated from
	userspace so can be dropped if requested
Patch 6 exports a "fragmentation index" which only has meaning when an
	allocation request fails. It determines if an allocation failure
	would be due to a lack of memory or external fragmentation.
Patch 7 is the compaction mechanism although it's unreachable at this point
Patch 8 adds a means of compacting all of memory with a proc trgger
Patch 9 adds a means of compacting a specific node with a sysfs trigger
Patch 10 adds "direct compaction" before "direct reclaim" if it is
	determined there is a good chance of success.
Patch 11 temporarily disables compaction if an allocation failure occurs
	after compaction.

Testing of compaction was in three stages.  For the test, debugging, preempt,
the sleep watchdog and lockdep were all enabled but nothing nasty popped
out. min_free_kbytes was tuned as recommended by hugeadm to help fragmentation
avoidance and high-order allocations. It was tested on X86, X86-64 and PPC64.

Ths first test represents one of the easiest cases that can be faced for
lumpy reclaim or memory compaction.

1. Machine freshly booted and configured for hugepage usage with
	a) hugeadm --create-global-mounts
	b) hugeadm --pool-pages-max DEFAULT:8G
	c) hugeadm --set-recommended-min_free_kbytes
	d) hugeadm --set-recommended-shmmax

	The min_free_kbytes here is important. Anti-fragmentation works best
	when pageblocks don't mix. hugeadm knows how to calculate a value that
	will significantly reduce the worst of external-fragmentation-related
	events as reported by the mm_page_alloc_extfrag tracepoint.

2. Load up memory
	a) Start updatedb
	b) Create in parallel a X files of pagesize*128 in size. Wait
	   until files are created. By parallel, I mean that 4096 instances
	   of dd were launched, one after the other using &. The crude
	   objective being to mix filesystem metadata allocations with
	   the buffer cache.
	c) Delete every second file so that pageblocks are likely to
	   have holes
	d) kill updatedb if it's still running

	At this point, the system is quiet, memory is full but it's full with
	clean filesystem metadata and clean buffer cache that is unmapped.
	This is readily migrated or discarded so you'd expect lumpy reclaim
	to have no significant advantage over compaction but this is at
	the POC stage.

3. In increments, attempt to allocate 5% of memory as hugepages.
	   Measure how long it took, how successful it was, how many
	   direct reclaims took place and how how many compactions. Note
	   the compaction figures might not fully add up as compactions
	   can take place for orders other than the hugepage size

X86				vanilla		compaction
Final page count                    930                941 (attempted 1002)
pages reclaimed                   74630               3861

X86-64				vanilla		compaction
Final page count:                   916                916 (attempted 1002)
Total pages reclaimed:           122076              49800

PPC64				vanilla		compaction
Final page count:                    91                 94 (attempted 110)
Total pages reclaimed:            80252              96299

There was not a dramatic improvement in success rates but it wouldn't be
expected in this case either. What was important is that significantly fewer
pages were reclaimed in all cases reducing the amount of IO required to
satisfy a huge page allocation.

The second tests were all performance related - kernbench, netperf, iozone
and sysbench. None showed anything too remarkable.

The last test was a high-order allocation stress test. Many kernel compiles
are started to fill memory with a pressured mix of kernel and movable
allocations. During this, an attempt is made to allocate 90% of memory
as huge pages - one at a time with small delays between attempts to avoid
flooding the IO queue.

                                             vanilla   compaction
Percentage of request allocated X86               98           99
Percentage of request allocated X86-64            93           99
Percentage of request allocated PPC64             59           76

Success rates are a little higher, particularly on PPC64 with the larger
huge pages. What is most interesting is the latency when allocating huge
pages.

X86:    http://www.csn.ul.ie/~mel/postings/compaction-20100312/highalloc-interlatency-arnold-compaction-stress-v4r3-mean.ps
X86_64: http://www.csn.ul.ie/~mel/postings/compaction-20100312/highalloc-interlatency-hydra-compaction-stress-v4r3-mean.ps
PPC64: http://www.csn.ul.ie/~mel/postings/compaction-20100312/highalloc-interlatency-powyah-compaction-stress-v4r3-mean.ps

X86 latency is reduced the least but it is depending heavily on the HIGHMEM
zone to allocate many of its huge pages which is a relatively
straight-forward job. X86-64 and PPC64 both show very significant reductions
in average time taken to allocate huge pages. It is not reduced to zero
because the system is under enough memory pressure that reclaim is still
required for some of the allocations.

What is also enlightening in the same directory is the "stddev" files. Each
of them show that the variance between allocation times is drastically
reduced.

Andrew, assuming no major complaints, how do you feel about picking these up?

 Documentation/ABI/stable/sysfs-devices-node  |    7 +
 Documentation/ABI/testing/sysfs-devices-node |    7 +
 Documentation/filesystems/proc.txt           |   68 +++-
 Documentation/sysctl/vm.txt                  |   11 +
 drivers/base/node.c                          |    3 +
 include/linux/compaction.h                   |   76 ++++
 include/linux/mm.h                           |    1 +
 include/linux/mmzone.h                       |    7 +
 include/linux/rmap.h                         |   27 +-
 include/linux/swap.h                         |    6 +
 include/linux/vmstat.h                       |    2 +
 kernel/sysctl.c                              |   11 +
 mm/Kconfig                                   |   20 +-
 mm/Makefile                                  |    1 +
 mm/compaction.c                              |  555 ++++++++++++++++++++++++++
 mm/ksm.c                                     |    4 +-
 mm/migrate.c                                 |   22 +
 mm/page_alloc.c                              |   68 ++++
 mm/rmap.c                                    |   10 +-
 mm/vmscan.c                                  |    5 -
 mm/vmstat.c                                  |  217 ++++++++++
 21 files changed, 1101 insertions(+), 27 deletions(-)
 create mode 100644 Documentation/ABI/stable/sysfs-devices-node
 create mode 100644 Documentation/ABI/testing/sysfs-devices-node
 create mode 100644 include/linux/compaction.h
 create mode 100644 mm/compaction.c


^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating
  2010-03-12 16:41 [PATCH 0/11] Memory Compaction v4 Mel Gorman
@ 2010-03-12 16:41 ` Mel Gorman
  2010-03-14 15:01   ` Minchan Kim
                     ` (2 more replies)
  2010-03-12 16:41 ` [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
                   ` (9 subsequent siblings)
  10 siblings, 3 replies; 109+ messages in thread
From: Mel Gorman @ 2010-03-12 16:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	linux-kernel, linux-mm

rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.

This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/rmap.h |   23 +++++++++++++++++++++++
 mm/migrate.c         |   12 ++++++++++++
 mm/rmap.c            |   10 +++++-----
 3 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..567d43f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -29,6 +29,9 @@ struct anon_vma {
 #ifdef CONFIG_KSM
 	atomic_t ksm_refcount;
 #endif
+#ifdef CONFIG_MIGRATION
+	atomic_t migrate_refcount;
+#endif
 	/*
 	 * NOTE: the LSB of the head.next is set by
 	 * mm_take_all_locks() _after_ taking the above lock. So the
@@ -81,6 +84,26 @@ static inline int ksm_refcount(struct anon_vma *anon_vma)
 	return 0;
 }
 #endif /* CONFIG_KSM */
+#ifdef CONFIG_MIGRATION
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+	atomic_set(&anon_vma->migrate_refcount, 0);
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+	return atomic_read(&anon_vma->migrate_refcount);
+}
+#else
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+	return 0;
+}
+#endif /* CONFIG_MIGRATE */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
diff --git a/mm/migrate.c b/mm/migrate.c
index 88000b8..98eaaf2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -547,6 +547,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	int rcu_locked = 0;
 	int charge = 0;
 	struct mem_cgroup *mem = NULL;
+	struct anon_vma *anon_vma = NULL;
 
 	if (!newpage)
 		return -ENOMEM;
@@ -603,6 +604,8 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	if (PageAnon(page)) {
 		rcu_read_lock();
 		rcu_locked = 1;
+		anon_vma = page_anon_vma(page);
+		atomic_inc(&anon_vma->migrate_refcount);
 	}
 
 	/*
@@ -642,6 +645,15 @@ skip_unmap:
 	if (rc)
 		remove_migration_ptes(page, page);
 rcu_unlock:
+
+	/* Drop an anon_vma reference if we took one */
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
+		int empty = list_empty(&anon_vma->head);
+		spin_unlock(&anon_vma->lock);
+		if (empty)
+			anon_vma_free(anon_vma);
+	}
+
 	if (rcu_locked)
 		rcu_read_unlock();
 uncharge:
diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..578d0fe 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -248,7 +248,8 @@ static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma);
+	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
+					!migrate_refcount(anon_vma);
 	spin_unlock(&anon_vma->lock);
 
 	if (empty)
@@ -273,6 +274,7 @@ static void anon_vma_ctor(void *data)
 
 	spin_lock_init(&anon_vma->lock);
 	ksm_refcount_init(anon_vma);
+	migrate_refcount_init(anon_vma);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -1338,10 +1340,8 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	/*
 	 * Note: remove_migration_ptes() cannot use page_lock_anon_vma()
 	 * because that depends on page_mapped(); but not all its usages
-	 * are holding mmap_sem, which also gave the necessary guarantee
-	 * (that this anon_vma's slab has not already been destroyed).
-	 * This needs to be reviewed later: avoiding page_lock_anon_vma()
-	 * is risky, and currently limits the usefulness of rmap_walk().
+	 * are holding mmap_sem. Users without mmap_sem are required to
+	 * take a reference count to prevent the anon_vma disappearing
 	 */
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-12 16:41 [PATCH 0/11] Memory Compaction v4 Mel Gorman
  2010-03-12 16:41 ` [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
@ 2010-03-12 16:41 ` Mel Gorman
  2010-03-15  0:28   ` Minchan Kim
  2010-03-12 16:41 ` [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-12 16:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	linux-kernel, linux-mm

rmap_walk_anon() was triggering errors in memory compaction that looks like
use-after-free errors in anon_vma. The problem appears to be that between
the page being isolated from the LRU and rcu_read_lock() being taken, the
mapcount of the page dropped to 0 and the anon_vma was freed. This patch
skips the migration of anon pages that are not mapped by anyone.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/migrate.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 98eaaf2..3c491e3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -602,6 +602,16 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	 * just care Anon page here.
 	 */
 	if (PageAnon(page)) {
+		/*
+		 * If the page has no mappings any more, just bail. An
+		 * unmapped anon page is likely to be freed soon but worse,
+		 * it's possible its anon_vma disappeared between when
+		 * the page was isolated and when we reached here while
+		 * the RCU lock was not held
+		 */
+		if (!page_mapcount(page))
+			goto uncharge;
+
 		rcu_read_lock();
 		rcu_locked = 1;
 		anon_vma = page_anon_vma(page);
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration
  2010-03-12 16:41 [PATCH 0/11] Memory Compaction v4 Mel Gorman
  2010-03-12 16:41 ` [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
  2010-03-12 16:41 ` [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
@ 2010-03-12 16:41 ` Mel Gorman
  2010-03-12 17:14   ` Rik van Riel
                     ` (2 more replies)
  2010-03-12 16:41 ` [PATCH 04/11] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
                   ` (7 subsequent siblings)
  10 siblings, 3 replies; 109+ messages in thread
From: Mel Gorman @ 2010-03-12 16:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	linux-kernel, linux-mm

For clarity of review, KSM and page migration have separate refcounts on
the anon_vma. While clear, this is a waste of memory. This patch gets
KSM and page migration to share their toys in a spirit of harmony.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 include/linux/rmap.h |   50 ++++++++++++++++++--------------------------------
 mm/ksm.c             |    4 ++--
 mm/migrate.c         |    4 ++--
 mm/rmap.c            |    6 ++----
 4 files changed, 24 insertions(+), 40 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 567d43f..7721674 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -26,11 +26,17 @@
  */
 struct anon_vma {
 	spinlock_t lock;	/* Serialize access to vma list */
-#ifdef CONFIG_KSM
-	atomic_t ksm_refcount;
-#endif
-#ifdef CONFIG_MIGRATION
-	atomic_t migrate_refcount;
+#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
+
+	/*
+	 * The external_refcount is taken by either KSM or page migration
+	 * to take a reference to an anon_vma when there is no
+	 * guarantee that the vma of page tables will exist for
+	 * the duration of the operation. A caller that takes
+	 * the reference is responsible for clearing up the
+	 * anon_vma if they are the last user on release
+	 */
+	atomic_t external_refcount;
 #endif
 	/*
 	 * NOTE: the LSB of the head.next is set by
@@ -64,46 +70,26 @@ struct anon_vma_chain {
 };
 
 #ifdef CONFIG_MMU
-#ifdef CONFIG_KSM
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
+#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
+static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
 {
-	atomic_set(&anon_vma->ksm_refcount, 0);
+	atomic_set(&anon_vma->external_refcount, 0);
 }
 
-static inline int ksm_refcount(struct anon_vma *anon_vma)
+static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
 {
-	return atomic_read(&anon_vma->ksm_refcount);
+	return atomic_read(&anon_vma->external_refcount);
 }
 #else
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
+static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
 {
 }
 
-static inline int ksm_refcount(struct anon_vma *anon_vma)
+static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
 {
 	return 0;
 }
 #endif /* CONFIG_KSM */
-#ifdef CONFIG_MIGRATION
-static inline void migrate_refcount_init(struct anon_vma *anon_vma)
-{
-	atomic_set(&anon_vma->migrate_refcount, 0);
-}
-
-static inline int migrate_refcount(struct anon_vma *anon_vma)
-{
-	return atomic_read(&anon_vma->migrate_refcount);
-}
-#else
-static inline void migrate_refcount_init(struct anon_vma *anon_vma)
-{
-}
-
-static inline int migrate_refcount(struct anon_vma *anon_vma)
-{
-	return 0;
-}
-#endif /* CONFIG_MIGRATE */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
diff --git a/mm/ksm.c b/mm/ksm.c
index a93f1b7..e45ec98 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -318,14 +318,14 @@ static void hold_anon_vma(struct rmap_item *rmap_item,
 			  struct anon_vma *anon_vma)
 {
 	rmap_item->anon_vma = anon_vma;
-	atomic_inc(&anon_vma->ksm_refcount);
+	atomic_inc(&anon_vma->external_refcount);
 }
 
 static void drop_anon_vma(struct rmap_item *rmap_item)
 {
 	struct anon_vma *anon_vma = rmap_item->anon_vma;
 
-	if (atomic_dec_and_lock(&anon_vma->ksm_refcount, &anon_vma->lock)) {
+	if (atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);
 		if (empty)
diff --git a/mm/migrate.c b/mm/migrate.c
index 3c491e3..dd1ab6b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -615,7 +615,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		rcu_read_lock();
 		rcu_locked = 1;
 		anon_vma = page_anon_vma(page);
-		atomic_inc(&anon_vma->migrate_refcount);
+		atomic_inc(&anon_vma->external_refcount);
 	}
 
 	/*
@@ -657,7 +657,7 @@ skip_unmap:
 rcu_unlock:
 
 	/* Drop an anon_vma reference if we took one */
-	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);
 		if (empty)
diff --git a/mm/rmap.c b/mm/rmap.c
index 578d0fe..af35b75 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -248,8 +248,7 @@ static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
-					!migrate_refcount(anon_vma);
+	empty = list_empty(&anon_vma->head) && !anonvma_external_refcount(anon_vma);
 	spin_unlock(&anon_vma->lock);
 
 	if (empty)
@@ -273,8 +272,7 @@ static void anon_vma_ctor(void *data)
 	struct anon_vma *anon_vma = data;
 
 	spin_lock_init(&anon_vma->lock);
-	ksm_refcount_init(anon_vma);
-	migrate_refcount_init(anon_vma);
+	anonvma_external_refcount_init(anon_vma);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 04/11] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
  2010-03-12 16:41 [PATCH 0/11] Memory Compaction v4 Mel Gorman
                   ` (2 preceding siblings ...)
  2010-03-12 16:41 ` [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
@ 2010-03-12 16:41 ` Mel Gorman
  2010-03-17  2:28   ` KOSAKI Motohiro
  2010-03-12 16:41 ` [PATCH 05/11] Export unusable free space index via /proc/unusable_index Mel Gorman
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-12 16:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	linux-kernel, linux-mm

CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
being able to hot-remove memory. The main users of page migration such as
sys_move_pages(), sys_migrate_pages() and cpuset process migration are
only beneficial on NUMA so it makes sense.

As memory compaction will operate within a zone and is useful on both NUMA
and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
user selects CONFIG_COMPACTION as an option.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/Kconfig |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 9c61158..04e241b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -172,17 +172,29 @@ config SPLIT_PTLOCK_CPUS
 	default "4"
 
 #
+# support for memory compaction
+config COMPACTION
+	bool "Allow for memory compaction"
+	def_bool y
+	select MIGRATION
+	depends on EXPERIMENTAL && HUGETLBFS && MMU
+	help
+	  Allows the compaction of memory for the allocation of huge pages.
+
+#
 # support for page migration
 #
 config MIGRATION
 	bool "Page migration"
 	def_bool y
-	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE
+	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION
 	help
 	  Allows the migration of the physical location of pages of processes
-	  while the virtual addresses are not changed. This is useful for
-	  example on NUMA systems to put pages nearer to the processors accessing
-	  the page.
+	  while the virtual addresses are not changed. This is useful in
+	  two situations. The first is on NUMA systems to put pages nearer
+	  to the processors accessing. The second is when allocating huge
+	  pages as migration can relocate pages to satisfy a huge page
+	  allocation instead of reclaiming.
 
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 05/11] Export unusable free space index via /proc/unusable_index
  2010-03-12 16:41 [PATCH 0/11] Memory Compaction v4 Mel Gorman
                   ` (3 preceding siblings ...)
  2010-03-12 16:41 ` [PATCH 04/11] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
@ 2010-03-12 16:41 ` Mel Gorman
  2010-03-15  5:41   ` KAMEZAWA Hiroyuki
  2010-03-17  2:42   ` KOSAKI Motohiro
  2010-03-12 16:41 ` [PATCH 06/11] Export fragmentation index via /proc/extfrag_index Mel Gorman
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 109+ messages in thread
From: Mel Gorman @ 2010-03-12 16:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	linux-kernel, linux-mm

Unusable free space index is a measure of external fragmentation that
takes the allocation size into account. For the most part, the huge page
size will be the size of interest but not necessarily so it is exported
on a per-order and per-zone basis via /proc/unusable_index.

The index is a value between 0 and 1. It can be expressed as a
percentage by multiplying by 100 as documented in
Documentation/filesystems/proc.txt.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 Documentation/filesystems/proc.txt |   13 ++++-
 mm/vmstat.c                        |  120 ++++++++++++++++++++++++++++++++++++
 2 files changed, 132 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 5e132b5..5c4b0fb 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -452,6 +452,7 @@ Table 1-5: Kernel info in /proc
  sys         See chapter 2                                     
  sysvipc     Info of SysVIPC Resources (msg, sem, shm)		(2.4)
  tty	     Info of tty drivers
+ unusable_index Additional page allocator information (see text)(2.5)
  uptime      System uptime                                     
  version     Kernel version                                    
  video	     bttv info of video resources			(2.4)
@@ -609,7 +610,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo.
+pagetypeinfo and unusable_index
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -650,6 +651,16 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
 also be allocatable although a lot of filesystem metadata may have to be
 reclaimed to achieve this.
 
+> cat /proc/unusable_index
+Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
+Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
+
+The unusable free space index measures how much of the available free
+memory cannot be used to satisfy an allocation of a given size and is a
+value between 0 and 1. The higher the value, the more of free memory is
+unusable and by implication, the worse the external fragmentation is. This
+can be expressed as a percentage by multiplying by 100.
+
 ..............................................................................
 
 meminfo:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7f760cb..ca42e10 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -453,6 +453,106 @@ static int frag_show(struct seq_file *m, void *arg)
 	return 0;
 }
 
+
+struct contig_page_info {
+	unsigned long free_pages;
+	unsigned long free_blocks_total;
+	unsigned long free_blocks_suitable;
+};
+
+/*
+ * Calculate the number of free pages in a zone, how many contiguous
+ * pages are free and how many are large enough to satisfy an allocation of
+ * the target size. Note that this function makes to attempt to estimate
+ * how many suitable free blocks there *might* be if MOVABLE pages were
+ * migrated. Calculating that is possible, but expensive and can be
+ * figured out from userspace
+ */
+static void fill_contig_page_info(struct zone *zone,
+				unsigned int suitable_order,
+				struct contig_page_info *info)
+{
+	unsigned int order;
+
+	info->free_pages = 0;
+	info->free_blocks_total = 0;
+	info->free_blocks_suitable = 0;
+
+	for (order = 0; order < MAX_ORDER; order++) {
+		unsigned long blocks;
+
+		/* Count number of free blocks */
+		blocks = zone->free_area[order].nr_free;
+		info->free_blocks_total += blocks;
+
+		/* Count free base pages */
+		info->free_pages += blocks << order;
+
+		/* Count the suitable free blocks */
+		if (order >= suitable_order)
+			info->free_blocks_suitable += blocks <<
+						(order - suitable_order);
+	}
+}
+
+/*
+ * Return an index indicating how much of the available free memory is
+ * unusable for an allocation of the requested size.
+ */
+static int unusable_free_index(unsigned int order,
+				struct contig_page_info *info)
+{
+	/* No free memory is interpreted as all free memory is unusable */
+	if (info->free_pages == 0)
+		return 1000;
+
+	/*
+	 * Index should be a value between 0 and 1. Return a value to 3
+	 * decimal places.
+	 *
+	 * 0 => no fragmentation
+	 * 1 => high fragmentation
+	 */
+	return ((info->free_pages - (info->free_blocks_suitable << order)) * 1000) / info->free_pages;
+
+}
+
+static void unusable_show_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	unsigned int order;
+	int index;
+	struct contig_page_info info;
+
+	seq_printf(m, "Node %d, zone %8s ",
+				pgdat->node_id,
+				zone->name);
+	for (order = 0; order < MAX_ORDER; ++order) {
+		fill_contig_page_info(zone, order, &info);
+		index = unusable_free_index(order, &info);
+		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+	}
+
+	seq_putc(m, '\n');
+}
+
+/*
+ * Display unusable free space index
+ * XXX: Could be a lot more efficient, but it's not a critical path
+ */
+static int unusable_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	/* check memoryless node */
+	if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+		return 0;
+
+	walk_zones_in_node(m, pgdat, unusable_show_print);
+
+	return 0;
+}
+
 static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
@@ -603,6 +703,25 @@ static const struct file_operations pagetypeinfo_file_ops = {
 	.release	= seq_release,
 };
 
+static const struct seq_operations unusable_op = {
+	.start	= frag_start,
+	.next	= frag_next,
+	.stop	= frag_stop,
+	.show	= unusable_show,
+};
+
+static int unusable_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &unusable_op);
+}
+
+static const struct file_operations unusable_file_ops = {
+	.open		= unusable_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 #ifdef CONFIG_ZONE_DMA
 #define TEXT_FOR_DMA(xx) xx "_dma",
 #else
@@ -947,6 +1066,7 @@ static int __init setup_vmstat(void)
 #ifdef CONFIG_PROC_FS
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
 	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
+	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
 #endif
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 06/11] Export fragmentation index via /proc/extfrag_index
  2010-03-12 16:41 [PATCH 0/11] Memory Compaction v4 Mel Gorman
                   ` (4 preceding siblings ...)
  2010-03-12 16:41 ` [PATCH 05/11] Export unusable free space index via /proc/unusable_index Mel Gorman
@ 2010-03-12 16:41 ` Mel Gorman
  2010-03-17  2:49   ` KOSAKI Motohiro
  2010-03-12 16:41 ` [PATCH 07/11] Memory compaction core Mel Gorman
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-12 16:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	linux-kernel, linux-mm

Fragmentation index is a value that makes sense when an allocation of a
given size would fail. The index indicates whether an allocation failure is
due to a lack of memory (values towards 0) or due to external fragmentation
(value towards 1).  For the most part, the huge page size will be the size
of interest but not necessarily so it is exported on a per-order and per-zone
basis via /proc/extfrag_index

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 Documentation/filesystems/proc.txt |   14 ++++++-
 mm/vmstat.c                        |   81 ++++++++++++++++++++++++++++++++++++
 2 files changed, 94 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 5c4b0fb..582ff3d 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -421,6 +421,7 @@ Table 1-5: Kernel info in /proc
  filesystems Supported filesystems                             
  driver	     Various drivers grouped here, currently rtc (2.4)
  execdomains Execdomains, related to security			(2.4)
+ extfrag_index Additional page allocator information (see text) (2.5)
  fb	     Frame Buffer devices				(2.4)
  fs	     File system parameters, currently nfs/exports	(2.4)
  ide         Directory containing info about the IDE subsystem 
@@ -610,7 +611,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo and unusable_index
+pagetypeinfo, unusable_index and extfrag_index.
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -661,6 +662,17 @@ value between 0 and 1. The higher the value, the more of free memory is
 unusable and by implication, the worse the external fragmentation is. This
 can be expressed as a percentage by multiplying by 100.
 
+> cat /proc/extfrag_index
+Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
+Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954
+
+The external fragmentation index, is only meaningful if an allocation
+would fail and indicates what the failure is due to. A value of -1 such as
+in many of the examples above states that the allocation would succeed.
+If it would fail, the value is between 0 and 1. A value tending towards
+0 implies the allocation failed due to a lack of memory. A value tending
+towards 1 implies it failed due to external fragmentation.
+
 ..............................................................................
 
 meminfo:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ca42e10..7377da6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -553,6 +553,67 @@ static int unusable_show(struct seq_file *m, void *arg)
 	return 0;
 }
 
+/*
+ * A fragmentation index only makes sense if an allocation of a requested
+ * size would fail. If that is true, the fragmentation index indicates
+ * whether external fragmentation or a lack of memory was the problem.
+ * The value can be used to determine if page reclaim or compaction
+ * should be used
+ */
+int fragmentation_index(unsigned int order, struct contig_page_info *info)
+{
+	unsigned long requested = 1UL << order;
+
+	if (!info->free_blocks_total)
+		return 0;
+
+	/* Fragmentation index only makes sense when a request would fail */
+	if (info->free_blocks_suitable)
+		return -1000;
+
+	/*
+	 * Index is between 0 and 1 so return within 3 decimal places
+	 *
+	 * 0 => allocation would fail due to lack of memory
+	 * 1 => allocation would fail due to fragmentation
+	 */
+	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
+}
+
+
+static void extfrag_show_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	unsigned int order;
+	int index;
+
+	/* Alloc on stack as interrupts are disabled for zone walk */
+	struct contig_page_info info;
+
+	seq_printf(m, "Node %d, zone %8s ",
+				pgdat->node_id,
+				zone->name);
+	for (order = 0; order < MAX_ORDER; ++order) {
+		fill_contig_page_info(zone, order, &info);
+		index = fragmentation_index(order, &info);
+		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+	}
+
+	seq_putc(m, '\n');
+}
+
+/*
+ * Display fragmentation index for orders that allocations would fail for
+ */
+static int extfrag_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	walk_zones_in_node(m, pgdat, extfrag_show_print);
+
+	return 0;
+}
+
 static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
@@ -722,6 +783,25 @@ static const struct file_operations unusable_file_ops = {
 	.release	= seq_release,
 };
 
+static const struct seq_operations extfrag_op = {
+	.start	= frag_start,
+	.next	= frag_next,
+	.stop	= frag_stop,
+	.show	= extfrag_show,
+};
+
+static int extfrag_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &extfrag_op);
+}
+
+static const struct file_operations extfrag_file_ops = {
+	.open		= extfrag_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 #ifdef CONFIG_ZONE_DMA
 #define TEXT_FOR_DMA(xx) xx "_dma",
 #else
@@ -1067,6 +1147,7 @@ static int __init setup_vmstat(void)
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
 	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
 	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
+	proc_create("extfrag_index", S_IRUGO, NULL, &extfrag_file_ops);
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
 #endif
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 07/11] Memory compaction core
  2010-03-12 16:41 [PATCH 0/11] Memory Compaction v4 Mel Gorman
                   ` (5 preceding siblings ...)
  2010-03-12 16:41 ` [PATCH 06/11] Export fragmentation index via /proc/extfrag_index Mel Gorman
@ 2010-03-12 16:41 ` Mel Gorman
  2010-03-15 13:44   ` Minchan Kim
  2010-03-17 10:31   ` KOSAKI Motohiro
  2010-03-12 16:41 ` [PATCH 08/11] Add /proc trigger for memory compaction Mel Gorman
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 109+ messages in thread
From: Mel Gorman @ 2010-03-12 16:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	linux-kernel, linux-mm

This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.

A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable pages
within each area, isolating them onto a private list called migratelist.
The free scanner starts at the top of the zone and searches for suitable
areas and consumes the free pages within making them available for the
migration scanner. The pages isolated for migration are then migrated to
the newly isolated free pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |    8 +
 include/linux/mm.h         |    1 +
 include/linux/swap.h       |    6 +
 include/linux/vmstat.h     |    1 +
 mm/Makefile                |    1 +
 mm/compaction.c            |  353 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   39 +++++
 mm/vmscan.c                |    5 -
 mm/vmstat.c                |    5 +
 9 files changed, 414 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/compaction.h
 create mode 100644 mm/compaction.c

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
new file mode 100644
index 0000000..6201371
--- /dev/null
+++ b/include/linux/compaction.h
@@ -0,0 +1,8 @@
+#ifndef _LINUX_COMPACTION_H
+#define _LINUX_COMPACTION_H
+
+/* Return values for compact_zone() */
+#define COMPACT_INCOMPLETE	0
+#define COMPACT_COMPLETE	1
+
+#endif /* _LINUX_COMPACTION_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 925ae32..7a24ebb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -335,6 +335,7 @@ void put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
+int split_free_page(struct page *page);
 
 /*
  * Compound pages have a destructor function.  Provide a
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1f59d93..cf8bba7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -151,6 +151,7 @@ enum {
 };
 
 #define SWAP_CLUSTER_MAX 32
+#define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
 #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
 #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
@@ -238,6 +239,11 @@ static inline void lru_cache_add_active_file(struct page *page)
 	__lru_cache_add(page, LRU_ACTIVE_FILE);
 }
 
+/* LRU Isolation modes. */
+#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
+#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
+#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 117f0dd..56e4b44 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
diff --git a/mm/Makefile b/mm/Makefile
index 7a68d2a..ccb1f72 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_COMPACTION) += compaction.o
 obj-$(CONFIG_SMP) += percpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
diff --git a/mm/compaction.c b/mm/compaction.c
new file mode 100644
index 0000000..3cc4db5
--- /dev/null
+++ b/mm/compaction.c
@@ -0,0 +1,353 @@
+/*
+ * linux/mm/compaction.c
+ *
+ * Memory compaction for the reduction of external fragmentation. Note that
+ * this heavily depends upon page migration to do all the real heavy
+ * lifting
+ *
+ * Copyright IBM Corp. 2007-2010 Mel Gorman <mel@csn.ul.ie>
+ */
+#include <linux/swap.h>
+#include <linux/migrate.h>
+#include <linux/compaction.h>
+#include <linux/mm_inline.h>
+#include "internal.h"
+
+/*
+ * compact_control is used to track pages being migrated and the free pages
+ * they are being migrated to during memory compaction. The free_pfn starts
+ * at the end of a zone and migrate_pfn begins at the start. Movable pages
+ * are moved to the end of a zone during a compaction run and the run
+ * completes when free_pfn <= migrate_pfn
+ */
+struct compact_control {
+	struct list_head freepages;	/* List of free pages to migrate to */
+	struct list_head migratepages;	/* List of pages being migrated */
+	unsigned long nr_freepages;	/* Number of isolated free pages */
+	unsigned long nr_migratepages;	/* Number of pages to migrate */
+	unsigned long free_pfn;		/* isolate_freepages search base */
+	unsigned long migrate_pfn;	/* isolate_migratepages search base */
+
+	/* Account for isolated anon and file pages */
+	unsigned long nr_anon;
+	unsigned long nr_file;
+
+	struct zone *zone;
+};
+
+static int release_freepages(struct list_head *freelist)
+{
+	struct page *page, *next;
+	int count = 0;
+
+	list_for_each_entry_safe(page, next, freelist, lru) {
+		list_del(&page->lru);
+		__free_page(page);
+		count++;
+	}
+
+	return count;
+}
+
+/* Isolate free pages onto a private freelist. Must hold zone->lock */
+static int isolate_freepages_block(struct zone *zone,
+				unsigned long blockpfn,
+				struct list_head *freelist)
+{
+	unsigned long zone_end_pfn, end_pfn;
+	int total_isolated = 0;
+
+	/* Get the last PFN we should scan for free pages at */
+	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+	end_pfn = blockpfn + pageblock_nr_pages;
+	if (end_pfn > zone_end_pfn)
+		end_pfn = zone_end_pfn;
+
+	/* Isolate free pages. This assumes the block is valid */
+	for (; blockpfn < end_pfn; blockpfn++) {
+		struct page *page;
+		int isolated, i;
+
+		if (!pfn_valid_within(blockpfn))
+			continue;
+
+		page = pfn_to_page(blockpfn);
+		if (!PageBuddy(page))
+			continue;
+
+		/* Found a free page, break it into order-0 pages */
+		isolated = split_free_page(page);
+		total_isolated += isolated;
+		for (i = 0; i < isolated; i++) {
+			list_add(&page->lru, freelist);
+			page++;
+		}
+
+		/* If a page was split, advance to the end of it */
+		if (isolated)
+			blockpfn += isolated - 1;
+	}
+
+	return total_isolated;
+}
+
+/* Returns 1 if the page is within a block suitable for migration to */
+static int suitable_migration_target(struct page *page)
+{
+	/* If the page is a large free page, then allow migration */
+	if (PageBuddy(page) && page_order(page) >= pageblock_order)
+		return 1;
+
+	/* If the block is MIGRATE_MOVABLE, allow migration */
+	if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
+		return 1;
+
+	/* Otherwise skip the block */
+	return 0;
+}
+
+/*
+ * Based on information in the current compact_control, find blocks
+ * suitable for isolating free pages from
+ */
+static void isolate_freepages(struct zone *zone,
+				struct compact_control *cc)
+{
+	struct page *page;
+	unsigned long high_pfn, low_pfn, pfn;
+	unsigned long flags;
+	int nr_freepages = cc->nr_freepages;
+	struct list_head *freelist = &cc->freepages;
+
+	pfn = cc->free_pfn;
+	low_pfn = cc->migrate_pfn + pageblock_nr_pages;
+	high_pfn = low_pfn;
+
+	/*
+	 * Isolate free pages until enough are available to migrate the
+	 * pages on cc->migratepages. We stop searching if the migrate
+	 * and free page scanners meet or enough free pages are isolated.
+	 */
+	spin_lock_irqsave(&zone->lock, flags);
+	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
+					pfn -= pageblock_nr_pages) {
+		int isolated;
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		/* 
+		 * Check for overlapping nodes/zones. It's possible on some
+		 * configurations to have a setup like
+		 * node0 node1 node0
+		 * i.e. it's possible that all pages within a zones range of
+		 * pages do not belong to a single zone.
+		 */
+		page = pfn_to_page(pfn);
+		if (page_zone(page) != zone)
+			continue;
+
+		/* Check the block is suitable for migration */
+		if (!suitable_migration_target(page))
+			continue;
+
+		/* Found a block suitable for isolating free pages from */
+		isolated = isolate_freepages_block(zone, pfn, freelist);
+		nr_freepages += isolated;
+
+		/*
+		 * Record the highest PFN we isolated pages from. When next
+		 * looking for free pages, the search will restart here as
+		 * page migration may have returned some pages to the allocator
+		 */
+		if (isolated)
+			high_pfn = max(high_pfn, pfn);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	cc->free_pfn = high_pfn;
+	cc->nr_freepages = nr_freepages;
+}
+
+/* Update the number of anon and file isolated pages in the zone) */
+void update_zone_isolated(struct zone *zone, struct compact_control *cc)
+{
+	struct page *page;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+
+	list_for_each_entry(page, &cc->migratepages, lru) {
+		int lru = page_lru_base_type(page);
+		count[lru]++;
+	}
+
+	cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
+}
+
+/*
+ * Isolate all pages that can be migrated from the block pointed to by
+ * the migrate scanner within compact_control.
+ */
+static unsigned long isolate_migratepages(struct zone *zone,
+					struct compact_control *cc)
+{
+	unsigned long low_pfn, end_pfn;
+	struct list_head *migratelist;
+
+	low_pfn = cc->migrate_pfn;
+	migratelist = &cc->migratepages;
+
+	/* Do not scan outside zone boundaries */
+	if (low_pfn < zone->zone_start_pfn)
+		low_pfn = zone->zone_start_pfn;
+
+	/* Setup to scan one block but not past where we are migrating to */
+	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
+
+	/* Do not cross the free scanner or scan within a memory hole */
+	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
+		cc->migrate_pfn = end_pfn;
+		return 0;
+	}
+
+	migrate_prep();
+
+	/* Time to isolate some pages for migration */
+	spin_lock_irq(&zone->lru_lock);
+	for (; low_pfn < end_pfn; low_pfn++) {
+		struct page *page;
+		if (!pfn_valid_within(low_pfn))
+			continue;
+
+		/* Get the page and skip if free */
+		page = pfn_to_page(low_pfn);
+		if (PageBuddy(page)) {
+			low_pfn += (1 << page_order(page)) - 1;
+			continue;
+		}
+
+		if (!PageLRU(page) || PageUnevictable(page))
+			continue;
+
+		/* Try isolate the page */
+		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) == 0) {
+			del_page_from_lru_list(zone, page, page_lru(page));
+			list_add(&page->lru, migratelist);
+			mem_cgroup_del_lru(page);
+			cc->nr_migratepages++;
+		}
+
+		/* Avoid isolating too much */
+		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
+			break;
+	}
+
+	update_zone_isolated(zone, cc);
+
+	spin_unlock_irq(&zone->lru_lock);
+	cc->migrate_pfn = low_pfn;
+
+	return cc->nr_migratepages;
+}
+
+/*
+ * This is a migrate-callback that "allocates" freepages by taking pages
+ * from the isolated freelists in the block we are migrating to.
+ */
+static struct page *compaction_alloc(struct page *migratepage,
+					unsigned long data,
+					int **result)
+{
+	struct compact_control *cc = (struct compact_control *)data;
+	struct page *freepage;
+
+	VM_BUG_ON(cc == NULL);
+
+	/* Isolate free pages if necessary */
+	if (list_empty(&cc->freepages)) {
+		isolate_freepages(cc->zone, cc);
+
+		if (list_empty(&cc->freepages))
+			return NULL;
+	}
+
+	freepage = list_entry(cc->freepages.next, struct page, lru);
+	list_del(&freepage->lru);
+	cc->nr_freepages--;
+
+	return freepage;
+}
+
+/*
+ * We cannot control nr_migratepages and nr_freepages fully when migration is
+ * running as migrate_pages() has no knowledge of compact_control. When
+ * migration is complete, we count the number of pages on the lists by hand.
+ */
+static void update_nr_listpages(struct compact_control *cc)
+{
+	int nr_migratepages = 0;
+	int nr_freepages = 0;
+	struct page *page;
+	list_for_each_entry(page, &cc->migratepages, lru)
+		nr_migratepages++;
+	list_for_each_entry(page, &cc->freepages, lru)
+		nr_freepages++;
+
+	cc->nr_migratepages = nr_migratepages;
+	cc->nr_freepages = nr_freepages;
+}
+
+static inline int compact_finished(struct zone *zone,
+						struct compact_control *cc)
+{
+	/* Compaction run completes if the migrate and free scanner meet */
+	if (cc->free_pfn <= cc->migrate_pfn)
+		return COMPACT_COMPLETE;
+
+	return COMPACT_INCOMPLETE;
+}
+
+static int compact_zone(struct zone *zone, struct compact_control *cc)
+{
+	int ret = COMPACT_INCOMPLETE;
+
+	/* Setup to move all movable pages to the end of the zone */
+	cc->migrate_pfn = zone->zone_start_pfn;
+	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
+	cc->free_pfn &= ~(pageblock_nr_pages-1);
+
+	for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
+		unsigned long nr_migrate, nr_remaining;
+		if (!isolate_migratepages(zone, cc))
+			continue;
+
+		nr_migrate = cc->nr_migratepages;
+		migrate_pages(&cc->migratepages, compaction_alloc,
+						(unsigned long)cc, 0);
+		update_nr_listpages(cc);
+		nr_remaining = cc->nr_migratepages;
+
+		count_vm_event(COMPACTBLOCKS);
+		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
+		if (nr_remaining)
+			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
+
+		/* Release LRU pages not migrated */
+		if (!list_empty(&cc->migratepages)) {
+			putback_lru_pages(&cc->migratepages);
+			cc->nr_migratepages = 0;
+		}
+
+		mod_zone_page_state(zone, NR_ISOLATED_ANON, -cc->nr_anon);
+		mod_zone_page_state(zone, NR_ISOLATED_FILE, -cc->nr_file);
+	}
+
+	/* Release free pages and check accounting */
+	cc->nr_freepages -= release_freepages(&cc->freepages);
+	VM_BUG_ON(cc->nr_freepages != 0);
+
+	return ret;
+}
+
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 882aef0..9708143 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1208,6 +1208,45 @@ void split_page(struct page *page, unsigned int order)
 }
 
 /*
+ * Similar to split_page except the page is already free. As this is only
+ * being used for migration, the migratetype of the block also changes.
+ */
+int split_free_page(struct page *page)
+{
+	unsigned int order;
+	unsigned long watermark;
+	struct zone *zone;
+
+	BUG_ON(!PageBuddy(page));
+
+	zone = page_zone(page);
+	order = page_order(page);
+
+	/* Obey watermarks or the system could deadlock */
+	watermark = low_wmark_pages(zone) + (1 << order);
+	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+		return 0;
+
+	/* Remove page from free list */
+	list_del(&page->lru);
+	zone->free_area[order].nr_free--;
+	rmv_page_order(page);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+
+	/* Split into individual pages */
+	set_page_refcounted(page);
+	split_page(page, order);
+
+	if (order >= pageblock_order - 1) {
+		struct page *endpage = page + (1 << order) - 1;
+		for (; page < endpage; page += pageblock_nr_pages)
+			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+	}
+
+	return 1 << order;
+}
+
+/*
  * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..ef89600 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -839,11 +839,6 @@ keep:
 	return nr_reclaimed;
 }
 
-/* LRU Isolation modes. */
-#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
-#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
-#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
-
 /*
  * Attempt to remove the specified page from its LRU.  Only take this page
  * if it is of the appropriate PageActive status.  Pages which are being
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7377da6..af88647 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -891,6 +891,11 @@ static const char * const vmstat_text[] = {
 	"allocstall",
 
 	"pgrotated",
+
+	"compact_blocks_moved",
+	"compact_pages_moved",
+	"compact_pagemigrate_failed",
+
 #ifdef CONFIG_HUGETLB_PAGE
 	"htlb_buddy_alloc_success",
 	"htlb_buddy_alloc_fail",
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 08/11] Add /proc trigger for memory compaction
  2010-03-12 16:41 [PATCH 0/11] Memory Compaction v4 Mel Gorman
                   ` (6 preceding siblings ...)
  2010-03-12 16:41 ` [PATCH 07/11] Memory compaction core Mel Gorman
@ 2010-03-12 16:41 ` Mel Gorman
  2010-03-17  3:18   ` KOSAKI Motohiro
  2010-03-12 16:41 ` [PATCH 09/11] Add /sys trigger for per-node " Mel Gorman
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-12 16:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	linux-kernel, linux-mm

This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
value is written to the file, all zones are compacted. The expected user
of such a trigger is a job scheduler that prepares the system before the
target application runs.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 Documentation/sysctl/vm.txt |   11 ++++++++
 include/linux/compaction.h  |    6 ++++
 kernel/sysctl.c             |   10 +++++++
 mm/compaction.c             |   61 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 88 insertions(+), 0 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 6c7d18c..317d3f0 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -19,6 +19,7 @@ files can be found in mm/swap.c.
 Currently, these files are in /proc/sys/vm:
 
 - block_dump
+- compact_memory
 - dirty_background_bytes
 - dirty_background_ratio
 - dirty_bytes
@@ -64,6 +65,16 @@ information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.
 
 ==============================================================
 
+compact_memory
+
+Available only when CONFIG_COMPACTION is set. When an arbitrary value
+is written to the file, all zones are compacted such that free memory
+is available in contiguous blocks where possible. This can be important
+for example in the allocation of huge pages although processes will also
+directly compact memory as required.
+
+==============================================================
+
 dirty_background_bytes
 
 Contains the amount of dirty memory at which the pdflush background writeback
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 6201371..52762d2 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -5,4 +5,10 @@
 #define COMPACT_INCOMPLETE	0
 #define COMPACT_COMPLETE	1
 
+#ifdef CONFIG_COMPACTION
+extern int sysctl_compact_memory;
+extern int sysctl_compaction_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos);
+#endif /* CONFIG_COMPACTION */
+
 #endif /* _LINUX_COMPACTION_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7e12adc..df3b018 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -64,6 +64,7 @@
 #include <linux/slow-work.h>
 #include <linux/perf_event.h>
 #include <linux/kprobes.h>
+#include <linux/compaction.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -1090,6 +1091,15 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= drop_caches_sysctl_handler,
 	},
+#ifdef CONFIG_COMPACTION
+	{
+		.procname	= "compact_memory",
+		.data		= &sysctl_compact_memory,
+		.maxlen		= sizeof(int),
+		.mode		= 0200,
+		.proc_handler	= sysctl_compaction_handler,
+	},
+#endif /* CONFIG_COMPACTION */
 	{
 		.procname	= "min_free_kbytes",
 		.data		= &min_free_kbytes,
diff --git a/mm/compaction.c b/mm/compaction.c
index 3cc4db5..817aa5b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -11,6 +11,7 @@
 #include <linux/migrate.h>
 #include <linux/compaction.h>
 #include <linux/mm_inline.h>
+#include <linux/sysctl.h>
 #include "internal.h"
 
 /*
@@ -351,3 +352,63 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	return ret;
 }
 
+/* Compact all zones within a node */
+static int compact_node(int nid)
+{
+	int zoneid;
+	pg_data_t *pgdat;
+	struct zone *zone;
+
+	if (nid < 0 || nid > nr_node_ids || !node_online(nid))
+		return -EINVAL;
+	pgdat = NODE_DATA(nid);
+
+	/* Flush pending updates to the LRU lists */
+	lru_add_drain_all();
+
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+		struct compact_control cc;
+
+		zone = &pgdat->node_zones[zoneid];
+		if (!populated_zone(zone))
+			continue;
+
+		cc.nr_freepages = 0;
+		cc.nr_migratepages = 0;
+		cc.zone = zone;
+		cc.order = -1;
+		INIT_LIST_HEAD(&cc.freepages);
+		INIT_LIST_HEAD(&cc.migratepages);
+
+		compact_zone(zone, &cc);
+
+		VM_BUG_ON(!list_empty(&cc.freepages));
+		VM_BUG_ON(!list_empty(&cc.migratepages));
+	}
+
+	return 0;
+}
+
+/* Compact all nodes in the system */
+static int compact_nodes(void)
+{
+	int nid;
+
+	for_each_online_node(nid)
+		compact_node(nid);
+
+	return COMPACT_COMPLETE;
+}
+
+/* The written value is actually unused, all memory is compacted */
+int sysctl_compact_memory;
+
+/* This is the entry point for compacting all nodes via /proc/sys/vm */
+int sysctl_compaction_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos)
+{
+	if (write)
+		return compact_nodes();
+
+	return 0;
+}
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 09/11] Add /sys trigger for per-node memory compaction
  2010-03-12 16:41 [PATCH 0/11] Memory Compaction v4 Mel Gorman
                   ` (7 preceding siblings ...)
  2010-03-12 16:41 ` [PATCH 08/11] Add /proc trigger for memory compaction Mel Gorman
@ 2010-03-12 16:41 ` Mel Gorman
  2010-03-17  3:18   ` KOSAKI Motohiro
  2010-03-12 16:41 ` [PATCH 10/11] Direct compact when a high-order allocation fails Mel Gorman
  2010-03-12 16:41 ` [PATCH 11/11] Do not compact within a preferred zone after a compaction failure Mel Gorman
  10 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-12 16:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	linux-kernel, linux-mm

This patch adds a per-node sysfs file called compact. When the file is
written to, each zone in that node is compacted. The intention that this
would be used by something like a job scheduler in a batch system before
a job starts so that the job can allocate the maximum number of
hugepages without significant start-up cost.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 Documentation/ABI/testing/sysfs-devices-node |    7 +++++++
 drivers/base/node.c                          |    3 +++
 include/linux/compaction.h                   |   16 ++++++++++++++++
 mm/compaction.c                              |   23 +++++++++++++++++++++++
 4 files changed, 49 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-devices-node

diff --git a/Documentation/ABI/testing/sysfs-devices-node b/Documentation/ABI/testing/sysfs-devices-node
new file mode 100644
index 0000000..0cb286a
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-devices-node
@@ -0,0 +1,7 @@
+What:		/sys/devices/system/node/nodeX/compact
+Date:		February 2010
+Contact:	Mel Gorman <mel@csn.ul.ie>
+Description:
+		When this file is written to, all memory within that node
+		will be compacted. When it completes, memory will be free
+		in as contiguous blocks as possible.
diff --git a/drivers/base/node.c b/drivers/base/node.c
index ad43185..15fb30d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -15,6 +15,7 @@
 #include <linux/cpu.h>
 #include <linux/device.h>
 #include <linux/swap.h>
+#include <linux/compaction.h>
 
 static struct sysdev_class_attribute *node_state_attrs[];
 
@@ -242,6 +243,8 @@ int register_node(struct node *node, int num, struct node *parent)
 		scan_unevictable_register_node(node);
 
 		hugetlb_register_node(node);
+
+		compaction_register_node(node);
 	}
 	return error;
 }
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 52762d2..c94890b 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -11,4 +11,20 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
 #endif /* CONFIG_COMPACTION */
 
+#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+extern int compaction_register_node(struct node *node);
+extern void compaction_unregister_node(struct node *node);
+
+#else
+
+static inline int compaction_register_node(struct node *node)
+{
+	return 0;
+}
+
+static inline void compaction_unregister_node(struct node *node)
+{
+}
+#endif /* CONFIG_COMPACTION && CONFIG_SYSFS && CONFIG_NUMA */
+
 #endif /* _LINUX_COMPACTION_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index 817aa5b..b85343c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -12,6 +12,7 @@
 #include <linux/compaction.h>
 #include <linux/mm_inline.h>
 #include <linux/sysctl.h>
+#include <linux/sysfs.h>
 #include "internal.h"
 
 /*
@@ -412,3 +413,25 @@ int sysctl_compaction_handler(struct ctl_table *table, int write,
 
 	return 0;
 }
+
+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+ssize_t sysfs_compact_node(struct sys_device *dev,
+			struct sysdev_attribute *attr,
+			const char *buf, size_t count)
+{
+	compact_node(dev->id);
+
+	return count;
+}
+static SYSDEV_ATTR(compact, S_IWUSR, NULL, sysfs_compact_node);
+
+int compaction_register_node(struct node *node)
+{
+	return sysdev_create_file(&node->sysdev, &attr_compact);
+}
+
+void compaction_unregister_node(struct node *node)
+{
+	return sysdev_remove_file(&node->sysdev, &attr_compact);
+}
+#endif /* CONFIG_SYSFS && CONFIG_NUMA */
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-12 16:41 [PATCH 0/11] Memory Compaction v4 Mel Gorman
                   ` (8 preceding siblings ...)
  2010-03-12 16:41 ` [PATCH 09/11] Add /sys trigger for per-node " Mel Gorman
@ 2010-03-12 16:41 ` Mel Gorman
  2010-03-16  2:47   ` Minchan Kim
  2010-03-19  6:21   ` KOSAKI Motohiro
  2010-03-12 16:41 ` [PATCH 11/11] Do not compact within a preferred zone after a compaction failure Mel Gorman
  10 siblings, 2 replies; 109+ messages in thread
From: Mel Gorman @ 2010-03-12 16:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	linux-kernel, linux-mm

Ordinarily when a high-order allocation fails, direct reclaim is entered to
free pages to satisfy the allocation.  With this patch, it is determined if
an allocation failed due to external fragmentation instead of low memory
and if so, the calling process will compact until a suitable page is
freed. Compaction by moving pages in memory is considerably cheaper than
paging out to disk and works where there are locked pages or no swap. If
compaction fails to free a page of a suitable size, then reclaim will
still occur.

Direct compaction returns as soon as possible. As each block is compacted,
it is checked if a suitable page has been freed and if so, it returns.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |   16 +++++-
 include/linux/vmstat.h     |    1 +
 mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   26 ++++++++++
 mm/vmstat.c                |   15 +++++-
 5 files changed, 172 insertions(+), 4 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index c94890b..b851428 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,14 +1,26 @@
 #ifndef _LINUX_COMPACTION_H
 #define _LINUX_COMPACTION_H
 
-/* Return values for compact_zone() */
+/* Return values for compact_zone() and try_to_compact_pages() */
 #define COMPACT_INCOMPLETE	0
-#define COMPACT_COMPLETE	1
+#define COMPACT_PARTIAL		1
+#define COMPACT_COMPLETE	2
 
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
+
+extern int fragmentation_index(struct zone *zone, unsigned int order);
+extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *mask);
+#else
+static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+	return COMPACT_INCOMPLETE;
+}
+
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 56e4b44..b4b4d34 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
 		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
+		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
diff --git a/mm/compaction.c b/mm/compaction.c
index b85343c..15589c6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -34,6 +34,8 @@ struct compact_control {
 	unsigned long nr_anon;
 	unsigned long nr_file;
 
+	unsigned int order;		/* order a direct compactor needs */
+	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
 };
 
@@ -304,10 +306,31 @@ static void update_nr_listpages(struct compact_control *cc)
 static inline int compact_finished(struct zone *zone,
 						struct compact_control *cc)
 {
+	unsigned int order;
+	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
+
 	/* Compaction run completes if the migrate and free scanner meet */
 	if (cc->free_pfn <= cc->migrate_pfn)
 		return COMPACT_COMPLETE;
 
+	/* Compaction run is not finished if the watermark is not met */
+	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
+		return COMPACT_INCOMPLETE;
+
+	if (cc->order == -1)
+		return COMPACT_INCOMPLETE;
+
+	/* Direct compactor: Is a suitable page free? */
+	for (order = cc->order; order < MAX_ORDER; order++) {
+		/* Job done if page is free of the right migratetype */
+		if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
+			return COMPACT_PARTIAL;
+
+		/* Job done if allocation would set block type */
+		if (order >= pageblock_order && zone->free_area[order].nr_free)
+			return COMPACT_PARTIAL;
+	}
+
 	return COMPACT_INCOMPLETE;
 }
 
@@ -353,6 +376,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	return ret;
 }
 
+static inline unsigned long compact_zone_order(struct zone *zone,
+						int order, gfp_t gfp_mask)
+{
+	struct compact_control cc = {
+		.nr_freepages = 0,
+		.nr_migratepages = 0,
+		.order = order,
+		.migratetype = allocflags_to_migratetype(gfp_mask),
+		.zone = zone,
+	};
+	INIT_LIST_HEAD(&cc.freepages);
+	INIT_LIST_HEAD(&cc.migratepages);
+
+	return compact_zone(zone, &cc);
+}
+
+/**
+ * try_to_compact_pages - Direct compact to satisfy a high-order allocation
+ * @zonelist: The zonelist used for the current allocation
+ * @order: The order of the current allocation
+ * @gfp_mask: The GFP mask of the current allocation
+ * @nodemask: The allowed nodes to allocate from
+ *
+ * This is the main entry point for direct page compaction.
+ */
+unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	int may_enter_fs = gfp_mask & __GFP_FS;
+	int may_perform_io = gfp_mask & __GFP_IO;
+	unsigned long watermark;
+	struct zoneref *z;
+	struct zone *zone;
+	int rc = COMPACT_INCOMPLETE;
+
+	/* Check whether it is worth even starting compaction */
+	if (order == 0 || !may_enter_fs || !may_perform_io)
+		return rc;
+
+	/*
+	 * We will not stall if the necessary conditions are not met for
+	 * migration but direct reclaim seems to account stalls similarly
+	 */
+	count_vm_event(COMPACTSTALL);
+
+	/* Compact each zone in the list */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
+								nodemask) {
+		int fragindex;
+		int status;
+
+		/*
+		 * Watermarks for order-0 must be met for compaction. Note
+		 * the 2UL. This is because during migration, copies of
+		 * pages need to be allocated and for a short time, the
+		 * footprint is higher
+		 */
+		watermark = low_wmark_pages(zone) + (2UL << order);
+		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+			continue;
+
+		/*
+		 * fragmentation index determines if allocation failures are
+		 * due to low memory or external fragmentation
+		 *
+		 * index of -1 implies allocations might succeed depending
+		 * 	on watermarks
+		 * index < 500 implies alloc failure is due to lack of memory
+		 *
+		 * XXX: The choice of 500 is arbitrary. Reinvestigate
+		 *      appropriately to determine a sensible default.
+		 *      and what it means when watermarks are also taken
+		 *      into account. Consider making it a sysctl
+		 */
+		fragindex = fragmentation_index(zone, order);
+		if (fragindex >= 0 && fragindex <= 500)
+			continue;
+
+		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
+			rc = COMPACT_PARTIAL;
+			break;
+		}
+
+		status = compact_zone_order(zone, order, gfp_mask);
+		rc = max(status, rc);
+
+		if (zone_watermark_ok(zone, order, watermark, 0, 0))
+			break;
+	}
+
+	return rc;
+}
+
+
 /* Compact all zones within a node */
 static int compact_node(int nid)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9708143..e301108 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -49,6 +49,7 @@
 #include <linux/debugobjects.h>
 #include <linux/kmemleak.h>
 #include <linux/memory.h>
+#include <linux/compaction.h>
 #include <trace/events/kmem.h>
 #include <linux/ftrace_event.h>
 
@@ -1765,6 +1766,31 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
+	/* Try memory compaction for high-order allocations before reclaim */
+	if (order) {
+		*did_some_progress = try_to_compact_pages(zonelist,
+						order, gfp_mask, nodemask);
+		if (*did_some_progress != COMPACT_INCOMPLETE) {
+			page = get_page_from_freelist(gfp_mask, nodemask,
+					order, zonelist, high_zoneidx,
+					alloc_flags, preferred_zone,
+					migratetype);
+			if (page) {
+				__count_vm_event(COMPACTSUCCESS);
+				return page;
+			}
+
+			/*
+			 * It's bad if compaction run occurs and fails.
+			 * The most likely reason is that pages exist,
+			 * but not enough to satisfy watermarks.
+			 */
+			count_vm_event(COMPACTFAIL);
+
+			cond_resched();
+		}
+	}
+
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
 	p->flags |= PF_MEMALLOC;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index af88647..c88f285 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -560,7 +560,7 @@ static int unusable_show(struct seq_file *m, void *arg)
  * The value can be used to determine if page reclaim or compaction
  * should be used
  */
-int fragmentation_index(unsigned int order, struct contig_page_info *info)
+int __fragmentation_index(unsigned int order, struct contig_page_info *info)
 {
 	unsigned long requested = 1UL << order;
 
@@ -580,6 +580,14 @@ int fragmentation_index(unsigned int order, struct contig_page_info *info)
 	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
 }
 
+/* Same as __fragmentation index but allocs contig_page_info on stack */
+int fragmentation_index(struct zone *zone, unsigned int order)
+{
+	struct contig_page_info info;
+
+	fill_contig_page_info(zone, order, &info);
+	return __fragmentation_index(order, &info);
+}
 
 static void extfrag_show_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
@@ -595,7 +603,7 @@ static void extfrag_show_print(struct seq_file *m,
 				zone->name);
 	for (order = 0; order < MAX_ORDER; ++order) {
 		fill_contig_page_info(zone, order, &info);
-		index = fragmentation_index(order, &info);
+		index = __fragmentation_index(order, &info);
 		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
 	}
 
@@ -895,6 +903,9 @@ static const char * const vmstat_text[] = {
 	"compact_blocks_moved",
 	"compact_pages_moved",
 	"compact_pagemigrate_failed",
+	"compact_stall",
+	"compact_fail",
+	"compact_success",
 
 #ifdef CONFIG_HUGETLB_PAGE
 	"htlb_buddy_alloc_success",
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 11/11] Do not compact within a preferred zone after a compaction failure
  2010-03-12 16:41 [PATCH 0/11] Memory Compaction v4 Mel Gorman
                   ` (9 preceding siblings ...)
  2010-03-12 16:41 ` [PATCH 10/11] Direct compact when a high-order allocation fails Mel Gorman
@ 2010-03-12 16:41 ` Mel Gorman
  10 siblings, 0 replies; 109+ messages in thread
From: Mel Gorman @ 2010-03-12 16:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	linux-kernel, linux-mm

The fragmentation index may indicate that a failure it due to external
fragmentation, a compaction run complete and an allocation failure still
fail. There are two obvious reasons as to why

  o Page migration cannot move all pages so fragmentation remains
  o A suitable page may exist but watermarks are not met

In the event of compaction and allocation failure, this patch prevents
compaction happening for a short interval. It's only recorded on the
preferred zone but that should be enough coverage. This could have been
implemented similar to the zonelist_cache but the increased size of the
zonelist did not appear to be justified.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |   35 +++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h     |    7 +++++++
 mm/page_alloc.c            |    5 ++++-
 3 files changed, 46 insertions(+), 1 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index b851428..bc7059d 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -14,6 +14,32 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask);
+
+/* defer_compaction - Do not compact within a zone until a given time */
+static inline void defer_compaction(struct zone *zone, unsigned long resume)
+{
+	/*
+	 * This function is called when compaction fails to result in a page
+	 * allocation success. This is somewhat unsatisfactory as the failure
+	 * to compact has nothing to do with time and everything to do with
+	 * the requested order, the number of free pages and watermarks. How
+	 * to wait on that is more unclear, but the answer would apply to
+	 * other areas where the VM waits based on time.
+	 */
+	zone->compact_resume = resume;
+}
+
+static inline int compaction_deferred(struct zone *zone)
+{
+	/* init once if necessary */
+	if (unlikely(!zone->compact_resume)) {
+		zone->compact_resume = jiffies;
+		return 0;
+	}
+
+	return time_before(jiffies, zone->compact_resume);
+}
+
 #else
 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask)
@@ -21,6 +47,15 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	return COMPACT_INCOMPLETE;
 }
 
+static inline void defer_compaction(struct zone *zone, unsigned long resume)
+{
+}
+
+static inline int compaction_deferred(struct zone *zone)
+{
+	return 1;
+}
+
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 37df0b3..99b7ecc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -321,6 +321,13 @@ struct zone {
 	unsigned long		*pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 
+#ifdef CONFIG_COMPACTION
+	/*
+	 * If a compaction fails, do not try compaction again until
+	 * jiffies is after the value of compact_resume
+	 */
+	unsigned long		compact_resume;
+#endif
 
 	ZONE_PADDING(_pad1_)
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e301108..f481df2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1767,7 +1767,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	cond_resched();
 
 	/* Try memory compaction for high-order allocations before reclaim */
-	if (order) {
+	if (order && !compaction_deferred(preferred_zone)) {
 		*did_some_progress = try_to_compact_pages(zonelist,
 						order, gfp_mask, nodemask);
 		if (*did_some_progress != COMPACT_INCOMPLETE) {
@@ -1787,6 +1787,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 			 */
 			count_vm_event(COMPACTFAIL);
 
+			/* On failure, avoid compaction for a short time. */
+			defer_compaction(preferred_zone, jiffies + HZ/50);
+
 			cond_resched();
 		}
 	}
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration
  2010-03-12 16:41 ` [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
@ 2010-03-12 17:14   ` Rik van Riel
  2010-03-15  5:35   ` KAMEZAWA Hiroyuki
  2010-03-17  2:06   ` KOSAKI Motohiro
  2 siblings, 0 replies; 109+ messages in thread
From: Rik van Riel @ 2010-03-12 17:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KOSAKI Motohiro, linux-kernel,
	linux-mm

On 03/12/2010 11:41 AM, Mel Gorman wrote:
> For clarity of review, KSM and page migration have separate refcounts on
> the anon_vma. While clear, this is a waste of memory. This patch gets
> KSM and page migration to share their toys in a spirit of harmony.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
> Reviewed-by: Minchan Kim<minchan.kim@gmail.com>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 01/11] mm,migration: Take a reference to the anon_vma  before migrating
  2010-03-12 16:41 ` [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
@ 2010-03-14 15:01   ` Minchan Kim
  2010-03-15  5:06   ` KAMEZAWA Hiroyuki
  2010-03-17  1:44   ` KOSAKI Motohiro
  2 siblings, 0 replies; 109+ messages in thread
From: Minchan Kim @ 2010-03-14 15:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KOSAKI Motohiro, Rik van Riel,
	linux-kernel, linux-mm

On Sat, Mar 13, 2010 at 1:41 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
> locking an anon_vma and it does not appear to have sufficient locking to
> ensure the anon_vma does not disappear from under it.
>
> This patch copies an approach used by KSM to take a reference on the
> anon_vma while pages are being migrated. This should prevent rmap_walk()
> running into nasty surprises later because anon_vma has been freed.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

BTW, This another refcount of anon_vma is merged  with KSM by [3/11].
Looks good to me.


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped  anonymous pages
  2010-03-12 16:41 ` [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
@ 2010-03-15  0:28   ` Minchan Kim
  2010-03-15  5:34     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 109+ messages in thread
From: Minchan Kim @ 2010-03-15  0:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KOSAKI Motohiro, Rik van Riel,
	linux-kernel, linux-mm

Hi, Mel.
On Sat, Mar 13, 2010 at 1:41 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> rmap_walk_anon() was triggering errors in memory compaction that looks like
> use-after-free errors in anon_vma. The problem appears to be that between
> the page being isolated from the LRU and rcu_read_lock() being taken, the
> mapcount of the page dropped to 0 and the anon_vma was freed. This patch
> skips the migration of anon pages that are not mapped by anyone.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  mm/migrate.c |   10 ++++++++++
>  1 files changed, 10 insertions(+), 0 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 98eaaf2..3c491e3 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -602,6 +602,16 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>         * just care Anon page here.
>         */
>        if (PageAnon(page)) {
> +               /*
> +                * If the page has no mappings any more, just bail. An
> +                * unmapped anon page is likely to be freed soon but worse,
> +                * it's possible its anon_vma disappeared between when
> +                * the page was isolated and when we reached here while
> +                * the RCU lock was not held
> +                */
> +               if (!page_mapcount(page))

As looking code about mapcount of page, I got confused.
I think mapcount of page is protected by pte lock.
But I can't find pte lock in unmap_and_move.
If I am right, what protects race between this condition check and
rcu_read_lock?
This patch makes race window very small but It can't remove race totally.

I think I am missing something.
Pz, point me out. :)


> +                       goto uncharge;
> +
>                rcu_read_lock();
>                rcu_locked = 1;
>                anon_vma = page_anon_vma(page);
> --
> 1.6.5
>




-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating
  2010-03-12 16:41 ` [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
  2010-03-14 15:01   ` Minchan Kim
@ 2010-03-15  5:06   ` KAMEZAWA Hiroyuki
  2010-03-17  1:44   ` KOSAKI Motohiro
  2 siblings, 0 replies; 109+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-15  5:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KOSAKI Motohiro, Rik van Riel,
	linux-kernel, linux-mm

On Fri, 12 Mar 2010 16:41:17 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
> locking an anon_vma and it does not appear to have sufficient locking to
> ensure the anon_vma does not disappear from under it.
> 
> This patch copies an approach used by KSM to take a reference on the
> anon_vma while pages are being migrated. This should prevent rmap_walk()
> running into nasty surprises later because anon_vma has been freed.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  include/linux/rmap.h |   23 +++++++++++++++++++++++
>  mm/migrate.c         |   12 ++++++++++++
>  mm/rmap.c            |   10 +++++-----
>  3 files changed, 40 insertions(+), 5 deletions(-)
> 
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped  anonymous pages
  2010-03-15  0:28   ` Minchan Kim
@ 2010-03-15  5:34     ` KAMEZAWA Hiroyuki
  2010-03-15  6:28       ` Minchan Kim
  2010-03-15 11:28       ` Mel Gorman
  0 siblings, 2 replies; 109+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-15  5:34 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Mon, 15 Mar 2010 09:28:08 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> Hi, Mel.
> On Sat, Mar 13, 2010 at 1:41 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > rmap_walk_anon() was triggering errors in memory compaction that looks like
> > use-after-free errors in anon_vma. The problem appears to be that between
> > the page being isolated from the LRU and rcu_read_lock() being taken, the
> > mapcount of the page dropped to 0 and the anon_vma was freed. This patch
> > skips the migration of anon pages that are not mapped by anyone.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >  mm/migrate.c |   10 ++++++++++
> >  1 files changed, 10 insertions(+), 0 deletions(-)
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 98eaaf2..3c491e3 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -602,6 +602,16 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >         * just care Anon page here.
> >         */
> >        if (PageAnon(page)) {
> > +               /*
> > +                * If the page has no mappings any more, just bail. An
> > +                * unmapped anon page is likely to be freed soon but worse,
> > +                * it's possible its anon_vma disappeared between when
> > +                * the page was isolated and when we reached here while
> > +                * the RCU lock was not held
> > +                */
> > +               if (!page_mapcount(page))
> 
> As looking code about mapcount of page, I got confused.
> I think mapcount of page is protected by pte lock.
> But I can't find pte lock in unmap_and_move.
There is no pte_lock.

> If I am right, what protects race between this condition check and
> rcu_read_lock?
> This patch makes race window very small but It can't remove race totally.
> 
> I think I am missing something.
> Pz, point me out. :)
> 

Hmm. This is my understanding of old story.

At migration.
  1. we increase page_count().
  2. isolate it from LRU.
  3. call try_to_unmap() under rcu_read_lock(). Then, 
  4. replace pte with swp_entry_t made by PFN. under pte_lock.
  5. do migarate 
  6. remap new pages. under pte_lock()>
  7. release rcu_read_lock().

Here, we don't care whether page->mapping holds valid anon_vma or not.

Assume a racy threads which calls zap_pte_range() (or some other)

a) When the thread finds valid pte under pte_lock and successfully call
   page_remove_rmap().
   In this case, migration thread finds try_to_unmap doesn't unmap any pte.
   Then, at 6, remap pte will not work.
b) When the thread finds migrateion PTE(as swap entry) in zap_page_range().
   In this case, migration doesn't find migrateion PTE and remap fails.

Why rcu_read_lock() is necessary..
 - When page_mapcount() goes to 0, we shouldn't trust page->mapping is valid.
 - Possible cases are
	i) anon_vma (= page->mapping) is freed and used for other object.
 	ii) anon_vma (= page->mapping) is freed
	iii) anon_vma (= page->mapping) is freed and used as anon_vma again.

Here, anon_vma_cachep is created  by SLAB_DESTROY_BY_RCU. Then, possible cases
are only ii) and iii). While anon_vma is anon_vma, try_to_unmap and remap_page
can work well because of the list of vmas and address check. IOW, remap routine
just do nothing if anon_vma is freed.

I'm not sure by what logic "use-after-free anon_vma" is caught. But yes,
there will be case, "anon_vma is touched after freed.", I think.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration
  2010-03-12 16:41 ` [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
  2010-03-12 17:14   ` Rik van Riel
@ 2010-03-15  5:35   ` KAMEZAWA Hiroyuki
  2010-03-17  2:06   ` KOSAKI Motohiro
  2 siblings, 0 replies; 109+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-15  5:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KOSAKI Motohiro, Rik van Riel,
	linux-kernel, linux-mm

On Fri, 12 Mar 2010 16:41:19 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> For clarity of review, KSM and page migration have separate refcounts on
> the anon_vma. While clear, this is a waste of memory. This patch gets
> KSM and page migration to share their toys in a spirit of harmony.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewd-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 05/11] Export unusable free space index via /proc/unusable_index
  2010-03-12 16:41 ` [PATCH 05/11] Export unusable free space index via /proc/unusable_index Mel Gorman
@ 2010-03-15  5:41   ` KAMEZAWA Hiroyuki
  2010-03-15  9:48     ` Mel Gorman
  2010-03-17  2:42   ` KOSAKI Motohiro
  1 sibling, 1 reply; 109+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-15  5:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KOSAKI Motohiro, Rik van Riel,
	linux-kernel, linux-mm

On Fri, 12 Mar 2010 16:41:21 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> Unusable free space index is a measure of external fragmentation that
> takes the allocation size into account. For the most part, the huge page
> size will be the size of interest but not necessarily so it is exported
> on a per-order and per-zone basis via /proc/unusable_index.
> 
> The index is a value between 0 and 1. It can be expressed as a
> percentage by multiplying by 100 as documented in
> Documentation/filesystems/proc.txt.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  Documentation/filesystems/proc.txt |   13 ++++-
>  mm/vmstat.c                        |  120 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 132 insertions(+), 1 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 5e132b5..5c4b0fb 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -452,6 +452,7 @@ Table 1-5: Kernel info in /proc
>   sys         See chapter 2                                     
>   sysvipc     Info of SysVIPC Resources (msg, sem, shm)		(2.4)
>   tty	     Info of tty drivers
> + unusable_index Additional page allocator information (see text)(2.5)
>   uptime      System uptime                                     
>   version     Kernel version                                    
>   video	     bttv info of video resources			(2.4)
> @@ -609,7 +610,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
>  available in ZONE_NORMAL, etc... 
>  
>  More information relevant to external fragmentation can be found in
> -pagetypeinfo.
> +pagetypeinfo and unusable_index
>  
>  > cat /proc/pagetypeinfo
>  Page block order: 9
> @@ -650,6 +651,16 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
>  also be allocatable although a lot of filesystem metadata may have to be
>  reclaimed to achieve this.
>  
> +> cat /proc/unusable_index
> +Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
> +Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
> +
> +The unusable free space index measures how much of the available free
> +memory cannot be used to satisfy an allocation of a given size and is a
> +value between 0 and 1. The higher the value, the more of free memory is
> +unusable and by implication, the worse the external fragmentation is. This
> +can be expressed as a percentage by multiplying by 100.
> +

I'm sorry but how this information is different from buddyinfo ?

Thanks,
-Kame



>  ..............................................................................
>  
>  meminfo:
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 7f760cb..ca42e10 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -453,6 +453,106 @@ static int frag_show(struct seq_file *m, void *arg)
>  	return 0;
>  }
>  
> +
> +struct contig_page_info {
> +	unsigned long free_pages;
> +	unsigned long free_blocks_total;
> +	unsigned long free_blocks_suitable;
> +};
> +
> +/*
> + * Calculate the number of free pages in a zone, how many contiguous
> + * pages are free and how many are large enough to satisfy an allocation of
> + * the target size. Note that this function makes to attempt to estimate
> + * how many suitable free blocks there *might* be if MOVABLE pages were
> + * migrated. Calculating that is possible, but expensive and can be
> + * figured out from userspace
> + */
> +static void fill_contig_page_info(struct zone *zone,
> +				unsigned int suitable_order,
> +				struct contig_page_info *info)
> +{
> +	unsigned int order;
> +
> +	info->free_pages = 0;
> +	info->free_blocks_total = 0;
> +	info->free_blocks_suitable = 0;
> +
> +	for (order = 0; order < MAX_ORDER; order++) {
> +		unsigned long blocks;
> +
> +		/* Count number of free blocks */
> +		blocks = zone->free_area[order].nr_free;
> +		info->free_blocks_total += blocks;
> +
> +		/* Count free base pages */
> +		info->free_pages += blocks << order;
> +
> +		/* Count the suitable free blocks */
> +		if (order >= suitable_order)
> +			info->free_blocks_suitable += blocks <<
> +						(order - suitable_order);
> +	}
> +}
> +
> +/*
> + * Return an index indicating how much of the available free memory is
> + * unusable for an allocation of the requested size.
> + */
> +static int unusable_free_index(unsigned int order,
> +				struct contig_page_info *info)
> +{
> +	/* No free memory is interpreted as all free memory is unusable */
> +	if (info->free_pages == 0)
> +		return 1000;
> +
> +	/*
> +	 * Index should be a value between 0 and 1. Return a value to 3
> +	 * decimal places.
> +	 *
> +	 * 0 => no fragmentation
> +	 * 1 => high fragmentation
> +	 */
> +	return ((info->free_pages - (info->free_blocks_suitable << order)) * 1000) / info->free_pages;
> +
> +}
> +
> +static void unusable_show_print(struct seq_file *m,
> +					pg_data_t *pgdat, struct zone *zone)
> +{
> +	unsigned int order;
> +	int index;
> +	struct contig_page_info info;
> +
> +	seq_printf(m, "Node %d, zone %8s ",
> +				pgdat->node_id,
> +				zone->name);
> +	for (order = 0; order < MAX_ORDER; ++order) {
> +		fill_contig_page_info(zone, order, &info);
> +		index = unusable_free_index(order, &info);
> +		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
> +	}
> +
> +	seq_putc(m, '\n');
> +}
> +
> +/*
> + * Display unusable free space index
> + * XXX: Could be a lot more efficient, but it's not a critical path
> + */
> +static int unusable_show(struct seq_file *m, void *arg)
> +{
> +	pg_data_t *pgdat = (pg_data_t *)arg;
> +
> +	/* check memoryless node */
> +	if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
> +		return 0;
> +
> +	walk_zones_in_node(m, pgdat, unusable_show_print);
> +
> +	return 0;
> +}
> +
>  static void pagetypeinfo_showfree_print(struct seq_file *m,
>  					pg_data_t *pgdat, struct zone *zone)
>  {
> @@ -603,6 +703,25 @@ static const struct file_operations pagetypeinfo_file_ops = {
>  	.release	= seq_release,
>  };
>  
> +static const struct seq_operations unusable_op = {
> +	.start	= frag_start,
> +	.next	= frag_next,
> +	.stop	= frag_stop,
> +	.show	= unusable_show,
> +};
> +
> +static int unusable_open(struct inode *inode, struct file *file)
> +{
> +	return seq_open(file, &unusable_op);
> +}
> +
> +static const struct file_operations unusable_file_ops = {
> +	.open		= unusable_open,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= seq_release,
> +};
> +
>  #ifdef CONFIG_ZONE_DMA
>  #define TEXT_FOR_DMA(xx) xx "_dma",
>  #else
> @@ -947,6 +1066,7 @@ static int __init setup_vmstat(void)
>  #ifdef CONFIG_PROC_FS
>  	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
>  	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
> +	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
>  	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
>  	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
>  #endif
> -- 
> 1.6.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped  anonymous pages
  2010-03-15  5:34     ` KAMEZAWA Hiroyuki
@ 2010-03-15  6:28       ` Minchan Kim
  2010-03-15  6:44         ` KAMEZAWA Hiroyuki
  2010-03-15 11:28       ` Mel Gorman
  1 sibling, 1 reply; 109+ messages in thread
From: Minchan Kim @ 2010-03-15  6:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Mon, Mar 15, 2010 at 2:34 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 15 Mar 2010 09:28:08 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> Hi, Mel.
>> On Sat, Mar 13, 2010 at 1:41 AM, Mel Gorman <mel@csn.ul.ie> wrote:
>> > rmap_walk_anon() was triggering errors in memory compaction that looks like
>> > use-after-free errors in anon_vma. The problem appears to be that between
>> > the page being isolated from the LRU and rcu_read_lock() being taken, the
>> > mapcount of the page dropped to 0 and the anon_vma was freed. This patch
>> > skips the migration of anon pages that are not mapped by anyone.
>> >
>> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> > Acked-by: Rik van Riel <riel@redhat.com>
>> > ---
>> >  mm/migrate.c |   10 ++++++++++
>> >  1 files changed, 10 insertions(+), 0 deletions(-)
>> >
>> > diff --git a/mm/migrate.c b/mm/migrate.c
>> > index 98eaaf2..3c491e3 100644
>> > --- a/mm/migrate.c
>> > +++ b/mm/migrate.c
>> > @@ -602,6 +602,16 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>> >         * just care Anon page here.
>> >         */
>> >        if (PageAnon(page)) {
>> > +               /*
>> > +                * If the page has no mappings any more, just bail. An
>> > +                * unmapped anon page is likely to be freed soon but worse,
>> > +                * it's possible its anon_vma disappeared between when
>> > +                * the page was isolated and when we reached here while
>> > +                * the RCU lock was not held
>> > +                */
>> > +               if (!page_mapcount(page))
>>
>> As looking code about mapcount of page, I got confused.
>> I think mapcount of page is protected by pte lock.
>> But I can't find pte lock in unmap_and_move.
> There is no pte_lock.
>
>> If I am right, what protects race between this condition check and
>> rcu_read_lock?
>> This patch makes race window very small but It can't remove race totally.
>>
>> I think I am missing something.
>> Pz, point me out. :)
>>
>
> Hmm. This is my understanding of old story.
>
> At migration.
>  1. we increase page_count().
>  2. isolate it from LRU.
>  3. call try_to_unmap() under rcu_read_lock(). Then,
>  4. replace pte with swp_entry_t made by PFN. under pte_lock.
>  5. do migarate
>  6. remap new pages. under pte_lock()>
>  7. release rcu_read_lock().
>
> Here, we don't care whether page->mapping holds valid anon_vma or not.
>
> Assume a racy threads which calls zap_pte_range() (or some other)
>
> a) When the thread finds valid pte under pte_lock and successfully call
>   page_remove_rmap().
>   In this case, migration thread finds try_to_unmap doesn't unmap any pte.
>   Then, at 6, remap pte will not work.
> b) When the thread finds migrateion PTE(as swap entry) in zap_page_range().
>   In this case, migration doesn't find migrateion PTE and remap fails.
>
> Why rcu_read_lock() is necessary..
>  - When page_mapcount() goes to 0, we shouldn't trust page->mapping is valid.
>  - Possible cases are
>        i) anon_vma (= page->mapping) is freed and used for other object.
>        ii) anon_vma (= page->mapping) is freed
>        iii) anon_vma (= page->mapping) is freed and used as anon_vma again.
>
> Here, anon_vma_cachep is created  by SLAB_DESTROY_BY_RCU. Then, possible cases
> are only ii) and iii). While anon_vma is anon_vma, try_to_unmap and remap_page
> can work well because of the list of vmas and address check. IOW, remap routine
> just do nothing if anon_vma is freed.
>
> I'm not sure by what logic "use-after-free anon_vma" is caught. But yes,
> there will be case, "anon_vma is touched after freed.", I think.
>
> Thanks,
> -Kame
>

Thanks for detail explanation, Kame.
But it can't understand me enough, Sorry.

Mel said he met "use-after-free errors in anon_vma".
So added the check in unmap_and_move.

if (PageAnon(page)) {
 ....
 if (!page_mapcount(page))
   goto uncharge;
 rcu_read_lock();

My concern what protects racy mapcount of the page?
For example,

CPU A                                 CPU B
unmap_and_move
page_mapcount check pass    zap_pte_range
<-- some stall -->                   pte_lock
<-- some stall -->                   page_remove_rmap(map_count is zero!)
<-- some stall -->                   pte_unlock
<-- some stall -->                   anon_vma_unlink
<-- some stall -->                   anon_vma free !!!!
rcu_read_lock
anon_vma has gone!!

I think above scenario make error "use-after-free", again.
What prevent above scenario?


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped  anonymous pages
  2010-03-15  6:28       ` Minchan Kim
@ 2010-03-15  6:44         ` KAMEZAWA Hiroyuki
  2010-03-15  7:09           ` KAMEZAWA Hiroyuki
  2010-03-15  7:11           ` Minchan Kim
  0 siblings, 2 replies; 109+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-15  6:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Mon, 15 Mar 2010 15:28:15 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Mon, Mar 15, 2010 at 2:34 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Mon, 15 Mar 2010 09:28:08 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >
> >> Hi, Mel.
> >> On Sat, Mar 13, 2010 at 1:41 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> >> > rmap_walk_anon() was triggering errors in memory compaction that looks like
> >> > use-after-free errors in anon_vma. The problem appears to be that between
> >> > the page being isolated from the LRU and rcu_read_lock() being taken, the
> >> > mapcount of the page dropped to 0 and the anon_vma was freed. This patch
> >> > skips the migration of anon pages that are not mapped by anyone.
> >> >
> >> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >> > Acked-by: Rik van Riel <riel@redhat.com>
> >> > ---
> >> >  mm/migrate.c |   10 ++++++++++
> >> >  1 files changed, 10 insertions(+), 0 deletions(-)
> >> >
> >> > diff --git a/mm/migrate.c b/mm/migrate.c
> >> > index 98eaaf2..3c491e3 100644
> >> > --- a/mm/migrate.c
> >> > +++ b/mm/migrate.c
> >> > @@ -602,6 +602,16 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >> >         * just care Anon page here.
> >> >         */
> >> >        if (PageAnon(page)) {
> >> > +               /*
> >> > +                * If the page has no mappings any more, just bail. An
> >> > +                * unmapped anon page is likely to be freed soon but worse,
> >> > +                * it's possible its anon_vma disappeared between when
> >> > +                * the page was isolated and when we reached here while
> >> > +                * the RCU lock was not held
> >> > +                */
> >> > +               if (!page_mapcount(page))
> >>
> >> As looking code about mapcount of page, I got confused.
> >> I think mapcount of page is protected by pte lock.
> >> But I can't find pte lock in unmap_and_move.
> > There is no pte_lock.
> >
> >> If I am right, what protects race between this condition check and
> >> rcu_read_lock?
> >> This patch makes race window very small but It can't remove race totally.
> >>
> >> I think I am missing something.
> >> Pz, point me out. :)
> >>
> >
> > Hmm. This is my understanding of old story.
> >
> > At migration.
> >  1. we increase page_count().
> >  2. isolate it from LRU.
> >  3. call try_to_unmap() under rcu_read_lock(). Then,
> >  4. replace pte with swp_entry_t made by PFN. under pte_lock.
> >  5. do migarate
> >  6. remap new pages. under pte_lock()>
> >  7. release rcu_read_lock().
> >
> > Here, we don't care whether page->mapping holds valid anon_vma or not.
> >
> > Assume a racy threads which calls zap_pte_range() (or some other)
> >
> > a) When the thread finds valid pte under pte_lock and successfully call
> >   page_remove_rmap().
> >   In this case, migration thread finds try_to_unmap doesn't unmap any pte.
> >   Then, at 6, remap pte will not work.
> > b) When the thread finds migrateion PTE(as swap entry) in zap_page_range().
> >   In this case, migration doesn't find migrateion PTE and remap fails.
> >
> > Why rcu_read_lock() is necessary..
> >  - When page_mapcount() goes to 0, we shouldn't trust page->mapping is valid.
> >  - Possible cases are
> >        i) anon_vma (= page->mapping) is freed and used for other object.
> >        ii) anon_vma (= page->mapping) is freed
> >        iii) anon_vma (= page->mapping) is freed and used as anon_vma again.
> >
> > Here, anon_vma_cachep is created  by SLAB_DESTROY_BY_RCU. Then, possible cases
> > are only ii) and iii). While anon_vma is anon_vma, try_to_unmap and remap_page
> > can work well because of the list of vmas and address check. IOW, remap routine
> > just do nothing if anon_vma is freed.
> >
> > I'm not sure by what logic "use-after-free anon_vma" is caught. But yes,
> > there will be case, "anon_vma is touched after freed.", I think.
> >
> > Thanks,
> > -Kame
> >
> 
> Thanks for detail explanation, Kame.
> But it can't understand me enough, Sorry.
> 
> Mel said he met "use-after-free errors in anon_vma".
> So added the check in unmap_and_move.
> 
> if (PageAnon(page)) {
>  ....
>  if (!page_mapcount(page))
>    goto uncharge;
>  rcu_read_lock();
> 
> My concern what protects racy mapcount of the page?
> For example,
> 
> CPU A                                 CPU B
> unmap_and_move
> page_mapcount check pass    zap_pte_range
> <-- some stall -->                   pte_lock
> <-- some stall -->                   page_remove_rmap(map_count is zero!)
> <-- some stall -->                   pte_unlock
> <-- some stall -->                   anon_vma_unlink
> <-- some stall -->                   anon_vma free !!!!
> rcu_read_lock
> anon_vma has gone!!
> 
> I think above scenario make error "use-after-free", again.
> What prevent above scenario?
> 
I think this patch is not complete. 
I guess this patch in [1/11] is trigger for the race.
==
+
+	/* Drop an anon_vma reference if we took one */
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
+		int empty = list_empty(&anon_vma->head);
+		spin_unlock(&anon_vma->lock);
+		if (empty)
+			anon_vma_free(anon_vma);
+	}
==
If my understainding in above is correct, this "modify" freed anon_vma.
Then, use-after-free happens. (In old implementation, there are no refcnt,
so, there is no use-after-free ops.)


So, what I can think of now is a patch like following is necessary.

==
static inline struct anon_vma *anon_vma_alloc(void)
{
        struct anon_vma *anon_vma;
        anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
        atomic_set(&anon_vma->refcnt, 1);
}

void anon_vma_free(struct anon_vma *anon_vma)
{
        /*
         * This called when anon_vma is..
         * - anon_vma->vma_list becomes empty.
         * - incremetned refcnt while migration, ksm etc.. is dropped.
         * - allocated but unused.
         */
        if (atomic_dec_and_test(&anon_vma->refcnt))
                kmem_cache_free(anon_vma_cachep, anon_vma);
}
==
Then all things will go simple. 
Overhead is concern but list_empty() helps us much.

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped  anonymous pages
  2010-03-15  6:44         ` KAMEZAWA Hiroyuki
@ 2010-03-15  7:09           ` KAMEZAWA Hiroyuki
  2010-03-15 13:48             ` Minchan Kim
  2010-03-15  7:11           ` Minchan Kim
  1 sibling, 1 reply; 109+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-15  7:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Mon, 15 Mar 2010 15:44:59 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 15 Mar 2010 15:28:15 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
> 
> > On Mon, Mar 15, 2010 at 2:34 PM, KAMEZAWA Hiroyuki
> > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Mon, 15 Mar 2010 09:28:08 +0900
> > > Minchan Kim <minchan.kim@gmail.com> wrote:

> > I think above scenario make error "use-after-free", again.
> > What prevent above scenario?
> > 
> I think this patch is not complete. 
> I guess this patch in [1/11] is trigger for the race.
> ==
> +
> +	/* Drop an anon_vma reference if we took one */
> +	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
> +		int empty = list_empty(&anon_vma->head);
> +		spin_unlock(&anon_vma->lock);
> +		if (empty)
> +			anon_vma_free(anon_vma);
> +	}
> ==
> If my understainding in above is correct, this "modify" freed anon_vma.
> Then, use-after-free happens. (In old implementation, there are no refcnt,
> so, there is no use-after-free ops.)
> 
Sorry, about above, my understanding was wrong. anon_vma->lock is modifed even
in old code. Sorry for noise.

-Kame









^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped  anonymous pages
  2010-03-15  6:44         ` KAMEZAWA Hiroyuki
  2010-03-15  7:09           ` KAMEZAWA Hiroyuki
@ 2010-03-15  7:11           ` Minchan Kim
  1 sibling, 0 replies; 109+ messages in thread
From: Minchan Kim @ 2010-03-15  7:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm, Hugh Dickins

On Mon, Mar 15, 2010 at 3:44 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com>
>> Thanks for detail explanation, Kame.
>> But it can't understand me enough, Sorry.
>>
>> Mel said he met "use-after-free errors in anon_vma".
>> So added the check in unmap_and_move.
>>
>> if (PageAnon(page)) {
>>  ....
>>  if (!page_mapcount(page))
>>    goto uncharge;
>>  rcu_read_lock();
>>
>> My concern what protects racy mapcount of the page?
>> For example,
>>
>> CPU A                                 CPU B
>> unmap_and_move
>> page_mapcount check pass    zap_pte_range
>> <-- some stall -->                   pte_lock
>> <-- some stall -->                   page_remove_rmap(map_count is zero!)
>> <-- some stall -->                   pte_unlock
>> <-- some stall -->                   anon_vma_unlink
>> <-- some stall -->                   anon_vma free !!!!
>> rcu_read_lock
>> anon_vma has gone!!
>>
>> I think above scenario make error "use-after-free", again.
>> What prevent above scenario?
>>
> I think this patch is not complete.
> I guess this patch in [1/11] is trigger for the race.
> ==
> +
> +       /* Drop an anon_vma reference if we took one */
> +       if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
> +               int empty = list_empty(&anon_vma->head);
> +               spin_unlock(&anon_vma->lock);
> +               if (empty)
> +                       anon_vma_free(anon_vma);
> +       }
> ==
> If my understainding in above is correct, this "modify" freed anon_vma.
> Then, use-after-free happens. (In old implementation, there are no refcnt,
> so, there is no use-after-free ops.)
>

I agree.
Let's wait Mel's response.

>
> So, what I can think of now is a patch like following is necessary.
>
> ==
> static inline struct anon_vma *anon_vma_alloc(void)
> {
>        struct anon_vma *anon_vma;
>        anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
>        atomic_set(&anon_vma->refcnt, 1);
> }
>
> void anon_vma_free(struct anon_vma *anon_vma)
> {
>        /*
>         * This called when anon_vma is..
>         * - anon_vma->vma_list becomes empty.
>         * - incremetned refcnt while migration, ksm etc.. is dropped.
>         * - allocated but unused.
>         */
>        if (atomic_dec_and_test(&anon_vma->refcnt))
>                kmem_cache_free(anon_vma_cachep, anon_vma);
> }
> ==
> Then all things will go simple.
> Overhead is concern but list_empty() helps us much.

When they made things complicated without atomic_op,
there was reasonable reason, I think. :)

My opinion depends on you and server guys(Hugh, Rik, Andrea Arcangeli and so on)


>
> Thanks,
> -Kame
>
>
>
>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 05/11] Export unusable free space index via /proc/unusable_index
  2010-03-15  5:41   ` KAMEZAWA Hiroyuki
@ 2010-03-15  9:48     ` Mel Gorman
  0 siblings, 0 replies; 109+ messages in thread
From: Mel Gorman @ 2010-03-15  9:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KOSAKI Motohiro, Rik van Riel,
	linux-kernel, linux-mm

On Mon, Mar 15, 2010 at 02:41:24PM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 12 Mar 2010 16:41:21 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Unusable free space index is a measure of external fragmentation that
> > takes the allocation size into account. For the most part, the huge page
> > size will be the size of interest but not necessarily so it is exported
> > on a per-order and per-zone basis via /proc/unusable_index.
> > 
> > The index is a value between 0 and 1. It can be expressed as a
> > percentage by multiplying by 100 as documented in
> > Documentation/filesystems/proc.txt.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >  Documentation/filesystems/proc.txt |   13 ++++-
> >  mm/vmstat.c                        |  120 ++++++++++++++++++++++++++++++++++++
> >  2 files changed, 132 insertions(+), 1 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> > index 5e132b5..5c4b0fb 100644
> > --- a/Documentation/filesystems/proc.txt
> > +++ b/Documentation/filesystems/proc.txt
> > @@ -452,6 +452,7 @@ Table 1-5: Kernel info in /proc
> >   sys         See chapter 2                                     
> >   sysvipc     Info of SysVIPC Resources (msg, sem, shm)		(2.4)
> >   tty	     Info of tty drivers
> > + unusable_index Additional page allocator information (see text)(2.5)
> >   uptime      System uptime                                     
> >   version     Kernel version                                    
> >   video	     bttv info of video resources			(2.4)
> > @@ -609,7 +610,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
> >  available in ZONE_NORMAL, etc... 
> >  
> >  More information relevant to external fragmentation can be found in
> > -pagetypeinfo.
> > +pagetypeinfo and unusable_index
> >  
> >  > cat /proc/pagetypeinfo
> >  Page block order: 9
> > @@ -650,6 +651,16 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
> >  also be allocatable although a lot of filesystem metadata may have to be
> >  reclaimed to achieve this.
> >  
> > +> cat /proc/unusable_index
> > +Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
> > +Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
> > +
> > +The unusable free space index measures how much of the available free
> > +memory cannot be used to satisfy an allocation of a given size and is a
> > +value between 0 and 1. The higher the value, the more of free memory is
> > +unusable and by implication, the worse the external fragmentation is. This
> > +can be expressed as a percentage by multiplying by 100.
> > +
> 
> I'm sorry but how this information is different from buddyinfo ?
> 

This information can be calculated from buddyinfo by hand or by scripts if
necessary. The difference is in how the information is presented.  It's far
easier to see at a glance the potential fragmentation at each order with this
file than with buddyinfo. I use this information in fragmentation-tests to
graph the index over time but I also have the necessary scripts to parse
buddyinfo so it's not a big deal for me.

I can drop this patch if necessary because none of the core code uses
it. It was for the convenience of a user.

> Thanks,
> -Kame
> 
> 
> 
> >  ..............................................................................
> >  
> >  meminfo:
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 7f760cb..ca42e10 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -453,6 +453,106 @@ static int frag_show(struct seq_file *m, void *arg)
> >  	return 0;
> >  }
> >  
> > +
> > +struct contig_page_info {
> > +	unsigned long free_pages;
> > +	unsigned long free_blocks_total;
> > +	unsigned long free_blocks_suitable;
> > +};
> > +
> > +/*
> > + * Calculate the number of free pages in a zone, how many contiguous
> > + * pages are free and how many are large enough to satisfy an allocation of
> > + * the target size. Note that this function makes to attempt to estimate
> > + * how many suitable free blocks there *might* be if MOVABLE pages were
> > + * migrated. Calculating that is possible, but expensive and can be
> > + * figured out from userspace
> > + */
> > +static void fill_contig_page_info(struct zone *zone,
> > +				unsigned int suitable_order,
> > +				struct contig_page_info *info)
> > +{
> > +	unsigned int order;
> > +
> > +	info->free_pages = 0;
> > +	info->free_blocks_total = 0;
> > +	info->free_blocks_suitable = 0;
> > +
> > +	for (order = 0; order < MAX_ORDER; order++) {
> > +		unsigned long blocks;
> > +
> > +		/* Count number of free blocks */
> > +		blocks = zone->free_area[order].nr_free;
> > +		info->free_blocks_total += blocks;
> > +
> > +		/* Count free base pages */
> > +		info->free_pages += blocks << order;
> > +
> > +		/* Count the suitable free blocks */
> > +		if (order >= suitable_order)
> > +			info->free_blocks_suitable += blocks <<
> > +						(order - suitable_order);
> > +	}
> > +}
> > +
> > +/*
> > + * Return an index indicating how much of the available free memory is
> > + * unusable for an allocation of the requested size.
> > + */
> > +static int unusable_free_index(unsigned int order,
> > +				struct contig_page_info *info)
> > +{
> > +	/* No free memory is interpreted as all free memory is unusable */
> > +	if (info->free_pages == 0)
> > +		return 1000;
> > +
> > +	/*
> > +	 * Index should be a value between 0 and 1. Return a value to 3
> > +	 * decimal places.
> > +	 *
> > +	 * 0 => no fragmentation
> > +	 * 1 => high fragmentation
> > +	 */
> > +	return ((info->free_pages - (info->free_blocks_suitable << order)) * 1000) / info->free_pages;
> > +
> > +}
> > +
> > +static void unusable_show_print(struct seq_file *m,
> > +					pg_data_t *pgdat, struct zone *zone)
> > +{
> > +	unsigned int order;
> > +	int index;
> > +	struct contig_page_info info;
> > +
> > +	seq_printf(m, "Node %d, zone %8s ",
> > +				pgdat->node_id,
> > +				zone->name);
> > +	for (order = 0; order < MAX_ORDER; ++order) {
> > +		fill_contig_page_info(zone, order, &info);
> > +		index = unusable_free_index(order, &info);
> > +		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
> > +	}
> > +
> > +	seq_putc(m, '\n');
> > +}
> > +
> > +/*
> > + * Display unusable free space index
> > + * XXX: Could be a lot more efficient, but it's not a critical path
> > + */
> > +static int unusable_show(struct seq_file *m, void *arg)
> > +{
> > +	pg_data_t *pgdat = (pg_data_t *)arg;
> > +
> > +	/* check memoryless node */
> > +	if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
> > +		return 0;
> > +
> > +	walk_zones_in_node(m, pgdat, unusable_show_print);
> > +
> > +	return 0;
> > +}
> > +
> >  static void pagetypeinfo_showfree_print(struct seq_file *m,
> >  					pg_data_t *pgdat, struct zone *zone)
> >  {
> > @@ -603,6 +703,25 @@ static const struct file_operations pagetypeinfo_file_ops = {
> >  	.release	= seq_release,
> >  };
> >  
> > +static const struct seq_operations unusable_op = {
> > +	.start	= frag_start,
> > +	.next	= frag_next,
> > +	.stop	= frag_stop,
> > +	.show	= unusable_show,
> > +};
> > +
> > +static int unusable_open(struct inode *inode, struct file *file)
> > +{
> > +	return seq_open(file, &unusable_op);
> > +}
> > +
> > +static const struct file_operations unusable_file_ops = {
> > +	.open		= unusable_open,
> > +	.read		= seq_read,
> > +	.llseek		= seq_lseek,
> > +	.release	= seq_release,
> > +};
> > +
> >  #ifdef CONFIG_ZONE_DMA
> >  #define TEXT_FOR_DMA(xx) xx "_dma",
> >  #else
> > @@ -947,6 +1066,7 @@ static int __init setup_vmstat(void)
> >  #ifdef CONFIG_PROC_FS
> >  	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
> >  	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
> > +	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
> >  	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
> >  	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
> >  #endif
> > -- 
> > 1.6.5
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-15  5:34     ` KAMEZAWA Hiroyuki
  2010-03-15  6:28       ` Minchan Kim
@ 2010-03-15 11:28       ` Mel Gorman
  2010-03-15 12:48         ` Minchan Kim
  1 sibling, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-15 11:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Mon, Mar 15, 2010 at 02:34:20PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 15 Mar 2010 09:28:08 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
> 
> > Hi, Mel.
> > On Sat, Mar 13, 2010 at 1:41 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > > rmap_walk_anon() was triggering errors in memory compaction that looks like
> > > use-after-free errors in anon_vma. The problem appears to be that between
> > > the page being isolated from the LRU and rcu_read_lock() being taken, the
> > > mapcount of the page dropped to 0 and the anon_vma was freed. This patch
> > > skips the migration of anon pages that are not mapped by anyone.
> > >
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > ---
> > >  mm/migrate.c |   10 ++++++++++
> > >  1 files changed, 10 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index 98eaaf2..3c491e3 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -602,6 +602,16 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> > >         * just care Anon page here.
> > >         */
> > >        if (PageAnon(page)) {
> > > +               /*
> > > +                * If the page has no mappings any more, just bail. An
> > > +                * unmapped anon page is likely to be freed soon but worse,
> > > +                * it's possible its anon_vma disappeared between when
> > > +                * the page was isolated and when we reached here while
> > > +                * the RCU lock was not held
> > > +                */
> > > +               if (!page_mapcount(page))
> > 
> > As looking code about mapcount of page, I got confused.
> > I think mapcount of page is protected by pte lock.
> > But I can't find pte lock in unmap_and_move.
>
> There is no pte_lock.
> 

Indeed. It is manipulated while some other lock is held but it can be read
without locks held. For example, when mapping a page either tha anon_vma
lock or i_mmap_lock is held but it is read without special locking in places
like page_referenced_ksm().

> > If I am right, what protects race between this condition check and
> > rcu_read_lock?
> > This patch makes race window very small but It can't remove race totally.
> > 
> > I think I am missing something.
> > Pz, point me out. :)
> > 
> 
> Hmm. This is my understanding of old story.
> 
> At migration.
>   1. we increase page_count().
>   2. isolate it from LRU.
>   3. call try_to_unmap() under rcu_read_lock(). Then, 
>   4. replace pte with swp_entry_t made by PFN. under pte_lock.
>   5. do migarate 
>   6. remap new pages. under pte_lock()>
>   7. release rcu_read_lock().
> 
> Here, we don't care whether page->mapping holds valid anon_vma or not.
> 
> Assume a racy threads which calls zap_pte_range() (or some other)
> 

I believe the race being hit is related to processes existing. A racy thread calling
zap_pte_range() while pages within were being migrated does appear to be the problem.

> a) When the thread finds valid pte under pte_lock and successfully call
>    page_remove_rmap().
>    In this case, migration thread finds try_to_unmap doesn't unmap any pte.
>    Then, at 6, remap pte will not work.
> b) When the thread finds migrateion PTE(as swap entry) in zap_page_range().
>    In this case, migration doesn't find migrateion PTE and remap fails.
> 
> Why rcu_read_lock() is necessary..
>  - When page_mapcount() goes to 0, we shouldn't trust page->mapping is valid.

I also believe this to be true.

>  - Possible cases are
> 	i) anon_vma (= page->mapping) is freed and used for other object.
>  	ii) anon_vma (= page->mapping) is freed
> 	iii) anon_vma (= page->mapping) is freed and used as anon_vma again.
> 
> Here, anon_vma_cachep is created  by SLAB_DESTROY_BY_RCU. Then, possible cases
> are only ii) and iii).

I believe it's (ii) that was being hit.

> While anon_vma is anon_vma, try_to_unmap and remap_page
> can work well because of the list of vmas and address check. IOW, remap routine
> just do nothing if anon_vma is freed.
> 
> I'm not sure by what logic "use-after-free anon_vma" is caught. But yes,
> there will be case, "anon_vma is touched after freed.", I think.
> 

The use after free looks like

1. page_mapcount(page) was zero so anon_vma was no longer reliable
2. rcu lock taken but the anon_vma at this point can already be garbage because the
   process exited
3. call try_to_unmap, looks up tha anon_vma and locks it. This causes problems

I thought the race would be closed but there is still a very tiny window there all
right. The following alternative should close it. What do you think?

        if (PageAnon(page)) {
		rcu_read_lock();

                /*
                 * If the page has no mappings any more, just bail. An
                 * unmapped anon page is likely to be freed soon but worse,
                 * it's possible its anon_vma disappeared between when
                 * the page was isolated and when we reached here while
                 * the RCU lock was not held
                 */
                if (!page_mapcount(page)) {
			rcu_read_unlock();
                        goto uncharge;
		}

                rcu_locked = 1;
                anon_vma = page_anon_vma(page);
                atomic_inc(&anon_vma->external_refcount);
        }

The rcu_unlock label is not used here because the reference counts were not taken in
the case where page_mapcount == 0.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-15 11:28       ` Mel Gorman
@ 2010-03-15 12:48         ` Minchan Kim
  2010-03-15 14:21           ` Mel Gorman
  0 siblings, 1 reply; 109+ messages in thread
From: Minchan Kim @ 2010-03-15 12:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Mon, 2010-03-15 at 11:28 +0000, Mel Gorman wrote:
> The use after free looks like
> 
> 1. page_mapcount(page) was zero so anon_vma was no longer reliable
> 2. rcu lock taken but the anon_vma at this point can already be garbage because the
>    process exited
> 3. call try_to_unmap, looks up tha anon_vma and locks it. This causes problems
> 
> I thought the race would be closed but there is still a very tiny window there all
> right. The following alternative should close it. What do you think?
> 
>         if (PageAnon(page)) {
> 		rcu_read_lock();
> 
>                 /*
>                  * If the page has no mappings any more, just bail. An
>                  * unmapped anon page is likely to be freed soon but worse,
>                  * it's possible its anon_vma disappeared between when
>                  * the page was isolated and when we reached here while
>                  * the RCU lock was not held
>                  */
>                 if (!page_mapcount(page)) {
> 			rcu_read_unlock();
>                         goto uncharge;
> 		}
> 
>                 rcu_locked = 1;
>                 anon_vma = page_anon_vma(page);
>                 atomic_inc(&anon_vma->external_refcount);
>         }
> 
> The rcu_unlock label is not used here because the reference counts were not taken in
> the case where page_mapcount == 0.
> 

Looks good to me. 
Please, repost above code with your use-after-free scenario comment.


-- 
Kind regards,
Minchan Kim



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-12 16:41 ` [PATCH 07/11] Memory compaction core Mel Gorman
@ 2010-03-15 13:44   ` Minchan Kim
  2010-03-15 14:41     ` Mel Gorman
  2010-03-17 10:31   ` KOSAKI Motohiro
  1 sibling, 1 reply; 109+ messages in thread
From: Minchan Kim @ 2010-03-15 13:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KOSAKI Motohiro, Rik van Riel,
	linux-kernel, linux-mm

On Fri, 2010-03-12 at 16:41 +0000, Mel Gorman wrote:
> This patch is the core of a mechanism which compacts memory in a zone by
> relocating movable pages towards the end of the zone.
> 
> A single compaction run involves a migration scanner and a free scanner.
> Both scanners operate on pageblock-sized areas in the zone. The migration
> scanner starts at the bottom of the zone and searches for all movable pages
> within each area, isolating them onto a private list called migratelist.
> The free scanner starts at the top of the zone and searches for suitable
> areas and consumes the free pages within making them available for the
> migration scanner. The pages isolated for migration are then migrated to
> the newly isolated free pages.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

There is below some nitpicks. Otherwise looks good to me. 

..

< snip >

> +/* Update the number of anon and file isolated pages in the zone) */
                                                single parenthesis ^ 

> +void update_zone_isolated(struct zone *zone, struct compact_control *cc)
> +{
> +	struct page *page;
> +	unsigned int count[NR_LRU_LISTS] = { 0, };
> +
> +	list_for_each_entry(page, &cc->migratepages, lru) {
> +		int lru = page_lru_base_type(page);
> +		count[lru]++;
> +	}
> +
> +	cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> +	cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> +	__mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
> +	__mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
> +}
> +

< snip >

> +static unsigned long isolate_migratepages(struct zone *zone,
> +					struct compact_control *cc)
> +{
> +	unsigned long low_pfn, end_pfn;
> +	struct list_head *migratelist;
> +
> +	low_pfn = cc->migrate_pfn;
> +	migratelist = &cc->migratepages;
> +
> +	/* Do not scan outside zone boundaries */
> +	if (low_pfn < zone->zone_start_pfn)
> +		low_pfn = zone->zone_start_pfn;
> +
> +	/* Setup to scan one block but not past where we are migrating to */
> +	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> +
> +	/* Do not cross the free scanner or scan within a memory hole */
> +	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
> +		cc->migrate_pfn = end_pfn;
> +		return 0;
> +	}
> +
> +	migrate_prep();
> +
> +	/* Time to isolate some pages for migration */
> +	spin_lock_irq(&zone->lru_lock);
> +	for (; low_pfn < end_pfn; low_pfn++) {
> +		struct page *page;
> +		if (!pfn_valid_within(low_pfn))
> +			continue;
> +
> +		/* Get the page and skip if free */
> +		page = pfn_to_page(low_pfn);
> +		if (PageBuddy(page)) {
> +			low_pfn += (1 << page_order(page)) - 1;
> +			continue;
> +		}
> +
> +		if (!PageLRU(page) || PageUnevictable(page))
> +			continue;

Do we need this checks?
It is done by __isolate_lru_page. 

Explicit check would make code readability good.
So if you mind it, I don't oppose it, either. 
But other caller of __isolate_lru_pages don't check it, either.

> +		/* Try isolate the page */
> +		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) == 0) {
> +			del_page_from_lru_list(zone, page, page_lru(page));
> +			list_add(&page->lru, migratelist);
> +			mem_cgroup_del_lru(page);
> +			cc->nr_migratepages++;
> +		}
> +
> +		/* Avoid isolating too much */
> +		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
> +			break;
> +	}
> +
> +	update_zone_isolated(zone, cc);
> +
> +	spin_unlock_irq(&zone->lru_lock);
> +	cc->migrate_pfn = low_pfn;
> +
> +	return cc->nr_migratepages;
> +}
> +


-- 
Kind regards,
Minchan Kim



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped  anonymous pages
  2010-03-15  7:09           ` KAMEZAWA Hiroyuki
@ 2010-03-15 13:48             ` Minchan Kim
  0 siblings, 0 replies; 109+ messages in thread
From: Minchan Kim @ 2010-03-15 13:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Mon, Mar 15, 2010 at 4:09 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 15 Mar 2010 15:44:59 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> On Mon, 15 Mar 2010 15:28:15 +0900
>> Minchan Kim <minchan.kim@gmail.com> wrote:
>>
>> > On Mon, Mar 15, 2010 at 2:34 PM, KAMEZAWA Hiroyuki
>> > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > > On Mon, 15 Mar 2010 09:28:08 +0900
>> > > Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> > I think above scenario make error "use-after-free", again.
>> > What prevent above scenario?
>> >
>> I think this patch is not complete.
>> I guess this patch in [1/11] is trigger for the race.
>> ==
>> +
>> +     /* Drop an anon_vma reference if we took one */
>> +     if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
>> +             int empty = list_empty(&anon_vma->head);
>> +             spin_unlock(&anon_vma->lock);
>> +             if (empty)
>> +                     anon_vma_free(anon_vma);
>> +     }
>> ==
>> If my understainding in above is correct, this "modify" freed anon_vma.
>> Then, use-after-free happens. (In old implementation, there are no refcnt,
>> so, there is no use-after-free ops.)
>>
> Sorry, about above, my understanding was wrong. anon_vma->lock is modifed even
> in old code. Sorry for noise.

Nope.  Such your kindness always helps and cheer up others people.
In addition, give others good time to consider seriously something.

Thanks, Kame.


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-15 12:48         ` Minchan Kim
@ 2010-03-15 14:21           ` Mel Gorman
  2010-03-15 14:33             ` Minchan Kim
                               ` (2 more replies)
  0 siblings, 3 replies; 109+ messages in thread
From: Mel Gorman @ 2010-03-15 14:21 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Mon, Mar 15, 2010 at 09:48:49PM +0900, Minchan Kim wrote:
> On Mon, 2010-03-15 at 11:28 +0000, Mel Gorman wrote:
> > The use after free looks like
> > 
> > 1. page_mapcount(page) was zero so anon_vma was no longer reliable
> > 2. rcu lock taken but the anon_vma at this point can already be garbage because the
> >    process exited
> > 3. call try_to_unmap, looks up tha anon_vma and locks it. This causes problems
> > 
> > I thought the race would be closed but there is still a very tiny window there all
> > right. The following alternative should close it. What do you think?
> > 
> >         if (PageAnon(page)) {
> > 		rcu_read_lock();
> > 
> >                 /*
> >                  * If the page has no mappings any more, just bail. An
> >                  * unmapped anon page is likely to be freed soon but worse,
> >                  * it's possible its anon_vma disappeared between when
> >                  * the page was isolated and when we reached here while
> >                  * the RCU lock was not held
> >                  */
> >                 if (!page_mapcount(page)) {
> > 			rcu_read_unlock();
> >                         goto uncharge;
> > 		}
> > 
> >                 rcu_locked = 1;
> >                 anon_vma = page_anon_vma(page);
> >                 atomic_inc(&anon_vma->external_refcount);
> >         }
> > 
> > The rcu_unlock label is not used here because the reference counts were not taken in
> > the case where page_mapcount == 0.
> > 
> 
> Please, repost above code with your use-after-free scenario comment.
> 

This will be the replacement patch so.

==== CUT HERE ====
mm,migration: Do not try to migrate unmapped anonymous pages

rmap_walk_anon() was triggering errors in memory compaction that look like
use-after-free errors. The problem is that between the page being isolated
from the LRU and rcu_read_lock() being taken, the mapcount of the page
dropped to 0 and the anon_vma gets freed. This can happen during memory
compaction if pages being migrated belong to a process that exits before
migration completes. Hence, the use-after-free race looks like

 1. Page isolated for migration
 2. Process exits
 3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
 4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
 4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
    is garbage.

This patch checks the mapcount after the rcu lock is taken. If the
mapcount is zero, the anon_vma is assumed to be freed and no further
action is taken.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/migrate.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 98eaaf2..6eb1efe 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -603,6 +603,19 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	 */
 	if (PageAnon(page)) {
 		rcu_read_lock();
+
+		/*
+		 * If the page has no mappings any more, just bail. An
+		 * unmapped anon page is likely to be freed soon but worse,
+		 * it's possible its anon_vma disappeared between when
+		 * the page was isolated and when we reached here while
+		 * the RCU lock was not held
+		 */
+		if (!page_mapcount(page)) {
+			rcu_read_unlock();
+			goto uncharge;
+		}
+
 		rcu_locked = 1;
 		anon_vma = page_anon_vma(page);
 		atomic_inc(&anon_vma->migrate_refcount);

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped  anonymous pages
  2010-03-15 14:21           ` Mel Gorman
@ 2010-03-15 14:33             ` Minchan Kim
  2010-03-15 23:49             ` KAMEZAWA Hiroyuki
  2010-03-17  2:03             ` KOSAKI Motohiro
  2 siblings, 0 replies; 109+ messages in thread
From: Minchan Kim @ 2010-03-15 14:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Mon, Mar 15, 2010 at 11:21 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Mon, Mar 15, 2010 at 09:48:49PM +0900, Minchan Kim wrote:
>> On Mon, 2010-03-15 at 11:28 +0000, Mel Gorman wrote:
>> > The use after free looks like
>> >
>> > 1. page_mapcount(page) was zero so anon_vma was no longer reliable
>> > 2. rcu lock taken but the anon_vma at this point can already be garbage because the
>> >    process exited
>> > 3. call try_to_unmap, looks up tha anon_vma and locks it. This causes problems
>> >
>> > I thought the race would be closed but there is still a very tiny window there all
>> > right. The following alternative should close it. What do you think?
>> >
>> >         if (PageAnon(page)) {
>> >             rcu_read_lock();
>> >
>> >                 /*
>> >                  * If the page has no mappings any more, just bail. An
>> >                  * unmapped anon page is likely to be freed soon but worse,
>> >                  * it's possible its anon_vma disappeared between when
>> >                  * the page was isolated and when we reached here while
>> >                  * the RCU lock was not held
>> >                  */
>> >                 if (!page_mapcount(page)) {
>> >                     rcu_read_unlock();
>> >                         goto uncharge;
>> >             }
>> >
>> >                 rcu_locked = 1;
>> >                 anon_vma = page_anon_vma(page);
>> >                 atomic_inc(&anon_vma->external_refcount);
>> >         }
>> >
>> > The rcu_unlock label is not used here because the reference counts were not taken in
>> > the case where page_mapcount == 0.
>> >
>>
>> Please, repost above code with your use-after-free scenario comment.
>>
>
> This will be the replacement patch so.
>
> ==== CUT HERE ====
> mm,migration: Do not try to migrate unmapped anonymous pages
>
> rmap_walk_anon() was triggering errors in memory compaction that look like
> use-after-free errors. The problem is that between the page being isolated
> from the LRU and rcu_read_lock() being taken, the mapcount of the page
> dropped to 0 and the anon_vma gets freed. This can happen during memory
> compaction if pages being migrated belong to a process that exits before
> migration completes. Hence, the use-after-free race looks like
>
>  1. Page isolated for migration
>  2. Process exits
>  3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
>  4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
>  4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
>    is garbage.
>
> This patch checks the mapcount after the rcu lock is taken. If the
> mapcount is zero, the anon_vma is assumed to be freed and no further
> action is taken.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-15 13:44   ` Minchan Kim
@ 2010-03-15 14:41     ` Mel Gorman
  0 siblings, 0 replies; 109+ messages in thread
From: Mel Gorman @ 2010-03-15 14:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KOSAKI Motohiro, Rik van Riel,
	linux-kernel, linux-mm

On Mon, Mar 15, 2010 at 10:44:14PM +0900, Minchan Kim wrote:
> On Fri, 2010-03-12 at 16:41 +0000, Mel Gorman wrote:
> > This patch is the core of a mechanism which compacts memory in a zone by
> > relocating movable pages towards the end of the zone.
> > 
> > A single compaction run involves a migration scanner and a free scanner.
> > Both scanners operate on pageblock-sized areas in the zone. The migration
> > scanner starts at the bottom of the zone and searches for all movable pages
> > within each area, isolating them onto a private list called migratelist.
> > The free scanner starts at the top of the zone and searches for suitable
> > areas and consumes the free pages within making them available for the
> > migration scanner. The pages isolated for migration are then migrated to
> > the newly isolated free pages.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> 

Thanks

> There is below some nitpicks. Otherwise looks good to me. 
> 
> ..
> 
> < snip >
> 
> > +/* Update the number of anon and file isolated pages in the zone) */
>                                                 single parenthesis ^ 
> 

Fixed. If a V5 becomes necessary, the fix will be included.

> > +void update_zone_isolated(struct zone *zone, struct compact_control *cc)
> > +{
> > +	struct page *page;
> > +	unsigned int count[NR_LRU_LISTS] = { 0, };
> > +
> > +	list_for_each_entry(page, &cc->migratepages, lru) {
> > +		int lru = page_lru_base_type(page);
> > +		count[lru]++;
> > +	}
> > +
> > +	cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> > +	cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> > +	__mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
> > +	__mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
> > +}
> > +
> 
> < snip >
> 
> > +static unsigned long isolate_migratepages(struct zone *zone,
> > +					struct compact_control *cc)
> > +{
> > +	unsigned long low_pfn, end_pfn;
> > +	struct list_head *migratelist;
> > +
> > +	low_pfn = cc->migrate_pfn;
> > +	migratelist = &cc->migratepages;
> > +
> > +	/* Do not scan outside zone boundaries */
> > +	if (low_pfn < zone->zone_start_pfn)
> > +		low_pfn = zone->zone_start_pfn;
> > +
> > +	/* Setup to scan one block but not past where we are migrating to */
> > +	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> > +
> > +	/* Do not cross the free scanner or scan within a memory hole */
> > +	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
> > +		cc->migrate_pfn = end_pfn;
> > +		return 0;
> > +	}
> > +
> > +	migrate_prep();
> > +
> > +	/* Time to isolate some pages for migration */
> > +	spin_lock_irq(&zone->lru_lock);
> > +	for (; low_pfn < end_pfn; low_pfn++) {
> > +		struct page *page;
> > +		if (!pfn_valid_within(low_pfn))
> > +			continue;
> > +
> > +		/* Get the page and skip if free */
> > +		page = pfn_to_page(low_pfn);
> > +		if (PageBuddy(page)) {
> > +			low_pfn += (1 << page_order(page)) - 1;
> > +			continue;
> > +		}
> > +
> > +		if (!PageLRU(page) || PageUnevictable(page))
> > +			continue;
> 
> Do we need this checks?
> It is done by __isolate_lru_page. 
> 
> Explicit check would make code readability good.
> So if you mind it, I don't oppose it, either. 
> But other caller of __isolate_lru_pages don't check it, either.
> 

The checks are no longer necessary. They were made at a time I was calling
switch (__isolate_lru_page...) in a similar pattern to what happens
in mm/vmscan.c. In that pattern, -EINVAL is considered a bug so I was
deliberately skipped over these pages.

Thanks

> > +		/* Try isolate the page */
> > +		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) == 0) {
> > +			del_page_from_lru_list(zone, page, page_lru(page));
> > +			list_add(&page->lru, migratelist);
> > +			mem_cgroup_del_lru(page);
> > +			cc->nr_migratepages++;
> > +		}
> > +
> > +		/* Avoid isolating too much */
> > +		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
> > +			break;
> > +	}
> > +
> > +	update_zone_isolated(zone, cc);
> > +
> > +	spin_unlock_irq(&zone->lru_lock);
> > +	cc->migrate_pfn = low_pfn;
> > +
> > +	return cc->nr_migratepages;
> > +}
> > +
> 
> 
> -- 
> Kind regards,
> Minchan Kim
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-15 14:21           ` Mel Gorman
  2010-03-15 14:33             ` Minchan Kim
@ 2010-03-15 23:49             ` KAMEZAWA Hiroyuki
  2010-03-17  2:12               ` KAMEZAWA Hiroyuki
  2010-03-17  2:03             ` KOSAKI Motohiro
  2 siblings, 1 reply; 109+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-15 23:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Minchan Kim, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Mon, 15 Mar 2010 14:21:24 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> On Mon, Mar 15, 2010 at 09:48:49PM +0900, Minchan Kim wrote:
> > On Mon, 2010-03-15 at 11:28 +0000, Mel Gorman wrote:
> > > The use after free looks like
> > > 
> > > 1. page_mapcount(page) was zero so anon_vma was no longer reliable
> > > 2. rcu lock taken but the anon_vma at this point can already be garbage because the
> > >    process exited
> > > 3. call try_to_unmap, looks up tha anon_vma and locks it. This causes problems
> > > 
> > > I thought the race would be closed but there is still a very tiny window there all
> > > right. The following alternative should close it. What do you think?
> > > 
> > >         if (PageAnon(page)) {
> > > 		rcu_read_lock();
> > > 
> > >                 /*
> > >                  * If the page has no mappings any more, just bail. An
> > >                  * unmapped anon page is likely to be freed soon but worse,
> > >                  * it's possible its anon_vma disappeared between when
> > >                  * the page was isolated and when we reached here while
> > >                  * the RCU lock was not held
> > >                  */
> > >                 if (!page_mapcount(page)) {
> > > 			rcu_read_unlock();
> > >                         goto uncharge;
> > > 		}
> > > 
> > >                 rcu_locked = 1;
> > >                 anon_vma = page_anon_vma(page);
> > >                 atomic_inc(&anon_vma->external_refcount);
> > >         }
> > > 
> > > The rcu_unlock label is not used here because the reference counts were not taken in
> > > the case where page_mapcount == 0.
> > > 
> > 
> > Please, repost above code with your use-after-free scenario comment.
> > 
> 
> This will be the replacement patch so.
> 
> ==== CUT HERE ====
> mm,migration: Do not try to migrate unmapped anonymous pages
> 
> rmap_walk_anon() was triggering errors in memory compaction that look like
> use-after-free errors. The problem is that between the page being isolated
> from the LRU and rcu_read_lock() being taken, the mapcount of the page
> dropped to 0 and the anon_vma gets freed. This can happen during memory
> compaction if pages being migrated belong to a process that exits before
> migration completes. Hence, the use-after-free race looks like
> 
>  1. Page isolated for migration
>  2. Process exits
>  3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
>  4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
>  4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
>     is garbage.
> 
> This patch checks the mapcount after the rcu lock is taken. If the
> mapcount is zero, the anon_vma is assumed to be freed and no further
> action is taken.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>

Reviewd-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


> ---
>  mm/migrate.c |   13 +++++++++++++
>  1 files changed, 13 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 98eaaf2..6eb1efe 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -603,6 +603,19 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  	 */
>  	if (PageAnon(page)) {
>  		rcu_read_lock();
> +
> +		/*
> +		 * If the page has no mappings any more, just bail. An
> +		 * unmapped anon page is likely to be freed soon but worse,
> +		 * it's possible its anon_vma disappeared between when
> +		 * the page was isolated and when we reached here while
> +		 * the RCU lock was not held
> +		 */
> +		if (!page_mapcount(page)) {
> +			rcu_read_unlock();
> +			goto uncharge;
> +		}
> +
>  		rcu_locked = 1;
>  		anon_vma = page_anon_vma(page);
>  		atomic_inc(&anon_vma->migrate_refcount);
> 


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-12 16:41 ` [PATCH 10/11] Direct compact when a high-order allocation fails Mel Gorman
@ 2010-03-16  2:47   ` Minchan Kim
  2010-03-19  6:21   ` KOSAKI Motohiro
  1 sibling, 0 replies; 109+ messages in thread
From: Minchan Kim @ 2010-03-16  2:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KOSAKI Motohiro, Rik van Riel,
	linux-kernel, linux-mm

On Sat, Mar 13, 2010 at 1:41 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> Ordinarily when a high-order allocation fails, direct reclaim is entered to
> free pages to satisfy the allocation.  With this patch, it is determined if
> an allocation failed due to external fragmentation instead of low memory
> and if so, the calling process will compact until a suitable page is
> freed. Compaction by moving pages in memory is considerably cheaper than
> paging out to disk and works where there are locked pages or no swap. If
> compaction fails to free a page of a suitable size, then reclaim will
> still occur.
>
> Direct compaction returns as soon as possible. As each block is compacted,
> it is checked if a suitable page has been freed and if so, it returns.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

At least, I can't find any fault more. :)

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating
  2010-03-12 16:41 ` [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
  2010-03-14 15:01   ` Minchan Kim
  2010-03-15  5:06   ` KAMEZAWA Hiroyuki
@ 2010-03-17  1:44   ` KOSAKI Motohiro
  2010-03-17 11:45     ` Mel Gorman
  2 siblings, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-17  1:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

>  rcu_unlock:
> +
> +	/* Drop an anon_vma reference if we took one */
> +	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
> +		int empty = list_empty(&anon_vma->head);
> +		spin_unlock(&anon_vma->lock);
> +		if (empty)
> +			anon_vma_free(anon_vma);
> +	}
> +

Why don't we check ksm_refcount here? Also, why drop_anon_vma() doesn't
need check migrate_refcount?

plus, if we add this logic, we can remove SLAB_DESTROY_BY_RCU from 
anon_vma_cachep and rcu_read_lock() from unmap_and_move(), I think.
It is for preventing anon_vma recycle logic. but no free directly mean
no memory recycle.




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-15 14:21           ` Mel Gorman
  2010-03-15 14:33             ` Minchan Kim
  2010-03-15 23:49             ` KAMEZAWA Hiroyuki
@ 2010-03-17  2:03             ` KOSAKI Motohiro
  2010-03-17 11:51               ` Mel Gorman
  2 siblings, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-17  2:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Minchan Kim, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Rik van Riel, linux-kernel, linux-mm

> mm,migration: Do not try to migrate unmapped anonymous pages
> 
> rmap_walk_anon() was triggering errors in memory compaction that look like
> use-after-free errors. The problem is that between the page being isolated
> from the LRU and rcu_read_lock() being taken, the mapcount of the page
> dropped to 0 and the anon_vma gets freed. This can happen during memory
> compaction if pages being migrated belong to a process that exits before
> migration completes. Hence, the use-after-free race looks like
> 
>  1. Page isolated for migration
>  2. Process exits
>  3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
>  4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
>  4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
>     is garbage.
> 
> This patch checks the mapcount after the rcu lock is taken. If the
> mapcount is zero, the anon_vma is assumed to be freed and no further
> action is taken.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  mm/migrate.c |   13 +++++++++++++
>  1 files changed, 13 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 98eaaf2..6eb1efe 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -603,6 +603,19 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  	 */
>  	if (PageAnon(page)) {
>  		rcu_read_lock();
> +
> +		/*
> +		 * If the page has no mappings any more, just bail. An
> +		 * unmapped anon page is likely to be freed soon but worse,
> +		 * it's possible its anon_vma disappeared between when
> +		 * the page was isolated and when we reached here while
> +		 * the RCU lock was not held
> +		 */
> +		if (!page_mapcount(page)) {
> +			rcu_read_unlock();
> +			goto uncharge;
> +		}

I haven't understand what prevent this check. Why don't we need following scenario?

 1. Page isolated for migration
 2. Passed this if (!page_mapcount(page)) check
 3. Process exits
 4. page_mapcount(page) drops to zero so anon_vma was no longer reliable


Traditionally, page migration logic is, it can touch garbarge of anon_vma, but
SLAB_DESTROY_BY_RCU prevent any disaster. Is this broken concept?




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration
  2010-03-12 16:41 ` [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
  2010-03-12 17:14   ` Rik van Riel
  2010-03-15  5:35   ` KAMEZAWA Hiroyuki
@ 2010-03-17  2:06   ` KOSAKI Motohiro
  2 siblings, 0 replies; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-17  2:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> For clarity of review, KSM and page migration have separate refcounts on
> the anon_vma. While clear, this is a waste of memory. This patch gets
> KSM and page migration to share their toys in a spirit of harmony.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-15 23:49             ` KAMEZAWA Hiroyuki
@ 2010-03-17  2:12               ` KAMEZAWA Hiroyuki
  2010-03-17  3:00                 ` Minchan Kim
  2010-03-17 12:07                 ` Mel Gorman
  0 siblings, 2 replies; 109+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-17  2:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Minchan Kim, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Tue, 16 Mar 2010 08:49:34 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 15 Mar 2010 14:21:24 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Mon, Mar 15, 2010 at 09:48:49PM +0900, Minchan Kim wrote:
> > > On Mon, 2010-03-15 at 11:28 +0000, Mel Gorman wrote:
> > > > The use after free looks like
> > > > 
> > > > 1. page_mapcount(page) was zero so anon_vma was no longer reliable
> > > > 2. rcu lock taken but the anon_vma at this point can already be garbage because the
> > > >    process exited
> > > > 3. call try_to_unmap, looks up tha anon_vma and locks it. This causes problems
> > > > 
> > > > I thought the race would be closed but there is still a very tiny window there all
> > > > right. The following alternative should close it. What do you think?
> > > > 
> > > >         if (PageAnon(page)) {
> > > > 		rcu_read_lock();
> > > > 
> > > >                 /*
> > > >                  * If the page has no mappings any more, just bail. An
> > > >                  * unmapped anon page is likely to be freed soon but worse,
> > > >                  * it's possible its anon_vma disappeared between when
> > > >                  * the page was isolated and when we reached here while
> > > >                  * the RCU lock was not held
> > > >                  */
> > > >                 if (!page_mapcount(page)) {
> > > > 			rcu_read_unlock();
> > > >                         goto uncharge;
> > > > 		}
> > > > 
> > > >                 rcu_locked = 1;
> > > >                 anon_vma = page_anon_vma(page);
> > > >                 atomic_inc(&anon_vma->external_refcount);
> > > >         }
> > > > 
> > > > The rcu_unlock label is not used here because the reference counts were not taken in
> > > > the case where page_mapcount == 0.
> > > > 
> > > 
> > > Please, repost above code with your use-after-free scenario comment.
> > > 
> > 
> > This will be the replacement patch so.
> > 
> > ==== CUT HERE ====
> > mm,migration: Do not try to migrate unmapped anonymous pages
> > 
> > rmap_walk_anon() was triggering errors in memory compaction that look like
> > use-after-free errors. The problem is that between the page being isolated
> > from the LRU and rcu_read_lock() being taken, the mapcount of the page
> > dropped to 0 and the anon_vma gets freed. This can happen during memory
> > compaction if pages being migrated belong to a process that exits before
> > migration completes. Hence, the use-after-free race looks like
> > 
> >  1. Page isolated for migration
> >  2. Process exits
> >  3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
> >  4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
> >  4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
> >     is garbage.
> > 
> > This patch checks the mapcount after the rcu lock is taken. If the
> > mapcount is zero, the anon_vma is assumed to be freed and no further
> > action is taken.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> 
> Reviewd-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 

BTW, I doubt freeing anon_vma can happen even when we check mapcount.

"unmap" is 2-stage operation.
	1. unmap_vmas() => modify ptes, free pages, etc.
	2. free_pgtables() => free pgtables, unlink vma and free it.

Then, if migration is enough slow. 

	Migration():				Exit():
	check mapcount
	rcu_read_lock
	pte_lock				
	replace pte with migration pte		
	pte_unlock
						pte_lock
	copy page etc...			zap pte (clear pte)
						pte_unlock
						free_pgtables
						->free vma
						->free anon_vma
	pte_lock
	remap pte with new pfn(fail)
	pte_unlock

	lock anon_vma->lock		# modification after free.
	check list is empty
	unlock anon_vma->lock
	free anon_vma
	rcu_read_unlock


Hmm. IIUC, anon_vma is allocated as SLAB_DESTROY_BY_RCU. Then, while
rcu_read_lock() is taken, anon_vma is anon_vma even if freed. But it
may reused as anon_vma for someone else.
(IOW, it may be reused but never pushed back to general purpose memory
 until RCU grace period.)
Then, touching anon_vma->lock never cause any corruption.

Does use-after-free check for SLAB_DESTROY_BY_RCU correct behavior ?
Above case is not use-after-free. It's safe and expected sequence.

Thanks,
-Kame



> > ---
> >  mm/migrate.c |   13 +++++++++++++
> >  1 files changed, 13 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 98eaaf2..6eb1efe 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -603,6 +603,19 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  	 */
> >  	if (PageAnon(page)) {
> >  		rcu_read_lock();
> > +
> > +		/*
> > +		 * If the page has no mappings any more, just bail. An
> > +		 * unmapped anon page is likely to be freed soon but worse,
> > +		 * it's possible its anon_vma disappeared between when
> > +		 * the page was isolated and when we reached here while
> > +		 * the RCU lock was not held
> > +		 */
> > +		if (!page_mapcount(page)) {
> > +			rcu_read_unlock();
> > +			goto uncharge;
> > +		}
> > +
> >  		rcu_locked = 1;
> >  		anon_vma = page_anon_vma(page);
> >  		atomic_inc(&anon_vma->migrate_refcount);
> > 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 04/11] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
  2010-03-12 16:41 ` [PATCH 04/11] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
@ 2010-03-17  2:28   ` KOSAKI Motohiro
  2010-03-17 11:32     ` Mel Gorman
  0 siblings, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-17  2:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
> being able to hot-remove memory. The main users of page migration such as
> sys_move_pages(), sys_migrate_pages() and cpuset process migration are
> only beneficial on NUMA so it makes sense.
> 
> As memory compaction will operate within a zone and is useful on both NUMA
> and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
> user selects CONFIG_COMPACTION as an option.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/Kconfig |   20 ++++++++++++++++----
>  1 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 9c61158..04e241b 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -172,17 +172,29 @@ config SPLIT_PTLOCK_CPUS
>  	default "4"
>  
>  #
> +# support for memory compaction
> +config COMPACTION
> +	bool "Allow for memory compaction"
> +	def_bool y
> +	select MIGRATION
> +	depends on EXPERIMENTAL && HUGETLBFS && MMU
> +	help
> +	  Allows the compaction of memory for the allocation of huge pages.
> +

If select MIGRATION works, we can remove "depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE"
line from config MIGRATION.



> +#
>  # support for page migration
>  #
>  config MIGRATION
>  	bool "Page migration"
>  	def_bool y
> -	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE
> +	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION
>  	help
>  	  Allows the migration of the physical location of pages of processes
> -	  while the virtual addresses are not changed. This is useful for
> -	  example on NUMA systems to put pages nearer to the processors accessing
> -	  the page.
> +	  while the virtual addresses are not changed. This is useful in
> +	  two situations. The first is on NUMA systems to put pages nearer
> +	  to the processors accessing. The second is when allocating huge
> +	  pages as migration can relocate pages to satisfy a huge page
> +	  allocation instead of reclaiming.
>  
>  config PHYS_ADDR_T_64BIT
>  	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
> -- 
> 1.6.5
> 




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 05/11] Export unusable free space index via /proc/unusable_index
  2010-03-12 16:41 ` [PATCH 05/11] Export unusable free space index via /proc/unusable_index Mel Gorman
  2010-03-15  5:41   ` KAMEZAWA Hiroyuki
@ 2010-03-17  2:42   ` KOSAKI Motohiro
  1 sibling, 0 replies; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-17  2:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> Unusable free space index is a measure of external fragmentation that
> takes the allocation size into account. For the most part, the huge page
> size will be the size of interest but not necessarily so it is exported
> on a per-order and per-zone basis via /proc/unusable_index.
> 
> The index is a value between 0 and 1. It can be expressed as a
> percentage by multiplying by 100 as documented in
> Documentation/filesystems/proc.txt.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>

thanks.

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 06/11] Export fragmentation index via /proc/extfrag_index
  2010-03-12 16:41 ` [PATCH 06/11] Export fragmentation index via /proc/extfrag_index Mel Gorman
@ 2010-03-17  2:49   ` KOSAKI Motohiro
  2010-03-17 11:33     ` Mel Gorman
  0 siblings, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-17  2:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> +/*
> + * A fragmentation index only makes sense if an allocation of a requested
> + * size would fail. If that is true, the fragmentation index indicates
> + * whether external fragmentation or a lack of memory was the problem.
> + * The value can be used to determine if page reclaim or compaction
> + * should be used
> + */
> +int fragmentation_index(unsigned int order, struct contig_page_info *info)
> +{
> +	unsigned long requested = 1UL << order;
> +
> +	if (!info->free_blocks_total)
> +		return 0;
> +
> +	/* Fragmentation index only makes sense when a request would fail */
> +	if (info->free_blocks_suitable)
> +		return -1000;
> +
> +	/*
> +	 * Index is between 0 and 1 so return within 3 decimal places
> +	 *
> +	 * 0 => allocation would fail due to lack of memory
> +	 * 1 => allocation would fail due to fragmentation
> +	 */
> +	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
> +}

Dumb question.

your paper (http://portal.acm.org/citation.cfm?id=1375634.1375641) says

fragmentation_index = 1 - (TotalFree/SizeRequested)/BlocksFree

but your code have extra '1000+'. Why?



Probably, I haven't understand the intention of this calculation.




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped  anonymous pages
  2010-03-17  2:12               ` KAMEZAWA Hiroyuki
@ 2010-03-17  3:00                 ` Minchan Kim
  2010-03-17  3:15                   ` KAMEZAWA Hiroyuki
  2010-03-17 12:07                 ` Mel Gorman
  1 sibling, 1 reply; 109+ messages in thread
From: Minchan Kim @ 2010-03-17  3:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 17, 2010 at 11:12 AM, KAMEZAWA Hiroyuki
> BTW, I doubt freeing anon_vma can happen even when we check mapcount.
>
> "unmap" is 2-stage operation.
>        1. unmap_vmas() => modify ptes, free pages, etc.
>        2. free_pgtables() => free pgtables, unlink vma and free it.
>
> Then, if migration is enough slow.
>
>        Migration():                            Exit():
>        check mapcount
>        rcu_read_lock
>        pte_lock
>        replace pte with migration pte
>        pte_unlock
>                                                pte_lock
>        copy page etc...                        zap pte (clear pte)
>                                                pte_unlock
>                                                free_pgtables
>                                                ->free vma
>                                                ->free anon_vma
>        pte_lock
>        remap pte with new pfn(fail)
>        pte_unlock
>
>        lock anon_vma->lock             # modification after free.
>        check list is empty

check list is empty?
Do you mean anon_vma->head?

If it is, is it possible that that list isn't empty since anon_vma is
used by others due to
SLAB_DESTROY_BY_RCU?

but such case is handled by page_check_address, vma_address, I think.

>        unlock anon_vma->lock
>        free anon_vma
>        rcu_read_unlock
>
>
> Hmm. IIUC, anon_vma is allocated as SLAB_DESTROY_BY_RCU. Then, while
> rcu_read_lock() is taken, anon_vma is anon_vma even if freed. But it
> may reused as anon_vma for someone else.
> (IOW, it may be reused but never pushed back to general purpose memory
>  until RCU grace period.)
> Then, touching anon_vma->lock never cause any corruption.
>
> Does use-after-free check for SLAB_DESTROY_BY_RCU correct behavior ?

Could you elaborate your point?

> Above case is not use-after-free. It's safe and expected sequence.
>
> Thanks,
> -Kame
>
>
>
>> > ---
>> >  mm/migrate.c |   13 +++++++++++++
>> >  1 files changed, 13 insertions(+), 0 deletions(-)
>> >
>> > diff --git a/mm/migrate.c b/mm/migrate.c
>> > index 98eaaf2..6eb1efe 100644
>> > --- a/mm/migrate.c
>> > +++ b/mm/migrate.c
>> > @@ -603,6 +603,19 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>> >      */
>> >     if (PageAnon(page)) {
>> >             rcu_read_lock();
>> > +
>> > +           /*
>> > +            * If the page has no mappings any more, just bail. An
>> > +            * unmapped anon page is likely to be freed soon but worse,
>> > +            * it's possible its anon_vma disappeared between when
>> > +            * the page was isolated and when we reached here while
>> > +            * the RCU lock was not held
>> > +            */
>> > +           if (!page_mapcount(page)) {
>> > +                   rcu_read_unlock();
>> > +                   goto uncharge;
>> > +           }
>> > +
>> >             rcu_locked = 1;
>> >             anon_vma = page_anon_vma(page);
>> >             atomic_inc(&anon_vma->migrate_refcount);
>> >
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped  anonymous pages
  2010-03-17  3:00                 ` Minchan Kim
@ 2010-03-17  3:15                   ` KAMEZAWA Hiroyuki
  2010-03-17  4:15                     ` Minchan Kim
  2010-03-17 16:41                     ` Christoph Lameter
  0 siblings, 2 replies; 109+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-17  3:15 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, 17 Mar 2010 12:00:15 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Wed, Mar 17, 2010 at 11:12 AM, KAMEZAWA Hiroyuki
> > BTW, I doubt freeing anon_vma can happen even when we check mapcount.
> >
> > "unmap" is 2-stage operation.
> >        1. unmap_vmas() => modify ptes, free pages, etc.
> >        2. free_pgtables() => free pgtables, unlink vma and free it.
> >
> > Then, if migration is enough slow.
> >
> >        Migration():                            Exit():
> >        check mapcount
> >        rcu_read_lock
> >        pte_lock
> >        replace pte with migration pte
> >        pte_unlock
> >                                                pte_lock
> >        copy page etc...                        zap pte (clear pte)
> >                                                pte_unlock
> >                                                free_pgtables
> >                                                ->free vma
> >                                                ->free anon_vma
> >        pte_lock
> >        remap pte with new pfn(fail)
> >        pte_unlock
> >
> >        lock anon_vma->lock             # modification after free.
> >        check list is empty
> 
> check list is empty?
> Do you mean anon_vma->head?
> 
yes.

> If it is, is it possible that that list isn't empty since anon_vma is
> used by others due to
> SLAB_DESTROY_BY_RCU?
> 
There are 4 cases.
	A) anon_vma->list is not empty because anon_vma is not freed.
	B) anon_vma->list is empty because it's freed.
	C) anon_vma->list is empty but it's reused.
	D) anon_vma->list is not empty but it's reused.
 
> but such case is handled by page_check_address, vma_address, I think.
> 
yes. Then, this corrupt nothing, as I wrote. We just modify anon_vma->lock
and it's safe because of SLAB_DESTROY_BY_RCU.


> >        unlock anon_vma->lock
> >        free anon_vma
> >        rcu_read_unlock
> >
> >
> > Hmm. IIUC, anon_vma is allocated as SLAB_DESTROY_BY_RCU. Then, while
> > rcu_read_lock() is taken, anon_vma is anon_vma even if freed. But it
> > may reused as anon_vma for someone else.
> > (IOW, it may be reused but never pushed back to general purpose memory
> >  until RCU grace period.)
> > Then, touching anon_vma->lock never cause any corruption.
> >
> > Does use-after-free check for SLAB_DESTROY_BY_RCU correct behavior ?
> 
> Could you elaborate your point?
> 

Ah, my point is "how use-after-free is detected ?"

If use-after-free is detected by free_pages() (DEBUG_PGALLOC), it seems
strange because DESTROY_BY_RCU guarantee that never happens.

So, I assume use-after-free is detected in SLAB layer. If so,
in above B), C), D) case, it seems there is use-after free in slab's point
of view but it works as expected, no corruption.

Then, my question is
"Does use-after-free check for SLAB_DESTROY_BY_RCU work correctly ?"

and implies we need this patch ?
(But this will prevent unnecessary page copy etc. by easy check.)

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 09/11] Add /sys trigger for per-node memory compaction
  2010-03-12 16:41 ` [PATCH 09/11] Add /sys trigger for per-node " Mel Gorman
@ 2010-03-17  3:18   ` KOSAKI Motohiro
  0 siblings, 0 replies; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-17  3:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> This patch adds a per-node sysfs file called compact. When the file is
> written to, each zone in that node is compacted. The intention that this
> would be used by something like a job scheduler in a batch system before
> a job starts so that the job can allocate the maximum number of
> hugepages without significant start-up cost.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 08/11] Add /proc trigger for memory compaction
  2010-03-12 16:41 ` [PATCH 08/11] Add /proc trigger for memory compaction Mel Gorman
@ 2010-03-17  3:18   ` KOSAKI Motohiro
  0 siblings, 0 replies; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-17  3:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
> value is written to the file, all zones are compacted. The expected user
> of such a trigger is a job scheduler that prepares the system before the
> target application runs.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>





^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped  anonymous pages
  2010-03-17  3:15                   ` KAMEZAWA Hiroyuki
@ 2010-03-17  4:15                     ` Minchan Kim
  2010-03-17  4:19                       ` KAMEZAWA Hiroyuki
  2010-03-17 16:41                     ` Christoph Lameter
  1 sibling, 1 reply; 109+ messages in thread
From: Minchan Kim @ 2010-03-17  4:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 17, 2010 at 12:15 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 17 Mar 2010 12:00:15 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Wed, Mar 17, 2010 at 11:12 AM, KAMEZAWA Hiroyuki
>> > BTW, I doubt freeing anon_vma can happen even when we check mapcount.
>> >
>> > "unmap" is 2-stage operation.
>> >        1. unmap_vmas() => modify ptes, free pages, etc.
>> >        2. free_pgtables() => free pgtables, unlink vma and free it.
>> >
>> > Then, if migration is enough slow.
>> >
>> >        Migration():                            Exit():
>> >        check mapcount
>> >        rcu_read_lock
>> >        pte_lock
>> >        replace pte with migration pte
>> >        pte_unlock
>> >                                                pte_lock
>> >        copy page etc...                        zap pte (clear pte)
>> >                                                pte_unlock
>> >                                                free_pgtables
>> >                                                ->free vma
>> >                                                ->free anon_vma
>> >        pte_lock
>> >        remap pte with new pfn(fail)
>> >        pte_unlock
>> >
>> >        lock anon_vma->lock             # modification after free.
>> >        check list is empty
>>
>> check list is empty?
>> Do you mean anon_vma->head?
>>
> yes.
>
>> If it is, is it possible that that list isn't empty since anon_vma is
>> used by others due to
>> SLAB_DESTROY_BY_RCU?
>>
> There are 4 cases.
>        A) anon_vma->list is not empty because anon_vma is not freed.
>        B) anon_vma->list is empty because it's freed.
>        C) anon_vma->list is empty but it's reused.
>        D) anon_vma->list is not empty but it's reused.

E) anon_vma is used for other object.

That's because we don't hold rcu_read_lock.
I think Mel met this E) situation.

AFAIU, even slab page of SLAB_BY_RCU can be freed after grace period.
Do I miss something?

>
>> but such case is handled by page_check_address, vma_address, I think.
>>
> yes. Then, this corrupt nothing, as I wrote. We just modify anon_vma->lock
> and it's safe because of SLAB_DESTROY_BY_RCU.
>
>
>> >        unlock anon_vma->lock
>> >        free anon_vma
>> >        rcu_read_unlock
>> >
>> >
>> > Hmm. IIUC, anon_vma is allocated as SLAB_DESTROY_BY_RCU. Then, while
>> > rcu_read_lock() is taken, anon_vma is anon_vma even if freed. But it
>> > may reused as anon_vma for someone else.
>> > (IOW, it may be reused but never pushed back to general purpose memory
>> >  until RCU grace period.)
>> > Then, touching anon_vma->lock never cause any corruption.
>> >
>> > Does use-after-free check for SLAB_DESTROY_BY_RCU correct behavior ?
>>
>> Could you elaborate your point?
>>
>
> Ah, my point is "how use-after-free is detected ?"
>
> If use-after-free is detected by free_pages() (DEBUG_PGALLOC), it seems
> strange because DESTROY_BY_RCU guarantee that never happens.
>
> So, I assume use-after-free is detected in SLAB layer. If so,
> in above B), C), D) case, it seems there is use-after free in slab's point
> of view but it works as expected, no corruption.
>
> Then, my question is
> "Does use-after-free check for SLAB_DESTROY_BY_RCU work correctly ?"
>

I am not sure Mel found that by DEBUG_PGALLOC.
But, E) case can be founded by DEBUG_PGALLOC.

> and implies we need this patch ?
> (But this will prevent unnecessary page copy etc. by easy check.)
>
> Thanks,
> -Kame
>
>
>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped  anonymous pages
  2010-03-17  4:15                     ` Minchan Kim
@ 2010-03-17  4:19                       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 109+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-17  4:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, 17 Mar 2010 13:15:14 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Wed, Mar 17, 2010 at 12:15 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Wed, 17 Mar 2010 12:00:15 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >
> >> On Wed, Mar 17, 2010 at 11:12 AM, KAMEZAWA Hiroyuki
> >> > BTW, I doubt freeing anon_vma can happen even when we check mapcount.
> >> >
> >> > "unmap" is 2-stage operation.
> >> >        1. unmap_vmas() => modify ptes, free pages, etc.
> >> >        2. free_pgtables() => free pgtables, unlink vma and free it.
> >> >
> >> > Then, if migration is enough slow.
> >> >
> >> >        Migration():                            Exit():
> >> >        check mapcount
> >> >        rcu_read_lock
> >> >        pte_lock
> >> >        replace pte with migration pte
> >> >        pte_unlock
> >> >                                                pte_lock
> >> >        copy page etc...                        zap pte (clear pte)
> >> >                                                pte_unlock
> >> >                                                free_pgtables
> >> >                                                ->free vma
> >> >                                                ->free anon_vma
> >> >        pte_lock
> >> >        remap pte with new pfn(fail)
> >> >        pte_unlock
> >> >
> >> >        lock anon_vma->lock             # modification after free.
> >> >        check list is empty
> >>
> >> check list is empty?
> >> Do you mean anon_vma->head?
> >>
> > yes.
> >
> >> If it is, is it possible that that list isn't empty since anon_vma is
> >> used by others due to
> >> SLAB_DESTROY_BY_RCU?
> >>
> > There are 4 cases.
> >        A) anon_vma->list is not empty because anon_vma is not freed.
> >        B) anon_vma->list is empty because it's freed.
> >        C) anon_vma->list is empty but it's reused.
> >        D) anon_vma->list is not empty but it's reused.
> 
> E) anon_vma is used for other object.
> 
> That's because we don't hold rcu_read_lock.
> I think Mel met this E) situation.
> 
Hmm. 

> AFAIU, even slab page of SLAB_BY_RCU can be freed after grace period.
> Do I miss something?
> 
I miss something. Sorry for noise.

Maybe we need check page_mapped() before calling try_to_unmap() as
vmscan does. Thank you for your help.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-12 16:41 ` [PATCH 07/11] Memory compaction core Mel Gorman
  2010-03-15 13:44   ` Minchan Kim
@ 2010-03-17 10:31   ` KOSAKI Motohiro
  2010-03-17 11:40     ` Mel Gorman
  2010-03-18 17:08     ` Mel Gorman
  1 sibling, 2 replies; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-17 10:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

nit

> +static int compact_zone(struct zone *zone, struct compact_control *cc)
> +{
> +	int ret = COMPACT_INCOMPLETE;
> +
> +	/* Setup to move all movable pages to the end of the zone */
> +	cc->migrate_pfn = zone->zone_start_pfn;
> +	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
> +	cc->free_pfn &= ~(pageblock_nr_pages-1);
> +
> +	for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
> +		unsigned long nr_migrate, nr_remaining;
> +		if (!isolate_migratepages(zone, cc))
> +			continue;
> +
> +		nr_migrate = cc->nr_migratepages;
> +		migrate_pages(&cc->migratepages, compaction_alloc,
> +						(unsigned long)cc, 0);
> +		update_nr_listpages(cc);
> +		nr_remaining = cc->nr_migratepages;
> +
> +		count_vm_event(COMPACTBLOCKS);

V1 did compaction per pageblock. but current patch doesn't.
so, Is COMPACTBLOCKS still good name?


> +		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
> +		if (nr_remaining)
> +			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
> +
> +		/* Release LRU pages not migrated */
> +		if (!list_empty(&cc->migratepages)) {
> +			putback_lru_pages(&cc->migratepages);
> +			cc->nr_migratepages = 0;
> +		}
> +
> +		mod_zone_page_state(zone, NR_ISOLATED_ANON, -cc->nr_anon);
> +		mod_zone_page_state(zone, NR_ISOLATED_FILE, -cc->nr_file);

I think you don't need decrease this vmstatistics here. migrate_pages() and
putback_lru_pages() alredy does.


other parts, looks good.





^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 04/11] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
  2010-03-17  2:28   ` KOSAKI Motohiro
@ 2010-03-17 11:32     ` Mel Gorman
  2010-03-17 16:37       ` Christoph Lameter
  0 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-17 11:32 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 17, 2010 at 11:28:26AM +0900, KOSAKI Motohiro wrote:
> > CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
> > being able to hot-remove memory. The main users of page migration such as
> > sys_move_pages(), sys_migrate_pages() and cpuset process migration are
> > only beneficial on NUMA so it makes sense.
> > 
> > As memory compaction will operate within a zone and is useful on both NUMA
> > and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
> > user selects CONFIG_COMPACTION as an option.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/Kconfig |   20 ++++++++++++++++----
> >  1 files changed, 16 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 9c61158..04e241b 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -172,17 +172,29 @@ config SPLIT_PTLOCK_CPUS
> >  	default "4"
> >  
> >  #
> > +# support for memory compaction
> > +config COMPACTION
> > +	bool "Allow for memory compaction"
> > +	def_bool y
> > +	select MIGRATION
> > +	depends on EXPERIMENTAL && HUGETLBFS && MMU
> > +	help
> > +	  Allows the compaction of memory for the allocation of huge pages.
> > +
> 
> If select MIGRATION works, we can remove "depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE"
> line from config MIGRATION.
> 

I'm not quite getting why this would be an advantage. COMPACTION
requires MIGRATION but conceivable both NUMA and HOTREMOVE can work
without it.

> 
> 
> > +#
> >  # support for page migration
> >  #
> >  config MIGRATION
> >  	bool "Page migration"
> >  	def_bool y
> > -	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE
> > +	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION
> >  	help
> >  	  Allows the migration of the physical location of pages of processes
> > -	  while the virtual addresses are not changed. This is useful for
> > -	  example on NUMA systems to put pages nearer to the processors accessing
> > -	  the page.
> > +	  while the virtual addresses are not changed. This is useful in
> > +	  two situations. The first is on NUMA systems to put pages nearer
> > +	  to the processors accessing. The second is when allocating huge
> > +	  pages as migration can relocate pages to satisfy a huge page
> > +	  allocation instead of reclaiming.
> >  
> >  config PHYS_ADDR_T_64BIT
> >  	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
> > -- 
> > 1.6.5
> > 
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 06/11] Export fragmentation index via /proc/extfrag_index
  2010-03-17  2:49   ` KOSAKI Motohiro
@ 2010-03-17 11:33     ` Mel Gorman
  2010-03-23  0:22       ` KOSAKI Motohiro
  0 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-17 11:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 17, 2010 at 11:49:49AM +0900, KOSAKI Motohiro wrote:
> > +/*
> > + * A fragmentation index only makes sense if an allocation of a requested
> > + * size would fail. If that is true, the fragmentation index indicates
> > + * whether external fragmentation or a lack of memory was the problem.
> > + * The value can be used to determine if page reclaim or compaction
> > + * should be used
> > + */
> > +int fragmentation_index(unsigned int order, struct contig_page_info *info)
> > +{
> > +	unsigned long requested = 1UL << order;
> > +
> > +	if (!info->free_blocks_total)
> > +		return 0;
> > +
> > +	/* Fragmentation index only makes sense when a request would fail */
> > +	if (info->free_blocks_suitable)
> > +		return -1000;
> > +
> > +	/*
> > +	 * Index is between 0 and 1 so return within 3 decimal places
> > +	 *
> > +	 * 0 => allocation would fail due to lack of memory
> > +	 * 1 => allocation would fail due to fragmentation
> > +	 */
> > +	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
> > +}
> 
> Dumb question.
> 
> your paper (http://portal.acm.org/citation.cfm?id=1375634.1375641) says
> 
> fragmentation_index = 1 - (TotalFree/SizeRequested)/BlocksFree
> 
> but your code have extra '1000+'. Why?

To get an approximation to three decimal places.

> 
> Probably, I haven't understand the intention of this calculation.
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-17 10:31   ` KOSAKI Motohiro
@ 2010-03-17 11:40     ` Mel Gorman
  2010-03-18  2:35       ` KOSAKI Motohiro
  2010-03-18 17:08     ` Mel Gorman
  1 sibling, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-17 11:40 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 17, 2010 at 07:31:53PM +0900, KOSAKI Motohiro wrote:
> nit
> 
> > +static int compact_zone(struct zone *zone, struct compact_control *cc)
> > +{
> > +	int ret = COMPACT_INCOMPLETE;
> > +
> > +	/* Setup to move all movable pages to the end of the zone */
> > +	cc->migrate_pfn = zone->zone_start_pfn;
> > +	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
> > +	cc->free_pfn &= ~(pageblock_nr_pages-1);
> > +
> > +	for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
> > +		unsigned long nr_migrate, nr_remaining;
> > +		if (!isolate_migratepages(zone, cc))
> > +			continue;
> > +
> > +		nr_migrate = cc->nr_migratepages;
> > +		migrate_pages(&cc->migratepages, compaction_alloc,
> > +						(unsigned long)cc, 0);
> > +		update_nr_listpages(cc);
> > +		nr_remaining = cc->nr_migratepages;
> > +
> > +		count_vm_event(COMPACTBLOCKS);
> 
> V1 did compaction per pageblock. but current patch doesn't.
> so, Is COMPACTBLOCKS still good name?
> 

It's not such a minor nit. I wondered about that myself but it's still a
block - just not a pageblock. Would COMPACTCLUSTER be a better name as it's
related to COMPACT_CLUSTER_MAX?

> 
> > +		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
> > +		if (nr_remaining)
> > +			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
> > +
> > +		/* Release LRU pages not migrated */
> > +		if (!list_empty(&cc->migratepages)) {
> > +			putback_lru_pages(&cc->migratepages);
> > +			cc->nr_migratepages = 0;
> > +		}
> > +
> > +		mod_zone_page_state(zone, NR_ISOLATED_ANON, -cc->nr_anon);
> > +		mod_zone_page_state(zone, NR_ISOLATED_FILE, -cc->nr_file);
> 
> I think you don't need decrease this vmstatistics here. migrate_pages() and
> putback_lru_pages() alredy does.
> 

Hmm, I do need to decrease the vmstats here but not by this much. The
pages migrated need to be accounted for but not the ones that failed. I
missed this because migration was always succeeding. Thanks. I'll get it
fixed for V5

> other parts, looks good.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating
  2010-03-17  1:44   ` KOSAKI Motohiro
@ 2010-03-17 11:45     ` Mel Gorman
  2010-03-17 16:38       ` Christoph Lameter
  0 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-17 11:45 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 17, 2010 at 10:44:06AM +0900, KOSAKI Motohiro wrote:
> >  rcu_unlock:
> > +
> > +	/* Drop an anon_vma reference if we took one */
> > +	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
> > +		int empty = list_empty(&anon_vma->head);
> > +		spin_unlock(&anon_vma->lock);
> > +		if (empty)
> > +			anon_vma_free(anon_vma);
> > +	}
> > +
> 
> Why don't we check ksm_refcount here?

The counts later get merged and the ksm code should be doing its own
checking. Checking both counts here would obscure what is going on and
not help after patch 3 of the series.

> Also, why drop_anon_vma() doesn't need check migrate_refcount?
> 

Same reason. Counts get merged later.


> plus, if we add this logic, we can remove SLAB_DESTROY_BY_RCU from 
> anon_vma_cachep and rcu_read_lock() from unmap_and_move(), I think.
> It is for preventing anon_vma recycle logic. but no free directly mean
> no memory recycle.
> 

This is true, but I don't think such a change belongs in this patch
series. If this series gets merged, then it would be sensible to investigate
if refcounting anon_vma is a good idea or would it be a bouncing write-shared
cacheline mess.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-17  2:03             ` KOSAKI Motohiro
@ 2010-03-17 11:51               ` Mel Gorman
  2010-03-18  0:48                 ` KOSAKI Motohiro
  0 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-17 11:51 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 17, 2010 at 11:03:05AM +0900, KOSAKI Motohiro wrote:
> > mm,migration: Do not try to migrate unmapped anonymous pages
> > 
> > rmap_walk_anon() was triggering errors in memory compaction that look like
> > use-after-free errors. The problem is that between the page being isolated
> > from the LRU and rcu_read_lock() being taken, the mapcount of the page
> > dropped to 0 and the anon_vma gets freed. This can happen during memory
> > compaction if pages being migrated belong to a process that exits before
> > migration completes. Hence, the use-after-free race looks like
> > 
> >  1. Page isolated for migration
> >  2. Process exits
> >  3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
> >  4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
> >  4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
> >     is garbage.
> > 
> > This patch checks the mapcount after the rcu lock is taken. If the
> > mapcount is zero, the anon_vma is assumed to be freed and no further
> > action is taken.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >  mm/migrate.c |   13 +++++++++++++
> >  1 files changed, 13 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 98eaaf2..6eb1efe 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -603,6 +603,19 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  	 */
> >  	if (PageAnon(page)) {
> >  		rcu_read_lock();
> > +
> > +		/*
> > +		 * If the page has no mappings any more, just bail. An
> > +		 * unmapped anon page is likely to be freed soon but worse,
> > +		 * it's possible its anon_vma disappeared between when
> > +		 * the page was isolated and when we reached here while
> > +		 * the RCU lock was not held
> > +		 */
> > +		if (!page_mapcount(page)) {
> > +			rcu_read_unlock();
> > +			goto uncharge;
> > +		}
> 
> I haven't understand what prevent this check. Why don't we need following scenario?
> 
>  1. Page isolated for migration
>  2. Passed this if (!page_mapcount(page)) check
>  3. Process exits
>  4. page_mapcount(page) drops to zero so anon_vma was no longer reliable
> 
> 
> Traditionally, page migration logic is, it can touch garbarge of anon_vma, but
> SLAB_DESTROY_BY_RCU prevent any disaster. Is this broken concept?
> 

The check is made within the RCU read lock. If the count is positive at
that point but goes to zero due to a process exiting, the anon_vma will
still be valid until rcu_read_unlock() is called.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-17  2:12               ` KAMEZAWA Hiroyuki
  2010-03-17  3:00                 ` Minchan Kim
@ 2010-03-17 12:07                 ` Mel Gorman
  1 sibling, 0 replies; 109+ messages in thread
From: Mel Gorman @ 2010-03-17 12:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 17, 2010 at 11:12:34AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 16 Mar 2010 08:49:34 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Mon, 15 Mar 2010 14:21:24 +0000
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > On Mon, Mar 15, 2010 at 09:48:49PM +0900, Minchan Kim wrote:
> > > > On Mon, 2010-03-15 at 11:28 +0000, Mel Gorman wrote:
> > > > > The use after free looks like
> > > > > 
> > > > > 1. page_mapcount(page) was zero so anon_vma was no longer reliable
> > > > > 2. rcu lock taken but the anon_vma at this point can already be garbage because the
> > > > >    process exited
> > > > > 3. call try_to_unmap, looks up tha anon_vma and locks it. This causes problems
> > > > > 
> > > > > I thought the race would be closed but there is still a very tiny window there all
> > > > > right. The following alternative should close it. What do you think?
> > > > > 
> > > > >         if (PageAnon(page)) {
> > > > > 		rcu_read_lock();
> > > > > 
> > > > >                 /*
> > > > >                  * If the page has no mappings any more, just bail. An
> > > > >                  * unmapped anon page is likely to be freed soon but worse,
> > > > >                  * it's possible its anon_vma disappeared between when
> > > > >                  * the page was isolated and when we reached here while
> > > > >                  * the RCU lock was not held
> > > > >                  */
> > > > >                 if (!page_mapcount(page)) {
> > > > > 			rcu_read_unlock();
> > > > >                         goto uncharge;
> > > > > 		}
> > > > > 
> > > > >                 rcu_locked = 1;
> > > > >                 anon_vma = page_anon_vma(page);
> > > > >                 atomic_inc(&anon_vma->external_refcount);
> > > > >         }
> > > > > 
> > > > > The rcu_unlock label is not used here because the reference counts were not taken in
> > > > > the case where page_mapcount == 0.
> > > > > 
> > > > 
> > > > Please, repost above code with your use-after-free scenario comment.
> > > > 
> > > 
> > > This will be the replacement patch so.
> > > 
> > > ==== CUT HERE ====
> > > mm,migration: Do not try to migrate unmapped anonymous pages
> > > 
> > > rmap_walk_anon() was triggering errors in memory compaction that look like
> > > use-after-free errors. The problem is that between the page being isolated
> > > from the LRU and rcu_read_lock() being taken, the mapcount of the page
> > > dropped to 0 and the anon_vma gets freed. This can happen during memory
> > > compaction if pages being migrated belong to a process that exits before
> > > migration completes. Hence, the use-after-free race looks like
> > > 
> > >  1. Page isolated for migration
> > >  2. Process exits
> > >  3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
> > >  4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
> > >  4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
> > >     is garbage.
> > > 
> > > This patch checks the mapcount after the rcu lock is taken. If the
> > > mapcount is zero, the anon_vma is assumed to be freed and no further
> > > action is taken.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > 
> > Reviewd-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> 
> BTW, I doubt freeing anon_vma can happen even when we check mapcount.
> 

Bear in mind that without this patch, then compaction can trigger
bad-dereference-bugs fairly trivially. Each time it's related to taking
anon_vma->lock. It's not being caught by sl*b or page-alloc use-after-free 
debugging. It's somewhat detected by lockdep which recognises the lock
it's trying to track is screwy.

> "unmap" is 2-stage operation.
> 	1. unmap_vmas() => modify ptes, free pages, etc.
> 	2. free_pgtables() => free pgtables, unlink vma and free it.
> 
> Then, if migration is enough slow. 
> 
> 	Migration():				Exit():
> 	check mapcount
> 	rcu_read_lock
> 	pte_lock				
> 	replace pte with migration pte		
> 	pte_unlock
> 						pte_lock
> 	copy page etc...			zap pte (clear pte)
> 						pte_unlock
> 						free_pgtables
> 						->free vma
> 						->free anon_vma
> 	pte_lock
> 	remap pte with new pfn(fail)
> 	pte_unlock
> 
> 	lock anon_vma->lock		# modification after free.

But the anon_vma is still valid. Minimally, it shouldn't be destroyed
until after the rcu_read_unlock but it's also protected by the refcount
taken by migration.

Look at anon_vma_unlink(). It checks for the anon_vma being empty with

empty = list_empty(&anon_vma->head) && !anonvma_external_refcount(anon_vma);

So though the vmas have been unmapped, the anon_vma should still not
have been freed until migration is completed. We drop our reference, see
the list is empty, free the anon_vma and call rcu_read_unlock().

> 	check list is empty
> 	unlock anon_vma->lock
> 	free anon_vma
> 	rcu_read_unlock
> 
> Hmm. IIUC, anon_vma is allocated as SLAB_DESTROY_BY_RCU. Then, while
> rcu_read_lock() is taken, anon_vma is anon_vma even if freed. But it
> may reused as anon_vma for someone else.
> (IOW, it may be reused but never pushed back to general purpose memory
>  until RCU grace period.)

I don't think it can be reused because we took the external_refcount
preventing it being freed.

> Then, touching anon_vma->lock never cause any corruption.
> 

It would be bad if the anon_vma is reused. We'd decrement the wrong
counter potentially leaking the anon_vma structure.

> Does use-after-free check for SLAB_DESTROY_BY_RCU correct behavior ?
> Above case is not use-after-free. It's safe and expected sequence.
> 

I don't think it's RCU that guarantees the correct behaviour here, it's
the external_refcount.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 04/11] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
  2010-03-17 11:32     ` Mel Gorman
@ 2010-03-17 16:37       ` Christoph Lameter
  2010-03-17 23:56         ` KOSAKI Motohiro
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Lameter @ 2010-03-17 16:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Wed, 17 Mar 2010, Mel Gorman wrote:

> > If select MIGRATION works, we can remove "depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE"
> > line from config MIGRATION.
> >
>
> I'm not quite getting why this would be an advantage. COMPACTION
> requires MIGRATION but conceivable both NUMA and HOTREMOVE can work
> without it.

Avoids having to add additional CONFIG_XXX on the page migration "depends"
line in the future.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating
  2010-03-17 11:45     ` Mel Gorman
@ 2010-03-17 16:38       ` Christoph Lameter
  2010-03-18 11:12         ` Mel Gorman
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Lameter @ 2010-03-17 16:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Wed, 17 Mar 2010, Mel Gorman wrote:

> This is true, but I don't think such a change belongs in this patch
> series. If this series gets merged, then it would be sensible to investigate
> if refcounting anon_vma is a good idea or would it be a bouncing write-shared
> cacheline mess.

SLAB_DESTROY_BY_RCU is there to avoid the cooling of hot cachelines by
RCU.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped  anonymous pages
  2010-03-17  3:15                   ` KAMEZAWA Hiroyuki
  2010-03-17  4:15                     ` Minchan Kim
@ 2010-03-17 16:41                     ` Christoph Lameter
  2010-03-18  0:30                       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 109+ messages in thread
From: Christoph Lameter @ 2010-03-17 16:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, 17 Mar 2010, KAMEZAWA Hiroyuki wrote:

> Ah, my point is "how use-after-free is detected ?"

The slab layers do not check for use after free conditions if
SLAB_DESTROY_BY_RCU is set. It is legal to access the object after a
kfree() etc as long as the RCU period has not passed.

> Then, my question is
> "Does use-after-free check for SLAB_DESTROY_BY_RCU work correctly ?"

Use after free checks are not performed for SLAB_DESTROY_BY_RCU slabs.



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 04/11] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
  2010-03-17 16:37       ` Christoph Lameter
@ 2010-03-17 23:56         ` KOSAKI Motohiro
  2010-03-18 11:24           ` Mel Gorman
  0 siblings, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-17 23:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Adam Litke, Avi Kivity, David Rientjes, Rik van Riel,
	linux-kernel, linux-mm

> On Wed, 17 Mar 2010, Mel Gorman wrote:
> 
> > > If select MIGRATION works, we can remove "depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE"
> > > line from config MIGRATION.
> > >
> >
> > I'm not quite getting why this would be an advantage. COMPACTION
> > requires MIGRATION but conceivable both NUMA and HOTREMOVE can work
> > without it.
> 
> Avoids having to add additional CONFIG_XXX on the page migration "depends"
> line in the future.

Yes, Kconfig mess freqently shot ourself in past days. if we have a chance
to remove unnecessary dependency, we should do. that's my intention of the last mail.




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped  anonymous pages
  2010-03-17 16:41                     ` Christoph Lameter
@ 2010-03-18  0:30                       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 109+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-18  0:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Minchan Kim, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, 17 Mar 2010 11:41:10 -0500 (CDT)
Christoph Lameter <cl@linux-foundation.org> wrote:

> On Wed, 17 Mar 2010, KAMEZAWA Hiroyuki wrote:
> 
> > Ah, my point is "how use-after-free is detected ?"
> 
> The slab layers do not check for use after free conditions if
> SLAB_DESTROY_BY_RCU is set. It is legal to access the object after a
> kfree() etc as long as the RCU period has not passed.
> 
> > Then, my question is
> > "Does use-after-free check for SLAB_DESTROY_BY_RCU work correctly ?"
> 
> Use after free checks are not performed for SLAB_DESTROY_BY_RCU slabs.
> 
Thank you for kindly clarification. I have no more concerns.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-17 11:51               ` Mel Gorman
@ 2010-03-18  0:48                 ` KOSAKI Motohiro
  2010-03-18 11:14                   ` Mel Gorman
  0 siblings, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-18  0:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Minchan Kim, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Rik van Riel, linux-kernel, linux-mm

> > > +		/*
> > > +		 * If the page has no mappings any more, just bail. An
> > > +		 * unmapped anon page is likely to be freed soon but worse,
> > > +		 * it's possible its anon_vma disappeared between when
> > > +		 * the page was isolated and when we reached here while
> > > +		 * the RCU lock was not held
> > > +		 */
> > > +		if (!page_mapcount(page)) {
> > > +			rcu_read_unlock();
> > > +			goto uncharge;
> > > +		}
> > 
> > I haven't understand what prevent this check. Why don't we need following scenario?
> > 
> >  1. Page isolated for migration
> >  2. Passed this if (!page_mapcount(page)) check
> >  3. Process exits
> >  4. page_mapcount(page) drops to zero so anon_vma was no longer reliable
> > 
> > Traditionally, page migration logic is, it can touch garbarge of anon_vma, but
> > SLAB_DESTROY_BY_RCU prevent any disaster. Is this broken concept?
> 
> The check is made within the RCU read lock. If the count is positive at
> that point but goes to zero due to a process exiting, the anon_vma will
> still be valid until rcu_read_unlock() is called.

Thank you!

then, this logic depend on SLAB_DESTROY_BY_RCU, not refcount.
So, I think we don't need your [1/11] patch.

Am I missing something?



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-17 11:40     ` Mel Gorman
@ 2010-03-18  2:35       ` KOSAKI Motohiro
  2010-03-18 11:43         ` Mel Gorman
  0 siblings, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-18  2:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> On Wed, Mar 17, 2010 at 07:31:53PM +0900, KOSAKI Motohiro wrote:
> > nit
> > 
> > > +static int compact_zone(struct zone *zone, struct compact_control *cc)
> > > +{
> > > +	int ret = COMPACT_INCOMPLETE;
> > > +
> > > +	/* Setup to move all movable pages to the end of the zone */
> > > +	cc->migrate_pfn = zone->zone_start_pfn;
> > > +	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
> > > +	cc->free_pfn &= ~(pageblock_nr_pages-1);
> > > +
> > > +	for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
> > > +		unsigned long nr_migrate, nr_remaining;
> > > +		if (!isolate_migratepages(zone, cc))
> > > +			continue;
> > > +
> > > +		nr_migrate = cc->nr_migratepages;
> > > +		migrate_pages(&cc->migratepages, compaction_alloc,
> > > +						(unsigned long)cc, 0);
> > > +		update_nr_listpages(cc);
> > > +		nr_remaining = cc->nr_migratepages;
> > > +
> > > +		count_vm_event(COMPACTBLOCKS);
> > 
> > V1 did compaction per pageblock. but current patch doesn't.
> > so, Is COMPACTBLOCKS still good name?
> 
> It's not such a minor nit. I wondered about that myself but it's still a
> block - just not a pageblock. Would COMPACTCLUSTER be a better name as it's
> related to COMPACT_CLUSTER_MAX?

I've looked at this code again. honestly I'm a abit confusing even though both your
suggestions seems reasonable.  

now COMPACTBLOCKS is tracking #-of-called-migrate_pages. but I can't imazine
how to use it. can you please explain this ststics purpose? probably this is only useful
when conbination other stats, and the name should be consist with such combination one.


> > > +		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
> > > +		if (nr_remaining)
> > > +			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
> > > +
> > > +		/* Release LRU pages not migrated */
> > > +		if (!list_empty(&cc->migratepages)) {
> > > +			putback_lru_pages(&cc->migratepages);
> > > +			cc->nr_migratepages = 0;
> > > +		}
> > > +
> > > +		mod_zone_page_state(zone, NR_ISOLATED_ANON, -cc->nr_anon);
> > > +		mod_zone_page_state(zone, NR_ISOLATED_FILE, -cc->nr_file);
> > 
> > I think you don't need decrease this vmstatistics here. migrate_pages() and
> > putback_lru_pages() alredy does.
> > 
> 
> Hmm, I do need to decrease the vmstats here but not by this much. The
> pages migrated need to be accounted for but not the ones that failed. I
> missed this because migration was always succeeding. Thanks. I'll get it
> fixed for V5

thanks.



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating
  2010-03-17 16:38       ` Christoph Lameter
@ 2010-03-18 11:12         ` Mel Gorman
  2010-03-18 16:31           ` Christoph Lameter
  0 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-18 11:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 17, 2010 at 11:38:54AM -0500, Christoph Lameter wrote:
> On Wed, 17 Mar 2010, Mel Gorman wrote:
> 
> > This is true, but I don't think such a change belongs in this patch
> > series. If this series gets merged, then it would be sensible to investigate
> > if refcounting anon_vma is a good idea or would it be a bouncing write-shared
> > cacheline mess.
> 
> SLAB_DESTROY_BY_RCU is there to avoid the cooling of hot cachelines by
> RCU.
> 

Then even if we move to a full ref-count, it might still be a good idea
to preserve the SLAB_DESTROY_BY_RCU.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-18  0:48                 ` KOSAKI Motohiro
@ 2010-03-18 11:14                   ` Mel Gorman
  2010-03-19  6:21                     ` KOSAKI Motohiro
  0 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-18 11:14 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

On Thu, Mar 18, 2010 at 09:48:08AM +0900, KOSAKI Motohiro wrote:
> > > > +		/*
> > > > +		 * If the page has no mappings any more, just bail. An
> > > > +		 * unmapped anon page is likely to be freed soon but worse,
> > > > +		 * it's possible its anon_vma disappeared between when
> > > > +		 * the page was isolated and when we reached here while
> > > > +		 * the RCU lock was not held
> > > > +		 */
> > > > +		if (!page_mapcount(page)) {
> > > > +			rcu_read_unlock();
> > > > +			goto uncharge;
> > > > +		}
> > > 
> > > I haven't understand what prevent this check. Why don't we need following scenario?
> > > 
> > >  1. Page isolated for migration
> > >  2. Passed this if (!page_mapcount(page)) check
> > >  3. Process exits
> > >  4. page_mapcount(page) drops to zero so anon_vma was no longer reliable
> > > 
> > > Traditionally, page migration logic is, it can touch garbarge of anon_vma, but
> > > SLAB_DESTROY_BY_RCU prevent any disaster. Is this broken concept?
> > 
> > The check is made within the RCU read lock. If the count is positive at
> > that point but goes to zero due to a process exiting, the anon_vma will
> > still be valid until rcu_read_unlock() is called.
> 
> Thank you!
> 
> then, this logic depend on SLAB_DESTROY_BY_RCU, not refcount.
> So, I think we don't need your [1/11] patch.
> 
> Am I missing something?
> 

The refcount is still needed. The anon_vma might be valid, but the
refcount is what ensures that the anon_vma is not freed and reused.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 04/11] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
  2010-03-17 23:56         ` KOSAKI Motohiro
@ 2010-03-18 11:24           ` Mel Gorman
  2010-03-19  6:21             ` KOSAKI Motohiro
  0 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-18 11:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Christoph Lameter, Andrew Morton, Andrea Arcangeli, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Thu, Mar 18, 2010 at 08:56:23AM +0900, KOSAKI Motohiro wrote:
> > On Wed, 17 Mar 2010, Mel Gorman wrote:
> > 
> > > > If select MIGRATION works, we can remove "depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE"
> > > > line from config MIGRATION.
> > > >
> > >
> > > I'm not quite getting why this would be an advantage. COMPACTION
> > > requires MIGRATION but conceivable both NUMA and HOTREMOVE can work
> > > without it.
> > 
> > Avoids having to add additional CONFIG_XXX on the page migration "depends"
> > line in the future.
> 
> Yes, Kconfig mess freqently shot ourself in past days. if we have a chance
> to remove unnecessary dependency, we should do. that's my intention of the last mail.
> 

But if the depends line is removed, it could be set without NUMA, memory
hot-remove or compaction enabled. That wouldn't be very useful. I'm
missing something obvious.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-18  2:35       ` KOSAKI Motohiro
@ 2010-03-18 11:43         ` Mel Gorman
  2010-03-19  6:21           ` KOSAKI Motohiro
  0 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-18 11:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Thu, Mar 18, 2010 at 11:35:46AM +0900, KOSAKI Motohiro wrote:
> > On Wed, Mar 17, 2010 at 07:31:53PM +0900, KOSAKI Motohiro wrote:
> > > nit
> > > 
> > > > +static int compact_zone(struct zone *zone, struct compact_control *cc)
> > > > +{
> > > > +	int ret = COMPACT_INCOMPLETE;
> > > > +
> > > > +	/* Setup to move all movable pages to the end of the zone */
> > > > +	cc->migrate_pfn = zone->zone_start_pfn;
> > > > +	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
> > > > +	cc->free_pfn &= ~(pageblock_nr_pages-1);
> > > > +
> > > > +	for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
> > > > +		unsigned long nr_migrate, nr_remaining;
> > > > +		if (!isolate_migratepages(zone, cc))
> > > > +			continue;
> > > > +
> > > > +		nr_migrate = cc->nr_migratepages;
> > > > +		migrate_pages(&cc->migratepages, compaction_alloc,
> > > > +						(unsigned long)cc, 0);
> > > > +		update_nr_listpages(cc);
> > > > +		nr_remaining = cc->nr_migratepages;
> > > > +
> > > > +		count_vm_event(COMPACTBLOCKS);
> > > 
> > > V1 did compaction per pageblock. but current patch doesn't.
> > > so, Is COMPACTBLOCKS still good name?
> > 
> > It's not such a minor nit. I wondered about that myself but it's still a
> > block - just not a pageblock. Would COMPACTCLUSTER be a better name as it's
> > related to COMPACT_CLUSTER_MAX?
> 
> I've looked at this code again. honestly I'm a abit confusing even though both your
> suggestions seems reasonable.  
> 
> now COMPACTBLOCKS is tracking #-of-called-migrate_pages. but I can't imazine
> how to use it. can you please explain this ststics purpose? probably this is only useful
> when conbination other stats, and the name should be consist with such combination one.
> 

It is intended to count how many steps compaction took, the fewer the
better so minimally, the lower this number is the better. Specifically, the
"goodness" is related to the number of pages that were successfully allocated
due to compaction. Assuming the only high-order allocation was huge pages,
one possible calculation for "goodness" is;

hugepage_clusters = (1 << HUGE HUGETLB_PAGE_ORDER) / COMPACT_CLUSTER_MAX
goodness = (compactclusters / hugepage_clusters) / compactsuccess

The value of goodness is undefined if "compactsuccess" is 0.

Otherwise, the closer the "goodness" is to 1, the better. A value of 1
implies that compaction is selecting exactly the right blocks for migration
and the minimum number of pages are being moved around. The greater the value,
the more "useless" work compaction is doing.

If there are a mix of high-orders that are resulting in compaction, calculating
the goodness is a lot harder and compactcluster is just a rule of thumb as
to how much work compaction is doing.

Does that make sense?

> 
> > > > +		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
> > > > +		if (nr_remaining)
> > > > +			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
> > > > +
> > > > +		/* Release LRU pages not migrated */
> > > > +		if (!list_empty(&cc->migratepages)) {
> > > > +			putback_lru_pages(&cc->migratepages);
> > > > +			cc->nr_migratepages = 0;
> > > > +		}
> > > > +
> > > > +		mod_zone_page_state(zone, NR_ISOLATED_ANON, -cc->nr_anon);
> > > > +		mod_zone_page_state(zone, NR_ISOLATED_FILE, -cc->nr_file);
> > > 
> > > I think you don't need decrease this vmstatistics here. migrate_pages() and
> > > putback_lru_pages() alredy does.
> > > 
> > 
> > Hmm, I do need to decrease the vmstats here but not by this much. The
> > pages migrated need to be accounted for but not the ones that failed. I
> > missed this because migration was always succeeding. Thanks. I'll get it
> > fixed for V5
> 
> thanks.
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating
  2010-03-18 11:12         ` Mel Gorman
@ 2010-03-18 16:31           ` Christoph Lameter
  0 siblings, 0 replies; 109+ messages in thread
From: Christoph Lameter @ 2010-03-18 16:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Thu, 18 Mar 2010, Mel Gorman wrote:

> Then even if we move to a full ref-count, it might still be a good idea
> to preserve the SLAB_DESTROY_BY_RCU.

Yes.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-17 10:31   ` KOSAKI Motohiro
  2010-03-17 11:40     ` Mel Gorman
@ 2010-03-18 17:08     ` Mel Gorman
  1 sibling, 0 replies; 109+ messages in thread
From: Mel Gorman @ 2010-03-18 17:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 17, 2010 at 07:31:53PM +0900, KOSAKI Motohiro wrote:
> nit
> 
> > +static int compact_zone(struct zone *zone, struct compact_control *cc)
> > +{
> > +	int ret = COMPACT_INCOMPLETE;
> > +
> > +	/* Setup to move all movable pages to the end of the zone */
> > +	cc->migrate_pfn = zone->zone_start_pfn;
> > +	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
> > +	cc->free_pfn &= ~(pageblock_nr_pages-1);
> > +
> > +	for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
> > +		unsigned long nr_migrate, nr_remaining;
> > +		if (!isolate_migratepages(zone, cc))
> > +			continue;
> > +
> > +		nr_migrate = cc->nr_migratepages;
> > +		migrate_pages(&cc->migratepages, compaction_alloc,
> > +						(unsigned long)cc, 0);
> > +		update_nr_listpages(cc);
> > +		nr_remaining = cc->nr_migratepages;
> > +
> > +		count_vm_event(COMPACTBLOCKS);
> 
> V1 did compaction per pageblock. but current patch doesn't.
> so, Is COMPACTBLOCKS still good name?
> 
> 
> > +		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
> > +		if (nr_remaining)
> > +			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
> > +
> > +		/* Release LRU pages not migrated */
> > +		if (!list_empty(&cc->migratepages)) {
> > +			putback_lru_pages(&cc->migratepages);
> > +			cc->nr_migratepages = 0;
> > +		}
> > +
> > +		mod_zone_page_state(zone, NR_ISOLATED_ANON, -cc->nr_anon);
> > +		mod_zone_page_state(zone, NR_ISOLATED_FILE, -cc->nr_file);
> 
> I think you don't need decrease this vmstatistics here. migrate_pages() and
> putback_lru_pages() alredy does.
> 

Actually, you're right and I was wrong. I was double decrementing the
counts. Good spot.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 04/11] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
  2010-03-18 11:24           ` Mel Gorman
@ 2010-03-19  6:21             ` KOSAKI Motohiro
  2010-03-19 10:16               ` Mel Gorman
  0 siblings, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-19  6:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Christoph Lameter, Andrew Morton,
	Andrea Arcangeli, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> On Thu, Mar 18, 2010 at 08:56:23AM +0900, KOSAKI Motohiro wrote:
> > > On Wed, 17 Mar 2010, Mel Gorman wrote:
> > > 
> > > > > If select MIGRATION works, we can remove "depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE"
> > > > > line from config MIGRATION.
> > > > >
> > > >
> > > > I'm not quite getting why this would be an advantage. COMPACTION
> > > > requires MIGRATION but conceivable both NUMA and HOTREMOVE can work
> > > > without it.
> > > 
> > > Avoids having to add additional CONFIG_XXX on the page migration "depends"
> > > line in the future.
> > 
> > Yes, Kconfig mess freqently shot ourself in past days. if we have a chance
> > to remove unnecessary dependency, we should do. that's my intention of the last mail.
> > 
> 
> But if the depends line is removed, it could be set without NUMA, memory
> hot-remove or compaction enabled. That wouldn't be very useful. I'm
> missing something obvious.

Perhaps I'm missing something. 

my point is, force enabling useless config is not good idea (yes, i agree). but config 
selectability doesn't cause any failure. IOW, usefulness and dependency aren't 
related so much. personally I dislike _unnecessary_ dependency.

If my opinion cause any bad thing, I'll withdraw it. of course.




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-12 16:41 ` [PATCH 10/11] Direct compact when a high-order allocation fails Mel Gorman
  2010-03-16  2:47   ` Minchan Kim
@ 2010-03-19  6:21   ` KOSAKI Motohiro
  2010-03-19  6:31     ` KOSAKI Motohiro
  2010-03-19 10:09     ` Mel Gorman
  1 sibling, 2 replies; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-19  6:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> @@ -1765,6 +1766,31 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  
>  	cond_resched();
>  
> +	/* Try memory compaction for high-order allocations before reclaim */
> +	if (order) {
> +		*did_some_progress = try_to_compact_pages(zonelist,
> +						order, gfp_mask, nodemask);
> +		if (*did_some_progress != COMPACT_INCOMPLETE) {
> +			page = get_page_from_freelist(gfp_mask, nodemask,
> +					order, zonelist, high_zoneidx,
> +					alloc_flags, preferred_zone,
> +					migratetype);
> +			if (page) {
> +				__count_vm_event(COMPACTSUCCESS);
> +				return page;
> +			}
> +
> +			/*
> +			 * It's bad if compaction run occurs and fails.
> +			 * The most likely reason is that pages exist,
> +			 * but not enough to satisfy watermarks.
> +			 */
> +			count_vm_event(COMPACTFAIL);
> +
> +			cond_resched();
> +		}
> +	}
> +

Hmm..Hmmm...........

Today, I've reviewed this patch and [11/11] carefully twice. but It is harder to ack.

This patch seems to assume page compaction is faster than direct
reclaim. but it often doesn't, because dropping useless page cache is very
lightweight operation, but page compaction makes a lot of memcpy (i.e. cpu cache
pollution). IOW this patch is focusing to hugepage allocation very aggressively, but
it seems not enough care to reduce typical workload damage.


At first, I would like to clarify current reclaim corner case and how vmscan should do at this mail.

Now we have Lumpy reclaim. It is very excellent solution for externa fragmentation.
but unfortunately it have lots corner case.

Viewpoint 1. Unnecessary IO

isolate_pages() for lumpy reclaim frequently grab very young page. it is often
still dirty. then, pageout() is called much.

Unfortunately, page size grained io is _very_ inefficient. it can makes lots disk
seek and kill disk io bandwidth.


Viewpoint 2. Unevictable pages 

isolate_pages() for lumpy reclaim can pick up unevictable page. it is obviously
undroppable. so if the zone have plenty mlocked pages (it is not rare case on
server use case), lumpy reclaim can become very useless.


Viewpoint 3. GFP_ATOMIC allocation failure

Obviously lumpy reclaim can't help GFP_ATOMIC issue.


Viewpoint 4. reclaim latency

reclaim latency directly affect page allocation latency. so if lumpy reclaim with
much pageout io is slow (often it is), it affect page allocation latency and can
reduce end user experience.


I really hope that auto page migration help to solve above issue. but sadly this 
patch seems doesn't.

Honestly, I think this patch was very impressive and useful at 2-3 years ago.
because 1) we didn't have lumpy reclaim 2) we didn't have sane reclaim bail out.
then, old vmscan is very heavyweight and inefficient operation for high order reclaim.
therefore the downside of adding this page migration is hidden relatively. but...

We have to make an effort to reduce reclaim latency, not adding new latency source.
Instead, I would recommend tightly integrate page-compaction and lumpy reclaim.
I mean 1) reusing lumpy reclaim's neighbor pfn page pickking up logic 2) do page
migration instead pageout when the page is some condition (example active or dirty
or referenced or swapbacked).

This patch seems shoot me! /me die. R.I.P. ;-)


btw please don't use 'hugeadm --set-recommended-min_free_kbytes' at testing.
    To evaluate a case of free memory starvation is very important for this patch
    series, I think. I slightly doubt this patch might invoke useless compaction
    in such case.



At bottom line, the explict compaction via /proc can be merged soon, I think.
but this auto compaction logic seems need more discussion.






^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-18 11:14                   ` Mel Gorman
@ 2010-03-19  6:21                     ` KOSAKI Motohiro
  2010-03-19  8:59                       ` Mel Gorman
  0 siblings, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-19  6:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Minchan Kim, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Rik van Riel, linux-kernel, linux-mm

> > then, this logic depend on SLAB_DESTROY_BY_RCU, not refcount.
> > So, I think we don't need your [1/11] patch.
> > 
> > Am I missing something?
> > 
> 
> The refcount is still needed. The anon_vma might be valid, but the
> refcount is what ensures that the anon_vma is not freed and reused.

please please why do we need both mechanism. now cristoph is very busy and I am
de fact reviewer of page migration and mempolicy code. I really hope to understand
your patch.




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 07/11] Memory compaction core
  2010-03-18 11:43         ` Mel Gorman
@ 2010-03-19  6:21           ` KOSAKI Motohiro
  0 siblings, 0 replies; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-19  6:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> > > > V1 did compaction per pageblock. but current patch doesn't.
> > > > so, Is COMPACTBLOCKS still good name?
> > > 
> > > It's not such a minor nit. I wondered about that myself but it's still a
> > > block - just not a pageblock. Would COMPACTCLUSTER be a better name as it's
> > > related to COMPACT_CLUSTER_MAX?
> > 
> > I've looked at this code again. honestly I'm a abit confusing even though both your
> > suggestions seems reasonable.  
> > 
> > now COMPACTBLOCKS is tracking #-of-called-migrate_pages. but I can't imazine
> > how to use it. can you please explain this ststics purpose? probably this is only useful
> > when conbination other stats, and the name should be consist with such combination one.
> > 
> 
> It is intended to count how many steps compaction took, the fewer the
> better so minimally, the lower this number is the better. Specifically, the
> "goodness" is related to the number of pages that were successfully allocated
> due to compaction. Assuming the only high-order allocation was huge pages,
> one possible calculation for "goodness" is;
> 
> hugepage_clusters = (1 << HUGE HUGETLB_PAGE_ORDER) / COMPACT_CLUSTER_MAX
> goodness = (compactclusters / hugepage_clusters) / compactsuccess
> 
> The value of goodness is undefined if "compactsuccess" is 0.
> 
> Otherwise, the closer the "goodness" is to 1, the better. A value of 1
> implies that compaction is selecting exactly the right blocks for migration
> and the minimum number of pages are being moved around. The greater the value,
> the more "useless" work compaction is doing.
> 
> If there are a mix of high-orders that are resulting in compaction, calculating
> the goodness is a lot harder and compactcluster is just a rule of thumb as
> to how much work compaction is doing.
> 
> Does that make sense?

Sure! then, now I fully agree with COMPACTCLUSTER.

Thanks.



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-19  6:21   ` KOSAKI Motohiro
@ 2010-03-19  6:31     ` KOSAKI Motohiro
  2010-03-19 10:10       ` Mel Gorman
  2010-03-19 10:09     ` Mel Gorman
  1 sibling, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-19  6:31 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> Viewpoint 1. Unnecessary IO
> 
> isolate_pages() for lumpy reclaim frequently grab very young page. it is often
> still dirty. then, pageout() is called much.
> 
> Unfortunately, page size grained io is _very_ inefficient. it can makes lots disk
> seek and kill disk io bandwidth.
> 
> 
> Viewpoint 2. Unevictable pages 
> 
> isolate_pages() for lumpy reclaim can pick up unevictable page. it is obviously
> undroppable. so if the zone have plenty mlocked pages (it is not rare case on
> server use case), lumpy reclaim can become very useless.
> 
> 
> Viewpoint 3. GFP_ATOMIC allocation failure
> 
> Obviously lumpy reclaim can't help GFP_ATOMIC issue.
> 
> 
> Viewpoint 4. reclaim latency
> 
> reclaim latency directly affect page allocation latency. so if lumpy reclaim with
> much pageout io is slow (often it is), it affect page allocation latency and can
> reduce end user experience.

Viewpoint 5. end user surprising

lumpy reclaim can makes swap-out even though the system have lots free
memory. end users very surprised it and they can think it is bug.

Also, this swap activity easyly confuse that an administrator decide when
install more memory into the system.




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-19  6:21                     ` KOSAKI Motohiro
@ 2010-03-19  8:59                       ` Mel Gorman
  2010-03-25  2:49                         ` KOSAKI Motohiro
  0 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-19  8:59 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

On Fri, Mar 19, 2010 at 03:21:41PM +0900, KOSAKI Motohiro wrote:
> > > then, this logic depend on SLAB_DESTROY_BY_RCU, not refcount.
> > > So, I think we don't need your [1/11] patch.
> > > 
> > > Am I missing something?
> > > 
> > 
> > The refcount is still needed. The anon_vma might be valid, but the
> > refcount is what ensures that the anon_vma is not freed and reused.
> 
> please please why do we need both mechanism. now cristoph is very busy and I am
> de fact reviewer of page migration and mempolicy code. I really hope to understand
> your patch.
> 

As in, why not drop the RCU protection of anon_vma altogeter? Mainly, because I
think it would be reaching too far for this patchset and it should be done as
a follow-up. Putting the ref-count everywhere will change the cache-behaviour
of anon_vma more than I'd like to slip into a patchset like this. Secondly,
Christoph mentions that SLAB_DESTROY_BY_RCU is used to keep anon_vma cache-hot.
For these reasons, removing RCU from these paths and adding the refcount
in others is a patch that should stand on its own.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-19  6:21   ` KOSAKI Motohiro
  2010-03-19  6:31     ` KOSAKI Motohiro
@ 2010-03-19 10:09     ` Mel Gorman
  2010-03-25 11:08       ` KOSAKI Motohiro
  1 sibling, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-19 10:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Fri, Mar 19, 2010 at 03:21:31PM +0900, KOSAKI Motohiro wrote:
> > @@ -1765,6 +1766,31 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> >  
> >  	cond_resched();
> >  
> > +	/* Try memory compaction for high-order allocations before reclaim */
> > +	if (order) {
> > +		*did_some_progress = try_to_compact_pages(zonelist,
> > +						order, gfp_mask, nodemask);
> > +		if (*did_some_progress != COMPACT_INCOMPLETE) {
> > +			page = get_page_from_freelist(gfp_mask, nodemask,
> > +					order, zonelist, high_zoneidx,
> > +					alloc_flags, preferred_zone,
> > +					migratetype);
> > +			if (page) {
> > +				__count_vm_event(COMPACTSUCCESS);
> > +				return page;
> > +			}
> > +
> > +			/*
> > +			 * It's bad if compaction run occurs and fails.
> > +			 * The most likely reason is that pages exist,
> > +			 * but not enough to satisfy watermarks.
> > +			 */
> > +			count_vm_event(COMPACTFAIL);
> > +
> > +			cond_resched();
> > +		}
> > +	}
> > +
> 
> Hmm..Hmmm...........
> 
> Today, I've reviewed this patch and [11/11] carefully twice. but It is harder to ack.
> 
> This patch seems to assume page compaction is faster than direct
> reclaim. but it often doesn't, because dropping useless page cache is very
> lightweight operation,

Two points with that;

1. It's very hard to know in advance how often direct reclaim of clean page
   cache would be enough to satisfy the allocation.

2. Even if it was faster to discard page cache, it's not necessarily
   faster when the cost of reading that page cache back-in is taken into
   account

Lumpy reclaim tries to avoid dumping useful page cache but it is perfectly
possible for hot data to be discarded because it happened to be located
near cold data. It's impossible to know in general how much unnecessary IO
takes place as a result of lumpy reclaim because it depends heavily on the
system-state when lumpy reclaim starts.

> but page compaction makes a lot of memcpy (i.e. cpu cache
> pollution). IOW this patch is focusing to hugepage allocation very aggressively, but
> it seems not enough care to reduce typical workload damage.
> 

What typical workload is making aggressive use of high order
allocations? Typically when such a user is found, effort is spent on
finding alternatives to high-orders as opposed to worrying about the cost
of allocating them. There was a focus on huge page allocation because it
was the most useful test case that was likely to be encountered in practice.

I can adjust the allocation levels to some other value but it's not typical
for a system to make very aggressive use of other orders. I could have it
use random orders but also is not very typical.

> At first, I would like to clarify current reclaim corner case and how
> vmscan should do at this mail.
> 
> Now we have Lumpy reclaim. It is very excellent solution for externa
> fragmentation.

In some situations, it can grind a system to trash for a time. What is far
more likely is to be dealing with a machine with no swap - something that
is common in clusters. In this case, lumpy is a lot less likely to succeed
unless the machine is very quiet. It's just not going to find the contiguous
page cache it needs to discard and anonymous pages get in the way.

> but unfortunately it have lots corner case.
> 
> Viewpoint 1. Unnecessary IO
> 
> isolate_pages() for lumpy reclaim frequently grab very young page. it is often
> still dirty. then, pageout() is called much.
> 
> Unfortunately, page size grained io is _very_ inefficient. it can makes lots disk
> seek and kill disk io bandwidth.
> 

Page-based IO like this has also been reported as being a problem for some
filesystems. When this happens, lumpy reclaim potentially stalls for a long
time waiting for the dirty data to be flushed by a flusher thread. Compaction
does not suffer from the same problem.

> Viewpoint 2. Unevictable pages 
> 
> isolate_pages() for lumpy reclaim can pick up unevictable page. it is obviously
> undroppable. so if the zone have plenty mlocked pages (it is not rare case on
> server use case), lumpy reclaim can become very useless.
> 

Also true. Potentially, compaction can deal with unevictable pages but it's
not done in this series as it's significant enough as it is and useful in
its current form.

> Viewpoint 3. GFP_ATOMIC allocation failure
> 
> Obviously lumpy reclaim can't help GFP_ATOMIC issue.
> 

Also true although right now, it's not possible to compact for GFP_ATOMIC
either. I think it could be done on some cases but I didn't try for it.
High-order GFP_ATOMIC allocations are still something we simply try and
avoid rather than deal with within the page allocator.

> Viewpoint 4. reclaim latency
> 
> reclaim latency directly affect page allocation latency. so if lumpy reclaim with
> much pageout io is slow (often it is), it affect page allocation latency and can
> reduce end user experience.
> 

Also true. When allocation huge pages on a normal desktop for example,
it scan stall the machine for a number of seconds while reclaim kicks
in.

With direct compaction, this does not happen to anywhere near the same
degree. There are still some stalls because as huge pages get allocated,
free memory drops until pages have to be reclaimed anyway. The effects
are a lot less prononced and the operation finishes a lot faster.

> I really hope that auto page migration help to solve above issue. but sadly this 
> patch seems doesn't.
> 

How do you figure? I think it goes a long way to mitigating the worst of
the problems you laid out above.

> Honestly, I think this patch was very impressive and useful at 2-3 years ago.
> because 1) we didn't have lumpy reclaim 2) we didn't have sane reclaim bail out.
> then, old vmscan is very heavyweight and inefficient operation for high order reclaim.
> therefore the downside of adding this page migration is hidden relatively. but...
> 
> We have to make an effort to reduce reclaim latency, not adding new latency source.

I recognise that reclaim latency has been reduced but there is a wall.
The cost of reading the data back in will always be there and on
swapless systems, it might simply be impossible for lumpy reclaim to do
what it needs.

> Instead, I would recommend tightly integrate page-compaction and lumpy reclaim.
> I mean 1) reusing lumpy reclaim's neighbor pfn page pickking up logic

There are a number of difficulties with this. I'm not saying it's impossible,
but the win is not very clear-cut and there are some disadvantages.

One, there would have to be exceptions for kswapd in the path because it
really should continue reclaiming. The reclaim path is already very dense
and this would add significant compliexity to that path.

The second difficulty is that the migration and free block selection
algorithm becomes a lot harder, more expensive and identifying the exit
conditions presents a significant difficultly. Right now, the selection is
based on linear scans with straight-forward selection and the exit condition
is simply when the scanners meet. With the migration scanner based on LRU,
significant care would have to be taken to ensure that appropriate free blocks
were chosen to migrate to so that we didn't "migrate from" a block in one
pass and "migrate to" in another (the reason why I went with linear scans
in the first place). Identifying when the zone has been compacted and should
just stop is no longer as straight-forward either.  You'd have to track what
blocks had been operated on in the past which is potentially a lot of state. To
maintain this state, an unknown number structures would have to be allocated
which may re-enter the allocator presenting its own class of problems.

Third, right now it's very easy to identify when compaction is not going
to work in advance - simply check the watermarks and make a calculation
based on fragmentation. With a combined reclaim/compaction step, these
type of checks would need to be made continually - potentially
increasing the latency of reclaim albeit very slightly.

Lastly, with this series, there is very little difference between direct
compaction and proc-triggered compaction. They share the same code paths
and all that differs is the exit conditions. If it was integrated into
reclaim, it becomes a lot less straight-forward to share the code.

> 2) do page
> migration instead pageout when the page is some condition (example active or dirty
> or referenced or swapbacked).
> 

Right now, it is identifed when pageout should happen instead of page
migration. It's known before compaction starts if it's likely to be
successful or not.

> This patch seems shoot me! /me die. R.I.P. ;-)
> 

That seems a bit dramatic. Your alternative proposal has some significant
difficulties and is likely to be very complicated. Also, there is nothing
to say that this mechanism could not be integrated with lumpy reclaim over
time once it was shown that useless migration was going on or latencies were
increased for some workload.

This patch seems like a far more rational starting point to me than adding
more complexity to reclaim at the outset.

> btw please don't use 'hugeadm --set-recommended-min_free_kbytes' at testing.

It's somewhat important for the type of stress tests I do for huge page
allocation. Without it, fragmentation avoidance has trouble and the
results become a lot less repeatable.

>     To evaluate a case of free memory starvation is very important for this patch
>     series, I think. I slightly doubt this patch might invoke useless compaction
>     in such case.
> 

I can drop the min_free_kbytes change but the likely result will be that
allocation success rates will simply be lower. The calculations on
whether compaction should be used or not are based on watermarks which
adjust to the value of min_free_kbytes.

> At bottom line, the explict compaction via /proc can be merged soon, I think.
> but this auto compaction logic seems need more discussion.
> 

My concern would be that the compaction paths would then be used very
rarely in practice and we'd get no data on how direct compaction should
be done.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-19  6:31     ` KOSAKI Motohiro
@ 2010-03-19 10:10       ` Mel Gorman
  2010-03-25 11:22         ` KOSAKI Motohiro
  0 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-19 10:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Fri, Mar 19, 2010 at 03:31:27PM +0900, KOSAKI Motohiro wrote:
> > Viewpoint 1. Unnecessary IO
> > 
> > isolate_pages() for lumpy reclaim frequently grab very young page. it is often
> > still dirty. then, pageout() is called much.
> > 
> > Unfortunately, page size grained io is _very_ inefficient. it can makes lots disk
> > seek and kill disk io bandwidth.
> > 
> > 
> > Viewpoint 2. Unevictable pages 
> > 
> > isolate_pages() for lumpy reclaim can pick up unevictable page. it is obviously
> > undroppable. so if the zone have plenty mlocked pages (it is not rare case on
> > server use case), lumpy reclaim can become very useless.
> > 
> > 
> > Viewpoint 3. GFP_ATOMIC allocation failure
> > 
> > Obviously lumpy reclaim can't help GFP_ATOMIC issue.
> > 
> > 
> > Viewpoint 4. reclaim latency
> > 
> > reclaim latency directly affect page allocation latency. so if lumpy reclaim with
> > much pageout io is slow (often it is), it affect page allocation latency and can
> > reduce end user experience.
> 
> Viewpoint 5. end user surprising
> 
> lumpy reclaim can makes swap-out even though the system have lots free
> memory. end users very surprised it and they can think it is bug.
> 
> Also, this swap activity easyly confuse that an administrator decide when
> install more memory into the system.
> 

Compaction in this case is a lot less surprising. If there is enough free
memory, compaction will trigger automatically without any reclaim.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 04/11] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
  2010-03-19  6:21             ` KOSAKI Motohiro
@ 2010-03-19 10:16               ` Mel Gorman
  2010-03-25  3:28                 ` KOSAKI Motohiro
  0 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-19 10:16 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Christoph Lameter, Andrew Morton, Andrea Arcangeli, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Fri, Mar 19, 2010 at 03:21:20PM +0900, KOSAKI Motohiro wrote:
> > On Thu, Mar 18, 2010 at 08:56:23AM +0900, KOSAKI Motohiro wrote:
> > > > On Wed, 17 Mar 2010, Mel Gorman wrote:
> > > > 
> > > > > > If select MIGRATION works, we can remove "depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE"
> > > > > > line from config MIGRATION.
> > > > > >
> > > > >
> > > > > I'm not quite getting why this would be an advantage. COMPACTION
> > > > > requires MIGRATION but conceivable both NUMA and HOTREMOVE can work
> > > > > without it.
> > > > 
> > > > Avoids having to add additional CONFIG_XXX on the page migration "depends"
> > > > line in the future.
> > > 
> > > Yes, Kconfig mess freqently shot ourself in past days. if we have a chance
> > > to remove unnecessary dependency, we should do. that's my intention of the last mail.
> > > 
> > 
> > But if the depends line is removed, it could be set without NUMA, memory
> > hot-remove or compaction enabled. That wouldn't be very useful. I'm
> > missing something obvious.
> 
> Perhaps I'm missing something. 
> 
> my point is, force enabling useless config is not good idea (yes, i agree). but config 
> selectability doesn't cause any failure. IOW, usefulness and dependency aren't 
> related so much. personally I dislike _unnecessary_ dependency.
> 
> If my opinion cause any bad thing, I'll withdraw it. of course.
> 

I've changed the MIGRATION entry to

config MIGRATION
        bool "Page migration"
        def_bool y
        depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE 

i.e. it no longer depends on COMPACTION because the "select MIGRATION"
in that line is enough.

I've left NUMA and HOTREMOVE because migration is an optional feature
for those configurations.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 06/11] Export fragmentation index via /proc/extfrag_index
  2010-03-17 11:33     ` Mel Gorman
@ 2010-03-23  0:22       ` KOSAKI Motohiro
  2010-03-23 12:03         ` Mel Gorman
  0 siblings, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-23  0:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> > > +	/*
> > > +	 * Index is between 0 and 1 so return within 3 decimal places
> > > +	 *
> > > +	 * 0 => allocation would fail due to lack of memory
> > > +	 * 1 => allocation would fail due to fragmentation
> > > +	 */
> > > +	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
> > > +}
> > 
> > Dumb question.
> > your paper (http://portal.acm.org/citation.cfm?id=1375634.1375641) says
> > fragmentation_index = 1 - (TotalFree/SizeRequested)/BlocksFree
> > but your code have extra '1000+'. Why?
> 
> To get an approximation to three decimal places.

Do you mean this is poor man's round up logic?
Why don't you use DIV_ROUND_UP? likes following,

return 1000 - (DIV_ROUND_UP(info->free_pages * 1000 / requested) /  info->free_blocks_total);



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 06/11] Export fragmentation index via /proc/extfrag_index
  2010-03-23  0:22       ` KOSAKI Motohiro
@ 2010-03-23 12:03         ` Mel Gorman
  2010-03-25  2:47           ` KOSAKI Motohiro
  0 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-23 12:03 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Tue, Mar 23, 2010 at 09:22:04AM +0900, KOSAKI Motohiro wrote:
> > > > +	/*
> > > > +	 * Index is between 0 and 1 so return within 3 decimal places
> > > > +	 *
> > > > +	 * 0 => allocation would fail due to lack of memory
> > > > +	 * 1 => allocation would fail due to fragmentation
> > > > +	 */
> > > > +	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
> > > > +}
> > > 
> > > Dumb question.
> > > your paper (http://portal.acm.org/citation.cfm?id=1375634.1375641) says
> > > fragmentation_index = 1 - (TotalFree/SizeRequested)/BlocksFree
> > > but your code have extra '1000+'. Why?
> > 
> > To get an approximation to three decimal places.
> 
> Do you mean this is poor man's round up logic?

Not exactly.

The intention is to have a value of 968 instead of 0.968231. i.e.
instead of a value between 0 and 1, it'll be a value between 0 and 1000
that matches the first three digits after the decimal place.

> Why don't you use DIV_ROUND_UP? likes following,
> 
> return 1000 - (DIV_ROUND_UP(info->free_pages * 1000 / requested) /  info->free_blocks_total);
> 

Because it's not doing the same thing unless I missed something.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 06/11] Export fragmentation index via /proc/extfrag_index
  2010-03-23 12:03         ` Mel Gorman
@ 2010-03-25  2:47           ` KOSAKI Motohiro
  2010-03-25  8:47             ` Mel Gorman
  0 siblings, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-25  2:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> On Tue, Mar 23, 2010 at 09:22:04AM +0900, KOSAKI Motohiro wrote:
> > > > > +	/*
> > > > > +	 * Index is between 0 and 1 so return within 3 decimal places
> > > > > +	 *
> > > > > +	 * 0 => allocation would fail due to lack of memory
> > > > > +	 * 1 => allocation would fail due to fragmentation
> > > > > +	 */
> > > > > +	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
> > > > > +}
> > > > 
> > > > Dumb question.
> > > > your paper (http://portal.acm.org/citation.cfm?id=1375634.1375641) says
> > > > fragmentation_index = 1 - (TotalFree/SizeRequested)/BlocksFree
> > > > but your code have extra '1000+'. Why?
> > > 
> > > To get an approximation to three decimal places.
> > 
> > Do you mean this is poor man's round up logic?
> 
> Not exactly.
> 
> The intention is to have a value of 968 instead of 0.968231. i.e.
> instead of a value between 0 and 1, it'll be a value between 0 and 1000
> that matches the first three digits after the decimal place.

Let's consider extream case.

free_pages: 1
requested: 1
free_blocks_total: 1

frag_index = 1000  - ((1000 + 1*1000/1))/1 = -1000

This is not your intension, I guess. 
Probably we don't need any round_up/round_down logic. because fragmentation_index
is only used "if (fragindex >= 0 && fragindex <= 500)" check in try_to_compact_pages().
+1 or -1 inaccurate can be ignored. iow, I think we can remove '1000+' expression.


> > Why don't you use DIV_ROUND_UP? likes following,
> > 
> > return 1000 - (DIV_ROUND_UP(info->free_pages * 1000 / requested) /  info->free_blocks_total);
> > 
> 
> Because it's not doing the same thing unless I missed something.




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-19  8:59                       ` Mel Gorman
@ 2010-03-25  2:49                         ` KOSAKI Motohiro
  2010-03-25  8:32                           ` Mel Gorman
  0 siblings, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-25  2:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Minchan Kim, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Rik van Riel, linux-kernel, linux-mm

> On Fri, Mar 19, 2010 at 03:21:41PM +0900, KOSAKI Motohiro wrote:
> > > > then, this logic depend on SLAB_DESTROY_BY_RCU, not refcount.
> > > > So, I think we don't need your [1/11] patch.
> > > > 
> > > > Am I missing something?
> > > > 
> > > 
> > > The refcount is still needed. The anon_vma might be valid, but the
> > > refcount is what ensures that the anon_vma is not freed and reused.
> > 
> > please please why do we need both mechanism. now cristoph is very busy and I am
> > de fact reviewer of page migration and mempolicy code. I really hope to understand
> > your patch.
> 
> As in, why not drop the RCU protection of anon_vma altogeter? Mainly, because I
> think it would be reaching too far for this patchset and it should be done as
> a follow-up. Putting the ref-count everywhere will change the cache-behaviour
> of anon_vma more than I'd like to slip into a patchset like this. Secondly,
> Christoph mentions that SLAB_DESTROY_BY_RCU is used to keep anon_vma cache-hot.
> For these reasons, removing RCU from these paths and adding the refcount
> in others is a patch that should stand on its own.

Hmmm...
I haven't understand your mention because I guess I was wrong.

probably my last question was unclear. I mean,

1) If we still need SLAB_DESTROY_BY_RCU, why do we need to add refcount?
    Which difference is exist between normal page migration and compaction?
2) If we added refcount, which race will solve?

IOW, Is this patch fix old issue or compaction specific issue?




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 04/11] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
  2010-03-19 10:16               ` Mel Gorman
@ 2010-03-25  3:28                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-25  3:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Christoph Lameter, Andrew Morton,
	Andrea Arcangeli, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> On Fri, Mar 19, 2010 at 03:21:20PM +0900, KOSAKI Motohiro wrote:
> > > On Thu, Mar 18, 2010 at 08:56:23AM +0900, KOSAKI Motohiro wrote:
> > > > > On Wed, 17 Mar 2010, Mel Gorman wrote:
> > > > > 
> > > > > > > If select MIGRATION works, we can remove "depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE"
> > > > > > > line from config MIGRATION.
> > > > > > >
> > > > > >
> > > > > > I'm not quite getting why this would be an advantage. COMPACTION
> > > > > > requires MIGRATION but conceivable both NUMA and HOTREMOVE can work
> > > > > > without it.
> > > > > 
> > > > > Avoids having to add additional CONFIG_XXX on the page migration "depends"
> > > > > line in the future.
> > > > 
> > > > Yes, Kconfig mess freqently shot ourself in past days. if we have a chance
> > > > to remove unnecessary dependency, we should do. that's my intention of the last mail.
> > > > 
> > > 
> > > But if the depends line is removed, it could be set without NUMA, memory
> > > hot-remove or compaction enabled. That wouldn't be very useful. I'm
> > > missing something obvious.
> > 
> > Perhaps I'm missing something. 
> > 
> > my point is, force enabling useless config is not good idea (yes, i agree). but config 
> > selectability doesn't cause any failure. IOW, usefulness and dependency aren't 
> > related so much. personally I dislike _unnecessary_ dependency.
> > 
> > If my opinion cause any bad thing, I'll withdraw it. of course.
> > 
> 
> I've changed the MIGRATION entry to
> 
> config MIGRATION
>         bool "Page migration"
>         def_bool y
>         depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE 
> 
> i.e. it no longer depends on COMPACTION because the "select MIGRATION"
> in that line is enough.
> 
> I've left NUMA and HOTREMOVE because migration is an optional feature
> for those configurations.

ok... I don't oppose it anymore. Let's dive into Kconfig select hell ;)



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-25  2:49                         ` KOSAKI Motohiro
@ 2010-03-25  8:32                           ` Mel Gorman
  2010-03-25  8:56                             ` KOSAKI Motohiro
  2010-03-25  9:02                             ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 109+ messages in thread
From: Mel Gorman @ 2010-03-25  8:32 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

On Thu, Mar 25, 2010 at 11:49:23AM +0900, KOSAKI Motohiro wrote:
> > On Fri, Mar 19, 2010 at 03:21:41PM +0900, KOSAKI Motohiro wrote:
> > > > > then, this logic depend on SLAB_DESTROY_BY_RCU, not refcount.
> > > > > So, I think we don't need your [1/11] patch.
> > > > > 
> > > > > Am I missing something?
> > > > > 
> > > > 
> > > > The refcount is still needed. The anon_vma might be valid, but the
> > > > refcount is what ensures that the anon_vma is not freed and reused.
> > > 
> > > please please why do we need both mechanism. now cristoph is very busy and I am
> > > de fact reviewer of page migration and mempolicy code. I really hope to understand
> > > your patch.
> > 
> > As in, why not drop the RCU protection of anon_vma altogeter? Mainly, because I
> > think it would be reaching too far for this patchset and it should be done as
> > a follow-up. Putting the ref-count everywhere will change the cache-behaviour
> > of anon_vma more than I'd like to slip into a patchset like this. Secondly,
> > Christoph mentions that SLAB_DESTROY_BY_RCU is used to keep anon_vma cache-hot.
> > For these reasons, removing RCU from these paths and adding the refcount
> > in others is a patch that should stand on its own.
> 
> Hmmm...
> I haven't understand your mention because I guess I was wrong.
> 
> probably my last question was unclear. I mean,
> 
> 1) If we still need SLAB_DESTROY_BY_RCU, why do we need to add refcount?
>     Which difference is exist between normal page migration and compaction?

The processes typically calling migration today own the page they are moving
and is not going to exit unexpectedly during migration.

> 2) If we added refcount, which race will solve?
> 

The process exiting and the last anon_vma being dropped while compaction
is running. This can be reliably triggered with compaction.

> IOW, Is this patch fix old issue or compaction specific issue?
> 

Strictly speaking, it's an old issue but in practice it's impossible to
trigger because the process migrating always owns the page. Compaction
moves pages belonging to arbitrary processes.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 06/11] Export fragmentation index via /proc/extfrag_index
  2010-03-25  2:47           ` KOSAKI Motohiro
@ 2010-03-25  8:47             ` Mel Gorman
  2010-03-25 11:20               ` KOSAKI Motohiro
  0 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-25  8:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Thu, Mar 25, 2010 at 11:47:17AM +0900, KOSAKI Motohiro wrote:
> > On Tue, Mar 23, 2010 at 09:22:04AM +0900, KOSAKI Motohiro wrote:
> > > > > > +	/*
> > > > > > +	 * Index is between 0 and 1 so return within 3 decimal places
> > > > > > +	 *
> > > > > > +	 * 0 => allocation would fail due to lack of memory
> > > > > > +	 * 1 => allocation would fail due to fragmentation
> > > > > > +	 */
> > > > > > +	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
> > > > > > +}
> > > > > 
> > > > > Dumb question.
> > > > > your paper (http://portal.acm.org/citation.cfm?id=1375634.1375641) says
> > > > > fragmentation_index = 1 - (TotalFree/SizeRequested)/BlocksFree
> > > > > but your code have extra '1000+'. Why?
> > > > 
> > > > To get an approximation to three decimal places.
> > > 
> > > Do you mean this is poor man's round up logic?
> > 
> > Not exactly.
> > 
> > The intention is to have a value of 968 instead of 0.968231. i.e.
> > instead of a value between 0 and 1, it'll be a value between 0 and 1000
> > that matches the first three digits after the decimal place.
> 
> Let's consider extream case.
> 
> free_pages: 1
> requested: 1
> free_blocks_total: 1
> 
> frag_index = 1000  - ((1000 + 1*1000/1))/1 = -1000
> 
> This is not your intension, I guess. 

Why not?

See this comment

/* Fragmentation index only makes sense when a request would fail */

In your example, there is a free page of the requested size so the allocation
would succeed. In this case, fragmentation index does indeed go negative
but the value is not useful.

> Probably we don't need any round_up/round_down logic. because fragmentation_index
> is only used "if (fragindex >= 0 && fragindex <= 500)" check in try_to_compact_pages().
> +1 or -1 inaccurate can be ignored. iow, I think we can remove '1000+' expression.
> 

This isn't about rounding, it's about having a value that normally is
between 0 and 1 expressed as a number between 0 and 1000 because we
can't use double in the kernel.

> 
> > > Why don't you use DIV_ROUND_UP? likes following,
> > > 
> > > return 1000 - (DIV_ROUND_UP(info->free_pages * 1000 / requested) /  info->free_blocks_total);
> > > 
> > 
> > Because it's not doing the same thing unless I missed something.
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-25  8:32                           ` Mel Gorman
@ 2010-03-25  8:56                             ` KOSAKI Motohiro
  2010-03-25  9:18                               ` Mel Gorman
  2010-03-25  9:02                             ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-25  8:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Minchan Kim, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Rik van Riel, linux-kernel, linux-mm

> On Thu, Mar 25, 2010 at 11:49:23AM +0900, KOSAKI Motohiro wrote:
> > > On Fri, Mar 19, 2010 at 03:21:41PM +0900, KOSAKI Motohiro wrote:
> > > > > > then, this logic depend on SLAB_DESTROY_BY_RCU, not refcount.
> > > > > > So, I think we don't need your [1/11] patch.
> > > > > > 
> > > > > > Am I missing something?
> > > > > > 
> > > > > 
> > > > > The refcount is still needed. The anon_vma might be valid, but the
> > > > > refcount is what ensures that the anon_vma is not freed and reused.
> > > > 
> > > > please please why do we need both mechanism. now cristoph is very busy and I am
> > > > de fact reviewer of page migration and mempolicy code. I really hope to understand
> > > > your patch.
> > > 
> > > As in, why not drop the RCU protection of anon_vma altogeter? Mainly, because I
> > > think it would be reaching too far for this patchset and it should be done as
> > > a follow-up. Putting the ref-count everywhere will change the cache-behaviour
> > > of anon_vma more than I'd like to slip into a patchset like this. Secondly,
> > > Christoph mentions that SLAB_DESTROY_BY_RCU is used to keep anon_vma cache-hot.
> > > For these reasons, removing RCU from these paths and adding the refcount
> > > in others is a patch that should stand on its own.
> > 
> > Hmmm...
> > I haven't understand your mention because I guess I was wrong.
> > 
> > probably my last question was unclear. I mean,
> > 
> > 1) If we still need SLAB_DESTROY_BY_RCU, why do we need to add refcount?
> >     Which difference is exist between normal page migration and compaction?
> 
> The processes typically calling migration today own the page they are moving
> and is not going to exit unexpectedly during migration.
> 
> > 2) If we added refcount, which race will solve?
> > 
> 
> The process exiting and the last anon_vma being dropped while compaction
> is running. This can be reliably triggered with compaction.
> 
> > IOW, Is this patch fix old issue or compaction specific issue?
> 
> Strictly speaking, it's an old issue but in practice it's impossible to
> trigger because the process migrating always owns the page. Compaction
> moves pages belonging to arbitrary processes.

Do you mean current memroy hotplug code is broken???
I think compaction need refcount, hotplug also need it. both they migrate another
task's page.

but , I haven't seen hotplug failure. Am I  missing something? or the compaction
have its specific race situation?



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-25  8:32                           ` Mel Gorman
  2010-03-25  8:56                             ` KOSAKI Motohiro
@ 2010-03-25  9:02                             ` KAMEZAWA Hiroyuki
  2010-03-25  9:09                               ` KOSAKI Motohiro
  2010-03-25 16:16                               ` Minchan Kim
  1 sibling, 2 replies; 109+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-25  9:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Minchan Kim, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

On Thu, 25 Mar 2010 08:32:35 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> On Thu, Mar 25, 2010 at 11:49:23AM +0900, KOSAKI Motohiro wrote:
> > > On Fri, Mar 19, 2010 at 03:21:41PM +0900, KOSAKI Motohiro wrote: 
> > Hmmm...
> > I haven't understand your mention because I guess I was wrong.
> > 
> > probably my last question was unclear. I mean,
> > 
> > 1) If we still need SLAB_DESTROY_BY_RCU, why do we need to add refcount?
> >     Which difference is exist between normal page migration and compaction?
> 
> The processes typically calling migration today own the page they are moving
> and is not going to exit unexpectedly during migration.
> 
> > 2) If we added refcount, which race will solve?
> > 
> 
> The process exiting and the last anon_vma being dropped while compaction
> is running. This can be reliably triggered with compaction.
> 
> > IOW, Is this patch fix old issue or compaction specific issue?
> > 
> 
> Strictly speaking, it's an old issue but in practice it's impossible to
> trigger because the process migrating always owns the page. Compaction
> moves pages belonging to arbitrary processes.
> 
Kosaki-san,

 IIUC, the race in memory-hotunplug was fixed by this patch [2/11].

 But, this behavior of unmap_and_move() requires access to _freed_
 objects (spinlock). Even if it's safe because of SLAB_DESTROY_BY_RCU,
 it't not good habit in general.

 After direct compaction, page-migration will be one of "core" code of
 memory management. Then, I agree to patch [1/11] as our direction for
 keeping sanity and showing direction to more updates. Maybe adding
 refcnt and removing RCU in futuer is good.

 IMHO, pushing this patch [2/11] as "BUGFIX" independent of this set and
 adding anon_vma->refcnt [1/11] and [3/11] in 1st Direct-compaction patch
 series  to show the direction will makse sense.
 (I think merging 1/11 and 3/11 will be okay...)

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-25  9:09                               ` KOSAKI Motohiro
@ 2010-03-25  9:08                                 ` KAMEZAWA Hiroyuki
  2010-03-25  9:21                                 ` Mel Gorman
  1 sibling, 0 replies; 109+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-25  9:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Minchan Kim, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

On Thu, 25 Mar 2010 18:09:34 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Thu, 25 Mar 2010 08:32:35 +0000
> > Mel Gorman <mel@csn.ul.ie> wrote:

> >  IIUC, the race in memory-hotunplug was fixed by this patch [2/11].
> > 
> >  But, this behavior of unmap_and_move() requires access to _freed_
> >  objects (spinlock). Even if it's safe because of SLAB_DESTROY_BY_RCU,
> >  it't not good habit in general.
> > 
> >  After direct compaction, page-migration will be one of "core" code of
> >  memory management. Then, I agree to patch [1/11] as our direction for
> >  keeping sanity and showing direction to more updates. Maybe adding
> >  refcnt and removing RCU in futuer is good.
> 
> But Christoph seems oppose to remove SLAB_DESTROY_BY_RCU. then refcount
> is meaningless now. I agree you if we will remove SLAB_DESTROY_BY_RCU
> in the future.
> 
removing rcu_read_lock/unlock in unmap_and_move() and removing
SLAB_DESTROY_BY_RCU is different story.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-25  9:02                             ` KAMEZAWA Hiroyuki
@ 2010-03-25  9:09                               ` KOSAKI Motohiro
  2010-03-25  9:08                                 ` KAMEZAWA Hiroyuki
  2010-03-25  9:21                                 ` Mel Gorman
  2010-03-25 16:16                               ` Minchan Kim
  1 sibling, 2 replies; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-25  9:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Mel Gorman, Minchan Kim, Andrew Morton,
	Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Rik van Riel, linux-kernel, linux-mm

> On Thu, 25 Mar 2010 08:32:35 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Thu, Mar 25, 2010 at 11:49:23AM +0900, KOSAKI Motohiro wrote:
> > > > On Fri, Mar 19, 2010 at 03:21:41PM +0900, KOSAKI Motohiro wrote: 
> > > Hmmm...
> > > I haven't understand your mention because I guess I was wrong.
> > > 
> > > probably my last question was unclear. I mean,
> > > 
> > > 1) If we still need SLAB_DESTROY_BY_RCU, why do we need to add refcount?
> > >     Which difference is exist between normal page migration and compaction?
> > 
> > The processes typically calling migration today own the page they are moving
> > and is not going to exit unexpectedly during migration.
> > 
> > > 2) If we added refcount, which race will solve?
> > > 
> > 
> > The process exiting and the last anon_vma being dropped while compaction
> > is running. This can be reliably triggered with compaction.
> > 
> > > IOW, Is this patch fix old issue or compaction specific issue?
> > > 
> > 
> > Strictly speaking, it's an old issue but in practice it's impossible to
> > trigger because the process migrating always owns the page. Compaction
> > moves pages belonging to arbitrary processes.
> > 
> Kosaki-san,
> 
>  IIUC, the race in memory-hotunplug was fixed by this patch [2/11].
> 
>  But, this behavior of unmap_and_move() requires access to _freed_
>  objects (spinlock). Even if it's safe because of SLAB_DESTROY_BY_RCU,
>  it't not good habit in general.
> 
>  After direct compaction, page-migration will be one of "core" code of
>  memory management. Then, I agree to patch [1/11] as our direction for
>  keeping sanity and showing direction to more updates. Maybe adding
>  refcnt and removing RCU in futuer is good.

But Christoph seems oppose to remove SLAB_DESTROY_BY_RCU. then refcount
is meaningless now. I agree you if we will remove SLAB_DESTROY_BY_RCU
in the future.

refcount is easy understanding than rcu trick.


>  IMHO, pushing this patch [2/11] as "BUGFIX" independent of this set and
>  adding anon_vma->refcnt [1/11] and [3/11] in 1st Direct-compaction patch
>  series  to show the direction will makse sense.
>  (I think merging 1/11 and 3/11 will be okay...)

agreed.

> 
> Thanks,
> -Kame
> 
> 




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-25  8:56                             ` KOSAKI Motohiro
@ 2010-03-25  9:18                               ` Mel Gorman
  0 siblings, 0 replies; 109+ messages in thread
From: Mel Gorman @ 2010-03-25  9:18 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

On Thu, Mar 25, 2010 at 05:56:25PM +0900, KOSAKI Motohiro wrote:
> > On Thu, Mar 25, 2010 at 11:49:23AM +0900, KOSAKI Motohiro wrote:
> > > > On Fri, Mar 19, 2010 at 03:21:41PM +0900, KOSAKI Motohiro wrote:
> > > > > > > then, this logic depend on SLAB_DESTROY_BY_RCU, not refcount.
> > > > > > > So, I think we don't need your [1/11] patch.
> > > > > > > 
> > > > > > > Am I missing something?
> > > > > > > 
> > > > > > 
> > > > > > The refcount is still needed. The anon_vma might be valid, but the
> > > > > > refcount is what ensures that the anon_vma is not freed and reused.
> > > > > 
> > > > > please please why do we need both mechanism. now cristoph is very busy and I am
> > > > > de fact reviewer of page migration and mempolicy code. I really hope to understand
> > > > > your patch.
> > > > 
> > > > As in, why not drop the RCU protection of anon_vma altogeter? Mainly, because I
> > > > think it would be reaching too far for this patchset and it should be done as
> > > > a follow-up. Putting the ref-count everywhere will change the cache-behaviour
> > > > of anon_vma more than I'd like to slip into a patchset like this. Secondly,
> > > > Christoph mentions that SLAB_DESTROY_BY_RCU is used to keep anon_vma cache-hot.
> > > > For these reasons, removing RCU from these paths and adding the refcount
> > > > in others is a patch that should stand on its own.
> > > 
> > > Hmmm...
> > > I haven't understand your mention because I guess I was wrong.
> > > 
> > > probably my last question was unclear. I mean,
> > > 
> > > 1) If we still need SLAB_DESTROY_BY_RCU, why do we need to add refcount?
> > >     Which difference is exist between normal page migration and compaction?
> > 
> > The processes typically calling migration today own the page they are moving
> > and is not going to exit unexpectedly during migration.
> > 
> > > 2) If we added refcount, which race will solve?
> > > 
> > 
> > The process exiting and the last anon_vma being dropped while compaction
> > is running. This can be reliably triggered with compaction.
> > 
> > > IOW, Is this patch fix old issue or compaction specific issue?
> > 
> > Strictly speaking, it's an old issue but in practice it's impossible to
> > trigger because the process migrating always owns the page. Compaction
> > moves pages belonging to arbitrary processes.
> 
> Do you mean current memroy hotplug code is broken???

I hadn't considered the memory hotplug case but you're right, it's possible
it's at risk.

While compaction can trigger this problem reliably, it's not exactly easy
to trigger. I was triggering it under very heavy memory load with a large
number of very short lived processes (specifically, an excessive compile-based
load). It's possible that memory hotplug has not been tested under similar
situations.

> I think compaction need refcount, hotplug also need it. both they migrate another
> task's page.
> 
> but , I haven't seen hotplug failure. Am I  missing something? or the compaction
> have its specific race situation?
> 

It's worth double-checking.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-25  9:09                               ` KOSAKI Motohiro
  2010-03-25  9:08                                 ` KAMEZAWA Hiroyuki
@ 2010-03-25  9:21                                 ` Mel Gorman
  2010-03-25  9:41                                   ` KAMEZAWA Hiroyuki
  2010-03-25 14:35                                   ` Christoph Lameter
  1 sibling, 2 replies; 109+ messages in thread
From: Mel Gorman @ 2010-03-25  9:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Minchan Kim, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

On Thu, Mar 25, 2010 at 06:09:34PM +0900, KOSAKI Motohiro wrote:
> > On Thu, 25 Mar 2010 08:32:35 +0000
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > On Thu, Mar 25, 2010 at 11:49:23AM +0900, KOSAKI Motohiro wrote:
> > > > > On Fri, Mar 19, 2010 at 03:21:41PM +0900, KOSAKI Motohiro wrote: 
> > > > Hmmm...
> > > > I haven't understand your mention because I guess I was wrong.
> > > > 
> > > > probably my last question was unclear. I mean,
> > > > 
> > > > 1) If we still need SLAB_DESTROY_BY_RCU, why do we need to add refcount?
> > > >     Which difference is exist between normal page migration and compaction?
> > > 
> > > The processes typically calling migration today own the page they are moving
> > > and is not going to exit unexpectedly during migration.
> > > 
> > > > 2) If we added refcount, which race will solve?
> > > > 
> > > 
> > > The process exiting and the last anon_vma being dropped while compaction
> > > is running. This can be reliably triggered with compaction.
> > > 
> > > > IOW, Is this patch fix old issue or compaction specific issue?
> > > > 
> > > 
> > > Strictly speaking, it's an old issue but in practice it's impossible to
> > > trigger because the process migrating always owns the page. Compaction
> > > moves pages belonging to arbitrary processes.
> > > 
> > Kosaki-san,
> > 
> >  IIUC, the race in memory-hotunplug was fixed by this patch [2/11].
> > 
> >  But, this behavior of unmap_and_move() requires access to _freed_
> >  objects (spinlock). Even if it's safe because of SLAB_DESTROY_BY_RCU,
> >  it't not good habit in general.
> > 
> >  After direct compaction, page-migration will be one of "core" code of
> >  memory management. Then, I agree to patch [1/11] as our direction for
> >  keeping sanity and showing direction to more updates. Maybe adding
> >  refcnt and removing RCU in futuer is good.
> 
> But Christoph seems oppose to remove SLAB_DESTROY_BY_RCU. then refcount
> is meaningless now.

Christoph is opposed to removing it because of cache-hotness issues more
so than use-after-free concerns. The refcount is needed with or without
SLAB_DESTROY_BY_RCU.

> I agree you if we will remove SLAB_DESTROY_BY_RCU
> in the future.
> 
> refcount is easy understanding than rcu trick.
> 
> 
> >  IMHO, pushing this patch [2/11] as "BUGFIX" independent of this set and
> >  adding anon_vma->refcnt [1/11] and [3/11] in 1st Direct-compaction patch
> >  series  to show the direction will makse sense.
> >  (I think merging 1/11 and 3/11 will be okay...)
> 
> agreed.
> 
> > 
> > Thanks,
> > -Kame
> > 
> > 
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-25  9:21                                 ` Mel Gorman
@ 2010-03-25  9:41                                   ` KAMEZAWA Hiroyuki
  2010-03-25  9:59                                     ` KOSAKI Motohiro
  2010-03-25 14:35                                   ` Christoph Lameter
  1 sibling, 1 reply; 109+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-25  9:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Minchan Kim, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

On Thu, 25 Mar 2010 09:21:32 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> On Thu, Mar 25, 2010 at 06:09:34PM +0900, KOSAKI Motohiro wrote:
> > > On Thu, 25 Mar 2010 08:32:35 +0000
> > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > 
> > > > On Thu, Mar 25, 2010 at 11:49:23AM +0900, KOSAKI Motohiro wrote:
> > > > > > On Fri, Mar 19, 2010 at 03:21:41PM +0900, KOSAKI Motohiro wrote: 
> > > > > Hmmm...
> > > > > I haven't understand your mention because I guess I was wrong.
> > > > > 
> > > > > probably my last question was unclear. I mean,
> > > > > 
> > > > > 1) If we still need SLAB_DESTROY_BY_RCU, why do we need to add refcount?
> > > > >     Which difference is exist between normal page migration and compaction?
> > > > 
> > > > The processes typically calling migration today own the page they are moving
> > > > and is not going to exit unexpectedly during migration.
> > > > 
> > > > > 2) If we added refcount, which race will solve?
> > > > > 
> > > > 
> > > > The process exiting and the last anon_vma being dropped while compaction
> > > > is running. This can be reliably triggered with compaction.
> > > > 
> > > > > IOW, Is this patch fix old issue or compaction specific issue?
> > > > > 
> > > > 
> > > > Strictly speaking, it's an old issue but in practice it's impossible to
> > > > trigger because the process migrating always owns the page. Compaction
> > > > moves pages belonging to arbitrary processes.
> > > > 
> > > Kosaki-san,
> > > 
> > >  IIUC, the race in memory-hotunplug was fixed by this patch [2/11].
> > > 
> > >  But, this behavior of unmap_and_move() requires access to _freed_
> > >  objects (spinlock). Even if it's safe because of SLAB_DESTROY_BY_RCU,
> > >  it't not good habit in general.
> > > 
> > >  After direct compaction, page-migration will be one of "core" code of
> > >  memory management. Then, I agree to patch [1/11] as our direction for
> > >  keeping sanity and showing direction to more updates. Maybe adding
> > >  refcnt and removing RCU in futuer is good.
> > 
> > But Christoph seems oppose to remove SLAB_DESTROY_BY_RCU. then refcount
> > is meaningless now.
> 
> Christoph is opposed to removing it because of cache-hotness issues more
> so than use-after-free concerns. The refcount is needed with or without
> SLAB_DESTROY_BY_RCU.
> 

I wonder a code which the easiest to be read will be like following.
==

        if (PageAnon(page)) {
                struct anon_vma anon = page_lock_anon_vma(page);
		/* to take this lock, this page must be mapped. */
		if (!anon_vma)
			goto uncharge;
		increase refcnt
		page_unlock_anon_vma(anon);
        }
	....
==
and
==
void anon_vma_free(struct anon_vma *anon)
{
	/*
	 * To increase refcnt of anon-vma, anon_vma->lock should be held by
	 * page_lock_anon_vma(). It means anon_vma has a "mapped" page.
	 * If this anon is freed by unmap or exit, all pages under this anon
	 * must be unmapped. Then, just checking refcnt without lock is ok.
	 */
	if (check refcnt > 0)
		return do nothing
	kmem_cache_free(anon);
}
==

Then, rcu_read_lock can be removed in clean way.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-25  9:41                                   ` KAMEZAWA Hiroyuki
@ 2010-03-25  9:59                                     ` KOSAKI Motohiro
  2010-03-25 10:12                                       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-25  9:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Mel Gorman, Minchan Kim, Andrew Morton,
	Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Rik van Riel, linux-kernel, linux-mm

> > > > Kosaki-san,
> > > > 
> > > >  IIUC, the race in memory-hotunplug was fixed by this patch [2/11].
> > > > 
> > > >  But, this behavior of unmap_and_move() requires access to _freed_
> > > >  objects (spinlock). Even if it's safe because of SLAB_DESTROY_BY_RCU,
> > > >  it't not good habit in general.
> > > > 
> > > >  After direct compaction, page-migration will be one of "core" code of
> > > >  memory management. Then, I agree to patch [1/11] as our direction for
> > > >  keeping sanity and showing direction to more updates. Maybe adding
> > > >  refcnt and removing RCU in futuer is good.
> > > 
> > > But Christoph seems oppose to remove SLAB_DESTROY_BY_RCU. then refcount
> > > is meaningless now.
> > 
> > Christoph is opposed to removing it because of cache-hotness issues more
> > so than use-after-free concerns. The refcount is needed with or without
> > SLAB_DESTROY_BY_RCU.
> > 
> 
> I wonder a code which the easiest to be read will be like following.
> ==
> 
>         if (PageAnon(page)) {
>                 struct anon_vma anon = page_lock_anon_vma(page);
> 		/* to take this lock, this page must be mapped. */
> 		if (!anon_vma)
> 			goto uncharge;
> 		increase refcnt
> 		page_unlock_anon_vma(anon);
>         }
> 	....
> ==

This seems very good and acceptable to me. This refcnt usage
obviously reduce rcu-lock holding time.

I still think no refcount doesn't cause any disaster. but I agree
this is forward step patch.

thanks.


> and
> ==
> void anon_vma_free(struct anon_vma *anon)
> {
> 	/*
> 	 * To increase refcnt of anon-vma, anon_vma->lock should be held by
> 	 * page_lock_anon_vma(). It means anon_vma has a "mapped" page.
> 	 * If this anon is freed by unmap or exit, all pages under this anon
> 	 * must be unmapped. Then, just checking refcnt without lock is ok.
> 	 */
> 	if (check refcnt > 0)
> 		return do nothing
> 	kmem_cache_free(anon);
> }
> ==
> 
> Then, rcu_read_lock can be removed in clean way.
> 
> Thanks,
> -Kame
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-25  9:59                                     ` KOSAKI Motohiro
@ 2010-03-25 10:12                                       ` KAMEZAWA Hiroyuki
  2010-03-25 13:39                                         ` Mel Gorman
  2010-03-25 15:29                                         ` Minchan Kim
  0 siblings, 2 replies; 109+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-25 10:12 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Minchan Kim, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

On Thu, 25 Mar 2010 18:59:25 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > > > > Kosaki-san,
> > > > > 
> > > > >  IIUC, the race in memory-hotunplug was fixed by this patch [2/11].
> > > > > 
> > > > >  But, this behavior of unmap_and_move() requires access to _freed_
> > > > >  objects (spinlock). Even if it's safe because of SLAB_DESTROY_BY_RCU,
> > > > >  it't not good habit in general.
> > > > > 
> > > > >  After direct compaction, page-migration will be one of "core" code of
> > > > >  memory management. Then, I agree to patch [1/11] as our direction for
> > > > >  keeping sanity and showing direction to more updates. Maybe adding
> > > > >  refcnt and removing RCU in futuer is good.
> > > > 
> > > > But Christoph seems oppose to remove SLAB_DESTROY_BY_RCU. then refcount
> > > > is meaningless now.
> > > 
> > > Christoph is opposed to removing it because of cache-hotness issues more
> > > so than use-after-free concerns. The refcount is needed with or without
> > > SLAB_DESTROY_BY_RCU.
> > > 
> > 
> > I wonder a code which the easiest to be read will be like following.
> > ==
> > 
> >         if (PageAnon(page)) {
> >                 struct anon_vma anon = page_lock_anon_vma(page);
> > 		/* to take this lock, this page must be mapped. */
> > 		if (!anon_vma)
> > 			goto uncharge;
> > 		increase refcnt
> > 		page_unlock_anon_vma(anon);
> >         }
> > 	....
> > ==
> 
> This seems very good and acceptable to me. This refcnt usage
> obviously reduce rcu-lock holding time.
> 
> I still think no refcount doesn't cause any disaster. but I agree
> this is forward step patch.
> 

BTW, by above change and the change in patch [2/11], 
"A page turnd to be SwapCache and free unmapped but not freed"
page will be never migrated.

Mel, could you change the check as this ??

	if (PageAnon(page)) {
		rcu_read_lock();
		if (!page_mapcount(page)) {
			rcu_read_unlock();
			if (!PageSwapCache(page))
				goto uncharge;
			/* unmapped swap cache can be migrated */
		} else {
			...
		}
	.....
	} else 


Thx,
-Kame



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-19 10:09     ` Mel Gorman
@ 2010-03-25 11:08       ` KOSAKI Motohiro
  2010-03-25 15:11         ` Mel Gorman
  0 siblings, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-25 11:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> > Hmm..Hmmm...........
> > 
> > Today, I've reviewed this patch and [11/11] carefully twice. but It is harder to ack.
> > 
> > This patch seems to assume page compaction is faster than direct
> > reclaim. but it often doesn't, because dropping useless page cache is very
> > lightweight operation,
> 
> Two points with that;
> 
> 1. It's very hard to know in advance how often direct reclaim of clean page
>    cache would be enough to satisfy the allocation.

Yeah, This is main reason why I'd suggest tightly integrate vmscan and compaction.


> 2. Even if it was faster to discard page cache, it's not necessarily
>    faster when the cost of reading that page cache back-in is taken into
>    account

_If_ this is useful page, you are right. but please remember, In typical
case the system have lots no longer used pages.

> 
> Lumpy reclaim tries to avoid dumping useful page cache but it is perfectly
> possible for hot data to be discarded because it happened to be located
> near cold data. 

Yeah, I fully agree.

> It's impossible to know in general how much unnecessary IO
> takes place as a result of lumpy reclaim because it depends heavily on the
> system-state when lumpy reclaim starts.

I think this is explained why vmscan and compaction shouldn't be separated.
Yes, Only vmscan can know it.


> > but page compaction makes a lot of memcpy (i.e. cpu cache
> > pollution). IOW this patch is focusing to hugepage allocation very aggressively, but
> > it seems not enough care to reduce typical workload damage.
> > 
> 
> What typical workload is making aggressive use of high order
> allocations? Typically when such a user is found, effort is spent on
> finding alternatives to high-orders as opposed to worrying about the cost
> of allocating them. There was a focus on huge page allocation because it
> was the most useful test case that was likely to be encountered in practice.
> 
> I can adjust the allocation levels to some other value but it's not typical
> for a system to make very aggressive use of other orders. I could have it
> use random orders but also is not very typical.

If this compaction is trigged only order-9 allocation, I don't oppose it at all.
Also PAGE_ALLOC_COSTLY_ORDER is probably acceptable. I agree huge page 
allocation made lots trouble. but low order and the system
have lots no longer used page case, your logic is worse than current.
I worry about it.

My point is, We have to consider to disard useful cached pages and to
discard no longer accessed pages. latter is nearly zero cost. please
don't consider page discard itself is bad, it is correct page life cycle.
To protest discard useless cached page can makes reduce IO throughput.

> 
> > At first, I would like to clarify current reclaim corner case and how
> > vmscan should do at this mail.
> > 
> > Now we have Lumpy reclaim. It is very excellent solution for externa
> > fragmentation.
> 
> In some situations, it can grind a system to trash for a time. What is far
> more likely is to be dealing with a machine with no swap - something that
> is common in clusters. In this case, lumpy is a lot less likely to succeed
> unless the machine is very quiet. It's just not going to find the contiguous
> page cache it needs to discard and anonymous pages get in the way.
> 
> > but unfortunately it have lots corner case.
> > 
> > Viewpoint 1. Unnecessary IO
> > 
> > isolate_pages() for lumpy reclaim frequently grab very young page. it is often
> > still dirty. then, pageout() is called much.
> > 
> > Unfortunately, page size grained io is _very_ inefficient. it can makes lots disk
> > seek and kill disk io bandwidth.
> > 
> 
> Page-based IO like this has also been reported as being a problem for some
> filesystems. When this happens, lumpy reclaim potentially stalls for a long
> time waiting for the dirty data to be flushed by a flusher thread. Compaction
> does not suffer from the same problem.
> 
> > Viewpoint 2. Unevictable pages 
> > 
> > isolate_pages() for lumpy reclaim can pick up unevictable page. it is obviously
> > undroppable. so if the zone have plenty mlocked pages (it is not rare case on
> > server use case), lumpy reclaim can become very useless.
> > 
> 
> Also true. Potentially, compaction can deal with unevictable pages but it's
> not done in this series as it's significant enough as it is and useful in
> its current form.
> 
> > Viewpoint 3. GFP_ATOMIC allocation failure
> > 
> > Obviously lumpy reclaim can't help GFP_ATOMIC issue.
> > 
> 
> Also true although right now, it's not possible to compact for GFP_ATOMIC
> either. I think it could be done on some cases but I didn't try for it.
> High-order GFP_ATOMIC allocations are still something we simply try and
> avoid rather than deal with within the page allocator.
> 
> > Viewpoint 4. reclaim latency
> > 
> > reclaim latency directly affect page allocation latency. so if lumpy reclaim with
> > much pageout io is slow (often it is), it affect page allocation latency and can
> > reduce end user experience.
> > 
> 
> Also true. When allocation huge pages on a normal desktop for example,
> it scan stall the machine for a number of seconds while reclaim kicks
> in.
> 
> With direct compaction, this does not happen to anywhere near the same
> degree. There are still some stalls because as huge pages get allocated,
> free memory drops until pages have to be reclaimed anyway. The effects
> are a lot less prononced and the operation finishes a lot faster.
> 
> > I really hope that auto page migration help to solve above issue. but sadly this 
> > patch seems doesn't.
> > 
> 
> How do you figure? I think it goes a long way to mitigating the worst of
> the problems you laid out above.

Both lumpy reclaim and page comaction have some advantage and some disadvantage.
However we already have lumpy reclaim. I hope you rememver we are attacking
very narrowing corner case. we have to consider to reduce the downside of compaction
at first priority.
Not only big benefit but also big downside seems no good.

So, I'd suggest either way
1) no change caller place, but invoke compaction at very limited situation, or
2) invoke compaction at only lumpy reclaim unfit situation

My last mail, I proposed about (2). but you seems got bad impression. then,
now I propsed (1). I mean we will _start_ to treat the compaction is for
hugepage allocation assistance feature, not generic allocation change.

btw, I hope drop or improve patch 11/11 ;-)



> > Honestly, I think this patch was very impressive and useful at 2-3 years ago.
> > because 1) we didn't have lumpy reclaim 2) we didn't have sane reclaim bail out.
> > then, old vmscan is very heavyweight and inefficient operation for high order reclaim.
> > therefore the downside of adding this page migration is hidden relatively. but...
> > 
> > We have to make an effort to reduce reclaim latency, not adding new latency source.
> 
> I recognise that reclaim latency has been reduced but there is a wall.

If it is a wall, we have to fix this! :)


> The cost of reading the data back in will always be there and on
> swapless systems, it might simply be impossible for lumpy reclaim to do
> what it needs.

Well, I didn't and don't think the compaction is useless. I haven't saied
the compaction is useless.  I've talked about how to avoid downside mess.


> > Instead, I would recommend tightly integrate page-compaction and lumpy reclaim.
> > I mean 1) reusing lumpy reclaim's neighbor pfn page pickking up logic
> 
> There are a number of difficulties with this. I'm not saying it's impossible,
> but the win is not very clear-cut and there are some disadvantages.
> 
> One, there would have to be exceptions for kswapd in the path because it
> really should continue reclaiming. The reclaim path is already very dense
> and this would add significant compliexity to that path.
> 
> The second difficulty is that the migration and free block selection
> algorithm becomes a lot harder, more expensive and identifying the exit
> conditions presents a significant difficultly. Right now, the selection is
> based on linear scans with straight-forward selection and the exit condition
> is simply when the scanners meet. With the migration scanner based on LRU,
> significant care would have to be taken to ensure that appropriate free blocks
> were chosen to migrate to so that we didn't "migrate from" a block in one
> pass and "migrate to" in another (the reason why I went with linear scans
> in the first place). Identifying when the zone has been compacted and should
> just stop is no longer as straight-forward either.  You'd have to track what
> blocks had been operated on in the past which is potentially a lot of state. To
> maintain this state, an unknown number structures would have to be allocated
> which may re-enter the allocator presenting its own class of problems.
> 
> Third, right now it's very easy to identify when compaction is not going
> to work in advance - simply check the watermarks and make a calculation
> based on fragmentation. With a combined reclaim/compaction step, these
> type of checks would need to be made continually - potentially
> increasing the latency of reclaim albeit very slightly.
> 
> Lastly, with this series, there is very little difference between direct
> compaction and proc-triggered compaction. They share the same code paths
> and all that differs is the exit conditions. If it was integrated into
> reclaim, it becomes a lot less straight-forward to share the code.
> 
> > 2) do page
> > migration instead pageout when the page is some condition (example active or dirty
> > or referenced or swapbacked).
> > 
> 
> Right now, it is identifed when pageout should happen instead of page
> migration. It's known before compaction starts if it's likely to be
> successful or not.
> 

patch 11/11 says, it's known likely to be successfull or not, but not exactly.
I think you and I don't have so big different analisys about current behavior.
I feel I merely pesimistic rather than you.



> > This patch seems shoot me! /me die. R.I.P. ;-)
> 
> That seems a bit dramatic. Your alternative proposal has some significant
> difficulties and is likely to be very complicated. Also, there is nothing
> to say that this mechanism could not be integrated with lumpy reclaim over
> time once it was shown that useless migration was going on or latencies were
> increased for some workload.
> 
> This patch seems like a far more rational starting point to me than adding
> more complexity to reclaim at the outset.
> 
> > btw please don't use 'hugeadm --set-recommended-min_free_kbytes' at testing.
> 
> It's somewhat important for the type of stress tests I do for huge page
> allocation. Without it, fragmentation avoidance has trouble and the
> results become a lot less repeatable.
> 
> >     To evaluate a case of free memory starvation is very important for this patch
> >     series, I think. I slightly doubt this patch might invoke useless compaction
> >     in such case.
> > 
> 
> I can drop the min_free_kbytes change but the likely result will be that
> allocation success rates will simply be lower. The calculations on
> whether compaction should be used or not are based on watermarks which
> adjust to the value of min_free_kbytes.

Then, should we need min_free_kbytes auto adjustment trick?


> > At bottom line, the explict compaction via /proc can be merged soon, I think.
> > but this auto compaction logic seems need more discussion.
> 
> My concern would be that the compaction paths would then be used very
> rarely in practice and we'd get no data on how direct compaction should
> be done.

Agree almost.

Again, I think this patch is attacking corner case issue. then, I don't
hope this will makes new corner case. I don't think your approach is
perfectly broken.

But please remember, now compaction might makes very large lru shuffling
in compaction failure case. It mean vmscan might discard very wrong pages.
I have big worry about it.



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 06/11] Export fragmentation index via /proc/extfrag_index
  2010-03-25  8:47             ` Mel Gorman
@ 2010-03-25 11:20               ` KOSAKI Motohiro
  2010-03-25 14:11                 ` Mel Gorman
  0 siblings, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-25 11:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> On Thu, Mar 25, 2010 at 11:47:17AM +0900, KOSAKI Motohiro wrote:
> > > On Tue, Mar 23, 2010 at 09:22:04AM +0900, KOSAKI Motohiro wrote:
> > > > > > > +	/*
> > > > > > > +	 * Index is between 0 and 1 so return within 3 decimal places
> > > > > > > +	 *
> > > > > > > +	 * 0 => allocation would fail due to lack of memory
> > > > > > > +	 * 1 => allocation would fail due to fragmentation
> > > > > > > +	 */
> > > > > > > +	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
> > > > > > > +}
> > > > > > 
> > > > > > Dumb question.
> > > > > > your paper (http://portal.acm.org/citation.cfm?id=1375634.1375641) says
> > > > > > fragmentation_index = 1 - (TotalFree/SizeRequested)/BlocksFree
> > > > > > but your code have extra '1000+'. Why?
> > > > > 
> > > > > To get an approximation to three decimal places.
> > > > 
> > > > Do you mean this is poor man's round up logic?
> > > 
> > > Not exactly.
> > > 
> > > The intention is to have a value of 968 instead of 0.968231. i.e.
> > > instead of a value between 0 and 1, it'll be a value between 0 and 1000
> > > that matches the first three digits after the decimal place.
> > 
> > Let's consider extream case.
> > 
> > free_pages: 1
> > requested: 1
> > free_blocks_total: 1
> > 
> > frag_index = 1000  - ((1000 + 1*1000/1))/1 = -1000
> > 
> > This is not your intension, I guess. 
> 
> Why not?
> 
> See this comment
> 
> /* Fragmentation index only makes sense when a request would fail */
> 
> In your example, there is a free page of the requested size so the allocation
> would succeed. In this case, fragmentation index does indeed go negative
> but the value is not useful.
>
> > Probably we don't need any round_up/round_down logic. because fragmentation_index
> > is only used "if (fragindex >= 0 && fragindex <= 500)" check in try_to_compact_pages().
> > +1 or -1 inaccurate can be ignored. iow, I think we can remove '1000+' expression.
> > 
> 
> This isn't about rounding, it's about having a value that normally is
> between 0 and 1 expressed as a number between 0 and 1000 because we
> can't use double in the kernel.

Sorry, My example was wrong. new example is here.

free_pages: 4
requested: 2
free_blocks_total: 4

theory: 1 - (TotalFree/SizeRequested)/BlocksFree
            = 1 - (4/2)/4 = 0.5

code : 1000 - ((1000 + 4*1000/2))/4 = 1000 - (1000 + 2000)/4 = 1000/4 = 250


I don't think this is three decimal picking up code. This seems might makes
lots compaction invocation rather than theory.



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-19 10:10       ` Mel Gorman
@ 2010-03-25 11:22         ` KOSAKI Motohiro
  0 siblings, 0 replies; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-25 11:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> > Viewpoint 5. end user surprising
> > 
> > lumpy reclaim can makes swap-out even though the system have lots free
> > memory. end users very surprised it and they can think it is bug.
> > 
> > Also, this swap activity easyly confuse that an administrator decide when
> > install more memory into the system.
> > 
> 
> Compaction in this case is a lot less surprising. If there is enough free
> memory, compaction will trigger automatically without any reclaim.

I fully agree this.



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-25 10:12                                       ` KAMEZAWA Hiroyuki
@ 2010-03-25 13:39                                         ` Mel Gorman
  2010-03-26  3:07                                           ` KOSAKI Motohiro
  2010-03-25 15:29                                         ` Minchan Kim
  1 sibling, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-25 13:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

On Thu, Mar 25, 2010 at 07:12:29PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 25 Mar 2010 18:59:25 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > > > > Kosaki-san,
> > > > > > 
> > > > > >  IIUC, the race in memory-hotunplug was fixed by this patch [2/11].
> > > > > > 
> > > > > >  But, this behavior of unmap_and_move() requires access to _freed_
> > > > > >  objects (spinlock). Even if it's safe because of SLAB_DESTROY_BY_RCU,
> > > > > >  it't not good habit in general.
> > > > > > 
> > > > > >  After direct compaction, page-migration will be one of "core" code of
> > > > > >  memory management. Then, I agree to patch [1/11] as our direction for
> > > > > >  keeping sanity and showing direction to more updates. Maybe adding
> > > > > >  refcnt and removing RCU in futuer is good.
> > > > > 
> > > > > But Christoph seems oppose to remove SLAB_DESTROY_BY_RCU. then refcount
> > > > > is meaningless now.
> > > > 
> > > > Christoph is opposed to removing it because of cache-hotness issues more
> > > > so than use-after-free concerns. The refcount is needed with or without
> > > > SLAB_DESTROY_BY_RCU.
> > > > 
> > > 
> > > I wonder a code which the easiest to be read will be like following.
> > > ==
> > > 
> > >         if (PageAnon(page)) {
> > >                 struct anon_vma anon = page_lock_anon_vma(page);
> > > 		/* to take this lock, this page must be mapped. */
> > > 		if (!anon_vma)
> > > 			goto uncharge;
> > > 		increase refcnt
> > > 		page_unlock_anon_vma(anon);
> > >         }
> > > 	....
> > > ==
> > 
> > This seems very good and acceptable to me. This refcnt usage
> > obviously reduce rcu-lock holding time.
> > 
> > I still think no refcount doesn't cause any disaster. but I agree
> > this is forward step patch.
> > 
> 
> BTW, by above change and the change in patch [2/11], 
> "A page turnd to be SwapCache and free unmapped but not freed"
> page will be never migrated.
> 

Good point.

> Mel, could you change the check as this ??
> 
> 	if (PageAnon(page)) {
> 		rcu_read_lock();
> 		if (!page_mapcount(page)) {
> 			rcu_read_unlock();
> 			if (!PageSwapCache(page))
> 				goto uncharge;
> 			/* unmapped swap cache can be migrated */
> 		} else {
> 			...
> 		}
> 	.....
> 	} else 
> 

There were minor changes in how the rcu_read_lock is taken and released
based on other comments. With your suggestion, the block now looks like;

        if (PageAnon(page)) {
                rcu_read_lock();
                rcu_locked = 1;

                /*
                 * If the page has no mappings any more, just bail. An
                 * unmapped anon page is likely to be freed soon but
                 * worse,
                 * it's possible its anon_vma disappeared between when
                 * the page was isolated and when we reached here while
                 * the RCU lock was not held
                 */
                if (!page_mapcount(page) && !PageSwapCache(page))
                        goto rcu_unlock;

                anon_vma = page_anon_vma(page);
                atomic_inc(&anon_vma->external_refcount);
        }

Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 06/11] Export fragmentation index via /proc/extfrag_index
  2010-03-25 11:20               ` KOSAKI Motohiro
@ 2010-03-25 14:11                 ` Mel Gorman
  2010-03-26  3:10                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-25 14:11 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Thu, Mar 25, 2010 at 08:20:04PM +0900, KOSAKI Motohiro wrote:
> > On Thu, Mar 25, 2010 at 11:47:17AM +0900, KOSAKI Motohiro wrote:
> > > > On Tue, Mar 23, 2010 at 09:22:04AM +0900, KOSAKI Motohiro wrote:
> > > > > > > > +	/*
> > > > > > > > +	 * Index is between 0 and 1 so return within 3 decimal places
> > > > > > > > +	 *
> > > > > > > > +	 * 0 => allocation would fail due to lack of memory
> > > > > > > > +	 * 1 => allocation would fail due to fragmentation
> > > > > > > > +	 */
> > > > > > > > +	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
> > > > > > > > +}
> > > > > > > 
> > > > > > > Dumb question.
> > > > > > > your paper (http://portal.acm.org/citation.cfm?id=1375634.1375641) says
> > > > > > > fragmentation_index = 1 - (TotalFree/SizeRequested)/BlocksFree
> > > > > > > but your code have extra '1000+'. Why?
> > > > > > 
> > > > > > To get an approximation to three decimal places.
> > > > > 
> > > > > Do you mean this is poor man's round up logic?
> > > > 
> > > > Not exactly.
> > > > 
> > > > The intention is to have a value of 968 instead of 0.968231. i.e.
> > > > instead of a value between 0 and 1, it'll be a value between 0 and 1000
> > > > that matches the first three digits after the decimal place.
> > > 
> > > Let's consider extream case.
> > > 
> > > free_pages: 1
> > > requested: 1
> > > free_blocks_total: 1
> > > 
> > > frag_index = 1000  - ((1000 + 1*1000/1))/1 = -1000
> > > 
> > > This is not your intension, I guess. 
> > 
> > Why not?
> > 
> > See this comment
> > 
> > /* Fragmentation index only makes sense when a request would fail */
> > 
> > In your example, there is a free page of the requested size so the allocation
> > would succeed. In this case, fragmentation index does indeed go negative
> > but the value is not useful.
> >
> > > Probably we don't need any round_up/round_down logic. because fragmentation_index
> > > is only used "if (fragindex >= 0 && fragindex <= 500)" check in try_to_compact_pages().
> > > +1 or -1 inaccurate can be ignored. iow, I think we can remove '1000+' expression.
> > > 
> > 
> > This isn't about rounding, it's about having a value that normally is
> > between 0 and 1 expressed as a number between 0 and 1000 because we
> > can't use double in the kernel.
> 
> Sorry, My example was wrong. new example is here.
> 
> free_pages: 4
> requested: 2
> free_blocks_total: 4
> 
> theory: 1 - (TotalFree/SizeRequested)/BlocksFree
>             = 1 - (4/2)/4 = 0.5
> 
> code : 1000 - ((1000 + 4*1000/2))/4 = 1000 - (1000 + 2000)/4 = 1000/4 = 250
> 
> I don't think this is three decimal picking up code. This seems might makes
> lots compaction invocation rather than theory.
> 

Ok, I cannot apologise for this enough.

Since that paper was published, further work showed that the equation could
be much improved. As part of that, I updated the equation to the following;

double index = 1    - ( (1    + ((double)info->free_pages        / requested)) / info->free_blocks_total);

or when approximated to three decimal places

int index =    1000 - ( (1000 + (        info->free_pages * 1000 / requested)) / info->free_blocks_total);

Your analysis of the paper is perfect. When slotted into a driver program
with your example figures, I get the following results

old equation = 0.500000
current equation = 0.250000
integer approximation = 250

The code as-is is correct and is what I intended. My explanation on the
other hand sucks and I should have remembered that I updated equation since
I published that paper 2 years ago.

Again, I am extremely sorry for misleading you.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-25  9:21                                 ` Mel Gorman
  2010-03-25  9:41                                   ` KAMEZAWA Hiroyuki
@ 2010-03-25 14:35                                   ` Christoph Lameter
  1 sibling, 0 replies; 109+ messages in thread
From: Christoph Lameter @ 2010-03-25 14:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Minchan Kim, Andrew Morton,
	Andrea Arcangeli, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

On Thu, 25 Mar 2010, Mel Gorman wrote:

> Christoph is opposed to removing it because of cache-hotness issues more
> so than use-after-free concerns. The refcount is needed with or without
> SLAB_DESTROY_BY_RCU.

The issue is only performance. If we can preserve the cache hotness in a
different way (or do things in a completely different manner) then its
fine.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-25 11:08       ` KOSAKI Motohiro
@ 2010-03-25 15:11         ` Mel Gorman
  2010-03-26  6:01           ` KOSAKI Motohiro
  0 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-25 15:11 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Rik van Riel, linux-kernel, linux-mm

On Thu, Mar 25, 2010 at 08:08:20PM +0900, KOSAKI Motohiro wrote:
> > > Hmm..Hmmm...........
> > > 
> > > Today, I've reviewed this patch and [11/11] carefully twice. but It is harder to ack.
> > > 
> > > This patch seems to assume page compaction is faster than direct
> > > reclaim. but it often doesn't, because dropping useless page cache is very
> > > lightweight operation,
> > 
> > Two points with that;
> > 
> > 1. It's very hard to know in advance how often direct reclaim of clean page
> >    cache would be enough to satisfy the allocation.
> 
> Yeah, This is main reason why I'd suggest tightly integrate vmscan and compaction.
> 
> 
> > 2. Even if it was faster to discard page cache, it's not necessarily
> >    faster when the cost of reading that page cache back-in is taken into
> >    account
> 
> _If_ this is useful page, you are right. but please remember, In typical
> case the system have lots no longer used pages.
> 

But we don't *know* that for sure. Lumpy reclaim for example can take an
unused clean page that happened to be surrounded by active hot pages and
throw out the whole lot.

I am not against integrating compaction with lumpy reclaim ultimately,
but I think we should have a good handle on the simple case first before
altering reclaim. In particular, I have concerns about how to efficiently
select pageblocks to migrate to when integrated with lumpy reclaim and how it
should be decided when to reclaim and when to compact with an integrated path.

I think it would be best if we had this basis of comparison and a workload
that turned out to be compaction-intensive to gain a full understanding of
the best integration between reclaim and compaction.

> > Lumpy reclaim tries to avoid dumping useful page cache but it is perfectly
> > possible for hot data to be discarded because it happened to be located
> > near cold data. 
> 
> Yeah, I fully agree.
> 
> > It's impossible to know in general how much unnecessary IO
> > takes place as a result of lumpy reclaim because it depends heavily on the
> > system-state when lumpy reclaim starts.
> 
> I think this is explained why vmscan and compaction shouldn't be separated.
>
> Yes, Only vmscan can know it.
> 

vmscan doesn't know how much unnecessary IO it generated as a result of
it's actions. We can make a guess at it indirectly from vmstat but
that's about it.

> > > but page compaction makes a lot of memcpy (i.e. cpu cache
> > > pollution). IOW this patch is focusing to hugepage allocation very aggressively, but
> > > it seems not enough care to reduce typical workload damage.
> > > 
> > 
> > What typical workload is making aggressive use of high order
> > allocations? Typically when such a user is found, effort is spent on
> > finding alternatives to high-orders as opposed to worrying about the cost
> > of allocating them. There was a focus on huge page allocation because it
> > was the most useful test case that was likely to be encountered in practice.
> > 
> > I can adjust the allocation levels to some other value but it's not typical
> > for a system to make very aggressive use of other orders. I could have it
> > use random orders but also is not very typical.
> 
> If this compaction is trigged only order-9 allocation, I don't oppose it at all.
> Also PAGE_ALLOC_COSTLY_ORDER is probably acceptable. I agree huge page 
> allocation made lots trouble. but low order and the system
> have lots no longer used page case, your logic is worse than current.
> I worry about it.
> 

If you insist, I can limit direct compaction for > PAGE_ALLOC_COSTLY_ORDER. The
allocator is already meant to be able to handle these orders without special
assistance and it'd avoid compaction becoming a cruch for subsystems that
suddently decide it's a great idea to use order-1 or order-2 heavily.

> My point is, We have to consider to disard useful cached pages and to
> discard no longer accessed pages. latter is nearly zero cost.

I am not opposed to moving in this sort of direction although
particularly if we disable compaction for the lower orders. I believe
what you are suggesting is that the allocator would take the steps

1. Try allocate from lists
2. If that fails, do something like zone_reclaim_mode and lumpy reclaim
   only pages which are cheap to discard
3. If that fails, try compaction to move around the active pages
4. If that fails, lumpy reclaim 

> please
> don't consider page discard itself is bad, it is correct page life cycle.
> To protest discard useless cached page can makes reduce IO throughput.
> 

I don't consider it bad as such but I had generally considered compaction to
be better than discarding pages. I take your point though that if we compact
many old pages, it might be a net loss.

> > 
> > > At first, I would like to clarify current reclaim corner case and how
> > > vmscan should do at this mail.
> > > 
> > > Now we have Lumpy reclaim. It is very excellent solution for externa
> > > fragmentation.
> > 
> > In some situations, it can grind a system to trash for a time. What is far
> > more likely is to be dealing with a machine with no swap - something that
> > is common in clusters. In this case, lumpy is a lot less likely to succeed
> > unless the machine is very quiet. It's just not going to find the contiguous
> > page cache it needs to discard and anonymous pages get in the way.
> > 
> > > but unfortunately it have lots corner case.
> > > 
> > > Viewpoint 1. Unnecessary IO
> > > 
> > > isolate_pages() for lumpy reclaim frequently grab very young page. it is often
> > > still dirty. then, pageout() is called much.
> > > 
> > > Unfortunately, page size grained io is _very_ inefficient. it can makes lots disk
> > > seek and kill disk io bandwidth.
> > > 
> > 
> > Page-based IO like this has also been reported as being a problem for some
> > filesystems. When this happens, lumpy reclaim potentially stalls for a long
> > time waiting for the dirty data to be flushed by a flusher thread. Compaction
> > does not suffer from the same problem.
> > 
> > > Viewpoint 2. Unevictable pages 
> > > 
> > > isolate_pages() for lumpy reclaim can pick up unevictable page. it is obviously
> > > undroppable. so if the zone have plenty mlocked pages (it is not rare case on
> > > server use case), lumpy reclaim can become very useless.
> > > 
> > 
> > Also true. Potentially, compaction can deal with unevictable pages but it's
> > not done in this series as it's significant enough as it is and useful in
> > its current form.
> > 
> > > Viewpoint 3. GFP_ATOMIC allocation failure
> > > 
> > > Obviously lumpy reclaim can't help GFP_ATOMIC issue.
> > > 
> > 
> > Also true although right now, it's not possible to compact for GFP_ATOMIC
> > either. I think it could be done on some cases but I didn't try for it.
> > High-order GFP_ATOMIC allocations are still something we simply try and
> > avoid rather than deal with within the page allocator.
> > 
> > > Viewpoint 4. reclaim latency
> > > 
> > > reclaim latency directly affect page allocation latency. so if lumpy reclaim with
> > > much pageout io is slow (often it is), it affect page allocation latency and can
> > > reduce end user experience.
> > > 
> > 
> > Also true. When allocation huge pages on a normal desktop for example,
> > it scan stall the machine for a number of seconds while reclaim kicks
> > in.
> > 
> > With direct compaction, this does not happen to anywhere near the same
> > degree. There are still some stalls because as huge pages get allocated,
> > free memory drops until pages have to be reclaimed anyway. The effects
> > are a lot less prononced and the operation finishes a lot faster.
> > 
> > > I really hope that auto page migration help to solve above issue. but sadly this 
> > > patch seems doesn't.
> > > 
> > 
> > How do you figure? I think it goes a long way to mitigating the worst of
> > the problems you laid out above.
> 
> Both lumpy reclaim and page comaction have some advantage and some disadvantage.
> However we already have lumpy reclaim. I hope you rememver we are attacking
> very narrowing corner case. we have to consider to reduce the downside of compaction
> at first priority.
> Not only big benefit but also big downside seems no good.
> 
> So, I'd suggest either way
> 1) no change caller place, but invoke compaction at very limited situation, or

I'm ok with enabling compaction only for >= PAGE_ALLOC_COSTLY_ORDER.
This will likely limit it to just huge pages for the moment but even
that would be very useful to me on swapless systems

> 2) invoke compaction at only lumpy reclaim unfit situation
> 
> My last mail, I proposed about (2). but you seems got bad impression. then,
> now I propsed (1).

1 would be my preference to start with.

After merge, I'd look into "cheap" lumpy reclaim which is used as a
first option, then compaction, then full direct reclaim. Would that be
satisfactory?

> I mean we will _start_ to treat the compaction is for
> hugepage allocation assistance feature, not generic allocation change.
> 

Agreed.

> btw, I hope drop or improve patch 11/11 ;-)
> 

I expect it to be improved over time. The compactfail counter is there to
identify when a bad situation occurs so that the workload can be better
understood. There are different heuristics that could be applied there to
avoid the wait but all of them have disadvantages.

> > > Honestly, I think this patch was very impressive and useful at 2-3 years ago.
> > > because 1) we didn't have lumpy reclaim 2) we didn't have sane reclaim bail out.
> > > then, old vmscan is very heavyweight and inefficient operation for high order reclaim.
> > > therefore the downside of adding this page migration is hidden relatively. but...
> > > 
> > > We have to make an effort to reduce reclaim latency, not adding new latency source.
> > 
> > I recognise that reclaim latency has been reduced but there is a wall.
> 
> If it is a wall, we have to fix this! :)

Well, the wall I had in mind was IO bandwidth :)

> 
> > The cost of reading the data back in will always be there and on
> > swapless systems, it might simply be impossible for lumpy reclaim to do
> > what it needs.
> 
> Well, I didn't and don't think the compaction is useless. I haven't saied
> the compaction is useless.  I've talked about how to avoid downside mess.
> 
> > > Instead, I would recommend tightly integrate page-compaction and lumpy reclaim.
> > > I mean 1) reusing lumpy reclaim's neighbor pfn page pickking up logic
> > 
> > There are a number of difficulties with this. I'm not saying it's impossible,
> > but the win is not very clear-cut and there are some disadvantages.
> > 
> > One, there would have to be exceptions for kswapd in the path because it
> > really should continue reclaiming. The reclaim path is already very dense
> > and this would add significant compliexity to that path.
> > 
> > The second difficulty is that the migration and free block selection
> > algorithm becomes a lot harder, more expensive and identifying the exit
> > conditions presents a significant difficultly. Right now, the selection is
> > based on linear scans with straight-forward selection and the exit condition
> > is simply when the scanners meet. With the migration scanner based on LRU,
> > significant care would have to be taken to ensure that appropriate free blocks
> > were chosen to migrate to so that we didn't "migrate from" a block in one
> > pass and "migrate to" in another (the reason why I went with linear scans
> > in the first place). Identifying when the zone has been compacted and should
> > just stop is no longer as straight-forward either.  You'd have to track what
> > blocks had been operated on in the past which is potentially a lot of state. To
> > maintain this state, an unknown number structures would have to be allocated
> > which may re-enter the allocator presenting its own class of problems.
> > 
> > Third, right now it's very easy to identify when compaction is not going
> > to work in advance - simply check the watermarks and make a calculation
> > based on fragmentation. With a combined reclaim/compaction step, these
> > type of checks would need to be made continually - potentially
> > increasing the latency of reclaim albeit very slightly.
> > 
> > Lastly, with this series, there is very little difference between direct
> > compaction and proc-triggered compaction. They share the same code paths
> > and all that differs is the exit conditions. If it was integrated into
> > reclaim, it becomes a lot less straight-forward to share the code.
> > 
> > > 2) do page
> > > migration instead pageout when the page is some condition (example active or dirty
> > > or referenced or swapbacked).
> > > 
> > 
> > Right now, it is identifed when pageout should happen instead of page
> > migration. It's known before compaction starts if it's likely to be
> > successful or not.
> > 
> 
> patch 11/11 says, it's known likely to be successfull or not, but not exactly.

Indeed. For example, it might not have been possible to migrate the necessary
pages because they were pagetables, slab etc. It might also be simply memory
pressure. It might look like there should be enough pages to compaction but
there are too many processes allocating at the same time.

> I think you and I don't have so big different analisys about current behavior.
> I feel I merely pesimistic rather than you.
> 

Of course I'm optimistic :)

> 
> 
> > > This patch seems shoot me! /me die. R.I.P. ;-)
> > 
> > That seems a bit dramatic. Your alternative proposal has some significant
> > difficulties and is likely to be very complicated. Also, there is nothing
> > to say that this mechanism could not be integrated with lumpy reclaim over
> > time once it was shown that useless migration was going on or latencies were
> > increased for some workload.
> > 
> > This patch seems like a far more rational starting point to me than adding
> > more complexity to reclaim at the outset.
> > 
> > > btw please don't use 'hugeadm --set-recommended-min_free_kbytes' at testing.
> > 
> > It's somewhat important for the type of stress tests I do for huge page
> > allocation. Without it, fragmentation avoidance has trouble and the
> > results become a lot less repeatable.
> > 
> > >     To evaluate a case of free memory starvation is very important for this patch
> > >     series, I think. I slightly doubt this patch might invoke useless compaction
> > >     in such case.
> > > 
> > 
> > I can drop the min_free_kbytes change but the likely result will be that
> > allocation success rates will simply be lower. The calculations on
> > whether compaction should be used or not are based on watermarks which
> > adjust to the value of min_free_kbytes.
> 
> Then, should we need min_free_kbytes auto adjustment trick?
> 

I have considered this in the past. Specifically that it would be auto-adjusted
the first time a huge page was allocated. I never got around to it though.

> 
> > > At bottom line, the explict compaction via /proc can be merged soon, I think.
> > > but this auto compaction logic seems need more discussion.
> > 
> > My concern would be that the compaction paths would then be used very
> > rarely in practice and we'd get no data on how direct compaction should
> > be done.
> 
> Agree almost.
> 
> Again, I think this patch is attacking corner case issue. then, I don't
> hope this will makes new corner case. I don't think your approach is
> perfectly broken.
> 
> But please remember, now compaction might makes very large lru shuffling
> in compaction failure case. It mean vmscan might discard very wrong pages.
> I have big worry about it.
> 

Would disabling compaction for the lower orders alleviate your concerns?
I have also taken note to investigate how much LRU churn can be avoided.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-25 10:12                                       ` KAMEZAWA Hiroyuki
  2010-03-25 13:39                                         ` Mel Gorman
@ 2010-03-25 15:29                                         ` Minchan Kim
  2010-03-26  0:58                                           ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 109+ messages in thread
From: Minchan Kim @ 2010-03-25 15:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

Hi, Kame. 

On Thu, 2010-03-25 at 19:12 +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 25 Mar 2010 18:59:25 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > > > > Kosaki-san,
> > > > > > 
> > > > > >  IIUC, the race in memory-hotunplug was fixed by this patch [2/11].
> > > > > > 
> > > > > >  But, this behavior of unmap_and_move() requires access to _freed_
> > > > > >  objects (spinlock). Even if it's safe because of SLAB_DESTROY_BY_RCU,
> > > > > >  it't not good habit in general.
> > > > > > 
> > > > > >  After direct compaction, page-migration will be one of "core" code of
> > > > > >  memory management. Then, I agree to patch [1/11] as our direction for
> > > > > >  keeping sanity and showing direction to more updates. Maybe adding
> > > > > >  refcnt and removing RCU in futuer is good.
> > > > > 
> > > > > But Christoph seems oppose to remove SLAB_DESTROY_BY_RCU. then refcount
> > > > > is meaningless now.
> > > > 
> > > > Christoph is opposed to removing it because of cache-hotness issues more
> > > > so than use-after-free concerns. The refcount is needed with or without
> > > > SLAB_DESTROY_BY_RCU.
> > > > 
> > > 
> > > I wonder a code which the easiest to be read will be like following.
> > > ==
> > > 
> > >         if (PageAnon(page)) {
> > >                 struct anon_vma anon = page_lock_anon_vma(page);
> > > 		/* to take this lock, this page must be mapped. */
> > > 		if (!anon_vma)
> > > 			goto uncharge;
> > > 		increase refcnt
> > > 		page_unlock_anon_vma(anon);
> > >         }
> > > 	....
> > > ==
> > 
> > This seems very good and acceptable to me. This refcnt usage
> > obviously reduce rcu-lock holding time.
> > 
> > I still think no refcount doesn't cause any disaster. but I agree
> > this is forward step patch.
> > 
> 
> BTW, by above change and the change in patch [2/11], 
> "A page turnd to be SwapCache and free unmapped but not freed"
> page will be never migrated.
> 
> Mel, could you change the check as this ??
> 
> 	if (PageAnon(page)) {
> 		rcu_read_lock();
> 		if (!page_mapcount(page)) {
> 			rcu_read_unlock();
> 			if (!PageSwapCache(page))
> 				goto uncharge;
> 			/* unmapped swap cache can be migrated */


Which case do we have PageAnon && (page_mapcount == 0) && PageSwapCache ?
With looking over code which add_to_swap_cache, I found somewhere. 

1) shrink_page_list
I think this case doesn't matter by isolate_lru_xxx.

2) shmem_swapin
It seems to be !PageAnon

3) shmem_writepage
It seems to be !PageAnon. 

4) do_swap_page
page_add_anon_rmap increases _mapcount before setting page->mapping to anon_vma. 
So It doesn't matter. 


I think following codes in unmap_and_move seems to handle 3) case. 

---
         * Corner case handling:
         * 1. When a new swap-cache page is read into, it is added to the LRU
         * and treated as swapcache but it has no rmap yet.
        ...
        if (!page->mapping) {
                if (!PageAnon(page) && page_has_private(page)) {
                ....
                }    
                goto skip_unmap;
        }    

---

Do we really check PageSwapCache in there?
Do I miss any case?



-- 
Kind regards,
Minchan Kim



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-25  9:02                             ` KAMEZAWA Hiroyuki
  2010-03-25  9:09                               ` KOSAKI Motohiro
@ 2010-03-25 16:16                               ` Minchan Kim
  1 sibling, 0 replies; 109+ messages in thread
From: Minchan Kim @ 2010-03-25 16:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

On Thu, 2010-03-25 at 18:02 +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 25 Mar 2010 08:32:35 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Thu, Mar 25, 2010 at 11:49:23AM +0900, KOSAKI Motohiro wrote:
> > > > On Fri, Mar 19, 2010 at 03:21:41PM +0900, KOSAKI Motohiro wrote: 
> > > Hmmm...
> > > I haven't understand your mention because I guess I was wrong.
> > > 
> > > probably my last question was unclear. I mean,
> > > 
> > > 1) If we still need SLAB_DESTROY_BY_RCU, why do we need to add refcount?
> > >     Which difference is exist between normal page migration and compaction?
> > 
> > The processes typically calling migration today own the page they are moving
> > and is not going to exit unexpectedly during migration.
> > 
> > > 2) If we added refcount, which race will solve?
> > > 
> > 
> > The process exiting and the last anon_vma being dropped while compaction
> > is running. This can be reliably triggered with compaction.
> > 
> > > IOW, Is this patch fix old issue or compaction specific issue?
> > > 
> > 
> > Strictly speaking, it's an old issue but in practice it's impossible to
> > trigger because the process migrating always owns the page. Compaction
> > moves pages belonging to arbitrary processes.
> > 
> Kosaki-san,
> 
>  IIUC, the race in memory-hotunplug was fixed by this patch [2/11].
> 
>  But, this behavior of unmap_and_move() requires access to _freed_
>  objects (spinlock). Even if it's safe because of SLAB_DESTROY_BY_RCU,
>  it't not good habit in general.

I agree kosaki's opinion. 

I guess Mel met the problem before this patch. 
Apparently, It had a problem like Mel's description. 
But we can close race window by this patch. 
so we don't need to new ref counter. 

At least, rcu_read_lock prevent anon_vma's free. 
so we can hold spinlock of anon_vma although it's not good habit.
About reusing anon_vma by SLAB_XXX_RCU, page_check_address and 
vma_address can prevent wrong working in try_to_unmap.  


>  After direct compaction, page-migration will be one of "core" code of
>  memory management. Then, I agree to patch [1/11] as our direction for
>  keeping sanity and showing direction to more updates. Maybe adding
>  refcnt and removing RCU in futuer is good.


I agree. (use one locking rule) 
I don't mean that we have to remove SLAB_XXX_RCU.
I want to reduce two locking rule with just one if we can. 
As far as we can do, I hope hide rcu_read_lock by Kame's version.
(Kame's version copy & page)
==

       if (PageAnon(page)) {
               struct anon_vma anon = page_lock_anon_vma(page);
               /* to take this lock, this page must be mapped. */
               if (!anon_vma)
                       goto uncharge;
               increase refcnt
               page_unlock_anon_vma(anon);
       }
       ....
==
and
==
void anon_vma_free(struct anon_vma *anon)
{
       /*
        * To increase refcnt of anon-vma, anon_vma->lock should be held by
        * page_lock_anon_vma(). It means anon_vma has a "mapped" page.
        * If this anon is freed by unmap or exit, all pages under this anon
        * must be unmapped. Then, just checking refcnt without lock is ok.
        */
       if (check refcnt > 0)
               return do nothing
       kmem_cache_free(anon);
}
==
Many locking rule would make many contributor very hard.

> 
>  IMHO, pushing this patch [2/11] as "BUGFIX" independent of this set and
>  adding anon_vma->refcnt [1/11] and [3/11] in 1st Direct-compaction patch
>  series  to show the direction will makse sense.
>  (I think merging 1/11 and 3/11 will be okay...)

Yes. For reducing locking, We can enhance it step by step after merge 
[1/11] and [3/11] if others doesn't oppose it any more. 

> 
> Thanks,
> -Kame
> 
> 


-- 
Kind regards,
Minchan Kim



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-25 15:29                                         ` Minchan Kim
@ 2010-03-26  0:58                                           ` KAMEZAWA Hiroyuki
  2010-03-26  1:39                                             ` Minchan Kim
  0 siblings, 1 reply; 109+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-26  0:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

On Fri, 26 Mar 2010 00:29:01 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> Hi, Kame. 
<snip>

> Which case do we have PageAnon && (page_mapcount == 0) && PageSwapCache ?
> With looking over code which add_to_swap_cache, I found somewhere. 
> 
> 1) shrink_page_list
> I think this case doesn't matter by isolate_lru_xxx.
> 
> 2) shmem_swapin
> It seems to be !PageAnon
> 
> 3) shmem_writepage
> It seems to be !PageAnon. 
> 
> 4) do_swap_page
> page_add_anon_rmap increases _mapcount before setting page->mapping to anon_vma. 
> So It doesn't matter. 

> 
> 
> I think following codes in unmap_and_move seems to handle 3) case. 
> 
> ---
>          * Corner case handling:
>          * 1. When a new swap-cache page is read into, it is added to the LRU
>          * and treated as swapcache but it has no rmap yet.
>         ...
>         if (!page->mapping) {
>                 if (!PageAnon(page) && page_has_private(page)) {
>                 ....
>                 }    
>                 goto skip_unmap;
>         }    
> 
> ---
> 
> Do we really check PageSwapCache in there?
> Do I miss any case?
> 

When a page is fully unmapped, page->mapping is not cleared.

from rmap.c
==
 734 void page_remove_rmap(struct page *page)
 735 {
	....
 758         /*
 759          * It would be tidy to reset the PageAnon mapping here,
 760          * but that might overwrite a racing page_add_anon_rmap
 761          * which increments mapcount after us but sets mapping
 762          * before us: so leave the reset to free_hot_cold_page,
 763          * and remember that it's only reliable while mapped.
 764          * Leaving it set also helps swapoff to reinstate ptes
 765          * faster for those pages still in swapcache.
 766          */
 767 }
==

What happens at memory reclaim is...

	the first vmscan
	1. isolate a page from LRU.
	2. add_to_swap_cache it.
	3. try_to_unmap it
	4. pageout it (PG_reclaim && PG_writeback)
	5. move page to the tail of LRU.
	.....<after some time>
	6. I/O ends and PG_writeback is cleared.

Here, in above cycle, the page is not freed. Still in LRU list.
	next vmscan
	7. isolate a page from LRU.
	8. finds a unmapped clean SwapCache
	9. drop it.

So, to _free_ unmapped SwapCache, sequence 7-9 should happen.
If enough memory is freed by the first itelation of vmscan before I/O end,
next vmscan doesn't happen. Then, we have "unmmaped clean Swapcache which has
anon_vma pointer on page->mapping" on LRU.

Thanks,
-Kame


	



	



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped  anonymous pages
  2010-03-26  0:58                                           ` KAMEZAWA Hiroyuki
@ 2010-03-26  1:39                                             ` Minchan Kim
  0 siblings, 0 replies; 109+ messages in thread
From: Minchan Kim @ 2010-03-26  1:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

On Fri, Mar 26, 2010 at 9:58 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 26 Mar 2010 00:29:01 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> Hi, Kame.
> <snip>
>
>> Which case do we have PageAnon && (page_mapcount == 0) && PageSwapCache ?
>> With looking over code which add_to_swap_cache, I found somewhere.
>>
>> 1) shrink_page_list
>> I think this case doesn't matter by isolate_lru_xxx.
>>
>> 2) shmem_swapin
>> It seems to be !PageAnon
>>
>> 3) shmem_writepage
>> It seems to be !PageAnon.
>>
>> 4) do_swap_page
>> page_add_anon_rmap increases _mapcount before setting page->mapping to anon_vma.
>> So It doesn't matter.
>
>>
>>
>> I think following codes in unmap_and_move seems to handle 3) case.
>>
>> ---
>>          * Corner case handling:
>>          * 1. When a new swap-cache page is read into, it is added to the LRU
>>          * and treated as swapcache but it has no rmap yet.
>>         ...
>>         if (!page->mapping) {
>>                 if (!PageAnon(page) && page_has_private(page)) {
>>                 ....
>>                 }
>>                 goto skip_unmap;
>>         }
>>
>> ---
>>
>> Do we really check PageSwapCache in there?
>> Do I miss any case?
>>
>
> When a page is fully unmapped, page->mapping is not cleared.
>
> from rmap.c
> ==
>  734 void page_remove_rmap(struct page *page)
>  735 {
>        ....
>  758         /*
>  759          * It would be tidy to reset the PageAnon mapping here,
>  760          * but that might overwrite a racing page_add_anon_rmap
>  761          * which increments mapcount after us but sets mapping
>  762          * before us: so leave the reset to free_hot_cold_page,
>  763          * and remember that it's only reliable while mapped.
>  764          * Leaving it set also helps swapoff to reinstate ptes
>  765          * faster for those pages still in swapcache.
>  766          */
>  767 }
> ==
>
> What happens at memory reclaim is...
>
>        the first vmscan
>        1. isolate a page from LRU.
>        2. add_to_swap_cache it.
>        3. try_to_unmap it
>        4. pageout it (PG_reclaim && PG_writeback)
>        5. move page to the tail of LRU.
>        .....<after some time>
>        6. I/O ends and PG_writeback is cleared.
>
> Here, in above cycle, the page is not freed. Still in LRU list.
>        next vmscan
>        7. isolate a page from LRU.
>        8. finds a unmapped clean SwapCache
>        9. drop it.
>
> So, to _free_ unmapped SwapCache, sequence 7-9 should happen.
> If enough memory is freed by the first itelation of vmscan before I/O end,
> next vmscan doesn't happen. Then, we have "unmmaped clean Swapcache which has
> anon_vma pointer on page->mapping" on LRU.

Thanks for open my eye. Kame. :)



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-25 13:39                                         ` Mel Gorman
@ 2010-03-26  3:07                                           ` KOSAKI Motohiro
  2010-03-26 13:49                                             ` Mel Gorman
  0 siblings, 1 reply; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-26  3:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Minchan Kim, Andrew Morton,
	Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Rik van Riel, linux-kernel, linux-mm

very small nit

> There were minor changes in how the rcu_read_lock is taken and released
> based on other comments. With your suggestion, the block now looks like;
> 
>         if (PageAnon(page)) {
>                 rcu_read_lock();
>                 rcu_locked = 1;
> 
>                 /*
>                  * If the page has no mappings any more, just bail. An
>                  * unmapped anon page is likely to be freed soon but
>                  * worse,
>                  * it's possible its anon_vma disappeared between when
>                  * the page was isolated and when we reached here while
>                  * the RCU lock was not held
>                  */
>                 if (!page_mapcount(page) && !PageSwapCache(page))

                        page_mapped?

>                         goto rcu_unlock;
> 
>                 anon_vma = page_anon_vma(page);
>                 atomic_inc(&anon_vma->external_refcount);
>         }




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 06/11] Export fragmentation index via /proc/extfrag_index
  2010-03-25 14:11                 ` Mel Gorman
@ 2010-03-26  3:10                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-26  3:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> On Thu, Mar 25, 2010 at 08:20:04PM +0900, KOSAKI Motohiro wrote:
> > > On Thu, Mar 25, 2010 at 11:47:17AM +0900, KOSAKI Motohiro wrote:
> > > > > On Tue, Mar 23, 2010 at 09:22:04AM +0900, KOSAKI Motohiro wrote:
> > > > > > > > > +	/*
> > > > > > > > > +	 * Index is between 0 and 1 so return within 3 decimal places
> > > > > > > > > +	 *
> > > > > > > > > +	 * 0 => allocation would fail due to lack of memory
> > > > > > > > > +	 * 1 => allocation would fail due to fragmentation
> > > > > > > > > +	 */
> > > > > > > > > +	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
> > > > > > > > > +}
> > > > > > > > 
> > > > > > > > Dumb question.
> > > > > > > > your paper (http://portal.acm.org/citation.cfm?id=1375634.1375641) says
> > > > > > > > fragmentation_index = 1 - (TotalFree/SizeRequested)/BlocksFree
> > > > > > > > but your code have extra '1000+'. Why?
> > > > > > > 
> > > > > > > To get an approximation to three decimal places.
> > > > > > 
> > > > > > Do you mean this is poor man's round up logic?
> > > > > 
> > > > > Not exactly.
> > > > > 
> > > > > The intention is to have a value of 968 instead of 0.968231. i.e.
> > > > > instead of a value between 0 and 1, it'll be a value between 0 and 1000
> > > > > that matches the first three digits after the decimal place.
> > > > 
> > > > Let's consider extream case.
> > > > 
> > > > free_pages: 1
> > > > requested: 1
> > > > free_blocks_total: 1
> > > > 
> > > > frag_index = 1000  - ((1000 + 1*1000/1))/1 = -1000
> > > > 
> > > > This is not your intension, I guess. 
> > > 
> > > Why not?
> > > 
> > > See this comment
> > > 
> > > /* Fragmentation index only makes sense when a request would fail */
> > > 
> > > In your example, there is a free page of the requested size so the allocation
> > > would succeed. In this case, fragmentation index does indeed go negative
> > > but the value is not useful.
> > >
> > > > Probably we don't need any round_up/round_down logic. because fragmentation_index
> > > > is only used "if (fragindex >= 0 && fragindex <= 500)" check in try_to_compact_pages().
> > > > +1 or -1 inaccurate can be ignored. iow, I think we can remove '1000+' expression.
> > > > 
> > > 
> > > This isn't about rounding, it's about having a value that normally is
> > > between 0 and 1 expressed as a number between 0 and 1000 because we
> > > can't use double in the kernel.
> > 
> > Sorry, My example was wrong. new example is here.
> > 
> > free_pages: 4
> > requested: 2
> > free_blocks_total: 4
> > 
> > theory: 1 - (TotalFree/SizeRequested)/BlocksFree
> >             = 1 - (4/2)/4 = 0.5
> > 
> > code : 1000 - ((1000 + 4*1000/2))/4 = 1000 - (1000 + 2000)/4 = 1000/4 = 250
> > 
> > I don't think this is three decimal picking up code. This seems might makes
> > lots compaction invocation rather than theory.
> > 
> 
> Ok, I cannot apologise for this enough.
> 
> Since that paper was published, further work showed that the equation could
> be much improved. As part of that, I updated the equation to the following;
> 
> double index = 1    - ( (1    + ((double)info->free_pages        / requested)) / info->free_blocks_total);
> 
> or when approximated to three decimal places
> 
> int index =    1000 - ( (1000 + (        info->free_pages * 1000 / requested)) / info->free_blocks_total);
> 
> Your analysis of the paper is perfect. When slotted into a driver program
> with your example figures, I get the following results
> 
> old equation = 0.500000
> current equation = 0.250000
> integer approximation = 250
> 
> The code as-is is correct and is what I intended. My explanation on the
> other hand sucks and I should have remembered that I updated equation since
> I published that paper 2 years ago.
> 
> Again, I am extremely sorry for misleading you.

No worry at all. it is merely review. I have no objection this equation if it is intentional. :)




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 10/11] Direct compact when a high-order allocation fails
  2010-03-25 15:11         ` Mel Gorman
@ 2010-03-26  6:01           ` KOSAKI Motohiro
  0 siblings, 0 replies; 109+ messages in thread
From: KOSAKI Motohiro @ 2010-03-26  6:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

> If you insist, I can limit direct compaction for > PAGE_ALLOC_COSTLY_ORDER. The
> allocator is already meant to be able to handle these orders without special
> assistance and it'd avoid compaction becoming a cruch for subsystems that
> suddently decide it's a great idea to use order-1 or order-2 heavily.
> 
> > My point is, We have to consider to disard useful cached pages and to
> > discard no longer accessed pages. latter is nearly zero cost.
> 
> I am not opposed to moving in this sort of direction although
> particularly if we disable compaction for the lower orders. I believe
> what you are suggesting is that the allocator would take the steps
> 
> 1. Try allocate from lists
> 2. If that fails, do something like zone_reclaim_mode and lumpy reclaim
>    only pages which are cheap to discard
> 3. If that fails, try compaction to move around the active pages
> 4. If that fails, lumpy reclaim 

This seems makes a lot of sense. 
I think todo are

1) now almost system doesn't use zone_reclaim. we need to consider change
    zone_reclaim as by default or not.
2) current zone_reclaim doesn't have light reclaim mode. it start reclaim as priority=5.
    we need to consider adding new zone reclaim mode or not.


> > please
> > don't consider page discard itself is bad, it is correct page life cycle.
> > To protest discard useless cached page can makes reduce IO throughput.
> 
> I don't consider it bad as such but I had generally considered compaction to
> be better than discarding pages. I take your point though that if we compact
> many old pages, it might be a net loss.

thanks.


> > > How do you figure? I think it goes a long way to mitigating the worst of
> > > the problems you laid out above.
> > 
> > Both lumpy reclaim and page comaction have some advantage and some disadvantage.
> > However we already have lumpy reclaim. I hope you rememver we are attacking
> > very narrowing corner case. we have to consider to reduce the downside of compaction
> > at first priority.
> > Not only big benefit but also big downside seems no good.
> > 
> > So, I'd suggest either way
> > 1) no change caller place, but invoke compaction at very limited situation, or
> 
> I'm ok with enabling compaction only for >= PAGE_ALLOC_COSTLY_ORDER.
> This will likely limit it to just huge pages for the moment but even
> that would be very useful to me on swapless systems

Agreed! thanks.

sidenote: I don't think this is only a feature for swapless systems. example, btrfs
doesn't have pageout implementation, it mean btrfs can't use lumpy reclaim.
page comaction can help to solve this issue.


> > 2) invoke compaction at only lumpy reclaim unfit situation
> > 
> > My last mail, I proposed about (2). but you seems got bad impression. then,
> > now I propsed (1).
> 
> 1 would be my preference to start with.
> 
> After merge, I'd look into "cheap" lumpy reclaim which is used as a
> first option, then compaction, then full direct reclaim. Would that be
> satisfactory?

Yeah! this is very nice for me!


> > I mean we will _start_ to treat the compaction is for
> > hugepage allocation assistance feature, not generic allocation change.
> > 
> 
> Agreed.
> 
> > btw, I hope drop or improve patch 11/11 ;-)
> 
> I expect it to be improved over time. The compactfail counter is there to
> identify when a bad situation occurs so that the workload can be better
> understood. There are different heuristics that could be applied there to
> avoid the wait but all of them have disadvantages.

great!


> > > > Honestly, I think this patch was very impressive and useful at 2-3 years ago.
> > > > because 1) we didn't have lumpy reclaim 2) we didn't have sane reclaim bail out.
> > > > then, old vmscan is very heavyweight and inefficient operation for high order reclaim.
> > > > therefore the downside of adding this page migration is hidden relatively. but...
> > > > 
> > > > We have to make an effort to reduce reclaim latency, not adding new latency source.
> > > 
> > > I recognise that reclaim latency has been reduced but there is a wall.
> > 
> > If it is a wall, we have to fix this! :)
> 
> Well, the wall I had in mind was IO bandwidth :)

ok, I catched you mention.

> > > Right now, it is identifed when pageout should happen instead of page
> > > migration. It's known before compaction starts if it's likely to be
> > > successful or not.
> > > 
> > 
> > patch 11/11 says, it's known likely to be successfull or not, but not exactly.
> 
> Indeed. For example, it might not have been possible to migrate the necessary
> pages because they were pagetables, slab etc. It might also be simply memory
> pressure. It might look like there should be enough pages to compaction but
> there are too many processes allocating at the same time.

agreed.


> > > I can drop the min_free_kbytes change but the likely result will be that
> > > allocation success rates will simply be lower. The calculations on
> > > whether compaction should be used or not are based on watermarks which
> > > adjust to the value of min_free_kbytes.
> > 
> > Then, should we need min_free_kbytes auto adjustment trick?
> 
> I have considered this in the past. Specifically that it would be auto-adjusted
> the first time a huge page was allocated. I never got around to it though.

Hmhm, ok.
we can discuss it as separate patch and separate thread.


> > But please remember, now compaction might makes very large lru shuffling
> > in compaction failure case. It mean vmscan might discard very wrong pages.
> > I have big worry about it.
> > 
> 
> Would disabling compaction for the lower orders alleviate your concerns?
> I have also taken note to investigate how much LRU churn can be avoided.

that's really great.

I'm looking for your v6 post :)



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-26  3:07                                           ` KOSAKI Motohiro
@ 2010-03-26 13:49                                             ` Mel Gorman
  0 siblings, 0 replies; 109+ messages in thread
From: Mel Gorman @ 2010-03-26 13:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Minchan Kim, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	Rik van Riel, linux-kernel, linux-mm

On Fri, Mar 26, 2010 at 12:07:02PM +0900, KOSAKI Motohiro wrote:
> very small nit
> 
> > There were minor changes in how the rcu_read_lock is taken and released
> > based on other comments. With your suggestion, the block now looks like;
> > 
> >         if (PageAnon(page)) {
> >                 rcu_read_lock();
> >                 rcu_locked = 1;
> > 
> >                 /*
> >                  * If the page has no mappings any more, just bail. An
> >                  * unmapped anon page is likely to be freed soon but
> >                  * worse,
> >                  * it's possible its anon_vma disappeared between when
> >                  * the page was isolated and when we reached here while
> >                  * the RCU lock was not held
> >                  */
> >                 if (!page_mapcount(page) && !PageSwapCache(page))
> 
>                         page_mapped?
> 

Will be fixed in V6.

Thanks

> >                         goto rcu_unlock;
> > 
> >                 anon_vma = page_anon_vma(page);
> >                 atomic_inc(&anon_vma->external_refcount);
> >         }
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-23 17:22   ` Christoph Lameter
@ 2010-03-23 18:04     ` Mel Gorman
  0 siblings, 0 replies; 109+ messages in thread
From: Mel Gorman @ 2010-03-23 18:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, Mar 23, 2010 at 12:22:57PM -0500, Christoph Lameter wrote:
> On Tue, 23 Mar 2010, Mel Gorman wrote:
> 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 98eaaf2..6eb1efe 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -603,6 +603,19 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  	 */
> >  	if (PageAnon(page)) {
> >  		rcu_read_lock();
> > +
> > +		/*
> > +		 * If the page has no mappings any more, just bail. An
> > +		 * unmapped anon page is likely to be freed soon but worse,
> > +		 * it's possible its anon_vma disappeared between when
> > +		 * the page was isolated and when we reached here while
> > +		 * the RCU lock was not held
> > +		 */
> > +		if (!page_mapcount(page)) {
> > +			rcu_read_unlock();
> > +			goto uncharge;
> > +		}
> > +
> >  		rcu_locked = 1;
> >  		anon_vma = page_anon_vma(page);
> >  		atomic_inc(&anon_vma->migrate_refcount);
> 
> A way to make this simpler would be to move "rcu_locked = 1" before the
> if statement and then do
> 
> if (!page_mapcount(page))
> 	goto rcu_unlock;
> 

True. Fixed.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-23 12:25 ` [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
@ 2010-03-23 17:22   ` Christoph Lameter
  2010-03-23 18:04     ` Mel Gorman
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Lameter @ 2010-03-23 17:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 23 Mar 2010, Mel Gorman wrote:

> diff --git a/mm/migrate.c b/mm/migrate.c
> index 98eaaf2..6eb1efe 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -603,6 +603,19 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  	 */
>  	if (PageAnon(page)) {
>  		rcu_read_lock();
> +
> +		/*
> +		 * If the page has no mappings any more, just bail. An
> +		 * unmapped anon page is likely to be freed soon but worse,
> +		 * it's possible its anon_vma disappeared between when
> +		 * the page was isolated and when we reached here while
> +		 * the RCU lock was not held
> +		 */
> +		if (!page_mapcount(page)) {
> +			rcu_read_unlock();
> +			goto uncharge;
> +		}
> +
>  		rcu_locked = 1;
>  		anon_vma = page_anon_vma(page);
>  		atomic_inc(&anon_vma->migrate_refcount);

A way to make this simpler would be to move "rcu_locked = 1" before the
if statement and then do

if (!page_mapcount(page))
	goto rcu_unlock;


^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-23 12:25 [PATCH 0/11] Memory Compaction v5 Mel Gorman
@ 2010-03-23 12:25 ` Mel Gorman
  2010-03-23 17:22   ` Christoph Lameter
  0 siblings, 1 reply; 109+ messages in thread
From: Mel Gorman @ 2010-03-23 12:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

rmap_walk_anon() was triggering errors in memory compaction that look like
use-after-free errors. The problem is that between the page being isolated
from the LRU and rcu_read_lock() being taken, the mapcount of the page
dropped to 0 and the anon_vma gets freed. This can happen during memory
compaction if pages being migrated belong to a process that exits before
migration completes. Hence, the use-after-free race looks like

 1. Page isolated for migration
 2. Process exits
 3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
 4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
 4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
    is garbage.

This patch checks the mapcount after the rcu lock is taken. If the
mapcount is zero, the anon_vma is assumed to be freed and no further
action is taken.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/migrate.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 98eaaf2..6eb1efe 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -603,6 +603,19 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	 */
 	if (PageAnon(page)) {
 		rcu_read_lock();
+
+		/*
+		 * If the page has no mappings any more, just bail. An
+		 * unmapped anon page is likely to be freed soon but worse,
+		 * it's possible its anon_vma disappeared between when
+		 * the page was isolated and when we reached here while
+		 * the RCU lock was not held
+		 */
+		if (!page_mapcount(page)) {
+			rcu_read_unlock();
+			goto uncharge;
+		}
+
 		rcu_locked = 1;
 		anon_vma = page_anon_vma(page);
 		atomic_inc(&anon_vma->migrate_refcount);
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 109+ messages in thread

end of thread, other threads:[~2010-03-26 13:49 UTC | newest]

Thread overview: 109+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-12 16:41 [PATCH 0/11] Memory Compaction v4 Mel Gorman
2010-03-12 16:41 ` [PATCH 01/11] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
2010-03-14 15:01   ` Minchan Kim
2010-03-15  5:06   ` KAMEZAWA Hiroyuki
2010-03-17  1:44   ` KOSAKI Motohiro
2010-03-17 11:45     ` Mel Gorman
2010-03-17 16:38       ` Christoph Lameter
2010-03-18 11:12         ` Mel Gorman
2010-03-18 16:31           ` Christoph Lameter
2010-03-12 16:41 ` [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
2010-03-15  0:28   ` Minchan Kim
2010-03-15  5:34     ` KAMEZAWA Hiroyuki
2010-03-15  6:28       ` Minchan Kim
2010-03-15  6:44         ` KAMEZAWA Hiroyuki
2010-03-15  7:09           ` KAMEZAWA Hiroyuki
2010-03-15 13:48             ` Minchan Kim
2010-03-15  7:11           ` Minchan Kim
2010-03-15 11:28       ` Mel Gorman
2010-03-15 12:48         ` Minchan Kim
2010-03-15 14:21           ` Mel Gorman
2010-03-15 14:33             ` Minchan Kim
2010-03-15 23:49             ` KAMEZAWA Hiroyuki
2010-03-17  2:12               ` KAMEZAWA Hiroyuki
2010-03-17  3:00                 ` Minchan Kim
2010-03-17  3:15                   ` KAMEZAWA Hiroyuki
2010-03-17  4:15                     ` Minchan Kim
2010-03-17  4:19                       ` KAMEZAWA Hiroyuki
2010-03-17 16:41                     ` Christoph Lameter
2010-03-18  0:30                       ` KAMEZAWA Hiroyuki
2010-03-17 12:07                 ` Mel Gorman
2010-03-17  2:03             ` KOSAKI Motohiro
2010-03-17 11:51               ` Mel Gorman
2010-03-18  0:48                 ` KOSAKI Motohiro
2010-03-18 11:14                   ` Mel Gorman
2010-03-19  6:21                     ` KOSAKI Motohiro
2010-03-19  8:59                       ` Mel Gorman
2010-03-25  2:49                         ` KOSAKI Motohiro
2010-03-25  8:32                           ` Mel Gorman
2010-03-25  8:56                             ` KOSAKI Motohiro
2010-03-25  9:18                               ` Mel Gorman
2010-03-25  9:02                             ` KAMEZAWA Hiroyuki
2010-03-25  9:09                               ` KOSAKI Motohiro
2010-03-25  9:08                                 ` KAMEZAWA Hiroyuki
2010-03-25  9:21                                 ` Mel Gorman
2010-03-25  9:41                                   ` KAMEZAWA Hiroyuki
2010-03-25  9:59                                     ` KOSAKI Motohiro
2010-03-25 10:12                                       ` KAMEZAWA Hiroyuki
2010-03-25 13:39                                         ` Mel Gorman
2010-03-26  3:07                                           ` KOSAKI Motohiro
2010-03-26 13:49                                             ` Mel Gorman
2010-03-25 15:29                                         ` Minchan Kim
2010-03-26  0:58                                           ` KAMEZAWA Hiroyuki
2010-03-26  1:39                                             ` Minchan Kim
2010-03-25 14:35                                   ` Christoph Lameter
2010-03-25 16:16                               ` Minchan Kim
2010-03-12 16:41 ` [PATCH 03/11] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
2010-03-12 17:14   ` Rik van Riel
2010-03-15  5:35   ` KAMEZAWA Hiroyuki
2010-03-17  2:06   ` KOSAKI Motohiro
2010-03-12 16:41 ` [PATCH 04/11] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
2010-03-17  2:28   ` KOSAKI Motohiro
2010-03-17 11:32     ` Mel Gorman
2010-03-17 16:37       ` Christoph Lameter
2010-03-17 23:56         ` KOSAKI Motohiro
2010-03-18 11:24           ` Mel Gorman
2010-03-19  6:21             ` KOSAKI Motohiro
2010-03-19 10:16               ` Mel Gorman
2010-03-25  3:28                 ` KOSAKI Motohiro
2010-03-12 16:41 ` [PATCH 05/11] Export unusable free space index via /proc/unusable_index Mel Gorman
2010-03-15  5:41   ` KAMEZAWA Hiroyuki
2010-03-15  9:48     ` Mel Gorman
2010-03-17  2:42   ` KOSAKI Motohiro
2010-03-12 16:41 ` [PATCH 06/11] Export fragmentation index via /proc/extfrag_index Mel Gorman
2010-03-17  2:49   ` KOSAKI Motohiro
2010-03-17 11:33     ` Mel Gorman
2010-03-23  0:22       ` KOSAKI Motohiro
2010-03-23 12:03         ` Mel Gorman
2010-03-25  2:47           ` KOSAKI Motohiro
2010-03-25  8:47             ` Mel Gorman
2010-03-25 11:20               ` KOSAKI Motohiro
2010-03-25 14:11                 ` Mel Gorman
2010-03-26  3:10                   ` KOSAKI Motohiro
2010-03-12 16:41 ` [PATCH 07/11] Memory compaction core Mel Gorman
2010-03-15 13:44   ` Minchan Kim
2010-03-15 14:41     ` Mel Gorman
2010-03-17 10:31   ` KOSAKI Motohiro
2010-03-17 11:40     ` Mel Gorman
2010-03-18  2:35       ` KOSAKI Motohiro
2010-03-18 11:43         ` Mel Gorman
2010-03-19  6:21           ` KOSAKI Motohiro
2010-03-18 17:08     ` Mel Gorman
2010-03-12 16:41 ` [PATCH 08/11] Add /proc trigger for memory compaction Mel Gorman
2010-03-17  3:18   ` KOSAKI Motohiro
2010-03-12 16:41 ` [PATCH 09/11] Add /sys trigger for per-node " Mel Gorman
2010-03-17  3:18   ` KOSAKI Motohiro
2010-03-12 16:41 ` [PATCH 10/11] Direct compact when a high-order allocation fails Mel Gorman
2010-03-16  2:47   ` Minchan Kim
2010-03-19  6:21   ` KOSAKI Motohiro
2010-03-19  6:31     ` KOSAKI Motohiro
2010-03-19 10:10       ` Mel Gorman
2010-03-25 11:22         ` KOSAKI Motohiro
2010-03-19 10:09     ` Mel Gorman
2010-03-25 11:08       ` KOSAKI Motohiro
2010-03-25 15:11         ` Mel Gorman
2010-03-26  6:01           ` KOSAKI Motohiro
2010-03-12 16:41 ` [PATCH 11/11] Do not compact within a preferred zone after a compaction failure Mel Gorman
2010-03-23 12:25 [PATCH 0/11] Memory Compaction v5 Mel Gorman
2010-03-23 12:25 ` [PATCH 02/11] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
2010-03-23 17:22   ` Christoph Lameter
2010-03-23 18:04     ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).