All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/14] Memory Compaction v6
@ 2010-03-30  9:14 ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

These are mostly minor changes based on feedback received on V5. The
biggest difference is the last patch. Kamezawa Hiroyuki pointed out that
PageSwapCache should be allowed to migrate even though they are unmapped
anonymous pages. This was not as straight-forward as expected. Hence, I
kept the patch responsible separate and at the end of the series. If it's
not correct, it can be easily dropped until it is resolved.

Changelog since V5
  o Rebase to mmotm-2010-03-24-14-48
  o Add more reviewed-by's
  o Correct one spelling in vmstat.c and some leader clarifications
  o Split the LRU isolation modes into a separate path
  o Correct a NID change
  o Call migrate_prep less frequently
  o Remove unnecessary inlining
  o Do not interfere with memory hot-remove
  o Do not compact for orders <= PAGE_ALLOC_COSTLY_ORDER
  o page_mapped instead of page_mapcount and allow swapcache to migrate
  o Avoid too many pages being isolated for migration
  o Handle PageSwapCache pages during migration

Changelog since V4
  o Remove unnecessary check for PageLRU and PageUnevictable
  o Fix isolated accounting
  o Close race window between page_mapcount and rcu_read_lock
  o Added a lot more Reviewed-by tags

Changelog since V3
  o Document sysfs entries (subseqently, merged independently)
  o COMPACTION should depend on MMU
  o Comment updates
  o Ensure proc/sysfs triggering of compaction fully completes
  o Rename anon_vma refcount to external_refcount
  o Rebase to mmotm on top of 2.6.34-rc1

Changelog since V2
  o Move unusable and fragmentation indices to separate proc files
  o Express indices as being between 0 and 1
  o Update copyright notice for compaction.c
  o Avoid infinite loop when split free page fails
  o Init compact_resume at least once (impacted x86 testing)
  o Fewer pages are isolated during compaction.
  o LRU lists are no longer rotated when page is busy
  o NR_ISOLATED_* is updated to avoid isolating too many pages
  o Update zone LRU stats correctly when isolating pages
  o Reference count anon_vma instead of insufficient locking with
    use-after-free races in memory compaction
  o Watch for unmapped anon pages during migration
  o Remove unnecessary parameters on a few functions
  o Add Reviewed-by's. Note that I didn't add the Acks and Reviewed
    for the proc patches as they have been split out into separate
    files and I don't know if the Acks are still valid.

Changelog since V1
  o Update help blurb on CONFIG_MIGRATION
  o Max unusable free space index is 100, not 1000
  o Move blockpfn forward properly during compaction
  o Cleanup CONFIG_COMPACTION vs CONFIG_MIGRATION confusion
  o Permissions on /proc and /sys files should be 0200
  o Reduce verbosity
  o Compact all nodes when triggered via /proc
  o Add per-node compaction via sysfs
  o Move defer_compaction out-of-line
  o Fix lock oddities in rmap_walk_anon
  o Add documentation

This patchset is a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was slub
"defragmentation" (really a form of targeted reclaim). Hence, this is called
"compaction" to distinguish it from other forms of defragmentation.

In this implementation, a full compaction run involves two scanners operating
within a zone - a migration and a free scanner. The migration scanner
starts at the beginning of a zone and finds all movable pages within one
pageblock_nr_pages-sized area and isolates them on a migratepages list. The
free scanner begins at the end of the zone and searches on a per-area
basis for enough free pages to migrate all the pages on the migratepages
list. As each area is respectively migrated or exhausted of free pages,
the scanners are advanced one area.  A compaction run completes within a
zone when the two scanners meet.

This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.

It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.

Memory compaction can be triggered in one of three ways. It may be triggered
explicitly by writing any value to /proc/sys/vm/compact_memory and compacting
all of memory. It can be triggered on a per-node basis by writing any
value to /sys/devices/system/node/nodeN/compact where N is the node ID to
be compacted. When a process fails to allocate a high-order page, it may
compact memory in an attempt to satisfy the allocation instead of entering
direct reclaim. Explicit compaction does not finish until the two scanners
meet and direct compaction ends if a suitable page becomes available that
would meet watermarks.

The series is in 14 patches. The first three are not "core" to the series
but are important pre-requisites.

Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
	patch, it's possible to use anon_vma after free if the caller is
	not holding a VMA or mmap_sem for the pages in question. While
	there should be no existing user that causes this problem,
	it's a requirement for memory compaction to be stable. The patch
	is at the start of the series for bisection reasons.
Patch 2 skips over anon pages during migration that are no longer mapped
	because there still appeared to be a small window between when
	a page was isolated and migration started during which anon_vma
	could disappear.
Patch 3 merges the KSM and migrate counts. It could be merged with patch 1
	but would be slightly harder to review.
Patch 4 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 5 exports a "unusable free space index" via /proc/pagetypeinfo. It's
	a measure of external fragmentation that takes the size of the
	allocation request into account. It can also be calculated from
	userspace so can be dropped if requested
Patch 6 exports a "fragmentation index" which only has meaning when an
	allocation request fails. It determines if an allocation failure
	would be due to a lack of memory or external fragmentation.
Patch 7 moves the definition for LRU isolation modes for use by compaction
Patch 8 is the compaction mechanism although it's unreachable at this point
Patch 9 adds a means of compacting all of memory with a proc trgger
Patch 10 adds a means of compacting a specific node with a sysfs trigger
Patch 11 adds "direct compaction" before "direct reclaim" if it is
	determined there is a good chance of success.
Patch 12 adds a sysctl that allows tuning of the threshold at which the
	kernel will compact or direct reclaim
Patch 13 temporarily disables compaction if an allocation failure occurs
	after compaction.
Patch 14 allows the migration of PageSwapCache pages. This patch was not
	as straight-forward as rmap_walk and migration needed extra
	smarts to avoid problems under heavy memory pressure. It's possible
	that memory hot-remove could be affected.

Testing of compaction was in three stages.  For the test, debugging, preempt,
the sleep watchdog and lockdep were all enabled but nothing nasty popped
out. min_free_kbytes was tuned as recommended by hugeadm to help fragmentation
avoidance and high-order allocations. It was tested on X86, X86-64 and PPC64.

Ths first test represents one of the easiest cases that can be faced for
lumpy reclaim or memory compaction.

1. Machine freshly booted and configured for hugepage usage with
	a) hugeadm --create-global-mounts
	b) hugeadm --pool-pages-max DEFAULT:8G
	c) hugeadm --set-recommended-min_free_kbytes
	d) hugeadm --set-recommended-shmmax

	The min_free_kbytes here is important. Anti-fragmentation works best
	when pageblocks don't mix. hugeadm knows how to calculate a value that
	will significantly reduce the worst of external-fragmentation-related
	events as reported by the mm_page_alloc_extfrag tracepoint.

2. Load up memory
	a) Start updatedb
	b) Create in parallel a X files of pagesize*128 in size. Wait
	   until files are created. By parallel, I mean that 4096 instances
	   of dd were launched, one after the other using &. The crude
	   objective being to mix filesystem metadata allocations with
	   the buffer cache.
	c) Delete every second file so that pageblocks are likely to
	   have holes
	d) kill updatedb if it's still running

	At this point, the system is quiet, memory is full but it's full with
	clean filesystem metadata and clean buffer cache that is unmapped.
	This is readily migrated or discarded so you'd expect lumpy reclaim
	to have no significant advantage over compaction but this is at
	the POC stage.

3. In increments, attempt to allocate 5% of memory as hugepages.
	   Measure how long it took, how successful it was, how many
	   direct reclaims took place and how how many compactions. Note
	   the compaction figures might not fully add up as compactions
	   can take place for orders other than the hugepage size

X86				vanilla		compaction
Final page count                    915                916 (attempted 1002)
pages reclaimed                   88872               4920

X86-64				vanilla		compaction
Final page count:                   901                901 (attempted 1002)
Total pages reclaimed:           137573              67434

PPC64				vanilla		compaction
Final page count:                    89                 92 (attempted 110)
Total pages reclaimed:            84727               8822

There was not a dramatic improvement in success rates but it wouldn't be
expected in this case either. What was important is that far fewer pages
were reclaimed in all cases reducing the amount of IO required to satisfy
a huge page allocation.

The second tests were all performance related - kernbench, netperf, iozone
and sysbench. None showed anything too remarkable.

The last test was a high-order allocation stress test. Many kernel compiles
are started to fill memory with a pressured mix of unmovable and movable
allocations. During this, an attempt is made to allocate 90% of memory
as huge pages - one at a time with small delays between attempts to avoid
flooding the IO queue.

                                             vanilla   compaction
Percentage of request allocated X86               96           99
Percentage of request allocated X86-64            96           98
Percentage of request allocated PPC64             51           70

Success rates are a little higher, particularly on PPC64 with the larger
huge pages. What is most interesting is the latency when allocating huge
pages.

X86:    http://www.csn.ul.ie/~mel/postings/compaction-20100329/highalloc-interlatency-arnold-compaction-stress-v6r24-mean.ps
X86_64: http://www.csn.ul.ie/~mel/postings/compaction-20100329/highalloc-interlatency-hydra-compaction-stress-v6r24-mean.ps
PPC64: http://www.csn.ul.ie/~mel/postings/compaction-20100329/highalloc-interlatency-powyah-compaction-stress-v6r24-mean.ps

X86 latency is reduced the least but it is depending heavily on the HIGHMEM
zone to allocate many of its huge pages which is a relatively straight-forward
job. X86-64 and PPC64 both show reductions in average time taken to allocate
huge pages. It is not reduced to zero because the system is under enough
memory pressure that reclaim is still required for some of the allocations.

What is also enlightening in the same directory is the "stddev" files. Each
of them show that the variance between allocation times is heavily reduced.

 Documentation/ABI/testing/sysfs-devices-node |    7 +
 Documentation/filesystems/proc.txt           |   25 +-
 Documentation/sysctl/vm.txt                  |   29 ++-
 drivers/base/node.c                          |    3 +
 include/linux/compaction.h                   |   81 ++++
 include/linux/mm.h                           |    1 +
 include/linux/mmzone.h                       |    7 +
 include/linux/rmap.h                         |   27 +-
 include/linux/swap.h                         |    6 +
 include/linux/vmstat.h                       |    2 +
 kernel/sysctl.c                              |   25 ++
 mm/Kconfig                                   |   18 +-
 mm/Makefile                                  |    1 +
 mm/compaction.c                              |  589 ++++++++++++++++++++++++++
 mm/ksm.c                                     |    4 +-
 mm/migrate.c                                 |   28 ++
 mm/page_alloc.c                              |   73 ++++
 mm/rmap.c                                    |   16 +-
 mm/vmscan.c                                  |    5 -
 mm/vmstat.c                                  |  218 ++++++++++
 20 files changed, 1135 insertions(+), 30 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-devices-node
 create mode 100644 include/linux/compaction.h
 create mode 100644 mm/compaction.c


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 0/14] Memory Compaction v6
@ 2010-03-30  9:14 ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

These are mostly minor changes based on feedback received on V5. The
biggest difference is the last patch. Kamezawa Hiroyuki pointed out that
PageSwapCache should be allowed to migrate even though they are unmapped
anonymous pages. This was not as straight-forward as expected. Hence, I
kept the patch responsible separate and at the end of the series. If it's
not correct, it can be easily dropped until it is resolved.

Changelog since V5
  o Rebase to mmotm-2010-03-24-14-48
  o Add more reviewed-by's
  o Correct one spelling in vmstat.c and some leader clarifications
  o Split the LRU isolation modes into a separate path
  o Correct a NID change
  o Call migrate_prep less frequently
  o Remove unnecessary inlining
  o Do not interfere with memory hot-remove
  o Do not compact for orders <= PAGE_ALLOC_COSTLY_ORDER
  o page_mapped instead of page_mapcount and allow swapcache to migrate
  o Avoid too many pages being isolated for migration
  o Handle PageSwapCache pages during migration

Changelog since V4
  o Remove unnecessary check for PageLRU and PageUnevictable
  o Fix isolated accounting
  o Close race window between page_mapcount and rcu_read_lock
  o Added a lot more Reviewed-by tags

Changelog since V3
  o Document sysfs entries (subseqently, merged independently)
  o COMPACTION should depend on MMU
  o Comment updates
  o Ensure proc/sysfs triggering of compaction fully completes
  o Rename anon_vma refcount to external_refcount
  o Rebase to mmotm on top of 2.6.34-rc1

Changelog since V2
  o Move unusable and fragmentation indices to separate proc files
  o Express indices as being between 0 and 1
  o Update copyright notice for compaction.c
  o Avoid infinite loop when split free page fails
  o Init compact_resume at least once (impacted x86 testing)
  o Fewer pages are isolated during compaction.
  o LRU lists are no longer rotated when page is busy
  o NR_ISOLATED_* is updated to avoid isolating too many pages
  o Update zone LRU stats correctly when isolating pages
  o Reference count anon_vma instead of insufficient locking with
    use-after-free races in memory compaction
  o Watch for unmapped anon pages during migration
  o Remove unnecessary parameters on a few functions
  o Add Reviewed-by's. Note that I didn't add the Acks and Reviewed
    for the proc patches as they have been split out into separate
    files and I don't know if the Acks are still valid.

Changelog since V1
  o Update help blurb on CONFIG_MIGRATION
  o Max unusable free space index is 100, not 1000
  o Move blockpfn forward properly during compaction
  o Cleanup CONFIG_COMPACTION vs CONFIG_MIGRATION confusion
  o Permissions on /proc and /sys files should be 0200
  o Reduce verbosity
  o Compact all nodes when triggered via /proc
  o Add per-node compaction via sysfs
  o Move defer_compaction out-of-line
  o Fix lock oddities in rmap_walk_anon
  o Add documentation

This patchset is a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was slub
"defragmentation" (really a form of targeted reclaim). Hence, this is called
"compaction" to distinguish it from other forms of defragmentation.

In this implementation, a full compaction run involves two scanners operating
within a zone - a migration and a free scanner. The migration scanner
starts at the beginning of a zone and finds all movable pages within one
pageblock_nr_pages-sized area and isolates them on a migratepages list. The
free scanner begins at the end of the zone and searches on a per-area
basis for enough free pages to migrate all the pages on the migratepages
list. As each area is respectively migrated or exhausted of free pages,
the scanners are advanced one area.  A compaction run completes within a
zone when the two scanners meet.

This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.

It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.

Memory compaction can be triggered in one of three ways. It may be triggered
explicitly by writing any value to /proc/sys/vm/compact_memory and compacting
all of memory. It can be triggered on a per-node basis by writing any
value to /sys/devices/system/node/nodeN/compact where N is the node ID to
be compacted. When a process fails to allocate a high-order page, it may
compact memory in an attempt to satisfy the allocation instead of entering
direct reclaim. Explicit compaction does not finish until the two scanners
meet and direct compaction ends if a suitable page becomes available that
would meet watermarks.

The series is in 14 patches. The first three are not "core" to the series
but are important pre-requisites.

Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
	patch, it's possible to use anon_vma after free if the caller is
	not holding a VMA or mmap_sem for the pages in question. While
	there should be no existing user that causes this problem,
	it's a requirement for memory compaction to be stable. The patch
	is at the start of the series for bisection reasons.
Patch 2 skips over anon pages during migration that are no longer mapped
	because there still appeared to be a small window between when
	a page was isolated and migration started during which anon_vma
	could disappear.
Patch 3 merges the KSM and migrate counts. It could be merged with patch 1
	but would be slightly harder to review.
Patch 4 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 5 exports a "unusable free space index" via /proc/pagetypeinfo. It's
	a measure of external fragmentation that takes the size of the
	allocation request into account. It can also be calculated from
	userspace so can be dropped if requested
Patch 6 exports a "fragmentation index" which only has meaning when an
	allocation request fails. It determines if an allocation failure
	would be due to a lack of memory or external fragmentation.
Patch 7 moves the definition for LRU isolation modes for use by compaction
Patch 8 is the compaction mechanism although it's unreachable at this point
Patch 9 adds a means of compacting all of memory with a proc trgger
Patch 10 adds a means of compacting a specific node with a sysfs trigger
Patch 11 adds "direct compaction" before "direct reclaim" if it is
	determined there is a good chance of success.
Patch 12 adds a sysctl that allows tuning of the threshold at which the
	kernel will compact or direct reclaim
Patch 13 temporarily disables compaction if an allocation failure occurs
	after compaction.
Patch 14 allows the migration of PageSwapCache pages. This patch was not
	as straight-forward as rmap_walk and migration needed extra
	smarts to avoid problems under heavy memory pressure. It's possible
	that memory hot-remove could be affected.

Testing of compaction was in three stages.  For the test, debugging, preempt,
the sleep watchdog and lockdep were all enabled but nothing nasty popped
out. min_free_kbytes was tuned as recommended by hugeadm to help fragmentation
avoidance and high-order allocations. It was tested on X86, X86-64 and PPC64.

Ths first test represents one of the easiest cases that can be faced for
lumpy reclaim or memory compaction.

1. Machine freshly booted and configured for hugepage usage with
	a) hugeadm --create-global-mounts
	b) hugeadm --pool-pages-max DEFAULT:8G
	c) hugeadm --set-recommended-min_free_kbytes
	d) hugeadm --set-recommended-shmmax

	The min_free_kbytes here is important. Anti-fragmentation works best
	when pageblocks don't mix. hugeadm knows how to calculate a value that
	will significantly reduce the worst of external-fragmentation-related
	events as reported by the mm_page_alloc_extfrag tracepoint.

2. Load up memory
	a) Start updatedb
	b) Create in parallel a X files of pagesize*128 in size. Wait
	   until files are created. By parallel, I mean that 4096 instances
	   of dd were launched, one after the other using &. The crude
	   objective being to mix filesystem metadata allocations with
	   the buffer cache.
	c) Delete every second file so that pageblocks are likely to
	   have holes
	d) kill updatedb if it's still running

	At this point, the system is quiet, memory is full but it's full with
	clean filesystem metadata and clean buffer cache that is unmapped.
	This is readily migrated or discarded so you'd expect lumpy reclaim
	to have no significant advantage over compaction but this is at
	the POC stage.

3. In increments, attempt to allocate 5% of memory as hugepages.
	   Measure how long it took, how successful it was, how many
	   direct reclaims took place and how how many compactions. Note
	   the compaction figures might not fully add up as compactions
	   can take place for orders other than the hugepage size

X86				vanilla		compaction
Final page count                    915                916 (attempted 1002)
pages reclaimed                   88872               4920

X86-64				vanilla		compaction
Final page count:                   901                901 (attempted 1002)
Total pages reclaimed:           137573              67434

PPC64				vanilla		compaction
Final page count:                    89                 92 (attempted 110)
Total pages reclaimed:            84727               8822

There was not a dramatic improvement in success rates but it wouldn't be
expected in this case either. What was important is that far fewer pages
were reclaimed in all cases reducing the amount of IO required to satisfy
a huge page allocation.

The second tests were all performance related - kernbench, netperf, iozone
and sysbench. None showed anything too remarkable.

The last test was a high-order allocation stress test. Many kernel compiles
are started to fill memory with a pressured mix of unmovable and movable
allocations. During this, an attempt is made to allocate 90% of memory
as huge pages - one at a time with small delays between attempts to avoid
flooding the IO queue.

                                             vanilla   compaction
Percentage of request allocated X86               96           99
Percentage of request allocated X86-64            96           98
Percentage of request allocated PPC64             51           70

Success rates are a little higher, particularly on PPC64 with the larger
huge pages. What is most interesting is the latency when allocating huge
pages.

X86:    http://www.csn.ul.ie/~mel/postings/compaction-20100329/highalloc-interlatency-arnold-compaction-stress-v6r24-mean.ps
X86_64: http://www.csn.ul.ie/~mel/postings/compaction-20100329/highalloc-interlatency-hydra-compaction-stress-v6r24-mean.ps
PPC64: http://www.csn.ul.ie/~mel/postings/compaction-20100329/highalloc-interlatency-powyah-compaction-stress-v6r24-mean.ps

X86 latency is reduced the least but it is depending heavily on the HIGHMEM
zone to allocate many of its huge pages which is a relatively straight-forward
job. X86-64 and PPC64 both show reductions in average time taken to allocate
huge pages. It is not reduced to zero because the system is under enough
memory pressure that reclaim is still required for some of the allocations.

What is also enlightening in the same directory is the "stddev" files. Each
of them show that the variance between allocation times is heavily reduced.

 Documentation/ABI/testing/sysfs-devices-node |    7 +
 Documentation/filesystems/proc.txt           |   25 +-
 Documentation/sysctl/vm.txt                  |   29 ++-
 drivers/base/node.c                          |    3 +
 include/linux/compaction.h                   |   81 ++++
 include/linux/mm.h                           |    1 +
 include/linux/mmzone.h                       |    7 +
 include/linux/rmap.h                         |   27 +-
 include/linux/swap.h                         |    6 +
 include/linux/vmstat.h                       |    2 +
 kernel/sysctl.c                              |   25 ++
 mm/Kconfig                                   |   18 +-
 mm/Makefile                                  |    1 +
 mm/compaction.c                              |  589 ++++++++++++++++++++++++++
 mm/ksm.c                                     |    4 +-
 mm/migrate.c                                 |   28 ++
 mm/page_alloc.c                              |   73 ++++
 mm/rmap.c                                    |   16 +-
 mm/vmscan.c                                  |    5 -
 mm/vmstat.c                                  |  218 ++++++++++
 20 files changed, 1135 insertions(+), 30 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-devices-node
 create mode 100644 include/linux/compaction.h
 create mode 100644 mm/compaction.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 01/14] mm,migration: Take a reference to the anon_vma before migrating
  2010-03-30  9:14 ` Mel Gorman
@ 2010-03-30  9:14   ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.

This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/rmap.h |   23 +++++++++++++++++++++++
 mm/migrate.c         |   12 ++++++++++++
 mm/rmap.c            |   10 +++++-----
 3 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..567d43f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -29,6 +29,9 @@ struct anon_vma {
 #ifdef CONFIG_KSM
 	atomic_t ksm_refcount;
 #endif
+#ifdef CONFIG_MIGRATION
+	atomic_t migrate_refcount;
+#endif
 	/*
 	 * NOTE: the LSB of the head.next is set by
 	 * mm_take_all_locks() _after_ taking the above lock. So the
@@ -81,6 +84,26 @@ static inline int ksm_refcount(struct anon_vma *anon_vma)
 	return 0;
 }
 #endif /* CONFIG_KSM */
+#ifdef CONFIG_MIGRATION
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+	atomic_set(&anon_vma->migrate_refcount, 0);
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+	return atomic_read(&anon_vma->migrate_refcount);
+}
+#else
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+	return 0;
+}
+#endif /* CONFIG_MIGRATE */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
diff --git a/mm/migrate.c b/mm/migrate.c
index 6903abf..06e6316 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -542,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	int rcu_locked = 0;
 	int charge = 0;
 	struct mem_cgroup *mem = NULL;
+	struct anon_vma *anon_vma = NULL;
 
 	if (!newpage)
 		return -ENOMEM;
@@ -598,6 +599,8 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	if (PageAnon(page)) {
 		rcu_read_lock();
 		rcu_locked = 1;
+		anon_vma = page_anon_vma(page);
+		atomic_inc(&anon_vma->migrate_refcount);
 	}
 
 	/*
@@ -637,6 +640,15 @@ skip_unmap:
 	if (rc)
 		remove_migration_ptes(page, page);
 rcu_unlock:
+
+	/* Drop an anon_vma reference if we took one */
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
+		int empty = list_empty(&anon_vma->head);
+		spin_unlock(&anon_vma->lock);
+		if (empty)
+			anon_vma_free(anon_vma);
+	}
+
 	if (rcu_locked)
 		rcu_read_unlock();
 uncharge:
diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..578d0fe 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -248,7 +248,8 @@ static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma);
+	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
+					!migrate_refcount(anon_vma);
 	spin_unlock(&anon_vma->lock);
 
 	if (empty)
@@ -273,6 +274,7 @@ static void anon_vma_ctor(void *data)
 
 	spin_lock_init(&anon_vma->lock);
 	ksm_refcount_init(anon_vma);
+	migrate_refcount_init(anon_vma);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -1338,10 +1340,8 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	/*
 	 * Note: remove_migration_ptes() cannot use page_lock_anon_vma()
 	 * because that depends on page_mapped(); but not all its usages
-	 * are holding mmap_sem, which also gave the necessary guarantee
-	 * (that this anon_vma's slab has not already been destroyed).
-	 * This needs to be reviewed later: avoiding page_lock_anon_vma()
-	 * is risky, and currently limits the usefulness of rmap_walk().
+	 * are holding mmap_sem. Users without mmap_sem are required to
+	 * take a reference count to prevent the anon_vma disappearing
 	 */
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 01/14] mm,migration: Take a reference to the anon_vma before migrating
@ 2010-03-30  9:14   ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.

This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/rmap.h |   23 +++++++++++++++++++++++
 mm/migrate.c         |   12 ++++++++++++
 mm/rmap.c            |   10 +++++-----
 3 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..567d43f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -29,6 +29,9 @@ struct anon_vma {
 #ifdef CONFIG_KSM
 	atomic_t ksm_refcount;
 #endif
+#ifdef CONFIG_MIGRATION
+	atomic_t migrate_refcount;
+#endif
 	/*
 	 * NOTE: the LSB of the head.next is set by
 	 * mm_take_all_locks() _after_ taking the above lock. So the
@@ -81,6 +84,26 @@ static inline int ksm_refcount(struct anon_vma *anon_vma)
 	return 0;
 }
 #endif /* CONFIG_KSM */
+#ifdef CONFIG_MIGRATION
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+	atomic_set(&anon_vma->migrate_refcount, 0);
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+	return atomic_read(&anon_vma->migrate_refcount);
+}
+#else
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+	return 0;
+}
+#endif /* CONFIG_MIGRATE */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
diff --git a/mm/migrate.c b/mm/migrate.c
index 6903abf..06e6316 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -542,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	int rcu_locked = 0;
 	int charge = 0;
 	struct mem_cgroup *mem = NULL;
+	struct anon_vma *anon_vma = NULL;
 
 	if (!newpage)
 		return -ENOMEM;
@@ -598,6 +599,8 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	if (PageAnon(page)) {
 		rcu_read_lock();
 		rcu_locked = 1;
+		anon_vma = page_anon_vma(page);
+		atomic_inc(&anon_vma->migrate_refcount);
 	}
 
 	/*
@@ -637,6 +640,15 @@ skip_unmap:
 	if (rc)
 		remove_migration_ptes(page, page);
 rcu_unlock:
+
+	/* Drop an anon_vma reference if we took one */
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
+		int empty = list_empty(&anon_vma->head);
+		spin_unlock(&anon_vma->lock);
+		if (empty)
+			anon_vma_free(anon_vma);
+	}
+
 	if (rcu_locked)
 		rcu_read_unlock();
 uncharge:
diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..578d0fe 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -248,7 +248,8 @@ static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma);
+	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
+					!migrate_refcount(anon_vma);
 	spin_unlock(&anon_vma->lock);
 
 	if (empty)
@@ -273,6 +274,7 @@ static void anon_vma_ctor(void *data)
 
 	spin_lock_init(&anon_vma->lock);
 	ksm_refcount_init(anon_vma);
+	migrate_refcount_init(anon_vma);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -1338,10 +1340,8 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	/*
 	 * Note: remove_migration_ptes() cannot use page_lock_anon_vma()
 	 * because that depends on page_mapped(); but not all its usages
-	 * are holding mmap_sem, which also gave the necessary guarantee
-	 * (that this anon_vma's slab has not already been destroyed).
-	 * This needs to be reviewed later: avoiding page_lock_anon_vma()
-	 * is risky, and currently limits the usefulness of rmap_walk().
+	 * are holding mmap_sem. Users without mmap_sem are required to
+	 * take a reference count to prevent the anon_vma disappearing
 	 */
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 02/14] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-03-30  9:14 ` Mel Gorman
@ 2010-03-30  9:14   ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

rmap_walk_anon() was triggering errors in memory compaction that look like
use-after-free errors. The problem is that between the page being isolated
from the LRU and rcu_read_lock() being taken, the mapcount of the page
dropped to 0 and the anon_vma gets freed. This can happen during memory
compaction if pages being migrated belong to a process that exits before
migration completes. Hence, the use-after-free race looks like

 1. Page isolated for migration
 2. Process exits
 3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
 4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
 4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
    is garbage.

This patch checks the mapcount after the rcu lock is taken. If the
mapcount is zero, the anon_vma is assumed to be freed and no further
action is taken.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/migrate.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 06e6316..5c5c1bd 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -599,6 +599,17 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	if (PageAnon(page)) {
 		rcu_read_lock();
 		rcu_locked = 1;
+
+		/*
+		 * If the page has no mappings any more, just bail. An
+		 * unmapped anon page is likely to be freed soon but worse,
+		 * it's possible its anon_vma disappeared between when
+		 * the page was isolated and when we reached here while
+		 * the RCU lock was not held
+		 */
+		if (!page_mapped(page))
+			goto rcu_unlock;
+
 		anon_vma = page_anon_vma(page);
 		atomic_inc(&anon_vma->migrate_refcount);
 	}
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 02/14] mm,migration: Do not try to migrate unmapped anonymous pages
@ 2010-03-30  9:14   ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

rmap_walk_anon() was triggering errors in memory compaction that look like
use-after-free errors. The problem is that between the page being isolated
from the LRU and rcu_read_lock() being taken, the mapcount of the page
dropped to 0 and the anon_vma gets freed. This can happen during memory
compaction if pages being migrated belong to a process that exits before
migration completes. Hence, the use-after-free race looks like

 1. Page isolated for migration
 2. Process exits
 3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
 4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
 4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
    is garbage.

This patch checks the mapcount after the rcu lock is taken. If the
mapcount is zero, the anon_vma is assumed to be freed and no further
action is taken.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/migrate.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 06e6316..5c5c1bd 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -599,6 +599,17 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	if (PageAnon(page)) {
 		rcu_read_lock();
 		rcu_locked = 1;
+
+		/*
+		 * If the page has no mappings any more, just bail. An
+		 * unmapped anon page is likely to be freed soon but worse,
+		 * it's possible its anon_vma disappeared between when
+		 * the page was isolated and when we reached here while
+		 * the RCU lock was not held
+		 */
+		if (!page_mapped(page))
+			goto rcu_unlock;
+
 		anon_vma = page_anon_vma(page);
 		atomic_inc(&anon_vma->migrate_refcount);
 	}
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 03/14] mm: Share the anon_vma ref counts between KSM and page migration
  2010-03-30  9:14 ` Mel Gorman
@ 2010-03-30  9:14   ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

For clarity of review, KSM and page migration have separate refcounts on
the anon_vma. While clear, this is a waste of memory. This patch gets
KSM and page migration to share their toys in a spirit of harmony.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/rmap.h |   50 ++++++++++++++++++--------------------------------
 mm/ksm.c             |    4 ++--
 mm/migrate.c         |    4 ++--
 mm/rmap.c            |    6 ++----
 4 files changed, 24 insertions(+), 40 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 567d43f..7721674 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -26,11 +26,17 @@
  */
 struct anon_vma {
 	spinlock_t lock;	/* Serialize access to vma list */
-#ifdef CONFIG_KSM
-	atomic_t ksm_refcount;
-#endif
-#ifdef CONFIG_MIGRATION
-	atomic_t migrate_refcount;
+#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
+
+	/*
+	 * The external_refcount is taken by either KSM or page migration
+	 * to take a reference to an anon_vma when there is no
+	 * guarantee that the vma of page tables will exist for
+	 * the duration of the operation. A caller that takes
+	 * the reference is responsible for clearing up the
+	 * anon_vma if they are the last user on release
+	 */
+	atomic_t external_refcount;
 #endif
 	/*
 	 * NOTE: the LSB of the head.next is set by
@@ -64,46 +70,26 @@ struct anon_vma_chain {
 };
 
 #ifdef CONFIG_MMU
-#ifdef CONFIG_KSM
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
+#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
+static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
 {
-	atomic_set(&anon_vma->ksm_refcount, 0);
+	atomic_set(&anon_vma->external_refcount, 0);
 }
 
-static inline int ksm_refcount(struct anon_vma *anon_vma)
+static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
 {
-	return atomic_read(&anon_vma->ksm_refcount);
+	return atomic_read(&anon_vma->external_refcount);
 }
 #else
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
+static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
 {
 }
 
-static inline int ksm_refcount(struct anon_vma *anon_vma)
+static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
 {
 	return 0;
 }
 #endif /* CONFIG_KSM */
-#ifdef CONFIG_MIGRATION
-static inline void migrate_refcount_init(struct anon_vma *anon_vma)
-{
-	atomic_set(&anon_vma->migrate_refcount, 0);
-}
-
-static inline int migrate_refcount(struct anon_vma *anon_vma)
-{
-	return atomic_read(&anon_vma->migrate_refcount);
-}
-#else
-static inline void migrate_refcount_init(struct anon_vma *anon_vma)
-{
-}
-
-static inline int migrate_refcount(struct anon_vma *anon_vma)
-{
-	return 0;
-}
-#endif /* CONFIG_MIGRATE */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
diff --git a/mm/ksm.c b/mm/ksm.c
index 8cdfc2a..3666d43 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -318,14 +318,14 @@ static void hold_anon_vma(struct rmap_item *rmap_item,
 			  struct anon_vma *anon_vma)
 {
 	rmap_item->anon_vma = anon_vma;
-	atomic_inc(&anon_vma->ksm_refcount);
+	atomic_inc(&anon_vma->external_refcount);
 }
 
 static void drop_anon_vma(struct rmap_item *rmap_item)
 {
 	struct anon_vma *anon_vma = rmap_item->anon_vma;
 
-	if (atomic_dec_and_lock(&anon_vma->ksm_refcount, &anon_vma->lock)) {
+	if (atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);
 		if (empty)
diff --git a/mm/migrate.c b/mm/migrate.c
index 5c5c1bd..35aad2a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -611,7 +611,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 			goto rcu_unlock;
 
 		anon_vma = page_anon_vma(page);
-		atomic_inc(&anon_vma->migrate_refcount);
+		atomic_inc(&anon_vma->external_refcount);
 	}
 
 	/*
@@ -653,7 +653,7 @@ skip_unmap:
 rcu_unlock:
 
 	/* Drop an anon_vma reference if we took one */
-	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);
 		if (empty)
diff --git a/mm/rmap.c b/mm/rmap.c
index 578d0fe..af35b75 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -248,8 +248,7 @@ static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
-					!migrate_refcount(anon_vma);
+	empty = list_empty(&anon_vma->head) && !anonvma_external_refcount(anon_vma);
 	spin_unlock(&anon_vma->lock);
 
 	if (empty)
@@ -273,8 +272,7 @@ static void anon_vma_ctor(void *data)
 	struct anon_vma *anon_vma = data;
 
 	spin_lock_init(&anon_vma->lock);
-	ksm_refcount_init(anon_vma);
-	migrate_refcount_init(anon_vma);
+	anonvma_external_refcount_init(anon_vma);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 03/14] mm: Share the anon_vma ref counts between KSM and page migration
@ 2010-03-30  9:14   ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

For clarity of review, KSM and page migration have separate refcounts on
the anon_vma. While clear, this is a waste of memory. This patch gets
KSM and page migration to share their toys in a spirit of harmony.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/rmap.h |   50 ++++++++++++++++++--------------------------------
 mm/ksm.c             |    4 ++--
 mm/migrate.c         |    4 ++--
 mm/rmap.c            |    6 ++----
 4 files changed, 24 insertions(+), 40 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 567d43f..7721674 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -26,11 +26,17 @@
  */
 struct anon_vma {
 	spinlock_t lock;	/* Serialize access to vma list */
-#ifdef CONFIG_KSM
-	atomic_t ksm_refcount;
-#endif
-#ifdef CONFIG_MIGRATION
-	atomic_t migrate_refcount;
+#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
+
+	/*
+	 * The external_refcount is taken by either KSM or page migration
+	 * to take a reference to an anon_vma when there is no
+	 * guarantee that the vma of page tables will exist for
+	 * the duration of the operation. A caller that takes
+	 * the reference is responsible for clearing up the
+	 * anon_vma if they are the last user on release
+	 */
+	atomic_t external_refcount;
 #endif
 	/*
 	 * NOTE: the LSB of the head.next is set by
@@ -64,46 +70,26 @@ struct anon_vma_chain {
 };
 
 #ifdef CONFIG_MMU
-#ifdef CONFIG_KSM
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
+#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
+static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
 {
-	atomic_set(&anon_vma->ksm_refcount, 0);
+	atomic_set(&anon_vma->external_refcount, 0);
 }
 
-static inline int ksm_refcount(struct anon_vma *anon_vma)
+static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
 {
-	return atomic_read(&anon_vma->ksm_refcount);
+	return atomic_read(&anon_vma->external_refcount);
 }
 #else
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
+static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
 {
 }
 
-static inline int ksm_refcount(struct anon_vma *anon_vma)
+static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
 {
 	return 0;
 }
 #endif /* CONFIG_KSM */
-#ifdef CONFIG_MIGRATION
-static inline void migrate_refcount_init(struct anon_vma *anon_vma)
-{
-	atomic_set(&anon_vma->migrate_refcount, 0);
-}
-
-static inline int migrate_refcount(struct anon_vma *anon_vma)
-{
-	return atomic_read(&anon_vma->migrate_refcount);
-}
-#else
-static inline void migrate_refcount_init(struct anon_vma *anon_vma)
-{
-}
-
-static inline int migrate_refcount(struct anon_vma *anon_vma)
-{
-	return 0;
-}
-#endif /* CONFIG_MIGRATE */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
diff --git a/mm/ksm.c b/mm/ksm.c
index 8cdfc2a..3666d43 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -318,14 +318,14 @@ static void hold_anon_vma(struct rmap_item *rmap_item,
 			  struct anon_vma *anon_vma)
 {
 	rmap_item->anon_vma = anon_vma;
-	atomic_inc(&anon_vma->ksm_refcount);
+	atomic_inc(&anon_vma->external_refcount);
 }
 
 static void drop_anon_vma(struct rmap_item *rmap_item)
 {
 	struct anon_vma *anon_vma = rmap_item->anon_vma;
 
-	if (atomic_dec_and_lock(&anon_vma->ksm_refcount, &anon_vma->lock)) {
+	if (atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);
 		if (empty)
diff --git a/mm/migrate.c b/mm/migrate.c
index 5c5c1bd..35aad2a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -611,7 +611,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 			goto rcu_unlock;
 
 		anon_vma = page_anon_vma(page);
-		atomic_inc(&anon_vma->migrate_refcount);
+		atomic_inc(&anon_vma->external_refcount);
 	}
 
 	/*
@@ -653,7 +653,7 @@ skip_unmap:
 rcu_unlock:
 
 	/* Drop an anon_vma reference if we took one */
-	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);
 		if (empty)
diff --git a/mm/rmap.c b/mm/rmap.c
index 578d0fe..af35b75 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -248,8 +248,7 @@ static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
-					!migrate_refcount(anon_vma);
+	empty = list_empty(&anon_vma->head) && !anonvma_external_refcount(anon_vma);
 	spin_unlock(&anon_vma->lock);
 
 	if (empty)
@@ -273,8 +272,7 @@ static void anon_vma_ctor(void *data)
 	struct anon_vma *anon_vma = data;
 
 	spin_lock_init(&anon_vma->lock);
-	ksm_refcount_init(anon_vma);
-	migrate_refcount_init(anon_vma);
+	anonvma_external_refcount_init(anon_vma);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 04/14] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
  2010-03-30  9:14 ` Mel Gorman
@ 2010-03-30  9:14   ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
being able to hot-remove memory. The main users of page migration such as
sys_move_pages(), sys_migrate_pages() and cpuset process migration are
only beneficial on NUMA so it makes sense.

As memory compaction will operate within a zone and is useful on both NUMA
and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
user selects CONFIG_COMPACTION as an option.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/Kconfig |   18 +++++++++++++++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 9c61158..4fd75a0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -172,6 +172,16 @@ config SPLIT_PTLOCK_CPUS
 	default "4"
 
 #
+# support for memory compaction
+config COMPACTION
+	bool "Allow for memory compaction"
+	def_bool y
+	select MIGRATION
+	depends on EXPERIMENTAL && HUGETLBFS && MMU
+	help
+	  Allows the compaction of memory for the allocation of huge pages.
+
+#
 # support for page migration
 #
 config MIGRATION
@@ -180,9 +190,11 @@ config MIGRATION
 	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE
 	help
 	  Allows the migration of the physical location of pages of processes
-	  while the virtual addresses are not changed. This is useful for
-	  example on NUMA systems to put pages nearer to the processors accessing
-	  the page.
+	  while the virtual addresses are not changed. This is useful in
+	  two situations. The first is on NUMA systems to put pages nearer
+	  to the processors accessing. The second is when allocating huge
+	  pages as migration can relocate pages to satisfy a huge page
+	  allocation instead of reclaiming.
 
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 04/14] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
@ 2010-03-30  9:14   ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
being able to hot-remove memory. The main users of page migration such as
sys_move_pages(), sys_migrate_pages() and cpuset process migration are
only beneficial on NUMA so it makes sense.

As memory compaction will operate within a zone and is useful on both NUMA
and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
user selects CONFIG_COMPACTION as an option.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/Kconfig |   18 +++++++++++++++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 9c61158..4fd75a0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -172,6 +172,16 @@ config SPLIT_PTLOCK_CPUS
 	default "4"
 
 #
+# support for memory compaction
+config COMPACTION
+	bool "Allow for memory compaction"
+	def_bool y
+	select MIGRATION
+	depends on EXPERIMENTAL && HUGETLBFS && MMU
+	help
+	  Allows the compaction of memory for the allocation of huge pages.
+
+#
 # support for page migration
 #
 config MIGRATION
@@ -180,9 +190,11 @@ config MIGRATION
 	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE
 	help
 	  Allows the migration of the physical location of pages of processes
-	  while the virtual addresses are not changed. This is useful for
-	  example on NUMA systems to put pages nearer to the processors accessing
-	  the page.
+	  while the virtual addresses are not changed. This is useful in
+	  two situations. The first is on NUMA systems to put pages nearer
+	  to the processors accessing. The second is when allocating huge
+	  pages as migration can relocate pages to satisfy a huge page
+	  allocation instead of reclaiming.
 
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 05/14] Export unusable free space index via /proc/unusable_index
  2010-03-30  9:14 ` Mel Gorman
@ 2010-03-30  9:14   ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

Unusable free space index is a measure of external fragmentation that
takes the allocation size into account. For the most part, the huge page
size will be the size of interest but not necessarily so it is exported
on a per-order and per-zone basis via /proc/unusable_index.

The index is a value between 0 and 1. It can be expressed as a
percentage by multiplying by 100 as documented in
Documentation/filesystems/proc.txt.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 Documentation/filesystems/proc.txt |   13 ++++-
 mm/vmstat.c                        |  120 ++++++++++++++++++++++++++++++++++++
 2 files changed, 132 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 74d2605..e87775a 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -453,6 +453,7 @@ Table 1-5: Kernel info in /proc
  sys         See chapter 2                                     
  sysvipc     Info of SysVIPC Resources (msg, sem, shm)		(2.4)
  tty	     Info of tty drivers
+ unusable_index Additional page allocator information (see text)(2.5)
  uptime      System uptime                                     
  version     Kernel version                                    
  video	     bttv info of video resources			(2.4)
@@ -610,7 +611,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo.
+pagetypeinfo and unusable_index
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -651,6 +652,16 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
 also be allocatable although a lot of filesystem metadata may have to be
 reclaimed to achieve this.
 
+> cat /proc/unusable_index
+Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
+Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
+
+The unusable free space index measures how much of the available free
+memory cannot be used to satisfy an allocation of a given size and is a
+value between 0 and 1. The higher the value, the more of free memory is
+unusable and by implication, the worse the external fragmentation is. This
+can be expressed as a percentage by multiplying by 100.
+
 ..............................................................................
 
 meminfo:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7f760cb..2fb4986 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -453,6 +453,106 @@ static int frag_show(struct seq_file *m, void *arg)
 	return 0;
 }
 
+
+struct contig_page_info {
+	unsigned long free_pages;
+	unsigned long free_blocks_total;
+	unsigned long free_blocks_suitable;
+};
+
+/*
+ * Calculate the number of free pages in a zone, how many contiguous
+ * pages are free and how many are large enough to satisfy an allocation of
+ * the target size. Note that this function makes no attempt to estimate
+ * how many suitable free blocks there *might* be if MOVABLE pages were
+ * migrated. Calculating that is possible, but expensive and can be
+ * figured out from userspace
+ */
+static void fill_contig_page_info(struct zone *zone,
+				unsigned int suitable_order,
+				struct contig_page_info *info)
+{
+	unsigned int order;
+
+	info->free_pages = 0;
+	info->free_blocks_total = 0;
+	info->free_blocks_suitable = 0;
+
+	for (order = 0; order < MAX_ORDER; order++) {
+		unsigned long blocks;
+
+		/* Count number of free blocks */
+		blocks = zone->free_area[order].nr_free;
+		info->free_blocks_total += blocks;
+
+		/* Count free base pages */
+		info->free_pages += blocks << order;
+
+		/* Count the suitable free blocks */
+		if (order >= suitable_order)
+			info->free_blocks_suitable += blocks <<
+						(order - suitable_order);
+	}
+}
+
+/*
+ * Return an index indicating how much of the available free memory is
+ * unusable for an allocation of the requested size.
+ */
+static int unusable_free_index(unsigned int order,
+				struct contig_page_info *info)
+{
+	/* No free memory is interpreted as all free memory is unusable */
+	if (info->free_pages == 0)
+		return 1000;
+
+	/*
+	 * Index should be a value between 0 and 1. Return a value to 3
+	 * decimal places.
+	 *
+	 * 0 => no fragmentation
+	 * 1 => high fragmentation
+	 */
+	return div_u64((info->free_pages - (info->free_blocks_suitable << order)) * 1000ULL, info->free_pages);
+
+}
+
+static void unusable_show_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	unsigned int order;
+	int index;
+	struct contig_page_info info;
+
+	seq_printf(m, "Node %d, zone %8s ",
+				pgdat->node_id,
+				zone->name);
+	for (order = 0; order < MAX_ORDER; ++order) {
+		fill_contig_page_info(zone, order, &info);
+		index = unusable_free_index(order, &info);
+		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+	}
+
+	seq_putc(m, '\n');
+}
+
+/*
+ * Display unusable free space index
+ * XXX: Could be a lot more efficient, but it's not a critical path
+ */
+static int unusable_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	/* check memoryless node */
+	if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+		return 0;
+
+	walk_zones_in_node(m, pgdat, unusable_show_print);
+
+	return 0;
+}
+
 static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
@@ -603,6 +703,25 @@ static const struct file_operations pagetypeinfo_file_ops = {
 	.release	= seq_release,
 };
 
+static const struct seq_operations unusable_op = {
+	.start	= frag_start,
+	.next	= frag_next,
+	.stop	= frag_stop,
+	.show	= unusable_show,
+};
+
+static int unusable_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &unusable_op);
+}
+
+static const struct file_operations unusable_file_ops = {
+	.open		= unusable_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 #ifdef CONFIG_ZONE_DMA
 #define TEXT_FOR_DMA(xx) xx "_dma",
 #else
@@ -947,6 +1066,7 @@ static int __init setup_vmstat(void)
 #ifdef CONFIG_PROC_FS
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
 	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
+	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
 #endif
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 05/14] Export unusable free space index via /proc/unusable_index
@ 2010-03-30  9:14   ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

Unusable free space index is a measure of external fragmentation that
takes the allocation size into account. For the most part, the huge page
size will be the size of interest but not necessarily so it is exported
on a per-order and per-zone basis via /proc/unusable_index.

The index is a value between 0 and 1. It can be expressed as a
percentage by multiplying by 100 as documented in
Documentation/filesystems/proc.txt.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 Documentation/filesystems/proc.txt |   13 ++++-
 mm/vmstat.c                        |  120 ++++++++++++++++++++++++++++++++++++
 2 files changed, 132 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 74d2605..e87775a 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -453,6 +453,7 @@ Table 1-5: Kernel info in /proc
  sys         See chapter 2                                     
  sysvipc     Info of SysVIPC Resources (msg, sem, shm)		(2.4)
  tty	     Info of tty drivers
+ unusable_index Additional page allocator information (see text)(2.5)
  uptime      System uptime                                     
  version     Kernel version                                    
  video	     bttv info of video resources			(2.4)
@@ -610,7 +611,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo.
+pagetypeinfo and unusable_index
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -651,6 +652,16 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
 also be allocatable although a lot of filesystem metadata may have to be
 reclaimed to achieve this.
 
+> cat /proc/unusable_index
+Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
+Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
+
+The unusable free space index measures how much of the available free
+memory cannot be used to satisfy an allocation of a given size and is a
+value between 0 and 1. The higher the value, the more of free memory is
+unusable and by implication, the worse the external fragmentation is. This
+can be expressed as a percentage by multiplying by 100.
+
 ..............................................................................
 
 meminfo:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7f760cb..2fb4986 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -453,6 +453,106 @@ static int frag_show(struct seq_file *m, void *arg)
 	return 0;
 }
 
+
+struct contig_page_info {
+	unsigned long free_pages;
+	unsigned long free_blocks_total;
+	unsigned long free_blocks_suitable;
+};
+
+/*
+ * Calculate the number of free pages in a zone, how many contiguous
+ * pages are free and how many are large enough to satisfy an allocation of
+ * the target size. Note that this function makes no attempt to estimate
+ * how many suitable free blocks there *might* be if MOVABLE pages were
+ * migrated. Calculating that is possible, but expensive and can be
+ * figured out from userspace
+ */
+static void fill_contig_page_info(struct zone *zone,
+				unsigned int suitable_order,
+				struct contig_page_info *info)
+{
+	unsigned int order;
+
+	info->free_pages = 0;
+	info->free_blocks_total = 0;
+	info->free_blocks_suitable = 0;
+
+	for (order = 0; order < MAX_ORDER; order++) {
+		unsigned long blocks;
+
+		/* Count number of free blocks */
+		blocks = zone->free_area[order].nr_free;
+		info->free_blocks_total += blocks;
+
+		/* Count free base pages */
+		info->free_pages += blocks << order;
+
+		/* Count the suitable free blocks */
+		if (order >= suitable_order)
+			info->free_blocks_suitable += blocks <<
+						(order - suitable_order);
+	}
+}
+
+/*
+ * Return an index indicating how much of the available free memory is
+ * unusable for an allocation of the requested size.
+ */
+static int unusable_free_index(unsigned int order,
+				struct contig_page_info *info)
+{
+	/* No free memory is interpreted as all free memory is unusable */
+	if (info->free_pages == 0)
+		return 1000;
+
+	/*
+	 * Index should be a value between 0 and 1. Return a value to 3
+	 * decimal places.
+	 *
+	 * 0 => no fragmentation
+	 * 1 => high fragmentation
+	 */
+	return div_u64((info->free_pages - (info->free_blocks_suitable << order)) * 1000ULL, info->free_pages);
+
+}
+
+static void unusable_show_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	unsigned int order;
+	int index;
+	struct contig_page_info info;
+
+	seq_printf(m, "Node %d, zone %8s ",
+				pgdat->node_id,
+				zone->name);
+	for (order = 0; order < MAX_ORDER; ++order) {
+		fill_contig_page_info(zone, order, &info);
+		index = unusable_free_index(order, &info);
+		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+	}
+
+	seq_putc(m, '\n');
+}
+
+/*
+ * Display unusable free space index
+ * XXX: Could be a lot more efficient, but it's not a critical path
+ */
+static int unusable_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	/* check memoryless node */
+	if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+		return 0;
+
+	walk_zones_in_node(m, pgdat, unusable_show_print);
+
+	return 0;
+}
+
 static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
@@ -603,6 +703,25 @@ static const struct file_operations pagetypeinfo_file_ops = {
 	.release	= seq_release,
 };
 
+static const struct seq_operations unusable_op = {
+	.start	= frag_start,
+	.next	= frag_next,
+	.stop	= frag_stop,
+	.show	= unusable_show,
+};
+
+static int unusable_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &unusable_op);
+}
+
+static const struct file_operations unusable_file_ops = {
+	.open		= unusable_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 #ifdef CONFIG_ZONE_DMA
 #define TEXT_FOR_DMA(xx) xx "_dma",
 #else
@@ -947,6 +1066,7 @@ static int __init setup_vmstat(void)
 #ifdef CONFIG_PROC_FS
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
 	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
+	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
 #endif
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 06/14] Export fragmentation index via /proc/extfrag_index
  2010-03-30  9:14 ` Mel Gorman
@ 2010-03-30  9:14   ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

Fragmentation index is a value that makes sense when an allocation of a
given size would fail. The index indicates whether an allocation failure is
due to a lack of memory (values towards 0) or due to external fragmentation
(value towards 1).  For the most part, the huge page size will be the size
of interest but not necessarily so it is exported on a per-order and per-zone
basis via /proc/extfrag_index

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 Documentation/filesystems/proc.txt |   14 ++++++-
 mm/vmstat.c                        |   82 ++++++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index e87775a..c041638 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -422,6 +422,7 @@ Table 1-5: Kernel info in /proc
  filesystems Supported filesystems                             
  driver	     Various drivers grouped here, currently rtc (2.4)
  execdomains Execdomains, related to security			(2.4)
+ extfrag_index Additional page allocator information (see text) (2.5)
  fb	     Frame Buffer devices				(2.4)
  fs	     File system parameters, currently nfs/exports	(2.4)
  ide         Directory containing info about the IDE subsystem 
@@ -611,7 +612,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo and unusable_index
+pagetypeinfo, unusable_index and extfrag_index.
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -662,6 +663,17 @@ value between 0 and 1. The higher the value, the more of free memory is
 unusable and by implication, the worse the external fragmentation is. This
 can be expressed as a percentage by multiplying by 100.
 
+> cat /proc/extfrag_index
+Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
+Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954
+
+The external fragmentation index, is only meaningful if an allocation
+would fail and indicates what the failure is due to. A value of -1 such as
+in many of the examples above states that the allocation would succeed.
+If it would fail, the value is between 0 and 1. A value tending towards
+0 implies the allocation failed due to a lack of memory. A value tending
+towards 1 implies it failed due to external fragmentation.
+
 ..............................................................................
 
 meminfo:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2fb4986..351e491 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -15,6 +15,7 @@
 #include <linux/cpu.h>
 #include <linux/vmstat.h>
 #include <linux/sched.h>
+#include <linux/math64.h>
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -553,6 +554,67 @@ static int unusable_show(struct seq_file *m, void *arg)
 	return 0;
 }
 
+/*
+ * A fragmentation index only makes sense if an allocation of a requested
+ * size would fail. If that is true, the fragmentation index indicates
+ * whether external fragmentation or a lack of memory was the problem.
+ * The value can be used to determine if page reclaim or compaction
+ * should be used
+ */
+int fragmentation_index(unsigned int order, struct contig_page_info *info)
+{
+	unsigned long requested = 1UL << order;
+
+	if (!info->free_blocks_total)
+		return 0;
+
+	/* Fragmentation index only makes sense when a request would fail */
+	if (info->free_blocks_suitable)
+		return -1000;
+
+	/*
+	 * Index is between 0 and 1 so return within 3 decimal places
+	 *
+	 * 0 => allocation would fail due to lack of memory
+	 * 1 => allocation would fail due to fragmentation
+	 */
+	return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total);
+}
+
+
+static void extfrag_show_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	unsigned int order;
+	int index;
+
+	/* Alloc on stack as interrupts are disabled for zone walk */
+	struct contig_page_info info;
+
+	seq_printf(m, "Node %d, zone %8s ",
+				pgdat->node_id,
+				zone->name);
+	for (order = 0; order < MAX_ORDER; ++order) {
+		fill_contig_page_info(zone, order, &info);
+		index = fragmentation_index(order, &info);
+		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+	}
+
+	seq_putc(m, '\n');
+}
+
+/*
+ * Display fragmentation index for orders that allocations would fail for
+ */
+static int extfrag_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	walk_zones_in_node(m, pgdat, extfrag_show_print);
+
+	return 0;
+}
+
 static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
@@ -722,6 +784,25 @@ static const struct file_operations unusable_file_ops = {
 	.release	= seq_release,
 };
 
+static const struct seq_operations extfrag_op = {
+	.start	= frag_start,
+	.next	= frag_next,
+	.stop	= frag_stop,
+	.show	= extfrag_show,
+};
+
+static int extfrag_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &extfrag_op);
+}
+
+static const struct file_operations extfrag_file_ops = {
+	.open		= extfrag_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 #ifdef CONFIG_ZONE_DMA
 #define TEXT_FOR_DMA(xx) xx "_dma",
 #else
@@ -1067,6 +1148,7 @@ static int __init setup_vmstat(void)
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
 	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
 	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
+	proc_create("extfrag_index", S_IRUGO, NULL, &extfrag_file_ops);
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
 #endif
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 06/14] Export fragmentation index via /proc/extfrag_index
@ 2010-03-30  9:14   ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

Fragmentation index is a value that makes sense when an allocation of a
given size would fail. The index indicates whether an allocation failure is
due to a lack of memory (values towards 0) or due to external fragmentation
(value towards 1).  For the most part, the huge page size will be the size
of interest but not necessarily so it is exported on a per-order and per-zone
basis via /proc/extfrag_index

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 Documentation/filesystems/proc.txt |   14 ++++++-
 mm/vmstat.c                        |   82 ++++++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index e87775a..c041638 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -422,6 +422,7 @@ Table 1-5: Kernel info in /proc
  filesystems Supported filesystems                             
  driver	     Various drivers grouped here, currently rtc (2.4)
  execdomains Execdomains, related to security			(2.4)
+ extfrag_index Additional page allocator information (see text) (2.5)
  fb	     Frame Buffer devices				(2.4)
  fs	     File system parameters, currently nfs/exports	(2.4)
  ide         Directory containing info about the IDE subsystem 
@@ -611,7 +612,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo and unusable_index
+pagetypeinfo, unusable_index and extfrag_index.
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -662,6 +663,17 @@ value between 0 and 1. The higher the value, the more of free memory is
 unusable and by implication, the worse the external fragmentation is. This
 can be expressed as a percentage by multiplying by 100.
 
+> cat /proc/extfrag_index
+Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
+Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954
+
+The external fragmentation index, is only meaningful if an allocation
+would fail and indicates what the failure is due to. A value of -1 such as
+in many of the examples above states that the allocation would succeed.
+If it would fail, the value is between 0 and 1. A value tending towards
+0 implies the allocation failed due to a lack of memory. A value tending
+towards 1 implies it failed due to external fragmentation.
+
 ..............................................................................
 
 meminfo:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2fb4986..351e491 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -15,6 +15,7 @@
 #include <linux/cpu.h>
 #include <linux/vmstat.h>
 #include <linux/sched.h>
+#include <linux/math64.h>
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -553,6 +554,67 @@ static int unusable_show(struct seq_file *m, void *arg)
 	return 0;
 }
 
+/*
+ * A fragmentation index only makes sense if an allocation of a requested
+ * size would fail. If that is true, the fragmentation index indicates
+ * whether external fragmentation or a lack of memory was the problem.
+ * The value can be used to determine if page reclaim or compaction
+ * should be used
+ */
+int fragmentation_index(unsigned int order, struct contig_page_info *info)
+{
+	unsigned long requested = 1UL << order;
+
+	if (!info->free_blocks_total)
+		return 0;
+
+	/* Fragmentation index only makes sense when a request would fail */
+	if (info->free_blocks_suitable)
+		return -1000;
+
+	/*
+	 * Index is between 0 and 1 so return within 3 decimal places
+	 *
+	 * 0 => allocation would fail due to lack of memory
+	 * 1 => allocation would fail due to fragmentation
+	 */
+	return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total);
+}
+
+
+static void extfrag_show_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	unsigned int order;
+	int index;
+
+	/* Alloc on stack as interrupts are disabled for zone walk */
+	struct contig_page_info info;
+
+	seq_printf(m, "Node %d, zone %8s ",
+				pgdat->node_id,
+				zone->name);
+	for (order = 0; order < MAX_ORDER; ++order) {
+		fill_contig_page_info(zone, order, &info);
+		index = fragmentation_index(order, &info);
+		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+	}
+
+	seq_putc(m, '\n');
+}
+
+/*
+ * Display fragmentation index for orders that allocations would fail for
+ */
+static int extfrag_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	walk_zones_in_node(m, pgdat, extfrag_show_print);
+
+	return 0;
+}
+
 static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
@@ -722,6 +784,25 @@ static const struct file_operations unusable_file_ops = {
 	.release	= seq_release,
 };
 
+static const struct seq_operations extfrag_op = {
+	.start	= frag_start,
+	.next	= frag_next,
+	.stop	= frag_stop,
+	.show	= extfrag_show,
+};
+
+static int extfrag_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &extfrag_op);
+}
+
+static const struct file_operations extfrag_file_ops = {
+	.open		= extfrag_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 #ifdef CONFIG_ZONE_DMA
 #define TEXT_FOR_DMA(xx) xx "_dma",
 #else
@@ -1067,6 +1148,7 @@ static int __init setup_vmstat(void)
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
 	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
 	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
+	proc_create("extfrag_index", S_IRUGO, NULL, &extfrag_file_ops);
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
 #endif
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 07/14] Move definition for LRU isolation modes to a header
  2010-03-30  9:14 ` Mel Gorman
@ 2010-03-30  9:14   ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

Currently, vmscan.c defines the isolation modes for
__isolate_lru_page(). Memory compaction needs access to these modes for
isolating pages for migration.  This patch exports them.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/swap.h |    5 +++++
 mm/vmscan.c          |    5 -----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1f59d93..986b12d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -238,6 +238,11 @@ static inline void lru_cache_add_active_file(struct page *page)
 	__lru_cache_add(page, LRU_ACTIVE_FILE);
 }
 
+/* LRU Isolation modes. */
+#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
+#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
+#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..ef89600 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -839,11 +839,6 @@ keep:
 	return nr_reclaimed;
 }
 
-/* LRU Isolation modes. */
-#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
-#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
-#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
-
 /*
  * Attempt to remove the specified page from its LRU.  Only take this page
  * if it is of the appropriate PageActive status.  Pages which are being
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 07/14] Move definition for LRU isolation modes to a header
@ 2010-03-30  9:14   ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

Currently, vmscan.c defines the isolation modes for
__isolate_lru_page(). Memory compaction needs access to these modes for
isolating pages for migration.  This patch exports them.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/swap.h |    5 +++++
 mm/vmscan.c          |    5 -----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1f59d93..986b12d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -238,6 +238,11 @@ static inline void lru_cache_add_active_file(struct page *page)
 	__lru_cache_add(page, LRU_ACTIVE_FILE);
 }
 
+/* LRU Isolation modes. */
+#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
+#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
+#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..ef89600 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -839,11 +839,6 @@ keep:
 	return nr_reclaimed;
 }
 
-/* LRU Isolation modes. */
-#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
-#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
-#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
-
 /*
  * Attempt to remove the specified page from its LRU.  Only take this page
  * if it is of the appropriate PageActive status.  Pages which are being
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 08/14] Memory compaction core
  2010-03-30  9:14 ` Mel Gorman
@ 2010-03-30  9:14   ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.

A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable pages
within each area, isolating them onto a private list called migratelist.
The free scanner starts at the top of the zone and searches for suitable
areas and consumes the free pages within making them available for the
migration scanner. The pages isolated for migration are then migrated to
the newly isolated free pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 include/linux/compaction.h |    9 +
 include/linux/mm.h         |    1 +
 include/linux/swap.h       |    1 +
 include/linux/vmstat.h     |    1 +
 mm/Makefile                |    1 +
 mm/compaction.c            |  379 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   39 +++++
 mm/vmstat.c                |    5 +
 8 files changed, 436 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/compaction.h
 create mode 100644 mm/compaction.c

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
new file mode 100644
index 0000000..dbebe58
--- /dev/null
+++ b/include/linux/compaction.h
@@ -0,0 +1,9 @@
+#ifndef _LINUX_COMPACTION_H
+#define _LINUX_COMPACTION_H
+
+/* Return values for compact_zone() */
+#define COMPACT_INCOMPLETE	0
+#define COMPACT_PARTIAL		1
+#define COMPACT_COMPLETE	2
+
+#endif /* _LINUX_COMPACTION_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3b473a..f920815 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -335,6 +335,7 @@ void put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
+int split_free_page(struct page *page);
 
 /*
  * Compound pages have a destructor function.  Provide a
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 986b12d..cf8bba7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -151,6 +151,7 @@ enum {
 };
 
 #define SWAP_CLUSTER_MAX 32
+#define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
 #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
 #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 117f0dd..56e4b44 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
diff --git a/mm/Makefile b/mm/Makefile
index 7a68d2a..ccb1f72 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_COMPACTION) += compaction.o
 obj-$(CONFIG_SMP) += percpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
diff --git a/mm/compaction.c b/mm/compaction.c
new file mode 100644
index 0000000..4041209
--- /dev/null
+++ b/mm/compaction.c
@@ -0,0 +1,379 @@
+/*
+ * linux/mm/compaction.c
+ *
+ * Memory compaction for the reduction of external fragmentation. Note that
+ * this heavily depends upon page migration to do all the real heavy
+ * lifting
+ *
+ * Copyright IBM Corp. 2007-2010 Mel Gorman <mel@csn.ul.ie>
+ */
+#include <linux/swap.h>
+#include <linux/migrate.h>
+#include <linux/compaction.h>
+#include <linux/mm_inline.h>
+#include <linux/backing-dev.h>
+#include "internal.h"
+
+/*
+ * compact_control is used to track pages being migrated and the free pages
+ * they are being migrated to during memory compaction. The free_pfn starts
+ * at the end of a zone and migrate_pfn begins at the start. Movable pages
+ * are moved to the end of a zone during a compaction run and the run
+ * completes when free_pfn <= migrate_pfn
+ */
+struct compact_control {
+	struct list_head freepages;	/* List of free pages to migrate to */
+	struct list_head migratepages;	/* List of pages being migrated */
+	unsigned long nr_freepages;	/* Number of isolated free pages */
+	unsigned long nr_migratepages;	/* Number of pages to migrate */
+	unsigned long free_pfn;		/* isolate_freepages search base */
+	unsigned long migrate_pfn;	/* isolate_migratepages search base */
+
+	/* Account for isolated anon and file pages */
+	unsigned long nr_anon;
+	unsigned long nr_file;
+
+	struct zone *zone;
+};
+
+static int release_freepages(struct list_head *freelist)
+{
+	struct page *page, *next;
+	int count = 0;
+
+	list_for_each_entry_safe(page, next, freelist, lru) {
+		list_del(&page->lru);
+		__free_page(page);
+		count++;
+	}
+
+	return count;
+}
+
+/* Isolate free pages onto a private freelist. Must hold zone->lock */
+static int isolate_freepages_block(struct zone *zone,
+				unsigned long blockpfn,
+				struct list_head *freelist)
+{
+	unsigned long zone_end_pfn, end_pfn;
+	int total_isolated = 0;
+
+	/* Get the last PFN we should scan for free pages at */
+	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+	end_pfn = blockpfn + pageblock_nr_pages;
+	if (end_pfn > zone_end_pfn)
+		end_pfn = zone_end_pfn;
+
+	/* Isolate free pages. This assumes the block is valid */
+	for (; blockpfn < end_pfn; blockpfn++) {
+		struct page *page;
+		int isolated, i;
+
+		if (!pfn_valid_within(blockpfn))
+			continue;
+
+		page = pfn_to_page(blockpfn);
+		if (!PageBuddy(page))
+			continue;
+
+		/* Found a free page, break it into order-0 pages */
+		isolated = split_free_page(page);
+		total_isolated += isolated;
+		for (i = 0; i < isolated; i++) {
+			list_add(&page->lru, freelist);
+			page++;
+		}
+
+		/* If a page was split, advance to the end of it */
+		if (isolated)
+			blockpfn += isolated - 1;
+	}
+
+	return total_isolated;
+}
+
+/* Returns 1 if the page is within a block suitable for migration to */
+static int suitable_migration_target(struct page *page)
+{
+
+	int migratetype = get_pageblock_migratetype(page);
+
+	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
+	if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
+		return 0;
+
+	/* If the page is a large free page, then allow migration */
+	if (PageBuddy(page) && page_order(page) >= pageblock_order)
+		return 1;
+
+	/* If the block is MIGRATE_MOVABLE, allow migration */
+	if (migratetype == MIGRATE_MOVABLE)
+		return 1;
+
+	/* Otherwise skip the block */
+	return 0;
+}
+
+/*
+ * Based on information in the current compact_control, find blocks
+ * suitable for isolating free pages from
+ */
+static void isolate_freepages(struct zone *zone,
+				struct compact_control *cc)
+{
+	struct page *page;
+	unsigned long high_pfn, low_pfn, pfn;
+	unsigned long flags;
+	int nr_freepages = cc->nr_freepages;
+	struct list_head *freelist = &cc->freepages;
+
+	pfn = cc->free_pfn;
+	low_pfn = cc->migrate_pfn + pageblock_nr_pages;
+	high_pfn = low_pfn;
+
+	/*
+	 * Isolate free pages until enough are available to migrate the
+	 * pages on cc->migratepages. We stop searching if the migrate
+	 * and free page scanners meet or enough free pages are isolated.
+	 */
+	spin_lock_irqsave(&zone->lock, flags);
+	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
+					pfn -= pageblock_nr_pages) {
+		int isolated;
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		/*
+		 * Check for overlapping nodes/zones. It's possible on some
+		 * configurations to have a setup like
+		 * node0 node1 node0
+		 * i.e. it's possible that all pages within a zones range of
+		 * pages do not belong to a single zone.
+		 */
+		page = pfn_to_page(pfn);
+		if (page_zone(page) != zone)
+			continue;
+
+		/* Check the block is suitable for migration */
+		if (!suitable_migration_target(page))
+			continue;
+
+		/* Found a block suitable for isolating free pages from */
+		isolated = isolate_freepages_block(zone, pfn, freelist);
+		nr_freepages += isolated;
+
+		/*
+		 * Record the highest PFN we isolated pages from. When next
+		 * looking for free pages, the search will restart here as
+		 * page migration may have returned some pages to the allocator
+		 */
+		if (isolated)
+			high_pfn = max(high_pfn, pfn);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	cc->free_pfn = high_pfn;
+	cc->nr_freepages = nr_freepages;
+}
+
+/* Update the number of anon and file isolated pages in the zone */
+static void acct_isolated(struct zone *zone, struct compact_control *cc)
+{
+	struct page *page;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+
+	list_for_each_entry(page, &cc->migratepages, lru) {
+		int lru = page_lru_base_type(page);
+		count[lru]++;
+	}
+
+	cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
+}
+
+/* Similar to reclaim, but different enough that they don't share logic */
+static int too_many_isolated(struct zone *zone)
+{
+
+	unsigned long inactive, isolated;
+
+	inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
+					zone_page_state(zone, NR_INACTIVE_ANON);
+	isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
+					zone_page_state(zone, NR_ISOLATED_ANON);
+
+	return isolated > inactive;
+}
+
+/*
+ * Isolate all pages that can be migrated from the block pointed to by
+ * the migrate scanner within compact_control.
+ */
+static unsigned long isolate_migratepages(struct zone *zone,
+					struct compact_control *cc)
+{
+	unsigned long low_pfn, end_pfn;
+	struct list_head *migratelist;
+
+	low_pfn = cc->migrate_pfn;
+	migratelist = &cc->migratepages;
+
+	/* Do not scan outside zone boundaries */
+	if (low_pfn < zone->zone_start_pfn)
+		low_pfn = zone->zone_start_pfn;
+
+	/* Setup to scan one block but not past where we are migrating to */
+	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
+
+	/* Do not cross the free scanner or scan within a memory hole */
+	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
+		cc->migrate_pfn = end_pfn;
+		return 0;
+	}
+
+	/* Do not isolate the world */
+	while (unlikely(too_many_isolated(zone))) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+		if (fatal_signal_pending(current))
+			return 0;
+	}
+
+	/* Time to isolate some pages for migration */
+	spin_lock_irq(&zone->lru_lock);
+	for (; low_pfn < end_pfn; low_pfn++) {
+		struct page *page;
+		if (!pfn_valid_within(low_pfn))
+			continue;
+
+		/* Get the page and skip if free */
+		page = pfn_to_page(low_pfn);
+		if (PageBuddy(page)) {
+			low_pfn += (1 << page_order(page)) - 1;
+			continue;
+		}
+
+		/* Try isolate the page */
+		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) == 0) {
+			del_page_from_lru_list(zone, page, page_lru(page));
+			list_add(&page->lru, migratelist);
+			mem_cgroup_del_lru(page);
+			cc->nr_migratepages++;
+		}
+
+		/* Avoid isolating too much */
+		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
+			break;
+	}
+
+	acct_isolated(zone, cc);
+
+	spin_unlock_irq(&zone->lru_lock);
+	cc->migrate_pfn = low_pfn;
+
+	return cc->nr_migratepages;
+}
+
+/*
+ * This is a migrate-callback that "allocates" freepages by taking pages
+ * from the isolated freelists in the block we are migrating to.
+ */
+static struct page *compaction_alloc(struct page *migratepage,
+					unsigned long data,
+					int **result)
+{
+	struct compact_control *cc = (struct compact_control *)data;
+	struct page *freepage;
+
+	/* Isolate free pages if necessary */
+	if (list_empty(&cc->freepages)) {
+		isolate_freepages(cc->zone, cc);
+
+		if (list_empty(&cc->freepages))
+			return NULL;
+	}
+
+	freepage = list_entry(cc->freepages.next, struct page, lru);
+	list_del(&freepage->lru);
+	cc->nr_freepages--;
+
+	return freepage;
+}
+
+/*
+ * We cannot control nr_migratepages and nr_freepages fully when migration is
+ * running as migrate_pages() has no knowledge of compact_control. When
+ * migration is complete, we count the number of pages on the lists by hand.
+ */
+static void update_nr_listpages(struct compact_control *cc)
+{
+	int nr_migratepages = 0;
+	int nr_freepages = 0;
+	struct page *page;
+	list_for_each_entry(page, &cc->migratepages, lru)
+		nr_migratepages++;
+	list_for_each_entry(page, &cc->freepages, lru)
+		nr_freepages++;
+
+	cc->nr_migratepages = nr_migratepages;
+	cc->nr_freepages = nr_freepages;
+}
+
+static inline int compact_finished(struct zone *zone,
+						struct compact_control *cc)
+{
+	if (fatal_signal_pending(current))
+		return COMPACT_PARTIAL;
+
+	/* Compaction run completes if the migrate and free scanner meet */
+	if (cc->free_pfn <= cc->migrate_pfn)
+		return COMPACT_COMPLETE;
+
+	return COMPACT_INCOMPLETE;
+}
+
+static int compact_zone(struct zone *zone, struct compact_control *cc)
+{
+	int ret = COMPACT_INCOMPLETE;
+
+	/* Setup to move all movable pages to the end of the zone */
+	cc->migrate_pfn = zone->zone_start_pfn;
+	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
+	cc->free_pfn &= ~(pageblock_nr_pages-1);
+
+	migrate_prep();
+
+	for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
+		unsigned long nr_migrate, nr_remaining;
+		if (!isolate_migratepages(zone, cc))
+			continue;
+
+		nr_migrate = cc->nr_migratepages;
+		migrate_pages(&cc->migratepages, compaction_alloc,
+						(unsigned long)cc, 0);
+		update_nr_listpages(cc);
+		nr_remaining = cc->nr_migratepages;
+
+		count_vm_event(COMPACTBLOCKS);
+		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
+		if (nr_remaining)
+			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
+
+		/* Release LRU pages not migrated */
+		if (!list_empty(&cc->migratepages)) {
+			putback_lru_pages(&cc->migratepages);
+			cc->nr_migratepages = 0;
+		}
+
+	}
+
+	/* Release free pages and check accounting */
+	cc->nr_freepages -= release_freepages(&cc->freepages);
+	VM_BUG_ON(cc->nr_freepages != 0);
+
+	return ret;
+}
+
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 624cba4..3cf947d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1208,6 +1208,45 @@ void split_page(struct page *page, unsigned int order)
 }
 
 /*
+ * Similar to split_page except the page is already free. As this is only
+ * being used for migration, the migratetype of the block also changes.
+ */
+int split_free_page(struct page *page)
+{
+	unsigned int order;
+	unsigned long watermark;
+	struct zone *zone;
+
+	BUG_ON(!PageBuddy(page));
+
+	zone = page_zone(page);
+	order = page_order(page);
+
+	/* Obey watermarks or the system could deadlock */
+	watermark = low_wmark_pages(zone) + (1 << order);
+	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+		return 0;
+
+	/* Remove page from free list */
+	list_del(&page->lru);
+	zone->free_area[order].nr_free--;
+	rmv_page_order(page);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+
+	/* Split into individual pages */
+	set_page_refcounted(page);
+	split_page(page, order);
+
+	if (order >= pageblock_order - 1) {
+		struct page *endpage = page + (1 << order) - 1;
+		for (; page < endpage; page += pageblock_nr_pages)
+			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+	}
+
+	return 1 << order;
+}
+
+/*
  * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 351e491..3a69b48 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -892,6 +892,11 @@ static const char * const vmstat_text[] = {
 	"allocstall",
 
 	"pgrotated",
+
+	"compact_blocks_moved",
+	"compact_pages_moved",
+	"compact_pagemigrate_failed",
+
 #ifdef CONFIG_HUGETLB_PAGE
 	"htlb_buddy_alloc_success",
 	"htlb_buddy_alloc_fail",
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 08/14] Memory compaction core
@ 2010-03-30  9:14   ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.

A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable pages
within each area, isolating them onto a private list called migratelist.
The free scanner starts at the top of the zone and searches for suitable
areas and consumes the free pages within making them available for the
migration scanner. The pages isolated for migration are then migrated to
the newly isolated free pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 include/linux/compaction.h |    9 +
 include/linux/mm.h         |    1 +
 include/linux/swap.h       |    1 +
 include/linux/vmstat.h     |    1 +
 mm/Makefile                |    1 +
 mm/compaction.c            |  379 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   39 +++++
 mm/vmstat.c                |    5 +
 8 files changed, 436 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/compaction.h
 create mode 100644 mm/compaction.c

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
new file mode 100644
index 0000000..dbebe58
--- /dev/null
+++ b/include/linux/compaction.h
@@ -0,0 +1,9 @@
+#ifndef _LINUX_COMPACTION_H
+#define _LINUX_COMPACTION_H
+
+/* Return values for compact_zone() */
+#define COMPACT_INCOMPLETE	0
+#define COMPACT_PARTIAL		1
+#define COMPACT_COMPLETE	2
+
+#endif /* _LINUX_COMPACTION_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3b473a..f920815 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -335,6 +335,7 @@ void put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
+int split_free_page(struct page *page);
 
 /*
  * Compound pages have a destructor function.  Provide a
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 986b12d..cf8bba7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -151,6 +151,7 @@ enum {
 };
 
 #define SWAP_CLUSTER_MAX 32
+#define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
 #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
 #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 117f0dd..56e4b44 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
diff --git a/mm/Makefile b/mm/Makefile
index 7a68d2a..ccb1f72 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_COMPACTION) += compaction.o
 obj-$(CONFIG_SMP) += percpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
diff --git a/mm/compaction.c b/mm/compaction.c
new file mode 100644
index 0000000..4041209
--- /dev/null
+++ b/mm/compaction.c
@@ -0,0 +1,379 @@
+/*
+ * linux/mm/compaction.c
+ *
+ * Memory compaction for the reduction of external fragmentation. Note that
+ * this heavily depends upon page migration to do all the real heavy
+ * lifting
+ *
+ * Copyright IBM Corp. 2007-2010 Mel Gorman <mel@csn.ul.ie>
+ */
+#include <linux/swap.h>
+#include <linux/migrate.h>
+#include <linux/compaction.h>
+#include <linux/mm_inline.h>
+#include <linux/backing-dev.h>
+#include "internal.h"
+
+/*
+ * compact_control is used to track pages being migrated and the free pages
+ * they are being migrated to during memory compaction. The free_pfn starts
+ * at the end of a zone and migrate_pfn begins at the start. Movable pages
+ * are moved to the end of a zone during a compaction run and the run
+ * completes when free_pfn <= migrate_pfn
+ */
+struct compact_control {
+	struct list_head freepages;	/* List of free pages to migrate to */
+	struct list_head migratepages;	/* List of pages being migrated */
+	unsigned long nr_freepages;	/* Number of isolated free pages */
+	unsigned long nr_migratepages;	/* Number of pages to migrate */
+	unsigned long free_pfn;		/* isolate_freepages search base */
+	unsigned long migrate_pfn;	/* isolate_migratepages search base */
+
+	/* Account for isolated anon and file pages */
+	unsigned long nr_anon;
+	unsigned long nr_file;
+
+	struct zone *zone;
+};
+
+static int release_freepages(struct list_head *freelist)
+{
+	struct page *page, *next;
+	int count = 0;
+
+	list_for_each_entry_safe(page, next, freelist, lru) {
+		list_del(&page->lru);
+		__free_page(page);
+		count++;
+	}
+
+	return count;
+}
+
+/* Isolate free pages onto a private freelist. Must hold zone->lock */
+static int isolate_freepages_block(struct zone *zone,
+				unsigned long blockpfn,
+				struct list_head *freelist)
+{
+	unsigned long zone_end_pfn, end_pfn;
+	int total_isolated = 0;
+
+	/* Get the last PFN we should scan for free pages at */
+	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+	end_pfn = blockpfn + pageblock_nr_pages;
+	if (end_pfn > zone_end_pfn)
+		end_pfn = zone_end_pfn;
+
+	/* Isolate free pages. This assumes the block is valid */
+	for (; blockpfn < end_pfn; blockpfn++) {
+		struct page *page;
+		int isolated, i;
+
+		if (!pfn_valid_within(blockpfn))
+			continue;
+
+		page = pfn_to_page(blockpfn);
+		if (!PageBuddy(page))
+			continue;
+
+		/* Found a free page, break it into order-0 pages */
+		isolated = split_free_page(page);
+		total_isolated += isolated;
+		for (i = 0; i < isolated; i++) {
+			list_add(&page->lru, freelist);
+			page++;
+		}
+
+		/* If a page was split, advance to the end of it */
+		if (isolated)
+			blockpfn += isolated - 1;
+	}
+
+	return total_isolated;
+}
+
+/* Returns 1 if the page is within a block suitable for migration to */
+static int suitable_migration_target(struct page *page)
+{
+
+	int migratetype = get_pageblock_migratetype(page);
+
+	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
+	if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
+		return 0;
+
+	/* If the page is a large free page, then allow migration */
+	if (PageBuddy(page) && page_order(page) >= pageblock_order)
+		return 1;
+
+	/* If the block is MIGRATE_MOVABLE, allow migration */
+	if (migratetype == MIGRATE_MOVABLE)
+		return 1;
+
+	/* Otherwise skip the block */
+	return 0;
+}
+
+/*
+ * Based on information in the current compact_control, find blocks
+ * suitable for isolating free pages from
+ */
+static void isolate_freepages(struct zone *zone,
+				struct compact_control *cc)
+{
+	struct page *page;
+	unsigned long high_pfn, low_pfn, pfn;
+	unsigned long flags;
+	int nr_freepages = cc->nr_freepages;
+	struct list_head *freelist = &cc->freepages;
+
+	pfn = cc->free_pfn;
+	low_pfn = cc->migrate_pfn + pageblock_nr_pages;
+	high_pfn = low_pfn;
+
+	/*
+	 * Isolate free pages until enough are available to migrate the
+	 * pages on cc->migratepages. We stop searching if the migrate
+	 * and free page scanners meet or enough free pages are isolated.
+	 */
+	spin_lock_irqsave(&zone->lock, flags);
+	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
+					pfn -= pageblock_nr_pages) {
+		int isolated;
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		/*
+		 * Check for overlapping nodes/zones. It's possible on some
+		 * configurations to have a setup like
+		 * node0 node1 node0
+		 * i.e. it's possible that all pages within a zones range of
+		 * pages do not belong to a single zone.
+		 */
+		page = pfn_to_page(pfn);
+		if (page_zone(page) != zone)
+			continue;
+
+		/* Check the block is suitable for migration */
+		if (!suitable_migration_target(page))
+			continue;
+
+		/* Found a block suitable for isolating free pages from */
+		isolated = isolate_freepages_block(zone, pfn, freelist);
+		nr_freepages += isolated;
+
+		/*
+		 * Record the highest PFN we isolated pages from. When next
+		 * looking for free pages, the search will restart here as
+		 * page migration may have returned some pages to the allocator
+		 */
+		if (isolated)
+			high_pfn = max(high_pfn, pfn);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	cc->free_pfn = high_pfn;
+	cc->nr_freepages = nr_freepages;
+}
+
+/* Update the number of anon and file isolated pages in the zone */
+static void acct_isolated(struct zone *zone, struct compact_control *cc)
+{
+	struct page *page;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+
+	list_for_each_entry(page, &cc->migratepages, lru) {
+		int lru = page_lru_base_type(page);
+		count[lru]++;
+	}
+
+	cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
+}
+
+/* Similar to reclaim, but different enough that they don't share logic */
+static int too_many_isolated(struct zone *zone)
+{
+
+	unsigned long inactive, isolated;
+
+	inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
+					zone_page_state(zone, NR_INACTIVE_ANON);
+	isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
+					zone_page_state(zone, NR_ISOLATED_ANON);
+
+	return isolated > inactive;
+}
+
+/*
+ * Isolate all pages that can be migrated from the block pointed to by
+ * the migrate scanner within compact_control.
+ */
+static unsigned long isolate_migratepages(struct zone *zone,
+					struct compact_control *cc)
+{
+	unsigned long low_pfn, end_pfn;
+	struct list_head *migratelist;
+
+	low_pfn = cc->migrate_pfn;
+	migratelist = &cc->migratepages;
+
+	/* Do not scan outside zone boundaries */
+	if (low_pfn < zone->zone_start_pfn)
+		low_pfn = zone->zone_start_pfn;
+
+	/* Setup to scan one block but not past where we are migrating to */
+	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
+
+	/* Do not cross the free scanner or scan within a memory hole */
+	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
+		cc->migrate_pfn = end_pfn;
+		return 0;
+	}
+
+	/* Do not isolate the world */
+	while (unlikely(too_many_isolated(zone))) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+		if (fatal_signal_pending(current))
+			return 0;
+	}
+
+	/* Time to isolate some pages for migration */
+	spin_lock_irq(&zone->lru_lock);
+	for (; low_pfn < end_pfn; low_pfn++) {
+		struct page *page;
+		if (!pfn_valid_within(low_pfn))
+			continue;
+
+		/* Get the page and skip if free */
+		page = pfn_to_page(low_pfn);
+		if (PageBuddy(page)) {
+			low_pfn += (1 << page_order(page)) - 1;
+			continue;
+		}
+
+		/* Try isolate the page */
+		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) == 0) {
+			del_page_from_lru_list(zone, page, page_lru(page));
+			list_add(&page->lru, migratelist);
+			mem_cgroup_del_lru(page);
+			cc->nr_migratepages++;
+		}
+
+		/* Avoid isolating too much */
+		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
+			break;
+	}
+
+	acct_isolated(zone, cc);
+
+	spin_unlock_irq(&zone->lru_lock);
+	cc->migrate_pfn = low_pfn;
+
+	return cc->nr_migratepages;
+}
+
+/*
+ * This is a migrate-callback that "allocates" freepages by taking pages
+ * from the isolated freelists in the block we are migrating to.
+ */
+static struct page *compaction_alloc(struct page *migratepage,
+					unsigned long data,
+					int **result)
+{
+	struct compact_control *cc = (struct compact_control *)data;
+	struct page *freepage;
+
+	/* Isolate free pages if necessary */
+	if (list_empty(&cc->freepages)) {
+		isolate_freepages(cc->zone, cc);
+
+		if (list_empty(&cc->freepages))
+			return NULL;
+	}
+
+	freepage = list_entry(cc->freepages.next, struct page, lru);
+	list_del(&freepage->lru);
+	cc->nr_freepages--;
+
+	return freepage;
+}
+
+/*
+ * We cannot control nr_migratepages and nr_freepages fully when migration is
+ * running as migrate_pages() has no knowledge of compact_control. When
+ * migration is complete, we count the number of pages on the lists by hand.
+ */
+static void update_nr_listpages(struct compact_control *cc)
+{
+	int nr_migratepages = 0;
+	int nr_freepages = 0;
+	struct page *page;
+	list_for_each_entry(page, &cc->migratepages, lru)
+		nr_migratepages++;
+	list_for_each_entry(page, &cc->freepages, lru)
+		nr_freepages++;
+
+	cc->nr_migratepages = nr_migratepages;
+	cc->nr_freepages = nr_freepages;
+}
+
+static inline int compact_finished(struct zone *zone,
+						struct compact_control *cc)
+{
+	if (fatal_signal_pending(current))
+		return COMPACT_PARTIAL;
+
+	/* Compaction run completes if the migrate and free scanner meet */
+	if (cc->free_pfn <= cc->migrate_pfn)
+		return COMPACT_COMPLETE;
+
+	return COMPACT_INCOMPLETE;
+}
+
+static int compact_zone(struct zone *zone, struct compact_control *cc)
+{
+	int ret = COMPACT_INCOMPLETE;
+
+	/* Setup to move all movable pages to the end of the zone */
+	cc->migrate_pfn = zone->zone_start_pfn;
+	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
+	cc->free_pfn &= ~(pageblock_nr_pages-1);
+
+	migrate_prep();
+
+	for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
+		unsigned long nr_migrate, nr_remaining;
+		if (!isolate_migratepages(zone, cc))
+			continue;
+
+		nr_migrate = cc->nr_migratepages;
+		migrate_pages(&cc->migratepages, compaction_alloc,
+						(unsigned long)cc, 0);
+		update_nr_listpages(cc);
+		nr_remaining = cc->nr_migratepages;
+
+		count_vm_event(COMPACTBLOCKS);
+		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
+		if (nr_remaining)
+			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
+
+		/* Release LRU pages not migrated */
+		if (!list_empty(&cc->migratepages)) {
+			putback_lru_pages(&cc->migratepages);
+			cc->nr_migratepages = 0;
+		}
+
+	}
+
+	/* Release free pages and check accounting */
+	cc->nr_freepages -= release_freepages(&cc->freepages);
+	VM_BUG_ON(cc->nr_freepages != 0);
+
+	return ret;
+}
+
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 624cba4..3cf947d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1208,6 +1208,45 @@ void split_page(struct page *page, unsigned int order)
 }
 
 /*
+ * Similar to split_page except the page is already free. As this is only
+ * being used for migration, the migratetype of the block also changes.
+ */
+int split_free_page(struct page *page)
+{
+	unsigned int order;
+	unsigned long watermark;
+	struct zone *zone;
+
+	BUG_ON(!PageBuddy(page));
+
+	zone = page_zone(page);
+	order = page_order(page);
+
+	/* Obey watermarks or the system could deadlock */
+	watermark = low_wmark_pages(zone) + (1 << order);
+	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+		return 0;
+
+	/* Remove page from free list */
+	list_del(&page->lru);
+	zone->free_area[order].nr_free--;
+	rmv_page_order(page);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+
+	/* Split into individual pages */
+	set_page_refcounted(page);
+	split_page(page, order);
+
+	if (order >= pageblock_order - 1) {
+		struct page *endpage = page + (1 << order) - 1;
+		for (; page < endpage; page += pageblock_nr_pages)
+			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+	}
+
+	return 1 << order;
+}
+
+/*
  * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 351e491..3a69b48 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -892,6 +892,11 @@ static const char * const vmstat_text[] = {
 	"allocstall",
 
 	"pgrotated",
+
+	"compact_blocks_moved",
+	"compact_pages_moved",
+	"compact_pagemigrate_failed",
+
 #ifdef CONFIG_HUGETLB_PAGE
 	"htlb_buddy_alloc_success",
 	"htlb_buddy_alloc_fail",
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 09/14] Add /proc trigger for memory compaction
  2010-03-30  9:14 ` Mel Gorman
@ 2010-03-30  9:14   ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
value is written to the file, all zones are compacted. The expected user
of such a trigger is a job scheduler that prepares the system before the
target application runs.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 Documentation/sysctl/vm.txt |   11 +++++++
 include/linux/compaction.h  |    6 ++++
 kernel/sysctl.c             |   10 +++++++
 mm/compaction.c             |   62 ++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 88 insertions(+), 1 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 56366a5..803c018 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -19,6 +19,7 @@ files can be found in mm/swap.c.
 Currently, these files are in /proc/sys/vm:
 
 - block_dump
+- compact_memory
 - dirty_background_bytes
 - dirty_background_ratio
 - dirty_bytes
@@ -64,6 +65,16 @@ information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.
 
 ==============================================================
 
+compact_memory
+
+Available only when CONFIG_COMPACTION is set. When an arbitrary value
+is written to the file, all zones are compacted such that free memory
+is available in contiguous blocks where possible. This can be important
+for example in the allocation of huge pages although processes will also
+directly compact memory as required.
+
+==============================================================
+
 dirty_background_bytes
 
 Contains the amount of dirty memory at which the pdflush background writeback
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index dbebe58..fef591b 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -6,4 +6,10 @@
 #define COMPACT_PARTIAL		1
 #define COMPACT_COMPLETE	2
 
+#ifdef CONFIG_COMPACTION
+extern int sysctl_compact_memory;
+extern int sysctl_compaction_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos);
+#endif /* CONFIG_COMPACTION */
+
 #endif /* _LINUX_COMPACTION_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 455f394..3838928 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -53,6 +53,7 @@
 #include <linux/slow-work.h>
 #include <linux/perf_event.h>
 #include <linux/kprobes.h>
+#include <linux/compaction.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -1102,6 +1103,15 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= drop_caches_sysctl_handler,
 	},
+#ifdef CONFIG_COMPACTION
+	{
+		.procname	= "compact_memory",
+		.data		= &sysctl_compact_memory,
+		.maxlen		= sizeof(int),
+		.mode		= 0200,
+		.proc_handler	= sysctl_compaction_handler,
+	},
+#endif /* CONFIG_COMPACTION */
 	{
 		.procname	= "min_free_kbytes",
 		.data		= &min_free_kbytes,
diff --git a/mm/compaction.c b/mm/compaction.c
index 4041209..615b811 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -12,6 +12,7 @@
 #include <linux/compaction.h>
 #include <linux/mm_inline.h>
 #include <linux/backing-dev.h>
+#include <linux/sysctl.h>
 #include "internal.h"
 
 /*
@@ -322,7 +323,7 @@ static void update_nr_listpages(struct compact_control *cc)
 	cc->nr_freepages = nr_freepages;
 }
 
-static inline int compact_finished(struct zone *zone,
+static int compact_finished(struct zone *zone,
 						struct compact_control *cc)
 {
 	if (fatal_signal_pending(current))
@@ -377,3 +378,62 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	return ret;
 }
 
+/* Compact all zones within a node */
+static int compact_node(int nid)
+{
+	int zoneid;
+	pg_data_t *pgdat;
+	struct zone *zone;
+
+	if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
+		return -EINVAL;
+	pgdat = NODE_DATA(nid);
+
+	/* Flush pending updates to the LRU lists */
+	lru_add_drain_all();
+
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+		struct compact_control cc;
+
+		zone = &pgdat->node_zones[zoneid];
+		if (!populated_zone(zone))
+			continue;
+
+		cc.nr_freepages = 0;
+		cc.nr_migratepages = 0;
+		cc.zone = zone;
+		INIT_LIST_HEAD(&cc.freepages);
+		INIT_LIST_HEAD(&cc.migratepages);
+
+		compact_zone(zone, &cc);
+
+		VM_BUG_ON(!list_empty(&cc.freepages));
+		VM_BUG_ON(!list_empty(&cc.migratepages));
+	}
+
+	return 0;
+}
+
+/* Compact all nodes in the system */
+static int compact_nodes(void)
+{
+	int nid;
+
+	for_each_online_node(nid)
+		compact_node(nid);
+
+	return COMPACT_COMPLETE;
+}
+
+/* The written value is actually unused, all memory is compacted */
+int sysctl_compact_memory;
+
+/* This is the entry point for compacting all nodes via /proc/sys/vm */
+int sysctl_compaction_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos)
+{
+	if (write)
+		return compact_nodes();
+
+	return 0;
+}
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 09/14] Add /proc trigger for memory compaction
@ 2010-03-30  9:14   ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
value is written to the file, all zones are compacted. The expected user
of such a trigger is a job scheduler that prepares the system before the
target application runs.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 Documentation/sysctl/vm.txt |   11 +++++++
 include/linux/compaction.h  |    6 ++++
 kernel/sysctl.c             |   10 +++++++
 mm/compaction.c             |   62 ++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 88 insertions(+), 1 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 56366a5..803c018 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -19,6 +19,7 @@ files can be found in mm/swap.c.
 Currently, these files are in /proc/sys/vm:
 
 - block_dump
+- compact_memory
 - dirty_background_bytes
 - dirty_background_ratio
 - dirty_bytes
@@ -64,6 +65,16 @@ information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.
 
 ==============================================================
 
+compact_memory
+
+Available only when CONFIG_COMPACTION is set. When an arbitrary value
+is written to the file, all zones are compacted such that free memory
+is available in contiguous blocks where possible. This can be important
+for example in the allocation of huge pages although processes will also
+directly compact memory as required.
+
+==============================================================
+
 dirty_background_bytes
 
 Contains the amount of dirty memory at which the pdflush background writeback
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index dbebe58..fef591b 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -6,4 +6,10 @@
 #define COMPACT_PARTIAL		1
 #define COMPACT_COMPLETE	2
 
+#ifdef CONFIG_COMPACTION
+extern int sysctl_compact_memory;
+extern int sysctl_compaction_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos);
+#endif /* CONFIG_COMPACTION */
+
 #endif /* _LINUX_COMPACTION_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 455f394..3838928 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -53,6 +53,7 @@
 #include <linux/slow-work.h>
 #include <linux/perf_event.h>
 #include <linux/kprobes.h>
+#include <linux/compaction.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -1102,6 +1103,15 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= drop_caches_sysctl_handler,
 	},
+#ifdef CONFIG_COMPACTION
+	{
+		.procname	= "compact_memory",
+		.data		= &sysctl_compact_memory,
+		.maxlen		= sizeof(int),
+		.mode		= 0200,
+		.proc_handler	= sysctl_compaction_handler,
+	},
+#endif /* CONFIG_COMPACTION */
 	{
 		.procname	= "min_free_kbytes",
 		.data		= &min_free_kbytes,
diff --git a/mm/compaction.c b/mm/compaction.c
index 4041209..615b811 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -12,6 +12,7 @@
 #include <linux/compaction.h>
 #include <linux/mm_inline.h>
 #include <linux/backing-dev.h>
+#include <linux/sysctl.h>
 #include "internal.h"
 
 /*
@@ -322,7 +323,7 @@ static void update_nr_listpages(struct compact_control *cc)
 	cc->nr_freepages = nr_freepages;
 }
 
-static inline int compact_finished(struct zone *zone,
+static int compact_finished(struct zone *zone,
 						struct compact_control *cc)
 {
 	if (fatal_signal_pending(current))
@@ -377,3 +378,62 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	return ret;
 }
 
+/* Compact all zones within a node */
+static int compact_node(int nid)
+{
+	int zoneid;
+	pg_data_t *pgdat;
+	struct zone *zone;
+
+	if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
+		return -EINVAL;
+	pgdat = NODE_DATA(nid);
+
+	/* Flush pending updates to the LRU lists */
+	lru_add_drain_all();
+
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+		struct compact_control cc;
+
+		zone = &pgdat->node_zones[zoneid];
+		if (!populated_zone(zone))
+			continue;
+
+		cc.nr_freepages = 0;
+		cc.nr_migratepages = 0;
+		cc.zone = zone;
+		INIT_LIST_HEAD(&cc.freepages);
+		INIT_LIST_HEAD(&cc.migratepages);
+
+		compact_zone(zone, &cc);
+
+		VM_BUG_ON(!list_empty(&cc.freepages));
+		VM_BUG_ON(!list_empty(&cc.migratepages));
+	}
+
+	return 0;
+}
+
+/* Compact all nodes in the system */
+static int compact_nodes(void)
+{
+	int nid;
+
+	for_each_online_node(nid)
+		compact_node(nid);
+
+	return COMPACT_COMPLETE;
+}
+
+/* The written value is actually unused, all memory is compacted */
+int sysctl_compact_memory;
+
+/* This is the entry point for compacting all nodes via /proc/sys/vm */
+int sysctl_compaction_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos)
+{
+	if (write)
+		return compact_nodes();
+
+	return 0;
+}
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 10/14] Add /sys trigger for per-node memory compaction
  2010-03-30  9:14 ` Mel Gorman
@ 2010-03-30  9:14   ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

This patch adds a per-node sysfs file called compact. When the file is
written to, each zone in that node is compacted. The intention that this
would be used by something like a job scheduler in a batch system before
a job starts so that the job can allocate the maximum number of
hugepages without significant start-up cost.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/ABI/testing/sysfs-devices-node |    7 +++++++
 drivers/base/node.c                          |    3 +++
 include/linux/compaction.h                   |   16 ++++++++++++++++
 mm/compaction.c                              |   23 +++++++++++++++++++++++
 4 files changed, 49 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-devices-node

diff --git a/Documentation/ABI/testing/sysfs-devices-node b/Documentation/ABI/testing/sysfs-devices-node
new file mode 100644
index 0000000..453a210
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-devices-node
@@ -0,0 +1,7 @@
+What:		/sys/devices/system/node/nodeX/compact
+Date:		February 2010
+Contact:	Mel Gorman <mel@csn.ul.ie>
+Description:
+		When this file is written to, all memory within that node
+		will be compacted. When it completes, memory will be freed
+		into blocks which have as many contiguous pages as possible
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 93b3ac6..07cdcc6 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -15,6 +15,7 @@
 #include <linux/cpu.h>
 #include <linux/device.h>
 #include <linux/swap.h>
+#include <linux/compaction.h>
 
 static struct sysdev_class_attribute *node_state_attrs[];
 
@@ -245,6 +246,8 @@ int register_node(struct node *node, int num, struct node *parent)
 		scan_unevictable_register_node(node);
 
 		hugetlb_register_node(node);
+
+		compaction_register_node(node);
 	}
 	return error;
 }
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index fef591b..c4ab05f 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -12,4 +12,20 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
 #endif /* CONFIG_COMPACTION */
 
+#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+extern int compaction_register_node(struct node *node);
+extern void compaction_unregister_node(struct node *node);
+
+#else
+
+static inline int compaction_register_node(struct node *node)
+{
+	return 0;
+}
+
+static inline void compaction_unregister_node(struct node *node)
+{
+}
+#endif /* CONFIG_COMPACTION && CONFIG_SYSFS && CONFIG_NUMA */
+
 #endif /* _LINUX_COMPACTION_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index 615b811..b058bae 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -13,6 +13,7 @@
 #include <linux/mm_inline.h>
 #include <linux/backing-dev.h>
 #include <linux/sysctl.h>
+#include <linux/sysfs.h>
 #include "internal.h"
 
 /*
@@ -437,3 +438,25 @@ int sysctl_compaction_handler(struct ctl_table *table, int write,
 
 	return 0;
 }
+
+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+ssize_t sysfs_compact_node(struct sys_device *dev,
+			struct sysdev_attribute *attr,
+			const char *buf, size_t count)
+{
+	compact_node(dev->id);
+
+	return count;
+}
+static SYSDEV_ATTR(compact, S_IWUSR, NULL, sysfs_compact_node);
+
+int compaction_register_node(struct node *node)
+{
+	return sysdev_create_file(&node->sysdev, &attr_compact);
+}
+
+void compaction_unregister_node(struct node *node)
+{
+	return sysdev_remove_file(&node->sysdev, &attr_compact);
+}
+#endif /* CONFIG_SYSFS && CONFIG_NUMA */
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 10/14] Add /sys trigger for per-node memory compaction
@ 2010-03-30  9:14   ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

This patch adds a per-node sysfs file called compact. When the file is
written to, each zone in that node is compacted. The intention that this
would be used by something like a job scheduler in a batch system before
a job starts so that the job can allocate the maximum number of
hugepages without significant start-up cost.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/ABI/testing/sysfs-devices-node |    7 +++++++
 drivers/base/node.c                          |    3 +++
 include/linux/compaction.h                   |   16 ++++++++++++++++
 mm/compaction.c                              |   23 +++++++++++++++++++++++
 4 files changed, 49 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-devices-node

diff --git a/Documentation/ABI/testing/sysfs-devices-node b/Documentation/ABI/testing/sysfs-devices-node
new file mode 100644
index 0000000..453a210
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-devices-node
@@ -0,0 +1,7 @@
+What:		/sys/devices/system/node/nodeX/compact
+Date:		February 2010
+Contact:	Mel Gorman <mel@csn.ul.ie>
+Description:
+		When this file is written to, all memory within that node
+		will be compacted. When it completes, memory will be freed
+		into blocks which have as many contiguous pages as possible
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 93b3ac6..07cdcc6 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -15,6 +15,7 @@
 #include <linux/cpu.h>
 #include <linux/device.h>
 #include <linux/swap.h>
+#include <linux/compaction.h>
 
 static struct sysdev_class_attribute *node_state_attrs[];
 
@@ -245,6 +246,8 @@ int register_node(struct node *node, int num, struct node *parent)
 		scan_unevictable_register_node(node);
 
 		hugetlb_register_node(node);
+
+		compaction_register_node(node);
 	}
 	return error;
 }
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index fef591b..c4ab05f 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -12,4 +12,20 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
 #endif /* CONFIG_COMPACTION */
 
+#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+extern int compaction_register_node(struct node *node);
+extern void compaction_unregister_node(struct node *node);
+
+#else
+
+static inline int compaction_register_node(struct node *node)
+{
+	return 0;
+}
+
+static inline void compaction_unregister_node(struct node *node)
+{
+}
+#endif /* CONFIG_COMPACTION && CONFIG_SYSFS && CONFIG_NUMA */
+
 #endif /* _LINUX_COMPACTION_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index 615b811..b058bae 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -13,6 +13,7 @@
 #include <linux/mm_inline.h>
 #include <linux/backing-dev.h>
 #include <linux/sysctl.h>
+#include <linux/sysfs.h>
 #include "internal.h"
 
 /*
@@ -437,3 +438,25 @@ int sysctl_compaction_handler(struct ctl_table *table, int write,
 
 	return 0;
 }
+
+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+ssize_t sysfs_compact_node(struct sys_device *dev,
+			struct sysdev_attribute *attr,
+			const char *buf, size_t count)
+{
+	compact_node(dev->id);
+
+	return count;
+}
+static SYSDEV_ATTR(compact, S_IWUSR, NULL, sysfs_compact_node);
+
+int compaction_register_node(struct node *node)
+{
+	return sysdev_create_file(&node->sysdev, &attr_compact);
+}
+
+void compaction_unregister_node(struct node *node)
+{
+	return sysdev_remove_file(&node->sysdev, &attr_compact);
+}
+#endif /* CONFIG_SYSFS && CONFIG_NUMA */
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 11/14] Direct compact when a high-order allocation fails
  2010-03-30  9:14 ` Mel Gorman
@ 2010-03-30  9:14   ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

Ordinarily when a high-order allocation fails, direct reclaim is entered to
free pages to satisfy the allocation.  With this patch, it is determined if
an allocation failed due to external fragmentation instead of low memory
and if so, the calling process will compact until a suitable page is
freed. Compaction by moving pages in memory is considerably cheaper than
paging out to disk and works where there are locked pages or no swap. If
compaction fails to free a page of a suitable size, then reclaim will
still occur.

Direct compaction returns as soon as possible. As each block is compacted,
it is checked if a suitable page has been freed and if so, it returns.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 include/linux/compaction.h |   20 ++++++--
 include/linux/vmstat.h     |    1 +
 mm/compaction.c            |  117 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   31 ++++++++++++
 mm/vmstat.c                |   15 +++++-
 5 files changed, 178 insertions(+), 6 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index c4ab05f..faa3faf 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,15 +1,27 @@
 #ifndef _LINUX_COMPACTION_H
 #define _LINUX_COMPACTION_H
 
-/* Return values for compact_zone() */
-#define COMPACT_INCOMPLETE	0
-#define COMPACT_PARTIAL		1
-#define COMPACT_COMPLETE	2
+/* Return values for compact_zone() and try_to_compact_pages() */
+#define COMPACT_SKIPPED		0
+#define COMPACT_INCOMPLETE	1
+#define COMPACT_PARTIAL		2
+#define COMPACT_COMPLETE	3
 
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
+
+extern int fragmentation_index(struct zone *zone, unsigned int order);
+extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *mask);
+#else
+static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+	return COMPACT_INCOMPLETE;
+}
+
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 56e4b44..b4b4d34 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
 		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
+		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
diff --git a/mm/compaction.c b/mm/compaction.c
index b058bae..e8ef511 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -35,6 +35,8 @@ struct compact_control {
 	unsigned long nr_anon;
 	unsigned long nr_file;
 
+	unsigned int order;		/* order a direct compactor needs */
+	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
 };
 
@@ -327,6 +329,9 @@ static void update_nr_listpages(struct compact_control *cc)
 static int compact_finished(struct zone *zone,
 						struct compact_control *cc)
 {
+	unsigned int order;
+	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
+
 	if (fatal_signal_pending(current))
 		return COMPACT_PARTIAL;
 
@@ -334,6 +339,24 @@ static int compact_finished(struct zone *zone,
 	if (cc->free_pfn <= cc->migrate_pfn)
 		return COMPACT_COMPLETE;
 
+	/* Compaction run is not finished if the watermark is not met */
+	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
+		return COMPACT_INCOMPLETE;
+
+	if (cc->order == -1)
+		return COMPACT_INCOMPLETE;
+
+	/* Direct compactor: Is a suitable page free? */
+	for (order = cc->order; order < MAX_ORDER; order++) {
+		/* Job done if page is free of the right migratetype */
+		if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
+			return COMPACT_PARTIAL;
+
+		/* Job done if allocation would set block type */
+		if (order >= pageblock_order && zone->free_area[order].nr_free)
+			return COMPACT_PARTIAL;
+	}
+
 	return COMPACT_INCOMPLETE;
 }
 
@@ -379,6 +402,99 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	return ret;
 }
 
+static unsigned long compact_zone_order(struct zone *zone,
+						int order, gfp_t gfp_mask)
+{
+	struct compact_control cc = {
+		.nr_freepages = 0,
+		.nr_migratepages = 0,
+		.order = order,
+		.migratetype = allocflags_to_migratetype(gfp_mask),
+		.zone = zone,
+	};
+	INIT_LIST_HEAD(&cc.freepages);
+	INIT_LIST_HEAD(&cc.migratepages);
+
+	return compact_zone(zone, &cc);
+}
+
+/**
+ * try_to_compact_pages - Direct compact to satisfy a high-order allocation
+ * @zonelist: The zonelist used for the current allocation
+ * @order: The order of the current allocation
+ * @gfp_mask: The GFP mask of the current allocation
+ * @nodemask: The allowed nodes to allocate from
+ *
+ * This is the main entry point for direct page compaction.
+ */
+unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	int may_enter_fs = gfp_mask & __GFP_FS;
+	int may_perform_io = gfp_mask & __GFP_IO;
+	unsigned long watermark;
+	struct zoneref *z;
+	struct zone *zone;
+	int rc = COMPACT_SKIPPED;
+
+	/*
+	 * Check whether it is worth even starting compaction. The order check is
+	 * made because an assumption is made that the page allocator can satisfy
+	 * the "cheaper" orders without taking special steps
+	 */
+	if (order <= PAGE_ALLOC_COSTLY_ORDER || !may_enter_fs || !may_perform_io)
+		return rc;
+
+	count_vm_event(COMPACTSTALL);
+
+	/* Compact each zone in the list */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
+								nodemask) {
+		int fragindex;
+		int status;
+
+		/*
+		 * Watermarks for order-0 must be met for compaction. Note
+		 * the 2UL. This is because during migration, copies of
+		 * pages need to be allocated and for a short time, the
+		 * footprint is higher
+		 */
+		watermark = low_wmark_pages(zone) + (2UL << order);
+		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+			continue;
+
+		/*
+		 * fragmentation index determines if allocation failures are
+		 * due to low memory or external fragmentation
+		 *
+		 * index of -1 implies allocations might succeed depending
+		 * 	on watermarks
+		 * index towards 0 implies failure is due to lack of memory
+		 * index towards 1000 implies failure is due to fragmentation
+		 *
+		 * Only compact if a failure would be due to fragmentation.
+		 */
+		fragindex = fragmentation_index(zone, order);
+		if (fragindex >= 0 && fragindex <= 500)
+			continue;
+
+		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
+			rc = COMPACT_PARTIAL;
+			break;
+		}
+
+		status = compact_zone_order(zone, order, gfp_mask);
+		rc = max(status, rc);
+
+		if (zone_watermark_ok(zone, order, watermark, 0, 0))
+			break;
+	}
+
+	return rc;
+}
+
+
 /* Compact all zones within a node */
 static int compact_node(int nid)
 {
@@ -403,6 +519,7 @@ static int compact_node(int nid)
 		cc.nr_freepages = 0;
 		cc.nr_migratepages = 0;
 		cc.zone = zone;
+		cc.order = -1;
 		INIT_LIST_HEAD(&cc.freepages);
 		INIT_LIST_HEAD(&cc.migratepages);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3cf947d..7a2e4a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -49,6 +49,7 @@
 #include <linux/debugobjects.h>
 #include <linux/kmemleak.h>
 #include <linux/memory.h>
+#include <linux/compaction.h>
 #include <trace/events/kmem.h>
 #include <linux/ftrace_event.h>
 
@@ -1768,6 +1769,36 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
+	/* Try memory compaction for high-order allocations before reclaim */
+	if (order) {
+		*did_some_progress = try_to_compact_pages(zonelist,
+						order, gfp_mask, nodemask);
+		if (*did_some_progress != COMPACT_SKIPPED) {
+
+			/* Page migration frees to the PCP lists but we want merging */
+			drain_pages(get_cpu());
+			put_cpu();
+
+			page = get_page_from_freelist(gfp_mask, nodemask,
+					order, zonelist, high_zoneidx,
+					alloc_flags, preferred_zone,
+					migratetype);
+			if (page) {
+				__count_vm_event(COMPACTSUCCESS);
+				return page;
+			}
+
+			/*
+			 * It's bad if compaction run occurs and fails.
+			 * The most likely reason is that pages exist,
+			 * but not enough to satisfy watermarks.
+			 */
+			count_vm_event(COMPACTFAIL);
+
+			cond_resched();
+		}
+	}
+
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
 	p->flags |= PF_MEMALLOC;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3a69b48..2780a36 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -561,7 +561,7 @@ static int unusable_show(struct seq_file *m, void *arg)
  * The value can be used to determine if page reclaim or compaction
  * should be used
  */
-int fragmentation_index(unsigned int order, struct contig_page_info *info)
+int __fragmentation_index(unsigned int order, struct contig_page_info *info)
 {
 	unsigned long requested = 1UL << order;
 
@@ -581,6 +581,14 @@ int fragmentation_index(unsigned int order, struct contig_page_info *info)
 	return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total);
 }
 
+/* Same as __fragmentation index but allocs contig_page_info on stack */
+int fragmentation_index(struct zone *zone, unsigned int order)
+{
+	struct contig_page_info info;
+
+	fill_contig_page_info(zone, order, &info);
+	return __fragmentation_index(order, &info);
+}
 
 static void extfrag_show_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
@@ -596,7 +604,7 @@ static void extfrag_show_print(struct seq_file *m,
 				zone->name);
 	for (order = 0; order < MAX_ORDER; ++order) {
 		fill_contig_page_info(zone, order, &info);
-		index = fragmentation_index(order, &info);
+		index = __fragmentation_index(order, &info);
 		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
 	}
 
@@ -896,6 +904,9 @@ static const char * const vmstat_text[] = {
 	"compact_blocks_moved",
 	"compact_pages_moved",
 	"compact_pagemigrate_failed",
+	"compact_stall",
+	"compact_fail",
+	"compact_success",
 
 #ifdef CONFIG_HUGETLB_PAGE
 	"htlb_buddy_alloc_success",
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 11/14] Direct compact when a high-order allocation fails
@ 2010-03-30  9:14   ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

Ordinarily when a high-order allocation fails, direct reclaim is entered to
free pages to satisfy the allocation.  With this patch, it is determined if
an allocation failed due to external fragmentation instead of low memory
and if so, the calling process will compact until a suitable page is
freed. Compaction by moving pages in memory is considerably cheaper than
paging out to disk and works where there are locked pages or no swap. If
compaction fails to free a page of a suitable size, then reclaim will
still occur.

Direct compaction returns as soon as possible. As each block is compacted,
it is checked if a suitable page has been freed and if so, it returns.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 include/linux/compaction.h |   20 ++++++--
 include/linux/vmstat.h     |    1 +
 mm/compaction.c            |  117 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   31 ++++++++++++
 mm/vmstat.c                |   15 +++++-
 5 files changed, 178 insertions(+), 6 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index c4ab05f..faa3faf 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,15 +1,27 @@
 #ifndef _LINUX_COMPACTION_H
 #define _LINUX_COMPACTION_H
 
-/* Return values for compact_zone() */
-#define COMPACT_INCOMPLETE	0
-#define COMPACT_PARTIAL		1
-#define COMPACT_COMPLETE	2
+/* Return values for compact_zone() and try_to_compact_pages() */
+#define COMPACT_SKIPPED		0
+#define COMPACT_INCOMPLETE	1
+#define COMPACT_PARTIAL		2
+#define COMPACT_COMPLETE	3
 
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
+
+extern int fragmentation_index(struct zone *zone, unsigned int order);
+extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *mask);
+#else
+static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+	return COMPACT_INCOMPLETE;
+}
+
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 56e4b44..b4b4d34 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
 		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
+		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
diff --git a/mm/compaction.c b/mm/compaction.c
index b058bae..e8ef511 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -35,6 +35,8 @@ struct compact_control {
 	unsigned long nr_anon;
 	unsigned long nr_file;
 
+	unsigned int order;		/* order a direct compactor needs */
+	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
 };
 
@@ -327,6 +329,9 @@ static void update_nr_listpages(struct compact_control *cc)
 static int compact_finished(struct zone *zone,
 						struct compact_control *cc)
 {
+	unsigned int order;
+	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
+
 	if (fatal_signal_pending(current))
 		return COMPACT_PARTIAL;
 
@@ -334,6 +339,24 @@ static int compact_finished(struct zone *zone,
 	if (cc->free_pfn <= cc->migrate_pfn)
 		return COMPACT_COMPLETE;
 
+	/* Compaction run is not finished if the watermark is not met */
+	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
+		return COMPACT_INCOMPLETE;
+
+	if (cc->order == -1)
+		return COMPACT_INCOMPLETE;
+
+	/* Direct compactor: Is a suitable page free? */
+	for (order = cc->order; order < MAX_ORDER; order++) {
+		/* Job done if page is free of the right migratetype */
+		if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
+			return COMPACT_PARTIAL;
+
+		/* Job done if allocation would set block type */
+		if (order >= pageblock_order && zone->free_area[order].nr_free)
+			return COMPACT_PARTIAL;
+	}
+
 	return COMPACT_INCOMPLETE;
 }
 
@@ -379,6 +402,99 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	return ret;
 }
 
+static unsigned long compact_zone_order(struct zone *zone,
+						int order, gfp_t gfp_mask)
+{
+	struct compact_control cc = {
+		.nr_freepages = 0,
+		.nr_migratepages = 0,
+		.order = order,
+		.migratetype = allocflags_to_migratetype(gfp_mask),
+		.zone = zone,
+	};
+	INIT_LIST_HEAD(&cc.freepages);
+	INIT_LIST_HEAD(&cc.migratepages);
+
+	return compact_zone(zone, &cc);
+}
+
+/**
+ * try_to_compact_pages - Direct compact to satisfy a high-order allocation
+ * @zonelist: The zonelist used for the current allocation
+ * @order: The order of the current allocation
+ * @gfp_mask: The GFP mask of the current allocation
+ * @nodemask: The allowed nodes to allocate from
+ *
+ * This is the main entry point for direct page compaction.
+ */
+unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	int may_enter_fs = gfp_mask & __GFP_FS;
+	int may_perform_io = gfp_mask & __GFP_IO;
+	unsigned long watermark;
+	struct zoneref *z;
+	struct zone *zone;
+	int rc = COMPACT_SKIPPED;
+
+	/*
+	 * Check whether it is worth even starting compaction. The order check is
+	 * made because an assumption is made that the page allocator can satisfy
+	 * the "cheaper" orders without taking special steps
+	 */
+	if (order <= PAGE_ALLOC_COSTLY_ORDER || !may_enter_fs || !may_perform_io)
+		return rc;
+
+	count_vm_event(COMPACTSTALL);
+
+	/* Compact each zone in the list */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
+								nodemask) {
+		int fragindex;
+		int status;
+
+		/*
+		 * Watermarks for order-0 must be met for compaction. Note
+		 * the 2UL. This is because during migration, copies of
+		 * pages need to be allocated and for a short time, the
+		 * footprint is higher
+		 */
+		watermark = low_wmark_pages(zone) + (2UL << order);
+		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+			continue;
+
+		/*
+		 * fragmentation index determines if allocation failures are
+		 * due to low memory or external fragmentation
+		 *
+		 * index of -1 implies allocations might succeed depending
+		 * 	on watermarks
+		 * index towards 0 implies failure is due to lack of memory
+		 * index towards 1000 implies failure is due to fragmentation
+		 *
+		 * Only compact if a failure would be due to fragmentation.
+		 */
+		fragindex = fragmentation_index(zone, order);
+		if (fragindex >= 0 && fragindex <= 500)
+			continue;
+
+		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
+			rc = COMPACT_PARTIAL;
+			break;
+		}
+
+		status = compact_zone_order(zone, order, gfp_mask);
+		rc = max(status, rc);
+
+		if (zone_watermark_ok(zone, order, watermark, 0, 0))
+			break;
+	}
+
+	return rc;
+}
+
+
 /* Compact all zones within a node */
 static int compact_node(int nid)
 {
@@ -403,6 +519,7 @@ static int compact_node(int nid)
 		cc.nr_freepages = 0;
 		cc.nr_migratepages = 0;
 		cc.zone = zone;
+		cc.order = -1;
 		INIT_LIST_HEAD(&cc.freepages);
 		INIT_LIST_HEAD(&cc.migratepages);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3cf947d..7a2e4a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -49,6 +49,7 @@
 #include <linux/debugobjects.h>
 #include <linux/kmemleak.h>
 #include <linux/memory.h>
+#include <linux/compaction.h>
 #include <trace/events/kmem.h>
 #include <linux/ftrace_event.h>
 
@@ -1768,6 +1769,36 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
+	/* Try memory compaction for high-order allocations before reclaim */
+	if (order) {
+		*did_some_progress = try_to_compact_pages(zonelist,
+						order, gfp_mask, nodemask);
+		if (*did_some_progress != COMPACT_SKIPPED) {
+
+			/* Page migration frees to the PCP lists but we want merging */
+			drain_pages(get_cpu());
+			put_cpu();
+
+			page = get_page_from_freelist(gfp_mask, nodemask,
+					order, zonelist, high_zoneidx,
+					alloc_flags, preferred_zone,
+					migratetype);
+			if (page) {
+				__count_vm_event(COMPACTSUCCESS);
+				return page;
+			}
+
+			/*
+			 * It's bad if compaction run occurs and fails.
+			 * The most likely reason is that pages exist,
+			 * but not enough to satisfy watermarks.
+			 */
+			count_vm_event(COMPACTFAIL);
+
+			cond_resched();
+		}
+	}
+
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
 	p->flags |= PF_MEMALLOC;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3a69b48..2780a36 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -561,7 +561,7 @@ static int unusable_show(struct seq_file *m, void *arg)
  * The value can be used to determine if page reclaim or compaction
  * should be used
  */
-int fragmentation_index(unsigned int order, struct contig_page_info *info)
+int __fragmentation_index(unsigned int order, struct contig_page_info *info)
 {
 	unsigned long requested = 1UL << order;
 
@@ -581,6 +581,14 @@ int fragmentation_index(unsigned int order, struct contig_page_info *info)
 	return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total);
 }
 
+/* Same as __fragmentation index but allocs contig_page_info on stack */
+int fragmentation_index(struct zone *zone, unsigned int order)
+{
+	struct contig_page_info info;
+
+	fill_contig_page_info(zone, order, &info);
+	return __fragmentation_index(order, &info);
+}
 
 static void extfrag_show_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
@@ -596,7 +604,7 @@ static void extfrag_show_print(struct seq_file *m,
 				zone->name);
 	for (order = 0; order < MAX_ORDER; ++order) {
 		fill_contig_page_info(zone, order, &info);
-		index = fragmentation_index(order, &info);
+		index = __fragmentation_index(order, &info);
 		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
 	}
 
@@ -896,6 +904,9 @@ static const char * const vmstat_text[] = {
 	"compact_blocks_moved",
 	"compact_pages_moved",
 	"compact_pagemigrate_failed",
+	"compact_stall",
+	"compact_fail",
+	"compact_success",
 
 #ifdef CONFIG_HUGETLB_PAGE
 	"htlb_buddy_alloc_success",
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 12/14] Add a tunable that decides when memory should be compacted and when it should be reclaimed
  2010-03-30  9:14 ` Mel Gorman
@ 2010-03-30  9:14   ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

The kernel applies some heuristics when deciding if memory should be
compacted or reclaimed to satisfy a high-order allocation. One of these
is based on the fragmentation. If the index is below 500, memory will
not be compacted. This choice is arbitrary and not based on data. To
help optimise the system and set a sensible default for this value, this
patch adds a sysctl extfrag_threshold. The kernel will only compact
memory if the fragmentation index is above the extfrag_threshold.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/sysctl/vm.txt |   18 ++++++++++++++++--
 include/linux/compaction.h  |    3 +++
 kernel/sysctl.c             |   15 +++++++++++++++
 mm/compaction.c             |   12 +++++++++++-
 4 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 803c018..878b1b4 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -27,6 +27,7 @@ Currently, these files are in /proc/sys/vm:
 - dirty_ratio
 - dirty_writeback_centisecs
 - drop_caches
+- extfrag_threshold
 - hugepages_treat_as_movable
 - hugetlb_shm_group
 - laptop_mode
@@ -131,8 +132,7 @@ out to disk.  This tunable expresses the interval between those wakeups, in
 
 Setting this to zero disables periodic writeback altogether.
 
-==============================================================
-
+============================================================== 
 drop_caches
 
 Writing to this will cause the kernel to drop clean caches, dentries and
@@ -150,6 +150,20 @@ user should run `sync' first.
 
 ==============================================================
 
+extfrag_threshold
+
+This parameter affects whether the kernel will compact memory or direct
+reclaim to satisfy a high-order allocation. /proc/extfrag_index shows what
+the fragmentation index for each order is in each zone in the system. Values
+tending towards 0 imply allocations would fail due to lack of memory,
+values towards 1000 imply failures are due to fragmentation and -1 implies
+that the allocation will succeed as long as watermarks are met.
+
+The kernel will not compact memory in a zone if the
+fragmentation index is <= extfrag_threshold. The default value is 500.
+
+==============================================================
+
 hugepages_treat_as_movable
 
 This parameter is only useful when kernelcore= is specified at boot time to
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index faa3faf..ae98afc 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -11,6 +11,9 @@
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
+extern int sysctl_extfrag_threshold;
+extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos);
 
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3838928..b8f292e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -243,6 +243,11 @@ static int min_sched_shares_ratelimit = 100000; /* 100 usec */
 static int max_sched_shares_ratelimit = NSEC_PER_SEC; /* 1 second */
 #endif
 
+#ifdef CONFIG_COMPACTION
+static int min_extfrag_threshold = 0;
+static int max_extfrag_threshold = 1000;
+#endif
+
 static struct ctl_table kern_table[] = {
 	{
 		.procname	= "sched_child_runs_first",
@@ -1111,6 +1116,16 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0200,
 		.proc_handler	= sysctl_compaction_handler,
 	},
+	{
+		.procname	= "extfrag_threshold",
+		.data		= &sysctl_extfrag_threshold,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_extfrag_handler,
+		.extra1		= &min_extfrag_threshold,
+		.extra2		= &max_extfrag_threshold,
+	},
+
 #endif /* CONFIG_COMPACTION */
 	{
 		.procname	= "min_free_kbytes",
diff --git a/mm/compaction.c b/mm/compaction.c
index e8ef511..3bb65d7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -418,6 +418,8 @@ static unsigned long compact_zone_order(struct zone *zone,
 	return compact_zone(zone, &cc);
 }
 
+int sysctl_extfrag_threshold = 500;
+
 /**
  * try_to_compact_pages - Direct compact to satisfy a high-order allocation
  * @zonelist: The zonelist used for the current allocation
@@ -476,7 +478,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 		 * Only compact if a failure would be due to fragmentation.
 		 */
 		fragindex = fragmentation_index(zone, order);
-		if (fragindex >= 0 && fragindex <= 500)
+		if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
 			continue;
 
 		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
@@ -556,6 +558,14 @@ int sysctl_compaction_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+int sysctl_extfrag_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos)
+{
+	proc_dointvec_minmax(table, write, buffer, length, ppos);
+
+	return 0;
+}
+
 #if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
 ssize_t sysfs_compact_node(struct sys_device *dev,
 			struct sysdev_attribute *attr,
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 12/14] Add a tunable that decides when memory should be compacted and when it should be reclaimed
@ 2010-03-30  9:14   ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

The kernel applies some heuristics when deciding if memory should be
compacted or reclaimed to satisfy a high-order allocation. One of these
is based on the fragmentation. If the index is below 500, memory will
not be compacted. This choice is arbitrary and not based on data. To
help optimise the system and set a sensible default for this value, this
patch adds a sysctl extfrag_threshold. The kernel will only compact
memory if the fragmentation index is above the extfrag_threshold.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/sysctl/vm.txt |   18 ++++++++++++++++--
 include/linux/compaction.h  |    3 +++
 kernel/sysctl.c             |   15 +++++++++++++++
 mm/compaction.c             |   12 +++++++++++-
 4 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 803c018..878b1b4 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -27,6 +27,7 @@ Currently, these files are in /proc/sys/vm:
 - dirty_ratio
 - dirty_writeback_centisecs
 - drop_caches
+- extfrag_threshold
 - hugepages_treat_as_movable
 - hugetlb_shm_group
 - laptop_mode
@@ -131,8 +132,7 @@ out to disk.  This tunable expresses the interval between those wakeups, in
 
 Setting this to zero disables periodic writeback altogether.
 
-==============================================================
-
+============================================================== 
 drop_caches
 
 Writing to this will cause the kernel to drop clean caches, dentries and
@@ -150,6 +150,20 @@ user should run `sync' first.
 
 ==============================================================
 
+extfrag_threshold
+
+This parameter affects whether the kernel will compact memory or direct
+reclaim to satisfy a high-order allocation. /proc/extfrag_index shows what
+the fragmentation index for each order is in each zone in the system. Values
+tending towards 0 imply allocations would fail due to lack of memory,
+values towards 1000 imply failures are due to fragmentation and -1 implies
+that the allocation will succeed as long as watermarks are met.
+
+The kernel will not compact memory in a zone if the
+fragmentation index is <= extfrag_threshold. The default value is 500.
+
+==============================================================
+
 hugepages_treat_as_movable
 
 This parameter is only useful when kernelcore= is specified at boot time to
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index faa3faf..ae98afc 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -11,6 +11,9 @@
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
+extern int sysctl_extfrag_threshold;
+extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos);
 
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3838928..b8f292e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -243,6 +243,11 @@ static int min_sched_shares_ratelimit = 100000; /* 100 usec */
 static int max_sched_shares_ratelimit = NSEC_PER_SEC; /* 1 second */
 #endif
 
+#ifdef CONFIG_COMPACTION
+static int min_extfrag_threshold = 0;
+static int max_extfrag_threshold = 1000;
+#endif
+
 static struct ctl_table kern_table[] = {
 	{
 		.procname	= "sched_child_runs_first",
@@ -1111,6 +1116,16 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0200,
 		.proc_handler	= sysctl_compaction_handler,
 	},
+	{
+		.procname	= "extfrag_threshold",
+		.data		= &sysctl_extfrag_threshold,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_extfrag_handler,
+		.extra1		= &min_extfrag_threshold,
+		.extra2		= &max_extfrag_threshold,
+	},
+
 #endif /* CONFIG_COMPACTION */
 	{
 		.procname	= "min_free_kbytes",
diff --git a/mm/compaction.c b/mm/compaction.c
index e8ef511..3bb65d7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -418,6 +418,8 @@ static unsigned long compact_zone_order(struct zone *zone,
 	return compact_zone(zone, &cc);
 }
 
+int sysctl_extfrag_threshold = 500;
+
 /**
  * try_to_compact_pages - Direct compact to satisfy a high-order allocation
  * @zonelist: The zonelist used for the current allocation
@@ -476,7 +478,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 		 * Only compact if a failure would be due to fragmentation.
 		 */
 		fragindex = fragmentation_index(zone, order);
-		if (fragindex >= 0 && fragindex <= 500)
+		if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
 			continue;
 
 		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
@@ -556,6 +558,14 @@ int sysctl_compaction_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+int sysctl_extfrag_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos)
+{
+	proc_dointvec_minmax(table, write, buffer, length, ppos);
+
+	return 0;
+}
+
 #if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
 ssize_t sysfs_compact_node(struct sys_device *dev,
 			struct sysdev_attribute *attr,
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 13/14] Do not compact within a preferred zone after a compaction failure
  2010-03-30  9:14 ` Mel Gorman
@ 2010-03-30  9:14   ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

The fragmentation index may indicate that a failure is due to external
fragmentation but after a compaction run completes, it is still possible
for an allocation to fail. There are two obvious reasons as to why

  o Page migration cannot move all pages so fragmentation remains
  o A suitable page may exist but watermarks are not met

In the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone for a period of time. The zone that
is deferred is the first zone in the zonelist - i.e. the preferred zone.
To defer compaction in the other zones, the information would need to be
stored in the zonelist or implemented similar to the zonelist_cache.
This would impact the fast-paths and is not justified at this time.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |   35 +++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h     |    7 +++++++
 mm/page_alloc.c            |    5 ++++-
 3 files changed, 46 insertions(+), 1 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index ae98afc..2a02719 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -18,6 +18,32 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask);
+
+/* defer_compaction - Do not compact within a zone until a given time */
+static inline void defer_compaction(struct zone *zone, unsigned long resume)
+{
+	/*
+	 * This function is called when compaction fails to result in a page
+	 * allocation success. This is somewhat unsatisfactory as the failure
+	 * to compact has nothing to do with time and everything to do with
+	 * the requested order, the number of free pages and watermarks. How
+	 * to wait on that is more unclear, but the answer would apply to
+	 * other areas where the VM waits based on time.
+	 */
+	zone->compact_resume = resume;
+}
+
+static inline int compaction_deferred(struct zone *zone)
+{
+	/* init once if necessary */
+	if (unlikely(!zone->compact_resume)) {
+		zone->compact_resume = jiffies;
+		return 0;
+	}
+
+	return time_before(jiffies, zone->compact_resume);
+}
+
 #else
 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask)
@@ -25,6 +51,15 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	return COMPACT_INCOMPLETE;
 }
 
+static inline void defer_compaction(struct zone *zone, unsigned long resume)
+{
+}
+
+static inline int compaction_deferred(struct zone *zone)
+{
+	return 1;
+}
+
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cf9e458..bde879b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -321,6 +321,13 @@ struct zone {
 	unsigned long		*pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 
+#ifdef CONFIG_COMPACTION
+	/*
+	 * If a compaction fails, do not try compaction again until
+	 * jiffies is after the value of compact_resume
+	 */
+	unsigned long		compact_resume;
+#endif
 
 	ZONE_PADDING(_pad1_)
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7a2e4a2..66823bd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1770,7 +1770,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	cond_resched();
 
 	/* Try memory compaction for high-order allocations before reclaim */
-	if (order) {
+	if (order && !compaction_deferred(preferred_zone)) {
 		*did_some_progress = try_to_compact_pages(zonelist,
 						order, gfp_mask, nodemask);
 		if (*did_some_progress != COMPACT_SKIPPED) {
@@ -1795,6 +1795,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 			 */
 			count_vm_event(COMPACTFAIL);
 
+			/* On failure, avoid compaction for a short time. */
+			defer_compaction(preferred_zone, jiffies + HZ/50);
+
 			cond_resched();
 		}
 	}
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 13/14] Do not compact within a preferred zone after a compaction failure
@ 2010-03-30  9:14   ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

The fragmentation index may indicate that a failure is due to external
fragmentation but after a compaction run completes, it is still possible
for an allocation to fail. There are two obvious reasons as to why

  o Page migration cannot move all pages so fragmentation remains
  o A suitable page may exist but watermarks are not met

In the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone for a period of time. The zone that
is deferred is the first zone in the zonelist - i.e. the preferred zone.
To defer compaction in the other zones, the information would need to be
stored in the zonelist or implemented similar to the zonelist_cache.
This would impact the fast-paths and is not justified at this time.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |   35 +++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h     |    7 +++++++
 mm/page_alloc.c            |    5 ++++-
 3 files changed, 46 insertions(+), 1 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index ae98afc..2a02719 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -18,6 +18,32 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask);
+
+/* defer_compaction - Do not compact within a zone until a given time */
+static inline void defer_compaction(struct zone *zone, unsigned long resume)
+{
+	/*
+	 * This function is called when compaction fails to result in a page
+	 * allocation success. This is somewhat unsatisfactory as the failure
+	 * to compact has nothing to do with time and everything to do with
+	 * the requested order, the number of free pages and watermarks. How
+	 * to wait on that is more unclear, but the answer would apply to
+	 * other areas where the VM waits based on time.
+	 */
+	zone->compact_resume = resume;
+}
+
+static inline int compaction_deferred(struct zone *zone)
+{
+	/* init once if necessary */
+	if (unlikely(!zone->compact_resume)) {
+		zone->compact_resume = jiffies;
+		return 0;
+	}
+
+	return time_before(jiffies, zone->compact_resume);
+}
+
 #else
 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask)
@@ -25,6 +51,15 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	return COMPACT_INCOMPLETE;
 }
 
+static inline void defer_compaction(struct zone *zone, unsigned long resume)
+{
+}
+
+static inline int compaction_deferred(struct zone *zone)
+{
+	return 1;
+}
+
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cf9e458..bde879b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -321,6 +321,13 @@ struct zone {
 	unsigned long		*pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 
+#ifdef CONFIG_COMPACTION
+	/*
+	 * If a compaction fails, do not try compaction again until
+	 * jiffies is after the value of compact_resume
+	 */
+	unsigned long		compact_resume;
+#endif
 
 	ZONE_PADDING(_pad1_)
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7a2e4a2..66823bd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1770,7 +1770,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	cond_resched();
 
 	/* Try memory compaction for high-order allocations before reclaim */
-	if (order) {
+	if (order && !compaction_deferred(preferred_zone)) {
 		*did_some_progress = try_to_compact_pages(zonelist,
 						order, gfp_mask, nodemask);
 		if (*did_some_progress != COMPACT_SKIPPED) {
@@ -1795,6 +1795,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 			 */
 			count_vm_event(COMPACTFAIL);
 
+			/* On failure, avoid compaction for a short time. */
+			defer_compaction(preferred_zone, jiffies + HZ/50);
+
 			cond_resched();
 		}
 	}
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-03-30  9:14 ` Mel Gorman
@ 2010-03-30  9:14   ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

PageAnon pages that are unmapped may or may not have an anon_vma so
are not currently migrated. However, a swap cache page can be migrated
and fits this description. This patch identifies page swap caches and
allows them to be migrated.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/migrate.c |   15 ++++++++++-----
 mm/rmap.c    |    6 ++++--
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 35aad2a..f9bf37e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -203,6 +203,9 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 	void **pslot;
 
 	if (!mapping) {
+		if (PageSwapCache(page))
+			SetPageSwapCache(newpage);
+
 		/* Anonymous page without mapping */
 		if (page_count(page) != 1)
 			return -EAGAIN;
@@ -607,11 +610,13 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		 * the page was isolated and when we reached here while
 		 * the RCU lock was not held
 		 */
-		if (!page_mapped(page))
-			goto rcu_unlock;
-
-		anon_vma = page_anon_vma(page);
-		atomic_inc(&anon_vma->external_refcount);
+		if (!page_mapped(page)) {
+			if (!PageSwapCache(page))
+				goto rcu_unlock;
+		} else {
+			anon_vma = page_anon_vma(page);
+			atomic_inc(&anon_vma->external_refcount);
+		}
 	}
 
 	/*
diff --git a/mm/rmap.c b/mm/rmap.c
index af35b75..d5ea1f2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
 
 	if (unlikely(PageKsm(page)))
 		return rmap_walk_ksm(page, rmap_one, arg);
-	else if (PageAnon(page))
+	else if (PageAnon(page)) {
+		if (PageSwapCache(page))
+			return SWAP_AGAIN;
 		return rmap_walk_anon(page, rmap_one, arg);
-	else
+	} else
 		return rmap_walk_file(page, rmap_one, arg);
 }
 #endif /* CONFIG_MIGRATION */
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-03-30  9:14   ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

PageAnon pages that are unmapped may or may not have an anon_vma so
are not currently migrated. However, a swap cache page can be migrated
and fits this description. This patch identifies page swap caches and
allows them to be migrated.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/migrate.c |   15 ++++++++++-----
 mm/rmap.c    |    6 ++++--
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 35aad2a..f9bf37e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -203,6 +203,9 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 	void **pslot;
 
 	if (!mapping) {
+		if (PageSwapCache(page))
+			SetPageSwapCache(newpage);
+
 		/* Anonymous page without mapping */
 		if (page_count(page) != 1)
 			return -EAGAIN;
@@ -607,11 +610,13 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		 * the page was isolated and when we reached here while
 		 * the RCU lock was not held
 		 */
-		if (!page_mapped(page))
-			goto rcu_unlock;
-
-		anon_vma = page_anon_vma(page);
-		atomic_inc(&anon_vma->external_refcount);
+		if (!page_mapped(page)) {
+			if (!PageSwapCache(page))
+				goto rcu_unlock;
+		} else {
+			anon_vma = page_anon_vma(page);
+			atomic_inc(&anon_vma->external_refcount);
+		}
 	}
 
 	/*
diff --git a/mm/rmap.c b/mm/rmap.c
index af35b75..d5ea1f2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
 
 	if (unlikely(PageKsm(page)))
 		return rmap_walk_ksm(page, rmap_one, arg);
-	else if (PageAnon(page))
+	else if (PageAnon(page)) {
+		if (PageSwapCache(page))
+			return SWAP_AGAIN;
 		return rmap_walk_anon(page, rmap_one, arg);
-	else
+	} else
 		return rmap_walk_file(page, rmap_one, arg);
 }
 #endif /* CONFIG_MIGRATION */
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-03-30  9:14   ` Mel Gorman
@ 2010-03-31  5:26     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-31  5:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 30 Mar 2010 10:14:49 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> PageAnon pages that are unmapped may or may not have an anon_vma so
> are not currently migrated. However, a swap cache page can be migrated
> and fits this description. This patch identifies page swap caches and
> allows them to be migrated.
> 

Some comments.

> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/migrate.c |   15 ++++++++++-----
>  mm/rmap.c    |    6 ++++--
>  2 files changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 35aad2a..f9bf37e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -203,6 +203,9 @@ static int migrate_page_move_mapping(struct address_space *mapping,
>  	void **pslot;
>  
>  	if (!mapping) {
> +		if (PageSwapCache(page))
> +			SetPageSwapCache(newpage);
> +

Migration of SwapCache requires radix-tree replacement, IOW, 
 mapping == NULL && PageSwapCache is BUG.

So, this never happens.


>  		/* Anonymous page without mapping */
>  		if (page_count(page) != 1)
>  			return -EAGAIN;
> @@ -607,11 +610,13 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  		 * the page was isolated and when we reached here while
>  		 * the RCU lock was not held
>  		 */
> -		if (!page_mapped(page))
> -			goto rcu_unlock;
> -
> -		anon_vma = page_anon_vma(page);
> -		atomic_inc(&anon_vma->external_refcount);
> +		if (!page_mapped(page)) {
> +			if (!PageSwapCache(page))
> +				goto rcu_unlock;
> +		} else {
> +			anon_vma = page_anon_vma(page);
> +			atomic_inc(&anon_vma->external_refcount);
> +		}
>  	}
>  
>  	/*
> diff --git a/mm/rmap.c b/mm/rmap.c
> index af35b75..d5ea1f2 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
>  
>  	if (unlikely(PageKsm(page)))
>  		return rmap_walk_ksm(page, rmap_one, arg);
> -	else if (PageAnon(page))
> +	else if (PageAnon(page)) {
> +		if (PageSwapCache(page))
> +			return SWAP_AGAIN;
>  		return rmap_walk_anon(page, rmap_one, arg);

SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.

Please see do_swap_page(), PageSwapCache bit is cleared only when

do_swap_page()...
       swap_free(entry);
        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
                try_to_free_swap(page);

Then, PageSwapCache is cleared only when swap is freeable even if mapped.

rmap_walk_anon() should be called and the check is not necessary.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-03-31  5:26     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-31  5:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, 30 Mar 2010 10:14:49 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> PageAnon pages that are unmapped may or may not have an anon_vma so
> are not currently migrated. However, a swap cache page can be migrated
> and fits this description. This patch identifies page swap caches and
> allows them to be migrated.
> 

Some comments.

> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/migrate.c |   15 ++++++++++-----
>  mm/rmap.c    |    6 ++++--
>  2 files changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 35aad2a..f9bf37e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -203,6 +203,9 @@ static int migrate_page_move_mapping(struct address_space *mapping,
>  	void **pslot;
>  
>  	if (!mapping) {
> +		if (PageSwapCache(page))
> +			SetPageSwapCache(newpage);
> +

Migration of SwapCache requires radix-tree replacement, IOW, 
 mapping == NULL && PageSwapCache is BUG.

So, this never happens.


>  		/* Anonymous page without mapping */
>  		if (page_count(page) != 1)
>  			return -EAGAIN;
> @@ -607,11 +610,13 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  		 * the page was isolated and when we reached here while
>  		 * the RCU lock was not held
>  		 */
> -		if (!page_mapped(page))
> -			goto rcu_unlock;
> -
> -		anon_vma = page_anon_vma(page);
> -		atomic_inc(&anon_vma->external_refcount);
> +		if (!page_mapped(page)) {
> +			if (!PageSwapCache(page))
> +				goto rcu_unlock;
> +		} else {
> +			anon_vma = page_anon_vma(page);
> +			atomic_inc(&anon_vma->external_refcount);
> +		}
>  	}
>  
>  	/*
> diff --git a/mm/rmap.c b/mm/rmap.c
> index af35b75..d5ea1f2 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
>  
>  	if (unlikely(PageKsm(page)))
>  		return rmap_walk_ksm(page, rmap_one, arg);
> -	else if (PageAnon(page))
> +	else if (PageAnon(page)) {
> +		if (PageSwapCache(page))
> +			return SWAP_AGAIN;
>  		return rmap_walk_anon(page, rmap_one, arg);

SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.

Please see do_swap_page(), PageSwapCache bit is cleared only when

do_swap_page()...
       swap_free(entry);
        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
                try_to_free_swap(page);

Then, PageSwapCache is cleared only when swap is freeable even if mapped.

rmap_walk_anon() should be called and the check is not necessary.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-03-31  5:26     ` KAMEZAWA Hiroyuki
@ 2010-03-31 11:27       ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-31 11:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 31, 2010 at 02:26:23PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 30 Mar 2010 10:14:49 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > PageAnon pages that are unmapped may or may not have an anon_vma so
> > are not currently migrated. However, a swap cache page can be migrated
> > and fits this description. This patch identifies page swap caches and
> > allows them to be migrated.
> > 
> 
> Some comments.
> 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/migrate.c |   15 ++++++++++-----
> >  mm/rmap.c    |    6 ++++--
> >  2 files changed, 14 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 35aad2a..f9bf37e 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -203,6 +203,9 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> >  	void **pslot;
> >  
> >  	if (!mapping) {
> > +		if (PageSwapCache(page))
> > +			SetPageSwapCache(newpage);
> > +
> 
> Migration of SwapCache requires radix-tree replacement, IOW, 
>  mapping == NULL && PageSwapCache is BUG.
> 
> So, this never happens.
> 

Correct. In the initial attempt to allow PageSwapCache pages to move, I
encountered a number of bugs. This was a "it can't be happening but something
is odd" change that I meant to drop in the final version but forgot. Thanks

> 
> >  		/* Anonymous page without mapping */
> >  		if (page_count(page) != 1)
> >  			return -EAGAIN;
> > @@ -607,11 +610,13 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  		 * the page was isolated and when we reached here while
> >  		 * the RCU lock was not held
> >  		 */
> > -		if (!page_mapped(page))
> > -			goto rcu_unlock;
> > -
> > -		anon_vma = page_anon_vma(page);
> > -		atomic_inc(&anon_vma->external_refcount);
> > +		if (!page_mapped(page)) {
> > +			if (!PageSwapCache(page))
> > +				goto rcu_unlock;
> > +		} else {
> > +			anon_vma = page_anon_vma(page);
> > +			atomic_inc(&anon_vma->external_refcount);
> > +		}
> >  	}
> >  
> >  	/*
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index af35b75..d5ea1f2 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
> >  
> >  	if (unlikely(PageKsm(page)))
> >  		return rmap_walk_ksm(page, rmap_one, arg);
> > -	else if (PageAnon(page))
> > +	else if (PageAnon(page)) {
> > +		if (PageSwapCache(page))
> > +			return SWAP_AGAIN;
> >  		return rmap_walk_anon(page, rmap_one, arg);
> 
> SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
> 
> Please see do_swap_page(), PageSwapCache bit is cleared only when
> 
> do_swap_page()...
>        swap_free(entry);
>         if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
>                 try_to_free_swap(page);
> 
> Then, PageSwapCache is cleared only when swap is freeable even if mapped.
> 
> rmap_walk_anon() should be called and the check is not necessary.
> 

I follow your reasoning. What caught me is the following comment;

         * Corner case handling:
         * 1. When a new swap-cache page is read into, it is added to the LRU
         * and treated as swapcache but it has no rmap yet.
         * Calling try_to_unmap() against a page->mapping==NULL page will
         * trigger a BUG.  So handle it here.

and the fact that without the check the following bug is triggered;

[  476.541321] BUG: unable to handle kernel NULL pointer dereference at (null)
[  476.590328] IP: [<ffffffff81072162>] __bfs+0x1a1/0x1d7
[  476.590328] PGD 3781c067 PUD 371b2067 PMD 0 
[  476.590328] Oops: 0000 [#1] PREEMPT SMP 
[  476.590328] last sysfs file: /sys/block/sr0/capability
[  476.590328] CPU 0 
[  476.590328] Modules linked in: highalloc trace_allocmap buddyinfo vmregress_core oprofile dm_crypt loop i2c_piix4 evdev i2c_core processor serio_raw tpm_tis tpm tpm_bios shpchp pci_hotplug button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci ide_pci_generic libata ide_core scsi_mod ohci_hcd r8169 mii ehci_hcd floppy thermal fan thermal_sys
[  477.296405] 
[  477.296405] Pid: 4343, comm: bench-stresshig Not tainted 2.6.34-rc2-mm1-compaction-v7r3 #1 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[  477.296405] RIP: 0010:[<ffffffff81072162>]  [<ffffffff81072162>] __bfs+0x1a1/0x1d7
[  477.296405] RSP: 0000:ffff88007817d4c8  EFLAGS: 00010046
[  477.296405] RAX: ffffffff81c80170 RBX: ffff88007817d528 RCX: 0000000000000000
[  477.296405] RDX: ffff88007817d528 RSI: 0000000000000000 RDI: ffff88007817d528
[  477.296405] RBP: ffff88007817d518 R08: 0000000000000000 R09: 0000000000000000
[  477.296405] R10: ffffffff816a1d08 R11: 0000000000000046 R12: 0000000000000000
[  477.922839] R13: ffffffff81513887 R14: ffffffff81c80170 R15: 0000000000000000
[  477.922839] FS:  00007f8d853d26e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
[  478.123091] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  478.123091] CR2: 0000000000000000 CR3: 0000000037a0e000 CR4: 00000000000006f0
[  478.123091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  478.123091] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  478.123091] Process bench-stresshig (pid: 4343, threadinfo ffff88007817c000, task ffff88007ebea000)
[  478.566030] Stack:
[  478.566030]  ffff88007817d570 ffffffff81070f92 0000000000000000 0000000000000002
[  478.566030] <0> ffff88007817d528 ffff88007817d528 ffff88007ebea668 ffffffff81513887
[  478.566030] <0> ffff88007ebea000 ffff88007ebea000 ffff88007817d598 ffffffff81074201
[  478.566030] Call Trace:
[  478.566030]  [<ffffffff81070f92>] ? usage_match+0x0/0x17
[  478.566030]  [<ffffffff81074201>] check_usage_backwards+0x93/0xca
[  478.566030]  [<ffffffff8107416e>] ? check_usage_backwards+0x0/0xca
[  478.566030]  [<ffffffff81073544>] mark_lock+0x31d/0x52f
[  478.566030]  [<ffffffff8107515a>] __lock_acquire+0x801/0x1776
[  478.566030]  [<ffffffff810761c5>] lock_acquire+0xf6/0x122
[  478.566030]  [<ffffffff810ef121>] ? rmap_walk+0x5c/0x16d
[  478.566030]  [<ffffffff812fcfeb>] _raw_spin_lock+0x3b/0x47
[  478.566030]  [<ffffffff810ef121>] ? rmap_walk+0x5c/0x16d
[  478.566030]  [<ffffffff810ef121>] rmap_walk+0x5c/0x16d
[  478.566030]  [<ffffffff81106396>] ? remove_migration_pte+0x0/0x234
[  478.566030]  [<ffffffff81300dc1>] ? sub_preempt_count+0x9/0x83
[  478.566030]  [<ffffffff81106914>] ? migrate_page_copy+0xa0/0x1ed
[  478.566030]  [<ffffffff81106ea4>] migrate_pages+0x3fc/0x5d3
[  478.566030]  [<ffffffff81106c56>] ? migrate_pages+0x1ae/0x5d3
[  478.566030]  [<ffffffff81073a24>] ? trace_hardirqs_on_caller+0x110/0x134
[  478.566030]  [<ffffffff81107e11>] ? compaction_alloc+0x0/0x283
[  478.566030]  [<ffffffff811079b0>] ? compact_zone+0x14e/0x4bd

Granted, what I should have spotted was that a more relevant check was for
the specific corner case like in the revised patch below. Please note that
if I put a WARN_ON in the check in rmap.c, it can and does trigger. If this
situation really is never meant to occur, there is a race that needs to be
closed before PageSwapCache can be migrated.

==== CUT HERE ====

mm,migration: Allow the migration of PageSwapCache pages

PageAnon pages that are unmapped may or may not have an anon_vma so
are not currently migrated. However, a swap cache page can be migrated
and fits this description. This patch identifies page swap caches and
allows them to be migrated.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/migrate.c |   12 +++++++-----
 mm/rmap.c    |    7 +++++--
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 35aad2a..2284d79 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -607,11 +607,13 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		 * the page was isolated and when we reached here while
 		 * the RCU lock was not held
 		 */
-		if (!page_mapped(page))
-			goto rcu_unlock;
-
-		anon_vma = page_anon_vma(page);
-		atomic_inc(&anon_vma->external_refcount);
+		if (!page_mapped(page)) {
+			if (!PageSwapCache(page))
+				goto rcu_unlock;
+		} else {
+			anon_vma = page_anon_vma(page);
+			atomic_inc(&anon_vma->external_refcount);
+		}
 	}
 
 	/*
diff --git a/mm/rmap.c b/mm/rmap.c
index af35b75..7d63c68 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1394,9 +1394,12 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
 
 	if (unlikely(PageKsm(page)))
 		return rmap_walk_ksm(page, rmap_one, arg);
-	else if (PageAnon(page))
+	else if (PageAnon(page)) {
+		/* See comment on corner case handling in mm/migrate.c */
+		if (PageSwapCache(page) && !page_mapped(page))
+			return SWAP_AGAIN;
 		return rmap_walk_anon(page, rmap_one, arg);
-	else
+	} else
 		return rmap_walk_file(page, rmap_one, arg);
 }
 #endif /* CONFIG_MIGRATION */

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-03-31 11:27       ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-03-31 11:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 31, 2010 at 02:26:23PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 30 Mar 2010 10:14:49 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > PageAnon pages that are unmapped may or may not have an anon_vma so
> > are not currently migrated. However, a swap cache page can be migrated
> > and fits this description. This patch identifies page swap caches and
> > allows them to be migrated.
> > 
> 
> Some comments.
> 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/migrate.c |   15 ++++++++++-----
> >  mm/rmap.c    |    6 ++++--
> >  2 files changed, 14 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 35aad2a..f9bf37e 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -203,6 +203,9 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> >  	void **pslot;
> >  
> >  	if (!mapping) {
> > +		if (PageSwapCache(page))
> > +			SetPageSwapCache(newpage);
> > +
> 
> Migration of SwapCache requires radix-tree replacement, IOW, 
>  mapping == NULL && PageSwapCache is BUG.
> 
> So, this never happens.
> 

Correct. In the initial attempt to allow PageSwapCache pages to move, I
encountered a number of bugs. This was a "it can't be happening but something
is odd" change that I meant to drop in the final version but forgot. Thanks

> 
> >  		/* Anonymous page without mapping */
> >  		if (page_count(page) != 1)
> >  			return -EAGAIN;
> > @@ -607,11 +610,13 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  		 * the page was isolated and when we reached here while
> >  		 * the RCU lock was not held
> >  		 */
> > -		if (!page_mapped(page))
> > -			goto rcu_unlock;
> > -
> > -		anon_vma = page_anon_vma(page);
> > -		atomic_inc(&anon_vma->external_refcount);
> > +		if (!page_mapped(page)) {
> > +			if (!PageSwapCache(page))
> > +				goto rcu_unlock;
> > +		} else {
> > +			anon_vma = page_anon_vma(page);
> > +			atomic_inc(&anon_vma->external_refcount);
> > +		}
> >  	}
> >  
> >  	/*
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index af35b75..d5ea1f2 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
> >  
> >  	if (unlikely(PageKsm(page)))
> >  		return rmap_walk_ksm(page, rmap_one, arg);
> > -	else if (PageAnon(page))
> > +	else if (PageAnon(page)) {
> > +		if (PageSwapCache(page))
> > +			return SWAP_AGAIN;
> >  		return rmap_walk_anon(page, rmap_one, arg);
> 
> SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
> 
> Please see do_swap_page(), PageSwapCache bit is cleared only when
> 
> do_swap_page()...
>        swap_free(entry);
>         if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
>                 try_to_free_swap(page);
> 
> Then, PageSwapCache is cleared only when swap is freeable even if mapped.
> 
> rmap_walk_anon() should be called and the check is not necessary.
> 

I follow your reasoning. What caught me is the following comment;

         * Corner case handling:
         * 1. When a new swap-cache page is read into, it is added to the LRU
         * and treated as swapcache but it has no rmap yet.
         * Calling try_to_unmap() against a page->mapping==NULL page will
         * trigger a BUG.  So handle it here.

and the fact that without the check the following bug is triggered;

[  476.541321] BUG: unable to handle kernel NULL pointer dereference at (null)
[  476.590328] IP: [<ffffffff81072162>] __bfs+0x1a1/0x1d7
[  476.590328] PGD 3781c067 PUD 371b2067 PMD 0 
[  476.590328] Oops: 0000 [#1] PREEMPT SMP 
[  476.590328] last sysfs file: /sys/block/sr0/capability
[  476.590328] CPU 0 
[  476.590328] Modules linked in: highalloc trace_allocmap buddyinfo vmregress_core oprofile dm_crypt loop i2c_piix4 evdev i2c_core processor serio_raw tpm_tis tpm tpm_bios shpchp pci_hotplug button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci ide_pci_generic libata ide_core scsi_mod ohci_hcd r8169 mii ehci_hcd floppy thermal fan thermal_sys
[  477.296405] 
[  477.296405] Pid: 4343, comm: bench-stresshig Not tainted 2.6.34-rc2-mm1-compaction-v7r3 #1 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[  477.296405] RIP: 0010:[<ffffffff81072162>]  [<ffffffff81072162>] __bfs+0x1a1/0x1d7
[  477.296405] RSP: 0000:ffff88007817d4c8  EFLAGS: 00010046
[  477.296405] RAX: ffffffff81c80170 RBX: ffff88007817d528 RCX: 0000000000000000
[  477.296405] RDX: ffff88007817d528 RSI: 0000000000000000 RDI: ffff88007817d528
[  477.296405] RBP: ffff88007817d518 R08: 0000000000000000 R09: 0000000000000000
[  477.296405] R10: ffffffff816a1d08 R11: 0000000000000046 R12: 0000000000000000
[  477.922839] R13: ffffffff81513887 R14: ffffffff81c80170 R15: 0000000000000000
[  477.922839] FS:  00007f8d853d26e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
[  478.123091] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  478.123091] CR2: 0000000000000000 CR3: 0000000037a0e000 CR4: 00000000000006f0
[  478.123091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  478.123091] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  478.123091] Process bench-stresshig (pid: 4343, threadinfo ffff88007817c000, task ffff88007ebea000)
[  478.566030] Stack:
[  478.566030]  ffff88007817d570 ffffffff81070f92 0000000000000000 0000000000000002
[  478.566030] <0> ffff88007817d528 ffff88007817d528 ffff88007ebea668 ffffffff81513887
[  478.566030] <0> ffff88007ebea000 ffff88007ebea000 ffff88007817d598 ffffffff81074201
[  478.566030] Call Trace:
[  478.566030]  [<ffffffff81070f92>] ? usage_match+0x0/0x17
[  478.566030]  [<ffffffff81074201>] check_usage_backwards+0x93/0xca
[  478.566030]  [<ffffffff8107416e>] ? check_usage_backwards+0x0/0xca
[  478.566030]  [<ffffffff81073544>] mark_lock+0x31d/0x52f
[  478.566030]  [<ffffffff8107515a>] __lock_acquire+0x801/0x1776
[  478.566030]  [<ffffffff810761c5>] lock_acquire+0xf6/0x122
[  478.566030]  [<ffffffff810ef121>] ? rmap_walk+0x5c/0x16d
[  478.566030]  [<ffffffff812fcfeb>] _raw_spin_lock+0x3b/0x47
[  478.566030]  [<ffffffff810ef121>] ? rmap_walk+0x5c/0x16d
[  478.566030]  [<ffffffff810ef121>] rmap_walk+0x5c/0x16d
[  478.566030]  [<ffffffff81106396>] ? remove_migration_pte+0x0/0x234
[  478.566030]  [<ffffffff81300dc1>] ? sub_preempt_count+0x9/0x83
[  478.566030]  [<ffffffff81106914>] ? migrate_page_copy+0xa0/0x1ed
[  478.566030]  [<ffffffff81106ea4>] migrate_pages+0x3fc/0x5d3
[  478.566030]  [<ffffffff81106c56>] ? migrate_pages+0x1ae/0x5d3
[  478.566030]  [<ffffffff81073a24>] ? trace_hardirqs_on_caller+0x110/0x134
[  478.566030]  [<ffffffff81107e11>] ? compaction_alloc+0x0/0x283
[  478.566030]  [<ffffffff811079b0>] ? compact_zone+0x14e/0x4bd

Granted, what I should have spotted was that a more relevant check was for
the specific corner case like in the revised patch below. Please note that
if I put a WARN_ON in the check in rmap.c, it can and does trigger. If this
situation really is never meant to occur, there is a race that needs to be
closed before PageSwapCache can be migrated.

==== CUT HERE ====

mm,migration: Allow the migration of PageSwapCache pages

PageAnon pages that are unmapped may or may not have an anon_vma so
are not currently migrated. However, a swap cache page can be migrated
and fits this description. This patch identifies page swap caches and
allows them to be migrated.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/migrate.c |   12 +++++++-----
 mm/rmap.c    |    7 +++++--
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 35aad2a..2284d79 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -607,11 +607,13 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		 * the page was isolated and when we reached here while
 		 * the RCU lock was not held
 		 */
-		if (!page_mapped(page))
-			goto rcu_unlock;
-
-		anon_vma = page_anon_vma(page);
-		atomic_inc(&anon_vma->external_refcount);
+		if (!page_mapped(page)) {
+			if (!PageSwapCache(page))
+				goto rcu_unlock;
+		} else {
+			anon_vma = page_anon_vma(page);
+			atomic_inc(&anon_vma->external_refcount);
+		}
 	}
 
 	/*
diff --git a/mm/rmap.c b/mm/rmap.c
index af35b75..7d63c68 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1394,9 +1394,12 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
 
 	if (unlikely(PageKsm(page)))
 		return rmap_walk_ksm(page, rmap_one, arg);
-	else if (PageAnon(page))
+	else if (PageAnon(page)) {
+		/* See comment on corner case handling in mm/migrate.c */
+		if (PageSwapCache(page) && !page_mapped(page))
+			return SWAP_AGAIN;
 		return rmap_walk_anon(page, rmap_one, arg);
-	else
+	} else
 		return rmap_walk_file(page, rmap_one, arg);
 }
 #endif /* CONFIG_MIGRATION */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-03-31 11:27       ` Mel Gorman
@ 2010-03-31 23:57         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-31 23:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, 31 Mar 2010 12:27:30 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Wed, Mar 31, 2010 at 02:26:23PM +0900, KAMEZAWA Hiroyuki wrote:

> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
> > 
> > rmap_walk_anon() should be called and the check is not necessary.
> > 
> 
> I follow your reasoning. What caught me is the following comment;
> 
>          * Corner case handling:
>          * 1. When a new swap-cache page is read into, it is added to the LRU
>          * and treated as swapcache but it has no rmap yet.
>          * Calling try_to_unmap() against a page->mapping==NULL page will
>          * trigger a BUG.  So handle it here.
> 
> and the fact that without the check the following bug is triggered;
> 
> [  476.541321] BUG: unable to handle kernel NULL pointer dereference at (null)
> [  476.590328] IP: [<ffffffff81072162>] __bfs+0x1a1/0x1d7
> [  476.590328] PGD 3781c067 PUD 371b2067 PMD 0 
> [  476.590328] Oops: 0000 [#1] PREEMPT SMP 
> [  476.590328] last sysfs file: /sys/block/sr0/capability
> [  476.590328] CPU 0 
> [  476.590328] Modules linked in: highalloc trace_allocmap buddyinfo vmregress_core oprofile dm_crypt loop i2c_piix4 evdev i2c_core processor serio_raw tpm_tis tpm tpm_bios shpchp pci_hotplug button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci ide_pci_generic libata ide_core scsi_mod ohci_hcd r8169 mii ehci_hcd floppy thermal fan thermal_sys
> [  477.296405] 
> [  477.296405] Pid: 4343, comm: bench-stresshig Not tainted 2.6.34-rc2-mm1-compaction-v7r3 #1 GA-MA790GP-UD4H/GA-MA790GP-UD4H
> [  477.296405] RIP: 0010:[<ffffffff81072162>]  [<ffffffff81072162>] __bfs+0x1a1/0x1d7
> [  477.296405] RSP: 0000:ffff88007817d4c8  EFLAGS: 00010046
> [  477.296405] RAX: ffffffff81c80170 RBX: ffff88007817d528 RCX: 0000000000000000
> [  477.296405] RDX: ffff88007817d528 RSI: 0000000000000000 RDI: ffff88007817d528
> [  477.296405] RBP: ffff88007817d518 R08: 0000000000000000 R09: 0000000000000000
> [  477.296405] R10: ffffffff816a1d08 R11: 0000000000000046 R12: 0000000000000000
> [  477.922839] R13: ffffffff81513887 R14: ffffffff81c80170 R15: 0000000000000000
> [  477.922839] FS:  00007f8d853d26e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
> [  478.123091] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  478.123091] CR2: 0000000000000000 CR3: 0000000037a0e000 CR4: 00000000000006f0
> [  478.123091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  478.123091] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  478.123091] Process bench-stresshig (pid: 4343, threadinfo ffff88007817c000, task ffff88007ebea000)
> [  478.566030] Stack:
> [  478.566030]  ffff88007817d570 ffffffff81070f92 0000000000000000 0000000000000002
> [  478.566030] <0> ffff88007817d528 ffff88007817d528 ffff88007ebea668 ffffffff81513887
> [  478.566030] <0> ffff88007ebea000 ffff88007ebea000 ffff88007817d598 ffffffff81074201
> [  478.566030] Call Trace:
> [  478.566030]  [<ffffffff81070f92>] ? usage_match+0x0/0x17
> [  478.566030]  [<ffffffff81074201>] check_usage_backwards+0x93/0xca
> [  478.566030]  [<ffffffff8107416e>] ? check_usage_backwards+0x0/0xca
> [  478.566030]  [<ffffffff81073544>] mark_lock+0x31d/0x52f
> [  478.566030]  [<ffffffff8107515a>] __lock_acquire+0x801/0x1776
> [  478.566030]  [<ffffffff810761c5>] lock_acquire+0xf6/0x122
> [  478.566030]  [<ffffffff810ef121>] ? rmap_walk+0x5c/0x16d
> [  478.566030]  [<ffffffff812fcfeb>] _raw_spin_lock+0x3b/0x47
> [  478.566030]  [<ffffffff810ef121>] ? rmap_walk+0x5c/0x16d
> [  478.566030]  [<ffffffff810ef121>] rmap_walk+0x5c/0x16d
> [  478.566030]  [<ffffffff81106396>] ? remove_migration_pte+0x0/0x234
> [  478.566030]  [<ffffffff81300dc1>] ? sub_preempt_count+0x9/0x83
> [  478.566030]  [<ffffffff81106914>] ? migrate_page_copy+0xa0/0x1ed
> [  478.566030]  [<ffffffff81106ea4>] migrate_pages+0x3fc/0x5d3
> [  478.566030]  [<ffffffff81106c56>] ? migrate_pages+0x1ae/0x5d3
> [  478.566030]  [<ffffffff81073a24>] ? trace_hardirqs_on_caller+0x110/0x134
> [  478.566030]  [<ffffffff81107e11>] ? compaction_alloc+0x0/0x283
> [  478.566030]  [<ffffffff811079b0>] ? compact_zone+0x14e/0x4bd
> 
> Granted, what I should have spotted was that a more relevant check was for
> the specific corner case like in the revised patch below. Please note that
> if I put a WARN_ON in the check in rmap.c, it can and does trigger. If this
> situation really is never meant to occur, there is a race that needs to be
> closed before PageSwapCache can be migrated.
> 
> ==== CUT HERE ====
> 
> mm,migration: Allow the migration of PageSwapCache pages
> 
> PageAnon pages that are unmapped may or may not have an anon_vma so
> are not currently migrated. However, a swap cache page can be migrated
> and fits this description. This patch identifies page swap caches and
> allows them to be migrated.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/migrate.c |   12 +++++++-----
>  mm/rmap.c    |    7 +++++--
>  2 files changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 35aad2a..2284d79 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -607,11 +607,13 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  		 * the page was isolated and when we reached here while
>  		 * the RCU lock was not held
>  		 */
> -		if (!page_mapped(page))
> -			goto rcu_unlock;
> -
> -		anon_vma = page_anon_vma(page);
> -		atomic_inc(&anon_vma->external_refcount);
> +		if (!page_mapped(page)) {
> +			if (!PageSwapCache(page))
> +				goto rcu_unlock;
> +		} else {
> +			anon_vma = page_anon_vma(page);
> +			atomic_inc(&anon_vma->external_refcount);
> +		}
>  	}
>  
>  	/*
> diff --git a/mm/rmap.c b/mm/rmap.c
> index af35b75..7d63c68 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1394,9 +1394,12 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
>  
>  	if (unlikely(PageKsm(page)))
>  		return rmap_walk_ksm(page, rmap_one, arg);
> -	else if (PageAnon(page))
> +	else if (PageAnon(page)) {
> +		/* See comment on corner case handling in mm/migrate.c */
> +		if (PageSwapCache(page) && !page_mapped(page))
> +			return SWAP_AGAIN;
>  		return rmap_walk_anon(page, rmap_one, arg);

rmap_walk_anon() is called against unmapped pages.
Then, !page_mapped() is always true. So, the behavior will not be different from
the last one.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-03-31 23:57         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-31 23:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, 31 Mar 2010 12:27:30 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Wed, Mar 31, 2010 at 02:26:23PM +0900, KAMEZAWA Hiroyuki wrote:

> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
> > 
> > rmap_walk_anon() should be called and the check is not necessary.
> > 
> 
> I follow your reasoning. What caught me is the following comment;
> 
>          * Corner case handling:
>          * 1. When a new swap-cache page is read into, it is added to the LRU
>          * and treated as swapcache but it has no rmap yet.
>          * Calling try_to_unmap() against a page->mapping==NULL page will
>          * trigger a BUG.  So handle it here.
> 
> and the fact that without the check the following bug is triggered;
> 
> [  476.541321] BUG: unable to handle kernel NULL pointer dereference at (null)
> [  476.590328] IP: [<ffffffff81072162>] __bfs+0x1a1/0x1d7
> [  476.590328] PGD 3781c067 PUD 371b2067 PMD 0 
> [  476.590328] Oops: 0000 [#1] PREEMPT SMP 
> [  476.590328] last sysfs file: /sys/block/sr0/capability
> [  476.590328] CPU 0 
> [  476.590328] Modules linked in: highalloc trace_allocmap buddyinfo vmregress_core oprofile dm_crypt loop i2c_piix4 evdev i2c_core processor serio_raw tpm_tis tpm tpm_bios shpchp pci_hotplug button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci ide_pci_generic libata ide_core scsi_mod ohci_hcd r8169 mii ehci_hcd floppy thermal fan thermal_sys
> [  477.296405] 
> [  477.296405] Pid: 4343, comm: bench-stresshig Not tainted 2.6.34-rc2-mm1-compaction-v7r3 #1 GA-MA790GP-UD4H/GA-MA790GP-UD4H
> [  477.296405] RIP: 0010:[<ffffffff81072162>]  [<ffffffff81072162>] __bfs+0x1a1/0x1d7
> [  477.296405] RSP: 0000:ffff88007817d4c8  EFLAGS: 00010046
> [  477.296405] RAX: ffffffff81c80170 RBX: ffff88007817d528 RCX: 0000000000000000
> [  477.296405] RDX: ffff88007817d528 RSI: 0000000000000000 RDI: ffff88007817d528
> [  477.296405] RBP: ffff88007817d518 R08: 0000000000000000 R09: 0000000000000000
> [  477.296405] R10: ffffffff816a1d08 R11: 0000000000000046 R12: 0000000000000000
> [  477.922839] R13: ffffffff81513887 R14: ffffffff81c80170 R15: 0000000000000000
> [  477.922839] FS:  00007f8d853d26e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
> [  478.123091] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  478.123091] CR2: 0000000000000000 CR3: 0000000037a0e000 CR4: 00000000000006f0
> [  478.123091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  478.123091] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  478.123091] Process bench-stresshig (pid: 4343, threadinfo ffff88007817c000, task ffff88007ebea000)
> [  478.566030] Stack:
> [  478.566030]  ffff88007817d570 ffffffff81070f92 0000000000000000 0000000000000002
> [  478.566030] <0> ffff88007817d528 ffff88007817d528 ffff88007ebea668 ffffffff81513887
> [  478.566030] <0> ffff88007ebea000 ffff88007ebea000 ffff88007817d598 ffffffff81074201
> [  478.566030] Call Trace:
> [  478.566030]  [<ffffffff81070f92>] ? usage_match+0x0/0x17
> [  478.566030]  [<ffffffff81074201>] check_usage_backwards+0x93/0xca
> [  478.566030]  [<ffffffff8107416e>] ? check_usage_backwards+0x0/0xca
> [  478.566030]  [<ffffffff81073544>] mark_lock+0x31d/0x52f
> [  478.566030]  [<ffffffff8107515a>] __lock_acquire+0x801/0x1776
> [  478.566030]  [<ffffffff810761c5>] lock_acquire+0xf6/0x122
> [  478.566030]  [<ffffffff810ef121>] ? rmap_walk+0x5c/0x16d
> [  478.566030]  [<ffffffff812fcfeb>] _raw_spin_lock+0x3b/0x47
> [  478.566030]  [<ffffffff810ef121>] ? rmap_walk+0x5c/0x16d
> [  478.566030]  [<ffffffff810ef121>] rmap_walk+0x5c/0x16d
> [  478.566030]  [<ffffffff81106396>] ? remove_migration_pte+0x0/0x234
> [  478.566030]  [<ffffffff81300dc1>] ? sub_preempt_count+0x9/0x83
> [  478.566030]  [<ffffffff81106914>] ? migrate_page_copy+0xa0/0x1ed
> [  478.566030]  [<ffffffff81106ea4>] migrate_pages+0x3fc/0x5d3
> [  478.566030]  [<ffffffff81106c56>] ? migrate_pages+0x1ae/0x5d3
> [  478.566030]  [<ffffffff81073a24>] ? trace_hardirqs_on_caller+0x110/0x134
> [  478.566030]  [<ffffffff81107e11>] ? compaction_alloc+0x0/0x283
> [  478.566030]  [<ffffffff811079b0>] ? compact_zone+0x14e/0x4bd
> 
> Granted, what I should have spotted was that a more relevant check was for
> the specific corner case like in the revised patch below. Please note that
> if I put a WARN_ON in the check in rmap.c, it can and does trigger. If this
> situation really is never meant to occur, there is a race that needs to be
> closed before PageSwapCache can be migrated.
> 
> ==== CUT HERE ====
> 
> mm,migration: Allow the migration of PageSwapCache pages
> 
> PageAnon pages that are unmapped may or may not have an anon_vma so
> are not currently migrated. However, a swap cache page can be migrated
> and fits this description. This patch identifies page swap caches and
> allows them to be migrated.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/migrate.c |   12 +++++++-----
>  mm/rmap.c    |    7 +++++--
>  2 files changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 35aad2a..2284d79 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -607,11 +607,13 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  		 * the page was isolated and when we reached here while
>  		 * the RCU lock was not held
>  		 */
> -		if (!page_mapped(page))
> -			goto rcu_unlock;
> -
> -		anon_vma = page_anon_vma(page);
> -		atomic_inc(&anon_vma->external_refcount);
> +		if (!page_mapped(page)) {
> +			if (!PageSwapCache(page))
> +				goto rcu_unlock;
> +		} else {
> +			anon_vma = page_anon_vma(page);
> +			atomic_inc(&anon_vma->external_refcount);
> +		}
>  	}
>  
>  	/*
> diff --git a/mm/rmap.c b/mm/rmap.c
> index af35b75..7d63c68 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1394,9 +1394,12 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
>  
>  	if (unlikely(PageKsm(page)))
>  		return rmap_walk_ksm(page, rmap_one, arg);
> -	else if (PageAnon(page))
> +	else if (PageAnon(page)) {
> +		/* See comment on corner case handling in mm/migrate.c */
> +		if (PageSwapCache(page) && !page_mapped(page))
> +			return SWAP_AGAIN;
>  		return rmap_walk_anon(page, rmap_one, arg);

rmap_walk_anon() is called against unmapped pages.
Then, !page_mapped() is always true. So, the behavior will not be different from
the last one.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache  pages
  2010-03-31 23:57         ` KAMEZAWA Hiroyuki
@ 2010-04-01  2:39           ` Minchan Kim
  -1 siblings, 0 replies; 72+ messages in thread
From: Minchan Kim @ 2010-04-01  2:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Thu, Apr 1, 2010 at 8:57 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com>
>
> rmap_walk_anon() is called against unmapped pages.
> Then, !page_mapped() is always true. So, the behavior will not be different from
> the last one.
>

rmap_walk_anon can be also called in case of failing try_to_unmap.
Then, In unmap_and_move, page_mapped is true and
remove_migration_ptes can be called.

But I am not sure this Mel's patch about this part is really needed.

> Thanks,
> -Kame
>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-04-01  2:39           ` Minchan Kim
  0 siblings, 0 replies; 72+ messages in thread
From: Minchan Kim @ 2010-04-01  2:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Thu, Apr 1, 2010 at 8:57 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com>
>
> rmap_walk_anon() is called against unmapped pages.
> Then, !page_mapped() is always true. So, the behavior will not be different from
> the last one.
>

rmap_walk_anon can be also called in case of failing try_to_unmap.
Then, In unmap_and_move, page_mapped is true and
remove_migration_ptes can be called.

But I am not sure this Mel's patch about this part is really needed.

> Thanks,
> -Kame
>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache  pages
  2010-03-31  5:26     ` KAMEZAWA Hiroyuki
@ 2010-04-01  2:43       ` Minchan Kim
  -1 siblings, 0 replies; 72+ messages in thread
From: Minchan Kim @ 2010-04-01  2:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 30 Mar 2010 10:14:49 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
>
>> PageAnon pages that are unmapped may or may not have an anon_vma so
>> are not currently migrated. However, a swap cache page can be migrated
>> and fits this description. This patch identifies page swap caches and
>> allows them to be migrated.
>>
>
> Some comments.
>
>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> ---
>>  mm/migrate.c |   15 ++++++++++-----
>>  mm/rmap.c    |    6 ++++--
>>  2 files changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 35aad2a..f9bf37e 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -203,6 +203,9 @@ static int migrate_page_move_mapping(struct address_space *mapping,
>>       void **pslot;
>>
>>       if (!mapping) {
>> +             if (PageSwapCache(page))
>> +                     SetPageSwapCache(newpage);
>> +
>
> Migration of SwapCache requires radix-tree replacement, IOW,
>  mapping == NULL && PageSwapCache is BUG.
>
> So, this never happens.
>
>
>>               /* Anonymous page without mapping */
>>               if (page_count(page) != 1)
>>                       return -EAGAIN;
>> @@ -607,11 +610,13 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>>                * the page was isolated and when we reached here while
>>                * the RCU lock was not held
>>                */
>> -             if (!page_mapped(page))
>> -                     goto rcu_unlock;
>> -
>> -             anon_vma = page_anon_vma(page);
>> -             atomic_inc(&anon_vma->external_refcount);
>> +             if (!page_mapped(page)) {
>> +                     if (!PageSwapCache(page))
>> +                             goto rcu_unlock;
>> +             } else {
>> +                     anon_vma = page_anon_vma(page);
>> +                     atomic_inc(&anon_vma->external_refcount);
>> +             }
>>       }
>>
>>       /*
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index af35b75..d5ea1f2 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
>>
>>       if (unlikely(PageKsm(page)))
>>               return rmap_walk_ksm(page, rmap_one, arg);
>> -     else if (PageAnon(page))
>> +     else if (PageAnon(page)) {
>> +             if (PageSwapCache(page))
>> +                     return SWAP_AGAIN;
>>               return rmap_walk_anon(page, rmap_one, arg);
>
> SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
>

In case of tmpfs, page has swapcache but not mapped.

> Please see do_swap_page(), PageSwapCache bit is cleared only when
>
> do_swap_page()...
>       swap_free(entry);
>        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
>                try_to_free_swap(page);
>
> Then, PageSwapCache is cleared only when swap is freeable even if mapped.
>
> rmap_walk_anon() should be called and the check is not necessary.

Frankly speaking, I don't understand what is Mel's problem, why he added
Swapcache check in rmap_walk, and why do you said we don't need it.

Could you explain more detail if you don't mind?

>
> Thanks,
> -Kame
>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-04-01  2:43       ` Minchan Kim
  0 siblings, 0 replies; 72+ messages in thread
From: Minchan Kim @ 2010-04-01  2:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 30 Mar 2010 10:14:49 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
>
>> PageAnon pages that are unmapped may or may not have an anon_vma so
>> are not currently migrated. However, a swap cache page can be migrated
>> and fits this description. This patch identifies page swap caches and
>> allows them to be migrated.
>>
>
> Some comments.
>
>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> ---
>>  mm/migrate.c |   15 ++++++++++-----
>>  mm/rmap.c    |    6 ++++--
>>  2 files changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 35aad2a..f9bf37e 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -203,6 +203,9 @@ static int migrate_page_move_mapping(struct address_space *mapping,
>>       void **pslot;
>>
>>       if (!mapping) {
>> +             if (PageSwapCache(page))
>> +                     SetPageSwapCache(newpage);
>> +
>
> Migration of SwapCache requires radix-tree replacement, IOW,
>  mapping == NULL && PageSwapCache is BUG.
>
> So, this never happens.
>
>
>>               /* Anonymous page without mapping */
>>               if (page_count(page) != 1)
>>                       return -EAGAIN;
>> @@ -607,11 +610,13 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>>                * the page was isolated and when we reached here while
>>                * the RCU lock was not held
>>                */
>> -             if (!page_mapped(page))
>> -                     goto rcu_unlock;
>> -
>> -             anon_vma = page_anon_vma(page);
>> -             atomic_inc(&anon_vma->external_refcount);
>> +             if (!page_mapped(page)) {
>> +                     if (!PageSwapCache(page))
>> +                             goto rcu_unlock;
>> +             } else {
>> +                     anon_vma = page_anon_vma(page);
>> +                     atomic_inc(&anon_vma->external_refcount);
>> +             }
>>       }
>>
>>       /*
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index af35b75..d5ea1f2 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
>>
>>       if (unlikely(PageKsm(page)))
>>               return rmap_walk_ksm(page, rmap_one, arg);
>> -     else if (PageAnon(page))
>> +     else if (PageAnon(page)) {
>> +             if (PageSwapCache(page))
>> +                     return SWAP_AGAIN;
>>               return rmap_walk_anon(page, rmap_one, arg);
>
> SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
>

In case of tmpfs, page has swapcache but not mapped.

> Please see do_swap_page(), PageSwapCache bit is cleared only when
>
> do_swap_page()...
>       swap_free(entry);
>        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
>                try_to_free_swap(page);
>
> Then, PageSwapCache is cleared only when swap is freeable even if mapped.
>
> rmap_walk_anon() should be called and the check is not necessary.

Frankly speaking, I don't understand what is Mel's problem, why he added
Swapcache check in rmap_walk, and why do you said we don't need it.

Could you explain more detail if you don't mind?

>
> Thanks,
> -Kame
>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache  pages
  2010-04-01  2:43       ` Minchan Kim
@ 2010-04-01  3:01         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-01  3:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Thu, 1 Apr 2010 11:43:18 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki       /*
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index af35b75..d5ea1f2 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
> >>
> >>       if (unlikely(PageKsm(page)))
> >>               return rmap_walk_ksm(page, rmap_one, arg);
> >> -     else if (PageAnon(page))
> >> +     else if (PageAnon(page)) {
> >> +             if (PageSwapCache(page))
> >> +                     return SWAP_AGAIN;
> >>               return rmap_walk_anon(page, rmap_one, arg);
> >
> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
> >
> 
> In case of tmpfs, page has swapcache but not mapped.
> 
> > Please see do_swap_page(), PageSwapCache bit is cleared only when
> >
> > do_swap_page()...
> >       swap_free(entry);
> >        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> >                try_to_free_swap(page);
> >
> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
> >
> > rmap_walk_anon() should be called and the check is not necessary.
> 
> Frankly speaking, I don't understand what is Mel's problem, why he added
> Swapcache check in rmap_walk, and why do you said we don't need it.
> 
> Could you explain more detail if you don't mind?
> 
I may miss something.

unmap_and_move()
 1. try_to_unmap(TTU_MIGRATION)
 2. move_to_newpage
 3. remove_migration_ptes
	-> rmap_walk()

Then, to map a page back we unmapped we call rmap_walk().

Assume a SwapCache which is mapped, then, PageAnon(page) == true.

 At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
       mapcount goes to 0.
 At 2. SwapCache is copied to a new page.
 At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
       Before patch, the new page is mapped back to all ptes.
       After patch, the new page is not mapped back because its mapcount is 0.

I don't think shared SwapCache of anon is not an usual behavior, so, the logic
before patch is more attractive.

If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
because page->mapping is NULL.

Thanks,
-Kame

















^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache  pages
@ 2010-04-01  3:01         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-01  3:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Thu, 1 Apr 2010 11:43:18 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki A  A  A  /*
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index af35b75..d5ea1f2 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
> >>
> >> A  A  A  if (unlikely(PageKsm(page)))
> >> A  A  A  A  A  A  A  return rmap_walk_ksm(page, rmap_one, arg);
> >> - A  A  else if (PageAnon(page))
> >> + A  A  else if (PageAnon(page)) {
> >> + A  A  A  A  A  A  if (PageSwapCache(page))
> >> + A  A  A  A  A  A  A  A  A  A  return SWAP_AGAIN;
> >> A  A  A  A  A  A  A  return rmap_walk_anon(page, rmap_one, arg);
> >
> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
> >
> 
> In case of tmpfs, page has swapcache but not mapped.
> 
> > Please see do_swap_page(), PageSwapCache bit is cleared only when
> >
> > do_swap_page()...
> > A  A  A  swap_free(entry);
> > A  A  A  A if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> > A  A  A  A  A  A  A  A try_to_free_swap(page);
> >
> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
> >
> > rmap_walk_anon() should be called and the check is not necessary.
> 
> Frankly speaking, I don't understand what is Mel's problem, why he added
> Swapcache check in rmap_walk, and why do you said we don't need it.
> 
> Could you explain more detail if you don't mind?
> 
I may miss something.

unmap_and_move()
 1. try_to_unmap(TTU_MIGRATION)
 2. move_to_newpage
 3. remove_migration_ptes
	-> rmap_walk()

Then, to map a page back we unmapped we call rmap_walk().

Assume a SwapCache which is mapped, then, PageAnon(page) == true.

 At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
       mapcount goes to 0.
 At 2. SwapCache is copied to a new page.
 At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
       Before patch, the new page is mapped back to all ptes.
       After patch, the new page is not mapped back because its mapcount is 0.

I don't think shared SwapCache of anon is not an usual behavior, so, the logic
before patch is more attractive.

If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
because page->mapping is NULL.

Thanks,
-Kame
















--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache  pages
  2010-04-01  3:01         ` KAMEZAWA Hiroyuki
@ 2010-04-01  4:44           ` Minchan Kim
  -1 siblings, 0 replies; 72+ messages in thread
From: Minchan Kim @ 2010-04-01  4:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Thu, Apr 1, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 1 Apr 2010 11:43:18 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki       /*
>> >> diff --git a/mm/rmap.c b/mm/rmap.c
>> >> index af35b75..d5ea1f2 100644
>> >> --- a/mm/rmap.c
>> >> +++ b/mm/rmap.c
>> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
>> >>
>> >>       if (unlikely(PageKsm(page)))
>> >>               return rmap_walk_ksm(page, rmap_one, arg);
>> >> -     else if (PageAnon(page))
>> >> +     else if (PageAnon(page)) {
>> >> +             if (PageSwapCache(page))
>> >> +                     return SWAP_AGAIN;
>> >>               return rmap_walk_anon(page, rmap_one, arg);
>> >
>> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
>> >
>>
>> In case of tmpfs, page has swapcache but not mapped.
>>
>> > Please see do_swap_page(), PageSwapCache bit is cleared only when
>> >
>> > do_swap_page()...
>> >       swap_free(entry);
>> >        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
>> >                try_to_free_swap(page);
>> >
>> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
>> >
>> > rmap_walk_anon() should be called and the check is not necessary.
>>
>> Frankly speaking, I don't understand what is Mel's problem, why he added
>> Swapcache check in rmap_walk, and why do you said we don't need it.
>>
>> Could you explain more detail if you don't mind?
>>
> I may miss something.
>
> unmap_and_move()
>  1. try_to_unmap(TTU_MIGRATION)
>  2. move_to_newpage
>  3. remove_migration_ptes
>        -> rmap_walk()
>
> Then, to map a page back we unmapped we call rmap_walk().
>
> Assume a SwapCache which is mapped, then, PageAnon(page) == true.
>
>  At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
>       mapcount goes to 0.
>  At 2. SwapCache is copied to a new page.
>  At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
>       Before patch, the new page is mapped back to all ptes.
>       After patch, the new page is not mapped back because its mapcount is 0.
>
> I don't think shared SwapCache of anon is not an usual behavior, so, the logic
> before patch is more attractive.
>
> If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
> because page->mapping is NULL.
>

Thanks. I agree. We don't need the check.
Then, my question is why Mel added the check in rmap_walk.
He mentioned some BUG trigger and fixed things after this patch.
What's it?
Is it really related to this logic?
I don't think so or we are missing something.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-04-01  4:44           ` Minchan Kim
  0 siblings, 0 replies; 72+ messages in thread
From: Minchan Kim @ 2010-04-01  4:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Thu, Apr 1, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 1 Apr 2010 11:43:18 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki       /*
>> >> diff --git a/mm/rmap.c b/mm/rmap.c
>> >> index af35b75..d5ea1f2 100644
>> >> --- a/mm/rmap.c
>> >> +++ b/mm/rmap.c
>> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
>> >>
>> >>       if (unlikely(PageKsm(page)))
>> >>               return rmap_walk_ksm(page, rmap_one, arg);
>> >> -     else if (PageAnon(page))
>> >> +     else if (PageAnon(page)) {
>> >> +             if (PageSwapCache(page))
>> >> +                     return SWAP_AGAIN;
>> >>               return rmap_walk_anon(page, rmap_one, arg);
>> >
>> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
>> >
>>
>> In case of tmpfs, page has swapcache but not mapped.
>>
>> > Please see do_swap_page(), PageSwapCache bit is cleared only when
>> >
>> > do_swap_page()...
>> >       swap_free(entry);
>> >        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
>> >                try_to_free_swap(page);
>> >
>> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
>> >
>> > rmap_walk_anon() should be called and the check is not necessary.
>>
>> Frankly speaking, I don't understand what is Mel's problem, why he added
>> Swapcache check in rmap_walk, and why do you said we don't need it.
>>
>> Could you explain more detail if you don't mind?
>>
> I may miss something.
>
> unmap_and_move()
>  1. try_to_unmap(TTU_MIGRATION)
>  2. move_to_newpage
>  3. remove_migration_ptes
>        -> rmap_walk()
>
> Then, to map a page back we unmapped we call rmap_walk().
>
> Assume a SwapCache which is mapped, then, PageAnon(page) == true.
>
>  At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
>       mapcount goes to 0.
>  At 2. SwapCache is copied to a new page.
>  At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
>       Before patch, the new page is mapped back to all ptes.
>       After patch, the new page is not mapped back because its mapcount is 0.
>
> I don't think shared SwapCache of anon is not an usual behavior, so, the logic
> before patch is more attractive.
>
> If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
> because page->mapping is NULL.
>

Thanks. I agree. We don't need the check.
Then, my question is why Mel added the check in rmap_walk.
He mentioned some BUG trigger and fixed things after this patch.
What's it?
Is it really related to this logic?
I don't think so or we are missing something.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache  pages
  2010-04-01  4:44           ` Minchan Kim
@ 2010-04-01  5:42             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-01  5:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Thu, 1 Apr 2010 13:44:29 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Thu, Apr 1, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 1 Apr 2010 11:43:18 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >
> >> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki       /*
> >> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> >> index af35b75..d5ea1f2 100644
> >> >> --- a/mm/rmap.c
> >> >> +++ b/mm/rmap.c
> >> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
> >> >>
> >> >>       if (unlikely(PageKsm(page)))
> >> >>               return rmap_walk_ksm(page, rmap_one, arg);
> >> >> -     else if (PageAnon(page))
> >> >> +     else if (PageAnon(page)) {
> >> >> +             if (PageSwapCache(page))
> >> >> +                     return SWAP_AGAIN;
> >> >>               return rmap_walk_anon(page, rmap_one, arg);
> >> >
> >> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
> >> >
> >>
> >> In case of tmpfs, page has swapcache but not mapped.
> >>
> >> > Please see do_swap_page(), PageSwapCache bit is cleared only when
> >> >
> >> > do_swap_page()...
> >> >       swap_free(entry);
> >> >        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> >> >                try_to_free_swap(page);
> >> >
> >> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
> >> >
> >> > rmap_walk_anon() should be called and the check is not necessary.
> >>
> >> Frankly speaking, I don't understand what is Mel's problem, why he added
> >> Swapcache check in rmap_walk, and why do you said we don't need it.
> >>
> >> Could you explain more detail if you don't mind?
> >>
> > I may miss something.
> >
> > unmap_and_move()
> >  1. try_to_unmap(TTU_MIGRATION)
> >  2. move_to_newpage
> >  3. remove_migration_ptes
> >        -> rmap_walk()
> >
> > Then, to map a page back we unmapped we call rmap_walk().
> >
> > Assume a SwapCache which is mapped, then, PageAnon(page) == true.
> >
> >  At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
> >       mapcount goes to 0.
> >  At 2. SwapCache is copied to a new page.
> >  At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
> >       Before patch, the new page is mapped back to all ptes.
> >       After patch, the new page is not mapped back because its mapcount is 0.
> >
> > I don't think shared SwapCache of anon is not an usual behavior, so, the logic
> > before patch is more attractive.
> >
> > If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
> > because page->mapping is NULL.
> >
> 
> Thanks. I agree. We don't need the check.
> Then, my question is why Mel added the check in rmap_walk.
> He mentioned some BUG trigger and fixed things after this patch.
> What's it?
> Is it really related to this logic?
> I don't think so or we are missing something.
> 
Hmm. Consiering again.

Now.
	if (PageAnon(page)) {
		rcu_locked = 1;
		rcu_read_lock();
		if (!page_mapped(page)) {
			if (!PageSwapCache(page))
				goto rcu_unlock;
		} else {
			anon_vma = page_anon_vma(page);
			atomic_inc(&anon_vma->external_refcount);
		}


Maybe this is a fix.

==
	skip_remap = 0;
	if (PageAnon(page)) {
		rcu_read_lock();
		if (!page_mapped(page)) {
			if (!PageSwapCache(page))
				goto rcu_unlock;
			/*
			 * We can't convice this anon_vma is valid or not because
			 * !page_mapped(page). Then, we do migration(radix-tree replacement)
			 * but don't remap it which touches anon_vma in page->mapping.
			 */
			skip_remap = 1;
			goto skip_unmap;
		} else {
			anon_vma = page_anon_vma(page);
			atomic_inc(&anon_vma->external_refcount);
		}
	}	
	.....copy page, radix-tree replacement,....

	if (!rc && !skip_remap)
		 remove_migration_ptes(page, page);
==

Thanks,
-Kame












^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache  pages
@ 2010-04-01  5:42             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-01  5:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Thu, 1 Apr 2010 13:44:29 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Thu, Apr 1, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 1 Apr 2010 11:43:18 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >
> >> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki A  A  A  /*
> >> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> >> index af35b75..d5ea1f2 100644
> >> >> --- a/mm/rmap.c
> >> >> +++ b/mm/rmap.c
> >> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
> >> >>
> >> >> A  A  A  if (unlikely(PageKsm(page)))
> >> >> A  A  A  A  A  A  A  return rmap_walk_ksm(page, rmap_one, arg);
> >> >> - A  A  else if (PageAnon(page))
> >> >> + A  A  else if (PageAnon(page)) {
> >> >> + A  A  A  A  A  A  if (PageSwapCache(page))
> >> >> + A  A  A  A  A  A  A  A  A  A  return SWAP_AGAIN;
> >> >> A  A  A  A  A  A  A  return rmap_walk_anon(page, rmap_one, arg);
> >> >
> >> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
> >> >
> >>
> >> In case of tmpfs, page has swapcache but not mapped.
> >>
> >> > Please see do_swap_page(), PageSwapCache bit is cleared only when
> >> >
> >> > do_swap_page()...
> >> > A  A  A  swap_free(entry);
> >> > A  A  A  A if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> >> > A  A  A  A  A  A  A  A try_to_free_swap(page);
> >> >
> >> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
> >> >
> >> > rmap_walk_anon() should be called and the check is not necessary.
> >>
> >> Frankly speaking, I don't understand what is Mel's problem, why he added
> >> Swapcache check in rmap_walk, and why do you said we don't need it.
> >>
> >> Could you explain more detail if you don't mind?
> >>
> > I may miss something.
> >
> > unmap_and_move()
> > A 1. try_to_unmap(TTU_MIGRATION)
> > A 2. move_to_newpage
> > A 3. remove_migration_ptes
> > A  A  A  A -> rmap_walk()
> >
> > Then, to map a page back we unmapped we call rmap_walk().
> >
> > Assume a SwapCache which is mapped, then, PageAnon(page) == true.
> >
> > A At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
> > A  A  A  mapcount goes to 0.
> > A At 2. SwapCache is copied to a new page.
> > A At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
> > A  A  A  Before patch, the new page is mapped back to all ptes.
> > A  A  A  After patch, the new page is not mapped back because its mapcount is 0.
> >
> > I don't think shared SwapCache of anon is not an usual behavior, so, the logic
> > before patch is more attractive.
> >
> > If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
> > because page->mapping is NULL.
> >
> 
> Thanks. I agree. We don't need the check.
> Then, my question is why Mel added the check in rmap_walk.
> He mentioned some BUG trigger and fixed things after this patch.
> What's it?
> Is it really related to this logic?
> I don't think so or we are missing something.
> 
Hmm. Consiering again.

Now.
	if (PageAnon(page)) {
		rcu_locked = 1;
		rcu_read_lock();
		if (!page_mapped(page)) {
			if (!PageSwapCache(page))
				goto rcu_unlock;
		} else {
			anon_vma = page_anon_vma(page);
			atomic_inc(&anon_vma->external_refcount);
		}


Maybe this is a fix.

==
	skip_remap = 0;
	if (PageAnon(page)) {
		rcu_read_lock();
		if (!page_mapped(page)) {
			if (!PageSwapCache(page))
				goto rcu_unlock;
			/*
			 * We can't convice this anon_vma is valid or not because
			 * !page_mapped(page). Then, we do migration(radix-tree replacement)
			 * but don't remap it which touches anon_vma in page->mapping.
			 */
			skip_remap = 1;
			goto skip_unmap;
		} else {
			anon_vma = page_anon_vma(page);
			atomic_inc(&anon_vma->external_refcount);
		}
	}	
	.....copy page, radix-tree replacement,....

	if (!rc && !skip_remap)
		 remove_migration_ptes(page, page);
==

Thanks,
-Kame











--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-04-01  4:44           ` Minchan Kim
@ 2010-04-01  9:30             ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-04-01  9:30 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Thu, Apr 01, 2010 at 01:44:29PM +0900, Minchan Kim wrote:
> On Thu, Apr 1, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 1 Apr 2010 11:43:18 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >
> >> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki       /*
> >> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> >> index af35b75..d5ea1f2 100644
> >> >> --- a/mm/rmap.c
> >> >> +++ b/mm/rmap.c
> >> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
> >> >>
> >> >>       if (unlikely(PageKsm(page)))
> >> >>               return rmap_walk_ksm(page, rmap_one, arg);
> >> >> -     else if (PageAnon(page))
> >> >> +     else if (PageAnon(page)) {
> >> >> +             if (PageSwapCache(page))
> >> >> +                     return SWAP_AGAIN;
> >> >>               return rmap_walk_anon(page, rmap_one, arg);
> >> >
> >> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
> >> >
> >>
> >> In case of tmpfs, page has swapcache but not mapped.
> >>
> >> > Please see do_swap_page(), PageSwapCache bit is cleared only when
> >> >
> >> > do_swap_page()...
> >> >       swap_free(entry);
> >> >        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> >> >                try_to_free_swap(page);
> >> >
> >> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
> >> >
> >> > rmap_walk_anon() should be called and the check is not necessary.
> >>
> >> Frankly speaking, I don't understand what is Mel's problem, why he added
> >> Swapcache check in rmap_walk, and why do you said we don't need it.
> >>
> >> Could you explain more detail if you don't mind?
> >>
> > I may miss something.
> >
> > unmap_and_move()
> >  1. try_to_unmap(TTU_MIGRATION)
> >  2. move_to_newpage
> >  3. remove_migration_ptes
> >        -> rmap_walk()
> >
> > Then, to map a page back we unmapped we call rmap_walk().
> >
> > Assume a SwapCache which is mapped, then, PageAnon(page) == true.
> >
> >  At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
> >       mapcount goes to 0.
> >  At 2. SwapCache is copied to a new page.
> >  At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
> >       Before patch, the new page is mapped back to all ptes.
> >       After patch, the new page is not mapped back because its mapcount is 0.
> >
> > I don't think shared SwapCache of anon is not an usual behavior, so, the logic
> > before patch is more attractive.
> >
> > If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
> > because page->mapping is NULL.
> >
> 
> Thanks. I agree. We don't need the check.
> Then, my question is why Mel added the check in rmap_walk.
> He mentioned some BUG trigger and fixed things after this patch.
> What's it?

If I remove the check for (PageSwapCache(page) && !page_mapped(page))
in rmap_walk(), then the bug below occurs. The first one is lockdep going
bad because it's accessing a bad lock implying that anon_vma->lock is
already invalid. The bug that triggers after it is the list walk.

[  373.951347] INFO: trying to register non-static key.
[  373.984314] the code is fine but needs lockdep annotation.
[  374.020512] turning off the locking correctness validator.
[  374.020512] Pid: 4272, comm: bench-stresshig Not tainted 2.6.34-rc2-mm1-compaction-v7r5 #2
[  374.020512] Call Trace:
[  374.020512]  [<ffffffff810758f2>] __lock_acquire+0xf99/0x1776
[  374.020512]  [<ffffffff810761c5>] lock_acquire+0xf6/0x122
[  374.020512]  [<ffffffff810ef121>] ? rmap_walk+0x5c/0x16d
[  374.020512]  [<ffffffff812fcfeb>] _raw_spin_lock+0x3b/0x47
[  374.020512]  [<ffffffff810ef121>] ? rmap_walk+0x5c/0x16d
[  374.020512]  [<ffffffff810ef121>] rmap_walk+0x5c/0x16d
[  374.020512]  [<ffffffff81106396>] ? remove_migration_pte+0x0/0x234
[  374.677618]  [<ffffffff81300dc1>] ? sub_preempt_count+0x9/0x83
[  374.677618]  [<ffffffff81106914>] ? migrate_page_copy+0xa0/0x1ed
[  374.677618]  [<ffffffff81106ea4>] migrate_pages+0x3fc/0x5d3
[  374.880569]  [<ffffffff81106c56>] ? migrate_pages+0x1ae/0x5d3
[  374.994700]  [<ffffffff81073a24>] ? trace_hardirqs_on_caller+0x110/0x134
[  375.018405]  [<ffffffff81107e11>] ? compaction_alloc+0x0/0x283
[  375.097256]  [<ffffffff811079b0>] ? compact_zone+0x14e/0x4bd
[  375.097256]  [<ffffffff812fd851>] ? _raw_spin_unlock_irq+0x30/0x5d
[  375.097256]  [<ffffffff81073a24>] ? trace_hardirqs_on_caller+0x110/0x134
[  375.097256]  [<ffffffff81107b43>] compact_zone+0x2e1/0x4bd
[  375.097256]  [<ffffffff811082f2>] try_to_compact_pages+0x1de/0x248
[  375.516928]  [<ffffffff810d3cd2>] __alloc_pages_nodemask+0x45a/0x81c
[  375.516928]  [<ffffffff812fde14>] ? restore_args+0x0/0x30
[  375.620035]  [<ffffffff8103995e>] ? finish_task_switch+0x0/0xe3
[  375.684491]  [<ffffffff810fe297>] alloc_pages_current+0x9b/0xa4
[  375.803591]  [<ffffffffa00a9a58>] test_alloc_runtest+0x781/0x140a [highalloc]
[  375.803591]  [<ffffffff81076398>] ? lock_release_non_nested+0x97/0x267
[  375.803591]  [<ffffffffa00aa7ce>] vmr_write_proc+0xed/0x102 [highalloc]
[  375.803591]  [<ffffffff81300dc1>] ? sub_preempt_count+0x9/0x83
[  375.803591]  [<ffffffff812fd92e>] ? _raw_spin_unlock+0x35/0x51
[  375.803591]  [<ffffffff810e5a17>] ? do_wp_page+0x6af/0x763
[  375.803591]  [<ffffffff8115bb2a>] ? proc_file_write+0x45/0x92
[  376.322379]  [<ffffffff8115bb5d>] proc_file_write+0x78/0x92
[  376.349787]  [<ffffffff8115bae5>] ? proc_file_write+0x0/0x92
[  376.349787]  [<ffffffff8115bae5>] ? proc_file_write+0x0/0x92
[  376.349787]  [<ffffffff8115647a>] proc_reg_write+0x89/0xa6
[  376.349787]  [<ffffffff8110c1f6>] vfs_write+0xb3/0x15a
[  376.349787]  [<ffffffff8110c36b>] sys_write+0x4c/0x73
[  376.349787]  [<ffffffff81002d32>] system_call_fastpath+0x16/0x1b
[  376.786203] BUG: unable to handle kernel NULL pointer dereference at (null)
[  376.857874] IP: [<ffffffff810ef170>] rmap_walk+0xab/0x16d
[  376.929206] PGD 7f561067 PUD 7eba2067 PMD 0 
[  376.942703] Oops: 0000 [#1] PREEMPT SMP 
[  376.942703] last sysfs file: /sys/block/sr0/capability
[  377.072011] CPU 3 
[  377.116386] Modules linked in: highalloc trace_allocmap buddyinfo vmregress_core oprofile dm_crypt loop i2c_piix4 evdev processor serio_raw tpm_tis tpm tpm_bios i2c_core shpchp pci_hotplug button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod sg sr_mod sd_mod cdrom ata_generic ahci libahci r8169 libata mii ide_pci_generic ide_core ehci_hcd ohci_hcd scsi_mod floppy thermal fan thermal_sys
[  377.520011] 
[  377.520011] Pid: 4272, comm: bench-stresshig Not tainted 2.6.34-rc2-mm1-compaction-v7r5 #2 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[  377.637060] RIP: 0010:[<ffffffff810ef170>]  [<ffffffff810ef170>] rmap_walk+0xab/0x16d
[  377.787277] RSP: 0000:ffff880037a797a8  EFLAGS: 00010202
[  377.787277] RAX: 0000000000000000 RBX: ffffffffffffffe0 RCX: 0000000000000000
[  377.895088] RDX: 0000000000000101 RSI: ffffffff8152ea0f RDI: ffffffff810ef121
[  377.895088] RBP: ffff880037a79828 R08: ffff880037a79458 R09: ffff880037044000
[  377.895088] R10: ffffffff81067358 R11: ffff880037a79228 R12: 0000000000000001
[  377.895088] R13: ffff88007bbf6af0 R14: ffffea00019bd798 R15: ffff88007bbf6b28
[  377.895088] FS:  00007fa3e984d6e0(0000) GS:ffff880002380000(0000) knlGS:0000000000000000
[  378.366669] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  378.366669] CR2: 0000000000000000 CR3: 000000003784d000 CR4: 00000000000006e0
[  378.366669] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  378.366669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  378.366669] Process bench-stresshig (pid: 4272, threadinfo ffff880037a78000, task ffff880037044000)
[  378.800010] Stack:
[  378.800010]  ffffea000027f920 ffffffff81106396 ffff880037a797f8 ffffffff81300dc1
[  378.907796] <0> ffff880037a797f8 ffffffff81106914 ffffea000027f920 ffffea000027f920
[  378.907796] <0> 0000000000000000 ffffea00019bd798 ffff880037a79828 ffffffff816a1cf0
[  378.907796] Call Trace:
[  378.907796]  [<ffffffff81106396>] ? remove_migration_pte+0x0/0x234
[  379.214225]  [<ffffffff81300dc1>] ? sub_preempt_count+0x9/0x83
[  379.296228]  [<ffffffff81106914>] ? migrate_page_copy+0xa0/0x1ed
[  379.296228]  [<ffffffff81106ea4>] migrate_pages+0x3fc/0x5d3
[  379.296228]  [<ffffffff81106c56>] ? migrate_pages+0x1ae/0x5d3
[  379.492124]  [<ffffffff81073a24>] ? trace_hardirqs_on_caller+0x110/0x134
[  379.492124]  [<ffffffff81107e11>] ? compaction_alloc+0x0/0x283
[  379.492124]  [<ffffffff811079b0>] ? compact_zone+0x14e/0x4bd
[  379.714743]  [<ffffffff812fd851>] ? _raw_spin_unlock_irq+0x30/0x5d
[  379.714743]  [<ffffffff81073a24>] ? trace_hardirqs_on_caller+0x110/0x134
[  379.714743]  [<ffffffff81107b43>] compact_zone+0x2e1/0x4bd
[  379.714743]  [<ffffffff811082f2>] try_to_compact_pages+0x1de/0x248
[  380.001915]  [<ffffffff810d3cd2>] __alloc_pages_nodemask+0x45a/0x81c
[  380.093011]  [<ffffffff812fde14>] ? restore_args+0x0/0x30
[  380.160604]  [<ffffffff8103995e>] ? finish_task_switch+0x0/0xe3
[  380.160604]  [<ffffffff810fe297>] alloc_pages_current+0x9b/0xa4
[  380.160604]  [<ffffffffa00a9a58>] test_alloc_runtest+0x781/0x140a [highalloc]
[  380.160604]  [<ffffffff81076398>] ? lock_release_non_nested+0x97/0x267
[  380.160604]  [<ffffffffa00aa7ce>] vmr_write_proc+0xed/0x102 [highalloc]
[  380.527282]  [<ffffffff81300dc1>] ? sub_preempt_count+0x9/0x83
[  380.600599]  [<ffffffff812fd92e>] ? _raw_spin_unlock+0x35/0x51
[  380.640179]  [<ffffffff810e5a17>] ? do_wp_page+0x6af/0x763
[  380.722097]  [<ffffffff8115bb2a>] ? proc_file_write+0x45/0x92
[  380.776200]  [<ffffffff8115bb5d>] proc_file_write+0x78/0x92
[  380.776200]  [<ffffffff8115bae5>] ? proc_file_write+0x0/0x92
[  380.936426]  [<ffffffff8115bae5>] ? proc_file_write+0x0/0x92
[  380.936426]  [<ffffffff8115647a>] proc_reg_write+0x89/0xa6
[  380.936426]  [<ffffffff8110c1f6>] vfs_write+0xb3/0x15a
[  380.936426]  [<ffffffff8110c36b>] sys_write+0x4c/0x73
[  381.197157]  [<ffffffff81002d32>] system_call_fastpath+0x16/0x1b
[  381.197157] Code: 22 48 3b 56 10 73 1c 48 83 fa f2 74 16 48 8b 4d 80 4c 89 f7 ff 55 88 83 f8 01 41 89 c4 0f 85 a8 00 00 00 48 8b 43 20 48 8d 58 e0 <48> 8b 43 20 0f 18 08 48 8d 43 20 49 39 c7 75 ab e9 8b 00 00 00 
[  381.512188] RIP  [<ffffffff810ef170>] rmap_walk+0xab/0x16d
[  381.541457]  RSP <ffff880037a797a8>
[  381.541457] CR2: 0000000000000000
[  381.667153] ---[ end trace b72e829e744f4e05 ]---
[  381.722475] note: bench-stresshig[4272] exited with preempt_count 2
[  381.797590] BUG: scheduling while atomic: bench-stresshig/4272/0x10000003
[  381.878912] INFO: lockdep is turned off.
[  381.925924] Modules linked in: highalloc trace_allocmap buddyinfo vmregress_core oprofile dm_crypt loop i2c_piix4 evdev processor serio_raw tpm_tis tpm tpm_bios i2c_core shpchp pci_hotplug button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod sg sr_mod sd_mod cdrom ata_generic ahci libahci r8169 libata mii ide_pci_generic ide_core ehci_hcd ohci_hcd scsi_mod floppy thermal fan thermal_sys
[  382.368391] Pid: 4272, comm: bench-stresshig Tainted: G      D     2.6.34-rc2-mm1-compaction-v7r5 #2
[  382.477829] Call Trace:
[  382.507155]  [<ffffffff81072e3d>] ? __debug_show_held_locks+0x1b/0x24
[  382.584339]  [<ffffffff81039959>] __schedule_bug+0x77/0x7c
[  382.650075]  [<ffffffff812fa32d>] schedule+0xcc/0x723
[  382.710610]  [<ffffffff8103bd9d>] __cond_resched+0x18/0x24
[  382.776348]  [<ffffffff812faac0>] _cond_resched+0x29/0x34
[  382.841046]  [<ffffffff810e6521>] unmap_vmas+0x76e/0x96b
[  382.904702]  [<ffffffff810eb14f>] exit_mmap+0xd5/0x17a
[  382.966280]  [<ffffffff81043be0>] mmput+0x46/0xf0
[  383.022654]  [<ffffffff81048179>] ? exit_mm+0xd9/0x14c
[  383.084231]  [<ffffffff810481dd>] exit_mm+0x13d/0x14c
[  383.144767]  [<ffffffff812fd879>] ? _raw_spin_unlock_irq+0x58/0x5d
[  383.218825]  [<ffffffff812237f6>] ? tty_audit_exit+0x28/0x91
[  383.286643]  [<ffffffff81049e6b>] do_exit+0x20f/0x70d
[  383.347179]  [<ffffffff810472e4>] ? kmsg_dump+0x153/0x16d
[  383.411878]  [<ffffffff812fed94>] oops_end+0xbe/0xc6
[  383.471373]  [<ffffffff81028005>] no_context+0x1f8/0x207
[  383.535029]  [<ffffffff810281e7>] __bad_area_nosemaphore+0x1d3/0x1f9
[  383.611170]  [<ffffffff810758f2>] ? __lock_acquire+0xf99/0x1776
[  383.682107]  [<ffffffff812fcdd6>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[  383.759289]  [<ffffffff810a38e9>] ? __rcu_process_callbacks+0xa7/0x30b
[  383.837507]  [<ffffffff81028220>] bad_area_nosemaphore+0x13/0x15
[  383.909484]  [<ffffffff81300c4e>] do_page_fault+0x24e/0x3b8
[  383.976259]  [<ffffffff81067358>] ? up+0x14/0x3e
[  384.031597]  [<ffffffff812fe075>] page_fault+0x25/0x30
[  384.093169]  [<ffffffff81067358>] ? up+0x14/0x3e
[  384.148504]  [<ffffffff810ef121>] ? rmap_walk+0x5c/0x16d
[  384.212163]  [<ffffffff810ef170>] ? rmap_walk+0xab/0x16d
[  384.275818]  [<ffffffff810ef121>] ? rmap_walk+0x5c/0x16d
[  384.339476]  [<ffffffff81106396>] ? remove_migration_pte+0x0/0x234
[  384.413536]  [<ffffffff81300dc1>] ? sub_preempt_count+0x9/0x83
[  384.483434]  [<ffffffff81106914>] ? migrate_page_copy+0xa0/0x1ed
[  384.555412]  [<ffffffff81106ea4>] migrate_pages+0x3fc/0x5d3
[  384.622190]  [<ffffffff81106c56>] ? migrate_pages+0x1ae/0x5d3
[  384.691046]  [<ffffffff81073a24>] ? trace_hardirqs_on_caller+0x110/0x134
[  384.771347]  [<ffffffff81107e11>] ? compaction_alloc+0x0/0x283
[  384.841246]  [<ffffffff811079b0>] ? compact_zone+0x14e/0x4bd
[  384.909062]  [<ffffffff812fd851>] ? _raw_spin_unlock_irq+0x30/0x5d
[  384.983120]  [<ffffffff81073a24>] ? trace_hardirqs_on_caller+0x110/0x134
[  385.063421]  [<ffffffff81107b43>] compact_zone+0x2e1/0x4bd
[  385.129158]  [<ffffffff811082f2>] try_to_compact_pages+0x1de/0x248
[  385.203215]  [<ffffffff810d3cd2>] __alloc_pages_nodemask+0x45a/0x81c
[  385.279353]  [<ffffffff812fde14>] ? restore_args+0x0/0x30
[  385.344053]  [<ffffffff8103995e>] ? finish_task_switch+0x0/0xe3
[  385.414988]  [<ffffffff810fe297>] alloc_pages_current+0x9b/0xa4
[  385.485927]  [<ffffffffa00a9a58>] test_alloc_runtest+0x781/0x140a [highalloc]
[  385.571427]  [<ffffffff81076398>] ? lock_release_non_nested+0x97/0x267
[  385.649647]  [<ffffffffa00aa7ce>] vmr_write_proc+0xed/0x102 [highalloc]
[  385.728907]  [<ffffffff81300dc1>] ? sub_preempt_count+0x9/0x83
[  385.798800]  [<ffffffff812fd92e>] ? _raw_spin_unlock+0x35/0x51
[  385.868700]  [<ffffffff810e5a17>] ? do_wp_page+0x6af/0x763
[  385.934436]  [<ffffffff8115bb2a>] ? proc_file_write+0x45/0x92
[  386.003294]  [<ffffffff8115bb5d>] proc_file_write+0x78/0x92
[  386.070072]  [<ffffffff8115bae5>] ? proc_file_write+0x0/0x92
[  386.137888]  [<ffffffff8115bae5>] ? proc_file_write+0x0/0x92
[  386.205708]  [<ffffffff8115647a>] proc_reg_write+0x89/0xa6
[  386.271442]  [<ffffffff8110c1f6>] vfs_write+0xb3/0x15a
[  386.333019]  [<ffffffff8110c36b>] sys_write+0x4c/0x73
[  386.393556]  [<ffffffff81002d32>] system_call_fastpath+0x16/0x1b

> Is it really related to this logic?
> I don't think so or we are missing something.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-04-01  9:30             ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-04-01  9:30 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Thu, Apr 01, 2010 at 01:44:29PM +0900, Minchan Kim wrote:
> On Thu, Apr 1, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 1 Apr 2010 11:43:18 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >
> >> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki       /*
> >> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> >> index af35b75..d5ea1f2 100644
> >> >> --- a/mm/rmap.c
> >> >> +++ b/mm/rmap.c
> >> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
> >> >>
> >> >>       if (unlikely(PageKsm(page)))
> >> >>               return rmap_walk_ksm(page, rmap_one, arg);
> >> >> -     else if (PageAnon(page))
> >> >> +     else if (PageAnon(page)) {
> >> >> +             if (PageSwapCache(page))
> >> >> +                     return SWAP_AGAIN;
> >> >>               return rmap_walk_anon(page, rmap_one, arg);
> >> >
> >> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
> >> >
> >>
> >> In case of tmpfs, page has swapcache but not mapped.
> >>
> >> > Please see do_swap_page(), PageSwapCache bit is cleared only when
> >> >
> >> > do_swap_page()...
> >> >       swap_free(entry);
> >> >        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> >> >                try_to_free_swap(page);
> >> >
> >> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
> >> >
> >> > rmap_walk_anon() should be called and the check is not necessary.
> >>
> >> Frankly speaking, I don't understand what is Mel's problem, why he added
> >> Swapcache check in rmap_walk, and why do you said we don't need it.
> >>
> >> Could you explain more detail if you don't mind?
> >>
> > I may miss something.
> >
> > unmap_and_move()
> >  1. try_to_unmap(TTU_MIGRATION)
> >  2. move_to_newpage
> >  3. remove_migration_ptes
> >        -> rmap_walk()
> >
> > Then, to map a page back we unmapped we call rmap_walk().
> >
> > Assume a SwapCache which is mapped, then, PageAnon(page) == true.
> >
> >  At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
> >       mapcount goes to 0.
> >  At 2. SwapCache is copied to a new page.
> >  At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
> >       Before patch, the new page is mapped back to all ptes.
> >       After patch, the new page is not mapped back because its mapcount is 0.
> >
> > I don't think shared SwapCache of anon is not an usual behavior, so, the logic
> > before patch is more attractive.
> >
> > If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
> > because page->mapping is NULL.
> >
> 
> Thanks. I agree. We don't need the check.
> Then, my question is why Mel added the check in rmap_walk.
> He mentioned some BUG trigger and fixed things after this patch.
> What's it?

If I remove the check for (PageSwapCache(page) && !page_mapped(page))
in rmap_walk(), then the bug below occurs. The first one is lockdep going
bad because it's accessing a bad lock implying that anon_vma->lock is
already invalid. The bug that triggers after it is the list walk.

[  373.951347] INFO: trying to register non-static key.
[  373.984314] the code is fine but needs lockdep annotation.
[  374.020512] turning off the locking correctness validator.
[  374.020512] Pid: 4272, comm: bench-stresshig Not tainted 2.6.34-rc2-mm1-compaction-v7r5 #2
[  374.020512] Call Trace:
[  374.020512]  [<ffffffff810758f2>] __lock_acquire+0xf99/0x1776
[  374.020512]  [<ffffffff810761c5>] lock_acquire+0xf6/0x122
[  374.020512]  [<ffffffff810ef121>] ? rmap_walk+0x5c/0x16d
[  374.020512]  [<ffffffff812fcfeb>] _raw_spin_lock+0x3b/0x47
[  374.020512]  [<ffffffff810ef121>] ? rmap_walk+0x5c/0x16d
[  374.020512]  [<ffffffff810ef121>] rmap_walk+0x5c/0x16d
[  374.020512]  [<ffffffff81106396>] ? remove_migration_pte+0x0/0x234
[  374.677618]  [<ffffffff81300dc1>] ? sub_preempt_count+0x9/0x83
[  374.677618]  [<ffffffff81106914>] ? migrate_page_copy+0xa0/0x1ed
[  374.677618]  [<ffffffff81106ea4>] migrate_pages+0x3fc/0x5d3
[  374.880569]  [<ffffffff81106c56>] ? migrate_pages+0x1ae/0x5d3
[  374.994700]  [<ffffffff81073a24>] ? trace_hardirqs_on_caller+0x110/0x134
[  375.018405]  [<ffffffff81107e11>] ? compaction_alloc+0x0/0x283
[  375.097256]  [<ffffffff811079b0>] ? compact_zone+0x14e/0x4bd
[  375.097256]  [<ffffffff812fd851>] ? _raw_spin_unlock_irq+0x30/0x5d
[  375.097256]  [<ffffffff81073a24>] ? trace_hardirqs_on_caller+0x110/0x134
[  375.097256]  [<ffffffff81107b43>] compact_zone+0x2e1/0x4bd
[  375.097256]  [<ffffffff811082f2>] try_to_compact_pages+0x1de/0x248
[  375.516928]  [<ffffffff810d3cd2>] __alloc_pages_nodemask+0x45a/0x81c
[  375.516928]  [<ffffffff812fde14>] ? restore_args+0x0/0x30
[  375.620035]  [<ffffffff8103995e>] ? finish_task_switch+0x0/0xe3
[  375.684491]  [<ffffffff810fe297>] alloc_pages_current+0x9b/0xa4
[  375.803591]  [<ffffffffa00a9a58>] test_alloc_runtest+0x781/0x140a [highalloc]
[  375.803591]  [<ffffffff81076398>] ? lock_release_non_nested+0x97/0x267
[  375.803591]  [<ffffffffa00aa7ce>] vmr_write_proc+0xed/0x102 [highalloc]
[  375.803591]  [<ffffffff81300dc1>] ? sub_preempt_count+0x9/0x83
[  375.803591]  [<ffffffff812fd92e>] ? _raw_spin_unlock+0x35/0x51
[  375.803591]  [<ffffffff810e5a17>] ? do_wp_page+0x6af/0x763
[  375.803591]  [<ffffffff8115bb2a>] ? proc_file_write+0x45/0x92
[  376.322379]  [<ffffffff8115bb5d>] proc_file_write+0x78/0x92
[  376.349787]  [<ffffffff8115bae5>] ? proc_file_write+0x0/0x92
[  376.349787]  [<ffffffff8115bae5>] ? proc_file_write+0x0/0x92
[  376.349787]  [<ffffffff8115647a>] proc_reg_write+0x89/0xa6
[  376.349787]  [<ffffffff8110c1f6>] vfs_write+0xb3/0x15a
[  376.349787]  [<ffffffff8110c36b>] sys_write+0x4c/0x73
[  376.349787]  [<ffffffff81002d32>] system_call_fastpath+0x16/0x1b
[  376.786203] BUG: unable to handle kernel NULL pointer dereference at (null)
[  376.857874] IP: [<ffffffff810ef170>] rmap_walk+0xab/0x16d
[  376.929206] PGD 7f561067 PUD 7eba2067 PMD 0 
[  376.942703] Oops: 0000 [#1] PREEMPT SMP 
[  376.942703] last sysfs file: /sys/block/sr0/capability
[  377.072011] CPU 3 
[  377.116386] Modules linked in: highalloc trace_allocmap buddyinfo vmregress_core oprofile dm_crypt loop i2c_piix4 evdev processor serio_raw tpm_tis tpm tpm_bios i2c_core shpchp pci_hotplug button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod sg sr_mod sd_mod cdrom ata_generic ahci libahci r8169 libata mii ide_pci_generic ide_core ehci_hcd ohci_hcd scsi_mod floppy thermal fan thermal_sys
[  377.520011] 
[  377.520011] Pid: 4272, comm: bench-stresshig Not tainted 2.6.34-rc2-mm1-compaction-v7r5 #2 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[  377.637060] RIP: 0010:[<ffffffff810ef170>]  [<ffffffff810ef170>] rmap_walk+0xab/0x16d
[  377.787277] RSP: 0000:ffff880037a797a8  EFLAGS: 00010202
[  377.787277] RAX: 0000000000000000 RBX: ffffffffffffffe0 RCX: 0000000000000000
[  377.895088] RDX: 0000000000000101 RSI: ffffffff8152ea0f RDI: ffffffff810ef121
[  377.895088] RBP: ffff880037a79828 R08: ffff880037a79458 R09: ffff880037044000
[  377.895088] R10: ffffffff81067358 R11: ffff880037a79228 R12: 0000000000000001
[  377.895088] R13: ffff88007bbf6af0 R14: ffffea00019bd798 R15: ffff88007bbf6b28
[  377.895088] FS:  00007fa3e984d6e0(0000) GS:ffff880002380000(0000) knlGS:0000000000000000
[  378.366669] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  378.366669] CR2: 0000000000000000 CR3: 000000003784d000 CR4: 00000000000006e0
[  378.366669] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  378.366669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  378.366669] Process bench-stresshig (pid: 4272, threadinfo ffff880037a78000, task ffff880037044000)
[  378.800010] Stack:
[  378.800010]  ffffea000027f920 ffffffff81106396 ffff880037a797f8 ffffffff81300dc1
[  378.907796] <0> ffff880037a797f8 ffffffff81106914 ffffea000027f920 ffffea000027f920
[  378.907796] <0> 0000000000000000 ffffea00019bd798 ffff880037a79828 ffffffff816a1cf0
[  378.907796] Call Trace:
[  378.907796]  [<ffffffff81106396>] ? remove_migration_pte+0x0/0x234
[  379.214225]  [<ffffffff81300dc1>] ? sub_preempt_count+0x9/0x83
[  379.296228]  [<ffffffff81106914>] ? migrate_page_copy+0xa0/0x1ed
[  379.296228]  [<ffffffff81106ea4>] migrate_pages+0x3fc/0x5d3
[  379.296228]  [<ffffffff81106c56>] ? migrate_pages+0x1ae/0x5d3
[  379.492124]  [<ffffffff81073a24>] ? trace_hardirqs_on_caller+0x110/0x134
[  379.492124]  [<ffffffff81107e11>] ? compaction_alloc+0x0/0x283
[  379.492124]  [<ffffffff811079b0>] ? compact_zone+0x14e/0x4bd
[  379.714743]  [<ffffffff812fd851>] ? _raw_spin_unlock_irq+0x30/0x5d
[  379.714743]  [<ffffffff81073a24>] ? trace_hardirqs_on_caller+0x110/0x134
[  379.714743]  [<ffffffff81107b43>] compact_zone+0x2e1/0x4bd
[  379.714743]  [<ffffffff811082f2>] try_to_compact_pages+0x1de/0x248
[  380.001915]  [<ffffffff810d3cd2>] __alloc_pages_nodemask+0x45a/0x81c
[  380.093011]  [<ffffffff812fde14>] ? restore_args+0x0/0x30
[  380.160604]  [<ffffffff8103995e>] ? finish_task_switch+0x0/0xe3
[  380.160604]  [<ffffffff810fe297>] alloc_pages_current+0x9b/0xa4
[  380.160604]  [<ffffffffa00a9a58>] test_alloc_runtest+0x781/0x140a [highalloc]
[  380.160604]  [<ffffffff81076398>] ? lock_release_non_nested+0x97/0x267
[  380.160604]  [<ffffffffa00aa7ce>] vmr_write_proc+0xed/0x102 [highalloc]
[  380.527282]  [<ffffffff81300dc1>] ? sub_preempt_count+0x9/0x83
[  380.600599]  [<ffffffff812fd92e>] ? _raw_spin_unlock+0x35/0x51
[  380.640179]  [<ffffffff810e5a17>] ? do_wp_page+0x6af/0x763
[  380.722097]  [<ffffffff8115bb2a>] ? proc_file_write+0x45/0x92
[  380.776200]  [<ffffffff8115bb5d>] proc_file_write+0x78/0x92
[  380.776200]  [<ffffffff8115bae5>] ? proc_file_write+0x0/0x92
[  380.936426]  [<ffffffff8115bae5>] ? proc_file_write+0x0/0x92
[  380.936426]  [<ffffffff8115647a>] proc_reg_write+0x89/0xa6
[  380.936426]  [<ffffffff8110c1f6>] vfs_write+0xb3/0x15a
[  380.936426]  [<ffffffff8110c36b>] sys_write+0x4c/0x73
[  381.197157]  [<ffffffff81002d32>] system_call_fastpath+0x16/0x1b
[  381.197157] Code: 22 48 3b 56 10 73 1c 48 83 fa f2 74 16 48 8b 4d 80 4c 89 f7 ff 55 88 83 f8 01 41 89 c4 0f 85 a8 00 00 00 48 8b 43 20 48 8d 58 e0 <48> 8b 43 20 0f 18 08 48 8d 43 20 49 39 c7 75 ab e9 8b 00 00 00 
[  381.512188] RIP  [<ffffffff810ef170>] rmap_walk+0xab/0x16d
[  381.541457]  RSP <ffff880037a797a8>
[  381.541457] CR2: 0000000000000000
[  381.667153] ---[ end trace b72e829e744f4e05 ]---
[  381.722475] note: bench-stresshig[4272] exited with preempt_count 2
[  381.797590] BUG: scheduling while atomic: bench-stresshig/4272/0x10000003
[  381.878912] INFO: lockdep is turned off.
[  381.925924] Modules linked in: highalloc trace_allocmap buddyinfo vmregress_core oprofile dm_crypt loop i2c_piix4 evdev processor serio_raw tpm_tis tpm tpm_bios i2c_core shpchp pci_hotplug button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod sg sr_mod sd_mod cdrom ata_generic ahci libahci r8169 libata mii ide_pci_generic ide_core ehci_hcd ohci_hcd scsi_mod floppy thermal fan thermal_sys
[  382.368391] Pid: 4272, comm: bench-stresshig Tainted: G      D     2.6.34-rc2-mm1-compaction-v7r5 #2
[  382.477829] Call Trace:
[  382.507155]  [<ffffffff81072e3d>] ? __debug_show_held_locks+0x1b/0x24
[  382.584339]  [<ffffffff81039959>] __schedule_bug+0x77/0x7c
[  382.650075]  [<ffffffff812fa32d>] schedule+0xcc/0x723
[  382.710610]  [<ffffffff8103bd9d>] __cond_resched+0x18/0x24
[  382.776348]  [<ffffffff812faac0>] _cond_resched+0x29/0x34
[  382.841046]  [<ffffffff810e6521>] unmap_vmas+0x76e/0x96b
[  382.904702]  [<ffffffff810eb14f>] exit_mmap+0xd5/0x17a
[  382.966280]  [<ffffffff81043be0>] mmput+0x46/0xf0
[  383.022654]  [<ffffffff81048179>] ? exit_mm+0xd9/0x14c
[  383.084231]  [<ffffffff810481dd>] exit_mm+0x13d/0x14c
[  383.144767]  [<ffffffff812fd879>] ? _raw_spin_unlock_irq+0x58/0x5d
[  383.218825]  [<ffffffff812237f6>] ? tty_audit_exit+0x28/0x91
[  383.286643]  [<ffffffff81049e6b>] do_exit+0x20f/0x70d
[  383.347179]  [<ffffffff810472e4>] ? kmsg_dump+0x153/0x16d
[  383.411878]  [<ffffffff812fed94>] oops_end+0xbe/0xc6
[  383.471373]  [<ffffffff81028005>] no_context+0x1f8/0x207
[  383.535029]  [<ffffffff810281e7>] __bad_area_nosemaphore+0x1d3/0x1f9
[  383.611170]  [<ffffffff810758f2>] ? __lock_acquire+0xf99/0x1776
[  383.682107]  [<ffffffff812fcdd6>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[  383.759289]  [<ffffffff810a38e9>] ? __rcu_process_callbacks+0xa7/0x30b
[  383.837507]  [<ffffffff81028220>] bad_area_nosemaphore+0x13/0x15
[  383.909484]  [<ffffffff81300c4e>] do_page_fault+0x24e/0x3b8
[  383.976259]  [<ffffffff81067358>] ? up+0x14/0x3e
[  384.031597]  [<ffffffff812fe075>] page_fault+0x25/0x30
[  384.093169]  [<ffffffff81067358>] ? up+0x14/0x3e
[  384.148504]  [<ffffffff810ef121>] ? rmap_walk+0x5c/0x16d
[  384.212163]  [<ffffffff810ef170>] ? rmap_walk+0xab/0x16d
[  384.275818]  [<ffffffff810ef121>] ? rmap_walk+0x5c/0x16d
[  384.339476]  [<ffffffff81106396>] ? remove_migration_pte+0x0/0x234
[  384.413536]  [<ffffffff81300dc1>] ? sub_preempt_count+0x9/0x83
[  384.483434]  [<ffffffff81106914>] ? migrate_page_copy+0xa0/0x1ed
[  384.555412]  [<ffffffff81106ea4>] migrate_pages+0x3fc/0x5d3
[  384.622190]  [<ffffffff81106c56>] ? migrate_pages+0x1ae/0x5d3
[  384.691046]  [<ffffffff81073a24>] ? trace_hardirqs_on_caller+0x110/0x134
[  384.771347]  [<ffffffff81107e11>] ? compaction_alloc+0x0/0x283
[  384.841246]  [<ffffffff811079b0>] ? compact_zone+0x14e/0x4bd
[  384.909062]  [<ffffffff812fd851>] ? _raw_spin_unlock_irq+0x30/0x5d
[  384.983120]  [<ffffffff81073a24>] ? trace_hardirqs_on_caller+0x110/0x134
[  385.063421]  [<ffffffff81107b43>] compact_zone+0x2e1/0x4bd
[  385.129158]  [<ffffffff811082f2>] try_to_compact_pages+0x1de/0x248
[  385.203215]  [<ffffffff810d3cd2>] __alloc_pages_nodemask+0x45a/0x81c
[  385.279353]  [<ffffffff812fde14>] ? restore_args+0x0/0x30
[  385.344053]  [<ffffffff8103995e>] ? finish_task_switch+0x0/0xe3
[  385.414988]  [<ffffffff810fe297>] alloc_pages_current+0x9b/0xa4
[  385.485927]  [<ffffffffa00a9a58>] test_alloc_runtest+0x781/0x140a [highalloc]
[  385.571427]  [<ffffffff81076398>] ? lock_release_non_nested+0x97/0x267
[  385.649647]  [<ffffffffa00aa7ce>] vmr_write_proc+0xed/0x102 [highalloc]
[  385.728907]  [<ffffffff81300dc1>] ? sub_preempt_count+0x9/0x83
[  385.798800]  [<ffffffff812fd92e>] ? _raw_spin_unlock+0x35/0x51
[  385.868700]  [<ffffffff810e5a17>] ? do_wp_page+0x6af/0x763
[  385.934436]  [<ffffffff8115bb2a>] ? proc_file_write+0x45/0x92
[  386.003294]  [<ffffffff8115bb5d>] proc_file_write+0x78/0x92
[  386.070072]  [<ffffffff8115bae5>] ? proc_file_write+0x0/0x92
[  386.137888]  [<ffffffff8115bae5>] ? proc_file_write+0x0/0x92
[  386.205708]  [<ffffffff8115647a>] proc_reg_write+0x89/0xa6
[  386.271442]  [<ffffffff8110c1f6>] vfs_write+0xb3/0x15a
[  386.333019]  [<ffffffff8110c36b>] sys_write+0x4c/0x73
[  386.393556]  [<ffffffff81002d32>] system_call_fastpath+0x16/0x1b

> Is it really related to this logic?
> I don't think so or we are missing something.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache  pages
  2010-04-01  9:30             ` Mel Gorman
@ 2010-04-01 10:42               ` Minchan Kim
  -1 siblings, 0 replies; 72+ messages in thread
From: Minchan Kim @ 2010-04-01 10:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Thu, Apr 1, 2010 at 6:30 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Thu, Apr 01, 2010 at 01:44:29PM +0900, Minchan Kim wrote:
>> On Thu, Apr 1, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Thu, 1 Apr 2010 11:43:18 +0900
>> > Minchan Kim <minchan.kim@gmail.com> wrote:
>> >
>> >> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki       /*
>> >> >> diff --git a/mm/rmap.c b/mm/rmap.c
>> >> >> index af35b75..d5ea1f2 100644
>> >> >> --- a/mm/rmap.c
>> >> >> +++ b/mm/rmap.c
>> >> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
>> >> >>
>> >> >>       if (unlikely(PageKsm(page)))
>> >> >>               return rmap_walk_ksm(page, rmap_one, arg);
>> >> >> -     else if (PageAnon(page))
>> >> >> +     else if (PageAnon(page)) {
>> >> >> +             if (PageSwapCache(page))
>> >> >> +                     return SWAP_AGAIN;
>> >> >>               return rmap_walk_anon(page, rmap_one, arg);
>> >> >
>> >> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
>> >> >
>> >>
>> >> In case of tmpfs, page has swapcache but not mapped.
>> >>
>> >> > Please see do_swap_page(), PageSwapCache bit is cleared only when
>> >> >
>> >> > do_swap_page()...
>> >> >       swap_free(entry);
>> >> >        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
>> >> >                try_to_free_swap(page);
>> >> >
>> >> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
>> >> >
>> >> > rmap_walk_anon() should be called and the check is not necessary.
>> >>
>> >> Frankly speaking, I don't understand what is Mel's problem, why he added
>> >> Swapcache check in rmap_walk, and why do you said we don't need it.
>> >>
>> >> Could you explain more detail if you don't mind?
>> >>
>> > I may miss something.
>> >
>> > unmap_and_move()
>> >  1. try_to_unmap(TTU_MIGRATION)
>> >  2. move_to_newpage
>> >  3. remove_migration_ptes
>> >        -> rmap_walk()
>> >
>> > Then, to map a page back we unmapped we call rmap_walk().
>> >
>> > Assume a SwapCache which is mapped, then, PageAnon(page) == true.
>> >
>> >  At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
>> >       mapcount goes to 0.
>> >  At 2. SwapCache is copied to a new page.
>> >  At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
>> >       Before patch, the new page is mapped back to all ptes.
>> >       After patch, the new page is not mapped back because its mapcount is 0.
>> >
>> > I don't think shared SwapCache of anon is not an usual behavior, so, the logic
>> > before patch is more attractive.
>> >
>> > If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
>> > because page->mapping is NULL.
>> >
>>
>> Thanks. I agree. We don't need the check.
>> Then, my question is why Mel added the check in rmap_walk.
>> He mentioned some BUG trigger and fixed things after this patch.
>> What's it?
>
> If I remove the check for (PageSwapCache(page) && !page_mapped(page))
> in rmap_walk(), then the bug below occurs. The first one is lockdep going
> bad because it's accessing a bad lock implying that anon_vma->lock is
> already invalid. The bug that triggers after it is the list walk.

Thanks. I think it's possible. It's subtle problem.
Assume !page_mapped  && PageAnon(page)  && PageSwapCache

0. PageAnon check
1. race window <---- anon_vma free!!!!
2. rcu_read_lock()
3. skip_unmap
4. move_to_new_page
5. newpage->mapping = page->mapping <--- !!!! It's invalid
6.     mapping->a_ops->migratepage
7.         radix tree change, copy page (still new page anon is NULL)
8.     remove_migrate_ptes
9.     rmap_walk
10.         PageAnon is true --> we are deceived.
11.         rmap_walk_anon -> go bomb!

Does it make sense?
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-04-01 10:42               ` Minchan Kim
  0 siblings, 0 replies; 72+ messages in thread
From: Minchan Kim @ 2010-04-01 10:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Thu, Apr 1, 2010 at 6:30 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Thu, Apr 01, 2010 at 01:44:29PM +0900, Minchan Kim wrote:
>> On Thu, Apr 1, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Thu, 1 Apr 2010 11:43:18 +0900
>> > Minchan Kim <minchan.kim@gmail.com> wrote:
>> >
>> >> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki       /*
>> >> >> diff --git a/mm/rmap.c b/mm/rmap.c
>> >> >> index af35b75..d5ea1f2 100644
>> >> >> --- a/mm/rmap.c
>> >> >> +++ b/mm/rmap.c
>> >> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
>> >> >>
>> >> >>       if (unlikely(PageKsm(page)))
>> >> >>               return rmap_walk_ksm(page, rmap_one, arg);
>> >> >> -     else if (PageAnon(page))
>> >> >> +     else if (PageAnon(page)) {
>> >> >> +             if (PageSwapCache(page))
>> >> >> +                     return SWAP_AGAIN;
>> >> >>               return rmap_walk_anon(page, rmap_one, arg);
>> >> >
>> >> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
>> >> >
>> >>
>> >> In case of tmpfs, page has swapcache but not mapped.
>> >>
>> >> > Please see do_swap_page(), PageSwapCache bit is cleared only when
>> >> >
>> >> > do_swap_page()...
>> >> >       swap_free(entry);
>> >> >        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
>> >> >                try_to_free_swap(page);
>> >> >
>> >> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
>> >> >
>> >> > rmap_walk_anon() should be called and the check is not necessary.
>> >>
>> >> Frankly speaking, I don't understand what is Mel's problem, why he added
>> >> Swapcache check in rmap_walk, and why do you said we don't need it.
>> >>
>> >> Could you explain more detail if you don't mind?
>> >>
>> > I may miss something.
>> >
>> > unmap_and_move()
>> >  1. try_to_unmap(TTU_MIGRATION)
>> >  2. move_to_newpage
>> >  3. remove_migration_ptes
>> >        -> rmap_walk()
>> >
>> > Then, to map a page back we unmapped we call rmap_walk().
>> >
>> > Assume a SwapCache which is mapped, then, PageAnon(page) == true.
>> >
>> >  At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
>> >       mapcount goes to 0.
>> >  At 2. SwapCache is copied to a new page.
>> >  At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
>> >       Before patch, the new page is mapped back to all ptes.
>> >       After patch, the new page is not mapped back because its mapcount is 0.
>> >
>> > I don't think shared SwapCache of anon is not an usual behavior, so, the logic
>> > before patch is more attractive.
>> >
>> > If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
>> > because page->mapping is NULL.
>> >
>>
>> Thanks. I agree. We don't need the check.
>> Then, my question is why Mel added the check in rmap_walk.
>> He mentioned some BUG trigger and fixed things after this patch.
>> What's it?
>
> If I remove the check for (PageSwapCache(page) && !page_mapped(page))
> in rmap_walk(), then the bug below occurs. The first one is lockdep going
> bad because it's accessing a bad lock implying that anon_vma->lock is
> already invalid. The bug that triggers after it is the list walk.

Thanks. I think it's possible. It's subtle problem.
Assume !page_mapped  && PageAnon(page)  && PageSwapCache

0. PageAnon check
1. race window <---- anon_vma free!!!!
2. rcu_read_lock()
3. skip_unmap
4. move_to_new_page
5. newpage->mapping = page->mapping <--- !!!! It's invalid
6.     mapping->a_ops->migratepage
7.         radix tree change, copy page (still new page anon is NULL)
8.     remove_migrate_ptes
9.     rmap_walk
10.         PageAnon is true --> we are deceived.
11.         rmap_walk_anon -> go bomb!

Does it make sense?
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache  pages
  2010-04-01  5:42             ` KAMEZAWA Hiroyuki
@ 2010-04-01 10:51               ` Minchan Kim
  -1 siblings, 0 replies; 72+ messages in thread
From: Minchan Kim @ 2010-04-01 10:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Thu, Apr 1, 2010 at 2:42 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 1 Apr 2010 13:44:29 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Thu, Apr 1, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Thu, 1 Apr 2010 11:43:18 +0900
>> > Minchan Kim <minchan.kim@gmail.com> wrote:
>> >
>> >> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki       /*
>> >> >> diff --git a/mm/rmap.c b/mm/rmap.c
>> >> >> index af35b75..d5ea1f2 100644
>> >> >> --- a/mm/rmap.c
>> >> >> +++ b/mm/rmap.c
>> >> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
>> >> >>
>> >> >>       if (unlikely(PageKsm(page)))
>> >> >>               return rmap_walk_ksm(page, rmap_one, arg);
>> >> >> -     else if (PageAnon(page))
>> >> >> +     else if (PageAnon(page)) {
>> >> >> +             if (PageSwapCache(page))
>> >> >> +                     return SWAP_AGAIN;
>> >> >>               return rmap_walk_anon(page, rmap_one, arg);
>> >> >
>> >> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
>> >> >
>> >>
>> >> In case of tmpfs, page has swapcache but not mapped.
>> >>
>> >> > Please see do_swap_page(), PageSwapCache bit is cleared only when
>> >> >
>> >> > do_swap_page()...
>> >> >       swap_free(entry);
>> >> >        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
>> >> >                try_to_free_swap(page);
>> >> >
>> >> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
>> >> >
>> >> > rmap_walk_anon() should be called and the check is not necessary.
>> >>
>> >> Frankly speaking, I don't understand what is Mel's problem, why he added
>> >> Swapcache check in rmap_walk, and why do you said we don't need it.
>> >>
>> >> Could you explain more detail if you don't mind?
>> >>
>> > I may miss something.
>> >
>> > unmap_and_move()
>> >  1. try_to_unmap(TTU_MIGRATION)
>> >  2. move_to_newpage
>> >  3. remove_migration_ptes
>> >        -> rmap_walk()
>> >
>> > Then, to map a page back we unmapped we call rmap_walk().
>> >
>> > Assume a SwapCache which is mapped, then, PageAnon(page) == true.
>> >
>> >  At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
>> >       mapcount goes to 0.
>> >  At 2. SwapCache is copied to a new page.
>> >  At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
>> >       Before patch, the new page is mapped back to all ptes.
>> >       After patch, the new page is not mapped back because its mapcount is 0.
>> >
>> > I don't think shared SwapCache of anon is not an usual behavior, so, the logic
>> > before patch is more attractive.
>> >
>> > If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
>> > because page->mapping is NULL.
>> >
>>
>> Thanks. I agree. We don't need the check.
>> Then, my question is why Mel added the check in rmap_walk.
>> He mentioned some BUG trigger and fixed things after this patch.
>> What's it?
>> Is it really related to this logic?
>> I don't think so or we are missing something.
>>
> Hmm. Consiering again.
>
> Now.
>        if (PageAnon(page)) {
>                rcu_locked = 1;
>                rcu_read_lock();
>                if (!page_mapped(page)) {
>                        if (!PageSwapCache(page))
>                                goto rcu_unlock;
>                } else {
>                        anon_vma = page_anon_vma(page);
>                        atomic_inc(&anon_vma->external_refcount);
>                }
>
>
> Maybe this is a fix.
>
> ==
>        skip_remap = 0;
>        if (PageAnon(page)) {
>                rcu_read_lock();
>                if (!page_mapped(page)) {
>                        if (!PageSwapCache(page))
>                                goto rcu_unlock;
>                        /*
>                         * We can't convice this anon_vma is valid or not because
>                         * !page_mapped(page). Then, we do migration(radix-tree replacement)
>                         * but don't remap it which touches anon_vma in page->mapping.
>                         */
>                        skip_remap = 1;
>                        goto skip_unmap;
>                } else {
>                        anon_vma = page_anon_vma(page);
>                        atomic_inc(&anon_vma->external_refcount);
>                }
>        }
>        .....copy page, radix-tree replacement,....
>

It's not enough.
we uses remove_migration_ptes in  move_to_new_page, too.
We have to prevent it.
We can check PageSwapCache(page) in move_to_new_page and then
skip remove_migration_ptes.

ex)
static int move_to_new_page(....)
{
     int swapcache = PageSwapCache(page);
     ...
     if (!swapcache)
         if(!rc)
             remove_migration_ptes
         else
             newpage->mapping = NULL;
}

And we have to close race between PageAnon(page) and rcu_read_lock.
If we don't do it, anon_vma could be free in the middle of operation.
I means

         * of migration. File cache pages are no problem because of page_lock()
         * File Caches may use write_page() or lock_page() in migration, then,
         * just care Anon page here.
         */
        if (PageAnon(page)) {
                !!! RACE !!!!
                rcu_read_lock();
                rcu_locked = 1;

+
+               /*
+                * If the page has no mappings any more, just bail. An
+                * unmapped anon page is likely to be freed soon but worse,


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-04-01 10:51               ` Minchan Kim
  0 siblings, 0 replies; 72+ messages in thread
From: Minchan Kim @ 2010-04-01 10:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Thu, Apr 1, 2010 at 2:42 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 1 Apr 2010 13:44:29 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Thu, Apr 1, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Thu, 1 Apr 2010 11:43:18 +0900
>> > Minchan Kim <minchan.kim@gmail.com> wrote:
>> >
>> >> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki       /*
>> >> >> diff --git a/mm/rmap.c b/mm/rmap.c
>> >> >> index af35b75..d5ea1f2 100644
>> >> >> --- a/mm/rmap.c
>> >> >> +++ b/mm/rmap.c
>> >> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
>> >> >>
>> >> >>       if (unlikely(PageKsm(page)))
>> >> >>               return rmap_walk_ksm(page, rmap_one, arg);
>> >> >> -     else if (PageAnon(page))
>> >> >> +     else if (PageAnon(page)) {
>> >> >> +             if (PageSwapCache(page))
>> >> >> +                     return SWAP_AGAIN;
>> >> >>               return rmap_walk_anon(page, rmap_one, arg);
>> >> >
>> >> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
>> >> >
>> >>
>> >> In case of tmpfs, page has swapcache but not mapped.
>> >>
>> >> > Please see do_swap_page(), PageSwapCache bit is cleared only when
>> >> >
>> >> > do_swap_page()...
>> >> >       swap_free(entry);
>> >> >        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
>> >> >                try_to_free_swap(page);
>> >> >
>> >> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
>> >> >
>> >> > rmap_walk_anon() should be called and the check is not necessary.
>> >>
>> >> Frankly speaking, I don't understand what is Mel's problem, why he added
>> >> Swapcache check in rmap_walk, and why do you said we don't need it.
>> >>
>> >> Could you explain more detail if you don't mind?
>> >>
>> > I may miss something.
>> >
>> > unmap_and_move()
>> >  1. try_to_unmap(TTU_MIGRATION)
>> >  2. move_to_newpage
>> >  3. remove_migration_ptes
>> >        -> rmap_walk()
>> >
>> > Then, to map a page back we unmapped we call rmap_walk().
>> >
>> > Assume a SwapCache which is mapped, then, PageAnon(page) == true.
>> >
>> >  At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
>> >       mapcount goes to 0.
>> >  At 2. SwapCache is copied to a new page.
>> >  At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
>> >       Before patch, the new page is mapped back to all ptes.
>> >       After patch, the new page is not mapped back because its mapcount is 0.
>> >
>> > I don't think shared SwapCache of anon is not an usual behavior, so, the logic
>> > before patch is more attractive.
>> >
>> > If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
>> > because page->mapping is NULL.
>> >
>>
>> Thanks. I agree. We don't need the check.
>> Then, my question is why Mel added the check in rmap_walk.
>> He mentioned some BUG trigger and fixed things after this patch.
>> What's it?
>> Is it really related to this logic?
>> I don't think so or we are missing something.
>>
> Hmm. Consiering again.
>
> Now.
>        if (PageAnon(page)) {
>                rcu_locked = 1;
>                rcu_read_lock();
>                if (!page_mapped(page)) {
>                        if (!PageSwapCache(page))
>                                goto rcu_unlock;
>                } else {
>                        anon_vma = page_anon_vma(page);
>                        atomic_inc(&anon_vma->external_refcount);
>                }
>
>
> Maybe this is a fix.
>
> ==
>        skip_remap = 0;
>        if (PageAnon(page)) {
>                rcu_read_lock();
>                if (!page_mapped(page)) {
>                        if (!PageSwapCache(page))
>                                goto rcu_unlock;
>                        /*
>                         * We can't convice this anon_vma is valid or not because
>                         * !page_mapped(page). Then, we do migration(radix-tree replacement)
>                         * but don't remap it which touches anon_vma in page->mapping.
>                         */
>                        skip_remap = 1;
>                        goto skip_unmap;
>                } else {
>                        anon_vma = page_anon_vma(page);
>                        atomic_inc(&anon_vma->external_refcount);
>                }
>        }
>        .....copy page, radix-tree replacement,....
>

It's not enough.
we uses remove_migration_ptes in  move_to_new_page, too.
We have to prevent it.
We can check PageSwapCache(page) in move_to_new_page and then
skip remove_migration_ptes.

ex)
static int move_to_new_page(....)
{
     int swapcache = PageSwapCache(page);
     ...
     if (!swapcache)
         if(!rc)
             remove_migration_ptes
         else
             newpage->mapping = NULL;
}

And we have to close race between PageAnon(page) and rcu_read_lock.
If we don't do it, anon_vma could be free in the middle of operation.
I means

         * of migration. File cache pages are no problem because of page_lock()
         * File Caches may use write_page() or lock_page() in migration, then,
         * just care Anon page here.
         */
        if (PageAnon(page)) {
                !!! RACE !!!!
                rcu_read_lock();
                rcu_locked = 1;

+
+               /*
+                * If the page has no mappings any more, just bail. An
+                * unmapped anon page is likely to be freed soon but worse,


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-04-01 10:51               ` Minchan Kim
@ 2010-04-01 17:36                 ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-04-01 17:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Thu, Apr 01, 2010 at 07:51:31PM +0900, Minchan Kim wrote:
> On Thu, Apr 1, 2010 at 2:42 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 1 Apr 2010 13:44:29 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >
> >> On Thu, Apr 1, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > On Thu, 1 Apr 2010 11:43:18 +0900
> >> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >> >
> >> >> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki       /*
> >> >> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> >> >> index af35b75..d5ea1f2 100644
> >> >> >> --- a/mm/rmap.c
> >> >> >> +++ b/mm/rmap.c
> >> >> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
> >> >> >>
> >> >> >>       if (unlikely(PageKsm(page)))
> >> >> >>               return rmap_walk_ksm(page, rmap_one, arg);
> >> >> >> -     else if (PageAnon(page))
> >> >> >> +     else if (PageAnon(page)) {
> >> >> >> +             if (PageSwapCache(page))
> >> >> >> +                     return SWAP_AGAIN;
> >> >> >>               return rmap_walk_anon(page, rmap_one, arg);
> >> >> >
> >> >> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
> >> >> >
> >> >>
> >> >> In case of tmpfs, page has swapcache but not mapped.
> >> >>
> >> >> > Please see do_swap_page(), PageSwapCache bit is cleared only when
> >> >> >
> >> >> > do_swap_page()...
> >> >> >       swap_free(entry);
> >> >> >        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> >> >> >                try_to_free_swap(page);
> >> >> >
> >> >> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
> >> >> >
> >> >> > rmap_walk_anon() should be called and the check is not necessary.
> >> >>
> >> >> Frankly speaking, I don't understand what is Mel's problem, why he added
> >> >> Swapcache check in rmap_walk, and why do you said we don't need it.
> >> >>
> >> >> Could you explain more detail if you don't mind?
> >> >>
> >> > I may miss something.
> >> >
> >> > unmap_and_move()
> >> >  1. try_to_unmap(TTU_MIGRATION)
> >> >  2. move_to_newpage
> >> >  3. remove_migration_ptes
> >> >        -> rmap_walk()
> >> >
> >> > Then, to map a page back we unmapped we call rmap_walk().
> >> >
> >> > Assume a SwapCache which is mapped, then, PageAnon(page) == true.
> >> >
> >> >  At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
> >> >       mapcount goes to 0.
> >> >  At 2. SwapCache is copied to a new page.
> >> >  At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
> >> >       Before patch, the new page is mapped back to all ptes.
> >> >       After patch, the new page is not mapped back because its mapcount is 0.
> >> >
> >> > I don't think shared SwapCache of anon is not an usual behavior, so, the logic
> >> > before patch is more attractive.
> >> >
> >> > If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
> >> > because page->mapping is NULL.
> >> >
> >>
> >> Thanks. I agree. We don't need the check.
> >> Then, my question is why Mel added the check in rmap_walk.
> >> He mentioned some BUG trigger and fixed things after this patch.
> >> What's it?
> >> Is it really related to this logic?
> >> I don't think so or we are missing something.
> >>
> > Hmm. Consiering again.
> >
> > Now.
> >        if (PageAnon(page)) {
> >                rcu_locked = 1;
> >                rcu_read_lock();
> >                if (!page_mapped(page)) {
> >                        if (!PageSwapCache(page))
> >                                goto rcu_unlock;
> >                } else {
> >                        anon_vma = page_anon_vma(page);
> >                        atomic_inc(&anon_vma->external_refcount);
> >                }
> >
> >
> > Maybe this is a fix.
> >
> > ==
> >        skip_remap = 0;
> >        if (PageAnon(page)) {
> >                rcu_read_lock();
> >                if (!page_mapped(page)) {
> >                        if (!PageSwapCache(page))
> >                                goto rcu_unlock;
> >                        /*
> >                         * We can't convice this anon_vma is valid or not because
> >                         * !page_mapped(page). Then, we do migration(radix-tree replacement)
> >                         * but don't remap it which touches anon_vma in page->mapping.
> >                         */
> >                        skip_remap = 1;
> >                        goto skip_unmap;
> >                } else {
> >                        anon_vma = page_anon_vma(page);
> >                        atomic_inc(&anon_vma->external_refcount);
> >                }
> >        }
> >        .....copy page, radix-tree replacement,....
> >
> 
> It's not enough.
> we uses remove_migration_ptes in  move_to_new_page, too.
> We have to prevent it.
> We can check PageSwapCache(page) in move_to_new_page and then
> skip remove_migration_ptes.
> 
> ex)
> static int move_to_new_page(....)
> {
>      int swapcache = PageSwapCache(page);
>      ...
>      if (!swapcache)
>          if(!rc)
>              remove_migration_ptes
>          else
>              newpage->mapping = NULL;
> }
> 

This I agree with.

> And we have to close race between PageAnon(page) and rcu_read_lock.

Not so sure on this. The page is locked at this point and that should
prevent it from becoming !PageAnon

> If we don't do it, anon_vma could be free in the middle of operation.
> I means
> 
>          * of migration. File cache pages are no problem because of page_lock()
>          * File Caches may use write_page() or lock_page() in migration, then,
>          * just care Anon page here.
>          */
>         if (PageAnon(page)) {
>                 !!! RACE !!!!
>                 rcu_read_lock();
>                 rcu_locked = 1;
> 
> +
> +               /*
> +                * If the page has no mappings any more, just bail. An
> +                * unmapped anon page is likely to be freed soon but worse,
> 

I am not sure this race exists because the page is locked but a key
observation has been made - A page that is unmapped can be migrated if
it's PageSwapCache but it may not have a valid anon_vma. Hence, in the
!page_mapped case, the key is to not use anon_vma. How about the
following patch?

==== CUT HERE ====

mm,migration: Allow the migration of PageSwapCache pages

PageAnon pages that are unmapped may or may not have an anon_vma so are
not currently migrated. However, a swap cache page can be migrated and
fits this description. This patch identifies page swap caches and allows
them to be migrated but ensures that no attempt to made to remap the pages
would would potentially try to access an already freed anon_vma.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>

diff --git a/mm/migrate.c b/mm/migrate.c
index 35aad2a..5d0218b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
  *   < 0 - error code
  *  == 0 - success
  */
-static int move_to_new_page(struct page *newpage, struct page *page)
+static int move_to_new_page(struct page *newpage, struct page *page,
+						int safe_to_remap)
 {
 	struct address_space *mapping;
 	int rc;
@@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
 	else
 		rc = fallback_migrate_page(mapping, newpage, page);
 
-	if (!rc)
-		remove_migration_ptes(page, newpage);
-	else
-		newpage->mapping = NULL;
+	if (safe_to_remap) {
+		if (!rc)
+			remove_migration_ptes(page, newpage);
+		else
+			newpage->mapping = NULL;
+	}
 
 	unlock_page(newpage);
 
@@ -539,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
+	int safe_to_remap = 1;
 	int rcu_locked = 0;
 	int charge = 0;
 	struct mem_cgroup *mem = NULL;
@@ -600,18 +604,26 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		rcu_read_lock();
 		rcu_locked = 1;
 
-		/*
-		 * If the page has no mappings any more, just bail. An
-		 * unmapped anon page is likely to be freed soon but worse,
-		 * it's possible its anon_vma disappeared between when
-		 * the page was isolated and when we reached here while
-		 * the RCU lock was not held
-		 */
-		if (!page_mapped(page))
-			goto rcu_unlock;
+		/* Determine how to safely use anon_vma */
+		if (!page_mapped(page)) {
+			if (!PageSwapCache(page))
+				goto rcu_unlock;
 
-		anon_vma = page_anon_vma(page);
-		atomic_inc(&anon_vma->external_refcount);
+			/*
+			 * We cannot be sure that the anon_vma of an unmapped
+			 * page is safe to use. In this case, the page still
+			 * gets migrated but the pages are not remapped as
+			 */
+			safe_to_remap = 0;
+		} else { 
+			/*
+			 * Take a reference count on the anon_vma if the
+			 * page is mapped so that it is guaranteed to
+			 * exist when the page is remapped later
+			 */
+			anon_vma = page_anon_vma(page);
+			atomic_inc(&anon_vma->external_refcount);
+		}
 	}
 
 	/*
@@ -646,9 +658,9 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 
 skip_unmap:
 	if (!page_mapped(page))
-		rc = move_to_new_page(newpage, page);
+		rc = move_to_new_page(newpage, page, safe_to_remap);
 
-	if (rc)
+	if (rc && safe_to_remap)
 		remove_migration_ptes(page, page);
 rcu_unlock:
 

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-04-01 17:36                 ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-04-01 17:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Thu, Apr 01, 2010 at 07:51:31PM +0900, Minchan Kim wrote:
> On Thu, Apr 1, 2010 at 2:42 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 1 Apr 2010 13:44:29 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >
> >> On Thu, Apr 1, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > On Thu, 1 Apr 2010 11:43:18 +0900
> >> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >> >
> >> >> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki       /*
> >> >> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> >> >> index af35b75..d5ea1f2 100644
> >> >> >> --- a/mm/rmap.c
> >> >> >> +++ b/mm/rmap.c
> >> >> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
> >> >> >>
> >> >> >>       if (unlikely(PageKsm(page)))
> >> >> >>               return rmap_walk_ksm(page, rmap_one, arg);
> >> >> >> -     else if (PageAnon(page))
> >> >> >> +     else if (PageAnon(page)) {
> >> >> >> +             if (PageSwapCache(page))
> >> >> >> +                     return SWAP_AGAIN;
> >> >> >>               return rmap_walk_anon(page, rmap_one, arg);
> >> >> >
> >> >> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
> >> >> >
> >> >>
> >> >> In case of tmpfs, page has swapcache but not mapped.
> >> >>
> >> >> > Please see do_swap_page(), PageSwapCache bit is cleared only when
> >> >> >
> >> >> > do_swap_page()...
> >> >> >       swap_free(entry);
> >> >> >        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> >> >> >                try_to_free_swap(page);
> >> >> >
> >> >> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
> >> >> >
> >> >> > rmap_walk_anon() should be called and the check is not necessary.
> >> >>
> >> >> Frankly speaking, I don't understand what is Mel's problem, why he added
> >> >> Swapcache check in rmap_walk, and why do you said we don't need it.
> >> >>
> >> >> Could you explain more detail if you don't mind?
> >> >>
> >> > I may miss something.
> >> >
> >> > unmap_and_move()
> >> >  1. try_to_unmap(TTU_MIGRATION)
> >> >  2. move_to_newpage
> >> >  3. remove_migration_ptes
> >> >        -> rmap_walk()
> >> >
> >> > Then, to map a page back we unmapped we call rmap_walk().
> >> >
> >> > Assume a SwapCache which is mapped, then, PageAnon(page) == true.
> >> >
> >> >  At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
> >> >       mapcount goes to 0.
> >> >  At 2. SwapCache is copied to a new page.
> >> >  At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
> >> >       Before patch, the new page is mapped back to all ptes.
> >> >       After patch, the new page is not mapped back because its mapcount is 0.
> >> >
> >> > I don't think shared SwapCache of anon is not an usual behavior, so, the logic
> >> > before patch is more attractive.
> >> >
> >> > If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
> >> > because page->mapping is NULL.
> >> >
> >>
> >> Thanks. I agree. We don't need the check.
> >> Then, my question is why Mel added the check in rmap_walk.
> >> He mentioned some BUG trigger and fixed things after this patch.
> >> What's it?
> >> Is it really related to this logic?
> >> I don't think so or we are missing something.
> >>
> > Hmm. Consiering again.
> >
> > Now.
> >        if (PageAnon(page)) {
> >                rcu_locked = 1;
> >                rcu_read_lock();
> >                if (!page_mapped(page)) {
> >                        if (!PageSwapCache(page))
> >                                goto rcu_unlock;
> >                } else {
> >                        anon_vma = page_anon_vma(page);
> >                        atomic_inc(&anon_vma->external_refcount);
> >                }
> >
> >
> > Maybe this is a fix.
> >
> > ==
> >        skip_remap = 0;
> >        if (PageAnon(page)) {
> >                rcu_read_lock();
> >                if (!page_mapped(page)) {
> >                        if (!PageSwapCache(page))
> >                                goto rcu_unlock;
> >                        /*
> >                         * We can't convice this anon_vma is valid or not because
> >                         * !page_mapped(page). Then, we do migration(radix-tree replacement)
> >                         * but don't remap it which touches anon_vma in page->mapping.
> >                         */
> >                        skip_remap = 1;
> >                        goto skip_unmap;
> >                } else {
> >                        anon_vma = page_anon_vma(page);
> >                        atomic_inc(&anon_vma->external_refcount);
> >                }
> >        }
> >        .....copy page, radix-tree replacement,....
> >
> 
> It's not enough.
> we uses remove_migration_ptes in  move_to_new_page, too.
> We have to prevent it.
> We can check PageSwapCache(page) in move_to_new_page and then
> skip remove_migration_ptes.
> 
> ex)
> static int move_to_new_page(....)
> {
>      int swapcache = PageSwapCache(page);
>      ...
>      if (!swapcache)
>          if(!rc)
>              remove_migration_ptes
>          else
>              newpage->mapping = NULL;
> }
> 

This I agree with.

> And we have to close race between PageAnon(page) and rcu_read_lock.

Not so sure on this. The page is locked at this point and that should
prevent it from becoming !PageAnon

> If we don't do it, anon_vma could be free in the middle of operation.
> I means
> 
>          * of migration. File cache pages are no problem because of page_lock()
>          * File Caches may use write_page() or lock_page() in migration, then,
>          * just care Anon page here.
>          */
>         if (PageAnon(page)) {
>                 !!! RACE !!!!
>                 rcu_read_lock();
>                 rcu_locked = 1;
> 
> +
> +               /*
> +                * If the page has no mappings any more, just bail. An
> +                * unmapped anon page is likely to be freed soon but worse,
> 

I am not sure this race exists because the page is locked but a key
observation has been made - A page that is unmapped can be migrated if
it's PageSwapCache but it may not have a valid anon_vma. Hence, in the
!page_mapped case, the key is to not use anon_vma. How about the
following patch?

==== CUT HERE ====

mm,migration: Allow the migration of PageSwapCache pages

PageAnon pages that are unmapped may or may not have an anon_vma so are
not currently migrated. However, a swap cache page can be migrated and
fits this description. This patch identifies page swap caches and allows
them to be migrated but ensures that no attempt to made to remap the pages
would would potentially try to access an already freed anon_vma.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>

diff --git a/mm/migrate.c b/mm/migrate.c
index 35aad2a..5d0218b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
  *   < 0 - error code
  *  == 0 - success
  */
-static int move_to_new_page(struct page *newpage, struct page *page)
+static int move_to_new_page(struct page *newpage, struct page *page,
+						int safe_to_remap)
 {
 	struct address_space *mapping;
 	int rc;
@@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
 	else
 		rc = fallback_migrate_page(mapping, newpage, page);
 
-	if (!rc)
-		remove_migration_ptes(page, newpage);
-	else
-		newpage->mapping = NULL;
+	if (safe_to_remap) {
+		if (!rc)
+			remove_migration_ptes(page, newpage);
+		else
+			newpage->mapping = NULL;
+	}
 
 	unlock_page(newpage);
 
@@ -539,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
+	int safe_to_remap = 1;
 	int rcu_locked = 0;
 	int charge = 0;
 	struct mem_cgroup *mem = NULL;
@@ -600,18 +604,26 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		rcu_read_lock();
 		rcu_locked = 1;
 
-		/*
-		 * If the page has no mappings any more, just bail. An
-		 * unmapped anon page is likely to be freed soon but worse,
-		 * it's possible its anon_vma disappeared between when
-		 * the page was isolated and when we reached here while
-		 * the RCU lock was not held
-		 */
-		if (!page_mapped(page))
-			goto rcu_unlock;
+		/* Determine how to safely use anon_vma */
+		if (!page_mapped(page)) {
+			if (!PageSwapCache(page))
+				goto rcu_unlock;
 
-		anon_vma = page_anon_vma(page);
-		atomic_inc(&anon_vma->external_refcount);
+			/*
+			 * We cannot be sure that the anon_vma of an unmapped
+			 * page is safe to use. In this case, the page still
+			 * gets migrated but the pages are not remapped as
+			 */
+			safe_to_remap = 0;
+		} else { 
+			/*
+			 * Take a reference count on the anon_vma if the
+			 * page is mapped so that it is guaranteed to
+			 * exist when the page is remapped later
+			 */
+			anon_vma = page_anon_vma(page);
+			atomic_inc(&anon_vma->external_refcount);
+		}
 	}
 
 	/*
@@ -646,9 +658,9 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 
 skip_unmap:
 	if (!page_mapped(page))
-		rc = move_to_new_page(newpage, page);
+		rc = move_to_new_page(newpage, page, safe_to_remap);
 
-	if (rc)
+	if (rc && safe_to_remap)
 		remove_migration_ptes(page, page);
 rcu_unlock:
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache  pages
  2010-04-01 17:36                 ` Mel Gorman
@ 2010-04-02  0:20                   ` Minchan Kim
  -1 siblings, 0 replies; 72+ messages in thread
From: Minchan Kim @ 2010-04-02  0:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Fri, Apr 2, 2010 at 2:36 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Thu, Apr 01, 2010 at 07:51:31PM +0900, Minchan Kim wrote:
>> On Thu, Apr 1, 2010 at 2:42 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Thu, 1 Apr 2010 13:44:29 +0900
>> > Minchan Kim <minchan.kim@gmail.com> wrote:
>> >
>> >> On Thu, Apr 1, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
>> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >> > On Thu, 1 Apr 2010 11:43:18 +0900
>> >> > Minchan Kim <minchan.kim@gmail.com> wrote:
>> >> >
>> >> >> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki       /*
>> >> >> >> diff --git a/mm/rmap.c b/mm/rmap.c
>> >> >> >> index af35b75..d5ea1f2 100644
>> >> >> >> --- a/mm/rmap.c
>> >> >> >> +++ b/mm/rmap.c
>> >> >> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
>> >> >> >>
>> >> >> >>       if (unlikely(PageKsm(page)))
>> >> >> >>               return rmap_walk_ksm(page, rmap_one, arg);
>> >> >> >> -     else if (PageAnon(page))
>> >> >> >> +     else if (PageAnon(page)) {
>> >> >> >> +             if (PageSwapCache(page))
>> >> >> >> +                     return SWAP_AGAIN;
>> >> >> >>               return rmap_walk_anon(page, rmap_one, arg);
>> >> >> >
>> >> >> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
>> >> >> >
>> >> >>
>> >> >> In case of tmpfs, page has swapcache but not mapped.
>> >> >>
>> >> >> > Please see do_swap_page(), PageSwapCache bit is cleared only when
>> >> >> >
>> >> >> > do_swap_page()...
>> >> >> >       swap_free(entry);
>> >> >> >        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
>> >> >> >                try_to_free_swap(page);
>> >> >> >
>> >> >> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
>> >> >> >
>> >> >> > rmap_walk_anon() should be called and the check is not necessary.
>> >> >>
>> >> >> Frankly speaking, I don't understand what is Mel's problem, why he added
>> >> >> Swapcache check in rmap_walk, and why do you said we don't need it.
>> >> >>
>> >> >> Could you explain more detail if you don't mind?
>> >> >>
>> >> > I may miss something.
>> >> >
>> >> > unmap_and_move()
>> >> >  1. try_to_unmap(TTU_MIGRATION)
>> >> >  2. move_to_newpage
>> >> >  3. remove_migration_ptes
>> >> >        -> rmap_walk()
>> >> >
>> >> > Then, to map a page back we unmapped we call rmap_walk().
>> >> >
>> >> > Assume a SwapCache which is mapped, then, PageAnon(page) == true.
>> >> >
>> >> >  At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
>> >> >       mapcount goes to 0.
>> >> >  At 2. SwapCache is copied to a new page.
>> >> >  At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
>> >> >       Before patch, the new page is mapped back to all ptes.
>> >> >       After patch, the new page is not mapped back because its mapcount is 0.
>> >> >
>> >> > I don't think shared SwapCache of anon is not an usual behavior, so, the logic
>> >> > before patch is more attractive.
>> >> >
>> >> > If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
>> >> > because page->mapping is NULL.
>> >> >
>> >>
>> >> Thanks. I agree. We don't need the check.
>> >> Then, my question is why Mel added the check in rmap_walk.
>> >> He mentioned some BUG trigger and fixed things after this patch.
>> >> What's it?
>> >> Is it really related to this logic?
>> >> I don't think so or we are missing something.
>> >>
>> > Hmm. Consiering again.
>> >
>> > Now.
>> >        if (PageAnon(page)) {
>> >                rcu_locked = 1;
>> >                rcu_read_lock();
>> >                if (!page_mapped(page)) {
>> >                        if (!PageSwapCache(page))
>> >                                goto rcu_unlock;
>> >                } else {
>> >                        anon_vma = page_anon_vma(page);
>> >                        atomic_inc(&anon_vma->external_refcount);
>> >                }
>> >
>> >
>> > Maybe this is a fix.
>> >
>> > ==
>> >        skip_remap = 0;
>> >        if (PageAnon(page)) {
>> >                rcu_read_lock();
>> >                if (!page_mapped(page)) {
>> >                        if (!PageSwapCache(page))
>> >                                goto rcu_unlock;
>> >                        /*
>> >                         * We can't convice this anon_vma is valid or not because
>> >                         * !page_mapped(page). Then, we do migration(radix-tree replacement)
>> >                         * but don't remap it which touches anon_vma in page->mapping.
>> >                         */
>> >                        skip_remap = 1;
>> >                        goto skip_unmap;
>> >                } else {
>> >                        anon_vma = page_anon_vma(page);
>> >                        atomic_inc(&anon_vma->external_refcount);
>> >                }
>> >        }
>> >        .....copy page, radix-tree replacement,....
>> >
>>
>> It's not enough.
>> we uses remove_migration_ptes in  move_to_new_page, too.
>> We have to prevent it.
>> We can check PageSwapCache(page) in move_to_new_page and then
>> skip remove_migration_ptes.
>>
>> ex)
>> static int move_to_new_page(....)
>> {
>>      int swapcache = PageSwapCache(page);
>>      ...
>>      if (!swapcache)
>>          if(!rc)
>>              remove_migration_ptes
>>          else
>>              newpage->mapping = NULL;
>> }
>>
>
> This I agree with.
>
>> And we have to close race between PageAnon(page) and rcu_read_lock.
>
> Not so sure on this. The page is locked at this point and that should
> prevent it from becoming !PageAnon

page lock can't prevent anon_vma free.
It's valid just only file-backed page, I think.

>> If we don't do it, anon_vma could be free in the middle of operation.
>> I means
>>
>>          * of migration. File cache pages are no problem because of page_lock()
>>          * File Caches may use write_page() or lock_page() in migration, then,
>>          * just care Anon page here.
>>          */
>>         if (PageAnon(page)) {
>>                 !!! RACE !!!!
>>                 rcu_read_lock();
>>                 rcu_locked = 1;
>>
>> +
>> +               /*
>> +                * If the page has no mappings any more, just bail. An
>> +                * unmapped anon page is likely to be freed soon but worse,
>>
>
> I am not sure this race exists because the page is locked but a key
> observation has been made - A page that is unmapped can be migrated if
> it's PageSwapCache but it may not have a valid anon_vma. Hence, in the
> !page_mapped case, the key is to not use anon_vma. How about the
> following patch?

I like this. Kame. How about your opinion?
please, look at a comment.

>
> ==== CUT HERE ====
>
> mm,migration: Allow the migration of PageSwapCache pages
>
> PageAnon pages that are unmapped may or may not have an anon_vma so are
> not currently migrated. However, a swap cache page can be migrated and
> fits this description. This patch identifies page swap caches and allows
> them to be migrated but ensures that no attempt to made to remap the pages
> would would potentially try to access an already freed anon_vma.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 35aad2a..5d0218b 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
>  *   < 0 - error code
>  *  == 0 - success
>  */
> -static int move_to_new_page(struct page *newpage, struct page *page)
> +static int move_to_new_page(struct page *newpage, struct page *page,
> +                                               int safe_to_remap)
>  {
>        struct address_space *mapping;
>        int rc;
> @@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
>        else
>                rc = fallback_migrate_page(mapping, newpage, page);
>
> -       if (!rc)
> -               remove_migration_ptes(page, newpage);
> -       else
> -               newpage->mapping = NULL;
> +       if (safe_to_remap) {
> +               if (!rc)
> +                       remove_migration_ptes(page, newpage);
> +               else
> +                       newpage->mapping = NULL;
> +       }
>
>        unlock_page(newpage);
>
> @@ -539,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>        int rc = 0;
>        int *result = NULL;
>        struct page *newpage = get_new_page(page, private, &result);
> +       int safe_to_remap = 1;
>        int rcu_locked = 0;
>        int charge = 0;
>        struct mem_cgroup *mem = NULL;
> @@ -600,18 +604,26 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>                rcu_read_lock();
>                rcu_locked = 1;
>
> -               /*
> -                * If the page has no mappings any more, just bail. An
> -                * unmapped anon page is likely to be freed soon but worse,
> -                * it's possible its anon_vma disappeared between when
> -                * the page was isolated and when we reached here while
> -                * the RCU lock was not held
> -                */
> -               if (!page_mapped(page))
> -                       goto rcu_unlock;
> +               /* Determine how to safely use anon_vma */
> +               if (!page_mapped(page)) {
> +                       if (!PageSwapCache(page))
> +                               goto rcu_unlock;
>
> -               anon_vma = page_anon_vma(page);
> -               atomic_inc(&anon_vma->external_refcount);
> +                       /*
> +                        * We cannot be sure that the anon_vma of an unmapped
> +                        * page is safe to use. In this case, the page still

How about changing comment?
"In this case, swapcache page still "
Also, I want to change "safe_to_remap" to "remap_swapcache".
I think it's just problem related to swapcache page.
So I want to represent it explicitly although we can know it's swapcache
by code.


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-04-02  0:20                   ` Minchan Kim
  0 siblings, 0 replies; 72+ messages in thread
From: Minchan Kim @ 2010-04-02  0:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Fri, Apr 2, 2010 at 2:36 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Thu, Apr 01, 2010 at 07:51:31PM +0900, Minchan Kim wrote:
>> On Thu, Apr 1, 2010 at 2:42 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Thu, 1 Apr 2010 13:44:29 +0900
>> > Minchan Kim <minchan.kim@gmail.com> wrote:
>> >
>> >> On Thu, Apr 1, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
>> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >> > On Thu, 1 Apr 2010 11:43:18 +0900
>> >> > Minchan Kim <minchan.kim@gmail.com> wrote:
>> >> >
>> >> >> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki       /*
>> >> >> >> diff --git a/mm/rmap.c b/mm/rmap.c
>> >> >> >> index af35b75..d5ea1f2 100644
>> >> >> >> --- a/mm/rmap.c
>> >> >> >> +++ b/mm/rmap.c
>> >> >> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
>> >> >> >>
>> >> >> >>       if (unlikely(PageKsm(page)))
>> >> >> >>               return rmap_walk_ksm(page, rmap_one, arg);
>> >> >> >> -     else if (PageAnon(page))
>> >> >> >> +     else if (PageAnon(page)) {
>> >> >> >> +             if (PageSwapCache(page))
>> >> >> >> +                     return SWAP_AGAIN;
>> >> >> >>               return rmap_walk_anon(page, rmap_one, arg);
>> >> >> >
>> >> >> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
>> >> >> >
>> >> >>
>> >> >> In case of tmpfs, page has swapcache but not mapped.
>> >> >>
>> >> >> > Please see do_swap_page(), PageSwapCache bit is cleared only when
>> >> >> >
>> >> >> > do_swap_page()...
>> >> >> >       swap_free(entry);
>> >> >> >        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
>> >> >> >                try_to_free_swap(page);
>> >> >> >
>> >> >> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
>> >> >> >
>> >> >> > rmap_walk_anon() should be called and the check is not necessary.
>> >> >>
>> >> >> Frankly speaking, I don't understand what is Mel's problem, why he added
>> >> >> Swapcache check in rmap_walk, and why do you said we don't need it.
>> >> >>
>> >> >> Could you explain more detail if you don't mind?
>> >> >>
>> >> > I may miss something.
>> >> >
>> >> > unmap_and_move()
>> >> >  1. try_to_unmap(TTU_MIGRATION)
>> >> >  2. move_to_newpage
>> >> >  3. remove_migration_ptes
>> >> >        -> rmap_walk()
>> >> >
>> >> > Then, to map a page back we unmapped we call rmap_walk().
>> >> >
>> >> > Assume a SwapCache which is mapped, then, PageAnon(page) == true.
>> >> >
>> >> >  At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
>> >> >       mapcount goes to 0.
>> >> >  At 2. SwapCache is copied to a new page.
>> >> >  At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
>> >> >       Before patch, the new page is mapped back to all ptes.
>> >> >       After patch, the new page is not mapped back because its mapcount is 0.
>> >> >
>> >> > I don't think shared SwapCache of anon is not an usual behavior, so, the logic
>> >> > before patch is more attractive.
>> >> >
>> >> > If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
>> >> > because page->mapping is NULL.
>> >> >
>> >>
>> >> Thanks. I agree. We don't need the check.
>> >> Then, my question is why Mel added the check in rmap_walk.
>> >> He mentioned some BUG trigger and fixed things after this patch.
>> >> What's it?
>> >> Is it really related to this logic?
>> >> I don't think so or we are missing something.
>> >>
>> > Hmm. Consiering again.
>> >
>> > Now.
>> >        if (PageAnon(page)) {
>> >                rcu_locked = 1;
>> >                rcu_read_lock();
>> >                if (!page_mapped(page)) {
>> >                        if (!PageSwapCache(page))
>> >                                goto rcu_unlock;
>> >                } else {
>> >                        anon_vma = page_anon_vma(page);
>> >                        atomic_inc(&anon_vma->external_refcount);
>> >                }
>> >
>> >
>> > Maybe this is a fix.
>> >
>> > ==
>> >        skip_remap = 0;
>> >        if (PageAnon(page)) {
>> >                rcu_read_lock();
>> >                if (!page_mapped(page)) {
>> >                        if (!PageSwapCache(page))
>> >                                goto rcu_unlock;
>> >                        /*
>> >                         * We can't convice this anon_vma is valid or not because
>> >                         * !page_mapped(page). Then, we do migration(radix-tree replacement)
>> >                         * but don't remap it which touches anon_vma in page->mapping.
>> >                         */
>> >                        skip_remap = 1;
>> >                        goto skip_unmap;
>> >                } else {
>> >                        anon_vma = page_anon_vma(page);
>> >                        atomic_inc(&anon_vma->external_refcount);
>> >                }
>> >        }
>> >        .....copy page, radix-tree replacement,....
>> >
>>
>> It's not enough.
>> we uses remove_migration_ptes in  move_to_new_page, too.
>> We have to prevent it.
>> We can check PageSwapCache(page) in move_to_new_page and then
>> skip remove_migration_ptes.
>>
>> ex)
>> static int move_to_new_page(....)
>> {
>>      int swapcache = PageSwapCache(page);
>>      ...
>>      if (!swapcache)
>>          if(!rc)
>>              remove_migration_ptes
>>          else
>>              newpage->mapping = NULL;
>> }
>>
>
> This I agree with.
>
>> And we have to close race between PageAnon(page) and rcu_read_lock.
>
> Not so sure on this. The page is locked at this point and that should
> prevent it from becoming !PageAnon

page lock can't prevent anon_vma free.
It's valid just only file-backed page, I think.

>> If we don't do it, anon_vma could be free in the middle of operation.
>> I means
>>
>>          * of migration. File cache pages are no problem because of page_lock()
>>          * File Caches may use write_page() or lock_page() in migration, then,
>>          * just care Anon page here.
>>          */
>>         if (PageAnon(page)) {
>>                 !!! RACE !!!!
>>                 rcu_read_lock();
>>                 rcu_locked = 1;
>>
>> +
>> +               /*
>> +                * If the page has no mappings any more, just bail. An
>> +                * unmapped anon page is likely to be freed soon but worse,
>>
>
> I am not sure this race exists because the page is locked but a key
> observation has been made - A page that is unmapped can be migrated if
> it's PageSwapCache but it may not have a valid anon_vma. Hence, in the
> !page_mapped case, the key is to not use anon_vma. How about the
> following patch?

I like this. Kame. How about your opinion?
please, look at a comment.

>
> ==== CUT HERE ====
>
> mm,migration: Allow the migration of PageSwapCache pages
>
> PageAnon pages that are unmapped may or may not have an anon_vma so are
> not currently migrated. However, a swap cache page can be migrated and
> fits this description. This patch identifies page swap caches and allows
> them to be migrated but ensures that no attempt to made to remap the pages
> would would potentially try to access an already freed anon_vma.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 35aad2a..5d0218b 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
>  *   < 0 - error code
>  *  == 0 - success
>  */
> -static int move_to_new_page(struct page *newpage, struct page *page)
> +static int move_to_new_page(struct page *newpage, struct page *page,
> +                                               int safe_to_remap)
>  {
>        struct address_space *mapping;
>        int rc;
> @@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
>        else
>                rc = fallback_migrate_page(mapping, newpage, page);
>
> -       if (!rc)
> -               remove_migration_ptes(page, newpage);
> -       else
> -               newpage->mapping = NULL;
> +       if (safe_to_remap) {
> +               if (!rc)
> +                       remove_migration_ptes(page, newpage);
> +               else
> +                       newpage->mapping = NULL;
> +       }
>
>        unlock_page(newpage);
>
> @@ -539,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>        int rc = 0;
>        int *result = NULL;
>        struct page *newpage = get_new_page(page, private, &result);
> +       int safe_to_remap = 1;
>        int rcu_locked = 0;
>        int charge = 0;
>        struct mem_cgroup *mem = NULL;
> @@ -600,18 +604,26 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>                rcu_read_lock();
>                rcu_locked = 1;
>
> -               /*
> -                * If the page has no mappings any more, just bail. An
> -                * unmapped anon page is likely to be freed soon but worse,
> -                * it's possible its anon_vma disappeared between when
> -                * the page was isolated and when we reached here while
> -                * the RCU lock was not held
> -                */
> -               if (!page_mapped(page))
> -                       goto rcu_unlock;
> +               /* Determine how to safely use anon_vma */
> +               if (!page_mapped(page)) {
> +                       if (!PageSwapCache(page))
> +                               goto rcu_unlock;
>
> -               anon_vma = page_anon_vma(page);
> -               atomic_inc(&anon_vma->external_refcount);
> +                       /*
> +                        * We cannot be sure that the anon_vma of an unmapped
> +                        * page is safe to use. In this case, the page still

How about changing comment?
"In this case, swapcache page still "
Also, I want to change "safe_to_remap" to "remap_swapcache".
I think it's just problem related to swapcache page.
So I want to represent it explicitly although we can know it's swapcache
by code.


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-04-01 17:36                 ` Mel Gorman
@ 2010-04-02  0:21                   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-02  0:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Minchan Kim, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Thu, 1 Apr 2010 18:36:41 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> > > ==
> > >        skip_remap = 0;
> > >        if (PageAnon(page)) {
> > >                rcu_read_lock();
> > >                if (!page_mapped(page)) {
> > >                        if (!PageSwapCache(page))
> > >                                goto rcu_unlock;
> > >                        /*
> > >                         * We can't convice this anon_vma is valid or not because
> > >                         * !page_mapped(page). Then, we do migration(radix-tree replacement)
> > >                         * but don't remap it which touches anon_vma in page->mapping.
> > >                         */
> > >                        skip_remap = 1;
> > >                        goto skip_unmap;
> > >                } else {
> > >                        anon_vma = page_anon_vma(page);
> > >                        atomic_inc(&anon_vma->external_refcount);
> > >                }
> > >        }
> > >        .....copy page, radix-tree replacement,....
> > >
> > 
> > It's not enough.
> > we uses remove_migration_ptes in  move_to_new_page, too.
> > We have to prevent it.
> > We can check PageSwapCache(page) in move_to_new_page and then
> > skip remove_migration_ptes.
> > 
> > ex)
> > static int move_to_new_page(....)
> > {
> >      int swapcache = PageSwapCache(page);
> >      ...
> >      if (!swapcache)
> >          if(!rc)
> >              remove_migration_ptes
> >          else
> >              newpage->mapping = NULL;
> > }
> > 
> 
> This I agree with.
> 
me, too.


> I am not sure this race exists because the page is locked but a key
> observation has been made - A page that is unmapped can be migrated if
> it's PageSwapCache but it may not have a valid anon_vma. Hence, in the
> !page_mapped case, the key is to not use anon_vma. How about the
> following patch?
> 

Seems good to me. But (see below)


> ==== CUT HERE ====
> 
> mm,migration: Allow the migration of PageSwapCache pages
> 
> PageAnon pages that are unmapped may or may not have an anon_vma so are
> not currently migrated. However, a swap cache page can be migrated and
> fits this description. This patch identifies page swap caches and allows
> them to be migrated but ensures that no attempt to made to remap the pages
> would would potentially try to access an already freed anon_vma.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 35aad2a..5d0218b 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
>   *   < 0 - error code
>   *  == 0 - success
>   */
> -static int move_to_new_page(struct page *newpage, struct page *page)
> +static int move_to_new_page(struct page *newpage, struct page *page,
> +						int safe_to_remap)
>  {
>  	struct address_space *mapping;
>  	int rc;
> @@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
>  	else
>  		rc = fallback_migrate_page(mapping, newpage, page);
>  
> -	if (!rc)
> -		remove_migration_ptes(page, newpage);
> -	else
> -		newpage->mapping = NULL;
> +	if (safe_to_remap) {
> +		if (!rc)
> +			remove_migration_ptes(page, newpage);
> +		else
> +			newpage->mapping = NULL;
> +	}
>  
	if (rc)
		newpage->mapping = NULL;
	else if (safe_to_remap)
		remove_migrateion_ptes(page, newpage);

Is better. Old code cleared newpage->mapping if rc!=0.

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-04-02  0:21                   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-02  0:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Minchan Kim, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Thu, 1 Apr 2010 18:36:41 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> > > ==
> > > A  A  A  A skip_remap = 0;
> > > A  A  A  A if (PageAnon(page)) {
> > > A  A  A  A  A  A  A  A rcu_read_lock();
> > > A  A  A  A  A  A  A  A if (!page_mapped(page)) {
> > > A  A  A  A  A  A  A  A  A  A  A  A if (!PageSwapCache(page))
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A goto rcu_unlock;
> > > A  A  A  A  A  A  A  A  A  A  A  A /*
> > > A  A  A  A  A  A  A  A  A  A  A  A  * We can't convice this anon_vma is valid or not because
> > > A  A  A  A  A  A  A  A  A  A  A  A  * !page_mapped(page). Then, we do migration(radix-tree replacement)
> > > A  A  A  A  A  A  A  A  A  A  A  A  * but don't remap it which touches anon_vma in page->mapping.
> > > A  A  A  A  A  A  A  A  A  A  A  A  */
> > > A  A  A  A  A  A  A  A  A  A  A  A skip_remap = 1;
> > > A  A  A  A  A  A  A  A  A  A  A  A goto skip_unmap;
> > > A  A  A  A  A  A  A  A } else {
> > > A  A  A  A  A  A  A  A  A  A  A  A anon_vma = page_anon_vma(page);
> > > A  A  A  A  A  A  A  A  A  A  A  A atomic_inc(&anon_vma->external_refcount);
> > > A  A  A  A  A  A  A  A }
> > > A  A  A  A }
> > > A  A  A  A .....copy page, radix-tree replacement,....
> > >
> > 
> > It's not enough.
> > we uses remove_migration_ptes in  move_to_new_page, too.
> > We have to prevent it.
> > We can check PageSwapCache(page) in move_to_new_page and then
> > skip remove_migration_ptes.
> > 
> > ex)
> > static int move_to_new_page(....)
> > {
> >      int swapcache = PageSwapCache(page);
> >      ...
> >      if (!swapcache)
> >          if(!rc)
> >              remove_migration_ptes
> >          else
> >              newpage->mapping = NULL;
> > }
> > 
> 
> This I agree with.
> 
me, too.


> I am not sure this race exists because the page is locked but a key
> observation has been made - A page that is unmapped can be migrated if
> it's PageSwapCache but it may not have a valid anon_vma. Hence, in the
> !page_mapped case, the key is to not use anon_vma. How about the
> following patch?
> 

Seems good to me. But (see below)


> ==== CUT HERE ====
> 
> mm,migration: Allow the migration of PageSwapCache pages
> 
> PageAnon pages that are unmapped may or may not have an anon_vma so are
> not currently migrated. However, a swap cache page can be migrated and
> fits this description. This patch identifies page swap caches and allows
> them to be migrated but ensures that no attempt to made to remap the pages
> would would potentially try to access an already freed anon_vma.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 35aad2a..5d0218b 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
>   *   < 0 - error code
>   *  == 0 - success
>   */
> -static int move_to_new_page(struct page *newpage, struct page *page)
> +static int move_to_new_page(struct page *newpage, struct page *page,
> +						int safe_to_remap)
>  {
>  	struct address_space *mapping;
>  	int rc;
> @@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
>  	else
>  		rc = fallback_migrate_page(mapping, newpage, page);
>  
> -	if (!rc)
> -		remove_migration_ptes(page, newpage);
> -	else
> -		newpage->mapping = NULL;
> +	if (safe_to_remap) {
> +		if (!rc)
> +			remove_migration_ptes(page, newpage);
> +		else
> +			newpage->mapping = NULL;
> +	}
>  
	if (rc)
		newpage->mapping = NULL;
	else if (safe_to_remap)
		remove_migrateion_ptes(page, newpage);

Is better. Old code cleared newpage->mapping if rc!=0.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-04-02  0:20                   ` Minchan Kim
@ 2010-04-02  8:51                     ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-04-02  8:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Fri, Apr 02, 2010 at 09:20:27AM +0900, Minchan Kim wrote:
> On Fri, Apr 2, 2010 at 2:36 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Thu, Apr 01, 2010 at 07:51:31PM +0900, Minchan Kim wrote:
> >> On Thu, Apr 1, 2010 at 2:42 PM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > On Thu, 1 Apr 2010 13:44:29 +0900
> >> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >> >
> >> >> On Thu, Apr 1, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
> >> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> >> > On Thu, 1 Apr 2010 11:43:18 +0900
> >> >> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >> >> >
> >> >> >> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki       /*
> >> >> >> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> >> >> >> index af35b75..d5ea1f2 100644
> >> >> >> >> --- a/mm/rmap.c
> >> >> >> >> +++ b/mm/rmap.c
> >> >> >> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
> >> >> >> >>
> >> >> >> >>       if (unlikely(PageKsm(page)))
> >> >> >> >>               return rmap_walk_ksm(page, rmap_one, arg);
> >> >> >> >> -     else if (PageAnon(page))
> >> >> >> >> +     else if (PageAnon(page)) {
> >> >> >> >> +             if (PageSwapCache(page))
> >> >> >> >> +                     return SWAP_AGAIN;
> >> >> >> >>               return rmap_walk_anon(page, rmap_one, arg);
> >> >> >> >
> >> >> >> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
> >> >> >> >
> >> >> >>
> >> >> >> In case of tmpfs, page has swapcache but not mapped.
> >> >> >>
> >> >> >> > Please see do_swap_page(), PageSwapCache bit is cleared only when
> >> >> >> >
> >> >> >> > do_swap_page()...
> >> >> >> >       swap_free(entry);
> >> >> >> >        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> >> >> >> >                try_to_free_swap(page);
> >> >> >> >
> >> >> >> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
> >> >> >> >
> >> >> >> > rmap_walk_anon() should be called and the check is not necessary.
> >> >> >>
> >> >> >> Frankly speaking, I don't understand what is Mel's problem, why he added
> >> >> >> Swapcache check in rmap_walk, and why do you said we don't need it.
> >> >> >>
> >> >> >> Could you explain more detail if you don't mind?
> >> >> >>
> >> >> > I may miss something.
> >> >> >
> >> >> > unmap_and_move()
> >> >> >  1. try_to_unmap(TTU_MIGRATION)
> >> >> >  2. move_to_newpage
> >> >> >  3. remove_migration_ptes
> >> >> >        -> rmap_walk()
> >> >> >
> >> >> > Then, to map a page back we unmapped we call rmap_walk().
> >> >> >
> >> >> > Assume a SwapCache which is mapped, then, PageAnon(page) == true.
> >> >> >
> >> >> >  At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
> >> >> >       mapcount goes to 0.
> >> >> >  At 2. SwapCache is copied to a new page.
> >> >> >  At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
> >> >> >       Before patch, the new page is mapped back to all ptes.
> >> >> >       After patch, the new page is not mapped back because its mapcount is 0.
> >> >> >
> >> >> > I don't think shared SwapCache of anon is not an usual behavior, so, the logic
> >> >> > before patch is more attractive.
> >> >> >
> >> >> > If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
> >> >> > because page->mapping is NULL.
> >> >> >
> >> >>
> >> >> Thanks. I agree. We don't need the check.
> >> >> Then, my question is why Mel added the check in rmap_walk.
> >> >> He mentioned some BUG trigger and fixed things after this patch.
> >> >> What's it?
> >> >> Is it really related to this logic?
> >> >> I don't think so or we are missing something.
> >> >>
> >> > Hmm. Consiering again.
> >> >
> >> > Now.
> >> >        if (PageAnon(page)) {
> >> >                rcu_locked = 1;
> >> >                rcu_read_lock();
> >> >                if (!page_mapped(page)) {
> >> >                        if (!PageSwapCache(page))
> >> >                                goto rcu_unlock;
> >> >                } else {
> >> >                        anon_vma = page_anon_vma(page);
> >> >                        atomic_inc(&anon_vma->external_refcount);
> >> >                }
> >> >
> >> >
> >> > Maybe this is a fix.
> >> >
> >> > ==
> >> >        skip_remap = 0;
> >> >        if (PageAnon(page)) {
> >> >                rcu_read_lock();
> >> >                if (!page_mapped(page)) {
> >> >                        if (!PageSwapCache(page))
> >> >                                goto rcu_unlock;
> >> >                        /*
> >> >                         * We can't convice this anon_vma is valid or not because
> >> >                         * !page_mapped(page). Then, we do migration(radix-tree replacement)
> >> >                         * but don't remap it which touches anon_vma in page->mapping.
> >> >                         */
> >> >                        skip_remap = 1;
> >> >                        goto skip_unmap;
> >> >                } else {
> >> >                        anon_vma = page_anon_vma(page);
> >> >                        atomic_inc(&anon_vma->external_refcount);
> >> >                }
> >> >        }
> >> >        .....copy page, radix-tree replacement,....
> >> >
> >>
> >> It's not enough.
> >> we uses remove_migration_ptes in  move_to_new_page, too.
> >> We have to prevent it.
> >> We can check PageSwapCache(page) in move_to_new_page and then
> >> skip remove_migration_ptes.
> >>
> >> ex)
> >> static int move_to_new_page(....)
> >> {
> >>      int swapcache = PageSwapCache(page);
> >>      ...
> >>      if (!swapcache)
> >>          if(!rc)
> >>              remove_migration_ptes
> >>          else
> >>              newpage->mapping = NULL;
> >> }
> >>
> >
> > This I agree with.
> >
> >> And we have to close race between PageAnon(page) and rcu_read_lock.
> >
> > Not so sure on this. The page is locked at this point and that should
> > prevent it from becoming !PageAnon
> 
> page lock can't prevent anon_vma free.

True, it can't in itself but it is a bug to free a locked page. As PageAnon
is cleared by the page allocator (see comments in page_remove_rmap) and we
have taken a reference to this page when isolating for migration, I still
don't see how it is possible for PageAnon to get cleared from underneath us.

> It's valid just only file-backed page, I think.
> 
> >> If we don't do it, anon_vma could be free in the middle of operation.
> >> I means
> >>
> >>          * of migration. File cache pages are no problem because of page_lock()
> >>          * File Caches may use write_page() or lock_page() in migration, then,
> >>          * just care Anon page here.
> >>          */
> >>         if (PageAnon(page)) {
> >>                 !!! RACE !!!!
> >>                 rcu_read_lock();
> >>                 rcu_locked = 1;
> >>
> >> +
> >> +               /*
> >> +                * If the page has no mappings any more, just bail. An
> >> +                * unmapped anon page is likely to be freed soon but worse,
> >>
> >
> > I am not sure this race exists because the page is locked but a key
> > observation has been made - A page that is unmapped can be migrated if
> > it's PageSwapCache but it may not have a valid anon_vma. Hence, in the
> > !page_mapped case, the key is to not use anon_vma. How about the
> > following patch?
> 
> I like this. Kame. How about your opinion?
> please, look at a comment.
> 
> >
> > ==== CUT HERE ====
> >
> > mm,migration: Allow the migration of PageSwapCache pages
> >
> > PageAnon pages that are unmapped may or may not have an anon_vma so are
> > not currently migrated. However, a swap cache page can be migrated and
> > fits this description. This patch identifies page swap caches and allows
> > them to be migrated but ensures that no attempt to made to remap the pages
> > would would potentially try to access an already freed anon_vma.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 35aad2a..5d0218b 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
> >  *   < 0 - error code
> >  *  == 0 - success
> >  */
> > -static int move_to_new_page(struct page *newpage, struct page *page)
> > +static int move_to_new_page(struct page *newpage, struct page *page,
> > +                                               int safe_to_remap)
> >  {
> >        struct address_space *mapping;
> >        int rc;
> > @@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
> >        else
> >                rc = fallback_migrate_page(mapping, newpage, page);
> >
> > -       if (!rc)
> > -               remove_migration_ptes(page, newpage);
> > -       else
> > -               newpage->mapping = NULL;
> > +       if (safe_to_remap) {
> > +               if (!rc)
> > +                       remove_migration_ptes(page, newpage);
> > +               else
> > +                       newpage->mapping = NULL;
> > +       }
> >
> >        unlock_page(newpage);
> >
> > @@ -539,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >        int rc = 0;
> >        int *result = NULL;
> >        struct page *newpage = get_new_page(page, private, &result);
> > +       int safe_to_remap = 1;
> >        int rcu_locked = 0;
> >        int charge = 0;
> >        struct mem_cgroup *mem = NULL;
> > @@ -600,18 +604,26 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >                rcu_read_lock();
> >                rcu_locked = 1;
> >
> > -               /*
> > -                * If the page has no mappings any more, just bail. An
> > -                * unmapped anon page is likely to be freed soon but worse,
> > -                * it's possible its anon_vma disappeared between when
> > -                * the page was isolated and when we reached here while
> > -                * the RCU lock was not held
> > -                */
> > -               if (!page_mapped(page))
> > -                       goto rcu_unlock;
> > +               /* Determine how to safely use anon_vma */
> > +               if (!page_mapped(page)) {
> > +                       if (!PageSwapCache(page))
> > +                               goto rcu_unlock;
> >
> > -               anon_vma = page_anon_vma(page);
> > -               atomic_inc(&anon_vma->external_refcount);
> > +                       /*
> > +                        * We cannot be sure that the anon_vma of an unmapped
> > +                        * page is safe to use. In this case, the page still
> 
> How about changing comment?
> "In this case, swapcache page still "
> Also, I want to change "safe_to_remap" to "remap_swapcache".

Done.

> I think it's just problem related to swapcache page.
> So I want to represent it explicitly although we can know it's swapcache
> by code.
> 

Sure. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-04-02  8:51                     ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-04-02  8:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli,
	Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm

On Fri, Apr 02, 2010 at 09:20:27AM +0900, Minchan Kim wrote:
> On Fri, Apr 2, 2010 at 2:36 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Thu, Apr 01, 2010 at 07:51:31PM +0900, Minchan Kim wrote:
> >> On Thu, Apr 1, 2010 at 2:42 PM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > On Thu, 1 Apr 2010 13:44:29 +0900
> >> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >> >
> >> >> On Thu, Apr 1, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
> >> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> >> > On Thu, 1 Apr 2010 11:43:18 +0900
> >> >> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >> >> >
> >> >> >> On Wed, Mar 31, 2010 at 2:26 PM, KAMEZAWA Hiroyuki       /*
> >> >> >> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> >> >> >> index af35b75..d5ea1f2 100644
> >> >> >> >> --- a/mm/rmap.c
> >> >> >> >> +++ b/mm/rmap.c
> >> >> >> >> @@ -1394,9 +1394,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
> >> >> >> >>
> >> >> >> >>       if (unlikely(PageKsm(page)))
> >> >> >> >>               return rmap_walk_ksm(page, rmap_one, arg);
> >> >> >> >> -     else if (PageAnon(page))
> >> >> >> >> +     else if (PageAnon(page)) {
> >> >> >> >> +             if (PageSwapCache(page))
> >> >> >> >> +                     return SWAP_AGAIN;
> >> >> >> >>               return rmap_walk_anon(page, rmap_one, arg);
> >> >> >> >
> >> >> >> > SwapCache has a condition as (PageSwapCache(page) && page_mapped(page) == true.
> >> >> >> >
> >> >> >>
> >> >> >> In case of tmpfs, page has swapcache but not mapped.
> >> >> >>
> >> >> >> > Please see do_swap_page(), PageSwapCache bit is cleared only when
> >> >> >> >
> >> >> >> > do_swap_page()...
> >> >> >> >       swap_free(entry);
> >> >> >> >        if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> >> >> >> >                try_to_free_swap(page);
> >> >> >> >
> >> >> >> > Then, PageSwapCache is cleared only when swap is freeable even if mapped.
> >> >> >> >
> >> >> >> > rmap_walk_anon() should be called and the check is not necessary.
> >> >> >>
> >> >> >> Frankly speaking, I don't understand what is Mel's problem, why he added
> >> >> >> Swapcache check in rmap_walk, and why do you said we don't need it.
> >> >> >>
> >> >> >> Could you explain more detail if you don't mind?
> >> >> >>
> >> >> > I may miss something.
> >> >> >
> >> >> > unmap_and_move()
> >> >> >  1. try_to_unmap(TTU_MIGRATION)
> >> >> >  2. move_to_newpage
> >> >> >  3. remove_migration_ptes
> >> >> >        -> rmap_walk()
> >> >> >
> >> >> > Then, to map a page back we unmapped we call rmap_walk().
> >> >> >
> >> >> > Assume a SwapCache which is mapped, then, PageAnon(page) == true.
> >> >> >
> >> >> >  At 1. try_to_unmap() will rewrite pte with swp_entry of SwapCache.
> >> >> >       mapcount goes to 0.
> >> >> >  At 2. SwapCache is copied to a new page.
> >> >> >  At 3. The new page is mapped back to the place. Now, newpage's mapcount is 0.
> >> >> >       Before patch, the new page is mapped back to all ptes.
> >> >> >       After patch, the new page is not mapped back because its mapcount is 0.
> >> >> >
> >> >> > I don't think shared SwapCache of anon is not an usual behavior, so, the logic
> >> >> > before patch is more attractive.
> >> >> >
> >> >> > If SwapCache is not mapped before "1", we skip "1" and rmap_walk will do nothing
> >> >> > because page->mapping is NULL.
> >> >> >
> >> >>
> >> >> Thanks. I agree. We don't need the check.
> >> >> Then, my question is why Mel added the check in rmap_walk.
> >> >> He mentioned some BUG trigger and fixed things after this patch.
> >> >> What's it?
> >> >> Is it really related to this logic?
> >> >> I don't think so or we are missing something.
> >> >>
> >> > Hmm. Consiering again.
> >> >
> >> > Now.
> >> >        if (PageAnon(page)) {
> >> >                rcu_locked = 1;
> >> >                rcu_read_lock();
> >> >                if (!page_mapped(page)) {
> >> >                        if (!PageSwapCache(page))
> >> >                                goto rcu_unlock;
> >> >                } else {
> >> >                        anon_vma = page_anon_vma(page);
> >> >                        atomic_inc(&anon_vma->external_refcount);
> >> >                }
> >> >
> >> >
> >> > Maybe this is a fix.
> >> >
> >> > ==
> >> >        skip_remap = 0;
> >> >        if (PageAnon(page)) {
> >> >                rcu_read_lock();
> >> >                if (!page_mapped(page)) {
> >> >                        if (!PageSwapCache(page))
> >> >                                goto rcu_unlock;
> >> >                        /*
> >> >                         * We can't convice this anon_vma is valid or not because
> >> >                         * !page_mapped(page). Then, we do migration(radix-tree replacement)
> >> >                         * but don't remap it which touches anon_vma in page->mapping.
> >> >                         */
> >> >                        skip_remap = 1;
> >> >                        goto skip_unmap;
> >> >                } else {
> >> >                        anon_vma = page_anon_vma(page);
> >> >                        atomic_inc(&anon_vma->external_refcount);
> >> >                }
> >> >        }
> >> >        .....copy page, radix-tree replacement,....
> >> >
> >>
> >> It's not enough.
> >> we uses remove_migration_ptes in  move_to_new_page, too.
> >> We have to prevent it.
> >> We can check PageSwapCache(page) in move_to_new_page and then
> >> skip remove_migration_ptes.
> >>
> >> ex)
> >> static int move_to_new_page(....)
> >> {
> >>      int swapcache = PageSwapCache(page);
> >>      ...
> >>      if (!swapcache)
> >>          if(!rc)
> >>              remove_migration_ptes
> >>          else
> >>              newpage->mapping = NULL;
> >> }
> >>
> >
> > This I agree with.
> >
> >> And we have to close race between PageAnon(page) and rcu_read_lock.
> >
> > Not so sure on this. The page is locked at this point and that should
> > prevent it from becoming !PageAnon
> 
> page lock can't prevent anon_vma free.

True, it can't in itself but it is a bug to free a locked page. As PageAnon
is cleared by the page allocator (see comments in page_remove_rmap) and we
have taken a reference to this page when isolating for migration, I still
don't see how it is possible for PageAnon to get cleared from underneath us.

> It's valid just only file-backed page, I think.
> 
> >> If we don't do it, anon_vma could be free in the middle of operation.
> >> I means
> >>
> >>          * of migration. File cache pages are no problem because of page_lock()
> >>          * File Caches may use write_page() or lock_page() in migration, then,
> >>          * just care Anon page here.
> >>          */
> >>         if (PageAnon(page)) {
> >>                 !!! RACE !!!!
> >>                 rcu_read_lock();
> >>                 rcu_locked = 1;
> >>
> >> +
> >> +               /*
> >> +                * If the page has no mappings any more, just bail. An
> >> +                * unmapped anon page is likely to be freed soon but worse,
> >>
> >
> > I am not sure this race exists because the page is locked but a key
> > observation has been made - A page that is unmapped can be migrated if
> > it's PageSwapCache but it may not have a valid anon_vma. Hence, in the
> > !page_mapped case, the key is to not use anon_vma. How about the
> > following patch?
> 
> I like this. Kame. How about your opinion?
> please, look at a comment.
> 
> >
> > ==== CUT HERE ====
> >
> > mm,migration: Allow the migration of PageSwapCache pages
> >
> > PageAnon pages that are unmapped may or may not have an anon_vma so are
> > not currently migrated. However, a swap cache page can be migrated and
> > fits this description. This patch identifies page swap caches and allows
> > them to be migrated but ensures that no attempt to made to remap the pages
> > would would potentially try to access an already freed anon_vma.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 35aad2a..5d0218b 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
> >  *   < 0 - error code
> >  *  == 0 - success
> >  */
> > -static int move_to_new_page(struct page *newpage, struct page *page)
> > +static int move_to_new_page(struct page *newpage, struct page *page,
> > +                                               int safe_to_remap)
> >  {
> >        struct address_space *mapping;
> >        int rc;
> > @@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
> >        else
> >                rc = fallback_migrate_page(mapping, newpage, page);
> >
> > -       if (!rc)
> > -               remove_migration_ptes(page, newpage);
> > -       else
> > -               newpage->mapping = NULL;
> > +       if (safe_to_remap) {
> > +               if (!rc)
> > +                       remove_migration_ptes(page, newpage);
> > +               else
> > +                       newpage->mapping = NULL;
> > +       }
> >
> >        unlock_page(newpage);
> >
> > @@ -539,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >        int rc = 0;
> >        int *result = NULL;
> >        struct page *newpage = get_new_page(page, private, &result);
> > +       int safe_to_remap = 1;
> >        int rcu_locked = 0;
> >        int charge = 0;
> >        struct mem_cgroup *mem = NULL;
> > @@ -600,18 +604,26 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >                rcu_read_lock();
> >                rcu_locked = 1;
> >
> > -               /*
> > -                * If the page has no mappings any more, just bail. An
> > -                * unmapped anon page is likely to be freed soon but worse,
> > -                * it's possible its anon_vma disappeared between when
> > -                * the page was isolated and when we reached here while
> > -                * the RCU lock was not held
> > -                */
> > -               if (!page_mapped(page))
> > -                       goto rcu_unlock;
> > +               /* Determine how to safely use anon_vma */
> > +               if (!page_mapped(page)) {
> > +                       if (!PageSwapCache(page))
> > +                               goto rcu_unlock;
> >
> > -               anon_vma = page_anon_vma(page);
> > -               atomic_inc(&anon_vma->external_refcount);
> > +                       /*
> > +                        * We cannot be sure that the anon_vma of an unmapped
> > +                        * page is safe to use. In this case, the page still
> 
> How about changing comment?
> "In this case, swapcache page still "
> Also, I want to change "safe_to_remap" to "remap_swapcache".

Done.

> I think it's just problem related to swapcache page.
> So I want to represent it explicitly although we can know it's swapcache
> by code.
> 

Sure. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-04-02  0:21                   ` KAMEZAWA Hiroyuki
@ 2010-04-02  8:52                     ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-04-02  8:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Fri, Apr 02, 2010 at 09:21:50AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 1 Apr 2010 18:36:41 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > > > ==
> > > >        skip_remap = 0;
> > > >        if (PageAnon(page)) {
> > > >                rcu_read_lock();
> > > >                if (!page_mapped(page)) {
> > > >                        if (!PageSwapCache(page))
> > > >                                goto rcu_unlock;
> > > >                        /*
> > > >                         * We can't convice this anon_vma is valid or not because
> > > >                         * !page_mapped(page). Then, we do migration(radix-tree replacement)
> > > >                         * but don't remap it which touches anon_vma in page->mapping.
> > > >                         */
> > > >                        skip_remap = 1;
> > > >                        goto skip_unmap;
> > > >                } else {
> > > >                        anon_vma = page_anon_vma(page);
> > > >                        atomic_inc(&anon_vma->external_refcount);
> > > >                }
> > > >        }
> > > >        .....copy page, radix-tree replacement,....
> > > >
> > > 
> > > It's not enough.
> > > we uses remove_migration_ptes in  move_to_new_page, too.
> > > We have to prevent it.
> > > We can check PageSwapCache(page) in move_to_new_page and then
> > > skip remove_migration_ptes.
> > > 
> > > ex)
> > > static int move_to_new_page(....)
> > > {
> > >      int swapcache = PageSwapCache(page);
> > >      ...
> > >      if (!swapcache)
> > >          if(!rc)
> > >              remove_migration_ptes
> > >          else
> > >              newpage->mapping = NULL;
> > > }
> > > 
> > 
> > This I agree with.
> > 
> me, too.
> 
> 
> > I am not sure this race exists because the page is locked but a key
> > observation has been made - A page that is unmapped can be migrated if
> > it's PageSwapCache but it may not have a valid anon_vma. Hence, in the
> > !page_mapped case, the key is to not use anon_vma. How about the
> > following patch?
> > 
> 
> Seems good to me. But (see below)
> 
> 
> > ==== CUT HERE ====
> > 
> > mm,migration: Allow the migration of PageSwapCache pages
> > 
> > PageAnon pages that are unmapped may or may not have an anon_vma so are
> > not currently migrated. However, a swap cache page can be migrated and
> > fits this description. This patch identifies page swap caches and allows
> > them to be migrated but ensures that no attempt to made to remap the pages
> > would would potentially try to access an already freed anon_vma.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 35aad2a..5d0218b 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
> >   *   < 0 - error code
> >   *  == 0 - success
> >   */
> > -static int move_to_new_page(struct page *newpage, struct page *page)
> > +static int move_to_new_page(struct page *newpage, struct page *page,
> > +						int safe_to_remap)
> >  {
> >  	struct address_space *mapping;
> >  	int rc;
> > @@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
> >  	else
> >  		rc = fallback_migrate_page(mapping, newpage, page);
> >  
> > -	if (!rc)
> > -		remove_migration_ptes(page, newpage);
> > -	else
> > -		newpage->mapping = NULL;
> > +	if (safe_to_remap) {
> > +		if (!rc)
> > +			remove_migration_ptes(page, newpage);
> > +		else
> > +			newpage->mapping = NULL;
> > +	}
> >  
> 	if (rc)
> 		newpage->mapping = NULL;
> 	else if (safe_to_remap)
> 		remove_migrateion_ptes(page, newpage);
> 
> Is better. Old code cleared newpage->mapping if rc!=0.
> 

True, done.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-04-02  8:52                     ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-04-02  8:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, Andrew Morton, Andrea Arcangeli, Christoph Lameter,
	Adam Litke, Avi Kivity, David Rientjes, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Fri, Apr 02, 2010 at 09:21:50AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 1 Apr 2010 18:36:41 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > > > ==
> > > >        skip_remap = 0;
> > > >        if (PageAnon(page)) {
> > > >                rcu_read_lock();
> > > >                if (!page_mapped(page)) {
> > > >                        if (!PageSwapCache(page))
> > > >                                goto rcu_unlock;
> > > >                        /*
> > > >                         * We can't convice this anon_vma is valid or not because
> > > >                         * !page_mapped(page). Then, we do migration(radix-tree replacement)
> > > >                         * but don't remap it which touches anon_vma in page->mapping.
> > > >                         */
> > > >                        skip_remap = 1;
> > > >                        goto skip_unmap;
> > > >                } else {
> > > >                        anon_vma = page_anon_vma(page);
> > > >                        atomic_inc(&anon_vma->external_refcount);
> > > >                }
> > > >        }
> > > >        .....copy page, radix-tree replacement,....
> > > >
> > > 
> > > It's not enough.
> > > we uses remove_migration_ptes in  move_to_new_page, too.
> > > We have to prevent it.
> > > We can check PageSwapCache(page) in move_to_new_page and then
> > > skip remove_migration_ptes.
> > > 
> > > ex)
> > > static int move_to_new_page(....)
> > > {
> > >      int swapcache = PageSwapCache(page);
> > >      ...
> > >      if (!swapcache)
> > >          if(!rc)
> > >              remove_migration_ptes
> > >          else
> > >              newpage->mapping = NULL;
> > > }
> > > 
> > 
> > This I agree with.
> > 
> me, too.
> 
> 
> > I am not sure this race exists because the page is locked but a key
> > observation has been made - A page that is unmapped can be migrated if
> > it's PageSwapCache but it may not have a valid anon_vma. Hence, in the
> > !page_mapped case, the key is to not use anon_vma. How about the
> > following patch?
> > 
> 
> Seems good to me. But (see below)
> 
> 
> > ==== CUT HERE ====
> > 
> > mm,migration: Allow the migration of PageSwapCache pages
> > 
> > PageAnon pages that are unmapped may or may not have an anon_vma so are
> > not currently migrated. However, a swap cache page can be migrated and
> > fits this description. This patch identifies page swap caches and allows
> > them to be migrated but ensures that no attempt to made to remap the pages
> > would would potentially try to access an already freed anon_vma.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 35aad2a..5d0218b 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
> >   *   < 0 - error code
> >   *  == 0 - success
> >   */
> > -static int move_to_new_page(struct page *newpage, struct page *page)
> > +static int move_to_new_page(struct page *newpage, struct page *page,
> > +						int safe_to_remap)
> >  {
> >  	struct address_space *mapping;
> >  	int rc;
> > @@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
> >  	else
> >  		rc = fallback_migrate_page(mapping, newpage, page);
> >  
> > -	if (!rc)
> > -		remove_migration_ptes(page, newpage);
> > -	else
> > -		newpage->mapping = NULL;
> > +	if (safe_to_remap) {
> > +		if (!rc)
> > +			remove_migration_ptes(page, newpage);
> > +		else
> > +			newpage->mapping = NULL;
> > +	}
> >  
> 	if (rc)
> 		newpage->mapping = NULL;
> 	else if (safe_to_remap)
> 		remove_migrateion_ptes(page, newpage);
> 
> Is better. Old code cleared newpage->mapping if rc!=0.
> 

True, done.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-04-07  0:06     ` Andrew Morton
@ 2010-04-07 16:49       ` Mel Gorman
  -1 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-04-07 16:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, Apr 06, 2010 at 05:06:23PM -0700, Andrew Morton wrote:
> On Fri,  2 Apr 2010 17:02:48 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > PageAnon pages that are unmapped may or may not have an anon_vma so are
> > not currently migrated. However, a swap cache page can be migrated and
> > fits this description. This patch identifies page swap caches and allows
> > them to be migrated but ensures that no attempt to made to remap the pages
> > would would potentially try to access an already freed anon_vma.
> > 
> > ...
> >
> > @@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
> >   *   < 0 - error code
> >   *  == 0 - success
> >   */
> > -static int move_to_new_page(struct page *newpage, struct page *page)
> > +static int move_to_new_page(struct page *newpage, struct page *page,
> > +						int remap_swapcache)
> 
> You're not a fan of `bool'.
> 

This function existed before compaction and returns an error code rather
than a true/false value.

> >  {
> >  	struct address_space *mapping;
> >  	int rc;
> > @@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
> >  	else
> >  		rc = fallback_migrate_page(mapping, newpage, page);
> >  
> > -	if (!rc)
> > -		remove_migration_ptes(page, newpage);
> > -	else
> > +	if (rc) {
> >  		newpage->mapping = NULL;
> > +	} else {
> > +		if (remap_swapcache) 
> > +			remove_migration_ptes(page, newpage);
> > +	}
> >  
> >  	unlock_page(newpage);
> >  
> > @@ -539,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  	int rc = 0;
> >  	int *result = NULL;
> >  	struct page *newpage = get_new_page(page, private, &result);
> > +	int remap_swapcache = 1;
> >  	int rcu_locked = 0;
> >  	int charge = 0;
> >  	struct mem_cgroup *mem = NULL;
> > @@ -600,18 +604,27 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  		rcu_read_lock();
> >  		rcu_locked = 1;
> >  
> > -		/*
> > -		 * If the page has no mappings any more, just bail. An
> > -		 * unmapped anon page is likely to be freed soon but worse,
> > -		 * it's possible its anon_vma disappeared between when
> > -		 * the page was isolated and when we reached here while
> > -		 * the RCU lock was not held
> > -		 */
> > -		if (!page_mapped(page))
> > -			goto rcu_unlock;
> > +		/* Determine how to safely use anon_vma */
> > +		if (!page_mapped(page)) {
> > +			if (!PageSwapCache(page))
> > +				goto rcu_unlock;
> >  
> > -		anon_vma = page_anon_vma(page);
> > -		atomic_inc(&anon_vma->external_refcount);
> > +			/*
> > +			 * We cannot be sure that the anon_vma of an unmapped
> > +			 * swapcache page is safe to use.
> 
> Why not?  A full explanation here would be nice.

Patch below.

> 
> > 			   In this case, the
> > +			 * swapcache page gets migrated but the pages are not
> > +			 * remapped
> > +			 */
> > +			remap_swapcache = 0;
> > +		} else { 
> > +			/*
> > +			 * Take a reference count on the anon_vma if the
> > +			 * page is mapped so that it is guaranteed to
> > +			 * exist when the page is remapped later
> > +			 */
> > +			anon_vma = page_anon_vma(page);
> > +			atomic_inc(&anon_vma->external_refcount);
> > +		}
> >  	}
> >  
> >  	/*
> > @@ -646,9 +659,9 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  
> >  skip_unmap:
> >  	if (!page_mapped(page))
> > -		rc = move_to_new_page(newpage, page);
> > +		rc = move_to_new_page(newpage, page, remap_swapcache);
> >  
> > -	if (rc)
> > +	if (rc && remap_swapcache)
> >  		remove_migration_ptes(page, page);
> >  rcu_unlock:
> 

Patch that updates the comment if you prefer it is as follows

==== CUT HERE ====
mm,compaction: Expand comment on unmapped page swap cache

The comment on the handling of anon_vma for unmapped pages is a bit
sparse. Expand it.

This is a fix to the patch "mm,migration: Allow the migration of
PageSwapCache pages"

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/migrate.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 0356e64..281a239 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -611,9 +611,15 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 
 			/*
 			 * We cannot be sure that the anon_vma of an unmapped
-			 * swapcache page is safe to use. In this case, the
-			 * swapcache page gets migrated but the pages are not
-			 * remapped
+			 * swapcache page is safe to use because we don't
+			 * know in advance if the VMA that this page belonged
+			 * to still exists. If the VMA and others sharing the
+			 * data have been freed, then the anon_vma could
+			 * already be invalid.
+			 *
+			 * To avoid this possibility, swapcache pages get
+			 * migrated but are not remapped when migration
+			 * completes
 			 */
 			remap_swapcache = 0;
 		} else { 

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-04-07 16:49       ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-04-07 16:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Tue, Apr 06, 2010 at 05:06:23PM -0700, Andrew Morton wrote:
> On Fri,  2 Apr 2010 17:02:48 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > PageAnon pages that are unmapped may or may not have an anon_vma so are
> > not currently migrated. However, a swap cache page can be migrated and
> > fits this description. This patch identifies page swap caches and allows
> > them to be migrated but ensures that no attempt to made to remap the pages
> > would would potentially try to access an already freed anon_vma.
> > 
> > ...
> >
> > @@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
> >   *   < 0 - error code
> >   *  == 0 - success
> >   */
> > -static int move_to_new_page(struct page *newpage, struct page *page)
> > +static int move_to_new_page(struct page *newpage, struct page *page,
> > +						int remap_swapcache)
> 
> You're not a fan of `bool'.
> 

This function existed before compaction and returns an error code rather
than a true/false value.

> >  {
> >  	struct address_space *mapping;
> >  	int rc;
> > @@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
> >  	else
> >  		rc = fallback_migrate_page(mapping, newpage, page);
> >  
> > -	if (!rc)
> > -		remove_migration_ptes(page, newpage);
> > -	else
> > +	if (rc) {
> >  		newpage->mapping = NULL;
> > +	} else {
> > +		if (remap_swapcache) 
> > +			remove_migration_ptes(page, newpage);
> > +	}
> >  
> >  	unlock_page(newpage);
> >  
> > @@ -539,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  	int rc = 0;
> >  	int *result = NULL;
> >  	struct page *newpage = get_new_page(page, private, &result);
> > +	int remap_swapcache = 1;
> >  	int rcu_locked = 0;
> >  	int charge = 0;
> >  	struct mem_cgroup *mem = NULL;
> > @@ -600,18 +604,27 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  		rcu_read_lock();
> >  		rcu_locked = 1;
> >  
> > -		/*
> > -		 * If the page has no mappings any more, just bail. An
> > -		 * unmapped anon page is likely to be freed soon but worse,
> > -		 * it's possible its anon_vma disappeared between when
> > -		 * the page was isolated and when we reached here while
> > -		 * the RCU lock was not held
> > -		 */
> > -		if (!page_mapped(page))
> > -			goto rcu_unlock;
> > +		/* Determine how to safely use anon_vma */
> > +		if (!page_mapped(page)) {
> > +			if (!PageSwapCache(page))
> > +				goto rcu_unlock;
> >  
> > -		anon_vma = page_anon_vma(page);
> > -		atomic_inc(&anon_vma->external_refcount);
> > +			/*
> > +			 * We cannot be sure that the anon_vma of an unmapped
> > +			 * swapcache page is safe to use.
> 
> Why not?  A full explanation here would be nice.

Patch below.

> 
> > 			   In this case, the
> > +			 * swapcache page gets migrated but the pages are not
> > +			 * remapped
> > +			 */
> > +			remap_swapcache = 0;
> > +		} else { 
> > +			/*
> > +			 * Take a reference count on the anon_vma if the
> > +			 * page is mapped so that it is guaranteed to
> > +			 * exist when the page is remapped later
> > +			 */
> > +			anon_vma = page_anon_vma(page);
> > +			atomic_inc(&anon_vma->external_refcount);
> > +		}
> >  	}
> >  
> >  	/*
> > @@ -646,9 +659,9 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  
> >  skip_unmap:
> >  	if (!page_mapped(page))
> > -		rc = move_to_new_page(newpage, page);
> > +		rc = move_to_new_page(newpage, page, remap_swapcache);
> >  
> > -	if (rc)
> > +	if (rc && remap_swapcache)
> >  		remove_migration_ptes(page, page);
> >  rcu_unlock:
> 

Patch that updates the comment if you prefer it is as follows

==== CUT HERE ====
mm,compaction: Expand comment on unmapped page swap cache

The comment on the handling of anon_vma for unmapped pages is a bit
sparse. Expand it.

This is a fix to the patch "mm,migration: Allow the migration of
PageSwapCache pages"

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/migrate.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 0356e64..281a239 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -611,9 +611,15 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 
 			/*
 			 * We cannot be sure that the anon_vma of an unmapped
-			 * swapcache page is safe to use. In this case, the
-			 * swapcache page gets migrated but the pages are not
-			 * remapped
+			 * swapcache page is safe to use because we don't
+			 * know in advance if the VMA that this page belonged
+			 * to still exists. If the VMA and others sharing the
+			 * data have been freed, then the anon_vma could
+			 * already be invalid.
+			 *
+			 * To avoid this possibility, swapcache pages get
+			 * migrated but are not remapped when migration
+			 * completes
 			 */
 			remap_swapcache = 0;
 		} else { 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-04-02 16:02   ` Mel Gorman
@ 2010-04-07  0:06     ` Andrew Morton
  -1 siblings, 0 replies; 72+ messages in thread
From: Andrew Morton @ 2010-04-07  0:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Fri,  2 Apr 2010 17:02:48 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> PageAnon pages that are unmapped may or may not have an anon_vma so are
> not currently migrated. However, a swap cache page can be migrated and
> fits this description. This patch identifies page swap caches and allows
> them to be migrated but ensures that no attempt to made to remap the pages
> would would potentially try to access an already freed anon_vma.
> 
> ...
>
> @@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
>   *   < 0 - error code
>   *  == 0 - success
>   */
> -static int move_to_new_page(struct page *newpage, struct page *page)
> +static int move_to_new_page(struct page *newpage, struct page *page,
> +						int remap_swapcache)

You're not a fan of `bool'.

>  {
>  	struct address_space *mapping;
>  	int rc;
> @@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
>  	else
>  		rc = fallback_migrate_page(mapping, newpage, page);
>  
> -	if (!rc)
> -		remove_migration_ptes(page, newpage);
> -	else
> +	if (rc) {
>  		newpage->mapping = NULL;
> +	} else {
> +		if (remap_swapcache) 
> +			remove_migration_ptes(page, newpage);
> +	}
>  
>  	unlock_page(newpage);
>  
> @@ -539,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  	int rc = 0;
>  	int *result = NULL;
>  	struct page *newpage = get_new_page(page, private, &result);
> +	int remap_swapcache = 1;
>  	int rcu_locked = 0;
>  	int charge = 0;
>  	struct mem_cgroup *mem = NULL;
> @@ -600,18 +604,27 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  		rcu_read_lock();
>  		rcu_locked = 1;
>  
> -		/*
> -		 * If the page has no mappings any more, just bail. An
> -		 * unmapped anon page is likely to be freed soon but worse,
> -		 * it's possible its anon_vma disappeared between when
> -		 * the page was isolated and when we reached here while
> -		 * the RCU lock was not held
> -		 */
> -		if (!page_mapped(page))
> -			goto rcu_unlock;
> +		/* Determine how to safely use anon_vma */
> +		if (!page_mapped(page)) {
> +			if (!PageSwapCache(page))
> +				goto rcu_unlock;
>  
> -		anon_vma = page_anon_vma(page);
> -		atomic_inc(&anon_vma->external_refcount);
> +			/*
> +			 * We cannot be sure that the anon_vma of an unmapped
> +			 * swapcache page is safe to use.

Why not?  A full explanation here would be nice.

> 			   In this case, the
> +			 * swapcache page gets migrated but the pages are not
> +			 * remapped
> +			 */
> +			remap_swapcache = 0;
> +		} else { 
> +			/*
> +			 * Take a reference count on the anon_vma if the
> +			 * page is mapped so that it is guaranteed to
> +			 * exist when the page is remapped later
> +			 */
> +			anon_vma = page_anon_vma(page);
> +			atomic_inc(&anon_vma->external_refcount);
> +		}
>  	}
>  
>  	/*
> @@ -646,9 +659,9 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  
>  skip_unmap:
>  	if (!page_mapped(page))
> -		rc = move_to_new_page(newpage, page);
> +		rc = move_to_new_page(newpage, page, remap_swapcache);
>  
> -	if (rc)
> +	if (rc && remap_swapcache)
>  		remove_migration_ptes(page, page);
>  rcu_unlock:


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-04-07  0:06     ` Andrew Morton
  0 siblings, 0 replies; 72+ messages in thread
From: Andrew Morton @ 2010-04-07  0:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Fri,  2 Apr 2010 17:02:48 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> PageAnon pages that are unmapped may or may not have an anon_vma so are
> not currently migrated. However, a swap cache page can be migrated and
> fits this description. This patch identifies page swap caches and allows
> them to be migrated but ensures that no attempt to made to remap the pages
> would would potentially try to access an already freed anon_vma.
> 
> ...
>
> @@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
>   *   < 0 - error code
>   *  == 0 - success
>   */
> -static int move_to_new_page(struct page *newpage, struct page *page)
> +static int move_to_new_page(struct page *newpage, struct page *page,
> +						int remap_swapcache)

You're not a fan of `bool'.

>  {
>  	struct address_space *mapping;
>  	int rc;
> @@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
>  	else
>  		rc = fallback_migrate_page(mapping, newpage, page);
>  
> -	if (!rc)
> -		remove_migration_ptes(page, newpage);
> -	else
> +	if (rc) {
>  		newpage->mapping = NULL;
> +	} else {
> +		if (remap_swapcache) 
> +			remove_migration_ptes(page, newpage);
> +	}
>  
>  	unlock_page(newpage);
>  
> @@ -539,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  	int rc = 0;
>  	int *result = NULL;
>  	struct page *newpage = get_new_page(page, private, &result);
> +	int remap_swapcache = 1;
>  	int rcu_locked = 0;
>  	int charge = 0;
>  	struct mem_cgroup *mem = NULL;
> @@ -600,18 +604,27 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  		rcu_read_lock();
>  		rcu_locked = 1;
>  
> -		/*
> -		 * If the page has no mappings any more, just bail. An
> -		 * unmapped anon page is likely to be freed soon but worse,
> -		 * it's possible its anon_vma disappeared between when
> -		 * the page was isolated and when we reached here while
> -		 * the RCU lock was not held
> -		 */
> -		if (!page_mapped(page))
> -			goto rcu_unlock;
> +		/* Determine how to safely use anon_vma */
> +		if (!page_mapped(page)) {
> +			if (!PageSwapCache(page))
> +				goto rcu_unlock;
>  
> -		anon_vma = page_anon_vma(page);
> -		atomic_inc(&anon_vma->external_refcount);
> +			/*
> +			 * We cannot be sure that the anon_vma of an unmapped
> +			 * swapcache page is safe to use.

Why not?  A full explanation here would be nice.

> 			   In this case, the
> +			 * swapcache page gets migrated but the pages are not
> +			 * remapped
> +			 */
> +			remap_swapcache = 0;
> +		} else { 
> +			/*
> +			 * Take a reference count on the anon_vma if the
> +			 * page is mapped so that it is guaranteed to
> +			 * exist when the page is remapped later
> +			 */
> +			anon_vma = page_anon_vma(page);
> +			atomic_inc(&anon_vma->external_refcount);
> +		}
>  	}
>  
>  	/*
> @@ -646,9 +659,9 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  
>  skip_unmap:
>  	if (!page_mapped(page))
> -		rc = move_to_new_page(newpage, page);
> +		rc = move_to_new_page(newpage, page, remap_swapcache);
>  
> -	if (rc)
> +	if (rc && remap_swapcache)
>  		remove_migration_ptes(page, page);
>  rcu_unlock:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache  pages
  2010-04-02 16:02   ` Mel Gorman
@ 2010-04-06 15:37     ` Minchan Kim
  -1 siblings, 0 replies; 72+ messages in thread
From: Minchan Kim @ 2010-04-06 15:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Sat, Apr 3, 2010 at 1:02 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> PageAnon pages that are unmapped may or may not have an anon_vma so are
> not currently migrated. However, a swap cache page can be migrated and
> fits this description. This patch identifies page swap caches and allows
> them to be migrated but ensures that no attempt to made to remap the pages
> would would potentially try to access an already freed anon_vma.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Thanks for your effort, Mel.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-04-06 15:37     ` Minchan Kim
  0 siblings, 0 replies; 72+ messages in thread
From: Minchan Kim @ 2010-04-06 15:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Sat, Apr 3, 2010 at 1:02 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> PageAnon pages that are unmapped may or may not have an anon_vma so are
> not currently migrated. However, a swap cache page can be migrated and
> fits this description. This patch identifies page swap caches and allows
> them to be migrated but ensures that no attempt to made to remap the pages
> would would potentially try to access an already freed anon_vma.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Thanks for your effort, Mel.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-04-02 16:02   ` Mel Gorman
@ 2010-04-06  6:54     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-06  6:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Fri,  2 Apr 2010 17:02:48 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> PageAnon pages that are unmapped may or may not have an anon_vma so are
> not currently migrated. However, a swap cache page can be migrated and
> fits this description. This patch identifies page swap caches and allows
> them to be migrated but ensures that no attempt to made to remap the pages
> would would potentially try to access an already freed anon_vma.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Seems nice to me.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-04-06  6:54     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-06  6:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm

On Fri,  2 Apr 2010 17:02:48 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> PageAnon pages that are unmapped may or may not have an anon_vma so are
> not currently migrated. However, a swap cache page can be migrated and
> fits this description. This patch identifies page swap caches and allows
> them to be migrated but ensures that no attempt to made to remap the pages
> would would potentially try to access an already freed anon_vma.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Seems nice to me.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
@ 2010-04-02 16:02   ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-04-02 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

PageAnon pages that are unmapped may or may not have an anon_vma so are
not currently migrated. However, a swap cache page can be migrated and
fits this description. This patch identifies page swap caches and allows
them to be migrated but ensures that no attempt to made to remap the pages
would would potentially try to access an already freed anon_vma.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/migrate.c |   47 ++++++++++++++++++++++++++++++-----------------
 1 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 35aad2a..0356e64 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
  *   < 0 - error code
  *  == 0 - success
  */
-static int move_to_new_page(struct page *newpage, struct page *page)
+static int move_to_new_page(struct page *newpage, struct page *page,
+						int remap_swapcache)
 {
 	struct address_space *mapping;
 	int rc;
@@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
 	else
 		rc = fallback_migrate_page(mapping, newpage, page);
 
-	if (!rc)
-		remove_migration_ptes(page, newpage);
-	else
+	if (rc) {
 		newpage->mapping = NULL;
+	} else {
+		if (remap_swapcache) 
+			remove_migration_ptes(page, newpage);
+	}
 
 	unlock_page(newpage);
 
@@ -539,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
+	int remap_swapcache = 1;
 	int rcu_locked = 0;
 	int charge = 0;
 	struct mem_cgroup *mem = NULL;
@@ -600,18 +604,27 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		rcu_read_lock();
 		rcu_locked = 1;
 
-		/*
-		 * If the page has no mappings any more, just bail. An
-		 * unmapped anon page is likely to be freed soon but worse,
-		 * it's possible its anon_vma disappeared between when
-		 * the page was isolated and when we reached here while
-		 * the RCU lock was not held
-		 */
-		if (!page_mapped(page))
-			goto rcu_unlock;
+		/* Determine how to safely use anon_vma */
+		if (!page_mapped(page)) {
+			if (!PageSwapCache(page))
+				goto rcu_unlock;
 
-		anon_vma = page_anon_vma(page);
-		atomic_inc(&anon_vma->external_refcount);
+			/*
+			 * We cannot be sure that the anon_vma of an unmapped
+			 * swapcache page is safe to use. In this case, the
+			 * swapcache page gets migrated but the pages are not
+			 * remapped
+			 */
+			remap_swapcache = 0;
+		} else { 
+			/*
+			 * Take a reference count on the anon_vma if the
+			 * page is mapped so that it is guaranteed to
+			 * exist when the page is remapped later
+			 */
+			anon_vma = page_anon_vma(page);
+			atomic_inc(&anon_vma->external_refcount);
+		}
 	}
 
 	/*
@@ -646,9 +659,9 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 
 skip_unmap:
 	if (!page_mapped(page))
-		rc = move_to_new_page(newpage, page);
+		rc = move_to_new_page(newpage, page, remap_swapcache);
 
-	if (rc)
+	if (rc && remap_swapcache)
 		remove_migration_ptes(page, page);
 rcu_unlock:
 
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
@ 2010-04-02 16:02   ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-04-02 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm

PageAnon pages that are unmapped may or may not have an anon_vma so are
not currently migrated. However, a swap cache page can be migrated and
fits this description. This patch identifies page swap caches and allows
them to be migrated but ensures that no attempt to made to remap the pages
would would potentially try to access an already freed anon_vma.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/migrate.c |   47 ++++++++++++++++++++++++++++++-----------------
 1 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 35aad2a..0356e64 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
  *   < 0 - error code
  *  == 0 - success
  */
-static int move_to_new_page(struct page *newpage, struct page *page)
+static int move_to_new_page(struct page *newpage, struct page *page,
+						int remap_swapcache)
 {
 	struct address_space *mapping;
 	int rc;
@@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
 	else
 		rc = fallback_migrate_page(mapping, newpage, page);
 
-	if (!rc)
-		remove_migration_ptes(page, newpage);
-	else
+	if (rc) {
 		newpage->mapping = NULL;
+	} else {
+		if (remap_swapcache) 
+			remove_migration_ptes(page, newpage);
+	}
 
 	unlock_page(newpage);
 
@@ -539,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
+	int remap_swapcache = 1;
 	int rcu_locked = 0;
 	int charge = 0;
 	struct mem_cgroup *mem = NULL;
@@ -600,18 +604,27 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		rcu_read_lock();
 		rcu_locked = 1;
 
-		/*
-		 * If the page has no mappings any more, just bail. An
-		 * unmapped anon page is likely to be freed soon but worse,
-		 * it's possible its anon_vma disappeared between when
-		 * the page was isolated and when we reached here while
-		 * the RCU lock was not held
-		 */
-		if (!page_mapped(page))
-			goto rcu_unlock;
+		/* Determine how to safely use anon_vma */
+		if (!page_mapped(page)) {
+			if (!PageSwapCache(page))
+				goto rcu_unlock;
 
-		anon_vma = page_anon_vma(page);
-		atomic_inc(&anon_vma->external_refcount);
+			/*
+			 * We cannot be sure that the anon_vma of an unmapped
+			 * swapcache page is safe to use. In this case, the
+			 * swapcache page gets migrated but the pages are not
+			 * remapped
+			 */
+			remap_swapcache = 0;
+		} else { 
+			/*
+			 * Take a reference count on the anon_vma if the
+			 * page is mapped so that it is guaranteed to
+			 * exist when the page is remapped later
+			 */
+			anon_vma = page_anon_vma(page);
+			atomic_inc(&anon_vma->external_refcount);
+		}
 	}
 
 	/*
@@ -646,9 +659,9 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 
 skip_unmap:
 	if (!page_mapped(page))
-		rc = move_to_new_page(newpage, page);
+		rc = move_to_new_page(newpage, page, remap_swapcache);
 
-	if (rc)
+	if (rc && remap_swapcache)
 		remove_migration_ptes(page, page);
 rcu_unlock:
 
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2010-04-07 16:50 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-30  9:14 [PATCH 0/14] Memory Compaction v6 Mel Gorman
2010-03-30  9:14 ` Mel Gorman
2010-03-30  9:14 ` [PATCH 01/14] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
2010-03-30  9:14   ` Mel Gorman
2010-03-30  9:14 ` [PATCH 02/14] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
2010-03-30  9:14   ` Mel Gorman
2010-03-30  9:14 ` [PATCH 03/14] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
2010-03-30  9:14   ` Mel Gorman
2010-03-30  9:14 ` [PATCH 04/14] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
2010-03-30  9:14   ` Mel Gorman
2010-03-30  9:14 ` [PATCH 05/14] Export unusable free space index via /proc/unusable_index Mel Gorman
2010-03-30  9:14   ` Mel Gorman
2010-03-30  9:14 ` [PATCH 06/14] Export fragmentation index via /proc/extfrag_index Mel Gorman
2010-03-30  9:14   ` Mel Gorman
2010-03-30  9:14 ` [PATCH 07/14] Move definition for LRU isolation modes to a header Mel Gorman
2010-03-30  9:14   ` Mel Gorman
2010-03-30  9:14 ` [PATCH 08/14] Memory compaction core Mel Gorman
2010-03-30  9:14   ` Mel Gorman
2010-03-30  9:14 ` [PATCH 09/14] Add /proc trigger for memory compaction Mel Gorman
2010-03-30  9:14   ` Mel Gorman
2010-03-30  9:14 ` [PATCH 10/14] Add /sys trigger for per-node " Mel Gorman
2010-03-30  9:14   ` Mel Gorman
2010-03-30  9:14 ` [PATCH 11/14] Direct compact when a high-order allocation fails Mel Gorman
2010-03-30  9:14   ` Mel Gorman
2010-03-30  9:14 ` [PATCH 12/14] Add a tunable that decides when memory should be compacted and when it should be reclaimed Mel Gorman
2010-03-30  9:14   ` Mel Gorman
2010-03-30  9:14 ` [PATCH 13/14] Do not compact within a preferred zone after a compaction failure Mel Gorman
2010-03-30  9:14   ` Mel Gorman
2010-03-30  9:14 ` [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages Mel Gorman
2010-03-30  9:14   ` Mel Gorman
2010-03-31  5:26   ` KAMEZAWA Hiroyuki
2010-03-31  5:26     ` KAMEZAWA Hiroyuki
2010-03-31 11:27     ` Mel Gorman
2010-03-31 11:27       ` Mel Gorman
2010-03-31 23:57       ` KAMEZAWA Hiroyuki
2010-03-31 23:57         ` KAMEZAWA Hiroyuki
2010-04-01  2:39         ` Minchan Kim
2010-04-01  2:39           ` Minchan Kim
2010-04-01  2:43     ` Minchan Kim
2010-04-01  2:43       ` Minchan Kim
2010-04-01  3:01       ` KAMEZAWA Hiroyuki
2010-04-01  3:01         ` KAMEZAWA Hiroyuki
2010-04-01  4:44         ` Minchan Kim
2010-04-01  4:44           ` Minchan Kim
2010-04-01  5:42           ` KAMEZAWA Hiroyuki
2010-04-01  5:42             ` KAMEZAWA Hiroyuki
2010-04-01 10:51             ` Minchan Kim
2010-04-01 10:51               ` Minchan Kim
2010-04-01 17:36               ` Mel Gorman
2010-04-01 17:36                 ` Mel Gorman
2010-04-02  0:20                 ` Minchan Kim
2010-04-02  0:20                   ` Minchan Kim
2010-04-02  8:51                   ` Mel Gorman
2010-04-02  8:51                     ` Mel Gorman
2010-04-02  0:21                 ` KAMEZAWA Hiroyuki
2010-04-02  0:21                   ` KAMEZAWA Hiroyuki
2010-04-02  8:52                   ` Mel Gorman
2010-04-02  8:52                     ` Mel Gorman
2010-04-01  9:30           ` Mel Gorman
2010-04-01  9:30             ` Mel Gorman
2010-04-01 10:42             ` Minchan Kim
2010-04-01 10:42               ` Minchan Kim
2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
2010-04-02 16:02 ` [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages Mel Gorman
2010-04-02 16:02   ` Mel Gorman
2010-04-06  6:54   ` KAMEZAWA Hiroyuki
2010-04-06  6:54     ` KAMEZAWA Hiroyuki
2010-04-06 15:37   ` Minchan Kim
2010-04-06 15:37     ` Minchan Kim
2010-04-07  0:06   ` Andrew Morton
2010-04-07  0:06     ` Andrew Morton
2010-04-07 16:49     ` Mel Gorman
2010-04-07 16:49       ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.