linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Direct Migration V9: Overview
@ 2006-01-10 22:41 Christoph Lameter
  2006-01-10 22:41 ` [PATCH 1/5] Direct Migration V9: PageSwapCache checks Christoph Lameter
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Christoph Lameter @ 2006-01-10 22:41 UTC (permalink / raw)
  To: akpm
  Cc: Cliff Wickman, linux-kernel, Christoph Lameter, lhms-devel,
	Hirokazu Takahashi, KAMEZAWA Hiroyuki

Swap migration is now in Linus tree. So this is the first direct page
migration patchset against his tree (2.6.15-git6 no changes apart
from the rediff). Also done on the off chance that we decide to have
the full thing in 2.6.16 instead of only swap migration.
Maybe this can now get into Andrew's tree?

-----

Page migration allows the moving of the physical location of pages between
nodes in a numa system while the process is running. This means that the
virtual addresses that the process sees do not change. However, the
system rearranges the physical location of those pages.

The main intend of page migration patches here is to reduce the latency of
memory access by moving pages near to the processor where the process
accessing that memory is running.

The migration patchsets allow a process to manually relocate the node
on which its pages are located through the MF_MOVE and MF_MOVE_ALL options
while setting a new memory policy. The pages of process can also be relocated
from another process using the sys_migrate_pages() function call. The
migrate_pages function call takes two sets of nodes and moves pages of a
process that are located on the from nodes to the destination nodes.

Manual migration is very useful if for example the scheduler has relocated
a process to a processor on a distant node. A batch scheduler or an
administrator can detect  the situation and move the pages of the process
nearer to the new processor.

Larger installations usually partition the system using cpusets into
sections of nodes. Paul Jackson has  equipped cpusets with the ability to
move pages when a task is moved to another cpuset. This allows automatic
control over locality of a process. If a task is moved to a new cpuset
then also all its pages are moved with it so that the performance of the
process does not sink dramatically (as is the case today).

The swap migration patchset in 2.6.16-git6 works by simply evicting
the page. The pages must be faulted back in. The pages are then typically
reallocated by the system near the node where the process is executing.
For swap migration the destination of the move is controlled by the
allocation policy. Cpusets set the allocation policy before calling
sys_migrate_pages() in order to move the pages as intended.

No allocation policy changes are performed for sys_migrate_pages(). This
means that the pages may not faulted in to the specified nodes if no
allocation policy was set by other means. The pages will just end up
near the node where the fault occurred.  The direct migration patchset
extends the migration functionality to avoid going through swap.
The destination node of the relation is controllable during the actual
moving of pages. The crutch of using the allocation policy to relocate
is not necessary any and the pages are moved directly to the target.

The direct migration patchset allows the preservation of the relative
location of pages within a group of nodes for all migration techniques which
will preserve a particular memory allocation pattern generated even after
migrating a process. This is necessary in order to preserve the memory
latencies. Processes will run with similar performance after migration.

This patch makes sys_migrate_pages() finally work as intended but does not
do any significant modifications to APIs.

Benefits over swap migration:

1. It makes migrates_pages() actually migrate pages instead of just
   swapping pages from a set of nodes out.

2. Its faster because the page does not need to be written to swap space.

3. It does not use swap space and therefore there is no danger of running
   out of swap space.

4. The need to write back a dirty page before migration is avoided through
   a file system specific method.

5. Direct migration allows the preservation of the relative location of a page
   within a set of nodes.

Many of the ideas for this code were originally developed in the memory
hotplug project and we hope that this code also will allow the hotplug
project to build on this patch in order to get to their goals.

The patchset consists of five patches (only the first two are necessary to
have basic direct migration support):

1. SwapCache patch

   SwapCache pages may have changed their type after lock_page() if the
   page was migrated. Check for this and retry lookup if the page is no
   longer a SwapCache page.

2. migrate_pages()

   Basic direct migration with fallback to swap if all other attempts
   fail.

3. remove_from_swap()

   Page migration installs swap ptes for anonymous pages in order to
   preserve the information contained in the page tables. This patch
   removes the swap ptes after migration and replaces them with regular
   ptes.

4. upgrade of MPOL_MF_MOVE and sys_migrate_pages()

   Add logic to mm/mempolicy.c to allow the policy layer to control
   direct page migration. Thanks to Paul Jackson for the interative
   logic to move between sets of nodes.

5. buffer_migrate_pages() patch

   Allow migration without writing back dirty pages. Add filesystem dependent
   migration support for ext2/ext3 and xfs. Use swapper space to setup a
   method to migrate anonymous pages without writeback.

Credits (also in mm/vmscan.c):

The idea for this scheme of page migration was first developed in the context
of the memory hotplug project. The main authors of the migration code from
the memory hotplug project are:

IWAMOTO Toshihiro <iwamoto@valinux.co.jp>
Hirokazu Takahashi <taka@valinux.co.jp>
Dave Hansen <haveblue@us.ibm.com>

Changes V8->V9:
- Patchset against 2.6.15-git6

Changes V7->V8:
- Patchset against 2.6.15-rc5-mm3
- Export more functions so that filesystems are able to implement their own
  migrate_page() function.
- Fix remove_from_swap() to remove the page from the swap cache in addition
  to replacing swap ptes. Call with the page lock on the new page.
- Fix copying of struct page {} field to avoid using the macros that process
  field information.

Changes V7->V7:
- Rediff against 2.6.14-rc5-mm2

Changes V6->V7:
- Patchset agsinst 2.6.15-rc5-mm1
- Fix one occurence of page->mapping in migrate_page_remove_references()
- Update description]

Changes V5->V6:
- Patchset against 2.6.15-rc3-mm1
- Remove checks for page count increases while migrating after Andrew assured
  me that this cannot happen. Revise documentation to reflect that. If this is
  the case then we will have no need to include the unwind code from the
  hotplug project in the future.
- Wrong reference while calling remove_from_swap to page instead of newpage
  fixed.

Changes V4->V5:
- Patchset against 2.6.15-rc2-mm1
- Update policy layer patch to use the generic check_range in 2.6.15-rc2-mm1.
- Remove try_to_unmap patch since VM_RESERVED vanished under us and therefore
  there is no point anymore to distinguish between permament and transitional
  failures.

Changes V3->V4:
- Patchset against 2.6.15-rc1-mm2 + two swap migration fixes posted today.
- Remove what is already in 2.6.14-rc1-mm2 which results in a significant
  cleanup of the code.

Changes V2->V3:
- Patchset against 2.6.14-mm2
- Fix single processor build and builds without CONFIG_MIGRATION
- export symbols for filesystems that are modules and for
  modules using migrate_pages().
- Paul Jackson's cpuset migration support is in 2.6.14-mm2 so
  this patchset can be easily applied to -mm2 to get from swap
  based to direct page migration.

Changes V1->V2:
- Call node_remap with the right parameters in do_migrate_pages().
- Take radix tree lock while examining page count to avoid races with
  find_get_page() and various *_get_pages based on it.
- Convert direct ptes to swap ptes before radix tree update to avoid
  more races.
- Fix problem if CONFIG_MIGRATION is off for buffer_migrate_page
- Add documentation about page migration
- Change migrate_pages() api so that the caller can decide what
  to do about the migrated pages (badmem handling and hotplug
  have to remove those pages for good).
- Drop config patch (already in mm)
- Add try_to_unmap patch
- Patchset now against 2.6.14-mm1 without requiring additional patches.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/5] Direct Migration V9: PageSwapCache checks
  2006-01-10 22:41 [PATCH 0/5] Direct Migration V9: Overview Christoph Lameter
@ 2006-01-10 22:41 ` Christoph Lameter
  2006-01-10 22:41 ` [PATCH 2/5] Direct Migration V9: migrate_pages() extension Christoph Lameter
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2006-01-10 22:41 UTC (permalink / raw)
  To: akpm
  Cc: Cliff Wickman, linux-kernel, Christoph Lameter, lhms-devel,
	Hirokazu Takahashi, KAMEZAWA Hiroyuki

Check for PageSwapCache after looking up and locking a swap page.

The page migration code may change a swap pte to point to a different page
under lock_page().

If that happens then the vm must retry the lookup operation in the swap
space to find the correct page number. There are a couple of locations
in the VM where a lock_page() is done on a swap page. In these locations
we need to check afterwards if the page was migrated. If the page was migrated
then the old page that was looked up before was freed and no longer has the
PageSwapCache bit set.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@@sgi.com>

Index: linux-2.6.15/mm/memory.c
===================================================================
--- linux-2.6.15.orig/mm/memory.c	2006-01-10 09:43:04.000000000 -0800
+++ linux-2.6.15/mm/memory.c	2006-01-10 14:31:39.000000000 -0800
@@ -1871,6 +1871,7 @@ static int do_swap_page(struct mm_struct
 		goto out;
 
 	entry = pte_to_swp_entry(orig_pte);
+again:
 	page = lookup_swap_cache(entry);
 	if (!page) {
  		swapin_readahead(entry, address, vma);
@@ -1894,6 +1895,12 @@ static int do_swap_page(struct mm_struct
 
 	mark_page_accessed(page);
 	lock_page(page);
+	if (!PageSwapCache(page)) {
+		/* Page migration has occured */
+		unlock_page(page);
+		page_cache_release(page);
+		goto again;
+	}
 
 	/*
 	 * Back out if somebody else already faulted in this pte.
Index: linux-2.6.15/mm/shmem.c
===================================================================
--- linux-2.6.15.orig/mm/shmem.c	2006-01-10 09:43:04.000000000 -0800
+++ linux-2.6.15/mm/shmem.c	2006-01-10 14:31:39.000000000 -0800
@@ -1028,6 +1028,14 @@ repeat:
 			page_cache_release(swappage);
 			goto repeat;
 		}
+		if (!PageSwapCache(swappage)) {
+			/* Page migration has occured */
+			shmem_swp_unmap(entry);
+			spin_unlock(&info->lock);
+			unlock_page(swappage);
+			page_cache_release(swappage);
+			goto repeat;
+		}
 		if (PageWriteback(swappage)) {
 			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
Index: linux-2.6.15/mm/swapfile.c
===================================================================
--- linux-2.6.15.orig/mm/swapfile.c	2006-01-10 09:43:04.000000000 -0800
+++ linux-2.6.15/mm/swapfile.c	2006-01-10 14:31:39.000000000 -0800
@@ -644,6 +644,7 @@ static int try_to_unuse(unsigned int typ
 		 */
 		swap_map = &si->swap_map[i];
 		entry = swp_entry(type, i);
+again:
 		page = read_swap_cache_async(entry, NULL, 0);
 		if (!page) {
 			/*
@@ -678,6 +679,12 @@ static int try_to_unuse(unsigned int typ
 		wait_on_page_locked(page);
 		wait_on_page_writeback(page);
 		lock_page(page);
+		if (!PageSwapCache(page)) {
+			/* Page migration has occured */
+			unlock_page(page);
+			page_cache_release(page);
+			goto again;
+		}
 		wait_on_page_writeback(page);
 
 		/*

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2/5] Direct Migration V9: migrate_pages() extension
  2006-01-10 22:41 [PATCH 0/5] Direct Migration V9: Overview Christoph Lameter
  2006-01-10 22:41 ` [PATCH 1/5] Direct Migration V9: PageSwapCache checks Christoph Lameter
@ 2006-01-10 22:41 ` Christoph Lameter
  2006-01-11  5:46   ` Andrew Morton
  2006-01-10 22:41 ` [PATCH 3/5] Direct Migration V9: remove_from_swap() to remove swap ptes Christoph Lameter
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2006-01-10 22:41 UTC (permalink / raw)
  To: akpm
  Cc: Cliff Wickman, linux-kernel, Christoph Lameter, lhms-devel,
	Hirokazu Takahashi, KAMEZAWA Hiroyuki

Add direct migration support with fall back to swap.

Direct migration support on top of the swap based page migration facility.

This allows the direct migration of anonymous pages and the migration of
file backed pages by dropping the associated buffers (requires writeout).

Fall back to swap out if necessary.

The patch is based on lots of patches from the hotplug project but the code
was restructured, documented and simplified as much as possible.

Note that an additional patch that defines the migrate_page() method
for filesystems is necessary in order to avoid writeback for anonymous
and file backed pages.

V7->V8:
 - Fixed by KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>:
    copy raw page->index and page->mapping to new page.

V6->V7:
 - Patch against 2.6.15-rc5-mm1.
 - Replace one occurrence of page->mapping with page_mapping(page) in
   migrate_pages_remove_references()

V4->V5:
 - Patch against 2.6.15-rc2-mm1 + double unlock fix + consolidation patch

V3->V4:
- Remove components already in the swap migration patch

V1->V2:
- Change migrate_pages() so that it can return pagelist for failed and
  moved pages. No longer free the old pages but allow caller to dispose
  of them.
- Unmap pages before changing reverse map under tree lock. Take
  a write_lock instead of a read_lock.
- Add documentation

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.15/include/linux/swap.h
===================================================================
--- linux-2.6.15.orig/include/linux/swap.h	2006-01-10 09:43:04.000000000 -0800
+++ linux-2.6.15/include/linux/swap.h	2006-01-10 14:31:42.000000000 -0800
@@ -178,6 +178,9 @@ extern int vm_swappiness;
 #ifdef CONFIG_MIGRATION
 extern int isolate_lru_page(struct page *p);
 extern int putback_lru_pages(struct list_head *l);
+extern int migrate_page(struct page *, struct page *);
+extern int migrate_page_remove_references(struct page *, struct page *, int);
+extern void migrate_page_copy(struct page *, struct page *);
 extern int migrate_pages(struct list_head *l, struct list_head *t,
 		struct list_head *moved, struct list_head *failed);
 #endif
Index: linux-2.6.15/Documentation/vm/page_migration
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.15/Documentation/vm/page_migration	2006-01-10 14:31:42.000000000 -0800
@@ -0,0 +1,129 @@
+Page migration
+--------------
+
+Page migration allows the moving of the physical location of pages between
+nodes in a numa system while the process is running. This means that the
+virtual addresses that the process sees do not change. However, the
+system rearranges the physical location of those pages.
+
+The main intend of page migration is to reduce the latency of memory access
+by moving pages near to the processor where the process accessing that memory
+is running.
+
+Page migration allows a process to manually relocate the node on which its
+pages are located through the MF_MOVE and MF_MOVE_ALL options while setting
+a new memory policy. The pages of process can also be relocated
+from another process using the sys_migrate_pages() function call. The
+migrate_pages function call takes two sets of nodes and moves pages of a
+process that are located on the from nodes to the destination nodes.
+
+Manual migration is very useful if for example the scheduler has relocated
+a process to a processor on a distant node. A batch scheduler or an
+administrator may detect the situation and move the pages of the process
+nearer to the new processor. At some point in the future we may have
+some mechanism in the scheduler that will automatically move the pages.
+
+Larger installations usually partition the system using cpusets into
+sections of nodes. Paul Jackson has equipped cpusets with the ability to
+move pages when a task is moved to another cpuset. This allows automatic
+control over locality of a process. If a task is moved to a new cpuset
+then also all its pages are moved with it so that the performance of the
+process does not sink dramatically (as is the case today).
+
+Page migration allows the preservation of the relative location of pages
+within a group of nodes for all migration techniques which will preserve a
+particular memory allocation pattern generated even after migrating a
+process. This is necessary in order to preserve the memory latencies.
+Processes will run with similar performance after migration.
+
+Page migration occurs in several steps. First a high level
+description for those trying to use migrate_pages() and then
+a low level description of how the low level details work.
+
+A. Use of migrate_pages()
+-------------------------
+
+1. Remove pages from the LRU.
+
+   Lists of pages to be migrated are generated by scanning over
+   pages and moving them into lists. This is done by
+   calling isolate_lru_page() or __isolate_lru_page().
+   Calling isolate_lru_page increases the references to the page
+   so that it cannot vanish under us.
+
+2. Generate a list of newly allocates page to move the contents
+   of the first list to.
+
+3. The migrate_pages() function is called which attempts
+   to do the migration. It returns the moved pages in the
+   list specified as the third parameter and the failed
+   migrations in the fourth parameter. The first parameter
+   will contain the pages that could still be retried.
+
+4. The leftover pages of various types are returned
+   to the LRU using putback_to_lru_pages() or otherwise
+   disposed of. The pages will still have the refcount as
+   increased by isolate_lru_pages()!
+
+B. Operation of migrate_pages()
+--------------------------------
+
+migrate_pages does several passes over its list of pages. A page is moved
+if all references to a page are removable at the time.
+
+Steps:
+
+1. Lock the page to be migrated
+
+2. Insure that writeback is complete.
+
+3. Make sure that the page has assigned swap cache entry if
+   it is an anonyous page. The swap cache reference is necessary
+   to preserve the information contain in the page table maps.
+
+4. Prep the new page that we want to move to. It is locked
+   and set to not being uptodate so that all accesses to the new
+   page immediately lock while we are moving references.
+
+5. All the page table references to the page are either dropped (file backed)
+   or converted to swap references (anonymous pages). This should decrease the
+   reference count.
+
+6. The radix tree lock is taken
+
+7. The refcount of the page is examined and we back out if references remain
+   otherwise we know that we are the only one referencing this page.
+
+8. The radix tree is checked and if it does not contain the pointer to this
+   page then we back out.
+
+9. The mapping is checked. If the mapping is gone then a truncate action may
+   be in progress and we back out.
+
+10. The new page is prepped with some settings from the old page so that accesses
+   to the new page will be discovered to have the correct settings.
+
+11. The radix tree is changed to point to the new page.
+
+12. The reference count of the old page is dropped because the reference has now
+    been removed.
+
+13. The radix tree lock is dropped.
+
+14. The page contents are copied to the new page.
+
+15. The remaining page flags are copied to the new page.
+
+16. The old page flags are cleared to indicate that the page does
+    not use any information anymore.
+
+17. Queued up writeback on the new page is triggered.
+
+18. If swap pte's were generated for the page then remove them again.
+
+19. The locks are dropped from the old and new page.
+
+20. The new page is moved to the LRU.
+
+Christoph Lameter, December 19, 2005.
+
Index: linux-2.6.15/mm/vmscan.c
===================================================================
--- linux-2.6.15.orig/mm/vmscan.c	2006-01-10 09:43:04.000000000 -0800
+++ linux-2.6.15/mm/vmscan.c	2006-01-10 14:31:42.000000000 -0800
@@ -649,6 +649,164 @@ retry:
 	return -EAGAIN;
 }
 /*
+ * Page migration was first developed in the context of the memory hotplug
+ * project. The main authors of the migration code are:
+ *
+ * IWAMOTO Toshihiro <iwamoto@valinux.co.jp>
+ * Hirokazu Takahashi <taka@valinux.co.jp>
+ * Dave Hansen <haveblue@us.ibm.com>
+ * Christoph Lameter <clameter@sgi.com>
+ */
+
+/*
+ * Remove references for a page and establish the new page with the correct
+ * basic settings to be able to stop accesses to the page.
+ */
+int migrate_page_remove_references(struct page *newpage, struct page *page, int nr_refs)
+{
+	struct address_space *mapping = page_mapping(page);
+	struct page **radix_pointer;
+	int i;
+
+	/*
+	 * Avoid doing any of the following work if the page count
+	 * indicates that the page is in use or truncate has removed
+	 * the page.
+	 */
+	if (!mapping || page_mapcount(page) + nr_refs != page_count(page))
+		return 1;
+
+	/*
+	 * Establish swap ptes for anonymous pages or destroy pte
+	 * maps for files.
+	 *
+	 * In order to reestablish file backed mappings the fault handlers
+	 * will take the radix tree_lock which may then be used to stop
+  	 * processses from accessing this page until the new page is ready.
+	 *
+	 * A process accessing via a swap pte (an anonymous page) will take a
+	 * page_lock on the old page which will block the process until the
+	 * migration attempt is complete. At that time the PageSwapCache bit
+	 * will be examined. If the page was migrated then the PageSwapCache
+	 * bit will be clear and the operation to retrieve the page will be
+	 * retried which will find the new page in the radix tree. Then a new
+	 * direct mapping may be generated based on the radix tree contents.
+	 *
+	 * If the page was not migrated then the PageSwapCache bit
+	 * is still set and the operation may continue.
+	 */
+	for(i = 0; i < 10 && page_mapped(page); i++) {
+		int rc = try_to_unmap(page);
+
+		if (rc == SWAP_SUCCESS)
+			break;
+		/*
+		 * If there are other runnable processes then running
+		 * them may make it possible to unmap the page
+		 */
+		schedule();
+	}
+
+	/*
+	 * Give up if we were unable to remove all mappings.
+	 */
+	if (page_mapcount(page))
+		return 1;
+
+	write_lock_irq(&mapping->tree_lock);
+
+	radix_pointer = (struct page **)radix_tree_lookup_slot(
+						&mapping->page_tree,
+						page_index(page));
+
+	if (!page_mapping(page) ||
+	    page_count(page) != nr_refs ||
+	    *radix_pointer != page) {
+		write_unlock_irq(&mapping->tree_lock);
+		return 1;
+	}
+
+	/*
+	 * Now we know that no one else is looking at the page.
+	 *
+	 * Certain minimal information about a page must be available
+	 * in order for other subsystems to properly handle the page if they
+	 * find it through the radix tree update before we are finished
+	 * copying the page.
+	 */
+	get_page(newpage);
+	newpage->index = page->index;
+	newpage->mapping = page->mapping;
+	if (PageSwapCache(page)) {
+		SetPageSwapCache(newpage);
+		set_page_private(newpage, page_private(page));
+	}
+
+	*radix_pointer = newpage;
+	__put_page(page);
+	write_unlock_irq(&mapping->tree_lock);
+
+	return 0;
+}
+
+/*
+ * Copy the page to its new location
+ */
+void migrate_page_copy(struct page *newpage, struct page *page)
+{
+	copy_highpage(newpage, page);
+
+	if (PageError(page))
+		SetPageError(newpage);
+	if (PageReferenced(page))
+		SetPageReferenced(newpage);
+	if (PageUptodate(page))
+		SetPageUptodate(newpage);
+	if (PageActive(page))
+		SetPageActive(newpage);
+	if (PageChecked(page))
+		SetPageChecked(newpage);
+	if (PageMappedToDisk(page))
+		SetPageMappedToDisk(newpage);
+
+	if (PageDirty(page)) {
+		clear_page_dirty_for_io(page);
+		set_page_dirty(newpage);
+ 	}
+
+	ClearPageSwapCache(page);
+	ClearPageActive(page);
+	ClearPagePrivate(page);
+	set_page_private(page, 0);
+	page->mapping = NULL;
+
+	/*
+	 * If any waiters have accumulated on the new page then
+	 * wake them up.
+	 */
+	if (PageWriteback(newpage))
+		end_page_writeback(newpage);
+}
+
+/*
+ * Common logic to directly migrate a single page suitable for
+ * pages that do not use PagePrivate.
+ *
+ * Pages are locked upon entry and exit.
+ */
+int migrate_page(struct page *newpage, struct page *page)
+{
+	BUG_ON(PageWriteback(page));	/* Writeback must be complete */
+
+	if (migrate_page_remove_references(newpage, page, 2))
+		return -EAGAIN;
+
+	migrate_page_copy(newpage, page);
+
+	return 0;
+}
+
+/*
  * migrate_pages
  *
  * Two lists are passed to this function. The first list
@@ -661,11 +819,6 @@ retry:
  * are movable anymore because t has become empty
  * or no retryable pages exist anymore.
  *
- * SIMPLIFIED VERSION: This implementation of migrate_pages
- * is only swapping out pages and never touches the second
- * list. The direct migration patchset
- * extends this function to avoid the use of swap.
- *
  * Return: Number of pages not migrated when "to" ran empty.
  */
 int migrate_pages(struct list_head *from, struct list_head *to,
@@ -686,6 +839,9 @@ redo:
 	retry = 0;
 
 	list_for_each_entry_safe(page, page2, from, lru) {
+		struct page *newpage = NULL;
+		struct address_space *mapping;
+
 		cond_resched();
 
 		rc = 0;
@@ -693,6 +849,9 @@ redo:
 			/* page was freed from under us. So we are done. */
 			goto next;
 
+		if (to && list_empty(to))
+			break;
+
 		/*
 		 * Skip locked pages during the first two passes to give the
 		 * functions holding the lock time to release the page. Later we
@@ -729,12 +888,64 @@ redo:
 			}
 		}
 
+		if (!to) {
+			rc = swap_page(page);
+			goto next;
+		}
+
+		newpage = lru_to_page(to);
+		lock_page(newpage);
+
 		/*
-		 * Page is properly locked and writeback is complete.
+		 * Pages are properly locked and writeback is complete.
 		 * Try to migrate the page.
 		 */
-		rc = swap_page(page);
-		goto next;
+		mapping = page_mapping(page);
+		if (!mapping)
+			goto unlock_both;
+
+		/*
+		 * Trigger writeout if page is dirty
+		 */
+		if (PageDirty(page)) {
+			switch (pageout(page, mapping)) {
+			case PAGE_KEEP:
+			case PAGE_ACTIVATE:
+				goto unlock_both;
+
+			case PAGE_SUCCESS:
+				unlock_page(newpage);
+				goto next;
+
+			case PAGE_CLEAN:
+				; /* try to migrate the page below */
+			}
+                }
+		/*
+		 * If we have no buffer or can release the buffer
+		 * then do a simple migration.
+		 */
+		if (!page_has_buffers(page) ||
+		    try_to_release_page(page, GFP_KERNEL)) {
+			rc = migrate_page(newpage, page);
+			goto unlock_both;
+		}
+
+		/*
+		 * On early passes with mapped pages simply
+		 * retry. There may be a lock held for some
+		 * buffers that may go away. Later
+		 * swap them out.
+		 */
+		if (pass > 4) {
+			unlock_page(newpage);
+			newpage = NULL;
+			rc = swap_page(page);
+			goto next;
+		}
+
+unlock_both:
+		unlock_page(newpage);
 
 unlock_page:
 		unlock_page(page);
@@ -747,7 +958,10 @@ next:
 			list_move(&page->lru, failed);
 			nr_failed++;
 		} else {
-			/* Success */
+			if (newpage)
+				/* Successful migration. Return new page to LRU */
+				move_to_lru(newpage);
+
 			list_move(&page->lru, moved);
 		}
 	}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 3/5] Direct Migration V9: remove_from_swap() to remove swap ptes
  2006-01-10 22:41 [PATCH 0/5] Direct Migration V9: Overview Christoph Lameter
  2006-01-10 22:41 ` [PATCH 1/5] Direct Migration V9: PageSwapCache checks Christoph Lameter
  2006-01-10 22:41 ` [PATCH 2/5] Direct Migration V9: migrate_pages() extension Christoph Lameter
@ 2006-01-10 22:41 ` Christoph Lameter
  2006-01-10 22:41 ` [PATCH 4/5] Direct Migration V9: upgrade MPOL_MF_MOVE and sys_migrate_pages() Christoph Lameter
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2006-01-10 22:41 UTC (permalink / raw)
  To: akpm
  Cc: Cliff Wickman, linux-kernel, Christoph Lameter, lhms-devel,
	Hirokazu Takahashi, KAMEZAWA Hiroyuki

Add remove_from_swap

remove_from_swap() allows the restoration of the pte entries that existed
before page migration occurred for anonymous pages by walking the reverse
maps. This reduces swap use and establishes regular pte's without the need
for page faults.

V7->V8:
- Move the removing of the page from the swap entries and from the swap
  cache to migrate page so that it can be done while the page lock
  on the new page is held.
- Unlock anon_vma
- Remove the page from the page cache

V5->V6:
- Somehow V5 did a remove_from_swap for the old page. Changed to new
  page

V3->V4:
- Add new function remove_vma_swap in swapfile.c to encapsulate
  the functionality needed instead of exporting unuse_vma.
- Add #ifdef CONFIG_MIGRATION

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.15/include/linux/swap.h
===================================================================
--- linux-2.6.15.orig/include/linux/swap.h	2006-01-10 14:31:42.000000000 -0800
+++ linux-2.6.15/include/linux/swap.h	2006-01-10 14:31:45.000000000 -0800
@@ -231,6 +231,9 @@ extern int remove_exclusive_swap_page(st
 struct backing_dev_info;
 
 extern spinlock_t swap_lock;
+#ifdef CONFIG_MIGRATION
+extern int remove_vma_swap(struct vm_area_struct *vma, struct page *page);
+#endif
 
 /* linux/mm/thrash.c */
 extern struct mm_struct * swap_token_mm;
Index: linux-2.6.15/mm/swapfile.c
===================================================================
--- linux-2.6.15.orig/mm/swapfile.c	2006-01-10 14:31:39.000000000 -0800
+++ linux-2.6.15/mm/swapfile.c	2006-01-10 14:31:45.000000000 -0800
@@ -552,6 +552,16 @@ static int unuse_mm(struct mm_struct *mm
 	return 0;
 }
 
+#ifdef CONFIG_MIGRATION
+int remove_vma_swap(struct vm_area_struct *vma, struct page *page)
+{
+	swp_entry_t entry = { .val = page_private(page) };
+
+	return unuse_vma(vma, entry, page);
+}
+#endif
+
+
 /*
  * Scan swap_map from current position to next entry still in use.
  * Recycle to start on reaching the end, returning 0 when empty.
Index: linux-2.6.15/mm/rmap.c
===================================================================
--- linux-2.6.15.orig/mm/rmap.c	2006-01-10 09:43:04.000000000 -0800
+++ linux-2.6.15/mm/rmap.c	2006-01-10 14:31:45.000000000 -0800
@@ -205,6 +205,35 @@ out:
 	return anon_vma;
 }
 
+#ifdef CONFIG_MIGRATION
+/*
+ * Remove an anonymous page from swap replacing the swap pte's
+ * through real pte's pointing to valid pages and then releasing
+ * the page from the swap cache.
+ *
+ * Must hold page lock on page.
+ */
+void remove_from_swap(struct page *page)
+{
+	struct anon_vma *anon_vma;
+	struct vm_area_struct *vma;
+
+	if (!PageAnon(page) || !PageSwapCache(page))
+		return;
+
+	anon_vma = page_lock_anon_vma(page);
+	if (!anon_vma)
+		return;
+
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
+		remove_vma_swap(vma, page);
+
+	spin_unlock(&anon_vma->lock);
+
+	delete_from_swap_cache(page);
+}
+#endif
+
 /*
  * At what user virtual address is page expected in vma?
  */
Index: linux-2.6.15/include/linux/rmap.h
===================================================================
--- linux-2.6.15.orig/include/linux/rmap.h	2006-01-10 09:43:04.000000000 -0800
+++ linux-2.6.15/include/linux/rmap.h	2006-01-10 14:31:45.000000000 -0800
@@ -92,6 +92,9 @@ static inline void page_dup_rmap(struct 
  */
 int page_referenced(struct page *, int is_locked);
 int try_to_unmap(struct page *);
+#ifdef CONFIG_MIGRATION
+void remove_from_swap(struct page *page);
+#endif
 
 /*
  * Called from mm/filemap_xip.c to unmap empty zero page
Index: linux-2.6.15/mm/vmscan.c
===================================================================
--- linux-2.6.15.orig/mm/vmscan.c	2006-01-10 14:31:42.000000000 -0800
+++ linux-2.6.15/mm/vmscan.c	2006-01-10 14:31:45.000000000 -0800
@@ -803,6 +803,15 @@ int migrate_page(struct page *newpage, s
 
 	migrate_page_copy(newpage, page);
 
+	/*
+	 * Remove auxiliary swap entries and replace
+	 * them with real ptes.
+	 *
+	 * Note that a real pte entry will allow processes that are not
+	 * waiting on the page lock to use the new page via the page tables
+	 * before the new page is unlocked.
+	 */
+	remove_from_swap(newpage);
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 4/5] Direct Migration V9: upgrade MPOL_MF_MOVE and sys_migrate_pages()
  2006-01-10 22:41 [PATCH 0/5] Direct Migration V9: Overview Christoph Lameter
                   ` (2 preceding siblings ...)
  2006-01-10 22:41 ` [PATCH 3/5] Direct Migration V9: remove_from_swap() to remove swap ptes Christoph Lameter
@ 2006-01-10 22:41 ` Christoph Lameter
  2006-01-10 22:41 ` [PATCH 5/5] Direct Migration V9: Avoid writeback / page_migrate() method Christoph Lameter
  2006-01-11  3:26 ` [PATCH 0/5] Direct Migration V9: Overview KAMEZAWA Hiroyuki
  5 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2006-01-10 22:41 UTC (permalink / raw)
  To: akpm
  Cc: Cliff Wickman, linux-kernel, Christoph Lameter, lhms-devel,
	Hirokazu Takahashi, KAMEZAWA Hiroyuki

Modify policy layer to support direct page migration

- Add migrate_pages_to() allowing the migration of a list of pages to a
  a specified node or to vma with a specific allocation policy in sets
  of MIGRATE_CHUNK_SIZE pages

- Modify do_migrate_pages() to do a staged move of pages from the
  source nodes to the target nodes.

V3->V4: Fixed up to be based on the swap migration code in 2.6.15-rc1-mm2.

V1->V2:
- Migrate processes in chunks of MIGRATE_CHUNK_SIZE

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.15/mm/mempolicy.c
===================================================================
--- linux-2.6.15.orig/mm/mempolicy.c	2006-01-10 09:43:04.000000000 -0800
+++ linux-2.6.15/mm/mempolicy.c	2006-01-10 14:31:49.000000000 -0800
@@ -95,6 +95,9 @@
 #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
 #define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2)		/* Gather statistics */
 
+/* The number of pages to migrate per call to migrate_pages() */
+#define MIGRATE_CHUNK_SIZE 256
+
 static kmem_cache_t *policy_cache;
 static kmem_cache_t *sn_cache;
 
@@ -565,24 +568,96 @@ static void migrate_page_add(struct vm_a
 	}
 }
 
-static int swap_pages(struct list_head *pagelist)
+/*
+ * Migrate the list 'pagelist' of pages to a certain destination.
+ *
+ * Specify destination with either non-NULL vma or dest_node >= 0
+ * Return the number of pages not migrated or error code
+ */
+static int migrate_pages_to(struct list_head *pagelist,
+	struct vm_area_struct *vma, int dest)
 {
+	LIST_HEAD(newlist);
 	LIST_HEAD(moved);
 	LIST_HEAD(failed);
-	int n;
+	int err = 0;
+	int nr_pages;
+	struct page *page;
+	struct list_head *p;
 
-	n = migrate_pages(pagelist, NULL, &moved, &failed);
-	putback_lru_pages(&failed);
-	putback_lru_pages(&moved);
+redo:
+	nr_pages = 0;
+	list_for_each(p, pagelist) {
+		if (vma)
+			page = alloc_page_vma(GFP_HIGHUSER, vma,
+						vma->vm_start);
+		else
+			page = alloc_pages_node(dest, GFP_HIGHUSER, 0);
 
-	return n;
+		if (!page) {
+			err = -ENOMEM;
+			goto out;
+		}
+		list_add(&page->lru, &newlist);
+		nr_pages++;
+		if (nr_pages > MIGRATE_CHUNK_SIZE);
+			break;
+	}
+	err = migrate_pages(pagelist, &newlist, &moved, &failed);
+
+	putback_lru_pages(&moved);	/* Call release pages instead ?? */
+
+	if (err >= 0 && list_empty(&newlist) && !list_empty(pagelist))
+		goto redo;
+out:
+	/* Return leftover allocated pages */
+	while (!list_empty(&newlist)) {
+		page = list_entry(newlist.next, struct page, lru);
+		list_del(&page->lru);
+		__free_page(page);
+	}
+	list_splice(&failed, pagelist);
+	if (err < 0)
+		return err;
+
+	/* Calculate number of leftover pages */
+	nr_pages = 0;
+	list_for_each(p, pagelist)
+		nr_pages++;
+	return nr_pages;
 }
 
 /*
- * For now migrate_pages simply swaps out the pages from nodes that are in
- * the source set but not in the target set. In the future, we would
- * want a function that moves pages between the two nodesets in such
- * a way as to preserve the physical layout as much as possible.
+ * Migrate pages from one node to a target node.
+ * Returns error or the number of pages not migrated.
+ */
+int migrate_to_node(struct mm_struct *mm, int source, int dest, int flags)
+{
+	nodemask_t nmask;
+	LIST_HEAD(pagelist);
+	int err = 0;
+
+	nodes_clear(nmask);
+	node_set(source, nmask);
+
+	check_range(mm, mm->mmap->vm_start, TASK_SIZE, &nmask,
+		    flags | MPOL_MF_DISCONTIG_OK,
+	            &pagelist);
+
+	if (!list_empty(&pagelist)) {
+
+		err = migrate_pages_to(&pagelist, NULL, dest);
+
+		if (!list_empty(&pagelist))
+			putback_lru_pages(&pagelist);
+
+	}
+	return err;
+}
+
+/*
+ * Move pages between the two nodesets so as to preserve the physical
+ * layout as much as possible.
  *
  * Returns the number of page that could not be moved.
  */
@@ -590,22 +665,76 @@ int do_migrate_pages(struct mm_struct *m
 	const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags)
 {
 	LIST_HEAD(pagelist);
-	int count = 0;
-	nodemask_t nodes;
+	int busy = 0;
+	int err = 0;
+	nodemask_t tmp;
 
-	nodes_andnot(nodes, *from_nodes, *to_nodes);
+  	down_read(&mm->mmap_sem);
 
-	down_read(&mm->mmap_sem);
-	check_range(mm, mm->mmap->vm_start, TASK_SIZE, &nodes,
-			flags | MPOL_MF_DISCONTIG_OK, &pagelist);
+/* Find a 'source' bit set in 'tmp' whose corresponding 'dest'
+ * bit in 'to' is not also set in 'tmp'.  Clear the found 'source'
+ * bit in 'tmp', and return that <source, dest> pair for migration.
+ * The pair of nodemasks 'to' and 'from' define the map.
+ *
+ * If no pair of bits is found that way, fallback to picking some
+ * pair of 'source' and 'dest' bits that are not the same.  If the
+ * 'source' and 'dest' bits are the same, this represents a node
+ * that will be migrating to itself, so no pages need move.
+ *
+ * If no bits are left in 'tmp', or if all remaining bits left
+ * in 'tmp' correspond to the same bit in 'to', return false
+ * (nothing left to migrate).
+ *
+ * This lets us pick a pair of nodes to migrate between, such that
+ * if possible the dest node is not already occupied by some other
+ * source node, minimizing the risk of overloading the memory on a
+ * node that would happen if we migrated incoming memory to a node
+ * before migrating outgoing memory source that same node.
+ *
+ * A single scan of tmp is sufficient.  As we go, we remember the
+ * most recent <s, d> pair that moved (s != d).  If we find a pair
+ * that not only moved, but what's better, moved to an empty slot
+ * (d is not set in tmp), then we break out then, with that pair.
+ * Otherwise when we finish scannng from_tmp, we at least have the
+ * most recent <s, d> pair that moved.  If we get all the way through
+ * the scan of tmp without finding any node that moved, much less
+ * moved to an empty node, then there is nothing left worth migrating.
+ */
 
-	if (!list_empty(&pagelist)) {
-		count = swap_pages(&pagelist);
-		putback_lru_pages(&pagelist);
+	tmp = *from_nodes;
+	while (!nodes_empty(tmp)) {
+		int s,d;
+		int source = -1;
+		int dest = 0;
+
+		for_each_node_mask(s, tmp) {
+
+			d = node_remap(s, *from_nodes, *to_nodes);
+			if (s == d)
+				continue;
+
+			source = s;	/* Node moved. Memorize */
+			dest = d;
+
+			/* dest not in remaining from nodes? */
+			if (!node_isset(dest, tmp))
+				break;
+		}
+		if (source == -1)
+			break;
+
+		node_clear(source, tmp);
+		err = migrate_to_node(mm, source, dest, flags);
+		if (err > 0)
+			busy += err;
+		if (err < 0)
+			break;
 	}
 
 	up_read(&mm->mmap_sem);
-	return count;
+	if (err < 0)
+		return err;
+	return busy;
 }
 
 long do_mbind(unsigned long start, unsigned long len,
@@ -665,8 +794,9 @@ long do_mbind(unsigned long start, unsig
 		int nr_failed = 0;
 
 		err = mbind_range(vma, start, end, new);
+
 		if (!list_empty(&pagelist))
-			nr_failed = swap_pages(&pagelist);
+			nr_failed = migrate_pages_to(&pagelist, vma, -1);
 
 		if (!err && nr_failed && (flags & MPOL_MF_STRICT))
 			err = -EIO;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 5/5] Direct Migration V9: Avoid writeback / page_migrate() method
  2006-01-10 22:41 [PATCH 0/5] Direct Migration V9: Overview Christoph Lameter
                   ` (3 preceding siblings ...)
  2006-01-10 22:41 ` [PATCH 4/5] Direct Migration V9: upgrade MPOL_MF_MOVE and sys_migrate_pages() Christoph Lameter
@ 2006-01-10 22:41 ` Christoph Lameter
  2006-01-11  6:03   ` Andrew Morton
  2006-01-11  6:25   ` Andrew Morton
  2006-01-11  3:26 ` [PATCH 0/5] Direct Migration V9: Overview KAMEZAWA Hiroyuki
  5 siblings, 2 replies; 16+ messages in thread
From: Christoph Lameter @ 2006-01-10 22:41 UTC (permalink / raw)
  To: akpm
  Cc: Cliff Wickman, linux-kernel, Christoph Lameter, lhms-devel,
	Hirokazu Takahashi, KAMEZAWA Hiroyuki

Migrate a page with buffers without requiring writeback

This introduces a new address space operation migratepage() that
may be used by a filesystem to implement its own version of page migration.

A version is provided that migrates buffers attached to pages. Some
filesystems (ext2, ext3, xfs) are modified to utilize this feature.

The swapper address space operation are modified so that a regular
migrate_page() will occur for anonymous pages without writeback
(migrate_pages forces every anonymous page to have a swap entry).

V7->V8:
- Export more functions in order for loadable filesystems to be able
  to define their own migration function.

V2->V3:
- export functions for filesystems that are modules and for modules that
  perform migration by calling migrate_pages().
- Fix macro name clash. Fix build on UP and systems without CONFIG_MIGRATION

V1->V2:
- Fix CONFIG_MIGRATION handling

Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.15/include/linux/fs.h
===================================================================
--- linux-2.6.15.orig/include/linux/fs.h	2006-01-10 09:43:04.000000000 -0800
+++ linux-2.6.15/include/linux/fs.h	2006-01-10 14:31:54.000000000 -0800
@@ -369,6 +369,8 @@ struct address_space_operations {
 			loff_t offset, unsigned long nr_segs);
 	struct page* (*get_xip_page)(struct address_space *, sector_t,
 			int);
+	/* migrate the contents of a page to the specified target */
+	int (*migratepage) (struct page *, struct page *);
 };
 
 struct backing_dev_info;
@@ -1713,6 +1715,12 @@ extern void simple_release_fs(struct vfs
 
 extern ssize_t simple_read_from_buffer(void __user *, size_t, loff_t *, const void *, size_t);
 
+#ifdef CONFIG_MIGRATION
+extern int buffer_migrate_page(struct page *, struct page *);
+#else
+#define buffer_migrate_page NULL
+#endif
+
 extern int inode_change_ok(struct inode *, struct iattr *);
 extern int __must_check inode_setattr(struct inode *, struct iattr *);
 
Index: linux-2.6.15/mm/swap_state.c
===================================================================
--- linux-2.6.15.orig/mm/swap_state.c	2006-01-10 09:43:04.000000000 -0800
+++ linux-2.6.15/mm/swap_state.c	2006-01-10 14:31:54.000000000 -0800
@@ -27,6 +27,7 @@ static struct address_space_operations s
 	.writepage	= swap_writepage,
 	.sync_page	= block_sync_page,
 	.set_page_dirty	= __set_page_dirty_nobuffers,
+	.migratepage	= migrate_page,
 };
 
 static struct backing_dev_info swap_backing_dev_info = {
Index: linux-2.6.15/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6.15.orig/fs/xfs/linux-2.6/xfs_aops.c	2006-01-02 19:21:10.000000000 -0800
+++ linux-2.6.15/fs/xfs/linux-2.6/xfs_aops.c	2006-01-10 14:31:54.000000000 -0800
@@ -1347,4 +1347,5 @@ struct address_space_operations linvfs_a
 	.commit_write		= generic_commit_write,
 	.bmap			= linvfs_bmap,
 	.direct_IO		= linvfs_direct_IO,
+	.migratepage		= buffer_migrate_page,
 };
Index: linux-2.6.15/fs/buffer.c
===================================================================
--- linux-2.6.15.orig/fs/buffer.c	2006-01-10 09:43:04.000000000 -0800
+++ linux-2.6.15/fs/buffer.c	2006-01-10 14:31:54.000000000 -0800
@@ -3049,6 +3049,71 @@ asmlinkage long sys_bdflush(int func, lo
 }
 
 /*
+ * Migration function for pages with buffers. This function can only be used
+ * if the underlying filesystem guarantees that no other references to "page"
+ * exist.
+ */
+#ifdef CONFIG_MIGRATION
+int buffer_migrate_page(struct page *newpage, struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	struct buffer_head *bh, *head;
+
+	if (!mapping)
+		return -EAGAIN;
+
+	if (!page_has_buffers(page))
+		return migrate_page(newpage, page);
+
+	head = page_buffers(page);
+
+	if (migrate_page_remove_references(newpage, page, 3))
+		return -EAGAIN;
+
+	spin_lock(&mapping->private_lock);
+
+	bh = head;
+	do {
+		get_bh(bh);
+		lock_buffer(bh);
+		bh = bh->b_this_page;
+
+	} while (bh != head);
+
+	ClearPagePrivate(page);
+	set_page_private(newpage, page_private(page));
+	set_page_private(page, 0);
+	put_page(page);
+	get_page(newpage);
+
+	bh = head;
+	do {
+		set_bh_page(bh, newpage, bh_offset(bh));
+		bh = bh->b_this_page;
+
+	} while (bh != head);
+
+	SetPagePrivate(newpage);
+	spin_unlock(&mapping->private_lock);
+
+	migrate_page_copy(newpage, page);
+
+	spin_lock(&mapping->private_lock);
+	bh = head;
+	do {
+		unlock_buffer(bh);
+ 		put_bh(bh);
+		bh = bh->b_this_page;
+
+	} while (bh != head);
+	spin_unlock(&mapping->private_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL(buffer_migrate_page);
+#endif
+
+/*
  * Buffer-head allocation
  */
 static kmem_cache_t *bh_cachep;
Index: linux-2.6.15/fs/ext3/inode.c
===================================================================
--- linux-2.6.15.orig/fs/ext3/inode.c	2006-01-02 19:21:10.000000000 -0800
+++ linux-2.6.15/fs/ext3/inode.c	2006-01-10 14:31:54.000000000 -0800
@@ -1559,6 +1559,7 @@ static struct address_space_operations e
 	.invalidatepage	= ext3_invalidatepage,
 	.releasepage	= ext3_releasepage,
 	.direct_IO	= ext3_direct_IO,
+	.migratepage	= buffer_migrate_page,
 };
 
 static struct address_space_operations ext3_writeback_aops = {
@@ -1572,6 +1573,7 @@ static struct address_space_operations e
 	.invalidatepage	= ext3_invalidatepage,
 	.releasepage	= ext3_releasepage,
 	.direct_IO	= ext3_direct_IO,
+	.migratepage	= buffer_migrate_page,
 };
 
 static struct address_space_operations ext3_journalled_aops = {
Index: linux-2.6.15/fs/ext2/inode.c
===================================================================
--- linux-2.6.15.orig/fs/ext2/inode.c	2006-01-02 19:21:10.000000000 -0800
+++ linux-2.6.15/fs/ext2/inode.c	2006-01-10 14:31:54.000000000 -0800
@@ -706,6 +706,7 @@ struct address_space_operations ext2_aop
 	.bmap			= ext2_bmap,
 	.direct_IO		= ext2_direct_IO,
 	.writepages		= ext2_writepages,
+	.migratepage		= buffer_migrate_page,
 };
 
 struct address_space_operations ext2_aops_xip = {
@@ -723,6 +724,7 @@ struct address_space_operations ext2_nob
 	.bmap			= ext2_bmap,
 	.direct_IO		= ext2_direct_IO,
 	.writepages		= ext2_writepages,
+	.migratepage		= buffer_migrate_page,
 };
 
 /*
Index: linux-2.6.15/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- linux-2.6.15.orig/fs/xfs/linux-2.6/xfs_buf.c	2006-01-02 19:21:10.000000000 -0800
+++ linux-2.6.15/fs/xfs/linux-2.6/xfs_buf.c	2006-01-10 14:31:54.000000000 -0800
@@ -1568,6 +1568,7 @@ xfs_mapping_buftarg(
 	struct address_space	*mapping;
 	static struct address_space_operations mapping_aops = {
 		.sync_page = block_sync_page,
+		.migratepage = fail_migrate_page,
 	};
 
 	inode = new_inode(bdev->bd_inode->i_sb);
Index: linux-2.6.15/mm/vmscan.c
===================================================================
--- linux-2.6.15.orig/mm/vmscan.c	2006-01-10 14:31:45.000000000 -0800
+++ linux-2.6.15/mm/vmscan.c	2006-01-10 14:31:54.000000000 -0800
@@ -604,6 +604,15 @@ int putback_lru_pages(struct list_head *
 }
 
 /*
+ * Non migratable page
+ */
+int fail_migrate_page(struct page *newpage, struct page *page)
+{
+	return -EIO;
+}
+EXPORT_SYMBOL(fail_migrate_page);
+
+/*
  * swapout a single page
  * page is locked upon entry, unlocked on exit
  */
@@ -648,6 +657,8 @@ unlock_retry:
 retry:
 	return -EAGAIN;
 }
+EXPORT_SYMBOL(swap_page);
+
 /*
  * Page migration was first developed in the context of the memory hotplug
  * project. The main authors of the migration code are:
@@ -748,6 +759,7 @@ int migrate_page_remove_references(struc
 
 	return 0;
 }
+EXPORT_SYMBOL(migrate_page_remove_references);
 
 /*
  * Copy the page to its new location
@@ -787,6 +799,7 @@ void migrate_page_copy(struct page *newp
 	if (PageWriteback(newpage))
 		end_page_writeback(newpage);
 }
+EXPORT_SYMBOL(migrate_page_copy);
 
 /*
  * Common logic to directly migrate a single page suitable for
@@ -814,6 +827,7 @@ int migrate_page(struct page *newpage, s
 	remove_from_swap(newpage);
 	return 0;
 }
+EXPORT_SYMBOL(migrate_page);
 
 /*
  * migrate_pages
@@ -913,6 +927,11 @@ redo:
 		if (!mapping)
 			goto unlock_both;
 
+		if (mapping->a_ops->migratepage) {
+			rc = mapping->a_ops->migratepage(newpage, page);
+			goto unlock_both;
+                }
+
 		/*
 		 * Trigger writeout if page is dirty
 		 */
@@ -982,6 +1001,7 @@ next:
 
 	return nr_failed + retry;
 }
+EXPORT_SYMBOL(migrate_pages);
 
 static void lru_add_drain_per_cpu(void *dummy)
 {
@@ -1024,6 +1044,7 @@ redo:
 	}
 	return rc;
 }
+EXPORT_SYMBOL(isolate_lru_page);
 #endif
 
 /*
Index: linux-2.6.15/include/linux/swap.h
===================================================================
--- linux-2.6.15.orig/include/linux/swap.h	2006-01-10 14:31:45.000000000 -0800
+++ linux-2.6.15/include/linux/swap.h	2006-01-10 14:31:54.000000000 -0800
@@ -183,6 +183,11 @@ extern int migrate_page_remove_reference
 extern void migrate_page_copy(struct page *, struct page *);
 extern int migrate_pages(struct list_head *l, struct list_head *t,
 		struct list_head *moved, struct list_head *failed);
+extern int fail_migrate_page(struct page *, struct page *);
+#else
+/* Possible settings for the migrate_page() method in address_operations */
+#define migrate_page NULL
+#define fail_migrate_page NULL
 #endif
 
 #ifdef CONFIG_MMU
Index: linux-2.6.15/mm/rmap.c
===================================================================
--- linux-2.6.15.orig/mm/rmap.c	2006-01-10 14:31:45.000000000 -0800
+++ linux-2.6.15/mm/rmap.c	2006-01-10 14:31:54.000000000 -0800
@@ -232,6 +232,7 @@ void remove_from_swap(struct page *page)
 
 	delete_from_swap_cache(page);
 }
+EXPORT_SYMBOL(remove_from_swap);
 #endif
 
 /*

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] Direct Migration V9: Overview
  2006-01-10 22:41 [PATCH 0/5] Direct Migration V9: Overview Christoph Lameter
                   ` (4 preceding siblings ...)
  2006-01-10 22:41 ` [PATCH 5/5] Direct Migration V9: Avoid writeback / page_migrate() method Christoph Lameter
@ 2006-01-11  3:26 ` KAMEZAWA Hiroyuki
  2006-01-11  6:10   ` Christoph Lameter
  2006-01-11  6:18   ` Christoph Lameter
  5 siblings, 2 replies; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-01-11  3:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Cliff Wickman, linux-kernel, lhms-devel, Hirokazu Takahashi

Hi,
Christoph Lameter wrote:
> Swap migration is now in Linus tree. So this is the first direct page
Congratz.
> migration patchset against his tree (2.6.15-git6 no changes apart
> from the rediff). Also done on the off chance that we decide to have
> the full thing in 2.6.16 instead of only swap migration.
> Maybe this can now get into Andrew's tree?
> 
I don't find any problem with this patch now.
'How to do memory hot-removing' is another issue.
I'm grad if this goes to -mm tree.


By the way, what is the limitation of migratable pages ?
I think  current limitation is just Hugetlb pages and mlocked pages. right ?
Could you make it clear and add comment or doc before going to -mm ?

-- Kame




> -----
> 
> Page migration allows the moving of the physical location of pages between
> nodes in a numa system while the process is running. This means that the
> virtual addresses that the process sees do not change. However, the
> system rearranges the physical location of those pages.
> 
> The main intend of page migration patches here is to reduce the latency of
> memory access by moving pages near to the processor where the process
> accessing that memory is running.
> 
> The migration patchsets allow a process to manually relocate the node
> on which its pages are located through the MF_MOVE and MF_MOVE_ALL options
> while setting a new memory policy. The pages of process can also be relocated
> from another process using the sys_migrate_pages() function call. The
> migrate_pages function call takes two sets of nodes and moves pages of a
> process that are located on the from nodes to the destination nodes.
> 
> Manual migration is very useful if for example the scheduler has relocated
> a process to a processor on a distant node. A batch scheduler or an
> administrator can detect  the situation and move the pages of the process
> nearer to the new processor.
> 
> Larger installations usually partition the system using cpusets into
> sections of nodes. Paul Jackson has  equipped cpusets with the ability to
> move pages when a task is moved to another cpuset. This allows automatic
> control over locality of a process. If a task is moved to a new cpuset
> then also all its pages are moved with it so that the performance of the
> process does not sink dramatically (as is the case today).
> 
> The swap migration patchset in 2.6.16-git6 works by simply evicting
> the page. The pages must be faulted back in. The pages are then typically
> reallocated by the system near the node where the process is executing.
> For swap migration the destination of the move is controlled by the
> allocation policy. Cpusets set the allocation policy before calling
> sys_migrate_pages() in order to move the pages as intended.
> 
> No allocation policy changes are performed for sys_migrate_pages(). This
> means that the pages may not faulted in to the specified nodes if no
> allocation policy was set by other means. The pages will just end up
> near the node where the fault occurred.  The direct migration patchset
> extends the migration functionality to avoid going through swap.
> The destination node of the relation is controllable during the actual
> moving of pages. The crutch of using the allocation policy to relocate
> is not necessary any and the pages are moved directly to the target.
> 
> The direct migration patchset allows the preservation of the relative
> location of pages within a group of nodes for all migration techniques which
> will preserve a particular memory allocation pattern generated even after
> migrating a process. This is necessary in order to preserve the memory
> latencies. Processes will run with similar performance after migration.
> 
> This patch makes sys_migrate_pages() finally work as intended but does not
> do any significant modifications to APIs.
> 
> Benefits over swap migration:
> 
> 1. It makes migrates_pages() actually migrate pages instead of just
>    swapping pages from a set of nodes out.
> 
> 2. Its faster because the page does not need to be written to swap space.
> 
> 3. It does not use swap space and therefore there is no danger of running
>    out of swap space.
> 
> 4. The need to write back a dirty page before migration is avoided through
>    a file system specific method.
> 
> 5. Direct migration allows the preservation of the relative location of a page
>    within a set of nodes.
> 
> Many of the ideas for this code were originally developed in the memory
> hotplug project and we hope that this code also will allow the hotplug
> project to build on this patch in order to get to their goals.
> 
> The patchset consists of five patches (only the first two are necessary to
> have basic direct migration support):
> 
> 1. SwapCache patch
> 
>    SwapCache pages may have changed their type after lock_page() if the
>    page was migrated. Check for this and retry lookup if the page is no
>    longer a SwapCache page.
> 
> 2. migrate_pages()
> 
>    Basic direct migration with fallback to swap if all other attempts
>    fail.
> 
> 3. remove_from_swap()
> 
>    Page migration installs swap ptes for anonymous pages in order to
>    preserve the information contained in the page tables. This patch
>    removes the swap ptes after migration and replaces them with regular
>    ptes.
> 
> 4. upgrade of MPOL_MF_MOVE and sys_migrate_pages()
> 
>    Add logic to mm/mempolicy.c to allow the policy layer to control
>    direct page migration. Thanks to Paul Jackson for the interative
>    logic to move between sets of nodes.
> 
> 5. buffer_migrate_pages() patch
> 
>    Allow migration without writing back dirty pages. Add filesystem dependent
>    migration support for ext2/ext3 and xfs. Use swapper space to setup a
>    method to migrate anonymous pages without writeback.
> 
> Credits (also in mm/vmscan.c):
> 
> The idea for this scheme of page migration was first developed in the context
> of the memory hotplug project. The main authors of the migration code from
> the memory hotplug project are:
> 
> IWAMOTO Toshihiro <iwamoto@valinux.co.jp>
> Hirokazu Takahashi <taka@valinux.co.jp>
> Dave Hansen <haveblue@us.ibm.com>
> 
> Changes V8->V9:
> - Patchset against 2.6.15-git6
> 
> Changes V7->V8:
> - Patchset against 2.6.15-rc5-mm3
> - Export more functions so that filesystems are able to implement their own
>   migrate_page() function.
> - Fix remove_from_swap() to remove the page from the swap cache in addition
>   to replacing swap ptes. Call with the page lock on the new page.
> - Fix copying of struct page {} field to avoid using the macros that process
>   field information.
> 
> Changes V7->V7:
> - Rediff against 2.6.14-rc5-mm2
> 
> Changes V6->V7:
> - Patchset agsinst 2.6.15-rc5-mm1
> - Fix one occurence of page->mapping in migrate_page_remove_references()
> - Update description]
> 
> Changes V5->V6:
> - Patchset against 2.6.15-rc3-mm1
> - Remove checks for page count increases while migrating after Andrew assured
>   me that this cannot happen. Revise documentation to reflect that. If this is
>   the case then we will have no need to include the unwind code from the
>   hotplug project in the future.
> - Wrong reference while calling remove_from_swap to page instead of newpage
>   fixed.
> 
> Changes V4->V5:
> - Patchset against 2.6.15-rc2-mm1
> - Update policy layer patch to use the generic check_range in 2.6.15-rc2-mm1.
> - Remove try_to_unmap patch since VM_RESERVED vanished under us and therefore
>   there is no point anymore to distinguish between permament and transitional
>   failures.
> 
> Changes V3->V4:
> - Patchset against 2.6.15-rc1-mm2 + two swap migration fixes posted today.
> - Remove what is already in 2.6.14-rc1-mm2 which results in a significant
>   cleanup of the code.
> 
> Changes V2->V3:
> - Patchset against 2.6.14-mm2
> - Fix single processor build and builds without CONFIG_MIGRATION
> - export symbols for filesystems that are modules and for
>   modules using migrate_pages().
> - Paul Jackson's cpuset migration support is in 2.6.14-mm2 so
>   this patchset can be easily applied to -mm2 to get from swap
>   based to direct page migration.
> 
> Changes V1->V2:
> - Call node_remap with the right parameters in do_migrate_pages().
> - Take radix tree lock while examining page count to avoid races with
>   find_get_page() and various *_get_pages based on it.
> - Convert direct ptes to swap ptes before radix tree update to avoid
>   more races.
> - Fix problem if CONFIG_MIGRATION is off for buffer_migrate_page
> - Add documentation about page migration
> - Change migrate_pages() api so that the caller can decide what
>   to do about the migrated pages (badmem handling and hotplug
>   have to remove those pages for good).
> - Drop config patch (already in mm)
> - Add try_to_unmap patch
> - Patchset now against 2.6.14-mm1 without requiring additional patches.
> 
> 
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] Direct Migration V9: migrate_pages() extension
  2006-01-10 22:41 ` [PATCH 2/5] Direct Migration V9: migrate_pages() extension Christoph Lameter
@ 2006-01-11  5:46   ` Andrew Morton
  2006-01-12  3:25     ` Christoph Lameter
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2006-01-11  5:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: cpw, linux-kernel, clameter, lhms-devel, taka, kamezawa.hiroyu

Christoph Lameter <clameter@sgi.com> wrote:
>
> +	for(i = 0; i < 10 && page_mapped(page); i++) {
>  +		int rc = try_to_unmap(page);
>  +
>  +		if (rc == SWAP_SUCCESS)
>  +			break;
>  +		/*
>  +		 * If there are other runnable processes then running
>  +		 * them may make it possible to unmap the page
>  +		 */
>  +		schedule();
>  +	}

The schedule() in state TASK_RUNNING simply won't do anything unless this
process happens to have been preempted.  You'll find that an ndelay(100) is
about as useful.

So I'd suggest that this part needs a bit of a rethink.  If we really need
to run other processes then try a schedule_timeout_uninterruptible(1).  If
not, just remove the loop.

Please stick a printk in there, work out how often and under which
workloads that loop is actually doing something useful.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/5] Direct Migration V9: Avoid writeback / page_migrate() method
  2006-01-10 22:41 ` [PATCH 5/5] Direct Migration V9: Avoid writeback / page_migrate() method Christoph Lameter
@ 2006-01-11  6:03   ` Andrew Morton
  2006-01-11  6:38     ` Christoph Lameter
  2006-01-11  6:25   ` Andrew Morton
  1 sibling, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2006-01-11  6:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: cpw, linux-kernel, clameter, lhms-devel, taka, kamezawa.hiroyu

Christoph Lameter <clameter@sgi.com> wrote:
>
> +	spin_lock(&mapping->private_lock);
>  +
>  +	bh = head;
>  +	do {
>  +		get_bh(bh);
>  +		lock_buffer(bh);
>  +		bh = bh->b_this_page;
>  +
>  +	} while (bh != head);
>  +

Guys, lock_buffer() sleeps and cannot be called inside spinlock.

Please, always enable kernel preemption and all debug options when testing
your code.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] Direct Migration V9: Overview
  2006-01-11  3:26 ` [PATCH 0/5] Direct Migration V9: Overview KAMEZAWA Hiroyuki
@ 2006-01-11  6:10   ` Christoph Lameter
  2006-01-11  6:18   ` Christoph Lameter
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2006-01-11  6:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: akpm, Cliff Wickman, linux-kernel, lhms-devel, Hirokazu Takahashi

On Wed, 11 Jan 2006, KAMEZAWA Hiroyuki wrote:

> By the way, what is the limitation of migratable pages ?
> I think  current limitation is just Hugetlb pages and mlocked pages. right ?

Hugetlb is the only limitation. I have a patch here to allow the moving of 
mlocked pages. Basically one only needs to guarantee that those are not 
swapped out. But I'd like to wait with that for awhile.

> Could you make it clear and add comment or doc before going to -mm ?

Will add a patch to that effect.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] Direct Migration V9: Overview
  2006-01-11  3:26 ` [PATCH 0/5] Direct Migration V9: Overview KAMEZAWA Hiroyuki
  2006-01-11  6:10   ` Christoph Lameter
@ 2006-01-11  6:18   ` Christoph Lameter
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2006-01-11  6:18 UTC (permalink / raw)
  To: akpm
  Cc: KAMEZAWA Hiroyuki, Cliff Wickman, linux-kernel, lhms-devel,
	Hirokazu Takahashi

On Wed, 11 Jan 2006, KAMEZAWA Hiroyuki wrote:

> I think  current limitation is just Hugetlb pages and mlocked pages. right ?
> Could you make it clear and add comment or doc before going to -mm ?

These are checked by the code in mm/mempolicy.c. Add some
comments to migrate_pages() stating the limitations.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.15/mm/vmscan.c
===================================================================
--- linux-2.6.15.orig/mm/vmscan.c	2006-01-10 22:13:37.000000000 -0800
+++ linux-2.6.15/mm/vmscan.c	2006-01-10 22:16:45.000000000 -0800
@@ -842,6 +842,12 @@ EXPORT_SYMBOL(migrate_page);
  * are movable anymore because t has become empty
  * or no retryable pages exist anymore.
  *
+ * Limitations:
+ * Cannot migrate mlocked pages because there is a danger
+ * that those may be swapped out. Also cannot migrate
+ * huge pages. We also cannot migrate pages of VMA's with
+ * special attributes (VM_IO, VM_LOCKED, VM_PFNMAP).
+ *
  * Return: Number of pages not migrated when "to" ran empty.
  */
 int migrate_pages(struct list_head *from, struct list_head *to,

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/5] Direct Migration V9: Avoid writeback / page_migrate() method
  2006-01-10 22:41 ` [PATCH 5/5] Direct Migration V9: Avoid writeback / page_migrate() method Christoph Lameter
  2006-01-11  6:03   ` Andrew Morton
@ 2006-01-11  6:25   ` Andrew Morton
  1 sibling, 0 replies; 16+ messages in thread
From: Andrew Morton @ 2006-01-11  6:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: cpw, linux-kernel, clameter, lhms-devel, taka, kamezawa.hiroyu

Christoph Lameter <clameter@sgi.com> wrote:
>
> +int buffer_migrate_page(struct page *newpage, struct page *page)
>  +{
>  +	struct address_space *mapping = page->mapping;
>  +	struct buffer_head *bh, *head;
>  +
>  +	if (!mapping)
>  +		return -EAGAIN;
>  +
>  +	if (!page_has_buffers(page))
>  +		return migrate_page(newpage, page);
>  +
>  +	head = page_buffers(page);
>  +
>  +	if (migrate_page_remove_references(newpage, page, 3))
>  +		return -EAGAIN;
>  +
>  +	spin_lock(&mapping->private_lock);

Why are you taking ->private_lock here?

address_space.private_lock protects the list of buffers at
address_space.private_list.  For a regular file (or directory or long
symlink..) that list contains buffers against the blockdev mapping (a
different address_space) which need to be synced for a successful fsync of
this file.  ie: dirty metadata for this file.

So we have two situations:

a) page->mapping->host refers to a regular file/dir/etc

   Here, mapping->private_list holds potentially-dirty buffers against
   the blockdev mapping (a different address_space).

   Nothing needs to be done.

b) page->mapping->host refers to a blockdev (/dev/hda1's pages)

   Here, mapping->private_list is actually always empty.

   Nothing needs to be done.


   BUT, page_buffers(page) refers to buffers which might be on some
   other address_space's ->private_list.  Because a blockdev may have dirty
   buffers which some other address_space needs to write out for its sync.

   blockdevmapping->private_lock is the correct lock for these buffers. 
   Each regular file has a copy of blockdevmapping in its ->assoc_mapping,
   so all files end up taking the same lock when manipulating their
   ->private_list.

   As long as you've taken a ref on the blockdev mapping's buffers and
   locked them then nobody will be starting I/O against them or fiddling
   with ->b_page while you do the swizzle (I think).

AFAIK nobody ever used address_space.private_list for anything apart from
the associated buffers, but that's just a btw.

Anyway, ->private_lock is purely for protecting the thing at
->private_list, so I suspect this locking is simply unneeded.

Please explain the reasoning behind taking this lock.  In fact, that should
have been commented, in the spirit of buffer.c's glorious overcommenting,
which I'm sure you enjoyed ;)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/5] Direct Migration V9: Avoid writeback / page_migrate() method
  2006-01-11  6:03   ` Andrew Morton
@ 2006-01-11  6:38     ` Christoph Lameter
  2006-01-11  6:49       ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2006-01-11  6:38 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: Andrew Morton, cpw, linux-kernel, lhms-devel, taka

On Tue, 10 Jan 2006, Andrew Morton wrote:

> Christoph Lameter <clameter@sgi.com> wrote:
> >
> > +	spin_lock(&mapping->private_lock);
> >  +
> >  +	bh = head;
> >  +	do {
> >  +		get_bh(bh);
> >  +		lock_buffer(bh);
> >  +		bh = bh->b_this_page;
> >  +
> >  +	} while (bh != head);
> >  +
> 
> Guys, lock_buffer() sleeps and cannot be called inside spinlock.

I took it the way it was in the hotplug patches.We are taking the 
spinlock here to protect the scan over the list of bh's of this page 
right?

Is it not sufficient to have the page locked to guarantee that the list of 
buffers is not changed? Seems that ext3 does that (see 
ext3_ordered_writepage() etc).

like this?

Index: linux-2.6.15/fs/buffer.c
===================================================================
--- linux-2.6.15.orig/fs/buffer.c	2006-01-10 22:13:37.000000000 -0800
+++ linux-2.6.15/fs/buffer.c	2006-01-10 22:37:28.000000000 -0800
@@ -3070,8 +3070,6 @@ int buffer_migrate_page(struct page *new
 	if (migrate_page_remove_references(newpage, page, 3))
 		return -EAGAIN;
 
-	spin_lock(&mapping->private_lock);
-
 	bh = head;
 	do {
 		get_bh(bh);
@@ -3094,11 +3092,9 @@ int buffer_migrate_page(struct page *new
 	} while (bh != head);
 
 	SetPagePrivate(newpage);
-	spin_unlock(&mapping->private_lock);
 
 	migrate_page_copy(newpage, page);
 
-	spin_lock(&mapping->private_lock);
 	bh = head;
 	do {
 		unlock_buffer(bh);
@@ -3106,7 +3102,6 @@ int buffer_migrate_page(struct page *new
 		bh = bh->b_this_page;
 
 	} while (bh != head);
-	spin_unlock(&mapping->private_lock);
 
 	return 0;
 }

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/5] Direct Migration V9: Avoid writeback / page_migrate() method
  2006-01-11  6:38     ` Christoph Lameter
@ 2006-01-11  6:49       ` Andrew Morton
  2006-01-11  6:52         ` Christoph Lameter
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2006-01-11  6:49 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: kamezawa.hiroyu, cpw, linux-kernel, lhms-devel, taka

Christoph Lameter <clameter@engr.sgi.com> wrote:
>
> On Tue, 10 Jan 2006, Andrew Morton wrote:
> 
> > Christoph Lameter <clameter@sgi.com> wrote:
> > >
> > > +	spin_lock(&mapping->private_lock);
> > >  +
> > >  +	bh = head;
> > >  +	do {
> > >  +		get_bh(bh);
> > >  +		lock_buffer(bh);
> > >  +		bh = bh->b_this_page;
> > >  +
> > >  +	} while (bh != head);
> > >  +
> > 
> > Guys, lock_buffer() sleeps and cannot be called inside spinlock.
> 
> I took it the way it was in the hotplug patches.We are taking the 
> spinlock here to protect the scan over the list of bh's of this page 
> right?
> 
> Is it not sufficient to have the page locked to guarantee that the list of 
> buffers is not changed? Seems that ext3 does that (see 
> ext3_ordered_writepage() etc).

Yes, the page lock protects the buffer ring.

> like this?
> 
> Index: linux-2.6.15/fs/buffer.c
> ===================================================================
> --- linux-2.6.15.orig/fs/buffer.c	2006-01-10 22:13:37.000000000 -0800
> +++ linux-2.6.15/fs/buffer.c	2006-01-10 22:37:28.000000000 -0800
> @@ -3070,8 +3070,6 @@ int buffer_migrate_page(struct page *new
>  	if (migrate_page_remove_references(newpage, page, 3))
>  		return -EAGAIN;
>  
> -	spin_lock(&mapping->private_lock);
> -
>  	bh = head;
>  	do {
>  		get_bh(bh);
> @@ -3094,11 +3092,9 @@ int buffer_migrate_page(struct page *new
>  	} while (bh != head);
>  
>  	SetPagePrivate(newpage);
> -	spin_unlock(&mapping->private_lock);
>  
>  	migrate_page_copy(newpage, page);
>  
> -	spin_lock(&mapping->private_lock);
>  	bh = head;
>  	do {
>  		unlock_buffer(bh);
> @@ -3106,7 +3102,6 @@ int buffer_migrate_page(struct page *new
>  		bh = bh->b_this_page;
>  
>  	} while (bh != head);
> -	spin_unlock(&mapping->private_lock);
>  
>  	return 0;
>  }

Seems right, I think.


So let's see.  Suppose the kernel is about to dink with a page's buffer
ring.  It does:

	get_page(page);
	lock_page(page);
	dink_with(page_buffers(page));

how do these patches ensure that the page doesn't get migrated under my
feet?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/5] Direct Migration V9: Avoid writeback / page_migrate() method
  2006-01-11  6:49       ` Andrew Morton
@ 2006-01-11  6:52         ` Christoph Lameter
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2006-01-11  6:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: kamezawa.hiroyu, cpw, linux-kernel, lhms-devel, taka

On Tue, 10 Jan 2006, Andrew Morton wrote:

> So let's see.  Suppose the kernel is about to dink with a page's buffer
> ring.  It does:
> 
> 	get_page(page);
> 	lock_page(page);
> 	dink_with(page_buffers(page));
> 
> how do these patches ensure that the page doesn't get migrated under my
> feet?

The page is locked when buffer_migrate_page is called. Thus 

lock_page

will block.

If the refcount was increased before the migration code does the lock 
on the tree_lock then the migration will be aborted.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] Direct Migration V9: migrate_pages() extension
  2006-01-11  5:46   ` Andrew Morton
@ 2006-01-12  3:25     ` Christoph Lameter
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2006-01-12  3:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: cpw, linux-kernel, lhms-devel, taka, kamezawa.hiroyu

On Tue, 10 Jan 2006, Andrew Morton wrote:

> So I'd suggest that this part needs a bit of a rethink.  If we really need
> to run other processes then try a schedule_timeout_uninterruptible(1).  If
> not, just remove the loop.
> 
> Please stick a printk in there, work out how often and under which
> workloads that loop is actually doing something useful.

The loop is there to call try_to_unmap until it replaces all ptes
with swap ptes. The problem is that try_to_unmap may return SWAP_FAIL
if a pte was recently referenced and do nothing.

Lets go back to the way the old hotplug code did it. Modify 
try_to_unmap() to take an additional parameter so that the reference
bit in the ptes does not cause a SWAP_FAIL. Another option would be to 
modify try_to_unmap to return an additional status SWAP_REFERENCE and
call try_to_unmap until another status is returned.




Add a parameter to try_to_unmap and friends to not return
SWAP_FAIL if a newly referenced pte is encountered.

Then replace the loop in migrate_page_remove_references()
with an invokation of try_to_unmap with that parameter.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.15/mm/vmscan.c
===================================================================
--- linux-2.6.15.orig/mm/vmscan.c	2006-01-11 19:22:01.000000000 -0800
+++ linux-2.6.15/mm/vmscan.c	2006-01-11 19:22:07.000000000 -0800
@@ -472,7 +472,7 @@ static int shrink_list(struct list_head 
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page)) {
+			switch (try_to_unmap(page, 0)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -621,7 +621,7 @@ static int swap_page(struct page *page)
 	struct address_space *mapping = page_mapping(page);
 
 	if (page_mapped(page) && mapping)
-		if (try_to_unmap(page) != SWAP_SUCCESS)
+		if (try_to_unmap(page, 0) != SWAP_SUCCESS)
 			goto unlock_retry;
 
 	if (PageDirty(page)) {
@@ -677,7 +677,6 @@ int migrate_page_remove_references(struc
 {
 	struct address_space *mapping = page_mapping(page);
 	struct page **radix_pointer;
-	int i;
 
 	/*
 	 * Avoid doing any of the following work if the page count
@@ -706,17 +705,7 @@ int migrate_page_remove_references(struc
 	 * If the page was not migrated then the PageSwapCache bit
 	 * is still set and the operation may continue.
 	 */
-	for(i = 0; i < 10 && page_mapped(page); i++) {
-		int rc = try_to_unmap(page);
-
-		if (rc == SWAP_SUCCESS)
-			break;
-		/*
-		 * If there are other runnable processes then running
-		 * them may make it possible to unmap the page
-		 */
-		schedule();
-	}
+	try_to_unmap(page, 1);
 
 	/*
 	 * Give up if we were unable to remove all mappings.
Index: linux-2.6.15/mm/rmap.c
===================================================================
--- linux-2.6.15.orig/mm/rmap.c	2006-01-11 19:22:00.000000000 -0800
+++ linux-2.6.15/mm/rmap.c	2006-01-11 19:22:07.000000000 -0800
@@ -571,7 +571,8 @@ void page_remove_rmap(struct page *page)
  * Subfunctions of try_to_unmap: try_to_unmap_one called
  * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
  */
-static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma)
+static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
+				int ignore_refs)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -594,7 +595,8 @@ static int try_to_unmap_one(struct page 
 	 * skipped over this mm) then we should reactivate it.
 	 */
 	if ((vma->vm_flags & VM_LOCKED) ||
-			ptep_clear_flush_young(vma, address, pte)) {
+			(ptep_clear_flush_young(vma, address, pte)
+				&& !ignore_refs)) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
 	}
@@ -728,7 +730,7 @@ static void try_to_unmap_cluster(unsigne
 	pte_unmap_unlock(pte - 1, ptl);
 }
 
-static int try_to_unmap_anon(struct page *page)
+static int try_to_unmap_anon(struct page *page, int ignore_refs)
 {
 	struct anon_vma *anon_vma;
 	struct vm_area_struct *vma;
@@ -739,7 +741,7 @@ static int try_to_unmap_anon(struct page
 		return ret;
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
-		ret = try_to_unmap_one(page, vma);
+		ret = try_to_unmap_one(page, vma, ignore_refs);
 		if (ret == SWAP_FAIL || !page_mapped(page))
 			break;
 	}
@@ -756,7 +758,7 @@ static int try_to_unmap_anon(struct page
  *
  * This function is only called from try_to_unmap for object-based pages.
  */
-static int try_to_unmap_file(struct page *page)
+static int try_to_unmap_file(struct page *page, int ignore_refs)
 {
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -770,7 +772,7 @@ static int try_to_unmap_file(struct page
 
 	spin_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
-		ret = try_to_unmap_one(page, vma);
+		ret = try_to_unmap_one(page, vma, ignore_refs);
 		if (ret == SWAP_FAIL || !page_mapped(page))
 			goto out;
 	}
@@ -855,16 +857,16 @@ out:
  * SWAP_AGAIN	- we missed a mapping, try again later
  * SWAP_FAIL	- the page is unswappable
  */
-int try_to_unmap(struct page *page)
+int try_to_unmap(struct page *page, int ignore_refs)
 {
 	int ret;
 
 	BUG_ON(!PageLocked(page));
 
 	if (PageAnon(page))
-		ret = try_to_unmap_anon(page);
+		ret = try_to_unmap_anon(page, ignore_refs);
 	else
-		ret = try_to_unmap_file(page);
+		ret = try_to_unmap_file(page, ignore_refs);
 
 	if (!page_mapped(page))
 		ret = SWAP_SUCCESS;
Index: linux-2.6.15/include/linux/rmap.h
===================================================================
--- linux-2.6.15.orig/include/linux/rmap.h	2006-01-11 19:22:19.000000000 -0800
+++ linux-2.6.15/include/linux/rmap.h	2006-01-11 19:22:22.000000000 -0800
@@ -91,7 +91,7 @@ static inline void page_dup_rmap(struct 
  * Called from mm/vmscan.c to handle paging out
  */
 int page_referenced(struct page *, int is_locked);
-int try_to_unmap(struct page *);
+int try_to_unmap(struct page *, int ignore_refs);
 #ifdef CONFIG_MIGRATION
 void remove_from_swap(struct page *page);
 #endif
@@ -114,7 +114,7 @@ unsigned long page_address_in_vma(struct
 #define anon_vma_link(vma)	do {} while (0)
 
 #define page_referenced(page,l) TestClearPageReferenced(page)
-#define try_to_unmap(page)	SWAP_FAIL
+#define try_to_unmap(page, refs) SWAP_FAIL
 
 #endif	/* CONFIG_MMU */
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2006-01-12  3:26 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-10 22:41 [PATCH 0/5] Direct Migration V9: Overview Christoph Lameter
2006-01-10 22:41 ` [PATCH 1/5] Direct Migration V9: PageSwapCache checks Christoph Lameter
2006-01-10 22:41 ` [PATCH 2/5] Direct Migration V9: migrate_pages() extension Christoph Lameter
2006-01-11  5:46   ` Andrew Morton
2006-01-12  3:25     ` Christoph Lameter
2006-01-10 22:41 ` [PATCH 3/5] Direct Migration V9: remove_from_swap() to remove swap ptes Christoph Lameter
2006-01-10 22:41 ` [PATCH 4/5] Direct Migration V9: upgrade MPOL_MF_MOVE and sys_migrate_pages() Christoph Lameter
2006-01-10 22:41 ` [PATCH 5/5] Direct Migration V9: Avoid writeback / page_migrate() method Christoph Lameter
2006-01-11  6:03   ` Andrew Morton
2006-01-11  6:38     ` Christoph Lameter
2006-01-11  6:49       ` Andrew Morton
2006-01-11  6:52         ` Christoph Lameter
2006-01-11  6:25   ` Andrew Morton
2006-01-11  3:26 ` [PATCH 0/5] Direct Migration V9: Overview KAMEZAWA Hiroyuki
2006-01-11  6:10   ` Christoph Lameter
2006-01-11  6:18   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).