All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Fix migration races in rmap_walk() V2
@ 2010-04-27 21:30 ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-27 21:30 UTC (permalink / raw)
  To: Linux-MM, LKML
  Cc: Minchan Kim, KAMEZAWA Hiroyuki, Mel Gorman, Christoph Lameter,
	Andrea Arcangeli, Rik van Riel, Andrew Morton

After V1, it was clear that execve was still racing but eventually died
in an exec-related race. An additional part of the test was created that
hammers exec() to reproduce typically within 10 minutes rather than several
hours.  The problem was that the VMA is moved under lock but not the page
tables. Migration fails to remove the migration PTE from its new location and
a BUG is later triggered. The third patch in this series is a candidate fix.

Changelog since V1
  o Handle the execve race
  o Be sure that rmap_walk() releases the correct VMA lock
  o Hold the anon_vma lock for the address lookup and the page remap
  o Add reviewed-bys

There are a number of races between migration and other operations that mean a
migration PTE can be left behind. Broadly speaking, migration works by locking
a page, unmapping it, putting a migration PTE in place that looks like a swap
entry, copying the page and remapping the page removing the old migration PTE.
If a fault occurs, the faulting process waits until migration completes.

The problem is that there are some races that either allow migration PTEs to
be copied or a migration PTE to be left behind. Migration still completes and
the page is unlocked but later a fault will call migration_entry_to_page()
and BUG() because the page is not locked. This series aims to close some
of these races.

Patch 1 alters fork() to restart page table copying when a migration PTE is
	encountered.

Patch 2 has vma_adjust() acquire the anon_vma lock and makes rmap_walk()
	aware that VMAs on the chain may have different anon_vma locks that
	also need to be acquired.

Patch 3 notes that while a VMA is moved under the anon_vma lock, the page
	tables are not similarly protected. Where migration PTEs are
	encountered, they are cleaned up.

The reproduction case was as follows;

1. Run kernel compilation in a loop
2. Start three processes, each of which creates one mapping. The three stress
   different aspects of the problem. The operations they undertake are;
	a) Forks a hundred children, each of which faults the mapping
		Purpose: stress tests migration pte removal
	b) Forks a hundred children, each which punches a hole in the mapping
	   and faults what remains
		Purpose: stress test VMA manipulations during migration
	c) Forks a hundren children, each of which execs and calls echo
		Purpose: stress test the execve race
3. Constantly compact memory using /proc/sys/vm/compact_memory so migration
   is active all the time. In theory, you could also force this using
   sys_move_pages or memory hot-remove but it'd be nowhere near as easy
   to test.

At the time of sending, it has been running several hours without problems
with a workload that would fail within a few minutes without the patches.

 include/linux/migrate.h |    7 +++++++
 mm/ksm.c                |   22 ++++++++++++++++++++--
 mm/memory.c             |   25 +++++++++++++++----------
 mm/migrate.c            |    2 +-
 mm/mmap.c               |    6 ++++++
 mm/mremap.c             |   29 +++++++++++++++++++++++++++++
 mm/rmap.c               |   28 +++++++++++++++++++++++-----
 7 files changed, 101 insertions(+), 18 deletions(-)

^ permalink raw reply	[flat|nested] 132+ messages in thread

* [PATCH 0/3] Fix migration races in rmap_walk() V2
@ 2010-04-27 21:30 ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-27 21:30 UTC (permalink / raw)
  To: Linux-MM, LKML
  Cc: Minchan Kim, KAMEZAWA Hiroyuki, Mel Gorman, Christoph Lameter,
	Andrea Arcangeli, Rik van Riel, Andrew Morton

After V1, it was clear that execve was still racing but eventually died
in an exec-related race. An additional part of the test was created that
hammers exec() to reproduce typically within 10 minutes rather than several
hours.  The problem was that the VMA is moved under lock but not the page
tables. Migration fails to remove the migration PTE from its new location and
a BUG is later triggered. The third patch in this series is a candidate fix.

Changelog since V1
  o Handle the execve race
  o Be sure that rmap_walk() releases the correct VMA lock
  o Hold the anon_vma lock for the address lookup and the page remap
  o Add reviewed-bys

There are a number of races between migration and other operations that mean a
migration PTE can be left behind. Broadly speaking, migration works by locking
a page, unmapping it, putting a migration PTE in place that looks like a swap
entry, copying the page and remapping the page removing the old migration PTE.
If a fault occurs, the faulting process waits until migration completes.

The problem is that there are some races that either allow migration PTEs to
be copied or a migration PTE to be left behind. Migration still completes and
the page is unlocked but later a fault will call migration_entry_to_page()
and BUG() because the page is not locked. This series aims to close some
of these races.

Patch 1 alters fork() to restart page table copying when a migration PTE is
	encountered.

Patch 2 has vma_adjust() acquire the anon_vma lock and makes rmap_walk()
	aware that VMAs on the chain may have different anon_vma locks that
	also need to be acquired.

Patch 3 notes that while a VMA is moved under the anon_vma lock, the page
	tables are not similarly protected. Where migration PTEs are
	encountered, they are cleaned up.

The reproduction case was as follows;

1. Run kernel compilation in a loop
2. Start three processes, each of which creates one mapping. The three stress
   different aspects of the problem. The operations they undertake are;
	a) Forks a hundred children, each of which faults the mapping
		Purpose: stress tests migration pte removal
	b) Forks a hundred children, each which punches a hole in the mapping
	   and faults what remains
		Purpose: stress test VMA manipulations during migration
	c) Forks a hundren children, each of which execs and calls echo
		Purpose: stress test the execve race
3. Constantly compact memory using /proc/sys/vm/compact_memory so migration
   is active all the time. In theory, you could also force this using
   sys_move_pages or memory hot-remove but it'd be nowhere near as easy
   to test.

At the time of sending, it has been running several hours without problems
with a workload that would fail within a few minutes without the patches.

 include/linux/migrate.h |    7 +++++++
 mm/ksm.c                |   22 ++++++++++++++++++++--
 mm/memory.c             |   25 +++++++++++++++----------
 mm/migrate.c            |    2 +-
 mm/mmap.c               |    6 ++++++
 mm/mremap.c             |   29 +++++++++++++++++++++++++++++
 mm/rmap.c               |   28 +++++++++++++++++++++++-----
 7 files changed, 101 insertions(+), 18 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
  2010-04-27 21:30 ` Mel Gorman
@ 2010-04-27 21:30   ` Mel Gorman
  -1 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-27 21:30 UTC (permalink / raw)
  To: Linux-MM, LKML
  Cc: Minchan Kim, KAMEZAWA Hiroyuki, Mel Gorman, Christoph Lameter,
	Andrea Arcangeli, Rik van Riel, Andrew Morton

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At page migration, we replace pte with migration_entry, which has
similar format as swap_entry and replace it with real pfn at the
end of migration. But there is a race with fork()'s copy_page_range().

Assume page migraion on CPU A and fork in CPU B. On CPU A, a page of
a process is under migration. On CPU B, a page's pte is under copy.

	CPUA			CPU B
				do_fork()
				copy_mm() (from process 1 to process2)
				insert new vma to mmap_list (if inode/anon_vma)
	pte_lock(process1)
	unmap a page
	insert migration_entry
	pte_unlock(process1)

	migrate page copy
				copy_page_range
	remap new page by rmap_walk()
	pte_lock(process2)
	found no pte.
	pte_unlock(process2)
				pte lock(process2)
				pte lock(process1)
				copy migration entry to process2
				pte unlock(process1)
				pte unlokc(process2)
	pte_lock(process1)
	replace migration entry
	to new page's pte.
	pte_unlock(process1)

Then, some serialization is necessary. IIUC, this is very rare event but
it is reproducible if a lot of migration is happening a lot with the
following program running in parallel.

    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    #include <sys/mman.h>

    #define SIZE (24*1048576UL)
    #define CHILDREN 100
    int main()
    {
	    int i = 0;
	    pid_t pids[CHILDREN];
	    char *buf = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
			    MAP_PRIVATE|MAP_ANONYMOUS,
			    0, 0);
	    if (buf == MAP_FAILED) {
		    perror("mmap");
		    exit(-1);
	    }

	    while (++i) {
		    int j = i % CHILDREN;

		    if (j == 0) {
			    printf("Waiting on children\n");
			    for (j = 0; j < CHILDREN; j++) {
				    memset(buf, i, SIZE);
				    if (pids[j] != -1)
					    waitpid(pids[j], NULL, 0);
			    }
			    j = 0;
		    }

		    if ((pids[j] = fork()) == 0) {
			    memset(buf, i, SIZE);
			    exit(EXIT_SUCCESS);
		    }
	    }

	    munmap(buf, SIZE);
    }

copy_page_range() can wait for the end of migration.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/memory.c |   25 +++++++++++++++----------
 1 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..800d77f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -675,16 +675,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			}
 			if (likely(!non_swap_entry(entry)))
 				rss[MM_SWAPENTS]++;
-			else if (is_write_migration_entry(entry) &&
-					is_cow_mapping(vm_flags)) {
-				/*
-				 * COW mappings require pages in both parent
-				 * and child to be set to read.
-				 */
-				make_migration_entry_read(&entry);
-				pte = swp_entry_to_pte(entry);
-				set_pte_at(src_mm, addr, src_pte, pte);
-			}
+			else
+				BUG();
 		}
 		goto out_set_pte;
 	}
@@ -760,6 +752,19 @@ again:
 			progress++;
 			continue;
 		}
+		if (unlikely(!pte_present(*src_pte) && !pte_file(*src_pte))) {
+			entry = pte_to_swp_entry(*src_pte);
+			if (is_migration_entry(entry)) {
+				/*
+				 * Because copying pte has the race with
+				 * pte rewriting of migraton, release lock
+				 * and retry.
+				 */
+				progress = 0;
+				entry.val = 0;
+				break;
+			}
+		}
 		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
 							vma, addr, rss);
 		if (entry.val)
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
@ 2010-04-27 21:30   ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-27 21:30 UTC (permalink / raw)
  To: Linux-MM, LKML
  Cc: Minchan Kim, KAMEZAWA Hiroyuki, Mel Gorman, Christoph Lameter,
	Andrea Arcangeli, Rik van Riel, Andrew Morton

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At page migration, we replace pte with migration_entry, which has
similar format as swap_entry and replace it with real pfn at the
end of migration. But there is a race with fork()'s copy_page_range().

Assume page migraion on CPU A and fork in CPU B. On CPU A, a page of
a process is under migration. On CPU B, a page's pte is under copy.

	CPUA			CPU B
				do_fork()
				copy_mm() (from process 1 to process2)
				insert new vma to mmap_list (if inode/anon_vma)
	pte_lock(process1)
	unmap a page
	insert migration_entry
	pte_unlock(process1)

	migrate page copy
				copy_page_range
	remap new page by rmap_walk()
	pte_lock(process2)
	found no pte.
	pte_unlock(process2)
				pte lock(process2)
				pte lock(process1)
				copy migration entry to process2
				pte unlock(process1)
				pte unlokc(process2)
	pte_lock(process1)
	replace migration entry
	to new page's pte.
	pte_unlock(process1)

Then, some serialization is necessary. IIUC, this is very rare event but
it is reproducible if a lot of migration is happening a lot with the
following program running in parallel.

    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    #include <sys/mman.h>

    #define SIZE (24*1048576UL)
    #define CHILDREN 100
    int main()
    {
	    int i = 0;
	    pid_t pids[CHILDREN];
	    char *buf = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
			    MAP_PRIVATE|MAP_ANONYMOUS,
			    0, 0);
	    if (buf == MAP_FAILED) {
		    perror("mmap");
		    exit(-1);
	    }

	    while (++i) {
		    int j = i % CHILDREN;

		    if (j == 0) {
			    printf("Waiting on children\n");
			    for (j = 0; j < CHILDREN; j++) {
				    memset(buf, i, SIZE);
				    if (pids[j] != -1)
					    waitpid(pids[j], NULL, 0);
			    }
			    j = 0;
		    }

		    if ((pids[j] = fork()) == 0) {
			    memset(buf, i, SIZE);
			    exit(EXIT_SUCCESS);
		    }
	    }

	    munmap(buf, SIZE);
    }

copy_page_range() can wait for the end of migration.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/memory.c |   25 +++++++++++++++----------
 1 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..800d77f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -675,16 +675,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			}
 			if (likely(!non_swap_entry(entry)))
 				rss[MM_SWAPENTS]++;
-			else if (is_write_migration_entry(entry) &&
-					is_cow_mapping(vm_flags)) {
-				/*
-				 * COW mappings require pages in both parent
-				 * and child to be set to read.
-				 */
-				make_migration_entry_read(&entry);
-				pte = swp_entry_to_pte(entry);
-				set_pte_at(src_mm, addr, src_pte, pte);
-			}
+			else
+				BUG();
 		}
 		goto out_set_pte;
 	}
@@ -760,6 +752,19 @@ again:
 			progress++;
 			continue;
 		}
+		if (unlikely(!pte_present(*src_pte) && !pte_file(*src_pte))) {
+			entry = pte_to_swp_entry(*src_pte);
+			if (is_migration_entry(entry)) {
+				/*
+				 * Because copying pte has the race with
+				 * pte rewriting of migraton, release lock
+				 * and retry.
+				 */
+				progress = 0;
+				entry.val = 0;
+				break;
+			}
+		}
 		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
 							vma, addr, rss);
 		if (entry.val)
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-04-27 21:30 ` Mel Gorman
@ 2010-04-27 21:30   ` Mel Gorman
  -1 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-27 21:30 UTC (permalink / raw)
  To: Linux-MM, LKML
  Cc: Minchan Kim, KAMEZAWA Hiroyuki, Mel Gorman, Christoph Lameter,
	Andrea Arcangeli, Rik van Riel, Andrew Morton

vma_adjust() is updating anon VMA information without any locks taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a BUG_ON
to trigger when the page is faulted in.

With the recent anon_vma changes, there can be more than one anon_vma->lock
that can be taken in a anon_vma_chain but a second lock cannot be spinned
upon in case of deadlock. Instead, the rmap walker tries to take locks of
different anon_vma's. If the attempt fails, the operation is restarted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/ksm.c  |   22 ++++++++++++++++++++--
 mm/mmap.c |    6 ++++++
 mm/rmap.c |   28 +++++++++++++++++++++++-----
 3 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..0c09927 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1668,15 +1668,28 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
 again:
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
+		struct anon_vma *locked_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
 		spin_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
+
+			/* See comment in mm/rmap.c#rmap_walk_anon on locking */
+			locked_vma = NULL;
+			if (anon_vma != vma->anon_vma) {
+				locked_vma = vma->anon_vma;
+				if (!spin_trylock(&locked_vma->lock)) {
+					spin_unlock(&anon_vma->lock);
+					goto again;
+				}
+			}
+
 			if (rmap_item->address < vma->vm_start ||
 			    rmap_item->address >= vma->vm_end)
-				continue;
+				goto next_vma;
+
 			/*
 			 * Initially we examine only the vma which covers this
 			 * rmap_item; but later, if there is still work to do,
@@ -1684,9 +1697,14 @@ again:
 			 * were forked from the original since ksmd passed.
 			 */
 			if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
-				continue;
+				goto next_vma;
 
 			ret = rmap_one(page, vma, rmap_item->address, arg);
+
+next_vma:
+			if (locked_vma)
+				spin_unlock(&locked_vma->lock);
+
 			if (ret != SWAP_AGAIN) {
 				spin_unlock(&anon_vma->lock);
 				goto out;
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61d6f1d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	if (vma->anon_vma)
+		spin_lock(&vma->anon_vma->lock);
+
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -620,6 +623,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	if (vma->anon_vma)
+		spin_unlock(&vma->anon_vma->lock);
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..f7ed89f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1358,7 +1358,7 @@ int try_to_munlock(struct page *page)
 static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 		struct vm_area_struct *, unsigned long, void *), void *arg)
 {
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma, *locked_vma;
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
@@ -1368,16 +1368,34 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	 * are holding mmap_sem. Users without mmap_sem are required to
 	 * take a reference count to prevent the anon_vma disappearing
 	 */
+retry:
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
 		return ret;
 	spin_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
-		if (address == -EFAULT)
-			continue;
-		ret = rmap_one(page, vma, address, arg);
+		unsigned long address;
+
+		/*
+		 * Guard against deadlocks by not spinning against
+		 * vma->anon_vma->lock. On contention release and retry
+		 */
+		locked_vma = NULL;
+		if (anon_vma != vma->anon_vma) {
+			locked_vma = vma->anon_vma;
+			if (!spin_trylock(&locked_vma->lock)) {
+				spin_unlock(&anon_vma->lock);
+				goto retry;
+			}
+		}
+		address = vma_address(page, vma);
+		if (address != -EFAULT)
+			ret = rmap_one(page, vma, address, arg);
+
+		if (locked_vma)
+			spin_unlock(&locked_vma->lock);
+
 		if (ret != SWAP_AGAIN)
 			break;
 	}
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-27 21:30   ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-27 21:30 UTC (permalink / raw)
  To: Linux-MM, LKML
  Cc: Minchan Kim, KAMEZAWA Hiroyuki, Mel Gorman, Christoph Lameter,
	Andrea Arcangeli, Rik van Riel, Andrew Morton

vma_adjust() is updating anon VMA information without any locks taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a BUG_ON
to trigger when the page is faulted in.

With the recent anon_vma changes, there can be more than one anon_vma->lock
that can be taken in a anon_vma_chain but a second lock cannot be spinned
upon in case of deadlock. Instead, the rmap walker tries to take locks of
different anon_vma's. If the attempt fails, the operation is restarted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/ksm.c  |   22 ++++++++++++++++++++--
 mm/mmap.c |    6 ++++++
 mm/rmap.c |   28 +++++++++++++++++++++++-----
 3 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..0c09927 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1668,15 +1668,28 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
 again:
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
+		struct anon_vma *locked_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
 		spin_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
+
+			/* See comment in mm/rmap.c#rmap_walk_anon on locking */
+			locked_vma = NULL;
+			if (anon_vma != vma->anon_vma) {
+				locked_vma = vma->anon_vma;
+				if (!spin_trylock(&locked_vma->lock)) {
+					spin_unlock(&anon_vma->lock);
+					goto again;
+				}
+			}
+
 			if (rmap_item->address < vma->vm_start ||
 			    rmap_item->address >= vma->vm_end)
-				continue;
+				goto next_vma;
+
 			/*
 			 * Initially we examine only the vma which covers this
 			 * rmap_item; but later, if there is still work to do,
@@ -1684,9 +1697,14 @@ again:
 			 * were forked from the original since ksmd passed.
 			 */
 			if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
-				continue;
+				goto next_vma;
 
 			ret = rmap_one(page, vma, rmap_item->address, arg);
+
+next_vma:
+			if (locked_vma)
+				spin_unlock(&locked_vma->lock);
+
 			if (ret != SWAP_AGAIN) {
 				spin_unlock(&anon_vma->lock);
 				goto out;
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61d6f1d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	if (vma->anon_vma)
+		spin_lock(&vma->anon_vma->lock);
+
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -620,6 +623,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	if (vma->anon_vma)
+		spin_unlock(&vma->anon_vma->lock);
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..f7ed89f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1358,7 +1358,7 @@ int try_to_munlock(struct page *page)
 static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 		struct vm_area_struct *, unsigned long, void *), void *arg)
 {
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma, *locked_vma;
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
@@ -1368,16 +1368,34 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	 * are holding mmap_sem. Users without mmap_sem are required to
 	 * take a reference count to prevent the anon_vma disappearing
 	 */
+retry:
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
 		return ret;
 	spin_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
-		if (address == -EFAULT)
-			continue;
-		ret = rmap_one(page, vma, address, arg);
+		unsigned long address;
+
+		/*
+		 * Guard against deadlocks by not spinning against
+		 * vma->anon_vma->lock. On contention release and retry
+		 */
+		locked_vma = NULL;
+		if (anon_vma != vma->anon_vma) {
+			locked_vma = vma->anon_vma;
+			if (!spin_trylock(&locked_vma->lock)) {
+				spin_unlock(&anon_vma->lock);
+				goto retry;
+			}
+		}
+		address = vma_address(page, vma);
+		if (address != -EFAULT)
+			ret = rmap_one(page, vma, address, arg);
+
+		if (locked_vma)
+			spin_unlock(&locked_vma->lock);
+
 		if (ret != SWAP_AGAIN)
 			break;
 	}
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-27 21:30 ` Mel Gorman
@ 2010-04-27 21:30   ` Mel Gorman
  -1 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-27 21:30 UTC (permalink / raw)
  To: Linux-MM, LKML
  Cc: Minchan Kim, KAMEZAWA Hiroyuki, Mel Gorman, Christoph Lameter,
	Andrea Arcangeli, Rik van Riel, Andrew Morton

During exec(), a temporary stack is setup and moved later to its final
location. There is a race between migration and exec whereby a migration
PTE can be placed in the temporary stack. When this VMA is moved under the
lock, migration no longer knows where the PTE is, fails to remove the PTE
and the migration PTE gets copied to the new location.  This later causes
a bug when the migration PTE is discovered but the page is not locked.

This patch handles the situation by removing the migration PTE when page
tables are being moved in case migration fails to find them. The alternative
would require significant modification to vma_adjust() and the locks taken
to ensure a VMA move and page table copy is atomic with respect to migration.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/migrate.h |    7 +++++++
 mm/migrate.c            |    2 +-
 mm/mremap.c             |   29 +++++++++++++++++++++++++++++
 3 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 7a07b17..05d2292 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -22,6 +22,8 @@ extern int migrate_prep(void);
 extern int migrate_vmas(struct mm_struct *mm,
 		const nodemask_t *from, const nodemask_t *to,
 		unsigned long flags);
+extern int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
+			unsigned long addr, void *old);
 #else
 #define PAGE_MIGRATION 0
 
@@ -42,5 +44,10 @@ static inline int migrate_vmas(struct mm_struct *mm,
 #define migrate_page NULL
 #define fail_migrate_page NULL
 
+static inline int remove_migration_pte(struct page *new,
+		struct vm_area_struct *vma, unsigned long addr, void *old)
+{
+}
+
 #endif /* CONFIG_MIGRATION */
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 4afd6fe..053fd39 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -75,7 +75,7 @@ void putback_lru_pages(struct list_head *l)
 /*
  * Restore a potential migration pte to a working pte entry
  */
-static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
+int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 				 unsigned long addr, void *old)
 {
 	struct mm_struct *mm = vma->vm_mm;
diff --git a/mm/mremap.c b/mm/mremap.c
index cde56ee..601bba0 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -13,12 +13,14 @@
 #include <linux/ksm.h>
 #include <linux/mman.h>
 #include <linux/swap.h>
+#include <linux/swapops.h>
 #include <linux/capability.h>
 #include <linux/fs.h>
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
 #include <linux/mmu_notifier.h>
+#include <linux/migrate.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -78,10 +80,13 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
 	unsigned long old_start;
+	swp_entry_t entry;
+	struct page *page;
 
 	old_start = old_addr;
 	mmu_notifier_invalidate_range_start(vma->vm_mm,
 					    old_start, old_end);
+restart:
 	if (vma->vm_file) {
 		/*
 		 * Subtle point from Rajesh Venkatasubramanian: before
@@ -111,6 +116,12 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 				   new_pte++, new_addr += PAGE_SIZE) {
 		if (pte_none(*old_pte))
 			continue;
+		if (unlikely(!pte_present(*old_pte))) {
+			entry = pte_to_swp_entry(*old_pte);
+			if (is_migration_entry(entry))
+				break;
+		}
+
 		pte = ptep_clear_flush(vma, old_addr, old_pte);
 		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
 		set_pte_at(mm, new_addr, new_pte, pte);
@@ -123,6 +134,24 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
+
+	/*
+	 * In this context, we cannot call migration_entry_wait() as we
+	 * are racing with migration. If migration finishes between when
+	 * PageLocked was checked and migration_entry_wait takes the
+	 * locks, it'll BUG. Instead, lock the page and remove the PTE
+	 * before restarting.
+	 */
+	if (old_addr != old_end) {
+		page = pfn_to_page(swp_offset(entry));
+		get_page(page);
+		lock_page(page);
+		remove_migration_pte(page, vma, old_addr, page);
+		unlock_page(page);
+		put_page(page);
+		goto restart;
+	}
+
 	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-27 21:30   ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-27 21:30 UTC (permalink / raw)
  To: Linux-MM, LKML
  Cc: Minchan Kim, KAMEZAWA Hiroyuki, Mel Gorman, Christoph Lameter,
	Andrea Arcangeli, Rik van Riel, Andrew Morton

During exec(), a temporary stack is setup and moved later to its final
location. There is a race between migration and exec whereby a migration
PTE can be placed in the temporary stack. When this VMA is moved under the
lock, migration no longer knows where the PTE is, fails to remove the PTE
and the migration PTE gets copied to the new location.  This later causes
a bug when the migration PTE is discovered but the page is not locked.

This patch handles the situation by removing the migration PTE when page
tables are being moved in case migration fails to find them. The alternative
would require significant modification to vma_adjust() and the locks taken
to ensure a VMA move and page table copy is atomic with respect to migration.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/migrate.h |    7 +++++++
 mm/migrate.c            |    2 +-
 mm/mremap.c             |   29 +++++++++++++++++++++++++++++
 3 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 7a07b17..05d2292 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -22,6 +22,8 @@ extern int migrate_prep(void);
 extern int migrate_vmas(struct mm_struct *mm,
 		const nodemask_t *from, const nodemask_t *to,
 		unsigned long flags);
+extern int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
+			unsigned long addr, void *old);
 #else
 #define PAGE_MIGRATION 0
 
@@ -42,5 +44,10 @@ static inline int migrate_vmas(struct mm_struct *mm,
 #define migrate_page NULL
 #define fail_migrate_page NULL
 
+static inline int remove_migration_pte(struct page *new,
+		struct vm_area_struct *vma, unsigned long addr, void *old)
+{
+}
+
 #endif /* CONFIG_MIGRATION */
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 4afd6fe..053fd39 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -75,7 +75,7 @@ void putback_lru_pages(struct list_head *l)
 /*
  * Restore a potential migration pte to a working pte entry
  */
-static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
+int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 				 unsigned long addr, void *old)
 {
 	struct mm_struct *mm = vma->vm_mm;
diff --git a/mm/mremap.c b/mm/mremap.c
index cde56ee..601bba0 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -13,12 +13,14 @@
 #include <linux/ksm.h>
 #include <linux/mman.h>
 #include <linux/swap.h>
+#include <linux/swapops.h>
 #include <linux/capability.h>
 #include <linux/fs.h>
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
 #include <linux/mmu_notifier.h>
+#include <linux/migrate.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -78,10 +80,13 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
 	unsigned long old_start;
+	swp_entry_t entry;
+	struct page *page;
 
 	old_start = old_addr;
 	mmu_notifier_invalidate_range_start(vma->vm_mm,
 					    old_start, old_end);
+restart:
 	if (vma->vm_file) {
 		/*
 		 * Subtle point from Rajesh Venkatasubramanian: before
@@ -111,6 +116,12 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 				   new_pte++, new_addr += PAGE_SIZE) {
 		if (pte_none(*old_pte))
 			continue;
+		if (unlikely(!pte_present(*old_pte))) {
+			entry = pte_to_swp_entry(*old_pte);
+			if (is_migration_entry(entry))
+				break;
+		}
+
 		pte = ptep_clear_flush(vma, old_addr, old_pte);
 		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
 		set_pte_at(mm, new_addr, new_pte, pte);
@@ -123,6 +134,24 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
+
+	/*
+	 * In this context, we cannot call migration_entry_wait() as we
+	 * are racing with migration. If migration finishes between when
+	 * PageLocked was checked and migration_entry_wait takes the
+	 * locks, it'll BUG. Instead, lock the page and remove the PTE
+	 * before restarting.
+	 */
+	if (old_addr != old_end) {
+		page = pfn_to_page(swp_offset(entry));
+		get_page(page);
+		lock_page(page);
+		remove_migration_pte(page, vma, old_addr, page);
+		unlock_page(page);
+		put_page(page);
+		goto restart;
+	}
+
 	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
  2010-04-27 21:30   ` Mel Gorman
@ 2010-04-27 22:22     ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-27 22:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

Ok I had a first look:

On Tue, Apr 27, 2010 at 10:30:50PM +0100, Mel Gorman wrote:
> 	CPUA			CPU B
> 				do_fork()
> 				copy_mm() (from process 1 to process2)
> 				insert new vma to mmap_list (if inode/anon_vma)

Insert to the tail of the anon_vma list...

> 	pte_lock(process1)
> 	unmap a page
> 	insert migration_entry
> 	pte_unlock(process1)
> 
> 	migrate page copy
> 				copy_page_range
> 	remap new page by rmap_walk()

rmap_walk will walk process1 first! It's at the head, the vmas with
unmapped ptes are at the tail so process1 is walked before process2.

> 	pte_lock(process2)
> 	found no pte.
> 	pte_unlock(process2)
> 				pte lock(process2)
> 				pte lock(process1)
> 				copy migration entry to process2
> 				pte unlock(process1)
> 				pte unlokc(process2)
> 	pte_lock(process1)
> 	replace migration entry
> 	to new page's pte.
> 	pte_unlock(process1)

rmap_walk has to lock down process1 before process2, this is the
ordering issue I already mentioned in earlier email. So it cannot
happen and this patch is unnecessary.

The ordering is fundamental and as said anon_vma_link already adds new
vmas to the _tail_ of the anon-vma. And this is why it has to add to
the tail. If anon_vma_link would add new vmas to the head of the list,
the above bug could materialize, but it doesn't so it cannot happen.

In mainline anon_vma_link is called anon_vma_chain_link, see the
list_add_tail there to provide this guarantee.

Because process1 is walked first by CPU A, the migration entry is
replaced by the final pte before copy-migration-entry
runs. Alternatively if copy-migration-entry runs before before
process1 is walked, the migration entry will be copied and found in
process 2.

Comments welcome.
Andrea

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
@ 2010-04-27 22:22     ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-27 22:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

Ok I had a first look:

On Tue, Apr 27, 2010 at 10:30:50PM +0100, Mel Gorman wrote:
> 	CPUA			CPU B
> 				do_fork()
> 				copy_mm() (from process 1 to process2)
> 				insert new vma to mmap_list (if inode/anon_vma)

Insert to the tail of the anon_vma list...

> 	pte_lock(process1)
> 	unmap a page
> 	insert migration_entry
> 	pte_unlock(process1)
> 
> 	migrate page copy
> 				copy_page_range
> 	remap new page by rmap_walk()

rmap_walk will walk process1 first! It's at the head, the vmas with
unmapped ptes are at the tail so process1 is walked before process2.

> 	pte_lock(process2)
> 	found no pte.
> 	pte_unlock(process2)
> 				pte lock(process2)
> 				pte lock(process1)
> 				copy migration entry to process2
> 				pte unlock(process1)
> 				pte unlokc(process2)
> 	pte_lock(process1)
> 	replace migration entry
> 	to new page's pte.
> 	pte_unlock(process1)

rmap_walk has to lock down process1 before process2, this is the
ordering issue I already mentioned in earlier email. So it cannot
happen and this patch is unnecessary.

The ordering is fundamental and as said anon_vma_link already adds new
vmas to the _tail_ of the anon-vma. And this is why it has to add to
the tail. If anon_vma_link would add new vmas to the head of the list,
the above bug could materialize, but it doesn't so it cannot happen.

In mainline anon_vma_link is called anon_vma_chain_link, see the
list_add_tail there to provide this guarantee.

Because process1 is walked first by CPU A, the migration entry is
replaced by the final pte before copy-migration-entry
runs. Alternatively if copy-migration-entry runs before before
process1 is walked, the migration entry will be copied and found in
process 2.

Comments welcome.
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
  2010-04-27 21:30 ` Mel Gorman
@ 2010-04-27 22:27   ` Christoph Lameter
  -1 siblings, 0 replies; 132+ messages in thread
From: Christoph Lameter @ 2010-04-27 22:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Rik van Riel, Andrew Morton

On Tue, 27 Apr 2010, Mel Gorman wrote:

> The problem is that there are some races that either allow migration PTEs to
> be copied or a migration PTE to be left behind. Migration still completes and
> the page is unlocked but later a fault will call migration_entry_to_page()
> and BUG() because the page is not locked. This series aims to close some
> of these races.

In general migration ptes were designed to cause the code encountering
them to go to sleep until migration is finished and a regular pte is
available. Looks like we are tolerating the handling of migration entries.

Never imagined copying page table with this. There could be recursion
issues of various kinds because page migration requires operations on the
page tables while performing migration. A simple fix would be to *not*
migrate page table tables.

> Patch 1 alters fork() to restart page table copying when a migration PTE is
> 	encountered.

Can we simply wait like in the fault path?

> Patch 3 notes that while a VMA is moved under the anon_vma lock, the page
> 	tables are not similarly protected. Where migration PTEs are
> 	encountered, they are cleaned up.

This means they are copied / moved etc and "cleaned" up in a state when
the page was unlocked. Migration entries are not supposed to exist when
a page is not locked.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
@ 2010-04-27 22:27   ` Christoph Lameter
  0 siblings, 0 replies; 132+ messages in thread
From: Christoph Lameter @ 2010-04-27 22:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Rik van Riel, Andrew Morton

On Tue, 27 Apr 2010, Mel Gorman wrote:

> The problem is that there are some races that either allow migration PTEs to
> be copied or a migration PTE to be left behind. Migration still completes and
> the page is unlocked but later a fault will call migration_entry_to_page()
> and BUG() because the page is not locked. This series aims to close some
> of these races.

In general migration ptes were designed to cause the code encountering
them to go to sleep until migration is finished and a regular pte is
available. Looks like we are tolerating the handling of migration entries.

Never imagined copying page table with this. There could be recursion
issues of various kinds because page migration requires operations on the
page tables while performing migration. A simple fix would be to *not*
migrate page table tables.

> Patch 1 alters fork() to restart page table copying when a migration PTE is
> 	encountered.

Can we simply wait like in the fault path?

> Patch 3 notes that while a VMA is moved under the anon_vma lock, the page
> 	tables are not similarly protected. Where migration PTEs are
> 	encountered, they are cleaned up.

This means they are copied / moved etc and "cleaned" up in a state when
the page was unlocked. Migration entries are not supposed to exist when
a page is not locked.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-27 21:30   ` Mel Gorman
@ 2010-04-27 22:30     ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-27 22:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Tue, Apr 27, 2010 at 10:30:52PM +0100, Mel Gorman wrote:
> During exec(), a temporary stack is setup and moved later to its final
> location. There is a race between migration and exec whereby a migration
> PTE can be placed in the temporary stack. When this VMA is moved under the
> lock, migration no longer knows where the PTE is, fails to remove the PTE
> and the migration PTE gets copied to the new location.  This later causes
> a bug when the migration PTE is discovered but the page is not locked.

This is the real bug, the patch 1 should be rejected and the
expanation-trace has the ordering wrong. The ordering is subtle but
fundamental to prevent that race, split_huge_page also requires the
same anon-vma list_add_tail to avoid the same race between fork and
rmap_walk. It should work fine already with old and new anon-vma code
as they both add new vmas always to the tail of the list.

So the bug in very short, is that "move_page_tables runs out of sync
with vma_adjust in shift_arg_pages"?

> This patch handles the situation by removing the migration PTE when page
> tables are being moved in case migration fails to find them. The alternative
> would require significant modification to vma_adjust() and the locks taken
> to ensure a VMA move and page table copy is atomic with respect to migration.

I'll now evaluate the fix and see if I can find any other
way to handle this.

Great, I'm quite sure with patch 3 we'll move the needle and fix the
bug, it perfectly explains why we only get the oops inside execve in
the stack page.

Patch 2 I didn't check it yet but it's only relevant for the new
anon-vma code, I suggest to handle it separately from the rest.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-27 22:30     ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-27 22:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Tue, Apr 27, 2010 at 10:30:52PM +0100, Mel Gorman wrote:
> During exec(), a temporary stack is setup and moved later to its final
> location. There is a race between migration and exec whereby a migration
> PTE can be placed in the temporary stack. When this VMA is moved under the
> lock, migration no longer knows where the PTE is, fails to remove the PTE
> and the migration PTE gets copied to the new location.  This later causes
> a bug when the migration PTE is discovered but the page is not locked.

This is the real bug, the patch 1 should be rejected and the
expanation-trace has the ordering wrong. The ordering is subtle but
fundamental to prevent that race, split_huge_page also requires the
same anon-vma list_add_tail to avoid the same race between fork and
rmap_walk. It should work fine already with old and new anon-vma code
as they both add new vmas always to the tail of the list.

So the bug in very short, is that "move_page_tables runs out of sync
with vma_adjust in shift_arg_pages"?

> This patch handles the situation by removing the migration PTE when page
> tables are being moved in case migration fails to find them. The alternative
> would require significant modification to vma_adjust() and the locks taken
> to ensure a VMA move and page table copy is atomic with respect to migration.

I'll now evaluate the fix and see if I can find any other
way to handle this.

Great, I'm quite sure with patch 3 we'll move the needle and fix the
bug, it perfectly explains why we only get the oops inside execve in
the stack page.

Patch 2 I didn't check it yet but it's only relevant for the new
anon-vma code, I suggest to handle it separately from the rest.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
  2010-04-27 22:27   ` Christoph Lameter
@ 2010-04-27 22:32     ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-27 22:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton

On Tue, Apr 27, 2010 at 05:27:36PM -0500, Christoph Lameter wrote:
> Can we simply wait like in the fault path?

There is no bug there, no need to wait either. I already audited it
before, and I didn't see any bug. Unless you can show a bug with CPU A
running the rmap_walk on process1 before process2, there is no bug to
fix there.

> 
> > Patch 3 notes that while a VMA is moved under the anon_vma lock, the page
> > 	tables are not similarly protected. Where migration PTEs are
> > 	encountered, they are cleaned up.
> 
> This means they are copied / moved etc and "cleaned" up in a state when
> the page was unlocked. Migration entries are not supposed to exist when
> a page is not locked.

patch 3 is real, and the first thought I had was to lock down the page
before running vma_adjust and unlock after move_page_tables. But these
are virtual addresses. Maybe there's a simpler way to keep migration
away while we run those two operations.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
@ 2010-04-27 22:32     ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-27 22:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton

On Tue, Apr 27, 2010 at 05:27:36PM -0500, Christoph Lameter wrote:
> Can we simply wait like in the fault path?

There is no bug there, no need to wait either. I already audited it
before, and I didn't see any bug. Unless you can show a bug with CPU A
running the rmap_walk on process1 before process2, there is no bug to
fix there.

> 
> > Patch 3 notes that while a VMA is moved under the anon_vma lock, the page
> > 	tables are not similarly protected. Where migration PTEs are
> > 	encountered, they are cleaned up.
> 
> This means they are copied / moved etc and "cleaned" up in a state when
> the page was unlocked. Migration entries are not supposed to exist when
> a page is not locked.

patch 3 is real, and the first thought I had was to lock down the page
before running vma_adjust and unlock after move_page_tables. But these
are virtual addresses. Maybe there's a simpler way to keep migration
away while we run those two operations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-27 22:30     ` Andrea Arcangeli
@ 2010-04-27 22:58       ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-27 22:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 12:30:04AM +0200, Andrea Arcangeli wrote:
> I'll now evaluate the fix and see if I can find any other
> way to handle this.


I think a better fix for bug mentioned in patch 3, is like below. This
seems to work fine on aa.git with the old (stable) 2.6.33 anon-vma
code. Not sure if this also works with the new anon-vma code in
mainline but at first glance I think it should. At that point we
should be single threaded so it shouldn't matter if anon_vma is
temporary null.

Then you've to re-evaluate the vma_adjust fixes for mainline-only in
patch 2 at the light of the below (I didn't check patch 2 in detail).

Please try to reproduce with the below applied.

----
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

So shift_arg_pages must run atomically with respect of rmap_walk, and
it's enough to run it under the anon_vma lock to make it atomic.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -503,6 +504,7 @@ static int shift_arg_pages(struct vm_are
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
 	struct mmu_gather *tlb;
+	struct anon_vma *anon_vma;
 
 	BUG_ON(new_start > new_end);
 
@@ -513,6 +515,12 @@ static int shift_arg_pages(struct vm_are
 	if (vma != find_vma(mm, new_start))
 		return -EFAULT;
 
+	anon_vma = vma->anon_vma;
+	/* stop rmap_walk or it won't find the stack pages */
+	spin_lock(&anon_vma->lock);
+	/* avoid vma_adjust to take any further anon_vma lock */
+	vma->anon_vma = NULL;
+
 	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
@@ -551,6 +559,9 @@ static int shift_arg_pages(struct vm_are
 	 */
 	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
 
+	vma->anon_vma = anon_vma;
+	spin_unlock(&anon_vma->lock);
+
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-27 22:58       ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-27 22:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 12:30:04AM +0200, Andrea Arcangeli wrote:
> I'll now evaluate the fix and see if I can find any other
> way to handle this.


I think a better fix for bug mentioned in patch 3, is like below. This
seems to work fine on aa.git with the old (stable) 2.6.33 anon-vma
code. Not sure if this also works with the new anon-vma code in
mainline but at first glance I think it should. At that point we
should be single threaded so it shouldn't matter if anon_vma is
temporary null.

Then you've to re-evaluate the vma_adjust fixes for mainline-only in
patch 2 at the light of the below (I didn't check patch 2 in detail).

Please try to reproduce with the below applied.

----
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

So shift_arg_pages must run atomically with respect of rmap_walk, and
it's enough to run it under the anon_vma lock to make it atomic.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -503,6 +504,7 @@ static int shift_arg_pages(struct vm_are
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
 	struct mmu_gather *tlb;
+	struct anon_vma *anon_vma;
 
 	BUG_ON(new_start > new_end);
 
@@ -513,6 +515,12 @@ static int shift_arg_pages(struct vm_are
 	if (vma != find_vma(mm, new_start))
 		return -EFAULT;
 
+	anon_vma = vma->anon_vma;
+	/* stop rmap_walk or it won't find the stack pages */
+	spin_lock(&anon_vma->lock);
+	/* avoid vma_adjust to take any further anon_vma lock */
+	vma->anon_vma = NULL;
+
 	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
@@ -551,6 +559,9 @@ static int shift_arg_pages(struct vm_are
 	 */
 	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
 
+	vma->anon_vma = anon_vma;
+	spin_unlock(&anon_vma->lock);
+
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-04-27 21:30   ` Mel Gorman
@ 2010-04-27 23:10     ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-27 23:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Tue, Apr 27, 2010 at 10:30:51PM +0100, Mel Gorman wrote:
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f90ea92..61d6f1d 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -578,6 +578,9 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		}
>  	}
>  
> +	if (vma->anon_vma)
> +		spin_lock(&vma->anon_vma->lock);
> +
>  	if (root) {
>  		flush_dcache_mmap_lock(mapping);
>  		vma_prio_tree_remove(vma, root);
> @@ -620,6 +623,9 @@ again:			remove_next = 1 + (end > next->vm_end);
>  	if (mapping)
>  		spin_unlock(&mapping->i_mmap_lock);
>  
> +	if (vma->anon_vma)
> +		spin_unlock(&vma->anon_vma->lock);
> +
>  	if (remove_next) {
>  		if (file) {
>  			fput(file);

The old code did:

    /*
     * When changing only vma->vm_end, we don't really need
     * anon_vma lock.
     */
    if (vma->anon_vma && (insert || importer || start !=  vma->vm_start))
	anon_vma = vma->anon_vma;
    if (anon_vma) {
        spin_lock(&anon_vma->lock);

why did it become unconditional? (and no idea why it was removed)

But I'm not sure about this part.... this is really only a question, I
may well be wrong, I just don't get it.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-27 23:10     ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-27 23:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Tue, Apr 27, 2010 at 10:30:51PM +0100, Mel Gorman wrote:
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f90ea92..61d6f1d 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -578,6 +578,9 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		}
>  	}
>  
> +	if (vma->anon_vma)
> +		spin_lock(&vma->anon_vma->lock);
> +
>  	if (root) {
>  		flush_dcache_mmap_lock(mapping);
>  		vma_prio_tree_remove(vma, root);
> @@ -620,6 +623,9 @@ again:			remove_next = 1 + (end > next->vm_end);
>  	if (mapping)
>  		spin_unlock(&mapping->i_mmap_lock);
>  
> +	if (vma->anon_vma)
> +		spin_unlock(&vma->anon_vma->lock);
> +
>  	if (remove_next) {
>  		if (file) {
>  			fput(file);

The old code did:

    /*
     * When changing only vma->vm_end, we don't really need
     * anon_vma lock.
     */
    if (vma->anon_vma && (insert || importer || start !=  vma->vm_start))
	anon_vma = vma->anon_vma;
    if (anon_vma) {
        spin_lock(&anon_vma->lock);

why did it become unconditional? (and no idea why it was removed)

But I'm not sure about this part.... this is really only a question, I
may well be wrong, I just don't get it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
  2010-04-27 22:22     ` Andrea Arcangeli
@ 2010-04-27 23:52       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-27 23:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 00:22:45 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> Ok I had a first look:
> 
> On Tue, Apr 27, 2010 at 10:30:50PM +0100, Mel Gorman wrote:
> > 	CPUA			CPU B
> > 				do_fork()
> > 				copy_mm() (from process 1 to process2)
> > 				insert new vma to mmap_list (if inode/anon_vma)
> 
> Insert to the tail of the anon_vma list...
> 
> > 	pte_lock(process1)
> > 	unmap a page
> > 	insert migration_entry
> > 	pte_unlock(process1)
> > 
> > 	migrate page copy
> > 				copy_page_range
> > 	remap new page by rmap_walk()
> 
> rmap_walk will walk process1 first! It's at the head, the vmas with
> unmapped ptes are at the tail so process1 is walked before process2.
> 
> > 	pte_lock(process2)
> > 	found no pte.
> > 	pte_unlock(process2)
> > 				pte lock(process2)
> > 				pte lock(process1)
> > 				copy migration entry to process2
> > 				pte unlock(process1)
> > 				pte unlokc(process2)
> > 	pte_lock(process1)
> > 	replace migration entry
> > 	to new page's pte.
> > 	pte_unlock(process1)
> 
> rmap_walk has to lock down process1 before process2, this is the
> ordering issue I already mentioned in earlier email. So it cannot
> happen and this patch is unnecessary.
> 
> The ordering is fundamental and as said anon_vma_link already adds new
> vmas to the _tail_ of the anon-vma. And this is why it has to add to
> the tail. If anon_vma_link would add new vmas to the head of the list,
> the above bug could materialize, but it doesn't so it cannot happen.
> 
> In mainline anon_vma_link is called anon_vma_chain_link, see the
> list_add_tail there to provide this guarantee.
> 
> Because process1 is walked first by CPU A, the migration entry is
> replaced by the final pte before copy-migration-entry
> runs. Alternatively if copy-migration-entry runs before before
> process1 is walked, the migration entry will be copied and found in
> process 2.
> 

I already explained this doesn't happend and said "I'm sorry".

But considering maintainance, it's not necessary to copy migration ptes
and we don't have to keep a fundamental risks of migration circus.

So, I don't say "we don't need this patch."

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
@ 2010-04-27 23:52       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-27 23:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 00:22:45 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> Ok I had a first look:
> 
> On Tue, Apr 27, 2010 at 10:30:50PM +0100, Mel Gorman wrote:
> > 	CPUA			CPU B
> > 				do_fork()
> > 				copy_mm() (from process 1 to process2)
> > 				insert new vma to mmap_list (if inode/anon_vma)
> 
> Insert to the tail of the anon_vma list...
> 
> > 	pte_lock(process1)
> > 	unmap a page
> > 	insert migration_entry
> > 	pte_unlock(process1)
> > 
> > 	migrate page copy
> > 				copy_page_range
> > 	remap new page by rmap_walk()
> 
> rmap_walk will walk process1 first! It's at the head, the vmas with
> unmapped ptes are at the tail so process1 is walked before process2.
> 
> > 	pte_lock(process2)
> > 	found no pte.
> > 	pte_unlock(process2)
> > 				pte lock(process2)
> > 				pte lock(process1)
> > 				copy migration entry to process2
> > 				pte unlock(process1)
> > 				pte unlokc(process2)
> > 	pte_lock(process1)
> > 	replace migration entry
> > 	to new page's pte.
> > 	pte_unlock(process1)
> 
> rmap_walk has to lock down process1 before process2, this is the
> ordering issue I already mentioned in earlier email. So it cannot
> happen and this patch is unnecessary.
> 
> The ordering is fundamental and as said anon_vma_link already adds new
> vmas to the _tail_ of the anon-vma. And this is why it has to add to
> the tail. If anon_vma_link would add new vmas to the head of the list,
> the above bug could materialize, but it doesn't so it cannot happen.
> 
> In mainline anon_vma_link is called anon_vma_chain_link, see the
> list_add_tail there to provide this guarantee.
> 
> Because process1 is walked first by CPU A, the migration entry is
> replaced by the final pte before copy-migration-entry
> runs. Alternatively if copy-migration-entry runs before before
> process1 is walked, the migration entry will be copied and found in
> process 2.
> 

I already explained this doesn't happend and said "I'm sorry".

But considering maintainance, it's not necessary to copy migration ptes
and we don't have to keep a fundamental risks of migration circus.

So, I don't say "we don't need this patch."

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-27 21:30   ` Mel Gorman
@ 2010-04-28  0:03     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  0:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, Christoph Lameter, Andrea Arcangeli,
	Rik van Riel, Andrew Morton

On Tue, 27 Apr 2010 22:30:52 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> During exec(), a temporary stack is setup and moved later to its final
> location. There is a race between migration and exec whereby a migration
> PTE can be placed in the temporary stack. When this VMA is moved under the
> lock, migration no longer knows where the PTE is, fails to remove the PTE
> and the migration PTE gets copied to the new location.  This later causes
> a bug when the migration PTE is discovered but the page is not locked.
> 
> This patch handles the situation by removing the migration PTE when page
> tables are being moved in case migration fails to find them. The alternative
> would require significant modification to vma_adjust() and the locks taken
> to ensure a VMA move and page table copy is atomic with respect to migration.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Mel, I don't like this fix. Consider following,

 1. try_to_unmap(oldpage)
 2. copy and replace
 3. remove_migration_ptes(oldpage, newpage)

What this patch handles is "3: remove_migration_ptes fails to remap it and
migration_pte will remain there case....The fact "new page is not mapped" means
"get_page() is not called against the new page".
So, the new page have been able to be freed until we restart move_ptes.

I bet calling __get_user_pages_fast() before vma_adjust() is the way to go. 
When page_count(page) != page_mapcount(page) +1, migration skip it.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-28  0:03     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  0:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, Christoph Lameter, Andrea Arcangeli,
	Rik van Riel, Andrew Morton

On Tue, 27 Apr 2010 22:30:52 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> During exec(), a temporary stack is setup and moved later to its final
> location. There is a race between migration and exec whereby a migration
> PTE can be placed in the temporary stack. When this VMA is moved under the
> lock, migration no longer knows where the PTE is, fails to remove the PTE
> and the migration PTE gets copied to the new location.  This later causes
> a bug when the migration PTE is discovered but the page is not locked.
> 
> This patch handles the situation by removing the migration PTE when page
> tables are being moved in case migration fails to find them. The alternative
> would require significant modification to vma_adjust() and the locks taken
> to ensure a VMA move and page table copy is atomic with respect to migration.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Mel, I don't like this fix. Consider following,

 1. try_to_unmap(oldpage)
 2. copy and replace
 3. remove_migration_ptes(oldpage, newpage)

What this patch handles is "3: remove_migration_ptes fails to remap it and
migration_pte will remain there case....The fact "new page is not mapped" means
"get_page() is not called against the new page".
So, the new page have been able to be freed until we restart move_ptes.

I bet calling __get_user_pages_fast() before vma_adjust() is the way to go. 
When page_count(page) != page_mapcount(page) +1, migration skip it.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-28  0:03     ` KAMEZAWA Hiroyuki
@ 2010-04-28  0:08       ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  0:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 09:03:02AM +0900, KAMEZAWA Hiroyuki wrote:
> I bet calling __get_user_pages_fast() before vma_adjust() is the way to go. 
> When page_count(page) != page_mapcount(page) +1, migration skip it.

My proposed fix avoids to walk the pagetables once more time and to
mangle over the page counts. Can you check it? It works but it needs
more review.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-28  0:08       ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  0:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 09:03:02AM +0900, KAMEZAWA Hiroyuki wrote:
> I bet calling __get_user_pages_fast() before vma_adjust() is the way to go. 
> When page_count(page) != page_mapcount(page) +1, migration skip it.

My proposed fix avoids to walk the pagetables once more time and to
mangle over the page counts. Can you check it? It works but it needs
more review.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
  2010-04-27 22:32     ` Andrea Arcangeli
@ 2010-04-28  0:13       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  0:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Mel Gorman, Linux-MM, LKML, Minchan Kim,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 00:32:42 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Tue, Apr 27, 2010 at 05:27:36PM -0500, Christoph Lameter wrote:
> > Can we simply wait like in the fault path?
> 
> There is no bug there, no need to wait either. I already audited it
> before, and I didn't see any bug. Unless you can show a bug with CPU A
> running the rmap_walk on process1 before process2, there is no bug to
> fix there.
> 
I think there is no bug, either. But that safety is fragile.


> > 
> > > Patch 3 notes that while a VMA is moved under the anon_vma lock, the page
> > > 	tables are not similarly protected. Where migration PTEs are
> > > 	encountered, they are cleaned up.
> > 
> > This means they are copied / moved etc and "cleaned" up in a state when
> > the page was unlocked. Migration entries are not supposed to exist when
> > a page is not locked.
> 
> patch 3 is real, and the first thought I had was to lock down the page
> before running vma_adjust and unlock after move_page_tables. But these
> are virtual addresses. Maybe there's a simpler way to keep migration
> away while we run those two operations.
> 

Doing some check in move_ptes() after vma_adjust() is not safe.
IOW, when vma's information and information in page-table is incosistent...objrmap
is broken and migartion will cause panic.

Then...I think there are 2 ways.
  1. use seqcounter in "mm_struct" as previous patch and lock it at mremap.
or
  2. get_user_pages_fast() when do remap.

Thanks,
-Kame








^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
@ 2010-04-28  0:13       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  0:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Mel Gorman, Linux-MM, LKML, Minchan Kim,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 00:32:42 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Tue, Apr 27, 2010 at 05:27:36PM -0500, Christoph Lameter wrote:
> > Can we simply wait like in the fault path?
> 
> There is no bug there, no need to wait either. I already audited it
> before, and I didn't see any bug. Unless you can show a bug with CPU A
> running the rmap_walk on process1 before process2, there is no bug to
> fix there.
> 
I think there is no bug, either. But that safety is fragile.


> > 
> > > Patch 3 notes that while a VMA is moved under the anon_vma lock, the page
> > > 	tables are not similarly protected. Where migration PTEs are
> > > 	encountered, they are cleaned up.
> > 
> > This means they are copied / moved etc and "cleaned" up in a state when
> > the page was unlocked. Migration entries are not supposed to exist when
> > a page is not locked.
> 
> patch 3 is real, and the first thought I had was to lock down the page
> before running vma_adjust and unlock after move_page_tables. But these
> are virtual addresses. Maybe there's a simpler way to keep migration
> away while we run those two operations.
> 

Doing some check in move_ptes() after vma_adjust() is not safe.
IOW, when vma's information and information in page-table is incosistent...objrmap
is broken and migartion will cause panic.

Then...I think there are 2 ways.
  1. use seqcounter in "mm_struct" as previous patch and lock it at mremap.
or
  2. get_user_pages_fast() when do remap.

Thanks,
-Kame







--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
  2010-04-27 23:52       ` KAMEZAWA Hiroyuki
@ 2010-04-28  0:18         ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  0:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 08:52:03AM +0900, KAMEZAWA Hiroyuki wrote:
> I already explained this doesn't happend and said "I'm sorry".

Oops I must have overlooked it sorry! I just seen the trace quoted in
the comment of the patch and that at least would need correction
before it can be pushed in mainline, or it creates huge confusion to
see a reverse trace for CPU A for an already tricky piece of code.

> But considering maintainance, it's not necessary to copy migration ptes
> and we don't have to keep a fundamental risks of migration circus.
> 
> So, I don't say "we don't need this patch."

split_huge_page also has the same requirement and there is no bug to
fix, so I don't see why to make special changes for just migrate.c
when we still have to list_add_tail for split_huge_page.

Furthermore this patch isn't fixing anything in any case and it looks
a noop to me. If the order ever gets inverted, and process2 ptes are
scanned before process1 ptes in the rmap_walk, sure the
copy-page-tables will break and stop until the process1 rmap_walk will
complete, but that is not enough! You have to repeat the rmap_walk of
process1 if the order ever gets inverted and this isn't happening in
the patch so I don't see how it could make any difference even just
for migrate.c (obviously not for split_huge_page).

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
@ 2010-04-28  0:18         ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  0:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 08:52:03AM +0900, KAMEZAWA Hiroyuki wrote:
> I already explained this doesn't happend and said "I'm sorry".

Oops I must have overlooked it sorry! I just seen the trace quoted in
the comment of the patch and that at least would need correction
before it can be pushed in mainline, or it creates huge confusion to
see a reverse trace for CPU A for an already tricky piece of code.

> But considering maintainance, it's not necessary to copy migration ptes
> and we don't have to keep a fundamental risks of migration circus.
> 
> So, I don't say "we don't need this patch."

split_huge_page also has the same requirement and there is no bug to
fix, so I don't see why to make special changes for just migrate.c
when we still have to list_add_tail for split_huge_page.

Furthermore this patch isn't fixing anything in any case and it looks
a noop to me. If the order ever gets inverted, and process2 ptes are
scanned before process1 ptes in the rmap_walk, sure the
copy-page-tables will break and stop until the process1 rmap_walk will
complete, but that is not enough! You have to repeat the rmap_walk of
process1 if the order ever gets inverted and this isn't happening in
the patch so I don't see how it could make any difference even just
for migrate.c (obviously not for split_huge_page).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
  2010-04-28  0:18         ` Andrea Arcangeli
@ 2010-04-28  0:19           ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  0:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 02:18:21AM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 08:52:03AM +0900, KAMEZAWA Hiroyuki wrote:
> > I already explained this doesn't happend and said "I'm sorry".
> 
> Oops I must have overlooked it sorry! I just seen the trace quoted in
> the comment of the patch and that at least would need correction
> before it can be pushed in mainline, or it creates huge confusion to
> see a reverse trace for CPU A for an already tricky piece of code.
> 
> > But considering maintainance, it's not necessary to copy migration ptes
> > and we don't have to keep a fundamental risks of migration circus.
> > 
> > So, I don't say "we don't need this patch."
> 
> split_huge_page also has the same requirement and there is no bug to
> fix, so I don't see why to make special changes for just migrate.c
> when we still have to list_add_tail for split_huge_page.
> 
> Furthermore this patch isn't fixing anything in any case and it looks
> a noop to me. If the order ever gets inverted, and process2 ptes are
> scanned before process1 ptes in the rmap_walk, sure the
> copy-page-tables will break and stop until the process1 rmap_walk will
> complete, but that is not enough! You have to repeat the rmap_walk of
> process1 if the order ever gets inverted and this isn't happening in
  ^^^^^^^2
> the patch so I don't see how it could make any difference even just
> for migrate.c (obviously not for split_huge_page).

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
@ 2010-04-28  0:19           ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  0:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 02:18:21AM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 08:52:03AM +0900, KAMEZAWA Hiroyuki wrote:
> > I already explained this doesn't happend and said "I'm sorry".
> 
> Oops I must have overlooked it sorry! I just seen the trace quoted in
> the comment of the patch and that at least would need correction
> before it can be pushed in mainline, or it creates huge confusion to
> see a reverse trace for CPU A for an already tricky piece of code.
> 
> > But considering maintainance, it's not necessary to copy migration ptes
> > and we don't have to keep a fundamental risks of migration circus.
> > 
> > So, I don't say "we don't need this patch."
> 
> split_huge_page also has the same requirement and there is no bug to
> fix, so I don't see why to make special changes for just migrate.c
> when we still have to list_add_tail for split_huge_page.
> 
> Furthermore this patch isn't fixing anything in any case and it looks
> a noop to me. If the order ever gets inverted, and process2 ptes are
> scanned before process1 ptes in the rmap_walk, sure the
> copy-page-tables will break and stop until the process1 rmap_walk will
> complete, but that is not enough! You have to repeat the rmap_walk of
> process1 if the order ever gets inverted and this isn't happening in
  ^^^^^^^2
> the patch so I don't see how it could make any difference even just
> for migrate.c (obviously not for split_huge_page).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
  2010-04-28  0:13       ` KAMEZAWA Hiroyuki
@ 2010-04-28  0:20         ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  0:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Lameter, Mel Gorman, Linux-MM, LKML, Minchan Kim,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote:
> Doing some check in move_ptes() after vma_adjust() is not safe.
> IOW, when vma's information and information in page-table is incosistent...objrmap
> is broken and migartion will cause panic.
> 
> Then...I think there are 2 ways.
>   1. use seqcounter in "mm_struct" as previous patch and lock it at mremap.
> or
>   2. get_user_pages_fast() when do remap.

3 take the anon_vma->lock

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
@ 2010-04-28  0:20         ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  0:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Lameter, Mel Gorman, Linux-MM, LKML, Minchan Kim,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote:
> Doing some check in move_ptes() after vma_adjust() is not safe.
> IOW, when vma's information and information in page-table is incosistent...objrmap
> is broken and migartion will cause panic.
> 
> Then...I think there are 2 ways.
>   1. use seqcounter in "mm_struct" as previous patch and lock it at mremap.
> or
>   2. get_user_pages_fast() when do remap.

3 take the anon_vma->lock

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
  2010-04-28  0:19           ` Andrea Arcangeli
@ 2010-04-28  0:28             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  0:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 02:19:11 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Apr 28, 2010 at 02:18:21AM +0200, Andrea Arcangeli wrote:
> > On Wed, Apr 28, 2010 at 08:52:03AM +0900, KAMEZAWA Hiroyuki wrote:
> > > I already explained this doesn't happend and said "I'm sorry".
> > 
> > Oops I must have overlooked it sorry! I just seen the trace quoted in
> > the comment of the patch and that at least would need correction
> > before it can be pushed in mainline, or it creates huge confusion to
> > see a reverse trace for CPU A for an already tricky piece of code.
> > 
> > > But considering maintainance, it's not necessary to copy migration ptes
> > > and we don't have to keep a fundamental risks of migration circus.
> > > 
> > > So, I don't say "we don't need this patch."
> > 
> > split_huge_page also has the same requirement and there is no bug to
> > fix, so I don't see why to make special changes for just migrate.c
> > when we still have to list_add_tail for split_huge_page.
> > 
> > Furthermore this patch isn't fixing anything in any case and it looks
> > a noop to me. If the order ever gets inverted, and process2 ptes are
> > scanned before process1 ptes in the rmap_walk, sure the
> > copy-page-tables will break and stop until the process1 rmap_walk will
> > complete, but that is not enough! You have to repeat the rmap_walk of
> > process1 if the order ever gets inverted and this isn't happening in
>   ^^^^^^^2

why we have to remove migration_pte by rmap_walk() which doesnt' exist ?

Anyway, I agree there are no oops. But there are risks because migration is
a feature which people don't tend to take care of (as memcg ;)
I like conservative approach for this kind of features.

Thanks,
-Kame










 


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
@ 2010-04-28  0:28             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  0:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 02:19:11 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Apr 28, 2010 at 02:18:21AM +0200, Andrea Arcangeli wrote:
> > On Wed, Apr 28, 2010 at 08:52:03AM +0900, KAMEZAWA Hiroyuki wrote:
> > > I already explained this doesn't happend and said "I'm sorry".
> > 
> > Oops I must have overlooked it sorry! I just seen the trace quoted in
> > the comment of the patch and that at least would need correction
> > before it can be pushed in mainline, or it creates huge confusion to
> > see a reverse trace for CPU A for an already tricky piece of code.
> > 
> > > But considering maintainance, it's not necessary to copy migration ptes
> > > and we don't have to keep a fundamental risks of migration circus.
> > > 
> > > So, I don't say "we don't need this patch."
> > 
> > split_huge_page also has the same requirement and there is no bug to
> > fix, so I don't see why to make special changes for just migrate.c
> > when we still have to list_add_tail for split_huge_page.
> > 
> > Furthermore this patch isn't fixing anything in any case and it looks
> > a noop to me. If the order ever gets inverted, and process2 ptes are
> > scanned before process1 ptes in the rmap_walk, sure the
> > copy-page-tables will break and stop until the process1 rmap_walk will
> > complete, but that is not enough! You have to repeat the rmap_walk of
> > process1 if the order ever gets inverted and this isn't happening in
>   ^^^^^^^2

why we have to remove migration_pte by rmap_walk() which doesnt' exist ?

Anyway, I agree there are no oops. But there are risks because migration is
a feature which people don't tend to take care of (as memcg ;)
I like conservative approach for this kind of features.

Thanks,
-Kame










 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-28  0:08       ` Andrea Arcangeli
@ 2010-04-28  0:36         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  0:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 02:08:33 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Apr 28, 2010 at 09:03:02AM +0900, KAMEZAWA Hiroyuki wrote:
> > I bet calling __get_user_pages_fast() before vma_adjust() is the way to go. 
> > When page_count(page) != page_mapcount(page) +1, migration skip it.
> 
> My proposed fix avoids to walk the pagetables once more time and to
> mangle over the page counts. Can you check it? It works but it needs
> more review.

Sure...but can we avoid temporal objrmap breakage(inconsistency) by it ?

THanks,
-Kame


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-28  0:36         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  0:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 02:08:33 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Apr 28, 2010 at 09:03:02AM +0900, KAMEZAWA Hiroyuki wrote:
> > I bet calling __get_user_pages_fast() before vma_adjust() is the way to go. 
> > When page_count(page) != page_mapcount(page) +1, migration skip it.
> 
> My proposed fix avoids to walk the pagetables once more time and to
> mangle over the page counts. Can you check it? It works but it needs
> more review.

Sure...but can we avoid temporal objrmap breakage(inconsistency) by it ?

THanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-27 22:58       ` Andrea Arcangeli
@ 2010-04-28  0:39         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  0:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 00:58:52 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Apr 28, 2010 at 12:30:04AM +0200, Andrea Arcangeli wrote:
> > I'll now evaluate the fix and see if I can find any other
> > way to handle this.
> 
> 
> I think a better fix for bug mentioned in patch 3, is like below. This
> seems to work fine on aa.git with the old (stable) 2.6.33 anon-vma
> code. Not sure if this also works with the new anon-vma code in
> mainline but at first glance I think it should. At that point we
> should be single threaded so it shouldn't matter if anon_vma is
> temporary null.
> 
> Then you've to re-evaluate the vma_adjust fixes for mainline-only in
> patch 2 at the light of the below (I didn't check patch 2 in detail).
> 
> Please try to reproduce with the below applied.
> 
> ----
> Subject: fix race between shift_arg_pages and rmap_walk
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> migrate.c requires rmap to be able to find all ptes mapping a page at
> all times, otherwise the migration entry can be instantiated, but it
> can't be removed if the second rmap_walk fails to find the page.
> 
> So shift_arg_pages must run atomically with respect of rmap_walk, and
> it's enough to run it under the anon_vma lock to make it atomic.
> 
> And split_huge_page() will have the same requirements as migrate.c
> already has.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Seems nice.

I'll test this but I think we need to take care of do_mremap(), too.
And it's more complicated....

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-28  0:39         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  0:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 00:58:52 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Apr 28, 2010 at 12:30:04AM +0200, Andrea Arcangeli wrote:
> > I'll now evaluate the fix and see if I can find any other
> > way to handle this.
> 
> 
> I think a better fix for bug mentioned in patch 3, is like below. This
> seems to work fine on aa.git with the old (stable) 2.6.33 anon-vma
> code. Not sure if this also works with the new anon-vma code in
> mainline but at first glance I think it should. At that point we
> should be single threaded so it shouldn't matter if anon_vma is
> temporary null.
> 
> Then you've to re-evaluate the vma_adjust fixes for mainline-only in
> patch 2 at the light of the below (I didn't check patch 2 in detail).
> 
> Please try to reproduce with the below applied.
> 
> ----
> Subject: fix race between shift_arg_pages and rmap_walk
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> migrate.c requires rmap to be able to find all ptes mapping a page at
> all times, otherwise the migration entry can be instantiated, but it
> can't be removed if the second rmap_walk fails to find the page.
> 
> So shift_arg_pages must run atomically with respect of rmap_walk, and
> it's enough to run it under the anon_vma lock to make it atomic.
> 
> And split_huge_page() will have the same requirements as migrate.c
> already has.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Seems nice.

I'll test this but I think we need to take care of do_mremap(), too.
And it's more complicated....

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
  2010-04-28  0:28             ` KAMEZAWA Hiroyuki
@ 2010-04-28  0:59               ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  0:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 09:28:02AM +0900, KAMEZAWA Hiroyuki wrote:
> why we have to remove migration_pte by rmap_walk() which doesnt' exist ?

I already thought this for split_huge_page and because split_huge_page
that already waits inside copy_huge_pmd if there's any
pmd_trans_splitting bit set, isn't removing the requirement for
list_add_tail in anon_vma_[chain_]link this patch also isn't removing
it.

But it's more complex than my prev explanation now that I think about
it more so no wonder it's not clear why yet (my fault).

So the thing is, if you patch it this way, the second rmap_walk run by
remove_migration_ptes will be safe even if the ordering is inverted,
but then the first rmap_walk that establishes the migration entry, will
still break if the rmap walk scans the child before the parent.

The only objective of the patch is to remove the list_add_tail
requirement but I'll show that it's still required by the first
rmap_walk in try_to_unmap below:

CPU A	    		  	      	CPU B
				      	fork
try_to_unmap
try to set migration entry in child (null pte still, nothing done)
					copy pte and map page in child pte (no
				      	migration entry set on child)
set migration entry in parent

parent ends up with migration entry but child not, breaking migration
in another way. so even with the patch, the only way to be safe is
that rmap_walk always scans the parent _first_.

So this patch is a noop from every angle, and it just moves the
ordering requirement from the remove_migration_ptes to try_to_unmap,
but it doesn't remove it as far as I can tell. So it only slowdown
fork a bit for no good.

split_huge_page is about the same. If the ordering is inverted,
split_huge_page could set a splitting pmd in the parent only, like the
above try_to_unmap would set the migration pte in the parent only.

> Anyway, I agree there are no oops. But there are risks because migration is
> a feature which people don't tend to take care of (as memcg ;)
> I like conservative approach for this kind of features.

Don't worry, migrate will run 24/7 the moment THP is in ;).

What migration really changed (and it's beneficial so split_huge_page
isn't introducing new requirement) is that rmap has to be exact at all
times, if it was only swap using it, nothing would happen because of
the vma_adjust/move_page_tables/vma_adjust not being in an atomic
section against the rmap_walk (the anon_vma lock of the process stack
vma is enough to make it atomic against rmap_walk luckily).

On a side note: split_huge_page isn't using migration because:

1) migrate.c can't deal with pmd or huge pmd (or compound pages at all)
2) we need it to split in place for gup (used by futex, o_direct,
   vmware workstation etc... and they definitely can't be slowed down
   by having to split the hugepage before gup returns, the only one
   that wouldn't be able to prevent gup splitting the hugepage without
   causing issues to split_huge_page is kvm thanks to the mmu
   notifier, but to do so it'd require to introduce
   a gup_fast variant that wouldn't be increasing the page count, but
   because of o_direct and friends we can't avoid to deal with staying
   in place in split_huge_page), while migrate
   by definition is about migrating something from here to there,
   not doing anything in place
3) we need to convert page structures from compound to not compound in
   place (beisdes the fact the physical memory is in place too),
   inside of the two rmap walks which is definitely specialized
   enough not to fit into the migrate framework

But memory compaction is the definitive user of migration code, as
it's only moving not-yet-huge pages from here to there, and THP
will run memory compaction every time we miss a hugepage in the buddy.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
@ 2010-04-28  0:59               ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  0:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 09:28:02AM +0900, KAMEZAWA Hiroyuki wrote:
> why we have to remove migration_pte by rmap_walk() which doesnt' exist ?

I already thought this for split_huge_page and because split_huge_page
that already waits inside copy_huge_pmd if there's any
pmd_trans_splitting bit set, isn't removing the requirement for
list_add_tail in anon_vma_[chain_]link this patch also isn't removing
it.

But it's more complex than my prev explanation now that I think about
it more so no wonder it's not clear why yet (my fault).

So the thing is, if you patch it this way, the second rmap_walk run by
remove_migration_ptes will be safe even if the ordering is inverted,
but then the first rmap_walk that establishes the migration entry, will
still break if the rmap walk scans the child before the parent.

The only objective of the patch is to remove the list_add_tail
requirement but I'll show that it's still required by the first
rmap_walk in try_to_unmap below:

CPU A	    		  	      	CPU B
				      	fork
try_to_unmap
try to set migration entry in child (null pte still, nothing done)
					copy pte and map page in child pte (no
				      	migration entry set on child)
set migration entry in parent

parent ends up with migration entry but child not, breaking migration
in another way. so even with the patch, the only way to be safe is
that rmap_walk always scans the parent _first_.

So this patch is a noop from every angle, and it just moves the
ordering requirement from the remove_migration_ptes to try_to_unmap,
but it doesn't remove it as far as I can tell. So it only slowdown
fork a bit for no good.

split_huge_page is about the same. If the ordering is inverted,
split_huge_page could set a splitting pmd in the parent only, like the
above try_to_unmap would set the migration pte in the parent only.

> Anyway, I agree there are no oops. But there are risks because migration is
> a feature which people don't tend to take care of (as memcg ;)
> I like conservative approach for this kind of features.

Don't worry, migrate will run 24/7 the moment THP is in ;).

What migration really changed (and it's beneficial so split_huge_page
isn't introducing new requirement) is that rmap has to be exact at all
times, if it was only swap using it, nothing would happen because of
the vma_adjust/move_page_tables/vma_adjust not being in an atomic
section against the rmap_walk (the anon_vma lock of the process stack
vma is enough to make it atomic against rmap_walk luckily).

On a side note: split_huge_page isn't using migration because:

1) migrate.c can't deal with pmd or huge pmd (or compound pages at all)
2) we need it to split in place for gup (used by futex, o_direct,
   vmware workstation etc... and they definitely can't be slowed down
   by having to split the hugepage before gup returns, the only one
   that wouldn't be able to prevent gup splitting the hugepage without
   causing issues to split_huge_page is kvm thanks to the mmu
   notifier, but to do so it'd require to introduce
   a gup_fast variant that wouldn't be increasing the page count, but
   because of o_direct and friends we can't avoid to deal with staying
   in place in split_huge_page), while migrate
   by definition is about migrating something from here to there,
   not doing anything in place
3) we need to convert page structures from compound to not compound in
   place (beisdes the fact the physical memory is in place too),
   inside of the two rmap walks which is definitely specialized
   enough not to fit into the migrate framework

But memory compaction is the definitive user of migration code, as
it's only moving not-yet-huge pages from here to there, and THP
will run memory compaction every time we miss a hugepage in the buddy.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-28  0:39         ` KAMEZAWA Hiroyuki
@ 2010-04-28  1:05           ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  1:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 09:39:48AM +0900, KAMEZAWA Hiroyuki wrote:
> Seems nice.

What did you mean with objrmap inconsistency? I think this is single
threaded there, userland didn't run yet and I don't think page faults
could run. Maybe it's safer to add a VM_BUG_ON(vma->anon_vma) just
before vma->anon_vma = anon_vma to be sure nothing run in between.

> I'll test this but I think we need to take care of do_mremap(), too.
> And it's more complicated....

do_mremap has to be safe by:

1) adjusting page->index atomically with the pte updates inside pt
   lock (while it moves from one pte to another)

2) having both vmas src and dst (not overlapping) indexed in the
   proper anon_vmas before move_page_table runs

As long as it's not overlapping it shouldn't be difficult to enforce
the above two invariants, exec.c is magic as it works on overlapping
areas and src and dst are the same vma and it's indexed into just one
anon-vma. So we've to stop the rmap_walks before we mangle over the
vma with vma_adjust and move_page_tables and truncate the end of the
vma with vma_adjust again, and finally we resume the rmap_walks.

I'm not entirely sure of the above so review greatly appreciated ;)

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-28  1:05           ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  1:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 09:39:48AM +0900, KAMEZAWA Hiroyuki wrote:
> Seems nice.

What did you mean with objrmap inconsistency? I think this is single
threaded there, userland didn't run yet and I don't think page faults
could run. Maybe it's safer to add a VM_BUG_ON(vma->anon_vma) just
before vma->anon_vma = anon_vma to be sure nothing run in between.

> I'll test this but I think we need to take care of do_mremap(), too.
> And it's more complicated....

do_mremap has to be safe by:

1) adjusting page->index atomically with the pte updates inside pt
   lock (while it moves from one pte to another)

2) having both vmas src and dst (not overlapping) indexed in the
   proper anon_vmas before move_page_table runs

As long as it's not overlapping it shouldn't be difficult to enforce
the above two invariants, exec.c is magic as it works on overlapping
areas and src and dst are the same vma and it's indexed into just one
anon-vma. So we've to stop the rmap_walks before we mangle over the
vma with vma_adjust and move_page_tables and truncate the end of the
vma with vma_adjust again, and finally we resume the rmap_walks.

I'm not entirely sure of the above so review greatly appreciated ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-28  1:05           ` Andrea Arcangeli
@ 2010-04-28  1:09             ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  1:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 03:05:43AM +0200, Andrea Arcangeli wrote:
> 1) adjusting page->index atomically with the pte updates inside pt
>    lock (while it moves from one pte to another)

actually no need of this at all! of course the dst vma will have
vma->vm_pgoff adjusted instead... never mind. So I don't see a problem
there.

I think this is very special of how exec.c abuses move_page_tables by
passing vma as src and dst, when it obviously cannot be indexed in two
anon-vmas because there's a single vma and a single vma->anon_vma, and
src and dst obviously cannot have two different vm_pgoff again because
there's a single vma and there can't be two different vma->vm_pgoff.

So I'm very hopeful do_mremap is already fully safe...

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-28  1:09             ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  1:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 03:05:43AM +0200, Andrea Arcangeli wrote:
> 1) adjusting page->index atomically with the pte updates inside pt
>    lock (while it moves from one pte to another)

actually no need of this at all! of course the dst vma will have
vma->vm_pgoff adjusted instead... never mind. So I don't see a problem
there.

I think this is very special of how exec.c abuses move_page_tables by
passing vma as src and dst, when it obviously cannot be indexed in two
anon-vmas because there's a single vma and a single vma->anon_vma, and
src and dst obviously cannot have two different vm_pgoff again because
there's a single vma and there can't be two different vma->vm_pgoff.

So I'm very hopeful do_mremap is already fully safe...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-28  1:05           ` Andrea Arcangeli
@ 2010-04-28  1:18             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  1:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 03:05:43 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Apr 28, 2010 at 09:39:48AM +0900, KAMEZAWA Hiroyuki wrote:
> > Seems nice.
> 
> What did you mean with objrmap inconsistency? I think this is single
> threaded there, userland didn't run yet and I don't think page faults
> could run. Maybe it's safer to add a VM_BUG_ON(vma->anon_vma) just
> before vma->anon_vma = anon_vma to be sure nothing run in between.
> 
I mean following relationship.

	vma_address(vma, page1) <-> address <-> pte <-> page2
	
	If page1 == page2, objrmap is consistent.
	If page1 != page2, objrmap is inconsistent.





> > I'll test this but I think we need to take care of do_mremap(), too.
> > And it's more complicated....
> 
> do_mremap has to be safe by:
> 
> 1) adjusting page->index atomically with the pte updates inside pt
>    lock (while it moves from one pte to another)
> 
I reviewed do_mremap again and am thinking it's safe.


	new_vma = copy_vma();
	....                   ----------(*)
	move_ptes().
	munmap unnecessary range.

At (*), if page1 != page2, rmap_walk will not run correctly.
But it seems copy_vma() keeps page1==page2 

As I reported, when there is a problem, vma_address() returns an address but
that address doesn't contain migration_pte.

BTW, page->index is not updated, we just keep [start_address, pgoff] to be
sane value.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-28  1:18             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  1:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 03:05:43 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Apr 28, 2010 at 09:39:48AM +0900, KAMEZAWA Hiroyuki wrote:
> > Seems nice.
> 
> What did you mean with objrmap inconsistency? I think this is single
> threaded there, userland didn't run yet and I don't think page faults
> could run. Maybe it's safer to add a VM_BUG_ON(vma->anon_vma) just
> before vma->anon_vma = anon_vma to be sure nothing run in between.
> 
I mean following relationship.

	vma_address(vma, page1) <-> address <-> pte <-> page2
	
	If page1 == page2, objrmap is consistent.
	If page1 != page2, objrmap is inconsistent.





> > I'll test this but I think we need to take care of do_mremap(), too.
> > And it's more complicated....
> 
> do_mremap has to be safe by:
> 
> 1) adjusting page->index atomically with the pte updates inside pt
>    lock (while it moves from one pte to another)
> 
I reviewed do_mremap again and am thinking it's safe.


	new_vma = copy_vma();
	....                   ----------(*)
	move_ptes().
	munmap unnecessary range.

At (*), if page1 != page2, rmap_walk will not run correctly.
But it seems copy_vma() keeps page1==page2 

As I reported, when there is a problem, vma_address() returns an address but
that address doesn't contain migration_pte.

BTW, page->index is not updated, we just keep [start_address, pgoff] to be
sane value.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-27 22:58       ` Andrea Arcangeli
@ 2010-04-28  1:29         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  1:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 00:58:52 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Apr 28, 2010 at 12:30:04AM +0200, Andrea Arcangeli wrote:
> > I'll now evaluate the fix and see if I can find any other
> > way to handle this.
> 
> 
> I think a better fix for bug mentioned in patch 3, is like below. This
> seems to work fine on aa.git with the old (stable) 2.6.33 anon-vma
> code. Not sure if this also works with the new anon-vma code in
> mainline but at first glance I think it should. At that point we
> should be single threaded so it shouldn't matter if anon_vma is
> temporary null.
> 
> Then you've to re-evaluate the vma_adjust fixes for mainline-only in
> patch 2 at the light of the below (I didn't check patch 2 in detail).
> 
> Please try to reproduce with the below applied.
> 
> ----
> Subject: fix race between shift_arg_pages and rmap_walk
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> migrate.c requires rmap to be able to find all ptes mapping a page at
> all times, otherwise the migration entry can be instantiated, but it
> can't be removed if the second rmap_walk fails to find the page.
> 
> So shift_arg_pages must run atomically with respect of rmap_walk, and
> it's enough to run it under the anon_vma lock to make it atomic.
> 
> And split_huge_page() will have the same requirements as migrate.c
> already has.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Hmm..Mel's patch 2/3 takes vma->anon_vma->lock in vma_adjust(),
so this patch clears vma->anon_vma...

some comment below.

> ---
> 
> diff --git a/fs/exec.c b/fs/exec.c
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -55,6 +55,7 @@
>  #include <linux/fsnotify.h>
>  #include <linux/fs_struct.h>
>  #include <linux/pipe_fs_i.h>
> +#include <linux/rmap.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -503,6 +504,7 @@ static int shift_arg_pages(struct vm_are
>  	unsigned long new_start = old_start - shift;
>  	unsigned long new_end = old_end - shift;
>  	struct mmu_gather *tlb;
> +	struct anon_vma *anon_vma;
>  
>  	BUG_ON(new_start > new_end);
>  
> @@ -513,6 +515,12 @@ static int shift_arg_pages(struct vm_are
>  	if (vma != find_vma(mm, new_start))
>  		return -EFAULT;
>  
> +	anon_vma = vma->anon_vma;
> +	/* stop rmap_walk or it won't find the stack pages */

	/*
	 * We adjust vma and move page tables in sequence. While update, 
	 * (vma, page) <-> address <-> pte relationship is unstable.
	 * We lock anon_vma->lock for keeping rmap_walk() safe. (see mm/rmap.c)
	 */


> +	spin_lock(&anon_vma->lock);
> +	/* avoid vma_adjust to take any further anon_vma lock */
> +	vma->anon_vma = NULL;
> +
>  	/*
>  	 * cover the whole range: [new_start, old_end)
>  	 */
> @@ -551,6 +559,9 @@ static int shift_arg_pages(struct vm_are
>  	 */
>  	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
>  
> +	vma->anon_vma = anon_vma;
> +	spin_unlock(&anon_vma->lock);
> +

I think we can unlock this just after move_page_tables().


Thanks,
-kame


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-28  1:29         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  1:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 00:58:52 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Apr 28, 2010 at 12:30:04AM +0200, Andrea Arcangeli wrote:
> > I'll now evaluate the fix and see if I can find any other
> > way to handle this.
> 
> 
> I think a better fix for bug mentioned in patch 3, is like below. This
> seems to work fine on aa.git with the old (stable) 2.6.33 anon-vma
> code. Not sure if this also works with the new anon-vma code in
> mainline but at first glance I think it should. At that point we
> should be single threaded so it shouldn't matter if anon_vma is
> temporary null.
> 
> Then you've to re-evaluate the vma_adjust fixes for mainline-only in
> patch 2 at the light of the below (I didn't check patch 2 in detail).
> 
> Please try to reproduce with the below applied.
> 
> ----
> Subject: fix race between shift_arg_pages and rmap_walk
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> migrate.c requires rmap to be able to find all ptes mapping a page at
> all times, otherwise the migration entry can be instantiated, but it
> can't be removed if the second rmap_walk fails to find the page.
> 
> So shift_arg_pages must run atomically with respect of rmap_walk, and
> it's enough to run it under the anon_vma lock to make it atomic.
> 
> And split_huge_page() will have the same requirements as migrate.c
> already has.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Hmm..Mel's patch 2/3 takes vma->anon_vma->lock in vma_adjust(),
so this patch clears vma->anon_vma...

some comment below.

> ---
> 
> diff --git a/fs/exec.c b/fs/exec.c
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -55,6 +55,7 @@
>  #include <linux/fsnotify.h>
>  #include <linux/fs_struct.h>
>  #include <linux/pipe_fs_i.h>
> +#include <linux/rmap.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -503,6 +504,7 @@ static int shift_arg_pages(struct vm_are
>  	unsigned long new_start = old_start - shift;
>  	unsigned long new_end = old_end - shift;
>  	struct mmu_gather *tlb;
> +	struct anon_vma *anon_vma;
>  
>  	BUG_ON(new_start > new_end);
>  
> @@ -513,6 +515,12 @@ static int shift_arg_pages(struct vm_are
>  	if (vma != find_vma(mm, new_start))
>  		return -EFAULT;
>  
> +	anon_vma = vma->anon_vma;
> +	/* stop rmap_walk or it won't find the stack pages */

	/*
	 * We adjust vma and move page tables in sequence. While update, 
	 * (vma, page) <-> address <-> pte relationship is unstable.
	 * We lock anon_vma->lock for keeping rmap_walk() safe. (see mm/rmap.c)
	 */


> +	spin_lock(&anon_vma->lock);
> +	/* avoid vma_adjust to take any further anon_vma lock */
> +	vma->anon_vma = NULL;
> +
>  	/*
>  	 * cover the whole range: [new_start, old_end)
>  	 */
> @@ -551,6 +559,9 @@ static int shift_arg_pages(struct vm_are
>  	 */
>  	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
>  
> +	vma->anon_vma = anon_vma;
> +	spin_unlock(&anon_vma->lock);
> +

I think we can unlock this just after move_page_tables().


Thanks,
-kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-28  1:18             ` KAMEZAWA Hiroyuki
@ 2010-04-28  1:36               ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  1:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 10:18:58AM +0900, KAMEZAWA Hiroyuki wrote:
> BTW, page->index is not updated, we just keep [start_address, pgoff] to be
> sane value.

yes I corrected myself too, the end result is the same and adjusting
the new vma vm_pgoff available in the anon-vma list is obviously
faster and simpler. When the src and dst ranges don't overlap and
src_vma != dst_vma (vma, new_vma in code) things are a lot simpler
there...

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-28  1:36               ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  1:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 10:18:58AM +0900, KAMEZAWA Hiroyuki wrote:
> BTW, page->index is not updated, we just keep [start_address, pgoff] to be
> sane value.

yes I corrected myself too, the end result is the same and adjusting
the new vma vm_pgoff available in the anon-vma list is obviously
faster and simpler. When the src and dst ranges don't overlap and
src_vma != dst_vma (vma, new_vma in code) things are a lot simpler
there...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-28  1:29         ` KAMEZAWA Hiroyuki
@ 2010-04-28  1:44           ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  1:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 10:29:28AM +0900, KAMEZAWA Hiroyuki wrote:
> Hmm..Mel's patch 2/3 takes vma->anon_vma->lock in vma_adjust(),
> so this patch clears vma->anon_vma...

yep, it should be safe with patch 2 applied too. And I'm unsure why Mel's
patch locks the anon_vma also when vm_start != start. See the other
email I sent about patch 2.

> I think we can unlock this just after move_page_tables().

Checking this, I can't see where exactly is vma->vm_pgoff adjusted
during the atomic section I protected with the anon_vma->lock?
For a moment it looks like these pages become unmovable.

I guess this is why I thought initially that it was move_page_tables
to adjust the page->index. If it doesn't then the vma->vm_pgoff has to
be moved down of shift >>PAGE_SHIFT and it doesn't seem to be
happening which is an unrelated bug.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-28  1:44           ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  1:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 10:29:28AM +0900, KAMEZAWA Hiroyuki wrote:
> Hmm..Mel's patch 2/3 takes vma->anon_vma->lock in vma_adjust(),
> so this patch clears vma->anon_vma...

yep, it should be safe with patch 2 applied too. And I'm unsure why Mel's
patch locks the anon_vma also when vm_start != start. See the other
email I sent about patch 2.

> I think we can unlock this just after move_page_tables().

Checking this, I can't see where exactly is vma->vm_pgoff adjusted
during the atomic section I protected with the anon_vma->lock?
For a moment it looks like these pages become unmovable.

I guess this is why I thought initially that it was move_page_tables
to adjust the page->index. If it doesn't then the vma->vm_pgoff has to
be moved down of shift >>PAGE_SHIFT and it doesn't seem to be
happening which is an unrelated bug.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-28  1:44           ` Andrea Arcangeli
@ 2010-04-28  2:12             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  2:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 03:44:34 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Apr 28, 2010 at 10:29:28AM +0900, KAMEZAWA Hiroyuki wrote:
> > Hmm..Mel's patch 2/3 takes vma->anon_vma->lock in vma_adjust(),
> > so this patch clears vma->anon_vma...
> 
> yep, it should be safe with patch 2 applied too. And I'm unsure why Mel's
> patch locks the anon_vma also when vm_start != start. See the other
> email I sent about patch 2.
> 
> > I think we can unlock this just after move_page_tables().
> 
> Checking this, I can't see where exactly is vma->vm_pgoff adjusted
> during the atomic section I protected with the anon_vma->lock?
> For a moment it looks like these pages become unmovable.
> 
The page can be replaced with migration_pte before the 1st vma_adjust.

The key is 
	(vma, page) <-> address <-> pte <-> page
relationship.

	vma_adjust() 
	(*)
	move_pagetables();
	(**)
	vma_adjust();

At (*), vma_address(vma, page) retruns a _new_ address. But pte is not
updated. This is ciritcal for rmap_walk. We're safe at (**).


> I guess this is why I thought initially that it was move_page_tables
> to adjust the page->index. If it doesn't then the vma->vm_pgoff has to
> be moved down of shift >>PAGE_SHIFT and it doesn't seem to be
> happening which is an unrelated bug.
> 

Anyway, I have no strong opinion about the placement of unlock(anon_vma->lock).

I wonder why we don't see this at testing memory-hotplug is because memory-hotplug
disables a new page allocation in the migration range. So, this exec() is hard to get
a page which can be migration target.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-28  2:12             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  2:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 03:44:34 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Apr 28, 2010 at 10:29:28AM +0900, KAMEZAWA Hiroyuki wrote:
> > Hmm..Mel's patch 2/3 takes vma->anon_vma->lock in vma_adjust(),
> > so this patch clears vma->anon_vma...
> 
> yep, it should be safe with patch 2 applied too. And I'm unsure why Mel's
> patch locks the anon_vma also when vm_start != start. See the other
> email I sent about patch 2.
> 
> > I think we can unlock this just after move_page_tables().
> 
> Checking this, I can't see where exactly is vma->vm_pgoff adjusted
> during the atomic section I protected with the anon_vma->lock?
> For a moment it looks like these pages become unmovable.
> 
The page can be replaced with migration_pte before the 1st vma_adjust.

The key is 
	(vma, page) <-> address <-> pte <-> page
relationship.

	vma_adjust() 
	(*)
	move_pagetables();
	(**)
	vma_adjust();

At (*), vma_address(vma, page) retruns a _new_ address. But pte is not
updated. This is ciritcal for rmap_walk. We're safe at (**).


> I guess this is why I thought initially that it was move_page_tables
> to adjust the page->index. If it doesn't then the vma->vm_pgoff has to
> be moved down of shift >>PAGE_SHIFT and it doesn't seem to be
> happening which is an unrelated bug.
> 

Anyway, I have no strong opinion about the placement of unlock(anon_vma->lock).

I wonder why we don't see this at testing memory-hotplug is because memory-hotplug
disables a new page allocation in the migration range. So, this exec() is hard to get
a page which can be migration target.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-28  2:12             ` KAMEZAWA Hiroyuki
@ 2010-04-28  2:42               ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  2:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 11:12:48AM +0900, KAMEZAWA Hiroyuki wrote:
> The page can be replaced with migration_pte before the 1st vma_adjust.
> 
> The key is 
> 	(vma, page) <-> address <-> pte <-> page
> relationship.
> 
> 	vma_adjust() 
> 	(*)
> 	move_pagetables();
> 	(**)
> 	vma_adjust();
> 
> At (*), vma_address(vma, page) retruns a _new_ address. But pte is not
> updated. This is ciritcal for rmap_walk. We're safe at (**).

Yes I agree we can move the unlock at (**) because the last vma_adjust
is only there to truncate the vm_end. In fact it looks super
heavyweight to call vma_adjust for that instead of just using
vma->vm_end = new_end considering we're under mmap_sem, full anonymous
etc... In fact I think even the first vma_adjust looks too
heavyweight and it doesn't bring any simplicity or added safety
considering this works in place and there's nothing to wonder about
vm_next or vma_merge or vm_file or anything that vma_adjust is good at.

So the confusion I had about vm_pgoff is because all things that moves
vm_start down, also move vm_pgoff down like stack growsdown but of
course those don't move the pages down too, so we must not alter
vm_pgoff here just vm_start along with the pagetables inside the
anon_vma lock to be fully safe. Also I forgot to unlock in case of
-ENOMEM ;)

this is a new try, next is for a later time... hope this helps!

Thanks!

----
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

So shift_arg_pages must run atomically with respect of rmap_walk, and
it's enough to run it under the anon_vma lock to make it atomic.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -502,6 +503,7 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
 
 	BUG_ON(new_start > new_end);
@@ -514,16 +516,26 @@ static int shift_arg_pages(struct vm_are
 		return -EFAULT;
 
 	/*
+	 * Stop the rmap walk or it won't find the stack pages, we've
+	 * to keep the lock hold until all pages are moved to the new
+	 * vm_start so their page->index will be always found
+	 * consistent with the unchanged vm_pgoff.
+	 */
+	spin_lock(&vma->anon_vma->lock);
+
+	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
-	vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL);
+	vma->vm_start = new_start;
 
 	/*
 	 * move the page tables downwards, on failure we rely on
 	 * process cleanup to remove whatever mess we made.
 	 */
-	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length))
+	moved_length = move_page_tables(vma, old_start,
+					vma, new_start, length);
+	spin_unlock(&vma->anon_vma->lock);
+	if (length != moved_length) 
 		return -ENOMEM;
 
 	lru_add_drain();
@@ -549,7 +561,7 @@ static int shift_arg_pages(struct vm_are
 	/*
 	 * shrink the vma to just the new range.
 	 */
-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+	vma->vm_end = new_end;
 
 	return 0;
 }

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-28  2:42               ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28  2:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 11:12:48AM +0900, KAMEZAWA Hiroyuki wrote:
> The page can be replaced with migration_pte before the 1st vma_adjust.
> 
> The key is 
> 	(vma, page) <-> address <-> pte <-> page
> relationship.
> 
> 	vma_adjust() 
> 	(*)
> 	move_pagetables();
> 	(**)
> 	vma_adjust();
> 
> At (*), vma_address(vma, page) retruns a _new_ address. But pte is not
> updated. This is ciritcal for rmap_walk. We're safe at (**).

Yes I agree we can move the unlock at (**) because the last vma_adjust
is only there to truncate the vm_end. In fact it looks super
heavyweight to call vma_adjust for that instead of just using
vma->vm_end = new_end considering we're under mmap_sem, full anonymous
etc... In fact I think even the first vma_adjust looks too
heavyweight and it doesn't bring any simplicity or added safety
considering this works in place and there's nothing to wonder about
vm_next or vma_merge or vm_file or anything that vma_adjust is good at.

So the confusion I had about vm_pgoff is because all things that moves
vm_start down, also move vm_pgoff down like stack growsdown but of
course those don't move the pages down too, so we must not alter
vm_pgoff here just vm_start along with the pagetables inside the
anon_vma lock to be fully safe. Also I forgot to unlock in case of
-ENOMEM ;)

this is a new try, next is for a later time... hope this helps!

Thanks!

----
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

So shift_arg_pages must run atomically with respect of rmap_walk, and
it's enough to run it under the anon_vma lock to make it atomic.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -502,6 +503,7 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
 
 	BUG_ON(new_start > new_end);
@@ -514,16 +516,26 @@ static int shift_arg_pages(struct vm_are
 		return -EFAULT;
 
 	/*
+	 * Stop the rmap walk or it won't find the stack pages, we've
+	 * to keep the lock hold until all pages are moved to the new
+	 * vm_start so their page->index will be always found
+	 * consistent with the unchanged vm_pgoff.
+	 */
+	spin_lock(&vma->anon_vma->lock);
+
+	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
-	vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL);
+	vma->vm_start = new_start;
 
 	/*
 	 * move the page tables downwards, on failure we rely on
 	 * process cleanup to remove whatever mess we made.
 	 */
-	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length))
+	moved_length = move_page_tables(vma, old_start,
+					vma, new_start, length);
+	spin_unlock(&vma->anon_vma->lock);
+	if (length != moved_length) 
 		return -ENOMEM;
 
 	lru_add_drain();
@@ -549,7 +561,7 @@ static int shift_arg_pages(struct vm_are
 	/*
 	 * shrink the vma to just the new range.
 	 */
-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+	vma->vm_end = new_end;
 
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-28  2:42               ` Andrea Arcangeli
@ 2010-04-28  2:49                 ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  2:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 04:42:27 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Apr 28, 2010 at 11:12:48AM +0900, KAMEZAWA Hiroyuki wrote:
> > The page can be replaced with migration_pte before the 1st vma_adjust.
> > 
> > The key is 
> > 	(vma, page) <-> address <-> pte <-> page
> > relationship.
> > 
> > 	vma_adjust() 
> > 	(*)
> > 	move_pagetables();
> > 	(**)
> > 	vma_adjust();
> > 
> > At (*), vma_address(vma, page) retruns a _new_ address. But pte is not
> > updated. This is ciritcal for rmap_walk. We're safe at (**).
> 
> Yes I agree we can move the unlock at (**) because the last vma_adjust
> is only there to truncate the vm_end. In fact it looks super
> heavyweight to call vma_adjust for that instead of just using
> vma->vm_end = new_end considering we're under mmap_sem, full anonymous
> etc... In fact I think even the first vma_adjust looks too
> heavyweight and it doesn't bring any simplicity or added safety
> considering this works in place and there's nothing to wonder about
> vm_next or vma_merge or vm_file or anything that vma_adjust is good at.
> 
> So the confusion I had about vm_pgoff is because all things that moves
> vm_start down, also move vm_pgoff down like stack growsdown but of
> course those don't move the pages down too, so we must not alter
> vm_pgoff here just vm_start along with the pagetables inside the
> anon_vma lock to be fully safe. Also I forgot to unlock in case of
> -ENOMEM ;)
> 
> this is a new try, next is for a later time... hope this helps!
> 
> Thanks!
> 
> ----
> Subject: fix race between shift_arg_pages and rmap_walk
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> migrate.c requires rmap to be able to find all ptes mapping a page at
> all times, otherwise the migration entry can be instantiated, but it
> can't be removed if the second rmap_walk fails to find the page.
> 
> So shift_arg_pages must run atomically with respect of rmap_walk, and
> it's enough to run it under the anon_vma lock to make it atomic.
> 
> And split_huge_page() will have the same requirements as migrate.c
> already has.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Seems good.
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I'll test this and report if I see trouble again.

Unfortunately, I'll have a week of holidays (in Japan) in 4/29-5/05,
my office is nearly closed. So, please consider no-mail-from-me is
good information.


Thanks,
-Kame


> ---
> 
> diff --git a/fs/exec.c b/fs/exec.c
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -55,6 +55,7 @@
>  #include <linux/fsnotify.h>
>  #include <linux/fs_struct.h>
>  #include <linux/pipe_fs_i.h>
> +#include <linux/rmap.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -502,6 +503,7 @@ static int shift_arg_pages(struct vm_are
>  	unsigned long length = old_end - old_start;
>  	unsigned long new_start = old_start - shift;
>  	unsigned long new_end = old_end - shift;
> +	unsigned long moved_length;
>  	struct mmu_gather *tlb;
>  
>  	BUG_ON(new_start > new_end);
> @@ -514,16 +516,26 @@ static int shift_arg_pages(struct vm_are
>  		return -EFAULT;
>  
>  	/*
> +	 * Stop the rmap walk or it won't find the stack pages, we've
> +	 * to keep the lock hold until all pages are moved to the new
> +	 * vm_start so their page->index will be always found
> +	 * consistent with the unchanged vm_pgoff.
> +	 */
> +	spin_lock(&vma->anon_vma->lock);
> +
> +	/*
>  	 * cover the whole range: [new_start, old_end)
>  	 */
> -	vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL);
> +	vma->vm_start = new_start;
>  
>  	/*
>  	 * move the page tables downwards, on failure we rely on
>  	 * process cleanup to remove whatever mess we made.
>  	 */
> -	if (length != move_page_tables(vma, old_start,
> -				       vma, new_start, length))
> +	moved_length = move_page_tables(vma, old_start,
> +					vma, new_start, length);
> +	spin_unlock(&vma->anon_vma->lock);
> +	if (length != moved_length) 
>  		return -ENOMEM;
>  
>  	lru_add_drain();
> @@ -549,7 +561,7 @@ static int shift_arg_pages(struct vm_are
>  	/*
>  	 * shrink the vma to just the new range.
>  	 */
> -	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
> +	vma->vm_end = new_end;
>  
>  	return 0;
>  }
> 


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-28  2:49                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  2:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 04:42:27 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Apr 28, 2010 at 11:12:48AM +0900, KAMEZAWA Hiroyuki wrote:
> > The page can be replaced with migration_pte before the 1st vma_adjust.
> > 
> > The key is 
> > 	(vma, page) <-> address <-> pte <-> page
> > relationship.
> > 
> > 	vma_adjust() 
> > 	(*)
> > 	move_pagetables();
> > 	(**)
> > 	vma_adjust();
> > 
> > At (*), vma_address(vma, page) retruns a _new_ address. But pte is not
> > updated. This is ciritcal for rmap_walk. We're safe at (**).
> 
> Yes I agree we can move the unlock at (**) because the last vma_adjust
> is only there to truncate the vm_end. In fact it looks super
> heavyweight to call vma_adjust for that instead of just using
> vma->vm_end = new_end considering we're under mmap_sem, full anonymous
> etc... In fact I think even the first vma_adjust looks too
> heavyweight and it doesn't bring any simplicity or added safety
> considering this works in place and there's nothing to wonder about
> vm_next or vma_merge or vm_file or anything that vma_adjust is good at.
> 
> So the confusion I had about vm_pgoff is because all things that moves
> vm_start down, also move vm_pgoff down like stack growsdown but of
> course those don't move the pages down too, so we must not alter
> vm_pgoff here just vm_start along with the pagetables inside the
> anon_vma lock to be fully safe. Also I forgot to unlock in case of
> -ENOMEM ;)
> 
> this is a new try, next is for a later time... hope this helps!
> 
> Thanks!
> 
> ----
> Subject: fix race between shift_arg_pages and rmap_walk
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> migrate.c requires rmap to be able to find all ptes mapping a page at
> all times, otherwise the migration entry can be instantiated, but it
> can't be removed if the second rmap_walk fails to find the page.
> 
> So shift_arg_pages must run atomically with respect of rmap_walk, and
> it's enough to run it under the anon_vma lock to make it atomic.
> 
> And split_huge_page() will have the same requirements as migrate.c
> already has.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Seems good.
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I'll test this and report if I see trouble again.

Unfortunately, I'll have a week of holidays (in Japan) in 4/29-5/05,
my office is nearly closed. So, please consider no-mail-from-me is
good information.


Thanks,
-Kame


> ---
> 
> diff --git a/fs/exec.c b/fs/exec.c
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -55,6 +55,7 @@
>  #include <linux/fsnotify.h>
>  #include <linux/fs_struct.h>
>  #include <linux/pipe_fs_i.h>
> +#include <linux/rmap.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -502,6 +503,7 @@ static int shift_arg_pages(struct vm_are
>  	unsigned long length = old_end - old_start;
>  	unsigned long new_start = old_start - shift;
>  	unsigned long new_end = old_end - shift;
> +	unsigned long moved_length;
>  	struct mmu_gather *tlb;
>  
>  	BUG_ON(new_start > new_end);
> @@ -514,16 +516,26 @@ static int shift_arg_pages(struct vm_are
>  		return -EFAULT;
>  
>  	/*
> +	 * Stop the rmap walk or it won't find the stack pages, we've
> +	 * to keep the lock hold until all pages are moved to the new
> +	 * vm_start so their page->index will be always found
> +	 * consistent with the unchanged vm_pgoff.
> +	 */
> +	spin_lock(&vma->anon_vma->lock);
> +
> +	/*
>  	 * cover the whole range: [new_start, old_end)
>  	 */
> -	vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL);
> +	vma->vm_start = new_start;
>  
>  	/*
>  	 * move the page tables downwards, on failure we rely on
>  	 * process cleanup to remove whatever mess we made.
>  	 */
> -	if (length != move_page_tables(vma, old_start,
> -				       vma, new_start, length))
> +	moved_length = move_page_tables(vma, old_start,
> +					vma, new_start, length);
> +	spin_unlock(&vma->anon_vma->lock);
> +	if (length != moved_length) 
>  		return -ENOMEM;
>  
>  	lru_add_drain();
> @@ -549,7 +561,7 @@ static int shift_arg_pages(struct vm_are
>  	/*
>  	 * shrink the vma to just the new range.
>  	 */
> -	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
> +	vma->vm_end = new_end;
>  
>  	return 0;
>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-28  2:49                 ` KAMEZAWA Hiroyuki
@ 2010-04-28  7:28                   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  7:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Mel Gorman, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 11:49:44 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 28 Apr 2010 04:42:27 +0200
> Andrea Arcangeli <aarcange@redhat.com> wrote:
 
> > migrate.c requires rmap to be able to find all ptes mapping a page at
> > all times, otherwise the migration entry can be instantiated, but it
> > can't be removed if the second rmap_walk fails to find the page.
> > 
> > So shift_arg_pages must run atomically with respect of rmap_walk, and
> > it's enough to run it under the anon_vma lock to make it atomic.
> > 
> > And split_huge_page() will have the same requirements as migrate.c
> > already has.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> Seems good.
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> I'll test this and report if I see trouble again.
> 
> Unfortunately, I'll have a week of holidays (in Japan) in 4/29-5/05,
> my office is nearly closed. So, please consider no-mail-from-me is
> good information.
> 
Here is bad news. When move_page_tables() fails, "some ptes" are moved
but others are not and....there is no rollback routine.

I bet the best way to fix this mess up is 
 - disable overlap moving of arg pages
 - use do_mremap().

But maybe you guys want to fix this directly.
Here is a temporal fix from me. But don't trust me..
==
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

So shift_arg_pages must run atomically with respect of rmap_walk, and
it's enough to run it under the anon_vma lock to make it atomic.

And split_huge_page() will have the same requirements as migrate.c
already has.

And, when moving overlapping ptes by move_page_tables(), it's cannot
be roll-backed as mremap does. This patch changes move_page_tables()'s
behavior and if it failes, no ptes are moved.

Changelog:
 - modified move_page_tables() to do atomic pte moving because
   "some ptes are moved but others are not" is critical for rmap_walk().
 - free pgtables at failure rather than give it all to do_exit().
   If not, objrmap will be inconsitent until exit() frees all.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

---
 fs/exec.c   |   67 +++++++++++++++++++++++++++++++++++++-----------------------
 mm/mremap.c |   28 ++++++++++++++++++++++---
 2 files changed, 67 insertions(+), 28 deletions(-)

Index: mel-test/fs/exec.c
===================================================================
--- mel-test.orig/fs/exec.c
+++ mel-test/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -503,7 +504,10 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
+	int ret;
+	unsigned long unused_pgd_start, unused_pgd_end, floor, ceiling;
 
 	BUG_ON(new_start > new_end);
 
@@ -517,41 +521,54 @@ static int shift_arg_pages(struct vm_are
 	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
-	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
-		return -ENOMEM;
+	spin_lock(&vma->anon_vma->lock);
+	vma->vm_start = new_start;
 
-	/*
-	 * move the page tables downwards, on failure we rely on
-	 * process cleanup to remove whatever mess we made.
-	 */
 	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length))
-		return -ENOMEM;
-
-	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
-	if (new_end > old_start) {
+				       vma, new_start, length)) {
+		vma->vm_start = old_start;
 		/*
-		 * when the old and new regions overlap clear from new_end.
-		 */
-		free_pgd_range(tlb, new_end, old_end, new_end,
-			vma->vm_next ? vma->vm_next->vm_start : 0);
+ 		 * we have to free [new_start, new_start+length] pgds
+ 		 * which we've allocated above.
+ 		 */
+		if (new_end > old_start) {
+			unused_pgd_start = new_start;
+			unused_pgd_end = old_start;
+		} else {
+			unused_pgd_start = new_start;
+			unused_pgd_end = new_end;
+		}
+		floor = new_start;
+		ceiling = old_start;
+		ret = -ENOMEM:
 	} else {
-		/*
-		 * otherwise, clean from old_start; this is done to not touch
-		 * the address space in [new_end, old_start) some architectures
-		 * have constraints on va-space that make this illegal (IA64) -
-		 * for the others its just a little faster.
-		 */
-		free_pgd_range(tlb, old_start, old_end, new_end,
-			vma->vm_next ? vma->vm_next->vm_start : 0);
+		if (new_end > old_start) {
+			unused_pgd_start = new_end;
+			unused_pgd_end = old_end;
+		} else {
+			unused_pgd_start = old_start;
+			unused_pgd_end = old_end;
+		}
+		floor = new_end;
+		if (vma->vm_next)
+			ceiling = vma->vm_next->vm_start;
+		else
+			ceiling = 0;
+		ret = 0;
 	}
+	spin_unlock(&vma->anon_vma->lock);
+
+	lru_add_drain();
+	tlb = tlb_gather_mmu(mm, 0);
+	/* Free unnecessary PGDS */
+	free_pgd_range(tlb, unused_pgd_start, unused_pgd_end, floor, ceiling);
 	tlb_finish_mmu(tlb, new_end, old_end);
 
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
 	 */
-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+	if (!ret)
+		vma->vm_end = new_end;
 
 	return 0;
 }
Index: mel-test/mm/mremap.c
===================================================================
--- mel-test.orig/mm/mremap.c
+++ mel-test/mm/mremap.c
@@ -134,22 +134,44 @@ unsigned long move_page_tables(struct vm
 {
 	unsigned long extent, next, old_end;
 	pmd_t *old_pmd, *new_pmd;
+	unsigned long from_addr, to_addr;
 
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
 
-	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
+	/* At first, copy required pmd in the range */
+	for (from_addr = old_addr, to_addr = new_addr;
+	     from_addr < old_end; from_addr += extent, to_addr += extent) {
 		cond_resched();
 		next = (old_addr + PMD_SIZE) & PMD_MASK;
 		if (next - 1 > old_end)
 			next = old_end;
 		extent = next - old_addr;
-		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
+		old_pmd = get_old_pmd(vma->vm_mm, from_addr);
 		if (!old_pmd)
 			continue;
-		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
+		new_pmd = alloc_new_pmd(vma->vm_mm, to_addr);
 		if (!new_pmd)
 			break;
+		next = (to_addr + PMD_SIZE) & PMD_MASK;
+		if (extent > next - to_addr)
+			extent = next - to_addr;
+	}
+	/* -ENOMEM ? */
+	if (from_addr < old_end) /* the caller must free remaining pmds. */
+		return 0;
+
+	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
+		cond_resched();
+		next = (old_addr + PMD_SIZE) & PMD_MASK;
+		if (next - 1 > old_end)
+			next = old_end;
+		extent = next - old_addr;
+		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
+		if (!old_pmd)
+			continue;
+		new_pmd = get_new_pmd(vma->vm_mm, new_addr);
+		BUG_ON(!new_pmd);
 		next = (new_addr + PMD_SIZE) & PMD_MASK;
 		if (extent > next - new_addr)
 			extent = next - new_addr;










^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-28  7:28                   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  7:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Mel Gorman, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Wed, 28 Apr 2010 11:49:44 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 28 Apr 2010 04:42:27 +0200
> Andrea Arcangeli <aarcange@redhat.com> wrote:
 
> > migrate.c requires rmap to be able to find all ptes mapping a page at
> > all times, otherwise the migration entry can be instantiated, but it
> > can't be removed if the second rmap_walk fails to find the page.
> > 
> > So shift_arg_pages must run atomically with respect of rmap_walk, and
> > it's enough to run it under the anon_vma lock to make it atomic.
> > 
> > And split_huge_page() will have the same requirements as migrate.c
> > already has.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> Seems good.
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> I'll test this and report if I see trouble again.
> 
> Unfortunately, I'll have a week of holidays (in Japan) in 4/29-5/05,
> my office is nearly closed. So, please consider no-mail-from-me is
> good information.
> 
Here is bad news. When move_page_tables() fails, "some ptes" are moved
but others are not and....there is no rollback routine.

I bet the best way to fix this mess up is 
 - disable overlap moving of arg pages
 - use do_mremap().

But maybe you guys want to fix this directly.
Here is a temporal fix from me. But don't trust me..
==
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

So shift_arg_pages must run atomically with respect of rmap_walk, and
it's enough to run it under the anon_vma lock to make it atomic.

And split_huge_page() will have the same requirements as migrate.c
already has.

And, when moving overlapping ptes by move_page_tables(), it's cannot
be roll-backed as mremap does. This patch changes move_page_tables()'s
behavior and if it failes, no ptes are moved.

Changelog:
 - modified move_page_tables() to do atomic pte moving because
   "some ptes are moved but others are not" is critical for rmap_walk().
 - free pgtables at failure rather than give it all to do_exit().
   If not, objrmap will be inconsitent until exit() frees all.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

---
 fs/exec.c   |   67 +++++++++++++++++++++++++++++++++++++-----------------------
 mm/mremap.c |   28 ++++++++++++++++++++++---
 2 files changed, 67 insertions(+), 28 deletions(-)

Index: mel-test/fs/exec.c
===================================================================
--- mel-test.orig/fs/exec.c
+++ mel-test/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -503,7 +504,10 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
+	int ret;
+	unsigned long unused_pgd_start, unused_pgd_end, floor, ceiling;
 
 	BUG_ON(new_start > new_end);
 
@@ -517,41 +521,54 @@ static int shift_arg_pages(struct vm_are
 	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
-	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
-		return -ENOMEM;
+	spin_lock(&vma->anon_vma->lock);
+	vma->vm_start = new_start;
 
-	/*
-	 * move the page tables downwards, on failure we rely on
-	 * process cleanup to remove whatever mess we made.
-	 */
 	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length))
-		return -ENOMEM;
-
-	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
-	if (new_end > old_start) {
+				       vma, new_start, length)) {
+		vma->vm_start = old_start;
 		/*
-		 * when the old and new regions overlap clear from new_end.
-		 */
-		free_pgd_range(tlb, new_end, old_end, new_end,
-			vma->vm_next ? vma->vm_next->vm_start : 0);
+ 		 * we have to free [new_start, new_start+length] pgds
+ 		 * which we've allocated above.
+ 		 */
+		if (new_end > old_start) {
+			unused_pgd_start = new_start;
+			unused_pgd_end = old_start;
+		} else {
+			unused_pgd_start = new_start;
+			unused_pgd_end = new_end;
+		}
+		floor = new_start;
+		ceiling = old_start;
+		ret = -ENOMEM:
 	} else {
-		/*
-		 * otherwise, clean from old_start; this is done to not touch
-		 * the address space in [new_end, old_start) some architectures
-		 * have constraints on va-space that make this illegal (IA64) -
-		 * for the others its just a little faster.
-		 */
-		free_pgd_range(tlb, old_start, old_end, new_end,
-			vma->vm_next ? vma->vm_next->vm_start : 0);
+		if (new_end > old_start) {
+			unused_pgd_start = new_end;
+			unused_pgd_end = old_end;
+		} else {
+			unused_pgd_start = old_start;
+			unused_pgd_end = old_end;
+		}
+		floor = new_end;
+		if (vma->vm_next)
+			ceiling = vma->vm_next->vm_start;
+		else
+			ceiling = 0;
+		ret = 0;
 	}
+	spin_unlock(&vma->anon_vma->lock);
+
+	lru_add_drain();
+	tlb = tlb_gather_mmu(mm, 0);
+	/* Free unnecessary PGDS */
+	free_pgd_range(tlb, unused_pgd_start, unused_pgd_end, floor, ceiling);
 	tlb_finish_mmu(tlb, new_end, old_end);
 
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
 	 */
-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+	if (!ret)
+		vma->vm_end = new_end;
 
 	return 0;
 }
Index: mel-test/mm/mremap.c
===================================================================
--- mel-test.orig/mm/mremap.c
+++ mel-test/mm/mremap.c
@@ -134,22 +134,44 @@ unsigned long move_page_tables(struct vm
 {
 	unsigned long extent, next, old_end;
 	pmd_t *old_pmd, *new_pmd;
+	unsigned long from_addr, to_addr;
 
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
 
-	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
+	/* At first, copy required pmd in the range */
+	for (from_addr = old_addr, to_addr = new_addr;
+	     from_addr < old_end; from_addr += extent, to_addr += extent) {
 		cond_resched();
 		next = (old_addr + PMD_SIZE) & PMD_MASK;
 		if (next - 1 > old_end)
 			next = old_end;
 		extent = next - old_addr;
-		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
+		old_pmd = get_old_pmd(vma->vm_mm, from_addr);
 		if (!old_pmd)
 			continue;
-		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
+		new_pmd = alloc_new_pmd(vma->vm_mm, to_addr);
 		if (!new_pmd)
 			break;
+		next = (to_addr + PMD_SIZE) & PMD_MASK;
+		if (extent > next - to_addr)
+			extent = next - to_addr;
+	}
+	/* -ENOMEM ? */
+	if (from_addr < old_end) /* the caller must free remaining pmds. */
+		return 0;
+
+	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
+		cond_resched();
+		next = (old_addr + PMD_SIZE) & PMD_MASK;
+		if (next - 1 > old_end)
+			next = old_end;
+		extent = next - old_addr;
+		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
+		if (!old_pmd)
+			continue;
+		new_pmd = get_new_pmd(vma->vm_mm, new_addr);
+		BUG_ON(!new_pmd);
 		next = (new_addr + PMD_SIZE) & PMD_MASK;
 		if (extent > next - new_addr)
 			extent = next - new_addr;









--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
  2010-04-27 23:52       ` KAMEZAWA Hiroyuki
@ 2010-04-28  8:24         ` Mel Gorman
  -1 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28  8:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 08:52:03AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 28 Apr 2010 00:22:45 +0200
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > Ok I had a first look:
> > 
> > On Tue, Apr 27, 2010 at 10:30:50PM +0100, Mel Gorman wrote:
> > > 	CPUA			CPU B
> > > 				do_fork()
> > > 				copy_mm() (from process 1 to process2)
> > > 				insert new vma to mmap_list (if inode/anon_vma)
> > 
> > Insert to the tail of the anon_vma list...
> > 
> > > 	pte_lock(process1)
> > > 	unmap a page
> > > 	insert migration_entry
> > > 	pte_unlock(process1)
> > > 
> > > 	migrate page copy
> > > 				copy_page_range
> > > 	remap new page by rmap_walk()
> > 
> > rmap_walk will walk process1 first! It's at the head, the vmas with
> > unmapped ptes are at the tail so process1 is walked before process2.
> > 
> > > 	pte_lock(process2)
> > > 	found no pte.
> > > 	pte_unlock(process2)
> > > 				pte lock(process2)
> > > 				pte lock(process1)
> > > 				copy migration entry to process2
> > > 				pte unlock(process1)
> > > 				pte unlokc(process2)
> > > 	pte_lock(process1)
> > > 	replace migration entry
> > > 	to new page's pte.
> > > 	pte_unlock(process1)
> > 
> > rmap_walk has to lock down process1 before process2, this is the
> > ordering issue I already mentioned in earlier email. So it cannot
> > happen and this patch is unnecessary.
> > 
> > The ordering is fundamental and as said anon_vma_link already adds new
> > vmas to the _tail_ of the anon-vma. And this is why it has to add to
> > the tail. If anon_vma_link would add new vmas to the head of the list,
> > the above bug could materialize, but it doesn't so it cannot happen.
> > 
> > In mainline anon_vma_link is called anon_vma_chain_link, see the
> > list_add_tail there to provide this guarantee.
> > 
> > Because process1 is walked first by CPU A, the migration entry is
> > replaced by the final pte before copy-migration-entry
> > runs. Alternatively if copy-migration-entry runs before before
> > process1 is walked, the migration entry will be copied and found in
> > process 2.
> > 
> 
> I already explained this doesn't happend and said "I'm sorry".
> 

And after going through it again, I'm happy that this was a red herring.
The patch is now dropped.

> But considering maintainance, it's not necessary to copy migration ptes
> and we don't have to keep a fundamental risks of migration circus.
> 

Even if it's not strictly necessary, migration should (and does in this case)
cope with being able to find all its migration ptes. An extra one being copied
doesn't matter as long as it can be found on the chain. It's not like the
execve-problem where a migration PTE gets moved to a place it can't be found.

> So, I don't say "we don't need this patch."
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
@ 2010-04-28  8:24         ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28  8:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 08:52:03AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 28 Apr 2010 00:22:45 +0200
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > Ok I had a first look:
> > 
> > On Tue, Apr 27, 2010 at 10:30:50PM +0100, Mel Gorman wrote:
> > > 	CPUA			CPU B
> > > 				do_fork()
> > > 				copy_mm() (from process 1 to process2)
> > > 				insert new vma to mmap_list (if inode/anon_vma)
> > 
> > Insert to the tail of the anon_vma list...
> > 
> > > 	pte_lock(process1)
> > > 	unmap a page
> > > 	insert migration_entry
> > > 	pte_unlock(process1)
> > > 
> > > 	migrate page copy
> > > 				copy_page_range
> > > 	remap new page by rmap_walk()
> > 
> > rmap_walk will walk process1 first! It's at the head, the vmas with
> > unmapped ptes are at the tail so process1 is walked before process2.
> > 
> > > 	pte_lock(process2)
> > > 	found no pte.
> > > 	pte_unlock(process2)
> > > 				pte lock(process2)
> > > 				pte lock(process1)
> > > 				copy migration entry to process2
> > > 				pte unlock(process1)
> > > 				pte unlokc(process2)
> > > 	pte_lock(process1)
> > > 	replace migration entry
> > > 	to new page's pte.
> > > 	pte_unlock(process1)
> > 
> > rmap_walk has to lock down process1 before process2, this is the
> > ordering issue I already mentioned in earlier email. So it cannot
> > happen and this patch is unnecessary.
> > 
> > The ordering is fundamental and as said anon_vma_link already adds new
> > vmas to the _tail_ of the anon-vma. And this is why it has to add to
> > the tail. If anon_vma_link would add new vmas to the head of the list,
> > the above bug could materialize, but it doesn't so it cannot happen.
> > 
> > In mainline anon_vma_link is called anon_vma_chain_link, see the
> > list_add_tail there to provide this guarantee.
> > 
> > Because process1 is walked first by CPU A, the migration entry is
> > replaced by the final pte before copy-migration-entry
> > runs. Alternatively if copy-migration-entry runs before before
> > process1 is walked, the migration entry will be copied and found in
> > process 2.
> > 
> 
> I already explained this doesn't happend and said "I'm sorry".
> 

And after going through it again, I'm happy that this was a red herring.
The patch is now dropped.

> But considering maintainance, it's not necessary to copy migration ptes
> and we don't have to keep a fundamental risks of migration circus.
> 

Even if it's not strictly necessary, migration should (and does in this case)
cope with being able to find all its migration ptes. An extra one being copied
doesn't matter as long as it can be found on the chain. It's not like the
execve-problem where a migration PTE gets moved to a place it can't be found.

> So, I don't say "we don't need this patch."
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-27 21:30   ` Mel Gorman
@ 2010-04-28  8:30     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  8:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, Christoph Lameter, Andrea Arcangeli,
	Rik van Riel, Andrew Morton

On Tue, 27 Apr 2010 22:30:52 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> During exec(), a temporary stack is setup and moved later to its final
> location. There is a race between migration and exec whereby a migration
> PTE can be placed in the temporary stack. When this VMA is moved under the
> lock, migration no longer knows where the PTE is, fails to remove the PTE
> and the migration PTE gets copied to the new location.  This later causes
> a bug when the migration PTE is discovered but the page is not locked.
> 
> This patch handles the situation by removing the migration PTE when page
> tables are being moved in case migration fails to find them. The alternative
> would require significant modification to vma_adjust() and the locks taken
> to ensure a VMA move and page table copy is atomic with respect to migration.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Here is my final proposal (before going vacation.)

I think this is very simple. The biggest problem is when move_page_range
fails, setup_arg_pages pass it all to exit() ;)

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This is an band-aid patch for avoiding unmap->remap of stack pages
while it's udner exec(). At exec, pages for stack is moved by
setup_arg_pages(). Under this, (vma,page)<->address relationship
can be in broken state.
Moreover, if moving ptes fails, pages with not-valid-rmap remains
in the page table and objrmap for the page is completely broken
until exit() frees all up.

This patch adds vma->broken_rmap. If broken_rmap != 0, vma_address()
returns -EFAULT always and try_to_unmap() fails.
(IOW, the pages for stack are pinned until setup_arg_pages() ends.)

And this prevents page migration because the page's mapcount never
goes to 0 until exec() fixes it up.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/exec.c                |    4 +++-
 include/linux/mm_types.h |    5 +++++
 mm/rmap.c                |    5 +++++
 3 files changed, 13 insertions(+), 1 deletion(-)

Index: mel-test/fs/exec.c
===================================================================
--- mel-test.orig/fs/exec.c
+++ mel-test/fs/exec.c
@@ -250,7 +250,8 @@ static int __bprm_mm_init(struct linux_b
 	err = insert_vm_struct(mm, vma);
 	if (err)
 		goto err;
-
+	/* prevent rmap_walk, try_to_unmap() etc..until we get fixed rmap */
+	vma->unstable_rmap = 1;
 	mm->stack_vm = mm->total_vm = 1;
 	up_write(&mm->mmap_sem);
 	bprm->p = vma->vm_end - sizeof(void *);
@@ -653,6 +654,7 @@ int setup_arg_pages(struct linux_binprm 
 		ret = -EFAULT;
 
 out_unlock:
+	vma->unstable_rmap = 0;
 	up_write(&mm->mmap_sem);
 	return ret;
 }
Index: mel-test/include/linux/mm_types.h
===================================================================
--- mel-test.orig/include/linux/mm_types.h
+++ mel-test/include/linux/mm_types.h
@@ -183,6 +183,11 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+	/*
+ 	 * updated only under down_write(mmap_sem). while this is not 0,
+ 	 * objrmap is not trustable.
+ 	 */
+	int unstable_rmap;
 };
 
 struct core_thread {
Index: mel-test/mm/rmap.c
===================================================================
--- mel-test.orig/mm/rmap.c
+++ mel-test/mm/rmap.c
@@ -332,6 +332,11 @@ vma_address(struct page *page, struct vm
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	unsigned long address;
+#ifdef CONFIG_MIGRATION
+	/* While unstable_rmap is set, we cannot trust objrmap */
+	if (unlikely(vma->unstable_rmap))
+		return -EFAULT:
+#endif
 
 	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-28  8:30     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28  8:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, Christoph Lameter, Andrea Arcangeli,
	Rik van Riel, Andrew Morton

On Tue, 27 Apr 2010 22:30:52 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> During exec(), a temporary stack is setup and moved later to its final
> location. There is a race between migration and exec whereby a migration
> PTE can be placed in the temporary stack. When this VMA is moved under the
> lock, migration no longer knows where the PTE is, fails to remove the PTE
> and the migration PTE gets copied to the new location.  This later causes
> a bug when the migration PTE is discovered but the page is not locked.
> 
> This patch handles the situation by removing the migration PTE when page
> tables are being moved in case migration fails to find them. The alternative
> would require significant modification to vma_adjust() and the locks taken
> to ensure a VMA move and page table copy is atomic with respect to migration.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Here is my final proposal (before going vacation.)

I think this is very simple. The biggest problem is when move_page_range
fails, setup_arg_pages pass it all to exit() ;)

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This is an band-aid patch for avoiding unmap->remap of stack pages
while it's udner exec(). At exec, pages for stack is moved by
setup_arg_pages(). Under this, (vma,page)<->address relationship
can be in broken state.
Moreover, if moving ptes fails, pages with not-valid-rmap remains
in the page table and objrmap for the page is completely broken
until exit() frees all up.

This patch adds vma->broken_rmap. If broken_rmap != 0, vma_address()
returns -EFAULT always and try_to_unmap() fails.
(IOW, the pages for stack are pinned until setup_arg_pages() ends.)

And this prevents page migration because the page's mapcount never
goes to 0 until exec() fixes it up.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/exec.c                |    4 +++-
 include/linux/mm_types.h |    5 +++++
 mm/rmap.c                |    5 +++++
 3 files changed, 13 insertions(+), 1 deletion(-)

Index: mel-test/fs/exec.c
===================================================================
--- mel-test.orig/fs/exec.c
+++ mel-test/fs/exec.c
@@ -250,7 +250,8 @@ static int __bprm_mm_init(struct linux_b
 	err = insert_vm_struct(mm, vma);
 	if (err)
 		goto err;
-
+	/* prevent rmap_walk, try_to_unmap() etc..until we get fixed rmap */
+	vma->unstable_rmap = 1;
 	mm->stack_vm = mm->total_vm = 1;
 	up_write(&mm->mmap_sem);
 	bprm->p = vma->vm_end - sizeof(void *);
@@ -653,6 +654,7 @@ int setup_arg_pages(struct linux_binprm 
 		ret = -EFAULT;
 
 out_unlock:
+	vma->unstable_rmap = 0;
 	up_write(&mm->mmap_sem);
 	return ret;
 }
Index: mel-test/include/linux/mm_types.h
===================================================================
--- mel-test.orig/include/linux/mm_types.h
+++ mel-test/include/linux/mm_types.h
@@ -183,6 +183,11 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+	/*
+ 	 * updated only under down_write(mmap_sem). while this is not 0,
+ 	 * objrmap is not trustable.
+ 	 */
+	int unstable_rmap;
 };
 
 struct core_thread {
Index: mel-test/mm/rmap.c
===================================================================
--- mel-test.orig/mm/rmap.c
+++ mel-test/mm/rmap.c
@@ -332,6 +332,11 @@ vma_address(struct page *page, struct vm
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	unsigned long address;
+#ifdef CONFIG_MIGRATION
+	/* While unstable_rmap is set, we cannot trust objrmap */
+	if (unlikely(vma->unstable_rmap))
+		return -EFAULT:
+#endif
 
 	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-04-27 23:10     ` Andrea Arcangeli
@ 2010-04-28  9:15       ` Mel Gorman
  -1 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28  9:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 01:10:07AM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 27, 2010 at 10:30:51PM +0100, Mel Gorman wrote:
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index f90ea92..61d6f1d 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -578,6 +578,9 @@ again:			remove_next = 1 + (end > next->vm_end);
> >  		}
> >  	}
> >  
> > +	if (vma->anon_vma)
> > +		spin_lock(&vma->anon_vma->lock);
> > +
> >  	if (root) {
> >  		flush_dcache_mmap_lock(mapping);
> >  		vma_prio_tree_remove(vma, root);
> > @@ -620,6 +623,9 @@ again:			remove_next = 1 + (end > next->vm_end);
> >  	if (mapping)
> >  		spin_unlock(&mapping->i_mmap_lock);
> >  
> > +	if (vma->anon_vma)
> > +		spin_unlock(&vma->anon_vma->lock);
> > +
> >  	if (remove_next) {
> >  		if (file) {
> >  			fput(file);
> 
> The old code did:
> 
>     /*
>      * When changing only vma->vm_end, we don't really need
>      * anon_vma lock.
>      */
>     if (vma->anon_vma && (insert || importer || start !=  vma->vm_start))
> 	anon_vma = vma->anon_vma;
>     if (anon_vma) {
>         spin_lock(&anon_vma->lock);
> 
> why did it become unconditional? (and no idea why it was removed)
> 

It became unconditional because I wasn't sure of the optimisation versus the
new anon_vma changes (doesn't matter, should have been safe). At the time
the patch was introduced, the bug looked like a race in VMA's in the list
having their details modified. I thought vma_address was returning -EFAULT
when it shouldn't and while this may still be possible, it wasn't the prime
cause of the bug.

The more important race was in execve between when a VMA got moved and the
page tables copied. The anon_vma locks are fine for the VMA move but the
page table copy happens later. What the patch did was alter the timing of
the race. rmap_walk() was finding the VMA of the new stack being set up by
exec, failing to lock it and backing off. By the time it would restart and
get back to that VMA, it was already moved making the bug simply harder to
reproduce because the race window was so small.

So, the VMA list does not appear to be messed up but there still needs
to be protection against modification of VMA details that are already on
the list. For that, the seq counter would have been enough and
lighter-weight than acquiring the anon_vma->lock every time in
vma_adjust().

I'll drop this patch again as the execve race looks the most important.

> But I'm not sure about this part.... this is really only a question, I
> may well be wrong, I just don't get it.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-28  9:15       ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28  9:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 01:10:07AM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 27, 2010 at 10:30:51PM +0100, Mel Gorman wrote:
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index f90ea92..61d6f1d 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -578,6 +578,9 @@ again:			remove_next = 1 + (end > next->vm_end);
> >  		}
> >  	}
> >  
> > +	if (vma->anon_vma)
> > +		spin_lock(&vma->anon_vma->lock);
> > +
> >  	if (root) {
> >  		flush_dcache_mmap_lock(mapping);
> >  		vma_prio_tree_remove(vma, root);
> > @@ -620,6 +623,9 @@ again:			remove_next = 1 + (end > next->vm_end);
> >  	if (mapping)
> >  		spin_unlock(&mapping->i_mmap_lock);
> >  
> > +	if (vma->anon_vma)
> > +		spin_unlock(&vma->anon_vma->lock);
> > +
> >  	if (remove_next) {
> >  		if (file) {
> >  			fput(file);
> 
> The old code did:
> 
>     /*
>      * When changing only vma->vm_end, we don't really need
>      * anon_vma lock.
>      */
>     if (vma->anon_vma && (insert || importer || start !=  vma->vm_start))
> 	anon_vma = vma->anon_vma;
>     if (anon_vma) {
>         spin_lock(&anon_vma->lock);
> 
> why did it become unconditional? (and no idea why it was removed)
> 

It became unconditional because I wasn't sure of the optimisation versus the
new anon_vma changes (doesn't matter, should have been safe). At the time
the patch was introduced, the bug looked like a race in VMA's in the list
having their details modified. I thought vma_address was returning -EFAULT
when it shouldn't and while this may still be possible, it wasn't the prime
cause of the bug.

The more important race was in execve between when a VMA got moved and the
page tables copied. The anon_vma locks are fine for the VMA move but the
page table copy happens later. What the patch did was alter the timing of
the race. rmap_walk() was finding the VMA of the new stack being set up by
exec, failing to lock it and backing off. By the time it would restart and
get back to that VMA, it was already moved making the bug simply harder to
reproduce because the race window was so small.

So, the VMA list does not appear to be messed up but there still needs
to be protection against modification of VMA details that are already on
the list. For that, the seq counter would have been enough and
lighter-weight than acquiring the anon_vma->lock every time in
vma_adjust().

I'll drop this patch again as the execve race looks the most important.

> But I'm not sure about this part.... this is really only a question, I
> may well be wrong, I just don't get it.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
  2010-04-27 22:32     ` Andrea Arcangeli
@ 2010-04-28  9:17       ` Mel Gorman
  -1 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28  9:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 12:32:42AM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 27, 2010 at 05:27:36PM -0500, Christoph Lameter wrote:
> > Can we simply wait like in the fault path?
> 
> There is no bug there, no need to wait either. I already audited it
> before, and I didn't see any bug. Unless you can show a bug with CPU A
> running the rmap_walk on process1 before process2, there is no bug to
> fix there.
> 

Yes, this patch is now dropped.

> > 
> > > Patch 3 notes that while a VMA is moved under the anon_vma lock, the page
> > > 	tables are not similarly protected. Where migration PTEs are
> > > 	encountered, they are cleaned up.
> > 
> > This means they are copied / moved etc and "cleaned" up in a state when
> > the page was unlocked. Migration entries are not supposed to exist when
> > a page is not locked.
> 
> patch 3 is real, and the first thought I had was to lock down the page
> before running vma_adjust and unlock after move_page_tables. But these
> are virtual addresses. Maybe there's a simpler way to keep migration
> away while we run those two operations.
> 

I see there is a large discussion on that patch so I'll read that rather
than commenting here.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
@ 2010-04-28  9:17       ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28  9:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 12:32:42AM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 27, 2010 at 05:27:36PM -0500, Christoph Lameter wrote:
> > Can we simply wait like in the fault path?
> 
> There is no bug there, no need to wait either. I already audited it
> before, and I didn't see any bug. Unless you can show a bug with CPU A
> running the rmap_walk on process1 before process2, there is no bug to
> fix there.
> 

Yes, this patch is now dropped.

> > 
> > > Patch 3 notes that while a VMA is moved under the anon_vma lock, the page
> > > 	tables are not similarly protected. Where migration PTEs are
> > > 	encountered, they are cleaned up.
> > 
> > This means they are copied / moved etc and "cleaned" up in a state when
> > the page was unlocked. Migration entries are not supposed to exist when
> > a page is not locked.
> 
> patch 3 is real, and the first thought I had was to lock down the page
> before running vma_adjust and unlock after move_page_tables. But these
> are virtual addresses. Maybe there's a simpler way to keep migration
> away while we run those two operations.
> 

I see there is a large discussion on that patch so I'll read that rather
than commenting here.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-28  7:28                   ` KAMEZAWA Hiroyuki
@ 2010-04-28 10:48                     ` Mel Gorman
  -1 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28 10:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

Thanks to you both for looking into this. I far prefer this general approach
than cleaning up the migration PTEs as the page tables get copied. While it
might "work", it's sloppy in the same way as having migration_entry_wait()
do the cleanup was sloppy. It's far preferable to make the VMA move and
page table copy atomic with anon_vma->lock.

On Wed, Apr 28, 2010 at 04:28:38PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 28 Apr 2010 11:49:44 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Wed, 28 Apr 2010 04:42:27 +0200
> > Andrea Arcangeli <aarcange@redhat.com> wrote:
>  
> > > migrate.c requires rmap to be able to find all ptes mapping a page at
> > > all times, otherwise the migration entry can be instantiated, but it
> > > can't be removed if the second rmap_walk fails to find the page.
> > > 
> > > So shift_arg_pages must run atomically with respect of rmap_walk, and
> > > it's enough to run it under the anon_vma lock to make it atomic.
> > > 
> > > And split_huge_page() will have the same requirements as migrate.c
> > > already has.
> > > 
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Seems good.
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > I'll test this and report if I see trouble again.
> > 
> > Unfortunately, I'll have a week of holidays (in Japan) in 4/29-5/05,
> > my office is nearly closed. So, please consider no-mail-from-me is
> > good information.
> > 
> Here is bad news. When move_page_tables() fails, "some ptes" are moved
> but others are not and....there is no rollback routine.
> 

The biggest problem is that the reverse mapping is temporarily out of
sync until do_exit gets rid of the mess, but how serious is that really?

If there is a migration entry in there, the mapcount should already be zero and
migration holds a reference to the page to prevent it going away. rmap_walk()
may then miss the migration_pte so it gets left behind. Ordinarily this
would be bad but in exec(), we cannot be faulting this page so we won't
trigger the bug in swapops. Instead, do_exit ultimately will skip over the
migration PTE doing nothing with the page but as the mapcount is still zero,
the page won't leak.

> I bet the best way to fix this mess up is 
>  - disable overlap moving of arg pages
>  - use do_mremap().
> 
> But maybe you guys want to fix this directly.
> Here is a temporal fix from me. But don't trust me..

I see the point of your patch but I'm not yet seeing why it is
necessary to back out if move_page_tables fails.

That said, both patches have a greater problem. Both of them hold a spinlock
(anon_vma->lock) while calling into the page allocator with GFP_KERNEL (to
allocate the page tables). We don't want to change that to GFP_ATOMIC so
either we need to allocate the pages in advance or special case rmap_walk()
to not walk processes that are in exec.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-28 10:48                     ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28 10:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

Thanks to you both for looking into this. I far prefer this general approach
than cleaning up the migration PTEs as the page tables get copied. While it
might "work", it's sloppy in the same way as having migration_entry_wait()
do the cleanup was sloppy. It's far preferable to make the VMA move and
page table copy atomic with anon_vma->lock.

On Wed, Apr 28, 2010 at 04:28:38PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 28 Apr 2010 11:49:44 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Wed, 28 Apr 2010 04:42:27 +0200
> > Andrea Arcangeli <aarcange@redhat.com> wrote:
>  
> > > migrate.c requires rmap to be able to find all ptes mapping a page at
> > > all times, otherwise the migration entry can be instantiated, but it
> > > can't be removed if the second rmap_walk fails to find the page.
> > > 
> > > So shift_arg_pages must run atomically with respect of rmap_walk, and
> > > it's enough to run it under the anon_vma lock to make it atomic.
> > > 
> > > And split_huge_page() will have the same requirements as migrate.c
> > > already has.
> > > 
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Seems good.
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > I'll test this and report if I see trouble again.
> > 
> > Unfortunately, I'll have a week of holidays (in Japan) in 4/29-5/05,
> > my office is nearly closed. So, please consider no-mail-from-me is
> > good information.
> > 
> Here is bad news. When move_page_tables() fails, "some ptes" are moved
> but others are not and....there is no rollback routine.
> 

The biggest problem is that the reverse mapping is temporarily out of
sync until do_exit gets rid of the mess, but how serious is that really?

If there is a migration entry in there, the mapcount should already be zero and
migration holds a reference to the page to prevent it going away. rmap_walk()
may then miss the migration_pte so it gets left behind. Ordinarily this
would be bad but in exec(), we cannot be faulting this page so we won't
trigger the bug in swapops. Instead, do_exit ultimately will skip over the
migration PTE doing nothing with the page but as the mapcount is still zero,
the page won't leak.

> I bet the best way to fix this mess up is 
>  - disable overlap moving of arg pages
>  - use do_mremap().
> 
> But maybe you guys want to fix this directly.
> Here is a temporal fix from me. But don't trust me..

I see the point of your patch but I'm not yet seeing why it is
necessary to back out if move_page_tables fails.

That said, both patches have a greater problem. Both of them hold a spinlock
(anon_vma->lock) while calling into the page allocator with GFP_KERNEL (to
allocate the page tables). We don't want to change that to GFP_ATOMIC so
either we need to allocate the pages in advance or special case rmap_walk()
to not walk processes that are in exec.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
  2010-04-28  0:20         ` Andrea Arcangeli
@ 2010-04-28 14:23           ` Mel Gorman
  -1 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28 14:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KAMEZAWA Hiroyuki, Christoph Lameter, Linux-MM, LKML,
	Minchan Kim, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 02:20:56AM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote:
> > Doing some check in move_ptes() after vma_adjust() is not safe.
> > IOW, when vma's information and information in page-table is incosistent...objrmap
> > is broken and migartion will cause panic.
> > 
> > Then...I think there are 2 ways.
> >   1. use seqcounter in "mm_struct" as previous patch and lock it at mremap.
> > or
> >   2. get_user_pages_fast() when do remap.
> 
> 3 take the anon_vma->lock
> 

I've been looking at ways during the day that the anon_vma lock can be held
while the page tables are being allocated. The schemes were way too hairy
just to cover a migration corner case.

As this is particular to exec, I'm wondering if Kamezawa's additional proposal
of just skipping migration of pages within the temporary stack might be the
best solution overall in terms of effectiveness and simplicity. His patch
introduced a new variable to the VMA but it shouldn't be necessary and it
altered vma_address which is unnecessary.

Here is a different version of the same basic idea to skip temporary VMAs
during migration. Maybe go with this?

(As a heads-up, I'll also be going offline in about 24 hours until Tuesday
morning. The area I'm in has zero internet access)

==== CUT HERE ====
mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks

Page migration requires rmap to be able to find all ptes mapping a page
at all times, otherwise the migration entry can be instantiated, but it
is possible to leave one behind if the second rmap_walk fails to find
the page.  If this page is later faulted, migration_entry_to_page() will
call BUG because the page is locked indicating the page was migrated by
the migration PTE not cleaned up.

There is a race between shift_arg_pages and migration that allows this
bug to trigger. A temporary stack is setup during exec and later moved. If
migration moves a page in the temporary stack and the VMA is then removed,
the migration PTE may not be found leading to a BUG when the stack is faulted.

Ideally, shift_arg_pages must run atomically with respect of rmap_walk by
holding the anon_vma lock but this is problematic as pages must be allocated
for page tables. Instead, this patch identifies when it is about to migrate
pages from a temporary stack and leaves them alone.  Memory hot-remove will
try again, sys_move_pages() wouldn't be operating during exec() time and
memory compaction will just continue to another page without concern.

[kamezawa.hiroyu@jp.fujitsu.com: Idea for having migration skip the stacks]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/rmap.c |   31 ++++++++++++++++++++++++++++++-
 1 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..5aaf4df 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1141,6 +1141,21 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	return ret;
 }
 
+static bool is_vma_temporary_stack(struct vm_area_struct *vma)
+{
+	if (vma->vm_flags != VM_STACK_FLAGS)
+		return false;
+
+	/*
+	 * Only during exec will the total VM consumed by a process
+	 * be exacly the same as the stack
+	 */
+	if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1)
+		return true;
+
+	return false;
+}
+
 /**
  * try_to_unmap_anon - unmap or unlock anonymous page using the object-based
  * rmap method
@@ -1169,7 +1184,21 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
+		unsigned long address;
+
+		/*
+		 * During exec, a temporary VMA is setup and later moved.
+		 * The VMA is moved under the anon_vma lock but not the
+		 * page tables leading to a race where migration cannot
+		 * find the migration ptes. Rather than increasing the
+		 * locking requirements of exec(), migration skips
+		 * temporary VMAs until after exec() completes.
+		 */
+		if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
+				is_vma_temporary_stack(vma))
+			continue;
+
+		address = vma_address(page, vma);
 		if (address == -EFAULT)
 			continue;
 		ret = try_to_unmap_one(page, vma, address, flags);

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
@ 2010-04-28 14:23           ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28 14:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KAMEZAWA Hiroyuki, Christoph Lameter, Linux-MM, LKML,
	Minchan Kim, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 02:20:56AM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote:
> > Doing some check in move_ptes() after vma_adjust() is not safe.
> > IOW, when vma's information and information in page-table is incosistent...objrmap
> > is broken and migartion will cause panic.
> > 
> > Then...I think there are 2 ways.
> >   1. use seqcounter in "mm_struct" as previous patch and lock it at mremap.
> > or
> >   2. get_user_pages_fast() when do remap.
> 
> 3 take the anon_vma->lock
> 

I've been looking at ways during the day that the anon_vma lock can be held
while the page tables are being allocated. The schemes were way too hairy
just to cover a migration corner case.

As this is particular to exec, I'm wondering if Kamezawa's additional proposal
of just skipping migration of pages within the temporary stack might be the
best solution overall in terms of effectiveness and simplicity. His patch
introduced a new variable to the VMA but it shouldn't be necessary and it
altered vma_address which is unnecessary.

Here is a different version of the same basic idea to skip temporary VMAs
during migration. Maybe go with this?

(As a heads-up, I'll also be going offline in about 24 hours until Tuesday
morning. The area I'm in has zero internet access)

==== CUT HERE ====
mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks

Page migration requires rmap to be able to find all ptes mapping a page
at all times, otherwise the migration entry can be instantiated, but it
is possible to leave one behind if the second rmap_walk fails to find
the page.  If this page is later faulted, migration_entry_to_page() will
call BUG because the page is locked indicating the page was migrated by
the migration PTE not cleaned up.

There is a race between shift_arg_pages and migration that allows this
bug to trigger. A temporary stack is setup during exec and later moved. If
migration moves a page in the temporary stack and the VMA is then removed,
the migration PTE may not be found leading to a BUG when the stack is faulted.

Ideally, shift_arg_pages must run atomically with respect of rmap_walk by
holding the anon_vma lock but this is problematic as pages must be allocated
for page tables. Instead, this patch identifies when it is about to migrate
pages from a temporary stack and leaves them alone.  Memory hot-remove will
try again, sys_move_pages() wouldn't be operating during exec() time and
memory compaction will just continue to another page without concern.

[kamezawa.hiroyu@jp.fujitsu.com: Idea for having migration skip the stacks]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/rmap.c |   31 ++++++++++++++++++++++++++++++-
 1 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..5aaf4df 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1141,6 +1141,21 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	return ret;
 }
 
+static bool is_vma_temporary_stack(struct vm_area_struct *vma)
+{
+	if (vma->vm_flags != VM_STACK_FLAGS)
+		return false;
+
+	/*
+	 * Only during exec will the total VM consumed by a process
+	 * be exacly the same as the stack
+	 */
+	if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1)
+		return true;
+
+	return false;
+}
+
 /**
  * try_to_unmap_anon - unmap or unlock anonymous page using the object-based
  * rmap method
@@ -1169,7 +1184,21 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
+		unsigned long address;
+
+		/*
+		 * During exec, a temporary VMA is setup and later moved.
+		 * The VMA is moved under the anon_vma lock but not the
+		 * page tables leading to a race where migration cannot
+		 * find the migration ptes. Rather than increasing the
+		 * locking requirements of exec(), migration skips
+		 * temporary VMAs until after exec() completes.
+		 */
+		if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
+				is_vma_temporary_stack(vma))
+			continue;
+
+		address = vma_address(page, vma);
 		if (address == -EFAULT)
 			continue;
 		ret = try_to_unmap_one(page, vma, address, flags);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
  2010-04-28  8:30     ` KAMEZAWA Hiroyuki
@ 2010-04-28 14:46       ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 14:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 05:30:54PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 27 Apr 2010 22:30:52 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > During exec(), a temporary stack is setup and moved later to its final
> > location. There is a race between migration and exec whereby a migration
> > PTE can be placed in the temporary stack. When this VMA is moved under the
> > lock, migration no longer knows where the PTE is, fails to remove the PTE
> > and the migration PTE gets copied to the new location.  This later causes
> > a bug when the migration PTE is discovered but the page is not locked.
> > 
> > This patch handles the situation by removing the migration PTE when page
> > tables are being moved in case migration fails to find them. The alternative
> > would require significant modification to vma_adjust() and the locks taken
> > to ensure a VMA move and page table copy is atomic with respect to migration.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> Here is my final proposal (before going vacation.)
> 
> I think this is very simple. The biggest problem is when move_page_range
> fails, setup_arg_pages pass it all to exit() ;)
> 
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This is an band-aid patch for avoiding unmap->remap of stack pages
> while it's udner exec(). At exec, pages for stack is moved by
> setup_arg_pages(). Under this, (vma,page)<->address relationship
> can be in broken state.
> Moreover, if moving ptes fails, pages with not-valid-rmap remains
> in the page table and objrmap for the page is completely broken
> until exit() frees all up.
> 
> This patch adds vma->broken_rmap. If broken_rmap != 0, vma_address()
> returns -EFAULT always and try_to_unmap() fails.
> (IOW, the pages for stack are pinned until setup_arg_pages() ends.)
> 
> And this prevents page migration because the page's mapcount never
> goes to 0 until exec() fixes it up.

I don't get it, I don't see the pinning and returning -EFAULT is not
solution for things that cannot fail (i.e. remove_migration_ptes and
split_huge_page). Plus there's no point to return failure to rmap_walk
when we can just stop the rmap_walk with the proper lock.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved
@ 2010-04-28 14:46       ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 14:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 05:30:54PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 27 Apr 2010 22:30:52 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > During exec(), a temporary stack is setup and moved later to its final
> > location. There is a race between migration and exec whereby a migration
> > PTE can be placed in the temporary stack. When this VMA is moved under the
> > lock, migration no longer knows where the PTE is, fails to remove the PTE
> > and the migration PTE gets copied to the new location.  This later causes
> > a bug when the migration PTE is discovered but the page is not locked.
> > 
> > This patch handles the situation by removing the migration PTE when page
> > tables are being moved in case migration fails to find them. The alternative
> > would require significant modification to vma_adjust() and the locks taken
> > to ensure a VMA move and page table copy is atomic with respect to migration.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> Here is my final proposal (before going vacation.)
> 
> I think this is very simple. The biggest problem is when move_page_range
> fails, setup_arg_pages pass it all to exit() ;)
> 
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This is an band-aid patch for avoiding unmap->remap of stack pages
> while it's udner exec(). At exec, pages for stack is moved by
> setup_arg_pages(). Under this, (vma,page)<->address relationship
> can be in broken state.
> Moreover, if moving ptes fails, pages with not-valid-rmap remains
> in the page table and objrmap for the page is completely broken
> until exit() frees all up.
> 
> This patch adds vma->broken_rmap. If broken_rmap != 0, vma_address()
> returns -EFAULT always and try_to_unmap() fails.
> (IOW, the pages for stack are pinned until setup_arg_pages() ends.)
> 
> And this prevents page migration because the page's mapcount never
> goes to 0 until exec() fixes it up.

I don't get it, I don't see the pinning and returning -EFAULT is not
solution for things that cannot fail (i.e. remove_migration_ptes and
split_huge_page). Plus there's no point to return failure to rmap_walk
when we can just stop the rmap_walk with the proper lock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
  2010-04-28 14:23           ` Mel Gorman
@ 2010-04-28 14:57             ` Mel Gorman
  -1 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28 14:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KAMEZAWA Hiroyuki, Christoph Lameter, Linux-MM, LKML,
	Minchan Kim, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 03:23:56PM +0100, Mel Gorman wrote:
> On Wed, Apr 28, 2010 at 02:20:56AM +0200, Andrea Arcangeli wrote:
> > On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote:
> > > Doing some check in move_ptes() after vma_adjust() is not safe.
> > > IOW, when vma's information and information in page-table is incosistent...objrmap
> > > is broken and migartion will cause panic.
> > > 
> > > Then...I think there are 2 ways.
> > >   1. use seqcounter in "mm_struct" as previous patch and lock it at mremap.
> > > or
> > >   2. get_user_pages_fast() when do remap.
> > 
> > 3 take the anon_vma->lock
> > 
> 
> <SNIP>
>
> Here is a different version of the same basic idea to skip temporary VMAs
> during migration. Maybe go with this?
> 
> +static bool is_vma_temporary_stack(struct vm_area_struct *vma)
> +{
> +	if (vma->vm_flags != VM_STACK_FLAGS)
> +		return false;
> +
> +	/*
> +	 * Only during exec will the total VM consumed by a process
> +	 * be exacly the same as the stack
> +	 */
> +	if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1)
> +		return true;
> +
> +	return false;
> +}
> +

The assumptions on the vm flags is of course totally wrong. VM_EXEC might
be applied as well as default flags from the mm.  The following is the same
basic idea, skip VMAs belonging to processes in exec rather than trying
to hold anon_vma->lock across move_page_tables(). Not tested yet.

==== CUT HERE ====
mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks

Page migration requires rmap to be able to find all ptes mapping a page
at all times, otherwise the migration entry can be instantiated, but it
is possible to leave one behind if the second rmap_walk fails to find
the page.  If this page is later faulted, migration_entry_to_page() will
call BUG because the page is locked indicating the page was migrated by
the migration PTE not cleaned up.

There is a race between shift_arg_pages and migration that allows this
bug to trigger. A temporary stack is setup during exec and later moved. If
migration moves a page in the temporary stack and the VMA is then removed,
the migration PTE may not be found leading to a BUG when the stack is faulted.

Ideally, shift_arg_pages must run atomically with respect of rmap_walk by
holding the anon_vma lock but this is problematic as pages must be allocated
for page tables. Instead, this patch skips processes in exec by making an
assumption that a mm with stack_vm == 1 and total_vm == 1 is a process in
exec() that hasn't finalised the temporary stack yet.  Memory hot-remove
will try again, sys_move_pages() wouldn't be operating during exec() time
and memory compaction will just continue to another page without concern.

[kamezawa.hiroyu@jp.fujitsu.com: Idea for having migration skip temporary stacks]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/rmap.c |   28 +++++++++++++++++++++++++++-
 1 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..9e20188 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1141,6 +1141,18 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	return ret;
 }
 
+static bool is_vma_in_exec(struct vm_area_struct *vma)
+{
+	/*
+	 * Only during exec will the total VM consumed by a process
+	 * be exacly the same as the stack and both equal to 1
+	 */
+	if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1)
+		return true;
+
+	return false;
+}
+
 /**
  * try_to_unmap_anon - unmap or unlock anonymous page using the object-based
  * rmap method
@@ -1169,7 +1181,21 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
+		unsigned long address;
+
+		/*
+		 * During exec, a temporary VMA is setup and later moved.
+		 * The VMA is moved under the anon_vma lock but not the
+		 * page tables leading to a race where migration cannot
+		 * find the migration ptes. Rather than increasing the
+		 * locking requirements of exec, migration skips
+		 * VMAs in processes calling exec.
+		 */
+		if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
+				is_vma_in_exec(vma))
+			continue;
+
+		address = vma_address(page, vma);
 		if (address == -EFAULT)
 			continue;
 		ret = try_to_unmap_one(page, vma, address, flags);

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
@ 2010-04-28 14:57             ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28 14:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KAMEZAWA Hiroyuki, Christoph Lameter, Linux-MM, LKML,
	Minchan Kim, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 03:23:56PM +0100, Mel Gorman wrote:
> On Wed, Apr 28, 2010 at 02:20:56AM +0200, Andrea Arcangeli wrote:
> > On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote:
> > > Doing some check in move_ptes() after vma_adjust() is not safe.
> > > IOW, when vma's information and information in page-table is incosistent...objrmap
> > > is broken and migartion will cause panic.
> > > 
> > > Then...I think there are 2 ways.
> > >   1. use seqcounter in "mm_struct" as previous patch and lock it at mremap.
> > > or
> > >   2. get_user_pages_fast() when do remap.
> > 
> > 3 take the anon_vma->lock
> > 
> 
> <SNIP>
>
> Here is a different version of the same basic idea to skip temporary VMAs
> during migration. Maybe go with this?
> 
> +static bool is_vma_temporary_stack(struct vm_area_struct *vma)
> +{
> +	if (vma->vm_flags != VM_STACK_FLAGS)
> +		return false;
> +
> +	/*
> +	 * Only during exec will the total VM consumed by a process
> +	 * be exacly the same as the stack
> +	 */
> +	if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1)
> +		return true;
> +
> +	return false;
> +}
> +

The assumptions on the vm flags is of course totally wrong. VM_EXEC might
be applied as well as default flags from the mm.  The following is the same
basic idea, skip VMAs belonging to processes in exec rather than trying
to hold anon_vma->lock across move_page_tables(). Not tested yet.

==== CUT HERE ====
mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks

Page migration requires rmap to be able to find all ptes mapping a page
at all times, otherwise the migration entry can be instantiated, but it
is possible to leave one behind if the second rmap_walk fails to find
the page.  If this page is later faulted, migration_entry_to_page() will
call BUG because the page is locked indicating the page was migrated by
the migration PTE not cleaned up.

There is a race between shift_arg_pages and migration that allows this
bug to trigger. A temporary stack is setup during exec and later moved. If
migration moves a page in the temporary stack and the VMA is then removed,
the migration PTE may not be found leading to a BUG when the stack is faulted.

Ideally, shift_arg_pages must run atomically with respect of rmap_walk by
holding the anon_vma lock but this is problematic as pages must be allocated
for page tables. Instead, this patch skips processes in exec by making an
assumption that a mm with stack_vm == 1 and total_vm == 1 is a process in
exec() that hasn't finalised the temporary stack yet.  Memory hot-remove
will try again, sys_move_pages() wouldn't be operating during exec() time
and memory compaction will just continue to another page without concern.

[kamezawa.hiroyu@jp.fujitsu.com: Idea for having migration skip temporary stacks]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/rmap.c |   28 +++++++++++++++++++++++++++-
 1 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..9e20188 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1141,6 +1141,18 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	return ret;
 }
 
+static bool is_vma_in_exec(struct vm_area_struct *vma)
+{
+	/*
+	 * Only during exec will the total VM consumed by a process
+	 * be exacly the same as the stack and both equal to 1
+	 */
+	if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1)
+		return true;
+
+	return false;
+}
+
 /**
  * try_to_unmap_anon - unmap or unlock anonymous page using the object-based
  * rmap method
@@ -1169,7 +1181,21 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
+		unsigned long address;
+
+		/*
+		 * During exec, a temporary VMA is setup and later moved.
+		 * The VMA is moved under the anon_vma lock but not the
+		 * page tables leading to a race where migration cannot
+		 * find the migration ptes. Rather than increasing the
+		 * locking requirements of exec, migration skips
+		 * VMAs in processes calling exec.
+		 */
+		if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
+				is_vma_in_exec(vma))
+			continue;
+
+		address = vma_address(page, vma);
 		if (address == -EFAULT)
 			continue;
 		ret = try_to_unmap_one(page, vma, address, flags);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
  2010-04-28 14:57             ` Mel Gorman
@ 2010-04-28 15:16               ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 15:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Christoph Lameter, Linux-MM, LKML,
	Minchan Kim, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 03:57:38PM +0100, Mel Gorman wrote:
> On Wed, Apr 28, 2010 at 03:23:56PM +0100, Mel Gorman wrote:
> > On Wed, Apr 28, 2010 at 02:20:56AM +0200, Andrea Arcangeli wrote:
> > > On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > Doing some check in move_ptes() after vma_adjust() is not safe.
> > > > IOW, when vma's information and information in page-table is incosistent...objrmap
> > > > is broken and migartion will cause panic.
> > > > 
> > > > Then...I think there are 2 ways.
> > > >   1. use seqcounter in "mm_struct" as previous patch and lock it at mremap.
> > > > or
> > > >   2. get_user_pages_fast() when do remap.
> > > 
> > > 3 take the anon_vma->lock
> > > 
> > 
> > <SNIP>
> >
> > Here is a different version of the same basic idea to skip temporary VMAs
> > during migration. Maybe go with this?
> > 
> > +static bool is_vma_temporary_stack(struct vm_area_struct *vma)
> > +{
> > +	if (vma->vm_flags != VM_STACK_FLAGS)
> > +		return false;
> > +
> > +	/*
> > +	 * Only during exec will the total VM consumed by a process
> > +	 * be exacly the same as the stack
> > +	 */
> > +	if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> 
> The assumptions on the vm flags is of course totally wrong. VM_EXEC might
> be applied as well as default flags from the mm.  The following is the same
> basic idea, skip VMAs belonging to processes in exec rather than trying
> to hold anon_vma->lock across move_page_tables(). Not tested yet.

This is better than the other, that made it look like people could set
broken rmap at arbitrary times, at least this shows it's ok only
during execve before anything run, but if we can't take the anon-vma
lock really better than having to make these special checks inside
every rmap_walk that has to be accurate, we should just delay the
linkage of the stack vma into its anon-vma, until after the move_pages.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
@ 2010-04-28 15:16               ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 15:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Christoph Lameter, Linux-MM, LKML,
	Minchan Kim, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 03:57:38PM +0100, Mel Gorman wrote:
> On Wed, Apr 28, 2010 at 03:23:56PM +0100, Mel Gorman wrote:
> > On Wed, Apr 28, 2010 at 02:20:56AM +0200, Andrea Arcangeli wrote:
> > > On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > Doing some check in move_ptes() after vma_adjust() is not safe.
> > > > IOW, when vma's information and information in page-table is incosistent...objrmap
> > > > is broken and migartion will cause panic.
> > > > 
> > > > Then...I think there are 2 ways.
> > > >   1. use seqcounter in "mm_struct" as previous patch and lock it at mremap.
> > > > or
> > > >   2. get_user_pages_fast() when do remap.
> > > 
> > > 3 take the anon_vma->lock
> > > 
> > 
> > <SNIP>
> >
> > Here is a different version of the same basic idea to skip temporary VMAs
> > during migration. Maybe go with this?
> > 
> > +static bool is_vma_temporary_stack(struct vm_area_struct *vma)
> > +{
> > +	if (vma->vm_flags != VM_STACK_FLAGS)
> > +		return false;
> > +
> > +	/*
> > +	 * Only during exec will the total VM consumed by a process
> > +	 * be exacly the same as the stack
> > +	 */
> > +	if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> 
> The assumptions on the vm flags is of course totally wrong. VM_EXEC might
> be applied as well as default flags from the mm.  The following is the same
> basic idea, skip VMAs belonging to processes in exec rather than trying
> to hold anon_vma->lock across move_page_tables(). Not tested yet.

This is better than the other, that made it look like people could set
broken rmap at arbitrary times, at least this shows it's ok only
during execve before anything run, but if we can't take the anon-vma
lock really better than having to make these special checks inside
every rmap_walk that has to be accurate, we should just delay the
linkage of the stack vma into its anon-vma, until after the move_pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
  2010-04-28 15:16               ` Andrea Arcangeli
@ 2010-04-28 15:23                 ` Mel Gorman
  -1 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28 15:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KAMEZAWA Hiroyuki, Christoph Lameter, Linux-MM, LKML,
	Minchan Kim, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 05:16:14PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 03:57:38PM +0100, Mel Gorman wrote:
> > On Wed, Apr 28, 2010 at 03:23:56PM +0100, Mel Gorman wrote:
> > > On Wed, Apr 28, 2010 at 02:20:56AM +0200, Andrea Arcangeli wrote:
> > > > On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > Doing some check in move_ptes() after vma_adjust() is not safe.
> > > > > IOW, when vma's information and information in page-table is incosistent...objrmap
> > > > > is broken and migartion will cause panic.
> > > > > 
> > > > > Then...I think there are 2 ways.
> > > > >   1. use seqcounter in "mm_struct" as previous patch and lock it at mremap.
> > > > > or
> > > > >   2. get_user_pages_fast() when do remap.
> > > > 
> > > > 3 take the anon_vma->lock
> > > > 
> > > 
> > > <SNIP>
> > >
> > > Here is a different version of the same basic idea to skip temporary VMAs
> > > during migration. Maybe go with this?
> > > 
> > > +static bool is_vma_temporary_stack(struct vm_area_struct *vma)
> > > +{
> > > +	if (vma->vm_flags != VM_STACK_FLAGS)
> > > +		return false;
> > > +
> > > +	/*
> > > +	 * Only during exec will the total VM consumed by a process
> > > +	 * be exacly the same as the stack
> > > +	 */
> > > +	if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1)
> > > +		return true;
> > > +
> > > +	return false;
> > > +}
> > > +
> > 
> > The assumptions on the vm flags is of course totally wrong. VM_EXEC might
> > be applied as well as default flags from the mm.  The following is the same
> > basic idea, skip VMAs belonging to processes in exec rather than trying
> > to hold anon_vma->lock across move_page_tables(). Not tested yet.
> 
> This is better than the other, that made it look like people could set
> broken rmap at arbitrary times, at least this shows it's ok only
> during execve before anything run, but if we can't take the anon-vma
> lock really better than having to make these special checks inside
> every rmap_walk that has to be accurate, we should just delay the
> linkage of the stack vma into its anon-vma, until after the move_pages.
> 

Is it possible to delay the linkage like that? As arguments get copied into
the temporary stack before it gets moved, I'd have expected the normal fault
path to prepare and attach the anon_vma. We could special case it but
that isn't very palatable either.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
@ 2010-04-28 15:23                 ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28 15:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KAMEZAWA Hiroyuki, Christoph Lameter, Linux-MM, LKML,
	Minchan Kim, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 05:16:14PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 03:57:38PM +0100, Mel Gorman wrote:
> > On Wed, Apr 28, 2010 at 03:23:56PM +0100, Mel Gorman wrote:
> > > On Wed, Apr 28, 2010 at 02:20:56AM +0200, Andrea Arcangeli wrote:
> > > > On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > Doing some check in move_ptes() after vma_adjust() is not safe.
> > > > > IOW, when vma's information and information in page-table is incosistent...objrmap
> > > > > is broken and migartion will cause panic.
> > > > > 
> > > > > Then...I think there are 2 ways.
> > > > >   1. use seqcounter in "mm_struct" as previous patch and lock it at mremap.
> > > > > or
> > > > >   2. get_user_pages_fast() when do remap.
> > > > 
> > > > 3 take the anon_vma->lock
> > > > 
> > > 
> > > <SNIP>
> > >
> > > Here is a different version of the same basic idea to skip temporary VMAs
> > > during migration. Maybe go with this?
> > > 
> > > +static bool is_vma_temporary_stack(struct vm_area_struct *vma)
> > > +{
> > > +	if (vma->vm_flags != VM_STACK_FLAGS)
> > > +		return false;
> > > +
> > > +	/*
> > > +	 * Only during exec will the total VM consumed by a process
> > > +	 * be exacly the same as the stack
> > > +	 */
> > > +	if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1)
> > > +		return true;
> > > +
> > > +	return false;
> > > +}
> > > +
> > 
> > The assumptions on the vm flags is of course totally wrong. VM_EXEC might
> > be applied as well as default flags from the mm.  The following is the same
> > basic idea, skip VMAs belonging to processes in exec rather than trying
> > to hold anon_vma->lock across move_page_tables(). Not tested yet.
> 
> This is better than the other, that made it look like people could set
> broken rmap at arbitrary times, at least this shows it's ok only
> during execve before anything run, but if we can't take the anon-vma
> lock really better than having to make these special checks inside
> every rmap_walk that has to be accurate, we should just delay the
> linkage of the stack vma into its anon-vma, until after the move_pages.
> 

Is it possible to delay the linkage like that? As arguments get copied into
the temporary stack before it gets moved, I'd have expected the normal fault
path to prepare and attach the anon_vma. We could special case it but
that isn't very palatable either.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-04-28  9:15       ` Mel Gorman
@ 2010-04-28 15:35         ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 15:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 10:15:55AM +0100, Mel Gorman wrote:
> It became unconditional because I wasn't sure of the optimisation versus the
> new anon_vma changes (doesn't matter, should have been safe). At the time

Changeset 287d97ac032136724143cde8d5964b414d562ee3 is meant to explain
the removal of the lock but I don't get it from the comments. Or at
least I don't get from that comment why we can't resurrect the plain
old deleted code that looked fine to me. Like there is no reason to
take the lock if start == vma->vm_start.

> So, the VMA list does not appear to be messed up but there still needs
> to be protection against modification of VMA details that are already on
> the list. For that, the seq counter would have been enough and
> lighter-weight than acquiring the anon_vma->lock every time in
> vma_adjust().
> 
> I'll drop this patch again as the execve race looks the most important.

You mean you're dropping patch 2 too? I agree dropping patch 1 but
to me the having to take all the anon_vma locks for every
vma->anon_vma->lock that we walk seems a must, otherwise
expand_downwards and vma_adjust won't be ok, plus we need to re-add
the anon_vma lock to vma_adjust, it can't be safe to alter vm_pgoff
and vm_start outside of the anon_vma->lock. Or I am mistaken?

Patch 2 wouldn't help the swapops crash we reproduced because at that
point the anon_vma of the stack is the local one, it's just after
execve.

vma_adjust and expand_downards would alter vm_pgoff and vm_start while
taking only the vma->anon_vma->lock where the vma->anon_vma is the
_local_ one of the vma.  But a vma in mainline can be indexed in
infinite anon_vmas, so to prevent breaking migration
vma_adjust/expand_downards the rmap_walk would need to take _all_
anon_vma->locks for every anon_vma that the vma is indexed into. Or
alternatively like you implemented rmap_walk would need to check if
the vma we found in the rmap_walk is different from the original
anon_vma and to take the vma->anon_vma->lock (so taking the
anon_vma->lock of the local anon_vma of every vma) before it can
actually read the vma->vm_pgoff/vm_start inside vma_address.

If the above is right it also means the new anon-vma changes also break
the whole locking of transparent hugepage, see wait_split_huge_page,
it does a spin_unlock_wait(&anon_vma->lock) thinking that waiting the
"local" anon-vma is enough, when in fact the hugepage may be shared
and belonging to the parent parent_vma->anon_vma and not to the local
one of the last child that is waiting on the wrong lock. So I may have
to rewrite this part of the thp locking to solve this. And for me it's
not enough to just taking more locks inside the rmap walks inside
split_huge_page as I used the anon_vma lock outside too.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-28 15:35         ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 15:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 10:15:55AM +0100, Mel Gorman wrote:
> It became unconditional because I wasn't sure of the optimisation versus the
> new anon_vma changes (doesn't matter, should have been safe). At the time

Changeset 287d97ac032136724143cde8d5964b414d562ee3 is meant to explain
the removal of the lock but I don't get it from the comments. Or at
least I don't get from that comment why we can't resurrect the plain
old deleted code that looked fine to me. Like there is no reason to
take the lock if start == vma->vm_start.

> So, the VMA list does not appear to be messed up but there still needs
> to be protection against modification of VMA details that are already on
> the list. For that, the seq counter would have been enough and
> lighter-weight than acquiring the anon_vma->lock every time in
> vma_adjust().
> 
> I'll drop this patch again as the execve race looks the most important.

You mean you're dropping patch 2 too? I agree dropping patch 1 but
to me the having to take all the anon_vma locks for every
vma->anon_vma->lock that we walk seems a must, otherwise
expand_downwards and vma_adjust won't be ok, plus we need to re-add
the anon_vma lock to vma_adjust, it can't be safe to alter vm_pgoff
and vm_start outside of the anon_vma->lock. Or I am mistaken?

Patch 2 wouldn't help the swapops crash we reproduced because at that
point the anon_vma of the stack is the local one, it's just after
execve.

vma_adjust and expand_downards would alter vm_pgoff and vm_start while
taking only the vma->anon_vma->lock where the vma->anon_vma is the
_local_ one of the vma.  But a vma in mainline can be indexed in
infinite anon_vmas, so to prevent breaking migration
vma_adjust/expand_downards the rmap_walk would need to take _all_
anon_vma->locks for every anon_vma that the vma is indexed into. Or
alternatively like you implemented rmap_walk would need to check if
the vma we found in the rmap_walk is different from the original
anon_vma and to take the vma->anon_vma->lock (so taking the
anon_vma->lock of the local anon_vma of every vma) before it can
actually read the vma->vm_pgoff/vm_start inside vma_address.

If the above is right it also means the new anon-vma changes also break
the whole locking of transparent hugepage, see wait_split_huge_page,
it does a spin_unlock_wait(&anon_vma->lock) thinking that waiting the
"local" anon-vma is enough, when in fact the hugepage may be shared
and belonging to the parent parent_vma->anon_vma and not to the local
one of the last child that is waiting on the wrong lock. So I may have
to rewrite this part of the thp locking to solve this. And for me it's
not enough to just taking more locks inside the rmap walks inside
split_huge_page as I used the anon_vma lock outside too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-04-28 15:35         ` Andrea Arcangeli
@ 2010-04-28 15:39           ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 15:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

Another way (not sure if it's good or bad, but it'd clearly avoid the
restarting locks in rmap_walk) would be to allocate a shared lock, and
still share the lock like the anon_vma was shared before. So we have
shorter chains to walk, but still a larger lock, so we've to take just
one and be safe.

spin_lock(&vma->anon_vma->shared_anon_vma_lock->lock)

something like that...

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-28 15:39           ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 15:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

Another way (not sure if it's good or bad, but it'd clearly avoid the
restarting locks in rmap_walk) would be to allocate a shared lock, and
still share the lock like the anon_vma was shared before. So we have
shorter chains to walk, but still a larger lock, so we've to take just
one and be safe.

spin_lock(&vma->anon_vma->shared_anon_vma_lock->lock)

something like that...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
  2010-04-28 15:23                 ` Mel Gorman
@ 2010-04-28 15:45                   ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 15:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Christoph Lameter, Linux-MM, LKML,
	Minchan Kim, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 04:23:54PM +0100, Mel Gorman wrote:
> Is it possible to delay the linkage like that? As arguments get copied into
> the temporary stack before it gets moved, I'd have expected the normal fault
> path to prepare and attach the anon_vma. We could special case it but
> that isn't very palatable either.

I'm not sure what is more palatable, but I feel it should be fixed in
execve.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
@ 2010-04-28 15:45                   ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 15:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Christoph Lameter, Linux-MM, LKML,
	Minchan Kim, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 04:23:54PM +0100, Mel Gorman wrote:
> Is it possible to delay the linkage like that? As arguments get copied into
> the temporary stack before it gets moved, I'd have expected the normal fault
> path to prepare and attach the anon_vma. We could special case it but
> that isn't very palatable either.

I'm not sure what is more palatable, but I feel it should be fixed in
execve.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-04-28 15:35         ` Andrea Arcangeli
@ 2010-04-28 15:55           ` Mel Gorman
  -1 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28 15:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 05:35:25PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 10:15:55AM +0100, Mel Gorman wrote:
> > It became unconditional because I wasn't sure of the optimisation versus the
> > new anon_vma changes (doesn't matter, should have been safe). At the time
> 
> Changeset 287d97ac032136724143cde8d5964b414d562ee3 is meant to explain
> the removal of the lock but I don't get it from the comments. Or at
> least I don't get from that comment why we can't resurrect the plain
> old deleted code that looked fine to me.

Frankly, I don't understand why it was safe to drop the lock either.
Maybe it was a mistake but I still haven't convinced myself I fully
understand the subtleties of the anon_vma changes.

> Like there is no reason to
> take the lock if start == vma->vm_start.
> 
> > So, the VMA list does not appear to be messed up but there still needs
> > to be protection against modification of VMA details that are already on
> > the list. For that, the seq counter would have been enough and
> > lighter-weight than acquiring the anon_vma->lock every time in
> > vma_adjust().
> > 
> > I'll drop this patch again as the execve race looks the most important.
> 
> You mean you're dropping patch 2 too?

Temporarily at least until I figured out if execve was the only problem. The
locking in vma_adjust didn't look like the prime reason for the crash
but the lack of locking there is still very odd.

> I agree dropping patch 1 but
> to me the having to take all the anon_vma locks for every
> vma->anon_vma->lock that we walk seems a must, otherwise
> expand_downwards and vma_adjust won't be ok, plus we need to re-add
> the anon_vma lock to vma_adjust, it can't be safe to alter vm_pgoff
> and vm_start outside of the anon_vma->lock. Or I am mistaken?
> 

No, you're not. If nothing else, vma_address can return the wrong value because
the VMAs vm_start and vm_pgoff were in the process of being updated but not
fully updated. It's hard to see how vma_address would return the wrong value
and miss a migration PTE as a result but it's a possibility.  It's probably
a lot more important for transparent hugepage support.

> Patch 2 wouldn't help the swapops crash we reproduced because at that
> point the anon_vma of the stack is the local one, it's just after
> execve.
> 
> vma_adjust and expand_downards would alter vm_pgoff and vm_start while
> taking only the vma->anon_vma->lock where the vma->anon_vma is the
> _local_ one of the vma. 

True, although in the case of expand_downwards, it's highly unlikely that
there is also a migration PTE to be cleaned up. It's hard to see how a
migration PTE would be left behind in that case but it still looks wrong to
be updating the VMA fields without locking.

> But a vma in mainline can be indexed in
> infinite anon_vmas, so to prevent breaking migration
> vma_adjust/expand_downards the rmap_walk would need to take _all_
> anon_vma->locks for every anon_vma that the vma is indexed into.

I felt this would be too heavy in the common case which is why I made
rmap_walk() do the try-lock-or-back-off instead because rmap_walk is typically
in far less critical paths.

> Or
> alternatively like you implemented rmap_walk would need to check if
> the vma we found in the rmap_walk is different from the original
> anon_vma and to take the vma->anon_vma->lock (so taking the
> anon_vma->lock of the local anon_vma of every vma) before it can
> actually read the vma->vm_pgoff/vm_start inside vma_address.
> 

To be absolutly sure, yes this is required. I don't think we've been hitting
this exact problem in these tests but it still is a better plan than adjusting
VMA details without locks.

> If the above is right it also means the new anon-vma changes also break
> the whole locking of transparent hugepage, see wait_split_huge_page,
> it does a spin_unlock_wait(&anon_vma->lock) thinking that waiting the
> "local" anon-vma is enough, when in fact the hugepage may be shared
> and belonging to the parent parent_vma->anon_vma and not to the local
> one of the last child that is waiting on the wrong lock. So I may have
> to rewrite this part of the thp locking to solve this. And for me it's
> not enough to just taking more locks inside the rmap walks inside
> split_huge_page as I used the anon_vma lock outside too.
> 

No fun. That potentially could be a lot of locks to take to split the
page.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-28 15:55           ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28 15:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 05:35:25PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 10:15:55AM +0100, Mel Gorman wrote:
> > It became unconditional because I wasn't sure of the optimisation versus the
> > new anon_vma changes (doesn't matter, should have been safe). At the time
> 
> Changeset 287d97ac032136724143cde8d5964b414d562ee3 is meant to explain
> the removal of the lock but I don't get it from the comments. Or at
> least I don't get from that comment why we can't resurrect the plain
> old deleted code that looked fine to me.

Frankly, I don't understand why it was safe to drop the lock either.
Maybe it was a mistake but I still haven't convinced myself I fully
understand the subtleties of the anon_vma changes.

> Like there is no reason to
> take the lock if start == vma->vm_start.
> 
> > So, the VMA list does not appear to be messed up but there still needs
> > to be protection against modification of VMA details that are already on
> > the list. For that, the seq counter would have been enough and
> > lighter-weight than acquiring the anon_vma->lock every time in
> > vma_adjust().
> > 
> > I'll drop this patch again as the execve race looks the most important.
> 
> You mean you're dropping patch 2 too?

Temporarily at least until I figured out if execve was the only problem. The
locking in vma_adjust didn't look like the prime reason for the crash
but the lack of locking there is still very odd.

> I agree dropping patch 1 but
> to me the having to take all the anon_vma locks for every
> vma->anon_vma->lock that we walk seems a must, otherwise
> expand_downwards and vma_adjust won't be ok, plus we need to re-add
> the anon_vma lock to vma_adjust, it can't be safe to alter vm_pgoff
> and vm_start outside of the anon_vma->lock. Or I am mistaken?
> 

No, you're not. If nothing else, vma_address can return the wrong value because
the VMAs vm_start and vm_pgoff were in the process of being updated but not
fully updated. It's hard to see how vma_address would return the wrong value
and miss a migration PTE as a result but it's a possibility.  It's probably
a lot more important for transparent hugepage support.

> Patch 2 wouldn't help the swapops crash we reproduced because at that
> point the anon_vma of the stack is the local one, it's just after
> execve.
> 
> vma_adjust and expand_downards would alter vm_pgoff and vm_start while
> taking only the vma->anon_vma->lock where the vma->anon_vma is the
> _local_ one of the vma. 

True, although in the case of expand_downwards, it's highly unlikely that
there is also a migration PTE to be cleaned up. It's hard to see how a
migration PTE would be left behind in that case but it still looks wrong to
be updating the VMA fields without locking.

> But a vma in mainline can be indexed in
> infinite anon_vmas, so to prevent breaking migration
> vma_adjust/expand_downards the rmap_walk would need to take _all_
> anon_vma->locks for every anon_vma that the vma is indexed into.

I felt this would be too heavy in the common case which is why I made
rmap_walk() do the try-lock-or-back-off instead because rmap_walk is typically
in far less critical paths.

> Or
> alternatively like you implemented rmap_walk would need to check if
> the vma we found in the rmap_walk is different from the original
> anon_vma and to take the vma->anon_vma->lock (so taking the
> anon_vma->lock of the local anon_vma of every vma) before it can
> actually read the vma->vm_pgoff/vm_start inside vma_address.
> 

To be absolutly sure, yes this is required. I don't think we've been hitting
this exact problem in these tests but it still is a better plan than adjusting
VMA details without locks.

> If the above is right it also means the new anon-vma changes also break
> the whole locking of transparent hugepage, see wait_split_huge_page,
> it does a spin_unlock_wait(&anon_vma->lock) thinking that waiting the
> "local" anon-vma is enough, when in fact the hugepage may be shared
> and belonging to the parent parent_vma->anon_vma and not to the local
> one of the last child that is waiting on the wrong lock. So I may have
> to rewrite this part of the thp locking to solve this. And for me it's
> not enough to just taking more locks inside the rmap walks inside
> split_huge_page as I used the anon_vma lock outside too.
> 

No fun. That potentially could be a lot of locks to take to split the
page.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-04-28 15:55           ` Mel Gorman
@ 2010-04-28 16:23             ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 16:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 04:55:58PM +0100, Mel Gorman wrote:
> Frankly, I don't understand why it was safe to drop the lock either.
> Maybe it was a mistake but I still haven't convinced myself I fully
> understand the subtleties of the anon_vma changes.

I understand the design but I'm also unsure about the details. it's
just this lock that gets splitted and when you update the
vm_start/pgoff with only the vma->anon_vma->lock, the vma may be
queued in multiple other anon_vmas, and you're only serializing
and safe for the pages that have page_mapcount 1, and point to the
local anon_vma == vma->anon_vma, not any other shared page.

The removal of the vma->anon_vma->lock from vma_adjust just seems an
unrelated mistake to me too, but I don't know for sure why yet.

Basically vma_adjust needs the anon_vma lock like expand_downards has.

After you fix vma_adjust to be as safe as expand_downards you've also
to take care of the rmap_walk that may run on a page->mapping =
anon_vma that isn't the vma->anon_vma and you're not taking that
anon_vma->lock of the shared page, when you change the vma
vm_pgoff/vm_start. If rmap_walk finds to find a pte, becauase of that,
migrate will crash.

> Temporarily at least until I figured out if execve was the only problem. The

For aa.git it is sure enough. And as long as you only see the migrate
crash in execve it's also sure enough.

> locking in vma_adjust didn't look like the prime reason for the crash
> but the lack of locking there is still very odd.

And I think it needs fixing to be safe.

> 
> > I agree dropping patch 1 but
> > to me the having to take all the anon_vma locks for every
> > vma->anon_vma->lock that we walk seems a must, otherwise
> > expand_downwards and vma_adjust won't be ok, plus we need to re-add
> > the anon_vma lock to vma_adjust, it can't be safe to alter vm_pgoff
> > and vm_start outside of the anon_vma->lock. Or I am mistaken?
> > 
> 
> No, you're not. If nothing else, vma_address can return the wrong value because
> the VMAs vm_start and vm_pgoff were in the process of being updated but not
> fully updated. It's hard to see how vma_address would return the wrong value
> and miss a migration PTE as a result but it's a possibility.  It's probably
> a lot more important for transparent hugepage support.

For the rmap_walk itself, migrate and split_huge_page are identical,
the main problem of transparent hugepage support is that I used the
anon_vma->lock in a wider way and taken well before the rmap_walk, so
I'm screwed in a worse way than migrate.

So I may have to change from anon_vma->lock to the compound_lock in
wait_split_huge_page(). But I'll still need the restarting loop of
anon_vma locks then inside the two rmap_walk run by split_huge_page.
Problem is, I would have preferred to do this locking change later as
a pure optimization than as a requirement for merging and running
stable, as it'll make things slightly more complex.

BTW, if we were to share the lock across all anon_vmas as I mentioned
in prev email, and just reduce the chain length, then it'd solve all
issues for rmap_walk in migrate and also for THP completely.

> > Patch 2 wouldn't help the swapops crash we reproduced because at that
> > point the anon_vma of the stack is the local one, it's just after
> > execve.
> > 
> > vma_adjust and expand_downards would alter vm_pgoff and vm_start while
> > taking only the vma->anon_vma->lock where the vma->anon_vma is the
> > _local_ one of the vma. 
> 
> True, although in the case of expand_downwards, it's highly unlikely that
> there is also a migration PTE to be cleaned up. It's hard to see how a
> migration PTE would be left behind in that case but it still looks wrong to
> be updating the VMA fields without locking.

Every time we fail to find the PTE, it can also mean try_to_unmap just
failed to instantiate the migration pte leading to random memory
corruption in migrate. If a task fork and the some of the stack pages
at the bottom of the stack are shared, but the top of the stack isn't
shared (so the vma->anon_vma->lock only protects the top and not the
bottom) migrate should be able to silently random corrupt memory right
now because of this.

> > But a vma in mainline can be indexed in
> > infinite anon_vmas, so to prevent breaking migration
> > vma_adjust/expand_downards the rmap_walk would need to take _all_
> > anon_vma->locks for every anon_vma that the vma is indexed into.
> 
> I felt this would be too heavy in the common case which is why I made
> rmap_walk() do the try-lock-or-back-off instead because rmap_walk is typically
> in far less critical paths.

If we could take all locks, it'd make life easier as it'd already
implement the "shared lock" but without sharing it. It won't provide
much runtime benefit though (just rmap_walk will be even slower than
real shared lock, and vma_adjust/expand_downards will be slightly faster).

> > Or
> > alternatively like you implemented rmap_walk would need to check if
> > the vma we found in the rmap_walk is different from the original
> > anon_vma and to take the vma->anon_vma->lock (so taking the
> > anon_vma->lock of the local anon_vma of every vma) before it can
> > actually read the vma->vm_pgoff/vm_start inside vma_address.
> > 
> 
> To be absolutly sure, yes this is required. I don't think we've been hitting
> this exact problem in these tests but it still is a better plan than adjusting
> VMA details without locks.

We've not been hitting it unless somebody crashed with random
corruption.

> > If the above is right it also means the new anon-vma changes also break
> > the whole locking of transparent hugepage, see wait_split_huge_page,
> > it does a spin_unlock_wait(&anon_vma->lock) thinking that waiting the
> > "local" anon-vma is enough, when in fact the hugepage may be shared
> > and belonging to the parent parent_vma->anon_vma and not to the local
> > one of the last child that is waiting on the wrong lock. So I may have
> > to rewrite this part of the thp locking to solve this. And for me it's
> > not enough to just taking more locks inside the rmap walks inside
> > split_huge_page as I used the anon_vma lock outside too.
> > 
> 
> No fun. That potentially could be a lot of locks to take to split the
> page.

compound_lock should be able to replace it in a more granular way, but
it's not exactly the time I was looking to apply scalar optimization
to THP as it may introduce other issues I can't foresee right now.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-28 16:23             ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 16:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 04:55:58PM +0100, Mel Gorman wrote:
> Frankly, I don't understand why it was safe to drop the lock either.
> Maybe it was a mistake but I still haven't convinced myself I fully
> understand the subtleties of the anon_vma changes.

I understand the design but I'm also unsure about the details. it's
just this lock that gets splitted and when you update the
vm_start/pgoff with only the vma->anon_vma->lock, the vma may be
queued in multiple other anon_vmas, and you're only serializing
and safe for the pages that have page_mapcount 1, and point to the
local anon_vma == vma->anon_vma, not any other shared page.

The removal of the vma->anon_vma->lock from vma_adjust just seems an
unrelated mistake to me too, but I don't know for sure why yet.

Basically vma_adjust needs the anon_vma lock like expand_downards has.

After you fix vma_adjust to be as safe as expand_downards you've also
to take care of the rmap_walk that may run on a page->mapping =
anon_vma that isn't the vma->anon_vma and you're not taking that
anon_vma->lock of the shared page, when you change the vma
vm_pgoff/vm_start. If rmap_walk finds to find a pte, becauase of that,
migrate will crash.

> Temporarily at least until I figured out if execve was the only problem. The

For aa.git it is sure enough. And as long as you only see the migrate
crash in execve it's also sure enough.

> locking in vma_adjust didn't look like the prime reason for the crash
> but the lack of locking there is still very odd.

And I think it needs fixing to be safe.

> 
> > I agree dropping patch 1 but
> > to me the having to take all the anon_vma locks for every
> > vma->anon_vma->lock that we walk seems a must, otherwise
> > expand_downwards and vma_adjust won't be ok, plus we need to re-add
> > the anon_vma lock to vma_adjust, it can't be safe to alter vm_pgoff
> > and vm_start outside of the anon_vma->lock. Or I am mistaken?
> > 
> 
> No, you're not. If nothing else, vma_address can return the wrong value because
> the VMAs vm_start and vm_pgoff were in the process of being updated but not
> fully updated. It's hard to see how vma_address would return the wrong value
> and miss a migration PTE as a result but it's a possibility.  It's probably
> a lot more important for transparent hugepage support.

For the rmap_walk itself, migrate and split_huge_page are identical,
the main problem of transparent hugepage support is that I used the
anon_vma->lock in a wider way and taken well before the rmap_walk, so
I'm screwed in a worse way than migrate.

So I may have to change from anon_vma->lock to the compound_lock in
wait_split_huge_page(). But I'll still need the restarting loop of
anon_vma locks then inside the two rmap_walk run by split_huge_page.
Problem is, I would have preferred to do this locking change later as
a pure optimization than as a requirement for merging and running
stable, as it'll make things slightly more complex.

BTW, if we were to share the lock across all anon_vmas as I mentioned
in prev email, and just reduce the chain length, then it'd solve all
issues for rmap_walk in migrate and also for THP completely.

> > Patch 2 wouldn't help the swapops crash we reproduced because at that
> > point the anon_vma of the stack is the local one, it's just after
> > execve.
> > 
> > vma_adjust and expand_downards would alter vm_pgoff and vm_start while
> > taking only the vma->anon_vma->lock where the vma->anon_vma is the
> > _local_ one of the vma. 
> 
> True, although in the case of expand_downwards, it's highly unlikely that
> there is also a migration PTE to be cleaned up. It's hard to see how a
> migration PTE would be left behind in that case but it still looks wrong to
> be updating the VMA fields without locking.

Every time we fail to find the PTE, it can also mean try_to_unmap just
failed to instantiate the migration pte leading to random memory
corruption in migrate. If a task fork and the some of the stack pages
at the bottom of the stack are shared, but the top of the stack isn't
shared (so the vma->anon_vma->lock only protects the top and not the
bottom) migrate should be able to silently random corrupt memory right
now because of this.

> > But a vma in mainline can be indexed in
> > infinite anon_vmas, so to prevent breaking migration
> > vma_adjust/expand_downards the rmap_walk would need to take _all_
> > anon_vma->locks for every anon_vma that the vma is indexed into.
> 
> I felt this would be too heavy in the common case which is why I made
> rmap_walk() do the try-lock-or-back-off instead because rmap_walk is typically
> in far less critical paths.

If we could take all locks, it'd make life easier as it'd already
implement the "shared lock" but without sharing it. It won't provide
much runtime benefit though (just rmap_walk will be even slower than
real shared lock, and vma_adjust/expand_downards will be slightly faster).

> > Or
> > alternatively like you implemented rmap_walk would need to check if
> > the vma we found in the rmap_walk is different from the original
> > anon_vma and to take the vma->anon_vma->lock (so taking the
> > anon_vma->lock of the local anon_vma of every vma) before it can
> > actually read the vma->vm_pgoff/vm_start inside vma_address.
> > 
> 
> To be absolutly sure, yes this is required. I don't think we've been hitting
> this exact problem in these tests but it still is a better plan than adjusting
> VMA details without locks.

We've not been hitting it unless somebody crashed with random
corruption.

> > If the above is right it also means the new anon-vma changes also break
> > the whole locking of transparent hugepage, see wait_split_huge_page,
> > it does a spin_unlock_wait(&anon_vma->lock) thinking that waiting the
> > "local" anon-vma is enough, when in fact the hugepage may be shared
> > and belonging to the parent parent_vma->anon_vma and not to the local
> > one of the last child that is waiting on the wrong lock. So I may have
> > to rewrite this part of the thp locking to solve this. And for me it's
> > not enough to just taking more locks inside the rmap walks inside
> > split_huge_page as I used the anon_vma lock outside too.
> > 
> 
> No fun. That potentially could be a lot of locks to take to split the
> page.

compound_lock should be able to replace it in a more granular way, but
it's not exactly the time I was looking to apply scalar optimization
to THP as it may introduce other issues I can't foresee right now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-04-28 16:23             ` Andrea Arcangeli
@ 2010-04-28 17:34               ` Mel Gorman
  -1 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28 17:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 06:23:05PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 04:55:58PM +0100, Mel Gorman wrote:
> > Frankly, I don't understand why it was safe to drop the lock either.
> > Maybe it was a mistake but I still haven't convinced myself I fully
> > understand the subtleties of the anon_vma changes.
> 
> I understand the design but I'm also unsure about the details. it's
> just this lock that gets splitted and when you update the
> vm_start/pgoff with only the vma->anon_vma->lock, the vma may be
> queued in multiple other anon_vmas, and you're only serializing
> and safe for the pages that have page_mapcount 1, and point to the
> local anon_vma == vma->anon_vma, not any other shared page.
> 
> The removal of the vma->anon_vma->lock from vma_adjust just seems an
> unrelated mistake to me too, but I don't know for sure why yet.
> Basically vma_adjust needs the anon_vma lock like expand_downards has.
> 

Well, in the easiest case, the details of the VMA (particularly vm_start
and vm_pgoff) can confuse callers of vma_address during rmap_walk. In the
case of migration, it will return other false positives or negatives.

At best I'm fuzzy on the details though.

> After you fix vma_adjust to be as safe as expand_downards you've also
> to take care of the rmap_walk that may run on a page->mapping =
> anon_vma that isn't the vma->anon_vma and you're not taking that
> anon_vma->lock of the shared page, when you change the vma
> vm_pgoff/vm_start.

Is this not what the try-lock-different-vmas-or-backoff-and-retry logic
in patch 2 is doing or am I missing something else?

> If rmap_walk finds to find a pte, becauase of that,
> migrate will crash.
> 
> > Temporarily at least until I figured out if execve was the only problem. The
> 
> For aa.git it is sure enough. And as long as you only see the migrate
> crash in execve it's also sure enough.
> 

I can only remember seeing the execve-related crash but I'd rather the
locking was correct too.

Problem is, I've seen at least one crash due to execve() even with the
check made in try_to_unmap_anon to not migrate within the temporary
stack. I'm not sure how it could have happened.

> > locking in vma_adjust didn't look like the prime reason for the crash
> > but the lack of locking there is still very odd.
> 
> And I think it needs fixing to be safe.
> 
> > 
> > > I agree dropping patch 1 but
> > > to me the having to take all the anon_vma locks for every
> > > vma->anon_vma->lock that we walk seems a must, otherwise
> > > expand_downwards and vma_adjust won't be ok, plus we need to re-add
> > > the anon_vma lock to vma_adjust, it can't be safe to alter vm_pgoff
> > > and vm_start outside of the anon_vma->lock. Or I am mistaken?
> > > 
> > 
> > No, you're not. If nothing else, vma_address can return the wrong value because
> > the VMAs vm_start and vm_pgoff were in the process of being updated but not
> > fully updated. It's hard to see how vma_address would return the wrong value
> > and miss a migration PTE as a result but it's a possibility.  It's probably
> > a lot more important for transparent hugepage support.
> 
> For the rmap_walk itself, migrate and split_huge_page are identical,
> the main problem of transparent hugepage support is that I used the
> anon_vma->lock in a wider way and taken well before the rmap_walk, so
> I'm screwed in a worse way than migrate.
> 

Ok.

> So I may have to change from anon_vma->lock to the compound_lock in
> wait_split_huge_page(). But I'll still need the restarting loop of
> anon_vma locks then inside the two rmap_walk run by split_huge_page.
> Problem is, I would have preferred to do this locking change later as
> a pure optimization than as a requirement for merging and running
> stable, as it'll make things slightly more complex.
> 
> BTW, if we were to share the lock across all anon_vmas as I mentioned
> in prev email, and just reduce the chain length, then it'd solve all
> issues for rmap_walk in migrate and also for THP completely.
> 

It might be where we end up eventually. I'm simply loathe to introduce
another set of rules to anon_vma locking if it can be at all avoided.

> > > Patch 2 wouldn't help the swapops crash we reproduced because at that
> > > point the anon_vma of the stack is the local one, it's just after
> > > execve.
> > > 
> > > vma_adjust and expand_downards would alter vm_pgoff and vm_start while
> > > taking only the vma->anon_vma->lock where the vma->anon_vma is the
> > > _local_ one of the vma. 
> > 
> > True, although in the case of expand_downwards, it's highly unlikely that
> > there is also a migration PTE to be cleaned up. It's hard to see how a
> > migration PTE would be left behind in that case but it still looks wrong to
> > be updating the VMA fields without locking.
> 
> Every time we fail to find the PTE, it can also mean try_to_unmap just
> failed to instantiate the migration pte leading to random memory
> corruption in migrate.

How so? The old PTE should have been left in place, the page count of
the page remain positive and migration not occur.

> If a task fork and the some of the stack pages
> at the bottom of the stack are shared, but the top of the stack isn't
> shared (so the vma->anon_vma->lock only protects the top and not the
> bottom) migrate should be able to silently random corrupt memory right
> now because of this.
> 
> > > But a vma in mainline can be indexed in
> > > infinite anon_vmas, so to prevent breaking migration
> > > vma_adjust/expand_downards the rmap_walk would need to take _all_
> > > anon_vma->locks for every anon_vma that the vma is indexed into.
> > 
> > I felt this would be too heavy in the common case which is why I made
> > rmap_walk() do the try-lock-or-back-off instead because rmap_walk is typically
> > in far less critical paths.
> 
> If we could take all locks, it'd make life easier as it'd already
> implement the "shared lock" but without sharing it. It won't provide
> much runtime benefit though (just rmap_walk will be even slower than
> real shared lock, and vma_adjust/expand_downards will be slightly faster).
> 

Because the list could be very large, it would make more sense to
introduce the shared lock if this is what was required.

> > > Or
> > > alternatively like you implemented rmap_walk would need to check if
> > > the vma we found in the rmap_walk is different from the original
> > > anon_vma and to take the vma->anon_vma->lock (so taking the
> > > anon_vma->lock of the local anon_vma of every vma) before it can
> > > actually read the vma->vm_pgoff/vm_start inside vma_address.
> > > 
> > 
> > To be absolutly sure, yes this is required. I don't think we've been hitting
> > this exact problem in these tests but it still is a better plan than adjusting
> > VMA details without locks.
> 
> We've not been hitting it unless somebody crashed with random
> corruption.
> 

Not that I've seen. Still just the crashes within execve.

> > > If the above is right it also means the new anon-vma changes also break
> > > the whole locking of transparent hugepage, see wait_split_huge_page,
> > > it does a spin_unlock_wait(&anon_vma->lock) thinking that waiting the
> > > "local" anon-vma is enough, when in fact the hugepage may be shared
> > > and belonging to the parent parent_vma->anon_vma and not to the local
> > > one of the last child that is waiting on the wrong lock. So I may have
> > > to rewrite this part of the thp locking to solve this. And for me it's
> > > not enough to just taking more locks inside the rmap walks inside
> > > split_huge_page as I used the anon_vma lock outside too.
> > > 
> > 
> > No fun. That potentially could be a lot of locks to take to split the
> > page.
> 
> compound_lock should be able to replace it in a more granular way, but
> it's not exactly the time I was looking to apply scalar optimization
> to THP as it may introduce other issues I can't foresee right now.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-28 17:34               ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28 17:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 06:23:05PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 04:55:58PM +0100, Mel Gorman wrote:
> > Frankly, I don't understand why it was safe to drop the lock either.
> > Maybe it was a mistake but I still haven't convinced myself I fully
> > understand the subtleties of the anon_vma changes.
> 
> I understand the design but I'm also unsure about the details. it's
> just this lock that gets splitted and when you update the
> vm_start/pgoff with only the vma->anon_vma->lock, the vma may be
> queued in multiple other anon_vmas, and you're only serializing
> and safe for the pages that have page_mapcount 1, and point to the
> local anon_vma == vma->anon_vma, not any other shared page.
> 
> The removal of the vma->anon_vma->lock from vma_adjust just seems an
> unrelated mistake to me too, but I don't know for sure why yet.
> Basically vma_adjust needs the anon_vma lock like expand_downards has.
> 

Well, in the easiest case, the details of the VMA (particularly vm_start
and vm_pgoff) can confuse callers of vma_address during rmap_walk. In the
case of migration, it will return other false positives or negatives.

At best I'm fuzzy on the details though.

> After you fix vma_adjust to be as safe as expand_downards you've also
> to take care of the rmap_walk that may run on a page->mapping =
> anon_vma that isn't the vma->anon_vma and you're not taking that
> anon_vma->lock of the shared page, when you change the vma
> vm_pgoff/vm_start.

Is this not what the try-lock-different-vmas-or-backoff-and-retry logic
in patch 2 is doing or am I missing something else?

> If rmap_walk finds to find a pte, becauase of that,
> migrate will crash.
> 
> > Temporarily at least until I figured out if execve was the only problem. The
> 
> For aa.git it is sure enough. And as long as you only see the migrate
> crash in execve it's also sure enough.
> 

I can only remember seeing the execve-related crash but I'd rather the
locking was correct too.

Problem is, I've seen at least one crash due to execve() even with the
check made in try_to_unmap_anon to not migrate within the temporary
stack. I'm not sure how it could have happened.

> > locking in vma_adjust didn't look like the prime reason for the crash
> > but the lack of locking there is still very odd.
> 
> And I think it needs fixing to be safe.
> 
> > 
> > > I agree dropping patch 1 but
> > > to me the having to take all the anon_vma locks for every
> > > vma->anon_vma->lock that we walk seems a must, otherwise
> > > expand_downwards and vma_adjust won't be ok, plus we need to re-add
> > > the anon_vma lock to vma_adjust, it can't be safe to alter vm_pgoff
> > > and vm_start outside of the anon_vma->lock. Or I am mistaken?
> > > 
> > 
> > No, you're not. If nothing else, vma_address can return the wrong value because
> > the VMAs vm_start and vm_pgoff were in the process of being updated but not
> > fully updated. It's hard to see how vma_address would return the wrong value
> > and miss a migration PTE as a result but it's a possibility.  It's probably
> > a lot more important for transparent hugepage support.
> 
> For the rmap_walk itself, migrate and split_huge_page are identical,
> the main problem of transparent hugepage support is that I used the
> anon_vma->lock in a wider way and taken well before the rmap_walk, so
> I'm screwed in a worse way than migrate.
> 

Ok.

> So I may have to change from anon_vma->lock to the compound_lock in
> wait_split_huge_page(). But I'll still need the restarting loop of
> anon_vma locks then inside the two rmap_walk run by split_huge_page.
> Problem is, I would have preferred to do this locking change later as
> a pure optimization than as a requirement for merging and running
> stable, as it'll make things slightly more complex.
> 
> BTW, if we were to share the lock across all anon_vmas as I mentioned
> in prev email, and just reduce the chain length, then it'd solve all
> issues for rmap_walk in migrate and also for THP completely.
> 

It might be where we end up eventually. I'm simply loathe to introduce
another set of rules to anon_vma locking if it can be at all avoided.

> > > Patch 2 wouldn't help the swapops crash we reproduced because at that
> > > point the anon_vma of the stack is the local one, it's just after
> > > execve.
> > > 
> > > vma_adjust and expand_downards would alter vm_pgoff and vm_start while
> > > taking only the vma->anon_vma->lock where the vma->anon_vma is the
> > > _local_ one of the vma. 
> > 
> > True, although in the case of expand_downwards, it's highly unlikely that
> > there is also a migration PTE to be cleaned up. It's hard to see how a
> > migration PTE would be left behind in that case but it still looks wrong to
> > be updating the VMA fields without locking.
> 
> Every time we fail to find the PTE, it can also mean try_to_unmap just
> failed to instantiate the migration pte leading to random memory
> corruption in migrate.

How so? The old PTE should have been left in place, the page count of
the page remain positive and migration not occur.

> If a task fork and the some of the stack pages
> at the bottom of the stack are shared, but the top of the stack isn't
> shared (so the vma->anon_vma->lock only protects the top and not the
> bottom) migrate should be able to silently random corrupt memory right
> now because of this.
> 
> > > But a vma in mainline can be indexed in
> > > infinite anon_vmas, so to prevent breaking migration
> > > vma_adjust/expand_downards the rmap_walk would need to take _all_
> > > anon_vma->locks for every anon_vma that the vma is indexed into.
> > 
> > I felt this would be too heavy in the common case which is why I made
> > rmap_walk() do the try-lock-or-back-off instead because rmap_walk is typically
> > in far less critical paths.
> 
> If we could take all locks, it'd make life easier as it'd already
> implement the "shared lock" but without sharing it. It won't provide
> much runtime benefit though (just rmap_walk will be even slower than
> real shared lock, and vma_adjust/expand_downards will be slightly faster).
> 

Because the list could be very large, it would make more sense to
introduce the shared lock if this is what was required.

> > > Or
> > > alternatively like you implemented rmap_walk would need to check if
> > > the vma we found in the rmap_walk is different from the original
> > > anon_vma and to take the vma->anon_vma->lock (so taking the
> > > anon_vma->lock of the local anon_vma of every vma) before it can
> > > actually read the vma->vm_pgoff/vm_start inside vma_address.
> > > 
> > 
> > To be absolutly sure, yes this is required. I don't think we've been hitting
> > this exact problem in these tests but it still is a better plan than adjusting
> > VMA details without locks.
> 
> We've not been hitting it unless somebody crashed with random
> corruption.
> 

Not that I've seen. Still just the crashes within execve.

> > > If the above is right it also means the new anon-vma changes also break
> > > the whole locking of transparent hugepage, see wait_split_huge_page,
> > > it does a spin_unlock_wait(&anon_vma->lock) thinking that waiting the
> > > "local" anon-vma is enough, when in fact the hugepage may be shared
> > > and belonging to the parent parent_vma->anon_vma and not to the local
> > > one of the last child that is waiting on the wrong lock. So I may have
> > > to rewrite this part of the thp locking to solve this. And for me it's
> > > not enough to just taking more locks inside the rmap walks inside
> > > split_huge_page as I used the anon_vma lock outside too.
> > > 
> > 
> > No fun. That potentially could be a lot of locks to take to split the
> > page.
> 
> compound_lock should be able to replace it in a more granular way, but
> it's not exactly the time I was looking to apply scalar optimization
> to THP as it may introduce other issues I can't foresee right now.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* [RFC PATCH] take all anon_vma locks in anon_vma_lock
  2010-04-28 16:23             ` Andrea Arcangeli
@ 2010-04-28 17:47               ` Rik van Riel
  -1 siblings, 0 replies; 132+ messages in thread
From: Rik van Riel @ 2010-04-28 17:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

Take all the locks for all the anon_vmas in anon_vma_lock, this properly
excludes migration and the transparent hugepage code from VMA changes done
by mmap/munmap/mprotect/expand_stack/etc...

Also document the locking rules for the same_vma list in the anon_vma_chain
and remove the anon_vma_lock call from expand_upwards, which does not need it.

Signed-off-by: Rik van Riel <riel@redhat.com>

--- 
Posted quickly as an RFC patch, only compile tested so far.
Andrea, Mel, does this look like a reasonable approach?

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..1eef42c 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -52,11 +52,15 @@ struct anon_vma {
  * all the anon_vmas associated with this VMA.
  * The "same_anon_vma" list contains the anon_vma_chains
  * which link all the VMAs associated with this anon_vma.
+ *
+ * The "same_vma" list is locked by either having mm->mmap_sem
+ * locked for writing, or having mm->mmap_sem locked for reading
+ * AND holding the mm->page_table_lock.
  */
 struct anon_vma_chain {
 	struct vm_area_struct *vma;
 	struct anon_vma *anon_vma;
-	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
+	struct list_head same_vma;	/* see above */
 	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
 };
 
@@ -90,11 +94,14 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
 	return page_rmapping(page);
 }
 
-static inline void anon_vma_lock(struct vm_area_struct *vma)
+static inline void anon_vma_lock(struct vm_area_struct *vma, void *nest_lock)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+	if (anon_vma) {
+		struct anon_vma_chain *avc;
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock);
+	}
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..2c13bbb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -452,7 +452,7 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 		spin_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
-	anon_vma_lock(vma);
+	anon_vma_lock(vma, &mm->mmap_sem);
 
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
 	__vma_link_file(vma);
@@ -1705,12 +1705,11 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		return -EFAULT;
 
 	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
+	 * Unlike expand_downwards, we do not need to take the anon_vma lock,
+	 * because we leave vma->vm_start and vma->pgoff untouched. 
+	 * This means rmap lookups of pages inside this VMA stay valid
+	 * throughout the stack expansion.
 	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
-	anon_vma_lock(vma);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1721,7 +1720,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	if (address < PAGE_ALIGN(address+4))
 		address = PAGE_ALIGN(address+4);
 	else {
-		anon_vma_unlock(vma);
 		return -ENOMEM;
 	}
 	error = 0;
@@ -1737,7 +1735,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		if (!error)
 			vma->vm_end = address;
 	}
-	anon_vma_unlock(vma);
 	return error;
 }
 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
@@ -1749,6 +1746,7 @@ static int expand_downwards(struct vm_area_struct *vma,
 				   unsigned long address)
 {
 	int error;
+	struct mm_struct *mm = vma->vm_mm;
 
 	/*
 	 * We must make sure the anon_vma is allocated
@@ -1762,7 +1760,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 	if (error)
 		return error;
 
-	anon_vma_lock(vma);
+	spin_lock(&mm->page_table_lock);
+	anon_vma_lock(vma, &mm->page_table_lock);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1784,6 +1783,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 		}
 	}
 	anon_vma_unlock(vma);
+	spin_unlock(&mm->page_table_lock);
+
 	return error;
 }
 

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [RFC PATCH] take all anon_vma locks in anon_vma_lock
@ 2010-04-28 17:47               ` Rik van Riel
  0 siblings, 0 replies; 132+ messages in thread
From: Rik van Riel @ 2010-04-28 17:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

Take all the locks for all the anon_vmas in anon_vma_lock, this properly
excludes migration and the transparent hugepage code from VMA changes done
by mmap/munmap/mprotect/expand_stack/etc...

Also document the locking rules for the same_vma list in the anon_vma_chain
and remove the anon_vma_lock call from expand_upwards, which does not need it.

Signed-off-by: Rik van Riel <riel@redhat.com>

--- 
Posted quickly as an RFC patch, only compile tested so far.
Andrea, Mel, does this look like a reasonable approach?

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..1eef42c 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -52,11 +52,15 @@ struct anon_vma {
  * all the anon_vmas associated with this VMA.
  * The "same_anon_vma" list contains the anon_vma_chains
  * which link all the VMAs associated with this anon_vma.
+ *
+ * The "same_vma" list is locked by either having mm->mmap_sem
+ * locked for writing, or having mm->mmap_sem locked for reading
+ * AND holding the mm->page_table_lock.
  */
 struct anon_vma_chain {
 	struct vm_area_struct *vma;
 	struct anon_vma *anon_vma;
-	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
+	struct list_head same_vma;	/* see above */
 	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
 };
 
@@ -90,11 +94,14 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
 	return page_rmapping(page);
 }
 
-static inline void anon_vma_lock(struct vm_area_struct *vma)
+static inline void anon_vma_lock(struct vm_area_struct *vma, void *nest_lock)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+	if (anon_vma) {
+		struct anon_vma_chain *avc;
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock);
+	}
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..2c13bbb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -452,7 +452,7 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 		spin_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
-	anon_vma_lock(vma);
+	anon_vma_lock(vma, &mm->mmap_sem);
 
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
 	__vma_link_file(vma);
@@ -1705,12 +1705,11 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		return -EFAULT;
 
 	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
+	 * Unlike expand_downwards, we do not need to take the anon_vma lock,
+	 * because we leave vma->vm_start and vma->pgoff untouched. 
+	 * This means rmap lookups of pages inside this VMA stay valid
+	 * throughout the stack expansion.
 	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
-	anon_vma_lock(vma);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1721,7 +1720,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	if (address < PAGE_ALIGN(address+4))
 		address = PAGE_ALIGN(address+4);
 	else {
-		anon_vma_unlock(vma);
 		return -ENOMEM;
 	}
 	error = 0;
@@ -1737,7 +1735,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		if (!error)
 			vma->vm_end = address;
 	}
-	anon_vma_unlock(vma);
 	return error;
 }
 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
@@ -1749,6 +1746,7 @@ static int expand_downwards(struct vm_area_struct *vma,
 				   unsigned long address)
 {
 	int error;
+	struct mm_struct *mm = vma->vm_mm;
 
 	/*
 	 * We must make sure the anon_vma is allocated
@@ -1762,7 +1760,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 	if (error)
 		return error;
 
-	anon_vma_lock(vma);
+	spin_lock(&mm->page_table_lock);
+	anon_vma_lock(vma, &mm->page_table_lock);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1784,6 +1783,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 		}
 	}
 	anon_vma_unlock(vma);
+	spin_unlock(&mm->page_table_lock);
+
 	return error;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-04-28 17:34               ` Mel Gorman
@ 2010-04-28 17:58                 ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 17:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 06:34:17PM +0100, Mel Gorman wrote:
> Well, in the easiest case, the details of the VMA (particularly vm_start
> and vm_pgoff) can confuse callers of vma_address during rmap_walk. In the
> case of migration, it will return other false positives or negatives.

false positives are fine ;). Only problems are false negatives...

> > After you fix vma_adjust to be as safe as expand_downards you've also
> > to take care of the rmap_walk that may run on a page->mapping =
> > anon_vma that isn't the vma->anon_vma and you're not taking that
> > anon_vma->lock of the shared page, when you change the vma
> > vm_pgoff/vm_start.
> 
> Is this not what the try-lock-different-vmas-or-backoff-and-retry logic
> in patch 2 is doing or am I missing something else?

yes exactly. This is why patch 2 can't be dropped, both for the
vma_adjust and the rmap_walk that are really two separate issues.

> How so? The old PTE should have been left in place, the page count of
> the page remain positive and migration not occur.

Right only problem is for remove_migration_ptes (and for both
split_huge_page rmap_walks). For migrate the only issue is the second
rmap_walk.

> Because the list could be very large, it would make more sense to
> introduce the shared lock if this is what was required.

Kind of agree, we'll see...

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-28 17:58                 ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 17:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 06:34:17PM +0100, Mel Gorman wrote:
> Well, in the easiest case, the details of the VMA (particularly vm_start
> and vm_pgoff) can confuse callers of vma_address during rmap_walk. In the
> case of migration, it will return other false positives or negatives.

false positives are fine ;). Only problems are false negatives...

> > After you fix vma_adjust to be as safe as expand_downards you've also
> > to take care of the rmap_walk that may run on a page->mapping =
> > anon_vma that isn't the vma->anon_vma and you're not taking that
> > anon_vma->lock of the shared page, when you change the vma
> > vm_pgoff/vm_start.
> 
> Is this not what the try-lock-different-vmas-or-backoff-and-retry logic
> in patch 2 is doing or am I missing something else?

yes exactly. This is why patch 2 can't be dropped, both for the
vma_adjust and the rmap_walk that are really two separate issues.

> How so? The old PTE should have been left in place, the page count of
> the page remain positive and migration not occur.

Right only problem is for remove_migration_ptes (and for both
split_huge_page rmap_walks). For migrate the only issue is the second
rmap_walk.

> Because the list could be very large, it would make more sense to
> introduce the shared lock if this is what was required.

Kind of agree, we'll see...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH] take all anon_vma locks in anon_vma_lock
  2010-04-28 17:47               ` Rik van Riel
@ 2010-04-28 18:03                 ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 18:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On Wed, Apr 28, 2010 at 01:47:19PM -0400, Rik van Riel wrote:
>  static inline void anon_vma_unlock(struct vm_area_struct *vma)

never mind as this is RFC, lock is clear enough

> @@ -1762,7 +1760,8 @@ static int expand_downwards(struct vm_area_struct *vma,
>  	if (error)
>  		return error;
>  
> -	anon_vma_lock(vma);
> +	spin_lock(&mm->page_table_lock);
> +	anon_vma_lock(vma, &mm->page_table_lock);

This will cause a lock inversion (page_table_lock can only be taken
after the anon_vma lock). I don't immediately see why the
page_table_lock here though?

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH] take all anon_vma locks in anon_vma_lock
@ 2010-04-28 18:03                 ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 18:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On Wed, Apr 28, 2010 at 01:47:19PM -0400, Rik van Riel wrote:
>  static inline void anon_vma_unlock(struct vm_area_struct *vma)

never mind as this is RFC, lock is clear enough

> @@ -1762,7 +1760,8 @@ static int expand_downwards(struct vm_area_struct *vma,
>  	if (error)
>  		return error;
>  
> -	anon_vma_lock(vma);
> +	spin_lock(&mm->page_table_lock);
> +	anon_vma_lock(vma, &mm->page_table_lock);

This will cause a lock inversion (page_table_lock can only be taken
after the anon_vma lock). I don't immediately see why the
page_table_lock here though?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH] take all anon_vma locks in anon_vma_lock
  2010-04-28 18:03                 ` Andrea Arcangeli
@ 2010-04-28 18:09                   ` Rik van Riel
  -1 siblings, 0 replies; 132+ messages in thread
From: Rik van Riel @ 2010-04-28 18:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On 04/28/2010 02:03 PM, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 01:47:19PM -0400, Rik van Riel wrote:
>>   static inline void anon_vma_unlock(struct vm_area_struct *vma)
>
> never mind as this is RFC, lock is clear enough
>
>> @@ -1762,7 +1760,8 @@ static int expand_downwards(struct vm_area_struct *vma,
>>   	if (error)
>>   		return error;
>>
>> -	anon_vma_lock(vma);
>> +	spin_lock(&mm->page_table_lock);
>> +	anon_vma_lock(vma,&mm->page_table_lock);
>
> This will cause a lock inversion (page_table_lock can only be taken
> after the anon_vma lock). I don't immediately see why the
> page_table_lock here though?

We need to safely walk the vma->anon_vma_chain /
anon_vma_chain->same_vma list.

So much for using the mmap_sem for read + the
page_table_lock to lock the anon_vma_chain list.

We'll need a new lock somewhere, probably in the
mm_struct since one per process seems plenty.

I'll add that in the next version of the patch.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH] take all anon_vma locks in anon_vma_lock
@ 2010-04-28 18:09                   ` Rik van Riel
  0 siblings, 0 replies; 132+ messages in thread
From: Rik van Riel @ 2010-04-28 18:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On 04/28/2010 02:03 PM, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 01:47:19PM -0400, Rik van Riel wrote:
>>   static inline void anon_vma_unlock(struct vm_area_struct *vma)
>
> never mind as this is RFC, lock is clear enough
>
>> @@ -1762,7 +1760,8 @@ static int expand_downwards(struct vm_area_struct *vma,
>>   	if (error)
>>   		return error;
>>
>> -	anon_vma_lock(vma);
>> +	spin_lock(&mm->page_table_lock);
>> +	anon_vma_lock(vma,&mm->page_table_lock);
>
> This will cause a lock inversion (page_table_lock can only be taken
> after the anon_vma lock). I don't immediately see why the
> page_table_lock here though?

We need to safely walk the vma->anon_vma_chain /
anon_vma_chain->same_vma list.

So much for using the mmap_sem for read + the
page_table_lock to lock the anon_vma_chain list.

We'll need a new lock somewhere, probably in the
mm_struct since one per process seems plenty.

I'll add that in the next version of the patch.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* [RFC PATCH -v2] take all anon_vma locks in anon_vma_lock
  2010-04-28 17:47               ` Rik van Riel
@ 2010-04-28 18:25                 ` Rik van Riel
  -1 siblings, 0 replies; 132+ messages in thread
From: Rik van Riel @ 2010-04-28 18:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

Take all the locks for all the anon_vmas in anon_vma_lock, this properly
excludes migration and the transparent hugepage code from VMA changes done
by mmap/munmap/mprotect/expand_stack/etc...

Also document the locking rules for the same_vma list in the anon_vma_chain
and remove the anon_vma_lock call from expand_upwards, which does not need it.

Signed-off-by: Rik van Riel <riel@redhat.com>

--- 
Posted quickly as an RFC patch, only compile tested so far.
Andrea, Mel, does this look like a reasonable approach?

v2:
 - also change anon_vma_unlock to walk the loop
 - add calls to anon_vma_lock & anon_vma_unlock to vma_adjust
 - introduce a new lock for the vma->anon_vma_chain list, to prevent
   the lock inversion that Andrea pointed out

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a0679c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -239,6 +239,7 @@ struct mm_struct {
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
 	spinlock_t page_table_lock;		/* Protects page tables and some counters */
+	spinlock_t anon_vma_chain_lock;		/* Protects vma->anon_vma_chain, with mmap_sem */
 
 	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
 						 * together off init_mm.mmlist, and are protected
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..492e7ca 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -52,11 +52,15 @@ struct anon_vma {
  * all the anon_vmas associated with this VMA.
  * The "same_anon_vma" list contains the anon_vma_chains
  * which link all the VMAs associated with this anon_vma.
+ *
+ * The "same_vma" list is locked by either having mm->mmap_sem
+ * locked for writing, or having mm->mmap_sem locked for reading
+ * AND holding the mm->anon_vma_chain_lock.
  */
 struct anon_vma_chain {
 	struct vm_area_struct *vma;
 	struct anon_vma *anon_vma;
-	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
+	struct list_head same_vma;	/* see above */
 	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
 };
 
@@ -90,18 +94,24 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
 	return page_rmapping(page);
 }
 
-static inline void anon_vma_lock(struct vm_area_struct *vma)
+static inline void anon_vma_lock(struct vm_area_struct *vma, void *nest_lock)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+	if (anon_vma) {
+		struct anon_vma_chain *avc;
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock);
+	}
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+	if (anon_vma) {
+		struct anon_vma_chain *avc;
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			spin_unlock(&avc->anon_vma->lock);
+	}
 }
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 44b0791..83b1ba2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -468,6 +468,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 	mm->nr_ptes = 0;
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
+	spin_lock_init(&mm->anon_vma_chain_lock);
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
 	mm_init_aio(mm);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 57aba0d..3ce8a1f 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,7 @@ struct mm_struct init_mm = {
 	.mm_count	= ATOMIC_INIT(1),
 	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
+	.anon_vma_chain_lock =  __SPIN_LOCK_UNLOCKED(init_mm.anon_vma_chain_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.cpu_vm_mask	= CPU_MASK_ALL,
 };
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..4602358 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -452,7 +452,7 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 		spin_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
-	anon_vma_lock(vma);
+	anon_vma_lock(vma, &mm->mmap_sem);
 
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
 	__vma_link_file(vma);
@@ -578,6 +578,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	anon_vma_lock(vma, &mm->mmap_sem);
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -599,6 +600,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		vma_prio_tree_insert(vma, root);
 		flush_dcache_mmap_unlock(mapping);
 	}
+	anon_vma_unlock(vma);
 
 	if (remove_next) {
 		/*
@@ -1705,12 +1707,11 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		return -EFAULT;
 
 	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
+	 * Unlike expand_downwards, we do not need to take the anon_vma lock,
+	 * because we leave vma->vm_start and vma->pgoff untouched. 
+	 * This means rmap lookups of pages inside this VMA stay valid
+	 * throughout the stack expansion.
 	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
-	anon_vma_lock(vma);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1721,7 +1722,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	if (address < PAGE_ALIGN(address+4))
 		address = PAGE_ALIGN(address+4);
 	else {
-		anon_vma_unlock(vma);
 		return -ENOMEM;
 	}
 	error = 0;
@@ -1737,7 +1737,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		if (!error)
 			vma->vm_end = address;
 	}
-	anon_vma_unlock(vma);
 	return error;
 }
 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
@@ -1749,6 +1748,7 @@ static int expand_downwards(struct vm_area_struct *vma,
 				   unsigned long address)
 {
 	int error;
+	struct mm_struct *mm = vma->vm_mm;
 
 	/*
 	 * We must make sure the anon_vma is allocated
@@ -1762,7 +1762,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 	if (error)
 		return error;
 
-	anon_vma_lock(vma);
+	spin_lock(&mm->anon_vma_chain_lock);
+	anon_vma_lock(vma, &mm->anon_vma_chain_lock);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1784,6 +1785,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 		}
 	}
 	anon_vma_unlock(vma);
+	spin_unlock(&mm->anon_vma_chain_lock);
+
 	return error;
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 526704e..aa27132 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -23,6 +23,7 @@
  * inode->i_mutex	(while writing or truncating, not reading or faulting)
  *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
+ *   mm->anon_vma_chain_lock (mmap_sem for read, protects vma->anon_vma_chain)
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_lock
  *         anon_vma->lock
@@ -135,8 +136,8 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 		}
 		spin_lock(&anon_vma->lock);
 
-		/* page_table_lock to protect against threads */
-		spin_lock(&mm->page_table_lock);
+		/* anon_vma_chain_lock to protect against threads */
+		spin_lock(&mm->anon_vma_chain_lock);
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
 			avc->anon_vma = anon_vma;
@@ -145,7 +146,7 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 			list_add(&avc->same_anon_vma, &anon_vma->head);
 			allocated = NULL;
 		}
-		spin_unlock(&mm->page_table_lock);
+		spin_unlock(&mm->anon_vma_chain_lock);
 
 		spin_unlock(&anon_vma->lock);
 		if (unlikely(allocated)) {

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [RFC PATCH -v2] take all anon_vma locks in anon_vma_lock
@ 2010-04-28 18:25                 ` Rik van Riel
  0 siblings, 0 replies; 132+ messages in thread
From: Rik van Riel @ 2010-04-28 18:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

Take all the locks for all the anon_vmas in anon_vma_lock, this properly
excludes migration and the transparent hugepage code from VMA changes done
by mmap/munmap/mprotect/expand_stack/etc...

Also document the locking rules for the same_vma list in the anon_vma_chain
and remove the anon_vma_lock call from expand_upwards, which does not need it.

Signed-off-by: Rik van Riel <riel@redhat.com>

--- 
Posted quickly as an RFC patch, only compile tested so far.
Andrea, Mel, does this look like a reasonable approach?

v2:
 - also change anon_vma_unlock to walk the loop
 - add calls to anon_vma_lock & anon_vma_unlock to vma_adjust
 - introduce a new lock for the vma->anon_vma_chain list, to prevent
   the lock inversion that Andrea pointed out

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a0679c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -239,6 +239,7 @@ struct mm_struct {
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
 	spinlock_t page_table_lock;		/* Protects page tables and some counters */
+	spinlock_t anon_vma_chain_lock;		/* Protects vma->anon_vma_chain, with mmap_sem */
 
 	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
 						 * together off init_mm.mmlist, and are protected
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..492e7ca 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -52,11 +52,15 @@ struct anon_vma {
  * all the anon_vmas associated with this VMA.
  * The "same_anon_vma" list contains the anon_vma_chains
  * which link all the VMAs associated with this anon_vma.
+ *
+ * The "same_vma" list is locked by either having mm->mmap_sem
+ * locked for writing, or having mm->mmap_sem locked for reading
+ * AND holding the mm->anon_vma_chain_lock.
  */
 struct anon_vma_chain {
 	struct vm_area_struct *vma;
 	struct anon_vma *anon_vma;
-	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
+	struct list_head same_vma;	/* see above */
 	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
 };
 
@@ -90,18 +94,24 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
 	return page_rmapping(page);
 }
 
-static inline void anon_vma_lock(struct vm_area_struct *vma)
+static inline void anon_vma_lock(struct vm_area_struct *vma, void *nest_lock)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+	if (anon_vma) {
+		struct anon_vma_chain *avc;
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock);
+	}
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+	if (anon_vma) {
+		struct anon_vma_chain *avc;
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			spin_unlock(&avc->anon_vma->lock);
+	}
 }
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 44b0791..83b1ba2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -468,6 +468,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 	mm->nr_ptes = 0;
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
+	spin_lock_init(&mm->anon_vma_chain_lock);
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
 	mm_init_aio(mm);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 57aba0d..3ce8a1f 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,7 @@ struct mm_struct init_mm = {
 	.mm_count	= ATOMIC_INIT(1),
 	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
+	.anon_vma_chain_lock =  __SPIN_LOCK_UNLOCKED(init_mm.anon_vma_chain_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.cpu_vm_mask	= CPU_MASK_ALL,
 };
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..4602358 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -452,7 +452,7 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 		spin_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
-	anon_vma_lock(vma);
+	anon_vma_lock(vma, &mm->mmap_sem);
 
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
 	__vma_link_file(vma);
@@ -578,6 +578,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	anon_vma_lock(vma, &mm->mmap_sem);
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -599,6 +600,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		vma_prio_tree_insert(vma, root);
 		flush_dcache_mmap_unlock(mapping);
 	}
+	anon_vma_unlock(vma);
 
 	if (remove_next) {
 		/*
@@ -1705,12 +1707,11 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		return -EFAULT;
 
 	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
+	 * Unlike expand_downwards, we do not need to take the anon_vma lock,
+	 * because we leave vma->vm_start and vma->pgoff untouched. 
+	 * This means rmap lookups of pages inside this VMA stay valid
+	 * throughout the stack expansion.
 	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
-	anon_vma_lock(vma);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1721,7 +1722,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	if (address < PAGE_ALIGN(address+4))
 		address = PAGE_ALIGN(address+4);
 	else {
-		anon_vma_unlock(vma);
 		return -ENOMEM;
 	}
 	error = 0;
@@ -1737,7 +1737,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		if (!error)
 			vma->vm_end = address;
 	}
-	anon_vma_unlock(vma);
 	return error;
 }
 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
@@ -1749,6 +1748,7 @@ static int expand_downwards(struct vm_area_struct *vma,
 				   unsigned long address)
 {
 	int error;
+	struct mm_struct *mm = vma->vm_mm;
 
 	/*
 	 * We must make sure the anon_vma is allocated
@@ -1762,7 +1762,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 	if (error)
 		return error;
 
-	anon_vma_lock(vma);
+	spin_lock(&mm->anon_vma_chain_lock);
+	anon_vma_lock(vma, &mm->anon_vma_chain_lock);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1784,6 +1785,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 		}
 	}
 	anon_vma_unlock(vma);
+	spin_unlock(&mm->anon_vma_chain_lock);
+
 	return error;
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 526704e..aa27132 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -23,6 +23,7 @@
  * inode->i_mutex	(while writing or truncating, not reading or faulting)
  *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
+ *   mm->anon_vma_chain_lock (mmap_sem for read, protects vma->anon_vma_chain)
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_lock
  *         anon_vma->lock
@@ -135,8 +136,8 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 		}
 		spin_lock(&anon_vma->lock);
 
-		/* page_table_lock to protect against threads */
-		spin_lock(&mm->page_table_lock);
+		/* anon_vma_chain_lock to protect against threads */
+		spin_lock(&mm->anon_vma_chain_lock);
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
 			avc->anon_vma = anon_vma;
@@ -145,7 +146,7 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 			list_add(&avc->same_anon_vma, &anon_vma->head);
 			allocated = NULL;
 		}
-		spin_unlock(&mm->page_table_lock);
+		spin_unlock(&mm->anon_vma_chain_lock);
 
 		spin_unlock(&anon_vma->lock);
 		if (unlikely(allocated)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v2] take all anon_vma locks in anon_vma_lock
  2010-04-28 18:25                 ` Rik van Riel
@ 2010-04-28 19:07                   ` Mel Gorman
  -1 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28 19:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On Wed, Apr 28, 2010 at 02:25:10PM -0400, Rik van Riel wrote:
> Take all the locks for all the anon_vmas in anon_vma_lock, this properly
> excludes migration and the transparent hugepage code from VMA changes done
> by mmap/munmap/mprotect/expand_stack/etc...
> 
> Also document the locking rules for the same_vma list in the anon_vma_chain
> and remove the anon_vma_lock call from expand_upwards, which does not need it.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> 
> --- 
> Posted quickly as an RFC patch, only compile tested so far.
> Andrea, Mel, does this look like a reasonable approach?
> 

Well, it looks nice but I am wary of making much in the way of comment
on its correctness. Every time I think I understand the new anon_vma
changes, I find something else to make me doubt myself :)

> v2:
>  - also change anon_vma_unlock to walk the loop
>  - add calls to anon_vma_lock & anon_vma_unlock to vma_adjust
>  - introduce a new lock for the vma->anon_vma_chain list, to prevent
>    the lock inversion that Andrea pointed out
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index b8bb9a6..a0679c6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -239,6 +239,7 @@ struct mm_struct {
>  	int map_count;				/* number of VMAs */
>  	struct rw_semaphore mmap_sem;
>  	spinlock_t page_table_lock;		/* Protects page tables and some counters */
> +	spinlock_t anon_vma_chain_lock;		/* Protects vma->anon_vma_chain, with mmap_sem */
>  
>  	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
>  						 * together off init_mm.mmlist, and are protected
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index d25bd22..492e7ca 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -52,11 +52,15 @@ struct anon_vma {
>   * all the anon_vmas associated with this VMA.
>   * The "same_anon_vma" list contains the anon_vma_chains
>   * which link all the VMAs associated with this anon_vma.
> + *
> + * The "same_vma" list is locked by either having mm->mmap_sem
> + * locked for writing, or having mm->mmap_sem locked for reading
> + * AND holding the mm->anon_vma_chain_lock.
>   */
>  struct anon_vma_chain {
>  	struct vm_area_struct *vma;
>  	struct anon_vma *anon_vma;
> -	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
> +	struct list_head same_vma;	/* see above */
>  	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
>  };
>  
> @@ -90,18 +94,24 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
>  	return page_rmapping(page);
>  }
>  
> -static inline void anon_vma_lock(struct vm_area_struct *vma)
> +static inline void anon_vma_lock(struct vm_area_struct *vma, void *nest_lock)
>  {
>  	struct anon_vma *anon_vma = vma->anon_vma;
> -	if (anon_vma)
> -		spin_lock(&anon_vma->lock);
> +	if (anon_vma) {
> +		struct anon_vma_chain *avc;
> +		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
> +			spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock);
> +	}
>  }

This doesn't build with LOCKDEP enabled. As we discussed on IRC, I
changed nest_lock to spinlock_t * and always made it the
anon_vma_chain_lock. That vaguely makes more sense than sometimes
depending on the mmap_sem semaphore but I'm not 100% solid on it.

Walking along like this and taking the necessary locks seems reasonable
though and should be usable by transparent hugepage support where as my
approach probably wasn't.

>  
>  static inline void anon_vma_unlock(struct vm_area_struct *vma)
>  {
>  	struct anon_vma *anon_vma = vma->anon_vma;
> -	if (anon_vma)
> -		spin_unlock(&anon_vma->lock);
> +	if (anon_vma) {
> +		struct anon_vma_chain *avc;
> +		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
> +			spin_unlock(&avc->anon_vma->lock);
> +	}
>  }
>  
>  /*
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 44b0791..83b1ba2 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -468,6 +468,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
>  	mm->nr_ptes = 0;
>  	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
>  	spin_lock_init(&mm->page_table_lock);
> +	spin_lock_init(&mm->anon_vma_chain_lock);
>  	mm->free_area_cache = TASK_UNMAPPED_BASE;
>  	mm->cached_hole_size = ~0UL;
>  	mm_init_aio(mm);
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index 57aba0d..3ce8a1f 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -15,6 +15,7 @@ struct mm_struct init_mm = {
>  	.mm_count	= ATOMIC_INIT(1),
>  	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
>  	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
> +	.anon_vma_chain_lock =  __SPIN_LOCK_UNLOCKED(init_mm.anon_vma_chain_lock),
>  	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
>  	.cpu_vm_mask	= CPU_MASK_ALL,
>  };
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f90ea92..4602358 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -452,7 +452,7 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
>  		spin_lock(&mapping->i_mmap_lock);
>  		vma->vm_truncate_count = mapping->truncate_count;
>  	}
> -	anon_vma_lock(vma);
> +	anon_vma_lock(vma, &mm->mmap_sem);
>  

so I changed this and others like it to the anon_vma_chain_lock. I
haven't actually tested yet.

>  	__vma_link(mm, vma, prev, rb_link, rb_parent);
>  	__vma_link_file(vma);
> @@ -578,6 +578,7 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		}
>  	}
>  
> +	anon_vma_lock(vma, &mm->mmap_sem);
>  	if (root) {
>  		flush_dcache_mmap_lock(mapping);
>  		vma_prio_tree_remove(vma, root);
> @@ -599,6 +600,7 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		vma_prio_tree_insert(vma, root);
>  		flush_dcache_mmap_unlock(mapping);
>  	}
> +	anon_vma_unlock(vma);
>  
>  	if (remove_next) {
>  		/*
> @@ -1705,12 +1707,11 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  		return -EFAULT;
>  
>  	/*
> -	 * We must make sure the anon_vma is allocated
> -	 * so that the anon_vma locking is not a noop.
> +	 * Unlike expand_downwards, we do not need to take the anon_vma lock,
> +	 * because we leave vma->vm_start and vma->pgoff untouched. 
> +	 * This means rmap lookups of pages inside this VMA stay valid
> +	 * throughout the stack expansion.
>  	 */
> -	if (unlikely(anon_vma_prepare(vma)))
> -		return -ENOMEM;
> -	anon_vma_lock(vma);
>  
>  	/*
>  	 * vma->vm_start/vm_end cannot change under us because the caller
> @@ -1721,7 +1722,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  	if (address < PAGE_ALIGN(address+4))
>  		address = PAGE_ALIGN(address+4);
>  	else {
> -		anon_vma_unlock(vma);
>  		return -ENOMEM;
>  	}
>  	error = 0;
> @@ -1737,7 +1737,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  		if (!error)
>  			vma->vm_end = address;
>  	}
> -	anon_vma_unlock(vma);
>  	return error;
>  }
>  #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
> @@ -1749,6 +1748,7 @@ static int expand_downwards(struct vm_area_struct *vma,
>  				   unsigned long address)
>  {
>  	int error;
> +	struct mm_struct *mm = vma->vm_mm;
>  
>  	/*
>  	 * We must make sure the anon_vma is allocated
> @@ -1762,7 +1762,8 @@ static int expand_downwards(struct vm_area_struct *vma,
>  	if (error)
>  		return error;
>  
> -	anon_vma_lock(vma);
> +	spin_lock(&mm->anon_vma_chain_lock);
> +	anon_vma_lock(vma, &mm->anon_vma_chain_lock);
>  
>  	/*
>  	 * vma->vm_start/vm_end cannot change under us because the caller
> @@ -1784,6 +1785,8 @@ static int expand_downwards(struct vm_area_struct *vma,
>  		}
>  	}
>  	anon_vma_unlock(vma);
> +	spin_unlock(&mm->anon_vma_chain_lock);
> +
>  	return error;
>  }
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 526704e..aa27132 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -23,6 +23,7 @@
>   * inode->i_mutex	(while writing or truncating, not reading or faulting)
>   *   inode->i_alloc_sem (vmtruncate_range)
>   *   mm->mmap_sem
> + *   mm->anon_vma_chain_lock (mmap_sem for read, protects vma->anon_vma_chain)
>   *     page->flags PG_locked (lock_page)
>   *       mapping->i_mmap_lock
>   *         anon_vma->lock
> @@ -135,8 +136,8 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>  		}
>  		spin_lock(&anon_vma->lock);
>  
> -		/* page_table_lock to protect against threads */
> -		spin_lock(&mm->page_table_lock);
> +		/* anon_vma_chain_lock to protect against threads */
> +		spin_lock(&mm->anon_vma_chain_lock);
>  		if (likely(!vma->anon_vma)) {
>  			vma->anon_vma = anon_vma;
>  			avc->anon_vma = anon_vma;
> @@ -145,7 +146,7 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>  			list_add(&avc->same_anon_vma, &anon_vma->head);
>  			allocated = NULL;
>  		}
> -		spin_unlock(&mm->page_table_lock);
> +		spin_unlock(&mm->anon_vma_chain_lock);
>  
>  		spin_unlock(&anon_vma->lock);
>  		if (unlikely(allocated)) {
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v2] take all anon_vma locks in anon_vma_lock
@ 2010-04-28 19:07                   ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-28 19:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On Wed, Apr 28, 2010 at 02:25:10PM -0400, Rik van Riel wrote:
> Take all the locks for all the anon_vmas in anon_vma_lock, this properly
> excludes migration and the transparent hugepage code from VMA changes done
> by mmap/munmap/mprotect/expand_stack/etc...
> 
> Also document the locking rules for the same_vma list in the anon_vma_chain
> and remove the anon_vma_lock call from expand_upwards, which does not need it.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> 
> --- 
> Posted quickly as an RFC patch, only compile tested so far.
> Andrea, Mel, does this look like a reasonable approach?
> 

Well, it looks nice but I am wary of making much in the way of comment
on its correctness. Every time I think I understand the new anon_vma
changes, I find something else to make me doubt myself :)

> v2:
>  - also change anon_vma_unlock to walk the loop
>  - add calls to anon_vma_lock & anon_vma_unlock to vma_adjust
>  - introduce a new lock for the vma->anon_vma_chain list, to prevent
>    the lock inversion that Andrea pointed out
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index b8bb9a6..a0679c6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -239,6 +239,7 @@ struct mm_struct {
>  	int map_count;				/* number of VMAs */
>  	struct rw_semaphore mmap_sem;
>  	spinlock_t page_table_lock;		/* Protects page tables and some counters */
> +	spinlock_t anon_vma_chain_lock;		/* Protects vma->anon_vma_chain, with mmap_sem */
>  
>  	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
>  						 * together off init_mm.mmlist, and are protected
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index d25bd22..492e7ca 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -52,11 +52,15 @@ struct anon_vma {
>   * all the anon_vmas associated with this VMA.
>   * The "same_anon_vma" list contains the anon_vma_chains
>   * which link all the VMAs associated with this anon_vma.
> + *
> + * The "same_vma" list is locked by either having mm->mmap_sem
> + * locked for writing, or having mm->mmap_sem locked for reading
> + * AND holding the mm->anon_vma_chain_lock.
>   */
>  struct anon_vma_chain {
>  	struct vm_area_struct *vma;
>  	struct anon_vma *anon_vma;
> -	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
> +	struct list_head same_vma;	/* see above */
>  	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
>  };
>  
> @@ -90,18 +94,24 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
>  	return page_rmapping(page);
>  }
>  
> -static inline void anon_vma_lock(struct vm_area_struct *vma)
> +static inline void anon_vma_lock(struct vm_area_struct *vma, void *nest_lock)
>  {
>  	struct anon_vma *anon_vma = vma->anon_vma;
> -	if (anon_vma)
> -		spin_lock(&anon_vma->lock);
> +	if (anon_vma) {
> +		struct anon_vma_chain *avc;
> +		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
> +			spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock);
> +	}
>  }

This doesn't build with LOCKDEP enabled. As we discussed on IRC, I
changed nest_lock to spinlock_t * and always made it the
anon_vma_chain_lock. That vaguely makes more sense than sometimes
depending on the mmap_sem semaphore but I'm not 100% solid on it.

Walking along like this and taking the necessary locks seems reasonable
though and should be usable by transparent hugepage support where as my
approach probably wasn't.

>  
>  static inline void anon_vma_unlock(struct vm_area_struct *vma)
>  {
>  	struct anon_vma *anon_vma = vma->anon_vma;
> -	if (anon_vma)
> -		spin_unlock(&anon_vma->lock);
> +	if (anon_vma) {
> +		struct anon_vma_chain *avc;
> +		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
> +			spin_unlock(&avc->anon_vma->lock);
> +	}
>  }
>  
>  /*
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 44b0791..83b1ba2 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -468,6 +468,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
>  	mm->nr_ptes = 0;
>  	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
>  	spin_lock_init(&mm->page_table_lock);
> +	spin_lock_init(&mm->anon_vma_chain_lock);
>  	mm->free_area_cache = TASK_UNMAPPED_BASE;
>  	mm->cached_hole_size = ~0UL;
>  	mm_init_aio(mm);
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index 57aba0d..3ce8a1f 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -15,6 +15,7 @@ struct mm_struct init_mm = {
>  	.mm_count	= ATOMIC_INIT(1),
>  	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
>  	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
> +	.anon_vma_chain_lock =  __SPIN_LOCK_UNLOCKED(init_mm.anon_vma_chain_lock),
>  	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
>  	.cpu_vm_mask	= CPU_MASK_ALL,
>  };
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f90ea92..4602358 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -452,7 +452,7 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
>  		spin_lock(&mapping->i_mmap_lock);
>  		vma->vm_truncate_count = mapping->truncate_count;
>  	}
> -	anon_vma_lock(vma);
> +	anon_vma_lock(vma, &mm->mmap_sem);
>  

so I changed this and others like it to the anon_vma_chain_lock. I
haven't actually tested yet.

>  	__vma_link(mm, vma, prev, rb_link, rb_parent);
>  	__vma_link_file(vma);
> @@ -578,6 +578,7 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		}
>  	}
>  
> +	anon_vma_lock(vma, &mm->mmap_sem);
>  	if (root) {
>  		flush_dcache_mmap_lock(mapping);
>  		vma_prio_tree_remove(vma, root);
> @@ -599,6 +600,7 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		vma_prio_tree_insert(vma, root);
>  		flush_dcache_mmap_unlock(mapping);
>  	}
> +	anon_vma_unlock(vma);
>  
>  	if (remove_next) {
>  		/*
> @@ -1705,12 +1707,11 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  		return -EFAULT;
>  
>  	/*
> -	 * We must make sure the anon_vma is allocated
> -	 * so that the anon_vma locking is not a noop.
> +	 * Unlike expand_downwards, we do not need to take the anon_vma lock,
> +	 * because we leave vma->vm_start and vma->pgoff untouched. 
> +	 * This means rmap lookups of pages inside this VMA stay valid
> +	 * throughout the stack expansion.
>  	 */
> -	if (unlikely(anon_vma_prepare(vma)))
> -		return -ENOMEM;
> -	anon_vma_lock(vma);
>  
>  	/*
>  	 * vma->vm_start/vm_end cannot change under us because the caller
> @@ -1721,7 +1722,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  	if (address < PAGE_ALIGN(address+4))
>  		address = PAGE_ALIGN(address+4);
>  	else {
> -		anon_vma_unlock(vma);
>  		return -ENOMEM;
>  	}
>  	error = 0;
> @@ -1737,7 +1737,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  		if (!error)
>  			vma->vm_end = address;
>  	}
> -	anon_vma_unlock(vma);
>  	return error;
>  }
>  #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
> @@ -1749,6 +1748,7 @@ static int expand_downwards(struct vm_area_struct *vma,
>  				   unsigned long address)
>  {
>  	int error;
> +	struct mm_struct *mm = vma->vm_mm;
>  
>  	/*
>  	 * We must make sure the anon_vma is allocated
> @@ -1762,7 +1762,8 @@ static int expand_downwards(struct vm_area_struct *vma,
>  	if (error)
>  		return error;
>  
> -	anon_vma_lock(vma);
> +	spin_lock(&mm->anon_vma_chain_lock);
> +	anon_vma_lock(vma, &mm->anon_vma_chain_lock);
>  
>  	/*
>  	 * vma->vm_start/vm_end cannot change under us because the caller
> @@ -1784,6 +1785,8 @@ static int expand_downwards(struct vm_area_struct *vma,
>  		}
>  	}
>  	anon_vma_unlock(vma);
> +	spin_unlock(&mm->anon_vma_chain_lock);
> +
>  	return error;
>  }
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 526704e..aa27132 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -23,6 +23,7 @@
>   * inode->i_mutex	(while writing or truncating, not reading or faulting)
>   *   inode->i_alloc_sem (vmtruncate_range)
>   *   mm->mmap_sem
> + *   mm->anon_vma_chain_lock (mmap_sem for read, protects vma->anon_vma_chain)
>   *     page->flags PG_locked (lock_page)
>   *       mapping->i_mmap_lock
>   *         anon_vma->lock
> @@ -135,8 +136,8 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>  		}
>  		spin_lock(&anon_vma->lock);
>  
> -		/* page_table_lock to protect against threads */
> -		spin_lock(&mm->page_table_lock);
> +		/* anon_vma_chain_lock to protect against threads */
> +		spin_lock(&mm->anon_vma_chain_lock);
>  		if (likely(!vma->anon_vma)) {
>  			vma->anon_vma = anon_vma;
>  			avc->anon_vma = anon_vma;
> @@ -145,7 +146,7 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>  			list_add(&avc->same_anon_vma, &anon_vma->head);
>  			allocated = NULL;
>  		}
> -		spin_unlock(&mm->page_table_lock);
> +		spin_unlock(&mm->anon_vma_chain_lock);
>  
>  		spin_unlock(&anon_vma->lock);
>  		if (unlikely(allocated)) {
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
  2010-04-28 18:25                 ` Rik van Riel
@ 2010-04-28 20:17                   ` Rik van Riel
  -1 siblings, 0 replies; 132+ messages in thread
From: Rik van Riel @ 2010-04-28 20:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a0679c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -239,6 +239,7 @@ struct mm_struct {
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
 	spinlock_t page_table_lock;		/* Protects page tables and some counters */
+	spinlock_t anon_vma_chain_lock;		/* Protects vma->anon_vma_chain, with mmap_sem */
 
 	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
 						 * together off init_mm.mmlist, and are protected
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..703c472 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -52,11 +52,15 @@ struct anon_vma {
  * all the anon_vmas associated with this VMA.
  * The "same_anon_vma" list contains the anon_vma_chains
  * which link all the VMAs associated with this anon_vma.
+ *
+ * The "same_vma" list is locked by either having mm->mmap_sem
+ * locked for writing, or having mm->mmap_sem locked for reading
+ * AND holding the mm->anon_vma_chain_lock.
  */
 struct anon_vma_chain {
 	struct vm_area_struct *vma;
 	struct anon_vma *anon_vma;
-	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
+	struct list_head same_vma;	/* see above */
 	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
 };
 
@@ -90,18 +94,24 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
 	return page_rmapping(page);
 }
 
-static inline void anon_vma_lock(struct vm_area_struct *vma)
-{
-	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_lock(&anon_vma->lock);
-}
+#define anon_vma_lock(vma, nest_lock)					\
+({									\
+	struct anon_vma *anon_vma = vma->anon_vma;			\
+	if (anon_vma) {							\
+		struct anon_vma_chain *avc;				\
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) \
+			spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock); \
+	}								\
+})
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+	if (anon_vma) {
+		struct anon_vma_chain *avc;
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			spin_unlock(&avc->anon_vma->lock);
+	}
 }
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 44b0791..83b1ba2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -468,6 +468,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 	mm->nr_ptes = 0;
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
+	spin_lock_init(&mm->anon_vma_chain_lock);
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
 	mm_init_aio(mm);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 57aba0d..3ce8a1f 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,7 @@ struct mm_struct init_mm = {
 	.mm_count	= ATOMIC_INIT(1),
 	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
+	.anon_vma_chain_lock =  __SPIN_LOCK_UNLOCKED(init_mm.anon_vma_chain_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.cpu_vm_mask	= CPU_MASK_ALL,
 };
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..4602358 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -452,7 +452,7 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 		spin_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
-	anon_vma_lock(vma);
+	anon_vma_lock(vma, &mm->mmap_sem);
 
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
 	__vma_link_file(vma);
@@ -578,6 +578,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	anon_vma_lock(vma, &mm->mmap_sem);
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -599,6 +600,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		vma_prio_tree_insert(vma, root);
 		flush_dcache_mmap_unlock(mapping);
 	}
+	anon_vma_unlock(vma);
 
 	if (remove_next) {
 		/*
@@ -1705,12 +1707,11 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		return -EFAULT;
 
 	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
+	 * Unlike expand_downwards, we do not need to take the anon_vma lock,
+	 * because we leave vma->vm_start and vma->pgoff untouched. 
+	 * This means rmap lookups of pages inside this VMA stay valid
+	 * throughout the stack expansion.
 	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
-	anon_vma_lock(vma);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1721,7 +1722,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	if (address < PAGE_ALIGN(address+4))
 		address = PAGE_ALIGN(address+4);
 	else {
-		anon_vma_unlock(vma);
 		return -ENOMEM;
 	}
 	error = 0;
@@ -1737,7 +1737,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		if (!error)
 			vma->vm_end = address;
 	}
-	anon_vma_unlock(vma);
 	return error;
 }
 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
@@ -1749,6 +1748,7 @@ static int expand_downwards(struct vm_area_struct *vma,
 				   unsigned long address)
 {
 	int error;
+	struct mm_struct *mm = vma->vm_mm;
 
 	/*
 	 * We must make sure the anon_vma is allocated
@@ -1762,7 +1762,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 	if (error)
 		return error;
 
-	anon_vma_lock(vma);
+	spin_lock(&mm->anon_vma_chain_lock);
+	anon_vma_lock(vma, &mm->anon_vma_chain_lock);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1784,6 +1785,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 		}
 	}
 	anon_vma_unlock(vma);
+	spin_unlock(&mm->anon_vma_chain_lock);
+
 	return error;
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 526704e..98d6289 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -23,6 +23,7 @@
  * inode->i_mutex	(while writing or truncating, not reading or faulting)
  *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
+ *   mm->anon_vma_chain_lock (mmap_sem for read, protects vma->anon_vma_chain)
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_lock
  *         anon_vma->lock
@@ -133,10 +134,11 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 				goto out_enomem_free_avc;
 			allocated = anon_vma;
 		}
+
+		/* anon_vma_chain_lock to protect against threads */
+		spin_lock(&mm->anon_vma_chain_lock);
 		spin_lock(&anon_vma->lock);
 
-		/* page_table_lock to protect against threads */
-		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
 			avc->anon_vma = anon_vma;
@@ -145,9 +147,9 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 			list_add(&avc->same_anon_vma, &anon_vma->head);
 			allocated = NULL;
 		}
-		spin_unlock(&mm->page_table_lock);
-
 		spin_unlock(&anon_vma->lock);
+		spin_unlock(&mm->anon_vma_chain_lock);
+
 		if (unlikely(allocated)) {
 			anon_vma_free(allocated);
 			anon_vma_chain_free(avc);


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
@ 2010-04-28 20:17                   ` Rik van Riel
  0 siblings, 0 replies; 132+ messages in thread
From: Rik van Riel @ 2010-04-28 20:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a0679c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -239,6 +239,7 @@ struct mm_struct {
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
 	spinlock_t page_table_lock;		/* Protects page tables and some counters */
+	spinlock_t anon_vma_chain_lock;		/* Protects vma->anon_vma_chain, with mmap_sem */
 
 	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
 						 * together off init_mm.mmlist, and are protected
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..703c472 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -52,11 +52,15 @@ struct anon_vma {
  * all the anon_vmas associated with this VMA.
  * The "same_anon_vma" list contains the anon_vma_chains
  * which link all the VMAs associated with this anon_vma.
+ *
+ * The "same_vma" list is locked by either having mm->mmap_sem
+ * locked for writing, or having mm->mmap_sem locked for reading
+ * AND holding the mm->anon_vma_chain_lock.
  */
 struct anon_vma_chain {
 	struct vm_area_struct *vma;
 	struct anon_vma *anon_vma;
-	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
+	struct list_head same_vma;	/* see above */
 	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
 };
 
@@ -90,18 +94,24 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
 	return page_rmapping(page);
 }
 
-static inline void anon_vma_lock(struct vm_area_struct *vma)
-{
-	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_lock(&anon_vma->lock);
-}
+#define anon_vma_lock(vma, nest_lock)					\
+({									\
+	struct anon_vma *anon_vma = vma->anon_vma;			\
+	if (anon_vma) {							\
+		struct anon_vma_chain *avc;				\
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) \
+			spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock); \
+	}								\
+})
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+	if (anon_vma) {
+		struct anon_vma_chain *avc;
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			spin_unlock(&avc->anon_vma->lock);
+	}
 }
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 44b0791..83b1ba2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -468,6 +468,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 	mm->nr_ptes = 0;
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
+	spin_lock_init(&mm->anon_vma_chain_lock);
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
 	mm_init_aio(mm);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 57aba0d..3ce8a1f 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,7 @@ struct mm_struct init_mm = {
 	.mm_count	= ATOMIC_INIT(1),
 	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
+	.anon_vma_chain_lock =  __SPIN_LOCK_UNLOCKED(init_mm.anon_vma_chain_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.cpu_vm_mask	= CPU_MASK_ALL,
 };
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..4602358 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -452,7 +452,7 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 		spin_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
-	anon_vma_lock(vma);
+	anon_vma_lock(vma, &mm->mmap_sem);
 
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
 	__vma_link_file(vma);
@@ -578,6 +578,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	anon_vma_lock(vma, &mm->mmap_sem);
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -599,6 +600,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		vma_prio_tree_insert(vma, root);
 		flush_dcache_mmap_unlock(mapping);
 	}
+	anon_vma_unlock(vma);
 
 	if (remove_next) {
 		/*
@@ -1705,12 +1707,11 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		return -EFAULT;
 
 	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
+	 * Unlike expand_downwards, we do not need to take the anon_vma lock,
+	 * because we leave vma->vm_start and vma->pgoff untouched. 
+	 * This means rmap lookups of pages inside this VMA stay valid
+	 * throughout the stack expansion.
 	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
-	anon_vma_lock(vma);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1721,7 +1722,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	if (address < PAGE_ALIGN(address+4))
 		address = PAGE_ALIGN(address+4);
 	else {
-		anon_vma_unlock(vma);
 		return -ENOMEM;
 	}
 	error = 0;
@@ -1737,7 +1737,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		if (!error)
 			vma->vm_end = address;
 	}
-	anon_vma_unlock(vma);
 	return error;
 }
 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
@@ -1749,6 +1748,7 @@ static int expand_downwards(struct vm_area_struct *vma,
 				   unsigned long address)
 {
 	int error;
+	struct mm_struct *mm = vma->vm_mm;
 
 	/*
 	 * We must make sure the anon_vma is allocated
@@ -1762,7 +1762,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 	if (error)
 		return error;
 
-	anon_vma_lock(vma);
+	spin_lock(&mm->anon_vma_chain_lock);
+	anon_vma_lock(vma, &mm->anon_vma_chain_lock);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1784,6 +1785,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 		}
 	}
 	anon_vma_unlock(vma);
+	spin_unlock(&mm->anon_vma_chain_lock);
+
 	return error;
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 526704e..98d6289 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -23,6 +23,7 @@
  * inode->i_mutex	(while writing or truncating, not reading or faulting)
  *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
+ *   mm->anon_vma_chain_lock (mmap_sem for read, protects vma->anon_vma_chain)
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_lock
  *         anon_vma->lock
@@ -133,10 +134,11 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 				goto out_enomem_free_avc;
 			allocated = anon_vma;
 		}
+
+		/* anon_vma_chain_lock to protect against threads */
+		spin_lock(&mm->anon_vma_chain_lock);
 		spin_lock(&anon_vma->lock);
 
-		/* page_table_lock to protect against threads */
-		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
 			avc->anon_vma = anon_vma;
@@ -145,9 +147,9 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 			list_add(&avc->same_anon_vma, &anon_vma->head);
 			allocated = NULL;
 		}
-		spin_unlock(&mm->page_table_lock);
-
 		spin_unlock(&anon_vma->lock);
+		spin_unlock(&mm->anon_vma_chain_lock);
+
 		if (unlikely(allocated)) {
 			anon_vma_free(allocated);
 			anon_vma_chain_free(avc);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
  2010-04-28 15:45                   ` Andrea Arcangeli
@ 2010-04-28 20:40                     ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 20:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Christoph Lameter, Linux-MM, LKML,
	Minchan Kim, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 05:45:08PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 04:23:54PM +0100, Mel Gorman wrote:
> > Is it possible to delay the linkage like that? As arguments get copied into
> > the temporary stack before it gets moved, I'd have expected the normal fault
> > path to prepare and attach the anon_vma. We could special case it but
> > that isn't very palatable either.
> 
> I'm not sure what is more palatable, but I feel it should be fixed in
> execve.

Ok the best idea so far I had is to add a fake temporary fake vma to
the anon_vma list with the old vm_start and same vm_pgoff before
shifting down vma->vm_start and calling move_page_tables. Then after
the move is complete we remove the fake vma. So all the fast paths
will remain unmodified and no magic is required. I'll try to fix this
for the old stable anon-vma code and test in aa.git first as the code
will differ. If it works ok anybody can port it to new anon-vma code.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
@ 2010-04-28 20:40                     ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 20:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Christoph Lameter, Linux-MM, LKML,
	Minchan Kim, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 05:45:08PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 04:23:54PM +0100, Mel Gorman wrote:
> > Is it possible to delay the linkage like that? As arguments get copied into
> > the temporary stack before it gets moved, I'd have expected the normal fault
> > path to prepare and attach the anon_vma. We could special case it but
> > that isn't very palatable either.
> 
> I'm not sure what is more palatable, but I feel it should be fixed in
> execve.

Ok the best idea so far I had is to add a fake temporary fake vma to
the anon_vma list with the old vm_start and same vm_pgoff before
shifting down vma->vm_start and calling move_page_tables. Then after
the move is complete we remove the fake vma. So all the fast paths
will remain unmodified and no magic is required. I'll try to fix this
for the old stable anon-vma code and test in aa.git first as the code
will differ. If it works ok anybody can port it to new anon-vma code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
  2010-04-28 20:17                   ` Rik van Riel
@ 2010-04-28 20:57                     ` Rik van Riel
  -1 siblings, 0 replies; 132+ messages in thread
From: Rik van Riel @ 2010-04-28 20:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

Take all the locks for all the anon_vmas in anon_vma_lock, this properly
excludes migration and the transparent hugepage code from VMA changes done
by mmap/munmap/mprotect/expand_stack/etc...

Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
otherwise we have an unavoidable lock ordering conflict.  This changes the
locking rules for the "same_vma" list to be either mm->mmap_sem for write, 
or mm->mmap_sem for read plus the new mm->anon_vma_chain lock.  This limits
the place where the new lock is taken to 2 locations - anon_vma_prepare and
expand_downwards.

Document the locking rules for the same_vma list in the anon_vma_chain and
remove the anon_vma_lock call from expand_upwards, which does not need it.

Signed-off-by: Rik van Riel <riel@redhat.com>

--- 
Posted quickly as an RFC patch, only compile tested so far.
Andrea, Mel, does this look like a reasonable approach?

v3:
 - change anon_vma_unlock into a macro so lockdep works right
 - fix lock ordering in anon_vma_prepare
v2:
 - also change anon_vma_unlock to walk the loop
 - add calls to anon_vma_lock & anon_vma_unlock to vma_adjust
 - introduce a new lock for the vma->anon_vma_chain list, to prevent
   the lock inversion that Andrea pointed out

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a0679c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -239,6 +239,7 @@ struct mm_struct {
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
 	spinlock_t page_table_lock;		/* Protects page tables and some counters */
+	spinlock_t anon_vma_chain_lock;		/* Protects vma->anon_vma_chain, with mmap_sem */
 
 	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
 						 * together off init_mm.mmlist, and are protected
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..703c472 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -52,11 +52,15 @@ struct anon_vma {
  * all the anon_vmas associated with this VMA.
  * The "same_anon_vma" list contains the anon_vma_chains
  * which link all the VMAs associated with this anon_vma.
+ *
+ * The "same_vma" list is locked by either having mm->mmap_sem
+ * locked for writing, or having mm->mmap_sem locked for reading
+ * AND holding the mm->anon_vma_chain_lock.
  */
 struct anon_vma_chain {
 	struct vm_area_struct *vma;
 	struct anon_vma *anon_vma;
-	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
+	struct list_head same_vma;	/* see above */
 	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
 };
 
@@ -90,18 +94,24 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
 	return page_rmapping(page);
 }
 
-static inline void anon_vma_lock(struct vm_area_struct *vma)
-{
-	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_lock(&anon_vma->lock);
-}
+#define anon_vma_lock(vma, nest_lock)					\
+({									\
+	struct anon_vma *anon_vma = vma->anon_vma;			\
+	if (anon_vma) {							\
+		struct anon_vma_chain *avc;				\
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) \
+			spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock); \
+	}								\
+})
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+	if (anon_vma) {
+		struct anon_vma_chain *avc;
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			spin_unlock(&avc->anon_vma->lock);
+	}
 }
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 44b0791..83b1ba2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -468,6 +468,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 	mm->nr_ptes = 0;
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
+	spin_lock_init(&mm->anon_vma_chain_lock);
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
 	mm_init_aio(mm);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 57aba0d..3ce8a1f 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,7 @@ struct mm_struct init_mm = {
 	.mm_count	= ATOMIC_INIT(1),
 	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
+	.anon_vma_chain_lock =  __SPIN_LOCK_UNLOCKED(init_mm.anon_vma_chain_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.cpu_vm_mask	= CPU_MASK_ALL,
 };
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..4602358 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -452,7 +452,7 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 		spin_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
-	anon_vma_lock(vma);
+	anon_vma_lock(vma, &mm->mmap_sem);
 
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
 	__vma_link_file(vma);
@@ -578,6 +578,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	anon_vma_lock(vma, &mm->mmap_sem);
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -599,6 +600,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		vma_prio_tree_insert(vma, root);
 		flush_dcache_mmap_unlock(mapping);
 	}
+	anon_vma_unlock(vma);
 
 	if (remove_next) {
 		/*
@@ -1705,12 +1707,11 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		return -EFAULT;
 
 	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
+	 * Unlike expand_downwards, we do not need to take the anon_vma lock,
+	 * because we leave vma->vm_start and vma->pgoff untouched. 
+	 * This means rmap lookups of pages inside this VMA stay valid
+	 * throughout the stack expansion.
 	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
-	anon_vma_lock(vma);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1721,7 +1722,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	if (address < PAGE_ALIGN(address+4))
 		address = PAGE_ALIGN(address+4);
 	else {
-		anon_vma_unlock(vma);
 		return -ENOMEM;
 	}
 	error = 0;
@@ -1737,7 +1737,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		if (!error)
 			vma->vm_end = address;
 	}
-	anon_vma_unlock(vma);
 	return error;
 }
 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
@@ -1749,6 +1748,7 @@ static int expand_downwards(struct vm_area_struct *vma,
 				   unsigned long address)
 {
 	int error;
+	struct mm_struct *mm = vma->vm_mm;
 
 	/*
 	 * We must make sure the anon_vma is allocated
@@ -1762,7 +1762,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 	if (error)
 		return error;
 
-	anon_vma_lock(vma);
+	spin_lock(&mm->anon_vma_chain_lock);
+	anon_vma_lock(vma, &mm->anon_vma_chain_lock);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1784,6 +1785,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 		}
 	}
 	anon_vma_unlock(vma);
+	spin_unlock(&mm->anon_vma_chain_lock);
+
 	return error;
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 526704e..98d6289 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -23,6 +23,7 @@
  * inode->i_mutex	(while writing or truncating, not reading or faulting)
  *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
+ *   mm->anon_vma_chain_lock (mmap_sem for read, protects vma->anon_vma_chain)
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_lock
  *         anon_vma->lock
@@ -133,10 +134,11 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 				goto out_enomem_free_avc;
 			allocated = anon_vma;
 		}
+
+		/* anon_vma_chain_lock to protect against threads */
+		spin_lock(&mm->anon_vma_chain_lock);
 		spin_lock(&anon_vma->lock);
 
-		/* page_table_lock to protect against threads */
-		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
 			avc->anon_vma = anon_vma;
@@ -145,9 +147,9 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 			list_add(&avc->same_anon_vma, &anon_vma->head);
 			allocated = NULL;
 		}
-		spin_unlock(&mm->page_table_lock);
-
 		spin_unlock(&anon_vma->lock);
+		spin_unlock(&mm->anon_vma_chain_lock);
+
 		if (unlikely(allocated)) {
 			anon_vma_free(allocated);
 			anon_vma_chain_free(avc);


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
@ 2010-04-28 20:57                     ` Rik van Riel
  0 siblings, 0 replies; 132+ messages in thread
From: Rik van Riel @ 2010-04-28 20:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

Take all the locks for all the anon_vmas in anon_vma_lock, this properly
excludes migration and the transparent hugepage code from VMA changes done
by mmap/munmap/mprotect/expand_stack/etc...

Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
otherwise we have an unavoidable lock ordering conflict.  This changes the
locking rules for the "same_vma" list to be either mm->mmap_sem for write, 
or mm->mmap_sem for read plus the new mm->anon_vma_chain lock.  This limits
the place where the new lock is taken to 2 locations - anon_vma_prepare and
expand_downwards.

Document the locking rules for the same_vma list in the anon_vma_chain and
remove the anon_vma_lock call from expand_upwards, which does not need it.

Signed-off-by: Rik van Riel <riel@redhat.com>

--- 
Posted quickly as an RFC patch, only compile tested so far.
Andrea, Mel, does this look like a reasonable approach?

v3:
 - change anon_vma_unlock into a macro so lockdep works right
 - fix lock ordering in anon_vma_prepare
v2:
 - also change anon_vma_unlock to walk the loop
 - add calls to anon_vma_lock & anon_vma_unlock to vma_adjust
 - introduce a new lock for the vma->anon_vma_chain list, to prevent
   the lock inversion that Andrea pointed out

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a0679c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -239,6 +239,7 @@ struct mm_struct {
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
 	spinlock_t page_table_lock;		/* Protects page tables and some counters */
+	spinlock_t anon_vma_chain_lock;		/* Protects vma->anon_vma_chain, with mmap_sem */
 
 	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
 						 * together off init_mm.mmlist, and are protected
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..703c472 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -52,11 +52,15 @@ struct anon_vma {
  * all the anon_vmas associated with this VMA.
  * The "same_anon_vma" list contains the anon_vma_chains
  * which link all the VMAs associated with this anon_vma.
+ *
+ * The "same_vma" list is locked by either having mm->mmap_sem
+ * locked for writing, or having mm->mmap_sem locked for reading
+ * AND holding the mm->anon_vma_chain_lock.
  */
 struct anon_vma_chain {
 	struct vm_area_struct *vma;
 	struct anon_vma *anon_vma;
-	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
+	struct list_head same_vma;	/* see above */
 	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
 };
 
@@ -90,18 +94,24 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
 	return page_rmapping(page);
 }
 
-static inline void anon_vma_lock(struct vm_area_struct *vma)
-{
-	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_lock(&anon_vma->lock);
-}
+#define anon_vma_lock(vma, nest_lock)					\
+({									\
+	struct anon_vma *anon_vma = vma->anon_vma;			\
+	if (anon_vma) {							\
+		struct anon_vma_chain *avc;				\
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) \
+			spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock); \
+	}								\
+})
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+	if (anon_vma) {
+		struct anon_vma_chain *avc;
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			spin_unlock(&avc->anon_vma->lock);
+	}
 }
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 44b0791..83b1ba2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -468,6 +468,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 	mm->nr_ptes = 0;
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
+	spin_lock_init(&mm->anon_vma_chain_lock);
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
 	mm_init_aio(mm);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 57aba0d..3ce8a1f 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,7 @@ struct mm_struct init_mm = {
 	.mm_count	= ATOMIC_INIT(1),
 	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
+	.anon_vma_chain_lock =  __SPIN_LOCK_UNLOCKED(init_mm.anon_vma_chain_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.cpu_vm_mask	= CPU_MASK_ALL,
 };
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..4602358 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -452,7 +452,7 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 		spin_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
-	anon_vma_lock(vma);
+	anon_vma_lock(vma, &mm->mmap_sem);
 
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
 	__vma_link_file(vma);
@@ -578,6 +578,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	anon_vma_lock(vma, &mm->mmap_sem);
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -599,6 +600,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		vma_prio_tree_insert(vma, root);
 		flush_dcache_mmap_unlock(mapping);
 	}
+	anon_vma_unlock(vma);
 
 	if (remove_next) {
 		/*
@@ -1705,12 +1707,11 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		return -EFAULT;
 
 	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
+	 * Unlike expand_downwards, we do not need to take the anon_vma lock,
+	 * because we leave vma->vm_start and vma->pgoff untouched. 
+	 * This means rmap lookups of pages inside this VMA stay valid
+	 * throughout the stack expansion.
 	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
-	anon_vma_lock(vma);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1721,7 +1722,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	if (address < PAGE_ALIGN(address+4))
 		address = PAGE_ALIGN(address+4);
 	else {
-		anon_vma_unlock(vma);
 		return -ENOMEM;
 	}
 	error = 0;
@@ -1737,7 +1737,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		if (!error)
 			vma->vm_end = address;
 	}
-	anon_vma_unlock(vma);
 	return error;
 }
 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
@@ -1749,6 +1748,7 @@ static int expand_downwards(struct vm_area_struct *vma,
 				   unsigned long address)
 {
 	int error;
+	struct mm_struct *mm = vma->vm_mm;
 
 	/*
 	 * We must make sure the anon_vma is allocated
@@ -1762,7 +1762,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 	if (error)
 		return error;
 
-	anon_vma_lock(vma);
+	spin_lock(&mm->anon_vma_chain_lock);
+	anon_vma_lock(vma, &mm->anon_vma_chain_lock);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1784,6 +1785,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 		}
 	}
 	anon_vma_unlock(vma);
+	spin_unlock(&mm->anon_vma_chain_lock);
+
 	return error;
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 526704e..98d6289 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -23,6 +23,7 @@
  * inode->i_mutex	(while writing or truncating, not reading or faulting)
  *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
+ *   mm->anon_vma_chain_lock (mmap_sem for read, protects vma->anon_vma_chain)
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_lock
  *         anon_vma->lock
@@ -133,10 +134,11 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 				goto out_enomem_free_avc;
 			allocated = anon_vma;
 		}
+
+		/* anon_vma_chain_lock to protect against threads */
+		spin_lock(&mm->anon_vma_chain_lock);
 		spin_lock(&anon_vma->lock);
 
-		/* page_table_lock to protect against threads */
-		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
 			avc->anon_vma = anon_vma;
@@ -145,9 +147,9 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 			list_add(&avc->same_anon_vma, &anon_vma->head);
 			allocated = NULL;
 		}
-		spin_unlock(&mm->page_table_lock);
-
 		spin_unlock(&anon_vma->lock);
+		spin_unlock(&mm->anon_vma_chain_lock);
+
 		if (unlikely(allocated)) {
 			anon_vma_free(allocated);
 			anon_vma_chain_free(avc);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
  2010-04-28 20:40                     ` Andrea Arcangeli
@ 2010-04-28 21:05                       ` Andrea Arcangeli
  -1 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 21:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Christoph Lameter, Linux-MM, LKML,
	Minchan Kim, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 10:40:43PM +0200, Andrea Arcangeli wrote:
> Ok the best idea so far I had is to add a fake temporary fake vma to
> the anon_vma list with the old vm_start and same vm_pgoff before
> shifting down vma->vm_start and calling move_page_tables. Then after
> the move is complete we remove the fake vma. So all the fast paths
> will remain unmodified and no magic is required. I'll try to fix this
> for the old stable anon-vma code and test in aa.git first as the code
> will differ. If it works ok anybody can port it to new anon-vma code.

here the new try for aa.git or 2.6.33 anon-vma code.

----
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -502,7 +503,9 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
+	struct vm_area_struct *tmp_vma;
 
 	BUG_ON(new_start > new_end);
 
@@ -514,16 +517,36 @@ static int shift_arg_pages(struct vm_are
 		return -EFAULT;
 
 	/*
+	 * We need to create a fake temporary vma and index it in the
+	 * anon_vma list in order to allow the pages to be reachable
+	 * at all times by the rmap walk for migrate, while
+	 * move_page_tables() is running.
+	 */
+	tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!tmp_vma)
+		return -ENOMEM;
+	*tmp_vma = *vma;
+
+	spin_lock(&vma->anon_vma->lock);
+	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
-	vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL);
+	vma->vm_start = new_start;
+	__anon_vma_link(tmp_vma);
+	spin_unlock(&vma->anon_vma->lock);
 
 	/*
 	 * move the page tables downwards, on failure we rely on
 	 * process cleanup to remove whatever mess we made.
 	 */
-	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length))
+	moved_length = move_page_tables(vma, old_start,
+					vma, new_start, length);
+
+	/* rmap walk will already find all pages using the new_start */
+	anon_vma_unlink(tmp_vma);
+	kmem_cache_free(vm_area_cachep, tmp_vma);
+
+	if (length != moved_length) 
 		return -ENOMEM;
 
 	lru_add_drain();
@@ -549,7 +572,7 @@ static int shift_arg_pages(struct vm_are
 	/*
 	 * shrink the vma to just the new range.
 	 */
-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+	vma->vm_end = new_end;
 
 	return 0;
 }

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/3] Fix migration races in rmap_walk() V2
@ 2010-04-28 21:05                       ` Andrea Arcangeli
  0 siblings, 0 replies; 132+ messages in thread
From: Andrea Arcangeli @ 2010-04-28 21:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Christoph Lameter, Linux-MM, LKML,
	Minchan Kim, Rik van Riel, Andrew Morton

On Wed, Apr 28, 2010 at 10:40:43PM +0200, Andrea Arcangeli wrote:
> Ok the best idea so far I had is to add a fake temporary fake vma to
> the anon_vma list with the old vm_start and same vm_pgoff before
> shifting down vma->vm_start and calling move_page_tables. Then after
> the move is complete we remove the fake vma. So all the fast paths
> will remain unmodified and no magic is required. I'll try to fix this
> for the old stable anon-vma code and test in aa.git first as the code
> will differ. If it works ok anybody can port it to new anon-vma code.

here the new try for aa.git or 2.6.33 anon-vma code.

----
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -502,7 +503,9 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
+	struct vm_area_struct *tmp_vma;
 
 	BUG_ON(new_start > new_end);
 
@@ -514,16 +517,36 @@ static int shift_arg_pages(struct vm_are
 		return -EFAULT;
 
 	/*
+	 * We need to create a fake temporary vma and index it in the
+	 * anon_vma list in order to allow the pages to be reachable
+	 * at all times by the rmap walk for migrate, while
+	 * move_page_tables() is running.
+	 */
+	tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!tmp_vma)
+		return -ENOMEM;
+	*tmp_vma = *vma;
+
+	spin_lock(&vma->anon_vma->lock);
+	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
-	vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL);
+	vma->vm_start = new_start;
+	__anon_vma_link(tmp_vma);
+	spin_unlock(&vma->anon_vma->lock);
 
 	/*
 	 * move the page tables downwards, on failure we rely on
 	 * process cleanup to remove whatever mess we made.
 	 */
-	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length))
+	moved_length = move_page_tables(vma, old_start,
+					vma, new_start, length);
+
+	/* rmap walk will already find all pages using the new_start */
+	anon_vma_unlink(tmp_vma);
+	kmem_cache_free(vm_area_cachep, tmp_vma);
+
+	if (length != moved_length) 
 		return -ENOMEM;
 
 	lru_add_drain();
@@ -549,7 +572,7 @@ static int shift_arg_pages(struct vm_are
 	/*
 	 * shrink the vma to just the new range.
 	 */
-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+	vma->vm_end = new_end;
 
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
  2010-04-28 20:57                     ` Rik van Riel
@ 2010-04-29  0:28                       ` Minchan Kim
  -1 siblings, 0 replies; 132+ messages in thread
From: Minchan Kim @ 2010-04-29  0:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Mel Gorman, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On Thu, Apr 29, 2010 at 5:57 AM, Rik van Riel <riel@redhat.com> wrote:
> Take all the locks for all the anon_vmas in anon_vma_lock, this properly
> excludes migration and the transparent hugepage code from VMA changes done
> by mmap/munmap/mprotect/expand_stack/etc...
>
> Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
> otherwise we have an unavoidable lock ordering conflict.  This changes the
> locking rules for the "same_vma" list to be either mm->mmap_sem for write,
> or mm->mmap_sem for read plus the new mm->anon_vma_chain lock.  This limits
> the place where the new lock is taken to 2 locations - anon_vma_prepare and
> expand_downwards.
>
> Document the locking rules for the same_vma list in the anon_vma_chain and
> remove the anon_vma_lock call from expand_upwards, which does not need it.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>

This patch makes things simple. So I like this.
Actually, I wanted this all-at-once locks approach.
But I was worried about that how the patch affects AIM 7 workload
which is cause of anon_vma_chain about scalability by Rik.
But now Rik himself is sending the patch. So I assume the patch
couldn't decrease scalability of the workload heavily.

Let's wait result of test if Rik doesn't have a problem of AIM7.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
@ 2010-04-29  0:28                       ` Minchan Kim
  0 siblings, 0 replies; 132+ messages in thread
From: Minchan Kim @ 2010-04-29  0:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Mel Gorman, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On Thu, Apr 29, 2010 at 5:57 AM, Rik van Riel <riel@redhat.com> wrote:
> Take all the locks for all the anon_vmas in anon_vma_lock, this properly
> excludes migration and the transparent hugepage code from VMA changes done
> by mmap/munmap/mprotect/expand_stack/etc...
>
> Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
> otherwise we have an unavoidable lock ordering conflict.  This changes the
> locking rules for the "same_vma" list to be either mm->mmap_sem for write,
> or mm->mmap_sem for read plus the new mm->anon_vma_chain lock.  This limits
> the place where the new lock is taken to 2 locations - anon_vma_prepare and
> expand_downwards.
>
> Document the locking rules for the same_vma list in the anon_vma_chain and
> remove the anon_vma_lock call from expand_upwards, which does not need it.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>

This patch makes things simple. So I like this.
Actually, I wanted this all-at-once locks approach.
But I was worried about that how the patch affects AIM 7 workload
which is cause of anon_vma_chain about scalability by Rik.
But now Rik himself is sending the patch. So I assume the patch
couldn't decrease scalability of the workload heavily.

Let's wait result of test if Rik doesn't have a problem of AIM7.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
  2010-04-29  0:28                       ` Minchan Kim
@ 2010-04-29  2:10                         ` Rik van Riel
  -1 siblings, 0 replies; 132+ messages in thread
From: Rik van Riel @ 2010-04-29  2:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrea Arcangeli, Mel Gorman, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On 04/28/2010 08:28 PM, Minchan Kim wrote:
> On Thu, Apr 29, 2010 at 5:57 AM, Rik van Riel<riel@redhat.com>  wrote:
>> Take all the locks for all the anon_vmas in anon_vma_lock, this properly
>> excludes migration and the transparent hugepage code from VMA changes done
>> by mmap/munmap/mprotect/expand_stack/etc...
>>
>> Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
>> otherwise we have an unavoidable lock ordering conflict.  This changes the
>> locking rules for the "same_vma" list to be either mm->mmap_sem for write,
>> or mm->mmap_sem for read plus the new mm->anon_vma_chain lock.  This limits
>> the place where the new lock is taken to 2 locations - anon_vma_prepare and
>> expand_downwards.
>>
>> Document the locking rules for the same_vma list in the anon_vma_chain and
>> remove the anon_vma_lock call from expand_upwards, which does not need it.
>>
>> Signed-off-by: Rik van Riel<riel@redhat.com>
>
> This patch makes things simple. So I like this.
> Actually, I wanted this all-at-once locks approach.
> But I was worried about that how the patch affects AIM 7 workload
> which is cause of anon_vma_chain about scalability by Rik.
> But now Rik himself is sending the patch. So I assume the patch
> couldn't decrease scalability of the workload heavily.

The thing is, the number of anon_vmas attached to a VMA is
small (depth of the tree, so for apache or aim the typical
depth is 2). This N is between 1 and 3.

The problem we had originally is the _width_ of the tree,
where every sibling process was attached to the same anon_vma
and the rmap code had to walk the page tables of all the
processes, for every privately owned page in each child process.
For large server workloads, this N is between a few hundred and
a few thousand.

What matters most at this point is correctness - we need to be
able to exclude rmap walks when messing with a VMA in any way
that breaks lookups, because rmap walks for page migration and
hugepage conversion have to be 100% reliable.

That is not a constraint I had in mind with the original
anon_vma changes, so the code needs to be fixed up now...

I suspect that taking one or two extra spinlocks in the code
paths changed by this patch (mmap/munmap/...) is going to make
a difference at all, since all of those paths are pretty
infrequently taken.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
@ 2010-04-29  2:10                         ` Rik van Riel
  0 siblings, 0 replies; 132+ messages in thread
From: Rik van Riel @ 2010-04-29  2:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrea Arcangeli, Mel Gorman, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On 04/28/2010 08:28 PM, Minchan Kim wrote:
> On Thu, Apr 29, 2010 at 5:57 AM, Rik van Riel<riel@redhat.com>  wrote:
>> Take all the locks for all the anon_vmas in anon_vma_lock, this properly
>> excludes migration and the transparent hugepage code from VMA changes done
>> by mmap/munmap/mprotect/expand_stack/etc...
>>
>> Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
>> otherwise we have an unavoidable lock ordering conflict.  This changes the
>> locking rules for the "same_vma" list to be either mm->mmap_sem for write,
>> or mm->mmap_sem for read plus the new mm->anon_vma_chain lock.  This limits
>> the place where the new lock is taken to 2 locations - anon_vma_prepare and
>> expand_downwards.
>>
>> Document the locking rules for the same_vma list in the anon_vma_chain and
>> remove the anon_vma_lock call from expand_upwards, which does not need it.
>>
>> Signed-off-by: Rik van Riel<riel@redhat.com>
>
> This patch makes things simple. So I like this.
> Actually, I wanted this all-at-once locks approach.
> But I was worried about that how the patch affects AIM 7 workload
> which is cause of anon_vma_chain about scalability by Rik.
> But now Rik himself is sending the patch. So I assume the patch
> couldn't decrease scalability of the workload heavily.

The thing is, the number of anon_vmas attached to a VMA is
small (depth of the tree, so for apache or aim the typical
depth is 2). This N is between 1 and 3.

The problem we had originally is the _width_ of the tree,
where every sibling process was attached to the same anon_vma
and the rmap code had to walk the page tables of all the
processes, for every privately owned page in each child process.
For large server workloads, this N is between a few hundred and
a few thousand.

What matters most at this point is correctness - we need to be
able to exclude rmap walks when messing with a VMA in any way
that breaks lookups, because rmap walks for page migration and
hugepage conversion have to be 100% reliable.

That is not a constraint I had in mind with the original
anon_vma changes, so the code needs to be fixed up now...

I suspect that taking one or two extra spinlocks in the code
paths changed by this patch (mmap/munmap/...) is going to make
a difference at all, since all of those paths are pretty
infrequently taken.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
  2010-04-29  2:10                         ` Rik van Riel
@ 2010-04-29  2:55                           ` Minchan Kim
  -1 siblings, 0 replies; 132+ messages in thread
From: Minchan Kim @ 2010-04-29  2:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Mel Gorman, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On Thu, Apr 29, 2010 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> On 04/28/2010 08:28 PM, Minchan Kim wrote:
>>
>> On Thu, Apr 29, 2010 at 5:57 AM, Rik van Riel<riel@redhat.com>  wrote:
>>>
>>> Take all the locks for all the anon_vmas in anon_vma_lock, this properly
>>> excludes migration and the transparent hugepage code from VMA changes
>>> done
>>> by mmap/munmap/mprotect/expand_stack/etc...
>>>
>>> Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
>>> otherwise we have an unavoidable lock ordering conflict.  This changes
>>> the
>>> locking rules for the "same_vma" list to be either mm->mmap_sem for
>>> write,
>>> or mm->mmap_sem for read plus the new mm->anon_vma_chain lock.  This
>>> limits
>>> the place where the new lock is taken to 2 locations - anon_vma_prepare
>>> and
>>> expand_downwards.
>>>
>>> Document the locking rules for the same_vma list in the anon_vma_chain
>>> and
>>> remove the anon_vma_lock call from expand_upwards, which does not need
>>> it.
>>>
>>> Signed-off-by: Rik van Riel<riel@redhat.com>
>>
>> This patch makes things simple. So I like this.
>> Actually, I wanted this all-at-once locks approach.
>> But I was worried about that how the patch affects AIM 7 workload
>> which is cause of anon_vma_chain about scalability by Rik.
>> But now Rik himself is sending the patch. So I assume the patch
>> couldn't decrease scalability of the workload heavily.
>
> The thing is, the number of anon_vmas attached to a VMA is
> small (depth of the tree, so for apache or aim the typical
> depth is 2). This N is between 1 and 3.
>
> The problem we had originally is the _width_ of the tree,
> where every sibling process was attached to the same anon_vma
> and the rmap code had to walk the page tables of all the
> processes, for every privately owned page in each child process.
> For large server workloads, this N is between a few hundred and
> a few thousand.
>
> What matters most at this point is correctness - we need to be
> able to exclude rmap walks when messing with a VMA in any way
> that breaks lookups, because rmap walks for page migration and
> hugepage conversion have to be 100% reliable.
>
> That is not a constraint I had in mind with the original
> anon_vma changes, so the code needs to be fixed up now...

Yes. I understand it.

When you tried anon_vma_chain patches as I pointed out, what I have a
concern is parent's vma not child's one.
The vma of parent still has N anon_vma.
AFAIR, you said it's trade-off and would be good than old at least.
I agreed.  But I just want to remind you because this makes worse.  :)
The corner case is that we have to hold locks of N.

Do I miss something?
Really, Can't we ignore that case latency although this happen infrequently?
I am not against this patch. I just want to listen your opinion.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
@ 2010-04-29  2:55                           ` Minchan Kim
  0 siblings, 0 replies; 132+ messages in thread
From: Minchan Kim @ 2010-04-29  2:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Mel Gorman, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On Thu, Apr 29, 2010 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> On 04/28/2010 08:28 PM, Minchan Kim wrote:
>>
>> On Thu, Apr 29, 2010 at 5:57 AM, Rik van Riel<riel@redhat.com>  wrote:
>>>
>>> Take all the locks for all the anon_vmas in anon_vma_lock, this properly
>>> excludes migration and the transparent hugepage code from VMA changes
>>> done
>>> by mmap/munmap/mprotect/expand_stack/etc...
>>>
>>> Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
>>> otherwise we have an unavoidable lock ordering conflict.  This changes
>>> the
>>> locking rules for the "same_vma" list to be either mm->mmap_sem for
>>> write,
>>> or mm->mmap_sem for read plus the new mm->anon_vma_chain lock.  This
>>> limits
>>> the place where the new lock is taken to 2 locations - anon_vma_prepare
>>> and
>>> expand_downwards.
>>>
>>> Document the locking rules for the same_vma list in the anon_vma_chain
>>> and
>>> remove the anon_vma_lock call from expand_upwards, which does not need
>>> it.
>>>
>>> Signed-off-by: Rik van Riel<riel@redhat.com>
>>
>> This patch makes things simple. So I like this.
>> Actually, I wanted this all-at-once locks approach.
>> But I was worried about that how the patch affects AIM 7 workload
>> which is cause of anon_vma_chain about scalability by Rik.
>> But now Rik himself is sending the patch. So I assume the patch
>> couldn't decrease scalability of the workload heavily.
>
> The thing is, the number of anon_vmas attached to a VMA is
> small (depth of the tree, so for apache or aim the typical
> depth is 2). This N is between 1 and 3.
>
> The problem we had originally is the _width_ of the tree,
> where every sibling process was attached to the same anon_vma
> and the rmap code had to walk the page tables of all the
> processes, for every privately owned page in each child process.
> For large server workloads, this N is between a few hundred and
> a few thousand.
>
> What matters most at this point is correctness - we need to be
> able to exclude rmap walks when messing with a VMA in any way
> that breaks lookups, because rmap walks for page migration and
> hugepage conversion have to be 100% reliable.
>
> That is not a constraint I had in mind with the original
> anon_vma changes, so the code needs to be fixed up now...

Yes. I understand it.

When you tried anon_vma_chain patches as I pointed out, what I have a
concern is parent's vma not child's one.
The vma of parent still has N anon_vma.
AFAIR, you said it's trade-off and would be good than old at least.
I agreed.  But I just want to remind you because this makes worse.  :)
The corner case is that we have to hold locks of N.

Do I miss something?
Really, Can't we ignore that case latency although this happen infrequently?
I am not against this patch. I just want to listen your opinion.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
  2010-04-29  2:55                           ` Minchan Kim
@ 2010-04-29  6:42                             ` Minchan Kim
  -1 siblings, 0 replies; 132+ messages in thread
From: Minchan Kim @ 2010-04-29  6:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Mel Gorman, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On Thu, Apr 29, 2010 at 11:55 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Thu, Apr 29, 2010 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
>> On 04/28/2010 08:28 PM, Minchan Kim wrote:
>>>
>>> On Thu, Apr 29, 2010 at 5:57 AM, Rik van Riel<riel@redhat.com>  wrote:
>>>>
>>>> Take all the locks for all the anon_vmas in anon_vma_lock, this properly
>>>> excludes migration and the transparent hugepage code from VMA changes
>>>> done
>>>> by mmap/munmap/mprotect/expand_stack/etc...
>>>>
>>>> Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
>>>> otherwise we have an unavoidable lock ordering conflict.  This changes
>>>> the
>>>> locking rules for the "same_vma" list to be either mm->mmap_sem for
>>>> write,
>>>> or mm->mmap_sem for read plus the new mm->anon_vma_chain lock.  This
>>>> limits
>>>> the place where the new lock is taken to 2 locations - anon_vma_prepare
>>>> and
>>>> expand_downwards.
>>>>
>>>> Document the locking rules for the same_vma list in the anon_vma_chain
>>>> and
>>>> remove the anon_vma_lock call from expand_upwards, which does not need
>>>> it.
>>>>
>>>> Signed-off-by: Rik van Riel<riel@redhat.com>
>>>
>>> This patch makes things simple. So I like this.
>>> Actually, I wanted this all-at-once locks approach.
>>> But I was worried about that how the patch affects AIM 7 workload
>>> which is cause of anon_vma_chain about scalability by Rik.
>>> But now Rik himself is sending the patch. So I assume the patch
>>> couldn't decrease scalability of the workload heavily.
>>
>> The thing is, the number of anon_vmas attached to a VMA is
>> small (depth of the tree, so for apache or aim the typical
>> depth is 2). This N is between 1 and 3.
>>
>> The problem we had originally is the _width_ of the tree,
>> where every sibling process was attached to the same anon_vma
>> and the rmap code had to walk the page tables of all the
>> processes, for every privately owned page in each child process.
>> For large server workloads, this N is between a few hundred and
>> a few thousand.
>>
>> What matters most at this point is correctness - we need to be
>> able to exclude rmap walks when messing with a VMA in any way
>> that breaks lookups, because rmap walks for page migration and
>> hugepage conversion have to be 100% reliable.
>>
>> That is not a constraint I had in mind with the original
>> anon_vma changes, so the code needs to be fixed up now...
>
> Yes. I understand it.
>
> When you tried anon_vma_chain patches as I pointed out, what I have a
> concern is parent's vma not child's one.
> The vma of parent still has N anon_vma.
> AFAIR, you said it's trade-off and would be good than old at least.
> I agreed.  But I just want to remind you because this makes worse.  :)
> The corner case is that we have to hold locks of N.
>
> Do I miss something?
> Really, Can't we ignore that case latency although this happen infrequently?
> I am not against this patch. I just want to listen your opinion.

me/ slaps self.

It's about height of tree and I can't imagine high height of
scenario.(fork->fork->fork->...->fork)
So as Rik pointed out, It's a not big overhead about latency latency,
at least. I think.

I supports this approach.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
@ 2010-04-29  6:42                             ` Minchan Kim
  0 siblings, 0 replies; 132+ messages in thread
From: Minchan Kim @ 2010-04-29  6:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Mel Gorman, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On Thu, Apr 29, 2010 at 11:55 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Thu, Apr 29, 2010 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
>> On 04/28/2010 08:28 PM, Minchan Kim wrote:
>>>
>>> On Thu, Apr 29, 2010 at 5:57 AM, Rik van Riel<riel@redhat.com>  wrote:
>>>>
>>>> Take all the locks for all the anon_vmas in anon_vma_lock, this properly
>>>> excludes migration and the transparent hugepage code from VMA changes
>>>> done
>>>> by mmap/munmap/mprotect/expand_stack/etc...
>>>>
>>>> Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
>>>> otherwise we have an unavoidable lock ordering conflict.  This changes
>>>> the
>>>> locking rules for the "same_vma" list to be either mm->mmap_sem for
>>>> write,
>>>> or mm->mmap_sem for read plus the new mm->anon_vma_chain lock.  This
>>>> limits
>>>> the place where the new lock is taken to 2 locations - anon_vma_prepare
>>>> and
>>>> expand_downwards.
>>>>
>>>> Document the locking rules for the same_vma list in the anon_vma_chain
>>>> and
>>>> remove the anon_vma_lock call from expand_upwards, which does not need
>>>> it.
>>>>
>>>> Signed-off-by: Rik van Riel<riel@redhat.com>
>>>
>>> This patch makes things simple. So I like this.
>>> Actually, I wanted this all-at-once locks approach.
>>> But I was worried about that how the patch affects AIM 7 workload
>>> which is cause of anon_vma_chain about scalability by Rik.
>>> But now Rik himself is sending the patch. So I assume the patch
>>> couldn't decrease scalability of the workload heavily.
>>
>> The thing is, the number of anon_vmas attached to a VMA is
>> small (depth of the tree, so for apache or aim the typical
>> depth is 2). This N is between 1 and 3.
>>
>> The problem we had originally is the _width_ of the tree,
>> where every sibling process was attached to the same anon_vma
>> and the rmap code had to walk the page tables of all the
>> processes, for every privately owned page in each child process.
>> For large server workloads, this N is between a few hundred and
>> a few thousand.
>>
>> What matters most at this point is correctness - we need to be
>> able to exclude rmap walks when messing with a VMA in any way
>> that breaks lookups, because rmap walks for page migration and
>> hugepage conversion have to be 100% reliable.
>>
>> That is not a constraint I had in mind with the original
>> anon_vma changes, so the code needs to be fixed up now...
>
> Yes. I understand it.
>
> When you tried anon_vma_chain patches as I pointed out, what I have a
> concern is parent's vma not child's one.
> The vma of parent still has N anon_vma.
> AFAIR, you said it's trade-off and would be good than old at least.
> I agreed.  But I just want to remind you because this makes worse.  :)
> The corner case is that we have to hold locks of N.
>
> Do I miss something?
> Really, Can't we ignore that case latency although this happen infrequently?
> I am not against this patch. I just want to listen your opinion.

me/ slaps self.

It's about height of tree and I can't imagine high height of
scenario.(fork->fork->fork->...->fork)
So as Rik pointed out, It's a not big overhead about latency latency,
at least. I think.

I supports this approach.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
  2010-04-29  0:28                       ` Minchan Kim
@ 2010-04-29  7:37                         ` Mel Gorman
  -1 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-29  7:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, Andrea Arcangeli, Linux-MM, LKML,
	KAMEZAWA Hiroyuki, Christoph Lameter, Andrew Morton

On Thu, Apr 29, 2010 at 09:28:25AM +0900, Minchan Kim wrote:
> On Thu, Apr 29, 2010 at 5:57 AM, Rik van Riel <riel@redhat.com> wrote:
> > Take all the locks for all the anon_vmas in anon_vma_lock, this properly
> > excludes migration and the transparent hugepage code from VMA changes done
> > by mmap/munmap/mprotect/expand_stack/etc...
> >
> > Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
> > otherwise we have an unavoidable lock ordering conflict.  This changes the
> > locking rules for the "same_vma" list to be either mm->mmap_sem for write,
> > or mm->mmap_sem for read plus the new mm->anon_vma_chain lock.  This limits
> > the place where the new lock is taken to 2 locations - anon_vma_prepare and
> > expand_downwards.
> >
> > Document the locking rules for the same_vma list in the anon_vma_chain and
> > remove the anon_vma_lock call from expand_upwards, which does not need it.
> >
> > Signed-off-by: Rik van Riel <riel@redhat.com>
> 
> This patch makes things simple. So I like this.

Agreed.

> Actually, I wanted this all-at-once locks approach.
> But I was worried about that how the patch affects AIM 7 workload
> which is cause of anon_vma_chain about scalability by Rik.

I had similar concerns. I'm surprised how it worked out.

> But now Rik himself is sending the patch. So I assume the patch
> couldn't decrease scalability of the workload heavily.
> 
> Let's wait result of test if Rik doesn't have a problem of AIM7.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
@ 2010-04-29  7:37                         ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-29  7:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, Andrea Arcangeli, Linux-MM, LKML,
	KAMEZAWA Hiroyuki, Christoph Lameter, Andrew Morton

On Thu, Apr 29, 2010 at 09:28:25AM +0900, Minchan Kim wrote:
> On Thu, Apr 29, 2010 at 5:57 AM, Rik van Riel <riel@redhat.com> wrote:
> > Take all the locks for all the anon_vmas in anon_vma_lock, this properly
> > excludes migration and the transparent hugepage code from VMA changes done
> > by mmap/munmap/mprotect/expand_stack/etc...
> >
> > Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
> > otherwise we have an unavoidable lock ordering conflict.  This changes the
> > locking rules for the "same_vma" list to be either mm->mmap_sem for write,
> > or mm->mmap_sem for read plus the new mm->anon_vma_chain lock.  This limits
> > the place where the new lock is taken to 2 locations - anon_vma_prepare and
> > expand_downwards.
> >
> > Document the locking rules for the same_vma list in the anon_vma_chain and
> > remove the anon_vma_lock call from expand_upwards, which does not need it.
> >
> > Signed-off-by: Rik van Riel <riel@redhat.com>
> 
> This patch makes things simple. So I like this.

Agreed.

> Actually, I wanted this all-at-once locks approach.
> But I was worried about that how the patch affects AIM 7 workload
> which is cause of anon_vma_chain about scalability by Rik.

I had similar concerns. I'm surprised how it worked out.

> But now Rik himself is sending the patch. So I assume the patch
> couldn't decrease scalability of the workload heavily.
> 
> Let's wait result of test if Rik doesn't have a problem of AIM7.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
  2010-04-28 20:57                     ` Rik van Riel
@ 2010-04-29  8:15                       ` Mel Gorman
  -1 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-29  8:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On Wed, Apr 28, 2010 at 04:57:34PM -0400, Rik van Riel wrote:
> Take all the locks for all the anon_vmas in anon_vma_lock, this properly
> excludes migration and the transparent hugepage code from VMA changes done
> by mmap/munmap/mprotect/expand_stack/etc...
> 

In vma_adjust(), what prevents something like rmap_map seeing partial
updates while the following lines execute?

        vma->vm_start = start;
        vma->vm_end = end;
        vma->vm_pgoff = pgoff;
        if (adjust_next) {
                next->vm_start += adjust_next << PAGE_SHIFT;
                next->vm_pgoff += adjust_next;
        }

They would appear to happen outside the lock, even with this patch. The
update happened within the lock in 2.6.33.

> Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
> otherwise we have an unavoidable lock ordering conflict.  This changes the
> locking rules for the "same_vma" list to be either mm->mmap_sem for write, 
> or mm->mmap_sem for read plus the new mm->anon_vma_chain lock.  This limits
> the place where the new lock is taken to 2 locations - anon_vma_prepare and
> expand_downwards.
> 
> Document the locking rules for the same_vma list in the anon_vma_chain and
> remove the anon_vma_lock call from expand_upwards, which does not need it.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> 
> --- 
> Posted quickly as an RFC patch, only compile tested so far.
> Andrea, Mel, does this look like a reasonable approach?
> 

Yes.

> v3:
>  - change anon_vma_unlock into a macro so lockdep works right
>  - fix lock ordering in anon_vma_prepare
> v2:
>  - also change anon_vma_unlock to walk the loop
>  - add calls to anon_vma_lock & anon_vma_unlock to vma_adjust
>  - introduce a new lock for the vma->anon_vma_chain list, to prevent
>    the lock inversion that Andrea pointed out
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index b8bb9a6..a0679c6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -239,6 +239,7 @@ struct mm_struct {
>  	int map_count;				/* number of VMAs */
>  	struct rw_semaphore mmap_sem;
>  	spinlock_t page_table_lock;		/* Protects page tables and some counters */
> +	spinlock_t anon_vma_chain_lock;		/* Protects vma->anon_vma_chain, with mmap_sem */
>  
>  	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
>  						 * together off init_mm.mmlist, and are protected
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index d25bd22..703c472 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -52,11 +52,15 @@ struct anon_vma {
>   * all the anon_vmas associated with this VMA.
>   * The "same_anon_vma" list contains the anon_vma_chains
>   * which link all the VMAs associated with this anon_vma.
> + *
> + * The "same_vma" list is locked by either having mm->mmap_sem
> + * locked for writing, or having mm->mmap_sem locked for reading
> + * AND holding the mm->anon_vma_chain_lock.
>   */
>  struct anon_vma_chain {
>  	struct vm_area_struct *vma;
>  	struct anon_vma *anon_vma;
> -	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
> +	struct list_head same_vma;	/* see above */
>  	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
>  };
>  
> @@ -90,18 +94,24 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
>  	return page_rmapping(page);
>  }
>  
> -static inline void anon_vma_lock(struct vm_area_struct *vma)
> -{
> -	struct anon_vma *anon_vma = vma->anon_vma;
> -	if (anon_vma)
> -		spin_lock(&anon_vma->lock);
> -}
> +#define anon_vma_lock(vma, nest_lock)					\
> +({									\
> +	struct anon_vma *anon_vma = vma->anon_vma;			\
> +	if (anon_vma) {							\
> +		struct anon_vma_chain *avc;				\
> +		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) \
> +			spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock); \
> +	}								\
> +})
>  
>  static inline void anon_vma_unlock(struct vm_area_struct *vma)
>  {
>  	struct anon_vma *anon_vma = vma->anon_vma;
> -	if (anon_vma)
> -		spin_unlock(&anon_vma->lock);
> +	if (anon_vma) {
> +		struct anon_vma_chain *avc;
> +		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
> +			spin_unlock(&avc->anon_vma->lock);
> +	}
>  }
>  
>  /*
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 44b0791..83b1ba2 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -468,6 +468,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
>  	mm->nr_ptes = 0;
>  	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
>  	spin_lock_init(&mm->page_table_lock);
> +	spin_lock_init(&mm->anon_vma_chain_lock);
>  	mm->free_area_cache = TASK_UNMAPPED_BASE;
>  	mm->cached_hole_size = ~0UL;
>  	mm_init_aio(mm);
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index 57aba0d..3ce8a1f 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -15,6 +15,7 @@ struct mm_struct init_mm = {
>  	.mm_count	= ATOMIC_INIT(1),
>  	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
>  	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
> +	.anon_vma_chain_lock =  __SPIN_LOCK_UNLOCKED(init_mm.anon_vma_chain_lock),
>  	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
>  	.cpu_vm_mask	= CPU_MASK_ALL,
>  };
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f90ea92..4602358 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -452,7 +452,7 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
>  		spin_lock(&mapping->i_mmap_lock);
>  		vma->vm_truncate_count = mapping->truncate_count;
>  	}
> -	anon_vma_lock(vma);
> +	anon_vma_lock(vma, &mm->mmap_sem);
>  
>  	__vma_link(mm, vma, prev, rb_link, rb_parent);
>  	__vma_link_file(vma);
> @@ -578,6 +578,7 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		}
>  	}
>  
> +	anon_vma_lock(vma, &mm->mmap_sem);
>  	if (root) {
>  		flush_dcache_mmap_lock(mapping);
>  		vma_prio_tree_remove(vma, root);
> @@ -599,6 +600,7 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		vma_prio_tree_insert(vma, root);
>  		flush_dcache_mmap_unlock(mapping);
>  	}
> +	anon_vma_unlock(vma);
>  
>  	if (remove_next) {
>  		/*
> @@ -1705,12 +1707,11 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  		return -EFAULT;
>  
>  	/*
> -	 * We must make sure the anon_vma is allocated
> -	 * so that the anon_vma locking is not a noop.
> +	 * Unlike expand_downwards, we do not need to take the anon_vma lock,
> +	 * because we leave vma->vm_start and vma->pgoff untouched. 
> +	 * This means rmap lookups of pages inside this VMA stay valid
> +	 * throughout the stack expansion.
>  	 */
> -	if (unlikely(anon_vma_prepare(vma)))
> -		return -ENOMEM;
> -	anon_vma_lock(vma);
>  
>  	/*
>  	 * vma->vm_start/vm_end cannot change under us because the caller
> @@ -1721,7 +1722,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  	if (address < PAGE_ALIGN(address+4))
>  		address = PAGE_ALIGN(address+4);
>  	else {
> -		anon_vma_unlock(vma);
>  		return -ENOMEM;
>  	}
>  	error = 0;
> @@ -1737,7 +1737,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  		if (!error)
>  			vma->vm_end = address;
>  	}
> -	anon_vma_unlock(vma);
>  	return error;
>  }
>  #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
> @@ -1749,6 +1748,7 @@ static int expand_downwards(struct vm_area_struct *vma,
>  				   unsigned long address)
>  {
>  	int error;
> +	struct mm_struct *mm = vma->vm_mm;
>  
>  	/*
>  	 * We must make sure the anon_vma is allocated
> @@ -1762,7 +1762,8 @@ static int expand_downwards(struct vm_area_struct *vma,
>  	if (error)
>  		return error;
>  
> -	anon_vma_lock(vma);
> +	spin_lock(&mm->anon_vma_chain_lock);
> +	anon_vma_lock(vma, &mm->anon_vma_chain_lock);
>  
>  	/*
>  	 * vma->vm_start/vm_end cannot change under us because the caller
> @@ -1784,6 +1785,8 @@ static int expand_downwards(struct vm_area_struct *vma,
>  		}
>  	}
>  	anon_vma_unlock(vma);
> +	spin_unlock(&mm->anon_vma_chain_lock);
> +
>  	return error;
>  }
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 526704e..98d6289 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -23,6 +23,7 @@
>   * inode->i_mutex	(while writing or truncating, not reading or faulting)
>   *   inode->i_alloc_sem (vmtruncate_range)
>   *   mm->mmap_sem
> + *   mm->anon_vma_chain_lock (mmap_sem for read, protects vma->anon_vma_chain)
>   *     page->flags PG_locked (lock_page)
>   *       mapping->i_mmap_lock
>   *         anon_vma->lock
> @@ -133,10 +134,11 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>  				goto out_enomem_free_avc;
>  			allocated = anon_vma;
>  		}
> +
> +		/* anon_vma_chain_lock to protect against threads */
> +		spin_lock(&mm->anon_vma_chain_lock);
>  		spin_lock(&anon_vma->lock);
>  
> -		/* page_table_lock to protect against threads */
> -		spin_lock(&mm->page_table_lock);
>  		if (likely(!vma->anon_vma)) {
>  			vma->anon_vma = anon_vma;
>  			avc->anon_vma = anon_vma;
> @@ -145,9 +147,9 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>  			list_add(&avc->same_anon_vma, &anon_vma->head);
>  			allocated = NULL;
>  		}
> -		spin_unlock(&mm->page_table_lock);
> -
>  		spin_unlock(&anon_vma->lock);
> +		spin_unlock(&mm->anon_vma_chain_lock);
> +
>  		if (unlikely(allocated)) {
>  			anon_vma_free(allocated);
>  			anon_vma_chain_free(avc);
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
@ 2010-04-29  8:15                       ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-29  8:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On Wed, Apr 28, 2010 at 04:57:34PM -0400, Rik van Riel wrote:
> Take all the locks for all the anon_vmas in anon_vma_lock, this properly
> excludes migration and the transparent hugepage code from VMA changes done
> by mmap/munmap/mprotect/expand_stack/etc...
> 

In vma_adjust(), what prevents something like rmap_map seeing partial
updates while the following lines execute?

        vma->vm_start = start;
        vma->vm_end = end;
        vma->vm_pgoff = pgoff;
        if (adjust_next) {
                next->vm_start += adjust_next << PAGE_SHIFT;
                next->vm_pgoff += adjust_next;
        }

They would appear to happen outside the lock, even with this patch. The
update happened within the lock in 2.6.33.

> Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
> otherwise we have an unavoidable lock ordering conflict.  This changes the
> locking rules for the "same_vma" list to be either mm->mmap_sem for write, 
> or mm->mmap_sem for read plus the new mm->anon_vma_chain lock.  This limits
> the place where the new lock is taken to 2 locations - anon_vma_prepare and
> expand_downwards.
> 
> Document the locking rules for the same_vma list in the anon_vma_chain and
> remove the anon_vma_lock call from expand_upwards, which does not need it.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> 
> --- 
> Posted quickly as an RFC patch, only compile tested so far.
> Andrea, Mel, does this look like a reasonable approach?
> 

Yes.

> v3:
>  - change anon_vma_unlock into a macro so lockdep works right
>  - fix lock ordering in anon_vma_prepare
> v2:
>  - also change anon_vma_unlock to walk the loop
>  - add calls to anon_vma_lock & anon_vma_unlock to vma_adjust
>  - introduce a new lock for the vma->anon_vma_chain list, to prevent
>    the lock inversion that Andrea pointed out
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index b8bb9a6..a0679c6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -239,6 +239,7 @@ struct mm_struct {
>  	int map_count;				/* number of VMAs */
>  	struct rw_semaphore mmap_sem;
>  	spinlock_t page_table_lock;		/* Protects page tables and some counters */
> +	spinlock_t anon_vma_chain_lock;		/* Protects vma->anon_vma_chain, with mmap_sem */
>  
>  	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
>  						 * together off init_mm.mmlist, and are protected
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index d25bd22..703c472 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -52,11 +52,15 @@ struct anon_vma {
>   * all the anon_vmas associated with this VMA.
>   * The "same_anon_vma" list contains the anon_vma_chains
>   * which link all the VMAs associated with this anon_vma.
> + *
> + * The "same_vma" list is locked by either having mm->mmap_sem
> + * locked for writing, or having mm->mmap_sem locked for reading
> + * AND holding the mm->anon_vma_chain_lock.
>   */
>  struct anon_vma_chain {
>  	struct vm_area_struct *vma;
>  	struct anon_vma *anon_vma;
> -	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
> +	struct list_head same_vma;	/* see above */
>  	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
>  };
>  
> @@ -90,18 +94,24 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
>  	return page_rmapping(page);
>  }
>  
> -static inline void anon_vma_lock(struct vm_area_struct *vma)
> -{
> -	struct anon_vma *anon_vma = vma->anon_vma;
> -	if (anon_vma)
> -		spin_lock(&anon_vma->lock);
> -}
> +#define anon_vma_lock(vma, nest_lock)					\
> +({									\
> +	struct anon_vma *anon_vma = vma->anon_vma;			\
> +	if (anon_vma) {							\
> +		struct anon_vma_chain *avc;				\
> +		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) \
> +			spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock); \
> +	}								\
> +})
>  
>  static inline void anon_vma_unlock(struct vm_area_struct *vma)
>  {
>  	struct anon_vma *anon_vma = vma->anon_vma;
> -	if (anon_vma)
> -		spin_unlock(&anon_vma->lock);
> +	if (anon_vma) {
> +		struct anon_vma_chain *avc;
> +		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
> +			spin_unlock(&avc->anon_vma->lock);
> +	}
>  }
>  
>  /*
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 44b0791..83b1ba2 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -468,6 +468,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
>  	mm->nr_ptes = 0;
>  	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
>  	spin_lock_init(&mm->page_table_lock);
> +	spin_lock_init(&mm->anon_vma_chain_lock);
>  	mm->free_area_cache = TASK_UNMAPPED_BASE;
>  	mm->cached_hole_size = ~0UL;
>  	mm_init_aio(mm);
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index 57aba0d..3ce8a1f 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -15,6 +15,7 @@ struct mm_struct init_mm = {
>  	.mm_count	= ATOMIC_INIT(1),
>  	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
>  	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
> +	.anon_vma_chain_lock =  __SPIN_LOCK_UNLOCKED(init_mm.anon_vma_chain_lock),
>  	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
>  	.cpu_vm_mask	= CPU_MASK_ALL,
>  };
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f90ea92..4602358 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -452,7 +452,7 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
>  		spin_lock(&mapping->i_mmap_lock);
>  		vma->vm_truncate_count = mapping->truncate_count;
>  	}
> -	anon_vma_lock(vma);
> +	anon_vma_lock(vma, &mm->mmap_sem);
>  
>  	__vma_link(mm, vma, prev, rb_link, rb_parent);
>  	__vma_link_file(vma);
> @@ -578,6 +578,7 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		}
>  	}
>  
> +	anon_vma_lock(vma, &mm->mmap_sem);
>  	if (root) {
>  		flush_dcache_mmap_lock(mapping);
>  		vma_prio_tree_remove(vma, root);
> @@ -599,6 +600,7 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		vma_prio_tree_insert(vma, root);
>  		flush_dcache_mmap_unlock(mapping);
>  	}
> +	anon_vma_unlock(vma);
>  
>  	if (remove_next) {
>  		/*
> @@ -1705,12 +1707,11 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  		return -EFAULT;
>  
>  	/*
> -	 * We must make sure the anon_vma is allocated
> -	 * so that the anon_vma locking is not a noop.
> +	 * Unlike expand_downwards, we do not need to take the anon_vma lock,
> +	 * because we leave vma->vm_start and vma->pgoff untouched. 
> +	 * This means rmap lookups of pages inside this VMA stay valid
> +	 * throughout the stack expansion.
>  	 */
> -	if (unlikely(anon_vma_prepare(vma)))
> -		return -ENOMEM;
> -	anon_vma_lock(vma);
>  
>  	/*
>  	 * vma->vm_start/vm_end cannot change under us because the caller
> @@ -1721,7 +1722,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  	if (address < PAGE_ALIGN(address+4))
>  		address = PAGE_ALIGN(address+4);
>  	else {
> -		anon_vma_unlock(vma);
>  		return -ENOMEM;
>  	}
>  	error = 0;
> @@ -1737,7 +1737,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  		if (!error)
>  			vma->vm_end = address;
>  	}
> -	anon_vma_unlock(vma);
>  	return error;
>  }
>  #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
> @@ -1749,6 +1748,7 @@ static int expand_downwards(struct vm_area_struct *vma,
>  				   unsigned long address)
>  {
>  	int error;
> +	struct mm_struct *mm = vma->vm_mm;
>  
>  	/*
>  	 * We must make sure the anon_vma is allocated
> @@ -1762,7 +1762,8 @@ static int expand_downwards(struct vm_area_struct *vma,
>  	if (error)
>  		return error;
>  
> -	anon_vma_lock(vma);
> +	spin_lock(&mm->anon_vma_chain_lock);
> +	anon_vma_lock(vma, &mm->anon_vma_chain_lock);
>  
>  	/*
>  	 * vma->vm_start/vm_end cannot change under us because the caller
> @@ -1784,6 +1785,8 @@ static int expand_downwards(struct vm_area_struct *vma,
>  		}
>  	}
>  	anon_vma_unlock(vma);
> +	spin_unlock(&mm->anon_vma_chain_lock);
> +
>  	return error;
>  }
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 526704e..98d6289 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -23,6 +23,7 @@
>   * inode->i_mutex	(while writing or truncating, not reading or faulting)
>   *   inode->i_alloc_sem (vmtruncate_range)
>   *   mm->mmap_sem
> + *   mm->anon_vma_chain_lock (mmap_sem for read, protects vma->anon_vma_chain)
>   *     page->flags PG_locked (lock_page)
>   *       mapping->i_mmap_lock
>   *         anon_vma->lock
> @@ -133,10 +134,11 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>  				goto out_enomem_free_avc;
>  			allocated = anon_vma;
>  		}
> +
> +		/* anon_vma_chain_lock to protect against threads */
> +		spin_lock(&mm->anon_vma_chain_lock);
>  		spin_lock(&anon_vma->lock);
>  
> -		/* page_table_lock to protect against threads */
> -		spin_lock(&mm->page_table_lock);
>  		if (likely(!vma->anon_vma)) {
>  			vma->anon_vma = anon_vma;
>  			avc->anon_vma = anon_vma;
> @@ -145,9 +147,9 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>  			list_add(&avc->same_anon_vma, &anon_vma->head);
>  			allocated = NULL;
>  		}
> -		spin_unlock(&mm->page_table_lock);
> -
>  		spin_unlock(&anon_vma->lock);
> +		spin_unlock(&mm->anon_vma_chain_lock);
> +
>  		if (unlikely(allocated)) {
>  			anon_vma_free(allocated);
>  			anon_vma_chain_free(avc);
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
  2010-04-29  8:15                       ` Mel Gorman
@ 2010-04-29  8:32                         ` Minchan Kim
  -1 siblings, 0 replies; 132+ messages in thread
From: Minchan Kim @ 2010-04-29  8:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Andrea Arcangeli, Linux-MM, LKML,
	KAMEZAWA Hiroyuki, Christoph Lameter, Andrew Morton

On Thu, 2010-04-29 at 09:15 +0100, Mel Gorman wrote:
> On Wed, Apr 28, 2010 at 04:57:34PM -0400, Rik van Riel wrote:
> > Take all the locks for all the anon_vmas in anon_vma_lock, this properly
> > excludes migration and the transparent hugepage code from VMA changes done
> > by mmap/munmap/mprotect/expand_stack/etc...
> > 
> 
> In vma_adjust(), what prevents something like rmap_map seeing partial
> updates while the following lines execute?
> 
>         vma->vm_start = start;
>         vma->vm_end = end;
>         vma->vm_pgoff = pgoff;
>         if (adjust_next) {
>                 next->vm_start += adjust_next << PAGE_SHIFT;
>                 next->vm_pgoff += adjust_next;
>         }
> They would appear to happen outside the lock, even with this patch. The
> update happened within the lock in 2.6.33.
> 
> 
> 
This part does it. :)

----
@@ -578,6 +578,7 @@ again:                      remove_next = 1 + (end >
next->vm_end);
                }   
        }   
 
+       anon_vma_lock(vma, &mm->mmap_sem);
        if (root) {
                flush_dcache_mmap_lock(mapping);
                vma_prio_tree_remove(vma, root);
@@ -599,6 +600,7 @@ again:                      remove_next = 1 + (end >
next->vm_end);
                vma_prio_tree_insert(vma, root);
                flush_dcache_mmap_unlock(mapping);
        }   
+       anon_vma_unlock(vma);
---


But we still need patch about shift_arg_pages.



-- 
Kind regards,
Minchan Kim



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
@ 2010-04-29  8:32                         ` Minchan Kim
  0 siblings, 0 replies; 132+ messages in thread
From: Minchan Kim @ 2010-04-29  8:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Andrea Arcangeli, Linux-MM, LKML,
	KAMEZAWA Hiroyuki, Christoph Lameter, Andrew Morton

On Thu, 2010-04-29 at 09:15 +0100, Mel Gorman wrote:
> On Wed, Apr 28, 2010 at 04:57:34PM -0400, Rik van Riel wrote:
> > Take all the locks for all the anon_vmas in anon_vma_lock, this properly
> > excludes migration and the transparent hugepage code from VMA changes done
> > by mmap/munmap/mprotect/expand_stack/etc...
> > 
> 
> In vma_adjust(), what prevents something like rmap_map seeing partial
> updates while the following lines execute?
> 
>         vma->vm_start = start;
>         vma->vm_end = end;
>         vma->vm_pgoff = pgoff;
>         if (adjust_next) {
>                 next->vm_start += adjust_next << PAGE_SHIFT;
>                 next->vm_pgoff += adjust_next;
>         }
> They would appear to happen outside the lock, even with this patch. The
> update happened within the lock in 2.6.33.
> 
> 
> 
This part does it. :)

----
@@ -578,6 +578,7 @@ again:                      remove_next = 1 + (end >
next->vm_end);
                }   
        }   
 
+       anon_vma_lock(vma, &mm->mmap_sem);
        if (root) {
                flush_dcache_mmap_lock(mapping);
                vma_prio_tree_remove(vma, root);
@@ -599,6 +600,7 @@ again:                      remove_next = 1 + (end >
next->vm_end);
                vma_prio_tree_insert(vma, root);
                flush_dcache_mmap_unlock(mapping);
        }   
+       anon_vma_unlock(vma);
---


But we still need patch about shift_arg_pages.



-- 
Kind regards,
Minchan Kim


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
  2010-04-29  8:32                         ` Minchan Kim
@ 2010-04-29  8:44                           ` Mel Gorman
  -1 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-29  8:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, Andrea Arcangeli, Linux-MM, LKML,
	KAMEZAWA Hiroyuki, Christoph Lameter, Andrew Morton

On Thu, Apr 29, 2010 at 05:32:17PM +0900, Minchan Kim wrote:
> On Thu, 2010-04-29 at 09:15 +0100, Mel Gorman wrote:
> > On Wed, Apr 28, 2010 at 04:57:34PM -0400, Rik van Riel wrote:
> > > Take all the locks for all the anon_vmas in anon_vma_lock, this properly
> > > excludes migration and the transparent hugepage code from VMA changes done
> > > by mmap/munmap/mprotect/expand_stack/etc...
> > > 
> > 
> > In vma_adjust(), what prevents something like rmap_map seeing partial
> > updates while the following lines execute?
> > 
> >         vma->vm_start = start;
> >         vma->vm_end = end;
> >         vma->vm_pgoff = pgoff;
> >         if (adjust_next) {
> >                 next->vm_start += adjust_next << PAGE_SHIFT;
> >                 next->vm_pgoff += adjust_next;
> >         }
> > They would appear to happen outside the lock, even with this patch. The
> > update happened within the lock in 2.6.33.
> > 
> > 
> > 
> This part does it. :)
> 
> ----
> @@ -578,6 +578,7 @@ again:                      remove_next = 1 + (end >
> next->vm_end);
>                 }   
>         }   
>  
> +       anon_vma_lock(vma, &mm->mmap_sem);
>         if (root) {
>                 flush_dcache_mmap_lock(mapping);
>                 vma_prio_tree_remove(vma, root);
> @@ -599,6 +600,7 @@ again:                      remove_next = 1 + (end >
> next->vm_end);
>                 vma_prio_tree_insert(vma, root);
>                 flush_dcache_mmap_unlock(mapping);
>         }   
> +       anon_vma_unlock(vma);
> ---
> 

I'm blind. You're right.

> But we still need patch about shift_arg_pages.
> 

Assuming you are referring to migration, it's easiest to just not migrate
pages within the stack until after shift_arg_pages runs. The locks
cannot be held during move_page_tables() because the page allocator is
called. It could be done in two stages where pages are allocated outside
the lock and then passed to move_page_tables() but I don't think
increasing the cost of exec() is justified just so a page can be
migrated during exec.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
@ 2010-04-29  8:44                           ` Mel Gorman
  0 siblings, 0 replies; 132+ messages in thread
From: Mel Gorman @ 2010-04-29  8:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, Andrea Arcangeli, Linux-MM, LKML,
	KAMEZAWA Hiroyuki, Christoph Lameter, Andrew Morton

On Thu, Apr 29, 2010 at 05:32:17PM +0900, Minchan Kim wrote:
> On Thu, 2010-04-29 at 09:15 +0100, Mel Gorman wrote:
> > On Wed, Apr 28, 2010 at 04:57:34PM -0400, Rik van Riel wrote:
> > > Take all the locks for all the anon_vmas in anon_vma_lock, this properly
> > > excludes migration and the transparent hugepage code from VMA changes done
> > > by mmap/munmap/mprotect/expand_stack/etc...
> > > 
> > 
> > In vma_adjust(), what prevents something like rmap_map seeing partial
> > updates while the following lines execute?
> > 
> >         vma->vm_start = start;
> >         vma->vm_end = end;
> >         vma->vm_pgoff = pgoff;
> >         if (adjust_next) {
> >                 next->vm_start += adjust_next << PAGE_SHIFT;
> >                 next->vm_pgoff += adjust_next;
> >         }
> > They would appear to happen outside the lock, even with this patch. The
> > update happened within the lock in 2.6.33.
> > 
> > 
> > 
> This part does it. :)
> 
> ----
> @@ -578,6 +578,7 @@ again:                      remove_next = 1 + (end >
> next->vm_end);
>                 }   
>         }   
>  
> +       anon_vma_lock(vma, &mm->mmap_sem);
>         if (root) {
>                 flush_dcache_mmap_lock(mapping);
>                 vma_prio_tree_remove(vma, root);
> @@ -599,6 +600,7 @@ again:                      remove_next = 1 + (end >
> next->vm_end);
>                 vma_prio_tree_insert(vma, root);
>                 flush_dcache_mmap_unlock(mapping);
>         }   
> +       anon_vma_unlock(vma);
> ---
> 

I'm blind. You're right.

> But we still need patch about shift_arg_pages.
> 

Assuming you are referring to migration, it's easiest to just not migrate
pages within the stack until after shift_arg_pages runs. The locks
cannot be held during move_page_tables() because the page allocator is
called. It could be done in two stages where pages are allocated outside
the lock and then passed to move_page_tables() but I don't think
increasing the cost of exec() is justified just so a page can be
migrated during exec.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
  2010-04-29  2:55                           ` Minchan Kim
@ 2010-04-29 15:39                             ` Rik van Riel
  -1 siblings, 0 replies; 132+ messages in thread
From: Rik van Riel @ 2010-04-29 15:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrea Arcangeli, Mel Gorman, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On 04/28/2010 10:55 PM, Minchan Kim wrote:

> When you tried anon_vma_chain patches as I pointed out, what I have a
> concern is parent's vma not child's one.
> The vma of parent still has N anon_vma.

No, it is the other way around.

The anon_vma of the parent is also present in all of the
children, so the parent anon_vma is attached to N vmas.

However, the parent vma only has 1 anon_vma attached to
it, and each of the children will have 2 anon_vmas.

That is what should keep any locking overhead with this
patch minimal.

Yes, a deep fork bomb can slow itself down.  Too bad,
don't do that :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock
@ 2010-04-29 15:39                             ` Rik van Riel
  0 siblings, 0 replies; 132+ messages in thread
From: Rik van Riel @ 2010-04-29 15:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrea Arcangeli, Mel Gorman, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrew Morton

On 04/28/2010 10:55 PM, Minchan Kim wrote:

> When you tried anon_vma_chain patches as I pointed out, what I have a
> concern is parent's vma not child's one.
> The vma of parent still has N anon_vma.

No, it is the other way around.

The anon_vma of the parent is also present in all of the
children, so the parent anon_vma is attached to N vmas.

However, the parent vma only has 1 anon_vma attached to
it, and each of the children will have 2 anon_vmas.

That is what should keep any locking overhead with this
patch minimal.

Yes, a deep fork bomb can slow itself down.  Too bad,
don't do that :)

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

end of thread, other threads:[~2010-04-30 19:27 UTC | newest]

Thread overview: 132+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-27 21:30 [PATCH 0/3] Fix migration races in rmap_walk() V2 Mel Gorman
2010-04-27 21:30 ` Mel Gorman
2010-04-27 21:30 ` [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered Mel Gorman
2010-04-27 21:30   ` Mel Gorman
2010-04-27 22:22   ` Andrea Arcangeli
2010-04-27 22:22     ` Andrea Arcangeli
2010-04-27 23:52     ` KAMEZAWA Hiroyuki
2010-04-27 23:52       ` KAMEZAWA Hiroyuki
2010-04-28  0:18       ` Andrea Arcangeli
2010-04-28  0:18         ` Andrea Arcangeli
2010-04-28  0:19         ` Andrea Arcangeli
2010-04-28  0:19           ` Andrea Arcangeli
2010-04-28  0:28           ` KAMEZAWA Hiroyuki
2010-04-28  0:28             ` KAMEZAWA Hiroyuki
2010-04-28  0:59             ` Andrea Arcangeli
2010-04-28  0:59               ` Andrea Arcangeli
2010-04-28  8:24       ` Mel Gorman
2010-04-28  8:24         ` Mel Gorman
2010-04-27 21:30 ` [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
2010-04-27 21:30   ` Mel Gorman
2010-04-27 23:10   ` Andrea Arcangeli
2010-04-27 23:10     ` Andrea Arcangeli
2010-04-28  9:15     ` Mel Gorman
2010-04-28  9:15       ` Mel Gorman
2010-04-28 15:35       ` Andrea Arcangeli
2010-04-28 15:35         ` Andrea Arcangeli
2010-04-28 15:39         ` Andrea Arcangeli
2010-04-28 15:39           ` Andrea Arcangeli
2010-04-28 15:55         ` Mel Gorman
2010-04-28 15:55           ` Mel Gorman
2010-04-28 16:23           ` Andrea Arcangeli
2010-04-28 16:23             ` Andrea Arcangeli
2010-04-28 17:34             ` Mel Gorman
2010-04-28 17:34               ` Mel Gorman
2010-04-28 17:58               ` Andrea Arcangeli
2010-04-28 17:58                 ` Andrea Arcangeli
2010-04-28 17:47             ` [RFC PATCH] take all anon_vma locks in anon_vma_lock Rik van Riel
2010-04-28 17:47               ` Rik van Riel
2010-04-28 18:03               ` Andrea Arcangeli
2010-04-28 18:03                 ` Andrea Arcangeli
2010-04-28 18:09                 ` Rik van Riel
2010-04-28 18:09                   ` Rik van Riel
2010-04-28 18:25               ` [RFC PATCH -v2] " Rik van Riel
2010-04-28 18:25                 ` Rik van Riel
2010-04-28 19:07                 ` Mel Gorman
2010-04-28 19:07                   ` Mel Gorman
2010-04-28 20:17                 ` [RFC PATCH -v3] " Rik van Riel
2010-04-28 20:17                   ` Rik van Riel
2010-04-28 20:57                   ` Rik van Riel
2010-04-28 20:57                     ` Rik van Riel
2010-04-29  0:28                     ` Minchan Kim
2010-04-29  0:28                       ` Minchan Kim
2010-04-29  2:10                       ` Rik van Riel
2010-04-29  2:10                         ` Rik van Riel
2010-04-29  2:55                         ` Minchan Kim
2010-04-29  2:55                           ` Minchan Kim
2010-04-29  6:42                           ` Minchan Kim
2010-04-29  6:42                             ` Minchan Kim
2010-04-29 15:39                           ` Rik van Riel
2010-04-29 15:39                             ` Rik van Riel
2010-04-29  7:37                       ` Mel Gorman
2010-04-29  7:37                         ` Mel Gorman
2010-04-29  8:15                     ` Mel Gorman
2010-04-29  8:15                       ` Mel Gorman
2010-04-29  8:32                       ` Minchan Kim
2010-04-29  8:32                         ` Minchan Kim
2010-04-29  8:44                         ` Mel Gorman
2010-04-29  8:44                           ` Mel Gorman
2010-04-27 21:30 ` [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved Mel Gorman
2010-04-27 21:30   ` Mel Gorman
2010-04-27 22:30   ` Andrea Arcangeli
2010-04-27 22:30     ` Andrea Arcangeli
2010-04-27 22:58     ` Andrea Arcangeli
2010-04-27 22:58       ` Andrea Arcangeli
2010-04-28  0:39       ` KAMEZAWA Hiroyuki
2010-04-28  0:39         ` KAMEZAWA Hiroyuki
2010-04-28  1:05         ` Andrea Arcangeli
2010-04-28  1:05           ` Andrea Arcangeli
2010-04-28  1:09           ` Andrea Arcangeli
2010-04-28  1:09             ` Andrea Arcangeli
2010-04-28  1:18           ` KAMEZAWA Hiroyuki
2010-04-28  1:18             ` KAMEZAWA Hiroyuki
2010-04-28  1:36             ` Andrea Arcangeli
2010-04-28  1:36               ` Andrea Arcangeli
2010-04-28  1:29       ` KAMEZAWA Hiroyuki
2010-04-28  1:29         ` KAMEZAWA Hiroyuki
2010-04-28  1:44         ` Andrea Arcangeli
2010-04-28  1:44           ` Andrea Arcangeli
2010-04-28  2:12           ` KAMEZAWA Hiroyuki
2010-04-28  2:12             ` KAMEZAWA Hiroyuki
2010-04-28  2:42             ` Andrea Arcangeli
2010-04-28  2:42               ` Andrea Arcangeli
2010-04-28  2:49               ` KAMEZAWA Hiroyuki
2010-04-28  2:49                 ` KAMEZAWA Hiroyuki
2010-04-28  7:28                 ` KAMEZAWA Hiroyuki
2010-04-28  7:28                   ` KAMEZAWA Hiroyuki
2010-04-28 10:48                   ` Mel Gorman
2010-04-28 10:48                     ` Mel Gorman
2010-04-28  0:03   ` KAMEZAWA Hiroyuki
2010-04-28  0:03     ` KAMEZAWA Hiroyuki
2010-04-28  0:08     ` Andrea Arcangeli
2010-04-28  0:08       ` Andrea Arcangeli
2010-04-28  0:36       ` KAMEZAWA Hiroyuki
2010-04-28  0:36         ` KAMEZAWA Hiroyuki
2010-04-28  8:30   ` KAMEZAWA Hiroyuki
2010-04-28  8:30     ` KAMEZAWA Hiroyuki
2010-04-28 14:46     ` Andrea Arcangeli
2010-04-28 14:46       ` Andrea Arcangeli
2010-04-27 22:27 ` [PATCH 0/3] Fix migration races in rmap_walk() V2 Christoph Lameter
2010-04-27 22:27   ` Christoph Lameter
2010-04-27 22:32   ` Andrea Arcangeli
2010-04-27 22:32     ` Andrea Arcangeli
2010-04-28  0:13     ` KAMEZAWA Hiroyuki
2010-04-28  0:13       ` KAMEZAWA Hiroyuki
2010-04-28  0:20       ` Andrea Arcangeli
2010-04-28  0:20         ` Andrea Arcangeli
2010-04-28 14:23         ` Mel Gorman
2010-04-28 14:23           ` Mel Gorman
2010-04-28 14:57           ` Mel Gorman
2010-04-28 14:57             ` Mel Gorman
2010-04-28 15:16             ` Andrea Arcangeli
2010-04-28 15:16               ` Andrea Arcangeli
2010-04-28 15:23               ` Mel Gorman
2010-04-28 15:23                 ` Mel Gorman
2010-04-28 15:45                 ` Andrea Arcangeli
2010-04-28 15:45                   ` Andrea Arcangeli
2010-04-28 20:40                   ` Andrea Arcangeli
2010-04-28 20:40                     ` Andrea Arcangeli
2010-04-28 21:05                     ` Andrea Arcangeli
2010-04-28 21:05                       ` Andrea Arcangeli
2010-04-28  9:17     ` Mel Gorman
2010-04-28  9:17       ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.