* [PATCH 0/2] Fix migration races in rmap_walk()
@ 2010-04-26 22:37 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-26 22:37 UTC (permalink / raw)
To: Linux-MM, LKML
Cc: Minchan Kim, KAMEZAWA Hiroyuki, Mel Gorman, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
After digging around a lot, I believe the following two patches are the
best way to close the race that allows a migration PTE to be left behind
triggering a BUG check in migration_entry_to_page().
Patch one alters has fork() wait for migration to complete. Patch two has
vma_adjust() acquire the anon_vma lock it is aware of and makes rmap_walk()
aware that different VMAs can be encountered during the walk.
I dropped the use of the seq counter because there were still races in
place. For example, while the seq counter would catch when vma_adjust()
and rmap_walk() were looking at the same VMA, there was still insufficient
protection on the VMA list being modified.
The reproduction case was as follows;
1. Run kernel compilation in a loop
2. Start two processes that repeatedly fork()ed and manipulated mappings
3. Constantly compact memory using /proc/sys/vm/compact_memory
4. Optionally add/remove swap
With these two patches applied, I was unable to trigger the bug check
in migration_entry_to_page() but it would be really helpful if Rik could
comment on the anon_vma locking requirements and whether patch 2 is 100%
safe or not. The tests have only been running 8 hours but I'm posting now
anyway and will see how it survives running for a few days.
The other issues raised about expand_downwards will need to be re-examined to
see if they still exist and transparent hugepage support will need further
thinking to see if split_huge_page() can deal with these situations.
mm/ksm.c | 13 +++++++++++++
mm/memory.c | 25 ++++++++++++++++---------
mm/mmap.c | 6 ++++++
mm/rmap.c | 23 ++++++++++++++++++++---
4 files changed, 55 insertions(+), 12 deletions(-)
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 0/2] Fix migration races in rmap_walk()
@ 2010-04-26 22:37 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-26 22:37 UTC (permalink / raw)
To: Linux-MM, LKML
Cc: Minchan Kim, KAMEZAWA Hiroyuki, Mel Gorman, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
After digging around a lot, I believe the following two patches are the
best way to close the race that allows a migration PTE to be left behind
triggering a BUG check in migration_entry_to_page().
Patch one alters has fork() wait for migration to complete. Patch two has
vma_adjust() acquire the anon_vma lock it is aware of and makes rmap_walk()
aware that different VMAs can be encountered during the walk.
I dropped the use of the seq counter because there were still races in
place. For example, while the seq counter would catch when vma_adjust()
and rmap_walk() were looking at the same VMA, there was still insufficient
protection on the VMA list being modified.
The reproduction case was as follows;
1. Run kernel compilation in a loop
2. Start two processes that repeatedly fork()ed and manipulated mappings
3. Constantly compact memory using /proc/sys/vm/compact_memory
4. Optionally add/remove swap
With these two patches applied, I was unable to trigger the bug check
in migration_entry_to_page() but it would be really helpful if Rik could
comment on the anon_vma locking requirements and whether patch 2 is 100%
safe or not. The tests have only been running 8 hours but I'm posting now
anyway and will see how it survives running for a few days.
The other issues raised about expand_downwards will need to be re-examined to
see if they still exist and transparent hugepage support will need further
thinking to see if split_huge_page() can deal with these situations.
mm/ksm.c | 13 +++++++++++++
mm/memory.c | 25 ++++++++++++++++---------
mm/mmap.c | 6 ++++++
mm/rmap.c | 23 ++++++++++++++++++++---
4 files changed, 55 insertions(+), 12 deletions(-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 1/2] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
2010-04-26 22:37 ` Mel Gorman
@ 2010-04-26 22:37 ` Mel Gorman
-1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-26 22:37 UTC (permalink / raw)
To: Linux-MM, LKML
Cc: Minchan Kim, KAMEZAWA Hiroyuki, Mel Gorman, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
At page migration, we replace pte with migration_entry, which has
similar format as swap_entry and replace it with real pfn at the
end of migration. But there is a race with fork()'s copy_page_range().
Assume page migraion on CPU A and fork in CPU B. On CPU A, a page of
a process is under migration. On CPU B, a page's pte is under copy.
CPUA CPU B
do_fork()
copy_mm() (from process 1 to process2)
insert new vma to mmap_list (if inode/anon_vma)
pte_lock(process1)
unmap a page
insert migration_entry
pte_unlock(process1)
migrate page copy
copy_page_range
remap new page by rmap_walk()
pte_lock(process2)
found no pte.
pte_unlock(process2)
pte lock(process2)
pte lock(process1)
copy migration entry to process2
pte unlock(process1)
pte unlokc(process2)
pte_lock(process1)
replace migration entry
to new page's pte.
pte_unlock(process1)
Then, some serialization is necessary. IIUC, this is very rare event but
it is reproducible if a lot of migration is happening a lot with the
following program running in parallel.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/mman.h>
#define SIZE (24*1048576UL)
#define CHILDREN 100
int main()
{
int i = 0;
pid_t pids[CHILDREN];
char *buf = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS,
0, 0);
if (buf == MAP_FAILED) {
perror("mmap");
exit(-1);
}
while (++i) {
int j = i % CHILDREN;
if (j == 0) {
printf("Waiting on children\n");
for (j = 0; j < CHILDREN; j++) {
memset(buf, i, SIZE);
if (pids[j] != -1)
waitpid(pids[j], NULL, 0);
}
j = 0;
}
if ((pids[j] = fork()) == 0) {
memset(buf, i, SIZE);
exit(EXIT_SUCCESS);
}
}
munmap(buf, SIZE);
}
copy_page_range() can wait for the end of migration.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/memory.c | 24 +++++++++++++++---------
1 files changed, 15 insertions(+), 9 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 833952d..36dadd4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -675,15 +675,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
}
if (likely(!non_swap_entry(entry)))
rss[MM_SWAPENTS]++;
- else if (is_write_migration_entry(entry) &&
- is_cow_mapping(vm_flags)) {
- /*
- * COW mappings require pages in both parent
- * and child to be set to read.
- */
- make_migration_entry_read(&entry);
- pte = swp_entry_to_pte(entry);
- set_pte_at(src_mm, addr, src_pte, pte);
+ else {
+ BUG();
}
}
goto out_set_pte;
@@ -760,6 +753,19 @@ again:
progress++;
continue;
}
+ if (unlikely(!pte_present(*src_pte) && !pte_file(*src_pte))) {
+ entry = pte_to_swp_entry(*src_pte);
+ if (is_migration_entry(entry)) {
+ /*
+ * Because copying pte has the race with
+ * pte rewriting of migraton, release lock
+ * and retry.
+ */
+ progress = 0;
+ entry.val = 0;
+ break;
+ }
+ }
entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
vma, addr, rss);
if (entry.val)
--
1.6.5
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 1/2] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
@ 2010-04-26 22:37 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-26 22:37 UTC (permalink / raw)
To: Linux-MM, LKML
Cc: Minchan Kim, KAMEZAWA Hiroyuki, Mel Gorman, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
At page migration, we replace pte with migration_entry, which has
similar format as swap_entry and replace it with real pfn at the
end of migration. But there is a race with fork()'s copy_page_range().
Assume page migraion on CPU A and fork in CPU B. On CPU A, a page of
a process is under migration. On CPU B, a page's pte is under copy.
CPUA CPU B
do_fork()
copy_mm() (from process 1 to process2)
insert new vma to mmap_list (if inode/anon_vma)
pte_lock(process1)
unmap a page
insert migration_entry
pte_unlock(process1)
migrate page copy
copy_page_range
remap new page by rmap_walk()
pte_lock(process2)
found no pte.
pte_unlock(process2)
pte lock(process2)
pte lock(process1)
copy migration entry to process2
pte unlock(process1)
pte unlokc(process2)
pte_lock(process1)
replace migration entry
to new page's pte.
pte_unlock(process1)
Then, some serialization is necessary. IIUC, this is very rare event but
it is reproducible if a lot of migration is happening a lot with the
following program running in parallel.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/mman.h>
#define SIZE (24*1048576UL)
#define CHILDREN 100
int main()
{
int i = 0;
pid_t pids[CHILDREN];
char *buf = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS,
0, 0);
if (buf == MAP_FAILED) {
perror("mmap");
exit(-1);
}
while (++i) {
int j = i % CHILDREN;
if (j == 0) {
printf("Waiting on children\n");
for (j = 0; j < CHILDREN; j++) {
memset(buf, i, SIZE);
if (pids[j] != -1)
waitpid(pids[j], NULL, 0);
}
j = 0;
}
if ((pids[j] = fork()) == 0) {
memset(buf, i, SIZE);
exit(EXIT_SUCCESS);
}
}
munmap(buf, SIZE);
}
copy_page_range() can wait for the end of migration.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/memory.c | 24 +++++++++++++++---------
1 files changed, 15 insertions(+), 9 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 833952d..36dadd4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -675,15 +675,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
}
if (likely(!non_swap_entry(entry)))
rss[MM_SWAPENTS]++;
- else if (is_write_migration_entry(entry) &&
- is_cow_mapping(vm_flags)) {
- /*
- * COW mappings require pages in both parent
- * and child to be set to read.
- */
- make_migration_entry_read(&entry);
- pte = swp_entry_to_pte(entry);
- set_pte_at(src_mm, addr, src_pte, pte);
+ else {
+ BUG();
}
}
goto out_set_pte;
@@ -760,6 +753,19 @@ again:
progress++;
continue;
}
+ if (unlikely(!pte_present(*src_pte) && !pte_file(*src_pte))) {
+ entry = pte_to_swp_entry(*src_pte);
+ if (is_migration_entry(entry)) {
+ /*
+ * Because copying pte has the race with
+ * pte rewriting of migraton, release lock
+ * and retry.
+ */
+ progress = 0;
+ entry.val = 0;
+ break;
+ }
+ }
entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
vma, addr, rss);
if (entry.val)
--
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
2010-04-26 22:37 ` Mel Gorman
@ 2010-04-26 22:37 ` Mel Gorman
-1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-26 22:37 UTC (permalink / raw)
To: Linux-MM, LKML
Cc: Minchan Kim, KAMEZAWA Hiroyuki, Mel Gorman, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
vma_adjust() is updating anon VMA information without any locks taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a BUG_ON
to trigger when the page is faulted in.
With the recent anon_vma changes, there can be more than one anon_vma->lock
that can be taken in a anon_vma_chain but a second lock cannot be spinned
upon in case of deadlock. Instead, the rmap walker tries to take locks of
different anon_vma's. If the attempt fails, the operation is restarted.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/ksm.c | 13 +++++++++++++
mm/mmap.c | 6 ++++++
mm/rmap.c | 22 +++++++++++++++++++---
3 files changed, 38 insertions(+), 3 deletions(-)
diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..baa5b4d 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1674,9 +1674,22 @@ again:
spin_lock(&anon_vma->lock);
list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
vma = vmac->vma;
+
+ /* See comment in mm/rmap.c#rmap_walk_anon on locking */
+ if (anon_vma != vma->anon_vma) {
+ if (!spin_trylock(&vma->anon_vma->lock)) {
+ spin_unlock(&anon_vma->lock);
+ goto again;
+ }
+ }
+
if (rmap_item->address < vma->vm_start ||
rmap_item->address >= vma->vm_end)
continue;
+
+ if (anon_vma != vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
/*
* Initially we examine only the vma which covers this
* rmap_item; but later, if there is still work to do,
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61d6f1d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end);
}
}
+ if (vma->anon_vma)
+ spin_lock(&vma->anon_vma->lock);
+
if (root) {
flush_dcache_mmap_lock(mapping);
vma_prio_tree_remove(vma, root);
@@ -620,6 +623,9 @@ again: remove_next = 1 + (end > next->vm_end);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);
+ if (vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
if (remove_next) {
if (file) {
fput(file);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..bc313a6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1368,15 +1368,31 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
* are holding mmap_sem. Users without mmap_sem are required to
* take a reference count to prevent the anon_vma disappearing
*/
+retry:
anon_vma = page_anon_vma(page);
if (!anon_vma)
return ret;
spin_lock(&anon_vma->lock);
list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
struct vm_area_struct *vma = avc->vma;
- unsigned long address = vma_address(page, vma);
- if (address == -EFAULT)
- continue;
+ unsigned long address;
+
+ /*
+ * Guard against deadlocks by not spinning against
+ * vma->anon_vma->lock. If contention is found, release our
+ * lock and try again until VMA list can be traversed without
+ * contention.
+ */
+ if (anon_vma != vma->anon_vma) {
+ if (!spin_trylock(&vma->anon_vma->lock)) {
+ spin_unlock(&anon_vma->lock);
+ goto retry;
+ }
+ }
+ address = vma_address(page, vma);
+ if (anon_vma != vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
ret = rmap_one(page, vma, address, arg);
if (ret != SWAP_AGAIN)
break;
--
1.6.5
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-26 22:37 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-26 22:37 UTC (permalink / raw)
To: Linux-MM, LKML
Cc: Minchan Kim, KAMEZAWA Hiroyuki, Mel Gorman, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
vma_adjust() is updating anon VMA information without any locks taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a BUG_ON
to trigger when the page is faulted in.
With the recent anon_vma changes, there can be more than one anon_vma->lock
that can be taken in a anon_vma_chain but a second lock cannot be spinned
upon in case of deadlock. Instead, the rmap walker tries to take locks of
different anon_vma's. If the attempt fails, the operation is restarted.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/ksm.c | 13 +++++++++++++
mm/mmap.c | 6 ++++++
mm/rmap.c | 22 +++++++++++++++++++---
3 files changed, 38 insertions(+), 3 deletions(-)
diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..baa5b4d 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1674,9 +1674,22 @@ again:
spin_lock(&anon_vma->lock);
list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
vma = vmac->vma;
+
+ /* See comment in mm/rmap.c#rmap_walk_anon on locking */
+ if (anon_vma != vma->anon_vma) {
+ if (!spin_trylock(&vma->anon_vma->lock)) {
+ spin_unlock(&anon_vma->lock);
+ goto again;
+ }
+ }
+
if (rmap_item->address < vma->vm_start ||
rmap_item->address >= vma->vm_end)
continue;
+
+ if (anon_vma != vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
/*
* Initially we examine only the vma which covers this
* rmap_item; but later, if there is still work to do,
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61d6f1d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end);
}
}
+ if (vma->anon_vma)
+ spin_lock(&vma->anon_vma->lock);
+
if (root) {
flush_dcache_mmap_lock(mapping);
vma_prio_tree_remove(vma, root);
@@ -620,6 +623,9 @@ again: remove_next = 1 + (end > next->vm_end);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);
+ if (vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
if (remove_next) {
if (file) {
fput(file);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..bc313a6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1368,15 +1368,31 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
* are holding mmap_sem. Users without mmap_sem are required to
* take a reference count to prevent the anon_vma disappearing
*/
+retry:
anon_vma = page_anon_vma(page);
if (!anon_vma)
return ret;
spin_lock(&anon_vma->lock);
list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
struct vm_area_struct *vma = avc->vma;
- unsigned long address = vma_address(page, vma);
- if (address == -EFAULT)
- continue;
+ unsigned long address;
+
+ /*
+ * Guard against deadlocks by not spinning against
+ * vma->anon_vma->lock. If contention is found, release our
+ * lock and try again until VMA list can be traversed without
+ * contention.
+ */
+ if (anon_vma != vma->anon_vma) {
+ if (!spin_trylock(&vma->anon_vma->lock)) {
+ spin_unlock(&anon_vma->lock);
+ goto retry;
+ }
+ }
+ address = vma_address(page, vma);
+ if (anon_vma != vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
ret = rmap_one(page, vma, address, arg);
if (ret != SWAP_AGAIN)
break;
--
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH 0/2] Fix migration races in rmap_walk()
2010-04-26 22:37 ` Mel Gorman
@ 2010-04-26 23:04 ` Andrea Arcangeli
-1 siblings, 0 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2010-04-26 23:04 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
Christoph Lameter, Rik van Riel, Andrew Morton
On Mon, Apr 26, 2010 at 11:37:56PM +0100, Mel Gorman wrote:
> The other issues raised about expand_downwards will need to be re-examined to
> see if they still exist and transparent hugepage support will need further
> thinking to see if split_huge_page() can deal with these situations.
So patch 1 is for aa.git too, and patch 2 is only for mainline with
the new anon-vma changes (patch 2 not needed in current aa.git, and if
I apply it, it'll deadlock so...) right?
split_huge_page is somewhat simpler and more strict in its checking
than migrate.c in this respect, and yes patch 2 will also need to be
extended to cover split_huge_page the moment I stop backing out the
new anon-vma code (but it won't be any different, whatever works for
migrate will also work for split_huge_page later).
For now I'm much more interested in patch 1 and I'll leave patch 2 to
mainline digestion and check it later hope to find all issues fixed by
the time transparent hugepage gets merged.
About patch 1 it's very interesting because I looked at the race
against fork and migrate yesterday and I didn't see issues but I'm
going to read your patch 1 in detail now to understand what is the
problem you're fixing.
Good you posted this fast, so I can try to help ;)
Andrea
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 0/2] Fix migration races in rmap_walk()
@ 2010-04-26 23:04 ` Andrea Arcangeli
0 siblings, 0 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2010-04-26 23:04 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
Christoph Lameter, Rik van Riel, Andrew Morton
On Mon, Apr 26, 2010 at 11:37:56PM +0100, Mel Gorman wrote:
> The other issues raised about expand_downwards will need to be re-examined to
> see if they still exist and transparent hugepage support will need further
> thinking to see if split_huge_page() can deal with these situations.
So patch 1 is for aa.git too, and patch 2 is only for mainline with
the new anon-vma changes (patch 2 not needed in current aa.git, and if
I apply it, it'll deadlock so...) right?
split_huge_page is somewhat simpler and more strict in its checking
than migrate.c in this respect, and yes patch 2 will also need to be
extended to cover split_huge_page the moment I stop backing out the
new anon-vma code (but it won't be any different, whatever works for
migrate will also work for split_huge_page later).
For now I'm much more interested in patch 1 and I'll leave patch 2 to
mainline digestion and check it later hope to find all issues fixed by
the time transparent hugepage gets merged.
About patch 1 it's very interesting because I looked at the race
against fork and migrate yesterday and I didn't see issues but I'm
going to read your patch 1 in detail now to understand what is the
problem you're fixing.
Good you posted this fast, so I can try to help ;)
Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
2010-04-26 22:37 ` Mel Gorman
@ 2010-04-26 23:05 ` Minchan Kim
-1 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-26 23:05 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, KAMEZAWA Hiroyuki, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 5409 bytes --]
On Tue, Apr 27, 2010 at 7:37 AM, Mel Gorman <mel@csn.ul.ie> wrote:> vma_adjust() is updating anon VMA information without any locks taken.> In contrast, file-backed mappings use the i_mmap_lock and this lack of> locking can result in races with page migration. During rmap_walk(),> vma_address() can return -EFAULT for an address that will soon be valid.> This leaves a dangling migration PTE behind which can later cause a BUG_ON> to trigger when the page is faulted in.>> With the recent anon_vma changes, there can be more than one anon_vma->lock> that can be taken in a anon_vma_chain but a second lock cannot be spinned> upon in case of deadlock. Instead, the rmap walker tries to take locks of> different anon_vma's. If the attempt fails, the operation is restarted.>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>> --->  mm/ksm.c  |  13 +++++++++++++>  mm/mmap.c |   6 ++++++>  mm/rmap.c |  22 +++++++++++++++++++--->  3 files changed, 38 insertions(+), 3 deletions(-)>> diff --git a/mm/ksm.c b/mm/ksm.c> index 3666d43..baa5b4d 100644> --- a/mm/ksm.c> +++ b/mm/ksm.c> @@ -1674,9 +1674,22 @@ again:>         spin_lock(&anon_vma->lock);>         list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {>             vma = vmac->vma;> +> +            /* See comment in mm/rmap.c#rmap_walk_anon on locking */> +            if (anon_vma != vma->anon_vma) {> +                if (!spin_trylock(&vma->anon_vma->lock)) {> +                    spin_unlock(&anon_vma->lock);> +                    goto again;> +                }> +            }> +>             if (rmap_item->address < vma->vm_start ||>               rmap_item->address >= vma->vm_end)>                 continue;> +> +            if (anon_vma != vma->anon_vma)> +                spin_unlock(&vma->anon_vma->lock);> +>             /*>             * Initially we examine only the vma which covers this>             * rmap_item; but later, if there is still work to do,> diff --git a/mm/mmap.c b/mm/mmap.c> index f90ea92..61d6f1d 100644> --- a/mm/mmap.c> +++ b/mm/mmap.c> @@ -578,6 +578,9 @@ again:            remove_next = 1 + (end > next->vm_end);>         }>     }>> +    if (vma->anon_vma)> +        spin_lock(&vma->anon_vma->lock);> +>     if (root) {>         flush_dcache_mmap_lock(mapping);>         vma_prio_tree_remove(vma, root);> @@ -620,6 +623,9 @@ again:            remove_next = 1 + (end > next->vm_end);>     if (mapping)>         spin_unlock(&mapping->i_mmap_lock);>> +    if (vma->anon_vma)> +        spin_unlock(&vma->anon_vma->lock);> +>     if (remove_next) {>         if (file) {>             fput(file);> diff --git a/mm/rmap.c b/mm/rmap.c> index 85f203e..bc313a6 100644> --- a/mm/rmap.c> +++ b/mm/rmap.c> @@ -1368,15 +1368,31 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,>     * are holding mmap_sem. Users without mmap_sem are required to>     * take a reference count to prevent the anon_vma disappearing>     */> +retry:>     anon_vma = page_anon_vma(page);>     if (!anon_vma)>         return ret;>     spin_lock(&anon_vma->lock);>     list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {>         struct vm_area_struct *vma = avc->vma;> -        unsigned long address = vma_address(page, vma);> -        if (address == -EFAULT)> -            continue;> +        unsigned long address;> +> +        /*> +         * Guard against deadlocks by not spinning against> +         * vma->anon_vma->lock. If contention is found, release our> +         * lock and try again until VMA list can be traversed without> +         * contention.> +         */> +        if (anon_vma != vma->anon_vma) {> +            if (!spin_trylock(&vma->anon_vma->lock)) {> +                spin_unlock(&anon_vma->lock);> +                goto retry;> +            }> +        }> +        address = vma_address(page, vma);> +        if (anon_vma != vma->anon_vma)> +            spin_unlock(&vma->anon_vma->lock);> +
if (address == -EFAULT) continue;
> Â Â Â Â Â Â Â Â ret = rmap_one(page, vma, address, arg);> Â Â Â Â Â Â Â Â if (ret != SWAP_AGAIN)> Â Â Â Â Â Â Â Â Â Â Â Â break;> --> 1.6.5>>
-- Kind regards,Minchan Kimÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-26 23:05 ` Minchan Kim
0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-26 23:05 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, KAMEZAWA Hiroyuki, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
On Tue, Apr 27, 2010 at 7:37 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> vma_adjust() is updating anon VMA information without any locks taken.
> In contrast, file-backed mappings use the i_mmap_lock and this lack of
> locking can result in races with page migration. During rmap_walk(),
> vma_address() can return -EFAULT for an address that will soon be valid.
> This leaves a dangling migration PTE behind which can later cause a BUG_ON
> to trigger when the page is faulted in.
>
> With the recent anon_vma changes, there can be more than one anon_vma->lock
> that can be taken in a anon_vma_chain but a second lock cannot be spinned
> upon in case of deadlock. Instead, the rmap walker tries to take locks of
> different anon_vma's. If the attempt fails, the operation is restarted.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> mm/ksm.c | 13 +++++++++++++
> mm/mmap.c | 6 ++++++
> mm/rmap.c | 22 +++++++++++++++++++---
> 3 files changed, 38 insertions(+), 3 deletions(-)
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 3666d43..baa5b4d 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1674,9 +1674,22 @@ again:
> spin_lock(&anon_vma->lock);
> list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
> vma = vmac->vma;
> +
> + /* See comment in mm/rmap.c#rmap_walk_anon on locking */
> + if (anon_vma != vma->anon_vma) {
> + if (!spin_trylock(&vma->anon_vma->lock)) {
> + spin_unlock(&anon_vma->lock);
> + goto again;
> + }
> + }
> +
> if (rmap_item->address < vma->vm_start ||
> rmap_item->address >= vma->vm_end)
> continue;
> +
> + if (anon_vma != vma->anon_vma)
> + spin_unlock(&vma->anon_vma->lock);
> +
> /*
> * Initially we examine only the vma which covers this
> * rmap_item; but later, if there is still work to do,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f90ea92..61d6f1d 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end);
> }
> }
>
> + if (vma->anon_vma)
> + spin_lock(&vma->anon_vma->lock);
> +
> if (root) {
> flush_dcache_mmap_lock(mapping);
> vma_prio_tree_remove(vma, root);
> @@ -620,6 +623,9 @@ again: remove_next = 1 + (end > next->vm_end);
> if (mapping)
> spin_unlock(&mapping->i_mmap_lock);
>
> + if (vma->anon_vma)
> + spin_unlock(&vma->anon_vma->lock);
> +
> if (remove_next) {
> if (file) {
> fput(file);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 85f203e..bc313a6 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1368,15 +1368,31 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
> * are holding mmap_sem. Users without mmap_sem are required to
> * take a reference count to prevent the anon_vma disappearing
> */
> +retry:
> anon_vma = page_anon_vma(page);
> if (!anon_vma)
> return ret;
> spin_lock(&anon_vma->lock);
> list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> struct vm_area_struct *vma = avc->vma;
> - unsigned long address = vma_address(page, vma);
> - if (address == -EFAULT)
> - continue;
> + unsigned long address;
> +
> + /*
> + * Guard against deadlocks by not spinning against
> + * vma->anon_vma->lock. If contention is found, release our
> + * lock and try again until VMA list can be traversed without
> + * contention.
> + */
> + if (anon_vma != vma->anon_vma) {
> + if (!spin_trylock(&vma->anon_vma->lock)) {
> + spin_unlock(&anon_vma->lock);
> + goto retry;
> + }
> + }
> + address = vma_address(page, vma);
> + if (anon_vma != vma->anon_vma)
> + spin_unlock(&vma->anon_vma->lock);
> +
if (address == -EFAULT)
continue;
> ret = rmap_one(page, vma, address, arg);
> if (ret != SWAP_AGAIN)
> break;
> --
> 1.6.5
>
>
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
2010-04-26 22:37 ` Mel Gorman
@ 2010-04-26 23:15 ` Minchan Kim
-1 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-26 23:15 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, KAMEZAWA Hiroyuki, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
On Tue, Apr 27, 2010 at 7:37 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> vma_adjust() is updating anon VMA information without any locks taken.
> In contrast, file-backed mappings use the i_mmap_lock and this lack of
> locking can result in races with page migration. During rmap_walk(),
> vma_address() can return -EFAULT for an address that will soon be valid.
> This leaves a dangling migration PTE behind which can later cause a BUG_ON
> to trigger when the page is faulted in.
>
> With the recent anon_vma changes, there can be more than one anon_vma->lock
> that can be taken in a anon_vma_chain but a second lock cannot be spinned
> upon in case of deadlock. Instead, the rmap walker tries to take locks of
> different anon_vma's. If the attempt fails, the operation is restarted.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Actually, I am worry about rollback approach like this.
If we don't often need anon_vmas serializing, that's enough.
But I am not sure how often we need locking of anon_vmas like this.
Whenever we need it, we have to use rollback approach like this in future.
In my opinion, it's not good.
Rik, can't we make anon_vma locks more simple?
Anyway, Mel's patch is best now.
I hope improving locks of anon_vmas without rollback approach in near future.
Thanks, Mel.
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-26 23:15 ` Minchan Kim
0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-26 23:15 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, KAMEZAWA Hiroyuki, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
On Tue, Apr 27, 2010 at 7:37 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> vma_adjust() is updating anon VMA information without any locks taken.
> In contrast, file-backed mappings use the i_mmap_lock and this lack of
> locking can result in races with page migration. During rmap_walk(),
> vma_address() can return -EFAULT for an address that will soon be valid.
> This leaves a dangling migration PTE behind which can later cause a BUG_ON
> to trigger when the page is faulted in.
>
> With the recent anon_vma changes, there can be more than one anon_vma->lock
> that can be taken in a anon_vma_chain but a second lock cannot be spinned
> upon in case of deadlock. Instead, the rmap walker tries to take locks of
> different anon_vma's. If the attempt fails, the operation is restarted.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Actually, I am worry about rollback approach like this.
If we don't often need anon_vmas serializing, that's enough.
But I am not sure how often we need locking of anon_vmas like this.
Whenever we need it, we have to use rollback approach like this in future.
In my opinion, it's not good.
Rik, can't we make anon_vma locks more simple?
Anyway, Mel's patch is best now.
I hope improving locks of anon_vmas without rollback approach in near future.
Thanks, Mel.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/2] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
2010-04-26 22:37 ` Mel Gorman
@ 2010-04-26 23:28 ` Minchan Kim
-1 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-26 23:28 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, KAMEZAWA Hiroyuki, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 3777 bytes --]
On Tue, Apr 27, 2010 at 7:37 AM, Mel Gorman <mel@csn.ul.ie> wrote:> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>>> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>>> At page migration, we replace pte with migration_entry, which has> similar format as swap_entry and replace it with real pfn at the> end of migration. But there is a race with fork()'s copy_page_range().>> Assume page migraion on CPU A and fork in CPU B. On CPU A, a page of> a process is under migration. On CPU B, a page's pte is under copy.>> Â Â Â Â CPUA Â Â Â Â Â Â Â Â Â Â CPU B> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â do_fork()> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â copy_mm() (from process 1 to process2)> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â insert new vma to mmap_list (if inode/anon_vma)> Â Â Â Â pte_lock(process1)> Â Â Â Â unmap a page> Â Â Â Â insert migration_entry> Â Â Â Â pte_unlock(process1)>> Â Â Â Â migrate page copy> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â copy_page_range> Â Â Â Â remap new page by rmap_walk()> Â Â Â Â pte_lock(process2)> Â Â Â Â found no pte.> Â Â Â Â pte_unlock(process2)> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â pte lock(process2)> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â pte lock(process1)> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â copy migration entry to process2> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â pte unlock(process1)> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â pte unlokc(process2)> Â Â Â Â pte_lock(process1)> Â Â Â Â replace migration entry> Â Â Â Â to new page's pte.> Â Â Â Â pte_unlock(process1)>> Then, some serialization is necessary. IIUC, this is very rare event but> it is reproducible if a lot of migration is happening a lot with the> following program running in parallel.>> Â Â #include <stdio.h>> Â Â #include <string.h>> Â Â #include <stdlib.h>> Â Â #include <sys/mman.h>>> Â Â #define SIZE (24*1048576UL)> Â Â #define CHILDREN 100> Â Â int main()> Â Â {> Â Â Â Â Â Â int i = 0;> Â Â Â Â Â Â pid_t pids[CHILDREN];> Â Â Â Â Â Â char *buf = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,> Â Â Â Â Â Â Â Â Â Â Â Â Â Â MAP_PRIVATE|MAP_ANONYMOUS,> Â Â Â Â Â Â Â Â Â Â Â Â Â Â 0, 0);> Â Â Â Â Â Â if (buf == MAP_FAILED) {> Â Â Â Â Â Â Â Â Â Â perror("mmap");> Â Â Â Â Â Â Â Â Â Â exit(-1);> Â Â Â Â Â Â }>> Â Â Â Â Â Â while (++i) {> Â Â Â Â Â Â Â Â Â Â int j = i % CHILDREN;>> Â Â Â Â Â Â Â Â Â Â if (j == 0) {> Â Â Â Â Â Â Â Â Â Â Â Â Â Â printf("Waiting on children\n");> Â Â Â Â Â Â Â Â Â Â Â Â Â Â for (j = 0; j < CHILDREN; j++) {> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â memset(buf, i, SIZE);> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â if (pids[j] != -1)> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â waitpid(pids[j], NULL, 0);> Â Â Â Â Â Â Â Â Â Â Â Â Â Â }> Â Â Â Â Â Â Â Â Â Â Â Â Â Â j = 0;> Â Â Â Â Â Â Â Â Â Â }>> Â Â Â Â Â Â Â Â Â Â if ((pids[j] = fork()) == 0) {> Â Â Â Â Â Â Â Â Â Â Â Â Â Â memset(buf, i, SIZE);> Â Â Â Â Â Â Â Â Â Â Â Â Â Â exit(EXIT_SUCCESS);> Â Â Â Â Â Â Â Â Â Â }> Â Â Â Â Â Â }>> Â Â Â Â Â Â munmap(buf, SIZE);> Â Â }>> copy_page_range() can wait for the end of migration.>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>Reviewed-by : Minchan Kim <minchan.kim@gmail.com>
-- Kind regards,Minchan Kimÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/2] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
@ 2010-04-26 23:28 ` Minchan Kim
0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-26 23:28 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, KAMEZAWA Hiroyuki, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
On Tue, Apr 27, 2010 at 7:37 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> At page migration, we replace pte with migration_entry, which has
> similar format as swap_entry and replace it with real pfn at the
> end of migration. But there is a race with fork()'s copy_page_range().
>
> Assume page migraion on CPU A and fork in CPU B. On CPU A, a page of
> a process is under migration. On CPU B, a page's pte is under copy.
>
> CPUA CPU B
> do_fork()
> copy_mm() (from process 1 to process2)
> insert new vma to mmap_list (if inode/anon_vma)
> pte_lock(process1)
> unmap a page
> insert migration_entry
> pte_unlock(process1)
>
> migrate page copy
> copy_page_range
> remap new page by rmap_walk()
> pte_lock(process2)
> found no pte.
> pte_unlock(process2)
> pte lock(process2)
> pte lock(process1)
> copy migration entry to process2
> pte unlock(process1)
> pte unlokc(process2)
> pte_lock(process1)
> replace migration entry
> to new page's pte.
> pte_unlock(process1)
>
> Then, some serialization is necessary. IIUC, this is very rare event but
> it is reproducible if a lot of migration is happening a lot with the
> following program running in parallel.
>
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
> #include <sys/mman.h>
>
> #define SIZE (24*1048576UL)
> #define CHILDREN 100
> int main()
> {
> int i = 0;
> pid_t pids[CHILDREN];
> char *buf = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS,
> 0, 0);
> if (buf == MAP_FAILED) {
> perror("mmap");
> exit(-1);
> }
>
> while (++i) {
> int j = i % CHILDREN;
>
> if (j == 0) {
> printf("Waiting on children\n");
> for (j = 0; j < CHILDREN; j++) {
> memset(buf, i, SIZE);
> if (pids[j] != -1)
> waitpid(pids[j], NULL, 0);
> }
> j = 0;
> }
>
> if ((pids[j] = fork()) == 0) {
> memset(buf, i, SIZE);
> exit(EXIT_SUCCESS);
> }
> }
>
> munmap(buf, SIZE);
> }
>
> copy_page_range() can wait for the end of migration.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by : Minchan Kim <minchan.kim@gmail.com>
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
2010-04-26 22:37 ` Mel Gorman
@ 2010-04-27 0:07 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-27 0:07 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, Minchan Kim, Christoph Lameter, Andrea Arcangeli,
Rik van Riel, Andrew Morton
On Mon, 26 Apr 2010 23:37:58 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> vma_adjust() is updating anon VMA information without any locks taken.
> In contrast, file-backed mappings use the i_mmap_lock and this lack of
> locking can result in races with page migration. During rmap_walk(),
> vma_address() can return -EFAULT for an address that will soon be valid.
> This leaves a dangling migration PTE behind which can later cause a BUG_ON
> to trigger when the page is faulted in.
>
> With the recent anon_vma changes, there can be more than one anon_vma->lock
> that can be taken in a anon_vma_chain but a second lock cannot be spinned
> upon in case of deadlock. Instead, the rmap walker tries to take locks of
> different anon_vma's. If the attempt fails, the operation is restarted.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Ok, acquiring vma->anon_vma->spin_lock always sounds very safe.
(but slow.)
I'll test this, too.
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> mm/ksm.c | 13 +++++++++++++
> mm/mmap.c | 6 ++++++
> mm/rmap.c | 22 +++++++++++++++++++---
> 3 files changed, 38 insertions(+), 3 deletions(-)
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 3666d43..baa5b4d 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1674,9 +1674,22 @@ again:
> spin_lock(&anon_vma->lock);
> list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
> vma = vmac->vma;
> +
> + /* See comment in mm/rmap.c#rmap_walk_anon on locking */
> + if (anon_vma != vma->anon_vma) {
> + if (!spin_trylock(&vma->anon_vma->lock)) {
> + spin_unlock(&anon_vma->lock);
> + goto again;
> + }
> + }
> +
> if (rmap_item->address < vma->vm_start ||
> rmap_item->address >= vma->vm_end)
> continue;
> +
> + if (anon_vma != vma->anon_vma)
> + spin_unlock(&vma->anon_vma->lock);
> +
> /*
> * Initially we examine only the vma which covers this
> * rmap_item; but later, if there is still work to do,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f90ea92..61d6f1d 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end);
> }
> }
>
> + if (vma->anon_vma)
> + spin_lock(&vma->anon_vma->lock);
> +
> if (root) {
> flush_dcache_mmap_lock(mapping);
> vma_prio_tree_remove(vma, root);
> @@ -620,6 +623,9 @@ again: remove_next = 1 + (end > next->vm_end);
> if (mapping)
> spin_unlock(&mapping->i_mmap_lock);
>
> + if (vma->anon_vma)
> + spin_unlock(&vma->anon_vma->lock);
> +
> if (remove_next) {
> if (file) {
> fput(file);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 85f203e..bc313a6 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1368,15 +1368,31 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
> * are holding mmap_sem. Users without mmap_sem are required to
> * take a reference count to prevent the anon_vma disappearing
> */
> +retry:
> anon_vma = page_anon_vma(page);
> if (!anon_vma)
> return ret;
> spin_lock(&anon_vma->lock);
> list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> struct vm_area_struct *vma = avc->vma;
> - unsigned long address = vma_address(page, vma);
> - if (address == -EFAULT)
> - continue;
> + unsigned long address;
> +
> + /*
> + * Guard against deadlocks by not spinning against
> + * vma->anon_vma->lock. If contention is found, release our
> + * lock and try again until VMA list can be traversed without
> + * contention.
> + */
> + if (anon_vma != vma->anon_vma) {
> + if (!spin_trylock(&vma->anon_vma->lock)) {
> + spin_unlock(&anon_vma->lock);
> + goto retry;
> + }
> + }
> + address = vma_address(page, vma);
> + if (anon_vma != vma->anon_vma)
> + spin_unlock(&vma->anon_vma->lock);
> +
> ret = rmap_one(page, vma, address, arg);
> if (ret != SWAP_AGAIN)
> break;
> --
> 1.6.5
>
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-27 0:07 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-27 0:07 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, Minchan Kim, Christoph Lameter, Andrea Arcangeli,
Rik van Riel, Andrew Morton
On Mon, 26 Apr 2010 23:37:58 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> vma_adjust() is updating anon VMA information without any locks taken.
> In contrast, file-backed mappings use the i_mmap_lock and this lack of
> locking can result in races with page migration. During rmap_walk(),
> vma_address() can return -EFAULT for an address that will soon be valid.
> This leaves a dangling migration PTE behind which can later cause a BUG_ON
> to trigger when the page is faulted in.
>
> With the recent anon_vma changes, there can be more than one anon_vma->lock
> that can be taken in a anon_vma_chain but a second lock cannot be spinned
> upon in case of deadlock. Instead, the rmap walker tries to take locks of
> different anon_vma's. If the attempt fails, the operation is restarted.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Ok, acquiring vma->anon_vma->spin_lock always sounds very safe.
(but slow.)
I'll test this, too.
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> mm/ksm.c | 13 +++++++++++++
> mm/mmap.c | 6 ++++++
> mm/rmap.c | 22 +++++++++++++++++++---
> 3 files changed, 38 insertions(+), 3 deletions(-)
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 3666d43..baa5b4d 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1674,9 +1674,22 @@ again:
> spin_lock(&anon_vma->lock);
> list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
> vma = vmac->vma;
> +
> + /* See comment in mm/rmap.c#rmap_walk_anon on locking */
> + if (anon_vma != vma->anon_vma) {
> + if (!spin_trylock(&vma->anon_vma->lock)) {
> + spin_unlock(&anon_vma->lock);
> + goto again;
> + }
> + }
> +
> if (rmap_item->address < vma->vm_start ||
> rmap_item->address >= vma->vm_end)
> continue;
> +
> + if (anon_vma != vma->anon_vma)
> + spin_unlock(&vma->anon_vma->lock);
> +
> /*
> * Initially we examine only the vma which covers this
> * rmap_item; but later, if there is still work to do,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f90ea92..61d6f1d 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end);
> }
> }
>
> + if (vma->anon_vma)
> + spin_lock(&vma->anon_vma->lock);
> +
> if (root) {
> flush_dcache_mmap_lock(mapping);
> vma_prio_tree_remove(vma, root);
> @@ -620,6 +623,9 @@ again: remove_next = 1 + (end > next->vm_end);
> if (mapping)
> spin_unlock(&mapping->i_mmap_lock);
>
> + if (vma->anon_vma)
> + spin_unlock(&vma->anon_vma->lock);
> +
> if (remove_next) {
> if (file) {
> fput(file);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 85f203e..bc313a6 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1368,15 +1368,31 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
> * are holding mmap_sem. Users without mmap_sem are required to
> * take a reference count to prevent the anon_vma disappearing
> */
> +retry:
> anon_vma = page_anon_vma(page);
> if (!anon_vma)
> return ret;
> spin_lock(&anon_vma->lock);
> list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> struct vm_area_struct *vma = avc->vma;
> - unsigned long address = vma_address(page, vma);
> - if (address == -EFAULT)
> - continue;
> + unsigned long address;
> +
> + /*
> + * Guard against deadlocks by not spinning against
> + * vma->anon_vma->lock. If contention is found, release our
> + * lock and try again until VMA list can be traversed without
> + * contention.
> + */
> + if (anon_vma != vma->anon_vma) {
> + if (!spin_trylock(&vma->anon_vma->lock)) {
> + spin_unlock(&anon_vma->lock);
> + goto retry;
> + }
> + }
> + address = vma_address(page, vma);
> + if (anon_vma != vma->anon_vma)
> + spin_unlock(&vma->anon_vma->lock);
> +
> ret = rmap_one(page, vma, address, arg);
> if (ret != SWAP_AGAIN)
> break;
> --
> 1.6.5
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/2] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
2010-04-26 22:37 ` Mel Gorman
@ 2010-04-27 0:08 ` Rik van Riel
-1 siblings, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2010-04-27 0:08 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
Christoph Lameter, Andrea Arcangeli, Andrew Morton
On 04/26/2010 06:37 PM, Mel Gorman wrote:
> From: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
>
> From: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
>
> At page migration, we replace pte with migration_entry, which has
> similar format as swap_entry and replace it with real pfn at the
> end of migration. But there is a race with fork()'s copy_page_range().
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/2] mm,migration: During fork(), wait for migration to end if migration PTE is encountered
@ 2010-04-27 0:08 ` Rik van Riel
0 siblings, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2010-04-27 0:08 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
Christoph Lameter, Andrea Arcangeli, Andrew Morton
On 04/26/2010 06:37 PM, Mel Gorman wrote:
> From: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
>
> From: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
>
> At page migration, we replace pte with migration_entry, which has
> similar format as swap_entry and replace it with real pfn at the
> end of migration. But there is a race with fork()'s copy_page_range().
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
2010-04-26 23:05 ` Minchan Kim
@ 2010-04-27 0:30 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-27 0:30 UTC (permalink / raw)
To: Minchan Kim
Cc: Mel Gorman, Linux-MM, LKML, Christoph Lameter, Andrea Arcangeli,
Rik van Riel, Andrew Morton
On Tue, 27 Apr 2010 08:05:26 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:
> On Tue, Apr 27, 2010 at 7:37 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > vma_adjust() is updating anon VMA information without any locks taken.
> > In contrast, file-backed mappings use the i_mmap_lock and this lack of
> > locking can result in races with page migration. During rmap_walk(),
> > vma_address() can return -EFAULT for an address that will soon be valid.
> > This leaves a dangling migration PTE behind which can later cause a BUG_ON
> > to trigger when the page is faulted in.
> >
> > With the recent anon_vma changes, there can be more than one anon_vma->lock
> > that can be taken in a anon_vma_chain but a second lock cannot be spinned
> > upon in case of deadlock. Instead, the rmap walker tries to take locks of
> > different anon_vma's. If the attempt fails, the operation is restarted.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > mm/ksm.c | 13 +++++++++++++
> > mm/mmap.c | 6 ++++++
> > mm/rmap.c | 22 +++++++++++++++++++---
> > 3 files changed, 38 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 3666d43..baa5b4d 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -1674,9 +1674,22 @@ again:
> > spin_lock(&anon_vma->lock);
> > list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
> > vma = vmac->vma;
> > +
> > + /* See comment in mm/rmap.c#rmap_walk_anon on locking */
> > + if (anon_vma != vma->anon_vma) {
> > + if (!spin_trylock(&vma->anon_vma->lock)) {
> > + spin_unlock(&anon_vma->lock);
> > + goto again;
> > + }
> > + }
> > +
> > if (rmap_item->address < vma->vm_start ||
> > rmap_item->address >= vma->vm_end)
> > continue;
> > +
> > + if (anon_vma != vma->anon_vma)
> > + spin_unlock(&vma->anon_vma->lock);
> > +
> > /*
> > * Initially we examine only the vma which covers this
> > * rmap_item; but later, if there is still work to do,
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index f90ea92..61d6f1d 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end);
> > }
> > }
> >
> > + if (vma->anon_vma)
> > + spin_lock(&vma->anon_vma->lock);
> > +
> > if (root) {
> > flush_dcache_mmap_lock(mapping);
> > vma_prio_tree_remove(vma, root);
> > @@ -620,6 +623,9 @@ again: remove_next = 1 + (end > next->vm_end);
> > if (mapping)
> > spin_unlock(&mapping->i_mmap_lock);
> >
> > + if (vma->anon_vma)
> > + spin_unlock(&vma->anon_vma->lock);
> > +
> > if (remove_next) {
> > if (file) {
> > fput(file);
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 85f203e..bc313a6 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1368,15 +1368,31 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
> > * are holding mmap_sem. Users without mmap_sem are required to
> > * take a reference count to prevent the anon_vma disappearing
> > */
> > +retry:
> > anon_vma = page_anon_vma(page);
> > if (!anon_vma)
> > return ret;
> > spin_lock(&anon_vma->lock);
> > list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> > struct vm_area_struct *vma = avc->vma;
> > - unsigned long address = vma_address(page, vma);
> > - if (address == -EFAULT)
> > - continue;
> > + unsigned long address;
> > +
> > + /*
> > + * Guard against deadlocks by not spinning against
> > + * vma->anon_vma->lock. If contention is found, release our
> > + * lock and try again until VMA list can be traversed without
> > + * contention.
> > + */
> > + if (anon_vma != vma->anon_vma) {
> > + if (!spin_trylock(&vma->anon_vma->lock)) {
> > + spin_unlock(&anon_vma->lock);
> > + goto retry;
> > + }
> > + }
> > + address = vma_address(page, vma);
> > + if (anon_vma != vma->anon_vma)
> > + spin_unlock(&vma->anon_vma->lock);
> > +
>
> if (address == -EFAULT)
> continue;
>
yes. thank you for pointing out.
-Kame
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-27 0:30 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-27 0:30 UTC (permalink / raw)
To: Minchan Kim
Cc: Mel Gorman, Linux-MM, LKML, Christoph Lameter, Andrea Arcangeli,
Rik van Riel, Andrew Morton
On Tue, 27 Apr 2010 08:05:26 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:
> On Tue, Apr 27, 2010 at 7:37 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > vma_adjust() is updating anon VMA information without any locks taken.
> > In contrast, file-backed mappings use the i_mmap_lock and this lack of
> > locking can result in races with page migration. During rmap_walk(),
> > vma_address() can return -EFAULT for an address that will soon be valid.
> > This leaves a dangling migration PTE behind which can later cause a BUG_ON
> > to trigger when the page is faulted in.
> >
> > With the recent anon_vma changes, there can be more than one anon_vma->lock
> > that can be taken in a anon_vma_chain but a second lock cannot be spinned
> > upon in case of deadlock. Instead, the rmap walker tries to take locks of
> > different anon_vma's. If the attempt fails, the operation is restarted.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > A mm/ksm.c A | A 13 +++++++++++++
> > A mm/mmap.c | A A 6 ++++++
> > A mm/rmap.c | A 22 +++++++++++++++++++---
> > A 3 files changed, 38 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 3666d43..baa5b4d 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -1674,9 +1674,22 @@ again:
> > A A A A A A A A spin_lock(&anon_vma->lock);
> > A A A A A A A A list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
> > A A A A A A A A A A A A vma = vmac->vma;
> > +
> > + A A A A A A A A A A A /* See comment in mm/rmap.c#rmap_walk_anon on locking */
> > + A A A A A A A A A A A if (anon_vma != vma->anon_vma) {
> > + A A A A A A A A A A A A A A A if (!spin_trylock(&vma->anon_vma->lock)) {
> > + A A A A A A A A A A A A A A A A A A A spin_unlock(&anon_vma->lock);
> > + A A A A A A A A A A A A A A A A A A A goto again;
> > + A A A A A A A A A A A A A A A }
> > + A A A A A A A A A A A }
> > +
> > A A A A A A A A A A A A if (rmap_item->address < vma->vm_start ||
> > A A A A A A A A A A A A A A rmap_item->address >= vma->vm_end)
> > A A A A A A A A A A A A A A A A continue;
> > +
> > + A A A A A A A A A A A if (anon_vma != vma->anon_vma)
> > + A A A A A A A A A A A A A A A spin_unlock(&vma->anon_vma->lock);
> > +
> > A A A A A A A A A A A A /*
> > A A A A A A A A A A A A * Initially we examine only the vma which covers this
> > A A A A A A A A A A A A * rmap_item; but later, if there is still work to do,
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index f90ea92..61d6f1d 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -578,6 +578,9 @@ again: A A A A A A A A A A A remove_next = 1 + (end > next->vm_end);
> > A A A A A A A A }
> > A A A A }
> >
> > + A A A if (vma->anon_vma)
> > + A A A A A A A spin_lock(&vma->anon_vma->lock);
> > +
> > A A A A if (root) {
> > A A A A A A A A flush_dcache_mmap_lock(mapping);
> > A A A A A A A A vma_prio_tree_remove(vma, root);
> > @@ -620,6 +623,9 @@ again: A A A A A A A A A A A remove_next = 1 + (end > next->vm_end);
> > A A A A if (mapping)
> > A A A A A A A A spin_unlock(&mapping->i_mmap_lock);
> >
> > + A A A if (vma->anon_vma)
> > + A A A A A A A spin_unlock(&vma->anon_vma->lock);
> > +
> > A A A A if (remove_next) {
> > A A A A A A A A if (file) {
> > A A A A A A A A A A A A fput(file);
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 85f203e..bc313a6 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1368,15 +1368,31 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
> > A A A A * are holding mmap_sem. Users without mmap_sem are required to
> > A A A A * take a reference count to prevent the anon_vma disappearing
> > A A A A */
> > +retry:
> > A A A A anon_vma = page_anon_vma(page);
> > A A A A if (!anon_vma)
> > A A A A A A A A return ret;
> > A A A A spin_lock(&anon_vma->lock);
> > A A A A list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> > A A A A A A A A struct vm_area_struct *vma = avc->vma;
> > - A A A A A A A unsigned long address = vma_address(page, vma);
> > - A A A A A A A if (address == -EFAULT)
> > - A A A A A A A A A A A continue;
> > + A A A A A A A unsigned long address;
> > +
> > + A A A A A A A /*
> > + A A A A A A A A * Guard against deadlocks by not spinning against
> > + A A A A A A A A * vma->anon_vma->lock. If contention is found, release our
> > + A A A A A A A A * lock and try again until VMA list can be traversed without
> > + A A A A A A A A * contention.
> > + A A A A A A A A */
> > + A A A A A A A if (anon_vma != vma->anon_vma) {
> > + A A A A A A A A A A A if (!spin_trylock(&vma->anon_vma->lock)) {
> > + A A A A A A A A A A A A A A A spin_unlock(&anon_vma->lock);
> > + A A A A A A A A A A A A A A A goto retry;
> > + A A A A A A A A A A A }
> > + A A A A A A A }
> > + A A A A A A A address = vma_address(page, vma);
> > + A A A A A A A if (anon_vma != vma->anon_vma)
> > + A A A A A A A A A A A spin_unlock(&vma->anon_vma->lock);
> > +
>
> if (address == -EFAULT)
> continue;
>
yes. thank you for pointing out.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
2010-04-26 22:37 ` Mel Gorman
@ 2010-04-27 0:30 ` Rik van Riel
-1 siblings, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2010-04-27 0:30 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
Christoph Lameter, Andrea Arcangeli, Andrew Morton
On 04/26/2010 06:37 PM, Mel Gorman wrote:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 85f203e..bc313a6 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1368,15 +1368,31 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
> * are holding mmap_sem. Users without mmap_sem are required to
> * take a reference count to prevent the anon_vma disappearing
> */
> +retry:
> anon_vma = page_anon_vma(page);
> if (!anon_vma)
> return ret;
> spin_lock(&anon_vma->lock);
> list_for_each_entry(avc,&anon_vma->head, same_anon_vma) {
> struct vm_area_struct *vma = avc->vma;
> - unsigned long address = vma_address(page, vma);
> - if (address == -EFAULT)
> - continue;
> + unsigned long address;
> +
> + /*
> + * Guard against deadlocks by not spinning against
> + * vma->anon_vma->lock. If contention is found, release our
> + * lock and try again until VMA list can be traversed without
> + * contention.
> + */
> + if (anon_vma != vma->anon_vma) {
> + if (!spin_trylock(&vma->anon_vma->lock)) {
> + spin_unlock(&anon_vma->lock);
> + goto retry;
> + }
> + }
If you're part way down the list, surely you'll need to
unlock multiple anon_vmas here before going to retry?
Otherwise you could forget to unlock one that you already
locked, and live-lock here.
--
All rights reversed
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-27 0:30 ` Rik van Riel
0 siblings, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2010-04-27 0:30 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
Christoph Lameter, Andrea Arcangeli, Andrew Morton
On 04/26/2010 06:37 PM, Mel Gorman wrote:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 85f203e..bc313a6 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1368,15 +1368,31 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
> * are holding mmap_sem. Users without mmap_sem are required to
> * take a reference count to prevent the anon_vma disappearing
> */
> +retry:
> anon_vma = page_anon_vma(page);
> if (!anon_vma)
> return ret;
> spin_lock(&anon_vma->lock);
> list_for_each_entry(avc,&anon_vma->head, same_anon_vma) {
> struct vm_area_struct *vma = avc->vma;
> - unsigned long address = vma_address(page, vma);
> - if (address == -EFAULT)
> - continue;
> + unsigned long address;
> +
> + /*
> + * Guard against deadlocks by not spinning against
> + * vma->anon_vma->lock. If contention is found, release our
> + * lock and try again until VMA list can be traversed without
> + * contention.
> + */
> + if (anon_vma != vma->anon_vma) {
> + if (!spin_trylock(&vma->anon_vma->lock)) {
> + spin_unlock(&anon_vma->lock);
> + goto retry;
> + }
> + }
If you're part way down the list, surely you'll need to
unlock multiple anon_vmas here before going to retry?
Otherwise you could forget to unlock one that you already
locked, and live-lock here.
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
2010-04-27 0:30 ` Rik van Riel
@ 2010-04-27 0:31 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-27 0:31 UTC (permalink / raw)
To: Rik van Riel
Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
Andrea Arcangeli, Andrew Morton
On Mon, 26 Apr 2010 20:30:41 -0400
Rik van Riel <riel@redhat.com> wrote:
> On 04/26/2010 06:37 PM, Mel Gorman wrote:
>
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 85f203e..bc313a6 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1368,15 +1368,31 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
> > * are holding mmap_sem. Users without mmap_sem are required to
> > * take a reference count to prevent the anon_vma disappearing
> > */
> > +retry:
> > anon_vma = page_anon_vma(page);
> > if (!anon_vma)
> > return ret;
> > spin_lock(&anon_vma->lock);
> > list_for_each_entry(avc,&anon_vma->head, same_anon_vma) {
> > struct vm_area_struct *vma = avc->vma;
> > - unsigned long address = vma_address(page, vma);
> > - if (address == -EFAULT)
> > - continue;
> > + unsigned long address;
> > +
> > + /*
> > + * Guard against deadlocks by not spinning against
> > + * vma->anon_vma->lock. If contention is found, release our
> > + * lock and try again until VMA list can be traversed without
> > + * contention.
> > + */
> > + if (anon_vma != vma->anon_vma) {
> > + if (!spin_trylock(&vma->anon_vma->lock)) {
> > + spin_unlock(&anon_vma->lock);
> > + goto retry;
> > + }
> > + }
>
> If you're part way down the list, surely you'll need to
> unlock multiple anon_vmas here before going to retry?
>
vma->anon_vma->lock is released after vma_address().
-Kame
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-27 0:31 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-27 0:31 UTC (permalink / raw)
To: Rik van Riel
Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
Andrea Arcangeli, Andrew Morton
On Mon, 26 Apr 2010 20:30:41 -0400
Rik van Riel <riel@redhat.com> wrote:
> On 04/26/2010 06:37 PM, Mel Gorman wrote:
>
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 85f203e..bc313a6 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1368,15 +1368,31 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
> > * are holding mmap_sem. Users without mmap_sem are required to
> > * take a reference count to prevent the anon_vma disappearing
> > */
> > +retry:
> > anon_vma = page_anon_vma(page);
> > if (!anon_vma)
> > return ret;
> > spin_lock(&anon_vma->lock);
> > list_for_each_entry(avc,&anon_vma->head, same_anon_vma) {
> > struct vm_area_struct *vma = avc->vma;
> > - unsigned long address = vma_address(page, vma);
> > - if (address == -EFAULT)
> > - continue;
> > + unsigned long address;
> > +
> > + /*
> > + * Guard against deadlocks by not spinning against
> > + * vma->anon_vma->lock. If contention is found, release our
> > + * lock and try again until VMA list can be traversed without
> > + * contention.
> > + */
> > + if (anon_vma != vma->anon_vma) {
> > + if (!spin_trylock(&vma->anon_vma->lock)) {
> > + spin_unlock(&anon_vma->lock);
> > + goto retry;
> > + }
> > + }
>
> If you're part way down the list, surely you'll need to
> unlock multiple anon_vmas here before going to retry?
>
vma->anon_vma->lock is released after vma_address().
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
2010-04-27 0:31 ` KAMEZAWA Hiroyuki
@ 2010-04-27 2:13 ` Rik van Riel
-1 siblings, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2010-04-27 2:13 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
Andrea Arcangeli, Andrew Morton
On 04/26/2010 08:31 PM, KAMEZAWA Hiroyuki wrote:
>> If you're part way down the list, surely you'll need to
>> unlock multiple anon_vmas here before going to retry?
>>
> vma->anon_vma->lock is released after vma_address().
Doh, never mind. Too much code in my mind at once...
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-27 2:13 ` Rik van Riel
0 siblings, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2010-04-27 2:13 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
Andrea Arcangeli, Andrew Morton
On 04/26/2010 08:31 PM, KAMEZAWA Hiroyuki wrote:
>> If you're part way down the list, surely you'll need to
>> unlock multiple anon_vmas here before going to retry?
>>
> vma->anon_vma->lock is released after vma_address().
Doh, never mind. Too much code in my mind at once...
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
2010-04-27 0:07 ` KAMEZAWA Hiroyuki
@ 2010-04-27 3:50 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-27 3:50 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
On Tue, 27 Apr 2010 09:07:06 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 26 Apr 2010 23:37:58 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
>
> > vma_adjust() is updating anon VMA information without any locks taken.
> > In contrast, file-backed mappings use the i_mmap_lock and this lack of
> > locking can result in races with page migration. During rmap_walk(),
> > vma_address() can return -EFAULT for an address that will soon be valid.
> > This leaves a dangling migration PTE behind which can later cause a BUG_ON
> > to trigger when the page is faulted in.
> >
> > With the recent anon_vma changes, there can be more than one anon_vma->lock
> > that can be taken in a anon_vma_chain but a second lock cannot be spinned
> > upon in case of deadlock. Instead, the rmap walker tries to take locks of
> > different anon_vma's. If the attempt fails, the operation is restarted.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>
> Ok, acquiring vma->anon_vma->spin_lock always sounds very safe.
> (but slow.)
>
> I'll test this, too.
>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
Sorry. reproduced. It seems the same bug before patch.
mapcount 1 -> unmap -> remap -> mapcount 0. And it was SwapCache.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-27 3:50 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-27 3:50 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
On Tue, 27 Apr 2010 09:07:06 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 26 Apr 2010 23:37:58 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
>
> > vma_adjust() is updating anon VMA information without any locks taken.
> > In contrast, file-backed mappings use the i_mmap_lock and this lack of
> > locking can result in races with page migration. During rmap_walk(),
> > vma_address() can return -EFAULT for an address that will soon be valid.
> > This leaves a dangling migration PTE behind which can later cause a BUG_ON
> > to trigger when the page is faulted in.
> >
> > With the recent anon_vma changes, there can be more than one anon_vma->lock
> > that can be taken in a anon_vma_chain but a second lock cannot be spinned
> > upon in case of deadlock. Instead, the rmap walker tries to take locks of
> > different anon_vma's. If the attempt fails, the operation is restarted.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>
> Ok, acquiring vma->anon_vma->spin_lock always sounds very safe.
> (but slow.)
>
> I'll test this, too.
>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
Sorry. reproduced. It seems the same bug before patch.
mapcount 1 -> unmap -> remap -> mapcount 0. And it was SwapCache.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
2010-04-27 3:50 ` KAMEZAWA Hiroyuki
@ 2010-04-27 4:03 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-27 4:03 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
On Tue, 27 Apr 2010 12:50:40 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 27 Apr 2010 09:07:06 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > On Mon, 26 Apr 2010 23:37:58 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> >
> > > vma_adjust() is updating anon VMA information without any locks taken.
> > > In contrast, file-backed mappings use the i_mmap_lock and this lack of
> > > locking can result in races with page migration. During rmap_walk(),
> > > vma_address() can return -EFAULT for an address that will soon be valid.
> > > This leaves a dangling migration PTE behind which can later cause a BUG_ON
> > > to trigger when the page is faulted in.
> > >
> > > With the recent anon_vma changes, there can be more than one anon_vma->lock
> > > that can be taken in a anon_vma_chain but a second lock cannot be spinned
> > > upon in case of deadlock. Instead, the rmap walker tries to take locks of
> > > different anon_vma's. If the attempt fails, the operation is restarted.
> > >
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >
> > Ok, acquiring vma->anon_vma->spin_lock always sounds very safe.
> > (but slow.)
> >
> > I'll test this, too.
> >
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
>
> Sorry. reproduced. It seems the same bug before patch.
> mapcount 1 -> unmap -> remap -> mapcount 0. And it was SwapCache.
>
Sorry again. It was _not_ SwapCache. Just SwapBacked.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-27 4:03 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-27 4:03 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Mel Gorman, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
On Tue, 27 Apr 2010 12:50:40 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 27 Apr 2010 09:07:06 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > On Mon, 26 Apr 2010 23:37:58 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> >
> > > vma_adjust() is updating anon VMA information without any locks taken.
> > > In contrast, file-backed mappings use the i_mmap_lock and this lack of
> > > locking can result in races with page migration. During rmap_walk(),
> > > vma_address() can return -EFAULT for an address that will soon be valid.
> > > This leaves a dangling migration PTE behind which can later cause a BUG_ON
> > > to trigger when the page is faulted in.
> > >
> > > With the recent anon_vma changes, there can be more than one anon_vma->lock
> > > that can be taken in a anon_vma_chain but a second lock cannot be spinned
> > > upon in case of deadlock. Instead, the rmap walker tries to take locks of
> > > different anon_vma's. If the attempt fails, the operation is restarted.
> > >
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >
> > Ok, acquiring vma->anon_vma->spin_lock always sounds very safe.
> > (but slow.)
> >
> > I'll test this, too.
> >
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
>
> Sorry. reproduced. It seems the same bug before patch.
> mapcount 1 -> unmap -> remap -> mapcount 0. And it was SwapCache.
>
Sorry again. It was _not_ SwapCache. Just SwapBacked.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
2010-04-27 3:50 ` KAMEZAWA Hiroyuki
@ 2010-04-27 8:59 ` Mel Gorman
-1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-27 8:59 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Linux-MM, LKML, Minchan Kim, Christoph Lameter, Andrea Arcangeli,
Rik van Riel, Andrew Morton
On Tue, Apr 27, 2010 at 12:50:40PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 27 Apr 2010 09:07:06 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > On Mon, 26 Apr 2010 23:37:58 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> >
> > > vma_adjust() is updating anon VMA information without any locks taken.
> > > In contrast, file-backed mappings use the i_mmap_lock and this lack of
> > > locking can result in races with page migration. During rmap_walk(),
> > > vma_address() can return -EFAULT for an address that will soon be valid.
> > > This leaves a dangling migration PTE behind which can later cause a BUG_ON
> > > to trigger when the page is faulted in.
> > >
> > > With the recent anon_vma changes, there can be more than one anon_vma->lock
> > > that can be taken in a anon_vma_chain but a second lock cannot be spinned
> > > upon in case of deadlock. Instead, the rmap walker tries to take locks of
> > > different anon_vma's. If the attempt fails, the operation is restarted.
> > >
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >
> > Ok, acquiring vma->anon_vma->spin_lock always sounds very safe.
> > (but slow.)
> >
> > I'll test this, too.
> >
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
>
> Sorry. reproduced. It seems the same bug before patch.
> mapcount 1 -> unmap -> remap -> mapcount 0. And it was SwapCache.
>
Same here, reproduced after 18 hours.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-27 8:59 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-27 8:59 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Linux-MM, LKML, Minchan Kim, Christoph Lameter, Andrea Arcangeli,
Rik van Riel, Andrew Morton
On Tue, Apr 27, 2010 at 12:50:40PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 27 Apr 2010 09:07:06 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > On Mon, 26 Apr 2010 23:37:58 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> >
> > > vma_adjust() is updating anon VMA information without any locks taken.
> > > In contrast, file-backed mappings use the i_mmap_lock and this lack of
> > > locking can result in races with page migration. During rmap_walk(),
> > > vma_address() can return -EFAULT for an address that will soon be valid.
> > > This leaves a dangling migration PTE behind which can later cause a BUG_ON
> > > to trigger when the page is faulted in.
> > >
> > > With the recent anon_vma changes, there can be more than one anon_vma->lock
> > > that can be taken in a anon_vma_chain but a second lock cannot be spinned
> > > upon in case of deadlock. Instead, the rmap walker tries to take locks of
> > > different anon_vma's. If the attempt fails, the operation is restarted.
> > >
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >
> > Ok, acquiring vma->anon_vma->spin_lock always sounds very safe.
> > (but slow.)
> >
> > I'll test this, too.
> >
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
>
> Sorry. reproduced. It seems the same bug before patch.
> mapcount 1 -> unmap -> remap -> mapcount 0. And it was SwapCache.
>
Same here, reproduced after 18 hours.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
2010-04-27 8:59 ` Mel Gorman
@ 2010-04-27 9:09 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-27 9:09 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, Minchan Kim, Christoph Lameter, Andrea Arcangeli,
Rik van Riel, Andrew Morton
On Tue, 27 Apr 2010 09:59:51 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> On Tue, Apr 27, 2010 at 12:50:40PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Tue, 27 Apr 2010 09:07:06 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > On Mon, 26 Apr 2010 23:37:58 +0100
> > > Mel Gorman <mel@csn.ul.ie> wrote:
> > >
> > > > vma_adjust() is updating anon VMA information without any locks taken.
> > > > In contrast, file-backed mappings use the i_mmap_lock and this lack of
> > > > locking can result in races with page migration. During rmap_walk(),
> > > > vma_address() can return -EFAULT for an address that will soon be valid.
> > > > This leaves a dangling migration PTE behind which can later cause a BUG_ON
> > > > to trigger when the page is faulted in.
> > > >
> > > > With the recent anon_vma changes, there can be more than one anon_vma->lock
> > > > that can be taken in a anon_vma_chain but a second lock cannot be spinned
> > > > upon in case of deadlock. Instead, the rmap walker tries to take locks of
> > > > different anon_vma's. If the attempt fails, the operation is restarted.
> > > >
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > >
> > > Ok, acquiring vma->anon_vma->spin_lock always sounds very safe.
> > > (but slow.)
> > >
> > > I'll test this, too.
> > >
> > > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > >
> >
> > Sorry. reproduced. It seems the same bug before patch.
> > mapcount 1 -> unmap -> remap -> mapcount 0. And it was SwapCache.
> >
>
> Same here, reproduced after 18 hours.
>
Hmm. It seems rmap_one() is called and the race is not in vma_address()
but in remap_migration_pte().
So, I added more hooks for debug..but not reproduced yet.
(But I doubt my debug code, too ;)
But it seems strange to have a race in remap_migration_pte(), so, I doubt
my debug code, too.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-27 9:09 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-27 9:09 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, Minchan Kim, Christoph Lameter, Andrea Arcangeli,
Rik van Riel, Andrew Morton
On Tue, 27 Apr 2010 09:59:51 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> On Tue, Apr 27, 2010 at 12:50:40PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Tue, 27 Apr 2010 09:07:06 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > On Mon, 26 Apr 2010 23:37:58 +0100
> > > Mel Gorman <mel@csn.ul.ie> wrote:
> > >
> > > > vma_adjust() is updating anon VMA information without any locks taken.
> > > > In contrast, file-backed mappings use the i_mmap_lock and this lack of
> > > > locking can result in races with page migration. During rmap_walk(),
> > > > vma_address() can return -EFAULT for an address that will soon be valid.
> > > > This leaves a dangling migration PTE behind which can later cause a BUG_ON
> > > > to trigger when the page is faulted in.
> > > >
> > > > With the recent anon_vma changes, there can be more than one anon_vma->lock
> > > > that can be taken in a anon_vma_chain but a second lock cannot be spinned
> > > > upon in case of deadlock. Instead, the rmap walker tries to take locks of
> > > > different anon_vma's. If the attempt fails, the operation is restarted.
> > > >
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > >
> > > Ok, acquiring vma->anon_vma->spin_lock always sounds very safe.
> > > (but slow.)
> > >
> > > I'll test this, too.
> > >
> > > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > >
> >
> > Sorry. reproduced. It seems the same bug before patch.
> > mapcount 1 -> unmap -> remap -> mapcount 0. And it was SwapCache.
> >
>
> Same here, reproduced after 18 hours.
>
Hmm. It seems rmap_one() is called and the race is not in vma_address()
but in remap_migration_pte().
So, I added more hooks for debug..but not reproduced yet.
(But I doubt my debug code, too ;)
But it seems strange to have a race in remap_migration_pte(), so, I doubt
my debug code, too.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
2010-04-26 23:05 ` Minchan Kim
@ 2010-04-27 9:17 ` Mel Gorman
-1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-27 9:17 UTC (permalink / raw)
To: Minchan Kim
Cc: Linux-MM, LKML, KAMEZAWA Hiroyuki, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
On Tue, Apr 27, 2010 at 08:05:26AM +0900, Minchan Kim wrote:
> On Tue, Apr 27, 2010 at 7:37 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > vma_adjust() is updating anon VMA information without any locks taken.
> > In contrast, file-backed mappings use the i_mmap_lock and this lack of
> > locking can result in races with page migration. During rmap_walk(),
> > vma_address() can return -EFAULT for an address that will soon be valid.
> > This leaves a dangling migration PTE behind which can later cause a BUG_ON
> > to trigger when the page is faulted in.
> >
> > With the recent anon_vma changes, there can be more than one anon_vma->lock
> > that can be taken in a anon_vma_chain but a second lock cannot be spinned
> > upon in case of deadlock. Instead, the rmap walker tries to take locks of
> > different anon_vma's. If the attempt fails, the operation is restarted.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > mm/ksm.c | 13 +++++++++++++
> > mm/mmap.c | 6 ++++++
> > mm/rmap.c | 22 +++++++++++++++++++---
> > 3 files changed, 38 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 3666d43..baa5b4d 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -1674,9 +1674,22 @@ again:
> > spin_lock(&anon_vma->lock);
> > list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
> > vma = vmac->vma;
> > +
> > + /* See comment in mm/rmap.c#rmap_walk_anon on locking */
> > + if (anon_vma != vma->anon_vma) {
> > + if (!spin_trylock(&vma->anon_vma->lock)) {
> > + spin_unlock(&anon_vma->lock);
> > + goto again;
> > + }
> > + }
> > +
> > if (rmap_item->address < vma->vm_start ||
> > rmap_item->address >= vma->vm_end)
> > continue;
> > +
> > + if (anon_vma != vma->anon_vma)
> > + spin_unlock(&vma->anon_vma->lock);
> > +
> > /*
> > * Initially we examine only the vma which covers this
> > * rmap_item; but later, if there is still work to do,
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index f90ea92..61d6f1d 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end);
> > }
> > }
> >
> > + if (vma->anon_vma)
> > + spin_lock(&vma->anon_vma->lock);
> > +
> > if (root) {
> > flush_dcache_mmap_lock(mapping);
> > vma_prio_tree_remove(vma, root);
> > @@ -620,6 +623,9 @@ again: remove_next = 1 + (end > next->vm_end);
> > if (mapping)
> > spin_unlock(&mapping->i_mmap_lock);
> >
> > + if (vma->anon_vma)
> > + spin_unlock(&vma->anon_vma->lock);
> > +
> > if (remove_next) {
> > if (file) {
> > fput(file);
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 85f203e..bc313a6 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1368,15 +1368,31 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
> > * are holding mmap_sem. Users without mmap_sem are required to
> > * take a reference count to prevent the anon_vma disappearing
> > */
> > +retry:
> > anon_vma = page_anon_vma(page);
> > if (!anon_vma)
> > return ret;
> > spin_lock(&anon_vma->lock);
> > list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> > struct vm_area_struct *vma = avc->vma;
> > - unsigned long address = vma_address(page, vma);
> > - if (address == -EFAULT)
> > - continue;
> > + unsigned long address;
> > +
> > + /*
> > + * Guard against deadlocks by not spinning against
> > + * vma->anon_vma->lock. If contention is found, release our
> > + * lock and try again until VMA list can be traversed without
> > + * contention.
> > + */
> > + if (anon_vma != vma->anon_vma) {
> > + if (!spin_trylock(&vma->anon_vma->lock)) {
> > + spin_unlock(&anon_vma->lock);
> > + goto retry;
> > + }
> > + }
> > + address = vma_address(page, vma);
> > + if (anon_vma != vma->anon_vma)
> > + spin_unlock(&vma->anon_vma->lock);
> > +
>
> if (address == -EFAULT)
> continue;
>
Correct. Thanks.
> > ret = rmap_one(page, vma, address, arg);
> > if (ret != SWAP_AGAIN)
> > break;
> > --
> > 1.6.5
> >
> >
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-27 9:17 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-27 9:17 UTC (permalink / raw)
To: Minchan Kim
Cc: Linux-MM, LKML, KAMEZAWA Hiroyuki, Christoph Lameter,
Andrea Arcangeli, Rik van Riel, Andrew Morton
On Tue, Apr 27, 2010 at 08:05:26AM +0900, Minchan Kim wrote:
> On Tue, Apr 27, 2010 at 7:37 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > vma_adjust() is updating anon VMA information without any locks taken.
> > In contrast, file-backed mappings use the i_mmap_lock and this lack of
> > locking can result in races with page migration. During rmap_walk(),
> > vma_address() can return -EFAULT for an address that will soon be valid.
> > This leaves a dangling migration PTE behind which can later cause a BUG_ON
> > to trigger when the page is faulted in.
> >
> > With the recent anon_vma changes, there can be more than one anon_vma->lock
> > that can be taken in a anon_vma_chain but a second lock cannot be spinned
> > upon in case of deadlock. Instead, the rmap walker tries to take locks of
> > different anon_vma's. If the attempt fails, the operation is restarted.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > mm/ksm.c | 13 +++++++++++++
> > mm/mmap.c | 6 ++++++
> > mm/rmap.c | 22 +++++++++++++++++++---
> > 3 files changed, 38 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 3666d43..baa5b4d 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -1674,9 +1674,22 @@ again:
> > spin_lock(&anon_vma->lock);
> > list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
> > vma = vmac->vma;
> > +
> > + /* See comment in mm/rmap.c#rmap_walk_anon on locking */
> > + if (anon_vma != vma->anon_vma) {
> > + if (!spin_trylock(&vma->anon_vma->lock)) {
> > + spin_unlock(&anon_vma->lock);
> > + goto again;
> > + }
> > + }
> > +
> > if (rmap_item->address < vma->vm_start ||
> > rmap_item->address >= vma->vm_end)
> > continue;
> > +
> > + if (anon_vma != vma->anon_vma)
> > + spin_unlock(&vma->anon_vma->lock);
> > +
> > /*
> > * Initially we examine only the vma which covers this
> > * rmap_item; but later, if there is still work to do,
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index f90ea92..61d6f1d 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end);
> > }
> > }
> >
> > + if (vma->anon_vma)
> > + spin_lock(&vma->anon_vma->lock);
> > +
> > if (root) {
> > flush_dcache_mmap_lock(mapping);
> > vma_prio_tree_remove(vma, root);
> > @@ -620,6 +623,9 @@ again: remove_next = 1 + (end > next->vm_end);
> > if (mapping)
> > spin_unlock(&mapping->i_mmap_lock);
> >
> > + if (vma->anon_vma)
> > + spin_unlock(&vma->anon_vma->lock);
> > +
> > if (remove_next) {
> > if (file) {
> > fput(file);
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 85f203e..bc313a6 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1368,15 +1368,31 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
> > * are holding mmap_sem. Users without mmap_sem are required to
> > * take a reference count to prevent the anon_vma disappearing
> > */
> > +retry:
> > anon_vma = page_anon_vma(page);
> > if (!anon_vma)
> > return ret;
> > spin_lock(&anon_vma->lock);
> > list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> > struct vm_area_struct *vma = avc->vma;
> > - unsigned long address = vma_address(page, vma);
> > - if (address == -EFAULT)
> > - continue;
> > + unsigned long address;
> > +
> > + /*
> > + * Guard against deadlocks by not spinning against
> > + * vma->anon_vma->lock. If contention is found, release our
> > + * lock and try again until VMA list can be traversed without
> > + * contention.
> > + */
> > + if (anon_vma != vma->anon_vma) {
> > + if (!spin_trylock(&vma->anon_vma->lock)) {
> > + spin_unlock(&anon_vma->lock);
> > + goto retry;
> > + }
> > + }
> > + address = vma_address(page, vma);
> > + if (anon_vma != vma->anon_vma)
> > + spin_unlock(&vma->anon_vma->lock);
> > +
>
> if (address == -EFAULT)
> continue;
>
Correct. Thanks.
> > ret = rmap_one(page, vma, address, arg);
> > if (ret != SWAP_AGAIN)
> > break;
> > --
> > 1.6.5
> >
> >
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
2010-04-27 9:09 ` KAMEZAWA Hiroyuki
@ 2010-04-27 10:29 ` Mel Gorman
-1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-27 10:29 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Linux-MM, LKML, Minchan Kim, Christoph Lameter, Andrea Arcangeli,
Rik van Riel, Andrew Morton
On Tue, Apr 27, 2010 at 06:09:49PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 27 Apr 2010 09:59:51 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
>
> > On Tue, Apr 27, 2010 at 12:50:40PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 27 Apr 2010 09:07:06 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > >
> > > > On Mon, 26 Apr 2010 23:37:58 +0100
> > > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > >
> > > > > vma_adjust() is updating anon VMA information without any locks taken.
> > > > > In contrast, file-backed mappings use the i_mmap_lock and this lack of
> > > > > locking can result in races with page migration. During rmap_walk(),
> > > > > vma_address() can return -EFAULT for an address that will soon be valid.
> > > > > This leaves a dangling migration PTE behind which can later cause a BUG_ON
> > > > > to trigger when the page is faulted in.
> > > > >
> > > > > With the recent anon_vma changes, there can be more than one anon_vma->lock
> > > > > that can be taken in a anon_vma_chain but a second lock cannot be spinned
> > > > > upon in case of deadlock. Instead, the rmap walker tries to take locks of
> > > > > different anon_vma's. If the attempt fails, the operation is restarted.
> > > > >
> > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > >
> > > > Ok, acquiring vma->anon_vma->spin_lock always sounds very safe.
> > > > (but slow.)
> > > >
> > > > I'll test this, too.
> > > >
> > > > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > >
> > >
> > > Sorry. reproduced. It seems the same bug before patch.
> > > mapcount 1 -> unmap -> remap -> mapcount 0. And it was SwapCache.
> > >
> >
> > Same here, reproduced after 18 hours.
> >
> Hmm. It seems rmap_one() is called and the race is not in vma_address()
> but in remap_migration_pte().
It could have been in both but the vma lock should have been held across
the rmap_one. It still reproduces but it's still the right thing to do.
This is the current version of patch 2/2.
==== CUT HERE ====
[PATCH] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
vma_adjust() is updating anon VMA information without any locks taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a BUG_ON
to trigger when the page is faulted in.
With the recent anon_vma changes, there can be more than one anon_vma->lock
that can be taken in a anon_vma_chain but a second lock cannot be spinned
upon in case of deadlock. Instead, the rmap walker tries to take locks of
different anon_vma's. If the attempt fails, the operation is restarted.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/ksm.c | 19 +++++++++++++++++--
mm/mmap.c | 6 ++++++
mm/rmap.c | 27 +++++++++++++++++++++++----
3 files changed, 46 insertions(+), 6 deletions(-)
diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..87c7531 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1674,9 +1674,19 @@ again:
spin_lock(&anon_vma->lock);
list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
vma = vmac->vma;
+
+ /* See comment in mm/rmap.c#rmap_walk_anon on locking */
+ if (anon_vma != vma->anon_vma) {
+ if (!spin_trylock(&vma->anon_vma->lock)) {
+ spin_unlock(&anon_vma->lock);
+ goto again;
+ }
+ }
+
if (rmap_item->address < vma->vm_start ||
rmap_item->address >= vma->vm_end)
- continue;
+ goto next_vma;
+
/*
* Initially we examine only the vma which covers this
* rmap_item; but later, if there is still work to do,
@@ -1684,9 +1694,14 @@ again:
* were forked from the original since ksmd passed.
*/
if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
- continue;
+ goto next_vma;
ret = rmap_one(page, vma, rmap_item->address, arg);
+
+next_vma:
+ if (anon_vma != vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
if (ret != SWAP_AGAIN) {
spin_unlock(&anon_vma->lock);
goto out;
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61d6f1d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end);
}
}
+ if (vma->anon_vma)
+ spin_lock(&vma->anon_vma->lock);
+
if (root) {
flush_dcache_mmap_lock(mapping);
vma_prio_tree_remove(vma, root);
@@ -620,6 +623,9 @@ again: remove_next = 1 + (end > next->vm_end);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);
+ if (vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
if (remove_next) {
if (file) {
fput(file);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..7c2b7a9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1368,18 +1368,37 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
* are holding mmap_sem. Users without mmap_sem are required to
* take a reference count to prevent the anon_vma disappearing
*/
+retry:
anon_vma = page_anon_vma(page);
if (!anon_vma)
return ret;
spin_lock(&anon_vma->lock);
list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
struct vm_area_struct *vma = avc->vma;
- unsigned long address = vma_address(page, vma);
- if (address == -EFAULT)
- continue;
- ret = rmap_one(page, vma, address, arg);
+ unsigned long address;
+
+ /*
+ * Guard against deadlocks by not spinning against
+ * vma->anon_vma->lock. If contention is found, release our lock and
+ * try again until VMA list can be traversed without worrying about
+ * the details of the VMA changing underneath us.
+ */
+ if (anon_vma != vma->anon_vma) {
+ if (!spin_trylock(&vma->anon_vma->lock)) {
+ spin_unlock(&anon_vma->lock);
+ goto retry;
+ }
+ }
+ address = vma_address(page, vma);
+ if (address != -EFAULT)
+ ret = rmap_one(page, vma, address, arg);
+
+ if (anon_vma != vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
if (ret != SWAP_AGAIN)
break;
+
}
spin_unlock(&anon_vma->lock);
return ret;
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-27 10:29 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-27 10:29 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Linux-MM, LKML, Minchan Kim, Christoph Lameter, Andrea Arcangeli,
Rik van Riel, Andrew Morton
On Tue, Apr 27, 2010 at 06:09:49PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 27 Apr 2010 09:59:51 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
>
> > On Tue, Apr 27, 2010 at 12:50:40PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 27 Apr 2010 09:07:06 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > >
> > > > On Mon, 26 Apr 2010 23:37:58 +0100
> > > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > >
> > > > > vma_adjust() is updating anon VMA information without any locks taken.
> > > > > In contrast, file-backed mappings use the i_mmap_lock and this lack of
> > > > > locking can result in races with page migration. During rmap_walk(),
> > > > > vma_address() can return -EFAULT for an address that will soon be valid.
> > > > > This leaves a dangling migration PTE behind which can later cause a BUG_ON
> > > > > to trigger when the page is faulted in.
> > > > >
> > > > > With the recent anon_vma changes, there can be more than one anon_vma->lock
> > > > > that can be taken in a anon_vma_chain but a second lock cannot be spinned
> > > > > upon in case of deadlock. Instead, the rmap walker tries to take locks of
> > > > > different anon_vma's. If the attempt fails, the operation is restarted.
> > > > >
> > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > >
> > > > Ok, acquiring vma->anon_vma->spin_lock always sounds very safe.
> > > > (but slow.)
> > > >
> > > > I'll test this, too.
> > > >
> > > > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > >
> > >
> > > Sorry. reproduced. It seems the same bug before patch.
> > > mapcount 1 -> unmap -> remap -> mapcount 0. And it was SwapCache.
> > >
> >
> > Same here, reproduced after 18 hours.
> >
> Hmm. It seems rmap_one() is called and the race is not in vma_address()
> but in remap_migration_pte().
It could have been in both but the vma lock should have been held across
the rmap_one. It still reproduces but it's still the right thing to do.
This is the current version of patch 2/2.
==== CUT HERE ====
[PATCH] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
vma_adjust() is updating anon VMA information without any locks taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a BUG_ON
to trigger when the page is faulted in.
With the recent anon_vma changes, there can be more than one anon_vma->lock
that can be taken in a anon_vma_chain but a second lock cannot be spinned
upon in case of deadlock. Instead, the rmap walker tries to take locks of
different anon_vma's. If the attempt fails, the operation is restarted.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/ksm.c | 19 +++++++++++++++++--
mm/mmap.c | 6 ++++++
mm/rmap.c | 27 +++++++++++++++++++++++----
3 files changed, 46 insertions(+), 6 deletions(-)
diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..87c7531 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1674,9 +1674,19 @@ again:
spin_lock(&anon_vma->lock);
list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
vma = vmac->vma;
+
+ /* See comment in mm/rmap.c#rmap_walk_anon on locking */
+ if (anon_vma != vma->anon_vma) {
+ if (!spin_trylock(&vma->anon_vma->lock)) {
+ spin_unlock(&anon_vma->lock);
+ goto again;
+ }
+ }
+
if (rmap_item->address < vma->vm_start ||
rmap_item->address >= vma->vm_end)
- continue;
+ goto next_vma;
+
/*
* Initially we examine only the vma which covers this
* rmap_item; but later, if there is still work to do,
@@ -1684,9 +1694,14 @@ again:
* were forked from the original since ksmd passed.
*/
if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
- continue;
+ goto next_vma;
ret = rmap_one(page, vma, rmap_item->address, arg);
+
+next_vma:
+ if (anon_vma != vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
if (ret != SWAP_AGAIN) {
spin_unlock(&anon_vma->lock);
goto out;
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61d6f1d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end);
}
}
+ if (vma->anon_vma)
+ spin_lock(&vma->anon_vma->lock);
+
if (root) {
flush_dcache_mmap_lock(mapping);
vma_prio_tree_remove(vma, root);
@@ -620,6 +623,9 @@ again: remove_next = 1 + (end > next->vm_end);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);
+ if (vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
if (remove_next) {
if (file) {
fput(file);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..7c2b7a9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1368,18 +1368,37 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
* are holding mmap_sem. Users without mmap_sem are required to
* take a reference count to prevent the anon_vma disappearing
*/
+retry:
anon_vma = page_anon_vma(page);
if (!anon_vma)
return ret;
spin_lock(&anon_vma->lock);
list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
struct vm_area_struct *vma = avc->vma;
- unsigned long address = vma_address(page, vma);
- if (address == -EFAULT)
- continue;
- ret = rmap_one(page, vma, address, arg);
+ unsigned long address;
+
+ /*
+ * Guard against deadlocks by not spinning against
+ * vma->anon_vma->lock. If contention is found, release our lock and
+ * try again until VMA list can be traversed without worrying about
+ * the details of the VMA changing underneath us.
+ */
+ if (anon_vma != vma->anon_vma) {
+ if (!spin_trylock(&vma->anon_vma->lock)) {
+ spin_unlock(&anon_vma->lock);
+ goto retry;
+ }
+ }
+ address = vma_address(page, vma);
+ if (address != -EFAULT)
+ ret = rmap_one(page, vma, address, arg);
+
+ if (anon_vma != vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
if (ret != SWAP_AGAIN)
break;
+
}
spin_unlock(&anon_vma->lock);
return ret;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
2010-04-27 10:29 ` Mel Gorman
@ 2010-04-27 15:37 ` Andrea Arcangeli
-1 siblings, 0 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2010-04-27 15:37 UTC (permalink / raw)
To: Mel Gorman
Cc: KAMEZAWA Hiroyuki, Linux-MM, LKML, Minchan Kim,
Christoph Lameter, Rik van Riel, Andrew Morton
On Tue, Apr 27, 2010 at 11:29:05AM +0100, Mel Gorman wrote:
> It could have been in both but the vma lock should have been held across
> the rmap_one. It still reproduces but it's still the right thing to do.
> This is the current version of patch 2/2.
Well, keep in mind I reproduced the swapops bug with 2.6.33 anon-vma
code, it's unlikely that focusing on patch 2 you'll fix bug in
swapops.h. If this is a bug in the new anon-vma code, it needs fixing
of course! But I doubt this bug is related to swapops in execve on the
bprm->p args.
I've yet to check in detail patch 1 sorry, I'll let you know my
opinion about it as soon as I checked it in detail.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-27 15:37 ` Andrea Arcangeli
0 siblings, 0 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2010-04-27 15:37 UTC (permalink / raw)
To: Mel Gorman
Cc: KAMEZAWA Hiroyuki, Linux-MM, LKML, Minchan Kim,
Christoph Lameter, Rik van Riel, Andrew Morton
On Tue, Apr 27, 2010 at 11:29:05AM +0100, Mel Gorman wrote:
> It could have been in both but the vma lock should have been held across
> the rmap_one. It still reproduces but it's still the right thing to do.
> This is the current version of patch 2/2.
Well, keep in mind I reproduced the swapops bug with 2.6.33 anon-vma
code, it's unlikely that focusing on patch 2 you'll fix bug in
swapops.h. If this is a bug in the new anon-vma code, it needs fixing
of course! But I doubt this bug is related to swapops in execve on the
bprm->p args.
I've yet to check in detail patch 1 sorry, I'll let you know my
opinion about it as soon as I checked it in detail.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
2010-04-27 15:37 ` Andrea Arcangeli
@ 2010-04-27 16:35 ` Mel Gorman
-1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-27 16:35 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: KAMEZAWA Hiroyuki, Linux-MM, LKML, Minchan Kim,
Christoph Lameter, Rik van Riel, Andrew Morton
On Tue, Apr 27, 2010 at 05:37:59PM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 27, 2010 at 11:29:05AM +0100, Mel Gorman wrote:
> > It could have been in both but the vma lock should have been held across
> > the rmap_one. It still reproduces but it's still the right thing to do.
> > This is the current version of patch 2/2.
>
> Well, keep in mind I reproduced the swapops bug with 2.6.33 anon-vma
> code, it's unlikely that focusing on patch 2 you'll fix bug in
> swapops.h. If this is a bug in the new anon-vma code, it needs fixing
> of course! But I doubt this bug is related to swapops in execve on the
> bprm->p args.
>
Why do you doubt it's unrelated to execve? From what I've seen during the day,
there is a race in execve where the VMA gets moved (under the anon_vma lock)
before the page-tables are copied with move_ptes (without a lock). In that
case, execve can encounter and copy migration ptes before migration removes
them. This also applies to mainline because it is only taking the RCU lock
and not the anon_vma->lock.
I have a prototype that "handles" the situation with the new anon_vma
code by removing the migration ptes it finds while moving page tables
but it needs more work before releasing.
An alternative would be to split vma_adjust() into locked and unlocked
versions. shift_arg_pages() could then take the anon_vma lock to lock
both the VMA move and the pagetable copy here.
/*
* cover the whole range: [new_start, old_end)
*/
if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
return -ENOMEM;
/*
* move the page tables downwards, on failure we rely on
* process cleanup to remove whatever mess we made.
*/
if (length != move_page_tables(vma, old_start,
vma, new_start, length))
return -ENOMEM;
It'd be messy to split up the locking of vma_adjust like this though and
exec() will hold the anon_vma locks for longer just to guard against
migration. It's not clear this is better than having move_ptes handle
the
> I've yet to check in detail patch 1 sorry, I'll let you know my
> opinion about it as soon as I checked it in detail.
>
No problem. I still need to revisit all of these patches once I am
confident the swapops bug cannot be triggered any more.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
@ 2010-04-27 16:35 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-27 16:35 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: KAMEZAWA Hiroyuki, Linux-MM, LKML, Minchan Kim,
Christoph Lameter, Rik van Riel, Andrew Morton
On Tue, Apr 27, 2010 at 05:37:59PM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 27, 2010 at 11:29:05AM +0100, Mel Gorman wrote:
> > It could have been in both but the vma lock should have been held across
> > the rmap_one. It still reproduces but it's still the right thing to do.
> > This is the current version of patch 2/2.
>
> Well, keep in mind I reproduced the swapops bug with 2.6.33 anon-vma
> code, it's unlikely that focusing on patch 2 you'll fix bug in
> swapops.h. If this is a bug in the new anon-vma code, it needs fixing
> of course! But I doubt this bug is related to swapops in execve on the
> bprm->p args.
>
Why do you doubt it's unrelated to execve? From what I've seen during the day,
there is a race in execve where the VMA gets moved (under the anon_vma lock)
before the page-tables are copied with move_ptes (without a lock). In that
case, execve can encounter and copy migration ptes before migration removes
them. This also applies to mainline because it is only taking the RCU lock
and not the anon_vma->lock.
I have a prototype that "handles" the situation with the new anon_vma
code by removing the migration ptes it finds while moving page tables
but it needs more work before releasing.
An alternative would be to split vma_adjust() into locked and unlocked
versions. shift_arg_pages() could then take the anon_vma lock to lock
both the VMA move and the pagetable copy here.
/*
* cover the whole range: [new_start, old_end)
*/
if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
return -ENOMEM;
/*
* move the page tables downwards, on failure we rely on
* process cleanup to remove whatever mess we made.
*/
if (length != move_page_tables(vma, old_start,
vma, new_start, length))
return -ENOMEM;
It'd be messy to split up the locking of vma_adjust like this though and
exec() will hold the anon_vma locks for longer just to guard against
migration. It's not clear this is better than having move_ptes handle
the
> I've yet to check in detail patch 1 sorry, I'll let you know my
> opinion about it as soon as I checked it in detail.
>
No problem. I still need to revisit all of these patches once I am
confident the swapops bug cannot be triggered any more.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2010-04-27 16:36 UTC | newest]
Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-26 22:37 [PATCH 0/2] Fix migration races in rmap_walk() Mel Gorman
2010-04-26 22:37 ` Mel Gorman
2010-04-26 22:37 ` [PATCH 1/2] mm,migration: During fork(), wait for migration to end if migration PTE is encountered Mel Gorman
2010-04-26 22:37 ` Mel Gorman
2010-04-26 23:28 ` Minchan Kim
2010-04-26 23:28 ` Minchan Kim
2010-04-27 0:08 ` Rik van Riel
2010-04-27 0:08 ` Rik van Riel
2010-04-26 22:37 ` [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
2010-04-26 22:37 ` Mel Gorman
2010-04-26 23:05 ` Minchan Kim
2010-04-26 23:05 ` Minchan Kim
2010-04-27 0:30 ` KAMEZAWA Hiroyuki
2010-04-27 0:30 ` KAMEZAWA Hiroyuki
2010-04-27 9:17 ` Mel Gorman
2010-04-27 9:17 ` Mel Gorman
2010-04-26 23:15 ` Minchan Kim
2010-04-26 23:15 ` Minchan Kim
2010-04-27 0:07 ` KAMEZAWA Hiroyuki
2010-04-27 0:07 ` KAMEZAWA Hiroyuki
2010-04-27 3:50 ` KAMEZAWA Hiroyuki
2010-04-27 3:50 ` KAMEZAWA Hiroyuki
2010-04-27 4:03 ` KAMEZAWA Hiroyuki
2010-04-27 4:03 ` KAMEZAWA Hiroyuki
2010-04-27 8:59 ` Mel Gorman
2010-04-27 8:59 ` Mel Gorman
2010-04-27 9:09 ` KAMEZAWA Hiroyuki
2010-04-27 9:09 ` KAMEZAWA Hiroyuki
2010-04-27 10:29 ` Mel Gorman
2010-04-27 10:29 ` Mel Gorman
2010-04-27 15:37 ` Andrea Arcangeli
2010-04-27 15:37 ` Andrea Arcangeli
2010-04-27 16:35 ` Mel Gorman
2010-04-27 16:35 ` Mel Gorman
2010-04-27 0:30 ` Rik van Riel
2010-04-27 0:30 ` Rik van Riel
2010-04-27 0:31 ` KAMEZAWA Hiroyuki
2010-04-27 0:31 ` KAMEZAWA Hiroyuki
2010-04-27 2:13 ` Rik van Riel
2010-04-27 2:13 ` Rik van Riel
2010-04-26 23:04 ` [PATCH 0/2] Fix migration races in rmap_walk() Andrea Arcangeli
2010-04-26 23:04 ` Andrea Arcangeli
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.