All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-23  3:01 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-23  3:01 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, minchan.kim, Christoph Lameter, akpm, linux-kernel


This patch itself is for -mm ..but may need to go -stable tree for memory
hotplug. (but we've got no report to hit this race...)

This one is the simplest, I think and works well on my test set.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
or mapping->i_mmap_lock is held and enter following loop.

	for_each_vma_in_this_rmap_link(list from page->mapping) {
		unsigned long address = vma_address(page, vma);
		if (address == -EFAULT)
			continue;
		....
	}

vma_address is checking [start, end, pgoff] v.s. page->index.

But vma's [start, end, pgoff] is updated without locks. vma_address()
can hit a race and may return wrong result.

This bahavior is no problem in usual routine as try_to_unmap() etc...
But for page migration, rmap_walk() has to find all migration_ptes
which migration code overwritten valid ptes. This race is critical and cause
BUG that a migration_pte is sometimes not removed.

pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
Apr 21 17:27:47 localhost kernel: CPU 3
Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
Apr 21 17:27:47 localhost kernel:
Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G        W   2.6.34-rc4-mm1+ #4 D2519/PRIMERGY          
Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>]  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08  EFLAGS: 00010246
Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
Apr 21 17:27:47 localhost kernel: FS:  00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
Apr 21 17:27:47 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
Apr 21 17:27:47 localhost kernel: Stack:
Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
Apr 21 17:27:47 localhost kernel: Call Trace:
Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
Apr 21 17:27:47 localhost kernel: RIP  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---



This patch adds vma_address_safe(). And update [start, end, pgoff]
under seq counter. 

Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/mm_types.h |    2 ++
 mm/mmap.c                |   15 ++++++++++++++-
 mm/rmap.c                |   25 ++++++++++++++++++++++++-
 3 files changed, 40 insertions(+), 2 deletions(-)

Index: linux-2.6.34-rc5-mm1/include/linux/mm_types.h
===================================================================
--- linux-2.6.34-rc5-mm1.orig/include/linux/mm_types.h
+++ linux-2.6.34-rc5-mm1/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
+#include <linux/seqlock.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -183,6 +184,7 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+	seqcount_t updating;	/* works like seqlock for updating vma info. */
 };
 
 struct core_thread {
Index: linux-2.6.34-rc5-mm1/mm/mmap.c
===================================================================
--- linux-2.6.34-rc5-mm1.orig/mm/mmap.c
+++ linux-2.6.34-rc5-mm1/mm/mmap.c
@@ -491,6 +491,16 @@ __vma_unlink(struct mm_struct *mm, struc
 		mm->mmap_cache = prev;
 }
 
+static void adjust_start_vma(struct vm_area_struct *vma)
+{
+	write_seqcount_begin(&vma->updating);
+}
+
+static void adjust_end_vma(struct vm_area_struct *vma)
+{
+	write_seqcount_end(&vma->updating);
+}
+
 /*
  * We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that
  * is already present in an i_mmap tree without adjusting the tree.
@@ -584,13 +594,16 @@ again:			remove_next = 1 + (end > next->
 		if (adjust_next)
 			vma_prio_tree_remove(next, root);
 	}
-
+	adjust_start_vma(vma);
 	vma->vm_start = start;
 	vma->vm_end = end;
 	vma->vm_pgoff = pgoff;
+	adjust_end_vma(vma);
 	if (adjust_next) {
+		adjust_start_vma(next);
 		next->vm_start += adjust_next << PAGE_SHIFT;
 		next->vm_pgoff += adjust_next;
+		adjust_end_vma(next);
 	}
 
 	if (root) {
Index: linux-2.6.34-rc5-mm1/mm/rmap.c
===================================================================
--- linux-2.6.34-rc5-mm1.orig/mm/rmap.c
+++ linux-2.6.34-rc5-mm1/mm/rmap.c
@@ -342,6 +342,23 @@ vma_address(struct page *page, struct vm
 }
 
 /*
+ * vma's address check is racy if we don't hold mmap_sem. This function
+ * gives a safe way for accessing the [start, end, pgoff] tuple of vma.
+ */
+
+static inline unsigned long vma_address_safe(struct page *page,
+		struct vm_area_struct *vma)
+{
+	unsigned long ret, safety;
+
+	do {
+		safety = read_seqcount_begin(&vma->updating);
+		ret = vma_address(page, vma);
+	} while (read_seqcount_retry(&vma->updating, safety));
+	return ret;
+}
+
+/*
  * At what user virtual address is page expected in vma?
  * checking that the page matches the vma.
  */
@@ -1372,7 +1389,13 @@ static int rmap_walk_anon(struct page *p
 	spin_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
+		unsigned long address;
+
+		/*
+		 * In page migration, this race is critical. So, use
+		 * safe version.
+		 */
+		address = vma_address_safe(page, vma);
 		if (address == -EFAULT)
 			continue;
 		ret = rmap_one(page, vma, address, arg);


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-23  3:01 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-23  3:01 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, minchan.kim, Christoph Lameter, akpm, linux-kernel


This patch itself is for -mm ..but may need to go -stable tree for memory
hotplug. (but we've got no report to hit this race...)

This one is the simplest, I think and works well on my test set.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
or mapping->i_mmap_lock is held and enter following loop.

	for_each_vma_in_this_rmap_link(list from page->mapping) {
		unsigned long address = vma_address(page, vma);
		if (address == -EFAULT)
			continue;
		....
	}

vma_address is checking [start, end, pgoff] v.s. page->index.

But vma's [start, end, pgoff] is updated without locks. vma_address()
can hit a race and may return wrong result.

This bahavior is no problem in usual routine as try_to_unmap() etc...
But for page migration, rmap_walk() has to find all migration_ptes
which migration code overwritten valid ptes. This race is critical and cause
BUG that a migration_pte is sometimes not removed.

pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
Apr 21 17:27:47 localhost kernel: CPU 3
Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
Apr 21 17:27:47 localhost kernel:
Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G        W   2.6.34-rc4-mm1+ #4 D2519/PRIMERGY          
Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>]  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08  EFLAGS: 00010246
Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
Apr 21 17:27:47 localhost kernel: FS:  00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
Apr 21 17:27:47 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
Apr 21 17:27:47 localhost kernel: Stack:
Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
Apr 21 17:27:47 localhost kernel: Call Trace:
Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
Apr 21 17:27:47 localhost kernel: RIP  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---



This patch adds vma_address_safe(). And update [start, end, pgoff]
under seq counter. 

Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/mm_types.h |    2 ++
 mm/mmap.c                |   15 ++++++++++++++-
 mm/rmap.c                |   25 ++++++++++++++++++++++++-
 3 files changed, 40 insertions(+), 2 deletions(-)

Index: linux-2.6.34-rc5-mm1/include/linux/mm_types.h
===================================================================
--- linux-2.6.34-rc5-mm1.orig/include/linux/mm_types.h
+++ linux-2.6.34-rc5-mm1/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
+#include <linux/seqlock.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -183,6 +184,7 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+	seqcount_t updating;	/* works like seqlock for updating vma info. */
 };
 
 struct core_thread {
Index: linux-2.6.34-rc5-mm1/mm/mmap.c
===================================================================
--- linux-2.6.34-rc5-mm1.orig/mm/mmap.c
+++ linux-2.6.34-rc5-mm1/mm/mmap.c
@@ -491,6 +491,16 @@ __vma_unlink(struct mm_struct *mm, struc
 		mm->mmap_cache = prev;
 }
 
+static void adjust_start_vma(struct vm_area_struct *vma)
+{
+	write_seqcount_begin(&vma->updating);
+}
+
+static void adjust_end_vma(struct vm_area_struct *vma)
+{
+	write_seqcount_end(&vma->updating);
+}
+
 /*
  * We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that
  * is already present in an i_mmap tree without adjusting the tree.
@@ -584,13 +594,16 @@ again:			remove_next = 1 + (end > next->
 		if (adjust_next)
 			vma_prio_tree_remove(next, root);
 	}
-
+	adjust_start_vma(vma);
 	vma->vm_start = start;
 	vma->vm_end = end;
 	vma->vm_pgoff = pgoff;
+	adjust_end_vma(vma);
 	if (adjust_next) {
+		adjust_start_vma(next);
 		next->vm_start += adjust_next << PAGE_SHIFT;
 		next->vm_pgoff += adjust_next;
+		adjust_end_vma(next);
 	}
 
 	if (root) {
Index: linux-2.6.34-rc5-mm1/mm/rmap.c
===================================================================
--- linux-2.6.34-rc5-mm1.orig/mm/rmap.c
+++ linux-2.6.34-rc5-mm1/mm/rmap.c
@@ -342,6 +342,23 @@ vma_address(struct page *page, struct vm
 }
 
 /*
+ * vma's address check is racy if we don't hold mmap_sem. This function
+ * gives a safe way for accessing the [start, end, pgoff] tuple of vma.
+ */
+
+static inline unsigned long vma_address_safe(struct page *page,
+		struct vm_area_struct *vma)
+{
+	unsigned long ret, safety;
+
+	do {
+		safety = read_seqcount_begin(&vma->updating);
+		ret = vma_address(page, vma);
+	} while (read_seqcount_retry(&vma->updating, safety));
+	return ret;
+}
+
+/*
  * At what user virtual address is page expected in vma?
  * checking that the page matches the vma.
  */
@@ -1372,7 +1389,13 @@ static int rmap_walk_anon(struct page *p
 	spin_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
+		unsigned long address;
+
+		/*
+		 * In page migration, this race is critical. So, use
+		 * safe version.
+		 */
+		address = vma_address_safe(page, vma);
 		if (address == -EFAULT)
 			continue;
 		ret = rmap_one(page, vma, address, arg);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-23  3:01 ` KAMEZAWA Hiroyuki
@ 2010-04-23  5:11   ` Minchan Kim
  -1 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-23  5:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Mel Gorman, Christoph Lameter, akpm, linux-kernel

On Fri, Apr 23, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> This patch itself is for -mm ..but may need to go -stable tree for memory
> hotplug. (but we've got no report to hit this race...)
>
> This one is the simplest, I think and works well on my test set.
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
> or mapping->i_mmap_lock is held and enter following loop.
>
>        for_each_vma_in_this_rmap_link(list from page->mapping) {
>                unsigned long address = vma_address(page, vma);
>                if (address == -EFAULT)
>                        continue;
>                ....
>        }
>
> vma_address is checking [start, end, pgoff] v.s. page->index.
>
> But vma's [start, end, pgoff] is updated without locks. vma_address()
> can hit a race and may return wrong result.
>
> This bahavior is no problem in usual routine as try_to_unmap() etc...
> But for page migration, rmap_walk() has to find all migration_ptes
> which migration code overwritten valid ptes. This race is critical and cause
> BUG that a migration_pte is sometimes not removed.
>
> pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
> Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
> Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
> Apr 21 17:27:47 localhost kernel: CPU 3
> Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
> Apr 21 17:27:47 localhost kernel:
> Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G        W   2.6.34-rc4-mm1+ #4 D2519/PRIMERGY
> Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>]  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08  EFLAGS: 00010246
> Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
> Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
> Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
> Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
> Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
> Apr 21 17:27:47 localhost kernel: FS:  00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
> Apr 21 17:27:47 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
> Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
> Apr 21 17:27:47 localhost kernel: Stack:
> Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
> Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
> Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
> Apr 21 17:27:47 localhost kernel: Call Trace:
> Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
> Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
> Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
> Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
> Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
> Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
> Apr 21 17:27:47 localhost kernel: RIP  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
> Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
>
>
>
> This patch adds vma_address_safe(). And update [start, end, pgoff]
> under seq counter.
>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

That's exactly same what I have in mind. :)
But I am hesitating. That's because AFAIR, we try to remove seqlock. Right?
But in this case, seqlock is good, I think. :)
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-23  5:11   ` Minchan Kim
  0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-23  5:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Mel Gorman, Christoph Lameter, akpm, linux-kernel

On Fri, Apr 23, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> This patch itself is for -mm ..but may need to go -stable tree for memory
> hotplug. (but we've got no report to hit this race...)
>
> This one is the simplest, I think and works well on my test set.
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
> or mapping->i_mmap_lock is held and enter following loop.
>
>        for_each_vma_in_this_rmap_link(list from page->mapping) {
>                unsigned long address = vma_address(page, vma);
>                if (address == -EFAULT)
>                        continue;
>                ....
>        }
>
> vma_address is checking [start, end, pgoff] v.s. page->index.
>
> But vma's [start, end, pgoff] is updated without locks. vma_address()
> can hit a race and may return wrong result.
>
> This bahavior is no problem in usual routine as try_to_unmap() etc...
> But for page migration, rmap_walk() has to find all migration_ptes
> which migration code overwritten valid ptes. This race is critical and cause
> BUG that a migration_pte is sometimes not removed.
>
> pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
> Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
> Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
> Apr 21 17:27:47 localhost kernel: CPU 3
> Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
> Apr 21 17:27:47 localhost kernel:
> Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G        W   2.6.34-rc4-mm1+ #4 D2519/PRIMERGY
> Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>]  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08  EFLAGS: 00010246
> Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
> Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
> Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
> Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
> Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
> Apr 21 17:27:47 localhost kernel: FS:  00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
> Apr 21 17:27:47 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
> Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
> Apr 21 17:27:47 localhost kernel: Stack:
> Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
> Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
> Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
> Apr 21 17:27:47 localhost kernel: Call Trace:
> Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
> Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
> Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
> Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
> Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
> Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
> Apr 21 17:27:47 localhost kernel: RIP  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
> Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
>
>
>
> This patch adds vma_address_safe(). And update [start, end, pgoff]
> under seq counter.
>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

That's exactly same what I have in mind. :)
But I am hesitating. That's because AFAIR, we try to remove seqlock. Right?
But in this case, seqlock is good, I think. :)
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-23  5:11   ` Minchan Kim
@ 2010-04-23  5:27     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-23  5:27 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, Mel Gorman, Christoph Lameter, akpm, linux-kernel

On Fri, 23 Apr 2010 14:11:37 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Fri, Apr 23, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > This patch itself is for -mm ..but may need to go -stable tree for memory
> > hotplug. (but we've got no report to hit this race...)
> >
> > This one is the simplest, I think and works well on my test set.
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
> > or mapping->i_mmap_lock is held and enter following loop.
> >
> >        for_each_vma_in_this_rmap_link(list from page->mapping) {
> >                unsigned long address = vma_address(page, vma);
> >                if (address == -EFAULT)
> >                        continue;
> >                ....
> >        }
> >
> > vma_address is checking [start, end, pgoff] v.s. page->index.
> >
> > But vma's [start, end, pgoff] is updated without locks. vma_address()
> > can hit a race and may return wrong result.
> >
> > This bahavior is no problem in usual routine as try_to_unmap() etc...
> > But for page migration, rmap_walk() has to find all migration_ptes
> > which migration code overwritten valid ptes. This race is critical and cause
> > BUG that a migration_pte is sometimes not removed.
> >
> > pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
> > Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
> > Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> > Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
> > Apr 21 17:27:47 localhost kernel: CPU 3
> > Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
> > Apr 21 17:27:47 localhost kernel:
> > Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G        W   2.6.34-rc4-mm1+ #4 D2519/PRIMERGY
> > Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>]  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> > Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08  EFLAGS: 00010246
> > Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
> > Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
> > Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
> > Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
> > Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
> > Apr 21 17:27:47 localhost kernel: FS:  00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
> > Apr 21 17:27:47 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
> > Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
> > Apr 21 17:27:47 localhost kernel: Stack:
> > Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
> > Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
> > Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
> > Apr 21 17:27:47 localhost kernel: Call Trace:
> > Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
> > Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
> > Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
> > Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
> > Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
> > Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
> > Apr 21 17:27:47 localhost kernel: RIP  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> > Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
> > Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
> >
> >
> >
> > This patch adds vma_address_safe(). And update [start, end, pgoff]
> > under seq counter.
> >
> > Cc: Mel Gorman <mel@csn.ul.ie>
> > Cc: Minchan Kim <minchan.kim@gmail.com>
> > Cc: Christoph Lameter <cl@linux-foundation.org>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> That's exactly same what I have in mind. :)
> But I am hesitating. That's because AFAIR, we try to remove seqlock. Right?

Ah,..."don't use seqlock" is trend ?

> But in this case, seqlock is good, I think. :)
> 
BTW, this isn't seqlock but seq_counter :)

I'm still testing. What I doubt other than vma_address() is fork().
at fork(), followings _may_ happen. (but I'm not sure).

	chain vma.
	copy page table.
	   -> migration entry is copied, too.

At remap, 
	for each vma
	    look into page table and replace.

Then,
						rmap_walk().
	fork(parent, child)
						look into child's page table.
						=> we fond nothing.
	spin_lock(child's pagetable);
	spin_lock(parant's page table);
	copy migration entry
	spin_unlock(paranet's page table)
	spin_unlock(child's page table)
						update parent's paga table

If we always find parant's page table before child's , there is no race.
But I can't get prit_tree's list order as clear image. Hmm.

Thanks,
-Kame









^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-23  5:27     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-23  5:27 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, Mel Gorman, Christoph Lameter, akpm, linux-kernel

On Fri, 23 Apr 2010 14:11:37 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Fri, Apr 23, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > This patch itself is for -mm ..but may need to go -stable tree for memory
> > hotplug. (but we've got no report to hit this race...)
> >
> > This one is the simplest, I think and works well on my test set.
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
> > or mapping->i_mmap_lock is held and enter following loop.
> >
> > A  A  A  A for_each_vma_in_this_rmap_link(list from page->mapping) {
> > A  A  A  A  A  A  A  A unsigned long address = vma_address(page, vma);
> > A  A  A  A  A  A  A  A if (address == -EFAULT)
> > A  A  A  A  A  A  A  A  A  A  A  A continue;
> > A  A  A  A  A  A  A  A ....
> > A  A  A  A }
> >
> > vma_address is checking [start, end, pgoff] v.s. page->index.
> >
> > But vma's [start, end, pgoff] is updated without locks. vma_address()
> > can hit a race and may return wrong result.
> >
> > This bahavior is no problem in usual routine as try_to_unmap() etc...
> > But for page migration, rmap_walk() has to find all migration_ptes
> > which migration code overwritten valid ptes. This race is critical and cause
> > BUG that a migration_pte is sometimes not removed.
> >
> > pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
> > Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
> > Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> > Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
> > Apr 21 17:27:47 localhost kernel: CPU 3
> > Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
> > Apr 21 17:27:47 localhost kernel:
> > Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G A  A  A  A W A  2.6.34-rc4-mm1+ #4 D2519/PRIMERGY
> > Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>] A [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> > Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08 A EFLAGS: 00010246
> > Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
> > Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
> > Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
> > Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
> > Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
> > Apr 21 17:27:47 localhost kernel: FS: A 00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
> > Apr 21 17:27:47 localhost kernel: CS: A 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
> > Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
> > Apr 21 17:27:47 localhost kernel: Stack:
> > Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
> > Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
> > Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
> > Apr 21 17:27:47 localhost kernel: Call Trace:
> > Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
> > Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
> > Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
> > Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
> > Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
> > Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
> > Apr 21 17:27:47 localhost kernel: RIP A [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> > Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
> > Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
> >
> >
> >
> > This patch adds vma_address_safe(). And update [start, end, pgoff]
> > under seq counter.
> >
> > Cc: Mel Gorman <mel@csn.ul.ie>
> > Cc: Minchan Kim <minchan.kim@gmail.com>
> > Cc: Christoph Lameter <cl@linux-foundation.org>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> That's exactly same what I have in mind. :)
> But I am hesitating. That's because AFAIR, we try to remove seqlock. Right?

Ah,..."don't use seqlock" is trend ?

> But in this case, seqlock is good, I think. :)
> 
BTW, this isn't seqlock but seq_counter :)

I'm still testing. What I doubt other than vma_address() is fork().
at fork(), followings _may_ happen. (but I'm not sure).

	chain vma.
	copy page table.
	   -> migration entry is copied, too.

At remap, 
	for each vma
	    look into page table and replace.

Then,
						rmap_walk().
	fork(parent, child)
						look into child's page table.
						=> we fond nothing.
	spin_lock(child's pagetable);
	spin_lock(parant's page table);
	copy migration entry
	spin_unlock(paranet's page table)
	spin_unlock(child's page table)
						update parent's paga table

If we always find parant's page table before child's , there is no race.
But I can't get prit_tree's list order as clear image. Hmm.

Thanks,
-Kame








--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-23  5:27     ` KAMEZAWA Hiroyuki
@ 2010-04-23  7:00       ` Minchan Kim
  -1 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-23  7:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Mel Gorman, Christoph Lameter, akpm, linux-kernel

On Fri, Apr 23, 2010 at 2:27 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 23 Apr 2010 14:11:37 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Fri, Apr 23, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >
>> > This patch itself is for -mm ..but may need to go -stable tree for memory
>> > hotplug. (but we've got no report to hit this race...)
>> >
>> > This one is the simplest, I think and works well on my test set.
>> > ==
>> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >
>> > In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
>> > or mapping->i_mmap_lock is held and enter following loop.
>> >
>> >        for_each_vma_in_this_rmap_link(list from page->mapping) {
>> >                unsigned long address = vma_address(page, vma);
>> >                if (address == -EFAULT)
>> >                        continue;
>> >                ....
>> >        }
>> >
>> > vma_address is checking [start, end, pgoff] v.s. page->index.
>> >
>> > But vma's [start, end, pgoff] is updated without locks. vma_address()
>> > can hit a race and may return wrong result.
>> >
>> > This bahavior is no problem in usual routine as try_to_unmap() etc...
>> > But for page migration, rmap_walk() has to find all migration_ptes
>> > which migration code overwritten valid ptes. This race is critical and cause
>> > BUG that a migration_pte is sometimes not removed.
>> >
>> > pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
>> > Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
>> > Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
>> > Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
>> > Apr 21 17:27:47 localhost kernel: CPU 3
>> > Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
>> > Apr 21 17:27:47 localhost kernel:
>> > Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G        W   2.6.34-rc4-mm1+ #4 D2519/PRIMERGY
>> > Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>]  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
>> > Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08  EFLAGS: 00010246
>> > Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
>> > Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
>> > Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
>> > Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
>> > Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
>> > Apr 21 17:27:47 localhost kernel: FS:  00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
>> > Apr 21 17:27:47 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
>> > Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> > Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> > Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
>> > Apr 21 17:27:47 localhost kernel: Stack:
>> > Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
>> > Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
>> > Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
>> > Apr 21 17:27:47 localhost kernel: Call Trace:
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
>> > Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
>> > Apr 21 17:27:47 localhost kernel: RIP  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
>> > Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
>> > Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
>> >
>> >
>> >
>> > This patch adds vma_address_safe(). And update [start, end, pgoff]
>> > under seq counter.
>> >
>> > Cc: Mel Gorman <mel@csn.ul.ie>
>> > Cc: Minchan Kim <minchan.kim@gmail.com>
>> > Cc: Christoph Lameter <cl@linux-foundation.org>
>> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>> That's exactly same what I have in mind. :)
>> But I am hesitating. That's because AFAIR, we try to remove seqlock. Right?
>
> Ah,..."don't use seqlock" is trend ?
>
>> But in this case, seqlock is good, I think. :)
>>
> BTW, this isn't seqlock but seq_counter :)
>
> I'm still testing. What I doubt other than vma_address() is fork().
> at fork(), followings _may_ happen. (but I'm not sure).
>
>        chain vma.
>        copy page table.
>           -> migration entry is copied, too.
>
> At remap,
>        for each vma
>            look into page table and replace.
>
> Then,
>                                                rmap_walk().
>        fork(parent, child)
>                                                look into child's page table.
>                                                => we fond nothing.
>        spin_lock(child's pagetable);
>        spin_lock(parant's page table);
>        copy migration entry
>        spin_unlock(paranet's page table)
>        spin_unlock(child's page table)
>                                                update parent's paga table
>
> If we always find parant's page table before child's , there is no race.
> But I can't get prit_tree's list order as clear image. Hmm.
>
> Thanks,
> -Kame
>

That's good point, Kame.
I looked into prio_tree quickly.
If I understand it right, list order is backward.

dup_mmap calls vma_prio_tree_add.

 * prio_tree_root
 *      |
 *      A       vm_set.head
 *     / \      /
 *    L   R -> H-I-J-K-M-N-O-P-Q-S
 *    ^   ^    <-- vm_set.list -->
 *  tree nodes
 *

Maybe, parent and childs's vma are H~S.
Then, comment said.

"vma->shared.vm_set.parent != NULL    ==> a tree node"
So vma_prio_tree_add call not list_add_tail but list_add.

Anyway, I think order isn't mixed.
So, could we traverse it by backward in rmap?

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-23  7:00       ` Minchan Kim
  0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-23  7:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Mel Gorman, Christoph Lameter, akpm, linux-kernel

On Fri, Apr 23, 2010 at 2:27 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 23 Apr 2010 14:11:37 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Fri, Apr 23, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >
>> > This patch itself is for -mm ..but may need to go -stable tree for memory
>> > hotplug. (but we've got no report to hit this race...)
>> >
>> > This one is the simplest, I think and works well on my test set.
>> > ==
>> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >
>> > In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
>> > or mapping->i_mmap_lock is held and enter following loop.
>> >
>> >        for_each_vma_in_this_rmap_link(list from page->mapping) {
>> >                unsigned long address = vma_address(page, vma);
>> >                if (address == -EFAULT)
>> >                        continue;
>> >                ....
>> >        }
>> >
>> > vma_address is checking [start, end, pgoff] v.s. page->index.
>> >
>> > But vma's [start, end, pgoff] is updated without locks. vma_address()
>> > can hit a race and may return wrong result.
>> >
>> > This bahavior is no problem in usual routine as try_to_unmap() etc...
>> > But for page migration, rmap_walk() has to find all migration_ptes
>> > which migration code overwritten valid ptes. This race is critical and cause
>> > BUG that a migration_pte is sometimes not removed.
>> >
>> > pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
>> > Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
>> > Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
>> > Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
>> > Apr 21 17:27:47 localhost kernel: CPU 3
>> > Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
>> > Apr 21 17:27:47 localhost kernel:
>> > Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G        W   2.6.34-rc4-mm1+ #4 D2519/PRIMERGY
>> > Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>]  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
>> > Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08  EFLAGS: 00010246
>> > Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
>> > Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
>> > Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
>> > Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
>> > Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
>> > Apr 21 17:27:47 localhost kernel: FS:  00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
>> > Apr 21 17:27:47 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
>> > Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> > Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> > Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
>> > Apr 21 17:27:47 localhost kernel: Stack:
>> > Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
>> > Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
>> > Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
>> > Apr 21 17:27:47 localhost kernel: Call Trace:
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
>> > Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
>> > Apr 21 17:27:47 localhost kernel: RIP  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
>> > Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
>> > Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
>> >
>> >
>> >
>> > This patch adds vma_address_safe(). And update [start, end, pgoff]
>> > under seq counter.
>> >
>> > Cc: Mel Gorman <mel@csn.ul.ie>
>> > Cc: Minchan Kim <minchan.kim@gmail.com>
>> > Cc: Christoph Lameter <cl@linux-foundation.org>
>> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>> That's exactly same what I have in mind. :)
>> But I am hesitating. That's because AFAIR, we try to remove seqlock. Right?
>
> Ah,..."don't use seqlock" is trend ?
>
>> But in this case, seqlock is good, I think. :)
>>
> BTW, this isn't seqlock but seq_counter :)
>
> I'm still testing. What I doubt other than vma_address() is fork().
> at fork(), followings _may_ happen. (but I'm not sure).
>
>        chain vma.
>        copy page table.
>           -> migration entry is copied, too.
>
> At remap,
>        for each vma
>            look into page table and replace.
>
> Then,
>                                                rmap_walk().
>        fork(parent, child)
>                                                look into child's page table.
>                                                => we fond nothing.
>        spin_lock(child's pagetable);
>        spin_lock(parant's page table);
>        copy migration entry
>        spin_unlock(paranet's page table)
>        spin_unlock(child's page table)
>                                                update parent's paga table
>
> If we always find parant's page table before child's , there is no race.
> But I can't get prit_tree's list order as clear image. Hmm.
>
> Thanks,
> -Kame
>

That's good point, Kame.
I looked into prio_tree quickly.
If I understand it right, list order is backward.

dup_mmap calls vma_prio_tree_add.

 * prio_tree_root
 *      |
 *      A       vm_set.head
 *     / \      /
 *    L   R -> H-I-J-K-M-N-O-P-Q-S
 *    ^   ^    <-- vm_set.list -->
 *  tree nodes
 *

Maybe, parent and childs's vma are H~S.
Then, comment said.

"vma->shared.vm_set.parent != NULL    ==> a tree node"
So vma_prio_tree_add call not list_add_tail but list_add.

Anyway, I think order isn't mixed.
So, could we traverse it by backward in rmap?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-23  7:00       ` Minchan Kim
@ 2010-04-23  7:17         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-23  7:17 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, Mel Gorman, Christoph Lameter, akpm, linux-kernel

On Fri, 23 Apr 2010 16:00:31 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Fri, Apr 23, 2010 at 2:27 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Fri, 23 Apr 2010 14:11:37 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >
> >> On Fri, Apr 23, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> >
> >> > This patch itself is for -mm ..but may need to go -stable tree for memory
> >> > hotplug. (but we've got no report to hit this race...)
> >> >
> >> > This one is the simplest, I think and works well on my test set.
> >> > ==
> >> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> >
> >> > In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
> >> > or mapping->i_mmap_lock is held and enter following loop.
> >> >
> >> >        for_each_vma_in_this_rmap_link(list from page->mapping) {
> >> >                unsigned long address = vma_address(page, vma);
> >> >                if (address == -EFAULT)
> >> >                        continue;
> >> >                ....
> >> >        }
> >> >
> >> > vma_address is checking [start, end, pgoff] v.s. page->index.
> >> >
> >> > But vma's [start, end, pgoff] is updated without locks. vma_address()
> >> > can hit a race and may return wrong result.
> >> >
> >> > This bahavior is no problem in usual routine as try_to_unmap() etc...
> >> > But for page migration, rmap_walk() has to find all migration_ptes
> >> > which migration code overwritten valid ptes. This race is critical and cause
> >> > BUG that a migration_pte is sometimes not removed.
> >> >
> >> > pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
> >> > Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
> >> > Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> >> > Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
> >> > Apr 21 17:27:47 localhost kernel: CPU 3
> >> > Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
> >> > Apr 21 17:27:47 localhost kernel:
> >> > Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G        W   2.6.34-rc4-mm1+ #4 D2519/PRIMERGY
> >> > Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>]  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> >> > Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08  EFLAGS: 00010246
> >> > Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
> >> > Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
> >> > Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
> >> > Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
> >> > Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
> >> > Apr 21 17:27:47 localhost kernel: FS:  00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
> >> > Apr 21 17:27:47 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> > Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
> >> > Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> > Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >> > Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
> >> > Apr 21 17:27:47 localhost kernel: Stack:
> >> > Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
> >> > Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
> >> > Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
> >> > Apr 21 17:27:47 localhost kernel: Call Trace:
> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
> >> > Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
> >> > Apr 21 17:27:47 localhost kernel: RIP  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> >> > Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
> >> > Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
> >> >
> >> >
> >> >
> >> > This patch adds vma_address_safe(). And update [start, end, pgoff]
> >> > under seq counter.
> >> >
> >> > Cc: Mel Gorman <mel@csn.ul.ie>
> >> > Cc: Minchan Kim <minchan.kim@gmail.com>
> >> > Cc: Christoph Lameter <cl@linux-foundation.org>
> >> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >>
> >> That's exactly same what I have in mind. :)
> >> But I am hesitating. That's because AFAIR, we try to remove seqlock. Right?
> >
> > Ah,..."don't use seqlock" is trend ?
> >
> >> But in this case, seqlock is good, I think. :)
> >>
> > BTW, this isn't seqlock but seq_counter :)
> >
> > I'm still testing. What I doubt other than vma_address() is fork().
> > at fork(), followings _may_ happen. (but I'm not sure).
> >
> >        chain vma.
> >        copy page table.
> >           -> migration entry is copied, too.
> >
> > At remap,
> >        for each vma
> >            look into page table and replace.
> >
> > Then,
> >                                                rmap_walk().
> >        fork(parent, child)
> >                                                look into child's page table.
> >                                                => we fond nothing.
> >        spin_lock(child's pagetable);
> >        spin_lock(parant's page table);
> >        copy migration entry
> >        spin_unlock(paranet's page table)
> >        spin_unlock(child's page table)
> >                                                update parent's paga table
> >
> > If we always find parant's page table before child's , there is no race.
> > But I can't get prit_tree's list order as clear image. Hmm.
> >
> > Thanks,
> > -Kame
> >
> 
> That's good point, Kame.
> I looked into prio_tree quickly.
> If I understand it right, list order is backward.
> 
> dup_mmap calls vma_prio_tree_add.
> 
>  * prio_tree_root
>  *      |
>  *      A       vm_set.head
>  *     / \      /
>  *    L   R -> H-I-J-K-M-N-O-P-Q-S
>  *    ^   ^    <-- vm_set.list -->
>  *  tree nodes
>  *
> 
> Maybe, parent and childs's vma are H~S.
> Then, comment said.
> 
> "vma->shared.vm_set.parent != NULL    ==> a tree node"
> So vma_prio_tree_add call not list_add_tail but list_add.
> 
Ah, thank you for explanation.

> Anyway, I think order isn't mixed.
> So, could we traverse it by backward in rmap?
> 
Doesn't it make prio-tree code dirty ?

Here is another idea....but ..hmm. Does this make fork() slow in some cases ?

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At page migration, we replace pte with migration_entry, which has
similar format as swap_entry and replace it with real pfn at the
end of migration. But there is a race with fork()'s copy_page_range().

Assume page migraion on CPU A and fork in CPU B. On CPU A, a page of
a process is under migration. On CPU B, a page's pte is under copy.


	CPUA			CPU B
				do_fork()
				copy_mm() (from process 1 to process2)
				insert new vma to mmap_list (if inode/anon_vma)
	pte_lock(process1)
	unmap a page
	insert migration_entry
	pte_unlock(process1)

	migrate page copy
				copy_page_range
	remap new page by rmap_walk()
	pte_lock(process2)
	found no pte.
	pte_unlock(process2)
				pte lock(process2)
				pte lock(process1)
				copy migration entry to process2
				pte unlock(process1)
				pte unlokc(process2)
	pte_lock(process1)
	replace migration entry
	to new page's pte.
	pte_unlock(process1)

Then, some serialization is necessary. IIUC, this is very rare event.
So, I think copy_page_range() can wait for the end of migration.


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memory.c |   24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

Index: linux-2.6.34-rc4-mm1/mm/memory.c
===================================================================
--- linux-2.6.34-rc4-mm1.orig/mm/memory.c
+++ linux-2.6.34-rc4-mm1/mm/memory.c
@@ -675,15 +675,8 @@ copy_one_pte(struct mm_struct *dst_mm, s
 			}
 			if (likely(!non_swap_entry(entry)))
 				rss[MM_SWAPENTS]++;
-			else if (is_write_migration_entry(entry) &&
-					is_cow_mapping(vm_flags)) {
-				/*
-				 * COW mappings require pages in both parent
-				 * and child to be set to read.
-				 */
-				make_migration_entry_read(&entry);
-				pte = swp_entry_to_pte(entry);
-				set_pte_at(src_mm, addr, src_pte, pte);
+			else {
+				BUG();
 			}
 		}
 		goto out_set_pte;
@@ -760,6 +753,19 @@ again:
 			progress++;
 			continue;
 		}
+		if (unlikely(!pte_present(*src_pte) && !pte_file(*src_pte))) {
+			entry = pte_to_swp_entry(*src_pte);
+			if (is_migration_entry(entry)) {
+				/*
+				 * Because copying pte has the race with
+				 * pte rewriting of migraton, release lock
+				 * and retry.
+				 */
+				progress = 0;
+				entry.val = 0;
+				break;
+			}
+		}
 		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
 							vma, addr, rss);
 		if (entry.val)







^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-23  7:17         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-23  7:17 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, Mel Gorman, Christoph Lameter, akpm, linux-kernel

On Fri, 23 Apr 2010 16:00:31 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Fri, Apr 23, 2010 at 2:27 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Fri, 23 Apr 2010 14:11:37 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >
> >> On Fri, Apr 23, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> >
> >> > This patch itself is for -mm ..but may need to go -stable tree for memory
> >> > hotplug. (but we've got no report to hit this race...)
> >> >
> >> > This one is the simplest, I think and works well on my test set.
> >> > ==
> >> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> >
> >> > In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
> >> > or mapping->i_mmap_lock is held and enter following loop.
> >> >
> >> > A  A  A  A for_each_vma_in_this_rmap_link(list from page->mapping) {
> >> > A  A  A  A  A  A  A  A unsigned long address = vma_address(page, vma);
> >> > A  A  A  A  A  A  A  A if (address == -EFAULT)
> >> > A  A  A  A  A  A  A  A  A  A  A  A continue;
> >> > A  A  A  A  A  A  A  A ....
> >> > A  A  A  A }
> >> >
> >> > vma_address is checking [start, end, pgoff] v.s. page->index.
> >> >
> >> > But vma's [start, end, pgoff] is updated without locks. vma_address()
> >> > can hit a race and may return wrong result.
> >> >
> >> > This bahavior is no problem in usual routine as try_to_unmap() etc...
> >> > But for page migration, rmap_walk() has to find all migration_ptes
> >> > which migration code overwritten valid ptes. This race is critical and cause
> >> > BUG that a migration_pte is sometimes not removed.
> >> >
> >> > pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
> >> > Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
> >> > Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> >> > Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
> >> > Apr 21 17:27:47 localhost kernel: CPU 3
> >> > Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
> >> > Apr 21 17:27:47 localhost kernel:
> >> > Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G A  A  A  A W A  2.6.34-rc4-mm1+ #4 D2519/PRIMERGY
> >> > Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>] A [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> >> > Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08 A EFLAGS: 00010246
> >> > Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
> >> > Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
> >> > Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
> >> > Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
> >> > Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
> >> > Apr 21 17:27:47 localhost kernel: FS: A 00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
> >> > Apr 21 17:27:47 localhost kernel: CS: A 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> > Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
> >> > Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> > Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >> > Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
> >> > Apr 21 17:27:47 localhost kernel: Stack:
> >> > Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
> >> > Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
> >> > Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
> >> > Apr 21 17:27:47 localhost kernel: Call Trace:
> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
> >> > Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
> >> > Apr 21 17:27:47 localhost kernel: RIP A [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> >> > Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
> >> > Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
> >> >
> >> >
> >> >
> >> > This patch adds vma_address_safe(). And update [start, end, pgoff]
> >> > under seq counter.
> >> >
> >> > Cc: Mel Gorman <mel@csn.ul.ie>
> >> > Cc: Minchan Kim <minchan.kim@gmail.com>
> >> > Cc: Christoph Lameter <cl@linux-foundation.org>
> >> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >>
> >> That's exactly same what I have in mind. :)
> >> But I am hesitating. That's because AFAIR, we try to remove seqlock. Right?
> >
> > Ah,..."don't use seqlock" is trend ?
> >
> >> But in this case, seqlock is good, I think. :)
> >>
> > BTW, this isn't seqlock but seq_counter :)
> >
> > I'm still testing. What I doubt other than vma_address() is fork().
> > at fork(), followings _may_ happen. (but I'm not sure).
> >
> > A  A  A  A chain vma.
> > A  A  A  A copy page table.
> > A  A  A  A  A  -> migration entry is copied, too.
> >
> > At remap,
> > A  A  A  A for each vma
> > A  A  A  A  A  A look into page table and replace.
> >
> > Then,
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A rmap_walk().
> > A  A  A  A fork(parent, child)
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A look into child's page table.
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A => we fond nothing.
> > A  A  A  A spin_lock(child's pagetable);
> > A  A  A  A spin_lock(parant's page table);
> > A  A  A  A copy migration entry
> > A  A  A  A spin_unlock(paranet's page table)
> > A  A  A  A spin_unlock(child's page table)
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A update parent's paga table
> >
> > If we always find parant's page table before child's , there is no race.
> > But I can't get prit_tree's list order as clear image. Hmm.
> >
> > Thanks,
> > -Kame
> >
> 
> That's good point, Kame.
> I looked into prio_tree quickly.
> If I understand it right, list order is backward.
> 
> dup_mmap calls vma_prio_tree_add.
> 
>  * prio_tree_root
>  *      |
>  *      A       vm_set.head
>  *     / \      /
>  *    L   R -> H-I-J-K-M-N-O-P-Q-S
>  *    ^   ^    <-- vm_set.list -->
>  *  tree nodes
>  *
> 
> Maybe, parent and childs's vma are H~S.
> Then, comment said.
> 
> "vma->shared.vm_set.parent != NULL    ==> a tree node"
> So vma_prio_tree_add call not list_add_tail but list_add.
> 
Ah, thank you for explanation.

> Anyway, I think order isn't mixed.
> So, could we traverse it by backward in rmap?
> 
Doesn't it make prio-tree code dirty ?

Here is another idea....but ..hmm. Does this make fork() slow in some cases ?

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At page migration, we replace pte with migration_entry, which has
similar format as swap_entry and replace it with real pfn at the
end of migration. But there is a race with fork()'s copy_page_range().

Assume page migraion on CPU A and fork in CPU B. On CPU A, a page of
a process is under migration. On CPU B, a page's pte is under copy.


	CPUA			CPU B
				do_fork()
				copy_mm() (from process 1 to process2)
				insert new vma to mmap_list (if inode/anon_vma)
	pte_lock(process1)
	unmap a page
	insert migration_entry
	pte_unlock(process1)

	migrate page copy
				copy_page_range
	remap new page by rmap_walk()
	pte_lock(process2)
	found no pte.
	pte_unlock(process2)
				pte lock(process2)
				pte lock(process1)
				copy migration entry to process2
				pte unlock(process1)
				pte unlokc(process2)
	pte_lock(process1)
	replace migration entry
	to new page's pte.
	pte_unlock(process1)

Then, some serialization is necessary. IIUC, this is very rare event.
So, I think copy_page_range() can wait for the end of migration.


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memory.c |   24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

Index: linux-2.6.34-rc4-mm1/mm/memory.c
===================================================================
--- linux-2.6.34-rc4-mm1.orig/mm/memory.c
+++ linux-2.6.34-rc4-mm1/mm/memory.c
@@ -675,15 +675,8 @@ copy_one_pte(struct mm_struct *dst_mm, s
 			}
 			if (likely(!non_swap_entry(entry)))
 				rss[MM_SWAPENTS]++;
-			else if (is_write_migration_entry(entry) &&
-					is_cow_mapping(vm_flags)) {
-				/*
-				 * COW mappings require pages in both parent
-				 * and child to be set to read.
-				 */
-				make_migration_entry_read(&entry);
-				pte = swp_entry_to_pte(entry);
-				set_pte_at(src_mm, addr, src_pte, pte);
+			else {
+				BUG();
 			}
 		}
 		goto out_set_pte;
@@ -760,6 +753,19 @@ again:
 			progress++;
 			continue;
 		}
+		if (unlikely(!pte_present(*src_pte) && !pte_file(*src_pte))) {
+			entry = pte_to_swp_entry(*src_pte);
+			if (is_migration_entry(entry)) {
+				/*
+				 * Because copying pte has the race with
+				 * pte rewriting of migraton, release lock
+				 * and retry.
+				 */
+				progress = 0;
+				entry.val = 0;
+				break;
+			}
+		}
 		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
 							vma, addr, rss);
 		if (entry.val)






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-23  7:17         ` KAMEZAWA Hiroyuki
@ 2010-04-23  7:53           ` Minchan Kim
  -1 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-23  7:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Mel Gorman, Christoph Lameter, akpm, linux-kernel

On Fri, Apr 23, 2010 at 4:17 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 23 Apr 2010 16:00:31 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Fri, Apr 23, 2010 at 2:27 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Fri, 23 Apr 2010 14:11:37 +0900
>> > Minchan Kim <minchan.kim@gmail.com> wrote:
>> >
>> >> On Fri, Apr 23, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
>> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >> >
>> >> > This patch itself is for -mm ..but may need to go -stable tree for memory
>> >> > hotplug. (but we've got no report to hit this race...)
>> >> >
>> >> > This one is the simplest, I think and works well on my test set.
>> >> > ==
>> >> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >> >
>> >> > In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
>> >> > or mapping->i_mmap_lock is held and enter following loop.
>> >> >
>> >> >        for_each_vma_in_this_rmap_link(list from page->mapping) {
>> >> >                unsigned long address = vma_address(page, vma);
>> >> >                if (address == -EFAULT)
>> >> >                        continue;
>> >> >                ....
>> >> >        }
>> >> >
>> >> > vma_address is checking [start, end, pgoff] v.s. page->index.
>> >> >
>> >> > But vma's [start, end, pgoff] is updated without locks. vma_address()
>> >> > can hit a race and may return wrong result.
>> >> >
>> >> > This bahavior is no problem in usual routine as try_to_unmap() etc...
>> >> > But for page migration, rmap_walk() has to find all migration_ptes
>> >> > which migration code overwritten valid ptes. This race is critical and cause
>> >> > BUG that a migration_pte is sometimes not removed.
>> >> >
>> >> > pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
>> >> > Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
>> >> > Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
>> >> > Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
>> >> > Apr 21 17:27:47 localhost kernel: CPU 3
>> >> > Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
>> >> > Apr 21 17:27:47 localhost kernel:
>> >> > Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G        W   2.6.34-rc4-mm1+ #4 D2519/PRIMERGY
>> >> > Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>]  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
>> >> > Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08  EFLAGS: 00010246
>> >> > Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
>> >> > Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
>> >> > Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
>> >> > Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
>> >> > Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
>> >> > Apr 21 17:27:47 localhost kernel: FS:  00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
>> >> > Apr 21 17:27:47 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> >> > Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
>> >> > Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> >> > Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> >> > Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
>> >> > Apr 21 17:27:47 localhost kernel: Stack:
>> >> > Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
>> >> > Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
>> >> > Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
>> >> > Apr 21 17:27:47 localhost kernel: Call Trace:
>> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
>> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
>> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
>> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
>> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
>> >> > Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
>> >> > Apr 21 17:27:47 localhost kernel: RIP  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
>> >> > Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
>> >> > Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
>> >> >
>> >> >
>> >> >
>> >> > This patch adds vma_address_safe(). And update [start, end, pgoff]
>> >> > under seq counter.
>> >> >
>> >> > Cc: Mel Gorman <mel@csn.ul.ie>
>> >> > Cc: Minchan Kim <minchan.kim@gmail.com>
>> >> > Cc: Christoph Lameter <cl@linux-foundation.org>
>> >> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >>
>> >> That's exactly same what I have in mind. :)
>> >> But I am hesitating. That's because AFAIR, we try to remove seqlock. Right?
>> >
>> > Ah,..."don't use seqlock" is trend ?
>> >
>> >> But in this case, seqlock is good, I think. :)
>> >>
>> > BTW, this isn't seqlock but seq_counter :)
>> >
>> > I'm still testing. What I doubt other than vma_address() is fork().
>> > at fork(), followings _may_ happen. (but I'm not sure).
>> >
>> >        chain vma.
>> >        copy page table.
>> >           -> migration entry is copied, too.
>> >
>> > At remap,
>> >        for each vma
>> >            look into page table and replace.
>> >
>> > Then,
>> >                                                rmap_walk().
>> >        fork(parent, child)
>> >                                                look into child's page table.
>> >                                                => we fond nothing.
>> >        spin_lock(child's pagetable);
>> >        spin_lock(parant's page table);
>> >        copy migration entry
>> >        spin_unlock(paranet's page table)
>> >        spin_unlock(child's page table)
>> >                                                update parent's paga table
>> >
>> > If we always find parant's page table before child's , there is no race.
>> > But I can't get prit_tree's list order as clear image. Hmm.
>> >
>> > Thanks,
>> > -Kame
>> >
>>
>> That's good point, Kame.
>> I looked into prio_tree quickly.
>> If I understand it right, list order is backward.
>>
>> dup_mmap calls vma_prio_tree_add.
>>
>>  * prio_tree_root
>>  *      |
>>  *      A       vm_set.head
>>  *     / \      /
>>  *    L   R -> H-I-J-K-M-N-O-P-Q-S
>>  *    ^   ^    <-- vm_set.list -->
>>  *  tree nodes
>>  *
>>
>> Maybe, parent and childs's vma are H~S.
>> Then, comment said.
>>
>> "vma->shared.vm_set.parent != NULL    ==> a tree node"
>> So vma_prio_tree_add call not list_add_tail but list_add.
>>
> Ah, thank you for explanation.
>
>> Anyway, I think order isn't mixed.
>> So, could we traverse it by backward in rmap?
>>
> Doesn't it make prio-tree code dirty ?
>
> Here is another idea....but ..hmm. Does this make fork() slow in some cases ?

Yes. I think this idea is good to me. :)
Great, Kame.

But as you said, migration is rare.
so we wouldn't lost much performance in many case.

Actually, If I understand prio_tree right, I think backward walking of
prio_tree is nod bad.
I don't think it make code dirty. :)
I admit it's different per people.

I like both ideas.
I passes decision to others.  :)

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-23  7:53           ` Minchan Kim
  0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-23  7:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Mel Gorman, Christoph Lameter, akpm, linux-kernel

On Fri, Apr 23, 2010 at 4:17 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 23 Apr 2010 16:00:31 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Fri, Apr 23, 2010 at 2:27 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Fri, 23 Apr 2010 14:11:37 +0900
>> > Minchan Kim <minchan.kim@gmail.com> wrote:
>> >
>> >> On Fri, Apr 23, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
>> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >> >
>> >> > This patch itself is for -mm ..but may need to go -stable tree for memory
>> >> > hotplug. (but we've got no report to hit this race...)
>> >> >
>> >> > This one is the simplest, I think and works well on my test set.
>> >> > ==
>> >> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >> >
>> >> > In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
>> >> > or mapping->i_mmap_lock is held and enter following loop.
>> >> >
>> >> >        for_each_vma_in_this_rmap_link(list from page->mapping) {
>> >> >                unsigned long address = vma_address(page, vma);
>> >> >                if (address == -EFAULT)
>> >> >                        continue;
>> >> >                ....
>> >> >        }
>> >> >
>> >> > vma_address is checking [start, end, pgoff] v.s. page->index.
>> >> >
>> >> > But vma's [start, end, pgoff] is updated without locks. vma_address()
>> >> > can hit a race and may return wrong result.
>> >> >
>> >> > This bahavior is no problem in usual routine as try_to_unmap() etc...
>> >> > But for page migration, rmap_walk() has to find all migration_ptes
>> >> > which migration code overwritten valid ptes. This race is critical and cause
>> >> > BUG that a migration_pte is sometimes not removed.
>> >> >
>> >> > pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
>> >> > Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
>> >> > Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
>> >> > Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
>> >> > Apr 21 17:27:47 localhost kernel: CPU 3
>> >> > Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
>> >> > Apr 21 17:27:47 localhost kernel:
>> >> > Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G        W   2.6.34-rc4-mm1+ #4 D2519/PRIMERGY
>> >> > Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>]  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
>> >> > Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08  EFLAGS: 00010246
>> >> > Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
>> >> > Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
>> >> > Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
>> >> > Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
>> >> > Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
>> >> > Apr 21 17:27:47 localhost kernel: FS:  00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
>> >> > Apr 21 17:27:47 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> >> > Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
>> >> > Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> >> > Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> >> > Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
>> >> > Apr 21 17:27:47 localhost kernel: Stack:
>> >> > Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
>> >> > Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
>> >> > Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
>> >> > Apr 21 17:27:47 localhost kernel: Call Trace:
>> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
>> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
>> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
>> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
>> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
>> >> > Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
>> >> > Apr 21 17:27:47 localhost kernel: RIP  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
>> >> > Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
>> >> > Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
>> >> >
>> >> >
>> >> >
>> >> > This patch adds vma_address_safe(). And update [start, end, pgoff]
>> >> > under seq counter.
>> >> >
>> >> > Cc: Mel Gorman <mel@csn.ul.ie>
>> >> > Cc: Minchan Kim <minchan.kim@gmail.com>
>> >> > Cc: Christoph Lameter <cl@linux-foundation.org>
>> >> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >>
>> >> That's exactly same what I have in mind. :)
>> >> But I am hesitating. That's because AFAIR, we try to remove seqlock. Right?
>> >
>> > Ah,..."don't use seqlock" is trend ?
>> >
>> >> But in this case, seqlock is good, I think. :)
>> >>
>> > BTW, this isn't seqlock but seq_counter :)
>> >
>> > I'm still testing. What I doubt other than vma_address() is fork().
>> > at fork(), followings _may_ happen. (but I'm not sure).
>> >
>> >        chain vma.
>> >        copy page table.
>> >           -> migration entry is copied, too.
>> >
>> > At remap,
>> >        for each vma
>> >            look into page table and replace.
>> >
>> > Then,
>> >                                                rmap_walk().
>> >        fork(parent, child)
>> >                                                look into child's page table.
>> >                                                => we fond nothing.
>> >        spin_lock(child's pagetable);
>> >        spin_lock(parant's page table);
>> >        copy migration entry
>> >        spin_unlock(paranet's page table)
>> >        spin_unlock(child's page table)
>> >                                                update parent's paga table
>> >
>> > If we always find parant's page table before child's , there is no race.
>> > But I can't get prit_tree's list order as clear image. Hmm.
>> >
>> > Thanks,
>> > -Kame
>> >
>>
>> That's good point, Kame.
>> I looked into prio_tree quickly.
>> If I understand it right, list order is backward.
>>
>> dup_mmap calls vma_prio_tree_add.
>>
>>  * prio_tree_root
>>  *      |
>>  *      A       vm_set.head
>>  *     / \      /
>>  *    L   R -> H-I-J-K-M-N-O-P-Q-S
>>  *    ^   ^    <-- vm_set.list -->
>>  *  tree nodes
>>  *
>>
>> Maybe, parent and childs's vma are H~S.
>> Then, comment said.
>>
>> "vma->shared.vm_set.parent != NULL    ==> a tree node"
>> So vma_prio_tree_add call not list_add_tail but list_add.
>>
> Ah, thank you for explanation.
>
>> Anyway, I think order isn't mixed.
>> So, could we traverse it by backward in rmap?
>>
> Doesn't it make prio-tree code dirty ?
>
> Here is another idea....but ..hmm. Does this make fork() slow in some cases ?

Yes. I think this idea is good to me. :)
Great, Kame.

But as you said, migration is rare.
so we wouldn't lost much performance in many case.

Actually, If I understand prio_tree right, I think backward walking of
prio_tree is nod bad.
I don't think it make code dirty. :)
I admit it's different per people.

I like both ideas.
I passes decision to others.  :)

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-23  7:53           ` Minchan Kim
@ 2010-04-23  7:55             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-23  7:55 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, Mel Gorman, Christoph Lameter, akpm, linux-kernel

On Fri, 23 Apr 2010 16:53:44 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Fri, Apr 23, 2010 at 4:17 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Fri, 23 Apr 2010 16:00:31 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >
> >> On Fri, Apr 23, 2010 at 2:27 PM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > On Fri, 23 Apr 2010 14:11:37 +0900
> >> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >> >
> >> >> On Fri, Apr 23, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
> >> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> >> >
> >> >> > This patch itself is for -mm ..but may need to go -stable tree for memory
> >> >> > hotplug. (but we've got no report to hit this race...)
> >> >> >
> >> >> > This one is the simplest, I think and works well on my test set.
> >> >> > ==
> >> >> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> >> >
> >> >> > In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
> >> >> > or mapping->i_mmap_lock is held and enter following loop.
> >> >> >
> >> >> >        for_each_vma_in_this_rmap_link(list from page->mapping) {
> >> >> >                unsigned long address = vma_address(page, vma);
> >> >> >                if (address == -EFAULT)
> >> >> >                        continue;
> >> >> >                ....
> >> >> >        }
> >> >> >
> >> >> > vma_address is checking [start, end, pgoff] v.s. page->index.
> >> >> >
> >> >> > But vma's [start, end, pgoff] is updated without locks. vma_address()
> >> >> > can hit a race and may return wrong result.
> >> >> >
> >> >> > This bahavior is no problem in usual routine as try_to_unmap() etc...
> >> >> > But for page migration, rmap_walk() has to find all migration_ptes
> >> >> > which migration code overwritten valid ptes. This race is critical and cause
> >> >> > BUG that a migration_pte is sometimes not removed.
> >> >> >
> >> >> > pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
> >> >> > Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
> >> >> > Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> >> >> > Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
> >> >> > Apr 21 17:27:47 localhost kernel: CPU 3
> >> >> > Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
> >> >> > Apr 21 17:27:47 localhost kernel:
> >> >> > Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G        W   2.6.34-rc4-mm1+ #4 D2519/PRIMERGY
> >> >> > Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>]  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> >> >> > Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08  EFLAGS: 00010246
> >> >> > Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
> >> >> > Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
> >> >> > Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
> >> >> > Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
> >> >> > Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
> >> >> > Apr 21 17:27:47 localhost kernel: FS:  00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
> >> >> > Apr 21 17:27:47 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> >> > Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
> >> >> > Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> >> > Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >> >> > Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
> >> >> > Apr 21 17:27:47 localhost kernel: Stack:
> >> >> > Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
> >> >> > Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
> >> >> > Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
> >> >> > Apr 21 17:27:47 localhost kernel: Call Trace:
> >> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
> >> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
> >> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
> >> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
> >> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
> >> >> > Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
> >> >> > Apr 21 17:27:47 localhost kernel: RIP  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> >> >> > Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
> >> >> > Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
> >> >> >
> >> >> >
> >> >> >
> >> >> > This patch adds vma_address_safe(). And update [start, end, pgoff]
> >> >> > under seq counter.
> >> >> >
> >> >> > Cc: Mel Gorman <mel@csn.ul.ie>
> >> >> > Cc: Minchan Kim <minchan.kim@gmail.com>
> >> >> > Cc: Christoph Lameter <cl@linux-foundation.org>
> >> >> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> >>
> >> >> That's exactly same what I have in mind. :)
> >> >> But I am hesitating. That's because AFAIR, we try to remove seqlock. Right?
> >> >
> >> > Ah,..."don't use seqlock" is trend ?
> >> >
> >> >> But in this case, seqlock is good, I think. :)
> >> >>
> >> > BTW, this isn't seqlock but seq_counter :)
> >> >
> >> > I'm still testing. What I doubt other than vma_address() is fork().
> >> > at fork(), followings _may_ happen. (but I'm not sure).
> >> >
> >> >        chain vma.
> >> >        copy page table.
> >> >           -> migration entry is copied, too.
> >> >
> >> > At remap,
> >> >        for each vma
> >> >            look into page table and replace.
> >> >
> >> > Then,
> >> >                                                rmap_walk().
> >> >        fork(parent, child)
> >> >                                                look into child's page table.
> >> >                                                => we fond nothing.
> >> >        spin_lock(child's pagetable);
> >> >        spin_lock(parant's page table);
> >> >        copy migration entry
> >> >        spin_unlock(paranet's page table)
> >> >        spin_unlock(child's page table)
> >> >                                                update parent's paga table
> >> >
> >> > If we always find parant's page table before child's , there is no race.
> >> > But I can't get prit_tree's list order as clear image. Hmm.
> >> >
> >> > Thanks,
> >> > -Kame
> >> >
> >>
> >> That's good point, Kame.
> >> I looked into prio_tree quickly.
> >> If I understand it right, list order is backward.
> >>
> >> dup_mmap calls vma_prio_tree_add.
> >>
> >>  * prio_tree_root
> >>  *      |
> >>  *      A       vm_set.head
> >>  *     / \      /
> >>  *    L   R -> H-I-J-K-M-N-O-P-Q-S
> >>  *    ^   ^    <-- vm_set.list -->
> >>  *  tree nodes
> >>  *
> >>
> >> Maybe, parent and childs's vma are H~S.
> >> Then, comment said.
> >>
> >> "vma->shared.vm_set.parent != NULL    ==> a tree node"
> >> So vma_prio_tree_add call not list_add_tail but list_add.
> >>
> > Ah, thank you for explanation.
> >
> >> Anyway, I think order isn't mixed.
> >> So, could we traverse it by backward in rmap?
> >>
> > Doesn't it make prio-tree code dirty ?
> >
> > Here is another idea....but ..hmm. Does this make fork() slow in some cases ?
> 
> Yes. I think this idea is good to me. :)
> Great, Kame.
> 
> But as you said, migration is rare.
> so we wouldn't lost much performance in many case.
> 
> Actually, If I understand prio_tree right, I think backward walking of
> prio_tree is nod bad.
> I don't think it make code dirty. :)
> I admit it's different per people.
> 
it's okay to me. My concern was difficulty of maintainance.

> I like both ideas.
> I passes decision to others.  :)
> 
me, too.

maybe a problem is we need tese case to hit this race.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-23  7:55             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-23  7:55 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, Mel Gorman, Christoph Lameter, akpm, linux-kernel

On Fri, 23 Apr 2010 16:53:44 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Fri, Apr 23, 2010 at 4:17 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Fri, 23 Apr 2010 16:00:31 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >
> >> On Fri, Apr 23, 2010 at 2:27 PM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > On Fri, 23 Apr 2010 14:11:37 +0900
> >> > Minchan Kim <minchan.kim@gmail.com> wrote:
> >> >
> >> >> On Fri, Apr 23, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
> >> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> >> >
> >> >> > This patch itself is for -mm ..but may need to go -stable tree for memory
> >> >> > hotplug. (but we've got no report to hit this race...)
> >> >> >
> >> >> > This one is the simplest, I think and works well on my test set.
> >> >> > ==
> >> >> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> >> >
> >> >> > In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
> >> >> > or mapping->i_mmap_lock is held and enter following loop.
> >> >> >
> >> >> > A  A  A  A for_each_vma_in_this_rmap_link(list from page->mapping) {
> >> >> > A  A  A  A  A  A  A  A unsigned long address = vma_address(page, vma);
> >> >> > A  A  A  A  A  A  A  A if (address == -EFAULT)
> >> >> > A  A  A  A  A  A  A  A  A  A  A  A continue;
> >> >> > A  A  A  A  A  A  A  A ....
> >> >> > A  A  A  A }
> >> >> >
> >> >> > vma_address is checking [start, end, pgoff] v.s. page->index.
> >> >> >
> >> >> > But vma's [start, end, pgoff] is updated without locks. vma_address()
> >> >> > can hit a race and may return wrong result.
> >> >> >
> >> >> > This bahavior is no problem in usual routine as try_to_unmap() etc...
> >> >> > But for page migration, rmap_walk() has to find all migration_ptes
> >> >> > which migration code overwritten valid ptes. This race is critical and cause
> >> >> > BUG that a migration_pte is sometimes not removed.
> >> >> >
> >> >> > pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
> >> >> > Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
> >> >> > Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> >> >> > Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
> >> >> > Apr 21 17:27:47 localhost kernel: CPU 3
> >> >> > Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
> >> >> > Apr 21 17:27:47 localhost kernel:
> >> >> > Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G A  A  A  A W A  2.6.34-rc4-mm1+ #4 D2519/PRIMERGY
> >> >> > Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>] A [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> >> >> > Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08 A EFLAGS: 00010246
> >> >> > Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
> >> >> > Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
> >> >> > Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
> >> >> > Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
> >> >> > Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
> >> >> > Apr 21 17:27:47 localhost kernel: FS: A 00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
> >> >> > Apr 21 17:27:47 localhost kernel: CS: A 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> >> > Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
> >> >> > Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> >> > Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >> >> > Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
> >> >> > Apr 21 17:27:47 localhost kernel: Stack:
> >> >> > Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
> >> >> > Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
> >> >> > Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
> >> >> > Apr 21 17:27:47 localhost kernel: Call Trace:
> >> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
> >> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
> >> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
> >> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
> >> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
> >> >> > Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
> >> >> > Apr 21 17:27:47 localhost kernel: RIP A [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> >> >> > Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
> >> >> > Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
> >> >> >
> >> >> >
> >> >> >
> >> >> > This patch adds vma_address_safe(). And update [start, end, pgoff]
> >> >> > under seq counter.
> >> >> >
> >> >> > Cc: Mel Gorman <mel@csn.ul.ie>
> >> >> > Cc: Minchan Kim <minchan.kim@gmail.com>
> >> >> > Cc: Christoph Lameter <cl@linux-foundation.org>
> >> >> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> >>
> >> >> That's exactly same what I have in mind. :)
> >> >> But I am hesitating. That's because AFAIR, we try to remove seqlock. Right?
> >> >
> >> > Ah,..."don't use seqlock" is trend ?
> >> >
> >> >> But in this case, seqlock is good, I think. :)
> >> >>
> >> > BTW, this isn't seqlock but seq_counter :)
> >> >
> >> > I'm still testing. What I doubt other than vma_address() is fork().
> >> > at fork(), followings _may_ happen. (but I'm not sure).
> >> >
> >> > A  A  A  A chain vma.
> >> > A  A  A  A copy page table.
> >> > A  A  A  A  A  -> migration entry is copied, too.
> >> >
> >> > At remap,
> >> > A  A  A  A for each vma
> >> > A  A  A  A  A  A look into page table and replace.
> >> >
> >> > Then,
> >> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A rmap_walk().
> >> > A  A  A  A fork(parent, child)
> >> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A look into child's page table.
> >> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A => we fond nothing.
> >> > A  A  A  A spin_lock(child's pagetable);
> >> > A  A  A  A spin_lock(parant's page table);
> >> > A  A  A  A copy migration entry
> >> > A  A  A  A spin_unlock(paranet's page table)
> >> > A  A  A  A spin_unlock(child's page table)
> >> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A update parent's paga table
> >> >
> >> > If we always find parant's page table before child's , there is no race.
> >> > But I can't get prit_tree's list order as clear image. Hmm.
> >> >
> >> > Thanks,
> >> > -Kame
> >> >
> >>
> >> That's good point, Kame.
> >> I looked into prio_tree quickly.
> >> If I understand it right, list order is backward.
> >>
> >> dup_mmap calls vma_prio_tree_add.
> >>
> >> A * prio_tree_root
> >> A * A  A  A |
> >> A * A  A  A A A  A  A  vm_set.head
> >> A * A  A  / \ A  A  A /
> >> A * A  A L A  R -> H-I-J-K-M-N-O-P-Q-S
> >> A * A  A ^ A  ^ A  A <-- vm_set.list -->
> >> A * A tree nodes
> >> A *
> >>
> >> Maybe, parent and childs's vma are H~S.
> >> Then, comment said.
> >>
> >> "vma->shared.vm_set.parent != NULL A  A ==> a tree node"
> >> So vma_prio_tree_add call not list_add_tail but list_add.
> >>
> > Ah, thank you for explanation.
> >
> >> Anyway, I think order isn't mixed.
> >> So, could we traverse it by backward in rmap?
> >>
> > Doesn't it make prio-tree code dirty ?
> >
> > Here is another idea....but ..hmm. Does this make fork() slow in some cases ?
> 
> Yes. I think this idea is good to me. :)
> Great, Kame.
> 
> But as you said, migration is rare.
> so we wouldn't lost much performance in many case.
> 
> Actually, If I understand prio_tree right, I think backward walking of
> prio_tree is nod bad.
> I don't think it make code dirty. :)
> I admit it's different per people.
> 
it's okay to me. My concern was difficulty of maintainance.

> I like both ideas.
> I passes decision to others.  :)
> 
me, too.

maybe a problem is we need tese case to hit this race.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-23  3:01 ` KAMEZAWA Hiroyuki
@ 2010-04-23  9:59   ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-23  9:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, minchan.kim, Christoph Lameter, akpm, linux-kernel

On Fri, Apr 23, 2010 at 12:01:48PM +0900, KAMEZAWA Hiroyuki wrote:
> This patch itself is for -mm ..but may need to go -stable tree for memory
> hotplug. (but we've got no report to hit this race...)
> 

Only because it's very difficult to hit. Even when running compaction
constantly, it can take anywhere between 10 minutes and 2 hours for me
to reproduce it.

> This one is the simplest, I think and works well on my test set.
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
> or mapping->i_mmap_lock is held and enter following loop.
> 
> 	for_each_vma_in_this_rmap_link(list from page->mapping) {
> 		unsigned long address = vma_address(page, vma);
> 		if (address == -EFAULT)
> 			continue;
> 		....
> 	}
> 
> vma_address is checking [start, end, pgoff] v.s. page->index.
> 
> But vma's [start, end, pgoff] is updated without locks. vma_address()
> can hit a race and may return wrong result.
> 
> This bahavior is no problem in usual routine as try_to_unmap() etc...
> But for page migration, rmap_walk() has to find all migration_ptes
> which migration code overwritten valid ptes. This race is critical and cause
> BUG that a migration_pte is sometimes not removed.
> 
> pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
> Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
> Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
> Apr 21 17:27:47 localhost kernel: CPU 3
> Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
> Apr 21 17:27:47 localhost kernel:
> Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G        W   2.6.34-rc4-mm1+ #4 D2519/PRIMERGY          
> Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>]  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08  EFLAGS: 00010246
> Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
> Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
> Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
> Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
> Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
> Apr 21 17:27:47 localhost kernel: FS:  00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
> Apr 21 17:27:47 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
> Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
> Apr 21 17:27:47 localhost kernel: Stack:
> Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
> Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
> Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
> Apr 21 17:27:47 localhost kernel: Call Trace:
> Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
> Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
> Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
> Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
> Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
> Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
> Apr 21 17:27:47 localhost kernel: RIP  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
> Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
> 
> This patch adds vma_address_safe(). And update [start, end, pgoff]
> under seq counter. 
> 

I had considered this idea as well as it is vaguely similar to how zones get
resized with a seqlock. I was hoping that the existing locking on anon_vma
would be usable by backing off until uncontended but maybe not so lets
check out this approach.

> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/mm_types.h |    2 ++
>  mm/mmap.c                |   15 ++++++++++++++-
>  mm/rmap.c                |   25 ++++++++++++++++++++++++-
>  3 files changed, 40 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6.34-rc5-mm1/include/linux/mm_types.h
> ===================================================================
> --- linux-2.6.34-rc5-mm1.orig/include/linux/mm_types.h
> +++ linux-2.6.34-rc5-mm1/include/linux/mm_types.h
> @@ -12,6 +12,7 @@
>  #include <linux/completion.h>
>  #include <linux/cpumask.h>
>  #include <linux/page-debug-flags.h>
> +#include <linux/seqlock.h>
>  #include <asm/page.h>
>  #include <asm/mmu.h>
>  
> @@ -183,6 +184,7 @@ struct vm_area_struct {
>  #ifdef CONFIG_NUMA
>  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>  #endif
> +	seqcount_t updating;	/* works like seqlock for updating vma info. */
>  };
>  

#ifdef CONFIG_MIGRATION ?

Minor issue, but would you consider matching the making used when altering
the size of zones? e.g. seqcount_t span_seqcounter, vma_span_seqbegin,
vma_span_seqend etc?

>  struct core_thread {
> Index: linux-2.6.34-rc5-mm1/mm/mmap.c
> ===================================================================
> --- linux-2.6.34-rc5-mm1.orig/mm/mmap.c
> +++ linux-2.6.34-rc5-mm1/mm/mmap.c
> @@ -491,6 +491,16 @@ __vma_unlink(struct mm_struct *mm, struc
>  		mm->mmap_cache = prev;
>  }
>  
> +static void adjust_start_vma(struct vm_area_struct *vma)
> +{
> +	write_seqcount_begin(&vma->updating);
> +}
> +
> +static void adjust_end_vma(struct vm_area_struct *vma)
> +{
> +	write_seqcount_end(&vma->updating);
> +}
> +
>  /*
>   * We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that
>   * is already present in an i_mmap tree without adjusting the tree.
> @@ -584,13 +594,16 @@ again:			remove_next = 1 + (end > next->
>  		if (adjust_next)
>  			vma_prio_tree_remove(next, root);
>  	}
> -
> +	adjust_start_vma(vma);
>  	vma->vm_start = start;
>  	vma->vm_end = end;
>  	vma->vm_pgoff = pgoff;
> +	adjust_end_vma(vma);
>  	if (adjust_next) {
> +		adjust_start_vma(next);
>  		next->vm_start += adjust_next << PAGE_SHIFT;
>  		next->vm_pgoff += adjust_next;
> +		adjust_end_vma(next);
>  	}

I'm not 100% sure about this, I think either the seqcounter needs
a larger span and possibly to be based on the mm instead of the VMA or
rmap_walk_[anon|ksm] has to do a full restart if there is a simultaneous
update.

Lets take the case;

VMA A		- Any given VMA with a page being migrated

During migration, munmap() is called so the VMA is now being split to
give

VMA A-lower	- The lower part of the VMA
hole		- A hole due to munmap
VMA A-upper	- The new VMA inserted as a result of munmap that
		  spans the page being migrated.

vma_adjust() takes the seq counters when updating the range of VMA
A-lower but VMA a-upper is not linked in yet and rmap_walk_[anon|ksm] is
now looking at the wrong VMA.

In this case, rmap_walk_anon() would correct check the range of
VMA A-lower but still get the wrong answer because it should have been
checking the new VMA A-upper.

I believe my backoff-if-lock-contended patch caught this situation
by always restarting the entire operation if there was lock contention.
Once the lock was acquired the VMA list could be different so had
to be restarted to be sure we were checking the right VMA.

For the seqcounter, the adjust_start_vma() needs to be at the beginning
of the operation and adjust_end_vma() must be after all the VMA updates
have completed, including any adjustments of the prio trees, the anon_vma
lists etc. However, to avoid deadlocks rmap_walk_anon() needs to release is
anon_vma->lock otherwise the VMA list update will deadlock waiting on the
same lock.

>  
>  	if (root) {
> Index: linux-2.6.34-rc5-mm1/mm/rmap.c
> ===================================================================
> --- linux-2.6.34-rc5-mm1.orig/mm/rmap.c
> +++ linux-2.6.34-rc5-mm1/mm/rmap.c
> @@ -342,6 +342,23 @@ vma_address(struct page *page, struct vm
>  }
>  
>  /*
> + * vma's address check is racy if we don't hold mmap_sem. This function
> + * gives a safe way for accessing the [start, end, pgoff] tuple of vma.
> + */
> +
> +static inline unsigned long vma_address_safe(struct page *page,
> +		struct vm_area_struct *vma)
> +{
> +	unsigned long ret, safety;
> +
> +	do {
> +		safety = read_seqcount_begin(&vma->updating);
> +		ret = vma_address(page, vma);
> +	} while (read_seqcount_retry(&vma->updating, safety));
> +	return ret;
> +}
> +
> +/*
>   * At what user virtual address is page expected in vma?
>   * checking that the page matches the vma.
>   */
> @@ -1372,7 +1389,13 @@ static int rmap_walk_anon(struct page *p
>  	spin_lock(&anon_vma->lock);
>  	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
>  		struct vm_area_struct *vma = avc->vma;
> -		unsigned long address = vma_address(page, vma);
> +		unsigned long address;
> +
> +		/*
> +		 * In page migration, this race is critical. So, use
> +		 * safe version.
> +		 */
> +		address = vma_address_safe(page, vma);

If I'm right above about maybe checking the wrong VMA due to an munmap,
vma_address_safe isn't the right thing as such. Once the seqcounter covers
the entire VMA updates, it would then look like

unsigned long safety = read_seqcount_begin(&vma->updating);
address = vma_address(page, vma);
if (read_seqcount_retry(&vma->updating, safety)) {
	/*
	 * We raced against an updater of the VMA without mmap_sem held.
	 * Release the anon_vma lock to allow the update to complete and
	 * restart the operation
	 */
	spin_unlock(&anon_vma->lock);
	goto restart;
}

where the restart label lookup up the pages anon_vma again, reacquires
the lock and starts a walk on the anon_vma_chain list again.

What do you think?

>  		if (address == -EFAULT)
>  			continue;
>  		ret = rmap_one(page, vma, address, arg);
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-23  9:59   ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-23  9:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, minchan.kim, Christoph Lameter, akpm, linux-kernel

On Fri, Apr 23, 2010 at 12:01:48PM +0900, KAMEZAWA Hiroyuki wrote:
> This patch itself is for -mm ..but may need to go -stable tree for memory
> hotplug. (but we've got no report to hit this race...)
> 

Only because it's very difficult to hit. Even when running compaction
constantly, it can take anywhere between 10 minutes and 2 hours for me
to reproduce it.

> This one is the simplest, I think and works well on my test set.
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
> or mapping->i_mmap_lock is held and enter following loop.
> 
> 	for_each_vma_in_this_rmap_link(list from page->mapping) {
> 		unsigned long address = vma_address(page, vma);
> 		if (address == -EFAULT)
> 			continue;
> 		....
> 	}
> 
> vma_address is checking [start, end, pgoff] v.s. page->index.
> 
> But vma's [start, end, pgoff] is updated without locks. vma_address()
> can hit a race and may return wrong result.
> 
> This bahavior is no problem in usual routine as try_to_unmap() etc...
> But for page migration, rmap_walk() has to find all migration_ptes
> which migration code overwritten valid ptes. This race is critical and cause
> BUG that a migration_pte is sometimes not removed.
> 
> pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
> Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
> Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
> Apr 21 17:27:47 localhost kernel: CPU 3
> Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
> Apr 21 17:27:47 localhost kernel:
> Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G        W   2.6.34-rc4-mm1+ #4 D2519/PRIMERGY          
> Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>]  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08  EFLAGS: 00010246
> Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
> Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
> Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
> Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
> Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
> Apr 21 17:27:47 localhost kernel: FS:  00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
> Apr 21 17:27:47 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
> Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
> Apr 21 17:27:47 localhost kernel: Stack:
> Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
> Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
> Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
> Apr 21 17:27:47 localhost kernel: Call Trace:
> Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
> Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
> Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
> Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
> Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
> Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
> Apr 21 17:27:47 localhost kernel: RIP  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
> Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
> 
> This patch adds vma_address_safe(). And update [start, end, pgoff]
> under seq counter. 
> 

I had considered this idea as well as it is vaguely similar to how zones get
resized with a seqlock. I was hoping that the existing locking on anon_vma
would be usable by backing off until uncontended but maybe not so lets
check out this approach.

> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/mm_types.h |    2 ++
>  mm/mmap.c                |   15 ++++++++++++++-
>  mm/rmap.c                |   25 ++++++++++++++++++++++++-
>  3 files changed, 40 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6.34-rc5-mm1/include/linux/mm_types.h
> ===================================================================
> --- linux-2.6.34-rc5-mm1.orig/include/linux/mm_types.h
> +++ linux-2.6.34-rc5-mm1/include/linux/mm_types.h
> @@ -12,6 +12,7 @@
>  #include <linux/completion.h>
>  #include <linux/cpumask.h>
>  #include <linux/page-debug-flags.h>
> +#include <linux/seqlock.h>
>  #include <asm/page.h>
>  #include <asm/mmu.h>
>  
> @@ -183,6 +184,7 @@ struct vm_area_struct {
>  #ifdef CONFIG_NUMA
>  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>  #endif
> +	seqcount_t updating;	/* works like seqlock for updating vma info. */
>  };
>  

#ifdef CONFIG_MIGRATION ?

Minor issue, but would you consider matching the making used when altering
the size of zones? e.g. seqcount_t span_seqcounter, vma_span_seqbegin,
vma_span_seqend etc?

>  struct core_thread {
> Index: linux-2.6.34-rc5-mm1/mm/mmap.c
> ===================================================================
> --- linux-2.6.34-rc5-mm1.orig/mm/mmap.c
> +++ linux-2.6.34-rc5-mm1/mm/mmap.c
> @@ -491,6 +491,16 @@ __vma_unlink(struct mm_struct *mm, struc
>  		mm->mmap_cache = prev;
>  }
>  
> +static void adjust_start_vma(struct vm_area_struct *vma)
> +{
> +	write_seqcount_begin(&vma->updating);
> +}
> +
> +static void adjust_end_vma(struct vm_area_struct *vma)
> +{
> +	write_seqcount_end(&vma->updating);
> +}
> +
>  /*
>   * We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that
>   * is already present in an i_mmap tree without adjusting the tree.
> @@ -584,13 +594,16 @@ again:			remove_next = 1 + (end > next->
>  		if (adjust_next)
>  			vma_prio_tree_remove(next, root);
>  	}
> -
> +	adjust_start_vma(vma);
>  	vma->vm_start = start;
>  	vma->vm_end = end;
>  	vma->vm_pgoff = pgoff;
> +	adjust_end_vma(vma);
>  	if (adjust_next) {
> +		adjust_start_vma(next);
>  		next->vm_start += adjust_next << PAGE_SHIFT;
>  		next->vm_pgoff += adjust_next;
> +		adjust_end_vma(next);
>  	}

I'm not 100% sure about this, I think either the seqcounter needs
a larger span and possibly to be based on the mm instead of the VMA or
rmap_walk_[anon|ksm] has to do a full restart if there is a simultaneous
update.

Lets take the case;

VMA A		- Any given VMA with a page being migrated

During migration, munmap() is called so the VMA is now being split to
give

VMA A-lower	- The lower part of the VMA
hole		- A hole due to munmap
VMA A-upper	- The new VMA inserted as a result of munmap that
		  spans the page being migrated.

vma_adjust() takes the seq counters when updating the range of VMA
A-lower but VMA a-upper is not linked in yet and rmap_walk_[anon|ksm] is
now looking at the wrong VMA.

In this case, rmap_walk_anon() would correct check the range of
VMA A-lower but still get the wrong answer because it should have been
checking the new VMA A-upper.

I believe my backoff-if-lock-contended patch caught this situation
by always restarting the entire operation if there was lock contention.
Once the lock was acquired the VMA list could be different so had
to be restarted to be sure we were checking the right VMA.

For the seqcounter, the adjust_start_vma() needs to be at the beginning
of the operation and adjust_end_vma() must be after all the VMA updates
have completed, including any adjustments of the prio trees, the anon_vma
lists etc. However, to avoid deadlocks rmap_walk_anon() needs to release is
anon_vma->lock otherwise the VMA list update will deadlock waiting on the
same lock.

>  
>  	if (root) {
> Index: linux-2.6.34-rc5-mm1/mm/rmap.c
> ===================================================================
> --- linux-2.6.34-rc5-mm1.orig/mm/rmap.c
> +++ linux-2.6.34-rc5-mm1/mm/rmap.c
> @@ -342,6 +342,23 @@ vma_address(struct page *page, struct vm
>  }
>  
>  /*
> + * vma's address check is racy if we don't hold mmap_sem. This function
> + * gives a safe way for accessing the [start, end, pgoff] tuple of vma.
> + */
> +
> +static inline unsigned long vma_address_safe(struct page *page,
> +		struct vm_area_struct *vma)
> +{
> +	unsigned long ret, safety;
> +
> +	do {
> +		safety = read_seqcount_begin(&vma->updating);
> +		ret = vma_address(page, vma);
> +	} while (read_seqcount_retry(&vma->updating, safety));
> +	return ret;
> +}
> +
> +/*
>   * At what user virtual address is page expected in vma?
>   * checking that the page matches the vma.
>   */
> @@ -1372,7 +1389,13 @@ static int rmap_walk_anon(struct page *p
>  	spin_lock(&anon_vma->lock);
>  	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
>  		struct vm_area_struct *vma = avc->vma;
> -		unsigned long address = vma_address(page, vma);
> +		unsigned long address;
> +
> +		/*
> +		 * In page migration, this race is critical. So, use
> +		 * safe version.
> +		 */
> +		address = vma_address_safe(page, vma);

If I'm right above about maybe checking the wrong VMA due to an munmap,
vma_address_safe isn't the right thing as such. Once the seqcounter covers
the entire VMA updates, it would then look like

unsigned long safety = read_seqcount_begin(&vma->updating);
address = vma_address(page, vma);
if (read_seqcount_retry(&vma->updating, safety)) {
	/*
	 * We raced against an updater of the VMA without mmap_sem held.
	 * Release the anon_vma lock to allow the update to complete and
	 * restart the operation
	 */
	spin_unlock(&anon_vma->lock);
	goto restart;
}

where the restart label lookup up the pages anon_vma again, reacquires
the lock and starts a walk on the anon_vma_chain list again.

What do you think?

>  		if (address == -EFAULT)
>  			continue;
>  		ret = rmap_one(page, vma, address, arg);
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-23  9:59   ` Mel Gorman
@ 2010-04-23 15:58     ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-23 15:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, minchan.kim, Christoph Lameter, akpm, linux-kernel

On Fri, Apr 23, 2010 at 10:59:22AM +0100, Mel Gorman wrote:
> On Fri, Apr 23, 2010 at 12:01:48PM +0900, KAMEZAWA Hiroyuki wrote:
> > This patch itself is for -mm ..but may need to go -stable tree for memory
> > hotplug. (but we've got no report to hit this race...)
> > 
> 
> Only because it's very difficult to hit. Even when running compaction
> constantly, it can take anywhere between 10 minutes and 2 hours for me
> to reproduce it.
> 
> > This one is the simplest, I think and works well on my test set.
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
> > or mapping->i_mmap_lock is held and enter following loop.
> > 
> > 	for_each_vma_in_this_rmap_link(list from page->mapping) {
> > 		unsigned long address = vma_address(page, vma);
> > 		if (address == -EFAULT)
> > 			continue;
> > 		....
> > 	}
> > 
> > vma_address is checking [start, end, pgoff] v.s. page->index.
> > 
> > But vma's [start, end, pgoff] is updated without locks. vma_address()
> > can hit a race and may return wrong result.
> > 
> > This bahavior is no problem in usual routine as try_to_unmap() etc...
> > But for page migration, rmap_walk() has to find all migration_ptes
> > which migration code overwritten valid ptes. This race is critical and cause
> > BUG that a migration_pte is sometimes not removed.
> > 
> > pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
> > Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
> > Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> > Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
> > Apr 21 17:27:47 localhost kernel: CPU 3
> > Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
> > Apr 21 17:27:47 localhost kernel:
> > Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G        W   2.6.34-rc4-mm1+ #4 D2519/PRIMERGY          
> > Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>]  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> > Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08  EFLAGS: 00010246
> > Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
> > Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
> > Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
> > Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
> > Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
> > Apr 21 17:27:47 localhost kernel: FS:  00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
> > Apr 21 17:27:47 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
> > Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
> > Apr 21 17:27:47 localhost kernel: Stack:
> > Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
> > Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
> > Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
> > Apr 21 17:27:47 localhost kernel: Call Trace:
> > Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
> > Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
> > Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
> > Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
> > Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
> > Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
> > Apr 21 17:27:47 localhost kernel: RIP  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> > Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
> > Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
> > 
> > This patch adds vma_address_safe(). And update [start, end, pgoff]
> > under seq counter. 
> > 
> 
> I had considered this idea as well as it is vaguely similar to how zones get
> resized with a seqlock. I was hoping that the existing locking on anon_vma
> would be usable by backing off until uncontended but maybe not so lets
> check out this approach.
> 

A possible combination of the two approaches is as follows. It uses the
anon_vma lock mostly except where the anon_vma differs between the page
and the VMAs being walked in which case it uses the seq counter. I've
had it running a few hours now without problems but I'll leave it
running at least 24 hours.

==== CUT HERE ====
 mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information by protecting against vma_adjust with a combination of locks and seq counter

vma_adjust() is updating anon VMA information without any locks taken.
In constract, file-backed mappings use the i_mmap_lock. This lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a
BUG_ON to trigger when the page is faulted in.

With the recent anon_vma changes, there is no single anon_vma->lock that
can be taken that is safe for rmap_walk() to guard against changes by
vma_adjust(). Instead, a lock can be taken on one VMA while changes
happen to another.

What this patch does is protect against updates with a combination of
locks and seq counters. First, the vma->anon_vma lock is taken by
vma_adjust() and the sequence counter starts. The lock is released and
the sequence ended when the VMA updates are complete.

The lock serialses rmap_walk_anon when the page and VMA share the same
anon_vma. Where the anon_vmas do not match, the seq counter is checked.
If a change is noticed, rmap_walk_anon drops its locks and starts again
from scratch as the VMA list may have changed. The dangling migration
PTE bug was not triggered after several hours of stress testing with
this patch applied.

[kamezawa.hiroyu@jp.fujitsu.com: Use of a seq counter]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mm_types.h |   13 +++++++++++++
 mm/ksm.c                 |   17 +++++++++++++++--
 mm/mmap.c                |   30 ++++++++++++++++++++++++++++++
 mm/rmap.c                |   25 ++++++++++++++++++++++++-
 4 files changed, 82 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..fcd5db2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
+#include <linux/seqlock.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -240,6 +241,18 @@ struct mm_struct {
 	struct rw_semaphore mmap_sem;
 	spinlock_t page_table_lock;		/* Protects page tables and some counters */
 
+#ifdef CONFIG_MIGRATION
+	/*
+	 * During migration, rmap_walk walks all the VMAs mapping a particular
+	 * page to remove the migration ptes. It doesn't this without mmap_sem
+	 * held and the semaphore is unnecessarily heavily to take in this case.
+	 * File-backed VMAs are protected by the i_mmap_lock and anon-VMAs are
+	 * protected by this seq counter. If the seq counter changes while
+	 * the migration PTE is being removed, the operation restarts.
+	 */
+	seqcount_t span_seqcounter;
+#endif
+
 	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
 						 * together off init_mm.mmlist, and are protected
 						 * by mmlist_lock
diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..613c762 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1671,11 +1671,24 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
+retry:
 		spin_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
+			unsigned long update_race;
+			bool outside;
 			vma = vmac->vma;
-			if (rmap_item->address < vma->vm_start ||
-			    rmap_item->address >= vma->vm_end)
+
+			/* See comment in rmap_walk_anon about reading anon VMA info */
+			update_race = read_seqcount_begin(&vma->vm_mm->span_seqcounter);
+			outside = rmap_item->address < vma->vm_start ||
+						rmap_item->address >= vma->vm_end;
+			if (anon_vma != vma->anon_vma &&
+					read_seqcount_retry(&vma->vm_mm->span_seqcounter, update_race)) {
+				spin_unlock(&anon_vma->lock);
+				goto retry;
+			}
+
+			if (outside)
 				continue;
 			/*
 			 * Initially we examine only the vma which covers this
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..1508c43 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -491,6 +491,26 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
 		mm->mmap_cache = prev;
 }
 
+#ifdef CONFIG_MIGRATION
+static void vma_span_seqbegin(struct vm_area_struct *vma)
+{
+	write_seqcount_begin(&vma->vm_mm->span_seqcounter);
+}
+
+static void vma_span_seqend(struct vm_area_struct *vma)
+{
+	write_seqcount_end(&vma->vm_mm->span_seqcounter);
+}
+#else
+static inline void vma_span_seqbegin(struct vm_area_struct *vma)
+{
+}
+
+static void adjust_end_vma(struct vm_area_struct *vma)
+{
+}
+#endif /* CONFIG_MIGRATION */
+
 /*
  * We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that
  * is already present in an i_mmap tree without adjusting the tree.
@@ -578,6 +598,11 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	if (vma->anon_vma) {
+		spin_lock(&vma->anon_vma->lock);
+		vma_span_seqbegin(vma);
+	}
+
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -620,6 +645,11 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	if (vma->anon_vma) {
+		vma_span_seqend(vma);
+		spin_unlock(&vma->anon_vma->lock);
+	}
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..b2aec5d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1368,13 +1368,35 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	 * are holding mmap_sem. Users without mmap_sem are required to
 	 * take a reference count to prevent the anon_vma disappearing
 	 */
+retry:
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
 		return ret;
 	spin_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
+		unsigned long update_race;
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
+		unsigned long address;
+
+		/*
+		 * We do not hold mmap_sem and there is no guarantee that the
+		 * pages anon_vma matches the VMAs anon_vma so we cannot hold
+		 * the same lock. Instead, the pages anon_vma lock protects
+		 * against the VMA list from changing underneath us and the
+		 * seqlock protects against updates of the VMA information.
+		 *
+		 * If the seq counter has changed, then the VMA information is
+		 * being updated. We release the anon_vma lock so the update
+		 * update completes and restart the entire operation
+		 */
+		update_race = read_seqcount_begin(&vma->vm_mm->span_seqcounter);
+		address = vma_address(page, vma);
+		if (anon_vma != vma->anon_vma && 
+				read_seqcount_retry(&vma->vm_mm->span_seqcounter, update_race)) {
+			spin_unlock(&anon_vma->lock);
+			goto retry;
+		}
+
 		if (address == -EFAULT)
 			continue;
 		ret = rmap_one(page, vma, address, arg);

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-23 15:58     ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-23 15:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, minchan.kim, Christoph Lameter, akpm, linux-kernel

On Fri, Apr 23, 2010 at 10:59:22AM +0100, Mel Gorman wrote:
> On Fri, Apr 23, 2010 at 12:01:48PM +0900, KAMEZAWA Hiroyuki wrote:
> > This patch itself is for -mm ..but may need to go -stable tree for memory
> > hotplug. (but we've got no report to hit this race...)
> > 
> 
> Only because it's very difficult to hit. Even when running compaction
> constantly, it can take anywhere between 10 minutes and 2 hours for me
> to reproduce it.
> 
> > This one is the simplest, I think and works well on my test set.
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
> > or mapping->i_mmap_lock is held and enter following loop.
> > 
> > 	for_each_vma_in_this_rmap_link(list from page->mapping) {
> > 		unsigned long address = vma_address(page, vma);
> > 		if (address == -EFAULT)
> > 			continue;
> > 		....
> > 	}
> > 
> > vma_address is checking [start, end, pgoff] v.s. page->index.
> > 
> > But vma's [start, end, pgoff] is updated without locks. vma_address()
> > can hit a race and may return wrong result.
> > 
> > This bahavior is no problem in usual routine as try_to_unmap() etc...
> > But for page migration, rmap_walk() has to find all migration_ptes
> > which migration code overwritten valid ptes. This race is critical and cause
> > BUG that a migration_pte is sometimes not removed.
> > 
> > pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
> > Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
> > Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> > Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
> > Apr 21 17:27:47 localhost kernel: CPU 3
> > Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
> > Apr 21 17:27:47 localhost kernel:
> > Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G        W   2.6.34-rc4-mm1+ #4 D2519/PRIMERGY          
> > Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>]  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> > Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08  EFLAGS: 00010246
> > Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
> > Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
> > Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
> > Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
> > Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
> > Apr 21 17:27:47 localhost kernel: FS:  00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
> > Apr 21 17:27:47 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
> > Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
> > Apr 21 17:27:47 localhost kernel: Stack:
> > Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
> > Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
> > Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
> > Apr 21 17:27:47 localhost kernel: Call Trace:
> > Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
> > Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
> > Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
> > Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
> > Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
> > Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
> > Apr 21 17:27:47 localhost kernel: RIP  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
> > Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
> > Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
> > 
> > This patch adds vma_address_safe(). And update [start, end, pgoff]
> > under seq counter. 
> > 
> 
> I had considered this idea as well as it is vaguely similar to how zones get
> resized with a seqlock. I was hoping that the existing locking on anon_vma
> would be usable by backing off until uncontended but maybe not so lets
> check out this approach.
> 

A possible combination of the two approaches is as follows. It uses the
anon_vma lock mostly except where the anon_vma differs between the page
and the VMAs being walked in which case it uses the seq counter. I've
had it running a few hours now without problems but I'll leave it
running at least 24 hours.

==== CUT HERE ====
 mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information by protecting against vma_adjust with a combination of locks and seq counter

vma_adjust() is updating anon VMA information without any locks taken.
In constract, file-backed mappings use the i_mmap_lock. This lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a
BUG_ON to trigger when the page is faulted in.

With the recent anon_vma changes, there is no single anon_vma->lock that
can be taken that is safe for rmap_walk() to guard against changes by
vma_adjust(). Instead, a lock can be taken on one VMA while changes
happen to another.

What this patch does is protect against updates with a combination of
locks and seq counters. First, the vma->anon_vma lock is taken by
vma_adjust() and the sequence counter starts. The lock is released and
the sequence ended when the VMA updates are complete.

The lock serialses rmap_walk_anon when the page and VMA share the same
anon_vma. Where the anon_vmas do not match, the seq counter is checked.
If a change is noticed, rmap_walk_anon drops its locks and starts again
from scratch as the VMA list may have changed. The dangling migration
PTE bug was not triggered after several hours of stress testing with
this patch applied.

[kamezawa.hiroyu@jp.fujitsu.com: Use of a seq counter]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mm_types.h |   13 +++++++++++++
 mm/ksm.c                 |   17 +++++++++++++++--
 mm/mmap.c                |   30 ++++++++++++++++++++++++++++++
 mm/rmap.c                |   25 ++++++++++++++++++++++++-
 4 files changed, 82 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..fcd5db2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
+#include <linux/seqlock.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -240,6 +241,18 @@ struct mm_struct {
 	struct rw_semaphore mmap_sem;
 	spinlock_t page_table_lock;		/* Protects page tables and some counters */
 
+#ifdef CONFIG_MIGRATION
+	/*
+	 * During migration, rmap_walk walks all the VMAs mapping a particular
+	 * page to remove the migration ptes. It doesn't this without mmap_sem
+	 * held and the semaphore is unnecessarily heavily to take in this case.
+	 * File-backed VMAs are protected by the i_mmap_lock and anon-VMAs are
+	 * protected by this seq counter. If the seq counter changes while
+	 * the migration PTE is being removed, the operation restarts.
+	 */
+	seqcount_t span_seqcounter;
+#endif
+
 	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
 						 * together off init_mm.mmlist, and are protected
 						 * by mmlist_lock
diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..613c762 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1671,11 +1671,24 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
+retry:
 		spin_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
+			unsigned long update_race;
+			bool outside;
 			vma = vmac->vma;
-			if (rmap_item->address < vma->vm_start ||
-			    rmap_item->address >= vma->vm_end)
+
+			/* See comment in rmap_walk_anon about reading anon VMA info */
+			update_race = read_seqcount_begin(&vma->vm_mm->span_seqcounter);
+			outside = rmap_item->address < vma->vm_start ||
+						rmap_item->address >= vma->vm_end;
+			if (anon_vma != vma->anon_vma &&
+					read_seqcount_retry(&vma->vm_mm->span_seqcounter, update_race)) {
+				spin_unlock(&anon_vma->lock);
+				goto retry;
+			}
+
+			if (outside)
 				continue;
 			/*
 			 * Initially we examine only the vma which covers this
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..1508c43 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -491,6 +491,26 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
 		mm->mmap_cache = prev;
 }
 
+#ifdef CONFIG_MIGRATION
+static void vma_span_seqbegin(struct vm_area_struct *vma)
+{
+	write_seqcount_begin(&vma->vm_mm->span_seqcounter);
+}
+
+static void vma_span_seqend(struct vm_area_struct *vma)
+{
+	write_seqcount_end(&vma->vm_mm->span_seqcounter);
+}
+#else
+static inline void vma_span_seqbegin(struct vm_area_struct *vma)
+{
+}
+
+static void adjust_end_vma(struct vm_area_struct *vma)
+{
+}
+#endif /* CONFIG_MIGRATION */
+
 /*
  * We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that
  * is already present in an i_mmap tree without adjusting the tree.
@@ -578,6 +598,11 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	if (vma->anon_vma) {
+		spin_lock(&vma->anon_vma->lock);
+		vma_span_seqbegin(vma);
+	}
+
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -620,6 +645,11 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	if (vma->anon_vma) {
+		vma_span_seqend(vma);
+		spin_unlock(&vma->anon_vma->lock);
+	}
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..b2aec5d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1368,13 +1368,35 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	 * are holding mmap_sem. Users without mmap_sem are required to
 	 * take a reference count to prevent the anon_vma disappearing
 	 */
+retry:
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
 		return ret;
 	spin_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
+		unsigned long update_race;
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
+		unsigned long address;
+
+		/*
+		 * We do not hold mmap_sem and there is no guarantee that the
+		 * pages anon_vma matches the VMAs anon_vma so we cannot hold
+		 * the same lock. Instead, the pages anon_vma lock protects
+		 * against the VMA list from changing underneath us and the
+		 * seqlock protects against updates of the VMA information.
+		 *
+		 * If the seq counter has changed, then the VMA information is
+		 * being updated. We release the anon_vma lock so the update
+		 * update completes and restart the entire operation
+		 */
+		update_race = read_seqcount_begin(&vma->vm_mm->span_seqcounter);
+		address = vma_address(page, vma);
+		if (anon_vma != vma->anon_vma && 
+				read_seqcount_retry(&vma->vm_mm->span_seqcounter, update_race)) {
+			spin_unlock(&anon_vma->lock);
+			goto retry;
+		}
+
 		if (address == -EFAULT)
 			continue;
 		ret = rmap_one(page, vma, address, arg);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-23 15:58     ` Mel Gorman
@ 2010-04-24  2:02       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-24  2:02 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, minchan.kim, Christoph Lameter, akpm, linux-kernel

On Fri, 23 Apr 2010 16:58:01 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> > I had considered this idea as well as it is vaguely similar to how zones get
> > resized with a seqlock. I was hoping that the existing locking on anon_vma
> > would be usable by backing off until uncontended but maybe not so lets
> > check out this approach.
> > 
> 
> A possible combination of the two approaches is as follows. It uses the
> anon_vma lock mostly except where the anon_vma differs between the page
> and the VMAs being walked in which case it uses the seq counter. I've
> had it running a few hours now without problems but I'll leave it
> running at least 24 hours.
> 
ok, I'll try this, too.


> ==== CUT HERE ====
>  mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information by protecting against vma_adjust with a combination of locks and seq counter
> 
> vma_adjust() is updating anon VMA information without any locks taken.
> In constract, file-backed mappings use the i_mmap_lock. This lack of
> locking can result in races with page migration. During rmap_walk(),
> vma_address() can return -EFAULT for an address that will soon be valid.
> This leaves a dangling migration PTE behind which can later cause a
> BUG_ON to trigger when the page is faulted in.
> 
> With the recent anon_vma changes, there is no single anon_vma->lock that
> can be taken that is safe for rmap_walk() to guard against changes by
> vma_adjust(). Instead, a lock can be taken on one VMA while changes
> happen to another.
> 
> What this patch does is protect against updates with a combination of
> locks and seq counters. First, the vma->anon_vma lock is taken by
> vma_adjust() and the sequence counter starts. The lock is released and
> the sequence ended when the VMA updates are complete.
> 
> The lock serialses rmap_walk_anon when the page and VMA share the same
> anon_vma. Where the anon_vmas do not match, the seq counter is checked.
> If a change is noticed, rmap_walk_anon drops its locks and starts again
> from scratch as the VMA list may have changed. The dangling migration
> PTE bug was not triggered after several hours of stress testing with
> this patch applied.
> 
> [kamezawa.hiroyu@jp.fujitsu.com: Use of a seq counter]
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

I think this patch is nice!

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-24  2:02       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-24  2:02 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, minchan.kim, Christoph Lameter, akpm, linux-kernel

On Fri, 23 Apr 2010 16:58:01 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> > I had considered this idea as well as it is vaguely similar to how zones get
> > resized with a seqlock. I was hoping that the existing locking on anon_vma
> > would be usable by backing off until uncontended but maybe not so lets
> > check out this approach.
> > 
> 
> A possible combination of the two approaches is as follows. It uses the
> anon_vma lock mostly except where the anon_vma differs between the page
> and the VMAs being walked in which case it uses the seq counter. I've
> had it running a few hours now without problems but I'll leave it
> running at least 24 hours.
> 
ok, I'll try this, too.


> ==== CUT HERE ====
>  mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information by protecting against vma_adjust with a combination of locks and seq counter
> 
> vma_adjust() is updating anon VMA information without any locks taken.
> In constract, file-backed mappings use the i_mmap_lock. This lack of
> locking can result in races with page migration. During rmap_walk(),
> vma_address() can return -EFAULT for an address that will soon be valid.
> This leaves a dangling migration PTE behind which can later cause a
> BUG_ON to trigger when the page is faulted in.
> 
> With the recent anon_vma changes, there is no single anon_vma->lock that
> can be taken that is safe for rmap_walk() to guard against changes by
> vma_adjust(). Instead, a lock can be taken on one VMA while changes
> happen to another.
> 
> What this patch does is protect against updates with a combination of
> locks and seq counters. First, the vma->anon_vma lock is taken by
> vma_adjust() and the sequence counter starts. The lock is released and
> the sequence ended when the VMA updates are complete.
> 
> The lock serialses rmap_walk_anon when the page and VMA share the same
> anon_vma. Where the anon_vmas do not match, the seq counter is checked.
> If a change is noticed, rmap_walk_anon drops its locks and starts again
> from scratch as the VMA list may have changed. The dangling migration
> PTE bug was not triggered after several hours of stress testing with
> this patch applied.
> 
> [kamezawa.hiroyu@jp.fujitsu.com: Use of a seq counter]
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

I think this patch is nice!

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-24  2:02       ` KAMEZAWA Hiroyuki
@ 2010-04-24 10:43         ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-24 10:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, minchan.kim, Christoph Lameter, akpm, linux-kernel

On Sat, Apr 24, 2010 at 11:02:00AM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 23 Apr 2010 16:58:01 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > > I had considered this idea as well as it is vaguely similar to how zones get
> > > resized with a seqlock. I was hoping that the existing locking on anon_vma
> > > would be usable by backing off until uncontended but maybe not so lets
> > > check out this approach.
> > > 
> > 
> > A possible combination of the two approaches is as follows. It uses the
> > anon_vma lock mostly except where the anon_vma differs between the page
> > and the VMAs being walked in which case it uses the seq counter. I've
> > had it running a few hours now without problems but I'll leave it
> > running at least 24 hours.
> > 
> ok, I'll try this, too.
> 
> 
> > ==== CUT HERE ====
> >  mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information by protecting against vma_adjust with a combination of locks and seq counter
> > 
> > vma_adjust() is updating anon VMA information without any locks taken.
> > In constract, file-backed mappings use the i_mmap_lock. This lack of
> > locking can result in races with page migration. During rmap_walk(),
> > vma_address() can return -EFAULT for an address that will soon be valid.
> > This leaves a dangling migration PTE behind which can later cause a
> > BUG_ON to trigger when the page is faulted in.
> > 
> > With the recent anon_vma changes, there is no single anon_vma->lock that
> > can be taken that is safe for rmap_walk() to guard against changes by
> > vma_adjust(). Instead, a lock can be taken on one VMA while changes
> > happen to another.
> > 
> > What this patch does is protect against updates with a combination of
> > locks and seq counters. First, the vma->anon_vma lock is taken by
> > vma_adjust() and the sequence counter starts. The lock is released and
> > the sequence ended when the VMA updates are complete.
> > 
> > The lock serialses rmap_walk_anon when the page and VMA share the same
> > anon_vma. Where the anon_vmas do not match, the seq counter is checked.
> > If a change is noticed, rmap_walk_anon drops its locks and starts again
> > from scratch as the VMA list may have changed. The dangling migration
> > PTE bug was not triggered after several hours of stress testing with
> > this patch applied.
> > 
> > [kamezawa.hiroyu@jp.fujitsu.com: Use of a seq counter]
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> I think this patch is nice!
> 

It looks nice but it still broke after 28 hours of running. The
seq-counter is still insufficient to catch all changes that are made to
the list. I'm beginning to wonder if a) this really can be fully safely
locked with the anon_vma changes and b) if it has to be a spinlock to
catch the majority of cases but still a lazy cleanup if there happens to
be a race. It's unsatisfactory and I'm expecting I'll either have some
insight to the new anon_vma changes that allow it to be locked or Rik
knows how to restore the original behaviour which as Andrea pointed out
was safe.

> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-24 10:43         ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-24 10:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, minchan.kim, Christoph Lameter, akpm, linux-kernel

On Sat, Apr 24, 2010 at 11:02:00AM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 23 Apr 2010 16:58:01 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > > I had considered this idea as well as it is vaguely similar to how zones get
> > > resized with a seqlock. I was hoping that the existing locking on anon_vma
> > > would be usable by backing off until uncontended but maybe not so lets
> > > check out this approach.
> > > 
> > 
> > A possible combination of the two approaches is as follows. It uses the
> > anon_vma lock mostly except where the anon_vma differs between the page
> > and the VMAs being walked in which case it uses the seq counter. I've
> > had it running a few hours now without problems but I'll leave it
> > running at least 24 hours.
> > 
> ok, I'll try this, too.
> 
> 
> > ==== CUT HERE ====
> >  mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information by protecting against vma_adjust with a combination of locks and seq counter
> > 
> > vma_adjust() is updating anon VMA information without any locks taken.
> > In constract, file-backed mappings use the i_mmap_lock. This lack of
> > locking can result in races with page migration. During rmap_walk(),
> > vma_address() can return -EFAULT for an address that will soon be valid.
> > This leaves a dangling migration PTE behind which can later cause a
> > BUG_ON to trigger when the page is faulted in.
> > 
> > With the recent anon_vma changes, there is no single anon_vma->lock that
> > can be taken that is safe for rmap_walk() to guard against changes by
> > vma_adjust(). Instead, a lock can be taken on one VMA while changes
> > happen to another.
> > 
> > What this patch does is protect against updates with a combination of
> > locks and seq counters. First, the vma->anon_vma lock is taken by
> > vma_adjust() and the sequence counter starts. The lock is released and
> > the sequence ended when the VMA updates are complete.
> > 
> > The lock serialses rmap_walk_anon when the page and VMA share the same
> > anon_vma. Where the anon_vmas do not match, the seq counter is checked.
> > If a change is noticed, rmap_walk_anon drops its locks and starts again
> > from scratch as the VMA list may have changed. The dangling migration
> > PTE bug was not triggered after several hours of stress testing with
> > this patch applied.
> > 
> > [kamezawa.hiroyu@jp.fujitsu.com: Use of a seq counter]
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> I think this patch is nice!
> 

It looks nice but it still broke after 28 hours of running. The
seq-counter is still insufficient to catch all changes that are made to
the list. I'm beginning to wonder if a) this really can be fully safely
locked with the anon_vma changes and b) if it has to be a spinlock to
catch the majority of cases but still a lazy cleanup if there happens to
be a race. It's unsatisfactory and I'm expecting I'll either have some
insight to the new anon_vma changes that allow it to be locked or Rik
knows how to restore the original behaviour which as Andrea pointed out
was safe.

> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-24 10:43         ` Mel Gorman
@ 2010-04-25 23:49           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-25 23:49 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, minchan.kim, Christoph Lameter, akpm, linux-kernel

On Sat, 24 Apr 2010 11:43:24 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Sat, Apr 24, 2010 at 11:02:00AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Fri, 23 Apr 2010 16:58:01 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > > I had considered this idea as well as it is vaguely similar to how zones get
> > > > resized with a seqlock. I was hoping that the existing locking on anon_vma
> > > > would be usable by backing off until uncontended but maybe not so lets
> > > > check out this approach.
> > > > 
> > > 
> > > A possible combination of the two approaches is as follows. It uses the
> > > anon_vma lock mostly except where the anon_vma differs between the page
> > > and the VMAs being walked in which case it uses the seq counter. I've
> > > had it running a few hours now without problems but I'll leave it
> > > running at least 24 hours.
> > > 
> > ok, I'll try this, too.
> > 
> > 
> > > ==== CUT HERE ====
> > >  mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information by protecting against vma_adjust with a combination of locks and seq counter
> > > 
> > > vma_adjust() is updating anon VMA information without any locks taken.
> > > In constract, file-backed mappings use the i_mmap_lock. This lack of
> > > locking can result in races with page migration. During rmap_walk(),
> > > vma_address() can return -EFAULT for an address that will soon be valid.
> > > This leaves a dangling migration PTE behind which can later cause a
> > > BUG_ON to trigger when the page is faulted in.
> > > 
> > > With the recent anon_vma changes, there is no single anon_vma->lock that
> > > can be taken that is safe for rmap_walk() to guard against changes by
> > > vma_adjust(). Instead, a lock can be taken on one VMA while changes
> > > happen to another.
> > > 
> > > What this patch does is protect against updates with a combination of
> > > locks and seq counters. First, the vma->anon_vma lock is taken by
> > > vma_adjust() and the sequence counter starts. The lock is released and
> > > the sequence ended when the VMA updates are complete.
> > > 
> > > The lock serialses rmap_walk_anon when the page and VMA share the same
> > > anon_vma. Where the anon_vmas do not match, the seq counter is checked.
> > > If a change is noticed, rmap_walk_anon drops its locks and starts again
> > > from scratch as the VMA list may have changed. The dangling migration
> > > PTE bug was not triggered after several hours of stress testing with
> > > this patch applied.
> > > 
> > > [kamezawa.hiroyu@jp.fujitsu.com: Use of a seq counter]
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > I think this patch is nice!
> > 
> 
> It looks nice but it still broke after 28 hours of running. The
> seq-counter is still insufficient to catch all changes that are made to
> the list. I'm beginning to wonder if a) this really can be fully safely
> locked with the anon_vma changes and b) if it has to be a spinlock to
> catch the majority of cases but still a lazy cleanup if there happens to
> be a race. It's unsatisfactory and I'm expecting I'll either have some
> insight to the new anon_vma changes that allow it to be locked or Rik
> knows how to restore the original behaviour which as Andrea pointed out
> was safe.
> 
Ouch. Hmm, how about the race in fork() I pointed out ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-25 23:49           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-25 23:49 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, minchan.kim, Christoph Lameter, akpm, linux-kernel

On Sat, 24 Apr 2010 11:43:24 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Sat, Apr 24, 2010 at 11:02:00AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Fri, 23 Apr 2010 16:58:01 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > > I had considered this idea as well as it is vaguely similar to how zones get
> > > > resized with a seqlock. I was hoping that the existing locking on anon_vma
> > > > would be usable by backing off until uncontended but maybe not so lets
> > > > check out this approach.
> > > > 
> > > 
> > > A possible combination of the two approaches is as follows. It uses the
> > > anon_vma lock mostly except where the anon_vma differs between the page
> > > and the VMAs being walked in which case it uses the seq counter. I've
> > > had it running a few hours now without problems but I'll leave it
> > > running at least 24 hours.
> > > 
> > ok, I'll try this, too.
> > 
> > 
> > > ==== CUT HERE ====
> > >  mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information by protecting against vma_adjust with a combination of locks and seq counter
> > > 
> > > vma_adjust() is updating anon VMA information without any locks taken.
> > > In constract, file-backed mappings use the i_mmap_lock. This lack of
> > > locking can result in races with page migration. During rmap_walk(),
> > > vma_address() can return -EFAULT for an address that will soon be valid.
> > > This leaves a dangling migration PTE behind which can later cause a
> > > BUG_ON to trigger when the page is faulted in.
> > > 
> > > With the recent anon_vma changes, there is no single anon_vma->lock that
> > > can be taken that is safe for rmap_walk() to guard against changes by
> > > vma_adjust(). Instead, a lock can be taken on one VMA while changes
> > > happen to another.
> > > 
> > > What this patch does is protect against updates with a combination of
> > > locks and seq counters. First, the vma->anon_vma lock is taken by
> > > vma_adjust() and the sequence counter starts. The lock is released and
> > > the sequence ended when the VMA updates are complete.
> > > 
> > > The lock serialses rmap_walk_anon when the page and VMA share the same
> > > anon_vma. Where the anon_vmas do not match, the seq counter is checked.
> > > If a change is noticed, rmap_walk_anon drops its locks and starts again
> > > from scratch as the VMA list may have changed. The dangling migration
> > > PTE bug was not triggered after several hours of stress testing with
> > > this patch applied.
> > > 
> > > [kamezawa.hiroyu@jp.fujitsu.com: Use of a seq counter]
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > I think this patch is nice!
> > 
> 
> It looks nice but it still broke after 28 hours of running. The
> seq-counter is still insufficient to catch all changes that are made to
> the list. I'm beginning to wonder if a) this really can be fully safely
> locked with the anon_vma changes and b) if it has to be a spinlock to
> catch the majority of cases but still a lazy cleanup if there happens to
> be a race. It's unsatisfactory and I'm expecting I'll either have some
> insight to the new anon_vma changes that allow it to be locked or Rik
> knows how to restore the original behaviour which as Andrea pointed out
> was safe.
> 
Ouch. Hmm, how about the race in fork() I pointed out ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-25 23:49           ` KAMEZAWA Hiroyuki
@ 2010-04-26  2:53             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-26  2:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, minchan.kim, Christoph Lameter, akpm, linux-kernel

On Mon, 26 Apr 2010 08:49:01 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Sat, 24 Apr 2010 11:43:24 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:

> > It looks nice but it still broke after 28 hours of running. The
> > seq-counter is still insufficient to catch all changes that are made to
> > the list. I'm beginning to wonder if a) this really can be fully safely
> > locked with the anon_vma changes and b) if it has to be a spinlock to
> > catch the majority of cases but still a lazy cleanup if there happens to
> > be a race. It's unsatisfactory and I'm expecting I'll either have some
> > insight to the new anon_vma changes that allow it to be locked or Rik
> > knows how to restore the original behaviour which as Andrea pointed out
> > was safe.
> > 
> Ouch. Hmm, how about the race in fork() I pointed out ?
> 
Forget this. Sorry for noise.

==
This is a memo for myself.

*) at fork, when copying a vma for file, vma_prio_tree_add() is called
   before copying page tables.
   There are several patterns.

Assume tasks named as t1,t2,t3,t4,t5 and their own vmas v1,v2,v3,v4,v5 which map
a range in address spaces.

(a) t1 forks t2.
   v1 is in prio_tree, v2 for t2 will be pointed by ->head pointer.

   \
    v1  --(head)---> v2 
   /  \
  ?    ?

  vma_prio_tree_foreach() order : v1->v2.


(b) after (a), t2 forks t3. (list_add() is used.)
 
    \
     v1 --(head)--> v2 ->(list.next)->v3
    /  \
   ?    ?

   vma_prio_tree_foreach() order : v1->v2->v3

(c) after (b), t1 forks t4.

    \
     v1 --(head)--> v2 ->(list.next)->v3->v4
    /  \              
   ?    ?

    vma_prio_tree_foreach() order : v1->v2->v3->v4

(d) after (c), t4 forks t5.

    \
     v1 --(head)--> v2 ->(list.next)->v3->v4->v5
    /  \               
   ?    ?
    vma_prio_tree_foreach() order : v1->v2->v3->v4->v5

(e) after (c), t3 forks t5.
    \
     v1 --(head)--> v2 ->(list.next)->v3->v5->v4-
    /  \               
   ?    ?
    vma_prio_tree_foreach() order : v1->v2->v3->v5->v4

.....in any case, it seems vma_prio_tree_foreach() finds
the parent's vma 1st.

Thx,
-Kame
















^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-26  2:53             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-26  2:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, minchan.kim, Christoph Lameter, akpm, linux-kernel

On Mon, 26 Apr 2010 08:49:01 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Sat, 24 Apr 2010 11:43:24 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:

> > It looks nice but it still broke after 28 hours of running. The
> > seq-counter is still insufficient to catch all changes that are made to
> > the list. I'm beginning to wonder if a) this really can be fully safely
> > locked with the anon_vma changes and b) if it has to be a spinlock to
> > catch the majority of cases but still a lazy cleanup if there happens to
> > be a race. It's unsatisfactory and I'm expecting I'll either have some
> > insight to the new anon_vma changes that allow it to be locked or Rik
> > knows how to restore the original behaviour which as Andrea pointed out
> > was safe.
> > 
> Ouch. Hmm, how about the race in fork() I pointed out ?
> 
Forget this. Sorry for noise.

==
This is a memo for myself.

*) at fork, when copying a vma for file, vma_prio_tree_add() is called
   before copying page tables.
   There are several patterns.

Assume tasks named as t1,t2,t3,t4,t5 and their own vmas v1,v2,v3,v4,v5 which map
a range in address spaces.

(a) t1 forks t2.
   v1 is in prio_tree, v2 for t2 will be pointed by ->head pointer.

   \
    v1  --(head)---> v2 
   /  \
  ?    ?

  vma_prio_tree_foreach() order : v1->v2.


(b) after (a), t2 forks t3. (list_add() is used.)
 
    \
     v1 --(head)--> v2 ->(list.next)->v3
    /  \
   ?    ?

   vma_prio_tree_foreach() order : v1->v2->v3

(c) after (b), t1 forks t4.

    \
     v1 --(head)--> v2 ->(list.next)->v3->v4
    /  \              
   ?    ?

    vma_prio_tree_foreach() order : v1->v2->v3->v4

(d) after (c), t4 forks t5.

    \
     v1 --(head)--> v2 ->(list.next)->v3->v4->v5
    /  \               
   ?    ?
    vma_prio_tree_foreach() order : v1->v2->v3->v4->v5

(e) after (c), t3 forks t5.
    \
     v1 --(head)--> v2 ->(list.next)->v3->v5->v4-
    /  \               
   ?    ?
    vma_prio_tree_foreach() order : v1->v2->v3->v5->v4

.....in any case, it seems vma_prio_tree_foreach() finds
the parent's vma 1st.

Thx,
-Kame















--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-23 15:58     ` Mel Gorman
@ 2010-04-26  4:00       ` Minchan Kim
  -1 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-26  4:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, linux-mm, Christoph Lameter, akpm, linux-kernel

Hi, Mel.

On Sat, Apr 24, 2010 at 12:58 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Fri, Apr 23, 2010 at 10:59:22AM +0100, Mel Gorman wrote:
>> On Fri, Apr 23, 2010 at 12:01:48PM +0900, KAMEZAWA Hiroyuki wrote:
>> > This patch itself is for -mm ..but may need to go -stable tree for memory
>> > hotplug. (but we've got no report to hit this race...)
>> >
>>
>> Only because it's very difficult to hit. Even when running compaction
>> constantly, it can take anywhere between 10 minutes and 2 hours for me
>> to reproduce it.
>>
>> > This one is the simplest, I think and works well on my test set.
>> > ==
>> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >
>> > In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
>> > or mapping->i_mmap_lock is held and enter following loop.
>> >
>> >     for_each_vma_in_this_rmap_link(list from page->mapping) {
>> >             unsigned long address = vma_address(page, vma);
>> >             if (address == -EFAULT)
>> >                     continue;
>> >             ....
>> >     }
>> >
>> > vma_address is checking [start, end, pgoff] v.s. page->index.
>> >
>> > But vma's [start, end, pgoff] is updated without locks. vma_address()
>> > can hit a race and may return wrong result.
>> >
>> > This bahavior is no problem in usual routine as try_to_unmap() etc...
>> > But for page migration, rmap_walk() has to find all migration_ptes
>> > which migration code overwritten valid ptes. This race is critical and cause
>> > BUG that a migration_pte is sometimes not removed.
>> >
>> > pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
>> > Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
>> > Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
>> > Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
>> > Apr 21 17:27:47 localhost kernel: CPU 3
>> > Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
>> > Apr 21 17:27:47 localhost kernel:
>> > Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G        W   2.6.34-rc4-mm1+ #4 D2519/PRIMERGY
>> > Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>]  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
>> > Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08  EFLAGS: 00010246
>> > Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
>> > Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
>> > Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
>> > Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
>> > Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
>> > Apr 21 17:27:47 localhost kernel: FS:  00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
>> > Apr 21 17:27:47 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
>> > Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> > Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> > Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
>> > Apr 21 17:27:47 localhost kernel: Stack:
>> > Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
>> > Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
>> > Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
>> > Apr 21 17:27:47 localhost kernel: Call Trace:
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
>> > Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
>> > Apr 21 17:27:47 localhost kernel: RIP  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
>> > Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
>> > Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
>> >
>> > This patch adds vma_address_safe(). And update [start, end, pgoff]
>> > under seq counter.
>> >
>>
>> I had considered this idea as well as it is vaguely similar to how zones get
>> resized with a seqlock. I was hoping that the existing locking on anon_vma
>> would be usable by backing off until uncontended but maybe not so lets
>> check out this approach.
>>
>
> A possible combination of the two approaches is as follows. It uses the
> anon_vma lock mostly except where the anon_vma differs between the page
> and the VMAs being walked in which case it uses the seq counter. I've
> had it running a few hours now without problems but I'll leave it
> running at least 24 hours.
>
> ==== CUT HERE ====
>  mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information by protecting against vma_adjust with a combination of locks and seq counter
>
> vma_adjust() is updating anon VMA information without any locks taken.
> In constract, file-backed mappings use the i_mmap_lock. This lack of
> locking can result in races with page migration. During rmap_walk(),
> vma_address() can return -EFAULT for an address that will soon be valid.
> This leaves a dangling migration PTE behind which can later cause a
> BUG_ON to trigger when the page is faulted in.
>
> With the recent anon_vma changes, there is no single anon_vma->lock that
> can be taken that is safe for rmap_walk() to guard against changes by
> vma_adjust(). Instead, a lock can be taken on one VMA while changes
> happen to another.
>
> What this patch does is protect against updates with a combination of
> locks and seq counters. First, the vma->anon_vma lock is taken by
> vma_adjust() and the sequence counter starts. The lock is released and
> the sequence ended when the VMA updates are complete.
>
> The lock serialses rmap_walk_anon when the page and VMA share the same
> anon_vma. Where the anon_vmas do not match, the seq counter is checked.
> If a change is noticed, rmap_walk_anon drops its locks and starts again
> from scratch as the VMA list may have changed. The dangling migration
> PTE bug was not triggered after several hours of stress testing with
> this patch applied.
>
> [kamezawa.hiroyu@jp.fujitsu.com: Use of a seq counter]
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/mm_types.h |   13 +++++++++++++
>  mm/ksm.c                 |   17 +++++++++++++++--
>  mm/mmap.c                |   30 ++++++++++++++++++++++++++++++
>  mm/rmap.c                |   25 ++++++++++++++++++++++++-
>  4 files changed, 82 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index b8bb9a6..fcd5db2 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -12,6 +12,7 @@
>  #include <linux/completion.h>
>  #include <linux/cpumask.h>
>  #include <linux/page-debug-flags.h>
> +#include <linux/seqlock.h>
>  #include <asm/page.h>
>  #include <asm/mmu.h>
>
> @@ -240,6 +241,18 @@ struct mm_struct {
>        struct rw_semaphore mmap_sem;
>        spinlock_t page_table_lock;             /* Protects page tables and some counters */
>
> +#ifdef CONFIG_MIGRATION
> +       /*
> +        * During migration, rmap_walk walks all the VMAs mapping a particular
> +        * page to remove the migration ptes. It doesn't this without mmap_sem
> +        * held and the semaphore is unnecessarily heavily to take in this case.
> +        * File-backed VMAs are protected by the i_mmap_lock and anon-VMAs are
> +        * protected by this seq counter. If the seq counter changes while
> +        * the migration PTE is being removed, the operation restarts.
> +        */
> +       seqcount_t span_seqcounter;
> +#endif
> +
>        struct list_head mmlist;                /* List of maybe swapped mm's.  These are globally strung
>                                                 * together off init_mm.mmlist, and are protected
>                                                 * by mmlist_lock
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 3666d43..613c762 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1671,11 +1671,24 @@ again:
>                struct anon_vma_chain *vmac;
>                struct vm_area_struct *vma;
>
> +retry:
>                spin_lock(&anon_vma->lock);
>                list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
> +                       unsigned long update_race;
> +                       bool outside;
>                        vma = vmac->vma;
> -                       if (rmap_item->address < vma->vm_start ||
> -                           rmap_item->address >= vma->vm_end)
> +
> +                       /* See comment in rmap_walk_anon about reading anon VMA info */
> +                       update_race = read_seqcount_begin(&vma->vm_mm->span_seqcounter);
> +                       outside = rmap_item->address < vma->vm_start ||
> +                                               rmap_item->address >= vma->vm_end;
> +                       if (anon_vma != vma->anon_vma &&
> +                                       read_seqcount_retry(&vma->vm_mm->span_seqcounter, update_race)) {
> +                               spin_unlock(&anon_vma->lock);
> +                               goto retry;
> +                       }
> +
> +                       if (outside)
>                                continue;
>                        /*
>                         * Initially we examine only the vma which covers this
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f90ea92..1508c43 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -491,6 +491,26 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
>                mm->mmap_cache = prev;
>  }
>
> +#ifdef CONFIG_MIGRATION
> +static void vma_span_seqbegin(struct vm_area_struct *vma)
> +{
> +       write_seqcount_begin(&vma->vm_mm->span_seqcounter);
> +}
> +
> +static void vma_span_seqend(struct vm_area_struct *vma)
> +{
> +       write_seqcount_end(&vma->vm_mm->span_seqcounter);
> +}
> +#else
> +static inline void vma_span_seqbegin(struct vm_area_struct *vma)
> +{
> +}
> +
> +static void adjust_end_vma(struct vm_area_struct *vma)
> +{
> +}
> +#endif /* CONFIG_MIGRATION */
> +
>  /*
>  * We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that
>  * is already present in an i_mmap tree without adjusting the tree.
> @@ -578,6 +598,11 @@ again:                     remove_next = 1 + (end > next->vm_end);
>                }
>        }
>
> +       if (vma->anon_vma) {
> +               spin_lock(&vma->anon_vma->lock);


Actually, I can't understand why we need to hold lock of vma->anon_vma->lock?
I think seqcounter is enough.

I looked at your scenarion.

More exactly, your scenario about unmap is following as.

1. VMA A-Lower - VMA A-Upper (include hole)
2. VMA A-Lower - VMA hole - VMA A-Upper(except hole)
3. VMA A-Lower -  hole is remove at last - VMA A-Upper.

I mean VMA A-upper is already linkeded at vma list through
__insert_vm_struct atomically(by seqcounter).
So rmap can find proper entry, I think.

What am I missing? :)

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-26  4:00       ` Minchan Kim
  0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-26  4:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, linux-mm, Christoph Lameter, akpm, linux-kernel

Hi, Mel.

On Sat, Apr 24, 2010 at 12:58 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Fri, Apr 23, 2010 at 10:59:22AM +0100, Mel Gorman wrote:
>> On Fri, Apr 23, 2010 at 12:01:48PM +0900, KAMEZAWA Hiroyuki wrote:
>> > This patch itself is for -mm ..but may need to go -stable tree for memory
>> > hotplug. (but we've got no report to hit this race...)
>> >
>>
>> Only because it's very difficult to hit. Even when running compaction
>> constantly, it can take anywhere between 10 minutes and 2 hours for me
>> to reproduce it.
>>
>> > This one is the simplest, I think and works well on my test set.
>> > ==
>> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >
>> > In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
>> > or mapping->i_mmap_lock is held and enter following loop.
>> >
>> >     for_each_vma_in_this_rmap_link(list from page->mapping) {
>> >             unsigned long address = vma_address(page, vma);
>> >             if (address == -EFAULT)
>> >                     continue;
>> >             ....
>> >     }
>> >
>> > vma_address is checking [start, end, pgoff] v.s. page->index.
>> >
>> > But vma's [start, end, pgoff] is updated without locks. vma_address()
>> > can hit a race and may return wrong result.
>> >
>> > This bahavior is no problem in usual routine as try_to_unmap() etc...
>> > But for page migration, rmap_walk() has to find all migration_ptes
>> > which migration code overwritten valid ptes. This race is critical and cause
>> > BUG that a migration_pte is sometimes not removed.
>> >
>> > pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
>> > Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
>> > Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
>> > Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
>> > Apr 21 17:27:47 localhost kernel: CPU 3
>> > Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
>> > Apr 21 17:27:47 localhost kernel:
>> > Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G        W   2.6.34-rc4-mm1+ #4 D2519/PRIMERGY
>> > Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>]  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
>> > Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08  EFLAGS: 00010246
>> > Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
>> > Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
>> > Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
>> > Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
>> > Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
>> > Apr 21 17:27:47 localhost kernel: FS:  00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
>> > Apr 21 17:27:47 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
>> > Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> > Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> > Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
>> > Apr 21 17:27:47 localhost kernel: Stack:
>> > Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
>> > Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
>> > Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
>> > Apr 21 17:27:47 localhost kernel: Call Trace:
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
>> > Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
>> > Apr 21 17:27:47 localhost kernel: RIP  [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
>> > Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
>> > Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
>> >
>> > This patch adds vma_address_safe(). And update [start, end, pgoff]
>> > under seq counter.
>> >
>>
>> I had considered this idea as well as it is vaguely similar to how zones get
>> resized with a seqlock. I was hoping that the existing locking on anon_vma
>> would be usable by backing off until uncontended but maybe not so lets
>> check out this approach.
>>
>
> A possible combination of the two approaches is as follows. It uses the
> anon_vma lock mostly except where the anon_vma differs between the page
> and the VMAs being walked in which case it uses the seq counter. I've
> had it running a few hours now without problems but I'll leave it
> running at least 24 hours.
>
> ==== CUT HERE ====
>  mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information by protecting against vma_adjust with a combination of locks and seq counter
>
> vma_adjust() is updating anon VMA information without any locks taken.
> In constract, file-backed mappings use the i_mmap_lock. This lack of
> locking can result in races with page migration. During rmap_walk(),
> vma_address() can return -EFAULT for an address that will soon be valid.
> This leaves a dangling migration PTE behind which can later cause a
> BUG_ON to trigger when the page is faulted in.
>
> With the recent anon_vma changes, there is no single anon_vma->lock that
> can be taken that is safe for rmap_walk() to guard against changes by
> vma_adjust(). Instead, a lock can be taken on one VMA while changes
> happen to another.
>
> What this patch does is protect against updates with a combination of
> locks and seq counters. First, the vma->anon_vma lock is taken by
> vma_adjust() and the sequence counter starts. The lock is released and
> the sequence ended when the VMA updates are complete.
>
> The lock serialses rmap_walk_anon when the page and VMA share the same
> anon_vma. Where the anon_vmas do not match, the seq counter is checked.
> If a change is noticed, rmap_walk_anon drops its locks and starts again
> from scratch as the VMA list may have changed. The dangling migration
> PTE bug was not triggered after several hours of stress testing with
> this patch applied.
>
> [kamezawa.hiroyu@jp.fujitsu.com: Use of a seq counter]
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/mm_types.h |   13 +++++++++++++
>  mm/ksm.c                 |   17 +++++++++++++++--
>  mm/mmap.c                |   30 ++++++++++++++++++++++++++++++
>  mm/rmap.c                |   25 ++++++++++++++++++++++++-
>  4 files changed, 82 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index b8bb9a6..fcd5db2 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -12,6 +12,7 @@
>  #include <linux/completion.h>
>  #include <linux/cpumask.h>
>  #include <linux/page-debug-flags.h>
> +#include <linux/seqlock.h>
>  #include <asm/page.h>
>  #include <asm/mmu.h>
>
> @@ -240,6 +241,18 @@ struct mm_struct {
>        struct rw_semaphore mmap_sem;
>        spinlock_t page_table_lock;             /* Protects page tables and some counters */
>
> +#ifdef CONFIG_MIGRATION
> +       /*
> +        * During migration, rmap_walk walks all the VMAs mapping a particular
> +        * page to remove the migration ptes. It doesn't this without mmap_sem
> +        * held and the semaphore is unnecessarily heavily to take in this case.
> +        * File-backed VMAs are protected by the i_mmap_lock and anon-VMAs are
> +        * protected by this seq counter. If the seq counter changes while
> +        * the migration PTE is being removed, the operation restarts.
> +        */
> +       seqcount_t span_seqcounter;
> +#endif
> +
>        struct list_head mmlist;                /* List of maybe swapped mm's.  These are globally strung
>                                                 * together off init_mm.mmlist, and are protected
>                                                 * by mmlist_lock
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 3666d43..613c762 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1671,11 +1671,24 @@ again:
>                struct anon_vma_chain *vmac;
>                struct vm_area_struct *vma;
>
> +retry:
>                spin_lock(&anon_vma->lock);
>                list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
> +                       unsigned long update_race;
> +                       bool outside;
>                        vma = vmac->vma;
> -                       if (rmap_item->address < vma->vm_start ||
> -                           rmap_item->address >= vma->vm_end)
> +
> +                       /* See comment in rmap_walk_anon about reading anon VMA info */
> +                       update_race = read_seqcount_begin(&vma->vm_mm->span_seqcounter);
> +                       outside = rmap_item->address < vma->vm_start ||
> +                                               rmap_item->address >= vma->vm_end;
> +                       if (anon_vma != vma->anon_vma &&
> +                                       read_seqcount_retry(&vma->vm_mm->span_seqcounter, update_race)) {
> +                               spin_unlock(&anon_vma->lock);
> +                               goto retry;
> +                       }
> +
> +                       if (outside)
>                                continue;
>                        /*
>                         * Initially we examine only the vma which covers this
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f90ea92..1508c43 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -491,6 +491,26 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
>                mm->mmap_cache = prev;
>  }
>
> +#ifdef CONFIG_MIGRATION
> +static void vma_span_seqbegin(struct vm_area_struct *vma)
> +{
> +       write_seqcount_begin(&vma->vm_mm->span_seqcounter);
> +}
> +
> +static void vma_span_seqend(struct vm_area_struct *vma)
> +{
> +       write_seqcount_end(&vma->vm_mm->span_seqcounter);
> +}
> +#else
> +static inline void vma_span_seqbegin(struct vm_area_struct *vma)
> +{
> +}
> +
> +static void adjust_end_vma(struct vm_area_struct *vma)
> +{
> +}
> +#endif /* CONFIG_MIGRATION */
> +
>  /*
>  * We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that
>  * is already present in an i_mmap tree without adjusting the tree.
> @@ -578,6 +598,11 @@ again:                     remove_next = 1 + (end > next->vm_end);
>                }
>        }
>
> +       if (vma->anon_vma) {
> +               spin_lock(&vma->anon_vma->lock);


Actually, I can't understand why we need to hold lock of vma->anon_vma->lock?
I think seqcounter is enough.

I looked at your scenarion.

More exactly, your scenario about unmap is following as.

1. VMA A-Lower - VMA A-Upper (include hole)
2. VMA A-Lower - VMA hole - VMA A-Upper(except hole)
3. VMA A-Lower -  hole is remove at last - VMA A-Upper.

I mean VMA A-upper is already linkeded at vma list through
__insert_vm_struct atomically(by seqcounter).
So rmap can find proper entry, I think.

What am I missing? :)

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-25 23:49           ` KAMEZAWA Hiroyuki
@ 2010-04-26  4:06             ` Minchan Kim
  -1 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-26  4:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, Christoph Lameter, akpm, linux-kernel

On Mon, Apr 26, 2010 at 8:49 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Sat, 24 Apr 2010 11:43:24 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
>
>> On Sat, Apr 24, 2010 at 11:02:00AM +0900, KAMEZAWA Hiroyuki wrote:
>> > On Fri, 23 Apr 2010 16:58:01 +0100
>> > Mel Gorman <mel@csn.ul.ie> wrote:
>> >
>> > > > I had considered this idea as well as it is vaguely similar to how zones get
>> > > > resized with a seqlock. I was hoping that the existing locking on anon_vma
>> > > > would be usable by backing off until uncontended but maybe not so lets
>> > > > check out this approach.
>> > > >
>> > >
>> > > A possible combination of the two approaches is as follows. It uses the
>> > > anon_vma lock mostly except where the anon_vma differs between the page
>> > > and the VMAs being walked in which case it uses the seq counter. I've
>> > > had it running a few hours now without problems but I'll leave it
>> > > running at least 24 hours.
>> > >
>> > ok, I'll try this, too.
>> >
>> >
>> > > ==== CUT HERE ====
>> > >  mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information by protecting against vma_adjust with a combination of locks and seq counter
>> > >
>> > > vma_adjust() is updating anon VMA information without any locks taken.
>> > > In constract, file-backed mappings use the i_mmap_lock. This lack of
>> > > locking can result in races with page migration. During rmap_walk(),
>> > > vma_address() can return -EFAULT for an address that will soon be valid.
>> > > This leaves a dangling migration PTE behind which can later cause a
>> > > BUG_ON to trigger when the page is faulted in.
>> > >
>> > > With the recent anon_vma changes, there is no single anon_vma->lock that
>> > > can be taken that is safe for rmap_walk() to guard against changes by
>> > > vma_adjust(). Instead, a lock can be taken on one VMA while changes
>> > > happen to another.
>> > >
>> > > What this patch does is protect against updates with a combination of
>> > > locks and seq counters. First, the vma->anon_vma lock is taken by
>> > > vma_adjust() and the sequence counter starts. The lock is released and
>> > > the sequence ended when the VMA updates are complete.
>> > >
>> > > The lock serialses rmap_walk_anon when the page and VMA share the same
>> > > anon_vma. Where the anon_vmas do not match, the seq counter is checked.
>> > > If a change is noticed, rmap_walk_anon drops its locks and starts again
>> > > from scratch as the VMA list may have changed. The dangling migration
>> > > PTE bug was not triggered after several hours of stress testing with
>> > > this patch applied.
>> > >
>> > > [kamezawa.hiroyu@jp.fujitsu.com: Use of a seq counter]
>> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> >
>> > I think this patch is nice!
>> >
>>
>> It looks nice but it still broke after 28 hours of running. The
>> seq-counter is still insufficient to catch all changes that are made to
>> the list. I'm beginning to wonder if a) this really can be fully safely
>> locked with the anon_vma changes and b) if it has to be a spinlock to
>> catch the majority of cases but still a lazy cleanup if there happens to
>> be a race. It's unsatisfactory and I'm expecting I'll either have some
>> insight to the new anon_vma changes that allow it to be locked or Rik
>> knows how to restore the original behaviour which as Andrea pointed out
>> was safe.
>>
> Ouch. Hmm, how about the race in fork() I pointed out ?

I thought it's possible.
Mel's test would take a long time to trigger BUG.
So I think we could solve one of problems. Remained one is about fork
race, I think.
Mel. Could you retry your test with below Kame's patch?
http://lkml.org/lkml/2010/4/23/58


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-26  4:06             ` Minchan Kim
  0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-26  4:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, Christoph Lameter, akpm, linux-kernel

On Mon, Apr 26, 2010 at 8:49 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Sat, 24 Apr 2010 11:43:24 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
>
>> On Sat, Apr 24, 2010 at 11:02:00AM +0900, KAMEZAWA Hiroyuki wrote:
>> > On Fri, 23 Apr 2010 16:58:01 +0100
>> > Mel Gorman <mel@csn.ul.ie> wrote:
>> >
>> > > > I had considered this idea as well as it is vaguely similar to how zones get
>> > > > resized with a seqlock. I was hoping that the existing locking on anon_vma
>> > > > would be usable by backing off until uncontended but maybe not so lets
>> > > > check out this approach.
>> > > >
>> > >
>> > > A possible combination of the two approaches is as follows. It uses the
>> > > anon_vma lock mostly except where the anon_vma differs between the page
>> > > and the VMAs being walked in which case it uses the seq counter. I've
>> > > had it running a few hours now without problems but I'll leave it
>> > > running at least 24 hours.
>> > >
>> > ok, I'll try this, too.
>> >
>> >
>> > > ==== CUT HERE ====
>> > >  mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information by protecting against vma_adjust with a combination of locks and seq counter
>> > >
>> > > vma_adjust() is updating anon VMA information without any locks taken.
>> > > In constract, file-backed mappings use the i_mmap_lock. This lack of
>> > > locking can result in races with page migration. During rmap_walk(),
>> > > vma_address() can return -EFAULT for an address that will soon be valid.
>> > > This leaves a dangling migration PTE behind which can later cause a
>> > > BUG_ON to trigger when the page is faulted in.
>> > >
>> > > With the recent anon_vma changes, there is no single anon_vma->lock that
>> > > can be taken that is safe for rmap_walk() to guard against changes by
>> > > vma_adjust(). Instead, a lock can be taken on one VMA while changes
>> > > happen to another.
>> > >
>> > > What this patch does is protect against updates with a combination of
>> > > locks and seq counters. First, the vma->anon_vma lock is taken by
>> > > vma_adjust() and the sequence counter starts. The lock is released and
>> > > the sequence ended when the VMA updates are complete.
>> > >
>> > > The lock serialses rmap_walk_anon when the page and VMA share the same
>> > > anon_vma. Where the anon_vmas do not match, the seq counter is checked.
>> > > If a change is noticed, rmap_walk_anon drops its locks and starts again
>> > > from scratch as the VMA list may have changed. The dangling migration
>> > > PTE bug was not triggered after several hours of stress testing with
>> > > this patch applied.
>> > >
>> > > [kamezawa.hiroyu@jp.fujitsu.com: Use of a seq counter]
>> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> >
>> > I think this patch is nice!
>> >
>>
>> It looks nice but it still broke after 28 hours of running. The
>> seq-counter is still insufficient to catch all changes that are made to
>> the list. I'm beginning to wonder if a) this really can be fully safely
>> locked with the anon_vma changes and b) if it has to be a spinlock to
>> catch the majority of cases but still a lazy cleanup if there happens to
>> be a race. It's unsatisfactory and I'm expecting I'll either have some
>> insight to the new anon_vma changes that allow it to be locked or Rik
>> knows how to restore the original behaviour which as Andrea pointed out
>> was safe.
>>
> Ouch. Hmm, how about the race in fork() I pointed out ?

I thought it's possible.
Mel's test would take a long time to trigger BUG.
So I think we could solve one of problems. Remained one is about fork
race, I think.
Mel. Could you retry your test with below Kame's patch?
http://lkml.org/lkml/2010/4/23/58


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-26  2:53             ` KAMEZAWA Hiroyuki
@ 2010-04-26  4:31               ` Minchan Kim
  -1 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-26  4:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, Christoph Lameter, akpm, linux-kernel

On Mon, Apr 26, 2010 at 11:53 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 26 Apr 2010 08:49:01 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> On Sat, 24 Apr 2010 11:43:24 +0100
>> Mel Gorman <mel@csn.ul.ie> wrote:
>
>> > It looks nice but it still broke after 28 hours of running. The
>> > seq-counter is still insufficient to catch all changes that are made to
>> > the list. I'm beginning to wonder if a) this really can be fully safely
>> > locked with the anon_vma changes and b) if it has to be a spinlock to
>> > catch the majority of cases but still a lazy cleanup if there happens to
>> > be a race. It's unsatisfactory and I'm expecting I'll either have some
>> > insight to the new anon_vma changes that allow it to be locked or Rik
>> > knows how to restore the original behaviour which as Andrea pointed out
>> > was safe.
>> >
>> Ouch. Hmm, how about the race in fork() I pointed out ?
>>
> Forget this. Sorry for noise.

Yes. It was due to my wrong explanation.
Sorry for that, Kame.


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-26  4:31               ` Minchan Kim
  0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-26  4:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, Christoph Lameter, akpm, linux-kernel

On Mon, Apr 26, 2010 at 11:53 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 26 Apr 2010 08:49:01 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> On Sat, 24 Apr 2010 11:43:24 +0100
>> Mel Gorman <mel@csn.ul.ie> wrote:
>
>> > It looks nice but it still broke after 28 hours of running. The
>> > seq-counter is still insufficient to catch all changes that are made to
>> > the list. I'm beginning to wonder if a) this really can be fully safely
>> > locked with the anon_vma changes and b) if it has to be a spinlock to
>> > catch the majority of cases but still a lazy cleanup if there happens to
>> > be a race. It's unsatisfactory and I'm expecting I'll either have some
>> > insight to the new anon_vma changes that allow it to be locked or Rik
>> > knows how to restore the original behaviour which as Andrea pointed out
>> > was safe.
>> >
>> Ouch. Hmm, how about the race in fork() I pointed out ?
>>
> Forget this. Sorry for noise.

Yes. It was due to my wrong explanation.
Sorry for that, Kame.


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-25 23:49           ` KAMEZAWA Hiroyuki
@ 2010-04-26  9:28             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-26  9:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, minchan.kim, Christoph Lameter, akpm, linux-kernel

On Mon, 26 Apr 2010 08:49:01 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Sat, 24 Apr 2010 11:43:24 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:

> > It looks nice but it still broke after 28 hours of running. The
> > seq-counter is still insufficient to catch all changes that are made to
> > the list. I'm beginning to wonder if a) this really can be fully safely
> > locked with the anon_vma changes and b) if it has to be a spinlock to
> > catch the majority of cases but still a lazy cleanup if there happens to
> > be a race. It's unsatisfactory and I'm expecting I'll either have some
> > insight to the new anon_vma changes that allow it to be locked or Rik
> > knows how to restore the original behaviour which as Andrea pointed out
> > was safe.
> > 
> Ouch. 

Ok, reproduced. Here is status in my test + printk().

 * A race doesn't seem to happen if swap=off. 
    I need to swapon to cause the bug.
 * Before unmap, mapcount=1, SwapCache for anonymous memory.
   old page's flag was SWAPCACHE, Active, Uptodate, Referenced, Locked.
 * After remap, mapcount=0, return code=0.
   new page's flag after remap was SwapCache, Active, Dirty, Uptodate, Referenced.

(Hmm, dirty bit can be added by try_to_unamp().)

-Kame




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-26  9:28             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-26  9:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, minchan.kim, Christoph Lameter, akpm, linux-kernel

On Mon, 26 Apr 2010 08:49:01 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Sat, 24 Apr 2010 11:43:24 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:

> > It looks nice but it still broke after 28 hours of running. The
> > seq-counter is still insufficient to catch all changes that are made to
> > the list. I'm beginning to wonder if a) this really can be fully safely
> > locked with the anon_vma changes and b) if it has to be a spinlock to
> > catch the majority of cases but still a lazy cleanup if there happens to
> > be a race. It's unsatisfactory and I'm expecting I'll either have some
> > insight to the new anon_vma changes that allow it to be locked or Rik
> > knows how to restore the original behaviour which as Andrea pointed out
> > was safe.
> > 
> Ouch. 

Ok, reproduced. Here is status in my test + printk().

 * A race doesn't seem to happen if swap=off. 
    I need to swapon to cause the bug.
 * Before unmap, mapcount=1, SwapCache for anonymous memory.
   old page's flag was SWAPCACHE, Active, Uptodate, Referenced, Locked.
 * After remap, mapcount=0, return code=0.
   new page's flag after remap was SwapCache, Active, Dirty, Uptodate, Referenced.

(Hmm, dirty bit can be added by try_to_unamp().)

-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-26  9:28             ` KAMEZAWA Hiroyuki
@ 2010-04-26  9:48               ` Minchan Kim
  -1 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-26  9:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, Christoph Lameter, akpm, linux-kernel

On Mon, Apr 26, 2010 at 6:28 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 26 Apr 2010 08:49:01 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> On Sat, 24 Apr 2010 11:43:24 +0100
>> Mel Gorman <mel@csn.ul.ie> wrote:
>
>> > It looks nice but it still broke after 28 hours of running. The
>> > seq-counter is still insufficient to catch all changes that are made to
>> > the list. I'm beginning to wonder if a) this really can be fully safely
>> > locked with the anon_vma changes and b) if it has to be a spinlock to
>> > catch the majority of cases but still a lazy cleanup if there happens to
>> > be a race. It's unsatisfactory and I'm expecting I'll either have some
>> > insight to the new anon_vma changes that allow it to be locked or Rik
>> > knows how to restore the original behaviour which as Andrea pointed out
>> > was safe.
>> >
>> Ouch.
>
> Ok, reproduced. Here is status in my test + printk().
>
>  * A race doesn't seem to happen if swap=off.
>    I need to swapon to cause the bug

FYI,

Do you have a swapon/off bomb test?
When I saw your mail, I feel it might be culprit.

http://lkml.org/lkml/2010/4/22/762.

It is just guessing. I don't have a time to look into, now.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-26  9:48               ` Minchan Kim
  0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-26  9:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, Christoph Lameter, akpm, linux-kernel

On Mon, Apr 26, 2010 at 6:28 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 26 Apr 2010 08:49:01 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> On Sat, 24 Apr 2010 11:43:24 +0100
>> Mel Gorman <mel@csn.ul.ie> wrote:
>
>> > It looks nice but it still broke after 28 hours of running. The
>> > seq-counter is still insufficient to catch all changes that are made to
>> > the list. I'm beginning to wonder if a) this really can be fully safely
>> > locked with the anon_vma changes and b) if it has to be a spinlock to
>> > catch the majority of cases but still a lazy cleanup if there happens to
>> > be a race. It's unsatisfactory and I'm expecting I'll either have some
>> > insight to the new anon_vma changes that allow it to be locked or Rik
>> > knows how to restore the original behaviour which as Andrea pointed out
>> > was safe.
>> >
>> Ouch.
>
> Ok, reproduced. Here is status in my test + printk().
>
>  * A race doesn't seem to happen if swap=off.
>    I need to swapon to cause the bug

FYI,

Do you have a swapon/off bomb test?
When I saw your mail, I feel it might be culprit.

http://lkml.org/lkml/2010/4/22/762.

It is just guessing. I don't have a time to look into, now.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-26  9:48               ` Minchan Kim
@ 2010-04-26  9:49                 ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-26  9:49 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Mel Gorman, linux-mm, Christoph Lameter, akpm, linux-kernel

On Mon, 26 Apr 2010 18:48:42 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Mon, Apr 26, 2010 at 6:28 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Mon, 26 Apr 2010 08:49:01 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> >> On Sat, 24 Apr 2010 11:43:24 +0100
> >> Mel Gorman <mel@csn.ul.ie> wrote:
> >
> >> > It looks nice but it still broke after 28 hours of running. The
> >> > seq-counter is still insufficient to catch all changes that are made to
> >> > the list. I'm beginning to wonder if a) this really can be fully safely
> >> > locked with the anon_vma changes and b) if it has to be a spinlock to
> >> > catch the majority of cases but still a lazy cleanup if there happens to
> >> > be a race. It's unsatisfactory and I'm expecting I'll either have some
> >> > insight to the new anon_vma changes that allow it to be locked or Rik
> >> > knows how to restore the original behaviour which as Andrea pointed out
> >> > was safe.
> >> >
> >> Ouch.
> >
> > Ok, reproduced. Here is status in my test + printk().
> >
> >  * A race doesn't seem to happen if swap=off.
> >    I need to swapon to cause the bug
> 
> FYI,
> 
> Do you have a swapon/off bomb test?

No. Just running test under swapoff, and running the same test after swapon.


> When I saw your mail, I feel it might be culprit.
> 
> http://lkml.org/lkml/2010/4/22/762.
> 
> It is just guessing. I don't have a time to look into, now.
> 
Hmm. BTW.

==
static int expand_downwards(struct vm_area_struct *vma,
                                   unsigned long address)
{
   ....
       /* Somebody else might have raced and expanded it already */
        if (address < vma->vm_start) {
                unsigned long size, grow;

                size = vma->vm_end - address;
                grow = (vma->vm_start - address) >> PAGE_SHIFT;

                error = acct_stack_growth(vma, size, grow);
                if (!error) {
                        vma->vm_start = address;
                        vma->vm_pgoff -= grow;
                }
        }
==	

I feel this part needs care. No ?

Thanks,
-Kmae




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-26  9:49                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-26  9:49 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Mel Gorman, linux-mm, Christoph Lameter, akpm, linux-kernel

On Mon, 26 Apr 2010 18:48:42 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Mon, Apr 26, 2010 at 6:28 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Mon, 26 Apr 2010 08:49:01 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> >> On Sat, 24 Apr 2010 11:43:24 +0100
> >> Mel Gorman <mel@csn.ul.ie> wrote:
> >
> >> > It looks nice but it still broke after 28 hours of running. The
> >> > seq-counter is still insufficient to catch all changes that are made to
> >> > the list. I'm beginning to wonder if a) this really can be fully safely
> >> > locked with the anon_vma changes and b) if it has to be a spinlock to
> >> > catch the majority of cases but still a lazy cleanup if there happens to
> >> > be a race. It's unsatisfactory and I'm expecting I'll either have some
> >> > insight to the new anon_vma changes that allow it to be locked or Rik
> >> > knows how to restore the original behaviour which as Andrea pointed out
> >> > was safe.
> >> >
> >> Ouch.
> >
> > Ok, reproduced. Here is status in my test + printk().
> >
> > A * A race doesn't seem to happen if swap=off.
> > A  A I need to swapon to cause the bug
> 
> FYI,
> 
> Do you have a swapon/off bomb test?

No. Just running test under swapoff, and running the same test after swapon.


> When I saw your mail, I feel it might be culprit.
> 
> http://lkml.org/lkml/2010/4/22/762.
> 
> It is just guessing. I don't have a time to look into, now.
> 
Hmm. BTW.

==
static int expand_downwards(struct vm_area_struct *vma,
                                   unsigned long address)
{
   ....
       /* Somebody else might have raced and expanded it already */
        if (address < vma->vm_start) {
                unsigned long size, grow;

                size = vma->vm_end - address;
                grow = (vma->vm_start - address) >> PAGE_SHIFT;

                error = acct_stack_growth(vma, size, grow);
                if (!error) {
                        vma->vm_start = address;
                        vma->vm_pgoff -= grow;
                }
        }
==	

I feel this part needs care. No ?

Thanks,
-Kmae



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-26  9:49                 ` KAMEZAWA Hiroyuki
@ 2010-04-26 10:07                   ` Minchan Kim
  -1 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-26 10:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, Christoph Lameter, akpm, linux-kernel

On Mon, Apr 26, 2010 at 6:49 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 26 Apr 2010 18:48:42 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Mon, Apr 26, 2010 at 6:28 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Mon, 26 Apr 2010 08:49:01 +0900
>> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >
>> >> On Sat, 24 Apr 2010 11:43:24 +0100
>> >> Mel Gorman <mel@csn.ul.ie> wrote:
>> >
>> >> > It looks nice but it still broke after 28 hours of running. The
>> >> > seq-counter is still insufficient to catch all changes that are made to
>> >> > the list. I'm beginning to wonder if a) this really can be fully safely
>> >> > locked with the anon_vma changes and b) if it has to be a spinlock to
>> >> > catch the majority of cases but still a lazy cleanup if there happens to
>> >> > be a race. It's unsatisfactory and I'm expecting I'll either have some
>> >> > insight to the new anon_vma changes that allow it to be locked or Rik
>> >> > knows how to restore the original behaviour which as Andrea pointed out
>> >> > was safe.
>> >> >
>> >> Ouch.
>> >
>> > Ok, reproduced. Here is status in my test + printk().
>> >
>> >  * A race doesn't seem to happen if swap=off.
>> >    I need to swapon to cause the bug
>>
>> FYI,
>>
>> Do you have a swapon/off bomb test?
>
> No. Just running test under swapoff, and running the same test after swapon.
>
>
>> When I saw your mail, I feel it might be culprit.
>>
>> http://lkml.org/lkml/2010/4/22/762.
>>
>> It is just guessing. I don't have a time to look into, now.
>>
> Hmm. BTW.
>
> ==
> static int expand_downwards(struct vm_area_struct *vma,
>                                   unsigned long address)
> {
>   ....
>       /* Somebody else might have raced and expanded it already */
>        if (address < vma->vm_start) {
>                unsigned long size, grow;
>
>                size = vma->vm_end - address;
>                grow = (vma->vm_start - address) >> PAGE_SHIFT;
>
>                error = acct_stack_growth(vma, size, grow);
>                if (!error) {
>                        vma->vm_start = address;
>                        vma->vm_pgoff -= grow;
>                }
>        }
> ==
>
> I feel this part needs care. No ?

Yes. Andrea pointed it out.
I didn't followed the thread whole yet but It seems Mel and Andrea
want to restore anon_vma's atomicity like old than one by one healing.



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-26 10:07                   ` Minchan Kim
  0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2010-04-26 10:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, Christoph Lameter, akpm, linux-kernel

On Mon, Apr 26, 2010 at 6:49 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 26 Apr 2010 18:48:42 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Mon, Apr 26, 2010 at 6:28 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Mon, 26 Apr 2010 08:49:01 +0900
>> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >
>> >> On Sat, 24 Apr 2010 11:43:24 +0100
>> >> Mel Gorman <mel@csn.ul.ie> wrote:
>> >
>> >> > It looks nice but it still broke after 28 hours of running. The
>> >> > seq-counter is still insufficient to catch all changes that are made to
>> >> > the list. I'm beginning to wonder if a) this really can be fully safely
>> >> > locked with the anon_vma changes and b) if it has to be a spinlock to
>> >> > catch the majority of cases but still a lazy cleanup if there happens to
>> >> > be a race. It's unsatisfactory and I'm expecting I'll either have some
>> >> > insight to the new anon_vma changes that allow it to be locked or Rik
>> >> > knows how to restore the original behaviour which as Andrea pointed out
>> >> > was safe.
>> >> >
>> >> Ouch.
>> >
>> > Ok, reproduced. Here is status in my test + printk().
>> >
>> >  * A race doesn't seem to happen if swap=off.
>> >    I need to swapon to cause the bug
>>
>> FYI,
>>
>> Do you have a swapon/off bomb test?
>
> No. Just running test under swapoff, and running the same test after swapon.
>
>
>> When I saw your mail, I feel it might be culprit.
>>
>> http://lkml.org/lkml/2010/4/22/762.
>>
>> It is just guessing. I don't have a time to look into, now.
>>
> Hmm. BTW.
>
> ==
> static int expand_downwards(struct vm_area_struct *vma,
>                                   unsigned long address)
> {
>   ....
>       /* Somebody else might have raced and expanded it already */
>        if (address < vma->vm_start) {
>                unsigned long size, grow;
>
>                size = vma->vm_end - address;
>                grow = (vma->vm_start - address) >> PAGE_SHIFT;
>
>                error = acct_stack_growth(vma, size, grow);
>                if (!error) {
>                        vma->vm_start = address;
>                        vma->vm_pgoff -= grow;
>                }
>        }
> ==
>
> I feel this part needs care. No ?

Yes. Andrea pointed it out.
I didn't followed the thread whole yet but It seems Mel and Andrea
want to restore anon_vma's atomicity like old than one by one healing.



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
  2010-04-26  9:48               ` Minchan Kim
@ 2010-04-26 11:36                 ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-26 11:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, linux-mm, Christoph Lameter, akpm, linux-kernel

On Mon, Apr 26, 2010 at 06:48:42PM +0900, Minchan Kim wrote:
> On Mon, Apr 26, 2010 at 6:28 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Mon, 26 Apr 2010 08:49:01 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> >> On Sat, 24 Apr 2010 11:43:24 +0100
> >> Mel Gorman <mel@csn.ul.ie> wrote:
> >
> >> > It looks nice but it still broke after 28 hours of running. The
> >> > seq-counter is still insufficient to catch all changes that are made to
> >> > the list. I'm beginning to wonder if a) this really can be fully safely
> >> > locked with the anon_vma changes and b) if it has to be a spinlock to
> >> > catch the majority of cases but still a lazy cleanup if there happens to
> >> > be a race. It's unsatisfactory and I'm expecting I'll either have some
> >> > insight to the new anon_vma changes that allow it to be locked or Rik
> >> > knows how to restore the original behaviour which as Andrea pointed out
> >> > was safe.
> >> >
> >> Ouch.
> >
> > Ok, reproduced. Here is status in my test + printk().
> >
> >  * A race doesn't seem to happen if swap=off.
> >    I need to swapon to cause the bug
> 
> FYI,
> 
> Do you have a swapon/off bomb test?
> When I saw your mail, I feel it might be culprit.
> 
> http://lkml.org/lkml/2010/4/22/762.
> 
> It is just guessing. I don't have a time to look into, now.
> 

I haven't tried a swapon/off test but that patch is certainly important
and closes an important race. A fork-heavy test will routinely hit the
problem and applying the patch makes it very difficult to reproduce the
problem. I've added it to my stack while I continue trying to pin down
when the VMA-changes make a difference.

I'm relooking at the seq counter approach. It appears to very rare the logic
is actually triggered so reproducing is a problem. I'm still not convinced
that just locking anon_vma is not the answer there. If it locks and as
expand_downwards already locks and with the fork-based patch, I think the
races might be closed but I'm not 100% certain yet.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk
@ 2010-04-26 11:36                 ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2010-04-26 11:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, linux-mm, Christoph Lameter, akpm, linux-kernel

On Mon, Apr 26, 2010 at 06:48:42PM +0900, Minchan Kim wrote:
> On Mon, Apr 26, 2010 at 6:28 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Mon, 26 Apr 2010 08:49:01 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> >> On Sat, 24 Apr 2010 11:43:24 +0100
> >> Mel Gorman <mel@csn.ul.ie> wrote:
> >
> >> > It looks nice but it still broke after 28 hours of running. The
> >> > seq-counter is still insufficient to catch all changes that are made to
> >> > the list. I'm beginning to wonder if a) this really can be fully safely
> >> > locked with the anon_vma changes and b) if it has to be a spinlock to
> >> > catch the majority of cases but still a lazy cleanup if there happens to
> >> > be a race. It's unsatisfactory and I'm expecting I'll either have some
> >> > insight to the new anon_vma changes that allow it to be locked or Rik
> >> > knows how to restore the original behaviour which as Andrea pointed out
> >> > was safe.
> >> >
> >> Ouch.
> >
> > Ok, reproduced. Here is status in my test + printk().
> >
> >  * A race doesn't seem to happen if swap=off.
> >    I need to swapon to cause the bug
> 
> FYI,
> 
> Do you have a swapon/off bomb test?
> When I saw your mail, I feel it might be culprit.
> 
> http://lkml.org/lkml/2010/4/22/762.
> 
> It is just guessing. I don't have a time to look into, now.
> 

I haven't tried a swapon/off test but that patch is certainly important
and closes an important race. A fork-heavy test will routinely hit the
problem and applying the patch makes it very difficult to reproduce the
problem. I've added it to my stack while I continue trying to pin down
when the VMA-changes make a difference.

I'm relooking at the seq counter approach. It appears to very rare the logic
is actually triggered so reproducing is a problem. I'm still not convinced
that just locking anon_vma is not the answer there. If it locks and as
expand_downwards already locks and with the fork-based patch, I think the
races might be closed but I'm not 100% certain yet.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2010-04-26 11:37 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-23  3:01 [BUGFIX][mm][PATCH] fix migration race in rmap_walk KAMEZAWA Hiroyuki
2010-04-23  3:01 ` KAMEZAWA Hiroyuki
2010-04-23  5:11 ` Minchan Kim
2010-04-23  5:11   ` Minchan Kim
2010-04-23  5:27   ` KAMEZAWA Hiroyuki
2010-04-23  5:27     ` KAMEZAWA Hiroyuki
2010-04-23  7:00     ` Minchan Kim
2010-04-23  7:00       ` Minchan Kim
2010-04-23  7:17       ` KAMEZAWA Hiroyuki
2010-04-23  7:17         ` KAMEZAWA Hiroyuki
2010-04-23  7:53         ` Minchan Kim
2010-04-23  7:53           ` Minchan Kim
2010-04-23  7:55           ` KAMEZAWA Hiroyuki
2010-04-23  7:55             ` KAMEZAWA Hiroyuki
2010-04-23  9:59 ` Mel Gorman
2010-04-23  9:59   ` Mel Gorman
2010-04-23 15:58   ` Mel Gorman
2010-04-23 15:58     ` Mel Gorman
2010-04-24  2:02     ` KAMEZAWA Hiroyuki
2010-04-24  2:02       ` KAMEZAWA Hiroyuki
2010-04-24 10:43       ` Mel Gorman
2010-04-24 10:43         ` Mel Gorman
2010-04-25 23:49         ` KAMEZAWA Hiroyuki
2010-04-25 23:49           ` KAMEZAWA Hiroyuki
2010-04-26  2:53           ` KAMEZAWA Hiroyuki
2010-04-26  2:53             ` KAMEZAWA Hiroyuki
2010-04-26  4:31             ` Minchan Kim
2010-04-26  4:31               ` Minchan Kim
2010-04-26  4:06           ` Minchan Kim
2010-04-26  4:06             ` Minchan Kim
2010-04-26  9:28           ` KAMEZAWA Hiroyuki
2010-04-26  9:28             ` KAMEZAWA Hiroyuki
2010-04-26  9:48             ` Minchan Kim
2010-04-26  9:48               ` Minchan Kim
2010-04-26  9:49               ` KAMEZAWA Hiroyuki
2010-04-26  9:49                 ` KAMEZAWA Hiroyuki
2010-04-26 10:07                 ` Minchan Kim
2010-04-26 10:07                   ` Minchan Kim
2010-04-26 11:36               ` Mel Gorman
2010-04-26 11:36                 ` Mel Gorman
2010-04-26  4:00     ` Minchan Kim
2010-04-26  4:00       ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.