All of lore.kernel.org
 help / color / mirror / Atom feed
* mm: kernel BUG at mm/mprotect.c:149
@ 2014-03-06 21:12 ` Sasha Levin
  0 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-03-06 21:12 UTC (permalink / raw)
  To: Rik van Riel; +Cc: David Rientjes, Andrew Morton, linux-mm, LKML

Hi Rik,

While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".

[  886.745765] kernel BUG at mm/mprotect.c:149!
[  886.746831] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  886.748511] Dumping ftrace buffer:
[  886.749998]    (ftrace buffer empty)
[  886.750110] Modules linked in:
[  886.751396] CPU: 20 PID: 26219 Comm: trinity-c216 Tainted: G        W    3.14.0-rc5-next-20140305-sasha-00011-ge06f5f3-dirty #105
[  886.751396] task: ffff8800b6c80000 ti: ffff880228436000 task.ti: ffff880228436000
[  886.751396] RIP: 0010:[<ffffffff812aab33>]  [<ffffffff812aab33>] change_protection_range+0x3b3/0x500
[  886.751396] RSP: 0000:ffff880228437da8  EFLAGS: 00010282
[  886.751396] RAX: 8000000527c008e5 RBX: 00007f647916e000 RCX: 0000000000000000
[  886.751396] RDX: ffff8802ef488e40 RSI: 00007f6479000000 RDI: 8000000527c008e5
[  886.751396] RBP: ffff880228437e78 R08: 0000000000000000 R09: 0000000000000000
[  886.751396] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802ef488e40
[  886.751396] R13: 00007f6479000000 R14: 00007f647916e000 R15: 00007f646e34e000
[  886.751396] FS:  00007f64b28d4700(0000) GS:ffff88052ba00000(0000) knlGS:0000000000000000
[  886.751396] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  886.751396] CR2: 00007f64aed83af8 CR3: 0000000206e52000 CR4: 00000000000006a0
[  886.751396] Stack:
[  886.751396]  ffff880200000001 ffffffff8447f152 ffff880228437dd8 ffff880228af3000
[  886.769142]  00007f646e34e000 00007f647916dfff 0000000000000000 00007f647916e000
[  886.769142]  ffff880206e527f0 00007f647916dfff 0000000000000000 0000000000000000
[  886.769142] Call Trace:
[  886.769142]  [<ffffffff8447f152>] ? preempt_count_sub+0xe2/0x120
[  886.769142]  [<ffffffff812aaca5>] change_protection+0x25/0x30
[  886.769142]  [<ffffffff812c3eeb>] change_prot_numa+0x1b/0x30
[  886.769142]  [<ffffffff8118df49>] task_numa_work+0x279/0x360
[  886.769142]  [<ffffffff8116c57e>] task_work_run+0xae/0xf0
[  886.769142]  [<ffffffff8106ffbe>] do_notify_resume+0x8e/0xe0
[  886.769142]  [<ffffffff8447a93b>] retint_signal+0x4d/0x92
[  886.769142] Code: 49 8b 3c 24 48 83 3d fc 2e ba 04 00 75 12 0f 0b 0f 1f 84 00 00 00 00 00 eb fe 66 0f 1f 44 00 00 48 89 f8 66 66 66 90 84 c0 79 0d <0f> 0b 0f 1f 00 eb fe 66 0f 1f 44 00 00 8b 4d 9c 44 8b 4d 98 89
[  886.769142] RIP  [<ffffffff812aab33>] change_protection_range+0x3b3/0x500
[  886.769142]  RSP <ffff880228437da8>


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 52+ messages in thread

* mm: kernel BUG at mm/mprotect.c:149
@ 2014-03-06 21:12 ` Sasha Levin
  0 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-03-06 21:12 UTC (permalink / raw)
  To: Rik van Riel; +Cc: David Rientjes, Andrew Morton, linux-mm, LKML

Hi Rik,

While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".

[  886.745765] kernel BUG at mm/mprotect.c:149!
[  886.746831] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  886.748511] Dumping ftrace buffer:
[  886.749998]    (ftrace buffer empty)
[  886.750110] Modules linked in:
[  886.751396] CPU: 20 PID: 26219 Comm: trinity-c216 Tainted: G        W    3.14.0-rc5-next-20140305-sasha-00011-ge06f5f3-dirty #105
[  886.751396] task: ffff8800b6c80000 ti: ffff880228436000 task.ti: ffff880228436000
[  886.751396] RIP: 0010:[<ffffffff812aab33>]  [<ffffffff812aab33>] change_protection_range+0x3b3/0x500
[  886.751396] RSP: 0000:ffff880228437da8  EFLAGS: 00010282
[  886.751396] RAX: 8000000527c008e5 RBX: 00007f647916e000 RCX: 0000000000000000
[  886.751396] RDX: ffff8802ef488e40 RSI: 00007f6479000000 RDI: 8000000527c008e5
[  886.751396] RBP: ffff880228437e78 R08: 0000000000000000 R09: 0000000000000000
[  886.751396] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802ef488e40
[  886.751396] R13: 00007f6479000000 R14: 00007f647916e000 R15: 00007f646e34e000
[  886.751396] FS:  00007f64b28d4700(0000) GS:ffff88052ba00000(0000) knlGS:0000000000000000
[  886.751396] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  886.751396] CR2: 00007f64aed83af8 CR3: 0000000206e52000 CR4: 00000000000006a0
[  886.751396] Stack:
[  886.751396]  ffff880200000001 ffffffff8447f152 ffff880228437dd8 ffff880228af3000
[  886.769142]  00007f646e34e000 00007f647916dfff 0000000000000000 00007f647916e000
[  886.769142]  ffff880206e527f0 00007f647916dfff 0000000000000000 0000000000000000
[  886.769142] Call Trace:
[  886.769142]  [<ffffffff8447f152>] ? preempt_count_sub+0xe2/0x120
[  886.769142]  [<ffffffff812aaca5>] change_protection+0x25/0x30
[  886.769142]  [<ffffffff812c3eeb>] change_prot_numa+0x1b/0x30
[  886.769142]  [<ffffffff8118df49>] task_numa_work+0x279/0x360
[  886.769142]  [<ffffffff8116c57e>] task_work_run+0xae/0xf0
[  886.769142]  [<ffffffff8106ffbe>] do_notify_resume+0x8e/0xe0
[  886.769142]  [<ffffffff8447a93b>] retint_signal+0x4d/0x92
[  886.769142] Code: 49 8b 3c 24 48 83 3d fc 2e ba 04 00 75 12 0f 0b 0f 1f 84 00 00 00 00 00 eb fe 66 0f 1f 44 00 00 48 89 f8 66 66 66 90 84 c0 79 0d <0f> 0b 0f 1f 00 eb fe 66 0f 1f 44 00 00 8b 4d 9c 44 8b 4d 98 89
[  886.769142] RIP  [<ffffffff812aab33>] change_protection_range+0x3b3/0x500
[  886.769142]  RSP <ffff880228437da8>


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-06 21:12 ` Sasha Levin
@ 2014-03-06 22:31   ` Rik van Riel
  -1 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2014-03-06 22:31 UTC (permalink / raw)
  To: Sasha Levin
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, mgorman, hhuang,
	knoel, aarcange

On Thu, 06 Mar 2014 16:12:28 -0500
Sasha Levin <sasha.levin@oracle.com> wrote:

> While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
> following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".

That patch should not introduce any functional changes, except for
the VM_BUG_ON that catches the fact that we fell through to the 4kB
pte handling code, despite having just handled a THP pmd...

Does this patch fix the issue?

Mel, am I overlooking anything obvious? :)

---8<---

Subject: mm,numa,mprotect: always continue after finding a stable thp page

When turning a thp pmds into a NUMA one, change_huge_pmd will
return 0 when the pmd already is a NUMA pmd.

However, change_pmd_range would fall through to the code that
handles 4kB pages, instead of continuing on to the next pmd.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 61f0a07..4746608 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -138,8 +138,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 						pages += HPAGE_PMD_NR;
 						nr_huge_updates++;
 					}
-					continue;
 				}
+				continue;
 			}
 			/* fall through, the trans huge pmd just split */
 		}

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-06 22:31   ` Rik van Riel
  0 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2014-03-06 22:31 UTC (permalink / raw)
  To: Sasha Levin
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, mgorman, hhuang,
	knoel, aarcange

On Thu, 06 Mar 2014 16:12:28 -0500
Sasha Levin <sasha.levin@oracle.com> wrote:

> While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
> following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".

That patch should not introduce any functional changes, except for
the VM_BUG_ON that catches the fact that we fell through to the 4kB
pte handling code, despite having just handled a THP pmd...

Does this patch fix the issue?

Mel, am I overlooking anything obvious? :)

---8<---

Subject: mm,numa,mprotect: always continue after finding a stable thp page

When turning a thp pmds into a NUMA one, change_huge_pmd will
return 0 when the pmd already is a NUMA pmd.

However, change_pmd_range would fall through to the code that
handles 4kB pages, instead of continuing on to the next pmd.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 61f0a07..4746608 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -138,8 +138,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 						pages += HPAGE_PMD_NR;
 						nr_huge_updates++;
 					}
-					continue;
 				}
+				continue;
 			}
 			/* fall through, the trans huge pmd just split */
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-06 21:12 ` Sasha Levin
@ 2014-03-06 22:31   ` Rik van Riel
  -1 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2014-03-06 22:31 UTC (permalink / raw)
  To: Sasha Levin
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, mgorman, hhuang,
	knoel, aarcange

On Thu, 06 Mar 2014 16:12:28 -0500
Sasha Levin <sasha.levin@oracle.com> wrote:

> While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
> following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".

That patch should not introduce any functional changes, except for
the VM_BUG_ON that catches the fact that we fell through to the 4kB
pte handling code, despite having just handled a THP pmd...

Does this patch fix the issue?

Mel, am I overlooking anything obvious? :)

---8<---

Subject: mm,numa,mprotect: always continue after finding a stable thp page

When turning a thp pmds into a NUMA one, change_huge_pmd will
return 0 when the pmd already is a NUMA pmd.

However, change_pmd_range would fall through to the code that
handles 4kB pages, instead of continuing on to the next pmd.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 61f0a07..4746608 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -138,8 +138,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 						pages += HPAGE_PMD_NR;
 						nr_huge_updates++;
 					}
-					continue;
 				}
+				continue;
 			}
 			/* fall through, the trans huge pmd just split */
 		}

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-06 22:31   ` Rik van Riel
  0 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2014-03-06 22:31 UTC (permalink / raw)
  To: Sasha Levin
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, mgorman, hhuang,
	knoel, aarcange

On Thu, 06 Mar 2014 16:12:28 -0500
Sasha Levin <sasha.levin@oracle.com> wrote:

> While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
> following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".

That patch should not introduce any functional changes, except for
the VM_BUG_ON that catches the fact that we fell through to the 4kB
pte handling code, despite having just handled a THP pmd...

Does this patch fix the issue?

Mel, am I overlooking anything obvious? :)

---8<---

Subject: mm,numa,mprotect: always continue after finding a stable thp page

When turning a thp pmds into a NUMA one, change_huge_pmd will
return 0 when the pmd already is a NUMA pmd.

However, change_pmd_range would fall through to the code that
handles 4kB pages, instead of continuing on to the next pmd.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 61f0a07..4746608 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -138,8 +138,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 						pages += HPAGE_PMD_NR;
 						nr_huge_updates++;
 					}
-					continue;
 				}
+				continue;
 			}
 			/* fall through, the trans huge pmd just split */
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-06 22:31   ` Rik van Riel
@ 2014-03-06 22:52     ` Rik van Riel
  -1 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2014-03-06 22:52 UTC (permalink / raw)
  To: Sasha Levin
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, mgorman, hhuang,
	knoel, aarcange

On 03/06/2014 05:31 PM, Rik van Riel wrote:
> On Thu, 06 Mar 2014 16:12:28 -0500
> Sasha Levin <sasha.levin@oracle.com> wrote:
>
>> While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
>> following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".
>
> That patch should not introduce any functional changes, except for
> the VM_BUG_ON that catches the fact that we fell through to the 4kB
> pte handling code, despite having just handled a THP pmd...
>
> Does this patch fix the issue?
>
> Mel, am I overlooking anything obvious? :)
>
> ---8<---
>
> Subject: mm,numa,mprotect: always continue after finding a stable thp page
>
> When turning a thp pmds into a NUMA one, change_huge_pmd will
> return 0 when the pmd already is a NUMA pmd.

I did miss something obvious.  In this case, the code returns 1.

> However, change_pmd_range would fall through to the code that
> handles 4kB pages, instead of continuing on to the next pmd.

Maybe the case that I missed is when khugepaged is in the
process of collapsing pages into a transparent huge page?

If the virtual CPU gets de-scheduled by the host for long
enough, it would be possible for khugepaged to run on
another virtual CPU, and turn the pmd into a THP pmd,
before that VM_BUG_ON test.

I see that khugepaged takes the mmap_sem for writing in the
collapse code, and it looks like task_numa_work takes the
mmap_sem for reading, so I guess that may not be possible?

Andrea, would you happen to know what case am I missing?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-06 22:52     ` Rik van Riel
  0 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2014-03-06 22:52 UTC (permalink / raw)
  To: Sasha Levin
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, mgorman, hhuang,
	knoel, aarcange

On 03/06/2014 05:31 PM, Rik van Riel wrote:
> On Thu, 06 Mar 2014 16:12:28 -0500
> Sasha Levin <sasha.levin@oracle.com> wrote:
>
>> While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
>> following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".
>
> That patch should not introduce any functional changes, except for
> the VM_BUG_ON that catches the fact that we fell through to the 4kB
> pte handling code, despite having just handled a THP pmd...
>
> Does this patch fix the issue?
>
> Mel, am I overlooking anything obvious? :)
>
> ---8<---
>
> Subject: mm,numa,mprotect: always continue after finding a stable thp page
>
> When turning a thp pmds into a NUMA one, change_huge_pmd will
> return 0 when the pmd already is a NUMA pmd.

I did miss something obvious.  In this case, the code returns 1.

> However, change_pmd_range would fall through to the code that
> handles 4kB pages, instead of continuing on to the next pmd.

Maybe the case that I missed is when khugepaged is in the
process of collapsing pages into a transparent huge page?

If the virtual CPU gets de-scheduled by the host for long
enough, it would be possible for khugepaged to run on
another virtual CPU, and turn the pmd into a THP pmd,
before that VM_BUG_ON test.

I see that khugepaged takes the mmap_sem for writing in the
collapse code, and it looks like task_numa_work takes the
mmap_sem for reading, so I guess that may not be possible?

Andrea, would you happen to know what case am I missing?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-06 22:31   ` Rik van Riel
@ 2014-03-07  1:04     ` Sasha Levin
  -1 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-03-07  1:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, mgorman, hhuang,
	knoel, aarcange

On 03/06/2014 05:31 PM, Rik van Riel wrote:
> On Thu, 06 Mar 2014 16:12:28 -0500
> Sasha Levin<sasha.levin@oracle.com>  wrote:
>
>> >While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
>> >following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".
> That patch should not introduce any functional changes, except for
> the VM_BUG_ON that catches the fact that we fell through to the 4kB
> pte handling code, despite having just handled a THP pmd...
>
> Does this patch fix the issue?

I'm seeing a different issue with this patch:

[  625.886532] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[  625.888056] IP: [<ffffffff811aca2c>] __lock_acquire+0xbc/0x580
[  625.888969] PGD 427842067 PUD 41b321067 PMD 0
[  625.889775] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  625.890773] Dumping ftrace buffer:
[  625.891454]    (ftrace buffer empty)
[  625.892041] Modules linked in:
[  625.892610] CPU: 9 PID: 18991 Comm: trinity-c198 Tainted: G        W    3.14.0-rc5-next-20140305-sasha-00012-g00c5c8f-dirty #110
[  625.894293] task: ffff8804278cb000 ti: ffff88041d89a000 task.ti: ffff88041d89a000
[  625.895678] RIP: 0010:[<ffffffff811aca2c>]  [<ffffffff811aca2c>] __lock_acquire+0xbc/0x580
[  625.896232] RSP: 0018:ffff88041d89bbe8  EFLAGS: 00010002
[  625.896232] RAX: 0000000000000082 RBX: 0000000000000018 RCX: 0000000000000000
[  625.896232] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
[  625.896232] RBP: ffff88041d89bc58 R08: 0000000000000001 R09: 0000000000000000
[  625.896232] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8804278cb000
[  625.896232] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[  625.896232] FS:  00007f144ab81700(0000) GS:ffff880a2b800000(0000) knlGS:0000000000000000
[  625.896232] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  625.896232] CR2: 0000000000000018 CR3: 00000004290cf000 CR4: 00000000000006a0
[  625.896232] Stack:
[  625.896232]  ffff8802295abc18 ffff8804278cb000 ffff880a2b9d84c0 ffff880a2b9d84d0
[  625.896232]  ffff88041d89bc18 ffffffff810bb084 ffff88041d89bc28 ffffffff8107912d
[  625.896232]  ffff88041d89bc58 ffff8804278cb000 0000000000000000 0000000000000001
[  625.896232] Call Trace:
[  625.896232]  [<ffffffff810bb084>] ? kvm_clock_read+0x24/0x50
[  625.896232]  [<ffffffff8107912d>] ? sched_clock+0x1d/0x30
[  625.896232]  [<ffffffff811ad072>] lock_acquire+0x182/0x1d0
[  625.896232]  [<ffffffff812aa473>] ? change_pte_range+0xa3/0x3b0
[  625.896232]  [<ffffffff8107912d>] ? sched_clock+0x1d/0x30
[  625.896232]  [<ffffffff844793a0>] _raw_spin_lock+0x40/0x80
[  625.896232]  [<ffffffff812aa473>] ? change_pte_range+0xa3/0x3b0
[  625.896232]  [<ffffffff812aa473>] change_pte_range+0xa3/0x3b0
[  625.896232]  [<ffffffff812aab28>] change_protection_range+0x3a8/0x4d0
[  625.896232]  [<ffffffff8447f152>] ? preempt_count_sub+0xe2/0x120
[  625.896232]  [<ffffffff812aac75>] change_protection+0x25/0x30
[  625.896232]  [<ffffffff812c3ebb>] change_prot_numa+0x1b/0x30
[  625.896232]  [<ffffffff8118df49>] task_numa_work+0x279/0x360
[  625.896232]  [<ffffffff8116c57e>] task_work_run+0xae/0xf0
[  625.896232]  [<ffffffff8106ffbe>] do_notify_resume+0x8e/0xe0
[  625.896232]  [<ffffffff84484422>] int_signal+0x12/0x17
[  625.896232] Code: c2 37 f6 6b 85 be fa 0b 00 00 48 c7 c7 a1 4f 6c 85 e8 09 60 f9 ff 31 c0 e9 9c 04 00 00 66 90 44 8b 1d 39 1f cd 04 45 85 db 74 0c <48> 81 3b 40 52 76 87 75 06 0f 1f 00 45 31 c0 83 fe 01 77 0c 89
[  625.896232] RIP  [<ffffffff811aca2c>] __lock_acquire+0xbc/0x580
[  625.896232]  RSP <ffff88041d89bbe8>
[  625.896232] CR2: 0000000000000018


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-07  1:04     ` Sasha Levin
  0 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-03-07  1:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, mgorman, hhuang,
	knoel, aarcange

On 03/06/2014 05:31 PM, Rik van Riel wrote:
> On Thu, 06 Mar 2014 16:12:28 -0500
> Sasha Levin<sasha.levin@oracle.com>  wrote:
>
>> >While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
>> >following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".
> That patch should not introduce any functional changes, except for
> the VM_BUG_ON that catches the fact that we fell through to the 4kB
> pte handling code, despite having just handled a THP pmd...
>
> Does this patch fix the issue?

I'm seeing a different issue with this patch:

[  625.886532] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[  625.888056] IP: [<ffffffff811aca2c>] __lock_acquire+0xbc/0x580
[  625.888969] PGD 427842067 PUD 41b321067 PMD 0
[  625.889775] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  625.890773] Dumping ftrace buffer:
[  625.891454]    (ftrace buffer empty)
[  625.892041] Modules linked in:
[  625.892610] CPU: 9 PID: 18991 Comm: trinity-c198 Tainted: G        W    3.14.0-rc5-next-20140305-sasha-00012-g00c5c8f-dirty #110
[  625.894293] task: ffff8804278cb000 ti: ffff88041d89a000 task.ti: ffff88041d89a000
[  625.895678] RIP: 0010:[<ffffffff811aca2c>]  [<ffffffff811aca2c>] __lock_acquire+0xbc/0x580
[  625.896232] RSP: 0018:ffff88041d89bbe8  EFLAGS: 00010002
[  625.896232] RAX: 0000000000000082 RBX: 0000000000000018 RCX: 0000000000000000
[  625.896232] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
[  625.896232] RBP: ffff88041d89bc58 R08: 0000000000000001 R09: 0000000000000000
[  625.896232] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8804278cb000
[  625.896232] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[  625.896232] FS:  00007f144ab81700(0000) GS:ffff880a2b800000(0000) knlGS:0000000000000000
[  625.896232] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  625.896232] CR2: 0000000000000018 CR3: 00000004290cf000 CR4: 00000000000006a0
[  625.896232] Stack:
[  625.896232]  ffff8802295abc18 ffff8804278cb000 ffff880a2b9d84c0 ffff880a2b9d84d0
[  625.896232]  ffff88041d89bc18 ffffffff810bb084 ffff88041d89bc28 ffffffff8107912d
[  625.896232]  ffff88041d89bc58 ffff8804278cb000 0000000000000000 0000000000000001
[  625.896232] Call Trace:
[  625.896232]  [<ffffffff810bb084>] ? kvm_clock_read+0x24/0x50
[  625.896232]  [<ffffffff8107912d>] ? sched_clock+0x1d/0x30
[  625.896232]  [<ffffffff811ad072>] lock_acquire+0x182/0x1d0
[  625.896232]  [<ffffffff812aa473>] ? change_pte_range+0xa3/0x3b0
[  625.896232]  [<ffffffff8107912d>] ? sched_clock+0x1d/0x30
[  625.896232]  [<ffffffff844793a0>] _raw_spin_lock+0x40/0x80
[  625.896232]  [<ffffffff812aa473>] ? change_pte_range+0xa3/0x3b0
[  625.896232]  [<ffffffff812aa473>] change_pte_range+0xa3/0x3b0
[  625.896232]  [<ffffffff812aab28>] change_protection_range+0x3a8/0x4d0
[  625.896232]  [<ffffffff8447f152>] ? preempt_count_sub+0xe2/0x120
[  625.896232]  [<ffffffff812aac75>] change_protection+0x25/0x30
[  625.896232]  [<ffffffff812c3ebb>] change_prot_numa+0x1b/0x30
[  625.896232]  [<ffffffff8118df49>] task_numa_work+0x279/0x360
[  625.896232]  [<ffffffff8116c57e>] task_work_run+0xae/0xf0
[  625.896232]  [<ffffffff8106ffbe>] do_notify_resume+0x8e/0xe0
[  625.896232]  [<ffffffff84484422>] int_signal+0x12/0x17
[  625.896232] Code: c2 37 f6 6b 85 be fa 0b 00 00 48 c7 c7 a1 4f 6c 85 e8 09 60 f9 ff 31 c0 e9 9c 04 00 00 66 90 44 8b 1d 39 1f cd 04 45 85 db 74 0c <48> 81 3b 40 52 76 87 75 06 0f 1f 00 45 31 c0 83 fe 01 77 0c 89
[  625.896232] RIP  [<ffffffff811aca2c>] __lock_acquire+0xbc/0x580
[  625.896232]  RSP <ffff88041d89bbe8>
[  625.896232] CR2: 0000000000000018


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-06 22:52     ` Rik van Riel
@ 2014-03-07 14:06       ` Mel Gorman
  -1 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-03-07 14:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Sasha Levin, David Rientjes, Andrew Morton, linux-mm, LKML,
	hhuang, knoel, aarcange

On Thu, Mar 06, 2014 at 05:52:47PM -0500, Rik van Riel wrote:
> On 03/06/2014 05:31 PM, Rik van Riel wrote:
> >On Thu, 06 Mar 2014 16:12:28 -0500
> >Sasha Levin <sasha.levin@oracle.com> wrote:
> >
> >>While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
> >>following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".
> >
> >That patch should not introduce any functional changes, except for
> >the VM_BUG_ON that catches the fact that we fell through to the 4kB
> >pte handling code, despite having just handled a THP pmd...
> >
> >Does this patch fix the issue?
> >
> >Mel, am I overlooking anything obvious? :)
> >
> >---8<---
> >
> >Subject: mm,numa,mprotect: always continue after finding a stable thp page
> >
> >When turning a thp pmds into a NUMA one, change_huge_pmd will
> >return 0 when the pmd already is a NUMA pmd.
> 
> I did miss something obvious.  In this case, the code returns 1.
> 
> >However, change_pmd_range would fall through to the code that
> >handles 4kB pages, instead of continuing on to the next pmd.
> 
> Maybe the case that I missed is when khugepaged is in the
> process of collapsing pages into a transparent huge page?
> 
> If the virtual CPU gets de-scheduled by the host for long
> enough, it would be possible for khugepaged to run on
> another virtual CPU, and turn the pmd into a THP pmd,
> before that VM_BUG_ON test.
> 
> I see that khugepaged takes the mmap_sem for writing in the
> collapse code, and it looks like task_numa_work takes the
> mmap_sem for reading, so I guess that may not be possible?
> 

mmap_sem will prevent a parallel collapse but what prevents something
like the following?

							do_huge_pmd_wp_page
change_pmd_range
if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd))
	continue;
							pmdp_clear_flush(vma, haddr, pmd);
if (pmd_trans_huge(*pmd)) {
	.... path not taken ....
}
							page_add_new_anon_rmap(new_page, vma, haddr);
							set_pmd_at(mm, haddr, pmd, entry);
VM_BUG_ON(pmd_trans_huge(*pmd));

We do not hold the page table lock during the pmd_trans_huge check and we
do not recheck it under PTF lock in change_pte_range()

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-07 14:06       ` Mel Gorman
  0 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-03-07 14:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Sasha Levin, David Rientjes, Andrew Morton, linux-mm, LKML,
	hhuang, knoel, aarcange

On Thu, Mar 06, 2014 at 05:52:47PM -0500, Rik van Riel wrote:
> On 03/06/2014 05:31 PM, Rik van Riel wrote:
> >On Thu, 06 Mar 2014 16:12:28 -0500
> >Sasha Levin <sasha.levin@oracle.com> wrote:
> >
> >>While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
> >>following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".
> >
> >That patch should not introduce any functional changes, except for
> >the VM_BUG_ON that catches the fact that we fell through to the 4kB
> >pte handling code, despite having just handled a THP pmd...
> >
> >Does this patch fix the issue?
> >
> >Mel, am I overlooking anything obvious? :)
> >
> >---8<---
> >
> >Subject: mm,numa,mprotect: always continue after finding a stable thp page
> >
> >When turning a thp pmds into a NUMA one, change_huge_pmd will
> >return 0 when the pmd already is a NUMA pmd.
> 
> I did miss something obvious.  In this case, the code returns 1.
> 
> >However, change_pmd_range would fall through to the code that
> >handles 4kB pages, instead of continuing on to the next pmd.
> 
> Maybe the case that I missed is when khugepaged is in the
> process of collapsing pages into a transparent huge page?
> 
> If the virtual CPU gets de-scheduled by the host for long
> enough, it would be possible for khugepaged to run on
> another virtual CPU, and turn the pmd into a THP pmd,
> before that VM_BUG_ON test.
> 
> I see that khugepaged takes the mmap_sem for writing in the
> collapse code, and it looks like task_numa_work takes the
> mmap_sem for reading, so I guess that may not be possible?
> 

mmap_sem will prevent a parallel collapse but what prevents something
like the following?

							do_huge_pmd_wp_page
change_pmd_range
if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd))
	continue;
							pmdp_clear_flush(vma, haddr, pmd);
if (pmd_trans_huge(*pmd)) {
	.... path not taken ....
}
							page_add_new_anon_rmap(new_page, vma, haddr);
							set_pmd_at(mm, haddr, pmd, entry);
VM_BUG_ON(pmd_trans_huge(*pmd));

We do not hold the page table lock during the pmd_trans_huge check and we
do not recheck it under PTF lock in change_pte_range()

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-07 14:06       ` Mel Gorman
@ 2014-03-07 15:09         ` Mel Gorman
  -1 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-03-07 15:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Sasha Levin, David Rientjes, Andrew Morton, linux-mm, LKML,
	hhuang, knoel, aarcange

On Fri, Mar 07, 2014 at 02:06:50PM +0000, Mel Gorman wrote:
> On Thu, Mar 06, 2014 at 05:52:47PM -0500, Rik van Riel wrote:
> > On 03/06/2014 05:31 PM, Rik van Riel wrote:
> > >On Thu, 06 Mar 2014 16:12:28 -0500
> > >Sasha Levin <sasha.levin@oracle.com> wrote:
> > >
> > >>While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
> > >>following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".
> > >
> > >That patch should not introduce any functional changes, except for
> > >the VM_BUG_ON that catches the fact that we fell through to the 4kB
> > >pte handling code, despite having just handled a THP pmd...
> > >
> > >Does this patch fix the issue?
> > >
> > >Mel, am I overlooking anything obvious? :)
> > >
> > >---8<---
> > >
> > >Subject: mm,numa,mprotect: always continue after finding a stable thp page
> > >
> > >When turning a thp pmds into a NUMA one, change_huge_pmd will
> > >return 0 when the pmd already is a NUMA pmd.
> > 
> > I did miss something obvious.  In this case, the code returns 1.
> > 
> > >However, change_pmd_range would fall through to the code that
> > >handles 4kB pages, instead of continuing on to the next pmd.
> > 
> > Maybe the case that I missed is when khugepaged is in the
> > process of collapsing pages into a transparent huge page?
> > 
> > If the virtual CPU gets de-scheduled by the host for long
> > enough, it would be possible for khugepaged to run on
> > another virtual CPU, and turn the pmd into a THP pmd,
> > before that VM_BUG_ON test.
> > 
> > I see that khugepaged takes the mmap_sem for writing in the
> > collapse code, and it looks like task_numa_work takes the
> > mmap_sem for reading, so I guess that may not be possible?
> > 
> 
> mmap_sem will prevent a parallel collapse but what prevents something
> like the following?
> 
> 							do_huge_pmd_wp_page
> change_pmd_range
> if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd))
> 	continue;
> 							pmdp_clear_flush(vma, haddr, pmd);
> if (pmd_trans_huge(*pmd)) {
> 	.... path not taken ....
> }
> 							page_add_new_anon_rmap(new_page, vma, haddr);
> 							set_pmd_at(mm, haddr, pmd, entry);
> VM_BUG_ON(pmd_trans_huge(*pmd));
> 
> We do not hold the page table lock during the pmd_trans_huge check and we
> do not recheck it under PTF lock in change_pte_range()
> 

This is a completely untested prototype. It rechecks pmd_trans_huge
under the lock and falls through if it hit a parallel split. It's not
perfect because it could decide to fall through just because there was
no prot_numa work to do but it's for illustration purposes. Secondly,
I noted that you are calling invalidate for every pmd range. Is that not
a lot of invalidations? We could do the same by just tracking the address
of the first invalidation.

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2afc40e..a0050fc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -46,6 +46,20 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	unsigned long pages = 0;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+
+	/* transhuge page could have faulted in parallel */
+	if (unlikely(pmd_trans_huge(*pmd))) {
+		int nr_ptes;
+
+		pte_unmap_unlock(pte, ptl);
+		nr_ptes = change_huge_pmd(vma, pmd, addr, newprot, prot_numa);
+		if (nr_ptes)
+			return nr_ptes;
+
+		/* Page was split so retake the ptl and handle ptes */
+		pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	}
+
 	arch_enter_lazy_mmu_mode();
 	do {
 		oldpte = *pte;
@@ -106,14 +120,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, int dirty_accountable, int prot_numa,
+		unsigned long mni_start, unsigned long mni_end)
 {
 	pmd_t *pmd;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long next;
 	unsigned long pages = 0;
 	unsigned long nr_huge_updates = 0;
-	unsigned long mni_start = 0;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -124,10 +138,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			continue;
 
 		/* invoke the mmu notifier if the pmd is populated */
-		if (!mni_start) {
-			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start, end);
-		}
+		if (!mni_start)
+			mmu_notifier_invalidate_range_start(mm, addr, mni_end);
 
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
@@ -141,8 +153,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 						pages += HPAGE_PMD_NR;
 						nr_huge_updates++;
 					}
-					continue;
 				}
+				continue;
 			}
 			/* fall through, the trans huge pmd just split */
 		}
@@ -152,9 +164,6 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pages += this_pages;
 	} while (pmd++, addr = next, addr != end);
 
-	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end);
-
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
 	return pages;
@@ -167,16 +176,25 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 	pud_t *pud;
 	unsigned long next;
 	unsigned long pages = 0;
+	unsigned long mni_start = 0;
 
 	pud = pud_offset(pgd, addr);
 	do {
+		unsigned long this_pages;
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		pages += change_pmd_range(vma, pud, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+		this_pages = change_pmd_range(vma, pud, addr, next, newprot,
+				 dirty_accountable, prot_numa, mni_start,
+				 end);
+		if (this_pages)
+			mni_start = addr;
+		pages += this_pages;
 	} while (pud++, addr = next, addr != end);
 
+	if (mni_start)
+		mmu_notifier_invalidate_range_end(vma->vm_mm, mni_start, end);
+
 	return pages;
 }
 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-07 15:09         ` Mel Gorman
  0 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-03-07 15:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Sasha Levin, David Rientjes, Andrew Morton, linux-mm, LKML,
	hhuang, knoel, aarcange

On Fri, Mar 07, 2014 at 02:06:50PM +0000, Mel Gorman wrote:
> On Thu, Mar 06, 2014 at 05:52:47PM -0500, Rik van Riel wrote:
> > On 03/06/2014 05:31 PM, Rik van Riel wrote:
> > >On Thu, 06 Mar 2014 16:12:28 -0500
> > >Sasha Levin <sasha.levin@oracle.com> wrote:
> > >
> > >>While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
> > >>following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".
> > >
> > >That patch should not introduce any functional changes, except for
> > >the VM_BUG_ON that catches the fact that we fell through to the 4kB
> > >pte handling code, despite having just handled a THP pmd...
> > >
> > >Does this patch fix the issue?
> > >
> > >Mel, am I overlooking anything obvious? :)
> > >
> > >---8<---
> > >
> > >Subject: mm,numa,mprotect: always continue after finding a stable thp page
> > >
> > >When turning a thp pmds into a NUMA one, change_huge_pmd will
> > >return 0 when the pmd already is a NUMA pmd.
> > 
> > I did miss something obvious.  In this case, the code returns 1.
> > 
> > >However, change_pmd_range would fall through to the code that
> > >handles 4kB pages, instead of continuing on to the next pmd.
> > 
> > Maybe the case that I missed is when khugepaged is in the
> > process of collapsing pages into a transparent huge page?
> > 
> > If the virtual CPU gets de-scheduled by the host for long
> > enough, it would be possible for khugepaged to run on
> > another virtual CPU, and turn the pmd into a THP pmd,
> > before that VM_BUG_ON test.
> > 
> > I see that khugepaged takes the mmap_sem for writing in the
> > collapse code, and it looks like task_numa_work takes the
> > mmap_sem for reading, so I guess that may not be possible?
> > 
> 
> mmap_sem will prevent a parallel collapse but what prevents something
> like the following?
> 
> 							do_huge_pmd_wp_page
> change_pmd_range
> if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd))
> 	continue;
> 							pmdp_clear_flush(vma, haddr, pmd);
> if (pmd_trans_huge(*pmd)) {
> 	.... path not taken ....
> }
> 							page_add_new_anon_rmap(new_page, vma, haddr);
> 							set_pmd_at(mm, haddr, pmd, entry);
> VM_BUG_ON(pmd_trans_huge(*pmd));
> 
> We do not hold the page table lock during the pmd_trans_huge check and we
> do not recheck it under PTF lock in change_pte_range()
> 

This is a completely untested prototype. It rechecks pmd_trans_huge
under the lock and falls through if it hit a parallel split. It's not
perfect because it could decide to fall through just because there was
no prot_numa work to do but it's for illustration purposes. Secondly,
I noted that you are calling invalidate for every pmd range. Is that not
a lot of invalidations? We could do the same by just tracking the address
of the first invalidation.

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2afc40e..a0050fc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -46,6 +46,20 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	unsigned long pages = 0;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+
+	/* transhuge page could have faulted in parallel */
+	if (unlikely(pmd_trans_huge(*pmd))) {
+		int nr_ptes;
+
+		pte_unmap_unlock(pte, ptl);
+		nr_ptes = change_huge_pmd(vma, pmd, addr, newprot, prot_numa);
+		if (nr_ptes)
+			return nr_ptes;
+
+		/* Page was split so retake the ptl and handle ptes */
+		pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	}
+
 	arch_enter_lazy_mmu_mode();
 	do {
 		oldpte = *pte;
@@ -106,14 +120,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, int dirty_accountable, int prot_numa,
+		unsigned long mni_start, unsigned long mni_end)
 {
 	pmd_t *pmd;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long next;
 	unsigned long pages = 0;
 	unsigned long nr_huge_updates = 0;
-	unsigned long mni_start = 0;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -124,10 +138,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			continue;
 
 		/* invoke the mmu notifier if the pmd is populated */
-		if (!mni_start) {
-			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start, end);
-		}
+		if (!mni_start)
+			mmu_notifier_invalidate_range_start(mm, addr, mni_end);
 
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
@@ -141,8 +153,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 						pages += HPAGE_PMD_NR;
 						nr_huge_updates++;
 					}
-					continue;
 				}
+				continue;
 			}
 			/* fall through, the trans huge pmd just split */
 		}
@@ -152,9 +164,6 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pages += this_pages;
 	} while (pmd++, addr = next, addr != end);
 
-	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end);
-
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
 	return pages;
@@ -167,16 +176,25 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 	pud_t *pud;
 	unsigned long next;
 	unsigned long pages = 0;
+	unsigned long mni_start = 0;
 
 	pud = pud_offset(pgd, addr);
 	do {
+		unsigned long this_pages;
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		pages += change_pmd_range(vma, pud, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+		this_pages = change_pmd_range(vma, pud, addr, next, newprot,
+				 dirty_accountable, prot_numa, mni_start,
+				 end);
+		if (this_pages)
+			mni_start = addr;
+		pages += this_pages;
 	} while (pud++, addr = next, addr != end);
 
+	if (mni_start)
+		mmu_notifier_invalidate_range_end(vma->vm_mm, mni_start, end);
+
 	return pages;
 }
 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-07 15:09         ` Mel Gorman
@ 2014-03-07 15:13           ` Mel Gorman
  -1 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-03-07 15:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Sasha Levin, David Rientjes, Andrew Morton, linux-mm, LKML,
	hhuang, knoel, aarcange

On Fri, Mar 07, 2014 at 03:09:23PM +0000, Mel Gorman wrote:
> On Fri, Mar 07, 2014 at 02:06:50PM +0000, Mel Gorman wrote:
> > On Thu, Mar 06, 2014 at 05:52:47PM -0500, Rik van Riel wrote:
> > > On 03/06/2014 05:31 PM, Rik van Riel wrote:
> > > >On Thu, 06 Mar 2014 16:12:28 -0500
> > > >Sasha Levin <sasha.levin@oracle.com> wrote:
> > > >
> > > >>While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
> > > >>following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".
> > > >
> > > >That patch should not introduce any functional changes, except for
> > > >the VM_BUG_ON that catches the fact that we fell through to the 4kB
> > > >pte handling code, despite having just handled a THP pmd...
> > > >
> > > >Does this patch fix the issue?
> > > >
> > > >Mel, am I overlooking anything obvious? :)
> > > >
> > > >---8<---
> > > >
> > > >Subject: mm,numa,mprotect: always continue after finding a stable thp page
> > > >
> > > >When turning a thp pmds into a NUMA one, change_huge_pmd will
> > > >return 0 when the pmd already is a NUMA pmd.
> > > 
> > > I did miss something obvious.  In this case, the code returns 1.
> > > 
> > > >However, change_pmd_range would fall through to the code that
> > > >handles 4kB pages, instead of continuing on to the next pmd.
> > > 
> > > Maybe the case that I missed is when khugepaged is in the
> > > process of collapsing pages into a transparent huge page?
> > > 
> > > If the virtual CPU gets de-scheduled by the host for long
> > > enough, it would be possible for khugepaged to run on
> > > another virtual CPU, and turn the pmd into a THP pmd,
> > > before that VM_BUG_ON test.
> > > 
> > > I see that khugepaged takes the mmap_sem for writing in the
> > > collapse code, and it looks like task_numa_work takes the
> > > mmap_sem for reading, so I guess that may not be possible?
> > > 
> > 
> > mmap_sem will prevent a parallel collapse but what prevents something
> > like the following?
> > 
> > 							do_huge_pmd_wp_page
> > change_pmd_range
> > if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd))
> > 	continue;
> > 							pmdp_clear_flush(vma, haddr, pmd);
> > if (pmd_trans_huge(*pmd)) {
> > 	.... path not taken ....
> > }
> > 							page_add_new_anon_rmap(new_page, vma, haddr);
> > 							set_pmd_at(mm, haddr, pmd, entry);
> > VM_BUG_ON(pmd_trans_huge(*pmd));
> > 
> > We do not hold the page table lock during the pmd_trans_huge check and we
> > do not recheck it under PTF lock in change_pte_range()
> > 
> 
> This is a completely untested prototype. It rechecks pmd_trans_huge
> under the lock and falls through if it hit a parallel split. It's not
> perfect because it could decide to fall through just because there was
> no prot_numa work to do but it's for illustration purposes.

Nope, if there was no prot_numa work to do, it returns 1 so the check is
ok. It's only if 0 is returned we need to fall through and really handle
ptes.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-07 15:13           ` Mel Gorman
  0 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-03-07 15:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Sasha Levin, David Rientjes, Andrew Morton, linux-mm, LKML,
	hhuang, knoel, aarcange

On Fri, Mar 07, 2014 at 03:09:23PM +0000, Mel Gorman wrote:
> On Fri, Mar 07, 2014 at 02:06:50PM +0000, Mel Gorman wrote:
> > On Thu, Mar 06, 2014 at 05:52:47PM -0500, Rik van Riel wrote:
> > > On 03/06/2014 05:31 PM, Rik van Riel wrote:
> > > >On Thu, 06 Mar 2014 16:12:28 -0500
> > > >Sasha Levin <sasha.levin@oracle.com> wrote:
> > > >
> > > >>While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
> > > >>following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".
> > > >
> > > >That patch should not introduce any functional changes, except for
> > > >the VM_BUG_ON that catches the fact that we fell through to the 4kB
> > > >pte handling code, despite having just handled a THP pmd...
> > > >
> > > >Does this patch fix the issue?
> > > >
> > > >Mel, am I overlooking anything obvious? :)
> > > >
> > > >---8<---
> > > >
> > > >Subject: mm,numa,mprotect: always continue after finding a stable thp page
> > > >
> > > >When turning a thp pmds into a NUMA one, change_huge_pmd will
> > > >return 0 when the pmd already is a NUMA pmd.
> > > 
> > > I did miss something obvious.  In this case, the code returns 1.
> > > 
> > > >However, change_pmd_range would fall through to the code that
> > > >handles 4kB pages, instead of continuing on to the next pmd.
> > > 
> > > Maybe the case that I missed is when khugepaged is in the
> > > process of collapsing pages into a transparent huge page?
> > > 
> > > If the virtual CPU gets de-scheduled by the host for long
> > > enough, it would be possible for khugepaged to run on
> > > another virtual CPU, and turn the pmd into a THP pmd,
> > > before that VM_BUG_ON test.
> > > 
> > > I see that khugepaged takes the mmap_sem for writing in the
> > > collapse code, and it looks like task_numa_work takes the
> > > mmap_sem for reading, so I guess that may not be possible?
> > > 
> > 
> > mmap_sem will prevent a parallel collapse but what prevents something
> > like the following?
> > 
> > 							do_huge_pmd_wp_page
> > change_pmd_range
> > if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd))
> > 	continue;
> > 							pmdp_clear_flush(vma, haddr, pmd);
> > if (pmd_trans_huge(*pmd)) {
> > 	.... path not taken ....
> > }
> > 							page_add_new_anon_rmap(new_page, vma, haddr);
> > 							set_pmd_at(mm, haddr, pmd, entry);
> > VM_BUG_ON(pmd_trans_huge(*pmd));
> > 
> > We do not hold the page table lock during the pmd_trans_huge check and we
> > do not recheck it under PTF lock in change_pte_range()
> > 
> 
> This is a completely untested prototype. It rechecks pmd_trans_huge
> under the lock and falls through if it hit a parallel split. It's not
> perfect because it could decide to fall through just because there was
> no prot_numa work to do but it's for illustration purposes.

Nope, if there was no prot_numa work to do, it returns 1 so the check is
ok. It's only if 0 is returned we need to fall through and really handle
ptes.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-07 15:09         ` Mel Gorman
@ 2014-03-07 18:27           ` Mel Gorman
  -1 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-03-07 18:27 UTC (permalink / raw)
  To: Rik van Riel, Sasha Levin
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, hhuang, knoel, aarcange

On Fri, Mar 07, 2014 at 03:09:23PM +0000, Mel Gorman wrote:
> On Fri, Mar 07, 2014 at 02:06:50PM +0000, Mel Gorman wrote:
> > On Thu, Mar 06, 2014 at 05:52:47PM -0500, Rik van Riel wrote:
> > > On 03/06/2014 05:31 PM, Rik van Riel wrote:
> > > >On Thu, 06 Mar 2014 16:12:28 -0500
> > > >Sasha Levin <sasha.levin@oracle.com> wrote:
> > > >
> > > >>While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
> > > >>following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".
> > > >
> > > >That patch should not introduce any functional changes, except for
> > > >the VM_BUG_ON that catches the fact that we fell through to the 4kB
> > > >pte handling code, despite having just handled a THP pmd...
> > > >
> > > >Does this patch fix the issue?
> > > >
> > > >Mel, am I overlooking anything obvious? :)
> > > >
> > > >---8<---
> > > >
> > > >Subject: mm,numa,mprotect: always continue after finding a stable thp page
> > > >
> > > >When turning a thp pmds into a NUMA one, change_huge_pmd will
> > > >return 0 when the pmd already is a NUMA pmd.
> > > 
> > > I did miss something obvious.  In this case, the code returns 1.
> > > 
> > > >However, change_pmd_range would fall through to the code that
> > > >handles 4kB pages, instead of continuing on to the next pmd.
> > > 
> > > Maybe the case that I missed is when khugepaged is in the
> > > process of collapsing pages into a transparent huge page?
> > > 
> > > If the virtual CPU gets de-scheduled by the host for long
> > > enough, it would be possible for khugepaged to run on
> > > another virtual CPU, and turn the pmd into a THP pmd,
> > > before that VM_BUG_ON test.
> > > 
> > > I see that khugepaged takes the mmap_sem for writing in the
> > > collapse code, and it looks like task_numa_work takes the
> > > mmap_sem for reading, so I guess that may not be possible?
> > > 
> > 
> > mmap_sem will prevent a parallel collapse but what prevents something
> > like the following?
> > 
> > 							do_huge_pmd_wp_page
> > change_pmd_range
> > if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd))
> > 	continue;
> > 							pmdp_clear_flush(vma, haddr, pmd);
> > if (pmd_trans_huge(*pmd)) {
> > 	.... path not taken ....
> > }
> > 							page_add_new_anon_rmap(new_page, vma, haddr);
> > 							set_pmd_at(mm, haddr, pmd, entry);
> > VM_BUG_ON(pmd_trans_huge(*pmd));
> > 
> > We do not hold the page table lock during the pmd_trans_huge check and we
> > do not recheck it under PTF lock in change_pte_range()
> > 
> 
> This is a completely untested prototype. It rechecks pmd_trans_huge
> under the lock and falls through if it hit a parallel split. It's not
> perfect because it could decide to fall through just because there was
> no prot_numa work to do but it's for illustration purposes. Secondly,
> I noted that you are calling invalidate for every pmd range. Is that not
> a lot of invalidations? We could do the same by just tracking the address
> of the first invalidation.
> 

And there were other minor issues. This is still untested but Sasha,
can you try it out please? I discussed this with Rik on IRC for a bit and
reckon this should be sufficient if the correct race has been identified.

The race can only really happen for prot_numa updates and it's ok to bail on
those updates if a race occurs because all we miss is a few hinting faults.
That simplifies the patch considerably but throw in some comments to
explain it

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2afc40e..72061a2 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -46,6 +46,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	unsigned long pages = 0;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+
+	/*
+	 * For a prot_numa update we only hold mmap_sem for read so there is a
+	 * potential race with faulting where a pmd was temporarily none so
+	 * recheck it under the lock and bail if we race
+	 */
+	if (prot_numa && unlikely(pmd_trans_huge(*pmd))) {
+		pte_unmap_unlock(pte, ptl);
+		return 0;
+	}
+
 	arch_enter_lazy_mmu_mode();
 	do {
 		oldpte = *pte;
@@ -141,12 +152,13 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 						pages += HPAGE_PMD_NR;
 						nr_huge_updates++;
 					}
+
+					/* huge pmd was handled */
 					continue;
 				}
 			}
 			/* fall through, the trans huge pmd just split */
 		}
-		VM_BUG_ON(pmd_trans_huge(*pmd));
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
 				 dirty_accountable, prot_numa);
 		pages += this_pages;

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-07 18:27           ` Mel Gorman
  0 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-03-07 18:27 UTC (permalink / raw)
  To: Rik van Riel, Sasha Levin
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, hhuang, knoel, aarcange

On Fri, Mar 07, 2014 at 03:09:23PM +0000, Mel Gorman wrote:
> On Fri, Mar 07, 2014 at 02:06:50PM +0000, Mel Gorman wrote:
> > On Thu, Mar 06, 2014 at 05:52:47PM -0500, Rik van Riel wrote:
> > > On 03/06/2014 05:31 PM, Rik van Riel wrote:
> > > >On Thu, 06 Mar 2014 16:12:28 -0500
> > > >Sasha Levin <sasha.levin@oracle.com> wrote:
> > > >
> > > >>While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the
> > > >>following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()".
> > > >
> > > >That patch should not introduce any functional changes, except for
> > > >the VM_BUG_ON that catches the fact that we fell through to the 4kB
> > > >pte handling code, despite having just handled a THP pmd...
> > > >
> > > >Does this patch fix the issue?
> > > >
> > > >Mel, am I overlooking anything obvious? :)
> > > >
> > > >---8<---
> > > >
> > > >Subject: mm,numa,mprotect: always continue after finding a stable thp page
> > > >
> > > >When turning a thp pmds into a NUMA one, change_huge_pmd will
> > > >return 0 when the pmd already is a NUMA pmd.
> > > 
> > > I did miss something obvious.  In this case, the code returns 1.
> > > 
> > > >However, change_pmd_range would fall through to the code that
> > > >handles 4kB pages, instead of continuing on to the next pmd.
> > > 
> > > Maybe the case that I missed is when khugepaged is in the
> > > process of collapsing pages into a transparent huge page?
> > > 
> > > If the virtual CPU gets de-scheduled by the host for long
> > > enough, it would be possible for khugepaged to run on
> > > another virtual CPU, and turn the pmd into a THP pmd,
> > > before that VM_BUG_ON test.
> > > 
> > > I see that khugepaged takes the mmap_sem for writing in the
> > > collapse code, and it looks like task_numa_work takes the
> > > mmap_sem for reading, so I guess that may not be possible?
> > > 
> > 
> > mmap_sem will prevent a parallel collapse but what prevents something
> > like the following?
> > 
> > 							do_huge_pmd_wp_page
> > change_pmd_range
> > if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd))
> > 	continue;
> > 							pmdp_clear_flush(vma, haddr, pmd);
> > if (pmd_trans_huge(*pmd)) {
> > 	.... path not taken ....
> > }
> > 							page_add_new_anon_rmap(new_page, vma, haddr);
> > 							set_pmd_at(mm, haddr, pmd, entry);
> > VM_BUG_ON(pmd_trans_huge(*pmd));
> > 
> > We do not hold the page table lock during the pmd_trans_huge check and we
> > do not recheck it under PTF lock in change_pte_range()
> > 
> 
> This is a completely untested prototype. It rechecks pmd_trans_huge
> under the lock and falls through if it hit a parallel split. It's not
> perfect because it could decide to fall through just because there was
> no prot_numa work to do but it's for illustration purposes. Secondly,
> I noted that you are calling invalidate for every pmd range. Is that not
> a lot of invalidations? We could do the same by just tracking the address
> of the first invalidation.
> 

And there were other minor issues. This is still untested but Sasha,
can you try it out please? I discussed this with Rik on IRC for a bit and
reckon this should be sufficient if the correct race has been identified.

The race can only really happen for prot_numa updates and it's ok to bail on
those updates if a race occurs because all we miss is a few hinting faults.
That simplifies the patch considerably but throw in some comments to
explain it

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2afc40e..72061a2 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -46,6 +46,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	unsigned long pages = 0;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+
+	/*
+	 * For a prot_numa update we only hold mmap_sem for read so there is a
+	 * potential race with faulting where a pmd was temporarily none so
+	 * recheck it under the lock and bail if we race
+	 */
+	if (prot_numa && unlikely(pmd_trans_huge(*pmd))) {
+		pte_unmap_unlock(pte, ptl);
+		return 0;
+	}
+
 	arch_enter_lazy_mmu_mode();
 	do {
 		oldpte = *pte;
@@ -141,12 +152,13 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 						pages += HPAGE_PMD_NR;
 						nr_huge_updates++;
 					}
+
+					/* huge pmd was handled */
 					continue;
 				}
 			}
 			/* fall through, the trans huge pmd just split */
 		}
-		VM_BUG_ON(pmd_trans_huge(*pmd));
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
 				 dirty_accountable, prot_numa);
 		pages += this_pages;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-07 18:27           ` Mel Gorman
@ 2014-03-11 16:28             ` Mel Gorman
  -1 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-03-11 16:28 UTC (permalink / raw)
  To: Rik van Riel, Sasha Levin
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, hhuang, knoel, aarcange

On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote:
> > This is a completely untested prototype. It rechecks pmd_trans_huge
> > under the lock and falls through if it hit a parallel split. It's not
> > perfect because it could decide to fall through just because there was
> > no prot_numa work to do but it's for illustration purposes. Secondly,
> > I noted that you are calling invalidate for every pmd range. Is that not
> > a lot of invalidations? We could do the same by just tracking the address
> > of the first invalidation.
> > 
> 
> And there were other minor issues. This is still untested but Sasha,
> can you try it out please? I discussed this with Rik on IRC for a bit and
> reckon this should be sufficient if the correct race has been identified.
> 

Any luck with this patch Sasha? It passed basic tests here but I had not
seen the issue trigger either.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-11 16:28             ` Mel Gorman
  0 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-03-11 16:28 UTC (permalink / raw)
  To: Rik van Riel, Sasha Levin
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, hhuang, knoel, aarcange

On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote:
> > This is a completely untested prototype. It rechecks pmd_trans_huge
> > under the lock and falls through if it hit a parallel split. It's not
> > perfect because it could decide to fall through just because there was
> > no prot_numa work to do but it's for illustration purposes. Secondly,
> > I noted that you are calling invalidate for every pmd range. Is that not
> > a lot of invalidations? We could do the same by just tracking the address
> > of the first invalidation.
> > 
> 
> And there were other minor issues. This is still untested but Sasha,
> can you try it out please? I discussed this with Rik on IRC for a bit and
> reckon this should be sufficient if the correct race has been identified.
> 

Any luck with this patch Sasha? It passed basic tests here but I had not
seen the issue trigger either.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-11 16:28             ` Mel Gorman
@ 2014-03-11 16:51               ` Sasha Levin
  -1 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-03-11 16:51 UTC (permalink / raw)
  To: Mel Gorman, Rik van Riel
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, hhuang, knoel, aarcange

On 03/11/2014 12:28 PM, Mel Gorman wrote:
> On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote:
>>> This is a completely untested prototype. It rechecks pmd_trans_huge
>>> under the lock and falls through if it hit a parallel split. It's not
>>> perfect because it could decide to fall through just because there was
>>> no prot_numa work to do but it's for illustration purposes. Secondly,
>>> I noted that you are calling invalidate for every pmd range. Is that not
>>> a lot of invalidations? We could do the same by just tracking the address
>>> of the first invalidation.
>>>
>>
>> And there were other minor issues. This is still untested but Sasha,
>> can you try it out please? I discussed this with Rik on IRC for a bit and
>> reckon this should be sufficient if the correct race has been identified.
>>
>
> Any luck with this patch Sasha? It passed basic tests here but I had not
> seen the issue trigger either.
>

Sorry, I've been stuck in my weekend project of getting lockdep to work with page locks :)

It takes a moment to test, so just to be sure - I should have only this last patch applied?
Without the one in the original mail?


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-11 16:51               ` Sasha Levin
  0 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-03-11 16:51 UTC (permalink / raw)
  To: Mel Gorman, Rik van Riel
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, hhuang, knoel, aarcange

On 03/11/2014 12:28 PM, Mel Gorman wrote:
> On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote:
>>> This is a completely untested prototype. It rechecks pmd_trans_huge
>>> under the lock and falls through if it hit a parallel split. It's not
>>> perfect because it could decide to fall through just because there was
>>> no prot_numa work to do but it's for illustration purposes. Secondly,
>>> I noted that you are calling invalidate for every pmd range. Is that not
>>> a lot of invalidations? We could do the same by just tracking the address
>>> of the first invalidation.
>>>
>>
>> And there were other minor issues. This is still untested but Sasha,
>> can you try it out please? I discussed this with Rik on IRC for a bit and
>> reckon this should be sufficient if the correct race has been identified.
>>
>
> Any luck with this patch Sasha? It passed basic tests here but I had not
> seen the issue trigger either.
>

Sorry, I've been stuck in my weekend project of getting lockdep to work with page locks :)

It takes a moment to test, so just to be sure - I should have only this last patch applied?
Without the one in the original mail?


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-11 16:51               ` Sasha Levin
@ 2014-03-11 17:00                 ` Rik van Riel
  -1 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2014-03-11 17:00 UTC (permalink / raw)
  To: Sasha Levin, Mel Gorman
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, hhuang, knoel, aarcange

On 03/11/2014 12:51 PM, Sasha Levin wrote:
> On 03/11/2014 12:28 PM, Mel Gorman wrote:
>> On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote:
>>>> This is a completely untested prototype. It rechecks pmd_trans_huge
>>>> under the lock and falls through if it hit a parallel split. It's not
>>>> perfect because it could decide to fall through just because there was
>>>> no prot_numa work to do but it's for illustration purposes. Secondly,
>>>> I noted that you are calling invalidate for every pmd range. Is that
>>>> not
>>>> a lot of invalidations? We could do the same by just tracking the
>>>> address
>>>> of the first invalidation.
>>>>
>>>
>>> And there were other minor issues. This is still untested but Sasha,
>>> can you try it out please? I discussed this with Rik on IRC for a bit
>>> and
>>> reckon this should be sufficient if the correct race has been
>>> identified.
>>>
>>
>> Any luck with this patch Sasha? It passed basic tests here but I had not
>> seen the issue trigger either.
>>
>
> Sorry, I've been stuck in my weekend project of getting lockdep to work
> with page locks :)
>
> It takes a moment to test, so just to be sure - I should have only this
> last patch applied?
> Without the one in the original mail?

Indeed, only this patch should do it.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-11 17:00                 ` Rik van Riel
  0 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2014-03-11 17:00 UTC (permalink / raw)
  To: Sasha Levin, Mel Gorman
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, hhuang, knoel, aarcange

On 03/11/2014 12:51 PM, Sasha Levin wrote:
> On 03/11/2014 12:28 PM, Mel Gorman wrote:
>> On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote:
>>>> This is a completely untested prototype. It rechecks pmd_trans_huge
>>>> under the lock and falls through if it hit a parallel split. It's not
>>>> perfect because it could decide to fall through just because there was
>>>> no prot_numa work to do but it's for illustration purposes. Secondly,
>>>> I noted that you are calling invalidate for every pmd range. Is that
>>>> not
>>>> a lot of invalidations? We could do the same by just tracking the
>>>> address
>>>> of the first invalidation.
>>>>
>>>
>>> And there were other minor issues. This is still untested but Sasha,
>>> can you try it out please? I discussed this with Rik on IRC for a bit
>>> and
>>> reckon this should be sufficient if the correct race has been
>>> identified.
>>>
>>
>> Any luck with this patch Sasha? It passed basic tests here but I had not
>> seen the issue trigger either.
>>
>
> Sorry, I've been stuck in my weekend project of getting lockdep to work
> with page locks :)
>
> It takes a moment to test, so just to be sure - I should have only this
> last patch applied?
> Without the one in the original mail?

Indeed, only this patch should do it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-11 17:33                   ` Sasha Levin
@ 2014-03-11 17:32                     ` Rik van Riel
  -1 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2014-03-11 17:32 UTC (permalink / raw)
  To: Sasha Levin, Mel Gorman
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, hhuang, knoel, aarcange

On 03/11/2014 01:33 PM, Sasha Levin wrote:
> On 03/11/2014 01:00 PM, Rik van Riel wrote:
>> On 03/11/2014 12:51 PM, Sasha Levin wrote:
>>> On 03/11/2014 12:28 PM, Mel Gorman wrote:
>>>> On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote:
>>>>>> This is a completely untested prototype. It rechecks pmd_trans_huge
>>>>>> under the lock and falls through if it hit a parallel split. It's not
>>>>>> perfect because it could decide to fall through just because there
>>>>>> was
>>>>>> no prot_numa work to do but it's for illustration purposes. Secondly,
>>>>>> I noted that you are calling invalidate for every pmd range. Is that
>>>>>> not
>>>>>> a lot of invalidations? We could do the same by just tracking the
>>>>>> address
>>>>>> of the first invalidation.
>>>>>>
>>>>>
>>>>> And there were other minor issues. This is still untested but Sasha,
>>>>> can you try it out please? I discussed this with Rik on IRC for a bit
>>>>> and
>>>>> reckon this should be sufficient if the correct race has been
>>>>> identified.
>>>>>
>>>>
>>>> Any luck with this patch Sasha? It passed basic tests here but I had
>>>> not
>>>> seen the issue trigger either.
>>>>
>>>
>>> Sorry, I've been stuck in my weekend project of getting lockdep to work
>>> with page locks :)
>>>
>>> It takes a moment to test, so just to be sure - I should have only this
>>> last patch applied?
>>> Without the one in the original mail?
>>
>> Indeed, only this patch should do it.
>
> Okay. So just this patch on top of the latest -next shows the following
> issues:

OK, those are all issues with Davidlohr Bueso's
per-thread vma cache patch :)


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-11 17:32                     ` Rik van Riel
  0 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2014-03-11 17:32 UTC (permalink / raw)
  To: Sasha Levin, Mel Gorman
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, hhuang, knoel, aarcange

On 03/11/2014 01:33 PM, Sasha Levin wrote:
> On 03/11/2014 01:00 PM, Rik van Riel wrote:
>> On 03/11/2014 12:51 PM, Sasha Levin wrote:
>>> On 03/11/2014 12:28 PM, Mel Gorman wrote:
>>>> On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote:
>>>>>> This is a completely untested prototype. It rechecks pmd_trans_huge
>>>>>> under the lock and falls through if it hit a parallel split. It's not
>>>>>> perfect because it could decide to fall through just because there
>>>>>> was
>>>>>> no prot_numa work to do but it's for illustration purposes. Secondly,
>>>>>> I noted that you are calling invalidate for every pmd range. Is that
>>>>>> not
>>>>>> a lot of invalidations? We could do the same by just tracking the
>>>>>> address
>>>>>> of the first invalidation.
>>>>>>
>>>>>
>>>>> And there were other minor issues. This is still untested but Sasha,
>>>>> can you try it out please? I discussed this with Rik on IRC for a bit
>>>>> and
>>>>> reckon this should be sufficient if the correct race has been
>>>>> identified.
>>>>>
>>>>
>>>> Any luck with this patch Sasha? It passed basic tests here but I had
>>>> not
>>>> seen the issue trigger either.
>>>>
>>>
>>> Sorry, I've been stuck in my weekend project of getting lockdep to work
>>> with page locks :)
>>>
>>> It takes a moment to test, so just to be sure - I should have only this
>>> last patch applied?
>>> Without the one in the original mail?
>>
>> Indeed, only this patch should do it.
>
> Okay. So just this patch on top of the latest -next shows the following
> issues:

OK, those are all issues with Davidlohr Bueso's
per-thread vma cache patch :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-11 17:00                 ` Rik van Riel
@ 2014-03-11 17:33                   ` Sasha Levin
  -1 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-03-11 17:33 UTC (permalink / raw)
  To: Rik van Riel, Mel Gorman
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, hhuang, knoel, aarcange

On 03/11/2014 01:00 PM, Rik van Riel wrote:
> On 03/11/2014 12:51 PM, Sasha Levin wrote:
>> On 03/11/2014 12:28 PM, Mel Gorman wrote:
>>> On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote:
>>>>> This is a completely untested prototype. It rechecks pmd_trans_huge
>>>>> under the lock and falls through if it hit a parallel split. It's not
>>>>> perfect because it could decide to fall through just because there was
>>>>> no prot_numa work to do but it's for illustration purposes. Secondly,
>>>>> I noted that you are calling invalidate for every pmd range. Is that
>>>>> not
>>>>> a lot of invalidations? We could do the same by just tracking the
>>>>> address
>>>>> of the first invalidation.
>>>>>
>>>>
>>>> And there were other minor issues. This is still untested but Sasha,
>>>> can you try it out please? I discussed this with Rik on IRC for a bit
>>>> and
>>>> reckon this should be sufficient if the correct race has been
>>>> identified.
>>>>
>>>
>>> Any luck with this patch Sasha? It passed basic tests here but I had not
>>> seen the issue trigger either.
>>>
>>
>> Sorry, I've been stuck in my weekend project of getting lockdep to work
>> with page locks :)
>>
>> It takes a moment to test, so just to be sure - I should have only this
>> last patch applied?
>> Without the one in the original mail?
>
> Indeed, only this patch should do it.

Okay. So just this patch on top of the latest -next shows the following issues:

1. BUG in task_numa_work:

[  439.417171] BUG: unable to handle kernel paging request at ffff880e17530c00
[  439.418216] IP: [<ffffffff81299385>] vmacache_find+0x75/0xa0
[  439.419073] PGD 8904067 PUD 1028fcb067 PMD 1028f10067 PTE 8000000e17530060
[  439.420340] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  439.420340] Dumping ftrace buffer:
[  439.420340]    (ftrace buffer empty)
[  439.420340] Modules linked in:
[  439.420340] CPU: 12 PID: 9937 Comm: trinity-c212 Tainted: G        W    3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137
[  439.420340] task: ffff880e1a45b000 ti: ffff880e1a490000 task.ti: ffff880e1a490000
[  439.420340] RIP: 0010:[<ffffffff81299385>]  [<ffffffff81299385>] vmacache_find+0x75/0xa0
[  439.420340] RSP: 0018:ffff880e1a491e68  EFLAGS: 00010286
[  439.420340] RAX: ffff880e17530c00 RBX: 0000000000000000 RCX: 0000000000000001
[  439.420340] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880e1a45b000
[  439.420340] RBP: ffff880e1a491e68 R08: 0000000000000000 R09: 0000000000000000
[  439.420340] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880e1ab75000
[  439.420340] R13: ffff880e1ab75000 R14: 0000000000010000 R15: 0000000000000000
[  439.420340] FS:  00007f3458c05700(0000) GS:ffff880d2b800000(0000) knlGS:0000000000000000
[  439.420340] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  439.420340] CR2: ffff880e17530c00 CR3: 0000000e1a472000 CR4: 00000000000006a0
[  439.420340] Stack:
[  439.420340]  ffff880e1a491e98 ffffffff812a7610 ffffffff8118de40 0000000000000117
[  439.420340]  00000001000036da ffff880e1a45b000 ffff880e1a491ef8 ffffffff8118de4b
[  439.420340]  ffff880e1a491ec8 ffffffff81269575 ffff880e1ab750a8 ffff880e1a45b000
[  439.420340] Call Trace:
[  439.420340]  [<ffffffff812a7610>] find_vma+0x20/0x90
[  439.420340]  [<ffffffff8118de40>] ? task_numa_work+0x130/0x360
[  439.420340]  [<ffffffff8118de4b>] task_numa_work+0x13b/0x360
[  439.420340]  [<ffffffff81269575>] ? context_tracking_user_exit+0x195/0x1d0
[  439.420340]  [<ffffffff8116c5be>] task_work_run+0xae/0xf0
[  439.420340]  [<ffffffff8106ffbe>] do_notify_resume+0x8e/0xe0
[  439.420340]  [<ffffffff844b17a2>] int_signal+0x12/0x17
[  439.420340] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f
[  439.420340] RIP  [<ffffffff81299385>] vmacache_find+0x75/0xa0
[  439.420340]  RSP <ffff880e1a491e68>
[  439.420340] CR2: ffff880e17530c00

2. Similar to the above, but with a different trace:

[  304.212158] BUG: unable to handle kernel paging request at ffff88020d37f800
[  304.220420] IP: [<ffffffff81299385>] vmacache_find+0x75/0xa0
[  304.220420] PGD 8904067 PUD 102effb067 PMD 102ef91067 PTE 800000020d37f060
[  304.220420] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  304.220420] Dumping ftrace buffer:
[  304.220420]    (ftrace buffer empty)
[  304.220420] Modules linked in:
[  304.220420] CPU: 30 PID: 11157 Comm: trinity-c393 Tainted: G    B   W    3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137
[  304.220420] task: ffff881020a90000 ti: ffff88101fda2000 task.ti: ffff88101fda2000
[  304.260466] RIP: 0010:[<ffffffff81299385>]  [<ffffffff81299385>] vmacache_find+0x75/0xa0
[  304.260466] RSP: 0000:ffff88101fda3da8  EFLAGS: 00010286
[  304.260466] RAX: ffff88020d37f800 RBX: 00007f7f92f680fc RCX: 0000000000000000
[  304.260466] RDX: 0000000000000000 RSI: 00007f7f92f680fc RDI: ffff881020a90000
[  304.260466] RBP: ffff88101fda3da8 R08: 0000000000000001 R09: 0000000000000000
[  304.260466] R10: 0000000000000000 R11: 0000000000000000 R12: ffff881021358000
[  304.260466] R13: 0000000000000000 R14: ffff8810213580a8 R15: ffff881021358000
[  304.260466] FS:  00007f7f92f8d700(0000) GS:ffff880f2ba00000(0000) knlGS:0000000000000000
[  304.260466] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  304.260466] CR2: ffff88020d37f800 CR3: 000000101fd6b000 CR4: 00000000000006a0
[  304.260466] Stack:
[  304.260466]  ffff88101fda3dd8 ffffffff812a7610 ffffffff844abd82 00007f7f92f680fc
[  304.260466]  00007f7f92f680fc 00000000000000a8 ffff88101fda3ef8 ffffffff844abdd6
[  304.260466]  ffff88101fda3e08 ffffffff811a60d6 0000000000000000 ffff881020a90000
[  304.260466] Call Trace:
[  304.260466]  [<ffffffff812a7610>] find_vma+0x20/0x90
[  304.260466]  [<ffffffff844abd82>] ? __do_page_fault+0x302/0x5d0
[  304.260466]  [<ffffffff844abdd6>] __do_page_fault+0x356/0x5d0
[  304.260466]  [<ffffffff811a60d6>] ? trace_hardirqs_off_caller+0x16/0x1a0
[  304.260466]  [<ffffffff8118ab46>] ? vtime_account_user+0x96/0xb0
[  304.260466]  [<ffffffff844ac4d2>] ? preempt_count_sub+0xe2/0x120
[  304.260466]  [<ffffffff81269567>] ? context_tracking_user_exit+0x187/0x1d0
[  304.260466]  [<ffffffff811a60d6>] ? trace_hardirqs_off_caller+0x16/0x1a0
[  304.260466]  [<ffffffff844ac115>] do_page_fault+0x45/0x70
[  304.260466]  [<ffffffff844ab3c6>] do_async_page_fault+0x36/0x100
[  304.260466]  [<ffffffff844a7f58>] async_page_fault+0x28/0x30
[  304.260466] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f
[  304.260466] RIP  [<ffffffff81299385>] vmacache_find+0x75/0xa0
[  304.260466]  RSP <ffff88101fda3da8>
[  304.260466] CR2: ffff88020d37f800

3. This one is a new issue. Might be related to something else but it seems related as I've never
saw it before this patch:

[  560.473342] kernel BUG at mm/mmap.c:439!
[  560.473766] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  560.474723] Dumping ftrace buffer:
[  560.475410]    (ftrace buffer empty)
[  560.476098] Modules linked in:
[  560.476241] CPU: 10 PID: 17037 Comm: trinity-c84 Tainted: G        W    3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137
[  560.477651] task: ffff880424310000 ti: ffff8803eec4c000 task.ti: ffff8803eec4c000
[  560.478595] RIP: 0010:[<ffffffff812a6ef5>]  [<ffffffff812a6ef5>] validate_mm+0x115/0x140
[  560.479749] RSP: 0018:ffff8803eec4de98  EFLAGS: 00010296
[  560.480471] RAX: 0000000000000012 RBX: 00007fff8ae88000 RCX: 0000000000000006
[  560.481490] RDX: 0000000000000006 RSI: ffff880424310cf0 RDI: 0000000000000282
[  560.481490] RBP: ffff8803eec4dec8 R08: 0000000000000001 R09: 0000000000000001
[  560.481490] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8803f61aa000
[  560.481490] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000048
[  560.481490] FS:  00007f799ddf6700(0000) GS:ffff880b2b800000(0000) knlGS:0000000000000000
[  560.481490] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  560.481490] CR2: 0000000004502048 CR3: 00000003f8bcb000 CR4: 00000000000006a0
[  560.481490] Stack:
[  560.481490]  ffff8803eec4dec8 0000000000000000 ffff8803f61aa000 ffff8800a22fae00
[  560.481490]  0000000000000000 00007f799d4dc000 ffff8803eec4df28 ffffffff812a8fb7
[  560.481490]  ffffffff00000001 ffff880b21453600 ffff8800a22fae10 ffff8803f61aa008
[  560.481490] Call Trace:
[  560.481490]  [<ffffffff812a8fb7>] do_munmap+0x307/0x360
[  560.493775]  [<ffffffff812a9056>] ? vm_munmap+0x46/0x80
[  560.493775]  [<ffffffff812a9064>] vm_munmap+0x54/0x80
[  560.493775]  [<ffffffff812a90bc>] SyS_munmap+0x2c/0x40
[  560.493775]  [<ffffffff844b1690>] tracesys+0xdd/0xe2
[  560.493775] Code: 32 fd ff ff 41 8b 74 24 58 39 f0 74 19 89 c2 48 c7 c7 19 f5 6d 85 31 c0 e8 45 8c 1f 03 eb 0c 0f 1f 80 00 00 00 00 45 85 ed 74 0d <0f> 0b 66 0f 1f 84 00 00 00 00 00 eb fe 48 83 c4 08 5b 41 5c 41
[  560.493775] RIP  [<ffffffff812a6ef5>] validate_mm+0x115/0x140
[  560.493775]  RSP <ffff8803eec4de98>


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-11 17:33                   ` Sasha Levin
  0 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-03-11 17:33 UTC (permalink / raw)
  To: Rik van Riel, Mel Gorman
  Cc: David Rientjes, Andrew Morton, linux-mm, LKML, hhuang, knoel, aarcange

On 03/11/2014 01:00 PM, Rik van Riel wrote:
> On 03/11/2014 12:51 PM, Sasha Levin wrote:
>> On 03/11/2014 12:28 PM, Mel Gorman wrote:
>>> On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote:
>>>>> This is a completely untested prototype. It rechecks pmd_trans_huge
>>>>> under the lock and falls through if it hit a parallel split. It's not
>>>>> perfect because it could decide to fall through just because there was
>>>>> no prot_numa work to do but it's for illustration purposes. Secondly,
>>>>> I noted that you are calling invalidate for every pmd range. Is that
>>>>> not
>>>>> a lot of invalidations? We could do the same by just tracking the
>>>>> address
>>>>> of the first invalidation.
>>>>>
>>>>
>>>> And there were other minor issues. This is still untested but Sasha,
>>>> can you try it out please? I discussed this with Rik on IRC for a bit
>>>> and
>>>> reckon this should be sufficient if the correct race has been
>>>> identified.
>>>>
>>>
>>> Any luck with this patch Sasha? It passed basic tests here but I had not
>>> seen the issue trigger either.
>>>
>>
>> Sorry, I've been stuck in my weekend project of getting lockdep to work
>> with page locks :)
>>
>> It takes a moment to test, so just to be sure - I should have only this
>> last patch applied?
>> Without the one in the original mail?
>
> Indeed, only this patch should do it.

Okay. So just this patch on top of the latest -next shows the following issues:

1. BUG in task_numa_work:

[  439.417171] BUG: unable to handle kernel paging request at ffff880e17530c00
[  439.418216] IP: [<ffffffff81299385>] vmacache_find+0x75/0xa0
[  439.419073] PGD 8904067 PUD 1028fcb067 PMD 1028f10067 PTE 8000000e17530060
[  439.420340] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  439.420340] Dumping ftrace buffer:
[  439.420340]    (ftrace buffer empty)
[  439.420340] Modules linked in:
[  439.420340] CPU: 12 PID: 9937 Comm: trinity-c212 Tainted: G        W    3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137
[  439.420340] task: ffff880e1a45b000 ti: ffff880e1a490000 task.ti: ffff880e1a490000
[  439.420340] RIP: 0010:[<ffffffff81299385>]  [<ffffffff81299385>] vmacache_find+0x75/0xa0
[  439.420340] RSP: 0018:ffff880e1a491e68  EFLAGS: 00010286
[  439.420340] RAX: ffff880e17530c00 RBX: 0000000000000000 RCX: 0000000000000001
[  439.420340] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880e1a45b000
[  439.420340] RBP: ffff880e1a491e68 R08: 0000000000000000 R09: 0000000000000000
[  439.420340] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880e1ab75000
[  439.420340] R13: ffff880e1ab75000 R14: 0000000000010000 R15: 0000000000000000
[  439.420340] FS:  00007f3458c05700(0000) GS:ffff880d2b800000(0000) knlGS:0000000000000000
[  439.420340] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  439.420340] CR2: ffff880e17530c00 CR3: 0000000e1a472000 CR4: 00000000000006a0
[  439.420340] Stack:
[  439.420340]  ffff880e1a491e98 ffffffff812a7610 ffffffff8118de40 0000000000000117
[  439.420340]  00000001000036da ffff880e1a45b000 ffff880e1a491ef8 ffffffff8118de4b
[  439.420340]  ffff880e1a491ec8 ffffffff81269575 ffff880e1ab750a8 ffff880e1a45b000
[  439.420340] Call Trace:
[  439.420340]  [<ffffffff812a7610>] find_vma+0x20/0x90
[  439.420340]  [<ffffffff8118de40>] ? task_numa_work+0x130/0x360
[  439.420340]  [<ffffffff8118de4b>] task_numa_work+0x13b/0x360
[  439.420340]  [<ffffffff81269575>] ? context_tracking_user_exit+0x195/0x1d0
[  439.420340]  [<ffffffff8116c5be>] task_work_run+0xae/0xf0
[  439.420340]  [<ffffffff8106ffbe>] do_notify_resume+0x8e/0xe0
[  439.420340]  [<ffffffff844b17a2>] int_signal+0x12/0x17
[  439.420340] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f
[  439.420340] RIP  [<ffffffff81299385>] vmacache_find+0x75/0xa0
[  439.420340]  RSP <ffff880e1a491e68>
[  439.420340] CR2: ffff880e17530c00

2. Similar to the above, but with a different trace:

[  304.212158] BUG: unable to handle kernel paging request at ffff88020d37f800
[  304.220420] IP: [<ffffffff81299385>] vmacache_find+0x75/0xa0
[  304.220420] PGD 8904067 PUD 102effb067 PMD 102ef91067 PTE 800000020d37f060
[  304.220420] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  304.220420] Dumping ftrace buffer:
[  304.220420]    (ftrace buffer empty)
[  304.220420] Modules linked in:
[  304.220420] CPU: 30 PID: 11157 Comm: trinity-c393 Tainted: G    B   W    3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137
[  304.220420] task: ffff881020a90000 ti: ffff88101fda2000 task.ti: ffff88101fda2000
[  304.260466] RIP: 0010:[<ffffffff81299385>]  [<ffffffff81299385>] vmacache_find+0x75/0xa0
[  304.260466] RSP: 0000:ffff88101fda3da8  EFLAGS: 00010286
[  304.260466] RAX: ffff88020d37f800 RBX: 00007f7f92f680fc RCX: 0000000000000000
[  304.260466] RDX: 0000000000000000 RSI: 00007f7f92f680fc RDI: ffff881020a90000
[  304.260466] RBP: ffff88101fda3da8 R08: 0000000000000001 R09: 0000000000000000
[  304.260466] R10: 0000000000000000 R11: 0000000000000000 R12: ffff881021358000
[  304.260466] R13: 0000000000000000 R14: ffff8810213580a8 R15: ffff881021358000
[  304.260466] FS:  00007f7f92f8d700(0000) GS:ffff880f2ba00000(0000) knlGS:0000000000000000
[  304.260466] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  304.260466] CR2: ffff88020d37f800 CR3: 000000101fd6b000 CR4: 00000000000006a0
[  304.260466] Stack:
[  304.260466]  ffff88101fda3dd8 ffffffff812a7610 ffffffff844abd82 00007f7f92f680fc
[  304.260466]  00007f7f92f680fc 00000000000000a8 ffff88101fda3ef8 ffffffff844abdd6
[  304.260466]  ffff88101fda3e08 ffffffff811a60d6 0000000000000000 ffff881020a90000
[  304.260466] Call Trace:
[  304.260466]  [<ffffffff812a7610>] find_vma+0x20/0x90
[  304.260466]  [<ffffffff844abd82>] ? __do_page_fault+0x302/0x5d0
[  304.260466]  [<ffffffff844abdd6>] __do_page_fault+0x356/0x5d0
[  304.260466]  [<ffffffff811a60d6>] ? trace_hardirqs_off_caller+0x16/0x1a0
[  304.260466]  [<ffffffff8118ab46>] ? vtime_account_user+0x96/0xb0
[  304.260466]  [<ffffffff844ac4d2>] ? preempt_count_sub+0xe2/0x120
[  304.260466]  [<ffffffff81269567>] ? context_tracking_user_exit+0x187/0x1d0
[  304.260466]  [<ffffffff811a60d6>] ? trace_hardirqs_off_caller+0x16/0x1a0
[  304.260466]  [<ffffffff844ac115>] do_page_fault+0x45/0x70
[  304.260466]  [<ffffffff844ab3c6>] do_async_page_fault+0x36/0x100
[  304.260466]  [<ffffffff844a7f58>] async_page_fault+0x28/0x30
[  304.260466] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f
[  304.260466] RIP  [<ffffffff81299385>] vmacache_find+0x75/0xa0
[  304.260466]  RSP <ffff88101fda3da8>
[  304.260466] CR2: ffff88020d37f800

3. This one is a new issue. Might be related to something else but it seems related as I've never
saw it before this patch:

[  560.473342] kernel BUG at mm/mmap.c:439!
[  560.473766] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  560.474723] Dumping ftrace buffer:
[  560.475410]    (ftrace buffer empty)
[  560.476098] Modules linked in:
[  560.476241] CPU: 10 PID: 17037 Comm: trinity-c84 Tainted: G        W    3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137
[  560.477651] task: ffff880424310000 ti: ffff8803eec4c000 task.ti: ffff8803eec4c000
[  560.478595] RIP: 0010:[<ffffffff812a6ef5>]  [<ffffffff812a6ef5>] validate_mm+0x115/0x140
[  560.479749] RSP: 0018:ffff8803eec4de98  EFLAGS: 00010296
[  560.480471] RAX: 0000000000000012 RBX: 00007fff8ae88000 RCX: 0000000000000006
[  560.481490] RDX: 0000000000000006 RSI: ffff880424310cf0 RDI: 0000000000000282
[  560.481490] RBP: ffff8803eec4dec8 R08: 0000000000000001 R09: 0000000000000001
[  560.481490] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8803f61aa000
[  560.481490] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000048
[  560.481490] FS:  00007f799ddf6700(0000) GS:ffff880b2b800000(0000) knlGS:0000000000000000
[  560.481490] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  560.481490] CR2: 0000000004502048 CR3: 00000003f8bcb000 CR4: 00000000000006a0
[  560.481490] Stack:
[  560.481490]  ffff8803eec4dec8 0000000000000000 ffff8803f61aa000 ffff8800a22fae00
[  560.481490]  0000000000000000 00007f799d4dc000 ffff8803eec4df28 ffffffff812a8fb7
[  560.481490]  ffffffff00000001 ffff880b21453600 ffff8800a22fae10 ffff8803f61aa008
[  560.481490] Call Trace:
[  560.481490]  [<ffffffff812a8fb7>] do_munmap+0x307/0x360
[  560.493775]  [<ffffffff812a9056>] ? vm_munmap+0x46/0x80
[  560.493775]  [<ffffffff812a9064>] vm_munmap+0x54/0x80
[  560.493775]  [<ffffffff812a90bc>] SyS_munmap+0x2c/0x40
[  560.493775]  [<ffffffff844b1690>] tracesys+0xdd/0xe2
[  560.493775] Code: 32 fd ff ff 41 8b 74 24 58 39 f0 74 19 89 c2 48 c7 c7 19 f5 6d 85 31 c0 e8 45 8c 1f 03 eb 0c 0f 1f 80 00 00 00 00 45 85 ed 74 0d <0f> 0b 66 0f 1f 84 00 00 00 00 00 eb fe 48 83 c4 08 5b 41 5c 41
[  560.493775] RIP  [<ffffffff812a6ef5>] validate_mm+0x115/0x140
[  560.493775]  RSP <ffff8803eec4de98>


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-11 17:33                   ` Sasha Levin
@ 2014-03-11 18:06                     ` Mel Gorman
  -1 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-03-11 18:06 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Rik van Riel, David Rientjes, Andrew Morton, linux-mm, LKML,
	hhuang, knoel, aarcange

On Tue, Mar 11, 2014 at 01:33:00PM -0400, Sasha Levin wrote:
> Okay. So just this patch on top of the latest -next shows the following issues:
> 
> 1. BUG in task_numa_work:
> 
> [  439.417171] BUG: unable to handle kernel paging request at ffff880e17530c00
> [  439.418216] IP: [<ffffffff81299385>] vmacache_find+0x75/0xa0
> [  439.419073] PGD 8904067 PUD 1028fcb067 PMD 1028f10067 PTE 8000000e17530060
> [  439.420340] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [  439.420340] Dumping ftrace buffer:
> [  439.420340]    (ftrace buffer empty)
> [  439.420340] Modules linked in:
> [  439.420340] CPU: 12 PID: 9937 Comm: trinity-c212 Tainted: G        W    3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137
> [  439.420340] task: ffff880e1a45b000 ti: ffff880e1a490000 task.ti: ffff880e1a490000
> [  439.420340] RIP: 0010:[<ffffffff81299385>]  [<ffffffff81299385>] vmacache_find+0x75/0xa0
> [  439.420340] RSP: 0018:ffff880e1a491e68  EFLAGS: 00010286
> [  439.420340] RAX: ffff880e17530c00 RBX: 0000000000000000 RCX: 0000000000000001
> [  439.420340] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880e1a45b000
> [  439.420340] RBP: ffff880e1a491e68 R08: 0000000000000000 R09: 0000000000000000
> [  439.420340] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880e1ab75000
> [  439.420340] R13: ffff880e1ab75000 R14: 0000000000010000 R15: 0000000000000000
> [  439.420340] FS:  00007f3458c05700(0000) GS:ffff880d2b800000(0000) knlGS:0000000000000000
> [  439.420340] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  439.420340] CR2: ffff880e17530c00 CR3: 0000000e1a472000 CR4: 00000000000006a0
> [  439.420340] Stack:
> [  439.420340]  ffff880e1a491e98 ffffffff812a7610 ffffffff8118de40 0000000000000117
> [  439.420340]  00000001000036da ffff880e1a45b000 ffff880e1a491ef8 ffffffff8118de4b
> [  439.420340]  ffff880e1a491ec8 ffffffff81269575 ffff880e1ab750a8 ffff880e1a45b000
> [  439.420340] Call Trace:
> [  439.420340]  [<ffffffff812a7610>] find_vma+0x20/0x90
> [  439.420340]  [<ffffffff8118de40>] ? task_numa_work+0x130/0x360
> [  439.420340]  [<ffffffff8118de4b>] task_numa_work+0x13b/0x360
> [  439.420340]  [<ffffffff81269575>] ? context_tracking_user_exit+0x195/0x1d0
> [  439.420340]  [<ffffffff8116c5be>] task_work_run+0xae/0xf0
> [  439.420340]  [<ffffffff8106ffbe>] do_notify_resume+0x8e/0xe0
> [  439.420340]  [<ffffffff844b17a2>] int_signal+0x12/0x17
> [  439.420340] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f
> [  439.420340] RIP  [<ffffffff81299385>] vmacache_find+0x75/0xa0
> [  439.420340]  RSP <ffff880e1a491e68>
> [  439.420340] CR2: ffff880e17530c00
> 

Ok, this does not look related. It looks like damage from the VMA caching
patches, possibly a use-after free. I'm skeptical that it's related to
automatic NUMA balancing as such based on the second trace you posted.

What does addr2line -e vmlinux ffffffff81299385 say? I want to be sure
it looks like a vma dereference without risking making a mistake
decoding it.

1. Does this bug trigger even if automatic NUMA balancing is disabled?
2. Does this bug trigger if DEBUG_PAGEALLOC is disabled? If it's a
	use-after free then the bug would still be there.
3. Can you test with the following patches reverted please?

	e15d25d9c827b4346a36a3a78dd566d5ad353402 mm-per-thread-vma-caching-fix-fix
	e440e20dc76803cdab616b4756c201d5c72857f2 mm-per-thread-vma-caching-fix
	0d9ad4220e6d73f63a9eeeaac031b92838f75bb3 mm: per-thread vma caching

The last patch will not revert cleanly (least it didn't for me) but it
was just a case of git rm the two affected files, remove any include of
vmacache.h and commit the rest.

Thanks!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-11 18:06                     ` Mel Gorman
  0 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-03-11 18:06 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Rik van Riel, David Rientjes, Andrew Morton, linux-mm, LKML,
	hhuang, knoel, aarcange

On Tue, Mar 11, 2014 at 01:33:00PM -0400, Sasha Levin wrote:
> Okay. So just this patch on top of the latest -next shows the following issues:
> 
> 1. BUG in task_numa_work:
> 
> [  439.417171] BUG: unable to handle kernel paging request at ffff880e17530c00
> [  439.418216] IP: [<ffffffff81299385>] vmacache_find+0x75/0xa0
> [  439.419073] PGD 8904067 PUD 1028fcb067 PMD 1028f10067 PTE 8000000e17530060
> [  439.420340] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [  439.420340] Dumping ftrace buffer:
> [  439.420340]    (ftrace buffer empty)
> [  439.420340] Modules linked in:
> [  439.420340] CPU: 12 PID: 9937 Comm: trinity-c212 Tainted: G        W    3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137
> [  439.420340] task: ffff880e1a45b000 ti: ffff880e1a490000 task.ti: ffff880e1a490000
> [  439.420340] RIP: 0010:[<ffffffff81299385>]  [<ffffffff81299385>] vmacache_find+0x75/0xa0
> [  439.420340] RSP: 0018:ffff880e1a491e68  EFLAGS: 00010286
> [  439.420340] RAX: ffff880e17530c00 RBX: 0000000000000000 RCX: 0000000000000001
> [  439.420340] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880e1a45b000
> [  439.420340] RBP: ffff880e1a491e68 R08: 0000000000000000 R09: 0000000000000000
> [  439.420340] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880e1ab75000
> [  439.420340] R13: ffff880e1ab75000 R14: 0000000000010000 R15: 0000000000000000
> [  439.420340] FS:  00007f3458c05700(0000) GS:ffff880d2b800000(0000) knlGS:0000000000000000
> [  439.420340] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  439.420340] CR2: ffff880e17530c00 CR3: 0000000e1a472000 CR4: 00000000000006a0
> [  439.420340] Stack:
> [  439.420340]  ffff880e1a491e98 ffffffff812a7610 ffffffff8118de40 0000000000000117
> [  439.420340]  00000001000036da ffff880e1a45b000 ffff880e1a491ef8 ffffffff8118de4b
> [  439.420340]  ffff880e1a491ec8 ffffffff81269575 ffff880e1ab750a8 ffff880e1a45b000
> [  439.420340] Call Trace:
> [  439.420340]  [<ffffffff812a7610>] find_vma+0x20/0x90
> [  439.420340]  [<ffffffff8118de40>] ? task_numa_work+0x130/0x360
> [  439.420340]  [<ffffffff8118de4b>] task_numa_work+0x13b/0x360
> [  439.420340]  [<ffffffff81269575>] ? context_tracking_user_exit+0x195/0x1d0
> [  439.420340]  [<ffffffff8116c5be>] task_work_run+0xae/0xf0
> [  439.420340]  [<ffffffff8106ffbe>] do_notify_resume+0x8e/0xe0
> [  439.420340]  [<ffffffff844b17a2>] int_signal+0x12/0x17
> [  439.420340] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f
> [  439.420340] RIP  [<ffffffff81299385>] vmacache_find+0x75/0xa0
> [  439.420340]  RSP <ffff880e1a491e68>
> [  439.420340] CR2: ffff880e17530c00
> 

Ok, this does not look related. It looks like damage from the VMA caching
patches, possibly a use-after free. I'm skeptical that it's related to
automatic NUMA balancing as such based on the second trace you posted.

What does addr2line -e vmlinux ffffffff81299385 say? I want to be sure
it looks like a vma dereference without risking making a mistake
decoding it.

1. Does this bug trigger even if automatic NUMA balancing is disabled?
2. Does this bug trigger if DEBUG_PAGEALLOC is disabled? If it's a
	use-after free then the bug would still be there.
3. Can you test with the following patches reverted please?

	e15d25d9c827b4346a36a3a78dd566d5ad353402 mm-per-thread-vma-caching-fix-fix
	e440e20dc76803cdab616b4756c201d5c72857f2 mm-per-thread-vma-caching-fix
	0d9ad4220e6d73f63a9eeeaac031b92838f75bb3 mm: per-thread vma caching

The last patch will not revert cleanly (least it didn't for me) but it
was just a case of git rm the two affected files, remove any include of
vmacache.h and commit the rest.

Thanks!

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-11 18:06                     ` Mel Gorman
@ 2014-03-11 19:18                       ` Sasha Levin
  -1 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-03-11 19:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, David Rientjes, Andrew Morton, linux-mm, LKML,
	hhuang, knoel, aarcange

On 03/11/2014 02:06 PM, Mel Gorman wrote:
> On Tue, Mar 11, 2014 at 01:33:00PM -0400, Sasha Levin wrote:
>> Okay. So just this patch on top of the latest -next shows the following issues:
>>
>> 1. BUG in task_numa_work:
>>
>> [  439.417171] BUG: unable to handle kernel paging request at ffff880e17530c00
>> [  439.418216] IP: [<ffffffff81299385>] vmacache_find+0x75/0xa0
>> [  439.419073] PGD 8904067 PUD 1028fcb067 PMD 1028f10067 PTE 8000000e17530060
>> [  439.420340] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> [  439.420340] Dumping ftrace buffer:
>> [  439.420340]    (ftrace buffer empty)
>> [  439.420340] Modules linked in:
>> [  439.420340] CPU: 12 PID: 9937 Comm: trinity-c212 Tainted: G        W    3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137
>> [  439.420340] task: ffff880e1a45b000 ti: ffff880e1a490000 task.ti: ffff880e1a490000
>> [  439.420340] RIP: 0010:[<ffffffff81299385>]  [<ffffffff81299385>] vmacache_find+0x75/0xa0
>> [  439.420340] RSP: 0018:ffff880e1a491e68  EFLAGS: 00010286
>> [  439.420340] RAX: ffff880e17530c00 RBX: 0000000000000000 RCX: 0000000000000001
>> [  439.420340] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880e1a45b000
>> [  439.420340] RBP: ffff880e1a491e68 R08: 0000000000000000 R09: 0000000000000000
>> [  439.420340] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880e1ab75000
>> [  439.420340] R13: ffff880e1ab75000 R14: 0000000000010000 R15: 0000000000000000
>> [  439.420340] FS:  00007f3458c05700(0000) GS:ffff880d2b800000(0000) knlGS:0000000000000000
>> [  439.420340] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [  439.420340] CR2: ffff880e17530c00 CR3: 0000000e1a472000 CR4: 00000000000006a0
>> [  439.420340] Stack:
>> [  439.420340]  ffff880e1a491e98 ffffffff812a7610 ffffffff8118de40 0000000000000117
>> [  439.420340]  00000001000036da ffff880e1a45b000 ffff880e1a491ef8 ffffffff8118de4b
>> [  439.420340]  ffff880e1a491ec8 ffffffff81269575 ffff880e1ab750a8 ffff880e1a45b000
>> [  439.420340] Call Trace:
>> [  439.420340]  [<ffffffff812a7610>] find_vma+0x20/0x90
>> [  439.420340]  [<ffffffff8118de40>] ? task_numa_work+0x130/0x360
>> [  439.420340]  [<ffffffff8118de4b>] task_numa_work+0x13b/0x360
>> [  439.420340]  [<ffffffff81269575>] ? context_tracking_user_exit+0x195/0x1d0
>> [  439.420340]  [<ffffffff8116c5be>] task_work_run+0xae/0xf0
>> [  439.420340]  [<ffffffff8106ffbe>] do_notify_resume+0x8e/0xe0
>> [  439.420340]  [<ffffffff844b17a2>] int_signal+0x12/0x17
>> [  439.420340] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f
>> [  439.420340] RIP  [<ffffffff81299385>] vmacache_find+0x75/0xa0
>> [  439.420340]  RSP <ffff880e1a491e68>
>> [  439.420340] CR2: ffff880e17530c00
>>
>
> Ok, this does not look related. It looks like damage from the VMA caching
> patches, possibly a use-after free. I'm skeptical that it's related to
> automatic NUMA balancing as such based on the second trace you posted.

It's a bit weird because right after applying the patch I've hit this trace twice while
never seeing it previously.

But it's very possible since there are quite a lot of different mm issues around right now.

> What does addr2line -e vmlinux ffffffff81299385 say? I want to be sure
> it looks like a vma dereference without risking making a mistake
> decoding it.

mm/vmacache.c:75

Or:

	if (vma && vma->vm_start <= addr && vma->vm_end > addr)

> 1. Does this bug trigger even if automatic NUMA balancing is disabled?

(assuming we're talking about kernel BUG at mm/mmap.c:439!)

Yes, same result with numa_balancing=disable.

> 2. Does this bug trigger if DEBUG_PAGEALLOC is disabled? If it's a
> 	use-after free then the bug would still be there.

Still there.

> 3. Can you test with the following patches reverted please?
>
> 	e15d25d9c827b4346a36a3a78dd566d5ad353402 mm-per-thread-vma-caching-fix-fix
> 	e440e20dc76803cdab616b4756c201d5c72857f2 mm-per-thread-vma-caching-fix
> 	0d9ad4220e6d73f63a9eeeaac031b92838f75bb3 mm: per-thread vma caching
>
> The last patch will not revert cleanly (least it didn't for me) but it
> was just a case of git rm the two affected files, remove any include of
> vmacache.h and commit the rest.

Don't see the issues I've reported before now.



Thanks,
Sasha

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-11 19:18                       ` Sasha Levin
  0 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-03-11 19:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, David Rientjes, Andrew Morton, linux-mm, LKML,
	hhuang, knoel, aarcange

On 03/11/2014 02:06 PM, Mel Gorman wrote:
> On Tue, Mar 11, 2014 at 01:33:00PM -0400, Sasha Levin wrote:
>> Okay. So just this patch on top of the latest -next shows the following issues:
>>
>> 1. BUG in task_numa_work:
>>
>> [  439.417171] BUG: unable to handle kernel paging request at ffff880e17530c00
>> [  439.418216] IP: [<ffffffff81299385>] vmacache_find+0x75/0xa0
>> [  439.419073] PGD 8904067 PUD 1028fcb067 PMD 1028f10067 PTE 8000000e17530060
>> [  439.420340] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> [  439.420340] Dumping ftrace buffer:
>> [  439.420340]    (ftrace buffer empty)
>> [  439.420340] Modules linked in:
>> [  439.420340] CPU: 12 PID: 9937 Comm: trinity-c212 Tainted: G        W    3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137
>> [  439.420340] task: ffff880e1a45b000 ti: ffff880e1a490000 task.ti: ffff880e1a490000
>> [  439.420340] RIP: 0010:[<ffffffff81299385>]  [<ffffffff81299385>] vmacache_find+0x75/0xa0
>> [  439.420340] RSP: 0018:ffff880e1a491e68  EFLAGS: 00010286
>> [  439.420340] RAX: ffff880e17530c00 RBX: 0000000000000000 RCX: 0000000000000001
>> [  439.420340] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880e1a45b000
>> [  439.420340] RBP: ffff880e1a491e68 R08: 0000000000000000 R09: 0000000000000000
>> [  439.420340] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880e1ab75000
>> [  439.420340] R13: ffff880e1ab75000 R14: 0000000000010000 R15: 0000000000000000
>> [  439.420340] FS:  00007f3458c05700(0000) GS:ffff880d2b800000(0000) knlGS:0000000000000000
>> [  439.420340] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [  439.420340] CR2: ffff880e17530c00 CR3: 0000000e1a472000 CR4: 00000000000006a0
>> [  439.420340] Stack:
>> [  439.420340]  ffff880e1a491e98 ffffffff812a7610 ffffffff8118de40 0000000000000117
>> [  439.420340]  00000001000036da ffff880e1a45b000 ffff880e1a491ef8 ffffffff8118de4b
>> [  439.420340]  ffff880e1a491ec8 ffffffff81269575 ffff880e1ab750a8 ffff880e1a45b000
>> [  439.420340] Call Trace:
>> [  439.420340]  [<ffffffff812a7610>] find_vma+0x20/0x90
>> [  439.420340]  [<ffffffff8118de40>] ? task_numa_work+0x130/0x360
>> [  439.420340]  [<ffffffff8118de4b>] task_numa_work+0x13b/0x360
>> [  439.420340]  [<ffffffff81269575>] ? context_tracking_user_exit+0x195/0x1d0
>> [  439.420340]  [<ffffffff8116c5be>] task_work_run+0xae/0xf0
>> [  439.420340]  [<ffffffff8106ffbe>] do_notify_resume+0x8e/0xe0
>> [  439.420340]  [<ffffffff844b17a2>] int_signal+0x12/0x17
>> [  439.420340] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f
>> [  439.420340] RIP  [<ffffffff81299385>] vmacache_find+0x75/0xa0
>> [  439.420340]  RSP <ffff880e1a491e68>
>> [  439.420340] CR2: ffff880e17530c00
>>
>
> Ok, this does not look related. It looks like damage from the VMA caching
> patches, possibly a use-after free. I'm skeptical that it's related to
> automatic NUMA balancing as such based on the second trace you posted.

It's a bit weird because right after applying the patch I've hit this trace twice while
never seeing it previously.

But it's very possible since there are quite a lot of different mm issues around right now.

> What does addr2line -e vmlinux ffffffff81299385 say? I want to be sure
> it looks like a vma dereference without risking making a mistake
> decoding it.

mm/vmacache.c:75

Or:

	if (vma && vma->vm_start <= addr && vma->vm_end > addr)

> 1. Does this bug trigger even if automatic NUMA balancing is disabled?

(assuming we're talking about kernel BUG at mm/mmap.c:439!)

Yes, same result with numa_balancing=disable.

> 2. Does this bug trigger if DEBUG_PAGEALLOC is disabled? If it's a
> 	use-after free then the bug would still be there.

Still there.

> 3. Can you test with the following patches reverted please?
>
> 	e15d25d9c827b4346a36a3a78dd566d5ad353402 mm-per-thread-vma-caching-fix-fix
> 	e440e20dc76803cdab616b4756c201d5c72857f2 mm-per-thread-vma-caching-fix
> 	0d9ad4220e6d73f63a9eeeaac031b92838f75bb3 mm: per-thread vma caching
>
> The last patch will not revert cleanly (least it didn't for me) but it
> was just a case of git rm the two affected files, remove any include of
> vmacache.h and commit the rest.

Don't see the issues I've reported before now.



Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-11 19:18                       ` Sasha Levin
@ 2014-03-11 19:28                         ` Andrew Morton
  -1 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2014-03-11 19:28 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Mel Gorman, Rik van Riel, David Rientjes, linux-mm, LKML, hhuang,
	knoel, aarcange, Davidlohr Bueso

On Tue, 11 Mar 2014 15:18:02 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:

> > 3. Can you test with the following patches reverted please?
> >
> > 	e15d25d9c827b4346a36a3a78dd566d5ad353402 mm-per-thread-vma-caching-fix-fix
> > 	e440e20dc76803cdab616b4756c201d5c72857f2 mm-per-thread-vma-caching-fix
> > 	0d9ad4220e6d73f63a9eeeaac031b92838f75bb3 mm: per-thread vma caching
> >
> > The last patch will not revert cleanly (least it didn't for me) but it
> > was just a case of git rm the two affected files, remove any include of
> > vmacache.h and commit the rest.
> 
> Don't see the issues I've reported before now.

This is foggy.  Do you mean that all the bugs went away when
per-thread-vma-caching was reverted?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-11 19:28                         ` Andrew Morton
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2014-03-11 19:28 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Mel Gorman, Rik van Riel, David Rientjes, linux-mm, LKML, hhuang,
	knoel, aarcange, Davidlohr Bueso

On Tue, 11 Mar 2014 15:18:02 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:

> > 3. Can you test with the following patches reverted please?
> >
> > 	e15d25d9c827b4346a36a3a78dd566d5ad353402 mm-per-thread-vma-caching-fix-fix
> > 	e440e20dc76803cdab616b4756c201d5c72857f2 mm-per-thread-vma-caching-fix
> > 	0d9ad4220e6d73f63a9eeeaac031b92838f75bb3 mm: per-thread vma caching
> >
> > The last patch will not revert cleanly (least it didn't for me) but it
> > was just a case of git rm the two affected files, remove any include of
> > vmacache.h and commit the rest.
> 
> Don't see the issues I've reported before now.

This is foggy.  Do you mean that all the bugs went away when
per-thread-vma-caching was reverted?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-11 18:06                     ` Mel Gorman
@ 2014-03-11 19:31                       ` Davidlohr Bueso
  -1 siblings, 0 replies; 52+ messages in thread
From: Davidlohr Bueso @ 2014-03-11 19:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Sasha Levin, Rik van Riel, David Rientjes, Andrew Morton,
	linux-mm, LKML, hhuang, knoel, aarcange

On Tue, 2014-03-11 at 18:06 +0000, Mel Gorman wrote:
> On Tue, Mar 11, 2014 at 01:33:00PM -0400, Sasha Levin wrote:
> > Okay. So just this patch on top of the latest -next shows the following issues:
> > 
> > 1. BUG in task_numa_work:
> > 
> > [  439.417171] BUG: unable to handle kernel paging request at ffff880e17530c00
> > [  439.418216] IP: [<ffffffff81299385>] vmacache_find+0x75/0xa0
> > [  439.419073] PGD 8904067 PUD 1028fcb067 PMD 1028f10067 PTE 8000000e17530060
> > [  439.420340] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > [  439.420340] Dumping ftrace buffer:
> > [  439.420340]    (ftrace buffer empty)
> > [  439.420340] Modules linked in:
> > [  439.420340] CPU: 12 PID: 9937 Comm: trinity-c212 Tainted: G        W    3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137
> > [  439.420340] task: ffff880e1a45b000 ti: ffff880e1a490000 task.ti: ffff880e1a490000
> > [  439.420340] RIP: 0010:[<ffffffff81299385>]  [<ffffffff81299385>] vmacache_find+0x75/0xa0
> > [  439.420340] RSP: 0018:ffff880e1a491e68  EFLAGS: 00010286
> > [  439.420340] RAX: ffff880e17530c00 RBX: 0000000000000000 RCX: 0000000000000001
> > [  439.420340] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880e1a45b000
> > [  439.420340] RBP: ffff880e1a491e68 R08: 0000000000000000 R09: 0000000000000000
> > [  439.420340] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880e1ab75000
> > [  439.420340] R13: ffff880e1ab75000 R14: 0000000000010000 R15: 0000000000000000
> > [  439.420340] FS:  00007f3458c05700(0000) GS:ffff880d2b800000(0000) knlGS:0000000000000000
> > [  439.420340] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [  439.420340] CR2: ffff880e17530c00 CR3: 0000000e1a472000 CR4: 00000000000006a0
> > [  439.420340] Stack:
> > [  439.420340]  ffff880e1a491e98 ffffffff812a7610 ffffffff8118de40 0000000000000117
> > [  439.420340]  00000001000036da ffff880e1a45b000 ffff880e1a491ef8 ffffffff8118de4b
> > [  439.420340]  ffff880e1a491ec8 ffffffff81269575 ffff880e1ab750a8 ffff880e1a45b000
> > [  439.420340] Call Trace:
> > [  439.420340]  [<ffffffff812a7610>] find_vma+0x20/0x90
> > [  439.420340]  [<ffffffff8118de40>] ? task_numa_work+0x130/0x360
> > [  439.420340]  [<ffffffff8118de4b>] task_numa_work+0x13b/0x360
> > [  439.420340]  [<ffffffff81269575>] ? context_tracking_user_exit+0x195/0x1d0
> > [  439.420340]  [<ffffffff8116c5be>] task_work_run+0xae/0xf0
> > [  439.420340]  [<ffffffff8106ffbe>] do_notify_resume+0x8e/0xe0
> > [  439.420340]  [<ffffffff844b17a2>] int_signal+0x12/0x17
> > [  439.420340] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f
> > [  439.420340] RIP  [<ffffffff81299385>] vmacache_find+0x75/0xa0
> > [  439.420340]  RSP <ffff880e1a491e68>
> > [  439.420340] CR2: ffff880e17530c00
> > 
> 
> Ok, this does not look related. It looks like damage from the VMA caching
> patches, possibly a use-after free. I'm skeptical that it's related to
> automatic NUMA balancing as such based on the second trace you posted.

Indeed, this issue is separate. It was reported a few days ago by
Fengguang: http://lkml.org/lkml/2014/3/9/201

Apparently this has only been seen in DEBUG_PAGEALLOC + trinity
scenarios, just like in Sasha's trace. I'm suspecting some invalidation
is missing (not sure what DEBUG_PAGEALLOC enables differently in this
aspect), as when we do the vmacache_find() the vma is bogus which
triggers the paging request error.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-11 19:31                       ` Davidlohr Bueso
  0 siblings, 0 replies; 52+ messages in thread
From: Davidlohr Bueso @ 2014-03-11 19:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Sasha Levin, Rik van Riel, David Rientjes, Andrew Morton,
	linux-mm, LKML, hhuang, knoel, aarcange

On Tue, 2014-03-11 at 18:06 +0000, Mel Gorman wrote:
> On Tue, Mar 11, 2014 at 01:33:00PM -0400, Sasha Levin wrote:
> > Okay. So just this patch on top of the latest -next shows the following issues:
> > 
> > 1. BUG in task_numa_work:
> > 
> > [  439.417171] BUG: unable to handle kernel paging request at ffff880e17530c00
> > [  439.418216] IP: [<ffffffff81299385>] vmacache_find+0x75/0xa0
> > [  439.419073] PGD 8904067 PUD 1028fcb067 PMD 1028f10067 PTE 8000000e17530060
> > [  439.420340] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > [  439.420340] Dumping ftrace buffer:
> > [  439.420340]    (ftrace buffer empty)
> > [  439.420340] Modules linked in:
> > [  439.420340] CPU: 12 PID: 9937 Comm: trinity-c212 Tainted: G        W    3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137
> > [  439.420340] task: ffff880e1a45b000 ti: ffff880e1a490000 task.ti: ffff880e1a490000
> > [  439.420340] RIP: 0010:[<ffffffff81299385>]  [<ffffffff81299385>] vmacache_find+0x75/0xa0
> > [  439.420340] RSP: 0018:ffff880e1a491e68  EFLAGS: 00010286
> > [  439.420340] RAX: ffff880e17530c00 RBX: 0000000000000000 RCX: 0000000000000001
> > [  439.420340] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880e1a45b000
> > [  439.420340] RBP: ffff880e1a491e68 R08: 0000000000000000 R09: 0000000000000000
> > [  439.420340] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880e1ab75000
> > [  439.420340] R13: ffff880e1ab75000 R14: 0000000000010000 R15: 0000000000000000
> > [  439.420340] FS:  00007f3458c05700(0000) GS:ffff880d2b800000(0000) knlGS:0000000000000000
> > [  439.420340] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [  439.420340] CR2: ffff880e17530c00 CR3: 0000000e1a472000 CR4: 00000000000006a0
> > [  439.420340] Stack:
> > [  439.420340]  ffff880e1a491e98 ffffffff812a7610 ffffffff8118de40 0000000000000117
> > [  439.420340]  00000001000036da ffff880e1a45b000 ffff880e1a491ef8 ffffffff8118de4b
> > [  439.420340]  ffff880e1a491ec8 ffffffff81269575 ffff880e1ab750a8 ffff880e1a45b000
> > [  439.420340] Call Trace:
> > [  439.420340]  [<ffffffff812a7610>] find_vma+0x20/0x90
> > [  439.420340]  [<ffffffff8118de40>] ? task_numa_work+0x130/0x360
> > [  439.420340]  [<ffffffff8118de4b>] task_numa_work+0x13b/0x360
> > [  439.420340]  [<ffffffff81269575>] ? context_tracking_user_exit+0x195/0x1d0
> > [  439.420340]  [<ffffffff8116c5be>] task_work_run+0xae/0xf0
> > [  439.420340]  [<ffffffff8106ffbe>] do_notify_resume+0x8e/0xe0
> > [  439.420340]  [<ffffffff844b17a2>] int_signal+0x12/0x17
> > [  439.420340] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f
> > [  439.420340] RIP  [<ffffffff81299385>] vmacache_find+0x75/0xa0
> > [  439.420340]  RSP <ffff880e1a491e68>
> > [  439.420340] CR2: ffff880e17530c00
> > 
> 
> Ok, this does not look related. It looks like damage from the VMA caching
> patches, possibly a use-after free. I'm skeptical that it's related to
> automatic NUMA balancing as such based on the second trace you posted.

Indeed, this issue is separate. It was reported a few days ago by
Fengguang: http://lkml.org/lkml/2014/3/9/201

Apparently this has only been seen in DEBUG_PAGEALLOC + trinity
scenarios, just like in Sasha's trace. I'm suspecting some invalidation
is missing (not sure what DEBUG_PAGEALLOC enables differently in this
aspect), as when we do the vmacache_find() the vma is bogus which
triggers the paging request error.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
  2014-03-11 19:28                         ` Andrew Morton
@ 2014-03-11 19:33                           ` Sasha Levin
  -1 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-03-11 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, David Rientjes, linux-mm, LKML, hhuang,
	knoel, aarcange, Davidlohr Bueso

On 03/11/2014 03:28 PM, Andrew Morton wrote:
> On Tue, 11 Mar 2014 15:18:02 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:
>
>>> 3. Can you test with the following patches reverted please?
>>>
>>> 	e15d25d9c827b4346a36a3a78dd566d5ad353402 mm-per-thread-vma-caching-fix-fix
>>> 	e440e20dc76803cdab616b4756c201d5c72857f2 mm-per-thread-vma-caching-fix
>>> 	0d9ad4220e6d73f63a9eeeaac031b92838f75bb3 mm: per-thread vma caching
>>>
>>> The last patch will not revert cleanly (least it didn't for me) but it
>>> was just a case of git rm the two affected files, remove any include of
>>> vmacache.h and commit the rest.
>>
>> Don't see the issues I've reported before now.
>
> This is foggy.  Do you mean that all the bugs went away when
> per-thread-vma-caching was reverted?

No, sorry, just the vmacache_find and the mm/mmap.c:439 BUGs.


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page
@ 2014-03-11 19:33                           ` Sasha Levin
  0 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-03-11 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, David Rientjes, linux-mm, LKML, hhuang,
	knoel, aarcange, Davidlohr Bueso

On 03/11/2014 03:28 PM, Andrew Morton wrote:
> On Tue, 11 Mar 2014 15:18:02 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:
>
>>> 3. Can you test with the following patches reverted please?
>>>
>>> 	e15d25d9c827b4346a36a3a78dd566d5ad353402 mm-per-thread-vma-caching-fix-fix
>>> 	e440e20dc76803cdab616b4756c201d5c72857f2 mm-per-thread-vma-caching-fix
>>> 	0d9ad4220e6d73f63a9eeeaac031b92838f75bb3 mm: per-thread vma caching
>>>
>>> The last patch will not revert cleanly (least it didn't for me) but it
>>> was just a case of git rm the two affected files, remove any include of
>>> vmacache.h and commit the rest.
>>
>> Don't see the issues I've reported before now.
>
> This is foggy.  Do you mean that all the bugs went away when
> per-thread-vma-caching was reverted?

No, sorry, just the vmacache_find and the mm/mmap.c:439 BUGs.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
  2014-03-11 19:28                         ` Andrew Morton
@ 2014-03-12 10:36                           ` Mel Gorman
  -1 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-03-12 10:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Rik van Riel, David Rientjes, hhuang, knoel,
	aarcange, Davidlohr Bueso, linux-mm, LKML

Andrew, this should go with the patches 
mmnuma-reorganize-change_pmd_range.patch
mmnuma-reorganize-change_pmd_range-fix.patch
move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch
in mmotm please.

Thanks.

---8<---
From: Mel Gorman <mgorman@suse.de>
Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes

Sasha Levin reported the following bug using trinity

[  886.745765] kernel BUG at mm/mprotect.c:149!
[  886.746831] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  886.748511] Dumping ftrace buffer:
[  886.749998]    (ftrace buffer empty)
[  886.750110] Modules linked in:
[  886.751396] CPU: 20 PID: 26219 Comm: trinity-c216 Tainted: G        W    3.14.0-rc5-next-20140305-sasha-00011-ge06f5f3-dirty #105
[  886.751396] task: ffff8800b6c80000 ti: ffff880228436000 task.ti: ffff880228436000
[  886.751396] RIP: 0010:[<ffffffff812aab33>]  [<ffffffff812aab33>] change_protection_range+0x3b3/0x500
[  886.751396] RSP: 0000:ffff880228437da8  EFLAGS: 00010282
[  886.751396] RAX: 8000000527c008e5 RBX: 00007f647916e000 RCX: 0000000000000000
[  886.751396] RDX: ffff8802ef488e40 RSI: 00007f6479000000 RDI: 8000000527c008e5
[  886.751396] RBP: ffff880228437e78 R08: 0000000000000000 R09: 0000000000000000
[  886.751396] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802ef488e40
[  886.751396] R13: 00007f6479000000 R14: 00007f647916e000 R15: 00007f646e34e000
[  886.751396] FS:  00007f64b28d4700(0000) GS:ffff88052ba00000(0000) knlGS:0000000000000000
[  886.751396] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  886.751396] CR2: 00007f64aed83af8 CR3: 0000000206e52000 CR4: 00000000000006a0
[  886.751396] Stack:
[  886.751396]  ffff880200000001 ffffffff8447f152 ffff880228437dd8 ffff880228af3000
[  886.769142]  00007f646e34e000 00007f647916dfff 0000000000000000 00007f647916e000
[  886.769142]  ffff880206e527f0 00007f647916dfff 0000000000000000 0000000000000000
[  886.769142] Call Trace:
[  886.769142]  [<ffffffff8447f152>] ? preempt_count_sub+0xe2/0x120
[  886.769142]  [<ffffffff812aaca5>] change_protection+0x25/0x30
[  886.769142]  [<ffffffff812c3eeb>] change_prot_numa+0x1b/0x30
[  886.769142]  [<ffffffff8118df49>] task_numa_work+0x279/0x360
[  886.769142]  [<ffffffff8116c57e>] task_work_run+0xae/0xf0
[  886.769142]  [<ffffffff8106ffbe>] do_notify_resume+0x8e/0xe0
[  886.769142]  [<ffffffff8447a93b>] retint_signal+0x4d/0x92
[  886.769142] Code: 49 8b 3c 24 48 83 3d fc 2e ba 04 00 75 12 0f 0b 0f 1f 84 00 00 00 00 00 eb fe 66 0f 1f 44 00 00 48 89 f8 66 66 66 90 84 c0 79 0d <0f> 0b 0f 1f 00 eb fe 66 0f 1f 44 00 00 8b 4d 9c 44 8b 4d 98
+89
[  886.769142] RIP  [<ffffffff812aab33>] change_protection_range+0x3b3/0x500
[  886.769142]  RSP <ffff880228437da8>

The VM_BUG_ON was added by the patch "mm,numa: reorganize
change_pmd_range". The race existed without the patch but was just harder
to hit.

The problem is that a transhuge check is made without holding the PTL. It's
possible at the time of the check that a parallel fault clears the pmd
and inserts a new one which then triggers the VM_BUG_ON check.  This patch
removes the VM_BUG_ON but fixes the race by rechecking transhuge under the
PTL when marking page tables for NUMA hinting and bailing if a race occurred.
It is not a problem for calls to mprotect() as they hold mmap_sem for write.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2afc40e..72061a2 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -46,6 +46,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	unsigned long pages = 0;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+
+	/*
+	 * For a prot_numa update we only hold mmap_sem for read so there is a
+	 * potential race with faulting where a pmd was temporarily none so
+	 * recheck it under the lock and bail if we race
+	 */
+	if (prot_numa && unlikely(pmd_trans_huge(*pmd))) {
+		pte_unmap_unlock(pte, ptl);
+		return 0;
+	}
+
 	arch_enter_lazy_mmu_mode();
 	do {
 		oldpte = *pte;
@@ -141,12 +152,13 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 						pages += HPAGE_PMD_NR;
 						nr_huge_updates++;
 					}
+
+					/* huge pmd was handled */
 					continue;
 				}
 			}
 			/* fall through, the trans huge pmd just split */
 		}
-		VM_BUG_ON(pmd_trans_huge(*pmd));
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
 				 dirty_accountable, prot_numa);
 		pages += this_pages;

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
@ 2014-03-12 10:36                           ` Mel Gorman
  0 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-03-12 10:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Rik van Riel, David Rientjes, hhuang, knoel,
	aarcange, Davidlohr Bueso, linux-mm, LKML

Andrew, this should go with the patches 
mmnuma-reorganize-change_pmd_range.patch
mmnuma-reorganize-change_pmd_range-fix.patch
move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch
in mmotm please.

Thanks.

---8<---
From: Mel Gorman <mgorman@suse.de>
Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes

Sasha Levin reported the following bug using trinity

[  886.745765] kernel BUG at mm/mprotect.c:149!
[  886.746831] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  886.748511] Dumping ftrace buffer:
[  886.749998]    (ftrace buffer empty)
[  886.750110] Modules linked in:
[  886.751396] CPU: 20 PID: 26219 Comm: trinity-c216 Tainted: G        W    3.14.0-rc5-next-20140305-sasha-00011-ge06f5f3-dirty #105
[  886.751396] task: ffff8800b6c80000 ti: ffff880228436000 task.ti: ffff880228436000
[  886.751396] RIP: 0010:[<ffffffff812aab33>]  [<ffffffff812aab33>] change_protection_range+0x3b3/0x500
[  886.751396] RSP: 0000:ffff880228437da8  EFLAGS: 00010282
[  886.751396] RAX: 8000000527c008e5 RBX: 00007f647916e000 RCX: 0000000000000000
[  886.751396] RDX: ffff8802ef488e40 RSI: 00007f6479000000 RDI: 8000000527c008e5
[  886.751396] RBP: ffff880228437e78 R08: 0000000000000000 R09: 0000000000000000
[  886.751396] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802ef488e40
[  886.751396] R13: 00007f6479000000 R14: 00007f647916e000 R15: 00007f646e34e000
[  886.751396] FS:  00007f64b28d4700(0000) GS:ffff88052ba00000(0000) knlGS:0000000000000000
[  886.751396] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  886.751396] CR2: 00007f64aed83af8 CR3: 0000000206e52000 CR4: 00000000000006a0
[  886.751396] Stack:
[  886.751396]  ffff880200000001 ffffffff8447f152 ffff880228437dd8 ffff880228af3000
[  886.769142]  00007f646e34e000 00007f647916dfff 0000000000000000 00007f647916e000
[  886.769142]  ffff880206e527f0 00007f647916dfff 0000000000000000 0000000000000000
[  886.769142] Call Trace:
[  886.769142]  [<ffffffff8447f152>] ? preempt_count_sub+0xe2/0x120
[  886.769142]  [<ffffffff812aaca5>] change_protection+0x25/0x30
[  886.769142]  [<ffffffff812c3eeb>] change_prot_numa+0x1b/0x30
[  886.769142]  [<ffffffff8118df49>] task_numa_work+0x279/0x360
[  886.769142]  [<ffffffff8116c57e>] task_work_run+0xae/0xf0
[  886.769142]  [<ffffffff8106ffbe>] do_notify_resume+0x8e/0xe0
[  886.769142]  [<ffffffff8447a93b>] retint_signal+0x4d/0x92
[  886.769142] Code: 49 8b 3c 24 48 83 3d fc 2e ba 04 00 75 12 0f 0b 0f 1f 84 00 00 00 00 00 eb fe 66 0f 1f 44 00 00 48 89 f8 66 66 66 90 84 c0 79 0d <0f> 0b 0f 1f 00 eb fe 66 0f 1f 44 00 00 8b 4d 9c 44 8b 4d 98
+89
[  886.769142] RIP  [<ffffffff812aab33>] change_protection_range+0x3b3/0x500
[  886.769142]  RSP <ffff880228437da8>

The VM_BUG_ON was added by the patch "mm,numa: reorganize
change_pmd_range". The race existed without the patch but was just harder
to hit.

The problem is that a transhuge check is made without holding the PTL. It's
possible at the time of the check that a parallel fault clears the pmd
and inserts a new one which then triggers the VM_BUG_ON check.  This patch
removes the VM_BUG_ON but fixes the race by rechecking transhuge under the
PTL when marking page tables for NUMA hinting and bailing if a race occurred.
It is not a problem for calls to mprotect() as they hold mmap_sem for write.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2afc40e..72061a2 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -46,6 +46,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	unsigned long pages = 0;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+
+	/*
+	 * For a prot_numa update we only hold mmap_sem for read so there is a
+	 * potential race with faulting where a pmd was temporarily none so
+	 * recheck it under the lock and bail if we race
+	 */
+	if (prot_numa && unlikely(pmd_trans_huge(*pmd))) {
+		pte_unmap_unlock(pte, ptl);
+		return 0;
+	}
+
 	arch_enter_lazy_mmu_mode();
 	do {
 		oldpte = *pte;
@@ -141,12 +152,13 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 						pages += HPAGE_PMD_NR;
 						nr_huge_updates++;
 					}
+
+					/* huge pmd was handled */
 					continue;
 				}
 			}
 			/* fall through, the trans huge pmd just split */
 		}
-		VM_BUG_ON(pmd_trans_huge(*pmd));
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
 				 dirty_accountable, prot_numa);
 		pages += this_pages;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
  2014-03-12 10:36                           ` Mel Gorman
@ 2014-03-12 12:16                             ` Rik van Riel
  -1 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2014-03-12 12:16 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Sasha Levin, David Rientjes, hhuang, knoel, aarcange,
	Davidlohr Bueso, linux-mm, LKML

On 03/12/2014 06:36 AM, Mel Gorman wrote:
> Andrew, this should go with the patches 
> mmnuma-reorganize-change_pmd_range.patch
> mmnuma-reorganize-change_pmd_range-fix.patch
> move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch
> in mmotm please.
> 
> Thanks.

That would be nice indeed :)

I am still not entirely sure why the kernel did not hit this race
before my reorganize change_pmd_range patch. Maybe gcc used to do
one load and now it does two?

> The problem is that a transhuge check is made without holding the PTL. It's
> possible at the time of the check that a parallel fault clears the pmd
> and inserts a new one which then triggers the VM_BUG_ON check.  This patch
> removes the VM_BUG_ON but fixes the race by rechecking transhuge under the
> PTL when marking page tables for NUMA hinting and bailing if a race occurred.
> It is not a problem for calls to mprotect() as they hold mmap_sem for write.
> 
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
@ 2014-03-12 12:16                             ` Rik van Riel
  0 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2014-03-12 12:16 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Sasha Levin, David Rientjes, hhuang, knoel, aarcange,
	Davidlohr Bueso, linux-mm, LKML

On 03/12/2014 06:36 AM, Mel Gorman wrote:
> Andrew, this should go with the patches 
> mmnuma-reorganize-change_pmd_range.patch
> mmnuma-reorganize-change_pmd_range-fix.patch
> move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch
> in mmotm please.
> 
> Thanks.

That would be nice indeed :)

I am still not entirely sure why the kernel did not hit this race
before my reorganize change_pmd_range patch. Maybe gcc used to do
one load and now it does two?

> The problem is that a transhuge check is made without holding the PTL. It's
> possible at the time of the check that a parallel fault clears the pmd
> and inserts a new one which then triggers the VM_BUG_ON check.  This patch
> removes the VM_BUG_ON but fixes the race by rechecking transhuge under the
> PTL when marking page tables for NUMA hinting and bailing if a race occurred.
> It is not a problem for calls to mprotect() as they hold mmap_sem for write.
> 
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
  2014-03-12 10:36                           ` Mel Gorman
@ 2014-03-15  3:15                             ` Sasha Levin
  -1 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-03-15  3:15 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Rik van Riel, David Rientjes, hhuang, knoel, aarcange,
	Davidlohr Bueso, linux-mm, LKML

On 03/12/2014 06:36 AM, Mel Gorman wrote:
> Andrew, this should go with the patches
> mmnuma-reorganize-change_pmd_range.patch
> mmnuma-reorganize-change_pmd_range-fix.patch
> move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch
> in mmotm please.
>
> Thanks.
>
> ---8<---
> From: Mel Gorman<mgorman@suse.de>
> Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
>
> Sasha Levin reported the following bug using trinity

I'm seeing a different issue with this patch. A NULL ptr deref occurs in the
pte_offset_map_lock() macro right before the new recheck code:

[ 1877.093980] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 1877.095174] IP: __lock_acquire+0xbc/0x5a0 (kernel/locking/lockdep.c:3069)
[ 1877.096069] PGD 6dee7a067 PUD 6dee7b067 PMD 0
[ 1877.096821] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 1877.097706] Dumping ftrace buffer:
[ 1877.098281]    (ftrace buffer empty)
[ 1877.098825] Modules linked in:
[ 1877.099327] CPU: 19 PID: 27913 Comm: trinity-c100 Tainted: G        W     3.14.0-rc6-next-20140314-sasha-00012-g5590866 #219
[ 1877.100044] task: ffff8808f4280000 ti: ffff8806e1e54000 task.ti: ffff8806e1e54000
[ 1877.100044] RIP:  __lock_acquire+0xbc/0x5a0 (kernel/locking/lockdep.c:3069)
[ 1877.100044] RSP: 0000:ffff8806e1e55be8  EFLAGS: 00010002
[ 1877.100044] RAX: 0000000000000082 RBX: 0000000000000018 RCX: 0000000000000000
[ 1877.100044] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
[ 1877.100044] RBP: ffff8806e1e55c58 R08: 0000000000000001 R09: 0000000000000000
[ 1877.100044] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8808f4280000
[ 1877.100044] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[ 1877.100044] FS:  00007fe3fe152700(0000) GS:ffff88042ba00000(0000) knlGS:0000000000000000
[ 1877.100044] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1877.100044] CR2: 0000000000000018 CR3: 00000006dee79000 CR4: 00000000000006a0
[ 1877.100044] DR0: 0000000000698000 DR1: 0000000000698000 DR2: 0000000000698000
[ 1877.100044] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 000000000009060a
[ 1877.100044] Stack:
[ 1877.100044]  ffff8806e1e55c18 ffffffff81184e95 ffff8808f4280038 00000000001d8500
[ 1877.100044]  ffff88042bbd8500 0000000000000013 ffff8806e1e55c48 ffffffff81185108
[ 1877.100044]  ffffffff87c13bd0 ffff8808f4280000 0000000000000000 0000000000000001
[ 1877.100044] Call Trace:
[ 1877.100044]  ? sched_clock_local+0x25/0x90 (kernel/sched/clock.c:205)
[ 1877.100044]  ? sched_clock_cpu+0xb8/0x100 (kernel/sched/clock.c:310)
[ 1877.100044]  lock_acquire+0x182/0x1d0 (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[ 1877.100044]  ? change_pte_range+0xa3/0x410 (mm/mprotect.c:55)
[ 1877.100044]  ? __lock_release+0x1e2/0x200 (kernel/locking/lockdep.c:3506)
[ 1877.100044]  _raw_spin_lock+0x40/0x80 (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
[ 1877.100044]  ? change_pte_range+0xa3/0x410 (mm/mprotect.c:55)
[ 1877.100044]  ? _raw_spin_unlock+0x35/0x60 (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[ 1877.100044]  change_pte_range+0xa3/0x410 (mm/mprotect.c:55)
[ 1877.100044]  change_protection_range+0x3a8/0x4d0 (mm/mprotect.c:164 mm/mprotect.c:188 mm/mprotect.c:213)
[ 1877.100044]  ? preempt_count_sub+0xe2/0x120 (kernel/sched/core.c:2529)
[ 1877.100044]  change_protection+0x25/0x30 (mm/mprotect.c:237)
[ 1877.100044]  change_prot_numa+0x1b/0x30 (mm/mempolicy.c:559)
[ 1877.100044]  task_numa_work+0x279/0x360 (kernel/sched/fair.c:1911)
[ 1877.100044]  task_work_run+0xae/0xf0 (kernel/task_work.c:125)
[ 1877.100044]  do_notify_resume+0x8e/0xe0 (include/linux/tracehook.h:196 arch/x86/kernel/signal.c:751)
[ 1877.100044]  retint_signal+0x4d/0x92 (arch/x86/kernel/entry_64.S:1096)
[ 1877.100044] Code: c2 6f 3b 6d 85 be fa 0b 00 00 48 c7 c7 ce 94 6d 85 e8 f9 78 f9 ff 31 c0 e9 bc 04 00 00 66 90 44 8b 1d 29 69 cd 04 45 85 db 74 0c <48> 81 3b 80 f2 75 87 75 06 0f 1f 00 45 31 c0 83 fe 01 77 0c 89
[ 1877.100044] RIP  __lock_acquire+0xbc/0x5a0 (kernel/locking/lockdep.c:3069)
[ 1877.100044]  RSP <ffff8806e1e55be8>
[ 1877.100044] CR2: 0000000000000018


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
@ 2014-03-15  3:15                             ` Sasha Levin
  0 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-03-15  3:15 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Rik van Riel, David Rientjes, hhuang, knoel, aarcange,
	Davidlohr Bueso, linux-mm, LKML

On 03/12/2014 06:36 AM, Mel Gorman wrote:
> Andrew, this should go with the patches
> mmnuma-reorganize-change_pmd_range.patch
> mmnuma-reorganize-change_pmd_range-fix.patch
> move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch
> in mmotm please.
>
> Thanks.
>
> ---8<---
> From: Mel Gorman<mgorman@suse.de>
> Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
>
> Sasha Levin reported the following bug using trinity

I'm seeing a different issue with this patch. A NULL ptr deref occurs in the
pte_offset_map_lock() macro right before the new recheck code:

[ 1877.093980] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 1877.095174] IP: __lock_acquire+0xbc/0x5a0 (kernel/locking/lockdep.c:3069)
[ 1877.096069] PGD 6dee7a067 PUD 6dee7b067 PMD 0
[ 1877.096821] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 1877.097706] Dumping ftrace buffer:
[ 1877.098281]    (ftrace buffer empty)
[ 1877.098825] Modules linked in:
[ 1877.099327] CPU: 19 PID: 27913 Comm: trinity-c100 Tainted: G        W     3.14.0-rc6-next-20140314-sasha-00012-g5590866 #219
[ 1877.100044] task: ffff8808f4280000 ti: ffff8806e1e54000 task.ti: ffff8806e1e54000
[ 1877.100044] RIP:  __lock_acquire+0xbc/0x5a0 (kernel/locking/lockdep.c:3069)
[ 1877.100044] RSP: 0000:ffff8806e1e55be8  EFLAGS: 00010002
[ 1877.100044] RAX: 0000000000000082 RBX: 0000000000000018 RCX: 0000000000000000
[ 1877.100044] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
[ 1877.100044] RBP: ffff8806e1e55c58 R08: 0000000000000001 R09: 0000000000000000
[ 1877.100044] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8808f4280000
[ 1877.100044] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[ 1877.100044] FS:  00007fe3fe152700(0000) GS:ffff88042ba00000(0000) knlGS:0000000000000000
[ 1877.100044] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1877.100044] CR2: 0000000000000018 CR3: 00000006dee79000 CR4: 00000000000006a0
[ 1877.100044] DR0: 0000000000698000 DR1: 0000000000698000 DR2: 0000000000698000
[ 1877.100044] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 000000000009060a
[ 1877.100044] Stack:
[ 1877.100044]  ffff8806e1e55c18 ffffffff81184e95 ffff8808f4280038 00000000001d8500
[ 1877.100044]  ffff88042bbd8500 0000000000000013 ffff8806e1e55c48 ffffffff81185108
[ 1877.100044]  ffffffff87c13bd0 ffff8808f4280000 0000000000000000 0000000000000001
[ 1877.100044] Call Trace:
[ 1877.100044]  ? sched_clock_local+0x25/0x90 (kernel/sched/clock.c:205)
[ 1877.100044]  ? sched_clock_cpu+0xb8/0x100 (kernel/sched/clock.c:310)
[ 1877.100044]  lock_acquire+0x182/0x1d0 (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[ 1877.100044]  ? change_pte_range+0xa3/0x410 (mm/mprotect.c:55)
[ 1877.100044]  ? __lock_release+0x1e2/0x200 (kernel/locking/lockdep.c:3506)
[ 1877.100044]  _raw_spin_lock+0x40/0x80 (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
[ 1877.100044]  ? change_pte_range+0xa3/0x410 (mm/mprotect.c:55)
[ 1877.100044]  ? _raw_spin_unlock+0x35/0x60 (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[ 1877.100044]  change_pte_range+0xa3/0x410 (mm/mprotect.c:55)
[ 1877.100044]  change_protection_range+0x3a8/0x4d0 (mm/mprotect.c:164 mm/mprotect.c:188 mm/mprotect.c:213)
[ 1877.100044]  ? preempt_count_sub+0xe2/0x120 (kernel/sched/core.c:2529)
[ 1877.100044]  change_protection+0x25/0x30 (mm/mprotect.c:237)
[ 1877.100044]  change_prot_numa+0x1b/0x30 (mm/mempolicy.c:559)
[ 1877.100044]  task_numa_work+0x279/0x360 (kernel/sched/fair.c:1911)
[ 1877.100044]  task_work_run+0xae/0xf0 (kernel/task_work.c:125)
[ 1877.100044]  do_notify_resume+0x8e/0xe0 (include/linux/tracehook.h:196 arch/x86/kernel/signal.c:751)
[ 1877.100044]  retint_signal+0x4d/0x92 (arch/x86/kernel/entry_64.S:1096)
[ 1877.100044] Code: c2 6f 3b 6d 85 be fa 0b 00 00 48 c7 c7 ce 94 6d 85 e8 f9 78 f9 ff 31 c0 e9 bc 04 00 00 66 90 44 8b 1d 29 69 cd 04 45 85 db 74 0c <48> 81 3b 80 f2 75 87 75 06 0f 1f 00 45 31 c0 83 fe 01 77 0c 89
[ 1877.100044] RIP  __lock_acquire+0xbc/0x5a0 (kernel/locking/lockdep.c:3069)
[ 1877.100044]  RSP <ffff8806e1e55be8>
[ 1877.100044] CR2: 0000000000000018


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
  2014-03-15  3:15                             ` Sasha Levin
@ 2014-03-19 14:38                               ` Mel Gorman
  -1 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-03-19 14:38 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrew Morton, Rik van Riel, David Rientjes, hhuang, knoel,
	aarcange, Davidlohr Bueso, linux-mm, LKML

On Fri, Mar 14, 2014 at 11:15:37PM -0400, Sasha Levin wrote:
> On 03/12/2014 06:36 AM, Mel Gorman wrote:
> >Andrew, this should go with the patches
> >mmnuma-reorganize-change_pmd_range.patch
> >mmnuma-reorganize-change_pmd_range-fix.patch
> >move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch
> >in mmotm please.
> >
> >Thanks.
> >
> >---8<---
> >From: Mel Gorman<mgorman@suse.de>
> >Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
> >
> >Sasha Levin reported the following bug using trinity
> 
> I'm seeing a different issue with this patch. A NULL ptr deref occurs in the
> pte_offset_map_lock() macro right before the new recheck code:
> 

This on top?

I tried testing it but got all sorts of carnage that trinity throw up
in the mix and ordinary testing does not trigger the race. I've no idea
which of the current mess of trinity-exposed bugs you've encountered and
got fixed already.

---8<---
From: Mel Gorman <mgorman@suse.de>
Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes -fix

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 40 ++++++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 66973db..c43d557 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -36,6 +36,34 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 }
 #endif
 
+/*
+ * For a prot_numa update we only hold mmap_sem for read so there is a
+ * potential race with faulting where a pmd was temporarily none. This
+ * function checks for a transhuge pmd under the appropriate lock. It
+ * returns a pte if it was successfully locked or NULL if it raced with
+ * a transhuge insertion.
+ */
+static pte_t *lock_pte_protection(struct vm_area_struct *vma, pmd_t *pmd,
+			unsigned long addr, int prot_numa, spinlock_t **ptl)
+{
+	pte_t *pte;
+	spinlock_t *pmdl;
+
+	/* !prot_numa is protected by mmap_sem held for write */
+	if (!prot_numa)
+		return pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl);
+
+	pmdl = pmd_lock(vma->vm_mm, pmd);
+	if (unlikely(pmd_trans_huge(*pmd) || pmd_none(*pmd))) {
+		spin_unlock(pmdl);
+		return NULL;
+	}
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl);
+	spin_unlock(pmdl);
+	return pte;
+}
+
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable, int prot_numa)
@@ -45,17 +73,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	spinlock_t *ptl;
 	unsigned long pages = 0;
 
-	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
-
-	/*
-	 * For a prot_numa update we only hold mmap_sem for read so there is a
-	 * potential race with faulting where a pmd was temporarily none so
-	 * recheck it under the lock and bail if we race
-	 */
-	if (prot_numa && unlikely(pmd_trans_huge(*pmd))) {
-		pte_unmap_unlock(pte, ptl);
+	pte = lock_pte_protection(vma, pmd, addr, prot_numa, &ptl);
+	if (!pte)
 		return 0;
-	}
 
 	arch_enter_lazy_mmu_mode();
 	do {

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
@ 2014-03-19 14:38                               ` Mel Gorman
  0 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-03-19 14:38 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrew Morton, Rik van Riel, David Rientjes, hhuang, knoel,
	aarcange, Davidlohr Bueso, linux-mm, LKML

On Fri, Mar 14, 2014 at 11:15:37PM -0400, Sasha Levin wrote:
> On 03/12/2014 06:36 AM, Mel Gorman wrote:
> >Andrew, this should go with the patches
> >mmnuma-reorganize-change_pmd_range.patch
> >mmnuma-reorganize-change_pmd_range-fix.patch
> >move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch
> >in mmotm please.
> >
> >Thanks.
> >
> >---8<---
> >From: Mel Gorman<mgorman@suse.de>
> >Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
> >
> >Sasha Levin reported the following bug using trinity
> 
> I'm seeing a different issue with this patch. A NULL ptr deref occurs in the
> pte_offset_map_lock() macro right before the new recheck code:
> 

This on top?

I tried testing it but got all sorts of carnage that trinity throw up
in the mix and ordinary testing does not trigger the race. I've no idea
which of the current mess of trinity-exposed bugs you've encountered and
got fixed already.

---8<---
From: Mel Gorman <mgorman@suse.de>
Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes -fix

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 40 ++++++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 66973db..c43d557 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -36,6 +36,34 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 }
 #endif
 
+/*
+ * For a prot_numa update we only hold mmap_sem for read so there is a
+ * potential race with faulting where a pmd was temporarily none. This
+ * function checks for a transhuge pmd under the appropriate lock. It
+ * returns a pte if it was successfully locked or NULL if it raced with
+ * a transhuge insertion.
+ */
+static pte_t *lock_pte_protection(struct vm_area_struct *vma, pmd_t *pmd,
+			unsigned long addr, int prot_numa, spinlock_t **ptl)
+{
+	pte_t *pte;
+	spinlock_t *pmdl;
+
+	/* !prot_numa is protected by mmap_sem held for write */
+	if (!prot_numa)
+		return pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl);
+
+	pmdl = pmd_lock(vma->vm_mm, pmd);
+	if (unlikely(pmd_trans_huge(*pmd) || pmd_none(*pmd))) {
+		spin_unlock(pmdl);
+		return NULL;
+	}
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl);
+	spin_unlock(pmdl);
+	return pte;
+}
+
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable, int prot_numa)
@@ -45,17 +73,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	spinlock_t *ptl;
 	unsigned long pages = 0;
 
-	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
-
-	/*
-	 * For a prot_numa update we only hold mmap_sem for read so there is a
-	 * potential race with faulting where a pmd was temporarily none so
-	 * recheck it under the lock and bail if we race
-	 */
-	if (prot_numa && unlikely(pmd_trans_huge(*pmd))) {
-		pte_unmap_unlock(pte, ptl);
+	pte = lock_pte_protection(vma, pmd, addr, prot_numa, &ptl);
+	if (!pte)
 		return 0;
-	}
 
 	arch_enter_lazy_mmu_mode();
 	do {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
  2014-03-19 14:38                               ` Mel Gorman
@ 2014-03-21 22:06                                 ` Andrew Morton
  -1 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2014-03-21 22:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Sasha Levin, Rik van Riel, David Rientjes, hhuang, knoel,
	aarcange, Davidlohr Bueso, linux-mm, LKML

On Wed, 19 Mar 2014 14:38:32 +0000 Mel Gorman <mgorman@suse.de> wrote:

> On Fri, Mar 14, 2014 at 11:15:37PM -0400, Sasha Levin wrote:
> > On 03/12/2014 06:36 AM, Mel Gorman wrote:
> > >Andrew, this should go with the patches
> > >mmnuma-reorganize-change_pmd_range.patch
> > >mmnuma-reorganize-change_pmd_range-fix.patch
> > >move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch
> > >in mmotm please.
> > >
> > >Thanks.
> > >
> > >---8<---
> > >From: Mel Gorman<mgorman@suse.de>
> > >Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
> > >
> > >Sasha Levin reported the following bug using trinity
> > 
> > I'm seeing a different issue with this patch. A NULL ptr deref occurs in the
> > pte_offset_map_lock() macro right before the new recheck code:
> > 
> 
> This on top?
> 
> I tried testing it but got all sorts of carnage that trinity throw up
> in the mix and ordinary testing does not trigger the race. I've no idea
> which of the current mess of trinity-exposed bugs you've encountered and
> got fixed already.
> 

Where are we at with this one?


With this patch I'm now sitting on

mmnuma-reorganize-change_pmd_range.patch
mmnuma-reorganize-change_pmd_range-fix.patch
mm-numa-recheck-for-transhuge-pages-under-lock-during-protection-changes.patch
mm-numa-recheck-for-transhuge-pages-under-lock-during-protection-changes-fix.patch
move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch

Where the first four are prep work for the fifth.

move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch
is a performance optimisation for KVM guests.  Should I drop all five,
try again next time?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
@ 2014-03-21 22:06                                 ` Andrew Morton
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2014-03-21 22:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Sasha Levin, Rik van Riel, David Rientjes, hhuang, knoel,
	aarcange, Davidlohr Bueso, linux-mm, LKML

On Wed, 19 Mar 2014 14:38:32 +0000 Mel Gorman <mgorman@suse.de> wrote:

> On Fri, Mar 14, 2014 at 11:15:37PM -0400, Sasha Levin wrote:
> > On 03/12/2014 06:36 AM, Mel Gorman wrote:
> > >Andrew, this should go with the patches
> > >mmnuma-reorganize-change_pmd_range.patch
> > >mmnuma-reorganize-change_pmd_range-fix.patch
> > >move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch
> > >in mmotm please.
> > >
> > >Thanks.
> > >
> > >---8<---
> > >From: Mel Gorman<mgorman@suse.de>
> > >Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
> > >
> > >Sasha Levin reported the following bug using trinity
> > 
> > I'm seeing a different issue with this patch. A NULL ptr deref occurs in the
> > pte_offset_map_lock() macro right before the new recheck code:
> > 
> 
> This on top?
> 
> I tried testing it but got all sorts of carnage that trinity throw up
> in the mix and ordinary testing does not trigger the race. I've no idea
> which of the current mess of trinity-exposed bugs you've encountered and
> got fixed already.
> 

Where are we at with this one?


With this patch I'm now sitting on

mmnuma-reorganize-change_pmd_range.patch
mmnuma-reorganize-change_pmd_range-fix.patch
mm-numa-recheck-for-transhuge-pages-under-lock-during-protection-changes.patch
mm-numa-recheck-for-transhuge-pages-under-lock-during-protection-changes-fix.patch
move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch

Where the first four are prep work for the fifth.

move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch
is a performance optimisation for KVM guests.  Should I drop all five,
try again next time?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
  2014-03-19 14:38                               ` Mel Gorman
@ 2014-03-21 22:24                                 ` Rik van Riel
  -1 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2014-03-21 22:24 UTC (permalink / raw)
  To: Mel Gorman, Sasha Levin
  Cc: Andrew Morton, David Rientjes, hhuang, knoel, aarcange,
	Davidlohr Bueso, linux-mm, LKML

On 03/19/2014 10:38 AM, Mel Gorman wrote:
> On Fri, Mar 14, 2014 at 11:15:37PM -0400, Sasha Levin wrote:
>> On 03/12/2014 06:36 AM, Mel Gorman wrote:
>>> Andrew, this should go with the patches
>>> mmnuma-reorganize-change_pmd_range.patch
>>> mmnuma-reorganize-change_pmd_range-fix.patch
>>> move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch
>>> in mmotm please.
>>>
>>> Thanks.
>>>
>>> ---8<---
>>> From: Mel Gorman<mgorman@suse.de>
>>> Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
>>>
>>> Sasha Levin reported the following bug using trinity
>>
>> I'm seeing a different issue with this patch. A NULL ptr deref occurs in the
>> pte_offset_map_lock() macro right before the new recheck code:
>>
> 
> This on top?
> 
> I tried testing it but got all sorts of carnage that trinity throw up
> in the mix and ordinary testing does not trigger the race. I've no idea
> which of the current mess of trinity-exposed bugs you've encountered and
> got fixed already.

Eeeep indeed.  If we re-test the transhuge status, we need to
take the pmd lock, and not the potentially non-existent pte
lock.

Good catch.

> ---8<---
> From: Mel Gorman <mgorman@suse.de>
> Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes -fix
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

> ---
>  mm/mprotect.c | 40 ++++++++++++++++++++++++++++++----------
>  1 file changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 66973db..c43d557 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -36,6 +36,34 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
>  }
>  #endif
>  
> +/*
> + * For a prot_numa update we only hold mmap_sem for read so there is a
> + * potential race with faulting where a pmd was temporarily none. This
> + * function checks for a transhuge pmd under the appropriate lock. It
> + * returns a pte if it was successfully locked or NULL if it raced with
> + * a transhuge insertion.
> + */
> +static pte_t *lock_pte_protection(struct vm_area_struct *vma, pmd_t *pmd,
> +			unsigned long addr, int prot_numa, spinlock_t **ptl)
> +{
> +	pte_t *pte;
> +	spinlock_t *pmdl;
> +
> +	/* !prot_numa is protected by mmap_sem held for write */
> +	if (!prot_numa)
> +		return pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl);
> +
> +	pmdl = pmd_lock(vma->vm_mm, pmd);
> +	if (unlikely(pmd_trans_huge(*pmd) || pmd_none(*pmd))) {
> +		spin_unlock(pmdl);
> +		return NULL;
> +	}
> +
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl);
> +	spin_unlock(pmdl);
> +	return pte;
> +}
> +
>  static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  		unsigned long addr, unsigned long end, pgprot_t newprot,
>  		int dirty_accountable, int prot_numa)
> @@ -45,17 +73,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  	spinlock_t *ptl;
>  	unsigned long pages = 0;
>  
> -	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> -
> -	/*
> -	 * For a prot_numa update we only hold mmap_sem for read so there is a
> -	 * potential race with faulting where a pmd was temporarily none so
> -	 * recheck it under the lock and bail if we race
> -	 */
> -	if (prot_numa && unlikely(pmd_trans_huge(*pmd))) {
> -		pte_unmap_unlock(pte, ptl);
> +	pte = lock_pte_protection(vma, pmd, addr, prot_numa, &ptl);
> +	if (!pte)
>  		return 0;
> -	}
>  
>  	arch_enter_lazy_mmu_mode();
>  	do {
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
@ 2014-03-21 22:24                                 ` Rik van Riel
  0 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2014-03-21 22:24 UTC (permalink / raw)
  To: Mel Gorman, Sasha Levin
  Cc: Andrew Morton, David Rientjes, hhuang, knoel, aarcange,
	Davidlohr Bueso, linux-mm, LKML

On 03/19/2014 10:38 AM, Mel Gorman wrote:
> On Fri, Mar 14, 2014 at 11:15:37PM -0400, Sasha Levin wrote:
>> On 03/12/2014 06:36 AM, Mel Gorman wrote:
>>> Andrew, this should go with the patches
>>> mmnuma-reorganize-change_pmd_range.patch
>>> mmnuma-reorganize-change_pmd_range-fix.patch
>>> move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch
>>> in mmotm please.
>>>
>>> Thanks.
>>>
>>> ---8<---
>>> From: Mel Gorman<mgorman@suse.de>
>>> Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
>>>
>>> Sasha Levin reported the following bug using trinity
>>
>> I'm seeing a different issue with this patch. A NULL ptr deref occurs in the
>> pte_offset_map_lock() macro right before the new recheck code:
>>
> 
> This on top?
> 
> I tried testing it but got all sorts of carnage that trinity throw up
> in the mix and ordinary testing does not trigger the race. I've no idea
> which of the current mess of trinity-exposed bugs you've encountered and
> got fixed already.

Eeeep indeed.  If we re-test the transhuge status, we need to
take the pmd lock, and not the potentially non-existent pte
lock.

Good catch.

> ---8<---
> From: Mel Gorman <mgorman@suse.de>
> Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes -fix
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

> ---
>  mm/mprotect.c | 40 ++++++++++++++++++++++++++++++----------
>  1 file changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 66973db..c43d557 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -36,6 +36,34 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
>  }
>  #endif
>  
> +/*
> + * For a prot_numa update we only hold mmap_sem for read so there is a
> + * potential race with faulting where a pmd was temporarily none. This
> + * function checks for a transhuge pmd under the appropriate lock. It
> + * returns a pte if it was successfully locked or NULL if it raced with
> + * a transhuge insertion.
> + */
> +static pte_t *lock_pte_protection(struct vm_area_struct *vma, pmd_t *pmd,
> +			unsigned long addr, int prot_numa, spinlock_t **ptl)
> +{
> +	pte_t *pte;
> +	spinlock_t *pmdl;
> +
> +	/* !prot_numa is protected by mmap_sem held for write */
> +	if (!prot_numa)
> +		return pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl);
> +
> +	pmdl = pmd_lock(vma->vm_mm, pmd);
> +	if (unlikely(pmd_trans_huge(*pmd) || pmd_none(*pmd))) {
> +		spin_unlock(pmdl);
> +		return NULL;
> +	}
> +
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl);
> +	spin_unlock(pmdl);
> +	return pte;
> +}
> +
>  static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  		unsigned long addr, unsigned long end, pgprot_t newprot,
>  		int dirty_accountable, int prot_numa)
> @@ -45,17 +73,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  	spinlock_t *ptl;
>  	unsigned long pages = 0;
>  
> -	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> -
> -	/*
> -	 * For a prot_numa update we only hold mmap_sem for read so there is a
> -	 * potential race with faulting where a pmd was temporarily none so
> -	 * recheck it under the lock and bail if we race
> -	 */
> -	if (prot_numa && unlikely(pmd_trans_huge(*pmd))) {
> -		pte_unmap_unlock(pte, ptl);
> +	pte = lock_pte_protection(vma, pmd, addr, prot_numa, &ptl);
> +	if (!pte)
>  		return 0;
> -	}
>  
>  	arch_enter_lazy_mmu_mode();
>  	do {
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
  2014-03-21 22:06                                 ` Andrew Morton
@ 2014-03-21 22:36                                   ` Sasha Levin
  -1 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-03-21 22:36 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman
  Cc: Rik van Riel, David Rientjes, hhuang, knoel, aarcange,
	Davidlohr Bueso, linux-mm, LKML

On 03/21/2014 06:06 PM, Andrew Morton wrote:
> On Wed, 19 Mar 2014 14:38:32 +0000 Mel Gorman<mgorman@suse.de>  wrote:
>
>> >On Fri, Mar 14, 2014 at 11:15:37PM -0400, Sasha Levin wrote:
>>> > >On 03/12/2014 06:36 AM, Mel Gorman wrote:
>>>> > > >Andrew, this should go with the patches
>>>> > > >mmnuma-reorganize-change_pmd_range.patch
>>>> > > >mmnuma-reorganize-change_pmd_range-fix.patch
>>>> > > >move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch
>>>> > > >in mmotm please.
>>>> > > >
>>>> > > >Thanks.
>>>> > > >
>>>> > > >---8<---
>>>> > > >From: Mel Gorman<mgorman@suse.de>
>>>> > > >Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
>>>> > > >
>>>> > > >Sasha Levin reported the following bug using trinity
>>> > >
>>> > >I'm seeing a different issue with this patch. A NULL ptr deref occurs in the
>>> > >pte_offset_map_lock() macro right before the new recheck code:
>>> > >
>> >
>> >This on top?
>> >
>> >I tried testing it but got all sorts of carnage that trinity throw up
>> >in the mix and ordinary testing does not trigger the race. I've no idea
>> >which of the current mess of trinity-exposed bugs you've encountered and
>> >got fixed already.
>> >
> Where are we at with this one?

Looking good here, haven't seen any of the issues reported in this thread.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
@ 2014-03-21 22:36                                   ` Sasha Levin
  0 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-03-21 22:36 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman
  Cc: Rik van Riel, David Rientjes, hhuang, knoel, aarcange,
	Davidlohr Bueso, linux-mm, LKML

On 03/21/2014 06:06 PM, Andrew Morton wrote:
> On Wed, 19 Mar 2014 14:38:32 +0000 Mel Gorman<mgorman@suse.de>  wrote:
>
>> >On Fri, Mar 14, 2014 at 11:15:37PM -0400, Sasha Levin wrote:
>>> > >On 03/12/2014 06:36 AM, Mel Gorman wrote:
>>>> > > >Andrew, this should go with the patches
>>>> > > >mmnuma-reorganize-change_pmd_range.patch
>>>> > > >mmnuma-reorganize-change_pmd_range-fix.patch
>>>> > > >move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch
>>>> > > >in mmotm please.
>>>> > > >
>>>> > > >Thanks.
>>>> > > >
>>>> > > >---8<---
>>>> > > >From: Mel Gorman<mgorman@suse.de>
>>>> > > >Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes
>>>> > > >
>>>> > > >Sasha Levin reported the following bug using trinity
>>> > >
>>> > >I'm seeing a different issue with this patch. A NULL ptr deref occurs in the
>>> > >pte_offset_map_lock() macro right before the new recheck code:
>>> > >
>> >
>> >This on top?
>> >
>> >I tried testing it but got all sorts of carnage that trinity throw up
>> >in the mix and ordinary testing does not trigger the race. I've no idea
>> >which of the current mess of trinity-exposed bugs you've encountered and
>> >got fixed already.
>> >
> Where are we at with this one?

Looking good here, haven't seen any of the issues reported in this thread.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2014-03-21 22:37 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-06 21:12 mm: kernel BUG at mm/mprotect.c:149 Sasha Levin
2014-03-06 21:12 ` Sasha Levin
2014-03-06 22:31 ` [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Rik van Riel
2014-03-06 22:31   ` Rik van Riel
2014-03-06 22:31 ` Rik van Riel
2014-03-06 22:31   ` Rik van Riel
2014-03-06 22:52   ` Rik van Riel
2014-03-06 22:52     ` Rik van Riel
2014-03-07 14:06     ` Mel Gorman
2014-03-07 14:06       ` Mel Gorman
2014-03-07 15:09       ` Mel Gorman
2014-03-07 15:09         ` Mel Gorman
2014-03-07 15:13         ` Mel Gorman
2014-03-07 15:13           ` Mel Gorman
2014-03-07 18:27         ` Mel Gorman
2014-03-07 18:27           ` Mel Gorman
2014-03-11 16:28           ` Mel Gorman
2014-03-11 16:28             ` Mel Gorman
2014-03-11 16:51             ` Sasha Levin
2014-03-11 16:51               ` Sasha Levin
2014-03-11 17:00               ` Rik van Riel
2014-03-11 17:00                 ` Rik van Riel
2014-03-11 17:33                 ` Sasha Levin
2014-03-11 17:33                   ` Sasha Levin
2014-03-11 17:32                   ` Rik van Riel
2014-03-11 17:32                     ` Rik van Riel
2014-03-11 18:06                   ` Mel Gorman
2014-03-11 18:06                     ` Mel Gorman
2014-03-11 19:18                     ` Sasha Levin
2014-03-11 19:18                       ` Sasha Levin
2014-03-11 19:28                       ` Andrew Morton
2014-03-11 19:28                         ` Andrew Morton
2014-03-11 19:33                         ` Sasha Levin
2014-03-11 19:33                           ` Sasha Levin
2014-03-12 10:36                         ` [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes Mel Gorman
2014-03-12 10:36                           ` Mel Gorman
2014-03-12 12:16                           ` Rik van Riel
2014-03-12 12:16                             ` Rik van Riel
2014-03-15  3:15                           ` Sasha Levin
2014-03-15  3:15                             ` Sasha Levin
2014-03-19 14:38                             ` Mel Gorman
2014-03-19 14:38                               ` Mel Gorman
2014-03-21 22:06                               ` Andrew Morton
2014-03-21 22:06                                 ` Andrew Morton
2014-03-21 22:36                                 ` Sasha Levin
2014-03-21 22:36                                   ` Sasha Levin
2014-03-21 22:24                               ` Rik van Riel
2014-03-21 22:24                                 ` Rik van Riel
2014-03-11 19:31                     ` [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Davidlohr Bueso
2014-03-11 19:31                       ` Davidlohr Bueso
2014-03-07  1:04   ` Sasha Levin
2014-03-07  1:04     ` Sasha Levin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.