linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* kernel BUG at mm/huge_memory.c:212!
@ 2012-11-27 21:18 Jiri Slaby
  2012-11-27 23:47 ` David Rientjes
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Jiri Slaby @ 2012-11-27 21:18 UTC (permalink / raw)
  To: linux-mm, LKML

Hi,

I've hit BUG_ON(atomic_dec_and_test(&huge_zero_refcount)) in
put_huge_zero_page right now. There are some "Bad rss-counter state"
before that, but those are perhaps unrelated as I saw many of them in
the previous -next. But even with yesterday's next I got the BUG.

[ 7395.654928] BUG: Bad rss-counter state mm:ffff8800088289c0 idx:1 val:-1
[ 7417.652911] BUG: Bad rss-counter state mm:ffff880008829a00 idx:1 val:-1
[ 7423.317027] BUG: Bad rss-counter state mm:ffff8800088296c0 idx:1 val:-1
[ 7463.737596] BUG: Bad rss-counter state mm:ffff88000882ad80 idx:1 val:-2
[ 7486.462237] BUG: Bad rss-counter state mm:ffff880008829040 idx:1 val:-2
[ 7499.118560] BUG: Bad rss-counter state mm:ffff880008829040 idx:1 val:-2
[ 7507.000464] BUG: Bad rss-counter state mm:ffff880008828000 idx:1 val:-2
[ 7512.898902] BUG: Bad rss-counter state mm:ffff880008829380 idx:1 val:-2
[ 7522.299066] BUG: Bad rss-counter state mm:ffff8800088296c0 idx:1 val:-2
[ 7530.471048] BUG: Bad rss-counter state mm:ffff8800088296c0 idx:1 val:-2
[ 7597.602661] BUG: 'atomic_dec_and_test(&huge_zero_refcount)' is true!
[ 7597.602683] ------------[ cut here ]------------
[ 7597.602711] kernel BUG at /l/latest/linux/mm/huge_memory.c:212!
[ 7597.602732] invalid opcode: 0000 [#1] SMP
[ 7597.602751] Modules linked in: vfat fat dvb_usb_dib0700 dib0090
dib7000p dib7000m dib0070 dib8000 dib3000mc dibx000_common microcode
[ 7597.602811] CPU 1
[ 7597.602823] Pid: 1221, comm: java Not tainted
3.7.0-rc6-next-20121126_64+ #1698 To Be Filled By O.E.M. To Be Filled By
O.E.M./To be filled by O.E.M.
[ 7597.602867] RIP: 0010:[<ffffffff8116839e>]  [<ffffffff8116839e>]
put_huge_zero_page+0x2e/0x30
[ 7597.602902] RSP: 0000:ffff8801a58cdd48  EFLAGS: 00010292
[ 7597.602921] RAX: 0000000000000038 RBX: ffff880183cc0d00 RCX:
0000000000000007
[ 7597.602944] RDX: 00000000000000b5 RSI: 0000000000000046 RDI:
ffffffff81dc605c
[ 7597.602967] RBP: ffff8801a58cdd48 R08: 746127203a475542 R09:
000000000000047b
[ 7597.602990] R10: 6365645f63696d6f R11: 7365745f646e615f R12:
00007fd4b3e00000
[ 7597.603014] R13: 00007fd4b3dcc000 R14: ffff8801bdebab00 R15:
8000000001d94225
[ 7597.603037] FS:  00007fd4c7ebe700(0000) GS:ffff8801cbc80000(0000)
knlGS:0000000000000000
[ 7597.603064] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7597.603083] CR2: 00007fd4b3dcc498 CR3: 000000017d6bc000 CR4:
00000000000007e0
[ 7597.603106] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 7597.603129] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 7597.603152] Process java (pid: 1221, threadinfo ffff8801a58cc000,
task ffff8801a4655be0)
[ 7597.603178] Stack:
[ 7597.603187]  ffff8801a58cddc8 ffffffff8116b8d4 ffff8801a38cb000
ffff8801bdebab00
[ 7597.603219]  ffff880183cc0d00 00000001a38cb067 ffffea0006cccb40
ffff8801a3911cf0
[ 7597.603250]  00000001b332d000 00007fd4b3c00000 ffff880183cc0d00
00007fd4b3dcc498
[ 7597.603282] Call Trace:
[ 7597.603293]  [<ffffffff8116b8d4>] do_huge_pmd_wp_page+0x7e4/0x900
[ 7597.603316]  [<ffffffff81148755>] handle_mm_fault+0x145/0x330
[ 7597.603337]  [<ffffffff81071e45>] __do_page_fault+0x145/0x480
[ 7597.603358]  [<ffffffff810b42c5>] ? sched_clock_local+0x25/0xa0
[ 7597.603378]  [<ffffffff810b4ec8>] ? __enqueue_entity+0x78/0x80
[ 7597.603400]  [<ffffffff810d0efd>] ? sys_futex+0x8d/0x190
[ 7597.603420]  [<ffffffff810721be>] do_page_fault+0xe/0x10
[ 7597.603440]  [<ffffffff816b7c72>] page_fault+0x22/0x30
[ 7597.603458] Code: 66 90 f0 ff 0d c0 05 cf 00 0f 94 c0 84 c0 75 02 f3
c3 55 48 c7 c6 60 51 97 81 48 c7 c7 1a 82 94 81 48 89 e5 31 c0 e8 25 60
54 00 <0f> 0b 66 66 66 66 90 55 48 89 e5 53 48 83 ec 08 48 83 7e 08 00
[ 7597.603640] RIP  [<ffffffff8116839e>] put_huge_zero_page+0x2e/0x30
[ 7597.603664]  RSP <ffff8801a58cdd48>
[ 7597.636299] ---[ end trace 241e96a56fc0cf87 ]---
[ 7612.907136] SysRq : Keyboard mode set to system default

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: kernel BUG at mm/huge_memory.c:212!
  2012-11-27 21:18 kernel BUG at mm/huge_memory.c:212! Jiri Slaby
@ 2012-11-27 23:47 ` David Rientjes
  2012-11-29  7:38 ` Bob Liu
  2012-11-30 15:03 ` [PATCH 0/2] " Kirill A. Shutemov
  2 siblings, 0 replies; 12+ messages in thread
From: David Rientjes @ 2012-11-27 23:47 UTC (permalink / raw)
  To: Jiri Slaby, Kirill A. Shutemov; +Cc: linux-mm, LKML

On Tue, 27 Nov 2012, Jiri Slaby wrote:

> Hi,
> 
> I've hit BUG_ON(atomic_dec_and_test(&huge_zero_refcount)) in
> put_huge_zero_page right now. There are some "Bad rss-counter state"
> before that, but those are perhaps unrelated as I saw many of them in
> the previous -next. But even with yesterday's next I got the BUG.
> 
> [ 7395.654928] BUG: Bad rss-counter state mm:ffff8800088289c0 idx:1 val:-1
> [ 7417.652911] BUG: Bad rss-counter state mm:ffff880008829a00 idx:1 val:-1
> [ 7423.317027] BUG: Bad rss-counter state mm:ffff8800088296c0 idx:1 val:-1
> [ 7463.737596] BUG: Bad rss-counter state mm:ffff88000882ad80 idx:1 val:-2
> [ 7486.462237] BUG: Bad rss-counter state mm:ffff880008829040 idx:1 val:-2
> [ 7499.118560] BUG: Bad rss-counter state mm:ffff880008829040 idx:1 val:-2
> [ 7507.000464] BUG: Bad rss-counter state mm:ffff880008828000 idx:1 val:-2
> [ 7512.898902] BUG: Bad rss-counter state mm:ffff880008829380 idx:1 val:-2
> [ 7522.299066] BUG: Bad rss-counter state mm:ffff8800088296c0 idx:1 val:-2
> [ 7530.471048] BUG: Bad rss-counter state mm:ffff8800088296c0 idx:1 val:-2
> [ 7597.602661] BUG: 'atomic_dec_and_test(&huge_zero_refcount)' is true!
> [ 7597.602683] ------------[ cut here ]------------
> [ 7597.602711] kernel BUG at /l/latest/linux/mm/huge_memory.c:212!
> [ 7597.602732] invalid opcode: 0000 [#1] SMP
> [ 7597.602751] Modules linked in: vfat fat dvb_usb_dib0700 dib0090
> dib7000p dib7000m dib0070 dib8000 dib3000mc dibx000_common microcode
> [ 7597.602811] CPU 1
> [ 7597.602823] Pid: 1221, comm: java Not tainted
> 3.7.0-rc6-next-20121126_64+ #1698 To Be Filled By O.E.M. To Be Filled By
> O.E.M./To be filled by O.E.M.
> [ 7597.602867] RIP: 0010:[<ffffffff8116839e>]  [<ffffffff8116839e>]
> put_huge_zero_page+0x2e/0x30
> [ 7597.602902] RSP: 0000:ffff8801a58cdd48  EFLAGS: 00010292
> [ 7597.602921] RAX: 0000000000000038 RBX: ffff880183cc0d00 RCX:
> 0000000000000007
> [ 7597.602944] RDX: 00000000000000b5 RSI: 0000000000000046 RDI:
> ffffffff81dc605c
> [ 7597.602967] RBP: ffff8801a58cdd48 R08: 746127203a475542 R09:
> 000000000000047b
> [ 7597.602990] R10: 6365645f63696d6f R11: 7365745f646e615f R12:
> 00007fd4b3e00000
> [ 7597.603014] R13: 00007fd4b3dcc000 R14: ffff8801bdebab00 R15:
> 8000000001d94225
> [ 7597.603037] FS:  00007fd4c7ebe700(0000) GS:ffff8801cbc80000(0000)
> knlGS:0000000000000000
> [ 7597.603064] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 7597.603083] CR2: 00007fd4b3dcc498 CR3: 000000017d6bc000 CR4:
> 00000000000007e0
> [ 7597.603106] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 7597.603129] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [ 7597.603152] Process java (pid: 1221, threadinfo ffff8801a58cc000,
> task ffff8801a4655be0)
> [ 7597.603178] Stack:
> [ 7597.603187]  ffff8801a58cddc8 ffffffff8116b8d4 ffff8801a38cb000
> ffff8801bdebab00
> [ 7597.603219]  ffff880183cc0d00 00000001a38cb067 ffffea0006cccb40
> ffff8801a3911cf0
> [ 7597.603250]  00000001b332d000 00007fd4b3c00000 ffff880183cc0d00
> 00007fd4b3dcc498
> [ 7597.603282] Call Trace:
> [ 7597.603293]  [<ffffffff8116b8d4>] do_huge_pmd_wp_page+0x7e4/0x900
> [ 7597.603316]  [<ffffffff81148755>] handle_mm_fault+0x145/0x330
> [ 7597.603337]  [<ffffffff81071e45>] __do_page_fault+0x145/0x480
> [ 7597.603358]  [<ffffffff810b42c5>] ? sched_clock_local+0x25/0xa0
> [ 7597.603378]  [<ffffffff810b4ec8>] ? __enqueue_entity+0x78/0x80
> [ 7597.603400]  [<ffffffff810d0efd>] ? sys_futex+0x8d/0x190
> [ 7597.603420]  [<ffffffff810721be>] do_page_fault+0xe/0x10
> [ 7597.603440]  [<ffffffff816b7c72>] page_fault+0x22/0x30
> [ 7597.603458] Code: 66 90 f0 ff 0d c0 05 cf 00 0f 94 c0 84 c0 75 02 f3
> c3 55 48 c7 c6 60 51 97 81 48 c7 c7 1a 82 94 81 48 89 e5 31 c0 e8 25 60
> 54 00 <0f> 0b 66 66 66 66 90 55 48 89 e5 53 48 83 ec 08 48 83 7e 08 00
> [ 7597.603640] RIP  [<ffffffff8116839e>] put_huge_zero_page+0x2e/0x30
> [ 7597.603664]  RSP <ffff8801a58cdd48>
> [ 7597.636299] ---[ end trace 241e96a56fc0cf87 ]---
> [ 7612.907136] SysRq : Keyboard mode set to system default
> 

Thanks for the report.  Adding Kirill to the cc since this is from the 
huge zero page patchset sitting in next and is due to the refcounting on 
lazy allocation.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: kernel BUG at mm/huge_memory.c:212!
  2012-11-27 21:18 kernel BUG at mm/huge_memory.c:212! Jiri Slaby
  2012-11-27 23:47 ` David Rientjes
@ 2012-11-29  7:38 ` Bob Liu
  2012-11-30 15:03 ` [PATCH 0/2] " Kirill A. Shutemov
  2 siblings, 0 replies; 12+ messages in thread
From: Bob Liu @ 2012-11-29  7:38 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: linux-mm, LKML, kirill.shutemov

Hi Jiri,

On Wed, Nov 28, 2012 at 5:18 AM, Jiri Slaby <jslaby@suse.cz> wrote:
> Hi,
>
> I've hit BUG_ON(atomic_dec_and_test(&huge_zero_refcount)) in
> put_huge_zero_page right now. There are some "Bad rss-counter state"
> before that, but those are perhaps unrelated as I saw many of them in
> the previous -next. But even with yesterday's next I got the BUG.
>

Could you please give more details about your test or how to trigger this BUG?
I'm using kernel with huge zero page feature but haven't seen it yet.

> [ 7395.654928] BUG: Bad rss-counter state mm:ffff8800088289c0 idx:1 val:-1
> [ 7417.652911] BUG: Bad rss-counter state mm:ffff880008829a00 idx:1 val:-1
> [ 7423.317027] BUG: Bad rss-counter state mm:ffff8800088296c0 idx:1 val:-1
> [ 7463.737596] BUG: Bad rss-counter state mm:ffff88000882ad80 idx:1 val:-2
> [ 7486.462237] BUG: Bad rss-counter state mm:ffff880008829040 idx:1 val:-2
> [ 7499.118560] BUG: Bad rss-counter state mm:ffff880008829040 idx:1 val:-2
> [ 7507.000464] BUG: Bad rss-counter state mm:ffff880008828000 idx:1 val:-2
> [ 7512.898902] BUG: Bad rss-counter state mm:ffff880008829380 idx:1 val:-2
> [ 7522.299066] BUG: Bad rss-counter state mm:ffff8800088296c0 idx:1 val:-2
> [ 7530.471048] BUG: Bad rss-counter state mm:ffff8800088296c0 idx:1 val:-2
> [ 7597.602661] BUG: 'atomic_dec_and_test(&huge_zero_refcount)' is true!
> [ 7597.602683] ------------[ cut here ]------------
> [ 7597.602711] kernel BUG at /l/latest/linux/mm/huge_memory.c:212!
> [ 7597.602732] invalid opcode: 0000 [#1] SMP
> [ 7597.602751] Modules linked in: vfat fat dvb_usb_dib0700 dib0090
> dib7000p dib7000m dib0070 dib8000 dib3000mc dibx000_common microcode
> [ 7597.602811] CPU 1
> [ 7597.602823] Pid: 1221, comm: java Not tainted
> 3.7.0-rc6-next-20121126_64+ #1698 To Be Filled By O.E.M. To Be Filled By
> O.E.M./To be filled by O.E.M.
> [ 7597.602867] RIP: 0010:[<ffffffff8116839e>]  [<ffffffff8116839e>]
> put_huge_zero_page+0x2e/0x30
> [ 7597.602902] RSP: 0000:ffff8801a58cdd48  EFLAGS: 00010292
> [ 7597.602921] RAX: 0000000000000038 RBX: ffff880183cc0d00 RCX:
> 0000000000000007
> [ 7597.602944] RDX: 00000000000000b5 RSI: 0000000000000046 RDI:
> ffffffff81dc605c
> [ 7597.602967] RBP: ffff8801a58cdd48 R08: 746127203a475542 R09:
> 000000000000047b
> [ 7597.602990] R10: 6365645f63696d6f R11: 7365745f646e615f R12:
> 00007fd4b3e00000
> [ 7597.603014] R13: 00007fd4b3dcc000 R14: ffff8801bdebab00 R15:
> 8000000001d94225
> [ 7597.603037] FS:  00007fd4c7ebe700(0000) GS:ffff8801cbc80000(0000)
> knlGS:0000000000000000
> [ 7597.603064] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 7597.603083] CR2: 00007fd4b3dcc498 CR3: 000000017d6bc000 CR4:
> 00000000000007e0
> [ 7597.603106] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 7597.603129] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [ 7597.603152] Process java (pid: 1221, threadinfo ffff8801a58cc000,
> task ffff8801a4655be0)
> [ 7597.603178] Stack:
> [ 7597.603187]  ffff8801a58cddc8 ffffffff8116b8d4 ffff8801a38cb000
> ffff8801bdebab00
> [ 7597.603219]  ffff880183cc0d00 00000001a38cb067 ffffea0006cccb40
> ffff8801a3911cf0
> [ 7597.603250]  00000001b332d000 00007fd4b3c00000 ffff880183cc0d00
> 00007fd4b3dcc498
> [ 7597.603282] Call Trace:
> [ 7597.603293]  [<ffffffff8116b8d4>] do_huge_pmd_wp_page+0x7e4/0x900
> [ 7597.603316]  [<ffffffff81148755>] handle_mm_fault+0x145/0x330
> [ 7597.603337]  [<ffffffff81071e45>] __do_page_fault+0x145/0x480
> [ 7597.603358]  [<ffffffff810b42c5>] ? sched_clock_local+0x25/0xa0
> [ 7597.603378]  [<ffffffff810b4ec8>] ? __enqueue_entity+0x78/0x80
> [ 7597.603400]  [<ffffffff810d0efd>] ? sys_futex+0x8d/0x190
> [ 7597.603420]  [<ffffffff810721be>] do_page_fault+0xe/0x10
> [ 7597.603440]  [<ffffffff816b7c72>] page_fault+0x22/0x30
> [ 7597.603458] Code: 66 90 f0 ff 0d c0 05 cf 00 0f 94 c0 84 c0 75 02 f3
> c3 55 48 c7 c6 60 51 97 81 48 c7 c7 1a 82 94 81 48 89 e5 31 c0 e8 25 60
> 54 00 <0f> 0b 66 66 66 66 90 55 48 89 e5 53 48 83 ec 08 48 83 7e 08 00
> [ 7597.603640] RIP  [<ffffffff8116839e>] put_huge_zero_page+0x2e/0x30
> [ 7597.603664]  RSP <ffff8801a58cdd48>
> [ 7597.636299] ---[ end trace 241e96a56fc0cf87 ]---
> [ 7612.907136] SysRq : Keyboard mode set to system default
>

Btw: Could you have a try with below patch? I think it might be
related but not sure.
Thank you!

(Sorry i can only use web email currently so the patch format may be incorrect)
------------
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4489e16..d282d80 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1096,7 +1096,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm)

 static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd, unsigned long haddr)
+		pmd_t *pmd, pmd_t orig_pmd, unsigned long haddr)
 {
 	pgtable_t pgtable;
 	pmd_t _pmd;
@@ -1125,6 +1125,10 @@ static int
do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

 	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
+		WARN_ON(1);
+		goto out_free_page;
+	}
 	pmdp_clear_flush(vma, haddr, pmd);
 	/* leave pmd empty until pte is filled */

@@ -1156,6 +1160,14 @@ static int
do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
 	ret |= VM_FAULT_WRITE;
 out:
 	return ret;
+out_free_page:
+	spin_unlock(&mm->page_table_lock);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mem_cgroup_uncharge_start();
+	mem_cgroup_uncharge_page(page);
+	put_page(page);
+	mem_cgroup_uncharge_end();
+	goto out;
 }

 static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
@@ -1302,7 +1314,7 @@ alloc:
 		count_vm_event(THP_FAULT_FALLBACK);
 		if (is_huge_zero_pmd(orig_pmd)) {
 			ret = do_huge_pmd_wp_zero_page_fallback(mm, vma,
-					address, pmd, haddr);
+					address, pmd, orig_pmd, haddr);
 		} else {
 			ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
 					pmd, orig_pmd, page, haddr);

-- 
Regards,
--Bob

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 0/2] kernel BUG at mm/huge_memory.c:212!
  2012-11-27 21:18 kernel BUG at mm/huge_memory.c:212! Jiri Slaby
  2012-11-27 23:47 ` David Rientjes
  2012-11-29  7:38 ` Bob Liu
@ 2012-11-30 15:03 ` Kirill A. Shutemov
  2012-11-30 15:03   ` [PATCH 1/2] thp: fix anononymous page accounting in fallback path for COW of HZP Kirill A. Shutemov
                     ` (2 more replies)
  2 siblings, 3 replies; 12+ messages in thread
From: Kirill A. Shutemov @ 2012-11-30 15:03 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: linux-mm, LKML, David Rientjes, Bob Liu, Andrew Morton,
	Andrea Arcangeli, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Hi Jiri,

Sorry for late answer. It took time to reproduce and debug the issue.

Could you test two patches below by thread. I expect it to fix both
issues: put_huge_zero_page() and Bad rss-counter state.

Kirill A. Shutemov (2):
  thp: fix anononymous page accounting in fallback path for COW of HZP
  thp: avoid race on multiple parallel page faults to the same page

 mm/huge_memory.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] thp: fix anononymous page accounting in fallback path for COW of HZP
  2012-11-30 15:03 ` [PATCH 0/2] " Kirill A. Shutemov
@ 2012-11-30 15:03   ` Kirill A. Shutemov
  2012-12-03  3:14     ` Bob Liu
  2012-11-30 15:03   ` [PATCH 2/2] thp: avoid race on multiple parallel page faults to the same page Kirill A. Shutemov
  2012-12-03 13:02   ` [PATCH 0/2] kernel BUG at mm/huge_memory.c:212! Jiri Slaby
  2 siblings, 1 reply; 12+ messages in thread
From: Kirill A. Shutemov @ 2012-11-30 15:03 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: linux-mm, LKML, David Rientjes, Bob Liu, Andrew Morton,
	Andrea Arcangeli, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Don't forget to account newly allocated page in fallback path for
copy-on-write of huge zero page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 57f0024..9d6f521 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1164,6 +1164,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
 	pmd_populate(mm, pmd, pgtable);
 	spin_unlock(&mm->page_table_lock);
 	put_huge_zero_page();
+	inc_mm_counter(mm, MM_ANONPAGES);
 
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] thp: avoid race on multiple parallel page faults to the same page
  2012-11-30 15:03 ` [PATCH 0/2] " Kirill A. Shutemov
  2012-11-30 15:03   ` [PATCH 1/2] thp: fix anononymous page accounting in fallback path for COW of HZP Kirill A. Shutemov
@ 2012-11-30 15:03   ` Kirill A. Shutemov
  2012-12-03  2:29     ` Bob Liu
  2012-12-03 13:02   ` [PATCH 0/2] kernel BUG at mm/huge_memory.c:212! Jiri Slaby
  2 siblings, 1 reply; 12+ messages in thread
From: Kirill A. Shutemov @ 2012-11-30 15:03 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: linux-mm, LKML, David Rientjes, Bob Liu, Andrew Morton,
	Andrea Arcangeli, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

pmd value is stable only with mm->page_table_lock taken. After taking
the lock we need to check that nobody modified the pmd before change it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9d6f521..51cb8fe 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -770,17 +770,20 @@ static inline struct page *alloc_hugepage(int defrag)
 }
 #endif
 
-static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
 		unsigned long zero_pfn)
 {
 	pmd_t entry;
+	if (!pmd_none(*pmd))
+		return false;
 	entry = pfn_pmd(zero_pfn, vma->vm_page_prot);
 	entry = pmd_wrprotect(entry);
 	entry = pmd_mkhuge(entry);
 	set_pmd_at(mm, haddr, pmd, entry);
 	pgtable_trans_huge_deposit(mm, pgtable);
 	mm->nr_ptes++;
+	return true;
 }
 
 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -800,6 +803,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				transparent_hugepage_use_zero_page()) {
 			pgtable_t pgtable;
 			unsigned long zero_pfn;
+			bool set;
 			pgtable = pte_alloc_one(mm, haddr);
 			if (unlikely(!pgtable))
 				return VM_FAULT_OOM;
@@ -810,9 +814,13 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				goto out;
 			}
 			spin_lock(&mm->page_table_lock);
-			set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+			set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
 					zero_pfn);
 			spin_unlock(&mm->page_table_lock);
+			if (!set) {
+				pte_free(mm, pgtable);
+				put_huge_zero_page();
+			}
 			return 0;
 		}
 		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -1046,14 +1054,16 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 */
 	if (is_huge_zero_pmd(pmd)) {
 		unsigned long zero_pfn;
+		bool set;
 		/*
 		 * get_huge_zero_page() will never allocate a new page here,
 		 * since we already have a zero page to copy. It just takes a
 		 * reference.
 		 */
 		zero_pfn = get_huge_zero_page();
-		set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+		set = set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
 				zero_pfn);
+		BUG_ON(!set); /* unexpected !pmd_none(dst_pmd) */
 		ret = 0;
 		goto out_unlock;
 	}
@@ -1110,7 +1120,7 @@ unlock:
 
 static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd, unsigned long haddr)
+		pmd_t *pmd, pmd_t orig_pmd, unsigned long haddr)
 {
 	pgtable_t pgtable;
 	pmd_t _pmd;
@@ -1139,6 +1149,9 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 
 	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_free_page;
+
 	pmdp_clear_flush(vma, haddr, pmd);
 	/* leave pmd empty until pte is filled */
 
@@ -1171,6 +1184,12 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
 	ret |= VM_FAULT_WRITE;
 out:
 	return ret;
+out_free_page:
+	spin_unlock(&mm->page_table_lock);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mem_cgroup_uncharge_page(page);
+	put_page(page);
+	goto out;
 }
 
 static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
@@ -1317,7 +1336,7 @@ alloc:
 		count_vm_event(THP_FAULT_FALLBACK);
 		if (is_huge_zero_pmd(orig_pmd)) {
 			ret = do_huge_pmd_wp_zero_page_fallback(mm, vma,
-					address, pmd, haddr);
+					address, pmd, orig_pmd, haddr);
 		} else {
 			ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
 					pmd, orig_pmd, page, haddr);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] thp: avoid race on multiple parallel page faults to the same page
  2012-11-30 15:03   ` [PATCH 2/2] thp: avoid race on multiple parallel page faults to the same page Kirill A. Shutemov
@ 2012-12-03  2:29     ` Bob Liu
  0 siblings, 0 replies; 12+ messages in thread
From: Bob Liu @ 2012-12-03  2:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jiri Slaby, linux-mm, LKML, David Rientjes, Andrew Morton,
	Andrea Arcangeli

On Fri, Nov 30, 2012 at 11:03 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> pmd value is stable only with mm->page_table_lock taken. After taking
> the lock we need to check that nobody modified the pmd before change it.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Reviewed-by: Bob Liu <lliubbo@gmail.com>

> ---
>  mm/huge_memory.c | 29 ++++++++++++++++++++++++-----
>  1 file changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9d6f521..51cb8fe 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -770,17 +770,20 @@ static inline struct page *alloc_hugepage(int defrag)
>  }
>  #endif
>
> -static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
> +static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
>                 struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
>                 unsigned long zero_pfn)
>  {
>         pmd_t entry;
> +       if (!pmd_none(*pmd))
> +               return false;
>         entry = pfn_pmd(zero_pfn, vma->vm_page_prot);
>         entry = pmd_wrprotect(entry);
>         entry = pmd_mkhuge(entry);
>         set_pmd_at(mm, haddr, pmd, entry);
>         pgtable_trans_huge_deposit(mm, pgtable);
>         mm->nr_ptes++;
> +       return true;
>  }
>
>  int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> @@ -800,6 +803,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>                                 transparent_hugepage_use_zero_page()) {
>                         pgtable_t pgtable;
>                         unsigned long zero_pfn;
> +                       bool set;
>                         pgtable = pte_alloc_one(mm, haddr);
>                         if (unlikely(!pgtable))
>                                 return VM_FAULT_OOM;
> @@ -810,9 +814,13 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>                                 goto out;
>                         }
>                         spin_lock(&mm->page_table_lock);
> -                       set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
> +                       set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
>                                         zero_pfn);
>                         spin_unlock(&mm->page_table_lock);
> +                       if (!set) {
> +                               pte_free(mm, pgtable);
> +                               put_huge_zero_page();
> +                       }
>                         return 0;
>                 }
>                 page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
> @@ -1046,14 +1054,16 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>          */
>         if (is_huge_zero_pmd(pmd)) {
>                 unsigned long zero_pfn;
> +               bool set;
>                 /*
>                  * get_huge_zero_page() will never allocate a new page here,
>                  * since we already have a zero page to copy. It just takes a
>                  * reference.
>                  */
>                 zero_pfn = get_huge_zero_page();
> -               set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
> +               set = set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
>                                 zero_pfn);
> +               BUG_ON(!set); /* unexpected !pmd_none(dst_pmd) */
>                 ret = 0;
>                 goto out_unlock;
>         }
> @@ -1110,7 +1120,7 @@ unlock:
>
>  static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
>                 struct vm_area_struct *vma, unsigned long address,
> -               pmd_t *pmd, unsigned long haddr)
> +               pmd_t *pmd, pmd_t orig_pmd, unsigned long haddr)
>  {
>         pgtable_t pgtable;
>         pmd_t _pmd;
> @@ -1139,6 +1149,9 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
>         mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
>
>         spin_lock(&mm->page_table_lock);
> +       if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +               goto out_free_page;
> +
>         pmdp_clear_flush(vma, haddr, pmd);
>         /* leave pmd empty until pte is filled */
>
> @@ -1171,6 +1184,12 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
>         ret |= VM_FAULT_WRITE;
>  out:
>         return ret;
> +out_free_page:
> +       spin_unlock(&mm->page_table_lock);
> +       mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +       mem_cgroup_uncharge_page(page);
> +       put_page(page);
> +       goto out;
>  }
>
>  static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> @@ -1317,7 +1336,7 @@ alloc:
>                 count_vm_event(THP_FAULT_FALLBACK);
>                 if (is_huge_zero_pmd(orig_pmd)) {
>                         ret = do_huge_pmd_wp_zero_page_fallback(mm, vma,
> -                                       address, pmd, haddr);
> +                                       address, pmd, orig_pmd, haddr);
>                 } else {
>                         ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
>                                         pmd, orig_pmd, page, haddr);
> --
> 1.7.11.7
>



-- 
Regards,
--Bob

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] thp: fix anononymous page accounting in fallback path for COW of HZP
  2012-11-30 15:03   ` [PATCH 1/2] thp: fix anononymous page accounting in fallback path for COW of HZP Kirill A. Shutemov
@ 2012-12-03  3:14     ` Bob Liu
  2012-12-03  8:15       ` Kirill A. Shutemov
  0 siblings, 1 reply; 12+ messages in thread
From: Bob Liu @ 2012-12-03  3:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jiri Slaby, linux-mm, LKML, David Rientjes, Andrew Morton,
	Andrea Arcangeli

On Fri, Nov 30, 2012 at 11:03 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Don't forget to account newly allocated page in fallback path for
> copy-on-write of huge zero page.
>

What about fallback path in do_huge_pmd_wp_page_fallback()?
I think we should also account newly allocated page in it.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/huge_memory.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 57f0024..9d6f521 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1164,6 +1164,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
>         pmd_populate(mm, pmd, pgtable);
>         spin_unlock(&mm->page_table_lock);
>         put_huge_zero_page();
> +       inc_mm_counter(mm, MM_ANONPAGES);
>
>         mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>
> --
> 1.7.11.7
>

-- 
Regards,
--Bob

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] thp: fix anononymous page accounting in fallback path for COW of HZP
  2012-12-03  3:14     ` Bob Liu
@ 2012-12-03  8:15       ` Kirill A. Shutemov
  0 siblings, 0 replies; 12+ messages in thread
From: Kirill A. Shutemov @ 2012-12-03  8:15 UTC (permalink / raw)
  To: Bob Liu
  Cc: Jiri Slaby, linux-mm, LKML, David Rientjes, Andrew Morton,
	Andrea Arcangeli

[-- Attachment #1: Type: text/plain, Size: 711 bytes --]

On Mon, Dec 03, 2012 at 11:14:38AM +0800, Bob Liu wrote:
> On Fri, Nov 30, 2012 at 11:03 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > Don't forget to account newly allocated page in fallback path for
> > copy-on-write of huge zero page.
> >
> 
> What about fallback path in do_huge_pmd_wp_page_fallback()?
> I think we should also account newly allocated page in it.

No. Normal huge pages has already accounted on fork(). See
copy_huge_pmd().

Huge zero page (as 4k zero page) doesn't contribute to RSS, so we need to
account the page which replaces huge zero page on COW.
-- 
 Kirill A. Shutemov

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] kernel BUG at mm/huge_memory.c:212!
  2012-11-30 15:03 ` [PATCH 0/2] " Kirill A. Shutemov
  2012-11-30 15:03   ` [PATCH 1/2] thp: fix anononymous page accounting in fallback path for COW of HZP Kirill A. Shutemov
  2012-11-30 15:03   ` [PATCH 2/2] thp: avoid race on multiple parallel page faults to the same page Kirill A. Shutemov
@ 2012-12-03 13:02   ` Jiri Slaby
  2012-12-12  5:36     ` Bob Liu
  2 siblings, 1 reply; 12+ messages in thread
From: Jiri Slaby @ 2012-12-03 13:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, LKML, David Rientjes, Bob Liu, Andrew Morton, Andrea Arcangeli

On 11/30/2012 04:03 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Hi Jiri,
> 
> Sorry for late answer. It took time to reproduce and debug the issue.
> 
> Could you test two patches below by thread. I expect it to fix both
> issues: put_huge_zero_page() and Bad rss-counter state.

Hi, yes, since applying the patches on the last Thu, it didn't recur.

> Kirill A. Shutemov (2):
>   thp: fix anononymous page accounting in fallback path for COW of HZP
>   thp: avoid race on multiple parallel page faults to the same page
> 
>  mm/huge_memory.c | 30 +++++++++++++++++++++++++-----
>  1 file changed, 25 insertions(+), 5 deletions(-)

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] kernel BUG at mm/huge_memory.c:212!
  2012-12-03 13:02   ` [PATCH 0/2] kernel BUG at mm/huge_memory.c:212! Jiri Slaby
@ 2012-12-12  5:36     ` Bob Liu
  2012-12-12 10:59       ` Kirill A. Shutemov
  0 siblings, 1 reply; 12+ messages in thread
From: Bob Liu @ 2012-12-12  5:36 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Kirill A. Shutemov, linux-mm, LKML, David Rientjes,
	Andrew Morton, Andrea Arcangeli

On Mon, Dec 3, 2012 at 9:02 PM, Jiri Slaby <jslaby@suse.cz> wrote:
> On 11/30/2012 04:03 PM, Kirill A. Shutemov wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> Hi Jiri,
>>
>> Sorry for late answer. It took time to reproduce and debug the issue.
>>
>> Could you test two patches below by thread. I expect it to fix both
>> issues: put_huge_zero_page() and Bad rss-counter state.
>
> Hi, yes, since applying the patches on the last Thu, it didn't recur.
>
>> Kirill A. Shutemov (2):
>>   thp: fix anononymous page accounting in fallback path for COW of HZP
>>   thp: avoid race on multiple parallel page faults to the same page
>>
>>  mm/huge_memory.c | 30 +++++++++++++++++++++++++-----
>>  1 file changed, 25 insertions(+), 5 deletions(-)
>

I still saw this bug on 3.7.0-rc8, but it's hard to reproduce it.
It appears only once.

-- 
Regards,
--Bob

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] kernel BUG at mm/huge_memory.c:212!
  2012-12-12  5:36     ` Bob Liu
@ 2012-12-12 10:59       ` Kirill A. Shutemov
  0 siblings, 0 replies; 12+ messages in thread
From: Kirill A. Shutemov @ 2012-12-12 10:59 UTC (permalink / raw)
  To: Bob Liu
  Cc: Jiri Slaby, linux-mm, LKML, David Rientjes, Andrew Morton,
	Andrea Arcangeli

[-- Attachment #1: Type: text/plain, Size: 1189 bytes --]

On Wed, Dec 12, 2012 at 01:36:36PM +0800, Bob Liu wrote:
> On Mon, Dec 3, 2012 at 9:02 PM, Jiri Slaby <jslaby@suse.cz> wrote:
> > On 11/30/2012 04:03 PM, Kirill A. Shutemov wrote:
> >> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >>
> >> Hi Jiri,
> >>
> >> Sorry for late answer. It took time to reproduce and debug the issue.
> >>
> >> Could you test two patches below by thread. I expect it to fix both
> >> issues: put_huge_zero_page() and Bad rss-counter state.
> >
> > Hi, yes, since applying the patches on the last Thu, it didn't recur.
> >
> >> Kirill A. Shutemov (2):
> >>   thp: fix anononymous page accounting in fallback path for COW of HZP
> >>   thp: avoid race on multiple parallel page faults to the same page
> >>
> >>  mm/huge_memory.c | 30 +++++++++++++++++++++++++-----
> >>  1 file changed, 25 insertions(+), 5 deletions(-)
> >
> 
> I still saw this bug on 3.7.0-rc8, but it's hard to reproduce it.
> It appears only once.

I guess the patch you've posted fixes the issue, right?

It's useful to enable debug_cow to test fallback path:

echo 1 > /sys/kernel/mm/transparent_hugepage/debug_cow

-- 
 Kirill A. Shutemov

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-12-12 10:57 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-27 21:18 kernel BUG at mm/huge_memory.c:212! Jiri Slaby
2012-11-27 23:47 ` David Rientjes
2012-11-29  7:38 ` Bob Liu
2012-11-30 15:03 ` [PATCH 0/2] " Kirill A. Shutemov
2012-11-30 15:03   ` [PATCH 1/2] thp: fix anononymous page accounting in fallback path for COW of HZP Kirill A. Shutemov
2012-12-03  3:14     ` Bob Liu
2012-12-03  8:15       ` Kirill A. Shutemov
2012-11-30 15:03   ` [PATCH 2/2] thp: avoid race on multiple parallel page faults to the same page Kirill A. Shutemov
2012-12-03  2:29     ` Bob Liu
2012-12-03 13:02   ` [PATCH 0/2] kernel BUG at mm/huge_memory.c:212! Jiri Slaby
2012-12-12  5:36     ` Bob Liu
2012-12-12 10:59       ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).