linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86: mm: Fix vmalloc_fault oops during lazy MMU updates.
@ 2013-02-17  2:35 Samu Kallio
  2013-02-21 12:33 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 4+ messages in thread
From: Samu Kallio @ 2013-02-17  2:35 UTC (permalink / raw)
  To: LKML; +Cc: Samu Kallio

In paravirtualized x86_64 kernels, vmalloc_fault may cause an oops
when lazy MMU updates are enabled, because set_pgd effects are being
deferred.

One instance of this problem is during process mm cleanup with memory
cgroups enabled. The chain of events is as follows:

- zap_pte_range enables lazy MMU updates
- zap_pte_range eventually calls mem_cgroup_charge_statistics,
  which accesses the vmalloc'd mem_cgroup per-cpu stat area
- vmalloc_fault is triggered which tries to sync the corresponding
  PGD entry with set_pgd, but the update is deferred
- vmalloc_fault oopses due to a mismatch in the PUD entries

Calling arch_flush_lazy_mmu_mode immediately after set_pgd makes the
changes visible to the consistency checks.

Signed-off-by: Samu Kallio <samu.kallio@aberdeencloud.com>
---
 arch/x86/mm/fault.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8e13ecb..0a45298 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -378,10 +378,12 @@ static noinline __kprobes int vmalloc_fault(unsigned long address)
 	if (pgd_none(*pgd_ref))
 		return -1;
 
-	if (pgd_none(*pgd))
+	if (pgd_none(*pgd)) {
 		set_pgd(pgd, *pgd_ref);
-	else
+		arch_flush_lazy_mmu_mode();
+	} else {
 		BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
+	}
 
 	/*
 	 * Below here mismatches are bugs because these lower tables
-- 
1.8.1.3


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: x86: mm: Fix vmalloc_fault oops during lazy MMU updates.
  2013-02-17  2:35 [PATCH] x86: mm: Fix vmalloc_fault oops during lazy MMU updates Samu Kallio
@ 2013-02-21 12:33 ` Konrad Rzeszutek Wilk
  2013-02-21 15:56   ` Samu Kallio
  0 siblings, 1 reply; 4+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-02-21 12:33 UTC (permalink / raw)
  To: Samu Kallio, mingo, Jeremy Fitzhardinge; +Cc: LKML

On Sun, Feb 17, 2013 at 02:35:52AM -0000, Samu Kallio wrote:
> In paravirtualized x86_64 kernels, vmalloc_fault may cause an oops
> when lazy MMU updates are enabled, because set_pgd effects are being
> deferred.
> 
> One instance of this problem is during process mm cleanup with memory
> cgroups enabled. The chain of events is as follows:
> 
> - zap_pte_range enables lazy MMU updates
> - zap_pte_range eventually calls mem_cgroup_charge_statistics,
>   which accesses the vmalloc'd mem_cgroup per-cpu stat area
> - vmalloc_fault is triggered which tries to sync the corresponding
>   PGD entry with set_pgd, but the update is deferred
> - vmalloc_fault oopses due to a mismatch in the PUD entries
> 
> Calling arch_flush_lazy_mmu_mode immediately after set_pgd makes the
> changes visible to the consistency checks.

How do you reproduce this? Is there a BUG() or WARN() trace that
is triggered when this happens?

Also pls next time also CC me.
> 
> Signed-off-by: Samu Kallio <samu.kallio@aberdeencloud.com>
> 
> ---
> arch/x86/mm/fault.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 8e13ecb..0a45298 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -378,10 +378,12 @@ static noinline __kprobes int vmalloc_fault(unsigned long address)
>  	if (pgd_none(*pgd_ref))
>  		return -1;
>  
> -	if (pgd_none(*pgd))
> +	if (pgd_none(*pgd)) {
>  		set_pgd(pgd, *pgd_ref);
> -	else
> +		arch_flush_lazy_mmu_mode();
> +	} else {
>  		BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
> +	}
>  
>  	/*
>  	 * Below here mismatches are bugs because these lower tables

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: x86: mm: Fix vmalloc_fault oops during lazy MMU updates.
  2013-02-21 12:33 ` Konrad Rzeszutek Wilk
@ 2013-02-21 15:56   ` Samu Kallio
  2013-02-23  1:06     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 4+ messages in thread
From: Samu Kallio @ 2013-02-21 15:56 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: mingo, Jeremy Fitzhardinge, LKML

On Thu, Feb 21, 2013 at 2:33 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Sun, Feb 17, 2013 at 02:35:52AM -0000, Samu Kallio wrote:
>> In paravirtualized x86_64 kernels, vmalloc_fault may cause an oops
>> when lazy MMU updates are enabled, because set_pgd effects are being
>> deferred.
>>
>> One instance of this problem is during process mm cleanup with memory
>> cgroups enabled. The chain of events is as follows:
>>
>> - zap_pte_range enables lazy MMU updates
>> - zap_pte_range eventually calls mem_cgroup_charge_statistics,
>>   which accesses the vmalloc'd mem_cgroup per-cpu stat area
>> - vmalloc_fault is triggered which tries to sync the corresponding
>>   PGD entry with set_pgd, but the update is deferred
>> - vmalloc_fault oopses due to a mismatch in the PUD entries
>>
>> Calling arch_flush_lazy_mmu_mode immediately after set_pgd makes the
>> changes visible to the consistency checks.
>
> How do you reproduce this? Is there a BUG() or WARN() trace that
> is triggered when this happens?

In my case I've seen this triggered on an Amazon EC2 (Xen PV) instance
under heavy load spawning many LXC containers. The best I can say at
this point is that the frequency of this bug seems to be linked to how
busy the machine is.

The earliest report of this problem was from 3.3:
    http://comments.gmane.org/gmane.linux.kernel.cgroups/5540
I can personally confirm the issue since 3.5.

Here's a sample bug report from a 3.7 kernel (vanilla with Xen XSAVE patch
for EC2 compatibility). The latest kernel version I have tested and seen this
problem occur is 3.7.9.

[11852214.733630] ------------[ cut here ]------------
[11852214.733642] kernel BUG at arch/x86/mm/fault.c:397!
[11852214.733648] invalid opcode: 0000 [#1] SMP
[11852214.733654] Modules linked in: veth xt_nat xt_comment fuse btrfs
libcrc32c zlib_deflate ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat
xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
bridge stp llc iptable_filter ip_tables x_tables ghash_clmulni_intel
aesni_intel aes_x86_64 ablk_helper cryptd xts lrw gf128mul microcode
ext4 crc16 jbd2 mbcache
[11852214.733695] CPU 1
[11852214.733700] Pid: 1617, comm: qmgr Not tainted 3.7.0-1-ec2 #1
[11852214.733705] RIP: e030:[<ffffffff8143018d>]  [<ffffffff8143018d>]
vmalloc_fault+0x14b/0x249
[11852214.733725] RSP: e02b:ffff88083e57d7f8  EFLAGS: 00010046
[11852214.733730] RAX: 0000000854046000 RBX: ffffe8ffffc80d70 RCX:
ffff880000000000
[11852214.733736] RDX: 00003ffffffff000 RSI: ffff880854046ff8 RDI:
0000000000000000
[11852214.733744] RBP: ffff88083e57d818 R08: 0000000000000000 R09:
ffff880000000ff8
[11852214.733750] R10: 0000000000007ff0 R11: 0000000000000001 R12:
ffff880854686e88
[11852214.733758] R13: ffffffff8180ce88 R14: ffff88083e57d948 R15:
0000000000000000
[11852214.733768] FS:  00007ff3bf0f8740(0000)
GS:ffff88088b480000(0000) knlGS:0000000000000000
[11852214.733777] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[11852214.733782] CR2: ffffe8ffffc80d70 CR3: 0000000854686000 CR4:
0000000000002660
[11852214.733790] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[11852214.733796] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[11852214.733803] Process qmgr (pid: 1617, threadinfo
ffff88083e57c000, task ffff88084474b3e0)
[11852214.733810] Stack:
[11852214.733814]  0000000000000029 0000000000000002 ffffe8ffffc80d70
ffff88083e57d948
[11852214.733828]  ffff88083e57d928 ffffffff8103e0c7 0000000000000000
ffff88083e57d8d0
[11852214.733840]  ffff88084474b3e0 0000000000000060 0000000000000000
0000000000006cf6
[11852214.733852] Call Trace:
[11852214.733861]  [<ffffffff8103e0c7>] __do_page_fault+0x2c7/0x4a0
[11852214.733871]  [<ffffffff81004ac2>] ? xen_mc_flush+0xb2/0x1b0
[11852214.733880]  [<ffffffff810032ce>] ? xen_end_context_switch+0x1e/0x30
[11852214.733888]  [<ffffffff810043cb>] ? xen_write_msr_safe+0x9b/0xc0
[11852214.733900]  [<ffffffff810125b3>] ? __switch_to+0x163/0x4a0
[11852214.733907]  [<ffffffff8103e2de>] do_page_fault+0xe/0x10
[11852214.733919]  [<ffffffff81437f98>] page_fault+0x28/0x30
[11852214.733930]  [<ffffffff8115e873>] ?
mem_cgroup_charge_statistics.isra.12+0x13/0x50
[11852214.733940]  [<ffffffff8116012e>] __mem_cgroup_uncharge_common+0xce/0x2d0
[11852214.733948]  [<ffffffff81007fee>] ? xen_pte_val+0xe/0x10
[11852214.733958]  [<ffffffff8116391a>] mem_cgroup_uncharge_page+0x2a/0x30
[11852214.733966]  [<ffffffff81139e78>] page_remove_rmap+0xf8/0x150
[11852214.733976]  [<ffffffff8112d78a>] ? vm_normal_page+0x1a/0x80
[11852214.733984]  [<ffffffff8112e5b3>] unmap_single_vma+0x573/0x860
[11852214.733994]  [<ffffffff81114520>] ? release_pages+0x1f0/0x230
[11852214.734004]  [<ffffffff810054aa>] ? __xen_pgd_walk+0x16a/0x260
[11852214.734018]  [<ffffffff8112f0b2>] unmap_vmas+0x52/0xa0
[11852214.734026]  [<ffffffff81136e08>] exit_mmap+0x98/0x170
[11852214.734034]  [<ffffffff8104b929>] mmput+0x59/0x110
[11852214.734043]  [<ffffffff81053d95>] exit_mm+0x105/0x130
[11852214.734051]  [<ffffffff814376e0>] ? _raw_spin_lock_irq+0x10/0x40
[11852214.734059]  [<ffffffff81053f27>] do_exit+0x167/0x900
[11852214.734070]  [<ffffffff8106093d>] ? __sigqueue_free+0x3d/0x50
[11852214.734079]  [<ffffffff81060b9e>] ? __dequeue_signal+0x10e/0x1f0
[11852214.734087]  [<ffffffff810549ff>] do_group_exit+0x3f/0xb0
[11852214.734097]  [<ffffffff81063431>] get_signal_to_deliver+0x1c1/0x5e0
[11852214.734107]  [<ffffffff8101334f>] do_signal+0x3f/0x960
[11852214.734114]  [<ffffffff811aae61>] ? ep_poll+0x2a1/0x360
[11852214.734122]  [<ffffffff81083420>] ? try_to_wake_up+0x2d0/0x2d0
[11852214.734129]  [<ffffffff81013cd8>] do_notify_resume+0x48/0x60
[11852214.734138]  [<ffffffff81438a5a>] int_signal+0x12/0x17
[11852214.734143] Code: ff ff 3f 00 00 48 21 d0 4c 8d 0c 30 ff 14 25
b8 f3 81 81 48 21 d0 48 01 c6 48 83 3e 00 0f 84 fa 00 00 00 49 8b 39
48 85 ff 75 02 <0f> 0b ff 14 25 e0 f3 81 81 49 89 c0 48 8b 3e ff 14 25
e0 f3 81
[11852214.734212] RIP  [<ffffffff8143018d>] vmalloc_fault+0x14b/0x249
[11852214.734222]  RSP <ffff88083e57d7f8>
[11852214.734231] ---[ end trace 81ac798210f95867 ]---
[11852214.734237] Fixing recursive fault but reboot is needed!

> Also pls next time also CC me.

Will do, I originally CC'd Jeremy since made some lazy MMU related
cleanups in arch/x86/mm/fault.c, and I thought he might have a comment
on this.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: x86: mm: Fix vmalloc_fault oops during lazy MMU updates.
  2013-02-21 15:56   ` Samu Kallio
@ 2013-02-23  1:06     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 4+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-02-23  1:06 UTC (permalink / raw)
  To: Samu Kallio, mingo; +Cc: Jeremy Fitzhardinge, LKML, xen-devel

On Thu, Feb 21, 2013 at 05:56:35PM +0200, Samu Kallio wrote:
> On Thu, Feb 21, 2013 at 2:33 PM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> > On Sun, Feb 17, 2013 at 02:35:52AM -0000, Samu Kallio wrote:
> >> In paravirtualized x86_64 kernels, vmalloc_fault may cause an oops
> >> when lazy MMU updates are enabled, because set_pgd effects are being
> >> deferred.
> >>
> >> One instance of this problem is during process mm cleanup with memory
> >> cgroups enabled. The chain of events is as follows:
> >>
> >> - zap_pte_range enables lazy MMU updates
> >> - zap_pte_range eventually calls mem_cgroup_charge_statistics,
> >>   which accesses the vmalloc'd mem_cgroup per-cpu stat area
> >> - vmalloc_fault is triggered which tries to sync the corresponding
> >>   PGD entry with set_pgd, but the update is deferred
> >> - vmalloc_fault oopses due to a mismatch in the PUD entries
> >>
> >> Calling arch_flush_lazy_mmu_mode immediately after set_pgd makes the
> >> changes visible to the consistency checks.
> >
> > How do you reproduce this? Is there a BUG() or WARN() trace that
> > is triggered when this happens?
> 
> In my case I've seen this triggered on an Amazon EC2 (Xen PV) instance
> under heavy load spawning many LXC containers. The best I can say at
> this point is that the frequency of this bug seems to be linked to how
> busy the machine is.
> 
> The earliest report of this problem was from 3.3:
>     http://comments.gmane.org/gmane.linux.kernel.cgroups/5540
> I can personally confirm the issue since 3.5.
> 
> Here's a sample bug report from a 3.7 kernel (vanilla with Xen XSAVE patch
> for EC2 compatibility). The latest kernel version I have tested and seen this
> problem occur is 3.7.9.

Ingo,

I am OK with this patch. Are you OK taking this in or should I take
it (and add the nice RIP below)?

It should also have CC: stable@vger.kernel.org on it.

FYI, There is also a Red Hat bug for this: https://bugzilla.redhat.com/show_bug.cgi?id=914737

> 
> [11852214.733630] ------------[ cut here ]------------
> [11852214.733642] kernel BUG at arch/x86/mm/fault.c:397!
> [11852214.733648] invalid opcode: 0000 [#1] SMP
> [11852214.733654] Modules linked in: veth xt_nat xt_comment fuse btrfs
> libcrc32c zlib_deflate ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat
> xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
> bridge stp llc iptable_filter ip_tables x_tables ghash_clmulni_intel
> aesni_intel aes_x86_64 ablk_helper cryptd xts lrw gf128mul microcode
> ext4 crc16 jbd2 mbcache
> [11852214.733695] CPU 1
> [11852214.733700] Pid: 1617, comm: qmgr Not tainted 3.7.0-1-ec2 #1
> [11852214.733705] RIP: e030:[<ffffffff8143018d>]  [<ffffffff8143018d>]
> vmalloc_fault+0x14b/0x249
> [11852214.733725] RSP: e02b:ffff88083e57d7f8  EFLAGS: 00010046
> [11852214.733730] RAX: 0000000854046000 RBX: ffffe8ffffc80d70 RCX:
> ffff880000000000
> [11852214.733736] RDX: 00003ffffffff000 RSI: ffff880854046ff8 RDI:
> 0000000000000000
> [11852214.733744] RBP: ffff88083e57d818 R08: 0000000000000000 R09:
> ffff880000000ff8
> [11852214.733750] R10: 0000000000007ff0 R11: 0000000000000001 R12:
> ffff880854686e88
> [11852214.733758] R13: ffffffff8180ce88 R14: ffff88083e57d948 R15:
> 0000000000000000
> [11852214.733768] FS:  00007ff3bf0f8740(0000)
> GS:ffff88088b480000(0000) knlGS:0000000000000000
> [11852214.733777] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [11852214.733782] CR2: ffffe8ffffc80d70 CR3: 0000000854686000 CR4:
> 0000000000002660
> [11852214.733790] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [11852214.733796] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [11852214.733803] Process qmgr (pid: 1617, threadinfo
> ffff88083e57c000, task ffff88084474b3e0)
> [11852214.733810] Stack:
> [11852214.733814]  0000000000000029 0000000000000002 ffffe8ffffc80d70
> ffff88083e57d948
> [11852214.733828]  ffff88083e57d928 ffffffff8103e0c7 0000000000000000
> ffff88083e57d8d0
> [11852214.733840]  ffff88084474b3e0 0000000000000060 0000000000000000
> 0000000000006cf6
> [11852214.733852] Call Trace:
> [11852214.733861]  [<ffffffff8103e0c7>] __do_page_fault+0x2c7/0x4a0
> [11852214.733871]  [<ffffffff81004ac2>] ? xen_mc_flush+0xb2/0x1b0
> [11852214.733880]  [<ffffffff810032ce>] ? xen_end_context_switch+0x1e/0x30
> [11852214.733888]  [<ffffffff810043cb>] ? xen_write_msr_safe+0x9b/0xc0
> [11852214.733900]  [<ffffffff810125b3>] ? __switch_to+0x163/0x4a0
> [11852214.733907]  [<ffffffff8103e2de>] do_page_fault+0xe/0x10
> [11852214.733919]  [<ffffffff81437f98>] page_fault+0x28/0x30
> [11852214.733930]  [<ffffffff8115e873>] ?
> mem_cgroup_charge_statistics.isra.12+0x13/0x50
> [11852214.733940]  [<ffffffff8116012e>] __mem_cgroup_uncharge_common+0xce/0x2d0
> [11852214.733948]  [<ffffffff81007fee>] ? xen_pte_val+0xe/0x10
> [11852214.733958]  [<ffffffff8116391a>] mem_cgroup_uncharge_page+0x2a/0x30
> [11852214.733966]  [<ffffffff81139e78>] page_remove_rmap+0xf8/0x150
> [11852214.733976]  [<ffffffff8112d78a>] ? vm_normal_page+0x1a/0x80
> [11852214.733984]  [<ffffffff8112e5b3>] unmap_single_vma+0x573/0x860
> [11852214.733994]  [<ffffffff81114520>] ? release_pages+0x1f0/0x230
> [11852214.734004]  [<ffffffff810054aa>] ? __xen_pgd_walk+0x16a/0x260
> [11852214.734018]  [<ffffffff8112f0b2>] unmap_vmas+0x52/0xa0
> [11852214.734026]  [<ffffffff81136e08>] exit_mmap+0x98/0x170
> [11852214.734034]  [<ffffffff8104b929>] mmput+0x59/0x110
> [11852214.734043]  [<ffffffff81053d95>] exit_mm+0x105/0x130
> [11852214.734051]  [<ffffffff814376e0>] ? _raw_spin_lock_irq+0x10/0x40
> [11852214.734059]  [<ffffffff81053f27>] do_exit+0x167/0x900
> [11852214.734070]  [<ffffffff8106093d>] ? __sigqueue_free+0x3d/0x50
> [11852214.734079]  [<ffffffff81060b9e>] ? __dequeue_signal+0x10e/0x1f0
> [11852214.734087]  [<ffffffff810549ff>] do_group_exit+0x3f/0xb0
> [11852214.734097]  [<ffffffff81063431>] get_signal_to_deliver+0x1c1/0x5e0
> [11852214.734107]  [<ffffffff8101334f>] do_signal+0x3f/0x960
> [11852214.734114]  [<ffffffff811aae61>] ? ep_poll+0x2a1/0x360
> [11852214.734122]  [<ffffffff81083420>] ? try_to_wake_up+0x2d0/0x2d0
> [11852214.734129]  [<ffffffff81013cd8>] do_notify_resume+0x48/0x60
> [11852214.734138]  [<ffffffff81438a5a>] int_signal+0x12/0x17
> [11852214.734143] Code: ff ff 3f 00 00 48 21 d0 4c 8d 0c 30 ff 14 25
> b8 f3 81 81 48 21 d0 48 01 c6 48 83 3e 00 0f 84 fa 00 00 00 49 8b 39
> 48 85 ff 75 02 <0f> 0b ff 14 25 e0 f3 81 81 49 89 c0 48 8b 3e ff 14 25
> e0 f3 81
> [11852214.734212] RIP  [<ffffffff8143018d>] vmalloc_fault+0x14b/0x249
> [11852214.734222]  RSP <ffff88083e57d7f8>
> [11852214.734231] ---[ end trace 81ac798210f95867 ]---
> [11852214.734237] Fixing recursive fault but reboot is needed!
> 
> > Also pls next time also CC me.
> 
> Will do, I originally CC'd Jeremy since made some lazy MMU related
> cleanups in arch/x86/mm/fault.c, and I thought he might have a comment
> on this.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-02-23  1:06 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-17  2:35 [PATCH] x86: mm: Fix vmalloc_fault oops during lazy MMU updates Samu Kallio
2013-02-21 12:33 ` Konrad Rzeszutek Wilk
2013-02-21 15:56   ` Samu Kallio
2013-02-23  1:06     ` Konrad Rzeszutek Wilk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).