All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] Fix: x86 unaligned __memcpy to/from virtual memory
@ 2015-06-24 16:14 Mathieu Desnoyers
  2015-06-24 17:00 ` Linus Torvalds
  0 siblings, 1 reply; 9+ messages in thread
From: Mathieu Desnoyers @ 2015-06-24 16:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Mathieu Desnoyers, Ingo Molnar, H. Peter Anvin,
	Linus Torvalds, x86

When trying to change memory allocation from kmalloc to vmalloc to
handle memory fragmentation for reallocation of a growing string within
a kernel module, our testsuite started to trigger kernel OOPS. It
triggers when the string is copied into a ring buffer using memcpy,
piece-wise.

Here is the OOPS:

[ 4078.314978] BUG: unable to handle kernel paging request at ffffc900038d995e
[ 4078.315824] IP: [<ffffffff81316f12>] __memcpy+0x12/0x20
[ 4078.315824] PGD 236c92067 PUD 236c93067 PMD bac0c067 PTE 0
[ 4078.315824] Oops: 0000 [#1] SMP
[ 4078.315824] Modules linked in: lttng_probe_workqueue(O) lttng_probe_vmscan(O) lttng_probe_udp(O) lttng_probe_timer(O) lttng_probe_sunrpc(O) lttng_probe_statedump(O) lttng_probe_sock(O) lttng_probe_skb(O) lttng_probe_signal(O) lttng_probe_scsi(O) lttng_probe_sched(O) lttng_probe_regmap(O) lttng_probe_rcu(O) lttng_probe_random(O) lttng_probe_printk(O) lttng_probe_power(O) lttng_probe_net(O) lttng_probe_napi(O) lttng_probe_module(O) lttng_probe_kmem(O) lttng_probe_jbd2(O) lttng_probe_irq(O) lttng_probe_ext4(O) lttng_probe_compaction(O) lttng_probe_block(O) lttng_types(O) lttng_ring_buffer_metadata_mmap_client(O) lttng_ring_buffer_client_mmap_overwrite(O) lttng_ring_buffer_client_mmap_discard(O) lttng_ring_buffer_metadata_client(O) lttng_ring_buffer_client_overwrite(O) lttng_ring_buffer_client_discard(O) lttng_tracer(O) lttng_statedump(O) lttng_kprobes(O) lttng_lib_ring_buffer(O) lttng_kretprobes(O) virtio_blk virtio_net virtio_pci virtio_ring virtio
[ 4078.315824] CPU: 5 PID: 4258 Comm: lttng-consumerd Tainted: G           O    4.1.0 #7
[ 4078.315824] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 4078.315824] task: ffff8802350c3660 ti: ffff8800bae84000 task.ti: ffff8800bae84000
[ 4078.315824] RIP: 0010:[<ffffffff81316f12>]  [<ffffffff81316f12>] __memcpy+0x12/0x20
[ 4078.315824] RSP: 0018:ffff8800bae87da0  EFLAGS: 00010246
[ 4078.315824] RAX: ffff880235439025 RBX: 0000000000000fd8 RCX: 00000000000001fb
[ 4078.315824] RDX: 0000000000000000 RSI: ffffc900038d995e RDI: ffff880235439025
[ 4078.315824] RBP: ffff8800bae87db8 R08: ffff8800bacecc00 R09: 0000000000008000
[ 4078.315824] R10: 0000000000000000 R11: 0000000000000246 R12: ffff8800bae87dc8
[ 4078.315824] R13: ffff88023466e800 R14: 0000000000000fd8 R15: 0000000000000fd8
[ 4078.315824] FS:  00007f5d3b1cc700(0000) GS:ffff8802372a0000(0000) knlGS:0000000000000000
[ 4078.315824] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 4078.315824] CR2: ffffc900038d995e CR3: 00000000bb1ed000 CR4: 00000000000006e0
[ 4078.315824] Stack:
[ 4078.315824]  ffffffffa01ac797 ffff8800bb5bd480 ffff8800bb5bd4d0 ffff8800bae87e48
[ 4078.315824]  ffffffffa0073060 ffff88023466e800 0000000000000000 0000000000000fd8
[ 4078.315824]  ffffffff00000001 ffff8800bacecc00 0000000000000fd8 0000000000008025
[ 4078.315824] Call Trace:
[ 4078.315824]  [<ffffffffa01ac797>] ? lttng_event_write+0x87/0xb0 [lttng_ring_buffer_metadata_client]
[ 4078.315824]  [<ffffffffa0073060>] lttng_metadata_output_channel+0xd0/0x120 [lttng_tracer]
[ 4078.315824]  [<ffffffffa00755f9>] lttng_metadata_ring_buffer_ioctl+0x79/0xd0 [lttng_tracer]
[ 4078.315824]  [<ffffffff8117ba10>] do_vfs_ioctl+0x2e0/0x4e0
[ 4078.315824]  [<ffffffff812b35c7>] ? file_has_perm+0x87/0xa0
[ 4078.315824]  [<ffffffff8117bc91>] SyS_ioctl+0x81/0xa0
[ 4078.315824]  [<ffffffff810115d1>] ? syscall_trace_leave+0xd1/0xe0
[ 4078.315824]  [<ffffffff818bbd37>] tracesys_phase2+0x84/0x89
[ 4078.315824] Code: 5b 5d c3 66 0f 1f 44 00 00 e8 6b fc ff ff eb e1 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3
[ 4078.315824] RIP  [<ffffffff81316f12>] __memcpy+0x12/0x20
[ 4078.315824]  RSP <ffff8800bae87da0>
[ 4078.315824] CR2: ffffc900038d995e
[ 4078.315824] ---[ end trace a05b652829ceda48 ]---

This points to arch/x86/lib/memcpy_64.S:__memcpy rep movsq instruction.
This could be reproduced on my Lenovo x240 laptop (i7 CPU), and within a
virtual machine running on a Intel(R) Xeon(R) CPU E5-2630 v3 host.
Interestingly, with the VM having the rep_good flag (but not erms), the issue
triggers. However, if the VM has both rep_good and erms flags, the issue does
not trigger.

Moreover, if I call vmalloc_sync_all() just after each vmalloc()
allocation, the issue does not trigger.

It looks like there is some bad interaction between this implementation
of memcpy and vmalloc faults in the kernel, in cases where the source or
destination addresses are not aligned on multiples of 8 bytes. I'm not
sure if the right fix is to fix __memcpy or to look into this issue from
a vmalloc fault handler perspective.

This fix only covers x86-64. It would be interesting to check whether
x86-32 rep; movsl memcpy is also affected.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: x86@kernel.org
---
 arch/x86/lib/memcpy_64.S | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index b046664..df1ba95 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -29,6 +29,15 @@ ENTRY(__memcpy)
 ENTRY(memcpy)
 	ALTERNATIVE_2 "jmp memcpy_orig", "", X86_FEATURE_REP_GOOD, \
 		      "jmp memcpy_erms", X86_FEATURE_ERMS
+	/*
+	 * Use memcpy_orig when the source or destination address is not
+	 * aligned on a multiple of 8 bytes. This takes care of vmalloc
+	 * fault issues with unaligned rep movsq accesses.
+	 */
+	movq %rsi, %rax
+	orq %rdi, %rax
+	andl $7, %eax
+	jnz memcpy_orig
 
 	movq %rdi, %rax
 	movq %rdx, %rcx
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] Fix: x86 unaligned __memcpy to/from virtual memory
  2015-06-24 16:14 [RFC PATCH] Fix: x86 unaligned __memcpy to/from virtual memory Mathieu Desnoyers
@ 2015-06-24 17:00 ` Linus Torvalds
  2015-06-24 18:49   ` Mathieu Desnoyers
  0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2015-06-24 17:00 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Linux Kernel Mailing List, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers

On Wed, Jun 24, 2015 at 9:14 AM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> When trying to change memory allocation from kmalloc to vmalloc to
> handle memory fragmentation for reallocation of a growing string within
> a kernel module, our testsuite started to trigger kernel OOPS. It
> triggers when the string is copied into a ring buffer using memcpy,
> piece-wise.

I hate your patch, just because it doesn't make sense. The "when
non-aligned, don't do movsq" might make sense for performance, but it
does *not* make sense for correctness.

Why would "rep movsq" trigger the oops, but memcpy_orig not? I think
the fundamental bug is something else.

I don't see *what* the bug is, though.

Very odd.

x86 people, can you see anything there? It does look like
vmalloc_fault() *should* have triggered, so why didn't it? The address
is definitely in the VMALLOC_START/END range, and the error code is
0000, so how come didn't vmalloc_fault() handle this?

> This points to arch/x86/lib/memcpy_64.S:__memcpy rep movsq instruction.
> This could be reproduced on my Lenovo x240 laptop (i7 CPU), and within a
> virtual machine running on a Intel(R) Xeon(R) CPU E5-2630 v3 host.
> Interestingly, with the VM having the rep_good flag (but not erms), the issue
> triggers. However, if the VM has both rep_good and erms flags, the issue does
> not trigger.

With ERMS, I think we end up using just "rep movsb" instead. But there
should be absolutely no difference in fault patterns.

I see the QEMU part, is this just regular kvm? Could you add a debug
printk to the vmalloc_fault() caller and then reproduce the oops? It
shouldn't trigger enough to be a horrible logging problem.

                  Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] Fix: x86 unaligned __memcpy to/from virtual memory
  2015-06-24 17:00 ` Linus Torvalds
@ 2015-06-24 18:49   ` Mathieu Desnoyers
  2015-06-24 18:53     ` H. Peter Anvin
  2015-06-24 19:15     ` Linus Torvalds
  0 siblings, 2 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2015-06-24 18:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Linux Kernel Mailing List, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers

----- On Jun 24, 2015, at 1:00 PM, Linus Torvalds torvalds@linux-foundation.org wrote:

> On Wed, Jun 24, 2015 at 9:14 AM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>> When trying to change memory allocation from kmalloc to vmalloc to
>> handle memory fragmentation for reallocation of a growing string within
>> a kernel module, our testsuite started to trigger kernel OOPS. It
>> triggers when the string is copied into a ring buffer using memcpy,
>> piece-wise.
> 
> I hate your patch, just because it doesn't make sense. The "when
> non-aligned, don't do movsq" might make sense for performance, but it
> does *not* make sense for correctness.
> 
> Why would "rep movsq" trigger the oops, but memcpy_orig not? I think
> the fundamental bug is something else.
> 
> I don't see *what* the bug is, though.
> 
> Very odd.
> 
> x86 people, can you see anything there? It does look like
> vmalloc_fault() *should* have triggered, so why didn't it? The address
> is definitely in the VMALLOC_START/END range, and the error code is
> 0000, so how come didn't vmalloc_fault() handle this?
> 
>> This points to arch/x86/lib/memcpy_64.S:__memcpy rep movsq instruction.
>> This could be reproduced on my Lenovo x240 laptop (i7 CPU), and within a
>> virtual machine running on a Intel(R) Xeon(R) CPU E5-2630 v3 host.
>> Interestingly, with the VM having the rep_good flag (but not erms), the issue
>> triggers. However, if the VM has both rep_good and erms flags, the issue does
>> not trigger.
> 
> With ERMS, I think we end up using just "rep movsb" instead. But there
> should be absolutely no difference in fault patterns.
> 
> I see the QEMU part, is this just regular kvm?

Yes, this is just regular kvm.

> Could you add a debug
> printk to the vmalloc_fault() caller and then reproduce the oops? It
> shouldn't trigger enough to be a horrible logging problem.

Here is the output. I added the printk just after the initial range
check within vmalloc_fault. What is weird is that the fault happens
on an aligned source address. It's the destination which is unaligned.
Let me know if you need more info.

[   53.084521] DEBUG: vmalloc_fault at address 0xffffc9000746e000
[   53.085460] BUG: unable to handle kernel paging request at ffffc9000746e000
[   53.085460] IP:
[   53.090220]  [<ffffffff81316f12>] __memcpy+0x12/0x20
[   53.090220] PGD 236c92067 PUD 236c93067 PMD 22e840067 PTE 0
[   53.090220] Oops: 0000 [#1] SMP 
[   53.090220] Modules linked in: lttng_probe_workqueue(O) lttng_probe_vmscan(O) lttng_probe_udp(O) lttng_probe_timer(O) lttng_probe_sunrpc(O) lttng_probe_statedump(O) lttng_probe_sock(O) lttng_probe_skb(O) lttng_probe_signal(O) lttng_probe_scsi(O) lttng_probe_sched(O) lttng_probe_regmap(O) lttng_probe_rcu(O) lttng_probe_random(O) lttng_probe_power(O) lttng_probe_net(O) lttng_probe_napi(O) lttng_probe_module(O) lttng_probe_kmem(O) lttng_probe_jbd2(O) lttng_probe_irq(O) lttng_probe_ext4(O) lttng_probe_compaction(O) lttng_probe_block(O) lttng_types(O) lttng_ring_buffer_metadata_mmap_client(O) lttng_ring_buffer_client_mmap_overwrite(O) lttng_ring_buffer_client_mmap_discard(O) lttng_ring_buffer_metadata_client(O) lttng_ring_buffer_client_overwrite(O) lttng_ring_buffer_client_discard(O) lttng_tracer(O) lttng_statedump(O) lttng_kprobes(O) lttng_lib_ring_buffer(O) lttng_kretprobes(O) virtio_blk virtio_net virtio_pci virtio_ring virtio [last unloaded: lttng_statedump]
[   53.090220] CPU: 4 PID: 3532 Comm: lttng-consumerd Tainted: G           O    4.1.0+ #10
[   53.090220] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[   53.090220] task: ffff880235355aa0 ti: ffff8800bb6d0000 task.ti: ffff8800bb6d0000
[   53.090220] RIP: 0010:[<ffffffff81316f12>]  [<ffffffff81316f12>] __memcpy+0x12/0x20
[   53.090220] RSP: 0018:ffff8800bb6d3da0  EFLAGS: 00010206
[   53.090220] RAX: ffff8802355b3025 RBX: 0000000000000fdb RCX: 00000000000001fb
[   53.090220] RDX: 0000000000000003 RSI: ffffc9000746e000 RDI: ffff8802355b3025
[   53.090220] RBP: ffff8800bb6d3db8 R08: ffff880231cd7200 R09: 0000000000000025
[   53.090220] R10: 0000000000000000 R11: 0000000000001000 R12: ffff8800bb6d3dc8
[   53.090220] R13: ffff88022e437400 R14: 0000000000000fdb R15: 0000000000000fdb
[   53.090220] FS:  00007f24d8bbc700(0000) GS:ffff880237280000(0000) knlGS:0000000000000000
[   53.090220] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   53.090220] CR2: ffffc9000746e000 CR3: 00000000ba6d6000 CR4: 00000000000006e0
[   53.090220] Stack:
[   53.090220]  ffffffffa05ac797 ffff8802334fb300 ffff8802334fb350 ffff8800bb6d3e48
[   53.090220]  ffffffffa0473060 ffff88022e437400 0000000000000000 0000000000000fdb
[   53.090220]  ffffffff00000001 ffff880231cd7200 0000000000000fdb 0000000000000025
[   53.090220] Call Trace:
[   53.090220]  [<ffffffffa05ac797>] ? lttng_event_write+0x87/0xb0 [lttng_ring_buffer_metadata_client]
[   53.090220]  [<ffffffffa0473060>] lttng_metadata_output_channel+0xd0/0x120 [lttng_tracer]
[   53.090220]  [<ffffffffa04755f9>] lttng_metadata_ring_buffer_ioctl+0x79/0xd0 [lttng_tracer]
[   53.090220]  [<ffffffff8117ba10>] do_vfs_ioctl+0x2e0/0x4e0
[   53.090220]  [<ffffffff812b35c7>] ? file_has_perm+0x87/0xa0
[   53.090220]  [<ffffffff8117bc91>] SyS_ioctl+0x81/0xa0
[   53.090220]  [<ffffffff818bbd37>] tracesys_phase2+0x84/0x89
[   53.090220] Code: 5b 5d c3 66 0f 1f 44 00 00 e8 6b fc ff ff eb e1 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 
[   53.090220] RIP  [<ffffffff81316f12>] __memcpy+0x12/0x20
[   53.090220]  RSP <ffff8800bb6d3da0>
[   53.090220] CR2: ffffc9000746e000
[   53.090220] ---[ end trace 850d7bf1b41647ee ]---



-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] Fix: x86 unaligned __memcpy to/from virtual memory
  2015-06-24 18:49   ` Mathieu Desnoyers
@ 2015-06-24 18:53     ` H. Peter Anvin
  2015-06-24 19:15     ` Linus Torvalds
  1 sibling, 0 replies; 9+ messages in thread
From: H. Peter Anvin @ 2015-06-24 18:53 UTC (permalink / raw)
  To: Mathieu Desnoyers, Linus Torvalds
  Cc: Thomas Gleixner, Linux Kernel Mailing List, Ingo Molnar,
	the arch/x86 maintainers

On 06/24/2015 11:49 AM, Mathieu Desnoyers wrote:
> 
> [   53.084521] DEBUG: vmalloc_fault at address 0xffffc9000746e000
> [   53.085460] BUG: unable to handle kernel paging request at ffffc9000746e000
> [   53.085460] IP:
> [   53.090220]  [<ffffffff81316f12>] __memcpy+0x12/0x20
> [   53.090220] PGD 236c92067 PUD 236c93067 PMD 22e840067 PTE 0
> [   53.090220] Oops: 0000 [#1] SMP 
> [   53.090220] Modules linked in: lttng_probe_workqueue(O) lttng_probe_vmscan(O) lttng_probe_udp(O) lttng_probe_timer(O) lttng_probe_sunrpc(O) lttng_probe_statedump(O) lttng_probe_sock(O) lttng_probe_skb(O) lttng_probe_signal(O) lttng_probe_scsi(O) lttng_probe_sched(O) lttng_probe_regmap(O) lttng_probe_rcu(O) lttng_probe_random(O) lttng_probe_power(O) lttng_probe_net(O) lttng_probe_napi(O) lttng_probe_module(O) lttng_probe_kmem(O) lttng_probe_jbd2(O) lttng_probe_irq(O) lttng_probe_ext4(O) lttng_probe_compaction(O) lttng_probe_block(O) lttng_types(O) lttng_ring_buffer_metadata_mmap_client(O) lttng_ring_buffer_client_mmap_overwrite(O) lttng_ring_buffer_client_mmap_discard(O) lttng_ring_buffer_metadata_client(O) lttng_ring_buffer_client_overwrite(O) lttng_ring_buffer_client_discard(O) lttng_tracer(O) lttng_statedump(O) lttng_kprobes(O) lttng_lib_ring_buffer(O) lttng_kretprobes(O) virtio_blk virtio_net virtio_pci virtio_ring virtio [last unloaded: lttng_statedump]
> [   53.090220] CPU: 4 PID: 3532 Comm: lttng-consumerd Tainted: G           O    4.1.0+ #10
> [   53.090220] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> [   53.090220] task: ffff880235355aa0 ti: ffff8800bb6d0000 task.ti: ffff8800bb6d0000
> [   53.090220] RIP: 0010:[<ffffffff81316f12>]  [<ffffffff81316f12>] __memcpy+0x12/0x20
> [   53.090220] RSP: 0018:ffff8800bb6d3da0  EFLAGS: 00010206
> [   53.090220] RAX: ffff8802355b3025 RBX: 0000000000000fdb RCX: 00000000000001fb
> [   53.090220] RDX: 0000000000000003 RSI: ffffc9000746e000 RDI: ffff8802355b3025

Okay, RSI is at the start of a page, but isn't even unaligned.  RDI is
unaligned, but that shouldn't matter at all.

So I think the problem is really that you are simply outrunning your
input buffer.

	-hpa


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] Fix: x86 unaligned __memcpy to/from virtual memory
  2015-06-24 18:49   ` Mathieu Desnoyers
  2015-06-24 18:53     ` H. Peter Anvin
@ 2015-06-24 19:15     ` Linus Torvalds
  2015-06-24 23:54       ` Mathieu Desnoyers
  1 sibling, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2015-06-24 19:15 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Linux Kernel Mailing List, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers

On Wed, Jun 24, 2015 at 11:49 AM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> Here is the output. I added the printk just after the initial range
> check within vmalloc_fault.

Good. Can you add printk's to the error return paths too, so that we
see which one it is that triggers.

If it is a valid vmalloc address, then vmalloc_fault() _should_ just
fix it up and return 0.  Clearly it doesn't, and hits one of the
"return -1" cases instead.

In particular, that

        pgd_ref = pgd_offset_k(address);

should return the reference page table pointer for init_mm, which is
what vmalloc() itself *should* be populating.

The fact that it sounds like one of the "pud/pmd/pte_none()" checks
for the reference ends up returning true, seems to indicate that the
page tables haven't been filled in even for the reference address.
Which is really really odd.

I'm really inclined to think that it's something in lttng, because
it's so odd. A race with vunmap() on another CPU? How could
vmalloc_fault() not see the reference page table contents?

                 Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] Fix: x86 unaligned __memcpy to/from virtual memory
  2015-06-24 19:15     ` Linus Torvalds
@ 2015-06-24 23:54       ` Mathieu Desnoyers
  2015-06-25  0:33         ` Mathieu Desnoyers
  2015-06-25  0:37         ` Linus Torvalds
  0 siblings, 2 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2015-06-24 23:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Linux Kernel Mailing List, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers

----- On Jun 24, 2015, at 3:15 PM, Linus Torvalds torvalds@linux-foundation.org wrote:

> On Wed, Jun 24, 2015 at 11:49 AM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> Here is the output. I added the printk just after the initial range
>> check within vmalloc_fault.
> 
> Good. Can you add printk's to the error return paths too, so that we
> see which one it is that triggers.

OK, see below. This time the fault occurred at an unaligned address.
It fails on the !pte_present(*pte_ref) check.

> 
> If it is a valid vmalloc address, then vmalloc_fault() _should_ just
> fix it up and return 0.  Clearly it doesn't, and hits one of the
> "return -1" cases instead.
> 
> In particular, that
> 
>        pgd_ref = pgd_offset_k(address);
> 
> should return the reference page table pointer for init_mm, which is
> what vmalloc() itself *should* be populating.
> 
> The fact that it sounds like one of the "pud/pmd/pte_none()" checks
> for the reference ends up returning true, seems to indicate that the
> page tables haven't been filled in even for the reference address.
> Which is really really odd.
> 
> I'm really inclined to think that it's something in lttng, because
> it's so odd. A race with vunmap() on another CPU? How could
> vmalloc_fault() not see the reference page table contents?

[   69.719102] DEBUG: vmalloc_fault at address 0xffffc900077e0f3b
[   69.720216] DEBUG: !pte_present(*pte_ref) error
[   69.720216] BUG: unable to handle kernel paging request at ffffc900077e0f3b
[   69.720216] IP: [<ffffffff81316f72>] __memcpy+0x12/0x20
[   69.720216] PGD 236c92067 PUD 236c93067 PMD 22cc87067 PTE 0
[   69.720216] Oops: 0000 [#1] SMP 
[   69.720216] Modules linked in: lttng_probe_workqueue(O) lttng_probe_vmscan(O) lttng_probe_udp(O) lttng_probe_timer(O) lttng_probe_sunrpc(O) lttng_probe_statedump(O) lttng_probe_sock(O) lttng_probe_skb(O) lttng_probe_signal(O) lttng_probe_scsi(O) lttng_probe_sched(O) lttng_probe_regmap(O) lttng_probe_rcu(O) lttng_probe_random(O) lttng_probe_power(O) lttng_probe_net(O) lttng_probe_napi(O) lttng_probe_module(O) lttng_probe_kmem(O) lttng_probe_jbd2(O) lttng_probe_irq(O) lttng_probe_ext4(O) lttng_probe_compaction(O) lttng_probe_block(O) lttng_types(O) lttng_ring_buffer_metadata_mmap_client(O) lttng_ring_buffer_client_mmap_overwrite(O) lttng_ring_buffer_client_mmap_discard(O) lttng_ring_buffer_metadata_client(O) lttng_ring_buffer_client_overwrite(O) lttng_ring_buffer_client_discard(O) lttng_tracer(O) lttng_statedump(O) lttng_kprobes(O) lttng_lib_ring_buffer(O) lttng_kretprobes(O) virtio_net virtio_blk virtio_pci virtio_ring virtio [last unloaded: lttng_statedump]
[   69.720216] CPU: 12 PID: 3536 Comm: lttng-consumerd Tainted: G           O    4.1.0+ #11
[   69.720216] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[   69.720216] task: ffff8800bab3b660 ti: ffff88023344c000 task.ti: ffff88023344c000
[   69.720216] RIP: 0010:[<ffffffff81316f72>]  [<ffffffff81316f72>] __memcpy+0x12/0x20
[   69.720216] RSP: 0018:ffff88023344fda0  EFLAGS: 00010246
[   69.720216] RAX: ffff880233135025 RBX: 0000000000000fd8 RCX: 00000000000001fb
[   69.720216] RDX: 0000000000000000 RSI: ffffc900077e0f3b RDI: ffff880233135025
[   69.720216] RBP: ffff88023344fdb8 R08: ffff880037ae3a00 R09: 0000000000005000
[   69.720216] R10: 0000000000000000 R11: 0000000000000246 R12: ffff88023344fdc8
[   69.720216] R13: ffff8800bb33f000 R14: 0000000000000fd8 R15: 0000000000000fd8
[   69.720216] FS:  00007fb2d28ae700(0000) GS:ffff880237380000(0000) knlGS:0000000000000000
[   69.720216] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   69.720216] CR2: ffffc900077e0f3b CR3: 000000023329d000 CR4: 00000000000006e0
[   69.720216] Stack:
[   69.720216]  ffffffffa05a6797 ffff88023527ad00 ffff88023527ad50 ffff88023344fe48
[   69.720216]  ffffffffa046d060 ffff8800bb33f000 0000000000000000 0000000000000fd8
[   69.720216]  ffffffff00000001 ffff880037ae3a00 0000000000000fd8 0000000000005025
[   69.720216] Call Trace:
[   69.720216]  [<ffffffffa05a6797>] ? lttng_event_write+0x87/0xb0 [lttng_ring_buffer_metadata_client]
[   69.720216]  [<ffffffffa046d060>] lttng_metadata_output_channel+0xd0/0x120 [lttng_tracer]
[   69.720216]  [<ffffffffa046f5f9>] lttng_metadata_ring_buffer_ioctl+0x79/0xd0 [lttng_tracer]
[   69.720216]  [<ffffffff8117ba70>] do_vfs_ioctl+0x2e0/0x4e0
[   69.720216]  [<ffffffff812b3627>] ? file_has_perm+0x87/0xa0
[   69.720216]  [<ffffffff8117bcf1>] SyS_ioctl+0x81/0xa0
[   69.720216]  [<ffffffff810115d1>] ? syscall_trace_leave+0xd1/0xe0
[   69.720216]  [<ffffffff818bbdb7>] tracesys_phase2+0x84/0x89
[   69.720216] Code: 5b 5d c3 66 0f 1f 44 00 00 e8 6b fc ff ff eb e1 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 
[   69.720216] RIP  [<ffffffff81316f72>] __memcpy+0x12/0x20
[   69.720216]  RSP <ffff88023344fda0>
[   69.720216] CR2: ffffc900077e0f3b
[   69.720216] ---[ end trace c35549bf8e0386a2 ]---


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] Fix: x86 unaligned __memcpy to/from virtual memory
  2015-06-24 23:54       ` Mathieu Desnoyers
@ 2015-06-25  0:33         ` Mathieu Desnoyers
  2015-06-25  0:37         ` Linus Torvalds
  1 sibling, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2015-06-25  0:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Linux Kernel Mailing List, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers

----- On Jun 24, 2015, at 7:54 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:
> ----- On Jun 24, 2015, at 3:15 PM, Linus Torvalds torvalds@linux-foundation.org
> wrote:
> 
>> On Wed, Jun 24, 2015 at 11:49 AM, Mathieu Desnoyers
>> <mathieu.desnoyers@efficios.com> wrote:
>>>
>>> Here is the output. I added the printk just after the initial range
>>> check within vmalloc_fault.
>> 
>> Good. Can you add printk's to the error return paths too, so that we
>> see which one it is that triggers.
> 
> OK, see below. This time the fault occurred at an unaligned address.
> It fails on the !pte_present(*pte_ref) check.

I just tried to to a bytewise copy in C rather than call
memcpy, and I got the fault to trigger. So I guess I was on
the wrong track assuming __memcpy would be the culprit.
What is odd is that if I issue vmalloc_sync_all() after each
vmalloc call, the OOPS never triggers. It is clearly a test
case that ends up stressing vfree/vmalloc.

[   34.751984] DEBUG: vmalloc_fault at address 0xffffc90007290000
[   34.753188] DEBUG: !pte_present(*pte_ref) error
[   34.753188] BUG: unable to handle kernel paging request at ffffc90007290000
[   34.753188] IP: [<ffffffffa05c77c0>] lttng_event_write+0x90/0xd0 [lttng_ring_buffer_metadata_client]
[   34.753188] PGD 236c92067 PUD 236c93067 PMD b6964067 PTE 0
[   34.753188] Oops: 0000 [#1] SMP 
[   34.753188] Modules linked in: lttng_probe_workqueue(O) lttng_probe_vmscan(O) lttng_probe_udp(O) lttng_probe_timer(O) lttng_probe_sunrpc(O) lttng_probe_statedump(O) lttng_probe_sock(O) lttng_probe_skb(O) lttng_probe_signal(O) lttng_probe_scsi(O) lttng_probe_sched(O) lttng_probe_regmap(O) lttng_probe_rcu(O) lttng_probe_random(O) lttng_probe_power(O) lttng_probe_net(O) lttng_probe_napi(O) lttng_probe_module(O) lttng_probe_kmem(O) lttng_probe_jbd2(O) lttng_probe_irq(O) lttng_probe_ext4(O) lttng_probe_compaction(O) lttng_probe_block(O) lttng_types(O) lttng_ring_buffer_metadata_mmap_client(O) lttng_ring_buffer_client_mmap_overwrite(O) lttng_ring_buffer_client_mmap_discard(O) lttng_ring_buffer_metadata_client(O) lttng_ring_buffer_client_overwrite(O) lttng_ring_buffer_client_discard(O) lttng_tracer(O) lttng_statedump(O) lttng_kprobes(O) lttng_lib_ring_buffer(O) lttng_kretprobes(O) virtio_blk virtio_net virtio_pci virtio_ring virtio [last unloaded: lttng_statedump]
[   34.753188] CPU: 26 PID: 3563 Comm: lttng-consumerd Tainted: G           O    4.1.0+ #11
[   34.753188] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[   34.753188] task: ffff880234d94880 ti: ffff88022af6c000 task.ti: ffff88022af6c000
[   34.753188] RIP: 0010:[<ffffffffa05c77c0>]  [<ffffffffa05c77c0>] lttng_event_write+0x90/0xd0 [lttng_ring_buffer_metadata_client]
[   34.753188] RSP: 0018:ffff88022af6fda8  EFLAGS: 00010212
[   34.753188] RAX: 000000000000009d RBX: 0000000000000fd8 RCX: 0000000000000025
[   34.753188] RDX: ffff8800b7681120 RSI: ffffc9000728ff63 RDI: 0000000000000000
[   34.753188] RBP: ffff88022af6fdb8 R08: 000000000000009d R09: ffff88022ea33025
[   34.753188] R10: 000000000000003b R11: 0000000000000246 R12: ffff88022af6fdc8
[   34.753188] R13: ffff880231565c00 R14: 0000000000000fd8 R15: 0000000000000fd8
[   34.753188] FS:  00007fd64b5f2700(0000) GS:ffff880237540000(0000) knlGS:0000000000000000
[   34.753188] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   34.753188] CR2: ffffc90007290000 CR3: 0000000233803000 CR4: 00000000000006e0
[   34.753188] Stack:
[   34.753188]  ffff880234cbff00 ffff880234cbff50 ffff88022af6fe48 ffffffffa048e060
[   34.753188]  ffff880231565c00 0000000000000000 0000000000000fd8 ffffffff00000001
[   34.753188]  ffff88023155d000 0000000000000fd8 0000000000004025 0000000000004025
[   34.753188] Call Trace:
[   34.753188]  [<ffffffffa048e060>] lttng_metadata_output_channel+0xd0/0x120 [lttng_tracer]
[   34.753188]  [<ffffffffa04905f9>] lttng_metadata_ring_buffer_ioctl+0x79/0xd0 [lttng_tracer]
[   34.753188]  [<ffffffff8117ba70>] do_vfs_ioctl+0x2e0/0x4e0
[   34.753188]  [<ffffffff812b3627>] ? file_has_perm+0x87/0xa0
[   34.753188]  [<ffffffff8117bcf1>] SyS_ioctl+0x81/0xa0
[   34.753188]  [<ffffffff810115d1>] ? syscall_trace_leave+0xd1/0xe0
[   34.753188]  [<ffffffff818bbdb7>] tracesys_phase2+0x84/0x89
[   34.753188] Code: d9 48 0f 47 cb 48 39 cb 75 46 48 8d 57 02 25 ff 0f 00 00 45 31 c0 48 89 c1 31 c0 48 c1 e2 04 4c 01 ca 66 0f 1f 84 00 00 00 00 00 <44> 0f b6 14 06 49 89 c9 4c 03 0a 41 83 c0 01 45 88 14 01 49 63 
[   34.753188] RIP  [<ffffffffa05c77c0>] lttng_event_write+0x90/0xd0 [lttng_ring_buffer_metadata_client]
[   34.753188]  RSP <ffff88022af6fda8>
[   34.753188] CR2: ffffc90007290000
[   34.753188] ---[ end trace 28951381246c3a2e ]---


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] Fix: x86 unaligned __memcpy to/from virtual memory
  2015-06-24 23:54       ` Mathieu Desnoyers
  2015-06-25  0:33         ` Mathieu Desnoyers
@ 2015-06-25  0:37         ` Linus Torvalds
  2015-06-25 12:58           ` Mathieu Desnoyers
  1 sibling, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2015-06-25  0:37 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Linux Kernel Mailing List, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers

On Wed, Jun 24, 2015 at 4:54 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> OK, see below. This time the fault occurred at an unaligned address.
> It fails on the !pte_present(*pte_ref) check.

So every time, %rcx is 0x001fb.

Once, your rdx value (which is remaining bytes after the movsq) was 3,
the other two times it's 0.

What's so magical about that 4056-byte copy (+3 bytes once)? Are you
*sure* that copy is valid?

                   Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] Fix: x86 unaligned __memcpy to/from virtual memory
  2015-06-25  0:37         ` Linus Torvalds
@ 2015-06-25 12:58           ` Mathieu Desnoyers
  0 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2015-06-25 12:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Linux Kernel Mailing List, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers

----- On Jun 24, 2015, at 8:37 PM, Linus Torvalds torvalds@linux-foundation.org wrote:

> On Wed, Jun 24, 2015 at 4:54 PM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> OK, see below. This time the fault occurred at an unaligned address.
>> It fails on the !pte_present(*pte_ref) check.
> 
> So every time, %rcx is 0x001fb.
> 
> Once, your rdx value (which is remaining bytes after the movsq) was 3,
> the other two times it's 0.
> 
> What's so magical about that 4056-byte copy (+3 bytes once)? Are you
> *sure* that copy is valid?

*grumble* after another round of inspection, it appears that the cause
is a missing lock in lttng-modules metadata handling. The race never
triggered any safety net until I tried to move to vmalloc.

The updater was just pulling the carpet from under the feet of the
reader when doing the reallocation.

The fact that the OOPS disappeared with different CPU configurations
and when calling vmalloc_sync_all() after vmalloc() was just due to
timing. Sorry for the noise!

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-06-25 12:59 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-24 16:14 [RFC PATCH] Fix: x86 unaligned __memcpy to/from virtual memory Mathieu Desnoyers
2015-06-24 17:00 ` Linus Torvalds
2015-06-24 18:49   ` Mathieu Desnoyers
2015-06-24 18:53     ` H. Peter Anvin
2015-06-24 19:15     ` Linus Torvalds
2015-06-24 23:54       ` Mathieu Desnoyers
2015-06-25  0:33         ` Mathieu Desnoyers
2015-06-25  0:37         ` Linus Torvalds
2015-06-25 12:58           ` Mathieu Desnoyers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.