kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Advice on oops - memory trap on non-memory access instruction (invalid CR2?)
@ 2019-10-14  3:32 Guilherme G. Piccoli
  2019-10-14 14:10 ` Thomas Gleixner
  0 siblings, 1 reply; 3+ messages in thread
From: Guilherme G. Piccoli @ 2019-10-14  3:32 UTC (permalink / raw)
  To: kvm, linux-acpi, linux-mm, platform-driver-x86, x86, iommu
  Cc: gpiccoli, Guilherme G. Piccoli, gavin.guo, halves,
	ioanna-maria.alifieraki, jay.vosburgh, mfo

Hello kernel community, I'm investigating a recurrent problem, and
hereby I'm seeking some advice - perhaps anybody reading this had
similar issue, for example. I've iterated some mailing-lists I thought
would be of interest, apologize if I miss any or if I shouldn't have
included some.

We have a kernel memory oops due to invalid read/write, but the trap
happens in a non-memory access instruction.

Example in [0] below. We can see a read access to offset 0x458, while it
seems KVM was sending IPI. The "Code" line though (and EIP analysis with
objdump in the vmlinux image) shows the trapping instruction as:

2b:*84 c0 test %al,%al <-- trapping instruction

This instruction clearly shouldn't trap by invalid memory access. Also,
this 0x458 offset seems not present in the code, based on assembly
analysis done [1]. We had 3 or 4 more reports like this, some have
invalid address on write (again #PF), some #GP - in all of them, the
trapping insn is a non-memory related opcode.

We understand x86 (should) have precise exceptions, so some hypothesis
right now are related with:

(a) Invalid CR2 - perhaps due to a System Management Interrupt, firmware
code executed and caused an invalid memory access, polluting CR2.

(b) Error in processor - there are some errata on Xeon processors, which
Intel claims never were observed in commercial systems.

(c) Error in kernel reporting when the oops happens - though we
investigate this deeply, and the exception handlers are quite concise
assembly routines that stacks processor generated data.

(d) Some KVM/vAPIC related failure that may be caused by guest MMAPed
APIC area bad access during interrupt virtualization.

(e) Intel processor do not present precise interrupts.

All of them are unlikely - maybe I'm not seeing something obvious, hence
this advice request. Below there's a more detailed analysis of the
registers of the aforementioned oops splat [2].

We are aware of the old version of kernel, unfortunately the user
reporting this issue is unable to update right now. Any
direction/suggestion/advice to obtain more data or prove/disprove some
of our hypothesis is highly appreciated. Any questions are also
appreciated, feel free to respond with any ideas you might have.

Thanks,


Guilherme
--


[0]
BUG: unable to handle kernel NULL pointer dereference at 0000000000000458
IP: [<ffffffffc079baf6>] kvm_irq_delivery_to_apic+0x56/0x220 [kvm]
PGD 0
Oops: 0000 [#1] SMP
Modules linked in: <...>
CPU: 40 PID: 78274 Comm: qemu-system-x86 Tainted: P W  OE
4.4.0-45-generic #66~14.04.1-Ubuntu
Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.1.7 06/16/2016
task: ffff8800594dd280 ti: ffff880169168000 task.ti: ffff880169168000
RIP: 0010:[<ffffffffc079baf6>]  [<ffffffffc079baf6>]
kvm_irq_delivery_to_apic+0x56/0x220 [kvm]
RSP: 0018:ffff88016916bbe8  EFLAGS: 00010282
RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000003
RDX: 0000000000000040 RSI: 0000000000000010 RDI: ffff88016916bba8
RBP: ffff88016916bc30 R08: 0000000000000004 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000008fd
R13: 0000000000000004 R14: ffff88004d3e8000 R15: ffff88016916bc40
FS:  00007fbd67fff700(0000) GS:ffff881ffeb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000458 CR3: 00000001961a9000 CR4: 00000000003426e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
 0000000000000001 0000000000000000 ffff882194b81400 0000000194b81410
 0000000000000300 00000000000008fd 0000000000000004 ffff882194b81400
 0000000000000001 ffff88016916bc78 ffffffffc0796d20 08000000000000fd
Call Trace:
 [<addr>] apic_reg_write+0x110/0x5f0 [kvm]
 [<addr>] kvm_apic_write_nodecode+0x4b/0x60 [kvm]
 [<addr>] handle_apic_write+0x1e/0x30 [kvm_intel]
 [<addr>] vmx_handle_exit+0x288/0xbf0 [kvm_intel]
 [<addr>] vcpu_enter_guest+0x8b4/0x10a0 [kvm]
 [<addr>] ? kvm_vcpu_block+0x191/0x2d0 [kvm]
 [<addr>] ? prepare_to_wait_event+0xf0/0xf0
 [<addr>] kvm_arch_vcpu_ioctl_run+0xc4/0x3d0 [kvm]
 [<addr>] kvm_vcpu_ioctl+0x2ab/0x640 [kvm]
 [<addr>] do_vfs_ioctl+0x2dd/0x4c0
 [<addr>] ? __audit_syscall_entry+0xaf/0x100
 [<addr>] ? do_audit_syscall_entry+0x66/0x70
 [<addr>] SyS_ioctl+0x79/0x90
 [<addr>] entry_SYSCALL_64_fastpath+0x16/0x75
Code: d4 ff ff ff ff 75 0d 81 7a 10 ff 00 00 00 0f 84 7d 01 00 00 4c 8b
45 c0 48 8b 75 c8 48 8d 4d d4 4c 89 fa 4c 89 f7 e8 ca be ff ff <84> c0
0f 85 0c 01 00 00 41 8b 86 f0 09 00 00 85 c0 0f 8e fd 00
RIP  [<ffffffffc079baf6>] kvm_irq_delivery_to_apic+0x56/0x220 [kvm]
RSP <ffff88016916bbe8> CR2: 0000000000000458
--


[1] Assembly analysis: https://pastebin.ubuntu.com/p/hdHNmvFtd8/
--


[2] More detailed analysis of registers:

%rax = 1 [return from kvm_irq_delivery_to_apic_fast()]

%rbx = 0x300 [ICR_LO register - this value comes from
kvm_apic_write_nodecode(), in which the offset / register is assigned to
%ebx.

%rdi = &bitmap
%rsi = 16 (0x10) from "for_each_set_bit(i, &bitmap, 16)" in function
kvm_irq_delivery_to_apic_fast().

%rcx = i in above loop
%rdx = 64 (0x40 - BITS_PER_LONG, set inside find_next_bit() in the above
loop)

%r8 = 4 ->  accumulates the return of kvm_apic_set_irq() - it means 4
IRQs were delivered successfully. It could have been zeroed in the
process, and IRQs that were discarded don't accumulate here, so the
value doesn't say much.

%r14 = (struct kvm*) apic->vcpu->kvm
%r15 = (kvm_lapic_irq*) irq [stack-like addr, as it came from
apic_send_ipi(), in which irq is declared in stack - from the stack
dump, it is 0xffffffffc0796d20]

%r12 = apic->regs[ICR_LO] -> important register, describes the IPI data;
value of 0x8fd means:

bits 0-7 (vector): 253
bits 8-10 (delivery mode): 0 -> fixed
bit 11 (destination logic): 1 -> logical
bit 12 (delivery status): 0 -> idle
bit 14 (level): 0 -> De-assert [oddity: Intel SDM vol 3 (10.6.1) claims
this should be 1 in Xeon processors]
bit 15 (trigger mode): 0 -> Edge
bits 18-19 (shorthand): No

%r13 = irq.dest_id == apic->regs[ICR_HI] / some transformation of this
register <it's a xapic system, not x2apic>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Advice on oops - memory trap on non-memory access instruction (invalid CR2?)
  2019-10-14  3:32 Advice on oops - memory trap on non-memory access instruction (invalid CR2?) Guilherme G. Piccoli
@ 2019-10-14 14:10 ` Thomas Gleixner
  2019-10-15 15:21   ` Guilherme G. Piccoli
  0 siblings, 1 reply; 3+ messages in thread
From: Thomas Gleixner @ 2019-10-14 14:10 UTC (permalink / raw)
  To: Guilherme G. Piccoli
  Cc: kvm, linux-acpi, linux-mm, platform-driver-x86, x86, iommu,
	Guilherme G. Piccoli, gavin.guo, halves, ioanna-maria.alifieraki,
	jay.vosburgh, mfo

On Mon, 14 Oct 2019, Guilherme G. Piccoli wrote:
> Modules linked in: <...>
> CPU: 40 PID: 78274 Comm: qemu-system-x86 Tainted: P W  OE

Tainted: P     - Proprietary module loaded ...

Try again without that module

Tainted: W     - Warning issued before

Are you sure that that warning is harmless and unrelated?

> 4.4.0-45-generic #66~14.04.1-Ubuntu

Does the same problem happen with a not so dead kernel? CR2 handling got
quite some updates/fixes since then.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Advice on oops - memory trap on non-memory access instruction (invalid CR2?)
  2019-10-14 14:10 ` Thomas Gleixner
@ 2019-10-15 15:21   ` Guilherme G. Piccoli
  0 siblings, 0 replies; 3+ messages in thread
From: Guilherme G. Piccoli @ 2019-10-15 15:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: kvm, linux-acpi, linux-mm, platform-driver-x86, x86, iommu,
	Guilherme G. Piccoli, gavin.guo, halves, ioanna-maria.alifieraki,
	jay.vosburgh, mfo

On 14/10/2019 11:10, Thomas Gleixner wrote:
> On Mon, 14 Oct 2019, Guilherme G. Piccoli wrote:
>> Modules linked in: <...>
>> CPU: 40 PID: 78274 Comm: qemu-system-x86 Tainted: P W  OE
> 
> Tainted: P     - Proprietary module loaded ...
> 
> Try again without that module

Thanks Thomas, for the prompt response. This is some ScaleIO stuff, I
guess it's part of customer setup, and I agree would be better to not
have this kind of module loaded. Anyway, the analysis of oops show a
quite odd situation that we'd like to at least have a strong clue before
saying the scaleio stuff is the culprit.

> 
> Tainted: W     - Warning issued before
> 
> Are you sure that that warning is harmless and unrelated?
> 

Sorry I didn't mention that before, the warn is:

[5946866.593060] WARNING: CPU: 42 PID: 173056 at
/build/linux-lts-xenial-80t3lB/linux-lts-xenial-4.4.0/arch/x86/events/intel/core.c:1868
intel_pmu_handle_irq+0x2d4/0x470()
[5946866.593061] perfevents: irq loop stuck!

It happened ~700 days before the oops (yeah, the uptime is quite large,
about 900 days when the oops happened heh).


>> 4.4.0-45-generic #66~14.04.1-Ubuntu
> 
> Does the same problem happen with a not so dead kernel? CR2 handling got
> quite some updates/fixes since then.

Unfortunately we don't have ways to test that for now, but your comment
is quite interesting - we can take a look in the CR2 fixes since v4.4.

But what do you think about having a #PF while the instruction pointed
in the oops Code section (and the RIP address) is not a memory-related insn?

Thanks,


Guilherme
> 
> Thanks,
> 
> 	tglx
> 
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-10-15 15:22 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-14  3:32 Advice on oops - memory trap on non-memory access instruction (invalid CR2?) Guilherme G. Piccoli
2019-10-14 14:10 ` Thomas Gleixner
2019-10-15 15:21   ` Guilherme G. Piccoli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).