xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* regression in recent pvops kernels, dom0 crashes early
@ 2021-05-13 10:24 Olaf Hering
  2021-05-13 12:11 ` Andrew Cooper
  2021-05-17 10:54 ` Jan Beulich
  0 siblings, 2 replies; 13+ messages in thread
From: Olaf Hering @ 2021-05-13 10:24 UTC (permalink / raw)
  To: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1173 bytes --]

Recent pvops dom0 kernels fail to boot on this particular ProLiant BL465c G5 box.
It happens to work with every Xen and a 4.4 based sle12sp3 kernel, but fails with every Xen and a 4.12 based sle12sp4 (and every newer) kernel.

Any idea what is going on?

....
(XEN) Freed 256kB init memory.
(XEN) mm.c:1758:d0 Bad L1 flags 800000
(XEN) traps.c:458:d0 Unhandled invalid opcode fault/trap [#6] on VCPU 0 [ec=0000]
(XEN) domain_crash_sync called from entry.S: fault at ffff82d08022a2a0 create_bounce_frame+0x133/0x143
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
(XEN) ----[ Xen-4.4.20170405T152638.6bf0560e12-9.xen44  x86_64  debug=y  Not tainted ]----
....

....
(XEN) Freed 656kB init memory
(XEN) mm.c:2165:d0v0 Bad L1 flags 800000
(XEN) d0v0 Unhandled invalid opcode fault/trap [#6, ec=ffffffff]
(XEN) domain_crash_sync called from entry.S: fault at ffff82d04031a016 x86_64/entry.S#create_bounce_frame+0x15d/0x177
(XEN) Domain 0 (vcpu#0) crashed on cpu#5:
(XEN) ----[ Xen-4.15.20210504T145803.280d472f4f-6.xen415  x86_64  debug=y  Not tainted ]----
....

I can probably cycle through all kernels between 4.4 and 4.12 to see where it broke.


Olaf

[-- Attachment #1.2: xen_404.sle12sp4.txt.gz --]
[-- Type: application/gzip, Size: 3742 bytes --]

[-- Attachment #1.3: xen_415.tw.txt.gz --]
[-- Type: application/gzip, Size: 4191 bytes --]

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression in recent pvops kernels, dom0 crashes early
  2021-05-13 10:24 regression in recent pvops kernels, dom0 crashes early Olaf Hering
@ 2021-05-13 12:11 ` Andrew Cooper
  2021-05-13 12:22   ` Olaf Hering
  2021-05-13 12:29   ` Olaf Hering
  2021-05-17 10:54 ` Jan Beulich
  1 sibling, 2 replies; 13+ messages in thread
From: Andrew Cooper @ 2021-05-13 12:11 UTC (permalink / raw)
  To: Olaf Hering, xen-devel

On 13/05/2021 11:24, Olaf Hering wrote:
> Recent pvops dom0 kernels fail to boot on this particular ProLiant BL465c G5 box.
> It happens to work with every Xen and a 4.4 based sle12sp3 kernel, but fails with every Xen and a 4.12 based sle12sp4 (and every newer) kernel.
>
> Any idea what is going on?
>
> ....
> (XEN) Freed 256kB init memory.
> (XEN) mm.c:1758:d0 Bad L1 flags 800000
> (XEN) traps.c:458:d0 Unhandled invalid opcode fault/trap [#6] on VCPU 0 [ec=0000]
> (XEN) domain_crash_sync called from entry.S: fault at ffff82d08022a2a0 create_bounce_frame+0x133/0x143
> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> (XEN) ----[ Xen-4.4.20170405T152638.6bf0560e12-9.xen44  x86_64  debug=y  Not tainted ]----
> ....
>
> ....
> (XEN) Freed 656kB init memory
> (XEN) mm.c:2165:d0v0 Bad L1 flags 800000
> (XEN) d0v0 Unhandled invalid opcode fault/trap [#6, ec=ffffffff]
> (XEN) domain_crash_sync called from entry.S: fault at ffff82d04031a016 x86_64/entry.S#create_bounce_frame+0x15d/0x177
> (XEN) Domain 0 (vcpu#0) crashed on cpu#5:
> (XEN) ----[ Xen-4.15.20210504T145803.280d472f4f-6.xen415  x86_64  debug=y  Not tainted ]----
> ....
>
> I can probably cycle through all kernels between 4.4 and 4.12 to see where it broke.

"Unhandled invalid opcode fault/trap" is "Xen tried to raise #UD with
the guest, and it hasn't set up a handler yet".  The Bad L1 flags
earlier means there was an attempted edit to a pagetable which was
rejected by Xen.

These two things aren't obviously related by a single action in Xen, so
I expect the pagetable modification failed, and the guest fell into a
bad error path.


If I'm counting bits correctly, that is Xen rejecting the use of the NX
bit, which is suspicious.  Do you have the full Xen boot log on this
box?  I wonder if we've some problem clobbing the XD-disable bit.

~Andrew



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression in recent pvops kernels, dom0 crashes early
  2021-05-13 12:11 ` Andrew Cooper
@ 2021-05-13 12:22   ` Olaf Hering
  2021-05-13 12:29     ` Andrew Cooper
  2021-05-13 12:29   ` Olaf Hering
  1 sibling, 1 reply; 13+ messages in thread
From: Olaf Hering @ 2021-05-13 12:22 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 402 bytes --]

Am Thu, 13 May 2021 13:11:10 +0100
schrieb Andrew Cooper <andrew.cooper3@citrix.com>:

> If I'm counting bits correctly, that is Xen rejecting the use of the NX
> bit, which is suspicious.  Do you have the full Xen boot log on this
> box?  I wonder if we've some problem clobbing the XD-disable bit.


Yes, it was attached.
Is there any other Xen cmdline knob to enable more debug?

Olaf

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression in recent pvops kernels, dom0 crashes early
  2021-05-13 12:22   ` Olaf Hering
@ 2021-05-13 12:29     ` Andrew Cooper
  2021-05-13 12:31       ` Olaf Hering
  2021-05-13 13:00       ` Olaf Hering
  0 siblings, 2 replies; 13+ messages in thread
From: Andrew Cooper @ 2021-05-13 12:29 UTC (permalink / raw)
  To: Olaf Hering; +Cc: xen-devel

On 13/05/2021 13:22, Olaf Hering wrote:
> Am Thu, 13 May 2021 13:11:10 +0100
> schrieb Andrew Cooper <andrew.cooper3@citrix.com>:
>
>> If I'm counting bits correctly, that is Xen rejecting the use of the NX
>> bit, which is suspicious.  Do you have the full Xen boot log on this
>> box?  I wonder if we've some problem clobbing the XD-disable bit.
>
> Yes, it was attached.
> Is there any other Xen cmdline knob to enable more debug?

Urgh sorry - I've not had enough coffee yet today.

Warning: NX (Execute Disable) protection not active

And this is an AMD box not an Intel box, so no XD-disable nonsense (that
I'm aware of).

So, the two options are:
1) This box legitimately doesn't have NX, and the dom0 kernel is buggy
for trying to use it.
2) This box does actually have NX, Xen has failed to turn it on, and
dom0 (through non CPUID means) thinks that NX is usable.

Can we first establish whether this box really does, or does not, have NX ?

~Andrew



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression in recent pvops kernels, dom0 crashes early
  2021-05-13 12:11 ` Andrew Cooper
  2021-05-13 12:22   ` Olaf Hering
@ 2021-05-13 12:29   ` Olaf Hering
  1 sibling, 0 replies; 13+ messages in thread
From: Olaf Hering @ 2021-05-13 12:29 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1081 bytes --]

Am Thu, 13 May 2021 13:11:10 +0100
schrieb Andrew Cooper <andrew.cooper3@citrix.com>:

> If I'm counting bits correctly, that is Xen rejecting the use of the NX
> bit, which is suspicious.

I tried 'dom0=pvh,debug':

...
(XEN) mcheck_poll: Machine check polling timer started.
(XEN) Running stub recovery selftests...
(XEN) Fixup #UD[0000]: ffff82d07fffe040 [ffff82d07fffe040] -> ffff82d040394a17
(XEN) Fixup #GP[0000]: ffff82d07fffe041 [ffff82d07fffe041] -> ffff82d040394a17
(XEN) Fixup #SS[0000]: ffff82d07fffe040 [ffff82d07fffe040] -> ffff82d040394a17
(XEN) Fixup #BP[0000]: ffff82d07fffe041 [ffff82d07fffe041] -> ffff82d040394a17
(XEN) HPET: 0 timers usable for broadcast (3 total)
(XEN) Warning: NX (Execute Disable) protection not active
(XEN) Dom0 has maximum 864 PIRQs
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Presently, iommu must be enabled for PVH hardware domain
(XEN) ****************************************
...

The other logs have:
(XEN) Warning: NX (Execute Disable) protection not active

Olaf

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression in recent pvops kernels, dom0 crashes early
  2021-05-13 12:29     ` Andrew Cooper
@ 2021-05-13 12:31       ` Olaf Hering
  2021-05-13 13:00       ` Olaf Hering
  1 sibling, 0 replies; 13+ messages in thread
From: Olaf Hering @ 2021-05-13 12:31 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1334 bytes --]

Am Thu, 13 May 2021 13:29:32 +0100
schrieb Andrew Cooper <andrew.cooper3@citrix.com>:

> Can we first establish whether this box really does, or does not, have NX ?

According to lscpu of a native boot: no.

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             2
NUMA node(s):          2
Vendor ID:             AuthenticAMD
CPU family:            16
Model:                 2
Model name:            Quad-Core AMD Opteron(tm) Processor 2356
Stepping:              3
CPU MHz:               2300.057
BogoMIPS:              4600.11
Virtualization:        AMD-V
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
L3 cache:              2048K
NUMA node0 CPU(s):     0,2,4,6
NUMA node1 CPU(s):     1,3,5,7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall mmxex
t fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_
legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs hw_pstate vmmcall npt lbrv svm_lock

Olaf

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression in recent pvops kernels, dom0 crashes early
  2021-05-13 12:29     ` Andrew Cooper
  2021-05-13 12:31       ` Olaf Hering
@ 2021-05-13 13:00       ` Olaf Hering
  2021-05-13 13:09         ` Andrew Cooper
  1 sibling, 1 reply; 13+ messages in thread
From: Olaf Hering @ 2021-05-13 13:00 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 384 bytes --]

Am Thu, 13 May 2021 13:29:32 +0100
schrieb Andrew Cooper <andrew.cooper3@citrix.com>:

> Warning: NX (Execute Disable) protection not active

There was a knob in the BIOS, it was set to "Disabled" for some reason.
Once enabled, the flag is seen and the dom0 starts fine.

If Xen is booted with 'cpuid=no-nx', the dom0 crashes again.

Thanks for the help, Andrew.


Olaf

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression in recent pvops kernels, dom0 crashes early
  2021-05-13 13:00       ` Olaf Hering
@ 2021-05-13 13:09         ` Andrew Cooper
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Cooper @ 2021-05-13 13:09 UTC (permalink / raw)
  To: Olaf Hering; +Cc: xen-devel

On 13/05/2021 14:00, Olaf Hering wrote:
> Am Thu, 13 May 2021 13:29:32 +0100
> schrieb Andrew Cooper <andrew.cooper3@citrix.com>:
>
>> Warning: NX (Execute Disable) protection not active
> There was a knob in the BIOS, it was set to "Disabled" for some reason.
> Once enabled, the flag is seen and the dom0 starts fine.
>
> If Xen is booted with 'cpuid=no-nx', the dom0 crashes again.
>
> Thanks for the help, Andrew.

Well - I wouldn't say we're quite done yet.

Clearly between sle12sp3 and sle12sp4 you've picked up a regression
where Linux decides to use NX despite its absence.

If NX is a mandatory feature now, then dom0 ought to error out cleanly
stating this fact.

~Andrew


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression in recent pvops kernels, dom0 crashes early
  2021-05-13 10:24 regression in recent pvops kernels, dom0 crashes early Olaf Hering
  2021-05-13 12:11 ` Andrew Cooper
@ 2021-05-17 10:54 ` Jan Beulich
  2021-05-19 18:42   ` Olaf Hering
  1 sibling, 1 reply; 13+ messages in thread
From: Jan Beulich @ 2021-05-17 10:54 UTC (permalink / raw)
  To: Olaf Hering; +Cc: xen-devel

On 13.05.2021 12:24, Olaf Hering wrote:
> Recent pvops dom0 kernels fail to boot on this particular ProLiant BL465c G5 box.
> It happens to work with every Xen and a 4.4 based sle12sp3 kernel, but fails with every Xen and a 4.12 based sle12sp4 (and every newer) kernel.
> 
> Any idea what is going on?
> 
> ....
> (XEN) Freed 256kB init memory.
> (XEN) mm.c:1758:d0 Bad L1 flags 800000
> (XEN) traps.c:458:d0 Unhandled invalid opcode fault/trap [#6] on VCPU 0 [ec=0000]
> (XEN) domain_crash_sync called from entry.S: fault at ffff82d08022a2a0 create_bounce_frame+0x133/0x143
> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> (XEN) ----[ Xen-4.4.20170405T152638.6bf0560e12-9.xen44  x86_64  debug=y  Not tainted ]----
> ....
> 
> ....
> (XEN) Freed 656kB init memory
> (XEN) mm.c:2165:d0v0 Bad L1 flags 800000
> (XEN) d0v0 Unhandled invalid opcode fault/trap [#6, ec=ffffffff]
> (XEN) domain_crash_sync called from entry.S: fault at ffff82d04031a016 x86_64/entry.S#create_bounce_frame+0x15d/0x177
> (XEN) Domain 0 (vcpu#0) crashed on cpu#5:
> (XEN) ----[ Xen-4.15.20210504T145803.280d472f4f-6.xen415  x86_64  debug=y  Not tainted ]----
> ....
> 
> I can probably cycle through all kernels between 4.4 and 4.12 to see where it broke.

I didn't try to figure out where exactly it broke, but could you give the
patch below a try, perhaps on top of my almost-one-year-old submission at
https://lkml.org/lkml/2020/5/27/1035?

Jan

x86/Xen: swap NX determination and GDT setup on BSP

xen_setup_gdt(), via xen_load_gdt_boot(), wants to adjust page tables.
For this to work when NX is not available, x86_configure_nx() needs to
be called first.

Reported-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -1262,16 +1262,16 @@ asmlinkage __visible void __init xen_sta
 	/* Get mfn list */
 	xen_build_dynamic_phys_to_machine();
 
+	/* Work out if we support NX */
+	get_cpu_cap(&boot_cpu_data);
+	x86_configure_nx();
+
 	/*
 	 * Set up kernel GDT and segment registers, mainly so that
 	 * -fstack-protector code can be executed.
 	 */
 	xen_setup_gdt(0);
 
-	/* Work out if we support NX */
-	get_cpu_cap(&boot_cpu_data);
-	x86_configure_nx();
-
 	/* Determine virtual and physical address sizes */
 	get_cpu_address_sizes(&boot_cpu_data);
 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression in recent pvops kernels, dom0 crashes early
  2021-05-17 10:54 ` Jan Beulich
@ 2021-05-19 18:42   ` Olaf Hering
  2021-05-20  7:03     ` Jan Beulich
  0 siblings, 1 reply; 13+ messages in thread
From: Olaf Hering @ 2021-05-19 18:42 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 479 bytes --]

Am Mon, 17 May 2021 12:54:02 +0200
schrieb Jan Beulich <jbeulich@suse.com>:

> x86/Xen: swap NX determination and GDT setup on BSP
> 
> xen_setup_gdt(), via xen_load_gdt_boot(), wants to adjust page tables.
> For this to work when NX is not available, x86_configure_nx() needs to
> be called first.


Thanks. I tried this patch on-top of the SLE15-SP3 kernel branch.
Without the patch booting fails as reported.
With the patch the dom0 starts as expected.


Olaf

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression in recent pvops kernels, dom0 crashes early
  2021-05-19 18:42   ` Olaf Hering
@ 2021-05-20  7:03     ` Jan Beulich
  2021-05-20  7:45       ` Olaf Hering
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Beulich @ 2021-05-20  7:03 UTC (permalink / raw)
  To: Olaf Hering; +Cc: xen-devel

On 19.05.2021 20:42, Olaf Hering wrote:
> Am Mon, 17 May 2021 12:54:02 +0200
> schrieb Jan Beulich <jbeulich@suse.com>:
> 
>> x86/Xen: swap NX determination and GDT setup on BSP
>>
>> xen_setup_gdt(), via xen_load_gdt_boot(), wants to adjust page tables.
>> For this to work when NX is not available, x86_configure_nx() needs to
>> be called first.
> 
> 
> Thanks. I tried this patch on-top of the SLE15-SP3 kernel branch.
> Without the patch booting fails as reported.
> With the patch the dom0 starts as expected.

Just to be sure - you did not need the other patch that I said I suspect
is needed as a prereq?

Jan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression in recent pvops kernels, dom0 crashes early
  2021-05-20  7:03     ` Jan Beulich
@ 2021-05-20  7:45       ` Olaf Hering
  2021-05-20  9:42         ` Olaf Hering
  0 siblings, 1 reply; 13+ messages in thread
From: Olaf Hering @ 2021-05-20  7:45 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 265 bytes --]

Am Thu, 20 May 2021 09:03:34 +0200
schrieb Jan Beulich <jbeulich@suse.com>:

> Just to be sure - you did not need the other patch that I said I suspect
> is needed as a prereq?

Yes, I needed just this single patch which moves x86_configure_nx up.


Olaf

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression in recent pvops kernels, dom0 crashes early
  2021-05-20  7:45       ` Olaf Hering
@ 2021-05-20  9:42         ` Olaf Hering
  0 siblings, 0 replies; 13+ messages in thread
From: Olaf Hering @ 2021-05-20  9:42 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 448 bytes --]

Am Thu, 20 May 2021 09:45:03 +0200
schrieb Olaf Hering <olaf@aepfle.de>:

> Am Thu, 20 May 2021 09:03:34 +0200
> schrieb Jan Beulich <jbeulich@suse.com>:
> 
> > Just to be sure - you did not need the other patch that I said I suspect
> > is needed as a prereq?  
> Yes, I needed just this single patch which moves x86_configure_nx up.

I tried the very same approach with the SLE12-SP4-LTSS branch, which also fixed dom0 boot.


Olaf

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-05-20  9:42 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-13 10:24 regression in recent pvops kernels, dom0 crashes early Olaf Hering
2021-05-13 12:11 ` Andrew Cooper
2021-05-13 12:22   ` Olaf Hering
2021-05-13 12:29     ` Andrew Cooper
2021-05-13 12:31       ` Olaf Hering
2021-05-13 13:00       ` Olaf Hering
2021-05-13 13:09         ` Andrew Cooper
2021-05-13 12:29   ` Olaf Hering
2021-05-17 10:54 ` Jan Beulich
2021-05-19 18:42   ` Olaf Hering
2021-05-20  7:03     ` Jan Beulich
2021-05-20  7:45       ` Olaf Hering
2021-05-20  9:42         ` Olaf Hering

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).