All of lore.kernel.org
 help / color / mirror / Atom feed
* dom0 PV looping on search_pre_exception_table()
@ 2020-12-08 17:57 Manuel Bouyer
  2020-12-08 18:13 ` Andrew Cooper
  0 siblings, 1 reply; 25+ messages in thread
From: Manuel Bouyer @ 2020-12-08 17:57 UTC (permalink / raw)
  To: xen-devel

Hello,
for the first time I tried to boot a xen kernel from devel with
a NetBSD PV dom0. The kernel boots, but when the first userland prcess
is launched, it seems to enter a loop involving search_pre_exception_table()
(I see an endless stream from the dprintk() at arch/x86/extable.c:202)

With xen 4.13 I see it, but exactly once:
(XEN) extable.c:202: Pre-exception: ffff82d08038c304 -> ffff82d08038c8c8

with devel:
(XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
(XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
(XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
(XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
(XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
[...]

the dom0 kernel is the same.

At first glance it looks like a fault in the guest is not handled at it should,
and the userland process keeps faulting on the same address.

Any idea what to look at ?

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-08 17:57 dom0 PV looping on search_pre_exception_table() Manuel Bouyer
@ 2020-12-08 18:13 ` Andrew Cooper
  2020-12-09  8:39   ` Jan Beulich
  2020-12-09 10:15   ` Manuel Bouyer
  0 siblings, 2 replies; 25+ messages in thread
From: Andrew Cooper @ 2020-12-08 18:13 UTC (permalink / raw)
  To: Manuel Bouyer, xen-devel

On 08/12/2020 17:57, Manuel Bouyer wrote:
> Hello,
> for the first time I tried to boot a xen kernel from devel with
> a NetBSD PV dom0. The kernel boots, but when the first userland prcess
> is launched, it seems to enter a loop involving search_pre_exception_table()
> (I see an endless stream from the dprintk() at arch/x86/extable.c:202)
>
> With xen 4.13 I see it, but exactly once:
> (XEN) extable.c:202: Pre-exception: ffff82d08038c304 -> ffff82d08038c8c8
>
> with devel:
> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
> [...]
>
> the dom0 kernel is the same.
>
> At first glance it looks like a fault in the guest is not handled at it should,
> and the userland process keeps faulting on the same address.
>
> Any idea what to look at ?

That is a reoccurring fault on IRET back to guest context, and is
probably caused by some unwise-in-hindsight cleanup which doesn't
escalate the failure to the failsafe callback.

This ought to give something useful to debug with:

~Andrew

diff --git a/xen/arch/x86/extable.c b/xen/arch/x86/extable.c
index 70972f1085..62a7bcbe38 100644
--- a/xen/arch/x86/extable.c
+++ b/xen/arch/x86/extable.c
@@ -191,6 +191,10 @@ static int __init stub_selftest(void)
 __initcall(stub_selftest);
 #endif
 
+#include <xen/sched.h>
+#include <xen/softirq.h>
+const char *vec_name(unsigned int vec);
+
 unsigned long
 search_pre_exception_table(struct cpu_user_regs *regs)
 {
@@ -199,7 +203,13 @@ search_pre_exception_table(struct cpu_user_regs *regs)
         __start___pre_ex_table, __stop___pre_ex_table-1, addr);
     if ( fixup )
     {
-        dprintk(XENLOG_INFO, "Pre-exception: %p -> %p\n", _p(addr),
_p(fixup));
+        printk(XENLOG_ERR "IRET fault: %s[%04x]\n",
+               vec_name(regs->entry_vector), regs->error_code);
+
+        domain_crash(current->domain);
+        for ( ;; )
+            do_softirq();
+
         perfc_incr(exception_fixed);
     }
     return fixup;
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 0459cee9fb..1059f3ce66 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -687,7 +687,7 @@ const char *trapstr(unsigned int trapnr)
     return trapnr < ARRAY_SIZE(strings) ? strings[trapnr] : "???";
 }
 
-static const char *vec_name(unsigned int vec)
+const char *vec_name(unsigned int vec)
 {
     static const char names[][4] = {
 #define P(x) [X86_EXC_ ## x] = "#" #x



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-08 18:13 ` Andrew Cooper
@ 2020-12-09  8:39   ` Jan Beulich
  2020-12-09  9:49     ` Manuel Bouyer
  2020-12-09 10:15   ` Manuel Bouyer
  1 sibling, 1 reply; 25+ messages in thread
From: Jan Beulich @ 2020-12-09  8:39 UTC (permalink / raw)
  To: Andrew Cooper, Manuel Bouyer; +Cc: xen-devel

On 08.12.2020 19:13, Andrew Cooper wrote:
> On 08/12/2020 17:57, Manuel Bouyer wrote:
>> Hello,
>> for the first time I tried to boot a xen kernel from devel with
>> a NetBSD PV dom0. The kernel boots, but when the first userland prcess
>> is launched, it seems to enter a loop involving search_pre_exception_table()
>> (I see an endless stream from the dprintk() at arch/x86/extable.c:202)
>>
>> With xen 4.13 I see it, but exactly once:
>> (XEN) extable.c:202: Pre-exception: ffff82d08038c304 -> ffff82d08038c8c8
>>
>> with devel:
>> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
>> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
>> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
>> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
>> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
>> [...]
>>
>> the dom0 kernel is the same.
>>
>> At first glance it looks like a fault in the guest is not handled at it should,
>> and the userland process keeps faulting on the same address.
>>
>> Any idea what to look at ?
> 
> That is a reoccurring fault on IRET back to guest context, and is
> probably caused by some unwise-in-hindsight cleanup which doesn't
> escalate the failure to the failsafe callback.

But is this a 32-bit Dom0? 64-bit ones get well-known selectors
installed for CS and SS by create_bounce_frame(), and we don't
permit registration of non-canonical trap handler entry point
addresses.

I have to admit I also find curious the difference between 4.13
and master.

Jan


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-09  8:39   ` Jan Beulich
@ 2020-12-09  9:49     ` Manuel Bouyer
  0 siblings, 0 replies; 25+ messages in thread
From: Manuel Bouyer @ 2020-12-09  9:49 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Wed, Dec 09, 2020 at 09:39:49AM +0100, Jan Beulich wrote:
> On 08.12.2020 19:13, Andrew Cooper wrote:
> > On 08/12/2020 17:57, Manuel Bouyer wrote:
> >> Hello,
> >> for the first time I tried to boot a xen kernel from devel with
> >> a NetBSD PV dom0. The kernel boots, but when the first userland prcess
> >> is launched, it seems to enter a loop involving search_pre_exception_table()
> >> (I see an endless stream from the dprintk() at arch/x86/extable.c:202)
> >>
> >> With xen 4.13 I see it, but exactly once:
> >> (XEN) extable.c:202: Pre-exception: ffff82d08038c304 -> ffff82d08038c8c8
> >>
> >> with devel:
> >> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
> >> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
> >> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
> >> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
> >> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
> >> [...]
> >>
> >> the dom0 kernel is the same.
> >>
> >> At first glance it looks like a fault in the guest is not handled at it should,
> >> and the userland process keeps faulting on the same address.
> >>
> >> Any idea what to look at ?
> > 
> > That is a reoccurring fault on IRET back to guest context, and is
> > probably caused by some unwise-in-hindsight cleanup which doesn't
> > escalate the failure to the failsafe callback.
> 
> But is this a 32-bit Dom0? 64-bit ones get well-known selectors
> installed for CS and SS by create_bounce_frame(), and we don't
> permit registration of non-canonical trap handler entry point
> addresses.

No, it's a 64bits dom0.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-08 18:13 ` Andrew Cooper
  2020-12-09  8:39   ` Jan Beulich
@ 2020-12-09 10:15   ` Manuel Bouyer
  2020-12-09 13:28     ` Andrew Cooper
  1 sibling, 1 reply; 25+ messages in thread
From: Manuel Bouyer @ 2020-12-09 10:15 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

On Tue, Dec 08, 2020 at 06:13:46PM +0000, Andrew Cooper wrote:
> On 08/12/2020 17:57, Manuel Bouyer wrote:
> > Hello,
> > for the first time I tried to boot a xen kernel from devel with
> > a NetBSD PV dom0. The kernel boots, but when the first userland prcess
> > is launched, it seems to enter a loop involving search_pre_exception_table()
> > (I see an endless stream from the dprintk() at arch/x86/extable.c:202)
> >
> > With xen 4.13 I see it, but exactly once:
> > (XEN) extable.c:202: Pre-exception: ffff82d08038c304 -> ffff82d08038c8c8
> >
> > with devel:
> > (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
> > (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
> > (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
> > (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
> > (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
> > [...]
> >
> > the dom0 kernel is the same.
> >
> > At first glance it looks like a fault in the guest is not handled at it should,
> > and the userland process keeps faulting on the same address.
> >
> > Any idea what to look at ?
> 
> That is a reoccurring fault on IRET back to guest context, and is
> probably caused by some unwise-in-hindsight cleanup which doesn't
> escalate the failure to the failsafe callback.
> 
> This ought to give something useful to debug with:

thanks, I got:
(XEN) IRET fault: #PF[0000]                                                 
(XEN) domain_crash called from extable.c:209                                
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:                                   
(XEN) ----[ Xen-4.15-unstable  x86_64  debug=y   Tainted:   C   ]----       
(XEN) CPU:    0                                                             
(XEN) RIP:    0047:[<00007f7e184007d0>]                                     
(XEN) RFLAGS: 0000000000000202   EM: 0   CONTEXT: pv guest (d0v0)           
(XEN) rax: ffff82d04038c309   rbx: 0000000000000000   rcx: 000000000000e008 
(XEN) rdx: 0000000000010086   rsi: ffff83007fcb7f78   rdi: 000000000000e010 
(XEN) rbp: 0000000000000000   rsp: 00007f7fff53e3e0   r8:  0000000e00000000 
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000 
(XEN) r12: 0000000000000000   r13: 0000000000000000   r14: 0000000000000000 
(XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 0000000000002660 
(XEN) cr3: 0000000079cdb000   cr2: 00007f7fff53e3e0                         
(XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: ffffffff80cf2dc0    
(XEN) ds: 0023   es: 0023   fs: 0000   gs: 0000   ss: 003f   cs: 0047       
(XEN) Guest stack trace from rsp=00007f7fff53e3e0:          
(XEN)    0000000000000001 00007f7fff53e8f8 0000000000000000 0000000000000000
(XEN)    0000000000000003 000000004b600040 0000000000000004 0000000000000038
(XEN)    0000000000000005 0000000000000008 0000000000000006 0000000000001000
(XEN)    0000000000000007 00007f7e18400000 0000000000000008 0000000000000000
(XEN)    0000000000000009 000000004b601cd0 00000000000007d0 0000000000000000
(XEN)    00000000000007d1 0000000000000000 00000000000007d2 0000000000000000
(XEN)    00000000000007d3 0000000000000000 000000000000000d 00007f7fff53f000
(XEN)    00000000000007de 00007f7fff53e4e0 0000000000000000 0000000000000000
(XEN)    6e692f6e6962732f 0000000000007469 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-09 10:15   ` Manuel Bouyer
@ 2020-12-09 13:28     ` Andrew Cooper
  2020-12-09 13:59       ` Manuel Bouyer
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Cooper @ 2020-12-09 13:28 UTC (permalink / raw)
  To: Manuel Bouyer; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 5677 bytes --]

On 09/12/2020 10:15, Manuel Bouyer wrote:
> On Tue, Dec 08, 2020 at 06:13:46PM +0000, Andrew Cooper wrote:
>> On 08/12/2020 17:57, Manuel Bouyer wrote:
>>> Hello,
>>> for the first time I tried to boot a xen kernel from devel with
>>> a NetBSD PV dom0. The kernel boots, but when the first userland prcess
>>> is launched, it seems to enter a loop involving search_pre_exception_table()
>>> (I see an endless stream from the dprintk() at arch/x86/extable.c:202)
>>>
>>> With xen 4.13 I see it, but exactly once:
>>> (XEN) extable.c:202: Pre-exception: ffff82d08038c304 -> ffff82d08038c8c8
>>>
>>> with devel:
>>> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
>>> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
>>> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
>>> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
>>> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
>>> [...]
>>>
>>> the dom0 kernel is the same.
>>>
>>> At first glance it looks like a fault in the guest is not handled at it should,
>>> and the userland process keeps faulting on the same address.
>>>
>>> Any idea what to look at ?
>> That is a reoccurring fault on IRET back to guest context, and is
>> probably caused by some unwise-in-hindsight cleanup which doesn't
>> escalate the failure to the failsafe callback.
>>
>> This ought to give something useful to debug with:
> thanks, I got:
> (XEN) IRET fault: #PF[0000]                                                 
> (XEN) domain_crash called from extable.c:209                                
> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:                                   
> (XEN) ----[ Xen-4.15-unstable  x86_64  debug=y   Tainted:   C   ]----       
> (XEN) CPU:    0                                                             
> (XEN) RIP:    0047:[<00007f7e184007d0>]                                     
> (XEN) RFLAGS: 0000000000000202   EM: 0   CONTEXT: pv guest (d0v0)           
> (XEN) rax: ffff82d04038c309   rbx: 0000000000000000   rcx: 000000000000e008 
> (XEN) rdx: 0000000000010086   rsi: ffff83007fcb7f78   rdi: 000000000000e010 
> (XEN) rbp: 0000000000000000   rsp: 00007f7fff53e3e0   r8:  0000000e00000000 
> (XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000 
> (XEN) r12: 0000000000000000   r13: 0000000000000000   r14: 0000000000000000 
> (XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 0000000000002660 
> (XEN) cr3: 0000000079cdb000   cr2: 00007f7fff53e3e0                         
> (XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: ffffffff80cf2dc0    
> (XEN) ds: 0023   es: 0023   fs: 0000   gs: 0000   ss: 003f   cs: 0047       
> (XEN) Guest stack trace from rsp=00007f7fff53e3e0:          
> (XEN)    0000000000000001 00007f7fff53e8f8 0000000000000000 0000000000000000
> (XEN)    0000000000000003 000000004b600040 0000000000000004 0000000000000038
> (XEN)    0000000000000005 0000000000000008 0000000000000006 0000000000001000
> (XEN)    0000000000000007 00007f7e18400000 0000000000000008 0000000000000000
> (XEN)    0000000000000009 000000004b601cd0 00000000000007d0 0000000000000000
> (XEN)    00000000000007d1 0000000000000000 00000000000007d2 0000000000000000
> (XEN)    00000000000007d3 0000000000000000 000000000000000d 00007f7fff53f000
> (XEN)    00000000000007de 00007f7fff53e4e0 0000000000000000 0000000000000000
> (XEN)    6e692f6e6962732f 0000000000007469 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.

Pagefaults on IRET come either from stack accesses for operands (not the
case here as Xen is otherwise working fine), or from segement selector
loads for %cs and %ss.

In this example, %ss is in the LDT, which specifically does use
pagefaults to promote the frame to PGT_segdesc.

I suspect that what is happening is that handle_ldt_mapping_fault() is
failing to promote the page (for some reason), and we're taking the "In
hypervisor mode? Leave it to the #PF handler to fix up." path due to the
confusion in context, and Xen's #PF handler is concluding "nothing else
to do".

The older behaviour of escalating to the failsafe callback would have
broken this cycle by rewriting %ss and re-entering the kernel.


Please try the attached debugging patch, which is an extension of what I
gave you yesterday.  First, it ought to print %cr2, which I expect will
point to Xen's virtual mapping of the vcpu's LDT.  The logic ought to
loop a few times so we can inspect the hypervisor codepaths which are
effectively livelocked in this state, and I've also instrumented
check_descriptor() failures because I've got a gut feeling that is the
root cause of the problem.

~Andrew

[-- Attachment #2: 0001-extable-dbg.patch --]
[-- Type: text/x-patch, Size: 2272 bytes --]

From 841a6950fec5b43b370653e0c833a54fed64882e Mon Sep 17 00:00:00 2001
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Wed, 9 Dec 2020 12:50:38 +0000
Subject: extable-dbg


diff --git a/xen/arch/x86/extable.c b/xen/arch/x86/extable.c
index 70972f1085..88b05bef38 100644
--- a/xen/arch/x86/extable.c
+++ b/xen/arch/x86/extable.c
@@ -191,6 +191,10 @@ static int __init stub_selftest(void)
 __initcall(stub_selftest);
 #endif
 
+#include <xen/sched.h>
+#include <xen/softirq.h>
+const char *vec_name(unsigned int vec);
+
 unsigned long
 search_pre_exception_table(struct cpu_user_regs *regs)
 {
@@ -199,7 +203,21 @@ search_pre_exception_table(struct cpu_user_regs *regs)
         __start___pre_ex_table, __stop___pre_ex_table-1, addr);
     if ( fixup )
     {
-        dprintk(XENLOG_INFO, "Pre-exception: %p -> %p\n", _p(addr), _p(fixup));
+        static int count;
+
+        printk(XENLOG_ERR "IRET fault: %s[%04x]\n",
+               vec_name(regs->entry_vector), regs->error_code);
+
+        if ( regs->entry_vector == X86_EXC_PF )
+            printk(XENLOG_ERR "%%cr2 %016lx\n", read_cr2());
+
+        if ( count++ > 2 )
+        {
+            domain_crash(current->domain);
+            for ( ;; )
+                do_softirq();
+        }
+
         perfc_incr(exception_fixed);
     }
     return fixup;
diff --git a/xen/arch/x86/pv/descriptor-tables.c b/xen/arch/x86/pv/descriptor-tables.c
index 39c1a2311a..6bc58bba67 100644
--- a/xen/arch/x86/pv/descriptor-tables.c
+++ b/xen/arch/x86/pv/descriptor-tables.c
@@ -282,6 +282,10 @@ int validate_segdesc_page(struct page_info *page)
 
     unmap_domain_page(descs);
 
+    if ( i != 512 )
+        printk_once("Check Descriptor failed: idx %u, a: %08x, b: %08x\n",
+                    i, descs[i].a, descs[i].b);
+
     return i == 512 ? 0 : -EINVAL;
 }
 
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 0459cee9fb..1059f3ce66 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -687,7 +687,7 @@ const char *trapstr(unsigned int trapnr)
     return trapnr < ARRAY_SIZE(strings) ? strings[trapnr] : "???";
 }
 
-static const char *vec_name(unsigned int vec)
+const char *vec_name(unsigned int vec)
 {
     static const char names[][4] = {
 #define P(x) [X86_EXC_ ## x] = "#" #x

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-09 13:28     ` Andrew Cooper
@ 2020-12-09 13:59       ` Manuel Bouyer
  2020-12-09 14:41         ` Andrew Cooper
  0 siblings, 1 reply; 25+ messages in thread
From: Manuel Bouyer @ 2020-12-09 13:59 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

On Wed, Dec 09, 2020 at 01:28:54PM +0000, Andrew Cooper wrote:
> 
> Pagefaults on IRET come either from stack accesses for operands (not the
> case here as Xen is otherwise working fine), or from segement selector
> loads for %cs and %ss.
> 
> In this example, %ss is in the LDT, which specifically does use
> pagefaults to promote the frame to PGT_segdesc.
> 
> I suspect that what is happening is that handle_ldt_mapping_fault() is
> failing to promote the page (for some reason), and we're taking the "In
> hypervisor mode? Leave it to the #PF handler to fix up." path due to the
> confusion in context, and Xen's #PF handler is concluding "nothing else
> to do".
> 
> The older behaviour of escalating to the failsafe callback would have
> broken this cycle by rewriting %ss and re-entering the kernel.
> 
> 
> Please try the attached debugging patch, which is an extension of what I
> gave you yesterday.  First, it ought to print %cr2, which I expect will
> point to Xen's virtual mapping of the vcpu's LDT.  The logic ought to
> loop a few times so we can inspect the hypervisor codepaths which are
> effectively livelocked in this state, and I've also instrumented
> check_descriptor() failures because I've got a gut feeling that is the
> root cause of the problem.

here's the output:
(XEN) IRET fault: #PF[0000]                                            [23/1999]
(XEN) %cr2 ffff820000010040                                                    
(XEN) IRET fault: #PF[0000]                                                    
(XEN) %cr2 ffff820000010040                                                 
(XEN) IRET fault: #PF[0000]
(XEN) %cr2 ffff820000010040
(XEN) IRET fault: #PF[0000]
(XEN) %cr2 ffff820000010040
(XEN) domain_crash called from extable.c:216
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
(XEN) ----[ Xen-4.15-unstable  x86_64  debug=y   Tainted:   C   ]----
(XEN) CPU:    0
(XEN) RIP:    0047:[<00007f7ff60007d0>]
(XEN) RFLAGS: 0000000000000202   EM: 0   CONTEXT: pv guest (d0v0)
(XEN) rax: ffff82d04038c309   rbx: 0000000000000000   rcx: 000000000000e008
(XEN) rdx: 0000000000010086   rsi: ffff83007fcb7f78   rdi: 000000000000e010
(XEN) rbp: 0000000000000000   rsp: 00007f7fff4876c0   r8:  0000000e00000000
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: 0000000000000000   r13: 0000000000000000   r14: 0000000000000000
(XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 0000000000002660
(XEN) cr3: 0000000079cdb000   cr2: ffffa1000000a040
(XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: ffffffff80cf2dc0
(XEN) ds: 0023   es: 0023   fs: 0000   gs: 0000   ss: 003f   cs: 0047
(XEN) Guest stack trace from rsp=00007f7fff4876c0:
(XEN)    0000000000000001 00007f7fff487bd8 0000000000000000 0000000000000000
(XEN)    0000000000000003 00000000aee00040 0000000000000004 0000000000000038
(XEN)    0000000000000005 0000000000000008 0000000000000006 0000000000001000
(XEN)    0000000000000007 00007f7ff6000000 0000000000000008 0000000000000000
(XEN)    0000000000000009 00000000aee01cd0 00000000000007d0 0000000000000000
(XEN)    00000000000007d1 0000000000000000 00000000000007d2 0000000000000000
(XEN)    00000000000007d3 0000000000000000 000000000000000d 00007f7fff488000
(XEN)    00000000000007de 00007f7fff4877c0 0000000000000000 0000000000000000
(XEN)    6e692f6e6962732f 0000000000007469 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-09 13:59       ` Manuel Bouyer
@ 2020-12-09 14:41         ` Andrew Cooper
  2020-12-09 15:44           ` Manuel Bouyer
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Cooper @ 2020-12-09 14:41 UTC (permalink / raw)
  To: Manuel Bouyer; +Cc: xen-devel

On 09/12/2020 13:59, Manuel Bouyer wrote:
> On Wed, Dec 09, 2020 at 01:28:54PM +0000, Andrew Cooper wrote:
>> Pagefaults on IRET come either from stack accesses for operands (not the
>> case here as Xen is otherwise working fine), or from segement selector
>> loads for %cs and %ss.
>>
>> In this example, %ss is in the LDT, which specifically does use
>> pagefaults to promote the frame to PGT_segdesc.
>>
>> I suspect that what is happening is that handle_ldt_mapping_fault() is
>> failing to promote the page (for some reason), and we're taking the "In
>> hypervisor mode? Leave it to the #PF handler to fix up." path due to the
>> confusion in context, and Xen's #PF handler is concluding "nothing else
>> to do".
>>
>> The older behaviour of escalating to the failsafe callback would have
>> broken this cycle by rewriting %ss and re-entering the kernel.
>>
>>
>> Please try the attached debugging patch, which is an extension of what I
>> gave you yesterday.  First, it ought to print %cr2, which I expect will
>> point to Xen's virtual mapping of the vcpu's LDT.  The logic ought to
>> loop a few times so we can inspect the hypervisor codepaths which are
>> effectively livelocked in this state, and I've also instrumented
>> check_descriptor() failures because I've got a gut feeling that is the
>> root cause of the problem.
> here's the output:
> (XEN) IRET fault: #PF[0000]                                            [23/1999]
> (XEN) %cr2 ffff820000010040                                                    
> (XEN) IRET fault: #PF[0000]                                                    
> (XEN) %cr2 ffff820000010040                                                 
> (XEN) IRET fault: #PF[0000]
> (XEN) %cr2 ffff820000010040
> (XEN) IRET fault: #PF[0000]
> (XEN) %cr2 ffff820000010040
> (XEN) domain_crash called from extable.c:216
> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> (XEN) ----[ Xen-4.15-unstable  x86_64  debug=y   Tainted:   C   ]----
> (XEN) CPU:    0
> (XEN) RIP:    0047:[<00007f7ff60007d0>]
> (XEN) RFLAGS: 0000000000000202   EM: 0   CONTEXT: pv guest (d0v0)
> (XEN) rax: ffff82d04038c309   rbx: 0000000000000000   rcx: 000000000000e008
> (XEN) rdx: 0000000000010086   rsi: ffff83007fcb7f78   rdi: 000000000000e010
> (XEN) rbp: 0000000000000000   rsp: 00007f7fff4876c0   r8:  0000000e00000000
> (XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
> (XEN) r12: 0000000000000000   r13: 0000000000000000   r14: 0000000000000000
> (XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 0000000000002660
> (XEN) cr3: 0000000079cdb000   cr2: ffffa1000000a040
> (XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: ffffffff80cf2dc0
> (XEN) ds: 0023   es: 0023   fs: 0000   gs: 0000   ss: 003f   cs: 0047
> (XEN) Guest stack trace from rsp=00007f7fff4876c0:
> (XEN)    0000000000000001 00007f7fff487bd8 0000000000000000 0000000000000000
> (XEN)    0000000000000003 00000000aee00040 0000000000000004 0000000000000038
> (XEN)    0000000000000005 0000000000000008 0000000000000006 0000000000001000
> (XEN)    0000000000000007 00007f7ff6000000 0000000000000008 0000000000000000
> (XEN)    0000000000000009 00000000aee01cd0 00000000000007d0 0000000000000000
> (XEN)    00000000000007d1 0000000000000000 00000000000007d2 0000000000000000
> (XEN)    00000000000007d3 0000000000000000 000000000000000d 00007f7fff488000
> (XEN)    00000000000007de 00007f7fff4877c0 0000000000000000 0000000000000000
> (XEN)    6e692f6e6962732f 0000000000007469 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.

Huh, so it is the LDT, but we're not getting as far as inspecting the
target frame.

I wonder if the LDT is set up correctly.  How about this incremental delta?

~Andrew

diff --git a/xen/arch/x86/extable.c b/xen/arch/x86/extable.c
index 88b05bef38..be59a3e216 100644
--- a/xen/arch/x86/extable.c
+++ b/xen/arch/x86/extable.c
@@ -203,13 +203,16 @@ search_pre_exception_table(struct cpu_user_regs *regs)
         __start___pre_ex_table, __stop___pre_ex_table-1, addr);
     if ( fixup )
     {
+        struct vcpu *curr = current;
         static int count;
 
         printk(XENLOG_ERR "IRET fault: %s[%04x]\n",
                vec_name(regs->entry_vector), regs->error_code);
 
         if ( regs->entry_vector == X86_EXC_PF )
-            printk(XENLOG_ERR "%%cr2 %016lx\n", read_cr2());
+            printk(XENLOG_ERR "%%cr2 %016lx, LDT base %016lx, limit
%04x\n",
+                   read_cr2(), curr->arch.pv.ldt_base,
+                   (curr->arch.pv.ldt_ents << 3) | 7);
 
         if ( count++ > 2 )
         {
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 1059f3ce66..3ac07a84c3 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1233,6 +1233,8 @@ static int handle_ldt_mapping_fault(unsigned int
offset,
     }
     else
     {
+        printk(XENLOG_ERR "*** pv_map_ldt_shadow_page(%#x) failed\n",
offset);
+
         /* In hypervisor mode? Leave it to the #PF handler to fix up. */
         if ( !guest_mode(regs) )
             return 0;



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-09 14:41         ` Andrew Cooper
@ 2020-12-09 15:44           ` Manuel Bouyer
  2020-12-09 16:00             ` Andrew Cooper
  0 siblings, 1 reply; 25+ messages in thread
From: Manuel Bouyer @ 2020-12-09 15:44 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

On Wed, Dec 09, 2020 at 02:41:23PM +0000, Andrew Cooper wrote:
> 
> Huh, so it is the LDT, but we're not getting as far as inspecting the
> target frame.
> 
> I wonder if the LDT is set up correctly.

I guess it is, otherwise it wouldn't boot with a Xen 4.13 kernel, isn't it ?

> How about this incremental delta?

Here's the output
(XEN) IRET fault: #PF[0000]                                                    
(XEN) %cr2 ffff820000010040, LDT base ffffc4800000a000, limit 0057             
(XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  
(XEN) IRET fault: #PF[0000]                                                    
(XEN) %cr2 ffff820000010040, LDT base ffffc4800000a000, limit 0057             
(XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  
(XEN) IRET fault: #PF[0000]                                                 
(XEN) %cr2 ffff820000010040, LDT base ffffc4800000a000, limit 0057
(XEN) *** pv_map_ldt_shadow_page(0x40) failed
(XEN) IRET fault: #PF[0000]
(XEN) %cr2 ffff820000010040, LDT base ffffc4800000a000, limit 0057
(XEN) domain_crash called from extable.c:219
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
(XEN) ----[ Xen-4.15-unstable  x86_64  debug=y   Tainted:   C   ]----
(XEN) CPU:    0
(XEN) RIP:    0047:[<00007f7ecaa007d0>]
(XEN) RFLAGS: 0000000000000202   EM: 0   CONTEXT: pv guest (d0v0)
(XEN) rax: ffff82d04038c309   rbx: 0000000000000000   rcx: 000000000000e008
(XEN) rdx: 0000000000010086   rsi: ffff83007fcb7f78   rdi: 000000000000e010
(XEN) rbp: 0000000000000000   rsp: 00007f7fff32e3f0   r8:  0000000e00000000
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: 0000000000000000   r13: 0000000000000000   r14: 0000000000000000
(XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 0000000000002660
(XEN) cr3: 0000000079cdb000   cr2: ffffc4800000a040
(XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: ffffffff80cf2dc0
(XEN) ds: 0023   es: 0023   fs: 0000   gs: 0000   ss: 003f   cs: 0047
(XEN) Guest stack trace from rsp=00007f7fff32e3f0:
(XEN)    0000000000000001 00007f7fff32e908 0000000000000000 0000000000000000
(XEN)    0000000000000003 0000000173e00040 0000000000000004 0000000000000038
(XEN)    0000000000000005 0000000000000008 0000000000000006 0000000000001000
(XEN)    0000000000000007 00007f7ecaa00000 0000000000000008 0000000000000000
(XEN)    0000000000000009 0000000173e01cd0 00000000000007d0 0000000000000000
(XEN)    00000000000007d1 0000000000000000 00000000000007d2 0000000000000000
(XEN)    00000000000007d3 0000000000000000 000000000000000d 00007f7fff32f000
(XEN)    00000000000007de 00007f7fff32e4f0 0000000000000000 0000000000000000
(XEN)    6e692f6e6962732f 0000000000007469 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-09 15:44           ` Manuel Bouyer
@ 2020-12-09 16:00             ` Andrew Cooper
  2020-12-09 16:30               ` Manuel Bouyer
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Cooper @ 2020-12-09 16:00 UTC (permalink / raw)
  To: Manuel Bouyer; +Cc: xen-devel

On 09/12/2020 15:44, Manuel Bouyer wrote:
> On Wed, Dec 09, 2020 at 02:41:23PM +0000, Andrew Cooper wrote:
>> Huh, so it is the LDT, but we're not getting as far as inspecting the
>> target frame.
>>
>> I wonder if the LDT is set up correctly.
> I guess it is, otherwise it wouldn't boot with a Xen 4.13 kernel, isn't it ?

Well - you said you always saw it once on 4.13, which clearly shows that
something was wonky, but it managed to unblock itself.

>> How about this incremental delta?
> Here's the output
> (XEN) IRET fault: #PF[0000]                                                    
> (XEN) %cr2 ffff820000010040, LDT base ffffc4800000a000, limit 0057             
> (XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  
> (XEN) IRET fault: #PF[0000]                                                    
> (XEN) %cr2 ffff820000010040, LDT base ffffc4800000a000, limit 0057             
> (XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  
> (XEN) IRET fault: #PF[0000]                                                 

Ok, so the promotion definitely fails, but we don't get as far as
inspecting the content of the LDT frame.  This probably means it failed
to change the page type, which probably means there are still
outstanding writeable references.

I'm expecting the final printk to be the one which triggers.

~Andrew

diff --git a/xen/arch/x86/pv/mm.c b/xen/arch/x86/pv/mm.c
index 5d74d11cba..2823dc2894 100644
--- a/xen/arch/x86/pv/mm.c
+++ b/xen/arch/x86/pv/mm.c
@@ -87,14 +87,23 @@ bool pv_map_ldt_shadow_page(unsigned int offset)
 
     gl1e = guest_get_eff_kern_l1e(linear);
     if ( unlikely(!(l1e_get_flags(gl1e) & _PAGE_PRESENT)) )
+    {
+        printk(XENLOG_ERR "*** LDT: gl1e %"PRIpte" not present\n",
gl1e.l1);
         return false;
+    }
 
     page = get_page_from_gfn(currd, l1e_get_pfn(gl1e), NULL, P2M_ALLOC);
     if ( unlikely(!page) )
+    {
+        printk(XENLOG_ERR "*** LDT: failed to get gfn %05lx reference\n",
+               l1e_get_pfn(gl1e));
         return false;
+    }
 
     if ( unlikely(!get_page_type(page, PGT_seg_desc_page)) )
     {
+        printk(XENLOG_ERR "*** LDT: bad type: caf %016lx, taf=%016lx\n",
+               page->count_info, page->u.inuse.type_info);
         put_page(page);
         return false;
     }



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-09 16:00             ` Andrew Cooper
@ 2020-12-09 16:30               ` Manuel Bouyer
  2020-12-09 18:08                 ` Andrew Cooper
  0 siblings, 1 reply; 25+ messages in thread
From: Manuel Bouyer @ 2020-12-09 16:30 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

On Wed, Dec 09, 2020 at 04:00:02PM +0000, Andrew Cooper wrote:
> [...]
> >> I wonder if the LDT is set up correctly.
> > I guess it is, otherwise it wouldn't boot with a Xen 4.13 kernel, isn't it ?
> 
> Well - you said you always saw it once on 4.13, which clearly shows that
> something was wonky, but it managed to unblock itself.
> 
> >> How about this incremental delta?
> > Here's the output
> > (XEN) IRET fault: #PF[0000]                                                    
> > (XEN) %cr2 ffff820000010040, LDT base ffffc4800000a000, limit 0057             
> > (XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  
> > (XEN) IRET fault: #PF[0000]                                                    
> > (XEN) %cr2 ffff820000010040, LDT base ffffc4800000a000, limit 0057             
> > (XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  
> > (XEN) IRET fault: #PF[0000]                                                 
> 
> Ok, so the promotion definitely fails, but we don't get as far as
> inspecting the content of the LDT frame.  This probably means it failed
> to change the page type, which probably means there are still
> outstanding writeable references.
> 
> I'm expecting the final printk to be the one which triggers.

It's not. 
Here's the output:
(XEN) IRET fault: #PF[0000]                                                    
(XEN) %cr2 ffff820000010040, LDT base ffffbd000000a000, limit 0057             
(XEN) *** LDT: gl1e 0000000000000000 not present                               
(XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  
(XEN) IRET fault: #PF[0000]                                                    
(XEN) %cr2 ffff820000010040, LDT base ffffbd000000a000, limit 0057             
(XEN) *** LDT: gl1e 0000000000000000 not present                               
(XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  
(XEN) IRET fault: #PF[0000]                                                    
(XEN) %cr2 ffff820000010040, LDT base ffffbd000000a000, limit 0057          
(XEN) *** LDT: gl1e 0000000000000000 not present
(XEN) *** pv_map_ldt_shadow_page(0x40) failed
(XEN) IRET fault: #PF[0000]
(XEN) %cr2 ffff820000010040, LDT base ffffbd000000a000, limit 0057
(XEN) domain_crash called from extable.c:219
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
(XEN) ----[ Xen-4.15-unstable  x86_64  debug=y   Tainted:   C   ]----
(XEN) CPU:    0
(XEN) RIP:    0047:[<00007f7f5dc007d0>]
(XEN) RFLAGS: 0000000000000202   EM: 0   CONTEXT: pv guest (d0v0)
(XEN) rax: ffff82d04038c309   rbx: 0000000000000000   rcx: 000000000000e008
(XEN) rdx: 0000000000010086   rsi: ffff83007fcb7f78   rdi: 000000000000e010
(XEN) rbp: 0000000000000000   rsp: 00007f7fffcfc8d0   r8:  0000000e00000000
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: 0000000000000000   r13: 0000000000000000   r14: 0000000000000000
(XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 0000000000002660
(XEN) cr3: 0000000079cdb000   cr2: ffffbd000000a040
(XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: ffffffff80cf2dc0
(XEN) ds: 0023   es: 0023   fs: 0000   gs: 0000   ss: 003f   cs: 0047
(XEN) Guest stack trace from rsp=00007f7fffcfc8d0:
(XEN)    0000000000000001 00007f7fffcfcde8 0000000000000000 0000000000000000
(XEN)    0000000000000003 000000000e200040 0000000000000004 0000000000000038
(XEN)    0000000000000005 0000000000000008 0000000000000006 0000000000001000
(XEN)    0000000000000007 00007f7f5dc00000 0000000000000008 0000000000000000
(XEN)    0000000000000009 000000000e201cd0 00000000000007d0 0000000000000000
(XEN)    00000000000007d1 0000000000000000 00000000000007d2 0000000000000000
(XEN)    00000000000007d3 0000000000000000 000000000000000d 00007f7fffcfd000
(XEN)    00000000000007de 00007f7fffcfc9d0 0000000000000000 0000000000000000
(XEN)    6e692f6e6962732f 0000000000007469 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-09 16:30               ` Manuel Bouyer
@ 2020-12-09 18:08                 ` Andrew Cooper
  2020-12-09 18:57                   ` Manuel Bouyer
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Cooper @ 2020-12-09 18:08 UTC (permalink / raw)
  To: Manuel Bouyer; +Cc: xen-devel

On 09/12/2020 16:30, Manuel Bouyer wrote:
> On Wed, Dec 09, 2020 at 04:00:02PM +0000, Andrew Cooper wrote:
>> [...]
>>>> I wonder if the LDT is set up correctly.
>>> I guess it is, otherwise it wouldn't boot with a Xen 4.13 kernel, isn't it ?
>> Well - you said you always saw it once on 4.13, which clearly shows that
>> something was wonky, but it managed to unblock itself.
>>
>>>> How about this incremental delta?
>>> Here's the output
>>> (XEN) IRET fault: #PF[0000]                                                    
>>> (XEN) %cr2 ffff820000010040, LDT base ffffc4800000a000, limit 0057             
>>> (XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  
>>> (XEN) IRET fault: #PF[0000]                                                    
>>> (XEN) %cr2 ffff820000010040, LDT base ffffc4800000a000, limit 0057             
>>> (XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  
>>> (XEN) IRET fault: #PF[0000]                                                 
>> Ok, so the promotion definitely fails, but we don't get as far as
>> inspecting the content of the LDT frame.  This probably means it failed
>> to change the page type, which probably means there are still
>> outstanding writeable references.
>>
>> I'm expecting the final printk to be the one which triggers.
> It's not. 
> Here's the output:
> (XEN) IRET fault: #PF[0000]
> (XEN) %cr2 ffff820000010040, LDT base ffffbd000000a000, limit 0057
> (XEN) *** LDT: gl1e 0000000000000000 not present
> (XEN) *** pv_map_ldt_shadow_page(0x40) failed
> (XEN) IRET fault: #PF[0000]
> (XEN) %cr2 ffff820000010040, LDT base ffffbd000000a000, limit 0057
> (XEN) *** LDT: gl1e 0000000000000000 not present
> (XEN) *** pv_map_ldt_shadow_page(0x40) failed

Ok.  So the mapping registered for the LDT is not yet present.  Xen
should be raising #PF with the guest, and would be in every case other
than the weird context on IRET, where we've confused bad guest state
with bad hypervisor state.

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 3ac07a84c3..35c24ed668 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1235,10 +1235,6 @@ static int handle_ldt_mapping_fault(unsigned int
offset,
     {
         printk(XENLOG_ERR "*** pv_map_ldt_shadow_page(%#x) failed\n",
offset);
 
-        /* In hypervisor mode? Leave it to the #PF handler to fix up. */
-        if ( !guest_mode(regs) )
-            return 0;
-
         /* Access would have become non-canonical? Pass #GP[sel] back. */
         if ( unlikely(!is_canonical_address(curr->arch.pv.ldt_base +
offset)) )
         {


This bodge ought to cause a #PF to be delivered suitably, but may make
other corner cases not quite work correctly, so isn't a clean fix.

~Andrew


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-09 18:08                 ` Andrew Cooper
@ 2020-12-09 18:57                   ` Manuel Bouyer
  2020-12-09 19:08                     ` Andrew Cooper
  0 siblings, 1 reply; 25+ messages in thread
From: Manuel Bouyer @ 2020-12-09 18:57 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

On Wed, Dec 09, 2020 at 06:08:53PM +0000, Andrew Cooper wrote:
> On 09/12/2020 16:30, Manuel Bouyer wrote:
> > On Wed, Dec 09, 2020 at 04:00:02PM +0000, Andrew Cooper wrote:
> >> [...]
> >>>> I wonder if the LDT is set up correctly.
> >>> I guess it is, otherwise it wouldn't boot with a Xen 4.13 kernel, isn't it ?
> >> Well - you said you always saw it once on 4.13, which clearly shows that
> >> something was wonky, but it managed to unblock itself.
> >>
> >>>> How about this incremental delta?
> >>> Here's the output
> >>> (XEN) IRET fault: #PF[0000]                                                    
> >>> (XEN) %cr2 ffff820000010040, LDT base ffffc4800000a000, limit 0057             
> >>> (XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  
> >>> (XEN) IRET fault: #PF[0000]                                                    
> >>> (XEN) %cr2 ffff820000010040, LDT base ffffc4800000a000, limit 0057             
> >>> (XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  
> >>> (XEN) IRET fault: #PF[0000]                                                 
> >> Ok, so the promotion definitely fails, but we don't get as far as
> >> inspecting the content of the LDT frame.  This probably means it failed
> >> to change the page type, which probably means there are still
> >> outstanding writeable references.
> >>
> >> I'm expecting the final printk to be the one which triggers.
> > It's not. 
> > Here's the output:
> > (XEN) IRET fault: #PF[0000]
> > (XEN) %cr2 ffff820000010040, LDT base ffffbd000000a000, limit 0057
> > (XEN) *** LDT: gl1e 0000000000000000 not present
> > (XEN) *** pv_map_ldt_shadow_page(0x40) failed
> > (XEN) IRET fault: #PF[0000]
> > (XEN) %cr2 ffff820000010040, LDT base ffffbd000000a000, limit 0057
> > (XEN) *** LDT: gl1e 0000000000000000 not present
> > (XEN) *** pv_map_ldt_shadow_page(0x40) failed
> 
> Ok.  So the mapping registered for the LDT is not yet present.  Xen
> should be raising #PF with the guest, and would be in every case other
> than the weird context on IRET, where we've confused bad guest state
> with bad hypervisor state.

Unfortunably it doesn't fix the problem. I'm now getting a loop of
(XEN) *** LDT: gl1e 0000000000000000 not present                               
(XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-09 18:57                   ` Manuel Bouyer
@ 2020-12-09 19:08                     ` Andrew Cooper
  2020-12-10  9:51                       ` Manuel Bouyer
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Cooper @ 2020-12-09 19:08 UTC (permalink / raw)
  To: Manuel Bouyer; +Cc: xen-devel

On 09/12/2020 18:57, Manuel Bouyer wrote:
> On Wed, Dec 09, 2020 at 06:08:53PM +0000, Andrew Cooper wrote:
>> On 09/12/2020 16:30, Manuel Bouyer wrote:
>>> On Wed, Dec 09, 2020 at 04:00:02PM +0000, Andrew Cooper wrote:
>>>> [...]
>>>>>> I wonder if the LDT is set up correctly.
>>>>> I guess it is, otherwise it wouldn't boot with a Xen 4.13 kernel, isn't it ?
>>>> Well - you said you always saw it once on 4.13, which clearly shows that
>>>> something was wonky, but it managed to unblock itself.
>>>>
>>>>>> How about this incremental delta?
>>>>> Here's the output
>>>>> (XEN) IRET fault: #PF[0000]                                                    
>>>>> (XEN) %cr2 ffff820000010040, LDT base ffffc4800000a000, limit 0057             
>>>>> (XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  
>>>>> (XEN) IRET fault: #PF[0000]                                                    
>>>>> (XEN) %cr2 ffff820000010040, LDT base ffffc4800000a000, limit 0057             
>>>>> (XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  
>>>>> (XEN) IRET fault: #PF[0000]                                                 
>>>> Ok, so the promotion definitely fails, but we don't get as far as
>>>> inspecting the content of the LDT frame.  This probably means it failed
>>>> to change the page type, which probably means there are still
>>>> outstanding writeable references.
>>>>
>>>> I'm expecting the final printk to be the one which triggers.
>>> It's not. 
>>> Here's the output:
>>> (XEN) IRET fault: #PF[0000]
>>> (XEN) %cr2 ffff820000010040, LDT base ffffbd000000a000, limit 0057
>>> (XEN) *** LDT: gl1e 0000000000000000 not present
>>> (XEN) *** pv_map_ldt_shadow_page(0x40) failed
>>> (XEN) IRET fault: #PF[0000]
>>> (XEN) %cr2 ffff820000010040, LDT base ffffbd000000a000, limit 0057
>>> (XEN) *** LDT: gl1e 0000000000000000 not present
>>> (XEN) *** pv_map_ldt_shadow_page(0x40) failed
>> Ok.  So the mapping registered for the LDT is not yet present.  Xen
>> should be raising #PF with the guest, and would be in every case other
>> than the weird context on IRET, where we've confused bad guest state
>> with bad hypervisor state.
> Unfortunably it doesn't fix the problem. I'm now getting a loop of
> (XEN) *** LDT: gl1e 0000000000000000 not present                               
> (XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  

Oh of course - we don't follow the exit-to-guest path on the way out here.

As a gross hack to check that we've at least diagnosed the issue
appropriately, could you modify NetBSD to explicitly load the %ss
selector into %es (or any other free segment) before first entering user
context?

If it a sequence of LDT demand-faulting issues, that should cause them
to be fully resolved before Xen's IRET becomes the first actual LDT load.

~Andrew


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-09 19:08                     ` Andrew Cooper
@ 2020-12-10  9:51                       ` Manuel Bouyer
  2020-12-10 10:41                         ` Jan Beulich
                                           ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Manuel Bouyer @ 2020-12-10  9:51 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

On Wed, Dec 09, 2020 at 07:08:41PM +0000, Andrew Cooper wrote:
> Oh of course - we don't follow the exit-to-guest path on the way out here.
> 
> As a gross hack to check that we've at least diagnosed the issue
> appropriately, could you modify NetBSD to explicitly load the %ss
> selector into %es (or any other free segment) before first entering user
> context?

If I understood it properly, the user %ss is loaded by Xen from the
trapframe when the guest swictes from kernel to user mode, isn't it ?
So you mean setting %es to the same value in the trapframe ?

Actually I used %fs because %es is set equal to %ds.
Xen 4.13 boots fine with this change, but with 4.15 I get a loop of:


(XEN) *** LDT: gl1e 0000000000000000 not present                               
(XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  
[  12.3586540] Process (pid 1) got sig 11                                      

which means that the dom0 gets the trap, and decides that the fault address
is not mapped. Without the change the dom0 doesn't show the
"Process (pid 1) got sig 11"

I activated the NetBSD trap debug code, and this shows:
[   6.7165877] kern.module.path=/stand/amd64-xen/9.1/modules                    (XEN) *** LDT: gl1e 0000000000000000 not present                                
(XEN) *** pv_map_ldt_shadow_page(0x40) failed                                   
[   6.9462322] pid 1.1 (init): signal 11 code=1 (trap 0x6) @rip 0x7f7ef0c007d0 a
ddr 0xffffbd800000a040 error=14
[   7.0647896] trapframe 0xffffbd80381cff00
[   7.1126288] rip 0x00007f7ef0c007d0  rsp 0x00007f7fff10aa30  rfl 0x00000000000
00202
[   7.2041518] rdi 000000000000000000  rsi 000000000000000000  rdx 0000000000000
00000
[   7.2956758] rcx 000000000000000000  r8  000000000000000000  r9  0000000000000
00000
[   7.3872013] r10 000000000000000000  r11 000000000000000000  r12 0000000000000
00000
[   7.4787216] r13 000000000000000000  r14 000000000000000000  r15 0000000000000
00000
[   7.5702439] rbp 000000000000000000  rbx 0x00007f7fff10afe0  rax 0000000000000
00000
[   7.6617663] cs 0x47  ds 0x23  es 0x23  fs 0000  gs 0000  ss 0x3f
[   7.7345663] fsbase 000000000000000000 gsbase 000000000000000000

so it looks like something resets %fs to 0 ...

Anyway the fault address 0xffffbd800000a040 is in the hypervisor's range,
isn't it ?

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-10  9:51                       ` Manuel Bouyer
@ 2020-12-10 10:41                         ` Jan Beulich
  2020-12-10 15:51                         ` Andrew Cooper
  2020-12-11  8:58                         ` Jan Beulich
  2 siblings, 0 replies; 25+ messages in thread
From: Jan Beulich @ 2020-12-10 10:41 UTC (permalink / raw)
  To: Manuel Bouyer, Andrew Cooper; +Cc: xen-devel

On 10.12.2020 10:51, Manuel Bouyer wrote:
> On Wed, Dec 09, 2020 at 07:08:41PM +0000, Andrew Cooper wrote:
>> Oh of course - we don't follow the exit-to-guest path on the way out here.
>>
>> As a gross hack to check that we've at least diagnosed the issue
>> appropriately, could you modify NetBSD to explicitly load the %ss
>> selector into %es (or any other free segment) before first entering user
>> context?
> 
> If I understood it properly, the user %ss is loaded by Xen from the
> trapframe when the guest swictes from kernel to user mode, isn't it ?
> So you mean setting %es to the same value in the trapframe ?
> 
> Actually I used %fs because %es is set equal to %ds.
> Xen 4.13 boots fine with this change, but with 4.15 I get a loop of:
> 
> 
> (XEN) *** LDT: gl1e 0000000000000000 not present                               
> (XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  
> [  12.3586540] Process (pid 1) got sig 11                                      
> 
> which means that the dom0 gets the trap, and decides that the fault address
> is not mapped. Without the change the dom0 doesn't show the
> "Process (pid 1) got sig 11"
> 
> I activated the NetBSD trap debug code, and this shows:
> [   6.7165877] kern.module.path=/stand/amd64-xen/9.1/modules                    (XEN) *** LDT: gl1e 0000000000000000 not present                                
> (XEN) *** pv_map_ldt_shadow_page(0x40) failed                                   
> [   6.9462322] pid 1.1 (init): signal 11 code=1 (trap 0x6) @rip 0x7f7ef0c007d0 a
> ddr 0xffffbd800000a040 error=14
> [   7.0647896] trapframe 0xffffbd80381cff00
> [   7.1126288] rip 0x00007f7ef0c007d0  rsp 0x00007f7fff10aa30  rfl 0x00000000000
> 00202
> [   7.2041518] rdi 000000000000000000  rsi 000000000000000000  rdx 0000000000000
> 00000
> [   7.2956758] rcx 000000000000000000  r8  000000000000000000  r9  0000000000000
> 00000
> [   7.3872013] r10 000000000000000000  r11 000000000000000000  r12 0000000000000
> 00000
> [   7.4787216] r13 000000000000000000  r14 000000000000000000  r15 0000000000000
> 00000
> [   7.5702439] rbp 000000000000000000  rbx 0x00007f7fff10afe0  rax 0000000000000
> 00000
> [   7.6617663] cs 0x47  ds 0x23  es 0x23  fs 0000  gs 0000  ss 0x3f
> [   7.7345663] fsbase 000000000000000000 gsbase 000000000000000000
> 
> so it looks like something resets %fs to 0 ...
> 
> Anyway the fault address 0xffffbd800000a040 is in the hypervisor's range,
> isn't it ?

No, the hypervisor range is 0xffff800000000000-0xffff880000000000.

Jan


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-10  9:51                       ` Manuel Bouyer
  2020-12-10 10:41                         ` Jan Beulich
@ 2020-12-10 15:51                         ` Andrew Cooper
  2020-12-10 17:03                           ` Manuel Bouyer
  2020-12-11  8:58                         ` Jan Beulich
  2 siblings, 1 reply; 25+ messages in thread
From: Andrew Cooper @ 2020-12-10 15:51 UTC (permalink / raw)
  To: Manuel Bouyer; +Cc: xen-devel

On 10/12/2020 09:51, Manuel Bouyer wrote:
> On Wed, Dec 09, 2020 at 07:08:41PM +0000, Andrew Cooper wrote:
>> Oh of course - we don't follow the exit-to-guest path on the way out here.
>>
>> As a gross hack to check that we've at least diagnosed the issue
>> appropriately, could you modify NetBSD to explicitly load the %ss
>> selector into %es (or any other free segment) before first entering user
>> context?
> If I understood it properly, the user %ss is loaded by Xen from the
> trapframe when the guest swictes from kernel to user mode, isn't it ?

Yes.  The kernel involves HYPERCALL_iret, and Xen copies/audits the
provided trapframe, and uses it to actually enter userspace.

> So you mean setting %es to the same value in the trapframe ?

Yes - specifically I wanted to force the LDT reference to happen in a
context where demand-faulting should work, so all the mappings get set
up properly before we first encounter the LDT reference in Xen's IRET
instruction.

And to be clear, there is definitely a bug needing fixing here in Xen in
terms of handling IRET faults caused by guest state.  However, it looks
like this isn't the root of the problem - merely some very weird
collateral damage.

> Actually I used %fs because %es is set equal to %ds.
> Xen 4.13 boots fine with this change, but with 4.15 I get a loop of:
>
>
> (XEN) *** LDT: gl1e 0000000000000000 not present                               
> (XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  
> [  12.3586540] Process (pid 1) got sig 11                                      
>
> which means that the dom0 gets the trap, and decides that the fault address
> is not mapped. Without the change the dom0 doesn't show the
> "Process (pid 1) got sig 11"
>
> I activated the NetBSD trap debug code, and this shows:
> [   6.7165877] kern.module.path=/stand/amd64-xen/9.1/modules
> (XEN) *** LDT: gl1e 0000000000000000 not present                                
> (XEN) *** pv_map_ldt_shadow_page(0x40) failed
> [   6.9462322] pid 1.1 (init): signal 11 code=1 (trap 0x6) @rip 0x7f7ef0c007d0 addr 0xffffbd800000a040 error=14
> [   7.0647896] trapframe 0xffffbd80381cff00
> [   7.1126288] rip 0x00007f7ef0c007d0  rsp 0x00007f7fff10aa30  rfl 0x0000000000000202
> [   7.2041518] rdi 000000000000000000  rsi 000000000000000000  rdx 000000000000000000
> [   7.2956758] rcx 000000000000000000  r8  000000000000000000  r9  000000000000000000
> [   7.3872013] r10 000000000000000000  r11 000000000000000000  r12 000000000000000000
> [   7.4787216] r13 000000000000000000  r14 000000000000000000  r15 000000000000000000
> [   7.5702439] rbp 000000000000000000  rbx 0x00007f7fff10afe0  rax 000000000000000000
> [   7.6617663] cs 0x47  ds 0x23  es 0x23  fs 0000  gs 0000  ss 0x3f
> [   7.7345663] fsbase 000000000000000000 gsbase 000000000000000000
>
> so it looks like something resets %fs to 0 ...
>
> Anyway the fault address 0xffffbd800000a040 is in the hypervisor's range,
> isn't it ?

No.  Its the kernel's LDT.  From previous debugging:
> (XEN) %cr2 ffff820000010040, LDT base ffffbd000000a000, limit 0057

LDT handling in Xen is a bit complicated.  To maintain host safety, we
must map it into Xen's range, and we explicitly support a PV guest doing
on-demand mapping of the LDT.  (This pertains to the experimental
Windows XP PV support which never made it beyond a prototype.  Windows
can page out the LDT.)  Either way, we lazily map the LDT frames on
first use.

So %cr2 is the real hardware faulting address, and is in the Xen range. 
We spot that it is an LDT access, and try to lazily map the frame (at
LDT base), but find that the kernel's virtual address mapping
0xffffbd000000a000 is not present (the gl1e printk).

Therefore, we pass #PF to the guest kernel, adjusting vCR2 to what would
have happened had Xen not mapped the real LDT elsewhere, which is
expected to cause the guest kernel to do whatever demand mapping is
necessary to pull the LDT back in.


I suppose it is worth taking a step back and ascertaining how exactly
NetBSD handles (or, should be handling) the LDT.

Do you mind elaborating on how it is supposed to work?

~Andrew


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-10 15:51                         ` Andrew Cooper
@ 2020-12-10 17:03                           ` Manuel Bouyer
  2020-12-10 17:18                             ` Andrew Cooper
  0 siblings, 1 reply; 25+ messages in thread
From: Manuel Bouyer @ 2020-12-10 17:03 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

On Thu, Dec 10, 2020 at 03:51:46PM +0000, Andrew Cooper wrote:
> > [   7.6617663] cs 0x47  ds 0x23  es 0x23  fs 0000  gs 0000  ss 0x3f
> > [   7.7345663] fsbase 000000000000000000 gsbase 000000000000000000
> >
> > so it looks like something resets %fs to 0 ...
> >
> > Anyway the fault address 0xffffbd800000a040 is in the hypervisor's range,
> > isn't it ?
> 
> No.  Its the kernel's LDT.  From previous debugging:
> > (XEN) %cr2 ffff820000010040, LDT base ffffbd000000a000, limit 0057
> 
> LDT handling in Xen is a bit complicated.  To maintain host safety, we
> must map it into Xen's range, and we explicitly support a PV guest doing
> on-demand mapping of the LDT.  (This pertains to the experimental
> Windows XP PV support which never made it beyond a prototype.  Windows
> can page out the LDT.)  Either way, we lazily map the LDT frames on
> first use.
> 
> So %cr2 is the real hardware faulting address, and is in the Xen range. 
> We spot that it is an LDT access, and try to lazily map the frame (at
> LDT base), but find that the kernel's virtual address mapping
> 0xffffbd000000a000 is not present (the gl1e printk).
> 
> Therefore, we pass #PF to the guest kernel, adjusting vCR2 to what would
> have happened had Xen not mapped the real LDT elsewhere, which is
> expected to cause the guest kernel to do whatever demand mapping is
> necessary to pull the LDT back in.
> 
> 
> I suppose it is worth taking a step back and ascertaining how exactly
> NetBSD handles (or, should be handling) the LDT.
> 
> Do you mind elaborating on how it is supposed to work?

Note that I'm not familiar with this selector stuff; and I usually get
it wrong the first time I go back to it.

AFAIK, in the Xen PV case, a page is allocated an mapped in kernel
space, and registered to Xen with MMUEXT_SET_LDT.
From what I found, in the common case the LDT is the same for all processes.
Does it make sense ?

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-10 17:03                           ` Manuel Bouyer
@ 2020-12-10 17:18                             ` Andrew Cooper
  2020-12-10 17:35                               ` Manuel Bouyer
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Cooper @ 2020-12-10 17:18 UTC (permalink / raw)
  To: Manuel Bouyer; +Cc: xen-devel

On 10/12/2020 17:03, Manuel Bouyer wrote:
> On Thu, Dec 10, 2020 at 03:51:46PM +0000, Andrew Cooper wrote:
>>> [   7.6617663] cs 0x47  ds 0x23  es 0x23  fs 0000  gs 0000  ss 0x3f
>>> [   7.7345663] fsbase 000000000000000000 gsbase 000000000000000000
>>>
>>> so it looks like something resets %fs to 0 ...
>>>
>>> Anyway the fault address 0xffffbd800000a040 is in the hypervisor's range,
>>> isn't it ?
>> No.  Its the kernel's LDT.  From previous debugging:
>>> (XEN) %cr2 ffff820000010040, LDT base ffffbd000000a000, limit 0057
>> LDT handling in Xen is a bit complicated.  To maintain host safety, we
>> must map it into Xen's range, and we explicitly support a PV guest doing
>> on-demand mapping of the LDT.  (This pertains to the experimental
>> Windows XP PV support which never made it beyond a prototype.  Windows
>> can page out the LDT.)  Either way, we lazily map the LDT frames on
>> first use.
>>
>> So %cr2 is the real hardware faulting address, and is in the Xen range. 
>> We spot that it is an LDT access, and try to lazily map the frame (at
>> LDT base), but find that the kernel's virtual address mapping
>> 0xffffbd000000a000 is not present (the gl1e printk).
>>
>> Therefore, we pass #PF to the guest kernel, adjusting vCR2 to what would
>> have happened had Xen not mapped the real LDT elsewhere, which is
>> expected to cause the guest kernel to do whatever demand mapping is
>> necessary to pull the LDT back in.
>>
>>
>> I suppose it is worth taking a step back and ascertaining how exactly
>> NetBSD handles (or, should be handling) the LDT.
>>
>> Do you mind elaborating on how it is supposed to work?
> Note that I'm not familiar with this selector stuff; and I usually get
> it wrong the first time I go back to it.
>
> AFAIK, in the Xen PV case, a page is allocated an mapped in kernel
> space, and registered to Xen with MMUEXT_SET_LDT.
> From what I found, in the common case the LDT is the same for all processes.
> Does it make sense ?

The debugging earlier shows that MMUEXT_SET_LDT has indeed been called. 
Presumably 0xffffbd000000a000 is a plausible virtual address for NetBSD
to position the LDT?

However, Xen finds the mapping not-present when trying to demand-map it,
hence why the #PF is forwarded to the kernel.

The way we pull guest virtual addresses was altered by XSA-286 (released
not too long ago despite its apparent age), but *should* have been no
functional change.  I wonder if we accidentally broke something there. 
What exactly are you running, Xen-wise, with the 4.13 version?

Given that this is init failing, presumably the issue would repro with
the net installer version too?

~Andrew


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-10 17:18                             ` Andrew Cooper
@ 2020-12-10 17:35                               ` Manuel Bouyer
  2020-12-10 21:01                                 ` Andrew Cooper
  0 siblings, 1 reply; 25+ messages in thread
From: Manuel Bouyer @ 2020-12-10 17:35 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

On Thu, Dec 10, 2020 at 05:18:39PM +0000, Andrew Cooper wrote:
> The debugging earlier shows that MMUEXT_SET_LDT has indeed been called. 
> Presumably 0xffffbd000000a000 is a plausible virtual address for NetBSD
> to position the LDT?

Yes, it is. 

> 
> However, Xen finds the mapping not-present when trying to demand-map it,
> hence why the #PF is forwarded to the kernel.
> 
> The way we pull guest virtual addresses was altered by XSA-286 (released
> not too long ago despite its apparent age), but *should* have been no
> functional change.  I wonder if we accidentally broke something there. 
> What exactly are you running, Xen-wise, with the 4.13 version?

It is 4.13.2, with the patch for XSA351

> 
> Given that this is init failing, presumably the issue would repro with
> the net installer version too?

Hopefully yes, maybe even as a domU. But I don't have a linux dom0 to test.

If you have a Xen setup you can test with
http://ftp.netbsd.org/pub/NetBSD/NetBSD-9.1/amd64/binary/kernel/netbsd-INSTALL_XEN3_DOMU.gz

note that this won't boot as a dom0 kernel.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-10 17:35                               ` Manuel Bouyer
@ 2020-12-10 21:01                                 ` Andrew Cooper
  2020-12-11 10:47                                   ` Manuel Bouyer
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Cooper @ 2020-12-10 21:01 UTC (permalink / raw)
  To: Manuel Bouyer; +Cc: xen-devel

On 10/12/2020 17:35, Manuel Bouyer wrote:
> On Thu, Dec 10, 2020 at 05:18:39PM +0000, Andrew Cooper wrote:
>> However, Xen finds the mapping not-present when trying to demand-map it,
>> hence why the #PF is forwarded to the kernel.
>>
>> The way we pull guest virtual addresses was altered by XSA-286 (released
>> not too long ago despite its apparent age), but *should* have been no
>> functional change.  I wonder if we accidentally broke something there. 
>> What exactly are you running, Xen-wise, with the 4.13 version?
> It is 4.13.2, with the patch for XSA351

Thanks,

>> Given that this is init failing, presumably the issue would repro with
>> the net installer version too?
> Hopefully yes, maybe even as a domU. But I don't have a linux dom0 to test.
>
> If you have a Xen setup you can test with
> http://ftp.netbsd.org/pub/NetBSD/NetBSD-9.1/amd64/binary/kernel/netbsd-INSTALL_XEN3_DOMU.gz
>
> note that this won't boot as a dom0 kernel.

I've repro'd the problem.

When I modify Xen to explicitly demand-map the LDT in the MMUEXT_SET_LDT
hypercall, everything works fine.

Specifically, this delta:

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 723cc1070f..71a791d877 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3742,12 +3742,31 @@ long do_mmuext_op(
             else if ( (curr->arch.pv.ldt_ents != ents) ||
                       (curr->arch.pv.ldt_base != ptr) )
             {
+                unsigned int err = 0, tmp;
+
                 if ( pv_destroy_ldt(curr) )
                     flush_tlb_local();
 
                 curr->arch.pv.ldt_base = ptr;
                 curr->arch.pv.ldt_ents = ents;
                 load_LDT(curr);
+
+                printk("Probe new LDT\n");
+                asm volatile (
+                    "mov %%es, %[tmp];\n\t"
+                    "1: mov %[sel], %%es;\n\t"
+                    "mov %[tmp], %%es;\n\t"
+                    "2:\n\t"
+                    ".section .fixup,\"ax\"\n"
+                    "3: mov $1, %[err];\n\t"
+                    "jmp 2b\n\t"
+                    ".previous\n\t"
+                    _ASM_EXTABLE(1b, 3b)
+                    : [err] "+r" (err),
+                      [tmp] "=&r" (tmp)
+                    : [sel] "r" (0x3f)
+                    : "memory");
+                printk("  => err %u\n", err);
             }
             break;
         }

Which stashes %es, explicitly loads init's %ss selector to trigger the
#PF and Xen's lazy mapping, then restores %es.

(XEN) d1v0 Dropping PAT write of 0007010600070106
(XEN) Probe new LDT
(XEN) *** LDT Successful map, slot 0
(XEN)   => err 0
(XEN) d1 L1TF-vulnerable L4e 0000000801e88000 - Shadowing

And the domain is up and running:

# xl list
Name                                        ID   Mem VCPUs    State   
Time(s)
Domain-0                                     0  2656     8    
r-----      44.6
netbsd                                       1   256     1    
-b----       5.3

(Probably confused about the fact I gave it no disk...)

Now, in this case, we find that the virtual address provided for the LDT
is mapped, so we successfully copy the mapping into Xen's area, and init
runs happily.

So the mystery is why the LDT virtual address is not-present when Xen
tries to lazily map the LDT at the normal point...

Presumably you've got no Meltdown mitigations going on within the NetBSD
kernel?  (I suspect not, seeing as changing Xen changes the behaviour,
but it is worth asking).

~Andrew


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-10  9:51                       ` Manuel Bouyer
  2020-12-10 10:41                         ` Jan Beulich
  2020-12-10 15:51                         ` Andrew Cooper
@ 2020-12-11  8:58                         ` Jan Beulich
  2020-12-11 11:15                           ` Manuel Bouyer
  2 siblings, 1 reply; 25+ messages in thread
From: Jan Beulich @ 2020-12-11  8:58 UTC (permalink / raw)
  To: Manuel Bouyer; +Cc: xen-devel, Andrew Cooper

On 10.12.2020 10:51, Manuel Bouyer wrote:
> On Wed, Dec 09, 2020 at 07:08:41PM +0000, Andrew Cooper wrote:
>> Oh of course - we don't follow the exit-to-guest path on the way out here.
>>
>> As a gross hack to check that we've at least diagnosed the issue
>> appropriately, could you modify NetBSD to explicitly load the %ss
>> selector into %es (or any other free segment) before first entering user
>> context?
> 
> If I understood it properly, the user %ss is loaded by Xen from the
> trapframe when the guest swictes from kernel to user mode, isn't it ?
> So you mean setting %es to the same value in the trapframe ?
> 
> Actually I used %fs because %es is set equal to %ds.
> Xen 4.13 boots fine with this change, but with 4.15 I get a loop of:
> 
> 
> (XEN) *** LDT: gl1e 0000000000000000 not present                               
> (XEN) *** pv_map_ldt_shadow_page(0x40) failed                                  

Could you please revert 9ff970564764 ("x86/mm: drop guest_get_eff_l1e()")?
I think there was a thinko there in that the change can't be split from
the bigger one which was part of the originally planned set for XSA-286.
We mustn't avoid the switching of page tables as long as
guest_get_eff{,_kern}_l1e() makes use of the linear page tables.

Jan


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-10 21:01                                 ` Andrew Cooper
@ 2020-12-11 10:47                                   ` Manuel Bouyer
  0 siblings, 0 replies; 25+ messages in thread
From: Manuel Bouyer @ 2020-12-11 10:47 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

On Thu, Dec 10, 2020 at 09:01:12PM +0000, Andrew Cooper wrote:
> I've repro'd the problem.
> 
> When I modify Xen to explicitly demand-map the LDT in the MMUEXT_SET_LDT
> hypercall, everything works fine.
> 
> Specifically, this delta:
> 
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index 723cc1070f..71a791d877 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -3742,12 +3742,31 @@ long do_mmuext_op(
>              else if ( (curr->arch.pv.ldt_ents != ents) ||
>                        (curr->arch.pv.ldt_base != ptr) )
>              {
> +                unsigned int err = 0, tmp;
> +
>                  if ( pv_destroy_ldt(curr) )
>                      flush_tlb_local();
>  
>                  curr->arch.pv.ldt_base = ptr;
>                  curr->arch.pv.ldt_ents = ents;
>                  load_LDT(curr);
> +
> +                printk("Probe new LDT\n");
> +                asm volatile (
> +                    "mov %%es, %[tmp];\n\t"
> +                    "1: mov %[sel], %%es;\n\t"
> +                    "mov %[tmp], %%es;\n\t"
> +                    "2:\n\t"
> +                    ".section .fixup,\"ax\"\n"
> +                    "3: mov $1, %[err];\n\t"
> +                    "jmp 2b\n\t"
> +                    ".previous\n\t"
> +                    _ASM_EXTABLE(1b, 3b)
> +                    : [err] "+r" (err),
> +                      [tmp] "=&r" (tmp)
> +                    : [sel] "r" (0x3f)
> +                    : "memory");
> +                printk("  => err %u\n", err);
>              }
>              break;
>          }
> 
> Which stashes %es, explicitly loads init's %ss selector to trigger the
> #PF and Xen's lazy mapping, then restores %es.

Yes, this works for dom0 too, I have it running multiuser

> [...]
> 
> Presumably you've got no Meltdown mitigations going on within the NetBSD
> kernel?  (I suspect not, seeing as changing Xen changes the behaviour,
> but it is worth asking).

No, there's no Meltdown mitigations for PV in NetBSD. as I see it,
for amd64 at last, the Xen kernel has to do it anyway, so it's not usefull
to implement it in the guest's kernel. Did I miss something ?

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-11  8:58                         ` Jan Beulich
@ 2020-12-11 11:15                           ` Manuel Bouyer
  2020-12-11 13:56                             ` Andrew Cooper
  0 siblings, 1 reply; 25+ messages in thread
From: Manuel Bouyer @ 2020-12-11 11:15 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper

On Fri, Dec 11, 2020 at 09:58:54AM +0100, Jan Beulich wrote:
> Could you please revert 9ff970564764 ("x86/mm: drop guest_get_eff_l1e()")?
> I think there was a thinko there in that the change can't be split from
> the bigger one which was part of the originally planned set for XSA-286.
> We mustn't avoid the switching of page tables as long as
> guest_get_eff{,_kern}_l1e() makes use of the linear page tables.

Yes, reverting this commit also makes the dom0 boot.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: dom0 PV looping on search_pre_exception_table()
  2020-12-11 11:15                           ` Manuel Bouyer
@ 2020-12-11 13:56                             ` Andrew Cooper
  0 siblings, 0 replies; 25+ messages in thread
From: Andrew Cooper @ 2020-12-11 13:56 UTC (permalink / raw)
  To: Manuel Bouyer, Jan Beulich; +Cc: xen-devel

On 11/12/2020 11:15, Manuel Bouyer wrote:
> On Fri, Dec 11, 2020 at 09:58:54AM +0100, Jan Beulich wrote:
>> Could you please revert 9ff970564764 ("x86/mm: drop guest_get_eff_l1e()")?
>> I think there was a thinko there in that the change can't be split from
>> the bigger one which was part of the originally planned set for XSA-286.
>> We mustn't avoid the switching of page tables as long as
>> guest_get_eff{,_kern}_l1e() makes use of the linear page tables.
> Yes, reverting this commit also makes the dom0 boot.
>

This was going to be my next area of investigation.  Thanks for confirming.

In hindsight, the bug is very obvious...

~Andrew


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2020-12-11 13:57 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-08 17:57 dom0 PV looping on search_pre_exception_table() Manuel Bouyer
2020-12-08 18:13 ` Andrew Cooper
2020-12-09  8:39   ` Jan Beulich
2020-12-09  9:49     ` Manuel Bouyer
2020-12-09 10:15   ` Manuel Bouyer
2020-12-09 13:28     ` Andrew Cooper
2020-12-09 13:59       ` Manuel Bouyer
2020-12-09 14:41         ` Andrew Cooper
2020-12-09 15:44           ` Manuel Bouyer
2020-12-09 16:00             ` Andrew Cooper
2020-12-09 16:30               ` Manuel Bouyer
2020-12-09 18:08                 ` Andrew Cooper
2020-12-09 18:57                   ` Manuel Bouyer
2020-12-09 19:08                     ` Andrew Cooper
2020-12-10  9:51                       ` Manuel Bouyer
2020-12-10 10:41                         ` Jan Beulich
2020-12-10 15:51                         ` Andrew Cooper
2020-12-10 17:03                           ` Manuel Bouyer
2020-12-10 17:18                             ` Andrew Cooper
2020-12-10 17:35                               ` Manuel Bouyer
2020-12-10 21:01                                 ` Andrew Cooper
2020-12-11 10:47                                   ` Manuel Bouyer
2020-12-11  8:58                         ` Jan Beulich
2020-12-11 11:15                           ` Manuel Bouyer
2020-12-11 13:56                             ` Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.