On 09/12/2020 10:15, Manuel Bouyer wrote:
> On Tue, Dec 08, 2020 at 06:13:46PM +0000, Andrew Cooper wrote:
>> On 08/12/2020 17:57, Manuel Bouyer wrote:
>>> Hello,
>>> for the first time I tried to boot a xen kernel from devel with
>>> a NetBSD PV dom0. The kernel boots, but when the first userland prcess
>>> is launched, it seems to enter a loop involving search_pre_exception_table()
>>> (I see an endless stream from the dprintk() at arch/x86/extable.c:202)
>>>
>>> With xen 4.13 I see it, but exactly once:
>>> (XEN) extable.c:202: Pre-exception: ffff82d08038c304 -> ffff82d08038c8c8
>>>
>>> with devel:
>>> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
>>> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
>>> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
>>> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
>>> (XEN) extable.c:202: Pre-exception: ffff82d040393309 -> ffff82d0403938c8        
>>> [...]
>>>
>>> the dom0 kernel is the same.
>>>
>>> At first glance it looks like a fault in the guest is not handled at it should,
>>> and the userland process keeps faulting on the same address.
>>>
>>> Any idea what to look at ?
>> That is a reoccurring fault on IRET back to guest context, and is
>> probably caused by some unwise-in-hindsight cleanup which doesn't
>> escalate the failure to the failsafe callback.
>>
>> This ought to give something useful to debug with:
> thanks, I got:
> (XEN) IRET fault: #PF[0000]                                                 
> (XEN) domain_crash called from extable.c:209                                
> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:                                   
> (XEN) ----[ Xen-4.15-unstable  x86_64  debug=y   Tainted:   C   ]----       
> (XEN) CPU:    0                                                             
> (XEN) RIP:    0047:[<00007f7e184007d0>]                                     
> (XEN) RFLAGS: 0000000000000202   EM: 0   CONTEXT: pv guest (d0v0)           
> (XEN) rax: ffff82d04038c309   rbx: 0000000000000000   rcx: 000000000000e008 
> (XEN) rdx: 0000000000010086   rsi: ffff83007fcb7f78   rdi: 000000000000e010 
> (XEN) rbp: 0000000000000000   rsp: 00007f7fff53e3e0   r8:  0000000e00000000 
> (XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000 
> (XEN) r12: 0000000000000000   r13: 0000000000000000   r14: 0000000000000000 
> (XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 0000000000002660 
> (XEN) cr3: 0000000079cdb000   cr2: 00007f7fff53e3e0                         
> (XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: ffffffff80cf2dc0    
> (XEN) ds: 0023   es: 0023   fs: 0000   gs: 0000   ss: 003f   cs: 0047       
> (XEN) Guest stack trace from rsp=00007f7fff53e3e0:          
> (XEN)    0000000000000001 00007f7fff53e8f8 0000000000000000 0000000000000000
> (XEN)    0000000000000003 000000004b600040 0000000000000004 0000000000000038
> (XEN)    0000000000000005 0000000000000008 0000000000000006 0000000000001000
> (XEN)    0000000000000007 00007f7e18400000 0000000000000008 0000000000000000
> (XEN)    0000000000000009 000000004b601cd0 00000000000007d0 0000000000000000
> (XEN)    00000000000007d1 0000000000000000 00000000000007d2 0000000000000000
> (XEN)    00000000000007d3 0000000000000000 000000000000000d 00007f7fff53f000
> (XEN)    00000000000007de 00007f7fff53e4e0 0000000000000000 0000000000000000
> (XEN)    6e692f6e6962732f 0000000000007469 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.

Pagefaults on IRET come either from stack accesses for operands (not the
case here as Xen is otherwise working fine), or from segement selector
loads for %cs and %ss.

In this example, %ss is in the LDT, which specifically does use
pagefaults to promote the frame to PGT_segdesc.

I suspect that what is happening is that handle_ldt_mapping_fault() is
failing to promote the page (for some reason), and we're taking the "In
hypervisor mode? Leave it to the #PF handler to fix up." path due to the
confusion in context, and Xen's #PF handler is concluding "nothing else
to do".

The older behaviour of escalating to the failsafe callback would have
broken this cycle by rewriting %ss and re-entering the kernel.


Please try the attached debugging patch, which is an extension of what I
gave you yesterday.  First, it ought to print %cr2, which I expect will
point to Xen's virtual mapping of the vcpu's LDT.  The logic ought to
loop a few times so we can inspect the hypervisor codepaths which are
effectively livelocked in this state, and I've also instrumented
check_descriptor() failures because I've got a gut feeling that is the
root cause of the problem.

~Andrew