On 04/16/2018 11:21 PM, George Dunlap wrote: > On Mon, Apr 16, 2018 at 7:46 PM, Razvan Cojocaru > wrote: >> On 04/16/2018 08:47 PM, George Dunlap wrote: >>> On 04/13/2018 03:44 PM, Razvan Cojocaru wrote: >>>> On 04/11/2018 11:04 AM, Razvan Cojocaru wrote: >>>>> Debugging continues. >>>> >>>> Finally, the attached patch seems to get the display unstuck in my >>>> scenario, although for one guest I get: >>>> >>>> (XEN) d2v0 Unexpected vmexit: reason 49 >>>> (XEN) domain_crash called from vmx.c:4120 >>>> (XEN) Domain 2 (vcpu#0) crashed on cpu#1: >>>> (XEN) ----[ Xen-4.11-unstable x86_64 debug=y Not tainted ]---- >>>> (XEN) CPU: 1 >>>> (XEN) RIP: 0010:[] >>>> (XEN) RFLAGS: 0000000000010246 CONTEXT: hvm guest (d2v0) >>>> (XEN) rax: fffff88003000000 rbx: fffff900c0083db0 rcx: 00000000aa55aa55 >>>> (XEN) rdx: fffffa80041bdc41 rsi: fffff900c00c69a0 rdi: 0000000000000001 >>>> (XEN) rbp: 0000000000000000 rsp: fffff88002ee9ef0 r8: fffffa80041bdc40 >>>> (XEN) r9: fffff80001810e80 r10: fffffa800342aa70 r11: fffff88002ee9e80 >>>> (XEN) r12: 0000000000000005 r13: 0000000000000001 r14: fffff900c00c08b0 >>>> (XEN) r15: 0000000000000001 cr0: 0000000080050031 cr4: 00000000000406f8 >>>> (XEN) cr3: 00000000ef771000 cr2: fffff900c00c8000 >>>> (XEN) fsb: 00000000fffde000 gsb: fffff80001810d00 gss: 000007fffffdc000 >>>> (XEN) ds: 002b es: 002b fs: 0053 gs: 002b ss: 0018 cs: 0010 >>>> >>>> i.e. EXIT_REASON_EPT_MISCONFIG - so not of the woods yet. I am hoping >>>> somebody more familiar with the code can point to a more elegant >>>> solution if one exists. >>> >>> I think I have an idea what's going on, but it's complicated. :-) >>> >>> Basically, the logdirty functionality isn't simple, and needs careful >>> thought on how to integrate it. I'll write some more tomorrow, and see >>> if I can come up with a solution. >> >> I think I know why this happens for the one guest - the other guests >> start at a certain resolution display-wise and stay that way until shutdown. >> >> This particular guest starts with a larger screen, then goes to roughly >> 2/3rds of it, then tries to go back to the initial larger one - at which >> point the above happens. I assume this corresponds to some pages being >> removed and/or added. I'll test this theory more tomorrow - if it's >> correct I should be able to reproduce the crash (with the patch) by >> simply resetting the screen resolution (increasing it). > > The trick is that p2m_change_type doesn't actually iterate over the > entire p2m range, individually changing entries as it goes. Instead > it misconfigures the entries at the top-level, which causes the kinds > of faults shown above. As it gets faults for each entry, it checks > the current type, the logdirty ranges, and the global logdirty bit to > determine what the new types should be. > > Your patch makes it so that all the altp2ms now get the > misconfiguration when the logdirty range is changed; but clearly > handling the misconfiguration isn't integrated properly with the > altp2m system yet. Doing it right may take some thought. FWIW, the attached patch has solved the misconfig-related domain crash for me (though I'm very likely missing some subtleties). It all seems to work as expected when enabling altp2m and switching early to a new view. However, now I have domUs with a frozen display when I disconnect the introspection application (that is, after I switch back to the default view and disable altp2m on the domain). Thanks, Razvan