Re: [PATCH v2] x86/HVM: correct hvmemul_map_linear_addr() for multi-page case

From: "Jan Beulich" <JBeulich@suse.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: xen-devel <xen-devel@lists.xenproject.org>,
	Paul Durrant <paul.durrant@citrix.com>
Subject: Re: [PATCH v2] x86/HVM: correct hvmemul_map_linear_addr() for multi-page case
Date: Mon, 08 Oct 2018 05:53:29 -0600	[thread overview]
Message-ID: <5BBB453902000078001EF88D@prv1-mh.provo.novell.com> (raw)
In-Reply-To: <b1bf6cb0-f3d3-28fc-2062-e8bf3411c35e@citrix.com>

>>> On 25.09.18 at 17:30, <andrew.cooper3@citrix.com> wrote:
> On 25/09/18 13:41, Jan Beulich wrote:
>>>>> On 20.09.18 at 14:41, <andrew.cooper3@citrix.com> wrote:
>>> On 13/09/18 11:12, Jan Beulich wrote:
>>>> The function does two translations in one go for a single guest access.
>>>> Any failure of the first translation step (guest linear -> guest
>>>> physical), resulting in #PF, ought to take precedence over any failure
>>>> of the second step (guest physical -> host physical).
>>> Why?  What is the basis of this presumption?
>>>
>>> As far as what real hardware does...
>>>
>>> This test sets up a ballooned page and a read-only page.  I.e. a second
>>> stage fault on the first part of a misaligned access, and a first stage
>>> fault on the second part of the access.
>>>
>>> (d1) --- Xen Test Framework ---
>>> (d1) Environment: HVM 64bit (Long mode 4 levels)
>>> (d1) Test splitfault
>>> (d1) About to read
>>> (XEN) *** EPT qual 0000000000000181, gpa 000000000011cffc
>>> (d1) Reading PTR: got 00000000ffffffff
>>> (d1) About to write
>>> (XEN) *** EPT qual 0000000000000182, gpa 000000000011cffc
>>> (d1) ******************************
>>> (d1) PANIC: Unhandled exception at 0008:00000000001047e0
>>> (d1) Vec 14 #PF[-d-sWP] %cr2 000000000011d000
>>> (d1) ******************************
>>>
>>> The second stage fault is recognised first, which is contrary to your
>>> presumption, i.e. the code in its current form appears to be correct.
>> Coming back to this example of yours: As a first step, are we in
>> agreement that with the exception of very complex instructions
>> (FSAVE, FXSAVE, XSAVE etc) instructions are supposed to work in an
>> all-or-nothing manner when it comes to updating of architectural
>> state (be it registers or memory)?
> 
> No.  Read Chapter Intel Vol3 8.1 and 8.2, which makes it quite clear
> that misaligned accesses may be split access, and observably so to other
> processors in the system.
> 
> I've even found a new bit in it which says that >quadword SSE accesses
> may even result in a partial write being completed before #PF is raised.

As said before, I'm all with you with the "may" part for FPU, SSE, and
alike more complex accesses. Short of being able to think of a way to
compare (in a more direct way than via EPT behavior, as your test
did) native and emulated behavior (I had coded something up until I
realized that it doesn't test what I want to test), I've written a small
test for real hardware. Would this debugging patch

--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -87,7 +87,7 @@ GLOBAL(boot_cpu_compat_gdt_table)
  * of physical memory. In any case the VGA hole should be mapped with type UC.
  * Uses 1x 4k page.
  */
-l1_identmap:
+GLOBAL(l1_identmap)
         pfn = 0
         .rept L1_PAGETABLE_ENTRIES
         /* VGA hole (0xa0000-0xc0000) should be mapped UC-. */
--- a/xen/arch/x86/ioport_emulate.c
+++ b/xen/arch/x86/ioport_emulate.c
@@ -112,8 +112,48 @@ static struct dmi_system_id __initdata i
     { }
 };
 
+extern unsigned long l1_identmap[L1_PAGETABLE_ENTRIES];//temp
 static int __init ioport_quirks_init(void)
 {
+ {//temp
+  unsigned char*hole = __va(0xb0000);
+  unsigned long l1e = l1_identmap[0xb0];
+  int i;
+
+  show_page_walk((long)hole);
+
+  l1_identmap[0xb0] &= ~_PAGE_RW;
+  l1_identmap[0xb0] |= _PAGE_PWT;
+  for(;;) {
+   flush_area_all(hole, FLUSH_TLB|FLUSH_TLB_GLOBAL);
+   show_page_walk((long)hole);
+
+   for(i = 3; i > 0; --i) {
+    unsigned long cr2 = 0;
+
+    memset(hole - 4, 0, 4);
+    asm("1: addl $-1, %0; 2:\n\t"
+        ".pushsection .fixup,\"ax\"\n"
+        "3: mov %%cr2, %1; jmp 2b\n\t"
+        ".popsection\n\t"
+        _ASM_EXTABLE(1b, 3b)
+        : "+m" (hole[-i]), "+r" (cr2) :: "memory");
+
+    printk("CR2=%lx data: %02x %02x %02x %02x\n", cr2,
+           hole[-4], hole[-3], hole[-2], hole[-1]);
+   }
+
+   if(l1_identmap[0xb0] & (1UL << (paddr_bits - 1)))
+    break;
+
+   l1_identmap[0xb0] |= ((1UL << paddr_bits) - 1) & PAGE_MASK;
+  }
+
+  l1_identmap[0xb0] = l1e;
+  flush_area_all(hole, FLUSH_TLB|FLUSH_TLB_GLOBAL);
+  show_page_walk((long)hole);
+ }
+
     dmi_check_system(ioport_quirks_tbl);
     return 0;
 }

producing this output

(XEN) Pagetable walk from ffff8300000b0000:
(XEN)  L4[0x106] = 80000000cf0ba063 ffffffffffffffff
(XEN)  L3[0x000] = 00000000cf0b4063 ffffffffffffffff
(XEN)  L2[0x000] = 00000000cf0b3063 ffffffffffffffff
(XEN)  L1[0x0b0] = 00000000000b0373 ffffffffffffffff
(XEN) Pagetable walk from ffff8300000b0000:
(XEN)  L4[0x106] = 80000000cf0ba063 ffffffffffffffff
(XEN)  L3[0x000] = 00000000cf0b4063 ffffffffffffffff
(XEN)  L2[0x000] = 00000000cf0b3063 ffffffffffffffff
(XEN)  L1[0x0b0] = 00000000000b0379 ffffffffffffffff
(XEN) CR2=ffff8300000b0000 data: 00 00 00 00
(XEN) CR2=ffff8300000b0000 data: 00 00 00 00
(XEN) CR2=ffff8300000b0000 data: 00 00 00 00
(XEN) Pagetable walk from ffff8300000b0000:
(XEN)  L4[0x106] = 80000000cf0ba063 ffffffffffffffff
(XEN)  L3[0x000] = 00000000cf0b4063 ffffffffffffffff
(XEN)  L2[0x000] = 00000000cf0b3063 ffffffffffffffff
(XEN)  L1[0x0b0] = 0000000ffffff379 ffffffffffffffff
(XEN) CR2=ffff8300000b0000 data: 00 00 00 00
(XEN) CR2=ffff8300000b0000 data: 00 00 00 00
(XEN) CR2=ffff8300000b0000 data: 00 00 00 00
(XEN) Pagetable walk from ffff8300000b0000:
(XEN)  L4[0x106] = 80000000cf0ba063 ffffffffffffffff
(XEN)  L3[0x000] = 00000000cf0b4063 ffffffffffffffff
(XEN)  L2[0x000] = 00000000cf0b3063 ffffffffffffffff
(XEN)  L1[0x0b0] = 00000000000b0373 ffffffffffffffff

convince you that simple (ALU) accesses at the very least do not get
split, and hence our emulation shouldn't do so either. (I admit that I
think I see another change, beyond the patch here, to be necessary
for this to work consistently in all cases: In the "known_gla() returns
true" case we'd also have to perform the second page walk _before_
trying to actually write anything. But I had questioned that
conditional anyway, first in
https://lists.xenproject.org/archives/html/xen-devel/2018-09/msg00155.html
and then with a shortened version left in the description of commit
3bdec530a5 ["x86/HVM: split page straddling emulated accesses in
more cases"]. Paul wanted that behavior to be mirrored to
hvmemul_rmw() instead.)

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel